Hyperthreading

While Hyperthreading (officially called Hyper-Threading Technology) is available since 2002 the Cluster btrzx2 installed in 2017 is the first in Bayreuth to have enabled this feature. Hyper-Threading Technology is a form of simultaneous multithreading (SMT) technology introduced by Intel, while the concept behind the technology has been patented by Sun Microsystems. Architecturally, a processor with Hyper-Threading Technology consists of two logical processors per physical core, each of which has its own processor architectural state. Each logical processor can be individually halted, interrupted, or directed to execute a specified thread independently from the other logical processor sharing the same physical core.

While the following examples focus on btrzx2's nodes labeled compute8, the other "CPU-Based" node types labeled compute20 and compute40 are analogous but feature more cores. The compute8 nodes got their name from having two Intel Xeon E5-2623 v4 CPUs featuring a clock speed of 2.60 GHz and 4 physical cores each. The output of the command numactl --hardware shows that nicely:


available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 8 9 10 11
node 0 size: 32673 MB
node 0 free: 30863 MB
node 1 cpus: 4 5 6 7 12 13 14 15
node 1 size: 32768 MB
node 1 free: 31225 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

Please note that in this output, node means a CPU on the cluster's compute node, i.e., there are two Intel Xeon E5-2623 v4 CPUs on one compute node labeled node 0 and node 1 each with 32 GB of RAM attached. (For considerations about RAM-access please see the page about NUMA). In this setup, the first hyperthread of the first core of node 0 bears the label 0 while the second hyperthread of the first core of node 0 bears the label 8. One can easily see this by submitting the following job requesting two threads (ppn=2) to btrzx2's queueing system:


#PBS -l nodes=1:ppn=2:compute8,walltime=00:50:00
#PBS -j oe
numactl --showi

The result will be an output file containing


policy: default
preferred node: current
physcpubind: 0 8 
cpubind: 0 
nodebind: 0 
membind: 0

with the line physcpubind telling which hyperthread cores were allocated for this job. Interestingly, when requesting 4 physical cores in the form of 8 hyperthread cores by using the parameter ppn=8 when submitting the job, the output will show that all threads are preferably allocated on the same CPU chip to:


policy: default
preferred node: current
physcpubind: 0 1 2 3 8 9 10 11 
cpubind: 0 
nodebind: 0 
membind: 0

All hyperthreads of a compute node of type compute8 are requested by using ppn=16 which will produce the output


policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
cpubind: 0 1 
nodebind: 0 1 
membind: 0 1

Therefore, a sequential job should request nodes=1:ppn=2 since using one physical core utilizes both hyperthread cores. While Jobs parallelized by using OpenMP can benefit from using all hyperthreads of a compute node, jobs parallelized using MPI are usually slowed down when using more MPI processes than physical cores available. The following example executes an MPI job on two cluster nodes.


#PBS -l nodes=2:ppn=16:compute8,walltime=00:05:00
#PBS -j oe
module load intel_parallel_studio_xe_2016_update4
awk 'NR%2==1' $PBS_NODEFILE & my_nodefile
$MPI_RUN -ordered-output -prepend-rank -machinefile my_nodefile /bin/tcsh -c 'taskset -c -p $$; hostname' | sort

This example uses the command taskset -c -p $$ and hostname to show the mappings between hyperthreads, MPI processes, and compute nodes. While requesting all hyperthread cores of two nodes (nodes=2:ppn=16:compute8) [can be done with nodes=2:ppn=40:compute20 for the other node types], the line awk 'NR%2==1' $PBS_NODEFILE & my_nodefile singles out every second logical core. The PBS queueing system passes the name of a file containing a list of all hyperthread cores to use in the environment variable $PBS_NODEFILE. Here awk 'NR%2==1' is used to print the lines with odd numbers into the file my_nodefile. The resulting job output will look like


[0] pid 44595's current affinity list: 0,8
[0] r03n30
[10] pid 43651's current affinity list: 2,10
[10] r03n29
[11] pid 43652's current affinity list: 3,11
[11] r03n29
[12] pid 43653's current affinity list: 4,12
[12] r03n29
[13] pid 43654's current affinity list: 5,13
[13] r03n29
[14] pid 43655's current affinity list: 6,14
[14] r03n29
[15] pid 43656's current affinity list: 7,15
[15] r03n29
[1] pid 44596's current affinity list: 1,9
[1] r03n30
[2] pid 44597's current affinity list: 2,10
[2] r03n30
[3] pid 44598's current affinity list: 3,11
[3] r03n30
[4] pid 44599's current affinity list: 4,12
[4] r03n30
[5] pid 44600's current affinity list: 5,13
[5] r03n30
[6] pid 44601's current affinity list: 6,14
[6] r03n30
[7] pid 44602's current affinity list: 7,15
[7] r03n30
[8] pid 43649's current affinity list: 0,8
[8] r03n29
[9] pid 43650's current affinity list: 1,9
[9] r03n29

Verantwortlich für die Redaktion: Dr.rer.nat. Ingo Schelter