William, The src/slurmd/slurmd/get_mach_stat.c file includes a main() function. It can be compiled stand-alone (see comments at top of file). When you run the executable, it will display the processor mappings you presented below. This should help validate your table and help explain the observed masks that you found.
Don -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Williams, Kevin E. (Federal SIP) Sent: Monday, April 04, 2011 6:43 PM To: [email protected] Subject: [slurm-dev] Question on task affinity Hello, We are running slurm 2.2.0 on a redhat 5.5 cluster. In an effort to understand the task/affinity plugin operation we have set up this simple test ... Note that the target node is a 2 processor, 6 core per processor, 2 threads per core, hyperthreaded system. The partition is set up with only the one node to avoid any resource allocation issues. The actual program being run is not included in this message. We run the following command: srun -n12 --cpu_bind=verbose sh -c 'program and parameters' By default, SLURM uses the following cpu masks: cpu_bind;MASK - target, task 0 0 [29400]: mask 0x8 set cpu_bind;MASK - target, task 1 1 [29401]: mask 0x2 set cpu_bind;MASK - target, task 2 2 [29402]: mask 0x8000 set cpu_bind;MASK - target, task 3 3 [29403]: mask 0x2000 set cpu_bind;MASK - target, task 4 4 [29404]: mask 0x800 set cpu_bind;MASK - target, task 5 5 [29405]: mask 0x200 set cpu_bind;MASK - target, task 6 6 [29406]: mask 0x800000 set cpu_bind;MASK - target, task 7 7 [29407]: mask 0x200000 set cpu_bind;MASK - target, task 8 8 [29408]: mask 0x80 set cpu bind;MASK - target, task 9 9 [29409]: mask 0x20 set cpu-bind;MASK - target, task 10 10 [29410]: mask 0x80000 set cpu=bind;MASK - target, task 11 11 [29411]: mask 0x20000 set With the current default settings, it will take about 25 seconds. Now we run it as follows: srun -n12 --cpu_bind=rank,verbose sh -c 'program and parameters' cpu_bind;RANK - target, task 0 0 [29815]: mask 0xl set cpu_bind;RANK - target, task 1 1 [29816]: mask 0x2 set cpu_bind;RANK - target, task 2 2 [29817]: mask 0x4 set cpu_bind;RANK - target, task 3 3 [29818]: mask 0x8 set cpu_bind;RANK - target, task 4 4 [29819]: mask 0x10 set cpu_bind;RANK - target, task 5 5 [29820]: mask 0x20 set cpu_bind;RANK - target, task 6 6 [29821]: mask 0x40 set cpu_bind;RANK - target, task 7 7 [29822]: mask 0x80 set cpu_bind;RANK - target, task 8 8 [29823]: mask 0x100 set cpu_bind;RANK - target, task 9 9 [29824]: mask 0x200 set cpu_bind;RANK - target, task 10 10 [29825]: mask 0x400 set cpu_bind;RANK - target, task 11 11 [29826]: mask 0x800 set With the mask settings provided by --cpu_bind=rank, it takes only about 12 seconds to finish. We believe these masks represent processor numbers as listed in /proc/cpuinfo, with bit 0xl for processor 0, 0x2 for processor 1, etc. If that is the case, here is a table showing how the masks and processor numbers correspond to the apicid, physical id, and core id fields: apic proc phys core mask 0 3 0 0 8 1 15 0 0 8000 2 11 0 1 800 3 23 0 1 800000 4 7 0 2 80 5 19 0 2 80000 16 1 0 8 2 17 13 0 8 2000 18 9 0 9 200 19 21 0 9 200000 20 5 0 10 20 21 17 0 10 20000 32 0 1 0 1 33 12 1 0 1000 34 8 1 1 100 35 20 1 1 100000 36 4 1 2 10 37 16 1 2 10000 48 2 1 8 4 49 14 1 8 4000 50 10 1 9 400 51 22 1 9 400000 52 6 1 10 40 53 18 1 10 40000 As you can see that the default set of masks corresponds to the 12 threads on socket 0, rather than balancing the load among sockets 0 and 1. We would like to understand why SLURM is choosing these masks by default. The relevant slurm.conf parameters are : TaskEpilog = (null) TaskPlugin = task/affinity TaskPluginParam = (null type) TaskProlog = (null) Please advise. Kevin Williams Hewlett Packard [email protected]
