I've hit an issue with binding using slurm 21.08.5 that I'm hoping someone 
might be able to help with.  I took a scan through the e-mail list but didn't 
see this one - apologies if I missed it.  Maybe I just need a better 
understanding on why this is happening but feels like a bug.

The issue is that if I include the hint=nomultithread to an salloc (or sbatch) 
it seems to break the binding for the srun within it.  Works find if it is a 
direct srun.

Here are the examples of running the sruns directly and things look good:

~> srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,map_cpu:0,16,32,48 /bin/true
cpu-bind=MAP  - cn4, task  0  0 [103837]: mask 0x1 set
cpu-bind=MAP  - cn4, task  1  1 [103838]: mask 0x10000 set
cpu-bind=MAP  - cn4, task  2  2 [103839]: mask 0x100000000 set
cpu-bind=MAP  - cn4, task  3  3 [103840]: mask 0x1000000000000 set

~> srun --hint=nomultithread -n 4 -N 1 --ntasks-per-node=4 
--cpu_bind=v,map_cpu:0,16,32,48 /bin/true
cpu-bind=MAP  - cn4, task  0  0 [103992]: mask 0x1 set
cpu-bind=MAP  - cn4, task  1  1 [103993]: mask 0x10000 set
cpu-bind=MAP  - cn4, task  2  2 [103994]: mask 0x100000000 set
cpu-bind=MAP  - cn4, task  3  3 [103995]: mask 0x1000000000000 set

And here are the sruns wrapped by an salloc:

~> salloc --exclusive -N 1 -n 4 srun -n 4 -N 1 --ntasks-per-node=4 
--cpu_bind=v,map_cpu:0,16,32,48 /bin/true
salloc: Granted job allocation 282077
salloc: Waiting for resource configuration
salloc: Nodes cn4 are ready for job
cpu-bind=MAP  - cn4, task  0  0 [169441]: mask 0x1 set
cpu-bind=MAP  - cn4, task  1  1 [169442]: mask 0x10000 set
cpu-bind=MAP  - cn4, task  2  2 [169443]: mask 0x100000000 set
cpu-bind=MAP  - cn4, task  3  3 [169444]: mask 0x1000000000000 set
salloc: Relinquishing job allocation 282077

~> salloc --hint=nomultithread --exclusive -N 1 -n 4 srun -n 4 -N 1 
--ntasks-per-node=4 --cpu_bind=v,map_cpu:0,16,32,48 /bin/true
salloc: Granted job allocation 282078
salloc: Waiting for resource configuration
salloc: Nodes cn4 are ready for job
cpu-bind=MASK - cn4, task  0  0 [169586]: mask 
0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task  1  1 [169587]: mask 
0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task  2  2 [169588]: mask 
0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task  3  3 [169589]: mask 
0xf0000000000000000000000000000000f set
salloc: Relinquishing job allocation 282078

I do see that the binding has changed to cpu-bind=MASK.  Maybe that is a clue.  
:)  Even if I send in a mask, mine is not fully used in the presence of the 
hint:

~> salloc --exclusive -N 1 -n 4 srun -n 4 -N 1 --ntasks-per-node=4 
--cpu_bind=v,mask_cpu:0x1,0x1000,0x100000000,0x1000000000000 /bin/true
salloc: Granted job allocation 282084
salloc: Waiting for resource configuration
salloc: Nodes cn4 are ready for job
cpu-bind=MASK - cn4, task  0  0 [125303]: mask 0x1 set
cpu-bind=MASK - cn4, task  1  1 [125304]: mask 0x1000 set
cpu-bind=MASK - cn4, task  2  2 [125305]: mask 0x100000000 set
cpu-bind=MASK - cn4, task  3  3 [125306]: mask 0x1000000000000 set
salloc: Relinquishing job allocation 282084

~> salloc --hint=nomultithread --exclusive -N 1 -n 4 srun -n 4 -N 1 
--ntasks-per-node=4 
--cpu_bind=v,mask_cpu:0x1,0x1000,0x100000000,0x1000000000000 /bin/true
salloc: Granted job allocation 282085
salloc: Waiting for resource configuration
salloc: Nodes cn4 are ready for job
cpu-bind=MASK - cn4, task  0  0 [125462]: mask 0x1 set
cpu-bind=MASK - cn4, task  1  1 [125463]: mask 
0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task  2  2 [125464]: mask 
0xf0000000000000000000000000000000f set
cpu-bind=MASK - cn4, task  3  3 [125465]: mask 
0xf0000000000000000000000000000000f set
salloc: Relinquishing job allocation 282085

Note that the mask is ignored for tasks 1, 2, and 3 in this latter case.  
Pretty sure my syntax is correct as it worked in the first test without the 
hint.   I also have 22.05.0 installed but not active.  I'll try it with that 
later today and report the results.

Brent

Reply via email to