I've hit an issue with binding using slurm 21.08.5 that I'm hoping someone might be able to help with. I took a scan through the e-mail list but didn't see this one - apologies if I missed it. Maybe I just need a better understanding on why this is happening but feels like a bug.
The issue is that if I include the hint=nomultithread to an salloc (or sbatch) it seems to break the binding for the srun within it. Works find if it is a direct srun. Here are the examples of running the sruns directly and things look good: ~> srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,map_cpu:0,16,32,48 /bin/true cpu-bind=MAP - cn4, task 0 0 [103837]: mask 0x1 set cpu-bind=MAP - cn4, task 1 1 [103838]: mask 0x10000 set cpu-bind=MAP - cn4, task 2 2 [103839]: mask 0x100000000 set cpu-bind=MAP - cn4, task 3 3 [103840]: mask 0x1000000000000 set ~> srun --hint=nomultithread -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,map_cpu:0,16,32,48 /bin/true cpu-bind=MAP - cn4, task 0 0 [103992]: mask 0x1 set cpu-bind=MAP - cn4, task 1 1 [103993]: mask 0x10000 set cpu-bind=MAP - cn4, task 2 2 [103994]: mask 0x100000000 set cpu-bind=MAP - cn4, task 3 3 [103995]: mask 0x1000000000000 set And here are the sruns wrapped by an salloc: ~> salloc --exclusive -N 1 -n 4 srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,map_cpu:0,16,32,48 /bin/true salloc: Granted job allocation 282077 salloc: Waiting for resource configuration salloc: Nodes cn4 are ready for job cpu-bind=MAP - cn4, task 0 0 [169441]: mask 0x1 set cpu-bind=MAP - cn4, task 1 1 [169442]: mask 0x10000 set cpu-bind=MAP - cn4, task 2 2 [169443]: mask 0x100000000 set cpu-bind=MAP - cn4, task 3 3 [169444]: mask 0x1000000000000 set salloc: Relinquishing job allocation 282077 ~> salloc --hint=nomultithread --exclusive -N 1 -n 4 srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,map_cpu:0,16,32,48 /bin/true salloc: Granted job allocation 282078 salloc: Waiting for resource configuration salloc: Nodes cn4 are ready for job cpu-bind=MASK - cn4, task 0 0 [169586]: mask 0xf0000000000000000000000000000000f set cpu-bind=MASK - cn4, task 1 1 [169587]: mask 0xf0000000000000000000000000000000f set cpu-bind=MASK - cn4, task 2 2 [169588]: mask 0xf0000000000000000000000000000000f set cpu-bind=MASK - cn4, task 3 3 [169589]: mask 0xf0000000000000000000000000000000f set salloc: Relinquishing job allocation 282078 I do see that the binding has changed to cpu-bind=MASK. Maybe that is a clue. :) Even if I send in a mask, mine is not fully used in the presence of the hint: ~> salloc --exclusive -N 1 -n 4 srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,mask_cpu:0x1,0x1000,0x100000000,0x1000000000000 /bin/true salloc: Granted job allocation 282084 salloc: Waiting for resource configuration salloc: Nodes cn4 are ready for job cpu-bind=MASK - cn4, task 0 0 [125303]: mask 0x1 set cpu-bind=MASK - cn4, task 1 1 [125304]: mask 0x1000 set cpu-bind=MASK - cn4, task 2 2 [125305]: mask 0x100000000 set cpu-bind=MASK - cn4, task 3 3 [125306]: mask 0x1000000000000 set salloc: Relinquishing job allocation 282084 ~> salloc --hint=nomultithread --exclusive -N 1 -n 4 srun -n 4 -N 1 --ntasks-per-node=4 --cpu_bind=v,mask_cpu:0x1,0x1000,0x100000000,0x1000000000000 /bin/true salloc: Granted job allocation 282085 salloc: Waiting for resource configuration salloc: Nodes cn4 are ready for job cpu-bind=MASK - cn4, task 0 0 [125462]: mask 0x1 set cpu-bind=MASK - cn4, task 1 1 [125463]: mask 0xf0000000000000000000000000000000f set cpu-bind=MASK - cn4, task 2 2 [125464]: mask 0xf0000000000000000000000000000000f set cpu-bind=MASK - cn4, task 3 3 [125465]: mask 0xf0000000000000000000000000000000f set salloc: Relinquishing job allocation 282085 Note that the mask is ignored for tasks 1, 2, and 3 in this latter case. Pretty sure my syntax is correct as it worked in the first test without the hint. I also have 22.05.0 installed but not active. I'll try it with that later today and report the results. Brent
