salloc/mpirun does not play well together with task affinity socket 
binding.  The following example illustrates the problem.

[sulu] (slurm) mnp> salloc -p bones-only -N1-1 -n3 --cpu_bind=socket 
mpirun cat /proc/self/status | grep Cpus_allowed_list
salloc: Granted job allocation 387
--------------------------------------------------------------------------
An invalid physical processor id was returned ...

The problem is that with mpirun jobs Slurm launches only a single task, 
regardless of the value of -n. This confuses the socket binding logic in 
task affinity.  The result is that task affinity binds the task to only a 
single cpu, instead of all the allocated cpus on the socket.  When mpi 
attempts to bind to any of the other allocated cpus on the socket, it gets 
the "invalid physical processor id" error. Note that the problem may occur 
even if socket binding is not explicitly requested by the user.  If 
task/affinity is configured and the allocated CPUs are a whole number of 
sockets, Slurm will use "implicit auto binding" to sockets, triggering the 
problem.

The attached patch fixes the problem for 2.3.0.

Regards,
Martin

Attachment: 314985Fix_2_3_0.patch
Description: Binary data

Reply via email to