Thank you Martin. This patch will be in slurm v2.3.1 (with some minor cosmetic changes: removed the now unused variable "q" and replaces some spaces with tabls).

Moe


Quoting [email protected]:

salloc/mpirun does not play well together with task affinity socket
binding.  The following example illustrates the problem.

[sulu] (slurm) mnp> salloc -p bones-only -N1-1 -n3 --cpu_bind=socket
mpirun cat /proc/self/status | grep Cpus_allowed_list
salloc: Granted job allocation 387
--------------------------------------------------------------------------
An invalid physical processor id was returned ...

The problem is that with mpirun jobs Slurm launches only a single task,
regardless of the value of -n. This confuses the socket binding logic in
task affinity.  The result is that task affinity binds the task to only a
single cpu, instead of all the allocated cpus on the socket.  When mpi
attempts to bind to any of the other allocated cpus on the socket, it gets
the "invalid physical processor id" error. Note that the problem may occur
even if socket binding is not explicitly requested by the user.  If
task/affinity is configured and the allocated CPUs are a whole number of
sockets, Slurm will use "implicit auto binding" to sockets, triggering the
problem.

The attached patch fixes the problem for 2.3.0.

Regards,
Martin





Reply via email to