Don,

With the the code history online now you can see who changed what and when. More information about this patch is available here:
https://github.com/SchedMD/slurm/commit/df0bd9b47b4cf679493ca0e8880e6d8bfd00fa7f

I can observe the problem that you describe. Perhaps you could work on a fix for this and the original problem with Yiannis.

Moe Jette
SchedMD



Quoting [email protected]:

I received a bug report on SLURM 2.2.4 about something that was previously
working in SLURM 2.2.1.

The problem occurs on a node that has 1 socket with 4 cores,   when
attempting to run 4 tasks with "mpirun" under a SLURM allocation of the
single node,  and also using task affinity with cpusets.    Here are the
relevant lines from the "slurm.conf":

SelectType=select/cons_res
SelectTypeParameters=CR_Socket

TaskPlugin=task/affinity
TaskPluginParam=cpusets

NodeName=n1 NodeHostname=bones NodeAddr=bones CoresPerSocket=4 Sockets=1
ThreadsPerCore=1

The commands that are being run are:

salloc -N1-1 -n1 mpirun hostname   <-- works ok
salloc -N1-1 -n2 mpirun hostname   <-- works ok
salloc -N1-1 -n3 mpirun hostname   <-- works ok
salloc -N1-1 -n4 mpirun hostname   <-- fails with a message :
                                        An invalid physical processor id
                                        was returned when attempting to
                                        set processor affinity

In jobs of this type,  the salloc provides the SLURM allocation of the
target node and then starts "mpirun" on the requesting node.   The mpirun
process calls the SLURM API to cause "slurmd" on the target node to start
the "orted" process, which then runs the appropriate number of MPI jobs
(or in this case, just a simple "hostname" command) on the target node.
Since "task/affinity" is set,  the "slurmd" creates a cpuset before
starting "orted",  and both the "orted" and the processes it spawns run
inside this cpuset.   The cpuset should contain the number of cpus
required to run the requested number of tasks.

And in fact this is what happens for "-n1", "-n2", and "-n3".   The slurmd
uses cpu bind method 'mask_cpu' to generate cpu masks that set the
appropriate number of processors for the cpuset to accommodate the number
of tasks that "orted" eventually launches.   This occurs in the
"lllp_distribution" code in module "dist_tasks.c".

When the magic number of 4 is reached,  which happens to be the exact
number of cores within the single socket on the node,  the code switches
over to doing "implicit auto binding" and sets a task distribution type of
"SLURM_DIST_CYCLIC" and a cpu_bind_type of CPU_BIND_TO_SOCKETS.    This
causes the "task_layout_lllp_cyclic" routine to be called, which ends up
generating a mask based on sockets, cores, or threads, depending on the
distribution type.   This mask must then be "expanded" to encompass the
appropriate number of cpus for the unit in question,  so a call is made to
"_expand_masks".

In SLURM 2.2.1, the "_expand_masks" routine called a routine called
"_blot_mask"  for both CPU_BIND_TO_CORES and CPU_BIND_TO_SOCKETS.   This
expanded the bit mask to allow for multiple threads in a core or multiple
cores in a socket.   Sometime between SLURM 2.2.1 and SLURM 2.2.4,  a new
routine called "_blot_mask_sockets" was added, and is being called for the
CPU_BIND_TO SOCKETS case, instead of calling "_blot_mask".   Here is the
code for that routine:

/* helper function for _expand_masks()
 * foreach task, consider which other tasks have set bits on the same
socket */
static void _blot_mask_sockets(const uint32_t maxtasks, const uint32_t
task,
                               bitstr_t **masks, uint16_t blot)
{
        uint16_t i, j, size = 0;
        uint32_t q;

        char *str = NULL;

        if (!masks[task])
                return;
        size = bit_size(masks[task]);
        for (i = 0; i < size; i++) {
                if (bit_test(masks[task], i)) {
                        /* check if other tasks have set bits on this
socket */
                        uint16_t start = (i / blot) * blot;
                        for (j = start; j < start+blot; j++) {
                                for (q = 0; q < maxtasks; q++) {
                                        if ((q != task) &&
                                            bit_test(masks[q], j)) {
                                                bit_set(masks[task], j);
                                        }
                                }
                        }
                }
        }
}


What I would expect this routine to do is to expand the bit mask for the
task to include all the cpus on the socket,  but it does not do this.  It
ends up leaving the mask set to only 1 bit.   This causes slurmd to create
a cpuset that only contains one cpu,  so the MPI jobs that "orted" starts
fail when they attempt to bind to their individual cpus.

I do not understand what this routine is supposed to accomplish.  The
comment: "foreach task, consider which other tasks have set bits on the
same socket" seems to indicate that it is supposed to take into account
masks that may be set for other tasks on this node for the same job,  but
I don't understand what this "taking into account" is intended for.   It
appears to want to set bits in the current mask if bits are set in masks
for other tasks.   Are all the tasks supposed to end up within the same
cpuset eventually?   In any case,  because my job has only one task (the
"orted" run),  it never sets any additional bits, and I get the situation
described in the previous paragraph.

My current temporary "fix" is to add back in the call to "_blot_mask" in
the CPU_BIND_TO SOCKETS path.  I added it back in in front of the call to
"_blot_mask_sockets",  and left that call in place, as it seemed to do
nothing anyway.    Here is the modified call:

        if (cpu_bind_type & CPU_BIND_TO_SOCKETS) {
                if (hw_threads*hw_cores < 2)
                        return;
                for (i = 0; i < maxtasks; i++) {
                        _blot_mask(masks[i], hw_threads*hw_cores);
                        _blot_mask_sockets(maxtasks, i, masks,
hw_threads*hw_cores);
                }
                return;
        }

The call to  "_blot_mask" now expands the mask  to include the full
socket,  and the cpuset gets created with all 4 cpus, so the MPI tasks
bind correctly.

But I don't really understand what is going on in the code.   Can someone
explain what the rationale behind the addition of the "_blot_mask_sockets"
routine and call?   What is supposed to happen with these masks when the
unit is the "socket"?  Perhaps this code is intended to handle cases where
there are many tasks in the job, and many sockets on the same node,  but
it falls short when there is just a single task.

  -Don Albert-



Reply via email to