I received a bug report on SLURM 2.2.4 about something that was previously 
working in SLURM 2.2.1.

The problem occurs on a node that has 1 socket with 4 cores,   when 
attempting to run 4 tasks with "mpirun" under a SLURM allocation of the 
single node,  and also using task affinity with cpusets.    Here are the 
relevant lines from the "slurm.conf":

SelectType=select/cons_res
SelectTypeParameters=CR_Socket

TaskPlugin=task/affinity
TaskPluginParam=cpusets

NodeName=n1 NodeHostname=bones NodeAddr=bones CoresPerSocket=4 Sockets=1 
ThreadsPerCore=1

The commands that are being run are:

salloc -N1-1 -n1 mpirun hostname   <-- works ok
salloc -N1-1 -n2 mpirun hostname   <-- works ok
salloc -N1-1 -n3 mpirun hostname   <-- works ok
salloc -N1-1 -n4 mpirun hostname   <-- fails with a message : 
                                        An invalid physical processor id 
                                        was returned when attempting to
                                        set processor affinity

In jobs of this type,  the salloc provides the SLURM allocation of the 
target node and then starts "mpirun" on the requesting node.   The mpirun 
process calls the SLURM API to cause "slurmd" on the target node to start 
the "orted" process, which then runs the appropriate number of MPI jobs 
(or in this case, just a simple "hostname" command) on the target node. 
Since "task/affinity" is set,  the "slurmd" creates a cpuset before 
starting "orted",  and both the "orted" and the processes it spawns run 
inside this cpuset.   The cpuset should contain the number of cpus 
required to run the requested number of tasks.

And in fact this is what happens for "-n1", "-n2", and "-n3".   The slurmd 
uses cpu bind method 'mask_cpu' to generate cpu masks that set the 
appropriate number of processors for the cpuset to accommodate the number 
of tasks that "orted" eventually launches.   This occurs in the 
"lllp_distribution" code in module "dist_tasks.c".

When the magic number of 4 is reached,  which happens to be the exact 
number of cores within the single socket on the node,  the code switches 
over to doing "implicit auto binding" and sets a task distribution type of 
"SLURM_DIST_CYCLIC" and a cpu_bind_type of CPU_BIND_TO_SOCKETS.    This 
causes the "task_layout_lllp_cyclic" routine to be called, which ends up 
generating a mask based on sockets, cores, or threads, depending on the 
distribution type.   This mask must then be "expanded" to encompass the 
appropriate number of cpus for the unit in question,  so a call is made to 
"_expand_masks".

In SLURM 2.2.1, the "_expand_masks" routine called a routine called 
"_blot_mask"  for both CPU_BIND_TO_CORES and CPU_BIND_TO_SOCKETS.   This 
expanded the bit mask to allow for multiple threads in a core or multiple 
cores in a socket.   Sometime between SLURM 2.2.1 and SLURM 2.2.4,  a new 
routine called "_blot_mask_sockets" was added, and is being called for the 
CPU_BIND_TO SOCKETS case, instead of calling "_blot_mask".   Here is the 
code for that routine:

/* helper function for _expand_masks()
 * foreach task, consider which other tasks have set bits on the same 
socket */
static void _blot_mask_sockets(const uint32_t maxtasks, const uint32_t 
task,
                               bitstr_t **masks, uint16_t blot)
{
        uint16_t i, j, size = 0;
        uint32_t q;

        char *str = NULL;

        if (!masks[task])
                return;
        size = bit_size(masks[task]);
        for (i = 0; i < size; i++) {
                if (bit_test(masks[task], i)) {
                        /* check if other tasks have set bits on this 
socket */
                        uint16_t start = (i / blot) * blot;
                        for (j = start; j < start+blot; j++) {
                                for (q = 0; q < maxtasks; q++) {
                                        if ((q != task) &&
                                            bit_test(masks[q], j)) {
                                                bit_set(masks[task], j);
                                        }
                                }
                        }
                }
        }
}


What I would expect this routine to do is to expand the bit mask for the 
task to include all the cpus on the socket,  but it does not do this.  It 
ends up leaving the mask set to only 1 bit.   This causes slurmd to create 
a cpuset that only contains one cpu,  so the MPI jobs that "orted" starts 
fail when they attempt to bind to their individual cpus. 

I do not understand what this routine is supposed to accomplish.  The 
comment: "foreach task, consider which other tasks have set bits on the 
same socket" seems to indicate that it is supposed to take into account 
masks that may be set for other tasks on this node for the same job,  but 
I don't understand what this "taking into account" is intended for.   It 
appears to want to set bits in the current mask if bits are set in masks 
for other tasks.   Are all the tasks supposed to end up within the same 
cpuset eventually?   In any case,  because my job has only one task (the 
"orted" run),  it never sets any additional bits, and I get the situation 
described in the previous paragraph.

My current temporary "fix" is to add back in the call to "_blot_mask" in 
the CPU_BIND_TO SOCKETS path.  I added it back in in front of the call to 
"_blot_mask_sockets",  and left that call in place, as it seemed to do 
nothing anyway.    Here is the modified call:

        if (cpu_bind_type & CPU_BIND_TO_SOCKETS) {
                if (hw_threads*hw_cores < 2)
                        return;
                for (i = 0; i < maxtasks; i++) {
                        _blot_mask(masks[i], hw_threads*hw_cores);
                        _blot_mask_sockets(maxtasks, i, masks, 
hw_threads*hw_cores);
                }
                return;
        }

The call to  "_blot_mask" now expands the mask  to include the full 
socket,  and the cpuset gets created with all 4 cpus, so the MPI tasks 
bind correctly.

But I don't really understand what is going on in the code.   Can someone 
explain what the rationale behind the addition of the "_blot_mask_sockets" 
routine and call?   What is supposed to happen with these masks when the 
unit is the "socket"?  Perhaps this code is intended to handle cases where 
there are many tasks in the job, and many sockets on the same node,  but 
it falls short when there is just a single task.

  -Don Albert-

Reply via email to