Moe,

At the time I entered this report,  I didn't realize that the change in 
question had been put in by Yiannis.   We have discussed this and are 
working on finding a fix.

  -Don-

[email protected] wrote on 06/30/2011 01:30:53 PM:

> Don,
> 
> With the the code history online now you can see who changed what and 
> when. More information about this patch is available here:
> https://github.
> com/SchedMD/slurm/commit/df0bd9b47b4cf679493ca0e8880e6d8bfd00fa7f
> 
> I can observe the problem that you describe. Perhaps you could work on 
> a fix for this and the original problem with Yiannis.
> 
> Moe Jette
> SchedMD
> 
> 
> 
> Quoting [email protected]:
> 
> > I received a bug report on SLURM 2.2.4 about something that was 
previously
> > working in SLURM 2.2.1.
> >
> > The problem occurs on a node that has 1 socket with 4 cores,   when
> > attempting to run 4 tasks with "mpirun" under a SLURM allocation of 
the
> > single node,  and also using task affinity with cpusets.    Here are 
the
> > relevant lines from the "slurm.conf":
> >
> > SelectType=select/cons_res
> > SelectTypeParameters=CR_Socket
> >
> > TaskPlugin=task/affinity
> > TaskPluginParam=cpusets
> >
> > NodeName=n1 NodeHostname=bones NodeAddr=bones CoresPerSocket=4 
Sockets=1
> > ThreadsPerCore=1
> >
> > The commands that are being run are:
> >
> > salloc -N1-1 -n1 mpirun hostname   <-- works ok
> > salloc -N1-1 -n2 mpirun hostname   <-- works ok
> > salloc -N1-1 -n3 mpirun hostname   <-- works ok
> > salloc -N1-1 -n4 mpirun hostname   <-- fails with a message :
> >                                         An invalid physical processor 
id
> >                                         was returned when attempting 
to
> >                                         set processor affinity
> >
> > In jobs of this type,  the salloc provides the SLURM allocation of the
> > target node and then starts "mpirun" on the requesting node.   The 
mpirun
> > process calls the SLURM API to cause "slurmd" on the target node to 
start
> > the "orted" process, which then runs the appropriate number of MPI 
jobs
> > (or in this case, just a simple "hostname" command) on the target 
node.
> > Since "task/affinity" is set,  the "slurmd" creates a cpuset before
> > starting "orted",  and both the "orted" and the processes it spawns 
run
> > inside this cpuset.   The cpuset should contain the number of cpus
> > required to run the requested number of tasks.
> >
> > And in fact this is what happens for "-n1", "-n2", and "-n3".   The 
slurmd
> > uses cpu bind method 'mask_cpu' to generate cpu masks that set the
> > appropriate number of processors for the cpuset to accommodate the 
number
> > of tasks that "orted" eventually launches.   This occurs in the
> > "lllp_distribution" code in module "dist_tasks.c".
> >
> > When the magic number of 4 is reached,  which happens to be the exact
> > number of cores within the single socket on the node,  the code 
switches
> > over to doing "implicit auto binding" and sets a task distribution 
type of
> > "SLURM_DIST_CYCLIC" and a cpu_bind_type of CPU_BIND_TO_SOCKETS. This
> > causes the "task_layout_lllp_cyclic" routine to be called, which ends 
up
> > generating a mask based on sockets, cores, or threads, depending on 
the
> > distribution type.   This mask must then be "expanded" to encompass 
the
> > appropriate number of cpus for the unit in question,  so a call is 
made to
> > "_expand_masks".
> >
> > In SLURM 2.2.1, the "_expand_masks" routine called a routine called
> > "_blot_mask"  for both CPU_BIND_TO_CORES and CPU_BIND_TO_SOCKETS. This
> > expanded the bit mask to allow for multiple threads in a core or 
multiple
> > cores in a socket.   Sometime between SLURM 2.2.1 and SLURM 2.2.4,  a 
new
> > routine called "_blot_mask_sockets" was added, and is being called for 
the
> > CPU_BIND_TO SOCKETS case, instead of calling "_blot_mask".   Here is 
the
> > code for that routine:
> >
> > /* helper function for _expand_masks()
> >  * foreach task, consider which other tasks have set bits on the same
> > socket */
> > static void _blot_mask_sockets(const uint32_t maxtasks, const uint32_t
> > task,
> >                                bitstr_t **masks, uint16_t blot)
> > {
> >         uint16_t i, j, size = 0;
> >         uint32_t q;
> >
> >         char *str = NULL;
> >
> >         if (!masks[task])
> >                 return;
> >         size = bit_size(masks[task]);
> >         for (i = 0; i < size; i++) {
> >                 if (bit_test(masks[task], i)) {
> >                         /* check if other tasks have set bits on this
> > socket */
> >                         uint16_t start = (i / blot) * blot;
> >                         for (j = start; j < start+blot; j++) {
> >                                 for (q = 0; q < maxtasks; q++) {
> >                                         if ((q != task) &&
> >                                             bit_test(masks[q], j)) {
> >                                                 bit_set(masks[task], 
j);
> >                                         }
> >                                 }
> >                         }
> >                 }
> >         }
> > }
> >
> >
> > What I would expect this routine to do is to expand the bit mask for 
the
> > task to include all the cpus on the socket,  but it does not do this. 
It
> > ends up leaving the mask set to only 1 bit.   This causes slurmd to 
create
> > a cpuset that only contains one cpu,  so the MPI jobs that "orted" 
starts
> > fail when they attempt to bind to their individual cpus.
> >
> > I do not understand what this routine is supposed to accomplish.  The
> > comment: "foreach task, consider which other tasks have set bits on 
the
> > same socket" seems to indicate that it is supposed to take into 
account
> > masks that may be set for other tasks on this node for the same job, 
but
> > I don't understand what this "taking into account" is intended for. It
> > appears to want to set bits in the current mask if bits are set in 
masks
> > for other tasks.   Are all the tasks supposed to end up within the 
same
> > cpuset eventually?   In any case,  because my job has only one task 
(the
> > "orted" run),  it never sets any additional bits, and I get the 
situation
> > described in the previous paragraph.
> >
> > My current temporary "fix" is to add back in the call to "_blot_mask" 
in
> > the CPU_BIND_TO SOCKETS path.  I added it back in in front of the call 
to
> > "_blot_mask_sockets",  and left that call in place, as it seemed to do
> > nothing anyway.    Here is the modified call:
> >
> >         if (cpu_bind_type & CPU_BIND_TO_SOCKETS) {
> >                 if (hw_threads*hw_cores < 2)
> >                         return;
> >                 for (i = 0; i < maxtasks; i++) {
> >                         _blot_mask(masks[i], hw_threads*hw_cores);
> >                         _blot_mask_sockets(maxtasks, i, masks,
> > hw_threads*hw_cores);
> >                 }
> >                 return;
> >         }
> >
> > The call to  "_blot_mask" now expands the mask  to include the full
> > socket,  and the cpuset gets created with all 4 cpus, so the MPI tasks
> > bind correctly.
> >
> > But I don't really understand what is going on in the code.   Can 
someone
> > explain what the rationale behind the addition of the 
"_blot_mask_sockets"
> > routine and call?   What is supposed to happen with these masks when 
the
> > unit is the "socket"?  Perhaps this code is intended to handle cases 
where
> > there are many tasks in the job, and many sockets on the same node, 
but
> > it falls short when there is just a single task.
> >
> >   -Don Albert-
> 
> 
> 

Reply via email to