Moe, At the time I entered this report, I didn't realize that the change in question had been put in by Yiannis. We have discussed this and are working on finding a fix.
-Don- [email protected] wrote on 06/30/2011 01:30:53 PM: > Don, > > With the the code history online now you can see who changed what and > when. More information about this patch is available here: > https://github. > com/SchedMD/slurm/commit/df0bd9b47b4cf679493ca0e8880e6d8bfd00fa7f > > I can observe the problem that you describe. Perhaps you could work on > a fix for this and the original problem with Yiannis. > > Moe Jette > SchedMD > > > > Quoting [email protected]: > > > I received a bug report on SLURM 2.2.4 about something that was previously > > working in SLURM 2.2.1. > > > > The problem occurs on a node that has 1 socket with 4 cores, when > > attempting to run 4 tasks with "mpirun" under a SLURM allocation of the > > single node, and also using task affinity with cpusets. Here are the > > relevant lines from the "slurm.conf": > > > > SelectType=select/cons_res > > SelectTypeParameters=CR_Socket > > > > TaskPlugin=task/affinity > > TaskPluginParam=cpusets > > > > NodeName=n1 NodeHostname=bones NodeAddr=bones CoresPerSocket=4 Sockets=1 > > ThreadsPerCore=1 > > > > The commands that are being run are: > > > > salloc -N1-1 -n1 mpirun hostname <-- works ok > > salloc -N1-1 -n2 mpirun hostname <-- works ok > > salloc -N1-1 -n3 mpirun hostname <-- works ok > > salloc -N1-1 -n4 mpirun hostname <-- fails with a message : > > An invalid physical processor id > > was returned when attempting to > > set processor affinity > > > > In jobs of this type, the salloc provides the SLURM allocation of the > > target node and then starts "mpirun" on the requesting node. The mpirun > > process calls the SLURM API to cause "slurmd" on the target node to start > > the "orted" process, which then runs the appropriate number of MPI jobs > > (or in this case, just a simple "hostname" command) on the target node. > > Since "task/affinity" is set, the "slurmd" creates a cpuset before > > starting "orted", and both the "orted" and the processes it spawns run > > inside this cpuset. The cpuset should contain the number of cpus > > required to run the requested number of tasks. > > > > And in fact this is what happens for "-n1", "-n2", and "-n3". The slurmd > > uses cpu bind method 'mask_cpu' to generate cpu masks that set the > > appropriate number of processors for the cpuset to accommodate the number > > of tasks that "orted" eventually launches. This occurs in the > > "lllp_distribution" code in module "dist_tasks.c". > > > > When the magic number of 4 is reached, which happens to be the exact > > number of cores within the single socket on the node, the code switches > > over to doing "implicit auto binding" and sets a task distribution type of > > "SLURM_DIST_CYCLIC" and a cpu_bind_type of CPU_BIND_TO_SOCKETS. This > > causes the "task_layout_lllp_cyclic" routine to be called, which ends up > > generating a mask based on sockets, cores, or threads, depending on the > > distribution type. This mask must then be "expanded" to encompass the > > appropriate number of cpus for the unit in question, so a call is made to > > "_expand_masks". > > > > In SLURM 2.2.1, the "_expand_masks" routine called a routine called > > "_blot_mask" for both CPU_BIND_TO_CORES and CPU_BIND_TO_SOCKETS. This > > expanded the bit mask to allow for multiple threads in a core or multiple > > cores in a socket. Sometime between SLURM 2.2.1 and SLURM 2.2.4, a new > > routine called "_blot_mask_sockets" was added, and is being called for the > > CPU_BIND_TO SOCKETS case, instead of calling "_blot_mask". Here is the > > code for that routine: > > > > /* helper function for _expand_masks() > > * foreach task, consider which other tasks have set bits on the same > > socket */ > > static void _blot_mask_sockets(const uint32_t maxtasks, const uint32_t > > task, > > bitstr_t **masks, uint16_t blot) > > { > > uint16_t i, j, size = 0; > > uint32_t q; > > > > char *str = NULL; > > > > if (!masks[task]) > > return; > > size = bit_size(masks[task]); > > for (i = 0; i < size; i++) { > > if (bit_test(masks[task], i)) { > > /* check if other tasks have set bits on this > > socket */ > > uint16_t start = (i / blot) * blot; > > for (j = start; j < start+blot; j++) { > > for (q = 0; q < maxtasks; q++) { > > if ((q != task) && > > bit_test(masks[q], j)) { > > bit_set(masks[task], j); > > } > > } > > } > > } > > } > > } > > > > > > What I would expect this routine to do is to expand the bit mask for the > > task to include all the cpus on the socket, but it does not do this. It > > ends up leaving the mask set to only 1 bit. This causes slurmd to create > > a cpuset that only contains one cpu, so the MPI jobs that "orted" starts > > fail when they attempt to bind to their individual cpus. > > > > I do not understand what this routine is supposed to accomplish. The > > comment: "foreach task, consider which other tasks have set bits on the > > same socket" seems to indicate that it is supposed to take into account > > masks that may be set for other tasks on this node for the same job, but > > I don't understand what this "taking into account" is intended for. It > > appears to want to set bits in the current mask if bits are set in masks > > for other tasks. Are all the tasks supposed to end up within the same > > cpuset eventually? In any case, because my job has only one task (the > > "orted" run), it never sets any additional bits, and I get the situation > > described in the previous paragraph. > > > > My current temporary "fix" is to add back in the call to "_blot_mask" in > > the CPU_BIND_TO SOCKETS path. I added it back in in front of the call to > > "_blot_mask_sockets", and left that call in place, as it seemed to do > > nothing anyway. Here is the modified call: > > > > if (cpu_bind_type & CPU_BIND_TO_SOCKETS) { > > if (hw_threads*hw_cores < 2) > > return; > > for (i = 0; i < maxtasks; i++) { > > _blot_mask(masks[i], hw_threads*hw_cores); > > _blot_mask_sockets(maxtasks, i, masks, > > hw_threads*hw_cores); > > } > > return; > > } > > > > The call to "_blot_mask" now expands the mask to include the full > > socket, and the cpuset gets created with all 4 cpus, so the MPI tasks > > bind correctly. > > > > But I don't really understand what is going on in the code. Can someone > > explain what the rationale behind the addition of the "_blot_mask_sockets" > > routine and call? What is supposed to happen with these masks when the > > unit is the "socket"? Perhaps this code is intended to handle cases where > > there are many tasks in the job, and many sockets on the same node, but > > it falls short when there is just a single task. > > > > -Don Albert- > > >
