There is an error in the patch I sent yesterday for this problem.  The 
attach patch is a corrected version.
Regards,
Martin








Martin Perry/US/BULL 
02/24/2011 02:04 PM

To
[email protected]
cc
[email protected], "[email protected]" 
<[email protected]>
Subject
Slurmd abort when using task affinity with plane distribution





Slurmd may abort when using task affinity with the plane distribution 
method.  I think the problem is in function _task_layout_plane in 
src/common/slurm_step_layout.c.  The function does not support 
heterogeneous allocations of cpus across nodes.  The following example 
illustrates the problem:

slurm.conf settings:
SelectType=select/cons_res
SelectTypeParameters=CR_Core
TaskPlugin=task/affinity
TaskPluginParam=sched,cores

command:
srun -p bones-chekov-scotty -N 3-3 -n 6 -l -m plane=2 hostname | sort

In this example, slurm allocates 4 cores from one node and 1 core each 
from the other two nodes (block allocation method).  But 
_task_layout_plane distributes 2 tasks to each node, even though two of 
the nodes only have 1 allocated core.  When task affinity detects this 
condition, it aborts slurmd with the following error (from the slurmd 
log): "error: task/affinity: only 1 bits in avail_map for 2 tasks!"

The attached patch fixes the problem for slurm version 2.2.1 by modifying 
_task_layout_plane to take the allocation into account when distributing 
tasks across nodes.  Here is the same example after the patch has been 
applied, showing that the job runs successfully and the tasks have been 
correctly distributed in accordance with the block allocation and plane=2 
distribution:

[sulu] (slurm) mnp> srun -p bones-chekov-scotty -N 3-3 -n 6 -l -m plane=2 
hostname | sort
0: scotty
1: scotty
2: chekov
3: bones
4: scotty
5: scotty


Regards,
Martin



Attachment: taskassignplanedistribfix_2-2-1.patch
Description: Binary data

Attachment: taskassignplanedistribfixv2_2-2-1.patch
Description: Binary data

Reply via email to