There is an error in the patch I sent yesterday for this problem. The attach patch is a corrected version. Regards, Martin
Martin Perry/US/BULL 02/24/2011 02:04 PM To [email protected] cc [email protected], "[email protected]" <[email protected]> Subject Slurmd abort when using task affinity with plane distribution Slurmd may abort when using task affinity with the plane distribution method. I think the problem is in function _task_layout_plane in src/common/slurm_step_layout.c. The function does not support heterogeneous allocations of cpus across nodes. The following example illustrates the problem: slurm.conf settings: SelectType=select/cons_res SelectTypeParameters=CR_Core TaskPlugin=task/affinity TaskPluginParam=sched,cores command: srun -p bones-chekov-scotty -N 3-3 -n 6 -l -m plane=2 hostname | sort In this example, slurm allocates 4 cores from one node and 1 core each from the other two nodes (block allocation method). But _task_layout_plane distributes 2 tasks to each node, even though two of the nodes only have 1 allocated core. When task affinity detects this condition, it aborts slurmd with the following error (from the slurmd log): "error: task/affinity: only 1 bits in avail_map for 2 tasks!" The attached patch fixes the problem for slurm version 2.2.1 by modifying _task_layout_plane to take the allocation into account when distributing tasks across nodes. Here is the same example after the patch has been applied, showing that the job runs successfully and the tasks have been correctly distributed in accordance with the block allocation and plane=2 distribution: [sulu] (slurm) mnp> srun -p bones-chekov-scotty -N 3-3 -n 6 -l -m plane=2 hostname | sort 0: scotty 1: scotty 2: chekov 3: bones 4: scotty 5: scotty Regards, Martin
taskassignplanedistribfix_2-2-1.patch
Description: Binary data
taskassignplanedistribfixv2_2-2-1.patch
Description: Binary data
