SLURM version slurm-2.4.0-0.pre3. I have a slurm.conf with the following node configuration: ... # NODES NodeName=DEFAULT State=UNKNOWN Sockets=2 CoresPerSocket=4 RealMemory=3952 NodeName=cn[5-6,9-10] NodeName=cn[7-8] Sockets=1 CoresPerSocket=4 ...
Error output: $ salloc -n 32 -w cn[7-8] srun -n 32 -l hostname|sort salloc: Granted job allocation 669 salloc: Relinquishing job allocation 669 00: cn6 01: cn6 02: cn6 03: cn6 04: cn6 05: cn6 06: cn6 07: cn6 08: cn7 09: cn7 10: cn7 11: cn7 12: cn7 13: cn7 14: cn7 15: cn7 16: cn8 17: cn8 18: cn8 19: cn8 20: cn8 21: cn8 22: cn8 23: cn8 24: cn9 25: cn9 26: cn9 27: cn9 28: cn10 29: cn10 30: cn10 31: cn10 8 tasks were assigned to node cn7 and cn8. The tasks should be assigned to node cn9 and cn10. The following patch fixes it: --- step_mgr.c.orig 2012-02-02 09:37:07.000000000 +0800 +++ step_mgr.c 2012-02-02 10:24:49.000000000 +0800 @@ -1922,7 +1922,7 @@ xfree(step_specs->node_list); step_specs->node_list = bitmap2node_name(nodeset); } else { - step_node_list = bitmap2node_name(nodeset); + step_node_list = bitmap2node_name_sortable(nodeset, false); xfree(step_specs->node_list); step_specs->node_list = xstrdup(step_node_list); } With this patch output of the former command: $ salloc -n 32 -w cn[7-8] srun -n 32 -l hostname|sort salloc: Granted job allocation 670 salloc: Relinquishing job allocation 670 00: cn6 01: cn6 02: cn6 03: cn6 04: cn6 05: cn6 06: cn6 07: cn6 08: cn9 09: cn9 10: cn9 11: cn9 12: cn9 13: cn9 14: cn9 15: cn9 16: cn10 17: cn10 18: cn10 19: cn10 20: cn10 21: cn10 22: cn10 23: cn10 24: cn7 25: cn7 26: cn7 27: cn7 28: cn8 29: cn8 30: cn8 31: cn8
signature.asc
Description: This is a digitally signed message part