Thanks for the analysis and patch. This change will be in SLURM
versions 2.3.4 and 2.4.0-pre4 when released.
Quoting Hongjia Cao <hj...@nudt.edu.cn>:
SLURM version slurm-2.4.0-0.pre3.
I have a slurm.conf with the following node configuration:
...
# NODES
NodeName=DEFAULT State=UNKNOWN Sockets=2 CoresPerSocket=4
RealMemory=3952
NodeName=cn[5-6,9-10]
NodeName=cn[7-8] Sockets=1 CoresPerSocket=4
...
Error output:
$ salloc -n 32 -w cn[7-8] srun -n 32 -l hostname|sort
salloc: Granted job allocation 669
salloc: Relinquishing job allocation 669
00: cn6
01: cn6
02: cn6
03: cn6
04: cn6
05: cn6
06: cn6
07: cn6
08: cn7
09: cn7
10: cn7
11: cn7
12: cn7
13: cn7
14: cn7
15: cn7
16: cn8
17: cn8
18: cn8
19: cn8
20: cn8
21: cn8
22: cn8
23: cn8
24: cn9
25: cn9
26: cn9
27: cn9
28: cn10
29: cn10
30: cn10
31: cn10
8 tasks were assigned to node cn7 and cn8. The tasks should be assigned
to node cn9 and cn10.
The following patch fixes it:
--- step_mgr.c.orig 2012-02-02 09:37:07.000000000 +0800
+++ step_mgr.c 2012-02-02 10:24:49.000000000 +0800
@@ -1922,7 +1922,7 @@
xfree(step_specs->node_list);
step_specs->node_list = bitmap2node_name(nodeset);
} else {
- step_node_list = bitmap2node_name(nodeset);
+ step_node_list = bitmap2node_name_sortable(nodeset, false);
xfree(step_specs->node_list);
step_specs->node_list = xstrdup(step_node_list);
}
With this patch output of the former command:
$ salloc -n 32 -w cn[7-8] srun -n 32 -l hostname|sort
salloc: Granted job allocation 670
salloc: Relinquishing job allocation 670
00: cn6
01: cn6
02: cn6
03: cn6
04: cn6
05: cn6
06: cn6
07: cn6
08: cn9
09: cn9
10: cn9
11: cn9
12: cn9
13: cn9
14: cn9
15: cn9
16: cn10
17: cn10
18: cn10
19: cn10
20: cn10
21: cn10
22: cn10
23: cn10
24: cn7
25: cn7
26: cn7
27: cn7
28: cn8
29: cn8
30: cn8
31: cn8