Thanks for the analysis and patch. This change will be in SLURM versions 2.3.4 and 2.4.0-pre4 when released.

Quoting Hongjia Cao <hj...@nudt.edu.cn>:


SLURM version slurm-2.4.0-0.pre3.

I have a slurm.conf with the following node configuration:
...
# NODES
NodeName=DEFAULT State=UNKNOWN Sockets=2 CoresPerSocket=4
RealMemory=3952
NodeName=cn[5-6,9-10]
NodeName=cn[7-8] Sockets=1 CoresPerSocket=4
...

Error output:

$ salloc -n 32 -w cn[7-8] srun -n 32 -l hostname|sort
salloc: Granted job allocation 669
salloc: Relinquishing job allocation 669
00: cn6
01: cn6
02: cn6
03: cn6
04: cn6
05: cn6
06: cn6
07: cn6
08: cn7
09: cn7
10: cn7
11: cn7
12: cn7
13: cn7
14: cn7
15: cn7
16: cn8
17: cn8
18: cn8
19: cn8
20: cn8
21: cn8
22: cn8
23: cn8
24: cn9
25: cn9
26: cn9
27: cn9
28: cn10
29: cn10
30: cn10
31: cn10

8 tasks were assigned to node cn7 and cn8. The tasks should be assigned
to node cn9 and cn10.

The following patch fixes it:

--- step_mgr.c.orig     2012-02-02 09:37:07.000000000 +0800
+++ step_mgr.c  2012-02-02 10:24:49.000000000 +0800
@@ -1922,7 +1922,7 @@
                xfree(step_specs->node_list);
                step_specs->node_list = bitmap2node_name(nodeset);
        } else {
-               step_node_list = bitmap2node_name(nodeset);
+               step_node_list = bitmap2node_name_sortable(nodeset, false);
                xfree(step_specs->node_list);
                step_specs->node_list = xstrdup(step_node_list);
        }


With this patch output of the former command:

$ salloc -n 32 -w cn[7-8] srun -n 32 -l hostname|sort
salloc: Granted job allocation 670
salloc: Relinquishing job allocation 670
00: cn6
01: cn6
02: cn6
03: cn6
04: cn6
05: cn6
06: cn6
07: cn6
08: cn9
09: cn9
10: cn9
11: cn9
12: cn9
13: cn9
14: cn9
15: cn9
16: cn10
17: cn10
18: cn10
19: cn10
20: cn10
21: cn10
22: cn10
23: cn10
24: cn7
25: cn7
26: cn7
27: cn7
28: cn8
29: cn8
30: cn8
31: cn8





Reply via email to