There were a couple of GRES bugs fixed about a week ago that will be
in v14.11.4. If you like, you can probably get them from github and
apply to v14.03. Commit information below:
commit 72cefd541e0c112d81c3681b7179fb3d777bad88
Author: Morris Jette <[email protected]>
Date: Thu Jan 15 09:13:20 2015 -0800
GRES scheduling fix
Fix for GRES scheduling in which there is CPU topology defined or
GRES types defined and there is more than 1 GPU per topology record
in slurmctld. Without this fix, only one GRES could be allocated
from each defined topology.
bug 1369
commit ce1d99f5ade31f415fea5e53c5144ec0d71b971f
Author: Morris Jette <[email protected]>
Date: Wed Jan 14 16:25:44 2015 -0800
Fix for slurmctld abort on gres error
The slurmctld could abort with a gres configuration having
Type= configured, but no CPU binding configured.
Quoting Franco Broi <[email protected]>:
Not sure if this has already been reported and fixed. It was being
caused by a single queued job which I cancelled. Resubmitted and it ran
ok.
slurm 14.03.6
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000541d90 in _job_alloc (job_gres_list=<value optimized
out>, node_gres_list=0x1443e68, node_cnt=16, node_offset=0,
cpu_cnt=<value optimized out>,
job_id=99387, node_name=0x145bd38 "delta1",
core_bitmap=0x14c9d28) at gres.c:3047
3047 if (job_gres_ptr->gres_bit_alloc[node_offset]) {
Missing separate debuginfos, use: debuginfo-install
glibc-2.13-1.x86_64 munge-libs-0.5.9-3.fc14.x86_64
(gdb) where
#0 0x0000000000541d90 in _job_alloc (job_gres_list=<value optimized
out>, node_gres_list=0x1443e68, node_cnt=16, node_offset=0,
cpu_cnt=<value optimized out>,
job_id=99387, node_name=0x145bd38 "delta1",
core_bitmap=0x14c9d28) at gres.c:3047
#1 gres_plugin_job_alloc (job_gres_list=<value optimized out>,
node_gres_list=0x1443e68, node_cnt=16, node_offset=0, cpu_cnt=<value
optimized out>, job_id=99387,
node_name=0x145bd38 "delta1", core_bitmap=0x14c9d28) at gres.c:3216
#2 0x00007f432ffd44e9 in _add_job_to_res (job_ptr=0x14c9818,
action=0) at select_cons_res.c:817
#3 0x00007f432ffd79b7 in select_p_select_nodeinfo_set
(job_ptr=0x14c9818) at select_cons_res.c:2376
#4 0x0000000000462806 in select_nodes (job_ptr=<value optimized
out>, test_only=false, select_node_bitmap=<value optimized out>) at
node_scheduler.c:1815
#5 0x00000000004556ab in schedule (job_limit=100) at job_scheduler.c:1198
#6 0x000000000043289a in _slurmctld_background (argc=<value
optimized out>, argv=<value optimized out>) at controller.c:1589
#7 main (argc=<value optimized out>, argv=<value optimized out>) at
controller.c:561
(gdb) p *(gres_job_state_t *) job_gres_data
$2 = {gres_cnt_alloc = 1, node_cnt = 16, gres_bit_alloc = 0x0,
gres_bit_step_alloc = 0x0, gres_cnt_step_alloc = 0x14d6578}
Cheers,
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support