Hi, I've been having sporadic slurm 2.6.5 crashes that require clearing the
state to recover from.
I haven't been able to isolate the hardware issue that results in GPUs
going AWOL to slurm yet (nvidia-smi still lists them), but slurmctld begins
to log:

[2014-02-17T05:26:46.000] error: gres/gpu: job 14727 and node mendieta06
bitmap sizes differ (1 != 2)
[2014-02-17T05:26:46.000] error: gres/gpu: job 14731 dealloc node
mendieta06 gres count underflow

Later, when a job using the resource that went missing finishes somehow,
slurmctld crashes. Looking at the backtrace I have:

#0  0x0000003c000328e5 in raise () from /lib64/libc.so.6
#1  0x0000003c000340c5 in abort () from /lib64/libc.so.6
#2  0x0000003c0002ba0e in __assert_fail_base () from /lib64/libc.so.6
#3  0x0000003c0002bad0 in __assert_fail () from /lib64/libc.so.6
#4  0x00000000004a41fa in bit_test (b=<value optimized out>, bit=<value
optimized out>) at bitstring.c:183
#5  0x0000000000533790 in _job_dealloc (job_gres_list=<value optimized
out>, node_gres_list=0x249a0e8, node_offset=0, job_id=14727,
    node_name=0x246cd68 "mendieta06") at gres.c:3160
#6  gres_plugin_job_dealloc (job_gres_list=<value optimized out>,
node_gres_list=0x249a0e8, node_offset=0, job_id=14727, node_name=0x246cd68
"mendieta06")
    at gres.c:3228
#7  0x00007f4f14741a35 in _rm_job_from_res (part_record_ptr=0x2474758,
node_usage=0x24cc418, job_ptr=0x2477108, action=0) at select_cons_res.c:1169
#8  0x00007f4f14741dc2 in select_p_job_fini (job_ptr=<value optimized out>)
at select_cons_res.c:2140
#9  0x000000000045bf57 in deallocate_nodes (job_ptr=0x2477108,
timeout=true, suspended=false, preempted=false) at node_scheduler.c:478
#10 0x0000000000442281 in job_time_limit () at job_mgr.c:5465
#11 0x000000000042fa8f in _slurmctld_background (no_data=<value optimized
out>) at controller.c:1462
#12 0x000000000043259f in main (argc=<value optimized out>, argv=<value
optimized out>) at controller.c:586

This looks to be due to gres_cnt_config being inconsistent with the node's
state at job deallocation time, but I'm not familiar enough with gres code
to propose a decent solution.


-- 
Carlos S. Bederián
Instituto de Física Enrique Gaviola - CONICET
Medina Allende S/N, Ciudad Universitaria
X5000HUA Córdoba, Argentina

Reply via email to