[slurm-dev] slurmctld and slurmdb aborts

Mike Newton Tue, 21 Jun 2016 09:42:08 -0700

We’re running SLURM 15.08.8 and having problems with slurmctld and slurmdb 
aborting. It started with slurmctld aborting enough that a cron was added to 
check if slurmctld was running and if not restart it. Over time I think this 
must have corrupted data as slurmdb is now aborting. The core dumps from 
slurmctld  are showing this gdb:


Core was generated by `/opt/slurm/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000444150 in _job_fits_in_active_row (job_ptr=0x7f7028323820, 
p_ptr=0x27fcfd0) at gang.c:432
432         gang.c: No such file or directory.
                in gang.c

The errors in the slurmdb.log have the following kind of error messages:

[2016-06-21T10:00:02.916] error: We have more allocated time than is possible 
(24742237 > 23274000) for cluster dscr(6465) from 2016-06-20T18:00:00 - 
2016-06-20T19:00:00 tres 1
[2016-06-21T10:00:02.916] error: We have more time than is possible 
(23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from 
2016-06-20T18:00:00 - 2016-06-20T19:00:00 tres 1
[2016-06-21T10:00:04.348] error: We have more allocated time than is possible 
(24642666 > 23274000) for cluster dscr(6465) from 2016-06-20T19:00:00 - 
2016-06-20T20:00:00 tres 1
[2016-06-21T10:00:04.348] error: We have more time than is possible 
(23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from 
2016-06-20T19:00:00 - 2016-06-20T20:00:00 tres 1
[2016-06-21T10:00:05.766] error: We have more allocated time than is possible 
(24319743 > 23274000) for cluster dscr(6465) from 2016-06-20T20:00:00 - 
2016-06-20T21:00:00 tres 1

From a search I found a script attached to a ticket at schedmd that showed 
“lost jobs” (lost.pl). I ran this and it returned 4942 lines.

Is there some way to clean up the database so that the slurmdb does not have 
the errors and keep from aborting?

Thanks!

[slurm-dev] slurmctld and slurmdb aborts

Reply via email to