We’re running SLURM 15.08.8 and having problems with slurmctld and slurmdb 
aborting. It started with slurmctld aborting enough that a cron was added to 
check if slurmctld was running and if not restart it. Over time I think this 
must have corrupted data as slurmdb is now aborting. The core dumps from 
slurmctld  are showing this gdb:

Core was generated by `/opt/slurm/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000444150 in _job_fits_in_active_row (job_ptr=0x7f7028323820, 
p_ptr=0x27fcfd0) at gang.c:432
432         gang.c: No such file or directory.
                in gang.c

The errors in the slurmdb.log have the following kind of error messages:

[2016-06-21T10:00:02.916] error: We have more allocated time than is possible 
(24742237 > 23274000) for cluster dscr(6465) from 2016-06-20T18:00:00 - 
2016-06-20T19:00:00 tres 1
[2016-06-21T10:00:02.916] error: We have more time than is possible 
(23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from 
2016-06-20T18:00:00 - 2016-06-20T19:00:00 tres 1
[2016-06-21T10:00:04.348] error: We have more allocated time than is possible 
(24642666 > 23274000) for cluster dscr(6465) from 2016-06-20T19:00:00 - 
2016-06-20T20:00:00 tres 1
[2016-06-21T10:00:04.348] error: We have more time than is possible 
(23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from 
2016-06-20T19:00:00 - 2016-06-20T20:00:00 tres 1
[2016-06-21T10:00:05.766] error: We have more allocated time than is possible 
(24319743 > 23274000) for cluster dscr(6465) from 2016-06-20T20:00:00 - 
2016-06-20T21:00:00 tres 1

From a search I found a script attached to a ticket at schedmd that showed 
“lost jobs” (lost.pl). I ran this and it returned 4942 lines.

Is there some way to clean up the database so that the slurmdb does not have 
the errors and keep from aborting?

Thanks!

Reply via email to