We’re running SLURM 15.08.8 and having problems with slurmctld and slurmdb aborting. It started with slurmctld aborting enough that a cron was added to check if slurmctld was running and if not restart it. Over time I think this must have corrupted data as slurmdb is now aborting. The core dumps from slurmctld are showing this gdb:
Core was generated by `/opt/slurm/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000444150 in _job_fits_in_active_row (job_ptr=0x7f7028323820, p_ptr=0x27fcfd0) at gang.c:432 432 gang.c: No such file or directory. in gang.c The errors in the slurmdb.log have the following kind of error messages: [2016-06-21T10:00:02.916] error: We have more allocated time than is possible (24742237 > 23274000) for cluster dscr(6465) from 2016-06-20T18:00:00 - 2016-06-20T19:00:00 tres 1 [2016-06-21T10:00:02.916] error: We have more time than is possible (23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from 2016-06-20T18:00:00 - 2016-06-20T19:00:00 tres 1 [2016-06-21T10:00:04.348] error: We have more allocated time than is possible (24642666 > 23274000) for cluster dscr(6465) from 2016-06-20T19:00:00 - 2016-06-20T20:00:00 tres 1 [2016-06-21T10:00:04.348] error: We have more time than is possible (23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from 2016-06-20T19:00:00 - 2016-06-20T20:00:00 tres 1 [2016-06-21T10:00:05.766] error: We have more allocated time than is possible (24319743 > 23274000) for cluster dscr(6465) from 2016-06-20T20:00:00 - 2016-06-20T21:00:00 tres 1 From a search I found a script attached to a ticket at schedmd that showed “lost jobs” (lost.pl). I ran this and it returned 4942 lines. Is there some way to clean up the database so that the slurmdb does not have the errors and keep from aborting? Thanks!