Hi Mike –
We have also recently experienced slurmctl failures causing orphaned jobs. We had many instances of the same error you referenced. We ended up having to manually update the SLURM accounting database. I documented the procedure in our internal wiki. If you’re interested its linked below. Please understand that this is risky, comes with no support, and should be considered for reference only. :) Also, I’m sure more seasoned experts here have a much better way of solving the same problem. https://www.dropbox.com/s/j1ez0spgc27z5eb/Manual%20Updating%20for%20Orphaned%20Jobs.pdf?dl=0 -- Ed Swindelles University of Connecticut From: Mike Newton [mailto:jmnew...@duke.edu] Sent: Tuesday, June 21, 2016 12:42 PM To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] slurmctld and slurmdb aborts We’re running SLURM 15.08.8 and having problems with slurmctld and slurmdb aborting. It started with slurmctld aborting enough that a cron was added to check if slurmctld was running and if not restart it. Over time I think this must have corrupted data as slurmdb is now aborting. The core dumps from slurmctld are showing this gdb: Core was generated by `/opt/slurm/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000444150 in _job_fits_in_active_row (job_ptr=0x7f7028323820, p_ptr=0x27fcfd0) at gang.c:432 432 gang.c: No such file or directory. in gang.c The errors in the slurmdb.log have the following kind of error messages: [2016-06-21T10:00:02.916] error: We have more allocated time than is possible (24742237 > 23274000) for cluster dscr(6465) from 2016-06-20T18:00:00 - 2016-06-20T19:00:00 tres 1 [2016-06-21T10:00:02.916] error: We have more time than is possible (23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from 2016-06-20T18:00:00 - 2016-06-20T19:00:00 tres 1 [2016-06-21T10:00:04.348] error: We have more allocated time than is possible (24642666 > 23274000) for cluster dscr(6465) from 2016-06-20T19:00:00 - 2016-06-20T20:00:00 tres 1 [2016-06-21T10:00:04.348] error: We have more time than is possible (23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from 2016-06-20T19:00:00 - 2016-06-20T20:00:00 tres 1 [2016-06-21T10:00:05.766] error: We have more allocated time than is possible (24319743 > 23274000) for cluster dscr(6465) from 2016-06-20T20:00:00 - 2016-06-20T21:00:00 tres 1 >From a search I found a script attached to a ticket at schedmd that showed >“lost jobs” (lost.pl). I ran this and it returned 4942 lines. Is there some way to clean up the database so that the slurmdb does not have the errors and keep from aborting? Thanks!
smime.p7s
Description: S/MIME cryptographic signature