Thanks for the document Ed! Mike Newton Duke University
From: "Swindelles, Ed" <ed.swindel...@uconn.edu> Reply-To: slurm-dev <slurm-dev@schedmd.com> Date: Wednesday, June 22, 2016 at 10:55 AM To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] RE: slurmctld and slurmdb aborts Hi Mike – We have also recently experienced slurmctl failures causing orphaned jobs. We had many instances of the same error you referenced. We ended up having to manually update the SLURM accounting database. I documented the procedure in our internal wiki. If you’re interested its linked below. Please understand that this is risky, comes with no support, and should be considered for reference only. ☺ Also, I’m sure more seasoned experts here have a much better way of solving the same problem. https://www.dropbox.com/s/j1ez0spgc27z5eb/Manual%20Updating%20for%20Orphaned%20Jobs.pdf?dl=0 -- Ed Swindelles University of Connecticut From: Mike Newton [mailto:jmnew...@duke.edu] Sent: Tuesday, June 21, 2016 12:42 PM To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] slurmctld and slurmdb aborts We’re running SLURM 15.08.8 and having problems with slurmctld and slurmdb aborting. It started with slurmctld aborting enough that a cron was added to check if slurmctld was running and if not restart it. Over time I think this must have corrupted data as slurmdb is now aborting. The core dumps from slurmctld are showing this gdb: Core was generated by `/opt/slurm/sbin/slurmctld'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000444150 in _job_fits_in_active_row (job_ptr=0x7f7028323820, p_ptr=0x27fcfd0) at gang.c:432 432 gang.c: No such file or directory. in gang.c The errors in the slurmdb.log have the following kind of error messages: [2016-06-21T10:00:02.916] error: We have more allocated time than is possible (24742237 > 23274000) for cluster dscr(6465) from 2016-06-20T18:00:00 - 2016-06-20T19:00:00 tres 1 [2016-06-21T10:00:02.916] error: We have more time than is possible (23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from 2016-06-20T18:00:00 - 2016-06-20T19:00:00 tres 1 [2016-06-21T10:00:04.348] error: We have more allocated time than is possible (24642666 > 23274000) for cluster dscr(6465) from 2016-06-20T19:00:00 - 2016-06-20T20:00:00 tres 1 [2016-06-21T10:00:04.348] error: We have more time than is possible (23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from 2016-06-20T19:00:00 - 2016-06-20T20:00:00 tres 1 [2016-06-21T10:00:05.766] error: We have more allocated time than is possible (24319743 > 23274000) for cluster dscr(6465) from 2016-06-20T20:00:00 - 2016-06-20T21:00:00 tres 1 From a search I found a script attached to a ticket at schedmd that showed “lost jobs” (lost.pl). I ran this and it returned 4942 lines. Is there some way to clean up the database so that the slurmdb does not have the errors and keep from aborting? Thanks!