[slurm-dev] RE: slurmctld and slurmdb aborts

Mike Newton Thu, 23 Jun 2016 07:45:03 -0700

Thanks for the document Ed!

Mike Newton
Duke University

From: "Swindelles, Ed" <ed.swindel...@uconn.edu>
Reply-To: slurm-dev <slurm-dev@schedmd.com>
Date: Wednesday, June 22, 2016 at 10:55 AM
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] RE: slurmctld and slurmdb aborts

Hi Mike –

We have also recently experienced slurmctl failures causing orphaned jobs. We 
had many instances of the same error you referenced. We ended up having to 
manually update the SLURM accounting database. I documented the procedure in 
our internal wiki. If you’re interested its linked below. Please understand 
that this is risky, comes with no support, and should be considered for 
reference only. ☺ Also, I’m sure more seasoned experts here have a much better 
way of solving the same problem.

https://www.dropbox.com/s/j1ez0spgc27z5eb/Manual%20Updating%20for%20Orphaned%20Jobs.pdf?dl=0

--
Ed Swindelles
University of Connecticut

From: Mike Newton [mailto:jmnew...@duke.edu]
Sent: Tuesday, June 21, 2016 12:42 PM
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] slurmctld and slurmdb aborts

We’re running SLURM 15.08.8 and having problems with slurmctld and slurmdb 
aborting. It started with slurmctld aborting enough that a cron was added to 
check if slurmctld was running and if not restart it. Over time I think this 
must have corrupted data as slurmdb is now aborting. The core dumps from 
slurmctld  are showing this gdb:

Core was generated by `/opt/slurm/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000444150 in _job_fits_in_active_row (job_ptr=0x7f7028323820, 
p_ptr=0x27fcfd0) at gang.c:432
432         gang.c: No such file or directory.
                in gang.c

The errors in the slurmdb.log have the following kind of error messages:

[2016-06-21T10:00:02.916] error: We have more allocated time than is possible 
(24742237 > 23274000) for cluster dscr(6465) from 2016-06-20T18:00:00 - 
2016-06-20T19:00:00 tres 1
[2016-06-21T10:00:02.916] error: We have more time than is possible 
(23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from 
2016-06-20T18:00:00 - 2016-06-20T19:00:00 tres 1
[2016-06-21T10:00:04.348] error: We have more allocated time than is possible 
(24642666 > 23274000) for cluster dscr(6465) from 2016-06-20T19:00:00 - 
2016-06-20T20:00:00 tres 1
[2016-06-21T10:00:04.348] error: We have more time than is possible 
(23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from 
2016-06-20T19:00:00 - 2016-06-20T20:00:00 tres 1
[2016-06-21T10:00:05.766] error: We have more allocated time than is possible 
(24319743 > 23274000) for cluster dscr(6465) from 2016-06-20T20:00:00 - 
2016-06-20T21:00:00 tres 1

From a search I found a script attached to a ticket at schedmd that showed 
“lost jobs” (lost.pl). I ran this and it returned 4942 lines.

Is there some way to clean up the database so that the slurmdb does not have 
the errors and keep from aborting?

Thanks!

[slurm-dev] RE: slurmctld and slurmdb aborts

Reply via email to