Hi Mike –

 

We have also recently experienced slurmctl failures causing orphaned jobs. We 
had many instances of the same error you referenced. We ended up having to 
manually update the SLURM accounting database. I documented the procedure in 
our internal wiki. If you’re interested its linked below. Please understand 
that this is risky, comes with no support, and should be considered for 
reference only. :) Also, I’m sure more seasoned experts here have a much better 
way of solving the same problem.

 

https://www.dropbox.com/s/j1ez0spgc27z5eb/Manual%20Updating%20for%20Orphaned%20Jobs.pdf?dl=0

 

--

Ed Swindelles

University of Connecticut

 

From: Mike Newton [mailto:jmnew...@duke.edu] 
Sent: Tuesday, June 21, 2016 12:42 PM
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] slurmctld and slurmdb aborts

 

We’re running SLURM 15.08.8 and having problems with slurmctld and slurmdb 
aborting. It started with slurmctld aborting enough that a cron was added to 
check if slurmctld was running and if not restart it. Over time I think this 
must have corrupted data as slurmdb is now aborting. The core dumps from 
slurmctld  are showing this gdb:

 

Core was generated by `/opt/slurm/sbin/slurmctld'.

Program terminated with signal 11, Segmentation fault.

#0  0x0000000000444150 in _job_fits_in_active_row (job_ptr=0x7f7028323820, 
p_ptr=0x27fcfd0) at gang.c:432

432         gang.c: No such file or directory.

                in gang.c

 

The errors in the slurmdb.log have the following kind of error messages:

 

[2016-06-21T10:00:02.916] error: We have more allocated time than is possible 
(24742237 > 23274000) for cluster dscr(6465) from 2016-06-20T18:00:00 - 
2016-06-20T19:00:00 tres 1

[2016-06-21T10:00:02.916] error: We have more time than is possible 
(23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from 
2016-06-20T18:00:00 - 2016-06-20T19:00:00 tres 1

[2016-06-21T10:00:04.348] error: We have more allocated time than is possible 
(24642666 > 23274000) for cluster dscr(6465) from 2016-06-20T19:00:00 - 
2016-06-20T20:00:00 tres 1

[2016-06-21T10:00:04.348] error: We have more time than is possible 
(23274000+259200+0)(23533200) > 23274000 for cluster dscr(6465) from 
2016-06-20T19:00:00 - 2016-06-20T20:00:00 tres 1

[2016-06-21T10:00:05.766] error: We have more allocated time than is possible 
(24319743 > 23274000) for cluster dscr(6465) from 2016-06-20T20:00:00 - 
2016-06-20T21:00:00 tres 1

 

>From a search I found a script attached to a ticket at schedmd that showed 
>“lost jobs” (lost.pl). I ran this and it returned 4942 lines.

 

Is there some way to clean up the database so that the slurmdb does not have 
the errors and keep from aborting?

 

Thanks!

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to