Hi Mats, Mats Kronberg <[email protected]> writes:
> Hi all, > > We seem to have some bad data in the sacct database (some jobs that > finished long ago are listed as PENDING). This makes it impossible to > delete an association (using sacctmgr), since SLURM believes a job is > still pending. > > Example: > > # squeue -j 2771 > slurm_load_jobs error: Invalid job id specified > > # scontrol show job 2771 > slurm_load_jobs error: Invalid job id specified > > # sacct --jobs=2771 > JobID JobName Partition Account AllocCPUS State ExitCode > ------------ ---------- ---------- ---------- ---------- ---------- -------- > 2771 ZF2HYB triolith snic002-1+ 64 PENDING 0:0 > > > There are more jobs like this, I found a total of 310 PENDING entries > that did not match a pending job (according to squeue) out of 40000 in > the sacct database. > > Since we create external job completion logs using a script > (JobCompLoc in slurm.conf), I know that job 2771 was actually started. > This particular job ended in NODE_FAIL, but other affected jobs ended > normally (COMPLETED) after a normal runtime. > > > How do we fix this? > > Is there a safe way to modify the database? The sacctmgr man page says > "The DerivedExitCode and Comment fields are the only fields of a job > record in the database that can be modified after job completion". > > SLURM version: 2.4.5 > > > -- > Mats Kronberg > NSC, National Supercomputer Centre You might find a script I wrote and posted to the list useful: http://permalink.gmane.org/gmane.comp.distributed.slurm.devel/3624 Your case seems slightly different from ours, as our problem involved non-existent jobs in the state RUNNING, rather than PENDING. However, the script is reasonably generic, so it should at least provide a starting point. Regards Loris -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email [email protected]
