Hi all,
We seem to have some bad data in the sacct database (some jobs that
finished long ago are listed as PENDING). This makes it impossible to
delete an association (using sacctmgr), since SLURM believes a job is
still pending.
Example:
# squeue -j 2771
slurm_load_jobs error: Invalid job id specified
# scontrol show job 2771
slurm_load_jobs error: Invalid job id specified
# sacct --jobs=2771
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2771 ZF2HYB triolith snic002-1+ 64 PENDING 0:0
There are more jobs like this, I found a total of 310 PENDING entries
that did not match a pending job (according to squeue) out of 40000 in
the sacct database.
Since we create external job completion logs using a script
(JobCompLoc in slurm.conf), I know that job 2771 was actually started.
This particular job ended in NODE_FAIL, but other affected jobs ended
normally (COMPLETED) after a normal runtime.
How do we fix this?
Is there a safe way to modify the database? The sacctmgr man page says
"The DerivedExitCode and Comment fields are the only fields of a job
record in the database that can be modified after job completion".
SLURM version: 2.4.5
--
Mats Kronberg
NSC, National Supercomputer Centre