Hi Mats,

Mats Kronberg <[email protected]> writes:

> Hi all,
> 
> We seem to have some bad data in the sacct database (some jobs that
> finished long ago are listed as PENDING). This makes it impossible to
> delete an association (using sacctmgr), since SLURM believes a job is
> still pending.
> 
> Example:
> 
> # squeue -j 2771
> slurm_load_jobs error: Invalid job id specified
> 
> # scontrol show job 2771
> slurm_load_jobs error: Invalid job id specified
> 
> # sacct --jobs=2771
>        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
> ------------ ---------- ---------- ---------- ---------- ---------- --------
> 2771             ZF2HYB   triolith snic002-1+         64    PENDING      0:0
> 
> 
> There are more jobs like this, I found a total of 310 PENDING entries
> that did not match a pending job (according to squeue) out of 40000 in
> the sacct database.
> 
> Since we create external job completion logs using a script
> (JobCompLoc in slurm.conf), I know that job 2771 was actually started.
> This particular job ended in NODE_FAIL, but other affected jobs ended
> normally (COMPLETED) after a normal runtime.
> 
> 
> How do we fix this?
> 
> Is there a safe way to modify the database? The sacctmgr man page says
> "The DerivedExitCode and Comment fields are the only fields of a job
> record in the database that can be modified after job completion".
> 
> SLURM version: 2.4.5
> 
> 
> -- 
> Mats Kronberg
> NSC, National Supercomputer Centre

You might find a script I wrote and posted to the list useful: 

http://permalink.gmane.org/gmane.comp.distributed.slurm.devel/3624

Your case seems slightly different from ours, as our problem involved
non-existent jobs in the state RUNNING, rather than PENDING.  However,
the script is reasonably generic, so it should at least provide a
starting point.

Regards

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email [email protected]

Reply via email to