Hello all,
Two times now I've found a node draining with reason "Duplicate jobid".
The slurmctld logs shows:
backfill: Started JobId=X in <...> on Y
_slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0
email msg to <...>: SLURM Job_id=X Name=<...> Failed, Run time 00:00:00,
PENDING, ExitCode 0
drain_nodes: node Y state set to DRAIN
error: Duplicate jobid on nodes Y, set to state DRAIN
Requeuing JobID=X State=0x0 NodeCnt=0
Job X shows reason "launch failed requeued held".
I'm guessing job X is the offending job here.
Is it expected behaviour that a failed job launch is handled as a
duplicate jobid? If so, can anybody elaborate on this and do I need to
do anything (besides resuming the node)?
Or is this a bug? (Caused by the timing of the requeue?)
Best,
Robbert
--
Robbert Eggermont Intelligent Systems
[email protected] Electr.Eng., Mathematics & Comp.Science
+31 15 27 83234 Delft University of Technology