Hello all,

Two times now I've found a node draining with reason "Duplicate jobid".

The slurmctld logs shows:
backfill: Started JobId=X in <...> on Y
_slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0
email msg to <...>: SLURM Job_id=X Name=<...> Failed, Run time 00:00:00, PENDING, ExitCode 0
drain_nodes: node Y state set to DRAIN
error: Duplicate jobid on nodes Y, set to state DRAIN
Requeuing JobID=X State=0x0 NodeCnt=0

Job X shows reason "launch failed requeued held".

I'm guessing job X is the offending job here.

Is it expected behaviour that a failed job launch is handled as a duplicate jobid? If so, can anybody elaborate on this and do I need to do anything (besides resuming the node)?

Or is this a bug? (Caused by the timing of the requeue?)

Best,

Robbert

--
Robbert Eggermont                                  Intelligent Systems
[email protected]         Electr.Eng., Mathematics & Comp.Science
+31 15 27 83234                         Delft University of Technology

Reply via email to