[slurm-dev] Duplicate jobid, launch failed?

Robbert Eggermont Wed, 16 Mar 2016 03:21:13 -0700


Hello all,


Two times now I've found a node draining with reason "Duplicate jobid".

The slurmctld logs shows:
backfill: Started JobId=X in <...> on Y
_slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0

email msg to <...>: SLURM Job_id=X Name=<...> Failed, Run time 00:00:00,PENDING, ExitCode 0

drain_nodes: node Y state set to DRAIN
error: Duplicate jobid on nodes Y, set to state DRAIN
Requeuing JobID=X State=0x0 NodeCnt=0

Job X shows reason "launch failed requeued held".

I'm guessing job X is the offending job here.

Is it expected behaviour that a failed job launch is handled as aduplicate jobid? If so, can anybody elaborate on this and do I need todo anything (besides resuming the node)?


Or is this a bug? (Caused by the timing of the requeue?)

Best,

Robbert

--
Robbert Eggermont                                  Intelligent Systems
[email protected]         Electr.Eng., Mathematics & Comp.Science
+31 15 27 83234                         Delft University of Technology

[slurm-dev] Duplicate jobid, launch failed?

Reply via email to