Re: [slurm-dev] JobRequeue and NODE_FAIL

Yuri D'Elia Fri, 09 Dec 2011 01:53:55 -0800

On Wed, 7 Dec 2011 18:35:54 +0100
"Yuri D'Elia" <[email protected]> wrote:


> # sacct --duplicates -S2011-12-06T10:00:00 -o jobid,jobname,state,node,start 
> -j 34458,34459
>        JobID    JobName      State        NodeList               Start
> ------------ ---------- ---------- --------------- -------------------
> 34459         MER_20892  NODE_FAIL         abz07gm 2011-12-06T20:57:46
> 34459         MER_20892  NODE_FAIL         abz03gm 2011-12-06T22:02:10
> 34458         MER_20892  NODE_FAIL         abz07gm 2011-12-06T20:57:46
> 34458         MER_20892  NODE_FAIL         abz03gm 2011-12-06T22:01:42
> 34458         MER_20892  COMPLETED         abz03gm 2011-12-06T22:03:07
> 34459         MER_20892  NODE_FAIL         abz03gm 2011-12-06T22:03:50
> 
> Notice how job 34459 was immediately rescheduled on "abz03gm" at 22:02, to 
> fail *again* with NODE_FAIL (but of course this isn't true, the node didn't 
> fail really), to be rescheduled the third time on the same node "abz03gm" and 
> fail again.
> 
> Meanwhile, job 34458 was scheduled on abz03gm, to fail immediately the first 
> time, to be rescheduled just one second afterwards on the same node and 
> succeed.

And, just for posterity, the jobs marked as COMPLETED didn't actually run 
(after investigating the output files).

Re: [slurm-dev] JobRequeue and NODE_FAIL

Reply via email to