On Wed, 7 Dec 2011 18:35:54 +0100 "Yuri D'Elia" <[email protected]> wrote:
> # sacct --duplicates -S2011-12-06T10:00:00 -o jobid,jobname,state,node,start > -j 34458,34459 > JobID JobName State NodeList Start > ------------ ---------- ---------- --------------- ------------------- > 34459 MER_20892 NODE_FAIL abz07gm 2011-12-06T20:57:46 > 34459 MER_20892 NODE_FAIL abz03gm 2011-12-06T22:02:10 > 34458 MER_20892 NODE_FAIL abz07gm 2011-12-06T20:57:46 > 34458 MER_20892 NODE_FAIL abz03gm 2011-12-06T22:01:42 > 34458 MER_20892 COMPLETED abz03gm 2011-12-06T22:03:07 > 34459 MER_20892 NODE_FAIL abz03gm 2011-12-06T22:03:50 > > Notice how job 34459 was immediately rescheduled on "abz03gm" at 22:02, to > fail *again* with NODE_FAIL (but of course this isn't true, the node didn't > fail really), to be rescheduled the third time on the same node "abz03gm" and > fail again. > > Meanwhile, job 34458 was scheduled on abz03gm, to fail immediately the first > time, to be rescheduled just one second afterwards on the same node and > succeed. And, just for posterity, the jobs marked as COMPLETED didn't actually run (after investigating the output files).
