On Wed, 07 Dec 2011 10:01:06 -0700
Moe Jette <[email protected]> wrote:
> Hi Yiri,
>
> The requeued job keeps the same job ID, name, priority, qos, etc. The
> requeued job submit time is reset to the current time, so you will see
> two records in the accounting logs for that job ID with different
> submit times. The job also has an environment variable set for it,
> SLURM_RESTART_COUNT.
There seems to be some bugs in the rescheduling (at least in my 2.4 test
installation):
This is what I got when one of the nodes crashed:
# sacct --duplicates -S2011-12-06T10:00:00 -o jobid,jobname,state,node,start -j
34458,34459
JobID JobName State NodeList Start
------------ ---------- ---------- --------------- -------------------
34459 MER_20892 NODE_FAIL abz07gm 2011-12-06T20:57:46
34459 MER_20892 NODE_FAIL abz03gm 2011-12-06T22:02:10
34458 MER_20892 NODE_FAIL abz07gm 2011-12-06T20:57:46
34458 MER_20892 NODE_FAIL abz03gm 2011-12-06T22:01:42
34458 MER_20892 COMPLETED abz03gm 2011-12-06T22:03:07
34459 MER_20892 NODE_FAIL abz03gm 2011-12-06T22:03:50
Notice how job 34459 was immediately rescheduled on "abz03gm" at 22:02, to fail
*again* with NODE_FAIL (but of course this isn't true, the node didn't fail
really), to be rescheduled the third time on the same node "abz03gm" and fail
again.
Meanwhile, job 34458 was scheduled on abz03gm, to fail immediately the first
time, to be rescheduled just one second afterwards on the same node and succeed.
I have 32 instances of that happening. It seems that the first requeuing is
always happening on the first node of the partition (abz03gm), and it always
fails. The second requeuing (on the same node) sometimes works, sometimes
doesn't, but no more than 3 attempts are ever performed.