On Wed, 07 Dec 2011 10:01:06 -0700
Moe Jette <[email protected]> wrote:

> Hi Yiri,
> 
> The requeued job keeps the same job ID, name, priority, qos, etc. The  
> requeued job submit time is reset to the current time, so you will see  
> two records in the accounting logs for that job ID with different  
> submit times. The job also has an environment variable set for it,  
> SLURM_RESTART_COUNT.

There seems to be some bugs in the rescheduling (at least in my 2.4 test 
installation):

This is what I got when one of the nodes crashed:

# sacct --duplicates -S2011-12-06T10:00:00 -o jobid,jobname,state,node,start -j 
34458,34459
       JobID    JobName      State        NodeList               Start
------------ ---------- ---------- --------------- -------------------
34459         MER_20892  NODE_FAIL         abz07gm 2011-12-06T20:57:46
34459         MER_20892  NODE_FAIL         abz03gm 2011-12-06T22:02:10
34458         MER_20892  NODE_FAIL         abz07gm 2011-12-06T20:57:46
34458         MER_20892  NODE_FAIL         abz03gm 2011-12-06T22:01:42
34458         MER_20892  COMPLETED         abz03gm 2011-12-06T22:03:07
34459         MER_20892  NODE_FAIL         abz03gm 2011-12-06T22:03:50

Notice how job 34459 was immediately rescheduled on "abz03gm" at 22:02, to fail 
*again* with NODE_FAIL (but of course this isn't true, the node didn't fail 
really), to be rescheduled the third time on the same node "abz03gm" and fail 
again.

Meanwhile, job 34458 was scheduled on abz03gm, to fail immediately the first 
time, to be rescheduled just one second afterwards on the same node and succeed.

I have 32 instances of that happening. It seems that the first requeuing is 
always happening on the first node of the partition (abz03gm), and it always 
fails. The second requeuing (on the same node) sometimes works, sometimes 
doesn't, but no more than 3 attempts are ever performed.

Reply via email to