[slurm-dev] Re: Requeue Exit

Paul Edmon Tue, 03 Mar 2015 11:41:38 -0800

Ah, good to know. I do prefer that behavior, just didn't expect it.Thanks.


-Paul Edmon-

On 03/03/2015 02:00 PM, David Bigagli wrote:

Ah ok, the job failed to launch in this case Slurm requeue the job inheld state, the previous behaviour was to terminate the job.

The reason for this is to avoid the job dispatch failure over and over.

On 03/03/2015 10:53 AM, Paul Edmon wrote:


In this case the Node was in a funny state where it couldn't resolve
user id's.  So right after the job tried to launch it failed and
requeued.  We just let the scheduler do what it will when it lists
Node_fail.

-Paul Edmon-


On 03/03/2015 01:20 PM, David Bigagli wrote:


How do you set your node down? If I run a job and then issue

'scontrol update node=prometeo state=down'

the job is requeued in pend. Do you have an epilog?

On 03/03/2015 10:12 AM, Paul Edmon wrote:


We are definitely using the default for that one.  So it should be
requeueing just fine.

-Paul Edmon-

On 03/03/2015 01:05 PM, Lipari, Don wrote:

It looks like the governing config parameter would be:

JobRequeue
     This option controls what to do by default after a node failure.
If JobRequeue is
     set to a value of 1, then any batch job running on the failed
node will be requeued
     for execution on different nodes.  If JobRequeue is set to a
value of 0, then any
     job running on the failed node will be terminated. Use the
sbatch --no-requeue or
     --requeue option to change the default behavior for individual
jobs.  The default
     value is 1.

According to the this, the job should be requeued and not held.

-----Original Message-----
From: Paul Edmon [mailto:ped...@cfa.harvard.edu]
Sent: Tuesday, March 03, 2015 9:14 AM
To: slurm-dev
Subject: [slurm-dev] Re: Requeue Exit


Basically the node cuts out due to hardware issues and the jobs is

requeued. I'm just trying to figure out why it sent them into aheld

state as opposed to just simply requeueing as normal. Thoughts?

-Paul Edmon-

On 03/03/2015 12:11 PM, David Bigagli wrote:

There are no default values for these parameters, you have to
configure your own. In your case do the prolog fails or the node
changes state as the jobs are running?

On 03/03/2015 08:30 AM, Paul Edmon wrote:

So what are the default values for these two options?  We recently
updated to 14.11 and jobs that previously would have just requeued

due

to node failure are now going into a held state.

*RequeueExit*
     Enables automatic job requeue for jobs which exit with the

specified

     values. Separate multiple exit code by a comma. Jobs will be

put

     back in to pending state and later scheduled again. Restarted

jobs

will have the environment variable *SLURM_RESTART_COUNT*set to

the

     number of times the job has been restarted.

*RequeueExitHold*
     Enables automatic requeue of jobs into pending state in hold,

meaning their priority is zero. Separate multiple exitcode by

comma. These jobs are put in the *JOB_SPECIAL_EXIT* exitstate.
     Restarted jobs will have the environment variable
     *SLURM_RESTART_COUNT* set to the number of times the job has

been

     restarted.

-Paul Edmon-

[slurm-dev] Re: Requeue Exit

Reply via email to