Ah, good to know. I do prefer that behavior, just didn't expect it. Thanks.

-Paul Edmon-

On 03/03/2015 02:00 PM, David Bigagli wrote:


Ah ok, the job failed to launch in this case Slurm requeue the job in held state, the previous behaviour was to terminate the job.
The reason for this is to avoid the job dispatch failure over and over.

On 03/03/2015 10:53 AM, Paul Edmon wrote:

In this case the Node was in a funny state where it couldn't resolve
user id's.  So right after the job tried to launch it failed and
requeued.  We just let the scheduler do what it will when it lists
Node_fail.

-Paul Edmon-


On 03/03/2015 01:20 PM, David Bigagli wrote:

How do you set your node down? If I run a job and then issue

'scontrol update node=prometeo state=down'

the job is requeued in pend. Do you have an epilog?

On 03/03/2015 10:12 AM, Paul Edmon wrote:

We are definitely using the default for that one.  So it should be
requeueing just fine.

-Paul Edmon-

On 03/03/2015 01:05 PM, Lipari, Don wrote:
It looks like the governing config parameter would be:

JobRequeue
     This option controls what to do by default after a node failure.
If JobRequeue is
     set to a value of 1, then any batch job running on the failed
node will be requeued
     for execution on different nodes.  If JobRequeue is set to a
value of 0, then any
     job running on the failed node will be terminated. Use the
sbatch --no-requeue or
     --requeue option to change the default behavior for individual
jobs.  The default
     value is 1.

According to the this, the job should be requeued and not held.

-----Original Message-----
From: Paul Edmon [mailto:ped...@cfa.harvard.edu]
Sent: Tuesday, March 03, 2015 9:14 AM
To: slurm-dev
Subject: [slurm-dev] Re: Requeue Exit


Basically the node cuts out due to hardware issues and the jobs is
requeued. I'm just trying to figure out why it sent them into a held
state as opposed to just simply requeueing as normal. Thoughts?

-Paul Edmon-

On 03/03/2015 12:11 PM, David Bigagli wrote:
There are no default values for these parameters, you have to
configure your own. In your case do the prolog fails or the node
changes state as the jobs are running?

On 03/03/2015 08:30 AM, Paul Edmon wrote:
So what are the default values for these two options?  We recently
updated to 14.11 and jobs that previously would have just requeued
due
to node failure are now going into a held state.

*RequeueExit*
     Enables automatic job requeue for jobs which exit with the
specified
     values. Separate multiple exit code by a comma. Jobs will be
put
     back in to pending state and later scheduled again. Restarted
jobs
will have the environment variable *SLURM_RESTART_COUNT* set to
the
     number of times the job has been restarted.

*RequeueExitHold*
     Enables automatic requeue of jobs into pending state in hold,
meaning their priority is zero. Separate multiple exit code by
a
comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state.
     Restarted jobs will have the environment variable
     *SLURM_RESTART_COUNT* set to the number of times the job has
been
     restarted.

-Paul Edmon-



Reply via email to