[slurm-dev] MaxCPUs option in slurm.conf

2015-03-03 Thread tejas.deshpande
Hi Team,

Is it possible to mention MaxCPUs option in partitions.. in slurm.conf?
As I need restrict the jobs to certain no. of cores.


--
Regards,
Tejas

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com


[slurm-dev] Requeue Exit

2015-03-03 Thread Paul Edmon
So what are the default values for these two options?  We recently 
updated to 14.11 and jobs that previously would have just requeued due 
to node failure are now going into a held state.


*RequeueExit*
   Enables automatic job requeue for jobs which exit with the specified
   values. Separate multiple exit code by a comma. Jobs will be put
   back in to pending state and later scheduled again. Restarted jobs
   will have the environment variable *SLURM_RESTART_COUNT* set to the
   number of times the job has been restarted.

*RequeueExitHold*
   Enables automatic requeue of jobs into pending state in hold,
   meaning their priority is zero. Separate multiple exit code by a
   comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state.
   Restarted jobs will have the environment variable
   *SLURM_RESTART_COUNT* set to the number of times the job has been
   restarted. 


-Paul Edmon-



[slurm-dev] Re: Requeue Exit

2015-03-03 Thread David Bigagli


There are no default values for these parameters, you have to configure 
your own. In your case do the prolog fails or the node changes state as 
the jobs are running?


On 03/03/2015 08:30 AM, Paul Edmon wrote:

So what are the default values for these two options?  We recently
updated to 14.11 and jobs that previously would have just requeued due
to node failure are now going into a held state.

*RequeueExit*
Enables automatic job requeue for jobs which exit with the specified
values. Separate multiple exit code by a comma. Jobs will be put
back in to pending state and later scheduled again. Restarted jobs
will have the environment variable *SLURM_RESTART_COUNT* set to the
number of times the job has been restarted.

*RequeueExitHold*
Enables automatic requeue of jobs into pending state in hold,
meaning their priority is zero. Separate multiple exit code by a
comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state.
Restarted jobs will have the environment variable
*SLURM_RESTART_COUNT* set to the number of times the job has been
restarted.

-Paul Edmon-



--

Thanks,
  /David/Bigagli

www.schedmd.com


[slurm-dev] Re: Requeue Exit

2015-03-03 Thread Paul Edmon


Basically the node cuts out due to hardware issues and the jobs is 
requeued.  I'm just trying to figure out why it sent them into a held 
state as opposed to just simply requeueing as normal. Thoughts?


-Paul Edmon-

On 03/03/2015 12:11 PM, David Bigagli wrote:


There are no default values for these parameters, you have to 
configure your own. In your case do the prolog fails or the node 
changes state as the jobs are running?


On 03/03/2015 08:30 AM, Paul Edmon wrote:

So what are the default values for these two options?  We recently
updated to 14.11 and jobs that previously would have just requeued due
to node failure are now going into a held state.

*RequeueExit*
Enables automatic job requeue for jobs which exit with the specified
values. Separate multiple exit code by a comma. Jobs will be put
back in to pending state and later scheduled again. Restarted jobs
will have the environment variable *SLURM_RESTART_COUNT* set to the
number of times the job has been restarted.

*RequeueExitHold*
Enables automatic requeue of jobs into pending state in hold,
meaning their priority is zero. Separate multiple exit code by a
comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state.
Restarted jobs will have the environment variable
*SLURM_RESTART_COUNT* set to the number of times the job has been
restarted.

-Paul Edmon-





[slurm-dev] Re: Requeue Exit

2015-03-03 Thread David Bigagli


How do you set your node down? If I run a job and then issue

'scontrol update node=prometeo state=down'

the job is requeued in pend. Do you have an epilog?

On 03/03/2015 10:12 AM, Paul Edmon wrote:


We are definitely using the default for that one.  So it should be
requeueing just fine.

-Paul Edmon-

On 03/03/2015 01:05 PM, Lipari, Don wrote:

It looks like the governing config parameter would be:

JobRequeue
 This option controls what to do by default after a node failure.
If JobRequeue is
 set to a value of 1, then any batch job running on the failed
node will be requeued
 for execution on different nodes.  If JobRequeue is set to a
value of 0, then any
 job running on the failed node will be terminated.  Use the
sbatch --no-requeue or
 --requeue option to change the default behavior for individual
jobs.  The default
 value is 1.

According to the this, the job should be requeued and not held.


-Original Message-
From: Paul Edmon [mailto:ped...@cfa.harvard.edu]
Sent: Tuesday, March 03, 2015 9:14 AM
To: slurm-dev
Subject: [slurm-dev] Re: Requeue Exit


Basically the node cuts out due to hardware issues and the jobs is
requeued.  I'm just trying to figure out why it sent them into a held
state as opposed to just simply requeueing as normal. Thoughts?

-Paul Edmon-

On 03/03/2015 12:11 PM, David Bigagli wrote:

There are no default values for these parameters, you have to
configure your own. In your case do the prolog fails or the node
changes state as the jobs are running?

On 03/03/2015 08:30 AM, Paul Edmon wrote:

So what are the default values for these two options?  We recently
updated to 14.11 and jobs that previously would have just requeued

due

to node failure are now going into a held state.

*RequeueExit*
 Enables automatic job requeue for jobs which exit with the

specified

 values. Separate multiple exit code by a comma. Jobs will be

put

 back in to pending state and later scheduled again. Restarted

jobs

 will have the environment variable *SLURM_RESTART_COUNT* set to

the

 number of times the job has been restarted.

*RequeueExitHold*
 Enables automatic requeue of jobs into pending state in hold,
 meaning their priority is zero. Separate multiple exit code by

a

 comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state.
 Restarted jobs will have the environment variable
 *SLURM_RESTART_COUNT* set to the number of times the job has

been

 restarted.

-Paul Edmon-



--

Thanks,
  /David/Bigagli

www.schedmd.com


[slurm-dev] Re: Requeue Exit

2015-03-03 Thread Paul Edmon


We are definitely using the default for that one.  So it should be 
requeueing just fine.


-Paul Edmon-

On 03/03/2015 01:05 PM, Lipari, Don wrote:

It looks like the governing config parameter would be:

JobRequeue
 This option controls what to do by default after a node failure.  If 
JobRequeue is
 set to a value of 1, then any batch job running on the failed node will be 
requeued
 for execution on different nodes.  If JobRequeue is set to a value of 0, 
then any
 job running on the failed node will be terminated.  Use the sbatch 
--no-requeue or
 --requeue option to change the default behavior for individual jobs.  The 
default
 value is 1.

According to the this, the job should be requeued and not held.


-Original Message-
From: Paul Edmon [mailto:ped...@cfa.harvard.edu]
Sent: Tuesday, March 03, 2015 9:14 AM
To: slurm-dev
Subject: [slurm-dev] Re: Requeue Exit


Basically the node cuts out due to hardware issues and the jobs is
requeued.  I'm just trying to figure out why it sent them into a held
state as opposed to just simply requeueing as normal. Thoughts?

-Paul Edmon-

On 03/03/2015 12:11 PM, David Bigagli wrote:

There are no default values for these parameters, you have to
configure your own. In your case do the prolog fails or the node
changes state as the jobs are running?

On 03/03/2015 08:30 AM, Paul Edmon wrote:

So what are the default values for these two options?  We recently
updated to 14.11 and jobs that previously would have just requeued

due

to node failure are now going into a held state.

*RequeueExit*
 Enables automatic job requeue for jobs which exit with the

specified

 values. Separate multiple exit code by a comma. Jobs will be

put

 back in to pending state and later scheduled again. Restarted

jobs

 will have the environment variable *SLURM_RESTART_COUNT* set to

the

 number of times the job has been restarted.

*RequeueExitHold*
 Enables automatic requeue of jobs into pending state in hold,
 meaning their priority is zero. Separate multiple exit code by

a

 comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state.
 Restarted jobs will have the environment variable
 *SLURM_RESTART_COUNT* set to the number of times the job has

been

 restarted.

-Paul Edmon-



[slurm-dev] Expanding TotalCPU to include child processes

2015-03-03 Thread Scott Yockel
Slurm-Dev,

Is there anything in the works to add the capacity of TotalCPU to also track 
the child process user and system time?  I see that currently TotalCPU is 
defined: provides a measure of the task’s parent process and does not include 
CPU time of child processes.”  I ask this because it would be nice to profile 
how well the multi-core jobs are using the system, a sort of parallel 
efficiency measure.   One could compare  (wall time * cpus) to (FullCPUTotal) 
and understand if the users were “hogging” cores.  Back in my LSF days, they 
had this Hog Factor that was something like this.  Right now the only way I see 
to catch this is while it is happening on the cluster, not post job completion. 
 

Cheers,

~Scott
==
Dr. Scott Yockel | Senior Team Lead of HPC
FAS Research Computing | Harvard University
38 Oxford Street Cambridge, MA
Office: 211A | Phone: 617-496-7468
==



[slurm-dev] Re: Requeue Exit

2015-03-03 Thread Paul Edmon


In this case the Node was in a funny state where it couldn't resolve 
user id's.  So right after the job tried to launch it failed and 
requeued.  We just let the scheduler do what it will when it lists 
Node_fail.


-Paul Edmon-


On 03/03/2015 01:20 PM, David Bigagli wrote:


How do you set your node down? If I run a job and then issue

'scontrol update node=prometeo state=down'

the job is requeued in pend. Do you have an epilog?

On 03/03/2015 10:12 AM, Paul Edmon wrote:


We are definitely using the default for that one.  So it should be
requeueing just fine.

-Paul Edmon-

On 03/03/2015 01:05 PM, Lipari, Don wrote:

It looks like the governing config parameter would be:

JobRequeue
 This option controls what to do by default after a node failure.
If JobRequeue is
 set to a value of 1, then any batch job running on the failed
node will be requeued
 for execution on different nodes.  If JobRequeue is set to a
value of 0, then any
 job running on the failed node will be terminated.  Use the
sbatch --no-requeue or
 --requeue option to change the default behavior for individual
jobs.  The default
 value is 1.

According to the this, the job should be requeued and not held.


-Original Message-
From: Paul Edmon [mailto:ped...@cfa.harvard.edu]
Sent: Tuesday, March 03, 2015 9:14 AM
To: slurm-dev
Subject: [slurm-dev] Re: Requeue Exit


Basically the node cuts out due to hardware issues and the jobs is
requeued.  I'm just trying to figure out why it sent them into a held
state as opposed to just simply requeueing as normal. Thoughts?

-Paul Edmon-

On 03/03/2015 12:11 PM, David Bigagli wrote:

There are no default values for these parameters, you have to
configure your own. In your case do the prolog fails or the node
changes state as the jobs are running?

On 03/03/2015 08:30 AM, Paul Edmon wrote:

So what are the default values for these two options?  We recently
updated to 14.11 and jobs that previously would have just requeued

due

to node failure are now going into a held state.

*RequeueExit*
 Enables automatic job requeue for jobs which exit with the

specified

 values. Separate multiple exit code by a comma. Jobs will be

put

 back in to pending state and later scheduled again. Restarted

jobs

 will have the environment variable *SLURM_RESTART_COUNT* set to

the

 number of times the job has been restarted.

*RequeueExitHold*
 Enables automatic requeue of jobs into pending state in hold,
 meaning their priority is zero. Separate multiple exit code by

a

 comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state.
 Restarted jobs will have the environment variable
 *SLURM_RESTART_COUNT* set to the number of times the job has

been

 restarted.

-Paul Edmon-





[slurm-dev] Re: Requeue Exit

2015-03-03 Thread David Bigagli



Ah ok, the job failed to launch in this case Slurm requeue the job in 
held state, the previous behaviour was to terminate the job.

The reason for this is to avoid the job dispatch failure over and over.

On 03/03/2015 10:53 AM, Paul Edmon wrote:


In this case the Node was in a funny state where it couldn't resolve
user id's.  So right after the job tried to launch it failed and
requeued.  We just let the scheduler do what it will when it lists
Node_fail.

-Paul Edmon-


On 03/03/2015 01:20 PM, David Bigagli wrote:


How do you set your node down? If I run a job and then issue

'scontrol update node=prometeo state=down'

the job is requeued in pend. Do you have an epilog?

On 03/03/2015 10:12 AM, Paul Edmon wrote:


We are definitely using the default for that one.  So it should be
requeueing just fine.

-Paul Edmon-

On 03/03/2015 01:05 PM, Lipari, Don wrote:

It looks like the governing config parameter would be:

JobRequeue
 This option controls what to do by default after a node failure.
If JobRequeue is
 set to a value of 1, then any batch job running on the failed
node will be requeued
 for execution on different nodes.  If JobRequeue is set to a
value of 0, then any
 job running on the failed node will be terminated.  Use the
sbatch --no-requeue or
 --requeue option to change the default behavior for individual
jobs.  The default
 value is 1.

According to the this, the job should be requeued and not held.


-Original Message-
From: Paul Edmon [mailto:ped...@cfa.harvard.edu]
Sent: Tuesday, March 03, 2015 9:14 AM
To: slurm-dev
Subject: [slurm-dev] Re: Requeue Exit


Basically the node cuts out due to hardware issues and the jobs is
requeued.  I'm just trying to figure out why it sent them into a held
state as opposed to just simply requeueing as normal. Thoughts?

-Paul Edmon-

On 03/03/2015 12:11 PM, David Bigagli wrote:

There are no default values for these parameters, you have to
configure your own. In your case do the prolog fails or the node
changes state as the jobs are running?

On 03/03/2015 08:30 AM, Paul Edmon wrote:

So what are the default values for these two options?  We recently
updated to 14.11 and jobs that previously would have just requeued

due

to node failure are now going into a held state.

*RequeueExit*
 Enables automatic job requeue for jobs which exit with the

specified

 values. Separate multiple exit code by a comma. Jobs will be

put

 back in to pending state and later scheduled again. Restarted

jobs

 will have the environment variable *SLURM_RESTART_COUNT* set to

the

 number of times the job has been restarted.

*RequeueExitHold*
 Enables automatic requeue of jobs into pending state in hold,
 meaning their priority is zero. Separate multiple exit code by

a

 comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state.
 Restarted jobs will have the environment variable
 *SLURM_RESTART_COUNT* set to the number of times the job has

been

 restarted.

-Paul Edmon-





--

Thanks,
  /David/Bigagli

www.schedmd.com


[slurm-dev] Re: Requeue Exit

2015-03-03 Thread Paul Edmon


Ah, good to know.  I do prefer that behavior, just didn't expect it.  
Thanks.


-Paul Edmon-

On 03/03/2015 02:00 PM, David Bigagli wrote:



Ah ok, the job failed to launch in this case Slurm requeue the job in 
held state, the previous behaviour was to terminate the job.

The reason for this is to avoid the job dispatch failure over and over.

On 03/03/2015 10:53 AM, Paul Edmon wrote:


In this case the Node was in a funny state where it couldn't resolve
user id's.  So right after the job tried to launch it failed and
requeued.  We just let the scheduler do what it will when it lists
Node_fail.

-Paul Edmon-


On 03/03/2015 01:20 PM, David Bigagli wrote:


How do you set your node down? If I run a job and then issue

'scontrol update node=prometeo state=down'

the job is requeued in pend. Do you have an epilog?

On 03/03/2015 10:12 AM, Paul Edmon wrote:


We are definitely using the default for that one.  So it should be
requeueing just fine.

-Paul Edmon-

On 03/03/2015 01:05 PM, Lipari, Don wrote:

It looks like the governing config parameter would be:

JobRequeue
 This option controls what to do by default after a node failure.
If JobRequeue is
 set to a value of 1, then any batch job running on the failed
node will be requeued
 for execution on different nodes.  If JobRequeue is set to a
value of 0, then any
 job running on the failed node will be terminated. Use the
sbatch --no-requeue or
 --requeue option to change the default behavior for individual
jobs.  The default
 value is 1.

According to the this, the job should be requeued and not held.


-Original Message-
From: Paul Edmon [mailto:ped...@cfa.harvard.edu]
Sent: Tuesday, March 03, 2015 9:14 AM
To: slurm-dev
Subject: [slurm-dev] Re: Requeue Exit


Basically the node cuts out due to hardware issues and the jobs is
requeued.  I'm just trying to figure out why it sent them into a 
held

state as opposed to just simply requeueing as normal. Thoughts?

-Paul Edmon-

On 03/03/2015 12:11 PM, David Bigagli wrote:

There are no default values for these parameters, you have to
configure your own. In your case do the prolog fails or the node
changes state as the jobs are running?

On 03/03/2015 08:30 AM, Paul Edmon wrote:

So what are the default values for these two options?  We recently
updated to 14.11 and jobs that previously would have just requeued

due

to node failure are now going into a held state.

*RequeueExit*
 Enables automatic job requeue for jobs which exit with the

specified

 values. Separate multiple exit code by a comma. Jobs will be

put

 back in to pending state and later scheduled again. Restarted

jobs
 will have the environment variable *SLURM_RESTART_COUNT* 
set to

the

 number of times the job has been restarted.

*RequeueExitHold*
 Enables automatic requeue of jobs into pending state in hold,
 meaning their priority is zero. Separate multiple exit 
code by

a
 comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit 
state.

 Restarted jobs will have the environment variable
 *SLURM_RESTART_COUNT* set to the number of times the job has

been

 restarted.

-Paul Edmon-