[slurm-dev] MaxCPUs option in slurm.conf
Hi Team, Is it possible to mention MaxCPUs option in partitions.. in slurm.conf? As I need restrict the jobs to certain no. of cores. -- Regards, Tejas The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
[slurm-dev] Requeue Exit
So what are the default values for these two options? We recently updated to 14.11 and jobs that previously would have just requeued due to node failure are now going into a held state. *RequeueExit* Enables automatic job requeue for jobs which exit with the specified values. Separate multiple exit code by a comma. Jobs will be put back in to pending state and later scheduled again. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. *RequeueExitHold* Enables automatic requeue of jobs into pending state in hold, meaning their priority is zero. Separate multiple exit code by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon-
[slurm-dev] Re: Requeue Exit
There are no default values for these parameters, you have to configure your own. In your case do the prolog fails or the node changes state as the jobs are running? On 03/03/2015 08:30 AM, Paul Edmon wrote: So what are the default values for these two options? We recently updated to 14.11 and jobs that previously would have just requeued due to node failure are now going into a held state. *RequeueExit* Enables automatic job requeue for jobs which exit with the specified values. Separate multiple exit code by a comma. Jobs will be put back in to pending state and later scheduled again. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. *RequeueExitHold* Enables automatic requeue of jobs into pending state in hold, meaning their priority is zero. Separate multiple exit code by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon- -- Thanks, /David/Bigagli www.schedmd.com
[slurm-dev] Re: Requeue Exit
Basically the node cuts out due to hardware issues and the jobs is requeued. I'm just trying to figure out why it sent them into a held state as opposed to just simply requeueing as normal. Thoughts? -Paul Edmon- On 03/03/2015 12:11 PM, David Bigagli wrote: There are no default values for these parameters, you have to configure your own. In your case do the prolog fails or the node changes state as the jobs are running? On 03/03/2015 08:30 AM, Paul Edmon wrote: So what are the default values for these two options? We recently updated to 14.11 and jobs that previously would have just requeued due to node failure are now going into a held state. *RequeueExit* Enables automatic job requeue for jobs which exit with the specified values. Separate multiple exit code by a comma. Jobs will be put back in to pending state and later scheduled again. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. *RequeueExitHold* Enables automatic requeue of jobs into pending state in hold, meaning their priority is zero. Separate multiple exit code by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon-
[slurm-dev] Re: Requeue Exit
How do you set your node down? If I run a job and then issue 'scontrol update node=prometeo state=down' the job is requeued in pend. Do you have an epilog? On 03/03/2015 10:12 AM, Paul Edmon wrote: We are definitely using the default for that one. So it should be requeueing just fine. -Paul Edmon- On 03/03/2015 01:05 PM, Lipari, Don wrote: It looks like the governing config parameter would be: JobRequeue This option controls what to do by default after a node failure. If JobRequeue is set to a value of 1, then any batch job running on the failed node will be requeued for execution on different nodes. If JobRequeue is set to a value of 0, then any job running on the failed node will be terminated. Use the sbatch --no-requeue or --requeue option to change the default behavior for individual jobs. The default value is 1. According to the this, the job should be requeued and not held. -Original Message- From: Paul Edmon [mailto:ped...@cfa.harvard.edu] Sent: Tuesday, March 03, 2015 9:14 AM To: slurm-dev Subject: [slurm-dev] Re: Requeue Exit Basically the node cuts out due to hardware issues and the jobs is requeued. I'm just trying to figure out why it sent them into a held state as opposed to just simply requeueing as normal. Thoughts? -Paul Edmon- On 03/03/2015 12:11 PM, David Bigagli wrote: There are no default values for these parameters, you have to configure your own. In your case do the prolog fails or the node changes state as the jobs are running? On 03/03/2015 08:30 AM, Paul Edmon wrote: So what are the default values for these two options? We recently updated to 14.11 and jobs that previously would have just requeued due to node failure are now going into a held state. *RequeueExit* Enables automatic job requeue for jobs which exit with the specified values. Separate multiple exit code by a comma. Jobs will be put back in to pending state and later scheduled again. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. *RequeueExitHold* Enables automatic requeue of jobs into pending state in hold, meaning their priority is zero. Separate multiple exit code by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon- -- Thanks, /David/Bigagli www.schedmd.com
[slurm-dev] Re: Requeue Exit
We are definitely using the default for that one. So it should be requeueing just fine. -Paul Edmon- On 03/03/2015 01:05 PM, Lipari, Don wrote: It looks like the governing config parameter would be: JobRequeue This option controls what to do by default after a node failure. If JobRequeue is set to a value of 1, then any batch job running on the failed node will be requeued for execution on different nodes. If JobRequeue is set to a value of 0, then any job running on the failed node will be terminated. Use the sbatch --no-requeue or --requeue option to change the default behavior for individual jobs. The default value is 1. According to the this, the job should be requeued and not held. -Original Message- From: Paul Edmon [mailto:ped...@cfa.harvard.edu] Sent: Tuesday, March 03, 2015 9:14 AM To: slurm-dev Subject: [slurm-dev] Re: Requeue Exit Basically the node cuts out due to hardware issues and the jobs is requeued. I'm just trying to figure out why it sent them into a held state as opposed to just simply requeueing as normal. Thoughts? -Paul Edmon- On 03/03/2015 12:11 PM, David Bigagli wrote: There are no default values for these parameters, you have to configure your own. In your case do the prolog fails or the node changes state as the jobs are running? On 03/03/2015 08:30 AM, Paul Edmon wrote: So what are the default values for these two options? We recently updated to 14.11 and jobs that previously would have just requeued due to node failure are now going into a held state. *RequeueExit* Enables automatic job requeue for jobs which exit with the specified values. Separate multiple exit code by a comma. Jobs will be put back in to pending state and later scheduled again. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. *RequeueExitHold* Enables automatic requeue of jobs into pending state in hold, meaning their priority is zero. Separate multiple exit code by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon-
[slurm-dev] Expanding TotalCPU to include child processes
Slurm-Dev, Is there anything in the works to add the capacity of TotalCPU to also track the child process user and system time? I see that currently TotalCPU is defined: provides a measure of the task’s parent process and does not include CPU time of child processes.” I ask this because it would be nice to profile how well the multi-core jobs are using the system, a sort of parallel efficiency measure. One could compare (wall time * cpus) to (FullCPUTotal) and understand if the users were “hogging” cores. Back in my LSF days, they had this Hog Factor that was something like this. Right now the only way I see to catch this is while it is happening on the cluster, not post job completion. Cheers, ~Scott == Dr. Scott Yockel | Senior Team Lead of HPC FAS Research Computing | Harvard University 38 Oxford Street Cambridge, MA Office: 211A | Phone: 617-496-7468 ==
[slurm-dev] Re: Requeue Exit
In this case the Node was in a funny state where it couldn't resolve user id's. So right after the job tried to launch it failed and requeued. We just let the scheduler do what it will when it lists Node_fail. -Paul Edmon- On 03/03/2015 01:20 PM, David Bigagli wrote: How do you set your node down? If I run a job and then issue 'scontrol update node=prometeo state=down' the job is requeued in pend. Do you have an epilog? On 03/03/2015 10:12 AM, Paul Edmon wrote: We are definitely using the default for that one. So it should be requeueing just fine. -Paul Edmon- On 03/03/2015 01:05 PM, Lipari, Don wrote: It looks like the governing config parameter would be: JobRequeue This option controls what to do by default after a node failure. If JobRequeue is set to a value of 1, then any batch job running on the failed node will be requeued for execution on different nodes. If JobRequeue is set to a value of 0, then any job running on the failed node will be terminated. Use the sbatch --no-requeue or --requeue option to change the default behavior for individual jobs. The default value is 1. According to the this, the job should be requeued and not held. -Original Message- From: Paul Edmon [mailto:ped...@cfa.harvard.edu] Sent: Tuesday, March 03, 2015 9:14 AM To: slurm-dev Subject: [slurm-dev] Re: Requeue Exit Basically the node cuts out due to hardware issues and the jobs is requeued. I'm just trying to figure out why it sent them into a held state as opposed to just simply requeueing as normal. Thoughts? -Paul Edmon- On 03/03/2015 12:11 PM, David Bigagli wrote: There are no default values for these parameters, you have to configure your own. In your case do the prolog fails or the node changes state as the jobs are running? On 03/03/2015 08:30 AM, Paul Edmon wrote: So what are the default values for these two options? We recently updated to 14.11 and jobs that previously would have just requeued due to node failure are now going into a held state. *RequeueExit* Enables automatic job requeue for jobs which exit with the specified values. Separate multiple exit code by a comma. Jobs will be put back in to pending state and later scheduled again. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. *RequeueExitHold* Enables automatic requeue of jobs into pending state in hold, meaning their priority is zero. Separate multiple exit code by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon-
[slurm-dev] Re: Requeue Exit
Ah ok, the job failed to launch in this case Slurm requeue the job in held state, the previous behaviour was to terminate the job. The reason for this is to avoid the job dispatch failure over and over. On 03/03/2015 10:53 AM, Paul Edmon wrote: In this case the Node was in a funny state where it couldn't resolve user id's. So right after the job tried to launch it failed and requeued. We just let the scheduler do what it will when it lists Node_fail. -Paul Edmon- On 03/03/2015 01:20 PM, David Bigagli wrote: How do you set your node down? If I run a job and then issue 'scontrol update node=prometeo state=down' the job is requeued in pend. Do you have an epilog? On 03/03/2015 10:12 AM, Paul Edmon wrote: We are definitely using the default for that one. So it should be requeueing just fine. -Paul Edmon- On 03/03/2015 01:05 PM, Lipari, Don wrote: It looks like the governing config parameter would be: JobRequeue This option controls what to do by default after a node failure. If JobRequeue is set to a value of 1, then any batch job running on the failed node will be requeued for execution on different nodes. If JobRequeue is set to a value of 0, then any job running on the failed node will be terminated. Use the sbatch --no-requeue or --requeue option to change the default behavior for individual jobs. The default value is 1. According to the this, the job should be requeued and not held. -Original Message- From: Paul Edmon [mailto:ped...@cfa.harvard.edu] Sent: Tuesday, March 03, 2015 9:14 AM To: slurm-dev Subject: [slurm-dev] Re: Requeue Exit Basically the node cuts out due to hardware issues and the jobs is requeued. I'm just trying to figure out why it sent them into a held state as opposed to just simply requeueing as normal. Thoughts? -Paul Edmon- On 03/03/2015 12:11 PM, David Bigagli wrote: There are no default values for these parameters, you have to configure your own. In your case do the prolog fails or the node changes state as the jobs are running? On 03/03/2015 08:30 AM, Paul Edmon wrote: So what are the default values for these two options? We recently updated to 14.11 and jobs that previously would have just requeued due to node failure are now going into a held state. *RequeueExit* Enables automatic job requeue for jobs which exit with the specified values. Separate multiple exit code by a comma. Jobs will be put back in to pending state and later scheduled again. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. *RequeueExitHold* Enables automatic requeue of jobs into pending state in hold, meaning their priority is zero. Separate multiple exit code by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon- -- Thanks, /David/Bigagli www.schedmd.com
[slurm-dev] Re: Requeue Exit
Ah, good to know. I do prefer that behavior, just didn't expect it. Thanks. -Paul Edmon- On 03/03/2015 02:00 PM, David Bigagli wrote: Ah ok, the job failed to launch in this case Slurm requeue the job in held state, the previous behaviour was to terminate the job. The reason for this is to avoid the job dispatch failure over and over. On 03/03/2015 10:53 AM, Paul Edmon wrote: In this case the Node was in a funny state where it couldn't resolve user id's. So right after the job tried to launch it failed and requeued. We just let the scheduler do what it will when it lists Node_fail. -Paul Edmon- On 03/03/2015 01:20 PM, David Bigagli wrote: How do you set your node down? If I run a job and then issue 'scontrol update node=prometeo state=down' the job is requeued in pend. Do you have an epilog? On 03/03/2015 10:12 AM, Paul Edmon wrote: We are definitely using the default for that one. So it should be requeueing just fine. -Paul Edmon- On 03/03/2015 01:05 PM, Lipari, Don wrote: It looks like the governing config parameter would be: JobRequeue This option controls what to do by default after a node failure. If JobRequeue is set to a value of 1, then any batch job running on the failed node will be requeued for execution on different nodes. If JobRequeue is set to a value of 0, then any job running on the failed node will be terminated. Use the sbatch --no-requeue or --requeue option to change the default behavior for individual jobs. The default value is 1. According to the this, the job should be requeued and not held. -Original Message- From: Paul Edmon [mailto:ped...@cfa.harvard.edu] Sent: Tuesday, March 03, 2015 9:14 AM To: slurm-dev Subject: [slurm-dev] Re: Requeue Exit Basically the node cuts out due to hardware issues and the jobs is requeued. I'm just trying to figure out why it sent them into a held state as opposed to just simply requeueing as normal. Thoughts? -Paul Edmon- On 03/03/2015 12:11 PM, David Bigagli wrote: There are no default values for these parameters, you have to configure your own. In your case do the prolog fails or the node changes state as the jobs are running? On 03/03/2015 08:30 AM, Paul Edmon wrote: So what are the default values for these two options? We recently updated to 14.11 and jobs that previously would have just requeued due to node failure are now going into a held state. *RequeueExit* Enables automatic job requeue for jobs which exit with the specified values. Separate multiple exit code by a comma. Jobs will be put back in to pending state and later scheduled again. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. *RequeueExitHold* Enables automatic requeue of jobs into pending state in hold, meaning their priority is zero. Separate multiple exit code by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon-