[slurm-dev] Re: SLURM with VASP
Trey, Do you also have Stack size set to unlimited in /etc/sysconfig/slurm? ‘ulimit -s unlimited’ We have that in ours and it may have been for Vasp. I just did a search for vasp and stack and that did show up as a documented problem at other sites. Mike Robbert On Jan 29, 2015, at 8:30 AM, Trey Dockendorf treyd...@tamu.edu wrote: I'm doing the same in 2 place. I have 'ulimit -l unlimited' in /etc/sysconfig/slurm (sourced by slurm service on CentOS) and also these two lines in /etc/security/limits.d/unlimited_memlock.conf * hard memlock unlimited * soft memlock unlimited I'm thinking this is due to virtual memory limits we enforce which is something I'm going to test. Thanks, - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu mailto:treyd...@tamu.edu Jabber: treyd...@tamu.edu mailto:treyd...@tamu.edu On Thu, Jan 29, 2015 at 4:48 AM, Ben Polman ben.pol...@science.ru.nl mailto:ben.pol...@science.ru.nl wrote: Christopher Samuel wrote on 01/29/15 03:16: we have included a ulimit -l unlimited in the slurmd init script, to be precise in /etc/default/slurm-llnl which gets automatically sourced by the slurmd init script on ubuntu on a ubuntu 14.04 server slurmd would otherwise get the limits set voor root which are more restrictive Ben On 29/01/15 09:26, Trey Dockendorf wrote: Thanks for the response. We use PropagateResourceLimits=NONE and also set both hard and soft for memlock to unlimited on all compute nodes via a file in /etc/security/limits.d. It's still worth putting ulimit -a in the batch script just before calling VASP to capture what is actually getting set, just in case there's something odd going on.. We had people using VASP last year with Slurm and OpenMPI and they didn't seem to have any issues. Best of luck! Chris -- - Dr. B.J.W. Polman, CCZ, Radboud University. Osiris beheerder NWI Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone: +31-24-3653360 e-mail: ben.pol...@science.ru.nl mailto:ben.pol...@science.ru.nl smime.p7s Description: S/MIME cryptographic signature
[slurm-dev] Re: restarting checkpoint after slurm_checkpoint_vacate API call
The slurm_checkpoint_vacate() triggers a checkpoint operation, which could take minutes to complete. You can't slurm_checkpoint_restart() the job until the checkpoint operation and accounting are complete. Adding some sleep/retry logic should do what you want. Quoting Manuel Rodríguez Pascual manuel.rodriguez.pasc...@gmail.com: Good morning all, I am facing a problem when using slurm.h API to manage checkpoints. What I want to do is to checkpoint a running task, shut it down, and then restore it somewhere (in the same node or another one). slurm.conf is configured with: CheckpointType=checkpoint/blcr JobCheckpointDir=/home/slurm/ My code, after initial verifications goes like: int max_wait = 60; if (slurm_checkpoint_vacate(opt.jobid, opt.stepid, max_wait, /home/slurm/) != 0) _show_error_and_exit(); //just in case it is still not stopped slurm_kill_job(opt.jobid, 9, KILL_JOB_ARRAY) ; char* checkpoint_location = /home/slurm; if ( slurm_checkpoint_restart(opt.jobid, opt.stepid, 0, checkpoint_location) != 0) _show_error_and_exit(); The errno and error message I get is: 2011: Duplicate job id and this content in slurmctld: re-use active job_id 2570 slurmctld: _slurm_rpc_checkpoint restart 2570: Duplicate job id if I do instead if ( slurm_checkpoint_restart(opt.jobid +1 , opt.stepid, 0, checkpoint_location) != 0) The errno and error message I get is: 2: No such file or directory and this content in slurmctld: No job ckpt file (/home/slurm//2571.ckpt) to read slurmctld: _slurm_rpc_checkpoint restart 2570: No such file or directory Which is right, the file does not exist, so of course it cannot start it. However if I specify /home/slurm/2570/ as image_dir, the folder created by the checkpoint_vacate call, the result is the same. Besides that, it seems that the input parameter image_dir is not read, only the default parameter. So if i set my checkpoint_location to /a/b/c for example, the output log returns the same error, showing that it is trying to find the image in /home/slurm. So, this said, have you got any help or suggestion on how to deal with checkpoints with Slurm API? Am I doing something wrong? Is there any working example I can see? Should I be using other call instead of these ones? Thanks for your help. Best regards, Manuel -- Morris Moe Jette CTO, SchedMD LLC Commercial Slurm Development and Support
[slurm-dev] Re: SLURM with VASP
Christopher Samuel wrote on 01/29/15 03:16: we have included a ulimit -l unlimited in the slurmd init script, to be precise in /etc/default/slurm-llnl which gets automatically sourced by the slurmd init script on ubuntu on a ubuntu 14.04 server slurmd would otherwise get the limits set voor root which are more restrictive Ben On 29/01/15 09:26, Trey Dockendorf wrote: Thanks for the response. We use PropagateResourceLimits=NONE and also set both hard and soft for memlock to unlimited on all compute nodes via a file in /etc/security/limits.d. It's still worth putting ulimit -a in the batch script just before calling VASP to capture what is actually getting set, just in case there's something odd going on.. We had people using VASP last year with Slurm and OpenMPI and they didn't seem to have any issues. Best of luck! Chris -- - Dr. B.J.W. Polman, CCZ, Radboud University. Osiris beheerder NWI Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone: +31-24-3653360 e-mail: ben.pol...@science.ru.nl
[slurm-dev] restarting checkpoint after slurm_checkpoint_vacate API call
Good morning all, I am facing a problem when using slurm.h API to manage checkpoints. What I want to do is to checkpoint a running task, shut it down, and then restore it somewhere (in the same node or another one). slurm.conf is configured with: CheckpointType=checkpoint/blcr JobCheckpointDir=/home/slurm/ My code, after initial verifications goes like: int max_wait = 60; if (slurm_checkpoint_vacate(opt.jobid, opt.stepid, max_wait, /home/slurm/) != 0) _show_error_and_exit(); //just in case it is still not stopped slurm_kill_job(opt.jobid, 9, KILL_JOB_ARRAY) ; char* checkpoint_location = /home/slurm; if ( slurm_checkpoint_restart(opt.jobid, opt.stepid, 0, checkpoint_location) != 0) _show_error_and_exit(); The errno and error message I get is: 2011: Duplicate job id and this content in slurmctld: re-use active job_id 2570 slurmctld: _slurm_rpc_checkpoint restart 2570: Duplicate job id if I do instead if ( slurm_checkpoint_restart(opt.jobid +1 , opt.stepid, 0, checkpoint_location) != 0) The errno and error message I get is: 2: No such file or directory and this content in slurmctld: No job ckpt file (/home/slurm//2571.ckpt) to read slurmctld: _slurm_rpc_checkpoint restart 2570: No such file or directory Which is right, the file does not exist, so of course it cannot start it. However if I specify /home/slurm/2570/ as image_dir, the folder created by the checkpoint_vacate call, the result is the same. Besides that, it seems that the input parameter image_dir is not read, only the default parameter. So if i set my checkpoint_location to /a/b/c for example, the output log returns the same error, showing that it is trying to find the image in /home/slurm. So, this said, have you got any help or suggestion on how to deal with checkpoints with Slurm API? Am I doing something wrong? Is there any working example I can see? Should I be using other call instead of these ones? Thanks for your help. Best regards, Manuel
[slurm-dev] Re: SLURM with VASP
I'm doing the same in 2 place. I have 'ulimit -l unlimited' in /etc/sysconfig/slurm (sourced by slurm service on CentOS) and also these two lines in /etc/security/limits.d/unlimited_memlock.conf * hard memlock unlimited * soft memlock unlimited I'm thinking this is due to virtual memory limits we enforce which is something I'm going to test. Thanks, - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu On Thu, Jan 29, 2015 at 4:48 AM, Ben Polman ben.pol...@science.ru.nl wrote: Christopher Samuel wrote on 01/29/15 03:16: we have included a ulimit -l unlimited in the slurmd init script, to be precise in /etc/default/slurm-llnl which gets automatically sourced by the slurmd init script on ubuntu on a ubuntu 14.04 server slurmd would otherwise get the limits set voor root which are more restrictive Ben On 29/01/15 09:26, Trey Dockendorf wrote: Thanks for the response. We use PropagateResourceLimits=NONE and also set both hard and soft for memlock to unlimited on all compute nodes via a file in /etc/security/limits.d. It's still worth putting ulimit -a in the batch script just before calling VASP to capture what is actually getting set, just in case there's something odd going on.. We had people using VASP last year with Slurm and OpenMPI and they didn't seem to have any issues. Best of luck! Chris -- - Dr. B.J.W. Polman, CCZ, Radboud University. Osiris beheerder NWI Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone: +31-24-3653360 e-mail: ben.pol...@science.ru.nl
[slurm-dev] Re: Paramater Analagous to MAXLOAD on Torque/Maui?
No there is not. In a typical configuration Slurm binds applications to specific CPUs (using a task plugin), which is essential for decent performance for most parallel applications. Your best option would probably be to configure Slurm with TaskPlugin=task/none and then drain/resume nodes based upon load using some script. It's less than ideal... Quoting Novosielski, Ryan novos...@ca.rutgers.edu: Hi all, Running Slurm 14.11 in pre-production. Had a question since I can't find the answer in the documentation. Is there any config option that says to mark a node unavailable if the load on it is higher than a certain value? This feature of Maui is helping with our testing as we can have both Slurm and Torque/Maui running at the same time and a job will not be scheduled by Torque/Maui on a node that appears to have anything running on it (eg. something spawned by Slurm). I don't see a similar feature in Slurm. Thanks for your help. -- *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |-*O*- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf Res Comp - MSB C630, Newark `' -- Morris Moe Jette CTO, SchedMD LLC Commercial Slurm Development and Support
[slurm-dev] Re: Paramater Analagous to MAXLOAD on Torque/Maui?
Thanks for the quick answer, folks. It's probably not worth it to figure anything out for this -- it's in limited testing anyway, so limiting the users and telling them to be careful is probably fine for this month or so that we'd be doing this. I just wanted to make sure there was no quick solution that I'd missed. -- *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |-*O*- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf Res Comp - MSB C630, Newark `' From: Mehdi Denou [mehdi.de...@bull.net] Sent: Thursday, January 29, 2015 11:02 AM To: slurm-dev Subject: [slurm-dev] Re: Paramater Analagous to MAXLOAD on Torque/Maui? Maybe you can script something with the HealthCheckProgram. ? Le 29/01/2015 16:58, je...@schedmd.com a écrit : No there is not. In a typical configuration Slurm binds applications to specific CPUs (using a task plugin), which is essential for decent performance for most parallel applications. Your best option would probably be to configure Slurm with TaskPlugin=task/none and then drain/resume nodes based upon load using some script. It's less than ideal... Quoting Novosielski, Ryan novos...@ca.rutgers.edu: Hi all, Running Slurm 14.11 in pre-production. Had a question since I can't find the answer in the documentation. Is there any config option that says to mark a node unavailable if the load on it is higher than a certain value? This feature of Maui is helping with our testing as we can have both Slurm and Torque/Maui running at the same time and a job will not be scheduled by Torque/Maui on a node that appears to have anything running on it (eg. something spawned by Slurm). I don't see a similar feature in Slurm. Thanks for your help. -- *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |-*O*- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf Res Comp - MSB C630, Newark `' -- --- Mehdi Denou International HPC support +336 45 57 66 56
[slurm-dev] Re: SLURM with VASP
We currently do not set stack size on compute nodes. The default is left in place. In my test batches for VASP I was setting ulimit -s unlimited at the beginning of the batch script as I found mention of that problem on the VASP forum in relation to segfaults. Given others have mentioned stack size I'll try with and without and may find that setting unlimited stack size across all compute nodes does have benefit. Thanks, - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu On Thu, Jan 29, 2015 at 10:45 AM, Michael Robbert mrobb...@mines.edu wrote: Trey, Do you also have Stack size set to unlimited in /etc/sysconfig/slurm? ‘ulimit -s unlimited’ We have that in ours and it may have been for Vasp. I just did a search for vasp and stack and that did show up as a documented problem at other sites. Mike Robbert On Jan 29, 2015, at 8:30 AM, Trey Dockendorf treyd...@tamu.edu wrote: I'm doing the same in 2 place. I have 'ulimit -l unlimited' in /etc/sysconfig/slurm (sourced by slurm service on CentOS) and also these two lines in /etc/security/limits.d/unlimited_memlock.conf * hard memlock unlimited * soft memlock unlimited I'm thinking this is due to virtual memory limits we enforce which is something I'm going to test. Thanks, - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu On Thu, Jan 29, 2015 at 4:48 AM, Ben Polman ben.pol...@science.ru.nl wrote: Christopher Samuel wrote on 01/29/15 03:16: we have included a ulimit -l unlimited in the slurmd init script, to be precise in /etc/default/slurm-llnl which gets automatically sourced by the slurmd init script on ubuntu on a ubuntu 14.04 server slurmd would otherwise get the limits set voor root which are more restrictive Ben On 29/01/15 09:26, Trey Dockendorf wrote: Thanks for the response. We use PropagateResourceLimits=NONE and also set both hard and soft for memlock to unlimited on all compute nodes via a file in /etc/security/limits.d. It's still worth putting ulimit -a in the batch script just before calling VASP to capture what is actually getting set, just in case there's something odd going on.. We had people using VASP last year with Slurm and OpenMPI and they didn't seem to have any issues. Best of luck! Chris -- - Dr. B.J.W. Polman, CCZ, Radboud University. Osiris beheerder NWI Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone: +31-24-3653360 e-mail: ben.pol...@science.ru.nl
[slurm-dev] Re: SLURM with VASP
Just to make the point explicit, if the slurmd on a node is not started with unlimited it does not help to put unlimited in the batch script since you can't override the limit set on the parent process (slurmd). Ben -Original Message- From: Trey Dockendorf treyd...@tamu.edu To: slurm-dev slurm-dev@schedmd.com Sent: Thu, 29 Jan 2015 20:21 Subject: [slurm-dev] Re: SLURM with VASP We currently do not set stack size on compute nodes. The default is left in place. In my test batches for VASP I was setting ulimit -s unlimited at the beginning of the batch script as I found mention of that problem on the VASP forum in relation to segfaults. Given others have mentioned stack size I'll try with and without and may find that setting unlimited stack size across all compute nodes does have benefit. Thanks, - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu On Thu, Jan 29, 2015 at 10:45 AM, Michael Robbert mrobb...@mines.edu wrote: Trey, Do you also have Stack size set to unlimited in /etc/sysconfig/slurm? ‘ulimit -s unlimited’ We have that in ours and it may have been for Vasp. I just did a search for vasp and stack and that did show up as a documented problem at other sites. Mike Robbert On Jan 29, 2015, at 8:30 AM, Trey Dockendorf treyd...@tamu.edu wrote: I'm doing the same in 2 place. I have 'ulimit -l unlimited' in /etc/sysconfig/slurm (sourced by slurm service on CentOS) and also these two lines in /etc/security/limits.d/unlimited_memlock.conf * hard memlock unlimited * soft memlock unlimited I'm thinking this is due to virtual memory limits we enforce which is something I'm going to test. Thanks, - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu On Thu, Jan 29, 2015 at 4:48 AM, Ben Polman ben.pol...@science.ru.nl wrote: Christopher Samuel wrote on 01/29/15 03:16: we have included a ulimit -l unlimited in the slurmd init script, to be precise in /etc/default/slurm-llnl which gets automatically sourced by the slurmd init script on ubuntu on a ubuntu 14.04 server slurmd would otherwise get the limits set voor root which are more restrictive Ben On 29/01/15 09:26, Trey Dockendorf wrote: Thanks for the response. We use PropagateResourceLimits=NONE and also set both hard and soft for memlock to unlimited on all compute nodes via a file in /etc/security/limits.d. It's still worth putting ulimit -a in the batch script just before calling VASP to capture what is actually getting set, just in case there's something odd going on.. We had people using VASP last year with Slurm and OpenMPI and they didn't seem to have any issues. Best of luck! Chris -- - Dr. B.J.W. Polman, CCZ, Radboud University. Osiris beheerder NWI Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone: +31-24-3653360 e-mail: ben.pol...@science.ru.nl
[slurm-dev] Re: SLURM with VASP
I was not aware of that, thanks for teaching me something new. I'll add that ulimit to slurmd init script as one of my test cases for resolving this. Thanks, - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu On Thu, Jan 29, 2015 at 1:47 PM, ben.pol...@science.ru.nl wrote: Just to make the point explicit, if the slurmd on a node is not started with unlimited it does not help to put unlimited in the batch script since you can't override the limit set on the parent process (slurmd). Ben -Original Message- From: Trey Dockendorf treyd...@tamu.edu To: slurm-dev slurm-dev@schedmd.com Sent: Thu, 29 Jan 2015 20:21 Subject: [slurm-dev] Re: SLURM with VASP We currently do not set stack size on compute nodes. The default is left in place. In my test batches for VASP I was setting ulimit -s unlimited at the beginning of the batch script as I found mention of that problem on the VASP forum in relation to segfaults. Given others have mentioned stack size I'll try with and without and may find that setting unlimited stack size across all compute nodes does have benefit. Thanks, - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu On Thu, Jan 29, 2015 at 10:45 AM, Michael Robbert mrobb...@mines.edu wrote: Trey, Do you also have Stack size set to unlimited in /etc/sysconfig/slurm? ‘ulimit -s unlimited’ We have that in ours and it may have been for Vasp. I just did a search for vasp and stack and that did show up as a documented problem at other sites. Mike Robbert On Jan 29, 2015, at 8:30 AM, Trey Dockendorf treyd...@tamu.edu wrote: I'm doing the same in 2 place. I have 'ulimit -l unlimited' in /etc/sysconfig/slurm (sourced by slurm service on CentOS) and also these two lines in /etc/security/limits.d/unlimited_memlock.conf * hard memlock unlimited * soft memlock unlimited I'm thinking this is due to virtual memory limits we enforce which is something I'm going to test. Thanks, - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu On Thu, Jan 29, 2015 at 4:48 AM, Ben Polman ben.pol...@science.ru.nl wrote: Christopher Samuel wrote on 01/29/15 03:16: we have included a ulimit -l unlimited in the slurmd init script, to be precise in /etc/default/slurm-llnl which gets automatically sourced by the slurmd init script on ubuntu on a ubuntu 14.04 server slurmd would otherwise get the limits set voor root which are more restrictive Ben On 29/01/15 09:26, Trey Dockendorf wrote: Thanks for the response. We use PropagateResourceLimits=NONE and also set both hard and soft for memlock to unlimited on all compute nodes via a file in /etc/security/limits.d. It's still worth putting ulimit -a in the batch script just before calling VASP to capture what is actually getting set, just in case there's something odd going on.. We had people using VASP last year with Slurm and OpenMPI and they didn't seem to have any issues. Best of luck! Chris -- - Dr. B.J.W. Polman, CCZ, Radboud University. Osiris beheerder NWI Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone: +31-24-3653360 e-mail: ben.pol...@science.ru.nl
[slurm-dev] Re: SLURM with VASP
On 30/01/15 07:19, Trey Dockendorf wrote: I was not aware of that, thanks for teaching me something new. I'll add that ulimit to slurmd init script as one of my test cases for resolving this. Once you've done that can you send the output of ulimit -a from inside the VASP batch job (just before it starts) so we can see what the limits are please? thanks, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
[slurm-dev] Re: SLURM with VASP
This is after I set ulimit -s unlimited in /etc/sysconfig/slurm. The job had -N4 --ntasks-per-node=4 and partition has MaxMemPerCPU=4000. core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 257745 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 15974400 open files (-n) 8192 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 257745 virtual memory (kbytes, -v) 32108543 file locks (-x) unlimited The 'max locked memory' is set explicitly on compute nodes and the 'open files' is defined in /etc/sysconfig/slurm based on recommendation from docs. Thus far the unlimited stack size has allowed VASP to run long longer than it ever has previously using the input files I was given. It has even begun printing what I would assume is useful output (I know little of what to expect from VASP in terms of output). This is progress! Is there any reason to put that stack size of unlimited in /etc/sysconfig/slurm and not in /etc/security/limits.d/? I prefer the later as that's where I tend to look on a system when inspecting what limits are being set. Thanks, - Trey = Trey Dockendorf Systems Analyst I Texas AM University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu Jabber: treyd...@tamu.edu On Thu, Jan 29, 2015 at 4:15 PM, Christopher Samuel sam...@unimelb.edu.au wrote: On 30/01/15 07:19, Trey Dockendorf wrote: I was not aware of that, thanks for teaching me something new. I'll add that ulimit to slurmd init script as one of my test cases for resolving this. Once you've done that can you send the output of ulimit -a from inside the VASP batch job (just before it starts) so we can see what the limits are please? thanks, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
[slurm-dev] Re: Paramater Analagous to MAXLOAD on Torque/Maui?
On Wed, Jan 28, 2015 at 6:54 PM, Novosielski, Ryan novos...@ca.rutgers.edu wrote: Running Slurm 14.11 in pre-production. Had a question since I can't find the answer in the documentation. Is there any config option that says to mark a node unavailable if the load on it is higher than a certain value? This feature of Maui is helping with our testing as we can have both Slurm and Torque/Maui running at the same time and a job will not be scheduled by Torque/Maui on a node that appears to have anything running on it (eg. something spawned by Slurm). I don't see a similar feature in Slurm. This can be done fairly easily using NHC 1.4.1 and its built-in check_ps_loadavg() check. See the documentation for more details (http://go.lbl.gov/nhc#check_ps_loadavg) HTH! Michael -- Michael Jennings m...@lbl.gov Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209EW: 510-495-2687 MS 050B-3209 F: 510-486-8615