[slurm-dev] Re: SLURM with VASP

2015-01-29 Thread Michael Robbert
Trey,
Do you also have Stack size set to unlimited in /etc/sysconfig/slurm? ‘ulimit 
-s unlimited’
We have that in ours and it may have been for Vasp. I just did a search for 
vasp and stack and that did show up as a documented problem at other sites.

Mike Robbert

 On Jan 29, 2015, at 8:30 AM, Trey Dockendorf treyd...@tamu.edu wrote:
 
 I'm doing the same in 2 place.  I have 'ulimit -l unlimited' in 
 /etc/sysconfig/slurm (sourced by slurm service on CentOS) and also these two 
 lines in /etc/security/limits.d/unlimited_memlock.conf
 
 * hard memlock unlimited
 * soft memlock unlimited
 
 I'm thinking this is due to virtual memory limits we enforce which is 
 something I'm going to test.
 
 Thanks,
  - Trey
 
 =
 
 Trey Dockendorf 
 Systems Analyst I 
 Texas AM University 
 Academy for Advanced Telecommunications and Learning Technologies 
 Phone: (979)458-2396 
 Email: treyd...@tamu.edu mailto:treyd...@tamu.edu 
 Jabber: treyd...@tamu.edu mailto:treyd...@tamu.edu
 On Thu, Jan 29, 2015 at 4:48 AM, Ben Polman ben.pol...@science.ru.nl 
 mailto:ben.pol...@science.ru.nl wrote:
 
 Christopher Samuel wrote on 01/29/15 03:16:
 
 we have included a
 
 ulimit -l unlimited
 
 in the slurmd init script, to be precise in /etc/default/slurm-llnl which 
 gets automatically sourced
 by the slurmd init script on ubuntu
 
 on a ubuntu 14.04 server slurmd would otherwise
 get the limits set voor root which are more restrictive
 
 Ben
 
 On 29/01/15 09:26, Trey Dockendorf wrote:
 
 Thanks for the response.  We use PropagateResourceLimits=NONE and also
 set both hard and soft for memlock to unlimited on all compute nodes via
 a file in /etc/security/limits.d.
 It's still worth putting ulimit -a in the batch script just before
 calling VASP to capture what is actually getting set, just in case
 there's something odd going on..
 
 We had people using VASP last year with Slurm and OpenMPI and they
 didn't seem to have any issues.
 
 Best of luck!
 Chris
 
 
 -- 
 -
 Dr. B.J.W. Polman, CCZ, Radboud University.
 Osiris beheerder NWI
 Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone: +31-24-3653360
 e-mail: ben.pol...@science.ru.nl mailto:ben.pol...@science.ru.nl
 
 



smime.p7s
Description: S/MIME cryptographic signature


[slurm-dev] Re: restarting checkpoint after slurm_checkpoint_vacate API call

2015-01-29 Thread jette


The slurm_checkpoint_vacate() triggers a checkpoint operation, which  
could take minutes to complete. You can't slurm_checkpoint_restart()  
the job until the checkpoint operation and accounting are complete.  
Adding some sleep/retry logic should do what you want.


Quoting Manuel Rodríguez Pascual manuel.rodriguez.pasc...@gmail.com:


Good morning all,

I am facing a problem when using slurm.h API to manage checkpoints.

What I want to do is to checkpoint a running task, shut it down, and then
restore it somewhere (in the same node or another one).

slurm.conf is configured with:
CheckpointType=checkpoint/blcr
JobCheckpointDir=/home/slurm/

My code, after initial verifications goes like:

 int max_wait = 60;
if (slurm_checkpoint_vacate(opt.jobid, opt.stepid, max_wait,
/home/slurm/) != 0)
_show_error_and_exit();

//just in case it is still not stopped
slurm_kill_job(opt.jobid, 9,  KILL_JOB_ARRAY) ;

char* checkpoint_location = /home/slurm;
if ( slurm_checkpoint_restart(opt.jobid, opt.stepid, 0,
 checkpoint_location) != 0)
_show_error_and_exit();


The errno and error message I get is:

2011: Duplicate job id

and this content in slurmctld:

re-use active job_id 2570
slurmctld: _slurm_rpc_checkpoint restart 2570: Duplicate job id


if I do instead

if ( slurm_checkpoint_restart(opt.jobid +1 , opt.stepid, 0,
 checkpoint_location) != 0)

The errno and error message I get is:

2: No such file or directory

and this content in slurmctld:

No job ckpt file (/home/slurm//2571.ckpt) to read
slurmctld: _slurm_rpc_checkpoint restart 2570: No such file or directory

Which is right, the file does not exist, so of course it cannot start it.
However if I specify /home/slurm/2570/ as image_dir, the folder created
by the checkpoint_vacate call, the result is the same.

Besides that, it seems that the input parameter   image_dir is not read,
only the default parameter. So if i set my checkpoint_location to
/a/b/c for example, the output log returns the same error, showing that
it is trying to find the image in /home/slurm.


So, this said, have you got any help or suggestion on how to deal with
checkpoints with Slurm API? Am I doing something wrong? Is there any
working example I can see? Should I be using other call instead of these
ones?


Thanks for your help. Best regards,


Manuel



--
Morris Moe Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support


[slurm-dev] Re: SLURM with VASP

2015-01-29 Thread Ben Polman


Christopher Samuel wrote on 01/29/15 03:16:

we have included a

ulimit -l unlimited

in the slurmd init script, to be precise in /etc/default/slurm-llnl which gets 
automatically sourced
by the slurmd init script on ubuntu

on a ubuntu 14.04 server slurmd would otherwise
get the limits set voor root which are more restrictive

Ben


On 29/01/15 09:26, Trey Dockendorf wrote:


Thanks for the response.  We use PropagateResourceLimits=NONE and also
set both hard and soft for memlock to unlimited on all compute nodes via
a file in /etc/security/limits.d.

It's still worth putting ulimit -a in the batch script just before
calling VASP to capture what is actually getting set, just in case
there's something odd going on..

We had people using VASP last year with Slurm and OpenMPI and they
didn't seem to have any issues.

Best of luck!
Chris



--
-
Dr. B.J.W. Polman, CCZ, Radboud University.
Osiris beheerder NWI
Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone: +31-24-3653360
e-mail: ben.pol...@science.ru.nl


[slurm-dev] restarting checkpoint after slurm_checkpoint_vacate API call

2015-01-29 Thread Manuel Rodríguez Pascual
Good morning all,

I am facing a problem when using slurm.h API to manage checkpoints.

What I want to do is to checkpoint a running task, shut it down, and then
restore it somewhere (in the same node or another one).

slurm.conf is configured with:
CheckpointType=checkpoint/blcr
JobCheckpointDir=/home/slurm/

My code, after initial verifications goes like:

 int max_wait = 60;
if (slurm_checkpoint_vacate(opt.jobid, opt.stepid, max_wait,
/home/slurm/) != 0)
_show_error_and_exit();

//just in case it is still not stopped
slurm_kill_job(opt.jobid, 9,  KILL_JOB_ARRAY) ;

char* checkpoint_location = /home/slurm;
if ( slurm_checkpoint_restart(opt.jobid, opt.stepid, 0,
 checkpoint_location) != 0)
_show_error_and_exit();


The errno and error message I get is:

2011: Duplicate job id

and this content in slurmctld:

re-use active job_id 2570
slurmctld: _slurm_rpc_checkpoint restart 2570: Duplicate job id


if I do instead

if ( slurm_checkpoint_restart(opt.jobid +1 , opt.stepid, 0,
 checkpoint_location) != 0)

The errno and error message I get is:

2: No such file or directory

and this content in slurmctld:

No job ckpt file (/home/slurm//2571.ckpt) to read
slurmctld: _slurm_rpc_checkpoint restart 2570: No such file or directory

Which is right, the file does not exist, so of course it cannot start it.
However if I specify /home/slurm/2570/ as image_dir, the folder created
by the checkpoint_vacate call, the result is the same.

Besides that, it seems that the input parameter   image_dir is not read,
only the default parameter. So if i set my checkpoint_location to
/a/b/c for example, the output log returns the same error, showing that
it is trying to find the image in /home/slurm.


So, this said, have you got any help or suggestion on how to deal with
checkpoints with Slurm API? Am I doing something wrong? Is there any
working example I can see? Should I be using other call instead of these
ones?


Thanks for your help. Best regards,


Manuel


[slurm-dev] Re: SLURM with VASP

2015-01-29 Thread Trey Dockendorf
I'm doing the same in 2 place.  I have 'ulimit -l unlimited' in
/etc/sysconfig/slurm (sourced by slurm service on CentOS) and also these
two lines in /etc/security/limits.d/unlimited_memlock.conf

* hard memlock unlimited
* soft memlock unlimited

I'm thinking this is due to virtual memory limits we enforce which is
something I'm going to test.

Thanks,
 - Trey

=

Trey Dockendorf
Systems Analyst I
Texas AM University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treyd...@tamu.edu
Jabber: treyd...@tamu.edu

On Thu, Jan 29, 2015 at 4:48 AM, Ben Polman ben.pol...@science.ru.nl
wrote:


 Christopher Samuel wrote on 01/29/15 03:16:

 we have included a

 ulimit -l unlimited

 in the slurmd init script, to be precise in /etc/default/slurm-llnl which
 gets automatically sourced
 by the slurmd init script on ubuntu

 on a ubuntu 14.04 server slurmd would otherwise
 get the limits set voor root which are more restrictive

 Ben

  On 29/01/15 09:26, Trey Dockendorf wrote:

  Thanks for the response.  We use PropagateResourceLimits=NONE and also
 set both hard and soft for memlock to unlimited on all compute nodes via
 a file in /etc/security/limits.d.

 It's still worth putting ulimit -a in the batch script just before
 calling VASP to capture what is actually getting set, just in case
 there's something odd going on..

 We had people using VASP last year with Slurm and OpenMPI and they
 didn't seem to have any issues.

 Best of luck!
 Chris



 --
 -
 Dr. B.J.W. Polman, CCZ, Radboud University.
 Osiris beheerder NWI
 Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone:
 +31-24-3653360
 e-mail: ben.pol...@science.ru.nl



[slurm-dev] Re: Paramater Analagous to MAXLOAD on Torque/Maui?

2015-01-29 Thread jette


No there is not.

In a typical configuration Slurm binds applications to specific CPUs  
(using a task plugin), which is essential for decent performance for  
most parallel applications. Your best option would probably be to  
configure Slurm with TaskPlugin=task/none and then drain/resume  
nodes based upon load using some script. It's less than ideal...


Quoting Novosielski, Ryan novos...@ca.rutgers.edu:


Hi all,

Running Slurm 14.11 in pre-production. Had a question since I can't  
find the answer in the documentation. Is there any config option  
that says to mark a node unavailable if the load on it is higher  
than a certain value? This feature of Maui is helping with our  
testing as we can have both Slurm and Torque/Maui running at the  
same time and a job will not be scheduled by Torque/Maui on a node  
that appears to have anything running on it (eg. something spawned  
by Slurm). I don't see a similar feature in Slurm.


Thanks for your help.

--
 *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
 || \\UTGERS  |-*O*-
 ||_// Biomedical | Ryan Novosielski - Senior Technologist
 || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922)
 ||  \\  Sciences | OIRT/High Perf  Res Comp - MSB C630, Newark
  `'



--
Morris Moe Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support


[slurm-dev] Re: Paramater Analagous to MAXLOAD on Torque/Maui?

2015-01-29 Thread Novosielski, Ryan

Thanks for the quick answer, folks. It's probably not worth it to figure 
anything out for this -- it's in limited testing anyway, so limiting the users 
and telling them to be careful is probably fine for this month or so that we'd 
be doing this. I just wanted to make sure there was no quick solution that I'd 
missed.

--
 *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
 || \\UTGERS  |-*O*-
 ||_// Biomedical | Ryan Novosielski - Senior Technologist
 || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922)
 ||  \\  Sciences | OIRT/High Perf  Res Comp - MSB C630, Newark
  `'

From: Mehdi Denou [mehdi.de...@bull.net]
Sent: Thursday, January 29, 2015 11:02 AM
To: slurm-dev
Subject: [slurm-dev] Re: Paramater Analagous to MAXLOAD on Torque/Maui?

Maybe you can script something with the HealthCheckProgram. ?

Le 29/01/2015 16:58, je...@schedmd.com a écrit :

 No there is not.

 In a typical configuration Slurm binds applications to specific CPUs
 (using a task plugin), which is essential for decent performance for
 most parallel applications. Your best option would probably be to
 configure Slurm with TaskPlugin=task/none and then drain/resume
 nodes based upon load using some script. It's less than ideal...

 Quoting Novosielski, Ryan novos...@ca.rutgers.edu:

 Hi all,

 Running Slurm 14.11 in pre-production. Had a question since I can't
 find the answer in the documentation. Is there any config option that
 says to mark a node unavailable if the load on it is higher than a
 certain value? This feature of Maui is helping with our testing as we
 can have both Slurm and Torque/Maui running at the same time and a
 job will not be scheduled by Torque/Maui on a node that appears to
 have anything running on it (eg. something spawned by Slurm). I don't
 see a similar feature in Slurm.

 Thanks for your help.

 --
  *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
  || \\UTGERS  |-*O*-
  ||_// Biomedical | Ryan Novosielski - Senior Technologist
  || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922)
  ||  \\  Sciences | OIRT/High Perf  Res Comp - MSB C630, Newark
   `'



--
---
Mehdi Denou
International HPC support
+336 45 57 66 56


[slurm-dev] Re: SLURM with VASP

2015-01-29 Thread Trey Dockendorf
We currently do not set stack size on compute nodes.  The default is left
in place.  In my test batches for VASP I was setting ulimit -s unlimited at
the beginning of the batch script as I found mention of that problem on the
VASP forum in relation to segfaults.  Given others have mentioned stack
size I'll try with and without and may find that setting unlimited stack
size across all compute nodes does have benefit.

Thanks,
- Trey

=

Trey Dockendorf
Systems Analyst I
Texas AM University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treyd...@tamu.edu
Jabber: treyd...@tamu.edu

On Thu, Jan 29, 2015 at 10:45 AM, Michael Robbert mrobb...@mines.edu
wrote:

 Trey,
 Do you also have Stack size set to unlimited in /etc/sysconfig/slurm?
 ‘ulimit -s unlimited’
 We have that in ours and it may have been for Vasp. I just did a search
 for vasp and stack and that did show up as a documented problem at other
 sites.

 Mike Robbert

 On Jan 29, 2015, at 8:30 AM, Trey Dockendorf treyd...@tamu.edu wrote:

  I'm doing the same in 2 place.  I have 'ulimit -l unlimited' in
 /etc/sysconfig/slurm (sourced by slurm service on CentOS) and also these
 two lines in /etc/security/limits.d/unlimited_memlock.conf

 * hard memlock unlimited
 * soft memlock unlimited

 I'm thinking this is due to virtual memory limits we enforce which is
 something I'm going to test.

 Thanks,
  - Trey

 =

 Trey Dockendorf
 Systems Analyst I
 Texas AM University
 Academy for Advanced Telecommunications and Learning Technologies
 Phone: (979)458-2396
 Email: treyd...@tamu.edu
 Jabber: treyd...@tamu.edu

 On Thu, Jan 29, 2015 at 4:48 AM, Ben Polman ben.pol...@science.ru.nl
 wrote:


 Christopher Samuel wrote on 01/29/15 03:16:

 we have included a

 ulimit -l unlimited

 in the slurmd init script, to be precise in /etc/default/slurm-llnl which
 gets automatically sourced
 by the slurmd init script on ubuntu

 on a ubuntu 14.04 server slurmd would otherwise
 get the limits set voor root which are more restrictive

 Ben

  On 29/01/15 09:26, Trey Dockendorf wrote:

  Thanks for the response.  We use PropagateResourceLimits=NONE and also
 set both hard and soft for memlock to unlimited on all compute nodes via
 a file in /etc/security/limits.d.

 It's still worth putting ulimit -a in the batch script just before
 calling VASP to capture what is actually getting set, just in case
 there's something odd going on..

 We had people using VASP last year with Slurm and OpenMPI and they
 didn't seem to have any issues.

 Best of luck!
 Chris



 --
 -
 Dr. B.J.W. Polman, CCZ, Radboud University.
 Osiris beheerder NWI
 Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone:
 +31-24-3653360
 e-mail: ben.pol...@science.ru.nl






[slurm-dev] Re: SLURM with VASP

2015-01-29 Thread ben . polman

Just to make the point explicit,  if the slurmd  on a node is not started with 
unlimited it does not help to put unlimited in the batch script since you can't 
override the limit set on the parent process (slurmd). 

Ben 


-Original Message-
From: Trey Dockendorf treyd...@tamu.edu
To: slurm-dev slurm-dev@schedmd.com
Sent: Thu, 29 Jan 2015 20:21
Subject: [slurm-dev] Re: SLURM with VASP

We currently do not set stack size on compute nodes.  The default is left
in place.  In my test batches for VASP I was setting ulimit -s unlimited at
the beginning of the batch script as I found mention of that problem on the
VASP forum in relation to segfaults.  Given others have mentioned stack
size I'll try with and without and may find that setting unlimited stack
size across all compute nodes does have benefit.

Thanks,
- Trey

=

Trey Dockendorf
Systems Analyst I
Texas AM University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treyd...@tamu.edu
Jabber: treyd...@tamu.edu

On Thu, Jan 29, 2015 at 10:45 AM, Michael Robbert mrobb...@mines.edu
wrote:

 Trey,
 Do you also have Stack size set to unlimited in /etc/sysconfig/slurm?
 ‘ulimit -s unlimited’
 We have that in ours and it may have been for Vasp. I just did a search
 for vasp and stack and that did show up as a documented problem at other
 sites.

 Mike Robbert

 On Jan 29, 2015, at 8:30 AM, Trey Dockendorf treyd...@tamu.edu wrote:

  I'm doing the same in 2 place.  I have 'ulimit -l unlimited' in
 /etc/sysconfig/slurm (sourced by slurm service on CentOS) and also these
 two lines in /etc/security/limits.d/unlimited_memlock.conf

 * hard memlock unlimited
 * soft memlock unlimited

 I'm thinking this is due to virtual memory limits we enforce which is
 something I'm going to test.

 Thanks,
  - Trey

 =

 Trey Dockendorf
 Systems Analyst I
 Texas AM University
 Academy for Advanced Telecommunications and Learning Technologies
 Phone: (979)458-2396
 Email: treyd...@tamu.edu
 Jabber: treyd...@tamu.edu

 On Thu, Jan 29, 2015 at 4:48 AM, Ben Polman ben.pol...@science.ru.nl
 wrote:


 Christopher Samuel wrote on 01/29/15 03:16:

 we have included a

 ulimit -l unlimited

 in the slurmd init script, to be precise in /etc/default/slurm-llnl which
 gets automatically sourced
 by the slurmd init script on ubuntu

 on a ubuntu 14.04 server slurmd would otherwise
 get the limits set voor root which are more restrictive

 Ben

  On 29/01/15 09:26, Trey Dockendorf wrote:

  Thanks for the response.  We use PropagateResourceLimits=NONE and also
 set both hard and soft for memlock to unlimited on all compute nodes via
 a file in /etc/security/limits.d.

 It's still worth putting ulimit -a in the batch script just before
 calling VASP to capture what is actually getting set, just in case
 there's something odd going on..

 We had people using VASP last year with Slurm and OpenMPI and they
 didn't seem to have any issues.

 Best of luck!
 Chris



 --
 -
 Dr. B.J.W. Polman, CCZ, Radboud University.
 Osiris beheerder NWI
 Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone:
 +31-24-3653360
 e-mail: ben.pol...@science.ru.nl






[slurm-dev] Re: SLURM with VASP

2015-01-29 Thread Trey Dockendorf
I was not aware of that, thanks for teaching me something new.  I'll add
that ulimit to slurmd init script as one of my test cases for resolving
this.

Thanks,
- Trey

=

Trey Dockendorf
Systems Analyst I
Texas AM University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treyd...@tamu.edu
Jabber: treyd...@tamu.edu

On Thu, Jan 29, 2015 at 1:47 PM, ben.pol...@science.ru.nl wrote:


 Just to make the point explicit,  if the slurmd  on a node is not started
 with
 unlimited it does not help to put unlimited in the batch script since you
 can't override the limit set on the parent process (slurmd).

 Ben


 -Original Message-
 From: Trey Dockendorf treyd...@tamu.edu
 To: slurm-dev slurm-dev@schedmd.com
 Sent: Thu, 29 Jan 2015 20:21
 Subject: [slurm-dev] Re: SLURM with VASP

 We currently do not set stack size on compute nodes.  The default is left
 in place.  In my test batches for VASP I was setting ulimit -s unlimited at
 the beginning of the batch script as I found mention of that problem on the
 VASP forum in relation to segfaults.  Given others have mentioned stack
 size I'll try with and without and may find that setting unlimited stack
 size across all compute nodes does have benefit.

 Thanks,
 - Trey

 =

 Trey Dockendorf
 Systems Analyst I
 Texas AM University
 Academy for Advanced Telecommunications and Learning Technologies
 Phone: (979)458-2396
 Email: treyd...@tamu.edu
 Jabber: treyd...@tamu.edu

 On Thu, Jan 29, 2015 at 10:45 AM, Michael Robbert mrobb...@mines.edu
 wrote:

 Trey,
 Do you also have Stack size set to unlimited in /etc/sysconfig/slurm?
 ‘ulimit -s unlimited’
 We have that in ours and it may have been for Vasp. I just did a search
 for vasp and stack and that did show up as a documented problem at other
 sites.

 Mike Robbert

 On Jan 29, 2015, at 8:30 AM, Trey Dockendorf treyd...@tamu.edu wrote:

  I'm doing the same in 2 place.  I have 'ulimit -l unlimited' in
 /etc/sysconfig/slurm (sourced by slurm service on CentOS) and also these
 two lines in /etc/security/limits.d/unlimited_memlock.conf

 * hard memlock unlimited
 * soft memlock unlimited

 I'm thinking this is due to virtual memory limits we enforce which is
 something I'm going to test.

 Thanks,
  - Trey

 =

 Trey Dockendorf
 Systems Analyst I
 Texas AM University
 Academy for Advanced Telecommunications and Learning Technologies
 Phone: (979)458-2396
 Email: treyd...@tamu.edu
 Jabber: treyd...@tamu.edu

 On Thu, Jan 29, 2015 at 4:48 AM, Ben Polman ben.pol...@science.ru.nl
 wrote:


 Christopher Samuel wrote on 01/29/15 03:16:

 we have included a

 ulimit -l unlimited

 in the slurmd init script, to be precise in /etc/default/slurm-llnl
 which gets automatically sourced
 by the slurmd init script on ubuntu

 on a ubuntu 14.04 server slurmd would otherwise
 get the limits set voor root which are more restrictive

 Ben

  On 29/01/15 09:26, Trey Dockendorf wrote:

  Thanks for the response.  We use PropagateResourceLimits=NONE and also
 set both hard and soft for memlock to unlimited on all compute nodes
 via
 a file in /etc/security/limits.d.

 It's still worth putting ulimit -a in the batch script just before
 calling VASP to capture what is actually getting set, just in case
 there's something odd going on..

 We had people using VASP last year with Slurm and OpenMPI and they
 didn't seem to have any issues.

 Best of luck!
 Chris



 --
 -
 Dr. B.J.W. Polman, CCZ, Radboud University.
 Osiris beheerder NWI
 Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone:
 +31-24-3653360
 e-mail: ben.pol...@science.ru.nl







[slurm-dev] Re: SLURM with VASP

2015-01-29 Thread Christopher Samuel

On 30/01/15 07:19, Trey Dockendorf wrote:

 I was not aware of that, thanks for teaching me something new.  I'll add
 that ulimit to slurmd init script as one of my test cases for resolving
 this.

Once you've done that can you send the output of ulimit -a from inside
the VASP batch job (just before it starts) so we can see what the limits
are please?

thanks,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: SLURM with VASP

2015-01-29 Thread Trey Dockendorf
This is after I set ulimit -s unlimited in /etc/sysconfig/slurm.  The job
had -N4 --ntasks-per-node=4 and partition has MaxMemPerCPU=4000.

core file size  (blocks, -c) 0
data seg size   (kbytes, -d) unlimited
scheduling priority (-e) 0
file size   (blocks, -f) unlimited
pending signals (-i) 257745
max locked memory   (kbytes, -l) unlimited
max memory size (kbytes, -m) 15974400
open files  (-n) 8192
pipe size(512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority  (-r) 0
stack size  (kbytes, -s) unlimited
cpu time   (seconds, -t) unlimited
max user processes  (-u) 257745
virtual memory  (kbytes, -v) 32108543
file locks  (-x) unlimited

The 'max locked memory' is set explicitly on compute nodes and the 'open
files' is defined in /etc/sysconfig/slurm based on recommendation from docs.

Thus far the unlimited stack size has allowed VASP to run long longer than
it ever has previously using the input files I was given.  It has even
begun printing what I would assume is useful output (I know little of what
to expect from VASP in terms of output).  This is progress!

Is there any reason to put that stack size of unlimited in
/etc/sysconfig/slurm and not in /etc/security/limits.d/?  I prefer the
later as that's where I tend to look on a system when inspecting what
limits are being set.

Thanks,
- Trey

=

Trey Dockendorf
Systems Analyst I
Texas AM University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treyd...@tamu.edu
Jabber: treyd...@tamu.edu

On Thu, Jan 29, 2015 at 4:15 PM, Christopher Samuel sam...@unimelb.edu.au
wrote:


 On 30/01/15 07:19, Trey Dockendorf wrote:

  I was not aware of that, thanks for teaching me something new.  I'll add
  that ulimit to slurmd init script as one of my test cases for resolving
  this.

 Once you've done that can you send the output of ulimit -a from inside
 the VASP batch job (just before it starts) so we can see what the limits
 are please?

 thanks,
 Chris
 --
  Christopher SamuelSenior Systems Administrator
  VLSCI - Victorian Life Sciences Computation Initiative
  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
  http://www.vlsci.org.au/  http://twitter.com/vlsci



[slurm-dev] Re: Paramater Analagous to MAXLOAD on Torque/Maui?

2015-01-29 Thread Michael Jennings

On Wed, Jan 28, 2015 at 6:54 PM, Novosielski, Ryan
novos...@ca.rutgers.edu wrote:

 Running Slurm 14.11 in pre-production. Had a question since I can't find the 
 answer in the documentation. Is there any config option that says to mark a 
 node unavailable if the load on it is higher than a certain value? This 
 feature of Maui is helping with our testing as we can have both Slurm and 
 Torque/Maui running at the same time and a job will not be scheduled by 
 Torque/Maui on a node that appears to have anything running on it (eg. 
 something spawned by Slurm). I don't see a similar feature in Slurm.

This can be done fairly easily using NHC 1.4.1 and its built-in
check_ps_loadavg() check.  See the documentation for more details
(http://go.lbl.gov/nhc#check_ps_loadavg)

HTH!
Michael

-- 
Michael Jennings m...@lbl.gov
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209EW: 510-495-2687
MS 050B-3209  F: 510-486-8615