[slurm-dev] Re: Per-job tmp directories and namespaces

2017-08-15 Thread Bill Barth
We don’t use cgroups with our SLURM at this time, though we have some ongoing 
investigations in that direction. There’s probably a way to get both plugins to 
cooperate.

Best,
Bill.

-- 
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu|   Phone: (512) 232-7069
Office: ROC 1.435|   Fax:   (512) 475-9445
 
 

On 8/10/17, 12:31 PM, "Kilian Cavalotti"  
wrote:


Hi Bill,

On Thu, Aug 10, 2017 at 5:33 AM, Bill Barth  wrote:
> If you add the same line from /etc/pam.d/system-auth (or your OS’s 
equivalent) to /etc/pam.d/slurm, then srun- and sbatch-initiated shells and 
processes will also have the directory properly set up.

That indeed seems like good advice to make sure XDG_RUNTIME_DIR is
coherently defined in users' environment wherever they're running, but
last time I checked, pam_systemd wasn't playing nice with Slurm's
cgroups feature (and that's an euphemism). Because systemd manages its
own cgroups hierarchy for user sessions, that resulted in all sorts of
issues when Slurm was trying to set up its own cgroup structures for
tracking jobs' resources and enforcing limits. Which prompted us to
actually *remove* pam_systemd from our compute node configurations.

Do you use cgroups in your Slurm setup with pam_systemd on nodes? And
if so, did you notice any issue with cgroups?

Cheers,
-- 
Kilian




[slurm-dev] Re: Per-job tmp directories and namespaces

2017-08-14 Thread Kilian Cavalotti

On Thu, Aug 10, 2017 at 10:31 AM, Kilian Cavalotti
 wrote:
> Do you use cgroups in your Slurm setup with pam_systemd on nodes? And
> if so, did you notice any issue with cgroups?

For what it's worth, I just checked again with Slurm 17.02 and CentOS
7.3, and can confirm than enabling pam_systemd.so in /etc/pam.d/slurm
breaks cgroups for at least device access. We do enforce GPU isolation
through Slurm's ConstraintDevices and
/etc/slurm/cgroup_allowed_devices_file.conf, and as soon as
pam_systemd is active, all GPUs are visible from any job.

Since this is far more important to us than XDG_* dirs, we disable
pam_systemd on our systems. That seems to be the official SchedMD
recommendation too (see https://bugs.schedmd.com/show_bug.cgi?id=3674
and https://bugs.schedmd.com/show_bug.cgi?id=3158).

Cheers,
-- 
Kilian


[slurm-dev] Re: Per-job tmp directories and namespaces

2017-08-10 Thread Kilian Cavalotti

Hi Bill,

On Thu, Aug 10, 2017 at 5:33 AM, Bill Barth  wrote:
> If you add the same line from /etc/pam.d/system-auth (or your OS’s 
> equivalent) to /etc/pam.d/slurm, then srun- and sbatch-initiated shells and 
> processes will also have the directory properly set up.

That indeed seems like good advice to make sure XDG_RUNTIME_DIR is
coherently defined in users' environment wherever they're running, but
last time I checked, pam_systemd wasn't playing nice with Slurm's
cgroups feature (and that's an euphemism). Because systemd manages its
own cgroups hierarchy for user sessions, that resulted in all sorts of
issues when Slurm was trying to set up its own cgroup structures for
tracking jobs' resources and enforcing limits. Which prompted us to
actually *remove* pam_systemd from our compute node configurations.

Do you use cgroups in your Slurm setup with pam_systemd on nodes? And
if so, did you notice any issue with cgroups?

Cheers,
-- 
Kilian


[slurm-dev] Re: Per-job tmp directories and namespaces

2017-08-10 Thread Bill Barth
Fortunately, once we figured out what systemd was doing, we didn’t need to 
interact with it besides adding its PAM module configuration line to slurm’s 
PAM config file.

Best,
Bill.

-- 
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu|   Phone: (512) 232-7069
Office: ROC 1.435|   Fax:   (512) 475-9445
 
 

On 8/10/17, 8:34 AM, "John Hearns"  wrote:

Bill, thankyou very much for that.  I guess I have to get my systemd hat on.
A hat which is very large and composed of many parts, and indeed functions 
as a pair of pants too.






On 10 August 2017 at 14:33, Bill Barth  wrote:

If you use a modern enough OS (RHEL/CentOS 7, etc), XDG_RUNTIME_DIR will 
probably be set and mounted (it’s a tmpfs with a limited max size mounted, 
per-session, under /run/user/) on your login nodes, any node that 
environment propagates to (like the first
 compute node of a job), and anywhere that the user (or MPI stack) sshes to 
due to the PAM integration of pam_systemd.so in the auth process. Just having 
the environment variable set is not quite enough, though, you also need it 
mounted and unmounted at the
 end of each shell session. If you add the same line from 
/etc/pam.d/system-auth (or your OS’s equivalent) to /etc/pam.d/slurm, then 
srun- and sbatch-initiated shells and processes will also have the directory 
properly set up. MPI jobs that use ssh will get
 the mount automatically due to the ssh PAM integration with systemd, but 
those that use PMI-* and srun need the additional PAM integration.

Like it or not, this systemd-based/freedesktop.org  
system for a private, ephemeral temporary directory appears to be the future on 
Linux, and lots of GUI-based programs (Qt) are already expecting
 it. There are instructions in the standard for what you’re supposed to do 
as a developer if it doesn’t exist or has the wrong permissions, but this 
method is at least becoming standardized across Linux distributions. We first 
discovered this recently on some
 new CentOS 7 boxes that we were running under SLURM and were complaining 
in some GUI apps that didn’t have it mounted. It took a little while to figure 
out where in the PAM stack to insert the pam_systemd.so configuration line to 
guarantee that it was working
 for all our SLURM jobs, but the above method seems to solve the problem.

Best,
Bill.

--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu|   Phone:
(512) 232-7069 
Office: ROC 1.435|   Fax:   (512) 475-9445 




On 8/10/17, 3:06 AM, "Fokke Dijkstra"  wrote:

We use the spank-private-tmp plugin developed at HPC2N in Sweden:


https://github.com/hpc2n/spank-private-tmp 




See also: 
https://slurm.schedmd.com/SUG14/private_tmp.pdf 

for a presentation about the plugin.




2017-08-10 9:31 GMT+02:00 John Hearns :

I am sure someone discussed this topic on this list a few months ago... 
if it rings any bells please let me know.
I am not discussing setting the TMPDIR environment variable and 
crateing a new TMPDIR directory on a per job basis - though thankyou for the 
help I did get when discussing this.


Rather I would like to set up a new namespace when a job runs such that 
/tmp is unique to every job.  /tmp can of course be a directly uniquely created 
for that job and deleted afterwards.
This is to cope with any software which writes to  /tmp rather than 
using the TMPDIR variable.
If anyone does this let me know what your techniques are please.











--
Fokke Dijkstra
 
Research and Innovation Support
Center for Information Technology, University of Groningen
Postbus 11044, 9700 CA  Groningen, The Netherlands
+31-50-363 9243 


















[slurm-dev] Re: Per-job tmp directories and namespaces

2017-08-10 Thread John Hearns
Bill, thankyou very much for that.  I guess I have to get my systemd hat on.
A hat which is very large and composed of many parts, and indeed functions
as a pair of pants too.



On 10 August 2017 at 14:33, Bill Barth  wrote:

> If you use a modern enough OS (RHEL/CentOS 7, etc), XDG_RUNTIME_DIR will
> probably be set and mounted (it’s a tmpfs with a limited max size mounted,
> per-session, under /run/user/) on your login nodes, any node that
> environment propagates to (like the first compute node of a job), and
> anywhere that the user (or MPI stack) sshes to due to the PAM integration
> of pam_systemd.so in the auth process. Just having the environment variable
> set is not quite enough, though, you also need it mounted and unmounted at
> the end of each shell session. If you add the same line from
> /etc/pam.d/system-auth (or your OS’s equivalent) to /etc/pam.d/slurm, then
> srun- and sbatch-initiated shells and processes will also have the
> directory properly set up. MPI jobs that use ssh will get the mount
> automatically due to the ssh PAM integration with systemd, but those that
> use PMI-* and srun need the additional PAM integration.
>
> Like it or not, this systemd-based/freedesktop.org system for a private,
> ephemeral temporary directory appears to be the future on Linux, and lots
> of GUI-based programs (Qt) are already expecting it. There are instructions
> in the standard for what you’re supposed to do as a developer if it doesn’t
> exist or has the wrong permissions, but this method is at least becoming
> standardized across Linux distributions. We first discovered this recently
> on some new CentOS 7 boxes that we were running under SLURM and were
> complaining in some GUI apps that didn’t have it mounted. It took a little
> while to figure out where in the PAM stack to insert the pam_systemd.so
> configuration line to guarantee that it was working for all our SLURM jobs,
> but the above method seems to solve the problem.
>
> Best,
> Bill.
>
> --
> Bill Barth, Ph.D., Director, HPC
> bba...@tacc.utexas.edu|   Phone: (512) 232-7069
> Office: ROC 1.435|   Fax:   (512) 475-9445
>
>
>
> On 8/10/17, 3:06 AM, "Fokke Dijkstra"  wrote:
>
> We use the spank-private-tmp plugin developed at HPC2N in Sweden:
>
> https://github.com/hpc2n/spank-private-tmp
>
>
>
> See also: https://slurm.schedmd.com/SUG14/private_tmp.pdf
> for a presentation about the plugin.
>
>
>
>
> 2017-08-10 9:31 GMT+02:00 John Hearns :
>
> I am sure someone discussed this topic on this list a few months
> ago... if it rings any bells please let me know.
> I am not discussing setting the TMPDIR environment variable and
> crateing a new TMPDIR directory on a per job basis - though thankyou for
> the help I did get when discussing this.
>
>
> Rather I would like to set up a new namespace when a job runs such
> that /tmp is unique to every job.  /tmp can of course be a directly
> uniquely created for that job and deleted afterwards.
> This is to cope with any software which writes to  /tmp rather than
> using the TMPDIR variable.
> If anyone does this let me know what your techniques are please.
>
>
>
>
>
>
>
>
>
>
>
> --
> Fokke Dijkstra
>  
> Research and Innovation Support
> Center for Information Technology, University of Groningen
> Postbus 11044, 9700 CA  Groningen, The Netherlands
> +31-50-363 9243
>
>
>
>
>
>
>
>


[slurm-dev] Re: Per-job tmp directories and namespaces

2017-08-10 Thread Bill Barth
If you use a modern enough OS (RHEL/CentOS 7, etc), XDG_RUNTIME_DIR will 
probably be set and mounted (it’s a tmpfs with a limited max size mounted, 
per-session, under /run/user/) on your login nodes, any node that 
environment propagates to (like the first compute node of a job), and anywhere 
that the user (or MPI stack) sshes to due to the PAM integration of 
pam_systemd.so in the auth process. Just having the environment variable set is 
not quite enough, though, you also need it mounted and unmounted at the end of 
each shell session. If you add the same line from /etc/pam.d/system-auth (or 
your OS’s equivalent) to /etc/pam.d/slurm, then srun- and sbatch-initiated 
shells and processes will also have the directory properly set up. MPI jobs 
that use ssh will get the mount automatically due to the ssh PAM integration 
with systemd, but those that use PMI-* and srun need the additional PAM 
integration.

Like it or not, this systemd-based/freedesktop.org system for a private, 
ephemeral temporary directory appears to be the future on Linux, and lots of 
GUI-based programs (Qt) are already expecting it. There are instructions in the 
standard for what you’re supposed to do as a developer if it doesn’t exist or 
has the wrong permissions, but this method is at least becoming standardized 
across Linux distributions. We first discovered this recently on some new 
CentOS 7 boxes that we were running under SLURM and were complaining in some 
GUI apps that didn’t have it mounted. It took a little while to figure out 
where in the PAM stack to insert the pam_systemd.so configuration line to 
guarantee that it was working for all our SLURM jobs, but the above method 
seems to solve the problem.

Best,
Bill.

-- 
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu|   Phone: (512) 232-7069
Office: ROC 1.435|   Fax:   (512) 475-9445
 
 

On 8/10/17, 3:06 AM, "Fokke Dijkstra"  wrote:

We use the spank-private-tmp plugin developed at HPC2N in Sweden:

https://github.com/hpc2n/spank-private-tmp



See also: https://slurm.schedmd.com/SUG14/private_tmp.pdf
for a presentation about the plugin.




2017-08-10 9:31 GMT+02:00 John Hearns :

I am sure someone discussed this topic on this list a few months ago... if 
it rings any bells please let me know.
I am not discussing setting the TMPDIR environment variable and crateing a 
new TMPDIR directory on a per job basis - though thankyou for the help I did 
get when discussing this.


Rather I would like to set up a new namespace when a job runs such that 
/tmp is unique to every job.  /tmp can of course be a directly uniquely created 
for that job and deleted afterwards.
This is to cope with any software which writes to  /tmp rather than using 
the TMPDIR variable.
If anyone does this let me know what your techniques are please.











-- 
Fokke Dijkstra 
  
Research and Innovation Support 
Center for Information Technology, University of Groningen 
Postbus 11044, 9700 CA  Groningen, The Netherlands 
+31-50-363 9243 









[slurm-dev] Re: Per-job tmp directories and namespaces

2017-08-10 Thread John Hearns
Fokke, thankyou very much for the response.


On 10 August 2017 at 10:07, Fokke Dijkstra  wrote:

> We use the spank-private-tmp plugin developed at HPC2N in Sweden:
> https://github.com/hpc2n/spank-private-tmp
>
> See also: https://slurm.schedmd.com/SUG14/private_tmp.pdf
> for a presentation about the plugin.
>
>
> 2017-08-10 9:31 GMT+02:00 John Hearns :
>
>> I am sure someone discussed this topic on this list a few months ago...
>> if it rings any bells please let me know.
>> I am not discussing setting the TMPDIR environment variable and crateing
>> a new TMPDIR directory on a per job basis - though thankyou for the help I
>> did get when discussing this.
>>
>> Rather I would like to set up a new namespace when a job runs such that
>> /tmp is unique to every job.  /tmp can of course be a directly uniquely
>> created for that job and deleted afterwards.
>> This is to cope with any software which writes to  /tmp rather than using
>> the TMPDIR variable.
>> If anyone does this let me know what your techniques are please.
>>
>>
>
>
> --
> Fokke Dijkstra  
> Research and Innovation Support
> Center for Information Technology, University of Groningen
> Postbus 11044, 9700 CA  Groningen, The Netherlands
> +31-50-363 9243 <+31%2050%20363%209243>
>


[slurm-dev] Re: Per-job tmp directories and namespaces

2017-08-10 Thread Fokke Dijkstra
We use the spank-private-tmp plugin developed at HPC2N in Sweden:
https://github.com/hpc2n/spank-private-tmp

See also: https://slurm.schedmd.com/SUG14/private_tmp.pdf
for a presentation about the plugin.


2017-08-10 9:31 GMT+02:00 John Hearns :

> I am sure someone discussed this topic on this list a few months ago... if
> it rings any bells please let me know.
> I am not discussing setting the TMPDIR environment variable and crateing a
> new TMPDIR directory on a per job basis - though thankyou for the help I
> did get when discussing this.
>
> Rather I would like to set up a new namespace when a job runs such that
> /tmp is unique to every job.  /tmp can of course be a directly uniquely
> created for that job and deleted afterwards.
> This is to cope with any software which writes to  /tmp rather than using
> the TMPDIR variable.
> If anyone does this let me know what your techniques are please.
>
>


-- 
Fokke Dijkstra  
Research and Innovation Support
Center for Information Technology, University of Groningen
Postbus 11044, 9700 CA  Groningen, The Netherlands
+31-50-363 9243