Hi all,

I'm wondering what's the current status/best practice regarding slurm
and systemd mutual control of cgroups.

We currently use systemd 224 and slurm 14.11.6 with proctrack/cgroup,
task/cgroup (we're based on debian 8, but we compile our own slurm and
other software).

The problems are that if I use systemd sessions (i.e. pam_systemd
session in /etc/pam.d/slurm), then systemd will move the processes to
it's own cgroup subtree, e.g. /proc/$$/cgroup inside a slurm job:

9:blkio:/user.slice/user-3573.slice
8:cpuset:/slurm/uid_3573/job_16307/step_0
7:devices:/user.slice/user-3573.slice
6:net_prio:/
5:freezer:/slurm/uid_3573/job_16307/step_0
4:perf_event:/
3:memory:/slurm/uid_3573/job_16307/step_0/task_0
2:cpu,cpuacct:/slurm/uid_3573/job_16307/step_0/task_0
1:name=systemd:/user.slice/user-3573.slice/session-11.scope

If I don't use systemd session, the cgroups are ok, e.g.:

9:blkio:/system.slice
8:cpuset:/slurm/uid_3573/job_16306/step_0
7:devices:/slurm/uid_3573/job_16306/step_0
6:net_prio:/
5:freezer:/slurm/uid_3573/job_16306/step_0
4:perf_event:/
3:memory:/slurm/uid_3573/job_16306/step_0/task_0
2:cpu,cpuacct:/slurm/uid_3573/job_16306/step_0/task_0
1:name=systemd:/system.slice/slurmd.service

however, there are two issues:
- systemd sees the processes as belonging to slurmd in the system.slice
  instead of in the user.slice, which can be problematic when
  e.g. restarting slurmd - it might kill all jobs, depending on
  configuration.

- systemd is in-charge of creating some session related setups. We
  encountered a missing /run/user/<uid> directory, which some programs
  need (e.g. virsh/libvirt). But I suspect there are others setups.

Currently the devices cgroup is the problematic one (we constrain
/dev/nvidia*), but I suspect systemd will try to take over the other
cgroup hierarchies in the future.

Our current experimental solution is to use a pam_exec script to save
the cgroups before pam_systemd, and restore them afterwards. But I
suspect there might be some other problems.

So my questions are what are the best methods of handling this? Is it
solved/changed in 15.08? Should I send this question to the systemd
maintainers? Should I even use pam_systemd for slurm jobs?


Thanks in advance,
    Yair.



Our experimental pam_exec script:
#!/bin/bash

_ppid=`awk '$1=="PPid:"{print $2}' /proc/$$/status`
if [[ -z "$_ppid" ]]; then
   echo "Don't know my PPID" 1>&2
   exit 1
fi
_savefile="/tmp/.savecgroup.${PAM_SERVICE}.${PAM_TYPE}.${PAM_USER}.${_ppid}"

case "$1" in
    save)
        cat /proc/$_ppid/cgroup > $_savefile
        break
    ;;
    restore)
        for cgroup in `awk -F: '$3~/^\/slurm\// {printf 
"/sys/fs/cgroup/%s%s/tasks\n", $2, $3} ${_savefile}'`; do
            echo $_ppid >> $cgroup
        done
        break
    ;;
    *)
        echo "Either save or restore" 1>&2
        exit 2
esac

exit 0

Reply via email to