Hi all,
I'm wondering what's the current status/best practice regarding slurm and systemd mutual control of cgroups. We currently use systemd 224 and slurm 14.11.6 with proctrack/cgroup, task/cgroup (we're based on debian 8, but we compile our own slurm and other software). The problems are that if I use systemd sessions (i.e. pam_systemd session in /etc/pam.d/slurm), then systemd will move the processes to it's own cgroup subtree, e.g. /proc/$$/cgroup inside a slurm job: 9:blkio:/user.slice/user-3573.slice 8:cpuset:/slurm/uid_3573/job_16307/step_0 7:devices:/user.slice/user-3573.slice 6:net_prio:/ 5:freezer:/slurm/uid_3573/job_16307/step_0 4:perf_event:/ 3:memory:/slurm/uid_3573/job_16307/step_0/task_0 2:cpu,cpuacct:/slurm/uid_3573/job_16307/step_0/task_0 1:name=systemd:/user.slice/user-3573.slice/session-11.scope If I don't use systemd session, the cgroups are ok, e.g.: 9:blkio:/system.slice 8:cpuset:/slurm/uid_3573/job_16306/step_0 7:devices:/slurm/uid_3573/job_16306/step_0 6:net_prio:/ 5:freezer:/slurm/uid_3573/job_16306/step_0 4:perf_event:/ 3:memory:/slurm/uid_3573/job_16306/step_0/task_0 2:cpu,cpuacct:/slurm/uid_3573/job_16306/step_0/task_0 1:name=systemd:/system.slice/slurmd.service however, there are two issues: - systemd sees the processes as belonging to slurmd in the system.slice instead of in the user.slice, which can be problematic when e.g. restarting slurmd - it might kill all jobs, depending on configuration. - systemd is in-charge of creating some session related setups. We encountered a missing /run/user/<uid> directory, which some programs need (e.g. virsh/libvirt). But I suspect there are others setups. Currently the devices cgroup is the problematic one (we constrain /dev/nvidia*), but I suspect systemd will try to take over the other cgroup hierarchies in the future. Our current experimental solution is to use a pam_exec script to save the cgroups before pam_systemd, and restore them afterwards. But I suspect there might be some other problems. So my questions are what are the best methods of handling this? Is it solved/changed in 15.08? Should I send this question to the systemd maintainers? Should I even use pam_systemd for slurm jobs? Thanks in advance, Yair. Our experimental pam_exec script: #!/bin/bash _ppid=`awk '$1=="PPid:"{print $2}' /proc/$$/status` if [[ -z "$_ppid" ]]; then echo "Don't know my PPID" 1>&2 exit 1 fi _savefile="/tmp/.savecgroup.${PAM_SERVICE}.${PAM_TYPE}.${PAM_USER}.${_ppid}" case "$1" in save) cat /proc/$_ppid/cgroup > $_savefile break ;; restore) for cgroup in `awk -F: '$3~/^\/slurm\// {printf "/sys/fs/cgroup/%s%s/tasks\n", $2, $3} ${_savefile}'`; do echo $_ppid >> $cgroup done break ;; *) echo "Either save or restore" 1>&2 exit 2 esac exit 0