Hi all,
I'm wondering what's the current status/best practice regarding slurm
and systemd mutual control of cgroups.
We currently use systemd 224 and slurm 14.11.6 with proctrack/cgroup,
task/cgroup (we're based on debian 8, but we compile our own slurm and
other software).
The problems are that if I use systemd sessions (i.e. pam_systemd
session in /etc/pam.d/slurm), then systemd will move the processes to
it's own cgroup subtree, e.g. /proc/$$/cgroup inside a slurm job:
9:blkio:/user.slice/user-3573.slice
8:cpuset:/slurm/uid_3573/job_16307/step_0
7:devices:/user.slice/user-3573.slice
6:net_prio:/
5:freezer:/slurm/uid_3573/job_16307/step_0
4:perf_event:/
3:memory:/slurm/uid_3573/job_16307/step_0/task_0
2:cpu,cpuacct:/slurm/uid_3573/job_16307/step_0/task_0
1:name=systemd:/user.slice/user-3573.slice/session-11.scope
If I don't use systemd session, the cgroups are ok, e.g.:
9:blkio:/system.slice
8:cpuset:/slurm/uid_3573/job_16306/step_0
7:devices:/slurm/uid_3573/job_16306/step_0
6:net_prio:/
5:freezer:/slurm/uid_3573/job_16306/step_0
4:perf_event:/
3:memory:/slurm/uid_3573/job_16306/step_0/task_0
2:cpu,cpuacct:/slurm/uid_3573/job_16306/step_0/task_0
1:name=systemd:/system.slice/slurmd.service
however, there are two issues:
- systemd sees the processes as belonging to slurmd in the system.slice
instead of in the user.slice, which can be problematic when
e.g. restarting slurmd - it might kill all jobs, depending on
configuration.
- systemd is in-charge of creating some session related setups. We
encountered a missing /run/user/<uid> directory, which some programs
need (e.g. virsh/libvirt). But I suspect there are others setups.
Currently the devices cgroup is the problematic one (we constrain
/dev/nvidia*), but I suspect systemd will try to take over the other
cgroup hierarchies in the future.
Our current experimental solution is to use a pam_exec script to save
the cgroups before pam_systemd, and restore them afterwards. But I
suspect there might be some other problems.
So my questions are what are the best methods of handling this? Is it
solved/changed in 15.08? Should I send this question to the systemd
maintainers? Should I even use pam_systemd for slurm jobs?
Thanks in advance,
Yair.
Our experimental pam_exec script:
#!/bin/bash
_ppid=`awk '$1=="PPid:"{print $2}' /proc/$$/status`
if [[ -z "$_ppid" ]]; then
echo "Don't know my PPID" 1>&2
exit 1
fi
_savefile="/tmp/.savecgroup.${PAM_SERVICE}.${PAM_TYPE}.${PAM_USER}.${_ppid}"
case "$1" in
save)
cat /proc/$_ppid/cgroup > $_savefile
break
;;
restore)
for cgroup in `awk -F: '$3~/^\/slurm\// {printf
"/sys/fs/cgroup/%s%s/tasks\n", $2, $3} ${_savefile}'`; do
echo $_ppid >> $cgroup
done
break
;;
*)
echo "Either save or restore" 1>&2
exit 2
esac
exit 0