Hello,
we recently upgraded our SLURM nodes (2.6.2) to Fedora 19 (Kernel 3.11).
We use cgroup as TaskPlugin and ProctrackType:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
Since the upgrade we noticed error messages like this (after a job
completes) in the slurmd log file:
task/cgroup: not removing job memcg : Device or resource busy
task/cgroup: not removing user memcg : Device or resource busy
(The job behaves nicely (it doesn't spawn other processes). We also
don't notice any problems if we execute these processes outside SLURM.)
After finishing approximately 10 jobs, the node becomes unresponsive. It
doesn't crash completely but you can't SSH or do much useful. Once I was
lucky and could login on a console and check the running processes. Two
processes were in D state:
- slurmstepd
- rmdir /cgroup/cpuset/slurm/uid_XXXXXXX/job_1894127/step_0 (executed by
the release agent).
dmesg didn't show anything strange. I wasn't able to do more diagnostics
because the node hanged completely soon after that.
After that I focused on the cgroup configuration:
- disabling the release agent doesn't make any difference
- I now disabled the memory cgroups (ConstrainRAMSpace=no). The cpuset
cgroups are still active (ConstrainCores=yes). At first sight this seems
to fix the problem.
At the moment, my best guess is a problem with the cgroup subsystem.
Has anyone seen a similar problem or an idea how to fix it? Anyone has a
stable setup with cgroups enabled on Fedora 19 (or another distribution
with a 3.11 kernel)?
Thanks,
Bram.