[slurm-dev] Memory cgroup troubles

Bram Vandoren Thu, 28 Nov 2013 07:37:27 -0800


Hello,

we recently upgraded our SLURM nodes (2.6.2) to Fedora 19 (Kernel 3.11).We use cgroup as TaskPlugin and ProctrackType:

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup

Since the upgrade we noticed error messages like this (after a jobcompletes) in the slurmd log file:

task/cgroup: not removing job memcg : Device or resource busy
task/cgroup: not removing user memcg : Device or resource busy

(The job behaves nicely (it doesn't spawn other processes). We alsodon't notice any problems if we execute these processes outside SLURM.)

After finishing approximately 10 jobs, the node becomes unresponsive. Itdoesn't crash completely but you can't SSH or do much useful. Once I waslucky and could login on a console and check the running processes. Twoprocesses were in D state:

- slurmstepd

- rmdir /cgroup/cpuset/slurm/uid_XXXXXXX/job_1894127/step_0 (executed bythe release agent).dmesg didn't show anything strange. I wasn't able to do more diagnosticsbecause the node hanged completely soon after that.


After that I focused on the cgroup configuration:
- disabling the release agent doesn't make any difference

- I now disabled the memory cgroups (ConstrainRAMSpace=no). The cpusetcgroups are still active (ConstrainCores=yes). At first sight this seemsto fix the problem.


At the moment, my best guess is a problem with the cgroup subsystem.

Has anyone seen a similar problem or an idea how to fix it? Anyone has astable setup with cgroups enabled on Fedora 19 (or another distributionwith a 3.11 kernel)?


Thanks,
Bram.

[slurm-dev] Memory cgroup troubles

Reply via email to