"Mark A. Grondona" <mgrond...@llnl.gov> writes:

> There will be detailed documentation regarding memory cgroups in
> the Documentation for your kernel (or, the latest documentation is
> here
>
>  http://www.kernel.org/doc/Documentation/cgroups/memory.txt

Thanks!  I'm going to go through that.

> I'm sure you have also read through the cgroup.conf(5) manpage in
> SLURM.

Yes. :)  After I read it, I still wasn't certain what the
different limits actually were limiting and how the Allowed* and
Constrain* interact.

> SLURM sets memory.limit_in_bytes to the allocated memory for
[...]

Thanks for that description.  That made it much more clearer to me how
task/cgroup works (or should work. :).

> In the slurmd log for the job there should be a line of output
> which details the settings that slurm is applying to the job step
> memory cgroup.

For my job, it says:

task/cgroup/memory: total:64530M allowed:100%, swap:0%, max:100%(64530M) 
max+swap:100%(129060M) min:30M

task/cgroup: /slurm/uid_10231/job_344: alloc=64530MB mem.limit=64530MB 
memsw.limit=64530MB
task/cgroup: /slurm/uid_10231/job_344/step_4294967294: alloc=64530MB 
mem.limit=64530MB memsw.limit=64530MB

Which I guess means that both memory.limit_in_bytes and
memory.memsw.limit_in_bytes are set to 64530MB.

> I have a spank plugin that essentially greps the dmesg output
> after job completion and issues such a message to the stderr of
> the job if a task has been terminated by the OOM killer. It is
> not perfect, but works 90% of the time. I can send it to you if
> you like.

Yes, I'd very much like that!  Jobs killed by memory limit is quite
common on our cluster, and users get confused if there is no message
telling them why the job died.


Thanks for a very informative answer!


-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Reply via email to