Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-10 Thread Bjørn-Helge Mevik
Matthew BETTINGER writes: > Just curious if this option or oom setting (which we use) can leave > the nodes in CG "completing" state. I don't think so. As far as I know, jobs go into completing state when Slurm is cancelling them or when they exit on their own, and stays in that state until

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-09 Thread Matthew BETTINGER
Just curious if this option or oom setting (which we use) can leave the nodes in CG "completing" state. We have CG states quite often and only way is to reboot the node. I believe it occurs when parent process dies or gets killed or Z? Thanks. MB On 10/8/19, 6:11 AM, "slurm-users on

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-09 Thread Jean-mathieu CHANTREIN
- Mail original - > Maybe I missed something else... That's right. Thank to Bjørn-Helge who help me. You must enable swapaccount in the kernel as shown here: https://unix.stackexchange.com/questions/531480/what-does-swapaccount-1-in-grub-cmdline-linux-default-do By default, this is

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Bjørn-Helge Mevik
Marcus Boden writes: > you're looking for KillOnBadExit in the slurm.conf: > KillOnBadExit [...] > this should terminate the job if a step or a process gets oom-killed. That is a good tip! But as I read the documentation (I haven't tested it), it will only kill the job step itself, it will

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Bjørn-Helge Mevik
Juergen Salk writes: > that is interesting. We have a very similar setup as well. However, in > our Slurm test cluster I have noticed that it is not the *job* that > gets killed. Instead, the OOM killer terminates one (or more) > *processes* Yes, that is how the kernel OOM killer works. This

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Jean-mathieu CHANTREIN
Hello, thanks for you answers, > - Does it work if you remove the space in "TaskPlugin=task/affinity, > task/cgroup"? (Slurm can be quite picky when reading slurm.conf). It was the case, I make a mistake when I copy/cut... So, I haven't space here. > > - See in slurmd.log on the node(s) of

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Juergen Salk
> On 19-10-08 10:36, Juergen Salk wrote: > > * Bjørn-Helge Mevik [191008 08:34]: > > > Jean-mathieu CHANTREIN writes: > > > > > > > I tried using, in slurm.conf > > > > TaskPlugin=task/affinity, task/cgroup > > > > SelectTypeParameters=CR_CPU_Memory > > > > MemLimitEnforce=yes > > > > > > >

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Marcus Boden
Hi Jürgen, you're looking for KillOnBadExit in the slurm.conf: KillOnBadExit If set to 1, a step will be terminated immediately if any task is crashed or aborted, as indicated by a non-zero exit code. With the default value of 0, if one of the processes is crashed or aborted the other

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Juergen Salk
* Bjørn-Helge Mevik [191008 08:34]: > Jean-mathieu CHANTREIN writes: > > > I tried using, in slurm.conf > > TaskPlugin=task/affinity, task/cgroup > > SelectTypeParameters=CR_CPU_Memory > > MemLimitEnforce=yes > > > > and in cgroup.conf: > > CgroupAutomount=yes > > ConstrainCores=yes > >

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Bjørn-Helge Mevik
Jean-mathieu CHANTREIN writes: > I tried using, in slurm.conf > TaskPlugin=task/affinity, task/cgroup > SelectTypeParameters=CR_CPU_Memory > MemLimitEnforce=yes > > and in cgroup.conf: > CgroupAutomount=yes > ConstrainCores=yes > ConstrainRAMSpace=yes > ConstrainSwapSpace=yes >

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-07 Thread Renfro, Michael
Our cgroup settings are quite a bit different, and we don’t allow jobs to swap, but the following works to limit memory here (I know, because I get emails frequent emails from users who don’t change their jobs from the default 2 GB per CPU that we use): CgroupMountpoint="/sys/fs/cgroup"

[slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-07 Thread Jean-mathieu CHANTREIN
Hello, I tried using, in slurm.conf TaskPlugin=task/affinity, task/cgroup SelectTypeParameters=CR_CPU_Memory MemLimitEnforce=yes and in cgroup.conf: CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes MaxSwapPercent=10 TaskAffinity=no But when the job