Thank you Loris!
Like many of our jobs, this is an embarrassingly parallel analysis, where we have to strike a compromise between what would be a completely granular array of >100,000 small jobs or some kind of serialisation through loops. So the individual jobs where I noticed this behaviour are actually already part of an array :) Cheers, Arthur ------------------------------------------------------------- Dr. Arthur Gilly Head of Analytics Institute of Translational Genomics Helmholtz-Centre Munich (HMGU) ------------------------------------------------------------- From: slurm-users <[email protected]> On Behalf Of Loris Bennett Sent: Tuesday, 8 June 2021 16:05 To: Slurm User Community List <[email protected]> Subject: Re: [slurm-users] Kill job when child process gets OOM-killed Dear Arthur, Arthur Gilly <[email protected] <mailto:[email protected]> > writes: > Dear Slurm users, > > > > I am looking for a SLURM setting that will kill a job immediately when any > subprocess of that job hits an OOM limit. Several posts have touched upon > that, e.g: > https://www.mail-archive.com/[email protected]/msg04091.html and > https://www.mail-archive.com/[email protected]/msg04190.html or > https://bugs.schedmd.com/show_bug.cgi?id=3216 but I cannot find an answer > that works in our setting. > > > > The two options I have found are: > > 1 Set shebang to #!/bin/bash -e, which we don’t want to do as we’d need to > change this for hundreds of scripts from another cluster where we had a > different scheduler, AND it would kill tasks for other runtime errors (e.g. > if one command in the > script doesn’t find a file). > > 2 Set KillOnBadExit=1. I am puzzled by this one. This is supposed to be > overridden by srun’s -K option. Using the example below, srun -K --mem=1G > ./multalloc.sh would be expected to kill the job at the first OOM. But it > doesn’t, and happily > keeps reporting 3 oom-kill events. So, will this work? > > > > The reason we want this is that we have script that execute programs in > loops. These programs are slow and memory intensive. When the first one > crashes for OOM, the next iterations also crash. In the current setup, we are > wasting days > executing loops where every iteration crashes after an hour or so due to OOM. Not an answer to your question, but if your runs are independent, would using a job array help you here? Cheers, Loris > We are using cgroups (and we want to keep them) with the following config: > > CgroupAutomount=yes > > ConstrainCores=yes > > ConstrainDevices=yes > > ConstrainKmemSpace=no > > ConstrainRAMSpace=yes > > ConstrainSwapSpace=yes > > MaxSwapPercent=10 > > TaskAffinity=no > > > > Relevant bits from slurm.conf: > > SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE > > SelectType=select/cons_tres > > GresTypes=gpu,mps,bandwidth > > > > > > Very simple example: > > #!/bin/bash > > # multalloc.sh – each line is a very simple cpp program that allocates a 8Gb > vector and fills it with random floats > > echo one > > ./alloc8Gb > > echo two > > ./alloc8Gb > > echo three > > ./alloc8Gb > > echo done. > > > > This is submitted as follows: > > > > sbatch --mem=1G ./multalloc.sh > > > > The log is : > > one > > ./multalloc.sh: line 4: 231155 Killed ./alloc8Gb > > two > > ./multalloc.sh: line 6: 231181 Killed ./alloc8Gb > > three > > ./multalloc.sh: line 8: 231263 Killed ./alloc8Gb > > done. > > slurmstepd: error: Detected 3 oom-kill event(s) in StepId=3130111.batch > cgroup. Some of your processes may have been killed by the cgroup > out-of-memory handler. > > > > I am expecting an OOM job kill right before “two”. > > > > Any help appreciated. > > > > Best regards, > > > > Arthur > > > > > > ------------------------------------------------------------- > > Dr. Arthur Gilly > > Head of Analytics > > Institute of Translational Genomics > > Helmholtz-Centre Munich (HMGU) > > ------------------------------------------------------------- > > > > Helmholtz Zentrum München > Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH) > Ingolstädter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de <http://www.helmholtz-muenchen.de> > Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling > Geschäftsführung: Prof. Dr. med. Dr. h.c. Matthias Tschöp, Kerstin Günther > Registergericht: Amtsgericht München HRB 6466 > USt-IdNr: DE 129521671 > -- Dr. Loris Bennett (Hr./Mr.) ZEDAT, Freie Universität Berlin Email [email protected] <mailto:[email protected]> Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671
