Also consider any cached information, e.g. nfs . You won't necessarily see 
this, but might be getting accounted for in the cgroup, depending on your 
setup/settings.

-----Original Message-----
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Loris Bennett
Sent: 14 February 2018 12:06
To: Geert Kapteijns <ghkaptei...@gmail.com>
Cc: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] slurmstepd: error: Exceeded job memory limit at some 
point.

Geert Kapteijns <ghkaptei...@gmail.com> writes:

> Hi everyone,
>
> I’m running into out-of-memory errors when I specify an array job. 
> Needless to say, 100M should be more than enough, and increasing the 
> allocated memory to 1G doesn't solve the problem. I call my script as
> follows: sbatch --array=100-199 run_batch_job. run_batch_job contains
>
> #!/bin/env bash
> #SBATCH --partition=lln
> #SBATCH --output=/home/user/outs/%x.out.%a
> #SBATCH --error=/home/user/outs/%x.err.%a #SBATCH --cpus-per-task=1 
> #SBATCH --mem-per-cpu=100M #SBATCH --time=2-00:00:00
>
> srun my_program.out $SLURM_ARRAY_TASK_ID
>
> Instead of using --mem-per-cpu and --cpus-per-task, I’ve also tried the 
> following:
>
> #SBATCH --mem=100M
> #SBATCH --ntasks=1  # Number of cores
> #SBATCH --nodes=1  # All cores on one machine
>
> But in both cases for some of the runs, I get the error:
>
> slurmstepd: error: Exceeded job memory limit at some point.
> srun: error: obelix-cn002: task 0: Out Of Memory
> slurmstepd: error: Exceeded job memory limit at some point.
>
> I’ve also posted the question on stackoverflow. Does anyone know what is 
> happening here?

Maybe once in a while a simulation really does just use more memory than you 
were expecting.  Have a look at the output of

  sacct -j 123456 -o jobid,maxrss,state --units=M

with the appropriate job ID.

Regards

Loris

--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email loris.benn...@fu-berlin.de

Reply via email to