Re: [slurm-users] slurmstepd: error: Exceeded job memory limit at some point.

2018-02-15 Thread Williams, Jenny Avis
Here we see this.  There is a difference in behavior depending whether the 
program runs out of the  "standard" NFS or the GPFS filesystem.

If the I/O is from NFS, there can be conditions where we see this with some 
frequency on a given problem.  It will not be every time but can be reproduced.

The same routine run over GPFS would likely not present this error.
Our GPFS however is configured with a huge local lroc, memory pinned for mmfsd 
etc. . You have to push the I/O much harder on GPFS than on the NFS to get a D 
wait on that filesystem.

It appears to correlate to how efficiently the file caching is handled.  

Jenny
UNC Chapel Hill

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
John DeSantis
Sent: Wednesday, February 14, 2018 9:50 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] slurmstepd: error: Exceeded job memory limit at some 
point.

Geert,

Considering the the following response from Loris:

> Maybe once in a while a simulation really does just use more memory 
> than you were expecting.  Have a look at the output of
> 
>   sacct -j 123456 -o jobid,maxrss,state --units=M
> 
> with the appropriate job ID.

This can certainly happen!  

I'd suggest profiling the job(s) in question; perhaps a loop of `ps` with the 
appropriate output modifiers, e.g. 'rss' (and vsz if you're tracking virtual 
memory usage).  

We've seen jobs that which will terminate after several hours of run time 
because their memory usage spiked during a JobAcctGatherFrequency sampling 
interval (every 30 seconds, adjusted within slurm.conf).

John DeSantis


On Wed, 14 Feb 2018 13:05:41 +0100
Loris Bennett  wrote:

> Geert Kapteijns  writes:
> 
> > Hi everyone,
> >
> > I’m running into out-of-memory errors when I specify an array job.
> > Needless to say, 100M should be more than enough, and increasing the 
> > allocated memory to 1G doesn't solve the problem. I call my script 
> > as follows: sbatch --array=100-199 run_batch_job.
> > run_batch_job contains
> >
> > #!/bin/env bash
> > #SBATCH --partition=lln
> > #SBATCH --output=/home/user/outs/%x.out.%a
> > #SBATCH --error=/home/user/outs/%x.err.%a #SBATCH --cpus-per-task=1 
> > #SBATCH --mem-per-cpu=100M #SBATCH --time=2-00:00:00
> >
> > srun my_program.out $SLURM_ARRAY_TASK_ID
> >
> > Instead of using --mem-per-cpu and --cpus-per-task, I’ve also tried 
> > the following:
> >
> > #SBATCH --mem=100M
> > #SBATCH --ntasks=1  # Number of cores #SBATCH --nodes=1  # All cores 
> > on one machine
> >
> > But in both cases for some of the runs, I get the error:
> >
> > slurmstepd: error: Exceeded job memory limit at some point.
> > srun: error: obelix-cn002: task 0: Out Of Memory
> > slurmstepd: error: Exceeded job memory limit at some point.
> >
> > I’ve also posted the question on stackoverflow. Does anyone know 
> > what is happening here?
> 
> Maybe once in a while a simulation really does just use more memory 
> than you were expecting.  Have a look at the output of
> 
>   sacct -j 123456 -o jobid,maxrss,state --units=M
> 
> with the appropriate job ID.
> 
> Regards
> 
> Loris
> 




Re: [slurm-users] slurmstepd: error: Exceeded job memory limit at some point.

2018-02-14 Thread John DeSantis
Geert,

Considering the the following response from Loris:

> Maybe once in a while a simulation really does just use more memory
> than you were expecting.  Have a look at the output of
> 
>   sacct -j 123456 -o jobid,maxrss,state --units=M
> 
> with the appropriate job ID.

This can certainly happen!  

I'd suggest profiling the job(s) in question; perhaps a loop of `ps`
with the appropriate output modifiers, e.g. 'rss' (and vsz if you're
tracking virtual memory usage).  

We've seen jobs that which will terminate after several hours of run
time because their memory usage spiked during a JobAcctGatherFrequency
sampling interval (every 30 seconds, adjusted within slurm.conf).

John DeSantis


On Wed, 14 Feb 2018 13:05:41 +0100
Loris Bennett  wrote:

> Geert Kapteijns  writes:
> 
> > Hi everyone,
> >
> > I’m running into out-of-memory errors when I specify an array job.
> > Needless to say, 100M should be more than enough, and increasing
> > the allocated memory to 1G doesn't solve the problem. I call my
> > script as follows: sbatch --array=100-199 run_batch_job.
> > run_batch_job contains
> >
> > #!/bin/env bash
> > #SBATCH --partition=lln
> > #SBATCH --output=/home/user/outs/%x.out.%a
> > #SBATCH --error=/home/user/outs/%x.err.%a
> > #SBATCH --cpus-per-task=1
> > #SBATCH --mem-per-cpu=100M
> > #SBATCH --time=2-00:00:00
> >
> > srun my_program.out $SLURM_ARRAY_TASK_ID
> >
> > Instead of using --mem-per-cpu and --cpus-per-task, I’ve also tried
> > the following:
> >
> > #SBATCH --mem=100M
> > #SBATCH --ntasks=1  # Number of cores
> > #SBATCH --nodes=1  # All cores on one machine
> >
> > But in both cases for some of the runs, I get the error:
> >
> > slurmstepd: error: Exceeded job memory limit at some point.
> > srun: error: obelix-cn002: task 0: Out Of Memory
> > slurmstepd: error: Exceeded job memory limit at some point.
> >
> > I’ve also posted the question on stackoverflow. Does anyone know
> > what is happening here?
> 
> Maybe once in a while a simulation really does just use more memory
> than you were expecting.  Have a look at the output of
> 
>   sacct -j 123456 -o jobid,maxrss,state --units=M
> 
> with the appropriate job ID.
> 
> Regards
> 
> Loris
> 




Re: [slurm-users] slurmstepd: error: Exceeded job memory limit at some point.

2018-02-14 Thread Chris Bridson (NBI)
Also consider any cached information, e.g. nfs . You won't necessarily see 
this, but might be getting accounted for in the cgroup, depending on your 
setup/settings.

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Loris Bennett
Sent: 14 February 2018 12:06
To: Geert Kapteijns 
Cc: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] slurmstepd: error: Exceeded job memory limit at some 
point.

Geert Kapteijns  writes:

> Hi everyone,
>
> I’m running into out-of-memory errors when I specify an array job. 
> Needless to say, 100M should be more than enough, and increasing the 
> allocated memory to 1G doesn't solve the problem. I call my script as
> follows: sbatch --array=100-199 run_batch_job. run_batch_job contains
>
> #!/bin/env bash
> #SBATCH --partition=lln
> #SBATCH --output=/home/user/outs/%x.out.%a
> #SBATCH --error=/home/user/outs/%x.err.%a #SBATCH --cpus-per-task=1 
> #SBATCH --mem-per-cpu=100M #SBATCH --time=2-00:00:00
>
> srun my_program.out $SLURM_ARRAY_TASK_ID
>
> Instead of using --mem-per-cpu and --cpus-per-task, I’ve also tried the 
> following:
>
> #SBATCH --mem=100M
> #SBATCH --ntasks=1  # Number of cores
> #SBATCH --nodes=1  # All cores on one machine
>
> But in both cases for some of the runs, I get the error:
>
> slurmstepd: error: Exceeded job memory limit at some point.
> srun: error: obelix-cn002: task 0: Out Of Memory
> slurmstepd: error: Exceeded job memory limit at some point.
>
> I’ve also posted the question on stackoverflow. Does anyone know what is 
> happening here?

Maybe once in a while a simulation really does just use more memory than you 
were expecting.  Have a look at the output of

  sacct -j 123456 -o jobid,maxrss,state --units=M

with the appropriate job ID.

Regards

Loris

--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



Re: [slurm-users] slurmstepd: error: Exceeded job memory limit at some point.

2018-02-14 Thread Loris Bennett
Geert Kapteijns  writes:

> Hi everyone,
>
> I’m running into out-of-memory errors when I specify an array job. Needless 
> to say, 100M should be more than enough, and increasing the allocated memory 
> to 1G doesn't solve the problem. I call my script as
> follows: sbatch --array=100-199 run_batch_job. run_batch_job contains
>
> #!/bin/env bash
> #SBATCH --partition=lln
> #SBATCH --output=/home/user/outs/%x.out.%a
> #SBATCH --error=/home/user/outs/%x.err.%a
> #SBATCH --cpus-per-task=1
> #SBATCH --mem-per-cpu=100M
> #SBATCH --time=2-00:00:00
>
> srun my_program.out $SLURM_ARRAY_TASK_ID
>
> Instead of using --mem-per-cpu and --cpus-per-task, I’ve also tried the 
> following:
>
> #SBATCH --mem=100M
> #SBATCH --ntasks=1  # Number of cores
> #SBATCH --nodes=1  # All cores on one machine
>
> But in both cases for some of the runs, I get the error:
>
> slurmstepd: error: Exceeded job memory limit at some point.
> srun: error: obelix-cn002: task 0: Out Of Memory
> slurmstepd: error: Exceeded job memory limit at some point.
>
> I’ve also posted the question on stackoverflow. Does anyone know what is 
> happening here?

Maybe once in a while a simulation really does just use more memory than you
were expecting.  Have a look at the output of

  sacct -j 123456 -o jobid,maxrss,state --units=M

with the appropriate job ID.

Regards

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de