[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-16 Thread Stefan Doerr
Thank you very much! Seems like there is an agreement that jobacct_gather/linux
will sum up the shared memory which is very probably the cause of my
problem.
We are switching now to jobacct_gather/cgroup to see if it will count
shared memory correctly.
I'll report back with the results.

On Fri, Dec 16, 2016 at 10:47 AM, Bjørn-Helge Mevik 
wrote:

>
> Christopher Samuel  writes:
>
> > On 16/12/16 02:15, Stefan Doerr wrote:
> >
> >> If I check on "top" indeed it shows all processes using the same amount
> >> of memory. Hence if I spawn 10 processes and you sum usages it would
> >> look like 10x the memory usage.
> >
> > Do you have:
> >
> > JobAcctGatherType=jobacct_gather/linux
> >
> > or:
> >
> > JobAcctGatherType=jobacct_gather/cgroup
>
> We discovered a while ago that if your job runs several processes that
> use shared memory, jobacct_gather/linux will count the shared memory
> once for each process, resulting in a larger RSS count than what is real
> (this is also the case for top).  One can ask the plugin not to count
> shared memory at all (JobAcctGatherParams=NoShared), but that will give
> you a too low RSS if the job uses shared memory.  Alternatively, one can
> ask it to use the "Proportional set size" [1]
> (JobAcctGatherParams=UsePss), which is cgroup uses (I believe), and
> sounds like the best estimate to me.
>
>
> [1] https://en.wikipedia.org/wiki/Proportional_set_size
>
> --
> Regards,
> Bjørn-Helge Mevik, dr. scient,
> Department for Research Computing, University of Oslo


[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-16 Thread Bjørn-Helge Mevik

Christopher Samuel  writes:

> On 16/12/16 02:15, Stefan Doerr wrote:
>
>> If I check on "top" indeed it shows all processes using the same amount
>> of memory. Hence if I spawn 10 processes and you sum usages it would
>> look like 10x the memory usage.
>
> Do you have:
>
> JobAcctGatherType=jobacct_gather/linux
>
> or:
>
> JobAcctGatherType=jobacct_gather/cgroup

We discovered a while ago that if your job runs several processes that
use shared memory, jobacct_gather/linux will count the shared memory
once for each process, resulting in a larger RSS count than what is real
(this is also the case for top).  One can ask the plugin not to count
shared memory at all (JobAcctGatherParams=NoShared), but that will give
you a too low RSS if the job uses shared memory.  Alternatively, one can
ask it to use the "Proportional set size" [1]
(JobAcctGatherParams=UsePss), which is cgroup uses (I believe), and
sounds like the best estimate to me.


[1] https://en.wikipedia.org/wiki/Proportional_set_size

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Christopher Samuel

On 16/12/16 10:33, Kilian Cavalotti wrote:

> I remember Danny recommending to use jobacct_gather/linux over
> jobacct_gather/cgroup, because "cgroup adds quite a bit of overhead
> with very little benefit".
> 
> Did that change?

We took that advice but reverted because of this issue (from memory).

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Kilian Cavalotti

On Thu, Dec 15, 2016 at 11:47 PM, Douglas Jacobsen  wrote:
>
> There are other good reasons to use jobacct_gather/cgroup, in particular if
> memory enforcement is used, jobacct_gather/linux will cause a job to be
> terminated if the summed memory exceeds the limit, which is OK so long as
> large memory processes aren't forking and artificially increasing the
> apparent memory usage seen by jobacct_gather/linux summing up contributions
> through the /proc interface.  jobacct_gather/cgroup on the other hand has
> much more reliable accounting of memory even for workloads where large
> memory processes (e.g., java) are forking processes.

I remember Danny recommending to use jobacct_gather/linux over
jobacct_gather/cgroup, because "cgroup adds quite a bit of overhead
with very little benefit".
(cf. https://groups.google.com/forum/#!msg/slurm-devel/RdvQVn7So1w/YeB1Yq9bRjoJ)

Did that change?

Cheers,
-- 
Kilian


[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Douglas Jacobsen


There are other good reasons to use jobacct_gather/cgroup, in particular 
if memory enforcement is used, jobacct_gather/linux will cause a job to 
be terminated if the summed memory exceeds the limit, which is OK so 
long as large memory processes aren't forking and artificially 
increasing the apparent memory usage seen by jobacct_gather/linux 
summing up contributions through the /proc interface.  
jobacct_gather/cgroup on the other hand has much more reliable 
accounting of memory even for workloads where large memory processes 
(e.g., java) are forking processes.



On 12/15/16 2:22 PM, Christopher Samuel wrote:

On 16/12/16 02:15, Stefan Doerr wrote:


If I check on "top" indeed it shows all processes using the same amount
of memory. Hence if I spawn 10 processes and you sum usages it would
look like 10x the memory usage.

Do you have:

JobAcctGatherType=jobacct_gather/linux

or:

JobAcctGatherType=jobacct_gather/cgroup

If the former, try the latter and see if it helps get better numbers (we
went to the former after suggestions from SchedMD but from highly
unreliable memory had to revert due to similar issues to those you are
seeing).

Best of luck,
Chris


[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Christopher Samuel

On 16/12/16 02:15, Stefan Doerr wrote:

> If I check on "top" indeed it shows all processes using the same amount
> of memory. Hence if I spawn 10 processes and you sum usages it would
> look like 10x the memory usage.

Do you have:

JobAcctGatherType=jobacct_gather/linux

or:

JobAcctGatherType=jobacct_gather/cgroup

If the former, try the latter and see if it helps get better numbers (we
went to the former after suggestions from SchedMD but from highly
unreliable memory had to revert due to similar issues to those you are
seeing).

Best of luck,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Benjamin Redling

Am 15. Dezember 2016 14:48:24 MEZ, schrieb Stefan Doerr :
>$ sinfo --version
>slurm 15.08.11
>
>$ sacct --format="CPUTime,MaxRSS" -j 72491
>   CPUTime MaxRSS
>-- --
>  00:27:06
>  00:27:06  37316236K
>
>
>I will have to ask the sysadms about cgroups since I'm just a user
>here.
>
>On Thu, Dec 15, 2016 at 12:05 PM, Merlin Hartley <
>merlin-sl...@mrc-mbu.cam.ac.uk> wrote:
>
>> I’ve been having the very same problem since I tried to enable
>Accounting
>> (slurmdbd) - so I have now had to disable accounting.
>>
>> It would seem therefore that this part of the documentation should be
>> updated:
>> https://slurm.schedmd.com/archive/slurm-15.08-latest/accounting.html
>> "To enable any limit enforcement you must at least have
>> *AccountingStorageEnforce=limits* in your slurm.conf, otherwise, even
>if
>> you have limits set, they will not be enforced. "
>>
>> I did not set that option at all in my slurm.conf and yet memory
>limits
>> started to be enforced - and again I don’t believe the memory
>estimate was
>> anything like correct.
>>
>> In the new year I may try accounting again but with
>"MemLimitEnforce=no”
>> set as well :)
>>
>>
>> Merlin
>>
>>
>> --
>> Merlin Hartley
>> IT Systems Engineer
>> MRC Mitochondrial Biology Unit
>> Cambridge, CB2 0XY
>> United Kingdom
>>
>> On 15 Dec 2016, at 10:32, Uwe Sauter  wrote:
>>
>>
>> You are correct. Which version do you run? Do you have cgroups
>enabled?
>> Can you enable debugging for slurmd on the nodes? The
>> output should contain what Slurm calculates as maximum memory for a
>job.
>>
>> One other option is do configure MemLimitEnforce=no (which defaults
>to yes
>> since 14.11).
>>
>>
>> Am 15.12.2016 um 11:26 schrieb Stefan Doerr:
>>
>> But this doesn't answer my question why it reports 10 times as much
>memory
>> usage as it is actually using, no?
>>
>> On Wed, Dec 14, 2016 at 1:00 PM, Uwe Sauter <
>> mailto:uwe.sauter...@gmail.com >> wrote:
>>
>>
>>There are only two memory related options "--mem" and
>"--mem-per-cpu".
>>
>>--mem tells slurm the memory requirement of the job (if used with
>> sbatch) or the step (if used with srun). But not the requirement
>>of each process.
>>
>>--mem-per-cpu is used in combination with --ntasks and
>--cpus-per-task.
>> If only --mem-per-cpu is used without other options the
>>memory requirement is calculated using the configured number of
>cores
>> (NOT the number of cores requested), as far as I can tell.
>>
>>You might want to play a bit more with the additionall options.
>>
>>
>>
>>Am 14.12.2016 um 12:09 schrieb Stefan Doerr:
>>
>> Hi, I'm running a python batch job on SLURM with following options
>>
>> #!/bin/bash
>> #
>> #SBATCH --job-name=metrics
>> #SBATCH --partition=xxx
>> #SBATCH --cpus-per-task=6
>> #SBATCH --mem=2
>> #SBATCH --output=slurm.%N.%j.out
>> #SBATCH --error=slurm.%N.%j.err
>>
>> So as I understand each process will have 20GB of RAM dedicated to
>it.
>>
>> Running it I get:
>>
>> slurmstepd: Job 72475 exceeded memory limit (39532832 > 2048),
>being
>> killed
>> slurmstepd: Exceeded job memory limit
>>
>> This however cannot be true. I've run the same script locally and it
>uses
>> 1-2GB of RAM. If it was using 40GB I would have
>>
>>gone to
>>
>> swap and definitely noticed.
>>
>> So I put some prints in my python code to see how much memory is used
>and
>> indeed it shows a max usage of 1.7GB and before the
>> error 1.2GB usage.
>>
>> What is happening here? I mean I could increase the mem option but
>then I
>> will be able to run much fewer jobs on my machines
>>
>>which
>>
>> seems really limiting.
>>
>>
>>
>>
>>

Hi, AFAIK Memory usage with cgroups is more than plain RSS. +File Cache, ...
So, the plugins in use would be really interesting.
Regards, Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321


[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Stefan Doerr
I decided to test it locally. So I ran the exact batch script that I run on
SLURM on my machine and monitored max memory usage with time -v

The first prints that you see is what python is reporting as RSS and
Virtual memory being used currently. As you can see it maxes out at 1.7GB
rss which agrees perfectly with the Maximum resident set size (kbytes):
1742836 that time -v reports.

The only thing I can think of that could confuse SLURM would be that my
calculations are done using joblib.parallel in python. This forks the
process to parallelize my calculations
https://pythonhosted.org/joblib/parallel.html#using-the-threading-backend
As you can read here:
"By default Parallel uses the Python multiprocessing module to fork
separate Python worker processes to execute tasks concurrently on separate
CPUs. This is a reasonable default for generic Python programs but it
induces some overhead as the input and output data need to be serialized in
a queue for communication with the worker processes."

If I check on "top" indeed it shows all processes using the same amount of
memory. Hence if I spawn 10 processes and you sum usages it would look like
10x the memory usage.


[a@xxx yyy]$ /usr/bin/time -v bash run.sh
...
128.8192 1033.396224
128.262144 1032.871936
...
1776.664576 2648.588288
472.502272 1368.522752
...
142.925824 1037.533184
135.794688 1037.254656
Command being timed: "bash run.sh"
User time (seconds): 3501.55
System time (seconds): 263.83
Percent of CPU this job got: 1421%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:24.88
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1742836
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 4
Minor (reclaiming a frame) page faults: 83921772
Voluntary context switches: 111865
Involuntary context switches: 98409
Swaps: 0
File system inputs: 904096
File system outputs: 29224
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

On Thu, Dec 15, 2016 at 2:50 PM, Stefan Doerr  wrote:

> $ sinfo --version
> slurm 15.08.11
>
> $ sacct --format="CPUTime,MaxRSS" -j 72491
>CPUTime MaxRSS
> -- --
>   00:27:06
>   00:27:06  37316236K
>
>
> I will have to ask the sysadms about cgroups since I'm just a user here.
>
> On Thu, Dec 15, 2016 at 12:05 PM, Merlin Hartley <
> merlin-sl...@mrc-mbu.cam.ac.uk> wrote:
>
>> I’ve been having the very same problem since I tried to enable Accounting
>> (slurmdbd) - so I have now had to disable accounting.
>>
>> It would seem therefore that this part of the documentation should be
>> updated:
>> https://slurm.schedmd.com/archive/slurm-15.08-latest/accounting.html
>> "To enable any limit enforcement you must at least have
>> *AccountingStorageEnforce=limits* in your slurm.conf, otherwise, even if
>> you have limits set, they will not be enforced. "
>>
>> I did not set that option at all in my slurm.conf and yet memory limits
>> started to be enforced - and again I don’t believe the memory estimate was
>> anything like correct.
>>
>> In the new year I may try accounting again but with "MemLimitEnforce=no”
>> set as well :)
>>
>>
>> Merlin
>>
>>
>> --
>> Merlin Hartley
>> IT Systems Engineer
>> MRC Mitochondrial Biology Unit
>> Cambridge, CB2 0XY
>> United Kingdom
>>
>> On 15 Dec 2016, at 10:32, Uwe Sauter  wrote:
>>
>>
>> You are correct. Which version do you run? Do you have cgroups enabled?
>> Can you enable debugging for slurmd on the nodes? The
>> output should contain what Slurm calculates as maximum memory for a job.
>>
>> One other option is do configure MemLimitEnforce=no (which defaults to
>> yes since 14.11).
>>
>>
>> Am 15.12.2016 um 11:26 schrieb Stefan Doerr:
>>
>> But this doesn't answer my question why it reports 10 times as much
>> memory usage as it is actually using, no?
>>
>> On Wed, Dec 14, 2016 at 1:00 PM, Uwe Sauter > mailto:uwe.sauter...@gmail.com >> wrote:
>>
>>
>>There are only two memory related options "--mem" and "--mem-per-cpu".
>>
>>--mem tells slurm the memory requirement of the job (if used with
>> sbatch) or the step (if used with srun). But not the requirement
>>of each process.
>>
>>--mem-per-cpu is used in combination with --ntasks and
>> --cpus-per-task. If only --mem-per-cpu is used without other options the
>>memory requirement is calculated using the configured number of cores
>> (NOT the number of cores requested), as far as I can tell.
>>
>>You might want to play a bit more with the additionall options.
>>
>>
>>
>>Am 14.12.2016 um 12:09 schrieb Stefan Doerr:
>>
>> Hi, I'm running a python batch job on SLURM with following options
>>
>> #!/bin/bash
>> #
>> #SBATCH --job-name=metrics
>> #SBATCH --partition=xxx
>> #SBATCH --cpus-per-task=6
>> #SBATCH 

[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Stefan Doerr
$ sinfo --version
slurm 15.08.11

$ sacct --format="CPUTime,MaxRSS" -j 72491
   CPUTime MaxRSS
-- --
  00:27:06
  00:27:06  37316236K


I will have to ask the sysadms about cgroups since I'm just a user here.

On Thu, Dec 15, 2016 at 12:05 PM, Merlin Hartley <
merlin-sl...@mrc-mbu.cam.ac.uk> wrote:

> I’ve been having the very same problem since I tried to enable Accounting
> (slurmdbd) - so I have now had to disable accounting.
>
> It would seem therefore that this part of the documentation should be
> updated:
> https://slurm.schedmd.com/archive/slurm-15.08-latest/accounting.html
> "To enable any limit enforcement you must at least have
> *AccountingStorageEnforce=limits* in your slurm.conf, otherwise, even if
> you have limits set, they will not be enforced. "
>
> I did not set that option at all in my slurm.conf and yet memory limits
> started to be enforced - and again I don’t believe the memory estimate was
> anything like correct.
>
> In the new year I may try accounting again but with "MemLimitEnforce=no”
> set as well :)
>
>
> Merlin
>
>
> --
> Merlin Hartley
> IT Systems Engineer
> MRC Mitochondrial Biology Unit
> Cambridge, CB2 0XY
> United Kingdom
>
> On 15 Dec 2016, at 10:32, Uwe Sauter  wrote:
>
>
> You are correct. Which version do you run? Do you have cgroups enabled?
> Can you enable debugging for slurmd on the nodes? The
> output should contain what Slurm calculates as maximum memory for a job.
>
> One other option is do configure MemLimitEnforce=no (which defaults to yes
> since 14.11).
>
>
> Am 15.12.2016 um 11:26 schrieb Stefan Doerr:
>
> But this doesn't answer my question why it reports 10 times as much memory
> usage as it is actually using, no?
>
> On Wed, Dec 14, 2016 at 1:00 PM, Uwe Sauter  mailto:uwe.sauter...@gmail.com >> wrote:
>
>
>There are only two memory related options "--mem" and "--mem-per-cpu".
>
>--mem tells slurm the memory requirement of the job (if used with
> sbatch) or the step (if used with srun). But not the requirement
>of each process.
>
>--mem-per-cpu is used in combination with --ntasks and --cpus-per-task.
> If only --mem-per-cpu is used without other options the
>memory requirement is calculated using the configured number of cores
> (NOT the number of cores requested), as far as I can tell.
>
>You might want to play a bit more with the additionall options.
>
>
>
>Am 14.12.2016 um 12:09 schrieb Stefan Doerr:
>
> Hi, I'm running a python batch job on SLURM with following options
>
> #!/bin/bash
> #
> #SBATCH --job-name=metrics
> #SBATCH --partition=xxx
> #SBATCH --cpus-per-task=6
> #SBATCH --mem=2
> #SBATCH --output=slurm.%N.%j.out
> #SBATCH --error=slurm.%N.%j.err
>
> So as I understand each process will have 20GB of RAM dedicated to it.
>
> Running it I get:
>
> slurmstepd: Job 72475 exceeded memory limit (39532832 > 2048), being
> killed
> slurmstepd: Exceeded job memory limit
>
> This however cannot be true. I've run the same script locally and it uses
> 1-2GB of RAM. If it was using 40GB I would have
>
>gone to
>
> swap and definitely noticed.
>
> So I put some prints in my python code to see how much memory is used and
> indeed it shows a max usage of 1.7GB and before the
> error 1.2GB usage.
>
> What is happening here? I mean I could increase the mem option but then I
> will be able to run much fewer jobs on my machines
>
>which
>
> seems really limiting.
>
>
>
>
>


[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Merlin Hartley
I’ve been having the very same problem since I tried to enable Accounting 
(slurmdbd) - so I have now had to disable accounting.

It would seem therefore that this part of the documentation should be updated:
https://slurm.schedmd.com/archive/slurm-15.08-latest/accounting.html
"To enable any limit enforcement you must at least have 
AccountingStorageEnforce=limits in your slurm.conf, otherwise, even if you have 
limits set, they will not be enforced. "

I did not set that option at all in my slurm.conf and yet memory limits started 
to be enforced - and again I don’t believe the memory estimate was anything 
like correct.

In the new year I may try accounting again but with "MemLimitEnforce=no” set as 
well :)


Merlin


--
Merlin Hartley
IT Systems Engineer
MRC Mitochondrial Biology Unit
Cambridge, CB2 0XY
United Kingdom

> On 15 Dec 2016, at 10:32, Uwe Sauter  wrote:
> 
> 
> You are correct. Which version do you run? Do you have cgroups enabled? Can 
> you enable debugging for slurmd on the nodes? The
> output should contain what Slurm calculates as maximum memory for a job.
> 
> One other option is do configure MemLimitEnforce=no (which defaults to yes 
> since 14.11).
> 
> 
> Am 15.12.2016 um 11:26 schrieb Stefan Doerr:
>> But this doesn't answer my question why it reports 10 times as much memory 
>> usage as it is actually using, no?
>> 
>> On Wed, Dec 14, 2016 at 1:00 PM, Uwe Sauter > > wrote:
>> 
>> 
>>There are only two memory related options "--mem" and "--mem-per-cpu".
>> 
>>--mem tells slurm the memory requirement of the job (if used with sbatch) 
>> or the step (if used with srun). But not the requirement
>>of each process.
>> 
>>--mem-per-cpu is used in combination with --ntasks and --cpus-per-task. 
>> If only --mem-per-cpu is used without other options the
>>memory requirement is calculated using the configured number of cores 
>> (NOT the number of cores requested), as far as I can tell.
>> 
>>You might want to play a bit more with the additionall options.
>> 
>> 
>> 
>>Am 14.12.2016 um 12:09 schrieb Stefan Doerr:
>>> Hi, I'm running a python batch job on SLURM with following options
>>> 
>>> #!/bin/bash
>>> #
>>> #SBATCH --job-name=metrics
>>> #SBATCH --partition=xxx
>>> #SBATCH --cpus-per-task=6
>>> #SBATCH --mem=2
>>> #SBATCH --output=slurm.%N.%j.out
>>> #SBATCH --error=slurm.%N.%j.err
>>> 
>>> So as I understand each process will have 20GB of RAM dedicated to it.
>>> 
>>> Running it I get:
>>> 
>>> slurmstepd: Job 72475 exceeded memory limit (39532832 > 2048), being 
>>> killed
>>> slurmstepd: Exceeded job memory limit
>>> 
>>> This however cannot be true. I've run the same script locally and it uses 
>>> 1-2GB of RAM. If it was using 40GB I would have
>>gone to
>>> swap and definitely noticed.
>>> 
>>> So I put some prints in my python code to see how much memory is used and 
>>> indeed it shows a max usage of 1.7GB and before the
>>> error 1.2GB usage.
>>> 
>>> What is happening here? I mean I could increase the mem option but then I 
>>> will be able to run much fewer jobs on my machines
>>which
>>> seems really limiting.
>> 
>> 



[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Uwe Sauter

You are correct. Which version do you run? Do you have cgroups enabled? Can you 
enable debugging for slurmd on the nodes? The
output should contain what Slurm calculates as maximum memory for a job.

One other option is do configure MemLimitEnforce=no (which defaults to yes 
since 14.11).


Am 15.12.2016 um 11:26 schrieb Stefan Doerr:
> But this doesn't answer my question why it reports 10 times as much memory 
> usage as it is actually using, no?
> 
> On Wed, Dec 14, 2016 at 1:00 PM, Uwe Sauter  > wrote:
> 
> 
> There are only two memory related options "--mem" and "--mem-per-cpu".
> 
> --mem tells slurm the memory requirement of the job (if used with sbatch) 
> or the step (if used with srun). But not the requirement
> of each process.
> 
> --mem-per-cpu is used in combination with --ntasks and --cpus-per-task. 
> If only --mem-per-cpu is used without other options the
> memory requirement is calculated using the configured number of cores 
> (NOT the number of cores requested), as far as I can tell.
> 
> You might want to play a bit more with the additionall options.
> 
> 
> 
> Am 14.12.2016 um 12:09 schrieb Stefan Doerr:
> > Hi, I'm running a python batch job on SLURM with following options
> >
> > #!/bin/bash
> > #
> > #SBATCH --job-name=metrics
> > #SBATCH --partition=xxx
> > #SBATCH --cpus-per-task=6
> > #SBATCH --mem=2
> > #SBATCH --output=slurm.%N.%j.out
> > #SBATCH --error=slurm.%N.%j.err
> >
> > So as I understand each process will have 20GB of RAM dedicated to it.
> >
> > Running it I get:
> >
> > slurmstepd: Job 72475 exceeded memory limit (39532832 > 2048), 
> being killed
> > slurmstepd: Exceeded job memory limit
> >
> > This however cannot be true. I've run the same script locally and it 
> uses 1-2GB of RAM. If it was using 40GB I would have
> gone to
> > swap and definitely noticed.
> >
> > So I put some prints in my python code to see how much memory is used 
> and indeed it shows a max usage of 1.7GB and before the
> > error 1.2GB usage.
> >
> > What is happening here? I mean I could increase the mem option but then 
> I will be able to run much fewer jobs on my machines
> which
> > seems really limiting.
> 
> 


[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Stefan Doerr
But this doesn't answer my question why it reports 10 times as much memory
usage as it is actually using, no?

On Wed, Dec 14, 2016 at 1:00 PM, Uwe Sauter  wrote:

>
> There are only two memory related options "--mem" and "--mem-per-cpu".
>
> --mem tells slurm the memory requirement of the job (if used with sbatch)
> or the step (if used with srun). But not the requirement
> of each process.
>
> --mem-per-cpu is used in combination with --ntasks and --cpus-per-task. If
> only --mem-per-cpu is used without other options the
> memory requirement is calculated using the configured number of cores (NOT
> the number of cores requested), as far as I can tell.
>
> You might want to play a bit more with the additionall options.
>
>
>
> Am 14.12.2016 um 12:09 schrieb Stefan Doerr:
> > Hi, I'm running a python batch job on SLURM with following options
> >
> > #!/bin/bash
> > #
> > #SBATCH --job-name=metrics
> > #SBATCH --partition=xxx
> > #SBATCH --cpus-per-task=6
> > #SBATCH --mem=2
> > #SBATCH --output=slurm.%N.%j.out
> > #SBATCH --error=slurm.%N.%j.err
> >
> > So as I understand each process will have 20GB of RAM dedicated to it.
> >
> > Running it I get:
> >
> > slurmstepd: Job 72475 exceeded memory limit (39532832 > 2048), being
> killed
> > slurmstepd: Exceeded job memory limit
> >
> > This however cannot be true. I've run the same script locally and it
> uses 1-2GB of RAM. If it was using 40GB I would have gone to
> > swap and definitely noticed.
> >
> > So I put some prints in my python code to see how much memory is used
> and indeed it shows a max usage of 1.7GB and before the
> > error 1.2GB usage.
> >
> > What is happening here? I mean I could increase the mem option but then
> I will be able to run much fewer jobs on my machines which
> > seems really limiting.
>


[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-14 Thread Uwe Sauter

There are only two memory related options "--mem" and "--mem-per-cpu".

--mem tells slurm the memory requirement of the job (if used with sbatch) or 
the step (if used with srun). But not the requirement
of each process.

--mem-per-cpu is used in combination with --ntasks and --cpus-per-task. If only 
--mem-per-cpu is used without other options the
memory requirement is calculated using the configured number of cores (NOT the 
number of cores requested), as far as I can tell.

You might want to play a bit more with the additionall options.



Am 14.12.2016 um 12:09 schrieb Stefan Doerr:
> Hi, I'm running a python batch job on SLURM with following options
> 
> #!/bin/bash
> #
> #SBATCH --job-name=metrics
> #SBATCH --partition=xxx
> #SBATCH --cpus-per-task=6
> #SBATCH --mem=2
> #SBATCH --output=slurm.%N.%j.out
> #SBATCH --error=slurm.%N.%j.err
> 
> So as I understand each process will have 20GB of RAM dedicated to it.
> 
> Running it I get:
> 
> slurmstepd: Job 72475 exceeded memory limit (39532832 > 2048), being 
> killed
> slurmstepd: Exceeded job memory limit
> 
> This however cannot be true. I've run the same script locally and it uses 
> 1-2GB of RAM. If it was using 40GB I would have gone to
> swap and definitely noticed.
> 
> So I put some prints in my python code to see how much memory is used and 
> indeed it shows a max usage of 1.7GB and before the
> error 1.2GB usage.
> 
> What is happening here? I mean I could increase the mem option but then I 
> will be able to run much fewer jobs on my machines which
> seems really limiting.