Thank you for the descriptions, community! When tasks in step_extern and
tasks in step_batch are active at the same time, how is the memory
accounting and summary done? when memory is over the limit, which one to be
killed?
On Fri, Mar 17, 2017 at 12:18 PM, Nicholas McCollum
Hello,
I currently have a small cluster for testing. Each compute node contains
2 sockets with 14 cores per CPU and a total of 128 GB RAM. I would like
to set up Slurm such that two jobs can simultaneously share one compute
node, effectively giving 1 socket (with binding) and half the total
+1 : I tried getting oom_notifierd working in CentOS7 but was
unsuccessful. I'd be greatly interested if anyone has gotten this to
work. I've ported some of the other BYU cgroup fencing tools over to
CentOS 7 and added minor functionality improvements if anyone is
interested.
Thank you to Ryan
usage_in_bytes is not actually usage in bytes, by the way. It's often
close but I have seen wildly different values. See
https://lkml.org/lkml/2011/3/28/93 and
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt section
5.5. memory.stat is what you want for accurate data.
I
Yep, that's it. While the fix is specific to the
JobAcctGatherType=jobacct_gather/cgroup plugin, you would need to be using
ProctrackType=proctrack/cgroup &
TaskPlugin=task/cgroup for SLURM to be using cgroups.
---
Samuel Gallop
Computing infrastructure for Science
CiS Support & Development
Thank you. I had some doubt about the accuracy of memory.stat. Sam, what
slurm conf parameters do you recommend to try your fix in bug #3531? There
are three places where cgroup plugin could be used:
JobAcctGatherType = jobacct_gather/*cgroup*
ProctrackType = proctrack/*cgroup*
Yes the memory.usage_in_bytes is one sum, but in memory.stat the two figures
are split …
# cat /sys/fs/cgroup/memory/slurm/uid_11253/job_183/step_0/memory.stat | grep
-Ew "^rss|^cache"
cache 16758034432
rss 663552
The fix (https://bugs.schedmd.com/show_bug.cgi?id=3531) attempts to address
For the case of the simple 'cp' test job which copying a 5 GB file, the
issue at the bottom is that how do we distinguish memories used: which is
from RSS, which is from file cache. cgroup reports them as one sum:
memory.memsw.* (we turn on swap off). The file cache can be small or very
big
Good examples:
https://hpc.nih.gov/docs/job_dependencies.html
BR
On 2017-03-15 17:37, Álvaro pc wrote:
> Hi again!
>
> I would really like to know about the behaviour of --dependency argument..
>
> Nobody know anything?
>
> *Álvaro Ponce Cabrera.*
>
>
> 2017-03-14 12:31 GMT+01:00 Álvaro pc
Yes - I anonymize certain details of what I throw up on paste sites... that's
one of those :)
-Original Message-
From: Benjamin Redling [mailto:benjamin.ra...@uni-jena.de]
Sent: Friday, March 17, 2017 9:55 AM
To: slurm-dev
Subject: [slurm-dev] RE: MaxJobs on association not being
Hi,
I believe you can get that message ('Exceeded job memory limit at some point')
even if the job finishes fine. When the cgroup is created (by SLURM) it
updates memory.limit_in_bytes with the job memory request coded in the job.
During the life of the job the kernel updates a number of
I've observed that utlization and top users listing looks like inconsitent
for me.
Do I understand correctly thatt percent of used by users shoudl sum to
percent of allocated for cluster utilization?
cheers,
Marcin
# sreport cluster utilization Start=2017-03-01 -t percent
Re hi,
On 2017-03-17 03:01, Will Dennis wrote:
> My slurm.conf:
> https://paste.fedoraproject.org/paste/RedFSPXVlR2auRlevS5t~F5M1UNdIGYhyRLivL9gydE=/raw
>
>> Are you sure the current running config is the one in the file?
>> Did you double check via "scontrol show config"
>
> Yes, all params
What kind of error information we will get if applications try to use more
memory than declared as we did the test before?
Shenglong
> On Mar 17, 2017, at 9:41 AM, Wensheng Deng wrote:
>
> The file is copied fine. It is just the message error annoying.
>
>
>
> On Thu, Mar
The file is copied fine. It is just the message error annoying.
On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist
wrote:
> On 2017-03-15 17:52, Wensheng Deng wrote:
> > No, it does not help:
> >
> > $ scontrol show config |grep -i jobacct
> >
> >
Dear All,
Yesterday I did some tests and it seemed that the scheduling is following
CPU load but I was wrong.
My configuration is at the moment:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU,CR_LLN
Today I submitted 70 threaded jobs to the queue and here is the CPU_LOAD
info
node1
16 matches
Mail list logo