[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Wensheng Deng
Thank you for the descriptions, community! When tasks in step_extern and tasks in step_batch are active at the same time, how is the memory accounting and summary done? when memory is over the limit, which one to be killed? On Fri, Mar 17, 2017 at 12:18 PM, Nicholas McCollum

[slurm-dev] Exclusive socket configuration help

2017-03-17 Thread Cyrus Proctor
Hello, I currently have a small cluster for testing. Each compute node contains 2 sockets with 14 cores per CPU and a total of 128 GB RAM. I would like to set up Slurm such that two jobs can simultaneously share one compute node, effectively giving 1 socket (with binding) and half the total

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Nicholas McCollum
+1 : I tried getting oom_notifierd working in CentOS7 but was unsuccessful. I'd be greatly interested if anyone has gotten this to work. I've ported some of the other BYU cgroup fencing tools over to CentOS 7 and added minor functionality improvements if anyone is interested. Thank you to Ryan

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Ryan Cox
usage_in_bytes is not actually usage in bytes, by the way. It's often close but I have seen wildly different values. See https://lkml.org/lkml/2011/3/28/93 and https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt section 5.5. memory.stat is what you want for accurate data. I

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Sam Gallop (NBI)
Yep, that's it. While the fix is specific to the JobAcctGatherType=jobacct_gather/cgroup plugin, you would need to be using ProctrackType=proctrack/cgroup & TaskPlugin=task/cgroup for SLURM to be using cgroups. --- Samuel Gallop Computing infrastructure for Science CiS Support & Development

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Wensheng Deng
Thank you. I had some doubt about the accuracy of memory.stat. Sam, what slurm conf parameters do you recommend to try your fix in bug #3531? There are three places where cgroup plugin could be used: JobAcctGatherType = jobacct_gather/*cgroup* ProctrackType = proctrack/*cgroup*

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Sam Gallop (NBI)
Yes the memory.usage_in_bytes is one sum, but in memory.stat the two figures are split … # cat /sys/fs/cgroup/memory/slurm/uid_11253/job_183/step_0/memory.stat | grep -Ew "^rss|^cache" cache 16758034432 rss 663552 The fix (https://bugs.schedmd.com/show_bug.cgi?id=3531) attempts to address

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Wensheng Deng
For the case of the simple 'cp' test job which copying a 5 GB file, the issue at the bottom is that how do we distinguish memories used: which is from RSS, which is from file cache. cgroup reports them as one sum: memory.memsw.* (we turn on swap off). The file cache can be small or very big

[slurm-dev] Re: Fwd: Dependency Problem In Full Queue

2017-03-17 Thread Benjamin Redling
Good examples: https://hpc.nih.gov/docs/job_dependencies.html BR On 2017-03-15 17:37, Álvaro pc wrote: > Hi again! > > I would really like to know about the behaviour of --dependency argument.. > > Nobody know anything? > > *Álvaro Ponce Cabrera.* > > > 2017-03-14 12:31 GMT+01:00 Álvaro pc

[slurm-dev] RE: MaxJobs on association not being respected

2017-03-17 Thread Will Dennis
Yes - I anonymize certain details of what I throw up on paste sites... that's one of those :) -Original Message- From: Benjamin Redling [mailto:benjamin.ra...@uni-jena.de] Sent: Friday, March 17, 2017 9:55 AM To: slurm-dev Subject: [slurm-dev] RE: MaxJobs on association not being

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Sam Gallop (NBI)
Hi, I believe you can get that message ('Exceeded job memory limit at some point') even if the job finishes fine. When the cgroup is created (by SLURM) it updates memory.limit_in_bytes with the job memory request coded in the job. During the life of the job the kernel updates a number of

[slurm-dev] sreport inconsistency

2017-03-17 Thread Marcin Stolarek
I've observed that utlization and top users listing looks like inconsitent for me. Do I understand correctly thatt percent of used by users shoudl sum to percent of allocated for cluster utilization? cheers, Marcin # sreport cluster utilization Start=2017-03-01 -t percent

[slurm-dev] RE: MaxJobs on association not being respected

2017-03-17 Thread Benjamin Redling
Re hi, On 2017-03-17 03:01, Will Dennis wrote: > My slurm.conf: > https://paste.fedoraproject.org/paste/RedFSPXVlR2auRlevS5t~F5M1UNdIGYhyRLivL9gydE=/raw > >> Are you sure the current running config is the one in the file? >> Did you double check via "scontrol show config" > > Yes, all params

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Shenglong Wang
What kind of error information we will get if applications try to use more memory than declared as we did the test before? Shenglong > On Mar 17, 2017, at 9:41 AM, Wensheng Deng wrote: > > The file is copied fine. It is just the message error annoying. > > > > On Thu, Mar

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Wensheng Deng
The file is copied fine. It is just the message error annoying. On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist wrote: > On 2017-03-15 17:52, Wensheng Deng wrote: > > No, it does not help: > > > > $ scontrol show config |grep -i jobacct > > > >

[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

2017-03-17 Thread kesim
Dear All, Yesterday I did some tests and it seemed that the scheduling is following CPU load but I was wrong. My configuration is at the moment: SelectType=select/cons_res SelectTypeParameters=CR_CPU,CR_LLN Today I submitted 70 threaded jobs to the queue and here is the CPU_LOAD info node1