Re: [slurm-users] QoS settings in sacctmgr requires restarting slurmctld to take effect

2019-01-11 Thread Jianwen Wei
Thank you. I sync /etc/slurm.conf from slurmctld node then then re-run slurmdbd as root, which seems to solve the problem. Before that, "sacctmgr list clusters" complained missing /etc/slurm.conf . Best, Jianwen > On Jan 11, 2019, at 12:33, Chris Samuel wrote: > > On 10/1/19 6:15

Re: [slurm-users] Array job execution trouble: some jobs in the array fail

2019-01-11 Thread Lyn Gerner
Hi Jean-Mathieu, I'd also recommend that you update to 17.11.12. I had issues w/job arrays in 17.11.7, such as tasks erroneously being held as "DependencyNeverSatisfied" that, I'm pleased to report, I have not seen in .12. Best, Lyn On Fri, Jan 11, 2019 at 8:13 AM Jean-mathieu CHANTREIN <

Re: [slurm-users] Larger jobs tend to get starved out on our cluster

2019-01-11 Thread Baker D . J .
Hi Chris, Thank you for your comments. Yesterday I experimented with increasing the PriorityWeightJobSize and that does appear to have quite a profound effect on the job mix executing at any one time. Larger jobs (needing 5 nodes or above) are now getting a decent share of the nodes in the

Re: [slurm-users] Larger jobs tend to get starved out on our cluster

2019-01-11 Thread Skouson, Gary
You should be able to turn on some backfill debug info from slurmctl, You can have slurm output the backfill info. Take a look at DebugFlags settings using Backfill and BackfillMap. Your bf_window is set to 3600 or 2.5 days, if the start time of the large job is out further than that, it

Re: [slurm-users] slurm, memory accounting and memory mapping

2019-01-11 Thread Sergey Koposov
Hi Janne, On Fri, 2019-01-11 at 10:37 +0200, Janne Blomqvist wrote: > On 11/01/2019 08.29, Sergey Koposov wrote: > > What is your memory limit configuration in slurm? Anyway, a few things to > > check: I guess these are the most relevant (uncommented) params I could see in the slurm.conf are

Re: [slurm-users] GPU gres error for 1 of 3 GPU types

2019-01-11 Thread Sean McGrath
I forgot to mention we are running slurm version 18.08.3 before. On Fri, Jan 11, 2019 at 10:35:09AM -0500, Paul Edmon wrote: > I'm pretty sure that gres.conf has to be on all the nodes as well > and not just the master. Thanks Paul. We deploy the same slurm configuration, including the

Re: [slurm-users] GPU gres error for 1 of 3 GPU types

2019-01-11 Thread Paul Edmon
I'm pretty sure that gres.conf has to be on all the nodes as well and not just the master. -Paul Edmon- On 1/11/19 5:21 AM, Sean McGrath wrote: Hi everyone, Your help for this would be much appreciated please. We have a cluster with 3 types of gpu configured in gres. Users can successfully

Re: [slurm-users] Array job execution trouble: some jobs in the array fail

2019-01-11 Thread Jean-mathieu CHANTREIN
You don't put any limitation on your master nodes ? I answer myself. I only have to change the PropagateResourceLimits variable from slurm.conf to NONE. This is not a problem since I activate the cgroups directly on each of the compute nodes. Regards. Jean-Mathieu > De: "Jean-Mathieu

Re: [slurm-users] Array job execution trouble: some jobs in the array fail

2019-01-11 Thread Jean-mathieu CHANTREIN
Hello Jeffrey. That's exactly it. I thank you very much, I would not have thought of that. I have actually put a limitation of 20 nproc in /etc/security/limits.conf to avoid potential misuse of some users. I had not imagined for one second that it could propagate on computational nodes! You

Re: [slurm-users] Array job execution trouble: some jobs in the array fail

2019-01-11 Thread Jeffrey Frey
What does ulimit tell you on the compute node(s) where the jobs are running? The error message you cited arises when a user has reached the per-user process count limit (e.g. "ulimit -u"). If your Slurm config doesn't limit how many jobs a node can execute concurrently (e.g. oversubscribe),

[slurm-users] GPU gres error for 1 of 3 GPU types

2019-01-11 Thread Sean McGrath
Hi everyone, Your help for this would be much appreciated please. We have a cluster with 3 types of gpu configured in gres. Users can successfully request 2 of the gpu types but the third errors when requested. Here is the successful salloc behaviour: root@boole01:/etc/slurm # salloc

[slurm-users] Array job execution trouble: some jobs in the array fail

2019-01-11 Thread Jean-mathieu CHANTREIN
Hello, I'm new to slurm (I used SGE before) and I'm new to this list. I have some difficulties with the use of slurm's array jobs, maybe you can help me? I am working with slurm version 17.11.7 on a debian testing. I use slurmdbd and fairshare. For my current user, I have the following

Re: [slurm-users] slurm, memory accounting and memory mapping

2019-01-11 Thread Bjørn-Helge Mevik
Sergey Koposov writes: > The trick is that my code uses memory mapping (i.e. mmap) of one > single large file (~12 Gb) in each thread on each node. > With this technique in the past despite the fact the file is > (read-only) mmaped in say 16 threads, the actual memory footprint was > still ~ 12

Re: [slurm-users] slurm, memory accounting and memory mapping

2019-01-11 Thread Janne Blomqvist
On 11/01/2019 08.29, Sergey Koposov wrote: > Hi, > > I've recently migrated to slurm from pbs on our cluster. Because of that, now > the job memory limits are > strictly enforced and that causes my code to get killed. > The trick is that my code uses memory mapping (i.e. mmap) of one single large