Re: [slurm-users] Best practice: How much node memory to specify in slurm.conf?

2018-01-16 Thread Marcin Stolarek
I think that it depends on your kernel and the way the cluster is booted (for instance initrd size). You can check the memory used by kernel in dmesg output - search for the line starting with "Memory:". This is fixed. It may be also good idea to "reserve" some space for cache and buffers - check h

[slurm-users] Best practice: How much node memory to specify in slurm.conf?

2018-01-16 Thread Greg Wickham
We’re using cgroups to limit memory of jobs, but in our slurm.conf the total node memory capacity is currently specified. Doing this there could be times when physical memory is over subscribed (physical allocation per job plus kernel memory requirements) and then swapping will occur. Is ther

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-16 Thread John DeSantis
Matthieu, > I would bet on something like LDAP requests taking too much time > because of a missing sssd cache. Good point! It's easy to forget to check something as "simple" as user look-up when something is taking "too long". John DeSantis On Tue, 16 Jan 2018 19:13:06 +0100 Matthieu Hautreux

Re: [slurm-users] Slurm not starting

2018-01-16 Thread Gennaro Oliva
Ciao Elisabetta, On Tue, Jan 16, 2018 at 04:32:47PM +0100, Elisabetta Falivene wrote: > being again able to launch slurmctld on the master and slurmd on the nodes. great! > *NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN* > to > *NodeName=node[01-08] CPUs=16 RealMemory=15999 State=U

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-16 Thread Matthieu Hautreux
Hi, In this kind if issues, one good thing to do is to get a backtrace of slurmctld during the slowdown. You should thus easily identify the subcomponent responsible for the issue. I would bet on something like LDAP requests taking too much time because of a missing sssd cache. Regards Matthieu

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-16 Thread John DeSantis
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Ciao Alessandro, > setting MessageTimeout to 20 didn't solve it :( > > looking at slurmctld logs I noticed many warning like these > > Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large > processing time from _slurm_rpc_dump_par

Re: [slurm-users] ntpd or chrony?

2018-01-16 Thread Ryan Novosielski
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 01/14/2018 09:11 PM, Lachlan Musicman wrote: > Hi all, > > As part of both Munge and SLURM, time synchronised servers are > necessary. > > I keep finding chrony installed and running and ntpd stopped. I > turn chrony off and restart/enable ntpd bu

Re: [slurm-users] Slurm not starting

2018-01-16 Thread Elisabetta Falivene
Here is the solution and another (minor) problem! Investigating in the direction of the pid problem I found that in the setting there was a *SlurmctldPidFile=/var/run/slurmctld.pid* *SlurmdPidFile=/var/run/slurmd.pid* but the pid was searched in /var/run/slurm-llnl so I changed in the slurm.conf

Re: [slurm-users] Slurm not starting

2018-01-16 Thread Elisabetta Falivene
> It seems like the pidfile in systemd and slurm.conf are different. Check > if they are the same and if not adjust the slurm.conf pid files. That > should prevent systemd from killing slurm. > Emh, sorry, how I can do this? > On Mon, 15 Jan 2018, 18:24 Elisabetta Falivene, > wrote: > >> The de

Re: [slurm-users] Slurm not starting

2018-01-16 Thread Elisabetta Falivene
> > slurmd: debug2: _slurm_connect failed: Connection refused >> slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: >> Connection refused >> > > This sounds like the compute node cannot connect back to > slurmctld on the management node, you should check that the > IP address

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-16 Thread Alessandro Federico
Hi, setting MessageTimeout to 20 didn't solve it :( looking at slurmctld logs I noticed many warning like these Jan 16 05:11:00 r000u17l01 slurmctld[22307]: Warning: Note very large processing time from _slurm_rpc_dump_partitions: usec=42850604 began=05:10:17.289 Jan 16 05:20:58 r000u17l01 slu

Re: [slurm-users] restrict application to a given partition

2018-01-16 Thread Juan A. Cordero Varelaq
I ended up with a more simple solution: I tweaked the program executable (a bash script), so that it inspects which partition it is running on, and if its the wrong one, it exits. Just added the following lines: if [ $SLURM_JOB_PARTITION == 'big' ]; then exit_code=126

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-16 Thread Alessandro Federico
Hi Trevor thank you very much we'll give it a try ale - Original Message - > From: "Trevor Cooper" > To: "Slurm User Community List" > Sent: Tuesday, January 16, 2018 12:10:21 AM > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv > operation > > Alessandro, >

[slurm-users] Overiding job QOS limit for individual user?

2018-01-16 Thread Loris Bennett
Hi, Before we started using QOS for jobs, I could restrict the number of jobs for an individual user with, say, sacctmgr modify user where name=alice account=physics set maxjobs=1 However, now we have configured QOS for jobs, if a user requests a QOS which has MaxJobs set to, say, 10, the abov