Re: [slurm-users] epilog when job is killed for max time

2018-11-08 Thread Noam Bernstein
Thanks - that's an awesome, yet horrible, hack :) Noam > On Nov 8, 2018, at 3:26 AM, Josep Manel Andrés Moscardó > wrote: > > Hi, > Somebody else gave me this piece of code (I hope he doesn't mind me sharing > it :) , at least

Re: [slurm-users] epilog when job is killed for max time

2018-11-08 Thread Josep Manel Andrés Moscardó
Hi, Somebody else gave me this piece of code (I hope he doesn't mind me sharing it :) , at least it is how they do it: #!/bin/bash #SBATCH --signal=B:USR1@300 #<-- This will make Slurm send signal USR1 to the bash process 300 seconds before the time limit #SBATCH -t 00:06:00

Re: [slurm-users] Seff error with Slurm-18.08.1

2018-11-08 Thread Miguel A . Sánchez
Hi and thanks for all your answers and sorry for the delay in my answer. Yesterday I have installed in the controller machine the Slurm-18.08.3 to check if with this last release the Seff command is working fine. The behavior has improve but I still receive a error message: #

Re: [slurm-users] Seff error with Slurm-18.08.1

2018-11-08 Thread Marcus Wagner
Hi Miguel, this is because SchedMD changed the stats field. There exists no more rss_max, cmp. line 225 of seff. You need to evaluate the field stats{tres_usage_in_max}, and there the value after '2=', but this is the memory value in bytes instead of kbytes, so this should be divided by 1024

Re: [slurm-users] bug 2119 with slurm 18.08.2

2018-11-08 Thread Brian Andrus
We use sssd with realmd enumeration is off. Brian Andrus On 11/8/2018 11:26 AM, Marcin Stolarek wrote: I have very similar issue for quite a time and I was unable to find its root cause. Are you using sssd and AD as a data source with only a subtree of entries searched - this is my case.

[slurm-users] bug 2119 with slurm 18.08.2

2018-11-08 Thread Brian Andrus
All, I am seeing what looks like the same issue as https://bugs.schedmd.com/show_bug.cgi?id=2119 Where, slurmctld is not picking up new accounts unless it is restarted. I have 4 clusters (non-federated), all using the same slurmdbd When I added an association for user name=me cluster=DevOps

Re: [slurm-users] bug 2119 with slurm 18.08.2

2018-11-08 Thread Marcin Stolarek
I have very similar issue for quite a time and I was unable to find its root cause. Are you using sssd and AD as a data source with only a subtree of entries searched - this is my case. Did you disable users enumeration? It also what I have. I didn’t find ang evidence that it’s related but...

[slurm-users] virtual memory limit exceeded

2018-11-08 Thread Noam Bernstein
Can anyone shed some light on where the _virtual_ memory limit comes from? We're getting jobs killed with the message slurmstepd: error: Step 3664.0 exceeded virtual memory limit (79348101120 > 72638634393), being killed Is this a limit that's dictated by cgroup.conf or by some srun option

Re: [slurm-users] Seff error with Slurm-18.08.1

2018-11-08 Thread Paddy Doyle
Hi all, It looks like we can use the api to avoid having to manually parse the '2=' value from the stats{tres_usage_in_max} value. I've submitted a bug report and patch: https://bugs.schedmd.com/show_bug.cgi?id=6004 The minimal changes needed would be in the attched seff.patch. Hope that