Re: [slurm-users] 19.05.0 x11 in sbatch

2019-05-29 Thread Hidas, Dean
I certainly missed that. The documentation seems to still have the --x11 flag in it (below). Is there another way to use x11 via sbatch with similar behavior? This has affected some users and I'd like to find a similar simple solution. The man page (19.05.0/share/man/man1/sbatch.1) and

Re: [slurm-users] 19.05.0 x11 in sbatch

2019-05-29 Thread Sean Crosby
Hi Dean, On Thu, 30 May 2019 at 07:30, Hidas, Dean mailto:dhi...@bnl.gov>> wrote: Is there any idea what I might have missed? The release notes say that the new X11 code will not work with sbatch - https://github.com/SchedMD/slurm/blob/master/RELEASE_NOTES NOTE: The X11 forwarding code has

[slurm-users] 19.05.0 x11 in sbatch

2019-05-29 Thread Hidas, Dean
Hello, I recently upgraded from 18.08.7 to 19.05.0 and at the moment it seems that the following is not accepted in a sbatch script, although it was working fine for us from 18x: #SBATCH --x11=all (or batch, first, last). sbatch exits with the following message: sbatch: unrecognized option

Re: [slurm-users] Submit job using srun fails but sbatch works

2019-05-29 Thread Alex Chekholko
I think this error usually means that on your node cn7 it has either the wrong /etc/hosts or the wrong /etc/slurm/slurm.conf E.g. try 'srun --nodelist=cn7 ping -c 1 cn7' On Wed, May 29, 2019 at 6:00 AM Alexander Åhman wrote: > Hi, > Have a very strange problem. The cluster has been working

Re: [slurm-users] Slurm Fairshare / Multifactor Priority

2019-05-29 Thread Paul Edmon
I believe it is still the case, but I haven't tested it.  I put this in way back when partition_job_depth was first introduced (which was eons ago now).  We run about 100 or so partitions, so this has served us well as a general rule.  What happens is that if you set partition job depth too

Re: [slurm-users] Slurm Fairshare / Multifactor Priority

2019-05-29 Thread Christoph Brüning
Hi Chad, for us (also running slurm 17.11), the crucial point was the balance between PriorityWeightFairshare, PriorityWeightAge and PriorityMaxAge. We set the PriorityWeightAge high (higher than PriorityWeightFairshare, in fact), so that even a job by some power user will eventually be the

Re: [slurm-users] Slurm Fairshare / Multifactor Priority

2019-05-29 Thread Kilian Cavalotti
Hi Paul, I'm wondering about this part in your SchedulerParameters: ### default_queue_depth should be some multiple of the partition_job_depth, ### ideally number_of_partitions * partition_job_depth, but typically the main ### loop exits prematurely if you go over about 400. A

Re: [slurm-users] Submit job using srun fails but sbatch works

2019-05-29 Thread Alexander Åhman
I have tried to find a network error but can't see anything. Every node I've tested has the same (and correct) view of things. _On node cn7:_ (the problematic one) em1: link/ether 50:9a:4c:79:31:4d inet 10.28.3.137/24 _On login machine:_ [alex@li1 ~]$ host cn7 cn7.ydesign.se has address

Re: [slurm-users] Slurm Fairshare / Multifactor Priority

2019-05-29 Thread Paul Edmon
For reference we are running 18.08.7 -Paul Edmon- On 5/29/19 10:39 AM, Paul Edmon wrote: Sure.  Here is what we have: ## Scheduling # ### This section is specific to scheduling ### Tells the scheduler to enforce limits for all

Re: [slurm-users] Slurm Fairshare / Multifactor Priority

2019-05-29 Thread Paul Edmon
Sure.  Here is what we have: ## Scheduling # ### This section is specific to scheduling ### Tells the scheduler to enforce limits for all partitions ### that a job submits to. EnforcePartLimits=ALL ### Let's slurm know that we have a

[slurm-users] Slurm Fairshare / Multifactor Priority

2019-05-29 Thread Julius, Chad
All, We rushed our Slurm install due to a short timeframe and missed some important items. We are now looking to implement a better system than the first in, first out we have now. My question, are the defaults listed in the slurm.conf file a good start? Would anyone be willing to share

Re: [slurm-users] Node weight / Job Preemption

2019-05-29 Thread Paul Edmon
I might look at these options: *preempt_reorder_count=#* Specify how many attempts should be made in reording preemptable jobs to minimize the count of jobs preempted. The default value is 1. High values may adversely impact performance. The logic to support this option is only

Re: [slurm-users] strigger on CG, completing state

2019-05-29 Thread Matthew BETTINGER
Ok thanks we will look into that! Thought we were the only ones who had the problem and yes it's like windows 98SE, you can try all you want but eventually we end up rebooting the nodes. Interns are starting to show up and you know they can bend a cluster in ways you never seen before. We

Re: [slurm-users] Submit job using srun fails but sbatch works

2019-05-29 Thread Ole Holm Nielsen
Hi Alexander, The error "can't find address for host cn7" would indicate a DNS problem. What is the output of "host cn7" from the srun host li1? How many network devices are in your subnet? It may be that the Linux kernel is doing "ARP cache trashing" if the number of devices approaches

[slurm-users] Submit job using srun fails but sbatch works

2019-05-29 Thread Alexander Åhman
Hi, Have a very strange problem. The cluster has been working just fine until one node died and now I can't submit jobs to 2 of the nodes using srun from the login machine. Using sbatch works just fine and also if I use srun from the same host as slurmctld. All the other nodes works just fine

[slurm-users] Node weight / Job Preemption

2019-05-29 Thread Mike Harvey
I am relatively new to SLURM, and am having difficulty configuring our scheduling to behave as we'd like. Partition based job preemption is configured as follows: PreemptType=preempt/partition_prio PreemptMode=suspend,gang This has been working fine. However, we recently added an older

Re: [slurm-users] Checking RawUsage against GrpTRESMins

2019-05-29 Thread Bjørn-Helge Mevik
Paddy Doyle writes: > Hi Jacob, > > On Tue, May 28, 2019 at 11:38:23AM -0400, Jacob Chappell wrote: > >> Hello all, >> >> Is it possible in Slurm to check RawUsage against GrpTRESMins and prevent a >> job from being submitted if the RawUsage exceeds the GrpTRESMins? My center >> needs this

Re: [slurm-users] strigger on CG, completing state

2019-05-29 Thread Yair Yarom
Hi, Check the UnkillableStepProgram and UnkillableStepTimeout options in slurm.conf. We use it to drain the stuck nodes and mail us - as here, usually stuck processes will require a reboot. As the drained strigger will never get triggered, we also set a finished trigger for the next RUNNING job.