Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Steven Dick
On Fri, Aug 30, 2019 at 2:58 PM Guillaume Perrault Archambault wrote: > My problem with that though, is what if each script (the 9 scripts in my > earlier example) each require different requirements? For example, run on a > different partition, or set a different time limit? My understanding

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Guillaume Perrault Archambault
Thank you Paul.. If admin does agree to creating various QOS job limits or GPU limits (eg 5,10,15,20,...) then tat could be a powerful solution. This would allow me to use job arrays. I still prefer a user side solution if possible because I'd like my script to be cluster-agnostic as much as

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Paul Edmon
Yes, QoS's are dynamic. -Paul Edmon- On 8/30/19 2:58 PM, Guillaume Perrault Archambault wrote: Hi Paul, Thanks for your pointers. I'll looking into QOS and MCS after my paper deadline (Sept 5). Re QOS, as expressed to Peter in the reply I just now sent, I wonder if it the QOS of a job can

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Guillaume Perrault Archambault
Hi Paul, Thanks for your pointers. I'll looking into QOS and MCS after my paper deadline (Sept 5). Re QOS, as expressed to Peter in the reply I just now sent, I wonder if it the QOS of a job can be change while it's pending (submitted but not yet running). Regards, Guillaume. On Fri, Aug 30,

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Guillaume Perrault Archambault
Hi Steven, Those both sound like potentially good solutions. So basically, you're saying that if I script it properly, I can use a single job array to launch multiple scripts by using a master sbatch script. My problem with that though, is what if each script (the 9 scripts in my earlier

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-08-30 Thread Brian Andrus
After you restart slurmctld do "scontrol reconfigure" Brian Andrus On 8/30/2019 6:57 AM, Robert Kudyba wrote: I had set RealMemory to a really high number as I mis-interpreted the recommendation. NodeName=node[001-003]  CoresPerSocket=12 RealMemory= 196489092  Sockets=2 Gres=gpu:1 But now I

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Paul Edmon
A QoS is probably your best bet.  Another variant might be MCS, which you can use to help reduce resource fragmentation.  For limits though QoS will be your best bet. -Paul Edmon- On 8/30/19 7:33 AM, Steven Dick wrote: It would still be possible to use job arrays in this situation, it's just

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-08-30 Thread Robert Kudyba
I had set RealMemory to a really high number as I mis-interpreted the recommendation. NodeName=node[001-003] CoresPerSocket=12 RealMemory= 196489092 Sockets=2 Gres=gpu:1 But now I set it to: RealMemory=191000 I restarted slurmctld. And according to the Bright Cluster support team: "Unless it

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Steven Dick
It would still be possible to use job arrays in this situation, it's just slightly messy. So the way a job array works is that you submit a single script, and that script is provided an integer for each subjob. The integer is in a range, with a possible step (default=1). To run the situation you

[slurm-users] Usage splitting

2019-08-30 Thread Stefan Staeglich
Hi, we have some compute nodes paid by different project owners. 10% are owned by project A and 90% are owned by project B. We want to implement the following policy such that every certain time period (e.g. two weeks): - Project A doesn't use more than 10% of the cluster in this time period -

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Guillaume Perrault Archambault
Hi Steven, Thanks for taking the time to reply to my post. Setting a limit on the number of jobs for a single array isn't sufficient because regression-tests need to launch multiple arrays, and I would need a job limit that would take effect over all launched jobs. It's very possible I'm not