[slurm-dev] RE: MaxJobs on association not being respected

2017-03-17 Thread Will Dennis
Yes - I anonymize certain details of what I throw up on paste sites... that's 
one of those :)

-Original Message-
From: Benjamin Redling [mailto:benjamin.ra...@uni-jena.de] 
Sent: Friday, March 17, 2017 9:55 AM
To: slurm-dev
Subject: [slurm-dev] RE: MaxJobs on association not being respected


Re hi,

On 2017-03-17 03:01, Will Dennis wrote:
> My slurm.conf:
> https://paste.fedoraproject.org/paste/RedFSPXVlR2auRlevS5t~F5M1UNdIGYh
> yRLivL9gydE=/raw
> 
>> Are you sure the current running config is the one in the file?
>> Did you double check via "scontrol show config"
> 
> Yes, all params set in slurm.conf are showing correctly.

the sacctmgr output from your first mail ("ml-cluster") doesn't fit the 
slurm.conf you provided ("test-cluster"). Can you clarify that?

Regards,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321


[slurm-dev] RE: MaxJobs on association not being respected

2017-03-17 Thread Benjamin Redling

Re hi,

On 2017-03-17 03:01, Will Dennis wrote:
> My slurm.conf:
> https://paste.fedoraproject.org/paste/RedFSPXVlR2auRlevS5t~F5M1UNdIGYhyRLivL9gydE=/raw
> 
>> Are you sure the current running config is the one in the file?
>> Did you double check via "scontrol show config"
> 
> Yes, all params set in slurm.conf are showing correctly.

the sacctmgr output from your first mail ("ml-cluster") doesn't fit the
slurm.conf you provided ("test-cluster"). Can you clarify that?

Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321


[slurm-dev] RE: MaxJobs on association not being respected

2017-03-16 Thread Will Dennis
My slurm.conf:
https://paste.fedoraproject.org/paste/RedFSPXVlR2auRlevS5t~F5M1UNdIGYhyRLivL9gydE=/raw

>Are you sure the current running config is the one in the file?
>Did you double check via "scontrol show config"

Yes, all params set in slurm.conf are showing correctly.

Thanks!
Will

-Original Message-
From: Benjamin Redling [mailto:benjamin.ra...@uni-jena.de] 
Sent: Thursday, March 16, 2017 7:54 PM
To: slurm-dev
Subject: [slurm-dev] RE: MaxJobs on association not being respected


Hello Will,

in case you didn't make any progress in the meantime:
are you allowed to post the full slurm.conf of the test setup?
Would be nice. Just to make sure nobody misses a seemingly irrelevant part. 
Skimming your posts didn't reveal to me any obvious flaws in the parts you 
provided.

Are you sure the current running config is the one in the file?
Did you double check via "scontrol show config"

Regards,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321


[slurm-dev] RE: MaxJobs on association not being respected

2017-03-16 Thread Benjamin Redling

Hello Will,

On 2017-03-15 18:13, Will Dennis wrote:
> Here are their definitions in slurm.conf:
> 
> # PARTITIONS
> PartitionName=batch Nodes=[nodelist] Default=YES DefMemPerCPU=2048 
> DefaultTime=01:00:00 MaxTime=05:00:00 PriorityTier=100 PreemptMode=off 
> State=UP
> PartitionName=long Nodes=[nodelist] Default=NO DefMemPerCPU=2048 
> DefaultTime=1-00:00:00 MaxTime=UNLIMITED PriorityTier=100 PreemptMode=off 
> State=UP
> PartitionName=scavenger Nodes=[nodelist] Default=NO DefMemPerCPU=2048 
> DefaultTime=1-00:00:00 MaxTime=UNLIMITED PriorityTier=10 PreemptMode=requeue 
> State=UP
> 
> Considering the ‘long’ partition, what is the best way to set up limits of 
> how many jobs can be submitted to it concurrently by a user, or how to limit 
> number of CPUs used? 
> 
> As can be seen from my prior post, we are utilizing job accounting via 
> slurmdbd.

in case you didn't make any progress in the meantime:
are you allowed to post the full slurm.conf of the test setup?
Would be nice. Just to make sure nobody misses a seemingly irrelevant
part. Skimming your posts didn't reveal to me any obvious flaws in the
parts you provided.

Are you sure the current running config is the one in the file?
Did you double check via "scontrol show config"

Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321


[slurm-dev] RE: MaxJobs on association not being respected

2017-03-15 Thread Will Dennis
Hi again,

Let me back up and explain what we are trying to do, maybe there’s a better way 
to do it...

We have three partitions set up in Slurm currently:

- ‘batch’ :  this is the regular everyday partition folks can use to submit 
jobs; it is set as the default partition, and has a 5-hour maximum job runtime 
limit.
- ‘long’ :  this partition is designed to be used for long-running jobs; there 
is no max job time-limit set, but we want to set a restriction on how many jobs 
(and/or maybe CPUs) that a given user’s job submission can run (use) 
concurrently.
- ‘scavenger’ :  this partition is designed to be used for low-priority (most 
probably long-running) jobs; there is no max job time-limit set, but any job 
submitted by the prior two partitions that needs resources that are being used 
by the scavenger partition should “bump” the scavenger jobs, which will go back 
into the queue to be re-run.

Here are their definitions in slurm.conf:

# PARTITIONS
PartitionName=batch Nodes=[nodelist] Default=YES DefMemPerCPU=2048 
DefaultTime=01:00:00 MaxTime=05:00:00 PriorityTier=100 PreemptMode=off State=UP
PartitionName=long Nodes=[nodelist] Default=NO DefMemPerCPU=2048 
DefaultTime=1-00:00:00 MaxTime=UNLIMITED PriorityTier=100 PreemptMode=off 
State=UP
PartitionName=scavenger Nodes=[nodelist] Default=NO DefMemPerCPU=2048 
DefaultTime=1-00:00:00 MaxTime=UNLIMITED PriorityTier=10 PreemptMode=requeue 
State=UP

Considering the ‘long’ partition, what is the best way to set up limits of how 
many jobs can be submitted to it concurrently by a user, or how to limit number 
of CPUs used? 

As can be seen from my prior post, we are utilizing job accounting via slurmdbd.

Thanks,
Will



From: Will Dennis 
Sent: Friday, March 10, 2017 1:56 PM
To: slurm-dev
Cc: Lyn Gerner
Subject: Re: [slurm-dev] MaxJobs on association not being respected

 

I currently have this set in slurm.conf as:

 

AccountingStorageEnforce=limits

 

 

On Mar 10, 2017, at 1:53 PM, Lyn Gerner <schedulerqu...@gmail.com> wrote:

 

Hey Will,

 

Check to make sure you have selected the correct value for 
AccountingStorageEnforce. Sounds like it may be that.




Best of luck,

Lyn




-- Forwarded message --
From: Will Dennis <wden...@nec-labs.com>
Date: Fri, Mar 10, 2017 at 8:30 AM
Subject: [slurm-dev] MaxJobs on association not being respected
To: slurm-dev <slurm-dev@schedmd.com>


Hi all,


Generally new to Slurm here, so please forgive any ignorance...


We have a test cluster (three compute nodes) running Slurm 16.05.4 in 
operation, with the ‘multifactor’ scheduler in use. We have set up slurmdb, and 
have set up associations for the users on partitions of the cluster, as follows:


[root@ml43 ~]# sacctmgr show associations

   ClusterAccount   User  Partition Share GrpJobs   GrpTRES 
GrpSubmit GrpWall   GrpTRESMins MaxJobs   MaxTRES MaxTRESPerNode 
MaxSubmit MaxWall   MaxTRESMins  QOS   Def QOS GrpTRESRunMin
-- -- -- -- - --- - 
- --- - --- - -- 
- --- -  - -
ml-cluster   root   1   

   normal
ml-cluster   root   root1   

   normal
ml-cluster ml   1   

   normal
ml-cluster ml   alex  scavenger 1   

   normal
ml-cluster ml   alex  batch 1   

   normal
ml-cluster ml   alex   long 1   
  1 
   normal
ml-cluster ml   iain  scavenger 1   

   normal
ml-cluster ml   iain  batch 1   

   normal
ml-cluster ml   iain   

[slurm-dev] Re: MaxJobs on association not being respected

2017-03-10 Thread Will Dennis
I currently have this set in slurm.conf as:

AccountingStorageEnforce=limits


On Mar 10, 2017, at 1:53 PM, Lyn Gerner 
> wrote:

Hey Will,

Check to make sure you have selected the correct value for 
AccountingStorageEnforce. Sounds like it may be that.

Best of luck,
Lyn

-- Forwarded message --
From: Will Dennis >
Date: Fri, Mar 10, 2017 at 8:30 AM
Subject: [slurm-dev] MaxJobs on association not being respected
To: slurm-dev >


Hi all,


Generally new to Slurm here, so please forgive any ignorance...


We have a test cluster (three compute nodes) running Slurm 16.05.4 in 
operation, with the ‘multifactor’ scheduler in use. We have set up slurmdb, and 
have set up associations for the users on partitions of the cluster, as follows:


[root@ml43 ~]# sacctmgr show associations

   ClusterAccount   User  Partition Share GrpJobs   GrpTRES 
GrpSubmit GrpWall   GrpTRESMins MaxJobs   MaxTRES MaxTRESPerNode 
MaxSubmit MaxWall   MaxTRESMins  QOS   Def QOS GrpTRESRunMin

-- -- -- -- - --- - 
- --- - --- - -- 
- --- -  - -

ml-cluster   root   1   

   normal

ml-cluster   root   root1   

   normal

ml-cluster ml   1   

   normal

ml-cluster ml   alex  scavenger 1   

   normal

ml-cluster ml   alex  batch 1   

   normal

ml-cluster ml   alex   long 1   
  1 
   normal

ml-cluster ml   iain  scavenger 1   

   normal

ml-cluster ml   iain  batch 1   

   normal

ml-cluster ml   iain   long 1   

   normal


As you may notice, we have set up a “MaxJobs” limit of “1" for the ‘alex’ user 
on the ‘long’ partition. What we want to do is enforce a maximum of one job 
running at a time per user for the ‘long’ partition. However, when the user 
‘alex’ submitted a number of jobs to this partition, all of them ran:

[root@ml43 ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
   324  long   tmp.sh alex PD   0:00  1 (Resources)
   321  long   tmp.sh alex  R   1:56  1 ml46
   323  long   tmp.sh alex  R   0:33  1 ml53
   322  long   tmp.sh alex  R   0:36  1 ml48

From the output of “share” we verified the right queue got the job:

[root@ml43 ~]# sshare -am
 Account   UserPartition  RawShares  NormSharesRawUsage 
 EffectvUsage  FairShare
 --  -- --- --- 
- --
root   1.007977 
 1.00   0.50
 root  root   10.50   0 
 0.00   1.00
 ml   10.507977 
 1.00   0.25
  ml   alexscavenger  10.08   0 
 0.17   0.25
  ml   alexbatch  10.08   0 
 0.17   0.25
  ml   alex long  10.087977 
 1.00   0.000244
  ml