[slurm-dev] RE: MaxJobs on association not being respected
Yes - I anonymize certain details of what I throw up on paste sites... that's one of those :) -Original Message- From: Benjamin Redling [mailto:benjamin.ra...@uni-jena.de] Sent: Friday, March 17, 2017 9:55 AM To: slurm-dev Subject: [slurm-dev] RE: MaxJobs on association not being respected Re hi, On 2017-03-17 03:01, Will Dennis wrote: > My slurm.conf: > https://paste.fedoraproject.org/paste/RedFSPXVlR2auRlevS5t~F5M1UNdIGYh > yRLivL9gydE=/raw > >> Are you sure the current running config is the one in the file? >> Did you double check via "scontrol show config" > > Yes, all params set in slurm.conf are showing correctly. the sacctmgr output from your first mail ("ml-cluster") doesn't fit the slurm.conf you provided ("test-cluster"). Can you clarify that? Regards, Benjamin -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html vox: +49 3641 9 44323 | fax: +49 3641 9 44321
[slurm-dev] RE: MaxJobs on association not being respected
Re hi, On 2017-03-17 03:01, Will Dennis wrote: > My slurm.conf: > https://paste.fedoraproject.org/paste/RedFSPXVlR2auRlevS5t~F5M1UNdIGYhyRLivL9gydE=/raw > >> Are you sure the current running config is the one in the file? >> Did you double check via "scontrol show config" > > Yes, all params set in slurm.conf are showing correctly. the sacctmgr output from your first mail ("ml-cluster") doesn't fit the slurm.conf you provided ("test-cluster"). Can you clarify that? Regards, Benjamin -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html vox: +49 3641 9 44323 | fax: +49 3641 9 44321
[slurm-dev] RE: MaxJobs on association not being respected
My slurm.conf: https://paste.fedoraproject.org/paste/RedFSPXVlR2auRlevS5t~F5M1UNdIGYhyRLivL9gydE=/raw >Are you sure the current running config is the one in the file? >Did you double check via "scontrol show config" Yes, all params set in slurm.conf are showing correctly. Thanks! Will -Original Message- From: Benjamin Redling [mailto:benjamin.ra...@uni-jena.de] Sent: Thursday, March 16, 2017 7:54 PM To: slurm-dev Subject: [slurm-dev] RE: MaxJobs on association not being respected Hello Will, in case you didn't make any progress in the meantime: are you allowed to post the full slurm.conf of the test setup? Would be nice. Just to make sure nobody misses a seemingly irrelevant part. Skimming your posts didn't reveal to me any obvious flaws in the parts you provided. Are you sure the current running config is the one in the file? Did you double check via "scontrol show config" Regards, Benjamin -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html vox: +49 3641 9 44323 | fax: +49 3641 9 44321
[slurm-dev] RE: MaxJobs on association not being respected
Hello Will, On 2017-03-15 18:13, Will Dennis wrote: > Here are their definitions in slurm.conf: > > # PARTITIONS > PartitionName=batch Nodes=[nodelist] Default=YES DefMemPerCPU=2048 > DefaultTime=01:00:00 MaxTime=05:00:00 PriorityTier=100 PreemptMode=off > State=UP > PartitionName=long Nodes=[nodelist] Default=NO DefMemPerCPU=2048 > DefaultTime=1-00:00:00 MaxTime=UNLIMITED PriorityTier=100 PreemptMode=off > State=UP > PartitionName=scavenger Nodes=[nodelist] Default=NO DefMemPerCPU=2048 > DefaultTime=1-00:00:00 MaxTime=UNLIMITED PriorityTier=10 PreemptMode=requeue > State=UP > > Considering the ‘long’ partition, what is the best way to set up limits of > how many jobs can be submitted to it concurrently by a user, or how to limit > number of CPUs used? > > As can be seen from my prior post, we are utilizing job accounting via > slurmdbd. in case you didn't make any progress in the meantime: are you allowed to post the full slurm.conf of the test setup? Would be nice. Just to make sure nobody misses a seemingly irrelevant part. Skimming your posts didn't reveal to me any obvious flaws in the parts you provided. Are you sure the current running config is the one in the file? Did you double check via "scontrol show config" Regards, Benjamin -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html vox: +49 3641 9 44323 | fax: +49 3641 9 44321
[slurm-dev] RE: MaxJobs on association not being respected
Hi again, Let me back up and explain what we are trying to do, maybe there’s a better way to do it... We have three partitions set up in Slurm currently: - ‘batch’ : this is the regular everyday partition folks can use to submit jobs; it is set as the default partition, and has a 5-hour maximum job runtime limit. - ‘long’ : this partition is designed to be used for long-running jobs; there is no max job time-limit set, but we want to set a restriction on how many jobs (and/or maybe CPUs) that a given user’s job submission can run (use) concurrently. - ‘scavenger’ : this partition is designed to be used for low-priority (most probably long-running) jobs; there is no max job time-limit set, but any job submitted by the prior two partitions that needs resources that are being used by the scavenger partition should “bump” the scavenger jobs, which will go back into the queue to be re-run. Here are their definitions in slurm.conf: # PARTITIONS PartitionName=batch Nodes=[nodelist] Default=YES DefMemPerCPU=2048 DefaultTime=01:00:00 MaxTime=05:00:00 PriorityTier=100 PreemptMode=off State=UP PartitionName=long Nodes=[nodelist] Default=NO DefMemPerCPU=2048 DefaultTime=1-00:00:00 MaxTime=UNLIMITED PriorityTier=100 PreemptMode=off State=UP PartitionName=scavenger Nodes=[nodelist] Default=NO DefMemPerCPU=2048 DefaultTime=1-00:00:00 MaxTime=UNLIMITED PriorityTier=10 PreemptMode=requeue State=UP Considering the ‘long’ partition, what is the best way to set up limits of how many jobs can be submitted to it concurrently by a user, or how to limit number of CPUs used? As can be seen from my prior post, we are utilizing job accounting via slurmdbd. Thanks, Will From: Will Dennis Sent: Friday, March 10, 2017 1:56 PM To: slurm-dev Cc: Lyn Gerner Subject: Re: [slurm-dev] MaxJobs on association not being respected I currently have this set in slurm.conf as: AccountingStorageEnforce=limits On Mar 10, 2017, at 1:53 PM, Lyn Gerner <schedulerqu...@gmail.com> wrote: Hey Will, Check to make sure you have selected the correct value for AccountingStorageEnforce. Sounds like it may be that. Best of luck, Lyn -- Forwarded message -- From: Will Dennis <wden...@nec-labs.com> Date: Fri, Mar 10, 2017 at 8:30 AM Subject: [slurm-dev] MaxJobs on association not being respected To: slurm-dev <slurm-dev@schedmd.com> Hi all, Generally new to Slurm here, so please forgive any ignorance... We have a test cluster (three compute nodes) running Slurm 16.05.4 in operation, with the ‘multifactor’ scheduler in use. We have set up slurmdb, and have set up associations for the users on partitions of the cluster, as follows: [root@ml43 ~]# sacctmgr show associations ClusterAccount User Partition Share GrpJobs GrpTRES GrpSubmit GrpWall GrpTRESMins MaxJobs MaxTRES MaxTRESPerNode MaxSubmit MaxWall MaxTRESMins QOS Def QOS GrpTRESRunMin -- -- -- -- - --- - - --- - --- - -- - --- - - - ml-cluster root 1 normal ml-cluster root root1 normal ml-cluster ml 1 normal ml-cluster ml alex scavenger 1 normal ml-cluster ml alex batch 1 normal ml-cluster ml alex long 1 1 normal ml-cluster ml iain scavenger 1 normal ml-cluster ml iain batch 1 normal ml-cluster ml iain
[slurm-dev] Re: MaxJobs on association not being respected
I currently have this set in slurm.conf as: AccountingStorageEnforce=limits On Mar 10, 2017, at 1:53 PM, Lyn Gerner> wrote: Hey Will, Check to make sure you have selected the correct value for AccountingStorageEnforce. Sounds like it may be that. Best of luck, Lyn -- Forwarded message -- From: Will Dennis > Date: Fri, Mar 10, 2017 at 8:30 AM Subject: [slurm-dev] MaxJobs on association not being respected To: slurm-dev > Hi all, Generally new to Slurm here, so please forgive any ignorance... We have a test cluster (three compute nodes) running Slurm 16.05.4 in operation, with the ‘multifactor’ scheduler in use. We have set up slurmdb, and have set up associations for the users on partitions of the cluster, as follows: [root@ml43 ~]# sacctmgr show associations ClusterAccount User Partition Share GrpJobs GrpTRES GrpSubmit GrpWall GrpTRESMins MaxJobs MaxTRES MaxTRESPerNode MaxSubmit MaxWall MaxTRESMins QOS Def QOS GrpTRESRunMin -- -- -- -- - --- - - --- - --- - -- - --- - - - ml-cluster root 1 normal ml-cluster root root1 normal ml-cluster ml 1 normal ml-cluster ml alex scavenger 1 normal ml-cluster ml alex batch 1 normal ml-cluster ml alex long 1 1 normal ml-cluster ml iain scavenger 1 normal ml-cluster ml iain batch 1 normal ml-cluster ml iain long 1 normal As you may notice, we have set up a “MaxJobs” limit of “1" for the ‘alex’ user on the ‘long’ partition. What we want to do is enforce a maximum of one job running at a time per user for the ‘long’ partition. However, when the user ‘alex’ submitted a number of jobs to this partition, all of them ran: [root@ml43 ~]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 324 long tmp.sh alex PD 0:00 1 (Resources) 321 long tmp.sh alex R 1:56 1 ml46 323 long tmp.sh alex R 0:33 1 ml53 322 long tmp.sh alex R 0:36 1 ml48 From the output of “share” we verified the right queue got the job: [root@ml43 ~]# sshare -am Account UserPartition RawShares NormSharesRawUsage EffectvUsage FairShare -- -- --- --- - -- root 1.007977 1.00 0.50 root root 10.50 0 0.00 1.00 ml 10.507977 1.00 0.25 ml alexscavenger 10.08 0 0.17 0.25 ml alexbatch 10.08 0 0.17 0.25 ml alex long 10.087977 1.00 0.000244 ml