Re: [slurm-users] Resource LImits

2023-04-20 Thread Jason Simms
Hello Ole and Hoot,

First, Hoot, thank you for your question. I've managed Slurm for a few
years now and still feel like I don't have a great understanding about
managing or limiting resources.

Ole, thanks for your continued support of the user community with your
documentation. I do wish not only that more of your information were
contained within the official docs, but also that there were even clearer
discussions around certain topics.

As an example, you write that "It is important to configure slurm.conf so
that the locked memory limit isn’t propagated to the batch jobs" by
setting PropagateResourceLimitsExcept=MEMLOCK. It's unclear to me whether
you are suggesting that literally everyone should have that set, or whether
it only applies to certain configurations. We don't have it set, for
instance, but we've not run into trouble with jobs failing due to locked
memory errors.

Then, in the official docs, to which you link, it says that "it may also be
desirable to lock the slurmd daemon's memory to help ensure that it keeps
responding if memory swapping begins" by creating /etc/sysconfig/slurm
containing the line SLURMD_OPTIONS="-M". Would there ever be a reason *not*
to include that? That is, I can't think it would ever be desirable for
slurmd to stop responding. So is that another "universal" recommendation, I
wonder?

It may be me talking as a new-ish user, but I would find a concise document
laying out common or useful configuration options to be presented when
setting up or reconfiguring Slurm. I'm certain I have inefficient or
missing options that I should have.

Warmest regards,
Jason

On Thu, Apr 20, 2023 at 2:11 AM Ole Holm Nielsen 
wrote:

> Hi Hoot,
>
> On 4/20/23 00:15, Hoot Thompson wrote:
> > Is there a ‘how to’ or recipe document for setting up and enforcing
> resource limits? I can establish accounts, users, and set limits but
> 'current value' is not incrementing after running jobs.
>
> I have written about resource limits in this Wiki page:
>
> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits
>
> IHTH,
> Ole
>
>

-- 
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms


Re: [slurm-users] Resource LImits

2023-04-20 Thread Hoot Thompson
So an update, GrpTRES registers a value while a job is running but GRpTRESMins 
does not. So I still have something wrong. GrpTRESMins reads in the docs like 
it is in fact an aggregate number.

> On Apr 20, 2023, at 1:01 PM, Ole Holm Nielsen  
> wrote:
> 
> On 20-04-2023 18:23, Hoot Thompson wrote:
>> Ole,
>> Earlier I found your Slurm_tools posting and found it very useful. This 
>> remains my problem, ‘current value’ not incrementing even after making 
>> needed changes to slurm.conf.
> 
> The ‘current value’ refers to those jobs that are currently running. Does 
> that answer your question?
> 
> /Ole
> 
>> ./showuserlimits -u ubuntu
>> scontrol -o show assoc_mgr users=ubuntu account=testing flags=Assoc
>> Association (Parent account):
>> ClusterName = dev-uid-testing
>> Account = testing
>> UserName =
>> Partition =
>> Priority = 0
>> ID = 6
>> SharesRaw/Norm/Level/Factor = 1/18446744073709551616.00/1/0.00
>> UsageRaw/Norm/Efctv = 0.00/1.00/0.00
>> ParentAccount = root, current value = 1
>> Lft = 2
>> DefAssoc = No
>> GrpJobs =
>> GrpJobsAccrue =
>> GrpSubmitJobs =
>> GrpWall =
>> GrpTRES =
>> cpu:Limit = 1500, current value = 0
>> GrpTRESMins =
>> GrpTRESRunMins =
>> MaxJobs =
>> MaxJobsAccrue =
>> MaxSubmitJobs =
>> MaxWallPJ =
>> MaxTRESPJ =
>> MaxTRESPN =
>> MaxTRESMinsPJ =
>> MinPrioThresh =
>> Association (User):
>> ClusterName = dev-uid-testing
>> Account = testing
>> UserName = ubuntu, UID=1000
>> Partition =
>> Priority = 0
>> ID = 9
>> SharesRaw/Norm/Level/Factor = 1/18446744073709551616.00/1/0.00
>> UsageRaw/Norm/Efctv = 0.00/1.00/0.00
>> ParentAccount =
>> Lft = 3
>> DefAssoc = Yes
>> GrpJobs =
>> GrpJobsAccrue =
>> GrpSubmitJobs =
>> GrpWall =
>> GrpTRES =
>> cpu:Limit = 1500, current value = 0
>> GrpTRESMins =
>> cpu:Limit = 1000, current value = 0
>> GrpTRESRunMins =
>> MaxJobs =
>> MaxJobsAccrue =
>> MaxSubmitJobs =
>> MaxWallPJ =
>> MaxTRESPJ =
>> MaxTRESPN =
>> MaxTRESMinsPJ =
>> MinPrioThresh =
>> Slurm share information:
>> AccountUserRawSharesNormSharesRawUsageEffectvUsageFairShare
>>  -- -- --- --- 
>> - --
>> testingubuntu1 00.00 0.00
>> Clearly I’m still missing something or I don’t understand how it’s supposed 
>> to work.
>> Hoot
>>> On Apr 20, 2023, at 2:10 AM, Ole Holm Nielsen  
>>> wrote:
>>> 
>>> Hi Hoot,
>>> 
>>> On 4/20/23 00:15, Hoot Thompson wrote:
 Is there a ‘how to’ or recipe document for setting up and enforcing 
 resource limits? I can establish accounts, users, and set limits but 
 'current value' is not incrementing after running jobs.
>>> 
>>> I have written about resource limits in this Wiki page:
>>> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits
> 
> 




Re: [slurm-users] Resource LImits

2023-04-20 Thread Hoot Thompson
And it indeed does show current value for a running job!! Do I feel stupid :-)



> On Apr 20, 2023, at 1:01 PM, Ole Holm Nielsen  
> wrote:
> 
> On 20-04-2023 18:23, Hoot Thompson wrote:
>> Ole,
>> Earlier I found your Slurm_tools posting and found it very useful. This 
>> remains my problem, ‘current value’ not incrementing even after making 
>> needed changes to slurm.conf.
> 
> The ‘current value’ refers to those jobs that are currently running. Does 
> that answer your question?
> 
> /Ole
> 
>> ./showuserlimits -u ubuntu
>> scontrol -o show assoc_mgr users=ubuntu account=testing flags=Assoc
>> Association (Parent account):
>> ClusterName = dev-uid-testing
>> Account = testing
>> UserName =
>> Partition =
>> Priority = 0
>> ID = 6
>> SharesRaw/Norm/Level/Factor = 1/18446744073709551616.00/1/0.00
>> UsageRaw/Norm/Efctv = 0.00/1.00/0.00
>> ParentAccount = root, current value = 1
>> Lft = 2
>> DefAssoc = No
>> GrpJobs =
>> GrpJobsAccrue =
>> GrpSubmitJobs =
>> GrpWall =
>> GrpTRES =
>> cpu:Limit = 1500, current value = 0
>> GrpTRESMins =
>> GrpTRESRunMins =
>> MaxJobs =
>> MaxJobsAccrue =
>> MaxSubmitJobs =
>> MaxWallPJ =
>> MaxTRESPJ =
>> MaxTRESPN =
>> MaxTRESMinsPJ =
>> MinPrioThresh =
>> Association (User):
>> ClusterName = dev-uid-testing
>> Account = testing
>> UserName = ubuntu, UID=1000
>> Partition =
>> Priority = 0
>> ID = 9
>> SharesRaw/Norm/Level/Factor = 1/18446744073709551616.00/1/0.00
>> UsageRaw/Norm/Efctv = 0.00/1.00/0.00
>> ParentAccount =
>> Lft = 3
>> DefAssoc = Yes
>> GrpJobs =
>> GrpJobsAccrue =
>> GrpSubmitJobs =
>> GrpWall =
>> GrpTRES =
>> cpu:Limit = 1500, current value = 0
>> GrpTRESMins =
>> cpu:Limit = 1000, current value = 0
>> GrpTRESRunMins =
>> MaxJobs =
>> MaxJobsAccrue =
>> MaxSubmitJobs =
>> MaxWallPJ =
>> MaxTRESPJ =
>> MaxTRESPN =
>> MaxTRESMinsPJ =
>> MinPrioThresh =
>> Slurm share information:
>> AccountUserRawSharesNormSharesRawUsageEffectvUsageFairShare
>>  -- -- --- --- 
>> - --
>> testingubuntu1 00.00 0.00
>> Clearly I’m still missing something or I don’t understand how it’s supposed 
>> to work.
>> Hoot
>>> On Apr 20, 2023, at 2:10 AM, Ole Holm Nielsen  
>>> wrote:
>>> 
>>> Hi Hoot,
>>> 
>>> On 4/20/23 00:15, Hoot Thompson wrote:
 Is there a ‘how to’ or recipe document for setting up and enforcing 
 resource limits? I can establish accounts, users, and set limits but 
 'current value' is not incrementing after running jobs.
>>> 
>>> I have written about resource limits in this Wiki page:
>>> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits
> 
> 




Re: [slurm-users] Resource LImits

2023-04-20 Thread Hoot Thompson
Ah, I thought that was the aggregate of past and current jobs.



> On Apr 20, 2023, at 1:01 PM, Ole Holm Nielsen  
> wrote:
> 
> On 20-04-2023 18:23, Hoot Thompson wrote:
>> Ole,
>> Earlier I found your Slurm_tools posting and found it very useful. This 
>> remains my problem, ‘current value’ not incrementing even after making 
>> needed changes to slurm.conf.
> 
> The ‘current value’ refers to those jobs that are currently running. Does 
> that answer your question?
> 
> /Ole
> 
>> ./showuserlimits -u ubuntu
>> scontrol -o show assoc_mgr users=ubuntu account=testing flags=Assoc
>> Association (Parent account):
>> ClusterName = dev-uid-testing
>> Account = testing
>> UserName =
>> Partition =
>> Priority = 0
>> ID = 6
>> SharesRaw/Norm/Level/Factor = 1/18446744073709551616.00/1/0.00
>> UsageRaw/Norm/Efctv = 0.00/1.00/0.00
>> ParentAccount = root, current value = 1
>> Lft = 2
>> DefAssoc = No
>> GrpJobs =
>> GrpJobsAccrue =
>> GrpSubmitJobs =
>> GrpWall =
>> GrpTRES =
>> cpu:Limit = 1500, current value = 0
>> GrpTRESMins =
>> GrpTRESRunMins =
>> MaxJobs =
>> MaxJobsAccrue =
>> MaxSubmitJobs =
>> MaxWallPJ =
>> MaxTRESPJ =
>> MaxTRESPN =
>> MaxTRESMinsPJ =
>> MinPrioThresh =
>> Association (User):
>> ClusterName = dev-uid-testing
>> Account = testing
>> UserName = ubuntu, UID=1000
>> Partition =
>> Priority = 0
>> ID = 9
>> SharesRaw/Norm/Level/Factor = 1/18446744073709551616.00/1/0.00
>> UsageRaw/Norm/Efctv = 0.00/1.00/0.00
>> ParentAccount =
>> Lft = 3
>> DefAssoc = Yes
>> GrpJobs =
>> GrpJobsAccrue =
>> GrpSubmitJobs =
>> GrpWall =
>> GrpTRES =
>> cpu:Limit = 1500, current value = 0
>> GrpTRESMins =
>> cpu:Limit = 1000, current value = 0
>> GrpTRESRunMins =
>> MaxJobs =
>> MaxJobsAccrue =
>> MaxSubmitJobs =
>> MaxWallPJ =
>> MaxTRESPJ =
>> MaxTRESPN =
>> MaxTRESMinsPJ =
>> MinPrioThresh =
>> Slurm share information:
>> AccountUserRawSharesNormSharesRawUsageEffectvUsageFairShare
>>  -- -- --- --- 
>> - --
>> testingubuntu1 00.00 0.00
>> Clearly I’m still missing something or I don’t understand how it’s supposed 
>> to work.
>> Hoot
>>> On Apr 20, 2023, at 2:10 AM, Ole Holm Nielsen  
>>> wrote:
>>> 
>>> Hi Hoot,
>>> 
>>> On 4/20/23 00:15, Hoot Thompson wrote:
 Is there a ‘how to’ or recipe document for setting up and enforcing 
 resource limits? I can establish accounts, users, and set limits but 
 'current value' is not incrementing after running jobs.
>>> 
>>> I have written about resource limits in this Wiki page:
>>> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits
> 
> 




Re: [slurm-users] Resource LImits

2023-04-20 Thread Ole Holm Nielsen

On 20-04-2023 18:23, Hoot Thompson wrote:

Ole,

Earlier I found your Slurm_tools posting and found it very useful. This 
remains my problem, ‘current value’ not incrementing even after making 
needed changes to slurm.conf.


The ‘current value’ refers to those jobs that are currently running. 
Does that answer your question?


/Ole


./showuserlimits -u ubuntu

scontrol -o show assoc_mgr users=ubuntu account=testing flags=Assoc

Association (Parent account):

ClusterName = dev-uid-testing

Account = testing

UserName =

Partition =

Priority = 0

ID = 6

SharesRaw/Norm/Level/Factor = 1/18446744073709551616.00/1/0.00

UsageRaw/Norm/Efctv = 0.00/1.00/0.00

ParentAccount = root, current value = 1

Lft = 2

DefAssoc = No

GrpJobs =

GrpJobsAccrue =

GrpSubmitJobs =

GrpWall =

GrpTRES =

cpu:Limit = 1500, current value = 0

GrpTRESMins =

GrpTRESRunMins =

MaxJobs =

MaxJobsAccrue =

MaxSubmitJobs =

MaxWallPJ =

MaxTRESPJ =

MaxTRESPN =

MaxTRESMinsPJ =

MinPrioThresh =

Association (User):

ClusterName = dev-uid-testing

Account = testing

UserName = ubuntu, UID=1000

Partition =

Priority = 0

ID = 9

SharesRaw/Norm/Level/Factor = 1/18446744073709551616.00/1/0.00

UsageRaw/Norm/Efctv = 0.00/1.00/0.00

ParentAccount =

Lft = 3

DefAssoc = Yes

GrpJobs =

GrpJobsAccrue =

GrpSubmitJobs =

GrpWall =

GrpTRES =

cpu:Limit = 1500, current value = 0

GrpTRESMins =

cpu:Limit = 1000, current value = 0

GrpTRESRunMins =

MaxJobs =

MaxJobsAccrue =

MaxSubmitJobs =

MaxWallPJ =

MaxTRESPJ =

MaxTRESPN =

MaxTRESMinsPJ =

MinPrioThresh =

Slurm share information:

AccountUserRawSharesNormSharesRawUsageEffectvUsageFairShare

 -- -- --- --- 
- --


testingubuntu1 00.00 0.00



Clearly I’m still missing something or I don’t understand how it’s 
supposed to work.


Hoot



On Apr 20, 2023, at 2:10 AM, Ole Holm Nielsen 
 wrote:


Hi Hoot,

On 4/20/23 00:15, Hoot Thompson wrote:
Is there a ‘how to’ or recipe document for setting up and enforcing 
resource limits? I can establish accounts, users, and set limits but 
'current value' is not incrementing after running jobs.


I have written about resource limits in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits





Re: [slurm-users] Resource LImits

2023-04-20 Thread Hoot Thompson
Ole,

Earlier I found your Slurm_tools posting and found it very useful. This remains 
my problem, ‘current value’ not incrementing even after making needed changes 
to slurm.conf.


./showuserlimits -u ubuntu
scontrol -o show assoc_mgr users=ubuntu account=testing flags=Assoc
Association (Parent account):
  ClusterName =   dev-uid-testing
  Account =   testing
  UserName = 
Partition = 
  Priority = 0
ID = 6
SharesRaw/Norm/Level/Factor = 1/18446744073709551616.00/1/0.00
UsageRaw/Norm/Efctv = 0.00/1.00/0.00
ParentAccount =   root, current value = 1
  Lft =   2
  DefAssoc = No
  GrpJobs = 
GrpJobsAccrue = 
GrpSubmitJobs = 
  GrpWall = 
  GrpTRES = 
 cpu: Limit = 1500, current value = 0
 
  GrpTRESMins = 
 
GrpTRESRunMins = 
 
  MaxJobs =  
MaxJobsAccrue =  
MaxSubmitJobs =  
MaxWallPJ =  
MaxTRESPJ =  
MaxTRESPN =  
MaxTRESMinsPJ =  
MinPrioThresh =  
Association (User):
  ClusterName =   dev-uid-testing
  Account =   testing
  UserName = ubuntu, UID=1000
Partition = 
  Priority = 0
ID = 9
SharesRaw/Norm/Level/Factor = 1/18446744073709551616.00/1/0.00
UsageRaw/Norm/Efctv = 0.00/1.00/0.00
ParentAccount =  
  Lft =   3
  DefAssoc = Yes
  GrpJobs = 
GrpJobsAccrue = 
GrpSubmitJobs = 
  GrpWall = 
  GrpTRES = 
 cpu: Limit = 1500, current value = 0
 
  GrpTRESMins = 
 cpu: Limit = 1000, current value = 0
 
GrpTRESRunMins = 
 
  MaxJobs =  
MaxJobsAccrue =  
MaxSubmitJobs =  
MaxWallPJ =  
MaxTRESPJ =  
MaxTRESPN =  
MaxTRESMinsPJ =  
MinPrioThresh =  
 
Slurm share information:
AccountUser  RawShares  NormSharesRawUsage  
EffectvUsage  FairShare 
 -- -- --- --- 
- -- 
testing  ubuntu  1   0  
0.00   0.00


Clearly I’m still missing something or I don’t understand how it’s supposed to 
work.

Hoot



> On Apr 20, 2023, at 2:10 AM, Ole Holm Nielsen  
> wrote:
> 
> Hi Hoot,
> 
> On 4/20/23 00:15, Hoot Thompson wrote:
>> Is there a ‘how to’ or recipe document for setting up and enforcing resource 
>> limits? I can establish accounts, users, and set limits but 'current value' 
>> is not incrementing after running jobs.
> 
> I have written about resource limits in this Wiki page:
> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits
> 
> IHTH,
> Ole
> 



Re: [slurm-users] Resource LImits

2023-04-20 Thread Hoot Thompson
Thank you for this. I’ll give it a read but no promises that I won’t be back 
with more questions!

Hoot

> On Apr 20, 2023, at 2:10 AM, Ole Holm Nielsen  
> wrote:
> 
> Hi Hoot,
> 
> On 4/20/23 00:15, Hoot Thompson wrote:
>> Is there a ‘how to’ or recipe document for setting up and enforcing resource 
>> limits? I can establish accounts, users, and set limits but 'current value' 
>> is not incrementing after running jobs.
> 
> I have written about resource limits in this Wiki page:
> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits
> 
> IHTH,
> Ole
> 




Re: [slurm-users] Resource LImits

2023-04-20 Thread Ole Holm Nielsen

Hi Hoot,

On 4/20/23 00:15, Hoot Thompson wrote:

Is there a ‘how to’ or recipe document for setting up and enforcing resource 
limits? I can establish accounts, users, and set limits but 'current value' is 
not incrementing after running jobs.


I have written about resource limits in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits

IHTH,
Ole