[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

2024-03-14 Thread Williams, Jenny Avis via slurm-users
Also --
scontrol show nodes

-Original Message-
From: Williams, Jenny Avis
Sent: Thursday, March 14, 2024 6:46 PM
To: Ole Holm Nielsen ; slurm-users@lists.schedmd.com
Subject: RE: [slurm-users] Re: Jobs being denied for GrpCpuLimit despite having 
enough resource

I use an alias slist = ` sed 's/ /\n/g' |sort|uniq`  -- do not cp/paste lines 
with "--" -- it is not the two hyphens intended. The examples below are for 
slurm 23.02.7 .  These commands assume administrator access.


This is a generalized set of areas I use to find why things just are not moving 
along.   Either there is indeed a QOS being applied just not in the way you 
expect, the scheduler is bogged down and the pend reason is not updating, and 
the job Reason is different, or the scheduler is indeed "stuck" on something 
you just aren't seeing yet.

-- Hunt for all qos applied:  either on user, account, or partition
Let's not qualify the user sacct listing. if there are any fields that 
are non-empty in any of the entries include those in the format list-- For 
instance if there are limits or qos' at the account tier they may or may not 
come into play; if the user or group or partition have qos' applied they may or 
may not come into play, depending on e.g. if "parent" has been set. If there is 
a grptres in a qos applied to the partition, that grptres applies to the 
partition, not the "group"/account that at least I tend to assume is what I 
want or would wish for at times. Grp at the partition level means the jobs in 
aggregate in the partition.  How that applies when users have jobs running in 
this partition and possibly others can get interesting.

So , if the pend reason is correct and in the end there is a QOS somewhere at 
play, look at all QOS that are in any way related at the partition,user and 
account levels.

scontrol show partition normal |slist |egrep -I 
"min|max|qos|oversubscribe|allow"
sacctmgr list associations where account=users_acct
sacctmgr list assoc where user=user

For any and all qos' that come from this output do sacctmgr listing so you see 
any fields with data.
sacctmgr show qos where name=qosname format=etc.


-- resource competition

squeue -p normal* -t pd --Format=prioritylong,jobid,state,tres:50|sort 
-n -k1,2 Is there any job in this partition that is pending with reason 
"Resources?"
Are the nodes shared with another partition, and if so, is there a job pending 
in that partition with reason Resources?
*If you have more than 1 partition with the same nodes in them, list all 
partitions in the -p option as a comma separated list, not just the one the job 
is in. Any higher priority job will block lower priority jobs competing for the 
same resources if they are ineligible for backfill. Do that squeue of both to 
see the jobs competing for the same resources in regard to priority.


scontrol show job jobID |slist |egrep -I 
"oversubscribe|qos|schednodelist|tres|cmd|workdir|min"
# typically I just do the scontrol show job jobID |slist then scan the list.  
That limit is hiding somewhere...

>From that job
QOS
Oversubscribe (should say YES -- anything else, that is the reason -- 
they have added --exclusive set.
SchedNodeList
A favorite recent sticking point.  A next job or next few jobs will take dibs 
on a node that the scheduler believes will become free but the node will show 
up as "idle" , so look for other jobs in the partition that have SchedNodeList 
set. Those are the jobs that are hanging onto the apparently idle jobs.

scontrol show partition normal |slist
Give the full listing not qualified -- of particular interest are any fields 
that say e.g. Max or Min, qos, oversubscribe,priority Are all partitions the 
same priority, or do they vary.  If there are other partitions that are higher 
priority they may be absorbing the scheduler resources, especially if there are 
short running jobs there.

-- There is a "snag"
Look at "sdiag -r ; sleep 120; sdiag"
Users running something like" watch squeue " or "watch sacct" can tank the 
responsiveness of the scheduler.  Under the heading Remote Procedure Call 
statistics by user, any user that has on the order of the same count as root 
could be causing scheduler slowdowns.

Look at " sacct -S now-2minute -E now -a -s completed,failed -X 
--format=elapsed" -- if you have large numbers of "short" jobs in any 
partitions, where YMMV for what short means, the scheduler can be overwhelmed.

I hope this helps.


-Original Message-----
From: Ole Holm Nielsen via slurm-users 
Sent: Thursday, March 14, 2024 1:16 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: Jobs being denied for GrpCpuLimit despite having 
enough resource

Hi Simon,

Maybe you could print the user's limits using this tool:
https://github

[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

2024-03-14 Thread Williams, Jenny Avis via slurm-users
I use an alias slist = ` sed 's/ /\n/g' |sort|uniq`  -- do not cp/paste lines 
with "--" -- it is not the two hyphens intended. The examples below are for 
slurm 23.02.7 .  These commands assume administrator access.


This is a generalized set of areas I use to find why things just are not moving 
along.   Either there is indeed a QOS being applied just not in the way you 
expect, the scheduler is bogged down and the pend reason is not updating, and 
the job Reason is different, or the scheduler is indeed "stuck" on something 
you just aren't seeing yet.

-- Hunt for all qos applied:  either on user, account, or partition
Let's not qualify the user sacct listing. if there are any fields that 
are non-empty in any of the entries include those in the format list-- For 
instance if there are limits or qos' at the account tier they may or may not 
come into play; if the user or group or partition have qos' applied they may or 
may not come into play, depending on e.g. if "parent" has been set. If there is 
a grptres in a qos applied to the partition, that grptres applies to the 
partition, not the "group"/account that at least I tend to assume is what I 
want or would wish for at times. Grp at the partition level means the jobs in 
aggregate in the partition.  How that applies when users have jobs running in 
this partition and possibly others can get interesting.

So , if the pend reason is correct and in the end there is a QOS somewhere at 
play, look at all QOS that are in any way related at the partition,user and 
account levels.

scontrol show partition normal |slist |egrep -I 
"min|max|qos|oversubscribe|allow"
sacctmgr list associations where account=users_acct
sacctmgr list assoc where user=user

For any and all qos' that come from this output do sacctmgr listing so you see 
any fields with data.
sacctmgr show qos where name=qosname format=etc.


-- resource competition

squeue -p normal* -t pd --Format=prioritylong,jobid,state,tres:50|sort 
-n -k1,2
Is there any job in this partition that is pending with reason "Resources?"
Are the nodes shared with another partition, and if so, is there a job pending 
in that partition with reason Resources?
*If you have more than 1 partition with the same nodes in them, list all 
partitions in the -p option as a comma separated list, not just the one the job 
is in. Any higher priority job will block lower priority jobs competing for the 
same resources if they are ineligible for backfill. Do that squeue of both to 
see the jobs competing for the same resources in regard to priority.


scontrol show job jobID |slist |egrep -I 
"oversubscribe|qos|schednodelist|tres|cmd|workdir|min"
# typically I just do the scontrol show job jobID |slist then scan the list.  
That limit is hiding somewhere...

>From that job
QOS
Oversubscribe (should say YES -- anything else, that is the reason -- 
they have added --exclusive set.
SchedNodeList
A favorite recent sticking point.  A next job or next few jobs will take dibs 
on a node that the scheduler believes will become free but the node will show 
up as "idle" , so look for other jobs in the partition that have SchedNodeList 
set. Those are the jobs that are hanging onto the apparently idle jobs.

scontrol show partition normal |slist
Give the full listing not qualified -- of particular interest are any fields 
that say e.g. Max or Min, qos, oversubscribe,priority
Are all partitions the same priority, or do they vary.  If there are other 
partitions that are higher priority they may be absorbing the scheduler 
resources, especially if there are short running jobs there.

-- There is a "snag"
Look at "sdiag -r ; sleep 120; sdiag"
Users running something like" watch squeue " or "watch sacct" can tank the 
responsiveness of the scheduler.  Under the heading Remote Procedure Call 
statistics by user, any user that has on the order of the same count as root 
could be causing scheduler slowdowns.

Look at " sacct -S now-2minute -E now -a -s completed,failed -X 
--format=elapsed" -- if you have large numbers of "short" jobs in any 
partitions, where YMMV for what short means, the scheduler can be overwhelmed.

I hope this helps.


-Original Message-----
From: Ole Holm Nielsen via slurm-users 
Sent: Thursday, March 14, 2024 1:16 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: Jobs being denied for GrpCpuLimit despite having 
enough resource

Hi Simon,

Maybe you could print the user's limits using this tool:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits

Which version of Slurm do you run?

/Ole

On 3/14/24 17:47, Simon Andrews via slurm-users wrote:
> Our cluster has developed a strange intermittent behaviour where jobs
> are being put into a pending state be

[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

2024-03-14 Thread Ole Holm Nielsen via slurm-users

Hi Simon,

Maybe you could print the user's limits using this tool:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits

Which version of Slurm do you run?

/Ole

On 3/14/24 17:47, Simon Andrews via slurm-users wrote:
Our cluster has developed a strange intermittent behaviour where jobs are 
being put into a pending state because they aren’t passing the 
AssocGrpCpuLimit, even though the user submitting has enough cpus for the 
job to run.


For example:

$ squeue -o "%.6i %.9P %.8j %.8u %.2t %.10M %.7m %.7c %.20R"

JOBID PARTITION NAME USER ST   TIME MIN_MEM MIN_CPU 
NODELIST(REASON)


    799    normal hostname andrewss PD   0:00  2G   5   
(AssocGrpCpuLimit)


..so the job isn’t running, and it’s the only job in the queue, but:

$ sacctmgr list associations part=normal user=andrewss 
format=Account,User,Partition,Share,GrpTRES


    Account   User  Partition Share   GrpTRES

-- -- -- - -

   andrewss   andrewss normal 1 cpu=5

That user has a limit of 5 CPUs so the job should run.

The weird thing is that this effect is intermittent.  A job can hang and 
the queue will stall for ages but will then suddenly start working and you 
can submit several jobs and they all work, until one fails again.


The cluster has active nodes and plenty of resource:

$ sinfo

PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST

normal*    up   infinite  2   idle compute-0-[6-7]

interactive    up 1-12:00:00  3   idle compute-1-[0-1,3]

The slurmctld log just says:

[2024-03-14T16:21:41.275] _slurm_rpc_submit_batch_job: JobId=799 
InitPrio=4294901720 usec=259


Whilst it’s in this state I can run other jobs with core requests of up to 
4 and they work, but not 5.  It’s like slurm is adding one CPU to the 
request and then denying it.


//

I’m sure I’m missing something fundamental but would appreciate it if 
someone could point out what it is!


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com