Re: [slurm-dev] Some jobs lose their priority with Reason=PartitionNodeLimit

Lennart Karlsson Fri, 21 Oct 2011 07:08:06 -0700

Hi Don,

And thanks for your answer.


" sacctmgr show qos" shows:

Name Priority GraceTime Preempt PreemptMode Flags UsageThres GrpCPUs GrpCPUMins GrpJobs GrpNodes GrpSubmit GrpWall MaxCPUs MaxCPUMins MaxJobs MaxNodes MaxSubmit MaxWall---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- -------- ----------- ------- -------- --------- ----------- -------- ----------- ------- -------- --------- -----------normal 0 clustershort 100 cluster 2 4 00:15:00testproj 1 clusterseqver 30 clusterinteract 100 cluster 1 1 12:00:00

i.e. there are no limits on QOS "normal" that the jobs run in.

I can find no slurmctld.log messages matching my problem jobs.

I notice that if I change the job, by removing a Feature request,
the reason PartitionNodeLimit disappears, and does not come back
when I put the Feature request back.

There are no changes in the partition definition over time and also
jobs lacking Feature requests are sometimes getting this PartitionNodeLimit
sickness. I have tried to find something related in the source code,
but found nothing. Probably I need to find time to dive deeper
into the source.

Just now I have 503 jobs waiting in queue and 38 of those have lost
their priority (i.e., priority is 1) with reason PartitionNodeLimit,
requesting different amounts of nodes, from one node to 35 nodes.
There are more than 300 nodes in states alloc or idle, so I have
difficulties too see why the Reason is not Priority or Resources.

Cheers,
-- Lennart Karlsson, UPPMAX, Uppsala University, Sweden


Lipari, Don wrote:

Correction:  the GrpNodes or MaxNodes limit of the normal qos will not cause 
the PartitionNodeLimit problem.  It's the qos's PartitionMaxNodes flag that 
will exempt the job from the partition's node limits.

-----Original Message-----
From: owner-slurm-...@lists.llnl.gov [mailto:owner-slurm-
d...@lists.llnl.gov] On Behalf Of Lipari, Don
Sent: Monday, October 17, 2011 8:54 AM
To: slurm-dev@lists.llnl.gov
Subject: RE: [slurm-dev] Some jobs lose their priority with
Reason=PartitionNodeLimit

Lennart,

Your slurmctld.log may contain more info regarding whether a max or min
nodes limit is being exceeded:

Example:  Job xxx requested too many nodes of partition...

However, it could also be caused by a node limit on the normal qos:

Run " sacctmgr show qos" to see whether there's a GrpNodes or MaxNodes
limit set for the normal qos.

Don

-----Original Message-----
From: owner-slurm-...@lists.llnl.gov [mailto:owner-slurm-
d...@lists.llnl.gov] On Behalf Of Lennart Karlsson
Sent: Saturday, October 15, 2011 3:23 AM
To: slurm-dev@lists.llnl.gov
Cc: HAUTREUX Matthieu
Subject: Re: [slurm-dev] Some jobs lose their priority with
Reason=PartitionNodeLimit

On 10/14/2011 03:02 PM, HAUTREUX Matthieu wrote:

Lennart,

I might be wrong, but it seems that your nodes are already

allocated

as they all have the "alloc" state in "salloc -p node".

As you have configured the partition asking for "Shared=Exclusive",

as

soon as one core is allocated on a node, the whole node is

allocated

exclusively.
As a result, your submission is pending, waiting for free nodes to
run. As soon as a running jobs holding 4 nodes is finished, your

job

should be started.

HTH
Matthieu

Matthieu,

You answer why my job is not starting, a straight-forward
answer to one of the most common questions you get.
Thanks for answering, but I worry about a more specific
detail: My PartitionNodeLimit problem.

My question is why the job loses its normal priority, goes
down to the lowest priority and gets marked with
the PartitionNodeLimit label.

I would like the job to keep its normal priority, so it has
a normal chance to start later on.

Cheers,
-- Lennart Karlsson, UPPMAX, Uppsala University, Sweden
    http://www.uppmax.uu.se

Re: [slurm-dev] Some jobs lose their priority with Reason=PartitionNodeLimit

Reply via email to