[slurm-dev] Some jobs lose their priority with Reason=PartitionNodeLimit

Lennart Karlsson Fri, 14 Oct 2011 05:48:39 -0700

Hi,

When reading in the squeue manual page about Reason=PartitionNodeLimit,
I find the explanation "The number of nodes required by this job is
outside of it’s partitions current limits. Can also indicate that
required nodes are DOWN or DRAINED."


Well that does not help very much in my current scenario:

PartitionName=node
AllocNodes=ALL AllowGroups=ALL Default=NO
DefaultTime=00:01:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1
Nodes=q[1-32,45-348]
Priority=1 RootOnly=NO Shared=EXCLUSIVE PreemptMode=OFF

State=UP TotalCPUs=2688 TotalNodes=336 DefMemPerNode=UNLIMITEDMaxMemPerNode=UNLIMITED


# sinfo -p node
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
node up 7-00:00:00 336 alloc q[1-32,45-348]

# squeue -j 1479243
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1479243 node Vopt shuang PD 0:00 4 (PartitionNodeLimit)

# scontrol show job 1479243
JobId=1479243 Name=Vopt
UserId=shuang(41279) GroupId=uppmax(40001)
Priority=1 Account=p2003036 QOS=normal WCKey=*
JobState=PENDING Reason=PartitionNodeLimit Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:00 TimeLimit=2-23:00:00 TimeMin=N/A
SubmitTime=2011-10-13T21:18:12 EligibleTime=2011-10-13T21:18:12
StartTime=Unknown EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=node AllocNode:Sid=kalkyl1:27843
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=4-4 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=thin Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/bubo/home/h5/shuang/VO2/mono1/mg211/hse/scf0/fm/hse-ispin1/run-vasp-5.2-kalkyl
WorkDir=/bubo/home/h5/shuang/VO2/mono1/mg211/hse/scf0/fm/hse-ispin1

Jobscript starts like this:
#!/bin/bash -l
#SBATCH -J Vopt
#SBATCH -t 71:00:00
#SBATCH -p node -N 4
#SBATCH -A p2003036



If I wait long enough, the job will start, sometimes after jumping
a few times between normal job priority values and priority value one,
but it is very annoying that the start gets delayed by this strange
PartitionNodeLimit error state.

Four nodes should not be too few or too many for the partition, so
something else is wrong. What? (I am able to get the same behaviour
by submitting a 32 core job on our 64 core SMP machine.)

The problem does not appear every time we submit four node jobs, but
I am not really sure about when it appears. Perhaps it happens only
with jobs that specify the "-N" flag. I presume that it would not
appear if "-N 4" was replaced by "-n 32".

I am running version 2.3.0-2.

Cheers,
-- Lennart Karlsson, UPPMAX, Uppsala University, Sweden
http://www.uppmax.uu.se

[slurm-dev] Some jobs lose their priority with Reason=PartitionNodeLimit

Reply via email to