We recently tried to implement accounting and fair queuing. For completeness, the system is a Cray XE6m
In slurm.conf, we have:
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=sdb
AccountingStorageEnforce=limits
PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0 # don't use the qos factor
MessageTimeout=45 # problems with race condition!
# PARTITIONS
PartitionName=workq Default=YES Priority=1 DefaultTime=60 MaxTime=06:00:00
AllowGroups=ALL
Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-091,094-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,2
80-287] MaxNodes=135
PartitionName=debugq Default=YES Priority=5 DefaultTime=60 MaxTime=4:00:00
AllowGroups=ALL Nodes=nid00[002-007,024-029] MaxNodes=4
PartitionName=wofq Default=YES Priority=1 DefaultTime=60 MaxTime=06:00:00
AllowGroups=ALL
Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-091,094-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,28
0-287] MaxNodes=135
IN sacctmgr, I have the following associations:
Cluster Account User Partition Share GrpJobs GrpNodes
GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes
MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS
GrpCPURunMins
---------- ---------- ---------- ---------- --------- ------- --------
-------- ------- --------- ----------- ----------- ------- --------
-------- --------- ----------- ----------- -------------------- ---------
-------------
loki root 1
normal
loki root root 1
normal
loki debug 200
4
2 00:15:00 normal
loki debug chenghao debugq 200
4
2 00:15:00 normal
loki debug chris.kar+ debugq 200
4
2 00:15:00 normal
loki debug cpotvin debugq 200
4
2 00:15:00 normal
loki debug gerry debugq 200
4
2 00:15:00 normal
loki debug james.cor+ debugq 200
4
2 00:15:00 normal
loki debug jdgao debugq 200
4
2 00:15:00 normal
loki debug kknopf83 debugq 200
4
2 00:15:00 normal
loki debug mansell debugq 200
4
2 00:15:00 normal
loki debug mflora debugq 200
4
2 00:15:00 normal
loki debug nyussouf debugq 200
4
2 00:15:00 normal
loki debug skinnerp debugq 200
4
2 00:15:00 normal
loki debug tajones debugq 200
4
2 00:15:00 normal
loki debug wicker debugq 200
4
2 00:15:00 normal
loki debug wof debugq 200
4
2 00:15:00 normal
loki largequeue 100
96 1024
06:00:00 normal
loki largequeue chenghao workq 100
96 1024
06:00:00 normal
loki largequeue chris.kar+ workq 100
96 1024
06:00:00 normal
loki largequeue cpotvin workq 100
96 1024
06:00:00 normal
loki largequeue gerry workq 100
96 1024
06:00:00 normal
loki largequeue james.cor+ workq 100
96 1024
06:00:00 normal
loki largequeue jdgao workq 100
96 1024
06:00:00 normal
loki largequeue kknopf83 workq 100
96 1024
06:00:00 normal
loki largequeue mansell workq 100
96 1024
06:00:00 normal
loki largequeue mflora workq 100
96 1024
06:00:00 normal
loki largequeue nyussouf workq 100
96 1024
06:00:00 normal
loki largequeue skinnerp workq 100
96 1024
06:00:00 normal
loki largequeue tajones 100
96 1024
06:00:00 normal
loki largequeue wicker workq 100
96 1024
06:00:00 normal
loki largequeue wof workq 100
96 1024
06:00:00 normal
loki realtime 1000
128 4096
01:00:00 normal
loki realtime wof wofq 1000
128 4096
01:00:00 normal
loki smallqueue 100
36 288
06:00:00 normal
loki smallqueue chenghao workq 100
36 288
06:00:00 normal
loki smallqueue chris.kar+ workq 100
36 288
06:00:00 normal
loki smallqueue cpotvin workq 100
36 288
06:00:00 normal
loki smallqueue gerry workq 100
36 288
06:00:00 normal
loki smallqueue james.cor+ workq 100
36 288
06:00:00 normal
loki smallqueue jdgao workq 100
36 288
06:00:00 normal
loki smallqueue kknopf83 workq 100
36 288
06:00:00 normal
loki smallqueue mansell workq 100
36 288
06:00:00 normal
loki smallqueue mflora workq 100
36 288
06:00:00 normal
loki smallqueue nyussouf workq 100
36 288
06:00:00 normal
loki smallqueue skinnerp workq 100
36 288
06:00:00 normal
loki smallqueue tajones workq 100
36 288
06:00:00 normal
loki smallqueue wicker workq 100
36 288
06:00:00 normal
loki smallqueue wof workq 100
36 288
06:00:00 normal
I've a user who keeps getting error'd out, with a claim that she has an
account/partition mismatch. The partition specified is not anywhere in her
slurm submission script, however (wofq).
I'm baffled. Any suggestions?
--
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
