What does the submit script look like and/or 'sbatch' command? What's the error in slurmctld.log?
- Trey ============================= Trey Dockendorf Systems Analyst I Texas A&M University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: [email protected] Jabber: [email protected] On Thu, Aug 6, 2015 at 3:53 PM, Gerry Creager - NOAA Affiliate < [email protected]> wrote: > We recently tried to implement accounting and fair queuing. For > completeness, the system is a Cray XE6m > > In slurm.conf, we have: > AccountingStorageType=accounting_storage/slurmdbd > AccountingStorageHost=sdb > AccountingStorageEnforce=limits > PriorityType=priority/multifactor > > PriorityWeightAge=1000 > PriorityWeightFairshare=10000 > PriorityWeightJobSize=1000 > PriorityWeightPartition=1000 > PriorityWeightQOS=0 # don't use the qos factor > > MessageTimeout=45 # problems with race condition! > > # PARTITIONS > PartitionName=workq Default=YES Priority=1 DefaultTime=60 MaxTime=06:00:00 > AllowGroups=ALL > Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-091,094-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,2 > 80-287] MaxNodes=135 > PartitionName=debugq Default=YES Priority=5 DefaultTime=60 MaxTime=4:00:00 > AllowGroups=ALL Nodes=nid00[002-007,024-029] MaxNodes=4 > PartitionName=wofq Default=YES Priority=1 DefaultTime=60 MaxTime=06:00:00 > AllowGroups=ALL > Nodes=nid00[002-007,024-029,040-043,046-049,052-055,064-071,088-091,094-099,100-103,120-127,136-151,160-167,184-199,216-223,232-247,256-263,28 > 0-287] MaxNodes=135 > > IN sacctmgr, I have the following associations: > Cluster Account User Partition Share GrpJobs GrpNodes > GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes > MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS > GrpCPURunMins > ---------- ---------- ---------- ---------- --------- ------- -------- > -------- ------- --------- ----------- ----------- ------- -------- > -------- --------- ----------- ----------- -------------------- --------- > ------------- > loki root 1 > > normal > > loki root root 1 > > normal > > loki debug 200 > 4 > 2 00:15:00 normal > > loki debug chenghao debugq 200 > 4 > 2 00:15:00 normal > > loki debug chris.kar+ debugq 200 > 4 > 2 00:15:00 normal > > loki debug cpotvin debugq 200 > 4 > 2 00:15:00 normal > > loki debug gerry debugq 200 > 4 > 2 00:15:00 normal > > loki debug james.cor+ debugq 200 > 4 > 2 00:15:00 normal > > loki debug jdgao debugq 200 > 4 > 2 00:15:00 normal > > loki debug kknopf83 debugq 200 > 4 > 2 00:15:00 normal > > loki debug mansell debugq 200 > 4 > 2 00:15:00 normal > > loki debug mflora debugq 200 > 4 > 2 00:15:00 normal > > loki debug nyussouf debugq 200 > 4 > 2 00:15:00 normal > > loki debug skinnerp debugq 200 > 4 > 2 00:15:00 normal > > loki debug tajones debugq 200 > 4 > 2 00:15:00 normal > > loki debug wicker debugq 200 > 4 > 2 00:15:00 normal > > loki debug wof debugq 200 > 4 > 2 00:15:00 normal > > loki largequeue 100 > 96 1024 > 06:00:00 normal > > loki largequeue chenghao workq 100 > 96 1024 > 06:00:00 normal > > loki largequeue chris.kar+ workq 100 > 96 1024 > 06:00:00 normal > > loki largequeue cpotvin workq 100 > 96 1024 > 06:00:00 normal > > loki largequeue gerry workq 100 > 96 1024 > 06:00:00 normal > > loki largequeue james.cor+ workq 100 > 96 1024 > 06:00:00 normal > > loki largequeue jdgao workq 100 > 96 1024 > 06:00:00 normal > > loki largequeue kknopf83 workq 100 > 96 1024 > 06:00:00 normal > > loki largequeue mansell workq 100 > 96 1024 > 06:00:00 normal > > loki largequeue mflora workq 100 > 96 1024 > 06:00:00 normal > > loki largequeue nyussouf workq 100 > 96 1024 > 06:00:00 normal > > loki largequeue skinnerp workq 100 > 96 1024 > 06:00:00 normal > > loki largequeue tajones 100 > 96 1024 > 06:00:00 normal > > loki largequeue wicker workq 100 > 96 1024 > 06:00:00 normal > > loki largequeue wof workq 100 > 96 1024 > 06:00:00 normal > > loki realtime 1000 > 128 4096 > 01:00:00 normal > > loki realtime wof wofq 1000 > 128 4096 > 01:00:00 normal > > loki smallqueue 100 > 36 288 > 06:00:00 normal > > loki smallqueue chenghao workq 100 > 36 288 > 06:00:00 normal > > loki smallqueue chris.kar+ workq 100 > 36 288 > 06:00:00 normal > > loki smallqueue cpotvin workq 100 > 36 288 > 06:00:00 normal > > loki smallqueue gerry workq 100 > 36 288 > 06:00:00 normal > > loki smallqueue james.cor+ workq 100 > 36 288 > 06:00:00 normal > > loki smallqueue jdgao workq 100 > 36 288 > 06:00:00 normal > > loki smallqueue kknopf83 workq 100 > 36 288 > 06:00:00 normal > > loki smallqueue mansell workq 100 > 36 288 > 06:00:00 normal > > loki smallqueue mflora workq 100 > 36 288 > 06:00:00 normal > > loki smallqueue nyussouf workq 100 > 36 288 > 06:00:00 normal > > loki smallqueue skinnerp workq 100 > 36 288 > 06:00:00 normal > > loki smallqueue tajones workq 100 > 36 288 > 06:00:00 normal > > loki smallqueue wicker workq 100 > 36 288 > 06:00:00 normal > > loki smallqueue wof workq 100 > 36 288 > 06:00:00 normal > > > > I've a user who keeps getting error'd out, with a claim that she has an > account/partition mismatch. The partition specified is not anywhere in her > slurm submission script, however (wofq). > > I'm baffled. Any suggestions? > -- > Gerry Creager > NSSL/CIMMS > 405.325.6371 > ++++++++++++++++++++++ > “Big whorls have little whorls, > That feed on their velocity; > And little whorls have lesser whorls, > And so on to viscosity.” > Lewis Fry Richardson (1881-1953) >
