Hi,

I'm trying to get the topology plugin working and something doesn't seem
to be working right. I have a westmere partition with 32 nodes in it. Once I
configure the topology plugin things seem to start acting strangely.
Submitting a job with more than 18 nodes files with this message:
sbatch: error: Batch job submission failed: Requested node configuration is not 
available

I can submit an 18 node job and a 14 node job that will run
simultaneously. nodes 1-14 are allocated to the 14 node job and nodes 15-32 are
allocated to the 18 node job. It seems like a job on this partition
can't include nodes from the 1-14 range and the 15-32 range in the same
job.

> sbatch -N 2 --exclusive -p westmere -w 'midway[014,015]' namdsubmit.slurm
sbatch: error: Batch job submission failed: Requested node configuration is not 
available

The logs for slurmctld say this when a job is submitted:

Jun 29 14:27:38 midway-mgt slurmctld[26644]: debug:  job 15332: best_fit 
topology failure : no switch satisfying the request found
Jun 29 14:27:38 midway-mgt slurmctld[26644]: debug:  job 15332: best_fit 
topology failure : no switch satisfying the request found
Jun 29 14:27:38 midway-mgt slurmctld[26644]: debug:  job 15332: best_fit 
topology failure : no switch satisfying the request found
Jun 29 14:27:38 midway-mgt slurmctld[26644]: _pick_best_nodes: job 15332 never 
runnable
Jun 29 14:27:38 midway-mgt slurmctld[26644]: _slurm_rpc_submit_batch_job: 
Requested node configuration is not available


I have another sandyb partition that seems to work ok with this
configuration. I can submit to as many nodes as I want with the topology
plugin configured and it works fine.

Am I configuring something strangely, or is there a bug with topology plugin?

Here is the topology.conf:

SwitchName=s00 
Nodes=midway[037-038,075-076,113-114,151-152,189-190,227-228,259-262]
SwitchName=s01 Nodes=midway[001-018]
SwitchName=s02 Nodes=midway[019-032]
SwitchName=s03 Nodes=midway-bigmem[01-02]
SwitchName=s04 Nodes=midway[039-056]
SwitchName=s05 Nodes=midway[057-074]
SwitchName=s06 Nodes=midway[077-094]
SwitchName=s07 Nodes=midway[095-112]
SwitchName=s08 Nodes=midway[115-132]
SwitchName=s09 Nodes=midway[133-150]
SwitchName=s10 Nodes=midway[153-170]
SwitchName=s11 Nodes=midway[171-188]
SwitchName=s12 Nodes=midway[191-208]
SwitchName=s13 Nodes=midway[209-226]
SwitchName=s14 Nodes=midway[233-242]
SwitchName=s15 Nodes=midway[243-256]
SwitchName=spine Switches=s[01-15]

Here is the slurm.conf:

ControlMachine=midway-mgt
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
MpiDefault=none
Proctracktype=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=slurm
StateSaveLocation=/tmp
SwitchType=switch/none
TaskPlugin=task/cgroup
TopologyPlugin=topology/tree
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerTimeSlice=10
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
AccountingStorageEnforce=limits,qos
AccountingStorageHost=midway-mgt
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
ClusterName=midway
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=debug
SlurmdDebug=3
PreemptMode=suspend,gang
PreemptType=preempt/partition_prio


NodeName=midway[037,038,075,076,113,114,151,152,189,190,227,228,259-262] 
Feature=lc,e5-2670,32G,noib Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 
Weight=1000
NodeName=midway[039-074,077-112,115-150,153-188,191-226,233-251] 
Feature=tc,e5-2670,32G,ib Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 
Weight=2000
NodeName=midway[252-256] Feature=tc,e5-2670,32G,ib Sockets=2 CoresPerSocket=8 
ThreadsPerCore=2 Weight=2000
NodeName=midway[001-032] Feature=tc,x5675,24G,ib Sockets=2 CoresPerSocket=6 
ThreadsPerCore=1 Weight=2000
NodeName=midway-bigmem[01-02] Feature=tc,e5-2670,256G,ib Sockets=2 
CoresPerSocket=8 ThreadsPerCore=2 Weight=2000

PartitionName=westmere Nodes=midway[001-032] Default=NO MaxTime=INFINITE 
State=UP Priority=20 PreemptMode=off
PartitionName=sandyb Nodes=midway[037-228,233-251,259-262] Default=YES 
MaxTime=INFINITE State=UP Priority=20 PreemptMode=off
PartitionName=sandyb-ht Nodes=midway[252-256] Default=NO MaxTime=INFINITE 
State=UP Priority=20 PreemptMode=off
PartitionName=bigmem Nodes=midway-bigmem[01-02] MaxTime=INFINITE State=UP 
Priority=20 PreemptMode=off



-- 
andy wettstein
hpc system administrator
research computing center
university of chicago
773.702.1104

Reply via email to