Hi, I'm trying to get the topology plugin working and something doesn't seem to be working right. I have a westmere partition with 32 nodes in it. Once I configure the topology plugin things seem to start acting strangely. Submitting a job with more than 18 nodes files with this message: sbatch: error: Batch job submission failed: Requested node configuration is not available
I can submit an 18 node job and a 14 node job that will run simultaneously. nodes 1-14 are allocated to the 14 node job and nodes 15-32 are allocated to the 18 node job. It seems like a job on this partition can't include nodes from the 1-14 range and the 15-32 range in the same job. > sbatch -N 2 --exclusive -p westmere -w 'midway[014,015]' namdsubmit.slurm sbatch: error: Batch job submission failed: Requested node configuration is not available The logs for slurmctld say this when a job is submitted: Jun 29 14:27:38 midway-mgt slurmctld[26644]: debug: job 15332: best_fit topology failure : no switch satisfying the request found Jun 29 14:27:38 midway-mgt slurmctld[26644]: debug: job 15332: best_fit topology failure : no switch satisfying the request found Jun 29 14:27:38 midway-mgt slurmctld[26644]: debug: job 15332: best_fit topology failure : no switch satisfying the request found Jun 29 14:27:38 midway-mgt slurmctld[26644]: _pick_best_nodes: job 15332 never runnable Jun 29 14:27:38 midway-mgt slurmctld[26644]: _slurm_rpc_submit_batch_job: Requested node configuration is not available I have another sandyb partition that seems to work ok with this configuration. I can submit to as many nodes as I want with the topology plugin configured and it works fine. Am I configuring something strangely, or is there a bug with topology plugin? Here is the topology.conf: SwitchName=s00 Nodes=midway[037-038,075-076,113-114,151-152,189-190,227-228,259-262] SwitchName=s01 Nodes=midway[001-018] SwitchName=s02 Nodes=midway[019-032] SwitchName=s03 Nodes=midway-bigmem[01-02] SwitchName=s04 Nodes=midway[039-056] SwitchName=s05 Nodes=midway[057-074] SwitchName=s06 Nodes=midway[077-094] SwitchName=s07 Nodes=midway[095-112] SwitchName=s08 Nodes=midway[115-132] SwitchName=s09 Nodes=midway[133-150] SwitchName=s10 Nodes=midway[153-170] SwitchName=s11 Nodes=midway[171-188] SwitchName=s12 Nodes=midway[191-208] SwitchName=s13 Nodes=midway[209-226] SwitchName=s14 Nodes=midway[233-242] SwitchName=s15 Nodes=midway[243-256] SwitchName=spine Switches=s[01-15] Here is the slurm.conf: ControlMachine=midway-mgt AuthType=auth/munge CacheGroups=0 CryptoType=crypto/munge MpiDefault=none Proctracktype=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/tmp/slurmd SlurmUser=slurm StateSaveLocation=/tmp SwitchType=switch/none TaskPlugin=task/cgroup TopologyPlugin=topology/tree InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerTimeSlice=10 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory PriorityType=priority/multifactor AccountingStorageEnforce=limits,qos AccountingStorageHost=midway-mgt AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment=YES ClusterName=midway JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=debug SlurmdDebug=3 PreemptMode=suspend,gang PreemptType=preempt/partition_prio NodeName=midway[037,038,075,076,113,114,151,152,189,190,227,228,259-262] Feature=lc,e5-2670,32G,noib Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 Weight=1000 NodeName=midway[039-074,077-112,115-150,153-188,191-226,233-251] Feature=tc,e5-2670,32G,ib Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 Weight=2000 NodeName=midway[252-256] Feature=tc,e5-2670,32G,ib Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 Weight=2000 NodeName=midway[001-032] Feature=tc,x5675,24G,ib Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 Weight=2000 NodeName=midway-bigmem[01-02] Feature=tc,e5-2670,256G,ib Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 Weight=2000 PartitionName=westmere Nodes=midway[001-032] Default=NO MaxTime=INFINITE State=UP Priority=20 PreemptMode=off PartitionName=sandyb Nodes=midway[037-228,233-251,259-262] Default=YES MaxTime=INFINITE State=UP Priority=20 PreemptMode=off PartitionName=sandyb-ht Nodes=midway[252-256] Default=NO MaxTime=INFINITE State=UP Priority=20 PreemptMode=off PartitionName=bigmem Nodes=midway-bigmem[01-02] MaxTime=INFINITE State=UP Priority=20 PreemptMode=off -- andy wettstein hpc system administrator research computing center university of chicago 773.702.1104
