Hi James, Unfortunately I can't help you with your real problem about "Duplicated NodeName", but I do have a hint for configuring SLURM, below.
On 31/07/12 03:45, James Sweet wrote: > > Hi, > > I am trying to configure slurm to run on a 4 rack BG/Q system and am getting > stuck in creating a config that slurmctld likes. I have already created > blocks in mmcs for both a whole rack and also each individual midplane. To > start I would like to try and create static blocks in slurm that map to the > rack/midplane blocks in mmcs. I have read though the slurm bluegene admin > guide but i'm unsure as to how to fix my config to sort out the "Duplicated > NodeName" error i am seeing when I try and run slurmctld in debug mode. The way SLURM works with Blue Gene systems is that you define the blocks that you want SLURM to create in the bluegene.conf file rather than through MMCS. The bluegene.conf man page has the full details, but here is a bluegene.conf file for a four-rack BG/Q with SLURM set to use static partitioning with eight blocks, each one midplane: # # bluegene.conf file generated by smap # See the bluegene.conf man page for more information # MloaderImage=/bgsys/drivers/ppcfloor/boot/firmware BridgeAPILogFile=/var/log/slurm/bridgeapi.log BridgeAPIVerbose=2 # We have 4 IO nodes per midplane, 32 for the four-rack system IONodesPerMP=4 # Once any condes in a block are in the error state stop running # jobs in that block MaxBlockInError=0 MidplaneNodeCnt=512 NodeCardNodeCnt=32 AllowSubBlockAllocations=No LayoutMode=STATIC #LayoutMode=DYNAMIC # # Block Layout # ############################################################################### # Full-system bgblock, implicitly created # MP=[0000x1011] Type=TORUS # 2x1x2x2 = 8 midplanes ############################################################################### MPs=0000 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block MPs=1000 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block MPs=0010 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block MPs=1010 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block MPs=0001 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block MPs=1001 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block MPs=0011 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block MPs=1011 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block #MPs=0000 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks #MPs=1000 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks #MPs=0010 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks #MPs=1010 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks #MPs=0001 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks #MPs=1001 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks #MPs=0011 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks #MPs=1011 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks Here you can see each midplane is specified with MPs=<midplane coordinate in four dimensions> and we're telling SLURM to use the whole midplane as a block (by telling it that the network connection is the full TORUS in all four dimensions). The lines below, commented out are for a static block layout where each midplane is split into four 128 cnode blocks (the smallest real block our system can support because of our small number of IO nodes). I left it in there just to show you another example of static partitioning. And as the comment says, the full system block is implicitly created so it doesn't need to be defined. Hopefully someone else can chime in on Duplicated NodeName error. Hope that helps! Mark > > The error appears on both 2.4.1 and 2.5.0-0.pre2. The following slurm.conf, > bluegene.conf and slurmctrld -Dvvv output are for 2.5.0-0.pre2 > > * slurm.conf > >> [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# /opt/slurm/2.5.0-0.pre2/etc/slurm.conf >> |sed '/^$/d' >> ClusterName=bgq-pre_ga >> ControlMachine=bgqsn >> SlurmUser=slurm >> SlurmctldPort=6817 >> SlurmdPort=6818 >> AuthType=auth/munge >> StateSaveLocation=/tmp >> SlurmdSpoolDir=/tmp/slurmd >> SwitchType=switch/none >> MpiDefault=none >> SlurmctldPidFile=/var/run/slurmctld.pid >> SlurmdPidFile=/var/run/slurmd.pid >> ProctrackType=proctrack/pgid >> CacheGroups=0 >> ReturnToService=0 >> Prolog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_prolog >> Epilog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_epilog >> SlurmctldTimeout=300 >> SlurmdTimeout=300 >> InactiveLimit=0 >> MinJobAge=300 >> KillWait=30 >> Waittime=0 >> SchedulerType=sched/backfill >> SelectType=select/bluegene >> FastSchedule=1 >> DebugFlags=BGBlockPick,SelectType >> SlurmctldDebug=3 >> SlurmctldLogFile=/tmp/slurm.log >> SlurmdDebug=3 >> JobCompType=jobcomp/none >> NodeName=bgq[0000x1011] State=UNKNOWN >> PartitionName=DEFAULT Shared=FORCE >> PartitionName=pbatch State=UP Nodes=bgq[0000x1011] Default=Yes > > * bluegene.conf > >> [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# >> /opt/slurm/2.5.0-0.pre2/etc/bluegene.conf |sed '/^$/d' >> MloaderImage=/bgsys/drivers/ppcfloor/boot/uloader >> IONodesPerMP=8 # io semi-poor >> BridgeAPILogFile=/tmp/bridgeapi.log >> BridgeAPIVerbose=2 >> DebugFlags=BGBlockPick,SelectType >> BasePartitionNodeCnt=512 >> NodeCardNodeCnt=32 >> LayoutMode=STATIC > > * slurmctld > >> [jim@bgqsn 2.5.0-0.pre2]$ sudo ./sbin/slurmctld -Dvvvvv >> slurmctld: pidfile not locked, assuming no running daemon >> slurmctld: Warning: Core limit is only 0 KB >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/accounting_storage_none.so >> slurmctld: Accounting storage NOT INVOKED plugin loaded >> slurmctld: debug3: Success. >> slurmctld: debug3: not enforcing associations and no list was given so we >> are giving a blank list >> slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to recover >> slurmctld: slurmctld version 2.5.0-pre2 started on cluster bgq-pre_ga >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/crypto_munge.so >> slurmctld: Munge cryptographic signature plugin loaded >> slurmctld: debug3: Success. >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/select_bluegene.so >> slurmctld: BlueGene node selection plugin loading... >> slurmctld: debug: Setting dimensions from slurm.conf file >> slurmctld: Attempting to contact MMCS >> slurmctld: BlueGene configured with 2122 midplanes >> slurmctld: debug: We are using 2122 of the system. >> slurmctld: BlueGene plugin loaded successfully >> slurmctld: BlueGene node selection plugin loaded >> slurmctld: debug3: Success. >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/preempt_none.so >> slurmctld: preempt/none loaded >> slurmctld: debug3: Success. >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/checkpoint_none.so >> slurmctld: debug3: Success. >> slurmctld: Checkpoint plugin loaded: checkpoint/none >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/jobacct_gather_none.so >> slurmctld: Job accounting gather NOT_INVOKED plugin loaded >> slurmctld: debug3: Success. >> slurmctld: debug: No backup controller to shutdown >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/switch_none.so >> slurmctld: switch NONE plugin loaded >> slurmctld: debug3: Success. >> slurmctld: debug3: Prefix is bgq bgq[0000x1011] 4 >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/topology_none.so >> slurmctld: topology NONE plugin loaded >> slurmctld: debug3: Success. >> slurmctld: fatal: Duplicated NodeName bgq0000 in the config file >> [jim@bgqsn 2.5.0-0.pre2]$ > > Any pointers or help would be much appreciated. > > Many Thanks > > James Sweet >
