Mark, Thanks very much for the config that makes more sense now. I have been trying to use 'smap -Dc' and then 'save /tmp/bluegene.conf' to generate the config but it never specified any MPs and always set the 'LayoutMode' to dynamic.
I'll make sure to add the BG patch also. Thanks Again, James On 31/07/2012 07:57, Mark Nelson wrote: > Hi James, > > Unfortunately I can't help you with your real problem about "Duplicated > NodeName", but I do have a hint for configuring SLURM, below. > > On 31/07/12 03:45, James Sweet wrote: >> >> Hi, >> >> I am trying to configure slurm to run on a 4 rack BG/Q system and am getting >> stuck in creating a config that slurmctld likes. I have already created >> blocks in mmcs for both a whole rack and also each individual midplane. To >> start I would like to try and create static blocks in slurm that map to the >> rack/midplane blocks in mmcs. I have read though the slurm bluegene admin >> guide but i'm unsure as to how to fix my config to sort out the "Duplicated >> NodeName" error i am seeing when I try and run slurmctld in debug mode. > > The way SLURM works with Blue Gene systems is that you define the blocks that > you want SLURM to create in the bluegene.conf file rather than through > MMCS. The bluegene.conf man page has the full details, but here is a > bluegene.conf file for a four-rack BG/Q with SLURM set to use static > partitioning > with eight blocks, each one midplane: > > # > # bluegene.conf file generated by smap > # See the bluegene.conf man page for more information > # > MloaderImage=/bgsys/drivers/ppcfloor/boot/firmware > > BridgeAPILogFile=/var/log/slurm/bridgeapi.log > BridgeAPIVerbose=2 > # We have 4 IO nodes per midplane, 32 for the four-rack system > IONodesPerMP=4 > # Once any condes in a block are in the error state stop running > # jobs in that block > MaxBlockInError=0 > > MidplaneNodeCnt=512 > NodeCardNodeCnt=32 > > AllowSubBlockAllocations=No > LayoutMode=STATIC > #LayoutMode=DYNAMIC > > # > # Block Layout > # > ############################################################################### > # Full-system bgblock, implicitly created > # MP=[0000x1011] Type=TORUS # 2x1x2x2 = 8 midplanes > ############################################################################### > MPs=0000 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > MPs=1000 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > MPs=0010 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > MPs=1010 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > MPs=0001 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > MPs=1001 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > MPs=0011 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > MPs=1011 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > #MPs=0000 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks > #MPs=1000 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks > #MPs=0010 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks > #MPs=1010 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks > #MPs=0001 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks > #MPs=1001 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks > #MPs=0011 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks > #MPs=1011 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks > > > Here you can see each midplane is specified with MPs=<midplane coordinate in > four dimensions> and we're telling SLURM to use the whole midplane as a > block (by telling it that the network connection is the full TORUS in all > four dimensions). > The lines below, commented out are for a static block layout where each > midplane is split into four 128 cnode blocks (the smallest real block our > system can support because of our small number of IO nodes). I left it in > there just to show you another example of static partitioning. > And as the comment says, the full system block is implicitly created so it > doesn't need to be defined. > > Hopefully someone else can chime in on Duplicated NodeName error. > > Hope that helps! > Mark > >> >> The error appears on both 2.4.1 and 2.5.0-0.pre2. The following slurm.conf, >> bluegene.conf and slurmctrld -Dvvv output are for 2.5.0-0.pre2 >> >> * slurm.conf >> >>> [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# /opt/slurm/2.5.0-0.pre2/etc/slurm.conf >>> |sed '/^$/d' >>> ClusterName=bgq-pre_ga >>> ControlMachine=bgqsn >>> SlurmUser=slurm >>> SlurmctldPort=6817 >>> SlurmdPort=6818 >>> AuthType=auth/munge >>> StateSaveLocation=/tmp >>> SlurmdSpoolDir=/tmp/slurmd >>> SwitchType=switch/none >>> MpiDefault=none >>> SlurmctldPidFile=/var/run/slurmctld.pid >>> SlurmdPidFile=/var/run/slurmd.pid >>> ProctrackType=proctrack/pgid >>> CacheGroups=0 >>> ReturnToService=0 >>> Prolog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_prolog >>> Epilog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_epilog >>> SlurmctldTimeout=300 >>> SlurmdTimeout=300 >>> InactiveLimit=0 >>> MinJobAge=300 >>> KillWait=30 >>> Waittime=0 >>> SchedulerType=sched/backfill >>> SelectType=select/bluegene >>> FastSchedule=1 >>> DebugFlags=BGBlockPick,SelectType >>> SlurmctldDebug=3 >>> SlurmctldLogFile=/tmp/slurm.log >>> SlurmdDebug=3 >>> JobCompType=jobcomp/none >>> NodeName=bgq[0000x1011] State=UNKNOWN >>> PartitionName=DEFAULT Shared=FORCE >>> PartitionName=pbatch State=UP Nodes=bgq[0000x1011] Default=Yes >> >> * bluegene.conf >> >>> [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# >>> /opt/slurm/2.5.0-0.pre2/etc/bluegene.conf |sed '/^$/d' >>> MloaderImage=/bgsys/drivers/ppcfloor/boot/uloader >>> IONodesPerMP=8 # io semi-poor >>> BridgeAPILogFile=/tmp/bridgeapi.log >>> BridgeAPIVerbose=2 >>> DebugFlags=BGBlockPick,SelectType >>> BasePartitionNodeCnt=512 >>> NodeCardNodeCnt=32 >>> LayoutMode=STATIC >> >> * slurmctld >> >>> [jim@bgqsn 2.5.0-0.pre2]$ sudo ./sbin/slurmctld -Dvvvvv >>> slurmctld: pidfile not locked, assuming no running daemon >>> slurmctld: Warning: Core limit is only 0 KB >>> slurmctld: debug3: Trying to load plugin >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/accounting_storage_none.so >>> slurmctld: Accounting storage NOT INVOKED plugin loaded >>> slurmctld: debug3: Success. >>> slurmctld: debug3: not enforcing associations and no list was given so we >>> are giving a blank list >>> slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to recover >>> slurmctld: slurmctld version 2.5.0-pre2 started on cluster bgq-pre_ga >>> slurmctld: debug3: Trying to load plugin >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/crypto_munge.so >>> slurmctld: Munge cryptographic signature plugin loaded >>> slurmctld: debug3: Success. >>> slurmctld: debug3: Trying to load plugin >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/select_bluegene.so >>> slurmctld: BlueGene node selection plugin loading... >>> slurmctld: debug: Setting dimensions from slurm.conf file >>> slurmctld: Attempting to contact MMCS >>> slurmctld: BlueGene configured with 2122 midplanes >>> slurmctld: debug: We are using 2122 of the system. >>> slurmctld: BlueGene plugin loaded successfully >>> slurmctld: BlueGene node selection plugin loaded >>> slurmctld: debug3: Success. >>> slurmctld: debug3: Trying to load plugin >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/preempt_none.so >>> slurmctld: preempt/none loaded >>> slurmctld: debug3: Success. >>> slurmctld: debug3: Trying to load plugin >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/checkpoint_none.so >>> slurmctld: debug3: Success. >>> slurmctld: Checkpoint plugin loaded: checkpoint/none >>> slurmctld: debug3: Trying to load plugin >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/jobacct_gather_none.so >>> slurmctld: Job accounting gather NOT_INVOKED plugin loaded >>> slurmctld: debug3: Success. >>> slurmctld: debug: No backup controller to shutdown >>> slurmctld: debug3: Trying to load plugin >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/switch_none.so >>> slurmctld: switch NONE plugin loaded >>> slurmctld: debug3: Success. >>> slurmctld: debug3: Prefix is bgq bgq[0000x1011] 4 >>> slurmctld: debug3: Trying to load plugin >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/topology_none.so >>> slurmctld: topology NONE plugin loaded >>> slurmctld: debug3: Success. >>> slurmctld: fatal: Duplicated NodeName bgq0000 in the config file >>> [jim@bgqsn 2.5.0-0.pre2]$ >> >> Any pointers or help would be much appreciated. >> >> Many Thanks >> >> James Sweet >> > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
