James, This is perfect timing since I am trying to install slurm on our brand new BlueGene/Q also.
I have tried lots of things with both 2.3.5 and 2.4.1 and always get the same error. I have even tried just a single explicit hostname and it still says it is duplicated. So I thought maybe it is creating a node list via something special in the bluegene plugin so I leave out the NodeName line. Then it says there are no node names. I looked at the code that generates the error. It is building a list of names and searching the list before adding each name to see if it is already in the list. So there is definitely some error happening before this call since it should only be called once when I have only a single node defined. I am going to be trying to trace this with a debugger next unless someone comes up with an answer. This is very likely related to the bluegene plugin since an almost identical config of slurm is working fine on my test cluster which is non-bluegene. Carl Schmidtmann -- Carl Schmidtmann Center for Integrated Research Computing University of Rochester ----- Original Message ----- > > Mark, > > Thanks very much for the config that makes more sense now. I have > been trying to use 'smap -Dc' and then 'save /tmp/bluegene.conf' to > generate the > config but it never specified any MPs and always set the 'LayoutMode' > to dynamic. > > I'll make sure to add the BG patch also. > > Thanks Again, > > James > > On 31/07/2012 07:57, Mark Nelson wrote: > > Hi James, > > > > Unfortunately I can't help you with your real problem about > > "Duplicated NodeName", but I do have a hint for configuring SLURM, > > below. > > > > On 31/07/12 03:45, James Sweet wrote: > >> > >> Hi, > >> > >> I am trying to configure slurm to run on a 4 rack BG/Q system and > >> am getting stuck in creating a config that slurmctld likes. I > >> have already created > >> blocks in mmcs for both a whole rack and also each individual > >> midplane. To start I would like to try and create static blocks > >> in slurm that map to the > >> rack/midplane blocks in mmcs. I have read though the slurm > >> bluegene admin guide but i'm unsure as to how to fix my config to > >> sort out the "Duplicated > >> NodeName" error i am seeing when I try and run slurmctld in debug > >> mode. > > > > The way SLURM works with Blue Gene systems is that you define the > > blocks that you want SLURM to create in the bluegene.conf file > > rather than through > > MMCS. The bluegene.conf man page has the full details, but here is > > a bluegene.conf file for a four-rack BG/Q with SLURM set to use > > static partitioning > > with eight blocks, each one midplane: > > > > # > > # bluegene.conf file generated by smap > > # See the bluegene.conf man page for more information > > # > > MloaderImage=/bgsys/drivers/ppcfloor/boot/firmware > > > > BridgeAPILogFile=/var/log/slurm/bridgeapi.log > > BridgeAPIVerbose=2 > > # We have 4 IO nodes per midplane, 32 for the four-rack system > > IONodesPerMP=4 > > # Once any condes in a block are in the error state stop running > > # jobs in that block > > MaxBlockInError=0 > > > > MidplaneNodeCnt=512 > > NodeCardNodeCnt=32 > > > > AllowSubBlockAllocations=No > > LayoutMode=STATIC > > #LayoutMode=DYNAMIC > > > > # > > # Block Layout > > # > > ############################################################################### > > # Full-system bgblock, implicitly created > > # MP=[0000x1011] Type=TORUS # 2x1x2x2 = 8 midplanes > > ############################################################################### > > MPs=0000 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > > MPs=1000 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > > MPs=0010 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > > MPs=1010 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > > MPs=0001 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > > MPs=1001 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > > MPs=0011 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > > MPs=1011 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block > > #MPs=0000 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode > > blocks > > #MPs=1000 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode > > blocks > > #MPs=0010 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode > > blocks > > #MPs=1010 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode > > blocks > > #MPs=0001 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode > > blocks > > #MPs=1001 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode > > blocks > > #MPs=0011 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode > > blocks > > #MPs=1011 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode > > blocks > > > > > > Here you can see each midplane is specified with MPs=<midplane > > coordinate in four dimensions> and we're telling SLURM to use the > > whole midplane as a > > block (by telling it that the network connection is the full TORUS > > in all four dimensions). > > The lines below, commented out are for a static block layout where > > each midplane is split into four 128 cnode blocks (the smallest > > real block our > > system can support because of our small number of IO nodes). I left > > it in there just to show you another example of static > > partitioning. > > And as the comment says, the full system block is implicitly > > created so it doesn't need to be defined. > > > > Hopefully someone else can chime in on Duplicated NodeName error. > > > > Hope that helps! > > Mark > > > >> > >> The error appears on both 2.4.1 and 2.5.0-0.pre2. The following > >> slurm.conf, bluegene.conf and slurmctrld -Dvvv output are for > >> 2.5.0-0.pre2 > >> > >> * slurm.conf > >> > >>> [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# > >>> /opt/slurm/2.5.0-0.pre2/etc/slurm.conf |sed '/^$/d' > >>> ClusterName=bgq-pre_ga > >>> ControlMachine=bgqsn > >>> SlurmUser=slurm > >>> SlurmctldPort=6817 > >>> SlurmdPort=6818 > >>> AuthType=auth/munge > >>> StateSaveLocation=/tmp > >>> SlurmdSpoolDir=/tmp/slurmd > >>> SwitchType=switch/none > >>> MpiDefault=none > >>> SlurmctldPidFile=/var/run/slurmctld.pid > >>> SlurmdPidFile=/var/run/slurmd.pid > >>> ProctrackType=proctrack/pgid > >>> CacheGroups=0 > >>> ReturnToService=0 > >>> Prolog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_prolog > >>> Epilog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_epilog > >>> SlurmctldTimeout=300 > >>> SlurmdTimeout=300 > >>> InactiveLimit=0 > >>> MinJobAge=300 > >>> KillWait=30 > >>> Waittime=0 > >>> SchedulerType=sched/backfill > >>> SelectType=select/bluegene > >>> FastSchedule=1 > >>> DebugFlags=BGBlockPick,SelectType > >>> SlurmctldDebug=3 > >>> SlurmctldLogFile=/tmp/slurm.log > >>> SlurmdDebug=3 > >>> JobCompType=jobcomp/none > >>> NodeName=bgq[0000x1011] State=UNKNOWN > >>> PartitionName=DEFAULT Shared=FORCE > >>> PartitionName=pbatch State=UP Nodes=bgq[0000x1011] Default=Yes > >> > >> * bluegene.conf > >> > >>> [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# > >>> /opt/slurm/2.5.0-0.pre2/etc/bluegene.conf |sed '/^$/d' > >>> MloaderImage=/bgsys/drivers/ppcfloor/boot/uloader > >>> IONodesPerMP=8 # io semi-poor > >>> BridgeAPILogFile=/tmp/bridgeapi.log > >>> BridgeAPIVerbose=2 > >>> DebugFlags=BGBlockPick,SelectType > >>> BasePartitionNodeCnt=512 > >>> NodeCardNodeCnt=32 > >>> LayoutMode=STATIC > >> > >> * slurmctld > >> > >>> [jim@bgqsn 2.5.0-0.pre2]$ sudo ./sbin/slurmctld -Dvvvvv > >>> slurmctld: pidfile not locked, assuming no running daemon > >>> slurmctld: Warning: Core limit is only 0 KB > >>> slurmctld: debug3: Trying to load plugin > >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/accounting_storage_none.so > >>> slurmctld: Accounting storage NOT INVOKED plugin loaded > >>> slurmctld: debug3: Success. > >>> slurmctld: debug3: not enforcing associations and no list was > >>> given so we are giving a blank list > >>> slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to > >>> recover > >>> slurmctld: slurmctld version 2.5.0-pre2 started on cluster > >>> bgq-pre_ga > >>> slurmctld: debug3: Trying to load plugin > >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/crypto_munge.so > >>> slurmctld: Munge cryptographic signature plugin loaded > >>> slurmctld: debug3: Success. > >>> slurmctld: debug3: Trying to load plugin > >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/select_bluegene.so > >>> slurmctld: BlueGene node selection plugin loading... > >>> slurmctld: debug: Setting dimensions from slurm.conf file > >>> slurmctld: Attempting to contact MMCS > >>> slurmctld: BlueGene configured with 2122 midplanes > >>> slurmctld: debug: We are using 2122 of the system. > >>> slurmctld: BlueGene plugin loaded successfully > >>> slurmctld: BlueGene node selection plugin loaded > >>> slurmctld: debug3: Success. > >>> slurmctld: debug3: Trying to load plugin > >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/preempt_none.so > >>> slurmctld: preempt/none loaded > >>> slurmctld: debug3: Success. > >>> slurmctld: debug3: Trying to load plugin > >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/checkpoint_none.so > >>> slurmctld: debug3: Success. > >>> slurmctld: Checkpoint plugin loaded: checkpoint/none > >>> slurmctld: debug3: Trying to load plugin > >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/jobacct_gather_none.so > >>> slurmctld: Job accounting gather NOT_INVOKED plugin loaded > >>> slurmctld: debug3: Success. > >>> slurmctld: debug: No backup controller to shutdown > >>> slurmctld: debug3: Trying to load plugin > >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/switch_none.so > >>> slurmctld: switch NONE plugin loaded > >>> slurmctld: debug3: Success. > >>> slurmctld: debug3: Prefix is bgq bgq[0000x1011] 4 > >>> slurmctld: debug3: Trying to load plugin > >>> /opt/slurm/2.5.0-0.pre2/lib/slurm/topology_none.so > >>> slurmctld: topology NONE plugin loaded > >>> slurmctld: debug3: Success. > >>> slurmctld: fatal: Duplicated NodeName bgq0000 in the config file > >>> [jim@bgqsn 2.5.0-0.pre2]$ > >> > >> Any pointers or help would be much appreciated. > >> > >> Many Thanks > >> > >> James Sweet > >> > > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > -- Carl Schmidtmann Center for Integrated Research Computing University of Rochester
