Oh, you might also want to try out this simple patch too seeing as you're on a Blue Gene: http://bugs.schedmd.com/show_bug.cgi?id=95
Mark. On 31/07/12 03:45, James Sweet wrote: > > Hi, > > I am trying to configure slurm to run on a 4 rack BG/Q system and am getting > stuck in creating a config that slurmctld likes. I have already created > blocks in mmcs for both a whole rack and also each individual midplane. To > start I would like to try and create static blocks in slurm that map to the > rack/midplane blocks in mmcs. I have read though the slurm bluegene admin > guide but i'm unsure as to how to fix my config to sort out the "Duplicated > NodeName" error i am seeing when I try and run slurmctld in debug mode. > > The error appears on both 2.4.1 and 2.5.0-0.pre2. The following slurm.conf, > bluegene.conf and slurmctrld -Dvvv output are for 2.5.0-0.pre2 > > * slurm.conf > >> [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# /opt/slurm/2.5.0-0.pre2/etc/slurm.conf >> |sed '/^$/d' >> ClusterName=bgq-pre_ga >> ControlMachine=bgqsn >> SlurmUser=slurm >> SlurmctldPort=6817 >> SlurmdPort=6818 >> AuthType=auth/munge >> StateSaveLocation=/tmp >> SlurmdSpoolDir=/tmp/slurmd >> SwitchType=switch/none >> MpiDefault=none >> SlurmctldPidFile=/var/run/slurmctld.pid >> SlurmdPidFile=/var/run/slurmd.pid >> ProctrackType=proctrack/pgid >> CacheGroups=0 >> ReturnToService=0 >> Prolog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_prolog >> Epilog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_epilog >> SlurmctldTimeout=300 >> SlurmdTimeout=300 >> InactiveLimit=0 >> MinJobAge=300 >> KillWait=30 >> Waittime=0 >> SchedulerType=sched/backfill >> SelectType=select/bluegene >> FastSchedule=1 >> DebugFlags=BGBlockPick,SelectType >> SlurmctldDebug=3 >> SlurmctldLogFile=/tmp/slurm.log >> SlurmdDebug=3 >> JobCompType=jobcomp/none >> NodeName=bgq[0000x1011] State=UNKNOWN >> PartitionName=DEFAULT Shared=FORCE >> PartitionName=pbatch State=UP Nodes=bgq[0000x1011] Default=Yes > > * bluegene.conf > >> [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# >> /opt/slurm/2.5.0-0.pre2/etc/bluegene.conf |sed '/^$/d' >> MloaderImage=/bgsys/drivers/ppcfloor/boot/uloader >> IONodesPerMP=8 # io semi-poor >> BridgeAPILogFile=/tmp/bridgeapi.log >> BridgeAPIVerbose=2 >> DebugFlags=BGBlockPick,SelectType >> BasePartitionNodeCnt=512 >> NodeCardNodeCnt=32 >> LayoutMode=STATIC > > * slurmctld > >> [jim@bgqsn 2.5.0-0.pre2]$ sudo ./sbin/slurmctld -Dvvvvv >> slurmctld: pidfile not locked, assuming no running daemon >> slurmctld: Warning: Core limit is only 0 KB >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/accounting_storage_none.so >> slurmctld: Accounting storage NOT INVOKED plugin loaded >> slurmctld: debug3: Success. >> slurmctld: debug3: not enforcing associations and no list was given so we >> are giving a blank list >> slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to recover >> slurmctld: slurmctld version 2.5.0-pre2 started on cluster bgq-pre_ga >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/crypto_munge.so >> slurmctld: Munge cryptographic signature plugin loaded >> slurmctld: debug3: Success. >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/select_bluegene.so >> slurmctld: BlueGene node selection plugin loading... >> slurmctld: debug: Setting dimensions from slurm.conf file >> slurmctld: Attempting to contact MMCS >> slurmctld: BlueGene configured with 2122 midplanes >> slurmctld: debug: We are using 2122 of the system. >> slurmctld: BlueGene plugin loaded successfully >> slurmctld: BlueGene node selection plugin loaded >> slurmctld: debug3: Success. >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/preempt_none.so >> slurmctld: preempt/none loaded >> slurmctld: debug3: Success. >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/checkpoint_none.so >> slurmctld: debug3: Success. >> slurmctld: Checkpoint plugin loaded: checkpoint/none >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/jobacct_gather_none.so >> slurmctld: Job accounting gather NOT_INVOKED plugin loaded >> slurmctld: debug3: Success. >> slurmctld: debug: No backup controller to shutdown >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/switch_none.so >> slurmctld: switch NONE plugin loaded >> slurmctld: debug3: Success. >> slurmctld: debug3: Prefix is bgq bgq[0000x1011] 4 >> slurmctld: debug3: Trying to load plugin >> /opt/slurm/2.5.0-0.pre2/lib/slurm/topology_none.so >> slurmctld: topology NONE plugin loaded >> slurmctld: debug3: Success. >> slurmctld: fatal: Duplicated NodeName bgq0000 in the config file >> [jim@bgqsn 2.5.0-0.pre2]$ > > Any pointers or help would be much appreciated. > > Many Thanks > > James Sweet >
