Hi, I am trying to configure slurm to run on a 4 rack BG/Q system and am getting stuck in creating a config that slurmctld likes. I have already created blocks in mmcs for both a whole rack and also each individual midplane. To start I would like to try and create static blocks in slurm that map to the rack/midplane blocks in mmcs. I have read though the slurm bluegene admin guide but i'm unsure as to how to fix my config to sort out the "Duplicated NodeName" error i am seeing when I try and run slurmctld in debug mode.
The error appears on both 2.4.1 and 2.5.0-0.pre2. The following slurm.conf, bluegene.conf and slurmctrld -Dvvv output are for 2.5.0-0.pre2 * slurm.conf > [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# /opt/slurm/2.5.0-0.pre2/etc/slurm.conf > |sed '/^$/d' > ClusterName=bgq-pre_ga > ControlMachine=bgqsn > SlurmUser=slurm > SlurmctldPort=6817 > SlurmdPort=6818 > AuthType=auth/munge > StateSaveLocation=/tmp > SlurmdSpoolDir=/tmp/slurmd > SwitchType=switch/none > MpiDefault=none > SlurmctldPidFile=/var/run/slurmctld.pid > SlurmdPidFile=/var/run/slurmd.pid > ProctrackType=proctrack/pgid > CacheGroups=0 > ReturnToService=0 > Prolog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_prolog > Epilog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_epilog > SlurmctldTimeout=300 > SlurmdTimeout=300 > InactiveLimit=0 > MinJobAge=300 > KillWait=30 > Waittime=0 > SchedulerType=sched/backfill > SelectType=select/bluegene > FastSchedule=1 > DebugFlags=BGBlockPick,SelectType > SlurmctldDebug=3 > SlurmctldLogFile=/tmp/slurm.log > SlurmdDebug=3 > JobCompType=jobcomp/none > NodeName=bgq[0000x1011] State=UNKNOWN > PartitionName=DEFAULT Shared=FORCE > PartitionName=pbatch State=UP Nodes=bgq[0000x1011] Default=Yes * bluegene.conf > [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# > /opt/slurm/2.5.0-0.pre2/etc/bluegene.conf |sed '/^$/d' > MloaderImage=/bgsys/drivers/ppcfloor/boot/uloader > IONodesPerMP=8 # io semi-poor > BridgeAPILogFile=/tmp/bridgeapi.log > BridgeAPIVerbose=2 > DebugFlags=BGBlockPick,SelectType > BasePartitionNodeCnt=512 > NodeCardNodeCnt=32 > LayoutMode=STATIC * slurmctld > [jim@bgqsn 2.5.0-0.pre2]$ sudo ./sbin/slurmctld -Dvvvvv > slurmctld: pidfile not locked, assuming no running daemon > slurmctld: Warning: Core limit is only 0 KB > slurmctld: debug3: Trying to load plugin > /opt/slurm/2.5.0-0.pre2/lib/slurm/accounting_storage_none.so > slurmctld: Accounting storage NOT INVOKED plugin loaded > slurmctld: debug3: Success. > slurmctld: debug3: not enforcing associations and no list was given so we are > giving a blank list > slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to recover > slurmctld: slurmctld version 2.5.0-pre2 started on cluster bgq-pre_ga > slurmctld: debug3: Trying to load plugin > /opt/slurm/2.5.0-0.pre2/lib/slurm/crypto_munge.so > slurmctld: Munge cryptographic signature plugin loaded > slurmctld: debug3: Success. > slurmctld: debug3: Trying to load plugin > /opt/slurm/2.5.0-0.pre2/lib/slurm/select_bluegene.so > slurmctld: BlueGene node selection plugin loading... > slurmctld: debug: Setting dimensions from slurm.conf file > slurmctld: Attempting to contact MMCS > slurmctld: BlueGene configured with 2122 midplanes > slurmctld: debug: We are using 2122 of the system. > slurmctld: BlueGene plugin loaded successfully > slurmctld: BlueGene node selection plugin loaded > slurmctld: debug3: Success. > slurmctld: debug3: Trying to load plugin > /opt/slurm/2.5.0-0.pre2/lib/slurm/preempt_none.so > slurmctld: preempt/none loaded > slurmctld: debug3: Success. > slurmctld: debug3: Trying to load plugin > /opt/slurm/2.5.0-0.pre2/lib/slurm/checkpoint_none.so > slurmctld: debug3: Success. > slurmctld: Checkpoint plugin loaded: checkpoint/none > slurmctld: debug3: Trying to load plugin > /opt/slurm/2.5.0-0.pre2/lib/slurm/jobacct_gather_none.so > slurmctld: Job accounting gather NOT_INVOKED plugin loaded > slurmctld: debug3: Success. > slurmctld: debug: No backup controller to shutdown > slurmctld: debug3: Trying to load plugin > /opt/slurm/2.5.0-0.pre2/lib/slurm/switch_none.so > slurmctld: switch NONE plugin loaded > slurmctld: debug3: Success. > slurmctld: debug3: Prefix is bgq bgq[0000x1011] 4 > slurmctld: debug3: Trying to load plugin > /opt/slurm/2.5.0-0.pre2/lib/slurm/topology_none.so > slurmctld: topology NONE plugin loaded > slurmctld: debug3: Success. > slurmctld: fatal: Duplicated NodeName bgq0000 in the config file > [jim@bgqsn 2.5.0-0.pre2]$ Any pointers or help would be much appreciated. Many Thanks James Sweet -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
