Hi James,

Unfortunately I can't help you with your real problem about "Duplicated 
NodeName", but I do have a hint for configuring SLURM, below.

On 31/07/12 03:45, James Sweet wrote:
>
> Hi,
>
> I am trying to configure slurm to run on a 4 rack BG/Q system and am getting 
> stuck in creating a config that slurmctld likes. I have already created
> blocks in mmcs for both a whole rack and also each individual midplane. To 
> start I would like to try and create static blocks in slurm that map to the
> rack/midplane blocks in mmcs. I have read though the slurm bluegene admin 
> guide but i'm unsure as to how to fix my config to sort out the  "Duplicated
> NodeName" error i am seeing when I try and run slurmctld in debug mode.

The way SLURM works with Blue Gene systems is that you define the blocks 
that you want SLURM to create in the bluegene.conf file rather than 
through MMCS. The bluegene.conf man page has the full details, but here 
is a bluegene.conf file for a four-rack BG/Q with SLURM set to use 
static partitioning with eight blocks, each one midplane:

#
# bluegene.conf file generated by smap
# See the bluegene.conf man page for more information
#
MloaderImage=/bgsys/drivers/ppcfloor/boot/firmware

BridgeAPILogFile=/var/log/slurm/bridgeapi.log
BridgeAPIVerbose=2
# We have 4 IO nodes per midplane, 32 for the four-rack system
IONodesPerMP=4
# Once any condes in a block are in the error state stop running
# jobs in that block
MaxBlockInError=0

MidplaneNodeCnt=512
NodeCardNodeCnt=32

AllowSubBlockAllocations=No
LayoutMode=STATIC
#LayoutMode=DYNAMIC

#
# Block Layout
#
###############################################################################
# Full-system bgblock, implicitly created
# MP=[0000x1011] Type=TORUS         # 2x1x2x2 = 8 midplanes
###############################################################################
MPs=0000       Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
MPs=1000       Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
MPs=0010       Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
MPs=1010       Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
MPs=0001       Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
MPs=1001       Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
MPs=0011       Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
MPs=1011       Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
#MPs=0000       Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
#MPs=1000       Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
#MPs=0010       Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
#MPs=1010       Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
#MPs=0001       Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
#MPs=1001       Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
#MPs=0011       Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
#MPs=1011       Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks


Here you can see each midplane is specified with MPs=<midplane 
coordinate in four dimensions> and we're telling SLURM to use the whole 
midplane as a block (by telling it that the network connection is the 
full TORUS in all four dimensions).
The lines below, commented out are for a static block layout where each 
midplane is split into four 128 cnode blocks (the smallest real block 
our system can support because of our small number of IO nodes). I left 
it in there just to show you another example of static partitioning.
And as the comment says, the full system block is implicitly created so 
it doesn't need to be defined.

Hopefully someone else can chime in on Duplicated NodeName error.

Hope that helps!
Mark

>
> The error appears on both 2.4.1 and 2.5.0-0.pre2. The following slurm.conf, 
> bluegene.conf and slurmctrld -Dvvv output are for 2.5.0-0.pre2
>
> * slurm.conf
>
>> [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# /opt/slurm/2.5.0-0.pre2/etc/slurm.conf 
>> |sed '/^$/d'
>> ClusterName=bgq-pre_ga
>> ControlMachine=bgqsn
>> SlurmUser=slurm
>> SlurmctldPort=6817
>> SlurmdPort=6818
>> AuthType=auth/munge
>> StateSaveLocation=/tmp
>> SlurmdSpoolDir=/tmp/slurmd
>> SwitchType=switch/none
>> MpiDefault=none
>> SlurmctldPidFile=/var/run/slurmctld.pid
>> SlurmdPidFile=/var/run/slurmd.pid
>> ProctrackType=proctrack/pgid
>> CacheGroups=0
>> ReturnToService=0
>> Prolog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_prolog
>> Epilog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_epilog
>> SlurmctldTimeout=300
>> SlurmdTimeout=300
>> InactiveLimit=0
>> MinJobAge=300
>> KillWait=30
>> Waittime=0
>> SchedulerType=sched/backfill
>> SelectType=select/bluegene
>> FastSchedule=1
>> DebugFlags=BGBlockPick,SelectType
>> SlurmctldDebug=3
>> SlurmctldLogFile=/tmp/slurm.log
>> SlurmdDebug=3
>> JobCompType=jobcomp/none
>> NodeName=bgq[0000x1011] State=UNKNOWN
>> PartitionName=DEFAULT Shared=FORCE
>> PartitionName=pbatch State=UP Nodes=bgq[0000x1011] Default=Yes
>
> * bluegene.conf
>
>> [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# 
>> /opt/slurm/2.5.0-0.pre2/etc/bluegene.conf |sed '/^$/d'
>> MloaderImage=/bgsys/drivers/ppcfloor/boot/uloader
>> IONodesPerMP=8 # io semi-poor
>> BridgeAPILogFile=/tmp/bridgeapi.log
>> BridgeAPIVerbose=2
>> DebugFlags=BGBlockPick,SelectType
>> BasePartitionNodeCnt=512
>> NodeCardNodeCnt=32
>> LayoutMode=STATIC
>
> * slurmctld
>
>> [jim@bgqsn 2.5.0-0.pre2]$ sudo ./sbin/slurmctld -Dvvvvv
>> slurmctld: pidfile not locked, assuming no running daemon
>> slurmctld: Warning: Core limit is only 0 KB
>> slurmctld: debug3: Trying to load plugin 
>> /opt/slurm/2.5.0-0.pre2/lib/slurm/accounting_storage_none.so
>> slurmctld: Accounting storage NOT INVOKED plugin loaded
>> slurmctld: debug3: Success.
>> slurmctld: debug3: not enforcing associations and no list was given so we 
>> are giving a blank list
>> slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to recover
>> slurmctld: slurmctld version 2.5.0-pre2 started on cluster bgq-pre_ga
>> slurmctld: debug3: Trying to load plugin 
>> /opt/slurm/2.5.0-0.pre2/lib/slurm/crypto_munge.so
>> slurmctld: Munge cryptographic signature plugin loaded
>> slurmctld: debug3: Success.
>> slurmctld: debug3: Trying to load plugin 
>> /opt/slurm/2.5.0-0.pre2/lib/slurm/select_bluegene.so
>> slurmctld: BlueGene node selection plugin loading...
>> slurmctld: debug:  Setting dimensions from slurm.conf file
>> slurmctld: Attempting to contact MMCS
>> slurmctld: BlueGene configured with 2122 midplanes
>> slurmctld: debug:  We are using 2122 of the system.
>> slurmctld: BlueGene plugin loaded successfully
>> slurmctld: BlueGene node selection plugin loaded
>> slurmctld: debug3: Success.
>> slurmctld: debug3: Trying to load plugin 
>> /opt/slurm/2.5.0-0.pre2/lib/slurm/preempt_none.so
>> slurmctld: preempt/none loaded
>> slurmctld: debug3: Success.
>> slurmctld: debug3: Trying to load plugin 
>> /opt/slurm/2.5.0-0.pre2/lib/slurm/checkpoint_none.so
>> slurmctld: debug3: Success.
>> slurmctld: Checkpoint plugin loaded: checkpoint/none
>> slurmctld: debug3: Trying to load plugin 
>> /opt/slurm/2.5.0-0.pre2/lib/slurm/jobacct_gather_none.so
>> slurmctld: Job accounting gather NOT_INVOKED plugin loaded
>> slurmctld: debug3: Success.
>> slurmctld: debug:  No backup controller to shutdown
>> slurmctld: debug3: Trying to load plugin 
>> /opt/slurm/2.5.0-0.pre2/lib/slurm/switch_none.so
>> slurmctld: switch NONE plugin loaded
>> slurmctld: debug3: Success.
>> slurmctld: debug3: Prefix is bgq bgq[0000x1011] 4
>> slurmctld: debug3: Trying to load plugin 
>> /opt/slurm/2.5.0-0.pre2/lib/slurm/topology_none.so
>> slurmctld: topology NONE plugin loaded
>> slurmctld: debug3: Success.
>> slurmctld: fatal: Duplicated NodeName bgq0000 in the config file
>> [jim@bgqsn 2.5.0-0.pre2]$
>
> Any pointers or help would be much appreciated.
>
> Many Thanks
>
> James Sweet
>

Reply via email to