Mark,

Thanks very much for the config that makes more sense now. I have been trying 
to use 'smap -Dc' and then 'save /tmp/bluegene.conf' to generate the 
config but it never specified any MPs and always set the 'LayoutMode' to 
dynamic.

I'll make sure to add the BG patch also.

Thanks Again,

James

On 31/07/2012 07:57, Mark Nelson wrote:
> Hi James,
>
> Unfortunately I can't help you with your real problem about "Duplicated 
> NodeName", but I do have a hint for configuring SLURM, below.
>
> On 31/07/12 03:45, James Sweet wrote:
>>
>> Hi,
>>
>> I am trying to configure slurm to run on a 4 rack BG/Q system and am getting 
>> stuck in creating a config that slurmctld likes. I have already created
>> blocks in mmcs for both a whole rack and also each individual midplane. To 
>> start I would like to try and create static blocks in slurm that map to the
>> rack/midplane blocks in mmcs. I have read though the slurm bluegene admin 
>> guide but i'm unsure as to how to fix my config to sort out the "Duplicated
>> NodeName" error i am seeing when I try and run slurmctld in debug mode.
>
> The way SLURM works with Blue Gene systems is that you define the blocks that 
> you want SLURM to create in the bluegene.conf file rather than through
> MMCS. The bluegene.conf man page has the full details, but here is a 
> bluegene.conf file for a four-rack BG/Q with SLURM set to use static 
> partitioning
> with eight blocks, each one midplane:
>
> #
> # bluegene.conf file generated by smap
> # See the bluegene.conf man page for more information
> #
> MloaderImage=/bgsys/drivers/ppcfloor/boot/firmware
>
> BridgeAPILogFile=/var/log/slurm/bridgeapi.log
> BridgeAPIVerbose=2
> # We have 4 IO nodes per midplane, 32 for the four-rack system
> IONodesPerMP=4
> # Once any condes in a block are in the error state stop running
> # jobs in that block
> MaxBlockInError=0
>
> MidplaneNodeCnt=512
> NodeCardNodeCnt=32
>
> AllowSubBlockAllocations=No
> LayoutMode=STATIC
> #LayoutMode=DYNAMIC
>
> #
> # Block Layout
> #
> ###############################################################################
> # Full-system bgblock, implicitly created
> # MP=[0000x1011] Type=TORUS # 2x1x2x2 = 8 midplanes
> ###############################################################################
> MPs=0000 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
> MPs=1000 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
> MPs=0010 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
> MPs=1010 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
> MPs=0001 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
> MPs=1001 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
> MPs=0011 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
> MPs=1011 Type=T,T,T,T # 1x1x1x1 = one 512-cnode block
> #MPs=0000 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
> #MPs=1000 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
> #MPs=0010 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
> #MPs=1010 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
> #MPs=0001 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
> #MPs=1001 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
> #MPs=0011 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
> #MPs=1011 Type=SMALL 128CNBlocks=4 # 1x1x1x1 = four 128-cnode blocks
>
>
> Here you can see each midplane is specified with MPs=<midplane coordinate in 
> four dimensions> and we're telling SLURM to use the whole midplane as a
> block (by telling it that the network connection is the full TORUS in all 
> four dimensions).
> The lines below, commented out are for a static block layout where each 
> midplane is split into four 128 cnode blocks (the smallest real block our
> system can support because of our small number of IO nodes). I left it in 
> there just to show you another example of static partitioning.
> And as the comment says, the full system block is implicitly created so it 
> doesn't need to be defined.
>
> Hopefully someone else can chime in on Duplicated NodeName error.
>
> Hope that helps!
> Mark
>
>>
>> The error appears on both 2.4.1 and 2.5.0-0.pre2. The following slurm.conf, 
>> bluegene.conf and slurmctrld -Dvvv output are for 2.5.0-0.pre2
>>
>> * slurm.conf
>>
>>> [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# /opt/slurm/2.5.0-0.pre2/etc/slurm.conf 
>>> |sed '/^$/d'
>>> ClusterName=bgq-pre_ga
>>> ControlMachine=bgqsn
>>> SlurmUser=slurm
>>> SlurmctldPort=6817
>>> SlurmdPort=6818
>>> AuthType=auth/munge
>>> StateSaveLocation=/tmp
>>> SlurmdSpoolDir=/tmp/slurmd
>>> SwitchType=switch/none
>>> MpiDefault=none
>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>> SlurmdPidFile=/var/run/slurmd.pid
>>> ProctrackType=proctrack/pgid
>>> CacheGroups=0
>>> ReturnToService=0
>>> Prolog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_prolog
>>> Epilog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_epilog
>>> SlurmctldTimeout=300
>>> SlurmdTimeout=300
>>> InactiveLimit=0
>>> MinJobAge=300
>>> KillWait=30
>>> Waittime=0
>>> SchedulerType=sched/backfill
>>> SelectType=select/bluegene
>>> FastSchedule=1
>>> DebugFlags=BGBlockPick,SelectType
>>> SlurmctldDebug=3
>>> SlurmctldLogFile=/tmp/slurm.log
>>> SlurmdDebug=3
>>> JobCompType=jobcomp/none
>>> NodeName=bgq[0000x1011] State=UNKNOWN
>>> PartitionName=DEFAULT Shared=FORCE
>>> PartitionName=pbatch State=UP Nodes=bgq[0000x1011] Default=Yes
>>
>> * bluegene.conf
>>
>>> [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# 
>>> /opt/slurm/2.5.0-0.pre2/etc/bluegene.conf |sed '/^$/d'
>>> MloaderImage=/bgsys/drivers/ppcfloor/boot/uloader
>>> IONodesPerMP=8 # io semi-poor
>>> BridgeAPILogFile=/tmp/bridgeapi.log
>>> BridgeAPIVerbose=2
>>> DebugFlags=BGBlockPick,SelectType
>>> BasePartitionNodeCnt=512
>>> NodeCardNodeCnt=32
>>> LayoutMode=STATIC
>>
>> * slurmctld
>>
>>> [jim@bgqsn 2.5.0-0.pre2]$ sudo ./sbin/slurmctld -Dvvvvv
>>> slurmctld: pidfile not locked, assuming no running daemon
>>> slurmctld: Warning: Core limit is only 0 KB
>>> slurmctld: debug3: Trying to load plugin 
>>> /opt/slurm/2.5.0-0.pre2/lib/slurm/accounting_storage_none.so
>>> slurmctld: Accounting storage NOT INVOKED plugin loaded
>>> slurmctld: debug3: Success.
>>> slurmctld: debug3: not enforcing associations and no list was given so we 
>>> are giving a blank list
>>> slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to recover
>>> slurmctld: slurmctld version 2.5.0-pre2 started on cluster bgq-pre_ga
>>> slurmctld: debug3: Trying to load plugin 
>>> /opt/slurm/2.5.0-0.pre2/lib/slurm/crypto_munge.so
>>> slurmctld: Munge cryptographic signature plugin loaded
>>> slurmctld: debug3: Success.
>>> slurmctld: debug3: Trying to load plugin 
>>> /opt/slurm/2.5.0-0.pre2/lib/slurm/select_bluegene.so
>>> slurmctld: BlueGene node selection plugin loading...
>>> slurmctld: debug: Setting dimensions from slurm.conf file
>>> slurmctld: Attempting to contact MMCS
>>> slurmctld: BlueGene configured with 2122 midplanes
>>> slurmctld: debug: We are using 2122 of the system.
>>> slurmctld: BlueGene plugin loaded successfully
>>> slurmctld: BlueGene node selection plugin loaded
>>> slurmctld: debug3: Success.
>>> slurmctld: debug3: Trying to load plugin 
>>> /opt/slurm/2.5.0-0.pre2/lib/slurm/preempt_none.so
>>> slurmctld: preempt/none loaded
>>> slurmctld: debug3: Success.
>>> slurmctld: debug3: Trying to load plugin 
>>> /opt/slurm/2.5.0-0.pre2/lib/slurm/checkpoint_none.so
>>> slurmctld: debug3: Success.
>>> slurmctld: Checkpoint plugin loaded: checkpoint/none
>>> slurmctld: debug3: Trying to load plugin 
>>> /opt/slurm/2.5.0-0.pre2/lib/slurm/jobacct_gather_none.so
>>> slurmctld: Job accounting gather NOT_INVOKED plugin loaded
>>> slurmctld: debug3: Success.
>>> slurmctld: debug: No backup controller to shutdown
>>> slurmctld: debug3: Trying to load plugin 
>>> /opt/slurm/2.5.0-0.pre2/lib/slurm/switch_none.so
>>> slurmctld: switch NONE plugin loaded
>>> slurmctld: debug3: Success.
>>> slurmctld: debug3: Prefix is bgq bgq[0000x1011] 4
>>> slurmctld: debug3: Trying to load plugin 
>>> /opt/slurm/2.5.0-0.pre2/lib/slurm/topology_none.so
>>> slurmctld: topology NONE plugin loaded
>>> slurmctld: debug3: Success.
>>> slurmctld: fatal: Duplicated NodeName bgq0000 in the config file
>>> [jim@bgqsn 2.5.0-0.pre2]$
>>
>> Any pointers or help would be much appreciated.
>>
>> Many Thanks
>>
>> James Sweet
>>
>

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Reply via email to