Hi,

I am trying to configure slurm to run on a 4 rack BG/Q system and am getting 
stuck in creating a config that slurmctld likes. I have already created 
blocks in mmcs for both a whole rack and also each individual midplane. To 
start I would like to try and create static blocks in slurm that map to the 
rack/midplane blocks in mmcs. I have read though the slurm bluegene admin guide 
but i'm unsure as to how to fix my config to sort out the  "Duplicated 
NodeName" error i am seeing when I try and run slurmctld in debug mode.

The error appears on both 2.4.1 and 2.5.0-0.pre2. The following slurm.conf, 
bluegene.conf and slurmctrld -Dvvv output are for 2.5.0-0.pre2

* slurm.conf

> [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# /opt/slurm/2.5.0-0.pre2/etc/slurm.conf 
> |sed '/^$/d'
> ClusterName=bgq-pre_ga
> ControlMachine=bgqsn
> SlurmUser=slurm
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> StateSaveLocation=/tmp
> SlurmdSpoolDir=/tmp/slurmd
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmdPidFile=/var/run/slurmd.pid
> ProctrackType=proctrack/pgid
> CacheGroups=0
> ReturnToService=0
> Prolog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_prolog
> Epilog=/opt/slurm/2.5.0-0.pre2/sbin/slurm_epilog
> SlurmctldTimeout=300
> SlurmdTimeout=300
> InactiveLimit=0
> MinJobAge=300
> KillWait=30
> Waittime=0
> SchedulerType=sched/backfill
> SelectType=select/bluegene
> FastSchedule=1
> DebugFlags=BGBlockPick,SelectType
> SlurmctldDebug=3
> SlurmctldLogFile=/tmp/slurm.log
> SlurmdDebug=3
> JobCompType=jobcomp/none
> NodeName=bgq[0000x1011] State=UNKNOWN
> PartitionName=DEFAULT Shared=FORCE
> PartitionName=pbatch State=UP Nodes=bgq[0000x1011] Default=Yes

* bluegene.conf

> [jim@bgqsn 2.5.0-0.pre2]$ grep -v ^# 
> /opt/slurm/2.5.0-0.pre2/etc/bluegene.conf |sed '/^$/d'
> MloaderImage=/bgsys/drivers/ppcfloor/boot/uloader
> IONodesPerMP=8 # io semi-poor
> BridgeAPILogFile=/tmp/bridgeapi.log
> BridgeAPIVerbose=2
> DebugFlags=BGBlockPick,SelectType
> BasePartitionNodeCnt=512
> NodeCardNodeCnt=32
> LayoutMode=STATIC

* slurmctld

> [jim@bgqsn 2.5.0-0.pre2]$ sudo ./sbin/slurmctld -Dvvvvv
> slurmctld: pidfile not locked, assuming no running daemon
> slurmctld: Warning: Core limit is only 0 KB
> slurmctld: debug3: Trying to load plugin 
> /opt/slurm/2.5.0-0.pre2/lib/slurm/accounting_storage_none.so
> slurmctld: Accounting storage NOT INVOKED plugin loaded
> slurmctld: debug3: Success.
> slurmctld: debug3: not enforcing associations and no list was given so we are 
> giving a blank list
> slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to recover
> slurmctld: slurmctld version 2.5.0-pre2 started on cluster bgq-pre_ga
> slurmctld: debug3: Trying to load plugin 
> /opt/slurm/2.5.0-0.pre2/lib/slurm/crypto_munge.so
> slurmctld: Munge cryptographic signature plugin loaded
> slurmctld: debug3: Success.
> slurmctld: debug3: Trying to load plugin 
> /opt/slurm/2.5.0-0.pre2/lib/slurm/select_bluegene.so
> slurmctld: BlueGene node selection plugin loading...
> slurmctld: debug:  Setting dimensions from slurm.conf file
> slurmctld: Attempting to contact MMCS
> slurmctld: BlueGene configured with 2122 midplanes
> slurmctld: debug:  We are using 2122 of the system.
> slurmctld: BlueGene plugin loaded successfully
> slurmctld: BlueGene node selection plugin loaded
> slurmctld: debug3: Success.
> slurmctld: debug3: Trying to load plugin 
> /opt/slurm/2.5.0-0.pre2/lib/slurm/preempt_none.so
> slurmctld: preempt/none loaded
> slurmctld: debug3: Success.
> slurmctld: debug3: Trying to load plugin 
> /opt/slurm/2.5.0-0.pre2/lib/slurm/checkpoint_none.so
> slurmctld: debug3: Success.
> slurmctld: Checkpoint plugin loaded: checkpoint/none
> slurmctld: debug3: Trying to load plugin 
> /opt/slurm/2.5.0-0.pre2/lib/slurm/jobacct_gather_none.so
> slurmctld: Job accounting gather NOT_INVOKED plugin loaded
> slurmctld: debug3: Success.
> slurmctld: debug:  No backup controller to shutdown
> slurmctld: debug3: Trying to load plugin 
> /opt/slurm/2.5.0-0.pre2/lib/slurm/switch_none.so
> slurmctld: switch NONE plugin loaded
> slurmctld: debug3: Success.
> slurmctld: debug3: Prefix is bgq bgq[0000x1011] 4
> slurmctld: debug3: Trying to load plugin 
> /opt/slurm/2.5.0-0.pre2/lib/slurm/topology_none.so
> slurmctld: topology NONE plugin loaded
> slurmctld: debug3: Success.
> slurmctld: fatal: Duplicated NodeName bgq0000 in the config file
> [jim@bgqsn 2.5.0-0.pre2]$

Any pointers or help would be much appreciated.

Many Thanks

James Sweet

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Reply via email to