Hello,

I am currently trying to setup slurm to run on our 4 frame, 8 midplane BlueGene 
Q. I am quite new to BlueGene but I have a existing background in HPC
sysadmin but I'm finding the configuration of slurm on BGQ a bit confusing.

Currently I have 8 blocks already set up in the mmcs and would like to get 
slurm to manage each of these 8 separate resources. The reason behind this
is that we wish to map any application failures to a specific midplane for 
hardware fault finding as we are still in the process of cabling and
configuring the racks.

I have read though the BG specific docs but I am at a loss as to how to set up 
my bluegene.conf and slurm.conf files to achieve a single 'queue' for
each of the 8 midplanes. Running smap -Dc doesn't seem to generate a 
bluegene.conf file and I have tried to run slurmctrld to see if the verbose
output will help;

[jsweet@bgqsn ~]$ /opt/slurm/2.3.3/sbin/slurmctld -D -f 
/opt/slurm/2.3.3/etc/slurm.conf -vvvv
slurmctld: pidfile not locked, assuming no running daemon
slurmctld: debug3: Trying to load plugin 
/opt/slurm/2.3.3/lib/slurm/accounting_storage_none.so
slurmctld: Accounting storage NOT INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: not enforcing associations and no list was given so we are 
giving a blank list
slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to recover
slurmctld: slurmctld version 2.3.3 started on cluster bgq
slurmctld: debug3: Trying to load plugin 
/opt/slurm/2.3.3/lib/slurm/crypto_munge.so
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/opt/slurm/2.3.3/lib/slurm/select_bluegene.so
slurmctld: BlueGene node selection plugin loading...
slurmctld: debug:  Setting dimensions from slurm.conf file
slurmctld: Attempting to contact MMCS
slurmctld: BlueGene configured with 2122 midplanes
slurmctld: debug:  We are using 1112 of the system.
slurmctld: BlueGene plugin loaded successfully
slurmctld: BlueGene node selection plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/opt/slurm/2.3.3/lib/slurm/preempt_none.so
slurmctld: preempt/none loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/opt/slurm/2.3.3/lib/slurm/checkpoint_none.so
slurmctld: debug3: Success.
slurmctld: Checkpoint plugin loaded: checkpoint/none
slurmctld: debug3: Trying to load plugin 
/opt/slurm/2.3.3/lib/slurm/jobacct_gather_none.so
slurmctld: Job accounting gather NOT_INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: debug:  No backup controller to shutdown
slurmctld: debug3: Trying to load plugin 
/opt/slurm/2.3.3/lib/slurm/switch_none.so
slurmctld: switch NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Prefix is bgq bgq[0000x0001] 4
slurmctld: debug3: Trying to load plugin 
/opt/slurm/2.3.3/lib/slurm/topology_none.so
slurmctld: topology NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug:  No DownNodes
slurmctld: debug3: Trying to load plugin 
/opt/slurm/2.3.3/lib/slurm/jobcomp_none.so
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/opt/slurm/2.3.3/lib/slurm/sched_backfill.so
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: debug3: Success.
slurmctld: error: read_slurm_conf: default partition not set.
slurmctld: error: Could not open node state file /tmp/node_state: No such file 
or directory
slurmctld: error: NOTE: Trying backup state save file. Information may be lost!
slurmctld: No node state file (/tmp/node_state.old) to recover
slurmctld: error: Incomplete node data checkpoint file
slurmctld: Recovered state of 0 nodes
slurmctld: error: Could not open front_end state file /tmp/front_end_state: No 
such file or directory
slurmctld: error: NOTE: Trying backup front_end_state save file. Information 
may be lost!
slurmctld: No node state file (/tmp/front_end_state.old) to recover
slurmctld: error: Incomplete front_end node data checkpoint file
slurmctld: Recovered state of 0 front_end nodes
slurmctld: error: Could not open job state file /tmp/job_state: No such file or 
directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: No job state file (/tmp/job_state.old) to recover
slurmctld: error: hostlist.c:1727 Invalid range: `000x001': Invalid argument
slurmctld: hostlist.c:3069: hostlist_ranged_string_dims: Assertion `hl != 
((void *)0)' failed.

**bluegene.conf**

MloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts
LayoutMode=STATIC
BasePartitionNodeCnt=512
NodeCardNodeCnt=32
Numpsets=8 #used for IO poor systems (Can't create 32 cnode blocks)
BridgeAPILogFile=/var/log/slurm/bridgeapi.log
BridgeAPIVerbose=0
BPs=[000x001] Type=TORUS # 1x1x1 = 4-32 c-node blocks 3-128 c-node blocks

**slurm.conf**

ClusterName=bgq
ControlMachine=bgqsn
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
CacheGroups=0
ReturnToService=0
Prolog=/opt/slurm/2.3.3/etc/bg_prolog
Epilog=/opt/slurm/2.3.3/etc/bg_epilog
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SchedulerType=sched/backfill
SelectType=select/bluegene
FastSchedule=1
SlurmctldDebug=3
SlurmdDebug=3
JobCompType=jobcomp/none
FrontEndName=bgqsn State=UNKNOWN
NodeName=bgq[0000x0001] CPUS=9216 State=UNKNOWN
PartitionName=R00-M0



What changes do I need to make to my bluegene.conf and slurm.conf files to get 
the 8 'queue' setup that I'm look for?

Thanks for you help

James
-- 
James Sweet
ACF Systems Administrator

EPCC,
School of Physics,
The University of Edinburgh,
James Clerk Maxwell Building,
Mayfield Road,
Edinburgh. EH9 3JZ.
Tel: 0131 445 7831


The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Reply via email to