Hi there,

I'm playing around with SLURM 2.4-pre2 emulating a Blue Gene /P and I'm having a strange issue where when slurmd connects to slurmctld the front end node gets immediately marked as DOWN with reason "Front end unexpectedly rebooted".

I've got all three SLURM daemons running on one machine and below is the output from slurmctld and slurmd:

slurm-dev:~# slurmctld -D -vvv
slurmctld: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
slurmctld: auth plugin for Munge (http://home.gna.org/munge/) loaded
slurmctld: debug:  slurmdbd: Sent DbdInit msg
slurmctld: slurmdbd: recovered 0 pending RPCs
slurmctld: debug2: user markn default acct is ibm
slurmctld: debug2: user swail default acct is ibm
slurmctld: debug2: user bjpop default acct is vlsci
slurmctld: debug2: user samuel default acct is vlsci
slurmctld: debug2: user brian default acct is vpac
slurmctld: debug2: user root default acct is root
slurmctld: slurmctld version 2.4.0-pre2 started on cluster tambo
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: BlueGene node selection plugin loading...
slurmctld: debug:  Setting dimensions from slurm.conf file
slurmctld: Attempting to contact MMCS
slurmctld: BlueGene configured with 122 midplanes
slurmctld: debug:  We are using 122 of the system.
slurmctld: BlueGene plugin loaded successfully
slurmctld: BlueGene node selection plugin loaded
slurmctld: preempt/none loaded
slurmctld: Checkpoint plugin loaded: checkpoint/none
slurmctld: Job accounting gather LINUX plugin loaded
slurmctld: debug:  No backup controller to shutdown
slurmctld: switch NONE plugin loaded
slurmctld: topology 3d_torus plugin loaded
slurmctld: debug:  No DownNodes
slurmctld: debug2: partition main does not allow root jobs
slurmctld: debug2: partition filler does not allow root jobs
slurmctld: jobcomp/script plugin loaded init
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: debug2: ba_update_mp_state: new state of [000] is UNKNOWN
slurmctld: debug2: ba_update_mp_state: new state of [001] is UNKNOWN
slurmctld: debug2: ba_update_mp_state: new state of [010] is UNKNOWN
slurmctld: debug2: ba_update_mp_state: new state of [011] is UNKNOWN
slurmctld: Recovered state of 4 nodes
slurmctld: Recovered state of 1 front_end nodes
slurmctld: Recovered information about 0 jobs
slurmctld: debug:  bluegene: select_p_state_restore
slurmctld: Recovered 0 blocks
slurmctld: No blocks created until jobs are submitted
slurmctld: debug:  Updating partition uid access list
slurmctld: Recovered state of 0 reservations
slurmctld: State of 0 triggers recovered
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: Running as primary controller
slurmctld: Registering slurmctld at port 6817 with slurmdbd.
slurmctld: debug2: Sending cpu count of 8192 for cluster
slurmctld: debug:  Priority MULTIFACTOR plugin loaded
slurmctld: debug:  power_save module disabled, SuspendTime < 0
slurmctld: debug2: slurmctld listening on 0.0.0.0:6817
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
slurmctld: debug2: name:slurm-dev boot_time:1327379842 up_time:0
slurmctld: debug2: ba_update_mp_state: new state of [000] is IDLE
slurmctld: debug2: ba_update_mp_state: new state of [001] is IDLE
slurmctld: debug2: ba_update_mp_state: new state of [010] is IDLE
slurmctld: debug2: ba_update_mp_state: new state of [011] is IDLE
slurmctld: debug:  Nodes bgp[000x011] have registered
slurmctld: debug2: _slurm_rpc_node_registration complete for slurm-dev usec=19682
slurmctld: debug:  Spawning registration agent for slurm-dev 1 hosts
slurmctld: debug2: Spawning RPC agent for msg_type 1001
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got them all
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from uid=0
slurmctld: debug2: name:slurm-dev boot_time:1327379842 up_time:0
slurmctld: Front end slurm-dev unexpectedly rebooted
slurmctld: debug2: ba_update_mp_state: new state of [000] is IDLE
slurmctld: debug2: ba_update_mp_state: new state of [001] is IDLE
slurmctld: debug2: ba_update_mp_state: new state of [010] is IDLE
slurmctld: debug2: ba_update_mp_state: new state of [011] is IDLE
slurmctld: debug:  Nodes bgp[000x011] have registered
slurmctld: debug2: _slurm_rpc_node_registration complete for slurm-dev usec=19573
slurmctld: debug2: node_did_resp slurm-dev
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: _slurm_rpc_dump_front_end, size=92 usec=19
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug: sched: schedule() returning, no front end nodes are available


slurm-dev:~# slurmd -D -vvv
slurmd: debug:  siblings is 4 (> 1), ignored
slurmd: debug:  cores is 4 (> 1), ignored
slurmd: topology 3d_torus plugin loaded
slurmd: task NONE plugin loaded
slurmd: auth plugin for Munge (http://home.gna.org/munge/) loaded
slurmd: Munge cryptographic signature plugin loaded
slurmd: Warning: Core limit is only 0 KB
slurmd: slurmd version 2.4.0-pre2 started
slurmd: switch NONE plugin loaded
slurmd: slurmd started on Tue 24 Jan 2012 15:37:22 +1100
slurmd: Procs=1 Sockets=1 Cores=1 Threads=1 Memory=1536 TmpDisk=10240 Uptime=0
slurmd: debug2: got this type of message 1001


slurm-dev:~# scontrol show frontend
FrontendName=slurm-dev State=DOWN Reason=Front end unexpectedly rebooted [slurm@2012-01-24T15:34:56]
   BootTime=2012-01-24T15:37:24 SlurmdStartTime=2012-01-24T15:37:22



I can successfully do an scontrol update frontendname=slurm-dev state=resume which leads to:
slurmctld: debug2: Processing RPC: REQUEST_UPDATE_FRONT_END from uid=0
slurmctld: update_front_end: set state of slurm-dev to IDLE
slurmctld: debug2: _slurm_rpc_update_front_end complete for slurm-dev usec=93

and a happy (and IDLE) frontend node:

slurm-dev:~# scontrol show frontend
FrontendName=slurm-dev State=IDLE Reason=(null)
   BootTime=2012-01-24T15:37:24 SlurmdStartTime=2012-01-24T15:37:22

but I'm just wondering what's causing this. The SLURM config file is attached.

Any help would be greatly appreciated.

Thanks!
Mark.
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=slurm-dev
ControlAddr=slurm-dev
AuthType=auth/munge
CryptoType=crypto/munge
DisableRootJobs=YES
EnforcePartLimits=NO
Epilog=/usr/local/slurm/etc/bgepilog.sh
#PrologSlurmctld= 
#FirstJobId=1 
#JobCheckpointDir=/var/slurm/checkpoint 
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0 
#JobRequeue=1 
#KillOnBadExit=0 
#Licenses=foo*4,bar 
MailProg=/usr/bin/mail 
#MaxTasksPerNode=128 
MpiDefault=none
#MpiParams=ports=#-# 
#PluginDir= 
#PlugStackConfig= 
#PrivateData=jobs 
ProctrackType=proctrack/linuxproc
Prolog=/usr/local/slurm/etc/bgprolog.sh
#PrologSlurmctld= 
#PropagatePrioProcess=0 
#PropagateResourceLimits= 
#PropagateResourceLimitsExcept= 
ReturnToService=1
#SallocDefaultCommand= 
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmctldDebug=6
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm
SlurmdDebug=6
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm/jobs
TaskEpilog=/usr/local/slurm/etc/taskepilog.sh
TaskProlog=/usr/local/slurm/etc/taskprolog.sh
TopologyPlugin=topology/3d_torus
SlurmSchedLogFile=/var/log/slurm/slurmsched.log
SlurmSchedLogLevel=6
#TmpFs=/tmp 
#TrackWCKey=no 
#TreeWidth= 
# 
# 
# TIMERS 
BatchStartTimeout=600 
#CompleteWait=0 
#EpilogMsgTime=2000 
#GetEnvTimeout=2 
InactiveLimit=0
KillWait=120
MessageTimeout=30 
#ResvOverRun=0 
MinJobAge=300
MaxJobCount=5000
#OverTimeLimit=0 
# 
# 
# SCHEDULING 
#SchedulerRootFilter=1 
#SchedulerType=sched/wiki2
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/bluegene
# 
# 
# JOB PRIORITY 
PriorityType=priority/multifactor
# Set PriorityDecayHalfLife to 0 to enforce hard limits per association
PriorityDecayHalfLife=0
PriorityCalcPeriod=00:05:00
PriorityFavorSmall=NO
PriorityMaxAge=7-0
# Reser the usage period every quarter
PriorityUsageResetPeriod=DAILY
PriorityWeightAge=1000
#PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=10000
#PriorityWeightQOS=0 # don't use the qos factor
# 
# 
# LOGGING AND ACCOUNTING 
AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageLoc=/var/log/slurm/accounting
AccountingStorageEnforce=associations,limits
AccountingStorageHost=slurm-dev
ClusterName=tambo
#DebugFlags=
JobCompLoc=/usr/local/slurm/etc/jobcompletion.sh
JobCompType=jobcomp/script
JobAcctGatherType=jobacct_gather/linux
# disable periodic job sampling - accounting on job termination only
JobAcctGatherFrequency=0
# 
# FRONTEND NODES
#FrontendName=DEFAULT
FrontendName=slurm-dev FrontendAddr=slurm-dev
# 
# COMPUTE NODES 
NodeName=DEFAULT Procs=2048 RealMemory=2097152 State=UNKNOWN 
NodeName=bgp[000x011] NodeAddr=slurm-dev NodeHostname=slurm-dev

PartitionName=DEFAULT Shared=FORCE DefaultTime=0:10:0 
PartitionName=main Nodes=bgp[000x011] Default=YES State=UP Priority=50000
PartitionName=filler Nodes=bgp[000x011] Default=NO State=UP Priority=100

Reply via email to