Mark,
Look at get_up_time() in src/slurmd/slurmd/get_mach_stat.c. The logic
is very much operating system dependent, but seems not to work on your
system. If you can figure out how to make this work on your system and
send a patch that would be great.
The idea behind the logic is that if a node running jobs reboots, then
all of its processes are killed and slurm needs to clean up those jobs.
Moe
Quoting Mark Nelson <mdnels...@gmail.com>:
Hi there,
I'm playing around with SLURM 2.4-pre2 emulating a Blue Gene /P and
I'm having a strange issue where when slurmd connects to slurmctld
the front end node gets immediately marked as DOWN with reason
"Front end unexpectedly rebooted".
I've got all three SLURM daemons running on one machine and below is
the output from slurmctld and slurmd:
slurm-dev:~# slurmctld -D -vvv
slurmctld: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
slurmctld: auth plugin for Munge (http://home.gna.org/munge/) loaded
slurmctld: debug: slurmdbd: Sent DbdInit msg
slurmctld: slurmdbd: recovered 0 pending RPCs
slurmctld: debug2: user markn default acct is ibm
slurmctld: debug2: user swail default acct is ibm
slurmctld: debug2: user bjpop default acct is vlsci
slurmctld: debug2: user samuel default acct is vlsci
slurmctld: debug2: user brian default acct is vpac
slurmctld: debug2: user root default acct is root
slurmctld: slurmctld version 2.4.0-pre2 started on cluster tambo
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: BlueGene node selection plugin loading...
slurmctld: debug: Setting dimensions from slurm.conf file
slurmctld: Attempting to contact MMCS
slurmctld: BlueGene configured with 122 midplanes
slurmctld: debug: We are using 122 of the system.
slurmctld: BlueGene plugin loaded successfully
slurmctld: BlueGene node selection plugin loaded
slurmctld: preempt/none loaded
slurmctld: Checkpoint plugin loaded: checkpoint/none
slurmctld: Job accounting gather LINUX plugin loaded
slurmctld: debug: No backup controller to shutdown
slurmctld: switch NONE plugin loaded
slurmctld: topology 3d_torus plugin loaded
slurmctld: debug: No DownNodes
slurmctld: debug2: partition main does not allow root jobs
slurmctld: debug2: partition filler does not allow root jobs
slurmctld: jobcomp/script plugin loaded init
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: debug2: ba_update_mp_state: new state of [000] is UNKNOWN
slurmctld: debug2: ba_update_mp_state: new state of [001] is UNKNOWN
slurmctld: debug2: ba_update_mp_state: new state of [010] is UNKNOWN
slurmctld: debug2: ba_update_mp_state: new state of [011] is UNKNOWN
slurmctld: Recovered state of 4 nodes
slurmctld: Recovered state of 1 front_end nodes
slurmctld: Recovered information about 0 jobs
slurmctld: debug: bluegene: select_p_state_restore
slurmctld: Recovered 0 blocks
slurmctld: No blocks created until jobs are submitted
slurmctld: debug: Updating partition uid access list
slurmctld: Recovered state of 0 reservations
slurmctld: State of 0 triggers recovered
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: Running as primary controller
slurmctld: Registering slurmctld at port 6817 with slurmdbd.
slurmctld: debug2: Sending cpu count of 8192 for cluster
slurmctld: debug: Priority MULTIFACTOR plugin loaded
slurmctld: debug: power_save module disabled, SuspendTime < 0
slurmctld: debug2: slurmctld listening on 0.0.0.0:6817
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS
from uid=0
slurmctld: debug2: name:slurm-dev boot_time:1327379842 up_time:0
slurmctld: debug2: ba_update_mp_state: new state of [000] is IDLE
slurmctld: debug2: ba_update_mp_state: new state of [001] is IDLE
slurmctld: debug2: ba_update_mp_state: new state of [010] is IDLE
slurmctld: debug2: ba_update_mp_state: new state of [011] is IDLE
slurmctld: debug: Nodes bgp[000x011] have registered
slurmctld: debug2: _slurm_rpc_node_registration complete for
slurm-dev usec=19682
slurmctld: debug: Spawning registration agent for slurm-dev 1 hosts
slurmctld: debug2: Spawning RPC agent for msg_type 1001
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got them all
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS
from uid=0
slurmctld: debug2: name:slurm-dev boot_time:1327379842 up_time:0
slurmctld: Front end slurm-dev unexpectedly rebooted
slurmctld: debug2: ba_update_mp_state: new state of [000] is IDLE
slurmctld: debug2: ba_update_mp_state: new state of [001] is IDLE
slurmctld: debug2: ba_update_mp_state: new state of [010] is IDLE
slurmctld: debug2: ba_update_mp_state: new state of [011] is IDLE
slurmctld: debug: Nodes bgp[000x011] have registered
slurmctld: debug2: _slurm_rpc_node_registration complete for
slurm-dev usec=19573
slurmctld: debug2: node_did_resp slurm-dev
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: _slurm_rpc_dump_front_end, size=92 usec=19
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug: sched: schedule() returning, no front end nodes
are available
slurm-dev:~# slurmd -D -vvv
slurmd: debug: siblings is 4 (> 1), ignored
slurmd: debug: cores is 4 (> 1), ignored
slurmd: topology 3d_torus plugin loaded
slurmd: task NONE plugin loaded
slurmd: auth plugin for Munge (http://home.gna.org/munge/) loaded
slurmd: Munge cryptographic signature plugin loaded
slurmd: Warning: Core limit is only 0 KB
slurmd: slurmd version 2.4.0-pre2 started
slurmd: switch NONE plugin loaded
slurmd: slurmd started on Tue 24 Jan 2012 15:37:22 +1100
slurmd: Procs=1 Sockets=1 Cores=1 Threads=1 Memory=1536
TmpDisk=10240 Uptime=0
slurmd: debug2: got this type of message 1001
slurm-dev:~# scontrol show frontend
FrontendName=slurm-dev State=DOWN Reason=Front end unexpectedly
rebooted [slurm@2012-01-24T15:34:56]
BootTime=2012-01-24T15:37:24 SlurmdStartTime=2012-01-24T15:37:22
I can successfully do an scontrol update frontendname=slurm-dev
state=resume which leads to:
slurmctld: debug2: Processing RPC: REQUEST_UPDATE_FRONT_END from uid=0
slurmctld: update_front_end: set state of slurm-dev to IDLE
slurmctld: debug2: _slurm_rpc_update_front_end complete for slurm-dev usec=93
and a happy (and IDLE) frontend node:
slurm-dev:~# scontrol show frontend
FrontendName=slurm-dev State=IDLE Reason=(null)
BootTime=2012-01-24T15:37:24 SlurmdStartTime=2012-01-24T15:37:22
but I'm just wondering what's causing this. The SLURM config file is
attached.
Any help would be greatly appreciated.
Thanks!
Mark.