[ClusterLabs] node is always offline

刘明 Mon, 15 Aug 2016 18:38:38 -0700

Hi all,
I am using pacemaker/corosync and iscsi to have a high-available server.
At the beginning it is very good, but two days ago there is some error.


When started one node, it's always offline.

Last updated: Mon Aug 15 17:31:54 2016
Last change: Mon Aug 15 16:34:30 2016 via crmd on node0
Current DC: NONE
1 Nodes configured
0 Resources configured

Node node0 (1): UNCLEAN (offline)

In the log /var/log/message:
Aug 15 09:25:04 node0 kernel: connection1:0: detected conn error (1020)
Aug 15 09:25:04 node0 iscsid: Kernel reported iSCSI connection 1:0 error (1020 
- ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed)

state (3)
Aug 15 09:25:07 node0 iscsid: connection1:0 is operational after recovery (1 
attempts)
Aug 15 09:25:09 node0 kernel: connection1:0: detected conn error (1020)
Aug 15 09:25:10 node0 iscsid: Kernel reported iSCSI connection 1:0 error (1020 
- ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed)

state (3)
Aug 15 09:25:12 node0 iscsid: connection1:0 is operational after recovery (1 
attempts)
Aug 15 09:25:15 node0 kernel: connection1:0: detected conn error (1020)
Aug 15 09:25:15 node0 iscsid: Kernel reported iSCSI connection 1:0 error (1020 
- ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed)

state (3)
Aug 15 09:25:18 node0 iscsid: connection1:0 is operational after recovery (1 
attempts)
Aug 15 09:25:20 node0 kernel: connection1:0: detected conn error (1020)
Aug 15 09:25:20 node0 iscsid: Kernel reported iSCSI connection 1:0 error (1020 
- ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed)

state (3)
Aug 15 09:25:23 node0 iscsid: connection1:0 is operational after recovery (1 
attempts)

That looks like a iscsi error. Then I stop iscsi, and restart corosync, the 
node is still offline as before, and the log is as

follows:

Aug 15 17:32:04 node0 crmd[7208]: notice: lrm_state_verify_stopped: Stopped 0 
recurring operations at shutdown (0 ops remaining)
Aug 15 17:32:04 node0 crmd[7208]: notice: do_lrm_control: Disconnected from the 
LRM
Aug 15 17:32:04 node0 crmd[7208]: notice: terminate_cs_connection: 
Disconnecting from Corosync
Aug 15 17:32:04 node0 crmd[7208]: error: crmd_fast_exit: Could not recover from 
internal error
Aug 15 17:32:04 node0 pacemakerd[7100]: error: pcmk_child_exit: Child process 
crmd (7208) exited: Generic Pacemaker error (201)
Aug 15 17:32:04 node0 pacemakerd[7100]: notice: pcmk_process_exit: Respawning 
failed child process: crmd
Aug 15 17:32:04 node0 crmd[7209]: notice: crm_add_logfile: Additional logging 
available in /var/log/pacemaker.log
Aug 15 17:32:04 node0 crmd[7209]: notice: main: CRM Git Version: 368c726
Aug 15 17:32:05 node0 crmd[7209]: notice: crm_cluster_connect: Connecting to 
cluster infrastructure: corosync
Aug 15 17:32:05 node0 crmd[7209]: notice: cluster_connect_quorum: Quorum 
acquired
Aug 15 17:32:05 node0 crmd[7209]: notice: crm_update_peer_state: 
pcmk_quorum_notification: Node node0[1] - state is now member

(was (null))
Aug 15 17:32:05 node0 crmd[7209]: notice: crm_update_peer_state: 
pcmk_quorum_notification: Node node0[1] - state is now lost (was

member)
Aug 15 17:32:05 node0 crmd[7209]: error: reap_dead_nodes: We're not part of the 
cluster anymore
Aug 15 17:32:05 node0 crmd[7209]: error: do_log: FSA: Input I_ERROR from 
reap_dead_nodes() received in state S_STARTING
Aug 15 17:32:05 node0 crmd[7209]: notice: do_state_transition: State transition 
S_STARTING -> S_RECOVERY [ input=I_ERROR

cause=C_FSA_INTERNAL origin=reap_dead_nodes ]
Aug 15 17:32:05 node0 crmd[7209]: warning: do_recover: Fast-tracking shutdown 
in response to errors
Aug 15 17:32:05 node0 crmd[7209]: error: do_started: Start cancelled... 
S_RECOVERY
Aug 15 17:32:05 node0 crmd[7209]: error: do_log: FSA: Input I_TERMINATE from 
do_recover() received in state S_RECOVERY
Aug 15 17:32:05 node0 crmd[7209]: notice: lrm_state_verify_stopped: Stopped 0 
recurring operations at shutdown (0 ops remaining)
Aug 15 17:32:05 node0 crmd[7209]: notice: do_lrm_control: Disconnected from the 
LRM
Aug 15 17:32:05 node0 crmd[7209]: notice: terminate_cs_connection: 
Disconnecting from Corosync
Aug 15 17:32:05 node0 crmd[7209]: error: crmd_fast_exit: Could not recover from 
internal error
Aug 15 17:32:05 node0 pacemakerd[7100]: error: pcmk_child_exit: Child process 
crmd (7209) exited: Generic Pacemaker error (201)
Aug 15 17:32:05 node0 pacemakerd[7100]: error: pcmk_process_exit: Child respawn 
count exceeded by crmd

_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] node is always offline

Reply via email to