Maybe you should attach pacemaker log file (/var/log/pacemaker.log). BTW, is your network running well?
在 2016-08-16二的 09:35 +0800,刘明写道: > Hi all, > I am using pacemaker/corosync and iscsi to have a high-available > server. > At the beginning it is very good, but two days ago there is some > error. > > When started one node, it's always offline. > > Last updated: Mon Aug 15 17:31:54 2016 > Last change: Mon Aug 15 16:34:30 2016 via crmd on node0 > Current DC: NONE > 1 Nodes configured > 0 Resources configured > > Node node0 (1): UNCLEAN (offline) > > In the log /var/log/message: > Aug 15 09:25:04 node0 kernel: connection1:0: detected conn error > (1020) > Aug 15 09:25:04 node0 iscsid: Kernel reported iSCSI connection 1:0 > error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) > > state (3) > Aug 15 09:25:07 node0 iscsid: connection1:0 is operational after > recovery (1 attempts) > Aug 15 09:25:09 node0 kernel: connection1:0: detected conn error > (1020) > Aug 15 09:25:10 node0 iscsid: Kernel reported iSCSI connection 1:0 > error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) > > state (3) > Aug 15 09:25:12 node0 iscsid: connection1:0 is operational after > recovery (1 attempts) > Aug 15 09:25:15 node0 kernel: connection1:0: detected conn error > (1020) > Aug 15 09:25:15 node0 iscsid: Kernel reported iSCSI connection 1:0 > error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) > > state (3) > Aug 15 09:25:18 node0 iscsid: connection1:0 is operational after > recovery (1 attempts) > Aug 15 09:25:20 node0 kernel: connection1:0: detected conn error > (1020) > Aug 15 09:25:20 node0 iscsid: Kernel reported iSCSI connection 1:0 > error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) > > state (3) > Aug 15 09:25:23 node0 iscsid: connection1:0 is operational after > recovery (1 attempts) > > That looks like a iscsi error. Then I stop iscsi, and restart > corosync, the node is still offline as before, and the log is as > > follows: > > Aug 15 17:32:04 node0 crmd[7208]: notice: lrm_state_verify_stopped: > Stopped 0 recurring operations at shutdown (0 ops remaining) > Aug 15 17:32:04 node0 crmd[7208]: notice: do_lrm_control: > Disconnected from the LRM > Aug 15 17:32:04 node0 crmd[7208]: notice: terminate_cs_connection: > Disconnecting from Corosync > Aug 15 17:32:04 node0 crmd[7208]: error: crmd_fast_exit: Could not > recover from internal error > Aug 15 17:32:04 node0 pacemakerd[7100]: error: pcmk_child_exit: Child > process crmd (7208) exited: Generic Pacemaker error (201) > Aug 15 17:32:04 node0 pacemakerd[7100]: notice: pcmk_process_exit: > Respawning failed child process: crmd > Aug 15 17:32:04 node0 crmd[7209]: notice: crm_add_logfile: Additional > logging available in /var/log/pacemaker.log > Aug 15 17:32:04 node0 crmd[7209]: notice: main: CRM Git Version: > 368c726 > Aug 15 17:32:05 node0 crmd[7209]: notice: crm_cluster_connect: > Connecting to cluster infrastructure: corosync > Aug 15 17:32:05 node0 crmd[7209]: notice: cluster_connect_quorum: > Quorum acquired > Aug 15 17:32:05 node0 crmd[7209]: notice: crm_update_peer_state: > pcmk_quorum_notification: Node node0[1] - state is now member > > (was (null)) > Aug 15 17:32:05 node0 crmd[7209]: notice: crm_update_peer_state: > pcmk_quorum_notification: Node node0[1] - state is now lost (was > > member) > Aug 15 17:32:05 node0 crmd[7209]: error: reap_dead_nodes: We're not > part of the cluster anymore > Aug 15 17:32:05 node0 crmd[7209]: error: do_log: FSA: Input I_ERROR > from reap_dead_nodes() received in state S_STARTING > Aug 15 17:32:05 node0 crmd[7209]: notice: do_state_transition: State > transition S_STARTING -> S_RECOVERY [ input=I_ERROR > > cause=C_FSA_INTERNAL origin=reap_dead_nodes ] > Aug 15 17:32:05 node0 crmd[7209]: warning: do_recover: Fast-tracking > shutdown in response to errors > Aug 15 17:32:05 node0 crmd[7209]: error: do_started: Start > cancelled... S_RECOVERY > Aug 15 17:32:05 node0 crmd[7209]: error: do_log: FSA: Input > I_TERMINATE from do_recover() received in state S_RECOVERY > Aug 15 17:32:05 node0 crmd[7209]: notice: lrm_state_verify_stopped: > Stopped 0 recurring operations at shutdown (0 ops remaining) > Aug 15 17:32:05 node0 crmd[7209]: notice: do_lrm_control: > Disconnected from the LRM > Aug 15 17:32:05 node0 crmd[7209]: notice: terminate_cs_connection: > Disconnecting from Corosync > Aug 15 17:32:05 node0 crmd[7209]: error: crmd_fast_exit: Could not > recover from internal error > Aug 15 17:32:05 node0 pacemakerd[7100]: error: pcmk_child_exit: Child > process crmd (7209) exited: Generic Pacemaker error (201) > Aug 15 17:32:05 node0 pacemakerd[7100]: error: pcmk_process_exit: > Child respawn count exceeded by crmd > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org