Re: [ClusterLabs] data loss of network would cause Pacemaker exit abnormally
On 08/30/2016 01:58 PM, chenhj wrote: > Hi, > > This is a continuation of the email below(I did not subscrib this maillist) > > http://clusterlabs.org/pipermail/users/2016-August/003838.html > >>>From the above, I suspect that the node with the network loss was the >>DC, and from its point of view, it was the other node that went away. > > Yes. the node with the network loss was DC(node2) > > Could someone explain what's the following messges means, and > why pacemakerd process exit instead of rejoin to CPG group? > >> Aug 27 12:33:59 [46849] node3 pacemakerd:error: pcmk_cpg_membership: >>We're not part of CPG group 'pacemakerd' anymore! This means the node was kicked out of the membership. I don't remember what that implies, I'm guessing the node exits because the cluster will most likely fence it after kicking it out. > >>> [root at node3 ~]# rpm -q corosync >>> corosync-1.4.1-7.el6.x86_64 >>That is quite old ... >>> [root at node3 ~]# cat /etc/redhat-release >>> CentOS release 6.3 (Final) >>> [root at node3 ~]# pacemakerd -F >> Pacemaker 1.1.14-1.el6 (Build: 70404b0) >>and I doubt that many people have tested Pacemaker 1.1.14 against >>corosync 1.4.1 ... quite far away from >>each other release-wise ... > > pacemaker 1.1.14 + corosync-1.4.7 can also reproduced this probleam, but > seems with lower probability. The corosync 2 series is a major improvement, but some config changes are necessary ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] data loss of network would cause Pacemaker exit abnormally
Hi, This is a continuation of the email below(I did not subscrib this maillist) http://clusterlabs.org/pipermail/users/2016-August/003838.html >>From the above, I suspect that the node with the network loss was the >DC, and from its point of view, it was the other node that went away. Yes. the node with the network loss was DC(node2) Could someone explain what's the following messges means, and why pacemakerd process exit instead of rejoin to CPG group? > Aug 27 12:33:59 [46849] node3 pacemakerd:error: pcmk_cpg_membership: >We're not part of CPG group 'pacemakerd' anymore! >> [root at node3 ~]# rpm -q corosync >> corosync-1.4.1-7.el6.x86_64 >That is quite old ... >> [root at node3 ~]# cat /etc/redhat-release >> CentOS release 6.3 (Final) >> [root at node3 ~]# pacemakerd -F > Pacemaker 1.1.14-1.el6 (Build: 70404b0) >and I doubt that many people have tested Pacemaker 1.1.14 against >corosync 1.4.1 ... quite far away from >each other release-wise ... pacemaker 1.1.14 + corosync-1.4.7 can also reproduced this probleam, but seems with lower probability.___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] data loss of network would cause Pacemaker exit abnormally
On 08/27/2016 09:15 PM, chenhj wrote: > Hi all, > > When i use the following command to simulate data lost of network at one > member of my 3 nodes Pacemaker+Corosync cluster, > sometimes it cause Pacemaker on another node exit. > > tc qdisc add dev eth2 root netem loss 90% > > Is there any method to avoid this proleam? > > [root@node3 ~]# ps -ef|grep pacemaker > root 32540 1 0 00:57 ?00:00:00 > /usr/libexec/pacemaker/lrmd > 189 32542 1 0 00:57 ?00:00:00 > /usr/libexec/pacemaker/pengine > root 33491 11491 0 00:58 pts/100:00:00 grep pacemaker > > /var/log/cluster/corosync.log > > Aug 27 12:33:59 [46855] node3cib: info: cib_process_request: >Completed cib_modify operation for section status: OK (rc=0, > origin=local/attrd/230, version=10.657.19) > Aug 27 12:33:59 corosync [CPG ] chosen downlist: sender r(0) > ip(192.168.125.129) ; members(old:2 left:1) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: >Node 2172496064 joined group pacemakerd (counter=12.0) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: >Node 2172496064 still member of group pacemakerd (peer=node2, > counter=12.0) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > crm_update_peer_proc: pcmk_cpg_membership: Node node2[2172496064] > - corosync-cpg is now online > Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: >Node 2273159360 still member of group pacemakerd (peer=node3, > counter=12.1) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_cs_flush: > Sent 0 CPG messages (1 remaining, last=19): Try again (6) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: >Node 2273159360 left group pacemakerd (peer=node3, counter=13.0) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > crm_update_peer_proc: pcmk_cpg_membership: Node node3[2273159360] > - corosync-cpg is now offline > Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: >Node 2172496064 still member of group pacemakerd (peer=node2, > counter=13.0) > Aug 27 12:33:59 [46849] node3 pacemakerd:error: pcmk_cpg_membership: >We're not part of CPG group 'pacemakerd' anymore! > Aug 27 12:33:59 [46849] node3 pacemakerd:error: pcmk_cpg_dispatch: > Evicted from CPG membership >From the above, I suspect that the node with the network loss was the DC, and from its point of view, it was the other node that went away. Proper quorum and fencing configuration should prevent this from being an issue. Once the one node sees heavy network loss, the other node(s) should fence it before it causes too many problems. > Aug 27 12:33:59 [46849] node3 pacemakerd:error: mcp_cpg_destroy: > Connection destroyed > Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_xml_cleanup: > Cleaning up memory from libxml2 > Aug 27 12:33:59 [46858] node3 attrd:error: crm_ipc_read: > Connection to pacemakerd failed > Aug 27 12:33:59 [46858] node3 attrd:error: > mainloop_gio_callback: Connection to pacemakerd[0x1255eb0] closed > (I/O condition=17) > Aug 27 12:33:59 [46858] node3 attrd: crit: attrd_cs_destroy: > Lost connection to Corosync service! > Aug 27 12:33:59 [46858] node3 attrd: notice: main: Exiting... > Aug 27 12:33:59 [46858] node3 attrd: notice: main: > Disconnecting client 0x12579a0, pid=46860... > Aug 27 12:33:59 [46858] node3 attrd:error: > attrd_cib_connection_destroy: Connection to the CIB terminated... > Aug 27 12:33:59 corosync [pcmk ] info: pcmk_ipc_exit: Client attrd > (conn=0x1955f80, async-conn=0x1955f80) left > Aug 27 12:33:59 [46856] node3 stonith-ng:error: crm_ipc_read: > Connection to pacemakerd failed > Aug 27 12:33:59 [46856] node3 stonith-ng:error: > mainloop_gio_callback: Connection to pacemakerd[0x2314af0] closed > (I/O condition=17) > Aug 27 12:33:59 [46856] node3 stonith-ng:error: > stonith_peer_cs_destroy:Corosync connection terminated > Aug 27 12:33:59 [46856] node3 stonith-ng: info: stonith_shutdown: > Terminating with 1 clients > Aug 27 12:33:59 [46856] node3 stonith-ng: info: > cib_connection_destroy: Connection to the CIB closed. > ... > > please see corosynclog.txt for detail of log > > > [root@node3 ~]# cat /etc/corosync/corosync.conf > totem { >version: 2 >secauth: off >interface { >member { >memberaddr: 192.168.125.134 >} >member { >memberaddr: 192.168.125.129 >} >member { >memberaddr: 192.168.125.135 >} > >ringnumber: 0 >bindnetaddr: 192.168.125.135 >mcastport: 5405 >
Re: [ClusterLabs] data loss of network would cause Pacemaker exit abnormally
On 08/28/2016 04:15 AM, chenhj wrote: > Hi all, > > When i use the following command to simulate data lost of network at > one member of my 3 nodes Pacemaker+Corosync cluster, > sometimes it cause Pacemaker on another node exit. > > tc qdisc add dev eth2 root netem loss 90% > > Is there any method to avoid this proleam? > > [root@node3 ~]# ps -ef|grep pacemaker > root 32540 1 0 00:57 ?00:00:00 > /usr/libexec/pacemaker/lrmd > 189 32542 1 0 00:57 ?00:00:00 > /usr/libexec/pacemaker/pengine > root 33491 11491 0 00:58 pts/100:00:00 grep pacemaker > > /var/log/cluster/corosync.log > > Aug 27 12:33:59 [46855] node3cib: info: > cib_process_request:Completed cib_modify operation for section > status: OK (rc=0, origin=local/attrd/230, version=10.657.19) > Aug 27 12:33:59 corosync [CPG ] chosen downlist: sender r(0) > ip(192.168.125.129) ; members(old:2 left:1) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > pcmk_cpg_membership:Node 2172496064 joined group pacemakerd > (counter=12.0) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > pcmk_cpg_membership:Node 2172496064 still member of group > pacemakerd (peer=node2, counter=12.0) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > crm_update_peer_proc: pcmk_cpg_membership: Node > node2[2172496064] - corosync-cpg is now online > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > pcmk_cpg_membership:Node 2273159360 still member of group > pacemakerd (peer=node3, counter=12.1) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_cs_flush: > Sent 0 CPG messages (1 remaining, last=19): Try again (6) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > pcmk_cpg_membership:Node 2273159360 left group pacemakerd > (peer=node3, counter=13.0) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > crm_update_peer_proc: pcmk_cpg_membership: Node > node3[2273159360] - corosync-cpg is now offline > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > pcmk_cpg_membership:Node 2172496064 still member of group > pacemakerd (peer=node2, counter=13.0) > Aug 27 12:33:59 [46849] node3 pacemakerd:error: > pcmk_cpg_membership:We're not part of CPG group 'pacemakerd' > anymore! > Aug 27 12:33:59 [46849] node3 pacemakerd:error: pcmk_cpg_dispatch: > Evicted from CPG membership > Aug 27 12:33:59 [46849] node3 pacemakerd:error: mcp_cpg_destroy: > Connection destroyed > Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_xml_cleanup: > Cleaning up memory from libxml2 > Aug 27 12:33:59 [46858] node3 attrd:error: crm_ipc_read: > Connection to pacemakerd failed > Aug 27 12:33:59 [46858] node3 attrd:error: > mainloop_gio_callback: Connection to pacemakerd[0x1255eb0] closed > (I/O condition=17) > Aug 27 12:33:59 [46858] node3 attrd: crit: attrd_cs_destroy: > Lost connection to Corosync service! > Aug 27 12:33:59 [46858] node3 attrd: notice: main: Exiting... > Aug 27 12:33:59 [46858] node3 attrd: notice: main: > Disconnecting client 0x12579a0, pid=46860... > Aug 27 12:33:59 [46858] node3 attrd:error: > attrd_cib_connection_destroy: Connection to the CIB terminated... > Aug 27 12:33:59 corosync [pcmk ] info: pcmk_ipc_exit: Client attrd > (conn=0x1955f80, async-conn=0x1955f80) left > Aug 27 12:33:59 [46856] node3 stonith-ng:error: crm_ipc_read: > Connection to pacemakerd failed > Aug 27 12:33:59 [46856] node3 stonith-ng:error: > mainloop_gio_callback: Connection to pacemakerd[0x2314af0] closed > (I/O condition=17) > Aug 27 12:33:59 [46856] node3 stonith-ng:error: > stonith_peer_cs_destroy:Corosync connection terminated > Aug 27 12:33:59 [46856] node3 stonith-ng: info: stonith_shutdown: > Terminating with 1 clients > Aug 27 12:33:59 [46856] node3 stonith-ng: info: > cib_connection_destroy: Connection to the CIB closed. > ... > > please see corosynclog.txt for detail of log > > > [root@node3 ~]# cat /etc/corosync/corosync.conf > totem { >version: 2 >secauth: off >interface { >member { >memberaddr: 192.168.125.134 >} >member { >memberaddr: 192.168.125.129 >} >member { >memberaddr: 192.168.125.135 >} > >ringnumber: 0 >bindnetaddr: 192.168.125.135 >mcastport: 5405 >ttl: 1 >} >transport: udpu > } > > logging { >fileline: off >to_logfile: yes >to_syslog: no >logfile: /var/log/cluster/corosync.log >debug: off >timestamp: on >logger_subsys { >subsys: AMF >debug: off >} > } > > service { >
[ClusterLabs] data loss of network would cause Pacemaker exit abnormally
Hi all, When i use the following command to simulate data lost of network at one member of my 3 nodes Pacemaker+Corosync cluster, sometimes it cause Pacemaker on another node exit. tc qdisc add dev eth2 root netem loss 90% Is there any method to avoid this proleam? [root@node3 ~]# ps -ef|grep pacemaker root 32540 1 0 00:57 ?00:00:00 /usr/libexec/pacemaker/lrmd 189 32542 1 0 00:57 ?00:00:00 /usr/libexec/pacemaker/pengine root 33491 11491 0 00:58 pts/100:00:00 grep pacemaker /var/log/cluster/corosync.log Aug 27 12:33:59 [46855] node3cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/attrd/230, version=10.657.19) Aug 27 12:33:59 corosync [CPG ] chosen downlist: sender r(0) ip(192.168.125.129) ; members(old:2 left:1) Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: Node 2172496064 joined group pacemakerd (counter=12.0) Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: Node 2172496064 still member of group pacemakerd (peer=node2, counter=12.0) Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_update_peer_proc: pcmk_cpg_membership: Node node2[2172496064] - corosync-cpg is now online Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: Node 2273159360 still member of group pacemakerd (peer=node3, counter=12.1) Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=19): Try again (6) Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: Node 2273159360 left group pacemakerd (peer=node3, counter=13.0) Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_update_peer_proc: pcmk_cpg_membership: Node node3[2273159360] - corosync-cpg is now offline Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership: Node 2172496064 still member of group pacemakerd (peer=node2, counter=13.0) Aug 27 12:33:59 [46849] node3 pacemakerd:error: pcmk_cpg_membership: We're not part of CPG group 'pacemakerd' anymore! Aug 27 12:33:59 [46849] node3 pacemakerd:error: pcmk_cpg_dispatch: Evicted from CPG membership Aug 27 12:33:59 [46849] node3 pacemakerd:error: mcp_cpg_destroy: Connection destroyed Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_xml_cleanup: Cleaning up memory from libxml2 Aug 27 12:33:59 [46858] node3 attrd:error: crm_ipc_read: Connection to pacemakerd failed Aug 27 12:33:59 [46858] node3 attrd:error: mainloop_gio_callback: Connection to pacemakerd[0x1255eb0] closed (I/O condition=17) Aug 27 12:33:59 [46858] node3 attrd: crit: attrd_cs_destroy: Lost connection to Corosync service! Aug 27 12:33:59 [46858] node3 attrd: notice: main: Exiting... Aug 27 12:33:59 [46858] node3 attrd: notice: main: Disconnecting client 0x12579a0, pid=46860... Aug 27 12:33:59 [46858] node3 attrd:error: attrd_cib_connection_destroy: Connection to the CIB terminated... Aug 27 12:33:59 corosync [pcmk ] info: pcmk_ipc_exit: Client attrd (conn=0x1955f80, async-conn=0x1955f80) left Aug 27 12:33:59 [46856] node3 stonith-ng:error: crm_ipc_read: Connection to pacemakerd failed Aug 27 12:33:59 [46856] node3 stonith-ng:error: mainloop_gio_callback: Connection to pacemakerd[0x2314af0] closed (I/O condition=17) Aug 27 12:33:59 [46856] node3 stonith-ng:error: stonith_peer_cs_destroy: Corosync connection terminated Aug 27 12:33:59 [46856] node3 stonith-ng: info: stonith_shutdown: Terminating with 1 clients Aug 27 12:33:59 [46856] node3 stonith-ng: info: cib_connection_destroy: Connection to the CIB closed. ... please see corosynclog.txt for detail of log [root@node3 ~]# cat /etc/corosync/corosync.conf totem { version: 2 secauth: off interface { member { memberaddr: 192.168.125.134 } member { memberaddr: 192.168.125.129 } member { memberaddr: 192.168.125.135 } ringnumber: 0 bindnetaddr: 192.168.125.135 mcastport: 5405 ttl: 1 } transport: udpu } logging { fileline: off to_logfile: yes to_syslog: no logfile: /var/log/cluster/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } service { ver: 1 name: pacemaker } Environment: [root@node3 ~]# rpm -q corosync corosync-1.4.1-7.el6.x86_64 [root@node3 ~]# cat /etc/redhat-release CentOS release 6.3 (Final) [root@node3 ~]# pacemakerd -F Pacemaker 1.1.14-1.el6 (Bu