Re: [Pacemaker] [corosync] active/active with Radius
This is really question for pacemaker list, so CCing. Regards, Honza Hi, I would like Corosync to manage Radius in an active/active configuration but I don't know how I should add this, so was wondering if somebody could point me in the right direction. Thanks and kind regards, Soph. -- Details -- So far I have this, # crm configure show node centos6-radius0-kawazu node centos6-radius1-yetti primitive failover-ip ocf:heartbeat:IPaddr \ params ip=192.168.10.200 \ op monitor interval=2s property $id=cib-bootstrap-options \ dc-version=1.1.10-14.el6_5.2-368c726 \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore And wondered if I should add this: # crm configure primitive RADIUS lsb:radiusd op monitor interval=5s timeout=20s start-delay=0s If I could add ocf:heartbeat then may be better, but I read this mayn't work when raddb forked. ( Reference : http://oss.clusterlabs.org/pipermail/pacemaker/2012-April/013790.html ) If not then how should I configure this? My O/S is CentOS 6. -- End -- ___ discuss mailing list disc...@corosync.org http://lists.corosync.org/mailman/listinfo/discuss ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Openais] Issues with a squid cluster.
This is really question for pacemaker list, so CCing. Regards, Honza Redeye napsal(a): I am not certain where I should post this, hopefully someone will point me in the right direction. I have a two node cluster on Ubuntu 12.04, corosync, pacemaker, and squid. Squid is not starting at boot, pacemaker is controlling that. The two servers are communicating just fine, pacemaker starts, stops, and monitors the squid resources just fine too. My problem is that I am unable to do anything with the squid instances. For example, I want to update an acl, and I want to bounce the squid service to load the new settings. Service squid3 stop|start|status|restart|etc does nothing, it returns unknown instance. Ps -af |grep squid shows two instances, one user root one user proxy, and squid is doing what it is supposed to. What can I do to remedy this? ___ Openais mailing list open...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/openais ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Openais] problem to delete resource
This is really question for pacemaker list, so CCing. Regards, Honza Vladimir Berezovski (vberezov) napsal(a): Hi , I added a new resourse like crm(live)configure# primitive p_drbd_ora ocf:linbit:drbd params drbd_resource=clusterdb_res_ora op monitor interval=60s but its status is FAILED(unmanaged) . I tried to stop and delete it but to no result - it's still running .How to manage this issue ? [root@node1 ~]# crm configure show node node1 \ attributes standby=off node node2 primitive p_drbd_ora ocf:linbit:drbd \ params drbd_resource=clusterdb_res_ora \ op monitor interval=60s \ meta target-role=Stopped is-managed=true property cib-bootstrap-options: \ dc-version=1.1.11-97629de \ cluster-infrastructure=classic openais (with plugin) \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1422887129 rsc_defaults rsc-options: \ resource-stickiness=100 [root@node1 ~]# crm_mon -1 Last updated: Mon Feb 2 17:12:40 2015 Last change: Mon Feb 2 16:44:52 2015 Stack: classic openais (with plugin) Current DC: node1 - partition WITHOUT quorum Version: 1.1.11-97629de 2 Nodes configured, 2 expected votes 1 Resources configured Online: [ node1 ] OFFLINE: [ node2 ] p_drbd_ora (ocf::linbit:drbd): FAILED node1 (unmanaged) Failed actions: p_drbd_ora_stop_0 on node1 'not configured' (6): call=6, status=complete, last-rc-change='Mon Feb 2 16:54:19 2015', queued=0ms, exec=26ms p_drbd_ora_stop_0 on node1 'not configured' (6): call=6, status=complete, last-rc-change='Mon Feb 2 16:54:19 2015', queued=0ms, exec=26ms #crm resource stop p_drbd_ora [root@node1 ~]# crm configure delete p_drbd_ora ERROR: resource p_drbd_ora is running, can't delete it Regards , Vladimir Berezovski ___ Openais mailing list open...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/openais ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Corosync fails to start when NIC is absent
Kostiantyn, One more thing to clarify. You said rebind can be avoided - what does it mean? By that I mean that as long as you don't shutdown interface everything will work as expected. Interface shutdown is administrator decision, system doesn't do it automagically :) Regards, Honza Thank you, Kostya On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Thank you. Now I am aware of it. Thank you, Kostya On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, Honza, Thank you for helping me. So, there is no defined behavior in case one of the interfaces is not in the system? You are right. There is no defined behavior. Regards, Honza Thank you, Kostya On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Corosync needs all interfaces during start and runtime. This doesn't mean they must be connected (this would make corosync unusable for physical NIC/Switch or cable failure), but they must be up and have correct ip. When this is not the case, corosync rebinds to localhost and weird things happens. Removal of this rebinding is long time TODO, but there are still more important bugs (especially because rebind can be avoided). Regards, Honza Thank you, Kostya On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I
Re: [Pacemaker] [corosync] CoroSync's UDPu transport for public IP addresses?
Dmitry, Great, it works! Thank you. It would be extremely helpful if this information will be included in a default corosync.conf as comments: - regarding allowed and even preferred absense of totem.interface in case of UDPu Yep - that quorum section must not be empty, and that the default quorum.provider could be corosync_votequorum (but not empty). This is not entirely true. quorum.provider cannot be empty string, or generally must be valid provider like corosync_votequorum. But unspecified quorum.provider works without any problem (as in example configuration file). Truth is, that Pacemaker must then be configured in a way that quorum is not required. Regards, Honza It would help to install and launch corosync instantly by novices. On Fri, Jan 16, 2015 at 7:31 PM, Jan Friesse jfrie...@redhat.com wrote: Dmitry Koterov napsal(a): such messages (for now). But, anyway, DNS names in ringX_addr seem not working, and no relevant messages are in default logs. Maybe add some validations for ringX_addr? I'm having resolvable DNS names: root@node1:/etc/corosync# ping -c1 -W100 node1 | grep from 64 bytes from node1 (127.0.1.1): icmp_seq=1 ttl=64 time=0.039 ms This is problem. Resolving node1 to localhost (127.0.0.1) is simply wrong. Names you want to use in corosync.conf should resolve to interface address. I believe other nodes has similar setting (so node2 resolved on node2 is again 127.0.0.1) Wow! What a shame! How could I miss it... So you're absolutely right, thanks: that was the cause, an entry in /etc/hosts. On some machines I removed it manually, but on others - didn't. Now I do it automatically by sed -i -r /^.*[[:space:]]$host([[:space:]]|\$)/d /etc/hosts in the initialization script. I apologize for the mess. So now I have only one place in corosync.conf where I need to specify a plain IP address for UDPu: totem.interface.bindnetaddr. If I specify 0.0.0.0 there, I'm having a message Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!' in the logs (BTW it does not say that I mistaked in bindnetaddr). Is there a way to completely untie from IP addresses? You can just remove whole interface section completely. Corosync will find correct address from nodelist. Regards, Honza Please try to fix this problem first and let's see if this will solve issue you are hitting. Regards, Honza root@node1:/etc/corosync# ping -c1 -W100 node2 | grep from 64 bytes from node2 (188.166.54.190): icmp_seq=1 ttl=55 time=88.3 ms root@node1:/etc/corosync# ping -c1 -W100 node3 | grep from 64 bytes from node3 (128.199.116.218): icmp_seq=1 ttl=51 time=252 ms With corosync.conf below, nothing works: ... nodelist { node { ring0_addr: node1 } node { ring0_addr: node2 } node { ring0_addr: node3 } } ... Jan 14 10:47:44 node1 corosync[15061]: [MAIN ] Corosync Cluster Engine ('2.3.3'): started and ready to provide service. Jan 14 10:47:44 node1 corosync[15061]: [MAIN ] Corosync built-in features: dbus testagents rdma watchdog augeas pie relro bindnow Jan 14 10:47:44 node1 corosync[15062]: [TOTEM ] Initializing transport (UDP/IP Unicast). Jan 14 10:47:44 node1 corosync[15062]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1 Jan 14 10:47:44 node1 corosync[15062]: [TOTEM ] The network interface [a.b.c.d] is now up. Jan 14 10:47:44 node1 corosync[15062]: [SERV ] Service engine loaded: corosync configuration map access [0] Jan 14 10:47:44 node1 corosync[15062]: [QB] server name: cmap Jan 14 10:47:44 node1 corosync[15062]: [SERV ] Service engine loaded: corosync configuration service [1] Jan 14 10:47:44 node1 corosync[15062]: [QB] server name: cfg Jan 14 10:47:44 node1 corosync[15062]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] Jan 14 10:47:44 node1 corosync[15062]: [QB] server name: cpg Jan 14 10:47:44 node1 corosync[15062]: [SERV ] Service engine loaded: corosync profile loading service [4] Jan 14 10:47:44 node1 corosync[15062]: [WD] No Watchdog, try modprobe a watchdog Jan 14 10:47:44 node1 corosync[15062]: [WD] no resources configured. Jan 14 10:47:44 node1 corosync[15062]: [SERV ] Service engine loaded: corosync watchdog service [7] Jan 14 10:47:44 node1 corosync[15062]: [QUORUM] Using quorum provider corosync_votequorum Jan 14 10:47:44 node1 corosync[15062]: [QUORUM] Quorum provider: corosync_votequorum failed to initialize. Jan 14 10:47:44 node1 corosync[15062]: [SERV ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!' Jan 14 10:47:44 node1 corosync[15062]: [MAIN ] Corosync Cluster Engine exiting with status 20 at service.c:356. But with IP addresses specified in ringX_addr, everything works: ... nodelist { node
Re: [Pacemaker] [corosync] CoroSync's UDPu transport for public IP addresses?
:48:28 node1 corosync[15156]: [SERV ] Service engine loaded: corosync configuration service [1] Jan 14 10:48:28 node1 corosync[15156]: [QB] server name: cfg Jan 14 10:48:28 node1 corosync[15156]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] Jan 14 10:48:28 node1 corosync[15156]: [QB] server name: cpg Jan 14 10:48:28 node1 corosync[15156]: [SERV ] Service engine loaded: corosync profile loading service [4] Jan 14 10:48:28 node1 corosync[15156]: [WD] No Watchdog, try modprobe a watchdog Jan 14 10:48:28 node1 corosync[15156]: [WD] no resources configured. Jan 14 10:48:28 node1 corosync[15156]: [SERV ] Service engine loaded: corosync watchdog service [7] Jan 14 10:48:28 node1 corosync[15156]: [QUORUM] Using quorum provider corosync_votequorum Jan 14 10:48:28 node1 corosync[15156]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] Jan 14 10:48:28 node1 corosync[15156]: [QB] server name: votequorum Jan 14 10:48:28 node1 corosync[15156]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] Jan 14 10:48:28 node1 corosync[15156]: [QB] server name: quorum Jan 14 10:48:28 node1 corosync[15156]: [TOTEM ] adding new UDPU member {a.b.c.d} Jan 14 10:48:28 node1 corosync[15156]: [TOTEM ] adding new UDPU member {e.f.g.h} Jan 14 10:48:28 node1 corosync[15156]: [TOTEM ] adding new UDPU member {i.j.k.l} Jan 14 10:48:28 node1 corosync[15156]: [TOTEM ] A new membership (m.n.o.p:80) was formed. Members joined: 1760315215 Jan 14 10:48:28 node1 corosync[15156]: [QUORUM] Members[1]: 1760315215 Jan 14 10:48:28 node1 corosync[15156]: [MAIN ] Completed service synchronization, ready to provide service. On Mon, Jan 5, 2015 at 6:45 PM, Jan Friesse jfrie...@redhat.com wrote: Dmitry, Sure, in logs I see adding new UDPU member {IP_ADDRESS} (so DNS names are definitely resolved), but in practice the cluster does not work, as I said above. So validations of ringX_addr in corosync.conf would be very helpful in corosync. that's weird. Because as long as DNS is resolved, corosync works only with IP. This means, code path is exactly same with IP or with DNS. Do you have logs from corosync? Honza On Fri, Jan 2, 2015 at 2:49 PM, Jan Friesse jfrie...@redhat.com wrote: Dmitry, No, I meant that if you pass a domain name in ring0_addr, there are no errors in logs, corosync even seems to find nodes (based on its logs), And crm_node -l shows them, but in practice nothing really works. A verbose error message would be very helpful in such case. This sounds weird. Are you sure that DNS names really maps to correct IP address? In logs there should be something like adding new UDPU member {IP_ADDRESS}. Regards, Honza On Tuesday, December 30, 2014, Daniel Dehennin daniel.dehen...@baby-gnu.org wrote: Dmitry Koterov dmitry.kote...@gmail.com javascript:; writes: Oh, seems I've found the solution! At least two mistakes was in my corosync.conf (BTW logs did not say about any errors, so my conclusion is based on my experiments only). 1. nodelist.node MUST contain only IP addresses. No hostnames! They simply do not work, crm status shows no nodes. And no warnings are in logs regarding this. You can add name like this: nodelist { node { ring0_addr: public-ip-address-of-the-first-machine name: node1 } node { ring0_addr: public-ip-address-of-the-second-machine name: node2 } } I used it on Ubuntu Trusty with udpu. Regards. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] CoroSync's UDPu transport for public IP addresses?
] Service engine loaded: corosync watchdog service [7] Jan 14 10:48:28 node1 corosync[15156]: [QUORUM] Using quorum provider corosync_votequorum Jan 14 10:48:28 node1 corosync[15156]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] Jan 14 10:48:28 node1 corosync[15156]: [QB] server name: votequorum Jan 14 10:48:28 node1 corosync[15156]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] Jan 14 10:48:28 node1 corosync[15156]: [QB] server name: quorum Jan 14 10:48:28 node1 corosync[15156]: [TOTEM ] adding new UDPU member {a.b.c.d} Jan 14 10:48:28 node1 corosync[15156]: [TOTEM ] adding new UDPU member {e.f.g.h} Jan 14 10:48:28 node1 corosync[15156]: [TOTEM ] adding new UDPU member {i.j.k.l} Jan 14 10:48:28 node1 corosync[15156]: [TOTEM ] A new membership (m.n.o.p:80) was formed. Members joined: 1760315215 Jan 14 10:48:28 node1 corosync[15156]: [QUORUM] Members[1]: 1760315215 Jan 14 10:48:28 node1 corosync[15156]: [MAIN ] Completed service synchronization, ready to provide service. On Mon, Jan 5, 2015 at 6:45 PM, Jan Friesse jfrie...@redhat.com wrote: Dmitry, Sure, in logs I see adding new UDPU member {IP_ADDRESS} (so DNS names are definitely resolved), but in practice the cluster does not work, as I said above. So validations of ringX_addr in corosync.conf would be very helpful in corosync. that's weird. Because as long as DNS is resolved, corosync works only with IP. This means, code path is exactly same with IP or with DNS. Do you have logs from corosync? Honza On Fri, Jan 2, 2015 at 2:49 PM, Jan Friesse jfrie...@redhat.com wrote: Dmitry, No, I meant that if you pass a domain name in ring0_addr, there are no errors in logs, corosync even seems to find nodes (based on its logs), And crm_node -l shows them, but in practice nothing really works. A verbose error message would be very helpful in such case. This sounds weird. Are you sure that DNS names really maps to correct IP address? In logs there should be something like adding new UDPU member {IP_ADDRESS}. Regards, Honza On Tuesday, December 30, 2014, Daniel Dehennin daniel.dehen...@baby-gnu.org wrote: Dmitry Koterov dmitry.kote...@gmail.com javascript:; writes: Oh, seems I've found the solution! At least two mistakes was in my corosync.conf (BTW logs did not say about any errors, so my conclusion is based on my experiments only). 1. nodelist.node MUST contain only IP addresses. No hostnames! They simply do not work, crm status shows no nodes. And no warnings are in logs regarding this. You can add name like this: nodelist { node { ring0_addr: public-ip-address-of-the-first-machine name: node1 } node { ring0_addr: public-ip-address-of-the-second-machine name: node2 } } I used it on Ubuntu Trusty with udpu. Regards. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Corosync fails to start when NIC is absent
Kostiantyn, Honza, Thank you for helping me. So, there is no defined behavior in case one of the interfaces is not in the system? You are right. There is no defined behavior. Regards, Honza Thank you, Kostya On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Corosync needs all interfaces during start and runtime. This doesn't mean they must be connected (this would make corosync unusable for physical NIC/Switch or cable failure), but they must be up and have correct ip. When this is not the case, corosync rebinds to localhost and weird things happens. Removal of this rebinding is long time TODO, but there are still more important bugs (especially because rebind can be avoided). Regards, Honza Thank you, Kostya On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I thought that when you use redundant rings, it is enough to have at least one NIC configured in the system. Am I wrong? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http
Re: [Pacemaker] Corosync fails to start when NIC is absent
Kostiantyn, According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Corosync needs all interfaces during start and runtime. This doesn't mean they must be connected (this would make corosync unusable for physical NIC/Switch or cable failure), but they must be up and have correct ip. When this is not the case, corosync rebinds to localhost and weird things happens. Removal of this rebinding is long time TODO, but there are still more important bugs (especially because rebind can be avoided). Regards, Honza Thank you, Kostya On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I thought that when you use redundant rings, it is enough to have at least one NIC configured in the system. Am I wrong? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] CoroSync's UDPu transport for public IP addresses?
Dmitry, Sure, in logs I see adding new UDPU member {IP_ADDRESS} (so DNS names are definitely resolved), but in practice the cluster does not work, as I said above. So validations of ringX_addr in corosync.conf would be very helpful in corosync. that's weird. Because as long as DNS is resolved, corosync works only with IP. This means, code path is exactly same with IP or with DNS. Do you have logs from corosync? Honza On Fri, Jan 2, 2015 at 2:49 PM, Jan Friesse jfrie...@redhat.com wrote: Dmitry, No, I meant that if you pass a domain name in ring0_addr, there are no errors in logs, corosync even seems to find nodes (based on its logs), And crm_node -l shows them, but in practice nothing really works. A verbose error message would be very helpful in such case. This sounds weird. Are you sure that DNS names really maps to correct IP address? In logs there should be something like adding new UDPU member {IP_ADDRESS}. Regards, Honza On Tuesday, December 30, 2014, Daniel Dehennin daniel.dehen...@baby-gnu.org wrote: Dmitry Koterov dmitry.kote...@gmail.com javascript:; writes: Oh, seems I've found the solution! At least two mistakes was in my corosync.conf (BTW logs did not say about any errors, so my conclusion is based on my experiments only). 1. nodelist.node MUST contain only IP addresses. No hostnames! They simply do not work, crm status shows no nodes. And no warnings are in logs regarding this. You can add name like this: nodelist { node { ring0_addr: public-ip-address-of-the-first-machine name: node1 } node { ring0_addr: public-ip-address-of-the-second-machine name: node2 } } I used it on Ubuntu Trusty with udpu. Regards. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] CoroSync's UDPu transport for public IP addresses?
Dmitry, No, I meant that if you pass a domain name in ring0_addr, there are no errors in logs, corosync even seems to find nodes (based on its logs), And crm_node -l shows them, but in practice nothing really works. A verbose error message would be very helpful in such case. This sounds weird. Are you sure that DNS names really maps to correct IP address? In logs there should be something like adding new UDPU member {IP_ADDRESS}. Regards, Honza On Tuesday, December 30, 2014, Daniel Dehennin daniel.dehen...@baby-gnu.org wrote: Dmitry Koterov dmitry.kote...@gmail.com javascript:; writes: Oh, seems I've found the solution! At least two mistakes was in my corosync.conf (BTW logs did not say about any errors, so my conclusion is based on my experiments only). 1. nodelist.node MUST contain only IP addresses. No hostnames! They simply do not work, crm status shows no nodes. And no warnings are in logs regarding this. You can add name like this: nodelist { node { ring0_addr: public-ip-address-of-the-first-machine name: node1 } node { ring0_addr: public-ip-address-of-the-second-machine name: node2 } } I used it on Ubuntu Trusty with udpu. Regards. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] CMAN and Pacemaker with IPv6
Teerapatr Dear Honza, Sorry to say this, but I found new error again. LOL This time, I already install the 1.4.1-17 as your advice. And the nodename, without altname, is map to IPv6 using hosts file. Everything is fine, but the 2 node can't communicate to each other. So I add the multicast address manually, using command `ccs -f /etc/cluster/cluster.conf --setmulticast ff::597` on both node. After that the CMAN cannot start. ff:: is not valid ipv6 multicast address. Use something like ff3e::597. Starting cluster: Checking if cluster has been disabled at boot...[ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs...[ OK ] Starting cman... Timed-out waiting for cluster Check cluster logs for details [FAILED] I also found a lot of LOG, but I think that this is where the problem has occur. Jul 15 13:36:14 corosync [MAIN ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service. Jul 15 13:36:14 corosync [MAIN ] Corosync built-in features: nss dbus rdma snmp Jul 15 13:36:14 corosync [MAIN ] Successfully read config from /etc/cluster/cluster.conf Jul 15 13:36:14 corosync [MAIN ] Successfully parsed cman config Jul 15 13:36:14 corosync [TOTEM ] Initializing transport (UDP/IP Multicast). Jul 15 13:36:14 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Jul 15 13:36:14 corosync [TOTEM ] Unable to bind the socket to receive multicast packets: Cannot assign requested address (99) Jul 15 13:36:14 corosync [TOTEM ] Could not set traffic priority: Socket operation on non-socket (88) Jul 15 13:36:14 corosync [TOTEM ] The network interface [2001:db8::151] is now up. Jul 15 13:36:14 corosync [QUORUM] Using quorum provider quorum_cman Jul 15 13:36:14 corosync [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Jul 15 13:36:14 corosync [CMAN ] CMAN 3.0.12.1 (built Apr 14 2014 09:36:10) started Jul 15 13:36:14 corosync [SERV ] Service engine loaded: corosync CMAN membership service 2.90 Jul 15 13:36:14 corosync [SERV ] Service engine loaded: openais checkpoint service B.01.01 Jul 15 13:36:14 corosync [SERV ] Service engine loaded: corosync extended virtual synchrony service Jul 15 13:36:14 corosync [SERV ] Service engine loaded: corosync configuration service Jul 15 13:36:14 corosync [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 Jul 15 13:36:14 corosync [SERV ] Service engine loaded: corosync cluster config database access v1.01 Jul 15 13:36:14 corosync [SERV ] Service engine loaded: corosync profile loading service Jul 15 13:36:14 corosync [QUORUM] Using quorum provider quorum_cman Jul 15 13:36:14 corosync [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Jul 15 13:36:14 corosync [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Jul 15 13:36:17 corosync [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. Jul 15 13:36:19 corosync [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. Jul 15 13:36:20 corosync [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. I cannot find the solution on Internet about [TOTEM ] Unable to bind the socket to receive multicast packets: Cannot assign requested address (99). Do you have any idea? Teenigma On Tue, Jul 15, 2014 at 10:02 AM, Teerapatr Kittiratanachai maillist...@gmail.com wrote: Honza Great, Thank you very much. But the terrible thing for me is I'm using the package from OpenSUSE repo. When i turn back to CentOS repo, which store lower version, the Dependency problem has occurred. Anyway, thank you for your help. Teenigma On Mon, Jul 14, 2014 at 8:51 PM, Jan Friesse jfrie...@redhat.com wrote: Honza, How do I include the patch with my CentOS package? Do I need to compile them manually? Yes. Also official CentOS version was never 1.4.5. If you are using CentOS, just use stock 1.4.1-17.1. Patch is included there. Honza TeEniGMa On Mon, Jul 14, 2014 at 3:21 PM, Jan Friesse jfrie...@redhat.com wrote: Teerapatr, For more information, these are LOG from /var/log/messages ... Jul 14 10:28:07 wh00 kernel: : DLM (built Mar 25 2014 20:01:13) installed Jul 14 10:28:07 wh00 corosync[2716]: [MAIN
Re: [Pacemaker] CMAN and Pacemaker with IPv6
Teerapatr, For more information, these are LOG from /var/log/messages ... Jul 14 10:28:07 wh00 kernel: : DLM (built Mar 25 2014 20:01:13) installed Jul 14 10:28:07 wh00 corosync[2716]: [MAIN ] Corosync Cluster Engine ('1.4.5'): started and ready to provide service. Jul 14 10:28:07 wh00 corosync[2716]: [MAIN ] Corosync built-in features: nss Jul 14 10:28:07 wh00 corosync[2716]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf Jul 14 10:28:07 wh00 corosync[2716]: [MAIN ] Successfully parsed cman config Jul 14 10:28:07 wh00 corosync[2716]: [TOTEM ] Initializing transport (UDP/IP Multicast). Jul 14 10:28:07 wh00 corosync[2716]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Jul 14 10:28:07 wh00 corosync[2716]: [TOTEM ] The network interface is down. ^^^ This line is important. This means, corosync was unable to find interface with given IPv6 address. There was regression in v1.4.5 causing this behavior. It's fixed in v1.4.6 (patch is https://github.com/corosync/corosync/commit/d76759ec26ecaeb9cc01f49e9eb0749b61454d27). So you can ether apply patch or (recommended) upgrade to 1.4.7. Regards, Honza Jul 14 10:28:10 wh00 pacemaker: Aborting startup of Pacemaker Cluster Manager ... Te On Mon, Jul 14, 2014 at 10:07 AM, Teerapatr Kittiratanachai maillist...@gmail.com wrote: Dear Honza, Sorry for late reply. After I have tested with all new configuration. On IPv6 only, and with no altname. I face with error below, Starting cluster: Checking if cluster has been disabled at boot...[ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs...[ OK ] Starting cman... corosync died with signal: 6 Check cluster logs for details [FAILED] And, exactly, there are no any enabled firewall, I also configure the Multicast address as manual. Could you advise me the solution? Many thanks in advance. Te On Thu, Jul 10, 2014 at 6:14 PM, Jan Friesse jfrie...@redhat.com wrote: Teerapatr, Hi Honza, As you said I use the nodename identify by hostname (which be accessed via IPv6) and the node also has the altname (which be IPv4 address). This doesn't work. Both hostname and altname have to be same IP version. Now, I configure the mcast address for both nodename and altname manually. The CMAN and Pacemaker can start ad well. But they don't communicate to another node. PLease make sure (as I've wrote in previous email) your firewall doesn't block mcast and corosync traffic (just disable it) and switch doesn't block multicast (this is very often the case). If these are VMs, make sure to properly configure bridge (just disable firewall) and allow mcast_querier. Honza On node0, crm_mon show node1 offline. In the same way, node one show node0 is down. So the split brain problem occur here. Regards, Te On Thu, Jul 10, 2014 at 2:50 PM, Jan Friesse jfrie...@redhat.com wrote: Teerapatr, OK, some problems are solved. I use the incorrect hostname. For now, the new problem has occured. Starting cman... Node address family does not match multicast address family Unable to get the configuration Node address family does not match multicast address family cman_tool: corosync daemon didn't start Check cluster logs for details [FAILED] This looks like one of your node is also reachable via ipv4 and ipv4 resolving is proffered. Please make sure to set only ipv6 address and try it again. Of course set mcast addr by hand maybe helpful (even-tho I don't believe it will solve problem you are hitting)). Also make sure ip6tables are properly configured and your switch is able to pass ipv6 mcast traffic. Regards, Honza How can i fix it? Or just assigned the multicast address in the configuration? Regards, Te On Thu, Jul 10, 2014 at 7:52 AM, Teerapatr Kittiratanachai maillist...@gmail.com wrote: I not found any LOG message /var/log/messages ... Jul 10 07:44:19 nwh00 kernel: : DLM (built Jun 19 2014 21:16:01) installed Jul 10 07:44:22 nwh00 pacemaker: Aborting startup of Pacemaker Cluster Manager ... and this is what display when I try to start pacemaker # /etc/init.d/pacemaker start Starting cluster: Checking if cluster has been disabled at boot...[ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs...[ OK ] Starting cman... Cannot find node name in cluster.conf Unable to get the configuration Cannot find node name in cluster.conf cman_tool
Re: [Pacemaker] CMAN and Pacemaker with IPv6
Honza, How do I include the patch with my CentOS package? Do I need to compile them manually? Yes. Also official CentOS version was never 1.4.5. If you are using CentOS, just use stock 1.4.1-17.1. Patch is included there. Honza TeEniGMa On Mon, Jul 14, 2014 at 3:21 PM, Jan Friesse jfrie...@redhat.com wrote: Teerapatr, For more information, these are LOG from /var/log/messages ... Jul 14 10:28:07 wh00 kernel: : DLM (built Mar 25 2014 20:01:13) installed Jul 14 10:28:07 wh00 corosync[2716]: [MAIN ] Corosync Cluster Engine ('1.4.5'): started and ready to provide service. Jul 14 10:28:07 wh00 corosync[2716]: [MAIN ] Corosync built-in features: nss Jul 14 10:28:07 wh00 corosync[2716]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf Jul 14 10:28:07 wh00 corosync[2716]: [MAIN ] Successfully parsed cman config Jul 14 10:28:07 wh00 corosync[2716]: [TOTEM ] Initializing transport (UDP/IP Multicast). Jul 14 10:28:07 wh00 corosync[2716]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Jul 14 10:28:07 wh00 corosync[2716]: [TOTEM ] The network interface is down. ^^^ This line is important. This means, corosync was unable to find interface with given IPv6 address. There was regression in v1.4.5 causing this behavior. It's fixed in v1.4.6 (patch is https://github.com/corosync/corosync/commit/d76759ec26ecaeb9cc01f49e9eb0749b61454d27). So you can ether apply patch or (recommended) upgrade to 1.4.7. Regards, Honza Jul 14 10:28:10 wh00 pacemaker: Aborting startup of Pacemaker Cluster Manager ... Te On Mon, Jul 14, 2014 at 10:07 AM, Teerapatr Kittiratanachai maillist...@gmail.com wrote: Dear Honza, Sorry for late reply. After I have tested with all new configuration. On IPv6 only, and with no altname. I face with error below, Starting cluster: Checking if cluster has been disabled at boot...[ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs...[ OK ] Starting cman... corosync died with signal: 6 Check cluster logs for details [FAILED] And, exactly, there are no any enabled firewall, I also configure the Multicast address as manual. Could you advise me the solution? Many thanks in advance. Te On Thu, Jul 10, 2014 at 6:14 PM, Jan Friesse jfrie...@redhat.com wrote: Teerapatr, Hi Honza, As you said I use the nodename identify by hostname (which be accessed via IPv6) and the node also has the altname (which be IPv4 address). This doesn't work. Both hostname and altname have to be same IP version. Now, I configure the mcast address for both nodename and altname manually. The CMAN and Pacemaker can start ad well. But they don't communicate to another node. PLease make sure (as I've wrote in previous email) your firewall doesn't block mcast and corosync traffic (just disable it) and switch doesn't block multicast (this is very often the case). If these are VMs, make sure to properly configure bridge (just disable firewall) and allow mcast_querier. Honza On node0, crm_mon show node1 offline. In the same way, node one show node0 is down. So the split brain problem occur here. Regards, Te On Thu, Jul 10, 2014 at 2:50 PM, Jan Friesse jfrie...@redhat.com wrote: Teerapatr, OK, some problems are solved. I use the incorrect hostname. For now, the new problem has occured. Starting cman... Node address family does not match multicast address family Unable to get the configuration Node address family does not match multicast address family cman_tool: corosync daemon didn't start Check cluster logs for details [FAILED] This looks like one of your node is also reachable via ipv4 and ipv4 resolving is proffered. Please make sure to set only ipv6 address and try it again. Of course set mcast addr by hand maybe helpful (even-tho I don't believe it will solve problem you are hitting)). Also make sure ip6tables are properly configured and your switch is able to pass ipv6 mcast traffic. Regards, Honza How can i fix it? Or just assigned the multicast address in the configuration? Regards, Te On Thu, Jul 10, 2014 at 7:52 AM, Teerapatr Kittiratanachai maillist...@gmail.com wrote: I not found any LOG message /var/log/messages ... Jul 10 07:44:19 nwh00 kernel: : DLM (built Jun 19 2014 21:16:01) installed Jul 10 07:44:22 nwh00 pacemaker: Aborting startup of Pacemaker Cluster Manager ... and this is what display when I try to start pacemaker # /etc/init.d/pacemaker start Starting cluster: Checking if cluster has been disabled at boot...[ OK ] Checking Network Manager... [ OK ] Global setup
Re: [Pacemaker] CMAN and Pacemaker with IPv6
Teerapatr, OK, some problems are solved. I use the incorrect hostname. For now, the new problem has occured. Starting cman... Node address family does not match multicast address family Unable to get the configuration Node address family does not match multicast address family cman_tool: corosync daemon didn't start Check cluster logs for details [FAILED] This looks like one of your node is also reachable via ipv4 and ipv4 resolving is proffered. Please make sure to set only ipv6 address and try it again. Of course set mcast addr by hand maybe helpful (even-tho I don't believe it will solve problem you are hitting)). Also make sure ip6tables are properly configured and your switch is able to pass ipv6 mcast traffic. Regards, Honza How can i fix it? Or just assigned the multicast address in the configuration? Regards, Te On Thu, Jul 10, 2014 at 7:52 AM, Teerapatr Kittiratanachai maillist...@gmail.com wrote: I not found any LOG message /var/log/messages ... Jul 10 07:44:19 nwh00 kernel: : DLM (built Jun 19 2014 21:16:01) installed Jul 10 07:44:22 nwh00 pacemaker: Aborting startup of Pacemaker Cluster Manager ... and this is what display when I try to start pacemaker # /etc/init.d/pacemaker start Starting cluster: Checking if cluster has been disabled at boot...[ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs...[ OK ] Starting cman... Cannot find node name in cluster.conf Unable to get the configuration Cannot find node name in cluster.conf cman_tool: corosync daemon didn't start Check cluster logs for details [FAILED] Stopping cluster: Leaving fence domain... [ OK ] Stopping gfs_controld...[ OK ] Stopping dlm_controld...[ OK ] Stopping fenced... [ OK ] Stopping cman...[ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Aborting startup of Pacemaker Cluster Manager another one thing, according to the happened problem, I remove the record from DNS for now and map it in to /etc/hosts files instead, as shown below. /etc/hosts ... 2001:db8:0:1::1 node0.example.com 2001:db8:0:1::2 node1.example.com ... Is there any configure that help me to got more log ? On Thu, Jul 10, 2014 at 5:06 AM, Andrew Beekhof and...@beekhof.net wrote: On 9 Jul 2014, at 9:15 pm, Teerapatr Kittiratanachai maillist...@gmail.com wrote: Dear All, I has implemented the HA on dual stack servers, Firstly, I doesn't deploy IPv6 record on DNS yet. The CMAN and PACEMAKER can work as normal. But, after I create record on DNS server, i found the error that cann't start CMAN. Are CMAN and PACEMAKER support the IPv6? I don;t think pacemaker cares. What errors did you get? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] CMAN and Pacemaker with IPv6
Teerapatr, Hi Honza, As you said I use the nodename identify by hostname (which be accessed via IPv6) and the node also has the altname (which be IPv4 address). This doesn't work. Both hostname and altname have to be same IP version. Now, I configure the mcast address for both nodename and altname manually. The CMAN and Pacemaker can start ad well. But they don't communicate to another node. PLease make sure (as I've wrote in previous email) your firewall doesn't block mcast and corosync traffic (just disable it) and switch doesn't block multicast (this is very often the case). If these are VMs, make sure to properly configure bridge (just disable firewall) and allow mcast_querier. Honza On node0, crm_mon show node1 offline. In the same way, node one show node0 is down. So the split brain problem occur here. Regards, Te On Thu, Jul 10, 2014 at 2:50 PM, Jan Friesse jfrie...@redhat.com wrote: Teerapatr, OK, some problems are solved. I use the incorrect hostname. For now, the new problem has occured. Starting cman... Node address family does not match multicast address family Unable to get the configuration Node address family does not match multicast address family cman_tool: corosync daemon didn't start Check cluster logs for details [FAILED] This looks like one of your node is also reachable via ipv4 and ipv4 resolving is proffered. Please make sure to set only ipv6 address and try it again. Of course set mcast addr by hand maybe helpful (even-tho I don't believe it will solve problem you are hitting)). Also make sure ip6tables are properly configured and your switch is able to pass ipv6 mcast traffic. Regards, Honza How can i fix it? Or just assigned the multicast address in the configuration? Regards, Te On Thu, Jul 10, 2014 at 7:52 AM, Teerapatr Kittiratanachai maillist...@gmail.com wrote: I not found any LOG message /var/log/messages ... Jul 10 07:44:19 nwh00 kernel: : DLM (built Jun 19 2014 21:16:01) installed Jul 10 07:44:22 nwh00 pacemaker: Aborting startup of Pacemaker Cluster Manager ... and this is what display when I try to start pacemaker # /etc/init.d/pacemaker start Starting cluster: Checking if cluster has been disabled at boot...[ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs...[ OK ] Starting cman... Cannot find node name in cluster.conf Unable to get the configuration Cannot find node name in cluster.conf cman_tool: corosync daemon didn't start Check cluster logs for details [FAILED] Stopping cluster: Leaving fence domain... [ OK ] Stopping gfs_controld...[ OK ] Stopping dlm_controld...[ OK ] Stopping fenced... [ OK ] Stopping cman...[ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Aborting startup of Pacemaker Cluster Manager another one thing, according to the happened problem, I remove the record from DNS for now and map it in to /etc/hosts files instead, as shown below. /etc/hosts ... 2001:db8:0:1::1 node0.example.com 2001:db8:0:1::2 node1.example.com ... Is there any configure that help me to got more log ? On Thu, Jul 10, 2014 at 5:06 AM, Andrew Beekhof and...@beekhof.net wrote: On 9 Jul 2014, at 9:15 pm, Teerapatr Kittiratanachai maillist...@gmail.com wrote: Dear All, I has implemented the HA on dual stack servers, Firstly, I doesn't deploy IPv6 record on DNS yet. The CMAN and PACEMAKER can work as normal. But, after I create record on DNS server, i found the error that cann't start CMAN. Are CMAN and PACEMAKER support the IPv6? I don;t think pacemaker cares. What errors did you get? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
Re: [Pacemaker] [Openais] unmanaged resource failed - how to get back?
Stefan, sending to Pacemaker list because your question seems to be not Corosync related. Regards, Honza Senftleben, Stefan (itsc) napsal(a): Hello, I set the cluster in a maintainance mode with: crm configure property maintenance-mode=true . Afterwards I did stop one resource manually, but after turning of the maintainance mode, the resource is in status unmanaged FAILED. But the resource is running already. What shoud I do now, to get the resource managed by pacemaker? Greetings Stefan Last updated: Mon Jun 30 12:42:45 2014 Last change: Mon Jun 30 12:41:33 2014 Stack: openais Current DC: lxds05 - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes 10 Resources configured. Online: [ lxds05 lxds07 ] Full list of resources: Resource Group: group_omd pri_fs_omd (ocf::heartbeat:Filesystem):Started lxds05 pri_apache2(ocf::heartbeat:apache):Started lxds05 pri_nagiosIP (ocf::heartbeat:IPaddr2): Started lxds05 Master/Slave Set: ms_drbd_omd [pri_drbd_omd] Masters: [ lxds05 ] Slaves: [ lxds07 ] Clone Set: clone_ping [pri_ping] Started: [ lxds07 lxds05 ] res_MailTo_omd_group(ocf::heartbeat:MailTo):Stopped omd_itsc(ocf::omd:omdnagios): Started lxds05 (unmanaged) FAILED res_MailTo_omd_itsc (ocf::heartbeat:MailTo):Stopped Node Attributes: * Node lxds05: + master-pri_drbd_omd:0 : 1 + pingd : 3000 * Node lxds07: + master-pri_drbd_omd:1 : 1 + pingd : 3000 Migration summary: * Node lxds07: * Node lxds05: omd_itsc: migration-threshold=100 fail-count=2 last-failure='Mon Jun 30 12:39:03 2014' Failed actions: omd_itsc_stop_0 (node=lxds05, call=49, rc=1, status=complete): unknown error ___ Openais mailing list open...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/openais ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Openais] Filesystem vs. Master-Slave MySQL resource
Matej, this is really question for pacemaker mailing list. Hello, I have the following setup: 2 nodes: db-01, db-02 Groups of resources: fs-01: iscsi+lvm+fs at db-01 fs-02: iscsi+lvm+fs at db-02 fs-01 is for mounting data files for MySQL at db-01, fs-02 for db-02 MySQL resources: primitive p_mysql mysql \ params binary=/usr/bin/mysqld_safe config=/etc/my.cnf datadir=/var/lib/mysql/db replication_user=replicant replication_passwd= test_user=test test_passwd= \ op start timeout=120 interval=0 \ op stop timeout=120 interval=0 \ op promote timeout=120 interval=0 \ op demote timeout=120 interval=0 \ op monitor role=Master timeout=30 interval=5 \ op monitor role=Slave timeout=30 interval=8 ms ms_mysql p_mysql \ meta notify=true master-max=1 clone-max=2 target-role=Started is-managed=true To force groups at right nodes I have following: location loc_mysql-1 fs-01 inf: db-01 location loc_mysql-1n fs-01 -inf: db-02 location loc_mysql-2 fs-02 inf: db-02 location loc_mysql-2n fs-02 -inf: db-01 I have troubles with order. I need to configure startup of ms_mysql after FS mounts. There are several scenarios: 1) Both nodes online - start both fs-01 and fs-02 - start ms_mysql, one node as Master, other as Slave 2) Only one node online - start related fs-0x - start ms_mysql at one node 3) Running both nodes, standby slave - stop ms_mysql:Slave - stop related fs 4) Running both nodes, standby master - demote master - promote slave to became master - stop slave (ex master) - stop related fs I have troubles to configure the right dependecies betwen fs-01, fs-02, ms_mysql:start, ms_mysql:promote, etc... I can provide more detais as needed. Thanks for your help. Best regards Matej Gajdos — e-mail: matej.gaj...@digmia.com DIGMIA s.r.o. Lazaretská 12 81108 Bratislava ___ Openais mailing list open...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/openais ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] auto_tie_breaker in two node cluster
I am not quite understand how auto_tie_breaker works. Say we have a cluster with 2 nodes and enabled auto_tie_breaker feature. Each node has 2 NICs. One NIC is used for cluster communication and another one is used for providing some services from the cluster. So the question is how the nodes will distinguish between two possible situations: 1) connection between the nodes are lost, but the both nodes remain working; 2) power supply on the node 1 (has the lowest node-id) broke down and node 2 remain working; In 1st case, according to the description of the auto_tie_breaker, the node with the lowest node-id in the cluster will remain working. And in that particular situation it is good result because the both nodes are in good state (the both can remain working). In 2nd case the only working node is #2 and the node-id of that node is not the lowest one. So what will be in this case? What logic will work, because we have lost the node with the lowest node id in 2-node cluster? there is no qdiskd for votequorum yet Is there plans to implement it? Kostya, yes there are plans to implement qdisk (network based one). Regards, Honza Many thanks, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker not started by corosync on ubuntu 14.04
Vladimir, Vladimir napsal(a): Hello everyone, I'm trying to get corosync/pacemaker run on Ubuntu 14.04. In my Ubuntu 12.04 setups pacemaker was started by corosync. Actually I thought the Yes. 12.04 used corosync 1.x with pacemaker plugin. service {...} section in the corosync.conf is specified for this purpose. Of course I could put pacemaker into the runlevel but I asked myself if the behaviour was just changed or if I maybe have a mistake in my corosync.conf. Behavior just changed. 14.04 uses corosync 2.x and there are no plugins (service section). So pacemaker is no longer started by corosync and you have to start both corosync and pacemaker (I believe upstart can handle dependencies, so probably starting only pacemaker is enough). Regards, Honza I started with this minimal corosync.conf: totem { version: 2 secauth: off interface { ringnumber: 0 bindnetaddr: 172.16.100.0 mcastaddr: 239.255.42.1 mcastport: 5405 } } service { name: pacemaker ver: 1 } quorum { provider: corosync_votequorum expected_votes: 2 } aisexec { user: root group: root } logging { fileline: off to_stderr: yes to_logfile: no to_syslog: yes syslog_facility: daemon debug: on timestamp: on logger_subsys { subsys: AMF debug: off tags: enter|leave|trace1|trace2|trace3|trace4|trace6 } } Thanks in advance. Kind regards Vladimir ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] corosync [TOTEM ] Process pause detected for 577 ms
Emmanuel, emmanuel segura napsal(a): Helllo Jan, I'm using corosync+pacemaker on Sles 11 Sp1 and this is a critical system, Oh, ok. i don't think i'll get the authorization for upgrade system, but i would like to know if there is any bug about this issue in my current corosync release. This is hard to say. Suse guys probably included many patches, so it would make sense to try to contact Suse support. After very very quick look to git, following patches may be related: 559d4083ed8355fe83f275e53b9c8f52a91694b2, 02c5dffa5bb8579c223006fa1587de9ba7409a3d, 64d0e5ace025cc929e42896c5d6beb3ef75b8244, 6fae42ba72006941c1fde99616ea30f4f10ebb38, c7e686181bcd0e975b09725502bef02c7d0c338a. But still keep in mind that between latest 1.3.6 (what I believe is more or less what you are using) and current origin/flatiron are 118 patches... Regards, Honza Thanks Emmanuel 2014-04-30 17:07 GMT+02:00 Jan Friesse jfrie...@redhat.com: Emmanuel, emmanuel segura napsal(a): Hello Jan, Thanks for the explanation, but i saw this in my log. :: corosync [TOTEM ] Process pause detected for 577 ms, flushing membership messages. corosync [TOTEM ] Process pause detected for 538 ms, flushing membership messages. corosync [TOTEM ] A processor failed, forming new configuration. corosync [CLM ] CLM CONFIGURATION CHANGE corosync [CLM ] New Configuration: corosync [CLM ] r(0) ip(10.xxx.xxx.xxx) corosync [CLM ] Members Left: corosync [CLM ] r(0) ip(10.xxx.xxx.xxx) corosync [CLM ] Members Joined: corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 6904: memb=1, new=0, lost=1 corosync [pcmk ] info: pcmk_peer_update: memb: node01 891257354 corosync [pcmk ] info: pcmk_peer_update: lost: node02 874480 : when this happen, corosync needs to retransmit the toten? from what i understood the toten need to be retransmit, but in my case a new configuration was formed This my corosync version corosync-1.3.3-0.3.1 1.3.3 is unsupported for ages. Please upgrade to newest 1.4.6 (if you are using cman) or 2.3.3 (if you are not using cman). Also please change your pacemaker to not use plugin (upgrade to 2.3.3 will solve it automatically, because plugins in corosync 2.x are no longer support). Regards, Honza Thanks 2014-04-30 9:42 GMT+02:00 Jan Friesse jfrie...@redhat.com: Emmanuel, there is no need to trigger fencing on Process pause detected Also fencing is not triggered if membership didn't changed. So let's say token was lost but during gather state all nodes replied, then there is no change of membership and no need to fence. I believe your situation was: - one node is little overloaded - token lost - overload over - gather state - every node is alive - no fencing Regards, Honza emmanuel segura napsal(a): Hello Jan, Forget the last mail: Hello Jan, I found this problem in two hp blade system and the strange thing is the fencing was not triggered :(, but it's enabled 2014-04-25 18:36 GMT+02:00 emmanuel segura emi2f...@gmail.com: Hello Jan, I found this problem in two hp blade system and the strange thing is the fencing was triggered :( 2014-04-25 9:27 GMT+02:00 Jan Friesse jfrie...@redhat.com: Emanuel, emmanuel segura napsal(a): Hello List, I have this two lines in my cluster logs, somebody can help to know what this means. :: corosync [TOTEM ] Process pause detected for 577 ms, flushing membership messages. corosync [TOTEM ] Process pause detected for 538 ms, flushing membership messages. Corosync internally checks gap between member join messages. If such gap is token/2, it means, that corosync was not scheduled to run by kernel for too long, and it should discard membership messages. Original intend was to detect paused process. If pause is detected, it's better to discard old membership messages and initiate new query then sending outdated view. So there are various reasons why this is triggered, but today it's usually VM with overloaded host machine. corosync [TOTEM ] A processor failed, forming new configuration. :: I know the corosync [TOTEM ] A processor failed, forming new configuration message is when the toten package is definitely lost. Thanks Regards, Honza ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman
Re: [Pacemaker] corosync [TOTEM ] Process pause detected for 577 ms
Emmanuel, there is no need to trigger fencing on Process pause detected Also fencing is not triggered if membership didn't changed. So let's say token was lost but during gather state all nodes replied, then there is no change of membership and no need to fence. I believe your situation was: - one node is little overloaded - token lost - overload over - gather state - every node is alive - no fencing Regards, Honza emmanuel segura napsal(a): Hello Jan, Forget the last mail: Hello Jan, I found this problem in two hp blade system and the strange thing is the fencing was not triggered :(, but it's enabled 2014-04-25 18:36 GMT+02:00 emmanuel segura emi2f...@gmail.com: Hello Jan, I found this problem in two hp blade system and the strange thing is the fencing was triggered :( 2014-04-25 9:27 GMT+02:00 Jan Friesse jfrie...@redhat.com: Emanuel, emmanuel segura napsal(a): Hello List, I have this two lines in my cluster logs, somebody can help to know what this means. :: corosync [TOTEM ] Process pause detected for 577 ms, flushing membership messages. corosync [TOTEM ] Process pause detected for 538 ms, flushing membership messages. Corosync internally checks gap between member join messages. If such gap is token/2, it means, that corosync was not scheduled to run by kernel for too long, and it should discard membership messages. Original intend was to detect paused process. If pause is detected, it's better to discard old membership messages and initiate new query then sending outdated view. So there are various reasons why this is triggered, but today it's usually VM with overloaded host machine. corosync [TOTEM ] A processor failed, forming new configuration. :: I know the corosync [TOTEM ] A processor failed, forming new configuration message is when the toten package is definitely lost. Thanks Regards, Honza ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] corosync [TOTEM ] Process pause detected for 577 ms
Emmanuel, emmanuel segura napsal(a): Hello Jan, Thanks for the explanation, but i saw this in my log. :: corosync [TOTEM ] Process pause detected for 577 ms, flushing membership messages. corosync [TOTEM ] Process pause detected for 538 ms, flushing membership messages. corosync [TOTEM ] A processor failed, forming new configuration. corosync [CLM ] CLM CONFIGURATION CHANGE corosync [CLM ] New Configuration: corosync [CLM ] r(0) ip(10.xxx.xxx.xxx) corosync [CLM ] Members Left: corosync [CLM ] r(0) ip(10.xxx.xxx.xxx) corosync [CLM ] Members Joined: corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 6904: memb=1, new=0, lost=1 corosync [pcmk ] info: pcmk_peer_update: memb: node01 891257354 corosync [pcmk ] info: pcmk_peer_update: lost: node02 874480 : when this happen, corosync needs to retransmit the toten? from what i understood the toten need to be retransmit, but in my case a new configuration was formed This my corosync version corosync-1.3.3-0.3.1 1.3.3 is unsupported for ages. Please upgrade to newest 1.4.6 (if you are using cman) or 2.3.3 (if you are not using cman). Also please change your pacemaker to not use plugin (upgrade to 2.3.3 will solve it automatically, because plugins in corosync 2.x are no longer support). Regards, Honza Thanks 2014-04-30 9:42 GMT+02:00 Jan Friesse jfrie...@redhat.com: Emmanuel, there is no need to trigger fencing on Process pause detected Also fencing is not triggered if membership didn't changed. So let's say token was lost but during gather state all nodes replied, then there is no change of membership and no need to fence. I believe your situation was: - one node is little overloaded - token lost - overload over - gather state - every node is alive - no fencing Regards, Honza emmanuel segura napsal(a): Hello Jan, Forget the last mail: Hello Jan, I found this problem in two hp blade system and the strange thing is the fencing was not triggered :(, but it's enabled 2014-04-25 18:36 GMT+02:00 emmanuel segura emi2f...@gmail.com: Hello Jan, I found this problem in two hp blade system and the strange thing is the fencing was triggered :( 2014-04-25 9:27 GMT+02:00 Jan Friesse jfrie...@redhat.com: Emanuel, emmanuel segura napsal(a): Hello List, I have this two lines in my cluster logs, somebody can help to know what this means. :: corosync [TOTEM ] Process pause detected for 577 ms, flushing membership messages. corosync [TOTEM ] Process pause detected for 538 ms, flushing membership messages. Corosync internally checks gap between member join messages. If such gap is token/2, it means, that corosync was not scheduled to run by kernel for too long, and it should discard membership messages. Original intend was to detect paused process. If pause is detected, it's better to discard old membership messages and initiate new query then sending outdated view. So there are various reasons why this is triggered, but today it's usually VM with overloaded host machine. corosync [TOTEM ] A processor failed, forming new configuration. :: I know the corosync [TOTEM ] A processor failed, forming new configuration message is when the toten package is definitely lost. Thanks Regards, Honza ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker
Re: [Pacemaker] corosync [TOTEM ] Process pause detected for 577 ms
Emanuel, emmanuel segura napsal(a): Hello List, I have this two lines in my cluster logs, somebody can help to know what this means. :: corosync [TOTEM ] Process pause detected for 577 ms, flushing membership messages. corosync [TOTEM ] Process pause detected for 538 ms, flushing membership messages. Corosync internally checks gap between member join messages. If such gap is token/2, it means, that corosync was not scheduled to run by kernel for too long, and it should discard membership messages. Original intend was to detect paused process. If pause is detected, it's better to discard old membership messages and initiate new query then sending outdated view. So there are various reasons why this is triggered, but today it's usually VM with overloaded host machine. corosync [TOTEM ] A processor failed, forming new configuration. :: I know the corosync [TOTEM ] A processor failed, forming new configuration message is when the toten package is definitely lost. Thanks Regards, Honza ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] corosync does not reflect the node status correctly
Michael, Michael Schwartzkopff napsal(a): Hi, we just upgraded to corosync-1.4.5-2.5 from the suse build server. On one cluster we have the problem, that corosync-objctl does not reflect the status So if I understand it correctly, you have multiple clusters and all of them was upgraded and only on one of them this bug appears? of nodes properly. Even when the other node stops corosync we still see: runtime.totem.mrp.srp.members.ID.status=joined Is this consistent between nodes? I mean, ALL nodes sees already stopped node as joined or some of them sees that as left? Regards, Honza But the log says: [TOTEM] A processor joined or left the membership and a new membership was formed. Any ideas? Mit freundlichen Grüßen, Michael Schwartzkopff ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Errors while compiling
Stephan Buchner napsal(a): Hm, i tried recompiling all three packages (libqb, corosync and pacemaker), using versions which have been marked stable by the gentoo project. I used the following versions: libqb = 0.14.4 corosync= 1.4.5 pacemaker = 1.1.11 Now i get this error, which seems at least related to the last one i got: CC corosync.lo corosync.c:38:27: fatal error: corosync/cmap.h: No such file or directory compilation terminated. make[2]: *** [corosync.lo] Fehler 1 make[2]: Leaving directory `/opt/srccluster/pacemaker-Pacemaker-1.1.11/lib/cluster' make[1]: *** [all-recursive] Fehler 1 make[1]: Leaving directory `/opt/srccluster/pacemaker-Pacemaker-1.1.11/lib' make: *** [core] Fehler 1 Am i missing something here? I loosely followed this guide: cmap is included in corosync 2.x. Also libqb 0.14.4 is known to be buggy, please use 0.17.0 http://clusterlabs.org/wiki/SourceInstall Am 17.03.2014 06:11, schrieb Andrew Beekhof: Its looking for cmap_handle_t which will be in one of the corosync headers. What version of corosync have you got installed? On 15 Mar 2014, at 12:18 am, Stephan Buchner buch...@linux-systeme.de wrote: Hm, i installed libcrmcluster1-dev and libcrmcommon2-dev on my debian system, still the same error :/ Am 14.03.2014 14:07, schrieb emmanuel segura: maybe you are missing crm dev library 2014-03-14 13:39 GMT+01:00 Stephan Buchner buch...@linux-systeme.de: Hey everyone! I am trying to compile pacemaker from source for some time - but i keep getting the same errors, despite using different versions. I did the following to get this: 1. ./autogen.sh 2. ./configure --prefix=/opt/cluster/ --disable-fatal-warnings 3. make After that step i always get this error: http://pastebin.com/eXFmhUUD I get this on version 1.10, as on 1.11 Any ideas? -- Stephan Buchner buch...@linux-systeme.de +49 201 - 29 88 319 +49 172 - 7 222 333 Linux-Systeme GmbH Langenbergerstr. 179, 45277 Essen www.linux-systeme.de +49 201 - 29 88 30 Amtsgericht Essen, HRB 14729 Geschäftsführer Jörg Hinz ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Stephan Buchner buch...@linux-systeme.de +49 201 - 29 88 319 +49 172 - 7 222 333 Linux-Systeme GmbH Langenbergerstr. 179, 45277 Essen www.linux-systeme.de +49 201 - 29 88 30 Amtsgericht Essen, HRB 14729 Geschäftsführer Jörg Hinz ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): Hi Honza, What I also found in the log related to the freeze at 12:22:26: Corosync main process was not scheduled for ... Can It be the general cause of the issue? Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:58597-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:59647-[10.9.1.3]:161 Mar 13 12:22:26 ctmgr corosync[3024]: [MAIN ] Corosync main process was not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token timeout increase. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] The token was lost in the OPERATIONAL state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] A processor failed, forming new configuration. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering GATHER state from 2(The token was lost in the OPERATIONAL state.). Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Creating commit token because I am the rep. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Saving state aru 6a8c high seq received 6a8c Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Storing new sequence id for ring 7dc Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering COMMIT state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] got commit token Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering RECOVERY state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [0] member 10.9.1.3: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [1] member 10.9.1.41: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [2] member 10.9.1.42: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [3] member 10.9.1.71: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [4] member 10.9.1.72: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [5] member 10.9.2.11: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [6] member 10.9.2.12: Regards, Attila -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 2:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 1:45 PM To: The Pacemaker cluster resource manager; Andrew Beekhof Subject: Re: [Pacemaker] Pacemaker/corosync freeze Hello, -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Thursday, March 13, 2014 10:03 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify - L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Perhaps Andrew could comment on that. Any idea? Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) - And maybe also newer pacemaker I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from Ubuntu package. I am currently building libqb 0.17.0, will update you on the results. In the meantime we had another freeze, which did not seem to be related to any restarts, but brought all coroync processes to 100%. Please check out the corosync.log, perhaps it is a different cause: http://pastebin.com/WMwzv0Rr In the meantime I will install the new libqb and send logs if we have further issues. Thank you very much for your help! Regards, Attila One more question: If I install libqb 0.17.0 from
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): Hi Honza, What I also found in the log related to the freeze at 12:22:26: Corosync main process was not scheduled for ... Can It be the general cause of the issue? I don't think it will cause issue you are hitting BUT keep in mind that if corosync is not scheduled for long time, it's probably fenced by other node. So increase timeout is vital. Honza Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:58597-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943-[10.9.1.3]:161 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:59647-[10.9.1.3]:161 Mar 13 12:22:26 ctmgr corosync[3024]: [MAIN ] Corosync main process was not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token timeout increase. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] The token was lost in the OPERATIONAL state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] A processor failed, forming new configuration. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering GATHER state from 2(The token was lost in the OPERATIONAL state.). Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Creating commit token because I am the rep. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Saving state aru 6a8c high seq received 6a8c Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] Storing new sequence id for ring 7dc Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering COMMIT state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] got commit token Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] entering RECOVERY state. Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [0] member 10.9.1.3: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [1] member 10.9.1.41: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [2] member 10.9.1.42: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [3] member 10.9.1.71: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [4] member 10.9.1.72: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [5] member 10.9.2.11: Mar 13 12:22:26 ctmgr corosync[3024]: [TOTEM ] TRANS [6] member 10.9.2.12: Regards, Attila -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 2:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Thursday, March 13, 2014 1:45 PM To: The Pacemaker cluster resource manager; Andrew Beekhof Subject: Re: [Pacemaker] Pacemaker/corosync freeze Hello, -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Thursday, March 13, 2014 10:03 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze ... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify - L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Perhaps Andrew could comment on that. Any idea? Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) - And maybe also newer pacemaker I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from Ubuntu package. I am currently building libqb 0.17.0, will update you on the results. In the meantime we had another freeze, which did not seem to be related to any restarts, but brought all coroync processes to 100%. Please check out the corosync.log, perhaps it is a different cause: http://pastebin.com/WMwzv0Rr
Re: [Pacemaker] Pacemaker/corosync freeze
... Also can you please try to set debug: on in corosync.conf and paste full corosync.log then? I set debug to on, and did a few restarts but could not reproduce the issue yet - will post the logs as soon as I manage to reproduce. Perfect. Another option you can try to set is netmtu (1200 is usually safe). Finally I was able to reproduce the issue. I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when node was up again). The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm To be honest, I had to wait much longer for this reproduction as before, even though there was no change in the corosync configuration - just potentially some system updates. But anyway, the issue is unfortunately still there. Previously, when this issue came, cpu was at 100% on all nodes - this time only on ctmgr, which was the DC... I hope you can find some useful details in the log. Attila, what seems to be interesting is Configuration ERRORs found during PE processing. Please run crm_verify -L to identify issues. I'm unsure how much is this problem but I'm really not pacemaker expert. Anyway, I have theory what may happening and it looks like related with IPC (and probably not related to network). But to make sure we will not try fixing already fixed bug, can you please build: - New libqb (0.17.0). There are plenty of fixes in IPC - Corosync 2.3.3 (already plenty IPC fixes) - And maybe also newer pacemaker I know you were not very happy using hand-compiled sources, but please give them at least a try. Thanks, Honza Thanks, Attila Regards, Honza There are also a few things that might or might not be related: 1) Whenever I want to edit the configuration with crm configure edit, ... ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush:Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Attila, Shall I try to downgrade to 1.4.6? What is the difference in that build? Or where should I start troubleshooting? First of all, 1.x branch (flatiron) is maintained so even it looks like a old version, it's quite a new. It contains more or less only
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): Hello Jan, Thank you very much for your help so far. -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 9:51 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages
Re: [Pacemaker] Pacemaker/corosync freeze
Attila Megyeri napsal(a): -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 2:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): Hello Jan, Thank you very much for your help so far. -Original Message- From: Jan Friesse [mailto:jfrie...@redhat.com] Sent: Wednesday, March 12, 2014 9:51 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze Attila Megyeri napsal(a): -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 10:27 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote: -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, March 11, 2014 12:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote: Thanks for the quick response! -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Friday, March 07, 2014 3:48 AM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Pacemaker/corosync freeze On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote: Hello, We have a strange issue with Corosync/Pacemaker. From time to time, something unexpected happens and suddenly the crm_mon output remains static. When I check the cpu usage, I see that one of the cores uses 100% cpu, but cannot actually match it to either the corosync or one of the pacemaker processes. In such a case, this high CPU usage is happening on all 7 nodes. I have to manually go to each node, stop pacemaker, restart corosync, then start pacemeker. Stoping pacemaker and corosync does not work in most of the cases, usually a kill -9 is needed. Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty. Using udpu as transport, two rings on Gigabit ETH, rro_mode passive. Logs are usually flooded with CPG related messages, such as: Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:49 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) Mar 06 18:10:50 [1316] ctsip1 crmd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=8): Try again (6) OR Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10933): Try again ( That is usually a symptom of corosync getting into a horribly confused state. Version? Distro? Have you checked for an update? Odd that the user of all that CPU isn't showing up though. As I wrote I use Ubuntu trusty, the exact package versions are: corosync 2.3.0-1ubuntu5 pacemaker 1.1.10+git20130802-1ubuntu2 Ah sorry, I seem to have missed that part. There are no updates available. The only option is to install from sources, but that would be very difficult to maintain and I'm not sure I would get rid of this issue. What do you recommend? The same thing as Lars, or switch to a distro that stays current with upstream (git shows 5 newer releases for that branch since it was released 3 years ago). If you do build from source, its probably best to go with v1.4.6 Hm, I am a bit confused here. We are using 2.3.0, I swapped the 2 for a 1 somehow. A bit distracted, sorry. I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the same issue - after some time CPU gets to 100%, and the corosync log is flooded with messages like: Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:55 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:56 [4798] ctdb2 crmd: info: crm_cs_flush: Sent 0 CPG messages (51 remaining, last=3995): Try again (6) Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush: Sent 0 CPG messages (48 remaining, last=3671): Try again (6) Mar 12 07:36:57 [4798] ctdb2 crmd
Re: [Pacemaker] [corosync] corosync Segmentation fault.
Andrey, what version of corosync and libqb are you using? Can you please attach output from valgrind (and gdb backtrace)? Thanks, Honza Andrey Groshev napsal(a): Hi, ALL. Something I already confused, or after updating any package or myself something broke, but call corosycn killed by segmentation fault signal. I correctly understood that does not link the library libqb ? . (gdb) n [New Thread 0x74b2b700 (LWP 9014)] 1266if ((flock_err = corosync_flock (corosync_lock_file, getpid ())) != COROSYNC_DONE_EXIT) { (gdb) n 1280totempg_initialize ( (gdb) n 1284totempg_service_ready_register ( (gdb) n 1287totempg_groups_initialize ( (gdb) n 1292totempg_groups_join ( (gdb) n 1307schedwrk_init ( (gdb) n 1314qb_loop_run (corosync_poll_handle); (gdb) n Program received signal SIGSEGV, Segmentation fault. 0x771e581c in free () from /lib64/libc.so.6 (gdb) ___ discuss mailing list disc...@corosync.org http://lists.corosync.org/mailman/listinfo/discuss ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [corosync] corosync Segmentation fault.
Andrey, can you please give a try to patch [PATCH] votequorum: Properly initialize atb and atb_string which I've sent to ML (it should be there soon)? Thanks, Honza Andrey Groshev napsal(a): 26.02.2014, 12:11, Jan Friesse jfrie...@redhat.com: Andrey, what version of corosync and libqb are you using? Can you please attach output from valgrind (and gdb backtrace)? ,,, 1314qb_loop_run (corosync_poll_handle); (gdb) n Program received signal SIGSEGV, Segmentation fault. 0x771e581c in free () from /lib64/libc.so.6 (gdb) bt #0 0x771e581c in free () from /lib64/libc.so.6 #1 0x77fe77ec in votequorum_readconfig (runtime=value optimized out) at votequorum.c:1293 #2 0x77fe8300 in votequorum_exec_init_fn (api=value optimized out) at votequorum.c:2115 #3 0x77feeb7b in corosync_service_link_and_init (corosync_api=0x78200980, service=0x78200760) at service.c:139 #4 0x77fe4197 in votequorum_init (api=0x78200980, q_set_quorate_fn=0x77fda5b0 quorum_api_set_quorum) at votequorum.c:2255 #5 0x77fda42f in quorum_exec_init_fn (api=0x78200980) at vsf_quorum.c:280 #6 0x77feeb7b in corosync_service_link_and_init (corosync_api=0x78200980, service=0x78200c40) at service.c:139 #7 0x77feede9 in corosync_service_defaults_link_and_init (corosync_api=0x78200980) at service.c:348 #8 0x77fe9621 in main_service_ready () at main.c:978 #9 0x77b90b0f in main_iface_change_fn (context=0x77f73010, iface_addr=value optimized out, iface_no=0) at totemsrp.c:4672 #10 0x77b8a734 in timer_function_netif_check_timeout (data=0x78304f10) at totemudp.c:672 #11 0x777289f8 in ?? () from /usr/lib64/libqb.so.0 #12 0x77727016 in qb_loop_run () from /usr/lib64/libqb.so.0 #13 0x77fea930 in main (argc=value optimized out, argv=value optimized out, envp=value optimized out) at main.c:1314 Unfortunately, I have not yet used a valgrind. Or hangs, or fast end with : # valgrind /usr/sbin/corosync -f ==2137== Memcheck, a memory error detector ==2137== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==2137== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==2137== Command: /usr/sbin/corosync -f ==2137== ==2137== ==2137== HEAP SUMMARY: ==2137== in use at exit: 29,876 bytes in 193 blocks ==2137== total heap usage: 890 allocs, 697 frees, 100,824 bytes allocated ==2137== ==2137== LEAK SUMMARY: ==2137==definitely lost: 0 bytes in 0 blocks ==2137==indirectly lost: 0 bytes in 0 blocks ==2137== possibly lost: 539 bytes in 22 blocks ==2137==still reachable: 29,337 bytes in 171 blocks ==2137== suppressed: 0 bytes in 0 blocks ==2137== Rerun with --leak-check=full to see details of leaked memory ==2137== ==2137== For counts of detected and suppressed errors, rerun with: -v ==2137== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 12 from 6) Now read manual about valgrind. Thanks, Honza Andrey Groshev napsal(a): Hi, ALL. Something I already confused, or after updating any package or myself something broke, but call corosycn killed by segmentation fault signal. I correctly understood that does not link the library libqb ? . (gdb) n [New Thread 0x74b2b700 (LWP 9014)] 1266if ((flock_err = corosync_flock (corosync_lock_file, getpid ())) != COROSYNC_DONE_EXIT) { (gdb) n 1280totempg_initialize ( (gdb) n 1284totempg_service_ready_register ( (gdb) n 1287totempg_groups_initialize ( (gdb) n 1292totempg_groups_join ( (gdb) n 1307schedwrk_init ( (gdb) n 1314qb_loop_run (corosync_poll_handle); (gdb) n Program received signal SIGSEGV, Segmentation fault. 0x771e581c in free () from /lib64/libc.so.6 (gdb) ___ discuss mailing list disc...@corosync.org http://lists.corosync.org/mailman/listinfo/discuss ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [corosync] corosync Segmentation fault.
Andrey Groshev napsal(a): 26.02.2014, 16:11, Jan Friesse jfrie...@redhat.com: Andrey, can you please give a try to patch [PATCH] votequorum: Properly initialize atb and atb_string which I've sent to ML (it should be there soon)? Yes. Service is running. Thanks. # corosync-quorumtool -l Membership information -- Nodeid Votes Name 172793104 1 dev-cluster2-node1 (local) Continue tests. In messages logs I see Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15480]: [error] trying to recv chunk of size 1024 but 4030249 available Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15497]: [error] trying to recv chunk of size 1024 but 40489 available Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15514]: [error] Corrupt blackbox: File header hash (436212587) does not match calculated hash (-1660939413) Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15531]: [error] Corrupt blackbox: File header hash (8328043) does not match calculated hash (-905964693) Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15548]: [error] Corrupt blackbox: File header hash (12651) does not match calculated hash (21972) . At this time build libqb. It tests or real errors? Looks more like build tests. Honza Thanks, Honza Andrey Groshev napsal(a): 26.02.2014, 12:11, Jan Friesse jfrie...@redhat.com: Andrey, what version of corosync and libqb are you using? Can you please attach output from valgrind (and gdb backtrace)? ,,, 1314qb_loop_run (corosync_poll_handle); (gdb) n Program received signal SIGSEGV, Segmentation fault. 0x771e581c in free () from /lib64/libc.so.6 (gdb) bt #0 0x771e581c in free () from /lib64/libc.so.6 #1 0x77fe77ec in votequorum_readconfig (runtime=value optimized out) at votequorum.c:1293 #2 0x77fe8300 in votequorum_exec_init_fn (api=value optimized out) at votequorum.c:2115 #3 0x77feeb7b in corosync_service_link_and_init (corosync_api=0x78200980, service=0x78200760) at service.c:139 #4 0x77fe4197 in votequorum_init (api=0x78200980, q_set_quorate_fn=0x77fda5b0 quorum_api_set_quorum) at votequorum.c:2255 #5 0x77fda42f in quorum_exec_init_fn (api=0x78200980) at vsf_quorum.c:280 #6 0x77feeb7b in corosync_service_link_and_init (corosync_api=0x78200980, service=0x78200c40) at service.c:139 #7 0x77feede9 in corosync_service_defaults_link_and_init (corosync_api=0x78200980) at service.c:348 #8 0x77fe9621 in main_service_ready () at main.c:978 #9 0x77b90b0f in main_iface_change_fn (context=0x77f73010, iface_addr=value optimized out, iface_no=0) at totemsrp.c:4672 #10 0x77b8a734 in timer_function_netif_check_timeout (data=0x78304f10) at totemudp.c:672 #11 0x777289f8 in ?? () from /usr/lib64/libqb.so.0 #12 0x77727016 in qb_loop_run () from /usr/lib64/libqb.so.0 #13 0x77fea930 in main (argc=value optimized out, argv=value optimized out, envp=value optimized out) at main.c:1314 Unfortunately, I have not yet used a valgrind. Or hangs, or fast end with : # valgrind /usr/sbin/corosync -f ==2137== Memcheck, a memory error detector ==2137== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==2137== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==2137== Command: /usr/sbin/corosync -f ==2137== ==2137== ==2137== HEAP SUMMARY: ==2137== in use at exit: 29,876 bytes in 193 blocks ==2137== total heap usage: 890 allocs, 697 frees, 100,824 bytes allocated ==2137== ==2137== LEAK SUMMARY: ==2137==definitely lost: 0 bytes in 0 blocks ==2137==indirectly lost: 0 bytes in 0 blocks ==2137== possibly lost: 539 bytes in 22 blocks ==2137==still reachable: 29,337 bytes in 171 blocks ==2137== suppressed: 0 bytes in 0 blocks ==2137== Rerun with --leak-check=full to see details of leaked memory ==2137== ==2137== For counts of detected and suppressed errors, rerun with: -v ==2137== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 12 from 6) Now read manual about valgrind. Thanks, Honza Andrey Groshev napsal(a): Hi, ALL. Something I already confused, or after updating any package or myself something broke, but call corosycn killed by segmentation fault signal. I correctly understood that does not link the library libqb ? . (gdb) n [New Thread 0x74b2b700 (LWP 9014)] 1266if ((flock_err = corosync_flock (corosync_lock_file, getpid ())) != COROSYNC_DONE_EXIT) { (gdb) n 1280totempg_initialize ( (gdb) n 1284totempg_service_ready_register ( (gdb) n 1287totempg_groups_initialize ( (gdb) n 1292totempg_groups_join ( (gdb) n 1307schedwrk_init
Re: [Pacemaker] Multicast pitfalls? corosync [TOTEM ] Retransmit List:
Beo, do you experiencing cluster split? If answer is no, then you don't need to do anything. Maybe network buffer is just filled. But, if answer is yes, try reduce mtu size (netmtu in configuration) to value like 1000. Regards, Honza Beo Banks napsal(a): Hi, i have a fresh 2 node cluster (kvm host1 - guest = nodeA | kvm host2 - guest = NodeB) and it seems to work but from time to time i have a lot of errors like Feb 13 13:41:04 corosync [TOTEM ] Retransmit List: 196 198 184 185 186 187 188 189 18a 18b 18c 18d 18e 18f 190 191 192 193 194 195 197 199 Feb 13 13:41:04 corosync [TOTEM ] Retransmit List: 197 199 184 185 186 187 188 189 18a 18b 18c 18d 18e 18f 190 191 192 193 194 195 196 198 Feb 13 13:41:04 corosync [TOTEM ] Retransmit List: 196 198 184 185 186 187 188 189 18a 18b 18c 18d 18e 18f 190 191 192 193 194 195 197 199 Feb 13 13:41:04 corosync [TOTEM ] Retransmit List: 197 199 184 185 186 187 188 189 18a 18b 18c 18d 18e 18f 190 191 192 193 194 195 196 198 Feb 13 13:41:04 corosync [TOTEM ] Retransmit List: 196 198 184 185 186 187 188 189 18a 18b 18c 18d 18e 18f 190 191 192 193 194 195 197 199 Feb 13 13:41:04 corosync [TOTEM ] Retransmit List: 197 199 184 185 186 187 188 189 18a 18b 18c 18d 18e 18f 190 191 192 193 194 195 196 198 i used the newest rhel 6.5 version. i have also already try solve the issue with echo 1 /sys/class/net/virbr0/bridge/multicast_querier (host system) but no chance... i have disable iptables,selinux..same issue how can solve it? thanks beo ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again
Brian J. Murrell (brian) napsal(a): I seem to have another instance where pacemaker fails to exit at the end of a shutdown. Here's the log from the start of the service pacemaker stop: Dec 3 13:00:39 wtm-60vm8 crmd[14076]: notice: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Dec 3 13:00:39 wtm-60vm8 crmd[14076]: info: do_te_invoke: Processing graph 19 (ref=pe_calc-dc-1386093636-83) derived from /var/lib/pengine/pe-input-40.bz2 ... Dec 3 13:05:08 wtm-60vm8 pacemakerd[14067]:error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again Dec 3 13:05:08 wtm-60vm8 pacemakerd[14067]: notice: pcmk_shutdown_worker: Shutdown complete Dec 3 13:05:08 wtm-60vm8 pacemakerd[14067]: info: main: Exiting pacemakerd These types of shutdown failure issues seem to always end up with the series of: error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again Even though the above messages seem to indicate that pacemaker did finally exit it did not as can be seen looking at the process table: 14032 ?Ssl0:01 corosync 14067 ?S 0:00 pacemakerd 14071 ?Ss 0:00 \_ /usr/libexec/pacemaker/cib So what does this sending message via cpg FAILED: (rc=6) mean exactly? Error 6 error means try again. This is happening ether if corosync is overloaded or creating new membership. Please take a look to /var/log/cluster/corosync.log if you see something strange there (+ make sure you have newest corosync). Regards, Honza Or any other ideas what happened to this shutdown to cause it to fail/hang ultimately? Cheers, b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Network outage debugging
Andrew Beekhof napsal(a): On 13 Nov 2013, at 11:49 am, Sean Lutner s...@rentul.net wrote: On Nov 12, 2013, at 7:33 PM, Andrew Beekhof and...@beekhof.net wrote: On 13 Nov 2013, at 11:22 am, Sean Lutner s...@rentul.net wrote: On Nov 12, 2013, at 6:01 PM, Andrew Beekhof and...@beekhof.net wrote: On 13 Nov 2013, at 6:10 am, Sean Lutner s...@rentul.net wrote: The folks testing the cluster I've been building have run a script which blocks all traffic except SSH on one node of the cluster for 15 seconds to mimic a network failure. During this time, the network being down seems to cause some odd behavior from pacemaker resulting in it dying. The cluster is two nodes and running four custom resources on EC2 instances. The OS is CentOS 6.4 with the config below: I've attached the /var/log/messages and /var/log/cluster/corosync.log from the time period during the test. I've having some difficulty in piecing together what happened and am hoping someone can shed some light on the problem. Any indications why pacemaker is dying on that node? Because corosync is dying underneath it: Nov 09 14:51:49 [942] ip-10-50-3-251cib:error: send_ais_text:Sending message 28 via cpg: FAILED (rc=2): Library error: Connection timed out (110) Nov 09 14:51:49 [942] ip-10-50-3-251cib:error: pcmk_cpg_dispatch:Connection to the CPG API failed: 2 Nov 09 14:51:49 [942] ip-10-50-3-251cib:error: cib_ais_destroy:Corosync connection lost! Exiting. Nov 09 14:51:49 [942] ip-10-50-3-251cib: info: terminate_cib:cib_ais_destroy: Exiting fast... Is that the expected behavior? It is expected behaviour when corosync dies. Ideally corosync wouldn't die though. What other debugging can I do to try to find out why corosync died? There are various logging setting that may help. CC'ing Jan to see if he has any suggestions. If corosync really died corosync-fplay output (right after corosync death) and coredump are most useful. Regards, Honza Thanks Is it because the DC was the other node? No. I did notice that there was an attempted fence operation but it didn't look successful. [root@ip-10-50-3-122 ~]# pcs config Corosync Nodes: Pacemaker Nodes: ip-10-50-3-122 ip-10-50-3-251 Resources: Resource: ClusterEIP_54.215.143.166 (provider=pacemaker type=EIP class=ocf) Attributes: first_network_interface_id=eni-e4e0b68c second_network_interface_id=eni-35f9af5d first_private_ip=10.50.3.191 second_private_ip=10.50.3.91 eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s Operations: monitor interval=5s Clone: EIP-AND-VARNISH-clone Group: EIP-AND-VARNISH Resource: Varnish (provider=redhat type=varnish.sh class=ocf) Operations: monitor interval=5s Resource: Varnishlog (provider=redhat type=varnishlog.sh class=ocf) Operations: monitor interval=5s Resource: Varnishncsa (provider=redhat type=varnishncsa.sh class=ocf) Operations: monitor interval=5s Resource: ec2-fencing (type=fence_ec2 class=stonith) Attributes: ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list pcmk_host_list=HA01 HA02 Operations: monitor start-delay=30s interval=0 timeout=150s Location Constraints: Ordering Constraints: ClusterEIP_54.215.143.166 then Varnish Varnish then Varnishlog Varnishlog then Varnishncsa Colocation Constraints: Varnish with ClusterEIP_54.215.143.166 Varnishlog with Varnish Varnishncsa with Varnishlog Cluster Properties: dc-version: 1.1.8-7.el6-394e906 cluster-infrastructure: cman last-lrm-refresh: 1384196963 no-quorum-policy: ignore stonith-enabled: true net-failure-messages-110913.outnet-failure-corosync-110913.out ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
Re: [Pacemaker] Network outage debugging
Sean Lutner napsal(a): On Nov 13, 2013, at 3:15 AM, Jan Friesse jfrie...@redhat.com wrote: Andrew Beekhof napsal(a): On 13 Nov 2013, at 11:49 am, Sean Lutner s...@rentul.net wrote: On Nov 12, 2013, at 7:33 PM, Andrew Beekhof and...@beekhof.net wrote: On 13 Nov 2013, at 11:22 am, Sean Lutner s...@rentul.net wrote: On Nov 12, 2013, at 6:01 PM, Andrew Beekhof and...@beekhof.net wrote: On 13 Nov 2013, at 6:10 am, Sean Lutner s...@rentul.net wrote: The folks testing the cluster I've been building have run a script which blocks all traffic except SSH on one node of the cluster for 15 seconds to mimic a network failure. During this time, the network being down seems to cause some odd behavior from pacemaker resulting in it dying. The cluster is two nodes and running four custom resources on EC2 instances. The OS is CentOS 6.4 with the config below: I've attached the /var/log/messages and /var/log/cluster/corosync.log from the time period during the test. I've having some difficulty in piecing together what happened and am hoping someone can shed some light on the problem. Any indications why pacemaker is dying on that node? Because corosync is dying underneath it: Nov 09 14:51:49 [942] ip-10-50-3-251cib:error: send_ais_text:Sending message 28 via cpg: FAILED (rc=2): Library error: Connection timed out (110) Nov 09 14:51:49 [942] ip-10-50-3-251cib:error: pcmk_cpg_dispatch:Connection to the CPG API failed: 2 Nov 09 14:51:49 [942] ip-10-50-3-251cib:error: cib_ais_destroy:Corosync connection lost! Exiting. Nov 09 14:51:49 [942] ip-10-50-3-251cib: info: terminate_cib:cib_ais_destroy: Exiting fast... Is that the expected behavior? It is expected behaviour when corosync dies. Ideally corosync wouldn't die though. What other debugging can I do to try to find out why corosync died? There are various logging setting that may help. CC'ing Jan to see if he has any suggestions. If corosync really died corosync-fplay output (right after corosync death) and coredump are most useful. Regards, Honza So the process to collect this would be: - Run the test - Watch the logs for corosync to die - Run corosync-fplay and capture the output (will corosync-fplay file.out suffice?) Yes. Usually, file is quite large, so gzip/xz is good idea. - Capture a core dump from corosync How do I capture the core dump? Is it something that has to be enabled in the /etc/corosync/corosync.conf file first and then run the tests? I've not done this in the past. This really depends. Do you have abrt enabled? If so, core is processed via abrt. (Way how to find out if abrt is running is to look to kernel.core_pattern sysctl. There is something different then classic value core). If you do not have abrt enabled, you must make sure to enable core dumps. When executing corosync via cman, it should be enabled automatically (start_global function does ulimit -c unlimited). If you are using corosync itself, create file /etc/default/corosync with content ulimit -c unlimited. Coredumps are stored in /var/lib/corosync/core.* (maybe you have already some of them there, so just take a look). Now, please install corosynclib-devel package and use http://stackoverflow.com/questions/5115613/core-dump-file-analysis Important part is to execute bt (or even better, thread apply all bt) and send output from this command. Regards, Honza Thanks Thanks Is it because the DC was the other node? No. I did notice that there was an attempted fence operation but it didn't look successful. [root@ip-10-50-3-122 ~]# pcs config Corosync Nodes: Pacemaker Nodes: ip-10-50-3-122 ip-10-50-3-251 Resources: Resource: ClusterEIP_54.215.143.166 (provider=pacemaker type=EIP class=ocf) Attributes: first_network_interface_id=eni-e4e0b68c second_network_interface_id=eni-35f9af5d first_private_ip=10.50.3.191 second_private_ip=10.50.3.91 eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s Operations: monitor interval=5s Clone: EIP-AND-VARNISH-clone Group: EIP-AND-VARNISH Resource: Varnish (provider=redhat type=varnish.sh class=ocf) Operations: monitor interval=5s Resource: Varnishlog (provider=redhat type=varnishlog.sh class=ocf) Operations: monitor interval=5s Resource: Varnishncsa (provider=redhat type=varnishncsa.sh class=ocf) Operations: monitor interval=5s Resource: ec2-fencing (type=fence_ec2 class=stonith) Attributes: ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list pcmk_host_list=HA01 HA02 Operations: monitor start-delay=30s interval=0 timeout=150s Location Constraints: Ordering Constraints: ClusterEIP_54.215.143.166 then Varnish Varnish then Varnishlog Varnishlog then Varnishncsa Colocation Constraints: Varnish with ClusterEIP_54.215.143.166 Varnishlog with Varnish Varnishncsa with Varnishlog Cluster Properties: dc-version: 1.1.8-7.el6-394e906 cluster
Re: [Pacemaker] Simple installation Pacemaker + CMAN + fence-agents
Andrew Beekhof napsal(a): Something seems very wrong with this at the corosync level. Even fenced and the dlm are having issues. Jan: Could this be firewall related? Yes. This can be ether firewall on mcast issue. I would recommend to turn off firewall completely (for testing). If this doesn't help, try omping for multicast test. Honza On 27 Sep 2013, at 10:44 pm, Bartłomiej Wójcik bartlomiej.woj...@turbineam.com wrote: W dniu 2013-09-27 04:26, Andrew Beekhof pisze: On 26/09/2013, at 8:35 PM, Bartłomiej Wójcik bartlomiej.woj...@turbineam.com wrote: Hello, I install Pacemaker in accordance with http://clusterlabs.org/quickstart-ubuntu.html on Ubuntu 13.04 two nodes changing only the IP addresses. /etc/cluster/cluster.conf: ?xml version=1.0? cluster config_version=1 name=pacemaker1 logging debug=off/ clusternodes clusternode name=fmpgpool4 nodeid=1 fence method name=pcmk-redirect device name=pcmk port=fmpgpool4/ /method /fence /clusternode clusternode name=fmpgpool5 nodeid=2 fence method name=pcmk-redirect device name=pcmk port=fmpgpool5/ /method /fence /clusternode /clusternodes fencedevices fencedevice name=pcmk agent=fence_pcmk/ /fencedevices /cluster gets only the server: ps -ef|grep pacemaker pacemakerd What do the logs from pacemakerd say? and nothing more I try to do: crm configure property stonith-enabled=false and gets: Signon to CIB failed: connection failed Init failed, could not perform requested operations ERROR: cannot parse xml: no element found: line 1, column 0 ERROR: No CIB! I don't know what could be wrong. Regards! ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org Hello, corosync.log: Sep 26 11:14:50 corosync [MAIN ] Corosync Cluster Engine ('1.4.4'): started and ready to provide service. Sep 26 11:14:50 corosync [MAIN ] Corosync built-in features: nss Sep 26 11:14:50 corosync [MAIN ] Successfully read config from /etc/cluster/cluster.conf Sep 26 11:14:50 corosync [MAIN ] Successfully parsed cman config Sep 26 11:14:50 corosync [MAIN ] Successfully configured openais services to load Sep 26 11:14:50 corosync [TOTEM ] Initializing transport (UDP/IP Multicast). Sep 26 11:14:50 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Sep 26 11:14:50 corosync [TOTEM ] The network interface [10.0.0.34] is now up. Sep 26 11:14:50 corosync [QUORUM] Using quorum provider quorum_cman Sep 26 11:14:50 corosync [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Sep 26 11:14:50 corosync [CMAN ] CMAN 3.1.8 (built Jan 17 2013 06:24:33) started Sep 26 11:14:50 corosync [SERV ] Service engine loaded: corosync CMAN membership service 2.90 Sep 26 11:14:50 corosync [SERV ] Service engine loaded: openais cluster membership service B.01.01 Sep 26 11:14:50 corosync [SERV ] Service engine loaded: openais event service B.01.01 Sep 26 11:14:50 corosync [SERV ] Service engine loaded: openais checkpoint service B.01.01 Sep 26 11:14:50 corosync [SERV ] Service engine loaded: openais message service B.03.01 Sep 26 11:14:50 corosync [SERV ] Service engine loaded: openais distributed locking service B.03.01 Sep 26 11:14:50 corosync [SERV ] Service engine loaded: openais timer service A.01.01 Sep 26 11:14:50 corosync [SERV ] Service engine loaded: corosync extended virtual synchrony service Sep 26 11:14:50 corosync [SERV ] Service engine loaded: corosync configuration service Sep 26 11:14:50 corosync [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 Sep 26 11:14:50 corosync [SERV ] Service engine loaded: corosync cluster config database access v1.01 Sep 26 11:14:50 corosync [SERV ] Service engine loaded: corosync profile loading service Sep 26 11:14:50 corosync [QUORUM] Using quorum provider quorum_cman Sep 26 11:14:50 corosync [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Sep 26 11:14:56 corosync [CLM ] Members Left: Sep 26 11:14:56 corosync [CLM ] Members Joined: Sep 26 11:14:56 corosync [CLM ] r(0) ip(10.0.0.35) Sep 26 11:14:56 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Set r/w permissions for uid=108, gid=0 on /var/log/cluster/corosync.log
Re: [Pacemaker] Could not initialize corosync configuration API error 2
Andrew, this problem was already discussed on corosync-ml. Andrew Beekhof napsal(a): Jan: not sure if you're on the pacemaker list On 29 Oct 2013, at 6:43 pm, Bauer, Stefan (IZLBW Extern) stefan.ba...@iz.bwl.de wrote: Dear Developers/Users, we’re using Pacemaker 1.1.7 and Corosync Cluster Engine 1.4.2 with Debian 6 and a recent vanilla Kernel (3.10). On quite a lot of our clusters we can not check the ring status anymore: corosync-cfgtool –s returns: Could not initialize corosync configuration API error 2 A reboot is fixing the problem. Even though the status is not returned, i see traffic on the ring interfaces and the cluster is operational. We’re using rrp_mode: active with 2 ring interfaces with multicast. Is this a known problem? Not that I know of. CC'ing Jan (corosync maintainer) Please try upgrade from 1.4.2 to 1.4.6. There are about 105 patches and (acording to git) 83 files changed, 2623 insertions(+), 652 deletions(-). There are no new features, only fixes. Does a workaround exist to not force us to reboot the machines regularly ? Any help is greatly appreciated. Regards Stefan Bauer ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org Regards, Honza ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker 1.1.8 and corosync's cpg service?
Mike, did you entered local node in nodelist? Because this may explain behavior you were describing. Honza Mike Edwards napsal(a): On Tue, May 21, 2013 at 11:15:56AM +1000, Andrew Beekhof babbled thus: cpg_join() is returning CS_ERR_TRY_AGAIN here. Jan: Any idea why this might happen? Thats a fair time to be blocked for. Looks like the problem was with the udpu transport. Switching to udp let pacemaker start. I've also noticed that multicast fails to work in this environment, though whether the issue lies with our switches, Vmware, or CentOS 6 itself, I'm unsure as of yet. Thanks Andrew. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker 1.1.8 and corosync's cpg service?
Mike Edwards napsal(a): Which would be the recommended trqansport? I'm not tied to any particular method. As long as UDP (multicast) works for you, it's better solution (better tested, faster, ...). UDPU is targeted for deployments where multicast is problem. Regards, Honza On Wed, May 22, 2013 at 10:01:37AM +1000, Andrew Beekhof babbled thus: I think nodelist only works for corosync 2.x So if you want to use udpu you might need to look up the corosync 1.x syntax. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker 1.1.8 and corosync's cpg service?
Actually, I've reviewed that config file again and it looks like you are using corosync 1.x. There nodelist is really not supported, and supported is member object inside of interface (see corosync.conf.example.udpu). For corosync 2.x, member object inside interface object works also, but it's internally converted to recommended version with nodelist (so that's what you've sent). Regards, Honza Mike Edwards napsal(a): Yep. The config I pasted has the bindnetaddr set to 10.10.23.50, which also happens to be defined as node 1. On Wed, May 22, 2013 at 09:28:13AM +0200, Jan Friesse babbled thus: Mike, did you entered local node in nodelist? Because this may explain behavior you were describing. Honza ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Openais] Hawk 0.5.2 Debian packages
Great news! Regards, Honza Charles Williams napsal(a): Hey all, I recently got a chance to finally build Debian packages for the 0.5.2 version of ClusterLabs Hawk GUI. These are Squeeze packages ATM (Wheezy to come next week dependent upon testing of the current packages) and I am looking for people interested in testing. If so. just head over to http://wiki.itadmins.net/doku.php?id=high_availability:hawk0.5.2 if you have any problems or such just let me know. I would like to be able to get Wheezy packages finished in the next couple of weeks. Thanks for your time, Chuck ___ Openais mailing list open...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/openais ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [corosync] Corosync memory usage rising
Andrew Beekhof napsal(a): On Thu, Jan 31, 2013 at 8:10 AM, Yves Trudeau y.trud...@videotron.ca wrote: Hi, Is there any known memory leak issue corosync 1.4.1. I have a setup here where corosync eats memory at a few kB a minute: 1.4.1 for sure. But it looks you are using 1.4.1-7 (EL 6.3.z), and I must say no, there is no known bug like this. Are you running pacemaker (if so plugin or cpg version)? OpenAIS services loaded? Is it clean corosync or corosync executed via cman? Honza [root@mys002 mysql]# while [ 1 ]; do ps faxu | grep corosync | grep -v grep; sleep 60; done root 11071 0.2 0.0 624256 8840 ?Ssl 09:14 0:02 corosync root 11071 0.2 0.0 624344 9144 ?Ssl 09:14 0:02 corosync root 11071 0.2 0.0 624344 9424 ?Ssl 09:14 0:02 corosync It goes on like that until no more memory which is still a long time. Another has corosync running for a long time: [root@mys001 mysql]# ps faxu | grep corosync | grep -v grep root 15735 0.2 21.5 4038664 3429592 ? Ssl 2012 184:19 corosync which is nearly 3.4GB. Holy heck! Bouncing to the corosync ML for comment. [root@mys002 mysql]# rpm -qa | grep -i coro corosynclib-1.4.1-7.el6_3.1.x86_64 corosync-1.4.1-7.el6_3.1.x86_64 [root@mys002 mysql]# uname -a Linux mys002 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 x86_64 x86_64 GNU/Linux looking at smaps of the process, I found this: 020b6000-d2b34000 rw-p 00:00 0 Size:3418616 kB Rss: 3417756 kB Pss: 3417756 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 3417756 kB Referenced: 3417064 kB Anonymous: 3417756 kB AnonHugePages: 3416064 kB Swap: 0 kB KernelPageSize:4 kB MMUPageSize: 4 kB this setup is using udpu totem { version: 2 secauth: on threads: 0 window_size: 5 max_messages: 5 netmtu: 1000 token: 5000 join: 1000 consensus: 5000 interface { member { memberaddr: 10.103.7.91 } member { memberaddr: 10.103.7.92 } ringnumber: 0 bindnetaddr: 10.103.7.91 mcastport: 5405 ttl: 1 } transport: udpu } with special timings because of issues with the vmware setup. Any idea of what could be causing this? Regards, Yves ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ discuss mailing list disc...@corosync.org http://lists.corosync.org/mailman/listinfo/discuss ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster
: 11,075,087 bytes in 1,613 blocks ==5453== suppressed: 0 bytes in 0 blocks ==5453== Rerun with --leak-check=full to see details of leaked memory ==5453== ==5453== For counts of detected and suppressed errors, rerun with: -v ==5453== Use --track-origins=yes to see where uninitialised values come from ==5453== ERROR SUMMARY: 715 errors from 3 contexts (suppressed: 2 from 2) Bus error (core dumped) I was also able to capture non-truncated fdata: http://sources.xes-inc.com/downloads/fdata-20121107 Here is the coredump: http://sources.xes-inc.com/downloads/vgcore.5453 I was not able to get corosync to crash without pacemaker also running, though I was not able to test for a long period of time. Another thing I discovered tonight was that the 127.0.1.1 entry in /etc/hosts (on both storage0 and storage1) was the source of the extra localhost entry in the cluster. I have removed this extraneous node so now only the 3 real nodes remain and commented out this line in /etc/hosts on all nodes in the cluster. http://burning-midnight.blogspot.com/2012/07/cluster-building-ubuntu-1204-revised.html Thanks, Andrew - Original Message - From: Jan Friesse jfrie...@redhat.com To: Andrew Martin amar...@xes-inc.com Cc: Angus Salkeld asalk...@redhat.com, disc...@corosync.org, pacemaker@oss.clusterlabs.org Sent: Wednesday, November 7, 2012 2:00:20 AM Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster Andrew, Andrew Martin napsal(a): A bit more data on this problem: I was doing some maintenance and had to briefly disconnect storagequorum's connection to the STONITH network (ethernet cable #7 in this diagram): http://sources.xes-inc.com/downloads/storagecluster.png Since corosync has two rings (and is in active mode), this should cause no disruption to the cluster. However, as soon as I disconnected cable #7, corosync on storage0 died (corosync was already stopped on storage1), which caused pacemaker on storage0 to also shutdown. I was not able to obtain a coredump this time as apport is still running on storage0. I strongly believe corosync fault is because of original problem you have. Also I would recommend you to try passive mode. Passive mode is better, because if one link fails, passive mode make progress (delivers messages), where active mode doesn't (up to moment, when ring is marked as failed. After that, passive/active behaves same). Also passive mode is much better tested. What else can I do to debug this problem? Or, should I just try to downgrade to corosync 1.4.2 (the version available in the Ubuntu repositories)? I would really like to find main issue (which looks like libqb one, rather then corosync). But if you decide to downgrade, please downgrade to latest 1.4.x series (1.4.4 for now). 1.4.2 has A LOT of known bugs. Thanks, Andrew Regards, Honza - Original Message - From: Andrew Martin amar...@xes-inc.com To: Angus Salkeld asalk...@redhat.com Cc: disc...@corosync.org, pacemaker@oss.clusterlabs.org Sent: Tuesday, November 6, 2012 2:01:17 PM Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster Hi Angus, I recompiled corosync with the changes you suggested in exec/main.c to generate fdata when SIGBUS is triggered. Here 's the corresponding coredump and fdata files: http://sources.xes-inc.com/downloads/core.13027 http://sources.xes-inc.com/downloads/fdata.20121106 (gdb) thread apply all bt Thread 1 (Thread 0x77fec700 (LWP 13027)): #0 0x7775bda3 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0 #1 0x777656b9 in ?? () from /usr/lib/libqb.so.0 #2 0x777637ba in qb_log_real_va_ () from /usr/lib/libqb.so.0 #3 0x55571700 in ?? () #4 0x77bc7df6 in ?? () from /usr/lib/libtotem_pg.so.5 #5 0x77bc1a6d in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.5 #6 0x77bbc8e2 in ?? () from /usr/lib/libtotem_pg.so.5 #7 0x7775d46f in ?? () from /usr/lib/libqb.so.0 #8 0x7775cfe7 in qb_loop_run () from /usr/lib/libqb.so.0 #9 0x55560945 in main () I've also been doing some hardware tests to rule it out as the cause of this problem: mcelog has found no problems and memtest finds the memory to be healthy as well. Thanks, Andrew - Original Message - From: Angus Salkeld asalk...@redhat.com To: pacemaker@oss.clusterlabs.org, disc...@corosync.org Sent: Friday, November 2, 2012 8:18:51 PM Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster On 02/11/12 13:07 -0500, Andrew Martin wrote: Hi Angus, Corosync died again while using libqb 0.14.3. Here is the coredump from today: http://sources.xes-inc.com/downloads/corosync.nov2.coredump # corosync -f notice [MAIN ] Corosync Cluster Engine ('2.1.0
Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster
Andrew, good news. I believe that I've found reproducer for problem you are facing. Now, to be sure it's really same, can you please run : df (interesting is /dev/shm) and send output of ls -la /dev/shm? I believe /dev/shm is full. Now, as a quick workaround, just delete all qb-* from /dev/shm and cluster should work. There are basically two problems: - ipc_shm is leaking memory - if there is no memory, libqb mmap nonallocated memory and receives sigbus Angus is working on both issues. Regards, Honza Jan Friesse napsal(a): Andrew, thanks for valgrind report (even it didn't showed anything useful) and blackbox. We believe that problem is because of access to invalid memory mapped by mmap operation. There are basically 3 places where we are doing mmap. 1.) corosync cpg_zcb functions (I don't believe this is the case) 2.) LibQB IPC 3.) LibQB blackbox Now, because nether me nor Angus are able to reproduce the bug, can you please: - apply patches Check successful initialization of IPC and Add support for selecting IPC type (later versions), or use corosync from git (ether needle or master branch, they are same) - compile corosync - Add qb { ipc_type: socket } to corosync.conf - Try running corosync This may, but may not help solve problem, but it should help us to diagnose if problem is or isn't IPC one. Thanks, Honza Andrew Martin napsal(a): Angus and Honza, I recompiled corosync with --enable-debug. Below is a capture of the valgrind output when corosync dies, after switching rrp_mode to passive: # valgrind corosync -f ==5453== Memcheck, a memory error detector ==5453== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al. ==5453== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info ==5453== Command: corosync -f ==5453== notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to provide service. info [MAIN ] Corosync built-in features: debug pie relro bindnow ==5453== Syscall param socketcall.sendmsg(msg) points to uninitialised byte(s) ==5453== at 0x54D233D: ??? (syscall-template.S:82) ==5453== by 0x4E391E8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E3BFC8: totemudp_token_send (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E38CF0: totemnet_token_send (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E3F1AF: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E40FB5: totemrrp_token_send (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E47E84: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E45770: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E40AD2: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E3C1A4: totemudp_token_target_set (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E38EBC: totemnet_token_target_set (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E3F3A8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== Address 0x7feff7f58 is on thread 1's stack ==5453== ==5453== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to uninitialised byte(s) ==5453== at 0x54D233D: ??? (syscall-template.S:82) ==5453== by 0x4E39427: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E3C042: totemudp_mcast_noflush_send (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E38D8A: totemnet_mcast_noflush_send (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E3F03D: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E4104D: totemrrp_mcast_noflush_send (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E46CB8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E49A04: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E4C8E0: main_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E3F0A6: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E409A8: rrp_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E39967: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== Address 0x7feffb9da is on thread 1's stack ==5453== ==5453== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to uninitialised byte(s) ==5453== at 0x54D233D: ??? (syscall-template.S:82) ==5453== by 0x4E39526: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E3C042: totemudp_mcast_noflush_send (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E38D8A: totemnet_mcast_noflush_send (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E3F03D: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E4104D: totemrrp_mcast_noflush_send (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E46CB8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E49A04: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E4C8E0: main_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E3F0A6: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E409A8: rrp_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== by 0x4E39967: ??? (in /usr/lib/libtotem_pg.so.5.0.0) ==5453== Address
Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster
that there is a newer version of libqb (0.14.3) out, but didn't see a fix for this particular bug. Could this libqb problem be related to the corosync to hang up? Here's the corresponding corosync log file (next time I should have a core dump as well): http://pastebin.com/5FLKg7We Hi Andrew I can't see much wrong with the log either. If you could run with the latest (libqb-0.14.3) and post a backtrace if it still happens, that would be great. Thanks Angus Thanks, Andrew - Original Message - From: Jan Friesse jfrie...@redhat.com To: Andrew Martin amar...@xes-inc.com Cc: disc...@corosync.org, The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, November 1, 2012 7:55:52 AM Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster Ansdrew, I was not able to find anything interesting (from corosync point of view) in configuration/logs (corosync related). What would be helpful: - if corosync died, there should be /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please xz them and store somewhere (they are quiet large but well compressible). - If you are able to reproduce problem (what seems like you are), can you please allow generating of coredumps and store somewhere backtrace of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and here thread apply all bt). If you are running distribution with ABRT support, you can also use ABRT to generate report. Regards, Honza Andrew Martin napsal(a): Corosync died an additional 3 times during the night on storage1. I wrote a daemon to attempt and start it as soon as it fails, so only one of those times resulted in a STONITH of storage1. I enabled debug in the corosync config, so I was able to capture a period when corosync died with debug output: http://pastebin.com/eAmJSmsQ In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For reference, here is my Pacemaker configuration: http://pastebin.com/DFL3hNvz It seems that an extra node, 16777343 localhost has been added to the cluster after storage1 was STONTIHed (must be the localhost interface on storage1). Is there anyway to prevent this? Does this help to determine why corosync is dying, and what I can do to fix it? Thanks, Andrew - Original Message - From: Andrew Martin amar...@xes-inc.com To: disc...@corosync.org Sent: Thursday, November 1, 2012 12:11:35 AM Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster Hello, I recently configured a 3-node fileserver cluster by building Corosync 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu 12.04 amd64. Two of the nodes (storage0 and storage1) are real nodes where the resources run (a DRBD disk, filesystem mount, and samba/nfs daemons), while the third node (storagequorum) is in standby mode and acts as a quorum node for the cluster. Today I discovered that corosync died on both storage0 and storage1 at the same time. Since corosync died, pacemaker shut down as well on both nodes. Because the cluster no longer had quorum (and the no-quorum-policy=freeze), storagequorum was unable to STONITH either node and just left the resources frozen where they were running, on storage0. I cannot find any log information to determine why corosync crashed, and this is a disturbing problem as the cluster and its messaging layer must be stable. Below is my corosync configuration file as well as the corosync log file from eac! h! ! n! o! de during this period. corosync.conf: http://pastebin.com/vWQDVmg8 Note that I have two redundant rings. On one of them, I specify the IP address (in this example 10.10.10.7) so that it binds to the correct interface (since potentially in the future those machines may have two interfaces on the same subnet). corosync.log from storage0: http://pastebin.com/HK8KYDDQ corosync.log from storage1: http://pastebin.com/sDWkcPUz corosync.log from storagequorum (the DC during this period): http://pastebin.com/uENQ5fnf Issuing service corosync start service pacemaker start on storage0 and storage1 resolved the problem and allowed the nodes to successfully reconnect to the cluster. What other information can I provide to help diagnose this problem and prevent it from recurring? Thanks, Andrew Martin ___ discuss mailing list disc...@corosync.org http://lists.corosync.org/mailman/listinfo/discuss ___ discuss mailing list disc...@corosync.org http://lists.corosync.org/mailman/listinfo/discuss ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman
Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster
it will generate it automatically. (I see you are getting a bus error) - :(. -A Thanks, Andrew - Original Message - From: Angus Salkeld asalk...@redhat.com To: pacemaker@oss.clusterlabs.org, disc...@corosync.org Sent: Thursday, November 1, 2012 5:11:23 PM Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster On 01/11/12 14:32 -0500, Andrew Martin wrote: Hi Honza, Thanks for the help. I enabled core dumps in /etc/security/limits.conf but didn't have a chance to reboot and apply the changes so I don't have a core dump this time. Do core dumps need to be enabled for the fdata-DATETIME-PID file to be generated? right now all that is in /var/lib/corosync are the ringid_XXX files. Do I need to set something explicitly in the corosync config to enable this logging? I did find find something else interesting with libqb this time. I compiled libqb 0.14.2 for use with the cluster. This time when corosync died I noticed the following in dmesg: Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap divide error ip:7f657a52e517 sp:7fffd5068858 error:0 in libqb.so.0.14.2[7f657a525000+1f000] This error was only present for one of the many other times corosync has died. I see that there is a newer version of libqb (0.14.3) out, but didn't see a fix for this particular bug. Could this libqb problem be related to the corosync to hang up? Here's the corresponding corosync log file (next time I should have a core dump as well): http://pastebin.com/5FLKg7We Hi Andrew I can't see much wrong with the log either. If you could run with the latest (libqb-0.14.3) and post a backtrace if it still happens, that would be great. Thanks Angus Thanks, Andrew - Original Message - From: Jan Friesse jfrie...@redhat.com To: Andrew Martin amar...@xes-inc.com Cc: disc...@corosync.org, The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, November 1, 2012 7:55:52 AM Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster Ansdrew, I was not able to find anything interesting (from corosync point of view) in configuration/logs (corosync related). What would be helpful: - if corosync died, there should be /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please xz them and store somewhere (they are quiet large but well compressible). - If you are able to reproduce problem (what seems like you are), can you please allow generating of coredumps and store somewhere backtrace of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and here thread apply all bt). If you are running distribution with ABRT support, you can also use ABRT to generate report. Regards, Honza Andrew Martin napsal(a): Corosync died an additional 3 times during the night on storage1. I wrote a daemon to attempt and start it as soon as it fails, so only one of those times resulted in a STONITH of storage1. I enabled debug in the corosync config, so I was able to capture a period when corosync died with debug output: http://pastebin.com/eAmJSmsQ In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For reference, here is my Pacemaker configuration: http://pastebin.com/DFL3hNvz It seems that an extra node, 16777343 localhost has been added to the cluster after storage1 was STONTIHed (must be the localhost interface on storage1). Is there anyway to prevent this? Does this help to determine why corosync is dying, and what I can do to fix it? Thanks, Andrew - Original Message - From: Andrew Martin amar...@xes-inc.com To: disc...@corosync.org Sent: Thursday, November 1, 2012 12:11:35 AM Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster Hello, I recently configured a 3-node fileserver cluster by building Corosync 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu 12.04 amd64. Two of the nodes (storage0 and storage1) are real nodes where the resources run (a DRBD disk, filesystem mount, and samba/nfs daemons), while the third node (storagequorum) is in standby mode and acts as a quorum node for the cluster. Today I discovered that corosync died on both storage0 and storage1 at the same time. Since corosync died, pacemaker shut down as well on both nodes. Because the cluster no longer had quorum (and the no-quorum-policy=freeze), storagequorum was unable to STONITH either node and just left the resources frozen where they were running, on storage0. I cannot find any log information to determine why corosync crashed, and this is a disturbing problem as the cluster and its messaging layer must be stable. Below is my corosync configuration file as well as the corosync log file from eac! h! ! n! o! de
Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster
Angus Salkeld napsal(a): On 02/11/12 13:07 -0500, Andrew Martin wrote: Hi Angus, Corosync died again while using libqb 0.14.3. Here is the coredump from today: http://sources.xes-inc.com/downloads/corosync.nov2.coredump # corosync -f notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to provide service. info [MAIN ] Corosync built-in features: pie relro bindnow Bus error (core dumped) Here's the log: http://pastebin.com/bUfiB3T3 Did your analysis of the core dump reveal anything? I can't get any symbols out of these coredumps. Can you try get a backtrace? Andrew, as I've wrote in original mail, backtrace can be got by: coredumps are stored in /var/lib/corosync as core.PID, and way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and here thread apply all bt). If you are running distribution with ABRT support, you can also use ABRT to generate report. It's also pretty weird that you are getting SIGBUS. SIGBUS is pretty usually result of accessing unaligned memory on processors without support to access that (for example Sparc). This doesn't seem to be your case (because of AMD64). Is there a way for me to make it generate fdata with a bus error, or how else can I gather additional information to help debug this? if you look in exec/main.c and look for SIGSEGV you will see how the mechanism for fdata works. Just and a handler for SIGBUS and hook it up. Then you should be able to get the fdata for both. I'd rather be able to get a backtrace if possible. Also if possible, please try to compile with --enable-debug (both libqb and corosync) to get as much information as possible. -Angus Regards, Honza Thanks, Andrew - Original Message - From: Angus Salkeld asalk...@redhat.com To: pacemaker@oss.clusterlabs.org, disc...@corosync.org Sent: Thursday, November 1, 2012 5:47:16 PM Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster On 01/11/12 17:27 -0500, Andrew Martin wrote: Hi Angus, I'll try upgrading to the latest libqb tomorrow and see if I can reproduce this behavior with it. I was able to get a coredump by running corosync manually in the foreground (corosync -f): http://sources.xes-inc.com/downloads/corosync.coredump Thanks, looking... There still isn't anything added to /var/lib/corosync however. What do I need to do to enable the fdata file to be created? Well if it crashes with SIGSEGV it will generate it automatically. (I see you are getting a bus error) - :(. -A Thanks, Andrew - Original Message - From: Angus Salkeld asalk...@redhat.com To: pacemaker@oss.clusterlabs.org, disc...@corosync.org Sent: Thursday, November 1, 2012 5:11:23 PM Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster On 01/11/12 14:32 -0500, Andrew Martin wrote: Hi Honza, Thanks for the help. I enabled core dumps in /etc/security/limits.conf but didn't have a chance to reboot and apply the changes so I don't have a core dump this time. Do core dumps need to be enabled for the fdata-DATETIME-PID file to be generated? right now all that is in /var/lib/corosync are the ringid_XXX files. Do I need to set something explicitly in the corosync config to enable this logging? I did find find something else interesting with libqb this time. I compiled libqb 0.14.2 for use with the cluster. This time when corosync died I noticed the following in dmesg: Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap divide error ip:7f657a52e517 sp:7fffd5068858 error:0 in libqb.so.0.14.2[7f657a525000+1f000] This error was only present for one of the many other times corosync has died. I see that there is a newer version of libqb (0.14.3) out, but didn't see a fix for this particular bug. Could this libqb problem be related to the corosync to hang up? Here's the corresponding corosync log file (next time I should have a core dump as well): http://pastebin.com/5FLKg7We Hi Andrew I can't see much wrong with the log either. If you could run with the latest (libqb-0.14.3) and post a backtrace if it still happens, that would be great. Thanks Angus Thanks, Andrew - Original Message - From: Jan Friesse jfrie...@redhat.com To: Andrew Martin amar...@xes-inc.com Cc: disc...@corosync.org, The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, November 1, 2012 7:55:52 AM Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster Ansdrew, I was not able to find anything interesting (from corosync point of view) in configuration/logs (corosync related). What would be helpful: - if corosync died, there should be /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please xz them and store somewhere (they are quiet large but well compressible). - If you are able to reproduce problem (what seems like you are), can you please allow
Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster
Ansdrew, I was not able to find anything interesting (from corosync point of view) in configuration/logs (corosync related). What would be helpful: - if corosync died, there should be /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please xz them and store somewhere (they are quiet large but well compressible). - If you are able to reproduce problem (what seems like you are), can you please allow generating of coredumps and store somewhere backtrace of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and here thread apply all bt). If you are running distribution with ABRT support, you can also use ABRT to generate report. Regards, Honza Andrew Martin napsal(a): Corosync died an additional 3 times during the night on storage1. I wrote a daemon to attempt and start it as soon as it fails, so only one of those times resulted in a STONITH of storage1. I enabled debug in the corosync config, so I was able to capture a period when corosync died with debug output: http://pastebin.com/eAmJSmsQ In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For reference, here is my Pacemaker configuration: http://pastebin.com/DFL3hNvz It seems that an extra node, 16777343 localhost has been added to the cluster after storage1 was STONTIHed (must be the localhost interface on storage1). Is there anyway to prevent this? Does this help to determine why corosync is dying, and what I can do to fix it? Thanks, Andrew - Original Message - From: Andrew Martin amar...@xes-inc.com To: disc...@corosync.org Sent: Thursday, November 1, 2012 12:11:35 AM Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster Hello, I recently configured a 3-node fileserver cluster by building Corosync 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu 12.04 amd64. Two of the nodes (storage0 and storage1) are real nodes where the resources run (a DRBD disk, filesystem mount, and samba/nfs daemons), while the third node (storagequorum) is in standby mode and acts as a quorum node for the cluster. Today I discovered that corosync died on both storage0 and storage1 at the same time. Since corosync died, pacemaker shut down as well on both nodes. Because the cluster no longer had quorum (and the no-quorum-policy=freeze), storagequorum was unable to STONITH either node and just left the resources frozen where they were running, on storage0. I cannot find any log information to determine why corosync crashed, and this is a disturbing problem as the cluster and its messaging layer must be stable. Below is my corosync configuration file as well as the corosync log file from each no! de during this period. corosync.conf: http://pastebin.com/vWQDVmg8 Note that I have two redundant rings. On one of them, I specify the IP address (in this example 10.10.10.7) so that it binds to the correct interface (since potentially in the future those machines may have two interfaces on the same subnet). corosync.log from storage0: http://pastebin.com/HK8KYDDQ corosync.log from storage1: http://pastebin.com/sDWkcPUz corosync.log from storagequorum (the DC during this period): http://pastebin.com/uENQ5fnf Issuing service corosync start service pacemaker start on storage0 and storage1 resolved the problem and allowed the nodes to successfully reconnect to the cluster. What other information can I provide to help diagnose this problem and prevent it from recurring? Thanks, Andrew Martin ___ discuss mailing list disc...@corosync.org http://lists.corosync.org/mailman/listinfo/discuss ___ discuss mailing list disc...@corosync.org http://lists.corosync.org/mailman/listinfo/discuss ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org