[ClusterLabs] corosync totem.token too long may cause pacemaker(cluster) unstable?
Hi, We changed totem.token from 3s to 60s. Then something strange were observed, such as unexpected node offline. I read corosync.conf manpage, but still don't understand the reason. Can anyone explain this? or maybe our conf is broken? Our corosync.conf: compatibility: whitetank quorum { provider: corosync_votequorum two_node: 0 } totem { version: 2 token: 6 token_retransmits_before_loss_const: 10 join: 60 consensus: 3600 vsftype: none max_messages: 20 clear_node_high_bit: yes rrp_mode: none secauth: off threads: 2 transport: udpu interface { ringnumber: 0 bindnetaddr: 39.39.0.5 mcastport: 5405 } } amf { mode: disabled } aisexec { user: root group: root } nodelist { node { ring0_addr: 39.39.0.4 nodeid: 1 } node { ring0_addr: 39.39.0.5 nodeid: 2 } node { ring0_addr: 39.39.0.6 nodeid: 3 } } ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] pending actions
Hi, Occasionally, I find my cluster with one pending action not being executed for some minutes (I guess until the "PEngine Recheck Timer" elapse). Running "crm_simulate -SL" shows the pending actions. I'm still confused about how it can happens, why it happens and how to avoid this. Earlier today, I started my test cluster with 3 nodes and a master/slave resource[1], all with positive master score (1001, 1000 and 990), and the cluster kept the promote action as a pending action for 15 minutes. You will find in attachment the first 3 pengine inputs executed after the cluster startup. What are the consequences if I set cluster-recheck-interval to 30s as instance? Thanks in advance for your lights :) Regards, [1] here is the setup: http://dalibo.github.io/PAF/Quick_Start-CentOS-7.html#cluster-resource-creation-and-management -- Jehan-Guillaume de Rorthais Dalibo pe-input-417.bz2 Description: application/bzip pe-input-418.bz2 Description: application/bzip pe-input-419.bz2 Description: application/bzip ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] resource was disabled automatically
On 03/06/2017 08:29 PM, cys wrote: > At 2017-03-07 05:47:19, "Ken Gaillot" wrote: >> To figure out why a resource was stopped, you want to check the logs on >> the DC (which will be the node with the most "pengine:" messages around >> that time). When the PE decides a resource needs to be stopped, you'll >> see a message like >> >> notice: LogActions: Stop() >> >> Often, by looking at the messages before that, you can see what led it >> to decide that. Shortly after that, you'll see something like >> > > Thanks Ken. It's really helpful. > Finally I found the debug log of pengine(in a separate file). It has this > message: > "All nodes for resource p_vs-scheduler are unavailable, unclean or shutting > down..." > So it seems this caused vs-scheduler disabled. > > If all nodes come back to be in good state, will pengine start the resource > automatically? > I did it manually yesterday. Yes, whenever a node changes state (such as becoming available), the pengine will recheck what can be done. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] FenceAgentAPI
On 07/03/17 05:09 AM, Jan Pokorný wrote: > On 06/03/17 17:12 -0500, Digimer wrote: >> The old FenceAgentAPI document on fedorahosted is gone now that fedora >> hosted is closed. So I created a copy on the clusterlabs wiki: >> >> http://wiki.clusterlabs.org/wiki/FenceAgentAPI > > Note that just few days ago I've announced that the page has moved to > https://docs.pagure.org/ClusterLabs.fence-agents/FenceAgentAPI.md, see > http://oss.clusterlabs.org/pipermail/developers/2017-February/000438.html > (that hit just the developers list, I don't think it's of interest of > users of the stack as such). Therefore that's another duplicate, just > as http://wiki.clusterlabs.org/wiki/Fedorahosted.org_FenceAgentAPI > (linked from the original fedorahosted.org page so as to allow for > future flexibility should the content still be visible, which turned > out to not be the case) is. > > I will add you (or whoever wants to maintain that file) to linux-cluster > group at pagure.io so you can edit the underlying Markdown file (just let > me off-list know your Fedora Account System username). The file itself > is tracked under git repository, access URLs were provided in the > announcement email. Ah! I missed that. Glad to see it was covered already. I was mainly stepping up because I thought it was missed, but if you're on it, you know the topic better than I do. So if you want help, I'm happy to do what I can, but otherwise I'll stand back and let you do your thing. :) >> It desperately needs an update. Specifically, it needs '-o metadata' >> properly explained. I am happy to update this document and change the >> cman/cluster.conf example over to a pacemaker example, etc, but I do not >> feel like I am authoritative on the XML validation side of things. >> >> Can someone give me, even just point-form notes, how to explain this? >> If so, I'll create 'FenceAgentAPI - Working' document and I will have >> anyone interested comment before making it an official update. >> >> Comments? > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Expected recovery behavior of remote-node guest when corosync ring0 is lost in a passive mode RRP config?
Ulrich, Thank you very much for your feedback. You wrote, "Could it be you forgot "allow-migrate=true" at the resource level or some migration IP address at the node level? I only have SLES11 here..." I know for sure that the pacemaker remote node (zs95kjg110102) I mentioned below is configured correctly for pacemaker Live Guest Migration. I can demonstrate this using the 'pcs resource move' CLI : I will migrate this "remote node" guest (zs95kjg110102) and resource "zs95kjg110102_res" to another cluster node (e.g. zs95kjpcs1 / 10.20.93.12) , using the 'pcs1' hostname / IP which is currently running on zs93kjpcs1 (10.20.93.11): [root@zs95kj ~]# pcs resource show |grep zs95kjg110102_res zs95kjg110102_res (ocf::heartbeat:VirtualDomain): Started zs93kjpcs1 [root@zs93kj ~]# pcs resource move zs95kjg110102_res zs95kjpcs1 [root@zs93kj ~]# pcs resource show |grep zs95kjg110102_res zs95kjg110102_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1 ## On zs95kjpcs1, you can see that the guest is actually running there... [root@zs95kj ~]# virsh list |grep zs95kjg110102 63zs95kjg110102 running [root@zs95kj ~]# ping 10.20.110.102 PING 10.20.110.102 (10.20.110.102) 56(84) bytes of data. 64 bytes from 10.20.110.102: icmp_seq=1 ttl=63 time=0.775 ms So, everything seems set up correctly for live guest migration of this VirtualDomain resource. What I am really looking for is a way to ensure 100% availability of a "live guest migratable" pacemaker remote node guest in a situation where the interface (in this case vlan1293) ring0_addr goes down. I thought that maybe configuring Redundant Ring Protocol (RRP) for corosync would provide this, but from what I've seen so far it doesn't look that way.If the ring0_addr interface is lost in an RRP configuration while the remote guest is connected to the host using that ring0_addr, the guest gets rebooted and reestablishes the "remote-node-to-host" connection over the ring1_addr, which is great as long as you don't care if the guest gets rebooted. Corosync is doing its job of preventing the cluster node from being fenced by failing over its heartbeat messaging to ring1, however the remote_node guests take a short term hit due to the remote-node-to-host reconnect. In the event of a ring0_addr failure, I don't see any attempt by pacemaker to migrate the remote_node to another cluster node, but maybe this is by design, since there is no alternate path for the guest to use for LGM (i.e. ring0 is a single point of failure). If the guest could be migrated over an alternate route, it would prevent the guest outage. Maybe my question is... is there any way to facilitate an alternate Live Guest Migration path in the event of a ring0_addr failure? This might also apply to a single ring protocol as well. Thanks, Scott Greenlese ... KVM on System Z - Solutions Test, IBM Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com From: "Ulrich Windl" To: Date: 03/02/2017 02:39 AM Subject:[ClusterLabs] Antw: Expected recovery behavior of remote-node guest when corosync ring0 is lost in a passive mode RRP config? >>> "Scott Greenlese" schrieb am 01.03.2017 um 22:07 in Nachricht : > Hi.. > > I am running a few corosync "passive mode" Redundant Ring Protocol (RRP) > failure scenarios, where > my cluster has several remote-node VirtualDomain resources running on each > node in the cluster, > which have been configured to allow Live Guest Migration (LGM) operations. > > While both corosync rings are active, if I drop ring0 on a given node where > I have remote node (guests) running, > I noticed that the guest will be shutdown / re-started on the same host, > after which the connection is re-established > and the guest proceeds to run on that same cluster node. Could it be you forgot "allow-migrate=true" at the resource level or some migration IP address at the node level? I only have SLES11 here... > > I am wondering why pacemaker doesn't try to "live" migrate the remote node > (guest) to a different node, instead > of rebooting the guest? Is there some way to configure the remote nodes > such that the recovery action is > LGM instead of reboot when the host-to-remote_node connect is lost in an > RRP situation? I guess the > next question is, is it even possible to LGM a remote node guest if the > corosync ring fails over from ring0 to ring1 > (or vise-versa)? > > # For example, here's a remote node's VirtualDomain resource definition. > > [root@zs95kj]# pcs resource show zs95kjg110102_res > Resource: zs95kjg110102_res (class=ocf provider=heartbeat > type=VirtualDomain) > Attributes: config=/guestxml/nfs1/zs95kjg110102.xml > hypervisor=qemu:///system migration_transport=ssh > Meta Attrs: allow-migrate=true remote-node=zs95kjg110102 > remote-addr=10.20.110.102 > Operations: start interval=0s timeout=480 > (zs95kjg110102_res-start-interval-0s) > stop interval=0s timeout=120 > (zs95kjg11010
Re: [ClusterLabs] FenceAgentAPI
On 06/03/17 17:12 -0500, Digimer wrote: > The old FenceAgentAPI document on fedorahosted is gone now that fedora > hosted is closed. So I created a copy on the clusterlabs wiki: > > http://wiki.clusterlabs.org/wiki/FenceAgentAPI Note that just few days ago I've announced that the page has moved to https://docs.pagure.org/ClusterLabs.fence-agents/FenceAgentAPI.md, see http://oss.clusterlabs.org/pipermail/developers/2017-February/000438.html (that hit just the developers list, I don't think it's of interest of users of the stack as such). Therefore that's another duplicate, just as http://wiki.clusterlabs.org/wiki/Fedorahosted.org_FenceAgentAPI (linked from the original fedorahosted.org page so as to allow for future flexibility should the content still be visible, which turned out to not be the case) is. I will add you (or whoever wants to maintain that file) to linux-cluster group at pagure.io so you can edit the underlying Markdown file (just let me off-list know your Fedora Account System username). The file itself is tracked under git repository, access URLs were provided in the announcement email. > It desperately needs an update. Specifically, it needs '-o metadata' > properly explained. I am happy to update this document and change the > cman/cluster.conf example over to a pacemaker example, etc, but I do not > feel like I am authoritative on the XML validation side of things. > > Can someone give me, even just point-form notes, how to explain this? > If so, I'll create 'FenceAgentAPI - Working' document and I will have > anyone interested comment before making it an official update. > > Comments? -- Jan (Poki) pgpoJPoCCLoIw.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org