Re: [ClusterLabs] Users Digest, Vol 44, Issue 11
Greetings from a confused user; We are running pacemaker as part of a load-balanced cluster of two members, both VMWare VMs, with both acting as stepping-stones to our DNS recursive resolvers (RR). Simple use - the /etc/resolver.conf on the *NIX boxes points at both IPs, and the cluster forwards to one of multiple RRs for DNS resolution. Today, for an as-yet undetermined reason, one of the two members started failing to connect to the RRs. Intermittently. And quite annoyingly, as this has affected data center operations. No matter what we've tried, one member fails intermittently, the other is fine. And we've tried - - reboot of the affected member - it came back up clean and fine, but the issue remained. - fail the cluster, moving both IPs to the second member server; failover was successful, problem remained. -- this moved the entire cluster to a different VM on a different VMWare host server, so different NIC, etc... - failed the cluster back to the original server; both IPs appears on the 'suspect' VM, and the problem remained - restore the cluster; both IPs are on the proper VMs, but the one still fails intermittently while the second just chugs along. Any ideas what could be causing this? Is this something that could be caused by the cluster config? Anybody ever seen anything similar? Our current unsustainable workaround is to remove the IP for the affected member from the *NIX resolver.conf file. I appreciate any reasonable suggestions. (I am not the creator of the cluster, just the guy trying o figure it out. Unfortunately the creator and my mentor is dearly departed and, in times like this, sorely missed.) Any replies will be read and responded to early tomorrow AM. thanks for understanding. -- Jeff Westgate ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] SAN, pacemaker, KVM: live-migration with ext3 ?
What I am struggling to understand here, is why it is being referred to as a "SAN" when it is not concurrently available... how are you mounting this "SAN" on each host? On 5 September 2018 at 18:55, Andrei Borzenkov wrote: > 05.09.2018 19:13, Lentes, Bernd пишет: > > Hi guys, > > > > just to be sure. I thought (maybe i'm wrong) that having a VM on a > shared storage (FC SAN), e.g. in a raw file on an ext3 fs on that SAN > allows live-migration because pacemaker takes care that the ext3 fs is at > any time only mounted on one node. > > > While live migration requires concurrent access from both nodes at the > same time. > > > I tried it, but "live"-migration wasn't possible. The vm was always > shutdown before migration. Or do i need OCFS2 ? > > You need to be able to access image from both nodes at the same time. If > image is on file system, it must be clustered filesystem. Or just use > raw device for image as already suggested. > > > Could anyone clarifies this ? > > > > > > Bernd > > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Mark Adams Director -- Open Virtualisation Solutions Ltd. Registered in England and Wales number: 07709887 Office Address: 274 Verdant Lane, London, SE6 1TW Office: +44 (0)333 355 0160 Mobile: +44 (0)750 800 1289 Site: http://www.openvs.co.uk ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] SAN, pacemaker, KVM: live-migration with ext3 ?
05.09.2018 19:13, Lentes, Bernd пишет: > Hi guys, > > just to be sure. I thought (maybe i'm wrong) that having a VM on a shared > storage (FC SAN), e.g. in a raw file on an ext3 fs on that SAN allows > live-migration because pacemaker takes care that the ext3 fs is at any time > only mounted on one node. While live migration requires concurrent access from both nodes at the same time. > I tried it, but "live"-migration wasn't possible. The vm was always shutdown > before migration. Or do i need OCFS2 ? You need to be able to access image from both nodes at the same time. If image is on file system, it must be clustered filesystem. Or just use raw device for image as already suggested. > Could anyone clarifies this ? > > > Bernd > ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker startup retries
>If you build from source, you can apply the patch that fixes the issue >to the 1.1.14 code base: >https://github.com/ClusterLabs/pacemaker/commit/98457d1635db1222f93599b6021e662e766ce62d [1] Just applied the patch and now it works as expected. The unseen node is only rebooted once on startup Thanks a lot! Cheers Cesar Links: -- [1] https://github.com/ClusterLabs/pacemaker/commit/98457d1635db1222f93599b6021e662e766ce62d___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] SAN, pacemaker, KVM: live-migration with ext3 ?
Why you use FS for raw image, when you can use directly LV as block device for your VM > On 5 Sep 2018, at 18:34, Lentes, Bernd > wrote: > > > > - On Sep 5, 2018, at 6:28 PM, FeldHost™ Admin ad...@feldhost.cz wrote: > >> hello, yes, you need ocfs2 or gfs2, but in your case (raw image) probably >> better >> to use lvm > > I use cLVM. The fs for the raw image resides on a clustered VG/LV. > But nevertheless i still need a cluster fs because of the concurrent access ? > > Bernd > > > Helmholtz Zentrum Muenchen > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) > Ingolstaedter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe > Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias H. Tschoep, Heinrich > Bassler, Dr. rer. nat. Alfons Enhsen > Registergericht: Amtsgericht Muenchen HRB 6466 > USt-IdNr: DE 129521671 > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] SAN, pacemaker, KVM: live-migration with ext3 ?
Or iscsi, or nfs. But there are many better solutions nowdays. of course it does all depend on your setup but ext3 is crazy old now! On Wed, 5 Sep 2018, 19:35 Lentes, Bernd, wrote: > > > - On Sep 5, 2018, at 6:28 PM, FeldHost™ Admin ad...@feldhost.cz wrote: > > > hello, yes, you need ocfs2 or gfs2, but in your case (raw image) > probably better > > to use lvm > > I use cLVM. The fs for the raw image resides on a clustered VG/LV. > But nevertheless i still need a cluster fs because of the concurrent > access ? > > Bernd > > > Helmholtz Zentrum Muenchen > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) > Ingolstaedter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe > Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias H. Tschoep, Heinrich > Bassler, Dr. rer. nat. Alfons Enhsen > Registergericht: Amtsgericht Muenchen HRB 6466 > USt-IdNr: DE 129521671 > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] SAN, pacemaker, KVM: live-migration with ext3 ?
- On Sep 5, 2018, at 6:28 PM, FeldHost™ Admin ad...@feldhost.cz wrote: > hello, yes, you need ocfs2 or gfs2, but in your case (raw image) probably > better > to use lvm I use cLVM. The fs for the raw image resides on a clustered VG/LV. But nevertheless i still need a cluster fs because of the concurrent access ? Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias H. Tschoep, Heinrich Bassler, Dr. rer. nat. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] SAN, pacemaker, KVM: live-migration with ext3 ?
hello, yes, you need ocfs2 or gfs2, but in your case (raw image) probably better to use lvm > On 5 Sep 2018, at 18:13, Lentes, Bernd > wrote: > > Hi guys, > > just to be sure. I thought (maybe i'm wrong) that having a VM on a shared > storage (FC SAN), e.g. in a raw file on an ext3 fs on that SAN allows > live-migration because pacemaker takes care that the ext3 fs is at any time > only mounted on one node. I tried it, but "live"-migration wasn't possible. > The vm was always shutdown before migration. Or do i need OCFS2 ? > Could anyone clarifies this ? > > > Bernd > > -- > > Bernd Lentes > Systemadministration > Institut für Entwicklungsgenetik > Gebäude 35.34 - Raum 208 > HelmholtzZentrum münchen > [ mailto:bernd.len...@helmholtz-muenchen.de | > bernd.len...@helmholtz-muenchen.de ] > phone: +49 89 3187 1241 > fax: +49 89 3187 2294 > [ http://www.helmholtz-muenchen.de/idg | http://www.helmholtz-muenchen.de/idg > ] > > wer Fehler macht kann etwas lernen > wer nichts macht kann auch nichts lernen > > > Helmholtz Zentrum Muenchen > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) > Ingolstaedter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe > Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias H. Tschoep, Heinrich > Bassler, Dr. rer. nat. Alfons Enhsen > Registergericht: Amtsgericht Muenchen HRB 6466 > USt-IdNr: DE 129521671 > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Rebooting a standby node triggers lots of transitions
On Wed, 2018-09-05 at 17:43 +0200, Ulrich Windl wrote: > > > > Kadlecsik József schrieb am > > > > 05.09.2018 um > > 15:33 in Nachricht i.hu>: > > Hi, > > > > For testing purposes one of our nodes was put in standby node and > > then > > rebooted several times. When the standby node started up, it joined > > the > > cluster as a new member and it resulted in transitions between the > > online > > nodes. However, when the standby node was rebooted in > > mid‑transitions, it > > triggered another transitions again. As a result, live migrations > > was > > aborted and guests stopped/started. > > > > How can one make sure that join/leave operations of standby nodes > > do not > > affect the location of the running resources? > > > > It's pacemaker 1.1.16‑1 with corosync 2.4.2‑3+deb9u1 on debian > > stretch > > nodes. Node joins/leaves do and should trigger new transitions, but that should not result in any actions if the node is in standby. The cluster will wait for any actions in progress (such as a live migration) to complete before beginning a new transition, so there is likely something else going on that is affecting the migration. > Logs and more details, please! Particularly the detail log on the DC should be helpful. It will have "pengine:" messages with "saving inputs" at each transition. > > > > Best regards, > > Jozsef > > ‑‑ > > E‑mail : kadlecsik.joz...@wigner.mta.hu > > PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt > > Address: Wigner Research Centre for Physics, Hungarian Academy of > > Sciences > > H‑1525 Budapest 114, POB. 49, Hungary > > ___ > > Users mailing list: Users@clusterlabs.org > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc > > h.pdf > > Bugs: http://bugs.clusterlabs.org > > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] SAN, pacemaker, KVM: live-migration with ext3 ?
Hi guys, just to be sure. I thought (maybe i'm wrong) that having a VM on a shared storage (FC SAN), e.g. in a raw file on an ext3 fs on that SAN allows live-migration because pacemaker takes care that the ext3 fs is at any time only mounted on one node. I tried it, but "live"-migration wasn't possible. The vm was always shutdown before migration. Or do i need OCFS2 ? Could anyone clarifies this ? Bernd -- Bernd Lentes Systemadministration Institut für Entwicklungsgenetik Gebäude 35.34 - Raum 208 HelmholtzZentrum münchen [ mailto:bernd.len...@helmholtz-muenchen.de | bernd.len...@helmholtz-muenchen.de ] phone: +49 89 3187 1241 fax: +49 89 3187 2294 [ http://www.helmholtz-muenchen.de/idg | http://www.helmholtz-muenchen.de/idg ] wer Fehler macht kann etwas lernen wer nichts macht kann auch nichts lernen Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias H. Tschoep, Heinrich Bassler, Dr. rer. nat. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker startup retries
On Wed, 2018-09-05 at 17:21 +0200, Cesar Hernandez wrote: > > > > P.S. If the issue is just a matter of timing when you're starting > > both > > nodes, you can start corosync on both nodes first, then start > > pacemaker > > on both nodes. That way pacemaker on each node will immediately see > > the > > other node's presence. > > -- > > Well rebooting a server lasts 2 minutes approximately. > I think I'm going to keep the same workaround I have on other > servers: > > -set crm stonith-timeout=300s > -have a "sleep 180" in the fencing script, so the fencing will always > last 3 minutes > > So when crm fences a node on startup, the fencing script will return > after 3 minutes. And at that time, the other node should be up and it > won't be retried fencing > > What you think about this workaround? > > > The other solution would be updating pacemaker, but this 1.1.14 I > have tested on many servers, and I don't want to take the risk to > update to 1.1.15 and (maybe) have some other new issues... > > Thanks a lot! > Cesar If you build from source, you can apply the patch that fixes the issue to the 1.1.14 code base: https://github.com/ClusterLabs/pacemaker/commit/98457d1635db1222f93599b6021e662e766ce62d -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Antw: Re: Antw: Q: native_color scores for clones
>>> Ken Gaillot schrieb am 05.09.2018 um 16:13 in >>> Nachricht <1536156803.4205.1.ca...@redhat.com>: [...] > In the case of stickiness, lib/pengine/complex.c has this code: > > (*rsc)->stickiness = 0; > ... > value = g_hash_table_lookup((*rsc)->meta, XML_RSC_ATTR_STICKINESS); > if (value != NULL && safe_str_neq("default", value)) { > (*rsc)->stickiness = char2score(value); > } > > which defaults the stickiness to 0, then uses the integer value of > "resource-stickiness" from meta-attributes (as long as it's not the > literal string "default"). This is after meta-attributes have been > unpacked, which takes care of the precedence of operation attributes > > rsc_defaults > legacy properties. Hi! Another built-in special rule ("default") 8-( What data type is stickiness, BTW? I thought it was integer ;-) [...] Regards, Ulrich ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Rebooting a standby node triggers lots of transitions
>>> Kadlecsik József schrieb am 05.09.2018 um 15:33 in Nachricht : > Hi, > > For testing purposes one of our nodes was put in standby node and then > rebooted several times. When the standby node started up, it joined the > cluster as a new member and it resulted in transitions between the online > nodes. However, when the standby node was rebooted in mid‑transitions, it > triggered another transitions again. As a result, live migrations was > aborted and guests stopped/started. > > How can one make sure that join/leave operations of standby nodes do not > affect the location of the running resources? > > It's pacemaker 1.1.16‑1 with corosync 2.4.2‑3+deb9u1 on debian stretch > nodes. Logs and more details, please! > > Best regards, > Jozsef > ‑‑ > E‑mail : kadlecsik.joz...@wigner.mta.hu > PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt > Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences > H‑1525 Budapest 114, POB. 49, Hungary > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker startup retries
> > P.S. If the issue is just a matter of timing when you're starting both > nodes, you can start corosync on both nodes first, then start pacemaker > on both nodes. That way pacemaker on each node will immediately see the > other node's presence. > -- Well rebooting a server lasts 2 minutes approximately. I think I'm going to keep the same workaround I have on other servers: -set crm stonith-timeout=300s -have a "sleep 180" in the fencing script, so the fencing will always last 3 minutes So when crm fences a node on startup, the fencing script will return after 3 minutes. And at that time, the other node should be up and it won't be retried fencing What you think about this workaround? The other solution would be updating pacemaker, but this 1.1.14 I have tested on many servers, and I don't want to take the risk to update to 1.1.15 and (maybe) have some other new issues... Thanks a lot! Cesar ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker startup retries
On Wed, 2018-09-05 at 09:51 -0500, Ken Gaillot wrote: > On Wed, 2018-09-05 at 16:38 +0200, Cesar Hernandez wrote: > > Hi > > > > > > > > Ah, this rings a bell. Despite having fenced the node, the > > > cluster > > > still considers the node unseen. That was a regression in 1.1.14 > > > that > > > was fixed in 1.1.15. :-( > > > > > > > Oh :( I'm using Pacemaker-1.1.14. > > Do you know if this reboot retries are just run 3 times? All the > > tests I've done the rebooting is finished after 3 times. > > > > Thanks > > Cesar > > No, if I remember correctly, it would just keep going until it saw > the > node. Not sure why it stops after 3. P.S. If the issue is just a matter of timing when you're starting both nodes, you can start corosync on both nodes first, then start pacemaker on both nodes. That way pacemaker on each node will immediately see the other node's presence. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker startup retries
On Wed, 2018-09-05 at 16:38 +0200, Cesar Hernandez wrote: > Hi > > > > > Ah, this rings a bell. Despite having fenced the node, the cluster > > still considers the node unseen. That was a regression in 1.1.14 > > that > > was fixed in 1.1.15. :-( > > > > Oh :( I'm using Pacemaker-1.1.14. > Do you know if this reboot retries are just run 3 times? All the > tests I've done the rebooting is finished after 3 times. > > Thanks > Cesar No, if I remember correctly, it would just keep going until it saw the node. Not sure why it stops after 3. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker startup retries
Hi > > Ah, this rings a bell. Despite having fenced the node, the cluster > still considers the node unseen. That was a regression in 1.1.14 that > was fixed in 1.1.15. :-( > Oh :( I'm using Pacemaker-1.1.14. Do you know if this reboot retries are just run 3 times? All the tests I've done the rebooting is finished after 3 times. Thanks Cesar ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker startup retries
On Wed, 2018-09-05 at 13:31 +0200, Cesar Hernandez wrote: > Hi > > > > > The first fencing is legitimate -- the node hasn't been seen at > > start- > > up, and so needs to be fenced. The second fencing will be the one > > of > > interest. Also, look for the result of the first fencing. > > The first fencing has finished with OK, as well as the other two > fencing operations. > > Aug 31 10:59:31 [30612] node1 stonith-ng: notice: > log_operation: Operation 'reboot' [31075] (call 2 from > crmd.30616) for host 'node2' with device 'st-fence_propio:0' > returned: 0 (OK) > > And the next log entries are: > > Aug 31 10:59:31 [30612] node1 stonith-ng: notice: > remote_op_done:Operation reboot of node2 by node1 for crmd.30616@ > node1.2d7857e7: OK > Aug 31 10:59:31 [30616] node1 crmd: notice: > tengine_stonith_callback: Stonith operation 2/81:0:0:c64efa2b-9366- > 4c07-b5f1-6a2dbee79fe7: OK (0) > Aug 31 10:59:31 [30616] node1 crmd: info: > crm_get_peer: Created entry 48db3347-5bbe-4cd4-b7ba- > db4c697c3146/0x55adbeb587f0 for node node2/0 (2 total) > Aug 31 10:59:31 [30616] node1 crmd: info: > peer_update_callback: node2 is now in unknown state > Aug 31 10:59:31 [30616] node1 crmd: info: > crm_get_peer: Node 0 has uuid node2 > Aug 31 10:59:31 [30616] node1 crmd: notice: > crm_update_peer_state_iter:crmd_peer_down: Node node2[0] - > state is now lost (was (null)) > Aug 31 10:59:31 [30616] node1 crmd: info: > peer_update_callback: node2 is now lost (was in unknown state) > Aug 31 10:59:31 [30616] node1 crmd: info: > crm_update_peer_proc: crmd_peer_down: Node node2[0] - all > processes are now offline > Aug 31 10:59:31 [30616] node1 crmd: info: > peer_update_callback: Client node2/peer now has status [offline] > (DC=true, changed= 1) > Aug 31 10:59:31 [30616] node1 crmd: info: > crm_update_peer_expected: crmd_peer_down: Node node2[0] - expected > state is now down (was (null)) > Aug 31 10:59:31 [30616] node1 crmd: info: > erase_status_tag: Deleting xpath: //node_state[@uname='node2']/lrm > Aug 31 10:59:31 [30616] node1 crmd: info: > erase_status_tag: Deleting xpath: > //node_state[@uname='node2']/transient_attributes > Aug 31 10:59:31 [30616] node1 crmd: notice: > tengine_stonith_notify:Peer node2 was terminated (reboot) by > node1 for node1: OK (ref=2d7857e7-7e88-482a-812f-b343218974dc) by > client crmd.30616 > > After some other entries I see: > > Aug 31 10:59:37 [30615] node1pengine: warning: > pe_fence_node: Node node2 will be fenced because the peer has not > been seen by the cluster Ah, this rings a bell. Despite having fenced the node, the cluster still considers the node unseen. That was a regression in 1.1.14 that was fixed in 1.1.15. :-( > Aug 31 10:59:37 [30615] node1pengine: warning: > determine_online_status: Node node2 is unclean > > The server lasts aprox 2 minutes to reboot, so it's normal to haven't > been seen after just 6 seconds. But I don't know why the server is > rebooted three times: > > Aug 31 10:59:31 [30616] node1 crmd: notice: > tengine_stonith_notify: Peer node2 was terminated (reboot) by > node1 for node1: OK (ref=2d7857e7-7e88-482a-812f-b343218974dc) by > client crmd.30616 > > Aug 31 10:59:53 [30616] node1 crmd: notice: > tengine_stonith_notify: Peer node2 was terminated (reboot) by > node1 for node1: OK (ref=2835cb08-362d-4d39-9133-3a7dcefb913c) by > client crmd.30616 > > Aug 31 11:00:05 [30616] node1 crmd: notice: > tengine_stonith_notify: Peer node2 was terminated (reboot) by > node1 for node1: OK (ref=17931f5b-76ea-4e3a-a792-535cea50afca) by > client crmd.30616 > > > After the last message, I only see it stops fencing and start > resources: > > > Aug 31 11:00:05 [30616] node1 crmd: notice: > tengine_stonith_notify:Peer node2 was terminated (reboot) by > node1 for node1: OK (ref=17931f5b-76ea-4e3a-a792-535cea50afca) by > client crmd.30616 > Aug 31 11:00:05 [30611] node1cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=local/crmd/68, version=0.60.30) > Aug 31 11:00:05 [30616] node1 crmd: info: > erase_status_tag: Deleting xpath: //node_state[@uname='node2']/lrm > Aug 31 11:00:05 [30616] node1 crmd: info: > erase_status_tag: Deleting xpath: > //node_state[@uname='node2']/transient_attributes > Aug 31 11:00:05 [30616] node1 crmd: info: > cib_fencing_updated: Fencing update 68 for node2: complete > Aug 31 11:00:05 [30616] node1 crmd: notice: > te_rsc_command:Initiating action 66: start p_fs_database_start_0 > on node1 (local) > Aug 31 11:00:05 [30616] node1 crmd: info: > do_lrm_rsc_op: Performing key=66:2:0:c64efa2b-9366-4c07-b5f1- > 6a2dbee79fe7 op=p_fs_database_start_0 > Aug 31 11:00:05
Re: [ClusterLabs] Antw: Re: Antw: Q: native_color scores for clones
On Wed, 2018-09-05 at 09:32 +0200, Ulrich Windl wrote: > > > > Ken Gaillot schrieb am 04.09.2018 um > > > > 19:21 in Nachricht > > <1536081690.4387.6.ca...@redhat.com>: > > On Tue, 2018-09-04 at 11:22 +0200, Ulrich Windl wrote: > > > > > > In Reply to my message am 30.08.2018 um 12:23 in Nachricht > > > > > > <5B87C5A0.A46 : 161 : > > > > > > 60728>: > > > > Hi! > > > > > > > > After having found showscores.sh, I thought I can improve the > > > > perfomance by > > > > porting it to Perl, but it seems the slow part actually is > > > > calling > > > > pacemakers > > > > helper scripts like crm_attribute, crm_failcount, etc... > > > > > > Actually the performance gain was less than expected, until I > > > added a > > > cache for calling external programs reading stickiness, fail > > > count > > > and migration threshold. Here are the numbers: > > > > > > showscores.sh (original): > > > real0m46.181s > > > user0m15.573s > > > sys 0m21.761s > > > > > > showscores.pl (without cache): > > > real0m46.053s > > > user0m15.861s > > > sys 0m20.997s > > > > > > showscores.pl (with cache): > > > real0m25.998s > > > user0m7.940s > > > sys 0m12.609s > > > > > > This made me think whether it's possible to retrieve such > > > attributes > > > in a more efficient way, arising the question how the > > > corresponding > > > tools actually do work (those attributes are obviously not part > > > of > > > the CIB). > > > > Actually they are ... the policy engine (aka scheduler in 2.0) has > > only > > the CIB to make decisions. The other daemons have some additional > > state > > that can affect their behavior, but the scheduling of actions > > relies > > solely on the CIB. > > > > Stickiness and migration-threshold are in the resource > > configuration > > (or defaults); fail counts are in the transient node attributes in > > the > > status section (which can only be retrieved from the live CIB or > > attribute daemons, not the CIB on disk, which may be why you didn't > > see > > them). > > Looking for the fail count, the closest match I got looked like this: > operation_key="prm_LVM_VMD_monitor_0" operation="monitor" crm-debug- > origin="build_active_RAs" crm_feature_set="3.0.10" transition- > key="58:107:7:8a33be7f-1b68-45bf-9143-54414ff3b662" transition- > magic="0:0;58:107:7:8a33be7f-1b68-45bf-9143-54414ff3b662" > on_node="h05" call-id="143" rc-code="0" op-status="0" interval="0" > last-run="1533809981" last-rc-change="1533809981" exec-time="89" > queue-time="0" op-digest="22698b9dba36e2926819f13c77569222"/> That's the most recently failed instance of the operation (for that node and resource). It's used to show the "failed actions" section of the crm_mon display. > Do I have to count and filter such elements, or is there a more > direct way to get the fail count? The simplest way to get the fail count is with the crm_failcount tool. That's especially true since per-operation fail counts were added in 1.1.17. In the CIB XML, fail counts are stored within . The attribute name will start with "fail-count- ". They can also be queried via attrd_updater, if you know the attribute name. > > > I can get the CIB via cib_admin and I can parse the XML if > > > needed, > > > but how can I get these other attributes? Tracing crm_attribute, > > > it > > > seems it reads some partially binary file from locations like > > > /dev/shm/qb-attrd-response-*. > > > > The /dev/shm files are simply where IPC communication is buffered, > > so > > that's just the response from a cib query (the equivalent of > > cibadmin > > -Q). > > > > > I would think that _all_ relevant attributes should be part of > > > the > > > CIB... > > > > Yep, they are :) > > I'm still having problems, sorry. > > > > > Often a final value is calculated from the CIB configuration, > > rather > > than directly in it. For example, for stickiness, the actual value > > could be in the resource configuration, a resource template, or > > resource defaults, or (pre-2.0) the legacy cluster properties for > > default stickiness. The configuration "unpacking" code will choose > > the > > final value based on a hierarchy of preference. > > I guess the actual algorithm is hidden somewhere. Could I do that > with XPath queries and some accumulation of numbers (like using the > max or min), or is it more complicated? Most of the default values are directly in the code when unpacking the configuration. Much of it is in lib/pengine/unpack.c, but parts are scattered elsewhere in lib/pengine (and it's not an easy read). Based on how the memory allocation works, anything not explicitly defaulted otherwise by code defaults to 0. > > > The other thing I realized was that both "migration threshold" > > > and > > > "stickiness" are both undefined for several resources (due to the > > > fact that the default values for those also aren't defined). I > > > really > > > wonder: Why not (e.g.) specify a default stickiness as integer 0
[ClusterLabs] Rebooting a standby node triggers lots of transitions
Hi, For testing purposes one of our nodes was put in standby node and then rebooted several times. When the standby node started up, it joined the cluster as a new member and it resulted in transitions between the online nodes. However, when the standby node was rebooted in mid-transitions, it triggered another transitions again. As a result, live migrations was aborted and guests stopped/started. How can one make sure that join/leave operations of standby nodes do not affect the location of the running resources? It's pacemaker 1.1.16-1 with corosync 2.4.2-3+deb9u1 on debian stretch nodes. Best regards, Jozsef -- E-mail : kadlecsik.joz...@wigner.mta.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences H-1525 Budapest 114, POB. 49, Hungary ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker startup retries
Hi > > The first fencing is legitimate -- the node hasn't been seen at start- > up, and so needs to be fenced. The second fencing will be the one of > interest. Also, look for the result of the first fencing. The first fencing has finished with OK, as well as the other two fencing operations. Aug 31 10:59:31 [30612] node1 stonith-ng: notice: log_operation: Operation 'reboot' [31075] (call 2 from crmd.30616) for host 'node2' with device 'st-fence_propio:0' returned: 0 (OK) And the next log entries are: Aug 31 10:59:31 [30612] node1 stonith-ng: notice: remote_op_done: Operation reboot of node2 by node1 for crmd.30616@node1.2d7857e7: OK Aug 31 10:59:31 [30616] node1 crmd: notice: tengine_stonith_callback: Stonith operation 2/81:0:0:c64efa2b-9366-4c07-b5f1-6a2dbee79fe7: OK (0) Aug 31 10:59:31 [30616] node1 crmd: info: crm_get_peer: Created entry 48db3347-5bbe-4cd4-b7ba-db4c697c3146/0x55adbeb587f0 for node node2/0 (2 total) Aug 31 10:59:31 [30616] node1 crmd: info: peer_update_callback: node2 is now in unknown state Aug 31 10:59:31 [30616] node1 crmd: info: crm_get_peer: Node 0 has uuid node2 Aug 31 10:59:31 [30616] node1 crmd: notice: crm_update_peer_state_iter: crmd_peer_down: Node node2[0] - state is now lost (was (null)) Aug 31 10:59:31 [30616] node1 crmd: info: peer_update_callback: node2 is now lost (was in unknown state) Aug 31 10:59:31 [30616] node1 crmd: info: crm_update_peer_proc: crmd_peer_down: Node node2[0] - all processes are now offline Aug 31 10:59:31 [30616] node1 crmd: info: peer_update_callback: Client node2/peer now has status [offline] (DC=true, changed= 1) Aug 31 10:59:31 [30616] node1 crmd: info: crm_update_peer_expected: crmd_peer_down: Node node2[0] - expected state is now down (was (null)) Aug 31 10:59:31 [30616] node1 crmd: info: erase_status_tag: Deleting xpath: //node_state[@uname='node2']/lrm Aug 31 10:59:31 [30616] node1 crmd: info: erase_status_tag: Deleting xpath: //node_state[@uname='node2']/transient_attributes Aug 31 10:59:31 [30616] node1 crmd: notice: tengine_stonith_notify: Peer node2 was terminated (reboot) by node1 for node1: OK (ref=2d7857e7-7e88-482a-812f-b343218974dc) by client crmd.30616 After some other entries I see: Aug 31 10:59:37 [30615] node1pengine: warning: pe_fence_node: Node node2 will be fenced because the peer has not been seen by the cluster Aug 31 10:59:37 [30615] node1pengine: warning: determine_online_status: Node node2 is unclean The server lasts aprox 2 minutes to reboot, so it's normal to haven't been seen after just 6 seconds. But I don't know why the server is rebooted three times: Aug 31 10:59:31 [30616] node1 crmd: notice: tengine_stonith_notify: Peer node2 was terminated (reboot) by node1 for node1: OK (ref=2d7857e7-7e88-482a-812f-b343218974dc) by client crmd.30616 Aug 31 10:59:53 [30616] node1 crmd: notice: tengine_stonith_notify: Peer node2 was terminated (reboot) by node1 for node1: OK (ref=2835cb08-362d-4d39-9133-3a7dcefb913c) by client crmd.30616 Aug 31 11:00:05 [30616] node1 crmd: notice: tengine_stonith_notify: Peer node2 was terminated (reboot) by node1 for node1: OK (ref=17931f5b-76ea-4e3a-a792-535cea50afca) by client crmd.30616 After the last message, I only see it stops fencing and start resources: Aug 31 11:00:05 [30616] node1 crmd: notice: tengine_stonith_notify: Peer node2 was terminated (reboot) by node1 for node1: OK (ref=17931f5b-76ea-4e3a-a792-535cea50afca) by client crmd.30616 Aug 31 11:00:05 [30611] node1cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/68, version=0.60.30) Aug 31 11:00:05 [30616] node1 crmd: info: erase_status_tag: Deleting xpath: //node_state[@uname='node2']/lrm Aug 31 11:00:05 [30616] node1 crmd: info: erase_status_tag: Deleting xpath: //node_state[@uname='node2']/transient_attributes Aug 31 11:00:05 [30616] node1 crmd: info: cib_fencing_updated: Fencing update 68 for node2: complete Aug 31 11:00:05 [30616] node1 crmd: notice: te_rsc_command: Initiating action 66: start p_fs_database_start_0 on node1 (local) Aug 31 11:00:05 [30616] node1 crmd: info: do_lrm_rsc_op: Performing key=66:2:0:c64efa2b-9366-4c07-b5f1-6a2dbee79fe7 op=p_fs_database_start_0 Aug 31 11:00:05 [30613] node1 lrmd: info: log_execute: executing - rsc:p_fs_database action:start call_id:70 Aug 31 11:00:05 [30616] node1 crmd: notice: te_rsc_command: Initiating action 77: start p_fs_datosweb_start_0 on node1 (local) Aug 31 11:00:05 [30616] node1 crmd: info: do_lrm_rsc_op: Performing key=77:2:0:c64efa2b-9366-4c07-b5f1-6a2dbee79fe7 op=p_fs_datosweb_start_0 Aug 31
[ClusterLabs] Antw: Re: Q: Configure date_expression rule with crm shell without using XML
>>> Kristoffer Grönlund schrieb am 04.09.2018 um 16:17 in Nachricht <1536070650.9365.15.ca...@suse.de>: > On Tue, 2018‑09‑04 at 16:00 +0200, Ulrich Windl wrote: >> Hi! >> >> I have a question: Can I use crm shell to configure time‑based >> meta_attributes without using XML? >> I tried to configure it using XML, but since then the resource >> configuration can be displayed as XML only. >> What I'm trying to do is to configure resource‑specific stickiness >> based on time only, wanting to set stickiness to 0 in the evening >> hours to allow re‑balancing resources. >> > > It should be possible to do this via rule expressions in crm shell, but > not all XML that possible is expressible via crm shell syntax. > > If you can show me the XML for the resource which doesn't render in crm > shell I could probably be more specific. > > http://crmsh.github.io/man/#topics_Syntax_RuleExpressions Hi Christoffer, crmsh is so incredibly cool: I could add my changes like the following (snipplet) withpout having to deal with XML (as presented in clusterlabs' example "Change resource-stickiness during working hours"): meta 1: resource-stickiness=0 \ meta 2: rule date spec hours=7-18 weekdays=1-5 resource-stickiness=1000 and that translates to Now if "show changed" could display the changes in "diff -u" or wdiff format, it would even be cooler. Maybe create another command like "show changes" or "show diff[erences]"... ;-) Regards, Ulrich > >> Regards, >> Ulrich >> >> >> >> ___ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. >> pdf >> Bugs: http://bugs.clusterlabs.org >> > ‑‑ > > Cheers, > Kristoffer > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org