Re: [ClusterLabs] Users Digest, Vol 44, Issue 11

2018-09-05 Thread Jeffrey Westgate
Greetings from a confused user;

We are running pacemaker as part of a load-balanced cluster of two members, 
both VMWare VMs, with both acting as stepping-stones to our DNS recursive 
resolvers (RR).  Simple use  - the /etc/resolver.conf on the *NIX boxes points 
at both IPs, and the cluster forwards to one of multiple RRs for DNS resolution.

Today, for an as-yet undetermined reason, one of the two members started 
failing to connect to the RRs. Intermittently. And quite annoyingly, as this 
has affected data center operations.  No matter what we've tried, one member 
fails intermittently, the other is fine.  
And we've tried - 
 - reboot of the affected member - it came back up clean and fine, but the 
issue remained.
 - fail the cluster, moving both IPs to the second member server; failover was 
successful, problem remained.
  -- this moved the entire cluster to a different VM on a different VMWare host 
server, so different NIC, etc...
- failed the cluster back to the original server; both IPs appears on the 
'suspect' VM, and the problem remained
- restore the cluster; both IPs are on the proper VMs, but the one still fails 
intermittently while the second just chugs along.

Any ideas what could be causing this?  Is this something that could be caused 
by the cluster config?  Anybody ever seen anything similar?

Our current unsustainable workaround is to remove the IP for the affected 
member from the *NIX resolver.conf file.

I appreciate any reasonable suggestions.  (I am not the creator of the cluster, 
just the guy trying o figure it out. Unfortunately the creator and my mentor is 
dearly departed and, in times like this, sorely missed.)

Any replies will be read and responded to early tomorrow AM.  thanks for 
understanding.
--
Jeff Westgate
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] SAN, pacemaker, KVM: live-migration with ext3 ?

2018-09-05 Thread Mark Adams
What I am struggling to understand here, is why it is being referred to as
a "SAN" when it is not concurrently available... how are you mounting this
"SAN" on each host?

On 5 September 2018 at 18:55, Andrei Borzenkov  wrote:

> 05.09.2018 19:13, Lentes, Bernd пишет:
> > Hi guys,
> >
> > just to be sure. I thought (maybe i'm wrong) that having a VM on a
> shared storage (FC SAN), e.g. in a raw file on an ext3 fs on that SAN
> allows live-migration because pacemaker takes care that the ext3 fs is at
> any time only mounted on one node.
>
>
> While live migration requires concurrent access from both nodes at the
> same time.
>
> > I tried it, but "live"-migration wasn't possible. The vm was always
> shutdown before migration. Or do i need OCFS2 ?
>
> You need to be able to access image from both nodes at the same time. If
> image is on file system, it must be clustered filesystem. Or just use
> raw device for image as already suggested.
>
> > Could anyone clarifies this ?
> >
> >
> > Bernd
> >
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Mark Adams
Director
--
Open Virtualisation Solutions Ltd.
Registered in England and Wales number: 07709887
Office Address: 274 Verdant Lane, London, SE6 1TW
Office: +44 (0)333 355 0160
Mobile: +44 (0)750 800 1289
Site: http://www.openvs.co.uk
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] SAN, pacemaker, KVM: live-migration with ext3 ?

2018-09-05 Thread Andrei Borzenkov
05.09.2018 19:13, Lentes, Bernd пишет:
> Hi guys,
> 
> just to be sure. I thought (maybe i'm wrong) that having a VM on a shared 
> storage (FC SAN), e.g. in a raw file on an ext3 fs on that SAN allows 
> live-migration because pacemaker takes care that the ext3 fs is at any time 
> only mounted on one node.


While live migration requires concurrent access from both nodes at the
same time.

> I tried it, but "live"-migration wasn't possible. The vm was always shutdown 
> before migration. Or do i need OCFS2 ?

You need to be able to access image from both nodes at the same time. If
image is on file system, it must be clustered filesystem. Or just use
raw device for image as already suggested.

> Could anyone clarifies this ?
> 
> 
> Bernd
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker startup retries

2018-09-05 Thread c.hernandez
 

>If you build from source, you can apply the patch that fixes the issue
>to the 1.1.14 code base:

>https://github.com/ClusterLabs/pacemaker/commit/98457d1635db1222f93599b6021e662e766ce62d
 [1]
Just applied the patch and now it works as expected. The unseen node is
only rebooted once on startup

Thanks a lot!

Cheers
Cesar
 

Links:
--
[1]
https://github.com/ClusterLabs/pacemaker/commit/98457d1635db1222f93599b6021e662e766ce62d___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] SAN, pacemaker, KVM: live-migration with ext3 ?

2018-09-05 Thread FeldHost™ Admin
Why you use FS for raw image, when you can use directly LV as block device for 
your VM

> On 5 Sep 2018, at 18:34, Lentes, Bernd  
> wrote:
> 
> 
> 
> - On Sep 5, 2018, at 6:28 PM, FeldHost™ Admin ad...@feldhost.cz wrote:
> 
>> hello, yes, you need ocfs2 or gfs2, but in your case (raw image) probably 
>> better
>> to use lvm
> 
> I use cLVM. The fs for the raw image resides on a clustered VG/LV.
> But nevertheless i still need a cluster fs because of the concurrent access ?
> 
> Bernd
> 
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias H. Tschoep, Heinrich 
> Bassler, Dr. rer. nat. Alfons Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] SAN, pacemaker, KVM: live-migration with ext3 ?

2018-09-05 Thread Mark Adams
Or iscsi, or nfs. But there are many better solutions nowdays. of course it
does all depend on your setup but ext3 is crazy old now!

On Wed, 5 Sep 2018, 19:35 Lentes, Bernd, 
wrote:

>
>
> - On Sep 5, 2018, at 6:28 PM, FeldHost™ Admin ad...@feldhost.cz wrote:
>
> > hello, yes, you need ocfs2 or gfs2, but in your case (raw image)
> probably better
> > to use lvm
>
> I use cLVM. The fs for the raw image resides on a clustered VG/LV.
> But nevertheless i still need a cluster fs because of the concurrent
> access ?
>
> Bernd
>
>
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias H. Tschoep, Heinrich
> Bassler, Dr. rer. nat. Alfons Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] SAN, pacemaker, KVM: live-migration with ext3 ?

2018-09-05 Thread Lentes, Bernd


- On Sep 5, 2018, at 6:28 PM, FeldHost™ Admin ad...@feldhost.cz wrote:

> hello, yes, you need ocfs2 or gfs2, but in your case (raw image) probably 
> better
> to use lvm

I use cLVM. The fs for the raw image resides on a clustered VG/LV.
But nevertheless i still need a cluster fs because of the concurrent access ?

Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias H. Tschoep, Heinrich 
Bassler, Dr. rer. nat. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] SAN, pacemaker, KVM: live-migration with ext3 ?

2018-09-05 Thread FeldHost™ Admin
hello, yes, you need ocfs2 or gfs2, but in your case (raw image) probably 
better to use lvm

> On 5 Sep 2018, at 18:13, Lentes, Bernd  
> wrote:
> 
> Hi guys,
> 
> just to be sure. I thought (maybe i'm wrong) that having a VM on a shared 
> storage (FC SAN), e.g. in a raw file on an ext3 fs on that SAN allows 
> live-migration because pacemaker takes care that the ext3 fs is at any time 
> only mounted on one node. I tried it, but "live"-migration wasn't possible. 
> The vm was always shutdown before migration. Or do i need OCFS2 ?
> Could anyone clarifies this ?
> 
> 
> Bernd
> 
> -- 
> 
> Bernd Lentes 
> Systemadministration 
> Institut für Entwicklungsgenetik 
> Gebäude 35.34 - Raum 208 
> HelmholtzZentrum münchen 
> [ mailto:bernd.len...@helmholtz-muenchen.de | 
> bernd.len...@helmholtz-muenchen.de ] 
> phone: +49 89 3187 1241 
> fax: +49 89 3187 2294 
> [ http://www.helmholtz-muenchen.de/idg | http://www.helmholtz-muenchen.de/idg 
> ] 
> 
> wer Fehler macht kann etwas lernen 
> wer nichts macht kann auch nichts lernen
> 
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias H. Tschoep, Heinrich 
> Bassler, Dr. rer. nat. Alfons Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Rebooting a standby node triggers lots of transitions

2018-09-05 Thread Ken Gaillot
On Wed, 2018-09-05 at 17:43 +0200, Ulrich Windl wrote:
> > > > Kadlecsik József  schrieb am
> > > > 05.09.2018 um
> 
> 15:33 in Nachricht  i.hu>:
> > Hi,
> > 
> > For testing purposes one of our nodes was put in standby node and
> > then 
> > rebooted several times. When the standby node started up, it joined
> > the 
> > cluster as a new member and it resulted in transitions between the
> > online 
> > nodes. However, when the standby node was rebooted in
> > mid‑transitions, it 
> > triggered another transitions again. As a result, live migrations
> > was 
> > aborted and guests stopped/started.
> > 
> > How can one make sure that join/leave operations of standby nodes
> > do not 
> > affect the location of the running resources?
> > 
> > It's pacemaker 1.1.16‑1 with corosync 2.4.2‑3+deb9u1 on debian
> > stretch 
> > nodes.

Node joins/leaves do and should trigger new transitions, but that
should not result in any actions if the node is in standby.

The cluster will wait for any actions in progress (such as a live
migration) to complete before beginning a new transition, so there is
likely something else going on that is affecting the migration.

> Logs and more details, please!

Particularly the detail log on the DC should be helpful. It will have
"pengine:" messages with "saving inputs" at each transition.

> > 
> > Best regards,
> > Jozsef
> > ‑‑
> > E‑mail : kadlecsik.joz...@wigner.mta.hu 
> > PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt 
> > Address: Wigner Research Centre for Physics, Hungarian Academy of
> > Sciences
> >  H‑1525 Budapest 114, POB. 49, Hungary
> > ___
> > Users mailing list: Users@clusterlabs.org 
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > Project Home: http://www.clusterlabs.org 
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf 
> > Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] SAN, pacemaker, KVM: live-migration with ext3 ?

2018-09-05 Thread Lentes, Bernd
Hi guys,

just to be sure. I thought (maybe i'm wrong) that having a VM on a shared 
storage (FC SAN), e.g. in a raw file on an ext3 fs on that SAN allows 
live-migration because pacemaker takes care that the ext3 fs is at any time 
only mounted on one node. I tried it, but "live"-migration wasn't possible. The 
vm was always shutdown before migration. Or do i need OCFS2 ?
Could anyone clarifies this ?


Bernd

-- 

Bernd Lentes 
Systemadministration 
Institut für Entwicklungsgenetik 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum münchen 
[ mailto:bernd.len...@helmholtz-muenchen.de | 
bernd.len...@helmholtz-muenchen.de ] 
phone: +49 89 3187 1241 
fax: +49 89 3187 2294 
[ http://www.helmholtz-muenchen.de/idg | http://www.helmholtz-muenchen.de/idg ] 

wer Fehler macht kann etwas lernen 
wer nichts macht kann auch nichts lernen
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias H. Tschoep, Heinrich 
Bassler, Dr. rer. nat. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker startup retries

2018-09-05 Thread Ken Gaillot
On Wed, 2018-09-05 at 17:21 +0200, Cesar Hernandez wrote:
> > 
> > P.S. If the issue is just a matter of timing when you're starting
> > both
> > nodes, you can start corosync on both nodes first, then start
> > pacemaker
> > on both nodes. That way pacemaker on each node will immediately see
> > the
> > other node's presence.
> > -- 
> 
> Well rebooting a server lasts 2 minutes approximately. 
> I think I'm going to keep the same workaround I have on other
> servers:
> 
> -set crm stonith-timeout=300s
> -have a "sleep 180" in the fencing script, so the fencing will always
> last 3 minutes
> 
> So when crm fences a node on startup, the fencing script will return
> after 3 minutes. And at that time, the other node should be up and it
> won't be retried fencing
> 
> What you think about this workaround?
> 
> 
> The other solution would be updating pacemaker, but this 1.1.14 I
> have tested on many servers, and I don't want to take the risk to
> update to 1.1.15 and (maybe) have some other new issues...
> 
> Thanks a lot!
> Cesar

If you build from source, you can apply the patch that fixes the issue
to the 1.1.14 code base:

https://github.com/ClusterLabs/pacemaker/commit/98457d1635db1222f93599b6021e662e766ce62d
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Antw: Re: Antw: Q: native_color scores for clones

2018-09-05 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 05.09.2018 um 16:13 in 
>>> Nachricht
<1536156803.4205.1.ca...@redhat.com>:
[...]
> In the case of stickiness, lib/pengine/complex.c has this code:
> 
> (*rsc)->stickiness = 0;
> ...
> value = g_hash_table_lookup((*rsc)->meta, XML_RSC_ATTR_STICKINESS);
> if (value != NULL && safe_str_neq("default", value)) {
> (*rsc)->stickiness = char2score(value);
> }
> 
> which defaults the stickiness to 0, then uses the integer value of
> "resource-stickiness" from meta-attributes (as long as it's not the
> literal string "default"). This is after meta-attributes have been
> unpacked, which takes care of the precedence of operation attributes >
> rsc_defaults > legacy properties.

Hi!

Another built-in special rule ("default") 8-( 
What data type is stickiness, BTW? I thought it was integer ;-)
[...]

Regards,
Ulrich



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Rebooting a standby node triggers lots of transitions

2018-09-05 Thread Ulrich Windl
>>> Kadlecsik József  schrieb am 05.09.2018 um
15:33 in Nachricht :
> Hi,
> 
> For testing purposes one of our nodes was put in standby node and then 
> rebooted several times. When the standby node started up, it joined the 
> cluster as a new member and it resulted in transitions between the online 
> nodes. However, when the standby node was rebooted in mid‑transitions, it 
> triggered another transitions again. As a result, live migrations was 
> aborted and guests stopped/started.
> 
> How can one make sure that join/leave operations of standby nodes do not 
> affect the location of the running resources?
> 
> It's pacemaker 1.1.16‑1 with corosync 2.4.2‑3+deb9u1 on debian stretch 
> nodes.

Logs and more details, please!

> 
> Best regards,
> Jozsef
> ‑‑
> E‑mail : kadlecsik.joz...@wigner.mta.hu 
> PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt 
> Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
>  H‑1525 Budapest 114, POB. 49, Hungary
> ___
> Users mailing list: Users@clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker startup retries

2018-09-05 Thread Cesar Hernandez


> 
> P.S. If the issue is just a matter of timing when you're starting both
> nodes, you can start corosync on both nodes first, then start pacemaker
> on both nodes. That way pacemaker on each node will immediately see the
> other node's presence.
> -- 

Well rebooting a server lasts 2 minutes approximately. 
I think I'm going to keep the same workaround I have on other servers:

-set crm stonith-timeout=300s
-have a "sleep 180" in the fencing script, so the fencing will always last 3 
minutes

So when crm fences a node on startup, the fencing script will return after 3 
minutes. And at that time, the other node should be up and it won't be retried 
fencing

What you think about this workaround?


The other solution would be updating pacemaker, but this 1.1.14 I have tested 
on many servers, and I don't want to take the risk to update to 1.1.15 and 
(maybe) have some other new issues...

Thanks a lot!
Cesar

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker startup retries

2018-09-05 Thread Ken Gaillot
On Wed, 2018-09-05 at 09:51 -0500, Ken Gaillot wrote:
> On Wed, 2018-09-05 at 16:38 +0200, Cesar Hernandez wrote:
> > Hi
> > 
> > > 
> > > Ah, this rings a bell. Despite having fenced the node, the
> > > cluster
> > > still considers the node unseen. That was a regression in 1.1.14
> > > that
> > > was fixed in 1.1.15. :-(
> > > 
> > 
> >  Oh :( I'm using Pacemaker-1.1.14.
> > Do you know if this reboot retries are just run 3 times? All the
> > tests I've done the rebooting is finished after 3 times.
> > 
> > Thanks
> > Cesar
> 
> No, if I remember correctly, it would just keep going until it saw
> the
> node. Not sure why it stops after 3.

P.S. If the issue is just a matter of timing when you're starting both
nodes, you can start corosync on both nodes first, then start pacemaker
on both nodes. That way pacemaker on each node will immediately see the
other node's presence.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker startup retries

2018-09-05 Thread Ken Gaillot
On Wed, 2018-09-05 at 16:38 +0200, Cesar Hernandez wrote:
> Hi
> 
> > 
> > Ah, this rings a bell. Despite having fenced the node, the cluster
> > still considers the node unseen. That was a regression in 1.1.14
> > that
> > was fixed in 1.1.15. :-(
> > 
> 
>  Oh :( I'm using Pacemaker-1.1.14.
> Do you know if this reboot retries are just run 3 times? All the
> tests I've done the rebooting is finished after 3 times.
> 
> Thanks
> Cesar

No, if I remember correctly, it would just keep going until it saw the
node. Not sure why it stops after 3.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker startup retries

2018-09-05 Thread Cesar Hernandez
Hi

> 
> Ah, this rings a bell. Despite having fenced the node, the cluster
> still considers the node unseen. That was a regression in 1.1.14 that
> was fixed in 1.1.15. :-(
> 

 Oh :( I'm using Pacemaker-1.1.14.
Do you know if this reboot retries are just run 3 times? All the tests I've 
done the rebooting is finished after 3 times.

Thanks
Cesar


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker startup retries

2018-09-05 Thread Ken Gaillot
On Wed, 2018-09-05 at 13:31 +0200, Cesar Hernandez wrote:
> Hi
> 
> > 
> > The first fencing is legitimate -- the node hasn't been seen at
> > start-
> > up, and so needs to be fenced. The second fencing will be the one
> > of
> > interest. Also, look for the result of the first fencing.
> 
> The first fencing has finished with OK, as well as the other two
> fencing operations.
> 
> Aug 31 10:59:31 [30612] node1 stonith-ng:   notice:
> log_operation: Operation 'reboot' [31075] (call 2 from
> crmd.30616) for host 'node2' with device 'st-fence_propio:0'
> returned: 0 (OK)
> 
> And the next log entries are:
> 
> Aug 31 10:59:31 [30612] node1 stonith-ng:   notice:
> remote_op_done:Operation reboot of node2 by node1 for crmd.30616@
> node1.2d7857e7: OK
> Aug 31 10:59:31 [30616] node1   crmd:   notice:
> tengine_stonith_callback:  Stonith operation 2/81:0:0:c64efa2b-9366-
> 4c07-b5f1-6a2dbee79fe7: OK (0)
> Aug 31 10:59:31 [30616] node1   crmd: info:
> crm_get_peer:  Created entry 48db3347-5bbe-4cd4-b7ba-
> db4c697c3146/0x55adbeb587f0 for node node2/0 (2 total)
> Aug 31 10:59:31 [30616] node1   crmd: info:
> peer_update_callback:  node2 is now in unknown state
> Aug 31 10:59:31 [30616] node1   crmd: info:
> crm_get_peer:  Node 0 has uuid node2
> Aug 31 10:59:31 [30616] node1   crmd:   notice:
> crm_update_peer_state_iter:crmd_peer_down: Node node2[0] -
> state is now lost (was (null))
> Aug 31 10:59:31 [30616] node1   crmd: info:
> peer_update_callback:  node2 is now lost (was in unknown state)
> Aug 31 10:59:31 [30616] node1   crmd: info:
> crm_update_peer_proc:  crmd_peer_down: Node node2[0] - all
> processes are now offline
> Aug 31 10:59:31 [30616] node1   crmd: info:
> peer_update_callback:  Client node2/peer now has status [offline]
> (DC=true, changed= 1)
> Aug 31 10:59:31 [30616] node1   crmd: info:
> crm_update_peer_expected:  crmd_peer_down: Node node2[0] - expected
> state is now down (was (null))
> Aug 31 10:59:31 [30616] node1   crmd: info:
> erase_status_tag:  Deleting xpath: //node_state[@uname='node2']/lrm
> Aug 31 10:59:31 [30616] node1   crmd: info:
> erase_status_tag:  Deleting xpath:
> //node_state[@uname='node2']/transient_attributes
> Aug 31 10:59:31 [30616] node1   crmd:   notice:
> tengine_stonith_notify:Peer node2 was terminated (reboot) by
> node1 for node1: OK (ref=2d7857e7-7e88-482a-812f-b343218974dc) by
> client crmd.30616
> 
> After some other entries I see:
> 
> Aug 31 10:59:37 [30615] node1pengine:  warning:
> pe_fence_node: Node node2 will be fenced because the peer has not
> been seen by the cluster

Ah, this rings a bell. Despite having fenced the node, the cluster
still considers the node unseen. That was a regression in 1.1.14 that
was fixed in 1.1.15. :-(

> Aug 31 10:59:37 [30615] node1pengine:  warning:
> determine_online_status:   Node node2 is unclean
> 
> The server lasts aprox 2 minutes to reboot, so it's normal to haven't
> been seen after just 6 seconds. But I don't know why the server is
> rebooted three times:
> 
> Aug 31 10:59:31 [30616] node1   crmd:   notice:
> tengine_stonith_notify:   Peer node2 was terminated (reboot) by
> node1 for node1: OK (ref=2d7857e7-7e88-482a-812f-b343218974dc) by
> client crmd.30616
> 
> Aug 31 10:59:53 [30616] node1   crmd:   notice:
> tengine_stonith_notify:   Peer node2 was terminated (reboot) by
> node1 for node1: OK (ref=2835cb08-362d-4d39-9133-3a7dcefb913c) by
> client crmd.30616
> 
> Aug 31 11:00:05 [30616] node1   crmd:   notice:
> tengine_stonith_notify:   Peer node2 was terminated (reboot) by
> node1 for node1: OK (ref=17931f5b-76ea-4e3a-a792-535cea50afca) by
> client crmd.30616
> 
> 
> After the last message, I only see it stops fencing and start
> resources:
> 
> 
> Aug 31 11:00:05 [30616] node1   crmd:   notice:
> tengine_stonith_notify:Peer node2 was terminated (reboot) by
> node1 for node1: OK (ref=17931f5b-76ea-4e3a-a792-535cea50afca) by
> client crmd.30616
> Aug 31 11:00:05 [30611] node1cib: info:
> cib_process_request:   Completed cib_modify operation for section
> status: OK (rc=0, origin=local/crmd/68, version=0.60.30)
> Aug 31 11:00:05 [30616] node1   crmd: info:
> erase_status_tag:  Deleting xpath: //node_state[@uname='node2']/lrm
> Aug 31 11:00:05 [30616] node1   crmd: info:
> erase_status_tag:  Deleting xpath:
> //node_state[@uname='node2']/transient_attributes
> Aug 31 11:00:05 [30616] node1   crmd: info:
> cib_fencing_updated:   Fencing update 68 for node2: complete
> Aug 31 11:00:05 [30616] node1   crmd:   notice:
> te_rsc_command:Initiating action 66: start p_fs_database_start_0
> on node1 (local)
> Aug 31 11:00:05 [30616] node1   crmd: info:
> do_lrm_rsc_op: Performing key=66:2:0:c64efa2b-9366-4c07-b5f1-
> 6a2dbee79fe7 op=p_fs_database_start_0
> Aug 31 11:00:05 

Re: [ClusterLabs] Antw: Re: Antw: Q: native_color scores for clones

2018-09-05 Thread Ken Gaillot
On Wed, 2018-09-05 at 09:32 +0200, Ulrich Windl wrote:
> > > > Ken Gaillot  schrieb am 04.09.2018 um
> > > > 19:21 in Nachricht
> 
> <1536081690.4387.6.ca...@redhat.com>:
> > On Tue, 2018-09-04 at 11:22 +0200, Ulrich Windl wrote:
> > > > > > In Reply to my message am 30.08.2018 um 12:23 in Nachricht
> > > > > > <5B87C5A0.A46 : 161 :
> > > 
> > > 60728>:
> > > > Hi!
> > > > 
> > > > After having found showscores.sh, I thought I can improve the
> > > > perfomance by 
> > > > porting it to Perl, but it seems the slow part actually is
> > > > calling
> > > > pacemakers 
> > > > helper scripts like crm_attribute, crm_failcount, etc...
> > > 
> > > Actually the performance gain was less than expected, until I
> > > added a
> > > cache for calling external programs reading stickiness, fail
> > > count
> > > and migration threshold. Here are the numbers:
> > > 
> > > showscores.sh (original):
> > > real0m46.181s
> > > user0m15.573s
> > > sys 0m21.761s
> > > 
> > > showscores.pl (without cache):
> > > real0m46.053s
> > > user0m15.861s
> > > sys 0m20.997s
> > > 
> > > showscores.pl (with cache):
> > > real0m25.998s
> > > user0m7.940s
> > > sys 0m12.609s
> > > 
> > > This made me think whether it's possible to retrieve such
> > > attributes
> > > in a more efficient way, arising the question how the
> > > corresponding
> > > tools actually do work (those attributes are obviously not part
> > > of
> > > the CIB).
> > 
> > Actually they are ... the policy engine (aka scheduler in 2.0) has
> > only
> > the CIB to make decisions. The other daemons have some additional
> > state
> > that can affect their behavior, but the scheduling of actions
> > relies
> > solely on the CIB.
> > 
> > Stickiness and migration-threshold are in the resource
> > configuration
> > (or defaults); fail counts are in the transient node attributes in
> > the
> > status section (which can only be retrieved from the live CIB or
> > attribute daemons, not the CIB on disk, which may be why you didn't
> > see
> > them).
> 
> Looking for the fail count, the closest match I got looked like this:
>  operation_key="prm_LVM_VMD_monitor_0" operation="monitor" crm-debug-
> origin="build_active_RAs" crm_feature_set="3.0.10" transition-
> key="58:107:7:8a33be7f-1b68-45bf-9143-54414ff3b662" transition-
> magic="0:0;58:107:7:8a33be7f-1b68-45bf-9143-54414ff3b662"
> on_node="h05" call-id="143" rc-code="0" op-status="0" interval="0"
> last-run="1533809981" last-rc-change="1533809981" exec-time="89"
> queue-time="0" op-digest="22698b9dba36e2926819f13c77569222"/>

That's the most recently failed instance of the operation (for that
node and resource). It's used to show the "failed actions" section of
the crm_mon display.

> Do I have to count and filter such elements, or is there a more
> direct way to get the fail count?

The simplest way to get the fail count is with the crm_failcount tool.
That's especially true since per-operation fail counts were added in
1.1.17.

In the CIB XML, fail counts are stored within  
. The attribute name will start with "fail-count-
".

They can also be queried via attrd_updater, if you know the attribute
name.

> > > I can get the CIB via cib_admin and I can parse the XML if
> > > needed,
> > > but how can I get these other attributes? Tracing crm_attribute,
> > > it
> > > seems it reads some partially binary file from locations like
> > > /dev/shm/qb-attrd-response-*.
> > 
> > The /dev/shm files are simply where IPC communication is buffered,
> > so
> > that's just the response from a cib query (the equivalent of
> > cibadmin
> > -Q).
> > 
> > > I would think that _all_ relevant attributes should be part of
> > > the
> > > CIB...
> > 
> > Yep, they are :)
> 
> I'm still having problems, sorry.
> 
> > 
> > Often a final value is calculated from the CIB configuration,
> > rather
> > than directly in it. For example, for stickiness, the actual value
> > could be in the resource configuration, a resource template, or
> > resource defaults, or (pre-2.0) the legacy cluster properties for
> > default stickiness. The configuration "unpacking" code will choose
> > the
> > final value based on a hierarchy of preference.
> 
> I guess the actual algorithm is hidden somewhere. Could I do that
> with XPath queries and some accumulation of numbers (like using the
> max or min), or is it more complicated?

Most of the default values are directly in the code when unpacking the
configuration. Much of it is in lib/pengine/unpack.c, but parts are
scattered elsewhere in lib/pengine (and it's not an easy read).

Based on how the memory allocation works, anything not explicitly
defaulted otherwise by code defaults to 0.

> > > The other thing I realized was that both "migration threshold"
> > > and
> > > "stickiness" are both undefined for several resources (due to the
> > > fact that the default values for those also aren't defined). I
> > > really
> > > wonder: Why not (e.g.) specify a default stickiness as integer 0

[ClusterLabs] Rebooting a standby node triggers lots of transitions

2018-09-05 Thread Kadlecsik József
Hi,

For testing purposes one of our nodes was put in standby node and then 
rebooted several times. When the standby node started up, it joined the 
cluster as a new member and it resulted in transitions between the online 
nodes. However, when the standby node was rebooted in mid-transitions, it 
triggered another transitions again. As a result, live migrations was 
aborted and guests stopped/started.

How can one make sure that join/leave operations of standby nodes do not 
affect the location of the running resources?

It's pacemaker 1.1.16-1 with corosync 2.4.2-3+deb9u1 on debian stretch 
nodes.

Best regards,
Jozsef
--
E-mail : kadlecsik.joz...@wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
 H-1525 Budapest 114, POB. 49, Hungary
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker startup retries

2018-09-05 Thread Cesar Hernandez
Hi

> 
> The first fencing is legitimate -- the node hasn't been seen at start-
> up, and so needs to be fenced. The second fencing will be the one of
> interest. Also, look for the result of the first fencing.

The first fencing has finished with OK, as well as the other two fencing 
operations.

Aug 31 10:59:31 [30612] node1 stonith-ng:   notice: log_operation: 
Operation 'reboot' [31075] (call 2 from crmd.30616) for host 'node2' with 
device 'st-fence_propio:0' returned: 0 (OK)

And the next log entries are:

Aug 31 10:59:31 [30612] node1 stonith-ng:   notice: remote_op_done:
Operation reboot of node2 by node1 for crmd.30616@node1.2d7857e7: OK
Aug 31 10:59:31 [30616] node1   crmd:   notice: tengine_stonith_callback:  
Stonith operation 2/81:0:0:c64efa2b-9366-4c07-b5f1-6a2dbee79fe7: OK (0)
Aug 31 10:59:31 [30616] node1   crmd: info: crm_get_peer:  Created 
entry 48db3347-5bbe-4cd4-b7ba-db4c697c3146/0x55adbeb587f0 for node node2/0 (2 
total)
Aug 31 10:59:31 [30616] node1   crmd: info: peer_update_callback:  
node2 is now in unknown state
Aug 31 10:59:31 [30616] node1   crmd: info: crm_get_peer:  Node 0 
has uuid node2
Aug 31 10:59:31 [30616] node1   crmd:   notice: crm_update_peer_state_iter: 
   crmd_peer_down: Node node2[0] - state is now lost (was (null))
Aug 31 10:59:31 [30616] node1   crmd: info: peer_update_callback:  
node2 is now lost (was in unknown state)
Aug 31 10:59:31 [30616] node1   crmd: info: crm_update_peer_proc:  
crmd_peer_down: Node node2[0] - all processes are now offline
Aug 31 10:59:31 [30616] node1   crmd: info: peer_update_callback:  
Client node2/peer now has status [offline] (DC=true, changed= 1)
Aug 31 10:59:31 [30616] node1   crmd: info: crm_update_peer_expected:  
crmd_peer_down: Node node2[0] - expected state is now down (was (null))
Aug 31 10:59:31 [30616] node1   crmd: info: erase_status_tag:  Deleting 
xpath: //node_state[@uname='node2']/lrm
Aug 31 10:59:31 [30616] node1   crmd: info: erase_status_tag:  Deleting 
xpath: //node_state[@uname='node2']/transient_attributes
Aug 31 10:59:31 [30616] node1   crmd:   notice: tengine_stonith_notify:
Peer node2 was terminated (reboot) by node1 for node1: OK 
(ref=2d7857e7-7e88-482a-812f-b343218974dc) by client crmd.30616

After some other entries I see:

Aug 31 10:59:37 [30615] node1pengine:  warning: pe_fence_node: Node 
node2 will be fenced because the peer has not been seen by the cluster
Aug 31 10:59:37 [30615] node1pengine:  warning: determine_online_status:   
Node node2 is unclean

The server lasts aprox 2 minutes to reboot, so it's normal to haven't been seen 
after just 6 seconds. But I don't know why the server is rebooted three times:

Aug 31 10:59:31 [30616] node1   crmd:   notice: tengine_stonith_notify: 
Peer node2 was terminated (reboot) by node1 for node1: OK 
(ref=2d7857e7-7e88-482a-812f-b343218974dc) by client crmd.30616

Aug 31 10:59:53 [30616] node1   crmd:   notice: tengine_stonith_notify: 
Peer node2 was terminated (reboot) by node1 for node1: OK 
(ref=2835cb08-362d-4d39-9133-3a7dcefb913c) by client crmd.30616

Aug 31 11:00:05 [30616] node1   crmd:   notice: tengine_stonith_notify: 
Peer node2 was terminated (reboot) by node1 for node1: OK 
(ref=17931f5b-76ea-4e3a-a792-535cea50afca) by client crmd.30616


After the last message, I only see it stops fencing and start resources:


Aug 31 11:00:05 [30616] node1   crmd:   notice: tengine_stonith_notify:
Peer node2 was terminated (reboot) by node1 for node1: OK 
(ref=17931f5b-76ea-4e3a-a792-535cea50afca) by client crmd.30616
Aug 31 11:00:05 [30611] node1cib: info: cib_process_request:   
Completed cib_modify operation for section status: OK (rc=0, 
origin=local/crmd/68, version=0.60.30)
Aug 31 11:00:05 [30616] node1   crmd: info: erase_status_tag:  Deleting 
xpath: //node_state[@uname='node2']/lrm
Aug 31 11:00:05 [30616] node1   crmd: info: erase_status_tag:  Deleting 
xpath: //node_state[@uname='node2']/transient_attributes
Aug 31 11:00:05 [30616] node1   crmd: info: cib_fencing_updated:   
Fencing update 68 for node2: complete
Aug 31 11:00:05 [30616] node1   crmd:   notice: te_rsc_command:
Initiating action 66: start p_fs_database_start_0 on node1 (local)
Aug 31 11:00:05 [30616] node1   crmd: info: do_lrm_rsc_op: 
Performing key=66:2:0:c64efa2b-9366-4c07-b5f1-6a2dbee79fe7 
op=p_fs_database_start_0
Aug 31 11:00:05 [30613] node1   lrmd: info: log_execute:   
executing - rsc:p_fs_database action:start call_id:70
Aug 31 11:00:05 [30616] node1   crmd:   notice: te_rsc_command:
Initiating action 77: start p_fs_datosweb_start_0 on node1 (local)
Aug 31 11:00:05 [30616] node1   crmd: info: do_lrm_rsc_op: 
Performing key=77:2:0:c64efa2b-9366-4c07-b5f1-6a2dbee79fe7 
op=p_fs_datosweb_start_0
Aug 31 

[ClusterLabs] Antw: Re: Q: Configure date_expression rule with crm shell without using XML

2018-09-05 Thread Ulrich Windl
>>> Kristoffer Grönlund  schrieb am 04.09.2018 um 16:17 in
Nachricht <1536070650.9365.15.ca...@suse.de>:
> On Tue, 2018‑09‑04 at 16:00 +0200,  Ulrich Windl  wrote:
>> Hi!
>> 
>> I have a question: Can I use crm shell to configure time‑based
>> meta_attributes without using XML?
>> I tried to configure it using XML, but since then the resource
>> configuration can be displayed as XML only.
>> What I'm trying to do is to configure resource‑specific stickiness
>> based on time only, wanting to set stickiness to 0 in the evening
>> hours to allow re‑balancing resources.
>> 
> 
> It should be possible to do this via rule expressions in crm shell, but
> not all XML that possible is expressible via crm shell syntax.
> 
> If you can show me the XML for the resource which doesn't render in crm
> shell I could probably be more specific.
> 
> http://crmsh.github.io/man/#topics_Syntax_RuleExpressions 

Hi Christoffer,

crmsh is so incredibly cool: I could add my changes like the following
(snipplet) withpout having to deal with XML (as presented in clusterlabs'
example "Change resource-stickiness during working hours"):

meta 1: resource-stickiness=0 \
meta 2: rule date spec hours=7-18 weekdays=1-5
resource-stickiness=1000

and that translates to


  


  

  

  
  


Now if "show changed" could display the changes in "diff -u" or wdiff format,
it would even be cooler. Maybe create another command like "show changes" or
"show diff[erences]"... ;-)

Regards,
Ulrich

> 
>> Regards,
>> Ulrich
>> 
>> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> pdf
>> Bugs: http://bugs.clusterlabs.org 
>> 
> ‑‑ 
> 
> Cheers,
> Kristoffer
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org