Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-29 Thread Jan Pokorný
On 28/09/16 16:30 -0400, Scott Greenlese wrote:
> Also,  I have tried simulating a failed cluster node (to trigger a
> STONITH action) by killing the corosync daemon on one node, but all
> that does is respawn the daemon ...  causing a temporary / transient
> failure condition, and no fence takes place.   Is there a way to
> kill corosync in such a way that it stays down?   Is there a best
> practice for STONITH testing?

This makes me seriously wonder what could cause this involuntary
daemon-scoped high availability...

Are you sure you are using upstream provided initscript/unit file?

(Just hope there's no fence_corosync_restart.)

-- 
Jan (Poki)


pgpbBS3jdAmnn.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-29 Thread Tomas Jelinek

Dne 29.9.2016 v 00:14 Ken Gaillot napsal(a):

On 09/28/2016 03:57 PM, Scott Greenlese wrote:

A quick addendum...

After sending this post, I decided to stop pacemaker on the single,
Online node in the cluster,
and this effectively killed the corosync daemon:

[root@zs93kl VD]# date;pcs cluster stop
Wed Sep 28 16:39:22 EDT 2016
Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...


Correct, "pcs cluster stop" tries to stop both pacemaker and corosync.


[root@zs93kl VD]# date;ps -ef |grep coro|grep -v grep
Wed Sep 28 16:46:19 EDT 2016


Totally irrelevant, but a little trick I picked up somewhere: when
grepping for a process, square-bracketing a character lets you avoid the
"grep -v", e.g. "ps -ef | grep cor[o]"

It's nice when I remember to use it ;)


[root@zs93kl VD]#



Next, I went to a node in "Pending" state, and sure enough... the pcs
cluster stop killed the daemon there, too:

[root@zs95kj VD]# date;pcs cluster stop
Wed Sep 28 16:48:15 EDT 2016
Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...

[root@zs95kj VD]# date;ps -ef |grep coro |grep -v grep
Wed Sep 28 16:48:38 EDT 2016
[root@zs95kj VD]#

So, this answers my own question... cluster stop should kill corosync.
So, why isn't the `pcs cluster stop --all` failing to
kill corosync?


It should. At least you've narrowed it down :)


This is a bug in pcs. Thanks for spotting it and providing detailed 
description. I filed the bug here: 
https://bugzilla.redhat.com/show_bug.cgi?id=1380372


Regards,
Tomas




Thanks...


Scott Greenlese ... IBM KVM on System Z Test, Poughkeepsie, N.Y.
INTERNET: swgre...@us.ibm.com



Inactive hide details for Scott Greenlese---09/28/2016 04:30:06 PM---Hi
folks.. I have some follow-up questions about corosync Scott
Greenlese---09/28/2016 04:30:06 PM---Hi folks.. I have some follow-up
questions about corosync daemon status after cluster shutdown.

From: Scott Greenlese/Poughkeepsie/IBM
To: kgail...@redhat.com, Cluster Labs - All topics related to
open-source clustering welcomed <users@clusterlabs.org>
Date: 09/28/2016 04:30 PM
Subject: Re: [ClusterLabs] Pacemaker quorum behavior




Hi folks..

I have some follow-up questions about corosync daemon status after
cluster shutdown.

Basically, what should happen to corosync on a cluster node when
pacemaker is shutdown on that node?
On my 5 node cluster, when I do a global shutdown, the pacemaker
processes exit, but corosync processes remain active.

Here's an example of where this led me into some trouble...

My cluster is still configured to use the "symmetric" resource
distribution. I don't have any location constraints in place, so
pacemaker tries to evenly distribute resources across all Online nodes.

With one cluster node (KVM host) powered off, I did the global cluster
stop:

[root@zs90KP VD]# date;pcs cluster stop --all
Wed Sep 28 15:07:40 EDT 2016
zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)
zs90kppcs1: Stopping Cluster (pacemaker)...
zs95KLpcs1: Stopping Cluster (pacemaker)...
zs95kjpcs1: Stopping Cluster (pacemaker)...
zs93kjpcs1: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)

Note: The "No route to host" messages are expected because that node /
LPAR is powered down.

(I don't show it here, but the corosync daemon is still running on the 4
active nodes. I do show it later).

I then powered on the one zs93KLpcs1 LPAR, so in theory I should not
have quorum when it comes up and activates
pacemaker, which is enabled to autostart at boot time on all 5 cluster
nodes. At this point, only 1 out of 5
nodes should be Online to the cluster, and therefore ... no quorum.

I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending'
Online, and "partition with quorum":


Corosync determines quorum, pacemaker just uses it. If corosync is
running, the node contributes to quorum.


[root@zs93kl ~]# date;pcs status |less
Wed Sep 28 15:25:13 EDT 2016
Cluster name: test_cluster_2
Last updated: Wed Sep 28 15:25:13 2016 Last change: Mon Sep 26 16:15:08
2016 by root via crm_resource on zs95kjpcs1
Stack: corosync
Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -
partition with quorum
106 nodes and 304 resources configured

Node zs90kppcs1: pending
Node zs93kjpcs1: pending
Node zs95KLpcs1: pending
Node zs95kjpcs1: pending
Online: [ zs93KLpcs1 ]

Full list of resources:

zs95kjg109062_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
zs95kjg109063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
.
.
.


Here you can see that corosync is up on all 5 nodes:

[root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep corosync
|grep -v grep"; done
Wed Sep 28 15:22:21 EDT 2016
zs90KP
roo

Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-28 Thread Scott Greenlese

A quick addendum...

After sending this post, I decided to stop pacemaker on the single, Online
node in the cluster,
and this effectively killed the corosync daemon:

[root@zs93kl VD]# date;pcs cluster stop
Wed Sep 28 16:39:22 EDT 2016
Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...


[root@zs93kl VD]# date;ps -ef |grep coro|grep -v grep
Wed Sep 28 16:46:19 EDT 2016
[root@zs93kl VD]#



Next, I went to a node in "Pending" state, and sure enough... the pcs
cluster stop killed the daemon there, too:

[root@zs95kj VD]# date;pcs cluster stop
Wed Sep 28 16:48:15 EDT 2016
Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...

[root@zs95kj VD]# date;ps -ef |grep coro |grep -v grep
Wed Sep 28 16:48:38 EDT 2016
[root@zs95kj VD]#

So, this answers my own question...  cluster stop should kill corosync.
So, why isn't the `pcs cluster stop --all` failing to
kill corosync?

Thanks...


Scott Greenlese ... IBM KVM on System Z Test,  Poughkeepsie, N.Y.
  INTERNET:  swgre...@us.ibm.com





From:   Scott Greenlese/Poughkeepsie/IBM
To: kgail...@redhat.com, Cluster Labs - All topics related to
open-source clustering welcomed <users@clusterlabs.org>
Date:   09/28/2016 04:30 PM
Subject:Re: [ClusterLabs] Pacemaker quorum behavior


Hi folks..

I have some follow-up questions about corosync daemon status after cluster
shutdown.

Basically, what should happen to corosync on a cluster node when pacemaker
is shutdown on that node?
On my 5 node cluster, when I do a global shutdown, the pacemaker processes
exit, but corosync processes remain active.

Here's an example of where this led me into some trouble...

My cluster is still configured to use the "symmetric" resource
distribution.   I don't have any location constraints in place, so
pacemaker tries to evenly distribute resources across all Online nodes.

With one cluster node (KVM host) powered off, I did the global cluster
stop:

[root@zs90KP VD]# date;pcs cluster stop --all
Wed Sep 28 15:07:40 EDT 2016
zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)
zs90kppcs1: Stopping Cluster (pacemaker)...
zs95KLpcs1: Stopping Cluster (pacemaker)...
zs95kjpcs1: Stopping Cluster (pacemaker)...
zs93kjpcs1: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)

Note:  The "No route to host" messages are expected because that node /
LPAR is powered down.

(I don't show it here, but the corosync daemon is still running on the 4
active nodes. I do show it later).

I then powered on the one zs93KLpcs1 LPAR,  so in theory I should not have
quorum when it comes up and activates
pacemaker, which is enabled to autostart at boot time on all 5 cluster
nodes.  At this point, only 1 out of 5
nodes should be Online to the cluster, and therefore ... no quorum.

I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending'
Online, and "partition with quorum":

[root@zs93kl ~]# date;pcs status |less
Wed Sep 28 15:25:13 EDT 2016
Cluster name: test_cluster_2
Last updated: Wed Sep 28 15:25:13 2016  Last change: Mon Sep 26
16:15:08 2016 by root via crm_resource on zs95kjpcs1
Stack: corosync
Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition
with quorum
106 nodes and 304 resources configured

Node zs90kppcs1: pending
Node zs93kjpcs1: pending
Node zs95KLpcs1: pending
Node zs95kjpcs1: pending
Online: [ zs93KLpcs1 ]

Full list of resources:

 zs95kjg109062_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109063_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
.
.
.


Here you can see that corosync is up on all 5 nodes:

[root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep corosync |grep
-v grep"; done
Wed Sep 28 15:22:21 EDT 2016
zs90KP
root 155374  1  0 Sep26 ?00:10:17 corosync
zs95KL
root  22933  1  0 11:51 ?00:00:54 corosync
zs95kj
root  19382  1  0 Sep26 ?00:10:15 corosync
zs93kj
root 129102  1  0 Sep26 ?00:12:10 corosync
zs93kl
root  21894  1  0 15:19 ?00:00:00 corosync


But, pacemaker is only running on the one, online node:

[root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep pacemakerd |
grep -v grep"; done
Wed Sep 28 15:23:29 EDT 2016
zs90KP
zs95KL
zs95kj
zs93kj
zs93kl
root  23005  1  0 15:19 ?00:00:00 /usr/sbin/pacemakerd -f
You have new mail in /var/spool/mail/root
[root@zs95kj VD]#


This situation wreaks havoc on my VirtualDomain resources, as the majority
of them are in FAILED or Stopped state, and to my
surprise... many of them show as Started:

[root@zs93kl VD]# date;pcs resource show |grep zs93KL
Wed Sep 28 15:55:29 EDT 2016
 zs95kjg109062_res  (ocf::heartbe

Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-28 Thread Scott Greenlese
): Started zs93KLpcs1
 zs95kjg110122_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110123_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110124_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110125_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110126_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110128_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110129_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110130_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110131_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110132_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110133_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110134_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110135_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110137_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110138_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110139_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110140_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110141_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110142_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110143_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110144_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110145_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110146_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110148_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110149_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110150_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110152_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110154_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110155_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110156_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110159_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110160_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110161_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110164_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110165_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110166_res  (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1


Pacemaker is attempting to activate all VirtualDomain resources on the one
cluster node.

So back to my original question... what should happen when I do a cluster
stop?
If it should be deactivating, what would prevent this?

Also,  I have tried simulating a failed cluster node (to trigger a STONITH
action) by killing the
corosync daemon on one node, but all that does is respawn the daemon ...
causing a temporary / transient
failure condition, and no fence takes place.   Is there a way to kill
corosync in such a way
that it stays down?   Is there a best practice for STONITH testing?

As usual, thanks in advance for your advice.

Scott Greenlese ... IBM KVM on System Z -  Solutions Test,  Poughkeepsie,
N.Y.
  INTERNET:  swgre...@us.ibm.com





From:   Ken Gaillot <kgail...@redhat.com>
To: users@clusterlabs.org
Date:   09/09/2016 06:23 PM
Subject:Re: [ClusterLabs] Pacemaker quorum behavior



On 09/09/2016 04:27 AM, Klaus Wenninger wrote:
> On 09/08/2016 07:31 PM, Scott Greenlese wrote:
>>
>> Hi Klaus, thanks for your prompt and thoughtful feedback...
>>
>> Please see my answers nested below (sections entitled, "Scott's
>> Reply"). Thanks!
>>
>> - Scott
>>
>>
>> Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.
>> INTERNET: swgre...@us.ibm.com
>> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
>>
>>
>> Inactive hide details for Klaus Wenninger ---09/08/2016 10:59:27
>> AM---On 09/08/2016 03:55 PM, Scott Greenlese wrote: >Klaus Wenninger
>> ---09/08/2016 10:59:27 AM---On 09/08/2016 03:55 PM, Scott Greenlese
>> wrote: >
>>
>> From: Klaus Wenninger <kwenn...@redhat.com>
>> To: users@clusterlabs.org
>> Date: 09/08/2016 10:59 AM
>> Subject: Re: [ClusterLabs] Pacemaker quorum behavior
>>
>> 
>>
>>
>>
>> On 09/08/2016 03:55 PM, Scott Greenlese wrote:
>> >
>> > Hi all...
>> >
>> > I have a few very basic questions for the group.
>> >
>> > I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100
>> > VirtualDomai

Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-09 Thread Ken Gaillot
On 09/09/2016 04:27 AM, Klaus Wenninger wrote:
> On 09/08/2016 07:31 PM, Scott Greenlese wrote:
>>
>> Hi Klaus, thanks for your prompt and thoughtful feedback...
>>
>> Please see my answers nested below (sections entitled, "Scott's
>> Reply"). Thanks!
>>
>> - Scott
>>
>>
>> Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.
>> INTERNET: swgre...@us.ibm.com
>> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
>>
>>
>> Inactive hide details for Klaus Wenninger ---09/08/2016 10:59:27
>> AM---On 09/08/2016 03:55 PM, Scott Greenlese wrote: >Klaus Wenninger
>> ---09/08/2016 10:59:27 AM---On 09/08/2016 03:55 PM, Scott Greenlese
>> wrote: >
>>
>> From: Klaus Wenninger <kwenn...@redhat.com>
>> To: users@clusterlabs.org
>> Date: 09/08/2016 10:59 AM
>> Subject: Re: [ClusterLabs] Pacemaker quorum behavior
>>
>> 
>>
>>
>>
>> On 09/08/2016 03:55 PM, Scott Greenlese wrote:
>> >
>> > Hi all...
>> >
>> > I have a few very basic questions for the group.
>> >
>> > I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100
>> > VirtualDomain pacemaker-remote nodes
>> > plus 100 "opaque" VirtualDomain resources. The cluster is configured
>> > to be 'symmetric' and I have no
>> > location constraints on the 200 VirtualDomain resources (other than to
>> > prevent the opaque guests
>> > from running on the pacemaker remote node resources). My quorum is set
>> > as:
>> >
>> > quorum {
>> > provider: corosync_votequorum
>> > }
>> >
>> > As an experiment, I powered down one LPAR in the cluster, leaving 4
>> > powered up with the pcsd service up on the 4 survivors
>> > but corosync/pacemaker down (pcs cluster stop --all) on the 4
>> > survivors. I then started pacemaker/corosync on a single cluster
>> >
>>
>> "pcs cluster stop" shuts down pacemaker & corosync on my test-cluster but
>> did you check the status of the individual services?
>>
>> Scott's reply:
>>
>> No, I only assumed that pacemaker was down because I got this back on
>> my pcs status
>> command from each cluster node:
>>
>> [root@zs95kj VD]# date;for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1
>> zs93kjpcs1 ; do ssh $host pcs status; done
>> Wed Sep 7 15:49:27 EDT 2016
>> Error: cluster is not currently running on this node
>> Error: cluster is not currently running on this node
>> Error: cluster is not currently running on this node
>> Error: cluster is not currently running on this node

In my experience, this is sufficient to say that pacemaker and corosync
aren't running.

>>
>> What else should I check?  The pcsd.service service was still up,
>> since I didn't not stop that
>> anywhere. Should I have done,  ps -ef |grep -e pacemaker -e corosync
>>  to check the state before
>> assuming it was really down?
>>
>>
> Guess the answer from Poki should guide you well here ...
>>
>>
>> > node (pcs cluster start), and this resulted in the 200 VirtualDomain
>> > resources activating on the single node.
>> > This was not what I was expecting. I assumed that no resources would
>> > activate / start on any cluster nodes
>> > until 3 out of the 5 total cluster nodes had pacemaker/corosync running.

Your expectation is correct; I'm not sure what happened in this case.
There are some obscure corosync options (e.g. last_man_standing,
allow_downscale) that could theoretically lead to this, but I don't get
the impression you're using anything unusual.

>> > After starting pacemaker/corosync on the single host (zs95kjpcs1),
>> > this is what I see :
>> >
>> > [root@zs95kj VD]# date;pcs status |less
>> > Wed Sep 7 15:51:17 EDT 2016
>> > Cluster name: test_cluster_2
>> > Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12
>> > 2016 by hacluster via crmd on zs93kjpcs1
>> > Stack: corosync
>> > Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -
>> > partition with quorum
>> > 106 nodes and 304 resources configured
>> >
>> > Node zs93KLpcs1: pending
>> > Node zs93kjpcs1: pending
>> > Node zs95KLpcs1: pending
>> > Online: [ zs95kjpcs1 ]
>> > OFFLINE: [ zs90kppcs1 ]
>> >
>> > .
>> > .
>> > .
>> > PCSD Status:
>> > zs93kjpcs1: On

Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-09 Thread Jan Pokorný
On 09/09/16 14:13 -0400, Scott Greenlese wrote:
> You had mentioned this command:
> 
> pstree -p | grep -A5 $(pidof -x pcs)
> 
> I'm not quite sure what the $(pidof -x pcs) represents??

This is a "command substitution" shell construct (new, blessed form
of `backtick` notation) that in this particular case was meant to
yield PID of running pcs command.  The whole compound command
was then meant to possibly discover what pcs is running under
the hood because that's what might get stuck.

> On an "Online" cluster node, I see:
> 
> [root@zs93kj ~]# ps -ef |grep pcs |grep -v grep
> root  18876  1  0 Sep07 ?
> 00:00:00 /bin/sh /usr/lib/pcsd/pcsd start
> root  18905  18876  0 Sep07 ?00:00:00 /bin/bash -c ulimit -S -c
> 0 >/dev/null 2>&1 ; /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb
> root  18906  18905  0 Sep07 ?00:04:22 /usr/bin/ruby
> -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb
> [root@zs93kj ~]#
> 
> If I use the 18876 PID on a healthy node, I get..
> 
> [root@zs93kj ~]# pstree -p |grep -A5 18876
>|-pcsd(18876)---bash(18905)---ruby(18906)-+-{ruby}(19102)
>| |-{ruby}(20212)
>| `-{ruby}(224258)
>|-pkcsslotd(18851)
>|-polkitd(19091)-+-{polkitd}(19100)
>||-{polkitd}(19101)
> 
> 
> Is this what you meant for me to do?

Only if I got my guess about "pcs cluster stop" command stuck right,
which is not the case, as you explained.

> If so, I'll be sure to do that next time I suspect processes are not
> exiting on cluster kill or stop.

In this another case, you really want to consult "systemctl status X"
for X in (corosync, pacemaker).   And to be really sure, for instance,
"pgrep Y" for Y in (pacemakerd, crmd, corosync).

(I hope I didn't confuse you too much due to the mentioned wild
guess originally).

-- 
Jan (Poki)


pgpsaY1UDWg0f.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-09 Thread Scott Greenlese
ualDomain): Started zs95kjpcs1
 zs95kjg109065_res  (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
 zs95kjg109066_res  (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
 zs95kjg109067_res  (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
 zs95kjg109068_res  (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
.,
.
.
PCSD Status:
  zs93kjpcs1: Online
  zs95kjpcs1: Online
  zs95KLpcs1: Online
  zs90kppcs1: Offline
  zs93KLpcs1: Online


Check resources again:

Wed Sep  7 16:09:52 EDT 2016

 ### VirtualDomain Resource Statistics: ###

"_res" Virtual Domain resources:
  Started on zs95kj: 199
  Started on zs93kj: 0
  Started on zs95KL: 0
  Started on zs93KL: 0
  Started on zs90KP: 0
  Total Started: 199
  Total NOT Started: 1


I have since isolated all the corrupted virtual domain images and disabled
their VirtualDomain resources.
We already rebooted all five cluster nodes, after installing a new KVM
driver on them.

Now,  the quorum calculation and behavior seems to be working perfectly as
expected.

I started pacemaker on the nodes, one at a time... and, after 3 of the 5
nodes had pacemaker "Online" ...
resources activated and were evenly distributed across them.

In summary,  a lesson learned here is to check status of the pcs process to
be certain pacemaker and corosync
are indeed "offline" and that all threads to that process have terminated.
You had mentioned this command:

pstree -p | grep -A5 $(pidof -x pcs)

I'm not quite sure what the $(pidof -x pcs) represents??

On an "Online" cluster node, I see:

[root@zs93kj ~]# ps -ef |grep pcs |grep -v grep
root  18876  1  0 Sep07 ?
00:00:00 /bin/sh /usr/lib/pcsd/pcsd start
root  18905  18876  0 Sep07 ?00:00:00 /bin/bash -c ulimit -S -c
0 >/dev/null 2>&1 ; /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb
root  18906  18905  0 Sep07 ?00:04:22 /usr/bin/ruby
-I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb
[root@zs93kj ~]#

If I use the 18876 PID on a healthy node, I get..

[root@zs93kj ~]# pstree -p |grep -A5 18876
   |-pcsd(18876)---bash(18905)---ruby(18906)-+-{ruby}(19102)
   | |-{ruby}(20212)
   | `-{ruby}(224258)
   |-pkcsslotd(18851)
   |-polkitd(19091)-+-{polkitd}(19100)
   ||-{polkitd}(19101)


Is this what you meant for me to do?If so, I'll be sure to do that next
time I suspect processes are not exiting on cluster kill or stop.

Thanks


Scott Greenlese ... IBM z/BX Solutions Test,  Poughkeepsie, N.Y.
  INTERNET:  swgre...@us.ibm.com
  PHONE:  8/293-7301 (845-433-7301)M/S:  POK 42HA/P966




From:   Jan Pokorný <jpoko...@redhat.com>
To: Cluster Labs - All topics related to open-source clustering
welcomed <users@clusterlabs.org>
Cc: Si Bo Niu <nius...@cn.ibm.com>, Scott
    Loveland/Poughkeepsie/IBM@IBMUS, Michael
Tebolt/Poughkeepsie/IBM@IBMUS
Date:   09/08/2016 02:43 PM
Subject:Re: [ClusterLabs] Pacemaker quorum behavior



On 08/09/16 10:20 -0400, Scott Greenlese wrote:
> Correction...
>
> When I stopped pacemaker/corosync on the four (powered on / active)
> cluster node hosts,  I was having an issue with the gentle method of
> stopping the cluster (pcs cluster stop --all),

Can you elaborate on what went wrong with this gentle method, please?

If it seemed to have stuck, you can perhaps run some diagnostics like:

  pstree -p | grep -A5 $(pidof -x pcs)

across the nodes to see if what process(es) pcs waits on, next time.

> so I ended up doing individual (pcs cluster kill ) on
> each of the four cluster nodes.   I then had to stop the virtual
> domains manually via 'virsh destroy ' on each host.
> Perhaps there was some residual node status affecting my quorum?

Hardly if corosync processes were indeed dead.

--
Jan (Poki)
[attachment "attyopgs.dat" deleted by Scott Greenlese/Poughkeepsie/IBM]
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-09 Thread Klaus Wenninger
On 09/08/2016 07:31 PM, Scott Greenlese wrote:
>
> Hi Klaus, thanks for your prompt and thoughtful feedback...
>
> Please see my answers nested below (sections entitled, "Scott's
> Reply"). Thanks!
>
> - Scott
>
>
> Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.
> INTERNET: swgre...@us.ibm.com
> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
>
>
> Inactive hide details for Klaus Wenninger ---09/08/2016 10:59:27
> AM---On 09/08/2016 03:55 PM, Scott Greenlese wrote: >Klaus Wenninger
> ---09/08/2016 10:59:27 AM---On 09/08/2016 03:55 PM, Scott Greenlese
> wrote: >
>
> From: Klaus Wenninger <kwenn...@redhat.com>
> To: users@clusterlabs.org
> Date: 09/08/2016 10:59 AM
> Subject: Re: [ClusterLabs] Pacemaker quorum behavior
>
> 
>
>
>
> On 09/08/2016 03:55 PM, Scott Greenlese wrote:
> >
> > Hi all...
> >
> > I have a few very basic questions for the group.
> >
> > I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100
> > VirtualDomain pacemaker-remote nodes
> > plus 100 "opaque" VirtualDomain resources. The cluster is configured
> > to be 'symmetric' and I have no
> > location constraints on the 200 VirtualDomain resources (other than to
> > prevent the opaque guests
> > from running on the pacemaker remote node resources). My quorum is set
> > as:
> >
> > quorum {
> > provider: corosync_votequorum
> > }
> >
> > As an experiment, I powered down one LPAR in the cluster, leaving 4
> > powered up with the pcsd service up on the 4 survivors
> > but corosync/pacemaker down (pcs cluster stop --all) on the 4
> > survivors. I then started pacemaker/corosync on a single cluster
> >
>
> "pcs cluster stop" shuts down pacemaker & corosync on my test-cluster but
> did you check the status of the individual services?
>
> Scott's reply:
>
> No, I only assumed that pacemaker was down because I got this back on
> my pcs status
> command from each cluster node:
>
> [root@zs95kj VD]# date;for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1
> zs93kjpcs1 ; do ssh $host pcs status; done
> Wed Sep 7 15:49:27 EDT 2016
> Error: cluster is not currently running on this node
> Error: cluster is not currently running on this node
> Error: cluster is not currently running on this node
> Error: cluster is not currently running on this node
>  
>
> What else should I check?  The pcsd.service service was still up,
> since I didn't not stop that
> anywhere. Should I have done,  ps -ef |grep -e pacemaker -e corosync
>  to check the state before
> assuming it was really down?
>
>
Guess the answer from Poki should guide you well here ...
>
>
> > node (pcs cluster start), and this resulted in the 200 VirtualDomain
> > resources activating on the single node.
> > This was not what I was expecting. I assumed that no resources would
> > activate / start on any cluster nodes
> > until 3 out of the 5 total cluster nodes had pacemaker/corosync running.
> >
> > After starting pacemaker/corosync on the single host (zs95kjpcs1),
> > this is what I see :
> >
> > [root@zs95kj VD]# date;pcs status |less
> > Wed Sep 7 15:51:17 EDT 2016
> > Cluster name: test_cluster_2
> > Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12
> > 2016 by hacluster via crmd on zs93kjpcs1
> > Stack: corosync
> > Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -
> > partition with quorum
> > 106 nodes and 304 resources configured
> >
> > Node zs93KLpcs1: pending
> > Node zs93kjpcs1: pending
> > Node zs95KLpcs1: pending
> > Online: [ zs95kjpcs1 ]
> > OFFLINE: [ zs90kppcs1 ]
> >
> > .
> > .
> > .
> > PCSD Status:
> > zs93kjpcs1: Online
> > zs95kjpcs1: Online
> > zs95KLpcs1: Online
> > zs90kppcs1: Offline
> > zs93KLpcs1: Online
> >
> > So, what exactly constitutes an "Online" vs. "Offline" cluster node
> > w.r.t. quorum calculation? Seems like in my case, it's "pending" on 3
> > nodes,
> > so where does that fall? Any why "pending"? What does that mean?
> >
> > Also, what exactly is the cluster's expected reaction to quorum loss?
> > Cluster resources will be stopped or something else?
> >
> Depends on how you configure it using cluster property no-quorum-policy
> (default: stop).
>
> Scott's reply:
>
> This is how the policy is configured:
>
> [root@zs95kj VD]# date;pcs config |grep quorum
> Thu Sep

Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-08 Thread Jan Pokorný
On 08/09/16 10:20 -0400, Scott Greenlese wrote:
> Correction...
> 
> When I stopped pacemaker/corosync on the four (powered on / active)
> cluster node hosts,  I was having an issue with the gentle method of
> stopping the cluster (pcs cluster stop --all),

Can you elaborate on what went wrong with this gentle method, please?

If it seemed to have stuck, you can perhaps run some diagnostics like:

  pstree -p | grep -A5 $(pidof -x pcs)

across the nodes to see if what process(es) pcs waits on, next time.

> so I ended up doing individual (pcs cluster kill ) on
> each of the four cluster nodes.   I then had to stop the virtual
> domains manually via 'virsh destroy ' on each host.
> Perhaps there was some residual node status affecting my quorum?

Hardly if corosync processes were indeed dead.

-- 
Jan (Poki)


pgp1Re5MQ30mT.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-08 Thread Scott Greenlese

Hi Klaus, thanks for your prompt and thoughtful feedback...

Please see my answers nested below (sections entitled, "Scott's Reply").
Thanks!

- Scott


Scott Greenlese ... IBM Solutions Test,  Poughkeepsie, N.Y.
  INTERNET:  swgre...@us.ibm.com
  PHONE:  8/293-7301 (845-433-7301)M/S:  POK 42HA/P966




From:   Klaus Wenninger <kwenn...@redhat.com>
To: users@clusterlabs.org
Date:   09/08/2016 10:59 AM
Subject:Re: [ClusterLabs] Pacemaker quorum behavior



On 09/08/2016 03:55 PM, Scott Greenlese wrote:
>
> Hi all...
>
> I have a few very basic questions for the group.
>
> I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100
> VirtualDomain pacemaker-remote nodes
> plus 100 "opaque" VirtualDomain resources. The cluster is configured
> to be 'symmetric' and I have no
> location constraints on the 200 VirtualDomain resources (other than to
> prevent the opaque guests
> from running on the pacemaker remote node resources). My quorum is set
> as:
>
> quorum {
> provider: corosync_votequorum
> }
>
> As an experiment, I powered down one LPAR in the cluster, leaving 4
> powered up with the pcsd service up on the 4 survivors
> but corosync/pacemaker down (pcs cluster stop --all) on the 4
> survivors. I then started pacemaker/corosync on a single cluster
>

"pcs cluster stop" shuts down pacemaker & corosync on my test-cluster but
did you check the status of the individual services?

Scott's reply:

No, I only assumed that pacemaker was down because I got this back on my
pcs status
command from each cluster node:

[root@zs95kj VD]# date;for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1
zs93kjpcs1 ; do ssh $host pcs status; done
Wed Sep  7 15:49:27 EDT 2016
Error: cluster is not currently running on this node
Error: cluster is not currently running on this node
Error: cluster is not currently running on this node
Error: cluster is not currently running on this node


What else should I check?  The pcsd.service service was still up, since I
didn't not stop that
anywhere. Should I have done,  ps -ef |grep -e pacemaker -e corosync  to
check the state before
assuming it was really down?




> node (pcs cluster start), and this resulted in the 200 VirtualDomain
> resources activating on the single node.
> This was not what I was expecting. I assumed that no resources would
> activate / start on any cluster nodes
> until 3 out of the 5 total cluster nodes had pacemaker/corosync running.
>
> After starting pacemaker/corosync on the single host (zs95kjpcs1),
> this is what I see :
>
> [root@zs95kj VD]# date;pcs status |less
> Wed Sep 7 15:51:17 EDT 2016
> Cluster name: test_cluster_2
> Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12
> 2016 by hacluster via crmd on zs93kjpcs1
> Stack: corosync
> Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -
> partition with quorum
> 106 nodes and 304 resources configured
>
> Node zs93KLpcs1: pending
> Node zs93kjpcs1: pending
> Node zs95KLpcs1: pending
> Online: [ zs95kjpcs1 ]
> OFFLINE: [ zs90kppcs1 ]
>
> .
> .
> .
> PCSD Status:
> zs93kjpcs1: Online
> zs95kjpcs1: Online
> zs95KLpcs1: Online
> zs90kppcs1: Offline
> zs93KLpcs1: Online
>
> So, what exactly constitutes an "Online" vs. "Offline" cluster node
> w.r.t. quorum calculation? Seems like in my case, it's "pending" on 3
> nodes,
> so where does that fall? Any why "pending"? What does that mean?
>
> Also, what exactly is the cluster's expected reaction to quorum loss?
> Cluster resources will be stopped or something else?
>
Depends on how you configure it using cluster property no-quorum-policy
(default: stop).

Scott's reply:

This is how the policy is configured:

[root@zs95kj VD]# date;pcs config |grep quorum
Thu Sep  8 13:18:33 EDT 2016
 no-quorum-policy: stop

What should I expect with the 'stop' setting?


>
>
> Where can I find this documentation?
>
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/

Scott's reply:

OK, I'll keep looking thru this doc, but I don't easily find the
no-quorum-policy explained.

Thanks..


>
>
> Thanks!
>
> Scott Greenlese - IBM Solution Test Team.
>
>
>
> Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.
> INTERNET: swgre...@us.ibm.com
> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Us

[ClusterLabs] Pacemaker quorum behavior

2016-09-08 Thread Scott Greenlese

Hi all...

I have a few very basic questions for the group.

I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100 VirtualDomain
pacemaker-remote nodes
plus 100 "opaque" VirtualDomain resources. The cluster is configured to be
'symmetric' and I have no
location constraints on the 200 VirtualDomain resources (other than to
prevent the opaque guests
from running on the pacemaker remote node resources).  My quorum is set as:

quorum {
provider: corosync_votequorum
}

As an experiment, I powered down one LPAR in the cluster, leaving 4 powered
up with the pcsd service up on the 4 survivors
but corosync/pacemaker down (pcs cluster stop --all) on the 4 survivors.
I then started pacemaker/corosync on a single cluster
node (pcs cluster start), and this resulted in the 200 VirtualDomain
resources activating on the single node.
This was not what I was expecting.  I assumed that no resources would
activate / start on any cluster nodes
until 3 out of the 5 total cluster nodes had pacemaker/corosync running.

After starting pacemaker/corosync on the single host (zs95kjpcs1), this is
what I see :

[root@zs95kj VD]# date;pcs status |less
Wed Sep  7 15:51:17 EDT 2016
Cluster name: test_cluster_2
Last updated: Wed Sep  7 15:51:18 2016  Last change: Wed Sep  7
15:30:12 2016 by hacluster via crmd on zs93kjpcs1
Stack: corosync
Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition
with quorum
106 nodes and 304 resources configured

Node zs93KLpcs1: pending
Node zs93kjpcs1: pending
Node zs95KLpcs1: pending
Online: [ zs95kjpcs1 ]
OFFLINE: [ zs90kppcs1 ]

.
.
.
PCSD Status:
  zs93kjpcs1: Online
  zs95kjpcs1: Online
  zs95KLpcs1: Online
  zs90kppcs1: Offline
  zs93KLpcs1: Online

So, what exactly constitutes an "Online" vs. "Offline" cluster node w.r.t.
quorum calculation?   Seems like in my case, it's "pending" on 3 nodes,
so where does that fall?   Any why "pending"?  What does that mean?

Also, what exactly is the cluster's expected reaction to quorum loss?
Cluster resources will be stopped or something else?

Where can I find this documentation?

Thanks!

Scott Greenlese -  IBM Solution Test Team.



Scott Greenlese ... IBM Solutions Test,  Poughkeepsie, N.Y.
  INTERNET:  swgre...@us.ibm.com
  PHONE:  8/293-7301 (845-433-7301)M/S:  POK 42HA/P966
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org