Re: [ClusterLabs] Weird Fencing Behavior

2018-07-17 Thread Andrei Borzenkov
18.07.2018 04:21, Confidential Company пишет:
>>> Hi,
>>>
>>> On my two-node active/passive setup, I configured fencing via
>>> fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I
>> expected
>>> that both nodes will be stonithed simultaenously.
>>>
>>> On my test scenario, Node1 has ClusterIP resource. When I
>> disconnect
>>> service/corosync link physically, Node1 was fenced and Node2 keeps
>> alive
>>> given pcmk_delay=0 on both nodes.
>>>
>>> Can you explain the behavior above?
>>>
>>
>> #node1 could not connect to ESX because links were disconnected. As
>> the
>> #most obvious explanation.
>>
>> #You have logs, you are the only one who can answer this question
>> with
>> #some certainty. Others can only guess.
>>
>>
>> Oops, my bad. I forgot to tell. I have two interfaces on each virtual
>> machine (nodes). second interface was used for ESX links, so fence
>> can be executed even though corosync links were disconnected. Looking
>> forward to your response. Thanks
> 
> #Having no fence delay means a death match (each node killing the other)
> #is possible, but it doesn't guarantee that it will happen. Some of the
> #time, one node will detect the outage and fence the other one before
> #the other one can react.
> 
> #It's basically an Old West shoot-out -- they may reach for their guns
> #at the same time, but one may be quicker.
> 
> #As Andrei suggested, the logs from both nodes could give you a timeline
> #of what happened when.
> 
> 
> Hi andrei, kindly see below logs. Based on time of logs, Node1 should have
> fenced first Node2, but in actual test/scenario, Node1 was fenced/shutdown
> by Node2.
> 

Node1 tried to fence but failed. It could be connectivity, it could be
credentials.

> Is it possible to have a 2-Node active/passive setup in pacemaker/corosync
> that the node that gets disconnected/interface down is the only one that
> gets fenced?
> 

If you could determine which node was disconnected you would not need
any fencing at all.

> Thanks guys
> 
> *LOGS from Node2:*
> 
> Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed,
> forming new configuration.
...
> Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will be
> fenced because the node is no longer part of the cluster
...
> Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation 'reboot'
> [2323] (call 2 from crmd.1084) for host 'ArcosRhel1' with device 'Fence1'
> returned: 0 (OK)
> Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]:  notice: Operation reboot of
> ArcosRhel1 by ArcosRhel2 for crmd.1084@ArcosRhel2.0426e6e1: OK
> Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Stonith operation
> 2/12:0:0:f9418e1f-1f13-4033-9eaa-aec705f807ef: OK (0)
> Jul 17 13:33:50 ArcosRhel2 crmd[1084]:  notice: Peer ArcosRhel1 was
> terminated (reboot) by ArcosRhel2 for ArcosRhel2: OK
...
> 
> 
> 
> *LOGS from NODE1*
> Jul 17 13:33:26 ArcoSRhel1 corosync[1464]: [TOTEM ] A processor failed,
> forming new configuration
> Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Node ArcosRhel2 will be
> fenced because the node is no longer part of the cluster
...
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: warning: Mapping action='off'
> to pcmk_reboot_action='off'
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not fence
> (reboot) ArcosRhel2: static-list
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
> (reboot) ArcosRhel2: static-list
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: Fence1 can not fence
> (reboot) ArcosRhel2: static-list
> Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]:  notice: fence2 can fence
> (reboot) ArcosRhel2: static-list
> Jul 17 13:33:46 ArcoSRhel1 fence_vmware_soap: Unable to connect/login to
> fencing device
> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
> fence_vmware_soap[7157] stderr: [ Unable to connect/login to fencing device
> ]
> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
> fence_vmware_soap[7157] stderr: [  ]
> Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning:
> fence_vmware_soap[7157] stderr: [  ]
> 
> 
> 
> 
> 
> 
>>> See my config below:
>>>
>>> [root@ArcosRhel2 cluster]# pcs config
>>> Cluster Name: ARCOSCLUSTER
>>> Corosync Nodes:
>>> ? ArcosRhel1 ArcosRhel2
>>> Pacemaker Nodes:
>>> ? ArcosRhel1 ArcosRhel2
>>>
>>> Resources:
>>> ? Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>>> ? ?Attributes: cidr_netmask=32 ip=172.16.10.243
>>> ? ?Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
>>> ? ? ? ? ? ? ? ?start interval=0s timeout=20s (ClusterIP-start-
>> interval-0s)
>>> ? ? ? ? ? ? ? ?stop interval=0s timeout=20s (ClusterIP-stop-
>> interval-0s)
>>>
>>> Stonith Devices:
>>> ? Resource: Fence1 (class=stonith type=fence_vmware_soap)
>>> ? ?Attributes: action=off ipaddr=172.16.10.151 login=admin
>> passwd=123pass
>>> pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s
>> port=ArcosRhel1(Joniel)
>>> ssl_insecure=1 pcmk_delay_max=0s
>>> ? ?Operations: monitor 

Re: [ClusterLabs] Weird Fencing Behavior

2018-07-17 Thread Confidential Company
> > Hi,
> >
> > On my two-node active/passive setup, I configured fencing via
> > fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I
> expected
> > that both nodes will be stonithed simultaenously.
> >
> > On my test scenario, Node1 has ClusterIP resource. When I
> disconnect
> > service/corosync link physically, Node1 was fenced and Node2 keeps
> alive
> > given pcmk_delay=0 on both nodes.
> >
> > Can you explain the behavior above?
> >
>
> #node1 could not connect to ESX because links were disconnected. As
> the
> #most obvious explanation.
>
> #You have logs, you are the only one who can answer this question
> with
> #some certainty. Others can only guess.
>
>
> Oops, my bad. I forgot to tell. I have two interfaces on each virtual
> machine (nodes). second interface was used for ESX links, so fence
> can be executed even though corosync links were disconnected. Looking
> forward to your response. Thanks

#Having no fence delay means a death match (each node killing the other)
#is possible, but it doesn't guarantee that it will happen. Some of the
#time, one node will detect the outage and fence the other one before
#the other one can react.

#It's basically an Old West shoot-out -- they may reach for their guns
#at the same time, but one may be quicker.

#As Andrei suggested, the logs from both nodes could give you a timeline
#of what happened when.


Hi andrei, kindly see below logs. Based on time of logs, Node1 should have
fenced first Node2, but in actual test/scenario, Node1 was fenced/shutdown
by Node2.

Is it possible to have a 2-Node active/passive setup in pacemaker/corosync
that the node that gets disconnected/interface down is the only one that
gets fenced?

Thanks guys

*LOGS from Node2:*

Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed,
forming new configuration.
Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [TOTEM ] A new membership (
172.16.10.242:220) was formed. Members left: 1
Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [TOTEM ] Failed to receive the
leave message. failed: 1
Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [QUORUM] Members[1]: 2
Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [MAIN  ] Completed service
synchronization, ready to provide service.
Jul 17 13:33:28 ArcosRhel2 attrd[1082]:  notice: Node ArcosRhel1 state is
now lost
Jul 17 13:33:28 ArcosRhel2 attrd[1082]:  notice: Removing all ArcosRhel1
attributes for peer loss
Jul 17 13:33:28 ArcosRhel2 attrd[1082]:  notice: Lost attribute writer
ArcosRhel1
Jul 17 13:33:28 ArcosRhel2 attrd[1082]:  notice: Purged 1 peers with id=1
and/or uname=ArcosRhel1 from the membership cache
Jul 17 13:33:28 ArcosRhel2 cib[1079]:  notice: Node ArcosRhel1 state is now
lost
Jul 17 13:33:28 ArcosRhel2 cib[1079]:  notice: Purged 1 peers with id=1
and/or uname=ArcosRhel1 from the membership cache
Jul 17 13:33:28 ArcosRhel2 crmd[1084]:  notice: Node ArcosRhel1 state is
now lost
Jul 17 13:33:28 ArcosRhel2 crmd[1084]: warning: Our DC node (ArcosRhel1)
left the cluster
Jul 17 13:33:28 ArcosRhel2 pacemakerd[1074]:  notice: Node ArcosRhel1 state
is now lost
Jul 17 13:33:28 ArcosRhel2 stonith-ng[1080]:  notice: Node ArcosRhel1 state
is now lost
Jul 17 13:33:28 ArcosRhel2 stonith-ng[1080]:  notice: Purged 1 peers with
id=1 and/or uname=ArcosRhel1 from the membership cache
Jul 17 13:33:28 ArcosRhel2 crmd[1084]:  notice: State transition S_NOT_DC
-> S_ELECTION
Jul 17 13:33:28 ArcosRhel2 crmd[1084]:  notice: State transition S_ELECTION
-> S_INTEGRATION
Jul 17 13:33:28 ArcosRhel2 crmd[1084]: warning: Input I_ELECTION_DC
received in state S_INTEGRATION from do_election_check
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will be
fenced because the node is no longer part of the cluster
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 is
unclean
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Action fence2_stop_0 on
ArcosRhel1 is unrunnable (offline)
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Action ClusterIP_stop_0
on ArcosRhel1 is unrunnable (offline)
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Scheduling Node
ArcosRhel1 for STONITH
Jul 17 13:33:30 ArcosRhel2 pengine[1083]:  notice: Move
 fence2#011(Started ArcosRhel1 -> ArcosRhel2)
Jul 17 13:33:30 ArcosRhel2 pengine[1083]:  notice: Move
 ClusterIP#011(Started ArcosRhel1 -> ArcosRhel2)
Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Calculated transition 0
(with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-20.bz2
Jul 17 13:33:30 ArcosRhel2 crmd[1084]:  notice: Requesting fencing (reboot)
of node ArcosRhel1
Jul 17 13:33:30 ArcosRhel2 crmd[1084]:  notice: Initiating start operation
fence2_start_0 locally on ArcosRhel2
Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]:  notice: Client
crmd.1084.cd70178e wants to fence (reboot) 'ArcosRhel1' with device '(any)'
Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]:  notice: Requesting peer
fencing (reboot) of ArcosRhel1
Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]:  notice: Fence1 

Re: [ClusterLabs] FYI: regression using 2.0.0 / 1.1.19 Pacemaker Remote node with older cluster nodes

2018-07-17 Thread Ken Gaillot
Upon further investigation, there is no problem when resource agents
are called by the cluster, which thankfully makes this issue less
significant.

The problem occurs when "crm_node -n" is called on the command line or
by a script, on a Pacemaker Remote node running 1.1.19 or 2.0.0 or
later, with cluster nodes running 1.1.18 or earlier. Upgrading cluster
nodes before Pacemaker Remote nodes avoids the issue.

If you have any custom resource agents, a good practice is to make sure
that they do not call any unnecessary commands (including "crm_node -n"
or "ocf_local_nodename") for meta-data actions. This will not only be
more efficient, but also make command-line meta-data calls immune to
issues like this.

A complete solution would make every command-line "crm_node -n" call
take longer and have more chances to fail, so I'm inclined to leave
this as a known issue, and rely on the workarounds.

On Mon, 2018-07-16 at 09:21 -0500, Ken Gaillot wrote:
> Hi all,
> 
> The just-released Pacemaker 2.0.0 and 1.1.19 releases have an issue
> when a Pacemaker Remote node is upgraded before the cluster nodes.
> 
> Pacemaker 2.0.0 contains a fix (also backported to 1.1.19) for the
> longstanding issue of "crm_node -n" getting the wrong name when run
> on
> the command line of a Pacemaker Remote node whose node name is
> different from its local hostname.
> 
> However, the fix can cause resource agents running on a Pacemaker
> Remote node to hang when used with a cluster node older than 2.0.0 /
> 1.1.19.
> 
> The only workaround is to upgrade all cluster nodes before upgrading
> any Pacemaker Remote nodes (which is the recommended practice
> anyway).
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antwort: Antw: corosync/dlm fencing?

2018-07-17 Thread Jan Pokorný

On 16/07/18 11:44 +0200, Philipp Achmüller wrote:
> Unfortunatly it is not obvious for me - the "grep fence" is attached
> in my original message.

Sifting your logs a bit:

> ---
> Node: siteb-2 (DC):
> 2018-06-28T09:02:23.282153+02:00 siteb-2 pengine[189259]:   notice: Move 
> stonith-sbd#011(Started sitea-1 -> siteb-1)
> [...]
> 2018-06-28T09:02:23.284575+02:00 siteb-2 crmd[189260]:   notice: Initiating 
> stop operation stonith-sbd_stop_0 on sitea-1
> [...]
> 2018-06-28T09:02:23.288254+02:00 siteb-2 crmd[189260]:   notice: Initiating 
> start operation stonith-sbd_start_0 on siteb-1
> [...]
> 2018-06-28T09:02:38.414440+02:00 siteb-2 corosync[189245]:   [TOTEM ] A 
> processor failed, forming new configuration.
> 2018-06-28T09:02:52.080141+02:00 siteb-2 corosync[189245]:   [TOTEM ] A new 
> membership (192.168.121.55:2012) was formed. Members left: 2
> 2018-06-28T09:02:52.080537+02:00 siteb-2 corosync[189245]:   [TOTEM ] Failed 
> to receive the leave message. failed: 2
> 2018-06-28T09:02:52.083415+02:00 siteb-2 attrd[189258]:   notice: Node 
> siteb-1 state is now lost
> [...]
> 2018-06-28T09:02:52.084054+02:00 siteb-2 crmd[189260]:  warning: No reason to 
> expect node 2 to be down
> [...]
> 2018-06-28T09:02:52.084409+02:00 siteb-2 corosync[189245]:   [QUORUM] 
> Members[3]: 1 3 4
> 2018-06-28T09:02:52.084492+02:00 siteb-2 corosync[189245]:   [MAIN  ] 
> Completed service synchronization, ready to provide service.
> [...]
> 2018-06-28T09:02:52.085210+02:00 siteb-2 kernel: [80872.012486] dlm: closing 
> connection to node 2
> [...]
> 2018-06-28T09:02:53.098683+02:00 siteb-2 pengine[189259]:  warning: 
> Scheduling Node siteb-1 for STONITH

> ---
> Node sitea-1:
> 2018-06-28T09:02:38.413748+02:00 sitea-1 corosync[6661]:   [TOTEM ] A 
> processor failed, forming new configuration.
> 2018-06-28T09:02:52.079905+02:00 sitea-1 corosync[6661]:   [TOTEM ] A new 
> membership (192.168.121.55:2012) was formed. Members left: 2
> 2018-06-28T09:02:52.080306+02:00 sitea-1 corosync[6661]:   [TOTEM ] Failed to 
> receive the leave message. failed: 2
> 2018-06-28T09:02:52.082619+02:00 sitea-1 cib[9021]:   notice: Node siteb-1 
> state is now lost
> [...]
> 2018-06-28T09:02:52.083429+02:00 sitea-1 corosync[6661]:   [QUORUM] 
> Members[3]: 1 3 4
> 2018-06-28T09:02:52.083521+02:00 sitea-1 corosync[6661]:   [MAIN  ] Completed 
> service synchronization, ready to provide service.
> 2018-06-28T09:02:52.083606+02:00 sitea-1 crmd[9031]:   notice: Node siteb-1 
> state is now lost
> 2018-06-28T09:02:52.084290+02:00 sitea-1 dlm_controld[73416]: 59514 fence 
> request 2 pid 171087 nodedown time 1530169372 fence_all dlm_stonith
> 2018-06-28T09:02:52.085446+02:00 sitea-1 kernel: [59508.568940] dlm: closing 
> connection to node 2
> 2018-06-28T09:02:52.109393+02:00 sitea-1 dlm_stonith: stonith_api_time: Found 
> 0 entries for 2/(null): 0 in progress, 0 completed
> 2018-06-28T09:02:52.110167+02:00 sitea-1 stonith-ng[9022]:   notice: Client 
> stonith-api.171087.d3c59fc2 wants to fence (reboot) '2' with device '(any)'
> 2018-06-28T09:02:52.113257+02:00 sitea-1 stonith-ng[9022]:   notice: 
> Requesting peer fencing (reboot) of siteb-1
> 2018-06-28T09:03:29.096714+02:00 sitea-1 stonith-ng[9022]:   notice: 
> Operation reboot of siteb-1 by sitea-2 for 
> stonith-api.171087@sitea-1.9fe08723: OK
> 2018-06-28T09:03:29.097152+02:00 sitea-1 stonith-api[171087]: 
> stonith_api_kick: Node 2/(null) kicked: reboot
> 2018-06-28T09:03:29.097426+02:00 sitea-1 crmd[9031]:   notice: Peer lnx0361b 
> was terminated (reboot) by sitea-2 on behalf of stonith-api.171087: OK
> 2018-06-28T09:03:30.098657+02:00 sitea-1 dlm_controld[73416]: 59552 fence 
> result 2 pid 171087 result 0 exit status
> 2018-06-28T09:03:30.099730+02:00 sitea-1 dlm_controld[73416]: 59552 fence 
> status 2 receive 0 from 1 walltime 1530169410 local 59552

> ---
> Node sitea-2:
> 2018-06-28T09:02:38.412808+02:00 sitea-2 corosync[6570]:   [TOTEM ] A 
> processor failed, forming new configuration.
> 2018-06-28T09:02:52.078249+02:00 sitea-2 corosync[6570]:   [TOTEM ] A new 
> membership (192.168.121.55:2012) was formed. Members left: 2
> 2018-06-28T09:02:52.078359+02:00 sitea-2 corosync[6570]:   [TOTEM ] Failed to 
> receive the leave message. failed: 2
> 2018-06-28T09:02:52.081949+02:00 sitea-2 cib[9655]:   notice: Node siteb-1 
> state is now lost
> [...]
> 2018-06-28T09:02:52.082653+02:00 sitea-2 corosync[6570]:   [QUORUM] 
> Members[3]: 1 3 4
> 2018-06-28T09:02:52.082739+02:00 sitea-2 corosync[6570]:   [MAIN  ] Completed 
> service synchronization, ready to provide service.
> [...]
> 2018-06-28T09:02:52.495697+02:00 sitea-2 stonith-ng[9656]:   notice: 
> stonith-sbd can fence (reboot) siteb-1: dynamic-list
> 2018-06-28T09:02:52.495902+02:00 sitea-2 stonith-ng[9656]:   notice: Delaying 
> reboot on stonith-sbd for 25358ms (timeout=300s)
> 2018-06-28T09:03:29.093957+02:00 sitea-2 stonith-ng[9656]:   notice: 
> Operation 'reboot' [231293] 

Re: [ClusterLabs] Weird Fencing Behavior

2018-07-17 Thread Ken Gaillot
On Tue, 2018-07-17 at 21:29 +0800, Confidential Company wrote:
> 
> > Hi,
> >
> > On my two-node active/passive setup, I configured fencing via
> > fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I
> expected
> > that both nodes will be stonithed simultaenously.
> >
> > On my test scenario, Node1 has ClusterIP resource. When I
> disconnect
> > service/corosync link physically, Node1 was fenced and Node2 keeps
> alive
> > given pcmk_delay=0 on both nodes.
> >
> > Can you explain the behavior above?
> >
> 
> #node1 could not connect to ESX because links were disconnected. As
> the
> #most obvious explanation.
> 
> #You have logs, you are the only one who can answer this question
> with
> #some certainty. Others can only guess.
> 
> 
> Oops, my bad. I forgot to tell. I have two interfaces on each virtual
> machine (nodes). second interface was used for ESX links, so fence
> can be executed even though corosync links were disconnected. Looking
> forward to your response. Thanks

Having no fence delay means a death match (each node killing the other)
is possible, but it doesn't guarantee that it will happen. Some of the
time, one node will detect the outage and fence the other one before
the other one can react.

It's basically an Old West shoot-out -- they may reach for their guns
at the same time, but one may be quicker.

As Andrei suggested, the logs from both nodes could give you a timeline
of what happened when.

> > See my config below:
> >
> > [root@ArcosRhel2 cluster]# pcs config
> > Cluster Name: ARCOSCLUSTER
> > Corosync Nodes:
> >  ArcosRhel1 ArcosRhel2
> > Pacemaker Nodes:
> >  ArcosRhel1 ArcosRhel2
> >
> > Resources:
> >  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
> >   Attributes: cidr_netmask=32 ip=172.16.10.243
> >   Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
> >               start interval=0s timeout=20s (ClusterIP-start-
> interval-0s)
> >               stop interval=0s timeout=20s (ClusterIP-stop-
> interval-0s)
> >
> > Stonith Devices:
> >  Resource: Fence1 (class=stonith type=fence_vmware_soap)
> >   Attributes: action=off ipaddr=172.16.10.151 login=admin
> passwd=123pass
> > pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s
> port=ArcosRhel1(Joniel)
> > ssl_insecure=1 pcmk_delay_max=0s
> >   Operations: monitor interval=60s (Fence1-monitor-interval-60s)
> >  Resource: fence2 (class=stonith type=fence_vmware_soap)
> >   Attributes: action=off ipaddr=172.16.10.152 login=admin
> passwd=123pass
> > pcmk_delay_max=0s pcmk_host_list=ArcosRhel2
> pcmk_monitor_timeout=60s
> > port=ArcosRhel2(Ben) ssl_insecure=1
> >   Operations: monitor interval=60s (fence2-monitor-interval-60s)
> > Fencing Levels:
> >
> > Location Constraints:
> >   Resource: Fence1
> >     Enabled on: ArcosRhel2 (score:INFINITY)
> > (id:location-Fence1-ArcosRhel2-INFINITY)
> >   Resource: fence2
> >     Enabled on: ArcosRhel1 (score:INFINITY)
> > (id:location-fence2-ArcosRhel1-INFINITY)
> > Ordering Constraints:
> > Colocation Constraints:
> > Ticket Constraints:
> >
> > Alerts:
> >  No alerts defined
> >
> > Resources Defaults:
> >  No defaults set
> > Operations Defaults:
> >  No defaults set
> >
> > Cluster Properties:
> >  cluster-infrastructure: corosync
> >  cluster-name: ARCOSCLUSTER
> >  dc-version: 1.1.16-12.el7-94ff4df
> >  have-watchdog: false
> >  last-lrm-refresh: 1531810841
> >  stonith-enabled: true
> >
> > Quorum:
> >   Options:
> >
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> h.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] pcs 0.10.0.alpha.1 available

2018-07-17 Thread Tomas Jelinek

I am happy to announce the first alpha of pcs-0.10: pcs-0.10.0.alpha.1.

Source code is available at:
https://github.com/ClusterLabs/pcs/archive/0.10.0.alpha.1.tar.gz
or
https://github.com/ClusterLabs/pcs/archive/0.10.0.alpha.1.zip

Pcs-0.10 is the new main pcs branch supporting Corosync 3.x and
Pacemaker 2.x clusters while dropping support for older Corosync and
Pacemaker versions. Pcs-0.9, being in maintenance mode, continues to
support Corosync 1.x/2.x and Pacemaker 1.x clusters.

Main changes in this alpha:
* Added support for Corosync 3.x and Kronosnet
* Node names are now fully supported
* Python 3.6+ and Ruby 2.2+ is now required


Complete change log for pcs-0.10.alpha against 0.9.163:
### Removed
- Pcs-0.10 removes support for CMAN, Corosync 1.x, Corosync 2.x and
  Pacemaker 1.x based clusters. For managing those clusters use
  pcs-0.9.x.
- Pcs-0.10 requires Python 3.6 and Ruby 2.2, support for older Python
  and Ruby versions has been removed.
- `pcs resource failcount reset` command has been removed as `pcs
  resource cleanup` is doing exactly the same job. ([rhbz#1427273])
- `pcs cluster node delete`, a deprecated alias to `pcs cluster node
  remove`, has been removed

### Added
- Validation for an unaccessible resource inside a bundle
  ([rhbz#1462248])
- Options to filter failures by an operation and its interval in `pcs
  resource cleanup` and `pcs resource failcount show` commands
  ([rhbz#1427273])

### Fixed
- `pcs cib-push diff-against=` does not consider an empty diff as an
  error ([ghpull#166])
- `pcs resource update` does not create an empty meta\_attributes
  element any more ([rhbz#1568353])
- `pcs resource debug-*` commands provide debug messages even with
  pacemaker-1.1.18 and newer ([rhbz#1574898])
- Improve `pcs quorum device add` usage and man page ([rhbz#1476862])
- Removing resources using web UI when the operation takes longer than
  expected ([rhbz#1579911])
- Removing a cluster node no longer leaves the node in the CIB and
  therefore cluster status even if the removal is run on the node which
  is being removed ([rhbz#1595829])

### Changed
- Authentication has been overhauled ([rhbz#1549535]):
  - The `pcs cluster auth` command only authenticates nodes in a local
cluster and does not accept a node list.
  - The new command for authentication is `pcs host auth`. It allows to
specify host names, addresses and pcsd ports.
  - Previously, running `pcs cluster auth A B C` caused A, B and C to be
all authenticated against each other. Now, `pcs host auth A B C`
makes the local host authenticated against A, B and C. This allows
better control of what is authenticated against what.
  - The `pcs pcsd clear-auth` command has been replaced by `pcs pcsd
deauth` and `pcs host deauth` commands. The new commands allows to
deauthenticate a single host / token as well as all hosts / tokens.
  - These changes are not backward compatible. You should use the `pcs
host auth` command to re-authenticate your hosts.
- The `pcs cluster setup` command has been overhauled ([rhbz#1158816],
  [rhbz#1183103]):
  - It works with Corosync 3.x only and supports knet as well as
udp/udpu.
  - Node names are now supported.
  - The number of Corosync options configurable by the command has been
significantly increased.
  - The syntax of the command has been completely changed to accommodate
the changes and new features.
- The `pcs cluster node add` command has been overhauled
  ([rhbz#1158816], [rhbz#1183103])
  - It works with Corosync 3.x only and supports knet as well as
udp/udpu.
  - Node names are now supported.
  - The syntax of the command has been changed to accommodate new
features and to be consistent with other pcs commands.
- The `pcs cluster node remove` has been overhauled ([rhbz#1158816],
  [rhbz#1595829]):
  - It works with Corosync 3.x only and supports knet as well as
udp/udpu.
  - It is now possible to remove more than one node at once.
  - Removing a cluster node no longer leaves the node in the CIB and
therefore cluster status even if the removal is run on the node
which is being removed
- Node names are fully supported now and are no longer coupled with node
  addresses. It is possible to set up a cluster where Corosync
  communicates over different addresses than pcs/pcsd. ([rhbz#1158816],
  [rhbz#1183103])
- Commands related to resource failures have been overhauled to support
  changes in pacemaker. Failures are now tracked per resource operations
  on top of resources and nodes. ([rhbz#1427273], [rhbz#1588667])
- `--watchdog` and `--device` options of `pcs stonith sbd enable` and
  `pcs stonith sbd device setup` commands have been replaced with
  `watchdog` and `device` options respectively

### Security
- CVE-2018-1086: Debug parameter removal bypass, allowing information
  disclosure ([rhbz#1557366])
- CVE-2018-1079: Privilege escalation via authorized user malicious REST
  call ([rhbz#1550243])


Thanks / congratu

[ClusterLabs] Weird Fencing Behavior

2018-07-17 Thread Confidential Company
> Hi,
>
> On my two-node active/passive setup, I configured fencing via
> fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I expected
> that both nodes will be stonithed simultaenously.
>
> On my test scenario, Node1 has ClusterIP resource. When I disconnect
> service/corosync link physically, Node1 was fenced and Node2 keeps alive
> given pcmk_delay=0 on both nodes.
>
> Can you explain the behavior above?
>

#node1 could not connect to ESX because links were disconnected. As the
#most obvious explanation.

#You have logs, you are the only one who can answer this question with
#some certainty. Others can only guess.


Oops, my bad. I forgot to tell. I have two interfaces on each virtual
machine (nodes). second interface was used for ESX links, so fence can be
executed even though corosync links were disconnected. Looking forward to
your response. Thanks

>
>
> See my config below:
>
> [root@ArcosRhel2 cluster]# pcs config
> Cluster Name: ARCOSCLUSTER
> Corosync Nodes:
>  ArcosRhel1 ArcosRhel2
> Pacemaker Nodes:
>  ArcosRhel1 ArcosRhel2
>
> Resources:
>  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: cidr_netmask=32 ip=172.16.10.243
>   Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
>   start interval=0s timeout=20s (ClusterIP-start-interval-0s)
>   stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
>
> Stonith Devices:
>  Resource: Fence1 (class=stonith type=fence_vmware_soap)
>   Attributes: action=off ipaddr=172.16.10.151 login=admin passwd=123pass
> pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s port=ArcosRhel1(Joniel)
> ssl_insecure=1 pcmk_delay_max=0s
>   Operations: monitor interval=60s (Fence1-monitor-interval-60s)
>  Resource: fence2 (class=stonith type=fence_vmware_soap)
>   Attributes: action=off ipaddr=172.16.10.152 login=admin passwd=123pass
> pcmk_delay_max=0s pcmk_host_list=ArcosRhel2 pcmk_monitor_timeout=60s
> port=ArcosRhel2(Ben) ssl_insecure=1
>   Operations: monitor interval=60s (fence2-monitor-interval-60s)
> Fencing Levels:
>
> Location Constraints:
>   Resource: Fence1
> Enabled on: ArcosRhel2 (score:INFINITY)
> (id:location-Fence1-ArcosRhel2-INFINITY)
>   Resource: fence2
> Enabled on: ArcosRhel1 (score:INFINITY)
> (id:location-fence2-ArcosRhel1-INFINITY)
> Ordering Constraints:
> Colocation Constraints:
> Ticket Constraints:
>
> Alerts:
>  No alerts defined
>
> Resources Defaults:
>  No defaults set
> Operations Defaults:
>  No defaults set
>
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: ARCOSCLUSTER
>  dc-version: 1.1.16-12.el7-94ff4df
>  have-watchdog: false
>  last-lrm-refresh: 1531810841
>  stonith-enabled: true
>
> Quorum:
>   Options:
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Weird Fencing Behavior?

2018-07-17 Thread Andrei Borzenkov
On Tue, Jul 17, 2018 at 10:58 AM, Confidential Company
 wrote:
> Hi,
>
> On my two-node active/passive setup, I configured fencing via
> fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I expected
> that both nodes will be stonithed simultaenously.
>
> On my test scenario, Node1 has ClusterIP resource. When I disconnect
> service/corosync link physically, Node1 was fenced and Node2 keeps alive
> given pcmk_delay=0 on both nodes.
>
> Can you explain the behavior above?
>

node1 could not connect to ESX because links were disconnected. As the
most obvious explanation.

You have logs, you are the only one who can answer this question with
some certainty. Others can only guess.

>
>
> See my config below:
>
> [root@ArcosRhel2 cluster]# pcs config
> Cluster Name: ARCOSCLUSTER
> Corosync Nodes:
>  ArcosRhel1 ArcosRhel2
> Pacemaker Nodes:
>  ArcosRhel1 ArcosRhel2
>
> Resources:
>  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: cidr_netmask=32 ip=172.16.10.243
>   Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
>   start interval=0s timeout=20s (ClusterIP-start-interval-0s)
>   stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
>
> Stonith Devices:
>  Resource: Fence1 (class=stonith type=fence_vmware_soap)
>   Attributes: action=off ipaddr=172.16.10.151 login=admin passwd=123pass
> pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s port=ArcosRhel1(Joniel)
> ssl_insecure=1 pcmk_delay_max=0s
>   Operations: monitor interval=60s (Fence1-monitor-interval-60s)
>  Resource: fence2 (class=stonith type=fence_vmware_soap)
>   Attributes: action=off ipaddr=172.16.10.152 login=admin passwd=123pass
> pcmk_delay_max=0s pcmk_host_list=ArcosRhel2 pcmk_monitor_timeout=60s
> port=ArcosRhel2(Ben) ssl_insecure=1
>   Operations: monitor interval=60s (fence2-monitor-interval-60s)
> Fencing Levels:
>
> Location Constraints:
>   Resource: Fence1
> Enabled on: ArcosRhel2 (score:INFINITY)
> (id:location-Fence1-ArcosRhel2-INFINITY)
>   Resource: fence2
> Enabled on: ArcosRhel1 (score:INFINITY)
> (id:location-fence2-ArcosRhel1-INFINITY)
> Ordering Constraints:
> Colocation Constraints:
> Ticket Constraints:
>
> Alerts:
>  No alerts defined
>
> Resources Defaults:
>  No defaults set
> Operations Defaults:
>  No defaults set
>
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: ARCOSCLUSTER
>  dc-version: 1.1.16-12.el7-94ff4df
>  have-watchdog: false
>  last-lrm-refresh: 1531810841
>  stonith-enabled: true
>
> Quorum:
>   Options:
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Weird Fencing Behavior?

2018-07-17 Thread Confidential Company
Hi,

On my two-node active/passive setup, I configured fencing via
fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I expected
that both nodes will be stonithed simultaenously.

On my test scenario, Node1 has ClusterIP resource. When I disconnect
service/corosync link physically, Node1 was fenced and Node2 keeps alive
given pcmk_delay=0 on both nodes.

Can you explain the behavior above?



See my config below:

[root@ArcosRhel2 cluster]# pcs config
Cluster Name: ARCOSCLUSTER
Corosync Nodes:
 ArcosRhel1 ArcosRhel2
Pacemaker Nodes:
 ArcosRhel1 ArcosRhel2

Resources:
 Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: cidr_netmask=32 ip=172.16.10.243
  Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
  start interval=0s timeout=20s (ClusterIP-start-interval-0s)
  stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)

Stonith Devices:
 Resource: Fence1 (class=stonith type=fence_vmware_soap)
  Attributes: action=off ipaddr=172.16.10.151 login=admin passwd=123pass
pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s port=ArcosRhel1(Joniel)
ssl_insecure=1 pcmk_delay_max=0s
  Operations: monitor interval=60s (Fence1-monitor-interval-60s)
 Resource: fence2 (class=stonith type=fence_vmware_soap)
  Attributes: action=off ipaddr=172.16.10.152 login=admin passwd=123pass
pcmk_delay_max=0s pcmk_host_list=ArcosRhel2 pcmk_monitor_timeout=60s
port=ArcosRhel2(Ben) ssl_insecure=1
  Operations: monitor interval=60s (fence2-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: Fence1
Enabled on: ArcosRhel2 (score:INFINITY)
(id:location-Fence1-ArcosRhel2-INFINITY)
  Resource: fence2
Enabled on: ArcosRhel1 (score:INFINITY)
(id:location-fence2-ArcosRhel1-INFINITY)
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: ARCOSCLUSTER
 dc-version: 1.1.16-12.el7-94ff4df
 have-watchdog: false
 last-lrm-refresh: 1531810841
 stonith-enabled: true

Quorum:
  Options:
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org