Re: [ClusterLabs] Weird Fencing Behavior
18.07.2018 04:21, Confidential Company пишет: >>> Hi, >>> >>> On my two-node active/passive setup, I configured fencing via >>> fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I >> expected >>> that both nodes will be stonithed simultaenously. >>> >>> On my test scenario, Node1 has ClusterIP resource. When I >> disconnect >>> service/corosync link physically, Node1 was fenced and Node2 keeps >> alive >>> given pcmk_delay=0 on both nodes. >>> >>> Can you explain the behavior above? >>> >> >> #node1 could not connect to ESX because links were disconnected. As >> the >> #most obvious explanation. >> >> #You have logs, you are the only one who can answer this question >> with >> #some certainty. Others can only guess. >> >> >> Oops, my bad. I forgot to tell. I have two interfaces on each virtual >> machine (nodes). second interface was used for ESX links, so fence >> can be executed even though corosync links were disconnected. Looking >> forward to your response. Thanks > > #Having no fence delay means a death match (each node killing the other) > #is possible, but it doesn't guarantee that it will happen. Some of the > #time, one node will detect the outage and fence the other one before > #the other one can react. > > #It's basically an Old West shoot-out -- they may reach for their guns > #at the same time, but one may be quicker. > > #As Andrei suggested, the logs from both nodes could give you a timeline > #of what happened when. > > > Hi andrei, kindly see below logs. Based on time of logs, Node1 should have > fenced first Node2, but in actual test/scenario, Node1 was fenced/shutdown > by Node2. > Node1 tried to fence but failed. It could be connectivity, it could be credentials. > Is it possible to have a 2-Node active/passive setup in pacemaker/corosync > that the node that gets disconnected/interface down is the only one that > gets fenced? > If you could determine which node was disconnected you would not need any fencing at all. > Thanks guys > > *LOGS from Node2:* > > Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed, > forming new configuration. ... > Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will be > fenced because the node is no longer part of the cluster ... > Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]: notice: Operation 'reboot' > [2323] (call 2 from crmd.1084) for host 'ArcosRhel1' with device 'Fence1' > returned: 0 (OK) > Jul 17 13:33:50 ArcosRhel2 stonith-ng[1080]: notice: Operation reboot of > ArcosRhel1 by ArcosRhel2 for crmd.1084@ArcosRhel2.0426e6e1: OK > Jul 17 13:33:50 ArcosRhel2 crmd[1084]: notice: Stonith operation > 2/12:0:0:f9418e1f-1f13-4033-9eaa-aec705f807ef: OK (0) > Jul 17 13:33:50 ArcosRhel2 crmd[1084]: notice: Peer ArcosRhel1 was > terminated (reboot) by ArcosRhel2 for ArcosRhel2: OK ... > > > > *LOGS from NODE1* > Jul 17 13:33:26 ArcoSRhel1 corosync[1464]: [TOTEM ] A processor failed, > forming new configuration > Jul 17 13:33:28 ArcoSRhel1 pengine[1476]: warning: Node ArcosRhel2 will be > fenced because the node is no longer part of the cluster ... > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: warning: Mapping action='off' > to pcmk_reboot_action='off' > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: Fence1 can not fence > (reboot) ArcosRhel2: static-list > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: fence2 can fence > (reboot) ArcosRhel2: static-list > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: Fence1 can not fence > (reboot) ArcosRhel2: static-list > Jul 17 13:33:28 ArcoSRhel1 stonith-ng[1473]: notice: fence2 can fence > (reboot) ArcosRhel2: static-list > Jul 17 13:33:46 ArcoSRhel1 fence_vmware_soap: Unable to connect/login to > fencing device > Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning: > fence_vmware_soap[7157] stderr: [ Unable to connect/login to fencing device > ] > Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning: > fence_vmware_soap[7157] stderr: [ ] > Jul 17 13:33:46 ArcoSRhel1 stonith-ng[1473]: warning: > fence_vmware_soap[7157] stderr: [ ] > > > > > > >>> See my config below: >>> >>> [root@ArcosRhel2 cluster]# pcs config >>> Cluster Name: ARCOSCLUSTER >>> Corosync Nodes: >>> ? ArcosRhel1 ArcosRhel2 >>> Pacemaker Nodes: >>> ? ArcosRhel1 ArcosRhel2 >>> >>> Resources: >>> ? Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) >>> ? ?Attributes: cidr_netmask=32 ip=172.16.10.243 >>> ? ?Operations: monitor interval=30s (ClusterIP-monitor-interval-30s) >>> ? ? ? ? ? ? ? ?start interval=0s timeout=20s (ClusterIP-start- >> interval-0s) >>> ? ? ? ? ? ? ? ?stop interval=0s timeout=20s (ClusterIP-stop- >> interval-0s) >>> >>> Stonith Devices: >>> ? Resource: Fence1 (class=stonith type=fence_vmware_soap) >>> ? ?Attributes: action=off ipaddr=172.16.10.151 login=admin >> passwd=123pass >>> pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s >> port=ArcosRhel1(Joniel) >>> ssl_insecure=1 pcmk_delay_max=0s >>> ? ?Operations: monitor
Re: [ClusterLabs] Weird Fencing Behavior
> > Hi, > > > > On my two-node active/passive setup, I configured fencing via > > fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I > expected > > that both nodes will be stonithed simultaenously. > > > > On my test scenario, Node1 has ClusterIP resource. When I > disconnect > > service/corosync link physically, Node1 was fenced and Node2 keeps > alive > > given pcmk_delay=0 on both nodes. > > > > Can you explain the behavior above? > > > > #node1 could not connect to ESX because links were disconnected. As > the > #most obvious explanation. > > #You have logs, you are the only one who can answer this question > with > #some certainty. Others can only guess. > > > Oops, my bad. I forgot to tell. I have two interfaces on each virtual > machine (nodes). second interface was used for ESX links, so fence > can be executed even though corosync links were disconnected. Looking > forward to your response. Thanks #Having no fence delay means a death match (each node killing the other) #is possible, but it doesn't guarantee that it will happen. Some of the #time, one node will detect the outage and fence the other one before #the other one can react. #It's basically an Old West shoot-out -- they may reach for their guns #at the same time, but one may be quicker. #As Andrei suggested, the logs from both nodes could give you a timeline #of what happened when. Hi andrei, kindly see below logs. Based on time of logs, Node1 should have fenced first Node2, but in actual test/scenario, Node1 was fenced/shutdown by Node2. Is it possible to have a 2-Node active/passive setup in pacemaker/corosync that the node that gets disconnected/interface down is the only one that gets fenced? Thanks guys *LOGS from Node2:* Jul 17 13:33:27 ArcosRhel2 corosync[1048]: [TOTEM ] A processor failed, forming new configuration. Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [TOTEM ] A new membership ( 172.16.10.242:220) was formed. Members left: 1 Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [TOTEM ] Failed to receive the leave message. failed: 1 Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [QUORUM] Members[1]: 2 Jul 17 13:33:28 ArcosRhel2 corosync[1048]: [MAIN ] Completed service synchronization, ready to provide service. Jul 17 13:33:28 ArcosRhel2 attrd[1082]: notice: Node ArcosRhel1 state is now lost Jul 17 13:33:28 ArcosRhel2 attrd[1082]: notice: Removing all ArcosRhel1 attributes for peer loss Jul 17 13:33:28 ArcosRhel2 attrd[1082]: notice: Lost attribute writer ArcosRhel1 Jul 17 13:33:28 ArcosRhel2 attrd[1082]: notice: Purged 1 peers with id=1 and/or uname=ArcosRhel1 from the membership cache Jul 17 13:33:28 ArcosRhel2 cib[1079]: notice: Node ArcosRhel1 state is now lost Jul 17 13:33:28 ArcosRhel2 cib[1079]: notice: Purged 1 peers with id=1 and/or uname=ArcosRhel1 from the membership cache Jul 17 13:33:28 ArcosRhel2 crmd[1084]: notice: Node ArcosRhel1 state is now lost Jul 17 13:33:28 ArcosRhel2 crmd[1084]: warning: Our DC node (ArcosRhel1) left the cluster Jul 17 13:33:28 ArcosRhel2 pacemakerd[1074]: notice: Node ArcosRhel1 state is now lost Jul 17 13:33:28 ArcosRhel2 stonith-ng[1080]: notice: Node ArcosRhel1 state is now lost Jul 17 13:33:28 ArcosRhel2 stonith-ng[1080]: notice: Purged 1 peers with id=1 and/or uname=ArcosRhel1 from the membership cache Jul 17 13:33:28 ArcosRhel2 crmd[1084]: notice: State transition S_NOT_DC -> S_ELECTION Jul 17 13:33:28 ArcosRhel2 crmd[1084]: notice: State transition S_ELECTION -> S_INTEGRATION Jul 17 13:33:28 ArcosRhel2 crmd[1084]: warning: Input I_ELECTION_DC received in state S_INTEGRATION from do_election_check Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 will be fenced because the node is no longer part of the cluster Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Node ArcosRhel1 is unclean Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Action fence2_stop_0 on ArcosRhel1 is unrunnable (offline) Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Action ClusterIP_stop_0 on ArcosRhel1 is unrunnable (offline) Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Scheduling Node ArcosRhel1 for STONITH Jul 17 13:33:30 ArcosRhel2 pengine[1083]: notice: Move fence2#011(Started ArcosRhel1 -> ArcosRhel2) Jul 17 13:33:30 ArcosRhel2 pengine[1083]: notice: Move ClusterIP#011(Started ArcosRhel1 -> ArcosRhel2) Jul 17 13:33:30 ArcosRhel2 pengine[1083]: warning: Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-20.bz2 Jul 17 13:33:30 ArcosRhel2 crmd[1084]: notice: Requesting fencing (reboot) of node ArcosRhel1 Jul 17 13:33:30 ArcosRhel2 crmd[1084]: notice: Initiating start operation fence2_start_0 locally on ArcosRhel2 Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]: notice: Client crmd.1084.cd70178e wants to fence (reboot) 'ArcosRhel1' with device '(any)' Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]: notice: Requesting peer fencing (reboot) of ArcosRhel1 Jul 17 13:33:30 ArcosRhel2 stonith-ng[1080]: notice: Fence1
Re: [ClusterLabs] FYI: regression using 2.0.0 / 1.1.19 Pacemaker Remote node with older cluster nodes
Upon further investigation, there is no problem when resource agents are called by the cluster, which thankfully makes this issue less significant. The problem occurs when "crm_node -n" is called on the command line or by a script, on a Pacemaker Remote node running 1.1.19 or 2.0.0 or later, with cluster nodes running 1.1.18 or earlier. Upgrading cluster nodes before Pacemaker Remote nodes avoids the issue. If you have any custom resource agents, a good practice is to make sure that they do not call any unnecessary commands (including "crm_node -n" or "ocf_local_nodename") for meta-data actions. This will not only be more efficient, but also make command-line meta-data calls immune to issues like this. A complete solution would make every command-line "crm_node -n" call take longer and have more chances to fail, so I'm inclined to leave this as a known issue, and rely on the workarounds. On Mon, 2018-07-16 at 09:21 -0500, Ken Gaillot wrote: > Hi all, > > The just-released Pacemaker 2.0.0 and 1.1.19 releases have an issue > when a Pacemaker Remote node is upgraded before the cluster nodes. > > Pacemaker 2.0.0 contains a fix (also backported to 1.1.19) for the > longstanding issue of "crm_node -n" getting the wrong name when run > on > the command line of a Pacemaker Remote node whose node name is > different from its local hostname. > > However, the fix can cause resource agents running on a Pacemaker > Remote node to hang when used with a cluster node older than 2.0.0 / > 1.1.19. > > The only workaround is to upgrade all cluster nodes before upgrading > any Pacemaker Remote nodes (which is the recommended practice > anyway). -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antwort: Antw: corosync/dlm fencing?
On 16/07/18 11:44 +0200, Philipp Achmüller wrote: > Unfortunatly it is not obvious for me - the "grep fence" is attached > in my original message. Sifting your logs a bit: > --- > Node: siteb-2 (DC): > 2018-06-28T09:02:23.282153+02:00 siteb-2 pengine[189259]: notice: Move > stonith-sbd#011(Started sitea-1 -> siteb-1) > [...] > 2018-06-28T09:02:23.284575+02:00 siteb-2 crmd[189260]: notice: Initiating > stop operation stonith-sbd_stop_0 on sitea-1 > [...] > 2018-06-28T09:02:23.288254+02:00 siteb-2 crmd[189260]: notice: Initiating > start operation stonith-sbd_start_0 on siteb-1 > [...] > 2018-06-28T09:02:38.414440+02:00 siteb-2 corosync[189245]: [TOTEM ] A > processor failed, forming new configuration. > 2018-06-28T09:02:52.080141+02:00 siteb-2 corosync[189245]: [TOTEM ] A new > membership (192.168.121.55:2012) was formed. Members left: 2 > 2018-06-28T09:02:52.080537+02:00 siteb-2 corosync[189245]: [TOTEM ] Failed > to receive the leave message. failed: 2 > 2018-06-28T09:02:52.083415+02:00 siteb-2 attrd[189258]: notice: Node > siteb-1 state is now lost > [...] > 2018-06-28T09:02:52.084054+02:00 siteb-2 crmd[189260]: warning: No reason to > expect node 2 to be down > [...] > 2018-06-28T09:02:52.084409+02:00 siteb-2 corosync[189245]: [QUORUM] > Members[3]: 1 3 4 > 2018-06-28T09:02:52.084492+02:00 siteb-2 corosync[189245]: [MAIN ] > Completed service synchronization, ready to provide service. > [...] > 2018-06-28T09:02:52.085210+02:00 siteb-2 kernel: [80872.012486] dlm: closing > connection to node 2 > [...] > 2018-06-28T09:02:53.098683+02:00 siteb-2 pengine[189259]: warning: > Scheduling Node siteb-1 for STONITH > --- > Node sitea-1: > 2018-06-28T09:02:38.413748+02:00 sitea-1 corosync[6661]: [TOTEM ] A > processor failed, forming new configuration. > 2018-06-28T09:02:52.079905+02:00 sitea-1 corosync[6661]: [TOTEM ] A new > membership (192.168.121.55:2012) was formed. Members left: 2 > 2018-06-28T09:02:52.080306+02:00 sitea-1 corosync[6661]: [TOTEM ] Failed to > receive the leave message. failed: 2 > 2018-06-28T09:02:52.082619+02:00 sitea-1 cib[9021]: notice: Node siteb-1 > state is now lost > [...] > 2018-06-28T09:02:52.083429+02:00 sitea-1 corosync[6661]: [QUORUM] > Members[3]: 1 3 4 > 2018-06-28T09:02:52.083521+02:00 sitea-1 corosync[6661]: [MAIN ] Completed > service synchronization, ready to provide service. > 2018-06-28T09:02:52.083606+02:00 sitea-1 crmd[9031]: notice: Node siteb-1 > state is now lost > 2018-06-28T09:02:52.084290+02:00 sitea-1 dlm_controld[73416]: 59514 fence > request 2 pid 171087 nodedown time 1530169372 fence_all dlm_stonith > 2018-06-28T09:02:52.085446+02:00 sitea-1 kernel: [59508.568940] dlm: closing > connection to node 2 > 2018-06-28T09:02:52.109393+02:00 sitea-1 dlm_stonith: stonith_api_time: Found > 0 entries for 2/(null): 0 in progress, 0 completed > 2018-06-28T09:02:52.110167+02:00 sitea-1 stonith-ng[9022]: notice: Client > stonith-api.171087.d3c59fc2 wants to fence (reboot) '2' with device '(any)' > 2018-06-28T09:02:52.113257+02:00 sitea-1 stonith-ng[9022]: notice: > Requesting peer fencing (reboot) of siteb-1 > 2018-06-28T09:03:29.096714+02:00 sitea-1 stonith-ng[9022]: notice: > Operation reboot of siteb-1 by sitea-2 for > stonith-api.171087@sitea-1.9fe08723: OK > 2018-06-28T09:03:29.097152+02:00 sitea-1 stonith-api[171087]: > stonith_api_kick: Node 2/(null) kicked: reboot > 2018-06-28T09:03:29.097426+02:00 sitea-1 crmd[9031]: notice: Peer lnx0361b > was terminated (reboot) by sitea-2 on behalf of stonith-api.171087: OK > 2018-06-28T09:03:30.098657+02:00 sitea-1 dlm_controld[73416]: 59552 fence > result 2 pid 171087 result 0 exit status > 2018-06-28T09:03:30.099730+02:00 sitea-1 dlm_controld[73416]: 59552 fence > status 2 receive 0 from 1 walltime 1530169410 local 59552 > --- > Node sitea-2: > 2018-06-28T09:02:38.412808+02:00 sitea-2 corosync[6570]: [TOTEM ] A > processor failed, forming new configuration. > 2018-06-28T09:02:52.078249+02:00 sitea-2 corosync[6570]: [TOTEM ] A new > membership (192.168.121.55:2012) was formed. Members left: 2 > 2018-06-28T09:02:52.078359+02:00 sitea-2 corosync[6570]: [TOTEM ] Failed to > receive the leave message. failed: 2 > 2018-06-28T09:02:52.081949+02:00 sitea-2 cib[9655]: notice: Node siteb-1 > state is now lost > [...] > 2018-06-28T09:02:52.082653+02:00 sitea-2 corosync[6570]: [QUORUM] > Members[3]: 1 3 4 > 2018-06-28T09:02:52.082739+02:00 sitea-2 corosync[6570]: [MAIN ] Completed > service synchronization, ready to provide service. > [...] > 2018-06-28T09:02:52.495697+02:00 sitea-2 stonith-ng[9656]: notice: > stonith-sbd can fence (reboot) siteb-1: dynamic-list > 2018-06-28T09:02:52.495902+02:00 sitea-2 stonith-ng[9656]: notice: Delaying > reboot on stonith-sbd for 25358ms (timeout=300s) > 2018-06-28T09:03:29.093957+02:00 sitea-2 stonith-ng[9656]: notice: > Operation 'reboot' [231293]
Re: [ClusterLabs] Weird Fencing Behavior
On Tue, 2018-07-17 at 21:29 +0800, Confidential Company wrote: > > > Hi, > > > > On my two-node active/passive setup, I configured fencing via > > fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I > expected > > that both nodes will be stonithed simultaenously. > > > > On my test scenario, Node1 has ClusterIP resource. When I > disconnect > > service/corosync link physically, Node1 was fenced and Node2 keeps > alive > > given pcmk_delay=0 on both nodes. > > > > Can you explain the behavior above? > > > > #node1 could not connect to ESX because links were disconnected. As > the > #most obvious explanation. > > #You have logs, you are the only one who can answer this question > with > #some certainty. Others can only guess. > > > Oops, my bad. I forgot to tell. I have two interfaces on each virtual > machine (nodes). second interface was used for ESX links, so fence > can be executed even though corosync links were disconnected. Looking > forward to your response. Thanks Having no fence delay means a death match (each node killing the other) is possible, but it doesn't guarantee that it will happen. Some of the time, one node will detect the outage and fence the other one before the other one can react. It's basically an Old West shoot-out -- they may reach for their guns at the same time, but one may be quicker. As Andrei suggested, the logs from both nodes could give you a timeline of what happened when. > > See my config below: > > > > [root@ArcosRhel2 cluster]# pcs config > > Cluster Name: ARCOSCLUSTER > > Corosync Nodes: > > ArcosRhel1 ArcosRhel2 > > Pacemaker Nodes: > > ArcosRhel1 ArcosRhel2 > > > > Resources: > > Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) > > Attributes: cidr_netmask=32 ip=172.16.10.243 > > Operations: monitor interval=30s (ClusterIP-monitor-interval-30s) > > start interval=0s timeout=20s (ClusterIP-start- > interval-0s) > > stop interval=0s timeout=20s (ClusterIP-stop- > interval-0s) > > > > Stonith Devices: > > Resource: Fence1 (class=stonith type=fence_vmware_soap) > > Attributes: action=off ipaddr=172.16.10.151 login=admin > passwd=123pass > > pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s > port=ArcosRhel1(Joniel) > > ssl_insecure=1 pcmk_delay_max=0s > > Operations: monitor interval=60s (Fence1-monitor-interval-60s) > > Resource: fence2 (class=stonith type=fence_vmware_soap) > > Attributes: action=off ipaddr=172.16.10.152 login=admin > passwd=123pass > > pcmk_delay_max=0s pcmk_host_list=ArcosRhel2 > pcmk_monitor_timeout=60s > > port=ArcosRhel2(Ben) ssl_insecure=1 > > Operations: monitor interval=60s (fence2-monitor-interval-60s) > > Fencing Levels: > > > > Location Constraints: > > Resource: Fence1 > > Enabled on: ArcosRhel2 (score:INFINITY) > > (id:location-Fence1-ArcosRhel2-INFINITY) > > Resource: fence2 > > Enabled on: ArcosRhel1 (score:INFINITY) > > (id:location-fence2-ArcosRhel1-INFINITY) > > Ordering Constraints: > > Colocation Constraints: > > Ticket Constraints: > > > > Alerts: > > No alerts defined > > > > Resources Defaults: > > No defaults set > > Operations Defaults: > > No defaults set > > > > Cluster Properties: > > cluster-infrastructure: corosync > > cluster-name: ARCOSCLUSTER > > dc-version: 1.1.16-12.el7-94ff4df > > have-watchdog: false > > last-lrm-refresh: 1531810841 > > stonith-enabled: true > > > > Quorum: > > Options: > > > > > > > > ___ > > Users mailing list: Users@clusterlabs.org > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc > h.pdf > > Bugs: http://bugs.clusterlabs.org > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] pcs 0.10.0.alpha.1 available
I am happy to announce the first alpha of pcs-0.10: pcs-0.10.0.alpha.1. Source code is available at: https://github.com/ClusterLabs/pcs/archive/0.10.0.alpha.1.tar.gz or https://github.com/ClusterLabs/pcs/archive/0.10.0.alpha.1.zip Pcs-0.10 is the new main pcs branch supporting Corosync 3.x and Pacemaker 2.x clusters while dropping support for older Corosync and Pacemaker versions. Pcs-0.9, being in maintenance mode, continues to support Corosync 1.x/2.x and Pacemaker 1.x clusters. Main changes in this alpha: * Added support for Corosync 3.x and Kronosnet * Node names are now fully supported * Python 3.6+ and Ruby 2.2+ is now required Complete change log for pcs-0.10.alpha against 0.9.163: ### Removed - Pcs-0.10 removes support for CMAN, Corosync 1.x, Corosync 2.x and Pacemaker 1.x based clusters. For managing those clusters use pcs-0.9.x. - Pcs-0.10 requires Python 3.6 and Ruby 2.2, support for older Python and Ruby versions has been removed. - `pcs resource failcount reset` command has been removed as `pcs resource cleanup` is doing exactly the same job. ([rhbz#1427273]) - `pcs cluster node delete`, a deprecated alias to `pcs cluster node remove`, has been removed ### Added - Validation for an unaccessible resource inside a bundle ([rhbz#1462248]) - Options to filter failures by an operation and its interval in `pcs resource cleanup` and `pcs resource failcount show` commands ([rhbz#1427273]) ### Fixed - `pcs cib-push diff-against=` does not consider an empty diff as an error ([ghpull#166]) - `pcs resource update` does not create an empty meta\_attributes element any more ([rhbz#1568353]) - `pcs resource debug-*` commands provide debug messages even with pacemaker-1.1.18 and newer ([rhbz#1574898]) - Improve `pcs quorum device add` usage and man page ([rhbz#1476862]) - Removing resources using web UI when the operation takes longer than expected ([rhbz#1579911]) - Removing a cluster node no longer leaves the node in the CIB and therefore cluster status even if the removal is run on the node which is being removed ([rhbz#1595829]) ### Changed - Authentication has been overhauled ([rhbz#1549535]): - The `pcs cluster auth` command only authenticates nodes in a local cluster and does not accept a node list. - The new command for authentication is `pcs host auth`. It allows to specify host names, addresses and pcsd ports. - Previously, running `pcs cluster auth A B C` caused A, B and C to be all authenticated against each other. Now, `pcs host auth A B C` makes the local host authenticated against A, B and C. This allows better control of what is authenticated against what. - The `pcs pcsd clear-auth` command has been replaced by `pcs pcsd deauth` and `pcs host deauth` commands. The new commands allows to deauthenticate a single host / token as well as all hosts / tokens. - These changes are not backward compatible. You should use the `pcs host auth` command to re-authenticate your hosts. - The `pcs cluster setup` command has been overhauled ([rhbz#1158816], [rhbz#1183103]): - It works with Corosync 3.x only and supports knet as well as udp/udpu. - Node names are now supported. - The number of Corosync options configurable by the command has been significantly increased. - The syntax of the command has been completely changed to accommodate the changes and new features. - The `pcs cluster node add` command has been overhauled ([rhbz#1158816], [rhbz#1183103]) - It works with Corosync 3.x only and supports knet as well as udp/udpu. - Node names are now supported. - The syntax of the command has been changed to accommodate new features and to be consistent with other pcs commands. - The `pcs cluster node remove` has been overhauled ([rhbz#1158816], [rhbz#1595829]): - It works with Corosync 3.x only and supports knet as well as udp/udpu. - It is now possible to remove more than one node at once. - Removing a cluster node no longer leaves the node in the CIB and therefore cluster status even if the removal is run on the node which is being removed - Node names are fully supported now and are no longer coupled with node addresses. It is possible to set up a cluster where Corosync communicates over different addresses than pcs/pcsd. ([rhbz#1158816], [rhbz#1183103]) - Commands related to resource failures have been overhauled to support changes in pacemaker. Failures are now tracked per resource operations on top of resources and nodes. ([rhbz#1427273], [rhbz#1588667]) - `--watchdog` and `--device` options of `pcs stonith sbd enable` and `pcs stonith sbd device setup` commands have been replaced with `watchdog` and `device` options respectively ### Security - CVE-2018-1086: Debug parameter removal bypass, allowing information disclosure ([rhbz#1557366]) - CVE-2018-1079: Privilege escalation via authorized user malicious REST call ([rhbz#1550243]) Thanks / congratu
[ClusterLabs] Weird Fencing Behavior
> Hi, > > On my two-node active/passive setup, I configured fencing via > fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I expected > that both nodes will be stonithed simultaenously. > > On my test scenario, Node1 has ClusterIP resource. When I disconnect > service/corosync link physically, Node1 was fenced and Node2 keeps alive > given pcmk_delay=0 on both nodes. > > Can you explain the behavior above? > #node1 could not connect to ESX because links were disconnected. As the #most obvious explanation. #You have logs, you are the only one who can answer this question with #some certainty. Others can only guess. Oops, my bad. I forgot to tell. I have two interfaces on each virtual machine (nodes). second interface was used for ESX links, so fence can be executed even though corosync links were disconnected. Looking forward to your response. Thanks > > > See my config below: > > [root@ArcosRhel2 cluster]# pcs config > Cluster Name: ARCOSCLUSTER > Corosync Nodes: > ArcosRhel1 ArcosRhel2 > Pacemaker Nodes: > ArcosRhel1 ArcosRhel2 > > Resources: > Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) > Attributes: cidr_netmask=32 ip=172.16.10.243 > Operations: monitor interval=30s (ClusterIP-monitor-interval-30s) > start interval=0s timeout=20s (ClusterIP-start-interval-0s) > stop interval=0s timeout=20s (ClusterIP-stop-interval-0s) > > Stonith Devices: > Resource: Fence1 (class=stonith type=fence_vmware_soap) > Attributes: action=off ipaddr=172.16.10.151 login=admin passwd=123pass > pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s port=ArcosRhel1(Joniel) > ssl_insecure=1 pcmk_delay_max=0s > Operations: monitor interval=60s (Fence1-monitor-interval-60s) > Resource: fence2 (class=stonith type=fence_vmware_soap) > Attributes: action=off ipaddr=172.16.10.152 login=admin passwd=123pass > pcmk_delay_max=0s pcmk_host_list=ArcosRhel2 pcmk_monitor_timeout=60s > port=ArcosRhel2(Ben) ssl_insecure=1 > Operations: monitor interval=60s (fence2-monitor-interval-60s) > Fencing Levels: > > Location Constraints: > Resource: Fence1 > Enabled on: ArcosRhel2 (score:INFINITY) > (id:location-Fence1-ArcosRhel2-INFINITY) > Resource: fence2 > Enabled on: ArcosRhel1 (score:INFINITY) > (id:location-fence2-ArcosRhel1-INFINITY) > Ordering Constraints: > Colocation Constraints: > Ticket Constraints: > > Alerts: > No alerts defined > > Resources Defaults: > No defaults set > Operations Defaults: > No defaults set > > Cluster Properties: > cluster-infrastructure: corosync > cluster-name: ARCOSCLUSTER > dc-version: 1.1.16-12.el7-94ff4df > have-watchdog: false > last-lrm-refresh: 1531810841 > stonith-enabled: true > > Quorum: > Options: > > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Weird Fencing Behavior?
On Tue, Jul 17, 2018 at 10:58 AM, Confidential Company wrote: > Hi, > > On my two-node active/passive setup, I configured fencing via > fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I expected > that both nodes will be stonithed simultaenously. > > On my test scenario, Node1 has ClusterIP resource. When I disconnect > service/corosync link physically, Node1 was fenced and Node2 keeps alive > given pcmk_delay=0 on both nodes. > > Can you explain the behavior above? > node1 could not connect to ESX because links were disconnected. As the most obvious explanation. You have logs, you are the only one who can answer this question with some certainty. Others can only guess. > > > See my config below: > > [root@ArcosRhel2 cluster]# pcs config > Cluster Name: ARCOSCLUSTER > Corosync Nodes: > ArcosRhel1 ArcosRhel2 > Pacemaker Nodes: > ArcosRhel1 ArcosRhel2 > > Resources: > Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) > Attributes: cidr_netmask=32 ip=172.16.10.243 > Operations: monitor interval=30s (ClusterIP-monitor-interval-30s) > start interval=0s timeout=20s (ClusterIP-start-interval-0s) > stop interval=0s timeout=20s (ClusterIP-stop-interval-0s) > > Stonith Devices: > Resource: Fence1 (class=stonith type=fence_vmware_soap) > Attributes: action=off ipaddr=172.16.10.151 login=admin passwd=123pass > pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s port=ArcosRhel1(Joniel) > ssl_insecure=1 pcmk_delay_max=0s > Operations: monitor interval=60s (Fence1-monitor-interval-60s) > Resource: fence2 (class=stonith type=fence_vmware_soap) > Attributes: action=off ipaddr=172.16.10.152 login=admin passwd=123pass > pcmk_delay_max=0s pcmk_host_list=ArcosRhel2 pcmk_monitor_timeout=60s > port=ArcosRhel2(Ben) ssl_insecure=1 > Operations: monitor interval=60s (fence2-monitor-interval-60s) > Fencing Levels: > > Location Constraints: > Resource: Fence1 > Enabled on: ArcosRhel2 (score:INFINITY) > (id:location-Fence1-ArcosRhel2-INFINITY) > Resource: fence2 > Enabled on: ArcosRhel1 (score:INFINITY) > (id:location-fence2-ArcosRhel1-INFINITY) > Ordering Constraints: > Colocation Constraints: > Ticket Constraints: > > Alerts: > No alerts defined > > Resources Defaults: > No defaults set > Operations Defaults: > No defaults set > > Cluster Properties: > cluster-infrastructure: corosync > cluster-name: ARCOSCLUSTER > dc-version: 1.1.16-12.el7-94ff4df > have-watchdog: false > last-lrm-refresh: 1531810841 > stonith-enabled: true > > Quorum: > Options: > > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Weird Fencing Behavior?
Hi, On my two-node active/passive setup, I configured fencing via fence_vmware_soap. I configured pcmk_delay=0 on both nodes so I expected that both nodes will be stonithed simultaenously. On my test scenario, Node1 has ClusterIP resource. When I disconnect service/corosync link physically, Node1 was fenced and Node2 keeps alive given pcmk_delay=0 on both nodes. Can you explain the behavior above? See my config below: [root@ArcosRhel2 cluster]# pcs config Cluster Name: ARCOSCLUSTER Corosync Nodes: ArcosRhel1 ArcosRhel2 Pacemaker Nodes: ArcosRhel1 ArcosRhel2 Resources: Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) Attributes: cidr_netmask=32 ip=172.16.10.243 Operations: monitor interval=30s (ClusterIP-monitor-interval-30s) start interval=0s timeout=20s (ClusterIP-start-interval-0s) stop interval=0s timeout=20s (ClusterIP-stop-interval-0s) Stonith Devices: Resource: Fence1 (class=stonith type=fence_vmware_soap) Attributes: action=off ipaddr=172.16.10.151 login=admin passwd=123pass pcmk_host_list=ArcosRhel1 pcmk_monitor_timeout=60s port=ArcosRhel1(Joniel) ssl_insecure=1 pcmk_delay_max=0s Operations: monitor interval=60s (Fence1-monitor-interval-60s) Resource: fence2 (class=stonith type=fence_vmware_soap) Attributes: action=off ipaddr=172.16.10.152 login=admin passwd=123pass pcmk_delay_max=0s pcmk_host_list=ArcosRhel2 pcmk_monitor_timeout=60s port=ArcosRhel2(Ben) ssl_insecure=1 Operations: monitor interval=60s (fence2-monitor-interval-60s) Fencing Levels: Location Constraints: Resource: Fence1 Enabled on: ArcosRhel2 (score:INFINITY) (id:location-Fence1-ArcosRhel2-INFINITY) Resource: fence2 Enabled on: ArcosRhel1 (score:INFINITY) (id:location-fence2-ArcosRhel1-INFINITY) Ordering Constraints: Colocation Constraints: Ticket Constraints: Alerts: No alerts defined Resources Defaults: No defaults set Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: ARCOSCLUSTER dc-version: 1.1.16-12.el7-94ff4df have-watchdog: false last-lrm-refresh: 1531810841 stonith-enabled: true Quorum: Options: ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org