Re: [ClusterLabs] Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-21 Thread Andrei Borzenkov
21.05.2019 0:46, Ken Gaillot пишет: >> >>> From what's described here, the op-restart-digest is changing every >>> time, which means something is going wrong in the hash comparison >>> (since the definition is not really changing). >>> >>> The log that stands out to me is: >>> >>> trace May 18

Re: [ClusterLabs] Antw: Re: Q: ocf:pacemaker:NodeUtilization monitor

2019-05-29 Thread Andrei Borzenkov
29.05.2019 11:12, Ulrich Windl пишет: Jan Pokorný schrieb am 28.05.2019 um 16:31 in > Nachricht > <20190528143145.ga29...@redhat.com>: >> On 27/05/19 08:28 +0200, Ulrich Windl wrote: >>> I copnfigured ocf:pacemaker:NodeUtilization more or less for fun, and I >> realized that the cluster rrep

Re: [ClusterLabs] Antw: Re: Antw: Re: Q: ocf:pacemaker:NodeUtilization monitor

2019-06-02 Thread Andrei Borzenkov
03.06.2019 9:09, Ulrich Windl пишет: > 118 if [ ‑x $xentool ]; then > 119 $xentool info | awk >>> '/total_memory/{printf("%d\n",$3);exit(0)}' > 120 else > 121 ocf_log warn "Can only set hv_memory for Xen hypervisor" > 122 echo "0" S

Re: [ClusterLabs] EXTERNAL: Re: Pacemaker not reacting as I would expect when two resources fail at the same time

2019-06-08 Thread Andrei Borzenkov
08.06.2019 5:12, Harvey Shepherd пишет: > Thank you for your advice Ken. Sorry for the delayed reply - I was trying out > a few things and trying to capture extra info. The changes that you suggested > make sense, and I have incorporated them into my config. However, the > original issue remains

Re: [ClusterLabs] Strange monitor return code log for LSB resource

2019-06-25 Thread Andrei Borzenkov
25.06.2019 16:53, Harvey Shepherd пишет: > Hi All, > > > I have a 2 node cluster running under Pacemaker 2.0.2, with around 20 > resources configured, the majority of which are LSB resources, but there are > also a few OCF ones. One of the LSB resources is controlled via an init > script calle

Re: [ClusterLabs] Problems with master/slave failovers

2019-06-27 Thread Andrei Borzenkov
On Fri, Jun 28, 2019 at 7:24 AM Harvey Shepherd wrote: > > Hi All, > > > I'm running Pacemaker 2.0.2 on a two node cluster. It runs one master/slave > resource (I'll refer to it as the king resource) and about 20 other resources > which are a mixture of: > > > - resources that only run on the ki

Re: [ClusterLabs] Problems with master/slave failovers

2019-06-28 Thread Andrei Borzenkov
29.06.2019 6:01, Harvey Shepherd пишет: > > As you can see, it eventually gives up in the transition attempt and starts a > new one. Eventually the failed king resource master has had time to come back > online and it then just promotes it again and forgets about trying to > failover. I'm not s

Re: [ClusterLabs] Problems with master/slave failovers

2019-06-28 Thread Andrei Borzenkov
29.06.2019 8:05, Harvey Shepherd пишет: > There is an ordering constraint - everything must be started after the king > resource. But even if this constraint didn't exist I don't see that it should > logically make any difference due to all the non-clone resources being > colocated with the mast

Re: [ClusterLabs] Problems with master/slave failovers

2019-06-29 Thread Andrei Borzenkov
28.06.2019 9:45, Andrei Borzenkov пишет: > On Fri, Jun 28, 2019 at 7:24 AM Harvey Shepherd > wrote: >> >> Hi All, >> >> >> I'm running Pacemaker 2.0.2 on a two node cluster. It runs one master/slave >> resource (I'll refer to it as the king re

Re: [ClusterLabs] Problems with master/slave failovers

2019-07-01 Thread Andrei Borzenkov
02.07.2019 2:30, Harvey Shepherd пишет: >> The "transition summary" is just a resource-by-resource list, not the >> order things will be done. The "executing cluster transition" section >> is the order things are being done. > > Thanks Ken. I think that's where the problem is originating. If you l

Re: [ClusterLabs] Problems with master/slave failovers

2019-07-03 Thread Andrei Borzenkov
On Wed, Jul 3, 2019 at 12:59 AM Ken Gaillot wrote: > > On Mon, 2019-07-01 at 23:30 +, Harvey Shepherd wrote: > > > The "transition summary" is just a resource-by-resource list, not > > > the > > > order things will be done. The "executing cluster transition" > > > section > > > is the order th

Re: [ClusterLabs] "node is unclean" leads to gratuitous reboot

2019-07-09 Thread Andrei Borzenkov
On Tue, Jul 9, 2019 at 3:54 PM Michael Powell < michael.pow...@harmonicinc.com> wrote: > I have a two-node cluster with a problem. If I start Corosync/Pacemaker > on one node, and then delay startup on the 2nd node (which is otherwise > up and running), the 2nd node will be rebooted very soon aft

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-09 Thread Andrei Borzenkov
09.07.2019 13:08, Danka Ivanović пишет: > Hi I didn't manage to start master with postgres, even if I increased start > timeout. I checked executable paths and start options. > When cluster is running with manually started master and slave started over > pacemaker, everything works ok. Today we had

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Andrei Borzenkov
On Wed, Jul 10, 2019 at 12:42 PM Jehan-Guillaume de Rorthais wrote: > > > P.S. crm_resource is called by resource agent (pgsqlms). And it shows > > result of original resource probing which makes it confusing. At least > > it explains where these logs entries come from. > > Not sure tu understand

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Andrei Borzenkov
On Wed, Jul 10, 2019 at 12:42 PM Jehan-Guillaume de Rorthais wrote: > > > > Jul 09 09:16:32 [2679] postgres1 lrmd:debug: > > > child_kill_helper: Kill pid 12735's group Jul 09 09:16:34 [2679] > > > postgres1 lrmd: warning: child_timeout_callback: > > > PGSQL_monitor_15000 proces

Re: [ClusterLabs] [EXTERNAL] Re: "node is unclean" leads to gratuitous reboot

2019-07-11 Thread Andrei Borzenkov
On Thu, Jul 11, 2019 at 12:58 PM Lars Ellenberg wrote: > > On Wed, Jul 10, 2019 at 06:15:56PM +, Michael Powell wrote: > > Thanks to you and Andrei for your responses. In our particular > > situation, we want to be able to operate with either node in > > stand-alone mode, or with both nodes p

Re: [ClusterLabs] Antw: Interacting with Pacemaker from my code

2019-07-16 Thread Andrei Borzenkov
On Tue, Jul 16, 2019 at 9:48 AM Nishant Nakate wrote: > > > On Tue, Jul 16, 2019 at 11:33 AM Ulrich Windl > wrote: >> >> >>> Nishant Nakate schrieb am 16.07.2019 um 05:37 >> >>> in >> Nachricht >> : >> > Hi All, >> > >> > I am new to this community and HA tools. Need some guidance on my curren

Re: [ClusterLabs] Antw: Interacting with Pacemaker from my code

2019-07-16 Thread Andrei Borzenkov
On Tue, Jul 16, 2019 at 11:01 AM Nishant Nakate wrote: > >> > >> > I will give you a quick overview of the system. There would be 3 nodes >> > configured in a cluster. One would act as a leader and others as >> > followers. Our system would be actively running on all the three nodes and >> > se

Re: [ClusterLabs] Feedback wanted: Node reaction to fabric fencing

2019-07-25 Thread Andrei Borzenkov
On Thu, Jul 25, 2019 at 3:20 AM Ondrej wrote: > > Is there any plan on getting this also into 1.1 branch? > If yes, then I would be for just introducing the configuration option in > 1.1.x with default to 'stop'. > +1 for back porting it from someone who just recently hit this (puzzling) behavior

[ClusterLabs] Reusing resource set in multiple constraints

2019-07-27 Thread Andrei Borzenkov
Is it possible to have single definition of resource set that is later references in order and location constraints? All syntax in documentation or crmsh presumes inline set definition in location or order statement. In this particular case there will be set of filesystems that need to be colocate

Re: [ClusterLabs] Reusing resource set in multiple constraints

2019-07-28 Thread Andrei Borzenkov
27.07.2019 11:04, Andrei Borzenkov пишет: > Is it possible to have single definition of resource set that is later > references in order and location constraints? All syntax in > documentation or crmsh presumes inline set definition in location or > order statement. > > In th

[ClusterLabs] corosync.service (and sbd.service) are not stopper on pacemaker shutdown when corosync-qdevice is used

2019-07-28 Thread Andrei Borzenkov
corosync.service sets StopWhenUnneded=yes which normally stops it when pacemaker is shut down. Unfortunately, corosync-qdevice.service declares Requires=corosync.service and corosync-qdevice.service itself is *not* stopped when pacemaker.service is stopped. Which means corosync.service remains "nee

[ClusterLabs] Node reset on shutdown by SBD watchdog with corosync-qdevice

2019-07-28 Thread Andrei Borzenkov
In two node cluster + qnetd I consistently see the node that is being shut down last being reset during shutdown. I.e. - shutdown the first node - OK - shutdown the second node - reset As far as I understand what happens is - during shutdown pacemaker.service is stopped first. In above configura

Re: [ClusterLabs] corosync.service (and sbd.service) are not stopper on pacemaker shutdown when corosync-qdevice is used

2019-07-29 Thread Andrei Borzenkov
On Mon, Jul 29, 2019 at 9:52 AM Jan Friesse wrote: > > Andrei > > Andrei Borzenkov napsal(a): > > corosync.service sets StopWhenUnneded=yes which normally stops it when > > This was the case only for very limited time (v 3.0.1) and it's removed > now (v 3.0.2) beca

Re: [ClusterLabs] Reusing resource set in multiple constraints

2019-08-03 Thread Andrei Borzenkov
29.07.2019 22:07, Ken Gaillot пишет: > On Sat, 2019-07-27 at 11:04 +0300, Andrei Borzenkov wrote: >> Is it possible to have single definition of resource set that is >> later >> references in order and location constraints? All syntax in >> documentation or crmsh presum

[ClusterLabs] How to clean up failed fencing action?

2019-08-03 Thread Andrei Borzenkov
I'm using sbd watchdog and stonith-watchdog-timeout without explicit stonith agents (shared nothing cluster). How can I clean up failed fencing action? Current DC: ha1 (version 2.0.1+20190408.1b68da8e8-1.3-2.0.1+20190408.1b68da8e8) - partition with quorum Last updated: Sat Aug 3 19:10:12 2019 Las

Re: [ClusterLabs] Query on HA

2019-08-05 Thread Andrei Borzenkov
There is no one-size-fits-all answer. You should enable and configure stonith in pacemaker (which is disabled, otherwise described situation would not happen). You may consider wait_for_all (or better two_node) options in corosync that would prevent pacemaker to start unless both nodes are up. On

Re: [ClusterLabs] Compile fence agent on Ubuntu failing

2019-08-07 Thread Andrei Borzenkov
07.08.2019 12:21, Oleg Ulyanov пишет: > Hi all, > I’m facing a problem with fence_vmware_soap on Ubuntu 16.04. Being able to > resolve dependency missing by manually installing python packages, I still > not able to connect to my vcenter. Apparently it’s a problem with 4.0.22 > version and —ssl-

Re: [ClusterLabs] Strange lost quorum with qdevice

2019-08-09 Thread Andrei Borzenkov
On Fri, Aug 9, 2019 at 9:25 AM Jan Friesse wrote: > > Олег Самойлов napsal(a): > > Hello all. > > > > I have a test bed with several virtual machines to test pacemaker. I > > simulate random failure on one of the node. The cluster will be on several > > data centres, so there is not stonith devi

Re: [ClusterLabs] Gracefully stop nodes one by one with disk-less sbd

2019-08-09 Thread Andrei Borzenkov
09.08.2019 16:34, Yan Gao пишет: > Hi, > > With disk-less sbd, it's fine to stop cluster service from the cluster > nodes all at the same time. > > But if to stop the nodes one by one, for example with a 3-node cluster, > after stopping the 2nd node, the only remaining node resets itself with:

Re: [ClusterLabs] Strange lost quorum with qdevice

2019-08-11 Thread Andrei Borzenkov
Отправлено с iPhone > 12 авг. 2019 г., в 8:46, Jan Friesse написал(а): > > Олег Самойлов napsal(a): >>> 9 авг. 2019 г., в 9:25, Jan Friesse написал(а): >>> Please do not set dpd_interval that high. dpd_interval on qnetd side is not >>> about how often is the ping is sent. Could you please re

Re: [ClusterLabs] Antw: Re: Gracefully stop nodes one by one with disk-less sbd

2019-08-12 Thread Andrei Borzenkov
Отправлено с iPhone 12 авг. 2019 г., в 9:48, Ulrich Windl написал(а): >>>> Andrei Borzenkov schrieb am 09.08.2019 um 18:40 in > Nachricht <217d10d8-022c-eaf6-28ae-a4f58b2f9...@gmail.com>: >> 09.08.2019 16:34, Yan Gao пишет: >>> Hi, >>> >&

Re: [ClusterLabs] Master/slave failover does not work as expected

2019-08-12 Thread Andrei Borzenkov
On Mon, Aug 12, 2019 at 4:12 PM Michael Powell < michael.pow...@harmonicinc.com> wrote: > At 07:44:49, the ss agent discovers that the master instance has failed on > node *mgraid…-0* as a result of a failed *ssadm* request in response to > an *ss_monitor()* operation. It issues a *crm_master -Q

Re: [ClusterLabs] [EXTERNAL] Users Digest, Vol 55, Issue 19

2019-08-12 Thread Andrei Borzenkov
ow...@clusterlabs.org > > When replying, please edit your Subject line so it is more specific than "Re: > Contents of Users digest..." > > > Today's Topics: > >1. why is node fenced ? (Lentes, Bernd) >2. Postgres HA - pacemaker RA do

Re: [ClusterLabs] Pacemaker - mounting md devices and run quotaon command

2019-08-19 Thread Andrei Borzenkov
On Tue, Aug 20, 2019 at 1:03 AM Del Monaco, Andrea wrote: > > Hi Users, > > > > As per title – do you know if there is some resource in pacemaker that allows > a filesystem (md array) to be mounted and then run the quotaon command on it Is not quota information persistent so it is enough to run

Re: [ClusterLabs] node name issues (Could not obtain a node name for corosync nodeid 739512332)

2019-08-22 Thread Andrei Borzenkov
22.08.2019 10:07, Ulrich Windl пишет: > Hi! > > When starting pacemaker (1.1.19+20181105.ccd6b5b10-3.10.1) on a node that had > been down for a while, I noticed some unexpected messages about the node name: > > pacemakerd: notice: get_node_name: Could not obtain a node name for > corosync n

Re: [ClusterLabs] Thoughts on crm shell

2019-08-22 Thread Andrei Borzenkov
22.08.2019 12:49, Ulrich Windl пишет: > Hi! > > It's been a while since I used crm shell, and now after having moved from > SLES11 to SLES12 (jhaving to use it again), I realized a few things: > > 1) As the ptest command is crm_simulate now, shouldn't crm shell's ptest (in > configure) be accom

Re: [ClusterLabs] Antw: Re: node name issues (Could not obtain a node name for corosync nodeid 739512332)

2019-08-26 Thread Andrei Borzenkov
On Mon, Aug 26, 2019 at 9:59 AM Ulrich Windl wrote: > Also see my earlier message. If adding the node name to corosync conf is > highly recommended, I wonder why SUSE's SLES procedure does not set it... > If you mean ha-cluster-init/ha-cluster-join, it just invokes "crm cluster", so you may cons

Re: [ClusterLabs] Command to show location constraints?

2019-08-27 Thread Andrei Borzenkov
27.08.2019 18:24, Casey & Gina пишет: > Hi, I'm looking for a way to show just location constraints, if they exist, > for a cluster. I'm looking for the same data shown in the output of `pcs > config` under the "Location Constraints:" header, but without all the rest, > so that I can write a sc

Re: [ClusterLabs] New status reporting for starting/stopping resources in 1.1.19-8.el7

2019-08-30 Thread Andrei Borzenkov
31.08.2019 6:39, Chris Walker пишет: > Hello, > The 1.1.19-8 EL7 version of Pacemaker contains a commit ‘Feature: crmd: > default record-pending to TRUE’ that is not in the ClusterLabs Github repo. commit b48ceeb041cee65a9b93b9b76235e475fa1a128f Author: Ken Gaillot Date: Mon Oct 16 09:45:18 2

Re: [ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

2019-09-03 Thread Andrei Borzenkov
03.09.2019 11:09, Marco Marino пишет: > Hi, I have a problem with fencing on a two node cluster. It seems that > randomly the cluster cannot complete monitor operation for fence devices. > In log I see: > crmd[8206]: error: Result of monitor operation for fence-node2 on > ld2.mydomain.it: Timed O

Re: [ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

2019-09-03 Thread Andrei Borzenkov
04.09.2019 0:27, wf...@niif.hu пишет: > Jeevan Patnaik writes: > >> [16187] node1 corosyncwarning [MAIN ] Corosync main process was not >> scheduled for 2889.8477 ms (threshold is 800. ms). Consider token >> timeout increase. >> [...] >> 2. How to fix this? We have not much load on the nodes

Re: [ClusterLabs] IPaddr2 RA and multicast mac

2019-09-03 Thread Andrei Borzenkov
04.09.2019 1:27, Tomer Azran пишет: > Hello, > > When using IPaddr2 RA in order to set a cloned IP address resource: > > pcs resource create vip1 ocf:heartbeat:IPaddr2 ip=10.0.0.100 iflabel=vip1 > cidr_netmask=24 flush_routes=true op monitor interval=30s > pcs resource clone vip1 clone-max=2 clo

Re: [ClusterLabs] IPAddr2 RA and CLUSTERIP local_node

2019-09-03 Thread Andrei Borzenkov
04.09.2019 2:03, Tomer Azran пишет: > Hello, > > When using IPaddr2 RA in order to set a cloned IP address resource: > > pcs resource create vip1 ocf:heartbeat:IPaddr2 ip=10.0.0.100 iflabel=vip1 > cidr_netmask=24 flush_routes=true op monitor interval=30s > pcs resource clone vip1 clone-max=2 clo

Re: [ClusterLabs] Antw: Re: pacemaker resources under systemd

2019-09-12 Thread Andrei Borzenkov
On Thu, Sep 12, 2019 at 12:40 PM Ulrich Windl wrote: > > Hi! > > I just discovered an unpleasant side-effect of this: > SLES has "zypper ps" to show processes that use obsoleted binaries. Now if any > resource binary was replaced, zypper suggests to restart pacemaker (which is > nonsense, of cours

Re: [ClusterLabs] Antw: Re: Antw: Re: pacemaker resources under systemd

2019-09-12 Thread Andrei Borzenkov
On Thu, Sep 12, 2019 at 3:45 PM Ulrich Windl wrote: > > >>> Andrei Borzenkov schrieb am 12.09.2019 um 14:21 in > Nachricht > : > > On Thu, Sep 12, 2019 at 12:40 PM Ulrich Windl > > wrote: > >> > >> Hi! > >> > >> I just disco

Re: [ClusterLabs] Fence_sbd script in Fedora30?

2019-09-23 Thread Andrei Borzenkov
23.09.2019 23:23, Vitaly Zolotusky пишет: > Hello, > I am trying to upgrade to Fedora 30. The platform is two node cluster with > pacemaker. > It Fedora 28 we were using old fence_sbd script from 2013: > > # This STONITH script drives the shared-storage stonith plugin. > # Copyright (C) 2013

[ClusterLabs] SBD with shared device - loss of both interconnect and shared device?

2019-10-09 Thread Andrei Borzenkov
What happens if both interconnect and shared device is lost by node? I assume node will reboot, correct? Now assuming (two node cluster) second node still can access shared device it will fence (via SBD) and continue takeover, right? If both nodes lost shared device, both nodes will reboot and if

Re: [ClusterLabs] Howto stonith in the case of any interface failure?

2019-10-09 Thread Andrei Borzenkov
On Wed, Oct 9, 2019 at 10:59 AM Kadlecsik József wrote: > > Hello, > > The nodes in our cluster have got backend and frontend interfaces: the > former ones are for the storage and cluster (corosync) traffic and the > latter ones are for the public services of KVM guests only. > > One of the nodes

Re: [ClusterLabs] Where to find documentation for cluster MD?

2019-10-10 Thread Andrei Borzenkov
On Thu, Oct 10, 2019 at 11:16 AM Ulrich Windl wrote: > > Hi! > > In recent SLES there is "cluster MD", like in > cluster-md-kmp-default-4.12.14-197.18.1.x86_64 > (/lib/modules/4.12.14-197.18-default/kernel/drivers/md/md-cluster.ko). > However I could not find any manual page for it. > > Where i

Re: [ClusterLabs] Why is node fenced ?

2019-10-10 Thread Andrei Borzenkov
10.10.2019 18:22, Lentes, Bernd пишет: > HI, > > i have a two node cluster running on SLES 12 SP4. > I did some testing on it. > I put one into standby (ha-idg-2), the other (ha-idg-1) got fenced a few > minutes later because i made a mistake. > ha-idg-2 was DC. ha-idg-1 made a fresh boot and i s

Re: [ClusterLabs] What happened to "crm resource migrate"?

2019-10-15 Thread Andrei Borzenkov
On Tue, Oct 15, 2019 at 11:58 AM Yan Gao wrote: > > > > Help for "move" still says: > > resource# help move > > Move a resource to another node > > > > Move a resource away from its current location. > Looks like an issue in the version of crmsh. > > Xin, could you please take a look? > > "crm_res

Re: [ClusterLabs] -INFINITY location constraint not honored?

2019-10-18 Thread Andrei Borzenkov
According to it, you have symmetric cluster (and apparently made typo trying to change it) On Fri, Oct 18, 2019 at 10:29 AM Raffaele Pantaleoni wrote: > > Il 17/10/2019 18:08, Ken Gaillot ha scritto: > > This does sound odd, possibly a bug. Can you provide the output of "pcs >

Re: [ClusterLabs] -INFINITY location constraint not honored?

2019-10-18 Thread Andrei Borzenkov
18.10.2019 12:43, Raffaele Pantaleoni пишет: > > Il 18/10/2019 10:21, Andrei Borzenkov ha scritto: >> According to it, you have symmetric cluster (and apparently made typo >> trying to change it) >> >> > name="symmetric-cluster" value=&quo

Re: [ClusterLabs] Antw: Safe way to stop pacemaker on both nodes of a two node cluster

2019-10-23 Thread Andrei Borzenkov
21.10.2019 9:39, Ulrich Windl пишет: "Dileep V Nair" schrieb am 20.10.2019 um 17:54 in > Nachricht > > m>: > >> Hi, >> >> I am confused about the best way to stop pacemaker on both nodes of a >> two node cluster. The options I know of are >> 1. Put the cluster in Maintenance Mode, sto

Re: [ClusterLabs] SLES12 SP4: update_cib_stonith_devices_v2 nonsense "Watchdog will be used via SBD if fencing is required"

2019-10-23 Thread Andrei Borzenkov
23.10.2019 13:35, Ulrich Windl пишет: > Hi! > > In SLES12 SP4 I'm kind of annoyed due to repeating messages "unpack_config: > Watchdog will be used via SBD if fencing is required". > > While examining another problem, I found this sequence: > * Some unrelated resource was moved (migrated) > *

Re: [ClusterLabs] reducing corosync-qnetd "response time"

2019-10-24 Thread Andrei Borzenkov
24.10.2019 16:54, Sherrard Burton пишет: > background: > we are upgrading a (very) old HA cluster running heartbeat DRBD and NFS, > with no stonith, to a much more modern implementation. for the existing > cluster, as well as the new one, the disk space requirements make > running a full three-node

Re: [ClusterLabs] active/passive resource config

2019-10-24 Thread Andrei Borzenkov
On Fri, Oct 25, 2019 at 9:03 AM jyd <471204...@qq.com> wrote: > > Hi: > I want to user pacemaker to mange a resource named A,i want A only > started on one node, > only when the node is down or A can not started in this node,the A resource > will started on other nodes. > And config a virtual

Re: [ClusterLabs] Support for 'score' in rsc_order is deprecated...use 'kind' instead...

2019-10-28 Thread Andrei Borzenkov
28.10.2019 20:00, Jean-Francois Malouin пишет: > Hi, > > Building a new pacemaker cluster using corosync 3.0 and pacemaker 2.0.1 on > Debian/Buster 10 > I get this error when trying to insert a order constraint in the CIB to first > promote drbd to primary > then start/scan LVM. It used to work

Re: [ClusterLabs] volume group won't start in a nested DRBD setup

2019-10-28 Thread Andrei Borzenkov
28.10.2019 22:44, Jean-Francois Malouin пишет: > Hi, > > Is there any new magic that I'm unaware of that needs to be added to a > pacemaker cluster using a DRBD nested setup? pacemaker 2.0.x and DRBD 8.4.10 > on > Debian/Buster on a 2-node cluster with stonith. > Eventually this will host a bunch

Re: [ClusterLabs] fencing on iscsi device not working

2019-10-30 Thread Andrei Borzenkov
30.10.2019 15:46, RAM PRASAD TWISTED ILLUSIONS пишет: > Hi everyone, > > I am trying to set up a storage cluster with two nodes, both running debian > buster. The two nodes called, duke and miles, have a LUN residing on a SAN > box as their shared storage device between them. As you can see in the

Re: [ClusterLabs] Antw: Re: fencing on iscsi device not working

2019-11-06 Thread Andrei Borzenkov
06.11.2019 18:55, Ken Gaillot пишет: > On Wed, 2019-11-06 at 08:04 +0100, Ulrich Windl wrote: > Ken Gaillot schrieb am 05.11.2019 um > 16:05 in >> >> Nachricht >> : >>> Coincidentally, the documentation for the pcmk_host_check default >>> was >>> recently updated for the upcoming 2.0.3 rel

Re: [ClusterLabs] node avoidance still leads to "status=Not installed" error for monitor op

2019-11-30 Thread Andrei Borzenkov
29.11.2019 16:37, Dennis Jacobfeuerborn пишет: Hi, I'm currently trying to set up a drbd 8.4 resource in a 3-node pacemaker cluster. The idea is to have nodes storage1 and storage2 running with the drbd clones and only use the third node storage3 for quorum. The way I'm trying to do this: pcs cl

Re: [ClusterLabs] Concept of a Shared ipaddress/resource for generic applicatons

2019-11-30 Thread Andrei Borzenkov
29.11.2019 17:46, Jan Pokorný пишет: On 27/11/19 20:13 +, matt_murd...@amat.com wrote: I finally understand that there is a Apache Resource for Pacemaker that assigns a single virtual ipaddress that "floats" between two nodes as in webservers. https://access.redhat.com/documentation/en-us/re

Re: [ClusterLabs] Fuzzy/misleading references to "restart" of a resource

2019-12-04 Thread Andrei Borzenkov
On Thu, Dec 5, 2019 at 1:04 AM Jan Pokorný wrote: > > On 04/12/19 21:19 +0100, Jan Pokorný wrote: > > OTOH, this enforced split of state transitions is perhaps what makes > > the transaction (comprising perhaps countless other interdependent > > resources) serializable and thus feasible at all (th

Re: [ClusterLabs] serious problem with iSCSILogicalUnit

2019-12-16 Thread Andrei Borzenkov
16.12.2019 18:26, Stefan K пишет: > I thnik I got it.. > > It looks like that (A) > order pcs_rsc_order_set_iscsi-server_haip iscsi-server:start > iscsi-lun00:start iscsi-lun01:start iscsi-lun02:start ha-ip:start > symmetrical=false It is different from configuration you show originally. > ord

Re: [ClusterLabs] Understanding advisory resource ordering

2020-01-11 Thread Andrei Borzenkov
08.01.2020 17:30, Achim Leitner пишет: > Hi, > > some progress on this issue: > > Am 20.12.19 um 13:37 schrieb Achim Leitner: >> After pacemaker restart, we have Transition 0 with the DRBD actions, >> followed 4s later with Transition 1 including all VM actions with >> correct ordering. 32s later

Re: [ClusterLabs] Making xt_cluster IP load-sharing work with IPv6 (Was: Concept of a Shared ipaddress/resource for generic applicatons)[

2020-01-11 Thread Andrei Borzenkov
04.01.2020 01:42, Valentin Vidić пишет: > On Thu, Jan 02, 2020 at 09:52:09PM +0100, Jan Pokorný wrote: >> What you've used appears to be akin to what this chunk of manpage >> suggests (amongst others): >> https://git.netfilter.org/iptables/tree/extensions/libxt_cluster.man >> >> which is (yet anoth

Re: [ClusterLabs] Making xt_cluster IP load-sharing work with IPv6

2020-01-14 Thread Andrei Borzenkov
14.01.2020 17:47, Jan Pokorný пишет: > On 11/01/20 19:47 +0300, Andrei Borzenkov wrote: >> 04.01.2020 01:42, Valentin Vidić пишет: >>> On Thu, Jan 02, 2020 at 09:52:09PM +0100, Jan Pokorný wrote: >>>> What you've used appears to be akin to what this chunk of ma

Re: [ClusterLabs] multi-site clusters vs disaster recovery clusters

2020-02-05 Thread Andrei Borzenkov
05.02.2020 18:16, Олег Самойлов пишет: > Hi all. > > I am reading the documentation about new (for me) pacemaker, which came with > RedHat 8. > > And I see two different chapters, which both tried to solve exactly the same > problem. > > One is CONFIGURING DISASTER RECOVERY CLUSTERS (pcs dr):

Re: [ClusterLabs] Why Do Nodes Leave the Cluster?

2020-02-05 Thread Andrei Borzenkov
05.02.2020 20:55, Eric Robinson пишет: > The two servers 001db01a and 001db01b were up and responsive. Neither had > been rebooted and neither were under heavy load. There's no indication in the > logs of loss of network connectivity. Any ideas on why both nodes seem to > think the other one is

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Coming in Pacemaker 2.0.4: shutdown locks

2020-02-27 Thread Andrei Borzenkov
27.02.2020 20:54, Ken Gaillot пишет: > On Thu, 2020-02-27 at 18:43 +0100, Jehan-Guillaume de Rorthais wrote: Speaking about shutdown, what is the status of clean shutdown of the cluster handled by Pacemaker? Currently, I advice to stop resources gracefully (eg. using pcs re

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Coming in Pacemaker 2.0.4: shutdown locks

2020-02-27 Thread Andrei Borzenkov
28.02.2020 01:55, Ken Gaillot пишет: > On Thu, 2020-02-27 at 22:39 +0300, Andrei Borzenkov wrote: >> 27.02.2020 20:54, Ken Gaillot пишет: >>> On Thu, 2020-02-27 at 18:43 +0100, Jehan-Guillaume de Rorthais >>> wrote: >>>>>> Speaking about shutdown, w

Re: [ClusterLabs] Coming in Pacemaker 2.0.4: fencing delay based on what resources are where

2020-03-21 Thread Andrei Borzenkov
21.03.2020 20:07, Ken Gaillot пишет: > Hi all, > > I am happy to announce a feature that was discussed on this list a > while back. It will be in Pacemaker 2.0.4 (the first release candidate > is expected in about three weeks). > > A longstanding concern in two-node clusters is that in a split br

Re: [ClusterLabs] fence_mpath and failed IP

2020-03-30 Thread Andrei Borzenkov
31.03.2020 05:56, Ken Gaillot пишет: > On Sat, 2020-02-22 at 03:50 +0200, Strahil Nikolov wrote: >> Hello community, >> >> Recently I have started playing with fence_mpath and I have noticed >> that when the node is fenced, the node is kicked out of the >> cluster (corosync & pacemaker are shut d

Re: [ClusterLabs] Is 20 seconds to complete redis switchover to be expected?

2020-03-31 Thread Andrei Borzenkov
31.03.2020 09:27, steven prothero пишет: > Hello, > > I am new with Pacemaker (new to redis also) and appreciate the info shared > here. > > I believe with Redis sentinel a switchover is about 2 seconds. > Reading a post about Pacemaker with Redis, the author said he was > doing it in 3 seconds

Re: [ClusterLabs] temporary loss of quorum when member starts to rejoin

2020-04-06 Thread Andrei Borzenkov
06.04.2020 17:05, Sherrard Burton пишет: > ...or at least that's that i think is happening :-) > > two-node cluster, plus quorum-only node. testing the behavior when > active node is gracefully rebooted. all seems well initially. resources > are migrated, come up and function as expected. > > but

Re: [ClusterLabs] temporary loss of quorum when member starts to rejoin

2020-04-06 Thread Andrei Borzenkov
06.04.2020 20:57, Sherrard Burton пишет: > > > On 4/6/20 1:20 PM, Sherrard Burton wrote: >> >> >> On 4/6/20 12:35 PM, Andrei Borzenkov wrote: >>> 06.04.2020 17:05, Sherrard Burton пишет: >>>> >>>> from the quorum node: >> .

Re: [ClusterLabs] temporary loss of quorum when member starts to rejoin

2020-04-07 Thread Andrei Borzenkov
07.04.2020 00:21, Sherrard Burton пишет: >> >> It looks like some timing issue or race condition. After reboot node >> manages to contact qnetd first, before connection to other node is >> established. Qnetd behaves as documented - it sees two equal size >> partitions and favors the partition that

Re: [ClusterLabs] temporary loss of quorum when member starts to rejoin

2020-04-08 Thread Andrei Borzenkov
08.04.2020 10:12, Jan Friesse пишет: > Sherrard, > >> i could not determine which of these sub-threads to include this in, >> so i am going to (reluctantly) top-post it. >> >> i switched the transport to udp, and in limited testing i seem to not >> be hitting the race condition. of course i have n

Re: [ClusterLabs] unable to start fence_scsi on a new add node

2020-04-18 Thread Andrei Borzenkov
16.04.2020 18:58, Stefan Sabolowitsch пишет: > Hi there, > i have expanded a cluster with 2 nodes with an additional one "elastic-03". > However, fence_scsi does not start on the new node. > > pcs-status: > [root@logger cluster]# pcs status > Cluster name: cluster_elastic > Stack: corosync > Curr

Re: [ClusterLabs] Merging partitioned two_node cluster?

2020-05-04 Thread Andrei Borzenkov
05.05.2020 06:39, Nickle, Richard пишет: > I have a two node cluster managing a VIP. The service is an SMTP service. > This could be active/active, it doesn't matter which node accepts the SMTP > connection, but I wanted to make sure that a VIP was in place so that there > was a well-known address

Re: [ClusterLabs] Merging partitioned two_node cluster?

2020-05-05 Thread Andrei Borzenkov
05.05.2020 16:44, Nickle, Richard пишет: > Thanks Honza and Andrei (and Strahil? I might have missed a message in the > thread...) > Yep, all messages from Strahil end up in spam folder. ___ Manage your subscription: https://lists.clusterlabs.org/mailm

Re: [ClusterLabs] Merging partitioned two_node cluster?

2020-05-05 Thread Andrei Borzenkov
be used wit udpu transport at all. > Is it possible that the calculation of my base network in 'bindnetaddr' > doesn't account for networks with CIDR mask bits greater than 24? (which > would have non-zero least significant bytes.) > > Thanks, > > Rick > >

Re: [ClusterLabs] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-17 Thread Andrei Borzenkov
17.06.2020 22:05, Howard пишет: > Hello, recently I received some really great advice from this community > regarding changing the token timeout value in corosync. Thank you! Since > then the cluster has been working perfectly with no errors in the log for > more than a week. > > This morning I lo

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Andrei Borzenkov
18.06.2020 18:24, Ken Gaillot пишет: > Note that a failed start of a stonith device will not prevent the > cluster from using that device for fencing. It just prevents the > cluster from monitoring the device. > My understanding is that if stonith resource cannot run anywhere, it also won't be us

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-18 Thread Andrei Borzenkov
18.06.2020 20:16, Howard пишет: > Thanks for the replies! I will look at the failure-timeout resource > attribute and at adjusting the timeout from 20 to 30 seconds. It is funny > that the 100 tries message is symbolic. > It is not symbolic, it is INFINITY. From pacemaker documentation If th

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-19 Thread Andrei Borzenkov
igible.  After 30 minutes it will start trying again. >> >> On Thu, Jun 18, 2020 at 12:29 PM Ken Gaillot > <mailto:kgail...@redhat.com>> wrote: >> >> On Thu, 2020-06-18 at 21:32 +0300, Andrei Borzenkov wrote: >> > 18.06.2020 18:24, Ken Gaillot пишет:

Re: [ClusterLabs] Antw: [EXT] Failed fencing monitor process (fence_vmware_soap) RHEL 8

2020-06-19 Thread Andrei Borzenkov
19.06.2020 01:13, Howard пишет: > Thanks for all the help so far. With your assistance, I'm very close to > stable. > > Made the following changes to the vmfence stonith resource: > > Meta Attrs: failure-timeout=30m migration-threshold=10 > Operations: monitor interval=60s (vmfence-monitor-int

[ClusterLabs] Two node cluster and extended distance/site failure

2020-06-23 Thread Andrei Borzenkov
Two node is what I almost exclusively deal with. It works reasonably well in one location where failures to perform fencing are rare and can be mitigated by two different fencing methods. Usually SBD is reliable enough, as failure of shared storage also implies failure of the whole cluster. When t

Re: [ClusterLabs] Antw: [EXT] Two node cluster and extended distance/site failure

2020-06-24 Thread Andrei Borzenkov
24.06.2020 10:28, Ulrich Windl пишет: >> >> Usual recommendation is third site which functions as witness. This >> works fine up to failure of this third site itself. Unavailability of >> the witness makes normal maintenance of either of two nodes impossible. > > That's a problem of pacemaker: > A

Re: [ClusterLabs] Antw: Re: Antw: [EXT] Two node cluster and extended distance/site failure

2020-06-24 Thread Andrei Borzenkov
24.06.2020 12:20, Ulrich Windl пишет: >> >> How Service Guard handles loss of shared storage? > > When a node is up it would log the event; if a node is down it wouldn't care; > if a node detects a communication problem with the other node, it would fence > itself. > So in case of split brain wi

Re: [ClusterLabs] [Off-topic] Message threading (Was: Antw: [EXT] Re: Two node cluster and extended distance/site failure)

2020-06-29 Thread Andrei Borzenkov
29.06.2020 14:57, Ulrich Windl пишет: Klaus Wenninger schrieb am 29.06.2020 um 10:12 in > Nachricht > > [...] >> My mailer was confused by all this combinations of >> "Antw: Re: Antw:" anddidn't compose mails into a >> thread properly. Which is why I missed further >> discussion where it was

Re: [ClusterLabs] Antw: [EXT] Suggestions for multiple NFS mounts as LSB script

2020-06-29 Thread Andrei Borzenkov
29.06.2020 20:20, Tony Stocker пишет: > >> >> >> The most interesting part seems to be the question whow you define (and >> detect) a failure that will cause a node switch. > > That is a VERY good question! How many mounts failed is the critical > number when you have 130+? If a single one fails,

Re: [ClusterLabs] cluster problems after let's encrypt

2020-07-06 Thread Andrei Borzenkov
06.07.2020 19:13, fatcha...@gmx.de пишет: > Hi, > > I'm running a two node corosync httpd-cluster on a CentOS 7. > corosync-2.4.5-4.el7.x86_64 > pcs-0.9.168-4.el7.centos.x86_64 > Today I used lets encrypt to installt https for two domains on that system. > After that the node with the new https-do

Re: [ClusterLabs] Automatic restart of Pacemaker after reboot and filesystem unmount problem

2020-07-14 Thread Andrei Borzenkov
14.07.2020 14:56, Grégory Sacré пишет: > Dear all, > > > I'm pretty new to Pacemaker so I must be missing something but I cannot find > it in the documentation. > > I'm setting up a SAMBA File Server cluster with DRBD and Pacemaker. Here are > the relevant pcs commands related to the mount par

Re: [ClusterLabs] qnetd and booth arbitrator running together in a 3rd geo site

2020-07-14 Thread Andrei Borzenkov
14.07.2020 13:19, Rohit Saini пишет: > Also, " Keep in mind that neither qdevice nor booth is "replacement" for > stonith. " > > Why not? qdevice/booth are handling the split-brain scenario, keeping one > master only even in case of local/geo network disjoints. Can you please > clarify more on th

[ClusterLabs] fence_virt architecture? (was: Re: Still Beginner STONITH Problem)

2020-07-18 Thread Andrei Borzenkov
18.07.2020 03:36, Reid Wahl пишет: > I'm not sure that the libvirt backend is intended to be used in this way, > with multiple hosts using the same multicast address. From the > fence_virt.conf man page: > > ~~~ > BACKENDS >libvirt >The libvirt plugin is the simplest plugin. It

Re: [ClusterLabs] Still Beginner STONITH Problem

2020-07-19 Thread Andrei Borzenkov
02.07.2020 18:18, stefan.schm...@farmpartner-tec.com пишет: > Hello, > > I hope someone can help with this problem. We are (still) trying to get > Stonith to achieve a running active/active HA Cluster, but sadly to no > avail. > > There are 2 Centos Hosts. On each one there is a virtual Ubuntu VM

Re: [ClusterLabs] fence_virt architecture? (was: Re: Still Beginner STONITH Problem)

2020-07-20 Thread Andrei Borzenkov
t I'm > right (libvirt network was in NAT mode) or wrong (VMs using Host's bond > in a bridged network). > > > > Best Regards, > > Strahil Nikolov > > > > На 19 юли 2020 г. 9:45:29 GMT+03:00, Andrei Borzenkov < > arvidj...@gmail.com> написа: >

<    1   2   3   4   5   6   7   8   >