[ClusterLabs] Antw: Re: Resource won't start, crm_resource -Y does not help

2019-07-22 Thread Ulrich Windl
22120911.uz4cgybmxsced...@redhat.com>: > Sounds like your RA returns e.g. OCF_ERR_ARGS or similar where it > shouldnt. > > Try starting the resource with crm_resource and add ‑VV which should > show you the code as it's being run. > > On 22/07/19 13:55 +0200, Ulrich Wi

[ClusterLabs] Resource won't start, crm_resource -Y does not help

2019-07-22 Thread Ulrich Windl
Hi! Playing with some new RA that won't start, I found this in crm_resource's man: -Y, --why Show why resources are not running, optionally filtered by --resource and/or --node When I tried it, all I got was: # crm_resource -r prm_idredir_test -Y Resource

[ClusterLabs] Antw: Re: resource location preference vs utilization

2019-07-16 Thread Ulrich Windl
>>> schrieb am 16.07.2019 um 15:20 in Nachricht <87k1cixgii.fsf...@lant.ki.iif.hu>: > "Ulrich Windl" writes: > >> schrieb am 15.07.2019 um 18:41 in Nachricht > <87o91vp7vv@lant.ki.iif.hu>: >> >>> In a mostly symmetrical clus

[ClusterLabs] Antw: Re: Antw: Interacting with Pacemaker from my code

2019-07-16 Thread Ulrich Windl
>>> Nishant Nakate schrieb am 16.07.2019 um 10:11 in Nachricht ... > May be because my knowledge of resource agents is not enough. Having my > processes as resources automatically handled by pace maker will workout. I > will need to find out more on making my services (written in CPP) resource >

[ClusterLabs] Antw: Re: Antw: Interacting with Pacemaker from my code

2019-07-16 Thread Ulrich Windl
>>> Nishant Nakate schrieb am 16.07.2019 um 08:58 in Nachricht : > On Tue, Jul 16, 2019 at 11:33 AM Ulrich Windl < > ulrich.wi...@rz.uni-regensburg.de> wrote: > >> >>> Nishant Nakate schrieb am 16.07.2019 um >> 05:37 in >> Nachricht >> :

[ClusterLabs] Antw: Interacting with Pacemaker from my code

2019-07-16 Thread Ulrich Windl
>>> Nishant Nakate schrieb am 16.07.2019 um 05:37 in Nachricht : > Hi All, > > I am new to this community and HA tools. Need some guidance on my current > handling pacemaker. > > For one of my projects, I am using pacekmaker for high availability. > Following the instructions provided in setup

[ClusterLabs] Antw: Re: Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Ulrich Windl
>>> Jehan-Guillaume de Rorthais schrieb am 10.07.2019 um 13:14 in Nachricht <20190710131427.3876ea36@firost>: > On Wed, 10 Jul 2019 12:53:59 +0300 > Andrei Borzenkov wrote: > >> On Wed, Jul 10, 2019 at 12:42 PM Jehan‑Guillaume de Rorthais >> wrote: >> >> > >> > > > Jul 09 09:16:32 [2679]

[ClusterLabs] Antw: Re: Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks

2019-07-01 Thread Ulrich Windl
>>> Jan Pokorný schrieb am 01.07.2019 um 14:42 in Nachricht <20190701124215.gn31...@redhat.com>: > On 01/07/19 13:26 +0200, Ulrich Windl wrote: >>>>> Jan Pokorný schrieb am 27.06.2019 um 12:02 >>>>> in Nachricht <20190627100209.gf31...@redhat.c

[ClusterLabs] Antw: Re: PCSD - High Memory Usage

2019-07-01 Thread Ulrich Windl
Would running pcsd unter valgrind be an option? In addition to checking for leaks, it can also provide some memory usage statistics (who is using how much)... >>> Tomas Jelinek schrieb am 27.06.2019 um 15:30 in Nachricht <363f827e-d05d-309f-7ab6-c43e268df...@redhat.com>: > Hi, > > We (pcs

[ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks

2019-07-01 Thread Ulrich Windl
>>> Jan Pokorný schrieb am 27.06.2019 um 12:02 in Nachricht <20190627100209.gf31...@redhat.com>: > On 25/06/19 12:20 ‑0500, Ken Gaillot wrote: >> On Tue, 2019‑06‑25 at 11:06 +, Somanath Jeeva wrote: >> Addressing the root cause, I'd first make sure corosync is running at >> real‑time priority

[ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks

2019-07-01 Thread Ulrich Windl
>>> Somanath Jeeva schrieb am 25.06.2019 um 13:06 in Nachricht > I have not configured fencing in our setup . However I would like to know if > the split brain can be avoided when high CPU occurs. It seems you like to ride a bicycle with crossed arms while trying to avoid to fall ;-) > >

[ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks

2019-07-01 Thread Ulrich Windl
>>> Ken Gaillot schrieb am 24.06.2019 um 16:57 in Nachricht <95f51b52283d05bbd948e4508c406d7ccb64.ca...@redhat.com>: > On Mon, 2019‑06‑24 at 08:52 +0200, Jan Friesse wrote: >> Somanath, >> >> > Hi All, >> > >> > I have a two node cluster with multicast (udp) transport . The >> > multicast

[ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks

2019-07-01 Thread Ulrich Windl
>>> Jan Friesse schrieb am 24.06.2019 um 08:52 in Nachricht : > Somanath, > >> Hi All, >> >> I have a two node cluster with multicast (udp) transport . The multicast IP > used in 224.1.1.1 . > > Would you mind to give a try to UDPU (unicast)? For two node cluster > there is going to be no

[ClusterLabs] Antw: Re: two virtual domains start and stop every 15 minutes

2019-07-01 Thread Ulrich Windl
To me it looks like a broken migration configuration. >>> "Lentes, Bernd" schrieb am 19.06.2019 um 18:46 in Nachricht <1654529492.1465807.1560962767193.javamail.zim...@helmholtz-muenchen.de>: > ‑ On Jun 15, 2019, at 4:30 PM, Bernd Lentes bernd.lentes@helmholtz‑muenchen.de > wrote: > >>

[ClusterLabs] Antw: Re: PostgreSQL PAF failover issue

2019-07-01 Thread Ulrich Windl
>>> Tiemen Ruiten schrieb am 14.06.2019 um 16:43 in Nachricht : > Right, so I may have been too fast to give up. I set maintenance mode back > on and promoted ph-sql-04 manually. Unfortunately I don't have the logs of > ph-sql-03 anymore because I reinitialized it. > > You mention that demote

[ClusterLabs] Antw: File System does not do a recovery on fail over

2019-06-11 Thread Ulrich Windl
>>> Indivar Nair schrieb am 09.06.2019 um 14:52 in Nachricht : > Hello ..., > > I have an Active-Passive cluster with two nodes hosting an XFS Filesystem > over a CLVM Volume. > > If a failover happens, the volume is mounted on the other node without a > recovery that usually happens to a

[ClusterLabs] Antw: Re: Antw: Re: Q: ocf:pacemaker:NodeUtilization monitor

2019-06-03 Thread Ulrich Windl
>>> Andrei Borzenkov schrieb am 29.05.2019 um 20:31 in Nachricht <1d0c775a-6854-bede-e241-8c23d5919...@gmail.com>: > 29.05.2019 11:12, Ulrich Windl пишет: >>>>> Jan Pokorný schrieb am 28.05.2019 um 16:31 in >> Nachricht >> <20190528143145.ga29...@

[ClusterLabs] Antw: Re: Q: ocf:pacemaker:NodeUtilization monitor

2019-05-29 Thread Ulrich Windl
>>> Jan Pokorný schrieb am 28.05.2019 um 16:31 in Nachricht <20190528143145.ga29...@redhat.com>: > On 27/05/19 08:28 +0200, Ulrich Windl wrote: >> I copnfigured ocf:pacemaker:NodeUtilization more or less for fun, and I > realized that the cluster rrepiorts no pro

[ClusterLabs] Q: ocf:pacemaker:NodeUtilization monitor

2019-05-27 Thread Ulrich Windl
Hi! I copnfigured ocf:pacemaker:NodeUtilization more or less for fun, and I realized that the cluster rrepiorts no problems, but in syslog I have these unusual messages: 2019-05-27T08:21:07.748149+02:00 h06 lrmd[16599]: notice: prm_node_util_monitor_30:15028:stderr [ info: Writing node

[ClusterLabs] Antw: Re: Antw: why is node fenced ?

2019-05-23 Thread Ulrich Windl
>>> "Lentes, Bernd" schrieb am 23.05.2019 um 15:01 in Nachricht <1029244418.9784641.1558616472505.javamail.zim...@helmholtz-muenchen.de>: > > ‑ On May 20, 2019, at 8:28 AM, Ulrich Windl ulrich.wi...@rz.uni‑regensburg.de > wrote: > >>>>>

[ClusterLabs] action 'monitor_Stopped' not found in Resource Agent meta-data

2019-05-23 Thread Ulrich Windl
Hi! Reading the release Notes on SLES12 HAE, I found the ``op monitor role="Stopped" ...`` operation that had been discussed here before, too. When trying to configure it, I get an error message (from crm shell): WARNING: prm_ping_gw1-v582: action 'monitor_Stopped' not found in Resource Agent

[ClusterLabs] Antw: Re: Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-21 Thread Ulrich Windl
Hi! So maybe the original defective RA would be valuable for debugging the issue. I guess the RA was invalid in some way that wasn't detected or handled properly... Regards, Ulrich >>> Andrei Borzenkov schrieb am 21.05.2019 um 09:13 in Nachricht : > 21.05.2019 0:46, Ken Gaillot пишет: >>>

[ClusterLabs] Antw: Re: Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-21 Thread Ulrich Windl
>>> Kadlecsik József schrieb am 20.05.2019 um 23:15 in Nachricht : [...] > stopping/starting of the resources. :‑) I haven't thought that "id" is > reserved as parameter name. The attribute "id" is "very much reserved" even in HTML, XML, SGML, etc. I don't know about "id" being reserved as an

[ClusterLabs] Antw: Re: Constant stop/start of resource in spite of interval=0

2019-05-20 Thread Ulrich Windl
What worries me is "Rejecting name for unique". >>> Kadlecsik József schrieb am 20.05.2019 um 14:37 in Nachricht : > On Sun, 19 May 2019, Kadlecsik József wrote: > >> On Sat, 18 May 2019, Kadlecsik József wrote: >> >> > On Sat, 18 May 2019, Kadlecsik József wrote: >> > >> > > On Sat, 18 May

[ClusterLabs] Antw: why is node fenced ?

2019-05-20 Thread Ulrich Windl
>>> "Lentes, Bernd" schrieb am 16.05.2019 um 17:10 in Nachricht <1151882511.6631123.1558019430655.javamail.zim...@helmholtz-muenchen.de>: > Hi, > > my HA-Cluster with two nodes fenced one on 14th of may. > ha-idg-1 has been the DC, ha-idg-2 was fenced. > It happened around 11:30 am. > The log

[ClusterLabs] Antw: Corosync crash

2019-05-07 Thread Ulrich Windl
>>> Klecho schrieb am 07.05.2019 um 08:59 in Nachricht <5e84375f-a631-cf06-50dc-58150fd78...@gmail.com>: > Hi, > > During the weekend my corosync daemon suddenly died without anything in > the logs, except this: > > May 5 20:39:16 ZZZ kernel: [1605277.136049] traps: corosync[2811] trap >

[ClusterLabs] Antw: Re: How to correctly stop cluster with active stonith watchdog?

2019-05-06 Thread Ulrich Windl
>>> Andrei Borzenkov schrieb am 05.05.2019 um 07:43 in Nachricht <033573b9-188f-baf6-e4b9-ba73150a3...@gmail.com>: > 30.04.2019 19:47, Олег Самойлов пишет: >> >> >>> 30 апр. 2019 г., в 19:38, Andrei Borzenkov >>> написал(а): >>> >>> 30.04.2019 19:34, Олег Самойлов пишет: > No. I

[ClusterLabs] Antw: crm_mon output to html-file - is there a way to manipulate the html-file ?

2019-05-06 Thread Ulrich Windl
>>> "Lentes, Bernd" schrieb am 03.05.2019 um 19:18 in Nachricht <595056197.709875.1556903922143.javamail.zim...@helmholtz-muenchen.de>: > Hi, > > on my cluster nodes i established a systemd service which starts crm_mon > which writes cluster information into a html-file so i can see the state

[ClusterLabs] Antw: monitor timed out with unknown error

2019-05-06 Thread Ulrich Windl
>>> Arkadiy Kulev schrieb am 05.05.2019 um 15:14 in >>> Nachricht : > Hello! > > I run pacemaker on 2 active/active hosts which balance the load of 2 public > IP addresses. > A few days ago we ran a very CPU/network intensive process on one of the 2 > hosts and Pacemaker failed. > > I've

[ClusterLabs] corosync.conf: A fatal syntax error that's not detected

2019-04-30 Thread Ulrich Windl
Hi! Trying to upgrade one corosync 1 cluster (SLES11 SP4) to corosync 2 (SLES12 SP4) resulted in a two-node cluster that happily fences each node, and little else. A first investigation indicated that I simply placed the "transport: updu" line within each "interface" instead of globally. It

[ClusterLabs] Antw: Re: Pacemaker detail log directory permissions

2019-04-29 Thread Ulrich Windl
>>> Jan Pokorný schrieb am 29.04.2019 um 17:22 in Nachricht <20190429152200.ga19...@redhat.com>: > On 29/04/19 14:58 +0200, Jan Pokorný wrote: >> On 29/04/19 08:20 +0200, Ulrich Windl wrote: >>>>>> Jan Pokorný schrieb am 25.04.2019 um 18:49 >>>

[ClusterLabs] Antw: Re: Pacemaker detail log directory permissions

2019-04-29 Thread Ulrich Windl
>>> Jan Pokorný schrieb am 25.04.2019 um 18:49 in Nachricht <20190425164946.gf23...@redhat.com>: > On 24/04/19 09:32 ‑0500, Ken Gaillot wrote: >> On Wed, 2019‑04‑24 at 16:08 +0200, wf...@niif.hu wrote: >>> Make install creates /var/log/pacemaker with mode 0770, owned by >>> hacluster:haclient.

[ClusterLabs] Warning (SLES 12 SP4): ocf:heartbeat:CTDB does not work any more

2019-04-25 Thread Ulrich Windl
Hi! I managed to get my cluster up again after upgrading from SLES11 SP4 to SLES12 SP4, but my CTDB Samba won't start any more. The problem is: CTDB(prm_s02_ctdb)[30904]: ERROR: Failed to execute /usr/sbin/ctdbd. lrmd[27341]: notice: prm_s02_ctdb_start_0:30857:stderr [ Invalid option

[ClusterLabs] Q: "confirmed=false"?

2019-04-24 Thread Ulrich Windl
H! I have a question: What is the difference between "confirmed=true" and "confired=false" actions, like here: Apr 24 08:30:20 h01 crmd[10774]: notice: process_lrm_event: Operation prm_xen_v01_migrate_from_0: ok (node=h01, call=169, rc=0, cib-update=150, confirmed=true) Apr 24 08:30:26 h01

[ClusterLabs] Antw: Re: Coming in 2.0.2: check whether a date-based rule is expired

2019-04-24 Thread Ulrich Windl
Hi! I know that April 1st is gone, but maybe should be have "user-friendly durations" also? Maybe like: "a deep breath", meaning "30 seconds" "a guru meditation", meaning "5 minutes" "a coffee break", meaning "15 minutes" "a lunch break", meaning "half an hour" ... Typical maintenance tasks can

[ClusterLabs] Q: nodes staying "UNCLEAN (offline)" -- why?

2019-04-23 Thread Ulrich Windl
Hi! After some tweaking past updating SLES11 to SLES12 I build a new config file for corosync. Corosync is happy, pacemaker says the nodes are online, but the cluster status still says both nodes are "UNCLEAN (offline)". Why? Messages I see are: crmd: info: peer_update_callback:Client

[ClusterLabs] Antw: Re: Question on permissions for pcsd ghost files

2019-04-23 Thread Ulrich Windl
>>> Tomas Jelinek schrieb am 23.04.2019 um 12:36 in Nachricht : > The files are listed as ghost files in order to let rpm know they belong > to pcs but are not distributed in rpm packages. Those files are created > by pcsd in runtime. I guess the 000 permissions come from the fact those >

[ClusterLabs] "Funny" message from corosync (SLES12 SP4)

2019-04-23 Thread Ulrich Windl
Hi! Fighting with the cahnges between corosync 1 (SLES11 SP4) and corosync 2 (SLES12 SP4), I got some "funny" error message: corosync[13979]: [MAIN ] parse error in config: The token hold timeout parameter (16 ms) may not be less than (30 ms). The funny part is that "hold" is not set at all

[ClusterLabs] some comments on corosync.conf(5)

2019-04-23 Thread Ulrich Windl
Hi! Reading the corosync.conf manual page of corosync 2.3.6 (SLES12 SP4), I have some random comments: The tag indent for "clear_node_high_bit" seems broken. Shouldn't "hold" be "token_hold"? Why is "token_retransmits_before_loss_const" so long (why the "_const")? The description states

[ClusterLabs] Antw: Re: Antw: Coming in pacemaker 2.0.2: XML output from tools

2019-04-23 Thread Ulrich Windl
>>> Christopher Lumens schrieb am 18.04.2019 um 17:07 in Nachricht <2044533839.17332252.1555600059522.javamail.zim...@redhat.com>: >> As XML is only as good as ist structure, would you present the structure of >> such XML output? > > If you enjoy reading XML that describes XML, you can check

[ClusterLabs] Antw: shutdown of 2-Node cluster when power outage

2019-04-18 Thread Ulrich Windl
>>> "Lentes, Bernd" schrieb am 18.04.2019 um 16:11 in Nachricht <63714399.378325.196700800.javamail.zim...@helmholtz-muenchen.de>: > Hi, > > i have a two-node cluster, both servers are buffered by an UPS. > If power is gone the UPS sends after a configurable time a signal via > network to

[ClusterLabs] Antw: Coming in pacemaker 2.0.2: XML output from tools

2019-04-18 Thread Ulrich Windl
>>> Ken Gaillot schrieb am 17.04.2019 um 22:53 in Nachricht <57f8ec2dcee0c715e2bc1146005e86e7844306d1.ca...@redhat.com>: > Hi all, > > Another new feature considered experimental in the upcoming pacemaker > release is XML output from tools for easier automated parsing. To begin > with, only

[ClusterLabs] Antw: Question about fencing

2019-04-18 Thread Ulrich Windl
>>> JCA <1.41...@gmail.com> schrieb am 17.04.2019 um 22:50 in Nachricht : > I am trying to get fencing working, as described in the "Cluster from > Scratch" guide, and I am stymied at get-go :-( > > The document mentions a property named stonith-enabled. When I was trying > to get my first

[ClusterLabs] Antw: Re: Resource not starting correctly

2019-04-16 Thread Ulrich Windl
>>> Ken Gaillot schrieb am 16.04.2019 um 00:30 in Nachricht <144df656215fc1ed6b3a35cffd1cbd2436f2a785.ca...@redhat.com>: [...] > The cluster successfully probed the service on both nodes, and started > it on node one. It then tried to start a 30‑second recurring monitor > for the service, but the

[ClusterLabs] Antw: Resource not starting correctly II

2019-04-16 Thread Ulrich Windl
>>> JCA <1.41...@gmail.com> schrieb am 15.04.2019 um 23:30 in Nachricht : > Well, I remain puzzled. I added a statement to the end of my script in > order to capture its return value. Much to my surprise, when I create the > associated resource (as described in my previous post) myapp-script gets

[ClusterLabs] Antw: Re: SBD as watchdog daemon

2019-04-15 Thread Ulrich Windl
>>> schrieb am 15.04.2019 um 13:03 in Nachricht <566fe1cd-b8fd-41e0-bc07-1722be14e...@ya.ru>: > >> 14 апр. 2019 г., в 10:12, Andrei Borzenkov написал(а): > > Thanks for explanation, I think this will be good addition to the SBD > manual. (SBD manual need in this.) But my

[ClusterLabs] Antw: corosync caused network breakdown

2019-04-09 Thread Ulrich Windl
>>> Sven Möller schrieb am 08.04.2019 um 16:11 in Nachricht <20190408141109.horde.fk4h-6rlnqpo3s3muppw...@cloudy.nichthelfer.de>: > Hi, > we were running a corosync config including 2 Rings for about 2.5 years on a > two node NFS Cluster (active/passive). The first ring (ring 0) is configured >

[ClusterLabs] Antw: Re: Antw: Re: Issue with DB2 HADR cluster

2019-04-03 Thread Ulrich Windl
>>> Valentin Vidic schrieb am 03.04.2019 um 09:26 in Nachricht <20190403072602.gw9...@gavran.carpriv.carnet.hr>: > On Wed, Apr 03, 2019 at 09:13:58AM +0200, Ulrich Windl wrote: >> I'm surprised: Once sbd writes the fence command, it usually takes >> less than 3 se

[ClusterLabs] Antw: Re: Issue with DB2 HADR cluster

2019-04-03 Thread Ulrich Windl
>>> Digimer schrieb am 02.04.2019 um 19:49 in Nachricht <6c6302f4-844b-240d-8d0e-727dddf36...@alteeve.ca>: [...] > It's worth noting that SBD fencing is "better than nothing", but slow. > IPMI and/or PDU fencing completes a lot faster. I'm surprised: Once sbd writes the fence command, it

[ClusterLabs] Antw: Why do clusters have a name?

2019-03-27 Thread Ulrich Windl
>>> Brian Reichert schrieb am 26.03.2019 um 21:12 in Nachricht <20190326201259.gj36...@numachi.com>: > This will sound like a dumb question: > > The manpage for pcs(8) implies that to set up a cluster, one needs > to provide a name. > > Why do clusters have names? Seems to be traditional.

[ClusterLabs] Antw: Re: Colocation constraint moving resource

2019-03-27 Thread Ulrich Windl
>>> Ken Gaillot schrieb am 26.03.2019 um 20:28 in Nachricht <1d8d000ab946586783fc9adec3063a1748a5b06f.ca...@redhat.com>: > On Tue, 2019-03-26 at 22:12 +0300, Andrei Borzenkov wrote: >> 26.03.2019 17:14, Ken Gaillot пишет: >> > On Tue, 2019-03-26 at 14:11 +0100, Thomas Singleton wrote: >> > > Dear

[ClusterLabs] Antw: Re: Question on sharing data with DRDB

2019-03-21 Thread Ulrich Windl
>>> Valentin Vidic schrieb am 20.03.2019 um 19:00 in Nachricht <20190320180007.gx9...@gavran.carpriv.carnet.hr>: > On Wed, Mar 20, 2019 at 01:47:56PM ‑0400, Digimer wrote: >> Not when DRBD is configured correctly. You sent 'fencing >> resource‑and‑stonith;' and set the appropriate fence handler.

[ClusterLabs] Antw: Re: Question on sharing data with DRDB

2019-03-21 Thread Ulrich Windl
>>> Digimer schrieb am 20.03.2019 um 18:47 in Nachricht <38f790d0-4bb6-53b8-7eb4-b285ed147...@alteeve.ca>: > On 2019-03-20 1:46 p.m., Valentin Vidic wrote: >> On Wed, Mar 20, 2019 at 01:34:52PM -0400, Digimer wrote: >>> Depending on your fail-over tolerances, I might add NFS to the mix and >>>

[ClusterLabs] Antw: Re: Question on sharing data with DRDB

2019-03-21 Thread Ulrich Windl
>>> Digimer schrieb am 20.03.2019 um 17:37 in Nachricht <37a2b613-62ce-a552-804a-df5199674...@alteeve.ca>: > Note; > > Cluster filesystems are amazing if you need them, and to be avoided if > at all possible. The overhead from the cluster locking hurts performance > quite a lot, and adds a

[ClusterLabs] Antw: Resource creation information

2019-03-13 Thread Ulrich Windl
Hi! On "2." (provider): I think "heartbeat" is purely historical, and (as you might have found out, the provider is mostly a subdirectory in the tree where RAs are located) for compatibility nobody dared or cared to change. Personally I'm using my own provider (consisting of a four-letter-hash of

[ClusterLabs] Antw: Re: (no subject)

2019-03-12 Thread Ulrich Windl
>>> Alex Crow schrieb am 11.03.2019 um 22:28 in Nachricht : > On 11/03/2019 21:18, Full Name wrote: >> I am a complete newbie here, so please bear with me, if I ask something > stupid and/or obvious. >> >> I have been able to deploy and configure the software across three > nodes,

[ClusterLabs] SLES11SP4 and SLES12 SP4: different crypto type

2019-02-26 Thread Ulrich Windl
Hi! I'm playing with upgrading a SLES11 SP4 cluster to SLES12 SP4. After having upgraded on node in a two-node cluster, I see the following messages on the SLES12 node: [...] Feb 27 08:46:16 h02 corosync[5198]: [TOTEM ] A new membership (172.20.16.2:2336) was formed. Members joined:

[ClusterLabs] Antw: Re: Continuous master monitor failure of a resource in case some other resource is being promoted

2019-02-26 Thread Ulrich Windl
>>> Ken Gaillot schrieb am 26.02.2019 um 16:27 in >>> Nachricht : [...] > > Actions that have been *scheduled* but not *initiated* can be aborted. > But anytime a resource agent has been invoked, we wait for that process > to complete. I guess it's to receive the regular exit code. > [...] >

[ClusterLabs] Antw: Monitor error

2019-02-26 Thread Ulrich Windl
>>> Valer Nur schrieb am 25.02.2019 um 17:22 in Nachricht <863485302.5185430.155721...@mail.yahoo.com>: > Hi, > I am a newbie on this. I have followed the great documentation to create an > active/passive cluster. I skipped the Apache part since I do not need it. All > seems to be working

[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Ulrich Windl
>>> Edwin Török schrieb am 20.02.2019 um 12:30 in Nachricht <0a49f593-1543-76e4-a8ab-06a48c596...@citrix.com>: > On 20/02/2019 07:57, Jan Friesse wrote: >> Edwin, >>> >>> >>> On 19/02/2019 17:02, Klaus Wenninger wrote: On 02/19/2019 05:41 PM, Edwin Török wrote: > On 19/02/2019 16:26,

[ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

2019-02-19 Thread Ulrich Windl
>>> Eric Robinson schrieb am 19.02.2019 um 21:06 in Nachricht >> -Original Message- >> From: Users On Behalf Of Ken Gaillot >> Sent: Tuesday, February 19, 2019 10:31 AM >> To: Cluster Labs - All topics related to open-source clustering welcomed >> >> Subject: Re: [ClusterLabs] Why Do

[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-19 Thread Ulrich Windl
error 4 in libc-2.17.so[7f221c554000+1c2000] >> [ 5390.361918] Code: b8 00 00 00 04 00 00 00 74 07 48 8d 05 f8 f2 0d 00 >> c3 0f 1f 80 00 00 00 00 48 31 c0 89 f9 83 e1 3f 66 0f ef c0 83 f9 30 77 >> 19 0f 6f 0f 66 0f 74 c1 66 0f d7

[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-18 Thread Ulrich Windl
>>> Jan Pokorný schrieb am 18.02.2019 um 21:08 in Nachricht <20190218200816.gd23...@redhat.com>: > On 15/02/19 08:48 +0100, Jan Friesse wrote: >> Ulrich Windl napsal(a): >>> IMHO any process running at real-time priorities must make sure >>> that

[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-17 Thread Ulrich Windl
Hi! I also wonder: With SCHED_RR would a sched_yield() at a proper place the 100% CPU loop also fix this issue? Or do you think "we need real-time, and cannot allow any other task to run"? Regards, Ulrich >>> Edwin Török schrieb am 15.02.2019 um 17:58 in Nachricht : > On 15/02/2019 16:08,

[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-14 Thread Ulrich Windl
I would expect that, as strace interrupts the RT task to query the code; you should run strace at the same RT priority ;-) >>> Edvin Torok 14.02.19 19.54 Uhr >>> Apologies for top posting, the strace you asked for is available here (although running strace itself had side-effect of getting

[ClusterLabs] Antw: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-14 Thread Ulrich Windl
Hi! IMHO any process running at real-time priorities must make sure that it consumes the CPU only for shorrt moment that are really critical to be performed in time. Specifically having some code that performs poorly (for various reasons) is absolutely _not_ a candidate to be run with real-time

[ClusterLabs] Antw: Re: Is fencing really a must for Postgres failover?

2019-02-13 Thread Ulrich Windl
Hi! I wonder: Can we close this thread with "You have been warned, so please don't come back later, crying! In the meantime you can do what you want to do."? Regards, Ulrich >>> Jehan-Guillaume de Rorthais schrieb am 13.02.2019 um 15:05 in Nachricht <20190213150549.47634671@firost>: > On Wed,

[ClusterLabs] Antw: Announcing hawk-apiserver, now in ClusterLabs

2019-02-12 Thread Ulrich Windl
Hello! I'd like to comment as an "old" SuSE customer: I'm amazed that lighttpd is dropped in favor of some new go application: SuSE now has a base system that needs (correct me if I'm wrong): shell, perl, python, java, go, ruby, ...? Maybe each programmer has his favorite. Personally I also

[ClusterLabs] Antw: Is fencing really a must for Postgres failover?

2019-02-11 Thread Ulrich Windl
>>> Maciej S schrieb am 11.02.2019 um 12:34 in Nachricht : > I was wondering if anyone can give a plain answer if fencing is really > needed in case there are no shared resources being used (as far as I define > shared resource). > > We want to use PAF or other Postgres (with replicated data

[ClusterLabs] Antw: Re: Pacemaker log showing time mismatch after

2019-02-03 Thread Ulrich Windl
>>> Jan Pokorný schrieb am 01.02.2019 um 08:10 in Nachricht <20190201071011.gb7...@redhat.com>: > On 28/01/19 09:47 ‑0600, Ken Gaillot wrote: >> On Mon, 2019‑01‑28 at 18:04 +0530, Dileep V Nair wrote: >> Pacemaker can handle the clock jumping forward, but not backward. > > I am rather surprised,

[ClusterLabs] Antw: [pacemaker] Discretion with glib v2.59.0+ recommended

2019-01-21 Thread Ulrich Windl
Hi! IMHO it's like in Perl: When relying the hash keys to be returned in any particular (or even stable) order, the idea is just broken! Either keep the keys in an extra array for ordering, or sort them in some way... Regards, Ulrich >>> Jan Pokorný schrieb am 18.01.2019 um 20:32 in Nachricht

[ClusterLabs] Antw: Re: Antw: Re: Unexpected resource restart

2019-01-20 Thread Ulrich Windl
>>> Ken Gaillot schrieb am 17.01.2019 um 18:45 in Nachricht : > On Thu, 2019‑01‑17 at 07:49 +0100, Ulrich Windl wrote: >> > > > Ken Gaillot schrieb am 16.01.2019 um >> > > > 16:34 in Nachricht >> >> <02e51f6d4f7c7c11161d54e2968c23

[ClusterLabs] Antw: Re: Antw: Trying to Understanding crm-fence-peer.sh

2019-01-16 Thread Ulrich Windl
properly, so the other node will assume some failure and fence (to be sure the other side is dead). Regards, Ulrich >>> "Bryan K. Walton" schrieb am 16.01.2019 um 16:36 in Nachricht <20190116153625.zybof7hrkuueh...@mygeeto.inside.leepfrog.com>: > On Wed, Jan 16, 2019 at 04

[ClusterLabs] Antw: Re: Unexpected resource restart

2019-01-16 Thread Ulrich Windl
>>> Ken Gaillot schrieb am 16.01.2019 um 16:34 in >>> Nachricht <02e51f6d4f7c7c11161d54e2968c23c77c4a1eed.ca...@redhat.com>: [...] > In retrospect, interleave=true should have been the default. I've never > seen a case where false made sense, and people get bit by overlooking > it all the time.

[ClusterLabs] Antw: Trying to Understanding crm-fence-peer.sh

2019-01-16 Thread Ulrich Windl
Hi! I guess we need more logs; especially some events from storage2 before fencing is triggered. Regards, Ulrich >>> "Bryan K. Walton" schrieb am 16.01.2019 um 16:03 in Nachricht <20190116150321.3j2f2upz67eth...@mygeeto.inside.leepfrog.com>: > I have posed this question on the DRBD‑user list,

[ClusterLabs] Antw: Pacemaker API structure and pkg-config files

2019-01-14 Thread Ulrich Windl
>>> schrieb am 14.01.2019 um 11:48 in Nachricht <87va2rjyzv@lant.ki.iif.hu>: > Hi, > > Recently I spent some time mapping the interrelations of the C header > files constituting the Pacemaker API. In the end I decided they were so > tightly interdependent that there was really no useful way

[ClusterLabs] Antw: Re: Proposal for machine-friendly output from Pacemaker tools

2019-01-08 Thread Ulrich Windl
>>> Ken Gaillot schrieb am 08.01.2019 um 18:28 in Nachricht <2ecefc63baa56a76a6eeca7c696fc7a1653eb620.ca...@redhat.com>: > On Tue, 2019-01-08 at 17:23 +0100, Kristoffer Grönlund wrote: >> On Tue, 2019-01-08 at 10:07 -0600, Ken Gaillot wrote: >> > On Tue, 2019-01-08 at 10:30 +0100, Kristoffer

[ClusterLabs] Antw: Re: Trying to understand the default action of a fence agent

2019-01-08 Thread Ulrich Windl
>>> Ken Gaillot schrieb am 08.01.2019 um 17:55 in Nachricht : > On Tue, 2019‑01‑08 at 07:35 ‑0600, Bryan K. Walton wrote: >> Hi, >> >> I'm building a two node cluster with Centos 7.6 and DRBD. These >> nodes >> are connected upstream to two Brocade switches. I'm trying to enable >> fencing by

[ClusterLabs] Antw: Proposal for machine-friendly output from Pacemaker tools

2019-01-07 Thread Ulrich Windl
>>> Ken Gaillot schrieb am 08.01.2019 um 00:52 in Nachricht : > There has been some discussion in the past about generating more > machine‑friendly output from pacemaker CLI tools for scripting and > high‑level interfaces, as well as possibly adding a pacemaker REST API. Interesting: XML being

[ClusterLabs] Antw: Re: SuSE12SP3 HAE SBD Communication Issue

2019-01-06 Thread Ulrich Windl
>>> Andrei Borzenkov schrieb am 22.12.2018 um 05:27 in Nachricht <3897ef3e-220e-7377-9647-24965eab4...@gmail.com>: > 21.12.2018 12:09, Klaus Wenninger пишет: >> On 12/21/2018 08:15 AM, Fulong Wang wrote: >>> Hello Experts, >>> >>> I'm New to this mail lists. >>> Pls kindlyforgive me if this mail

[ClusterLabs] Antw: Re: SuSE12SP3 HAE SBD Communication Issue

2018-12-27 Thread Ulrich Windl
Hi! Offline a SCSI disk: "echo offline > /sys/block/sd/device/state". The opposite is not "online", BTW, but: ""echo running > /sys/block/sd/device/state". You could also try "echo "scsi remove-single-device" > /proc/scsi/scsi", where MAGIC is (AFAIR) "HOST BUS TARGET LUN". Regards, Ulrich

[ClusterLabs] Antw: Re: HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.

2018-12-18 Thread Ulrich Windl
>>> Chris Walker schrieb am 18.12.2018 um 17:13 in Nachricht : [...] > 2. As Ken mentioned, synchronize the starting of Corosync and Pacemaker. I > did this with a simple ExecStartPre systemd script: > > [root@bug0 ~]# cat /etc/systemd/system/corosync.service.d/ha_wait.conf > [Service] >

[ClusterLabs] Q: syslog-ng RA in SLES 12 SP3

2018-12-18 Thread Ulrich Windl
Hi! I just noticed that in SLES12 SP3 there is a syslog-ng RA, but it seems there is no syslog-ng package available, and "$SYSLOG_NG_EXE" is not set any more. Thus that RA is unusable IMHO. I was running a non-priviledged syslog-ng with ist own log files on a high-port in SLES11, but it seems

[ClusterLabs] Antw: HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.

2018-12-17 Thread Ulrich Windl
>>> Vitaly Zolotusky schrieb am 17.12.2018 um 21:43 in >>> Nachricht <1782126841.215210.1545079428...@webmail6.networksolutionsemail.com>: > Hello, > I have a 2 node cluster and stonith is configured for SBD and fence_ipmilan. > fence_ipmilan for node 1 is configured for 0 delay and for node 2

[ClusterLabs] Antw: Corosync 3.0.0 is available at corosync.org!

2018-12-17 Thread Ulrich Windl
>>> Jan Friesse schrieb am 14.12.2018 um 15:06 in Nachricht <991569e4-2430-30f1-1bbc-827be7637...@redhat.com>: [...] > ‑ UDP/UDPU transports are still present, but supports only single ring > (RRP is gone in favor of Knet) and doesn't support encryption [...] I wonder: Is there a migration

[ClusterLabs] Antw: Corosync-qdevice 3.0.0 is available at GitHub!

2018-12-17 Thread Ulrich Windl
Hi! Once again you forgot the one-line-summary what is is ;-) I guess a quorum device... Regards, Ulrich >>> Jan Friesse schrieb am 12.12.2018 um 15:20 in Nachricht : > I am pleased to announce the first stable release of Corosync‑Qdevice > 3.0.0 available immediately from GitHub at >

[ClusterLabs] Antw: Coming in Pacemaker 2.0.1 / 1.1.20: improved fencing history

2018-12-17 Thread Ulrich Windl
>>> Ken Gaillot schrieb am 11.12.2018 um 21:48 in Nachricht <33316cc0570a12255c7de7dd387caee9c5058e37.ca...@redhat.com>: > Hi all, > > I expect to have the first release candidate for Pacemaker 2.0.1 > available soon! It will mostly be a bug fix release, but one > interesting new feature is

[ClusterLabs] Antw: live migration rarely fails seemingly without reason

2018-12-03 Thread Ulrich Windl
Hi! It seems your systems run with non-operative fencing, and the cluster wants to fence a node. Maybe bring the cluster to a clean state first, then repeat the test. Regards, Ulrich >>> "Lentes, Bernd" schrieb am 03.12.2018 um 16:40 in Nachricht

[ClusterLabs] Antw: How to backup?

2018-11-26 Thread Ulrich Windl
>>> lejeczek schrieb am 23.11.2018 um 15:56 in Nachricht <46d2baf6-a03d-9aac-fceb-7bcffb383...@yahoo.co.uk>: > hi guys, > > Do we have tools or maybe outside of the cluster suite there is a way to > backup cluster? > > I'm obviously talking about configuration so potentially cluster could >

[ClusterLabs] Antw: VirtualDomain & parallel shutdown

2018-11-20 Thread Ulrich Windl
>>> Klechomir schrieb am 20.11.2018 um 11:40 in Nachricht <12860117.ByXx81i3mo@bobo>: > Hi list, > Bumped onto the following issue lately: > > When ultiple VMs are given shutdown right one‑after‑onther and the shutdown of > > the first VM takes long, the others aren't being shut down at all

[ClusterLabs] Antw: Re: Antw: Placing resource based on least load on a node

2018-11-19 Thread Ulrich Windl
>>> Michael Schwartzkopff schrieb am 20.11.2018 um 08:41 in >>> Nachricht : > Am 20.11.18 um 08:35 schrieb Bernd: >> Am 2018-11-20 08:06, schrieb Ulrich Windl: >>>>>> Bernd schrieb am 20.11.2018 um 07:21 in >>>>>> Nachricht >&

[ClusterLabs] Antw: Announcing Anvil! m2 v2.0.7

2018-11-19 Thread Ulrich Windl
Hi! You forgot the most important piece of information: "What is it?" I guess it's so obvious for you that you forgot to mention. ;-) Regards, Ulrich >>> Digimer schrieb am 20.11.2018 um 08:25 in Nachricht <3ff31468-4052-dda7-7841-4c04985ad...@alteeve.ca>: > *

[ClusterLabs] Antw: Placing resource based on least load on a node

2018-11-19 Thread Ulrich Windl
>>> Bernd schrieb am 20.11.2018 um 07:21 in Nachricht : > Hi, > > I'd like to run a certain bunch of cronjobs from time to time on the > cluster node (four node cluster) that has the lowest load of all four > nodes. > > The parameters wanted for this system yet to build are > > * automatic

[ClusterLabs] Antw: Start Timeout.

2018-11-14 Thread Ulrich Windl
Hi! maybe syslog provides help on what's going wrong... Regards, Ulrich >>> Michael Gaberkorn schrieb am 14.11.2018 um 13:17 in Nachricht <408efd9b-2c9b-431b-8bb3-108b37dc0...@bd-innovations.com>: > Hello. > > > I installed ha-cluster with Postgresql-11 with high amount off data (5-9 >

[ClusterLabs] Antw: Re: IPaddr2 works for 12 seconds then stops

2018-11-13 Thread Ulrich Windl
>>> Valentin Vidic schrieb am 13.11.2018 um 17:04 in Nachricht <20181113160419.gv3...@gavran.carpriv.carnet.hr>: > On Tue, Nov 13, 2018 at 04:06:34PM +0100, Valentin Vidic wrote: >> Could be some kind of ARP inspection going on in the networking equipment, >> so check switch logs if you have

[ClusterLabs] Antw: Re: Pacemaker auto restarts disabled groups

2018-11-12 Thread Ulrich Windl
>>> Ken Gaillot schrieb am 08.11.2018 um 17:58 in >>> Nachricht <1541696332.5197.3.ca...@redhat.com>: > On Thu, 2018-11-08 at 12:14 +, Ian Underhill wrote: [...] > > Each transition is a set of actions needed to get to the desired state. > "Complete" are actions that were initiated and a

[ClusterLabs] Q: repeating message " cmirrord[17741]: [yEa32lLX] Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN"

2018-11-11 Thread Ulrich Windl
Hi! While analyzing some odd cluster problem in SLES11 SP4, I found this message repeating quite a lot (several times per second) with the same text: [...more...] Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX] Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN Nov 10 22:10:47 h05

[ClusterLabs] Antw: Re: Best-practices for changing networks settings in a cluster?

2018-11-05 Thread Ulrich Windl
>>> Ken Gaillot schrieb am 06.11.2018 um 00:12 in >>> Nachricht <1541459570.5061.11.ca...@redhat.com>: > On Mon, 2018-11-05 at 16:14 -0600, Ryan Thomas wrote: >> I have a two node cluster. I restart the network after making >> changes to the network settings. But, as soon as I restart the >>

[ClusterLabs] Antw: About the Pacemaker

2018-10-24 Thread Ulrich Windl
>>> "T. Ladd Omar" schrieb am 23.10.2018 um 15:06 in >>> Nachricht : > Hi all, I send this message to get some answers for my questions about > Pacemaker. > 1. In order to cleanup start-failed resources automatically, I add > failure-timeout attribute for resources, however, the common way to

[ClusterLabs] Antw: Re: Antw: Any CLVM/DLM users around?

2018-10-02 Thread Ulrich Windl
ace to disappear... > unless you disable fencing for DLM. > > I am now speculating that DLM restarts when the communications fail, and > the theory that disabling startup fencing for DLM > (enable_startup_fencing=0) may be the solution to my problem (reverting my > enable_fencing=0 DLM

<    4   5   6   7   8   9   10   11   12   13   >