Re: [ClusterLabs] dlm_controld 4.0.4 exits when crmd is fencing another node

2016-04-26 Thread Valentin Vidic
On Fri, Jan 22, 2016 at 07:57:52PM +0300, Vladislav Bogdanov wrote: > Tried reverting this one and a51b2bb ("If an error occurs unlink the > lock file and exit with status 1") one-by-one and both together, the > same result. > > So problem seems to be somewhere deeper. I've got the same

Re: [ClusterLabs] pcs testsuite status

2016-06-29 Thread Valentin Vidic
On Wed, Jun 29, 2016 at 10:31:42AM +0200, Tomas Jelinek wrote: > This should be replaceable by any agent which does not provide unfencing, > i.e. it does not have on_target="1" automatic="1" attributes in name="on" /> . You may need to experiment with few agents to find one which > works. Just

Re: [ClusterLabs] eventmachine gem in pcsd

2016-06-30 Thread Valentin Vidic
On Thu, Jun 30, 2016 at 01:27:25PM +0200, Tomas Jelinek wrote: > It seems eventmachine can be safely dropped as all tests passed without it. Great, thanks for confirming. -- Valentin ___ Users mailing list: Users@clusterlabs.org

[ClusterLabs] pcs testsuite status

2016-06-28 Thread Valentin Vidic
I'm trying to run pcs tests on Debian unstable, but there are some strange failures like diffs failing due to an additional space at the end of the line or just with "Error: cannot load cluster status, xml does not conform to the schema" Any idea what could be the issue here? I assume the tests

Re: [ClusterLabs] simple active/active router using pacemaker+corosync

2017-01-26 Thread Valentin Vidic
On Thu, Jan 26, 2017 at 12:10:24PM +0100, Arturo Borrero Gonzalez wrote: > I have a rather simple 2 nodes active/active router using pacemaker+corosync. > > Why active-active? Well, one node holds the virtual IPv4 resources and > the other node holds the virtual IPv6 resources. > On failover,

Re: [ClusterLabs] epic fail

2017-07-23 Thread Valentin Vidic
On Sun, Jul 23, 2017 at 07:27:03AM -0500, Dmitri Maziuk wrote: > So yesterday I ran yum update that puled in the new pacemaker and tried to > restart it. The node went into its usual "can't unmount drbd because kernel > is using it" and got stonith'ed in the middle of yum transaction. The end >

Re: [ClusterLabs] epic fail

2017-07-24 Thread Valentin Vidic
On Mon, Jul 24, 2017 at 11:01:26AM -0500, Dimitri Maziuk wrote: > Lsof/fuser show the PID of the process holding FS open as "kernel". That could be the NFS server running in the kernel. -- Valentin ___ Users mailing list: Users@clusterlabs.org

Re: [ClusterLabs] epic fail

2017-07-24 Thread Valentin Vidic
On Mon, Jul 24, 2017 at 10:38:40AM -0500, Ken Gaillot wrote: > Standby is not necessary, it's just a cautious step that allows the > admin to verify that all resources moved off correctly. The restart that > yum does should be sufficient for pacemaker to move everything. > > A restart shouldn't

Re: [ClusterLabs] Coming in Pacemaker 1.1.17: container bundles

2017-06-30 Thread Valentin Vidic
On Fri, Mar 31, 2017 at 05:43:02PM -0500, Ken Gaillot wrote: > Here's an example of the CIB XML syntax (higher-level tools will likely > provide a more convenient interface): > > > > Would it be possible to make this a bit more generic like: so we have support for other container

Re: [ClusterLabs] Coming in Pacemaker 1.1.17: container bundles

2017-07-01 Thread Valentin Vidic
On Fri, Jun 30, 2017 at 12:46:29PM -0500, Ken Gaillot wrote: > The challenge is that some properties are docker-specific and other > container engines will have their own specific properties. > > We decided to go with a tag for each supported engine -- so if we add > support for rkt, we'll add a

Re: [ClusterLabs] corosync service not automatically started

2017-10-10 Thread Valentin Vidic
On Tue, Oct 10, 2017 at 10:35:17AM +0200, Václav Mach wrote: > Oct 10 10:27:05 r1nren.et.cesnet.cz corosync[709]: [QB] Denied > connection, is not ready (709-1337-18) > Oct 10 10:27:06 r1nren.et.cesnet.cz corosync[709]: [QB] Denied > connection, is not ready (709-1337-18) > Oct 10

Re: [ClusterLabs] corosync service not automatically started

2017-10-10 Thread Valentin Vidic
On Tue, Oct 10, 2017 at 11:26:24AM +0200, Václav Mach wrote: > # The primary network interface > allow-hotplug eth0 > iface eth0 inet dhcp > # This is an autoconfigured IPv6 interface > iface eth0 inet6 auto allow-hotplug or dhcp could be causing problems. You can try disabling corosync and

Re: [ClusterLabs] PostgreSQL Automatic Failover (PAF) v2.2.0

2017-10-05 Thread Valentin Vidic
On Tue, Sep 12, 2017 at 04:48:19PM +0200, Jehan-Guillaume de Rorthais wrote: > PostgreSQL Automatic Failover (PAF) v2.2.0 has been released on September > 12th 2017 under the PostgreSQL licence. > > See: https://github.com/dalibo/PAF/releases/tag/v2.2.0 > > PAF is a PostgreSQL resource agent for

Re: [ClusterLabs] trouble with IPaddr2

2017-10-11 Thread Valentin Vidic
On Wed, Oct 11, 2017 at 10:51:04AM +0200, Stefan Krueger wrote: > primitive HA_IP-Serv1 IPaddr2 \ > params ip=172.16.101.70 cidr_netmask=16 \ > op monitor interval=20 timeout=30 on-fail=restart nic=bond0 \ > meta target-role=Started There might be something wrong with the

Re: [ClusterLabs] trouble with IPaddr2

2017-10-11 Thread Valentin Vidic
On Wed, Oct 11, 2017 at 01:29:40PM +0200, Stefan Krueger wrote: > ohh damn.. thanks a lot for this hint.. I delete all the IPs on enp4s0f0, and > than it works.. > but could you please explain why it now works? why he has a problem with this > IPs? AFAICT, it found a better interface with that

Re: [ClusterLabs] trouble with IPaddr2

2017-10-12 Thread Valentin Vidic
On Wed, Oct 11, 2017 at 02:36:24PM +0200, Valentin Vidic wrote: > AFAICT, it found a better interface with that subnet and tried > to use it instead of the one specified in the parameters :) > > But maybe IPaddr2 should just skip interface auto-detection > if an explicit inte

Re: [ClusterLabs] XenServer guest and host watchdog

2017-09-08 Thread Valentin Vidic
On Fri, Sep 08, 2017 at 12:57:12PM +, Mark Syms wrote: > As we discussed regarding the handling of watchdog in XenServer, both > guest and host, I've had a discussion with our subject matter expert > (Andrew, cc'd) on this topic. The guest watchdogs are handled by a > hardware timer in the

Re: [ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

2017-09-10 Thread Valentin Vidic
On Sun, Sep 10, 2017 at 08:27:47AM +0200, Ferenc Wágner wrote: > Confirmed: setting watchdog_device: off cluster wide got rid of the > above warnings. Interesting, what brand or version of IPMI has this problem? -- Valentin ___ Users mailing list:

Re: [ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

2017-09-11 Thread Valentin Vidic
On Mon, Sep 11, 2017 at 04:18:08PM +0200, Klaus Wenninger wrote: > Just for my understanding: You are using watchdog-handling in corosync? Corosync package in Debian gets build with --enable-watchdog so by default it takes /dev/watchdog during runtime. Don't think SUSE or RedHat packages get

Re: [ClusterLabs] XenServer guest and host watchdog

2017-09-09 Thread Valentin Vidic
On Fri, Sep 08, 2017 at 09:39:26PM +0100, Andrew Cooper wrote: > Yes.  The internal mechanism of the host watchdog is to use one > performance counter to count retired instructions and generate an NMI > roughly once every half second (give or take C and P states). > > Separately, there is a one

Re: [ClusterLabs] PostgreSQL Automatic Failover (PAF) v2.2.0

2017-10-05 Thread Valentin Vidic
On Thu, Oct 05, 2017 at 08:55:59PM +0200, Jehan-Guillaume de Rorthais wrote: > It doesn't seems impossible, however I'm not sure of the complexity around > this. > > You would have to either hack PAF and detect failover/migration or create a > new > RA that would always be part of the transition

Re: [ClusterLabs] Antw: Re: pcsd processes using 100% CPU

2018-07-23 Thread Valentin Vidic
On Thu, May 24, 2018 at 12:16:16AM -0600, Casey & Gina wrote: > Tried that, it doesn't seem to do anything but prefix the lines with the pid: > > [pid 24923] sched_yield() = 0 > [pid 24923] sched_yield() = 0 > [pid 24923] sched_yield() = 0 We managed to

Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Valentin Vidic
On Tue, Sep 11, 2018 at 09:31:13AM -0400, Patrick Whitney wrote: > But, when I invoke the "human" stonith power device (i.e. I turn the node > off), the other node collapses... > > In the logs I supplied, I basically do this: > > 1. stonith fence (With fence scsi) After fence_scsi finishes the

Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Valentin Vidic
On Tue, Sep 11, 2018 at 09:02:06AM -0400, Patrick Whitney wrote: > What I'm having trouble understanding is why dlm flattens the remaining > "running" node when the already fenced node is shutdown... I'm having > trouble understanding how power fencing would cause dlm to behave any > differently

Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Valentin Vidic
On Tue, Sep 11, 2018 at 09:13:08AM -0400, Patrick Whitney wrote: > So when the cluster suggests that DLM is shutdown on coro-test-1: > Clone Set: dlm-clone [dlm] > Started: [ coro-test-2 ] > Stopped: [ coro-test-1 ] > > ... DLM isn't actually stopped on 1? If you can connect to the

Re: [ClusterLabs] 2 node cluster dlm/clvm trouble

2018-09-11 Thread Valentin Vidic
On Tue, Sep 11, 2018 at 04:14:08PM +0300, Vladislav Bogdanov wrote: > And that is not an easy task sometimes, because main part of dlm runs in > kernel. > In some circumstances the only option is to forcibly reset the node. Exactly, killing the power on the node will stop the DLM code running in

Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Valentin Vidic
On Wed, Jul 11, 2018 at 08:01:46PM +0200, Salvatore D'angelo wrote: > Yes, but doing what you suggested the system find that sysV is > installed and try to leverage on update-rc.d scripts and the failure > occurs: > > root@pg1:~# systemctl enable corosync > corosync.service is not a native

Re: [ClusterLabs] Problem with pacemaker init.d script

2018-07-11 Thread Valentin Vidic
On Wed, Jul 11, 2018 at 04:31:31PM -0600, Casey & Gina wrote: > Forgive me for interjecting, but how did you upgrade on Ubuntu? I'm > frustrated with limitations in 1.1.14 (particularly in PCS so not sure > if it's relevant), and Ubuntu is ignoring my bug reports, so it would > be great to

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Valentin Vidic
On Mon, Mar 12, 2018 at 04:31:46PM +0100, Klaus Wenninger wrote: > Nope. Whenever the cluster is completely down... > Otherwise nodes would come up - if not seeing each other - > happily with both starting all services because they don't > know what already had been running on the other node. >

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Valentin Vidic
On Mon, Mar 12, 2018 at 01:58:21PM +0100, Klaus Wenninger wrote: > But isn't dlm directly interfering with corosync so > that it would get the quorum state from there? > As you have 2-node set probably on a 2-node-cluster > this would - after both nodes down - wait for all > nodes up first. Isn't

Re: [ClusterLabs] chap lio-t / iscsitarget disabled - why?

2018-04-03 Thread Valentin Vidic
On Tue, Apr 03, 2018 at 04:48:00PM +0200, Stefan Friedel wrote: > we've a running drbd - iscsi cluster (two nodes Debian stretch, pacemaker / > corosync, res group w/ ip + iscsitarget/lio-t + iscsiluns + lvm etc. on top of > drbd etc.). Everything is running fine - but we didn't manage to get CHAP

Re: [ClusterLabs] False negative from kamailio resource agent

2018-03-23 Thread Valentin Vidic
On Thu, Mar 22, 2018 at 03:36:55PM -0400, Alberto Mijares wrote: > Straight to the question: how can I manually run a resource agent > script (kamailio) simulating the pacemaker's environment without > actually having pacemaker running? You should be able to run it with something like: #

Re: [ClusterLabs] Position of pacemaker in today's HA world

2018-10-05 Thread Valentin Vidic
On Fri, Oct 05, 2018 at 11:34:10AM -0500, Ken Gaillot wrote: > The next big challenge is that high availability is becoming a subset > of the "orchestration" space in terms of how we fit into IT > departments. Systemd and Kubernetes are the clear leaders in service > orchestration today and likely 

Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops

2018-10-11 Thread Valentin Vidic
On Thu, Oct 11, 2018 at 01:25:52PM -0400, Daniel Ragle wrote: > For the 12 second window it *does* work in, it appears as though it works > only on one of the two servers (and always the same one). My twelve seconds > of pings runs continuously then stops; while attempts to hit the Web server >

Re: [ClusterLabs] LIO iSCSI target fails to start

2018-10-11 Thread Valentin Vidic
On Wed, Oct 10, 2018 at 02:36:21PM +0200, Stefan K wrote: > I think my config is correct, but it sill fails with "This Target > already exists in configFS" but "targetcli ls" shows nothing. It seems to find something in /sys/kernel/config/target. Maybe it was setup outside of pacemaker somehow?

Re: [ClusterLabs] resource-agents v4.2.0 rc1

2018-10-18 Thread Valentin Vidic
On Wed, Oct 17, 2018 at 12:03:18PM +0200, Oyvind Albrigtsen wrote: > - apache: retry PID check. I noticed that the ocft test started failing for apache in this version. Not sure if the test is broken or the agent. Can you check if the test still works for you? Restoring the previous version of

Re: [ClusterLabs] [ClusterLabs Developers] resource-agents v4.2.0 rc1

2018-10-19 Thread Valentin Vidic
On Fri, Oct 19, 2018 at 11:09:34AM +0200, Kristoffer Grönlund wrote: > I wonder if perhaps there was a configuration change as well, since the > return code seems to be configuration related. Maybe something changed > in the build scripts that moved something around? Wild guess, but... Seems to

Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops

2018-11-13 Thread Valentin Vidic
On Tue, Nov 13, 2018 at 09:06:56AM -0500, Daniel Ragle wrote: > Thanks, finally getting back to this. Putting a tshark on both nodes and > then restarting the VIP-clone resource shows the pings coming through for 12 > seconds, always on node2, then stop. I.E., before/after those 12 seconds >

Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops

2018-11-13 Thread Valentin Vidic
On Tue, Nov 13, 2018 at 04:06:34PM +0100, Valentin Vidic wrote: > Could be some kind of ARP inspection going on in the networking equipment, > so check switch logs if you have access to that. Also it seems to require multicast, so better check for that too :) -- Va

Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops

2018-11-13 Thread Valentin Vidic
On Tue, Nov 13, 2018 at 05:04:19PM +0100, Valentin Vidic wrote: > Also it seems to require multicast, so better check for that too :) And while the CLUSTERIP resource seems to work for me in a test cluster, the following clone definition: clone cip-clone cip \ meta clone-max=2 cl

Re: [ClusterLabs] IPaddr2 works for 12 seconds then stops

2018-11-13 Thread Valentin Vidic
On Tue, Nov 13, 2018 at 11:01:46AM -0600, Ken Gaillot wrote: > Clone instances have a default stickiness of 1 (instead of the usual 0) > so that they aren't needlessly shuffled around nodes every transition. > You can temporarily set an explicit stickiness of 0 to let them > rebalance, then unset

Re: [ClusterLabs] [ClusterLabs Developers] fence-agents v4.3.0

2018-10-09 Thread Valentin Vidic
On Tue, Oct 09, 2018 at 12:07:38PM +0200, Oyvind Albrigtsen wrote: > I've created a PR for the library detection and try/except imports: > https://github.com/ClusterLabs/fence-agents/pull/242 Thanks, I will give it a try right away... -- Valentin ___

Re: [ClusterLabs] [ClusterLabs Developers] fence-agents v4.3.0

2018-10-09 Thread Valentin Vidic
On Tue, Oct 02, 2018 at 03:13:51PM +0200, Oyvind Albrigtsen wrote: > ClusterLabs is happy to announce fence-agents v4.3.0. > > The source code is available at: > https://github.com/ClusterLabs/fence-agents/releases/tag/v4.3.0 > > The most significant enhancements in this release are: > - new

Re: [ClusterLabs] [ClusterLabs Developers] fence-agents v4.3.0

2018-10-09 Thread Valentin Vidic
On Tue, Oct 09, 2018 at 10:55:08AM +0200, Oyvind Albrigtsen wrote: > It seems like the if-line should be updated to check for those 2 > libraries (from the imports in the agent). Yes, that might work too. Also would it be possible to make the imports in openstack agent conditional so the

Re: [ClusterLabs] About fencing stonith

2018-09-26 Thread Valentin Vidic
On Thu, Sep 06, 2018 at 04:47:32PM -0400, Digimer wrote: > It depends on the hardware you have available. In your case, RPi has no > IPMI or similar feature, so you'll need something external, like a > switched PDU. I like the APC AP7900 (or your countries variant), which > you can often get used

Re: [ClusterLabs] Unexpected resource restart

2019-01-16 Thread Valentin Vidic
On Wed, Jan 16, 2019 at 12:41:11PM +0100, Valentin Vidic wrote: > This is what pacemaker says about the resource restarts: > > Jan 16 11:19:08 node1 pacemaker-schedulerd[713]: notice: * Start dlm:1 > ( node2 ) > Jan 16 11:19:08 node1 pacemaker-scheduler

Re: [ClusterLabs] Unexpected resource restart

2019-01-16 Thread Valentin Vidic
On Wed, Jan 16, 2019 at 12:16:04PM +, Andrew Price wrote: > The only thing that stands out to me with this config is the lack of > ordering constraint between dlm and lvmlockd. Not sure if that's the issue > though. They are both in the storage group so the order should be dlm than lockd?

Re: [ClusterLabs] Unexpected resource restart

2019-01-16 Thread Valentin Vidic
On Wed, Jan 16, 2019 at 12:28:59PM +0100, Valentin Vidic wrote: > When node2 is set to standby resource stop running there. However when > node2 is brought back online, it causes the resources on node1 to stop > and than start again which is a bit unexpected? > > Maybe the dep

[ClusterLabs] Unexpected resource restart

2019-01-16 Thread Valentin Vidic
Hi all, I'm testing the following configuration with two nodes: Clone: storage-clone Meta Attrs: interleave=true target-role=Started Group: storage Resource: dlm (class=ocf provider=pacemaker type=controld) Resource: lockd (class=ocf provider=heartbeat type=lvmlockd) Clone:

Re: [ClusterLabs] Trying to Understanding crm-fence-peer.sh

2019-01-16 Thread Valentin Vidic
On Wed, Jan 16, 2019 at 09:03:21AM -0600, Bryan K. Walton wrote: > The exit code 4 would seem to suggest that storage1 should be fenced. > But the switch ports connected to storage1 are still enabled. > > Am I misreading the logs here? This is a clean reboot, maybe fencing > isn't supposed to

Re: [ClusterLabs] Trying to Understanding crm-fence-peer.sh

2019-01-16 Thread Valentin Vidic
On Wed, Jan 16, 2019 at 04:20:03PM +0100, Valentin Vidic wrote: > I think drbd always calls crm-fence-peer.sh when it becomes disconnected > primary. In this case storage1 has closed the DRBD connection and > storage2 has become a disconnected primary. > > Maybe the probl

Re: [ClusterLabs] Status of Pacemaker 2 support in SBD?

2019-01-11 Thread Valentin Vidic
On Fri, Jan 11, 2019 at 12:42:02PM +0100, wf...@niif.hu wrote: > I opened https://github.com/ClusterLabs/sbd/pull/62 with our current > patches, but I'm just a middle man here. Valentin, do you agree to > upstream these two remaining patches of yours? Sure thing, merge anything you can... --

Re: [ClusterLabs] Upgrading from CentOS 6 to CentOS 7

2019-01-03 Thread Valentin Vidic
On Thu, Jan 03, 2019 at 04:56:26PM -0600, Ken Gaillot wrote: > Right -- not only that, but corosync 1 (CentOS 6) and corosync 2 > (CentOS 7) are not compatible for running in the same cluster. I suppose it is the same situation for upgrading from corosync 2 to corosync 3? -- Valentin

Re: [ClusterLabs] Antw: Re: Issue with DB2 HADR cluster

2019-04-03 Thread Valentin Vidic
On Wed, Apr 03, 2019 at 09:13:58AM +0200, Ulrich Windl wrote: > I'm surprised: Once sbd writes the fence command, it usually takes > less than 3 seconds until the victim is dead. If you power off a > server, the PDU still may have one or two seconds "power reserve", so > the host may not be down

Re: [ClusterLabs] Question on sharing data with DRDB

2019-03-20 Thread Valentin Vidic
On Wed, Mar 20, 2019 at 09:36:58AM -0600, JCA wrote: > # pcs -f fs_cfg resource create TestFS Filesystem device="/dev/drbd1" > directory="/tmp/Testing" > fstype="ext4" ext4 can only be mounted on one node at a time. If you need to access files on both nodes at the same time than a

Re: [ClusterLabs] Question on sharing data with DRDB

2019-03-20 Thread Valentin Vidic
On Wed, Mar 20, 2019 at 12:37:21PM -0400, Digimer wrote: > Cluster filesystems are amazing if you need them, and to be avoided if > at all possible. The overhead from the cluster locking hurts performance > quite a lot, and adds a non-trivial layer of complexity. > > I say this as someone who

Re: [ClusterLabs] Question on sharing data with DRDB

2019-03-20 Thread Valentin Vidic
On Wed, Mar 20, 2019 at 01:34:52PM -0400, Digimer wrote: > Depending on your fail-over tolerances, I might add NFS to the mix and > have the NFS server run on one node or the other, exporting your ext4 FS > that sits on DRBD in single-primary mode. > > The failover (if the NFS host died) would

Re: [ClusterLabs] Question on sharing data with DRDB

2019-03-20 Thread Valentin Vidic
On Wed, Mar 20, 2019 at 01:44:06PM -0400, Digimer wrote: > GFS2 notified the peers of disk changes, and DRBD handles actually > copying to changes to the peer. > > Think of DRBD, in this context, as being mdadm RAID, like how writing to > /dev/md0 is handled behind the scenes to write to both

Re: [ClusterLabs] Question on sharing data with DRDB

2019-03-20 Thread Valentin Vidic
On Wed, Mar 20, 2019 at 07:31:02PM +0100, Valentin Vidic wrote: > Right, but I'm not sure how this would help in the above situation > unless the DRBD can undo the local write that did not succeed on the > peer? Ah, it seems the activity log handles the undo by storing the location of th

Re: [ClusterLabs] Question on sharing data with DRDB

2019-03-20 Thread Valentin Vidic
On Wed, Mar 20, 2019 at 01:47:56PM -0400, Digimer wrote: > Not when DRBD is configured correctly. You sent 'fencing > resource-and-stonith;' and set the appropriate fence handler. This tells > DRBD to not proceed with a write while a node is in an unknown state > (which happens when the node stops

Re: [ClusterLabs] Question on sharing data with DRDB

2019-03-20 Thread Valentin Vidic
On Wed, Mar 20, 2019 at 02:01:07PM -0400, Digimer wrote: > On 2019-03-20 2:00 p.m., Valentin Vidic wrote: > > On Wed, Mar 20, 2019 at 01:47:56PM -0400, Digimer wrote: > >> Not when DRBD is configured correctly. You sent 'fencing > >> resource-and-stonith;' and set th

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 10:23:17PM +, Eric Robinson wrote: > I'm looking through the docs but I don't see how to set the on-fail value for > a resource. It is not set on the resource itself but on each of the actions (monitor, start, stop). -- Valentin

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 09:03:43PM +, Eric Robinson wrote: > Here are the relevant corosync logs. > > It appears that the stop action for resource p_mysql_002 failed, and > that caused a cascading series of service changes. However, I don't > understand why, since no other resources are

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 09:33:42PM +, Eric Robinson wrote: > I just noticed that. I also noticed that the lsb init script has a > hard-coded stop timeout of 30 seconds. So if the init script waits > longer than the cluster resource timeout of 15s, that would cause the Yes, you should use

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 08:34:21PM +, Eric Robinson wrote: > Why is it that when one of the resources that start with p_mysql_* > goes into a FAILED state, all the other MySQL services also stop? Perhaps stop is not working correctly for these lsb services, so for example stopping

Re: [ClusterLabs] Why Do All The Services Go Down When Just One Fails?

2019-02-16 Thread Valentin Vidic
On Sat, Feb 16, 2019 at 08:50:57PM +, Eric Robinson wrote: > Which logs? You mean /var/log/cluster/corosync.log? On the DC node pacemaker will be logging the actions it is trying to run (start or stop some resources). > But even if the stop action is resulting in an error, why would the >

Re: [ClusterLabs] Announcing hawk-apiserver, now in ClusterLabs

2019-02-12 Thread Valentin Vidic
On Tue, Feb 12, 2019 at 08:00:38PM +0100, Kristoffer Grönlund wrote: > One final note: hawk-apiserver uses a project called go-pacemaker > located at https://github.com/krig/go-pacemaker. I indend to transfer > this to ClusterLabs as well. go-pacemaker is still somewhat rough around > the edges,

Re: [ClusterLabs] Corosync crash

2019-05-07 Thread Valentin Vidic
On Tue, May 07, 2019 at 09:59:03AM +0300, Klecho wrote: > During the weekend my corosync daemon suddenly died without anything in the > logs, except this: > > May  5 20:39:16 ZZZ kernel: [1605277.136049] traps: corosync[2811] trap > invalid opcode ip:5635c376f2eb sp:7ffc3e109950 error:0 in >