Re: [ClusterLabs] Debugging problems with resource timeout without any actions from cluster

2017-10-17 Thread Ken Gaillot
On Tue, 2017-10-17 at 15:30 +0600, Sergey Korobitsin wrote:
> Ken Gaillot ☫ → To Cluster Labs - All topics related to open-source
> clustering welcomed @ Thu, Oct 12, 2017 09:47 -0500
> 
> Thanks for the answer, Ken,
> 
> > > I found several ways to achieve that:
> > > 
> > > 1. Put cluster in maintainance mode (as described here:
> > >    https://www.hastexo.com/resources/hints-and-kinks/maintenance-
> > > acti
> > > ve-pacemaker-clusters/)
> > > 
> > >    As far as I understand, services will be monitored, all logs
> > > written,
> > >    etc., but no action in case of failures will be taken. Is that
> > > right?
> > 
> > Actually, maintenance mode stops all monitors (except those with
> > role=Stopped, which ensure a service is not running).
> 
> OK, got it.
> 
> > > 2. Put the particular resource to unmanaged mode, as described
> > > here:
> > >    http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pac
> > > emak
> > > er_Explained/#s-monitoring-unmanaged
> > 
> > Disabling starts and stops is the exact purpose of unmanaged, so
> > this
> > is one way to get what you want. FYI you can also set this as a
> > global
> > default for all resources by setting it in the resource defaults
> > section of the configuration.
> 
> OK, got it too.
> 
> > > 3. Start all resources and remove start and stop operations from
> > > them.
> > 
> > :-O
> 
> This is kinda quirky way, but it exists! :-)
> 
> > > Which is the best way to achieve my purpose? I would like cluster
> > > to
> > > run
> > > as usual (and logging as usual or with trace on problematic
> > > resource),
> > > but no action in case of monitor failure should be taken.
> > 
> > That's actually a different goal, also easily accomplished, by
> > setting
> > on-fail=ignore on the monitor operation. From the sound of it, this
> > is
> > closer to what you want, since the cluster is still allowed to
> > start/stop resources when you standby a node, etc.
> 
> I'll try this one.
> 
> > You could also delete the recurring monitor operation from the
> > configuration, and it wouldn't run at all. But keeping it and
> > setting
> > on-fail=ignore lets you see failures in cluster status.
> > However, I'm not sure bypassing the monitor is the best solution to
> > this problem. If the problem is simply that your database monitor
> > can
> > legitimately take longer than 20 seconds in normal operation, then
> > raise the timeout as needed.
> 
> I want to determine why it needed more than 20 seconds, and under
> what
> circumstances.

Ah, excellent, that's what on-fail=ignore is useful for :-)

-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Debugging problems with resource timeout without any actions from cluster

2017-10-17 Thread Sergey Korobitsin
Ken Gaillot ☫ → To Cluster Labs - All topics related to open-source clustering 
welcomed @ Thu, Oct 12, 2017 09:47 -0500

Thanks for the answer, Ken,

> > I found several ways to achieve that:
> > 
> > 1. Put cluster in maintainance mode (as described here:
> >    https://www.hastexo.com/resources/hints-and-kinks/maintenance-acti
> > ve-pacemaker-clusters/)
> > 
> >    As far as I understand, services will be monitored, all logs
> > written,
> >    etc., but no action in case of failures will be taken. Is that
> > right?
> 
> Actually, maintenance mode stops all monitors (except those with
> role=Stopped, which ensure a service is not running).

OK, got it.

> > 2. Put the particular resource to unmanaged mode, as described here:
> >    http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemak
> > er_Explained/#s-monitoring-unmanaged
> 
> Disabling starts and stops is the exact purpose of unmanaged, so this
> is one way to get what you want. FYI you can also set this as a global
> default for all resources by setting it in the resource defaults
> section of the configuration.

OK, got it too.

> > 3. Start all resources and remove start and stop operations from
> > them.
> 
> :-O

This is kinda quirky way, but it exists! :-)

> > Which is the best way to achieve my purpose? I would like cluster to
> > run
> > as usual (and logging as usual or with trace on problematic
> > resource),
> > but no action in case of monitor failure should be taken.
> 
> That's actually a different goal, also easily accomplished, by setting
> on-fail=ignore on the monitor operation. From the sound of it, this is
> closer to what you want, since the cluster is still allowed to
> start/stop resources when you standby a node, etc.

I'll try this one.

> You could also delete the recurring monitor operation from the
> configuration, and it wouldn't run at all. But keeping it and setting
> on-fail=ignore lets you see failures in cluster status.

> However, I'm not sure bypassing the monitor is the best solution to
> this problem. If the problem is simply that your database monitor can
> legitimately take longer than 20 seconds in normal operation, then
> raise the timeout as needed.

I want to determine why it needed more than 20 seconds, and under what
circumstances.

-- 
Bright regards, Sergey Korobitsin,
Chief Research Officer
Arta Software, http://arta.kz/
xmpp:underta...@jabber.arta.kz

не противостоять этой тенценции; самым решительным броском вперед - идеей, 
и наиболее творческим из всех действий - бездельем.
  -- Тристан Тцара

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Debugging problems with resource timeout without any actions from cluster

2017-10-12 Thread Ken Gaillot
On Thu, 2017-10-12 at 17:13 +0600, Sergey Korobitsin wrote:
> Hello,
> I experience some strange problem on MySQL resource agent from
> Percona:
> sometimes monitor operation for it killed by lrmd due to timeout,
> like
> this:
> 
> Oct 12 12:26:46 sde1 lrmd[14812]:  warning: p_mysql_monitor_5000
> process (PID 28991) timed out
> Oct 12 12:27:15 sde1 lrmd[14812]:  warning:
> p_mysql_monitor_5000:28991 - timed out after 2ms
> Oct 12 12:27:15 sde1 crmd[14815]:error: Result of monitor
> operation for p_mysql on sde1: Timed Out
> 
> Now I investigate the problem, but trouble is that no extraordinary
> DB
> load or something else like that was detected. But, when those
> timeouts
> happen, Pacemaker tries to move MySQL (and all resources colocated
> with
> it) to other node (I have two-noded cluster). For some reasons I have
> other node in standby mode now, and Pacemaker move resources back,
> restarting them. All this moving/restarting leads our services to be
> unavailable for some time, and this is unwanted.
> 
> So, my purpose is to get cluster with MySQL and other colocated
> resources up, but only with resource monitoring, and without
> starting,
> stopping, promoting, demoting resources, etc.
> 
> I found several ways to achieve that:
> 
> 1. Put cluster in maintainance mode (as described here:
>    https://www.hastexo.com/resources/hints-and-kinks/maintenance-acti
> ve-pacemaker-clusters/)
> 
>    As far as I understand, services will be monitored, all logs
> written,
>    etc., but no action in case of failures will be taken. Is that
> right?

Actually, maintenance mode stops all monitors (except those with
role=Stopped, which ensure a service is not running).

> 
> 2. Put the particular resource to unmanaged mode, as described here:
>    http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemak
> er_Explained/#s-monitoring-unmanaged

Disabling starts and stops is the exact purpose of unmanaged, so this
is one way to get what you want. FYI you can also set this as a global
default for all resources by setting it in the resource defaults
section of the configuration.

> 3. Start all resources and remove start and stop operations from
> them.

:-O

> Which is the best way to achieve my purpose? I would like cluster to
> run
> as usual (and logging as usual or with trace on problematic
> resource),
> but no action in case of monitor failure should be taken.

That's actually a different goal, also easily accomplished, by setting
on-fail=ignore on the monitor operation. From the sound of it, this is
closer to what you want, since the cluster is still allowed to
start/stop resources when you standby a node, etc.

You could also delete the recurring monitor operation from the
configuration, and it wouldn't run at all. But keeping it and setting
on-fail=ignore lets you see failures in cluster status.

However, I'm not sure bypassing the monitor is the best solution to
this problem. If the problem is simply that your database monitor can
legitimately take longer than 20 seconds in normal operation, then
raise the timeout as needed.

> Here is the configuration of MySQL resource:
> 
> primitive p_mysql ocf:percona:mysql \
> params config="/etc/mysql/my.cnf"
> pid="/var/run/mysqld/mysqld.pid" socket="/var/run/mysqld/mysqld.sock"
> replication_user=slave_user replication_passwd=password
> max_slave_lag=180 evict_outdated_slaves=false
> binary="/usr/sbin/mysqld" test_user=test test_passwd=test \
> op start interval=0 timeout=60s \
> op stop interval=0 timeout=60s \
> op monitor interval=5s role=Master OCF_CHECK_LEVEL=1 \
> op monitor interval=2s role=Slave OCF_CHECK_LEVEL=1
> 
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org