Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-23 Thread Ken Gaillot
On 09/22/2016 05:58 PM, Andrew Beekhof wrote:
> 
> 
> On Fri, Sep 23, 2016 at 1:58 AM, Ken Gaillot  > wrote:
> 
> On 09/22/2016 09:53 AM, Jan Pokorný wrote:
> > On 22/09/16 08:42 +0200, Kristoffer Grönlund wrote:
> >> Ken Gaillot > writes:
> >>
> >>> I'm not saying it's a bad idea, just that it's more complicated than 
> it
> >>> first sounds, so it's worth thinking through the implications.
> >>
> >> Thinking about it and looking at how complicated it gets, maybe what
> >> you'd really want, to make it clearer for the user, is the ability to
> >> explicitly configure the behavior, either globally or per-resource. So
> >> instead of having to tweak a set of variables that interact in complex
> >> ways, you'd configure something like rule expressions,
> >>
> >> 
> >>   
> >>   
> >>   
> >> 
> >>
> >> So, try to restart the service 3 times, if that fails migrate the
> >> service, if it still fails, fence the node.
> >>
> >> (obviously the details and XML syntax are just an example)
> >>
> >> This would then replace on-fail, migration-threshold, etc.
> >
> > I must admit that in previous emails in this thread, I wasn't able to
> > follow during the first pass, which is not the case with this procedural
> > (sequence-ordered) approach.  Though someone can argue it doesn't take
> > type of operation into account, which might again open the door for
> > non-obvious interactions.
> 
> "restart" is the only on-fail value that it makes sense to escalate.
> 
> block/stop/fence/standby are final. Block means "don't touch the
> resource again", so there can't be any further response to failures.
> Stop/fence/standby move the resource off the local node, so failure
> handling is reset (there are 0 failures on the new node to begin with).
> 
> "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
> then migrate", but I can't think of a real-world situation where that
> makes sense, 
> 
> 
> really?
> 
> it is not uncommon to hear "i know its failed, but i dont want the
> cluster to do anything until its _really_ failed"  

Hmm, I guess that would be similar to how monitoring systems such as
nagios can be configured to send an alert only if N checks in a row
fail. That's useful where transient outages (e.g. a webserver hitting
its request limit) are acceptable for a short time.

I'm not sure that's translatable to Pacemaker. Pacemaker's error count
is not "in a row" but "since the count was last cleared".

"Ignore up to three monitor failures if they occur in a row [or, within
10 minutes?], then try soft recovery for the next two monitor failures,
then ban this node for the next monitor failure." Not sure being able to
say that is worth the complexity.

> 
> and it would be a significant re-implementation of "ignore"
> (which currently ignores the state of having failed, as opposed to a
> particular instance of failure).
> 
> 
> agreed
>  
> 
> 
> What the interface needs to express is: "If this operation fails,
> optionally try a soft recovery [always stop+start], but if  failures
> occur on the same node, proceed to a [configurable] hard recovery".
> 
> And of course the interface will need to be different depending on how
> certain details are decided, e.g. whether any failures count toward 
> or just failures of one particular operation type, and whether the hard
> recovery type can vary depending on what operation failed.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Linux-ha-dev] Announcing crmsh release 2.1.7

2016-09-23 Thread Ken Gaillot
On 09/23/2016 06:59 AM, Kostiantyn Ponomarenko wrote:
>>> Out of curiosity: What do you use it for, where the two_node option
> is not sufficient?
> 
> Alongside with starting the cluster with two nodes I need that
> possibility of starting the cluster with only one node.
> "two_node" option doesn't provide that.

Actually it can, if you use "two_node: 1" with "wait_for_all: 0".

The risk with that configuration is that both nodes can start without
seeing each other, and both start resources.

> 
> Thank you,
> Kostia
> 
> On Fri, Sep 2, 2016 at 11:33 AM, Kristoffer Grönlund  > wrote:
> 
> Kostiantyn Ponomarenko  > writes:
> 
> > Hi,
> >
> >>> If "scripts: no-quorum-policy=ignore" is becoming depreciated
> > Are there any plans to get rid of this option?
> > Am I missing something?
> 
> The above is talking about crmsh cluster configuration scripts, not core
> Pacemaker. As far as I know, no-quorum-policy=ignore is not being
> deprecated in Pacemaker.
> 
> However, it is no longer the recommended configuration for two node
> clusters.
> 
> >
> > PS: this option is very useful (vital) to me. And "two_node" option 
> won't
> > replace it.
> >
> 
> Out of curiosity: What do you use it for, where the two_node option is
> not sufficient?
> 
> Cheers,
> Kristoffer
> 
> --
> // Kristoffer Grönlund
> // kgronl...@suse.com 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Linux-ha-dev] Announcing crmsh release 2.1.7

2016-09-23 Thread Kostiantyn Ponomarenko
>> Out of curiosity: What do you use it for, where the two_node option is not
sufficient?

Alongside with starting the cluster with two nodes I need that possibility
of starting the cluster with only one node.
"two_node" option doesn't provide that.

Thank you,
Kostia

On Fri, Sep 2, 2016 at 11:33 AM, Kristoffer Grönlund 
wrote:

> Kostiantyn Ponomarenko  writes:
>
> > Hi,
> >
> >>> If "scripts: no-quorum-policy=ignore" is becoming depreciated
> > Are there any plans to get rid of this option?
> > Am I missing something?
>
> The above is talking about crmsh cluster configuration scripts, not core
> Pacemaker. As far as I know, no-quorum-policy=ignore is not being
> deprecated in Pacemaker.
>
> However, it is no longer the recommended configuration for two node
> clusters.
>
> >
> > PS: this option is very useful (vital) to me. And "two_node" option won't
> > replace it.
> >
>
> Out of curiosity: What do you use it for, where the two_node option is
> not sufficient?
>
> Cheers,
> Kristoffer
>
> --
> // Kristoffer Grönlund
> // kgronl...@suse.com
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] kind=Optional order constraint not working at startup

2016-09-23 Thread Auer, Jens
Hi,

> But if A can tolerate outage of B, why does it matter whether A is started 
> before or
> after B? By the same logic it should be able to reconnect once B is up? At 
> least that
> is what I'd expect.
In our case B is the file system resource that stores the configuration file 
for resource A. 
Resource A is a cloned resource that is started on both servers in our cluster. 
On the active
node, A should read the config file from the shared file system. On the passive 
node it
reads a default file. After that the config file is not read anymore and thus 
the shared filesystem can
go down and up again without disturbing the other resource.

After moving the filesystem to the passive node for failover, the process 
updates itself by reading the
configuration from the now new ini file. This requires that the shared 
filesystem is started on the node,
but I don't want to restart the process for internal reasons.

I could start the processes before the shared filesystem is started and then 
always re-sync. However this
will confuse the users because they don't expect this to happen.

In the end we probably will not go with cloned resources and just start them 
cleanly after the shared filesystem
is started on a node. This is much simpler and will solve the ordering problems 
here. It should also be possible
to put everything in a group as they are additionally co-located.

Cheers,
  Jens
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org