Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)

2018-05-31 Thread Casey & Gina
> Well, that does not sound very polite to user :)

The thing that really threw me off was pacemaker rebooting the node as soon as 
I'd try to start the cluster on it without the database running.

Is there a way to prevent this from happening?  Some way to indicate to 
Pacemaker, "Hey, I'm not willing/able to start the resource here because it 
appears to be in a corrupt state", while not causing the node to be fenced 
because it thinks that the resource is running when it isn't?

It would be perfectly safe to not fence the node, in this case...

-- 
Casey
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)

2018-05-31 Thread Ken Gaillot
On Thu, 2018-05-31 at 22:43 +0200, Jehan-Guillaume de Rorthais wrote:
> On Thu, 31 May 2018 22:52:12 +0300
> Andrei Borzenkov  wrote:
> 
> > 31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет:
> > > Sorry for getting back to you so late.
> > > 
> > > On Fri, 25 May 2018 11:58:59 -0600
> > > Casey & Gina  wrote:
> > >   
> > > > > On May 25, 2018, at 7:01 AM, Casey Allen Shobe  > > > > icloud.com>
> > > > > wrote:   
> > > > > > Actually, why is Pacemaker fencing the standby node just
> > > > > > because a
> > > > > > resource fails to start there?  I thought only the master
> > > > > > should be
> > > > > > fenced if it were assumed to be broken.
> > > > 
> > > > This is probably the most important thing to ask outside of the
> > > > PAF
> > > > resource agent which many may not be as fluent with as
> > > > pacemaker itself,
> > > > and perhaps the most indicative of me setting something up
> > > > incorrectly
> > > > outside of that resource agent.
> > > > 
> > > > My understanding of fencing was that pacemaker would only fence
> > > > a node if
> > > > it was the master but had stopped responding, to avoid a split-
> > > > brain
> > > > situation. Why would pacemaker ever fence a standby node with
> > > > no resources
> > > > currently allocated to it?  
> > > 
> > > So, as discussed on IRC and for the mailing list history, here is
> > > the
> > > answer:
> > > 
> > > https://clusterlabs.github.io/PAF/administration.html#failover
> > > 
> > > In short: after a failure (either on a primary or a standby), you
> > > MUST fix
> > > things on the node before starting Pacemaker.
> > > 
> > > If you don't, PAF will detect something incoherent and raise an
> > > error,
> > > leading Pacemaker to most likely fence your node, again.
> > >   
> > 
> > Well, that does not sound very polite to user :)
> 
> Sure :)
> 
> But at least, It's been documented as you pointed earlier.
> 
> After a failure and an automatic failover, either you have some
> automatic
> failback process somewhere...or you have to fix some things around.
> 
> PAF is not able to do automatic failback.
> 
> > Another database RA I mentioned somewhere in this thread has
> > different
> > approach - it starts database in its monitor action and start
> > action is
> > effectively dummy.
> 
> Mh, I would have to study that. But I'm not thrill about such
> behavior at a
> first look.
> 
> > So start always succeeds from pacemaker point of
> > view, but database won't be started until manually synchronized
> > again by
> > administrator.
> 
> It seems scary...What about the stop action? What if the monitor
> detect an
> error? Well, I really should check this RA you are talking about to
> answer my
> questions.
> 
> > Downside is that pacemaker resource status does not reflect
> > database
> > status. I wish pacemaker supported something like "requires manual
> > intervention" resource state that would not be treated like error
> > (causing all sorts of fatal consequences) but still evaluated for
> > dependencies (i.e. dependent resources would not be started). That
> > would
> > be ideal for such case.

I'm not clear what such a result would mean. Is the goal to stop
dependent resources, but not the resource itself? And/or to block all
further management of the resource?

> Good idea.
> 
> I have a couple more:
> * handling errors from notify actions

I could imagine notify supporting on-fail, defaulting to ignore. Would
that do what you want? Should notify errors count toward the resource
fail count?

> * supporting migrate-to/from for multistate RA
> * having real infinite master score :)

What behavior isn't supported by current infinity?

> 
> Cheers,
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)

2018-05-31 Thread Jehan-Guillaume de Rorthais
On Thu, 31 May 2018 22:52:12 +0300
Andrei Borzenkov  wrote:

> 31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет:
> > Sorry for getting back to you so late.
> > 
> > On Fri, 25 May 2018 11:58:59 -0600
> > Casey & Gina  wrote:
> >   
> >>> On May 25, 2018, at 7:01 AM, Casey Allen Shobe 
> >>> wrote:   
>  Actually, why is Pacemaker fencing the standby node just because a
>  resource fails to start there?  I thought only the master should be
>  fenced if it were assumed to be broken.
> >>
> >> This is probably the most important thing to ask outside of the PAF
> >> resource agent which many may not be as fluent with as pacemaker itself,
> >> and perhaps the most indicative of me setting something up incorrectly
> >> outside of that resource agent.
> >>
> >> My understanding of fencing was that pacemaker would only fence a node if
> >> it was the master but had stopped responding, to avoid a split-brain
> >> situation. Why would pacemaker ever fence a standby node with no resources
> >> currently allocated to it?  
> > 
> > So, as discussed on IRC and for the mailing list history, here is the
> > answer:
> > 
> > https://clusterlabs.github.io/PAF/administration.html#failover
> > 
> > In short: after a failure (either on a primary or a standby), you MUST fix
> > things on the node before starting Pacemaker.
> > 
> > If you don't, PAF will detect something incoherent and raise an error,
> > leading Pacemaker to most likely fence your node, again.
> >   
> 
> Well, that does not sound very polite to user :)

Sure :)

But at least, It's been documented as you pointed earlier.

After a failure and an automatic failover, either you have some automatic
failback process somewhere...or you have to fix some things around.

PAF is not able to do automatic failback.

> Another database RA I mentioned somewhere in this thread has different
> approach - it starts database in its monitor action and start action is
> effectively dummy.

Mh, I would have to study that. But I'm not thrill about such behavior at a
first look.

> So start always succeeds from pacemaker point of
> view, but database won't be started until manually synchronized again by
> administrator.

It seems scary...What about the stop action? What if the monitor detect an
error? Well, I really should check this RA you are talking about to answer my
questions.

> Downside is that pacemaker resource status does not reflect database
> status. I wish pacemaker supported something like "requires manual
> intervention" resource state that would not be treated like error
> (causing all sorts of fatal consequences) but still evaluated for
> dependencies (i.e. dependent resources would not be started). That would
> be ideal for such case.

Good idea.

I have a couple more:
* handling errors from notify actions
* supporting mgirate-to/from for multistate RA
* having real infinite master score :)

Cheers,
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)

2018-05-31 Thread Andrei Borzenkov
31.05.2018 22:18, Jehan-Guillaume de Rorthais пишет:
> Sorry for getting back to you so late.
> 
> On Fri, 25 May 2018 11:58:59 -0600
> Casey & Gina  wrote:
> 
>>> On May 25, 2018, at 7:01 AM, Casey Allen Shobe 
>>> wrote: 
 Actually, why is Pacemaker fencing the standby node just because a
 resource fails to start there?  I thought only the master should be fenced
 if it were assumed to be broken.  
>>
>> This is probably the most important thing to ask outside of the PAF resource
>> agent which many may not be as fluent with as pacemaker itself, and perhaps
>> the most indicative of me setting something up incorrectly outside of that
>> resource agent.
>>
>> My understanding of fencing was that pacemaker would only fence a node if it
>> was the master but had stopped responding, to avoid a split-brain situation.
>> Why would pacemaker ever fence a standby node with no resources currently
>> allocated to it?
> 
> So, as discussed on IRC and for the mailing list history, here is the answer:
> 
> https://clusterlabs.github.io/PAF/administration.html#failover
> 
> In short: after a failure (either on a primary or a standby), you MUST fix
> things on the node before starting Pacemaker.
> 
> If you don't, PAF will detect something incoherent and raise an error, leading
> Pacemaker to most likely fence your node, again.
> 

Well, that does not sound very polite to user :)

Another database RA I mentioned somewhere in this thread has different
approach - it starts database in its monitor action and start action is
effectively dummy. So start always succeeds from pacemaker point of
view, but database won't be started until manually synchronized again by
administrator.

Downside is that pacemaker resource status does not reflect database
status. I wish pacemaker supported something like "requires manual
intervention" resource state that would not be treated like error
(causing all sorts of fatal consequences) but still evaluated for
dependencies (i.e. dependent resources would not be started). That would
be ideal for such case.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)

2018-05-31 Thread Jehan-Guillaume de Rorthais
Sorry for getting back to you so late.

On Fri, 25 May 2018 11:58:59 -0600
Casey & Gina  wrote:

> > On May 25, 2018, at 7:01 AM, Casey Allen Shobe 
> > wrote: 
> >> Actually, why is Pacemaker fencing the standby node just because a
> >> resource fails to start there?  I thought only the master should be fenced
> >> if it were assumed to be broken.  
> 
> This is probably the most important thing to ask outside of the PAF resource
> agent which many may not be as fluent with as pacemaker itself, and perhaps
> the most indicative of me setting something up incorrectly outside of that
> resource agent.
> 
> My understanding of fencing was that pacemaker would only fence a node if it
> was the master but had stopped responding, to avoid a split-brain situation.
> Why would pacemaker ever fence a standby node with no resources currently
> allocated to it?

So, as discussed on IRC and for the mailing list history, here is the answer:

https://clusterlabs.github.io/PAF/administration.html#failover

In short: after a failure (either on a primary or a standby), you MUST fix
things on the node before starting Pacemaker.

If you don't, PAF will detect something incoherent and raise an error, leading
Pacemaker to most likely fence your node, again.

As instance, after a primary crash, you will have to resync it as a standby with
the new master before starting Pacemaker on the node and giving PAF the relay.
It is actually really important if you don't want to end up with a silently
corrupted standby in your cluster.

Cheers,
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Why would a standby node be fenced? (was: How to set up fencing/stonith)

2018-05-25 Thread Casey & Gina
> On May 25, 2018, at 7:01 AM, Casey Allen Shobe  
> wrote:
> 
>> Actually, why is Pacemaker fencing the standby node just because a resource 
>> fails to start there?  I thought only the master should be fenced if it were 
>> assumed to be broken.

This is probably the most important thing to ask outside of the PAF resource 
agent which many may not be as fluent with as pacemaker itself, and perhaps the 
most indicative of me setting something up incorrectly outside of that resource 
agent.

My understanding of fencing was that pacemaker would only fence a node if it 
was the master but had stopped responding, to avoid a split-brain situation.  
Why would pacemaker ever fence a standby node with no resources currently 
allocated to it?

Regards,
-- 
Casey
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org