Re: [ClusterLabs] Pacemaker startup-fencing

2016-03-19 Thread Ferenc Wágner
Andrei Borzenkov  writes:

> On Wed, Mar 16, 2016 at 2:22 PM, Ferenc Wágner  wrote:
>
>> Pacemaker explained says about this cluster option:
>>
>> Advanced Use Only: Should the cluster shoot unseen nodes? Not using
>> the default is very unsafe!
>>
>> 1. What are those "unseen" nodes?
>
> Nodes that lost communication with other nodes (think of unplugging cables)

Translating to node status, does is mean UNCLEAN (offline) nodes which
suddenly return?  Can Pacemaker tell these apart from abruptly power
cycled nodes (when reboot happens before the comeback)?  I guess if a
node was successfully fenced at the time, it won't be considered
UNCLEAN, but is that the only way to avoid that?

>> And a possibly related question:
>>
>> 2. If I've got UNCLEAN (offline) nodes, is there a way to clean them up,
>>so that they don't get fenced when I switch them on?  I mean without
>>removing the node altogether, to keep its capacity settings for
>>example.
>
> You can declare node as down using "crm node clearstate". You should
> not really do it unless you ascertained that node is actually
> physically down.

Great.  Is there an equivalent in bare bones Pacemaker, that is, not
involving the CRM shell?  Like deleting some status or LRMD history
element of the node, for example?

>> And some more about fencing:
>>
>> 3. What's the difference in cluster behavior between
>>- stonith-enabled=FALSE (9.3.2: how often will the stop operation be 
>> retried?)
>>- having no configured STONITH devices (resources won't be started, 
>> right?)
>>- failing to STONITH with some error (on every node)
>>- timing out the STONITH operation
>>- manual fencing
>
> I do not think there is much difference. Without fencing pacemaker
> cannot make decision to relocate resources so cluster will be stuck.

Then I wonder why I hear the "must have working fencing if you value
your data" mantra so often (and always without explanation).  After all,
it does not risk the data, only the automatic cluster recovery, right?

>> 4. What's the modern way to do manual fencing?  (stonith_admin
>>--confirm + what?
>
> node name.

:) I did really poor wording that question.  I meant to ask what kind of
cluster (STONITH) configuration makes the cluster sit patiently until I
do the manual fencing, then carry on without timeouts or other errors.
Just as if some automatic fencing agent did the job, but letting me
investigate the node status beforehand.
-- 
Thanks,
Feri

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker startup-fencing

2016-03-19 Thread Andrei Borzenkov
On Wed, Mar 16, 2016 at 4:18 PM, Lars Ellenberg
 wrote:
> On Wed, Mar 16, 2016 at 01:47:52PM +0100, Ferenc Wágner wrote:
>> >> And some more about fencing:
>> >>
>> >> 3. What's the difference in cluster behavior between
>> >>- stonith-enabled=FALSE (9.3.2: how often will the stop operation be 
>> >> retried?)
>> >>- having no configured STONITH devices (resources won't be started, 
>> >> right?)
>> >>- failing to STONITH with some error (on every node)
>> >>- timing out the STONITH operation
>> >>- manual fencing
>> >
>> > I do not think there is much difference. Without fencing pacemaker
>> > cannot make decision to relocate resources so cluster will be stuck.
>>
>> Then I wonder why I hear the "must have working fencing if you value
>> your data" mantra so often (and always without explanation).  After all,
>> it does not risk the data, only the automatic cluster recovery, right?
>
> stonith-enabled=false
> means:
> if some node becomes unresponsive,
> it is immediately *assumed* it was "clean" dead.
> no fencing takes place,
> resource takeover happens without further protection.
>

Oh! Actually it is not quite clear from documentation; documentation
does not explain what happens in case of stonith-enabled=false at all.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker startup-fencing

2016-03-19 Thread Ferenc Wágner
Hi,

Pacemaker explained says about this cluster option:

Advanced Use Only: Should the cluster shoot unseen nodes? Not using
the default is very unsafe!

1. What are those "unseen" nodes?

And a possibly related question:

2. If I've got UNCLEAN (offline) nodes, is there a way to clean them up,
   so that they don't get fenced when I switch them on?  I mean without
   removing the node altogether, to keep its capacity settings for
   example.

And some more about fencing:

3. What's the difference in cluster behavior between
   - stonith-enabled=FALSE (9.3.2: how often will the stop operation be 
retried?)
   - having no configured STONITH devices (resources won't be started, right?)
   - failing to STONITH with some error (on every node)
   - timing out the STONITH operation
   - manual fencing

4. What's the modern way to do manual fencing?  (stonith_admin
   --confirm + what?  I ask because meatware.so comes from
   cluster-glue and uses the old API).
-- 
Thanks,
Feri

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker startup-fencing

2016-03-19 Thread Lars Ellenberg
On Wed, Mar 16, 2016 at 01:47:52PM +0100, Ferenc Wágner wrote:
> >> And some more about fencing:
> >>
> >> 3. What's the difference in cluster behavior between
> >>- stonith-enabled=FALSE (9.3.2: how often will the stop operation be 
> >> retried?)
> >>- having no configured STONITH devices (resources won't be started, 
> >> right?)
> >>- failing to STONITH with some error (on every node)
> >>- timing out the STONITH operation
> >>- manual fencing
> >
> > I do not think there is much difference. Without fencing pacemaker
> > cannot make decision to relocate resources so cluster will be stuck.
> 
> Then I wonder why I hear the "must have working fencing if you value
> your data" mantra so often (and always without explanation).  After all,
> it does not risk the data, only the automatic cluster recovery, right?

stonith-enabled=false
means:
if some node becomes unresponsive,
it is immediately *assumed* it was "clean" dead.
no fencing takes place,
resource takeover happens without further protection.

That very much risks at least data divergence (replicas evoling
independently), if not data corruption (shared disks and the like).

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker startup-fencing

2016-03-18 Thread Ferenc Wágner
Andrei Borzenkov  writes:

> On Wed, Mar 16, 2016 at 4:18 PM, Lars Ellenberg  
> wrote:
>
>> On Wed, Mar 16, 2016 at 01:47:52PM +0100, Ferenc Wágner wrote:
>>
> And some more about fencing:
>
> 3. What's the difference in cluster behavior between
>- stonith-enabled=FALSE (9.3.2: how often will the stop operation be 
> retried?)
>- having no configured STONITH devices (resources won't be started, 
> right?)
>- failing to STONITH with some error (on every node)
>- timing out the STONITH operation
>- manual fencing

 I do not think there is much difference. Without fencing pacemaker
 cannot make decision to relocate resources so cluster will be stuck.
>>>
>>> Then I wonder why I hear the "must have working fencing if you value
>>> your data" mantra so often (and always without explanation).  After all,
>>> it does not risk the data, only the automatic cluster recovery, right?
>>
>> stonith-enabled=false
>> means:
>> if some node becomes unresponsive,
>> it is immediately *assumed* it was "clean" dead.
>> no fencing takes place,
>> resource takeover happens without further protection.
>
> Oh! Actually it is not quite clear from documentation; documentation
> does not explain what happens in case of stonith-enabled=false at all.

Yes, this is a crucially important piece of information, which should be
prominently announced in the documentation.  Thanks for spelling it out,
Lars.  Hope you don't mind that I turned your text into
https://github.com/ClusterLabs/pacemaker/pull/960.
-- 
Feri

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org