Re: [ClusterLabs] How to cancel a fencing request?

2018-04-10 Thread Jehan-Guillaume de Rorthais
On Tue, 10 Apr 2018 11:24:04 +0200
Klaus Wenninger  wrote:

> On 04/10/2018 08:48 AM, Jehan-Guillaume de Rorthais wrote:
> > On Mon, 09 Apr 2018 17:59:26 -0500
> > Ken Gaillot  wrote:
> >  
> >> On Tue, 2018-04-10 at 00:02 +0200, Jehan-Guillaume de Rorthais wrote:  
> >>> On Tue, 03 Apr 2018 17:35:43 -0500
> >>> Ken Gaillot  wrote:
> >>> 
>  On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote:
> > On 04/03/2018 05:43 PM, Ken Gaillot wrote:  
> >> On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote:  
> >>> On 04/02/2018 04:02 PM, Ken Gaillot wrote:  
>  On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de
>  Rorthais
>  wrote:  
> >>> [...]
> >>> -inf constraints like that should effectively prevent
> >>> stonith-actions from being executed on that nodes.  
> >> It shouldn't ...
> >>
> >> Pacemaker respects target-role=Started/Stopped for controlling
> >> execution of fence devices, but location (or even whether the
> >> device is
> >> "running" at all) only affects monitors, not execution.
> >>   
> >>> Though there are a few issues with location constraints
> >>> and stonith-devices.
> >>>
> >>> When stonithd brings up the devices from the cib it
> >>> runs the parts of pengine that fully evaluate these
> >>> constraints and it would disable the stonith-device
> >>> if the resource is unrunable on that node.  
> >> That should be true only for target-role, not everything that
> >> affects
> >> runnability  
> > cib_device_update bails out via a removal of the device if
> > - role == stopped
> > - node not in allowed_nodes-list of stonith-resource
> > - weight is negative
> >
> > Wouldn't that include a -inf rule for a node?  
>  Well, I'll be ... I thought I understood what was going on there.
>  :-)
>  You're right.
> 
>  I've frequently seen it recommended to ban fence devices from their
>  target when using one device per target. Perhaps it would be better
>  to
>  give a lower (but positive) score on the target compared to the
>  other
>  node(s), so it can be used when no other nodes are available. you
>  could
>  re-manage.  
> >>> Wait, you mean a fencing resource can be triggered from its own
> >>> target? Wat
> >>> happen then? Node suicide and all the cluster nodes are shutdown?
> >>>
> >>> Thanks,
> >> A node can fence itself, though it will be the cluster's last resort
> >> when no other node can. It doesn't necessarily imply all other nodes
> >> are shut down ...  
> > Indeed, sorry I was clear enough: I was talking about a fencing race
> > situation.  
> Fencing races - as well if suicide is involved - shouldn't be
> prevented by one partition not having quorum.
> That should be an issue just with 2-node-feature enabled.
> Which scenario did you have in mind?

The two-node scenario. The exact one I described upthread, minus the -inf
constraint location as Ken suggested.

> >> there may be other nodes up, but they are not allowed
> >> execute the relevant fence device for whatever reason.  
> > In such situation, how other node can confirm the node fence itself without
> > confirmation?  
> 
> Basically I see 2 cases:
> - sbd with watchdog-fencing where the other nodes assume
>   suicide to be successful after a certain time

Sure. With watchdog enabled cluster wide.

> - basically if a node is able to commit suicide (while part of
>   a quorate partition) I would expect it to come back online
>   after reboot telling the cluster that the resources are down

I would expect as well, but the fencing request hadn't been confirmed to anyone
yet

* is it enough that the node reboot and probes for resources to declare they
  are all stopped?
* is it enough so the node can acknowledge the DC/stonithd the fencing request
  was succeed?
* what if the fencing action is not "reboot" but "off"?

Thanks for your help!
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to cancel a fencing request?

2018-04-10 Thread Klaus Wenninger
On 04/10/2018 08:48 AM, Jehan-Guillaume de Rorthais wrote:
> On Mon, 09 Apr 2018 17:59:26 -0500
> Ken Gaillot  wrote:
>
>> On Tue, 2018-04-10 at 00:02 +0200, Jehan-Guillaume de Rorthais wrote:
>>> On Tue, 03 Apr 2018 17:35:43 -0500
>>> Ken Gaillot  wrote:
>>>   
 On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote:  
> On 04/03/2018 05:43 PM, Ken Gaillot wrote:    
>> On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote:    
>>> On 04/02/2018 04:02 PM, Ken Gaillot wrote:    
 On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de
 Rorthais
 wrote:    
>>> [...]  
>>> -inf constraints like that should effectively prevent
>>> stonith-actions from being executed on that nodes.    
>> It shouldn't ...
>>
>> Pacemaker respects target-role=Started/Stopped for controlling
>> execution of fence devices, but location (or even whether the
>> device is
>> "running" at all) only affects monitors, not execution.
>>     
>>> Though there are a few issues with location constraints
>>> and stonith-devices.
>>>
>>> When stonithd brings up the devices from the cib it
>>> runs the parts of pengine that fully evaluate these
>>> constraints and it would disable the stonith-device
>>> if the resource is unrunable on that node.    
>> That should be true only for target-role, not everything that
>> affects
>> runnability    
> cib_device_update bails out via a removal of the device if
> - role == stopped
> - node not in allowed_nodes-list of stonith-resource
> - weight is negative
>
> Wouldn't that include a -inf rule for a node?    
 Well, I'll be ... I thought I understood what was going on there.
 :-)
 You're right.

 I've frequently seen it recommended to ban fence devices from their
 target when using one device per target. Perhaps it would be better
 to
 give a lower (but positive) score on the target compared to the
 other
 node(s), so it can be used when no other nodes are available. you
 could
 re-manage.    
>>> Wait, you mean a fencing resource can be triggered from its own
>>> target? Wat
>>> happen then? Node suicide and all the cluster nodes are shutdown?
>>>
>>> Thanks,  
>> A node can fence itself, though it will be the cluster's last resort
>> when no other node can. It doesn't necessarily imply all other nodes
>> are shut down ...
> Indeed, sorry I was clear enough: I was talking about a fencing race
> situation.
Fencing races - as well if suicide is involved - shouldn't be
prevented by one partition not having quorum.
That should be an issue just with 2-node-feature enabled.
Which scenario did you have in mind?
>
>> there may be other nodes up, but they are not allowed
>> execute the relevant fence device for whatever reason.
> In such situation, how other node can confirm the node fence itself without
> confirmation?

Basically I see 2 cases:
- sbd with watchdog-fencing where the other nodes assume
  suicide to be successful after a certain time
- basically if a node is able to commit suicide (while part of
  a quorate partition) I would expect it to come back online
  after reboot telling the cluster that the resources are down

Regards,
Klaus

>
>> But of course there might be no other nodes up, in which case, yes, the
>> cluster dies (the idea being that the node is known to be malfunctioning, so
>> stop it from possibly corrupting data).
> This make sense to me.
>
> Thanks,

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to cancel a fencing request?

2018-04-09 Thread Jehan-Guillaume de Rorthais
On Mon, 09 Apr 2018 17:59:26 -0500
Ken Gaillot  wrote:

> On Tue, 2018-04-10 at 00:02 +0200, Jehan-Guillaume de Rorthais wrote:
> > On Tue, 03 Apr 2018 17:35:43 -0500
> > Ken Gaillot  wrote:
> >   
> > > On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote:  
> > > > On 04/03/2018 05:43 PM, Ken Gaillot wrote:    
> > > > > On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote:    
> > > > > > On 04/02/2018 04:02 PM, Ken Gaillot wrote:    
> > > > > > > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de
> > > > > > > Rorthais
> > > > > > > wrote:    
> > 
> > [...]  
> > > > > > 
> > > > > > -inf constraints like that should effectively prevent
> > > > > > stonith-actions from being executed on that nodes.    
> > > > > 
> > > > > It shouldn't ...
> > > > > 
> > > > > Pacemaker respects target-role=Started/Stopped for controlling
> > > > > execution of fence devices, but location (or even whether the
> > > > > device is
> > > > > "running" at all) only affects monitors, not execution.
> > > > >     
> > > > > > Though there are a few issues with location constraints
> > > > > > and stonith-devices.
> > > > > > 
> > > > > > When stonithd brings up the devices from the cib it
> > > > > > runs the parts of pengine that fully evaluate these
> > > > > > constraints and it would disable the stonith-device
> > > > > > if the resource is unrunable on that node.    
> > > > > 
> > > > > That should be true only for target-role, not everything that
> > > > > affects
> > > > > runnability    
> > > > 
> > > > cib_device_update bails out via a removal of the device if
> > > > - role == stopped
> > > > - node not in allowed_nodes-list of stonith-resource
> > > > - weight is negative
> > > > 
> > > > Wouldn't that include a -inf rule for a node?    
> > > 
> > > Well, I'll be ... I thought I understood what was going on there.
> > > :-)
> > > You're right.
> > > 
> > > I've frequently seen it recommended to ban fence devices from their
> > > target when using one device per target. Perhaps it would be better
> > > to
> > > give a lower (but positive) score on the target compared to the
> > > other
> > > node(s), so it can be used when no other nodes are available. you
> > > could
> > > re-manage.    
> > 
> > Wait, you mean a fencing resource can be triggered from its own
> > target? Wat
> > happen then? Node suicide and all the cluster nodes are shutdown?
> > 
> > Thanks,  
> 
> A node can fence itself, though it will be the cluster's last resort
> when no other node can. It doesn't necessarily imply all other nodes
> are shut down ...

Indeed, sorry I was clear enough: I was talking about a fencing race
situation.

> there may be other nodes up, but they are not allowed
> execute the relevant fence device for whatever reason.

In such situation, how other node can confirm the node fence itself without
confirmation?

> But of course there might be no other nodes up, in which case, yes, the
> cluster dies (the idea being that the node is known to be malfunctioning, so
> stop it from possibly corrupting data).

This make sense to me.

Thanks,
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to cancel a fencing request?

2018-04-09 Thread Ken Gaillot
On Tue, 2018-04-10 at 00:02 +0200, Jehan-Guillaume de Rorthais wrote:
> On Tue, 03 Apr 2018 17:35:43 -0500
> Ken Gaillot  wrote:
> 
> > On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote:
> > > On 04/03/2018 05:43 PM, Ken Gaillot wrote:  
> > > > On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote:  
> > > > > On 04/02/2018 04:02 PM, Ken Gaillot wrote:  
> > > > > > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de
> > > > > > Rorthais
> > > > > > wrote:  
> 
> [...]
> > > > > 
> > > > > -inf constraints like that should effectively prevent
> > > > > stonith-actions from being executed on that nodes.  
> > > > 
> > > > It shouldn't ...
> > > > 
> > > > Pacemaker respects target-role=Started/Stopped for controlling
> > > > execution of fence devices, but location (or even whether the
> > > > device is
> > > > "running" at all) only affects monitors, not execution.
> > > >   
> > > > > Though there are a few issues with location constraints
> > > > > and stonith-devices.
> > > > > 
> > > > > When stonithd brings up the devices from the cib it
> > > > > runs the parts of pengine that fully evaluate these
> > > > > constraints and it would disable the stonith-device
> > > > > if the resource is unrunable on that node.  
> > > > 
> > > > That should be true only for target-role, not everything that
> > > > affects
> > > > runnability  
> > > 
> > > cib_device_update bails out via a removal of the device if
> > > - role == stopped
> > > - node not in allowed_nodes-list of stonith-resource
> > > - weight is negative
> > > 
> > > Wouldn't that include a -inf rule for a node?  
> > 
> > Well, I'll be ... I thought I understood what was going on there.
> > :-)
> > You're right.
> > 
> > I've frequently seen it recommended to ban fence devices from their
> > target when using one device per target. Perhaps it would be better
> > to
> > give a lower (but positive) score on the target compared to the
> > other
> > node(s), so it can be used when no other nodes are available. you
> > could
> > re-manage.  
> 
> Wait, you mean a fencing resource can be triggered from its own
> target? Wat
> happen then? Node suicide and all the cluster nodes are shutdown?
> 
> Thanks,

A node can fence itself, though it will be the cluster's last resort
when no other node can. It doesn't necessarily imply all other nodes
are shut down ... there may be other nodes up, but they are not allowed
execute the relevant fence device for whatever reason. But of course
there might be no other nodes up, in which case, yes, the cluster dies
(the idea being that the node is known to be malfunctioning, so stop it
from possibly corrupting data).
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to cancel a fencing request?

2018-04-09 Thread Jehan-Guillaume de Rorthais
On Tue, 03 Apr 2018 16:59:21 -0500
Ken Gaillot  wrote:

> On Tue, 2018-04-03 at 21:33 +0200, Jehan-Guillaume de Rorthais wrote:
[...]
> > > > I'm not sure to understand the doc correctly in regard with this
> > > > property. Does
> > > > pcmk_delay_max delay the request itself or the execution of the
> > > > request?
> > > > 
> > > > In other words, is it:
> > > > 
> > > >   delay -> fence query -> fencing action
> > > > 
> > > > or 
> > > > 
> > > >   fence query -> delay -> fence action
> > > > 
> > > > ?
> > > > 
> > > > The first definition would solve this issue, but not the second.
> > > > As I
> > > > understand it, as soon as the fence query has been sent, the node
> > > > status is
> > > > "UNCLEAN (online)".    
> > > 
> > > The latter -- you're correct, the node is already unclean by that
> > > time.
> > > Since the stop did not succeed, the node must be fenced to continue
> > > safely.  
> > 
> > Thank you for this clarification.
> > 
> > Do you want to patch to add this clarification to the documentation ?  
> 
> Sure, it never hurts :)

I realize this is not as clear as I thought in my mind.

* who holds the action for some time? crmd or stonithd?
* in a two node cluster in fencing race, if one node is killed, what happen to
  its fencing query that was on hold? I suppose it will be overwrite with the
  new CIB version from the other node once it join the cluster again?

Thanks,
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to cancel a fencing request?

2018-04-09 Thread Jehan-Guillaume de Rorthais
On Tue, 03 Apr 2018 17:35:43 -0500
Ken Gaillot  wrote:

> On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote:
> > On 04/03/2018 05:43 PM, Ken Gaillot wrote:  
> > > On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote:  
> > > > On 04/02/2018 04:02 PM, Ken Gaillot wrote:  
> > > > > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais
> > > > > wrote:  
[...]
> > > > 
> > > > -inf constraints like that should effectively prevent
> > > > stonith-actions from being executed on that nodes.  
> > > 
> > > It shouldn't ...
> > > 
> > > Pacemaker respects target-role=Started/Stopped for controlling
> > > execution of fence devices, but location (or even whether the
> > > device is
> > > "running" at all) only affects monitors, not execution.
> > >   
> > > > Though there are a few issues with location constraints
> > > > and stonith-devices.
> > > > 
> > > > When stonithd brings up the devices from the cib it
> > > > runs the parts of pengine that fully evaluate these
> > > > constraints and it would disable the stonith-device
> > > > if the resource is unrunable on that node.  
> > > 
> > > That should be true only for target-role, not everything that
> > > affects
> > > runnability  
> > 
> > cib_device_update bails out via a removal of the device if
> > - role == stopped
> > - node not in allowed_nodes-list of stonith-resource
> > - weight is negative
> > 
> > Wouldn't that include a -inf rule for a node?  
> 
> Well, I'll be ... I thought I understood what was going on there. :-)
> You're right.
> 
> I've frequently seen it recommended to ban fence devices from their
> target when using one device per target. Perhaps it would be better to
> give a lower (but positive) score on the target compared to the other
> node(s), so it can be used when no other nodes are available. you could
> re-manage.  

Wait, you mean a fencing resource can be triggered from its own target? Wat
happen then? Node suicide and all the cluster nodes are shutdown?

Thanks,
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to cancel a fencing request?

2018-04-05 Thread Klaus Wenninger
On 04/05/2018 06:45 AM, Andrei Borzenkov wrote:
> 04.04.2018 01:35, Ken Gaillot пишет:
>> On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote:
> ...
> -inf constraints like that should effectively prevent
> stonith-actions from being executed on that nodes.
 It shouldn't ...

 Pacemaker respects target-role=Started/Stopped for controlling
 execution of fence devices, but location (or even whether the
 device is
 "running" at all) only affects monitors, not execution.

> Though there are a few issues with location constraints
> and stonith-devices.
>
> When stonithd brings up the devices from the cib it
> runs the parts of pengine that fully evaluate these
> constraints and it would disable the stonith-device
> if the resource is unrunable on that node.
 That should be true only for target-role, not everything that
 affects
 runnability
>>> cib_device_update bails out via a removal of the device if
>>> - role == stopped
>>> - node not in allowed_nodes-list of stonith-resource
>>> - weight is negative
>>>
>>> Wouldn't that include a -inf rule for a node?
>> Well, I'll be ... I thought I understood what was going on there. :-)
>> You're right.
>>
>> I've frequently seen it recommended to ban fence devices from their
>> target when using one device per target. Perhaps it would be better to
>> give a lower (but positive) score on the target compared to the other
>> node(s), so it can be used when no other nodes are available.
>>
> Oh! So I must have misunderstood comments on this in earlier discussions.
>
> So ability to place stonith resource on node does impact ability to
> perform stonith using this resource, right? OTOH decision which node is
> eligible to use stonith resource for stonith may not match decision
> which node is eligible to start stonith resource? Even more confusing ...

Something like that, yes ... and sorry for the confusion ...
Maybe easier to grab: "Has to be able to run there but doesn't
actually have to be started there right at the moment"

Regards,
Klaus

>>> It is of course clear that no pengine-decision to start
>>> a stonith-resource is required for it to be used for
>>> fencing.
>>>
> This means that there is only subset of usual (co-)locating restrictions
> that is taken into account? Is it all documented somewhere?

iirc there are restrictions mentioned in the documentation.
But what is written there didn't ring the right bells for me-
at least not immediately without having a look to the code ;-)
So we are working on something easier to grab there.
Guess for now the crucial rule is not to use anything that
might alter location-rule results over time (attributes, rules
with time in them, ...).

> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to cancel a fencing request?

2018-04-04 Thread Andrei Borzenkov
04.04.2018 01:35, Ken Gaillot пишет:
> On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote:
...
>>

 -inf constraints like that should effectively prevent
 stonith-actions from being executed on that nodes.
>>>
>>> It shouldn't ...
>>>
>>> Pacemaker respects target-role=Started/Stopped for controlling
>>> execution of fence devices, but location (or even whether the
>>> device is
>>> "running" at all) only affects monitors, not execution.
>>>
 Though there are a few issues with location constraints
 and stonith-devices.

 When stonithd brings up the devices from the cib it
 runs the parts of pengine that fully evaluate these
 constraints and it would disable the stonith-device
 if the resource is unrunable on that node.
>>>
>>> That should be true only for target-role, not everything that
>>> affects
>>> runnability
>>
>> cib_device_update bails out via a removal of the device if
>> - role == stopped
>> - node not in allowed_nodes-list of stonith-resource
>> - weight is negative
>>
>> Wouldn't that include a -inf rule for a node?
> 
> Well, I'll be ... I thought I understood what was going on there. :-)
> You're right.
> 
> I've frequently seen it recommended to ban fence devices from their
> target when using one device per target. Perhaps it would be better to
> give a lower (but positive) score on the target compared to the other
> node(s), so it can be used when no other nodes are available.
> 

Oh! So I must have misunderstood comments on this in earlier discussions.

So ability to place stonith resource on node does impact ability to
perform stonith using this resource, right? OTOH decision which node is
eligible to use stonith resource for stonith may not match decision
which node is eligible to start stonith resource? Even more confusing ...

>> It is of course clear that no pengine-decision to start
>> a stonith-resource is required for it to be used for
>> fencing.
>>

This means that there is only subset of usual (co-)locating restrictions
that is taken into account? Is it all documented somewhere?
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to cancel a fencing request?

2018-04-03 Thread Ken Gaillot
On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote:
> On 04/03/2018 05:43 PM, Ken Gaillot wrote:
> > On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote:
> > > On 04/02/2018 04:02 PM, Ken Gaillot wrote:
> > > > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais
> > > > wrote:
> > > > > On Sun, 1 Apr 2018 09:01:15 +0300
> > > > > Andrei Borzenkov  wrote:
> > > > > 
> > > > > > 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет:
> > > > > > > Hi all,
> > > > > > > 
> > > > > > > I experienced a problem in a two node cluster. It has one
> > > > > > > FA
> > > > > > > per
> > > > > > > node and
> > > > > > > location constraints to avoid the node each of them are
> > > > > > > supposed
> > > > > > > to
> > > > > > > interrupt. 
> > > > > > 
> > > > > > If you mean stonith resource - for all I know location it
> > > > > > does
> > > > > > not
> > > > > > affect stonith operations and only changes where monitoring
> > > > > > action
> > > > > > is
> > > > > > performed.
> > > > > 
> > > > > Sure.
> > > > > 
> > > > > > You can create two stonith resources and declare that each
> > > > > > can fence only single node, but that is not location
> > > > > > constraint, it
> > > > > > is
> > > > > > resource configuration. Showing your configuration would be
> > > > > > helpflul to
> > > > > > avoid guessing.
> > > > > 
> > > > > True, I should have done that. A conf worth thousands of
> > > > > words :)
> > > > > 
> > > > >   crm conf< > > > > 
> > > > >   primitive fence_vm_srv1
> > > > > stonith:fence_virsh   \
> > > > > params pcmk_host_check="static-list"
> > > > > pcmk_host_list="srv1"  \
> > > > >    ipaddr="192.168.2.1"
> > > > > login=""  \
> > > > >    identity_file="/root/.ssh/id_rsa" 
> > > > >    \
> > > > >    port="srv1-d8"
> > > > > action="off"  \
> > > > > op monitor interval=10s
> > > > > 
> > > > >   location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1
> > > > > 
> > > > >   primitive fence_vm_srv2
> > > > > stonith:fence_virsh   \
> > > > > params pcmk_host_check="static-list"
> > > > > pcmk_host_list="srv2"  \
> > > > >    ipaddr="192.168.2.1"
> > > > > login=""  \
> > > > >    identity_file="/root/.ssh/id_rsa" 
> > > > >    \
> > > > >    port="srv2-d8"
> > > > > action="off"  \
> > > > > op monitor interval=10s
> > > > > 
> > > > >   location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2
> > > > >   
> > > > >   EOC
> > > > > 
> > > 
> > > -inf constraints like that should effectively prevent
> > > stonith-actions from being executed on that nodes.
> > 
> > It shouldn't ...
> > 
> > Pacemaker respects target-role=Started/Stopped for controlling
> > execution of fence devices, but location (or even whether the
> > device is
> > "running" at all) only affects monitors, not execution.
> > 
> > > Though there are a few issues with location constraints
> > > and stonith-devices.
> > > 
> > > When stonithd brings up the devices from the cib it
> > > runs the parts of pengine that fully evaluate these
> > > constraints and it would disable the stonith-device
> > > if the resource is unrunable on that node.
> > 
> > That should be true only for target-role, not everything that
> > affects
> > runnability
> 
> cib_device_update bails out via a removal of the device if
> - role == stopped
> - node not in allowed_nodes-list of stonith-resource
> - weight is negative
> 
> Wouldn't that include a -inf rule for a node?

Well, I'll be ... I thought I understood what was going on there. :-)
You're right.

I've frequently seen it recommended to ban fence devices from their
target when using one device per target. Perhaps it would be better to
give a lower (but positive) score on the target compared to the other
node(s), so it can be used when no other nodes are available.

> It is of course clear that no pengine-decision to start
> a stonith-resource is required for it to be used for
> fencing.
> 
> Regards,
> Klaus
> 
> > 
> > > But this part is not retriggered for location contraints
> > > with attributes or other content that would dynamically
> > > change. So one has to stick with constraints as simple
> > > and static as those in the example above.
> > > 
> > > Regarding adding/removing location constraints dynamically
> > > I remember a bug that should have got fixed round 1.1.18
> > > that led to improper handling and actually usage of
> > > stonith-devices disabled or banned from certain nodes.
> > > 
> > > Regards,
> > > Klaus
> > >  
> > > > > > > During some tests, a ms resource raised an error during
> > > > > > > the
> > > > > > > stop
> > > > > > > action on
> > > > > > > both nodes. So both nodes were supposed to be fenced.
> > > > > > 
> > > > > > In two-node cluster you can set pcmk_delay_max so that both
> > > > > > nodes
> > > > > > do not
> > > > > > a

Re: [ClusterLabs] How to cancel a fencing request?

2018-04-03 Thread Ken Gaillot
On Tue, 2018-04-03 at 21:33 +0200, Jehan-Guillaume de Rorthais wrote:
> On Mon, 02 Apr 2018 09:02:24 -0500
> Ken Gaillot  wrote:
> > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais
> > wrote:
> > > On Sun, 1 Apr 2018 09:01:15 +0300
> > > Andrei Borzenkov  wrote:
> 
> [...]
> > > > In two-node cluster you can set pcmk_delay_max so that both
> > > > nodes
> > > > do not
> > > > attempt fencing simultaneously.  
> > > 
> > > I'm not sure to understand the doc correctly in regard with this
> > > property. Does
> > > pcmk_delay_max delay the request itself or the execution of the
> > > request?
> > > 
> > > In other words, is it:
> > > 
> > >   delay -> fence query -> fencing action
> > > 
> > > or 
> > > 
> > >   fence query -> delay -> fence action
> > > 
> > > ?
> > > 
> > > The first definition would solve this issue, but not the second.
> > > As I
> > > understand it, as soon as the fence query has been sent, the node
> > > status is
> > > "UNCLEAN (online)".  
> > 
> > The latter -- you're correct, the node is already unclean by that
> > time.
> > Since the stop did not succeed, the node must be fenced to continue
> > safely.
> 
> Thank you for this clarification.
> 
> Do you want to patch to add this clarification to the documentation ?

Sure, it never hurts :)

> 
> > > > > The first node did, but no FA was then able to fence the
> > > > > second
> > > > > one. So the
> > > > > node stayed DC and was reported as "UNCLEAN (online)".
> > > > > 
> > > > > We were able to fix the original ressource problem, but not
> > > > > to
> > > > > avoid the
> > > > > useless second node fencing.
> > > > > 
> > > > > My questions are:
> > > > > 
> > > > > 1. is it possible to cancel the fencing request 
> > > > > 2. is it possible reset the node status to "online" ?   
> > > > 
> > > > Not that I'm aware of.  
> > > 
> > > Argh!
> > > 
> > > ++  
> > 
> > You could fix the problem with the stopped service manually, then
> > run
> > "stonith_admin --confirm=" (or higher-level tool
> > equivalent).
> > That tells the cluster that you took care of the issue yourself, so
> > fencing can be considered complete.
> 
> Oh, OK. I was wondering if it could help.
> 
> For the complete story, while I was working on this cluster, we tried
> first to
> "unfence" the node using "stonith_admin --unfence "...and
> it actually
> rebooted the node (using fence_vmware_soap) without cleaning its
> status??
> 
> ...So we actually cleaned the status using "--confirm" after the
> complete
> reboot.
> 
> Thank you for this clarification again.
> 
> > The catch there is that the cluster will assume you stopped the
> > node,
> > and all services on it are stopped. That could potentially cause
> > some
> > headaches if it's not true. I'm guessing that if you unmanaged all
> > the
> > resources on it first, then confirmed fencing, the cluster would
> > detect
> > everything properly, then you could re-manage.
> 
> Good to know. Thanks again.
> 
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to cancel a fencing request?

2018-04-03 Thread Jehan-Guillaume de Rorthais
On Tue, 3 Apr 2018 07:36:31 +0200
Klaus Wenninger  wrote:

> On 04/02/2018 04:02 PM, Ken Gaillot wrote:
> > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais wrote:  
> >> On Sun, 1 Apr 2018 09:01:15 +0300
> >> Andrei Borzenkov  wrote:
> >>  
> >>> 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет:  
>  Hi all,
> 
>  I experienced a problem in a two node cluster. It has one FA per
>  node and
>  location constraints to avoid the node each of them are supposed
>  to
>  interrupt.   
> >>> If you mean stonith resource - for all I know location it does not
> >>> affect stonith operations and only changes where monitoring action
> >>> is
> >>> performed.  
> >> Sure.
> >>  
> >>> You can create two stonith resources and declare that each
> >>> can fence only single node, but that is not location constraint, it
> >>> is
> >>> resource configuration. Showing your configuration would be
> >>> helpflul to
> >>> avoid guessing.  
> >> True, I should have done that. A conf worth thousands of words :)
> >>
> >>   crm conf< >>
> >>   primitive fence_vm_srv1 stonith:fence_virsh   \
> >> params pcmk_host_check="static-list" pcmk_host_list="srv1"  \
> >>    ipaddr="192.168.2.1" login=""  \
> >>    identity_file="/root/.ssh/id_rsa"\
> >>    port="srv1-d8" action="off"  \
> >> op monitor interval=10s
> >>
> >>   location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1
> >>
> >>   primitive fence_vm_srv2 stonith:fence_virsh   \
> >> params pcmk_host_check="static-list" pcmk_host_list="srv2"  \
> >>    ipaddr="192.168.2.1" login=""  \
> >>    identity_file="/root/.ssh/id_rsa"\
> >>    port="srv2-d8" action="off"  \
> >> op monitor interval=10s
> >>
> >>   location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2
> >>   
> >>   EOC
> >>  
> 
> -inf constraints like that should effectively prevent
> stonith-actions from being executed on that nodes.
> Though there are a few issues with location constraints
> and stonith-devices.

Not sure I understand, I dont want to prevent stonith actions on that nodes. So
a quick clarification of what I had in mind with this:

  * fence_vm_srv2 is suppose to be able to fence srv2
  * should fence_vm_srv2 fence srv2, it must be able to reply then confirm the
stonith action
  * so fence_vm_srv2 must not start on srv2

Repeat the same for fence_vm_srv1.

So stonith action can run but

  * fence_vm_srv2 from srv1 to kill srv2
  * and fence_vm_srv1 from srv2 to kill srv1.

[...]
> >> In other words, is it:
> >>
> >>   delay -> fence query -> fencing action
> >>
> >> or 
> >>
> >>   fence query -> delay -> fence action
> >>
> >> ?
> >>
> >> The first definition would solve this issue, but not the second. As I
> >> understand it, as soon as the fence query has been sent, the node
> >> status is
> >> "UNCLEAN (online)".  
> > The latter -- you're correct, the node is already unclean by that time.
> > Since the stop did not succeed, the node must be fenced to continue
> > safely.  
> 
> Well, pcmk_delay_base/max are made for the case
> where both nodes in a 2-node-cluster loose contact
> and see the respectively other as unclean.
> If the looser gets fenced it's view of the partner-
> node becomes irrelevant.

IIRC, the survival node was DC and was seeing itself as "UNCLEEN (online)" as
this was the only way to stop the failing resource. There was just no fencing
resource available to kill it.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to cancel a fencing request?

2018-04-03 Thread Klaus Wenninger
On 04/03/2018 05:43 PM, Ken Gaillot wrote:
> On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote:
>> On 04/02/2018 04:02 PM, Ken Gaillot wrote:
>>> On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais
>>> wrote:
 On Sun, 1 Apr 2018 09:01:15 +0300
 Andrei Borzenkov  wrote:

> 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет:
>> Hi all,
>>
>> I experienced a problem in a two node cluster. It has one FA
>> per
>> node and
>> location constraints to avoid the node each of them are
>> supposed
>> to
>> interrupt. 
> If you mean stonith resource - for all I know location it does
> not
> affect stonith operations and only changes where monitoring
> action
> is
> performed.
 Sure.

> You can create two stonith resources and declare that each
> can fence only single node, but that is not location
> constraint, it
> is
> resource configuration. Showing your configuration would be
> helpflul to
> avoid guessing.
 True, I should have done that. A conf worth thousands of words :)

   crm conf<>>>
   primitive fence_vm_srv1 stonith:fence_virsh   \
 params pcmk_host_check="static-list" pcmk_host_list="srv1"  \
    ipaddr="192.168.2.1" login=""  \
    identity_file="/root/.ssh/id_rsa"\
    port="srv1-d8" action="off"  \
 op monitor interval=10s

   location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1

   primitive fence_vm_srv2 stonith:fence_virsh   \
 params pcmk_host_check="static-list" pcmk_host_list="srv2"  \
    ipaddr="192.168.2.1" login=""  \
    identity_file="/root/.ssh/id_rsa"\
    port="srv2-d8" action="off"  \
 op monitor interval=10s

   location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2
   
   EOC

>> -inf constraints like that should effectively prevent
>> stonith-actions from being executed on that nodes.
> It shouldn't ...
>
> Pacemaker respects target-role=Started/Stopped for controlling
> execution of fence devices, but location (or even whether the device is
> "running" at all) only affects monitors, not execution.
>
>> Though there are a few issues with location constraints
>> and stonith-devices.
>>
>> When stonithd brings up the devices from the cib it
>> runs the parts of pengine that fully evaluate these
>> constraints and it would disable the stonith-device
>> if the resource is unrunable on that node.
> That should be true only for target-role, not everything that affects
> runnability

cib_device_update bails out via a removal of the device if
- role == stopped
- node not in allowed_nodes-list of stonith-resource
- weight is negative

Wouldn't that include a -inf rule for a node?

It is of course clear that no pengine-decision to start
a stonith-resource is required for it to be used for
fencing.

Regards,
Klaus

>
>> But this part is not retriggered for location contraints
>> with attributes or other content that would dynamically
>> change. So one has to stick with constraints as simple
>> and static as those in the example above.
>>
>> Regarding adding/removing location constraints dynamically
>> I remember a bug that should have got fixed round 1.1.18
>> that led to improper handling and actually usage of
>> stonith-devices disabled or banned from certain nodes.
>>
>> Regards,
>> Klaus
>>  
>> During some tests, a ms resource raised an error during the
>> stop
>> action on
>> both nodes. So both nodes were supposed to be fenced.
> In two-node cluster you can set pcmk_delay_max so that both
> nodes
> do not
> attempt fencing simultaneously.
 I'm not sure to understand the doc correctly in regard with this
 property. Does
 pcmk_delay_max delay the request itself or the execution of the
 request?

 In other words, is it:

   delay -> fence query -> fencing action

 or 

   fence query -> delay -> fence action

 ?

 The first definition would solve this issue, but not the second.
 As I
 understand it, as soon as the fence query has been sent, the node
 status is
 "UNCLEAN (online)".
>>> The latter -- you're correct, the node is already unclean by that
>>> time.
>>> Since the stop did not succeed, the node must be fenced to continue
>>> safely.
>> Well, pcmk_delay_base/max are made for the case
>> where both nodes in a 2-node-cluster loose contact
>> and see the respectively other as unclean.
>> If the looser gets fenced it's view of the partner-
>> node becomes irrelevant.
>>
>> The first node did, but no FA was then able to fence the
>> second
>> one. So the
>> node stayed DC and was reported as "UNCLEAN (online)".
>>

Re: [ClusterLabs] How to cancel a fencing request?

2018-04-03 Thread Jehan-Guillaume de Rorthais
On Mon, 02 Apr 2018 09:02:24 -0500
Ken Gaillot  wrote:
> On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais wrote:
> > On Sun, 1 Apr 2018 09:01:15 +0300
> > Andrei Borzenkov  wrote:
[...]
> > > In two-node cluster you can set pcmk_delay_max so that both nodes
> > > do not
> > > attempt fencing simultaneously.  
> > 
> > I'm not sure to understand the doc correctly in regard with this
> > property. Does
> > pcmk_delay_max delay the request itself or the execution of the
> > request?
> > 
> > In other words, is it:
> > 
> >   delay -> fence query -> fencing action
> > 
> > or 
> > 
> >   fence query -> delay -> fence action
> > 
> > ?
> > 
> > The first definition would solve this issue, but not the second. As I
> > understand it, as soon as the fence query has been sent, the node
> > status is
> > "UNCLEAN (online)".  
> 
> The latter -- you're correct, the node is already unclean by that time.
> Since the stop did not succeed, the node must be fenced to continue
> safely.

Thank you for this clarification.

Do you want to patch to add this clarification to the documentation ?

> > > > The first node did, but no FA was then able to fence the second
> > > > one. So the
> > > > node stayed DC and was reported as "UNCLEAN (online)".
> > > > 
> > > > We were able to fix the original ressource problem, but not to
> > > > avoid the
> > > > useless second node fencing.
> > > > 
> > > > My questions are:
> > > > 
> > > > 1. is it possible to cancel the fencing request 
> > > > 2. is it possible reset the node status to "online" ?   
> > > 
> > > Not that I'm aware of.  
> > 
> > Argh!
> > 
> > ++  
> 
> You could fix the problem with the stopped service manually, then run
> "stonith_admin --confirm=" (or higher-level tool equivalent).
> That tells the cluster that you took care of the issue yourself, so
> fencing can be considered complete.

Oh, OK. I was wondering if it could help.

For the complete story, while I was working on this cluster, we tried first to
"unfence" the node using "stonith_admin --unfence "...and it actually
rebooted the node (using fence_vmware_soap) without cleaning its status??

...So we actually cleaned the status using "--confirm" after the complete
reboot.

Thank you for this clarification again.

> The catch there is that the cluster will assume you stopped the node,
> and all services on it are stopped. That could potentially cause some
> headaches if it's not true. I'm guessing that if you unmanaged all the
> resources on it first, then confirmed fencing, the cluster would detect
> everything properly, then you could re-manage.

Good to know. Thanks again.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to cancel a fencing request?

2018-04-03 Thread Ken Gaillot
On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote:
> On 04/02/2018 04:02 PM, Ken Gaillot wrote:
> > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais
> > wrote:
> > > On Sun, 1 Apr 2018 09:01:15 +0300
> > > Andrei Borzenkov  wrote:
> > > 
> > > > 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет:
> > > > > Hi all,
> > > > > 
> > > > > I experienced a problem in a two node cluster. It has one FA
> > > > > per
> > > > > node and
> > > > > location constraints to avoid the node each of them are
> > > > > supposed
> > > > > to
> > > > > interrupt. 
> > > > 
> > > > If you mean stonith resource - for all I know location it does
> > > > not
> > > > affect stonith operations and only changes where monitoring
> > > > action
> > > > is
> > > > performed.
> > > 
> > > Sure.
> > > 
> > > > You can create two stonith resources and declare that each
> > > > can fence only single node, but that is not location
> > > > constraint, it
> > > > is
> > > > resource configuration. Showing your configuration would be
> > > > helpflul to
> > > > avoid guessing.
> > > 
> > > True, I should have done that. A conf worth thousands of words :)
> > > 
> > >   crm conf< > > 
> > >   primitive fence_vm_srv1 stonith:fence_virsh   \
> > > params pcmk_host_check="static-list" pcmk_host_list="srv1"  \
> > >    ipaddr="192.168.2.1" login=""  \
> > >    identity_file="/root/.ssh/id_rsa"\
> > >    port="srv1-d8" action="off"  \
> > > op monitor interval=10s
> > > 
> > >   location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1
> > > 
> > >   primitive fence_vm_srv2 stonith:fence_virsh   \
> > > params pcmk_host_check="static-list" pcmk_host_list="srv2"  \
> > >    ipaddr="192.168.2.1" login=""  \
> > >    identity_file="/root/.ssh/id_rsa"\
> > >    port="srv2-d8" action="off"  \
> > > op monitor interval=10s
> > > 
> > >   location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2
> > >   
> > >   EOC
> > > 
> 
> -inf constraints like that should effectively prevent
> stonith-actions from being executed on that nodes.

It shouldn't ...

Pacemaker respects target-role=Started/Stopped for controlling
execution of fence devices, but location (or even whether the device is
"running" at all) only affects monitors, not execution.

> Though there are a few issues with location constraints
> and stonith-devices.
> 
> When stonithd brings up the devices from the cib it
> runs the parts of pengine that fully evaluate these
> constraints and it would disable the stonith-device
> if the resource is unrunable on that node.

That should be true only for target-role, not everything that affects
runnability

> But this part is not retriggered for location contraints
> with attributes or other content that would dynamically
> change. So one has to stick with constraints as simple
> and static as those in the example above.
> 
> Regarding adding/removing location constraints dynamically
> I remember a bug that should have got fixed round 1.1.18
> that led to improper handling and actually usage of
> stonith-devices disabled or banned from certain nodes.
> 
> Regards,
> Klaus
>  
> > > > > During some tests, a ms resource raised an error during the
> > > > > stop
> > > > > action on
> > > > > both nodes. So both nodes were supposed to be fenced.
> > > > 
> > > > In two-node cluster you can set pcmk_delay_max so that both
> > > > nodes
> > > > do not
> > > > attempt fencing simultaneously.
> > > 
> > > I'm not sure to understand the doc correctly in regard with this
> > > property. Does
> > > pcmk_delay_max delay the request itself or the execution of the
> > > request?
> > > 
> > > In other words, is it:
> > > 
> > >   delay -> fence query -> fencing action
> > > 
> > > or 
> > > 
> > >   fence query -> delay -> fence action
> > > 
> > > ?
> > > 
> > > The first definition would solve this issue, but not the second.
> > > As I
> > > understand it, as soon as the fence query has been sent, the node
> > > status is
> > > "UNCLEAN (online)".
> > 
> > The latter -- you're correct, the node is already unclean by that
> > time.
> > Since the stop did not succeed, the node must be fenced to continue
> > safely.
> 
> Well, pcmk_delay_base/max are made for the case
> where both nodes in a 2-node-cluster loose contact
> and see the respectively other as unclean.
> If the looser gets fenced it's view of the partner-
> node becomes irrelevant.
> 
> > > > > The first node did, but no FA was then able to fence the
> > > > > second
> > > > > one. So the
> > > > > node stayed DC and was reported as "UNCLEAN (online)".
> > > > > 
> > > > > We were able to fix the original ressource problem, but not
> > > > > to
> > > > > avoid the
> > > > > useless second node fencing.
> > > > > 
> > > > > My questions are:
> > > > > 
> > > > > 1. is it po

Re: [ClusterLabs] How to cancel a fencing request?

2018-04-02 Thread Klaus Wenninger
On 04/02/2018 04:02 PM, Ken Gaillot wrote:
> On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais wrote:
>> On Sun, 1 Apr 2018 09:01:15 +0300
>> Andrei Borzenkov  wrote:
>>
>>> 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет:
 Hi all,

 I experienced a problem in a two node cluster. It has one FA per
 node and
 location constraints to avoid the node each of them are supposed
 to
 interrupt. 
>>> If you mean stonith resource - for all I know location it does not
>>> affect stonith operations and only changes where monitoring action
>>> is
>>> performed.
>> Sure.
>>
>>> You can create two stonith resources and declare that each
>>> can fence only single node, but that is not location constraint, it
>>> is
>>> resource configuration. Showing your configuration would be
>>> helpflul to
>>> avoid guessing.
>> True, I should have done that. A conf worth thousands of words :)
>>
>>   crm conf<>
>>   primitive fence_vm_srv1 stonith:fence_virsh   \
>> params pcmk_host_check="static-list" pcmk_host_list="srv1"  \
>>    ipaddr="192.168.2.1" login=""  \
>>    identity_file="/root/.ssh/id_rsa"\
>>    port="srv1-d8" action="off"  \
>> op monitor interval=10s
>>
>>   location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1
>>
>>   primitive fence_vm_srv2 stonith:fence_virsh   \
>> params pcmk_host_check="static-list" pcmk_host_list="srv2"  \
>>    ipaddr="192.168.2.1" login=""  \
>>    identity_file="/root/.ssh/id_rsa"\
>>    port="srv2-d8" action="off"  \
>> op monitor interval=10s
>>
>>   location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2
>>   
>>   EOC
>>

-inf constraints like that should effectively prevent
stonith-actions from being executed on that nodes.
Though there are a few issues with location constraints
and stonith-devices.

When stonithd brings up the devices from the cib it
runs the parts of pengine that fully evaluate these
constraints and it would disable the stonith-device
if the resource is unrunable on that node.
But this part is not retriggered for location contraints
with attributes or other content that would dynamically
change. So one has to stick with constraints as simple
and static as those in the example above.

Regarding adding/removing location constraints dynamically
I remember a bug that should have got fixed round 1.1.18
that led to improper handling and actually usage of
stonith-devices disabled or banned from certain nodes.

Regards,
Klaus
 
 During some tests, a ms resource raised an error during the stop
 action on
 both nodes. So both nodes were supposed to be fenced.
>>> In two-node cluster you can set pcmk_delay_max so that both nodes
>>> do not
>>> attempt fencing simultaneously.
>> I'm not sure to understand the doc correctly in regard with this
>> property. Does
>> pcmk_delay_max delay the request itself or the execution of the
>> request?
>>
>> In other words, is it:
>>
>>   delay -> fence query -> fencing action
>>
>> or 
>>
>>   fence query -> delay -> fence action
>>
>> ?
>>
>> The first definition would solve this issue, but not the second. As I
>> understand it, as soon as the fence query has been sent, the node
>> status is
>> "UNCLEAN (online)".
> The latter -- you're correct, the node is already unclean by that time.
> Since the stop did not succeed, the node must be fenced to continue
> safely.

Well, pcmk_delay_base/max are made for the case
where both nodes in a 2-node-cluster loose contact
and see the respectively other as unclean.
If the looser gets fenced it's view of the partner-
node becomes irrelevant.

 The first node did, but no FA was then able to fence the second
 one. So the
 node stayed DC and was reported as "UNCLEAN (online)".

 We were able to fix the original ressource problem, but not to
 avoid the
 useless second node fencing.

 My questions are:

 1. is it possible to cancel the fencing request 
 2. is it possible reset the node status to "online" ? 
>>> Not that I'm aware of.
>> Argh!
>>
>> ++
> You could fix the problem with the stopped service manually, then run
> "stonith_admin --confirm=" (or higher-level tool equivalent).
> That tells the cluster that you took care of the issue yourself, so
> fencing can be considered complete.
>
> The catch there is that the cluster will assume you stopped the node,
> and all services on it are stopped. That could potentially cause some
> headaches if it's not true. I'm guessing that if you unmanaged all the
> resources on it first, then confirmed fencing, the cluster would detect
> everything properly, then you could re-manage.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home

Re: [ClusterLabs] How to cancel a fencing request?

2018-04-02 Thread Ken Gaillot
On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais wrote:
> On Sun, 1 Apr 2018 09:01:15 +0300
> Andrei Borzenkov  wrote:
> 
> > 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет:
> > > Hi all,
> > > 
> > > I experienced a problem in a two node cluster. It has one FA per
> > > node and
> > > location constraints to avoid the node each of them are supposed
> > > to
> > > interrupt. 
> > 
> > If you mean stonith resource - for all I know location it does not
> > affect stonith operations and only changes where monitoring action
> > is
> > performed.
> 
> Sure.
> 
> > You can create two stonith resources and declare that each
> > can fence only single node, but that is not location constraint, it
> > is
> > resource configuration. Showing your configuration would be
> > helpflul to
> > avoid guessing.
> 
> True, I should have done that. A conf worth thousands of words :)
> 
>   crm conf< 
>   primitive fence_vm_srv1 stonith:fence_virsh   \
> params pcmk_host_check="static-list" pcmk_host_list="srv1"  \
>    ipaddr="192.168.2.1" login=""  \
>    identity_file="/root/.ssh/id_rsa"\
>    port="srv1-d8" action="off"  \
> op monitor interval=10s
> 
>   location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1
> 
>   primitive fence_vm_srv2 stonith:fence_virsh   \
> params pcmk_host_check="static-list" pcmk_host_list="srv2"  \
>    ipaddr="192.168.2.1" login=""  \
>    identity_file="/root/.ssh/id_rsa"\
>    port="srv2-d8" action="off"  \
> op monitor interval=10s
> 
>   location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2
>   
>   EOC
> 
> 
> > > During some tests, a ms resource raised an error during the stop
> > > action on
> > > both nodes. So both nodes were supposed to be fenced.
> > 
> > In two-node cluster you can set pcmk_delay_max so that both nodes
> > do not
> > attempt fencing simultaneously.
> 
> I'm not sure to understand the doc correctly in regard with this
> property. Does
> pcmk_delay_max delay the request itself or the execution of the
> request?
> 
> In other words, is it:
> 
>   delay -> fence query -> fencing action
> 
> or 
> 
>   fence query -> delay -> fence action
> 
> ?
> 
> The first definition would solve this issue, but not the second. As I
> understand it, as soon as the fence query has been sent, the node
> status is
> "UNCLEAN (online)".

The latter -- you're correct, the node is already unclean by that time.
Since the stop did not succeed, the node must be fenced to continue
safely.

> > > The first node did, but no FA was then able to fence the second
> > > one. So the
> > > node stayed DC and was reported as "UNCLEAN (online)".
> > > 
> > > We were able to fix the original ressource problem, but not to
> > > avoid the
> > > useless second node fencing.
> > > 
> > > My questions are:
> > > 
> > > 1. is it possible to cancel the fencing request 
> > > 2. is it possible reset the node status to "online" ? 
> > 
> > Not that I'm aware of.
> 
> Argh!
> 
> ++

You could fix the problem with the stopped service manually, then run
"stonith_admin --confirm=" (or higher-level tool equivalent).
That tells the cluster that you took care of the issue yourself, so
fencing can be considered complete.

The catch there is that the cluster will assume you stopped the node,
and all services on it are stopped. That could potentially cause some
headaches if it's not true. I'm guessing that if you unmanaged all the
resources on it first, then confirmed fencing, the cluster would detect
everything properly, then you could re-manage.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to cancel a fencing request?

2018-04-02 Thread Jehan-Guillaume de Rorthais
On Sun, 1 Apr 2018 09:01:15 +0300
Andrei Borzenkov  wrote:

> 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет:
> > Hi all,
> > 
> > I experienced a problem in a two node cluster. It has one FA per node and
> > location constraints to avoid the node each of them are supposed to
> > interrupt. 
> 
> If you mean stonith resource - for all I know location it does not
> affect stonith operations and only changes where monitoring action is
> performed.

Sure.

> You can create two stonith resources and declare that each
> can fence only single node, but that is not location constraint, it is
> resource configuration. Showing your configuration would be helpflul to
> avoid guessing.

True, I should have done that. A conf worth thousands of words :)

  crm conf< > During some tests, a ms resource raised an error during the stop action on
> > both nodes. So both nodes were supposed to be fenced.
> 
> In two-node cluster you can set pcmk_delay_max so that both nodes do not
> attempt fencing simultaneously.

I'm not sure to understand the doc correctly in regard with this property. Does
pcmk_delay_max delay the request itself or the execution of the request?

In other words, is it:

  delay -> fence query -> fencing action

or 

  fence query -> delay -> fence action

?

The first definition would solve this issue, but not the second. As I
understand it, as soon as the fence query has been sent, the node status is
"UNCLEAN (online)".


> > The first node did, but no FA was then able to fence the second one. So the
> > node stayed DC and was reported as "UNCLEAN (online)".
> > 
> > We were able to fix the original ressource problem, but not to avoid the
> > useless second node fencing.
> > 
> > My questions are:
> > 
> > 1. is it possible to cancel the fencing request 
> > 2. is it possible reset the node status to "online" ? 
> 
> Not that I'm aware of.

Argh!

++
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to cancel a fencing request?

2018-03-31 Thread Andrei Borzenkov
31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет:
> Hi all,
> 
> I experienced a problem in a two node cluster. It has one FA per node and
> location constraints to avoid the node each of them are supposed to interrupt.
> 

If you mean stonith resource - for all I know location it does not
affect stonith operations and only changes where monitoring action is
performed. You can create two stonith resources and declare that each
can fence only single node, but that is not location constraint, it is
resource configuration. Showing your configuration would be helpflul to
avoid guessing.

> During some tests, a ms resource raised an error during the stop action on
> both nodes. So both nodes were supposed to be fenced.
> 

In two-node cluster you can set pcmk_delay_max so that both nodes do not
attempt fencing simultaneously.

> The first node did, but no FA was then able to fence the second one. So the
> node stayed DC and was reported as "UNCLEAN (online)".
> 
> We were able to fix the original ressource problem, but not to avoid the
> useless second node fencing.
> 
> My questions are:
> 
> 1. is it possible to cancel the fencing request > 2. is it possible reset the 
> node status to "online" ?
> 

Not that I'm aware of.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] How to cancel a fencing request?

2018-03-31 Thread Jehan-Guillaume de Rorthais
Hi all,

I experienced a problem in a two node cluster. It has one FA per node and
location constraints to avoid the node each of them are supposed to interrupt.

During some tests, a ms resource raised an error during the stop action on
both nodes. So both nodes were supposed to be fenced.

The first node did, but no FA was then able to fence the second one. So the
node stayed DC and was reported as "UNCLEAN (online)".

We were able to fix the original ressource problem, but not to avoid the
useless second node fencing.

My questions are:

1. is it possible to cancel the fencing request 
2. is it possible reset the node status to "online" ?

Thank you
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org