Re: [ClusterLabs] How to cancel a fencing request?
On Tue, 10 Apr 2018 11:24:04 +0200 Klaus Wenninger wrote: > On 04/10/2018 08:48 AM, Jehan-Guillaume de Rorthais wrote: > > On Mon, 09 Apr 2018 17:59:26 -0500 > > Ken Gaillot wrote: > > > >> On Tue, 2018-04-10 at 00:02 +0200, Jehan-Guillaume de Rorthais wrote: > >>> On Tue, 03 Apr 2018 17:35:43 -0500 > >>> Ken Gaillot wrote: > >>> > On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote: > > On 04/03/2018 05:43 PM, Ken Gaillot wrote: > >> On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote: > >>> On 04/02/2018 04:02 PM, Ken Gaillot wrote: > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de > Rorthais > wrote: > >>> [...] > >>> -inf constraints like that should effectively prevent > >>> stonith-actions from being executed on that nodes. > >> It shouldn't ... > >> > >> Pacemaker respects target-role=Started/Stopped for controlling > >> execution of fence devices, but location (or even whether the > >> device is > >> "running" at all) only affects monitors, not execution. > >> > >>> Though there are a few issues with location constraints > >>> and stonith-devices. > >>> > >>> When stonithd brings up the devices from the cib it > >>> runs the parts of pengine that fully evaluate these > >>> constraints and it would disable the stonith-device > >>> if the resource is unrunable on that node. > >> That should be true only for target-role, not everything that > >> affects > >> runnability > > cib_device_update bails out via a removal of the device if > > - role == stopped > > - node not in allowed_nodes-list of stonith-resource > > - weight is negative > > > > Wouldn't that include a -inf rule for a node? > Well, I'll be ... I thought I understood what was going on there. > :-) > You're right. > > I've frequently seen it recommended to ban fence devices from their > target when using one device per target. Perhaps it would be better > to > give a lower (but positive) score on the target compared to the > other > node(s), so it can be used when no other nodes are available. you > could > re-manage. > >>> Wait, you mean a fencing resource can be triggered from its own > >>> target? Wat > >>> happen then? Node suicide and all the cluster nodes are shutdown? > >>> > >>> Thanks, > >> A node can fence itself, though it will be the cluster's last resort > >> when no other node can. It doesn't necessarily imply all other nodes > >> are shut down ... > > Indeed, sorry I was clear enough: I was talking about a fencing race > > situation. > Fencing races - as well if suicide is involved - shouldn't be > prevented by one partition not having quorum. > That should be an issue just with 2-node-feature enabled. > Which scenario did you have in mind? The two-node scenario. The exact one I described upthread, minus the -inf constraint location as Ken suggested. > >> there may be other nodes up, but they are not allowed > >> execute the relevant fence device for whatever reason. > > In such situation, how other node can confirm the node fence itself without > > confirmation? > > Basically I see 2 cases: > - sbd with watchdog-fencing where the other nodes assume > suicide to be successful after a certain time Sure. With watchdog enabled cluster wide. > - basically if a node is able to commit suicide (while part of > a quorate partition) I would expect it to come back online > after reboot telling the cluster that the resources are down I would expect as well, but the fencing request hadn't been confirmed to anyone yet * is it enough that the node reboot and probes for resources to declare they are all stopped? * is it enough so the node can acknowledge the DC/stonithd the fencing request was succeed? * what if the fencing action is not "reboot" but "off"? Thanks for your help! ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to cancel a fencing request?
On 04/10/2018 08:48 AM, Jehan-Guillaume de Rorthais wrote: > On Mon, 09 Apr 2018 17:59:26 -0500 > Ken Gaillot wrote: > >> On Tue, 2018-04-10 at 00:02 +0200, Jehan-Guillaume de Rorthais wrote: >>> On Tue, 03 Apr 2018 17:35:43 -0500 >>> Ken Gaillot wrote: >>> On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote: > On 04/03/2018 05:43 PM, Ken Gaillot wrote: >> On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote: >>> On 04/02/2018 04:02 PM, Ken Gaillot wrote: On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais wrote: >>> [...] >>> -inf constraints like that should effectively prevent >>> stonith-actions from being executed on that nodes. >> It shouldn't ... >> >> Pacemaker respects target-role=Started/Stopped for controlling >> execution of fence devices, but location (or even whether the >> device is >> "running" at all) only affects monitors, not execution. >> >>> Though there are a few issues with location constraints >>> and stonith-devices. >>> >>> When stonithd brings up the devices from the cib it >>> runs the parts of pengine that fully evaluate these >>> constraints and it would disable the stonith-device >>> if the resource is unrunable on that node. >> That should be true only for target-role, not everything that >> affects >> runnability > cib_device_update bails out via a removal of the device if > - role == stopped > - node not in allowed_nodes-list of stonith-resource > - weight is negative > > Wouldn't that include a -inf rule for a node? Well, I'll be ... I thought I understood what was going on there. :-) You're right. I've frequently seen it recommended to ban fence devices from their target when using one device per target. Perhaps it would be better to give a lower (but positive) score on the target compared to the other node(s), so it can be used when no other nodes are available. you could re-manage. >>> Wait, you mean a fencing resource can be triggered from its own >>> target? Wat >>> happen then? Node suicide and all the cluster nodes are shutdown? >>> >>> Thanks, >> A node can fence itself, though it will be the cluster's last resort >> when no other node can. It doesn't necessarily imply all other nodes >> are shut down ... > Indeed, sorry I was clear enough: I was talking about a fencing race > situation. Fencing races - as well if suicide is involved - shouldn't be prevented by one partition not having quorum. That should be an issue just with 2-node-feature enabled. Which scenario did you have in mind? > >> there may be other nodes up, but they are not allowed >> execute the relevant fence device for whatever reason. > In such situation, how other node can confirm the node fence itself without > confirmation? Basically I see 2 cases: - sbd with watchdog-fencing where the other nodes assume suicide to be successful after a certain time - basically if a node is able to commit suicide (while part of a quorate partition) I would expect it to come back online after reboot telling the cluster that the resources are down Regards, Klaus > >> But of course there might be no other nodes up, in which case, yes, the >> cluster dies (the idea being that the node is known to be malfunctioning, so >> stop it from possibly corrupting data). > This make sense to me. > > Thanks, ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to cancel a fencing request?
On Mon, 09 Apr 2018 17:59:26 -0500 Ken Gaillot wrote: > On Tue, 2018-04-10 at 00:02 +0200, Jehan-Guillaume de Rorthais wrote: > > On Tue, 03 Apr 2018 17:35:43 -0500 > > Ken Gaillot wrote: > > > > > On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote: > > > > On 04/03/2018 05:43 PM, Ken Gaillot wrote: > > > > > On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote: > > > > > > On 04/02/2018 04:02 PM, Ken Gaillot wrote: > > > > > > > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de > > > > > > > Rorthais > > > > > > > wrote: > > > > [...] > > > > > > > > > > > > -inf constraints like that should effectively prevent > > > > > > stonith-actions from being executed on that nodes. > > > > > > > > > > It shouldn't ... > > > > > > > > > > Pacemaker respects target-role=Started/Stopped for controlling > > > > > execution of fence devices, but location (or even whether the > > > > > device is > > > > > "running" at all) only affects monitors, not execution. > > > > > > > > > > > Though there are a few issues with location constraints > > > > > > and stonith-devices. > > > > > > > > > > > > When stonithd brings up the devices from the cib it > > > > > > runs the parts of pengine that fully evaluate these > > > > > > constraints and it would disable the stonith-device > > > > > > if the resource is unrunable on that node. > > > > > > > > > > That should be true only for target-role, not everything that > > > > > affects > > > > > runnability > > > > > > > > cib_device_update bails out via a removal of the device if > > > > - role == stopped > > > > - node not in allowed_nodes-list of stonith-resource > > > > - weight is negative > > > > > > > > Wouldn't that include a -inf rule for a node? > > > > > > Well, I'll be ... I thought I understood what was going on there. > > > :-) > > > You're right. > > > > > > I've frequently seen it recommended to ban fence devices from their > > > target when using one device per target. Perhaps it would be better > > > to > > > give a lower (but positive) score on the target compared to the > > > other > > > node(s), so it can be used when no other nodes are available. you > > > could > > > re-manage. > > > > Wait, you mean a fencing resource can be triggered from its own > > target? Wat > > happen then? Node suicide and all the cluster nodes are shutdown? > > > > Thanks, > > A node can fence itself, though it will be the cluster's last resort > when no other node can. It doesn't necessarily imply all other nodes > are shut down ... Indeed, sorry I was clear enough: I was talking about a fencing race situation. > there may be other nodes up, but they are not allowed > execute the relevant fence device for whatever reason. In such situation, how other node can confirm the node fence itself without confirmation? > But of course there might be no other nodes up, in which case, yes, the > cluster dies (the idea being that the node is known to be malfunctioning, so > stop it from possibly corrupting data). This make sense to me. Thanks, ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to cancel a fencing request?
On Tue, 2018-04-10 at 00:02 +0200, Jehan-Guillaume de Rorthais wrote: > On Tue, 03 Apr 2018 17:35:43 -0500 > Ken Gaillot wrote: > > > On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote: > > > On 04/03/2018 05:43 PM, Ken Gaillot wrote: > > > > On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote: > > > > > On 04/02/2018 04:02 PM, Ken Gaillot wrote: > > > > > > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de > > > > > > Rorthais > > > > > > wrote: > > [...] > > > > > > > > > > -inf constraints like that should effectively prevent > > > > > stonith-actions from being executed on that nodes. > > > > > > > > It shouldn't ... > > > > > > > > Pacemaker respects target-role=Started/Stopped for controlling > > > > execution of fence devices, but location (or even whether the > > > > device is > > > > "running" at all) only affects monitors, not execution. > > > > > > > > > Though there are a few issues with location constraints > > > > > and stonith-devices. > > > > > > > > > > When stonithd brings up the devices from the cib it > > > > > runs the parts of pengine that fully evaluate these > > > > > constraints and it would disable the stonith-device > > > > > if the resource is unrunable on that node. > > > > > > > > That should be true only for target-role, not everything that > > > > affects > > > > runnability > > > > > > cib_device_update bails out via a removal of the device if > > > - role == stopped > > > - node not in allowed_nodes-list of stonith-resource > > > - weight is negative > > > > > > Wouldn't that include a -inf rule for a node? > > > > Well, I'll be ... I thought I understood what was going on there. > > :-) > > You're right. > > > > I've frequently seen it recommended to ban fence devices from their > > target when using one device per target. Perhaps it would be better > > to > > give a lower (but positive) score on the target compared to the > > other > > node(s), so it can be used when no other nodes are available. you > > could > > re-manage. > > Wait, you mean a fencing resource can be triggered from its own > target? Wat > happen then? Node suicide and all the cluster nodes are shutdown? > > Thanks, A node can fence itself, though it will be the cluster's last resort when no other node can. It doesn't necessarily imply all other nodes are shut down ... there may be other nodes up, but they are not allowed execute the relevant fence device for whatever reason. But of course there might be no other nodes up, in which case, yes, the cluster dies (the idea being that the node is known to be malfunctioning, so stop it from possibly corrupting data). -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to cancel a fencing request?
On Tue, 03 Apr 2018 16:59:21 -0500 Ken Gaillot wrote: > On Tue, 2018-04-03 at 21:33 +0200, Jehan-Guillaume de Rorthais wrote: [...] > > > > I'm not sure to understand the doc correctly in regard with this > > > > property. Does > > > > pcmk_delay_max delay the request itself or the execution of the > > > > request? > > > > > > > > In other words, is it: > > > > > > > > delay -> fence query -> fencing action > > > > > > > > or > > > > > > > > fence query -> delay -> fence action > > > > > > > > ? > > > > > > > > The first definition would solve this issue, but not the second. > > > > As I > > > > understand it, as soon as the fence query has been sent, the node > > > > status is > > > > "UNCLEAN (online)". > > > > > > The latter -- you're correct, the node is already unclean by that > > > time. > > > Since the stop did not succeed, the node must be fenced to continue > > > safely. > > > > Thank you for this clarification. > > > > Do you want to patch to add this clarification to the documentation ? > > Sure, it never hurts :) I realize this is not as clear as I thought in my mind. * who holds the action for some time? crmd or stonithd? * in a two node cluster in fencing race, if one node is killed, what happen to its fencing query that was on hold? I suppose it will be overwrite with the new CIB version from the other node once it join the cluster again? Thanks, ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to cancel a fencing request?
On Tue, 03 Apr 2018 17:35:43 -0500 Ken Gaillot wrote: > On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote: > > On 04/03/2018 05:43 PM, Ken Gaillot wrote: > > > On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote: > > > > On 04/02/2018 04:02 PM, Ken Gaillot wrote: > > > > > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais > > > > > wrote: [...] > > > > > > > > -inf constraints like that should effectively prevent > > > > stonith-actions from being executed on that nodes. > > > > > > It shouldn't ... > > > > > > Pacemaker respects target-role=Started/Stopped for controlling > > > execution of fence devices, but location (or even whether the > > > device is > > > "running" at all) only affects monitors, not execution. > > > > > > > Though there are a few issues with location constraints > > > > and stonith-devices. > > > > > > > > When stonithd brings up the devices from the cib it > > > > runs the parts of pengine that fully evaluate these > > > > constraints and it would disable the stonith-device > > > > if the resource is unrunable on that node. > > > > > > That should be true only for target-role, not everything that > > > affects > > > runnability > > > > cib_device_update bails out via a removal of the device if > > - role == stopped > > - node not in allowed_nodes-list of stonith-resource > > - weight is negative > > > > Wouldn't that include a -inf rule for a node? > > Well, I'll be ... I thought I understood what was going on there. :-) > You're right. > > I've frequently seen it recommended to ban fence devices from their > target when using one device per target. Perhaps it would be better to > give a lower (but positive) score on the target compared to the other > node(s), so it can be used when no other nodes are available. you could > re-manage. Wait, you mean a fencing resource can be triggered from its own target? Wat happen then? Node suicide and all the cluster nodes are shutdown? Thanks, ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to cancel a fencing request?
On 04/05/2018 06:45 AM, Andrei Borzenkov wrote: > 04.04.2018 01:35, Ken Gaillot пишет: >> On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote: > ... > -inf constraints like that should effectively prevent > stonith-actions from being executed on that nodes. It shouldn't ... Pacemaker respects target-role=Started/Stopped for controlling execution of fence devices, but location (or even whether the device is "running" at all) only affects monitors, not execution. > Though there are a few issues with location constraints > and stonith-devices. > > When stonithd brings up the devices from the cib it > runs the parts of pengine that fully evaluate these > constraints and it would disable the stonith-device > if the resource is unrunable on that node. That should be true only for target-role, not everything that affects runnability >>> cib_device_update bails out via a removal of the device if >>> - role == stopped >>> - node not in allowed_nodes-list of stonith-resource >>> - weight is negative >>> >>> Wouldn't that include a -inf rule for a node? >> Well, I'll be ... I thought I understood what was going on there. :-) >> You're right. >> >> I've frequently seen it recommended to ban fence devices from their >> target when using one device per target. Perhaps it would be better to >> give a lower (but positive) score on the target compared to the other >> node(s), so it can be used when no other nodes are available. >> > Oh! So I must have misunderstood comments on this in earlier discussions. > > So ability to place stonith resource on node does impact ability to > perform stonith using this resource, right? OTOH decision which node is > eligible to use stonith resource for stonith may not match decision > which node is eligible to start stonith resource? Even more confusing ... Something like that, yes ... and sorry for the confusion ... Maybe easier to grab: "Has to be able to run there but doesn't actually have to be started there right at the moment" Regards, Klaus >>> It is of course clear that no pengine-decision to start >>> a stonith-resource is required for it to be used for >>> fencing. >>> > This means that there is only subset of usual (co-)locating restrictions > that is taken into account? Is it all documented somewhere? iirc there are restrictions mentioned in the documentation. But what is written there didn't ring the right bells for me- at least not immediately without having a look to the code ;-) So we are working on something easier to grab there. Guess for now the crucial rule is not to use anything that might alter location-rule results over time (attributes, rules with time in them, ...). > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to cancel a fencing request?
04.04.2018 01:35, Ken Gaillot пишет: > On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote: ... >> -inf constraints like that should effectively prevent stonith-actions from being executed on that nodes. >>> >>> It shouldn't ... >>> >>> Pacemaker respects target-role=Started/Stopped for controlling >>> execution of fence devices, but location (or even whether the >>> device is >>> "running" at all) only affects monitors, not execution. >>> Though there are a few issues with location constraints and stonith-devices. When stonithd brings up the devices from the cib it runs the parts of pengine that fully evaluate these constraints and it would disable the stonith-device if the resource is unrunable on that node. >>> >>> That should be true only for target-role, not everything that >>> affects >>> runnability >> >> cib_device_update bails out via a removal of the device if >> - role == stopped >> - node not in allowed_nodes-list of stonith-resource >> - weight is negative >> >> Wouldn't that include a -inf rule for a node? > > Well, I'll be ... I thought I understood what was going on there. :-) > You're right. > > I've frequently seen it recommended to ban fence devices from their > target when using one device per target. Perhaps it would be better to > give a lower (but positive) score on the target compared to the other > node(s), so it can be used when no other nodes are available. > Oh! So I must have misunderstood comments on this in earlier discussions. So ability to place stonith resource on node does impact ability to perform stonith using this resource, right? OTOH decision which node is eligible to use stonith resource for stonith may not match decision which node is eligible to start stonith resource? Even more confusing ... >> It is of course clear that no pengine-decision to start >> a stonith-resource is required for it to be used for >> fencing. >> This means that there is only subset of usual (co-)locating restrictions that is taken into account? Is it all documented somewhere? ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to cancel a fencing request?
On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote: > On 04/03/2018 05:43 PM, Ken Gaillot wrote: > > On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote: > > > On 04/02/2018 04:02 PM, Ken Gaillot wrote: > > > > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais > > > > wrote: > > > > > On Sun, 1 Apr 2018 09:01:15 +0300 > > > > > Andrei Borzenkov wrote: > > > > > > > > > > > 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет: > > > > > > > Hi all, > > > > > > > > > > > > > > I experienced a problem in a two node cluster. It has one > > > > > > > FA > > > > > > > per > > > > > > > node and > > > > > > > location constraints to avoid the node each of them are > > > > > > > supposed > > > > > > > to > > > > > > > interrupt. > > > > > > > > > > > > If you mean stonith resource - for all I know location it > > > > > > does > > > > > > not > > > > > > affect stonith operations and only changes where monitoring > > > > > > action > > > > > > is > > > > > > performed. > > > > > > > > > > Sure. > > > > > > > > > > > You can create two stonith resources and declare that each > > > > > > can fence only single node, but that is not location > > > > > > constraint, it > > > > > > is > > > > > > resource configuration. Showing your configuration would be > > > > > > helpflul to > > > > > > avoid guessing. > > > > > > > > > > True, I should have done that. A conf worth thousands of > > > > > words :) > > > > > > > > > > crm conf< > > > > > > > > > primitive fence_vm_srv1 > > > > > stonith:fence_virsh \ > > > > > params pcmk_host_check="static-list" > > > > > pcmk_host_list="srv1" \ > > > > > ipaddr="192.168.2.1" > > > > > login="" \ > > > > > identity_file="/root/.ssh/id_rsa" > > > > > \ > > > > > port="srv1-d8" > > > > > action="off" \ > > > > > op monitor interval=10s > > > > > > > > > > location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1 > > > > > > > > > > primitive fence_vm_srv2 > > > > > stonith:fence_virsh \ > > > > > params pcmk_host_check="static-list" > > > > > pcmk_host_list="srv2" \ > > > > > ipaddr="192.168.2.1" > > > > > login="" \ > > > > > identity_file="/root/.ssh/id_rsa" > > > > > \ > > > > > port="srv2-d8" > > > > > action="off" \ > > > > > op monitor interval=10s > > > > > > > > > > location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2 > > > > > > > > > > EOC > > > > > > > > > > > -inf constraints like that should effectively prevent > > > stonith-actions from being executed on that nodes. > > > > It shouldn't ... > > > > Pacemaker respects target-role=Started/Stopped for controlling > > execution of fence devices, but location (or even whether the > > device is > > "running" at all) only affects monitors, not execution. > > > > > Though there are a few issues with location constraints > > > and stonith-devices. > > > > > > When stonithd brings up the devices from the cib it > > > runs the parts of pengine that fully evaluate these > > > constraints and it would disable the stonith-device > > > if the resource is unrunable on that node. > > > > That should be true only for target-role, not everything that > > affects > > runnability > > cib_device_update bails out via a removal of the device if > - role == stopped > - node not in allowed_nodes-list of stonith-resource > - weight is negative > > Wouldn't that include a -inf rule for a node? Well, I'll be ... I thought I understood what was going on there. :-) You're right. I've frequently seen it recommended to ban fence devices from their target when using one device per target. Perhaps it would be better to give a lower (but positive) score on the target compared to the other node(s), so it can be used when no other nodes are available. > It is of course clear that no pengine-decision to start > a stonith-resource is required for it to be used for > fencing. > > Regards, > Klaus > > > > > > But this part is not retriggered for location contraints > > > with attributes or other content that would dynamically > > > change. So one has to stick with constraints as simple > > > and static as those in the example above. > > > > > > Regarding adding/removing location constraints dynamically > > > I remember a bug that should have got fixed round 1.1.18 > > > that led to improper handling and actually usage of > > > stonith-devices disabled or banned from certain nodes. > > > > > > Regards, > > > Klaus > > > > > > > > > > During some tests, a ms resource raised an error during > > > > > > > the > > > > > > > stop > > > > > > > action on > > > > > > > both nodes. So both nodes were supposed to be fenced. > > > > > > > > > > > > In two-node cluster you can set pcmk_delay_max so that both > > > > > > nodes > > > > > > do not > > > > > > a
Re: [ClusterLabs] How to cancel a fencing request?
On Tue, 2018-04-03 at 21:33 +0200, Jehan-Guillaume de Rorthais wrote: > On Mon, 02 Apr 2018 09:02:24 -0500 > Ken Gaillot wrote: > > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais > > wrote: > > > On Sun, 1 Apr 2018 09:01:15 +0300 > > > Andrei Borzenkov wrote: > > [...] > > > > In two-node cluster you can set pcmk_delay_max so that both > > > > nodes > > > > do not > > > > attempt fencing simultaneously. > > > > > > I'm not sure to understand the doc correctly in regard with this > > > property. Does > > > pcmk_delay_max delay the request itself or the execution of the > > > request? > > > > > > In other words, is it: > > > > > > delay -> fence query -> fencing action > > > > > > or > > > > > > fence query -> delay -> fence action > > > > > > ? > > > > > > The first definition would solve this issue, but not the second. > > > As I > > > understand it, as soon as the fence query has been sent, the node > > > status is > > > "UNCLEAN (online)". > > > > The latter -- you're correct, the node is already unclean by that > > time. > > Since the stop did not succeed, the node must be fenced to continue > > safely. > > Thank you for this clarification. > > Do you want to patch to add this clarification to the documentation ? Sure, it never hurts :) > > > > > > The first node did, but no FA was then able to fence the > > > > > second > > > > > one. So the > > > > > node stayed DC and was reported as "UNCLEAN (online)". > > > > > > > > > > We were able to fix the original ressource problem, but not > > > > > to > > > > > avoid the > > > > > useless second node fencing. > > > > > > > > > > My questions are: > > > > > > > > > > 1. is it possible to cancel the fencing request > > > > > 2. is it possible reset the node status to "online" ? > > > > > > > > Not that I'm aware of. > > > > > > Argh! > > > > > > ++ > > > > You could fix the problem with the stopped service manually, then > > run > > "stonith_admin --confirm=" (or higher-level tool > > equivalent). > > That tells the cluster that you took care of the issue yourself, so > > fencing can be considered complete. > > Oh, OK. I was wondering if it could help. > > For the complete story, while I was working on this cluster, we tried > first to > "unfence" the node using "stonith_admin --unfence "...and > it actually > rebooted the node (using fence_vmware_soap) without cleaning its > status?? > > ...So we actually cleaned the status using "--confirm" after the > complete > reboot. > > Thank you for this clarification again. > > > The catch there is that the cluster will assume you stopped the > > node, > > and all services on it are stopped. That could potentially cause > > some > > headaches if it's not true. I'm guessing that if you unmanaged all > > the > > resources on it first, then confirmed fencing, the cluster would > > detect > > everything properly, then you could re-manage. > > Good to know. Thanks again. > -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to cancel a fencing request?
On Tue, 3 Apr 2018 07:36:31 +0200 Klaus Wenninger wrote: > On 04/02/2018 04:02 PM, Ken Gaillot wrote: > > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais wrote: > >> On Sun, 1 Apr 2018 09:01:15 +0300 > >> Andrei Borzenkov wrote: > >> > >>> 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет: > Hi all, > > I experienced a problem in a two node cluster. It has one FA per > node and > location constraints to avoid the node each of them are supposed > to > interrupt. > >>> If you mean stonith resource - for all I know location it does not > >>> affect stonith operations and only changes where monitoring action > >>> is > >>> performed. > >> Sure. > >> > >>> You can create two stonith resources and declare that each > >>> can fence only single node, but that is not location constraint, it > >>> is > >>> resource configuration. Showing your configuration would be > >>> helpflul to > >>> avoid guessing. > >> True, I should have done that. A conf worth thousands of words :) > >> > >> crm conf< >> > >> primitive fence_vm_srv1 stonith:fence_virsh \ > >> params pcmk_host_check="static-list" pcmk_host_list="srv1" \ > >> ipaddr="192.168.2.1" login="" \ > >> identity_file="/root/.ssh/id_rsa"\ > >> port="srv1-d8" action="off" \ > >> op monitor interval=10s > >> > >> location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1 > >> > >> primitive fence_vm_srv2 stonith:fence_virsh \ > >> params pcmk_host_check="static-list" pcmk_host_list="srv2" \ > >> ipaddr="192.168.2.1" login="" \ > >> identity_file="/root/.ssh/id_rsa"\ > >> port="srv2-d8" action="off" \ > >> op monitor interval=10s > >> > >> location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2 > >> > >> EOC > >> > > -inf constraints like that should effectively prevent > stonith-actions from being executed on that nodes. > Though there are a few issues with location constraints > and stonith-devices. Not sure I understand, I dont want to prevent stonith actions on that nodes. So a quick clarification of what I had in mind with this: * fence_vm_srv2 is suppose to be able to fence srv2 * should fence_vm_srv2 fence srv2, it must be able to reply then confirm the stonith action * so fence_vm_srv2 must not start on srv2 Repeat the same for fence_vm_srv1. So stonith action can run but * fence_vm_srv2 from srv1 to kill srv2 * and fence_vm_srv1 from srv2 to kill srv1. [...] > >> In other words, is it: > >> > >> delay -> fence query -> fencing action > >> > >> or > >> > >> fence query -> delay -> fence action > >> > >> ? > >> > >> The first definition would solve this issue, but not the second. As I > >> understand it, as soon as the fence query has been sent, the node > >> status is > >> "UNCLEAN (online)". > > The latter -- you're correct, the node is already unclean by that time. > > Since the stop did not succeed, the node must be fenced to continue > > safely. > > Well, pcmk_delay_base/max are made for the case > where both nodes in a 2-node-cluster loose contact > and see the respectively other as unclean. > If the looser gets fenced it's view of the partner- > node becomes irrelevant. IIRC, the survival node was DC and was seeing itself as "UNCLEEN (online)" as this was the only way to stop the failing resource. There was just no fencing resource available to kill it. ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to cancel a fencing request?
On 04/03/2018 05:43 PM, Ken Gaillot wrote: > On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote: >> On 04/02/2018 04:02 PM, Ken Gaillot wrote: >>> On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais >>> wrote: On Sun, 1 Apr 2018 09:01:15 +0300 Andrei Borzenkov wrote: > 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет: >> Hi all, >> >> I experienced a problem in a two node cluster. It has one FA >> per >> node and >> location constraints to avoid the node each of them are >> supposed >> to >> interrupt. > If you mean stonith resource - for all I know location it does > not > affect stonith operations and only changes where monitoring > action > is > performed. Sure. > You can create two stonith resources and declare that each > can fence only single node, but that is not location > constraint, it > is > resource configuration. Showing your configuration would be > helpflul to > avoid guessing. True, I should have done that. A conf worth thousands of words :) crm conf<>>> primitive fence_vm_srv1 stonith:fence_virsh \ params pcmk_host_check="static-list" pcmk_host_list="srv1" \ ipaddr="192.168.2.1" login="" \ identity_file="/root/.ssh/id_rsa"\ port="srv1-d8" action="off" \ op monitor interval=10s location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1 primitive fence_vm_srv2 stonith:fence_virsh \ params pcmk_host_check="static-list" pcmk_host_list="srv2" \ ipaddr="192.168.2.1" login="" \ identity_file="/root/.ssh/id_rsa"\ port="srv2-d8" action="off" \ op monitor interval=10s location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2 EOC >> -inf constraints like that should effectively prevent >> stonith-actions from being executed on that nodes. > It shouldn't ... > > Pacemaker respects target-role=Started/Stopped for controlling > execution of fence devices, but location (or even whether the device is > "running" at all) only affects monitors, not execution. > >> Though there are a few issues with location constraints >> and stonith-devices. >> >> When stonithd brings up the devices from the cib it >> runs the parts of pengine that fully evaluate these >> constraints and it would disable the stonith-device >> if the resource is unrunable on that node. > That should be true only for target-role, not everything that affects > runnability cib_device_update bails out via a removal of the device if - role == stopped - node not in allowed_nodes-list of stonith-resource - weight is negative Wouldn't that include a -inf rule for a node? It is of course clear that no pengine-decision to start a stonith-resource is required for it to be used for fencing. Regards, Klaus > >> But this part is not retriggered for location contraints >> with attributes or other content that would dynamically >> change. So one has to stick with constraints as simple >> and static as those in the example above. >> >> Regarding adding/removing location constraints dynamically >> I remember a bug that should have got fixed round 1.1.18 >> that led to improper handling and actually usage of >> stonith-devices disabled or banned from certain nodes. >> >> Regards, >> Klaus >> >> During some tests, a ms resource raised an error during the >> stop >> action on >> both nodes. So both nodes were supposed to be fenced. > In two-node cluster you can set pcmk_delay_max so that both > nodes > do not > attempt fencing simultaneously. I'm not sure to understand the doc correctly in regard with this property. Does pcmk_delay_max delay the request itself or the execution of the request? In other words, is it: delay -> fence query -> fencing action or fence query -> delay -> fence action ? The first definition would solve this issue, but not the second. As I understand it, as soon as the fence query has been sent, the node status is "UNCLEAN (online)". >>> The latter -- you're correct, the node is already unclean by that >>> time. >>> Since the stop did not succeed, the node must be fenced to continue >>> safely. >> Well, pcmk_delay_base/max are made for the case >> where both nodes in a 2-node-cluster loose contact >> and see the respectively other as unclean. >> If the looser gets fenced it's view of the partner- >> node becomes irrelevant. >> >> The first node did, but no FA was then able to fence the >> second >> one. So the >> node stayed DC and was reported as "UNCLEAN (online)". >>
Re: [ClusterLabs] How to cancel a fencing request?
On Mon, 02 Apr 2018 09:02:24 -0500 Ken Gaillot wrote: > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais wrote: > > On Sun, 1 Apr 2018 09:01:15 +0300 > > Andrei Borzenkov wrote: [...] > > > In two-node cluster you can set pcmk_delay_max so that both nodes > > > do not > > > attempt fencing simultaneously. > > > > I'm not sure to understand the doc correctly in regard with this > > property. Does > > pcmk_delay_max delay the request itself or the execution of the > > request? > > > > In other words, is it: > > > > delay -> fence query -> fencing action > > > > or > > > > fence query -> delay -> fence action > > > > ? > > > > The first definition would solve this issue, but not the second. As I > > understand it, as soon as the fence query has been sent, the node > > status is > > "UNCLEAN (online)". > > The latter -- you're correct, the node is already unclean by that time. > Since the stop did not succeed, the node must be fenced to continue > safely. Thank you for this clarification. Do you want to patch to add this clarification to the documentation ? > > > > The first node did, but no FA was then able to fence the second > > > > one. So the > > > > node stayed DC and was reported as "UNCLEAN (online)". > > > > > > > > We were able to fix the original ressource problem, but not to > > > > avoid the > > > > useless second node fencing. > > > > > > > > My questions are: > > > > > > > > 1. is it possible to cancel the fencing request > > > > 2. is it possible reset the node status to "online" ? > > > > > > Not that I'm aware of. > > > > Argh! > > > > ++ > > You could fix the problem with the stopped service manually, then run > "stonith_admin --confirm=" (or higher-level tool equivalent). > That tells the cluster that you took care of the issue yourself, so > fencing can be considered complete. Oh, OK. I was wondering if it could help. For the complete story, while I was working on this cluster, we tried first to "unfence" the node using "stonith_admin --unfence "...and it actually rebooted the node (using fence_vmware_soap) without cleaning its status?? ...So we actually cleaned the status using "--confirm" after the complete reboot. Thank you for this clarification again. > The catch there is that the cluster will assume you stopped the node, > and all services on it are stopped. That could potentially cause some > headaches if it's not true. I'm guessing that if you unmanaged all the > resources on it first, then confirmed fencing, the cluster would detect > everything properly, then you could re-manage. Good to know. Thanks again. ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to cancel a fencing request?
On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote: > On 04/02/2018 04:02 PM, Ken Gaillot wrote: > > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais > > wrote: > > > On Sun, 1 Apr 2018 09:01:15 +0300 > > > Andrei Borzenkov wrote: > > > > > > > 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет: > > > > > Hi all, > > > > > > > > > > I experienced a problem in a two node cluster. It has one FA > > > > > per > > > > > node and > > > > > location constraints to avoid the node each of them are > > > > > supposed > > > > > to > > > > > interrupt. > > > > > > > > If you mean stonith resource - for all I know location it does > > > > not > > > > affect stonith operations and only changes where monitoring > > > > action > > > > is > > > > performed. > > > > > > Sure. > > > > > > > You can create two stonith resources and declare that each > > > > can fence only single node, but that is not location > > > > constraint, it > > > > is > > > > resource configuration. Showing your configuration would be > > > > helpflul to > > > > avoid guessing. > > > > > > True, I should have done that. A conf worth thousands of words :) > > > > > > crm conf< > > > > > primitive fence_vm_srv1 stonith:fence_virsh \ > > > params pcmk_host_check="static-list" pcmk_host_list="srv1" \ > > > ipaddr="192.168.2.1" login="" \ > > > identity_file="/root/.ssh/id_rsa"\ > > > port="srv1-d8" action="off" \ > > > op monitor interval=10s > > > > > > location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1 > > > > > > primitive fence_vm_srv2 stonith:fence_virsh \ > > > params pcmk_host_check="static-list" pcmk_host_list="srv2" \ > > > ipaddr="192.168.2.1" login="" \ > > > identity_file="/root/.ssh/id_rsa"\ > > > port="srv2-d8" action="off" \ > > > op monitor interval=10s > > > > > > location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2 > > > > > > EOC > > > > > -inf constraints like that should effectively prevent > stonith-actions from being executed on that nodes. It shouldn't ... Pacemaker respects target-role=Started/Stopped for controlling execution of fence devices, but location (or even whether the device is "running" at all) only affects monitors, not execution. > Though there are a few issues with location constraints > and stonith-devices. > > When stonithd brings up the devices from the cib it > runs the parts of pengine that fully evaluate these > constraints and it would disable the stonith-device > if the resource is unrunable on that node. That should be true only for target-role, not everything that affects runnability > But this part is not retriggered for location contraints > with attributes or other content that would dynamically > change. So one has to stick with constraints as simple > and static as those in the example above. > > Regarding adding/removing location constraints dynamically > I remember a bug that should have got fixed round 1.1.18 > that led to improper handling and actually usage of > stonith-devices disabled or banned from certain nodes. > > Regards, > Klaus > > > > > > During some tests, a ms resource raised an error during the > > > > > stop > > > > > action on > > > > > both nodes. So both nodes were supposed to be fenced. > > > > > > > > In two-node cluster you can set pcmk_delay_max so that both > > > > nodes > > > > do not > > > > attempt fencing simultaneously. > > > > > > I'm not sure to understand the doc correctly in regard with this > > > property. Does > > > pcmk_delay_max delay the request itself or the execution of the > > > request? > > > > > > In other words, is it: > > > > > > delay -> fence query -> fencing action > > > > > > or > > > > > > fence query -> delay -> fence action > > > > > > ? > > > > > > The first definition would solve this issue, but not the second. > > > As I > > > understand it, as soon as the fence query has been sent, the node > > > status is > > > "UNCLEAN (online)". > > > > The latter -- you're correct, the node is already unclean by that > > time. > > Since the stop did not succeed, the node must be fenced to continue > > safely. > > Well, pcmk_delay_base/max are made for the case > where both nodes in a 2-node-cluster loose contact > and see the respectively other as unclean. > If the looser gets fenced it's view of the partner- > node becomes irrelevant. > > > > > > The first node did, but no FA was then able to fence the > > > > > second > > > > > one. So the > > > > > node stayed DC and was reported as "UNCLEAN (online)". > > > > > > > > > > We were able to fix the original ressource problem, but not > > > > > to > > > > > avoid the > > > > > useless second node fencing. > > > > > > > > > > My questions are: > > > > > > > > > > 1. is it po
Re: [ClusterLabs] How to cancel a fencing request?
On 04/02/2018 04:02 PM, Ken Gaillot wrote: > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais wrote: >> On Sun, 1 Apr 2018 09:01:15 +0300 >> Andrei Borzenkov wrote: >> >>> 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет: Hi all, I experienced a problem in a two node cluster. It has one FA per node and location constraints to avoid the node each of them are supposed to interrupt. >>> If you mean stonith resource - for all I know location it does not >>> affect stonith operations and only changes where monitoring action >>> is >>> performed. >> Sure. >> >>> You can create two stonith resources and declare that each >>> can fence only single node, but that is not location constraint, it >>> is >>> resource configuration. Showing your configuration would be >>> helpflul to >>> avoid guessing. >> True, I should have done that. A conf worth thousands of words :) >> >> crm conf<> >> primitive fence_vm_srv1 stonith:fence_virsh \ >> params pcmk_host_check="static-list" pcmk_host_list="srv1" \ >> ipaddr="192.168.2.1" login="" \ >> identity_file="/root/.ssh/id_rsa"\ >> port="srv1-d8" action="off" \ >> op monitor interval=10s >> >> location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1 >> >> primitive fence_vm_srv2 stonith:fence_virsh \ >> params pcmk_host_check="static-list" pcmk_host_list="srv2" \ >> ipaddr="192.168.2.1" login="" \ >> identity_file="/root/.ssh/id_rsa"\ >> port="srv2-d8" action="off" \ >> op monitor interval=10s >> >> location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2 >> >> EOC >> -inf constraints like that should effectively prevent stonith-actions from being executed on that nodes. Though there are a few issues with location constraints and stonith-devices. When stonithd brings up the devices from the cib it runs the parts of pengine that fully evaluate these constraints and it would disable the stonith-device if the resource is unrunable on that node. But this part is not retriggered for location contraints with attributes or other content that would dynamically change. So one has to stick with constraints as simple and static as those in the example above. Regarding adding/removing location constraints dynamically I remember a bug that should have got fixed round 1.1.18 that led to improper handling and actually usage of stonith-devices disabled or banned from certain nodes. Regards, Klaus During some tests, a ms resource raised an error during the stop action on both nodes. So both nodes were supposed to be fenced. >>> In two-node cluster you can set pcmk_delay_max so that both nodes >>> do not >>> attempt fencing simultaneously. >> I'm not sure to understand the doc correctly in regard with this >> property. Does >> pcmk_delay_max delay the request itself or the execution of the >> request? >> >> In other words, is it: >> >> delay -> fence query -> fencing action >> >> or >> >> fence query -> delay -> fence action >> >> ? >> >> The first definition would solve this issue, but not the second. As I >> understand it, as soon as the fence query has been sent, the node >> status is >> "UNCLEAN (online)". > The latter -- you're correct, the node is already unclean by that time. > Since the stop did not succeed, the node must be fenced to continue > safely. Well, pcmk_delay_base/max are made for the case where both nodes in a 2-node-cluster loose contact and see the respectively other as unclean. If the looser gets fenced it's view of the partner- node becomes irrelevant. The first node did, but no FA was then able to fence the second one. So the node stayed DC and was reported as "UNCLEAN (online)". We were able to fix the original ressource problem, but not to avoid the useless second node fencing. My questions are: 1. is it possible to cancel the fencing request 2. is it possible reset the node status to "online" ? >>> Not that I'm aware of. >> Argh! >> >> ++ > You could fix the problem with the stopped service manually, then run > "stonith_admin --confirm=" (or higher-level tool equivalent). > That tells the cluster that you took care of the issue yourself, so > fencing can be considered complete. > > The catch there is that the cluster will assume you stopped the node, > and all services on it are stopped. That could potentially cause some > headaches if it's not true. I'm guessing that if you unmanaged all the > resources on it first, then confirmed fencing, the cluster would detect > everything properly, then you could re-manage. ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home
Re: [ClusterLabs] How to cancel a fencing request?
On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais wrote: > On Sun, 1 Apr 2018 09:01:15 +0300 > Andrei Borzenkov wrote: > > > 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет: > > > Hi all, > > > > > > I experienced a problem in a two node cluster. It has one FA per > > > node and > > > location constraints to avoid the node each of them are supposed > > > to > > > interrupt. > > > > If you mean stonith resource - for all I know location it does not > > affect stonith operations and only changes where monitoring action > > is > > performed. > > Sure. > > > You can create two stonith resources and declare that each > > can fence only single node, but that is not location constraint, it > > is > > resource configuration. Showing your configuration would be > > helpflul to > > avoid guessing. > > True, I should have done that. A conf worth thousands of words :) > > crm conf< > primitive fence_vm_srv1 stonith:fence_virsh \ > params pcmk_host_check="static-list" pcmk_host_list="srv1" \ > ipaddr="192.168.2.1" login="" \ > identity_file="/root/.ssh/id_rsa"\ > port="srv1-d8" action="off" \ > op monitor interval=10s > > location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1 > > primitive fence_vm_srv2 stonith:fence_virsh \ > params pcmk_host_check="static-list" pcmk_host_list="srv2" \ > ipaddr="192.168.2.1" login="" \ > identity_file="/root/.ssh/id_rsa"\ > port="srv2-d8" action="off" \ > op monitor interval=10s > > location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2 > > EOC > > > > > During some tests, a ms resource raised an error during the stop > > > action on > > > both nodes. So both nodes were supposed to be fenced. > > > > In two-node cluster you can set pcmk_delay_max so that both nodes > > do not > > attempt fencing simultaneously. > > I'm not sure to understand the doc correctly in regard with this > property. Does > pcmk_delay_max delay the request itself or the execution of the > request? > > In other words, is it: > > delay -> fence query -> fencing action > > or > > fence query -> delay -> fence action > > ? > > The first definition would solve this issue, but not the second. As I > understand it, as soon as the fence query has been sent, the node > status is > "UNCLEAN (online)". The latter -- you're correct, the node is already unclean by that time. Since the stop did not succeed, the node must be fenced to continue safely. > > > The first node did, but no FA was then able to fence the second > > > one. So the > > > node stayed DC and was reported as "UNCLEAN (online)". > > > > > > We were able to fix the original ressource problem, but not to > > > avoid the > > > useless second node fencing. > > > > > > My questions are: > > > > > > 1. is it possible to cancel the fencing request > > > 2. is it possible reset the node status to "online" ? > > > > Not that I'm aware of. > > Argh! > > ++ You could fix the problem with the stopped service manually, then run "stonith_admin --confirm=" (or higher-level tool equivalent). That tells the cluster that you took care of the issue yourself, so fencing can be considered complete. The catch there is that the cluster will assume you stopped the node, and all services on it are stopped. That could potentially cause some headaches if it's not true. I'm guessing that if you unmanaged all the resources on it first, then confirmed fencing, the cluster would detect everything properly, then you could re-manage. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to cancel a fencing request?
On Sun, 1 Apr 2018 09:01:15 +0300 Andrei Borzenkov wrote: > 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет: > > Hi all, > > > > I experienced a problem in a two node cluster. It has one FA per node and > > location constraints to avoid the node each of them are supposed to > > interrupt. > > If you mean stonith resource - for all I know location it does not > affect stonith operations and only changes where monitoring action is > performed. Sure. > You can create two stonith resources and declare that each > can fence only single node, but that is not location constraint, it is > resource configuration. Showing your configuration would be helpflul to > avoid guessing. True, I should have done that. A conf worth thousands of words :) crm conf< > During some tests, a ms resource raised an error during the stop action on > > both nodes. So both nodes were supposed to be fenced. > > In two-node cluster you can set pcmk_delay_max so that both nodes do not > attempt fencing simultaneously. I'm not sure to understand the doc correctly in regard with this property. Does pcmk_delay_max delay the request itself or the execution of the request? In other words, is it: delay -> fence query -> fencing action or fence query -> delay -> fence action ? The first definition would solve this issue, but not the second. As I understand it, as soon as the fence query has been sent, the node status is "UNCLEAN (online)". > > The first node did, but no FA was then able to fence the second one. So the > > node stayed DC and was reported as "UNCLEAN (online)". > > > > We were able to fix the original ressource problem, but not to avoid the > > useless second node fencing. > > > > My questions are: > > > > 1. is it possible to cancel the fencing request > > 2. is it possible reset the node status to "online" ? > > Not that I'm aware of. Argh! ++ ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to cancel a fencing request?
31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет: > Hi all, > > I experienced a problem in a two node cluster. It has one FA per node and > location constraints to avoid the node each of them are supposed to interrupt. > If you mean stonith resource - for all I know location it does not affect stonith operations and only changes where monitoring action is performed. You can create two stonith resources and declare that each can fence only single node, but that is not location constraint, it is resource configuration. Showing your configuration would be helpflul to avoid guessing. > During some tests, a ms resource raised an error during the stop action on > both nodes. So both nodes were supposed to be fenced. > In two-node cluster you can set pcmk_delay_max so that both nodes do not attempt fencing simultaneously. > The first node did, but no FA was then able to fence the second one. So the > node stayed DC and was reported as "UNCLEAN (online)". > > We were able to fix the original ressource problem, but not to avoid the > useless second node fencing. > > My questions are: > > 1. is it possible to cancel the fencing request > 2. is it possible reset the > node status to "online" ? > Not that I'm aware of. ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] How to cancel a fencing request?
Hi all, I experienced a problem in a two node cluster. It has one FA per node and location constraints to avoid the node each of them are supposed to interrupt. During some tests, a ms resource raised an error during the stop action on both nodes. So both nodes were supposed to be fenced. The first node did, but no FA was then able to fence the second one. So the node stayed DC and was reported as "UNCLEAN (online)". We were able to fix the original ressource problem, but not to avoid the useless second node fencing. My questions are: 1. is it possible to cancel the fencing request 2. is it possible reset the node status to "online" ? Thank you ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org