Re: [ClusterLabs] How can I prevent multiple start of IPaddr 2 in an environment using fence_mpath?

2018-04-05 Thread Andrei Borzenkov
06.04.2018 07:30, 飯田 雄介 пишет:
> Hi, all
> I am testing the environment using fence_mpath with the following settings.
> 
> ===
>   Stack: corosync
>   Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition with quorum
>   Last updated: Fri Apr  6 13:16:20 2018
>   Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e
> 
>   2 nodes configured
>   13 resources configured
> 
>   Online: [ x3650e x3650f ]
> 
>   Full list of resources:
> 
>fenceMpath-x3650e(stonith:fence_mpath):  Started x3650e
>fenceMpath-x3650f(stonith:fence_mpath):  Started x3650f
>Resource Group: grpPostgreSQLDB
>prmFsPostgreSQLDB1   (ocf::heartbeat:Filesystem):Started x3650e
>prmFsPostgreSQLDB2   (ocf::heartbeat:Filesystem):Started x3650e
>prmFsPostgreSQLDB3   (ocf::heartbeat:Filesystem):Started x3650e
>prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e
>Resource Group: grpPostgreSQLIP
>prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2):   Started x3650e
>Clone Set: clnDiskd1 [prmDiskd1]
>Started: [ x3650e x3650f ]
>Clone Set: clnDiskd2 [prmDiskd2]
>Started: [ x3650e x3650f ]
>Clone Set: clnPing [prmPing]
>Started: [ x3650e x3650f ]
> ===
> 
> When split-brain occurs in this environment, x3650f executes fence and the 
> resource is started with x3650f.
> 
> === view of x3650e 
>   Stack: corosync
>   Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum
>   Last updated: Fri Apr  6 13:16:36 2018
>   Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e
> 
>   2 nodes configured
>   13 resources configured
> 
>   Node x3650f: UNCLEAN (offline)
>   Online: [ x3650e ]
> 
>   Full list of resources:
> 
>fenceMpath-x3650e(stonith:fence_mpath):  Started x3650e
>fenceMpath-x3650f(stonith:fence_mpath):  Started[ x3650e x3650f ]
>Resource Group: grpPostgreSQLDB
>prmFsPostgreSQLDB1   (ocf::heartbeat:Filesystem):Started x3650e
>prmFsPostgreSQLDB2   (ocf::heartbeat:Filesystem):Started x3650e
>prmFsPostgreSQLDB3   (ocf::heartbeat:Filesystem):Started x3650e
>prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e
>Resource Group: grpPostgreSQLIP
>prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2):   Started x3650e
>Clone Set: clnDiskd1 [prmDiskd1]
>prmDiskd1(ocf::pacemaker:diskd): Started x3650f (UNCLEAN)
>Started: [ x3650e ]
>Clone Set: clnDiskd2 [prmDiskd2]
>prmDiskd2(ocf::pacemaker:diskd): Started x3650f (UNCLEAN)
>Started: [ x3650e ]
>Clone Set: clnPing [prmPing]
>prmPing  (ocf::pacemaker:ping):  Started x3650f (UNCLEAN)
>Started: [ x3650e ]
> 
> === view of x3650f 
>   Stack: corosync
>   Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum
>   Last updated: Fri Apr  6 13:16:36 2018
>   Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e
> 
>   2 nodes configured
>   13 resources configured
> 
>   Online: [ x3650f ]
>   OFFLINE: [ x3650e ]
> 
>   Full list of resources:
> 
>fenceMpath-x3650e(stonith:fence_mpath):  Started x3650f
>fenceMpath-x3650f(stonith:fence_mpath):  Started x3650f
>Resource Group: grpPostgreSQLDB
>prmFsPostgreSQLDB1   (ocf::heartbeat:Filesystem):Started x3650f
>prmFsPostgreSQLDB2   (ocf::heartbeat:Filesystem):Started x3650f
>prmFsPostgreSQLDB3   (ocf::heartbeat:Filesystem):Started x3650f
>prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650f
>Resource Group: grpPostgreSQLIP
>prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2):   Started x3650f
>Clone Set: clnDiskd1 [prmDiskd1]
>Started: [ x3650f ]
>Stopped: [ x3650e ]
>Clone Set: clnDiskd2 [prmDiskd2]
>Started: [ x3650f ]
>Stopped: [ x3650e ]
>Clone Set: clnPing [prmPing]
>Started: [ x3650f ]
>Stopped: [ x3650e ]
> ===
> 
> However, IPaddr2 of x3650e will not stop until pgsql monitor error occurs.
> At this time, IPaddr2 is temporarily started on two nodes.
> 
> === view of after pgsql monitor error ===
>   Stack: corosync
>   Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum
>   Last updated: Fri Apr  6 13:16:56 2018
>   Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e
> 
>   2 nodes configured
>   13 resources configured
> 
>   Node x3650f: UNCLEAN (offline)
>   Online: [ x3650e ]
> 
>   Full list of resources:
> 
>fenceMpath-x3650e(stonith:fence_mpath):  Started x3650e
>fenceMpath-x3650f(stonith:fence_mpath):  Started[ x3650e x3650f ]
>Resource Group: grpPostgreSQLDB
>prmFsPostgreSQLDB1   (ocf::heartbeat:Filesystem):Started x3650e
>prmFsPostgreSQLDB2   (ocf::heartbeat:Filesystem):Started x3

[ClusterLabs] How can I prevent multiple start of IPaddr 2 in an environment using fence_mpath?

2018-04-05 Thread 飯田 雄介
Hi, all
I am testing the environment using fence_mpath with the following settings.

===
  Stack: corosync
  Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition with quorum
  Last updated: Fri Apr  6 13:16:20 2018
  Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e

  2 nodes configured
  13 resources configured

  Online: [ x3650e x3650f ]

  Full list of resources:

   fenceMpath-x3650e(stonith:fence_mpath):  Started x3650e
   fenceMpath-x3650f(stonith:fence_mpath):  Started x3650f
   Resource Group: grpPostgreSQLDB
   prmFsPostgreSQLDB1   (ocf::heartbeat:Filesystem):Started x3650e
   prmFsPostgreSQLDB2   (ocf::heartbeat:Filesystem):Started x3650e
   prmFsPostgreSQLDB3   (ocf::heartbeat:Filesystem):Started x3650e
   prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e
   Resource Group: grpPostgreSQLIP
   prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2):   Started x3650e
   Clone Set: clnDiskd1 [prmDiskd1]
   Started: [ x3650e x3650f ]
   Clone Set: clnDiskd2 [prmDiskd2]
   Started: [ x3650e x3650f ]
   Clone Set: clnPing [prmPing]
   Started: [ x3650e x3650f ]
===

When split-brain occurs in this environment, x3650f executes fence and the 
resource is started with x3650f.

=== view of x3650e 
  Stack: corosync
  Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum
  Last updated: Fri Apr  6 13:16:36 2018
  Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e

  2 nodes configured
  13 resources configured

  Node x3650f: UNCLEAN (offline)
  Online: [ x3650e ]

  Full list of resources:

   fenceMpath-x3650e(stonith:fence_mpath):  Started x3650e
   fenceMpath-x3650f(stonith:fence_mpath):  Started[ x3650e x3650f ]
   Resource Group: grpPostgreSQLDB
   prmFsPostgreSQLDB1   (ocf::heartbeat:Filesystem):Started x3650e
   prmFsPostgreSQLDB2   (ocf::heartbeat:Filesystem):Started x3650e
   prmFsPostgreSQLDB3   (ocf::heartbeat:Filesystem):Started x3650e
   prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e
   Resource Group: grpPostgreSQLIP
   prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2):   Started x3650e
   Clone Set: clnDiskd1 [prmDiskd1]
   prmDiskd1(ocf::pacemaker:diskd): Started x3650f (UNCLEAN)
   Started: [ x3650e ]
   Clone Set: clnDiskd2 [prmDiskd2]
   prmDiskd2(ocf::pacemaker:diskd): Started x3650f (UNCLEAN)
   Started: [ x3650e ]
   Clone Set: clnPing [prmPing]
   prmPing  (ocf::pacemaker:ping):  Started x3650f (UNCLEAN)
   Started: [ x3650e ]

=== view of x3650f 
  Stack: corosync
  Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum
  Last updated: Fri Apr  6 13:16:36 2018
  Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e

  2 nodes configured
  13 resources configured

  Online: [ x3650f ]
  OFFLINE: [ x3650e ]

  Full list of resources:

   fenceMpath-x3650e(stonith:fence_mpath):  Started x3650f
   fenceMpath-x3650f(stonith:fence_mpath):  Started x3650f
   Resource Group: grpPostgreSQLDB
   prmFsPostgreSQLDB1   (ocf::heartbeat:Filesystem):Started x3650f
   prmFsPostgreSQLDB2   (ocf::heartbeat:Filesystem):Started x3650f
   prmFsPostgreSQLDB3   (ocf::heartbeat:Filesystem):Started x3650f
   prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650f
   Resource Group: grpPostgreSQLIP
   prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2):   Started x3650f
   Clone Set: clnDiskd1 [prmDiskd1]
   Started: [ x3650f ]
   Stopped: [ x3650e ]
   Clone Set: clnDiskd2 [prmDiskd2]
   Started: [ x3650f ]
   Stopped: [ x3650e ]
   Clone Set: clnPing [prmPing]
   Started: [ x3650f ]
   Stopped: [ x3650e ]
===

However, IPaddr2 of x3650e will not stop until pgsql monitor error occurs.
At this time, IPaddr2 is temporarily started on two nodes.

=== view of after pgsql monitor error ===
  Stack: corosync
  Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum
  Last updated: Fri Apr  6 13:16:56 2018
  Last change: Thu Mar  1 18:38:02 2018 by root via cibadmin on x3650e

  2 nodes configured
  13 resources configured

  Node x3650f: UNCLEAN (offline)
  Online: [ x3650e ]

  Full list of resources:

   fenceMpath-x3650e(stonith:fence_mpath):  Started x3650e
   fenceMpath-x3650f(stonith:fence_mpath):  Started[ x3650e x3650f ]
   Resource Group: grpPostgreSQLDB
   prmFsPostgreSQLDB1   (ocf::heartbeat:Filesystem):Started x3650e
   prmFsPostgreSQLDB2   (ocf::heartbeat:Filesystem):Started x3650e
   prmFsPostgreSQLDB3   (ocf::heartbeat:Filesystem):Started x3650e
   prmApPostgreSQLDB(ocf::heartbeat:pgsql): Stopped
   Resource Group: grpPostgreSQLIP
   prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2):   Stopped
   Clone Set: clnDiskd1 [p

Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-04-05 Thread Thomas Lamprecht

Hi Honza,

Am 03/09/2018 um 05:26 PM schrieb Jan Friesse:

Thomas,

TotemConfchgCallback: ringid (1.1436)
active processors 3: 1 2 3
EXIT
Finalize  result is 1 (should be 1)


Hope I did both test right, but as it reproduces multiple times
with testcpg, our cpg usage in our filesystem, this seems like
valid tested, not just an single occurrence.


I've tested it too and yes, you are 100% right. Bug is there and it's 
pretty easy to reproduce when node with lowest nodeid is paused. It's 
slightly harder when node with higher nodeid is paused.




Do you were able to make some progress on this issue?
We'd really like a fix for this, so if there's anything I can do to help 
just hit me up. :)


Else, I have a (little hacky) workaround here (cpg client side), if you 
think the issue

isn't to easy to address anytime soon, I'd polish that patch up and we could
use that while waiting for the real fix.

cheers,
Thomas


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to cancel a fencing request?

2018-04-05 Thread Klaus Wenninger
On 04/05/2018 06:45 AM, Andrei Borzenkov wrote:
> 04.04.2018 01:35, Ken Gaillot пишет:
>> On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote:
> ...
> -inf constraints like that should effectively prevent
> stonith-actions from being executed on that nodes.
 It shouldn't ...

 Pacemaker respects target-role=Started/Stopped for controlling
 execution of fence devices, but location (or even whether the
 device is
 "running" at all) only affects monitors, not execution.

> Though there are a few issues with location constraints
> and stonith-devices.
>
> When stonithd brings up the devices from the cib it
> runs the parts of pengine that fully evaluate these
> constraints and it would disable the stonith-device
> if the resource is unrunable on that node.
 That should be true only for target-role, not everything that
 affects
 runnability
>>> cib_device_update bails out via a removal of the device if
>>> - role == stopped
>>> - node not in allowed_nodes-list of stonith-resource
>>> - weight is negative
>>>
>>> Wouldn't that include a -inf rule for a node?
>> Well, I'll be ... I thought I understood what was going on there. :-)
>> You're right.
>>
>> I've frequently seen it recommended to ban fence devices from their
>> target when using one device per target. Perhaps it would be better to
>> give a lower (but positive) score on the target compared to the other
>> node(s), so it can be used when no other nodes are available.
>>
> Oh! So I must have misunderstood comments on this in earlier discussions.
>
> So ability to place stonith resource on node does impact ability to
> perform stonith using this resource, right? OTOH decision which node is
> eligible to use stonith resource for stonith may not match decision
> which node is eligible to start stonith resource? Even more confusing ...

Something like that, yes ... and sorry for the confusion ...
Maybe easier to grab: "Has to be able to run there but doesn't
actually have to be started there right at the moment"

Regards,
Klaus

>>> It is of course clear that no pengine-decision to start
>>> a stonith-resource is required for it to be used for
>>> fencing.
>>>
> This means that there is only subset of usual (co-)locating restrictions
> that is taken into account? Is it all documented somewhere?

iirc there are restrictions mentioned in the documentation.
But what is written there didn't ring the right bells for me-
at least not immediately without having a look to the code ;-)
So we are working on something easier to grab there.
Guess for now the crucial rule is not to use anything that
might alter location-rule results over time (attributes, rules
with time in them, ...).

> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org