Re: [ClusterLabs] How can I prevent multiple start of IPaddr 2 in an environment using fence_mpath?
06.04.2018 07:30, 飯田 雄介 пишет: > Hi, all > I am testing the environment using fence_mpath with the following settings. > > === > Stack: corosync > Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition with quorum > Last updated: Fri Apr 6 13:16:20 2018 > Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on x3650e > > 2 nodes configured > 13 resources configured > > Online: [ x3650e x3650f ] > > Full list of resources: > >fenceMpath-x3650e(stonith:fence_mpath): Started x3650e >fenceMpath-x3650f(stonith:fence_mpath): Started x3650f >Resource Group: grpPostgreSQLDB >prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem):Started x3650e >prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem):Started x3650e >prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem):Started x3650e >prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e >Resource Group: grpPostgreSQLIP >prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2): Started x3650e >Clone Set: clnDiskd1 [prmDiskd1] >Started: [ x3650e x3650f ] >Clone Set: clnDiskd2 [prmDiskd2] >Started: [ x3650e x3650f ] >Clone Set: clnPing [prmPing] >Started: [ x3650e x3650f ] > === > > When split-brain occurs in this environment, x3650f executes fence and the > resource is started with x3650f. > > === view of x3650e > Stack: corosync > Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum > Last updated: Fri Apr 6 13:16:36 2018 > Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on x3650e > > 2 nodes configured > 13 resources configured > > Node x3650f: UNCLEAN (offline) > Online: [ x3650e ] > > Full list of resources: > >fenceMpath-x3650e(stonith:fence_mpath): Started x3650e >fenceMpath-x3650f(stonith:fence_mpath): Started[ x3650e x3650f ] >Resource Group: grpPostgreSQLDB >prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem):Started x3650e >prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem):Started x3650e >prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem):Started x3650e >prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e >Resource Group: grpPostgreSQLIP >prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2): Started x3650e >Clone Set: clnDiskd1 [prmDiskd1] >prmDiskd1(ocf::pacemaker:diskd): Started x3650f (UNCLEAN) >Started: [ x3650e ] >Clone Set: clnDiskd2 [prmDiskd2] >prmDiskd2(ocf::pacemaker:diskd): Started x3650f (UNCLEAN) >Started: [ x3650e ] >Clone Set: clnPing [prmPing] >prmPing (ocf::pacemaker:ping): Started x3650f (UNCLEAN) >Started: [ x3650e ] > > === view of x3650f > Stack: corosync > Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum > Last updated: Fri Apr 6 13:16:36 2018 > Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on x3650e > > 2 nodes configured > 13 resources configured > > Online: [ x3650f ] > OFFLINE: [ x3650e ] > > Full list of resources: > >fenceMpath-x3650e(stonith:fence_mpath): Started x3650f >fenceMpath-x3650f(stonith:fence_mpath): Started x3650f >Resource Group: grpPostgreSQLDB >prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem):Started x3650f >prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem):Started x3650f >prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem):Started x3650f >prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650f >Resource Group: grpPostgreSQLIP >prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2): Started x3650f >Clone Set: clnDiskd1 [prmDiskd1] >Started: [ x3650f ] >Stopped: [ x3650e ] >Clone Set: clnDiskd2 [prmDiskd2] >Started: [ x3650f ] >Stopped: [ x3650e ] >Clone Set: clnPing [prmPing] >Started: [ x3650f ] >Stopped: [ x3650e ] > === > > However, IPaddr2 of x3650e will not stop until pgsql monitor error occurs. > At this time, IPaddr2 is temporarily started on two nodes. > > === view of after pgsql monitor error === > Stack: corosync > Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum > Last updated: Fri Apr 6 13:16:56 2018 > Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on x3650e > > 2 nodes configured > 13 resources configured > > Node x3650f: UNCLEAN (offline) > Online: [ x3650e ] > > Full list of resources: > >fenceMpath-x3650e(stonith:fence_mpath): Started x3650e >fenceMpath-x3650f(stonith:fence_mpath): Started[ x3650e x3650f ] >Resource Group: grpPostgreSQLDB >prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem):Started x3650e >prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem):Started x3
[ClusterLabs] How can I prevent multiple start of IPaddr 2 in an environment using fence_mpath?
Hi, all I am testing the environment using fence_mpath with the following settings. === Stack: corosync Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition with quorum Last updated: Fri Apr 6 13:16:20 2018 Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on x3650e 2 nodes configured 13 resources configured Online: [ x3650e x3650f ] Full list of resources: fenceMpath-x3650e(stonith:fence_mpath): Started x3650e fenceMpath-x3650f(stonith:fence_mpath): Started x3650f Resource Group: grpPostgreSQLDB prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem):Started x3650e prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem):Started x3650e prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem):Started x3650e prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e Resource Group: grpPostgreSQLIP prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2): Started x3650e Clone Set: clnDiskd1 [prmDiskd1] Started: [ x3650e x3650f ] Clone Set: clnDiskd2 [prmDiskd2] Started: [ x3650e x3650f ] Clone Set: clnPing [prmPing] Started: [ x3650e x3650f ] === When split-brain occurs in this environment, x3650f executes fence and the resource is started with x3650f. === view of x3650e Stack: corosync Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum Last updated: Fri Apr 6 13:16:36 2018 Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on x3650e 2 nodes configured 13 resources configured Node x3650f: UNCLEAN (offline) Online: [ x3650e ] Full list of resources: fenceMpath-x3650e(stonith:fence_mpath): Started x3650e fenceMpath-x3650f(stonith:fence_mpath): Started[ x3650e x3650f ] Resource Group: grpPostgreSQLDB prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem):Started x3650e prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem):Started x3650e prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem):Started x3650e prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650e Resource Group: grpPostgreSQLIP prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2): Started x3650e Clone Set: clnDiskd1 [prmDiskd1] prmDiskd1(ocf::pacemaker:diskd): Started x3650f (UNCLEAN) Started: [ x3650e ] Clone Set: clnDiskd2 [prmDiskd2] prmDiskd2(ocf::pacemaker:diskd): Started x3650f (UNCLEAN) Started: [ x3650e ] Clone Set: clnPing [prmPing] prmPing (ocf::pacemaker:ping): Started x3650f (UNCLEAN) Started: [ x3650e ] === view of x3650f Stack: corosync Current DC: x3650f (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum Last updated: Fri Apr 6 13:16:36 2018 Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on x3650e 2 nodes configured 13 resources configured Online: [ x3650f ] OFFLINE: [ x3650e ] Full list of resources: fenceMpath-x3650e(stonith:fence_mpath): Started x3650f fenceMpath-x3650f(stonith:fence_mpath): Started x3650f Resource Group: grpPostgreSQLDB prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem):Started x3650f prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem):Started x3650f prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem):Started x3650f prmApPostgreSQLDB(ocf::heartbeat:pgsql): Started x3650f Resource Group: grpPostgreSQLIP prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2): Started x3650f Clone Set: clnDiskd1 [prmDiskd1] Started: [ x3650f ] Stopped: [ x3650e ] Clone Set: clnDiskd2 [prmDiskd2] Started: [ x3650f ] Stopped: [ x3650e ] Clone Set: clnPing [prmPing] Started: [ x3650f ] Stopped: [ x3650e ] === However, IPaddr2 of x3650e will not stop until pgsql monitor error occurs. At this time, IPaddr2 is temporarily started on two nodes. === view of after pgsql monitor error === Stack: corosync Current DC: x3650e (version 1.1.17-1.el7-b36b869) - partition WITHOUT quorum Last updated: Fri Apr 6 13:16:56 2018 Last change: Thu Mar 1 18:38:02 2018 by root via cibadmin on x3650e 2 nodes configured 13 resources configured Node x3650f: UNCLEAN (offline) Online: [ x3650e ] Full list of resources: fenceMpath-x3650e(stonith:fence_mpath): Started x3650e fenceMpath-x3650f(stonith:fence_mpath): Started[ x3650e x3650f ] Resource Group: grpPostgreSQLDB prmFsPostgreSQLDB1 (ocf::heartbeat:Filesystem):Started x3650e prmFsPostgreSQLDB2 (ocf::heartbeat:Filesystem):Started x3650e prmFsPostgreSQLDB3 (ocf::heartbeat:Filesystem):Started x3650e prmApPostgreSQLDB(ocf::heartbeat:pgsql): Stopped Resource Group: grpPostgreSQLIP prmIpPostgreSQLDB(ocf::heartbeat:IPaddr2): Stopped Clone Set: clnDiskd1 [p
Re: [ClusterLabs] corosync 2.4 CPG config change callback
Hi Honza, Am 03/09/2018 um 05:26 PM schrieb Jan Friesse: Thomas, TotemConfchgCallback: ringid (1.1436) active processors 3: 1 2 3 EXIT Finalize result is 1 (should be 1) Hope I did both test right, but as it reproduces multiple times with testcpg, our cpg usage in our filesystem, this seems like valid tested, not just an single occurrence. I've tested it too and yes, you are 100% right. Bug is there and it's pretty easy to reproduce when node with lowest nodeid is paused. It's slightly harder when node with higher nodeid is paused. Do you were able to make some progress on this issue? We'd really like a fix for this, so if there's anything I can do to help just hit me up. :) Else, I have a (little hacky) workaround here (cpg client side), if you think the issue isn't to easy to address anytime soon, I'd polish that patch up and we could use that while waiting for the real fix. cheers, Thomas ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How to cancel a fencing request?
On 04/05/2018 06:45 AM, Andrei Borzenkov wrote: > 04.04.2018 01:35, Ken Gaillot пишет: >> On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote: > ... > -inf constraints like that should effectively prevent > stonith-actions from being executed on that nodes. It shouldn't ... Pacemaker respects target-role=Started/Stopped for controlling execution of fence devices, but location (or even whether the device is "running" at all) only affects monitors, not execution. > Though there are a few issues with location constraints > and stonith-devices. > > When stonithd brings up the devices from the cib it > runs the parts of pengine that fully evaluate these > constraints and it would disable the stonith-device > if the resource is unrunable on that node. That should be true only for target-role, not everything that affects runnability >>> cib_device_update bails out via a removal of the device if >>> - role == stopped >>> - node not in allowed_nodes-list of stonith-resource >>> - weight is negative >>> >>> Wouldn't that include a -inf rule for a node? >> Well, I'll be ... I thought I understood what was going on there. :-) >> You're right. >> >> I've frequently seen it recommended to ban fence devices from their >> target when using one device per target. Perhaps it would be better to >> give a lower (but positive) score on the target compared to the other >> node(s), so it can be used when no other nodes are available. >> > Oh! So I must have misunderstood comments on this in earlier discussions. > > So ability to place stonith resource on node does impact ability to > perform stonith using this resource, right? OTOH decision which node is > eligible to use stonith resource for stonith may not match decision > which node is eligible to start stonith resource? Even more confusing ... Something like that, yes ... and sorry for the confusion ... Maybe easier to grab: "Has to be able to run there but doesn't actually have to be started there right at the moment" Regards, Klaus >>> It is of course clear that no pengine-decision to start >>> a stonith-resource is required for it to be used for >>> fencing. >>> > This means that there is only subset of usual (co-)locating restrictions > that is taken into account? Is it all documented somewhere? iirc there are restrictions mentioned in the documentation. But what is written there didn't ring the right bells for me- at least not immediately without having a look to the code ;-) So we are working on something easier to grab there. Guess for now the crucial rule is not to use anything that might alter location-rule results over time (attributes, rules with time in them, ...). > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org