Hi,

You are right, the problems seem to come from some services that are started at startup.

My installation script disables all startup options for all services we use, that's why I didn't focus on this possibility.

But after a quick investigation, a colleague had the good idea to make a "security" script that monitors and starts certain services.


Sorry to have contacted you for this little mistake,

Thank you for the help, it was effective

Quentin



Le 27/08/2020 à 09:56, Reid Wahl a écrit :
Hi, Quentin. Thanks for the logs!

I see you highlighted the fact that SERVICE1 was in "Stopping" state on both node 1 and node 2 when node 1 was rejoining the cluster. I also noted the following later in the logs, as well as some similar messages earlier:

Aug 27 08:47:02 [1330] NODE2    pengine:     info: determine_op_status:       
Operation monitor found resource SERVICE1 active on NODE1
Aug 27 08:47:02 [1330] NODE2    pengine:     info: determine_op_status:       
Operation monitor found resource SERVICE1 active on NODE1
Aug 27 08:47:02 [1330] NODE2    pengine:     info: determine_op_status:       
Operation monitor found resource SERVICE4 active on NODE2
Aug 27 08:47:02 [1330] NODE2    pengine:     info: determine_op_status:       
Operation monitor found resource SERVICE1 active on NODE2
...
Aug 27 08:47:02 [1330] NODE2    pengine:     info: common_print:              1 
: NODE1
Aug 27 08:47:02 [1330] NODE2    pengine:     info: common_print:              2 
: NODE2
...
Aug 27 08:47:02 [1330] NODE2    pengine:    error: native_create_actions:     
Resource SERVICE1 is active on 2 nodes (attempting recovery)
Aug 27 08:47:02 [1330] NODE2    pengine:   notice: native_create_actions:     
Seehttps://wiki.clusterlabs.org/wiki/FAQ#Resource_is_Too_Active  for more 
information

Can you make sure that all the cluster-managed systemd services are disabled from starting at boot (i.e., `systemctl is-enabled service1`, and the same for all the others) on both nodes? If they are enabled, disable them.

On Thu, Aug 27, 2020 at 12:46 AM Citron Vert <[email protected] <mailto:[email protected]>> wrote:

    Hi,

    Sorry for using this email adress, my name is Quentin. Thank you
    for your reply.

    I have already tried the stickiness solution (with the deprecated 
    value). I tried the one you gave me, and it does not change anything.

    Resources don't seem to move from node to node (i don't see the
    changes with crm_mon command).


    In the logs i found this line /"error: native_create_actions:    
    Resource SERVICE1 is active on 2 nodes/"

    Which led me to contact you to understand and learn a little more
    about this cluster. And why there are running resources on the
    passive node.


    You will find attached the logs during the reboot of the passive
    node and my cluster configuration.

    I think I'm missing out on something in the configuration / logs
    that I don't understand..


    Thank you in advance for your help,

    Quentin


    Le 26/08/2020 à 20:16, Reid Wahl a écrit :
    Hi, Citron.

    Based on your description, it sounds like some resources
    **might** be moving from node 1 to node 2, failing on node 2, and
    then moving back to node 1. If that's what's happening (and even
    if it's not), then it's probably smart to set some resource
    stickiness as a resource default. The below command sets a
    resource stickiness score of 1.

        # pcs resource defaults resource-stickiness=1

    Also note that the "default-resource-stickiness" cluster property
    is deprecated and should not be used.

    Finally, an explicit default resource stickiness score of 0 can
    interfere with the placement of cloned resource instances. If you
    don't want any stickiness, then it's better to leave stickiness
    unset. That way, primitives will have a stickiness of 0, but
    clone instances will have a stickiness of 1.

    If adding stickiness does not resolve the issue, can you share
    your cluster configuration and some logs that show the issue
    happening? Off the top of my head I'm not sure why resources
    would start and stop on node 2 without moving away from node1,
    unless they're clone instances that are starting and then failing
    a monitor operation on node 2.

    On Wed, Aug 26, 2020 at 8:42 AM Citron Vert
    <[email protected] <mailto:[email protected]>> wrote:

        Hello,
        I am contacting you because I have a problem with my cluster
        and I cannot find (nor understand) any information that can
        help me.

        I have a 2 nodes cluster (pacemaker, corosync, pcs) installed
        on CentOS 7 with a set of configuration.
        Everything seems to works fine, but here is what happens:

          * Node1 and Node2 are running well with Node1 as primary
          * I reboot Node2 wich is passive (no changes on Node1)
          * Node2 comes back in the cluster as passive
          * corosync logs shows resources getting started then
            stopped on Node2
          * "crm_mon" command shows some ressources on Node1 getting
            restarted

        I don't understand how it should work.
        If a node comes back, and becomes passive (since Node1 is
        running primary), there is no reason for the resources to be
        started then stopped on the new passive node ?

        One of my resources becomes unstable because it gets started
        and then stoped too quickly on Node2, wich seems to make it
        restart on Node1 without a failover.

        I tried several things and solution proposed by different
        sites and forums but without success.


        Is there a way so that the node, which joins the cluster as
        passive, does not start its own resources ?


        thanks in advance


        Here are some information just in case :

        $ rpm -qa | grep -E "corosync|pacemaker|pcs"
        corosync-2.4.5-4.el7.x86_64
        pacemaker-cli-1.1.21-4.el7.x86_64
        pacemaker-1.1.21-4.el7.x86_64
        pcs-0.9.168-4.el7.centos.x86_64
        corosynclib-2.4.5-4.el7.x86_64
        pacemaker-libs-1.1.21-4.el7.x86_64
        pacemaker-cluster-libs-1.1.21-4.el7.x86_64


                <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" 
value="false"/>
                <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" 
value="ignore"/>
                <nvpair id="cib-bootstrap-options-dc-deadtime" name="dc-deadtime" 
value="120s"/>
                <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" 
value="false"/>
                <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" 
value="1.1.21-4.el7-f14e36fd43"/>
                <nvpair id="cib-bootstrap-options-cluster-infrastructure" 
name="cluster-infrastructure" value="corosync"/>
                <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" 
value="CLUSTER"/>
                <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" 
value="1598446314"/>
                <nvpair id="cib-bootstrap-options-default-resource-stickiness" 
name="default-resource-stickiness" value="0"/>




        _______________________________________________
        Manage your subscription:
        https://lists.clusterlabs.org/mailman/listinfo/users

        ClusterLabs home: https://www.clusterlabs.org/



-- Regards,

    Reid Wahl, RHCA
    Software Maintenance Engineer, Red Hat
    CEE - Platform Support Delivery - ClusterHA



--
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to