Re: [ClusterLabs] Updated attribute is not displayed in crm_mon
On Tue, 2017-08-15 at 08:42 +0200, Jan Friesse wrote: > Ken Gaillot napsal(a): > > On Mon, 2017-08-14 at 12:33 -0500, Ken Gaillot wrote: > >> On Wed, 2017-08-02 at 09:59 +, 井上 和徳 wrote: > >>> Hi, > >>> > >>> In Pacemaker-1.1.17, the attribute updated while starting pacemaker is > >>> not displayed in crm_mon. > >>> In Pacemaker-1.1.16, it is displayed and results are different. > >>> > >>> https://github.com/ClusterLabs/pacemaker/commit/fe44f400a3116a158ab331a92a49a4ad8937170d > >>> This commit is the cause, but the following result (3.) is expected > >>> behavior? > >> > >> This turned out to be an odd one. The sequence of events is: > >> > >> 1. When the node leaves the cluster, the DC (correctly) wipes all its > >> transient attributes from attrd and the CIB. > >> > >> 2. Pacemaker is newly started on the node, and a transient attribute is > >> set before the node joins the cluster. > >> > >> 3. The node joins the cluster, and its transient attributes (including > >> the new value) are sync'ed with the rest of the cluster, in both attrd > >> and the CIB. So far, so good. > >> > >> 4. Because this is the node's first join since its crmd started, its > >> crmd wipes all of its transient attributes again. The idea is that the > >> node may have restarted so quickly that the DC hasn't yet done it (step > >> 1 here), so clear them now to avoid any problems with old values. > >> However, the crmd wipes only the CIB -- not attrd (arguably a bug). > > > > Whoops, clarification: the node may have restarted so quickly that > > corosync didn't notice it left, so the DC would never have gotten the > > Corosync always notice left of node no matter if left is longer or > within token timeout. Looking back at the original commit, it has a comment "OpenAIS has a nasty habit of not being able to tell if a node is returning or didn't leave in the first place", so it looks like it's only relevant on legacy stacks. > > > "peer lost" message that triggers wiping its transient attributes. > > > > I suspect the crmd wipes only the CIB in this case because we assumed > > attrd would be empty at this point -- missing exactly this case where a > > value was set between start-up and first join. > > > >> 5. With the older pacemaker version, both the joining node and the DC > >> would request a full write-out of all values from attrd. Because step 4 > >> only wiped the CIB, this ends up restoring the new value. With the newer > >> pacemaker version, this step is no longer done, so the value winds up > >> staying in attrd but not in CIB (until the next write-out naturally > >> occurs). > >> > >> I don't have a solution yet, but step 4 is clearly the problem (rather > >> than the new code that skips step 5, which is still a good idea > >> performance-wise). I'll keep working on it. > >> > >>> [test case] > >>> 1. Start pacemaker on two nodes at the same time and update the attribute > >>> during startup. > >>> In this case, the attribute is displayed in crm_mon. > >>> > >>> [root@node1 ~]# ssh -f node1 'systemctl start pacemaker ; > >>> attrd_updater -n KEY -U V-1' ; \ > >>> ssh -f node3 'systemctl start pacemaker ; > >>> attrd_updater -n KEY -U V-3' > >>> [root@node1 ~]# crm_mon -QA1 > >>> Stack: corosync > >>> Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with > >>> quorum > >>> > >>> 2 nodes configured > >>> 0 resources configured > >>> > >>> Online: [ node1 node3 ] > >>> > >>> No active resources > >>> > >>> > >>> Node Attributes: > >>> * Node node1: > >>> + KEY : V-1 > >>> * Node node3: > >>> + KEY : V-3 > >>> > >>> > >>> 2. Restart pacemaker on node1, and update the attribute during startup. > >>> > >>> [root@node1 ~]# systemctl stop pacemaker > >>> [root@node1 ~]# systemctl start pacemaker ; attrd_updater -n KEY -U > >>> V-10 > >>> > >>> > >>> 3. The attribute is registered in attrd but it is not registered in CIB, > >>> so the updated attribute is not displayed in crm_mon. > >>> > >>> [root@node1 ~]# attrd_updater -Q -n KEY -A > >>> name="KEY" host="node3" value="V-3" > >>> name="KEY" host="node1" value="V-10" > >>> > >>> [root@node1 ~]# crm_mon -QA1 > >>> Stack: corosync > >>> Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with > >>> quorum > >>> > >>> 2 nodes configured > >>> 0 resources configured > >>> > >>> Online: [ node1 node3 ] > >>> > >>> No active resources > >>> > >>> > >>> Node Attributes: > >>> * Node node1: > >>> * Node node3: > >>> + KEY : V-3 > >>> > >>> > >>> Best Regards > >>> > >>> ___ > >>> Users mailing list: Users@clusterlabs.org > >>> http://lists.clusterlabs.org/mailman/listinfo/users > >>> > >>> Project Home: http://www.clusterlabs.org > >>> Getting started: http://www.cl
Re: [ClusterLabs] Updated attribute is not displayed in crm_mon
Hi Ken, Thanks for the explanation. As an additional information, we are using Daemon(*1) that registers Corosync's ring status as attributes, so I want to avoid events where attributes are not displayed. *1 It's a ifcheckd that always running, not a resource. and registers attributes when Pacemaker is running. ( https://github.com/linux-ha-japan/pm_extras/tree/master/tools ) Attribute example : Node Attributes: * Node rhel73-1: + ringnumber_0 : 192.168.101.131 is UP + ringnumber_1 : 192.168.102.131 is UP * Node rhel73-2: + ringnumber_0 : 192.168.101.132 is UP + ringnumber_1 : 192.168.102.132 is UP Regards, Kazunori INOUE > -Original Message- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Tuesday, August 15, 2017 2:42 AM > To: Cluster Labs - All topics related to open-source clustering welcomed > Subject: Re: [ClusterLabs] Updated attribute is not displayed in crm_mon > > On Mon, 2017-08-14 at 12:33 -0500, Ken Gaillot wrote: > > On Wed, 2017-08-02 at 09:59 +, 井上 和徳 wrote: > > > Hi, > > > > > > In Pacemaker-1.1.17, the attribute updated while starting pacemaker is > > > not displayed in crm_mon. > > > In Pacemaker-1.1.16, it is displayed and results are different. > > > > > > https://github.com/ClusterLabs/pacemaker/commit/fe44f400a3116a158ab331a92a49a4ad8937170d > > > This commit is the cause, but the following result (3.) is expected > > > behavior? > > > > This turned out to be an odd one. The sequence of events is: > > > > 1. When the node leaves the cluster, the DC (correctly) wipes all its > > transient attributes from attrd and the CIB. > > > > 2. Pacemaker is newly started on the node, and a transient attribute is > > set before the node joins the cluster. > > > > 3. The node joins the cluster, and its transient attributes (including > > the new value) are sync'ed with the rest of the cluster, in both attrd > > and the CIB. So far, so good. > > > > 4. Because this is the node's first join since its crmd started, its > > crmd wipes all of its transient attributes again. The idea is that the > > node may have restarted so quickly that the DC hasn't yet done it (step > > 1 here), so clear them now to avoid any problems with old values. > > However, the crmd wipes only the CIB -- not attrd (arguably a bug). > > Whoops, clarification: the node may have restarted so quickly that > corosync didn't notice it left, so the DC would never have gotten the > "peer lost" message that triggers wiping its transient attributes. > > I suspect the crmd wipes only the CIB in this case because we assumed > attrd would be empty at this point -- missing exactly this case where a > value was set between start-up and first join. > > > 5. With the older pacemaker version, both the joining node and the DC > > would request a full write-out of all values from attrd. Because step 4 > > only wiped the CIB, this ends up restoring the new value. With the newer > > pacemaker version, this step is no longer done, so the value winds up > > staying in attrd but not in CIB (until the next write-out naturally > > occurs). > > > > I don't have a solution yet, but step 4 is clearly the problem (rather > > than the new code that skips step 5, which is still a good idea > > performance-wise). I'll keep working on it. > > > > > [test case] > > > 1. Start pacemaker on two nodes at the same time and update the attribute > > > during startup. > > >In this case, the attribute is displayed in crm_mon. > > > > > >[root@node1 ~]# ssh -f node1 'systemctl start pacemaker ; > > > attrd_updater -n KEY -U V-1' ; \ > > >ssh -f node3 'systemctl start pacemaker ; > > > attrd_updater -n KEY -U V-3' > > >[root@node1 ~]# crm_mon -QA1 > > >Stack: corosync > > >Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with > > > quorum > > > > > >2 nodes configured > > >0 resources configured > > > > > >Online: [ node1 node3 ] > > > > > >No active resources > > > > > > > > >Node Attributes: > > >* Node node1: > > >+ KEY : V-1 > > >* Node node3: > > >+ KEY : V-3 > > > > > > > > > 2. Restart pacemaker on node1, and update the attribute during startup. > > > > > >[root@node1 ~]# systemctl stop pacemaker > > >[root@node1 ~]# systemctl start pacemaker ; attrd_updater -n KEY -U > > > V-10 > > > > > > > > > 3. The attribute is registered in attrd but it is not registered in CIB, > > >so the updated attribute is not displayed in crm_mon. > > > > > >[root@node1 ~]# attrd_updater -Q -n KEY -A > > >name="KEY" host="node3" value="V-3" > > >name="KEY" host="node1" value="V-10" > > > > > >[root@node1 ~]# crm_mon -QA1 > > >Stack: corosync > > >Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with > > > quorum > > > >
Re: [ClusterLabs] Antw: Re: Notification agent and Notification recipients
Thanks for clarifying. Regards, Sriram. On Mon, Aug 14, 2017 at 7:34 PM, Klaus Wenninger wrote: > On 08/14/2017 03:19 PM, Sriram wrote: > > Yes, I had precreated the script file with the required permission. > > [root@*node1* alerts]# ls -l /usr/share/pacemaker/alert_file.sh > -rwxr-xr-x. 1 root root 4140 Aug 14 01:51 /usr/share/pacemaker/alert_ > file.sh > [root@*node2* alerts]# ls -l /usr/share/pacemaker/alert_file.sh > -rwxr-xr-x. 1 root root 4139 Aug 14 01:51 /usr/share/pacemaker/alert_ > file.sh > [root@*node3* alerts]# ls -l /usr/share/pacemaker/alert_file.sh > -rwxr-xr-x. 1 root root 4139 Aug 14 01:51 /usr/share/pacemaker/alert_ > file.sh > > Later I observed that user "hacluster" is not able to create the log file > under /usr/share/pacemaker/alert_file.log. > I am sorry, I should have observed this in the log before posting the > query. Then I gave the path as /tmp/alert_file.log, it is able to create > now. > Thanks for pointing it out. > > I have one more clarification, > > if the resource is running in node2, > [root@node2 tmp]# pcs resource > TRR(ocf::heartbeat:TimingRedundancyRA):Started node2 > > And I executed the below command to make it standby. > [root@node2 tmp] # pcs node standby node2 > > Resource shifted to node3, because of higher location constraint. > [root@node2 tmp]# pcs resource > TRR(ocf::heartbeat:TimingRedundancyRA):Started node3. > > > I got the log file created under node2(resource stopped) and > node3(resource started). > > Node1 was not notified about the resource shift, I mean no log file was > created there. > Its because alerts are designed to notify the external agents about the > cluster events. Its not for internal notifications. > > Is my understanding correct ? > > > Quite simple: crmd of node1 just didn't have anything to do with shifting > the resource > from node2 -> node3. There is no additional information passed between the > nodes > just to create a full set of notifications on every node. If you want to > have a full log > (or whatever you altert-agent is doing) in one place this would be up to > your alert-agent. > > > Regards, > Klaus > > > Regards, > Sriram. > > > > On Mon, Aug 14, 2017 at 5:42 PM, Klaus Wenninger > wrote: > >> On 08/14/2017 12:32 PM, Sriram wrote: >> >> Hi Ken, >> >> I used the alerts as well, seems to be not working. >> >> Please check the below configuration >> [root@node1 alerts]# pcs config show >> Cluster Name: >> Corosync Nodes: >> Pacemaker Nodes: >> node1 node2 node3 >> >> Resources: >> Resource: TRR (class=ocf provider=heartbeat type=TimingRedundancyRA) >> Operations: start interval=0s timeout=60s (TRR-start-interval-0s) >> stop interval=0s timeout=20s (TRR-stop-interval-0s) >> monitor interval=10 timeout=20 (TRR-monitor-interval-10) >> >> Stonith Devices: >> Fencing Levels: >> >> Location Constraints: >> Resource: TRR >> Enabled on: node1 (score:100) (id:location-TRR-node1-100) >> Enabled on: node2 (score:200) (id:location-TRR-node2-200) >> Enabled on: node3 (score:300) (id:location-TRR-node3-300) >> Ordering Constraints: >> Colocation Constraints: >> Ticket Constraints: >> >> Alerts: >> Alert: alert_file (path=/usr/share/pacemaker/alert_file.sh) >> Options: debug_exec_order=false >> Meta options: timeout=15s >> Recipients: >>Recipient: recipient_alert_file_id (value=/usr/share/pacemaker/al >> ert_file.log) >> >> >> Did you pre-create the file with proper rights? Be aware that the >> alert-agent >> is called as user hacluster. >> >> >> Resources Defaults: >> resource-stickiness: INFINITY >> Operations Defaults: >> No defaults set >> >> Cluster Properties: >> cluster-infrastructure: corosync >> dc-version: 1.1.15-11.el7_3.4-e174ec8 >> default-action-timeout: 240 >> have-watchdog: false >> no-quorum-policy: ignore >> placement-strategy: balanced >> stonith-enabled: false >> symmetric-cluster: false >> >> Quorum: >> Options: >> >> >> /usr/share/pacemaker/alert_file.sh does not get called whenever I >> trigger a scenario for failover. >> Please let me know if I m missing anything. >> >> >> Do you get any logs - like for startup of resources - or nothing at all? >> >> Regards, >> Klaus >> >> >> >> >> Regards, >> Sriram. >> >> On Tue, Aug 8, 2017 at 8:29 PM, Ken Gaillot wrote: >> >>> On Tue, 2017-08-08 at 17:40 +0530, Sriram wrote: >>> > Hi Ulrich, >>> > >>> > >>> > Please see inline. >>> > >>> > On Tue, Aug 8, 2017 at 2:01 PM, Ulrich Windl >>> > wrote: >>> > >>> Sriram schrieb am 08.08.2017 um >>> > 09:30 in Nachricht >>> > >> > +dv...@mail.gmail.com>: >>> > > Hi Ken & Jan, >>> > > >>> > > In the cluster we have, there is only one resource running. >>> > Its a OPT-IN >>> > > cluster with resource-stickiness set to INFINITY. >>> > > >>> > > Just to clarify my question, lets take a scenario where >>> > there are four >>> > > nodes N1,