Re: [ClusterLabs] Updated attribute is not displayed in crm_mon

2017-08-15 Thread Ken Gaillot
On Tue, 2017-08-15 at 08:42 +0200, Jan Friesse wrote:
> Ken Gaillot napsal(a):
> > On Mon, 2017-08-14 at 12:33 -0500, Ken Gaillot wrote:
> >> On Wed, 2017-08-02 at 09:59 +, 井上 和徳 wrote:
> >>> Hi,
> >>>
> >>> In Pacemaker-1.1.17, the attribute updated while starting pacemaker is 
> >>> not displayed in crm_mon.
> >>> In Pacemaker-1.1.16, it is displayed and results are different.
> >>>
> >>> https://github.com/ClusterLabs/pacemaker/commit/fe44f400a3116a158ab331a92a49a4ad8937170d
> >>> This commit is the cause, but the following result (3.) is expected 
> >>> behavior?
> >>
> >> This turned out to be an odd one. The sequence of events is:
> >>
> >> 1. When the node leaves the cluster, the DC (correctly) wipes all its
> >> transient attributes from attrd and the CIB.
> >>
> >> 2. Pacemaker is newly started on the node, and a transient attribute is
> >> set before the node joins the cluster.
> >>
> >> 3. The node joins the cluster, and its transient attributes (including
> >> the new value) are sync'ed with the rest of the cluster, in both attrd
> >> and the CIB. So far, so good.
> >>
> >> 4. Because this is the node's first join since its crmd started, its
> >> crmd wipes all of its transient attributes again. The idea is that the
> >> node may have restarted so quickly that the DC hasn't yet done it (step
> >> 1 here), so clear them now to avoid any problems with old values.
> >> However, the crmd wipes only the CIB -- not attrd (arguably a bug).
> >
> > Whoops, clarification: the node may have restarted so quickly that
> > corosync didn't notice it left, so the DC would never have gotten the
> 
> Corosync always notice left of node no matter if left is longer or 
> within token timeout.

Looking back at the original commit, it has a comment "OpenAIS has a
nasty habit of not being able to tell if a node is returning or didn't
leave in the first place", so it looks like it's only relevant on legacy
stacks.

> 
> > "peer lost" message that triggers wiping its transient attributes.
> >
> > I suspect the crmd wipes only the CIB in this case because we assumed
> > attrd would be empty at this point -- missing exactly this case where a
> > value was set between start-up and first join.
> >
> >> 5. With the older pacemaker version, both the joining node and the DC
> >> would request a full write-out of all values from attrd. Because step 4
> >> only wiped the CIB, this ends up restoring the new value. With the newer
> >> pacemaker version, this step is no longer done, so the value winds up
> >> staying in attrd but not in CIB (until the next write-out naturally
> >> occurs).
> >>
> >> I don't have a solution yet, but step 4 is clearly the problem (rather
> >> than the new code that skips step 5, which is still a good idea
> >> performance-wise). I'll keep working on it.
> >>
> >>> [test case]
> >>> 1. Start pacemaker on two nodes at the same time and update the attribute 
> >>> during startup.
> >>> In this case, the attribute is displayed in crm_mon.
> >>>
> >>> [root@node1 ~]# ssh -f node1 'systemctl start pacemaker ; 
> >>> attrd_updater -n KEY -U V-1' ; \
> >>> ssh -f node3 'systemctl start pacemaker ; 
> >>> attrd_updater -n KEY -U V-3'
> >>> [root@node1 ~]# crm_mon -QA1
> >>> Stack: corosync
> >>> Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with 
> >>> quorum
> >>>
> >>> 2 nodes configured
> >>> 0 resources configured
> >>>
> >>> Online: [ node1 node3 ]
> >>>
> >>> No active resources
> >>>
> >>>
> >>> Node Attributes:
> >>> * Node node1:
> >>> + KEY   : V-1
> >>> * Node node3:
> >>> + KEY   : V-3
> >>>
> >>>
> >>> 2. Restart pacemaker on node1, and update the attribute during startup.
> >>>
> >>> [root@node1 ~]# systemctl stop pacemaker
> >>> [root@node1 ~]# systemctl start pacemaker ; attrd_updater -n KEY -U 
> >>> V-10
> >>>
> >>>
> >>> 3. The attribute is registered in attrd but it is not registered in CIB,
> >>> so the updated attribute is not displayed in crm_mon.
> >>>
> >>> [root@node1 ~]# attrd_updater -Q -n KEY -A
> >>> name="KEY" host="node3" value="V-3"
> >>> name="KEY" host="node1" value="V-10"
> >>>
> >>> [root@node1 ~]# crm_mon -QA1
> >>> Stack: corosync
> >>> Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with 
> >>> quorum
> >>>
> >>> 2 nodes configured
> >>> 0 resources configured
> >>>
> >>> Online: [ node1 node3 ]
> >>>
> >>> No active resources
> >>>
> >>>
> >>> Node Attributes:
> >>> * Node node1:
> >>> * Node node3:
> >>> + KEY   : V-3
> >>>
> >>>
> >>> Best Regards
> >>>
> >>> ___
> >>> Users mailing list: Users@clusterlabs.org
> >>> http://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started: http://www.cl

Re: [ClusterLabs] Updated attribute is not displayed in crm_mon

2017-08-15 Thread 井上 和徳
Hi Ken,

Thanks for the explanation.

As an additional information, we are using Daemon(*1) that registers
Corosync's ring status as attributes, so I want to avoid events where
attributes are not displayed.

*1 It's a ifcheckd that always running, not a resource. and registers
   attributes when Pacemaker is running.
   ( https://github.com/linux-ha-japan/pm_extras/tree/master/tools )
   Attribute example :

   Node Attributes:
   * Node rhel73-1:
   + ringnumber_0  : 192.168.101.131 is UP
   + ringnumber_1  : 192.168.102.131 is UP
   * Node rhel73-2:
   + ringnumber_0  : 192.168.101.132 is UP
   + ringnumber_1  : 192.168.102.132 is UP

Regards,
Kazunori INOUE

> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Tuesday, August 15, 2017 2:42 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] Updated attribute is not displayed in crm_mon
> 
> On Mon, 2017-08-14 at 12:33 -0500, Ken Gaillot wrote:
> > On Wed, 2017-08-02 at 09:59 +, 井上 和徳 wrote:
> > > Hi,
> > >
> > > In Pacemaker-1.1.17, the attribute updated while starting pacemaker is 
> > > not displayed in crm_mon.
> > > In Pacemaker-1.1.16, it is displayed and results are different.
> > >
> > > https://github.com/ClusterLabs/pacemaker/commit/fe44f400a3116a158ab331a92a49a4ad8937170d
> > > This commit is the cause, but the following result (3.) is expected 
> > > behavior?
> >
> > This turned out to be an odd one. The sequence of events is:
> >
> > 1. When the node leaves the cluster, the DC (correctly) wipes all its
> > transient attributes from attrd and the CIB.
> >
> > 2. Pacemaker is newly started on the node, and a transient attribute is
> > set before the node joins the cluster.
> >
> > 3. The node joins the cluster, and its transient attributes (including
> > the new value) are sync'ed with the rest of the cluster, in both attrd
> > and the CIB. So far, so good.
> >
> > 4. Because this is the node's first join since its crmd started, its
> > crmd wipes all of its transient attributes again. The idea is that the
> > node may have restarted so quickly that the DC hasn't yet done it (step
> > 1 here), so clear them now to avoid any problems with old values.
> > However, the crmd wipes only the CIB -- not attrd (arguably a bug).
> 
> Whoops, clarification: the node may have restarted so quickly that
> corosync didn't notice it left, so the DC would never have gotten the
> "peer lost" message that triggers wiping its transient attributes.
> 
> I suspect the crmd wipes only the CIB in this case because we assumed
> attrd would be empty at this point -- missing exactly this case where a
> value was set between start-up and first join.
> 
> > 5. With the older pacemaker version, both the joining node and the DC
> > would request a full write-out of all values from attrd. Because step 4
> > only wiped the CIB, this ends up restoring the new value. With the newer
> > pacemaker version, this step is no longer done, so the value winds up
> > staying in attrd but not in CIB (until the next write-out naturally
> > occurs).
> >
> > I don't have a solution yet, but step 4 is clearly the problem (rather
> > than the new code that skips step 5, which is still a good idea
> > performance-wise). I'll keep working on it.
> >
> > > [test case]
> > > 1. Start pacemaker on two nodes at the same time and update the attribute 
> > > during startup.
> > >In this case, the attribute is displayed in crm_mon.
> > >
> > >[root@node1 ~]# ssh -f node1 'systemctl start pacemaker ; 
> > > attrd_updater -n KEY -U V-1' ; \
> > >ssh -f node3 'systemctl start pacemaker ; 
> > > attrd_updater -n KEY -U V-3'
> > >[root@node1 ~]# crm_mon -QA1
> > >Stack: corosync
> > >Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with 
> > > quorum
> > >
> > >2 nodes configured
> > >0 resources configured
> > >
> > >Online: [ node1 node3 ]
> > >
> > >No active resources
> > >
> > >
> > >Node Attributes:
> > >* Node node1:
> > >+ KEY   : V-1
> > >* Node node3:
> > >+ KEY   : V-3
> > >
> > >
> > > 2. Restart pacemaker on node1, and update the attribute during startup.
> > >
> > >[root@node1 ~]# systemctl stop pacemaker
> > >[root@node1 ~]# systemctl start pacemaker ; attrd_updater -n KEY -U 
> > > V-10
> > >
> > >
> > > 3. The attribute is registered in attrd but it is not registered in CIB,
> > >so the updated attribute is not displayed in crm_mon.
> > >
> > >[root@node1 ~]# attrd_updater -Q -n KEY -A
> > >name="KEY" host="node3" value="V-3"
> > >name="KEY" host="node1" value="V-10"
> > >
> > >[root@node1 ~]# crm_mon -QA1
> > >Stack: corosync
> > >Current DC: node3 (version 1.1.17-1.el7-b36b869) - partition with 
> > > quorum
> > >
> 

Re: [ClusterLabs] Antw: Re: Notification agent and Notification recipients

2017-08-15 Thread Sriram
Thanks for clarifying.

Regards,
Sriram.

On Mon, Aug 14, 2017 at 7:34 PM, Klaus Wenninger 
wrote:

> On 08/14/2017 03:19 PM, Sriram wrote:
>
> Yes, I had precreated the script file with the required permission.
>
> [root@*node1* alerts]# ls -l /usr/share/pacemaker/alert_file.sh
> -rwxr-xr-x. 1 root root 4140 Aug 14 01:51 /usr/share/pacemaker/alert_
> file.sh
>  [root@*node2* alerts]# ls -l /usr/share/pacemaker/alert_file.sh
> -rwxr-xr-x. 1 root root 4139 Aug 14 01:51 /usr/share/pacemaker/alert_
> file.sh
> [root@*node3* alerts]# ls -l /usr/share/pacemaker/alert_file.sh
> -rwxr-xr-x. 1 root root 4139 Aug 14 01:51 /usr/share/pacemaker/alert_
> file.sh
>
> Later I observed that user "hacluster" is not able to create the log file
> under /usr/share/pacemaker/alert_file.log.
> I am sorry, I should have observed this in the log before posting the
> query. Then I gave the path as /tmp/alert_file.log, it is able to create
> now.
> Thanks for pointing it out.
>
> I have one more clarification,
>
> if the resource is running in node2,
> [root@node2 tmp]# pcs resource
>  TRR(ocf::heartbeat:TimingRedundancyRA):Started node2
>
> And I executed the below command to make it standby.
> [root@node2 tmp] # pcs node standby node2
>
> Resource shifted to node3, because of higher location constraint.
> [root@node2 tmp]# pcs resource
>  TRR(ocf::heartbeat:TimingRedundancyRA):Started node3.
>
>
> I got the log file created under node2(resource stopped) and
> node3(resource started).
>
> Node1 was not notified about the resource shift, I mean no log file was
> created there.
> Its because alerts are designed to notify the external agents about the
> cluster events. Its not for internal notifications.
>
> Is my understanding correct ?
>
>
> Quite simple: crmd of node1 just didn't have anything to do with shifting
> the resource
> from node2 -> node3. There is no additional information passed between the
> nodes
> just to create a full set of notifications on every node. If you want to
> have a full log
> (or whatever you altert-agent is doing) in one place this would be up to
> your alert-agent.
>
>
> Regards,
> Klaus
>
>
> Regards,
> Sriram.
>
>
>
> On Mon, Aug 14, 2017 at 5:42 PM, Klaus Wenninger 
> wrote:
>
>> On 08/14/2017 12:32 PM, Sriram wrote:
>>
>> Hi Ken,
>>
>> I used the alerts as well, seems to be not working.
>>
>> Please check the below configuration
>> [root@node1 alerts]# pcs config show
>> Cluster Name:
>> Corosync Nodes:
>> Pacemaker Nodes:
>>  node1 node2 node3
>>
>> Resources:
>>  Resource: TRR (class=ocf provider=heartbeat type=TimingRedundancyRA)
>>   Operations: start interval=0s timeout=60s (TRR-start-interval-0s)
>>   stop interval=0s timeout=20s (TRR-stop-interval-0s)
>>   monitor interval=10 timeout=20 (TRR-monitor-interval-10)
>>
>> Stonith Devices:
>> Fencing Levels:
>>
>> Location Constraints:
>>   Resource: TRR
>> Enabled on: node1 (score:100) (id:location-TRR-node1-100)
>> Enabled on: node2 (score:200) (id:location-TRR-node2-200)
>> Enabled on: node3 (score:300) (id:location-TRR-node3-300)
>> Ordering Constraints:
>> Colocation Constraints:
>> Ticket Constraints:
>>
>> Alerts:
>>  Alert: alert_file (path=/usr/share/pacemaker/alert_file.sh)
>>   Options: debug_exec_order=false
>>   Meta options: timeout=15s
>>   Recipients:
>>Recipient: recipient_alert_file_id (value=/usr/share/pacemaker/al
>> ert_file.log)
>>
>>
>> Did you pre-create the file with proper rights? Be aware that the
>> alert-agent
>> is called as user hacluster.
>>
>>
>> Resources Defaults:
>>  resource-stickiness: INFINITY
>> Operations Defaults:
>>  No defaults set
>>
>> Cluster Properties:
>>  cluster-infrastructure: corosync
>>  dc-version: 1.1.15-11.el7_3.4-e174ec8
>>  default-action-timeout: 240
>>  have-watchdog: false
>>  no-quorum-policy: ignore
>>  placement-strategy: balanced
>>  stonith-enabled: false
>>  symmetric-cluster: false
>>
>> Quorum:
>>   Options:
>>
>>
>> /usr/share/pacemaker/alert_file.sh does not get called whenever I
>> trigger a scenario for failover.
>> Please let me know if I m missing anything.
>>
>>
>> Do you get any logs - like for startup of resources - or nothing at all?
>>
>> Regards,
>> Klaus
>>
>>
>>
>>
>> Regards,
>> Sriram.
>>
>> On Tue, Aug 8, 2017 at 8:29 PM, Ken Gaillot  wrote:
>>
>>> On Tue, 2017-08-08 at 17:40 +0530, Sriram wrote:
>>> > Hi Ulrich,
>>> >
>>> >
>>> > Please see inline.
>>> >
>>> > On Tue, Aug 8, 2017 at 2:01 PM, Ulrich Windl
>>> >  wrote:
>>> > >>> Sriram  schrieb am 08.08.2017 um
>>> > 09:30 in Nachricht
>>> > >> > +dv...@mail.gmail.com>:
>>> > > Hi Ken & Jan,
>>> > >
>>> > > In the cluster we have, there is only one resource running.
>>> > Its a OPT-IN
>>> > > cluster with resource-stickiness set to INFINITY.
>>> > >
>>> > > Just to clarify my question, lets take a scenario where
>>> > there are four
>>> > > nodes N1,