Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-18 Thread Afek, Ifat (Nokia - IL)


From: Yujun Zhang 
Date: Tuesday, 17 January 2017 at 02:41


Sounds good.

Have you created an etherpad page for collecting topics, Ifat?

Here: https://etherpad.openstack.org/p/vitrage-pike-design-sessions


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-16 Thread Yujun Zhang
Sounds good.

Have you created an etherpad page for collecting topics, Ifat?

On Mon, Jan 16, 2017 at 10:43 PM Afek, Ifat (Nokia - IL) <
ifat.a...@nokia.com> wrote:

>
>
> *From: *Yujun Zhang 
> *Date: *Sunday, 15 January 2017 at 17:53
>
>
>
> About fault and alarm, what I was thinking about the causal/deducing chain
> in root cause analysis.
>
>
>
> Fault state means the resource is not fully functional and it is evaluated
> by related indicators. There are alarms on events like power loss or
> measurands like CPU high, memory low, temperature high. There are also
> alarms based on deduced state, such as "host fault", "instance fault".
>
>
>
> So an example chain would be
>
> · "FAULT: power line cut off" =(monitor)=> "ALARM: host power
> loss" =(inspect)=> "FAULT: host is unavailable" =(action)=> "ALARM: host
> fault"
>
> · "FAULT: power line cut off" =(monitor)=> "ALARM: host power
> loss" =(inspect)=> "FAULT: host is unavailable" =(inspect)=> "FAULT:
> instance is unavailable" =(action)=> "ALARM: instance fault"
>
> If we omit the resource, then we get the causal chain as it is in Vitrage
>
> · "ALARM: host power loss" =(causes)=> "ALARM: host fault"
>
> · "ALARM: host power loss" =(causes)=> "ALARM: instance fault"
>
> But what the user care about might be there "FAULT: power line cut off"
> causes all these alarms. What I haven't made clear yet is the equivalence
> between fault and alarm.
>
>
>
> I may have made it more complex with my *immature* thoughts. It could be
> even more complex if we consider multiple upstream causes and downstream
> outcome. It may be an interesting topic to be discussed in design session.
>
>
>
>
>
> [Ifat] I agree. Let’s discuss this in the next design session we’ll have
>
>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-16 Thread Afek, Ifat (Nokia - IL)

From: Yujun Zhang 
Date: Sunday, 15 January 2017 at 17:53


About fault and alarm, what I was thinking about the causal/deducing chain in 
root cause analysis.

Fault state means the resource is not fully functional and it is evaluated by 
related indicators. There are alarms on events like power loss or measurands 
like CPU high, memory low, temperature high. There are also alarms based on 
deduced state, such as "host fault", "instance fault".

So an example chain would be
· "FAULT: power line cut off" =(monitor)=> "ALARM: host power loss" 
=(inspect)=> "FAULT: host is unavailable" =(action)=> "ALARM: host fault"
· "FAULT: power line cut off" =(monitor)=> "ALARM: host power loss" 
=(inspect)=> "FAULT: host is unavailable" =(inspect)=> "FAULT: instance is 
unavailable" =(action)=> "ALARM: instance fault"
If we omit the resource, then we get the causal chain as it is in Vitrage
· "ALARM: host power loss" =(causes)=> "ALARM: host fault"
· "ALARM: host power loss" =(causes)=> "ALARM: instance fault"
But what the user care about might be there "FAULT: power line cut off" causes 
all these alarms. What I haven't made clear yet is the equivalence between 
fault and alarm.

I may have made it more complex with my immature thoughts. It could be even 
more complex if we consider multiple upstream causes and downstream outcome. It 
may be an interesting topic to be discussed in design session.


[Ifat] I agree. Let’s discuss this in the next design session we’ll have


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-15 Thread Yujun Zhang
About fault and alarm, what I was thinking about the causal/deducing chain
in root cause analysis.

Fault state means the resource is not fully functional and it is evaluated
by related indicators. There are alarms on events like power loss or
measurands like CPU high, memory low, temperature high. There are also
alarms based on deduced state, such as "host fault", "instance fault".

So an example chain would be

   - "FAULT: power line cut off" =(monitor)=> "ALARM: host power loss"
   =(inspect)=> "FAULT: host is unavailable" =(action)=> "ALARM: host fault"
   - "FAULT: power line cut off" =(monitor)=> "ALARM: host power loss"
   =(inspect)=> "FAULT: host is unavailable" =(inspect)=> "FAULT: instance is
   unavailable" =(action)=> "ALARM: instance fault"

If we omit the resource, then we get the causal chain as it is in Vitrage

   - "ALARM: host power loss" =(causes)=> "ALARM: host fault"
   - "ALARM: host power loss" =(causes)=> "ALARM: instance fault"

But what the user care about might be there "FAULT: power line cut off"
causes all these alarms. What I haven't made clear yet is the equivalence
between fault and alarm.

I may have made it more complex with my *immature* thoughts. It could be
even more complex if we consider multiple upstream causes and downstream
outcome. It may be an interesting topic to be discussed in design session.

On Sun, Jan 15, 2017 at 9:21 PM Afek, Ifat (Nokia - IL) 
wrote:

> Hi Yinliyin,
>
>
>
> There are two use cases:
>
> One is yours, where you have a single monitor that generates “real”
> alarms, and Vitrage that generates deduced alarms.
>
> Another is where someone has a few monitors, and there might be a
> collision/equivalence between their alarms.
>
>
>
> The solution that you suggested might solve the first use case, but I
> wouldn’t want to ignore the second one, which is also valid.
>
>
>
> Regarding some of your specific suggestions:
>
> 1.   In templates, we only define the alarm entity for the datasource
> that the alarm is reported by, such as Nagios.
>
> [Ifat] This will only work for a single monitor.
>
>2.  When evaluator deduce an alarm, it would raise the alarm with
> the type set to be the datasource that would report the alarm, not be
> vitrage.
>
> [Ifat] I don’t think this is right. In Vitrage Alarm view in the UI,
> displaying the deduced alarm as “Nagios” is misleading, since Nagios did
> not report this alarm.
>
>
>
> I can think of a solution that is specific to the deduced alarms case,
> where we will replace a Vitrage alarm with a “real” alarm whenever there is
> a collision. This solution is easier, but we should carefully examine all
> use cases to make sure there is no ambiguity. However, for the more general
> use case I would prefer the option that we discussed in a previous mail, of
> having two (or more) alarms connected with a ‘equivalent’ relationship.
>
>
>
> What do you think?
>
> Ifat.
>
>
>
>
>
> *From: *"yinli...@zte.com.cn" 
> *Date: *Saturday, 14 January 2017 at 09:57
>
> · It won’t solve the general problem of two different monitors
> that raise the same alarm
>
> ·   [yinliyin] Generally, we would only deploy one monitor for a
> same alarm.
>
> · It won’t solve possible conflicts of timestamp and severity
> between different monitors
>
> ·  [yinliyin] Please see the following contents.
>
> · It will make the decision of when to delete the alarm more
> complex (delete it when the deduced alarm is deleted? When Nagios alarm is
> deleted? both? And how to change the timestamp and severity in these cases?)
>
> ·  [yinliyin] Please see the following contents.
>
>The following is the basic idea of solving the problem in this
> situation:
>
>1.  In templates, we only define the alarm entity for the
> datasource that the alarm is reported by, such as Nagios.
>
>2.  When evaluator deduce an alarm, it would raise the alarm with
> the type set to be the datasource that would report the alarm, not be
> vitrage.
>
>3.  When entity_graph get the events from the "evaluator_queue"(all
> the alarms in the "evaluator_queue" are deduced alarms), it queries the
> graph to find out whether there was a same alarm reported  by datasource.
> If  it was true,  it would discard the alarm.
>
>   4.  When entity_graph get the events from "queue",  it queries the
> graph to find out whether there was a same alarm deduced by evaluator. If
> it was true, it would replace the alarm in the graph with the newly arrived
> alarm reported by the datasource.
>
>  5.  When the evaluator deduced that an alarm would be deleted, it
> deletes the alarm whatever the generation type of the alarm be(Generated by
> datasource or deduced by evaluator).
>
>  6. When datasource reports recover event of an alarm, entity_graph
> would query graph to find out whether the alarm was exist. If the alarm was
> not exist, entity_graph would 

Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-15 Thread Afek, Ifat (Nokia - IL)
From: Yujun Zhang 
Date: Thursday, 12 January 2017 at 17:37

On Thu, Jan 12, 2017 at 5:12 PM Afek, Ifat (Nokia - IL) 
> wrote:

'deduced' vs 'monitored' would be good enough for most cases. Unless we have 
identify some real use case, I also think there is no need for bring in 
quantitative indicator like counter or probability.

[Ifat] I agree.

Personally, I don’t think this is needed. I think that if Nagios reports an 
error, then it is confident enough without getting it from another monitor.

You are right. We would consider a reported alarm as a reliable indicator of 
fault. What I was thinking about is: when we the alarm is not seen, can we be 
sure there is no fault?

Another situation is slow upstream alarm with fast downstream alarm. I don't 
have an actual example for the moment, so please allow me to imagine an extreme 
condition.

Suppose host fault will cause instance fault. But due to some restriction, the 
host fault is scanned every 1 hour, but instance fault can be scanned every 1 
second. Now, we get alarms from 10 instance in the same host. Can we deduce 
that the host is likely in fault status? And we may raise a "deduced" alarm on 
the host and trigger an immediate scan which may result in a "monitored" alarm. 
In this way, we reduce the time of detecting the root cause, i.e host fault.

[Ifat] I understand the use case.


An alternative solution is to distinguish fault from alarm. Alarm is actually a 
reflection of fault status.  Beside the directly linked alarm, fault status can 
also be deduced from downstream alarms. I haven't think over this model yet, it 
just flashed over my mind. Any comments are welcome.

[Ifat] Isn’t ‘fault vs. alarm’ just a different terminology for ‘deduced vs. 
monitored’?


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-12 Thread Yujun Zhang
Hi, Ifat

You comments is quite right. See my additional explanation inline.

On Thu, Jan 12, 2017 at 5:12 PM Afek, Ifat (Nokia - IL) 
wrote:

>
>
> One possible solution would be introducing a high level (abstract)
> template from users view. Then convert it to Vitrage scenario templates (or
> directly to graph). The *more sources* (nagios, vitrage deduction) for an
> abstract alarm we get from the system, the *more confidence* we get for a
> real fault. And the confidence of an alarm could be included in the
> scenario condition.
>
>
>
> [Ifat] I understand your idea, not sure yet if it helps with the use case.
>
> How would you imagine the ‘confidence’ property? As Boolean or a counter?
> One option is ‘deduced’ vs. ‘monitored’.
>
Another option is to count the number of monitors that reported it.
>

'deduced' vs 'monitored' would be good enough for most cases. Unless we
have identify some real use case, I also think there is no need for bring
in quantitative indicator like counter or probability.


> Personally, I don’t think this is needed. I think that if Nagios reports
> an error, then it is confident enough without getting it from another
> monitor.
>

You are right. We would consider a reported alarm as a reliable indicator
of fault. What I was thinking about is: when we the alarm is not seen, can
we be sure there is no fault?

Another situation is slow upstream alarm with fast downstream alarm. I
don't have an actual example for the moment, so please allow me to imagine
an extreme condition.

Suppose host fault will cause instance fault. But due to some restriction,
the host fault is scanned every 1 hour, but instance fault can be scanned
every 1 second. Now, we get alarms from 10 instance in the same host. Can
we deduce that the host is likely in fault status? And we may raise a
"deduced" alarm on the host and trigger an immediate scan which may result
in a "monitored" alarm. In this way, we reduce the time of detecting the
root cause, i.e host fault.

An alternative solution is to distinguish fault from alarm. Alarm is
actually a reflection of fault status.  Beside the directly linked alarm,
fault status can also be deduced from downstream alarms. I haven't think
over this model yet, it just flashed over my mind. Any comments are welcome.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-12 Thread Afek, Ifat (Nokia - IL)
Hi Yujun,

See my comments inline.

Ifat.

From: Yujun Zhang 
Date: Wednesday, 11 January 2017 at 12:12


I have just realized abstract alarm is not a good term. What I was talking 
about is fault and alarm.

Fault is what actually happens, and alarm is how it is detected (or deduced).


On Wed, Jan 11, 2017 at 5:13 PM Yujun Zhang 
> wrote:

I think YinLiYin's idea is a reasonable requirement from end user. They care 
more about the real faults in the system, not how they are detected. Though it 
will bring much challenge to design and engineering, it creates value for 
customers. I'm quite positive on this evolution.

[Ifat] Of course. I never argued about the need, just tried to figure out how 
we should implement it.

One possible solution would be introducing a high level (abstract) template 
from users view. Then convert it to Vitrage scenario templates (or directly to 
graph). The more sources (nagios, vitrage deduction) for an abstract alarm we 
get from the system, the more confidence we get for a real fault. And the 
confidence of an alarm could be included in the scenario condition.

[Ifat] I understand your idea, not sure yet if it helps with the use case.
How would you imagine the ‘confidence’ property? As Boolean or a counter? One 
option is ‘deduced’ vs. ‘monitored’. Another option is to count the number of 
monitors that reported it. Personally, I don’t think this is needed. I think 
that if Nagios reports an error, then it is confident enough without getting it 
from another monitor.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-11 Thread Yujun Zhang
I have just realized abstract alarm is not a good term. What I was talking
about is *fault* and *alarm*.

Fault is what actually happens, and alarm is how it is detected (or
deduced).

On Wed, Jan 11, 2017 at 5:13 PM Yujun Zhang 
wrote:

> Yes, if we consider the Vitrage scenario evaluator as a pseudo monitor.
>
> I think YinLiYin's idea is a reasonable requirement from end user. They
> care more about the *real faults* in the system, not how they are
> detected. Though it will bring much challenge to design and engineering, it
> creates value for customers. I'm quite positive on this evolution.
>
> One possible solution would be introducing a high level (abstract)
> template from users view. Then convert it to Vitrage scenario templates (or
> directly to graph). The *more sources* (nagios, vitrage deduction) for an
> abstract alarm we get from the system, the *more confidence* we get for a
> real fault. And the confidence of an alarm could be included in the
> scenario condition.
>
> On Wed, Jan 11, 2017 at 4:08 PM Afek, Ifat (Nokia - IL) <
> ifat.a...@nokia.com> wrote:
>
> You are right. But as I see it, the case of Vitrage suspect vs. the real
> Nagios alarm is just one example of the more general case of two monitors
> reporting the same alarm.
>
> Don’t you think so?
>
>
>
> *From: *Yujun Zhang 
>
>
> *Reply-To: *"OpenStack Development Mailing List (not for usage
> questions)" 
>
> *Date: *Wednesday, 11 January 2017 at 09:46
> *To: *"OpenStack Development Mailing List (not for usage questions)" <
> openstack-dev@lists.openstack.org>, "yinli...@zte.com.cn" <
> yinli...@zte.com.cn>
> *Cc: *"han.jin...@zte.com.cn" , "
> wang.we...@zte.com.cn" , "zhang.yuj...@zte.com.cn"
> , "jia.peiy...@zte.com.cn" <
> jia.peiy...@zte.com.cn>, "gong.yah...@zte.com.cn" 
>
>
> *Subject: *Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator
>
>
>
> Hi, Ifat
>
>
>
> If I understand it correctly, your concerns are mainly on same alarm from
> different monitor, but not "suspect" status as discussed in another thread.
>
>
>
> On Tue, Jan 10, 2017 at 10:21 PM Afek, Ifat (Nokia - IL) <
> ifat.a...@nokia.com> wrote:
>
> Hi Yinliyin,
>
>
>
> At first I thought that changing the deduced to be a property on the alarm
> might help in solving your use case. But now I think most of the problems
> will remain the same:
>
>
>
> ·  It won’t solve the general problem of two different monitors that
> raise the same alarm
>
> ·  It won’t solve possible conflicts of timestamp and severity between
> different monitors
>
> ·  It will make the decision of when to delete the alarm more complex
> (delete it when the deduced alarm is deleted? When Nagios alarm is deleted?
> both? And how to change the timestamp and severity in these cases?)
>
>
>
> So I don’t think that making this change is beneficial.
>
> What do you think?
>
>
>
> Best Regards,
>
> Ifat.
>
>
>
>
>
> *From: *"yinli...@zte.com.cn" 
> *Date: *Monday, 9 January 2017 at 05:29
> *To: *"Afek, Ifat (Nokia - IL)" 
> *Cc: *"openstack-dev@lists.openstack.org" <
> openstack-dev@lists.openstack.org>, "han.jin...@zte.com.cn" <
> han.jin...@zte.com.cn>, "wang.we...@zte.com.cn" , "
> zhang.yuj...@zte.com.cn" , "
> jia.peiy...@zte.com.cn" , "gong.yah...@zte.com.cn"
> 
> *Subject: *Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator
>
>
>
> Hi Ifat,
>
>  I think there is a situation that all the alarms are reported by
> the monitored system. We use vitrage to:
>
> 1.  Found the relationships of the alarms, and find the root
> cause.
>
> 2.  Deduce the alarm before it really occured. This comprise
> two aspects:
>
>  1) A cause B:  When A occured,  we deduce that B would
> occur
>
>  2) B is caused by A:  When B occured, we deduce that A
> must occured
>
> In "2",   we do expect vitrage to raise the alarm before the
> alarm is reported because the alarm would be lost or be delayed for some
> reason.  So we would write "raise alarm" actions in the scenarios of the
> template.  I think that the alarm is reported or is deduced should be a
> state property of the alarm. The vertex reported and the vertex deduced of
> the same alarm should be merged to one vertex.
>
>
>
>  Best Regards,
>
>  Yinliyin.
>
> 原始邮件
>
> *发件人:* <ifat.a...@nokia.com>;
>
> *收件人:* <openstack-dev@lists.openstack.org>;
>
> *抄送人:*韩静6838;王维雅00042110;章宇军10200531;贾培源10101785;龚亚辉6092001895
> <(609)%20200-1895>;
>
> *日* *期* *:*2017年01月07日 02:18
>
> *主* *题* *:**Re: [openstack-dev] [Vitrage] About alarms reported by
> 

Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-11 Thread Yujun Zhang
Yes, if we consider the Vitrage scenario evaluator as a pseudo monitor.

I think YinLiYin's idea is a reasonable requirement from end user. They
care more about the *real faults* in the system, not how they are detected.
Though it will bring much challenge to design and engineering, it creates
value for customers. I'm quite positive on this evolution.

One possible solution would be introducing a high level (abstract) template
from users view. Then convert it to Vitrage scenario templates (or directly
to graph). The *more sources* (nagios, vitrage deduction) for an abstract
alarm we get from the system, the *more confidence* we get for a real
fault. And the confidence of an alarm could be included in the scenario
condition.

On Wed, Jan 11, 2017 at 4:08 PM Afek, Ifat (Nokia - IL) 
wrote:

> You are right. But as I see it, the case of Vitrage suspect vs. the real
> Nagios alarm is just one example of the more general case of two monitors
> reporting the same alarm.
>
> Don’t you think so?
>
>
>
> *From: *Yujun Zhang 
>
>
> *Reply-To: *"OpenStack Development Mailing List (not for usage
> questions)" 
>
> *Date: *Wednesday, 11 January 2017 at 09:46
> *To: *"OpenStack Development Mailing List (not for usage questions)" <
> openstack-dev@lists.openstack.org>, "yinli...@zte.com.cn" <
> yinli...@zte.com.cn>
> *Cc: *"han.jin...@zte.com.cn" , "
> wang.we...@zte.com.cn" , "zhang.yuj...@zte.com.cn"
> , "jia.peiy...@zte.com.cn" <
> jia.peiy...@zte.com.cn>, "gong.yah...@zte.com.cn" 
>
>
> *Subject: *Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator
>
>
>
> Hi, Ifat
>
>
>
> If I understand it correctly, your concerns are mainly on same alarm from
> different monitor, but not "suspect" status as discussed in another thread.
>
>
>
> On Tue, Jan 10, 2017 at 10:21 PM Afek, Ifat (Nokia - IL) <
> ifat.a...@nokia.com> wrote:
>
> Hi Yinliyin,
>
>
>
> At first I thought that changing the deduced to be a property on the alarm
> might help in solving your use case. But now I think most of the problems
> will remain the same:
>
>
>
> ·  It won’t solve the general problem of two different monitors that
> raise the same alarm
>
> ·  It won’t solve possible conflicts of timestamp and severity between
> different monitors
>
> ·  It will make the decision of when to delete the alarm more complex
> (delete it when the deduced alarm is deleted? When Nagios alarm is deleted?
> both? And how to change the timestamp and severity in these cases?)
>
>
>
> So I don’t think that making this change is beneficial.
>
> What do you think?
>
>
>
> Best Regards,
>
> Ifat.
>
>
>
>
>
> *From: *"yinli...@zte.com.cn" 
> *Date: *Monday, 9 January 2017 at 05:29
> *To: *"Afek, Ifat (Nokia - IL)" 
> *Cc: *"openstack-dev@lists.openstack.org" <
> openstack-dev@lists.openstack.org>, "han.jin...@zte.com.cn" <
> han.jin...@zte.com.cn>, "wang.we...@zte.com.cn" , "
> zhang.yuj...@zte.com.cn" , "
> jia.peiy...@zte.com.cn" , "gong.yah...@zte.com.cn"
> 
> *Subject: *Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator
>
>
>
> Hi Ifat,
>
>  I think there is a situation that all the alarms are reported by
> the monitored system. We use vitrage to:
>
> 1.  Found the relationships of the alarms, and find the root
> cause.
>
> 2.  Deduce the alarm before it really occured. This comprise
> two aspects:
>
>  1) A cause B:  When A occured,  we deduce that B would
> occur
>
>  2) B is caused by A:  When B occured, we deduce that A
> must occured
>
> In "2",   we do expect vitrage to raise the alarm before the
> alarm is reported because the alarm would be lost or be delayed for some
> reason.  So we would write "raise alarm" actions in the scenarios of the
> template.  I think that the alarm is reported or is deduced should be a
> state property of the alarm. The vertex reported and the vertex deduced of
> the same alarm should be merged to one vertex.
>
>
>
>  Best Regards,
>
>  Yinliyin.
>
> 原始邮件
>
> *发件人:* <ifat.a...@nokia.com>;
>
> *收件人:* <openstack-dev@lists.openstack.org>;
>
> *抄送人:*韩静6838;王维雅00042110;章宇军10200531;贾培源10101785;龚亚辉6092001895
> <(609)%20200-1895>;
>
> *日* *期* *:*2017年01月07日 02:18
>
> *主* *题* *:**Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator*
>
>
>
> Hi YinLiYin,
>
>
>
> This is an interesting question. Let me divide my answer to two parts.
>
>
>
> First, the case that you described with Nagios and Vitrage. This problem
> depends on the specific Nagios tests that you 

Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-11 Thread Afek, Ifat (Nokia - IL)
You are right. But as I see it, the case of Vitrage suspect vs. the real Nagios 
alarm is just one example of the more general case of two monitors reporting 
the same alarm.
Don’t you think so?

From: Yujun Zhang 
Reply-To: "OpenStack Development Mailing List (not for usage questions)" 

Date: Wednesday, 11 January 2017 at 09:46
To: "OpenStack Development Mailing List (not for usage questions)" 
, "yinli...@zte.com.cn" 
Cc: "han.jin...@zte.com.cn" , "wang.we...@zte.com.cn" 
, "zhang.yuj...@zte.com.cn" , 
"jia.peiy...@zte.com.cn" , "gong.yah...@zte.com.cn" 

Subject: Re: [openstack-dev] [Vitrage] About alarms reported by datasource and 
the alarms generated by vitrage evaluator

Hi, Ifat

If I understand it correctly, your concerns are mainly on same alarm from 
different monitor, but not "suspect" status as discussed in another thread.

On Tue, Jan 10, 2017 at 10:21 PM Afek, Ifat (Nokia - IL) 
> wrote:
Hi Yinliyin,

At first I thought that changing the deduced to be a property on the alarm 
might help in solving your use case. But now I think most of the problems will 
remain the same:

·  It won’t solve the general problem of two different monitors that raise the 
same alarm
·  It won’t solve possible conflicts of timestamp and severity between 
different monitors
·  It will make the decision of when to delete the alarm more complex (delete 
it when the deduced alarm is deleted? When Nagios alarm is deleted? both? And 
how to change the timestamp and severity in these cases?)

So I don’t think that making this change is beneficial.
What do you think?

Best Regards,
Ifat.


From: "yinli...@zte.com.cn" 
>
Date: Monday, 9 January 2017 at 05:29
To: "Afek, Ifat (Nokia - IL)" >
Cc: 
"openstack-dev@lists.openstack.org" 
>, 
"han.jin...@zte.com.cn" 
>, 
"wang.we...@zte.com.cn" 
>, 
"zhang.yuj...@zte.com.cn" 
>, 
"jia.peiy...@zte.com.cn" 
>, 
"gong.yah...@zte.com.cn" 
>
Subject: Re: [openstack-dev] [Vitrage] About alarms reported by datasource and 
the alarms generated by vitrage evaluator



Hi Ifat,

 I think there is a situation that all the alarms are reported by the 
monitored system. We use vitrage to:

1.  Found the relationships of the alarms, and find the root cause.

2.  Deduce the alarm before it really occured. This comprise two 
aspects:

 1) A cause B:  When A occured,  we deduce that B would occur

 2) B is caused by A:  When B occured, we deduce that A must 
occured

In "2",   we do expect vitrage to raise the alarm before the alarm 
is reported because the alarm would be lost or be delayed for some reason.  So 
we would write "raise alarm" actions in the scenarios of the template.  I think 
that the alarm is reported or is deduced should be a state property of the 
alarm. The vertex reported and the vertex deduced of the same alarm should be 
merged to one vertex.



 Best Regards,

 Yinliyin.

原始邮件
发件人: <ifat.a...@nokia.com>;
收件人: 
<openstack-dev@lists.openstack.org>;
抄送人:韩静6838;王维雅00042110;章宇军10200531;贾培源10101785;龚亚辉6092001895;
日 期 :2017年01月07日 02:18
主 题 :Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the 
alarms generated by vitrage evaluator


Hi YinLiYin,

This is an interesting question. Let me divide my answer to two parts.

First, the case that you described with Nagios and Vitrage. This problem 
depends on the specific Nagios tests that you configure in your system, as well 
as on the Vitrage templates that  you use. For example, you can use 
Nagios/Zabbix to monitor the physical layer, and Vitrage to raise deduced 
alarms on the virtual and application layers. This way you will never have 
duplicated alarms. If you want to use Nagios to monitor the other layers  as 
well, you can simply modify Vitrage templates so they don’t raise the deduced 
alarms that Nagios may generate, and use the templates to show RCA between 
different Nagios alarms.

Now let’s 

Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-10 Thread Yujun Zhang
Hi, Ifat

If I understand it correctly, your concerns are mainly on same alarm from
different monitor, but not "suspect" status as discussed in another thread.

On Tue, Jan 10, 2017 at 10:21 PM Afek, Ifat (Nokia - IL) <
ifat.a...@nokia.com> wrote:

Hi Yinliyin,



At first I thought that changing the deduced to be a property on the alarm
might help in solving your use case. But now I think most of the problems
will remain the same:



   - It won’t solve the general problem of two different monitors that
   raise the same alarm
   - It won’t solve possible conflicts of timestamp and severity between
   different monitors
   - It will make the decision of when to delete the alarm more complex
   (delete it when the deduced alarm is deleted? When Nagios alarm is deleted?
   both? And how to change the timestamp and severity in these cases?)



So I don’t think that making this change is beneficial.

What do you think?



Best Regards,

Ifat.





*From: *"yinli...@zte.com.cn" 
*Date: *Monday, 9 January 2017 at 05:29
*To: *"Afek, Ifat (Nokia - IL)" 
*Cc: *"openstack-dev@lists.openstack.org" ,
"han.jin...@zte.com.cn" , "wang.we...@zte.com.cn" <
wang.we...@zte.com.cn>, "zhang.yuj...@zte.com.cn" ,
"jia.peiy...@zte.com.cn" , "gong.yah...@zte.com.cn"

*Subject: *Re: [openstack-dev] [Vitrage] About alarms reported by
datasource and the alarms generated by vitrage evaluator



Hi Ifat,

 I think there is a situation that all the alarms are reported by
the monitored system. We use vitrage to:

1.  Found the relationships of the alarms, and find the root
cause.

2.  Deduce the alarm before it really occured. This comprise
two aspects:

 1) A cause B:  When A occured,  we deduce that B would
occur

 2) B is caused by A:  When B occured, we deduce that A
must occured

In "2",   we do expect vitrage to raise the alarm before the
alarm is reported because the alarm would be lost or be delayed for some
reason.  So we would write "raise alarm" actions in the scenarios of the
template.  I think that the alarm is reported or is deduced should be a
state property of the alarm. The vertex reported and the vertex deduced of
the same alarm should be merged to one vertex.



 Best Regards,

 Yinliyin.

原始邮件

*发件人:* <ifat.a...@nokia.com>;

*收件人:* <openstack-dev@lists.openstack.org>;

*抄送人:*韩静6838;王维雅00042110;章宇军10200531;贾培源10101785;龚亚辉6092001895
<(609)%20200-1895>;

*日* *期* *:*2017年01月07日 02:18

*主* *题* *:**Re: [openstack-dev] [Vitrage] About alarms reported by
datasource and the alarms generated by vitrage evaluator*



Hi YinLiYin,



This is an interesting question. Let me divide my answer to two parts.



First, the case that you described with Nagios and Vitrage. This problem
depends on the specific Nagios tests that you configure in your system, as
well as on the Vitrage templates that  you use. For example, you can use
Nagios/Zabbix to monitor the physical layer, and Vitrage to raise deduced
alarms on the virtual and application layers. This way you will never have
duplicated alarms. If you want to use Nagios to monitor the other layers
 as well, you can simply modify Vitrage templates so they don’t raise the
deduced alarms that Nagios may generate, and use the templates to show RCA
between different Nagios alarms.



Now let’s talk about the more general case. Vitrage can receive alarms from
different monitors, including Nagios, Zabbix, collectd and Aodh. If you are
using more than one monitor, it is  possible that the same alarm (maybe
with a different name) will be raised twice. We need to create a mechanism
to identify such cases and create a single alarm with the properties of
both monitors. This has not been designed in details yet, so if you have
 any suggestion we will be happy to hear them.



Best Regards,

Ifat.





*From: *"yinli...@zte.com.cn" <yinli...@zte.com.cn>
*Reply-To: *"OpenStack Development Mailing List (not for usage questions)" <
openstack-dev@lists.openstack.org>
*Date: *Friday, 6 January 2017 at 03:27
*To: *"openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org
>
*Cc: *"gong.yah...@zte.com.cn" <gong.yah...@zte.com.cn>, "
han.jin...@zte.com.cn" <han.jin...@zte.com.cn>, "wang.we...@zte.com.cn" <
wang.we...@zte.com.cn>, "jia.peiy...@zte.com.cn" <jia.peiy...@zte.com.cn>, "
zhang.yuj...@zte.com.cn" <zhang.yuj...@zte.com.cn>
*Subject: *[openstack-dev] [Vitrage] About alarms reported by datasource
and the alarms generated by vitrage evaluator



Hi all,

   Vitrage generate alarms acording to the templates. All the alarms raised
by vitrage has the type "vitrage". Suppose Nagios has an alarm A. Alarm A
is raised by vitrage evaluator according to the action part of a scenario,
type  of alarm A is "vitrage". If Nagios reported alarm A 

Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-10 Thread Afek, Ifat (Nokia - IL)
Hi Yinliyin,

At first I thought that changing the deduced to be a property on the alarm 
might help in solving your use case. But now I think most of the problems will 
remain the same:


  *   It won’t solve the general problem of two different monitors that raise 
the same alarm
  *   It won’t solve possible conflicts of timestamp and severity between 
different monitors
  *   It will make the decision of when to delete the alarm more complex 
(delete it when the deduced alarm is deleted? When Nagios alarm is deleted? 
both? And how to change the timestamp and severity in these cases?)

So I don’t think that making this change is beneficial.
What do you think?

Best Regards,
Ifat.


From: "yinli...@zte.com.cn" 
Date: Monday, 9 January 2017 at 05:29
To: "Afek, Ifat (Nokia - IL)" 
Cc: "openstack-dev@lists.openstack.org" , 
"han.jin...@zte.com.cn" , "wang.we...@zte.com.cn" 
, "zhang.yuj...@zte.com.cn" , 
"jia.peiy...@zte.com.cn" , "gong.yah...@zte.com.cn" 

Subject: Re: [openstack-dev] [Vitrage] About alarms reported by datasource and 
the alarms generated by vitrage evaluator




Hi Ifat,

 I think there is a situation that all the alarms are reported by the 
monitored system. We use vitrage to:

1.  Found the relationships of the alarms, and find the root cause.

2.  Deduce the alarm before it really occured. This comprise two 
aspects:

 1) A cause B:  When A occured,  we deduce that B would occur

 2) B is caused by A:  When B occured, we deduce that A must 
occured

In "2",   we do expect vitrage to raise the alarm before the alarm 
is reported because the alarm would be lost or be delayed for some reason.  So 
we would write "raise alarm" actions in the scenarios of the template.  I think 
that the alarm is reported or is deduced should be a state property of the 
alarm. The vertex reported and the vertex deduced of the same alarm should be 
merged to one vertex.



 Best Regards,

 Yinliyin.

























殷力殷 YinLiYin



项目经理   Project Manager
虚拟化上海五部/无线研究院/无线产品经营部 NIV Shanghai Dept. V/Wireless Product R&D 
Institute/Wireless Product Operation


[cid:image001.gif@01D26B5C.646157B0]

[cid:image002.gif@01D26B5C.646157B0]
上海市浦东新区碧波路889号中兴研发大楼D502
D502, ZTE Corporation R Center, 889# Bibo Road,
Zhangjiang Hi-tech Park, Shanghai, P.R.China, 201203
T: +86 21 68896229
M: +86 13641895907
E: yinli...@zte.com.cn
www.zte.com.cn

原始邮件
发件人: <ifat.a...@nokia.com>;
收件人: <openstack-dev@lists.openstack.org>;
抄送人:韩静6838;王维雅00042110;章宇军10200531;贾培源10101785;龚亚辉6092001895;
日 期 :2017年01月07日 02:18
主 题 :Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the 
alarms generated by vitrage evaluator


Hi YinLiYin,

This is an interesting question. Let me divide my answer to two parts.

First, the case that you described with Nagios and Vitrage. This problem 
depends on the specific Nagios tests that you configure in your system, as well 
as on the Vitrage templates that  you use. For example, you can use 
Nagios/Zabbix to monitor the physical layer, and Vitrage to raise deduced 
alarms on the virtual and application layers. This way you will never have 
duplicated alarms. If you want to use Nagios to monitor the other layers  as 
well, you can simply modify Vitrage templates so they don’t raise the deduced 
alarms that Nagios may generate, and use the templates to show RCA between 
different Nagios alarms.

Now let’s talk about the more general case. Vitrage can receive alarms from 
different monitors, including Nagios, Zabbix, collectd and Aodh. If you are 
using more than one monitor, it is  possible that the same alarm (maybe with a 
different name) will be raised twice. We need to create a mechanism to identify 
such cases and create a single alarm with the properties of both monitors. This 
has not been designed in details yet, so if you have  any suggestion we will be 
happy to hear them.

Best Regards,
Ifat.


From: "yinli...@zte.com.cn" <yinli...@zte.com.cn>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" 
<openstack-dev@lists.openstack.org>
Date: Friday, 6 January 2017 at 03:27
To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org>
Cc: "gong.yah...@zte.com.cn" <gong.yah...@zte.com.cn>, "han.jin...@zte.com.cn" 
<han.jin...@zte.com.cn>, "wang.we...@zte.com.cn" <wang.we...@zte.com.cn>, 
"jia.peiy...@zte.com.cn" <jia.peiy...@zte.com.cn>, "zhang.yuj...@zte.com.cn" 
<zhang.yuj...@zte.com.cn>
Subject: [openstack-dev] [Vitrage] About alarms reported by datasource and the 
alarms generated by vitrage evaluator


Hi all,

   Vitrage generate alarms acording to the templates. All the alarms raised by 
vitrage has the type "vitrage". Suppose Nagios has 

Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-09 Thread Yujun Zhang
I prefer 2.b from instinct.

Not sure it could be linked to the vitrage_id[1] evolution. If an uuid is
created for the alarm, the implementation could be quite straightforward.

[1]: https://blueprints.launchpad.net/vitrage/+spec/standard-vitrage-id

On Tue, Jan 10, 2017 at 1:55 AM Afek, Ifat (Nokia - IL) 
wrote:

> Hi Yujun,
>
>
>
> I understand the use case now, thanks for the detailed explanation.
>
>
>
> Supporting this use case will require some development in Vitrage. Let me
> try to list down the requirements and options that we have.
>
>
>
> 1.   Requirement: Raise ‘suspect’ deduced alarms in Vitrage.
>
> Implementation: Quite straight forward. There is no way to set ‘suspect’
> property in Vitrage right now, but it should be easy to add this option.
>
>
>
> 2.   Requirement: Change a ‘suspect’ alarm of type ‘vitrage’ to a
> ‘real’ alarm of type ‘nagios’.
>
> Implementation: There are a few alternatives how to achieve this goal
>
>
>
> a.   Delete the ‘suspect’ alarm and create the ‘real’ alarm. This
> will require supporting ‘not’ condition in the templates. An example
> scenario:
>
> condition: vm_alarm and not nagios_alarm:
>
>(action: create vitrage alarm)
>
> condition: nagios_alarm and vitrage_alarm:
>
>(action: delete vitrage_alarm)
>
>
>
> b.   Have both ‘suspect’ alarm and ‘real’ alarm, and create a
> ‘equivalent’ relationship between them. Configuring the template should be
> easy, however it won’t look nice in the UI. In past discussions we
> mentioned an option to group some vertices together in the UI. If we have
> this option, we might want to group these two alarms together.
>
>
>
> c.   Merge the two alarms. This solution seems the most reasonable
> one at first, but it is not trivial. For example: suppose one alarm is
> defined as ‘critical’ and was raised at 10:01, and the other alarm was
> defined as ‘warning’ and was raised at 10:02. How will you combine the two?
> And what if the ‘critical’ alarm then goes down, will you know how to
> change the severity back to ‘warning’? in case of vitrage vs. nagios we
> would like to prefer nagios; but let’s think of the more general case of
> two different monitors.
>
>
>
> 3.   In one of your emails you mentioned an option of having two
> ‘suspects’. Suppose vm_alarm is raised, will you raise two suspect vitrage
> alarms, e.g. host_alarm and switch_alarm? And if you then receive
> host_alarm from nagios, would you like to delete the suspect switch_alarm,
> or keep it? If you would like to delete it, it will require supporting
> ‘not’ in the template condition.
>
>
>
> Personally I would go for option 2b, but I will be happy to hear your
> thoughts about it.
>
>
>
> Hope I helped, but I suspect I just made things more complicated ;-)
>
> Ifat.
>
>
>
>
>
> *From: *Yujun Zhang 
>
>
> *Reply-To: *"OpenStack Development Mailing List (not for usage
> questions)" 
>
> *Date: *Sunday, 8 January 2017 at 17:38
>
>
> *To: *"OpenStack Development Mailing List (not for usage questions)" <
> openstack-dev@lists.openstack.org>
> *Cc: *"han.jin...@zte.com.cn" , "
> wang.we...@zte.com.cn" , "gong.yah...@zte.com.cn" <
> gong.yah...@zte.com.cn>, "jia.peiy...@zte.com.cn" ,
> "zhang.yuj...@zte.com.cn" 
> *Subject: *Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator
>
> Maybe I have missed something in the scenario template, but it seems you
> have understood my idea quite correctly :-)
>
>
>
> See further explanation inline
>
> On Sun, Jan 8, 2017 at 3:06 PM Afek, Ifat (Nokia - IL) <
> ifat.a...@nokia.com> wrote:
>
> Hi Yujun,
>
>
>
> Thanks for the explanation, but I still don’t fully understand.
>
>
>
> Let me start with the current state:
>
> 1.   introduce a flexible `metadata` dict in to ALARM entity
>
> [Ifat] Already exists. An alarm is represented as a vertex in the entity
> graph, with a dictionary of properties.
>
>
>
>  [yujunz] Can the alarm vertex be updated by scenario action? e.g. raise
> an alarm and set the property `suspect` to true.
>
>
>
> 2.   Allow generating update event[1] on metadata change
>
> 3.   Allow using ALARM metadata in scenario condition
>
> [Ifat] Already exists. You can define properties in the ‘entities’ section
> in Vitrage templates
>
>
>
> [yujunz] How do I specify the condition if one specified alarm is
> 'suspicious', e.g. condition: host_alarm.suspect ?
>
>
>
> 4.   Allow setting ALARM metadata in scenario action
>
>
>
> If I understand correctly, you are suggesting that one scenario will add
> metadata to an existing alarm, which will trigger an event, and as a result
> another scenario might be executed?
>
>
>
> [yujunz] Exactly
>
>
>
> Can you describe a use case where this behavior will help calculating the
> root cause?
>
>
>
> 

Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-09 Thread Afek, Ifat (Nokia - IL)
Hi Yujun,

I understand the use case now, thanks for the detailed explanation.

Supporting this use case will require some development in Vitrage. Let me try 
to list down the requirements and options that we have.


1.   Requirement: Raise ‘suspect’ deduced alarms in Vitrage.

Implementation: Quite straight forward. There is no way to set ‘suspect’ 
property in Vitrage right now, but it should be easy to add this option.



2.   Requirement: Change a ‘suspect’ alarm of type ‘vitrage’ to a ‘real’ 
alarm of type ‘nagios’.

Implementation: There are a few alternatives how to achieve this goal



a.   Delete the ‘suspect’ alarm and create the ‘real’ alarm. This will 
require supporting ‘not’ condition in the templates. An example scenario:

condition: vm_alarm and not nagios_alarm:

   (action: create vitrage alarm)

condition: nagios_alarm and vitrage_alarm:

   (action: delete vitrage_alarm)



b.   Have both ‘suspect’ alarm and ‘real’ alarm, and create a ‘equivalent’ 
relationship between them. Configuring the template should be easy, however it 
won’t look nice in the UI. In past discussions we mentioned an option to group 
some vertices together in the UI. If we have this option, we might want to 
group these two alarms together.



c.   Merge the two alarms. This solution seems the most reasonable one at 
first, but it is not trivial. For example: suppose one alarm is defined as 
‘critical’ and was raised at 10:01, and the other alarm was defined as 
‘warning’ and was raised at 10:02. How will you combine the two? And what if 
the ‘critical’ alarm then goes down, will you know how to change the severity 
back to ‘warning’? in case of vitrage vs. nagios we would like to prefer 
nagios; but let’s think of the more general case of two different monitors.


3.   In one of your emails you mentioned an option of having two 
‘suspects’. Suppose vm_alarm is raised, will you raise two suspect vitrage 
alarms, e.g. host_alarm and switch_alarm? And if you then receive host_alarm 
from nagios, would you like to delete the suspect switch_alarm, or keep it? If 
you would like to delete it, it will require supporting ‘not’ in the template 
condition.

Personally I would go for option 2b, but I will be happy to hear your thoughts 
about it.

Hope I helped, but I suspect I just made things more complicated ;-)
Ifat.


From: Yujun Zhang 
Reply-To: "OpenStack Development Mailing List (not for usage questions)" 

Date: Sunday, 8 January 2017 at 17:38
To: "OpenStack Development Mailing List (not for usage questions)" 

Cc: "han.jin...@zte.com.cn" , "wang.we...@zte.com.cn" 
, "gong.yah...@zte.com.cn" , 
"jia.peiy...@zte.com.cn" , "zhang.yuj...@zte.com.cn" 

Subject: Re: [openstack-dev] [Vitrage] About alarms reported by datasource and 
the alarms generated by vitrage evaluator

Maybe I have missed something in the scenario template, but it seems you have 
understood my idea quite correctly :-)

See further explanation inline
On Sun, Jan 8, 2017 at 3:06 PM Afek, Ifat (Nokia - IL) 
> wrote:
Hi Yujun,

Thanks for the explanation, but I still don’t fully understand.

Let me start with the current state:
1.   introduce a flexible `metadata` dict in to ALARM entity
[Ifat] Already exists. An alarm is represented as a vertex in the entity graph, 
with a dictionary of properties.

 [yujunz] Can the alarm vertex be updated by scenario action? e.g. raise an 
alarm and set the property `suspect` to true.

2.   Allow generating update event[1] on metadata change
3.   Allow using ALARM metadata in scenario condition
[Ifat] Already exists. You can define properties in the ‘entities’ section in 
Vitrage templates

[yujunz] How do I specify the condition if one specified alarm is 'suspicious', 
e.g. condition: host_alarm.suspect ?

4.   Allow setting ALARM metadata in scenario action

If I understand correctly, you are suggesting that one scenario will add 
metadata to an existing alarm, which will trigger an event, and as a result 
another scenario might be executed?

[yujunz] Exactly

Can you describe a use case where this behavior will help calculating the root 
cause?

[yujunz] Here's the simplified case derived from YinLiYin's example. Suppose we 
add a causal relationship from `host_alarm` to `instance_alarm`, i.e. host 
alarm will cause instance alarm. If an instance alarm is detected (but no host 
alarm). It is "suspicious" that it may be caused by host alarm. The reason 
could be event delay or lost. Instead of waiting for snapshot service to update 
the host status, we want to run a diagnostic action to check it initiatively.

In this case, we want to set the upstream (host) of a confirmed alarm 
(instance) to "suspect" and trigger 

Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-09 Thread yinliyin
Hi Ifat, 

 I think there is a situation that all the alarms are reported by the 
monitored system. We use vitrage to:

1.  Found the relationships of the alarms, and find the root cause.

2.  Deduce the alarm before it really occured. This comprise two 
aspects:

 1) A cause B:  When A occured,  we deduce that B would occur

 2) B is caused by A:  When B occured, we deduce that A must 
occured

In "2",   we do expect vitrage to raise the alarm before the alarm 
is reported because the alarm would be lost or be delayed for some reason.  So 
we would write "raise alarm" actions in the scenarios of the template.  I think 
that the alarm is reported or is deduced should be a state property of the 
alarm. The vertex reported and the vertex deduced of the same alarm should be 
merged to one vertex. 





 Best Regards,

 Yinliyin.















  

   



















殷力殷 YinLiYin






项目经理   Project Manager
虚拟化上海五部/无线研究院/无线产品经营部 NIV Shanghai Dept. V/Wireless Product R&D 
Institute/Wireless Product Operation









上海市浦东新区碧波路889号中兴研发大楼D502 
D502, ZTE Corporation R Center, 889# Bibo Road, 
Zhangjiang Hi-tech Park, Shanghai, P.R.China, 201203 
T: +86 21 68896229
M: +86 13641895907 
E: yinli...@zte.com.cn
www.zte.com.cn










原始邮件



发件人: <ifat.a...@nokia.com>
收件人: <openstack-dev@lists.openstack.org>
抄送人:韩静6838王维雅00042110章宇军10200531贾培源10101785龚亚辉6092001895
日 期 :2017年01月07日 02:18
主 题 :Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the 
alarms generated by vitrage evaluator







Hi YinLiYin,


 


This is an interesting question. Let me divide my answer to two parts.


 


First, the case that you described with Nagios and Vitrage. This problem 
depends on the specific Nagios tests that you configure in your system, as well 
as on the Vitrage templates that  you use. For example, you can use 
Nagios/Zabbix to monitor the physical layer, and Vitrage to raise deduced 
alarms on the virtual and application layers. This way you will never have 
duplicated alarms. If you want to use Nagios to monitor the other layers  as 
well, you can simply modify Vitrage templates so they don’t raise the deduced 
alarms that Nagios may generate, and use the templates to show RCA between 
different Nagios alarms.


 


Now let’s talk about the more general case. Vitrage can receive alarms from 
different monitors, including Nagios, Zabbix, collectd and Aodh. If you are 
using more than one monitor, it is  possible that the same alarm (maybe with a 
different name) will be raised twice. We need to create a mechanism to identify 
such cases and create a single alarm with the properties of both monitors. This 
has not been designed in details yet, so if you have  any suggestion we will be 
happy to hear them.


 


Best Regards,


Ifat.


 


 



From: "yinli...@zte.com.cn" <yinli...@zte.com.cn>
 Reply-To: "OpenStack Development Mailing List (not for usage questions)" 
<openstack-dev@lists.openstack.org>
 Date: Friday, 6 January 2017 at 03:27
 To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org>
 Cc: "gong.yah...@zte.com.cn" <gong.yah...@zte.com.cn>, "han.jin...@zte.com.cn" 
<han.jin...@zte.com.cn>, "wang.we...@zte.com.cn" <wang.we...@zte.com.cn>, 
"jia.peiy...@zte.com.cn" <jia.peiy...@zte.com.cn>, "zhang.yuj...@zte.com.cn" 
<zhang.yuj...@zte.com.cn>
 Subject: [openstack-dev] [Vitrage] About alarms reported by datasource and the 
alarms generated by vitrage evaluator



 



Hi all, 


   Vitrage generate alarms acording to the templates. All the alarms raised by 
vitrage has the type "vitrage". Suppose Nagios has an alarm A. Alarm A is 
raised by vitrage evaluator according to the action part of a scenario, type  
of alarm A is "vitrage". If Nagios reported alarm A latter, a new alarm A with 
type "Nagios" would be generator in the entity graph. There would be two 
vertices for the same alarm in the graph. And we have to define two alarm 
entities, two relationships,  two scenarios in the template file to make the 
alarm propagation procedure work.


   It is inconvenient to describe fault model of system with lot of alarms. How 
to solve this problem?


 


殷力殷 YinLiYin


 


 






 上海市浦东新区碧波路889号中兴研发大楼D502 
 D502, ZTE Corporation R Center, 889# Bibo Road, 
 Zhangjiang Hi-tech Park, Shanghai, P.R.China, 201203 
 T: +86 21 68896229
 M: +86 13641895907 
 E: yinli...@zte.com.cn
 www.zte.com.cn__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-08 Thread Yujun Zhang
Maybe I have missed something in the scenario template, but it seems you
have understood my idea quite correctly :-)

See further explanation inline

On Sun, Jan 8, 2017 at 3:06 PM Afek, Ifat (Nokia - IL) 
wrote:

> Hi Yujun,
>
>
>
> Thanks for the explanation, but I still don’t fully understand.
>
>
>
> Let me start with the current state:
>
> 1.   introduce a flexible `metadata` dict in to ALARM entity
>
> [Ifat] Already exists. An alarm is represented as a vertex in the entity
> graph, with a dictionary of properties.
>

 [yujunz] Can the alarm vertex be updated by scenario action? e.g. raise an
alarm and set the property `suspect` to true.

2.   Allow generating update event[1] on metadata change
>
> 3.   Allow using ALARM metadata in scenario condition
>
> [Ifat] Already exists. You can define properties in the ‘entities’ section
> in Vitrage templates
>

[yujunz] How do I specify the condition if one specified alarm is
'suspicious', e.g. condition: host_alarm.suspect ?

4.   Allow setting ALARM metadata in scenario action
>
>
>
> If I understand correctly, you are suggesting that one scenario will add
> metadata to an existing alarm, which will trigger an event, and as a result
> another scenario might be executed?
>

[yujunz] Exactly

Can you describe a use case where this behavior will help calculating the
> root cause?
>

[yujunz] Here's the simplified case derived from YinLiYin's example.
Suppose we add a causal relationship from `host_alarm` to `instance_alarm`,
i.e. host alarm will cause instance alarm. If an instance alarm is detected
(but no host alarm). It is "suspicious" that it may be caused by host
alarm. The reason could be event delay or lost. Instead of waiting for
snapshot service to update the host status, we want to run a diagnostic
action to check it initiatively.

In this case, we want to set the upstream (host) of a confirmed alarm
(instance) to "suspect" and trigger an diagnostic action on this change.

Hope that I have made the use case clear.

Thanks,
>
> Ifat.
>
>
>
>
>
> *From: *Yujun Zhang 
>
>
> *Reply-To: *"OpenStack Development Mailing List (not for usage
> questions)" 
>
> *Date: *Saturday, 7 January 2017 at 09:27
>
>
> *To: *"OpenStack Development Mailing List (not for usage questions)" <
> openstack-dev@lists.openstack.org>
>
> *Cc: *"han.jin...@zte.com.cn" , "
> wang.we...@zte.com.cn" , "gong.yah...@zte.com.cn" <
> gong.yah...@zte.com.cn>, "jia.peiy...@zte.com.cn" ,
> "zhang.yuj...@zte.com.cn" 
> *Subject: *Re: [openstack-dev] [Vitrage] About alarms reported by
> datasource and the alarms generated by vitrage evaluator
>
>
>
> The two questions raised by YinLiYin is actually one, i.e. *how to enrich
> the alarm properties *that can be used as an condition in root cause
> deducing.
>
>
>
> Both 'suspect' or 'datasource' are additional information that may be
> referred as a condition in general fault model, a.k.a. scenario in vitrage.
>
>
>
> It seems it could be done by
>
>1. introduce a flexible `metadata` dict in to ALARM entity
>
> 2.  Allow generating update event[1] on metadata change
>
> 3.  Allow using ALARM metadata in scenario condition
>
> 4.  Allow setting ALARM metadata in scenario action
>
> This will leave the flexibility to continuous development by defining a
> complex scenario template and keep the vitrage evaluator simple and generic.
>
>
>
> My two cents.
>
>
>
> [1]:
> http://docs.openstack.org/developer/vitrage/scenario-evaluator.html#concepts-and-guidelines
>
>
>
>
> On Sat, Jan 7, 2017 at 2:23 AM Afek, Ifat (Nokia - IL) <
> ifat.a...@nokia.com> wrote:
>
> Hi YinLiYin,
>
>
>
> This is an interesting question. Let me divide my answer to two parts.
>
>
>
> First, the case that you described with Nagios and Vitrage. This problem
> depends on the specific Nagios tests that you configure in your system, as
> well as on the Vitrage templates that you use. For example, you can use
> Nagios/Zabbix to monitor the physical layer, and Vitrage to raise deduced
> alarms on the virtual and application layers. This way you will never have
> duplicated alarms. If you want to use Nagios to monitor the other layers as
> well, you can simply modify Vitrage templates so they don’t raise the
> deduced alarms that Nagios may generate, and use the templates to show RCA
> between different Nagios alarms.
>
>
>
> Now let’s talk about the more general case. Vitrage can receive alarms
> from different monitors, including Nagios, Zabbix, collectd and Aodh. If
> you are using more than one monitor, it is possible that the same alarm
> (maybe with a different name) will be raised twice. We need to create a
> mechanism to identify such cases and create a single alarm with the
> properties of both monitors. This has not been designed in details yet, so
> if you have any 

Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-07 Thread Afek, Ifat (Nokia - IL)
Hi Yujun,

Thanks for the explanation, but I still don’t fully understand.

Let me start with the current state:
1.   introduce a flexible `metadata` dict in to ALARM entity
[Ifat] Already exists. An alarm is represented as a vertex in the entity graph, 
with a dictionary of properties.
2.   Allow generating update event[1] on metadata change
3.   Allow using ALARM metadata in scenario condition
[Ifat] Already exists. You can define properties in the ‘entities’ section in 
Vitrage templates
4.   Allow setting ALARM metadata in scenario action

If I understand correctly, you are suggesting that one scenario will add 
metadata to an existing alarm, which will trigger an event, and as a result 
another scenario might be executed?
Can you describe a use case where this behavior will help calculating the root 
cause?

Thanks,
Ifat.


From: Yujun Zhang 
Reply-To: "OpenStack Development Mailing List (not for usage questions)" 

Date: Saturday, 7 January 2017 at 09:27
To: "OpenStack Development Mailing List (not for usage questions)" 

Cc: "han.jin...@zte.com.cn" , "wang.we...@zte.com.cn" 
, "gong.yah...@zte.com.cn" , 
"jia.peiy...@zte.com.cn" , "zhang.yuj...@zte.com.cn" 

Subject: Re: [openstack-dev] [Vitrage] About alarms reported by datasource and 
the alarms generated by vitrage evaluator

The two questions raised by YinLiYin is actually one, i.e. how to enrich the 
alarm properties that can be used as an condition in root cause deducing.

Both 'suspect' or 'datasource' are additional information that may be referred 
as a condition in general fault model, a.k.a. scenario in vitrage.

It seems it could be done by

  1.  introduce a flexible `metadata` dict in to ALARM entity
2.  Allow generating update event[1] on metadata change
3.  Allow using ALARM metadata in scenario condition
4.  Allow setting ALARM metadata in scenario action
This will leave the flexibility to continuous development by defining a complex 
scenario template and keep the vitrage evaluator simple and generic.

My two cents.

[1]: 
http://docs.openstack.org/developer/vitrage/scenario-evaluator.html#concepts-and-guidelines

On Sat, Jan 7, 2017 at 2:23 AM Afek, Ifat (Nokia - IL) 
> wrote:
Hi YinLiYin,

This is an interesting question. Let me divide my answer to two parts.

First, the case that you described with Nagios and Vitrage. This problem 
depends on the specific Nagios tests that you configure in your system, as well 
as on the Vitrage templates that you use. For example, you can use 
Nagios/Zabbix to monitor the physical layer, and Vitrage to raise deduced 
alarms on the virtual and application layers. This way you will never have 
duplicated alarms. If you want to use Nagios to monitor the other layers as 
well, you can simply modify Vitrage templates so they don’t raise the deduced 
alarms that Nagios may generate, and use the templates to show RCA between 
different Nagios alarms.

Now let’s talk about the more general case. Vitrage can receive alarms from 
different monitors, including Nagios, Zabbix, collectd and Aodh. If you are 
using more than one monitor, it is possible that the same alarm (maybe with a 
different name) will be raised twice. We need to create a mechanism to identify 
such cases and create a single alarm with the properties of both monitors. This 
has not been designed in details yet, so if you have any suggestion we will be 
happy to hear them.

Best Regards,
Ifat.


From: "yinli...@zte.com.cn" 
>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" 
>
Date: Friday, 6 January 2017 at 03:27
To: 
"openstack-dev@lists.openstack.org" 
>
Cc: "gong.yah...@zte.com.cn" 
>, 
"han.jin...@zte.com.cn" 
>, 
"wang.we...@zte.com.cn" 
>, 
"jia.peiy...@zte.com.cn" 
>, 
"zhang.yuj...@zte.com.cn" 
>
Subject: [openstack-dev] [Vitrage] About alarms reported by datasource and the 
alarms generated by vitrage evaluator


Hi all,

   Vitrage generate alarms acording to the templates. All the alarms raised by 
vitrage has the type "vitrage". Suppose 

Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-06 Thread Yujun Zhang
The two questions raised by YinLiYin is actually one, i.e. *how to enrich
the alarm properties *that can be used as an condition in root cause
deducing.

Both 'suspect' or 'datasource' are additional information that may be
referred as a condition in general fault model, a.k.a. scenario in vitrage.

It seems it could be done by

   1. introduce a flexible `metadata` dict in to ALARM entity
   2. Allow generating update event[1] on metadata change
   3. Allow using ALARM metadata in scenario condition
   4. Allow setting ALARM metadata in scenario action

This will leave the flexibility to continuous development by defining a
complex scenario template and keep the vitrage evaluator simple and generic.

My two cents.

[1]:
http://docs.openstack.org/developer/vitrage/scenario-evaluator.html#concepts-and-guidelines


On Sat, Jan 7, 2017 at 2:23 AM Afek, Ifat (Nokia - IL) 
wrote:

> Hi YinLiYin,
>
>
>
> This is an interesting question. Let me divide my answer to two parts.
>
>
>
> First, the case that you described with Nagios and Vitrage. This problem
> depends on the specific Nagios tests that you configure in your system, as
> well as on the Vitrage templates that you use. For example, you can use
> Nagios/Zabbix to monitor the physical layer, and Vitrage to raise deduced
> alarms on the virtual and application layers. This way you will never have
> duplicated alarms. If you want to use Nagios to monitor the other layers as
> well, you can simply modify Vitrage templates so they don’t raise the
> deduced alarms that Nagios may generate, and use the templates to show RCA
> between different Nagios alarms.
>
>
>
> Now let’s talk about the more general case. Vitrage can receive alarms
> from different monitors, including Nagios, Zabbix, collectd and Aodh. If
> you are using more than one monitor, it is possible that the same alarm
> (maybe with a different name) will be raised twice. We need to create a
> mechanism to identify such cases and create a single alarm with the
> properties of both monitors. This has not been designed in details yet, so
> if you have any suggestion we will be happy to hear them.
>
>
>
> Best Regards,
>
> Ifat.
>
>
>
>
>
> *From: *"yinli...@zte.com.cn" 
> *Reply-To: *"OpenStack Development Mailing List (not for usage
> questions)" 
> *Date: *Friday, 6 January 2017 at 03:27
> *To: *"openstack-dev@lists.openstack.org" <
> openstack-dev@lists.openstack.org>
> *Cc: *"gong.yah...@zte.com.cn" , "
> han.jin...@zte.com.cn" , "wang.we...@zte.com.cn" <
> wang.we...@zte.com.cn>, "jia.peiy...@zte.com.cn" ,
> "zhang.yuj...@zte.com.cn" 
> *Subject: *[openstack-dev] [Vitrage] About alarms reported by datasource
> and the alarms generated by vitrage evaluator
>
>
>
> Hi all,
>
>Vitrage generate alarms acording to the templates. All the alarms
> raised by vitrage has the type "vitrage". Suppose Nagios has an alarm A.
> Alarm A is raised by vitrage evaluator according to the action part of a
> scenario, type of alarm A is "vitrage". If Nagios reported alarm A latter,
> a new alarm A with type "Nagios" would be generator in the entity graph.
>   There would be two vertices for the same alarm in the graph. And we have
> to define two alarm entities, two relationships, two scenarios in the
> template file to make the alarm propagation procedure work.
>
>It is inconvenient to describe fault model of system with lot of
> alarms. How to solve this problem?
>
>
>
> 殷力殷 YinLiYin
>
>
>
>
>
>
> 上海市浦东新区碧波路889号中兴研发大楼D502
> D502, ZTE Corporation R Center, 889# Bibo Road,
> Zhangjiang Hi-tech Park, Shanghai, P.R.China, 201203
> T: +86 21 68896229 <+86%2021%206889%206229>
> M: +86 13641895907 <+86%20136%204189%205907>
> E: yinli...@zte.com.cn
> www.zte.com.cn
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the alarms generated by vitrage evaluator

2017-01-06 Thread Afek, Ifat (Nokia - IL)
Hi YinLiYin,

This is an interesting question. Let me divide my answer to two parts.

First, the case that you described with Nagios and Vitrage. This problem 
depends on the specific Nagios tests that you configure in your system, as well 
as on the Vitrage templates that you use. For example, you can use 
Nagios/Zabbix to monitor the physical layer, and Vitrage to raise deduced 
alarms on the virtual and application layers. This way you will never have 
duplicated alarms. If you want to use Nagios to monitor the other layers as 
well, you can simply modify Vitrage templates so they don’t raise the deduced 
alarms that Nagios may generate, and use the templates to show RCA between 
different Nagios alarms.

Now let’s talk about the more general case. Vitrage can receive alarms from 
different monitors, including Nagios, Zabbix, collectd and Aodh. If you are 
using more than one monitor, it is possible that the same alarm (maybe with a 
different name) will be raised twice. We need to create a mechanism to identify 
such cases and create a single alarm with the properties of both monitors. This 
has not been designed in details yet, so if you have any suggestion we will be 
happy to hear them.

Best Regards,
Ifat.


From: "yinli...@zte.com.cn" 
Reply-To: "OpenStack Development Mailing List (not for usage questions)" 

Date: Friday, 6 January 2017 at 03:27
To: "openstack-dev@lists.openstack.org" 
Cc: "gong.yah...@zte.com.cn" , "han.jin...@zte.com.cn" 
, "wang.we...@zte.com.cn" , 
"jia.peiy...@zte.com.cn" , "zhang.yuj...@zte.com.cn" 

Subject: [openstack-dev] [Vitrage] About alarms reported by datasource and the 
alarms generated by vitrage evaluator


Hi all,

   Vitrage generate alarms acording to the templates. All the alarms raised by 
vitrage has the type "vitrage". Suppose Nagios has an alarm A. Alarm A is 
raised by vitrage evaluator according to the action part of a scenario, type of 
alarm A is "vitrage". If Nagios reported alarm A latter, a new alarm A with 
type "Nagios" would be generator in the entity graph. There would be two 
vertices for the same alarm in the graph. And we have to define two alarm 
entities, two relationships, two scenarios in the template file to make the 
alarm propagation procedure work.

   It is inconvenient to describe fault model of system with lot of alarms. How 
to solve this problem?



殷力殷 YinLiYin




[cid:image001.gif@01D26859.D4BAB6B0]

[cid:image002.gif@01D26859.D4BAB6B0]
上海市浦东新区碧波路889号中兴研发大楼D502
D502, ZTE Corporation R Center, 889# Bibo Road,
Zhangjiang Hi-tech Park, Shanghai, P.R.China, 201203
T: +86 21 68896229
M: +86 13641895907
E: yinli...@zte.com.cn
www.zte.com.cn



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev