Re: [openstack-dev] [vitrage] error handling

2017-06-01 Thread Afek, Ifat (Nokia - IL/Kfar Sava)


From: "Yujun Zhang (ZTE)" 
Date: Thursday, 1 June 2017 at 18:10


On Thu, Jun 1, 2017 at 10:49 PM Afek, Ifat (Nokia - IL/Kfar Sava) 
> wrote:

So for now we agree that we need to add a UI for configuration information and 
datasources status.

Sounds good. In order to implement in UI, we shall also need api to expose them 
right?


Of course ☺

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [vitrage] error handling

2017-06-01 Thread Yujun Zhang (ZTE)
On Thu, Jun 1, 2017 at 10:49 PM Afek, Ifat (Nokia - IL/Kfar Sava) <
ifat.a...@nokia.com> wrote:

So for now we agree that we need to add a UI for configuration information
> and datasources status.
>
>
Sounds good. In order to implement in UI, we shall also need api to expose
them right?

-- 
Yujun Zhang
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [vitrage] error handling

2017-06-01 Thread Afek, Ifat (Nokia - IL/Kfar Sava)
Hi Yujun,

Indeed, during the initialization phase it might be beneficial to make sure the 
user is aware of configuration problems (although I’m not sure that crashing is 
the solution). The problem is that the same code is executed both in 
initialization and later on, so telling the difference is not trivial.

So for now we agree that we need to add a UI for configuration information and 
datasources status.

Best Regards,
Ifat.

From: "Yujun Zhang (ZTE)" <zhangyujun+...@gmail.com>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" 
<openstack-dev@lists.openstack.org>
Date: Tuesday, 30 May 2017 at 11:50
To: "OpenStack Development Mailing List (not for usage questions)" 
<openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [vitrage] error handling

On Tue, May 30, 2017 at 3:59 PM Afek, Ifat (Nokia - IL/Kfar Sava) 
<ifat.a...@nokia.com<mailto:ifat.a...@nokia.com>> wrote:
Hi Yujun,

You started an interesting discussion. I think that the distinction between an 
operational error and a programmer error is correct and we should always keep 
that in mind.

I agree that having an overall design for error handling in Vitrage is a good 
idea; but I disagree that until then we better let it crash.

I think that Vitrage is made out of many pieces that don’t necessarily depend 
on one another. For example, if one datasource fails, everything else can work 
as usual – so why crash? Similarly, if one template fails to load, all other 
templates can still be activated.

This usually or always happens during initialization phase, doesn't it? It is a 
period with human inspecting and should be detected in the deployment or user 
acceptance test. So if something fails, it is better to isolate them before 
continue running, e.g. correct the invalid template, invalid data source 
configuration or remove the template and disable the data source. This is 
because such error is permanent and they won't recover automatically.

Here we need to distinguish the error that data source is temporarily 
unavailable due to network connection issue or data source not up yet. In this 
case, I agree we'd better start the rest component and perform a retry 
periodically until it recovers.

Another aspect is that the main purpose of Vitrage is to provide insights. In 
case of a failure in one datasource/template, some of the insights might be 
missing. But this will not lead to inaccurate behavior or to wrong actions 
being executed in the system. IMO, we should give the user as much information 
as possible given that we have only part of the input.

I agree, if enough insights could be provided by the running system. We can 
improve the handling of permanent error. What is even better is supporting of a 
hot load for the components and templates.

What I don't like much is sometimes errors are handled but without enough 
details. In this case, a crash with trace stack is more useful than a user 
"friendly" message like "failed to start xxx component" or "invalid 
configuration file" (I'm not talking about vitrage, it is quite common in many 
projects)

My preference is "good error handling" > "no error handling" > "bad error 
handling". Though it is difficult to distinguish what is a good error handling 
and what is bad...

Regarding the use cases that you mentioned:


  1.  invalid configuration file
[Ifat] This should depend on the specific configuration. If keystone is 
misconfigured, nothing will work of course. But if for example Zabbix is 
misconfigured, Vitrage should work and show the topology and the non-Zabbix 
alarms.

Agree. It should be handled in a different way regarding what kind of error and 
how critical it is.


  1.  failed to communicate with data source
[Ifat] I think that the error should be logged, and all other datasources 
should work as usual.

Yes, and it would be good to have a retry mechanism


  1.  malformed data from data source

[Ifat] I think that the error should be logged, and all other datasources 
should work as usual. This problem means we must modify the code in the 
datasource itself, but until then Vitrage should work, right?
Yes, I think it is possible when the data source version changes and we should 
discard the data and indicate the error. The other part should not be affected.


  1.  failed to execute an action
[Ifat] Again, that’s a problem that requires code changes; but why fail other 
actions?

What I meant here is temporary failure, e.g. when you try to mark host down but 
not able to reach it due to network connection issue or other reasons


  1.  ...
BTW, it might be a good idea to add API/UI for showing the configuration and 
the status of the datasources. We all know that errors in the log files are 
often ignored…

Sure, the errors I mentioned above is what the system operators could encounter 
even with a correct confi

Re: [openstack-dev] [vitrage] error handling

2017-05-30 Thread Yujun Zhang (ZTE)
On Tue, May 30, 2017 at 3:59 PM Afek, Ifat (Nokia - IL/Kfar Sava) <
ifat.a...@nokia.com> wrote:

> Hi Yujun,
>
>
>
> You started an interesting discussion. I think that the distinction
> between an operational error and a programmer error is correct and we
> should always keep that in mind.
>
>
>
> I agree that having an overall design for error handling in Vitrage is a
> good idea; but I disagree that until then we better let it crash.
>
>
>
> I think that Vitrage is made out of many pieces that don’t necessarily
> depend on one another. For example, if one datasource fails, everything
> else can work as usual – so why crash? Similarly, if one template fails to
> load, all other templates can still be activated.
>

This usually or always happens during initialization phase, doesn't it? It
is a period with human inspecting and should be detected in the deployment
or user acceptance test. So if something fails, it is better to isolate
them before continue running, e.g. correct the invalid template, invalid
data source configuration or remove the template and disable the data
source. This is because such error is permanent and they won't recover
automatically.

Here we need to distinguish the error that data source is temporarily
unavailable due to network connection issue or data source not up yet. In
this case, I agree we'd better start the rest component and perform a retry
periodically until it recovers.


> Another aspect is that the main purpose of Vitrage is to provide insights.
> In case of a failure in one datasource/template, some of the insights might
> be missing. But this will not lead to inaccurate behavior or to wrong
> actions being executed in the system. IMO, we should give the user as much
> information as possible given that we have only part of the input.
>

I agree, if enough insights could be provided by the running system. We can
improve the handling of permanent error. What is even better is supporting
of a hot load for the components and templates.

What I don't like much is sometimes errors are handled but without enough
details. In this case, a crash with trace stack is more useful than a user
"friendly" message like "failed to start xxx component" or "invalid
configuration file" (I'm not talking about vitrage, it is quite common in
many projects)

My preference is "good error handling" > "no error handling" > "bad error
handling". Though it is difficult to distinguish what is a good error
handling and what is bad...

Regarding the use cases that you mentioned:
>
>
>
>1. invalid configuration file
>
> [Ifat] This should depend on the specific configuration. If keystone is
> misconfigured, nothing will work of course. But if for example Zabbix is
> misconfigured, Vitrage should work and show the topology and the non-Zabbix
> alarms.
>

Agree. It should be handled in a different way regarding what kind of error
and how critical it is.


>
>1. failed to communicate with data source
>
> [Ifat] I think that the error should be logged, and all other datasources
> should work as usual.
>

Yes, and it would be good to have a retry mechanism


>
>1. malformed data from data source
>
> [Ifat] I think that the error should be logged, and all other datasources
> should work as usual. This problem means we must modify the code in the
> datasource itself, but until then Vitrage should work, right?
>
Yes, I think it is possible when the data source version changes and we
should discard the data and indicate the error. The other part should not
be affected.


>1. failed to execute an action
>
> [Ifat] Again, that’s a problem that requires code changes; but why fail
> other actions?
>

What I meant here is temporary failure, e.g. when you try to mark host down
but not able to reach it due to network connection issue or other reasons


>1. ...
>
> BTW, it might be a good idea to add API/UI for showing the configuration
> and the status of the datasources. We all know that errors in the log files
> are often ignored…
>

Sure, the errors I mentioned above is what the system operators could
encounter even with a correct configuration and not related to software
bugs. Display them in UI would be very helpful. The log files are more for
the engineers to analyse the root cause.


> Best Regards,
>
> Ifat.
>
>
>
>
>
> *From: *"Yujun Zhang (ZTE)" <zhangyujun+...@gmail.com>
> *Reply-To: *"OpenStack Development Mailing List (not for usage
> questions)" <openstack-dev@lists.openstack.org>
> *Date: *Monday, 29 May 2017 at 16:13
> *To: *"OpenStack Development Mailing List (not for usage questions)" <
> openstack-dev@lists.openstack.org>
>

Re: [openstack-dev] [vitrage] error handling

2017-05-30 Thread Afek, Ifat (Nokia - IL/Kfar Sava)
Hi Yujun,

You started an interesting discussion. I think that the distinction between an 
operational error and a programmer error is correct and we should always keep 
that in mind.

I agree that having an overall design for error handling in Vitrage is a good 
idea; but I disagree that until then we better let it crash.

I think that Vitrage is made out of many pieces that don’t necessarily depend 
on one another. For example, if one datasource fails, everything else can work 
as usual – so why crash? Similarly, if one template fails to load, all other 
templates can still be activated.
Another aspect is that the main purpose of Vitrage is to provide insights. In 
case of a failure in one datasource/template, some of the insights might be 
missing. But this will not lead to inaccurate behavior or to wrong actions 
being executed in the system. IMO, we should give the user as much information 
as possible given that we have only part of the input.

Regarding the use cases that you mentioned:


  1.  invalid configuration file
[Ifat] This should depend on the specific configuration. If keystone is 
misconfigured, nothing will work of course. But if for example Zabbix is 
misconfigured, Vitrage should work and show the topology and the non-Zabbix 
alarms.


  1.  failed to communicate with data source
[Ifat] I think that the error should be logged, and all other datasources 
should work as usual.


  1.  malformed data from data source

[Ifat] I think that the error should be logged, and all other datasources 
should work as usual. This problem means we must modify the code in the 
datasource itself, but until then Vitrage should work, right?


  1.  failed to execute an action
[Ifat] Again, that’s a problem that requires code changes; but why fail other 
actions?


  1.  ...

BTW, it might be a good idea to add API/UI for showing the configuration and 
the status of the datasources. We all know that errors in the log files are 
often ignored…

Best Regards,
Ifat.


From: "Yujun Zhang (ZTE)" <zhangyujun+...@gmail.com>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" 
<openstack-dev@lists.openstack.org>
Date: Monday, 29 May 2017 at 16:13
To: "OpenStack Development Mailing List (not for usage questions)" 
<openstack-dev@lists.openstack.org>
Subject: [openstack-dev] [vitrage] error handling

Brought up by a recent code review, I think it worth a thorough discussion 
about the error handling rule.

I once read an article[1] from Joyent and it impressed me on the distinguish 
between Operational errors vs. programmer errors. The article is written for 
nodejs, but the principle also applies for other programming language.

The basic rule recommended by Joyent is
Handling operational errors
(Not) handling programmer errors
There is also one rule in openstack style guide line[2] close to this idea.

[H201] Do not write except:, use except Exception: at the very least. When 
catching an exception you should be as specific so you don’t mistakenly catch 
unexpected exceptions.

I do think before we have a well designed error handling, it is better to let 
it crash. It is dangerous to hide the errors and keep the system running in 
undetermined states.

So the question is what kind of operational errors are we facing in vitrage? I 
can think of something like

  1.  invalid configuration file
  2.  failed to communicate with data source
  3.  malformed data from data source
  4.  failed to execute an action
  5.  ...
Maybe this could be the first step for the error handling design.

[1]: https://www.joyent.com/node-js/production/design/errors
[2]: https://docs.openstack.org/developer/hacking/

--
Yujun Zhang
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [vitrage] error handling

2017-05-29 Thread Yujun Zhang (ZTE)
Brought up by a recent code review, I think it worth a thorough discussion
about the error handling rule.

I once read an article[1] from Joyent and it impressed me on the
distinguish between *Operational* errors vs. *programmer* errors. The
article is written for nodejs, but the principle also applies for other
programming language.

The basic rule recommended by Joyent is
Handling operational errors
(Not) handling programmer errors
There is also one rule in openstack style guide line[2] close to this idea.

[H201] Do not write except:, use except Exception: at the very least. When
catching an exception you should be as specific so you don’t mistakenly
catch unexpected exceptions.

I do think before we have a well designed error handling, it is better to
let it crash. It is dangerous to hide the errors and keep the system
running in undetermined states.

So the question is *what kind of operational errors are we facing in
vitrage?* I can think of something like

   1. invalid configuration file
   2. failed to communicate with data source
   3. malformed data from data source
   4. failed to execute an action
   5. ...

Maybe this could be the first step for the error handling design.

[1]: https://www.joyent.com/node-js/production/design/errors
[2]: https://docs.openstack.org/developer/hacking/

-- 
Yujun Zhang
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev