Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring

2017-05-17 Thread Adam Spiers

Thanks for the clarification Greg.  This sounds like it has the
potential to be a very useful capability.  May I suggest that you
propose a new user story for it, along similar lines to this existing
one?

http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html

Waines, Greg <greg.wai...@windriver.com> wrote:

Yes that’s correct.
VM Heartbeating / Health-check Monitoring would introduce intrusive / white-box 
type monitoring of VMs / Instances.

I realize this is somewhat in the gray-zone of what a cloud should be 
monitoring or not,
but I believe it provides an alternative for Applications deployed in VMs that 
do not have an external monitoring/management entity like a VNF Manager in the 
MANO architecture.
And even for VMs with VNF Managers, it provides a highly reliable alternate 
monitoring path that does not rely on Tenant Networking.

You’re correct, that VM HB/HC Monitoring would leverage
https://wiki.libvirt.org/page/Qemu_guest_agent
that would require the agent to be installed in the images for talking back to 
the compute host.
( there are other examples of similar approaches in openstack ... the 
murano-agent for installation, the swift-agent for object store management )
Although here, in the case of VM HB/HC Monitoring, via the QEMU Guest Agent, 
the messaging path is internal thru a QEMU virtual serial device.  i.e. a very 
simple interface with very few dependencies ... it’s up and available very 
early in VM lifecycle and virtually always up.

Wrt failure modes / use-cases

· a VM’s response to a Heartbeat Challenge Request can be as simple as 
just ACK-ing,
this alone allows for detection of:

oa failed or hung QEMU/KVM instance, or

oa failed or hung VM’s OS, or

oa failure of the VM’s OS to schedule the QEMU Guest Agent daemon, or

oa failure of the VM to route basic IO via linux sockets.

· I have had feedback that this is similar to the virtual hardware 
watchdog of QEMU/KVM ( https://libvirt.org/formatdomain.html#elementsWatchdog )

· However, the VM Heartbeat / Health-check Monitoring

o   provides a higher-level (i.e. application-level) heartbeating

§  i.e. if the Heartbeat requests are being answered by the Application running 
within the VM

o   provides more than just heartbeating, as the Application can use it to 
trigger a variety of audits,

o   provides a mechanism for the Application within the VM to report a Health 
Status / Info back to the Host / Cloud,

o   provides notification of the Heartbeat / Health-check status to 
higher-level cloud entities thru Vitrage

§  e.g.   VM-Heartbeat-Monitor - to - Vitrage - (EventAlarm) - Aodh - ... - 
VNF-Manager
   
- (StateChange) - Nova - ... - VNF Manager


Greg.


From: Adam Spiers <aspi...@suse.com>
Reply-To: "openstack-dev@lists.openstack.org" 
<openstack-dev@lists.openstack.org>
Date: Tuesday, May 16, 2017 at 7:29 PM
To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck 
Monitoring

Waines, Greg <greg.wai...@windriver.com<mailto:greg.wai...@windriver.com>> 
wrote:
thanks for the pointers Sam.

I took a quick look.
I agree that the VM Heartbeat / Health-check looks like a good fit into 
Masakari.

Currently your instance monitoring looks like it is strictly black-box type 
monitoring thru libvirt events.
Is that correct ?
i.e. you do not do any intrusive type monitoring of the instance thru the QUEMU 
Guest Agent facility
  correct ?

That is correct:

https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/instancemonitor/instance.py

I think this is what VM Heartbeat / Health-check would add to Masaraki.
Let me know if you agree.

OK, so you are looking for something slightly different I guess, based
on this QEMU guest agent?

   https://wiki.libvirt.org/page/Qemu_guest_agent

That would require the agent to be installed in the images, which is
extra work but I imagine quite easily justifiable in some scenarios.
What failure modes do you have in mind for covering with this
approach - things like the guest kernel freezing, for instance?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org<mailto:openstack-dev-requ...@lists.openstack.org>?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



___

Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring

2017-05-17 Thread Adam Spiers

Yep :-)  That's pretty much exactly what I was suggesting elsewhere in
this thread:

http://lists.openstack.org/pipermail/openstack-dev/2017-May/116748.html

Waines, Greg <greg.wai...@windriver.com> wrote:

Excellent.
Yeah I just watched your Boston Summit presentation and noticed, at least when 
you were talking about host-monitoring, you were looking at having alternative 
backends for reporting e.g. to masakari-api or to mistral or ... to Vitrage :)

Greg.

From: Adam Spiers <aspi...@suse.com>
Reply-To: "openstack-dev@lists.openstack.org" 
<openstack-dev@lists.openstack.org>
Date: Tuesday, May 16, 2017 at 7:42 PM
To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck 
Monitoring

Waines, Greg <greg.wai...@windriver.com<mailto:greg.wai...@windriver.com>> 
wrote:
Sam,

Two other more higher-level points I wanted to discuss with you about Masaraki.


First,
so I notice that you are doing both monitoring, auto-recovery and even host 
maintenance
type functionality as part of the Masaraki architecture.

are you open to some configurability (enabling/disabling) of these capabilities 
?

I can't speak for Sampath or the Masakari developers, but the monitors
are standalone components.  Currently they can only send notifications
in a format which the masakari-api service can understand, but I guess
it wouldn't be hard to extend them to send notifications in other
formats if that made sense.

e.g. OPNFV guys would NOT want auto-recovery, they would prefer that fault 
events
 get reported to Vitrage ... and eventually filter up to 
Aodh Alarms that get
 received by VNFManagers which would be responsible for the 
recovery.

e.g. some deployers of openstack might want to disable parts or all of your 
monitoring,
if using other mechanisms such as Zabbix or Nagios for the host 
monitoring (say)

Yes, exactly!  This kind of configurability and flexibility which
would allow each cloud architect to choose which monitoring / alerting
/ recovery components suit their requirements best in a "mix'n'match"
fashion, is exactly what we are aiming for with our modular approach
to the design of compute plane HA.  If the various monitoring
components adopt a driver-based approach to alerting and/or the
ability to alert via a lowest common denominator format such as simple
HTTP POST of JSON blobs, then it should be possible for each cloud
deployer to integrate the monitors with whichever reporting dashboards
/ recovery workflow controllers best satisfy their requirements.

Second, are you open to configurably having fault events reported to
Vitrage ?

Again I can't speak on behalf of the Masakari project, but this sounds
like a great idea to me :)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org<mailto:openstack-dev-requ...@lists.openstack.org>?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring

2017-05-17 Thread Waines, Greg
Excellent.
Yeah I just watched your Boston Summit presentation and noticed, at least when 
you were talking about host-monitoring, you were looking at having alternative 
backends for reporting e.g. to masakari-api or to mistral or ... to Vitrage :)

Greg.

From: Adam Spiers <aspi...@suse.com>
Reply-To: "openstack-dev@lists.openstack.org" 
<openstack-dev@lists.openstack.org>
Date: Tuesday, May 16, 2017 at 7:42 PM
To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck 
Monitoring

Waines, Greg <greg.wai...@windriver.com<mailto:greg.wai...@windriver.com>> 
wrote:
Sam,

Two other more higher-level points I wanted to discuss with you about Masaraki.


First,
so I notice that you are doing both monitoring, auto-recovery and even host 
maintenance
type functionality as part of the Masaraki architecture.

are you open to some configurability (enabling/disabling) of these capabilities 
?

I can't speak for Sampath or the Masakari developers, but the monitors
are standalone components.  Currently they can only send notifications
in a format which the masakari-api service can understand, but I guess
it wouldn't be hard to extend them to send notifications in other
formats if that made sense.

e.g. OPNFV guys would NOT want auto-recovery, they would prefer that fault 
events
  get reported to Vitrage ... and eventually filter up to 
Aodh Alarms that get
  received by VNFManagers which would be responsible for 
the recovery.

e.g. some deployers of openstack might want to disable parts or all of your 
monitoring,
 if using other mechanisms such as Zabbix or Nagios for the host 
monitoring (say)

Yes, exactly!  This kind of configurability and flexibility which
would allow each cloud architect to choose which monitoring / alerting
/ recovery components suit their requirements best in a "mix'n'match"
fashion, is exactly what we are aiming for with our modular approach
to the design of compute plane HA.  If the various monitoring
components adopt a driver-based approach to alerting and/or the
ability to alert via a lowest common denominator format such as simple
HTTP POST of JSON blobs, then it should be possible for each cloud
deployer to integrate the monitors with whichever reporting dashboards
/ recovery workflow controllers best satisfy their requirements.

Second, are you open to configurably having fault events reported to
Vitrage ?

Again I can't speak on behalf of the Masakari project, but this sounds
like a great idea to me :)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org<mailto:openstack-dev-requ...@lists.openstack.org>?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring

2017-05-17 Thread Waines, Greg
Yes that’s correct.
VM Heartbeating / Health-check Monitoring would introduce intrusive / white-box 
type monitoring of VMs / Instances.

I realize this is somewhat in the gray-zone of what a cloud should be 
monitoring or not,
but I believe it provides an alternative for Applications deployed in VMs that 
do not have an external monitoring/management entity like a VNF Manager in the 
MANO architecture.
And even for VMs with VNF Managers, it provides a highly reliable alternate 
monitoring path that does not rely on Tenant Networking.

You’re correct, that VM HB/HC Monitoring would leverage
https://wiki.libvirt.org/page/Qemu_guest_agent
that would require the agent to be installed in the images for talking back to 
the compute host.
( there are other examples of similar approaches in openstack ... the 
murano-agent for installation, the swift-agent for object store management )
Although here, in the case of VM HB/HC Monitoring, via the QEMU Guest Agent, 
the messaging path is internal thru a QEMU virtual serial device.  i.e. a very 
simple interface with very few dependencies ... it’s up and available very 
early in VM lifecycle and virtually always up.

Wrt failure modes / use-cases

· a VM’s response to a Heartbeat Challenge Request can be as simple as 
just ACK-ing,
this alone allows for detection of:

oa failed or hung QEMU/KVM instance, or

oa failed or hung VM’s OS, or

oa failure of the VM’s OS to schedule the QEMU Guest Agent daemon, or

oa failure of the VM to route basic IO via linux sockets.

· I have had feedback that this is similar to the virtual hardware 
watchdog of QEMU/KVM ( https://libvirt.org/formatdomain.html#elementsWatchdog )

· However, the VM Heartbeat / Health-check Monitoring

o   provides a higher-level (i.e. application-level) heartbeating

§  i.e. if the Heartbeat requests are being answered by the Application running 
within the VM

o   provides more than just heartbeating, as the Application can use it to 
trigger a variety of audits,

o   provides a mechanism for the Application within the VM to report a Health 
Status / Info back to the Host / Cloud,

o   provides notification of the Heartbeat / Health-check status to 
higher-level cloud entities thru Vitrage

§  e.g.   VM-Heartbeat-Monitor - to - Vitrage - (EventAlarm) - Aodh - ... - 
VNF-Manager

- (StateChange) - Nova - ... - VNF Manager


Greg.


From: Adam Spiers <aspi...@suse.com>
Reply-To: "openstack-dev@lists.openstack.org" 
<openstack-dev@lists.openstack.org>
Date: Tuesday, May 16, 2017 at 7:29 PM
To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck 
Monitoring

Waines, Greg <greg.wai...@windriver.com<mailto:greg.wai...@windriver.com>> 
wrote:
thanks for the pointers Sam.

I took a quick look.
I agree that the VM Heartbeat / Health-check looks like a good fit into 
Masakari.

Currently your instance monitoring looks like it is strictly black-box type 
monitoring thru libvirt events.
Is that correct ?
i.e. you do not do any intrusive type monitoring of the instance thru the QUEMU 
Guest Agent facility
   correct ?

That is correct:

https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/instancemonitor/instance.py

I think this is what VM Heartbeat / Health-check would add to Masaraki.
Let me know if you agree.

OK, so you are looking for something slightly different I guess, based
on this QEMU guest agent?

https://wiki.libvirt.org/page/Qemu_guest_agent

That would require the agent to be installed in the images, which is
extra work but I imagine quite easily justifiable in some scenarios.
What failure modes do you have in mind for covering with this
approach - things like the guest kernel freezing, for instance?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org<mailto:openstack-dev-requ...@lists.openstack.org>?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring

2017-05-16 Thread Adam Spiers

Waines, Greg  wrote:

Sam,

Two other more higher-level points I wanted to discuss with you about Masaraki.


First,
so I notice that you are doing both monitoring, auto-recovery and even host 
maintenance
type functionality as part of the Masaraki architecture.

are you open to some configurability (enabling/disabling) of these capabilities 
?


I can't speak for Sampath or the Masakari developers, but the monitors
are standalone components.  Currently they can only send notifications
in a format which the masakari-api service can understand, but I guess
it wouldn't be hard to extend them to send notifications in other
formats if that made sense.


e.g. OPNFV guys would NOT want auto-recovery, they would prefer that fault 
events
 get reported to Vitrage ... and eventually filter up to 
Aodh Alarms that get
 received by VNFManagers which would be responsible for the 
recovery.

e.g. some deployers of openstack might want to disable parts or all of your 
monitoring,
if using other mechanisms such as Zabbix or Nagios for the host 
monitoring (say)


Yes, exactly!  This kind of configurability and flexibility which
would allow each cloud architect to choose which monitoring / alerting
/ recovery components suit their requirements best in a "mix'n'match"
fashion, is exactly what we are aiming for with our modular approach
to the design of compute plane HA.  If the various monitoring
components adopt a driver-based approach to alerting and/or the
ability to alert via a lowest common denominator format such as simple
HTTP POST of JSON blobs, then it should be possible for each cloud
deployer to integrate the monitors with whichever reporting dashboards
/ recovery workflow controllers best satisfy their requirements.


Second, are you open to configurably having fault events reported to
Vitrage ?


Again I can't speak on behalf of the Masakari project, but this sounds
like a great idea to me :)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring

2017-05-16 Thread Adam Spiers

Waines, Greg  wrote:

thanks for the pointers Sam.

I took a quick look.
I agree that the VM Heartbeat / Health-check looks like a good fit into 
Masakari.

Currently your instance monitoring looks like it is strictly black-box type 
monitoring thru libvirt events.
Is that correct ?
i.e. you do not do any intrusive type monitoring of the instance thru the QUEMU 
Guest Agent facility
  correct ?


That is correct:

https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/instancemonitor/instance.py


I think this is what VM Heartbeat / Health-check would add to Masaraki.
Let me know if you agree.


OK, so you are looking for something slightly different I guess, based
on this QEMU guest agent?

   https://wiki.libvirt.org/page/Qemu_guest_agent

That would require the agent to be installed in the images, which is
extra work but I imagine quite easily justifiable in some scenarios.
What failure modes do you have in mind for covering with this
approach - things like the guest kernel freezing, for instance?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring

2017-05-16 Thread Adam Spiers

Afek, Ifat (Nokia - IL/Kfar Sava)  wrote:

On 16/05/2017, 4:36, "Sam P"  wrote:

   Hi Greg,

In Masakari [0] for VMHA, we have already implemented some what
   similar function in masakri-monitors.
Masakari-monitors runs on nova-compute node, and monitors the host,
   process or instance failures.
Masakari instance monitor has similar functionality with what you
   have described.
Please see [1] for more details on instance monitoring.
[0] https://wiki.openstack.org/wiki/Masakari
[1] 
https://github.com/openstack/masakari-monitors/tree/master/masakarimonitors/instancemonitor

Once masakari-monitors detect failures, it will send notifications to
   masakari-api to take appropriate recovery actions to recover that VM
   from failures.


You can also find out more about our architectural plans by watching
this talk which Sampath and I gave in Boston:

  
https://www.openstack.org/videos/boston-2017/high-availability-for-instances-moving-to-a-converged-upstream-solution

The slides are here:

  https://aspiers.github.io/openstack-summit-2017-boston-compute-ha/

We didn't go into much depth on monitoring and recovery of individual
VMs, but as Sampath explained, Masakari already handles both of these.


Hi Greg, Sam,

As Vitrage is about correlating alarms that come from different
sources, and is not a monitor by itself – I think that it can benefit
from information retrieved by both Masakari and Zabbix monitors.

Zabbix is already integrated into Vitrage. I don’t know if there are
specific tests for VM heartbeat, but I think it is very likely that
there are.  Regarding Masakari – looking at your documents, I believe
that integrating your monitoring information into Vitrage could be
quite straight forward.


Yes, this makes sense.  Masakari already cleanly decouples
monitoring/alerting from automated recovery, so it could support this
quite nicely.  And the modular converged architecture we explained in
the presentation will maintain that clean separation of
responsibilities whilst integrating Masakari together with other
components such as Pacemaker, Mistral, and maybe Vitrage too.

For example whilst so far this thread has been about VM instance
monitoring, another area where Vitrage could integrate with Masakari
is compute host monitoring.

If you watch this part of our presentation where we explained the next
generation architecture, you'll see that we propose a new
"nova-host-alerter" component which has a driver-based mechanism for
alerting different services when a compute host experiences a failure:

   https://youtu.be/YPKE1guti8E?t=32m43s

So one obvious possibility would be to add a driver for Vitrage, so
that Vitrage can be alerted when Pacemaker spots a host failure.

Similarly, we could extend Pacemaker configurations to alert Vitrage
when individual processes such as nova-compute or libvirtd fail.

If you would like to discuss any of this further or have any more
questions, in addition to this mailing list we are also available to
talk on the #openstack-ha IRC channel!

Cheers,
Adam

P.S. I've added the [HA] badge to this thread since this discussion is
definitely related to high availability.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev