Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring
Thanks for the clarification Greg. This sounds like it has the potential to be a very useful capability. May I suggest that you propose a new user story for it, along similar lines to this existing one? http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html Waines, Greg <greg.wai...@windriver.com> wrote: Yes that’s correct. VM Heartbeating / Health-check Monitoring would introduce intrusive / white-box type monitoring of VMs / Instances. I realize this is somewhat in the gray-zone of what a cloud should be monitoring or not, but I believe it provides an alternative for Applications deployed in VMs that do not have an external monitoring/management entity like a VNF Manager in the MANO architecture. And even for VMs with VNF Managers, it provides a highly reliable alternate monitoring path that does not rely on Tenant Networking. You’re correct, that VM HB/HC Monitoring would leverage https://wiki.libvirt.org/page/Qemu_guest_agent that would require the agent to be installed in the images for talking back to the compute host. ( there are other examples of similar approaches in openstack ... the murano-agent for installation, the swift-agent for object store management ) Although here, in the case of VM HB/HC Monitoring, via the QEMU Guest Agent, the messaging path is internal thru a QEMU virtual serial device. i.e. a very simple interface with very few dependencies ... it’s up and available very early in VM lifecycle and virtually always up. Wrt failure modes / use-cases · a VM’s response to a Heartbeat Challenge Request can be as simple as just ACK-ing, this alone allows for detection of: oa failed or hung QEMU/KVM instance, or oa failed or hung VM’s OS, or oa failure of the VM’s OS to schedule the QEMU Guest Agent daemon, or oa failure of the VM to route basic IO via linux sockets. · I have had feedback that this is similar to the virtual hardware watchdog of QEMU/KVM ( https://libvirt.org/formatdomain.html#elementsWatchdog ) · However, the VM Heartbeat / Health-check Monitoring o provides a higher-level (i.e. application-level) heartbeating § i.e. if the Heartbeat requests are being answered by the Application running within the VM o provides more than just heartbeating, as the Application can use it to trigger a variety of audits, o provides a mechanism for the Application within the VM to report a Health Status / Info back to the Host / Cloud, o provides notification of the Heartbeat / Health-check status to higher-level cloud entities thru Vitrage § e.g. VM-Heartbeat-Monitor - to - Vitrage - (EventAlarm) - Aodh - ... - VNF-Manager - (StateChange) - Nova - ... - VNF Manager Greg. From: Adam Spiers <aspi...@suse.com> Reply-To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org> Date: Tuesday, May 16, 2017 at 7:29 PM To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org> Subject: Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring Waines, Greg <greg.wai...@windriver.com<mailto:greg.wai...@windriver.com>> wrote: thanks for the pointers Sam. I took a quick look. I agree that the VM Heartbeat / Health-check looks like a good fit into Masakari. Currently your instance monitoring looks like it is strictly black-box type monitoring thru libvirt events. Is that correct ? i.e. you do not do any intrusive type monitoring of the instance thru the QUEMU Guest Agent facility correct ? That is correct: https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/instancemonitor/instance.py I think this is what VM Heartbeat / Health-check would add to Masaraki. Let me know if you agree. OK, so you are looking for something slightly different I guess, based on this QEMU guest agent? https://wiki.libvirt.org/page/Qemu_guest_agent That would require the agent to be installed in the images, which is extra work but I imagine quite easily justifiable in some scenarios. What failure modes do you have in mind for covering with this approach - things like the guest kernel freezing, for instance? __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org<mailto:openstack-dev-requ...@lists.openstack.org>?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___
Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring
Yep :-) That's pretty much exactly what I was suggesting elsewhere in this thread: http://lists.openstack.org/pipermail/openstack-dev/2017-May/116748.html Waines, Greg <greg.wai...@windriver.com> wrote: Excellent. Yeah I just watched your Boston Summit presentation and noticed, at least when you were talking about host-monitoring, you were looking at having alternative backends for reporting e.g. to masakari-api or to mistral or ... to Vitrage :) Greg. From: Adam Spiers <aspi...@suse.com> Reply-To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org> Date: Tuesday, May 16, 2017 at 7:42 PM To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org> Subject: Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring Waines, Greg <greg.wai...@windriver.com<mailto:greg.wai...@windriver.com>> wrote: Sam, Two other more higher-level points I wanted to discuss with you about Masaraki. First, so I notice that you are doing both monitoring, auto-recovery and even host maintenance type functionality as part of the Masaraki architecture. are you open to some configurability (enabling/disabling) of these capabilities ? I can't speak for Sampath or the Masakari developers, but the monitors are standalone components. Currently they can only send notifications in a format which the masakari-api service can understand, but I guess it wouldn't be hard to extend them to send notifications in other formats if that made sense. e.g. OPNFV guys would NOT want auto-recovery, they would prefer that fault events get reported to Vitrage ... and eventually filter up to Aodh Alarms that get received by VNFManagers which would be responsible for the recovery. e.g. some deployers of openstack might want to disable parts or all of your monitoring, if using other mechanisms such as Zabbix or Nagios for the host monitoring (say) Yes, exactly! This kind of configurability and flexibility which would allow each cloud architect to choose which monitoring / alerting / recovery components suit their requirements best in a "mix'n'match" fashion, is exactly what we are aiming for with our modular approach to the design of compute plane HA. If the various monitoring components adopt a driver-based approach to alerting and/or the ability to alert via a lowest common denominator format such as simple HTTP POST of JSON blobs, then it should be possible for each cloud deployer to integrate the monitors with whichever reporting dashboards / recovery workflow controllers best satisfy their requirements. Second, are you open to configurably having fault events reported to Vitrage ? Again I can't speak on behalf of the Masakari project, but this sounds like a great idea to me :) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org<mailto:openstack-dev-requ...@lists.openstack.org>?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring
Excellent. Yeah I just watched your Boston Summit presentation and noticed, at least when you were talking about host-monitoring, you were looking at having alternative backends for reporting e.g. to masakari-api or to mistral or ... to Vitrage :) Greg. From: Adam Spiers <aspi...@suse.com> Reply-To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org> Date: Tuesday, May 16, 2017 at 7:42 PM To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org> Subject: Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring Waines, Greg <greg.wai...@windriver.com<mailto:greg.wai...@windriver.com>> wrote: Sam, Two other more higher-level points I wanted to discuss with you about Masaraki. First, so I notice that you are doing both monitoring, auto-recovery and even host maintenance type functionality as part of the Masaraki architecture. are you open to some configurability (enabling/disabling) of these capabilities ? I can't speak for Sampath or the Masakari developers, but the monitors are standalone components. Currently they can only send notifications in a format which the masakari-api service can understand, but I guess it wouldn't be hard to extend them to send notifications in other formats if that made sense. e.g. OPNFV guys would NOT want auto-recovery, they would prefer that fault events get reported to Vitrage ... and eventually filter up to Aodh Alarms that get received by VNFManagers which would be responsible for the recovery. e.g. some deployers of openstack might want to disable parts or all of your monitoring, if using other mechanisms such as Zabbix or Nagios for the host monitoring (say) Yes, exactly! This kind of configurability and flexibility which would allow each cloud architect to choose which monitoring / alerting / recovery components suit their requirements best in a "mix'n'match" fashion, is exactly what we are aiming for with our modular approach to the design of compute plane HA. If the various monitoring components adopt a driver-based approach to alerting and/or the ability to alert via a lowest common denominator format such as simple HTTP POST of JSON blobs, then it should be possible for each cloud deployer to integrate the monitors with whichever reporting dashboards / recovery workflow controllers best satisfy their requirements. Second, are you open to configurably having fault events reported to Vitrage ? Again I can't speak on behalf of the Masakari project, but this sounds like a great idea to me :) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org<mailto:openstack-dev-requ...@lists.openstack.org>?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring
Yes that’s correct. VM Heartbeating / Health-check Monitoring would introduce intrusive / white-box type monitoring of VMs / Instances. I realize this is somewhat in the gray-zone of what a cloud should be monitoring or not, but I believe it provides an alternative for Applications deployed in VMs that do not have an external monitoring/management entity like a VNF Manager in the MANO architecture. And even for VMs with VNF Managers, it provides a highly reliable alternate monitoring path that does not rely on Tenant Networking. You’re correct, that VM HB/HC Monitoring would leverage https://wiki.libvirt.org/page/Qemu_guest_agent that would require the agent to be installed in the images for talking back to the compute host. ( there are other examples of similar approaches in openstack ... the murano-agent for installation, the swift-agent for object store management ) Although here, in the case of VM HB/HC Monitoring, via the QEMU Guest Agent, the messaging path is internal thru a QEMU virtual serial device. i.e. a very simple interface with very few dependencies ... it’s up and available very early in VM lifecycle and virtually always up. Wrt failure modes / use-cases · a VM’s response to a Heartbeat Challenge Request can be as simple as just ACK-ing, this alone allows for detection of: oa failed or hung QEMU/KVM instance, or oa failed or hung VM’s OS, or oa failure of the VM’s OS to schedule the QEMU Guest Agent daemon, or oa failure of the VM to route basic IO via linux sockets. · I have had feedback that this is similar to the virtual hardware watchdog of QEMU/KVM ( https://libvirt.org/formatdomain.html#elementsWatchdog ) · However, the VM Heartbeat / Health-check Monitoring o provides a higher-level (i.e. application-level) heartbeating § i.e. if the Heartbeat requests are being answered by the Application running within the VM o provides more than just heartbeating, as the Application can use it to trigger a variety of audits, o provides a mechanism for the Application within the VM to report a Health Status / Info back to the Host / Cloud, o provides notification of the Heartbeat / Health-check status to higher-level cloud entities thru Vitrage § e.g. VM-Heartbeat-Monitor - to - Vitrage - (EventAlarm) - Aodh - ... - VNF-Manager - (StateChange) - Nova - ... - VNF Manager Greg. From: Adam Spiers <aspi...@suse.com> Reply-To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org> Date: Tuesday, May 16, 2017 at 7:29 PM To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org> Subject: Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring Waines, Greg <greg.wai...@windriver.com<mailto:greg.wai...@windriver.com>> wrote: thanks for the pointers Sam. I took a quick look. I agree that the VM Heartbeat / Health-check looks like a good fit into Masakari. Currently your instance monitoring looks like it is strictly black-box type monitoring thru libvirt events. Is that correct ? i.e. you do not do any intrusive type monitoring of the instance thru the QUEMU Guest Agent facility correct ? That is correct: https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/instancemonitor/instance.py I think this is what VM Heartbeat / Health-check would add to Masaraki. Let me know if you agree. OK, so you are looking for something slightly different I guess, based on this QEMU guest agent? https://wiki.libvirt.org/page/Qemu_guest_agent That would require the agent to be installed in the images, which is extra work but I imagine quite easily justifiable in some scenarios. What failure modes do you have in mind for covering with this approach - things like the guest kernel freezing, for instance? __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org<mailto:openstack-dev-requ...@lists.openstack.org>?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring
Waines, Gregwrote: Sam, Two other more higher-level points I wanted to discuss with you about Masaraki. First, so I notice that you are doing both monitoring, auto-recovery and even host maintenance type functionality as part of the Masaraki architecture. are you open to some configurability (enabling/disabling) of these capabilities ? I can't speak for Sampath or the Masakari developers, but the monitors are standalone components. Currently they can only send notifications in a format which the masakari-api service can understand, but I guess it wouldn't be hard to extend them to send notifications in other formats if that made sense. e.g. OPNFV guys would NOT want auto-recovery, they would prefer that fault events get reported to Vitrage ... and eventually filter up to Aodh Alarms that get received by VNFManagers which would be responsible for the recovery. e.g. some deployers of openstack might want to disable parts or all of your monitoring, if using other mechanisms such as Zabbix or Nagios for the host monitoring (say) Yes, exactly! This kind of configurability and flexibility which would allow each cloud architect to choose which monitoring / alerting / recovery components suit their requirements best in a "mix'n'match" fashion, is exactly what we are aiming for with our modular approach to the design of compute plane HA. If the various monitoring components adopt a driver-based approach to alerting and/or the ability to alert via a lowest common denominator format such as simple HTTP POST of JSON blobs, then it should be possible for each cloud deployer to integrate the monitors with whichever reporting dashboards / recovery workflow controllers best satisfy their requirements. Second, are you open to configurably having fault events reported to Vitrage ? Again I can't speak on behalf of the Masakari project, but this sounds like a great idea to me :) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring
Waines, Gregwrote: thanks for the pointers Sam. I took a quick look. I agree that the VM Heartbeat / Health-check looks like a good fit into Masakari. Currently your instance monitoring looks like it is strictly black-box type monitoring thru libvirt events. Is that correct ? i.e. you do not do any intrusive type monitoring of the instance thru the QUEMU Guest Agent facility correct ? That is correct: https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/instancemonitor/instance.py I think this is what VM Heartbeat / Health-check would add to Masaraki. Let me know if you agree. OK, so you are looking for something slightly different I guess, based on this QEMU guest agent? https://wiki.libvirt.org/page/Qemu_guest_agent That would require the agent to be installed in the images, which is extra work but I imagine quite easily justifiable in some scenarios. What failure modes do you have in mind for covering with this approach - things like the guest kernel freezing, for instance? __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring
Afek, Ifat (Nokia - IL/Kfar Sava)wrote: On 16/05/2017, 4:36, "Sam P" wrote: Hi Greg, In Masakari [0] for VMHA, we have already implemented some what similar function in masakri-monitors. Masakari-monitors runs on nova-compute node, and monitors the host, process or instance failures. Masakari instance monitor has similar functionality with what you have described. Please see [1] for more details on instance monitoring. [0] https://wiki.openstack.org/wiki/Masakari [1] https://github.com/openstack/masakari-monitors/tree/master/masakarimonitors/instancemonitor Once masakari-monitors detect failures, it will send notifications to masakari-api to take appropriate recovery actions to recover that VM from failures. You can also find out more about our architectural plans by watching this talk which Sampath and I gave in Boston: https://www.openstack.org/videos/boston-2017/high-availability-for-instances-moving-to-a-converged-upstream-solution The slides are here: https://aspiers.github.io/openstack-summit-2017-boston-compute-ha/ We didn't go into much depth on monitoring and recovery of individual VMs, but as Sampath explained, Masakari already handles both of these. Hi Greg, Sam, As Vitrage is about correlating alarms that come from different sources, and is not a monitor by itself – I think that it can benefit from information retrieved by both Masakari and Zabbix monitors. Zabbix is already integrated into Vitrage. I don’t know if there are specific tests for VM heartbeat, but I think it is very likely that there are. Regarding Masakari – looking at your documents, I believe that integrating your monitoring information into Vitrage could be quite straight forward. Yes, this makes sense. Masakari already cleanly decouples monitoring/alerting from automated recovery, so it could support this quite nicely. And the modular converged architecture we explained in the presentation will maintain that clean separation of responsibilities whilst integrating Masakari together with other components such as Pacemaker, Mistral, and maybe Vitrage too. For example whilst so far this thread has been about VM instance monitoring, another area where Vitrage could integrate with Masakari is compute host monitoring. If you watch this part of our presentation where we explained the next generation architecture, you'll see that we propose a new "nova-host-alerter" component which has a driver-based mechanism for alerting different services when a compute host experiences a failure: https://youtu.be/YPKE1guti8E?t=32m43s So one obvious possibility would be to add a driver for Vitrage, so that Vitrage can be alerted when Pacemaker spots a host failure. Similarly, we could extend Pacemaker configurations to alert Vitrage when individual processes such as nova-compute or libvirtd fail. If you would like to discuss any of this further or have any more questions, in addition to this mailing list we are also available to talk on the #openstack-ha IRC channel! Cheers, Adam P.S. I've added the [HA] badge to this thread since this discussion is definitely related to high availability. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev