Re: [openstack-dev] [nova] automatically evacuate instances on compute failure
Tim, Regarding this discussion, now there is at least a plan in Heat to allow management of VMs not launched by that service: https://blueprints.launchpad.net/heat/+spec/adopt-stack So hopefully in future HARestarter will allow to support medium availability for all types of instances. -- Best regards, Oleg Gelbukh Mirantis Labs On Wed, Oct 9, 2013 at 3:28 PM, Tim Bell tim.b...@cern.ch wrote: Would the HARestarter approach work for VMs which were not launched by Heat ? We expect to have some applications driven by Heat but lots of others would not be (especially the more 'pet'-like traditional workloads). Tim From: Oleg Gelbukh [mailto:ogelb...@mirantis.com] Sent: 09 October 2013 13:01 To: OpenStack Development Mailing List Subject: Re: [openstack-dev] [nova] automatically evacuate instances on compute failure Hello, We have much interest in this discussion (with focus on second scenario outlined by Tim), and working on its design at the moment. Thanks to everyone for valuable insights in this thread. It looks like external orchestration daemon problem is partially solved already by Heat with HARestarter resource [1]. Hypervisor failure detection is also more or less solved problem in Nova [2]. There are other candidates for that task as well, like Ceilometer's hardware agent [3] (still WIP to my knowledge). [1] https://github.com/openstack/heat/blob/stable/grizzly/heat/engine/resources/instance.py#L35 [2] http://docs.openstack.org/developer/nova/api/nova.api.openstack.compute.contrib.hypervisors.html#module-nova.api.openstack.compute.contrib.hypervisors [3] https://blueprints.launchpad.net/ceilometer/+spec/monitoring-physical-devices -- Best regards, Oleg Gelbukh Mirantis Labs On Wed, Oct 9, 2013 at 9:26 AM, Tim Bell tim.b...@cern.ch wrote: I have proposed the summit design session for Hong Kong ( http://summit.openstack.org/cfp/details/103) to discuss exactly these sort of points. We have the low level Nova commands but need a service to automate the process. I see two scenarios - A hardware intervention needs to be scheduled, please rebalance this workload elsewhere before it fails completely - A hypervisor has failed, please recover what you can using shared storage and give me a policy on what to do with the other VMs (restart, leave down till repair etc.) Most OpenStack production sites have some sort of script doing this sort of thing now. However, each one will be implementing the logic for migration differently so there is no agreed best practise approach. Tim -Original Message- From: Chris Friesen [mailto:chris.frie...@windriver.com] Sent: 09 October 2013 00:48 To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [nova] automatically evacuate instances on compute failure On 10/08/2013 03:20 PM, Alex Glikson wrote: Seems that this can be broken into 3 incremental pieces. First, would be great if the ability to schedule a single 'evacuate' would be finally merged (_ https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_ ). Agreed. Then, it would make sense to have the logic that evacuates an entire host (_ https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_ ). The reasoning behind suggesting that this should not necessarily be in Nova is, perhaps, that it *can* be implemented outside Nova using the indvidual 'evacuate' API. This actually more-or-less exists already in the existing nova host-evacuate command. One major issue with this however is that it requires the caller to specify whether all the instances are on shared or local storage, and so it can't handle a mix of local and shared storage for the instances. If any of them boot off block storage for instance you need to move them first and then do the remaining ones as a group. It would be nice to embed the knowledge of whether or not an instance is on shared storage in the instance itself at creation time. I envision specifying this in the config file for the compute manager along with the instance storage location, and the compute manager could set the field in the instance at creation time. Finally, it should be possible to close the loop and invoke the evacuation automatically as a result of a failure detection (not clear how exactly this would work, though). Hopefully we will have at least the first part merged soon (not sure if anyone is actively working on a rebase). My interpretation of the discussion so far is that the nova maintainers would prefer this to be driven by an outside orchestration daemon. Currently the only way a service is recognized to be down is if someone calls is_up() and it notices that the service hasn't sent an update in the last minute. There's nothing in nova actively scanning for compute node failures, which is where the outside daemon comes in. Also
Re: [openstack-dev] [nova] automatically evacuate instances on compute failure
There are also times when I know a hypervisor needs to be failed even if Nova has not detected it. Typical examples would be an intervention on a network cable or retirement of a rack. The problem of VM Zombies does need to be addressed too. Not simple to solve. Thus, I feel a shared effort in this area is needed rather than each deployment having its own scripts... Tim From: Alex Glikson [mailto:glik...@il.ibm.com] Sent: 09 October 2013 14:00 To: OpenStack Development Mailing List Subject: Re: [openstack-dev] [nova] automatically evacuate instances on compute failure Hypervisor failure detection is also more or less solved problem in Nova [2]. There are other candidates for that task as well, like Ceilometer's hardware agent [3] (still WIP to my knowledge). The problem is that in some cases you want to be *really* sure that the hypervisor is down before running 'evacuate' (otherwise it could lead to an application crash). And you want to do it on scale. So, polling and traditional monitoring might not be good enough for a fully-automated service (e.g., you may need to do 'fencing' to ensure that the node will not suddenly come back with all the VMs still running). Regards, Alex From:Oleg Gelbukh ogelb...@mirantis.commailto:ogelb...@mirantis.com To:OpenStack Development Mailing List openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org, Date:09/10/2013 02:09 PM Subject:Re: [openstack-dev] [nova] automatically evacuate instances on compute failure Hello, We have much interest in this discussion (with focus on second scenario outlined by Tim), and working on its design at the moment. Thanks to everyone for valuable insights in this thread. It looks like external orchestration daemon problem is partially solved already by Heat with HARestarter resource [1]. Hypervisor failure detection is also more or less solved problem in Nova [2]. There are other candidates for that task as well, like Ceilometer's hardware agent [3] (still WIP to my knowledge). [1] https://github.com/openstack/heat/blob/stable/grizzly/heat/engine/resources/instance.py#L35 [2] http://docs.openstack.org/developer/nova/api/nova.api.openstack.compute.contrib.hypervisors.html#module-nova.api.openstack.compute.contrib.hypervisors [3] https://blueprints.launchpad.net/ceilometer/+spec/monitoring-physical-devices -- Best regards, Oleg Gelbukh Mirantis Labs On Wed, Oct 9, 2013 at 9:26 AM, Tim Bell tim.b...@cern.chmailto:tim.b...@cern.ch wrote: I have proposed the summit design session for Hong Kong (http://summit.openstack.org/cfp/details/103) to discuss exactly these sort of points. We have the low level Nova commands but need a service to automate the process. I see two scenarios - A hardware intervention needs to be scheduled, please rebalance this workload elsewhere before it fails completely - A hypervisor has failed, please recover what you can using shared storage and give me a policy on what to do with the other VMs (restart, leave down till repair etc.) Most OpenStack production sites have some sort of script doing this sort of thing now. However, each one will be implementing the logic for migration differently so there is no agreed best practise approach. Tim -Original Message- From: Chris Friesen [mailto:chris.frie...@windriver.commailto:chris.frie...@windriver.com] Sent: 09 October 2013 00:48 To: openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [nova] automatically evacuate instances on compute failure On 10/08/2013 03:20 PM, Alex Glikson wrote: Seems that this can be broken into 3 incremental pieces. First, would be great if the ability to schedule a single 'evacuate' would be finally merged (_https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_). Agreed. Then, it would make sense to have the logic that evacuates an entire host (_https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_). The reasoning behind suggesting that this should not necessarily be in Nova is, perhaps, that it *can* be implemented outside Nova using the indvidual 'evacuate' API. This actually more-or-less exists already in the existing nova host-evacuate command. One major issue with this however is that it requires the caller to specify whether all the instances are on shared or local storage, and so it can't handle a mix of local and shared storage for the instances. If any of them boot off block storage for instance you need to move them first and then do the remaining ones as a group. It would be nice to embed the knowledge of whether or not an instance is on shared storage in the instance itself at creation time. I envision specifying this in the config file for the compute manager along with the instance storage location, and the compute manager
Re: [openstack-dev] [nova] automatically evacuate instances on compute failure
Hi Folks, I am also very much curious about this. Earlier this bp had a dependency on query scheduler, which is now merged. It will be very helpful if anyone can throw some light on the fate of this bp. Thanks. Cheers, Syed Armani On Wed, Sep 25, 2013 at 11:46 PM, Chris Friesen chris.frie...@windriver.com wrote: I'm interested in automatically evacuating instances in the case of a failed compute node. I found the following blueprint that covers exactly this case: https://blueprints.launchpad.**net/nova/+spec/evacuate-** instance-automaticallyhttps://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically However, the comments there seem to indicate that the code that orchestrates the evacuation shouldn't go into nova (referencing the Havana design summit). Why wouldn't this type of behaviour belong in nova? (Is there a summary of discussions at the summit?) Is there a recommended place where this sort of thing should go? Thanks, Chris __**_ OpenStack-dev mailing list OpenStack-dev@lists.openstack.**org OpenStack-dev@lists.openstack.org http://lists.openstack.org/**cgi-bin/mailman/listinfo/**openstack-devhttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] automatically evacuate instances on compute failure
Seems that this can be broken into 3 incremental pieces. First, would be great if the ability to schedule a single 'evacuate' would be finally merged ( https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance ). Then, it would make sense to have the logic that evacuates an entire host ( https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host ). The reasoning behind suggesting that this should not necessarily be in Nova is, perhaps, that it *can* be implemented outside Nova using the indvidual 'evacuate' API. Finally, it should be possible to close the loop and invoke the evacuation automatically as a result of a failure detection (not clear how exactly this would work, though). Hopefully we will have at least the first part merged soon (not sure if anyone is actively working on a rebase). Regards, Alex From: Syed Armani dce3...@gmail.com To: OpenStack Development Mailing List openstack-dev@lists.openstack.org, Date: 09/10/2013 12:04 AM Subject:Re: [openstack-dev] [nova] automatically evacuate instances on compute failure Hi Folks, I am also very much curious about this. Earlier this bp had a dependency on query scheduler, which is now merged. It will be very helpful if anyone can throw some light on the fate of this bp. Thanks. Cheers, Syed Armani On Wed, Sep 25, 2013 at 11:46 PM, Chris Friesen chris.frie...@windriver.com wrote: I'm interested in automatically evacuating instances in the case of a failed compute node. I found the following blueprint that covers exactly this case: https://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically However, the comments there seem to indicate that the code that orchestrates the evacuation shouldn't go into nova (referencing the Havana design summit). Why wouldn't this type of behaviour belong in nova? (Is there a summary of discussions at the summit?) Is there a recommended place where this sort of thing should go? Thanks, Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] automatically evacuate instances on compute failure
I was working on https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance. Hadn't looked at it because of FF, will restore the patch soon. Thanks Mahesh Developer | ThoughtWorks | +1 (210) 716 1767 On Tue, Oct 8, 2013 at 5:20 PM, Alex Glikson glik...@il.ibm.com wrote: Seems that this can be broken into 3 incremental pieces. First, would be great if the ability to schedule a single 'evacuate' would be finally merged (* https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance *https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance). Then, it would make sense to have the logic that evacuates an entire host ( * https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host *https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host). The reasoning behind suggesting that this should not necessarily be in Nova is, perhaps, that it *can* be implemented outside Nova using the indvidual 'evacuate' API. Finally, it should be possible to close the loop and invoke the evacuation automatically as a result of a failure detection (not clear how exactly this would work, though). Hopefully we will have at least the first part merged soon (not sure if anyone is actively working on a rebase). Regards, Alex From:Syed Armani dce3...@gmail.com To:OpenStack Development Mailing List openstack-dev@lists.openstack.org, Date:09/10/2013 12:04 AM Subject:Re: [openstack-dev] [nova] automatically evacuate instances on compute failure -- Hi Folks, I am also very much curious about this. Earlier this bp had a dependency on query scheduler, which is now merged. It will be very helpful if anyone can throw some light on the fate of this bp. Thanks. Cheers, Syed Armani On Wed, Sep 25, 2013 at 11:46 PM, Chris Friesen * chris.frie...@windriver.com* chris.frie...@windriver.com wrote: I'm interested in automatically evacuating instances in the case of a failed compute node. I found the following blueprint that covers exactly this case: * ** https://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically *https://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically However, the comments there seem to indicate that the code that orchestrates the evacuation shouldn't go into nova (referencing the Havana design summit). Why wouldn't this type of behaviour belong in nova? (Is there a summary of discussions at the summit?) Is there a recommended place where this sort of thing should go? Thanks, Chris ___ OpenStack-dev mailing list* **OpenStack-dev@lists.openstack.org* OpenStack-dev@lists.openstack.org* **http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev*http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] automatically evacuate instances on compute failure
On 10/08/2013 03:20 PM, Alex Glikson wrote: Seems that this can be broken into 3 incremental pieces. First, would be great if the ability to schedule a single 'evacuate' would be finally merged (_https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_). Agreed. Then, it would make sense to have the logic that evacuates an entire host (_https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_). The reasoning behind suggesting that this should not necessarily be in Nova is, perhaps, that it *can* be implemented outside Nova using the indvidual 'evacuate' API. This actually more-or-less exists already in the existing nova host-evacuate command. One major issue with this however is that it requires the caller to specify whether all the instances are on shared or local storage, and so it can't handle a mix of local and shared storage for the instances. If any of them boot off block storage for instance you need to move them first and then do the remaining ones as a group. It would be nice to embed the knowledge of whether or not an instance is on shared storage in the instance itself at creation time. I envision specifying this in the config file for the compute manager along with the instance storage location, and the compute manager could set the field in the instance at creation time. Finally, it should be possible to close the loop and invoke the evacuation automatically as a result of a failure detection (not clear how exactly this would work, though). Hopefully we will have at least the first part merged soon (not sure if anyone is actively working on a rebase). My interpretation of the discussion so far is that the nova maintainers would prefer this to be driven by an outside orchestration daemon. Currently the only way a service is recognized to be down is if someone calls is_up() and it notices that the service hasn't sent an update in the last minute. There's nothing in nova actively scanning for compute node failures, which is where the outside daemon comes in. Also, there is some complexity involved in dealing with auto-evacuate: What do you do if an evacuate fails? How do you recover intelligently if there is no admin involved? Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [nova] automatically evacuate instances on compute failure
I have proposed the summit design session for Hong Kong (http://summit.openstack.org/cfp/details/103) to discuss exactly these sort of points. We have the low level Nova commands but need a service to automate the process. I see two scenarios - A hardware intervention needs to be scheduled, please rebalance this workload elsewhere before it fails completely - A hypervisor has failed, please recover what you can using shared storage and give me a policy on what to do with the other VMs (restart, leave down till repair etc.) Most OpenStack production sites have some sort of script doing this sort of thing now. However, each one will be implementing the logic for migration differently so there is no agreed best practise approach. Tim -Original Message- From: Chris Friesen [mailto:chris.frie...@windriver.com] Sent: 09 October 2013 00:48 To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [nova] automatically evacuate instances on compute failure On 10/08/2013 03:20 PM, Alex Glikson wrote: Seems that this can be broken into 3 incremental pieces. First, would be great if the ability to schedule a single 'evacuate' would be finally merged (_https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_). Agreed. Then, it would make sense to have the logic that evacuates an entire host (_https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_). The reasoning behind suggesting that this should not necessarily be in Nova is, perhaps, that it *can* be implemented outside Nova using the indvidual 'evacuate' API. This actually more-or-less exists already in the existing nova host-evacuate command. One major issue with this however is that it requires the caller to specify whether all the instances are on shared or local storage, and so it can't handle a mix of local and shared storage for the instances. If any of them boot off block storage for instance you need to move them first and then do the remaining ones as a group. It would be nice to embed the knowledge of whether or not an instance is on shared storage in the instance itself at creation time. I envision specifying this in the config file for the compute manager along with the instance storage location, and the compute manager could set the field in the instance at creation time. Finally, it should be possible to close the loop and invoke the evacuation automatically as a result of a failure detection (not clear how exactly this would work, though). Hopefully we will have at least the first part merged soon (not sure if anyone is actively working on a rebase). My interpretation of the discussion so far is that the nova maintainers would prefer this to be driven by an outside orchestration daemon. Currently the only way a service is recognized to be down is if someone calls is_up() and it notices that the service hasn't sent an update in the last minute. There's nothing in nova actively scanning for compute node failures, which is where the outside daemon comes in. Also, there is some complexity involved in dealing with auto-evacuate: What do you do if an evacuate fails? How do you recover intelligently if there is no admin involved? Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [nova] automatically evacuate instances on compute failure
I'm interested in automatically evacuating instances in the case of a failed compute node. I found the following blueprint that covers exactly this case: https://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically However, the comments there seem to indicate that the code that orchestrates the evacuation shouldn't go into nova (referencing the Havana design summit). Why wouldn't this type of behaviour belong in nova? (Is there a summary of discussions at the summit?) Is there a recommended place where this sort of thing should go? Thanks, Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev