Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-16 Thread Oleg Gelbukh
Tim,

Regarding this discussion, now there is at least a plan in Heat to allow
management of VMs not launched by that service:
https://blueprints.launchpad.net/heat/+spec/adopt-stack

So hopefully in future HARestarter will allow to support medium
availability for all types of instances.

--
Best regards,
Oleg Gelbukh
Mirantis Labs


On Wed, Oct 9, 2013 at 3:28 PM, Tim Bell  wrote:

> Would the HARestarter approach work for VMs which were not launched by
> Heat ?
>
> We expect to have some applications driven by Heat but lots of others
> would not be (especially the more 'pet'-like traditional workloads).
>
> Tim
>
> From: Oleg Gelbukh [mailto:ogelb...@mirantis.com]
> Sent: 09 October 2013 13:01
> To: OpenStack Development Mailing List
> Subject: Re: [openstack-dev] [nova] automatically evacuate instances on
> compute failure
>
> Hello,
>
> We have much interest in this discussion (with focus on second scenario
> outlined by Tim), and working on its design at the moment. Thanks to
> everyone for valuable insights in this thread.
>
> It looks like external orchestration daemon problem is partially solved
> already by Heat with HARestarter resource [1].
>
> Hypervisor failure detection is also more or less solved problem in Nova
> [2]. There are other candidates for that task as well, like Ceilometer's
> hardware agent [3] (still WIP to my knowledge).
>
> [1]
> https://github.com/openstack/heat/blob/stable/grizzly/heat/engine/resources/instance.py#L35
> [2]
> http://docs.openstack.org/developer/nova/api/nova.api.openstack.compute.contrib.hypervisors.html#module-nova.api.openstack.compute.contrib.hypervisors
> [3]
> https://blueprints.launchpad.net/ceilometer/+spec/monitoring-physical-devices
> --
> Best regards,
> Oleg Gelbukh
> Mirantis Labs
>
> On Wed, Oct 9, 2013 at 9:26 AM, Tim Bell  wrote:
> I have proposed the summit design session for Hong Kong (
> http://summit.openstack.org/cfp/details/103) to discuss exactly these
> sort of points. We have the low level Nova commands but need a service to
> automate the process.
>
> I see two scenarios
>
> - A hardware intervention needs to be scheduled, please rebalance this
> workload elsewhere before it fails completely
> - A hypervisor has failed, please recover what you can using shared
> storage and give me a policy on what to do with the other VMs (restart,
> leave down till repair etc.)
>
> Most OpenStack production sites have some sort of script doing this sort
> of thing now. However, each one will be implementing the logic for
> migration differently so there is no agreed best practise approach.
>
> Tim
>
> > -Original Message-
> > From: Chris Friesen [mailto:chris.frie...@windriver.com]
> > Sent: 09 October 2013 00:48
> > To: openstack-dev@lists.openstack.org
> > Subject: Re: [openstack-dev] [nova] automatically evacuate instances on
> compute failure
> >
> > On 10/08/2013 03:20 PM, Alex Glikson wrote:
> > > Seems that this can be broken into 3 incremental pieces. First, would
> > > be great if the ability to schedule a single 'evacuate' would be
> > > finally merged
> > > (_
> https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_
> ).
> >
> > Agreed.
> >
> > > Then, it would make sense to have the logic that evacuates an entire
> > > host
> > > (_
> https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_
> ).
> > > The reasoning behind suggesting that this should not necessarily be in
> > > Nova is, perhaps, that it *can* be implemented outside Nova using the
> > > indvidual 'evacuate' API.
> >
> > This actually more-or-less exists already in the existing "nova
> host-evacuate" command.  One major issue with this however is that it
> > requires the caller to specify whether all the instances are on shared
> or local storage, and so it can't handle a mix of local and shared
> > storage for the instances.   If any of them boot off block storage for
> > instance you need to move them first and then do the remaining ones as a
> group.
> >
> > It would be nice to embed the knowledge of whether or not an instance is
> on shared storage in the instance itself at creation time.  I
> > envision specifying this in the config file for the compute manager
> along with the instance storage location, and the compute manager
> > could set the field in the instance at creation time.
> >
> > > Finally, it should be possible to close the loop and invoke the
> > > evacuation automatically as a result of a failure detection (not clear
>

Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-09 Thread Tim Bell

There are also times when I know a hypervisor needs to be failed even if Nova 
has not detected it. Typical examples would be an intervention on a network 
cable or retirement of a rack.

The problem of VM Zombies does need to be addressed too. Not simple to solve.

Thus, I feel a shared effort in this area is needed rather than each deployment 
having its own scripts...

Tim

From: Alex Glikson [mailto:glik...@il.ibm.com]
Sent: 09 October 2013 14:00
To: OpenStack Development Mailing List
Subject: Re: [openstack-dev] [nova] automatically evacuate instances on compute 
failure

> Hypervisor failure detection is also more or less solved problem in Nova [2]. 
> There are other candidates for that task as well, like Ceilometer's hardware 
> agent [3] (still WIP to my knowledge).

The problem is that in some cases you want to be *really* sure that the 
hypervisor is down before running 'evacuate' (otherwise it could lead to an 
application crash). And you want to do it on scale. So, polling and traditional 
monitoring might not be good enough for a fully-automated service (e.g., you 
may need to do 'fencing' to ensure that the node will not suddenly come back 
with all the VMs still running).

Regards,
Alex




From:Oleg Gelbukh mailto:ogelb...@mirantis.com>>
To:OpenStack Development Mailing List 
mailto:openstack-dev@lists.openstack.org>>,
Date:    09/10/2013 02:09 PM
Subject:    Re: [openstack-dev] [nova] automatically evacuate instances on 
compute failure




Hello,

We have much interest in this discussion (with focus on second scenario 
outlined by Tim), and working on its design at the moment. Thanks to everyone 
for valuable insights in this thread.

It looks like external orchestration daemon problem is partially solved already 
by Heat with HARestarter resource [1].

Hypervisor failure detection is also more or less solved problem in Nova [2]. 
There are other candidates for that task as well, like Ceilometer's hardware 
agent [3] (still WIP to my knowledge).

[1] 
https://github.com/openstack/heat/blob/stable/grizzly/heat/engine/resources/instance.py#L35
[2] 
http://docs.openstack.org/developer/nova/api/nova.api.openstack.compute.contrib.hypervisors.html#module-nova.api.openstack.compute.contrib.hypervisors
[3] 
https://blueprints.launchpad.net/ceilometer/+spec/monitoring-physical-devices
--
Best regards,
Oleg Gelbukh
Mirantis Labs


On Wed, Oct 9, 2013 at 9:26 AM, Tim Bell 
mailto:tim.b...@cern.ch>> wrote:
I have proposed the summit design session for Hong Kong 
(http://summit.openstack.org/cfp/details/103) to discuss exactly these sort of 
points. We have the low level Nova commands but need a service to automate the 
process.

I see two scenarios

- A hardware intervention needs to be scheduled, please rebalance this workload 
elsewhere before it fails completely
- A hypervisor has failed, please recover what you can using shared storage and 
give me a policy on what to do with the other VMs (restart, leave down till 
repair etc.)

Most OpenStack production sites have some sort of script doing this sort of 
thing now. However, each one will be implementing the logic for migration 
differently so there is no agreed best practise approach.

Tim

> -Original Message-
> From: Chris Friesen 
> [mailto:chris.frie...@windriver.com<mailto:chris.frie...@windriver.com>]
> Sent: 09 October 2013 00:48
> To: 
> openstack-dev@lists.openstack.org<mailto:openstack-dev@lists.openstack.org>
> Subject: Re: [openstack-dev] [nova] automatically evacuate instances on 
> compute failure
>
> On 10/08/2013 03:20 PM, Alex Glikson wrote:
> > Seems that this can be broken into 3 incremental pieces. First, would
> > be great if the ability to schedule a single 'evacuate' would be
> > finally merged
> > (_https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_).
>
> Agreed.
>
> > Then, it would make sense to have the logic that evacuates an entire
> > host
> > (_https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_).
> > The reasoning behind suggesting that this should not necessarily be in
> > Nova is, perhaps, that it *can* be implemented outside Nova using the
> > indvidual 'evacuate' API.
>
> This actually more-or-less exists already in the existing "nova 
> host-evacuate" command.  One major issue with this however is that it
> requires the caller to specify whether all the instances are on shared or 
> local storage, and so it can't handle a mix of local and shared
> storage for the instances.   If any of them boot off block storage for
> instance you need to move them first and then do the remaining ones as a 
> group.
>
> It would be nice to embed the knowledge of whether or n

Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-09 Thread Oleg Gelbukh
Alex,

You are absolutely right. We need multiple confirmations that host is
actually failed on physical level and must be evacuated before we proceed
with auto-evacuate. Fencing is also requirement for such kind of action.

This etherpad [1] has some great notes and suggestions on the topic. Hope
it will be incorporated in Tim's session, or any other session on this
topic.

[1] https://etherpad.openstack.org/openstack-instance-high-availability

--
Best regards,
Oleg Gelbukh
Mirantis Labs


On Wed, Oct 9, 2013 at 3:59 PM, Alex Glikson  wrote:

> *> Hypervisor failure detection is also more or less solved problem in
> Nova [2]. There are other candidates for that task as well, like
> Ceilometer's hardware agent [3] (still WIP to my knowledge).*
>
> The problem is that in some cases you want to be *really* sure that the
> hypervisor is down before running 'evacuate' (otherwise it could lead to an
> application crash). And you want to do it on scale. So, polling and
> traditional monitoring might not be good enough for a fully-automated
> service (e.g., you may need to do 'fencing' to ensure that the node will
> not suddenly come back with all the VMs still running).
>
> Regards,
> Alex
>
>
>
>
> From:Oleg Gelbukh 
> To:OpenStack Development Mailing List <
> openstack-dev@lists.openstack.org>,
> Date:    09/10/2013 02:09 PM
> Subject:Re: [openstack-dev] [nova] automatically evacuate
> instances on compute failure
> --
>
>
>
> Hello,
>
> We have much interest in this discussion (with focus on second scenario
> outlined by Tim), and working on its design at the moment. Thanks to
> everyone for valuable insights in this thread.
>
> It looks like external orchestration daemon problem is partially solved
> already by Heat with HARestarter resource [1].
>
> Hypervisor failure detection is also more or less solved problem in Nova
> [2]. There are other candidates for that task as well, like Ceilometer's
> hardware agent [3] (still WIP to my knowledge).
>
> [1] *
> https://github.com/openstack/heat/blob/stable/grizzly/heat/engine/resources/instance.py#L35
> *<https://github.com/openstack/heat/blob/stable/grizzly/heat/engine/resources/instance.py#L35>
> [2] *
> http://docs.openstack.org/developer/nova/api/nova.api.openstack.compute.contrib.hypervisors.html#module-nova.api.openstack.compute.contrib.hypervisors
> *<http://docs.openstack.org/developer/nova/api/nova.api.openstack.compute.contrib.hypervisors.html#module-nova.api.openstack.compute.contrib.hypervisors>
> [3] *
> https://blueprints.launchpad.net/ceilometer/+spec/monitoring-physical-devices
> *<https://blueprints.launchpad.net/ceilometer/+spec/monitoring-physical-devices>
> --
> Best regards,
> Oleg Gelbukh
> Mirantis Labs
>
>
> On Wed, Oct 9, 2013 at 9:26 AM, Tim Bell 
> <*tim.b...@cern.ch*>
> wrote:
> I have proposed the summit design session for Hong Kong (*
> http://summit.openstack.org/cfp/details/103*<http://summit.openstack.org/cfp/details/103>)
> to discuss exactly these sort of points. We have the low level Nova
> commands but need a service to automate the process.
>
> I see two scenarios
>
> - A hardware intervention needs to be scheduled, please rebalance this
> workload elsewhere before it fails completely
> - A hypervisor has failed, please recover what you can using shared
> storage and give me a policy on what to do with the other VMs (restart,
> leave down till repair etc.)
>
> Most OpenStack production sites have some sort of script doing this sort
> of thing now. However, each one will be implementing the logic for
> migration differently so there is no agreed best practise approach.
>
> Tim
>
> > -Original Message-
> > From: Chris Friesen 
> > [mailto:*chris.frie...@windriver.com*
> ]
> > Sent: 09 October 2013 00:48
> > To: *openstack-dev@lists.openstack.org*
> > Subject: Re: [openstack-dev] [nova] automatically evacuate instances on
> compute failure
> >
> > On 10/08/2013 03:20 PM, Alex Glikson wrote:
> > > Seems that this can be broken into 3 incremental pieces. First, would
> > > be great if the ability to schedule a single 'evacuate' would be
> > > finally merged
> > > (_*
> https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_
> *<https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_>
> ).
> >
> > Agreed.
> >
> > > Then, it would make sense to have the logic that evacuates an entire
> > > host
> > > (_*
> https://blueprints.launchpad.net/python-novaclient/+spec/fi

Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-09 Thread Oleg Gelbukh
Tim,

Right now it won't and that is the problem we are trying to solve: combine
HARestarter with additional script/service so one doesn't interfere with
the other.
 And this exposes a design gap or glitch, as we effectively going to have 2
services to execute the same task. This gap is something we want to address
eventually.

--
Best regards,
Oleg Gelbukh
Mirantis Inc.


On Wed, Oct 9, 2013 at 3:28 PM, Tim Bell  wrote:

> Would the HARestarter approach work for VMs which were not launched by
> Heat ?
>
> We expect to have some applications driven by Heat but lots of others
> would not be (especially the more 'pet'-like traditional workloads).
>
> Tim
>
> From: Oleg Gelbukh [mailto:ogelb...@mirantis.com]
> Sent: 09 October 2013 13:01
> To: OpenStack Development Mailing List
> Subject: Re: [openstack-dev] [nova] automatically evacuate instances on
> compute failure
>
> Hello,
>
> We have much interest in this discussion (with focus on second scenario
> outlined by Tim), and working on its design at the moment. Thanks to
> everyone for valuable insights in this thread.
>
> It looks like external orchestration daemon problem is partially solved
> already by Heat with HARestarter resource [1].
>
> Hypervisor failure detection is also more or less solved problem in Nova
> [2]. There are other candidates for that task as well, like Ceilometer's
> hardware agent [3] (still WIP to my knowledge).
>
> [1]
> https://github.com/openstack/heat/blob/stable/grizzly/heat/engine/resources/instance.py#L35
> [2]
> http://docs.openstack.org/developer/nova/api/nova.api.openstack.compute.contrib.hypervisors.html#module-nova.api.openstack.compute.contrib.hypervisors
> [3]
> https://blueprints.launchpad.net/ceilometer/+spec/monitoring-physical-devices
> --
> Best regards,
> Oleg Gelbukh
> Mirantis Labs
>
> On Wed, Oct 9, 2013 at 9:26 AM, Tim Bell  wrote:
> I have proposed the summit design session for Hong Kong (
> http://summit.openstack.org/cfp/details/103) to discuss exactly these
> sort of points. We have the low level Nova commands but need a service to
> automate the process.
>
> I see two scenarios
>
> - A hardware intervention needs to be scheduled, please rebalance this
> workload elsewhere before it fails completely
> - A hypervisor has failed, please recover what you can using shared
> storage and give me a policy on what to do with the other VMs (restart,
> leave down till repair etc.)
>
> Most OpenStack production sites have some sort of script doing this sort
> of thing now. However, each one will be implementing the logic for
> migration differently so there is no agreed best practise approach.
>
> Tim
>
> > -Original Message-
> > From: Chris Friesen [mailto:chris.frie...@windriver.com]
> > Sent: 09 October 2013 00:48
> > To: openstack-dev@lists.openstack.org
> > Subject: Re: [openstack-dev] [nova] automatically evacuate instances on
> compute failure
> >
> > On 10/08/2013 03:20 PM, Alex Glikson wrote:
> > > Seems that this can be broken into 3 incremental pieces. First, would
> > > be great if the ability to schedule a single 'evacuate' would be
> > > finally merged
> > > (_
> https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_
> ).
> >
> > Agreed.
> >
> > > Then, it would make sense to have the logic that evacuates an entire
> > > host
> > > (_
> https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_
> ).
> > > The reasoning behind suggesting that this should not necessarily be in
> > > Nova is, perhaps, that it *can* be implemented outside Nova using the
> > > indvidual 'evacuate' API.
> >
> > This actually more-or-less exists already in the existing "nova
> host-evacuate" command.  One major issue with this however is that it
> > requires the caller to specify whether all the instances are on shared
> or local storage, and so it can't handle a mix of local and shared
> > storage for the instances.   If any of them boot off block storage for
> > instance you need to move them first and then do the remaining ones as a
> group.
> >
> > It would be nice to embed the knowledge of whether or not an instance is
> on shared storage in the instance itself at creation time.  I
> > envision specifying this in the config file for the compute manager
> along with the instance storage location, and the compute manager
> > could set the field in the instance at creation time.
> >
> > > Finally, it should be possible to close the loop and invoke the
> > > evacuation automatically a

Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-09 Thread Alex Glikson
> Hypervisor failure detection is also more or less solved problem in Nova 
[2]. There are other candidates for that task as well, like Ceilometer's 
hardware agent [3] (still WIP to my knowledge).

The problem is that in some cases you want to be *really* sure that the 
hypervisor is down before running 'evacuate' (otherwise it could lead to 
an application crash). And you want to do it on scale. So, polling and 
traditional monitoring might not be good enough for a fully-automated 
service (e.g., you may need to do 'fencing' to ensure that the node will 
not suddenly come back with all the VMs still running).

Regards,
Alex




From:   Oleg Gelbukh 
To: OpenStack Development Mailing List 
, 
Date:   09/10/2013 02:09 PM
Subject:    Re: [openstack-dev] [nova] automatically evacuate 
instances on compute failure



Hello,

We have much interest in this discussion (with focus on second scenario 
outlined by Tim), and working on its design at the moment. Thanks to 
everyone for valuable insights in this thread.

It looks like external orchestration daemon problem is partially solved 
already by Heat with HARestarter resource [1].

Hypervisor failure detection is also more or less solved problem in Nova 
[2]. There are other candidates for that task as well, like Ceilometer's 
hardware agent [3] (still WIP to my knowledge).

[1] 
https://github.com/openstack/heat/blob/stable/grizzly/heat/engine/resources/instance.py#L35
[2] 
http://docs.openstack.org/developer/nova/api/nova.api.openstack.compute.contrib.hypervisors.html#module-nova.api.openstack.compute.contrib.hypervisors
[3] 
https://blueprints.launchpad.net/ceilometer/+spec/monitoring-physical-devices
--
Best regards,
Oleg Gelbukh
Mirantis Labs


On Wed, Oct 9, 2013 at 9:26 AM, Tim Bell  wrote:
I have proposed the summit design session for Hong Kong (
http://summit.openstack.org/cfp/details/103) to discuss exactly these sort 
of points. We have the low level Nova commands but need a service to 
automate the process.

I see two scenarios

- A hardware intervention needs to be scheduled, please rebalance this 
workload elsewhere before it fails completely
- A hypervisor has failed, please recover what you can using shared 
storage and give me a policy on what to do with the other VMs (restart, 
leave down till repair etc.)

Most OpenStack production sites have some sort of script doing this sort 
of thing now. However, each one will be implementing the logic for 
migration differently so there is no agreed best practise approach.

Tim

> -Original Message-
> From: Chris Friesen [mailto:chris.frie...@windriver.com]
> Sent: 09 October 2013 00:48
> To: openstack-dev@lists.openstack.org
> Subject: Re: [openstack-dev] [nova] automatically evacuate instances on 
compute failure
>
> On 10/08/2013 03:20 PM, Alex Glikson wrote:
> > Seems that this can be broken into 3 incremental pieces. First, would
> > be great if the ability to schedule a single 'evacuate' would be
> > finally merged
> > (_
https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_
).
>
> Agreed.
>
> > Then, it would make sense to have the logic that evacuates an entire
> > host
> > (_
https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_
).
> > The reasoning behind suggesting that this should not necessarily be in
> > Nova is, perhaps, that it *can* be implemented outside Nova using the
> > indvidual 'evacuate' API.
>
> This actually more-or-less exists already in the existing "nova 
host-evacuate" command.  One major issue with this however is that it
> requires the caller to specify whether all the instances are on shared 
or local storage, and so it can't handle a mix of local and shared
> storage for the instances.   If any of them boot off block storage for
> instance you need to move them first and then do the remaining ones as a 
group.
>
> It would be nice to embed the knowledge of whether or not an instance is 
on shared storage in the instance itself at creation time.  I
> envision specifying this in the config file for the compute manager 
along with the instance storage location, and the compute manager
> could set the field in the instance at creation time.
>
> > Finally, it should be possible to close the loop and invoke the
> > evacuation automatically as a result of a failure detection (not clear
> > how exactly this would work, though). Hopefully we will have at least
> > the first part merged soon (not sure if anyone is actively working on
> > a rebase).
>
> My interpretation of the discussion so far is that the nova maintainers 
would prefer this to be driven by an outside orchestration daemon.
>
> Currently the only way a service is recognized to be "down" is if 
someone calls is_

Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-09 Thread Tim Bell
Would the HARestarter approach work for VMs which were not launched by Heat ?

We expect to have some applications driven by Heat but lots of others would not 
be (especially the more 'pet'-like traditional workloads).

Tim

From: Oleg Gelbukh [mailto:ogelb...@mirantis.com] 
Sent: 09 October 2013 13:01
To: OpenStack Development Mailing List
Subject: Re: [openstack-dev] [nova] automatically evacuate instances on compute 
failure

Hello,

We have much interest in this discussion (with focus on second scenario 
outlined by Tim), and working on its design at the moment. Thanks to everyone 
for valuable insights in this thread.

It looks like external orchestration daemon problem is partially solved already 
by Heat with HARestarter resource [1].

Hypervisor failure detection is also more or less solved problem in Nova [2]. 
There are other candidates for that task as well, like Ceilometer's hardware 
agent [3] (still WIP to my knowledge).

[1] 
https://github.com/openstack/heat/blob/stable/grizzly/heat/engine/resources/instance.py#L35
[2] 
http://docs.openstack.org/developer/nova/api/nova.api.openstack.compute.contrib.hypervisors.html#module-nova.api.openstack.compute.contrib.hypervisors
[3] 
https://blueprints.launchpad.net/ceilometer/+spec/monitoring-physical-devices
--
Best regards,
Oleg Gelbukh
Mirantis Labs

On Wed, Oct 9, 2013 at 9:26 AM, Tim Bell  wrote:
I have proposed the summit design session for Hong Kong 
(http://summit.openstack.org/cfp/details/103) to discuss exactly these sort of 
points. We have the low level Nova commands but need a service to automate the 
process.

I see two scenarios

- A hardware intervention needs to be scheduled, please rebalance this workload 
elsewhere before it fails completely
- A hypervisor has failed, please recover what you can using shared storage and 
give me a policy on what to do with the other VMs (restart, leave down till 
repair etc.)

Most OpenStack production sites have some sort of script doing this sort of 
thing now. However, each one will be implementing the logic for migration 
differently so there is no agreed best practise approach.

Tim

> -Original Message-
> From: Chris Friesen [mailto:chris.frie...@windriver.com]
> Sent: 09 October 2013 00:48
> To: openstack-dev@lists.openstack.org
> Subject: Re: [openstack-dev] [nova] automatically evacuate instances on 
> compute failure
>
> On 10/08/2013 03:20 PM, Alex Glikson wrote:
> > Seems that this can be broken into 3 incremental pieces. First, would
> > be great if the ability to schedule a single 'evacuate' would be
> > finally merged
> > (_https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_).
>
> Agreed.
>
> > Then, it would make sense to have the logic that evacuates an entire
> > host
> > (_https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_).
> > The reasoning behind suggesting that this should not necessarily be in
> > Nova is, perhaps, that it *can* be implemented outside Nova using the
> > indvidual 'evacuate' API.
>
> This actually more-or-less exists already in the existing "nova 
> host-evacuate" command.  One major issue with this however is that it
> requires the caller to specify whether all the instances are on shared or 
> local storage, and so it can't handle a mix of local and shared
> storage for the instances.   If any of them boot off block storage for
> instance you need to move them first and then do the remaining ones as a 
> group.
>
> It would be nice to embed the knowledge of whether or not an instance is on 
> shared storage in the instance itself at creation time.  I
> envision specifying this in the config file for the compute manager along 
> with the instance storage location, and the compute manager
> could set the field in the instance at creation time.
>
> > Finally, it should be possible to close the loop and invoke the
> > evacuation automatically as a result of a failure detection (not clear
> > how exactly this would work, though). Hopefully we will have at least
> > the first part merged soon (not sure if anyone is actively working on
> > a rebase).
>
> My interpretation of the discussion so far is that the nova maintainers would 
> prefer this to be driven by an outside orchestration daemon.
>
> Currently the only way a service is recognized to be "down" is if someone 
> calls is_up() and it notices that the service hasn't sent an update
> in the last minute.  There's nothing in nova actively scanning for compute 
> node failures, which is where the outside daemon comes in.
>
> Also, there is some complexity involved in dealing with auto-evacuate:
> What do you do if an evacuat

Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-09 Thread Oleg Gelbukh
Hello,

We have much interest in this discussion (with focus on second scenario
outlined by Tim), and working on its design at the moment. Thanks to
everyone for valuable insights in this thread.

It looks like external orchestration daemon problem is partially solved
already by Heat with HARestarter resource [1].

Hypervisor failure detection is also more or less solved problem in Nova
[2]. There are other candidates for that task as well, like Ceilometer's
hardware agent [3] (still WIP to my knowledge).

[1]
https://github.com/openstack/heat/blob/stable/grizzly/heat/engine/resources/instance.py#L35
[2]
http://docs.openstack.org/developer/nova/api/nova.api.openstack.compute.contrib.hypervisors.html#module-nova.api.openstack.compute.contrib.hypervisors
[3]
https://blueprints.launchpad.net/ceilometer/+spec/monitoring-physical-devices
--
Best regards,
Oleg Gelbukh
Mirantis Labs


On Wed, Oct 9, 2013 at 9:26 AM, Tim Bell  wrote:

> I have proposed the summit design session for Hong Kong (
> http://summit.openstack.org/cfp/details/103) to discuss exactly these
> sort of points. We have the low level Nova commands but need a service to
> automate the process.
>
> I see two scenarios
>
> - A hardware intervention needs to be scheduled, please rebalance this
> workload elsewhere before it fails completely
> - A hypervisor has failed, please recover what you can using shared
> storage and give me a policy on what to do with the other VMs (restart,
> leave down till repair etc.)
>
> Most OpenStack production sites have some sort of script doing this sort
> of thing now. However, each one will be implementing the logic for
> migration differently so there is no agreed best practise approach.
>
> Tim
>
> > -Original Message-
> > From: Chris Friesen [mailto:chris.frie...@windriver.com]
> > Sent: 09 October 2013 00:48
> > To: openstack-dev@lists.openstack.org
> > Subject: Re: [openstack-dev] [nova] automatically evacuate instances on
> compute failure
> >
> > On 10/08/2013 03:20 PM, Alex Glikson wrote:
> > > Seems that this can be broken into 3 incremental pieces. First, would
> > > be great if the ability to schedule a single 'evacuate' would be
> > > finally merged
> > > (_
> https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_
> ).
> >
> > Agreed.
> >
> > > Then, it would make sense to have the logic that evacuates an entire
> > > host
> > > (_
> https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_
> ).
> > > The reasoning behind suggesting that this should not necessarily be in
> > > Nova is, perhaps, that it *can* be implemented outside Nova using the
> > > indvidual 'evacuate' API.
> >
> > This actually more-or-less exists already in the existing "nova
> host-evacuate" command.  One major issue with this however is that it
> > requires the caller to specify whether all the instances are on shared
> or local storage, and so it can't handle a mix of local and shared
> > storage for the instances.   If any of them boot off block storage for
> > instance you need to move them first and then do the remaining ones as a
> group.
> >
> > It would be nice to embed the knowledge of whether or not an instance is
> on shared storage in the instance itself at creation time.  I
> > envision specifying this in the config file for the compute manager
> along with the instance storage location, and the compute manager
> > could set the field in the instance at creation time.
> >
> > > Finally, it should be possible to close the loop and invoke the
> > > evacuation automatically as a result of a failure detection (not clear
> > > how exactly this would work, though). Hopefully we will have at least
> > > the first part merged soon (not sure if anyone is actively working on
> > > a rebase).
> >
> > My interpretation of the discussion so far is that the nova maintainers
> would prefer this to be driven by an outside orchestration daemon.
> >
> > Currently the only way a service is recognized to be "down" is if
> someone calls is_up() and it notices that the service hasn't sent an update
> > in the last minute.  There's nothing in nova actively scanning for
> compute node failures, which is where the outside daemon comes in.
> >
> > Also, there is some complexity involved in dealing with auto-evacuate:
> > What do you do if an evacuate fails?  How do you recover intelligently
> if there is no admin involved?
> >
> > Chris
> >
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-08 Thread Tim Bell
I have proposed the summit design session for Hong Kong 
(http://summit.openstack.org/cfp/details/103) to discuss exactly these sort of 
points. We have the low level Nova commands but need a service to automate the 
process.

I see two scenarios

- A hardware intervention needs to be scheduled, please rebalance this workload 
elsewhere before it fails completely
- A hypervisor has failed, please recover what you can using shared storage and 
give me a policy on what to do with the other VMs (restart, leave down till 
repair etc.)

Most OpenStack production sites have some sort of script doing this sort of 
thing now. However, each one will be implementing the logic for migration 
differently so there is no agreed best practise approach.

Tim

> -Original Message-
> From: Chris Friesen [mailto:chris.frie...@windriver.com]
> Sent: 09 October 2013 00:48
> To: openstack-dev@lists.openstack.org
> Subject: Re: [openstack-dev] [nova] automatically evacuate instances on 
> compute failure
> 
> On 10/08/2013 03:20 PM, Alex Glikson wrote:
> > Seems that this can be broken into 3 incremental pieces. First, would
> > be great if the ability to schedule a single 'evacuate' would be
> > finally merged
> > (_https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_).
> 
> Agreed.
> 
> > Then, it would make sense to have the logic that evacuates an entire
> > host
> > (_https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_).
> > The reasoning behind suggesting that this should not necessarily be in
> > Nova is, perhaps, that it *can* be implemented outside Nova using the
> > indvidual 'evacuate' API.
> 
> This actually more-or-less exists already in the existing "nova 
> host-evacuate" command.  One major issue with this however is that it
> requires the caller to specify whether all the instances are on shared or 
> local storage, and so it can't handle a mix of local and shared
> storage for the instances.   If any of them boot off block storage for
> instance you need to move them first and then do the remaining ones as a 
> group.
> 
> It would be nice to embed the knowledge of whether or not an instance is on 
> shared storage in the instance itself at creation time.  I
> envision specifying this in the config file for the compute manager along 
> with the instance storage location, and the compute manager
> could set the field in the instance at creation time.
> 
> > Finally, it should be possible to close the loop and invoke the
> > evacuation automatically as a result of a failure detection (not clear
> > how exactly this would work, though). Hopefully we will have at least
> > the first part merged soon (not sure if anyone is actively working on
> > a rebase).
> 
> My interpretation of the discussion so far is that the nova maintainers would 
> prefer this to be driven by an outside orchestration daemon.
> 
> Currently the only way a service is recognized to be "down" is if someone 
> calls is_up() and it notices that the service hasn't sent an update
> in the last minute.  There's nothing in nova actively scanning for compute 
> node failures, which is where the outside daemon comes in.
> 
> Also, there is some complexity involved in dealing with auto-evacuate:
> What do you do if an evacuate fails?  How do you recover intelligently if 
> there is no admin involved?
> 
> Chris
> 
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-08 Thread Chris Friesen

On 10/08/2013 03:20 PM, Alex Glikson wrote:

Seems that this can be broken into 3 incremental pieces. First, would be
great if the ability to schedule a single 'evacuate' would be finally
merged
(_https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_).


Agreed.


Then, it would make sense to have the logic that evacuates an entire
host
(_https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_).
The reasoning behind suggesting that this should not necessarily be in
Nova is, perhaps, that it *can* be implemented outside Nova using the
indvidual 'evacuate' API.


This actually more-or-less exists already in the existing "nova 
host-evacuate" command.  One major issue with this however is that it 
requires the caller to specify whether all the instances are on shared 
or local storage, and so it can't handle a mix of local and shared 
storage for the instances.   If any of them boot off block storage for 
instance you need to move them first and then do the remaining ones as a 
group.


It would be nice to embed the knowledge of whether or not an instance is 
on shared storage in the instance itself at creation time.  I envision 
specifying this in the config file for the compute manager along with 
the instance storage location, and the compute manager could set the 
field in the instance at creation time.



Finally, it should be possible to close the
loop and invoke the evacuation automatically as a result of a failure
detection (not clear how exactly this would work, though). Hopefully we
will have at least the first part merged soon (not sure if anyone is
actively working on a rebase).


My interpretation of the discussion so far is that the nova maintainers 
would prefer this to be driven by an outside orchestration daemon.


Currently the only way a service is recognized to be "down" is if 
someone calls is_up() and it notices that the service hasn't sent an 
update in the last minute.  There's nothing in nova actively scanning 
for compute node failures, which is where the outside daemon comes in.


Also, there is some complexity involved in dealing with auto-evacuate: 
What do you do if an evacuate fails?  How do you recover intelligently 
if there is no admin involved?


Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-08 Thread Mahesh K P
I was working on
https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance.
Hadn't looked at it because of FF, will restore the patch soon.

Thanks
Mahesh
Developer | ThoughtWorks | +1 (210) 716 1767


On Tue, Oct 8, 2013 at 5:20 PM, Alex Glikson  wrote:

> Seems that this can be broken into 3 incremental pieces. First, would be
> great if the ability to schedule a single 'evacuate' would be finally
> merged (*
> https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance
> *<https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance>).
> Then, it would make sense to have the logic that evacuates an entire host (
> *
> https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host
> *<https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host>).
> The reasoning behind suggesting that this should not necessarily be in Nova
> is, perhaps, that it *can* be implemented outside Nova using the indvidual
> 'evacuate' API. Finally, it should be possible to close the loop and invoke
> the evacuation automatically as a result of a failure detection (not clear
> how exactly this would work, though). Hopefully we will have at least the
> first part merged soon (not sure if anyone is actively working on a rebase).
>
> Regards,
> Alex
>
>
>
>
> From:Syed Armani 
> To:OpenStack Development Mailing List <
> openstack-dev@lists.openstack.org>,
> Date:    09/10/2013 12:04 AM
> Subject:Re: [openstack-dev] [nova] automatically evacuate
> instances on compute failure
> --
>
>
>
> Hi Folks,
>
> I am also very much curious about this. Earlier this bp had a dependency
> on query scheduler, which is now merged. It will be very helpful if anyone
> can throw some light on the fate of this bp.
>
> Thanks.
>
> Cheers,
> Syed Armani
>
>
> On Wed, Sep 25, 2013 at 11:46 PM, Chris Friesen <*
> chris.frie...@windriver.com* > wrote:
> I'm interested in automatically evacuating instances in the case of a
> failed compute node.  I found the following blueprint that covers exactly
> this case:
> *
> **
> https://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically
> *<https://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically>
>
> However, the comments there seem to indicate that the code that
> orchestrates the evacuation shouldn't go into nova (referencing the Havana
> design summit).
>
> Why wouldn't this type of behaviour belong in nova?  (Is there a summary
> of discussions at the summit?)  Is there a recommended place where this
> sort of thing should go?
>
> Thanks,
> Chris
>
> ___
> OpenStack-dev mailing list*
> **OpenStack-dev@lists.openstack.org* *
> **http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev*<http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-08 Thread Alex Glikson
Seems that this can be broken into 3 incremental pieces. First, would be 
great if the ability to schedule a single 'evacuate' would be finally 
merged (
https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance
). Then, it would make sense to have the logic that evacuates an entire 
host (
https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host
). The reasoning behind suggesting that this should not necessarily be in 
Nova is, perhaps, that it *can* be implemented outside Nova using the 
indvidual 'evacuate' API. Finally, it should be possible to close the loop 
and invoke the evacuation automatically as a result of a failure detection 
(not clear how exactly this would work, though). Hopefully we will have at 
least the first part merged soon (not sure if anyone is actively working 
on a rebase).

Regards,
Alex




From:   Syed Armani 
To: OpenStack Development Mailing List 
, 
Date:   09/10/2013 12:04 AM
Subject:        Re: [openstack-dev] [nova] automatically evacuate 
instances on compute failure



Hi Folks,

I am also very much curious about this. Earlier this bp had a dependency 
on query scheduler, which is now merged. It will be very helpful if anyone 
can throw some light on the fate of this bp.

Thanks.

Cheers,
Syed Armani


On Wed, Sep 25, 2013 at 11:46 PM, Chris Friesen <
chris.frie...@windriver.com> wrote:
I'm interested in automatically evacuating instances in the case of a 
failed compute node.  I found the following blueprint that covers exactly 
this case:

https://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically


However, the comments there seem to indicate that the code that 
orchestrates the evacuation shouldn't go into nova (referencing the Havana 
design summit).

Why wouldn't this type of behaviour belong in nova?  (Is there a summary 
of discussions at the summit?)  Is there a recommended place where this 
sort of thing should go?

Thanks,
Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-08 Thread Syed Armani
Hi Folks,

I am also very much curious about this. Earlier this bp had a dependency on
query scheduler, which is now merged. It will be very helpful if anyone can
throw some light on the fate of this bp.

Thanks.

Cheers,
Syed Armani


On Wed, Sep 25, 2013 at 11:46 PM, Chris Friesen  wrote:

> I'm interested in automatically evacuating instances in the case of a
> failed compute node.  I found the following blueprint that covers exactly
> this case:
>
> https://blueprints.launchpad.**net/nova/+spec/evacuate-**
> instance-automatically
>
> However, the comments there seem to indicate that the code that
> orchestrates the evacuation shouldn't go into nova (referencing the Havana
> design summit).
>
> Why wouldn't this type of behaviour belong in nova?  (Is there a summary
> of discussions at the summit?)  Is there a recommended place where this
> sort of thing should go?
>
> Thanks,
> Chris
>
> __**_
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.**org 
> http://lists.openstack.org/**cgi-bin/mailman/listinfo/**openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] automatically evacuate instances on compute failure

2013-09-25 Thread Chris Friesen
I'm interested in automatically evacuating instances in the case of a 
failed compute node.  I found the following blueprint that covers 
exactly this case:


https://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically

However, the comments there seem to indicate that the code that 
orchestrates the evacuation shouldn't go into nova (referencing the 
Havana design summit).


Why wouldn't this type of behaviour belong in nova?  (Is there a summary 
of discussions at the summit?)  Is there a recommended place where this 
sort of thing should go?


Thanks,
Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev