Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-16 Thread Oleg Gelbukh
Tim,

Regarding this discussion, now there is at least a plan in Heat to allow
management of VMs not launched by that service:
https://blueprints.launchpad.net/heat/+spec/adopt-stack

So hopefully in future HARestarter will allow to support medium
availability for all types of instances.

--
Best regards,
Oleg Gelbukh
Mirantis Labs


On Wed, Oct 9, 2013 at 3:28 PM, Tim Bell tim.b...@cern.ch wrote:

 Would the HARestarter approach work for VMs which were not launched by
 Heat ?

 We expect to have some applications driven by Heat but lots of others
 would not be (especially the more 'pet'-like traditional workloads).

 Tim

 From: Oleg Gelbukh [mailto:ogelb...@mirantis.com]
 Sent: 09 October 2013 13:01
 To: OpenStack Development Mailing List
 Subject: Re: [openstack-dev] [nova] automatically evacuate instances on
 compute failure

 Hello,

 We have much interest in this discussion (with focus on second scenario
 outlined by Tim), and working on its design at the moment. Thanks to
 everyone for valuable insights in this thread.

 It looks like external orchestration daemon problem is partially solved
 already by Heat with HARestarter resource [1].

 Hypervisor failure detection is also more or less solved problem in Nova
 [2]. There are other candidates for that task as well, like Ceilometer's
 hardware agent [3] (still WIP to my knowledge).

 [1]
 https://github.com/openstack/heat/blob/stable/grizzly/heat/engine/resources/instance.py#L35
 [2]
 http://docs.openstack.org/developer/nova/api/nova.api.openstack.compute.contrib.hypervisors.html#module-nova.api.openstack.compute.contrib.hypervisors
 [3]
 https://blueprints.launchpad.net/ceilometer/+spec/monitoring-physical-devices
 --
 Best regards,
 Oleg Gelbukh
 Mirantis Labs

 On Wed, Oct 9, 2013 at 9:26 AM, Tim Bell tim.b...@cern.ch wrote:
 I have proposed the summit design session for Hong Kong (
 http://summit.openstack.org/cfp/details/103) to discuss exactly these
 sort of points. We have the low level Nova commands but need a service to
 automate the process.

 I see two scenarios

 - A hardware intervention needs to be scheduled, please rebalance this
 workload elsewhere before it fails completely
 - A hypervisor has failed, please recover what you can using shared
 storage and give me a policy on what to do with the other VMs (restart,
 leave down till repair etc.)

 Most OpenStack production sites have some sort of script doing this sort
 of thing now. However, each one will be implementing the logic for
 migration differently so there is no agreed best practise approach.

 Tim

  -Original Message-
  From: Chris Friesen [mailto:chris.frie...@windriver.com]
  Sent: 09 October 2013 00:48
  To: openstack-dev@lists.openstack.org
  Subject: Re: [openstack-dev] [nova] automatically evacuate instances on
 compute failure
 
  On 10/08/2013 03:20 PM, Alex Glikson wrote:
   Seems that this can be broken into 3 incremental pieces. First, would
   be great if the ability to schedule a single 'evacuate' would be
   finally merged
   (_
 https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_
 ).
 
  Agreed.
 
   Then, it would make sense to have the logic that evacuates an entire
   host
   (_
 https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_
 ).
   The reasoning behind suggesting that this should not necessarily be in
   Nova is, perhaps, that it *can* be implemented outside Nova using the
   indvidual 'evacuate' API.
 
  This actually more-or-less exists already in the existing nova
 host-evacuate command.  One major issue with this however is that it
  requires the caller to specify whether all the instances are on shared
 or local storage, and so it can't handle a mix of local and shared
  storage for the instances.   If any of them boot off block storage for
  instance you need to move them first and then do the remaining ones as a
 group.
 
  It would be nice to embed the knowledge of whether or not an instance is
 on shared storage in the instance itself at creation time.  I
  envision specifying this in the config file for the compute manager
 along with the instance storage location, and the compute manager
  could set the field in the instance at creation time.
 
   Finally, it should be possible to close the loop and invoke the
   evacuation automatically as a result of a failure detection (not clear
   how exactly this would work, though). Hopefully we will have at least
   the first part merged soon (not sure if anyone is actively working on
   a rebase).
 
  My interpretation of the discussion so far is that the nova maintainers
 would prefer this to be driven by an outside orchestration daemon.
 
  Currently the only way a service is recognized to be down is if
 someone calls is_up() and it notices that the service hasn't sent an update
  in the last minute.  There's nothing in nova actively scanning for
 compute node failures, which is where the outside daemon comes in.
 
  Also

Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-09 Thread Tim Bell

There are also times when I know a hypervisor needs to be failed even if Nova 
has not detected it. Typical examples would be an intervention on a network 
cable or retirement of a rack.

The problem of VM Zombies does need to be addressed too. Not simple to solve.

Thus, I feel a shared effort in this area is needed rather than each deployment 
having its own scripts...

Tim

From: Alex Glikson [mailto:glik...@il.ibm.com]
Sent: 09 October 2013 14:00
To: OpenStack Development Mailing List
Subject: Re: [openstack-dev] [nova] automatically evacuate instances on compute 
failure

 Hypervisor failure detection is also more or less solved problem in Nova [2]. 
 There are other candidates for that task as well, like Ceilometer's hardware 
 agent [3] (still WIP to my knowledge).

The problem is that in some cases you want to be *really* sure that the 
hypervisor is down before running 'evacuate' (otherwise it could lead to an 
application crash). And you want to do it on scale. So, polling and traditional 
monitoring might not be good enough for a fully-automated service (e.g., you 
may need to do 'fencing' to ensure that the node will not suddenly come back 
with all the VMs still running).

Regards,
Alex




From:Oleg Gelbukh ogelb...@mirantis.commailto:ogelb...@mirantis.com
To:OpenStack Development Mailing List 
openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org,
Date:09/10/2013 02:09 PM
Subject:Re: [openstack-dev] [nova] automatically evacuate instances on 
compute failure




Hello,

We have much interest in this discussion (with focus on second scenario 
outlined by Tim), and working on its design at the moment. Thanks to everyone 
for valuable insights in this thread.

It looks like external orchestration daemon problem is partially solved already 
by Heat with HARestarter resource [1].

Hypervisor failure detection is also more or less solved problem in Nova [2]. 
There are other candidates for that task as well, like Ceilometer's hardware 
agent [3] (still WIP to my knowledge).

[1] 
https://github.com/openstack/heat/blob/stable/grizzly/heat/engine/resources/instance.py#L35
[2] 
http://docs.openstack.org/developer/nova/api/nova.api.openstack.compute.contrib.hypervisors.html#module-nova.api.openstack.compute.contrib.hypervisors
[3] 
https://blueprints.launchpad.net/ceilometer/+spec/monitoring-physical-devices
--
Best regards,
Oleg Gelbukh
Mirantis Labs


On Wed, Oct 9, 2013 at 9:26 AM, Tim Bell 
tim.b...@cern.chmailto:tim.b...@cern.ch wrote:
I have proposed the summit design session for Hong Kong 
(http://summit.openstack.org/cfp/details/103) to discuss exactly these sort of 
points. We have the low level Nova commands but need a service to automate the 
process.

I see two scenarios

- A hardware intervention needs to be scheduled, please rebalance this workload 
elsewhere before it fails completely
- A hypervisor has failed, please recover what you can using shared storage and 
give me a policy on what to do with the other VMs (restart, leave down till 
repair etc.)

Most OpenStack production sites have some sort of script doing this sort of 
thing now. However, each one will be implementing the logic for migration 
differently so there is no agreed best practise approach.

Tim

 -Original Message-
 From: Chris Friesen 
 [mailto:chris.frie...@windriver.commailto:chris.frie...@windriver.com]
 Sent: 09 October 2013 00:48
 To: 
 openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
 Subject: Re: [openstack-dev] [nova] automatically evacuate instances on 
 compute failure

 On 10/08/2013 03:20 PM, Alex Glikson wrote:
  Seems that this can be broken into 3 incremental pieces. First, would
  be great if the ability to schedule a single 'evacuate' would be
  finally merged
  (_https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_).

 Agreed.

  Then, it would make sense to have the logic that evacuates an entire
  host
  (_https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_).
  The reasoning behind suggesting that this should not necessarily be in
  Nova is, perhaps, that it *can* be implemented outside Nova using the
  indvidual 'evacuate' API.

 This actually more-or-less exists already in the existing nova 
 host-evacuate command.  One major issue with this however is that it
 requires the caller to specify whether all the instances are on shared or 
 local storage, and so it can't handle a mix of local and shared
 storage for the instances.   If any of them boot off block storage for
 instance you need to move them first and then do the remaining ones as a 
 group.

 It would be nice to embed the knowledge of whether or not an instance is on 
 shared storage in the instance itself at creation time.  I
 envision specifying this in the config file for the compute manager along 
 with the instance storage location, and the compute manager

Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-08 Thread Syed Armani
Hi Folks,

I am also very much curious about this. Earlier this bp had a dependency on
query scheduler, which is now merged. It will be very helpful if anyone can
throw some light on the fate of this bp.

Thanks.

Cheers,
Syed Armani


On Wed, Sep 25, 2013 at 11:46 PM, Chris Friesen chris.frie...@windriver.com
 wrote:

 I'm interested in automatically evacuating instances in the case of a
 failed compute node.  I found the following blueprint that covers exactly
 this case:

 https://blueprints.launchpad.**net/nova/+spec/evacuate-**
 instance-automaticallyhttps://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically

 However, the comments there seem to indicate that the code that
 orchestrates the evacuation shouldn't go into nova (referencing the Havana
 design summit).

 Why wouldn't this type of behaviour belong in nova?  (Is there a summary
 of discussions at the summit?)  Is there a recommended place where this
 sort of thing should go?

 Thanks,
 Chris

 __**_
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.**org OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/**cgi-bin/mailman/listinfo/**openstack-devhttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-08 Thread Alex Glikson
Seems that this can be broken into 3 incremental pieces. First, would be 
great if the ability to schedule a single 'evacuate' would be finally 
merged (
https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance
). Then, it would make sense to have the logic that evacuates an entire 
host (
https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host
). The reasoning behind suggesting that this should not necessarily be in 
Nova is, perhaps, that it *can* be implemented outside Nova using the 
indvidual 'evacuate' API. Finally, it should be possible to close the loop 
and invoke the evacuation automatically as a result of a failure detection 
(not clear how exactly this would work, though). Hopefully we will have at 
least the first part merged soon (not sure if anyone is actively working 
on a rebase).

Regards,
Alex




From:   Syed Armani dce3...@gmail.com
To: OpenStack Development Mailing List 
openstack-dev@lists.openstack.org, 
Date:   09/10/2013 12:04 AM
Subject:Re: [openstack-dev] [nova] automatically evacuate 
instances on compute failure



Hi Folks,

I am also very much curious about this. Earlier this bp had a dependency 
on query scheduler, which is now merged. It will be very helpful if anyone 
can throw some light on the fate of this bp.

Thanks.

Cheers,
Syed Armani


On Wed, Sep 25, 2013 at 11:46 PM, Chris Friesen 
chris.frie...@windriver.com wrote:
I'm interested in automatically evacuating instances in the case of a 
failed compute node.  I found the following blueprint that covers exactly 
this case:

https://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically


However, the comments there seem to indicate that the code that 
orchestrates the evacuation shouldn't go into nova (referencing the Havana 
design summit).

Why wouldn't this type of behaviour belong in nova?  (Is there a summary 
of discussions at the summit?)  Is there a recommended place where this 
sort of thing should go?

Thanks,
Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-08 Thread Mahesh K P
I was working on
https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance.
Hadn't looked at it because of FF, will restore the patch soon.

Thanks
Mahesh
Developer | ThoughtWorks | +1 (210) 716 1767


On Tue, Oct 8, 2013 at 5:20 PM, Alex Glikson glik...@il.ibm.com wrote:

 Seems that this can be broken into 3 incremental pieces. First, would be
 great if the ability to schedule a single 'evacuate' would be finally
 merged (*
 https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance
 *https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance).
 Then, it would make sense to have the logic that evacuates an entire host (
 *
 https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host
 *https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host).
 The reasoning behind suggesting that this should not necessarily be in Nova
 is, perhaps, that it *can* be implemented outside Nova using the indvidual
 'evacuate' API. Finally, it should be possible to close the loop and invoke
 the evacuation automatically as a result of a failure detection (not clear
 how exactly this would work, though). Hopefully we will have at least the
 first part merged soon (not sure if anyone is actively working on a rebase).

 Regards,
 Alex




 From:Syed Armani dce3...@gmail.com
 To:OpenStack Development Mailing List 
 openstack-dev@lists.openstack.org,
 Date:09/10/2013 12:04 AM
 Subject:Re: [openstack-dev] [nova] automatically evacuate
 instances on compute failure
 --



 Hi Folks,

 I am also very much curious about this. Earlier this bp had a dependency
 on query scheduler, which is now merged. It will be very helpful if anyone
 can throw some light on the fate of this bp.

 Thanks.

 Cheers,
 Syed Armani


 On Wed, Sep 25, 2013 at 11:46 PM, Chris Friesen *
 chris.frie...@windriver.com* chris.frie...@windriver.com wrote:
 I'm interested in automatically evacuating instances in the case of a
 failed compute node.  I found the following blueprint that covers exactly
 this case:
 *
 **
 https://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically
 *https://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically

 However, the comments there seem to indicate that the code that
 orchestrates the evacuation shouldn't go into nova (referencing the Havana
 design summit).

 Why wouldn't this type of behaviour belong in nova?  (Is there a summary
 of discussions at the summit?)  Is there a recommended place where this
 sort of thing should go?

 Thanks,
 Chris

 ___
 OpenStack-dev mailing list*
 **OpenStack-dev@lists.openstack.org* OpenStack-dev@lists.openstack.org*
 **http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev*http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-08 Thread Chris Friesen

On 10/08/2013 03:20 PM, Alex Glikson wrote:

Seems that this can be broken into 3 incremental pieces. First, would be
great if the ability to schedule a single 'evacuate' would be finally
merged
(_https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_).


Agreed.


Then, it would make sense to have the logic that evacuates an entire
host
(_https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_).
The reasoning behind suggesting that this should not necessarily be in
Nova is, perhaps, that it *can* be implemented outside Nova using the
indvidual 'evacuate' API.


This actually more-or-less exists already in the existing nova 
host-evacuate command.  One major issue with this however is that it 
requires the caller to specify whether all the instances are on shared 
or local storage, and so it can't handle a mix of local and shared 
storage for the instances.   If any of them boot off block storage for 
instance you need to move them first and then do the remaining ones as a 
group.


It would be nice to embed the knowledge of whether or not an instance is 
on shared storage in the instance itself at creation time.  I envision 
specifying this in the config file for the compute manager along with 
the instance storage location, and the compute manager could set the 
field in the instance at creation time.



Finally, it should be possible to close the
loop and invoke the evacuation automatically as a result of a failure
detection (not clear how exactly this would work, though). Hopefully we
will have at least the first part merged soon (not sure if anyone is
actively working on a rebase).


My interpretation of the discussion so far is that the nova maintainers 
would prefer this to be driven by an outside orchestration daemon.


Currently the only way a service is recognized to be down is if 
someone calls is_up() and it notices that the service hasn't sent an 
update in the last minute.  There's nothing in nova actively scanning 
for compute node failures, which is where the outside daemon comes in.


Also, there is some complexity involved in dealing with auto-evacuate: 
What do you do if an evacuate fails?  How do you recover intelligently 
if there is no admin involved?


Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] automatically evacuate instances on compute failure

2013-10-08 Thread Tim Bell
I have proposed the summit design session for Hong Kong 
(http://summit.openstack.org/cfp/details/103) to discuss exactly these sort of 
points. We have the low level Nova commands but need a service to automate the 
process.

I see two scenarios

- A hardware intervention needs to be scheduled, please rebalance this workload 
elsewhere before it fails completely
- A hypervisor has failed, please recover what you can using shared storage and 
give me a policy on what to do with the other VMs (restart, leave down till 
repair etc.)

Most OpenStack production sites have some sort of script doing this sort of 
thing now. However, each one will be implementing the logic for migration 
differently so there is no agreed best practise approach.

Tim

 -Original Message-
 From: Chris Friesen [mailto:chris.frie...@windriver.com]
 Sent: 09 October 2013 00:48
 To: openstack-dev@lists.openstack.org
 Subject: Re: [openstack-dev] [nova] automatically evacuate instances on 
 compute failure
 
 On 10/08/2013 03:20 PM, Alex Glikson wrote:
  Seems that this can be broken into 3 incremental pieces. First, would
  be great if the ability to schedule a single 'evacuate' would be
  finally merged
  (_https://blueprints.launchpad.net/nova/+spec/find-host-and-evacuate-instance_).
 
 Agreed.
 
  Then, it would make sense to have the logic that evacuates an entire
  host
  (_https://blueprints.launchpad.net/python-novaclient/+spec/find-and-evacuate-host_).
  The reasoning behind suggesting that this should not necessarily be in
  Nova is, perhaps, that it *can* be implemented outside Nova using the
  indvidual 'evacuate' API.
 
 This actually more-or-less exists already in the existing nova 
 host-evacuate command.  One major issue with this however is that it
 requires the caller to specify whether all the instances are on shared or 
 local storage, and so it can't handle a mix of local and shared
 storage for the instances.   If any of them boot off block storage for
 instance you need to move them first and then do the remaining ones as a 
 group.
 
 It would be nice to embed the knowledge of whether or not an instance is on 
 shared storage in the instance itself at creation time.  I
 envision specifying this in the config file for the compute manager along 
 with the instance storage location, and the compute manager
 could set the field in the instance at creation time.
 
  Finally, it should be possible to close the loop and invoke the
  evacuation automatically as a result of a failure detection (not clear
  how exactly this would work, though). Hopefully we will have at least
  the first part merged soon (not sure if anyone is actively working on
  a rebase).
 
 My interpretation of the discussion so far is that the nova maintainers would 
 prefer this to be driven by an outside orchestration daemon.
 
 Currently the only way a service is recognized to be down is if someone 
 calls is_up() and it notices that the service hasn't sent an update
 in the last minute.  There's nothing in nova actively scanning for compute 
 node failures, which is where the outside daemon comes in.
 
 Also, there is some complexity involved in dealing with auto-evacuate:
 What do you do if an evacuate fails?  How do you recover intelligently if 
 there is no admin involved?
 
 Chris
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] automatically evacuate instances on compute failure

2013-09-25 Thread Chris Friesen
I'm interested in automatically evacuating instances in the case of a 
failed compute node.  I found the following blueprint that covers 
exactly this case:


https://blueprints.launchpad.net/nova/+spec/evacuate-instance-automatically

However, the comments there seem to indicate that the code that 
orchestrates the evacuation shouldn't go into nova (referencing the 
Havana design summit).


Why wouldn't this type of behaviour belong in nova?  (Is there a summary 
of discussions at the summit?)  Is there a recommended place where this 
sort of thing should go?


Thanks,
Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev