Re: [openstack-dev] [heat][nova] VM restarting on host, failure in convergence

2014-09-19 Thread Jastrzebski, Michal
   All,
  
   Currently OpenStack does not have a built-in HA mechanism for tenant
   instances which could restore virtual machines in case of a host
   failure. Openstack assumes every app is designed for failure and can
   handle instance failure and will self-remediate, but that is rarely
   the case for the very large Enterprise application ecosystem.
   Many existing enterprise applications are stateful, and assume that
   the physical infrastructure is always on.
  
 
  There is a fundamental debate that OpenStack's vendors need to work out
  here. Existing applications are well served by existing virtualization
  platforms. Turning OpenStack into a work-alike to oVirt is not the end
  goal here. It's a happy accident that traditional apps can sometimes be
  bent onto the cloud without much modification.
 
  The thing that clouds do is they give development teams a _limited_
  infrastructure that lets IT do what they're good at (keep the
  infrastructure up) and lets development teams do what they're good at 
(run
  their app). By putting HA into the _app_, and not the _infrastructure_,
  the dev teams get agility and scalability. No more waiting weeks for
  allocationg specialized servers with hardware fencing setups and fibre
  channel controllers to house a shared disk system so the super reliable
  virtualization can hide HA from the user.
 
  Spin up vms. Spin up volumes.  Run some replication between regions,
  and be resilient.

I don't argue that's the way to go. But reality is somewhat different.
In world of early design fail, low budget and deadlines some good
practices might be omitted early and might be hard to implement later.

Cloud from technical point of view can help to increase such apps, and
I think openstack should approach that part of market as well.

  So, as long as it is understood that whatever is being proposed should
  be an application centric feature, and not an infrastructure centric
  feature, this argument remains interesting in the cloud context.
  Otherwise, it is just an invitation for OpenStack to open up direct
  competition with behemoths like vCenter.
 
   Even the OpenStack controller services themselves do not gracefully
   handle failure.
  
 
  Which ones?

Heat has issues, horizon has issues, neutron l3 only works in 
active-passive setup.

   When these applications were virtualized, they were virtualized on
   platforms that enabled very high SLAs for each virtual machine,
   allowing the application to not be rewritten as the IT team moved them
   from physical to virtual. Now while these apps cannot benefit from
   methods like automatic scaleout, the application owners will greatly
   benefit from the self-service capabilities they will recieve as they
   utilize the OpenStack control plane.
  
 
  These apps were virtualized for IT's benefit. But the application authors
  and users are now stuck in high-cost virtualization. The cloud is best
  utilized when IT can control that cost and shift the burden of uptime
  to the users by offering them more overall capacity and flexibility with
  the caveat that the individual resources will not be as reliable.
 
  So what I'm most interested in is helping authors change their apps to
  be reslient on their own, not in putting more burden on IT.

This can be very costly, therefore not always possible.

   I'd like to suggest to expand heat convergence mechanism to enable
   self-remediation of virtual machines and other heat resources.
  
 
  Convergence is still nascent. I don't know if I'd pile on to what might
  take another 12 - 18 months to get done anyway. We're just now figuring
  out how to get started where we thought we might already be 1/3 of the
  way through. Just something to consider.

We don't need to complete convergence to start working with that. 
However this might take, sooner we start, sooner we deliver.


Thans,
Michał

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat][nova] VM restarting on host failure in convergence

2014-09-19 Thread Jastrzebski, Michal
   In short, what we'll need from nova is to have 100% reliable
   host-health monitor and equally reliable rebuild/evacuate mechanism
   with fencing and scheduler. In heat we need scallable and reliable
   event listener and engine to decide which action to perform in given
   situation.
 
  Unfortunately, I don't think Nova can provide this alone.  Nova only
  knows about whether or not the nova-compute daemon is current
  communicating with the rest of the system.  Even if the nova-compute
  daemon drops out, the compute node may still be running all instances
  just fine.  We certainly don't want to impact those running workloads
  unless absolutely necessary.

But, on the other hand if host is really down, nova might want to know
that, if only to change insances status to ERROR or whatever. I don't
think situation when instance is down due to host failure, and nova
doesn't know that is good for anyone.

  I understand that you're suggesting that we enhance Nova to be able to
  provide that level of knowledge and control.  I actually don't think
  Nova should have this knowledge of its underlying infrastructure.
 
  I would put the host monitoring infrastructure (to determine if a host
  is down) and fencing capability as out of scope for Nova and as a part
  of the supporting infrastructure.  Assuming those pieces can properly
  detect that a host is down and fence it, then all that's needed from
  Nova is the evacuate capability, which is already there.  There may be
  some enhancements that could be done to it, but surely it's quite close.

Why do you think nova shouldn't have information about underlying infra?
Since service group is pluggin based, we could develop new plugin for
enhancing nova's information reliability whthout any impact on current
code. I'm a bit concerned about dependency injection we'd have to make.
I'd love to be in situation, where people would have some level (maybe
not best they can get) of SLA in heat out of the box, without bigger
investment in infrastructure configuration.

  There's also the part where a notification needs to go out saying that
  the instance has failed.  Some thing (which could be Heat in the case of
  this proposal) can react to that, either directly or via ceilometer, for
  example.  There is an API today to hard reset the state of an instance
  to ERROR.  After a host is fenced, you could use this API to mark all
  instances on that host as dead.  I'm not sure if there's an easy way to
  do that for all instances on a host today.  That's likely an enhancement
  we could make to python-novaclient, similar to the evacuate all
  instances on a host enhancement that was done in novaclient.

Why nova itself wouldn't do that? I mean, nova should know real status
of its instances at all times in my opinion.

Thanks,
Michał
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat][nova] VM restarting on host failure in convergence

2014-09-17 Thread Russell Bryant
On 09/17/2014 09:03 AM, Jastrzebski, Michal wrote:
 In short, what we'll need from nova is to have 100% reliable
 host-health monitor and equally reliable rebuild/evacuate mechanism
 with fencing and scheduler. In heat we need scallable and reliable
 event listener and engine to decide which action to perform in given
 situation.

Unfortunately, I don't think Nova can provide this alone.  Nova only
knows about whether or not the nova-compute daemon is current
communicating with the rest of the system.  Even if the nova-compute
daemon drops out, the compute node may still be running all instances
just fine.  We certainly don't want to impact those running workloads
unless absolutely necessary.

I understand that you're suggesting that we enhance Nova to be able to
provide that level of knowledge and control.  I actually don't think
Nova should have this knowledge of its underlying infrastructure.

I would put the host monitoring infrastructure (to determine if a host
is down) and fencing capability as out of scope for Nova and as a part
of the supporting infrastructure.  Assuming those pieces can properly
detect that a host is down and fence it, then all that's needed from
Nova is the evacuate capability, which is already there.  There may be
some enhancements that could be done to it, but surely it's quite close.

There's also the part where a notification needs to go out saying that
the instance has failed.  Some thing (which could be Heat in the case of
this proposal) can react to that, either directly or via ceilometer, for
example.  There is an API today to hard reset the state of an instance
to ERROR.  After a host is fenced, you could use this API to mark all
instances on that host as dead.  I'm not sure if there's an easy way to
do that for all instances on a host today.  That's likely an enhancement
we could make to python-novaclient, similar to the evacuate all
instances on a host enhancement that was done in novaclient.

-- 
Russell Bryant

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat][nova] VM restarting on host failure in convergence

2014-09-17 Thread Clint Byrum
Excerpts from Jastrzebski, Michal's message of 2014-09-17 06:03:06 -0700:
 All,
 
 Currently OpenStack does not have a built-in HA mechanism for tenant
 instances which could restore virtual machines in case of a host
 failure. Openstack assumes every app is designed for failure and can
 handle instance failure and will self-remediate, but that is rarely
 the case for the very large Enterprise application ecosystem.
 Many existing enterprise applications are stateful, and assume that
 the physical infrastructure is always on.
 

There is a fundamental debate that OpenStack's vendors need to work out
here. Existing applications are well served by existing virtualization
platforms. Turning OpenStack into a work-alike to oVirt is not the end
goal here. It's a happy accident that traditional apps can sometimes be
bent onto the cloud without much modification.

The thing that clouds do is they give development teams a _limited_
infrastructure that lets IT do what they're good at (keep the
infrastructure up) and lets development teams do what they're good at (run
their app). By putting HA into the _app_, and not the _infrastructure_,
the dev teams get agility and scalability. No more waiting weeks for
allocationg specialized servers with hardware fencing setups and fibre
channel controllers to house a shared disk system so the super reliable
virtualization can hide HA from the user.

Spin up vms. Spin up volumes.  Run some replication between regions,
and be resilient.

So, as long as it is understood that whatever is being proposed should
be an application centric feature, and not an infrastructure centric
feature, this argument remains interesting in the cloud context.
Otherwise, it is just an invitation for OpenStack to open up direct
competition with behemoths like vCenter.

 Even the OpenStack controller services themselves do not gracefully
 handle failure.
 

Which ones?

 When these applications were virtualized, they were virtualized on
 platforms that enabled very high SLAs for each virtual machine,
 allowing the application to not be rewritten as the IT team moved them
 from physical to virtual. Now while these apps cannot benefit from
 methods like automatic scaleout, the application owners will greatly
 benefit from the self-service capabilities they will recieve as they
 utilize the OpenStack control plane.
 

These apps were virtualized for IT's benefit. But the application authors
and users are now stuck in high-cost virtualization. The cloud is best
utilized when IT can control that cost and shift the burden of uptime
to the users by offering them more overall capacity and flexibility with
the caveat that the individual resources will not be as reliable.

So what I'm most interested in is helping authors change their apps to
be reslient on their own, not in putting more burden on IT.

 I'd like to suggest to expand heat convergence mechanism to enable
 self-remediation of virtual machines and other heat resources.
 

Convergence is still nascent. I don't know if I'd pile on to what might
take another 12 - 18 months to get done anyway. We're just now figuring
out how to get started where we thought we might already be 1/3 of the
way through. Just something to consider.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev