Re: [openstack-dev] [Heat] Short term scaling strategies for large Heat stacks

2014-06-02 Thread Steve Baker
On 31/05/14 07:01, Zane Bitter wrote:
 On 29/05/14 19:52, Clint Byrum wrote:

 update-failure-recovery
 ===

 This is a blueprint I believe Zane is working on to land in Juno. It
 will
 allow us to retry a failed create or update action. Combined with the
 separate controller/compute node strategy, this may be our best option,
 but it is unclear whether that code will be available soon or not. The
 chunking is definitely required, because with 500 compute nodes, if
 node #250 fails, the remaining 249 nodes that are IN_PROGRESS will be
 cancelled, which makes the impact of a transient failure quite extreme.
 Also without chunking, we'll suffer from some of the performance
 problems we've seen where a single engine process will have to do all of
 the work to bring up a stack.

 Pros: * Uses blessed strategy

 Cons: * Implementation is not complete
   * Still suffers from heavy impact of failure
   * Requires chunking to be feasible

 I've already started working on this and I'm expecting to have this
 ready some time between the j-1 and j-2 milestones.

 I think these two strategies combined could probably get you a long
 way in the short term, though obviously they are not a replacement for
 the convergence strategy in the long term.


 BTW You missed off another strategy that we have discussed in the
 past, and which I think Steve Baker might(?) be working on: retrying
 failed calls at the client level.

As part of the client-plugins blueprint I'm planning on implementing
retry policies on API calls. So when currently we call:
self.nova().servers.create(**kwargs)

This will soon be:
self.client().servers.create(**kwargs)

And with a retry policy (assuming the default unique-ish server name is
used):
self.client_plugin().call_with_retry_policy('cleanup_yr_mess_and_try_again',
self.client().servers.create, **kwargs)

This should be suitable for handling transient errors on API calls such
as 500s, response timeouts or token expiration. It shouldn't be used for
resources which later come up in an ERROR state; convergence or
update-failure-recovery would be better for that.

These policies can start out simple and hard-coded, but there is
potential for different policies to be specified in heat.conf to cater
for the specific failure modes of a given cloud.

Expected to be ready j-1 - j-2

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Short term scaling strategies for large Heat stacks

2014-06-02 Thread Mike Spreitzer
Steve Baker sba...@redhat.com wrote on 06/02/2014 05:37:25 PM:

  BTW You missed off another strategy that we have discussed in the
  past, and which I think Steve Baker might(?) be working on: retrying
  failed calls at the client level.
 
 As part of the client-plugins blueprint I'm planning on implementing
 retry policies on API calls. So when currently we call:
 self.nova().servers.create(**kwargs)
 
 This will soon be:
 self.client().servers.create(**kwargs)
 
 And with a retry policy (assuming the default unique-ish server name is
 used):
 
self.client_plugin().call_with_retry_policy('cleanup_yr_mess_and_try_again',
 self.client().servers.create, **kwargs)
 
 This should be suitable for handling transient errors on API calls such
 as 500s, response timeouts or token expiration. It shouldn't be used for
 resources which later come up in an ERROR state; convergence or
 update-failure-recovery would be better for that.

Response timeouts can be problematic here for non-idempotent operations, 
right?

Thanks,
Mike

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Short term scaling strategies for large Heat stacks

2014-06-02 Thread Clint Byrum
Excerpts from Steve Baker's message of 2014-06-02 14:37:25 -0700:
 On 31/05/14 07:01, Zane Bitter wrote:
  On 29/05/14 19:52, Clint Byrum wrote:
 
  update-failure-recovery
  ===
 
  This is a blueprint I believe Zane is working on to land in Juno. It
  will
  allow us to retry a failed create or update action. Combined with the
  separate controller/compute node strategy, this may be our best option,
  but it is unclear whether that code will be available soon or not. The
  chunking is definitely required, because with 500 compute nodes, if
  node #250 fails, the remaining 249 nodes that are IN_PROGRESS will be
  cancelled, which makes the impact of a transient failure quite extreme.
  Also without chunking, we'll suffer from some of the performance
  problems we've seen where a single engine process will have to do all of
  the work to bring up a stack.
 
  Pros: * Uses blessed strategy
 
  Cons: * Implementation is not complete
* Still suffers from heavy impact of failure
* Requires chunking to be feasible
 
  I've already started working on this and I'm expecting to have this
  ready some time between the j-1 and j-2 milestones.
 
  I think these two strategies combined could probably get you a long
  way in the short term, though obviously they are not a replacement for
  the convergence strategy in the long term.
 
 
  BTW You missed off another strategy that we have discussed in the
  past, and which I think Steve Baker might(?) be working on: retrying
  failed calls at the client level.
 
 As part of the client-plugins blueprint I'm planning on implementing
 retry policies on API calls. So when currently we call:
 self.nova().servers.create(**kwargs)
 
 This will soon be:
 self.client().servers.create(**kwargs)
 
 And with a retry policy (assuming the default unique-ish server name is
 used):
 self.client_plugin().call_with_retry_policy('cleanup_yr_mess_and_try_again',
 self.client().servers.create, **kwargs)
 
 This should be suitable for handling transient errors on API calls such
 as 500s, response timeouts or token expiration. It shouldn't be used for
 resources which later come up in an ERROR state; convergence or
 update-failure-recovery would be better for that.
 

Steve this is fantastic work and sorely needed. Thank you for working on
it.

Unfortunately, ERROR state machines is the majority of our problem. IPMI
and PXE can be unreliable in some environments, and sometimes machines
are broken in subtle ways. Also, the odd bug in Neutron, Nova, or Ironic
will cause this.

Convergence is not available to us for the short term, and really
update-failure-recovery is some time off too, so we need more solutions
unfortunately.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Short term scaling strategies for large Heat stacks

2014-05-29 Thread Mike Spreitzer
Clint Byrum cl...@fewbar.com wrote on 05/29/2014 07:52:07 PM:

 I am writing to get some brainstorming started on how we might mitigate
 some of the issues we've seen while deploying large stacks on Heat. I am
 sending this to the dev list because it may involve landing fixes rather
 than just using different strategies. The problems outlined here are
 well known and reported as bugs or feature requests, but there may be
 more that we can do.
 
 ...
 
 Strategies:
 
 ...
 
 update-failure-recovery
 ===
 
 This is a blueprint I believe Zane is working on to land in Juno. It 
will
 allow us to retry a failed create or update action. Combined with the
 separate controller/compute node strategy, this may be our best option,
 but it is unclear whether that code will be available soon or not. The
 chunking is definitely required, because with 500 compute nodes, if
 node #250 fails, the remaining 249 nodes that are IN_PROGRESS will be
 cancelled, which makes the impact of a transient failure quite extreme.
 Also without chunking, we'll suffer from some of the performance
 problems we've seen where a single engine process will have to do all of
 the work to bring up a stack.
 
 Pros: * Uses blessed strategy
 
 Cons: * Implementation is not complete
   * Still suffers from heavy impact of failure
   * Requires chunking to be feasible

I like this one.  As I remarked in the convergence discussion, I think the 
first step there is a DB schema change to separate desired and observed 
state.  Once that is done, failure on one resource need not wedge a stack; 
non-dependent resources (like the peer compute nodes) can still be 
created.

This does not address the issue of putting a lot of work in one process; 
that requires a more radical change.

Regards,
Mike

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Short term scaling strategies for large Heat stacks

2014-05-29 Thread Clint Byrum
Excerpts from Mike Spreitzer's message of 2014-05-30 05:42:43 +0530:
 Clint Byrum cl...@fewbar.com wrote on 05/29/2014 07:52:07 PM:
 
  I am writing to get some brainstorming started on how we might mitigate
  some of the issues we've seen while deploying large stacks on Heat. I am
  sending this to the dev list because it may involve landing fixes rather
  than just using different strategies. The problems outlined here are
  well known and reported as bugs or feature requests, but there may be
  more that we can do.
  
  ...
  
  Strategies:
  
  ...
  
  update-failure-recovery
  ===
  
  This is a blueprint I believe Zane is working on to land in Juno. It 
 will
  allow us to retry a failed create or update action. Combined with the
  separate controller/compute node strategy, this may be our best option,
  but it is unclear whether that code will be available soon or not. The
  chunking is definitely required, because with 500 compute nodes, if
  node #250 fails, the remaining 249 nodes that are IN_PROGRESS will be
  cancelled, which makes the impact of a transient failure quite extreme.
  Also without chunking, we'll suffer from some of the performance
  problems we've seen where a single engine process will have to do all of
  the work to bring up a stack.
  
  Pros: * Uses blessed strategy
  
  Cons: * Implementation is not complete
* Still suffers from heavy impact of failure
* Requires chunking to be feasible
 
 I like this one.  As I remarked in the convergence discussion, I think the 
 first step there is a DB schema change to separate desired and observed 
 state.  Once that is done, failure on one resource need not wedge a stack; 
 non-dependent resources (like the peer compute nodes) can still be 
 created.

It's not just the observed state that you need in the database to resume.

You also need the parameters and template snippet that has been
successfully applied.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev