Re: [openstack-dev] [Heat] Short term scaling strategies for large Heat stacks
On 31/05/14 07:01, Zane Bitter wrote: On 29/05/14 19:52, Clint Byrum wrote: update-failure-recovery === This is a blueprint I believe Zane is working on to land in Juno. It will allow us to retry a failed create or update action. Combined with the separate controller/compute node strategy, this may be our best option, but it is unclear whether that code will be available soon or not. The chunking is definitely required, because with 500 compute nodes, if node #250 fails, the remaining 249 nodes that are IN_PROGRESS will be cancelled, which makes the impact of a transient failure quite extreme. Also without chunking, we'll suffer from some of the performance problems we've seen where a single engine process will have to do all of the work to bring up a stack. Pros: * Uses blessed strategy Cons: * Implementation is not complete * Still suffers from heavy impact of failure * Requires chunking to be feasible I've already started working on this and I'm expecting to have this ready some time between the j-1 and j-2 milestones. I think these two strategies combined could probably get you a long way in the short term, though obviously they are not a replacement for the convergence strategy in the long term. BTW You missed off another strategy that we have discussed in the past, and which I think Steve Baker might(?) be working on: retrying failed calls at the client level. As part of the client-plugins blueprint I'm planning on implementing retry policies on API calls. So when currently we call: self.nova().servers.create(**kwargs) This will soon be: self.client().servers.create(**kwargs) And with a retry policy (assuming the default unique-ish server name is used): self.client_plugin().call_with_retry_policy('cleanup_yr_mess_and_try_again', self.client().servers.create, **kwargs) This should be suitable for handling transient errors on API calls such as 500s, response timeouts or token expiration. It shouldn't be used for resources which later come up in an ERROR state; convergence or update-failure-recovery would be better for that. These policies can start out simple and hard-coded, but there is potential for different policies to be specified in heat.conf to cater for the specific failure modes of a given cloud. Expected to be ready j-1 - j-2 ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat] Short term scaling strategies for large Heat stacks
Steve Baker sba...@redhat.com wrote on 06/02/2014 05:37:25 PM: BTW You missed off another strategy that we have discussed in the past, and which I think Steve Baker might(?) be working on: retrying failed calls at the client level. As part of the client-plugins blueprint I'm planning on implementing retry policies on API calls. So when currently we call: self.nova().servers.create(**kwargs) This will soon be: self.client().servers.create(**kwargs) And with a retry policy (assuming the default unique-ish server name is used): self.client_plugin().call_with_retry_policy('cleanup_yr_mess_and_try_again', self.client().servers.create, **kwargs) This should be suitable for handling transient errors on API calls such as 500s, response timeouts or token expiration. It shouldn't be used for resources which later come up in an ERROR state; convergence or update-failure-recovery would be better for that. Response timeouts can be problematic here for non-idempotent operations, right? Thanks, Mike ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat] Short term scaling strategies for large Heat stacks
Excerpts from Steve Baker's message of 2014-06-02 14:37:25 -0700: On 31/05/14 07:01, Zane Bitter wrote: On 29/05/14 19:52, Clint Byrum wrote: update-failure-recovery === This is a blueprint I believe Zane is working on to land in Juno. It will allow us to retry a failed create or update action. Combined with the separate controller/compute node strategy, this may be our best option, but it is unclear whether that code will be available soon or not. The chunking is definitely required, because with 500 compute nodes, if node #250 fails, the remaining 249 nodes that are IN_PROGRESS will be cancelled, which makes the impact of a transient failure quite extreme. Also without chunking, we'll suffer from some of the performance problems we've seen where a single engine process will have to do all of the work to bring up a stack. Pros: * Uses blessed strategy Cons: * Implementation is not complete * Still suffers from heavy impact of failure * Requires chunking to be feasible I've already started working on this and I'm expecting to have this ready some time between the j-1 and j-2 milestones. I think these two strategies combined could probably get you a long way in the short term, though obviously they are not a replacement for the convergence strategy in the long term. BTW You missed off another strategy that we have discussed in the past, and which I think Steve Baker might(?) be working on: retrying failed calls at the client level. As part of the client-plugins blueprint I'm planning on implementing retry policies on API calls. So when currently we call: self.nova().servers.create(**kwargs) This will soon be: self.client().servers.create(**kwargs) And with a retry policy (assuming the default unique-ish server name is used): self.client_plugin().call_with_retry_policy('cleanup_yr_mess_and_try_again', self.client().servers.create, **kwargs) This should be suitable for handling transient errors on API calls such as 500s, response timeouts or token expiration. It shouldn't be used for resources which later come up in an ERROR state; convergence or update-failure-recovery would be better for that. Steve this is fantastic work and sorely needed. Thank you for working on it. Unfortunately, ERROR state machines is the majority of our problem. IPMI and PXE can be unreliable in some environments, and sometimes machines are broken in subtle ways. Also, the odd bug in Neutron, Nova, or Ironic will cause this. Convergence is not available to us for the short term, and really update-failure-recovery is some time off too, so we need more solutions unfortunately. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat] Short term scaling strategies for large Heat stacks
Clint Byrum cl...@fewbar.com wrote on 05/29/2014 07:52:07 PM: I am writing to get some brainstorming started on how we might mitigate some of the issues we've seen while deploying large stacks on Heat. I am sending this to the dev list because it may involve landing fixes rather than just using different strategies. The problems outlined here are well known and reported as bugs or feature requests, but there may be more that we can do. ... Strategies: ... update-failure-recovery === This is a blueprint I believe Zane is working on to land in Juno. It will allow us to retry a failed create or update action. Combined with the separate controller/compute node strategy, this may be our best option, but it is unclear whether that code will be available soon or not. The chunking is definitely required, because with 500 compute nodes, if node #250 fails, the remaining 249 nodes that are IN_PROGRESS will be cancelled, which makes the impact of a transient failure quite extreme. Also without chunking, we'll suffer from some of the performance problems we've seen where a single engine process will have to do all of the work to bring up a stack. Pros: * Uses blessed strategy Cons: * Implementation is not complete * Still suffers from heavy impact of failure * Requires chunking to be feasible I like this one. As I remarked in the convergence discussion, I think the first step there is a DB schema change to separate desired and observed state. Once that is done, failure on one resource need not wedge a stack; non-dependent resources (like the peer compute nodes) can still be created. This does not address the issue of putting a lot of work in one process; that requires a more radical change. Regards, Mike ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat] Short term scaling strategies for large Heat stacks
Excerpts from Mike Spreitzer's message of 2014-05-30 05:42:43 +0530: Clint Byrum cl...@fewbar.com wrote on 05/29/2014 07:52:07 PM: I am writing to get some brainstorming started on how we might mitigate some of the issues we've seen while deploying large stacks on Heat. I am sending this to the dev list because it may involve landing fixes rather than just using different strategies. The problems outlined here are well known and reported as bugs or feature requests, but there may be more that we can do. ... Strategies: ... update-failure-recovery === This is a blueprint I believe Zane is working on to land in Juno. It will allow us to retry a failed create or update action. Combined with the separate controller/compute node strategy, this may be our best option, but it is unclear whether that code will be available soon or not. The chunking is definitely required, because with 500 compute nodes, if node #250 fails, the remaining 249 nodes that are IN_PROGRESS will be cancelled, which makes the impact of a transient failure quite extreme. Also without chunking, we'll suffer from some of the performance problems we've seen where a single engine process will have to do all of the work to bring up a stack. Pros: * Uses blessed strategy Cons: * Implementation is not complete * Still suffers from heavy impact of failure * Requires chunking to be feasible I like this one. As I remarked in the convergence discussion, I think the first step there is a DB schema change to separate desired and observed state. Once that is done, failure on one resource need not wedge a stack; non-dependent resources (like the peer compute nodes) can still be created. It's not just the observed state that you need in the database to resume. You also need the parameters and template snippet that has been successfully applied. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev