Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO
On 12 August 2014 08:35, Zane Bitter wrote: > This sounds like the same figure I heard at the design summit; did the DB > call optimisation work that Steve Baker did immediately after that not have > any effect? It helped a lot- I'm not sure where heat tops out now - I'm not aware of rigorous benchmarks at this stage - I'm hoping we can get a large scale integration test (virtual machine based) up soon periodically. Ideally we'd have a microtest in the gate. >> That was the issue. So we fixed that bug, but we never un-reverted >> the patch that forks enough engines to use up all the CPU's on a box >> by default. That would likely help a lot with metadata access speed >> (we could manually do it in TripleO but we tend to push defaults. :) > > > Right, and we decided we wouldn't because it's wrong to do that to people by > default. In some cases the optimal running configuration for TripleO will > differ from the friendliest out-of-the-box configuration for Heat users in > general, and in those cases - of which this is one - TripleO will need to > specify the configuration. So - thanks for being clear about this (is it in the deployer docs for heat?). That said, nova, neutron and other projects are defaulting to one-worker-per-core, so I'm surprised that heat considers this inappropriate, but our other APIs consider it appropriate :) Whats different about heat that makes this a bad default? -Rob -- Robert Collins Distinguished Technologist HP Converged Cloud ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO
On 12/08/14 06:20, Clint Byrum wrote: > Excerpts from Zane Bitter's message of 2014-08-11 08:16:56 -0700: >> On 11/08/14 10:46, Clint Byrum wrote: >>> Right now we're stuck with an update that just doesn't work. It isn't >>> just about update-failure-recovery, which is coming along nicely, but >>> it is also about the lack of signals to control rebuild, poor support >>> for addressing machines as groups, and unacceptable performance in >>> large stacks. >> Are there blueprints/bugs filed for all of these issues? >> > Convergnce addresses the poor performance for large stacks in general. > We also have this: > > https://bugs.launchpad.net/heat/+bug/1306743 > > Which shows how slow metadata access can get. I have worked on patches > but haven't been able to complete them. We made big strides but we are > at a point where 40 nodes polling Heat every 30s is too much for one CPU > to handle. When we scaled Heat out onto more CPUs on one box by forking > we ran into eventlet issues. We also ran into issues because even with > many processes we can only use one to resolve templates for a single > stack during update, which was also excessively slow. > > We haven't been able to come back around to those yet, but you can see > where this has turned into a bit of a rat hole of optimization. > action-aware-sw-config is sort of what we want for rebuild. We > collaborated with the trove devs on how to also address it for resize > a while back but I have lost track of that work as it has taken a back > seat to more pressing issues. We were discussing offloading metadata polling to a tempURL swift object; that would certainly deal to scaling metadata polling. But also, this could help with out-of-band ansible workflow too. Anything (ie, Ansible) could push changed data to the swift object too. And if you wanted to ensure that heat didn't overwrite that during an accidental heat stack-update then you could configure os-collect-config to poll from 2 swift objects, one for heat and one for manual updates. The manual object could take precedence over the heat one for metadata merging, which could give you a nice fine-grained override mechanism. > Addressing groups is a general problem that I've had a hard time > articulating in the past. Tomas Sedovic has done a good job with this > TripleO spec, but I don't know that we've asked for an explicit change > in a bug or spec in Heat just yet: > > https://review.openstack.org/#/c/97939/ > > There are a number of other issues noted in that spec which are already > addressed in Heat, but require refactoring in TripleO's templates and > tools, and that work continues. I'll follow up the potential solutions in the other thread: http://lists.openstack.org/pipermail/openstack-dev/2014-August/042313.html > The point remains: we need something that works now, and doing an > alternate implementation for updates is actually faster than addressing > all of these issues. Thanks, that was a good summary of the issues, and I do appreciate the need for both tactical and strategic solutions. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO
Excerpts from Zane Bitter's message of 2014-08-11 13:35:44 -0700: > On 11/08/14 14:49, Clint Byrum wrote: > > Excerpts from Steven Hardy's message of 2014-08-11 11:40:07 -0700: > >> On Mon, Aug 11, 2014 at 11:20:50AM -0700, Clint Byrum wrote: > >>> Excerpts from Zane Bitter's message of 2014-08-11 08:16:56 -0700: > On 11/08/14 10:46, Clint Byrum wrote: > > Right now we're stuck with an update that just doesn't work. It isn't > > just about update-failure-recovery, which is coming along nicely, but > > it is also about the lack of signals to control rebuild, poor support > > for addressing machines as groups, and unacceptable performance in > > large stacks. > > Are there blueprints/bugs filed for all of these issues? > > >>> > >>> Convergnce addresses the poor performance for large stacks in general. > >>> We also have this: > >>> > >>> https://bugs.launchpad.net/heat/+bug/1306743 > >>> > >>> Which shows how slow metadata access can get. I have worked on patches > >>> but haven't been able to complete them. We made big strides but we are > >>> at a point where 40 nodes polling Heat every 30s is too much for one CPU > > This sounds like the same figure I heard at the design summit; did the > DB call optimisation work that Steve Baker did immediately after that > not have any effect? > Steve's work got us to 40. From 7. > >>> to handle. When we scaled Heat out onto more CPUs on one box by forking > >>> we ran into eventlet issues. We also ran into issues because even with > >>> many processes we can only use one to resolve templates for a single > >>> stack during update, which was also excessively slow. > >> > >> Related to this, and a discussion we had recently at the TripleO meetup is > >> this spec I raised today: > >> > >> https://review.openstack.org/#/c/113296/ > >> > >> It's following up on the idea that we could potentially address (or at > >> least mitigate, pending the fully convergence-ified heat) some of these > >> scalability concerns, if TripleO moves from the one-giant-template model > >> to a more modular nested-stack/provider model (e.g what Tomas has been > >> working on) > >> > >> I've not got into enough detail on that yet to be sure if it's acheivable > >> for Juno, but it seems initially to be complex-but-doable. > >> > >> I'd welcome feedback on that idea and how it may fit in with the more > >> granular convergence-engine model. > >> > >> Can you link to the eventlet/forking issues bug please? I thought since > >> bug #1321303 was fixed that multiple engines and multiple workers should > >> work OK, and obviously that being true is a precondition to expending > >> significant effort on the nested stack decoupling plan above. > >> > > > > That was the issue. So we fixed that bug, but we never un-reverted > > the patch that forks enough engines to use up all the CPU's on a box > > by default. That would likely help a lot with metadata access speed > > (we could manually do it in TripleO but we tend to push defaults. :) > > Right, and we decided we wouldn't because it's wrong to do that to > people by default. In some cases the optimal running configuration for > TripleO will differ from the friendliest out-of-the-box configuration > for Heat users in general, and in those cases - of which this is one - > TripleO will need to specify the configuration. > Whether or not the default should be to fork 1 process per CPU is a debate for another time. The point is, we can safely use the forking in Heat now to perhaps improve performance of metadata polling. Chasing that, and other optimizations, has not led us to a place where we can get to, say, 100 real nodes _today_. We're chasing another way to get to the scale and capability we need _today_, in much the same way we did with merge.py. We'll find the way to get it done more elegantly as time permits. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO
On 11/08/14 14:49, Clint Byrum wrote: Excerpts from Steven Hardy's message of 2014-08-11 11:40:07 -0700: On Mon, Aug 11, 2014 at 11:20:50AM -0700, Clint Byrum wrote: Excerpts from Zane Bitter's message of 2014-08-11 08:16:56 -0700: On 11/08/14 10:46, Clint Byrum wrote: Right now we're stuck with an update that just doesn't work. It isn't just about update-failure-recovery, which is coming along nicely, but it is also about the lack of signals to control rebuild, poor support for addressing machines as groups, and unacceptable performance in large stacks. Are there blueprints/bugs filed for all of these issues? Convergnce addresses the poor performance for large stacks in general. We also have this: https://bugs.launchpad.net/heat/+bug/1306743 Which shows how slow metadata access can get. I have worked on patches but haven't been able to complete them. We made big strides but we are at a point where 40 nodes polling Heat every 30s is too much for one CPU This sounds like the same figure I heard at the design summit; did the DB call optimisation work that Steve Baker did immediately after that not have any effect? to handle. When we scaled Heat out onto more CPUs on one box by forking we ran into eventlet issues. We also ran into issues because even with many processes we can only use one to resolve templates for a single stack during update, which was also excessively slow. Related to this, and a discussion we had recently at the TripleO meetup is this spec I raised today: https://review.openstack.org/#/c/113296/ It's following up on the idea that we could potentially address (or at least mitigate, pending the fully convergence-ified heat) some of these scalability concerns, if TripleO moves from the one-giant-template model to a more modular nested-stack/provider model (e.g what Tomas has been working on) I've not got into enough detail on that yet to be sure if it's acheivable for Juno, but it seems initially to be complex-but-doable. I'd welcome feedback on that idea and how it may fit in with the more granular convergence-engine model. Can you link to the eventlet/forking issues bug please? I thought since bug #1321303 was fixed that multiple engines and multiple workers should work OK, and obviously that being true is a precondition to expending significant effort on the nested stack decoupling plan above. That was the issue. So we fixed that bug, but we never un-reverted the patch that forks enough engines to use up all the CPU's on a box by default. That would likely help a lot with metadata access speed (we could manually do it in TripleO but we tend to push defaults. :) Right, and we decided we wouldn't because it's wrong to do that to people by default. In some cases the optimal running configuration for TripleO will differ from the friendliest out-of-the-box configuration for Heat users in general, and in those cases - of which this is one - TripleO will need to specify the configuration. cheers, Zane. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO
Excerpts from Steven Hardy's message of 2014-08-11 11:40:07 -0700: > On Mon, Aug 11, 2014 at 11:20:50AM -0700, Clint Byrum wrote: > > Excerpts from Zane Bitter's message of 2014-08-11 08:16:56 -0700: > > > On 11/08/14 10:46, Clint Byrum wrote: > > > > Right now we're stuck with an update that just doesn't work. It isn't > > > > just about update-failure-recovery, which is coming along nicely, but > > > > it is also about the lack of signals to control rebuild, poor support > > > > for addressing machines as groups, and unacceptable performance in > > > > large stacks. > > > > > > Are there blueprints/bugs filed for all of these issues? > > > > > > > Convergnce addresses the poor performance for large stacks in general. > > We also have this: > > > > https://bugs.launchpad.net/heat/+bug/1306743 > > > > Which shows how slow metadata access can get. I have worked on patches > > but haven't been able to complete them. We made big strides but we are > > at a point where 40 nodes polling Heat every 30s is too much for one CPU > > to handle. When we scaled Heat out onto more CPUs on one box by forking > > we ran into eventlet issues. We also ran into issues because even with > > many processes we can only use one to resolve templates for a single > > stack during update, which was also excessively slow. > > Related to this, and a discussion we had recently at the TripleO meetup is > this spec I raised today: > > https://review.openstack.org/#/c/113296/ > > It's following up on the idea that we could potentially address (or at > least mitigate, pending the fully convergence-ified heat) some of these > scalability concerns, if TripleO moves from the one-giant-template model > to a more modular nested-stack/provider model (e.g what Tomas has been > working on) > > I've not got into enough detail on that yet to be sure if it's acheivable > for Juno, but it seems initially to be complex-but-doable. > > I'd welcome feedback on that idea and how it may fit in with the more > granular convergence-engine model. > > Can you link to the eventlet/forking issues bug please? I thought since > bug #1321303 was fixed that multiple engines and multiple workers should > work OK, and obviously that being true is a precondition to expending > significant effort on the nested stack decoupling plan above. > That was the issue. So we fixed that bug, but we never un-reverted the patch that forks enough engines to use up all the CPU's on a box by default. That would likely help a lot with metadata access speed (we could manually do it in TripleO but we tend to push defaults. :) ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO
On Mon, Aug 11, 2014 at 11:20:50AM -0700, Clint Byrum wrote: > Excerpts from Zane Bitter's message of 2014-08-11 08:16:56 -0700: > > On 11/08/14 10:46, Clint Byrum wrote: > > > Right now we're stuck with an update that just doesn't work. It isn't > > > just about update-failure-recovery, which is coming along nicely, but > > > it is also about the lack of signals to control rebuild, poor support > > > for addressing machines as groups, and unacceptable performance in > > > large stacks. > > > > Are there blueprints/bugs filed for all of these issues? > > > > Convergnce addresses the poor performance for large stacks in general. > We also have this: > > https://bugs.launchpad.net/heat/+bug/1306743 > > Which shows how slow metadata access can get. I have worked on patches > but haven't been able to complete them. We made big strides but we are > at a point where 40 nodes polling Heat every 30s is too much for one CPU > to handle. When we scaled Heat out onto more CPUs on one box by forking > we ran into eventlet issues. We also ran into issues because even with > many processes we can only use one to resolve templates for a single > stack during update, which was also excessively slow. Related to this, and a discussion we had recently at the TripleO meetup is this spec I raised today: https://review.openstack.org/#/c/113296/ It's following up on the idea that we could potentially address (or at least mitigate, pending the fully convergence-ified heat) some of these scalability concerns, if TripleO moves from the one-giant-template model to a more modular nested-stack/provider model (e.g what Tomas has been working on) I've not got into enough detail on that yet to be sure if it's acheivable for Juno, but it seems initially to be complex-but-doable. I'd welcome feedback on that idea and how it may fit in with the more granular convergence-engine model. Can you link to the eventlet/forking issues bug please? I thought since bug #1321303 was fixed that multiple engines and multiple workers should work OK, and obviously that being true is a precondition to expending significant effort on the nested stack decoupling plan above. Steve ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO
Excerpts from Zane Bitter's message of 2014-08-11 08:16:56 -0700: > On 11/08/14 10:46, Clint Byrum wrote: > > Right now we're stuck with an update that just doesn't work. It isn't > > just about update-failure-recovery, which is coming along nicely, but > > it is also about the lack of signals to control rebuild, poor support > > for addressing machines as groups, and unacceptable performance in > > large stacks. > > Are there blueprints/bugs filed for all of these issues? > Convergnce addresses the poor performance for large stacks in general. We also have this: https://bugs.launchpad.net/heat/+bug/1306743 Which shows how slow metadata access can get. I have worked on patches but haven't been able to complete them. We made big strides but we are at a point where 40 nodes polling Heat every 30s is too much for one CPU to handle. When we scaled Heat out onto more CPUs on one box by forking we ran into eventlet issues. We also ran into issues because even with many processes we can only use one to resolve templates for a single stack during update, which was also excessively slow. We haven't been able to come back around to those yet, but you can see where this has turned into a bit of a rat hole of optimization. action-aware-sw-config is sort of what we want for rebuild. We collaborated with the trove devs on how to also address it for resize a while back but I have lost track of that work as it has taken a back seat to more pressing issues. Addressing groups is a general problem that I've had a hard time articulating in the past. Tomas Sedovic has done a good job with this TripleO spec, but I don't know that we've asked for an explicit change in a bug or spec in Heat just yet: https://review.openstack.org/#/c/97939/ There are a number of other issues noted in that spec which are already addressed in Heat, but require refactoring in TripleO's templates and tools, and that work continues. The point remains: we need something that works now, and doing an alternate implementation for updates is actually faster than addressing all of these issues. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO
On 11/08/14 10:46, Clint Byrum wrote: Right now we're stuck with an update that just doesn't work. It isn't just about update-failure-recovery, which is coming along nicely, but it is also about the lack of signals to control rebuild, poor support for addressing machines as groups, and unacceptable performance in large stacks. Are there blueprints/bugs filed for all of these issues? -ZB ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO
Excerpts from Steve Baker's message of 2014-08-10 15:33:26 -0700: > On 02/08/14 04:07, Allison Randal wrote: > > A few of us have been independently experimenting with Ansible as a > > backend for TripleO, and have just decided to try experimenting > > together. I've chatted with Robert, and he says that TripleO was always > > intended to have pluggable backends (CM layer), and just never had > > anyone interested in working on them. (I see it now, even in the early > > docs and talks, I guess I just couldn't see the forest for the trees.) > > So, the work is in line with the overall goals of the TripleO project. > > > > We're starting with a tiny scope, focused only on updating a running > > TripleO deployment, so our first work is in: > > > > - Create an Ansible Dynamic Inventory plugin to extract metadata from Heat > > - Improve/extend the Ansible nova_compute Cloud Module (or create a new > > one), for Nova rebuild > > - Develop a minimal handoff from Heat to Ansible, particularly focused > > on the interactions between os-collect-config and Ansible > > > > We're merging our work in this repo, until we figure out where it should > > live: > > > > https://github.com/allisonrandal/tripleo-ansible > > > > We've set ourselves one week as the first sanity-check to see whether > > this idea is going anywhere, and we may scrap it all at that point. But, > > it seems best to be totally transparent about the idea from the start, > > so no-one is surprised later. > > > Having pluggable backends for configuration seems like a good idea, and > Ansible is a great choice for the first alternative backend. > TripleO is intended to be loosely coupled for many components, not just in-instance configuration. > However what this repo seems to be doing at the moment is bypassing heat > to do a stack update, and I can only assume there is an eventual goal to > not use heat at all for stack orchestration too. > > > Granted, until blueprint update-failure-recovery lands[1] then doing a > stack-update is about as much fun as russian roulette. But this effort > is tactical rather than strategic, especially given TripleO's mission > statement. > We intend to stay modular. Ansible won't replace Heat from end to end. Right now we're stuck with an update that just doesn't work. It isn't just about update-failure-recovery, which is coming along nicely, but it is also about the lack of signals to control rebuild, poor support for addressing machines as groups, and unacceptable performance in large stacks. We remain committed to driving these things into Heat, which will allow us to address these things the way a large scale operation will need to. But until we can land those things in Heat, we need something more flexible like Ansible to go around Heat and do things in the exact order we need them done. Ansible doesn't have a REST API, which is a non-starter for modern automation, but the need to control workflow is greater than the need to have a REST API at this point. > If I were to use Ansible for TripleO configuration I would start with > something like the following: > * Install an ansible software-config hook onto the image to be triggered > by os-refresh-config[2][3] > * Incrementally replace StructuredConfig resources in > tripleo-heat-templates with SoftwareConfig resources that include the > ansible playbooks via get_file > * The above can start in a fork of tripleo-heat-templates, but can > eventually be structured using resource providers so that the deployer > chooses what configuration backend to use by selecting the environment > file that contains the appropriate config resources > > Now you have a cloud orchestrated by heat and configured by Ansible. If > it is still deemed necessary to do an out-of-band update to the stack > then you're in a much better position to do an ansible push, since you > can use the same playbook files that heat used to bring up the stack. > That would be a good plan if we wanted to fix issues with os-*-config, but that is the opposite of reality. We are working around Heat orchestration issues with Ansible. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO
On 02/08/14 04:07, Allison Randal wrote: > A few of us have been independently experimenting with Ansible as a > backend for TripleO, and have just decided to try experimenting > together. I've chatted with Robert, and he says that TripleO was always > intended to have pluggable backends (CM layer), and just never had > anyone interested in working on them. (I see it now, even in the early > docs and talks, I guess I just couldn't see the forest for the trees.) > So, the work is in line with the overall goals of the TripleO project. > > We're starting with a tiny scope, focused only on updating a running > TripleO deployment, so our first work is in: > > - Create an Ansible Dynamic Inventory plugin to extract metadata from Heat > - Improve/extend the Ansible nova_compute Cloud Module (or create a new > one), for Nova rebuild > - Develop a minimal handoff from Heat to Ansible, particularly focused > on the interactions between os-collect-config and Ansible > > We're merging our work in this repo, until we figure out where it should > live: > > https://github.com/allisonrandal/tripleo-ansible > > We've set ourselves one week as the first sanity-check to see whether > this idea is going anywhere, and we may scrap it all at that point. But, > it seems best to be totally transparent about the idea from the start, > so no-one is surprised later. > Having pluggable backends for configuration seems like a good idea, and Ansible is a great choice for the first alternative backend. However what this repo seems to be doing at the moment is bypassing heat to do a stack update, and I can only assume there is an eventual goal to not use heat at all for stack orchestration too. Granted, until blueprint update-failure-recovery lands[1] then doing a stack-update is about as much fun as russian roulette. But this effort is tactical rather than strategic, especially given TripleO's mission statement. If I were to use Ansible for TripleO configuration I would start with something like the following: * Install an ansible software-config hook onto the image to be triggered by os-refresh-config[2][3] * Incrementally replace StructuredConfig resources in tripleo-heat-templates with SoftwareConfig resources that include the ansible playbooks via get_file * The above can start in a fork of tripleo-heat-templates, but can eventually be structured using resource providers so that the deployer chooses what configuration backend to use by selecting the environment file that contains the appropriate config resources Now you have a cloud orchestrated by heat and configured by Ansible. If it is still deemed necessary to do an out-of-band update to the stack then you're in a much better position to do an ansible push, since you can use the same playbook files that heat used to bring up the stack. [1] https://review.openstack.org/#/c/112938/ [2] https://review.openstack.org/#/c/95937/ [3] http://git.openstack.org/cgit/openstack/heat-templates/tree/hot/software-config/elements/heat-config ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev