Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO

2014-08-11 Thread Clint Byrum
Excerpts from Steve Baker's message of 2014-08-10 15:33:26 -0700:
 On 02/08/14 04:07, Allison Randal wrote:
  A few of us have been independently experimenting with Ansible as a
  backend for TripleO, and have just decided to try experimenting
  together. I've chatted with Robert, and he says that TripleO was always
  intended to have pluggable backends (CM layer), and just never had
  anyone interested in working on them. (I see it now, even in the early
  docs and talks, I guess I just couldn't see the forest for the trees.)
  So, the work is in line with the overall goals of the TripleO project.
 
  We're starting with a tiny scope, focused only on updating a running
  TripleO deployment, so our first work is in:
 
  - Create an Ansible Dynamic Inventory plugin to extract metadata from Heat
  - Improve/extend the Ansible nova_compute Cloud Module (or create a new
  one), for Nova rebuild
  - Develop a minimal handoff from Heat to Ansible, particularly focused
  on the interactions between os-collect-config and Ansible
 
  We're merging our work in this repo, until we figure out where it should
  live:
 
  https://github.com/allisonrandal/tripleo-ansible
 
  We've set ourselves one week as the first sanity-check to see whether
  this idea is going anywhere, and we may scrap it all at that point. But,
  it seems best to be totally transparent about the idea from the start,
  so no-one is surprised later.
 
 Having pluggable backends for configuration seems like a good idea, and
 Ansible is a great choice for the first alternative backend.
 

TripleO is intended to be loosely coupled for many components, not just
in-instance configuration.

 However what this repo seems to be doing at the moment is bypassing heat
 to do a stack update, and I can only assume there is an eventual goal to
 not use heat at all for stack orchestration too.


 Granted, until blueprint update-failure-recovery lands[1] then doing a
 stack-update is about as much fun as russian roulette. But this effort
 is tactical rather than strategic, especially given TripleO's mission
 statement.
 

We intend to stay modular. Ansible won't replace Heat from end to end.

Right now we're stuck with an update that just doesn't work. It isn't
just about update-failure-recovery, which is coming along nicely, but
it is also about the lack of signals to control rebuild, poor support
for addressing machines as groups, and unacceptable performance in
large stacks.

We remain committed to driving these things into Heat, which will allow
us to address these things the way a large scale operation will need to.

But until we can land those things in Heat, we need something more
flexible like Ansible to go around Heat and do things in the exact
order we need them done. Ansible doesn't have a REST API, which is a
non-starter for modern automation, but the need to control workflow is
greater than the need to have a REST API at this point.

 If I were to use Ansible for TripleO configuration I would start with
 something like the following:
 * Install an ansible software-config hook onto the image to be triggered
 by os-refresh-config[2][3]
 * Incrementally replace StructuredConfig resources in
 tripleo-heat-templates with SoftwareConfig resources that include the
 ansible playbooks via get_file
 * The above can start in a fork of tripleo-heat-templates, but can
 eventually be structured using resource providers so that the deployer
 chooses what configuration backend to use by selecting the environment
 file that contains the appropriate config resources
 
 Now you have a cloud orchestrated by heat and configured by Ansible. If
 it is still deemed necessary to do an out-of-band update to the stack
 then you're in a much better position to do an ansible push, since you
 can use the same playbook files that heat used to bring up the stack.
 

That would be a good plan if we wanted to fix issues with os-*-config,
but that is the opposite of reality. We are working around Heat
orchestration issues with Ansible.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO

2014-08-11 Thread Zane Bitter

On 11/08/14 10:46, Clint Byrum wrote:

Right now we're stuck with an update that just doesn't work. It isn't
just about update-failure-recovery, which is coming along nicely, but
it is also about the lack of signals to control rebuild, poor support
for addressing machines as groups, and unacceptable performance in
large stacks.


Are there blueprints/bugs filed for all of these issues?

-ZB

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO

2014-08-11 Thread Clint Byrum
Excerpts from Zane Bitter's message of 2014-08-11 08:16:56 -0700:
 On 11/08/14 10:46, Clint Byrum wrote:
  Right now we're stuck with an update that just doesn't work. It isn't
  just about update-failure-recovery, which is coming along nicely, but
  it is also about the lack of signals to control rebuild, poor support
  for addressing machines as groups, and unacceptable performance in
  large stacks.
 
 Are there blueprints/bugs filed for all of these issues?
 

Convergnce addresses the poor performance for large stacks in general.
We also have this:

https://bugs.launchpad.net/heat/+bug/1306743

Which shows how slow metadata access can get. I have worked on patches
but haven't been able to complete them. We made big strides but we are
at a point where 40 nodes polling Heat every 30s is too much for one CPU
to handle. When we scaled Heat out onto more CPUs on one box by forking
we ran into eventlet issues. We also ran into issues because even with
many processes we can only use one to resolve templates for a single
stack during update, which was also excessively slow.

We haven't been able to come back around to those yet, but you can see
where this has turned into a bit of a rat hole of optimization.

action-aware-sw-config is sort of what we want for rebuild. We
collaborated with the trove devs on how to also address it for resize
a while back but I have lost track of that work as it has taken a back
seat to more pressing issues.

Addressing groups is a general problem that I've had a hard time
articulating in the past. Tomas Sedovic has done a good job with this
TripleO spec, but I don't know that we've asked for an explicit change
in a bug or spec in Heat just yet:

https://review.openstack.org/#/c/97939/

There are a number of other issues noted in that spec which are already
addressed in Heat, but require refactoring in TripleO's templates and
tools, and that work continues.

The point remains: we need something that works now, and doing an
alternate implementation for updates is actually faster than addressing
all of these issues.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO

2014-08-11 Thread Steven Hardy
On Mon, Aug 11, 2014 at 11:20:50AM -0700, Clint Byrum wrote:
 Excerpts from Zane Bitter's message of 2014-08-11 08:16:56 -0700:
  On 11/08/14 10:46, Clint Byrum wrote:
   Right now we're stuck with an update that just doesn't work. It isn't
   just about update-failure-recovery, which is coming along nicely, but
   it is also about the lack of signals to control rebuild, poor support
   for addressing machines as groups, and unacceptable performance in
   large stacks.
  
  Are there blueprints/bugs filed for all of these issues?
  
 
 Convergnce addresses the poor performance for large stacks in general.
 We also have this:
 
 https://bugs.launchpad.net/heat/+bug/1306743
 
 Which shows how slow metadata access can get. I have worked on patches
 but haven't been able to complete them. We made big strides but we are
 at a point where 40 nodes polling Heat every 30s is too much for one CPU
 to handle. When we scaled Heat out onto more CPUs on one box by forking
 we ran into eventlet issues. We also ran into issues because even with
 many processes we can only use one to resolve templates for a single
 stack during update, which was also excessively slow.

Related to this, and a discussion we had recently at the TripleO meetup is
this spec I raised today:

https://review.openstack.org/#/c/113296/

It's following up on the idea that we could potentially address (or at
least mitigate, pending the fully convergence-ified heat) some of these
scalability concerns, if TripleO moves from the one-giant-template model
to a more modular nested-stack/provider model (e.g what Tomas has been
working on)

I've not got into enough detail on that yet to be sure if it's acheivable
for Juno, but it seems initially to be complex-but-doable.

I'd welcome feedback on that idea and how it may fit in with the more
granular convergence-engine model.

Can you link to the eventlet/forking issues bug please?  I thought since
bug #1321303 was fixed that multiple engines and multiple workers should
work OK, and obviously that being true is a precondition to expending
significant effort on the nested stack decoupling plan above.

Steve

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO

2014-08-11 Thread Clint Byrum
Excerpts from Steven Hardy's message of 2014-08-11 11:40:07 -0700:
 On Mon, Aug 11, 2014 at 11:20:50AM -0700, Clint Byrum wrote:
  Excerpts from Zane Bitter's message of 2014-08-11 08:16:56 -0700:
   On 11/08/14 10:46, Clint Byrum wrote:
Right now we're stuck with an update that just doesn't work. It isn't
just about update-failure-recovery, which is coming along nicely, but
it is also about the lack of signals to control rebuild, poor support
for addressing machines as groups, and unacceptable performance in
large stacks.
   
   Are there blueprints/bugs filed for all of these issues?
   
  
  Convergnce addresses the poor performance for large stacks in general.
  We also have this:
  
  https://bugs.launchpad.net/heat/+bug/1306743
  
  Which shows how slow metadata access can get. I have worked on patches
  but haven't been able to complete them. We made big strides but we are
  at a point where 40 nodes polling Heat every 30s is too much for one CPU
  to handle. When we scaled Heat out onto more CPUs on one box by forking
  we ran into eventlet issues. We also ran into issues because even with
  many processes we can only use one to resolve templates for a single
  stack during update, which was also excessively slow.
 
 Related to this, and a discussion we had recently at the TripleO meetup is
 this spec I raised today:
 
 https://review.openstack.org/#/c/113296/
 
 It's following up on the idea that we could potentially address (or at
 least mitigate, pending the fully convergence-ified heat) some of these
 scalability concerns, if TripleO moves from the one-giant-template model
 to a more modular nested-stack/provider model (e.g what Tomas has been
 working on)
 
 I've not got into enough detail on that yet to be sure if it's acheivable
 for Juno, but it seems initially to be complex-but-doable.
 
 I'd welcome feedback on that idea and how it may fit in with the more
 granular convergence-engine model.
 
 Can you link to the eventlet/forking issues bug please?  I thought since
 bug #1321303 was fixed that multiple engines and multiple workers should
 work OK, and obviously that being true is a precondition to expending
 significant effort on the nested stack decoupling plan above.
 

That was the issue. So we fixed that bug, but we never un-reverted
the patch that forks enough engines to use up all the CPU's on a box
by default. That would likely help a lot with metadata access speed
(we could manually do it in TripleO but we tend to push defaults. :)

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO

2014-08-11 Thread Zane Bitter

On 11/08/14 14:49, Clint Byrum wrote:

Excerpts from Steven Hardy's message of 2014-08-11 11:40:07 -0700:

On Mon, Aug 11, 2014 at 11:20:50AM -0700, Clint Byrum wrote:

Excerpts from Zane Bitter's message of 2014-08-11 08:16:56 -0700:

On 11/08/14 10:46, Clint Byrum wrote:

Right now we're stuck with an update that just doesn't work. It isn't
just about update-failure-recovery, which is coming along nicely, but
it is also about the lack of signals to control rebuild, poor support
for addressing machines as groups, and unacceptable performance in
large stacks.


Are there blueprints/bugs filed for all of these issues?



Convergnce addresses the poor performance for large stacks in general.
We also have this:

https://bugs.launchpad.net/heat/+bug/1306743

Which shows how slow metadata access can get. I have worked on patches
but haven't been able to complete them. We made big strides but we are
at a point where 40 nodes polling Heat every 30s is too much for one CPU


This sounds like the same figure I heard at the design summit; did the 
DB call optimisation work that Steve Baker did immediately after that 
not have any effect?



to handle. When we scaled Heat out onto more CPUs on one box by forking
we ran into eventlet issues. We also ran into issues because even with
many processes we can only use one to resolve templates for a single
stack during update, which was also excessively slow.


Related to this, and a discussion we had recently at the TripleO meetup is
this spec I raised today:

https://review.openstack.org/#/c/113296/

It's following up on the idea that we could potentially address (or at
least mitigate, pending the fully convergence-ified heat) some of these
scalability concerns, if TripleO moves from the one-giant-template model
to a more modular nested-stack/provider model (e.g what Tomas has been
working on)

I've not got into enough detail on that yet to be sure if it's acheivable
for Juno, but it seems initially to be complex-but-doable.

I'd welcome feedback on that idea and how it may fit in with the more
granular convergence-engine model.

Can you link to the eventlet/forking issues bug please?  I thought since
bug #1321303 was fixed that multiple engines and multiple workers should
work OK, and obviously that being true is a precondition to expending
significant effort on the nested stack decoupling plan above.



That was the issue. So we fixed that bug, but we never un-reverted
the patch that forks enough engines to use up all the CPU's on a box
by default. That would likely help a lot with metadata access speed
(we could manually do it in TripleO but we tend to push defaults. :)


Right, and we decided we wouldn't because it's wrong to do that to 
people by default. In some cases the optimal running configuration for 
TripleO will differ from the friendliest out-of-the-box configuration 
for Heat users in general, and in those cases - of which this is one - 
TripleO will need to specify the configuration.


cheers,
Zane.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO

2014-08-11 Thread Clint Byrum
Excerpts from Zane Bitter's message of 2014-08-11 13:35:44 -0700:
 On 11/08/14 14:49, Clint Byrum wrote:
  Excerpts from Steven Hardy's message of 2014-08-11 11:40:07 -0700:
  On Mon, Aug 11, 2014 at 11:20:50AM -0700, Clint Byrum wrote:
  Excerpts from Zane Bitter's message of 2014-08-11 08:16:56 -0700:
  On 11/08/14 10:46, Clint Byrum wrote:
  Right now we're stuck with an update that just doesn't work. It isn't
  just about update-failure-recovery, which is coming along nicely, but
  it is also about the lack of signals to control rebuild, poor support
  for addressing machines as groups, and unacceptable performance in
  large stacks.
 
  Are there blueprints/bugs filed for all of these issues?
 
 
  Convergnce addresses the poor performance for large stacks in general.
  We also have this:
 
  https://bugs.launchpad.net/heat/+bug/1306743
 
  Which shows how slow metadata access can get. I have worked on patches
  but haven't been able to complete them. We made big strides but we are
  at a point where 40 nodes polling Heat every 30s is too much for one CPU
 
 This sounds like the same figure I heard at the design summit; did the 
 DB call optimisation work that Steve Baker did immediately after that 
 not have any effect?
 

Steve's work got us to 40. From 7.

  to handle. When we scaled Heat out onto more CPUs on one box by forking
  we ran into eventlet issues. We also ran into issues because even with
  many processes we can only use one to resolve templates for a single
  stack during update, which was also excessively slow.
 
  Related to this, and a discussion we had recently at the TripleO meetup is
  this spec I raised today:
 
  https://review.openstack.org/#/c/113296/
 
  It's following up on the idea that we could potentially address (or at
  least mitigate, pending the fully convergence-ified heat) some of these
  scalability concerns, if TripleO moves from the one-giant-template model
  to a more modular nested-stack/provider model (e.g what Tomas has been
  working on)
 
  I've not got into enough detail on that yet to be sure if it's acheivable
  for Juno, but it seems initially to be complex-but-doable.
 
  I'd welcome feedback on that idea and how it may fit in with the more
  granular convergence-engine model.
 
  Can you link to the eventlet/forking issues bug please?  I thought since
  bug #1321303 was fixed that multiple engines and multiple workers should
  work OK, and obviously that being true is a precondition to expending
  significant effort on the nested stack decoupling plan above.
 
 
  That was the issue. So we fixed that bug, but we never un-reverted
  the patch that forks enough engines to use up all the CPU's on a box
  by default. That would likely help a lot with metadata access speed
  (we could manually do it in TripleO but we tend to push defaults. :)
 
 Right, and we decided we wouldn't because it's wrong to do that to 
 people by default. In some cases the optimal running configuration for 
 TripleO will differ from the friendliest out-of-the-box configuration 
 for Heat users in general, and in those cases - of which this is one - 
 TripleO will need to specify the configuration.
 

Whether or not the default should be to fork 1 process per CPU is a
debate for another time. The point is, we can safely use the forking in
Heat now to perhaps improve performance of metadata polling.

Chasing that, and other optimizations, has not led us to a place where
we can get to, say, 100 real nodes _today_. We're chasing another way to
get to the scale and capability we need _today_, in much the same way
we did with merge.py. We'll find the way to get it done more elegantly
as time permits.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO

2014-08-11 Thread Steve Baker
On 12/08/14 06:20, Clint Byrum wrote:
 Excerpts from Zane Bitter's message of 2014-08-11 08:16:56 -0700:
 On 11/08/14 10:46, Clint Byrum wrote:
 Right now we're stuck with an update that just doesn't work. It isn't
 just about update-failure-recovery, which is coming along nicely, but
 it is also about the lack of signals to control rebuild, poor support
 for addressing machines as groups, and unacceptable performance in
 large stacks.
 Are there blueprints/bugs filed for all of these issues?

 Convergnce addresses the poor performance for large stacks in general.
 We also have this:

 https://bugs.launchpad.net/heat/+bug/1306743

 Which shows how slow metadata access can get. I have worked on patches
 but haven't been able to complete them. We made big strides but we are
 at a point where 40 nodes polling Heat every 30s is too much for one CPU
 to handle. When we scaled Heat out onto more CPUs on one box by forking
 we ran into eventlet issues. We also ran into issues because even with
 many processes we can only use one to resolve templates for a single
 stack during update, which was also excessively slow.

 We haven't been able to come back around to those yet, but you can see
 where this has turned into a bit of a rat hole of optimization.

 action-aware-sw-config is sort of what we want for rebuild. We
 collaborated with the trove devs on how to also address it for resize
 a while back but I have lost track of that work as it has taken a back
 seat to more pressing issues.

We were discussing offloading metadata polling to a tempURL swift
object; that would certainly deal to scaling metadata polling.

But also, this could help with out-of-band ansible workflow too.
Anything (ie, Ansible) could push changed data to the swift object too.
And if you wanted to ensure that heat didn't overwrite that during an
accidental heat stack-update then you could configure os-collect-config
to poll from 2 swift objects, one for heat and one for manual updates.
The manual object could take precedence over the heat one for metadata
merging, which could give you a nice fine-grained override mechanism.

 Addressing groups is a general problem that I've had a hard time
 articulating in the past. Tomas Sedovic has done a good job with this
 TripleO spec, but I don't know that we've asked for an explicit change
 in a bug or spec in Heat just yet:

 https://review.openstack.org/#/c/97939/

 There are a number of other issues noted in that spec which are already
 addressed in Heat, but require refactoring in TripleO's templates and
 tools, and that work continues.
I'll follow up the potential solutions in the other thread:
http://lists.openstack.org/pipermail/openstack-dev/2014-August/042313.html

 The point remains: we need something that works now, and doing an
 alternate implementation for updates is actually faster than addressing
 all of these issues.
Thanks, that was a good summary of the issues, and I do appreciate the
need for both tactical and strategic solutions.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO

2014-08-11 Thread Robert Collins
On 12 August 2014 08:35, Zane Bitter zbit...@redhat.com wrote:

 This sounds like the same figure I heard at the design summit; did the DB
 call optimisation work that Steve Baker did immediately after that not have
 any effect?

It helped a lot- I'm not sure where heat tops out now - I'm not aware
of rigorous benchmarks at this stage - I'm hoping we can get a large
scale integration test (virtual machine based) up soon periodically.
Ideally we'd have a microtest in the gate.


 That was the issue. So we fixed that bug, but we never un-reverted
 the patch that forks enough engines to use up all the CPU's on a box
 by default. That would likely help a lot with metadata access speed
 (we could manually do it in TripleO but we tend to push defaults. :)


 Right, and we decided we wouldn't because it's wrong to do that to people by
 default. In some cases the optimal running configuration for TripleO will
 differ from the friendliest out-of-the-box configuration for Heat users in
 general, and in those cases - of which this is one - TripleO will need to
 specify the configuration.

So - thanks for being clear about this (is it in the deployer docs for heat?).

That said, nova, neutron and other projects are defaulting to
one-worker-per-core, so I'm surprised that heat considers this
inappropriate, but our other APIs consider it appropriate :) Whats
different about heat that makes this a bad default?

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][heat] a small experiment with Ansible in TripleO

2014-08-10 Thread Steve Baker
On 02/08/14 04:07, Allison Randal wrote:
 A few of us have been independently experimenting with Ansible as a
 backend for TripleO, and have just decided to try experimenting
 together. I've chatted with Robert, and he says that TripleO was always
 intended to have pluggable backends (CM layer), and just never had
 anyone interested in working on them. (I see it now, even in the early
 docs and talks, I guess I just couldn't see the forest for the trees.)
 So, the work is in line with the overall goals of the TripleO project.

 We're starting with a tiny scope, focused only on updating a running
 TripleO deployment, so our first work is in:

 - Create an Ansible Dynamic Inventory plugin to extract metadata from Heat
 - Improve/extend the Ansible nova_compute Cloud Module (or create a new
 one), for Nova rebuild
 - Develop a minimal handoff from Heat to Ansible, particularly focused
 on the interactions between os-collect-config and Ansible

 We're merging our work in this repo, until we figure out where it should
 live:

 https://github.com/allisonrandal/tripleo-ansible

 We've set ourselves one week as the first sanity-check to see whether
 this idea is going anywhere, and we may scrap it all at that point. But,
 it seems best to be totally transparent about the idea from the start,
 so no-one is surprised later.

Having pluggable backends for configuration seems like a good idea, and
Ansible is a great choice for the first alternative backend.

However what this repo seems to be doing at the moment is bypassing heat
to do a stack update, and I can only assume there is an eventual goal to
not use heat at all for stack orchestration too.

Granted, until blueprint update-failure-recovery lands[1] then doing a
stack-update is about as much fun as russian roulette. But this effort
is tactical rather than strategic, especially given TripleO's mission
statement.

If I were to use Ansible for TripleO configuration I would start with
something like the following:
* Install an ansible software-config hook onto the image to be triggered
by os-refresh-config[2][3]
* Incrementally replace StructuredConfig resources in
tripleo-heat-templates with SoftwareConfig resources that include the
ansible playbooks via get_file
* The above can start in a fork of tripleo-heat-templates, but can
eventually be structured using resource providers so that the deployer
chooses what configuration backend to use by selecting the environment
file that contains the appropriate config resources

Now you have a cloud orchestrated by heat and configured by Ansible. If
it is still deemed necessary to do an out-of-band update to the stack
then you're in a much better position to do an ansible push, since you
can use the same playbook files that heat used to bring up the stack.

[1] https://review.openstack.org/#/c/112938/
[2] https://review.openstack.org/#/c/95937/
[3]
http://git.openstack.org/cgit/openstack/heat-templates/tree/hot/software-config/elements/heat-config

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev