Re: [openstack-dev] [heat][sahara][magnum][tripleo] Scaling nested stack validation

2017-04-03 Thread Zane Bitter

On 28/11/16 14:20, Zane Bitter wrote:

On 23/11/16 17:58, Zane Bitter wrote:

I also investigated another issue, which is that since the fix for
https://bugs.launchpad.net/heat/+bug/1388140 landed (in Kilo) I believe
we are validating nested stacks multiple times (specifically, m times,
where m is the stack's depth in the tree):

  root childgrandchild

  create
   -> validate --> validate --> validate
   -> Resource.create ===> create
-> validate --> validate
-> Resource.create ===> create
 -> validate

The only good news here is that ResourceGroup is smart enough to make
sure that it generates a nested stack with at most 1 resource to
validate when validate() is called. (However, when the nested stack is
created, and thus validated, it is of course full-sized.) Autoscaling
groups make no such allowances, but the patch above should actually have
the same effect. (We can't get rid of the special case for ResourceGroup
though, because of index substitution.)

An obvious fix would be to disable validation - or, more specifically,
validation of _resources_ - on create/update for stacks that have a
non-null owner_id (i.e. nested stacks), so that we had something like:

  root childgrandchild

  create
   -> validate --> validate --> validate
   -> Resource.create ===> create
-> Resource.create ===> create

That would eliminate the duplication/triplication/multiplication of
validation. It would also mean that we'd cut out the expensive part of
ResourceGroup validation with index substitution, leaving only the cheap
part.

One downside is that in the ResourceGroup/index substitution case we'd
be creating resources whose definitions hadn't _ever_ been validated. I
_think_ that's safe, in the sense that you'd just hear about errors
later, as opposed to everything falling over in a heap, but it's
difficult to be certain. Hearing about problems late is also not ideal
(since it may cause otherwise-healthy siblings to be cancelled), but I
would guess that heavy users like TripleO developers would say that it's
worth the tradeoff.


https://launchpad.net/bugs/1645336
https://review.openstack.org/#/c/403828/


It turned out to be more subtle than that:
https://bugs.launchpad.net/heat/+bug/1675589

Global state sucks.

- ZB


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat][sahara][magnum][tripleo] Scaling nested stack validation

2016-11-28 Thread Zane Bitter

On 23/11/16 17:58, Zane Bitter wrote:

I also investigated another issue, which is that since the fix for
https://bugs.launchpad.net/heat/+bug/1388140 landed (in Kilo) I believe
we are validating nested stacks multiple times (specifically, m times,
where m is the stack's depth in the tree):

  root childgrandchild

  create
   -> validate --> validate --> validate
   -> Resource.create ===> create
-> validate --> validate
-> Resource.create ===> create
 -> validate

The only good news here is that ResourceGroup is smart enough to make
sure that it generates a nested stack with at most 1 resource to
validate when validate() is called. (However, when the nested stack is
created, and thus validated, it is of course full-sized.) Autoscaling
groups make no such allowances, but the patch above should actually have
the same effect. (We can't get rid of the special case for ResourceGroup
though, because of index substitution.)

An obvious fix would be to disable validation - or, more specifically,
validation of _resources_ - on create/update for stacks that have a
non-null owner_id (i.e. nested stacks), so that we had something like:

  root childgrandchild

  create
   -> validate --> validate --> validate
   -> Resource.create ===> create
-> Resource.create ===> create

That would eliminate the duplication/triplication/multiplication of
validation. It would also mean that we'd cut out the expensive part of
ResourceGroup validation with index substitution, leaving only the cheap
part.

One downside is that in the ResourceGroup/index substitution case we'd
be creating resources whose definitions hadn't _ever_ been validated. I
_think_ that's safe, in the sense that you'd just hear about errors
later, as opposed to everything falling over in a heap, but it's
difficult to be certain. Hearing about problems late is also not ideal
(since it may cause otherwise-healthy siblings to be cancelled), but I
would guess that heavy users like TripleO developers would say that it's
worth the tradeoff.


https://launchpad.net/bugs/1645336
https://review.openstack.org/#/c/403828/

- ZB

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat][sahara][magnum][tripleo] Scaling nested stack validation

2016-11-24 Thread Thomas Herve
On Wed, Nov 23, 2016 at 11:58 PM, Zane Bitter  wrote:
> We discussed $SUBJECT at the summit as one of the main performance problems
> that people are running into when trying to create very large autoscaling
> groups, as projects like Sahara, Magnum, TripleO, OpenShift are wont to do.
> Of course, as we all know, validation happens synchronously, so it's prone
> to causing RPC timeouts that mean a hard failure of the parent stack.
>
> First the good news - I just committed this patch:
>
> https://review.openstack.org/#/c/400961/
>
> which should mean from now on that resources with identical definitions will
> not all be validated, and instead we'll just validate one representative
> one. In theory this should mean that autoscaling groups should now validate
> in constant rather than linear time. If anyone from one of the affected
> projects is able to confirm this, then I'd be happy to backport the patch to
> stable/newton. It really is very simple.
>
> The bad news here is for users of ResourceGroups with %index% substitution
> (*cough*TripleO*cough*) - this makes each resource definition unique, so it
> won't benefit from this fix. (Adding this to my mental list of reasons why
> index substitution is bad.)
>
>
> I also investigated another issue, which is that since the fix for
> https://bugs.launchpad.net/heat/+bug/1388140 landed (in Kilo) I believe we
> are validating nested stacks multiple times (specifically, m times, where m
> is the stack's depth in the tree):
>
>   root childgrandchild
>
>   create
>-> validate --> validate --> validate
>-> Resource.create ===> create
> -> validate --> validate
> -> Resource.create ===> create
>  -> validate
>
> The only good news here is that ResourceGroup is smart enough to make sure
> that it generates a nested stack with at most 1 resource to validate when
> validate() is called. (However, when the nested stack is created, and thus
> validated, it is of course full-sized.) Autoscaling groups make no such
> allowances, but the patch above should actually have the same effect. (We
> can't get rid of the special case for ResourceGroup though, because of index
> substitution.)
>
> An obvious fix would be to disable validation - or, more specifically,
> validation of _resources_ - on create/update for stacks that have a non-null
> owner_id (i.e. nested stacks), so that we had something like:
>
>   root childgrandchild
>
>   create
>-> validate --> validate --> validate
>-> Resource.create ===> create
> -> Resource.create ===> create
>
> That would eliminate the duplication/triplication/multiplication of
> validation. It would also mean that we'd cut out the expensive part of
> ResourceGroup validation with index substitution, leaving only the cheap
> part.
>
> One downside is that in the ResourceGroup/index substitution case we'd be
> creating resources whose definitions hadn't _ever_ been validated. I _think_
> that's safe, in the sense that you'd just hear about errors later, as
> opposed to everything falling over in a heap, but it's difficult to be
> certain. Hearing about problems late is also not ideal (since it may cause
> otherwise-healthy siblings to be cancelled), but I would guess that heavy
> users like TripleO developers would say that it's worth the tradeoff.
>
> However, one other thing about this bothers me. The part of validation that
> we're keeping:
>
>-> validate --> validate --> validate
>
> involves loading all of the nested stacks in memory at once (i.e. the thing
> we were not supposed to be doing any more in Kilo, in favour of farming
> nested stacks out over RPC.) As we discovered when we found out we were
> doing the same thing with outputs[1], this is a bit like hanging out a giant
> "Kick Me" sign for the OOM Killer.
>
> That's mitigated quite a lot by my patch though... we'll load the whole
> autoscaling group stack in memory, but if its members are themselves nested
> stacks we'll load only one of them. So the scaling tendencies will hopefully
> be dominated by the complexity of your templates more than than the size of
> your deployment. ResourceGroup is in a better position, because its nested
> stack will actually have only one member, so the size shouldn't affect
> memory consumption at all during validation.
>
> Some options:
> 1) Chalk it up to an acceptable tradeoff
> 2) Add a single-member special case for autoscaling group validation
> 3) Farm out the nested validation over RPC
> 4) Both (2) & (3)
> 5) Some totally different arrangement of how nested stacks are validated

I think I'd like to see what difference 3 makes. Maybe then also do 2.
Again, we really need to have some reproducible big template that 

[openstack-dev] [heat][sahara][magnum][tripleo] Scaling nested stack validation

2016-11-23 Thread Zane Bitter
We discussed $SUBJECT at the summit as one of the main performance 
problems that people are running into when trying to create very large 
autoscaling groups, as projects like Sahara, Magnum, TripleO, OpenShift 
are wont to do. Of course, as we all know, validation happens 
synchronously, so it's prone to causing RPC timeouts that mean a hard 
failure of the parent stack.


First the good news - I just committed this patch:

https://review.openstack.org/#/c/400961/

which should mean from now on that resources with identical definitions 
will not all be validated, and instead we'll just validate one 
representative one. In theory this should mean that autoscaling groups 
should now validate in constant rather than linear time. If anyone from 
one of the affected projects is able to confirm this, then I'd be happy 
to backport the patch to stable/newton. It really is very simple.


The bad news here is for users of ResourceGroups with %index% 
substitution (*cough*TripleO*cough*) - this makes each resource 
definition unique, so it won't benefit from this fix. (Adding this to my 
mental list of reasons why index substitution is bad.)



I also investigated another issue, which is that since the fix for 
https://bugs.launchpad.net/heat/+bug/1388140 landed (in Kilo) I believe 
we are validating nested stacks multiple times (specifically, m times, 
where m is the stack's depth in the tree):


  root childgrandchild

  create
   -> validate --> validate --> validate
   -> Resource.create ===> create
-> validate --> validate
-> Resource.create ===> create
 -> validate

The only good news here is that ResourceGroup is smart enough to make 
sure that it generates a nested stack with at most 1 resource to 
validate when validate() is called. (However, when the nested stack is 
created, and thus validated, it is of course full-sized.) Autoscaling 
groups make no such allowances, but the patch above should actually have 
the same effect. (We can't get rid of the special case for ResourceGroup 
though, because of index substitution.)


An obvious fix would be to disable validation - or, more specifically, 
validation of _resources_ - on create/update for stacks that have a 
non-null owner_id (i.e. nested stacks), so that we had something like:


  root childgrandchild

  create
   -> validate --> validate --> validate
   -> Resource.create ===> create
-> Resource.create ===> create

That would eliminate the duplication/triplication/multiplication of 
validation. It would also mean that we'd cut out the expensive part of 
ResourceGroup validation with index substitution, leaving only the cheap 
part.


One downside is that in the ResourceGroup/index substitution case we'd 
be creating resources whose definitions hadn't _ever_ been validated. I 
_think_ that's safe, in the sense that you'd just hear about errors 
later, as opposed to everything falling over in a heap, but it's 
difficult to be certain. Hearing about problems late is also not ideal 
(since it may cause otherwise-healthy siblings to be cancelled), but I 
would guess that heavy users like TripleO developers would say that it's 
worth the tradeoff.


However, one other thing about this bothers me. The part of validation 
that we're keeping:


   -> validate --> validate --> validate

involves loading all of the nested stacks in memory at once (i.e. the 
thing we were not supposed to be doing any more in Kilo, in favour of 
farming nested stacks out over RPC.) As we discovered when we found out 
we were doing the same thing with outputs[1], this is a bit like hanging 
out a giant "Kick Me" sign for the OOM Killer.


That's mitigated quite a lot by my patch though... we'll load the whole 
autoscaling group stack in memory, but if its members are themselves 
nested stacks we'll load only one of them. So the scaling tendencies 
will hopefully be dominated by the complexity of your templates more 
than than the size of your deployment. ResourceGroup is in a better 
position, because its nested stack will actually have only one member, 
so the size shouldn't affect memory consumption at all during validation.


Some options:
1) Chalk it up to an acceptable tradeoff
2) Add a single-member special case for autoscaling group validation
3) Farm out the nested validation over RPC
4) Both (2) & (3)
5) Some totally different arrangement of how nested stacks are validated

Discuss.

cheers,
Zane.

[1] https://review.openstack.org/#/c/383839/

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe