Re: [openstack-dev] [magnum][heat] 2 million requests / sec, 100s of nodes

2016-08-09 Thread Zane Bitter

On 07/08/16 19:52, Clint Byrum wrote:

Excerpts from Steve Baker's message of 2016-08-08 10:11:29 +1200:

On 05/08/16 21:48, Ricardo Rocha wrote:

Hi.

Quick update is 1000 nodes and 7 million reqs/sec :) - and the number
of requests should be higher but we had some internal issues. We have
a submission for barcelona to provide a lot more details.

But a couple questions came during the exercise:

1. Do we really need a volume in the VMs? On large clusters this is a
burden, and local storage only should be enough?

2. We observe a significant delay (~10min, which is half the total
time to deploy the cluster) on heat when it seems to be crunching the
kube_minions nested stacks. Once it's done, it still adds new stacks
gradually, so it doesn't look like it precomputed all the info in advance

Anyone tried to scale Heat to stacks this size? We end up with a stack
with:
* 1000 nested stacks (depth 2)
* 22000 resources
* 47008 events

And already changed most of the timeout/retrial values for rpc to get
this working.

This delay is already visible in clusters of 512 nodes, but 40% of the
time in 1000 nodes seems like something we could improve. Any hints on
Heat configuration optimizations for large stacks very welcome.


Yes, we recommend you set the following in /etc/heat/heat.conf [DEFAULT]:
max_resources_per_stack = -1

Enforcing this for large stacks has a very high overhead, we make this
change in the TripleO undercloud too.



Wouldn't this necessitate having a private Heat just for Magnum? Not
having a resource limit per stack would leave your Heat engines
vulnerable to being DoS'd by malicious users, since one can create many
many thousands of resources, and thus python objects, in just a couple
of cleverly crafted templates (which is why I added the setting).


Although when you added it, all of the resources in a tree of nested 
stacks got handled by a single engine, so sending a really big tree of 
nested stacks was an easy way to DoS Heat. That's no longer the case 
since Kilo; we farm the child stacks out over RPC, so the difficulty of 
carrying out a DoS increases in proportion to the number of cores you 
have running Heat whereas before it was constant. (This is also the 
cause of the performance problem, since counting all the resources in 
the tree when then entire thing was already loaded in-memory was easy.)


Convergence splits it up even further, farming out each _resource_ as 
well as each stack over RPC.


I had the thought that having a per-tenant resource limit might be both 
more effective at both protecting the limited resource and more 
efficient to calculate, since we could have the DB simply count the 
Resource rows for stacks in a given tenant instead of recursively 
loading all of the stacks in a tree and counting the resources in 
heat-engine. However, the tenant isn't stored directly in the Stack 
table, and people who know databases tell me the resulting joins would 
be fearsome.


I'm still not convinced it'd be worse than what we have now, even after 
Steve did a lot of work to make it much, much better than it was at one 
point ;)



This makes perfect sense in the undercloud of TripleO, which is a
private, single tenant OpenStack. But, for Magnum.. now you're talking
about the Heat that users have access to.


Indeed, and now that we're seeing other users of very large stacks 
(Sahara is another) I think we need to come up with a solution that is 
both efficient enough to use on a large/deep tree of nested stacks but 
can still be tuned to protect against DoS at whatever scale Heat is 
deployed at.


cheers,
Zane.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [magnum][heat] 2 million requests / sec, 100s of nodes

2016-08-08 Thread Zane Bitter

On 08/08/16 17:09, Ricardo Rocha wrote:

* trying the convergence_engine: as far as i could see this is already
there, just not enabled by default. We can give it a try and let you
know how it goes if there's no obvious drawback. Would it just work
with the current schema? We're running heat mitaka


There's been a bunch of bug fixes (including performance fixes) during 
Newton, which is why it wasn't enabled by default in Mitaka (and 
hopefully will be in Newton - it is enabled on master right now). We 
also haven't been backporting convergence bugfixes to the stable branch. 
So while you can just turn it on and it should work modulo bugs, I 
wouldn't recommend it.


cheers,
Zane.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [magnum][heat] 2 million requests / sec, 100s of nodes

2016-08-08 Thread Ricardo Rocha
Hi.

On Mon, Aug 8, 2016 at 6:17 PM, Zane Bitter  wrote:
> On 05/08/16 12:01, Hongbin Lu wrote:
>>
>> Add [heat] to the title to get more feedback.
>>
>>
>>
>> Best regards,
>>
>> Hongbin
>>
>>
>>
>> *From:*Ricardo Rocha [mailto:rocha.po...@gmail.com]
>> *Sent:* August-05-16 5:48 AM
>> *To:* OpenStack Development Mailing List (not for usage questions)
>> *Subject:* Re: [openstack-dev] [magnum] 2 million requests / sec, 100s
>> of nodes
>>
>>
>>
>> Hi.
>>
>>
>>
>> Quick update is 1000 nodes and 7 million reqs/sec :) - and the number of
>> requests should be higher but we had some internal issues. We have a
>> submission for barcelona to provide a lot more details.
>>
>>
>>
>> But a couple questions came during the exercise:
>>
>>
>>
>> 1. Do we really need a volume in the VMs? On large clusters this is a
>> burden, and local storage only should be enough?
>>
>>
>>
>> 2. We observe a significant delay (~10min, which is half the total time
>> to deploy the cluster) on heat when it seems to be crunching the
>> kube_minions nested stacks. Once it's done, it still adds new stacks
>> gradually, so it doesn't look like it precomputed all the info in advance
>>
>>
>>
>> Anyone tried to scale Heat to stacks this size? We end up with a stack
>> with:
>>
>> * 1000 nested stacks (depth 2)
>>
>> * 22000 resources
>>
>> * 47008 events
>
>
> Wow, that's a big stack :) TripleO has certainly been pushing the boundaries
> of how big a stack Heat can handle, but this sounds like another step up
> even from there.
>
>> And already changed most of the timeout/retrial values for rpc to get
>> this working.
>>
>>
>>
>> This delay is already visible in clusters of 512 nodes, but 40% of the
>> time in 1000 nodes seems like something we could improve. Any hints on
>> Heat configuration optimizations for large stacks very welcome.
>
>
> Y'all were right to set max_resources_per_stack to -1, because actually
> checking the number of resources in a tree of stacks is sloow. (Not as
> slow as it used to be when it was O(n^2), but still pretty slow.)
>
> We're actively working on trying to make Heat more horizontally scalable
> (even at the cost of some performance penalty) so that if you need to handle
> this kind of scale then you'll be able to reach it by adding more
> heat-engines. Another big step forward on this front is coming with Newton,
> as (barring major bugs) the convergence_engine architecture will be enabled
> by default.
>
> RPC timeouts are caused by the synchronous work that Heat does before
> returning a result to the caller. Most of this is validation of the data
> provided by the user. We've talked about trying to reduce the amount of
> validation done synchronously to a minimum (just enough to guarantee that we
> can store and retrieve the data from the DB) and push the rest into the
> asynchronous part of the stack operation alongside the actual create/update.
> (FWIW, TripleO typically uses a 600s RPC timeout.)
>
> The "QueuePool limit of size ... overflow ... reached" sounds like we're
> pulling messages off the queue even when we don't have threads available in
> the pool to pass them to. If you have a fix for this it would be much
> appreciated. However, I don't think there's any guarantee that just leaving
> messages on the queue can't lead to deadlocks. The problem with very large
> trees of nested stacks is not so much that it's a lot of stacks (Heat
> doesn't have _too_ much trouble with that) but that they all have to be
> processed simultaneously. e.g. to validate the top level stack you also need
> to validate all of the lower level stacks before returning the result. If
> higher-level stacks consume all of the thread pools then you'll get a
> deadlock as you'll be unable to validate any lower-level stacks. At this
> point you'd have maxed out the capacity of your Heat engines to process
> stacks simultaneously and you'd need to scale out to more Heat engines. The
> solution is probably to try limit the number of nested stack validations we
> send out concurrently.
>
> Improving performance at scale is a priority area of focus for the Heat team
> at the moment. That's been mostly driven by TripleO and Sahara, but we'd be
> very keen to hear about the kind of loads that Magnum is putting on Heat and
> working with folks across the community to figure out how to improve things
> for those use cases.

Thanks for the detailed reply, especially regarding the handling of
the nested stacks by the engines, much clearer now.

Seems like there's a couple of things we can try already:
* scaling the heat engines (we're currently running 3 nodes with 5
engines each, can check if more help, though it seems with >1000
nested stacks it will be hard to avoid starvation)
* trying the convergence_engine: as far as i could see this is already
there, just not enabled by default. We can give it a try and let you
know how it goes if there's no obvious drawback. Would it just work
with the current 

Re: [openstack-dev] [magnum][heat] 2 million requests / sec, 100s of nodes

2016-08-08 Thread Zane Bitter

On 05/08/16 12:01, Hongbin Lu wrote:

Add [heat] to the title to get more feedback.



Best regards,

Hongbin



*From:*Ricardo Rocha [mailto:rocha.po...@gmail.com]
*Sent:* August-05-16 5:48 AM
*To:* OpenStack Development Mailing List (not for usage questions)
*Subject:* Re: [openstack-dev] [magnum] 2 million requests / sec, 100s
of nodes



Hi.



Quick update is 1000 nodes and 7 million reqs/sec :) - and the number of
requests should be higher but we had some internal issues. We have a
submission for barcelona to provide a lot more details.



But a couple questions came during the exercise:



1. Do we really need a volume in the VMs? On large clusters this is a
burden, and local storage only should be enough?



2. We observe a significant delay (~10min, which is half the total time
to deploy the cluster) on heat when it seems to be crunching the
kube_minions nested stacks. Once it's done, it still adds new stacks
gradually, so it doesn't look like it precomputed all the info in advance



Anyone tried to scale Heat to stacks this size? We end up with a stack with:

* 1000 nested stacks (depth 2)

* 22000 resources

* 47008 events


Wow, that's a big stack :) TripleO has certainly been pushing the 
boundaries of how big a stack Heat can handle, but this sounds like 
another step up even from there.



And already changed most of the timeout/retrial values for rpc to get
this working.



This delay is already visible in clusters of 512 nodes, but 40% of the
time in 1000 nodes seems like something we could improve. Any hints on
Heat configuration optimizations for large stacks very welcome.


Y'all were right to set max_resources_per_stack to -1, because actually 
checking the number of resources in a tree of stacks is sloow. (Not 
as slow as it used to be when it was O(n^2), but still pretty slow.)


We're actively working on trying to make Heat more horizontally scalable 
(even at the cost of some performance penalty) so that if you need to 
handle this kind of scale then you'll be able to reach it by adding more 
heat-engines. Another big step forward on this front is coming with 
Newton, as (barring major bugs) the convergence_engine architecture will 
be enabled by default.


RPC timeouts are caused by the synchronous work that Heat does before 
returning a result to the caller. Most of this is validation of the data 
provided by the user. We've talked about trying to reduce the amount of 
validation done synchronously to a minimum (just enough to guarantee 
that we can store and retrieve the data from the DB) and push the rest 
into the asynchronous part of the stack operation alongside the actual 
create/update. (FWIW, TripleO typically uses a 600s RPC timeout.)


The "QueuePool limit of size ... overflow ... reached" sounds like we're 
pulling messages off the queue even when we don't have threads available 
in the pool to pass them to. If you have a fix for this it would be much 
appreciated. However, I don't think there's any guarantee that just 
leaving messages on the queue can't lead to deadlocks. The problem with 
very large trees of nested stacks is not so much that it's a lot of 
stacks (Heat doesn't have _too_ much trouble with that) but that they 
all have to be processed simultaneously. e.g. to validate the top level 
stack you also need to validate all of the lower level stacks before 
returning the result. If higher-level stacks consume all of the thread 
pools then you'll get a deadlock as you'll be unable to validate any 
lower-level stacks. At this point you'd have maxed out the capacity of 
your Heat engines to process stacks simultaneously and you'd need to 
scale out to more Heat engines. The solution is probably to try limit 
the number of nested stack validations we send out concurrently.


Improving performance at scale is a priority area of focus for the Heat 
team at the moment. That's been mostly driven by TripleO and Sahara, but 
we'd be very keen to hear about the kind of loads that Magnum is putting 
on Heat and working with folks across the community to figure out how to 
improve things for those use cases.


cheers,
Zane.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [magnum][heat] 2 million requests / sec, 100s of nodes

2016-08-08 Thread Tim Bell

On 08 Aug 2016, at 11:51, Ricardo Rocha 
> wrote:

Hi.

On Mon, Aug 8, 2016 at 1:52 AM, Clint Byrum 
> wrote:
Excerpts from Steve Baker's message of 2016-08-08 10:11:29 +1200:
On 05/08/16 21:48, Ricardo Rocha wrote:
Hi.

Quick update is 1000 nodes and 7 million reqs/sec :) - and the number
of requests should be higher but we had some internal issues. We have
a submission for barcelona to provide a lot more details.

But a couple questions came during the exercise:

1. Do we really need a volume in the VMs? On large clusters this is a
burden, and local storage only should be enough?

2. We observe a significant delay (~10min, which is half the total
time to deploy the cluster) on heat when it seems to be crunching the
kube_minions nested stacks. Once it's done, it still adds new stacks
gradually, so it doesn't look like it precomputed all the info in advance

Anyone tried to scale Heat to stacks this size? We end up with a stack
with:
* 1000 nested stacks (depth 2)
* 22000 resources
* 47008 events

And already changed most of the timeout/retrial values for rpc to get
this working.

This delay is already visible in clusters of 512 nodes, but 40% of the
time in 1000 nodes seems like something we could improve. Any hints on
Heat configuration optimizations for large stacks very welcome.

Yes, we recommend you set the following in /etc/heat/heat.conf [DEFAULT]:
max_resources_per_stack = -1

Enforcing this for large stacks has a very high overhead, we make this
change in the TripleO undercloud too.


Wouldn't this necessitate having a private Heat just for Magnum? Not
having a resource limit per stack would leave your Heat engines
vulnerable to being DoS'd by malicious users, since one can create many
many thousands of resources, and thus python objects, in just a couple
of cleverly crafted templates (which is why I added the setting).

This makes perfect sense in the undercloud of TripleO, which is a
private, single tenant OpenStack. But, for Magnum.. now you're talking
about the Heat that users have access to.

We have it already at -1 for these tests. As you say a malicious user
could DoS, right now this is manageable in our environment. But maybe
move it to a per tenant value, or some special policy? The stacks are
created under a separate domain for magnum (for trustees), we could
also use that for separation.


If there was a quota system within Heat for items like stacks and resources, 
this could be
controlled through that.

Looks like https://blueprints.launchpad.net/heat/+spec/add-quota-api-for-heat 
did not make it into upstream though.

Tim

A separate heat instance sounds like an overkill.

Cheers,
Ricardo


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [magnum][heat] 2 million requests / sec, 100s of nodes

2016-08-05 Thread Hongbin Lu
Add [heat] to the title to get more feedback.

Best regards,
Hongbin

From: Ricardo Rocha [mailto:rocha.po...@gmail.com]
Sent: August-05-16 5:48 AM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [magnum] 2 million requests / sec, 100s of nodes

Hi.

Quick update is 1000 nodes and 7 million reqs/sec :) - and the number of 
requests should be higher but we had some internal issues. We have a submission 
for barcelona to provide a lot more details.

But a couple questions came during the exercise:

1. Do we really need a volume in the VMs? On large clusters this is a burden, 
and local storage only should be enough?

2. We observe a significant delay (~10min, which is half the total time to 
deploy the cluster) on heat when it seems to be crunching the kube_minions 
nested stacks. Once it's done, it still adds new stacks gradually, so it 
doesn't look like it precomputed all the info in advance

Anyone tried to scale Heat to stacks this size? We end up with a stack with:
* 1000 nested stacks (depth 2)
* 22000 resources
* 47008 events

And already changed most of the timeout/retrial values for rpc to get this 
working.

This delay is already visible in clusters of 512 nodes, but 40% of the time in 
1000 nodes seems like something we could improve. Any hints on Heat 
configuration optimizations for large stacks very welcome.

Cheers,
  Ricardo

On Sun, Jun 19, 2016 at 10:59 PM, Brad Topol 
> wrote:

Thanks Ricardo! This is very exciting progress!

--Brad


Brad Topol, Ph.D.
IBM Distinguished Engineer
OpenStack
(919) 543-0646
Internet: bto...@us.ibm.com
Assistant: Kendra Witherspoon (919) 254-0680

[Inactive hide details for Ton Ngo---06/17/2016 12:10:33 PM---Thanks Ricardo 
for sharing the data, this is really encouraging! T]Ton Ngo---06/17/2016 
12:10:33 PM---Thanks Ricardo for sharing the data, this is really encouraging! 
Ton,

From: Ton Ngo/Watson/IBM@IBMUS
To: "OpenStack Development Mailing List \(not for usage questions\)" 
>
Date: 06/17/2016 12:10 PM
Subject: Re: [openstack-dev] [magnum] 2 million requests / sec, 100s of nodes





Thanks Ricardo for sharing the data, this is really encouraging!
Ton,

[Inactive hide details for Ricardo Rocha ---06/17/2016 08:16:15 AM---Hi. Just 
thought the Magnum team would be happy to hear :)]Ricardo Rocha ---06/17/2016 
08:16:15 AM---Hi. Just thought the Magnum team would be happy to hear :)

From: Ricardo Rocha >
To: "OpenStack Development Mailing List (not for usage questions)" 
>
Date: 06/17/2016 08:16 AM
Subject: [openstack-dev] [magnum] 2 million requests / sec, 100s of nodes




Hi.

Just thought the Magnum team would be happy to hear :)

We had access to some hardware the last couple days, and tried some
tests with Magnum and Kubernetes - following an original blog post
from the kubernetes team.

Got a 200 node kubernetes bay (800 cores) reaching 2 million requests / sec.

Check here for some details:
https://openstack-in-production.blogspot.ch/2016/06/scaling-magnum-and-kubernetes-2-million.html

We'll try bigger in a couple weeks, also using the Rally work from
Winnie, Ton and Spyros to see where it breaks. Already identified a
couple issues, will add bugs or push patches for those. If you have
ideas or suggestions for the next tests let us know.

Magnum is looking pretty good!

Cheers,
Ricardo

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe