Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-15 Thread Ihar Hrachyshka
Another potentially interesting devstack service that may help us to
understand our memory usage is peakmem_tracker. At this point, it's
not enabled anywhere. I proposed devstack-gate patch to enable it at:
https://review.openstack.org/#/c/434511/

On Wed, Feb 15, 2017 at 12:38 PM, Ihar Hrachyshka  wrote:
> Another potentially relevant info is, we saw before that oom-killer is
> triggered while 8gb of swap are barely used. This behavior is hard to
> explain, since we set kernel swappiness sysctl knob to 30:
>
> https://github.com/openstack-infra/devstack-gate/blob/master/functions.sh#L432
>
> (and any value above 0 means that if memory is requested, and there is
> swap available to fulfill it, it will not fail to allocate memory;
> swappiness only controls willingness of kernel to swap process pages
> instead of dropping disk cache entries, it may affect performance, but
> it should not affect malloc behavior).
>
> The only reason I can think of for a memory allocation request to
> trigger the trap when swap is free is when the memory request is for a
> RAM-locked page (it can either be memory locked with mlock(2), or
> mmap(2) when MAP_LOCKED used). To understand if that's the case in
> gate, I am adding a new mlock_tracker service to devstack:
> https://review.openstack.org/#/c/434470/
>
> The patch that enables the service in Pike+ gate is:
> https://review.openstack.org/#/c/434474/
>
> Thanks,
> Ihar
>
> On Wed, Feb 15, 2017 at 5:21 AM, Andrea Frittoli
>  wrote:
>> Some (new?) data on the oom kill issue in the gate.
>>
>> I filed a new bug / E-R query yet for the issue [1][2] since it looks to me
>> like the issue is not specific to mysqld - oom-kill will just pick the best
>> candidate, which in most cases happens to be mysqld. The next most likely
>> candidate to show errors in the logs is keystone, since token requests are
>> rather frequent, more than any other API call probably.
>>
>> According to logstash [3] all failures identified by [2] happen on RAX nodes
>> [3], which I hadn't realised before.
>>
>> Comparing dstat data between the failed run and a successful on an OVH node
>> [4], the main difference I can spot is free memory.
>> For the same test job, the free memory tends to be much lower, quite close
>> to zero for the majority of the time on the RAX node. My guess is that an
>> unlucky scheduling of tests may cause a slightly higher peak in memory usage
>> and trigger the oom-kill.
>>
>> I find it hard to relate lower free memory to a specific cloud provider /
>> underlying virtualisation technology, but maybe someone has an idea about
>> how that could be?
>>
>> Andrea
>>
>> [0]
>> http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/6f31320/logs/syslog.txt.gz#_Feb_14_00_32_28
>> [1] https://bugs.launchpad.net/tempest/+bug/1664953
>> [2] https://review.openstack.org/434238
>> [3]
>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Out%20of%20memory%3A%20Kill%20process%5C%22%20AND%20tags%3A%5C%22syslog.txt%5C%22
>> [4]
>> http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/1dfb4b7/logs/dstat-csv_log.txt.gz
>>
>> On Mon, Feb 6, 2017 at 10:13 AM Miguel Angel Ajo Pelayo
>>  wrote:
>>>
>>> Jeremy Stanley wrote:
>>>
>>>
>>> > It's an option of last resort, I think. The next consistent flavor
>>> > up in most of the providers donating resources is double the one
>>> > we're using (which is a fairly typical pattern in public clouds). As
>>> > aggregate memory constraints are our primary quota limit, this would
>>> > effectively halve our current job capacity.
>>>
>>> Properly coordinated with all the cloud the providers, they could create
>>> flavours which are private but available to our tenants, where a 25-50% more
>>> RAM would be just enough.
>>>
>>> I agree that should probably be a last resort tool, and we should keep
>>> looking for proper ways to find where we consume unnecessary RAM and make
>>> sure that's properly freed up.
>>>
>>> It could be interesting to coordinate such flavour creation in the mean
>>> time, even if we don't use it now, we could eventually test it or put it to
>>> work if we find trapped anytime later.
>>>
>>>
>>> On Sun, Feb 5, 2017 at 8:37 PM, Matt Riedemann 
>>> wrote:

 On 2/5/2017 1:19 PM, Clint Byrum wrote:
>
>
> Also I wonder if there's ever been any serious consideration given to
> switching to protobuf? Feels like one could make oslo.versionedobjects
> a wrapper around protobuf relatively easily, but perhaps that's already
> been explored in a forum that I wasn't paying attention to.


 I've never heard of anyone attempting that.

 --

 Thanks,

 Matt Riedemann



 __
 OpenStack Development Mailing 

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-15 Thread Ihar Hrachyshka
Another potentially relevant info is, we saw before that oom-killer is
triggered while 8gb of swap are barely used. This behavior is hard to
explain, since we set kernel swappiness sysctl knob to 30:

https://github.com/openstack-infra/devstack-gate/blob/master/functions.sh#L432

(and any value above 0 means that if memory is requested, and there is
swap available to fulfill it, it will not fail to allocate memory;
swappiness only controls willingness of kernel to swap process pages
instead of dropping disk cache entries, it may affect performance, but
it should not affect malloc behavior).

The only reason I can think of for a memory allocation request to
trigger the trap when swap is free is when the memory request is for a
RAM-locked page (it can either be memory locked with mlock(2), or
mmap(2) when MAP_LOCKED used). To understand if that's the case in
gate, I am adding a new mlock_tracker service to devstack:
https://review.openstack.org/#/c/434470/

The patch that enables the service in Pike+ gate is:
https://review.openstack.org/#/c/434474/

Thanks,
Ihar

On Wed, Feb 15, 2017 at 5:21 AM, Andrea Frittoli
 wrote:
> Some (new?) data on the oom kill issue in the gate.
>
> I filed a new bug / E-R query yet for the issue [1][2] since it looks to me
> like the issue is not specific to mysqld - oom-kill will just pick the best
> candidate, which in most cases happens to be mysqld. The next most likely
> candidate to show errors in the logs is keystone, since token requests are
> rather frequent, more than any other API call probably.
>
> According to logstash [3] all failures identified by [2] happen on RAX nodes
> [3], which I hadn't realised before.
>
> Comparing dstat data between the failed run and a successful on an OVH node
> [4], the main difference I can spot is free memory.
> For the same test job, the free memory tends to be much lower, quite close
> to zero for the majority of the time on the RAX node. My guess is that an
> unlucky scheduling of tests may cause a slightly higher peak in memory usage
> and trigger the oom-kill.
>
> I find it hard to relate lower free memory to a specific cloud provider /
> underlying virtualisation technology, but maybe someone has an idea about
> how that could be?
>
> Andrea
>
> [0]
> http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/6f31320/logs/syslog.txt.gz#_Feb_14_00_32_28
> [1] https://bugs.launchpad.net/tempest/+bug/1664953
> [2] https://review.openstack.org/434238
> [3]
> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Out%20of%20memory%3A%20Kill%20process%5C%22%20AND%20tags%3A%5C%22syslog.txt%5C%22
> [4]
> http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/1dfb4b7/logs/dstat-csv_log.txt.gz
>
> On Mon, Feb 6, 2017 at 10:13 AM Miguel Angel Ajo Pelayo
>  wrote:
>>
>> Jeremy Stanley wrote:
>>
>>
>> > It's an option of last resort, I think. The next consistent flavor
>> > up in most of the providers donating resources is double the one
>> > we're using (which is a fairly typical pattern in public clouds). As
>> > aggregate memory constraints are our primary quota limit, this would
>> > effectively halve our current job capacity.
>>
>> Properly coordinated with all the cloud the providers, they could create
>> flavours which are private but available to our tenants, where a 25-50% more
>> RAM would be just enough.
>>
>> I agree that should probably be a last resort tool, and we should keep
>> looking for proper ways to find where we consume unnecessary RAM and make
>> sure that's properly freed up.
>>
>> It could be interesting to coordinate such flavour creation in the mean
>> time, even if we don't use it now, we could eventually test it or put it to
>> work if we find trapped anytime later.
>>
>>
>> On Sun, Feb 5, 2017 at 8:37 PM, Matt Riedemann 
>> wrote:
>>>
>>> On 2/5/2017 1:19 PM, Clint Byrum wrote:


 Also I wonder if there's ever been any serious consideration given to
 switching to protobuf? Feels like one could make oslo.versionedobjects
 a wrapper around protobuf relatively easily, but perhaps that's already
 been explored in a forum that I wasn't paying attention to.
>>>
>>>
>>> I've never heard of anyone attempting that.
>>>
>>> --
>>>
>>> Thanks,
>>>
>>> Matt Riedemann
>>>
>>>
>>>
>>> __
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> 

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-15 Thread Jeremy Stanley
On 2017-02-15 13:21:16 + (+), Andrea Frittoli wrote:
[...]
> According to logstash [3] all failures identified by [2] happen on RAX
> nodes [3], which I hadn't realised before.
[...]
> I find it hard to relate lower free memory to a specific cloud provider /
> underlying virtualisation technology, but maybe someone has an idea about
> how that could be?

That provider is, AFAIK, the only Xen-based environment in which we
test. Is it possible memory allocations in a Xen DomU incur
additional overhead compared to other popular hypervisors?
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-15 Thread Andrea Frittoli
Some (new?) data on the oom kill issue in the gate.

I filed a new bug / E-R query yet for the issue [1][2] since it looks to me
like the issue is not specific to mysqld - oom-kill will just pick the best
candidate, which in most cases happens to be mysqld. The next most likely
candidate to show errors in the logs is keystone, since token requests are
rather frequent, more than any other API call probably.

According to logstash [3] all failures identified by [2] happen on RAX
nodes [3], which I hadn't realised before.

Comparing dstat data between the failed run and a successful on an OVH node
[4], the main difference I can spot is free memory.
For the same test job, the free memory tends to be much lower, quite close
to zero for the majority of the time on the RAX node. My guess is that an
unlucky scheduling of tests may cause a slightly higher peak in memory
usage and trigger the oom-kill.

I find it hard to relate lower free memory to a specific cloud provider /
underlying virtualisation technology, but maybe someone has an idea about
how that could be?

Andrea

[0]
http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/6f31320/logs/syslog.txt.gz#_Feb_14_00_32_28

[1] https://bugs.launchpad.net/tempest/+bug/1664953
[2] https://review.openstack.org/434238
[3]
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Out%20of%20memory%3A%20Kill%20process%5C%22%20AND%20tags%3A%5C%22syslog.txt%5C%22

[4]
http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/1dfb4b7/logs/dstat-csv_log.txt.gz


On Mon, Feb 6, 2017 at 10:13 AM Miguel Angel Ajo Pelayo 
wrote:

Jeremy Stanley wrote:


> It's an option of last resort, I think. The next consistent flavor
> up in most of the providers donating resources is double the one
> we're using (which is a fairly typical pattern in public clouds). As
> aggregate memory constraints are our primary quota limit, this would
> effectively halve our current job capacity.

Properly coordinated with all the cloud the providers, they could create
flavours which are private but available to our tenants, where a 25-50%
more RAM would be just enough.

I agree that should probably be a last resort tool, and we should keep
looking for proper ways to find where we consume unnecessary RAM and make
sure that's properly freed up.

It could be interesting to coordinate such flavour creation in the mean
time, even if we don't use it now, we could eventually test it or put it to
work if we find trapped anytime later.


On Sun, Feb 5, 2017 at 8:37 PM, Matt Riedemann  wrote:

On 2/5/2017 1:19 PM, Clint Byrum wrote:


Also I wonder if there's ever been any serious consideration given to
switching to protobuf? Feels like one could make oslo.versionedobjects
a wrapper around protobuf relatively easily, but perhaps that's already
been explored in a forum that I wasn't paying attention to.


I've never heard of anyone attempting that.

-- 

Thanks,

Matt Riedemann


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-06 Thread Miguel Angel Ajo Pelayo
Jeremy Stanley wrote:


> It's an option of last resort, I think. The next consistent flavor
> up in most of the providers donating resources is double the one
> we're using (which is a fairly typical pattern in public clouds). As
> aggregate memory constraints are our primary quota limit, this would
> effectively halve our current job capacity.

Properly coordinated with all the cloud the providers, they could create
flavours which are private but available to our tenants, where a 25-50%
more RAM would be just enough.

I agree that should probably be a last resort tool, and we should keep
looking for proper ways to find where we consume unnecessary RAM and make
sure that's properly freed up.

It could be interesting to coordinate such flavour creation in the mean
time, even if we don't use it now, we could eventually test it or put it to
work if we find trapped anytime later.


On Sun, Feb 5, 2017 at 8:37 PM, Matt Riedemann  wrote:

> On 2/5/2017 1:19 PM, Clint Byrum wrote:
>
>>
>> Also I wonder if there's ever been any serious consideration given to
>> switching to protobuf? Feels like one could make oslo.versionedobjects
>> a wrapper around protobuf relatively easily, but perhaps that's already
>> been explored in a forum that I wasn't paying attention to.
>>
>
> I've never heard of anyone attempting that.
>
> --
>
> Thanks,
>
> Matt Riedemann
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-05 Thread Matt Riedemann

On 2/5/2017 1:19 PM, Clint Byrum wrote:


Also I wonder if there's ever been any serious consideration given to
switching to protobuf? Feels like one could make oslo.versionedobjects
a wrapper around protobuf relatively easily, but perhaps that's already
been explored in a forum that I wasn't paying attention to.


I've never heard of anyone attempting that.

--

Thanks,

Matt Riedemann

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-05 Thread Clint Byrum
Excerpts from Matt Riedemann's message of 2017-02-04 16:09:56 -0600:
> On 2/2/2017 4:01 PM, Sean Dague wrote:
> >
> > The only services that are running on Apache in standard gate jobs are
> > keystone and the placement api. Everything else is still the
> > oslo.service stack (which is basically run eventlet as a preforking
> > static worker count webserver).
> >
> > The ways in which OpenStack and oslo.service uses eventlet are known to
> > have scaling bottle necks. The Keystone team saw substantial throughput
> > gains going over to apache hosting.
> >
> > -Sean
> >
> 
> FWIW, coincidentally the nova team is going to work on running nova-api 
> under apache in some select jobs in Pike because it turns out that 
> TripleO was running that configuration in Newton which is considered 
> experimental in nova (we don't do some things when running in that mode 
> which are actually pretty critical to how the code functions for 
> upgrades). So if Apache/eventlet is related, maybe we'll see some 
> differences after making that change.
> 
> But I also wouldn't be surprised if Nova is creating more versioned 
> objects which reference other full versioned objects (rather than just 
> an id reference) and maybe some of those are hanging around longer than 
> they should be.
> 

Has there ever been an effort to profile memory usage of oslo versioned
objects, including the actual objects defined in each major project? I
would be willing to wager there are enough circular references that
they're pretty sticky.

Also I wonder if there's ever been any serious consideration given to
switching to protobuf? Feels like one could make oslo.versionedobjects
a wrapper around protobuf relatively easily, but perhaps that's already
been explored in a forum that I wasn't paying attention to. I feel like
that's another example of something I hear in the hallways that doesn't
get any forward progress.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-04 Thread Matt Riedemann

On 2/2/2017 4:01 PM, Sean Dague wrote:


The only services that are running on Apache in standard gate jobs are
keystone and the placement api. Everything else is still the
oslo.service stack (which is basically run eventlet as a preforking
static worker count webserver).

The ways in which OpenStack and oslo.service uses eventlet are known to
have scaling bottle necks. The Keystone team saw substantial throughput
gains going over to apache hosting.

-Sean



FWIW, coincidentally the nova team is going to work on running nova-api 
under apache in some select jobs in Pike because it turns out that 
TripleO was running that configuration in Newton which is considered 
experimental in nova (we don't do some things when running in that mode 
which are actually pretty critical to how the code functions for 
upgrades). So if Apache/eventlet is related, maybe we'll see some 
differences after making that change.


But I also wouldn't be surprised if Nova is creating more versioned 
objects which reference other full versioned objects (rather than just 
an id reference) and maybe some of those are hanging around longer than 
they should be.


--

Thanks,

Matt Riedemann

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-04 Thread Matt Riedemann

On 2/2/2017 2:32 PM, Armando M. wrote:


Not sure I agree on this one, this has been observed multiple times in
the gate already [1] (though I am not sure there's a bug for it), and I
don't believe it has anything to do with the number of API workers,
unless not even two workers are enough.

[1]
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22('Connection%20aborted.'%2C%20BadStatusLine(%5C%22''%5C%22%2C)%5C%22




I think that's this:

http://status.openstack.org//elastic-recheck/index.html#1630664

--

Thanks,

Matt Riedemann

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-04 Thread Joshua Harlow

Another option is to turn on the following (for python 3.4+ jobs)

https://docs.python.org/3/library/tracemalloc.html

I think victor stinner (who we all know as haypo) has some experience 
with that and even did some of the backport patches for 2.7 for this may 
have some ideas on how we can plug that in.


Then assuming the following works we can even have a nice UI to analyze 
its reports & do comparison diffs:


http://pytracemalloc.readthedocs.io/tracemallocqt.html

One idea from mtreinish was to hook the following (or some variant of 
it) into oslo.service to get some data:


http://pytracemalloc.readthedocs.io/examples.html#thread-to-write-snapshots-into-files-every-minutes

Of course the other big question (that I don't actually know) is how 
does tracemalloc work in wsgi containers (such as apache or eventlet or 
uwsgi or ...). Seeing that a part of our http services are in such 
containers it seems like a useful thing to wonder :)


-Josh

Joshua Harlow wrote:

An example of what this (dozer) gathers (attached).

-Josh

Joshua Harlow wrote:

Has anyone tried:

https://github.com/mgedmin/dozer/blob/master/dozer/leak.py#L72

This piece of middleware creates some nice graphs (using PIL) that may
help identify which areas are using what memory (and/or leaking).

https://pypi.python.org/pypi/linesman might also be somewhat useful to
have running.

How any process takes more than 100MB here blows my mind (horizon is
doing nicely, ha); what are people caching in process to have RSS that
large (1.95 GB, woah).

Armando M. wrote:

Hi,

[TL;DR]: OpenStack services have steadily increased their memory
footprints. We need a concerted way to address the oom-kills experienced
in the openstack gate, as we may have reached a ceiling.

Now the longer version:


We have been experiencing some instability in the gate lately due to a
number of reasons. When everything adds up, this means it's rather
difficult to merge anything and knowing we're in feature freeze, that
adds to stress. One culprit was identified to be [1].

We initially tried to increase the swappiness, but that didn't seem to
help. Then we have looked at the resident memory in use. When going back
over the past three releases we have noticed that the aggregated memory
footprint of some openstack projects has grown steadily. We have the
following:

* Mitaka
o neutron: 1.40GB
o nova: 1.70GB
o swift: 640MB
o cinder: 730MB
o keystone: 760MB
o horizon: 17MB
o glance: 538MB
* Newton
o neutron: 1.59GB (+13%)
o nova: 1.67GB (-1%)
o swift: 779MB (+21%)
o cinder: 878MB (+20%)
o keystone: 919MB (+20%)
o horizon: 21MB (+23%)
o glance: 721MB (+34%)
* Ocata
o neutron: 1.75GB (+10%)
o nova: 1.95GB (%16%)
o swift: 703MB (-9%)
o cinder: 920MB (4%)
o keystone: 903MB (-1%)
o horizon: 25MB (+20%)
o glance: 740MB (+2%)

Numbers are approximated and I only took a couple of samples, but in a
nutshell, the majority of the services have seen double digit growth
over the past two cycles in terms of the amount or RSS memory they use.

Since [1] is observed only since ocata [2], I imagine that's pretty
reasonable to assume that memory increase might as well be a determining
factor to the oom-kills we see in the gate.

Profiling and surgically reducing the memory used by each component in
each service is a lengthy process, but I'd rather see some gate relief
right away. Reducing the number of API workers helps bring the RSS
memory down back to mitaka levels:

* neutron: 1.54GB
* nova: 1.24GB
* swift: 694MB
* cinder: 778MB
* keystone: 891MB
* horizon: 24MB
* glance: 490MB

However, it may have other side effects, like longer execution times, or
increase of timeouts.

Where do we go from here? I am not particularly fond of stop-gap [4],
but it is the one fix that most widely address the memory increase we
have experienced across the board.

Thanks,
Armando

[1] https://bugs.launchpad.net/neutron/+bug/1656386

[2]
http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22oom-killer%5C%22%20AND%20tags:syslog


[3]
http://logs.openstack.org/21/427921/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/82084c2/


[4] https://review.openstack.org/#/c/427921

__


OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__

OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-04 Thread Paul Belanger
On Fri, Feb 03, 2017 at 06:14:01PM +, Jeremy Stanley wrote:
> On 2017-02-03 11:12:04 +0100 (+0100), Miguel Angel Ajo Pelayo wrote:
> [...]
> > So, would it be realistic to bump the flavors RAM to favor our stability in
> > the short term? (considering that the less amount of workload our clouds
> > will be able to take is fewer, but the failure rate will also be fewer, so
> > the rechecks will be reduced).
> 
> It's an option of last resort, I think. The next consistent flavor
> up in most of the providers donating resources is double the one
> we're using (which is a fairly typical pattern in public clouds). As
> aggregate memory constraints are our primary quota limit, this would
> effectively halve our current job capacity.
> 
++
I completely agree. Halving our quote limit to address the issue of increases
memory consumption seems like the wrong approach.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-03 Thread Jeremy Stanley
On 2017-02-03 11:12:04 +0100 (+0100), Miguel Angel Ajo Pelayo wrote:
[...]
> So, would it be realistic to bump the flavors RAM to favor our stability in
> the short term? (considering that the less amount of workload our clouds
> will be able to take is fewer, but the failure rate will also be fewer, so
> the rechecks will be reduced).

It's an option of last resort, I think. The next consistent flavor
up in most of the providers donating resources is double the one
we're using (which is a fairly typical pattern in public clouds). As
aggregate memory constraints are our primary quota limit, this would
effectively halve our current job capacity.
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-03 Thread Miguel Angel Ajo Pelayo
On Fri, Feb 3, 2017 at 7:55 AM, IWAMOTO Toshihiro 
wrote:

> At Wed, 1 Feb 2017 16:24:54 -0800,
> Armando M. wrote:
> >
> > Hi,
> >
> > [TL;DR]: OpenStack services have steadily increased their memory
> > footprints. We need a concerted way to address the oom-kills experienced
> in
> > the openstack gate, as we may have reached a ceiling.
> >
> > Now the longer version:
> > 
> >
> > We have been experiencing some instability in the gate lately due to a
> > number of reasons. When everything adds up, this means it's rather
> > difficult to merge anything and knowing we're in feature freeze, that
> adds
> > to stress. One culprit was identified to be [1].
> >
> > We initially tried to increase the swappiness, but that didn't seem to
> > help. Then we have looked at the resident memory in use. When going back
> > over the past three releases we have noticed that the aggregated memory
> > footprint of some openstack projects has grown steadily. We have the
> > following:
>
> Not sure if it is due to memory shortage, VMs running CI jobs are
> experiencing sluggishness, which may be the cause of ovs related
> timeouts[1]. Tempest jobs run dstat to collect system info every
> second. When timeouts[1] happen, dstat outputs are also often missing
> for several seconds, which means a VM is having trouble scheduling
> both ovs related processes and the dstat process.
> Those ovs timeouts affect every project and happen much often than the
> oom-kills.
>
> Some details are on the lp bug page[2].
>
> Correlation of such sluggishness and VM paging activities are not
> clear. I wonder if VM hosts are under high load or if increasing VM
> memory would help. Those VMs have no free ram for file cache and file
> pages are read again and again, leading to extra IO loads on VM hosts
> and adversely affecting other VMs on the same host.
>
>
Iwamoto, that makes a lot of sense to me.

That makes me think that increasing the available RAM per instance could be
beneficial, even if we'd be able to run less workloads simultaneously.
Compute hosts would see their pressure reduced (since they can accommodate
less workload), instances would run more smoothly, because they'd have more
room for caching and buffers, and we may also see the OOM issues alleviated.

BUT, if that's even a suitable approach for all those problems which could
very well be inter-related, it still means that we should keep pursuing
finding the culprit of our memory footprint growth and taking counter
measures where reasonable.

Sometimes more RAM is just the cost of progress (new features, ability to
do online upgrades, better synchronisation patterns based in caching,
etc...), sometimes we'd be able to slash down the memory usage by
converting some of our small-repeatable services into other things (I'm
thinking of the neutron-ns-metadata proxy being converted to haproxy or
nginx + a neat piece of config).

So, would it be realistic to bump the flavors RAM to favor our stability in
the short term? (considering that the less amount of workload our clouds
will be able to take is fewer, but the failure rate will also be fewer, so
the rechecks will be reduced).




>
> [1] http://logstash.openstack.org/#dashboard/file/logstash.json?
> query=message%3A%5C%22no%20response%20to%20inactivity%20probe%5C%22
> [2] https://bugs.launchpad.net/neutron/+bug/1627106/comments/14
>
> --
> IWAMOTO Toshihiro
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread IWAMOTO Toshihiro
At Wed, 1 Feb 2017 16:24:54 -0800,
Armando M. wrote:
> 
> Hi,
> 
> [TL;DR]: OpenStack services have steadily increased their memory
> footprints. We need a concerted way to address the oom-kills experienced in
> the openstack gate, as we may have reached a ceiling.
> 
> Now the longer version:
> 
> 
> We have been experiencing some instability in the gate lately due to a
> number of reasons. When everything adds up, this means it's rather
> difficult to merge anything and knowing we're in feature freeze, that adds
> to stress. One culprit was identified to be [1].
> 
> We initially tried to increase the swappiness, but that didn't seem to
> help. Then we have looked at the resident memory in use. When going back
> over the past three releases we have noticed that the aggregated memory
> footprint of some openstack projects has grown steadily. We have the
> following:

Not sure if it is due to memory shortage, VMs running CI jobs are
experiencing sluggishness, which may be the cause of ovs related
timeouts[1]. Tempest jobs run dstat to collect system info every
second. When timeouts[1] happen, dstat outputs are also often missing
for several seconds, which means a VM is having trouble scheduling
both ovs related processes and the dstat process.
Those ovs timeouts affect every project and happen much often than the
oom-kills.

Some details are on the lp bug page[2].

Correlation of such sluggishness and VM paging activities are not
clear. I wonder if VM hosts are under high load or if increasing VM
memory would help. Those VMs have no free ram for file cache and file
pages are read again and again, leading to extra IO loads on VM hosts
and adversely affecting other VMs on the same host.


[1] 
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22no%20response%20to%20inactivity%20probe%5C%22
[2] https://bugs.launchpad.net/neutron/+bug/1627106/comments/14

--
IWAMOTO Toshihiro

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Joshua Harlow

Has anyone tried:

https://github.com/mgedmin/dozer/blob/master/dozer/leak.py#L72

This piece of middleware creates some nice graphs (using PIL) that may 
help identify which areas are using what memory (and/or leaking).


https://pypi.python.org/pypi/linesman might also be somewhat useful to 
have running.


How any process takes more than 100MB here blows my mind (horizon is 
doing nicely, ha); what are people caching in process to have RSS that 
large (1.95 GB, woah).


Armando M. wrote:

Hi,

[TL;DR]: OpenStack services have steadily increased their memory
footprints. We need a concerted way to address the oom-kills experienced
in the openstack gate, as we may have reached a ceiling.

Now the longer version:


We have been experiencing some instability in the gate lately due to a
number of reasons. When everything adds up, this means it's rather
difficult to merge anything and knowing we're in feature freeze, that
adds to stress. One culprit was identified to be [1].

We initially tried to increase the swappiness, but that didn't seem to
help. Then we have looked at the resident memory in use. When going back
over the past three releases we have noticed that the aggregated memory
footprint of some openstack projects has grown steadily. We have the
following:

  * Mitaka
  o neutron: 1.40GB
  o nova: 1.70GB
  o swift: 640MB
  o cinder: 730MB
  o keystone: 760MB
  o horizon: 17MB
  o glance: 538MB
  * Newton
  o neutron: 1.59GB (+13%)
  o nova: 1.67GB (-1%)
  o swift: 779MB (+21%)
  o cinder: 878MB (+20%)
  o keystone: 919MB (+20%)
  o horizon: 21MB (+23%)
  o glance: 721MB (+34%)
  * Ocata
  o neutron: 1.75GB (+10%)
  o nova: 1.95GB (%16%)
  o swift: 703MB (-9%)
  o cinder: 920MB (4%)
  o keystone: 903MB (-1%)
  o horizon: 25MB (+20%)
  o glance: 740MB (+2%)

Numbers are approximated and I only took a couple of samples, but in a
nutshell, the majority of the services have seen double digit growth
over the past two cycles in terms of the amount or RSS memory they use.

Since [1] is observed only since ocata [2], I imagine that's pretty
reasonable to assume that memory increase might as well be a determining
factor to the oom-kills we see in the gate.

Profiling and surgically reducing the memory used by each component in
each service is a lengthy process, but I'd rather see some gate relief
right away. Reducing the number of API workers helps bring the RSS
memory down back to mitaka levels:

  * neutron: 1.54GB
  * nova: 1.24GB
  * swift: 694MB
  * cinder: 778MB
  * keystone: 891MB
  * horizon: 24MB
  * glance: 490MB

However, it may have other side effects, like longer execution times, or
increase of timeouts.

Where do we go from here? I am not particularly fond of stop-gap [4],
but it is the one fix that most widely address the memory increase we
have experienced across the board.

Thanks,
Armando

[1] https://bugs.launchpad.net/neutron/+bug/1656386

[2]
http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22oom-killer%5C%22%20AND%20tags:syslog
[3]
http://logs.openstack.org/21/427921/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/82084c2/
[4] https://review.openstack.org/#/c/427921

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Robert Collins
On 3 Feb. 2017 16:14, "Robert Collins"  wrote:

This may help. http://jam-bazaar.blogspot.co.nz/2009/11/memory-
debugging-with-meliae.html

-rob


Oh, and if i recall correctly run snake run supports both heapy and meliae.

,-rob
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Robert Collins
This may help.
http://jam-bazaar.blogspot.co.nz/2009/11/memory-debugging-with-meliae.html

-rob

On 3 Feb. 2017 10:39, "Armando M."  wrote:

>
>
> On 2 February 2017 at 13:36, Ihar Hrachyshka  wrote:
>
>> On Thu, Feb 2, 2017 at 7:44 AM, Matthew Treinish 
>> wrote:
>> > Yeah, I'm curious about this too, there seems to be a big jump in
>> Newton for
>> > most of the project. It might not a be a single common cause between
>> them, but
>> > I'd be curious to know what's going on there.
>>
>> Both Matt from Nova as well as me and Armando suspect
>> oslo.versionedobjects. Pattern of memory consumption raise somewhat
>> correlates with the level of adoption for the library, at least in
>> Neutron. That being said, we don't have any numbers, so at this point
>> it's just pointing fingers into Oslo direction. :) Armando is going to
>> collect actual memory profile.
>>
>
> I'll do my best, but I can't guarantee I can come up with something in
> time for RC.
>
>
>>
>> Ihar
>>
>> 
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib
>> e
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Ed Leafe
On Feb 2, 2017, at 10:16 AM, Matthew Treinish  wrote:

> 

If that was intentional, it is the funniest thing I’ve read today. :)

-- Ed Leafe






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Kevin Benton
I'm referring to Apache sitting in between the services now as a TLS
terminator and connection proxy. That was not the configuration before but
it is now the default devstack behavior.

See this example from Newton:
http://logs.openstack.org/73/428073/2/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/028ea38/logs/apache_config/
Then this from master:
http://logs.openstack.org/32/421832/4/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/5af5c7c/logs/apache_config/


>The ways in which OpenStack and oslo.service uses eventlet are known to
>have scaling bottle necks. The Keystone team saw substantial throughput
>gains going over to apache hosting.


Right, but there is a difference between scaling issues and a single worker
not being able to handle the peak 5 concurrent requests or so that the gate
jobs experience. The eventlet wsgi server should have no issues with our
gate load.

On Thu, Feb 2, 2017 at 3:01 PM, Sean Dague  wrote:

> On 02/02/2017 04:07 PM, Kevin Benton wrote:
> > This error seems to be new in the ocata cycle. It's either related to a
> > dependency change or the fact that we put Apache in between the services
> > now. Handling more concurrent requests than workers wasn't an issue
> > before.
> >
> > It seems that you are suggesting that eventlet can't handle concurrent
> > connections, which is the entire purpose of the library, no?
>
> The only services that are running on Apache in standard gate jobs are
> keystone and the placement api. Everything else is still the
> oslo.service stack (which is basically run eventlet as a preforking
> static worker count webserver).
>
> The ways in which OpenStack and oslo.service uses eventlet are known to
> have scaling bottle necks. The Keystone team saw substantial throughput
> gains going over to apache hosting.
>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Sean Dague
On 02/02/2017 04:07 PM, Kevin Benton wrote:
> This error seems to be new in the ocata cycle. It's either related to a
> dependency change or the fact that we put Apache in between the services
> now. Handling more concurrent requests than workers wasn't an issue
> before.  
>
> It seems that you are suggesting that eventlet can't handle concurrent
> connections, which is the entire purpose of the library, no?

The only services that are running on Apache in standard gate jobs are
keystone and the placement api. Everything else is still the
oslo.service stack (which is basically run eventlet as a preforking
static worker count webserver).

The ways in which OpenStack and oslo.service uses eventlet are known to
have scaling bottle necks. The Keystone team saw substantial throughput
gains going over to apache hosting.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Kevin Benton
Note the HTTPS in the traceback in the bug report. Also the mention of
adjusting the Apache mpm settings to fix it. That seems to point to an
issue with Apache in the middle rather than eventlet and API_WORKERS.

On Feb 2, 2017 14:36, "Ihar Hrachyshka"  wrote:

> The BadStatusLine error is well known:
> https://bugs.launchpad.net/nova/+bug/1630664
>
> Now, it doesn't mean that the root cause of the error message is the
> same, and it may as well be that lowering the number of workers
> triggered it. All I am saying is we saw that error in the past.
>
> Ihar
>
> On Thu, Feb 2, 2017 at 1:07 PM, Kevin Benton  wrote:
> > This error seems to be new in the ocata cycle. It's either related to a
> > dependency change or the fact that we put Apache in between the services
> > now. Handling more concurrent requests than workers wasn't an issue
> before.
> >
> > It seems that you are suggesting that eventlet can't handle concurrent
> > connections, which is the entire purpose of the library, no?
> >
> > On Feb 2, 2017 13:53, "Sean Dague"  wrote:
> >>
> >> On 02/02/2017 03:32 PM, Armando M. wrote:
> >> >
> >> >
> >> > On 2 February 2017 at 12:19, Sean Dague  >> > > wrote:
> >> >
> >> > On 02/02/2017 02:28 PM, Armando M. wrote:
> >> > >
> >> > >
> >> > > On 2 February 2017 at 10:08, Sean Dague  >> > 
> >> > > >> wrote:
> >> > >
> >> > > On 02/02/2017 12:49 PM, Armando M. wrote:
> >> > > >
> >> > > >
> >> > > > On 2 February 2017 at 08:40, Sean Dague  >> >   >> > >
> >> > > > 
> >> >  >> > > >
> >> > > > On 02/02/2017 11:16 AM, Matthew Treinish wrote:
> >> > > > 
> >> > > > > 
> >> > > > >
> >> > > > > We definitely aren't saying running a single worker
> is
> >> > how
> >> > > we recommend people
> >> > > > > run OpenStack by doing this. But it just adds on to
> >> > the
> >> > > differences between the
> >> > > > > gate and what we expect things actually look like.
> >> > > >
> >> > > > I'm all for actually getting to the bottom of this,
> but
> >> > > honestly real
> >> > > > memory profiling is needed here. The growth across
> >> > projects
> >> > > probably
> >> > > > means that some common libraries are some part of
> this.
> >> > The
> >> > > ever growing
> >> > > > requirements list is demonstrative of that. Code reuse
> >> > is
> >> > > good, but if
> >> > > > we are importing much of a library to get access to a
> >> > couple of
> >> > > > functions, we're going to take a bunch of memory
> weight
> >> > on that
> >> > > > (especially if that library has friendly auto imports
> in
> >> > top level
> >> > > > __init__.py so we can't get only the parts we want).
> >> > > >
> >> > > > Changing the worker count is just shuffling around
> deck
> >> > chairs.
> >> > > >
> >> > > > I'm not familiar enough with memory profiling tools in
> >> > python
> >> > > to know
> >> > > > the right approach we should take there to get this
> down
> >> > to
> >> > > individual
> >> > > > libraries / objects that are containing all our
> memory.
> >> > Anyone
> >> > > more
> >> > > > skilled here able to help lead the way?
> >> > > >
> >> > > >
> >> > > > From what I hear, the overall consensus on this matter is
> to
> >> > determine
> >> > > > what actually caused the memory consumption bump and how
> to
> >> > > address it,
> >> > > > but that's more of a medium to long term action. In fact,
> to
> >> > me
> >> > > this is
> >> > > > one of the top priority matters we should talk about at
> the
> >> > > imminent PTG.
> >> > > >
> >> > > > For the time being, and to provide relief to the gate,
> >> > should we
> >> > > want to
> >> > > > lock the API_WORKERS to 1? I'll post something for review
> >> > and see how
> >> > > > many people shoot it down :)
> >> > >
> >> > > I don't think we want to do that. It's going to force down
> the
> >> > eventlet
> >> > > API workers to being a single process, and it's not super
> >> > clear that
> >> > > eventlet handles backups on the inbound socket well. I
> >> > honestly would
> >> > > expect 

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Mikhail Medvedev
On Thu, Feb 2, 2017 at 12:28 PM, Jeremy Stanley  wrote:
> On 2017-02-02 04:27:51 + (+), Dolph Mathews wrote:
>> What made most services jump +20% between mitaka and newton? Maybe there is
>> a common cause that we can tackle.
> [...]
>
> Almost hesitant to suggest this one but since we primarily use
> Ubuntu 14.04 LTS for stable/mitaka jobs and 16.04 LTS for later
> branches, could bloat in a newer release of the Python 2.7
> interpreter there (or something even lower-level still like glibc)
> be a contributing factor?

In our third-party CI (IBM KVM on Power) we run both stable/mitaka and
master on Ubuntu Xenial. I went ahead and plotted dstat graphs, see
http://dal05.objectstorage.softlayer.net/v1/AUTH_3d8e6ecb-f597-448c-8ec2-164e9f710dd6/pkvmci/dstat20170202/
. It does look like there is some difference in overall memory use -
mitaka uses a bit less. This is anecdotal, but still is an extra data
point. Also note that we have 12G of ram, and we do not see oom kills.

> I agree it's more likely bloat in some
> commonly-used module (possibly even one developed outside our
> community), but potential system-level overhead probably should also
> get some investigation.
> --
> Jeremy Stanley
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Mikhail Medvedev
IBM

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Clay Gerrard
On Thu, Feb 2, 2017 at 12:50 PM, Sean Dague  wrote:

>
> This is one of the reasons to get the wsgi stack off of eventlet and
> into a real webserver, as they handle HTTP request backups much much
> better.
>
>
To some extent I think this is generally true for *many* common workloads,
but the specifics depend *a lot* on the application under the webserver
that's servicing those requests.

I'm not entirely sure what you have in mind, and may be mistaken to assume
this is a reference to Apache/mod_wsgi?  If that's the case, depending on
how you configure it - aren't you still going to end up with an instance of
the wsgi application per worker-process and have the same front of line
queueing issue unless you increase workers?  Maybe if the application is
thread-safe you can use os thread workers - and preemptive interruption for
the GIL is more attractive for the application than eventlet's cooperative
interruption.  Either-way, it's not obvious that has a big impact on the
memory footprint issue (assume the issue is memory growth in the
application and not specifically eventlet.wsgi.server).  But you may have
more relevant experience than I do - happy to be enlightened!

Thanks,

-Clay
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Armando M.
On 2 February 2017 at 13:36, Ihar Hrachyshka  wrote:

> On Thu, Feb 2, 2017 at 7:44 AM, Matthew Treinish 
> wrote:
> > Yeah, I'm curious about this too, there seems to be a big jump in Newton
> for
> > most of the project. It might not a be a single common cause between
> them, but
> > I'd be curious to know what's going on there.
>
> Both Matt from Nova as well as me and Armando suspect
> oslo.versionedobjects. Pattern of memory consumption raise somewhat
> correlates with the level of adoption for the library, at least in
> Neutron. That being said, we don't have any numbers, so at this point
> it's just pointing fingers into Oslo direction. :) Armando is going to
> collect actual memory profile.
>

I'll do my best, but I can't guarantee I can come up with something in time
for RC.


>
> Ihar
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Armando M.
On 2 February 2017 at 13:34, Ihar Hrachyshka  wrote:

> The BadStatusLine error is well known:
> https://bugs.launchpad.net/nova/+bug/1630664


That's the one! I knew it I had seen it in the past!


>
>
> Now, it doesn't mean that the root cause of the error message is the
> same, and it may as well be that lowering the number of workers
> triggered it. All I am saying is we saw that error in the past.
>
> Ihar
>
> On Thu, Feb 2, 2017 at 1:07 PM, Kevin Benton  wrote:
> > This error seems to be new in the ocata cycle. It's either related to a
> > dependency change or the fact that we put Apache in between the services
> > now. Handling more concurrent requests than workers wasn't an issue
> before.
> >
> > It seems that you are suggesting that eventlet can't handle concurrent
> > connections, which is the entire purpose of the library, no?
> >
> > On Feb 2, 2017 13:53, "Sean Dague"  wrote:
> >>
> >> On 02/02/2017 03:32 PM, Armando M. wrote:
> >> >
> >> >
> >> > On 2 February 2017 at 12:19, Sean Dague  >> > > wrote:
> >> >
> >> > On 02/02/2017 02:28 PM, Armando M. wrote:
> >> > >
> >> > >
> >> > > On 2 February 2017 at 10:08, Sean Dague  >> > 
> >> > > >> wrote:
> >> > >
> >> > > On 02/02/2017 12:49 PM, Armando M. wrote:
> >> > > >
> >> > > >
> >> > > > On 2 February 2017 at 08:40, Sean Dague  >> >   >> > >
> >> > > > 
> >> >  >> > > >
> >> > > > On 02/02/2017 11:16 AM, Matthew Treinish wrote:
> >> > > > 
> >> > > > > 
> >> > > > >
> >> > > > > We definitely aren't saying running a single worker
> is
> >> > how
> >> > > we recommend people
> >> > > > > run OpenStack by doing this. But it just adds on to
> >> > the
> >> > > differences between the
> >> > > > > gate and what we expect things actually look like.
> >> > > >
> >> > > > I'm all for actually getting to the bottom of this,
> but
> >> > > honestly real
> >> > > > memory profiling is needed here. The growth across
> >> > projects
> >> > > probably
> >> > > > means that some common libraries are some part of
> this.
> >> > The
> >> > > ever growing
> >> > > > requirements list is demonstrative of that. Code reuse
> >> > is
> >> > > good, but if
> >> > > > we are importing much of a library to get access to a
> >> > couple of
> >> > > > functions, we're going to take a bunch of memory
> weight
> >> > on that
> >> > > > (especially if that library has friendly auto imports
> in
> >> > top level
> >> > > > __init__.py so we can't get only the parts we want).
> >> > > >
> >> > > > Changing the worker count is just shuffling around
> deck
> >> > chairs.
> >> > > >
> >> > > > I'm not familiar enough with memory profiling tools in
> >> > python
> >> > > to know
> >> > > > the right approach we should take there to get this
> down
> >> > to
> >> > > individual
> >> > > > libraries / objects that are containing all our
> memory.
> >> > Anyone
> >> > > more
> >> > > > skilled here able to help lead the way?
> >> > > >
> >> > > >
> >> > > > From what I hear, the overall consensus on this matter is
> to
> >> > determine
> >> > > > what actually caused the memory consumption bump and how
> to
> >> > > address it,
> >> > > > but that's more of a medium to long term action. In fact,
> to
> >> > me
> >> > > this is
> >> > > > one of the top priority matters we should talk about at
> the
> >> > > imminent PTG.
> >> > > >
> >> > > > For the time being, and to provide relief to the gate,
> >> > should we
> >> > > want to
> >> > > > lock the API_WORKERS to 1? I'll post something for review
> >> > and see how
> >> > > > many people shoot it down :)
> >> > >
> >> > > I don't think we want to do that. It's going to force down
> the
> >> > eventlet
> >> > > API workers to being a single process, and it's not super
> >> > clear that
> >> > > eventlet handles backups on the inbound socket well. I
> >> > honestly would
> >> > > expect that creates different hard to debug issues,
> especially
> >> > with high
> >> > > chatter rates between services.
> >> > >
> >> > 

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Ihar Hrachyshka
On Thu, Feb 2, 2017 at 7:44 AM, Matthew Treinish  wrote:
> Yeah, I'm curious about this too, there seems to be a big jump in Newton for
> most of the project. It might not a be a single common cause between them, but
> I'd be curious to know what's going on there.

Both Matt from Nova as well as me and Armando suspect
oslo.versionedobjects. Pattern of memory consumption raise somewhat
correlates with the level of adoption for the library, at least in
Neutron. That being said, we don't have any numbers, so at this point
it's just pointing fingers into Oslo direction. :) Armando is going to
collect actual memory profile.

Ihar

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Armando M.
On 2 February 2017 at 12:50, Sean Dague  wrote:

> On 02/02/2017 03:32 PM, Armando M. wrote:
> >
> >
> > On 2 February 2017 at 12:19, Sean Dague  > > wrote:
> >
> > On 02/02/2017 02:28 PM, Armando M. wrote:
> > >
> > >
> > > On 2 February 2017 at 10:08, Sean Dague 
> > > >> wrote:
> > >
> > > On 02/02/2017 12:49 PM, Armando M. wrote:
> > > >
> > > >
> > > > On 2 February 2017 at 08:40, Sean Dague    > >
> > > > 
> >  > > >
> > > > On 02/02/2017 11:16 AM, Matthew Treinish wrote:
> > > > 
> > > > > 
> > > > >
> > > > > We definitely aren't saying running a single worker is
> how
> > > we recommend people
> > > > > run OpenStack by doing this. But it just adds on to the
> > > differences between the
> > > > > gate and what we expect things actually look like.
> > > >
> > > > I'm all for actually getting to the bottom of this, but
> > > honestly real
> > > > memory profiling is needed here. The growth across
> projects
> > > probably
> > > > means that some common libraries are some part of this.
> The
> > > ever growing
> > > > requirements list is demonstrative of that. Code reuse is
> > > good, but if
> > > > we are importing much of a library to get access to a
> > couple of
> > > > functions, we're going to take a bunch of memory weight
> > on that
> > > > (especially if that library has friendly auto imports in
> > top level
> > > > __init__.py so we can't get only the parts we want).
> > > >
> > > > Changing the worker count is just shuffling around deck
> > chairs.
> > > >
> > > > I'm not familiar enough with memory profiling tools in
> > python
> > > to know
> > > > the right approach we should take there to get this down
> to
> > > individual
> > > > libraries / objects that are containing all our memory.
> > Anyone
> > > more
> > > > skilled here able to help lead the way?
> > > >
> > > >
> > > > From what I hear, the overall consensus on this matter is to
> > determine
> > > > what actually caused the memory consumption bump and how to
> > > address it,
> > > > but that's more of a medium to long term action. In fact, to
> me
> > > this is
> > > > one of the top priority matters we should talk about at the
> > > imminent PTG.
> > > >
> > > > For the time being, and to provide relief to the gate,
> should we
> > > want to
> > > > lock the API_WORKERS to 1? I'll post something for review
> > and see how
> > > > many people shoot it down :)
> > >
> > > I don't think we want to do that. It's going to force down the
> > eventlet
> > > API workers to being a single process, and it's not super
> > clear that
> > > eventlet handles backups on the inbound socket well. I
> > honestly would
> > > expect that creates different hard to debug issues, especially
> > with high
> > > chatter rates between services.
> > >
> > >
> > > I must admit I share your fear, but out of the tests that I have
> > > executed so far in [1,2,3], the house didn't burn in a fire. I am
> > > looking for other ways to have a substantial memory saving with a
> > > relatively quick and dirty fix, but coming up empty handed thus
> far.
> > >
> > > [1] https://review.openstack.org/#/c/428303/
> > 
> > > [2] https://review.openstack.org/#/c/427919/
> > 
> > > [3] https://review.openstack.org/#/c/427921/
> > 
> >
> > This failure in the first patch -
> > http://logs.openstack.org/03/428303/1/check/gate-tempest-
> dsvm-neutron-full-ubuntu-xenial/71f42ea/logs/screen-n-
> api.txt.gz?level=TRACE#_2017-02-02_19_14_11_751
> >  dsvm-neutron-full-ubuntu-xenial/71f42ea/logs/screen-n-
> api.txt.gz?level=TRACE#_2017-02-02_19_14_11_751>
> > looks exactly like I would expect by API Worker starvation.
> >
> >
> > Not sure I agree on this one, this has been observed multiple times in
> > the gate already [1] (though I am not sure there's a bug 

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Ihar Hrachyshka
The BadStatusLine error is well known:
https://bugs.launchpad.net/nova/+bug/1630664

Now, it doesn't mean that the root cause of the error message is the
same, and it may as well be that lowering the number of workers
triggered it. All I am saying is we saw that error in the past.

Ihar

On Thu, Feb 2, 2017 at 1:07 PM, Kevin Benton  wrote:
> This error seems to be new in the ocata cycle. It's either related to a
> dependency change or the fact that we put Apache in between the services
> now. Handling more concurrent requests than workers wasn't an issue before.
>
> It seems that you are suggesting that eventlet can't handle concurrent
> connections, which is the entire purpose of the library, no?
>
> On Feb 2, 2017 13:53, "Sean Dague"  wrote:
>>
>> On 02/02/2017 03:32 PM, Armando M. wrote:
>> >
>> >
>> > On 2 February 2017 at 12:19, Sean Dague > > > wrote:
>> >
>> > On 02/02/2017 02:28 PM, Armando M. wrote:
>> > >
>> > >
>> > > On 2 February 2017 at 10:08, Sean Dague > > 
>> > > >> wrote:
>> > >
>> > > On 02/02/2017 12:49 PM, Armando M. wrote:
>> > > >
>> > > >
>> > > > On 2 February 2017 at 08:40, Sean Dague > >  > > >
>> > > > 
>> > > > > >
>> > > > On 02/02/2017 11:16 AM, Matthew Treinish wrote:
>> > > > 
>> > > > > 
>> > > > >
>> > > > > We definitely aren't saying running a single worker is
>> > how
>> > > we recommend people
>> > > > > run OpenStack by doing this. But it just adds on to
>> > the
>> > > differences between the
>> > > > > gate and what we expect things actually look like.
>> > > >
>> > > > I'm all for actually getting to the bottom of this, but
>> > > honestly real
>> > > > memory profiling is needed here. The growth across
>> > projects
>> > > probably
>> > > > means that some common libraries are some part of this.
>> > The
>> > > ever growing
>> > > > requirements list is demonstrative of that. Code reuse
>> > is
>> > > good, but if
>> > > > we are importing much of a library to get access to a
>> > couple of
>> > > > functions, we're going to take a bunch of memory weight
>> > on that
>> > > > (especially if that library has friendly auto imports in
>> > top level
>> > > > __init__.py so we can't get only the parts we want).
>> > > >
>> > > > Changing the worker count is just shuffling around deck
>> > chairs.
>> > > >
>> > > > I'm not familiar enough with memory profiling tools in
>> > python
>> > > to know
>> > > > the right approach we should take there to get this down
>> > to
>> > > individual
>> > > > libraries / objects that are containing all our memory.
>> > Anyone
>> > > more
>> > > > skilled here able to help lead the way?
>> > > >
>> > > >
>> > > > From what I hear, the overall consensus on this matter is to
>> > determine
>> > > > what actually caused the memory consumption bump and how to
>> > > address it,
>> > > > but that's more of a medium to long term action. In fact, to
>> > me
>> > > this is
>> > > > one of the top priority matters we should talk about at the
>> > > imminent PTG.
>> > > >
>> > > > For the time being, and to provide relief to the gate,
>> > should we
>> > > want to
>> > > > lock the API_WORKERS to 1? I'll post something for review
>> > and see how
>> > > > many people shoot it down :)
>> > >
>> > > I don't think we want to do that. It's going to force down the
>> > eventlet
>> > > API workers to being a single process, and it's not super
>> > clear that
>> > > eventlet handles backups on the inbound socket well. I
>> > honestly would
>> > > expect that creates different hard to debug issues, especially
>> > with high
>> > > chatter rates between services.
>> > >
>> > >
>> > > I must admit I share your fear, but out of the tests that I have
>> > > executed so far in [1,2,3], the house didn't burn in a fire. I am
>> > > looking for other ways to have a substantial memory saving with a
>> > > relatively quick and dirty fix, but coming up empty handed thus
>> > far.
>> > >
>> > > [1] https://review.openstack.org/#/c/428303/
>> >   

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Kevin Benton
This error seems to be new in the ocata cycle. It's either related to a
dependency change or the fact that we put Apache in between the services
now. Handling more concurrent requests than workers wasn't an issue before.


It seems that you are suggesting that eventlet can't handle concurrent
connections, which is the entire purpose of the library, no?

On Feb 2, 2017 13:53, "Sean Dague"  wrote:

> On 02/02/2017 03:32 PM, Armando M. wrote:
> >
> >
> > On 2 February 2017 at 12:19, Sean Dague  > > wrote:
> >
> > On 02/02/2017 02:28 PM, Armando M. wrote:
> > >
> > >
> > > On 2 February 2017 at 10:08, Sean Dague 
> > > >> wrote:
> > >
> > > On 02/02/2017 12:49 PM, Armando M. wrote:
> > > >
> > > >
> > > > On 2 February 2017 at 08:40, Sean Dague    > >
> > > > 
> >  > > >
> > > > On 02/02/2017 11:16 AM, Matthew Treinish wrote:
> > > > 
> > > > > 
> > > > >
> > > > > We definitely aren't saying running a single worker is
> how
> > > we recommend people
> > > > > run OpenStack by doing this. But it just adds on to the
> > > differences between the
> > > > > gate and what we expect things actually look like.
> > > >
> > > > I'm all for actually getting to the bottom of this, but
> > > honestly real
> > > > memory profiling is needed here. The growth across
> projects
> > > probably
> > > > means that some common libraries are some part of this.
> The
> > > ever growing
> > > > requirements list is demonstrative of that. Code reuse is
> > > good, but if
> > > > we are importing much of a library to get access to a
> > couple of
> > > > functions, we're going to take a bunch of memory weight
> > on that
> > > > (especially if that library has friendly auto imports in
> > top level
> > > > __init__.py so we can't get only the parts we want).
> > > >
> > > > Changing the worker count is just shuffling around deck
> > chairs.
> > > >
> > > > I'm not familiar enough with memory profiling tools in
> > python
> > > to know
> > > > the right approach we should take there to get this down
> to
> > > individual
> > > > libraries / objects that are containing all our memory.
> > Anyone
> > > more
> > > > skilled here able to help lead the way?
> > > >
> > > >
> > > > From what I hear, the overall consensus on this matter is to
> > determine
> > > > what actually caused the memory consumption bump and how to
> > > address it,
> > > > but that's more of a medium to long term action. In fact, to
> me
> > > this is
> > > > one of the top priority matters we should talk about at the
> > > imminent PTG.
> > > >
> > > > For the time being, and to provide relief to the gate,
> should we
> > > want to
> > > > lock the API_WORKERS to 1? I'll post something for review
> > and see how
> > > > many people shoot it down :)
> > >
> > > I don't think we want to do that. It's going to force down the
> > eventlet
> > > API workers to being a single process, and it's not super
> > clear that
> > > eventlet handles backups on the inbound socket well. I
> > honestly would
> > > expect that creates different hard to debug issues, especially
> > with high
> > > chatter rates between services.
> > >
> > >
> > > I must admit I share your fear, but out of the tests that I have
> > > executed so far in [1,2,3], the house didn't burn in a fire. I am
> > > looking for other ways to have a substantial memory saving with a
> > > relatively quick and dirty fix, but coming up empty handed thus
> far.
> > >
> > > [1] https://review.openstack.org/#/c/428303/
> > 
> > > [2] https://review.openstack.org/#/c/427919/
> > 
> > > [3] https://review.openstack.org/#/c/427921/
> > 
> >
> > This failure in the first patch -
> > http://logs.openstack.org/03/428303/1/check/gate-tempest-
> dsvm-neutron-full-ubuntu-xenial/71f42ea/logs/screen-n-
> api.txt.gz?level=TRACE#_2017-02-02_19_14_11_751
> > 

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Sean Dague
On 02/02/2017 03:32 PM, Armando M. wrote:
> 
> 
> On 2 February 2017 at 12:19, Sean Dague  > wrote:
> 
> On 02/02/2017 02:28 PM, Armando M. wrote:
> >
> >
> > On 2 February 2017 at 10:08, Sean Dague  
> > >> wrote:
> >
> > On 02/02/2017 12:49 PM, Armando M. wrote:
> > >
> > >
> > > On 2 February 2017 at 08:40, Sean Dague    >
> > > 
>  > >
> > > On 02/02/2017 11:16 AM, Matthew Treinish wrote:
> > > 
> > > > 
> > > >
> > > > We definitely aren't saying running a single worker is how
> > we recommend people
> > > > run OpenStack by doing this. But it just adds on to the
> > differences between the
> > > > gate and what we expect things actually look like.
> > >
> > > I'm all for actually getting to the bottom of this, but
> > honestly real
> > > memory profiling is needed here. The growth across projects
> > probably
> > > means that some common libraries are some part of this. The
> > ever growing
> > > requirements list is demonstrative of that. Code reuse is
> > good, but if
> > > we are importing much of a library to get access to a
> couple of
> > > functions, we're going to take a bunch of memory weight
> on that
> > > (especially if that library has friendly auto imports in
> top level
> > > __init__.py so we can't get only the parts we want).
> > >
> > > Changing the worker count is just shuffling around deck
> chairs.
> > >
> > > I'm not familiar enough with memory profiling tools in
> python
> > to know
> > > the right approach we should take there to get this down to
> > individual
> > > libraries / objects that are containing all our memory.
> Anyone
> > more
> > > skilled here able to help lead the way?
> > >
> > >
> > > From what I hear, the overall consensus on this matter is to
> determine
> > > what actually caused the memory consumption bump and how to
> > address it,
> > > but that's more of a medium to long term action. In fact, to me
> > this is
> > > one of the top priority matters we should talk about at the
> > imminent PTG.
> > >
> > > For the time being, and to provide relief to the gate, should we
> > want to
> > > lock the API_WORKERS to 1? I'll post something for review
> and see how
> > > many people shoot it down :)
> >
> > I don't think we want to do that. It's going to force down the
> eventlet
> > API workers to being a single process, and it's not super
> clear that
> > eventlet handles backups on the inbound socket well. I
> honestly would
> > expect that creates different hard to debug issues, especially
> with high
> > chatter rates between services.
> >
> >
> > I must admit I share your fear, but out of the tests that I have
> > executed so far in [1,2,3], the house didn't burn in a fire. I am
> > looking for other ways to have a substantial memory saving with a
> > relatively quick and dirty fix, but coming up empty handed thus far.
> >
> > [1] https://review.openstack.org/#/c/428303/
> 
> > [2] https://review.openstack.org/#/c/427919/
> 
> > [3] https://review.openstack.org/#/c/427921/
> 
> 
> This failure in the first patch -
> 
> http://logs.openstack.org/03/428303/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/71f42ea/logs/screen-n-api.txt.gz?level=TRACE#_2017-02-02_19_14_11_751
> 
> 
> looks exactly like I would expect by API Worker starvation.
> 
> 
> Not sure I agree on this one, this has been observed multiple times in
> the gate already [1] (though I am not sure there's a bug for it), and I
> don't believe it has anything to do with the number of API workers,
> unless not even two workers are enough.

There is no guarntee that 2 workers are enough. I'm not surprised if we
see that failure some today. This was all guess work on trimming worker
counts to 

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Armando M.
On 2 February 2017 at 12:19, Sean Dague  wrote:

> On 02/02/2017 02:28 PM, Armando M. wrote:
> >
> >
> > On 2 February 2017 at 10:08, Sean Dague  > > wrote:
> >
> > On 02/02/2017 12:49 PM, Armando M. wrote:
> > >
> > >
> > > On 2 February 2017 at 08:40, Sean Dague 
> > > >> wrote:
> > >
> > > On 02/02/2017 11:16 AM, Matthew Treinish wrote:
> > > 
> > > > 
> > > >
> > > > We definitely aren't saying running a single worker is how
> > we recommend people
> > > > run OpenStack by doing this. But it just adds on to the
> > differences between the
> > > > gate and what we expect things actually look like.
> > >
> > > I'm all for actually getting to the bottom of this, but
> > honestly real
> > > memory profiling is needed here. The growth across projects
> > probably
> > > means that some common libraries are some part of this. The
> > ever growing
> > > requirements list is demonstrative of that. Code reuse is
> > good, but if
> > > we are importing much of a library to get access to a couple of
> > > functions, we're going to take a bunch of memory weight on that
> > > (especially if that library has friendly auto imports in top
> level
> > > __init__.py so we can't get only the parts we want).
> > >
> > > Changing the worker count is just shuffling around deck chairs.
> > >
> > > I'm not familiar enough with memory profiling tools in python
> > to know
> > > the right approach we should take there to get this down to
> > individual
> > > libraries / objects that are containing all our memory. Anyone
> > more
> > > skilled here able to help lead the way?
> > >
> > >
> > > From what I hear, the overall consensus on this matter is to
> determine
> > > what actually caused the memory consumption bump and how to
> > address it,
> > > but that's more of a medium to long term action. In fact, to me
> > this is
> > > one of the top priority matters we should talk about at the
> > imminent PTG.
> > >
> > > For the time being, and to provide relief to the gate, should we
> > want to
> > > lock the API_WORKERS to 1? I'll post something for review and see
> how
> > > many people shoot it down :)
> >
> > I don't think we want to do that. It's going to force down the
> eventlet
> > API workers to being a single process, and it's not super clear that
> > eventlet handles backups on the inbound socket well. I honestly would
> > expect that creates different hard to debug issues, especially with
> high
> > chatter rates between services.
> >
> >
> > I must admit I share your fear, but out of the tests that I have
> > executed so far in [1,2,3], the house didn't burn in a fire. I am
> > looking for other ways to have a substantial memory saving with a
> > relatively quick and dirty fix, but coming up empty handed thus far.
> >
> > [1] https://review.openstack.org/#/c/428303/
> > [2] https://review.openstack.org/#/c/427919/
> > [3] https://review.openstack.org/#/c/427921/
>
> This failure in the first patch -
> http://logs.openstack.org/03/428303/1/check/gate-tempest-
> dsvm-neutron-full-ubuntu-xenial/71f42ea/logs/screen-n-
> api.txt.gz?level=TRACE#_2017-02-02_19_14_11_751
> looks exactly like I would expect by API Worker starvation.
>

Not sure I agree on this one, this has been observed multiple times in the
gate already [1] (though I am not sure there's a bug for it), and I don't
believe it has anything to do with the number of API workers, unless not
even two workers are enough.

[1]
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22('Connection%20aborted.'%2C%20BadStatusLine(%5C%22''%5C%22%2C)%5C%22



> -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Sean Dague
On 02/02/2017 02:28 PM, Armando M. wrote:
> 
> 
> On 2 February 2017 at 10:08, Sean Dague  > wrote:
> 
> On 02/02/2017 12:49 PM, Armando M. wrote:
> >
> >
> > On 2 February 2017 at 08:40, Sean Dague  
> > >> wrote:
> >
> > On 02/02/2017 11:16 AM, Matthew Treinish wrote:
> > 
> > > 
> > >
> > > We definitely aren't saying running a single worker is how
> we recommend people
> > > run OpenStack by doing this. But it just adds on to the
> differences between the
> > > gate and what we expect things actually look like.
> >
> > I'm all for actually getting to the bottom of this, but
> honestly real
> > memory profiling is needed here. The growth across projects
> probably
> > means that some common libraries are some part of this. The
> ever growing
> > requirements list is demonstrative of that. Code reuse is
> good, but if
> > we are importing much of a library to get access to a couple of
> > functions, we're going to take a bunch of memory weight on that
> > (especially if that library has friendly auto imports in top level
> > __init__.py so we can't get only the parts we want).
> >
> > Changing the worker count is just shuffling around deck chairs.
> >
> > I'm not familiar enough with memory profiling tools in python
> to know
> > the right approach we should take there to get this down to
> individual
> > libraries / objects that are containing all our memory. Anyone
> more
> > skilled here able to help lead the way?
> >
> >
> > From what I hear, the overall consensus on this matter is to determine
> > what actually caused the memory consumption bump and how to
> address it,
> > but that's more of a medium to long term action. In fact, to me
> this is
> > one of the top priority matters we should talk about at the
> imminent PTG.
> >
> > For the time being, and to provide relief to the gate, should we
> want to
> > lock the API_WORKERS to 1? I'll post something for review and see how
> > many people shoot it down :)
> 
> I don't think we want to do that. It's going to force down the eventlet
> API workers to being a single process, and it's not super clear that
> eventlet handles backups on the inbound socket well. I honestly would
> expect that creates different hard to debug issues, especially with high
> chatter rates between services.
> 
> 
> I must admit I share your fear, but out of the tests that I have
> executed so far in [1,2,3], the house didn't burn in a fire. I am
> looking for other ways to have a substantial memory saving with a
> relatively quick and dirty fix, but coming up empty handed thus far.
> 
> [1] https://review.openstack.org/#/c/428303/
> [2] https://review.openstack.org/#/c/427919/
> [3] https://review.openstack.org/#/c/427921/

This failure in the first patch -
http://logs.openstack.org/03/428303/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/71f42ea/logs/screen-n-api.txt.gz?level=TRACE#_2017-02-02_19_14_11_751
looks exactly like I would expect by API Worker starvation.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Armando M.
On 2 February 2017 at 10:08, Sean Dague  wrote:

> On 02/02/2017 12:49 PM, Armando M. wrote:
> >
> >
> > On 2 February 2017 at 08:40, Sean Dague  > > wrote:
> >
> > On 02/02/2017 11:16 AM, Matthew Treinish wrote:
> > 
> > > 
> > >
> > > We definitely aren't saying running a single worker is how we
> recommend people
> > > run OpenStack by doing this. But it just adds on to the
> differences between the
> > > gate and what we expect things actually look like.
> >
> > I'm all for actually getting to the bottom of this, but honestly real
> > memory profiling is needed here. The growth across projects probably
> > means that some common libraries are some part of this. The ever
> growing
> > requirements list is demonstrative of that. Code reuse is good, but
> if
> > we are importing much of a library to get access to a couple of
> > functions, we're going to take a bunch of memory weight on that
> > (especially if that library has friendly auto imports in top level
> > __init__.py so we can't get only the parts we want).
> >
> > Changing the worker count is just shuffling around deck chairs.
> >
> > I'm not familiar enough with memory profiling tools in python to know
> > the right approach we should take there to get this down to
> individual
> > libraries / objects that are containing all our memory. Anyone more
> > skilled here able to help lead the way?
> >
> >
> > From what I hear, the overall consensus on this matter is to determine
> > what actually caused the memory consumption bump and how to address it,
> > but that's more of a medium to long term action. In fact, to me this is
> > one of the top priority matters we should talk about at the imminent PTG.
> >
> > For the time being, and to provide relief to the gate, should we want to
> > lock the API_WORKERS to 1? I'll post something for review and see how
> > many people shoot it down :)
>
> I don't think we want to do that. It's going to force down the eventlet
> API workers to being a single process, and it's not super clear that
> eventlet handles backups on the inbound socket well. I honestly would
> expect that creates different hard to debug issues, especially with high
> chatter rates between services.
>

I must admit I share your fear, but out of the tests that I have executed
so far in [1,2,3], the house didn't burn in a fire. I am looking for other
ways to have a substantial memory saving with a relatively quick and dirty
fix, but coming up empty handed thus far.

[1] https://review.openstack.org/#/c/428303/
[2] https://review.openstack.org/#/c/427919/
[3] https://review.openstack.org/#/c/427921/


>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Jeremy Stanley
On 2017-02-02 04:27:51 + (+), Dolph Mathews wrote:
> What made most services jump +20% between mitaka and newton? Maybe there is
> a common cause that we can tackle.
[...]

Almost hesitant to suggest this one but since we primarily use
Ubuntu 14.04 LTS for stable/mitaka jobs and 16.04 LTS for later
branches, could bloat in a newer release of the Python 2.7
interpreter there (or something even lower-level still like glibc)
be a contributing factor? I agree it's more likely bloat in some
commonly-used module (possibly even one developed outside our
community), but potential system-level overhead probably should also
get some investigation.
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Sean Dague
On 02/02/2017 12:49 PM, Armando M. wrote:
> 
> 
> On 2 February 2017 at 08:40, Sean Dague  > wrote:
> 
> On 02/02/2017 11:16 AM, Matthew Treinish wrote:
> 
> > 
> >
> > We definitely aren't saying running a single worker is how we recommend 
> people
> > run OpenStack by doing this. But it just adds on to the differences 
> between the
> > gate and what we expect things actually look like.
> 
> I'm all for actually getting to the bottom of this, but honestly real
> memory profiling is needed here. The growth across projects probably
> means that some common libraries are some part of this. The ever growing
> requirements list is demonstrative of that. Code reuse is good, but if
> we are importing much of a library to get access to a couple of
> functions, we're going to take a bunch of memory weight on that
> (especially if that library has friendly auto imports in top level
> __init__.py so we can't get only the parts we want).
> 
> Changing the worker count is just shuffling around deck chairs.
> 
> I'm not familiar enough with memory profiling tools in python to know
> the right approach we should take there to get this down to individual
> libraries / objects that are containing all our memory. Anyone more
> skilled here able to help lead the way?
> 
> 
> From what I hear, the overall consensus on this matter is to determine
> what actually caused the memory consumption bump and how to address it,
> but that's more of a medium to long term action. In fact, to me this is
> one of the top priority matters we should talk about at the imminent PTG.
> 
> For the time being, and to provide relief to the gate, should we want to
> lock the API_WORKERS to 1? I'll post something for review and see how
> many people shoot it down :)

I don't think we want to do that. It's going to force down the eventlet
API workers to being a single process, and it's not super clear that
eventlet handles backups on the inbound socket well. I honestly would
expect that creates different hard to debug issues, especially with high
chatter rates between services.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Armando M.
On 2 February 2017 at 08:40, Sean Dague  wrote:

> On 02/02/2017 11:16 AM, Matthew Treinish wrote:
> 
> > 
> >
> > We definitely aren't saying running a single worker is how we recommend
> people
> > run OpenStack by doing this. But it just adds on to the differences
> between the
> > gate and what we expect things actually look like.
>
> I'm all for actually getting to the bottom of this, but honestly real
> memory profiling is needed here. The growth across projects probably
> means that some common libraries are some part of this. The ever growing
> requirements list is demonstrative of that. Code reuse is good, but if
> we are importing much of a library to get access to a couple of
> functions, we're going to take a bunch of memory weight on that
> (especially if that library has friendly auto imports in top level
> __init__.py so we can't get only the parts we want).
>
> Changing the worker count is just shuffling around deck chairs.
>
> I'm not familiar enough with memory profiling tools in python to know
> the right approach we should take there to get this down to individual
> libraries / objects that are containing all our memory. Anyone more
> skilled here able to help lead the way?
>

>From what I hear, the overall consensus on this matter is to determine what
actually caused the memory consumption bump and how to address it, but
that's more of a medium to long term action. In fact, to me this is one of
the top priority matters we should talk about at the imminent PTG.

For the time being, and to provide relief to the gate, should we want to
lock the API_WORKERS to 1? I'll post something for review and see how many
people shoot it down :)

Thanks for your feedback!
Cheers,
Armando


>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Andrey Kurilin
On Thu, Feb 2, 2017 at 6:40 PM, Sean Dague  wrote:

> On 02/02/2017 11:16 AM, Matthew Treinish wrote:
> 
> > 
> >
> > We definitely aren't saying running a single worker is how we recommend
> people
> > run OpenStack by doing this. But it just adds on to the differences
> between the
> > gate and what we expect things actually look like.
>
> I'm all for actually getting to the bottom of this, but honestly real
> memory profiling is needed here. The growth across projects probably
> means that some common libraries are some part of this. The ever growing
> requirements list is demonstrative of that. Code reuse is good, but if
> we are importing much of a library to get access to a couple of
> functions, we're going to take a bunch of memory weight on that
> (especially if that library has friendly auto imports in top level
> __init__.py so we can't get only the parts we want).
>

Sounds like the new version of "oslo-incubator" idea.


>
> Changing the worker count is just shuffling around deck chairs.
>
> I'm not familiar enough with memory profiling tools in python to know
> the right approach we should take there to get this down to individual
> libraries / objects that are containing all our memory. Anyone more
> skilled here able to help lead the way?
>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Best regards,
Andrey Kurilin.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Sean Dague
On 02/02/2017 11:16 AM, Matthew Treinish wrote:

> 
> 
> We definitely aren't saying running a single worker is how we recommend people
> run OpenStack by doing this. But it just adds on to the differences between 
> the
> gate and what we expect things actually look like.

I'm all for actually getting to the bottom of this, but honestly real
memory profiling is needed here. The growth across projects probably
means that some common libraries are some part of this. The ever growing
requirements list is demonstrative of that. Code reuse is good, but if
we are importing much of a library to get access to a couple of
functions, we're going to take a bunch of memory weight on that
(especially if that library has friendly auto imports in top level
__init__.py so we can't get only the parts we want).

Changing the worker count is just shuffling around deck chairs.

I'm not familiar enough with memory profiling tools in python to know
the right approach we should take there to get this down to individual
libraries / objects that are containing all our memory. Anyone more
skilled here able to help lead the way?

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Matthew Treinish
On Thu, Feb 02, 2017 at 11:10:22AM -0500, Matthew Treinish wrote:
> On Wed, Feb 01, 2017 at 04:24:54PM -0800, Armando M. wrote:
> > Hi,
> > 
> > [TL;DR]: OpenStack services have steadily increased their memory
> > footprints. We need a concerted way to address the oom-kills experienced in
> > the openstack gate, as we may have reached a ceiling.
> > 
> > Now the longer version:
> > 
> > 
> > We have been experiencing some instability in the gate lately due to a
> > number of reasons. When everything adds up, this means it's rather
> > difficult to merge anything and knowing we're in feature freeze, that adds
> > to stress. One culprit was identified to be [1].
> > 
> > We initially tried to increase the swappiness, but that didn't seem to
> > help. Then we have looked at the resident memory in use. When going back
> > over the past three releases we have noticed that the aggregated memory
> > footprint of some openstack projects has grown steadily. We have the
> > following:
> > 
> >- Mitaka
> >   - neutron: 1.40GB
> >   - nova: 1.70GB
> >   - swift: 640MB
> >   - cinder: 730MB
> >   - keystone: 760MB
> >   - horizon: 17MB
> >   - glance: 538MB
> >- Newton
> >- neutron: 1.59GB (+13%)
> >   - nova: 1.67GB (-1%)
> >   - swift: 779MB (+21%)
> >   - cinder: 878MB (+20%)
> >   - keystone: 919MB (+20%)
> >   - horizon: 21MB (+23%)
> >   - glance: 721MB (+34%)
> >- Ocata
> >   - neutron: 1.75GB (+10%)
> >   - nova: 1.95GB (%16%)
> >   - swift: 703MB (-9%)
> >   - cinder: 920MB (4%)
> >   - keystone: 903MB (-1%)
> >   - horizon: 25MB (+20%)
> >   - glance: 740MB (+2%)
> > 
> > Numbers are approximated and I only took a couple of samples, but in a
> > nutshell, the majority of the services have seen double digit growth over
> > the past two cycles in terms of the amount or RSS memory they use.
> > 
> > Since [1] is observed only since ocata [2], I imagine that's pretty
> > reasonable to assume that memory increase might as well be a determining
> > factor to the oom-kills we see in the gate.
> > 
> > Profiling and surgically reducing the memory used by each component in each
> > service is a lengthy process, but I'd rather see some gate relief right
> > away. Reducing the number of API workers helps bring the RSS memory down
> > back to mitaka levels:
> > 
> >- neutron: 1.54GB
> >- nova: 1.24GB
> >- swift: 694MB
> >- cinder: 778MB
> >- keystone: 891MB
> >- horizon: 24MB
> >- glance: 490MB
> > 
> > However, it may have other side effects, like longer execution times, or
> > increase of timeouts.
> > 
> > Where do we go from here? I am not particularly fond of stop-gap [4], but
> > it is the one fix that most widely address the memory increase we have
> > experienced across the board.
> 
> So I have a couple of concerns with doing this. We're only running with 2
> workers per api service now and dropping it down to 1 means we have no more
> memory head room in the future. So this feels like we're just delaying the
> inevitable maybe for a cycle or 2. When we first started hitting OOM issues a
> couple years ago we dropped from nprocs to nprocs/2. [5] Back then we were 
> also
> running more services per job, it was back in the day of the integrated 
> release
> so all those projects were running. (like ceilometer, heat, etc.) So in a 
> little
> over 2 years the memory consumption for the 7 services has increased to the
> point where we're making up for a bunch of extra services that don't run in 
> the
> job anymore and we had to drop the worker count in half since. So if we were 
> to
> do this we don't have anymore room for when things keep growing. I think now 
> is
> the time we should start seriously taking a stance on our memory footprint
> growth and see if we can get it under control.
> 
> My second concern is the same as you here, the long term effects of this 
> change
> aren't exactly clear. With the limited sample size of the test patch[4] we 
> can't
> really say if it'll negatively affect run time or job success rates. I don't 
> think
> it should be too bad, tempest is only making 4 api requests at a time, and 
> most of
> the services should be able to handle that kinda load with a single worker. 
> (I'd
> hope)
> 
> This also does bring up the question of the gate config being representative
> of how we recommend running OpenStack. Like the reasons we try to use default
> config values as much as possible in devstack. We definitely aren't saying
> running a single worker



We definitely aren't saying running a single worker is how we recommend people
run OpenStack by doing this. But it just adds on to the differences between the
gate and what we expect things actually look like.

> 
> But, I'm not sure any of that is a blocker for moving forward with dropping 
> down
> to a single worker.
> 
> As an aside, I 

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Matthew Treinish
On Wed, Feb 01, 2017 at 04:24:54PM -0800, Armando M. wrote:
> Hi,
> 
> [TL;DR]: OpenStack services have steadily increased their memory
> footprints. We need a concerted way to address the oom-kills experienced in
> the openstack gate, as we may have reached a ceiling.
> 
> Now the longer version:
> 
> 
> We have been experiencing some instability in the gate lately due to a
> number of reasons. When everything adds up, this means it's rather
> difficult to merge anything and knowing we're in feature freeze, that adds
> to stress. One culprit was identified to be [1].
> 
> We initially tried to increase the swappiness, but that didn't seem to
> help. Then we have looked at the resident memory in use. When going back
> over the past three releases we have noticed that the aggregated memory
> footprint of some openstack projects has grown steadily. We have the
> following:
> 
>- Mitaka
>   - neutron: 1.40GB
>   - nova: 1.70GB
>   - swift: 640MB
>   - cinder: 730MB
>   - keystone: 760MB
>   - horizon: 17MB
>   - glance: 538MB
>- Newton
>- neutron: 1.59GB (+13%)
>   - nova: 1.67GB (-1%)
>   - swift: 779MB (+21%)
>   - cinder: 878MB (+20%)
>   - keystone: 919MB (+20%)
>   - horizon: 21MB (+23%)
>   - glance: 721MB (+34%)
>- Ocata
>   - neutron: 1.75GB (+10%)
>   - nova: 1.95GB (%16%)
>   - swift: 703MB (-9%)
>   - cinder: 920MB (4%)
>   - keystone: 903MB (-1%)
>   - horizon: 25MB (+20%)
>   - glance: 740MB (+2%)
> 
> Numbers are approximated and I only took a couple of samples, but in a
> nutshell, the majority of the services have seen double digit growth over
> the past two cycles in terms of the amount or RSS memory they use.
> 
> Since [1] is observed only since ocata [2], I imagine that's pretty
> reasonable to assume that memory increase might as well be a determining
> factor to the oom-kills we see in the gate.
> 
> Profiling and surgically reducing the memory used by each component in each
> service is a lengthy process, but I'd rather see some gate relief right
> away. Reducing the number of API workers helps bring the RSS memory down
> back to mitaka levels:
> 
>- neutron: 1.54GB
>- nova: 1.24GB
>- swift: 694MB
>- cinder: 778MB
>- keystone: 891MB
>- horizon: 24MB
>- glance: 490MB
> 
> However, it may have other side effects, like longer execution times, or
> increase of timeouts.
> 
> Where do we go from here? I am not particularly fond of stop-gap [4], but
> it is the one fix that most widely address the memory increase we have
> experienced across the board.

So I have a couple of concerns with doing this. We're only running with 2
workers per api service now and dropping it down to 1 means we have no more
memory head room in the future. So this feels like we're just delaying the
inevitable maybe for a cycle or 2. When we first started hitting OOM issues a
couple years ago we dropped from nprocs to nprocs/2. [5] Back then we were also
running more services per job, it was back in the day of the integrated release
so all those projects were running. (like ceilometer, heat, etc.) So in a little
over 2 years the memory consumption for the 7 services has increased to the
point where we're making up for a bunch of extra services that don't run in the
job anymore and we had to drop the worker count in half since. So if we were to
do this we don't have anymore room for when things keep growing. I think now is
the time we should start seriously taking a stance on our memory footprint
growth and see if we can get it under control.

My second concern is the same as you here, the long term effects of this change
aren't exactly clear. With the limited sample size of the test patch[4] we can't
really say if it'll negatively affect run time or job success rates. I don't 
think
it should be too bad, tempest is only making 4 api requests at a time, and most 
of
the services should be able to handle that kinda load with a single worker. (I'd
hope)

This also does bring up the question of the gate config being representative
of how we recommend running OpenStack. Like the reasons we try to use default
config values as much as possible in devstack. We definitely aren't saying
running a single worker

But, I'm not sure any of that is a blocker for moving forward with dropping down
to a single worker.

As an aside, I also just pushed up: https://review.openstack.org/#/c/428220/ to
see if that provides any useful info. I'm doubtful that it will be helpful,
because it's the combination of services running causing the issue. But it
doesn't really hurt to collect that.

-Matt Treinish

> [1] https://bugs.launchpad.net/neutron/+bug/1656386
> [2]
> http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22oom-killer%5C%22%20AND%20tags:syslog
> [3]
> http://logs.openstack.org/21/427921/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/82084c2/
> 

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Matthew Treinish
On Thu, Feb 02, 2017 at 04:27:51AM +, Dolph Mathews wrote:
> What made most services jump +20% between mitaka and newton? Maybe there is
> a common cause that we can tackle.

Yeah, I'm curious about this too, there seems to be a big jump in Newton for
most of the project. It might not a be a single common cause between them, but
I'd be curious to know what's going on there. 

> 
> I'd also be in favor of reducing the number of workers in the gate,
> assuming that doesn't also substantially increase the runtime of gate jobs.
> Does that environment variable (API_WORKERS) affect keystone and horizon?

It affects keystone, in certain deploy modes (only uwsgi standalone I think,
which menas not for most jobs) if it's running under apache we rely on apache
to handle things. Which is why this doesn't work on horizon.

API_WORKERS was the interface we added to devstack after we started having OOM
issues the first time around (roughly 2 years ago) Back then we were running
the service defaults which in most cases was nprocs for the number of workers.
API_WORKERS was added to have a global flag to set that to something else for
all the services. Right now it defaults to nproc/4 as long as that's >=2:

https://github.com/openstack-dev/devstack/blob/master/stackrc#L714

which basically means in the gate right now we're only running with 2 api
workers per server. It's just that a lot of 

-Matt Treinish

> 
> On Wed, Feb 1, 2017 at 6:39 PM Kevin Benton  wrote:
> 
> > And who said openstack wasn't growing? ;)
> >
> > I think reducing API workers is a nice quick way to bring back some
> > stability.
> >
> > I have spent a bunch of time digging into the OOM killer events and
> > haven't yet figured out why they are being triggered. There is significant
> > swap space remaining in all of the cases I have seen so it's likely some
> > memory locking issue or kernel allocations blocking swap. Until we can
> > figure out the cause, we effectively have no usable swap space on the test
> > instances so we are limited to 8GB.
> >
> > On Feb 1, 2017 17:27, "Armando M."  wrote:
> >
> > Hi,
> >
> > [TL;DR]: OpenStack services have steadily increased their memory
> > footprints. We need a concerted way to address the oom-kills experienced in
> > the openstack gate, as we may have reached a ceiling.
> >
> > Now the longer version:
> > 
> >
> > We have been experiencing some instability in the gate lately due to a
> > number of reasons. When everything adds up, this means it's rather
> > difficult to merge anything and knowing we're in feature freeze, that adds
> > to stress. One culprit was identified to be [1].
> >
> > We initially tried to increase the swappiness, but that didn't seem to
> > help. Then we have looked at the resident memory in use. When going back
> > over the past three releases we have noticed that the aggregated memory
> > footprint of some openstack projects has grown steadily. We have the
> > following:
> >
> >- Mitaka
> >   - neutron: 1.40GB
> >   - nova: 1.70GB
> >   - swift: 640MB
> >   - cinder: 730MB
> >   - keystone: 760MB
> >   - horizon: 17MB
> >   - glance: 538MB
> >- Newton
> >- neutron: 1.59GB (+13%)
> >   - nova: 1.67GB (-1%)
> >   - swift: 779MB (+21%)
> >   - cinder: 878MB (+20%)
> >   - keystone: 919MB (+20%)
> >   - horizon: 21MB (+23%)
> >   - glance: 721MB (+34%)
> >- Ocata
> >   - neutron: 1.75GB (+10%)
> >   - nova: 1.95GB (%16%)
> >   - swift: 703MB (-9%)
> >   - cinder: 920MB (4%)
> >   - keystone: 903MB (-1%)
> >   - horizon: 25MB (+20%)
> >   - glance: 740MB (+2%)
> >
> > Numbers are approximated and I only took a couple of samples, but in a
> > nutshell, the majority of the services have seen double digit growth over
> > the past two cycles in terms of the amount or RSS memory they use.
> >
> > Since [1] is observed only since ocata [2], I imagine that's pretty
> > reasonable to assume that memory increase might as well be a determining
> > factor to the oom-kills we see in the gate.
> >
> > Profiling and surgically reducing the memory used by each component in
> > each service is a lengthy process, but I'd rather see some gate relief
> > right away. Reducing the number of API workers helps bring the RSS memory
> > down back to mitaka levels:
> >
> >- neutron: 1.54GB
> >- nova: 1.24GB
> >- swift: 694MB
> >- cinder: 778MB
> >- keystone: 891MB
> >- horizon: 24MB
> >- glance: 490MB
> >
> > However, it may have other side effects, like longer execution times, or
> > increase of timeouts.
> >
> > Where do we go from here? I am not particularly fond of stop-gap [4], but
> > it is the one fix that most widely address the memory increase we have
> > experienced across the board.
> >
> > Thanks,
> > Armando
> >
> > [1] https://bugs.launchpad.net/neutron/+bug/1656386
> > [2]
> > 

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-01 Thread IWAMOTO Toshihiro
At Wed, 1 Feb 2017 17:37:34 -0700,
Kevin Benton wrote:
> 
> [1  ]
> [1.1  ]
> And who said openstack wasn't growing? ;)
> 
> I think reducing API workers is a nice quick way to bring back some
> stability.
> 
> I have spent a bunch of time digging into the OOM killer events and haven't
> yet figured out why they are being triggered. There is significant swap
> space remaining in all of the cases I have seen so it's likely some memory

We can try increasing watermark_scale_factor instead.
I looked at random 2 oom-killer invocations but free mem were above
watermark. oom-killer were triggered by 16kB contig page alloation by
apparmor_file_alloc_security, so if we can try disabling apparmor that
may also work.


> locking issue or kernel allocations blocking swap. Until we can figure out
> the cause, we effectively have no usable swap space on the test instances
> so we are limited to 8GB.
> 
> On Feb 1, 2017 17:27, "Armando M."  wrote:
> 
> > Hi,
> >
> > [TL;DR]: OpenStack services have steadily increased their memory
> > footprints. We need a concerted way to address the oom-kills experienced in
> > the openstack gate, as we may have reached a ceiling.
> >
> > Now the longer version:
> > 
> >
> > We have been experiencing some instability in the gate lately due to a
> > number of reasons. When everything adds up, this means it's rather
> > difficult to merge anything and knowing we're in feature freeze, that adds
> > to stress. One culprit was identified to be [1].
> >
> > We initially tried to increase the swappiness, but that didn't seem to
> > help. Then we have looked at the resident memory in use. When going back
> > over the past three releases we have noticed that the aggregated memory
> > footprint of some openstack projects has grown steadily. We have the
> > following:
> >
> >- Mitaka
> >   - neutron: 1.40GB
> >   - nova: 1.70GB
> >   - swift: 640MB
> >   - cinder: 730MB
> >   - keystone: 760MB
> >   - horizon: 17MB
> >   - glance: 538MB
> >- Newton
> >- neutron: 1.59GB (+13%)
> >   - nova: 1.67GB (-1%)
> >   - swift: 779MB (+21%)
> >   - cinder: 878MB (+20%)
> >   - keystone: 919MB (+20%)
> >   - horizon: 21MB (+23%)
> >   - glance: 721MB (+34%)
> >- Ocata
> >   - neutron: 1.75GB (+10%)
> >   - nova: 1.95GB (%16%)
> >   - swift: 703MB (-9%)
> >   - cinder: 920MB (4%)
> >   - keystone: 903MB (-1%)
> >   - horizon: 25MB (+20%)
> >   - glance: 740MB (+2%)
> >
> > Numbers are approximated and I only took a couple of samples, but in a
> > nutshell, the majority of the services have seen double digit growth over
> > the past two cycles in terms of the amount or RSS memory they use.
> >
> > Since [1] is observed only since ocata [2], I imagine that's pretty
> > reasonable to assume that memory increase might as well be a determining
> > factor to the oom-kills we see in the gate.
> >
> > Profiling and surgically reducing the memory used by each component in
> > each service is a lengthy process, but I'd rather see some gate relief
> > right away. Reducing the number of API workers helps bring the RSS memory
> > down back to mitaka levels:
> >
> >- neutron: 1.54GB
> >- nova: 1.24GB
> >- swift: 694MB
> >- cinder: 778MB
> >- keystone: 891MB
> >- horizon: 24MB
> >- glance: 490MB
> >
> > However, it may have other side effects, like longer execution times, or
> > increase of timeouts.
> >
> > Where do we go from here? I am not particularly fond of stop-gap [4], but
> > it is the one fix that most widely address the memory increase we have
> > experienced across the board.
> >
> > Thanks,
> > Armando
> >
> > [1] https://bugs.launchpad.net/neutron/+bug/1656386
> > [2] http://logstash.openstack.org/#/dashboard/file/logstash.
> > json?query=message:%5C%22oom-killer%5C%22%20AND%20tags:syslog
> > [3] http://logs.openstack.org/21/427921/1/check/gate-
> > tempest-dsvm-neutron-full-ubuntu-xenial/82084c2/
> > [4] https://review.openstack.org/#/c/427921
> >
> > __
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> [1.2  ]
> [2  ]
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-01 Thread Dolph Mathews
What made most services jump +20% between mitaka and newton? Maybe there is
a common cause that we can tackle.

I'd also be in favor of reducing the number of workers in the gate,
assuming that doesn't also substantially increase the runtime of gate jobs.
Does that environment variable (API_WORKERS) affect keystone and horizon?

On Wed, Feb 1, 2017 at 6:39 PM Kevin Benton  wrote:

> And who said openstack wasn't growing? ;)
>
> I think reducing API workers is a nice quick way to bring back some
> stability.
>
> I have spent a bunch of time digging into the OOM killer events and
> haven't yet figured out why they are being triggered. There is significant
> swap space remaining in all of the cases I have seen so it's likely some
> memory locking issue or kernel allocations blocking swap. Until we can
> figure out the cause, we effectively have no usable swap space on the test
> instances so we are limited to 8GB.
>
> On Feb 1, 2017 17:27, "Armando M."  wrote:
>
> Hi,
>
> [TL;DR]: OpenStack services have steadily increased their memory
> footprints. We need a concerted way to address the oom-kills experienced in
> the openstack gate, as we may have reached a ceiling.
>
> Now the longer version:
> 
>
> We have been experiencing some instability in the gate lately due to a
> number of reasons. When everything adds up, this means it's rather
> difficult to merge anything and knowing we're in feature freeze, that adds
> to stress. One culprit was identified to be [1].
>
> We initially tried to increase the swappiness, but that didn't seem to
> help. Then we have looked at the resident memory in use. When going back
> over the past three releases we have noticed that the aggregated memory
> footprint of some openstack projects has grown steadily. We have the
> following:
>
>- Mitaka
>   - neutron: 1.40GB
>   - nova: 1.70GB
>   - swift: 640MB
>   - cinder: 730MB
>   - keystone: 760MB
>   - horizon: 17MB
>   - glance: 538MB
>- Newton
>- neutron: 1.59GB (+13%)
>   - nova: 1.67GB (-1%)
>   - swift: 779MB (+21%)
>   - cinder: 878MB (+20%)
>   - keystone: 919MB (+20%)
>   - horizon: 21MB (+23%)
>   - glance: 721MB (+34%)
>- Ocata
>   - neutron: 1.75GB (+10%)
>   - nova: 1.95GB (%16%)
>   - swift: 703MB (-9%)
>   - cinder: 920MB (4%)
>   - keystone: 903MB (-1%)
>   - horizon: 25MB (+20%)
>   - glance: 740MB (+2%)
>
> Numbers are approximated and I only took a couple of samples, but in a
> nutshell, the majority of the services have seen double digit growth over
> the past two cycles in terms of the amount or RSS memory they use.
>
> Since [1] is observed only since ocata [2], I imagine that's pretty
> reasonable to assume that memory increase might as well be a determining
> factor to the oom-kills we see in the gate.
>
> Profiling and surgically reducing the memory used by each component in
> each service is a lengthy process, but I'd rather see some gate relief
> right away. Reducing the number of API workers helps bring the RSS memory
> down back to mitaka levels:
>
>- neutron: 1.54GB
>- nova: 1.24GB
>- swift: 694MB
>- cinder: 778MB
>- keystone: 891MB
>- horizon: 24MB
>- glance: 490MB
>
> However, it may have other side effects, like longer execution times, or
> increase of timeouts.
>
> Where do we go from here? I am not particularly fond of stop-gap [4], but
> it is the one fix that most widely address the memory increase we have
> experienced across the board.
>
> Thanks,
> Armando
>
> [1] https://bugs.launchpad.net/neutron/+bug/1656386
> [2]
> http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22oom-killer%5C%22%20AND%20tags:syslog
> [3]
> http://logs.openstack.org/21/427921/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/82084c2/
> [4] https://review.openstack.org/#/c/427921
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-- 
-Dolph
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-01 Thread Kevin Benton
And who said openstack wasn't growing? ;)

I think reducing API workers is a nice quick way to bring back some
stability.

I have spent a bunch of time digging into the OOM killer events and haven't
yet figured out why they are being triggered. There is significant swap
space remaining in all of the cases I have seen so it's likely some memory
locking issue or kernel allocations blocking swap. Until we can figure out
the cause, we effectively have no usable swap space on the test instances
so we are limited to 8GB.

On Feb 1, 2017 17:27, "Armando M."  wrote:

> Hi,
>
> [TL;DR]: OpenStack services have steadily increased their memory
> footprints. We need a concerted way to address the oom-kills experienced in
> the openstack gate, as we may have reached a ceiling.
>
> Now the longer version:
> 
>
> We have been experiencing some instability in the gate lately due to a
> number of reasons. When everything adds up, this means it's rather
> difficult to merge anything and knowing we're in feature freeze, that adds
> to stress. One culprit was identified to be [1].
>
> We initially tried to increase the swappiness, but that didn't seem to
> help. Then we have looked at the resident memory in use. When going back
> over the past three releases we have noticed that the aggregated memory
> footprint of some openstack projects has grown steadily. We have the
> following:
>
>- Mitaka
>   - neutron: 1.40GB
>   - nova: 1.70GB
>   - swift: 640MB
>   - cinder: 730MB
>   - keystone: 760MB
>   - horizon: 17MB
>   - glance: 538MB
>- Newton
>- neutron: 1.59GB (+13%)
>   - nova: 1.67GB (-1%)
>   - swift: 779MB (+21%)
>   - cinder: 878MB (+20%)
>   - keystone: 919MB (+20%)
>   - horizon: 21MB (+23%)
>   - glance: 721MB (+34%)
>- Ocata
>   - neutron: 1.75GB (+10%)
>   - nova: 1.95GB (%16%)
>   - swift: 703MB (-9%)
>   - cinder: 920MB (4%)
>   - keystone: 903MB (-1%)
>   - horizon: 25MB (+20%)
>   - glance: 740MB (+2%)
>
> Numbers are approximated and I only took a couple of samples, but in a
> nutshell, the majority of the services have seen double digit growth over
> the past two cycles in terms of the amount or RSS memory they use.
>
> Since [1] is observed only since ocata [2], I imagine that's pretty
> reasonable to assume that memory increase might as well be a determining
> factor to the oom-kills we see in the gate.
>
> Profiling and surgically reducing the memory used by each component in
> each service is a lengthy process, but I'd rather see some gate relief
> right away. Reducing the number of API workers helps bring the RSS memory
> down back to mitaka levels:
>
>- neutron: 1.54GB
>- nova: 1.24GB
>- swift: 694MB
>- cinder: 778MB
>- keystone: 891MB
>- horizon: 24MB
>- glance: 490MB
>
> However, it may have other side effects, like longer execution times, or
> increase of timeouts.
>
> Where do we go from here? I am not particularly fond of stop-gap [4], but
> it is the one fix that most widely address the memory increase we have
> experienced across the board.
>
> Thanks,
> Armando
>
> [1] https://bugs.launchpad.net/neutron/+bug/1656386
> [2] http://logstash.openstack.org/#/dashboard/file/logstash.
> json?query=message:%5C%22oom-killer%5C%22%20AND%20tags:syslog
> [3] http://logs.openstack.org/21/427921/1/check/gate-
> tempest-dsvm-neutron-full-ubuntu-xenial/82084c2/
> [4] https://review.openstack.org/#/c/427921
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-01 Thread Armando M.
Hi,

[TL;DR]: OpenStack services have steadily increased their memory
footprints. We need a concerted way to address the oom-kills experienced in
the openstack gate, as we may have reached a ceiling.

Now the longer version:


We have been experiencing some instability in the gate lately due to a
number of reasons. When everything adds up, this means it's rather
difficult to merge anything and knowing we're in feature freeze, that adds
to stress. One culprit was identified to be [1].

We initially tried to increase the swappiness, but that didn't seem to
help. Then we have looked at the resident memory in use. When going back
over the past three releases we have noticed that the aggregated memory
footprint of some openstack projects has grown steadily. We have the
following:

   - Mitaka
  - neutron: 1.40GB
  - nova: 1.70GB
  - swift: 640MB
  - cinder: 730MB
  - keystone: 760MB
  - horizon: 17MB
  - glance: 538MB
   - Newton
   - neutron: 1.59GB (+13%)
  - nova: 1.67GB (-1%)
  - swift: 779MB (+21%)
  - cinder: 878MB (+20%)
  - keystone: 919MB (+20%)
  - horizon: 21MB (+23%)
  - glance: 721MB (+34%)
   - Ocata
  - neutron: 1.75GB (+10%)
  - nova: 1.95GB (%16%)
  - swift: 703MB (-9%)
  - cinder: 920MB (4%)
  - keystone: 903MB (-1%)
  - horizon: 25MB (+20%)
  - glance: 740MB (+2%)

Numbers are approximated and I only took a couple of samples, but in a
nutshell, the majority of the services have seen double digit growth over
the past two cycles in terms of the amount or RSS memory they use.

Since [1] is observed only since ocata [2], I imagine that's pretty
reasonable to assume that memory increase might as well be a determining
factor to the oom-kills we see in the gate.

Profiling and surgically reducing the memory used by each component in each
service is a lengthy process, but I'd rather see some gate relief right
away. Reducing the number of API workers helps bring the RSS memory down
back to mitaka levels:

   - neutron: 1.54GB
   - nova: 1.24GB
   - swift: 694MB
   - cinder: 778MB
   - keystone: 891MB
   - horizon: 24MB
   - glance: 490MB

However, it may have other side effects, like longer execution times, or
increase of timeouts.

Where do we go from here? I am not particularly fond of stop-gap [4], but
it is the one fix that most widely address the memory increase we have
experienced across the board.

Thanks,
Armando

[1] https://bugs.launchpad.net/neutron/+bug/1656386
[2]
http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22oom-killer%5C%22%20AND%20tags:syslog
[3]
http://logs.openstack.org/21/427921/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/82084c2/
[4] https://review.openstack.org/#/c/427921
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev