Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-04 Thread Matt Riedemann

On 2/2/2017 4:01 PM, Sean Dague wrote:


The only services that are running on Apache in standard gate jobs are
keystone and the placement api. Everything else is still the
oslo.service stack (which is basically run eventlet as a preforking
static worker count webserver).

The ways in which OpenStack and oslo.service uses eventlet are known to
have scaling bottle necks. The Keystone team saw substantial throughput
gains going over to apache hosting.

-Sean



FWIW, coincidentally the nova team is going to work on running nova-api 
under apache in some select jobs in Pike because it turns out that 
TripleO was running that configuration in Newton which is considered 
experimental in nova (we don't do some things when running in that mode 
which are actually pretty critical to how the code functions for 
upgrades). So if Apache/eventlet is related, maybe we'll see some 
differences after making that change.


But I also wouldn't be surprised if Nova is creating more versioned 
objects which reference other full versioned objects (rather than just 
an id reference) and maybe some of those are hanging around longer than 
they should be.


--

Thanks,

Matt Riedemann

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-04 Thread Matt Riedemann

On 2/2/2017 2:32 PM, Armando M. wrote:


Not sure I agree on this one, this has been observed multiple times in
the gate already [1] (though I am not sure there's a bug for it), and I
don't believe it has anything to do with the number of API workers,
unless not even two workers are enough.

[1]
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22('Connection%20aborted.'%2C%20BadStatusLine(%5C%22''%5C%22%2C)%5C%22




I think that's this:

http://status.openstack.org//elastic-recheck/index.html#1630664

--

Thanks,

Matt Riedemann

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-04 Thread Joshua Harlow

Another option is to turn on the following (for python 3.4+ jobs)

https://docs.python.org/3/library/tracemalloc.html

I think victor stinner (who we all know as haypo) has some experience 
with that and even did some of the backport patches for 2.7 for this may 
have some ideas on how we can plug that in.


Then assuming the following works we can even have a nice UI to analyze 
its reports & do comparison diffs:


http://pytracemalloc.readthedocs.io/tracemallocqt.html

One idea from mtreinish was to hook the following (or some variant of 
it) into oslo.service to get some data:


http://pytracemalloc.readthedocs.io/examples.html#thread-to-write-snapshots-into-files-every-minutes

Of course the other big question (that I don't actually know) is how 
does tracemalloc work in wsgi containers (such as apache or eventlet or 
uwsgi or ...). Seeing that a part of our http services are in such 
containers it seems like a useful thing to wonder :)


-Josh

Joshua Harlow wrote:

An example of what this (dozer) gathers (attached).

-Josh

Joshua Harlow wrote:

Has anyone tried:

https://github.com/mgedmin/dozer/blob/master/dozer/leak.py#L72

This piece of middleware creates some nice graphs (using PIL) that may
help identify which areas are using what memory (and/or leaking).

https://pypi.python.org/pypi/linesman might also be somewhat useful to
have running.

How any process takes more than 100MB here blows my mind (horizon is
doing nicely, ha); what are people caching in process to have RSS that
large (1.95 GB, woah).

Armando M. wrote:

Hi,

[TL;DR]: OpenStack services have steadily increased their memory
footprints. We need a concerted way to address the oom-kills experienced
in the openstack gate, as we may have reached a ceiling.

Now the longer version:


We have been experiencing some instability in the gate lately due to a
number of reasons. When everything adds up, this means it's rather
difficult to merge anything and knowing we're in feature freeze, that
adds to stress. One culprit was identified to be [1].

We initially tried to increase the swappiness, but that didn't seem to
help. Then we have looked at the resident memory in use. When going back
over the past three releases we have noticed that the aggregated memory
footprint of some openstack projects has grown steadily. We have the
following:

* Mitaka
o neutron: 1.40GB
o nova: 1.70GB
o swift: 640MB
o cinder: 730MB
o keystone: 760MB
o horizon: 17MB
o glance: 538MB
* Newton
o neutron: 1.59GB (+13%)
o nova: 1.67GB (-1%)
o swift: 779MB (+21%)
o cinder: 878MB (+20%)
o keystone: 919MB (+20%)
o horizon: 21MB (+23%)
o glance: 721MB (+34%)
* Ocata
o neutron: 1.75GB (+10%)
o nova: 1.95GB (%16%)
o swift: 703MB (-9%)
o cinder: 920MB (4%)
o keystone: 903MB (-1%)
o horizon: 25MB (+20%)
o glance: 740MB (+2%)

Numbers are approximated and I only took a couple of samples, but in a
nutshell, the majority of the services have seen double digit growth
over the past two cycles in terms of the amount or RSS memory they use.

Since [1] is observed only since ocata [2], I imagine that's pretty
reasonable to assume that memory increase might as well be a determining
factor to the oom-kills we see in the gate.

Profiling and surgically reducing the memory used by each component in
each service is a lengthy process, but I'd rather see some gate relief
right away. Reducing the number of API workers helps bring the RSS
memory down back to mitaka levels:

* neutron: 1.54GB
* nova: 1.24GB
* swift: 694MB
* cinder: 778MB
* keystone: 891MB
* horizon: 24MB
* glance: 490MB

However, it may have other side effects, like longer execution times, or
increase of timeouts.

Where do we go from here? I am not particularly fond of stop-gap [4],
but it is the one fix that most widely address the memory increase we
have experienced across the board.

Thanks,
Armando

[1] https://bugs.launchpad.net/neutron/+bug/1656386

[2]
http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22oom-killer%5C%22%20AND%20tags:syslog


[3]
http://logs.openstack.org/21/427921/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/82084c2/


[4] https://review.openstack.org/#/c/427921

__


OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__

OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-04 Thread Paul Belanger
On Fri, Feb 03, 2017 at 06:14:01PM +, Jeremy Stanley wrote:
> On 2017-02-03 11:12:04 +0100 (+0100), Miguel Angel Ajo Pelayo wrote:
> [...]
> > So, would it be realistic to bump the flavors RAM to favor our stability in
> > the short term? (considering that the less amount of workload our clouds
> > will be able to take is fewer, but the failure rate will also be fewer, so
> > the rechecks will be reduced).
> 
> It's an option of last resort, I think. The next consistent flavor
> up in most of the providers donating resources is double the one
> we're using (which is a fairly typical pattern in public clouds). As
> aggregate memory constraints are our primary quota limit, this would
> effectively halve our current job capacity.
> 
++
I completely agree. Halving our quote limit to address the issue of increases
memory consumption seems like the wrong approach.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev