Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-15 Thread Ihar Hrachyshka
Another potentially interesting devstack service that may help us to understand our memory usage is peakmem_tracker. At this point, it's not enabled anywhere. I proposed devstack-gate patch to enable it at: https://review.openstack.org/#/c/434511/ On Wed, Feb 15, 2017 at 12:38 PM, Ihar Hrachyshka

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-15 Thread Ihar Hrachyshka
Another potentially relevant info is, we saw before that oom-killer is triggered while 8gb of swap are barely used. This behavior is hard to explain, since we set kernel swappiness sysctl knob to 30: https://github.com/openstack-infra/devstack-gate/blob/master/functions.sh#L432 (and any value

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-15 Thread Jeremy Stanley
On 2017-02-15 13:21:16 + (+), Andrea Frittoli wrote: [...] > According to logstash [3] all failures identified by [2] happen on RAX > nodes [3], which I hadn't realised before. [...] > I find it hard to relate lower free memory to a specific cloud provider / > underlying virtualisation

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-15 Thread Andrea Frittoli
Some (new?) data on the oom kill issue in the gate. I filed a new bug / E-R query yet for the issue [1][2] since it looks to me like the issue is not specific to mysqld - oom-kill will just pick the best candidate, which in most cases happens to be mysqld. The next most likely candidate to show

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-06 Thread Miguel Angel Ajo Pelayo
Jeremy Stanley wrote: > It's an option of last resort, I think. The next consistent flavor > up in most of the providers donating resources is double the one > we're using (which is a fairly typical pattern in public clouds). As > aggregate memory constraints are our primary quota limit, this

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-05 Thread Matt Riedemann
On 2/5/2017 1:19 PM, Clint Byrum wrote: Also I wonder if there's ever been any serious consideration given to switching to protobuf? Feels like one could make oslo.versionedobjects a wrapper around protobuf relatively easily, but perhaps that's already been explored in a forum that I wasn't

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-05 Thread Clint Byrum
Excerpts from Matt Riedemann's message of 2017-02-04 16:09:56 -0600: > On 2/2/2017 4:01 PM, Sean Dague wrote: > > > > The only services that are running on Apache in standard gate jobs are > > keystone and the placement api. Everything else is still the > > oslo.service stack (which is basically

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-04 Thread Matt Riedemann
On 2/2/2017 4:01 PM, Sean Dague wrote: The only services that are running on Apache in standard gate jobs are keystone and the placement api. Everything else is still the oslo.service stack (which is basically run eventlet as a preforking static worker count webserver). The ways in which

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-04 Thread Matt Riedemann
On 2/2/2017 2:32 PM, Armando M. wrote: Not sure I agree on this one, this has been observed multiple times in the gate already [1] (though I am not sure there's a bug for it), and I don't believe it has anything to do with the number of API workers, unless not even two workers are enough. [1]

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-04 Thread Joshua Harlow
Another option is to turn on the following (for python 3.4+ jobs) https://docs.python.org/3/library/tracemalloc.html I think victor stinner (who we all know as haypo) has some experience with that and even did some of the backport patches for 2.7 for this may have some ideas on how we can

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-04 Thread Paul Belanger
On Fri, Feb 03, 2017 at 06:14:01PM +, Jeremy Stanley wrote: > On 2017-02-03 11:12:04 +0100 (+0100), Miguel Angel Ajo Pelayo wrote: > [...] > > So, would it be realistic to bump the flavors RAM to favor our stability in > > the short term? (considering that the less amount of workload our

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-03 Thread Jeremy Stanley
On 2017-02-03 11:12:04 +0100 (+0100), Miguel Angel Ajo Pelayo wrote: [...] > So, would it be realistic to bump the flavors RAM to favor our stability in > the short term? (considering that the less amount of workload our clouds > will be able to take is fewer, but the failure rate will also be

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-03 Thread Miguel Angel Ajo Pelayo
On Fri, Feb 3, 2017 at 7:55 AM, IWAMOTO Toshihiro wrote: > At Wed, 1 Feb 2017 16:24:54 -0800, > Armando M. wrote: > > > > Hi, > > > > [TL;DR]: OpenStack services have steadily increased their memory > > footprints. We need a concerted way to address the oom-kills

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread IWAMOTO Toshihiro
At Wed, 1 Feb 2017 16:24:54 -0800, Armando M. wrote: > > Hi, > > [TL;DR]: OpenStack services have steadily increased their memory > footprints. We need a concerted way to address the oom-kills experienced in > the openstack gate, as we may have reached a ceiling. > > Now the longer version: >

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Joshua Harlow
Has anyone tried: https://github.com/mgedmin/dozer/blob/master/dozer/leak.py#L72 This piece of middleware creates some nice graphs (using PIL) that may help identify which areas are using what memory (and/or leaking). https://pypi.python.org/pypi/linesman might also be somewhat useful to

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Robert Collins
On 3 Feb. 2017 16:14, "Robert Collins" wrote: This may help. http://jam-bazaar.blogspot.co.nz/2009/11/memory- debugging-with-meliae.html -rob Oh, and if i recall correctly run snake run supports both heapy and meliae. ,-rob

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Robert Collins
This may help. http://jam-bazaar.blogspot.co.nz/2009/11/memory-debugging-with-meliae.html -rob On 3 Feb. 2017 10:39, "Armando M." wrote: > > > On 2 February 2017 at 13:36, Ihar Hrachyshka wrote: > >> On Thu, Feb 2, 2017 at 7:44 AM, Matthew Treinish

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Ed Leafe
On Feb 2, 2017, at 10:16 AM, Matthew Treinish wrote: > If that was intentional, it is the funniest thing I’ve read today. :) -- Ed Leafe __ OpenStack

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Kevin Benton
I'm referring to Apache sitting in between the services now as a TLS terminator and connection proxy. That was not the configuration before but it is now the default devstack behavior. See this example from Newton:

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Sean Dague
On 02/02/2017 04:07 PM, Kevin Benton wrote: > This error seems to be new in the ocata cycle. It's either related to a > dependency change or the fact that we put Apache in between the services > now. Handling more concurrent requests than workers wasn't an issue > before. > > It seems that you

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Kevin Benton
Note the HTTPS in the traceback in the bug report. Also the mention of adjusting the Apache mpm settings to fix it. That seems to point to an issue with Apache in the middle rather than eventlet and API_WORKERS. On Feb 2, 2017 14:36, "Ihar Hrachyshka" wrote: > The

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Mikhail Medvedev
On Thu, Feb 2, 2017 at 12:28 PM, Jeremy Stanley wrote: > On 2017-02-02 04:27:51 + (+), Dolph Mathews wrote: >> What made most services jump +20% between mitaka and newton? Maybe there is >> a common cause that we can tackle. > [...] > > Almost hesitant to suggest this

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Clay Gerrard
On Thu, Feb 2, 2017 at 12:50 PM, Sean Dague wrote: > > This is one of the reasons to get the wsgi stack off of eventlet and > into a real webserver, as they handle HTTP request backups much much > better. > > To some extent I think this is generally true for *many* common

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Armando M.
On 2 February 2017 at 13:36, Ihar Hrachyshka wrote: > On Thu, Feb 2, 2017 at 7:44 AM, Matthew Treinish > wrote: > > Yeah, I'm curious about this too, there seems to be a big jump in Newton > for > > most of the project. It might not a be a single

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Armando M.
On 2 February 2017 at 13:34, Ihar Hrachyshka wrote: > The BadStatusLine error is well known: > https://bugs.launchpad.net/nova/+bug/1630664 That's the one! I knew it I had seen it in the past! > > > Now, it doesn't mean that the root cause of the error message is the >

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Ihar Hrachyshka
On Thu, Feb 2, 2017 at 7:44 AM, Matthew Treinish wrote: > Yeah, I'm curious about this too, there seems to be a big jump in Newton for > most of the project. It might not a be a single common cause between them, but > I'd be curious to know what's going on there. Both Matt

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Armando M.
On 2 February 2017 at 12:50, Sean Dague wrote: > On 02/02/2017 03:32 PM, Armando M. wrote: > > > > > > On 2 February 2017 at 12:19, Sean Dague > > wrote: > > > > On 02/02/2017 02:28 PM, Armando M. wrote: > > > > > > > > >

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Ihar Hrachyshka
The BadStatusLine error is well known: https://bugs.launchpad.net/nova/+bug/1630664 Now, it doesn't mean that the root cause of the error message is the same, and it may as well be that lowering the number of workers triggered it. All I am saying is we saw that error in the past. Ihar On Thu,

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Kevin Benton
This error seems to be new in the ocata cycle. It's either related to a dependency change or the fact that we put Apache in between the services now. Handling more concurrent requests than workers wasn't an issue before. It seems that you are suggesting that eventlet can't handle concurrent

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Sean Dague
On 02/02/2017 03:32 PM, Armando M. wrote: > > > On 2 February 2017 at 12:19, Sean Dague > wrote: > > On 02/02/2017 02:28 PM, Armando M. wrote: > > > > > > On 2 February 2017 at 10:08, Sean Dague

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Armando M.
On 2 February 2017 at 12:19, Sean Dague wrote: > On 02/02/2017 02:28 PM, Armando M. wrote: > > > > > > On 2 February 2017 at 10:08, Sean Dague > > wrote: > > > > On 02/02/2017 12:49 PM, Armando M. wrote: > > > > > > > > >

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Sean Dague
On 02/02/2017 02:28 PM, Armando M. wrote: > > > On 2 February 2017 at 10:08, Sean Dague > wrote: > > On 02/02/2017 12:49 PM, Armando M. wrote: > > > > > > On 2 February 2017 at 08:40, Sean Dague

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Armando M.
On 2 February 2017 at 10:08, Sean Dague wrote: > On 02/02/2017 12:49 PM, Armando M. wrote: > > > > > > On 2 February 2017 at 08:40, Sean Dague > > wrote: > > > > On 02/02/2017 11:16 AM, Matthew Treinish wrote: > > > > >

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Jeremy Stanley
On 2017-02-02 04:27:51 + (+), Dolph Mathews wrote: > What made most services jump +20% between mitaka and newton? Maybe there is > a common cause that we can tackle. [...] Almost hesitant to suggest this one but since we primarily use Ubuntu 14.04 LTS for stable/mitaka jobs and 16.04 LTS

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Sean Dague
On 02/02/2017 12:49 PM, Armando M. wrote: > > > On 2 February 2017 at 08:40, Sean Dague > wrote: > > On 02/02/2017 11:16 AM, Matthew Treinish wrote: > > > > > > > We definitely aren't saying running

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Armando M.
On 2 February 2017 at 08:40, Sean Dague wrote: > On 02/02/2017 11:16 AM, Matthew Treinish wrote: > > > > > > > We definitely aren't saying running a single worker is how we recommend > people > > run OpenStack by doing this. But it just adds on

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Andrey Kurilin
On Thu, Feb 2, 2017 at 6:40 PM, Sean Dague wrote: > On 02/02/2017 11:16 AM, Matthew Treinish wrote: > > > > > > > We definitely aren't saying running a single worker is how we recommend > people > > run OpenStack by doing this. But it just adds

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Sean Dague
On 02/02/2017 11:16 AM, Matthew Treinish wrote: > > > We definitely aren't saying running a single worker is how we recommend people > run OpenStack by doing this. But it just adds on to the differences between > the > gate and what we expect things actually

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Matthew Treinish
On Thu, Feb 02, 2017 at 11:10:22AM -0500, Matthew Treinish wrote: > On Wed, Feb 01, 2017 at 04:24:54PM -0800, Armando M. wrote: > > Hi, > > > > [TL;DR]: OpenStack services have steadily increased their memory > > footprints. We need a concerted way to address the oom-kills experienced in > > the

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Matthew Treinish
On Wed, Feb 01, 2017 at 04:24:54PM -0800, Armando M. wrote: > Hi, > > [TL;DR]: OpenStack services have steadily increased their memory > footprints. We need a concerted way to address the oom-kills experienced in > the openstack gate, as we may have reached a ceiling. > > Now the longer version:

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-02 Thread Matthew Treinish
On Thu, Feb 02, 2017 at 04:27:51AM +, Dolph Mathews wrote: > What made most services jump +20% between mitaka and newton? Maybe there is > a common cause that we can tackle. Yeah, I'm curious about this too, there seems to be a big jump in Newton for most of the project. It might not a be a

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-01 Thread IWAMOTO Toshihiro
At Wed, 1 Feb 2017 17:37:34 -0700, Kevin Benton wrote: > > [1 ] > [1.1 ] > And who said openstack wasn't growing? ;) > > I think reducing API workers is a nice quick way to bring back some > stability. > > I have spent a bunch of time digging into the OOM killer events and haven't > yet

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-01 Thread Dolph Mathews
What made most services jump +20% between mitaka and newton? Maybe there is a common cause that we can tackle. I'd also be in favor of reducing the number of workers in the gate, assuming that doesn't also substantially increase the runtime of gate jobs. Does that environment variable

Re: [openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-01 Thread Kevin Benton
And who said openstack wasn't growing? ;) I think reducing API workers is a nice quick way to bring back some stability. I have spent a bunch of time digging into the OOM killer events and haven't yet figured out why they are being triggered. There is significant swap space remaining in all of

[openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

2017-02-01 Thread Armando M.
Hi, [TL;DR]: OpenStack services have steadily increased their memory footprints. We need a concerted way to address the oom-kills experienced in the openstack gate, as we may have reached a ceiling. Now the longer version: We have been experiencing some