Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job

2017-01-23 Thread Kevin Benton
What I don't understand is why the OOM killer is being invoked when there
is almost no swap space being used at all. Check out the memory output when
it's killed:

http://logs.openstack.org/59/382659/26/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/7de01d0/logs/syslog.txt.gz#_Jan_11_15_54_36

"Jan 11 15:54:36 ubuntu-xenial-rax-ord-6599274 kernel: Free swap  =
7994832kB
Jan 11 15:54:36 ubuntu-xenial-rax-ord-6599274 kernel: Total swap =
7999020kB"

Do we have something set that is effectively disabling the usage of swap
space?

On Wed, Jan 18, 2017 at 4:13 PM, Joe Gordon  wrote:

>
>
> On Thu, Jan 19, 2017 at 10:27 AM, Matt Riedemann <
> mrie...@linux.vnet.ibm.com> wrote:
>
>> On 1/18/2017 4:53 AM, Jens Rosenboom wrote:
>>
>>> To me it looks like the times of 2G are long gone, Nova is using
>>> almost 2G all by itself. And 8G may be getting tight if additional
>>> stuff like Ceph is being added.
>>>
>>>
>> I'm not really surprised at all about Nova being a memory hog with the
>> versioned object stuff we have which does it's own nesting of objects.
>>
>> What tools to people use to be able to profile the memory usage by the
>> types of objects in memory while this is running?
>
>
> objgraph and guppy/heapy
>
> http://smira.ru/wp-content/uploads/2011/08/heapy.html
>
> https://www.huyng.com/posts/python-performance-analysis
>
> You can also use gc.get_objects() (https://docs.python.org/2/
> library/gc.html#gc.get_objects) to get a list of all objects in memory
> and go from there.
>
> Slots (https://docs.python.org/2/reference/datamodel.html#slots) are
> useful for reducing the memory usage of objects.
>
>
>>
>> --
>>
>> Thanks,
>>
>> Matt Riedemann
>>
>>
>>
>> 
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib
>> e
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job

2017-01-18 Thread Joe Gordon
On Thu, Jan 19, 2017 at 10:27 AM, Matt Riedemann  wrote:

> On 1/18/2017 4:53 AM, Jens Rosenboom wrote:
>
>> To me it looks like the times of 2G are long gone, Nova is using
>> almost 2G all by itself. And 8G may be getting tight if additional
>> stuff like Ceph is being added.
>>
>>
> I'm not really surprised at all about Nova being a memory hog with the
> versioned object stuff we have which does it's own nesting of objects.
>
> What tools to people use to be able to profile the memory usage by the
> types of objects in memory while this is running?


objgraph and guppy/heapy

http://smira.ru/wp-content/uploads/2011/08/heapy.html

https://www.huyng.com/posts/python-performance-analysis

You can also use gc.get_objects() (
https://docs.python.org/2/library/gc.html#gc.get_objects) to get a list of
all objects in memory and go from there.

Slots (https://docs.python.org/2/reference/datamodel.html#slots) are useful
for reducing the memory usage of objects.


>
> --
>
> Thanks,
>
> Matt Riedemann
>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job

2017-01-18 Thread Ian Wienand
On 01/14/2017 02:48 AM, Jakub Libosvar wrote:
> recently I noticed we got oom-killer in action in one of our jobs [1]. 

> Any other ideas?

I spent quite a while chasing down similar things with centos a while
ago.  I do have some ideas :)

The symptom is probably that mysql gets chosen by the OOM killer but
it's unlikely to be mysql's fault, it's just big and a good target.

If the system is going offline, I added the ability to turn on the
netconsole in devstack-gate with [1].  As the comment mentions, you
can put little tests that stream data in /dev/kmsg and they will
generally get off the host, even if ssh has been killed.  I found this
very useful for getting the initial oops data (i've used this several
times for other gate oopses, including other kernel issues we've
seen).

For starting to pin down what is really consuming the memory, the
first thing I did was wrote a peak-memory usage tracker that gave me
stats on memory growth during the devstack run [2].  You have to
enable this with "enable_service peakmem_tracker".  This starts to
give you the big picture of where memory is starting to go.

At this point, you should have a rough idea of the real cause, and
you're going to want to start dumping /proc/pid/smaps of target
processes to get an idea of where the memory they're allocating is
going, or at the very least what libraries might be involved.  The
next step is going to depend on what you need to target...

If it's python, it can get a bit tricky to see where the memory is
going but there's a number of approaches.  At the time, despite it
being mostly unmaintained but I had some success with guppy [1].  In
my case, for example, I managed to hook into swift's wsgi startup and
run that under guppy, giving me the ability to get some heap stats.
from my notes [4] that looked something like

---
import signal, os
from guppy import hpy

def handler(signum, frame):
f = open('/tmp/heap.txt', 'w+')
f.write("testing\n")
hp = hpy()
f.write(str(hp.heap()))
f.close()

if __name__ == '__main__':
conf_file, options = parse_options()
signal.signal(signal.SIGUSR1, handler)

sys.exit(run_wsgi(conf_file, 'object-server',
  global_conf_callback=server.global_conf_callback,
  **options))
---

There are of course other tools from gdb to malloc tracers, etc.

But that was enough that I could try different things and compare the
heap usage.  Once you've got the smoking gun ... well then the hard
work starts of fixing it :) In my case it was pycparser and we came up
with a good solution [5].

Hopefully that's some useful tips ... #openstack-infra can of course
help holding vms etc as required.

-i

[1] 
http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/devstack-vm-gate-wrap.sh#n438
[2] 
https://git.openstack.org/cgit/openstack-dev/devstack/tree/tools/peakmem_tracker.sh
[3] https://pypi.python.org/pypi/guppy/
[4] https://etherpad.openstack.org/p/oom-in-rax-centos7-CI-job
[5] https://github.com/eliben/pycparser/issues/72

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job

2017-01-18 Thread Matt Riedemann

On 1/18/2017 4:53 AM, Jens Rosenboom wrote:

To me it looks like the times of 2G are long gone, Nova is using
almost 2G all by itself. And 8G may be getting tight if additional
stuff like Ceph is being added.



I'm not really surprised at all about Nova being a memory hog with the 
versioned object stuff we have which does it's own nesting of objects.


What tools to people use to be able to profile the memory usage by the 
types of objects in memory while this is running?


--

Thanks,

Matt Riedemann


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job

2017-01-18 Thread Matt Riedemann

On 1/13/2017 9:48 AM, Jakub Libosvar wrote:

Hi,

recently I noticed we got oom-killer in action in one of our jobs [1]. I
saw it several times, so far only with linux bridge job. The consequence
is that usually mysqld gets killed as a processes that consumes most of
the memory, sometimes even nova-api gets killed.

Does anybody know whether we can bump memory on nodes in the gate
without losing resources for running other jobs?
Has anybody experience with memory consumption being higher when using
linux bridge agents?

Any other ideas?

Thanks,
Jakub

[1]
http://logs.openstack.org/73/373973/13/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/295d92f/logs/syslog.txt.gz#_Jan_11_13_56_32


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



I don't think it's just the linuxbridge job, see:

http://status.openstack.org//elastic-recheck/index.html#1656850

And the linked logstash query, then expand by build_name.

I also tracked that in logstash to have started around 1/10 which was 
under our 10-days of logs, so something happened around then to start 
tipping us over. I had some leads in the bug report but I think the 
keystone team took over from there.


--

Thanks,

Matt Riedemann


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job

2017-01-18 Thread Jens Rosenboom
2017-01-13 17:56 GMT+01:00 Clark Boylan :
> On Fri, Jan 13, 2017, at 07:48 AM, Jakub Libosvar wrote:
>> Does anybody know whether we can bump memory on nodes in the gate
>> without losing resources for running other jobs?
>> Has anybody experience with memory consumption being higher when using
>> linux bridge agents?
>>
>> Any other ideas?
>
> Ideally I think we would see more work to reduce memory consumption.
> Heat has been able to more than halve their memory usage recently [0].
> Perhaps start by identifying the biggest memory hogs and go from there?
>
> [0]
> http://lists.openstack.org/pipermail/openstack-dev/2017-January/109748.html

In order to have some real data, I've run reproduce.sh for a random
full tempest check and aggregated the memory usage from ps output
during the tempest run [1].
To me it looks like the times of 2G are long gone, Nova is using
almost 2G all by itself. And 8G may be getting tight if additional
stuff like Ceph is being added.

As a side note, we are seeing consistent failures for the Chef
OpenStack Cookbook integration tests on infra. We have set up an
external CI now running on 12G instances and are getting successful
results there. [2]

[1] http://paste.openstack.org/show/595348/
[2] https://review.openstack.org/409900

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job

2017-01-13 Thread Dariusz Śmigiel
2017-01-13 11:13 GMT-06:00 Kevin Benton :
> Sounds like we must have a memory leak in the Linux bridge agent if that's
> the only difference between the Linux bridge job and the ovs ones. Is there
> a bug tracking this?

Just created one [1]. For now, this issue was observed in two cases
(mentioned in bug description).

[1] https://bugs.launchpad.net/neutron/+bug/1656386

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job

2017-01-13 Thread Kevin Benton
Sounds like we must have a memory leak in the Linux bridge agent if that's
the only difference between the Linux bridge job and the ovs ones. Is there
a bug tracking this?

On Jan 13, 2017 08:58, "Clark Boylan"  wrote:

> On Fri, Jan 13, 2017, at 07:48 AM, Jakub Libosvar wrote:
> > Does anybody know whether we can bump memory on nodes in the gate
> > without losing resources for running other jobs?
> > Has anybody experience with memory consumption being higher when using
> > linux bridge agents?
> >
> > Any other ideas?
>
> Ideally I think we would see more work to reduce memory consumption.
> Heat has been able to more than halve their memory usage recently [0].
> Perhaps start by identifying the biggest memory hogs and go from there?
>
> [0]
> http://lists.openstack.org/pipermail/openstack-dev/2017-
> January/109748.html
>
> Clark
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job

2017-01-13 Thread Clark Boylan
On Fri, Jan 13, 2017, at 07:48 AM, Jakub Libosvar wrote:
> Does anybody know whether we can bump memory on nodes in the gate 
> without losing resources for running other jobs?
> Has anybody experience with memory consumption being higher when using 
> linux bridge agents?
> 
> Any other ideas?

Ideally I think we would see more work to reduce memory consumption.
Heat has been able to more than halve their memory usage recently [0].
Perhaps start by identifying the biggest memory hogs and go from there?

[0]
http://lists.openstack.org/pipermail/openstack-dev/2017-January/109748.html

Clark

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job

2017-01-13 Thread Jeremy Stanley
On 2017-01-13 16:48:26 +0100 (+0100), Jakub Libosvar wrote:
[...]
> Does anybody know whether we can bump memory on nodes in the gate without
> losing resources for running other jobs?
[...]

We picked 8gb back when typical devstack-gate jobs only used around
2gb of memory, to make sure there was a hard upper limit developers
could expect when trying to recreate the same tests locally on their
systems. It would take a lot of convincing to raise that further
(and yes it would reduce the number of test instances we can run in
most of our providers since memory is generally the limiting factor
for our nova quotas).
-- 
Jeremy Stanley

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev