Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job
What I don't understand is why the OOM killer is being invoked when there is almost no swap space being used at all. Check out the memory output when it's killed: http://logs.openstack.org/59/382659/26/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/7de01d0/logs/syslog.txt.gz#_Jan_11_15_54_36 "Jan 11 15:54:36 ubuntu-xenial-rax-ord-6599274 kernel: Free swap = 7994832kB Jan 11 15:54:36 ubuntu-xenial-rax-ord-6599274 kernel: Total swap = 7999020kB" Do we have something set that is effectively disabling the usage of swap space? On Wed, Jan 18, 2017 at 4:13 PM, Joe Gordonwrote: > > > On Thu, Jan 19, 2017 at 10:27 AM, Matt Riedemann < > mrie...@linux.vnet.ibm.com> wrote: > >> On 1/18/2017 4:53 AM, Jens Rosenboom wrote: >> >>> To me it looks like the times of 2G are long gone, Nova is using >>> almost 2G all by itself. And 8G may be getting tight if additional >>> stuff like Ceph is being added. >>> >>> >> I'm not really surprised at all about Nova being a memory hog with the >> versioned object stuff we have which does it's own nesting of objects. >> >> What tools to people use to be able to profile the memory usage by the >> types of objects in memory while this is running? > > > objgraph and guppy/heapy > > http://smira.ru/wp-content/uploads/2011/08/heapy.html > > https://www.huyng.com/posts/python-performance-analysis > > You can also use gc.get_objects() (https://docs.python.org/2/ > library/gc.html#gc.get_objects) to get a list of all objects in memory > and go from there. > > Slots (https://docs.python.org/2/reference/datamodel.html#slots) are > useful for reducing the memory usage of objects. > > >> >> -- >> >> Thanks, >> >> Matt Riedemann >> >> >> >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib >> e >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job
On Thu, Jan 19, 2017 at 10:27 AM, Matt Riedemannwrote: > On 1/18/2017 4:53 AM, Jens Rosenboom wrote: > >> To me it looks like the times of 2G are long gone, Nova is using >> almost 2G all by itself. And 8G may be getting tight if additional >> stuff like Ceph is being added. >> >> > I'm not really surprised at all about Nova being a memory hog with the > versioned object stuff we have which does it's own nesting of objects. > > What tools to people use to be able to profile the memory usage by the > types of objects in memory while this is running? objgraph and guppy/heapy http://smira.ru/wp-content/uploads/2011/08/heapy.html https://www.huyng.com/posts/python-performance-analysis You can also use gc.get_objects() ( https://docs.python.org/2/library/gc.html#gc.get_objects) to get a list of all objects in memory and go from there. Slots (https://docs.python.org/2/reference/datamodel.html#slots) are useful for reducing the memory usage of objects. > > -- > > Thanks, > > Matt Riedemann > > > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job
On 01/14/2017 02:48 AM, Jakub Libosvar wrote: > recently I noticed we got oom-killer in action in one of our jobs [1]. > Any other ideas? I spent quite a while chasing down similar things with centos a while ago. I do have some ideas :) The symptom is probably that mysql gets chosen by the OOM killer but it's unlikely to be mysql's fault, it's just big and a good target. If the system is going offline, I added the ability to turn on the netconsole in devstack-gate with [1]. As the comment mentions, you can put little tests that stream data in /dev/kmsg and they will generally get off the host, even if ssh has been killed. I found this very useful for getting the initial oops data (i've used this several times for other gate oopses, including other kernel issues we've seen). For starting to pin down what is really consuming the memory, the first thing I did was wrote a peak-memory usage tracker that gave me stats on memory growth during the devstack run [2]. You have to enable this with "enable_service peakmem_tracker". This starts to give you the big picture of where memory is starting to go. At this point, you should have a rough idea of the real cause, and you're going to want to start dumping /proc/pid/smaps of target processes to get an idea of where the memory they're allocating is going, or at the very least what libraries might be involved. The next step is going to depend on what you need to target... If it's python, it can get a bit tricky to see where the memory is going but there's a number of approaches. At the time, despite it being mostly unmaintained but I had some success with guppy [1]. In my case, for example, I managed to hook into swift's wsgi startup and run that under guppy, giving me the ability to get some heap stats. from my notes [4] that looked something like --- import signal, os from guppy import hpy def handler(signum, frame): f = open('/tmp/heap.txt', 'w+') f.write("testing\n") hp = hpy() f.write(str(hp.heap())) f.close() if __name__ == '__main__': conf_file, options = parse_options() signal.signal(signal.SIGUSR1, handler) sys.exit(run_wsgi(conf_file, 'object-server', global_conf_callback=server.global_conf_callback, **options)) --- There are of course other tools from gdb to malloc tracers, etc. But that was enough that I could try different things and compare the heap usage. Once you've got the smoking gun ... well then the hard work starts of fixing it :) In my case it was pycparser and we came up with a good solution [5]. Hopefully that's some useful tips ... #openstack-infra can of course help holding vms etc as required. -i [1] http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/devstack-vm-gate-wrap.sh#n438 [2] https://git.openstack.org/cgit/openstack-dev/devstack/tree/tools/peakmem_tracker.sh [3] https://pypi.python.org/pypi/guppy/ [4] https://etherpad.openstack.org/p/oom-in-rax-centos7-CI-job [5] https://github.com/eliben/pycparser/issues/72 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job
On 1/18/2017 4:53 AM, Jens Rosenboom wrote: To me it looks like the times of 2G are long gone, Nova is using almost 2G all by itself. And 8G may be getting tight if additional stuff like Ceph is being added. I'm not really surprised at all about Nova being a memory hog with the versioned object stuff we have which does it's own nesting of objects. What tools to people use to be able to profile the memory usage by the types of objects in memory while this is running? -- Thanks, Matt Riedemann __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job
On 1/13/2017 9:48 AM, Jakub Libosvar wrote: Hi, recently I noticed we got oom-killer in action in one of our jobs [1]. I saw it several times, so far only with linux bridge job. The consequence is that usually mysqld gets killed as a processes that consumes most of the memory, sometimes even nova-api gets killed. Does anybody know whether we can bump memory on nodes in the gate without losing resources for running other jobs? Has anybody experience with memory consumption being higher when using linux bridge agents? Any other ideas? Thanks, Jakub [1] http://logs.openstack.org/73/373973/13/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/295d92f/logs/syslog.txt.gz#_Jan_11_13_56_32 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev I don't think it's just the linuxbridge job, see: http://status.openstack.org//elastic-recheck/index.html#1656850 And the linked logstash query, then expand by build_name. I also tracked that in logstash to have started around 1/10 which was under our 10-days of logs, so something happened around then to start tipping us over. I had some leads in the bug report but I think the keystone team took over from there. -- Thanks, Matt Riedemann __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job
2017-01-13 17:56 GMT+01:00 Clark Boylan: > On Fri, Jan 13, 2017, at 07:48 AM, Jakub Libosvar wrote: >> Does anybody know whether we can bump memory on nodes in the gate >> without losing resources for running other jobs? >> Has anybody experience with memory consumption being higher when using >> linux bridge agents? >> >> Any other ideas? > > Ideally I think we would see more work to reduce memory consumption. > Heat has been able to more than halve their memory usage recently [0]. > Perhaps start by identifying the biggest memory hogs and go from there? > > [0] > http://lists.openstack.org/pipermail/openstack-dev/2017-January/109748.html In order to have some real data, I've run reproduce.sh for a random full tempest check and aggregated the memory usage from ps output during the tempest run [1]. To me it looks like the times of 2G are long gone, Nova is using almost 2G all by itself. And 8G may be getting tight if additional stuff like Ceph is being added. As a side note, we are seeing consistent failures for the Chef OpenStack Cookbook integration tests on infra. We have set up an external CI now running on 12G instances and are getting successful results there. [2] [1] http://paste.openstack.org/show/595348/ [2] https://review.openstack.org/409900 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job
2017-01-13 11:13 GMT-06:00 Kevin Benton: > Sounds like we must have a memory leak in the Linux bridge agent if that's > the only difference between the Linux bridge job and the ovs ones. Is there > a bug tracking this? Just created one [1]. For now, this issue was observed in two cases (mentioned in bug description). [1] https://bugs.launchpad.net/neutron/+bug/1656386 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job
Sounds like we must have a memory leak in the Linux bridge agent if that's the only difference between the Linux bridge job and the ovs ones. Is there a bug tracking this? On Jan 13, 2017 08:58, "Clark Boylan"wrote: > On Fri, Jan 13, 2017, at 07:48 AM, Jakub Libosvar wrote: > > Does anybody know whether we can bump memory on nodes in the gate > > without losing resources for running other jobs? > > Has anybody experience with memory consumption being higher when using > > linux bridge agents? > > > > Any other ideas? > > Ideally I think we would see more work to reduce memory consumption. > Heat has been able to more than halve their memory usage recently [0]. > Perhaps start by identifying the biggest memory hogs and go from there? > > [0] > http://lists.openstack.org/pipermail/openstack-dev/2017- > January/109748.html > > Clark > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job
On Fri, Jan 13, 2017, at 07:48 AM, Jakub Libosvar wrote: > Does anybody know whether we can bump memory on nodes in the gate > without losing resources for running other jobs? > Has anybody experience with memory consumption being higher when using > linux bridge agents? > > Any other ideas? Ideally I think we would see more work to reduce memory consumption. Heat has been able to more than halve their memory usage recently [0]. Perhaps start by identifying the biggest memory hogs and go from there? [0] http://lists.openstack.org/pipermail/openstack-dev/2017-January/109748.html Clark __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job
On 2017-01-13 16:48:26 +0100 (+0100), Jakub Libosvar wrote: [...] > Does anybody know whether we can bump memory on nodes in the gate without > losing resources for running other jobs? [...] We picked 8gb back when typical devstack-gate jobs only used around 2gb of memory, to make sure there was a hard upper limit developers could expect when trying to recreate the same tests locally on their systems. It would take a lot of convincing to raise that further (and yes it would reduce the number of test instances we can run in most of our providers since memory is generally the limiting factor for our nova quotas). -- Jeremy Stanley __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev