Re: [openstack-dev] Your next semi weekly gate status report

2017-03-29 Thread Ian Wienand

On 03/28/2017 08:57 AM, Clark Boylan wrote:

1. Libvirt crashes: http://status.openstack.org/elastic-recheck/#1643911
and http://status.openstack.org/elastic-recheck/#1646779



Libvirt is randomly crashing during the job which causes things to fail
(for obvious reasons). To address this will likely require someone with
experience debugging libvirt since it's most likely a bug isolated to
libvirt. We're looking for someone familiar with libvirt internals to
drive the effort to fix this issue,


Ok, from the bug [1] we're seeing malloc() corruption.

While I agree that a coredump is not that likely to help, I would also
like to come to that conclusion after inspecting a coredump :) I've
found things in the heap before that give clues as to what real
problems are.

To this end, I've proposed [2] to keep coredumps.  It's a little
hackish but I think gets the job done. [3] enables this and saves any
dumps to the logs in d-g.

As suggested, running under valgrind would be great but probably
impractical until we narrow it down a little.  Another thing I've had
some success with is electric fence [4] which puts boundaries around
allocations so out-of-bounds access hits at the time of access.  I've
proposed [5] to try this out, but it's not looking particularly
promising unfortunately.  I'm open to suggestions, for example maybe
something like tcalloc might give us a different failure and could be
another clue.  If we get something vaguely reliable here, our best bet
might be to run a parallel non-voting job on all changes to see what
we can pick up.

-i

[1] https://bugs.launchpad.net/nova/+bug/1643911
[2] https://review.openstack.org/451128
[3] https://review.openstack.org/451219
[4] http://elinux.org/Electric_Fence
[5] https://review.openstack.org/451136

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] Your next semi weekly gate status report

2017-03-27 Thread Clark Boylan
Hello,

Previously we saw Libvirt crashes, OOMs,  and Tempest SSH Banner
failures were a problem. The SSH Banner failures have since been sorted
out, thank you to everyone that helped work that out. For details please
see https://review.openstack.org/#/c/439638/. There is also a fix for a
race that was causing ssh to fail in test_attach_detach_volume that has
been fixed in https://review.openstack.org/#/c/449661/ (this change is
not merged yet so would be great if tempest cores could get this in).

To address the OOMs we've also seen work to reduce the memory overhead
in running devstack. Changes to modify Apache's memory use have gone in:
https://review.openstack.org/#/c/446741/
https://review.openstack.org/#/c/445910/

We also tried putting MySQL on a diet, but that had to be reverted,
https://review.openstack.org/#/c/446196/.

There is also a memory_tracker logging service which you'll find logs
for in your job logs now. This can be useful in determining where memory
was used which you can use to reduce memory use.
https://review.openstack.org/#/c/434470/.

It is great to see people take an interest in addressing memory issues.
And we no longer see OOMkiller being a major problem according to
elastic-recheck. That said there is more that we can do here.
Outstanding changes that may help too include:
https://review.openstack.org/#/c/447119/
https://review.openstack.org/#/c/450207/

But we also really need individual projects to be looking at the memory
consumption of openstack itself and work on trimming as they are able.

Unfortunately the Libvirt crashes continue to be a problem.

Current top issues:

1. Libvirt crashes: http://status.openstack.org/elastic-recheck/#1643911
and http://status.openstack.org/elastic-recheck/#1646779

Libvirt is randomly crashing during the job which causes things to fail
(for obvious reasons). To address this will likely require someone with
experience debugging libvirt since it's most likely a bug isolated to
libvirt. We're looking for someone familiar with libvirt internals to
drive the effort to fix this issue,

2. Network packet loss in OSIC

This has caused connectivity errors to external services. Various e-r
bugs like http://status.openstack.org/elastic-recheck/index.html#1282876
http://status.openstack.org/elastic-recheck/index.html#1674681
http://status.openstack.org/elastic-recheck/index.html#1669162
http://status.openstack.org/elastic-recheck/index.html#1326813 all
appear to have tripped on this. We expect that the problem has been
corrected, but we should keep an eye on these and make sure they fall
off the e-r list.

Also our classification rate has taken a nose dive lately:

http://status.openstack.org/elastic-recheck/data/integrated_gate.html

Something that would help out is if people start classifiying these
failures. While the overall failure rate is lower than in previous
weeks, having a low classification rate means there are race conditions
(or other failures) we're not tracking yet, which will only make it more
difficult to fix. Normally if there is < a 90% classification rate we've
got at least one big persistent failure condition we're not aware of
yet.

Thank you,

mtreinish and clarkb

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev