Re: [openstack-dev] Gate Status - Friday Edition

2014-01-24 Thread Peter Portante
Hi Sean,

Given the swift failure happened once in the available logstash recorded
history, do we still feel this is a major gate issue?

See:
http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRkFJTDogdGVzdF9ub2RlX3dyaXRlX3RpbWVvdXRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiYWxsIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4MDExNzgwMX0=

Thanks,

-peter
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate Status - Friday Edition

2014-01-24 Thread Sean Dague
On 01/24/2014 11:18 AM, Peter Portante wrote:
 Hi Sean,
 
 Given the swift failure happened once in the available logstash recorded
 history, do we still feel this is a major gate issue?
 
 See: 
 http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRkFJTDogdGVzdF9ub2RlX3dyaXRlX3RpbWVvdXRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiYWxsIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4MDExNzgwMX0=
 
 Thanks,
 
 -peter

In the last 7 days Swift unit tests has failed 50 times in the gate
queue -
http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRmluaXNoZWQ6IEZBSUxVUkVcIiBBTkQgcHJvamVjdDpcIm9wZW5zdGFjay9zd2lmdFwiIEFORCBidWlsZF9xdWV1ZTpnYXRlIEFORCBidWlsZF9uYW1lOmdhdGUtc3dpZnQtcHl0aG9uKiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4NTEwNzY1M30=

That's a pretty high rate of failure, and really needs investigation.

Unit tests should never be failing in the gate, for any project. Russell
did a great job sorting out some bad tests in Nova the last couple of
days, and it would be good for other projects that are seeing similar
issues to do the same.

-Sean

-- 
Sean Dague
Samsung Research America
s...@dague.net / sean.da...@samsung.com
http://dague.net



signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate Status - Friday Edition

2014-01-24 Thread Clay Gerrard



 That's a pretty high rate of failure, and really needs investigation.


That's a great point, did you look into the logs of any of those jobs?
 Thanks for bringing it to my attention.

I saw a few swift tests that would pop, I'll open bugs to look into those.
 But the cardinality of the failures (7) was dwarfed by jenkins failures I
don't quite understand.

[EnvInject] - [ERROR] - SEVERE ERROR occurs: java.lang.InterruptedException
(e.g.
http://logs.openstack.org/86/66986/3/gate/gate-swift-python27/2e6a8fc/console.html
)

FATAL: command execution failed | java.io.InterruptedIOException (e.g.
http://logs.openstack.org/84/67584/5/gate/gate-swift-python27/4ad733d/console.html
)

These jobs are blowing up setting up the workspace on the slave, and we're
not automatically retrying them?  How can this only be effecting swift?

-Clay
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate Status - Friday Edition

2014-01-24 Thread John Griffith
On Fri, Jan 24, 2014 at 11:37 AM, Clay Gerrard clay.gerr...@gmail.com wrote:


 That's a pretty high rate of failure, and really needs investigation.


 That's a great point, did you look into the logs of any of those jobs?
 Thanks for bringing it to my attention.

 I saw a few swift tests that would pop, I'll open bugs to look into those.
 But the cardinality of the failures (7) was dwarfed by jenkins failures I
 don't quite understand.

 [EnvInject] - [ERROR] - SEVERE ERROR occurs: java.lang.InterruptedException
 (e.g.
 http://logs.openstack.org/86/66986/3/gate/gate-swift-python27/2e6a8fc/console.html)

 FATAL: command execution failed | java.io.InterruptedIOException (e.g.
 http://logs.openstack.org/84/67584/5/gate/gate-swift-python27/4ad733d/console.html)

 These jobs are blowing up setting up the workspace on the slave, and we're
 not automatically retrying them?  How can this only be effecting swift?

It's certainly not just swift:

http://logstash.openstack.org/#eyJzZWFyY2giOiJcImphdmEuaW8uSW50ZXJydXB0ZWRJT0V4Y2VwdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzkwNTg5MTg4NjY5fQ==


 -Clay

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate Status - Friday Edition

2014-01-24 Thread Peter Portante
Hi Sean,

In the last 7 days I see only 6 python27 based test failures:
http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNzogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk2Mjk0MDR9

And 4 python26 based test failures:
http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNjogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk1MzAzNTd9

Maybe the query you posted captures failures where the job did not even run?

And only 15 hits (well, 18, but three are within the same job, and some of
the tests are run twice, so it is a combined of 10 hits):
http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRkFJTDpcIiBhbmQgbWVzc2FnZTpcInRlc3RcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4OTg1NTAzMX0=


Thanks,

-peter
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate Status - Friday Edition

2014-01-24 Thread Clark Boylan
On Fri, Jan 24, 2014, at 10:51 AM, John Griffith wrote:
 On Fri, Jan 24, 2014 at 11:37 AM, Clay Gerrard clay.gerr...@gmail.com
 wrote:
 
 
  That's a pretty high rate of failure, and really needs investigation.
 
 
  That's a great point, did you look into the logs of any of those jobs?
  Thanks for bringing it to my attention.
 
  I saw a few swift tests that would pop, I'll open bugs to look into those.
  But the cardinality of the failures (7) was dwarfed by jenkins failures I
  don't quite understand.
 
  [EnvInject] - [ERROR] - SEVERE ERROR occurs: java.lang.InterruptedException
  (e.g.
  http://logs.openstack.org/86/66986/3/gate/gate-swift-python27/2e6a8fc/console.html)
 
  FATAL: command execution failed | java.io.InterruptedIOException (e.g.
  http://logs.openstack.org/84/67584/5/gate/gate-swift-python27/4ad733d/console.html)
 
  These jobs are blowing up setting up the workspace on the slave, and we're
  not automatically retrying them?  How can this only be effecting swift?
 
 It's certainly not just swift:
 
 http://logstash.openstack.org/#eyJzZWFyY2giOiJcImphdmEuaW8uSW50ZXJydXB0ZWRJT0V4Y2VwdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzkwNTg5MTg4NjY5fQ==
 
 
  -Clay
 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

This isn't all doom and gloom, but rather an unfortunate side effect of
how Jenkins aborts jobs. When a job is aborted there are corner cases
where Jenkins does not catch all of the exceptions that may happen and
that results in reporting the build as a failure instead of an abort.
Now all of this would be fine if we never aborted jobs, but it turns out
Zuul aggressively aborts jobs when it knows the result of that job will
not help anything (either ability to merge or useful results to report
back to code reviewers).

I have a hunch (but would need to do a bunch of digging to confirm it)
that most of these errors are simply job aborts that happened in ways
that Jenkins couldn't recover from gracefully. Looking at the most
recent occurrence of this particular failure we see
https://review.openstack.org/#/c/66307 failed
gate-tempest-dsvm-neutron-large-ops. If we go to the comments on the
change we see that this particular failure was never reported, which
implies the failure happened as part of a build abort.

The other thing we can do to convince ourselves that this problem is
mostly a poor reporting of job aborts is restricting our logstash search
to build_queue:check. Only the gate queue aborts jobs in this way so
occurrences in the check queue would indicate an actual problem. If we
do that we see a bunch of hudson.remoting.RequestAbortedException
which are also aborts not handled properly and since zuul shouldn't
abort the check queue were probably a result of some human aborting jobs
after a Zuul restart.

TL;DR I believe this is mostly a non issue and has to do with Zuul and
Jenkins quirks. If you see this error reported to Gerrit we should do
more digging.

Clark

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate Status - Friday Edition

2014-01-24 Thread Sean Dague
On 01/24/2014 02:02 PM, Peter Portante wrote:
 Hi Sean,
 
 In the last 7 days I see only 6 python27 based test
 failures: 
 http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNzogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk2Mjk0MDR9
 
 And 4 python26 based test
 failures: 
 http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNjogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk1MzAzNTd9
 
 Maybe the query you posted captures failures where the job did not even run?
 
 And only 15 hits (well, 18, but three are within the same job, and some
 of the tests are run twice, so it is a combined of 10
 hits): 
 http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRkFJTDpcIiBhbmQgbWVzc2FnZTpcInRlc3RcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4OTg1NTAzMX0=
 
 
 Thanks,

So it is true, that the Interupted exceptions (which is when a job is
killed because of a reset) are some times being turned into Fail events
by the system, which is one of the reasons the graphite data for
failures is incorrect, and if you use just the graphite sourcing for
fails, your numbers will be overly pessimistic.

The following is probably better lists
 -
http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python26
(7 uncategorized fails)
 -
http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python27
(5 uncategorized fails)

-Sean

-- 
Sean Dague
Samsung Research America
s...@dague.net / sean.da...@samsung.com
http://dague.net



signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate Status - Friday Edition

2014-01-24 Thread Joe Gordon
On Fri, Jan 24, 2014 at 10:37 AM, Clay Gerrard clay.gerr...@gmail.comwrote:



 That's a pretty high rate of failure, and really needs investigation.


 That's a great point, did you look into the logs of any of those jobs?
  Thanks for bringing it to my attention.


 I saw a few swift tests that would pop, I'll open bugs to look into those.
  But the cardinality of the failures (7) was dwarfed by jenkins failures I
 don't quite understand.


Here are all the unclassified swift unit test failures.

http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python26
http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python27



 [EnvInject] - [ERROR] - SEVERE ERROR occurs: java.lang.InterruptedException
 (e.g.
 http://logs.openstack.org/86/66986/3/gate/gate-swift-python27/2e6a8fc/console.html
 )

 FATAL: command execution failed | java.io.InterruptedIOException (e.g.
 http://logs.openstack.org/84/67584/5/gate/gate-swift-python27/4ad733d/console.html
 )

 These jobs are blowing up setting up the workspace on the slave, and we're
 not automatically retrying them?  How can this only be effecting swift?


https://bugs.launchpad.net/openstack-ci/+bug/1270309
https://review.openstack.org/#/c/67594/



 -Clay

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate Status - Friday Edition

2014-01-24 Thread Clay Gerrard
OH yeah that's much better.  I had found those eventually but had to dig
through all that other stuff :'(

Moving forward I think we can keep an eye on that page, open bugs for those
tests causing issue and dig in.

Thanks again!

-Clay


On Fri, Jan 24, 2014 at 11:37 AM, Sean Dague s...@dague.net wrote:

 On 01/24/2014 02:02 PM, Peter Portante wrote:
  Hi Sean,
 
  In the last 7 days I see only 6 python27 based test
  failures:
 http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNzogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk2Mjk0MDR9
 
  And 4 python26 based test
  failures:
 http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNjogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk1MzAzNTd9
 
  Maybe the query you posted captures failures where the job did not even
 run?
 
  And only 15 hits (well, 18, but three are within the same job, and some
  of the tests are run twice, so it is a combined of 10
  hits):
 http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRkFJTDpcIiBhbmQgbWVzc2FnZTpcInRlc3RcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4OTg1NTAzMX0=
 
 
  Thanks,

 So it is true, that the Interupted exceptions (which is when a job is
 killed because of a reset) are some times being turned into Fail events
 by the system, which is one of the reasons the graphite data for
 failures is incorrect, and if you use just the graphite sourcing for
 fails, your numbers will be overly pessimistic.

 The following is probably better lists
  -

 http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python26
 (7 uncategorized fails)
  -

 http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python27
 (5 uncategorized fails)

 -Sean

 --
 Sean Dague
 Samsung Research America
 s...@dague.net / sean.da...@samsung.com
 http://dague.net


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate Status - Friday Edition

2014-01-24 Thread Salvatore Orlando
I've found out that several jobs are exhibiting failures like bug 1254890
[1] and bug 1253896 [2] because openvswitch seem to be crashing the kernel.
The kernel trace reports as offending process usually either
neutron-ns-metadata-proxy or dnsmasq, but [3] seem to clearly point to
ovs-vsctl.
254 events observed in the previous 6 days show a similar trace in the logs
[4].
This means that while this alone won't explain all the failures observed,
it is however potentially one of the prominent root causes.

From the logs I have little hints about the kernel running. It seems there
has been no update in the past 7 days, but I can't be sure.
Openvswitch builds are updated periodically. The last build I found not to
trigger failures was the one generated on 2014/01/16 at 01:58:18.
Unfortunately version-wise I always have only 1.4.0, no build number.

I don't know if this will require getting in touch with ubuntu, or if we
can just prep a different image which an OVS build known to work without
problems.

Salvatore

[1] https://bugs.launchpad.net/neutron/+bug/1254890
[2] https://bugs.launchpad.net/neutron/+bug/1253896
[3] http://paste.openstack.org/show/61869/
[4] kernel BUG at /build/buildd/linux-3.2.0/fs/buffer.c:2917 and
filename:syslog.txt


On 24 January 2014 21:13, Clay Gerrard clay.gerr...@gmail.com wrote:

 OH yeah that's much better.  I had found those eventually but had to dig
 through all that other stuff :'(

 Moving forward I think we can keep an eye on that page, open bugs for
 those tests causing issue and dig in.

 Thanks again!

 -Clay


 On Fri, Jan 24, 2014 at 11:37 AM, Sean Dague s...@dague.net wrote:

 On 01/24/2014 02:02 PM, Peter Portante wrote:
  Hi Sean,
 
  In the last 7 days I see only 6 python27 based test
  failures:
 http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNzogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk2Mjk0MDR9
 
  And 4 python26 based test
  failures:
 http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNjogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk1MzAzNTd9
 
  Maybe the query you posted captures failures where the job did not even
 run?
 
  And only 15 hits (well, 18, but three are within the same job, and some
  of the tests are run twice, so it is a combined of 10
  hits):
 http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRkFJTDpcIiBhbmQgbWVzc2FnZTpcInRlc3RcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4OTg1NTAzMX0=
 
 
  Thanks,

 So it is true, that the Interupted exceptions (which is when a job is
 killed because of a reset) are some times being turned into Fail events
 by the system, which is one of the reasons the graphite data for
 failures is incorrect, and if you use just the graphite sourcing for
 fails, your numbers will be overly pessimistic.

 The following is probably better lists
  -

 http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python26
 (7 uncategorized fails)
  -

 http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python27
 (5 uncategorized fails)

 -Sean

 --
 Sean Dague
 Samsung Research America
 s...@dague.net / sean.da...@samsung.com
 http://dague.net


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate Status - Friday Edition

2014-01-24 Thread Joe Gordon
On Fri, Jan 24, 2014 at 6:57 PM, Salvatore Orlando sorla...@nicira.comwrote:

 I've found out that several jobs are exhibiting failures like bug 1254890
 [1] and bug 1253896 [2] because openvswitch seem to be crashing the kernel.
 The kernel trace reports as offending process usually either
 neutron-ns-metadata-proxy or dnsmasq, but [3] seem to clearly point to
 ovs-vsctl.
 254 events observed in the previous 6 days show a similar trace in the
 logs [4].
 This means that while this alone won't explain all the failures observed,
 it is however potentially one of the prominent root causes.

 From the logs I have little hints about the kernel running. It seems there
 has been no update in the past 7 days, but I can't be sure.
 Openvswitch builds are updated periodically. The last build I found not to
 trigger failures was the one generated on 2014/01/16 at 01:58:18.
 Unfortunately version-wise I always have only 1.4.0, no build number.

 I don't know if this will require getting in touch with ubuntu, or if we
 can just prep a different image which an OVS build known to work without
 problems.

 Salvatore

 [1] https://bugs.launchpad.net/neutron/+bug/1254890
 [2] https://bugs.launchpad.net/neutron/+bug/1253896
 [3] http://paste.openstack.org/show/61869/
 [4] kernel BUG at /build/buildd/linux-3.2.0/fs/buffer.c:2917 and
 filename:syslog.txt


Do you want to track this as a separate bug and e-r fingerprint? It will
overlap with the other two bugs but will give us good numbers on
status.openstack.org/elastic-recheck/



 On 24 January 2014 21:13, Clay Gerrard clay.gerr...@gmail.com wrote:

 OH yeah that's much better.  I had found those eventually but had to dig
 through all that other stuff :'(

 Moving forward I think we can keep an eye on that page, open bugs for
 those tests causing issue and dig in.

 Thanks again!

 -Clay


 On Fri, Jan 24, 2014 at 11:37 AM, Sean Dague s...@dague.net wrote:

 On 01/24/2014 02:02 PM, Peter Portante wrote:
  Hi Sean,
 
  In the last 7 days I see only 6 python27 based test
  failures:
 http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNzogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk2Mjk0MDR9
 
  And 4 python26 based test
  failures:
 http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNjogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk1MzAzNTd9
 
  Maybe the query you posted captures failures where the job did not
 even run?
 
  And only 15 hits (well, 18, but three are within the same job, and some
  of the tests are run twice, so it is a combined of 10
  hits):
 http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRkFJTDpcIiBhbmQgbWVzc2FnZTpcInRlc3RcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4OTg1NTAzMX0=
 
 
  Thanks,

 So it is true, that the Interupted exceptions (which is when a job is
 killed because of a reset) are some times being turned into Fail events
 by the system, which is one of the reasons the graphite data for
 failures is incorrect, and if you use just the graphite sourcing for
 fails, your numbers will be overly pessimistic.

 The following is probably better lists
  -

 http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python26
 (7 uncategorized fails)
  -

 http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python27
 (5 uncategorized fails)

 -Sean

 --
 Sean Dague
 Samsung Research America
 s...@dague.net / sean.da...@samsung.com
 http://dague.net


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Gate Status - Friday Edition

2014-01-24 Thread Chris Wright
* Salvatore Orlando (sorla...@nicira.com) wrote:
 I've found out that several jobs are exhibiting failures like bug 1254890
 [1] and bug 1253896 [2] because openvswitch seem to be crashing the kernel.
 The kernel trace reports as offending process usually either
 neutron-ns-metadata-proxy or dnsmasq, but [3] seem to clearly point to
 ovs-vsctl.

Hmm, that actually shows dnsmasq is the running/exiting process.
The ovs-vsctl was run nearly a half-second earlier.  Looks like
ovs-vsctl successfuly added the tap device (assuming it's for
dnsmasq?).  And dnsmasq is exiting upon receiving a signal.  Shot in
the dark, has the neutron path that would end up killing dnsmasq
(Dnsmasq::reload_allocations()) changed recently?  I didn't see much.

 254 events observed in the previous 6 days show a similar trace in the logs
 [4].

That kernel (3.2.0) is over a year old.  And there have been some network
namespace fixes since then (IIRC, refcounting related).

 This means that while this alone won't explain all the failures observed,
 it is however potentially one of the prominent root causes.
 
 From the logs I have little hints about the kernel running. It seems there
 has been no update in the past 7 days, but I can't be sure.
 Openvswitch builds are updated periodically. The last build I found not to
 trigger failures was the one generated on 2014/01/16 at 01:58:18.
 Unfortunately version-wise I always have only 1.4.0, no build number.
 
 I don't know if this will require getting in touch with ubuntu, or if we
 can just prep a different image which an OVS build known to work without
 problems.
 
 Salvatore
 
 [1] https://bugs.launchpad.net/neutron/+bug/1254890
 [2] https://bugs.launchpad.net/neutron/+bug/1253896
 [3] http://paste.openstack.org/show/61869/
 [4] kernel BUG at /build/buildd/linux-3.2.0/fs/buffer.c:2917 and
 filename:syslog.txt

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev