Re: [openstack-dev] Gate Status - Friday Edition
Hi Sean, Given the swift failure happened once in the available logstash recorded history, do we still feel this is a major gate issue? See: http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRkFJTDogdGVzdF9ub2RlX3dyaXRlX3RpbWVvdXRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiYWxsIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4MDExNzgwMX0= Thanks, -peter ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Gate Status - Friday Edition
On 01/24/2014 11:18 AM, Peter Portante wrote: Hi Sean, Given the swift failure happened once in the available logstash recorded history, do we still feel this is a major gate issue? See: http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRkFJTDogdGVzdF9ub2RlX3dyaXRlX3RpbWVvdXRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiYWxsIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4MDExNzgwMX0= Thanks, -peter In the last 7 days Swift unit tests has failed 50 times in the gate queue - http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRmluaXNoZWQ6IEZBSUxVUkVcIiBBTkQgcHJvamVjdDpcIm9wZW5zdGFjay9zd2lmdFwiIEFORCBidWlsZF9xdWV1ZTpnYXRlIEFORCBidWlsZF9uYW1lOmdhdGUtc3dpZnQtcHl0aG9uKiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4NTEwNzY1M30= That's a pretty high rate of failure, and really needs investigation. Unit tests should never be failing in the gate, for any project. Russell did a great job sorting out some bad tests in Nova the last couple of days, and it would be good for other projects that are seeing similar issues to do the same. -Sean -- Sean Dague Samsung Research America s...@dague.net / sean.da...@samsung.com http://dague.net signature.asc Description: OpenPGP digital signature ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Gate Status - Friday Edition
That's a pretty high rate of failure, and really needs investigation. That's a great point, did you look into the logs of any of those jobs? Thanks for bringing it to my attention. I saw a few swift tests that would pop, I'll open bugs to look into those. But the cardinality of the failures (7) was dwarfed by jenkins failures I don't quite understand. [EnvInject] - [ERROR] - SEVERE ERROR occurs: java.lang.InterruptedException (e.g. http://logs.openstack.org/86/66986/3/gate/gate-swift-python27/2e6a8fc/console.html ) FATAL: command execution failed | java.io.InterruptedIOException (e.g. http://logs.openstack.org/84/67584/5/gate/gate-swift-python27/4ad733d/console.html ) These jobs are blowing up setting up the workspace on the slave, and we're not automatically retrying them? How can this only be effecting swift? -Clay ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Gate Status - Friday Edition
On Fri, Jan 24, 2014 at 11:37 AM, Clay Gerrard clay.gerr...@gmail.com wrote: That's a pretty high rate of failure, and really needs investigation. That's a great point, did you look into the logs of any of those jobs? Thanks for bringing it to my attention. I saw a few swift tests that would pop, I'll open bugs to look into those. But the cardinality of the failures (7) was dwarfed by jenkins failures I don't quite understand. [EnvInject] - [ERROR] - SEVERE ERROR occurs: java.lang.InterruptedException (e.g. http://logs.openstack.org/86/66986/3/gate/gate-swift-python27/2e6a8fc/console.html) FATAL: command execution failed | java.io.InterruptedIOException (e.g. http://logs.openstack.org/84/67584/5/gate/gate-swift-python27/4ad733d/console.html) These jobs are blowing up setting up the workspace on the slave, and we're not automatically retrying them? How can this only be effecting swift? It's certainly not just swift: http://logstash.openstack.org/#eyJzZWFyY2giOiJcImphdmEuaW8uSW50ZXJydXB0ZWRJT0V4Y2VwdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzkwNTg5MTg4NjY5fQ== -Clay ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Gate Status - Friday Edition
Hi Sean, In the last 7 days I see only 6 python27 based test failures: http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNzogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk2Mjk0MDR9 And 4 python26 based test failures: http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNjogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk1MzAzNTd9 Maybe the query you posted captures failures where the job did not even run? And only 15 hits (well, 18, but three are within the same job, and some of the tests are run twice, so it is a combined of 10 hits): http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRkFJTDpcIiBhbmQgbWVzc2FnZTpcInRlc3RcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4OTg1NTAzMX0= Thanks, -peter ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Gate Status - Friday Edition
On Fri, Jan 24, 2014, at 10:51 AM, John Griffith wrote: On Fri, Jan 24, 2014 at 11:37 AM, Clay Gerrard clay.gerr...@gmail.com wrote: That's a pretty high rate of failure, and really needs investigation. That's a great point, did you look into the logs of any of those jobs? Thanks for bringing it to my attention. I saw a few swift tests that would pop, I'll open bugs to look into those. But the cardinality of the failures (7) was dwarfed by jenkins failures I don't quite understand. [EnvInject] - [ERROR] - SEVERE ERROR occurs: java.lang.InterruptedException (e.g. http://logs.openstack.org/86/66986/3/gate/gate-swift-python27/2e6a8fc/console.html) FATAL: command execution failed | java.io.InterruptedIOException (e.g. http://logs.openstack.org/84/67584/5/gate/gate-swift-python27/4ad733d/console.html) These jobs are blowing up setting up the workspace on the slave, and we're not automatically retrying them? How can this only be effecting swift? It's certainly not just swift: http://logstash.openstack.org/#eyJzZWFyY2giOiJcImphdmEuaW8uSW50ZXJydXB0ZWRJT0V4Y2VwdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzkwNTg5MTg4NjY5fQ== -Clay ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev This isn't all doom and gloom, but rather an unfortunate side effect of how Jenkins aborts jobs. When a job is aborted there are corner cases where Jenkins does not catch all of the exceptions that may happen and that results in reporting the build as a failure instead of an abort. Now all of this would be fine if we never aborted jobs, but it turns out Zuul aggressively aborts jobs when it knows the result of that job will not help anything (either ability to merge or useful results to report back to code reviewers). I have a hunch (but would need to do a bunch of digging to confirm it) that most of these errors are simply job aborts that happened in ways that Jenkins couldn't recover from gracefully. Looking at the most recent occurrence of this particular failure we see https://review.openstack.org/#/c/66307 failed gate-tempest-dsvm-neutron-large-ops. If we go to the comments on the change we see that this particular failure was never reported, which implies the failure happened as part of a build abort. The other thing we can do to convince ourselves that this problem is mostly a poor reporting of job aborts is restricting our logstash search to build_queue:check. Only the gate queue aborts jobs in this way so occurrences in the check queue would indicate an actual problem. If we do that we see a bunch of hudson.remoting.RequestAbortedException which are also aborts not handled properly and since zuul shouldn't abort the check queue were probably a result of some human aborting jobs after a Zuul restart. TL;DR I believe this is mostly a non issue and has to do with Zuul and Jenkins quirks. If you see this error reported to Gerrit we should do more digging. Clark ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Gate Status - Friday Edition
On 01/24/2014 02:02 PM, Peter Portante wrote: Hi Sean, In the last 7 days I see only 6 python27 based test failures: http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNzogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk2Mjk0MDR9 And 4 python26 based test failures: http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNjogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk1MzAzNTd9 Maybe the query you posted captures failures where the job did not even run? And only 15 hits (well, 18, but three are within the same job, and some of the tests are run twice, so it is a combined of 10 hits): http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRkFJTDpcIiBhbmQgbWVzc2FnZTpcInRlc3RcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4OTg1NTAzMX0= Thanks, So it is true, that the Interupted exceptions (which is when a job is killed because of a reset) are some times being turned into Fail events by the system, which is one of the reasons the graphite data for failures is incorrect, and if you use just the graphite sourcing for fails, your numbers will be overly pessimistic. The following is probably better lists - http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python26 (7 uncategorized fails) - http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python27 (5 uncategorized fails) -Sean -- Sean Dague Samsung Research America s...@dague.net / sean.da...@samsung.com http://dague.net signature.asc Description: OpenPGP digital signature ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Gate Status - Friday Edition
On Fri, Jan 24, 2014 at 10:37 AM, Clay Gerrard clay.gerr...@gmail.comwrote: That's a pretty high rate of failure, and really needs investigation. That's a great point, did you look into the logs of any of those jobs? Thanks for bringing it to my attention. I saw a few swift tests that would pop, I'll open bugs to look into those. But the cardinality of the failures (7) was dwarfed by jenkins failures I don't quite understand. Here are all the unclassified swift unit test failures. http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python26 http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python27 [EnvInject] - [ERROR] - SEVERE ERROR occurs: java.lang.InterruptedException (e.g. http://logs.openstack.org/86/66986/3/gate/gate-swift-python27/2e6a8fc/console.html ) FATAL: command execution failed | java.io.InterruptedIOException (e.g. http://logs.openstack.org/84/67584/5/gate/gate-swift-python27/4ad733d/console.html ) These jobs are blowing up setting up the workspace on the slave, and we're not automatically retrying them? How can this only be effecting swift? https://bugs.launchpad.net/openstack-ci/+bug/1270309 https://review.openstack.org/#/c/67594/ -Clay ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Gate Status - Friday Edition
OH yeah that's much better. I had found those eventually but had to dig through all that other stuff :'( Moving forward I think we can keep an eye on that page, open bugs for those tests causing issue and dig in. Thanks again! -Clay On Fri, Jan 24, 2014 at 11:37 AM, Sean Dague s...@dague.net wrote: On 01/24/2014 02:02 PM, Peter Portante wrote: Hi Sean, In the last 7 days I see only 6 python27 based test failures: http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNzogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk2Mjk0MDR9 And 4 python26 based test failures: http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNjogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk1MzAzNTd9 Maybe the query you posted captures failures where the job did not even run? And only 15 hits (well, 18, but three are within the same job, and some of the tests are run twice, so it is a combined of 10 hits): http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRkFJTDpcIiBhbmQgbWVzc2FnZTpcInRlc3RcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4OTg1NTAzMX0= Thanks, So it is true, that the Interupted exceptions (which is when a job is killed because of a reset) are some times being turned into Fail events by the system, which is one of the reasons the graphite data for failures is incorrect, and if you use just the graphite sourcing for fails, your numbers will be overly pessimistic. The following is probably better lists - http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python26 (7 uncategorized fails) - http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python27 (5 uncategorized fails) -Sean -- Sean Dague Samsung Research America s...@dague.net / sean.da...@samsung.com http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Gate Status - Friday Edition
I've found out that several jobs are exhibiting failures like bug 1254890 [1] and bug 1253896 [2] because openvswitch seem to be crashing the kernel. The kernel trace reports as offending process usually either neutron-ns-metadata-proxy or dnsmasq, but [3] seem to clearly point to ovs-vsctl. 254 events observed in the previous 6 days show a similar trace in the logs [4]. This means that while this alone won't explain all the failures observed, it is however potentially one of the prominent root causes. From the logs I have little hints about the kernel running. It seems there has been no update in the past 7 days, but I can't be sure. Openvswitch builds are updated periodically. The last build I found not to trigger failures was the one generated on 2014/01/16 at 01:58:18. Unfortunately version-wise I always have only 1.4.0, no build number. I don't know if this will require getting in touch with ubuntu, or if we can just prep a different image which an OVS build known to work without problems. Salvatore [1] https://bugs.launchpad.net/neutron/+bug/1254890 [2] https://bugs.launchpad.net/neutron/+bug/1253896 [3] http://paste.openstack.org/show/61869/ [4] kernel BUG at /build/buildd/linux-3.2.0/fs/buffer.c:2917 and filename:syslog.txt On 24 January 2014 21:13, Clay Gerrard clay.gerr...@gmail.com wrote: OH yeah that's much better. I had found those eventually but had to dig through all that other stuff :'( Moving forward I think we can keep an eye on that page, open bugs for those tests causing issue and dig in. Thanks again! -Clay On Fri, Jan 24, 2014 at 11:37 AM, Sean Dague s...@dague.net wrote: On 01/24/2014 02:02 PM, Peter Portante wrote: Hi Sean, In the last 7 days I see only 6 python27 based test failures: http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNzogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk2Mjk0MDR9 And 4 python26 based test failures: http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNjogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk1MzAzNTd9 Maybe the query you posted captures failures where the job did not even run? And only 15 hits (well, 18, but three are within the same job, and some of the tests are run twice, so it is a combined of 10 hits): http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRkFJTDpcIiBhbmQgbWVzc2FnZTpcInRlc3RcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4OTg1NTAzMX0= Thanks, So it is true, that the Interupted exceptions (which is when a job is killed because of a reset) are some times being turned into Fail events by the system, which is one of the reasons the graphite data for failures is incorrect, and if you use just the graphite sourcing for fails, your numbers will be overly pessimistic. The following is probably better lists - http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python26 (7 uncategorized fails) - http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python27 (5 uncategorized fails) -Sean -- Sean Dague Samsung Research America s...@dague.net / sean.da...@samsung.com http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Gate Status - Friday Edition
On Fri, Jan 24, 2014 at 6:57 PM, Salvatore Orlando sorla...@nicira.comwrote: I've found out that several jobs are exhibiting failures like bug 1254890 [1] and bug 1253896 [2] because openvswitch seem to be crashing the kernel. The kernel trace reports as offending process usually either neutron-ns-metadata-proxy or dnsmasq, but [3] seem to clearly point to ovs-vsctl. 254 events observed in the previous 6 days show a similar trace in the logs [4]. This means that while this alone won't explain all the failures observed, it is however potentially one of the prominent root causes. From the logs I have little hints about the kernel running. It seems there has been no update in the past 7 days, but I can't be sure. Openvswitch builds are updated periodically. The last build I found not to trigger failures was the one generated on 2014/01/16 at 01:58:18. Unfortunately version-wise I always have only 1.4.0, no build number. I don't know if this will require getting in touch with ubuntu, or if we can just prep a different image which an OVS build known to work without problems. Salvatore [1] https://bugs.launchpad.net/neutron/+bug/1254890 [2] https://bugs.launchpad.net/neutron/+bug/1253896 [3] http://paste.openstack.org/show/61869/ [4] kernel BUG at /build/buildd/linux-3.2.0/fs/buffer.c:2917 and filename:syslog.txt Do you want to track this as a separate bug and e-r fingerprint? It will overlap with the other two bugs but will give us good numbers on status.openstack.org/elastic-recheck/ On 24 January 2014 21:13, Clay Gerrard clay.gerr...@gmail.com wrote: OH yeah that's much better. I had found those eventually but had to dig through all that other stuff :'( Moving forward I think we can keep an eye on that page, open bugs for those tests causing issue and dig in. Thanks again! -Clay On Fri, Jan 24, 2014 at 11:37 AM, Sean Dague s...@dague.net wrote: On 01/24/2014 02:02 PM, Peter Portante wrote: Hi Sean, In the last 7 days I see only 6 python27 based test failures: http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNzogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk2Mjk0MDR9 And 4 python26 based test failures: http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRVJST1I6ICAgcHkyNjogY29tbWFuZHMgZmFpbGVkXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTA1ODk1MzAzNTd9 Maybe the query you posted captures failures where the job did not even run? And only 15 hits (well, 18, but three are within the same job, and some of the tests are run twice, so it is a combined of 10 hits): http://logstash.openstack.org/#eyJzZWFyY2giOiJwcm9qZWN0Olwib3BlbnN0YWNrL3N3aWZ0XCIgQU5EIGJ1aWxkX3F1ZXVlOmdhdGUgQU5EIGJ1aWxkX25hbWU6Z2F0ZS1zd2lmdC1weXRob24qIEFORCBtZXNzYWdlOlwiRkFJTDpcIiBhbmQgbWVzc2FnZTpcInRlc3RcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDU4OTg1NTAzMX0= Thanks, So it is true, that the Interupted exceptions (which is when a job is killed because of a reset) are some times being turned into Fail events by the system, which is one of the reasons the graphite data for failures is incorrect, and if you use just the graphite sourcing for fails, your numbers will be overly pessimistic. The following is probably better lists - http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python26 (7 uncategorized fails) - http://status.openstack.org/elastic-recheck/data/uncategorized.html#gate-swift-python27 (5 uncategorized fails) -Sean -- Sean Dague Samsung Research America s...@dague.net / sean.da...@samsung.com http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Gate Status - Friday Edition
* Salvatore Orlando (sorla...@nicira.com) wrote: I've found out that several jobs are exhibiting failures like bug 1254890 [1] and bug 1253896 [2] because openvswitch seem to be crashing the kernel. The kernel trace reports as offending process usually either neutron-ns-metadata-proxy or dnsmasq, but [3] seem to clearly point to ovs-vsctl. Hmm, that actually shows dnsmasq is the running/exiting process. The ovs-vsctl was run nearly a half-second earlier. Looks like ovs-vsctl successfuly added the tap device (assuming it's for dnsmasq?). And dnsmasq is exiting upon receiving a signal. Shot in the dark, has the neutron path that would end up killing dnsmasq (Dnsmasq::reload_allocations()) changed recently? I didn't see much. 254 events observed in the previous 6 days show a similar trace in the logs [4]. That kernel (3.2.0) is over a year old. And there have been some network namespace fixes since then (IIRC, refcounting related). This means that while this alone won't explain all the failures observed, it is however potentially one of the prominent root causes. From the logs I have little hints about the kernel running. It seems there has been no update in the past 7 days, but I can't be sure. Openvswitch builds are updated periodically. The last build I found not to trigger failures was the one generated on 2014/01/16 at 01:58:18. Unfortunately version-wise I always have only 1.4.0, no build number. I don't know if this will require getting in touch with ubuntu, or if we can just prep a different image which an OVS build known to work without problems. Salvatore [1] https://bugs.launchpad.net/neutron/+bug/1254890 [2] https://bugs.launchpad.net/neutron/+bug/1253896 [3] http://paste.openstack.org/show/61869/ [4] kernel BUG at /build/buildd/linux-3.2.0/fs/buffer.c:2917 and filename:syslog.txt ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev