Re: [openstack-dev] [infra] Intermittent network problems allowed to sneak passed the gate?
On 06/05/14 23:55, Jeremy Stanley wrote: On 2014-05-06 15:52:04 +0100 (+0100), Derek Higgins wrote: [...] The job simply got restarted and this kept happening until the job passed. A legitimately failed job : https://jenkins05.openstack.org/job/check-nova-docker-dsvm-f20/2/ http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5c1ebf/console.html [...] If the job fails in such a way that it impacts communication between the slave and the Jenkins master, or tanks the slave so badly that it ceases responding entirely, Jenkins often does not report a build completion status. Because this happens rather unfortunately often due to the nature of connectivity in service providers and due to bugs in Jenkins, Zuul assumes it should automatically reattempt any job which ceases running without explanation. Perhaps one option would be to keep a retry counter and not reattempt a job which fails in this manner more than once or twice...? It won't catch all cases but sounds like a good idea to me, if there is somebody familiar with the zuul code that can quickly do it great, otherwise I can try and make myself familiar. Derek. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [infra] Intermittent network problems allowed to sneak passed the gate?
Hi, I've been working on a check job that uses devstack-gate jobs to run the nova with the docker driver, while doing this I noticed that sometimes during the nova boot for an instance the node looses network connectivity(obviously a problem that needs to be worked on). Whats interesting is zuuls behavior when this occurs in the check queue. The job simply got restarted and this kept happening until the job passed. A legitimately failed job : https://jenkins05.openstack.org/job/check-nova-docker-dsvm-f20/2/ http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5c1ebf/console.html Retry (also failed) : https://jenkins07.openstack.org/job/check-nova-docker-dsvm-f20/3/ http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5f26ed/console.html Retried again (passed) : https://jenkins01.openstack.org/job/check-nova-docker-dsvm-f20/3/ http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/2ebfa88/console.html And success gets reported back to gerrit https://review.openstack.org/#/c/91514/ Patch Set 5: Verified+1 check-nova-docker-dsvm-f20 SUCCESS in 17m 27s (non-voting) Wouldn't this behavior allow commits that cause intermittent network problems to more easily sneak passed the gating infrastructure? I'm guessing that the retry is being triggered in zuul/launcher/gearman.py : onBuildCompleted() because onDisconnect calls onBuildCompleted with no results param Any thoughts? thanks, Derek. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [infra] Intermittent network problems allowed to sneak passed the gate?
On 06/05/14 16:17, Sean Dague wrote: On 05/06/2014 10:52 AM, Derek Higgins wrote: Hi, I've been working on a check job that uses devstack-gate jobs to run the nova with the docker driver, while doing this I noticed that sometimes during the nova boot for an instance the node looses network connectivity(obviously a problem that needs to be worked on). Whats interesting is zuuls behavior when this occurs in the check queue. The job simply got restarted and this kept happening until the job passed. A legitimately failed job : https://jenkins05.openstack.org/job/check-nova-docker-dsvm-f20/2/ http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5c1ebf/console.html Retry (also failed) : https://jenkins07.openstack.org/job/check-nova-docker-dsvm-f20/3/ http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5f26ed/console.html Retried again (passed) : https://jenkins01.openstack.org/job/check-nova-docker-dsvm-f20/3/ http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/2ebfa88/console.html And success gets reported back to gerrit https://review.openstack.org/#/c/91514/ Patch Set 5: Verified+1 check-nova-docker-dsvm-f20 SUCCESS in 17m 27s (non-voting) Wouldn't this behavior allow commits that cause intermittent network problems to more easily sneak passed the gating infrastructure? I'm guessing that the retry is being triggered in zuul/launcher/gearman.py : onBuildCompleted() because onDisconnect calls onBuildCompleted with no results param Any thoughts? There is some automatic retry facility in zuul right now to deal with a set of issues which are considered recoverable and typically the fault of the infrastructure provider. There might be a way to slip something through, however, all failures in the gate do tend to get eyes on them, and I've yet to see this kind of issue slip through. So something to keep an eye out for. Would be Hasn't this problem already slipped through (although its in the check queue not the gate), I mean it can now be merged and was only noticed because I was watching the zuul status page while the jobs were running? curious to see if we can mine out these issues in elastic recheck. The failed results are still reported to logstash from what I can see, so we can track them. I'll see if I can find any similar occurrences in other jobs and report back. -Sean ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [infra] Intermittent network problems allowed to sneak passed the gate?
On 2014-05-06 15:52:04 +0100 (+0100), Derek Higgins wrote: [...] The job simply got restarted and this kept happening until the job passed. A legitimately failed job : https://jenkins05.openstack.org/job/check-nova-docker-dsvm-f20/2/ http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5c1ebf/console.html [...] If the job fails in such a way that it impacts communication between the slave and the Jenkins master, or tanks the slave so badly that it ceases responding entirely, Jenkins often does not report a build completion status. Because this happens rather unfortunately often due to the nature of connectivity in service providers and due to bugs in Jenkins, Zuul assumes it should automatically reattempt any job which ceases running without explanation. Perhaps one option would be to keep a retry counter and not reattempt a job which fails in this manner more than once or twice...? -- Jeremy Stanley ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev