Re: [openstack-dev] [infra] Intermittent network problems allowed to sneak passed the gate?

2014-05-07 Thread Derek Higgins
On 06/05/14 23:55, Jeremy Stanley wrote:
 On 2014-05-06 15:52:04 +0100 (+0100), Derek Higgins wrote:
 [...]
 The job simply got restarted and this kept happening until the job passed.

 A legitimately failed job :
   https://jenkins05.openstack.org/job/check-nova-docker-dsvm-f20/2/

 http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5c1ebf/console.html
 [...]
 
 If the job fails in such a way that it impacts communication between
 the slave and the Jenkins master, or tanks the slave so badly that
 it ceases responding entirely, Jenkins often does not report a build
 completion status. Because this happens rather unfortunately often
 due to the nature of connectivity in service providers and due to
 bugs in Jenkins, Zuul assumes it should automatically reattempt any
 job which ceases running without explanation.
 
 Perhaps one option would be to keep a retry counter and not
 reattempt a job which fails in this manner more than once or
 twice...?

It won't catch all cases but sounds like a good idea to me, if there is
somebody familiar with the zuul code that can quickly do it great,
otherwise I can try and make myself familiar.

Derek.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [infra] Intermittent network problems allowed to sneak passed the gate?

2014-05-06 Thread Derek Higgins
Hi,

I've been working on a check job that uses devstack-gate jobs to run
the nova with the docker driver, while doing this I noticed that
sometimes during the nova boot for an instance the node looses network
connectivity(obviously a problem that needs to be worked on).
Whats interesting is zuuls behavior when this occurs in the check queue.
The job simply got restarted and this kept happening until the job passed.

A legitimately failed job :
  https://jenkins05.openstack.org/job/check-nova-docker-dsvm-f20/2/

http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5c1ebf/console.html

Retry (also failed)  :
  https://jenkins07.openstack.org/job/check-nova-docker-dsvm-f20/3/

http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5f26ed/console.html

Retried again (passed)   :
  https://jenkins01.openstack.org/job/check-nova-docker-dsvm-f20/3/

http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/2ebfa88/console.html

And success gets reported back to gerrit
https://review.openstack.org/#/c/91514/
Patch Set 5: Verified+1
check-nova-docker-dsvm-f20 SUCCESS in 17m 27s (non-voting)


Wouldn't this behavior allow commits that cause intermittent network
problems to more easily sneak passed the gating infrastructure?


I'm guessing that the retry is being triggered in
zuul/launcher/gearman.py : onBuildCompleted()

because onDisconnect calls onBuildCompleted with no results param

Any thoughts?

thanks,
Derek.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] Intermittent network problems allowed to sneak passed the gate?

2014-05-06 Thread Derek Higgins
On 06/05/14 16:17, Sean Dague wrote:
 On 05/06/2014 10:52 AM, Derek Higgins wrote:
 Hi,

 I've been working on a check job that uses devstack-gate jobs to run
 the nova with the docker driver, while doing this I noticed that
 sometimes during the nova boot for an instance the node looses network
 connectivity(obviously a problem that needs to be worked on).
 Whats interesting is zuuls behavior when this occurs in the check queue.
 The job simply got restarted and this kept happening until the job passed.

 A legitimately failed job :
   https://jenkins05.openstack.org/job/check-nova-docker-dsvm-f20/2/

 http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5c1ebf/console.html

 Retry (also failed)  :
   https://jenkins07.openstack.org/job/check-nova-docker-dsvm-f20/3/

 http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5f26ed/console.html

 Retried again (passed)   :
   https://jenkins01.openstack.org/job/check-nova-docker-dsvm-f20/3/

 http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/2ebfa88/console.html

 And success gets reported back to gerrit
 https://review.openstack.org/#/c/91514/
 Patch Set 5: Verified+1
 check-nova-docker-dsvm-f20 SUCCESS in 17m 27s (non-voting)


 Wouldn't this behavior allow commits that cause intermittent network
 problems to more easily sneak passed the gating infrastructure?


 I'm guessing that the retry is being triggered in
 zuul/launcher/gearman.py : onBuildCompleted()

 because onDisconnect calls onBuildCompleted with no results param

 Any thoughts?
 
 There is some automatic retry facility in zuul right now to deal with a
 set of issues which are considered recoverable and typically the fault
 of the infrastructure provider.
 
 There might be a way to slip something through, however, all failures in
 the gate do tend to get eyes on them, and I've yet to see this kind of
 issue slip through. So something to keep an eye out for. Would be
Hasn't this problem already slipped through (although its in the check
queue not the gate), I mean it can now be merged and was only noticed
because I was watching the zuul status page while the jobs were running?

 curious to see if we can mine out these issues in elastic recheck. The
 failed results are still reported to logstash from what I can see, so we
 can track them.
I'll see if I can find any similar occurrences in other jobs and report
back.

 
   -Sean
 
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra] Intermittent network problems allowed to sneak passed the gate?

2014-05-06 Thread Jeremy Stanley
On 2014-05-06 15:52:04 +0100 (+0100), Derek Higgins wrote:
[...]
 The job simply got restarted and this kept happening until the job passed.
 
 A legitimately failed job :
   https://jenkins05.openstack.org/job/check-nova-docker-dsvm-f20/2/
 
 http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5c1ebf/console.html
[...]

If the job fails in such a way that it impacts communication between
the slave and the Jenkins master, or tanks the slave so badly that
it ceases responding entirely, Jenkins often does not report a build
completion status. Because this happens rather unfortunately often
due to the nature of connectivity in service providers and due to
bugs in Jenkins, Zuul assumes it should automatically reattempt any
job which ceases running without explanation.

Perhaps one option would be to keep a retry counter and not
reattempt a job which fails in this manner more than once or
twice...?
-- 
Jeremy Stanley

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev