Hello folks,

apologize for this long post, but this issue is randomly failing CI job
and I would like to share the result of a long and subtle gray hair
inducing investigation :)

The most obvious symptom in the CI tests is the health-check failure of
this task: "Check for config-update post job". It can't find the commit
sha in config-update/builds/lastStableBuild logs. This is abnormal since
a previous task made sure the change has been Verified +2 and that it
was successfully merged, so there should have been a config-update run.

I believe this failure may also be affecting other tests that rely on
the config-update job to succeed (such as the test_project_test and
test_resource_workflow).


Here is an analysis of this failure:
https://softwarefactory-project.io/jenkins/job/software-factory-functional-trackerless/46/console

The relevant lines are:
[+] Starting the health base check
Fri Dec  2 00:33:53 UTC 2016 Running health-check/zuul.yaml
Zuul integration test SUCCESS
Fri Dec  2 00:35:43 UTC 2016 Running health-check/gerritbot.yaml
[...]
TASK [Check for config-update post job]
****************************************
fatal: [managesf.sftests.com]: FAILED! => {"changed": false, "elapsed":
1200, "failed": true, "msg": "Timeout when waiting for search string
67ddb89966cc4999df1869d99d140dd969963416 in
/var/lib/jenkins/jobs/config-update/builds/lastStableBuild/build.xml"}
        to retry, use: --limit @/etc/ansible/health-check/gerritbot.retry


Those two health-check (zuul and gerritbot) are very similar as they
both trigger a config-update job.  Looking in zuul.log, we can observe
how the first change (1,1) is being merged and how the config-update is
being successfully executed:
https://softwarefactory-project.io/paste/show/661/

As you can see, gerrit.GerritWatcher received two events: change-merged
and ref-updated, resulting in two trigger event eventually running the
config-update.


Then, gerritbot test merge another config change (4,1),
gerrit.GerritWatcher didn't received the ref-update event and the
config-update didn't get triggered:
https://softwarefactory-project.io/paste/show/662/


The question is, does gerrit failed to emit that event, or does zuul
failed to process it? Using a fresh sf deployment, I tried to reproduce
the failure while dumping the gerrit events-stream in parallel:

    ssh sftests.com gerrit stream-events > stream.log &
    while true; do
        ansible-playbook health-check/zuul.yaml && \
        ansible-playbook health-check/gerritbot.yaml || break;
    done


After about 3 hours of execution, it finally reproduces the infamous error:
"Timeout when waiting for search string
65e5ef9b9c4de1d1e3f037f79efe9b55a7da9ccd in
/var/lib/jenkins/jobs/config-update/builds/lastStableBuild/build.xml"

The stream.log do have both events logged:
https://softwarefactory-project.io/paste/show/664/

The zuul.log doesn't show any trace of the last updated-ref:
https://softwarefactory-project.io/paste/show/665/


That's it for now, next is to debug the GerritWatcher, but since it's
happening so rarely, it's not an easy task.

Alternatively, it may be worthy to run post pipeline on change-merged
instead of ref-updated.

If anybody have an idea what's going with that failure, please chime in!

Yours,
-Tristan

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Softwarefactory-dev mailing list
Softwarefactory-dev@redhat.com
https://www.redhat.com/mailman/listinfo/softwarefactory-dev

Reply via email to