Re: [webkit-dev] Handling flaky layout-test failures in EWS

Alexey Proskuryakov Thu, 05 Dec 2019 09:14:08 -0800


> 5 дек. 2019 г., в 7:58 AM, Aakash Jain <aakash_j...@apple.com> написал(а):
> 
> > What is the current behavior when a patch introduces substantial flakiness?
> 
> For the patch which introduces substantial flakiness (and no consistent 
> failure in EWS), prior to r253049 (which I landed two days back), the patch 
> would be infinitely retried (unless atleast one of the flaky test failed 
> consistently, or the patch is obsolete).
> 
> After r253049, the patch would be marked as passing. EWS is simply not 
> detecting flakiness introduced by a patch (since it doesn't know if the 
> flakiness is introduced by the patch or was pre-existing). However, if any of 
> those 10 tests failed consistently in two runs, the patch would be marked as 
> failing.
> 
> We can work towards improving EWS to better detect flakiness by using data 
> from new flakiness dashboard (which now has the API as well).
> 
> > This looks like decent evidence to make the EWS bubble red.
> We can do that. Have we seen any patch in the past which would fail 5+ tests 
> in first run, and fail completely different 5+ tests in second run? Also, 
> what should be the threshold, should it be 10 (5+5) different test failures 
> (or lower) in two runs?


This would have been difficult to see recently, because EWS just retried when 
hitting this. But I've certainly seen patches that increased flakiness like 
this, so it's a practical case.

Starting with 10 more test failures on two retries (combined) than in a clean 
run seems reasonable.

- Alexey


> Thanks
> Aakash
> 
>> On Dec 3, 2019, at 12:28 PM, Alexey Proskuryakov <a...@webkit.org 
>> <mailto:a...@webkit.org>> wrote:
>> 
>> 
>> Yes, I think that this makes more sense than retrying.
>> 
>> What is the current behavior when a patch introduces substantial flakiness? 
>> E.g. this scenario:
>> 
>> - First test run produces 5 failures.
>> - Second test run produces 5 different failures.
>> - Clean re-run produces no failures.
>> 
>> This looks like decent evidence to make the EWS bubble red. I don't know 
>> what exactly the threshold should be, but that should certainly be below 30.
>> 
>> - Alexey
>> 
>> 
>>> 3 дек. 2019 г., в 8:40 AM, Aakash Jain <aakash_j...@apple.com 
>>> <mailto:aakash_j...@apple.com>> написал(а):
>>> 
>>> Hi Everyone,
>>> 
>>> We have various layout-tests which are flaky (which sometimes pass and 
>>> sometimes fail/crash/timeout). EWS needs to work despite these flaky tests, 
>>> and need to be able to tell whether the patch being tested introduced any 
>>> test failure or not.
>>> 
>>> In EWS, we have logic (same logic in both old and new EWS) on how to deal 
>>> with flaky tests. The logic is roughly following:
>>> 
>>> - We apply the patch and build.
>>> 
>>> - Run layout-tests. During the test-run, we retry the failed tests. If 
>>> those tests pass in retry, the run is considered to have no failing test 
>>> (this retry part is same for EWS / build.webkit.org 
>>> <http://build.webkit.org/> / manual-run and part of run-webkit-test script 
>>> itself).
>>> 
>>> - If a test-run has some failures, EWS retry the test-run one more time (to 
>>> check if those test failures were flaky). 
>>> 
>>> - Then we do one more test-run on clean-tree (without the patch), and 
>>> compare the results of first run, second run, and clean tree run.
>>> 
>>> - After that we analyze the results from all three test-runs (which we call 
>>> first_run, second_run and clean_tree_run). 
>>> 
>>> 
>>> If there are different test failures in first and second runs (i.e.: flaky 
>>> test failures), and the patch being tested did not introduce any new test 
>>> failure, we retry the entire build (probably hoping that next time we will 
>>> not see the flakiness). This retry can result in almost infinite loop, if 
>>> someone commits a very flaky test (which is not rare), and would cause EWS 
>>> to be almost stuck until the flakiness is fixed.
>>> 
>>> From 
>>> https://trac.webkit.org/browser/webkit/trunk/Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py#L352
>>>  
>>> <https://trac.webkit.org/browser/webkit/trunk/Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py#L352>
>>>   ('Defer' means retrying the build).
>>> '''
>>> 350    # At this point we know that at least one test flaked, but no 
>>> consistent failures
>>> 351    # were introduced. This is a bit of a grey-zone.
>>> 352    return False  # Defer patch
>>> '''
>>> 
>>> 
>>> Retrying the entire build, just because of few flaky tests seems excessive 
>>> and inefficient. This doesn't help identify flaky tests, and only delays 
>>> the EWS result. Instead, we should mark the build as SUCCESS (since the 
>>> patch did not introduce any new consistent test failure).
>>> 
>>> In other EWS test-suites like API tests and JSC tests, we do not retry the 
>>> entire build on flaky test failures, we ignore the flaky tests, and only 
>>> focus on tests which failed consistently. We should do the similar thing 
>>> for layout-tests as well.
>>> 
>>> I am going to make that change for layout tests in 
>>> https://bugs.webkit.org/show_bug.cgi?id=204769 
>>> <https://bugs.webkit.org/show_bug.cgi?id=204769>. Please let me know if 
>>> anyone has a different opinion.
>>> 
>>> Thanks
>>> Aakash
>> 
>> 
>

_______________________________________________
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev

Re: [webkit-dev] Handling flaky layout-test failures in EWS

Reply via email to