Re: [webkit-dev] Handling flaky layout-test failures in EWS

2019-12-05 Thread Alexey Proskuryakov


> 5 дек. 2019 г., в 7:58 AM, Aakash Jain  написал(а):
> 
> > What is the current behavior when a patch introduces substantial flakiness?
> 
> For the patch which introduces substantial flakiness (and no consistent 
> failure in EWS), prior to r253049 (which I landed two days back), the patch 
> would be infinitely retried (unless atleast one of the flaky test failed 
> consistently, or the patch is obsolete).
> 
> After r253049, the patch would be marked as passing. EWS is simply not 
> detecting flakiness introduced by a patch (since it doesn't know if the 
> flakiness is introduced by the patch or was pre-existing). However, if any of 
> those 10 tests failed consistently in two runs, the patch would be marked as 
> failing.
> 
> We can work towards improving EWS to better detect flakiness by using data 
> from new flakiness dashboard (which now has the API as well).
> 
> > This looks like decent evidence to make the EWS bubble red.
> We can do that. Have we seen any patch in the past which would fail 5+ tests 
> in first run, and fail completely different 5+ tests in second run? Also, 
> what should be the threshold, should it be 10 (5+5) different test failures 
> (or lower) in two runs?

This would have been difficult to see recently, because EWS just retried when 
hitting this. But I've certainly seen patches that increased flakiness like 
this, so it's a practical case.

Starting with 10 more test failures on two retries (combined) than in a clean 
run seems reasonable.

- Alexey


> Thanks
> Aakash
> 
>> On Dec 3, 2019, at 12:28 PM, Alexey Proskuryakov > > wrote:
>> 
>> 
>> Yes, I think that this makes more sense than retrying.
>> 
>> What is the current behavior when a patch introduces substantial flakiness? 
>> E.g. this scenario:
>> 
>> - First test run produces 5 failures.
>> - Second test run produces 5 different failures.
>> - Clean re-run produces no failures.
>> 
>> This looks like decent evidence to make the EWS bubble red. I don't know 
>> what exactly the threshold should be, but that should certainly be below 30.
>> 
>> - Alexey
>> 
>> 
>>> 3 дек. 2019 г., в 8:40 AM, Aakash Jain >> > написал(а):
>>> 
>>> Hi Everyone,
>>> 
>>> We have various layout-tests which are flaky (which sometimes pass and 
>>> sometimes fail/crash/timeout). EWS needs to work despite these flaky tests, 
>>> and need to be able to tell whether the patch being tested introduced any 
>>> test failure or not.
>>> 
>>> In EWS, we have logic (same logic in both old and new EWS) on how to deal 
>>> with flaky tests. The logic is roughly following:
>>> 
>>> - We apply the patch and build.
>>> 
>>> - Run layout-tests. During the test-run, we retry the failed tests. If 
>>> those tests pass in retry, the run is considered to have no failing test 
>>> (this retry part is same for EWS / build.webkit.org 
>>>  / manual-run and part of run-webkit-test script 
>>> itself).
>>> 
>>> - If a test-run has some failures, EWS retry the test-run one more time (to 
>>> check if those test failures were flaky). 
>>> 
>>> - Then we do one more test-run on clean-tree (without the patch), and 
>>> compare the results of first run, second run, and clean tree run.
>>> 
>>> - After that we analyze the results from all three test-runs (which we call 
>>> first_run, second_run and clean_tree_run). 
>>> 
>>> 
>>> If there are different test failures in first and second runs (i.e.: flaky 
>>> test failures), and the patch being tested did not introduce any new test 
>>> failure, we retry the entire build (probably hoping that next time we will 
>>> not see the flakiness). This retry can result in almost infinite loop, if 
>>> someone commits a very flaky test (which is not rare), and would cause EWS 
>>> to be almost stuck until the flakiness is fixed.
>>> 
>>> From 
>>> https://trac.webkit.org/browser/webkit/trunk/Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py#L352
>>>  
>>> 
>>>   ('Defer' means retrying the build).
>>> '''
>>> 350# At this point we know that at least one test flaked, but no 
>>> consistent failures
>>> 351# were introduced. This is a bit of a grey-zone.
>>> 352return False  # Defer patch
>>> '''
>>> 
>>> 
>>> Retrying the entire build, just because of few flaky tests seems excessive 
>>> and inefficient. This doesn't help identify flaky tests, and only delays 
>>> the EWS result. Instead, we should mark the build as SUCCESS (since the 
>>> patch did not introduce any new consistent test failure).
>>> 
>>> In other EWS test-suites like API tests and JSC tests, we do not retry the 
>>> entire build on flaky test failures, we ignore the flaky tests, and only 
>>> focus on tests which failed consistently. We should do the similar thing 
>>> for layout-tests as well.
>>> 
>>> I am going to make that change for layout tests in 
>>> 

Re: [webkit-dev] Handling flaky layout-test failures in EWS

2019-12-05 Thread Aakash Jain


> On Dec 3, 2019, at 1:54 PM, Ryosuke Niwa  wrote:
> 
> 
> On Tue, Dec 3, 2019 at 9:29 AM Alexey Proskuryakov  > wrote:
> 
> Yes, I think that this makes more sense than retrying.
> 
> What is the current behavior when a patch introduces substantial flakiness? 
> E.g. this scenario:
> 
> - First test run produces 5 failures.
> - Second test run produces 5 different failures.
> - Clean re-run produces no failures.
> 
> This looks like decent evidence to make the EWS bubble red. I don't know what 
> exactly the threshold should be, but that should certainly be below 30.
> 
> This makes sense to me.
> 
> Another scenario where flaky failures might be a real regression is when 
> tests which never failed before starts failing flakily.
> 
> Can we fetch data from the flakiness dashboard and see if we’ve ever observed 
> the flakiness on a given test? That would let us get out of the guessing game.

Yes, that's a good idea, and something we have been exploring lately. Now we 
have new results database (https://results.webkit.org 
) as well which have REST API support which EWS 
can use. Using data from results database, we can have significant improvements 
in efficiency and flakiness detection in EWS. We started with some ideas in 
https://bugs.webkit.org/show_bug.cgi?id=204368 
.

> 
> I understand some flakiness might be specific to EWS but I’d imagine that’d 
> be more of minority. If it is common, we could also make EWS bots 
> periodically run full tests without patches and report results to flakiness 
> dashboard so that we have a corpse of flaky teat failures on EWS bots as well.
> 
> - R. Niwa
> 
>> 3 дек. 2019 г., в 8:40 AM, Aakash Jain > > написал(а):
> 
>> 
>> Hi Everyone,
>> 
>> We have various layout-tests which are flaky (which sometimes pass and 
>> sometimes fail/crash/timeout). EWS needs to work despite these flaky tests, 
>> and need to be able to tell whether the patch being tested introduced any 
>> test failure or not.
>> 
>> In EWS, we have logic (same logic in both old and new EWS) on how to deal 
>> with flaky tests. The logic is roughly following:
>> 
>> - We apply the patch and build.
>> 
>> - Run layout-tests. During the test-run, we retry the failed tests. If those 
>> tests pass in retry, the run is considered to have no failing test (this 
>> retry part is same for EWS / build.webkit.org  / 
>> manual-run and part of run-webkit-test script itself).
>> 
>> - If a test-run has some failures, EWS retry the test-run one more time (to 
>> check if those test failures were flaky). 
>> 
>> - Then we do one more test-run on clean-tree (without the patch), and 
>> compare the results of first run, second run, and clean tree run.
>> 
>> - After that we analyze the results from all three test-runs (which we call 
>> first_run, second_run and clean_tree_run). 
>> 
>> 
>> If there are different test failures in first and second runs (i.e.: flaky 
>> test failures), and the patch being tested did not introduce any new test 
>> failure, we retry the entire build (probably hoping that next time we will 
>> not see the flakiness). This retry can result in almost infinite loop, if 
>> someone commits a very flaky test (which is not rare), and would cause EWS 
>> to be almost stuck until the flakiness is fixed.
>> 
>> From 
>> https://trac.webkit.org/browser/webkit/trunk/Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py#L352
>>  
>> 
>>   ('Defer' means retrying the build).
>> '''
>> 350# At this point we know that at least one test flaked, but no 
>> consistent failures
>> 351# were introduced. This is a bit of a grey-zone.
>> 352return False  # Defer patch
>> '''
>> 
>> 
>> Retrying the entire build, just because of few flaky tests seems excessive 
>> and inefficient. This doesn't help identify flaky tests, and only delays the 
>> EWS result. Instead, we should mark the build as SUCCESS (since the patch 
>> did not introduce any new consistent test failure).
>> 
>> In other EWS test-suites like API tests and JSC tests, we do not retry the 
>> entire build on flaky test failures, we ignore the flaky tests, and only 
>> focus on tests which failed consistently. We should do the similar thing for 
>> layout-tests as well.
>> 
>> I am going to make that change for layout tests in 
>> https://bugs.webkit.org/show_bug.cgi?id=204769 
>> . Please let me know if 
>> anyone has a different opinion.
>> 
>> Thanks
>> Aakash
> 
> 
> ___
> webkit-dev mailing list
> webkit-dev@lists.webkit.org 
> https://lists.webkit.org/mailman/listinfo/webkit-dev 
> 
> -- 
> 

Re: [webkit-dev] Handling flaky layout-test failures in EWS

2019-12-05 Thread Aakash Jain
> What is the current behavior when a patch introduces substantial flakiness?

For the patch which introduces substantial flakiness (and no consistent failure 
in EWS), prior to r253049 (which I landed two days back), the patch would be 
infinitely retried (unless atleast one of the flaky test failed consistently, 
or the patch is obsolete).

After r253049, the patch would be marked as passing. EWS is simply not 
detecting flakiness introduced by a patch (since it doesn't know if the 
flakiness is introduced by the patch or was pre-existing). However, if any of 
those 10 tests failed consistently in two runs, the patch would be marked as 
failing.

We can work towards improving EWS to better detect flakiness by using data from 
new flakiness dashboard (which now has the API as well).

> This looks like decent evidence to make the EWS bubble red.
We can do that. Have we seen any patch in the past which would fail 5+ tests in 
first run, and fail completely different 5+ tests in second run? Also, what 
should be the threshold, should it be 10 (5+5) different test failures (or 
lower) in two runs?

Thanks
Aakash

> On Dec 3, 2019, at 12:28 PM, Alexey Proskuryakov  wrote:
> 
> 
> Yes, I think that this makes more sense than retrying.
> 
> What is the current behavior when a patch introduces substantial flakiness? 
> E.g. this scenario:
> 
> - First test run produces 5 failures.
> - Second test run produces 5 different failures.
> - Clean re-run produces no failures.
> 
> This looks like decent evidence to make the EWS bubble red. I don't know what 
> exactly the threshold should be, but that should certainly be below 30.
> 
> - Alexey
> 
> 
>> 3 дек. 2019 г., в 8:40 AM, Aakash Jain > > написал(а):
>> 
>> Hi Everyone,
>> 
>> We have various layout-tests which are flaky (which sometimes pass and 
>> sometimes fail/crash/timeout). EWS needs to work despite these flaky tests, 
>> and need to be able to tell whether the patch being tested introduced any 
>> test failure or not.
>> 
>> In EWS, we have logic (same logic in both old and new EWS) on how to deal 
>> with flaky tests. The logic is roughly following:
>> 
>> - We apply the patch and build.
>> 
>> - Run layout-tests. During the test-run, we retry the failed tests. If those 
>> tests pass in retry, the run is considered to have no failing test (this 
>> retry part is same for EWS / build.webkit.org  / 
>> manual-run and part of run-webkit-test script itself).
>> 
>> - If a test-run has some failures, EWS retry the test-run one more time (to 
>> check if those test failures were flaky). 
>> 
>> - Then we do one more test-run on clean-tree (without the patch), and 
>> compare the results of first run, second run, and clean tree run.
>> 
>> - After that we analyze the results from all three test-runs (which we call 
>> first_run, second_run and clean_tree_run). 
>> 
>> 
>> If there are different test failures in first and second runs (i.e.: flaky 
>> test failures), and the patch being tested did not introduce any new test 
>> failure, we retry the entire build (probably hoping that next time we will 
>> not see the flakiness). This retry can result in almost infinite loop, if 
>> someone commits a very flaky test (which is not rare), and would cause EWS 
>> to be almost stuck until the flakiness is fixed.
>> 
>> From 
>> https://trac.webkit.org/browser/webkit/trunk/Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py#L352
>>  
>> 
>>   ('Defer' means retrying the build).
>> '''
>> 350# At this point we know that at least one test flaked, but no 
>> consistent failures
>> 351# were introduced. This is a bit of a grey-zone.
>> 352return False  # Defer patch
>> '''
>> 
>> 
>> Retrying the entire build, just because of few flaky tests seems excessive 
>> and inefficient. This doesn't help identify flaky tests, and only delays the 
>> EWS result. Instead, we should mark the build as SUCCESS (since the patch 
>> did not introduce any new consistent test failure).
>> 
>> In other EWS test-suites like API tests and JSC tests, we do not retry the 
>> entire build on flaky test failures, we ignore the flaky tests, and only 
>> focus on tests which failed consistently. We should do the similar thing for 
>> layout-tests as well.
>> 
>> I am going to make that change for layout tests in 
>> https://bugs.webkit.org/show_bug.cgi?id=204769 
>> . Please let me know if 
>> anyone has a different opinion.
>> 
>> Thanks
>> Aakash
> 
> 

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Handling flaky layout-test failures in EWS

2019-12-03 Thread Ryosuke Niwa
On Tue, Dec 3, 2019 at 9:29 AM Alexey Proskuryakov  wrote:

>
> Yes, I think that this makes more sense than retrying.
>
> What is the current behavior when a patch introduces substantial
> flakiness? E.g. this scenario:
>
> - First test run produces 5 failures.
> - Second test run produces 5 different failures.
> - Clean re-run produces no failures.
>
> This looks like decent evidence to make the EWS bubble red. I don't know
> what exactly the threshold should be, but that should certainly be below 30.
>

This makes sense to me.

Another scenario where flaky failures might be a real regression is when
tests which never failed before starts failing flakily.

Can we fetch data from the flakiness dashboard and see if we’ve ever
observed the flakiness on a given test? That would let us get out of the
guessing game.

I understand some flakiness might be specific to EWS but I’d imagine that’d
be more of minority. If it is common, we could also make EWS bots
periodically run full tests without patches and report results to flakiness
dashboard so that we have a corpse of flaky teat failures on EWS bots as
well.

- R. Niwa

3 дек. 2019 г., в 8:40 AM, Aakash Jain  написал(а):
>
>
> Hi Everyone,
>
> We have various layout-tests which are flaky (which sometimes pass and
> sometimes fail/crash/timeout). EWS needs to work despite these flaky tests,
> and need to be able to tell whether the patch being tested introduced any
> test failure or not.
>
> In EWS, we have logic (same logic in both old and new EWS) on how to deal
> with flaky tests. The logic is roughly following:
>
> - We apply the patch and build.
>
> - Run layout-tests. During the test-run, we retry the failed tests. If
> those tests pass in retry, the run is considered to have no failing test
> (this retry part is same for EWS / build.webkit.org / manual-run and part
> of run-webkit-test script itself).
>
> - If a test-run has some failures, EWS retry the test-run one more time
> (to check if those test failures were flaky).
>
> - Then we do one more test-run on clean-tree (without the patch), and
> compare the results of first run, second run, and clean tree run.
>
> - After that we analyze the results from all three test-runs (which we
> call first_run, second_run and clean_tree_run).
>
>
> If there are different test failures in first and second runs (i.e.: flaky
> test failures), and the patch being tested did not introduce any new test
> failure, we retry the entire build (probably hoping that next time we will
> not see the flakiness). This retry can result in almost infinite loop, if
> someone commits a very flaky test (which is not rare), and would cause EWS
> to be almost stuck until the flakiness is fixed.
>
> From
> https://trac.webkit.org/browser/webkit/trunk/Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py#L352
>  ('Defer' means retrying the build).
> '''
> 350# At this point we know that at least one test flaked, but no
> consistent failures
> 351# were introduced. This is a bit of a grey-zone.
> 352return False  # Defer patch
> '''
>
>
> Retrying the entire build, just because of few flaky tests seems excessive
> and inefficient. This doesn't help identify flaky tests, and only delays
> the EWS result. Instead, we should mark the build as SUCCESS (since the
> patch did not introduce any new consistent test failure).
>
> In other EWS test-suites like API tests and JSC tests, we do not retry the
> entire build on flaky test failures, we ignore the flaky tests, and only
> focus on tests which failed consistently. We should do the similar thing
> for layout-tests as well.
>
> I am going to make that change for layout tests in
> https://bugs.webkit.org/show_bug.cgi?id=204769. Please let me know if
> anyone has a different opinion.
>
> Thanks
> Aakash
>
>
>
> ___
> webkit-dev mailing list
> webkit-dev@lists.webkit.org
> https://lists.webkit.org/mailman/listinfo/webkit-dev
>
-- 
- R. Niwa
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Handling flaky layout-test failures in EWS

2019-12-03 Thread Alexey Proskuryakov

Yes, I think that this makes more sense than retrying.

What is the current behavior when a patch introduces substantial flakiness? 
E.g. this scenario:

- First test run produces 5 failures.
- Second test run produces 5 different failures.
- Clean re-run produces no failures.

This looks like decent evidence to make the EWS bubble red. I don't know what 
exactly the threshold should be, but that should certainly be below 30.

- Alexey


> 3 дек. 2019 г., в 8:40 AM, Aakash Jain  написал(а):
> 
> Hi Everyone,
> 
> We have various layout-tests which are flaky (which sometimes pass and 
> sometimes fail/crash/timeout). EWS needs to work despite these flaky tests, 
> and need to be able to tell whether the patch being tested introduced any 
> test failure or not.
> 
> In EWS, we have logic (same logic in both old and new EWS) on how to deal 
> with flaky tests. The logic is roughly following:
> 
> - We apply the patch and build.
> 
> - Run layout-tests. During the test-run, we retry the failed tests. If those 
> tests pass in retry, the run is considered to have no failing test (this 
> retry part is same for EWS / build.webkit.org / manual-run and part of 
> run-webkit-test script itself).
> 
> - If a test-run has some failures, EWS retry the test-run one more time (to 
> check if those test failures were flaky). 
> 
> - Then we do one more test-run on clean-tree (without the patch), and compare 
> the results of first run, second run, and clean tree run.
> 
> - After that we analyze the results from all three test-runs (which we call 
> first_run, second_run and clean_tree_run). 
> 
> 
> If there are different test failures in first and second runs (i.e.: flaky 
> test failures), and the patch being tested did not introduce any new test 
> failure, we retry the entire build (probably hoping that next time we will 
> not see the flakiness). This retry can result in almost infinite loop, if 
> someone commits a very flaky test (which is not rare), and would cause EWS to 
> be almost stuck until the flakiness is fixed.
> 
> From 
> https://trac.webkit.org/browser/webkit/trunk/Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py#L352
>   ('Defer' means retrying the build).
> '''
> 350# At this point we know that at least one test flaked, but no 
> consistent failures
> 351# were introduced. This is a bit of a grey-zone.
> 352return False  # Defer patch
> '''
> 
> 
> Retrying the entire build, just because of few flaky tests seems excessive 
> and inefficient. This doesn't help identify flaky tests, and only delays the 
> EWS result. Instead, we should mark the build as SUCCESS (since the patch did 
> not introduce any new consistent test failure).
> 
> In other EWS test-suites like API tests and JSC tests, we do not retry the 
> entire build on flaky test failures, we ignore the flaky tests, and only 
> focus on tests which failed consistently. We should do the similar thing for 
> layout-tests as well.
> 
> I am going to make that change for layout tests in 
> https://bugs.webkit.org/show_bug.cgi?id=204769. Please let me know if anyone 
> has a different opinion.
> 
> Thanks
> Aakash


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev