> 5 дек. 2019 г., в 7:58 AM, Aakash Jain <aakash_j...@apple.com> написал(а): > > > What is the current behavior when a patch introduces substantial flakiness? > > For the patch which introduces substantial flakiness (and no consistent > failure in EWS), prior to r253049 (which I landed two days back), the patch > would be infinitely retried (unless atleast one of the flaky test failed > consistently, or the patch is obsolete). > > After r253049, the patch would be marked as passing. EWS is simply not > detecting flakiness introduced by a patch (since it doesn't know if the > flakiness is introduced by the patch or was pre-existing). However, if any of > those 10 tests failed consistently in two runs, the patch would be marked as > failing. > > We can work towards improving EWS to better detect flakiness by using data > from new flakiness dashboard (which now has the API as well). > > > This looks like decent evidence to make the EWS bubble red. > We can do that. Have we seen any patch in the past which would fail 5+ tests > in first run, and fail completely different 5+ tests in second run? Also, > what should be the threshold, should it be 10 (5+5) different test failures > (or lower) in two runs?
This would have been difficult to see recently, because EWS just retried when hitting this. But I've certainly seen patches that increased flakiness like this, so it's a practical case. Starting with 10 more test failures on two retries (combined) than in a clean run seems reasonable. - Alexey > Thanks > Aakash > >> On Dec 3, 2019, at 12:28 PM, Alexey Proskuryakov <a...@webkit.org >> <mailto:a...@webkit.org>> wrote: >> >> >> Yes, I think that this makes more sense than retrying. >> >> What is the current behavior when a patch introduces substantial flakiness? >> E.g. this scenario: >> >> - First test run produces 5 failures. >> - Second test run produces 5 different failures. >> - Clean re-run produces no failures. >> >> This looks like decent evidence to make the EWS bubble red. I don't know >> what exactly the threshold should be, but that should certainly be below 30. >> >> - Alexey >> >> >>> 3 дек. 2019 г., в 8:40 AM, Aakash Jain <aakash_j...@apple.com >>> <mailto:aakash_j...@apple.com>> написал(а): >>> >>> Hi Everyone, >>> >>> We have various layout-tests which are flaky (which sometimes pass and >>> sometimes fail/crash/timeout). EWS needs to work despite these flaky tests, >>> and need to be able to tell whether the patch being tested introduced any >>> test failure or not. >>> >>> In EWS, we have logic (same logic in both old and new EWS) on how to deal >>> with flaky tests. The logic is roughly following: >>> >>> - We apply the patch and build. >>> >>> - Run layout-tests. During the test-run, we retry the failed tests. If >>> those tests pass in retry, the run is considered to have no failing test >>> (this retry part is same for EWS / build.webkit.org >>> <http://build.webkit.org/> / manual-run and part of run-webkit-test script >>> itself). >>> >>> - If a test-run has some failures, EWS retry the test-run one more time (to >>> check if those test failures were flaky). >>> >>> - Then we do one more test-run on clean-tree (without the patch), and >>> compare the results of first run, second run, and clean tree run. >>> >>> - After that we analyze the results from all three test-runs (which we call >>> first_run, second_run and clean_tree_run). >>> >>> >>> If there are different test failures in first and second runs (i.e.: flaky >>> test failures), and the patch being tested did not introduce any new test >>> failure, we retry the entire build (probably hoping that next time we will >>> not see the flakiness). This retry can result in almost infinite loop, if >>> someone commits a very flaky test (which is not rare), and would cause EWS >>> to be almost stuck until the flakiness is fixed. >>> >>> From >>> https://trac.webkit.org/browser/webkit/trunk/Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py#L352 >>> >>> <https://trac.webkit.org/browser/webkit/trunk/Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py#L352> >>> ('Defer' means retrying the build). >>> ''' >>> 350 # At this point we know that at least one test flaked, but no >>> consistent failures >>> 351 # were introduced. This is a bit of a grey-zone. >>> 352 return False # Defer patch >>> ''' >>> >>> >>> Retrying the entire build, just because of few flaky tests seems excessive >>> and inefficient. This doesn't help identify flaky tests, and only delays >>> the EWS result. Instead, we should mark the build as SUCCESS (since the >>> patch did not introduce any new consistent test failure). >>> >>> In other EWS test-suites like API tests and JSC tests, we do not retry the >>> entire build on flaky test failures, we ignore the flaky tests, and only >>> focus on tests which failed consistently. We should do the similar thing >>> for layout-tests as well. >>> >>> I am going to make that change for layout tests in >>> https://bugs.webkit.org/show_bug.cgi?id=204769 >>> <https://bugs.webkit.org/show_bug.cgi?id=204769>. Please let me know if >>> anyone has a different opinion. >>> >>> Thanks >>> Aakash >> >> >
_______________________________________________ webkit-dev mailing list firstname.lastname@example.org https://lists.webkit.org/mailman/listinfo/webkit-dev