Re: [webkit-dev] EWS Comments on Bugzilla (Was: EWS now parses error logs in case of build failure)

2019-12-03 Thread Ryosuke Niwa
Great. I've completed the survey.

- R. Niwa

On Mon, Dec 2, 2019 at 5:19 PM Aakash Jain  wrote:

> There were multiple ideas discussed in this thread. I would like to gather
> more data about what do most people prefer. I have sent out a short survey
> in https://lists.webkit.org/pipermail/webkit-dev/2019-December/030980.html
>
> Thanks
> Aakash
>
> On Nov 5, 2019, at 12:04 PM, Alexey Proskuryakov  wrote:
>
>
>
> 4 нояб. 2019 г., в 1:37 PM, Ryosuke Niwa  написал(а):
>
>
> On Mon, Nov 4, 2019 at 9:40 AM Alexey Proskuryakov  wrote:
>
>>
>> Can you elaborate on that, how exactly is e-mailing on first failure
>> useful to reviewers?
>>
>> Getting rid of Bugzilla comments was one of the goals of EWS rewrite,
>> based on engineering feedback about noise in bugs and in e-mail, and I
>> wholeheartedly agree with this feedback. So I think that comments are
>> generally undesirable.
>>
>> Since I don't understand what your precise scenario is, I may be make
>> straw man arguments below, but here are some things that I think make the
>> proposed behavior unhelpful (add a comment on first failure, or when all
>> EWSes pass).
>>
>> 1. EWS comments in Bugzilla are so annoying that some people take the
>> radical step of manually hiding them. EWS history is archived anyway, there
>> is no need to look into comments for it.
>>
>> 2. There are often many people CC'ed on the bug to whom EWS data is
>> irrelevant or even mysterious (e.g. reporters, web developers or
>> non-reviewers). The noise is a slight annoyance, discouraging further
>> participation in the project.
>>
>> 3. I believe that for most reviewers, the mode of operation is one of the
>> two: (1) do it when pinged directly, or (2) go over the review queue when
>> one has the time. Getting EWS comments helps neither.
>>
>> 4. Commenting when all EWSes pass is not very practical - it's too often
>> that we have some stragglers that take days (or forever). I don't think
>> that we can make it reliable even if we start actively policing EWS
>> responsiveness.
>>
>> 5. The reviewer likely wants to know the state of multiple EWSes if they
>> are going to wait for EWS at all. What exactly are they going to do after
>> getting an e-mail that one EWS failed?
>>
>
> I often use a EWS failure as a signal to wait reviewing a patch.
> Otherwise, a bug mail will stay in my inbox as one of items to get to.
>
> I can see the usefulness in the somewhat unusual case of a super urgent
>> patch. We may want multiple people to watch it, so that members of CC list
>> would go and ask the patch author to update it with more urgency than
>> e-mail allows for. I think that opt-in is a better mechanism for that, so
>> that people who opted in would receive information about each EWS data
>> point.
>>
>
> I think there is a value in knowing that a patch isn't ready instead of
> having to open the bug to realize that.
>
>
> So just to clarify,
>
> - a major part of how you get to review bugs is by being CC'ed, and you
> review them when you have the time to read bugmail;
> - and you don't open the bug in Bugzilla if there is already an EWS
> failure by the time you read the e-mail where review is requested?
>
> That's clearly a valid benefit. In my mind, it probably doesn't outweigh
> the downsides. On the other hand, yours is a voice of someone who reviews
> way more patches than Maciej and me combined these days, so maybe more
> e-mail is an overall benefit to many of the reviewers.
>
> - Alexey
>
>
>
> - R. Niwa
>
>> 3 нояб. 2019 г., в 6:58 PM, Maciej Stachowiak  написал(а):
>>
>>
>> I think they are useful to actual and potential reviewers. Direct email
>> to the patch author is not something anyone can Cc themselves on, and is
>> not archived, so seems like a strictly worse form of communication.
>>
>> On Nov 2, 2019, at 9:34 AM, Alexey Proskuryakov  wrote:
>>
>>
>> My preference is still e-mailing the patch author directly (possibly,
>> also having an option to opt in for anyone). Bugzilla comments will always
>> be irrelevant for most people CC'ed on the bug, and they are almost always
>> undesirable to keep within the discussion flow.
>>
>> - Alexey
>>
>> 1 нояб. 2019 г., в 18:28, Aakash Jain  написал(а):
>>
>> Sounds good. I prefer the single comment when the first failure occur.
>> That way notification would be sent as soon as the first failure happens.
>>
>> I'll implement that (assuming it's acceptable to everyone).
>>
>> Thanks
>> Aakash
>>
>> On Nov 1, 2019, at 8:35 PM, Maciej Stachowiak  wrote:
>>
>>
>> How about only a single comment when the first failure occurs? (Or else
>> when all bots pass, if there is never a failure.)
>>
>> This should help the author, the reviewer, and anyone else cc’d, without
>> being too spammy.
>>
>> On Nov 1, 2019, at 5:20 PM, Aakash Jain  wrote:
>>
>> Hi Ryosuke,
>>
>> Many people didn't like the noise by the EWS comments, and we removed the
>> comments as per previous discussion in:
>> 

Re: [webkit-dev] Handling flaky layout-test failures in EWS

2019-12-03 Thread Ryosuke Niwa
On Tue, Dec 3, 2019 at 9:29 AM Alexey Proskuryakov  wrote:

>
> Yes, I think that this makes more sense than retrying.
>
> What is the current behavior when a patch introduces substantial
> flakiness? E.g. this scenario:
>
> - First test run produces 5 failures.
> - Second test run produces 5 different failures.
> - Clean re-run produces no failures.
>
> This looks like decent evidence to make the EWS bubble red. I don't know
> what exactly the threshold should be, but that should certainly be below 30.
>

This makes sense to me.

Another scenario where flaky failures might be a real regression is when
tests which never failed before starts failing flakily.

Can we fetch data from the flakiness dashboard and see if we’ve ever
observed the flakiness on a given test? That would let us get out of the
guessing game.

I understand some flakiness might be specific to EWS but I’d imagine that’d
be more of minority. If it is common, we could also make EWS bots
periodically run full tests without patches and report results to flakiness
dashboard so that we have a corpse of flaky teat failures on EWS bots as
well.

- R. Niwa

3 дек. 2019 г., в 8:40 AM, Aakash Jain  написал(а):
>
>
> Hi Everyone,
>
> We have various layout-tests which are flaky (which sometimes pass and
> sometimes fail/crash/timeout). EWS needs to work despite these flaky tests,
> and need to be able to tell whether the patch being tested introduced any
> test failure or not.
>
> In EWS, we have logic (same logic in both old and new EWS) on how to deal
> with flaky tests. The logic is roughly following:
>
> - We apply the patch and build.
>
> - Run layout-tests. During the test-run, we retry the failed tests. If
> those tests pass in retry, the run is considered to have no failing test
> (this retry part is same for EWS / build.webkit.org / manual-run and part
> of run-webkit-test script itself).
>
> - If a test-run has some failures, EWS retry the test-run one more time
> (to check if those test failures were flaky).
>
> - Then we do one more test-run on clean-tree (without the patch), and
> compare the results of first run, second run, and clean tree run.
>
> - After that we analyze the results from all three test-runs (which we
> call first_run, second_run and clean_tree_run).
>
>
> If there are different test failures in first and second runs (i.e.: flaky
> test failures), and the patch being tested did not introduce any new test
> failure, we retry the entire build (probably hoping that next time we will
> not see the flakiness). This retry can result in almost infinite loop, if
> someone commits a very flaky test (which is not rare), and would cause EWS
> to be almost stuck until the flakiness is fixed.
>
> From
> https://trac.webkit.org/browser/webkit/trunk/Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py#L352
>  ('Defer' means retrying the build).
> '''
> 350# At this point we know that at least one test flaked, but no
> consistent failures
> 351# were introduced. This is a bit of a grey-zone.
> 352return False  # Defer patch
> '''
>
>
> Retrying the entire build, just because of few flaky tests seems excessive
> and inefficient. This doesn't help identify flaky tests, and only delays
> the EWS result. Instead, we should mark the build as SUCCESS (since the
> patch did not introduce any new consistent test failure).
>
> In other EWS test-suites like API tests and JSC tests, we do not retry the
> entire build on flaky test failures, we ignore the flaky tests, and only
> focus on tests which failed consistently. We should do the similar thing
> for layout-tests as well.
>
> I am going to make that change for layout tests in
> https://bugs.webkit.org/show_bug.cgi?id=204769. Please let me know if
> anyone has a different opinion.
>
> Thanks
> Aakash
>
>
>
> ___
> webkit-dev mailing list
> webkit-dev@lists.webkit.org
> https://lists.webkit.org/mailman/listinfo/webkit-dev
>
-- 
- R. Niwa
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Handling flaky layout-test failures in EWS

2019-12-03 Thread Alexey Proskuryakov

Yes, I think that this makes more sense than retrying.

What is the current behavior when a patch introduces substantial flakiness? 
E.g. this scenario:

- First test run produces 5 failures.
- Second test run produces 5 different failures.
- Clean re-run produces no failures.

This looks like decent evidence to make the EWS bubble red. I don't know what 
exactly the threshold should be, but that should certainly be below 30.

- Alexey


> 3 дек. 2019 г., в 8:40 AM, Aakash Jain  написал(а):
> 
> Hi Everyone,
> 
> We have various layout-tests which are flaky (which sometimes pass and 
> sometimes fail/crash/timeout). EWS needs to work despite these flaky tests, 
> and need to be able to tell whether the patch being tested introduced any 
> test failure or not.
> 
> In EWS, we have logic (same logic in both old and new EWS) on how to deal 
> with flaky tests. The logic is roughly following:
> 
> - We apply the patch and build.
> 
> - Run layout-tests. During the test-run, we retry the failed tests. If those 
> tests pass in retry, the run is considered to have no failing test (this 
> retry part is same for EWS / build.webkit.org / manual-run and part of 
> run-webkit-test script itself).
> 
> - If a test-run has some failures, EWS retry the test-run one more time (to 
> check if those test failures were flaky). 
> 
> - Then we do one more test-run on clean-tree (without the patch), and compare 
> the results of first run, second run, and clean tree run.
> 
> - After that we analyze the results from all three test-runs (which we call 
> first_run, second_run and clean_tree_run). 
> 
> 
> If there are different test failures in first and second runs (i.e.: flaky 
> test failures), and the patch being tested did not introduce any new test 
> failure, we retry the entire build (probably hoping that next time we will 
> not see the flakiness). This retry can result in almost infinite loop, if 
> someone commits a very flaky test (which is not rare), and would cause EWS to 
> be almost stuck until the flakiness is fixed.
> 
> From 
> https://trac.webkit.org/browser/webkit/trunk/Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py#L352
>   ('Defer' means retrying the build).
> '''
> 350# At this point we know that at least one test flaked, but no 
> consistent failures
> 351# were introduced. This is a bit of a grey-zone.
> 352return False  # Defer patch
> '''
> 
> 
> Retrying the entire build, just because of few flaky tests seems excessive 
> and inefficient. This doesn't help identify flaky tests, and only delays the 
> EWS result. Instead, we should mark the build as SUCCESS (since the patch did 
> not introduce any new consistent test failure).
> 
> In other EWS test-suites like API tests and JSC tests, we do not retry the 
> entire build on flaky test failures, we ignore the flaky tests, and only 
> focus on tests which failed consistently. We should do the similar thing for 
> layout-tests as well.
> 
> I am going to make that change for layout tests in 
> https://bugs.webkit.org/show_bug.cgi?id=204769. Please let me know if anyone 
> has a different opinion.
> 
> Thanks
> Aakash


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


[webkit-dev] Handling flaky layout-test failures in EWS

2019-12-03 Thread Aakash Jain
Hi Everyone,

We have various layout-tests which are flaky (which sometimes pass and 
sometimes fail/crash/timeout). EWS needs to work despite these flaky tests, and 
need to be able to tell whether the patch being tested introduced any test 
failure or not.

In EWS, we have logic (same logic in both old and new EWS) on how to deal with 
flaky tests. The logic is roughly following:

- We apply the patch and build.

- Run layout-tests. During the test-run, we retry the failed tests. If those 
tests pass in retry, the run is considered to have no failing test (this retry 
part is same for EWS / build.webkit.org / manual-run and part of 
run-webkit-test script itself).

- If a test-run has some failures, EWS retry the test-run one more time (to 
check if those test failures were flaky). 

- Then we do one more test-run on clean-tree (without the patch), and compare 
the results of first run, second run, and clean tree run.

- After that we analyze the results from all three test-runs (which we call 
first_run, second_run and clean_tree_run). 


If there are different test failures in first and second runs (i.e.: flaky test 
failures), and the patch being tested did not introduce any new test failure, 
we retry the entire build (probably hoping that next time we will not see the 
flakiness). This retry can result in almost infinite loop, if someone commits a 
very flaky test (which is not rare), and would cause EWS to be almost stuck 
until the flakiness is fixed.

>From 
>https://trac.webkit.org/browser/webkit/trunk/Tools/Scripts/webkitpy/tool/bot/patchanalysistask.py#L352
>  ('Defer' means retrying the build).
'''
350# At this point we know that at least one test flaked, but no consistent 
failures
351# were introduced. This is a bit of a grey-zone.
352return False  # Defer patch
'''


Retrying the entire build, just because of few flaky tests seems excessive and 
inefficient. This doesn't help identify flaky tests, and only delays the EWS 
result. Instead, we should mark the build as SUCCESS (since the patch did not 
introduce any new consistent test failure).

In other EWS test-suites like API tests and JSC tests, we do not retry the 
entire build on flaky test failures, we ignore the flaky tests, and only focus 
on tests which failed consistently. We should do the similar thing for 
layout-tests as well.

I am going to make that change for layout tests in 
https://bugs.webkit.org/show_bug.cgi?id=204769. Please let me know if anyone 
has a different opinion.

Thanks
Aakash
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Safari Tech Preview 96 available on wpt.fyi!

2019-12-03 Thread Philip Jägenstedt
On Tue, Dec 3, 2019 at 3:05 AM Jiewen Tan  wrote:

> Hi Maciej,
>
> On Dec 2, 2019, at 4:10 PM, Maciej Stachowiak  wrote:
>
>
> There’s a number of mysterious timeouts with 96. Not sure if flakiness or
> real?
>
> The new WebCrypto failures are surprising, but likely real and should be
> investigated:
> https://wpt.fyi/results/WebCryptoAPI/wrapKey_unwrapKey?diff=ADC=is%3Adifferent_id=347530011_id=381930013
> 
>
>
> I believe these tests are flaky. I have made a PR to improve it a while
> ago. I should probably get those improvement landed sometime.
> https://github.com/web-platform-tests/wpt/pull/6102
>

That sounds very doable. I added you to the reviewers team to make this
easier and commented here:
https://github.com/web-platform-tests/wpt/pull/6102#issuecomment-561115470
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev