Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)

2012-04-10 Thread Stephen Chenney
There is a significant practical problem to turn the tree red and work
with someone to rebaseline the tests. It takes multiple hours for some
bots to build and test a given patch. That means, at any moment, you will
have maybe tens and in some cases hundreds of failing tests associated with
some changelist that you need to track on the bots. You might have more
failing tests associated with a different changelist, and so on.

How do you propose to track which baselines have been computed by which
bots for which changelists? In the past I have seen failing tests because
baselines were checked in too soon and some bot had not generated new
baselines. Things are even worse when a change goes in late in the day. How
will the pending rebaselines be handed off to the new gardener?

As the Chromium owner of tests that are frequently broken by third party
contributors (they do their best to help us, but cannot generate all 15 or
so Chromium expectations), I would much rather have a bug filed against me
and a line in the expectations file. At least I then have visibility on
what's happening and a chance to verify the results. In a recent pass to
clean up expectations we found multiple cases where the wrong result had
been committed because the person who did it failed to realize the result
was wrong. While you might say fix the test to make it obvious, it is not
obvious how to do that for all tests.

Why not simply attach an owner and a resolution date to each expectation?
The real problem right now is accountability and a way to remind people
that they have left expectations hanging.

Cheers,
Stephen.

On Mon, Apr 9, 2012 at 9:19 PM, Julien Chaffraix jchaffr...@webkit.orgwrote:

  If there's consensus in the mean time that it is better on balance to
  check in suppressions, perhaps we can figure out a better way to do
  that. Maybe (shudder) a second test_expectations file? Or maybe it
  would be better to actually check in suppressions marked as REBASELINE
  (or something like that)?
 
  That sounds quirky as it involves maintaining 2 sets of files.
 
  From my perspective, saying that we should discard the EWS result and
  allow changes to get in WebKit trunk, knowing they will turn the bots
  red, is a bad proposal regardless of how you justify it. In the small
  delta where the bots are red, you can bet people will miss something
  else that breaks.
 
 
  As Ryosuke points out, practically we're already in that situation -
  from what I can tell, the tree is red virtually all of the time, at
  least during US/Pacific working hours. It's not clear to me if the EWS
  has made this better or worse, but perhaps others have noticed a
  difference. That said, I doubt I like red trees any more than you do
  :)

 I wasn't talking about the tree's status quo here as it shouldn't
 impact the discussion. Just because the tree is red, doesn't mean it's
 the right thing (tm) to just drop the towel and make it more red (even
 if we seem to agree on the badness of that :)).

 To add some thoughts here, saying that the tree is red covers several
 states (failing tests, not building...) and IMHO the EWS has at least
 helped on the building side. As far as the tests goes, a lot of
 platform differences are unfortunately uncovered on the bots.

 Thanks,
 Julien
 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)

2012-04-10 Thread Ryosuke Niwa
On Tue, Apr 10, 2012 at 6:10 AM, Stephen Chenney schen...@chromium.orgwrote:

 There is a significant practical problem to turn the tree red and work
 with someone to rebaseline the tests. It takes multiple hours for some
 bots to build and test a given patch. That means, at any moment, you will
 have maybe tens and in some cases hundreds of failing tests associated with
 some changelist that you need to track on the bots. You might have more
 failing tests associated with a different changelist, and so on.


But you have to do this for non-Chromium ports anyway because they don't
use test_expectations.txt and skipping the tests won't help you generate
new baseline. In my opinion, we should not further diverge from the way
things are done in other ports.

How do you propose to track which baselines have been computed by which
 bots for which changelists? In the past I have seen failing tests because
 baselines were checked in too soon and some bot had not generated new
 baselines. Things are even worse when a change goes in late in the day. How
 will the pending rebaselines be handed off to the new gardener?


You have to stick around as long as it takes to rebaseline or notify
relevant port contributors after your patch lands:

http://www.webkit.org/coding/contributing.html clearly says
Keeping the tree green

In either case, your responsibility for the patch does not end with the
patch landing in the tree. There may be regressions from your change or
additional feedback from reviewers after the patch has landed. You can
watch the tree at build.webkit.org to make sure your patch builds and
passes tests on all platforms. It is your responsibility to be available
should regressions arise and to respond to additional feedback that happens
after a check-in.

Changes should succeed on all platforms, but it can be difficult to test on
every platform WebKit supports. Be certain that your change does not
introduce new test failures on the high-traffic Mac or Windows ports by
comparing the list of failing tests before and after your change. Your
change must at least compile on all platforms.

 As the Chromium owner of tests that are frequently broken by third party
 contributors (they do their best to help us, but cannot generate all 15 or
 so Chromium expectations), I would much rather have a bug filed against me
 and a line in the expectations file. At least I then have visibility on
 what's happening and a chance to verify the results. In a recent pass to
 clean up expectations we found multiple cases where the wrong result had
 been committed because the person who did it failed to realize the result
 was wrong. While you might say fix the test to make it obvious, it is not
 obvious how to do that for all tests.


This is an orthogonal issue to adding failing expectations prior to landing
your patch. It's about how rebaseline should be done. I agree that
rebaselining a test without understanding what the test intends to test or
the correct output is bad. But let's not mangle such a discussion into this
thread.

Why not simply attach an owner and a resolution date to each expectation?
 The real problem right now is accountability and a way to remind people
 that they have left expectations hanging.


That's what WebKit bugs are for. Ossy frequently files a bug and cc'es the
patch author when a new test is added or a test starts failing and he
doesn't know whether new result is correct or not. He also either skips the
test or rebaseline the test as needed. He also reverts patches when the
patch clearly introduced serious regressions (e.g. crashes on hundreds of
tests).

- Ryosuke
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)

2012-04-10 Thread Ryosuke Niwa
On Tue, Apr 10, 2012 at 11:42 AM, Stephen Chenney schen...@chromium.orgwrote:

 On Tue, Apr 10, 2012 at 1:00 PM, Ryosuke Niwa rn...@webkit.org wrote:

 On Tue, Apr 10, 2012 at 6:10 AM, Stephen Chenney 
 schen...@chromium.orgwrote:

 There is a significant practical problem to turn the tree red and work
 with someone to rebaseline the tests. It takes multiple hours for some
 bots to build and test a given patch. That means, at any moment, you will
 have maybe tens and in some cases hundreds of failing tests associated with
 some changelist that you need to track on the bots. You might have more
 failing tests associated with a different changelist, and so on.


 But you have to do this for non-Chromium ports anyway because they don't
 use test_expectations.txt and skipping the tests won't help you generate
 new baseline. In my opinion, we should not further diverge from the way
 things are done in other ports.


 How long on average does it take a builder to get through a change on
 another port? Right now the Chromium Mac 10.5 and 10.6 dbg builds are
 processing a patch from about 3 hours ago. About 20 patches have gone in
 since then. For the Mac 10.5 tree to ever be green would require there
 being no changes at all requiring new baselines for a 3 hour window.

 Just because other teams do it some way does not mean that Chromium, with
 it's greater number of bots and platforms, should do it the same way.


Yes, it does mean that we should do it the same way. What if non-Chromium
ports started imposing arbitrary processes like this on the rest of us?
It'll be a total chaos, and nobody would understand the right thing to do
for all ports.


 We are discussing a process here, not code, and in my mind the goal is to
 have the tree be as green as possible with all failures tracked with a
 minimal expectations file and as little engineer time as possible.


That's not our project goal. We have continuous builds and regression tests
to prevent regressions to improve the stability, not to keep bots green.
Please review http://www.webkit.org/projects/goals.html

Just look at how often the non-chromium mac and win builds are red. In
 particular, changes submitted via the commit queue take an indeterminate
 amount of time to go in, anything from an hour to several hours. Patch
 authors do not necessarily even have control over when the CQ+ is given.


That's why I don't use commit queue when I know my patch requires
platform-dependent rebaselines.


 Even when manually committing, if it takes 3 hours to create baselines
 then no patches go in in the afternoon. What if the bots are down or
 misbehaving?


We need to promptly fix those bots.

I would also point out the waste of resources when every contributor needs
 to track every failure around commit time in order to know when their own
 changes cause failures, and then track the bots to know when they are free
 to go home.


But that's clearly stated in the contribution guide line.

 Why not simply attach an owner and a resolution date to each expectation?
 The real problem right now is accountability and a way to remind people
 that they have left expectations hanging.


 That's what WebKit bugs are for. Ossy frequently files a bug and cc'es
 the patch author when a new test is added or a test starts failing and he
 doesn't know whether new result is correct or not. He also either skips the
 test or rebaseline the test as needed. He also reverts patches when the
 patch clearly introduced serious regressions (e.g. crashes on hundreds of
 tests).


 Yes, Ossy does an excellent job of gardening. Unfortunately, on Chrome we
 have tens if not hundreds of gardeners and, as this thread has revealed, no
 clear agreement on the best way to garden.


That IS the problem. We have too many in-experiented gardeners that don't
understand the WebKit culture or the WebKit process.

I strongly believe that keeping the tree green is more important than
 having a clean expectations file.


I disagree. You're effectively just disabling the test temporarily.

Finally, there is no pain free way to do this. The question is how to
 distribute the pain. Right now each gardening is using a process that
 distributes pain in their preferred way. From a community standpoint it
 would be nice if the Chromium team could come up with something consistent.


The process Chromium port uses should be consistent with non-Chromium ports.

- Ryosuke
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)

2012-04-10 Thread Ojan Vafai
I don't think we can come up with a hard and fast rule given current
tooling. In a theoretical future world in which it's easy to get expected
results off the EWS bots (or some other infrastructure), it would be
reasonable to expect people to incorporate the correct expected results for
any EWS-having ports before committing the patch. I expect we'd all agree
that would be better than turning the bots red or adding to
test_expectations.txt/Skipped files.

In the current world, it's a judgement call. If I expect a patch to need a
lot of platform-specific baselines, I'll make sure to commit it at a time
when I have hours to spare to cleanup any failures or, if I can't stick
around for the bots to cycle, I'll add it to test_expectations.txt
appropriately.

Both approaches have nasty tradeoffs. It is probably worth writing up a
wiki page outlining these two options and explaining why you might do one
or the other for people new to the project, but I don't see benefit in
trying to pick a hard rule that everyone must follow.

Ojan

On Tue, Apr 10, 2012 at 11:58 AM, Ryosuke Niwa rn...@webkit.org wrote:

 On Tue, Apr 10, 2012 at 11:42 AM, Stephen Chenney 
 schen...@chromium.orgwrote:

 On Tue, Apr 10, 2012 at 1:00 PM, Ryosuke Niwa rn...@webkit.org wrote:

 On Tue, Apr 10, 2012 at 6:10 AM, Stephen Chenney 
 schen...@chromium.orgwrote:

 There is a significant practical problem to turn the tree red and work
 with someone to rebaseline the tests. It takes multiple hours for some
 bots to build and test a given patch. That means, at any moment, you will
 have maybe tens and in some cases hundreds of failing tests associated with
 some changelist that you need to track on the bots. You might have more
 failing tests associated with a different changelist, and so on.


 But you have to do this for non-Chromium ports anyway because they don't
 use test_expectations.txt and skipping the tests won't help you generate
 new baseline. In my opinion, we should not further diverge from the way
 things are done in other ports.


 How long on average does it take a builder to get through a change on
 another port? Right now the Chromium Mac 10.5 and 10.6 dbg builds are
 processing a patch from about 3 hours ago. About 20 patches have gone in
 since then. For the Mac 10.5 tree to ever be green would require there
 being no changes at all requiring new baselines for a 3 hour window.

 Just because other teams do it some way does not mean that Chromium, with
 it's greater number of bots and platforms, should do it the same way.


 Yes, it does mean that we should do it the same way. What if non-Chromium
 ports started imposing arbitrary processes like this on the rest of us?
 It'll be a total chaos, and nobody would understand the right thing to do
 for all ports.


 We are discussing a process here, not code, and in my mind the goal is to
 have the tree be as green as possible with all failures tracked with a
 minimal expectations file and as little engineer time as possible.


 That's not our project goal. We have continuous builds and regression
 tests to prevent regressions to improve the stability, not to keep bots
 green. Please review http://www.webkit.org/projects/goals.html

 Just look at how often the non-chromium mac and win builds are red. In
 particular, changes submitted via the commit queue take an indeterminate
 amount of time to go in, anything from an hour to several hours. Patch
 authors do not necessarily even have control over when the CQ+ is given.


 That's why I don't use commit queue when I know my patch requires
 platform-dependent rebaselines.


 Even when manually committing, if it takes 3 hours to create baselines
 then no patches go in in the afternoon. What if the bots are down or
 misbehaving?


 We need to promptly fix those bots.

 I would also point out the waste of resources when every contributor needs
 to track every failure around commit time in order to know when their own
 changes cause failures, and then track the bots to know when they are free
 to go home.


 But that's clearly stated in the contribution guide line.

  Why not simply attach an owner and a resolution date to each
 expectation? The real problem right now is accountability and a way to
 remind people that they have left expectations hanging.


 That's what WebKit bugs are for. Ossy frequently files a bug and cc'es
 the patch author when a new test is added or a test starts failing and he
 doesn't know whether new result is correct or not. He also either skips the
 test or rebaseline the test as needed. He also reverts patches when the
 patch clearly introduced serious regressions (e.g. crashes on hundreds of
 tests).


 Yes, Ossy does an excellent job of gardening. Unfortunately, on Chrome we
 have tens if not hundreds of gardeners and, as this thread has revealed, no
 clear agreement on the best way to garden.


 That IS the problem. We have too many in-experiented gardeners that don't
 understand the WebKit culture or the 

Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)

2012-04-10 Thread Ryosuke Niwa
On Tue, Apr 10, 2012 at 12:29 PM, Ojan Vafai o...@chromium.org wrote:

 I don't think we can come up with a hard and fast rule given current
 tooling. In a theoretical future world in which it's easy to get expected
 results off the EWS bots (or some other infrastructure), it would be
 reasonable to expect people to incorporate the correct expected results for
 any EWS-having ports before committing the patch. I expect we'd all agree
 that would be better than turning the bots red or adding to
 test_expectations.txt/Skipped files.

 In the current world, it's a judgement call. If I expect a patch to need a
 lot of platform-specific baselines, I'll make sure to commit it at a time
 when I have hours to spare to cleanup any failures or, if I can't stick
 around for the bots to cycle, I'll add it to test_expectations.txt
 appropriately.

 Both approaches have nasty tradeoffs. It is probably worth writing up a
 wiki page outlining these two options and explaining why you might do one
 or the other for people new to the project, but I don't see benefit in
 trying to pick a hard rule that everyone must follow.


I agree. Both approaches have pros and cons. Will be good to document on
wiki as you say.

- Ryosuke
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)

2012-04-10 Thread Dirk Pranke
I agree with Ojan. It's clear that there are arguments for both
approaches and my initial note did not address all the situations that
come up. I will write up something further and put it on the wiki.

I will also continue mulling over what sorts of changes to the tools
we could do in the short term to make things better.

-- Dirk

On Tue, Apr 10, 2012 at 12:29 PM, Ojan Vafai o...@chromium.org wrote:
 I don't think we can come up with a hard and fast rule given current
 tooling. In a theoretical future world in which it's easy to get expected
 results off the EWS bots (or some other infrastructure), it would be
 reasonable to expect people to incorporate the correct expected results for
 any EWS-having ports before committing the patch. I expect we'd all agree
 that would be better than turning the bots red or adding to
 test_expectations.txt/Skipped files.

 In the current world, it's a judgement call. If I expect a patch to need a
 lot of platform-specific baselines, I'll make sure to commit it at a time
 when I have hours to spare to cleanup any failures or, if I can't stick
 around for the bots to cycle, I'll add it to test_expectations.txt
 appropriately.

 Both approaches have nasty tradeoffs. It is probably worth writing up a wiki
 page outlining these two options and explaining why you might do one or the
 other for people new to the project, but I don't see benefit in trying to
 pick a hard rule that everyone must follow.

 Ojan

 On Tue, Apr 10, 2012 at 11:58 AM, Ryosuke Niwa rn...@webkit.org wrote:

 On Tue, Apr 10, 2012 at 11:42 AM, Stephen Chenney schen...@chromium.org
 wrote:

 On Tue, Apr 10, 2012 at 1:00 PM, Ryosuke Niwa rn...@webkit.org wrote:

 On Tue, Apr 10, 2012 at 6:10 AM, Stephen Chenney schen...@chromium.org
 wrote:

 There is a significant practical problem to turn the tree red and work
 with someone to rebaseline the tests. It takes multiple hours for some 
 bots
 to build and test a given patch. That means, at any moment, you will have
 maybe tens and in some cases hundreds of failing tests associated with 
 some
 changelist that you need to track on the bots. You might have more failing
 tests associated with a different changelist, and so on.


 But you have to do this for non-Chromium ports anyway because they don't
 use test_expectations.txt and skipping the tests won't help you generate 
 new
 baseline. In my opinion, we should not further diverge from the way things
 are done in other ports.


 How long on average does it take a builder to get through a change on
 another port? Right now the Chromium Mac 10.5 and 10.6 dbg builds are
 processing a patch from about 3 hours ago. About 20 patches have gone in
 since then. For the Mac 10.5 tree to ever be green would require there being
 no changes at all requiring new baselines for a 3 hour window.

 Just because other teams do it some way does not mean that Chromium, with
 it's greater number of bots and platforms, should do it the same way.


 Yes, it does mean that we should do it the same way. What if non-Chromium
 ports started imposing arbitrary processes like this on the rest of us?
 It'll be a total chaos, and nobody would understand the right thing to do
 for all ports.


 We are discussing a process here, not code, and in my mind the goal is to
 have the tree be as green as possible with all failures tracked with a
 minimal expectations file and as little engineer time as possible.


 That's not our project goal. We have continuous builds and regression
 tests to prevent regressions to improve the stability, not to keep bots
 green. Please review http://www.webkit.org/projects/goals.html

 Just look at how often the non-chromium mac and win builds are red. In
 particular, changes submitted via the commit queue take an indeterminate
 amount of time to go in, anything from an hour to several hours. Patch
 authors do not necessarily even have control over when the CQ+ is given.


 That's why I don't use commit queue when I know my patch requires
 platform-dependent rebaselines.


 Even when manually committing, if it takes 3 hours to create baselines
 then no patches go in in the afternoon. What if the bots are down or
 misbehaving?


 We need to promptly fix those bots.

 I would also point out the waste of resources when every contributor
 needs to track every failure around commit time in order to know when their
 own changes cause failures, and then track the bots to know when they are
 free to go home.


 But that's clearly stated in the contribution guide line.

 Why not simply attach an owner and a resolution date to each
 expectation? The real problem right now is accountability and a way to
 remind people that they have left expectations hanging.


 That's what WebKit bugs are for. Ossy frequently files a bug and cc'es
 the patch author when a new test is added or a test starts failing and he
 doesn't know whether new result is correct or not. He also either skips the
 test or rebaseline the test as needed. He 

[webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)

2012-04-09 Thread Dirk Pranke
Hi all,

Recently I've noticed more people making changes and adding test
failure suppressions to various ports' test_expectations.txt files.
This is great!

However, I don't think we have an agreement over what the best
practices are here, so I thought I'd list out what I thought they
were, and others can comment / correct me as necessary:

1) Don't mix test_expectations.txt files and Skipped files. This is
really confusing to everyone involved ... your port should use one or
the other where possible. (*)

2) Entries in the test_expectations files should be short-lived. We
should be checking in what we believe the current output is,
*regardless of whether it's right or wrong*. A different way of
putting this is that we are more interested in changes in behavior
(regressions) rather than correctness. If you expect a test to be
failing for a long time in the same consistent way, check in the
failure and file a bug. Don't use test_expectations to suppress this.
(**)

3) Don't use test_expectations.txt to suppress failures across a
single cycle of the bot, just so you can gather updated baselines
without the tree going red. While it might seem that you're doing tree
maintainers a favor, in my experience this just makes things confusing
and it's better to let the tree go red until you can actually
rebaseline things. It's too easy to add a suppression for now and
then forget about it.

These rules are not set in stone, they're just what I try to do. If
people think there are better ways of doing this, please speak up! If
there is consensus, I'll update the wikis accordingly.

Thanks!

-- Dirk

(*) I have an outstanding to-do to modify new-run-webkit-tests to a
better way of tracking expectations to merge the inheritance/cascade
aspects of Skipped files with the flexibility in types of failures
that you get from expectations. Eventually we should have a mechanism
that replaces both, but for now, we don't. See
https://bugs.webkit.org/show_bug.cgi?id=83508 .

(**) Note that Chromium has not historically worked this way -- we
suppress failures rather than check in failing output -- but I believe
many / most of the chromium gardeners have come to believe that this
is not the way things should work and we'd be better off checking in
the failing output instead. Note that it may make sense to do
different things based on your ports' maturity, so you suppress things
while bringing a port up, and stop suppressing once your port is
stable.
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)

2012-04-09 Thread Dirk Pranke
On Mon, Apr 9, 2012 at 3:02 PM, James Robinson jam...@google.com wrote:
 3) Don't use test_expectations.txt to suppress failures across a
 single cycle of the bot, just so you can gather updated baselines
 without the tree going red. While it might seem that you're doing tree
 maintainers a favor, in my experience this just makes things confusing
 and it's better to let the tree go red until you can actually
 rebaseline things. It's too easy to add a suppression for now and
 then forget about it.


 How would you suggest someone contribute a patch that changes layout test
 results, if not by marking the test as failing?


Both this and Dana's response are good questions which I don't have
great answers for. Let me explain my thinking in a little more detail.

In my ideal world, you would be able to get updated baselines *prior*
to trying to land a patch. This is of course not really possible today
for any test that fails on multiple ports with different results, as
it's practically impossible to run more than a couple of ports by
hand, and we don't have a bot infrastructure to help you here.

In my closer-to-the-real-world ideal, the test_expectations.txt file
would be *empty*, all of the ports would be using them, and then
adding in suppressions for files you expect to fail in your patch
would be a good signal.

In practice, however, marking a test as failing has a few problems:

1) if you don't get the list of suppressions right, than what shows up
at the bots can be confusing - you don't realize that more (or
different) tests are failing than the ones that showed up. Also, you
can see tests fail on some bots that you didn't account for that don't
use the expectations at all (e.g., you suppressed Chromium correctly,
but the Apple bots go red).

2) adding suppressions to test_expectations.txt increases contention
on the test_expectations.txt file, confusing the gardeners /
bot-watchers and making merges harder. It is practically difficult to
tell the difference between I just added these lines as a temporary
suppression and I added these lines as a suppression two weeks ago
but then forgot about them, and as Ojan will tell you, you end up
with a bunch of lines in the file that just have a comment saying
just needs rebaselining, but you have no idea if those statements
are still true or not.

3) the current tool-of-the-day for managing rebaselining,
garden-o-matic, is much better suited to handling the unexpected
failures on the bots rather than the expected failures you've
introduced.

That said, as Dana points out, my proposal makes it hard if not
impossible for the EWS to work at all; I'm biased because I hardly
ever post changes that are expected to introduce failures on the EWS
bots that should still land. I don't know what to do about this,
although perhaps moving to my near-ideal model might work.

If there's consensus in the mean time that it is better on balance to
check in suppressions, perhaps we can figure out a better way to do
that. Maybe (shudder) a second test_expectations file? Or maybe it
would be better to actually check in suppressions marked as REBASELINE
(or something like that)?

-- Dirk
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)

2012-04-09 Thread Robert Hogan
On Monday 09 April 2012 22:42:33 Dirk Pranke wrote:
 Hi all,
 
 Recently I've noticed more people making changes and adding test
 failure suppressions to various ports' test_expectations.txt files.
 This is great!

Hi Dirk,

It would be good to have a page describing the right thing to do for each 
of the ports for each of the following situations:

- Adding a new test with rendtree/pixel results
- Changing the rendtree/pixel results for existing tests

I think the correct practice will differ between ports as only the Chromium 
bots (AFAIK) check pixel results. So this has the unfortunate consquence 
that when you change the expected pixel result for a test that has pixel 
results on all platforms you end up with stale png results in the tree for 
those that you are not in a position either to generate yourself or 
persuade someone on #webkit to generate for you.

Thanks,
Robert

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)

2012-04-09 Thread Julien Chaffraix
 In my ideal world, you would be able to get updated baselines *prior*
 to trying to land a patch. This is of course not really possible today
 for any test that fails on multiple ports with different results, as
 it's practically impossible to run more than a couple of ports by
 hand, and we don't have a bot infrastructure to help you here.

That would be the best outcome indeed, however that would more or less
require all the EWS to be able to run the tests (including pixel tests
for the platforms that enable them).

 3) the current tool-of-the-day for managing rebaselining,
 garden-o-matic, is much better suited to handling the unexpected
 failures on the bots rather than the expected failures you've
 introduced.

You could see it the other way. How could we make garden-o-matic
handle newly added suppression better? Maybe some new sub-tool listing
the newly added suppressions would help? Ignore the suppressions added
XX hours ago?

Your issue seems to be suppressions sticking into the tree not
strictly suppressing in itself. Our current tool (garden-o-matic)
handles failures a lot better than it handles suppressions.

 If there's consensus in the mean time that it is better on balance to
 check in suppressions, perhaps we can figure out a better way to do
 that. Maybe (shudder) a second test_expectations file? Or maybe it
 would be better to actually check in suppressions marked as REBASELINE
 (or something like that)?

That sounds quirky as it involves maintaining 2 sets of files.

From my perspective, saying that we should discard the EWS result and
allow changes to get in WebKit trunk, knowing they will turn the bots
red, is a bad proposal regardless of how you justify it. In the small
delta where the bots are red, you can bet people will miss something
else that breaks.

Thanks,
Julien
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)

2012-04-09 Thread Ryosuke Niwa
On Mon, Apr 9, 2012 at 3:50 PM, Julien Chaffraix jchaffr...@webkit.orgwrote:

   If there's consensus in the mean time that it is better on balance to
  check in suppressions, perhaps we can figure out a better way to do
  that. Maybe (shudder) a second test_expectations file? Or maybe it
  would be better to actually check in suppressions marked as REBASELINE
  (or something like that)?

 That sounds quirky as it involves maintaining 2 sets of files.

 From my perspective, saying that we should discard the EWS result and
 allow changes to get in WebKit trunk, knowing they will turn the bots
 red, is a bad proposal regardless of how you justify it. In the small
 delta where the bots are red, you can bet people will miss something
 else that breaks.


But that's the status quo. Until we get to the ideal world where every
port's EWS at least runs tests, there's no way around it since we can't
possibly require all contributors to run tests on all ports and platforms.

Also, adding new lines to test_expectations.txt will only work on Chromium
port since other ports use Skipped file instead, and skipping the test
won't give you new baseline on the bot. And this discrepancy between
Chromium and non-Chromium ports imposes a significant cognitive stress on
contributors and is not justifiable in my opinion.

- Ryosuke
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)

2012-04-09 Thread Dirk Pranke
On Mon, Apr 9, 2012 at 3:50 PM, Julien Chaffraix jchaffr...@webkit.org wrote:
 In my ideal world, you would be able to get updated baselines *prior*
 to trying to land a patch. This is of course not really possible today
 for any test that fails on multiple ports with different results, as
 it's practically impossible to run more than a couple of ports by
 hand, and we don't have a bot infrastructure to help you here.

 That would be the best outcome indeed, however that would more or less
 require all the EWS to be able to run the tests (including pixel tests
 for the platforms that enable them).


Yes. Probably even harder, we'd have to build tooling to extract the
results off of this bot and merge them into your patch.

 3) the current tool-of-the-day for managing rebaselining,
 garden-o-matic, is much better suited to handling the unexpected
 failures on the bots rather than the expected failures you've
 introduced.

 You could see it the other way. How could we make garden-o-matic
 handle newly added suppression better? Maybe some new sub-tool listing
 the newly added suppressions would help? Ignore the suppressions added
 XX hours ago?


Yes, that is similar.

 Your issue seems to be suppressions sticking into the tree not
 strictly suppressing in itself. Our current tool (garden-o-matic)
 handles failures a lot better than it handles suppressions.


Yes.

 If there's consensus in the mean time that it is better on balance to
 check in suppressions, perhaps we can figure out a better way to do
 that. Maybe (shudder) a second test_expectations file? Or maybe it
 would be better to actually check in suppressions marked as REBASELINE
 (or something like that)?

 That sounds quirky as it involves maintaining 2 sets of files.

 From my perspective, saying that we should discard the EWS result and
 allow changes to get in WebKit trunk, knowing they will turn the bots
 red, is a bad proposal regardless of how you justify it. In the small
 delta where the bots are red, you can bet people will miss something
 else that breaks.


As Ryosuke points out, practically we're already in that situation -
from what I can tell, the tree is red virtually all of the time, at
least during US/Pacific working hours. It's not clear to me if the EWS
has made this better or worse, but perhaps others have noticed a
difference. That said, I doubt I like red trees any more than you do
:)

I had not considered the EWS implications when I wrote the initial
note, so I haven't decided if (or how) to revise my opinions yet.

-- Dirk
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] handling failing tests (test_expectations, Skipped files, etc.)

2012-04-09 Thread Julien Chaffraix
 If there's consensus in the mean time that it is better on balance to
 check in suppressions, perhaps we can figure out a better way to do
 that. Maybe (shudder) a second test_expectations file? Or maybe it
 would be better to actually check in suppressions marked as REBASELINE
 (or something like that)?

 That sounds quirky as it involves maintaining 2 sets of files.

 From my perspective, saying that we should discard the EWS result and
 allow changes to get in WebKit trunk, knowing they will turn the bots
 red, is a bad proposal regardless of how you justify it. In the small
 delta where the bots are red, you can bet people will miss something
 else that breaks.


 As Ryosuke points out, practically we're already in that situation -
 from what I can tell, the tree is red virtually all of the time, at
 least during US/Pacific working hours. It's not clear to me if the EWS
 has made this better or worse, but perhaps others have noticed a
 difference. That said, I doubt I like red trees any more than you do
 :)

I wasn't talking about the tree's status quo here as it shouldn't
impact the discussion. Just because the tree is red, doesn't mean it's
the right thing (tm) to just drop the towel and make it more red (even
if we seem to agree on the badness of that :)).

To add some thoughts here, saying that the tree is red covers several
states (failing tests, not building...) and IMHO the EWS has at least
helped on the building side. As far as the tests goes, a lot of
platform differences are unfortunately uncovered on the bots.

Thanks,
Julien
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev