Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-17 Thread Ryosuke Niwa
On Thu, Aug 16, 2012 at 6:11 PM, Dirk Pranke dpra...@chromium.org wrote:

 On Thu, Aug 16, 2012 at 5:41 PM, Ryosuke Niwa rn...@webkit.org wrote:
  On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke dpra...@chromium.org
 wrote:
 
  On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney schen...@chromium.org
 
  wrote:
   I agree with the priorities above, at least. I also agree with the
   overall
   goal of making our implementation match our philosophy on testing.
  
   Ryosuke has raised a very valid point: it is not clear what a Gardener
   should do with a test that has changed behavior. Personally, the area
 I
   deal
   with is very susceptible to the single pixel differences matter
 issue,
   and
   I would prefer gardeners to flag updates in some way so that I can
 look
   at
   it myself (or one of the very few other experts can look at it). Maybe
   I'm
   just a control freak.
  
   In the current situation, gardeners are not likely to roll out a patch
   short
   of build failures, obvious test or perf regressions, or security
 issues.
   For
   the bulk of cases where a test result has changed, the gardener will
   either
   add it to expectations or go ahead and update the result.
  
   The result is a mixture of incorrect expected results and files listed
   in
   TestExpectations that may or may not be correct. The way I deal with
   this
   right now is to maintain a snapshot of TestExpectations and
 occasionally
   go
   through and look for newly added results, and check if they are in
 fact
   correct. And every now and then go look at older entries to see if
 they
   have
   been updated. Or I get lucky and notice that a new expected result has
   been
   checked in that is incorrect (I can mostly catch that by following
   checkins
   in the code).
  
   The proposed change does not alter this fundamental dynamic. Depending
   on
   what policy gardeners adopt, they might now rename the result as
 failing
   and
   remove the expected, or maybe they'll just update the expected. I'm
   still in
   a situation where I don't know the exact status of any given result,
 and
   where I have no easy way of knowing if a gardener has correctly
 updated
   an
   expected or failing image.
 
  So, if we had -passing/-failing, then people who knew what the results
  were supposed to be in a given directory (call them owners) could
  rename the existing -expected files into one category or another, and
  then a gardener could add newly-failing tests as -expected; it would
  then be trivial for the owner to look for new -expected files in that
  directory.
 
  Right now, an owner would have to periodically scan the directory for
  files changed since the last time they scanned the directory, and
  would have no way of knowing whether an updated file was correct or
  not (perhaps you could filter out by who committed the change, but
  it's certainly harder than ls *-expected).
 
  So, I would contend that my proposal would make it easier for us to
  have a process by which gardeners could punt on changes they're unsure
  about and for owners to subsequently review them. I think you could
  have a similar model if we just checked in *all* new results into
  -expected, but it would be harder to implement (perhaps not
  impossible, though, I haven't given it much thought yet).
 
 
  That seems like a much better model.

 To clarify: what does That refer to? My idea for gardeners to check
 in -expected and owners to move to -passing/failing] or the idea
 that we check in new results into -expected and have some other way
 of reviewing new baselines? (or something else)?


The former. In particular, the idea that only experts of a given area can
check in -correct/-failing is promising because there are many cases where
pixel tests fail due to completely unrelated bugs and we want to consider
it to be still correct. Butwe can implement this policy without the said
proposal. All we need to do is for gardeners to add test expectations and
wait for experts to come in and decide whether we need rebaseline, bug fix,
or rollout.

On the other hand, the pixel test output that's correct to one expert may
not be correct to another expert. For example, I might think that one
editing test's output is correct because it shows the feature we're testing
in the test is working properly. But Stephen might realizes that this
-expected.png contains off-by-one Skia bug. So categorizing -correct.png
and -failure.png may require multiple experts looking at the output, which
may or may not be practical.

Furthermore, some areas may not have an adequate number of experts or
experts who can triage -expected.png, which means that we'll further
fragment the process depending on how many experts we have for a given area.

 In a world where all failures are triaged promptly by qualified
  owners, all of this mechanism is unnecessary. Unfortunately, we don't
  live in that world at the moment.
 
 
  I'd argue that this is a bigger problem worth 

Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-17 Thread Dirk Pranke
On Fri, Aug 17, 2012 at 8:07 AM, Ryosuke Niwa rn...@webkit.org wrote:
 On Thu, Aug 16, 2012 at 6:11 PM, Dirk Pranke dpra...@chromium.org wrote:

 On Thu, Aug 16, 2012 at 5:41 PM, Ryosuke Niwa rn...@webkit.org wrote:
  On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke dpra...@chromium.org
  wrote:
 
  On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney
  schen...@chromium.org
  wrote:
   I agree with the priorities above, at least. I also agree with the
   overall
   goal of making our implementation match our philosophy on testing.
  
   Ryosuke has raised a very valid point: it is not clear what a
   Gardener
   should do with a test that has changed behavior. Personally, the area
   I
   deal
   with is very susceptible to the single pixel differences matter
   issue,
   and
   I would prefer gardeners to flag updates in some way so that I can
   look
   at
   it myself (or one of the very few other experts can look at it).
   Maybe
   I'm
   just a control freak.
  
   In the current situation, gardeners are not likely to roll out a
   patch
   short
   of build failures, obvious test or perf regressions, or security
   issues.
   For
   the bulk of cases where a test result has changed, the gardener will
   either
   add it to expectations or go ahead and update the result.
  
   The result is a mixture of incorrect expected results and files
   listed
   in
   TestExpectations that may or may not be correct. The way I deal with
   this
   right now is to maintain a snapshot of TestExpectations and
   occasionally
   go
   through and look for newly added results, and check if they are in
   fact
   correct. And every now and then go look at older entries to see if
   they
   have
   been updated. Or I get lucky and notice that a new expected result
   has
   been
   checked in that is incorrect (I can mostly catch that by following
   checkins
   in the code).
  
   The proposed change does not alter this fundamental dynamic.
   Depending
   on
   what policy gardeners adopt, they might now rename the result as
   failing
   and
   remove the expected, or maybe they'll just update the expected. I'm
   still in
   a situation where I don't know the exact status of any given result,
   and
   where I have no easy way of knowing if a gardener has correctly
   updated
   an
   expected or failing image.
 
  So, if we had -passing/-failing, then people who knew what the results
  were supposed to be in a given directory (call them owners) could
  rename the existing -expected files into one category or another, and
  then a gardener could add newly-failing tests as -expected; it would
  then be trivial for the owner to look for new -expected files in that
  directory.
 
  Right now, an owner would have to periodically scan the directory for
  files changed since the last time they scanned the directory, and
  would have no way of knowing whether an updated file was correct or
  not (perhaps you could filter out by who committed the change, but
  it's certainly harder than ls *-expected).
 
  So, I would contend that my proposal would make it easier for us to
  have a process by which gardeners could punt on changes they're unsure
  about and for owners to subsequently review them. I think you could
  have a similar model if we just checked in *all* new results into
  -expected, but it would be harder to implement (perhaps not
  impossible, though, I haven't given it much thought yet).
 
 
  That seems like a much better model.

 To clarify: what does That refer to? My idea for gardeners to check
 in -expected and owners to move to -passing/failing] or the idea
 that we check in new results into -expected and have some other way
 of reviewing new baselines? (or something else)?


 The former. In particular, the idea that only experts of a given area can
 check in -correct/-failing is promising because there are many cases where
 pixel tests fail due to completely unrelated bugs and we want to consider it
 to be still correct. Butwe can implement this policy without the said
 proposal. All we need to do is for gardeners to add test expectations and
 wait for experts to come in and decide whether we need rebaseline, bug fix,
 or rollout.

A problem with just adding test expectations is that we lose
visibility into whether the test is failing the same way or a
different way. That is one of the (if not the primary) motivation for
my proposal ...


 On the other hand, the pixel test output that's correct to one expert may
 not be correct to another expert. For example, I might think that one
 editing test's output is correct because it shows the feature we're testing
 in the test is working properly. But Stephen might realizes that this
 -expected.png contains off-by-one Skia bug. So categorizing -correct.png and
 -failure.png may require multiple experts looking at the output, which may
 or may not be practical.

Perhaps. Obviously (a) there's a limit to what you can do here, and
(b) a test that requires multiple 

Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-17 Thread Dimitri Glazkov
I don't have Dirk's patience, stamina and determination to explain and
evangelize the advantages of better awareness of current LayoutTests
coverage, I will chime in saying that I gladly support any effort to
improve it. I am especially excited about the ability to distinguish
failing, passing, and failing-in-different-way expectations.

:DG
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-17 Thread Ryosuke Niwa
On Fri, Aug 17, 2012 at 11:06 AM, Dirk Pranke dpra...@chromium.org wrote:

   On the other hand, the pixel test output that's correct to one expert
 may
  not be correct to another expert. For example, I might think that one
  editing test's output is correct because it shows the feature we're
 testing
  in the test is working properly. But Stephen might realizes that this
  -expected.png contains off-by-one Skia bug. So categorizing -correct.png
 and
  -failure.png may require multiple experts looking at the output, which
 may
  or may not be practical.

 Perhaps. Obviously (a) there's a limit to what you can do here, and
 (b) a test that requires multiple experts to verify its correctness
 is, IMO, a bad test :).


With that argument, almost all pixel tests are bad tests because pixel
tests in editing, for example, involve editing, rendering, and graphics
code. I don't think any single person can comprehend the entire stack to
tell with a 100% confidence that the test result is exactly and precisely
correct.

  I think we should just check-in whatever result we're
  currently seeing as -expected.png because we wouldn't at least have any
  ambiguity in the process then. We just check in whatever we're currently
  seeing and file a bug if we see a problem with the new result and
 possibly
  rollout the patch after talking with the author/reviewer.

 This is basically saying we should just follow the existing
 non-Chromium process, right?


Yes. In addition, doing so will significantly reduce the complexity of the
current process.

This would seem to bring us back to step
 1: it doesn't address the problem that I identified with the existing
 non-Chromium process, namely that a non-expert can't tell by looking
 at the checkout what tests are believed to be passing or not.


What is the use case of this? I've been working on WebKit for more than 3
years, and I've never had to think about whether a test for an area outside
of my expertise has the correct output or not other than when I was
gardening. And having -correct / -failing wouldn't have helped me knowing
what the correct output when I was gardening anyway because the new output
may as well as be new -correct or -failing result.

I don't think searching bugzilla (as it is currently used) is a
 workable alternative.


Why not? Bugzilla is the tool we use to triage and track bugs. I don't see
a need for an alternative method to keep track of bugs.

  The new result we check in may not be 100% right but experts — e.g. me
 for
  editing and Stephen for Skia — can come in and look at recent changes to
  triage any new failures.
 
  In fact, it might be worthwhile for us to invest our time in improving
 tools
  to do this. For example, can we have a portal where I can see new
  rebaselines that happened in LayoutTests/editing and
  LayoutTests/platform/*/editing since the last time I visited the portal?
  e.g. it can show chronological timeline of baselines along with a
 possible
  cause (list of changesets maybe?) of the baseline.

 We could build such a portal, sure. I would be interested to hear from
 others whether such a thing would be more or less useful than my
 proposal.

 Of course, you could just set up a watchlist for new expectations
 today. Probably not quite as polished as we could get with a portal,
 but dirt simple ..


That might be useful as long as it has an option to give us a digest
instead of sending me an e-mail per commit.

- Ryosuke
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-17 Thread Dirk Pranke
On Fri, Aug 17, 2012 at 11:29 AM, Ryosuke Niwa rn...@webkit.org wrote:
 On Fri, Aug 17, 2012 at 11:06 AM, Dirk Pranke dpra...@chromium.org wrote:

  On the other hand, the pixel test output that's correct to one expert
  may
  not be correct to another expert. For example, I might think that one
  editing test's output is correct because it shows the feature we're
  testing
  in the test is working properly. But Stephen might realizes that this
  -expected.png contains off-by-one Skia bug. So categorizing -correct.png
  and
  -failure.png may require multiple experts looking at the output, which
  may
  or may not be practical.

 Perhaps. Obviously (a) there's a limit to what you can do here, and
 (b) a test that requires multiple experts to verify its correctness
 is, IMO, a bad test :).


 With that argument, almost all pixel tests are bad tests because pixel tests
 in editing, for example, involve editing, rendering, and graphics code.

If in order to tell a pixel test is correct you need to be aware of
how all of that stuff works, then, yes, it's a bad test. It can fail
too many different ways, and is testing too many different bits of
information. As Filip might suggest, it would be nice if we could
split such tests up. That said, I will freely grant that in many cases
we can't easily do better given the way things are currently
structured, and splitting up such tests would be an enormous amount of
work.

If the pixel test is testing whether a rectangle is actually green or
actually red, such a test is fine, doesn't need much subject matter
expertise, and it is hard to imagine how you'd test such a thing some
other way.

 I don't think any single person can comprehend the entire stack to tell with a
 100% confidence that the test result is exactly and precisely correct.

Sure. Such a high bar should be avoided.

   I think we should just check-in whatever result we're
  currently seeing as -expected.png because we wouldn't at least have any
  ambiguity in the process then. We just check in whatever we're currently
  seeing and file a bug if we see a problem with the new result and
  possibly
  rollout the patch after talking with the author/reviewer.

 This is basically saying we should just follow the existing
 non-Chromium process, right?


 Yes. In addition, doing so will significantly reduce the complexity of the
 current process.

 This would seem to bring us back to step
 1: it doesn't address the problem that I identified with the existing
 non-Chromium process, namely that a non-expert can't tell by looking
 at the checkout what tests are believed to be passing or not.


 What is the use case of this? I've been working on WebKit for more than 3
 years, and I've never had to think about whether a test for an area outside
 of my expertise has the correct output or not other than when I was
 gardening. And having -correct / -failing wouldn't have helped me knowing
 what the correct output when I was gardening anyway because the new output
 may as well as be new -correct or -failing result.

I've done this frequently when gardening, when simply trying to learn
how a given chunk of code works and how a given chunk of tests work
(or don't work), and when trying to get a sense of how well our
product is or isn't passing tests.

Perhaps this is the case because I tend to more work on infrastructure
and testing, and look at stuff shallowly across the whole tree rather
than in depth in particular areas as you do.

 I don't think searching bugzilla (as it is currently used) is a workable
 alternative.


 Why not? Bugzilla is the tool we use to triage and track bugs. I don't see a
 need for an alternative method to keep track of bugs.

The way we currently use bugzilla, it is difficult if not impossible
to find a concise and accurate list of all the failing layout tests
meeting any sort of filename- or directory-based criteria (maybe you
can do it just for editing, I don't know). The layout test summary
reports that Ojan sends out to the chromium devs is an example of
this: he generates that from the TestExpectations files; doing so from
bugzilla is not currently feasible.

Note that we could certainly extend bugzilla to make this easier, if
there was consensus to do so (and I would be in favor of this, but
that would also incur more process than we have today).

- Dirk
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-16 Thread Dirk Pranke
On Wed, Aug 15, 2012 at 5:19 PM, Ryosuke Niwa rn...@webkit.org wrote:
 I have a concern that a lot of people wouldn't know what the correct
 output is for a given test.

 For a lot of pixel tests, deciding whether a given output is correct or not
 is really hard. e.g. some seemingly insignificant anti-alias different may
 turn out be a result of a bug in Skia and other graphics library or WebCore
 code that uses it.

 As a result,

 people may check in wrong correct results
 people may add failing results even though new results are more or less
 correct


 This leads me to think that just checking in the current output as
 -expected.png and filing bugs separately is a better approach.


I think your observations are correct, but at least my experience as a
gardener/sheriff leads me to a different conclusion. Namely, when I'm
looking at a newly failing test, it is difficult if not impossible for
me to know if the existing baseline was previously believed to be
correct or not, and thus it's hard for me to tell if the new baseline
should be considered worse, better, or different. In theory I could go
look at the changelog for each test, but I would be skeptical if that
had enough useful information (I would expect most comments to be
along the lines of rebaselining after X with no indication if the
output is correct or not. This is just a theory, though.

This is why I want to test this theory :). It seems like if we got
experience with this on one (or more) ports for a couple of months we
would have a much more well-informed opinion, and I'm not seeing a
huge downside to at least trying this idea out.

-- Dirk
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-16 Thread Dirk Pranke
On Wed, Aug 15, 2012 at 6:02 PM, Filip Pizlo fpi...@apple.com wrote:

 2) Possibility of the sheriff getting it wrong.

 (2) concerns me most.  We're talking about using filenames to serve as a
 kind of unchecked comment.  We already know that comments are usually bad
 because there is no protection against them going stale.


Sheriffs can already get things wrong (and rebaseline when they
shouldn't). I believe that adding passing/failing to expected will
make things better in this regard, not worse.

Another idea/observation is that if we have multiple types of
expectation files, it might be easier to set up watchlists, e.g., let
me know whenever a file gets checked into fast/forms with an -expected
or -failing result. It seems like this might be useful, but I'm not
sure.

 In particular, to further clarify my position, if someone were to argue that
 Dirk's proposal would be a wholesale replacement for TestExpectations, then
 I would be more likely to be on board, since I very much like the idea of
 reducing the number of ways of doing things.  Maybe that's a good way to
 reach compromise.

 Dirk, what value do you see in TestExpectations were your change to be
 landed?  Do scenarios still exist where there would be a test for which (a)
 there is no -fail.* file, (b) the test is not skipped, and (c) it's marked
 with some annotation in TestExpectations?  I'm most interested in the
 question of such scenarios exist, since in my experience, whenever a test is
 not rebased, is not skipped, and is marked as failing in TestExpectations,
 it ends up just causing gardening overhead later.

This is a good question, because it is definitely my intent that this
change replace some existing practices, not add to them.

Currently, the Chromium port uses TestExpectations entries for four
different kinds of things: tests we don't ever plan to fix (WONTFIX),
tests that we skip because not doing so causes other tests to break,
tests that fail (reliably), and tests that are flaky.

Skipped files do not let you distinguish (programmatically) between
the first two categories, and so my plan is to replace Skipped files
with TestExpectations (using the new syntax discussed a month or so
ago) soon (next week or two at the latest).

I would like to replace using TestExpectations for failing tests (at
least for tests that are expected to keep failing indefinitely because
someone isn't working on an active fix) with this new mechanism.

That leaves flaky tests. One can debate what the right thing to do w/
flaky tests is here; I'm inclined to argue that flakiness is at least
as bad as failing, and we should probably be skipping them, but the
Chromium port has not yet actively tried this approach (I think other
ports probably have experience here, though).

Does that help answer your question / sway you at all?

-- Dirk
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-16 Thread Ryosuke Niwa
On Thu, Aug 16, 2012 at 2:05 PM, Dirk Pranke dpra...@chromium.org wrote:

 I think your observations are correct, but at least my experience as a
 gardener/sheriff leads me to a different conclusion. Namely, when I'm
 looking at a newly failing test, it is difficult if not impossible for
 me to know if the existing baseline was previously believed to be
 correct or not, and thus it's hard for me to tell if the new baseline
 should be considered worse, better, or different.


How does the proposal solve this problem? Right now gardeners have two
options:

   - Rebaseline
   - Add a test expectation

Once we implemented the proposal, we have at least three options:

   - Rebaseline correct.png
   - Add/rebaseline expected.png
   - Add/rebaseline failure.png
   - (Optionally) Add a test expectation.

And that's a lot of options to choose from. The more options we have, the
more likely people make mistakes. We're already inconsistent in how
-expected.png is used because some people make mistakes. I'm afraid that
adding another set of expected results result in even more mistakes and a
unrecoverable mess.

This is why I want to test this theory :). It seems like if we got
 experience with this on one (or more) ports for a couple of months we
 would have a much more well-informed opinion, and I'm not seeing a
 huge downside to at least trying this idea out.


Sure. But if we're doing this experiment on trunk, thereby imposing
significant cognitive load on every other port, then I'd like to see us
setting *both* an exact date at which we decide whether this approach is
good or not and criteria by which we decide this *before *the experiment
starts.

Like Filip, I'm *extremely* concerned about the prospect of us introducing
yet-another-way-of-doing-things, and not be able to get rid of it later.

- Ryosuke
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-16 Thread Ryosuke Niwa
On Thu, Aug 16, 2012 at 2:32 PM, Filip Pizlo fpi...@apple.com wrote:


 On Aug 16, 2012, at 2:13 PM, Dirk Pranke wrote:

  On Wed, Aug 15, 2012 at 6:02 PM, Filip Pizlo fpi...@apple.com wrote:
 
  2) Possibility of the sheriff getting it wrong.
 
  (2) concerns me most.  We're talking about using filenames to serve as a
  kind of unchecked comment.  We already know that comments are usually
 bad
  because there is no protection against them going stale.
 
 
  Sheriffs can already get things wrong (and rebaseline when they
  shouldn't). I believe that adding passing/failing to expected will
  make things better in this regard, not worse.

 In what way do things become better?  Because other people will see what
 the sheriff believed about the result?

 Can you articulate some more about what happens when you have both
 -expected and -failing?

 My specific concern is that after someone checks in a fix, we will have
 some sheriff accidentally misjudge the change in behavior to be a
 regression, and check in a -failing file.  And then we end up in a world of
 confusion.

 This is why I think that just having -expected files is better.  It is a
 kind of recognition that we're tracking changes in behavior, rather than
 comparing against some almighty notion of what it means to be correct.

 
  Another idea/observation is that if we have multiple types of
  expectation files, it might be easier to set up watchlists, e.g., let
  me know whenever a file gets checked into fast/forms with an -expected
  or -failing result. It seems like this might be useful, but I'm not
  sure.
 
  In particular, to further clarify my position, if someone were to argue
 that
  Dirk's proposal would be a wholesale replacement for TestExpectations,
 then
  I would be more likely to be on board, since I very much like the idea
 of
  reducing the number of ways of doing things.  Maybe that's a good way to
  reach compromise.
 
  Dirk, what value do you see in TestExpectations were your change to be
  landed?  Do scenarios still exist where there would be a test for which
 (a)
  there is no -fail.* file, (b) the test is not skipped, and (c) it's
 marked
  with some annotation in TestExpectations?  I'm most interested in the
  question of such scenarios exist, since in my experience, whenever a
 test is
  not rebased, is not skipped, and is marked as failing in
 TestExpectations,
  it ends up just causing gardening overhead later.
 
  This is a good question, because it is definitely my intent that this
  change replace some existing practices, not add to them.
 
  Currently, the Chromium port uses TestExpectations entries for four
  different kinds of things: tests we don't ever plan to fix (WONTFIX),
  tests that we skip because not doing so causes other tests to break,
  tests that fail (reliably), and tests that are flaky.
 
  Skipped files do not let you distinguish (programmatically) between
  the first two categories, and so my plan is to replace Skipped files
  with TestExpectations (using the new syntax discussed a month or so
  ago) soon (next week or two at the latest).
 
  I would like to replace using TestExpectations for failing tests (at
  least for tests that are expected to keep failing indefinitely because
  someone isn't working on an active fix) with this new mechanism.
 
  That leaves flaky tests. One can debate what the right thing to do w/
  flaky tests is here; I'm inclined to argue that flakiness is at least
  as bad as failing, and we should probably be skipping them, but the
  Chromium port has not yet actively tried this approach (I think other
  ports probably have experience here, though).
 
  Does that help answer your question / sway you at all?

 Yes, it does - it answers my question, though it perhaps doesn't sway me.
  My concerns are still that:

 1) Switching to skipping flaky tests wholesale in all ports would be
 great, and then we could get rid of the flakiness support.


This is not necessarily helpful in some cases. There are cases some test
make subsequent tests flaky so skipping a flaky test doesn't fix the
problem. Instead, it just makes the next test flaky. The right fix, of
course, to skip/fix the culprit but pinpointing the test that makes
subsequent tests more flaky has proved to be a time consuming and
challenging task.

2) The WONTFIX mode in TestExpectations feels to me more like a statement
 that you're just trying to see if the test doesn't crash.  Correct?  Either
 way, it's confusing.


Yes, I'd argue that they should just be skipped but that's a bikeshedding
for another time.

- Ryosuke
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-16 Thread Ojan Vafai
On Thu, Aug 16, 2012 at 2:32 PM, Filip Pizlo fpi...@apple.com wrote:

 1) Switching to skipping flaky tests wholesale in all ports would be
 great, and then we could get rid of the flakiness support.


Then you don't notice when a flaky tests stops being flaky. The cost of
flakiness support on the project does not seem large to me and it's pretty
frequent that a test will only be flaky for a few hundred runs (e.g. due to
a bad checkin that gets fixed), so we can then remove it from
TestExpectations and run the test as normal.

2) The WONTFIX mode in TestExpectations feels to me more like a statement
 that you're just trying to see if the test doesn't crash.  Correct?  Either
 way, it's confusing.


WONTFIX is for tests that don't make sense to fix for the given port (e.g.
dashboard-specific tests for non-Apple ports). It's a way of distinguishing
tests that we're skipping because of a bug on our part vs. tests that we're
skipping because the test doesn't apply to the port.

3) Your new mechanism feels like it's already covered by our existing use
 of -expected files.  I'm not quite convinced that having -failing in
 addition to -expected files would be all that helpful.


But there are many cases where we *know* the result is incorrect, e.g. it
doesn't match a spec. Sure, there are also many cases where it's not clear
what the correct behavior is.
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-16 Thread Ryosuke Niwa
On Thu, Aug 16, 2012 at 3:04 PM, Dirk Pranke dpra...@chromium.org wrote:

 On Thu, Aug 16, 2012 at 2:23 PM, Ryosuke Niwa rn...@webkit.org wrote:
  Like Filip, I'm extremely concerned about the prospect of us introducing
  yet-another-way-of-doing-things, and not be able to get rid of it later.

 Presumably the only way we'd be not able to get rid of it would be if
 some port actually liked it, in which case, why would we get rid of it?


Because we already have too many ways to deal with test failures.

Why don't we at least merge Skipped and TestExpectations first.

- Ryosuke
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-16 Thread Ryosuke Niwa
   1. On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke dpra...@chromium.orgwrote:

On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney schen...@chromium.org
 wrote:
  I agree with the priorities above, at least. I also agree with the
 overall
  goal of making our implementation match our philosophy on testing.
 
  Ryosuke has raised a very valid point: it is not clear what a Gardener
  should do with a test that has changed behavior. Personally, the area I
 deal
  with is very susceptible to the single pixel differences matter issue,
 and
  I would prefer gardeners to flag updates in some way so that I can look
 at
  it myself (or one of the very few other experts can look at it). Maybe
 I'm
  just a control freak.
 
  In the current situation, gardeners are not likely to roll out a patch
 short
  of build failures, obvious test or perf regressions, or security issues.
 For
  the bulk of cases where a test result has changed, the gardener will
 either
  add it to expectations or go ahead and update the result.
 
  The result is a mixture of incorrect expected results and files listed in
  TestExpectations that may or may not be correct. The way I deal with this
  right now is to maintain a snapshot of TestExpectations and occasionally
 go
  through and look for newly added results, and check if they are in fact
  correct. And every now and then go look at older entries to see if they
 have
  been updated. Or I get lucky and notice that a new expected result has
 been
  checked in that is incorrect (I can mostly catch that by following
 checkins
  in the code).
 
  The proposed change does not alter this fundamental dynamic. Depending on
  what policy gardeners adopt, they might now rename the result as failing
 and
  remove the expected, or maybe they'll just update the expected. I'm
 still in
  a situation where I don't know the exact status of any given result, and
  where I have no easy way of knowing if a gardener has correctly updated
 an
  expected or failing image.

 So, if we had -passing/-failing, then people who knew what the results
 were supposed to be in a given directory (call them owners) could
 rename the existing -expected files into one category or another, and
 then a gardener could add newly-failing tests as -expected; it would
 then be trivial for the owner to look for new -expected files in that
 directory.

 Right now, an owner would have to periodically scan the directory for
 files changed since the last time they scanned the directory, and
 would have no way of knowing whether an updated file was correct or
 not (perhaps you could filter out by who committed the change, but
 it's certainly harder than ls *-expected).

 So, I would contend that my proposal would make it easier for us to
 have a process by which gardeners could punt on changes they're unsure
 about and for owners to subsequently review them. I think you could
 have a similar model if we just checked in *all* new results into
 -expected, but it would be harder to implement (perhaps not
 impossible, though, I haven't given it much thought yet).


That seems like a much better model. Right now, one of the biggest problem
is that people rebaseline tests or add test expectations without
understanding what the correct output is. It happens for almost all ports.

Having -correct.png doesn't necessarily help that because even a seemingly
innocent one-pixel difference could be a bug (a patch may only add a
baseline to one platform), and it's really hard to tell what the expected
result is unless you have worked on the rendering code for a long time.

On the other hand, if you follow the chromium model of just
 suppressing the file, then we have no idea if the failure you
 currently see has any relationship to the failure that was originally
 seen, or if the file that is currently checked in has any relationship
 to what correct might now need to be (we know that the file was
 correct at some point but is now stale). So, in the chromium model, at
 least, you can do a grep through TestExpectations to look for failing
 tests, but you don't have the history you would get by looking at a
 history of changes to -expected/-passing/-failing.

 In a world where all failures are triaged promptly by qualified
 owners, all of this mechanism is unnecessary. Unfortunately, we don't
 live in that world at the moment.


I'd argue that this is a bigger problem worth solving. Differentiating
expected failure and correct results isn't going to help if gardeners were
to keep making mistakes.

- Ryosuke
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-16 Thread Dirk Pranke
On Thu, Aug 16, 2012 at 5:41 PM, Ryosuke Niwa rn...@webkit.org wrote:
 On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke dpra...@chromium.org wrote:

 On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney schen...@chromium.org
 wrote:
  I agree with the priorities above, at least. I also agree with the
  overall
  goal of making our implementation match our philosophy on testing.
 
  Ryosuke has raised a very valid point: it is not clear what a Gardener
  should do with a test that has changed behavior. Personally, the area I
  deal
  with is very susceptible to the single pixel differences matter issue,
  and
  I would prefer gardeners to flag updates in some way so that I can look
  at
  it myself (or one of the very few other experts can look at it). Maybe
  I'm
  just a control freak.
 
  In the current situation, gardeners are not likely to roll out a patch
  short
  of build failures, obvious test or perf regressions, or security issues.
  For
  the bulk of cases where a test result has changed, the gardener will
  either
  add it to expectations or go ahead and update the result.
 
  The result is a mixture of incorrect expected results and files listed
  in
  TestExpectations that may or may not be correct. The way I deal with
  this
  right now is to maintain a snapshot of TestExpectations and occasionally
  go
  through and look for newly added results, and check if they are in fact
  correct. And every now and then go look at older entries to see if they
  have
  been updated. Or I get lucky and notice that a new expected result has
  been
  checked in that is incorrect (I can mostly catch that by following
  checkins
  in the code).
 
  The proposed change does not alter this fundamental dynamic. Depending
  on
  what policy gardeners adopt, they might now rename the result as failing
  and
  remove the expected, or maybe they'll just update the expected. I'm
  still in
  a situation where I don't know the exact status of any given result, and
  where I have no easy way of knowing if a gardener has correctly updated
  an
  expected or failing image.

 So, if we had -passing/-failing, then people who knew what the results
 were supposed to be in a given directory (call them owners) could
 rename the existing -expected files into one category or another, and
 then a gardener could add newly-failing tests as -expected; it would
 then be trivial for the owner to look for new -expected files in that
 directory.

 Right now, an owner would have to periodically scan the directory for
 files changed since the last time they scanned the directory, and
 would have no way of knowing whether an updated file was correct or
 not (perhaps you could filter out by who committed the change, but
 it's certainly harder than ls *-expected).

 So, I would contend that my proposal would make it easier for us to
 have a process by which gardeners could punt on changes they're unsure
 about and for owners to subsequently review them. I think you could
 have a similar model if we just checked in *all* new results into
 -expected, but it would be harder to implement (perhaps not
 impossible, though, I haven't given it much thought yet).


 That seems like a much better model.

To clarify: what does That refer to? My idea for gardeners to check
in -expected and owners to move to -passing/failing, or the idea
that we check in new results into -expected and have some other way
of reviewing new baselines? (or something else)?


 Right now, one of the biggest problem
 is that people rebaseline tests or add test expectations without
 understanding what the correct output is. It happens for almost all ports.

 Having -correct.png doesn't necessarily help that because even a seemingly
 innocent one-pixel difference could be a bug (a patch may only add a
 baseline to one platform), and it's really hard to tell what the expected
 result is unless you have worked on the rendering code for a long time.


Yes. I hate gardening/sheriffing because I feel like I usually don't
know whether to rebaseline or not :) (Although I suspect having
-correct would be better than not having it, at least).

 In a world where all failures are triaged promptly by qualified
 owners, all of this mechanism is unnecessary. Unfortunately, we don't
 live in that world at the moment.


 I'd argue that this is a bigger problem worth solving. Differentiating
 expected failure and correct results isn't going to help if gardeners were
 to keep making mistakes.


It is a bigger problem worth solving, but it is also much harder to
solve. As long as we have the combination of

1) changes can cause lots of tests to change expectations and need new baselines
2) it's acceptable to land changes without the accompanying updated baselines
3) there is pressure to keep the bots green (not only for its own
sake, but so that we can more easily identify *new* failures)
4) insufficient resources watching the tree (which is really true for
every port today, even chromium)

i think we're 

[webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Dirk Pranke
Hi all,

As many of you know, we normally treat the -expected files as
regression tests rather than correctness tests; they are intended
to capture the current behavior of the tree. As such, they
historically have not distinguished between a correct failure and an
incorrect failure.

The chromium port, however, has historically often not checked in
expectations for tests that are currently failing (or even have been
failing for a long time), and instead listed them in the expectations
file. This was primarily motivated by us wanting to know easily all of
the tests that were failing. However, this approach has its own
downsides.

I would like to move the project to a point where all of the ports
were basically using the same workflow/model, and combine the best
features of each approach [1].

To that end, I propose that we allow tests to have expectations that
end in '-passing' and '-failing' as well as '-expected'.

The meanings for '-passing' and '-failing' should be obvious, and
'-expected' can continue the current meaning of either or both of
what we expect to happen and I don't know if this is correct or
not :).

A given test will be allowed to only have one of the three potential
results at any one time/revision in a checkout. [2]

Because '-expected' will still be supported, this means that ports can
continue to work as they do today and we can try -passing/-failing on
a piecemeal basis to see if it's useful or not.

Ideally we will have some way (via a presubmit hook, or lint checks,
or something) to be able to generate a (checked-in) list (or perhaps
just a dashboard or web page) of all of the currently failing tests
and corresponding bugs from the -failing expectations, so that we
can keep one of the advantages that chromium has gotten out of their
TestExpectations files [3].

I will update all of the tools (run-webkit-tests, garden-o-matic,
flakiness dashboard, etc.) as needed to make managing these things as
easy as possible. [4]

Thoughts? I'm definitely open to suggestions/variants/other ideas/etc.

-- Dirk

Notes:

[1] Both the check in the failures and the suppress the failures
approaches have advantages and disadvantages:

Both approaches have their advantages and disadvantages:

Advantages for checking in failures:

* you can tell when a test starts failing differently
* the need to distinguish between different types of failures (text
vs. image vs. image+text) in the expectations file drops; the baseline
tells you what to expect
* the TestExpectations file can be much smaller and easier to manage -
the current Chromium file is a massively unruly mess
* the history of a particular test can be found by looking at the repo history

Disadvantages for checking in failures (advantages for just suppressing them):
* given current practice (just using -expected) you can't tell if a
particular -expected file is supposed to be be correct or not
* it may create more churn in the checked-in baselines in the repo
* you can't get a list of all of the failing tests associated with a
particular bug as easily

There are probably lots of ways one could attempt to design a solution
to these problems; I believe the approach outlined above is perhaps
the simplest possible and allows for us to try it in small parts of
the test suite (and only on some ports) to see if it's useful or not
without forcing it on everyone.

[2] I considered allowing -passing and -failing to co-exist, but a
risk is that the passing or correct result for a test will change,
and if a test is currently expected to fail, we won't notice that that
port's passing result becomes stale. In addition, managing the
transitions across multiple files becomes a lot more
complicated/tricky. There are tradeoffs here, and it's possible some
other logic will work better in practice, but I thought it might be
best to start simple.

[3] I haven't figured out the best way to do this yet, or whether it's
better to keep this list inside the TestExpectations files or in
separate files, or just have a dashboard separate from the repository
completely ...

[4] rough sketch of the work needed to be done here:
* update run-webkit-tests to look for all three suffixes, and to
generate new -failing -failing and/or -passing results as well as new
-expected results
* update garden-o-matic so that when you want to rebaseline a file you
can (optionally) indicate whether the new baseline is passing or
failing
* update webkit-patch rebaseline-expectations to (optionally) indicate
if the new baselines are passing or failing
* pass the information from run-webkit-tests to the flakiness
dashboard (via results.json) to indicate whether we matched against
-expected/-passing/-failing so that the dashboard can display the
right baselines for comparison.
* figure out the solution to [3] :).
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Filip Pizlo
This sounds like it's adding even more complication to an already complicated 
system.

Given how many tests we currently have, I also don't buy that continuing to run 
a test that is already known to fail provides much benefit.  If the test covers 
two things and one fails while the other succeeds (example #1: it prints 
correct results but doesn't crash; example #2: it literally has tests for two 
distinct unrelated features, and one feature is broken) then the correct thing 
to do is to split up the test.

So, I'd rather not continue to institutionalize this notion that we should have 
loads of incomprehensible machinery to reason about tests that have already 
given us all the information they were meant to give (i.e. they failed, end of 
story).

-F


On Aug 15, 2012, at 12:22 PM, Dirk Pranke dpra...@chromium.org wrote:

 Hi all,
 
 As many of you know, we normally treat the -expected files as
 regression tests rather than correctness tests; they are intended
 to capture the current behavior of the tree. As such, they
 historically have not distinguished between a correct failure and an
 incorrect failure.
 
 The chromium port, however, has historically often not checked in
 expectations for tests that are currently failing (or even have been
 failing for a long time), and instead listed them in the expectations
 file. This was primarily motivated by us wanting to know easily all of
 the tests that were failing. However, this approach has its own
 downsides.
 
 I would like to move the project to a point where all of the ports
 were basically using the same workflow/model, and combine the best
 features of each approach [1].
 
 To that end, I propose that we allow tests to have expectations that
 end in '-passing' and '-failing' as well as '-expected'.
 
 The meanings for '-passing' and '-failing' should be obvious, and
 '-expected' can continue the current meaning of either or both of
 what we expect to happen and I don't know if this is correct or
 not :).
 
 A given test will be allowed to only have one of the three potential
 results at any one time/revision in a checkout. [2]
 
 Because '-expected' will still be supported, this means that ports can
 continue to work as they do today and we can try -passing/-failing on
 a piecemeal basis to see if it's useful or not.
 
 Ideally we will have some way (via a presubmit hook, or lint checks,
 or something) to be able to generate a (checked-in) list (or perhaps
 just a dashboard or web page) of all of the currently failing tests
 and corresponding bugs from the -failing expectations, so that we
 can keep one of the advantages that chromium has gotten out of their
 TestExpectations files [3].
 
 I will update all of the tools (run-webkit-tests, garden-o-matic,
 flakiness dashboard, etc.) as needed to make managing these things as
 easy as possible. [4]
 
 Thoughts? I'm definitely open to suggestions/variants/other ideas/etc.
 
 -- Dirk
 
 Notes:
 
 [1] Both the check in the failures and the suppress the failures
 approaches have advantages and disadvantages:
 
 Both approaches have their advantages and disadvantages:
 
 Advantages for checking in failures:
 
 * you can tell when a test starts failing differently
 * the need to distinguish between different types of failures (text
 vs. image vs. image+text) in the expectations file drops; the baseline
 tells you what to expect
 * the TestExpectations file can be much smaller and easier to manage -
 the current Chromium file is a massively unruly mess
 * the history of a particular test can be found by looking at the repo history
 
 Disadvantages for checking in failures (advantages for just suppressing them):
 * given current practice (just using -expected) you can't tell if a
 particular -expected file is supposed to be be correct or not
 * it may create more churn in the checked-in baselines in the repo
 * you can't get a list of all of the failing tests associated with a
 particular bug as easily
 
 There are probably lots of ways one could attempt to design a solution
 to these problems; I believe the approach outlined above is perhaps
 the simplest possible and allows for us to try it in small parts of
 the test suite (and only on some ports) to see if it's useful or not
 without forcing it on everyone.
 
 [2] I considered allowing -passing and -failing to co-exist, but a
 risk is that the passing or correct result for a test will change,
 and if a test is currently expected to fail, we won't notice that that
 port's passing result becomes stale. In addition, managing the
 transitions across multiple files becomes a lot more
 complicated/tricky. There are tradeoffs here, and it's possible some
 other logic will work better in practice, but I thought it might be
 best to start simple.
 
 [3] I haven't figured out the best way to do this yet, or whether it's
 better to keep this list inside the TestExpectations files or in
 separate files, or just have a dashboard separate from the repository
 

Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Dirk Pranke
On Wed, Aug 15, 2012 at 12:27 PM, Filip Pizlo fpi...@apple.com wrote:
 This sounds like it's adding even more complication to an already complicated 
 system.

In some ways, yes. In other ways, perhaps it will allow us to simplify
things; e.g., if we are checking in failing tests, there is much less
of a need for multiple failure keywords in the TestExpectations file
(so perhaps we can simplify them back to something closer to Skipped
files).

 Given how many tests we currently have, I also don't buy that continuing to 
 run a test that is already known to fail provides much benefit.

I'm not sure I understand your feedback here? It's common practice (on
all the ports as far as I know today) to rebaseline tests that are
currently failing so that they fail differently. Of course, we also
skip some tests while they are failing as well. However, common
objections to skipping tests are that we can lose coverage for a
feature and/or miss when a test starts failing worse (e.g. crashing)?
And of course, a test might start passing again, but if we're skipping
it we wouldn't know that ...

 So, I'd rather not continue to institutionalize this notion that we should 
 have loads of incomprehensible machinery to reason about tests that have 
 already given us all the information they were meant to give (i.e. they 
 failed, end of story).

Are you suggesting that, rather than checking in new baselines at all
or having lots of logic to manage different kinds of failures, we
should just let failing tests fail (and keep the tree red) until a
change is either reverted or the test is fixed?

-- Dirk
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Dirk Pranke
I've received at least one bit of feedback off-list that it might not
have been clear what problems I was trying to solve and whether the
solution would add enough benefit to be worth it. Let me try to
restate things, in case this helps ...

The problems I'm attempting to solve are:

1) Chromium does things differently from the other WebKit ports, and
this seems bad :).

2) the TestExpectations syntax is too complicated

3) When you suppress or skip failures (as chromium does), you can't
easily tell when test changes behavior (starts failing differently).

4) We can't easily tell if a test's -expected output is actually
believed to be correct or not, which makes figuring out whether to
rebaseline or not difficult, and makes figuring out if your change
that is causing a test to fail is actually introducing a new
problem, just failing differently, or even potentially fixing
something.

I agree that it's unclear how much churn there will be and how much
benefit there will be, but we've been talking about these problems for
years and I think without actually trying *something* we won't know if
there's a better way to solve this or not, and I personally think it's
worth trying. The solution I've described is the least intrusive
mechanism we can try that I've yet come up with.

-- Dirk

On Wed, Aug 15, 2012 at 12:22 PM, Dirk Pranke dpra...@chromium.org wrote:
 Hi all,

 As many of you know, we normally treat the -expected files as
 regression tests rather than correctness tests; they are intended
 to capture the current behavior of the tree. As such, they
 historically have not distinguished between a correct failure and an
 incorrect failure.

 The chromium port, however, has historically often not checked in
 expectations for tests that are currently failing (or even have been
 failing for a long time), and instead listed them in the expectations
 file. This was primarily motivated by us wanting to know easily all of
 the tests that were failing. However, this approach has its own
 downsides.

 I would like to move the project to a point where all of the ports
 were basically using the same workflow/model, and combine the best
 features of each approach [1].

 To that end, I propose that we allow tests to have expectations that
 end in '-passing' and '-failing' as well as '-expected'.

 The meanings for '-passing' and '-failing' should be obvious, and
 '-expected' can continue the current meaning of either or both of
 what we expect to happen and I don't know if this is correct or
 not :).

 A given test will be allowed to only have one of the three potential
 results at any one time/revision in a checkout. [2]

 Because '-expected' will still be supported, this means that ports can
 continue to work as they do today and we can try -passing/-failing on
 a piecemeal basis to see if it's useful or not.

 Ideally we will have some way (via a presubmit hook, or lint checks,
 or something) to be able to generate a (checked-in) list (or perhaps
 just a dashboard or web page) of all of the currently failing tests
 and corresponding bugs from the -failing expectations, so that we
 can keep one of the advantages that chromium has gotten out of their
 TestExpectations files [3].

 I will update all of the tools (run-webkit-tests, garden-o-matic,
 flakiness dashboard, etc.) as needed to make managing these things as
 easy as possible. [4]

 Thoughts? I'm definitely open to suggestions/variants/other ideas/etc.

 -- Dirk

 Notes:

 [1] Both the check in the failures and the suppress the failures
 approaches have advantages and disadvantages:

 Both approaches have their advantages and disadvantages:

 Advantages for checking in failures:

 * you can tell when a test starts failing differently
 * the need to distinguish between different types of failures (text
 vs. image vs. image+text) in the expectations file drops; the baseline
 tells you what to expect
 * the TestExpectations file can be much smaller and easier to manage -
 the current Chromium file is a massively unruly mess
 * the history of a particular test can be found by looking at the repo history

 Disadvantages for checking in failures (advantages for just suppressing them):
 * given current practice (just using -expected) you can't tell if a
 particular -expected file is supposed to be be correct or not
 * it may create more churn in the checked-in baselines in the repo
 * you can't get a list of all of the failing tests associated with a
 particular bug as easily

 There are probably lots of ways one could attempt to design a solution
 to these problems; I believe the approach outlined above is perhaps
 the simplest possible and allows for us to try it in small parts of
 the test suite (and only on some ports) to see if it's useful or not
 without forcing it on everyone.

 [2] I considered allowing -passing and -failing to co-exist, but a
 risk is that the passing or correct result for a test will change,
 and if a test is currently expected to fail, we 

Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Michael Saboff
It seems to me that there are two issues here.  One is Chromium specific about 
process conformity.  It seems to me that should stay a Chromium issue without 
making the mechanism more complex for all ports.  The other ports seem to make 
things work using the existing framework.

The other broader issue is failing tests.  If I understand part of Filip's 
concern it is a signal to noise issue.  We do not want the pass / fail signal 
to be lost in the noise of expected failures.  Failing tests should be fixed as 
appropriate for failing platform(s).  That fixing might involve splitting off 
or removing a failing sub-test so that the remaining test adds value once 
again.  Especially a pass becoming a fail edge.  For me, a test failing 
differently provides unknown value as the noise of it being a failing test 
likely exceeds the benefit of the different failure mode signal.  It takes a 
non-zero effort to filter that noise and that effort is likely better spent 
fixing the test.

- Michael

On Aug 15, 2012, at 12:48 PM, Dirk Pranke dpra...@chromium.org wrote:

 I've received at least one bit of feedback off-list that it might not
 have been clear what problems I was trying to solve and whether the
 solution would add enough benefit to be worth it. Let me try to
 restate things, in case this helps ...
 
 The problems I'm attempting to solve are:
 
 1) Chromium does things differently from the other WebKit ports, and
 this seems bad :).
 
 2) the TestExpectations syntax is too complicated
 
 3) When you suppress or skip failures (as chromium does), you can't
 easily tell when test changes behavior (starts failing differently).
 
 4) We can't easily tell if a test's -expected output is actually
 believed to be correct or not, which makes figuring out whether to
 rebaseline or not difficult, and makes figuring out if your change
 that is causing a test to fail is actually introducing a new
 problem, just failing differently, or even potentially fixing
 something.
 
 I agree that it's unclear how much churn there will be and how much
 benefit there will be, but we've been talking about these problems for
 years and I think without actually trying *something* we won't know if
 there's a better way to solve this or not, and I personally think it's
 worth trying. The solution I've described is the least intrusive
 mechanism we can try that I've yet come up with.
 
 -- Dirk
 
 On Wed, Aug 15, 2012 at 12:22 PM, Dirk Pranke dpra...@chromium.org wrote:
 Hi all,
 
 As many of you know, we normally treat the -expected files as
 regression tests rather than correctness tests; they are intended
 to capture the current behavior of the tree. As such, they
 historically have not distinguished between a correct failure and an
 incorrect failure.
 
 The chromium port, however, has historically often not checked in
 expectations for tests that are currently failing (or even have been
 failing for a long time), and instead listed them in the expectations
 file. This was primarily motivated by us wanting to know easily all of
 the tests that were failing. However, this approach has its own
 downsides.
 
 I would like to move the project to a point where all of the ports
 were basically using the same workflow/model, and combine the best
 features of each approach [1].
 
 To that end, I propose that we allow tests to have expectations that
 end in '-passing' and '-failing' as well as '-expected'.
 
 The meanings for '-passing' and '-failing' should be obvious, and
 '-expected' can continue the current meaning of either or both of
 what we expect to happen and I don't know if this is correct or
 not :).
 
 A given test will be allowed to only have one of the three potential
 results at any one time/revision in a checkout. [2]
 
 Because '-expected' will still be supported, this means that ports can
 continue to work as they do today and we can try -passing/-failing on
 a piecemeal basis to see if it's useful or not.
 
 Ideally we will have some way (via a presubmit hook, or lint checks,
 or something) to be able to generate a (checked-in) list (or perhaps
 just a dashboard or web page) of all of the currently failing tests
 and corresponding bugs from the -failing expectations, so that we
 can keep one of the advantages that chromium has gotten out of their
 TestExpectations files [3].
 
 I will update all of the tools (run-webkit-tests, garden-o-matic,
 flakiness dashboard, etc.) as needed to make managing these things as
 easy as possible. [4]
 
 Thoughts? I'm definitely open to suggestions/variants/other ideas/etc.
 
 -- Dirk
 
 Notes:
 
 [1] Both the check in the failures and the suppress the failures
 approaches have advantages and disadvantages:
 
 Both approaches have their advantages and disadvantages:
 
 Advantages for checking in failures:
 
 * you can tell when a test starts failing differently
 * the need to distinguish between different types of failures (text
 vs. image vs. image+text) in the expectations file drops; 

Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Ojan Vafai
I think this is the best compromise between the benefits of the
historically chromium approach (i.e. add to TestExpectations) and the
historically non-chromium approach (either skip the test or check in a
failing result, usually the latter). The only thing Dirk's proposal changes
for these ports (i.e. all but Chromium) is that an expectation file's
suffix can be -passing or -failing in addition to the current -expected.

It makes for a clear list of broken tests that need fixing without
compromising the test suites effectiveness at detecting new regressions.

I agree that we just need to try it and see how it turns out. For
non-chromium ports, it will be easy to switch back if we decide it's too
much overhead and Chromium will just need to come up with an alternate way
of keeping track of the list of tests that are failing.

On Wed, Aug 15, 2012 at 1:36 PM, Michael Saboff msab...@apple.com wrote:

 It seems to me that there are two issues here.  One is Chromium specific
 about process conformity.  It seems to me that should stay a Chromium issue
 without making the mechanism more complex for all ports.  The other ports
 seem to make things work using the existing framework.

 The other broader issue is failing tests.  If I understand part of Filip's
 concern it is a signal to noise issue.  We do not want the pass / fail
 signal to be lost in the noise of expected failures.  Failing tests should
 be fixed as appropriate for failing platform(s).  That fixing might involve
 splitting off or removing a failing sub-test so that the remaining test
 adds value once again.  Especially a pass becoming a fail edge.  For me,
 a test failing differently provides unknown value as the noise of it being
 a failing test likely exceeds the benefit of the different failure mode
 signal.  It takes a non-zero effort to filter that noise and that effort is
 likely better spent fixing the test.


Historically, all WebKit ports other than Chromium (especially the Apple
port) have taken exactly this approach. What you're arguing against is the
long-standing approach WebKit has taken to testing. I don't think
discussing this broader issue is useful since noone is asking the
non-Chromium ports to change this aspect. This has been discussed to death
on webkit-dev, with many people (mostly at Apple) being strong proponents
of checking in failing expectations.

This just brings the Chromium port inline with other ports without losing
the benefit the Chromium approach currently has of keeping a list of
failing tests that need fixing.



 - Michael

 On Aug 15, 2012, at 12:48 PM, Dirk Pranke dpra...@chromium.org wrote:

  I've received at least one bit of feedback off-list that it might not
  have been clear what problems I was trying to solve and whether the
  solution would add enough benefit to be worth it. Let me try to
  restate things, in case this helps ...
 
  The problems I'm attempting to solve are:
 
  1) Chromium does things differently from the other WebKit ports, and
  this seems bad :).
 
  2) the TestExpectations syntax is too complicated
 
  3) When you suppress or skip failures (as chromium does), you can't
  easily tell when test changes behavior (starts failing differently).
 
  4) We can't easily tell if a test's -expected output is actually
  believed to be correct or not, which makes figuring out whether to
  rebaseline or not difficult, and makes figuring out if your change
  that is causing a test to fail is actually introducing a new
  problem, just failing differently, or even potentially fixing
  something.
 
  I agree that it's unclear how much churn there will be and how much
  benefit there will be, but we've been talking about these problems for
  years and I think without actually trying *something* we won't know if
  there's a better way to solve this or not, and I personally think it's
  worth trying. The solution I've described is the least intrusive
  mechanism we can try that I've yet come up with.
 
  -- Dirk
 
  On Wed, Aug 15, 2012 at 12:22 PM, Dirk Pranke dpra...@chromium.org
 wrote:
  Hi all,
 
  As many of you know, we normally treat the -expected files as
  regression tests rather than correctness tests; they are intended
  to capture the current behavior of the tree. As such, they
  historically have not distinguished between a correct failure and an
  incorrect failure.
 
  The chromium port, however, has historically often not checked in
  expectations for tests that are currently failing (or even have been
  failing for a long time), and instead listed them in the expectations
  file. This was primarily motivated by us wanting to know easily all of
  the tests that were failing. However, this approach has its own
  downsides.
 
  I would like to move the project to a point where all of the ports
  were basically using the same workflow/model, and combine the best
  features of each approach [1].
 
  To that end, I propose that we allow tests to have expectations that
  end in '-passing' 

Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Maciej Stachowiak

On Aug 15, 2012, at 12:27 PM, Filip Pizlo fpi...@apple.com wrote:

 This sounds like it's adding even more complication to an already complicated 
 system.
 
 Given how many tests we currently have, I also don't buy that continuing to 
 run a test that is already known to fail provides much benefit.  If the test 
 covers two things and one fails while the other succeeds (example #1: it 
 prints correct results but doesn't crash; example #2: it literally has tests 
 for two distinct unrelated features, and one feature is broken) then the 
 correct thing to do is to split up the test.

The case where this happens isn't so much testing unrelated features. It's more 
that a single test can face failures of multiple incidental aspects. Let me 
give an example. Let's say you have a test that's intended to verify JavaScript 
multiplication. Let's say it starts failing because of a regression in function 
call behavior. There may not be a clean way to split these two aspects. You may 
have to rewrite the multiplication test to work around this hypothetical 
function call bug. It may be nontrivial to figure out how to do that.

So I agree in principle that splitting orthogonal tests is good, but it's not 
always a good solution to avoiding chained regressions.

Note also that having more test files makes the test rights slower.

 
 So, I'd rather not continue to institutionalize this notion that we should 
 have loads of incomprehensible machinery to reason about tests that have 
 already given us all the information they were meant to give (i.e. they 
 failed, end of story).

I think another aspect of your reasoning doesn't entirely fit with the history 
of WebKit regression testing. Our tests were not originally designed to test 
against some abstract notion of correct behavior. They were designed to 
detect behavior changes. It's definitely not the case that, once behavior has 
changed once, there's no value to detecting when it changes again. Some of 
these behavior changes might be good and should form a new baseline. Others 
might be bad but still should not inhibit detecting other testing while they 
are investigated.


(I should note that I don't yet have an opinion on the new proposal; I have not 
read it. But I am skeptical that splitting tests is a practical solution.)


Regards,
Maciej


 
 -F
 
 
 On Aug 15, 2012, at 12:22 PM, Dirk Pranke dpra...@chromium.org wrote:
 
 Hi all,
 
 As many of you know, we normally treat the -expected files as
 regression tests rather than correctness tests; they are intended
 to capture the current behavior of the tree. As such, they
 historically have not distinguished between a correct failure and an
 incorrect failure.
 
 The chromium port, however, has historically often not checked in
 expectations for tests that are currently failing (or even have been
 failing for a long time), and instead listed them in the expectations
 file. This was primarily motivated by us wanting to know easily all of
 the tests that were failing. However, this approach has its own
 downsides.
 
 I would like to move the project to a point where all of the ports
 were basically using the same workflow/model, and combine the best
 features of each approach [1].
 
 To that end, I propose that we allow tests to have expectations that
 end in '-passing' and '-failing' as well as '-expected'.
 
 The meanings for '-passing' and '-failing' should be obvious, and
 '-expected' can continue the current meaning of either or both of
 what we expect to happen and I don't know if this is correct or
 not :).
 
 A given test will be allowed to only have one of the three potential
 results at any one time/revision in a checkout. [2]
 
 Because '-expected' will still be supported, this means that ports can
 continue to work as they do today and we can try -passing/-failing on
 a piecemeal basis to see if it's useful or not.
 
 Ideally we will have some way (via a presubmit hook, or lint checks,
 or something) to be able to generate a (checked-in) list (or perhaps
 just a dashboard or web page) of all of the currently failing tests
 and corresponding bugs from the -failing expectations, so that we
 can keep one of the advantages that chromium has gotten out of their
 TestExpectations files [3].
 
 I will update all of the tools (run-webkit-tests, garden-o-matic,
 flakiness dashboard, etc.) as needed to make managing these things as
 easy as possible. [4]
 
 Thoughts? I'm definitely open to suggestions/variants/other ideas/etc.
 
 -- Dirk
 
 Notes:
 
 [1] Both the check in the failures and the suppress the failures
 approaches have advantages and disadvantages:
 
 Both approaches have their advantages and disadvantages:
 
 Advantages for checking in failures:
 
 * you can tell when a test starts failing differently
 * the need to distinguish between different types of failures (text
 vs. image vs. image+text) in the expectations file drops; the baseline
 tells you what to expect
 * the TestExpectations file 

Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Dirk Pranke
On Wed, Aug 15, 2012 at 1:36 PM, Michael Saboff msab...@apple.com wrote:
 It seems to me that there are two issues here.  One is Chromium specific 
 about process conformity.  It seems to me that should stay a Chromium issue 
 without making the mechanism more complex for all ports.  The other ports 
 seem to make things work using the existing framework.


I'm definitely attempting to minimize the impact on ports (and people)
that are happy with things the way they are. One thing we could do to
help with this would be to require that any '-passing'/'-failing'
tests be limited to the platform/ directories, so that way a given
port could opt-in to this additional complexity (and we could replace
one Chromium-specific process with another one without affecting the
other ports).

 The other broader issue is failing tests.  If I understand part of Filip's 
 concern it is a signal to noise issue.  We do not want the pass / fail signal 
 to be lost in the noise of expected failures.  Failing tests should be fixed 
 as appropriate for failing platform(s).  That fixing might involve splitting 
 off or removing a failing sub-test so that the remaining test adds value once 
 again.  Especially a pass becoming a fail edge.  For me, a test failing 
 differently provides unknown value as the noise of it being a failing test 
 likely exceeds the benefit of the different failure mode signal.  It takes a 
 non-zero effort to filter that noise and that effort is likely better spent 
 fixing the test.

If I understand you correctly, you're suggesting that if a test starts
to fail, you don't care how it is failing, and simply knowing that it
is failing is enough? And if a test doesn't meet that criteria, it
should be split into multiple tests such that each sub-test does? I
think that's a nice goal, but we're nowhere close to that in practice.

In particular, this isn't true of many pixel tests at all. Minor
changes in text rendering can cause a whole bunch of tests to fail,
and yet we still need to run those tests to look for real regressions.
Ideally we can replace some or most of these tests with text-only
tests and reftests, but we're a long way away from that, too.

-- Dirk
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Filip Pizlo

On Aug 15, 2012, at 2:16 PM, Maciej Stachowiak m...@apple.com wrote:

 
 On Aug 15, 2012, at 12:27 PM, Filip Pizlo fpi...@apple.com wrote:
 
 This sounds like it's adding even more complication to an already 
 complicated system.
 
 Given how many tests we currently have, I also don't buy that continuing to 
 run a test that is already known to fail provides much benefit.  If the test 
 covers two things and one fails while the other succeeds (example #1: it 
 prints correct results but doesn't crash; example #2: it literally has tests 
 for two distinct unrelated features, and one feature is broken) then the 
 correct thing to do is to split up the test.
 
 The case where this happens isn't so much testing unrelated features. It's 
 more that a single test can face failures of multiple incidental aspects. Let 
 me give an example. Let's say you have a test that's intended to verify 
 JavaScript multiplication. Let's say it starts failing because of a 
 regression in function call behavior. There may not be a clean way to split 
 these two aspects. You may have to rewrite the multiplication test to work 
 around this hypothetical function call bug. It may be nontrivial to figure 
 out how to do that.
 
 So I agree in principle that splitting orthogonal tests is good, but it's not 
 always a good solution to avoiding chained regressions.
 
 Note also that having more test files makes the test rights slower.
 
 
 So, I'd rather not continue to institutionalize this notion that we should 
 have loads of incomprehensible machinery to reason about tests that have 
 already given us all the information they were meant to give (i.e. they 
 failed, end of story).
 
 I think another aspect of your reasoning doesn't entirely fit with the 
 history of WebKit regression testing. Our tests were not originally designed 
 to test against some abstract notion of correct behavior. They were 
 designed to detect behavior changes. It's definitely not the case that, once 
 behavior has changed once, there's no value to detecting when it changes 
 again. Some of these behavior changes might be good and should form a new 
 baseline. Others might be bad but still should not inhibit detecting other 
 testing while they are investigated.

Sorry, I was rather unclear.  I'm arguing against marking a test as 
expected-to-fail-but-not-crash.  I have no broad problem with rebasing the 
test, except if you're using -expected.* files to indicate that the test is in 
this expected-to-fail-but-not-crash state.  There is potentially a significant 
difference between tests that cover what is displayed and the loads of tests 
that cover discrete functionality (DOM APIs, JS behavior, etc).  Rebasing has 
very little benefit in the latter kinds of tests.

I'm also arguing against having yet another mechanism for indicating that a 
test is borked.

Finally, I think any strategy we use must take into account the fact that we 
currently have a *huge* number of tests.  While there may be examples of 
failing tests that still succeed in giving us meaningful coverage, I think 
there are far more examples of tests that provide no benefit, but require an 
evil troika of (1) CPU time to run, (2) network/disk time to check out and 
update, and (3) person time to rebase.  So, at this point, having sophisticated 
mechanisms for dealing with expected-to-fail-but-not-crash tests beyond filing 
a bug and skipping the test is likely to just be noise.

 
 
 (I should note that I don't yet have an opinion on the new proposal; I have 
 not read it. But I am skeptical that splitting tests is a practical solution.)
 
 
 Regards,
 Maciej
 
 
 
 -F
 
 
 On Aug 15, 2012, at 12:22 PM, Dirk Pranke dpra...@chromium.org wrote:
 
 Hi all,
 
 As many of you know, we normally treat the -expected files as
 regression tests rather than correctness tests; they are intended
 to capture the current behavior of the tree. As such, they
 historically have not distinguished between a correct failure and an
 incorrect failure.
 
 The chromium port, however, has historically often not checked in
 expectations for tests that are currently failing (or even have been
 failing for a long time), and instead listed them in the expectations
 file. This was primarily motivated by us wanting to know easily all of
 the tests that were failing. However, this approach has its own
 downsides.
 
 I would like to move the project to a point where all of the ports
 were basically using the same workflow/model, and combine the best
 features of each approach [1].
 
 To that end, I propose that we allow tests to have expectations that
 end in '-passing' and '-failing' as well as '-expected'.
 
 The meanings for '-passing' and '-failing' should be obvious, and
 '-expected' can continue the current meaning of either or both of
 what we expect to happen and I don't know if this is correct or
 not :).
 
 A given test will be allowed to only have one of the three potential
 results at any one time/revision 

Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Filip Pizlo
Apparently I was somewhat unclear.  Let me restate.  We have the following 
mechanisms available when a test fails:

1) Check in a new -expected.* file.

2) Modify the test.

3) Modify a TestExpectations file.

4) Add the test to a Skipped file.

5) Remove the test entirely.

I have no problem with (1) unless it is intended to mark the test as 
expected-to-fail-but-not-crash.  I agree that using -expected.* to accomplish 
what TestExpectations accomplishes is not valuable, but I further believe that 
even TestExpectations is not valuable.

I broadly prefer (2) whenever possible.

I believe that (3) and (4) are redundant, and I don't buy the value of (3).

I don't like (5) but we should probably do more of it for tests that have a 
chronically low signal-to-noise ratio.

You're proposing a new mechanism.  I'm arguing that given the sheer number of 
tests, and the overheads associated with maintaining them, (4) is the broadly 
more productive strategy in terms of bugs-fixed/person-hours.  And, increasing 
the number of mechanisms for dealing with tests by 20% is likely to reduce 
overall productivity rather than helping anyone.

-F



On Aug 15, 2012, at 12:40 PM, Dirk Pranke dpra...@chromium.org wrote:

 On Wed, Aug 15, 2012 at 12:27 PM, Filip Pizlo fpi...@apple.com wrote:
 This sounds like it's adding even more complication to an already 
 complicated system.
 
 In some ways, yes. In other ways, perhaps it will allow us to simplify
 things; e.g., if we are checking in failing tests, there is much less
 of a need for multiple failure keywords in the TestExpectations file
 (so perhaps we can simplify them back to something closer to Skipped
 files).
 
 Given how many tests we currently have, I also don't buy that continuing to 
 run a test that is already known to fail provides much benefit.
 
 I'm not sure I understand your feedback here? It's common practice (on
 all the ports as far as I know today) to rebaseline tests that are
 currently failing so that they fail differently. Of course, we also
 skip some tests while they are failing as well. However, common
 objections to skipping tests are that we can lose coverage for a
 feature and/or miss when a test starts failing worse (e.g. crashing)?
 And of course, a test might start passing again, but if we're skipping
 it we wouldn't know that ...
 
 So, I'd rather not continue to institutionalize this notion that we should 
 have loads of incomprehensible machinery to reason about tests that have 
 already given us all the information they were meant to give (i.e. they 
 failed, end of story).
 
 Are you suggesting that, rather than checking in new baselines at all
 or having lots of logic to manage different kinds of failures, we
 should just let failing tests fail (and keep the tree red) until a
 change is either reverted or the test is fixed?
 
 -- Dirk

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Dirk Pranke
On Wed, Aug 15, 2012 at 3:06 PM, Filip Pizlo fpi...@apple.com wrote:
 Apparently I was somewhat unclear.  Let me restate.  We have the following 
 mechanisms available when a test fails:

 1) Check in a new -expected.* file.

 2) Modify the test.

 3) Modify a TestExpectations file.

 4) Add the test to a Skipped file.

 5) Remove the test entirely.

 I have no problem with (1) unless it is intended to mark the test as 
 expected-to-fail-but-not-crash.  I agree that using -expected.* to accomplish 
 what TestExpectations accomplishes is not valuable, but I further believe 
 that even TestExpectations is not valuable.

 I broadly prefer (2) whenever possible.

 I believe that (3) and (4) are redundant, and I don't buy the value of (3).

 I don't like (5) but we should probably do more of it for tests that have a 
 chronically low signal-to-noise ratio.


Thank you for clarifying. I had actually written an almost identical
list but didn't send it, so I think we're on the same page at least as
far as understanding the problem goes ...

So, I would describe my suggestion as an improved variant of the kind
of (1) that can be used as expected-to-fail-but-not-crash (which
I'll call 1-fail), and that we would use this in cases where we use
(3), (4), or (1-fail) today.

I would also agree that we should do (2) where possible, but I don't
think this is easily possible for a large class of tests, especially
pixel tests, although I am currently working on other things that will
hopefully help here.

Chromium certainly does a lot of (3) today, and some (1-fail). Other
ports definitely use (1-fail) or (4) today, because (2) is rarely
possible for many, many tests.

We know that doing (1-fail), (3), or (4) causes real maintenance woes
down the road, but also that doing (1-fail) or (3) catches real
problems that simply skipping the test would not -- at some cost.
Whether the benefit is worth the cost, is not known, of course, but I
believe it is. I am hoping that my suggestion will have a lower
overall cost than doing (1-fail) or (3).

 You're proposing a new mechanism. I'm arguing that given the sheer number of 
 tests, and the overheads associated with maintaining them, (4) is the broadly 
 more productive strategy in terms of bugs-fixed/person-hours.  And, 
 increasing the number of mechanisms for dealing with tests by 20% is likely 
 to reduce overall productivity rather than helping anyone.


Why do you believe this to be true? I'm not being flippant here ... I
think this is a very plausible argument, and it may well be true, but
I don't know what the criteria we would use to evaluate it are. Some
of the possible factors are:

* the complexity of the test infrastructure and the cognitive load it
introduces on developers
* the cost of bugs that are missed because we're skipping the tests
intended to catch those bugs
* the cost of looking at regressions and trying to figure out if the
regression is something you care about or not
* the cost of looking at the -expected results and trying to figure
out if what is expected is correct or not

There may be others as well, but the last three are all very real in
my experience, and I believe they significantly outweigh the first
one, but I don't know how to objectively assess that (and I don't
think it's even possible since different people/teams/ports will weigh
these things differently).

-- Dirk
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Filip Pizlo

On Aug 15, 2012, at 4:02 PM, Dirk Pranke dpra...@chromium.org wrote:

 On Wed, Aug 15, 2012 at 3:06 PM, Filip Pizlo fpi...@apple.com wrote:
 Apparently I was somewhat unclear.  Let me restate.  We have the following 
 mechanisms available when a test fails:
 
 1) Check in a new -expected.* file.
 
 2) Modify the test.
 
 3) Modify a TestExpectations file.
 
 4) Add the test to a Skipped file.
 
 5) Remove the test entirely.
 
 I have no problem with (1) unless it is intended to mark the test as 
 expected-to-fail-but-not-crash.  I agree that using -expected.* to 
 accomplish what TestExpectations accomplishes is not valuable, but I further 
 believe that even TestExpectations is not valuable.
 
 I broadly prefer (2) whenever possible.
 
 I believe that (3) and (4) are redundant, and I don't buy the value of (3).
 
 I don't like (5) but we should probably do more of it for tests that have a 
 chronically low signal-to-noise ratio.
 
 
 Thank you for clarifying. I had actually written an almost identical
 list but didn't send it, so I think we're on the same page at least as
 far as understanding the problem goes ...
 
 So, I would describe my suggestion as an improved variant of the kind
 of (1) that can be used as expected-to-fail-but-not-crash (which
 I'll call 1-fail), and that we would use this in cases where we use
 (3), (4), or (1-fail) today.
 
 I would also agree that we should do (2) where possible, but I don't
 think this is easily possible for a large class of tests, especially
 pixel tests, although I am currently working on other things that will
 hopefully help here.
 
 Chromium certainly does a lot of (3) today, and some (1-fail). Other
 ports definitely use (1-fail) or (4) today, because (2) is rarely
 possible for many, many tests.
 
 We know that doing (1-fail), (3), or (4) causes real maintenance woes
 down the road, but also that doing (1-fail) or (3) catches real
 problems that simply skipping the test would not -- at some cost.
 Whether the benefit is worth the cost, is not known, of course, but I
 believe it is. I am hoping that my suggestion will have a lower
 overall cost than doing (1-fail) or (3).

I also believe that the trade-off is known and, and specifically, I believe 
that the cost of having any tests in the (1-fail) or (3) states is more costly 
than having them in (4) or (5).

 
 You're proposing a new mechanism. I'm arguing that given the sheer number of 
 tests, and the overheads associated with maintaining them, (4) is the 
 broadly more productive strategy in terms of bugs-fixed/person-hours.  And, 
 increasing the number of mechanisms for dealing with tests by 20% is likely 
 to reduce overall productivity rather than helping anyone.
 
 
 Why do you believe this to be true? I'm not being flippant here ... I
 think this is a very plausible argument, and it may well be true, but
 I don't know what the criteria we would use to evaluate it are. Some
 of the possible factors are:
 
 * the complexity of the test infrastructure and the cognitive load it
 introduces on developers
 * the cost of bugs that are missed because we're skipping the tests
 intended to catch those bugs
 * the cost of looking at regressions and trying to figure out if the
 regression is something you care about or not
 * the cost of looking at the -expected results and trying to figure
 out if what is expected is correct or not
 
 There may be others as well, but the last three are all very real in
 my experience, and I believe they significantly outweigh the first
 one, but I don't know how to objectively assess that (and I don't
 think it's even possible since different people/teams/ports will weigh
 these things differently).

I believe that the cognitive load is greater than any benefit from catching 
bugs incidentally by continuing to run a (1-fail) or (3) test, and continuing 
to evaluate whether or not the expectation matches some notions of desired 
behavior.

And therein lies one possible source of disagreement.

But there is another source of disagreement: would adding a sixth facility that 
overlaps with (1-fail) or (3) help?  No, I don't believe it would.  It's just 
another mechanism leading to more possible arguments about which mechanism is 
better.

-F

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Ryosuke Niwa
I have a concern that a lot of people wouldn't know what the correct
output is for a given test.

For a lot of pixel tests, deciding whether a given output is correct or not
is really hard. e.g. some seemingly insignificant anti-alias different may
turn out be a result of a bug in Skia and other graphics library or WebCore
code that uses it.

As a result,

   - people may check in wrong correct results
   - people may add failing results even though new results are more or
   less correct


This leads me to think that just checking in the current output as
-expected.png and filing bugs separately is a better approach.

- Ryosuke
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Dirk Pranke
On Wed, Aug 15, 2012 at 5:00 PM, Filip Pizlo fpi...@apple.com wrote:

 On Aug 15, 2012, at 4:02 PM, Dirk Pranke dpra...@chromium.org wrote:

 On Wed, Aug 15, 2012 at 3:06 PM, Filip Pizlo fpi...@apple.com wrote:
 Apparently I was somewhat unclear.  Let me restate.  We have the following 
 mechanisms available when a test fails:

 1) Check in a new -expected.* file.

 2) Modify the test.

 3) Modify a TestExpectations file.

 4) Add the test to a Skipped file.

 5) Remove the test entirely.

 I have no problem with (1) unless it is intended to mark the test as 
 expected-to-fail-but-not-crash.  I agree that using -expected.* to 
 accomplish what TestExpectations accomplishes is not valuable, but I 
 further believe that even TestExpectations is not valuable.

 I broadly prefer (2) whenever possible.

 I believe that (3) and (4) are redundant, and I don't buy the value of (3).

 I don't like (5) but we should probably do more of it for tests that have a 
 chronically low signal-to-noise ratio.


 Thank you for clarifying. I had actually written an almost identical
 list but didn't send it, so I think we're on the same page at least as
 far as understanding the problem goes ...

 So, I would describe my suggestion as an improved variant of the kind
 of (1) that can be used as expected-to-fail-but-not-crash (which
 I'll call 1-fail), and that we would use this in cases where we use
 (3), (4), or (1-fail) today.

 I would also agree that we should do (2) where possible, but I don't
 think this is easily possible for a large class of tests, especially
 pixel tests, although I am currently working on other things that will
 hopefully help here.

 Chromium certainly does a lot of (3) today, and some (1-fail). Other
 ports definitely use (1-fail) or (4) today, because (2) is rarely
 possible for many, many tests.

 We know that doing (1-fail), (3), or (4) causes real maintenance woes
 down the road, but also that doing (1-fail) or (3) catches real
 problems that simply skipping the test would not -- at some cost.
 Whether the benefit is worth the cost, is not known, of course, but I
 believe it is. I am hoping that my suggestion will have a lower
 overall cost than doing (1-fail) or (3).

 I also believe that the trade-off is known and, and specifically, I believe 
 that the cost of having any tests in the (1-fail) or (3) states is more 
 costly than having them in (4) or (5).


 You're proposing a new mechanism. I'm arguing that given the sheer number 
 of tests, and the overheads associated with maintaining them, (4) is the 
 broadly more productive strategy in terms of bugs-fixed/person-hours.  And, 
 increasing the number of mechanisms for dealing with tests by 20% is likely 
 to reduce overall productivity rather than helping anyone.


 Why do you believe this to be true? I'm not being flippant here ... I
 think this is a very plausible argument, and it may well be true, but
 I don't know what the criteria we would use to evaluate it are. Some
 of the possible factors are:

 * the complexity of the test infrastructure and the cognitive load it
 introduces on developers
 * the cost of bugs that are missed because we're skipping the tests
 intended to catch those bugs
 * the cost of looking at regressions and trying to figure out if the
 regression is something you care about or not
 * the cost of looking at the -expected results and trying to figure
 out if what is expected is correct or not

 There may be others as well, but the last three are all very real in
 my experience, and I believe they significantly outweigh the first
 one, but I don't know how to objectively assess that (and I don't
 think it's even possible since different people/teams/ports will weigh
 these things differently).

 I believe that the cognitive load is greater than any benefit from catching 
 bugs incidentally by continuing to run a (1-fail) or (3) test, and continuing 
 to evaluate whether or not the expectation matches some notions of desired 
 behavior.

 And therein lies one possible source of disagreement.


Yes :)

 But there is another source of disagreement: would adding a sixth facility 
 that overlaps with (1-fail) or (3) help?  No, I don't believe it would.  It's 
 just another mechanism leading to more possible arguments about which 
 mechanism is better.

Perhaps. I think it will, obviously, or I wouldn't be proposing this
in the first place.

I welcome other opinions on this as well ...

-- Dirk
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Peter Kasting
On Wed, Aug 15, 2012 at 5:00 PM, Filip Pizlo fpi...@apple.com wrote:

 I believe that the cognitive load is greater than any benefit from
 catching bugs incidentally by continuing to run a (1-fail) or (3) test, and
 continuing to evaluate whether or not the expectation matches some notions
 of desired behavior.


As someone who has spent a lot of time maintaining Chromium's expectations,
this seems clearly false, if your proposed alternative is to stop running
the test.  This is because a very common course of events is for a test to
begin failing, and then later on return to passing.  We (Chromium) see this
all the time with e.g. Skia changes, where for example the Skia folks will
rewrite gradient handling to more perfectly match some spec and as a result
dozens or hundreds of tests, many not explicitly intended to be about
gradient handling, will change and possibly begin passing.

By contrast, if we aren't running a test, we don't know when the test
begins passing again (except by trying to run it).  The resulting effect is
that skipped tests tend to remain skipped.  Tests that remain skipped are
no better than no tests.  And even if such tests are periodically retested,
once a test's output changes, there is a large window of time where the
test wasn't running, making it difficult to pinpoint exactly what caused
the change and whether the resulting effect is intentional and beneficial.

If we ARE running a test, then when the results change, knowing whether the
existing result was thought to be correct or not is a critical part of a
sheriff's job in deciding what to do about the change.  This is one reason
why Chromium has never gone down the path of simply checking in failure
expectations, and something that Dirk's proposal explicitly tries to
address while still allowing ports that (IMO mistakenly) don't care to
continue to not care.

We already have some good tooling (e.g. garden-o-matic) that could be
extended to show and update the small amount of additional info Dirk is
proposing.  I am very skeptical of abstract claims that this proposal
inflates complexity and decreases productivity in the absence of actually
testing a real workflow using the tools that we sheriffs really use to
maintain tree greenness.

I would like to see this proposal tested to get concrete feedback instead
of arguments on principle.

PK
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Filip Pizlo
The typical approach used in situations that you describe is to rebase, not 
skip.  This avoids the problem of not knowing when the test started passing.  
Hence, I'm not sure what you're implying.  Maybe a better example would help.


On Aug 15, 2012, at 5:39 PM, Peter Kasting pkast...@chromium.org wrote:

 On Wed, Aug 15, 2012 at 5:00 PM, Filip Pizlo fpi...@apple.com wrote:
 I believe that the cognitive load is greater than any benefit from catching 
 bugs incidentally by continuing to run a (1-fail) or (3) test, and continuing 
 to evaluate whether or not the expectation matches some notions of desired 
 behavior.
 
 As someone who has spent a lot of time maintaining Chromium's expectations, 
 this seems clearly false, if your proposed alternative is to stop running the 
 test.  This is because a very common course of events is for a test to begin 
 failing, and then later on return to passing.  We (Chromium) see this all the 
 time with e.g. Skia changes, where for example the Skia folks will rewrite 
 gradient handling to more perfectly match some spec and as a result dozens or 
 hundreds of tests, many not explicitly intended to be about gradient 
 handling, will change and possibly begin passing.
 
 By contrast, if we aren't running a test, we don't know when the test begins 
 passing again (except by trying to run it).  The resulting effect is that 
 skipped tests tend to remain skipped.  Tests that remain skipped are no 
 better than no tests.  And even if such tests are periodically retested, once 
 a test's output changes, there is a large window of time where the test 
 wasn't running, making it difficult to pinpoint exactly what caused the 
 change and whether the resulting effect is intentional and beneficial.
 
 If we ARE running a test, then when the results change, knowing whether the 
 existing result was thought to be correct or not is a critical part of a 
 sheriff's job in deciding what to do about the change.  This is one reason 
 why Chromium has never gone down the path of simply checking in failure 
 expectations, and something that Dirk's proposal explicitly tries to address 
 while still allowing ports that (IMO mistakenly) don't care to continue to 
 not care.
 
 We already have some good tooling (e.g. garden-o-matic) that could be 
 extended to show and update the small amount of additional info Dirk is 
 proposing.  I am very skeptical of abstract claims that this proposal 
 inflates complexity and decreases productivity in the absence of actually 
 testing a real workflow using the tools that we sheriffs really use to 
 maintain tree greenness.
 
 I would like to see this proposal tested to get concrete feedback instead of 
 arguments on principle.

I would not like to see our testing infrastructure get any more complicated 
than it already is, just because of a philosophical direction chosen 
unilaterally by one port.

 
 PK

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Peter Kasting
On Wed, Aug 15, 2012 at 5:45 PM, Filip Pizlo fpi...@apple.com wrote:

 The typical approach used in situations that you describe is to rebase,
 not skip.


Which is precisely what Dirk is proposing.  Literally the only difference
here is that the rebased test expectation would contain an optional
notation about whether we believe the new baseline to be correct or
incorrect.

Ports that don't care, or tests where we don't know, will be totally
unaffected.  I am not sure why this bothers you so much.  You talk about
making the infrastructure more complicated, but it doesn't seem like there
is much additional complication involved, and what minor complication is
involved is being volunteered to be handled by Dirk and other folks.

PK
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Filip Pizlo

On Aug 15, 2012, at 5:51 PM, Peter Kasting pkast...@chromium.org wrote:

 On Wed, Aug 15, 2012 at 5:45 PM, Filip Pizlo fpi...@apple.com wrote:
 The typical approach used in situations that you describe is to rebase, not 
 skip.
 
 Which is precisely what Dirk is proposing.  Literally the only difference 
 here is that the rebased test expectation would contain an optional notation 
 about whether we believe the new baseline to be correct or incorrect.

Can you articulate the value that this gives?  We must weigh that against the 
downsides:

1) Increased complexity of infrastructure.

2) Possibility of the sheriff getting it wrong.

(2) concerns me most.  We're talking about using filenames to serve as a kind 
of unchecked comment.  We already know that comments are usually bad because 
there is no protection against them going stale.

 
 Ports that don't care, or tests where we don't know, will be totally 
 unaffected.  I am not sure why this bothers you so much.  You talk about 
 making the infrastructure more complicated, but it doesn't seem like there is 
 much additional complication involved, and what minor complication is 
 involved is being volunteered to be handled by Dirk and other folks.

Ultimately it will lead to a further multiplication of number of possible ways 
of doing things, which is bad.  I'm sure some would love to get rid of Skipped 
files just as much as I would love to get rid of TestExpectations files.  Both 
are valid things to love, and imply that there must surely exist a middle 
ground: a way of doing things that is strictly better than the sum of the two.

In particular, to further clarify my position, if someone were to argue that 
Dirk's proposal would be a wholesale replacement for TestExpectations, then I 
would be more likely to be on board, since I very much like the idea of 
reducing the number of ways of doing things.  Maybe that's a good way to reach 
compromise.

Dirk, what value do you see in TestExpectations were your change to be landed?  
Do scenarios still exist where there would be a test for which (a) there is no 
-fail.* file, (b) the test is not skipped, and (c) it's marked with some 
annotation in TestExpectations?  I'm most interested in the question of such 
scenarios exist, since in my experience, whenever a test is not rebased, is not 
skipped, and is marked as failing in TestExpectations, it ends up just causing 
gardening overhead later.

-F

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations

2012-08-15 Thread Peter Kasting
On Wed, Aug 15, 2012 at 6:02 PM, Filip Pizlo fpi...@apple.com wrote:

 2) Possibility of the sheriff getting it wrong.

 (2) concerns me most.  We're talking about using filenames to serve as a
 kind of unchecked comment.  We already know that comments are usually bad
 because there is no protection against them going stale.


I don't see how this is parallel to stale comments.  Tests get continually
compared to the given output and we see immediately when something changes.

It is certainly possible for whoever assigns the filenames to get things
wrong.  There are basically two mitigations of this.  One is allowing the
existing expected.xxx file extensions to remain, and encouraging people
to leave them as-is when they're not sure whether the existing result is
correct.  The other is for sheriffs to use this signal as just that -- a
signal -- just as today we use the expected.xxx files as a signal of what
the correct output might be.  The difference is that this can generally be
considered a stronger signal.  Historically, there's been no real attempt
to guarantee that an expected result is anything other than the test's
current behavior.

I'm sure some would love to get rid of Skipped files just as much as I
 would love to get rid of TestExpectations files.  Both are valid things to
 love, and imply that there must surely exist a middle ground: a way of
 doing things that is strictly better than the sum of the two.


That's exactly what we're trying to do.

The value of this change is that hopefully it would dramatically reduce the
amount of content in these, especially in TestExpectations files.  If you
want to kill these so much, then this is a change you should at least let
us test!

In particular, to further clarify my position, if someone were to argue
 that Dirk's proposal would be a wholesale replacement for TestExpectations,
 then I would be more likely to be on board, since I very much like the idea
 of reducing the number of ways of doing things.  Maybe that's a good way to
 reach compromise.


It's hard to know if we could completely eliminate them without testing
this, but yes, one goal here is to greatly reduce the need for
TestExpectations lines.  A related goal is to make the patterns and
mechanisms used by all ports more similar.  As someone who has noted his
frustration with both different ways of doing things and philosophical
directions chosen by one port, you would hopefully be well-served by this
direction.

PK
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo/webkit-dev