Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Thu, Aug 16, 2012 at 6:11 PM, Dirk Pranke dpra...@chromium.org wrote: On Thu, Aug 16, 2012 at 5:41 PM, Ryosuke Niwa rn...@webkit.org wrote: On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke dpra...@chromium.org wrote: On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney schen...@chromium.org wrote: I agree with the priorities above, at least. I also agree with the overall goal of making our implementation match our philosophy on testing. Ryosuke has raised a very valid point: it is not clear what a Gardener should do with a test that has changed behavior. Personally, the area I deal with is very susceptible to the single pixel differences matter issue, and I would prefer gardeners to flag updates in some way so that I can look at it myself (or one of the very few other experts can look at it). Maybe I'm just a control freak. In the current situation, gardeners are not likely to roll out a patch short of build failures, obvious test or perf regressions, or security issues. For the bulk of cases where a test result has changed, the gardener will either add it to expectations or go ahead and update the result. The result is a mixture of incorrect expected results and files listed in TestExpectations that may or may not be correct. The way I deal with this right now is to maintain a snapshot of TestExpectations and occasionally go through and look for newly added results, and check if they are in fact correct. And every now and then go look at older entries to see if they have been updated. Or I get lucky and notice that a new expected result has been checked in that is incorrect (I can mostly catch that by following checkins in the code). The proposed change does not alter this fundamental dynamic. Depending on what policy gardeners adopt, they might now rename the result as failing and remove the expected, or maybe they'll just update the expected. I'm still in a situation where I don't know the exact status of any given result, and where I have no easy way of knowing if a gardener has correctly updated an expected or failing image. So, if we had -passing/-failing, then people who knew what the results were supposed to be in a given directory (call them owners) could rename the existing -expected files into one category or another, and then a gardener could add newly-failing tests as -expected; it would then be trivial for the owner to look for new -expected files in that directory. Right now, an owner would have to periodically scan the directory for files changed since the last time they scanned the directory, and would have no way of knowing whether an updated file was correct or not (perhaps you could filter out by who committed the change, but it's certainly harder than ls *-expected). So, I would contend that my proposal would make it easier for us to have a process by which gardeners could punt on changes they're unsure about and for owners to subsequently review them. I think you could have a similar model if we just checked in *all* new results into -expected, but it would be harder to implement (perhaps not impossible, though, I haven't given it much thought yet). That seems like a much better model. To clarify: what does That refer to? My idea for gardeners to check in -expected and owners to move to -passing/failing] or the idea that we check in new results into -expected and have some other way of reviewing new baselines? (or something else)? The former. In particular, the idea that only experts of a given area can check in -correct/-failing is promising because there are many cases where pixel tests fail due to completely unrelated bugs and we want to consider it to be still correct. Butwe can implement this policy without the said proposal. All we need to do is for gardeners to add test expectations and wait for experts to come in and decide whether we need rebaseline, bug fix, or rollout. On the other hand, the pixel test output that's correct to one expert may not be correct to another expert. For example, I might think that one editing test's output is correct because it shows the feature we're testing in the test is working properly. But Stephen might realizes that this -expected.png contains off-by-one Skia bug. So categorizing -correct.png and -failure.png may require multiple experts looking at the output, which may or may not be practical. Furthermore, some areas may not have an adequate number of experts or experts who can triage -expected.png, which means that we'll further fragment the process depending on how many experts we have for a given area. In a world where all failures are triaged promptly by qualified owners, all of this mechanism is unnecessary. Unfortunately, we don't live in that world at the moment. I'd argue that this is a bigger problem worth
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Fri, Aug 17, 2012 at 8:07 AM, Ryosuke Niwa rn...@webkit.org wrote: On Thu, Aug 16, 2012 at 6:11 PM, Dirk Pranke dpra...@chromium.org wrote: On Thu, Aug 16, 2012 at 5:41 PM, Ryosuke Niwa rn...@webkit.org wrote: On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke dpra...@chromium.org wrote: On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney schen...@chromium.org wrote: I agree with the priorities above, at least. I also agree with the overall goal of making our implementation match our philosophy on testing. Ryosuke has raised a very valid point: it is not clear what a Gardener should do with a test that has changed behavior. Personally, the area I deal with is very susceptible to the single pixel differences matter issue, and I would prefer gardeners to flag updates in some way so that I can look at it myself (or one of the very few other experts can look at it). Maybe I'm just a control freak. In the current situation, gardeners are not likely to roll out a patch short of build failures, obvious test or perf regressions, or security issues. For the bulk of cases where a test result has changed, the gardener will either add it to expectations or go ahead and update the result. The result is a mixture of incorrect expected results and files listed in TestExpectations that may or may not be correct. The way I deal with this right now is to maintain a snapshot of TestExpectations and occasionally go through and look for newly added results, and check if they are in fact correct. And every now and then go look at older entries to see if they have been updated. Or I get lucky and notice that a new expected result has been checked in that is incorrect (I can mostly catch that by following checkins in the code). The proposed change does not alter this fundamental dynamic. Depending on what policy gardeners adopt, they might now rename the result as failing and remove the expected, or maybe they'll just update the expected. I'm still in a situation where I don't know the exact status of any given result, and where I have no easy way of knowing if a gardener has correctly updated an expected or failing image. So, if we had -passing/-failing, then people who knew what the results were supposed to be in a given directory (call them owners) could rename the existing -expected files into one category or another, and then a gardener could add newly-failing tests as -expected; it would then be trivial for the owner to look for new -expected files in that directory. Right now, an owner would have to periodically scan the directory for files changed since the last time they scanned the directory, and would have no way of knowing whether an updated file was correct or not (perhaps you could filter out by who committed the change, but it's certainly harder than ls *-expected). So, I would contend that my proposal would make it easier for us to have a process by which gardeners could punt on changes they're unsure about and for owners to subsequently review them. I think you could have a similar model if we just checked in *all* new results into -expected, but it would be harder to implement (perhaps not impossible, though, I haven't given it much thought yet). That seems like a much better model. To clarify: what does That refer to? My idea for gardeners to check in -expected and owners to move to -passing/failing] or the idea that we check in new results into -expected and have some other way of reviewing new baselines? (or something else)? The former. In particular, the idea that only experts of a given area can check in -correct/-failing is promising because there are many cases where pixel tests fail due to completely unrelated bugs and we want to consider it to be still correct. Butwe can implement this policy without the said proposal. All we need to do is for gardeners to add test expectations and wait for experts to come in and decide whether we need rebaseline, bug fix, or rollout. A problem with just adding test expectations is that we lose visibility into whether the test is failing the same way or a different way. That is one of the (if not the primary) motivation for my proposal ... On the other hand, the pixel test output that's correct to one expert may not be correct to another expert. For example, I might think that one editing test's output is correct because it shows the feature we're testing in the test is working properly. But Stephen might realizes that this -expected.png contains off-by-one Skia bug. So categorizing -correct.png and -failure.png may require multiple experts looking at the output, which may or may not be practical. Perhaps. Obviously (a) there's a limit to what you can do here, and (b) a test that requires multiple
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
I don't have Dirk's patience, stamina and determination to explain and evangelize the advantages of better awareness of current LayoutTests coverage, I will chime in saying that I gladly support any effort to improve it. I am especially excited about the ability to distinguish failing, passing, and failing-in-different-way expectations. :DG ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Fri, Aug 17, 2012 at 11:06 AM, Dirk Pranke dpra...@chromium.org wrote: On the other hand, the pixel test output that's correct to one expert may not be correct to another expert. For example, I might think that one editing test's output is correct because it shows the feature we're testing in the test is working properly. But Stephen might realizes that this -expected.png contains off-by-one Skia bug. So categorizing -correct.png and -failure.png may require multiple experts looking at the output, which may or may not be practical. Perhaps. Obviously (a) there's a limit to what you can do here, and (b) a test that requires multiple experts to verify its correctness is, IMO, a bad test :). With that argument, almost all pixel tests are bad tests because pixel tests in editing, for example, involve editing, rendering, and graphics code. I don't think any single person can comprehend the entire stack to tell with a 100% confidence that the test result is exactly and precisely correct. I think we should just check-in whatever result we're currently seeing as -expected.png because we wouldn't at least have any ambiguity in the process then. We just check in whatever we're currently seeing and file a bug if we see a problem with the new result and possibly rollout the patch after talking with the author/reviewer. This is basically saying we should just follow the existing non-Chromium process, right? Yes. In addition, doing so will significantly reduce the complexity of the current process. This would seem to bring us back to step 1: it doesn't address the problem that I identified with the existing non-Chromium process, namely that a non-expert can't tell by looking at the checkout what tests are believed to be passing or not. What is the use case of this? I've been working on WebKit for more than 3 years, and I've never had to think about whether a test for an area outside of my expertise has the correct output or not other than when I was gardening. And having -correct / -failing wouldn't have helped me knowing what the correct output when I was gardening anyway because the new output may as well as be new -correct or -failing result. I don't think searching bugzilla (as it is currently used) is a workable alternative. Why not? Bugzilla is the tool we use to triage and track bugs. I don't see a need for an alternative method to keep track of bugs. The new result we check in may not be 100% right but experts — e.g. me for editing and Stephen for Skia — can come in and look at recent changes to triage any new failures. In fact, it might be worthwhile for us to invest our time in improving tools to do this. For example, can we have a portal where I can see new rebaselines that happened in LayoutTests/editing and LayoutTests/platform/*/editing since the last time I visited the portal? e.g. it can show chronological timeline of baselines along with a possible cause (list of changesets maybe?) of the baseline. We could build such a portal, sure. I would be interested to hear from others whether such a thing would be more or less useful than my proposal. Of course, you could just set up a watchlist for new expectations today. Probably not quite as polished as we could get with a portal, but dirt simple .. That might be useful as long as it has an option to give us a digest instead of sending me an e-mail per commit. - Ryosuke ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Fri, Aug 17, 2012 at 11:29 AM, Ryosuke Niwa rn...@webkit.org wrote: On Fri, Aug 17, 2012 at 11:06 AM, Dirk Pranke dpra...@chromium.org wrote: On the other hand, the pixel test output that's correct to one expert may not be correct to another expert. For example, I might think that one editing test's output is correct because it shows the feature we're testing in the test is working properly. But Stephen might realizes that this -expected.png contains off-by-one Skia bug. So categorizing -correct.png and -failure.png may require multiple experts looking at the output, which may or may not be practical. Perhaps. Obviously (a) there's a limit to what you can do here, and (b) a test that requires multiple experts to verify its correctness is, IMO, a bad test :). With that argument, almost all pixel tests are bad tests because pixel tests in editing, for example, involve editing, rendering, and graphics code. If in order to tell a pixel test is correct you need to be aware of how all of that stuff works, then, yes, it's a bad test. It can fail too many different ways, and is testing too many different bits of information. As Filip might suggest, it would be nice if we could split such tests up. That said, I will freely grant that in many cases we can't easily do better given the way things are currently structured, and splitting up such tests would be an enormous amount of work. If the pixel test is testing whether a rectangle is actually green or actually red, such a test is fine, doesn't need much subject matter expertise, and it is hard to imagine how you'd test such a thing some other way. I don't think any single person can comprehend the entire stack to tell with a 100% confidence that the test result is exactly and precisely correct. Sure. Such a high bar should be avoided. I think we should just check-in whatever result we're currently seeing as -expected.png because we wouldn't at least have any ambiguity in the process then. We just check in whatever we're currently seeing and file a bug if we see a problem with the new result and possibly rollout the patch after talking with the author/reviewer. This is basically saying we should just follow the existing non-Chromium process, right? Yes. In addition, doing so will significantly reduce the complexity of the current process. This would seem to bring us back to step 1: it doesn't address the problem that I identified with the existing non-Chromium process, namely that a non-expert can't tell by looking at the checkout what tests are believed to be passing or not. What is the use case of this? I've been working on WebKit for more than 3 years, and I've never had to think about whether a test for an area outside of my expertise has the correct output or not other than when I was gardening. And having -correct / -failing wouldn't have helped me knowing what the correct output when I was gardening anyway because the new output may as well as be new -correct or -failing result. I've done this frequently when gardening, when simply trying to learn how a given chunk of code works and how a given chunk of tests work (or don't work), and when trying to get a sense of how well our product is or isn't passing tests. Perhaps this is the case because I tend to more work on infrastructure and testing, and look at stuff shallowly across the whole tree rather than in depth in particular areas as you do. I don't think searching bugzilla (as it is currently used) is a workable alternative. Why not? Bugzilla is the tool we use to triage and track bugs. I don't see a need for an alternative method to keep track of bugs. The way we currently use bugzilla, it is difficult if not impossible to find a concise and accurate list of all the failing layout tests meeting any sort of filename- or directory-based criteria (maybe you can do it just for editing, I don't know). The layout test summary reports that Ojan sends out to the chromium devs is an example of this: he generates that from the TestExpectations files; doing so from bugzilla is not currently feasible. Note that we could certainly extend bugzilla to make this easier, if there was consensus to do so (and I would be in favor of this, but that would also incur more process than we have today). - Dirk ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Wed, Aug 15, 2012 at 5:19 PM, Ryosuke Niwa rn...@webkit.org wrote: I have a concern that a lot of people wouldn't know what the correct output is for a given test. For a lot of pixel tests, deciding whether a given output is correct or not is really hard. e.g. some seemingly insignificant anti-alias different may turn out be a result of a bug in Skia and other graphics library or WebCore code that uses it. As a result, people may check in wrong correct results people may add failing results even though new results are more or less correct This leads me to think that just checking in the current output as -expected.png and filing bugs separately is a better approach. I think your observations are correct, but at least my experience as a gardener/sheriff leads me to a different conclusion. Namely, when I'm looking at a newly failing test, it is difficult if not impossible for me to know if the existing baseline was previously believed to be correct or not, and thus it's hard for me to tell if the new baseline should be considered worse, better, or different. In theory I could go look at the changelog for each test, but I would be skeptical if that had enough useful information (I would expect most comments to be along the lines of rebaselining after X with no indication if the output is correct or not. This is just a theory, though. This is why I want to test this theory :). It seems like if we got experience with this on one (or more) ports for a couple of months we would have a much more well-informed opinion, and I'm not seeing a huge downside to at least trying this idea out. -- Dirk ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Wed, Aug 15, 2012 at 6:02 PM, Filip Pizlo fpi...@apple.com wrote: 2) Possibility of the sheriff getting it wrong. (2) concerns me most. We're talking about using filenames to serve as a kind of unchecked comment. We already know that comments are usually bad because there is no protection against them going stale. Sheriffs can already get things wrong (and rebaseline when they shouldn't). I believe that adding passing/failing to expected will make things better in this regard, not worse. Another idea/observation is that if we have multiple types of expectation files, it might be easier to set up watchlists, e.g., let me know whenever a file gets checked into fast/forms with an -expected or -failing result. It seems like this might be useful, but I'm not sure. In particular, to further clarify my position, if someone were to argue that Dirk's proposal would be a wholesale replacement for TestExpectations, then I would be more likely to be on board, since I very much like the idea of reducing the number of ways of doing things. Maybe that's a good way to reach compromise. Dirk, what value do you see in TestExpectations were your change to be landed? Do scenarios still exist where there would be a test for which (a) there is no -fail.* file, (b) the test is not skipped, and (c) it's marked with some annotation in TestExpectations? I'm most interested in the question of such scenarios exist, since in my experience, whenever a test is not rebased, is not skipped, and is marked as failing in TestExpectations, it ends up just causing gardening overhead later. This is a good question, because it is definitely my intent that this change replace some existing practices, not add to them. Currently, the Chromium port uses TestExpectations entries for four different kinds of things: tests we don't ever plan to fix (WONTFIX), tests that we skip because not doing so causes other tests to break, tests that fail (reliably), and tests that are flaky. Skipped files do not let you distinguish (programmatically) between the first two categories, and so my plan is to replace Skipped files with TestExpectations (using the new syntax discussed a month or so ago) soon (next week or two at the latest). I would like to replace using TestExpectations for failing tests (at least for tests that are expected to keep failing indefinitely because someone isn't working on an active fix) with this new mechanism. That leaves flaky tests. One can debate what the right thing to do w/ flaky tests is here; I'm inclined to argue that flakiness is at least as bad as failing, and we should probably be skipping them, but the Chromium port has not yet actively tried this approach (I think other ports probably have experience here, though). Does that help answer your question / sway you at all? -- Dirk ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Thu, Aug 16, 2012 at 2:05 PM, Dirk Pranke dpra...@chromium.org wrote: I think your observations are correct, but at least my experience as a gardener/sheriff leads me to a different conclusion. Namely, when I'm looking at a newly failing test, it is difficult if not impossible for me to know if the existing baseline was previously believed to be correct or not, and thus it's hard for me to tell if the new baseline should be considered worse, better, or different. How does the proposal solve this problem? Right now gardeners have two options: - Rebaseline - Add a test expectation Once we implemented the proposal, we have at least three options: - Rebaseline correct.png - Add/rebaseline expected.png - Add/rebaseline failure.png - (Optionally) Add a test expectation. And that's a lot of options to choose from. The more options we have, the more likely people make mistakes. We're already inconsistent in how -expected.png is used because some people make mistakes. I'm afraid that adding another set of expected results result in even more mistakes and a unrecoverable mess. This is why I want to test this theory :). It seems like if we got experience with this on one (or more) ports for a couple of months we would have a much more well-informed opinion, and I'm not seeing a huge downside to at least trying this idea out. Sure. But if we're doing this experiment on trunk, thereby imposing significant cognitive load on every other port, then I'd like to see us setting *both* an exact date at which we decide whether this approach is good or not and criteria by which we decide this *before *the experiment starts. Like Filip, I'm *extremely* concerned about the prospect of us introducing yet-another-way-of-doing-things, and not be able to get rid of it later. - Ryosuke ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Thu, Aug 16, 2012 at 2:32 PM, Filip Pizlo fpi...@apple.com wrote: On Aug 16, 2012, at 2:13 PM, Dirk Pranke wrote: On Wed, Aug 15, 2012 at 6:02 PM, Filip Pizlo fpi...@apple.com wrote: 2) Possibility of the sheriff getting it wrong. (2) concerns me most. We're talking about using filenames to serve as a kind of unchecked comment. We already know that comments are usually bad because there is no protection against them going stale. Sheriffs can already get things wrong (and rebaseline when they shouldn't). I believe that adding passing/failing to expected will make things better in this regard, not worse. In what way do things become better? Because other people will see what the sheriff believed about the result? Can you articulate some more about what happens when you have both -expected and -failing? My specific concern is that after someone checks in a fix, we will have some sheriff accidentally misjudge the change in behavior to be a regression, and check in a -failing file. And then we end up in a world of confusion. This is why I think that just having -expected files is better. It is a kind of recognition that we're tracking changes in behavior, rather than comparing against some almighty notion of what it means to be correct. Another idea/observation is that if we have multiple types of expectation files, it might be easier to set up watchlists, e.g., let me know whenever a file gets checked into fast/forms with an -expected or -failing result. It seems like this might be useful, but I'm not sure. In particular, to further clarify my position, if someone were to argue that Dirk's proposal would be a wholesale replacement for TestExpectations, then I would be more likely to be on board, since I very much like the idea of reducing the number of ways of doing things. Maybe that's a good way to reach compromise. Dirk, what value do you see in TestExpectations were your change to be landed? Do scenarios still exist where there would be a test for which (a) there is no -fail.* file, (b) the test is not skipped, and (c) it's marked with some annotation in TestExpectations? I'm most interested in the question of such scenarios exist, since in my experience, whenever a test is not rebased, is not skipped, and is marked as failing in TestExpectations, it ends up just causing gardening overhead later. This is a good question, because it is definitely my intent that this change replace some existing practices, not add to them. Currently, the Chromium port uses TestExpectations entries for four different kinds of things: tests we don't ever plan to fix (WONTFIX), tests that we skip because not doing so causes other tests to break, tests that fail (reliably), and tests that are flaky. Skipped files do not let you distinguish (programmatically) between the first two categories, and so my plan is to replace Skipped files with TestExpectations (using the new syntax discussed a month or so ago) soon (next week or two at the latest). I would like to replace using TestExpectations for failing tests (at least for tests that are expected to keep failing indefinitely because someone isn't working on an active fix) with this new mechanism. That leaves flaky tests. One can debate what the right thing to do w/ flaky tests is here; I'm inclined to argue that flakiness is at least as bad as failing, and we should probably be skipping them, but the Chromium port has not yet actively tried this approach (I think other ports probably have experience here, though). Does that help answer your question / sway you at all? Yes, it does - it answers my question, though it perhaps doesn't sway me. My concerns are still that: 1) Switching to skipping flaky tests wholesale in all ports would be great, and then we could get rid of the flakiness support. This is not necessarily helpful in some cases. There are cases some test make subsequent tests flaky so skipping a flaky test doesn't fix the problem. Instead, it just makes the next test flaky. The right fix, of course, to skip/fix the culprit but pinpointing the test that makes subsequent tests more flaky has proved to be a time consuming and challenging task. 2) The WONTFIX mode in TestExpectations feels to me more like a statement that you're just trying to see if the test doesn't crash. Correct? Either way, it's confusing. Yes, I'd argue that they should just be skipped but that's a bikeshedding for another time. - Ryosuke ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Thu, Aug 16, 2012 at 2:32 PM, Filip Pizlo fpi...@apple.com wrote: 1) Switching to skipping flaky tests wholesale in all ports would be great, and then we could get rid of the flakiness support. Then you don't notice when a flaky tests stops being flaky. The cost of flakiness support on the project does not seem large to me and it's pretty frequent that a test will only be flaky for a few hundred runs (e.g. due to a bad checkin that gets fixed), so we can then remove it from TestExpectations and run the test as normal. 2) The WONTFIX mode in TestExpectations feels to me more like a statement that you're just trying to see if the test doesn't crash. Correct? Either way, it's confusing. WONTFIX is for tests that don't make sense to fix for the given port (e.g. dashboard-specific tests for non-Apple ports). It's a way of distinguishing tests that we're skipping because of a bug on our part vs. tests that we're skipping because the test doesn't apply to the port. 3) Your new mechanism feels like it's already covered by our existing use of -expected files. I'm not quite convinced that having -failing in addition to -expected files would be all that helpful. But there are many cases where we *know* the result is incorrect, e.g. it doesn't match a spec. Sure, there are also many cases where it's not clear what the correct behavior is. ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Thu, Aug 16, 2012 at 3:04 PM, Dirk Pranke dpra...@chromium.org wrote: On Thu, Aug 16, 2012 at 2:23 PM, Ryosuke Niwa rn...@webkit.org wrote: Like Filip, I'm extremely concerned about the prospect of us introducing yet-another-way-of-doing-things, and not be able to get rid of it later. Presumably the only way we'd be not able to get rid of it would be if some port actually liked it, in which case, why would we get rid of it? Because we already have too many ways to deal with test failures. Why don't we at least merge Skipped and TestExpectations first. - Ryosuke ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
1. On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke dpra...@chromium.orgwrote: On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney schen...@chromium.org wrote: I agree with the priorities above, at least. I also agree with the overall goal of making our implementation match our philosophy on testing. Ryosuke has raised a very valid point: it is not clear what a Gardener should do with a test that has changed behavior. Personally, the area I deal with is very susceptible to the single pixel differences matter issue, and I would prefer gardeners to flag updates in some way so that I can look at it myself (or one of the very few other experts can look at it). Maybe I'm just a control freak. In the current situation, gardeners are not likely to roll out a patch short of build failures, obvious test or perf regressions, or security issues. For the bulk of cases where a test result has changed, the gardener will either add it to expectations or go ahead and update the result. The result is a mixture of incorrect expected results and files listed in TestExpectations that may or may not be correct. The way I deal with this right now is to maintain a snapshot of TestExpectations and occasionally go through and look for newly added results, and check if they are in fact correct. And every now and then go look at older entries to see if they have been updated. Or I get lucky and notice that a new expected result has been checked in that is incorrect (I can mostly catch that by following checkins in the code). The proposed change does not alter this fundamental dynamic. Depending on what policy gardeners adopt, they might now rename the result as failing and remove the expected, or maybe they'll just update the expected. I'm still in a situation where I don't know the exact status of any given result, and where I have no easy way of knowing if a gardener has correctly updated an expected or failing image. So, if we had -passing/-failing, then people who knew what the results were supposed to be in a given directory (call them owners) could rename the existing -expected files into one category or another, and then a gardener could add newly-failing tests as -expected; it would then be trivial for the owner to look for new -expected files in that directory. Right now, an owner would have to periodically scan the directory for files changed since the last time they scanned the directory, and would have no way of knowing whether an updated file was correct or not (perhaps you could filter out by who committed the change, but it's certainly harder than ls *-expected). So, I would contend that my proposal would make it easier for us to have a process by which gardeners could punt on changes they're unsure about and for owners to subsequently review them. I think you could have a similar model if we just checked in *all* new results into -expected, but it would be harder to implement (perhaps not impossible, though, I haven't given it much thought yet). That seems like a much better model. Right now, one of the biggest problem is that people rebaseline tests or add test expectations without understanding what the correct output is. It happens for almost all ports. Having -correct.png doesn't necessarily help that because even a seemingly innocent one-pixel difference could be a bug (a patch may only add a baseline to one platform), and it's really hard to tell what the expected result is unless you have worked on the rendering code for a long time. On the other hand, if you follow the chromium model of just suppressing the file, then we have no idea if the failure you currently see has any relationship to the failure that was originally seen, or if the file that is currently checked in has any relationship to what correct might now need to be (we know that the file was correct at some point but is now stale). So, in the chromium model, at least, you can do a grep through TestExpectations to look for failing tests, but you don't have the history you would get by looking at a history of changes to -expected/-passing/-failing. In a world where all failures are triaged promptly by qualified owners, all of this mechanism is unnecessary. Unfortunately, we don't live in that world at the moment. I'd argue that this is a bigger problem worth solving. Differentiating expected failure and correct results isn't going to help if gardeners were to keep making mistakes. - Ryosuke ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Thu, Aug 16, 2012 at 5:41 PM, Ryosuke Niwa rn...@webkit.org wrote: On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke dpra...@chromium.org wrote: On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney schen...@chromium.org wrote: I agree with the priorities above, at least. I also agree with the overall goal of making our implementation match our philosophy on testing. Ryosuke has raised a very valid point: it is not clear what a Gardener should do with a test that has changed behavior. Personally, the area I deal with is very susceptible to the single pixel differences matter issue, and I would prefer gardeners to flag updates in some way so that I can look at it myself (or one of the very few other experts can look at it). Maybe I'm just a control freak. In the current situation, gardeners are not likely to roll out a patch short of build failures, obvious test or perf regressions, or security issues. For the bulk of cases where a test result has changed, the gardener will either add it to expectations or go ahead and update the result. The result is a mixture of incorrect expected results and files listed in TestExpectations that may or may not be correct. The way I deal with this right now is to maintain a snapshot of TestExpectations and occasionally go through and look for newly added results, and check if they are in fact correct. And every now and then go look at older entries to see if they have been updated. Or I get lucky and notice that a new expected result has been checked in that is incorrect (I can mostly catch that by following checkins in the code). The proposed change does not alter this fundamental dynamic. Depending on what policy gardeners adopt, they might now rename the result as failing and remove the expected, or maybe they'll just update the expected. I'm still in a situation where I don't know the exact status of any given result, and where I have no easy way of knowing if a gardener has correctly updated an expected or failing image. So, if we had -passing/-failing, then people who knew what the results were supposed to be in a given directory (call them owners) could rename the existing -expected files into one category or another, and then a gardener could add newly-failing tests as -expected; it would then be trivial for the owner to look for new -expected files in that directory. Right now, an owner would have to periodically scan the directory for files changed since the last time they scanned the directory, and would have no way of knowing whether an updated file was correct or not (perhaps you could filter out by who committed the change, but it's certainly harder than ls *-expected). So, I would contend that my proposal would make it easier for us to have a process by which gardeners could punt on changes they're unsure about and for owners to subsequently review them. I think you could have a similar model if we just checked in *all* new results into -expected, but it would be harder to implement (perhaps not impossible, though, I haven't given it much thought yet). That seems like a much better model. To clarify: what does That refer to? My idea for gardeners to check in -expected and owners to move to -passing/failing, or the idea that we check in new results into -expected and have some other way of reviewing new baselines? (or something else)? Right now, one of the biggest problem is that people rebaseline tests or add test expectations without understanding what the correct output is. It happens for almost all ports. Having -correct.png doesn't necessarily help that because even a seemingly innocent one-pixel difference could be a bug (a patch may only add a baseline to one platform), and it's really hard to tell what the expected result is unless you have worked on the rendering code for a long time. Yes. I hate gardening/sheriffing because I feel like I usually don't know whether to rebaseline or not :) (Although I suspect having -correct would be better than not having it, at least). In a world where all failures are triaged promptly by qualified owners, all of this mechanism is unnecessary. Unfortunately, we don't live in that world at the moment. I'd argue that this is a bigger problem worth solving. Differentiating expected failure and correct results isn't going to help if gardeners were to keep making mistakes. It is a bigger problem worth solving, but it is also much harder to solve. As long as we have the combination of 1) changes can cause lots of tests to change expectations and need new baselines 2) it's acceptable to land changes without the accompanying updated baselines 3) there is pressure to keep the bots green (not only for its own sake, but so that we can more easily identify *new* failures) 4) insufficient resources watching the tree (which is really true for every port today, even chromium) i think we're
[webkit-dev] A proposal for handling failing layout tests and TestExpectations
Hi all, As many of you know, we normally treat the -expected files as regression tests rather than correctness tests; they are intended to capture the current behavior of the tree. As such, they historically have not distinguished between a correct failure and an incorrect failure. The chromium port, however, has historically often not checked in expectations for tests that are currently failing (or even have been failing for a long time), and instead listed them in the expectations file. This was primarily motivated by us wanting to know easily all of the tests that were failing. However, this approach has its own downsides. I would like to move the project to a point where all of the ports were basically using the same workflow/model, and combine the best features of each approach [1]. To that end, I propose that we allow tests to have expectations that end in '-passing' and '-failing' as well as '-expected'. The meanings for '-passing' and '-failing' should be obvious, and '-expected' can continue the current meaning of either or both of what we expect to happen and I don't know if this is correct or not :). A given test will be allowed to only have one of the three potential results at any one time/revision in a checkout. [2] Because '-expected' will still be supported, this means that ports can continue to work as they do today and we can try -passing/-failing on a piecemeal basis to see if it's useful or not. Ideally we will have some way (via a presubmit hook, or lint checks, or something) to be able to generate a (checked-in) list (or perhaps just a dashboard or web page) of all of the currently failing tests and corresponding bugs from the -failing expectations, so that we can keep one of the advantages that chromium has gotten out of their TestExpectations files [3]. I will update all of the tools (run-webkit-tests, garden-o-matic, flakiness dashboard, etc.) as needed to make managing these things as easy as possible. [4] Thoughts? I'm definitely open to suggestions/variants/other ideas/etc. -- Dirk Notes: [1] Both the check in the failures and the suppress the failures approaches have advantages and disadvantages: Both approaches have their advantages and disadvantages: Advantages for checking in failures: * you can tell when a test starts failing differently * the need to distinguish between different types of failures (text vs. image vs. image+text) in the expectations file drops; the baseline tells you what to expect * the TestExpectations file can be much smaller and easier to manage - the current Chromium file is a massively unruly mess * the history of a particular test can be found by looking at the repo history Disadvantages for checking in failures (advantages for just suppressing them): * given current practice (just using -expected) you can't tell if a particular -expected file is supposed to be be correct or not * it may create more churn in the checked-in baselines in the repo * you can't get a list of all of the failing tests associated with a particular bug as easily There are probably lots of ways one could attempt to design a solution to these problems; I believe the approach outlined above is perhaps the simplest possible and allows for us to try it in small parts of the test suite (and only on some ports) to see if it's useful or not without forcing it on everyone. [2] I considered allowing -passing and -failing to co-exist, but a risk is that the passing or correct result for a test will change, and if a test is currently expected to fail, we won't notice that that port's passing result becomes stale. In addition, managing the transitions across multiple files becomes a lot more complicated/tricky. There are tradeoffs here, and it's possible some other logic will work better in practice, but I thought it might be best to start simple. [3] I haven't figured out the best way to do this yet, or whether it's better to keep this list inside the TestExpectations files or in separate files, or just have a dashboard separate from the repository completely ... [4] rough sketch of the work needed to be done here: * update run-webkit-tests to look for all three suffixes, and to generate new -failing -failing and/or -passing results as well as new -expected results * update garden-o-matic so that when you want to rebaseline a file you can (optionally) indicate whether the new baseline is passing or failing * update webkit-patch rebaseline-expectations to (optionally) indicate if the new baselines are passing or failing * pass the information from run-webkit-tests to the flakiness dashboard (via results.json) to indicate whether we matched against -expected/-passing/-failing so that the dashboard can display the right baselines for comparison. * figure out the solution to [3] :). ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
This sounds like it's adding even more complication to an already complicated system. Given how many tests we currently have, I also don't buy that continuing to run a test that is already known to fail provides much benefit. If the test covers two things and one fails while the other succeeds (example #1: it prints correct results but doesn't crash; example #2: it literally has tests for two distinct unrelated features, and one feature is broken) then the correct thing to do is to split up the test. So, I'd rather not continue to institutionalize this notion that we should have loads of incomprehensible machinery to reason about tests that have already given us all the information they were meant to give (i.e. they failed, end of story). -F On Aug 15, 2012, at 12:22 PM, Dirk Pranke dpra...@chromium.org wrote: Hi all, As many of you know, we normally treat the -expected files as regression tests rather than correctness tests; they are intended to capture the current behavior of the tree. As such, they historically have not distinguished between a correct failure and an incorrect failure. The chromium port, however, has historically often not checked in expectations for tests that are currently failing (or even have been failing for a long time), and instead listed them in the expectations file. This was primarily motivated by us wanting to know easily all of the tests that were failing. However, this approach has its own downsides. I would like to move the project to a point where all of the ports were basically using the same workflow/model, and combine the best features of each approach [1]. To that end, I propose that we allow tests to have expectations that end in '-passing' and '-failing' as well as '-expected'. The meanings for '-passing' and '-failing' should be obvious, and '-expected' can continue the current meaning of either or both of what we expect to happen and I don't know if this is correct or not :). A given test will be allowed to only have one of the three potential results at any one time/revision in a checkout. [2] Because '-expected' will still be supported, this means that ports can continue to work as they do today and we can try -passing/-failing on a piecemeal basis to see if it's useful or not. Ideally we will have some way (via a presubmit hook, or lint checks, or something) to be able to generate a (checked-in) list (or perhaps just a dashboard or web page) of all of the currently failing tests and corresponding bugs from the -failing expectations, so that we can keep one of the advantages that chromium has gotten out of their TestExpectations files [3]. I will update all of the tools (run-webkit-tests, garden-o-matic, flakiness dashboard, etc.) as needed to make managing these things as easy as possible. [4] Thoughts? I'm definitely open to suggestions/variants/other ideas/etc. -- Dirk Notes: [1] Both the check in the failures and the suppress the failures approaches have advantages and disadvantages: Both approaches have their advantages and disadvantages: Advantages for checking in failures: * you can tell when a test starts failing differently * the need to distinguish between different types of failures (text vs. image vs. image+text) in the expectations file drops; the baseline tells you what to expect * the TestExpectations file can be much smaller and easier to manage - the current Chromium file is a massively unruly mess * the history of a particular test can be found by looking at the repo history Disadvantages for checking in failures (advantages for just suppressing them): * given current practice (just using -expected) you can't tell if a particular -expected file is supposed to be be correct or not * it may create more churn in the checked-in baselines in the repo * you can't get a list of all of the failing tests associated with a particular bug as easily There are probably lots of ways one could attempt to design a solution to these problems; I believe the approach outlined above is perhaps the simplest possible and allows for us to try it in small parts of the test suite (and only on some ports) to see if it's useful or not without forcing it on everyone. [2] I considered allowing -passing and -failing to co-exist, but a risk is that the passing or correct result for a test will change, and if a test is currently expected to fail, we won't notice that that port's passing result becomes stale. In addition, managing the transitions across multiple files becomes a lot more complicated/tricky. There are tradeoffs here, and it's possible some other logic will work better in practice, but I thought it might be best to start simple. [3] I haven't figured out the best way to do this yet, or whether it's better to keep this list inside the TestExpectations files or in separate files, or just have a dashboard separate from the repository
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Wed, Aug 15, 2012 at 12:27 PM, Filip Pizlo fpi...@apple.com wrote: This sounds like it's adding even more complication to an already complicated system. In some ways, yes. In other ways, perhaps it will allow us to simplify things; e.g., if we are checking in failing tests, there is much less of a need for multiple failure keywords in the TestExpectations file (so perhaps we can simplify them back to something closer to Skipped files). Given how many tests we currently have, I also don't buy that continuing to run a test that is already known to fail provides much benefit. I'm not sure I understand your feedback here? It's common practice (on all the ports as far as I know today) to rebaseline tests that are currently failing so that they fail differently. Of course, we also skip some tests while they are failing as well. However, common objections to skipping tests are that we can lose coverage for a feature and/or miss when a test starts failing worse (e.g. crashing)? And of course, a test might start passing again, but if we're skipping it we wouldn't know that ... So, I'd rather not continue to institutionalize this notion that we should have loads of incomprehensible machinery to reason about tests that have already given us all the information they were meant to give (i.e. they failed, end of story). Are you suggesting that, rather than checking in new baselines at all or having lots of logic to manage different kinds of failures, we should just let failing tests fail (and keep the tree red) until a change is either reverted or the test is fixed? -- Dirk ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
I've received at least one bit of feedback off-list that it might not have been clear what problems I was trying to solve and whether the solution would add enough benefit to be worth it. Let me try to restate things, in case this helps ... The problems I'm attempting to solve are: 1) Chromium does things differently from the other WebKit ports, and this seems bad :). 2) the TestExpectations syntax is too complicated 3) When you suppress or skip failures (as chromium does), you can't easily tell when test changes behavior (starts failing differently). 4) We can't easily tell if a test's -expected output is actually believed to be correct or not, which makes figuring out whether to rebaseline or not difficult, and makes figuring out if your change that is causing a test to fail is actually introducing a new problem, just failing differently, or even potentially fixing something. I agree that it's unclear how much churn there will be and how much benefit there will be, but we've been talking about these problems for years and I think without actually trying *something* we won't know if there's a better way to solve this or not, and I personally think it's worth trying. The solution I've described is the least intrusive mechanism we can try that I've yet come up with. -- Dirk On Wed, Aug 15, 2012 at 12:22 PM, Dirk Pranke dpra...@chromium.org wrote: Hi all, As many of you know, we normally treat the -expected files as regression tests rather than correctness tests; they are intended to capture the current behavior of the tree. As such, they historically have not distinguished between a correct failure and an incorrect failure. The chromium port, however, has historically often not checked in expectations for tests that are currently failing (or even have been failing for a long time), and instead listed them in the expectations file. This was primarily motivated by us wanting to know easily all of the tests that were failing. However, this approach has its own downsides. I would like to move the project to a point where all of the ports were basically using the same workflow/model, and combine the best features of each approach [1]. To that end, I propose that we allow tests to have expectations that end in '-passing' and '-failing' as well as '-expected'. The meanings for '-passing' and '-failing' should be obvious, and '-expected' can continue the current meaning of either or both of what we expect to happen and I don't know if this is correct or not :). A given test will be allowed to only have one of the three potential results at any one time/revision in a checkout. [2] Because '-expected' will still be supported, this means that ports can continue to work as they do today and we can try -passing/-failing on a piecemeal basis to see if it's useful or not. Ideally we will have some way (via a presubmit hook, or lint checks, or something) to be able to generate a (checked-in) list (or perhaps just a dashboard or web page) of all of the currently failing tests and corresponding bugs from the -failing expectations, so that we can keep one of the advantages that chromium has gotten out of their TestExpectations files [3]. I will update all of the tools (run-webkit-tests, garden-o-matic, flakiness dashboard, etc.) as needed to make managing these things as easy as possible. [4] Thoughts? I'm definitely open to suggestions/variants/other ideas/etc. -- Dirk Notes: [1] Both the check in the failures and the suppress the failures approaches have advantages and disadvantages: Both approaches have their advantages and disadvantages: Advantages for checking in failures: * you can tell when a test starts failing differently * the need to distinguish between different types of failures (text vs. image vs. image+text) in the expectations file drops; the baseline tells you what to expect * the TestExpectations file can be much smaller and easier to manage - the current Chromium file is a massively unruly mess * the history of a particular test can be found by looking at the repo history Disadvantages for checking in failures (advantages for just suppressing them): * given current practice (just using -expected) you can't tell if a particular -expected file is supposed to be be correct or not * it may create more churn in the checked-in baselines in the repo * you can't get a list of all of the failing tests associated with a particular bug as easily There are probably lots of ways one could attempt to design a solution to these problems; I believe the approach outlined above is perhaps the simplest possible and allows for us to try it in small parts of the test suite (and only on some ports) to see if it's useful or not without forcing it on everyone. [2] I considered allowing -passing and -failing to co-exist, but a risk is that the passing or correct result for a test will change, and if a test is currently expected to fail, we
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
It seems to me that there are two issues here. One is Chromium specific about process conformity. It seems to me that should stay a Chromium issue without making the mechanism more complex for all ports. The other ports seem to make things work using the existing framework. The other broader issue is failing tests. If I understand part of Filip's concern it is a signal to noise issue. We do not want the pass / fail signal to be lost in the noise of expected failures. Failing tests should be fixed as appropriate for failing platform(s). That fixing might involve splitting off or removing a failing sub-test so that the remaining test adds value once again. Especially a pass becoming a fail edge. For me, a test failing differently provides unknown value as the noise of it being a failing test likely exceeds the benefit of the different failure mode signal. It takes a non-zero effort to filter that noise and that effort is likely better spent fixing the test. - Michael On Aug 15, 2012, at 12:48 PM, Dirk Pranke dpra...@chromium.org wrote: I've received at least one bit of feedback off-list that it might not have been clear what problems I was trying to solve and whether the solution would add enough benefit to be worth it. Let me try to restate things, in case this helps ... The problems I'm attempting to solve are: 1) Chromium does things differently from the other WebKit ports, and this seems bad :). 2) the TestExpectations syntax is too complicated 3) When you suppress or skip failures (as chromium does), you can't easily tell when test changes behavior (starts failing differently). 4) We can't easily tell if a test's -expected output is actually believed to be correct or not, which makes figuring out whether to rebaseline or not difficult, and makes figuring out if your change that is causing a test to fail is actually introducing a new problem, just failing differently, or even potentially fixing something. I agree that it's unclear how much churn there will be and how much benefit there will be, but we've been talking about these problems for years and I think without actually trying *something* we won't know if there's a better way to solve this or not, and I personally think it's worth trying. The solution I've described is the least intrusive mechanism we can try that I've yet come up with. -- Dirk On Wed, Aug 15, 2012 at 12:22 PM, Dirk Pranke dpra...@chromium.org wrote: Hi all, As many of you know, we normally treat the -expected files as regression tests rather than correctness tests; they are intended to capture the current behavior of the tree. As such, they historically have not distinguished between a correct failure and an incorrect failure. The chromium port, however, has historically often not checked in expectations for tests that are currently failing (or even have been failing for a long time), and instead listed them in the expectations file. This was primarily motivated by us wanting to know easily all of the tests that were failing. However, this approach has its own downsides. I would like to move the project to a point where all of the ports were basically using the same workflow/model, and combine the best features of each approach [1]. To that end, I propose that we allow tests to have expectations that end in '-passing' and '-failing' as well as '-expected'. The meanings for '-passing' and '-failing' should be obvious, and '-expected' can continue the current meaning of either or both of what we expect to happen and I don't know if this is correct or not :). A given test will be allowed to only have one of the three potential results at any one time/revision in a checkout. [2] Because '-expected' will still be supported, this means that ports can continue to work as they do today and we can try -passing/-failing on a piecemeal basis to see if it's useful or not. Ideally we will have some way (via a presubmit hook, or lint checks, or something) to be able to generate a (checked-in) list (or perhaps just a dashboard or web page) of all of the currently failing tests and corresponding bugs from the -failing expectations, so that we can keep one of the advantages that chromium has gotten out of their TestExpectations files [3]. I will update all of the tools (run-webkit-tests, garden-o-matic, flakiness dashboard, etc.) as needed to make managing these things as easy as possible. [4] Thoughts? I'm definitely open to suggestions/variants/other ideas/etc. -- Dirk Notes: [1] Both the check in the failures and the suppress the failures approaches have advantages and disadvantages: Both approaches have their advantages and disadvantages: Advantages for checking in failures: * you can tell when a test starts failing differently * the need to distinguish between different types of failures (text vs. image vs. image+text) in the expectations file drops;
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
I think this is the best compromise between the benefits of the historically chromium approach (i.e. add to TestExpectations) and the historically non-chromium approach (either skip the test or check in a failing result, usually the latter). The only thing Dirk's proposal changes for these ports (i.e. all but Chromium) is that an expectation file's suffix can be -passing or -failing in addition to the current -expected. It makes for a clear list of broken tests that need fixing without compromising the test suites effectiveness at detecting new regressions. I agree that we just need to try it and see how it turns out. For non-chromium ports, it will be easy to switch back if we decide it's too much overhead and Chromium will just need to come up with an alternate way of keeping track of the list of tests that are failing. On Wed, Aug 15, 2012 at 1:36 PM, Michael Saboff msab...@apple.com wrote: It seems to me that there are two issues here. One is Chromium specific about process conformity. It seems to me that should stay a Chromium issue without making the mechanism more complex for all ports. The other ports seem to make things work using the existing framework. The other broader issue is failing tests. If I understand part of Filip's concern it is a signal to noise issue. We do not want the pass / fail signal to be lost in the noise of expected failures. Failing tests should be fixed as appropriate for failing platform(s). That fixing might involve splitting off or removing a failing sub-test so that the remaining test adds value once again. Especially a pass becoming a fail edge. For me, a test failing differently provides unknown value as the noise of it being a failing test likely exceeds the benefit of the different failure mode signal. It takes a non-zero effort to filter that noise and that effort is likely better spent fixing the test. Historically, all WebKit ports other than Chromium (especially the Apple port) have taken exactly this approach. What you're arguing against is the long-standing approach WebKit has taken to testing. I don't think discussing this broader issue is useful since noone is asking the non-Chromium ports to change this aspect. This has been discussed to death on webkit-dev, with many people (mostly at Apple) being strong proponents of checking in failing expectations. This just brings the Chromium port inline with other ports without losing the benefit the Chromium approach currently has of keeping a list of failing tests that need fixing. - Michael On Aug 15, 2012, at 12:48 PM, Dirk Pranke dpra...@chromium.org wrote: I've received at least one bit of feedback off-list that it might not have been clear what problems I was trying to solve and whether the solution would add enough benefit to be worth it. Let me try to restate things, in case this helps ... The problems I'm attempting to solve are: 1) Chromium does things differently from the other WebKit ports, and this seems bad :). 2) the TestExpectations syntax is too complicated 3) When you suppress or skip failures (as chromium does), you can't easily tell when test changes behavior (starts failing differently). 4) We can't easily tell if a test's -expected output is actually believed to be correct or not, which makes figuring out whether to rebaseline or not difficult, and makes figuring out if your change that is causing a test to fail is actually introducing a new problem, just failing differently, or even potentially fixing something. I agree that it's unclear how much churn there will be and how much benefit there will be, but we've been talking about these problems for years and I think without actually trying *something* we won't know if there's a better way to solve this or not, and I personally think it's worth trying. The solution I've described is the least intrusive mechanism we can try that I've yet come up with. -- Dirk On Wed, Aug 15, 2012 at 12:22 PM, Dirk Pranke dpra...@chromium.org wrote: Hi all, As many of you know, we normally treat the -expected files as regression tests rather than correctness tests; they are intended to capture the current behavior of the tree. As such, they historically have not distinguished between a correct failure and an incorrect failure. The chromium port, however, has historically often not checked in expectations for tests that are currently failing (or even have been failing for a long time), and instead listed them in the expectations file. This was primarily motivated by us wanting to know easily all of the tests that were failing. However, this approach has its own downsides. I would like to move the project to a point where all of the ports were basically using the same workflow/model, and combine the best features of each approach [1]. To that end, I propose that we allow tests to have expectations that end in '-passing'
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Aug 15, 2012, at 12:27 PM, Filip Pizlo fpi...@apple.com wrote: This sounds like it's adding even more complication to an already complicated system. Given how many tests we currently have, I also don't buy that continuing to run a test that is already known to fail provides much benefit. If the test covers two things and one fails while the other succeeds (example #1: it prints correct results but doesn't crash; example #2: it literally has tests for two distinct unrelated features, and one feature is broken) then the correct thing to do is to split up the test. The case where this happens isn't so much testing unrelated features. It's more that a single test can face failures of multiple incidental aspects. Let me give an example. Let's say you have a test that's intended to verify JavaScript multiplication. Let's say it starts failing because of a regression in function call behavior. There may not be a clean way to split these two aspects. You may have to rewrite the multiplication test to work around this hypothetical function call bug. It may be nontrivial to figure out how to do that. So I agree in principle that splitting orthogonal tests is good, but it's not always a good solution to avoiding chained regressions. Note also that having more test files makes the test rights slower. So, I'd rather not continue to institutionalize this notion that we should have loads of incomprehensible machinery to reason about tests that have already given us all the information they were meant to give (i.e. they failed, end of story). I think another aspect of your reasoning doesn't entirely fit with the history of WebKit regression testing. Our tests were not originally designed to test against some abstract notion of correct behavior. They were designed to detect behavior changes. It's definitely not the case that, once behavior has changed once, there's no value to detecting when it changes again. Some of these behavior changes might be good and should form a new baseline. Others might be bad but still should not inhibit detecting other testing while they are investigated. (I should note that I don't yet have an opinion on the new proposal; I have not read it. But I am skeptical that splitting tests is a practical solution.) Regards, Maciej -F On Aug 15, 2012, at 12:22 PM, Dirk Pranke dpra...@chromium.org wrote: Hi all, As many of you know, we normally treat the -expected files as regression tests rather than correctness tests; they are intended to capture the current behavior of the tree. As such, they historically have not distinguished between a correct failure and an incorrect failure. The chromium port, however, has historically often not checked in expectations for tests that are currently failing (or even have been failing for a long time), and instead listed them in the expectations file. This was primarily motivated by us wanting to know easily all of the tests that were failing. However, this approach has its own downsides. I would like to move the project to a point where all of the ports were basically using the same workflow/model, and combine the best features of each approach [1]. To that end, I propose that we allow tests to have expectations that end in '-passing' and '-failing' as well as '-expected'. The meanings for '-passing' and '-failing' should be obvious, and '-expected' can continue the current meaning of either or both of what we expect to happen and I don't know if this is correct or not :). A given test will be allowed to only have one of the three potential results at any one time/revision in a checkout. [2] Because '-expected' will still be supported, this means that ports can continue to work as they do today and we can try -passing/-failing on a piecemeal basis to see if it's useful or not. Ideally we will have some way (via a presubmit hook, or lint checks, or something) to be able to generate a (checked-in) list (or perhaps just a dashboard or web page) of all of the currently failing tests and corresponding bugs from the -failing expectations, so that we can keep one of the advantages that chromium has gotten out of their TestExpectations files [3]. I will update all of the tools (run-webkit-tests, garden-o-matic, flakiness dashboard, etc.) as needed to make managing these things as easy as possible. [4] Thoughts? I'm definitely open to suggestions/variants/other ideas/etc. -- Dirk Notes: [1] Both the check in the failures and the suppress the failures approaches have advantages and disadvantages: Both approaches have their advantages and disadvantages: Advantages for checking in failures: * you can tell when a test starts failing differently * the need to distinguish between different types of failures (text vs. image vs. image+text) in the expectations file drops; the baseline tells you what to expect * the TestExpectations file
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Wed, Aug 15, 2012 at 1:36 PM, Michael Saboff msab...@apple.com wrote: It seems to me that there are two issues here. One is Chromium specific about process conformity. It seems to me that should stay a Chromium issue without making the mechanism more complex for all ports. The other ports seem to make things work using the existing framework. I'm definitely attempting to minimize the impact on ports (and people) that are happy with things the way they are. One thing we could do to help with this would be to require that any '-passing'/'-failing' tests be limited to the platform/ directories, so that way a given port could opt-in to this additional complexity (and we could replace one Chromium-specific process with another one without affecting the other ports). The other broader issue is failing tests. If I understand part of Filip's concern it is a signal to noise issue. We do not want the pass / fail signal to be lost in the noise of expected failures. Failing tests should be fixed as appropriate for failing platform(s). That fixing might involve splitting off or removing a failing sub-test so that the remaining test adds value once again. Especially a pass becoming a fail edge. For me, a test failing differently provides unknown value as the noise of it being a failing test likely exceeds the benefit of the different failure mode signal. It takes a non-zero effort to filter that noise and that effort is likely better spent fixing the test. If I understand you correctly, you're suggesting that if a test starts to fail, you don't care how it is failing, and simply knowing that it is failing is enough? And if a test doesn't meet that criteria, it should be split into multiple tests such that each sub-test does? I think that's a nice goal, but we're nowhere close to that in practice. In particular, this isn't true of many pixel tests at all. Minor changes in text rendering can cause a whole bunch of tests to fail, and yet we still need to run those tests to look for real regressions. Ideally we can replace some or most of these tests with text-only tests and reftests, but we're a long way away from that, too. -- Dirk ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Aug 15, 2012, at 2:16 PM, Maciej Stachowiak m...@apple.com wrote: On Aug 15, 2012, at 12:27 PM, Filip Pizlo fpi...@apple.com wrote: This sounds like it's adding even more complication to an already complicated system. Given how many tests we currently have, I also don't buy that continuing to run a test that is already known to fail provides much benefit. If the test covers two things and one fails while the other succeeds (example #1: it prints correct results but doesn't crash; example #2: it literally has tests for two distinct unrelated features, and one feature is broken) then the correct thing to do is to split up the test. The case where this happens isn't so much testing unrelated features. It's more that a single test can face failures of multiple incidental aspects. Let me give an example. Let's say you have a test that's intended to verify JavaScript multiplication. Let's say it starts failing because of a regression in function call behavior. There may not be a clean way to split these two aspects. You may have to rewrite the multiplication test to work around this hypothetical function call bug. It may be nontrivial to figure out how to do that. So I agree in principle that splitting orthogonal tests is good, but it's not always a good solution to avoiding chained regressions. Note also that having more test files makes the test rights slower. So, I'd rather not continue to institutionalize this notion that we should have loads of incomprehensible machinery to reason about tests that have already given us all the information they were meant to give (i.e. they failed, end of story). I think another aspect of your reasoning doesn't entirely fit with the history of WebKit regression testing. Our tests were not originally designed to test against some abstract notion of correct behavior. They were designed to detect behavior changes. It's definitely not the case that, once behavior has changed once, there's no value to detecting when it changes again. Some of these behavior changes might be good and should form a new baseline. Others might be bad but still should not inhibit detecting other testing while they are investigated. Sorry, I was rather unclear. I'm arguing against marking a test as expected-to-fail-but-not-crash. I have no broad problem with rebasing the test, except if you're using -expected.* files to indicate that the test is in this expected-to-fail-but-not-crash state. There is potentially a significant difference between tests that cover what is displayed and the loads of tests that cover discrete functionality (DOM APIs, JS behavior, etc). Rebasing has very little benefit in the latter kinds of tests. I'm also arguing against having yet another mechanism for indicating that a test is borked. Finally, I think any strategy we use must take into account the fact that we currently have a *huge* number of tests. While there may be examples of failing tests that still succeed in giving us meaningful coverage, I think there are far more examples of tests that provide no benefit, but require an evil troika of (1) CPU time to run, (2) network/disk time to check out and update, and (3) person time to rebase. So, at this point, having sophisticated mechanisms for dealing with expected-to-fail-but-not-crash tests beyond filing a bug and skipping the test is likely to just be noise. (I should note that I don't yet have an opinion on the new proposal; I have not read it. But I am skeptical that splitting tests is a practical solution.) Regards, Maciej -F On Aug 15, 2012, at 12:22 PM, Dirk Pranke dpra...@chromium.org wrote: Hi all, As many of you know, we normally treat the -expected files as regression tests rather than correctness tests; they are intended to capture the current behavior of the tree. As such, they historically have not distinguished between a correct failure and an incorrect failure. The chromium port, however, has historically often not checked in expectations for tests that are currently failing (or even have been failing for a long time), and instead listed them in the expectations file. This was primarily motivated by us wanting to know easily all of the tests that were failing. However, this approach has its own downsides. I would like to move the project to a point where all of the ports were basically using the same workflow/model, and combine the best features of each approach [1]. To that end, I propose that we allow tests to have expectations that end in '-passing' and '-failing' as well as '-expected'. The meanings for '-passing' and '-failing' should be obvious, and '-expected' can continue the current meaning of either or both of what we expect to happen and I don't know if this is correct or not :). A given test will be allowed to only have one of the three potential results at any one time/revision
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
Apparently I was somewhat unclear. Let me restate. We have the following mechanisms available when a test fails: 1) Check in a new -expected.* file. 2) Modify the test. 3) Modify a TestExpectations file. 4) Add the test to a Skipped file. 5) Remove the test entirely. I have no problem with (1) unless it is intended to mark the test as expected-to-fail-but-not-crash. I agree that using -expected.* to accomplish what TestExpectations accomplishes is not valuable, but I further believe that even TestExpectations is not valuable. I broadly prefer (2) whenever possible. I believe that (3) and (4) are redundant, and I don't buy the value of (3). I don't like (5) but we should probably do more of it for tests that have a chronically low signal-to-noise ratio. You're proposing a new mechanism. I'm arguing that given the sheer number of tests, and the overheads associated with maintaining them, (4) is the broadly more productive strategy in terms of bugs-fixed/person-hours. And, increasing the number of mechanisms for dealing with tests by 20% is likely to reduce overall productivity rather than helping anyone. -F On Aug 15, 2012, at 12:40 PM, Dirk Pranke dpra...@chromium.org wrote: On Wed, Aug 15, 2012 at 12:27 PM, Filip Pizlo fpi...@apple.com wrote: This sounds like it's adding even more complication to an already complicated system. In some ways, yes. In other ways, perhaps it will allow us to simplify things; e.g., if we are checking in failing tests, there is much less of a need for multiple failure keywords in the TestExpectations file (so perhaps we can simplify them back to something closer to Skipped files). Given how many tests we currently have, I also don't buy that continuing to run a test that is already known to fail provides much benefit. I'm not sure I understand your feedback here? It's common practice (on all the ports as far as I know today) to rebaseline tests that are currently failing so that they fail differently. Of course, we also skip some tests while they are failing as well. However, common objections to skipping tests are that we can lose coverage for a feature and/or miss when a test starts failing worse (e.g. crashing)? And of course, a test might start passing again, but if we're skipping it we wouldn't know that ... So, I'd rather not continue to institutionalize this notion that we should have loads of incomprehensible machinery to reason about tests that have already given us all the information they were meant to give (i.e. they failed, end of story). Are you suggesting that, rather than checking in new baselines at all or having lots of logic to manage different kinds of failures, we should just let failing tests fail (and keep the tree red) until a change is either reverted or the test is fixed? -- Dirk ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Wed, Aug 15, 2012 at 3:06 PM, Filip Pizlo fpi...@apple.com wrote: Apparently I was somewhat unclear. Let me restate. We have the following mechanisms available when a test fails: 1) Check in a new -expected.* file. 2) Modify the test. 3) Modify a TestExpectations file. 4) Add the test to a Skipped file. 5) Remove the test entirely. I have no problem with (1) unless it is intended to mark the test as expected-to-fail-but-not-crash. I agree that using -expected.* to accomplish what TestExpectations accomplishes is not valuable, but I further believe that even TestExpectations is not valuable. I broadly prefer (2) whenever possible. I believe that (3) and (4) are redundant, and I don't buy the value of (3). I don't like (5) but we should probably do more of it for tests that have a chronically low signal-to-noise ratio. Thank you for clarifying. I had actually written an almost identical list but didn't send it, so I think we're on the same page at least as far as understanding the problem goes ... So, I would describe my suggestion as an improved variant of the kind of (1) that can be used as expected-to-fail-but-not-crash (which I'll call 1-fail), and that we would use this in cases where we use (3), (4), or (1-fail) today. I would also agree that we should do (2) where possible, but I don't think this is easily possible for a large class of tests, especially pixel tests, although I am currently working on other things that will hopefully help here. Chromium certainly does a lot of (3) today, and some (1-fail). Other ports definitely use (1-fail) or (4) today, because (2) is rarely possible for many, many tests. We know that doing (1-fail), (3), or (4) causes real maintenance woes down the road, but also that doing (1-fail) or (3) catches real problems that simply skipping the test would not -- at some cost. Whether the benefit is worth the cost, is not known, of course, but I believe it is. I am hoping that my suggestion will have a lower overall cost than doing (1-fail) or (3). You're proposing a new mechanism. I'm arguing that given the sheer number of tests, and the overheads associated with maintaining them, (4) is the broadly more productive strategy in terms of bugs-fixed/person-hours. And, increasing the number of mechanisms for dealing with tests by 20% is likely to reduce overall productivity rather than helping anyone. Why do you believe this to be true? I'm not being flippant here ... I think this is a very plausible argument, and it may well be true, but I don't know what the criteria we would use to evaluate it are. Some of the possible factors are: * the complexity of the test infrastructure and the cognitive load it introduces on developers * the cost of bugs that are missed because we're skipping the tests intended to catch those bugs * the cost of looking at regressions and trying to figure out if the regression is something you care about or not * the cost of looking at the -expected results and trying to figure out if what is expected is correct or not There may be others as well, but the last three are all very real in my experience, and I believe they significantly outweigh the first one, but I don't know how to objectively assess that (and I don't think it's even possible since different people/teams/ports will weigh these things differently). -- Dirk ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Aug 15, 2012, at 4:02 PM, Dirk Pranke dpra...@chromium.org wrote: On Wed, Aug 15, 2012 at 3:06 PM, Filip Pizlo fpi...@apple.com wrote: Apparently I was somewhat unclear. Let me restate. We have the following mechanisms available when a test fails: 1) Check in a new -expected.* file. 2) Modify the test. 3) Modify a TestExpectations file. 4) Add the test to a Skipped file. 5) Remove the test entirely. I have no problem with (1) unless it is intended to mark the test as expected-to-fail-but-not-crash. I agree that using -expected.* to accomplish what TestExpectations accomplishes is not valuable, but I further believe that even TestExpectations is not valuable. I broadly prefer (2) whenever possible. I believe that (3) and (4) are redundant, and I don't buy the value of (3). I don't like (5) but we should probably do more of it for tests that have a chronically low signal-to-noise ratio. Thank you for clarifying. I had actually written an almost identical list but didn't send it, so I think we're on the same page at least as far as understanding the problem goes ... So, I would describe my suggestion as an improved variant of the kind of (1) that can be used as expected-to-fail-but-not-crash (which I'll call 1-fail), and that we would use this in cases where we use (3), (4), or (1-fail) today. I would also agree that we should do (2) where possible, but I don't think this is easily possible for a large class of tests, especially pixel tests, although I am currently working on other things that will hopefully help here. Chromium certainly does a lot of (3) today, and some (1-fail). Other ports definitely use (1-fail) or (4) today, because (2) is rarely possible for many, many tests. We know that doing (1-fail), (3), or (4) causes real maintenance woes down the road, but also that doing (1-fail) or (3) catches real problems that simply skipping the test would not -- at some cost. Whether the benefit is worth the cost, is not known, of course, but I believe it is. I am hoping that my suggestion will have a lower overall cost than doing (1-fail) or (3). I also believe that the trade-off is known and, and specifically, I believe that the cost of having any tests in the (1-fail) or (3) states is more costly than having them in (4) or (5). You're proposing a new mechanism. I'm arguing that given the sheer number of tests, and the overheads associated with maintaining them, (4) is the broadly more productive strategy in terms of bugs-fixed/person-hours. And, increasing the number of mechanisms for dealing with tests by 20% is likely to reduce overall productivity rather than helping anyone. Why do you believe this to be true? I'm not being flippant here ... I think this is a very plausible argument, and it may well be true, but I don't know what the criteria we would use to evaluate it are. Some of the possible factors are: * the complexity of the test infrastructure and the cognitive load it introduces on developers * the cost of bugs that are missed because we're skipping the tests intended to catch those bugs * the cost of looking at regressions and trying to figure out if the regression is something you care about or not * the cost of looking at the -expected results and trying to figure out if what is expected is correct or not There may be others as well, but the last three are all very real in my experience, and I believe they significantly outweigh the first one, but I don't know how to objectively assess that (and I don't think it's even possible since different people/teams/ports will weigh these things differently). I believe that the cognitive load is greater than any benefit from catching bugs incidentally by continuing to run a (1-fail) or (3) test, and continuing to evaluate whether or not the expectation matches some notions of desired behavior. And therein lies one possible source of disagreement. But there is another source of disagreement: would adding a sixth facility that overlaps with (1-fail) or (3) help? No, I don't believe it would. It's just another mechanism leading to more possible arguments about which mechanism is better. -F ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
I have a concern that a lot of people wouldn't know what the correct output is for a given test. For a lot of pixel tests, deciding whether a given output is correct or not is really hard. e.g. some seemingly insignificant anti-alias different may turn out be a result of a bug in Skia and other graphics library or WebCore code that uses it. As a result, - people may check in wrong correct results - people may add failing results even though new results are more or less correct This leads me to think that just checking in the current output as -expected.png and filing bugs separately is a better approach. - Ryosuke ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Wed, Aug 15, 2012 at 5:00 PM, Filip Pizlo fpi...@apple.com wrote: On Aug 15, 2012, at 4:02 PM, Dirk Pranke dpra...@chromium.org wrote: On Wed, Aug 15, 2012 at 3:06 PM, Filip Pizlo fpi...@apple.com wrote: Apparently I was somewhat unclear. Let me restate. We have the following mechanisms available when a test fails: 1) Check in a new -expected.* file. 2) Modify the test. 3) Modify a TestExpectations file. 4) Add the test to a Skipped file. 5) Remove the test entirely. I have no problem with (1) unless it is intended to mark the test as expected-to-fail-but-not-crash. I agree that using -expected.* to accomplish what TestExpectations accomplishes is not valuable, but I further believe that even TestExpectations is not valuable. I broadly prefer (2) whenever possible. I believe that (3) and (4) are redundant, and I don't buy the value of (3). I don't like (5) but we should probably do more of it for tests that have a chronically low signal-to-noise ratio. Thank you for clarifying. I had actually written an almost identical list but didn't send it, so I think we're on the same page at least as far as understanding the problem goes ... So, I would describe my suggestion as an improved variant of the kind of (1) that can be used as expected-to-fail-but-not-crash (which I'll call 1-fail), and that we would use this in cases where we use (3), (4), or (1-fail) today. I would also agree that we should do (2) where possible, but I don't think this is easily possible for a large class of tests, especially pixel tests, although I am currently working on other things that will hopefully help here. Chromium certainly does a lot of (3) today, and some (1-fail). Other ports definitely use (1-fail) or (4) today, because (2) is rarely possible for many, many tests. We know that doing (1-fail), (3), or (4) causes real maintenance woes down the road, but also that doing (1-fail) or (3) catches real problems that simply skipping the test would not -- at some cost. Whether the benefit is worth the cost, is not known, of course, but I believe it is. I am hoping that my suggestion will have a lower overall cost than doing (1-fail) or (3). I also believe that the trade-off is known and, and specifically, I believe that the cost of having any tests in the (1-fail) or (3) states is more costly than having them in (4) or (5). You're proposing a new mechanism. I'm arguing that given the sheer number of tests, and the overheads associated with maintaining them, (4) is the broadly more productive strategy in terms of bugs-fixed/person-hours. And, increasing the number of mechanisms for dealing with tests by 20% is likely to reduce overall productivity rather than helping anyone. Why do you believe this to be true? I'm not being flippant here ... I think this is a very plausible argument, and it may well be true, but I don't know what the criteria we would use to evaluate it are. Some of the possible factors are: * the complexity of the test infrastructure and the cognitive load it introduces on developers * the cost of bugs that are missed because we're skipping the tests intended to catch those bugs * the cost of looking at regressions and trying to figure out if the regression is something you care about or not * the cost of looking at the -expected results and trying to figure out if what is expected is correct or not There may be others as well, but the last three are all very real in my experience, and I believe they significantly outweigh the first one, but I don't know how to objectively assess that (and I don't think it's even possible since different people/teams/ports will weigh these things differently). I believe that the cognitive load is greater than any benefit from catching bugs incidentally by continuing to run a (1-fail) or (3) test, and continuing to evaluate whether or not the expectation matches some notions of desired behavior. And therein lies one possible source of disagreement. Yes :) But there is another source of disagreement: would adding a sixth facility that overlaps with (1-fail) or (3) help? No, I don't believe it would. It's just another mechanism leading to more possible arguments about which mechanism is better. Perhaps. I think it will, obviously, or I wouldn't be proposing this in the first place. I welcome other opinions on this as well ... -- Dirk ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Wed, Aug 15, 2012 at 5:00 PM, Filip Pizlo fpi...@apple.com wrote: I believe that the cognitive load is greater than any benefit from catching bugs incidentally by continuing to run a (1-fail) or (3) test, and continuing to evaluate whether or not the expectation matches some notions of desired behavior. As someone who has spent a lot of time maintaining Chromium's expectations, this seems clearly false, if your proposed alternative is to stop running the test. This is because a very common course of events is for a test to begin failing, and then later on return to passing. We (Chromium) see this all the time with e.g. Skia changes, where for example the Skia folks will rewrite gradient handling to more perfectly match some spec and as a result dozens or hundreds of tests, many not explicitly intended to be about gradient handling, will change and possibly begin passing. By contrast, if we aren't running a test, we don't know when the test begins passing again (except by trying to run it). The resulting effect is that skipped tests tend to remain skipped. Tests that remain skipped are no better than no tests. And even if such tests are periodically retested, once a test's output changes, there is a large window of time where the test wasn't running, making it difficult to pinpoint exactly what caused the change and whether the resulting effect is intentional and beneficial. If we ARE running a test, then when the results change, knowing whether the existing result was thought to be correct or not is a critical part of a sheriff's job in deciding what to do about the change. This is one reason why Chromium has never gone down the path of simply checking in failure expectations, and something that Dirk's proposal explicitly tries to address while still allowing ports that (IMO mistakenly) don't care to continue to not care. We already have some good tooling (e.g. garden-o-matic) that could be extended to show and update the small amount of additional info Dirk is proposing. I am very skeptical of abstract claims that this proposal inflates complexity and decreases productivity in the absence of actually testing a real workflow using the tools that we sheriffs really use to maintain tree greenness. I would like to see this proposal tested to get concrete feedback instead of arguments on principle. PK ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
The typical approach used in situations that you describe is to rebase, not skip. This avoids the problem of not knowing when the test started passing. Hence, I'm not sure what you're implying. Maybe a better example would help. On Aug 15, 2012, at 5:39 PM, Peter Kasting pkast...@chromium.org wrote: On Wed, Aug 15, 2012 at 5:00 PM, Filip Pizlo fpi...@apple.com wrote: I believe that the cognitive load is greater than any benefit from catching bugs incidentally by continuing to run a (1-fail) or (3) test, and continuing to evaluate whether or not the expectation matches some notions of desired behavior. As someone who has spent a lot of time maintaining Chromium's expectations, this seems clearly false, if your proposed alternative is to stop running the test. This is because a very common course of events is for a test to begin failing, and then later on return to passing. We (Chromium) see this all the time with e.g. Skia changes, where for example the Skia folks will rewrite gradient handling to more perfectly match some spec and as a result dozens or hundreds of tests, many not explicitly intended to be about gradient handling, will change and possibly begin passing. By contrast, if we aren't running a test, we don't know when the test begins passing again (except by trying to run it). The resulting effect is that skipped tests tend to remain skipped. Tests that remain skipped are no better than no tests. And even if such tests are periodically retested, once a test's output changes, there is a large window of time where the test wasn't running, making it difficult to pinpoint exactly what caused the change and whether the resulting effect is intentional and beneficial. If we ARE running a test, then when the results change, knowing whether the existing result was thought to be correct or not is a critical part of a sheriff's job in deciding what to do about the change. This is one reason why Chromium has never gone down the path of simply checking in failure expectations, and something that Dirk's proposal explicitly tries to address while still allowing ports that (IMO mistakenly) don't care to continue to not care. We already have some good tooling (e.g. garden-o-matic) that could be extended to show and update the small amount of additional info Dirk is proposing. I am very skeptical of abstract claims that this proposal inflates complexity and decreases productivity in the absence of actually testing a real workflow using the tools that we sheriffs really use to maintain tree greenness. I would like to see this proposal tested to get concrete feedback instead of arguments on principle. I would not like to see our testing infrastructure get any more complicated than it already is, just because of a philosophical direction chosen unilaterally by one port. PK ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Wed, Aug 15, 2012 at 5:45 PM, Filip Pizlo fpi...@apple.com wrote: The typical approach used in situations that you describe is to rebase, not skip. Which is precisely what Dirk is proposing. Literally the only difference here is that the rebased test expectation would contain an optional notation about whether we believe the new baseline to be correct or incorrect. Ports that don't care, or tests where we don't know, will be totally unaffected. I am not sure why this bothers you so much. You talk about making the infrastructure more complicated, but it doesn't seem like there is much additional complication involved, and what minor complication is involved is being volunteered to be handled by Dirk and other folks. PK ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Aug 15, 2012, at 5:51 PM, Peter Kasting pkast...@chromium.org wrote: On Wed, Aug 15, 2012 at 5:45 PM, Filip Pizlo fpi...@apple.com wrote: The typical approach used in situations that you describe is to rebase, not skip. Which is precisely what Dirk is proposing. Literally the only difference here is that the rebased test expectation would contain an optional notation about whether we believe the new baseline to be correct or incorrect. Can you articulate the value that this gives? We must weigh that against the downsides: 1) Increased complexity of infrastructure. 2) Possibility of the sheriff getting it wrong. (2) concerns me most. We're talking about using filenames to serve as a kind of unchecked comment. We already know that comments are usually bad because there is no protection against them going stale. Ports that don't care, or tests where we don't know, will be totally unaffected. I am not sure why this bothers you so much. You talk about making the infrastructure more complicated, but it doesn't seem like there is much additional complication involved, and what minor complication is involved is being volunteered to be handled by Dirk and other folks. Ultimately it will lead to a further multiplication of number of possible ways of doing things, which is bad. I'm sure some would love to get rid of Skipped files just as much as I would love to get rid of TestExpectations files. Both are valid things to love, and imply that there must surely exist a middle ground: a way of doing things that is strictly better than the sum of the two. In particular, to further clarify my position, if someone were to argue that Dirk's proposal would be a wholesale replacement for TestExpectations, then I would be more likely to be on board, since I very much like the idea of reducing the number of ways of doing things. Maybe that's a good way to reach compromise. Dirk, what value do you see in TestExpectations were your change to be landed? Do scenarios still exist where there would be a test for which (a) there is no -fail.* file, (b) the test is not skipped, and (c) it's marked with some annotation in TestExpectations? I'm most interested in the question of such scenarios exist, since in my experience, whenever a test is not rebased, is not skipped, and is marked as failing in TestExpectations, it ends up just causing gardening overhead later. -F ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] A proposal for handling failing layout tests and TestExpectations
On Wed, Aug 15, 2012 at 6:02 PM, Filip Pizlo fpi...@apple.com wrote: 2) Possibility of the sheriff getting it wrong. (2) concerns me most. We're talking about using filenames to serve as a kind of unchecked comment. We already know that comments are usually bad because there is no protection against them going stale. I don't see how this is parallel to stale comments. Tests get continually compared to the given output and we see immediately when something changes. It is certainly possible for whoever assigns the filenames to get things wrong. There are basically two mitigations of this. One is allowing the existing expected.xxx file extensions to remain, and encouraging people to leave them as-is when they're not sure whether the existing result is correct. The other is for sheriffs to use this signal as just that -- a signal -- just as today we use the expected.xxx files as a signal of what the correct output might be. The difference is that this can generally be considered a stronger signal. Historically, there's been no real attempt to guarantee that an expected result is anything other than the test's current behavior. I'm sure some would love to get rid of Skipped files just as much as I would love to get rid of TestExpectations files. Both are valid things to love, and imply that there must surely exist a middle ground: a way of doing things that is strictly better than the sum of the two. That's exactly what we're trying to do. The value of this change is that hopefully it would dramatically reduce the amount of content in these, especially in TestExpectations files. If you want to kill these so much, then this is a change you should at least let us test! In particular, to further clarify my position, if someone were to argue that Dirk's proposal would be a wholesale replacement for TestExpectations, then I would be more likely to be on board, since I very much like the idea of reducing the number of ways of doing things. Maybe that's a good way to reach compromise. It's hard to know if we could completely eliminate them without testing this, but yes, one goal here is to greatly reduce the need for TestExpectations lines. A related goal is to make the patterns and mechanisms used by all ports more similar. As someone who has noted his frustration with both different ways of doing things and philosophical directions chosen by one port, you would hopefully be well-served by this direction. PK ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo/webkit-dev