Hi Maciej, Thanks for clarifying your concerns. I will address them a little out-of-order, because I think we probably agree on the important things even if we disagree on the less-important or more theoretical things.
First, as far as the "future" discoveries go, I agree we should try to fix as many of the known issues as possible before cutting over. It may be that many or all of them have already been fixed, as I still need to verify some of these bugs. I definitely think the best way forward is to get NRWT bots up and see how things are working in practice; that way we should have a lot of data to tell us about the pros and cons of each tool. That said, On Thu, Apr 7, 2011 at 4:40 AM, Maciej Stachowiak <[email protected]> wrote: > If the new test tool causes more failures, or worse yet causes more tests to > give unpredictable results, then that makes our testing system worse. The > main benefit of new-run-webkit-tests, as I understand it, is that it can run > the tests a lot faster. But I don't think it's a good tradeoff to run the > tests a lot faster on the buildbot, if the results we get will be less > reliable. I'm actually kind of shocked that anyone would consider replacing > the test script with one that is known to make our existing tests less > reliable. > Ideally, a test harness is stable, consistent, fast, expose as many bugs as possible, and expose those bugs in a way that is as reproducible as possible, and we should be shooting for that. But, just as our code should be bug free but isn't, the test harness may not be able to be ideal either, at which point, you'll probably prioritize some aspects of its behavior over others. For example, ORWT runs the tests in the same order every time in a single thread, and uses a long timeout. This makes the tests results very stable and consistent at the cost of potentially hiding some bugs (tests getting slower but not actually timing out, or tests that depend on previous tests having run). NRWT, at least the way we run it by default in Chromium, uses a much shorter timeout and of course runs things in parallel. This exposes those bugs, at the cost of making things appear flakier. We have attempted to build tooling to help with this, because we generally value finding more bugs over completely stable test runs. For example, we have an entirely separate tool called the "flakiness dashboard" that can help track the behavior of tests over time. So, your "less reliable" might actually be my "finding more bugs" :) NRWT has at least a couple of hooks that ORWT does not for helping to tune to your desired preferences, and we can configure them on a port-by-port basis. > I don't really care why tests would turn flaky. It's entirely possible that > these are bugs in the tests themselves, or in the code they are testing. That > should still be fixed. Of course. I'm certainly not suggesting that we shouldn't fix bugs. But, practically speaking, obviously we are okay with some tests failing, because we list them in Skipped files today. It may be that some of those tests would pass under NRWT, and we don't know that, either (because NRWT re-uses the existing Skipped files as-is. At some point we might want to change this). You mentioned that the "main benefit" of NRWT is that it is faster, but another benefit is that you can classify the expected failures better using NRWT, and detect real changes in the failure. If a test that used to render pixels incorrectly now actually has a different render tree, we'll catch that. If it starts crashing, we'll catch that. ORWT only has "run and expect it to pass" or "Skip and potentially miss something changing". I think I actually consider this more useful than the speed improvements. > > Nor do I think that marking tests as flaky in the expectations file means we > are not losing test coverage. If a test can randomly pass or fail, and we > know that and the tool expects it, we are nonetheless not getting the benefit > of learning when the test starts failing. See above. The data is all there, and it's somewhat a question of what you want to surface where. We have largely attempted to build tools that get the best of both worlds. Maybe this is partially a question of what you would like the main waterfall and consoles to tell you, and perhaps I do not fully understand how different ports would answer this question? -- Dirk _______________________________________________ webkit-dev mailing list [email protected] http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

