reactions inlined

On 05/10/2012 05:01 PM, Osztrogonac Csaba wrote:
Hi All,

Alexis Menard írta:
Hi,

By reading the email of Simon about removing Qt4 I have seen there was
plan to move to Amazon EC2.

State of art of gardening Qt :

- Mostly Ossy alone is gardener, which is unacceptable. Apple made a
move towards improving their bots (when you see kling gardening it
tells you they changed something), Google is already pretty good, GTK
also, we need to be better. While at the summit people praised Qt bots
being green all the time I do think it hides a terrible truth : our
skiplist grows grows grows and nobody look after it which conflicts a
bit with trying to release a stable trunk for Qt 5.0. How many mails
we receive from Ossy complaining about quality?

It's not exactly true that I'm the only one gardener. I'm working on
gardening with my buildbot group together. They are part time developers,
because they are students and have courses, exams, etc. Most of them
aren't WebKit committers yet, so their gardening patches are usually
committed by me or anybody else from here. ( But you can find their
names in the commit logs, changelogs, of course. :) )

With the other thing I have to agree, with this small group we have resource (enough time) for only fire-fighting: detect who/which commit broke which tests, update expected results if it is needed, filing bug reports, commenting bugs, buildfixes, etc. We don't have enough time to fix all bugs instead of who caused.

But gardening is so hard if most of the developer don't care with QA at all. When I comment a bug with "your patch broke X.Y. layout/API test, diff: ...", I regulary get the question: "How can I run this test?" And it isn't good, because it means this developer never run tests before. (But everybody should before commiting.) Other problem is that many developers insist on their buggy patch being in trunk, but they don't care fixing the bug. In this case we can only do that we skip the new failing tests, because red bots with many failing tests would
make catching new regression much more complex, sometimes impossible. But
in my opinion rolling-out a buggy patch and reland after fixing it would
cause less pain for everybody than growing, growing and growing skiplist.
I don't know why folks hate rolling out patches. It doesn't mean that the
patch is wrong at all. It isn't a capital sentence for the patch or the author. :) It only means that the patch caused some trouble/regression an should be fixed. And fixing offline is less painful for others than leaving buggy patch in trunk. Chromium guys usually rollout their own patches if they broke a test on the Qt
bot before I noticed. Really. We should follow their good practice. ;)

- Their is a huge delta machine wise with what the bot is running and
what people use to develop. The bot runs Ubuntu, many of us run
ArchLinux/OpenSuse while some us run Ubuntu. It leads to results
different from what the bot produce and what you see and your machine.
We have encountered many many many times people saying : "it passes on
my machine but not on the bot" -> Added to the Skiplist because nobody
can really see what's going with the bot. Szeged tried their best to
provide a virtual machine but it was a bit of a failure as the VM
doesn't behave the same as the bot, and the VM behave differently
whether your run it on VMWare or VirtualBox.

Unfortunately the VMWare image wasn't the best solution. And then we
created a meta package for Ubuntu 11.10 which installs all dependency:
https://launchpad.net/~u-szeged/+archive/sedkit
With this meta package you can install a full QtWebKit development environment in an hour.

Now the dircetion is moving to an Amazon Ubuntu image. But I think it is still papering over the problem. It is _very good_ (but expensive) for ensuring everybody can simple reproduce the bot results. But we don't develop for only one platform. More platform show more hidden and maybe serious bugs. If your patch works fine on the only one reference platform, it doesn't mean there isn't any bug in it.

The biggest problem is that folks who don't use Ubuntu 11.10 got thousands of failing tests because of minor font differences. In this case the best solution isn't that "I can't reproduce the results, so I won't run layout tests anymore." It would be more valuable for the whole project if font(config) experts try to make the WebKit, Qt, fontconfig or anything else to use same fonts. I don't know if it is possible or not, I don't know anything about fonts. Is it possible somehow to bundle a chosen fontconfig to Qt or to WebKit and use it for regression testing on all distro instead
of sweating because of different system fontconfig versions?

You are speaking about Linux, but it's not the only system where we want coverage. For example on Mac fontconfig does not play a role in the font game. We could use it, but than we would lose the coverage for the real use case. Btw, there is some light
in the dark land of fonts:
- I have done some work to unify test results between Linux and Mac, hopefully
I could finish it in the near future.
- In Ubutu 12.0, a strange bug have been fixed in freetype which made the Ahem
font produce wrong metrics (WidthXHeight=NxN+1 instead of NxN). Ahem is used
in a lot of tests in the css* directories. Currently our expectations are wrong, but if
we fix them these metrics will match across distros (everybody use the newer
freetype for a long time except our beloved, stable Debian :D )


- We don't have any gardening plan.
Not only the missing gardening plan is the problem. In my
opinion introducing contributing rules would be more important.
For example:
 - Developers should build the patch and run tests before committing.
(Or at least watch the bots after landing and fix/rollout quick if something goes wrong)
 - What should I do if I broke the build / a layout test / API test ?
- What should a gardener do if somebody doesn't care with the regression he/she caused ? - What should do the boss if somebody usually and intentionally hurt the rules? :)

I have to protest a bit. As Ossy describes it, it's really simple and straightforward. When somebody breaks a test than it means his patch is buggy and he should find the error in his changes, and everything will be fine. In reality, this is not always the case. When you break a test, it could mean different things:

    1. you did it wrong
Obviously you need to fix your patch
2. there is a bug in the system that you triggered somehow (with even a totally right change on it's own) Of course the right thing to do is to investigate in the problem. But it could be very complex, maybe the bug exists in a different subsystem that you don't know well. I don't think it is always possible to find the manpower to fight
with these bugs.
3. there is a bug/imperfection in the test infrastructure that you triggered Well, this is pretty annoying and relatively common. We should detect and solve these issues but it's not really fair to stop a good patch to land until somebody fixes the tools. Note that working out of trunk upon your previous work is
possible but it's not fun because you have to struggle more with rebasing.
    4. you caused some change that is not really a bug
Like some pixel differences that the actual users could not even notice. I would say if you do such a change than let's update the expectations, but it's not always possible since you cannot test your patch in each environment where we want coverage. (And if you don't use Ubuntu or Debian you cannot even produce results locally for Linux-destop.)

After all, I think we should be careful about what rules we introduce. They should satisfy two requirement: - we have to keep them. not just the first week, not just the first month, but always. :) - they must not block the development too much. How cares if we are rock stable if we cannot follow the evolution of the web?!

I agree with Ossy in that we should allocate more efforts on bug fixing / stabilisation but I don't agree that we should banish the skip list once and forever. Actually there is no stable port of WebKit where the skip list is unused. I would say, let's try to find a better
balance between stability and the speed of development.




What could be improved :

- We need to make a gardening plan. We can't be serious about making
web browsers/APIs without improving our coverage. I know we don't have
much resources but I think it should be ok to have one person doing it
for a week and then turn. Really it's a week maybe boring but it's
once every long time (almost one time every two-three months). This
will make Ossy more free to do something else so Ossy can go back
proper coding. I can make that list if people agree. Also it needs to
be enforced (maybe reviews could be the exception).

Gardening isn't so simple that only one person can be done. It can be enough for fire-fighting: buildfixes, updating expected files, reporting bugs, fix
some trivial bug. But isn't enough to fix all regression caused by others
who aren't responsible at all or the regression occured on the part of WebKit
you don't know anything. Not to mention there are many complex tests, and
there isn't trivial to decide if the new result is correct or not.

I added our gardening timetable to this wiki:
https://trac.webkit.org/wiki/QtWebKitBuildBots

All new volunteers are very welcome. ;-) It would be great if you guys in INdT could be join, you are near to PDT timezone. And handling problems freshly is always simpler than waiting for hungarian morning and trying solve dozens of
new regressions, broken builds, assertions, flakey tests, ...

- We need to be able to test/stress/break the bot environment. Today
the fact that none of us can mess up with the bot make it hard to
reproduce the failures of the bot that you can't see on your machine.
While I do understand (and we don't want that) that Ossy doesn't give
us the key to the bot, we still need to have one to mess around.

We hacked too many times in the past to make layout test system be able run more than one bot on the same 8-24 cores machine. But the limitation is still for one linux user. We still have a strict limitation: An other user trying to run tests on the same machine can kill all the bots, so now only one user is
allowed. In this case it isn't a good idea if anybody logs in and hacking
something. When I have to do it, I'm very very careful, but sometimes I
broke everything accidentally.

Not strictly in connection to your points but another infrastructural thing: when will we able to run tests in parallel? Is it reliable right now? Could we make it the default configuration of nrwt - except on bots, until it is really stable - so folks were not have to know the command line switch by heart (as I know it's not simple because you need to call the real nrwt and not the pearl wrapper and it's slightly different). It would be much more fun to run the tests before uploading / landing
your patch if it were not run for years.

-kbalazs

_______________________________________________
webkit-qt mailing list
[email protected]
http://lists.webkit.org/mailman/listinfo.cgi/webkit-qt

Reply via email to