On Tue, Jul 7, 2009 at 5:08 PM, Maciej Stachowiak <m...@apple.com> wrote:

> On Jul 7, 2009, at 4:19 PM, Peter Kasting wrote:
>
>> For example, the framework could compute both sums _and_ geomeans, if
>> people thought both were valuable.
>>
>
> That's a plausible thing to do, but I think there's a downside: if you make
> a change that moves the two scores in opposite directions, the benchmark
> doesn't help you decide if it's good or not. Avoiding paralysis in the face
> of tradeoffs is part of the reason we look primarily at the total score, not
> the individual subtest scores. The whole point of a meta-benchmark like this
> is to force ourselves to simplemindedly look at only one number.


Yes, I originally had more text like "deciding how to use these scores would
be the hard part", and this is precisely why.

I suppose that if different vendors wanted to use different criteria to
determine what to do in the face of a tradeoff, the benchmark could simply
be a data source, rather than a strong guide.  But this would make it
difficult to use the benchmark to compare engines, which is currently a key
use of SunSpider (and is a key failing, IMO, of frameworks like Dromaeo that
don't run identical code on every engine [IIRC]).

I think there's one way in which sampling the Web is not quite right. To
> some extent, what matters is not average density of an operation but peak
> density. An operation that's used a *lot* by a few sites and hardly used by
> most sites, may deserve a weighting above its average proportion of Web use.


If I understand you right, the effect you're noting is that speeding up
every web page by 1 ms might be a larger net win but a smaller perceived win
than speeding up, say, Gmail alone by 100 ms.

I think this is true.  One way to capture this would be to say that at least
part of the benchmark should concentrate on operations that are used in the
inner loops of any of n popular websites, without regard to their overall
frequency on the web.  (Although perhaps the two correlate well and there
aren't a lot of "rare but peaky" operations?  I don't know.)


> - GC load


I second this.  As people use more tabs and larger, more complex apps, the
performance of an engine under heavier GC load becomes more relevant.

It would be good to know what other things should be tested that are not
> sufficiently covered.


I think DOM bindings are hard to test and would benefit from benchmarking.
 No public benchmarks seem to test these well today.

* - For example, Mozilla's TraceMonkey effort showed relatively little
> improvement on the V8 benchmark, even though it showed significant
> improvement on SunSpider and other benchmarks. I think TraceMonkey speedups
> are real and significant, so this would tend to undermine my confidence in
> the V8 benchmark's coverage.


I agree that the V8 benchmark's coverage is inadequate and that the example
you mention illuminates that, because TraceMonkey definitely performs better
than SpiderMonkey in my own usage.  I wonder if there may have been an
opposite effect in a few cases where benchmarks with very simple tight loops
improved _more_ under TM than "real-world code" did, but I think the answer
to that is simply that benchmarks should be testing both kinds of code.

PK
_______________________________________________
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

Reply via email to