Re: [webkit-dev] Iterating SunSpider

2009-07-08 Thread Maciej Stachowiak


On Jul 7, 2009, at 8:50 PM, Geoffrey Garen wrote:

I also don't buy your conclusion -- that if regular expressions  
account for 1% of JavaScript time on the Internet overall, they  
need not be optimized.


I never said that.


You said the regular expression test was "most likely... the least  
relevant test" in SunSpider.


You said implementors' choice to optimize regular expressions  
because they were hot on SunSpider was "not what we want to  
encourage."


But maybe I misunderstood you. Do you think it was a good thing that  
SunSpider encouraged optimization of regular expressions? If so, do  
you think the same thing would have happened had SunSpider not used  
summation in calculating its scores?


I suspect this line of questioning will not result in effective  
persuasion or useful information transfer. It comes off as kind of a  
gotcha question.


My understanding of Mike's position is this:

- The slowest test on the benchmark will become a focus of  
optimization regardless of scoring method (thus, I assume he does not  
really think regexp optimization efforts are an utter waste.)


- During the period when JS engines had most things much faster than  
the state of things when SunSpider first came out, but hadn't yet  
extensively optimized regexps, the test gave a misleading and  
potentially unfair picture of overall performance. And this is a  
condition that could happen again in the future.


I think this is a plausible position, but I don't entirely buy these  
arguments, and I don't think they outweigh the reasons we chose to use  
summation scoring. I think it's ultimately a judgment call, and unless  
we have new information to present, we don't need to drag out the  
conversation or call each to account on details of supporting arguments.


Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Geoffrey Garen
I also don't buy your conclusion -- that if regular expressions  
account for 1% of JavaScript time on the Internet overall, they need  
not be optimized.


I never said that.


You said the regular expression test was "most likely... the least  
relevant test" in SunSpider.


You said implementors' choice to optimize regular expressions because  
they were hot on SunSpider was "not what we want to encourage."


But maybe I misunderstood you. Do you think it was a good thing that  
SunSpider encouraged optimization of regular expressions? If so, do  
you think the same thing would have happened had SunSpider not used  
summation in calculating its scores?


Geoff
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Maciej Stachowiak


On Jul 7, 2009, at 7:02 PM, Mike Belshe wrote:

On Tue, Jul 7, 2009 at 5:08 PM, Maciej Stachowiak   
wrote:


- property access, involving at least some polymorphic access patterns
- method calls
- object-oriented programming patterns
- GC load
- programming in a style that makes significant use of closures

This sounds like good stuff to me.  A few more thoughts:
   - We also see sites with just huge chunks of JS code being  
delivered, yet sparsely used.  Perhaps a parsing/loading test is  
interesting.


I agree this is a common pattern. I think the string-unpack-code test  
(and to a lesser extent string-tagcloud) beat on parsing pretty heavily.


   - Object cloning.  We should verify this is a useful test, but I  
believe template engines often use a pattern, combined with json  
data to clone js objects.  This may be more of a DOM-level test, but  
a JS equivalent should be doable.


I'd like to hear more about this.


   - JSON performance


string-tagcloud parses a giant JSON string, though not using the  
native JSON parsing facilities of ES5. Parsing of many shorter JSON  
expressions may be a useful test to add.


   - Tests of prototype chain usage (basically the counter- 
programming-style to closures)


There is some use of this but not a deep focus. Agreed it's good to  
test more.




If I were to characterize SunSpider and V8Benchmark tests, the  
SunSpider tests are generally short and focused micro-benchmarks.   
The v8 tests are generally larger tests comprised of real code.


The SunSpider tests are a mix. For example, 3d-raytrace, string- 
tagcloud, and the crypto tests are quite substantial examples of real  
code solving a real problem. Some, like bitops-bits-in-byte, are very  
focused. That particular test came from a developer bug report and  
apparently comes from real game code, but it's a tiny part of the  
game; it used to make JavaScriptCore look really bad, which is why we  
included it.


One thing I have noticed about the v8 tests is that they include a lot  
of content translated from other programming languages, either  
automatically or by hand.


 Both types of test offer unique advantages.  The microbenchmarks  
provide a way to create lots of small tests which cover a certain  
pattern.  The larger tests are less focused, but require more  
features to work well together in the engine to get higher scores.   
Tracemonkey is fairly new, and with its tracing approach, it is not  
surprising that it's initial traces can optimize the micro  
benchmarks but not fully trace larger code like what is found in the  
V8 benchmark.  In my opinion, both sets of tests are useful.


I do think TraceMonkey shows a bigger improvement on some categories  
of very trivial tests than on general code. But it seems to do better  
on most code than the V8 benchmark would indicate. I think this is  
because it gives the greatest benefit to operations other than  
function calls and property access, and those are the most heavily  
tested operations on the V8 benchmark.


Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Mike Belshe
On Tue, Jul 7, 2009 at 7:01 PM, Maciej Stachowiak  wrote:

>
> On Jul 7, 2009, at 6:43 PM, Mike Belshe wrote:
>
>
>> (There are other benchmarks that use summation, for example iBench, though
>> I am not sure these are examples of excellent benchmarks. Any benchmark that
>> consists of a single test also implicitly uses summation. I'm not sure what
>> other benchmarks do is as relevant of the technical merits.)
>>
>> Hehe - I don't think anyone has iBench except apple :-)
>>
>
> This is now extremely tangential to the original point, but iBench is
> available to the general public here: <
> http://www.lionbridge.com/lionbridge/en-US/services/software-product-engineering/testing-veritest/benchmark-software.htm
> >


Thanks!


>
>
>  A lot of research has been put into benchmarking over the years; there is
>> good reason for these choices, and they aren't arbitrary.  I have not seen
>> research indicating that summing of scores is statistically useful, but
>> there are plenty that have chosen geometric means.
>>
>
>
> I think we're starting to repeat our positions at this point, without
> adding new information or really persuading each other.
>
> If you have research that shows statistical benefits to geometric mean
> scoring, or other new information to add, I would welcome it.


Only what is already on this thread or google for "geometric mean
benchmark".

Mike


>
>
>
> Regards,
> Maciej
>
>
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Mike Belshe
On Tue, Jul 7, 2009 at 5:08 PM, Maciej Stachowiak  wrote:

>
> On Jul 7, 2009, at 4:19 PM, Peter Kasting wrote:
>
>  For example, the framework could compute both sums _and_ geomeans, if
>> people thought both were valuable.
>>
>
> That's a plausible thing to do, but I think there's a downside: if you make
> a change that moves the two scores in opposite directions, the benchmark
> doesn't help you decide if it's good or not. Avoiding paralysis in the face
> of tradeoffs is part of the reason we look primarily at the total score, not
> the individual subtest scores. The whole point of a meta-benchmark like this
> is to force ourselves to simplemindedly look at only one number.
>
>  We could agree on a way of benchmarking a representative sample of current
>> sites to get an idea of how widespread certain operations currently are.  We
>> could talk with the maintainers of jQuery, Dojo, etc. to see what sorts of
>> operations they think would be helpful to future apps to make faster.  We
>> could instrument browsers to have some sort of (opt-in) sampling of
>> real-world workloads.  etc.  Surely together we can come up with ways to
>> make Sunspider even better, while keeping its current strengths in mind.
>>
>
> I think these are all good ideas. I think there's one way in which sampling
> the Web is not quite right. To some extent, what matters is not average
> density of an operation but peak density. An operation that's used a *lot*
> by a few sites and hardly used by most sites, may deserve a weighting above
> its average proportion of Web use. I would like to hear input on what is
> inadequately covered. I tend to think there should be more coverage of the
> following:
>
> - property access, involving at least some polymorphic access patterns
> - method calls
> - object-oriented programming patterns
> - GC load
> - programming in a style that makes significant use of closures


This sounds like good stuff to me.  A few more thoughts:
   - We also see sites with just huge chunks of JS code being delivered, yet
sparsely used.  Perhaps a parsing/loading test is interesting.
   - Object cloning.  We should verify this is a useful test, but I believe
template engines often use a pattern, combined with json data to clone js
objects.  This may be more of a DOM-level test, but a JS equivalent should
be doable.
   - JSON performance
   - Tests of prototype chain usage (basically the counter-programming-style
to closures)


If I were to characterize SunSpider and V8Benchmark tests, the SunSpider
tests are generally short and focused micro-benchmarks.  The v8 tests are
generally larger tests comprised of real code.  Both types of test offer
unique advantages.  The microbenchmarks provide a way to create lots of
small tests which cover a certain pattern.  The larger tests are less
focused, but require more features to work well together in the engine to
get higher scores.  Tracemonkey is fairly new, and with its tracing
approach, it is not surprising that it's initial traces can optimize the
micro benchmarks but not fully trace larger code like what is found in the
V8 benchmark.  In my opinion, both sets of tests are useful.

Mike






>
> I think the V8 benchmark does a much better job of covering the first four
> of these things. I also think it overweights them, to the exclusion of most
> other considerations(*). As I mentioned before, I'd like to include some of
> V8's tests in a future SunSpider 2.0 content set.
>
> It would be good to know what other things should be tested that are not
> sufficiently covered.
>
> Regards,
> Maciej
>
> * - For example, Mozilla's TraceMonkey effort showed relatively little
> improvement on the V8 benchmark, even though it showed significant
> improvement on SunSpider and other benchmarks. I think TraceMonkey speedups
> are real and significant, so this would tend to undermine my confidence in
> the V8 benchmark's coverage. Note: I don't mean to start a side thread about
> whether the V8 benchmark is good or not, I just wanted to justify my remarks
> above.___
>
> webkit-dev mailing list
> webkit-dev@lists.webkit.org
> http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
>
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Maciej Stachowiak


On Jul 7, 2009, at 6:43 PM, Mike Belshe wrote:



(There are other benchmarks that use summation, for example iBench,  
though I am not sure these are examples of excellent benchmarks. Any  
benchmark that consists of a single test also implicitly uses  
summation. I'm not sure what other benchmarks do is as relevant of  
the technical merits.)


Hehe - I don't think anyone has iBench except apple :-)


This is now extremely tangential to the original point, but iBench is  
available to the general public here: 


A lot of research has been put into benchmarking over the years;  
there is good reason for these choices, and they aren't arbitrary.   
I have not seen research indicating that summing of scores is  
statistically useful, but there are plenty that have chosen  
geometric means.



I think we're starting to repeat our positions at this point, without  
adding new information or really persuading each other.


If you have research that shows statistical benefits to geometric mean  
scoring, or other new information to add, I would welcome it.



Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Mike Belshe
On Tue, Jul 7, 2009 at 4:45 PM, Maciej Stachowiak  wrote:

>
> On Jul 7, 2009, at 4:28 PM, Mike Belshe wrote:
>
>
>> When SunSpider was first created, regexps were a small proportion of the
>> total execution in what were the fastest publicly available at the time.
>> Eventually, everything else got much faster. So at some point, SunSpider
>> said "it might be a good idea to quadruple the speed of regexp matching
>> now". But if it used a geometric mean, it would always say it's a good idea
>> to quadruple the speed of regexp matching, unless it omitted regexp tests
>> entirely. From any starting point, and regardless of speed of other
>> facilities, speeding up regexps by a factor of N would always show the same
>> improvement in your overall score. SunSpider, on the other hand, was
>> deliberately designed to highlight the area where an engine most needs
>> improvement.
>>
>> I don't think the optimization of regex would have been effected by using
>> a different scoring mechanism.  In both scoring methods, the score of the
>> slowest test is the best pick for improving your overall score.
>>
>
> I don't see how that's the case with geometric means. With a geometric
> mean, the score of the test you can most easily optimize is the best pick,
> assuming you goal is to most improve the overall score. Improving the
> fastest test by a factor of 2 improves your score exactly as much as
> improving the slowest test by a factor of 2. Thus, there is no bias towards
> improving the slowest test, unless there is reason to believe that test
> would be the easiest to optimize. We chose summation specifically to avoid
> this phenomenon - we wanted the benchmark to make us think about what most
> needs improvement, not just what is easiest to optimize.


Usually with performance you end up with an exponentially increasing effort
to squeeze out the same amount of perf.I've rarely seen a case where a
single test can continually be improved at less effort than going after the
slower test.

I don't think a benchmark is often the right way to ever decide "what most
needs improvement".



> (There are other benchmarks that use summation, for example iBench, though
> I am not sure these are examples of excellent benchmarks. Any benchmark that
> consists of a single test also implicitly uses summation. I'm not sure what
> other benchmarks do is as relevant of the technical merits.)


Hehe - I don't think anyone has iBench except apple :-)

A lot of research has been put into benchmarking over the years; there is
good reason for these choices, and they aren't arbitrary.  I have not seen
research indicating that summing of scores is statistically useful, but
there are plenty that have chosen geometric means.


Mike


>
>
> Regards,
> Maciej
>
>
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Peter Kasting
On Tue, Jul 7, 2009 at 5:08 PM, Maciej Stachowiak  wrote:

> On Jul 7, 2009, at 4:19 PM, Peter Kasting wrote:
>
>> For example, the framework could compute both sums _and_ geomeans, if
>> people thought both were valuable.
>>
>
> That's a plausible thing to do, but I think there's a downside: if you make
> a change that moves the two scores in opposite directions, the benchmark
> doesn't help you decide if it's good or not. Avoiding paralysis in the face
> of tradeoffs is part of the reason we look primarily at the total score, not
> the individual subtest scores. The whole point of a meta-benchmark like this
> is to force ourselves to simplemindedly look at only one number.


Yes, I originally had more text like "deciding how to use these scores would
be the hard part", and this is precisely why.

I suppose that if different vendors wanted to use different criteria to
determine what to do in the face of a tradeoff, the benchmark could simply
be a data source, rather than a strong guide.  But this would make it
difficult to use the benchmark to compare engines, which is currently a key
use of SunSpider (and is a key failing, IMO, of frameworks like Dromaeo that
don't run identical code on every engine [IIRC]).

I think there's one way in which sampling the Web is not quite right. To
> some extent, what matters is not average density of an operation but peak
> density. An operation that's used a *lot* by a few sites and hardly used by
> most sites, may deserve a weighting above its average proportion of Web use.


If I understand you right, the effect you're noting is that speeding up
every web page by 1 ms might be a larger net win but a smaller perceived win
than speeding up, say, Gmail alone by 100 ms.

I think this is true.  One way to capture this would be to say that at least
part of the benchmark should concentrate on operations that are used in the
inner loops of any of n popular websites, without regard to their overall
frequency on the web.  (Although perhaps the two correlate well and there
aren't a lot of "rare but peaky" operations?  I don't know.)


> - GC load


I second this.  As people use more tabs and larger, more complex apps, the
performance of an engine under heavier GC load becomes more relevant.

It would be good to know what other things should be tested that are not
> sufficiently covered.


I think DOM bindings are hard to test and would benefit from benchmarking.
 No public benchmarks seem to test these well today.

* - For example, Mozilla's TraceMonkey effort showed relatively little
> improvement on the V8 benchmark, even though it showed significant
> improvement on SunSpider and other benchmarks. I think TraceMonkey speedups
> are real and significant, so this would tend to undermine my confidence in
> the V8 benchmark's coverage.


I agree that the V8 benchmark's coverage is inadequate and that the example
you mention illuminates that, because TraceMonkey definitely performs better
than SpiderMonkey in my own usage.  I wonder if there may have been an
opposite effect in a few cases where benchmarks with very simple tight loops
improved _more_ under TM than "real-world code" did, but I think the answer
to that is simply that benchmarks should be testing both kinds of code.

PK
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Maciej Stachowiak


On Jul 7, 2009, at 4:19 PM, Peter Kasting wrote:

For example, the framework could compute both sums _and_ geomeans,  
if people thought both were valuable.


That's a plausible thing to do, but I think there's a downside: if you  
make a change that moves the two scores in opposite directions, the  
benchmark doesn't help you decide if it's good or not. Avoiding  
paralysis in the face of tradeoffs is part of the reason we look  
primarily at the total score, not the individual subtest scores. The  
whole point of a meta-benchmark like this is to force ourselves to  
simplemindedly look at only one number.


We could agree on a way of benchmarking a representative sample of  
current sites to get an idea of how widespread certain operations  
currently are.  We could talk with the maintainers of jQuery, Dojo,  
etc. to see what sorts of operations they think would be helpful to  
future apps to make faster.  We could instrument browsers to have  
some sort of (opt-in) sampling of real-world workloads.  etc.   
Surely together we can come up with ways to make Sunspider even  
better, while keeping its current strengths in mind.


I think these are all good ideas. I think there's one way in which  
sampling the Web is not quite right. To some extent, what matters is  
not average density of an operation but peak density. An operation  
that's used a *lot* by a few sites and hardly used by most sites, may  
deserve a weighting above its average proportion of Web use. I would  
like to hear input on what is inadequately covered. I tend to think  
there should be more coverage of the following:


- property access, involving at least some polymorphic access patterns
- method calls
- object-oriented programming patterns
- GC load
- programming in a style that makes significant use of closures

I think the V8 benchmark does a much better job of covering the first  
four of these things. I also think it overweights them, to the  
exclusion of most other considerations(*). As I mentioned before, I'd  
like to include some of V8's tests in a future SunSpider 2.0 content  
set.


It would be good to know what other things should be tested that are  
not sufficiently covered.


Regards,
Maciej

* - For example, Mozilla's TraceMonkey effort showed relatively little  
improvement on the V8 benchmark, even though it showed significant  
improvement on SunSpider and other benchmarks. I think TraceMonkey  
speedups are real and significant, so this would tend to undermine my  
confidence in the V8 benchmark's coverage. Note: I don't mean to start  
a side thread about whether the V8 benchmark is good or not, I just  
wanted to justify my remarks above. 
___

webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Maciej Stachowiak


On Jul 7, 2009, at 4:28 PM, Mike Belshe wrote:



When SunSpider was first created, regexps were a small proportion of  
the total execution in what were the fastest publicly available at  
the time. Eventually, everything else got much faster. So at some  
point, SunSpider said "it might be a good idea to quadruple the  
speed of regexp matching now". But if it used a geometric mean, it  
would always say it's a good idea to quadruple the speed of regexp  
matching, unless it omitted regexp tests entirely. From any starting  
point, and regardless of speed of other facilities, speeding up  
regexps by a factor of N would always show the same improvement in  
your overall score. SunSpider, on the other hand, was deliberately  
designed to highlight the area where an engine most needs improvement.


I don't think the optimization of regex would have been effected by  
using a different scoring mechanism.  In both scoring methods, the  
score of the slowest test is the best pick for improving your  
overall score.


I don't see how that's the case with geometric means. With a geometric  
mean, the score of the test you can most easily optimize is the best  
pick, assuming you goal is to most improve the overall score.  
Improving the fastest test by a factor of 2 improves your score  
exactly as much as improving the slowest test by a factor of 2. Thus,  
there is no bias towards improving the slowest test, unless there is  
reason to believe that test would be the easiest to optimize. We chose  
summation specifically to avoid this phenomenon - we wanted the  
benchmark to make us think about what most needs improvement, not just  
what is easiest to optimize.


(There are other benchmarks that use summation, for example iBench,  
though I am not sure these are examples of excellent benchmarks. Any  
benchmark that consists of a single test also implicitly uses  
summation. I'm not sure what other benchmarks do is as relevant of the  
technical merits.)


Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Mike Belshe
On Tue, Jul 7, 2009 at 4:20 PM, Maciej Stachowiak  wrote:

>
> On Jul 7, 2009, at 4:01 PM, Mike Belshe wrote:
>
>
>> I'd like benchmarks to:
>>a) have meaning even as browsers change over time
>>b) evolve.  as new areas of JS (or whatever) become important, the
>> benchmark should have facilities to include that.
>>
>> Fair?  Good? Bad?
>>
>
> I think we can't rule out the possibility of a benchmark becoming less
> meaningful over time. I do think that we should eventually produce a new and
> rebalanced set of test content. I think it's fair to say that time is
> approaching for SunSpider.


I certainly agree that updating the benchmark over time is necessary :-)


>
>
> In particular, I don't think geometric means are a magic bullet.


Yes, using a geometric mean does not mean that you never need to update the
test suite.  But it does give you a lot of mileage :-)  And I think its
closer to an "industry standard" than anything else (spec.org).



> When SunSpider was first created, regexps were a small proportion of the
> total execution in what were the fastest publicly available at the time.
> Eventually, everything else got much faster. So at some point, SunSpider
> said "it might be a good idea to quadruple the speed of regexp matching
> now". But if it used a geometric mean, it would always say it's a good idea
> to quadruple the speed of regexp matching, unless it omitted regexp tests
> entirely. From any starting point, and regardless of speed of other
> facilities, speeding up regexps by a factor of N would always show the same
> improvement in your overall score. SunSpider, on the other hand, was
> deliberately designed to highlight the area where an engine most needs
> improvement.


I don't think the optimization of regex would have been effected by using a
different scoring mechanism.  In both scoring methods, the score of the
slowest test is the best pick for improving your overall score.  So vendors
would still need to optimize it to keep up.


Mike



>
> I think the only real way to deal with this is to periodically revise and
> rebalance the benchmark.
>
> Regards,
> Maciej
>
>
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Mike Belshe
On Tue, Jul 7, 2009 at 3:58 PM, Geoffrey Garen  wrote:

> Are you saying that you did see Regex as being such a high percentage of
>> javascript code?  If so, we're using very different mixes of content for our
>> tests.
>>
>
> I'm saying that I don't buy your claim that regular expression performance
> should only count as 1% of a JavaScript benchmark.


I never said that.


> I don't buy your premise -- that regular expressions are only 1% of
> JavaScript execution time on the web -- because I think your sample size is
> small, and, anecdotally, I've seen significant web applications that make
> heavy use of regular expressions for parsing and validation.


If you've got data, please post it - that would be fantastic.


> I also don't buy your conclusion -- that if regular expressions account for
> 1% of JavaScript time on the Internet overall, they need not be optimized.


I never said that.


> First, generally, I think it's dubious to say that there's any major
> feature of JavaScript that need not be optimized. Second, it's important for
> all web apps to be fast in WebKit -- not just the ones that do what's common
> overall. Third, we want to enable not only the web applications of today,
> but also the web applications of tomorrow.


I never said that either.

I'm talking about how a score is computed.  You're throwing a lot of red
herrings in here, and I don't know why.  Is this turning into an "I'm mad at
Google vs Webkit thing?"  I'm trying to have an engineering discussion about
the merits of two different scoring mechanisms.


> To some extent, I think you must agree with me, since v8 copied
> JavaScriptCore in implementing a regular expression JIT, and the v8
> benchmark copied SunSpider in including regular expressions as a test
> component.


I like SunSpider because of its balance. I think SunSpider, unlike some
> other benchmarks, tends to encourage broad thinking about all the different
> parts of the JavaScript language, and design tradeoffs between them, while
> discouraging tunnel vision. You can't just implement fast integer math, or
> fast property access, and call it a day on SunSpider. Instead, you need to
> consider many different language features, and do them all well.


You're completely missing the point.

Of course benchmarks should cover a broad set of features; that is not being
debated here.  I'd love to see the tests *expanded*.  Using a sum, it is
very very difficult to add "many different language features" without
causing the benchmark scoring to be very misleading.  (See my other reply).

The only thing I'm debating here is how the score should be computed.  There
is the summation method, which is variable over time, or you can use a
geometric mean, which is balanced.

To think of this another way - how is JavaScript benchmarking different from
other types of benchmarks?  If we could answer this question, that would
help me a lot.

What do you think of spec.org?  They've been using geometric mean to do
scoring for years:  http://www.spec.org/spec/glossary/#geometricmean  Other
benchmarks all seem to choose geometric means as well.  Peacemaker also uses
a geometric mean:  http://service.futuremark.com/peacekeeper/faq.action  I'm
actually hard pressed to find any benchmark which uses summation.  Are there
any non-sunspider benchmarks using sums?

Thanks,
Mike








Mike



>
>
> Geoff
>
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Oliver Hunt
What you seem to think is better would be to repeatedly update  
sunspider everytime that something gets faster, ignoring entirely  
that the value in sunspider is precisely that it has not changed.


Not quite what I'm saying :-)

I'd like benchmarks to:
a) have meaning even as browsers change over time
b) evolve.  as new areas of JS (or whatever) become important,  
the benchmark should have facilities to include that.


Fair?  Good? Bad?


It's not unreasonable, but it can't be done on a whim, and changes  
cannot be made trivially.  Both re-weighting sunspider and adding new  
tests as things are made faster is incredibly hard to do soundly  
because it becomes easy to end up obscuring meaningful data.


In the context of regex for example, say sunspider had been reweighted  
for the current generation on js engines before anyone had looked at  
regex.  Regex would not have stood out as being substantially slower,  
and would likely not have been investigated resulting in everyone  
having regex an order of magnitude slower than current engines.   
That's why sunspider has not been updated: after what a year and a  
half (?) it can still show areas where performance can be improved and  
while it does that it's still useful.


So determining when it is sensible to update sunspider is difficult,  
you may be right, and find rebalancing shows new areas where  
performance can be improved, but if you're wrong you run the risk of  
changing the benchmark from something that is actually useful  
development tool into something that is only useful for producing a  
number at the end.


If we see one section of the test taking dramatically longer than  
another then we can assume that we have not been paying enough  
attention to performance in that area, this is how we orginally  
noticed just how slow the regex engine was.  If we had been  
continually rebalancing the test over and over again we would not  
have noticed this or other areas where performance could be (and  
has) improved.  It would also break sunspider as a means for  
tracking and/or preventing performance regressions.


Of course, using old versions of the benchmark for regression  
testing is not prohibited by iterating a benchmark.


But what happens when the benchmarks disagree as to what is the  
improvement?  You can't improve performance with one benchmark while  
testing for regressions with another.


--Oliver___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Maciej Stachowiak


On Jul 7, 2009, at 4:01 PM, Mike Belshe wrote:



I'd like benchmarks to:
a) have meaning even as browsers change over time
b) evolve.  as new areas of JS (or whatever) become important,  
the benchmark should have facilities to include that.


Fair?  Good? Bad?


I think we can't rule out the possibility of a benchmark becoming less  
meaningful over time. I do think that we should eventually produce a  
new and rebalanced set of test content. I think it's fair to say that  
time is approaching for SunSpider.


In particular, I don't think geometric means are a magic bullet. When  
SunSpider was first created, regexps were a small proportion of the  
total execution in what were the fastest publicly available at the  
time. Eventually, everything else got much faster. So at some point,  
SunSpider said "it might be a good idea to quadruple the speed of  
regexp matching now". But if it used a geometric mean, it would always  
say it's a good idea to quadruple the speed of regexp matching, unless  
it omitted regexp tests entirely. From any starting point, and  
regardless of speed of other facilities, speeding up regexps by a  
factor of N would always show the same improvement in your overall  
score. SunSpider, on the other hand, was deliberately designed to  
highlight the area where an engine most needs improvement.


I think the only real way to deal with this is to periodically revise  
and rebalance the benchmark.


Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Peter Kasting
I'm more verbose than Mike, but it seems like people are talking past each
other.

On Tue, Jul 7, 2009 at 3:25 PM, Oliver Hunt  wrote:

> If we see one section of the test taking dramatically longer than another
> then we can assume that we have not been paying enough attention to
> performance in that area,
>

It depends on what your goal with perf is.  If the goal is to balance
optimizations such that operation A always consumes the same time as
operation B, you are correct.  But is this always best?  The current design
says "yes".  The open question is whether that is the best possible design.

On Tue, Jul 7, 2009 at 3:58 PM, Geoffrey Garen  wrote:
>
> I also don't buy your conclusion -- that if regular expressions account for
> 1% of JavaScript time on the Internet overall, they need not be optimized.


I didn't see Mike say that regexes "did not need to be optimized".

If given an operation that occurs 20% of the time and another that occurs 1%
of the time, I certainly think it _might_ be appropriate to spend more
engineering effort on optimizing the first operation.  Knowing for sure
depends on how much you value the rarer cases, for reasons such as you give
next:

Second, it's important for all web apps to be fast in WebKit -- not just the
> ones that do what's common overall. Third, we want to enable not only the
> web applications of today, but also the web applications of tomorrow.


I strongly agree with these principles, but I don't see why the current
design necessarily does a better job of preserving them than all other
designs.  For example, let's say at the time SunSpider was created (and
everything was roughly equal-weighted) that one of the subtests tested a
horribly slow operation that would greatly benefit future web apps if it
improved substantially.  Unfortunately, the original equal-weighting
enshrines the slowness of this operation, relative to the others being
tested, such that if you begin to make it faster, the subtests become
unbalanced and you conclude that no further work on it is needed for the
time being.  This is a suboptimal outcome.

So in general, the question is: when some operation is slower than others,
what criteria can we use to make the best decisions about where to spend
developer effort?  Surely our greatest cost here is opportunity cost.

I accept Maciej's statement that the current design was intentional.  I also
accept that sums and geomeans each have drawbacks in guiding
decision-making.  I simply want to focus on finding the best possible design
for the framework.

For example, the framework could compute both sums _and_ geomeans, if people
thought both were valuable.  We could agree on a way of benchmarking a
representative sample of current sites to get an idea of how widespread
certain operations currently are.  We could talk with the maintainers of
jQuery, Dojo, etc. to see what sorts of operations they think would be
helpful to future apps to make faster.  We could instrument browsers to have
some sort of (opt-in) sampling of real-world workloads.  etc.  Surely
together we can come up with ways to make Sunspider even better, while
keeping its current strengths in mind.

PK
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Mike Belshe
On Tue, Jul 7, 2009 at 3:25 PM, Oliver Hunt  wrote:

>
> On Jul 7, 2009, at 3:01 PM, Mike Belshe wrote:
>
> On Mon, Jul 6, 2009 at 10:11 AM, Geoffrey Garen  wrote:
>
>>  So, what you end up with is after a couple of years, the slowest test in
>>> the suite is the most significant part of the score.  Further, I'll predict
>>> that the slowest test will most likely be the least relevant test, because
>>> the truly important parts of JS engines were already optimized.  This has
>>> happened with Sunspider 0.9 - the regex portions of the test became the
>>> dominant factor, even though they were not nearly as prominent in the real
>>> world as they were in the benchmark.  This leads to implementors optimizing
>>> for the benchmark - and that is not what we want to encourage.
>>>
>>
>> How did you determine that regex performance is "not nearly as prominent
>> in the real world?"
>>
>
> For a while regex was 20-30% of the benchmark on most browsers even though
> it didn't consume 20-30% of the time that browsers spent inside javascript.
>
> You're right, but you're ignoring that for a long time before then it was
> consuming much much less.  If everything else gets faster then the
> proportion of time spent in the area that is not improved will increase,
> potentially by quite a lot.
>

Ok.


>
> On the topic of use in the real world -- jQuery at least runs a regex over
> the result of most (all?) XHR transactions to see if they might be xml, and
> jquery seems to be increasingly widely used, and frequently in conjunction
> with XHR.
>

Ok.


>
> What you seem to think is better would be to repeatedly update sunspider
> everytime that something gets faster, ignoring entirely that the value in
> sunspider is precisely that it has not changed.
>

Not quite what I'm saying :-)

I'd like benchmarks to:
a) have meaning even as browsers change over time
b) evolve.  as new areas of JS (or whatever) become important, the
benchmark should have facilities to include that.

Fair?  Good? Bad?



> If we see one section of the test taking dramatically longer than another
> then we can assume that we have not been paying enough attention to
> performance in that area, this is how we originally noticed just how slow
> the regex engine was.  If we had been continually rebalancing the test over
> and over again we would not have noticed this or other areas where
> performance could be (and has) improved.  It would also break sunspider as a
> means for tracking and/or preventing performance regressions.
>

Of course, using old versions of the benchmark for regression testing is not
prohibited by iterating a benchmark.

Mike




>
> --Oliver
>
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Geoffrey Garen
Are you saying that you did see Regex as being such a high  
percentage of javascript code?  If so, we're using very different  
mixes of content for our tests.


I'm saying that I don't buy your claim that regular expression  
performance should only count as 1% of a JavaScript benchmark.


I don't buy your premise -- that regular expressions are only 1% of  
JavaScript execution time on the web -- because I think your sample  
size is small, and, anecdotally, I've seen significant web  
applications that make heavy use of regular expressions for parsing  
and validation.


I also don't buy your conclusion -- that if regular expressions  
account for 1% of JavaScript time on the Internet overall, they need  
not be optimized.


First, generally, I think it's dubious to say that there's any major  
feature of JavaScript that need not be optimized. Second, it's  
important for all web apps to be fast in WebKit -- not just the ones  
that do what's common overall. Third, we want to enable not only the  
web applications of today, but also the web applications of tomorrow.


To some extent, I think you must agree with me, since v8 copied  
JavaScriptCore in implementing a regular expression JIT, and the v8  
benchmark copied SunSpider in including regular expressions as a test  
component.


I like SunSpider because of its balance. I think SunSpider, unlike  
some other benchmarks, tends to encourage broad thinking about all the  
different parts of the JavaScript language, and design tradeoffs  
between them, while discouraging tunnel vision. You can't just  
implement fast integer math, or fast property access, and call it a  
day on SunSpider. Instead, you need to consider many different  
language features, and do them all well.


Geoff
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Oliver Hunt


On Jul 7, 2009, at 3:01 PM, Mike Belshe wrote:

On Mon, Jul 6, 2009 at 10:11 AM, Geoffrey Garen   
wrote:
So, what you end up with is after a couple of years, the slowest  
test in the suite is the most significant part of the score.   
Further, I'll predict that the slowest test will most likely be the  
least relevant test, because the truly important parts of JS engines  
were already optimized.  This has happened with Sunspider 0.9 - the  
regex portions of the test became the dominant factor, even though  
they were not nearly as prominent in the real world as they were in  
the benchmark.  This leads to implementors optimizing for the  
benchmark - and that is not what we want to encourage.


How did you determine that regex performance is "not nearly as  
prominent in the real world?"


For a while regex was 20-30% of the benchmark on most browsers even  
though it didn't consume 20-30% of the time that browsers spent  
inside javascript.
You're right, but you're ignoring that for a long time before then it  
was consuming much much less.  If everything else gets faster then the  
proportion of time spent in the area that is not improved will  
increase, potentially by quite a lot.


On the topic of use in the real world -- jQuery at least runs a regex  
over the result of most (all?) XHR transactions to see if they might  
be xml, and jquery seems to be increasingly widely used, and  
frequently in conjunction with XHR.


What you seem to think is better would be to repeatedly update  
sunspider everytime that something gets faster, ignoring entirely that  
the value in sunspider is precisely that it has not changed.  If we  
see one section of the test taking dramatically longer than another  
then we can assume that we have not been paying enough attention to  
performance in that area, this is how we originally noticed just how  
slow the regex engine was.  If we had been continually rebalancing the  
test over and over again we would not have noticed this or other areas  
where performance could be (and has) improved.  It would also break  
sunspider as a means for tracking and/or preventing performance  
regressions.


--Oliver___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Mike Belshe
On Sat, Jul 4, 2009 at 3:27 PM, Maciej Stachowiak  wrote:

>
> On Jul 4, 2009, at 11:47 AM, Mike Belshe wrote:
>
> I'd like to understand what's going to happen with SunSpider in the future.
>  Here is a set of questions and criticisms.  I'm interested in how these can
> be addressed.
>
> There are 3 areas I'd like to see improved in
> SunSpider, some of which we've discussed before:
>
>
> #1: SunSpider is currently version 0.9.  Will SunSpider ever change?  Or is 
> it static?
> I believe that benchmarks need to be able to
> move with the times.  As JS Engines change and improve, and as new areas are 
> needed
> to be benchmarked, we need to be able to roll the version, fix bugs, and
> benchmark new features.  The SunSpider version has not changed for ~2yrs.
>  How can we change this situation?  Are there plans for a new version
> already underway?
>
>
> I've been thinking about updating SunSpider for some time. There are two
> categories of changes I've thought about:
>
> 1) Quality-of-implementation changes to the harness. Among these might be
> ability to use the harness with multiple test sets. That would be 1.0.
>
> 2) An updated set of tests - the current tests are too short, and don't
> adequately cover some areas of the language. I'd like to make the tests take
> at least 100ms each on modern browsers on recent hardware. I'd also be
> interested in incorporating some of the tests from the v8 benchmark suite,
> if the v8 developers were ok with this. That would be SunSpider 2.0.
>
> The reason I've been hesitant to make any changes is that the press and
> independent analysts latched on to SunSpider as a way of comparing
> JavaScript implementations. Originally, it was primarily intended to be a
> tool for the WebKit team to help us make our JavaScript faster. However, now
> that third parties are relying it, there are two things I want to be really
> careful about:
>
> a) I don't want to invalidate people's published data, so significant
> changes to the test content would need to be published as a clearly separate
> version.
>
> b) I want to avoid accidentally or intentionally making changes that are
> biased in favor of Safari or WebKit-based browsers in general, or that even
> give that impression. That would hurt the test's credibility. When we first
> made SunSpider, Safari actually didn't do that great on it, which I think
> helped people believe that the test wasn't designed to make us look good, it
> was designed to be a relatively unbiased comparison.
>
> Thus, any change to the content would need to be scrutinized in some way.
> I'm not sure what it would take to get widespread agreement that a 2.0
> content set is fair, but I agree it's time to make one soonish (before the
> end of the year probably). Thoughts on this are welcome.
>
>
> #2: Use of summing as a scoring mechanism is problematic
> Unfortunately, the sum-based scoring techniques do not withstand the test
> of time as browsers improve.  When the benchmark was first introduced, each
> test was equally weighted and reasonably large.  Over time, however, the
> test becomes dominated by the slowest tests - basically the weighting of the
> individual tests is variable based on the performance of the JS engine under
> test.  Today's engines spend ~50% of their time on just string and date
> tests.  The other tests are largely irrelevant at this point, and becoming
> less relevant every day.  Eventually many of the tests will take near-zero
> time, and the benchmark will have to be scrapped unless we figure out a
> better way to score it.  Benchmarking research which long pre-dates
> SunSpider confirms that geometric means provide a better basis for
> comparison:  http://portal.acm.org/citation.cfm?id=5673 Can future
> versions of the SunSpider driver be made so that they won't become
> irrelevant over time?
>
>
> Use of summation instead of geometric mean was a considered choice. The
> intent is that engines should focus on whatever is slowest. A simplified
> example: let's say it's estimated that likely workload in the field will
> consist of 50% Operation A, and 50% of Operation B, and I can benchmark them
> in isolation. Now let's say implementation in Foo these operations are
> equally fast, while in implementation Bar, Operation A is 4x as fast as in
> Foo, while Operation B is 4x as slow as in Foo. A comparison by geometric
> means would imply that Foo and Bar are equally good, but Bar would actually
> be twice as slow on the intended workload.
>

BTW - the way to work around this is to have enough sub-benchmarks such that
this just doesn't happen.  If we have the right test coverage, it seems
unlikely to me that a code change would dramatically improve exactly one
test at an exponential expense of exactly one other test.  I'm not saying it
is impossible - just that code changes don't generally cause that behavior.
 To combat this we can implement a broader base of benchmarks as well as
longer-running tests that are not "too

Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Mike Belshe
As I said, we can argue the mix of tests forever, but it is not useful.
 Yes, I would test using top-100 sites.  In the future, if a benchmark
claims to have a representative mix, it should document why.  Right?
Are you saying that you did see Regex as being such a high percentage of
javascript code?  If so, we're using very different mixes of content for our
tests.

Mike


On Tue, Jul 7, 2009 at 3:08 PM, Geoffrey Garen  wrote:

> So, I determined this through profiling.  If you profile your browser while
>> browsing websites, you won't find that it spends 20-30% of its javascript
>> execution time running regex (even with the old pcre).
>>
>
> What websites did you browse, and how did you choose them?
>
> Do you think your browsing is representative of all JavaScript
> applications?
>
> Geoff
>
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Geoffrey Garen
So, I determined this through profiling.  If you profile your  
browser while browsing websites, you won't find that it spends  
20-30% of its javascript execution time running regex (even with the  
old pcre).


What websites did you browse, and how did you choose them?

Do you think your browsing is representative of all JavaScript  
applications?


Geoff
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-07 Thread Mike Belshe
On Mon, Jul 6, 2009 at 10:11 AM, Geoffrey Garen  wrote:

>  So, what you end up with is after a couple of years, the slowest test in
>> the suite is the most significant part of the score.  Further, I'll predict
>> that the slowest test will most likely be the least relevant test, because
>> the truly important parts of JS engines were already optimized.  This has
>> happened with Sunspider 0.9 - the regex portions of the test became the
>> dominant factor, even though they were not nearly as prominent in the real
>> world as they were in the benchmark.  This leads to implementors optimizing
>> for the benchmark - and that is not what we want to encourage.
>>
>
> How did you determine that regex performance is "not nearly as prominent in
> the real world?"
>

For a while regex was 20-30% of the benchmark on most browsers even though
it didn't consume 20-30% of the time that browsers spent inside javascript.

So, I determined this through profiling.  If you profile your browser while
browsing websites, you won't find that it spends 20-30% of its javascript
execution time running regex (even with the old pcre).  It's more like 1%.
 If this is true, then it's a shame to see this consume 20-30% of any
benchmark, because it means the benchmark scoring is not indicative of the
real world.  Maybe I just disagree with the mix ever having been very
representative?  Or maybe it changed over time?  I don't know because I
can't go back in time :-)  Perhaps one solution is to better document how a
mix is chosen.

I don't really want to make this a debate about regex and he-says/she-says
how expensive it is.  We should talk about the framework.  If the framework
is subject to this type of skew, where it can disproportionately weight a
test, is that something we should avoid?

Keep in mind I'm not recommending any change to existing SunSpider 0.9 -
just changes to future versions.

Maciej pointed out a case where he thought the geometric mean was worse; I
think thats a fair consideration if you have the perfect benchmark with an
exactly representative workload.  But we don't have the ability make a
perfectly representative benchmark workload, and even if we did it would
change over time - eventually making the benchmark useless...

Mike
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-06 Thread Maciej Stachowiak


On Jul 6, 2009, at 10:11 AM, Geoffrey Garen wrote:

So, what you end up with is after a couple of years, the slowest  
test in the suite is the most significant part of the score.   
Further, I'll predict that the slowest test will most likely be the  
least relevant test, because the truly important parts of JS  
engines were already optimized.  This has happened with Sunspider  
0.9 - the regex portions of the test became the dominant factor,  
even though they were not nearly as prominent in the real world as  
they were in the benchmark.  This leads to implementors optimizing  
for the benchmark - and that is not what we want to encourage.


How did you determine that regex performance is "not nearly as  
prominent in the real world?"



For reference: in current JavaScriptCore, the one regexp-centric test  
is about 4.6% of the score by time. 3 of the string tests also spend  
their time in regexps, however, I think those are among the tests that  
most closely resemble what Web sites do. I believe the situation is  
roughly similar in other competitive JavaScript engines. This is  
probably not exactly proportionate but it doesn't dominate the test. I  
don't think any of this is a problem, unless one thinks the regexp  
improvements in Nitro, V8 and TraceMonkey were a waste of resources.


What I have seen happen is that numeric processing and especially  
integer math became a smaller and smaller proportion of the test,  
looking at the best publicly available engines over time. I think that  
turned out to be the case because math had much more room for  
optimization in naive implementations than, say, string processing.


Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-06 Thread Geoffrey Garen
So, what you end up with is after a couple of years, the slowest  
test in the suite is the most significant part of the score.   
Further, I'll predict that the slowest test will most likely be the  
least relevant test, because the truly important parts of JS engines  
were already optimized.  This has happened with Sunspider 0.9 - the  
regex portions of the test became the dominant factor, even though  
they were not nearly as prominent in the real world as they were in  
the benchmark.  This leads to implementors optimizing for the  
benchmark - and that is not what we want to encourage.


How did you determine that regex performance is "not nearly as  
prominent in the real world?"


Geoff
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-05 Thread Zoltan Herczeg
Hi,

> Can future versions
> of the SunSpider driver be made so that they won't become irrelevant over
> time?

I feel the weighting is more of an issue here than the total runtime.
Eventually some tests become dominant, and the gain (or loss) on them
almost determine the final results.

Besides, there was a discussion about SunSpider enhancements a year ago.
We collected some new JS benchmarks and put it into an WindScorpion (it is
another name of SunSpider) extension package. However, the topic died away
after a short time.

Zoltan


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-05 Thread George Staikos


On 4-Jul-09, at 2:47 PM, Mike Belshe wrote:


#2: Use of summing as a scoring mechanism is problematic
Unfortunately, the sum-based scoring techniques do not withstand  
the test of time as browsers improve.  When the benchmark was first  
introduced, each test was equally weighted and reasonably large.   
Over time, however, the test becomes dominated by the slowest tests  
- basically the weighting of the individual tests is variable based  
on the performance of the JS engine under test.  Today's engines  
spend ~50% of their time on just string and date tests.  The other  
tests are largely irrelevant at this point, and becoming less  
relevant every day.  Eventually many of the tests will take near- 
zero time, and the benchmark will have to be scrapped unless we  
figure out a better way to score it.  Benchmarking research which  
long pre-dates SunSpider confirms that geometric means provide a  
better basis for comparison:  http://portal.acm.org/citation.cfm? 
id=5673 Can future versions of the SunSpider driver be made so that  
they won't become irrelevant over time?


   Actually this doesn't happen on all CPUs.  For example CPUs  
without FPU have very different results.  memory performance is also  
a big factor.


#3: The SunSpider harness has a variance problem due to CPU power  
savings modes.
Because the test runs a tiny amount of Javascript (often under  
10ms) followed by a 500ms sleep, CPUs will go into power savings  
modes between test runs.  This radically changes the performance  
measurements and makes it so that comparison between two runs is  
dependent on the user's power savings mode.  To demonstrate this,  
run SunSpider on two machines- one with the Windows  
"balanced" (default) setting for power, and then again with "high  
performance".  It's easy to see skews of 30% between these two  
modes.  I think we should change the test harness to avoid such  
accidental effects.


   I've noticed this issue too.

--
George Staikos
Torch Mobile Inc.
http://www.torchmobile.com/

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-05 Thread Maciej Stachowiak


On Jul 5, 2009, at 9:58 AM, Joe Mason wrote:


Maciej Stachowiak wrote:
I think the pauses were large in an attempt to get stable,  
repeatable results, but are probably longer than necessary to  
achieve this. I agree with you that the artifacts in "balanced"  
power mode are a problem. Do you know what timer thresholds avoid  
the effect? I think this would be a reasonable "1.0" kind of change.


Just a gut feeling, but I suspect the exact throttling algorithm  
would vary too much from machine to machine and OS version to OS  
version to ever find a good threshold to avoid it.  The best thing  
to do would be to have the harness turn off CPU throttling when it  
starts.  (This is possible from the commandline under Linux, and I  
assume in Mac, but Windows might be a problem.)


The command line harness doesn't pause at all. It's the in-browser  
harness is the one we are concerned about. We obviously can't turn off  
CPU throttling from the JavaScript level. I'll see if I can still get  
repeatable cross-browser results with a much smaller delay.


Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-05 Thread Joe Mason

Maciej Stachowiak wrote:
I think the pauses were large in an attempt to get stable, repeatable 
results, but are probably longer than necessary to achieve this. I agree 
with you that the artifacts in "balanced" power mode are a problem. Do 
you know what timer thresholds avoid the effect? I think this would be a 
reasonable "1.0" kind of change.


Just a gut feeling, but I suspect the exact throttling algorithm would 
vary too much from machine to machine and OS version to OS version to 
ever find a good threshold to avoid it.  The best thing to do would be 
to have the harness turn off CPU throttling when it starts.  (This is 
possible from the commandline under Linux, and I assume in Mac, but 
Windows might be a problem.)


Joe
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-05 Thread Mike Belshe
On Sat, Jul 4, 2009 at 3:27 PM, Maciej Stachowiak  wrote:

>
> On Jul 4, 2009, at 11:47 AM, Mike Belshe wrote:
>
> I'd like to understand what's going to happen with SunSpider in the future.
>  Here is a set of questions and criticisms.  I'm interested in how these can
> be addressed.
>
> There are 3 areas I'd like to see improved in
> SunSpider, some of which we've discussed before:
>
>
> #1: SunSpider is currently version 0.9.  Will SunSpider ever change?  Or is 
> it static?
> I believe that benchmarks need to be able to
> move with the times.  As JS Engines change and improve, and as new areas are 
> needed
> to be benchmarked, we need to be able to roll the version, fix bugs, and
> benchmark new features.  The SunSpider version has not changed for ~2yrs.
>  How can we change this situation?  Are there plans for a new version
> already underway?
>
>
> I've been thinking about updating SunSpider for some time. There are two
> categories of changes I've thought about:
>
> 1) Quality-of-implementation changes to the harness. Among these might be
> ability to use the harness with multiple test sets. That would be 1.0.
>

Cool


>
> 2) An updated set of tests - the current tests are too short, and don't
> adequately cover some areas of the language. I'd like to make the tests take
> at least 100ms each on modern browsers on recent hardware. I'd also be
> interested in incorporating some of the tests from the v8 benchmark suite,
> if the v8 developers were ok with this. That would be SunSpider 2.0.
>

Cool.  Use of v8 tests is just fine; they're all open source.


>
> The reason I've been hesitant to make any changes is that the press and
> independent analysts latched on to SunSpider as a way of comparing
> JavaScript implementations. Originally, it was primarily intended to be a
> tool for the WebKit team to help us make our JavaScript faster. However, now
> that third parties are relying it, there are two things I want to be really
> careful about:
>
> a) I don't want to invalidate people's published data, so significant
> changes to the test content would need to be published as a clearly separate
> version.
>

Of course.  Small UI nit - the current SunSpider benchmark doesn't make the
version very prominent at all.  It would be nice to make it more salient.


>
> b) I want to avoid accidentally or intentionally making changes that are
> biased in favor of Safari or WebKit-based browsers in general, or that even
> give that impression. That would hurt the test's credibility. When we first
> made SunSpider, Safari actually didn't do that great on it, which I think
> helped people believe that the test wasn't designed to make us look good, it
> was designed to be a relatively unbiased comparison.
>

Of course.


>
> Thus, any change to the content would need to be scrutinized in some way.
> I'm not sure what it would take to get widespread agreement that a 2.0
> content set is fair, but I agree it's time to make one soonish (before the
> end of the year probably). Thoughts on this are welcome.
>
>
> #2: Use of summing as a scoring mechanism is problematic
> Unfortunately, the sum-based scoring techniques do not withstand the test
> of time as browsers improve.  When the benchmark was first introduced, each
> test was equally weighted and reasonably large.  Over time, however, the
> test becomes dominated by the slowest tests - basically the weighting of the
> individual tests is variable based on the performance of the JS engine under
> test.  Today's engines spend ~50% of their time on just string and date
> tests.  The other tests are largely irrelevant at this point, and becoming
> less relevant every day.  Eventually many of the tests will take near-zero
> time, and the benchmark will have to be scrapped unless we figure out a
> better way to score it.  Benchmarking research which long pre-dates
> SunSpider confirms that geometric means provide a better basis for
> comparison:  http://portal.acm.org/citation.cfm?id=5673 Can future
> versions of the SunSpider driver be made so that they won't become
> irrelevant over time?
>
>
> Use of summation instead of geometric mean was a considered choice. The
> intent is that engines should focus on whatever is slowest. A simplified
> example: let's say it's estimated that likely workload in the field will
> consist of 50% Operation A, and 50% of Operation B, and I can benchmark them
> in isolation. Now let's say implementation in Foo these operations are
> equally fast, while in implementation Bar, Operation A is 4x as fast as in
> Foo, while Operation B is 4x as slow as in Foo. A comparison by geometric
> means would imply that Foo and Bar are equally good, but Bar would actually
> be twice as slow on the intended workload.
>

I could almost buy this if:
   a)  we had a really really representative workload of what web pages do,
broken down into the exactly correct proportions.
   b)  the representative workload remains representative over time.

I'll argue tha

Re: [webkit-dev] Iterating SunSpider

2009-07-04 Thread Maciej Stachowiak


On Jul 4, 2009, at 1:06 PM, Peter Kasting wrote:


On Sat, Jul 4, 2009 at 11:47 AM, Mike Belshe  wrote:
#3: The SunSpider harness has a variance problem due to CPU power  
savings modes.


This one worries me because it decreases the consistency/ 
reproducibility of test scores and makes it harder to compare  
engines or to track one engine's scores over time.  For example,  
doing a bunch of CPU work just before running the benchmark can  
affect whether and when the CPU throttles down during the benchmark  
run.


Possible solution:
The dromaeo test suite already incorporates the SunSpider individual  
tests under a new benchmark harness which fixes all 3 of the above  
issues.   Thus, one approach would be to retire SunSpider 0.9 in  
favor of Dromaeo.   http://dromaeo.com/?sunspider  Dromaeo has also  
done a lot of good work to ensure statistical significance of the  
results.  Once we have a better benchmarking framework, it would be  
great to build a new microbenchmark mix which more realistically  
exercises today's JavaScript.


One complaint I have heard about the Dromaeo tests (not the harness)  
is that the actual JS that gets run differs from browser to browser  
(e.g. because it is a direct copy of a source library that does UA  
sniffing).  If this is true it means that this suite as-is isn't  
useful to compare engines to each other.


However, the Dromaeo _harness_ is probably a win as-is.

Of course, changing anything about Sunspider raises the question of  
tracking historical performance.  Perhaps the harness could support  
versioning, or perhaps people are simply willing to say "Sunspider  
1.0 scores cannot be compared to Sunspider 0.9 scores".  I believe  
this is the approach the V8 benchmark takes.


I think versioning the test content is right, and I think we should do  
that over time. I think a harness change to avoid triggering  
powersaving mode on Windows would be a reasonable thing to do to the  
harness without a version change. I don't think Dromaeo is a good  
choice of harness - I don't think their results are stable enough and  
I am not confident in the statistical soundness of their methodology.


Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] Iterating SunSpider

2009-07-04 Thread Maciej Stachowiak


On Jul 4, 2009, at 11:47 AM, Mike Belshe wrote:

I'd like to understand what's going to happen with SunSpider in the  
future.  Here is a set of questions and criticisms.  I'm interested  
in how these can be addressed.


There are 3 areas I'd like to see improved in SunSpider, some of  
which we've discussed before:


#1: SunSpider is currently version 0.9.  Will SunSpider ever  
change?  Or is it static?
I believe that benchmarks need to be able to move with the times.   
As JS Engines change and improve, and as new areas are needed to be  
benchmarked, we need to be able to roll the version, fix bugs, and  
benchmark new features.  The SunSpider version has not changed for  
~2yrs.  How can we change this situation?  Are there plans for a new  
version already underway?


I've been thinking about updating SunSpider for some time. There are  
two categories of changes I've thought about:


1) Quality-of-implementation changes to the harness. Among these might  
be ability to use the harness with multiple test sets. That would be  
1.0.


2) An updated set of tests - the current tests are too short, and  
don't adequately cover some areas of the language. I'd like to make  
the tests take at least 100ms each on modern browsers on recent  
hardware. I'd also be interested in incorporating some of the tests  
from the v8 benchmark suite, if the v8 developers were ok with this.  
That would be SunSpider 2.0.


The reason I've been hesitant to make any changes is that the press  
and independent analysts latched on to SunSpider as a way of comparing  
JavaScript implementations. Originally, it was primarily intended to  
be a tool for the WebKit team to help us make our JavaScript faster.  
However, now that third parties are relying it, there are two things I  
want to be really careful about:


a) I don't want to invalidate people's published data, so significant  
changes to the test content would need to be published as a clearly  
separate version.


b) I want to avoid accidentally or intentionally making changes that  
are biased in favor of Safari or WebKit-based browsers in general, or  
that even give that impression. That would hurt the test's  
credibility. When we first made SunSpider, Safari actually didn't do  
that great on it, which I think helped people believe that the test  
wasn't designed to make us look good, it was designed to be a  
relatively unbiased comparison.


Thus, any change to the content would need to be scrutinized in some  
way. I'm not sure what it would take to get widespread agreement that  
a 2.0 content set is fair, but I agree it's time to make one soonish  
(before the end of the year probably). Thoughts on this are welcome.




#2: Use of summing as a scoring mechanism is problematic
Unfortunately, the sum-based scoring techniques do not withstand the  
test of time as browsers improve.  When the benchmark was first  
introduced, each test was equally weighted and reasonably large.   
Over time, however, the test becomes dominated by the slowest tests  
- basically the weighting of the individual tests is variable based  
on the performance of the JS engine under test.  Today's engines  
spend ~50% of their time on just string and date tests.  The other  
tests are largely irrelevant at this point, and becoming less  
relevant every day.  Eventually many of the tests will take near- 
zero time, and the benchmark will have to be scrapped unless we  
figure out a better way to score it.  Benchmarking research which  
long pre-dates SunSpider confirms that geometric means provide a  
better basis for comparison:  http://portal.acm.org/citation.cfm?id=5673 
 Can future versions of the SunSpider driver be made so that they  
won't become irrelevant over time?


Use of summation instead of geometric mean was a considered choice.  
The intent is that engines should focus on whatever is slowest. A  
simplified example: let's say it's estimated that likely workload in  
the field will consist of 50% Operation A, and 50% of Operation B, and  
I can benchmark them in isolation. Now let's say implementation in Foo  
these operations are equally fast, while in implementation Bar,  
Operation A is 4x as fast as in Foo, while Operation B is 4x as slow  
as in Foo. A comparison by geometric means would imply that Foo and  
Bar are equally good, but Bar would actually be twice as slow on the  
intended workload.


Of course, doing this requires a judgment call on reasonable balance  
of different kinds of code, and that balance needs to be re-evaluated  
periodically. But tests based on geometric means also make an implied  
judgment call. The operations comprising each individual test are  
added linearly. The test then judges that these particular  
combinations are each equally important.





#3: The SunSpider harness has a variance problem due to CPU power  
savings modes.
Because the test runs a tiny amount of Javascript (often under 10ms)  
followed by a 500ms sleep, CPUs w

Re: [webkit-dev] Iterating SunSpider

2009-07-04 Thread Peter Kasting
On Sat, Jul 4, 2009 at 11:47 AM, Mike Belshe  wrote:

> #3: The SunSpider harness has a variance problem due to CPU power savings
> modes.
>

This one worries me because it decreases the consistency/reproducibility of
test scores and makes it harder to compare engines or to track one engine's
scores over time.  For example, doing a bunch of CPU work just before
running the benchmark can affect whether and when the CPU throttles down
during the benchmark run.

Possible solution:
> The dromaeo test suite already incorporates the SunSpider individual tests
> under a new benchmark harness which fixes all 3 of the above issues.   Thus,
> one approach would be to retire SunSpider 0.9 in favor of Dromaeo.
> http://dromaeo.com/?sunspider  Dromaeo has also done a lot of good work to
> ensure statistical significance of the results.  Once we have a better
> benchmarking framework, it would be great to build a new microbenchmark mix
> which more realistically exercises today's JavaScript.
>

One complaint I have heard about the Dromaeo tests (not the harness) is that
the actual JS that gets run differs from browser to browser (e.g. because it
is a direct copy of a source library that does UA sniffing).  If this is
true it means that this suite as-is isn't useful to compare engines to each
other.

However, the Dromaeo _harness_ is probably a win as-is.

Of course, changing anything about Sunspider raises the question of
tracking historical performance.  Perhaps the harness could support
versioning, or perhaps people are simply willing to say "Sunspider
1.0 scores cannot be compared to Sunspider 0.9 scores".  I believe this is
the approach the V8 benchmark takes.

PK
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


[webkit-dev] Iterating SunSpider

2009-07-04 Thread Mike Belshe
I'd like to understand what's going to happen with SunSpider in the future.
 Here is a set of questions and criticisms.  I'm interested in how these can
be addressed.

There are 3 areas I'd like to see improved in
SunSpider, some of which we've discussed before:

#1: SunSpider is currently version 0.9.  Will SunSpider ever change?
Or is it static?
I believe that benchmarks need to be able to
move with the times.  As JS Engines change and improve, and as new
areas are needed
to be benchmarked, we need to be able to roll the version, fix bugs, and
benchmark new features.  The SunSpider version has not changed for ~2yrs.
 How can we change this situation?  Are there plans for a new version
already underway?

#2: Use of summing as a scoring mechanism is problematic
Unfortunately, the sum-based scoring techniques do not withstand the test of
time as browsers improve.  When the benchmark was first introduced, each
test was equally weighted and reasonably large.  Over time, however, the
test becomes dominated by the slowest tests - basically the weighting of the
individual tests is variable based on the performance of the JS engine under
test.  Today's engines spend ~50% of their time on just string and date
tests.  The other tests are largely irrelevant at this point, and becoming
less relevant every day.  Eventually many of the tests will take near-zero
time, and the benchmark will have to be scrapped unless we figure out a
better way to score it.  Benchmarking research which long pre-dates
SunSpider confirms that geometric means provide a better basis for
comparison:  http://portal.acm.org/citation.cfm?id=5673 Can future versions
of the SunSpider driver be made so that they won't become irrelevant over
time?

#3: The SunSpider harness has a variance problem due to CPU power savings
modes.
Because the test runs a tiny amount of Javascript (often under 10ms)
followed by a 500ms sleep, CPUs will go into power savings modes between
test runs.  This radically changes the performance measurements and makes it
so that comparison between two runs is dependent on the user's power savings
mode.  To demonstrate this, run SunSpider on two machines- one with the
Windows "balanced" (default) setting for power, and then again with "high
performance".  It's easy to see skews of 30% between these two modes.  I
think we should change the test harness to avoid such accidental effects.

(BTW - if you change SunSpider's sleep from 500ms  to 10ms, the test runs in
just a few seconds.  It is unclear to me why the pauses are so large.  My
browser gets a 650ms score, so run 5 times, that test should take ~3000ms.
 But due to the pauses, it takes over 1 minute to run test, leaving the CPU
~96% idle).

Possible solution:
The dromaeo test suite already incorporates the SunSpider individual tests
under a new benchmark harness which fixes all 3 of the above issues.   Thus,
one approach would be to retire SunSpider 0.9 in favor of Dromaeo.
http://dromaeo.com/?sunspider  Dromaeo has also done a lot of good work to
ensure statistical significance of the results.  Once we have a better
benchmarking framework, it would be great to build a new microbenchmark mix
which more realistically exercises today's JavaScript.

Thanks,
Mike
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev