[v8-dev] Issue 2290 in v8: Inconsistent performance of optimized polymorphic call.

codesite-noreply Tue, 14 Aug 2012 20:12:52 -0700

Status: New
Owner: ----

New issue 2290 by [email protected]: Inconsistent performance of optimizedpolymorphic call.

http://code.google.com/p/v8/issues/detail?id=2290


Extremely variable performance of V8 optimized code.


Take the attached html page and save it somewhere.

Load the page. After several seconds the text will be replaced with someperformance statistics.

The numbers are in runs per second, so higher is better.

If you refresh the page, the numbers are often different, by up to a factor4x.This is a serious performance issue that affects code generated for thedart2js compiler.

The test is taken from Dromaeo, the nextSibling benchmark of the‘Traversal’ group.

There are four versions, see the test file for details.
nextSibling:    Standard Dromaeo test with property access.
nextSibling-get-1:      Property access replaced by function on subtree root.

nextSibling-get-many: Property access replaced by function on immediateprototypes.nextSibling-get-patched: Function on immedate prototype is installed ondemand by a stub on Object.prototype.



Our main concern is the last version, nextSibling-get-patched.

We use this pattern to add Dart methods to native DOM objects. We use thispattern because the prototypes of DOM objects are not always available inthe initial execution context. The execution context may be a browser thatexposes a different subset of prototypes or a more constrained environmentlike a worker or a stand-alone app running on d8 or nodejs. In thesediverse environments the program should succeed provided it does notactually use any of the missing types.


Typical result for Chrome 22.0.1229.2 dev-m on Windows:
nextSibling     477.08  ±0.99%
nextSibling-get-1       415.72  ±1.63%
nextSibling-get-many    434.46  ±2.41%
nextSibling-get-patched 112.73  ±0.37%

Notice that nextSibling-get-patched is much slower than the others, and allthe results have pretty low variance.


Now reload the page:
nextSibling     485.35  ±1.98%
nextSibling-get-1       344.21  ±4.76%
nextSibling-get-many    329.36  ±5.07%
nextSibling-get-patched 359.45  ±2.62%


Suddenly the middle ones are worse and nextSibling-get-patched is faster.
Unfortunately, the slow case is much more common than the fast case.

In comparison, Firefox 14.0.1 has a very predictable performance profile.Each reload is essentially the same:


FF 14.0.1
nextSibling     274.30  ±0.99%
nextSibling-get-1       233.27  ±0.79%
nextSibling-get-many    230.56  ±2.20%
nextSibling-get-patched 203.31  ±0.63%

So for our most important case, Firefox is twice as fast, even though theraw JS and / or DOM performance (nextSibling) is worse.


Analysis

The problem is (1) lack of true polymorphic call caches (2) essentiallyrandom choice of inlining candidates.


The workload is highly polymorphic.  There are 10 types of DOM Node visited:
#nodes = 606
  Text: 280
  HTMLDivElement: 50
  HTMLHeadingElement: 68
  HTMLParagraphElement: 181
  HTMLPreElement: 11
  HTMLUListElement: 6
  HTMLTableElement: 1
  HTMLDListElement: 5
  Comment: 3
  HTMLOListElement: 1
The distribution is skewed: Text is 46%, the top four are 95% of the Nodes.

Disassembling the inner loop shows that four maps are tested, with theremaining six falling through to a CallIC. The difference in runtimes isexplained by which maps are chosen. If the top four are chosen, the runsper second is comparable to the first version.

nextSibling-get-1 and nextSibling-get-many are roughly the same - there isa distribution of runs/s caused by picking different subsets of the 10 mapsfor the 10 Node types.This hypothesis was tested by compiling Chrome + V8 withkMaxCallPolymorphism = 10.The run/s for nextSibling-get-1 and nextSibling-get-many were the same andconsistently slightly higher than the the loop containing cur.nextSibling.This is explained by the inlining (10 times) of the small function. Theinlined this.nextSibling expression results in a LoadIC since nextSiblingis a DOM property. There is a performance benefit because each of theLoadICs is now monomorphic. In effect, the cur.get$nextSibling1() call iscompiled into a massive true polymorphic LoadIC, which is faster than thehash-table probing of the megamorphic LoadIC for cur.nextSibling.

nextSibling-get-patched is much worse because there are 20 maps - thepre-patching map and the post-patching map for each Node. There are enoughmaps that it is possible that the inlined polymorphic call tests uselessmaps before falling through to a megamorphic call, which is susceptible tohash table collisions.


Suggestions

The megamorphic hash table is address-hashed and unreliable run-to-run.
There are several ways to improve the situation.

* Use more counters

The megamorphic hash table contains ticks but these are usually 0 or 1 atoptimization time. They offer no useful prioritization. Exact countswould provide higher quality information.


* Deprioritize transitioned maps

The original maps for the objects in the prototype chain in thenextSibling-get-patched case have transitions to maps with the patchedfunctions. It is unlikely that the original maps are still in use. Cacheswith maps with transitions should be sorted after caches withouttransitions. This would fix the differentce betweennextSibling-get-patched and nextSibling-get-1/-many.


* Use a true polymorphic call cache

The megamorphic hash table should be replaced with true polymorphic caches.

If call-site polymorphic caches are too expensive, a sharedselector-specific cache might work.One idea is to adjust the unoptimized compiled function to use a truepolymorphic cache to allow gathering of more precise information ondeoptimization.


* Use a true polymorphic load cache

The fact that setting kMaxCallPolymorphism high caused superior performanceis a hint that a true polymorphic load cache for DOM properties might speedup general programs.The memory cost could be ameliorated by using the true site-specificpolymorphic cache only from locations in optimized code.


* Smarter optimization

Crankshaft inlined four copies of the same function (10 in theexperiment). The 4-way map test could be optimized to a higher degree testif the target functions were recognized as the same and inlined as one copy.(The only reason that multiple copies were beneficial was because the DOMproperty access in the first benchmark went via the global megamorphiccache; this would go away with a true polymorphic load cache.)



Appendix

Sample reloads.

nextSibling     470.09  ±6.34%
nextSibling-get-1       343.28  ±1.58%
nextSibling-get-many    486.35  ±3.23%
nextSibling-get-patched 394.40  ±1.49%


nextSibling     468.34  ±8.23%
nextSibling-get-1       488.03  ±1.14%
nextSibling-get-many    457.45  ±1.45%
nextSibling-get-patched 106.06  ±1.26%


nextSibling     477.04  ±1.21%
nextSibling-get-1       344.85  ±0.92%
nextSibling-get-many    438.75  ±1.22%
nextSibling-get-patched 219.72  ±0.94%


nextSibling     472.74  ±1.89%
nextSibling-get-1       352.17  ±1.16%
nextSibling-get-many    430.19  ±0.94%
nextSibling-get-patched 102.35  ±1.92%


nextSibling     470.04  ±1.04%
nextSibling-get-1       362.43  ±0.95%
nextSibling-get-many    442.45  ±1.73%
nextSibling-get-patched 224.80  ±0.54%


nextSibling     477.71  ±0.65%
nextSibling-get-1       439.87  ±1.94%
nextSibling-get-many    431.51  ±0.64%
nextSibling-get-patched 106.44  ±0.84%

nextSibling     470.04  ±2.13%
nextSibling-get-1       403.05  ±1.41%
nextSibling-get-many    337.54  ±1.13%
nextSibling-get-patched 391.47  ±3.25%


nextSibling     472.04  ±1.62%
nextSibling-get-1       361.28  ±0.90%
nextSibling-get-many    414.22  ±1.08%
nextSibling-get-patched 395.21  ±9.15%


nextSibling     476.38  ±1.61%
nextSibling-get-1       349.50  ±2.43%
nextSibling-get-many    425.92  ±0.97%
nextSibling-get-patched 108.70  ±0.85%


nextSibling     471.05  ±1.98%
nextSibling-get-1       407.58  ±0.95%
nextSibling-get-many    416.88  ±1.21%
nextSibling-get-patched 102.17  ±1.84%


nextSibling     461.44  ±0.59%
nextSibling-get-1       361.03  ±1.97%
nextSibling-get-many    332.67  ±1.11%
nextSibling-get-patched 373.18  ±0.90%


nextSibling     476.07  ±1.28%
nextSibling-get-1       414.78  ±0.94%
nextSibling-get-many    492.35  ±1.21%
nextSibling-get-patched 378.48  ±1.26%


Attachments:
        dom-traverse.html  120 KB

--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

[v8-dev] Issue 2290 in v8: Inconsistent performance of optimized polymorphic call.

Reply via email to