Status: New
Owner: ----

New issue 2290 by [email protected]: Inconsistent performance of optimized polymorphic call.
http://code.google.com/p/v8/issues/detail?id=2290

Extremely variable performance of V8 optimized code.


Take the attached html page and save it somewhere.
Load the page. After several seconds the text will be replaced with some performance statistics.
The numbers are in runs per second, so higher is better.
If you refresh the page, the numbers are often different, by up to a factor 4x. This is a serious performance issue that affects code generated for the dart2js compiler.

The test is taken from Dromaeo, the nextSibling benchmark of the ‘Traversal’ group.
There are four versions, see the test file for details.
nextSibling:    Standard Dromaeo test with property access.
nextSibling-get-1:      Property access replaced by function on subtree root.
nextSibling-get-many: Property access replaced by function on immediate prototypes. nextSibling-get-patched: Function on immedate prototype is installed on demand by a stub on Object.prototype.


Our main concern is the last version, nextSibling-get-patched.
We use this pattern to add Dart methods to native DOM objects. We use this pattern because the prototypes of DOM objects are not always available in the initial execution context. The execution context may be a browser that exposes a different subset of prototypes or a more constrained environment like a worker or a stand-alone app running on d8 or nodejs. In these diverse environments the program should succeed provided it does not actually use any of the missing types.

Typical result for Chrome 22.0.1229.2 dev-m on Windows:
nextSibling     477.08  ±0.99%
nextSibling-get-1       415.72  ±1.63%
nextSibling-get-many    434.46  ±2.41%
nextSibling-get-patched 112.73  ±0.37%


Notice that nextSibling-get-patched is much slower than the others, and all the results have pretty low variance.

Now reload the page:
nextSibling     485.35  ±1.98%
nextSibling-get-1       344.21  ±4.76%
nextSibling-get-many    329.36  ±5.07%
nextSibling-get-patched 359.45  ±2.62%


Suddenly the middle ones are worse and nextSibling-get-patched is faster.
Unfortunately, the slow case is much more common than the fast case.

In comparison, Firefox 14.0.1 has a very predictable performance profile. Each reload is essentially the same:

FF 14.0.1
nextSibling     274.30  ±0.99%
nextSibling-get-1       233.27  ±0.79%
nextSibling-get-many    230.56  ±2.20%
nextSibling-get-patched 203.31  ±0.63%


So for our most important case, Firefox is twice as fast, even though the raw JS and / or DOM performance (nextSibling) is worse.

Analysis


The problem is (1) lack of true polymorphic call caches (2) essentially random choice of inlining candidates.

The workload is highly polymorphic.  There are 10 types of DOM Node visited:
#nodes = 606
  Text: 280
  HTMLDivElement: 50
  HTMLHeadingElement: 68
  HTMLParagraphElement: 181
  HTMLPreElement: 11
  HTMLUListElement: 6
  HTMLTableElement: 1
  HTMLDListElement: 5
  Comment: 3
  HTMLOListElement: 1
The distribution is skewed: Text is 46%, the top four are 95% of the Nodes.

Disassembling the inner loop shows that four maps are tested, with the remaining six falling through to a CallIC. The difference in runtimes is explained by which maps are chosen. If the top four are chosen, the runs per second is comparable to the first version.

nextSibling-get-1 and nextSibling-get-many are roughly the same - there is a distribution of runs/s caused by picking different subsets of the 10 maps for the 10 Node types. This hypothesis was tested by compiling Chrome + V8 with kMaxCallPolymorphism = 10. The run/s for nextSibling-get-1 and nextSibling-get-many were the same and consistently slightly higher than the the loop containing cur.nextSibling. This is explained by the inlining (10 times) of the small function. The inlined this.nextSibling expression results in a LoadIC since nextSibling is a DOM property. There is a performance benefit because each of the LoadICs is now monomorphic. In effect, the cur.get$nextSibling1() call is compiled into a massive true polymorphic LoadIC, which is faster than the hash-table probing of the megamorphic LoadIC for cur.nextSibling.

nextSibling-get-patched is much worse because there are 20 maps - the pre-patching map and the post-patching map for each Node. There are enough maps that it is possible that the inlined polymorphic call tests useless maps before falling through to a megamorphic call, which is susceptible to hash table collisions.

Suggestions

The megamorphic hash table is address-hashed and unreliable run-to-run.
There are several ways to improve the situation.

* Use more counters

The megamorphic hash table contains ticks but these are usually 0 or 1 at optimization time. They offer no useful prioritization. Exact counts would provide higher quality information.

* Deprioritize transitioned maps

The original maps for the objects in the prototype chain in the nextSibling-get-patched case have transitions to maps with the patched functions. It is unlikely that the original maps are still in use. Caches with maps with transitions should be sorted after caches without transitions. This would fix the differentce between nextSibling-get-patched and nextSibling-get-1/-many.

* Use a true polymorphic call cache

The megamorphic hash table should be replaced with true polymorphic caches.
If call-site polymorphic caches are too expensive, a shared selector-specific cache might work. One idea is to adjust the unoptimized compiled function to use a true polymorphic cache to allow gathering of more precise information on deoptimization.

* Use a true polymorphic load cache

The fact that setting kMaxCallPolymorphism high caused superior performance is a hint that a true polymorphic load cache for DOM properties might speed up general programs. The memory cost could be ameliorated by using the true site-specific polymorphic cache only from locations in optimized code.

* Smarter optimization

Crankshaft inlined four copies of the same function (10 in the experiment). The 4-way map test could be optimized to a higher degree test if the target functions were recognized as the same and inlined as one copy. (The only reason that multiple copies were beneficial was because the DOM property access in the first benchmark went via the global megamorphic cache; this would go away with a true polymorphic load cache.)


Appendix

Sample reloads.

nextSibling     470.09  ±6.34%
nextSibling-get-1       343.28  ±1.58%
nextSibling-get-many    486.35  ±3.23%
nextSibling-get-patched 394.40  ±1.49%


nextSibling     468.34  ±8.23%
nextSibling-get-1       488.03  ±1.14%
nextSibling-get-many    457.45  ±1.45%
nextSibling-get-patched 106.06  ±1.26%


nextSibling     477.04  ±1.21%
nextSibling-get-1       344.85  ±0.92%
nextSibling-get-many    438.75  ±1.22%
nextSibling-get-patched 219.72  ±0.94%


nextSibling     472.74  ±1.89%
nextSibling-get-1       352.17  ±1.16%
nextSibling-get-many    430.19  ±0.94%
nextSibling-get-patched 102.35  ±1.92%


nextSibling     470.04  ±1.04%
nextSibling-get-1       362.43  ±0.95%
nextSibling-get-many    442.45  ±1.73%
nextSibling-get-patched 224.80  ±0.54%


nextSibling     477.71  ±0.65%
nextSibling-get-1       439.87  ±1.94%
nextSibling-get-many    431.51  ±0.64%
nextSibling-get-patched 106.44  ±0.84%

nextSibling     470.04  ±2.13%
nextSibling-get-1       403.05  ±1.41%
nextSibling-get-many    337.54  ±1.13%
nextSibling-get-patched 391.47  ±3.25%


nextSibling     472.04  ±1.62%
nextSibling-get-1       361.28  ±0.90%
nextSibling-get-many    414.22  ±1.08%
nextSibling-get-patched 395.21  ±9.15%


nextSibling     476.38  ±1.61%
nextSibling-get-1       349.50  ±2.43%
nextSibling-get-many    425.92  ±0.97%
nextSibling-get-patched 108.70  ±0.85%


nextSibling     471.05  ±1.98%
nextSibling-get-1       407.58  ±0.95%
nextSibling-get-many    416.88  ±1.21%
nextSibling-get-patched 102.17  ±1.84%


nextSibling     461.44  ±0.59%
nextSibling-get-1       361.03  ±1.97%
nextSibling-get-many    332.67  ±1.11%
nextSibling-get-patched 373.18  ±0.90%


nextSibling     476.07  ±1.28%
nextSibling-get-1       414.78  ±0.94%
nextSibling-get-many    492.35  ±1.21%
nextSibling-get-patched 378.48  ±1.26%


Attachments:
        dom-traverse.html  120 KB

--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

Reply via email to