Status: New
Owner: ----
New issue 2290 by [email protected]: Inconsistent performance of optimized
polymorphic call.
http://code.google.com/p/v8/issues/detail?id=2290
Extremely variable performance of V8 optimized code.
Take the attached html page and save it somewhere.
Load the page. After several seconds the text will be replaced with some
performance statistics.
The numbers are in runs per second, so higher is better.
If you refresh the page, the numbers are often different, by up to a factor
4x.
This is a serious performance issue that affects code generated for the
dart2js compiler.
The test is taken from Dromaeo, the nextSibling benchmark of the
‘Traversal’ group.
There are four versions, see the test file for details.
nextSibling: Standard Dromaeo test with property access.
nextSibling-get-1: Property access replaced by function on subtree root.
nextSibling-get-many: Property access replaced by function on immediate
prototypes.
nextSibling-get-patched: Function on immedate prototype is installed on
demand by a stub on Object.prototype.
Our main concern is the last version, nextSibling-get-patched.
We use this pattern to add Dart methods to native DOM objects. We use this
pattern because the prototypes of DOM objects are not always available in
the initial execution context. The execution context may be a browser that
exposes a different subset of prototypes or a more constrained environment
like a worker or a stand-alone app running on d8 or nodejs. In these
diverse environments the program should succeed provided it does not
actually use any of the missing types.
Typical result for Chrome 22.0.1229.2 dev-m on Windows:
nextSibling 477.08 ±0.99%
nextSibling-get-1 415.72 ±1.63%
nextSibling-get-many 434.46 ±2.41%
nextSibling-get-patched 112.73 ±0.37%
Notice that nextSibling-get-patched is much slower than the others, and all
the results have pretty low variance.
Now reload the page:
nextSibling 485.35 ±1.98%
nextSibling-get-1 344.21 ±4.76%
nextSibling-get-many 329.36 ±5.07%
nextSibling-get-patched 359.45 ±2.62%
Suddenly the middle ones are worse and nextSibling-get-patched is faster.
Unfortunately, the slow case is much more common than the fast case.
In comparison, Firefox 14.0.1 has a very predictable performance profile.
Each reload is essentially the same:
FF 14.0.1
nextSibling 274.30 ±0.99%
nextSibling-get-1 233.27 ±0.79%
nextSibling-get-many 230.56 ±2.20%
nextSibling-get-patched 203.31 ±0.63%
So for our most important case, Firefox is twice as fast, even though the
raw JS and / or DOM performance (nextSibling) is worse.
Analysis
The problem is (1) lack of true polymorphic call caches (2) essentially
random choice of inlining candidates.
The workload is highly polymorphic. There are 10 types of DOM Node visited:
#nodes = 606
Text: 280
HTMLDivElement: 50
HTMLHeadingElement: 68
HTMLParagraphElement: 181
HTMLPreElement: 11
HTMLUListElement: 6
HTMLTableElement: 1
HTMLDListElement: 5
Comment: 3
HTMLOListElement: 1
The distribution is skewed: Text is 46%, the top four are 95% of the Nodes.
Disassembling the inner loop shows that four maps are tested, with the
remaining six falling through to a CallIC. The difference in runtimes is
explained by which maps are chosen. If the top four are chosen, the runs
per second is comparable to the first version.
nextSibling-get-1 and nextSibling-get-many are roughly the same - there is
a distribution of runs/s caused by picking different subsets of the 10 maps
for the 10 Node types.
This hypothesis was tested by compiling Chrome + V8 with
kMaxCallPolymorphism = 10.
The run/s for nextSibling-get-1 and nextSibling-get-many were the same and
consistently slightly higher than the the loop containing cur.nextSibling.
This is explained by the inlining (10 times) of the small function. The
inlined this.nextSibling expression results in a LoadIC since nextSibling
is a DOM property. There is a performance benefit because each of the
LoadICs is now monomorphic. In effect, the cur.get$nextSibling1() call is
compiled into a massive true polymorphic LoadIC, which is faster than the
hash-table probing of the megamorphic LoadIC for cur.nextSibling.
nextSibling-get-patched is much worse because there are 20 maps - the
pre-patching map and the post-patching map for each Node. There are enough
maps that it is possible that the inlined polymorphic call tests useless
maps before falling through to a megamorphic call, which is susceptible to
hash table collisions.
Suggestions
The megamorphic hash table is address-hashed and unreliable run-to-run.
There are several ways to improve the situation.
* Use more counters
The megamorphic hash table contains ticks but these are usually 0 or 1 at
optimization time. They offer no useful prioritization. Exact counts
would provide higher quality information.
* Deprioritize transitioned maps
The original maps for the objects in the prototype chain in the
nextSibling-get-patched case have transitions to maps with the patched
functions. It is unlikely that the original maps are still in use. Caches
with maps with transitions should be sorted after caches without
transitions. This would fix the differentce between
nextSibling-get-patched and nextSibling-get-1/-many.
* Use a true polymorphic call cache
The megamorphic hash table should be replaced with true polymorphic caches.
If call-site polymorphic caches are too expensive, a shared
selector-specific cache might work.
One idea is to adjust the unoptimized compiled function to use a true
polymorphic cache to allow gathering of more precise information on
deoptimization.
* Use a true polymorphic load cache
The fact that setting kMaxCallPolymorphism high caused superior performance
is a hint that a true polymorphic load cache for DOM properties might speed
up general programs.
The memory cost could be ameliorated by using the true site-specific
polymorphic cache only from locations in optimized code.
* Smarter optimization
Crankshaft inlined four copies of the same function (10 in the
experiment). The 4-way map test could be optimized to a higher degree test
if the target functions were recognized as the same and inlined as one copy.
(The only reason that multiple copies were beneficial was because the DOM
property access in the first benchmark went via the global megamorphic
cache; this would go away with a true polymorphic load cache.)
Appendix
Sample reloads.
nextSibling 470.09 ±6.34%
nextSibling-get-1 343.28 ±1.58%
nextSibling-get-many 486.35 ±3.23%
nextSibling-get-patched 394.40 ±1.49%
nextSibling 468.34 ±8.23%
nextSibling-get-1 488.03 ±1.14%
nextSibling-get-many 457.45 ±1.45%
nextSibling-get-patched 106.06 ±1.26%
nextSibling 477.04 ±1.21%
nextSibling-get-1 344.85 ±0.92%
nextSibling-get-many 438.75 ±1.22%
nextSibling-get-patched 219.72 ±0.94%
nextSibling 472.74 ±1.89%
nextSibling-get-1 352.17 ±1.16%
nextSibling-get-many 430.19 ±0.94%
nextSibling-get-patched 102.35 ±1.92%
nextSibling 470.04 ±1.04%
nextSibling-get-1 362.43 ±0.95%
nextSibling-get-many 442.45 ±1.73%
nextSibling-get-patched 224.80 ±0.54%
nextSibling 477.71 ±0.65%
nextSibling-get-1 439.87 ±1.94%
nextSibling-get-many 431.51 ±0.64%
nextSibling-get-patched 106.44 ±0.84%
nextSibling 470.04 ±2.13%
nextSibling-get-1 403.05 ±1.41%
nextSibling-get-many 337.54 ±1.13%
nextSibling-get-patched 391.47 ±3.25%
nextSibling 472.04 ±1.62%
nextSibling-get-1 361.28 ±0.90%
nextSibling-get-many 414.22 ±1.08%
nextSibling-get-patched 395.21 ±9.15%
nextSibling 476.38 ±1.61%
nextSibling-get-1 349.50 ±2.43%
nextSibling-get-many 425.92 ±0.97%
nextSibling-get-patched 108.70 ±0.85%
nextSibling 471.05 ±1.98%
nextSibling-get-1 407.58 ±0.95%
nextSibling-get-many 416.88 ±1.21%
nextSibling-get-patched 102.17 ±1.84%
nextSibling 461.44 ±0.59%
nextSibling-get-1 361.03 ±1.97%
nextSibling-get-many 332.67 ±1.11%
nextSibling-get-patched 373.18 ±0.90%
nextSibling 476.07 ±1.28%
nextSibling-get-1 414.78 ±0.94%
nextSibling-get-many 492.35 ±1.21%
nextSibling-get-patched 378.48 ±1.26%
Attachments:
dom-traverse.html 120 KB
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev