Reviewers: jarin, jvoung, stichnot,
https://codereview.chromium.org/1157663007/diff/1/src/compiler/schedule.cc
File src/compiler/schedule.cc (right):
https://codereview.chromium.org/1157663007/diff/1/src/compiler/schedule.cc#newcode340
src/compiler/schedule.cc:340: os << " (in loop: B" <<
block->loop_header()->rpo_number() << ")";
much easier to profile when information whether block is in loop is
locally available.
https://codereview.chromium.org/1157663007/diff/1/src/disassembler.cc
File src/disassembler.cc (right):
https://codereview.chromium.org/1157663007/diff/1/src/disassembler.cc#newcode149
src/disassembler.cc:149: out.AddFormatted("%p %4X ", prev_pc, prev_pc
- begin);
perf outputs offsets in hex. I changed our output to hex, too, makes it
easier to profile and peek at output code (with block labels)
https://codereview.chromium.org/1157663007/diff/20001/src/compiler/register-allocator.cc
File src/compiler/register-allocator.cc (right):
https://codereview.chromium.org/1157663007/diff/20001/src/compiler/register-allocator.cc#newcode3477
src/compiler/register-allocator.cc:3477:
static_cast<size_t>(code()->InstructionBlockCount()))
Is there a reason InstructionBlockCount() is signed? I noticed its
implementation intentionally casts the unsigned "size()" of the
underlying collection to signed.
Description:
While working on tuning, I realized that the initial implementation was
suffering from an inefficiency. Turns out, if a range had multiple
conflicts, only the first 2 were considered, after which (regardless of
weight) the remaining would win by default. This would lead in some
scenarios to inefficient (while correct) codegen. To correct this, and
avoid disproportionately large compile time regressions, I implemented
an alternate data structure for storing allocations.
There were a few more important changes from the initial implementation:
- up-front grouping of related live ranges (e.g. phi inputs and output),
and a first-attempt allocation of groups on the same register;
- conflict resolution: I initially believed that splitting blocked
live ranges should rely on conflicts, but after more analysis (including
reference implementations and the scarce documentation around the LLVM
implementation) I changed that to relying on "what's best for the range
itself", letting the weights mechanics of the algorithm converge to the
right decision. As a result, a few benchmarks that were before regressed
are now improved, and a few with serious regressions (15-20%) are now under
the 10% range. I expect these remaining regressions to be easier to
understand and address, given the changes I'm introducing here.
The change includes a few other opportunistic changes. The optimized code
printer was outputting instruction offsets in decimal, changed it to hex, to
match what perf outputs them as. Also, when producing printouts from the
instruction scheduler, blocks now each indicate which loop (if any) they
belong
to, which helps with perf analysis.
BUG=
Please review this at https://codereview.chromium.org/1157663007/
Base URL: https://chromium.googlesource.com/v8/v8.git@master
Affected files (+1181, -402 lines):
M src/compiler/register-allocator.h
M src/compiler/register-allocator.cc
M src/compiler/schedule.cc
M src/disassembler.cc
M src/zone-containers.h
--
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.