[v8-dev] Re: Greedy allocator: perf work (issue 1157663007 by [email protected])

Mircea Trofin Thu, 11 Jun 2015 03:27:13 -0700

I believe there may be a difference in assumptions/premises. Let's see if
we can tease those apart. I also realize I am to blame for not having had
stated what my plans here were - which is quite possibly the reason for
these differences :)

When we started this effort, we had no data to indicate this new algorithm
has any value in the context of Turbofan. We had hints - LLVM's claims, and
Mozilla's (the latter being perhaps the closest to our scenario).

That's why I prioritized my work to first getting to numbers showing
feasibility: can this reg allocator produce better code quality than the
linear one, for a reasonable compile time cost. I think this is the main
difference between the plan I'm executing on and the one you're proposing -
if I understand it correctly: I believe your proposal starts from the
premise that the greedy algorithm *will* replace the linear one; I start
from the premise that we have (or had) no evidence that that should be the
case, and so we need to convince ourselves of that first.

I suppose this is a good place to ask: does this seem like a reasonable
starting point, or is there something I'm missing perhaps?

Currently, as per my last checkin, we have a sketch of the implementation
behind a flag- it's *almost* passing all the tests (what's keeping it red
is an assertion in 2 tests). I felt it served its purpose, though, and
didn't want to spend time green-ing an implementation I was going to change
anyway. This means we're basically at "step 2", however, given the
difference in premises, it follows that my next steps are a bit diverging
from what you're suggesting:

My next step (which is what I'm working through now, this current CL is a
stepping stone) is getting to a place where we have the data indicating
code quality is indeed better, and drive changes in the Turbofan codebase
based on the scenarios the new allocator is presenting. That means
continuing measurements comparing it with LS, and driving the right
heuristics in. At the same time, the problem of stability is there every
time we change heuristics, because we have no reason to believe that a new
way of splitting won't actually cause some security hole. I don't think we
can realistically stabilize and then optimize, I think it needs to happen
iteratively. That's why I want to start exercising the code as early as
possible with the security team, so that I understand the mechanics of fuzz
testing and start detecting bugs early... but this (fuzzing) will be very
likely a continuous effort. This is also why I'd find beneficial to flip
the switch on the algorithm in Canary.

Of course, if we can't see code quality improvements, then why bother
continuing :)

The next step after we have strong evidence of code quality value (and
fuzz/Canary issues fixed) is to look at opportunities for better compile
time. I suspect we won't beat LS there. The value, by design, of LS, is
fast compile time, but only reasonable code quality. This is not the
trade-off greedy is making. True, LLVM reported better compile time, but it
appears it was due to cleanup they meant to do - so we don't know if a
reimplemented, clean LS wouldn't have had even better compile time there.

This is likely another tripwire: we may find out that we can't find a
better trade-off than linear gives us already. To be clear, I'm not
suggesting punting compile time work until then - hence the work in this CL
for more appropriate data structures for certain usecases. I do think that
the "win" will be on code quality, and the compile time will be more of a
"this is the best we think we can do (while still not beating LS)", which
may be fine for an essentially AoT scenario - but we need the data to
reason about this tradeoff.

I wouldn't aggressively pursue changes to the live range model and so on
until they became dictated by the nature of the algorithm - unless your
point 4 suggests a full side-by-side coexistence of the 2 for a while? That
may actually work, but even then I'd want to see if and how much a change
is actually necessary (meaning, I want to see the turbo stats or a profiler
motivating any change).

Come to think of it, what I'm presenting is not so much different from the
plan you outlined, it's more that I want to approach it iteratively, with
more frequent trip wires, which I suppose is natural, given my stated
assumptions.

Coming back to more immediate topics, I'm not sure where the worry around
the size of the CL come from. It's still behind a flag, so it shouldn't
constitute any more of a risk than it currently is (from the last change) -
which is to say, none. True, we're exploring turning it on in Canary, but I
am more confident in hitting the prerequisites (green bots + fuzz tests)
with this CL than with the prototype I had before. Production is still
having the flag off, just like today.

In the framework outlined above, this CL is a stepping stone towards
collecting data supporting the value of replacing the allocator. I have a
subsequent change that builds on this, after which I should be able to
collect perf data, drive heuristics in, understand the motivation and
extent for any further changes to other parts of the model, get fuzz /
Canary crash reports, and see how all that turns back to affecting or not
perf.

To address 2 topics in 1 reply :) Jaroslav - the design LLVM has is
motivated by having 2 variants of the allocator, one that backtracks and
one that is faster, less optimal, and doesn't backtrack. I could see some
further factoring possible in the current implementation, for
maintainability reasons, at least, unless we also wanted to explore the 2
variant LLVM design; however, at this point, (given all the above) since I
believe the priority is getting code quality in a good place, then getting
the compile time in a good place, I'd punt too much such work until later.
Actually, if you were making a suggestion outside the timeline of this CL,
then yes, I absolutely agree.

On Thu, Jun 11, 2015 at 1:15 AM <[email protected]> wrote:

> This is actually a big CL, which I'm not sure we can get into landable
> state
> soonish. I suspect that we could proceed a lot faster if we split the work
> into
> manageable parts, something along these lines:
>
> 1.) Do the minimal amount of work required to make the "greedy allocator
> bot"
> green (i.e. passing all our current tests).
> 2.) Implement the basic algorithm (as mentioned in
> http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html)
> without
> any fancy heuristics, but with the interface introduced by LLVM to make the
> splitting/spilling decisions.
> 3.) Stabilize the basic algorithm and do measurements against LS.
> 4.) Based on that, split the basic allocator from LS and use appropriate
> data
> structures instead of trying to share all datastructures with LS.
> 5.) Start implementing the greedy heuristics on top of the basic allocator,
> just
> like LLVM did.
> 6.) Weaponize.
>
> WDYT?
>
> https://codereview.chromium.org/1157663007/
>

-- 
-- 
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[v8-dev] Re: Greedy allocator: perf work (issue 1157663007 by [email protected])

Reply via email to