Hey Pierre, thanks for sharing your findings with us. This data sounds very useful in evaluating possible paths V8 may take in the future. I'm adding a few colleagues who may have better insights into V8's compiler and register allocator.
Cheers, Yang On Thu, Oct 4, 2018 at 7:58 PM Pierre Langlois <[email protected]> wrote: > > Hello V8 devs! > > Recently we've been investigating how introducing callee-saved registers > as part of the JavaScript calling convention would affect generated > code. We understand it opens up questions on how different parts of V8 > would need to be adapted, specifically the GC, and it would require a > lot of investigation and long term planning. Specifically, having the GC > walk stack frames to dynamically find where callee-saved registers were > saved would surely have a performance impact. But, maybe it could be > balanced out by generating better code, we can't know until we > investigate :-). > > So here we've only looked at the register allocator. And more > specifically how it copes with not having to save and restore all > registers around calls. It's already given us food for thought so we > decided to share it. > > This is a long email, so TL;DR: Adding callee-saved registers introduces > longer live ranges assigned to registers. This is very good for > non-deferred blocks where a *lot* of gap moves are removed (up to > half!). But, we get a *lot* more moves at deferred block boundaries. So > much so that code size increases by 6% when running typescript on > arm64. It looks like the register allocator could be improved to deal > with longer register ranges in general. > > So, we've built a prototype that takes the list of callee-saved > registers from a call descriptor and propagates it down so the register > allocator can look it up for each call. And then it can decide to only > clobber certain registers. After this all we needed to do was to define > sets of callee-saved registers for each type of call descriptors: > > * Direct calls to C: Use callee-saved registers defined by the C > ABI. This also includes calls generated by the code-generator to > implement IEEE functions. > > * Call to the runtime: Clobber everything unless the call does not > have a frame state. A GC could be triggered and it needs everything > in memory. > > * Call to JS, stubs and builtins: We can define our own set of > callee-saved registers. When investigating we've picked the same set > as for C calls though. > > * We haven't investigated WASM calls yet. > > Of course, except for direct C calls, the generated code isn't > functional. This was just an experiment to see what it looked like. What > we can do though is make it a run-time option and gather statistics. We > can force V8 to compile every optimised JS function twice, the second > time with callee-saved registers, and then discard the latter. Finally > we can compare them! > > Here is a link to the prototype if you want to take a look: > https://chromium-review.googlesource.com/c/v8/v8/+/1261643 it's in the > form of a series of 6 patches, marked as abandoned now. > > So without further ado, let's look at some numbers! We've looked at how > gap moves and code size were affected. We ran the typescript benchmark > from web-tooling 0.5.2 and displayed statistics in percentages. We've > purposely split moves into deferred and non-deferred blocks after > realising they were not affected in the same way at all. > > So here we are, numbers of instructions, register/stack slot moves and > register/constant moves: > > | arch | instructions (%) | R <-> S (%) | deferred R > <-> S (%) | R <- CST (%) | deferred R <- CST (%) | > > |-------------------------+------------------+-------------+----------------------+--------------+-----------------------| > | arm64 (12 cs registers) | 6.30 | -51.86 | > 138.56 | -19.48 | 144.35 | > | arm (7 cs registers) | 1.54 | -34.60 | > 57.53 | -12.37 | 42.93 | > | x64 (5 cs registers) | 1.37 | -32.10 | > 49.19 | -8.13 | 47.51 | > | ia32 (3 cs registers) | -0.04 | -7.92 | > 7.79 | -1.54 | 4.00 | > > We were hoping to get fewer moves in general but instead we got more! > And especially on arm64 where we have a lot of registers. We cannot > accept such a code size increase. > > That being said, if we forget about the deferred columns, we've > *theoretically* gotten rid of up to half of register/stack moves! This > is promising. > > But now we're a bit confused as to why the more callee-saved registers > we have, the more moves we get in deferred blocks. We eventually linked > it to splintering. Indeed, if we disable it with the > --turbo-no-preprocess-ranges flag, we get the following results: > > | arch | instructions (%) | R <-> S (%) | deferred R > <-> S (%) | R <- CST (%) | deferred R <- CST (%) | > > |-------------------------+------------------+-------------+----------------------+--------------+-----------------------| > | arm64 (12 cs registers) | -2.82 | -37.20 | > -5.32 | -13.24 | -8.29 | > | arm (7 cs registers) | -2.35 | -29.24 | > -18.87 | -11.04 | -10.53 | > | x64 (5 cs registers) | -1.14 | -24.74 | > 0.89 | -6.52 | -0.56 | > | ia32 (3 cs registers) | -0.47 | -10.30 | > -11.01 | -2.18 | -2.74 | > > It looks much more like what we'd like to get. > > You can find the full data sets attached to this email. > > -- > -- > v8-dev mailing list > [email protected] > http://groups.google.com/group/v8-dev > --- > You received this message because you are subscribed to the Google Groups > "v8-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- -- v8-dev mailing list [email protected] http://groups.google.com/group/v8-dev --- You received this message because you are subscribed to the Google Groups "v8-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
