Hi Pierre. I was wondering about callee saved registers in the context of WebAssembly and started a first experiment there.
For general JavaScript, callee save poses some complications in the context of deoptimization information and garbage collection. In essence, you would need to keep track of the registers that are callee saved somehow so that the pointers stored in those registers can be found by the gc. And whether such registers contain pointers will be call-site specific and no longer just depend on the current frame. Also, I would expect callee save to be mostly beneficial for runtime performance in smaller functions that do not have high register pressure. In JavaScript, we probably inline many of these into hot contexts anyway, diminishing a potential return. In WebAssembly the story is a bit different. We currently do not have pointers to heap objects in registers, so callee save is much easier to do from a gc perspective. Also, there is no de-optimization to worry about. There is a proposal for pointers to heap objects in the form of anyrefs, but even once that lands, those pointers should be rare. So it should be feasible to always spill them to the stack, even if they are in callee saved registers. Most importantly, though, we do not inline in WebAssembly currently, so small-ish leaf functions are more common and hence there might be more performance to be gained. So looking into this in the context of WebAssembly might be more promising. Regarding the excessive moves in deferred code: That is an artifact of the current register allocator and might be worthwhile to look into for size considerations anyway. As the code is not performance critical, reducing the number of moves has not been a priority. Cheers Stephan On Thu, Oct 4, 2018 at 7:58 PM Pierre Langlois <[email protected]> wrote: > > Hello V8 devs! > > Recently we've been investigating how introducing callee-saved registers > as part of the JavaScript calling convention would affect generated > code. We understand it opens up questions on how different parts of V8 > would need to be adapted, specifically the GC, and it would require a > lot of investigation and long term planning. Specifically, having the GC > walk stack frames to dynamically find where callee-saved registers were > saved would surely have a performance impact. But, maybe it could be > balanced out by generating better code, we can't know until we > investigate :-). > > So here we've only looked at the register allocator. And more > specifically how it copes with not having to save and restore all > registers around calls. It's already given us food for thought so we > decided to share it. > > This is a long email, so TL;DR: Adding callee-saved registers introduces > longer live ranges assigned to registers. This is very good for > non-deferred blocks where a *lot* of gap moves are removed (up to > half!). But, we get a *lot* more moves at deferred block boundaries. So > much so that code size increases by 6% when running typescript on > arm64. It looks like the register allocator could be improved to deal > with longer register ranges in general. > > So, we've built a prototype that takes the list of callee-saved > registers from a call descriptor and propagates it down so the register > allocator can look it up for each call. And then it can decide to only > clobber certain registers. After this all we needed to do was to define > sets of callee-saved registers for each type of call descriptors: > > * Direct calls to C: Use callee-saved registers defined by the C > ABI. This also includes calls generated by the code-generator to > implement IEEE functions. > > * Call to the runtime: Clobber everything unless the call does not > have a frame state. A GC could be triggered and it needs everything > in memory. > > * Call to JS, stubs and builtins: We can define our own set of > callee-saved registers. When investigating we've picked the same set > as for C calls though. > > * We haven't investigated WASM calls yet. > > Of course, except for direct C calls, the generated code isn't > functional. This was just an experiment to see what it looked like. What > we can do though is make it a run-time option and gather statistics. We > can force V8 to compile every optimised JS function twice, the second > time with callee-saved registers, and then discard the latter. Finally > we can compare them! > > Here is a link to the prototype if you want to take a look: > https://chromium-review.googlesource.com/c/v8/v8/+/1261643 it's in the > form of a series of 6 patches, marked as abandoned now. > > So without further ado, let's look at some numbers! We've looked at how > gap moves and code size were affected. We ran the typescript benchmark > from web-tooling 0.5.2 and displayed statistics in percentages. We've > purposely split moves into deferred and non-deferred blocks after > realising they were not affected in the same way at all. > > So here we are, numbers of instructions, register/stack slot moves and > register/constant moves: > > | arch | instructions (%) | R <-> S (%) | deferred R > <-> S (%) | R <- CST (%) | deferred R <- CST (%) | > > |-------------------------+------------------+-------------+----------------------+--------------+-----------------------| > | arm64 (12 cs registers) | 6.30 | -51.86 | > 138.56 | -19.48 | 144.35 | > | arm (7 cs registers) | 1.54 | -34.60 | > 57.53 | -12.37 | 42.93 | > | x64 (5 cs registers) | 1.37 | -32.10 | > 49.19 | -8.13 | 47.51 | > | ia32 (3 cs registers) | -0.04 | -7.92 | > 7.79 | -1.54 | 4.00 | > > We were hoping to get fewer moves in general but instead we got more! > And especially on arm64 where we have a lot of registers. We cannot > accept such a code size increase. > > That being said, if we forget about the deferred columns, we've > *theoretically* gotten rid of up to half of register/stack moves! This > is promising. > > But now we're a bit confused as to why the more callee-saved registers > we have, the more moves we get in deferred blocks. We eventually linked > it to splintering. Indeed, if we disable it with the > --turbo-no-preprocess-ranges flag, we get the following results: > > | arch | instructions (%) | R <-> S (%) | deferred R > <-> S (%) | R <- CST (%) | deferred R <- CST (%) | > > |-------------------------+------------------+-------------+----------------------+--------------+-----------------------| > | arm64 (12 cs registers) | -2.82 | -37.20 | > -5.32 | -13.24 | -8.29 | > | arm (7 cs registers) | -2.35 | -29.24 | > -18.87 | -11.04 | -10.53 | > | x64 (5 cs registers) | -1.14 | -24.74 | > 0.89 | -6.52 | -0.56 | > | ia32 (3 cs registers) | -0.47 | -10.30 | > -11.01 | -2.18 | -2.74 | > > It looks much more like what we'd like to get. > > You can find the full data sets attached to this email. > > -- > -- > v8-dev mailing list > [email protected] > http://groups.google.com/group/v8-dev > --- > You received this message because you are subscribed to the Google Groups > "v8-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- -- v8-dev mailing list [email protected] http://groups.google.com/group/v8-dev --- You received this message because you are subscribed to the Google Groups "v8-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
