Reviewers: Erik Corry, Mads Ager,
Message:
Yes, I verified with the latest v8 bleeding edge. This change is no longer
needed since the register allocator takes care of the issue.
Description:
I am illustrating the optimization with two simple examples. I found some
performance improvement on the benchmarks (v8-suite).
EXAMPLE WITH Binary Op Stubs
++++++++++++++++++++++++++++
temp1 = x1 + x2;
temp2 = temp3 - x3;
THIS CODE:
=========
0xf5aa9444 100 e59b0014 ldr r0, [fp, #+20]
0xf5aa9448 104 e52d0004 str r0, [sp, #-4]!
0xf5aa944c 108 e59b0010 ldr r0, [fp, #+16]
0xf5aa9450 112 e49d1004 ldr r1, [sp], #+4
0xf5aa9454 116 e59fc088 ldr ip, [pc, #+136] ;; debug:
statement
208
;; code: STUB,
GenericBinaryOp, minor: 144
0xf5aa9458 120 e12fff3c blx ip
0xf5aa945c 124 e50b0010 str r0, [fp, #-16]
0xf5aa9460 128 e51b0018 ldr r0, [fp, #-24]
0xf5aa9464 132 e52d0004 str r0, [sp, #-4]!
0xf5aa9468 136 e59b000c ldr r0, [fp, #+12]
0xf5aa946c 140 e49d1004 ldr r1, [sp], #+4
0xf5aa9470 144 e59fc070 ldr ip, [pc, #+112] ;; debug:
statement
227
;; code: STUB,
GenericBinaryOp, minor: 148
0xf5aa9474 148 e12fff3c blx ip
WITH THIS OPTIMIZATION:
========================
0xf5ae1434 84 e59b1014 ldr r1, [fp, #+20]
0xf5ae1438 88 e59b0010 ldr r0, [fp, #+16]
0xf5ae143c 92 e59fc078 ldr ip, [pc, #+120] ;; debug:
statement
208
;; code: STUB,
GenericBinaryOp, minor: 144
0xf5ae1440 96 e12fff3c blx ip
0xf5ae1444 100 e50b0010 str r0, [fp, #-16]
0xf5ae1448 104 e51b1018 ldr r1, [fp, #-24]
0xf5ae144c 108 e59b000c ldr r0, [fp, #+12]
0xf5ae1450 112 e59fc068 ldr ip, [pc, #+104] ;; debug:
statement
227
;; code: STUB,
GenericBinaryOp, minor: 148
0xf5ae1454 116 e12fff3c blx ip
Example with Compares:
++++++++++++++++++++++
if(x1 > x2) temp4 = x1;
else temp4 = x2;
THIS CODE:
=========
0xf5a473c0 160 e59b0014 ldr r0, [fp, #+20]
0xf5a473c4 164 e52d0004 str r0, [sp, #-4]!
0xf5a473c8 168 e59b0010 ldr r0, [fp, #+16]
0xf5a473cc 172 e52d0004 str r0, [sp, #-4]!
0xf5a473d0 176 e49d1004 ldr r1, [sp], #+4
0xf5a473d4 180 e49d0004 ldr r0, [sp], #+4
0xf5a473d8 184 e1802001 orr r2, r0, r1
0xf5a473dc 188 e3120001 tst r2, #1
0xf5a473e0 192 0a000003 beq +20 -> 212 (0xf5a473f4)
0xf5a473e4 196 e59fc068 ldr ip, [pc, #+104] ;; debug:
statement
148
;; code: STUB,
Compare, minor: 92
0xf5a473e8 200 e12fff3c blx ip
WITH THIS OPTIMIZATION:
========================
0xf5afdc98 152 e59b0014 ldr r0, [fp, #+20]
0xf5afdc9c 156 e59b1010 ldr r1, [fp, #+16]
0xf5afdca0 160 e1802001 orr r2, r0, r1
0xf5afdca4 164 e3120001 tst r2, #1
0xf5afdca8 168 0a000003 beq +20 -> 188 (0xf5afdcbc)
0xf5afdcac 172 e59fc060 ldr ip, [pc, #+96] ;; debug:
statement
148
;; code: STUB,
Compare, minor: 44
0xf5afdcb0 176 e12fff3c blx ip
EXPLANATION OF THE CHANGES:
===========================
This change is implemented at the source of the problem, i.e., where the
redundant push/pop instructions are inserted in the instruction stream,
rather
than as a post-pass push/pop elimination phase. The current push/pop
elimination
phase can only delete the push/pop's of one of the input operands for a
2-input
operation, also it can't eliminate when push/pop use different registers.
Also
can the post-pass push/pop elimination phase be functionally correct if
there is
a control flow merge point in between the push and the pop ? I am not sure
how
that condition is checked, or if the design assumptions guarantees that such
situation never arises.
Since, the current implementation is based on the model where few registers
(r0,
r1) are pre-assigned for operands to operations with 2 inputs (e.g.,
BinaryOperations) and LoadSlot uses r0 to load a local or parameter and then
push/pop before the actual operation, an initial check is done to see if the
second operand of a 2-input operation meets the criteria for successful
optimization. For a successful optimization, firstly the second operand
shouldn't expand into a sequence of instructions (i.e., it is not an
expression
tree), rather should be a single direct access of a local/parameter as
load(fp+offset).
This verification is implemented in a lightweight fashion are done by some
probe
functions. Also, during actual Visit traversals for generation operand
access,
LoadAndSpill should go directly to the VisitSlot and LoadSlot through
VisitProxy, and if in between it goes to any other VisitXYZ() operations the
optimization is not done. So for the first operand a probing is not needed,
but
is needed and hence done only for the second operand before we actually
start
the traversal through LoadAndSpill for the first operand.
Please review this at http://codereview.chromium.org/1340003/show
SVN Base: http://v8.googlecode.com/svn/branches/bleeding_edge/
Affected files:
M src/arm/codegen-arm.h
M src/arm/codegen-arm.cc
M src/arm/virtual-frame-arm.h
M src/arm/virtual-frame-arm.cc
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev