Reviewers: Erik Corry, Mads Ager,

Message:
Yes, I verified with the latest v8 bleeding edge. This change is no longer
needed since the register allocator takes care of the issue.



Description:
I am illustrating the optimization with two simple examples. I found some
performance improvement on the benchmarks (v8-suite).

EXAMPLE WITH Binary Op Stubs
++++++++++++++++++++++++++++

  temp1 = x1 + x2;
  temp2 = temp3 - x3;


THIS CODE:
=========
0xf5aa9444   100  e59b0014       ldr r0, [fp, #+20]
0xf5aa9448   104  e52d0004       str r0, [sp, #-4]!
0xf5aa944c   108  e59b0010       ldr r0, [fp, #+16]
0xf5aa9450   112  e49d1004       ldr r1, [sp], #+4
0xf5aa9454 116 e59fc088 ldr ip, [pc, #+136] ;; debug: statement
208
                                                             ;; code: STUB,
GenericBinaryOp, minor: 144
0xf5aa9458   120  e12fff3c       blx ip
0xf5aa945c   124  e50b0010       str r0, [fp, #-16]
0xf5aa9460   128  e51b0018       ldr r0, [fp, #-24]
0xf5aa9464   132  e52d0004       str r0, [sp, #-4]!
0xf5aa9468   136  e59b000c       ldr r0, [fp, #+12]
0xf5aa946c   140  e49d1004       ldr r1, [sp], #+4
0xf5aa9470 144 e59fc070 ldr ip, [pc, #+112] ;; debug: statement
227
                                                             ;; code: STUB,
GenericBinaryOp, minor: 148
0xf5aa9474   148  e12fff3c       blx ip


WITH THIS OPTIMIZATION:
========================
0xf5ae1434    84  e59b1014       ldr r1, [fp, #+20]
0xf5ae1438    88  e59b0010       ldr r0, [fp, #+16]
0xf5ae143c 92 e59fc078 ldr ip, [pc, #+120] ;; debug: statement
208
                                                             ;; code: STUB,
GenericBinaryOp, minor: 144
0xf5ae1440    96  e12fff3c       blx ip
0xf5ae1444   100  e50b0010       str r0, [fp, #-16]
0xf5ae1448   104  e51b1018       ldr r1, [fp, #-24]
0xf5ae144c   108  e59b000c       ldr r0, [fp, #+12]
0xf5ae1450 112 e59fc068 ldr ip, [pc, #+104] ;; debug: statement
227
                                                             ;; code: STUB,
GenericBinaryOp, minor: 148
0xf5ae1454   116  e12fff3c       blx ip


Example with Compares:
++++++++++++++++++++++

  if(x1 > x2) temp4 = x1;
  else temp4 = x2;


THIS CODE:
=========
0xf5a473c0   160  e59b0014       ldr r0, [fp, #+20]
0xf5a473c4   164  e52d0004       str r0, [sp, #-4]!
0xf5a473c8   168  e59b0010       ldr r0, [fp, #+16]
0xf5a473cc   172  e52d0004       str r0, [sp, #-4]!
0xf5a473d0   176  e49d1004       ldr r1, [sp], #+4
0xf5a473d4   180  e49d0004       ldr r0, [sp], #+4
0xf5a473d8   184  e1802001       orr r2, r0, r1
0xf5a473dc   188  e3120001       tst r2, #1
0xf5a473e0   192  0a000003       beq +20 -> 212  (0xf5a473f4)
0xf5a473e4 196 e59fc068 ldr ip, [pc, #+104] ;; debug: statement
148
                                                             ;; code: STUB,
Compare, minor: 92
0xf5a473e8   200  e12fff3c       blx ip


WITH THIS OPTIMIZATION:
========================
0xf5afdc98   152  e59b0014       ldr r0, [fp, #+20]
0xf5afdc9c   156  e59b1010       ldr r1, [fp, #+16]
0xf5afdca0   160  e1802001       orr r2, r0, r1
0xf5afdca4   164  e3120001       tst r2, #1
0xf5afdca8   168  0a000003       beq +20 -> 188  (0xf5afdcbc)
0xf5afdcac 172 e59fc060 ldr ip, [pc, #+96] ;; debug: statement
148
                                                             ;; code: STUB,
Compare, minor: 44
0xf5afdcb0   176  e12fff3c       blx ip

EXPLANATION OF THE CHANGES:
===========================
This change is implemented at the source of the problem, i.e., where the
redundant push/pop instructions are inserted in the instruction stream, rather than as a post-pass push/pop elimination phase. The current push/pop elimination phase can only delete the push/pop's of one of the input operands for a 2-input operation, also it can't eliminate when push/pop use different registers. Also can the post-pass push/pop elimination phase be functionally correct if there is a control flow merge point in between the push and the pop ? I am not sure how
that condition is checked, or if the design assumptions guarantees that such
situation never arises.

Since, the current implementation is based on the model where few registers (r0,
r1) are pre-assigned for operands to operations with 2 inputs (e.g.,
BinaryOperations) and LoadSlot uses r0 to load a local or parameter and then
push/pop before the actual operation, an initial check is done to see if the
second operand of a 2-input operation meets the criteria for successful
optimization. For a successful optimization, firstly the second operand
shouldn't expand into a sequence of instructions (i.e., it is not an expression
tree), rather should be a single direct access of a local/parameter as
load(fp+offset).
This verification is implemented in a lightweight fashion are done by some probe functions. Also, during actual Visit traversals for generation operand access,
LoadAndSpill should go directly to the VisitSlot and LoadSlot through
VisitProxy, and if in between it goes to any other VisitXYZ() operations the
optimization is not done. So for the first operand a probing is not needed, but is needed and hence done only for the second operand before we actually start
the traversal through LoadAndSpill for the first operand.


Please review this at http://codereview.chromium.org/1340003/show

SVN Base: http://v8.googlecode.com/svn/branches/bleeding_edge/

Affected files:
  M     src/arm/codegen-arm.h
  M     src/arm/codegen-arm.cc
  M     src/arm/virtual-frame-arm.h
  M     src/arm/virtual-frame-arm.cc


--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

Reply via email to