Hi guys,

is TurboFan able to eliminate conditional branches by using some means of 
predication? I have a particular example in mind. Consider the following 
C++ code snippet:
std::size_t count = 0;
for (int32_t *p = begin; p != end; ++p) {
    if (*p < 42)
        ++count;
}

A simple loop that counts the number of values less than 42. I compiled 
this to the following WebAssembly code:
(loop $filter_i32
 (block $filter_i32.body
  (if
   (i32.lt_s
    (i32.load
     (local.get $5)
    )
    (i32.const 42)
   )
   (block $filter.accept
    (local.set $3
     (i32.add
      (local.get $3)
      (i32.const 1)
     )
    )
   )
  )
  (local.set $5
   (i32.add
    (local.get $5)
    (i32.const 4)
   )
  )
  (local.set $4
   (i32.add
    (local.get $4)
    (i32.const 1)
   )
  )
  (br_if $filter_i32
   (i32.lt_u
    (local.get $4)
    (global.get $size)
   )
  )
 )
)

$5 is  the address of the next i32 value, $3 is the count of values less 
than 42, and $4 is the induction variable of the loop and used in the loop 
header.

When I execute this WASM code in V8 using TurboFan (Liftoff is disabled) 
and let V8 print the produced assembly I get the following code for the 
loop:
0x2bdc0466b2f0    30  83c704         addl rdi,0x4
0x2bdc0466b2f3    33  4c8b5e23       REX.W movq r11,[rsi+0x23]
0x2bdc0466b2f7    37  493b23         REX.W cmpq rsp,[r11]
0x2bdc0466b2fa    3a  0f862d000000   jna 0x2bdc0466b32d  <+0x6d>
0x2bdc0466b300    40  448bdf         movl r11,rdi
0x2bdc0466b303    43  4c3bda         REX.W cmpq r11,rdx
0x2bdc0466b306    46  0f835c000000   jnc 0x2bdc0466b368  <+0xa8>
0x2bdc0466b30c    4c  42833c1b2a     cmpl [rbx+r11*1],0x2a
0x2bdc0466b311    51  0f8d04000000   jge 0x2bdc0466b31b  <+0x5b>
0x2bdc0466b317    57  4183c101       addl r9,0x1
0x2bdc0466b31b    5b  4183c001       addl r8,0x1
0x2bdc0466b31f    5f  44394108       cmpl [rcx+0x8],r8
0x2bdc0466b323    63  77cb           ja 0x2bdc0466b2f0  <+0x30>

The jump in 0x3a implements the loop header. The jump in 0x46 is V8's oob 
check. The jump in 0x51 implements the if-statement. If I compile the above 
C++ code with clang -O2 I get the following code for the loop:
.LBB0_8:                                # =>This Inner Loop Header: Depth=1
xorl %esi, %esi
cmpl %ebp, (%rcx)
setl %sil
addq %rbx, %rsi
xorl %edi, %edi
cmpl %ebp, 4(%rcx)
setl %dil
addq %rsi, %rdi
xorl %esi, %esi
cmpl %ebp, 8(%rcx)
setl %sil
addq %rdi, %rsi
xorl %ebx, %ebx
cmpl %ebp, 12(%rcx)
setl %bl
addq %rsi, %rbx
addq $16, %rcx
addq $-4, %rdx
jne .LBB0_8

The loop has been unrolled 4 times. (I omitted the code that covers the 
remainder of size % 4.) Further, the conditional branch has been eliminated 
and replaced by `setl` and `addq`, which is effectively an optimized form 
of predication.

When I compare the performance of clang's code to that of TurboFan, clang 
is around 10x faster. My question is: what can I do to improve the 
performance of that loop? Is loop unrolling or conversion of conditional 
branches supported in TurboFan? I must add that using SIMD is not possible, 
as the body of the if-statement inside the loop can be anything and 
enforcing SIMD is not always possible.

Regards,
Immanuel

-- 
-- 
v8-dev mailing list
v8-dev@googlegroups.com
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to v8-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/v8-dev/7b47cbac-f020-4af3-81f7-132e04e3dd8d%40googlegroups.com.

Reply via email to