Hi guys,
is TurboFan able to eliminate conditional branches by using some means of
predication? I have a particular example in mind. Consider the following
C++ code snippet:
std::size_t count = 0;
for (int32_t *p = begin; p != end; ++p) {
if (*p < 42)
++count;
}
A simple loop that counts the number of values less than 42. I compiled
this to the following WebAssembly code:
(loop $filter_i32
(block $filter_i32.body
(if
(i32.lt_s
(i32.load
(local.get $5)
)
(i32.const 42)
)
(block $filter.accept
(local.set $3
(i32.add
(local.get $3)
(i32.const 1)
)
)
)
)
(local.set $5
(i32.add
(local.get $5)
(i32.const 4)
)
)
(local.set $4
(i32.add
(local.get $4)
(i32.const 1)
)
)
(br_if $filter_i32
(i32.lt_u
(local.get $4)
(global.get $size)
)
)
)
)
$5 is the address of the next i32 value, $3 is the count of values less
than 42, and $4 is the induction variable of the loop and used in the loop
header.
When I execute this WASM code in V8 using TurboFan (Liftoff is disabled)
and let V8 print the produced assembly I get the following code for the
loop:
0x2bdc0466b2f0 30 83c704 addl rdi,0x4
0x2bdc0466b2f3 33 4c8b5e23 REX.W movq r11,[rsi+0x23]
0x2bdc0466b2f7 37 493b23 REX.W cmpq rsp,[r11]
0x2bdc0466b2fa 3a 0f862d000000 jna 0x2bdc0466b32d <+0x6d>
0x2bdc0466b300 40 448bdf movl r11,rdi
0x2bdc0466b303 43 4c3bda REX.W cmpq r11,rdx
0x2bdc0466b306 46 0f835c000000 jnc 0x2bdc0466b368 <+0xa8>
0x2bdc0466b30c 4c 42833c1b2a cmpl [rbx+r11*1],0x2a
0x2bdc0466b311 51 0f8d04000000 jge 0x2bdc0466b31b <+0x5b>
0x2bdc0466b317 57 4183c101 addl r9,0x1
0x2bdc0466b31b 5b 4183c001 addl r8,0x1
0x2bdc0466b31f 5f 44394108 cmpl [rcx+0x8],r8
0x2bdc0466b323 63 77cb ja 0x2bdc0466b2f0 <+0x30>
The jump in 0x3a implements the loop header. The jump in 0x46 is V8's oob
check. The jump in 0x51 implements the if-statement. If I compile the above
C++ code with clang -O2 I get the following code for the loop:
.LBB0_8: # =>This Inner Loop Header: Depth=1
xorl %esi, %esi
cmpl %ebp, (%rcx)
setl %sil
addq %rbx, %rsi
xorl %edi, %edi
cmpl %ebp, 4(%rcx)
setl %dil
addq %rsi, %rdi
xorl %esi, %esi
cmpl %ebp, 8(%rcx)
setl %sil
addq %rdi, %rsi
xorl %ebx, %ebx
cmpl %ebp, 12(%rcx)
setl %bl
addq %rsi, %rbx
addq $16, %rcx
addq $-4, %rdx
jne .LBB0_8
The loop has been unrolled 4 times. (I omitted the code that covers the
remainder of size % 4.) Further, the conditional branch has been eliminated
and replaced by `setl` and `addq`, which is effectively an optimized form
of predication.
When I compare the performance of clang's code to that of TurboFan, clang
is around 10x faster. My question is: what can I do to improve the
performance of that loop? Is loop unrolling or conversion of conditional
branches supported in TurboFan? I must add that using SIMD is not possible,
as the body of the if-statement inside the loop can be anything and
enforcing SIMD is not always possible.
Regards,
Immanuel
--
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/v8-dev/7b47cbac-f020-4af3-81f7-132e04e3dd8d%40googlegroups.com.