Hi guys, is TurboFan able to eliminate conditional branches by using some means of predication? I have a particular example in mind. Consider the following C++ code snippet: std::size_t count = 0; for (int32_t *p = begin; p != end; ++p) { if (*p < 42) ++count; }
A simple loop that counts the number of values less than 42. I compiled this to the following WebAssembly code: (loop $filter_i32 (block $filter_i32.body (if (i32.lt_s (i32.load (local.get $5) ) (i32.const 42) ) (block $filter.accept (local.set $3 (i32.add (local.get $3) (i32.const 1) ) ) ) ) (local.set $5 (i32.add (local.get $5) (i32.const 4) ) ) (local.set $4 (i32.add (local.get $4) (i32.const 1) ) ) (br_if $filter_i32 (i32.lt_u (local.get $4) (global.get $size) ) ) ) ) $5 is the address of the next i32 value, $3 is the count of values less than 42, and $4 is the induction variable of the loop and used in the loop header. When I execute this WASM code in V8 using TurboFan (Liftoff is disabled) and let V8 print the produced assembly I get the following code for the loop: 0x2bdc0466b2f0 30 83c704 addl rdi,0x4 0x2bdc0466b2f3 33 4c8b5e23 REX.W movq r11,[rsi+0x23] 0x2bdc0466b2f7 37 493b23 REX.W cmpq rsp,[r11] 0x2bdc0466b2fa 3a 0f862d000000 jna 0x2bdc0466b32d <+0x6d> 0x2bdc0466b300 40 448bdf movl r11,rdi 0x2bdc0466b303 43 4c3bda REX.W cmpq r11,rdx 0x2bdc0466b306 46 0f835c000000 jnc 0x2bdc0466b368 <+0xa8> 0x2bdc0466b30c 4c 42833c1b2a cmpl [rbx+r11*1],0x2a 0x2bdc0466b311 51 0f8d04000000 jge 0x2bdc0466b31b <+0x5b> 0x2bdc0466b317 57 4183c101 addl r9,0x1 0x2bdc0466b31b 5b 4183c001 addl r8,0x1 0x2bdc0466b31f 5f 44394108 cmpl [rcx+0x8],r8 0x2bdc0466b323 63 77cb ja 0x2bdc0466b2f0 <+0x30> The jump in 0x3a implements the loop header. The jump in 0x46 is V8's oob check. The jump in 0x51 implements the if-statement. If I compile the above C++ code with clang -O2 I get the following code for the loop: .LBB0_8: # =>This Inner Loop Header: Depth=1 xorl %esi, %esi cmpl %ebp, (%rcx) setl %sil addq %rbx, %rsi xorl %edi, %edi cmpl %ebp, 4(%rcx) setl %dil addq %rsi, %rdi xorl %esi, %esi cmpl %ebp, 8(%rcx) setl %sil addq %rdi, %rsi xorl %ebx, %ebx cmpl %ebp, 12(%rcx) setl %bl addq %rsi, %rbx addq $16, %rcx addq $-4, %rdx jne .LBB0_8 The loop has been unrolled 4 times. (I omitted the code that covers the remainder of size % 4.) Further, the conditional branch has been eliminated and replaced by `setl` and `addq`, which is effectively an optimized form of predication. When I compare the performance of clang's code to that of TurboFan, clang is around 10x faster. My question is: what can I do to improve the performance of that loop? Is loop unrolling or conversion of conditional branches supported in TurboFan? I must add that using SIMD is not possible, as the body of the if-statement inside the loop can be anything and enforcing SIMD is not always possible. Regards, Immanuel -- -- v8-dev mailing list v8-dev@googlegroups.com http://groups.google.com/group/v8-dev --- You received this message because you are subscribed to the Google Groups "v8-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/7b47cbac-f020-4af3-81f7-132e04e3dd8d%40googlegroups.com.