I should add that operating system is Linux 4.4 running on 64-bit Intel(R) 
Xeon(R) CPU E5-1650

On Sunday, September 23, 2018 at 5:49:47 PM UTC-4, Abhinav Jangda wrote:
>
> Hello Everyone,
>
> I have been studying the machine code generated by V8 for Web Assembly. I 
> took the following function *kernel_gemm* as example:
>
> #   define NI 1000
> #   define NJ 1100
> #   define NK 1200
>
> void kernel_gemm(
>      int C[NI][NJ],
>          int A[NI][NK],
>          int B[NK][NJ])
> {
>   int i, j, k;
>
>   for (i = 0; i < NI; i++) {
>     for (k = 0; k < NK; k++) {
>        for (j = 0; j < NJ; j++) {
>           C[i][j] += A[i][k] * B[k][j];
>         }
>     }
>   }
> }
>
> The above file is compiled to WASM using latest emsdk based on 
> clang//llvm-6.0 and is executed by v8. After studying the generated machine 
> code for above function by v8, I found that there are extra stack loads:
>
>         movq %rax, -0x10(%rsp)
>         movq %rdx, -0x18(%rsp)
>         xorq %rsi,%rsi
>         movq $0, %rdi
>         nop
> L1:
>         imull $0x12c0, %esi, %r8d
>         addq %rdx, %r8
>         imull $0x1130,%esi,%r9d
>         addq %rax,%r9
>         xorq %r11,%r11
>         nop
> L2:
>         imull $0x1130,%r11d,%r12d
>         leaq (%r8,%r11,4),%r14
>         addq %rcx,%r12
>         xorq %rbx,%rbx
>         movl %ebx,%r15d
>         nop
>         nop
> L3:
>         leaq 0x1(%r15),%rax
>         leaq (%r12,%r15,4),%rbx
>         movl (%rdi,%r14,1),%edx
>         leaq (%r9,%r15,4),%r15
>         movl (%rdi,%rbx,1),%ebx
>         imull %ebx,%edx
>         movl (%rdi,%r15,1),%ebx
>         addl %ebx,%edx
>         movl %ebx,(%rdi,%r15,1)
>
>         cmpl $0x44c,%eax
>         jz L3END
>         movl %eax,%r15d
>         jmp L3 
> L3END:
>         addl $0x1,%r11d
>         cmpl $0x4b0,%r11d
>         jnz L2
>         addl $0x1,%esi
>         cmpl $0x3e8,%esi
>         jz L1END 
>         movq -0x18(%rsp),%rdx
>         movq -0x10(%rsp),%rax
>         jmp L1 
> L1END:
>         addq $0x20, %rsp
>
> As you can see that there are extra stack loads for *rdx* and *rax* 
> registers in every iteration of first loop (in between *L1END *and *L3END*). 
> However, clang generates a code which performs around 1.3x better than v8 
> and has no stack loads of operands.  According to calling convention of V8 
> generated code, the arguments will be passed in registers *rax*, *rcx*, 
> *rdx*. Hence, *rdx*, and *rax* are for variables B and C respectively.
> I have been trying to get to know why there are extra loads. One reason 
> could be the register allocator of v8 is not as good as clang (which I 
> guess is fine because v8 has JIT and JITs are supposed to generate code 
> faster than AOT compilers). But I think there should exist another reason 
> like may be for On Stack Replacement or Preemption of code.
>
> It would be really great if anyone can point me in the direction in V8 
> source code. I have looked at wasm-compiler.cc but couldn't find anything.
>
> NOTE: The v8 generated code is generated using nodejs v8.11.2 and has been 
> converted to a simpler format by replacing absolute address in code with 
> labels. Above code, when assembled using clang (after taking care of 
> calling conventions of clang) performs exactly the same as v8 generated 
> code.
>
> As a reference, the clang generated assembly code is
>
>     xorl    %r8d, %r8d
>     .p2align    4, 0x90
> .LBB0_1:                                # %for.body
>                                         # =>This Loop Header: Depth=1
>                                         #     Child Loop BB0_2 Depth 2
>                                         #       Child Loop BB0_3 Depth 3
>     movq    %rdx, %r10
>     xorl    %r9d, %r9d
>     .p2align    4, 0x90
> .LBB0_2:                                # %for.body3
>                                         #   Parent Loop BB0_1 Depth=1
>                                         # =>  This Loop Header: Depth=2
>                                         #       Child Loop BB0_3 Depth 3
>     imulq   $4800, %r8, %rax        # imm = 0x12C0
>     addq    %rsi, %rax
>     leaq    (%rax,%r9,4), %r11
>     movq    $-1100, %rcx            # imm = 0xFBB4
>     .p2align    4, 0x90
> .LBB0_3:                                # %for.body6
>                                         #   Parent Loop BB0_1 Depth=1
>                                         #     Parent Loop BB0_2 Depth=2
>                                         # =>    This Inner Loop Header: 
> Depth=3
>     movl    (%r11), %eax
>     movl    4400(%r10,%rcx,4), %ebx
>     imull   %eax, %ebx
>     movl    4400(%rdi,%rcx,4), %eax
>     addl    %ebx, %eax
>     movl    %eax, 4400(%rdi,%rcx,4)
>     addq    $1, %rcx
>     jne .LBB0_3
> # %bb.4:                                # %for.inc17
>                                         #   in Loop: Header=BB0_2 Depth=2
>     addq    $1, %r9
>     addq    $4400, %r10             # imm = 0x1130
>     cmpq    $1200, %r9              # imm = 0x4B0
>     jne .LBB0_2
> # %bb.5:                                # %for.inc20
>                                         #   in Loop: Header=BB0_1 Depth=1
>     addq    $1, %r8
>     addq    $4400, %rdi             # imm = 0x1130
>     cmpq    $1000, %r8              # imm = 0x3E8
>     jne .LBB0_1
> # %bb.6:                                # %for.end22
>     popq    %rbx
>     retq
>
>
>
>
> Thank You,
>

-- 
-- 
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to