[v8-dev] Extra Stack Loads generated in V8

Abhinav Jangda Sun, 23 Sep 2018 14:49:53 -0700

Hello Everyone,

I have been studying the machine code generated by V8 for Web Assembly. I 
took the following function *kernel_gemm* as example:


#   define NI 1000
#   define NJ 1100
#   define NK 1200

void kernel_gemm(
     int C[NI][NJ],
         int A[NI][NK],
         int B[NK][NJ])
{
  int i, j, k;

  for (i = 0; i < NI; i++) {
    for (k = 0; k < NK; k++) {
       for (j = 0; j < NJ; j++) {
          C[i][j] += A[i][k] * B[k][j];
        }
    }
  }
}

The above file is compiled to WASM using latest emsdk based on 
clang//llvm-6.0 and is executed by v8. After studying the generated machine 
code for above function by v8, I found that there are extra stack loads:

        movq %rax, -0x10(%rsp)
        movq %rdx, -0x18(%rsp)
        xorq %rsi,%rsi
        movq $0, %rdi
        nop
L1:
        imull $0x12c0, %esi, %r8d
        addq %rdx, %r8
        imull $0x1130,%esi,%r9d
        addq %rax,%r9
        xorq %r11,%r11
        nop
L2:
        imull $0x1130,%r11d,%r12d
        leaq (%r8,%r11,4),%r14
        addq %rcx,%r12
        xorq %rbx,%rbx
        movl %ebx,%r15d
        nop
        nop
L3:
        leaq 0x1(%r15),%rax
        leaq (%r12,%r15,4),%rbx
        movl (%rdi,%r14,1),%edx
        leaq (%r9,%r15,4),%r15
        movl (%rdi,%rbx,1),%ebx
        imull %ebx,%edx
        movl (%rdi,%r15,1),%ebx
        addl %ebx,%edx
        movl %ebx,(%rdi,%r15,1)

        cmpl $0x44c,%eax
        jz L3END
        movl %eax,%r15d
        jmp L3 
L3END:
        addl $0x1,%r11d
        cmpl $0x4b0,%r11d
        jnz L2
        addl $0x1,%esi
        cmpl $0x3e8,%esi
        jz L1END 
        movq -0x18(%rsp),%rdx
        movq -0x10(%rsp),%rax
        jmp L1 
L1END:
        addq $0x20, %rsp

As you can see that there are extra stack loads for *rdx* and *rax* 
registers in every iteration of first loop (in between *L1END *and *L3END*). 
However, clang generates a code which performs around 1.3x better than v8 
and has no stack loads of operands.  According to calling convention of V8 
generated code, the arguments will be passed in registers *rax*, *rcx*, 
*rdx*. Hence, *rdx*, and *rax* are for variables B and C respectively.
I have been trying to get to know why there are extra loads. One reason 
could be the register allocator of v8 is not as good as clang (which I 
guess is fine because v8 has JIT and JITs are supposed to generate code 
faster than AOT compilers). But I think there should exist another reason 
like may be for On Stack Replacement or Preemption of code.

It would be really great if anyone can point me in the direction in V8 
source code. I have looked at wasm-compiler.cc but couldn't find anything.

NOTE: The v8 generated code is generated using nodejs v8.11.2 and has been 
converted to a simpler format by replacing absolute address in code with 
labels. Above code, when assembled using clang (after taking care of 
calling conventions of clang) performs exactly the same as v8 generated 
code.

As a reference, the clang generated assembly code is

    xorl    %r8d, %r8d
    .p2align    4, 0x90
.LBB0_1:                                # %for.body
                                        # =>This Loop Header: Depth=1
                                        #     Child Loop BB0_2 Depth 2
                                        #       Child Loop BB0_3 Depth 3
    movq    %rdx, %r10
    xorl    %r9d, %r9d
    .p2align    4, 0x90
.LBB0_2:                                # %for.body3
                                        #   Parent Loop BB0_1 Depth=1
                                        # =>  This Loop Header: Depth=2
                                        #       Child Loop BB0_3 Depth 3
    imulq   $4800, %r8, %rax        # imm = 0x12C0
    addq    %rsi, %rax
    leaq    (%rax,%r9,4), %r11
    movq    $-1100, %rcx            # imm = 0xFBB4
    .p2align    4, 0x90
.LBB0_3:                                # %for.body6
                                        #   Parent Loop BB0_1 Depth=1
                                        #     Parent Loop BB0_2 Depth=2
                                        # =>    This Inner Loop Header: 
Depth=3
    movl    (%r11), %eax
    movl    4400(%r10,%rcx,4), %ebx
    imull   %eax, %ebx
    movl    4400(%rdi,%rcx,4), %eax
    addl    %ebx, %eax
    movl    %eax, 4400(%rdi,%rcx,4)
    addq    $1, %rcx
    jne .LBB0_3
# %bb.4:                                # %for.inc17
                                        #   in Loop: Header=BB0_2 Depth=2
    addq    $1, %r9
    addq    $4400, %r10             # imm = 0x1130
    cmpq    $1200, %r9              # imm = 0x4B0
    jne .LBB0_2
# %bb.5:                                # %for.inc20
                                        #   in Loop: Header=BB0_1 Depth=1
    addq    $1, %r8
    addq    $4400, %rdi             # imm = 0x1130
    cmpq    $1000, %r8              # imm = 0x3E8
    jne .LBB0_1
# %bb.6:                                # %for.end22
    popq    %rbx
    retq




Thank You,

-- 
-- 
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[v8-dev] Extra Stack Loads generated in V8

Reply via email to