Hello Everyone,
I have been studying the machine code generated by V8 for Web Assembly. I
took the following function *kernel_gemm* as example:
# define NI 1000
# define NJ 1100
# define NK 1200
void kernel_gemm(
int C[NI][NJ],
int A[NI][NK],
int B[NK][NJ])
{
int i, j, k;
for (i = 0; i < NI; i++) {
for (k = 0; k < NK; k++) {
for (j = 0; j < NJ; j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
The above file is compiled to WASM using latest emsdk based on
clang//llvm-6.0 and is executed by v8. After studying the generated machine
code for above function by v8, I found that there are extra stack loads:
movq %rax, -0x10(%rsp)
movq %rdx, -0x18(%rsp)
xorq %rsi,%rsi
movq $0, %rdi
nop
L1:
imull $0x12c0, %esi, %r8d
addq %rdx, %r8
imull $0x1130,%esi,%r9d
addq %rax,%r9
xorq %r11,%r11
nop
L2:
imull $0x1130,%r11d,%r12d
leaq (%r8,%r11,4),%r14
addq %rcx,%r12
xorq %rbx,%rbx
movl %ebx,%r15d
nop
nop
L3:
leaq 0x1(%r15),%rax
leaq (%r12,%r15,4),%rbx
movl (%rdi,%r14,1),%edx
leaq (%r9,%r15,4),%r15
movl (%rdi,%rbx,1),%ebx
imull %ebx,%edx
movl (%rdi,%r15,1),%ebx
addl %ebx,%edx
movl %ebx,(%rdi,%r15,1)
cmpl $0x44c,%eax
jz L3END
movl %eax,%r15d
jmp L3
L3END:
addl $0x1,%r11d
cmpl $0x4b0,%r11d
jnz L2
addl $0x1,%esi
cmpl $0x3e8,%esi
jz L1END
movq -0x18(%rsp),%rdx
movq -0x10(%rsp),%rax
jmp L1
L1END:
addq $0x20, %rsp
As you can see that there are extra stack loads for *rdx* and *rax*
registers in every iteration of first loop (in between *L1END *and *L3END*).
However, clang generates a code which performs around 1.3x better than v8
and has no stack loads of operands. According to calling convention of V8
generated code, the arguments will be passed in registers *rax*, *rcx*,
*rdx*. Hence, *rdx*, and *rax* are for variables B and C respectively.
I have been trying to get to know why there are extra loads. One reason
could be the register allocator of v8 is not as good as clang (which I
guess is fine because v8 has JIT and JITs are supposed to generate code
faster than AOT compilers). But I think there should exist another reason
like may be for On Stack Replacement or Preemption of code.
It would be really great if anyone can point me in the direction in V8
source code. I have looked at wasm-compiler.cc but couldn't find anything.
NOTE: The v8 generated code is generated using nodejs v8.11.2 and has been
converted to a simpler format by replacing absolute address in code with
labels. Above code, when assembled using clang (after taking care of
calling conventions of clang) performs exactly the same as v8 generated
code.
As a reference, the clang generated assembly code is
xorl %r8d, %r8d
.p2align 4, 0x90
.LBB0_1: # %for.body
# =>This Loop Header: Depth=1
# Child Loop BB0_2 Depth 2
# Child Loop BB0_3 Depth 3
movq %rdx, %r10
xorl %r9d, %r9d
.p2align 4, 0x90
.LBB0_2: # %for.body3
# Parent Loop BB0_1 Depth=1
# => This Loop Header: Depth=2
# Child Loop BB0_3 Depth 3
imulq $4800, %r8, %rax # imm = 0x12C0
addq %rsi, %rax
leaq (%rax,%r9,4), %r11
movq $-1100, %rcx # imm = 0xFBB4
.p2align 4, 0x90
.LBB0_3: # %for.body6
# Parent Loop BB0_1 Depth=1
# Parent Loop BB0_2 Depth=2
# => This Inner Loop Header:
Depth=3
movl (%r11), %eax
movl 4400(%r10,%rcx,4), %ebx
imull %eax, %ebx
movl 4400(%rdi,%rcx,4), %eax
addl %ebx, %eax
movl %eax, 4400(%rdi,%rcx,4)
addq $1, %rcx
jne .LBB0_3
# %bb.4: # %for.inc17
# in Loop: Header=BB0_2 Depth=2
addq $1, %r9
addq $4400, %r10 # imm = 0x1130
cmpq $1200, %r9 # imm = 0x4B0
jne .LBB0_2
# %bb.5: # %for.inc20
# in Loop: Header=BB0_1 Depth=1
addq $1, %r8
addq $4400, %rdi # imm = 0x1130
cmpq $1000, %r8 # imm = 0x3E8
jne .LBB0_1
# %bb.6: # %for.end22
popq %rbx
retq
Thank You,
--
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.