[v8-dev] Performance Comparison Between V8 And JSC Using ARM PMU On weex Scenario

manjian2006 Mon, 17 Jul 2017 01:24:58 -0700

Background

After weex project switch from V8 to JSC, it gains an amazing performance 
boost. But UC Browser project must use V8, for the close binding of the 
browser and the VM. Then a question is raised: why JSC is fast, and can V8 
improve?
The First look

V8 and JSC are CPU intensive modules. The following diagram shows that most 
time reside in CPU. The lighter color shows the proportions that not reside 
in CPU, and after a deeper analysis, these time fragments are preempted by 
other threads.

<https://lh3.googleusercontent.com/-GT5EFjzGkh4/WWxznPVmWHI/AAAAAAAAIrk/-6FJ1jK2NLgeCnhK2wIjJN_I__p1kvKmACLcBGAs/s1600/333f16ff6564d153066a1b4080f76cd1.png>
Performance Tools

Log or trace technique is not suitable for V8 and JSC. Because these 
technique will increase the proportion if a code fragment is execution 
frequently. A frequently execute code does not mean it is also a slow code. 
Like star byte code handler, it is executed frequently but it does not take 
very long time. If use a trace byte code execution method, the largest 
proportion will be star byte code handler. So I use ARM PMU to take sample 
to identify the hot spots. The Linux kernel expose PMU mechanism in perf 
syscall. And there is a perf utility to record/report the PMU samples.
Get Things Ready

The Linux kernel perf implementation requires that user space program uses apcs 
frame 
<https://www.cl.cam.ac.uk/~fms27/teaching/2001-02/arm-project/02-sort/apcs.txt>.
 
Only C/C++ code can be easily use the apcs frame by adding a compile option 
-mapcs-frame. So the following modification is made to the Linux kernel:

diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.cindex 
1d9f706e8180..be737f015127 100644--- a/arch/arm/kernel/perf_event.c+++ 
b/arch/arm/kernel/perf_event.c@@ -982,8 +982,7 @@ 
early_initcall(init_hw_perf_events);
  * This code has been adapted from the ARM OProfile support.
  */
 struct frame_tail {-   struct frame_tail __user *fp;-   unsigned long sp;+   
void __user **fp;
    unsigned long lr;
 } __attribute__((packed));
@@ -991,16 +990,22 @@ struct frame_tail {
  * Get the return address for a single stackframe and return a pointer to the
  * next frame tail.
  */-static struct frame_tail __user *-user_backtrace(struct frame_tail __user 
*tail,-          struct perf_callchain_entry *entry)+static void __user 
**+user_backtrace(void __user **tail,+          struct perf_callchain_entry 
*entry,+           bool* is_thumb)
 {-   struct frame_tail buftail;+   struct frame_tail buftail, __user 
*pbuf_tail;++    if (*is_thumb)+        pbuf_tail = (struct frame_tail 
__user*)tail;+    else+        pbuf_tail = (struct frame_tail __user*)(tail - 
1);

    /* Also check accessibility of one struct frame_tail beyond */-   if 
(!access_ok(VERIFY_READ, tail, sizeof(buftail)))+   if (!access_ok(VERIFY_READ, 
pbuf_tail, sizeof(buftail)))
        return NULL;-   if (__copy_from_user_inatomic(&buftail, tail, 
sizeof(buftail)))+   if (__copy_from_user_inatomic(&buftail, pbuf_tail, 
sizeof(buftail)))
        return NULL;

    perf_callchain_store(entry, buftail.lr);@@ -1009,23 +1014,28 @@ 
user_backtrace(struct frame_tail __user *tail,
     * Frame pointers should strictly progress back up the stack
     * (towards higher addresses).
     */-   if (tail + 1 >= buftail.fp)+   if ((void __user**)(tail + 1) >= 
buftail.fp)
        return NULL;--   return buftail.fp - 1;+    if (buftail.lr & 1)+        
*is_thumb = true;+    else+        *is_thumb = false;+   return buftail.fp;
 }

 void
 perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
 {-   struct frame_tail __user *tail;---   tail = (struct frame_tail __user 
*)regs->ARM_fp - 1;+   void __user **tail;+    bool is_thumb = 
thumb_mode(regs); +    if (is_thumb)+        tail = (void 
__user**)(regs->ARM_r7);+    else+        tail = (void __user **)(regs->ARM_fp);

    while ((entry->nr < PERF_MAX_STACK_DEPTH) &&
           tail && !((unsigned long)tail & 0x3))-       tail = 
user_backtrace(tail, entry);+       tail = user_backtrace(tail, entry, 
&is_thumb);
 }

 /*

But v8's standard frame does not compatible with C/C++ frame, which defines 
the fp register should point to the start address of the old fp content on 
stack, but c/c++ defines the fp register should point to the end address of 
the old fp content. So v8's JIT code can not be unwind by the Linux kernel 
even after this modification. I think I can do better by comparing if the 
content which fp points to fall in the range of the thread stack.
Sampling

Follow the build instruction of weex-native 
<https://github.com/linzj/weex_native>, download the needed toolchain 
<https://github.com/linzj/weex_native/wiki/Build-Toolchains>, and run the 
build script 
<https://github.com/linzj/weex_native/wiki/The-checkout-and-build-script>. 
Start the playground demo, run the command: perf -g -t to start recording. 
Then open list demo page. Wait for several seconds after render stops, then 
use ctrl-c to stop recording.
Data Analysis

V8's has 4134 samples, and JSC has 2023 samples.

More samples mean occupy more CPU time.

V8's JIT occupies 2286 samples, and JSC occupies 528, a huge gap.

V8's ignition interpretor occupies 1220 samples, and JSC's LLInt occupies 
only 10 samples.

V8's LOAD_IC tf builtin and LoadIC_Miss runtime occupies 258 samples, and 
JSC's get_by_id runtime occupies 46. I can't get the IC samples because 
JSC's IC is really reside inside code instead of data, so it is rather 
scatter. The following is the JIT histogram.

<https://lh3.googleusercontent.com/-T2vzCSIxXlk/WWxz_4JT6pI/AAAAAAAAIro/fGNz5DcfoYcQgs3GSv9u5IXw9BOQXfKLQCLcBGAs/s1600/5f9452eb22f83b9ab036d3d74420fadf.png>


Conclusion

V8 spend a lot of time inside ignition, and its IC is slow. Most of the JIT 
code is generated by turbofan. But turbofan itself generates pretty high 
quality code. I have make modification to the LOAD_IC builtin, make it call 
a runtime function to search handler instead of the code turbofan 
generates, and it take 20ms more time even the runtime function is compiled 
using O2. JSC spends much less time in LLInt, I think it means a subtle JIT 
code generator can defeat a subtle bytecode evaluator. A downside of 
turbofans is that it take too much time to compile. I change the 
interrupt_budget from 0x1800 to 0x100, but that has little impact on the 
current execution. Because the optimization takes very long time to 
complete, the code replacement will happen after the current execution. For 
3 continuous list demo's example, 0x1800 interrupt_budget take 
875,866,710ms, and 0x100 interrupt_budget take 869, 745, 637ms. It looks 
like the third time interval move forward to the second internal. So V8 
needs a transitional JIT code generator , which take much less time to 
compile than turbofans and perform better than ignition, to fill the gap 
between ignition and turbofan.

Another observation is that V8 IC is slow. It use a lot ldr instructions:

ldr map

ldr weakcell

ldr map instance type

ldr handler

...

For ARM architecture, the code cache and the data cache of CPU is 
separated. So this design introduce a a lot more data cache miss. JSC 
instead, only use instruction cache:

movt strutureIDHigh

movw strutureIDLow

compare

// handler code reside here.

So maybe the data cache miss is a cause for slow IC.

-- 
-- 
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.
[v8-dev] Performance Comparison Between V8 And JSC Using ARM PMU On weex Scenario

Reply via email to