Background After weex project switch from V8 to JSC, it gains an amazing performance boost. But UC Browser project must use V8, for the close binding of the browser and the VM. Then a question is raised: why JSC is fast, and can V8 improve? The First look V8 and JSC are CPU intensive modules. The following diagram shows that most time reside in CPU. The lighter color shows the proportions that not reside in CPU, and after a deeper analysis, these time fragments are preempted by other threads. <https://lh3.googleusercontent.com/-GT5EFjzGkh4/WWxznPVmWHI/AAAAAAAAIrk/-6FJ1jK2NLgeCnhK2wIjJN_I__p1kvKmACLcBGAs/s1600/333f16ff6564d153066a1b4080f76cd1.png> Performance Tools Log or trace technique is not suitable for V8 and JSC. Because these technique will increase the proportion if a code fragment is execution frequently. A frequently execute code does not mean it is also a slow code. Like star byte code handler, it is executed frequently but it does not take very long time. If use a trace byte code execution method, the largest proportion will be star byte code handler. So I use ARM PMU to take sample to identify the hot spots. The Linux kernel expose PMU mechanism in perf syscall. And there is a perf utility to record/report the PMU samples. Get Things Ready The Linux kernel perf implementation requires that user space program uses apcs frame <https://www.cl.cam.ac.uk/~fms27/teaching/2001-02/arm-project/02-sort/apcs.txt>. Only C/C++ code can be easily use the apcs frame by adding a compile option -mapcs-frame. So the following modification is made to the Linux kernel: diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.cindex 1d9f706e8180..be737f015127 100644--- a/arch/arm/kernel/perf_event.c+++ b/arch/arm/kernel/perf_event.c@@ -982,8 +982,7 @@ early_initcall(init_hw_perf_events); * This code has been adapted from the ARM OProfile support. */ struct frame_tail {- struct frame_tail __user *fp;- unsigned long sp;+ void __user **fp; unsigned long lr; } __attribute__((packed)); @@ -991,16 +990,22 @@ struct frame_tail { * Get the return address for a single stackframe and return a pointer to the * next frame tail. */-static struct frame_tail __user *-user_backtrace(struct frame_tail __user *tail,- struct perf_callchain_entry *entry)+static void __user **+user_backtrace(void __user **tail,+ struct perf_callchain_entry *entry,+ bool* is_thumb) {- struct frame_tail buftail;+ struct frame_tail buftail, __user *pbuf_tail;++ if (*is_thumb)+ pbuf_tail = (struct frame_tail __user*)tail;+ else+ pbuf_tail = (struct frame_tail __user*)(tail - 1); /* Also check accessibility of one struct frame_tail beyond */- if (!access_ok(VERIFY_READ, tail, sizeof(buftail)))+ if (!access_ok(VERIFY_READ, pbuf_tail, sizeof(buftail))) return NULL;- if (__copy_from_user_inatomic(&buftail, tail, sizeof(buftail)))+ if (__copy_from_user_inatomic(&buftail, pbuf_tail, sizeof(buftail))) return NULL; perf_callchain_store(entry, buftail.lr);@@ -1009,23 +1014,28 @@ user_backtrace(struct frame_tail __user *tail, * Frame pointers should strictly progress back up the stack * (towards higher addresses). */- if (tail + 1 >= buftail.fp)+ if ((void __user**)(tail + 1) >= buftail.fp) return NULL;-- return buftail.fp - 1;+ if (buftail.lr & 1)+ *is_thumb = true;+ else+ *is_thumb = false;+ return buftail.fp; } void perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs) {- struct frame_tail __user *tail;--- tail = (struct frame_tail __user *)regs->ARM_fp - 1;+ void __user **tail;+ bool is_thumb = thumb_mode(regs); + if (is_thumb)+ tail = (void __user**)(regs->ARM_r7);+ else+ tail = (void __user **)(regs->ARM_fp); while ((entry->nr < PERF_MAX_STACK_DEPTH) && tail && !((unsigned long)tail & 0x3))- tail = user_backtrace(tail, entry);+ tail = user_backtrace(tail, entry, &is_thumb); } /* But v8's standard frame does not compatible with C/C++ frame, which defines the fp register should point to the start address of the old fp content on stack, but c/c++ defines the fp register should point to the end address of the old fp content. So v8's JIT code can not be unwind by the Linux kernel even after this modification. I think I can do better by comparing if the content which fp points to fall in the range of the thread stack. Sampling Follow the build instruction of weex-native <https://github.com/linzj/weex_native>, download the needed toolchain <https://github.com/linzj/weex_native/wiki/Build-Toolchains>, and run the build script <https://github.com/linzj/weex_native/wiki/The-checkout-and-build-script>. Start the playground demo, run the command: perf -g -t to start recording. Then open list demo page. Wait for several seconds after render stops, then use ctrl-c to stop recording. Data Analysis V8's has 4134 samples, and JSC has 2023 samples. More samples mean occupy more CPU time. V8's JIT occupies 2286 samples, and JSC occupies 528, a huge gap. V8's ignition interpretor occupies 1220 samples, and JSC's LLInt occupies only 10 samples. V8's LOAD_IC tf builtin and LoadIC_Miss runtime occupies 258 samples, and JSC's get_by_id runtime occupies 46. I can't get the IC samples because JSC's IC is really reside inside code instead of data, so it is rather scatter. The following is the JIT histogram. <https://lh3.googleusercontent.com/-T2vzCSIxXlk/WWxz_4JT6pI/AAAAAAAAIro/fGNz5DcfoYcQgs3GSv9u5IXw9BOQXfKLQCLcBGAs/s1600/5f9452eb22f83b9ab036d3d74420fadf.png> Conclusion V8 spend a lot of time inside ignition, and its IC is slow. Most of the JIT code is generated by turbofan. But turbofan itself generates pretty high quality code. I have make modification to the LOAD_IC builtin, make it call a runtime function to search handler instead of the code turbofan generates, and it take 20ms more time even the runtime function is compiled using O2. JSC spends much less time in LLInt, I think it means a subtle JIT code generator can defeat a subtle bytecode evaluator. A downside of turbofans is that it take too much time to compile. I change the interrupt_budget from 0x1800 to 0x100, but that has little impact on the current execution. Because the optimization takes very long time to complete, the code replacement will happen after the current execution. For 3 continuous list demo's example, 0x1800 interrupt_budget take 875,866,710ms, and 0x100 interrupt_budget take 869, 745, 637ms. It looks like the third time interval move forward to the second internal. So V8 needs a transitional JIT code generator , which take much less time to compile than turbofans and perform better than ignition, to fill the gap between ignition and turbofan. Another observation is that V8 IC is slow. It use a lot ldr instructions: ldr map ldr weakcell ldr map instance type ldr handler ... For ARM architecture, the code cache and the data cache of CPU is separated. So this design introduce a a lot more data cache miss. JSC instead, only use instruction cache: movt strutureIDHigh movw strutureIDLow compare // handler code reside here. So maybe the data cache miss is a cause for slow IC. -- -- v8-dev mailing list [email protected] http://groups.google.com/group/v8-dev --- You received this message because you are subscribed to the Google Groups "v8-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
