Yes, that's clear. I didn't wanted to propose using "jstack -F" directly. I just wanted to say that it's possible for an external tool to get a "reasonable good" stack trace out of a JVM process at any time and "jstack -F" can be taken as a boilerplate of how to do that.
That said, I still don't know how perf creates stack traces. Does it attach to the process with ptrace or how else does it inspect the stacks after a performance counter event? On Fri, Dec 5, 2014 at 8:34 PM, Staffan Larsen <staffan.lar...@oracle.com> wrote: > Just to note that the implementation of “jstack -F” is not at all suitable > for profiling since has a very high overhead (it attaches a debugger to the > process). > > /Staffan > >> On 5 dec 2014, at 20:22, Volker Simonis <volker.simo...@gmail.com> wrote: >> >> Hi Brendan, >> >> I'm still not understanding who is taking the actual stack traces (let >> alone the symbols) in your examples. Is this done by 'perf' itself >> based only on the frame pointer? >> >> As I wrote before, this is pretty hard to get right for a JVM, but >> there are good approximations. Have you looked at the 'jstack' tool >> which is part of the JDK? If you run it on a Java process, it will >> give you exact stack traces with full inlining information. However >> this only works at safepoints so it is probably not suitable for >> profiling with performance counters. But you can also use 'jstack -F >> -m' which gives you a 'best effort' mixed Java/C++ stacaktrace (most >> of the time even with inlined Java frames. This is probably the best >> you can get when interrupting a running JVM at an arbitrary point in >> time. As you mentioned in one of your blogs, the VM can be in the >> C-Library or even in the kernel at that time which don't preserve the >> frame pointer either. So it will be already hard to even walk up to >> the first Java frame. >> >> But nevertheless, if the output of 'jstack -F -m' is "good enough" for >> your purpose, you can implement something similar in 'perf' or a >> helper library of 'perf' and be happy (I don't actually know how perf >> takes stack traces but I suppose there may some kind of callback >> mechanism for walking unknown frames). This is actually not so hard. >> I've recently implemented a "print_native_stack()" function within >> hotspot itself (you can call it for example from gdb during debugging >> - see http://hg.openjdk.java.net/jdk9/dev/hotspot/rev/86183a940db4). >> Maye you could call this functions directly from 'perf' if perf >> attaches with ptrace to the process (I assume it does or how else >> could it walk the stack)? >> >> These were just some random thoughts with the hope that they may be helpful. >> >> Regards, >> Volker >> >> PS: by the way - the flame graphs look really impressive and it would >> be really nice to have something like this for Java. >> >> >> On Thu, Dec 4, 2014 at 11:55 PM, Brendan Gregg >> <brendan.d.gr...@gmail.com> wrote: >>> G'Day, >>> >>> I've hacked hotspot to return the frame pointer, in part to see what this >>> involves, and also to have a working prototype for analysis. Along with an >>> agent to resolve symbols, this has allowed full stack profiling using Linux >>> perf_events. The following flame graphs show the resulting profiles. >>> >>> A mixed mode CPU flame graph of a vert.x benchmark (click to zoom): >>> >>> http://www.brendangregg.com/FlameGraphs/cpu-mixedmode-vertx.svg >>> >>> Same thing, but this time disabling inlining, to show more frames: >>> >>> http://www.brendangregg.com/FlameGraphs/cpu-mixedmode-flamegraph.svg >>> >>> As expected, performance is worse without inlining. You can compare the >>> flame graphs side by side to see why. Less time spent doing work / I/O! >>> >>> https://github.com/brendangregg/Misc/blob/master/java/openjdk8_b132-fp.diff >>> is my patch, and currently only works for x86-64. It removes RBP from the >>> register pools, and inserts "mov(rbp, rsp)" into two function prologues. It >>> is also unsupported: use at your own risk. I'm not a veteran hotspot >>> engineer, so chances I messed something up are high. >>> >>> I'd love to be able to enable frame pointers in Oracle JDK, eg, with an >>> -XX:+NoOmitFramePointer option. It could be put under >>> -XX:+UnlockDiagnosticVMOptions or XX:+UnlockExperimentalVMOptions. So long >>> as we had some way to turn it on. If someone wants to include (improve, >>> rewrite) my patch, please do. >>> >>> I don't have much perf data yet, but on the vert.x microbenchmark it looked >>> like returning the frame pointer cost 2.6% performance. I hope that's >>> somewhat worst-case for production workloads. (I was also able to recover >>> the 2.6% by fine tuning other options, so were this a production change, I'd >>> be hoping not to regress performance at all.) >>> >>> We've discussed this before >>> (http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2014-October/thread.html#15939). >>> The Solaris-assisted approach that Serguei Spitsyn described (JDK-6617153) >>> should work very well. The JVM can run as-is, full stacks can be generated >>> on-demand, and symbols should always be correct. >>> >>> The frame pointer approach costs a little performance, and only shows >>> partial stacks after inlining (unless you disable inlining, but that can >>> cost >40% performance). There is the other issue Volker Simonis mentioned as >>> well, where some stacks may not be profiled correctly. And, if you are >>> unlucky, symbols can move during the profile, so any static perf-map-agent >>> map will translate some incorrectly (I've considered developing a way to >>> detect this, and highlight such frames as dubious.) >>> >>> At Netflix we are mostly Java on Linux. Switching to Oracle Solaris for this >>> feature is going to be a tough sell, especially when the value of full stack >>> profiling isn't widely understood. I personally think it might be a bit >>> easier if a -XX:+NoOmitFramePointer option existed, so Linux users can try >>> the feature, then consider the better Solaris version after gaining solid >>> experience on why it is so important. >>> >>> We recently blogged about the value of stack profiling and flame graphs, >>> http://techblog.netflix.com/2014/11/nodejs-in-flames.html, although this was >>> for Node.js, which already has frame pointer support. >>> >>> If anyone wants to try generating these mixed mode CPU flame graphs >>> themselves (in a test environment!), the first step is to compile OpenJDK 8 >>> b132 with the previous patch, and get that running. Also install the >>> packages for the "perf" command. The remaining steps would be something >>> like: >>> >>> # git clone --depth=1 https://github.com/brendangregg/FlameGraph >>> # git clone --depth=1 https://github.com/jrudolph/perf-map-agent >>> # cd perf-map-agent >>> # export JAVA_HOME=/... >>> # cmake . >>> # make >>> # perf record -F 99 -p `pgrep -n java` -g -- sleep 30 >>> # java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar >>> net.virtualvoid.perf.AttachOnce `pgrep -n java` >>> # perf script > ../FlameGraph/out.stacks >>> # cd ../FlameGraph >>> # ./stackcollapse-perf.pl < out.stacks | ./flamegraph.pl --color=java > >>> out.svg >>> >>> Finally, if you are new to CPU flame graphs, see >>> http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html . >>> >>> Brendan >