Maynard,

On 2014-12-08 16:05, Maynard Johnson wrote:
On 12/05/2014 05:09 PM, Brendan Gregg wrote:
G'Day Volker,

On Fri, Dec 5, 2014 at 11:22 AM, Volker Simonis
<volker.simo...@gmail.com> wrote:
Hi Brendan,

I'm still not understanding who is taking the actual stack traces (let
alone the symbols) in your examples. Is this done by 'perf' itself
based only on the frame pointer?

perf is walking the frame pointers.
Volker, to be specific, the perf profiling tool has a user space part and a
kernel space part. The collection of stack traces is done by the kernel.
When a user-specified event (or series of events) occur, the process
being profiled is interrupted and the sampled information (which can
optionally include a full stack trace) is made available to the user space
perf tool to be saved to a file for future post-profiling processing.

During the profiling phase, the perf tool collects information about the
profiled process's memory mappings, which allows for this address-to-symbol.
resolution, It's in the post-profiling phase where the sampled instruction,
along with its associated stack trace, are resolved to the appropriate symbol
(i.e., function/method) in a specific binary file (e.g., library, exectuable).

And if the VM creates a /tmp/perf-<PID>.map file to save information about
JITed methods, the perf's post-profiling tool will find it and use it to
correlate sampled addresses it collected from the VM's executable anonymous
memory mappings to the method names.

I seem to recall reading about perf having support for DWARF debug info.

If the VM (or a JVM/TI agent) could create DWARF debug symbols, could that be used to convey information about inlined functions and stack unwinding without frame pointers? I realize that emitting DWARF debug symbols for generated code is not a trivial undertaking but since perf is running sampling in the kernel and we can't disable inlining that seems to be one of the few ways we can get complete stack traces.

There would be several other advantages to having DWARF symbols for generated code, GDB can use them when debugging the JVM for example.

An alternate approach could be to extend the information in perf-<PID>.map to have more detailed PC ranges with information about which functions are inlined. A lot of that information is available in the VM but not necessarily exposed via the tool APIs

/Mikael


-Maynard

A JVMTI agent, perf-map-agent, is providing a map file for symbol
translation under /tmp/perf-PID.map. Linux perf already hunts for such
a file when doing symbol translation.


As I wrote before, this is pretty hard to get right for a JVM, but
there are good approximations. Have you looked at the 'jstack' tool
which is part of the JDK? If you run it on a Java process, it will
give you exact stack traces with full inlining information. However
this only works at safepoints so it is probably not suitable for
profiling with performance counters.

Right, jstack works, and I get full correct stacks. I do really want
to take stacks at any moment: not just CPU samples, but when tracing
kernel TCP events, or PMC cache miss profiling, etc. perf can already
do many advanced tracing and profiling activities. I just needed the
Java stacks for context.

But you can also use 'jstack -F
-m' which gives you a 'best effort' mixed Java/C++ stacaktrace (most
of the time even with inlined Java frames. This is probably the best
you can get when interrupting a running JVM at an arbitrary point in
time. As you mentioned in one of your blogs, the VM can be in the
C-Library or even in the kernel at that time which don't preserve the
frame pointer either. So it will be already hard to even walk up to
the first Java frame.

Well, the JVMs I'm looking at are already built with
-fno-omit-frame-pointer (which is good). I edited hotspot to preserve
it as well.

Here's before I changed hotspot:

http://www.brendangregg.com/FlameGraphs/cpu-mixedmode-nofp.svg

Yes, most stacks are clearly broken.

After changing hotspot:

http://www.brendangregg.com/FlameGraphs/cpu-mixedmode-vertx.svg

It's looking pretty good. If you look carefully on the far left and
right, there are 0.8% stacks in read() and write() directly from java,
which may well be broken (unless a java thread is calling these
directly; there could also be some gcc inlining going on). Even if
they are broken, I can see 98% of my profile. Plus, I'd be interested
to know what exactly is reusing the frame pointer, so we could fix
that too.

The Java stacks themselves are also about a third as deep as they
should be, due to inlining.


But nevertheless, if the output of 'jstack -F -m' is "good enough" for
your purpose, you can implement something similar in 'perf' or a
helper library of 'perf' and be happy (I don't actually know how perf
takes stack traces but I suppose there may some kind of callback
mechanism for walking unknown frames). This is actually not so hard.
I've recently implemented a "print_native_stack()" function within
hotspot itself (you can call it for example from gdb during debugging
- see http://hg.openjdk.java.net/jdk9/dev/hotspot/rev/86183a940db4).
Maye you could call this functions directly from 'perf' if perf
attaches with ptrace to the process (I assume it does or how else
could it walk the stack)?

An OS-cooperative stack walker would be great, and I think the hotspot
team is already doing this for Oracle Solaris. Thanks for the code
too, this is pretty interesting.

jstack -F -m eats 0.5s of CPU for me, so it would need work to make
this into a 99 Hertz-capable profiler. Plus I'd like to pick arbitrary
kernel functions or tracepoints and get Java context from them, too.
Eg, TCP functions, memory allocation, disk I/O, etc.


These were just some random thoughts with the hope that they may be helpful.

Yes, thanks!

Brendan


Reply via email to