Hi Mattis,

There are some interesting ideas here and generally the more crash information available the better. That said there are trade-offs in determining what information should be in a crash report versus logging/tracing output versus what should be left for post-mortem crash dump analysis.

There are limits on what you can do in the crash reporter due to the fact that you are executing in the context of a signal handler (and so you should only be using async-signal-safe library routines and VM functionality - a rule we already flout but usually get away with it). One way around this would be to have a dedicated crash reporter thread that is sitting on a semaphore wait (semaphores are async-signal safe) which is woken by the crashing thread after the info specific to the crashing thread (stack etc) has been recorded.

That said, the other limitation is that the process just crashed and you have no idea what memory corruption has occurred - so the more you try to do the more likely you will hit a secondary problem. Which implies ensuring we output data as soon as possible just in case we totally abort trying to get the next piece.

Cheers,
David

On 6/09/2013 9:32 PM, Mattis Castegren wrote:
Hi (re-sending mail after joining the mailing lists, sorry if you get
this mail twice)

My name is Mattis and I work with the JVM sustaining engineering team at
Oracle. I am starting up a project to improve the data we get in the
hs_err files when the JVM crashes. I have filed a JEP, but it has not
yet been approved. See my attachment for the initial draft including
motivation and scope. The main goal is not to completely solve new bugs
by just using an hs_err file, but to help users, support and development
to debug their problems, find duplicates of fixed bugs or application
errors. It is also to provide more information that can be helpful when
doing core file debugging on new issues.

The first step in this project is to gather suggestions of data that
could help us when we see crashes. I am talking to the rest of the
sustaining engineering team and also to the Java support team, but I
also wanted to ask if anyone on these aliases have any thoughts on what
data would help when we get an hs_err file. I’m looking for both big and
small suggestions. Deciding if the suggestions are feasible or not can
be discussed later.

Suggestions so far:

* Bigger changes

- Re-structure hs_err file to put more important data first, maybe
include a summary header. End users can’t be expected to read through
the entire hs_err file. If we can put important hints of what went wrong
at the top, that could save a lot of time. Also, many web tools truncate
hs_err files, so we may never see the end of the files. This would also
help us to faster triage incoming incidents

- Look at context sensitive data. If we crash when compiling a method,
what additional data could we provide. Can we provide anything when
crashing in GC, or when running interpreted code?

- Could we verify data structures? If we could list that some GC table
had been corrupted, that could give a hint at the problem as well as
help with finding duplicates and known issues

- Suggest workarounds/first debug steps. Sometimes we immediately know
what the first debug step is. If we crash when running a compiled
method, try to disable compilation of that method. If we crash after
several OOMs, try increasing the Java heap or lower heap usage. If we
could print these first steps, this could lead to bug filers providing
more data when they file a bug. We could also highlight "dangerous"
options, like -Xverify:none

* Additional Data

- The GC Strategy used

- The classpath variable

- Have we seen any OutOfMemoryErrors, StackOverflowErrors or C Heap
allocation fails?

- Include Hypervisor information. If we run on Xen or VMWare, that is
very interesting data

- Heap usage and permgen usage (JDK7) after the last full GC. If the
heap is 99% full, that is a good hint that we might have run into a
corner case in OOM handling for example

- Write the names of the last exceptions thrown. We currently list the
addresses of the last exceptions. Giving the type instead would be very
good. Did we crash after 10 StackOverflowErrors? That’s a good hint of
what went wrong

- Make the GC Heap History more easy to read. It’s a great feature, but
it is somewhat hard to read if an event is a full GC or a YC etc.

- Assembler instructions in dump files. Right now, we print the code
around the IP in hex format. This isn’t very helpful. If we could get
the crashing instructions as assembler instead, that would help a lot

- Growing and shrinking of heap. We have seen a few issues when growing
or shrinking the java heap. Saving the last few increases and decreases
with a timestamp would help finding out if this could be an issue

- Highlight if third party JNI libs have been used

Please let me know if you have ideas of what information would make
hs_err files more useful, and I will add them to my list.

Kind Regards

/Mattis

Reply via email to