Re: Project to improve hs_err files

Volker Simonis Fri, 06 Sep 2013 05:53:02 -0700

Hi Mattis,

just some quick comments:


On Fri, Sep 6, 2013 at 1:32 PM, Mattis Castegren
<[email protected]> wrote:
>
> Hi (re-sending mail after joining the mailing lists, sorry if you get this 
> mail twice)
>
>
>
> My name is Mattis and I work with the JVM sustaining engineering team at 
> Oracle. I am starting up a project to improve the data we get in the hs_err 
> files when the JVM crashes. I have filed a JEP, but it has not yet been 
> approved. See my attachment for the initial draft including motivation and 
> scope.

There is already a similar JEP: JEP 146: Improve Fatal Error Logs
(http://openjdk.java.net/jeps/146)
Are they somehow related? Maybe the efforts should be combined?

>
> The main goal is not to completely solve new bugs by just using an hs_err 
> file, but to help users, support and development to debug their problems, 
> find duplicates of fixed bugs or application errors. It is also to provide 
> more information that can be helpful when doing core file debugging on new 
> issues.
>
>
>
> The first step in this project is to gather suggestions of data that could 
> help us when we see crashes. I am talking to the rest of the sustaining 
> engineering team and also to the Java support team, but I also wanted to ask 
> if anyone on these aliases have any thoughts on what data would help when we 
> get an hs_err file. I’m looking for both big and small suggestions. Deciding 
> if the suggestions are feasible or not can be discussed later.
>
> Suggestions so far:
>
>
>
> * Bigger changes
>
> - Re-structure hs_err file to put more important data first, maybe include a 
> summary header. End users can’t be expected to read through the entire hs_err 
> file. If we can put important hints of what went wrong at the top, that could 
> save a lot of time. Also, many web tools truncate hs_err files, so we may 
> never see the end of the files. This would also help us to faster triage 
> incoming incidents
>
> - Look at context sensitive data. If we crash when compiling a method, what 
> additional data could we provide. Can we provide anything when crashing in 
> GC, or when running interpreted code?
>
> - Could we verify data structures? If we could list that some GC table had 
> been corrupted, that could give a hint at the problem as well as help with 
> finding duplicates and known issues
>
> - Suggest workarounds/first debug steps. Sometimes we immediately know what 
> the first debug step is. If we crash when running a compiled method, try to 
> disable compilation of that method. If we crash after several OOMs, try 
> increasing the Java heap or lower heap usage. If we could print these first 
> steps, this could lead to bug filers providing more data when they file a 
> bug. We could also highlight "dangerous" options, like -Xverify:none
>
>

- Catch crashes in the compiler and recompile the same method with
full debug output turned on (i.e. dump the graphs of every
optimization step until the crash)

>
> * Additional Data
>
> - The GC Strategy used
>
> - The classpath variable
>
> - Have we seen any OutOfMemoryErrors, StackOverflowErrors or C Heap 
> allocation fails?
>
> - Include Hypervisor information. If we run on Xen or VMWare, that is very 
> interesting data
>
> - Heap usage and permgen usage (JDK7) after the last full GC. If the heap is 
> 99% full, that is a good hint that we might have run into a corner case in 
> OOM handling for example
>
> - Write the names of the last exceptions thrown. We currently list the 
> addresses of the last exceptions. Giving the type instead would be very good. 
> Did we crash after 10 StackOverflowErrors? That’s a good hint of what went 
> wrong
>
> - Make the GC Heap History more easy to read. It’s a great feature, but it is 
> somewhat hard to read if an event is a full GC or a YC etc.
>
> - Assembler instructions in dump files. Right now, we print the code around 
> the IP in hex format. This isn’t very helpful. If we could get the crashing 
> instructions as assembler instead, that would help a lot
>

This can easily be done with the hsdis-library. Unfortunately the
hsdis-library can not bundeled with a commercial JDK because it is
based on the GNU-binutils which are GPL-only.

But we could do two things:
 - provide hsdis as a separate download (but this wouldn't help after a crash)
 - provide a simple tool based on hsdis which can post-process the
hs_err-file and translate the hex-dump into readable assembler.

> - Growing and shrinking of heap. We have seen a few issues when growing or 
> shrinking the java heap. Saving the last few increases and decreases with a 
> timestamp would help finding out if this could be an issue
>
> - Highlight if third party JNI libs have been used
>
>
>
> Please let me know if you have ideas of what information would make hs_err 
> files more useful, and I will add them to my list.
>
>
>
> Kind Regards
>
> /Mattis

Re: Project to improve hs_err files

Reply via email to