Hi Mattis, just some quick comments:
On Fri, Sep 6, 2013 at 1:32 PM, Mattis Castegren <mattis.casteg...@oracle.com> wrote: > > Hi (re-sending mail after joining the mailing lists, sorry if you get this > mail twice) > > > > My name is Mattis and I work with the JVM sustaining engineering team at > Oracle. I am starting up a project to improve the data we get in the hs_err > files when the JVM crashes. I have filed a JEP, but it has not yet been > approved. See my attachment for the initial draft including motivation and > scope. There is already a similar JEP: JEP 146: Improve Fatal Error Logs (http://openjdk.java.net/jeps/146) Are they somehow related? Maybe the efforts should be combined? > > The main goal is not to completely solve new bugs by just using an hs_err > file, but to help users, support and development to debug their problems, > find duplicates of fixed bugs or application errors. It is also to provide > more information that can be helpful when doing core file debugging on new > issues. > > > > The first step in this project is to gather suggestions of data that could > help us when we see crashes. I am talking to the rest of the sustaining > engineering team and also to the Java support team, but I also wanted to ask > if anyone on these aliases have any thoughts on what data would help when we > get an hs_err file. I’m looking for both big and small suggestions. Deciding > if the suggestions are feasible or not can be discussed later. > > Suggestions so far: > > > > * Bigger changes > > - Re-structure hs_err file to put more important data first, maybe include a > summary header. End users can’t be expected to read through the entire hs_err > file. If we can put important hints of what went wrong at the top, that could > save a lot of time. Also, many web tools truncate hs_err files, so we may > never see the end of the files. This would also help us to faster triage > incoming incidents > > - Look at context sensitive data. If we crash when compiling a method, what > additional data could we provide. Can we provide anything when crashing in > GC, or when running interpreted code? > > - Could we verify data structures? If we could list that some GC table had > been corrupted, that could give a hint at the problem as well as help with > finding duplicates and known issues > > - Suggest workarounds/first debug steps. Sometimes we immediately know what > the first debug step is. If we crash when running a compiled method, try to > disable compilation of that method. If we crash after several OOMs, try > increasing the Java heap or lower heap usage. If we could print these first > steps, this could lead to bug filers providing more data when they file a > bug. We could also highlight "dangerous" options, like -Xverify:none > > - Catch crashes in the compiler and recompile the same method with full debug output turned on (i.e. dump the graphs of every optimization step until the crash) > > * Additional Data > > - The GC Strategy used > > - The classpath variable > > - Have we seen any OutOfMemoryErrors, StackOverflowErrors or C Heap > allocation fails? > > - Include Hypervisor information. If we run on Xen or VMWare, that is very > interesting data > > - Heap usage and permgen usage (JDK7) after the last full GC. If the heap is > 99% full, that is a good hint that we might have run into a corner case in > OOM handling for example > > - Write the names of the last exceptions thrown. We currently list the > addresses of the last exceptions. Giving the type instead would be very good. > Did we crash after 10 StackOverflowErrors? That’s a good hint of what went > wrong > > - Make the GC Heap History more easy to read. It’s a great feature, but it is > somewhat hard to read if an event is a full GC or a YC etc. > > - Assembler instructions in dump files. Right now, we print the code around > the IP in hex format. This isn’t very helpful. If we could get the crashing > instructions as assembler instead, that would help a lot > This can easily be done with the hsdis-library. Unfortunately the hsdis-library can not bundeled with a commercial JDK because it is based on the GNU-binutils which are GPL-only. But we could do two things: - provide hsdis as a separate download (but this wouldn't help after a crash) - provide a simple tool based on hsdis which can post-process the hs_err-file and translate the hex-dump into readable assembler. > - Growing and shrinking of heap. We have seen a few issues when growing or > shrinking the java heap. Saving the last few increases and decreases with a > timestamp would help finding out if this could be an issue > > - Highlight if third party JNI libs have been used > > > > Please let me know if you have ideas of what information would make hs_err > files more useful, and I will add them to my list. > > > > Kind Regards > > /Mattis