Project to improve hs_err files

Mattis Castegren Fri, 06 Sep 2013 04:37:47 -0700

Hi (re-sending mail after joining the mailing lists, sorry if you get this mail 
twice)


 

My name is Mattis and I work with the JVM sustaining engineering team at 
Oracle. I am starting up a project to improve the data we get in the hs_err 
files when the JVM crashes. I have filed a JEP, but it has not yet been 
approved. See my attachment for the initial draft including motivation and 
scope. The main goal is not to completely solve new bugs by just using an 
hs_err file, but to help users, support and development to debug their 
problems, find duplicates of fixed bugs or application errors. It is also to 
provide more information that can be helpful when doing core file debugging on 
new issues.

 

The first step in this project is to gather suggestions of data that could help 
us when we see crashes. I am talking to the rest of the sustaining engineering 
team and also to the Java support team, but I also wanted to ask if anyone on 
these aliases have any thoughts on what data would help when we get an hs_err 
file. I'm looking for both big and small suggestions. Deciding if the 
suggestions are feasible or not can be discussed later.

Suggestions so far:

 

* Bigger changes

- Re-structure hs_err file to put more important data first, maybe include a 
summary header. End users can't be expected to read through the entire hs_err 
file. If we can put important hints of what went wrong at the top, that could 
save a lot of time. Also, many web tools truncate hs_err files, so we may never 
see the end of the files. This would also help us to faster triage incoming 
incidents

- Look at context sensitive data. If we crash when compiling a method, what 
additional data could we provide. Can we provide anything when crashing in GC, 
or when running interpreted code?

- Could we verify data structures? If we could list that some GC table had been 
corrupted, that could give a hint at the problem as well as help with finding 
duplicates and known issues

- Suggest workarounds/first debug steps. Sometimes we immediately know what the 
first debug step is. If we crash when running a compiled method, try to disable 
compilation of that method. If we crash after several OOMs, try increasing the 
Java heap or lower heap usage. If we could print these first steps, this could 
lead to bug filers providing more data when they file a bug. We could also 
highlight "dangerous" options, like -Xverify:none

 

* Additional Data

- The GC Strategy used

- The classpath variable

- Have we seen any OutOfMemoryErrors, StackOverflowErrors or C Heap allocation 
fails?

- Include Hypervisor information. If we run on Xen or VMWare, that is very 
interesting data

- Heap usage and permgen usage (JDK7) after the last full GC. If the heap is 
99% full, that is a good hint that we might have run into a corner case in OOM 
handling for example

- Write the names of the last exceptions thrown. We currently list the 
addresses of the last exceptions. Giving the type instead would be very good. 
Did we crash after 10 StackOverflowErrors? That's a good hint of what went wrong

- Make the GC Heap History more easy to read. It's a great feature, but it is 
somewhat hard to read if an event is a full GC or a YC etc.

- Assembler instructions in dump files. Right now, we print the code around the 
IP in hex format. This isn't very helpful. If we could get the crashing 
instructions as assembler instead, that would help a lot

- Growing and shrinking of heap. We have seen a few issues when growing or 
shrinking the java heap. Saving the last few increases and decreases with a 
timestamp would help finding out if this could be an issue

- Highlight if third party JNI libs have been used

 

Please let me know if you have ideas of what information would make hs_err 
files more useful, and I will add them to my list.

 

Kind Regards

/Mattis

Title: Include more useful information in hs_err files
Author: Mattis Castegren
Organization: Oracle
Owner: Mattis Castegren
Created: 2013/9/27
Type: Feature
State: Draft
Exposure: Open
Component: vm/svc
Scope: Impl
JSR:
RFE:
Discussion: serviceability dash dev at openjdk dot java dot net
Start: 2013/Q2
Depends:
Blocks:
Effort: S
Duration: L
Template: 1.0
Internal-refs:
Reviewed-by:
Endorsed-by:
Funded-by:

Summary
-------
Work to get more useful data in hs_err files, to make it easier for customers, 
support and development to find the reason of crashes without core file 
debugging. Work to point out common issues like OOMs and SOEs, and make it 
easier to verify known issues.

Goals
-----
Many of the Java Incidents and test failures that come in are about crashes, 
and have little more information than an hs_err file. If we can point out 
common problems and include more debug information in the hs_err files, that 
could save a lot of work.

If the header of an hs_err file says that "you have had 1400 OutOfMemoryErrors 
before the crash", the user can start by investigating these issues.

The focus of this project should be both advanced users, like Dev or Sustaining 
Engineering, and regular users who see a crash in some application. The output 
must therefore be easy to understand, yet contain enough detail for advanced 
debugging.

Non-Goals
---------
The goal is not to find new bugs using only an hs_err file. This project may 
help in some cases, but if it's really a new bug we will usually need a core 
file or a reproducer.

Success Metrics
---------------
 
Motivation
----------
We get thousands of Crash Reports against Java. By including better information 
in the hs_err files, we can easily filter out bugs caused by user errors, known 
issues, etc.

Better hs_err files will also allow more bugs to be resolved either by the 
customer or by support, leading to lower costs in both the support organization 
and the development organizations

Description
-----------
The first part of this project should be to investigate what information would 
help Dev, Sustaining and Support. We should look at solved crashes and see what 
information would have helped. We should also see what crashes turns out to be 
caused by user errors.

This should be done at both the jvm, core and client areas, as all teams run 
into crashes. For core and client, these crashes are more often in the native 
libraries.

This investigation should result in several enhancement requests. Each of these 
can then be discussed independently on the open jdk aliases.

Alternatives
------------
 
Testing
-------
Testing can be difficult, as we will sometimes add features to make other 
debugging easier. Some of the features may therefore be hard to test if there 
are no known ways to trigger the problem.

We should run any testing we do have for hs_err files to make sure that the new 
features don't interfere with what is already there.

Risks and Assumptions
---------------------
The risks of these features will be small. When they are triggered, the program 
is already crashed. The biggest risk is that the new features may in turn 
contain bugs that could cause the hs_err files to be truncated, giving even 
less information instead of more.
 
Dependencies
 
-----------
 
Impact
------
Low, should only be noticeable when Java crashes.

Project to improve hs_err files

Reply via email to