On Thu, 14 Apr 2022 13:06:58 GMT, Johannes Bechberger <d...@openjdk.java.net> 
wrote:

>> Move the AsyncGetCallTrace method implementation into a separate method and 
>> wrap its call in non-assert compilation mode in `os::ThreadCrashProtection` 
>> like it is done in 
>> [JFR](https://github.com/openjdk/jdk/blob/965ea8d9cd29aee41ba2b1b0b0c67bb67eca22dd/src/hotspot/share/jfr/periodic/sampling/jfrThreadSampler.cpp#L165).
>> This prevents AsyncGetCallTrace from crashing on segmentation faults (but 
>> not on `guarantee`s).
>> 
>> If a crash is observed, then the `num_frames` field of the trace is set to 
>> `ticks_unknown_state` (-7) to signal a state that cannot be properly 
>> handled. `ticks_unknown_state` is currently also used for signaling unknown 
>> thread states but this should not be a problem, as the semantic is the same. 
>> If `num_frames` already has an error code then this error code is not 
>> changed. This helps to distinguish between errors in walking threads in Java 
>> and non-Java mode, as `num_frames` is set there before the walking to the 
>> appropriate error code.
>> 
>> _Thanks for @tstuefe for suggesting this._
>
> Johannes Bechberger has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Handle nested ThreadCrashProtection on POSIX

It could be worth exploring unifying the JFR and AGCT stackwalking mechanism 
then. If the crash handler is good enough for JFR, it should be considered good 
enough for AGCT.

As raised in a separate email exchange, this doesn't however fix all kind of 
crashes, such as calls to `guarantee(...)` (like in `CodeCache::find_blob`) 
since it does not trigger a signal but calls `VMError::report_and_die` directly 
(or equivalent). Since the code of JFR for stackunwinding is pretty much a 
copy/paste of AGCT, it's not clear to me what's the difference is (which locks 
are being held? which other synchronization mechanism?). By unifying, and 
making sure we use JFR's mechanism wholesale in AGCT, we would make sure AGCT 
is at least as stable as JFR. From our production use, JFR is stable (we 
haven't observe consistent crashes like we do with AGCT), so that is to me the 
reference we need to match.

I understand the AGCT2 proposal is unifying + native frames. So we could take 
the unifying and make it part of this enhancement to AGCT and keep the native 
frames for AGCT2.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8225

Reply via email to