On Fri, 11 Mar 2022 16:34:29 GMT, Anton Kozlov <akoz...@openjdk.org> wrote:

> > blocking SIGSEGV and SIGBUS - or other synchronous error signals like 
> > SIGFPE - and then triggering said signal is UB. What happens is 
> > OS-dependent. I saw processes vanishing, or hang, or core. It makes sense, 
> > since what is the kernel supposed to do. It cannot deliver the signal, and 
> > deferring it would require returning to the faulting instruction, that 
> > would just re-fault.
> > For some more details see e.g. 
> > https://bugs.openjdk.java.net/browse/JDK-8252533
> 
> This UB looks reasonable. My point is that a native thread would run fine 
> with SIGSEGV blocked. But then JVM decides it can do SafeFetch, and things 
> gets nasty.

Blocking synchronous error signals makes zero sense even for normal programs, 
since you lose the ability to get cores. For the JVM in particular, it also 
blocks facilities like polling pages, or dynamically querying CPU abilities. So 
a JVM would not even start with synchronous error signals blocked.

> 
> > > Is there a crash that is fixed by the change? I just spotted it is an 
> > > enhancement, not a bug. Just trying to understand the problem.
> > 
> > 
> > Yes, this issue is a breakout from 
> > https://bugs.openjdk.java.net/browse/JDK-8282306, where we'd like to use 
> > SafeFetch to make stack walking in AsyncGetCallTrace more robust. AGCT is 
> > called from the signal handler, and it may run in any number of situations 
> > (e.g. in foreign threads, or threads that are in the process of getting 
> > dismantled, etc).
> 
> I mean, some way to verify the issue is fixed, e.g. a test that does not fail 
> anymore.

No, tests do not exist. Unfortunately, otherwise this regression would have 
been detected right away and we would not need this PR.

We have a test though that tests SafeFetch during error handling. That test can 
be tweaked for this purpose. So, test does not exist yet, but can be easily 
written. 

> 
> I see AsyncGetCallTrace to assume the JavaThread very soon, or do I look at 
> the wrong place? 
> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/prims/forte.cpp#L569
> 
> > Another situation is error handling itself. When writing an hs-err file, we 
> > use SafeFetch to do carefully tiptoe around the possibly corrupt VM state. 
> > If the original crash happened in a foreign thread, we still want some of 
> > these reports to work (e.g. dumping register content or printing stacks). 
> > So SafeFetch should be as robust as possible.
> 
> OK, thanks. I think we also handle recursive segfaults recover after 
> interpretation of the corrupted VM state. Otherwise, implementing the 
> printing functions would be too tedious and hard with SafeFetch alone. But I 
> see it's used in printing register content, at least.

Secondary error handling is a very coarse-grained tool. If an error reporting 
step crashes out, we continue with the next step. Has disadvantages though. The 
total number of retries is very limited. And a faulting error reporting step 
still hurts, because its report is compromised. E.g. if the call stack printing 
crashes out, we have no call stack. This is not an abstract problem. Its a very 
concrete and typical problem.

I spend a large part of my work with hs-err reports. They are of very high 
importance to us. We (SAP) have invested a lot of time and effort in hardening 
out OpenJDK error reporting, and SafeFetch is an important part of that. For 
example, we provided the facility that made SafeFetch usable in signal 
handling. It would be nice if our work was not compromised. Please let us find 
a way forward here.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7727

Reply via email to