On Thu, 14 Apr 2022 14:33:35 GMT, Johannes Bechberger <d...@openjdk.java.net> 
wrote:

> I compared the code of ASGCT and JFR and found only one instance of a method 
> that is used in ASGCT, but not in JFR and which does affect the VM state. It 
> is the `JavaThread::block_if_vm_exited` method which is transitively called, 
> a fix for this is coming. Therefore there is, in my opinion, no reason why we 
> cannot wrap AsyncGetCallTrace in ThreadCrashProtection.

Sorry, you did address none of my arguments. I see a reasoning chain that 
starts by JFR being "good enough" and from there concluding that using it in 
AGCT its good enough too. Well, maybe it was a bad idea in the first place. And 
right now, this code is only used by JFR. Reusing it perpetuates this pattern 
which others will follow. Because its so convenient just to catch all crashes. 

Does the code use Resource Area? Then its not sigjmp safe. Does it do any kind 
of dynamic allocation which could leak? Then its not sigjmp safe. Does it use 
locks? Again, not sigjmp safe. Does it use any kind of RAII objects? You guess 
it, its not sigjmp safe. Does it call libc functions that cause it to change 
state? Probably unsafe too.

And even if you today can prove that the whole code is safe, how do you prevent 
bitrot? These are general purpose hotspot functions. Someone may tomorrow use 
malloc, or RA. How would you know?

The fact that we never saw a problem in JFR does not prove much. A corruption 
may never show symptoms. It may only effect certain platforms, certain 
conditions, certain flukes of the memory layout god. There may be that one 
crash that you really should not have ignored or sigjmp'ed out of, but you 
never saw it in your tests, because it only happens to that one customer. What 
I really worry about is the delayed effect. Hiding the crash may cause delayed 
errors that are really hard to track down. Effectively transforming what would 
be a simple crash with an hs-err file and a core into a maintenance nightmare. 
That is why sigjmp is used very sparingly, and only in very controlled 
environments.

Put it another way: to really prove that what you do is safe, you have to 
answer, for every possible crash location:
1) is the crash itself benign and can be ignored?
2) is it safe to jump out of the crash code back to the entrance of AGCT, or 
would that interruption corrupt the VM or a system library?

My point is, if you can answer these questions, you can fix the crash. 
Therefore it makes no sense to wholesale catch signals. Either the code is safe 
and does not crash, then no need for catch-all. Or it is not, then its 
important to understand where we crash and why. Ignoring unknown crashes is not 
a good idea.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8225

Reply via email to