On Mon, 10 Jan 2022 05:19:26 GMT, Xin Liu <x...@openjdk.org> wrote:

> There is a handshake protocol between attach and HotSpot. Linux 
> VirtualMachineImpl sends SIGQUIT(3) if the AttachListener has not been 
> initialized. It expects "Signal Dispatcher" to handle SIGBREAK(same as 
> SIGQUIT) and create AttachListener. However, it is possible that attach 
> starts "handshake" before os::initialize_jdk_signal_support() is called. The 
> signal handler which handles SIGQUIT has not been installed. Prior to 
> os::initialize_jdk_signal_support(), universe_init() is called. Its time 
> complexity will be linear to the initial size of heap with 'AlwaysPreTouch'.  
> It takes 20~30 seconds to initialize 128g heap on a server-class host(eg. EC2 
> m4.10xlarge). Many tools such jcmd, jstack etc may force initializing HotSpot 
> quit prematurely. 
> 
> This patch checks '/proc/$pid/stat' SigCgt bitmask to ensure the signal will 
> be caught by the target process before striking it with SIGQUIT.  It will 
> make HotSpot more robust.  The fields of procfs are well 
> [documented](https://www.kernel.org/doc/html/latest/filesystems/proc.html#id10)
>  and have supported since Linux 2.6.30. libattach.so will not the only 
> consumer of it. I see that os_perf_linux.cpp supports it in libjvm.so. 
> 
> 
> Testing 
> 
> Before, this patch, once initialization takes long time, jcmd may quit the 
> java process. 
> 
> $java -Xms64g -XX:+AlwaysPreTouch -Xlog:gc+heap=debug:stderr 
> -XX:ParallelGCThreads=1 &
> [1] 9589
> [0.028s][debug][gc,heap] Minimum heap 68719476736  Initial heap 68719476736  
> Maximum heap 68719476736
> [0.034s][debug][gc,heap] Running G1 PreTouch with 1 workers for 16384 work 
> units pre-touching 68719476736B.
> $jcmd 9589 VM.flags
> 9589:
> [1]  + 9589 quit       java -Xms64g -XX:+AlwaysPreTouch 
> -Xlog:gc+heap=debug:stderr
> java.io.IOException: No such process
>         at jdk.attach/sun.tools.attach.VirtualMachineImpl.sendQuitTo(Native 
> Method)
>         at 
> jdk.attach/sun.tools.attach.VirtualMachineImpl.<init>(VirtualMachineImpl.java:100)
>         at 
> jdk.attach/sun.tools.attach.AttachProviderImpl.attachVirtualMachine(AttachProviderImpl.java:58)
>         at 
> jdk.attach/com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:207)
>         at jdk.jcmd/sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:113)
>         at jdk.jcmd/sun.tools.jcmd.JCmd.main(JCmd.java:97)
> 
> With this patch, jcmd will timeout but won't disrupt 15274. 
> 
> $ java -Xms64g -XX:+AlwaysPreTouch -XX:ParallelGCThreads=1 &
> [1] 15274
> $ jcmd 15274 VM.flags
> 15274:
> com.sun.tools.attach.AttachNotSupportedException: Unable to open socket file 
> /proc/15274/root/tmp/.java_pid15274: target process 15274 doesn't respond 
> within 10500ms or HotSpot VM not loaded
>         at 
> jdk.attach/sun.tools.attach.VirtualMachineImpl.<init>(VirtualMachineImpl.java:105)
>         at 
> jdk.attach/sun.tools.attach.AttachProviderImpl.attachVirtualMachine(AttachProviderImpl.java:58)
>         at 
> jdk.attach/com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:207)
>         at jdk.jcmd/sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:113)
>         at jdk.jcmd/sun.tools.jcmd.JCmd.main(JCmd.java:97)

Hi @navyxliu,

nice catch. I can see how this can be annoying.

I propose a simpler and more robust way to fix it though. We (1) set up general 
hotspot signal handling very early, then (2) proceed to initialize the heap - 
which you have shown can take some time - then (3) set up SIGQUIT handling. We 
core if we get a quit signal before (3).

I would add SIGQUIT handling to the general signal handler too, just to cover 
the time frame between (1) and (3). At (3), it would be overwritten, but that 
would be fine. Before (3), we could just ignore SIGQUIT, or print out some 
generic information (I assume thread dumps are not yet possible at this stage).

Since the documented behavior for the JVM is to threaddump on SIGQUIT unless we 
run with -Xrs, I think this is acceptable behavior. Not printing thread dump or 
only printing partial information is preferable to quitting with core.

Then, this would work for any client that sends sigquit to a JVM, not just 
those using the attach framework. And it would work on all Posix platforms, not 
just Linux. And we'd would not have to rely on parsing the proc fs.

Als note that a solution implemented in the client as you did has the 
disadvantage that I need a modern jcmd for it to work. However, I often just 
use whatever jcmd is in the path. Better to handle this in the receiving 
hotspot.

I sketched out a simple patch to test if what I propose can work:


diff --git a/src/hotspot/os/posix/signals_posix.cpp 
b/src/hotspot/os/posix/signals_posix.cpp
index 2c020a79408..3f0dd91e03b 100644
--- a/src/hotspot/os/posix/signals_posix.cpp
+++ b/src/hotspot/os/posix/signals_posix.cpp
@@ -615,6 +615,11 @@ int JVM_HANDLE_XXX_SIGNAL(int sig, siginfo_t* info,
 #endif // ZERO
   }
 
+if (sig == BREAK_SIGNAL) {
+  printf("too early for thread dumps...\n");
+  signal_was_handled = true;
+}
+
   // Ignore SIGPIPE and SIGXFSZ (4229104, 6499219).
   if (!signal_was_handled &&
       (sig == SIGPIPE || sig == SIGXFSZ)) {
@@ -1279,6 +1284,7 @@ void install_signal_handlers() {
   set_signal_handler(SIGFPE);
   PPC64_ONLY(set_signal_handler(SIGTRAP);)
   set_signal_handler(SIGXFSZ);
+  set_signal_handler(BREAK_SIGNAL); // just for initialization phase

It still misses a number of things (I did not check signal mask setup and 
ReduceSignalUsage must be handled too), but it shows the general direction and 
worked as expected.

Cheers, Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/7003

Reply via email to