Re: RFR (XS) fix for a safepoint deadlock (8047720)

David Holmes Sun, 29 Jun 2014 22:43:25 -0700

Hi Dan,

I see this has already gone in but I think it is worth looking closer atthis.


On 28/06/2014 2:18 AM, Daniel D. Daugherty wrote:

Greetings,

I have a fix ready for the following bug:

     8047720 Xprof hangs on Solaris
     https://bugs.openjdk.java.net/browse/JDK-8047720

Here is the webrev URL:

http://cr.openjdk.java.net/~dcubed/8047720-webrev/0-jdk9-hs-rt/

This deadlock occurred between the following threads:

     Main thread   - Trying to stop the WatcherThread as part of
                     shutting down the VM; this thread is blocked
                     on the PeriodicTask_lock which keeps it from
                     reaching a safepoint.
     WatcherThread - Requested a VM_ForceSafepoint to complete
                     a JavaThread::java_suspend() call as part
                     of a FlatProfiler record_thread_ticks()
                     call; this thread owns the PeriodicTask_lock
                     since it is processing a periodic task.
     VMThread      - Trying to start a safepoint; this thread is
                     blocked waiting for the Main thread to reach
                     a safepoint.

The PeriodicTask_lock is one of the VM internal locks and is
typically managed using Mutex::_no_safepoint_check_flag to
avoid deadlocks. Yes, the irony is dripping on the floor... :-)

What was overlooked here is that the holder of a lock that is acquiredwithout safepoint checks, must never block at a safepoint whilst holdingthat lock. In this case the blocking is indirect, caused by thesynchronous nature of the VM_Operation, rather than a direct result of"blocking for the safepoint" (which the WatcherThread does notparticipate in). I wonder if the WatcherThread should really be usingthe async variant of VM_ForceSafepoint here?

The interesting part of this deadlock is that I think that it
is possible for other periodic tasks to hit it. Anything that
causes the WatcherThread to start a safepoint while processing
a periodic task should be susceptible to this race. Think about
the -XX:+DeoptimizeALot option and how it causes VM_Deopt
requests on thread state transitions... Interesting...

I don't think so. You need three threads involved to get the deadlock.In the current case the main thread's locking of the PeriodicTask_lockwithout a safepoint check is what causes the problem - that violates therules surrounding use of "no safepoint checks". The other methods that aJavaThread might call that acquire the PeriodicTask_lock do perform thesafepoint checks, so they wouldn't deadlock. Hence it seems to me thatonly WatcherThread::stop can lead to this problem. And asWatcherThread::stop is only called from before_exit, and that can onlybe called once, it seems to me that we could/should actually acquire thelock with a safepoint check.


Cheers,
David


Testing:
     - I found a way to add delays to the right spots in the
       VM to make the deadlock reproduce in just about every
       run of the test associated with the bug. The new
       os::naked_short_sleep() function is your friend. Thanks
       to Fred for adding that! See the bug report for the
       debugging diffs.
     - 72 hours of running the test in the bug report with
       delays enabled for product, fastdebug and jvmg bits
       in parallel on my Solaris X86 server.
     - JPRT test run
     - Aurora Adhoc results are in process; we're having issues
       with both a broken testbase build and infra problems
       with results not being uploaded.

Re: RFR (XS) fix for a safepoint deadlock (8047720)

Reply via email to