Re: RFR (XS) 6471769: Error: assert(_cur_stack_depth == count_frames(), "cur_stack_depth out of sync")

Daniel D. Daugherty Mon, 03 Feb 2014 14:55:10 -0800

On 2/3/14 1:13 PM, [email protected] wrote:

Dan,


Thank you for reviewing!
Some comments are below.

On 2/3/14 7:51 AM, Daniel D. Daugherty wrote:

On 1/31/14 7:58 PM, [email protected] wrote:

Please, review the fix for:
  https://bugs.openjdk.java.net/browse/JDK-6471769


Open webrev:
http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/hotspot/6471769-JVMTI-FRAME/


src/share/vm/prims/jvmtiEnv.cpp
    Thumbs up on the code change.

    I've never been a fan of the:

        if (is_thread_fully_suspended(...))
            do-something-direct
        else
            do-something-via-a-safepoint

    model. We don't have any mechanism in the VM to keep the
    is_thread_fully_suspended() constraint true over the
    execution time of the "do-something-direct" operation so
    we've always been exposed to races.


Thank you for this confirmation.
So that, we are in sync here.

Also, I think this
    model is used for several other JVM/TI calls so you might
    want to visit those also with an eye on safety.

Yes, I'm looking to isolate and double check other places like this inthe code.

Another piece of work is to isolate other bugs on this topic.


    For this specific change, I think the comment would be
    better as:

    // It is only safe to perform the direct operation on the current
    // thread. All other usage needs to use a vm-op for safety.


It is taken, thanks!

In particular, I don't agree with this part of the comment:
// Performance-wise the only important case is when currentthread is
    // the target thread.
I'm not sure how you can state that performance on the currentthread
    is the only important case, but maybe I don't understand what you're
    really trying to say here.


I agree, the comment is not clear.

Probably, the comment had to say: "the only case where we can keepperformance

without loosing safety is the target thread is current ...".
In fact, collecting data at a safepoint is the only way to make it safe.
If the target thread is not current then:

- a) thread has been suspended: already slow path, so thatperformance is not that critical- b) thread has not been suspended: the information we provide istransient,

        safepoint is the only way to get it safe (is is correct?)


Yes, the only safe way to get info for another thread is via
a safepoint operation.

Now for the more general question with respect to this bug: 6471769
I've having trouble connecting this change (which I like) to this
particular bug report.

The crash in the bug's description looks like this:
# Internal Error(/net/prt-solamd64-q1-2/PrtBuildDir/workspace/src/share/vm/prims/jvmtiThreadState.cpp,273), pid=9561, tid=2
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM(20060914135846.dcubed.service_hs_b02_merge.2-debug compiled mode)
#
# Error: assert(_cur_stack_depth == count_frames(),"cur_stack_depthout of sync")
V [libjvm.so+0x1217206];; void VMError::report_and_die()+0x606
V [libjvm.so+0x667841];; void report_assertion_failure(constchar*,int,const char*)+0x61
V [libjvm.so+0xca9557];; int JvmtiThreadState::cur_stack_depth()+0x6e7
V [libjvm.so+0xc1caaa];; voidJvmtiExport::post_method_exit(JavaThread*,methodOop,frame)+0x21baV [libjvm.so+0x8385de];; voidInterpreterRuntime::post_method_exit(JavaThread*)+0x21e
j java.io.ObjectInputStream$BlockDataInputStream.getBlockDataMode()Z+4

In the above crash, the current thread is trying to post a
JVMTI_EVENT_METHOD_EXIT and during the event posting code, it
calls cur_stack_depth() which fails an assert because the
cached stack depth doesn't match the current queried value:

   282  int JvmtiThreadState::cur_stack_depth() {
   283    uint32_t debug_bits = 0;
   284    guarantee(JavaThread::current() == get_thread() ||
285 JvmtiEnv::is_thread_fully_suspended(get_thread(), false,&debug_bits
),
   286      "must be current thread or suspended");
   287
288 if (!is_interp_only_mode() || _cur_stack_depth ==UNKNOWN_STACK_DEPTH)
 {
   289      _cur_stack_depth = count_frames();
   290    } else {
   291      // heavy weight assert
   292      assert(_cur_stack_depth == count_frames(),
   293             "cur_stack_depth out of sync");
   294    }
   295    return _cur_stack_depth;
   296  }


This is not an unsafe query from one thread to another target thread
so I don't see how the proposed change will address this failure mode.
In particular, the assertion is troubling because it tells me that
the cached stack depth for the current thread is either wrong or has
been corrupted.
The above are consequences of unsafe operations we are doing.
Current thread should not calculate frame count of target thread (ifdifferent) other than at a safepoint.
We assume the thread is suspended but, in fact, it was not.
So that we lost safety here and all manifestation we observe are theconsequences.


For at least the stack trace shown in the description of 6471769,
it is not an unsafe case because the thread is operating on itself.
All I'm saying here is that 6471769 is not the right bug to use for
this fix...

This bug covers the frames miscount issue.
In the fastdebug mode we may get the assert as in this bug report:
  https://bugs.openjdk.java.net/browse/JDK-8032223


This bug: JDK-8032223 seems to be a perfect match for code that you
are proposing the change in your current webrev. You have a thread
("JDWP Transport Listener: dt_socket") that is trying to query the
frame count on another thread... That target JavaThread met the
is_thread_fully_suspended() criteria when execution went down the
direct path, but later the target stopped meeting that criteria...

And boom...

There is another potential race that I forgot to check and fix butDavid pointed out.
I will try to cover it in the next version of webrev.


See the separate reply that I sent on that part...

It is still possible there can be other issues here (as you say).
But it is hard to make sure it is the case because it is extremelyhard to reproduce (you know it!).
I'd suggest to resolve it step-by-step:
  - fix the known safety issues
  - watch the nightly if anything else is left, file and fix new bugs


This sounds like a good plan.


To reiterate:

- I like the code change!
- I would like to see the comment tweaked a bit

You should consider going forward with this change using a different
bug ID, perhaps something like:

   JDK-NNNNNNN improve safety of JVM/TI GetFrameCount()


There is already bug on this topic (and there are even more dups):
  https://bugs.openjdk.java.net/browse/JDK-8032223

But I still believe, the issue is the same ...


I don't think the issues in JDK-6471769 and JDK-8032223 are the same;
well not the same for the crash in the description of JDK-6471769. I
didn't look closely at the other crashes...


Dan

Thanks!
Serguei
Dan
Summary:
There is a general issue in the suspend equivalent conditionmechanism:Two subsequent calls to the JvmtiEnv::is_thread_fully_suspended()may return different results:
    - 1-st: true
    - 2-nd: false

  This more generic suspend equivalent issue is covered by another bug:
    https://bugs.openjdk.java.net/browse/JDK-6280037
The bug to fix in this review is a specific manifestation of the6280037in the JVMTI GetFrameCount() that has a big impact on the SQEnightly.
  It is on the Test Stabilization radar (as well as the 6280037).
  There are many tests intermittently failing because of this.
The webrev for review is a one-liner work around the 6280037 forthe GetFrameCount().
  The JVMTI GetFrameCount() spec tells:
"If this function is called for a thread actively executingbytecodes (for example,not the current thread and not suspended), the informationreturned is transient."
So, it is Ok to call the GetFrameCount() for non-suspended targetthreads.To achieve safety, the frame count for non-suspended threads iscalculated at a safepoint.It should be Ok and more safe to do the same for suspended threadsas well.There is no big performance impact because it is already on a slowpath.It is still important to avoid safepointing when the target threadis current.
The bug 6280037 should go out of the Test Stabilization radar(remove the svc-nightly label)
  as the most of the impacted tests are covered by the 6471769.


Testing:
  In progress: nsk.jvmti, nsk.jdi, nsk.jdwp, impacted JTreg tests


Thanks,
Serguei

Re: RFR (XS) 6471769: Error: assert(_cur_stack_depth == count_frames(), "cur_stack_depth out of sync")

Reply via email to