Re: RFR: 8247729: GetObjectMonitorUsage() might return inconsistent information

[email protected] Wed, 17 Jun 2020 08:44:22 -0700

Hi Yasumasa,

This fix is not enough.

The function JvmtiEnvBase::get_object_monitor_usage works in two modes:in VMop and non-VMop.

The non-VMop mode has to be removed.


Thanks,
Serguei


On 6/17/20 02:18, Yasumasa Suenaga wrote:

(Change subject for RFR)

Hi,

I filed it to JBS and upload a webrev for it.
Could you review it?

  JBS: https://bugs.openjdk.java.net/browse/JDK-8247729
  webrev: http://cr.openjdk.java.net/~ysuenaga/JDK-8247729/webrev.00/

This change has passed tests on submit repo.
Also I tested it with serviceability/jvmti and vmTestbase/nsk/jvmti onLinux x64.
Thanks,

Yasumasa


On 2020/06/17 14:37, [email protected] wrote:
Yes. It seems we have a consensus.
Thank you for taking care about it.

Thanks,
Serguei


On 6/16/20 18:34, David Holmes wrote:
Ok, may I file it to JBS and fix it?
Go for it! :)

Cheers,
David

On 17/06/2020 10:23 am, Yasumasa Suenaga wrote:
On 2020/06/17 8:47, [email protected] wrote:
Hi Dan, David and Yasumasa,


On 6/16/20 07:39, Daniel D. Daugherty wrote:
On 6/15/20 9:28 PM, David Holmes wrote:
On 16/06/2020 10:57 am, Daniel D. Daugherty wrote:
On 6/15/20 7:19 PM, David Holmes wrote:
On 16/06/2020 8:40 am, Daniel D. Daugherty wrote:
On 6/15/20 6:14 PM, David Holmes wrote:
Hi Dan,

On 15/06/2020 11:38 pm, Daniel D. Daugherty wrote:
On 6/15/20 3:26 AM, David Holmes wrote:
On 15/06/2020 4:02 pm, Yasumasa Suenaga wrote:
Hi David,

On 2020/06/15 14:15, David Holmes wrote:
Hi Yasumasa,

On 15/06/2020 2:49 pm, Yasumasa Suenaga wrote:
Hi all,
I wonder why JvmtiEnvBase::get_object_monitor_usage()(implementation of GetObjectMonitorUsage()) does notperform at safepoint.
GetObjectMonitorUsage will use a safepoint if the targetis not suspended:
jvmtiError
JvmtiEnv::GetObjectMonitorUsage(jobject object,jvmtiMonitorUsage* info_ptr) {
   JavaThread* calling_thread = JavaThread::current();
jvmtiError err =get_object_monitor_usage(calling_thread, object, info_ptr);
   if (err == JVMTI_ERROR_THREAD_NOT_SUSPENDED) {
// Some of the critical threads were not suspended.go to a safepoint and try again VM_GetObjectMonitorUsage op(this, calling_thread,object, info_ptr);
     VMThread::execute(&op);
     err = op.result();
   }
   return err;
} /* end GetObject */
I saw this code, so I guess there are some cases whenJVMTI_ERROR_THREAD_NOT_SUSPENDED is not returned fromget_object_monitor_usage().
Monitor owner would be acquired from monitor object atfirst [1], but it would perform concurrently.If owner thread is not suspended, the owner might bechanged to others in subsequent code.
For example, the owner might release the monitor before[2].
The expectation is that when we find an owner thread itis either suspended or not. If it is suspended then itcannot release the monitor. If it is not suspended wedetect that and redo the whole query at a safepoint.
I think the owner thread might resume unfortunately aftersuspending check.
Yes you are right. I was thinking resuming also required asafepoint but it only requires the Threads_lock. So yesthe code is wrong.
Which code is wrong?
Yes, a rogue resume can happen when theGetObjectMonitorUsage() callerhas started the process of gathering the information whilenot at asafepoint. Thus the information returned byGetObjectMonitorUsage()
might be stale, but that's a bug in the agent code.
The code tries to make sure that it either collects dataabout a monitor owned by a thread that is suspended, or elseit collects that data at a safepoint. But the owning threadcan be resumed just after the code determined it wassuspended. The monitor can then be released and theinformation gathered not only stale but potentiallycompletely wrong as it could now be owned by a differentthread and will report that thread's entry count.
If the agent is not using SuspendThread(), then as soon as
GetObjectMonitorUsage() returns to the caller the information
can be stale. In fact as soon as the implementation returns
from the safepoint that gathered the info, the target thread
could have moved on.
That isn't the issue. That the info is stale is fine. But theexpectation is that the information was actually an accuratesnapshot of the state of the monitor at some point in time.The current code does not ensure that.
Please explain. I clearly don't understand why you think the info
returned isn't "an accurate snapshot of the state of the monitor
at some point in time".
Because it may not be a "snapshot" at all. There is noatomicity**. The reported owner thread may not own it any longerwhen the entry count is read, so straight away you may have thewrong entry count information. The set of threads trying toacquire the monitor, or wait on the monitor can change inunexpected ways. It would be possible for instance to report thesame thread as being the owner, being blocked trying to enterthe monitor, and being in the wait-set of the monitor -apparently all at the same time!
** even if the owner is suspended we don't have completeatomicity because threads can join the set of threads trying toenter the monitor (unless they are all suspended).
Consider the case when the monitor's owner is _not_ suspended:
- GetObjectMonitorUsage() uses a safepoint to gather the infoabout
    the object's monitor. Since we're at a safepoint, the info that
we are gathering cannot change until we return from thesafepoint.
    It is a snapshot and a valid one at that.

Consider the case when the monitor's owner is suspended:

  - GetObjectMonitorUsage() will gather info about the object's
    monitor while _not_ at a safepoint. Assuming that no other
    thread is suspended, then entry_count can change because
    another thread can block on entry while we are gathering
    info. waiter_count and waiters can change if a thread was
    in a timed wait that has timed out and now that thread is
    blocked on re-entry. I don't think that notify_waiter_count
    and notify_waiters can change.

    So in this case, the owner info and notify info is stable,
    but the entry_count and waiter info is not stable.

Consider the case when the monitor is not owned:

  - GetObjectMonitorUsage() will start to gather info about the
    object's monitor while _not_ at a safepoint. If it finds a
    thread on the entry queue that is not suspended, then it will
    bail out and redo the info gather at a safepoint. I just
    noticed that it doesn't check for suspension for the threads
    on the waiters list so a timed Object.wait() call can cause
    some confusion here.

    So in this case, the owner info is not stable if a thread
    comes out of a timed wait and reenters the monitor. This
    case is no different than if a "barger" thread comes in
    after the NULL owner field is observed and enters the
    monitor. We'll return that there is no owner, a list of
    suspended pending entry thread and a list of waiting
    threads. The reality is that the object's monitor is
    owned by the "barger" that completely bypassed the entry
    queue by virtue of seeing the NULL owner field at exactly
    the right time.

So the owner field is only stable when we have an owner. If
that owner is not suspended, then the other fields are also
stable because we gathered the info at a safepoint. If the
owner is suspended, then the owner and notify info is stable,
but the entry_count and waiter info is not stable.

If we have a NULL owner field, then the info is only stable
if you have a non-suspended thread on the entry list. Ouch!
That's deterministic, but not without some work.


Okay so only when we gather the info at a safepoint is all
of it a valid and stable snapshot. Unfortunately, we only
do that at a safepoint when the owner thread is not suspended
or if owner == NULL and one of the entry threads is not
suspended. If either of those conditions is not true, then
the different pieces of info is unstable to varying degrees.

As for this claim:
It would be possible for instance to report the same thread
as being the owner, being blocked trying to enter the monitor,
and being in the wait-set of the monitor - apparently all at
the same time!
I can't figure out a way to make that scenario work. If the
thread is seen as the owner and is not suspended, then we
gather info at a safepoint. If it is suspended, then it can't
then be seen as on the entry queue or on the wait queue since
it is suspended. If it is seen on the entry queue and is not
suspended, then we gather info at a safepoint. If it is
suspended on the entry queue, then it can't be seen on the
wait queue.

So the info instability of this API is bad, but it's not
quite that bad. :-) (That is a small mercy.)


Handshaking is not going to make this situation any better
for GetObjectMonitorUsage(). If the monitor is owned and we
handshake with the owner, the stability or instability of
the other fields remains the same as when SuspendThread is
used. Handshaking with all threads won't make the data as
stable as when at a safepoint because individual threads
can resume execution after doing their handshake so there
will still be field instability.


Short version: GetObjectMonitorUsage() should only gather
data at a safepoint. Yes, I've changed my mind.
I agree with this.
The advantages are:
  - the result is stable
  - the implementation can be simplified

Performance impact is not very clear but should not be that
big as suspending all the threads has some overhead too.
I'm not sure if using handshakes can make performance better.
Ok, may I file it to JBS and fix it?

Yasumasa
Thanks,
Serguei
Dan
David
-----
The only way to make sure you don't have stale information is
to use SuspendThread(), but it's not required. Perhaps the doc
should have more clear about the possibility of returning stale
info. That's a question for Robert F.
GetObjectMonitorUsage says nothing about thread's beingsuspended so I can't see how this could be construed as anagent bug.
In your scenario above, you mention that the target thread was
suspended, GetObjectMonitorUsage() was called while the target
was suspended, and then the target thread was resumed after
GetObjectMonitorUsage() checked for suspension, but before
GetObjectMonitorUsage() was able to gather the info.
All three of those calls: SuspendThread(),GetObjectMonitorUsage()and ResumeThread() are made by the agent and the agent shouldnotresume the target thread while also callingGetObjectMonitorUsage().
The calls were allowed to be made out of order so agent bug.
Perhaps. I was thinking more generally about an independentresume, but you're right that doesn't really make a lot ofsense. But when the spec says nothing about suspension ...
And it is intentional that suspension is not required. JVM/DIand JVM/PIused to require suspension for these kinds of get-the-infoAPIs. JVM/TI
intentionally was designed to not require suspension.
As I've said before, we could add a note about the data beingpotentiallystale unless SuspendThread is used. I think of it like stat(2).You canfetch the file's info, but there's no guarantee that the infois currentby the time you process what you got back. Is it too muchmotherhood to
state that the data might be stale? I could go either way...
Using a handshake on the owner thread will allow this to befixed in the future without forcing/using any safepoints.
I have to think about that which is why I'm avoiding talkingabout
handshakes in this thread.
Effectively the handshake can "suspend" the thread whilst themonitor is queried. In effect the operation would create aper-thread safepoint.
I "know" that, but I still need time to think about it andprobably
see the code to see if there are holes...
Semantically it is no different to the code actuallysuspending the owner thread, but it can't actually do thatbecause suspends/resume don't nest.
Yeah... we used have a suspend count back when we trackedinternal and
external suspends separately. That was a nightmare...

Dan
Cheers,
David
Dan
Cheers,
David
Dan
JavaThread::is_ext_suspend_completed() is used to checkthread state, it returns `true` when the thread issleeping [3], or when it performs in native [4].
Sure but if the thread is actually suspended it can'tcontinue execution in the VM or in Java code.
This appears to be an optimisation for the assumedcommon case where threads are first suspended and thenthe monitors are queried.
I agree with this, but I could find out it from JVMTIspec - it just says "Get information about the object'smonitor."
Yes it was just an implementation optimisation, nothing todo with the spec.
GetObjectMonitorUsage() might return incorrectinformation in some case.
It starts with finding owner thread, but the owner mightbe just before wakeup.So I think it is more safe if GetObjectMonitorUsage() iscalled at safepoint in any case.
Except we're moving away from safepoints to usingHandshakes, so this particular operation will require thatthe apparent owner is Handshake-safe (by entering ahandshake with it) before querying the monitor. This wouldstill be preferable I think to always using a safepointfor the entire operation.
Cheers,
David
-----
Thanks,

Yasumasa
[3]http://hg.openjdk.java.net/jdk/jdk/file/76a17c8143d8/src/hotspot/share/runtime/thread.cpp#l671[4]http://hg.openjdk.java.net/jdk/jdk/file/76a17c8143d8/src/hotspot/share/runtime/thread.cpp#l684
However there is still a potential bug as the threadreported as the owner may not be suspended at the timewe first see it, and may release the monitor, but thenit may get suspended before we call:
owning_thread =Threads::owning_thread_from_monitor_owner(tlh.list(),owner);
and so we think it is still the monitor owner andproceed to query the monitor information in a racy way.This can't happen when suspension itself requires asafepoint as the current thread won't go to thatsafepoint during this code. However, if suspension isimplemented via a direct handshake with the targetthread then we have a problem.
David
-----
Thanks,

Yasumasa
[1]http://hg.openjdk.java.net/jdk/jdk/file/76a17c8143d8/src/hotspot/share/prims/jvmtiEnvBase.cpp#l973[2]http://hg.openjdk.java.net/jdk/jdk/file/76a17c8143d8/src/hotspot/share/prims/jvmtiEnvBase.cpp#l996

Re: RFR: 8247729: GetObjectMonitorUsage() might return inconsistent information

Reply via email to