Hi Goetz, > Thanks for the quick reply.
Yes, this time it didn't take that long... [... snip ...] > > > > > I understand you annotate at safepoints where the escape analysis > > > > > finds out that an object is "better" than global escape. > > > > > This are the cases where the analysis identifies optimization > > > > > opportunities. These annotations are then used to deoptimize > > > > > frames and the objects referenced by them. > > > > > Doesn't this overestimate the optimized > > > > > objects? E.g., eliminate_alloc_node has many cases where it bails > > > > > out. > > > > > > > > Yes, the implementation is conservative, but it is comparatively simple > > and > > > > the additional debug > > > > info is just 2 flags per safepoint. > > > Thanks. It also helped that you explained to me offline that > > > there are more optimizations than only lock elimination and scalar > > > replacement done based on the ea information. > > > The ea refines the IR graph with allows follow up optimizations > > > which can not easily be tracked back to the escaping objects or > > > the call sites where they do not escape. > > > Thus, if there are non-global escaping objects, you have to > > > deoptimize the frame. > > > Did I repeat that correctly? > > > > Mostly, but there are also cases where deoptimization is required if and > > only > > if ea-local objects > > are passed as arguments. This is the case when values are not read directly > > from a frame, but from a callee frame. > Hmm, don't get this completely, but ok. Let C be a callee frame of B which is a callee of A. If you use JVMTI to read an object reference from a local variable of C then the implementation of JDK-8227745 deoptimizes A if it passes any ea-local as argument, because the reference could be ea-local in A and there might be optimizations that are invalid after the escape state change. > > > > Accesses to instance > > > > members or array elements can be optimized as well. > > > You mean the compiler can/will ignore volatile or memory ordering > > > requirements for non-escaping objects? Sounds reasonable to do. > > > > Yes, for instance. Also without volatile modifiers it will eliminate > > accesses. > > Here is an example: > > Method A has a NoEscape allocation O that is not scalar replaced. A calls > > Method B, which is not > > inlined. When you use your debugger to break in B, then modify a field of O, > > then this modification > > would have no effect without deoptimization, because the jit assumes that B > > cannot modify O without > > a reference to it. > Yes, A can keep O in a register, while the JVMTI thread would write to > the location in the stack where the local is held (if it was written back). Not quite. It is the value of the field of O that is in a register not the reference to O itself. The agent changes the field's value in the /java heap/ (remember: O is _not_ scalar replaced), but the fields value is not reloaded after return from B. > > > > > Syncronization: looks good. I think others had a look at this before. > > > > > > > > > > EscapeBarrier::deoptimize_objects_internal() > > > > > The method name is misleading, it is not used by > > > > > deoptimize_objects(). > > > > > Also, method with the same name is in Deopitmization. > > > > > Proposal: deoptimize_objects_thread() ? > > > > > > > > Sorry, but I don't see, why it would be misleading. > > > > What would be the meaning of 'deoptimize_objects_thread'? I don't > > > > understand that name. > > > 1. I have no idea why it's called "_internal". Because it is private? > > > By the name, I would expect that EscapeBarrier::deoptimize_objects() > > > calls it for some internal tasks. But it does not. > > > > Well, I'd say it is pretty internal, what's happening in that method. So > > IMHO > > the suffix _internal > > is a match. > > > > > 2. My proposal: deoptimize_objects_all_threads() iterates all threads > > > and calls deoptimize_objects(_one)_thread(thread) for each of these. > > > That's how I would have named it. > > > But no bike shedding, if you don't see what I mean it's not obvious. > > Ok. We could have a quick call, too, if you like. > Ok, I think I have understood the remaining points. I'm fine with this > so far. Thanks again and best regards, Richard. -----Original Message----- From: Lindenmaier, Goetz <goetz.lindenma...@sap.com> Sent: Mittwoch, 22. Juli 2020 18:22 To: Reingruber, Richard <richard.reingru...@sap.com>; serviceability-dev@openjdk.java.net; hotspot-compiler-...@openjdk.java.net; hotspot-runtime-...@openjdk.java.net Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents Hi Richard, Thanks for the quick reply. > > > With DeoptimizeObjectsALot enabled internal threads are started that > > > deoptimize frames and > > > objects. The number of threads started are given with > > > DeoptimizeObjectsALotThreadCountAll and > > > DeoptimizeObjectsALotThreadCountSingle. The former targets all > existing > > > threads whereas the > > > latter operates on a single thread selected round robin. > > > > > > I removed the mode where deoptimizations were performed at every nth > > > exit from the runtime. I never used it. > > > Do I get it right? You have a n:1 and a n:all test scenario. > > n:1: n threads deoptimize 1 Jana thread where n => DOALThreadCountSingle > > n:m: n threads deoptimize all Java threads where n = DOALThreadCountAll? > > Not quite. > > -XX:+DeoptimizeObjectsALot // required > -XX:DeoptimizeObjectsALotThreadCountAll=m > -XX:DeoptimizeObjectsALotThreadCountSingle=n > > Will start m+n threads. Each operating on all existing JavaThreads using > EscapeBarriers. The > difference between the 2 thread types is that one distinct EscapeBarrier > targets either just a > single thread or all exisitng threads at onece. If just one single thread is > targeted per > EscapeBarrier, then it is not always the same thread, but threads are selected > round robin. So there > will be n threads selecting independently single threads round robin per > EscapeBarrier and m threads > that target all threads in every EscapeBarrier. Ok, yes, that is how I understood it. > > > * EscapeBarrier::sync_and_suspend_one(): use a direct handshake and > > > execute it always independently > > > of is_thread_fully_suspended(). > > Is this also a performance optimization? > > Maybe a minor one. OK > > > * JavaThread::wait_for_object_deoptimization(): > > > - Bugfix: the last check of is_obj_deopt_suspend() must be /after/ the > > > safepoint check! This > > > caused issues with not walkable stacks with DeoptimizeObjectsALot. > > OK. As I understand, there was one safepoint check in the old version, > > now there is one in each iteration. I assume this is intended, right? > > Yes it is. The important thing here is (A) a safepoint check is needed /after/ > leaving a safe state > (_thread_in_native, _thread_blocked). (B) Shared variables that are modified > at safepoints or with handshakes need to be reread /after/ the safepoint > check. > > BTW: I only noticed now that since JDK-8240918 JavaThreads themselves > must disarm their polling > page. Originally (before handshakes) this was done by the VM thread. With > handshakes it was done by > the thread executing the handshake op. This was changed for > OrderAccess::cross_modify_fence() where > the poll is left armed if the thread is in native and sice JDK-8240918 it is > always left armed. So > when a thread leaves a safe state (native, blocked) and there was a > handshake/vm op, it will always > call SafepointMechanism::block_if_requested_slow(), even if the > handshake/vm operation have been > processed already and everybody else is happyly executing bytecodes :) Ok. > Still (A) and (B) hold. > > > - Added limited spinning inspired by HandshakeSpinYield to fix > > > regression in > > > microbenchmark [1] > > Ok. Nice improvement, nice catch! > > Yes. It certainly took some time to find out. > > > > > > > I refer to some more changes answering your questions and comments > inline > > > below. > > > > > > Thanks, > > > Richard. > > > > > > [1] Microbenchmark: > > > > http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.6.microbe > nchmark/ > > > > > > > > > I understand you annotate at safepoints where the escape analysis > > > > finds out that an object is "better" than global escape. > > > > This are the cases where the analysis identifies optimization > > > > opportunities. These annotations are then used to deoptimize > > > > frames and the objects referenced by them. > > > > Doesn't this overestimate the optimized > > > > objects? E.g., eliminate_alloc_node has many cases where it bails > > > > out. > > > > > > Yes, the implementation is conservative, but it is comparatively simple > and > > > the additional debug > > > info is just 2 flags per safepoint. > > Thanks. It also helped that you explained to me offline that > > there are more optimizations than only lock elimination and scalar > > replacement done based on the ea information. > > The ea refines the IR graph with allows follow up optimizations > > which can not easily be tracked back to the escaping objects or > > the call sites where they do not escape. > > Thus, if there are non-global escaping objects, you have to > > deoptimize the frame. > > Did I repeat that correctly? > > Mostly, but there are also cases where deoptimization is required if and only > if ea-local objects > are passed as arguments. This is the case when values are not read directly > from a frame, but from a callee frame. Hmm, don't get this completely, but ok. > > > Accesses to instance > > > members or array elements can be optimized as well. > > You mean the compiler can/will ignore volatile or memory ordering > > requirements for non-escaping objects? Sounds reasonable to do. > > Yes, for instance. Also without volatile modifiers it will eliminate accesses. > Here is an example: > Method A has a NoEscape allocation O that is not scalar replaced. A calls > Method B, which is not > inlined. When you use your debugger to break in B, then modify a field of O, > then this modification > would have no effect without deoptimization, because the jit assumes that B > cannot modify O without > a reference to it. Yes, A can keep O in a register, while the JVMTI thread would write to the location in the stack where the local is held (if it was written back). > > > > Syncronization: looks good. I think others had a look at this before. > > > > > > > > EscapeBarrier::deoptimize_objects_internal() > > > > The method name is misleading, it is not used by > > > > deoptimize_objects(). > > > > Also, method with the same name is in Deopitmization. > > > > Proposal: deoptimize_objects_thread() ? > > > > > > Sorry, but I don't see, why it would be misleading. > > > What would be the meaning of 'deoptimize_objects_thread'? I don't > > > understand that name. > > 1. I have no idea why it's called "_internal". Because it is private? > > By the name, I would expect that EscapeBarrier::deoptimize_objects() > > calls it for some internal tasks. But it does not. > > Well, I'd say it is pretty internal, what's happening in that method. So IMHO > the suffix _internal > is a match. > > > 2. My proposal: deoptimize_objects_all_threads() iterates all threads > > and calls deoptimize_objects(_one)_thread(thread) for each of these. > > That's how I would have named it. > > But no bike shedding, if you don't see what I mean it's not obvious. > Ok. We could have a quick call, too, if you like. Ok, I think I have understood the remaining points. I'm fine with this so far. Thanks, Goetz.