This change adds a fast-locking scheme as an alternative to the current 
stack-locking implementation. It retains the advantages of stack-locking 
(namely fast locking in uncontended code-paths), while avoiding the overload of 
the mark word. That overloading causes massive problems with Lilliput, because 
it means we have to check and deal with this situation when trying to access 
the mark-word. And because of the very racy nature, this turns out to be very 
complex and would involve a variant of the inflation protocol to ensure that 
the object header is stable. (The current implementation of setting/fetching 
the i-hash provides a glimpse into the complexity).

What the original stack-locking does is basically to push a stack-lock onto the 
stack which consists only of the displaced header, and CAS a pointer to this 
stack location into the object header (the lowest two header bits being 00 
indicate 'stack-locked'). The pointer into the stack can then be used to 
identify which thread currently owns the lock.

This change basically reverses stack-locking: It still CASes the lowest two 
header bits to 00 to indicate 'fast-locked' but does *not* overload the upper 
bits with a stack-pointer. Instead, it pushes the object-reference to a 
thread-local lock-stack. This is a new structure which is basically a small 
array of oops that is associated with each thread. Experience shows that this 
array typcially remains very small (3-5 elements). Using this lock stack, it is 
possible to query which threads own which locks. Most importantly, the most 
common question 'does the current thread own me?' is very quickly answered by 
doing a quick scan of the array. More complex queries like 'which thread owns 
X?' are not performed in very performance-critical paths (usually in code like 
JVMTI or deadlock detection) where it is ok to do more complex operations (and 
we already do). The lock-stack is also a new set of GC roots, and would be 
scanned during thread scanning, possibly concurrently, via the normal pro
 tocols.

The lock-stack is grown when needed. This means that we need to check for 
potential overflow before attempting locking. When that is the case, locking 
fast-paths would call into the runtime to grow the stack and handle the 
locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check on 
method entry to avoid (possibly lots) of such checks at locking sites.

In contrast to stack-locking, fast-locking does *not* support recursive locking 
(yet). When that happens, the fast-lock gets inflated to a full monitor. It is 
not clear if it is worth to add support for recursive fast-locking.

One trouble is that when a contending thread arrives at a fast-locked object, 
it must inflate the fast-lock to a full monitor. Normally, we need to know the 
current owning thread, and record that in the monitor, so that the contending 
thread can wait for the current owner to properly exit the monitor. However, 
fast-locking doesn't have this information. What we do instead is to record a 
special marker ANONYMOUS_OWNER. When the thread that currently holds the lock 
arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be 
itself, fixes the owner to be itself, and then properly exits the monitor, and 
thus handing over to the contending thread.

As an alternative, I considered to remove stack-locking altogether, and only 
use heavy monitors. In most workloads this did not show measurable regressions. 
However, in a few workloads, I have observed severe regressions. All of them 
have been using old synchronized Java collections (Vector, Stack), StringBuffer 
or similar code. The combination of two conditions leads to regressions without 
stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. 
single-threaded use of Vector or StringBuffer) and 2. The workload churns such 
locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but 
creating lots of such single-use, single-threaded-locked objects leads to 
massive ObjectMonitor churn, which can lead to a significant performance 
impact. But alas, such code exists, and we probably don't want to punish it if 
we can avoid it.

This change enables to simplify (and speed-up!) a lot of code:

- The inflation protocol is no longer necessary: we can directly CAS the 
(tagged) ObjectMonitor pointer to the object header.
- Accessing the hashcode could now be done in the fastpath always, if the 
hashcode has been installed. Fast-locked headers can be used directly, for 
monitor-locked objects we can easily reach-through to the displaced header. 
This is safe because Java threads participate in monitor deflation protocol. 
This would be implemented in a separate PR


Testing:
 - [x] tier1 x86_64 x aarch64 x +UseFastLocking
 - [x] tier2 x86_64 x aarch64 x +UseFastLocking
 - [x] tier3 x86_64 x aarch64 x +UseFastLocking
 - [x] tier4 x86_64 x aarch64 x +UseFastLocking
 - [x] tier1 x86_64 x aarch64 x -UseFastLocking
 - [x] tier2 x86_64 x aarch64 x -UseFastLocking
 - [x] tier3 x86_64 x aarch64 x -UseFastLocking
 - [x] tier4 x86_64 x aarch64 x -UseFastLocking
 - [x] Several real-world applications have been tested with this change in 
tandem with Lilliput without any problems, yet

### Performance

#### Renaissance



  | x86_64 |   |   |   | aarch64 |   |  
-- | -- | -- | -- | -- | -- | -- | --
  | stack-locking | fast-locking |   |   | stack-locking | fast-locking |  
AkkaUct | 841.884 | 836.948 | 0.59% |   | 1475.774 | 1465.647 | 0.69%
Reactors | 11444.511 | 11606.66 | -1.40% |   | 11382.594 | 11638.036 | -2.19%
Als | 1367.183 | 1359.358 | 0.58% |   | 1678.103 | 1688.067 | -0.59%
ChiSquare | 577.021 | 577.398 | -0.07% |   | 986.619 | 988.063 | -0.15%
GaussMix | 817.459 | 819.073 | -0.20% |   | 1154.293 | 1155.522 | -0.11%
LogRegression | 598.343 | 603.371 | -0.83% |   | 638.052 | 644.306 | -0.97%
MovieLens | 8248.116 | 8314.576 | -0.80% |   | 9898.1 | 10097.867 | -1.98%
NaiveBayes | 587.607 | 581.608 | 1.03% |   | 541.583 | 550.059 | -1.54%
PageRank | 3260.553 | 3263.472 | -0.09% |   | 4376.405 | 4381.101 | -0.11%
FjKmeans | 979.978 | 976.122 | 0.40% |   | 774.312 | 771.235 | 0.40%
FutureGenetic | 2187.369 | 2183.271 | 0.19% |   | 2685.722 | 2689.056 | -0.12%
ParMnemonics | 2527.228 | 2564.667 | -1.46% |   | 4278.225 | 4263.863 | 0.34%
Scrabble | 111.882 | 111.768 | 0.10% |   | 151.796 | 153.959 | -1.40%
RxScrabble | 210.252 | 211.38 | -0.53% |   | 310.116 | 315.594 | -1.74%
Dotty | 750.415 | 752.658 | -0.30% |   | 1033.636 | 1036.168 | -0.24%
ScalaDoku | 3072.05 | 3051.2 | 0.68% |   | 3711.506 | 3690.04 | 0.58%
ScalaKmeans | 211.427 | 209.957 | 0.70% |   | 264.38 | 265.788 | -0.53%
ScalaStmBench7 | 1017.795 | 1018.869 | -0.11% |   | 1088.182 | 1092.266 | -0.37%
Philosophers | 6450.124 | 6565.705 | -1.76% |   | 12017.964 | 11902.559 | 0.97%
FinagleChirper | 3953.623 | 3972.647 | -0.48% |   | 4750.751 | 4769.274 | -0.39%
FinagleHttp | 3970.526 | 4005.341 | -0.87% |   | 5294.125 | 5296.224 | -0.04%

-------------

Commit messages:
 - Use testb when testing for anon owner in fast-path
 - Merge branch 'master' into JDK-8291555-v2
 - In x86_32 C2 fast_lock(), CAS thread directly, instead of CASing 
stack-pointer and then storing thread
 - x86 part of optimization to check for anon owner
 - Optimize the check-for-anon-owner fast-path
 - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
 - Fix method signature of build_frame() on ppc, arm and s390
 - Remove unnecessary assert
 - Refactorings and cleanups (runtime)
 - Some refactorings (x86)
 - ... and 72 more: https://git.openjdk.org/jdk/compare/e7fa150b...c4746710

Changes: https://git.openjdk.org/jdk/pull/10907/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10907&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8291555
  Stats: 1924 lines in 74 files changed: 1212 ins; 97 del; 615 mod
  Patch: https://git.openjdk.org/jdk/pull/10907.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/10907/head:pull/10907

PR: https://git.openjdk.org/jdk/pull/10907

Reply via email to