Re: RFR: 8291555: Implement alternative fast-locking scheme [v54]

Daniel D . Daugherty Wed, 05 Apr 2023 08:57:55 -0700

On Wed, 5 Apr 2023 10:40:54 GMT, Roman Kennke <rken...@openjdk.org> wrote:


>> This change adds a fast-locking scheme as an alternative to the current 
>> stack-locking implementation. It retains the advantages of stack-locking 
>> (namely fast locking in uncontended code-paths), while avoiding the overload 
>> of the mark word. That overloading causes massive problems with Lilliput, 
>> because it means we have to check and deal with this situation when trying 
>> to access the mark-word. And because of the very racy nature, this turns out 
>> to be very complex and would involve a variant of the inflation protocol to 
>> ensure that the object header is stable. (The current implementation of 
>> setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto 
>> the stack which consists only of the displaced header, and CAS a pointer to 
>> this stack location into the object header (the lowest two header bits being 
>> 00 indicate 'stack-locked'). The pointer into the stack can then be used to 
>> identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two 
>> header bits to 00 to indicate 'fast-locked' but does *not* overload the 
>> upper bits with a stack-pointer. Instead, it pushes the object-reference to 
>> a thread-local lock-stack. This is a new structure which is basically a 
>> small array of oops that is associated with each thread. Experience shows 
>> that this array typcially remains very small (3-5 elements). Using this lock 
>> stack, it is possible to query which threads own which locks. Most 
>> importantly, the most common question 'does the current thread own me?' is 
>> very quickly answered by doing a quick scan of the array. More complex 
>> queries like 'which thread owns X?' are not performed in very 
>> performance-critical paths (usually in code like JVMTI or deadlock 
>> detection) where it is ok to do more complex operations (and we already do). 
>> The lock-stack is also a new set of GC roots, and would be scanned during 
>> thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is fixed size, currently with 8 elements. According to my 
>> experiments with various workloads, this covers the vast majority of 
>> workloads (in-fact, most workloads seem to never exceed 5 active locks per 
>> thread at a time). We check for overflow in the fast-paths and when the 
>> lock-stack is full, we take the slow-path, which would inflate the lock to a 
>> monitor. That case should be very rare.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive 
>> locking (yet). When that happens, the fast-lock gets inflated to a full 
>> monitor. It is not clear if it is worth to add support for recursive 
>> fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked 
>> object, it must inflate the fast-lock to a full monitor. Normally, we need 
>> to know the current owning thread, and record that in the monitor, so that 
>> the contending thread can wait for the current owner to properly exit the 
>> monitor. However, fast-locking doesn't have this information. What we do 
>> instead is to record a special marker ANONYMOUS_OWNER. When the thread that 
>> currently holds the lock arrives at monitorexit, and observes 
>> ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, 
>> and then properly exits the monitor, and thus handing over to the contending 
>> thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only 
>> use heavy monitors. In most workloads this did not show measurable 
>> regressions. However, in a few workloads, I have observed severe 
>> regressions. All of them have been using old synchronized Java collections 
>> (Vector, Stack), StringBuffer or similar code. The combination of two 
>> conditions leads to regressions without stack- or fast-locking: 1. The 
>> workload synchronizes on uncontended locks (e.g. single-threaded use of 
>> Vector or StringBuffer) and 2. The workload churns such locks. IOW, 
>> uncontended use of Vector, StringBuffer, etc as such is ok, but creating 
>> lots of such single-use, single-threaded-locked objects leads to massive 
>> ObjectMonitor churn, which can lead to a significant performance impact. But 
>> alas, such code exists, and we probably don't want to punish it if we can 
>> avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the 
>> (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the 
>> hashcode has been installed. Fast-locked headers can be used directly, for 
>> monitor-locked objects we can easily reach-through to the displaced header. 
>> This is safe because Java threads participate in monitor deflation protocol. 
>> This would be implemented in a separate PR
>> 
>> Also, and I might be mistaken here, this new lightweight locking would make 
>> synchronized work better with Loom: Because the lock-records are no longer 
>> scattered across the stack, but instead are densely packed into the 
>> lock-stack, it should be easy for a vthread to save its lock-stack upon 
>> unmounting and restore it when re-mounting. However, I am not sure about 
>> this, and this PR does not attempt to implement that support.
>> 
>> Testing:
>>  - [x] tier1 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier2 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier3 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier4 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier1 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier2 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier3 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier4 x86_64 x aarch64 x -UseFastLocking
>>  - [x] Several real-world applications have been tested with this change in 
>> tandem with Lilliput without any problems, yet
>> 
>> ### Performance
>> 
>> #### Simple Microbenchmark
>> 
>> The microbenchmark exercises only the locking primitives for monitorenter 
>> and monitorexit, without contention. The benchmark can be found 
>> (here)[https://github.com/rkennke/fastlockbench]. Numbers are in ns/ops.
>> 
>> |  | x86_64 | aarch64 |
>> | -- | -- | -- |
>> | -UseFastLocking | 20.651 | 20.764 |
>> | +UseFastLocking | 18.896 | 18.908 |
>> 
>> 
>> #### Renaissance
>> 
>>   | x86_64 |   |   |   | aarch64 |   |  
>> -- | -- | -- | -- | -- | -- | -- | --
>>   | stack-locking | fast-locking |   |   | stack-locking | fast-locking |  
>> AkkaUct | 841.884 | 836.948 | 0.59% |   | 1475.774 | 1465.647 | 0.69%
>> Reactors | 11041.427 | 11181.451 | -1.25% |   | 11381.751 | 11521.318 | 
>> -1.21%
>> Als | 1367.183 | 1359.358 | 0.58% |   | 1678.103 | 1688.067 | -0.59%
>> ChiSquare | 577.021 | 577.398 | -0.07% |   | 986.619 | 988.063 | -0.15%
>> GaussMix | 817.459 | 819.073 | -0.20% |   | 1154.293 | 1155.522 | -0.11%
>> LogRegression | 598.343 | 603.371 | -0.83% |   | 638.052 | 644.306 | -0.97%
>> MovieLens | 8248.116 | 8314.576 | -0.80% |   | 7569.219 | 7646.828 | -1.01%%
>> NaiveBayes | 587.607 | 581.608 | 1.03% |   | 541.583 | 550.059 | -1.54%
>> PageRank | 3260.553 | 3263.472 | -0.09% |   | 4376.405 | 4381.101 | -0.11%
>> FjKmeans | 979.978 | 976.122 | 0.40% |   | 774.312 | 771.235 | 0.40%
>> FutureGenetic | 2187.369 | 2183.271 | 0.19% |   | 2685.722 | 2689.056 | 
>> -0.12%
>> ParMnemonics | 2434.551 | 2468.763 | -1.39% |   | 4278.225 | 4263.863 | 0.34%
>> Scrabble | 111.882 | 111.768 | 0.10% |   | 151.796 | 153.959 | -1.40%
>> RxScrabble | 210.252 | 211.38 | -0.53% |   | 310.116 | 315.594 | -1.74%
>> Dotty | 750.415 | 752.658 | -0.30% |   | 1033.636 | 1036.168 | -0.24%
>> ScalaDoku | 3072.05 | 3051.2 | 0.68% |   | 3711.506 | 3690.04 | 0.58%
>> ScalaKmeans | 211.427 | 209.957 | 0.70% |   | 264.38 | 265.788 | -0.53%
>> ScalaStmBench7 | 1017.795 | 1018.869 | -0.11% |   | 1088.182 | 1092.266 | 
>> -0.37%
>> Philosophers | 6450.124 | 6565.705 | -1.76% |   | 12017.964 | 11902.559 | 
>> 0.97%
>> FinagleChirper | 3953.623 | 3972.647 | -0.48% |   | 4750.751 | 4769.274 | 
>> -0.39%
>> FinagleHttp | 3970.526 | 4005.341 | -0.87% |   | 5294.125 | 5296.224 | -0.04%
>
> Roman Kennke has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Named constants for LockingMode

v52 is failing quite a few JVM/TI tests with crashes that look like this:


#  Internal Error 
(/opt/mach5/mesos/work_dir/slaves/741e9afd-8c02-45c3-b2e2-9db1450d0832-S40935/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/b154597d-2ba7-420d-81c1-ef13f408c137/runs/d52e181d-f011-47c8-a35f-30fcbba5c164/workspace/open/src/hotspot/share/runtime/javaThread.hpp:983),
 pid=1112738, tid=1112747
#  assert(t->is_Java_thread()) failed: incorrect cast to JavaThread
#
# JRE version: Java(TM) SE Runtime Environment (21.0) (fastdebug build 
21-internal-LTS-2023-04-04-2141101.daniel.daugherty.8291555forjdk21.git)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 
21-internal-LTS-2023-04-04-2141101.daniel.daugherty.8291555forjdk21.git, 
compiled mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, 
linux-aarch64)
# Problematic frame:
# V  [libjvm.so+0x155baa4]  is_lock_owned(Thread*, oop)+0x254

snip


Stack: [0x0000fffdb8970000,0x0000fffdb8b70000],  sp=0x0000fffdb8b6d480,  free 
space=2037k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x155baa4]  is_lock_owned(Thread*, oop)+0x254  
(javaThread.hpp:983)
V  [libjvm.so+0x15627d0]  ObjectSynchronizer::FastHashCode(Thread*, oop)+0x720  
(synchronizer.cpp:956)
V  [libjvm.so+0x12dbedc]  oopDesc::slow_identity_hash()+0x68  (oop.cpp:112)
V  [libjvm.so+0x105f044]  JvmtiTagMapTable::find(oop)+0x1c4  
(oop.inline.hpp:364)
V  [libjvm.so+0x1056744]  CallbackWrapper::CallbackWrapper(JvmtiTagMap*, 
oop)+0xa4  (jvmtiTagMap.cpp:222)
V  [libjvm.so+0x105c538]  
CallbackInvoker::invoke_advanced_stack_ref_callback(jvmtiHeapReferenceKind, 
long, long, int, _jmethodID*, long, int, oop)+0x158  (jvmtiTagMap.cpp:1754)
V  [libjvm.so+0x105d688]  JNILocalRootsClosure::do_oop(oop*)+0x168  
(jvmtiTagMap.cpp:2019)
V  [libjvm.so+0xe8bac8]  JNIHandleBlock::oops_do(OopClosure*)+0x68  
(jniHandles.cpp:411)
V  [libjvm.so+0x105ddbc]  
VM_HeapWalkOperation::collect_stack_roots(JavaThread*, 
JNILocalRootsClosure*)+0x6dc  (jvmtiTagMap.cpp:2725)
V  [libjvm.so+0x105e368]  VM_HeapWalkOperation::collect_stack_roots()+0x138  
(jvmtiTagMap.cpp:2772)
V  [libjvm.so+0x10555f0]  VM_HeapWalkOperation::doit()+0x6e0  
(jvmtiTagMap.cpp:2827)
V  [libjvm.so+0x1686ac0]  VM_Operation::evaluate()+0x120  (vmOperations.cpp:71)
V  [libjvm.so+0x16b2710]  VMThread::evaluate_operation(VM_Operation*)+0xd0  
(vmThread.cpp:281)
V  [libjvm.so+0x16b3204]  VMThread::inner_execute(VM_Operation*)+0x374  
(vmThread.cpp:428)
V  [libjvm.so+0x16b33fc]  VMThread::loop()+0x8c  (vmThread.cpp:495)
V  [libjvm.so+0x16b352c]  VMThread::run()+0x9c  (vmThread.cpp:175)
V  [libjvm.so+0x15ad4b0]  Thread::call_run()+0xac  (thread.cpp:224)
V  [libjvm.so+0x130c0a8]  thread_native_entry(Thread*)+0x134  (os_linux.cpp:740)
C  [libpthread.so.0+0x7908]  start_thread+0x188


This code block in `ObjectSynchronizer::FastHashCode()`:


      // Fall thru so we only have one place that installs the hash in
      // the ObjectMonitor.
    } else if (LockingMode == 2 && mark.is_fast_locked() && 
is_lock_owned(current, obj)) {
      // This is a fast-lock owned by the calling thread so use the
      // markWord from the object.
      hash = mark.hash();
      if (hash != 0) {                  // if it has a hash, just return it
        return hash;
      }
    } else if (LockingMode == 1 && mark.has_locker() && 
current->is_lock_owned((address)mark.locker())) {


is calling this static function:


static bool is_lock_owned(Thread* thread, oop obj) {
  assert(LockingMode == 2, "only call this with new lightweight locking 
enabled");
  return JavaThread::cast(thread)->lock_stack().contains(obj);
}


and that function used to look like this:


 static bool is_lock_owned(Thread* thread, oop obj) {
   assert(LockingMode == 2, "only call this with new lightweight locking 
enabled");
-  return thread->is_Java_thread() ? 
JavaThread::cast(thread)->lock_stack().contains(obj) : false;
+  return JavaThread::cast(thread)->lock_stack().contains(obj);
 }


so that `thread->is_Java_thread()` check is needed since the VMThread
is the one that's making this`is_lock_owned()` check as part of a hashcode
operation.

There are 129 test failures in Mach5 Tier4 and 769 test failures in Mach5 Tier5.
I don't know yet whether all are due to:

#  Internal Error 
(/opt/mach5/mesos/work_dir/slaves/741e9afd-8c02-45c3-b2e2-9db1450d0832-S40935/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/b154597d-2ba7-420d-81c1-ef13f408c137/runs/d52e181d-f011-47c8-a35f-30fcbba5c164/workspace/open/src/hotspot/share/runtime/javaThread.hpp:983),
 pid=1112738, tid=1112747
#  assert(t->is_Java_thread()) failed: incorrect cast to JavaThread

-------------

PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1497720340
PR Comment: https://git.openjdk.org/jdk/pull/10907#issuecomment-1497725623

Re: RFR: 8291555: Implement alternative fast-locking scheme [v54]

Reply via email to