Re: RFR: 8361099: Shenandoah: Improve heap lock contention by using CAS for memory allocation [v104]

Kelvin Nilsen Thu, 14 May 2026 07:21:29 -0700

On Mon, 4 May 2026 23:03:06 GMT, Xiaolong Peng <[email protected]> wrote:


>> - [x] I confirm that I make this contribution in accordance with the 
>> [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai).
>> 
>> Shenandoah always allocates memory with heap lock, we have observed heavy 
>> heap lock contention on memory allocation path in performance analysis of 
>> some service in which we tried to adopt Shenandoah. This change is to 
>> propose an optimization for the code path of memory allocation to improve 
>> heap lock contention, along with the optimization, a better OOD is also done 
>> to Shenandoah memory allocation to reuse the majority of the code:
>> 
>> * ShenandoahAllocator: base class of the allocators, most of the allocation 
>> code is in this class.
>> * ShenandoahMutatorAllocator: allocator for mutator, inherit from 
>> ShenandoahAllocator, only override methods `alloc_start_index`, `verify`, 
>> `_alloc_region_count` and  `_yield_to_safepoint` to customize the allocator 
>> for mutator.
>> * ShenandoahCollectorAllocator: allocator for collector allocation in 
>> Collector partition, similar to ShenandoahMutatorAllocator, only few lines 
>> of code to customize the allocator for Collector. 
>> * ShenandoahOldCollectorAllocator:  allocator for mutator collector 
>> allocation in OldCollector partition, it doesn't inherit the logic from 
>> ShenandoahAllocator for now, the `allocate` method has been overridden to 
>> delegate to `FreeSet::allocate_for_collector` due to the special allocation 
>> considerations for `plab` in old gen. We will rewrite this part later and 
>> move the code out of `FreeSet::allocate_for_collector`
>> 
>> I'm not expecting significant performance impact for most of the cases since 
>> in most case the contention on heap lock it not high enough to cause 
>> performance issue, but in some cases it may improve the latency/performance:
>> 
>> 1. Dacapo lusearch test on EC2 host with 96 CPU cores, p90 is improved from 
>> 500+us to less than 150us, p99 from 1000+us to ~200us. 
>> 
>> java -XX:-TieredCompilation -XX:+AlwaysPreTouch -Xms31G -Xmx31G 
>> -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions 
>> -XX:+UnlockDiagnosticVMOptions  -XX:-ShenandoahUncommit 
>> -XX:ShenandoahGCMode=generational  -XX:+UseTLAB -jar 
>> ~/tools/dacapo/dacapo-23.11-MR2-chopin.jar  -n 10 lusearch  | grep "metered 
>> full smoothing"
>> 
>> 
>> Openjdk TIP:
>> 
>> ===== DaCapo tail latency, metered full smoothing: 50% 241098 usec, 90% 
>> 402356 usec, 99% 411065 usec, 99.9% 411763 usec, 99.99% 415531 usec, max 
>> 428584 usec, measured over 524288 events =====
>> ===== DaCapo tail latency, metered full smoothing: 50% 902 usec, 9...
>
> Xiaolong Peng has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains 386 commits:
> 
>  - Merge branch 'openjdk:master' into cas-alloc-1
>  - Document _collector_allocator_reserved ordering contract
>  - Tame TestCASAllocContention: fewer threads, shorter run, smaller retention
>  - Comment: pair reserve-time inflation with read-time compensation
>    
>    Cross-reference ShenandoahFreeSet::reserve_alloc_regions_internal at
>    every 'bytes_allocated - mutator_allocator_remaining' subtraction site.
>    No behavior change.
>  - Include mutator allocator remaining bytes in unsafe_max_tlab_alloc
>  - Reserve mutator alloc regions after abbreviated degenerated GC
>  - fix: new jtreg tests miss -XX:+UnlockDiagnosticVMOptions
>  - fix: sync _top before CAS in unset_active_alloc_region
>  - fix: sync _top once on CAS success and reset_age after the loop in 
> unset_active_alloc_region
>  - Add jtreg tests for CAS allocator flag combinations and contention
>  - ... and 376 more: https://git.openjdk.org/jdk/compare/ebb3d688...ce68d1fa

Agree with @shipilev that we should simplify this PR so that it is more 
accessible to more reviewers and we can assimilate the changes in smaller 
incremental steps.

Makes sense to start with the ShenandoahAllocator refactoring as a separate PR.

A second step might be to simplify the accounting updates that are performed 
immediately following each change to the freeset tallies.

A third step might be to remove alignment requirement on PLABs given upstream 
changes that have already been integrated.  That way we wouldn't have to 
special case the OldCollector Allocator.

Then maybe we can integrate the CAS allocator change.

Not sure that the order I suggest above for these incremental steps represents 
the smoothest evolution of the software.  If you think a different order makes 
more sense, that's fine with me.

-------------

Changes requested by kdnilsen (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/26171#pullrequestreview-4290680371

Re: RFR: 8361099: Shenandoah: Improve heap lock contention by using CAS for memory allocation [v104]

Reply via email to