RFR: 8350441: ZGC: Overhaul Page Allocation

Joel Sikström Wed, 09 Apr 2025 06:58:04 -0700

> Note that any reference to pages from here on out refers to the concept of a 
> heap region in ZGC, not pages in the operating system (OS), unless stated 
> otherwise.


# Background

This PR addresses fragmentation by introducing a Mapped Cache that replaces the 
Page Cache in ZGC. The largest limitation of the Page Cache is that it is 
constrained by the abstraction of what a page is. The proposed Mapped Cache 
removes this limitation by decoupling memory from pages, allowing it to merge 
and split memory in ways that the Page Cache is not suited for. To facilitate 
the transition, much of the Page Allocator has been redesigned to work with the 
Mapped Cache.

In addition to fighting fragmentation, the new approach improves NUMA-support 
and simplifies memory unampping. Combined, these changes lay the foundation for 
even more improvements in ZGC, like replacing multi-mapped memory with 
anonymous memory.

# Why a Mapped Cache?

The main benefit of the Mapped Cache is that adjacent virtual memory ranges in 
the cache can be merged to create larger ranges, enabling larger allocation 
requests to succeed more easily. Most notably, it allows allocations to succeed 
more often without "harvesting" smaller, discontiguous ranges. Harvesting 
negatively impacts both fragmentation and latency, as it requires remapping 
memory into a new contiguous virtual address range. Fragmentation becomes 
especially problematic in long-running programs and in environments with 
limited address space, where finding large contiguous regions can be difficult 
and may lead to premature Out Of Memory Errors (OOME).

The Mapped Cache uses a self-balancing binary search tree to store memory 
ranges. Since the ranges are unused when inside the cache, the tree can use 
this memory to store metadata about itself, referred to as intrusive storage. 
This approach eliminates the need for dynamic memory allocation (e.g., malloc), 
which could otherwise introduce a latency overhead.

# Fragmentation

Currently, ZGC has multiple strategies for dealing with fragmentation. In some 
edge cases, these strategies are not as efficient as we would like. By 
addressing fragmentation differently with the Mapped Cache, ZGC is in a better 
position to avoid edge cases, which are bad even if they occur only once. This 
is especially impactful for programs running with a large heap.

## Virtual Memory Shuffling

In addition to the Mapped Cache, we have made some adjustments in how ZGC deals 
with virtual memory. When harvesting memory, which needs to be remapped, new 
contiguous virtual memory must first be claimed. We have now added a feature in 
which the harvested memory can be re-used to improve the likelihood of finding 
a contiguous range. Additionally, we have re-designed the defragmentation 
policy so that Large pages are always defragmented upon being freed. When 
freed, they are broken down and remapped into lower address space, in the hopes 
of "filling holes" and creating more contiguous ranges.

# NUMA and Partitions

In the current policy, ZGC interleaves memory across all NUMA nodes with a 
granularity of ZGranuleSize (2MB), which is the same size as a Small page. As a 
result, Small pages will end up on a single, preferably local, NUMA node, 
whilst larger allocations will (likely) end up on multiple NUMA nodes. In the 
new design, the policy is to prefer allocating *all* allocation sizes to the 
local NUMA node whenever possible. As an effect, ZGC may be able to extract 
better performance from NUMA systems.

To support local NUMA allocations, the Page Allocator, and in turn the Java 
heap, has been split up into what we refer to as Partitions. A partition keeps 
track of its own heap size and Mapped Cache, allowing it to only handle memory 
that is associated with its own share of the heap. The number of partitions is 
currently the same as the number of NUMA nodes. On non-NUMA systems, only a 
single partition is kept track of.

The introduction of partitions also establishes a foundation for more 
fine-grained control over the heap, paving the way for future enhancements, 
both NUMA possibilities and new features, such as Thread-Local GC.

# Defragmentation (Unmapping Memory)

Up until now, ZGC has unmapped memory asynchronously in a separate thread. The 
benefit of this is that other threads do not need to take a latency hit when 
unmapping memory. The main dependency on asynchronous unmapping is when 
harvesting, especially from a mutator thread, where synchronous unmapping could 
lead to unwanted latency.

With the introduction of the Mapped Cache, and by moving defragmentation away 
from mutator threads to the GC, asynchronous unmapping is no longer necessary 
to meet our latency goals. Instead, memory is now unmapped synchronously. The 
number of times memory is defragmented for page allocations has been reduced 
significantly. However, memory for Small pages never needs to be defragmented 
at all. For Large pages, memory defragmentation has little effect on the total 
latency, as they are costly to allocate anyways. For Medium pages, we have 
plans for future enhancements where memory is defragmented even less, or not at 
all.

For clarity: with the removal of asynchronous unmapping, we have removed the 
ZUnmapper thread and ZUnmap JFR event.

# Multi-Mapped Memory

Asynchronous unmapping has so far been possible because ZGC is backed by shared 
memory (on Linux), which allows memory to be multi-mapped. This is an artifact 
from non-generational ZGC, which used multi-mapping in its core design (See 
[this](https://wiki.openjdk.org/display/zgc/Pointer+Metadata+using+Multi-Mapped+memory)
 resource for more info). A goal we have in ZGC is to move from shared memory 
to anonymous memory. There are multiple benefits with anonymous memory, one of 
them being easier configuration for Transparent Huge Pages (OS pages). 
Anonymous memory doesn't support multi-mapped memory, and would be blocked by 
the asynchronous unmapping feature. However, with the removal of asynchronous 
unmapping, we are now better prepared for transitioning to anonymous memory.

# Additional Notes

This RFE comes with our own implementation of a red-black tree for the Mapped 
Cache. Another red-black tree was recently introduced by C. Norrbin in 
[JDK-8345314](https://bugs.openjdk.org/browse/JDK-8345314) (and enhanced in 
[JDK-8349211](https://bugs.openjdk.org/browse/JDK-8349211)). Our goal is to 
initially integrate with our implementation, but remove our implementation in 
favor of Norrbin's tree in a future RFE. The reason we have our own tree 
implementation is because Norrbin's tree was not finished during the time we 
were developing and testing this RFE.

Some new additions have been made to keep the current functionality in the 
Serviceability Agent (SA).

# Testing

* Oracle's tiers 1-8
* We have added a small set of new tests, both gtests and jtreg tests, to test 
new functionality

# Performance

* Improvements in tail latency in SPECjbb2015.

* Improvements when using small OS pages in combination with NUMA.

* Small increase in the time it takes to run a GC. This is because some work 
has been moved from mutator threads to only be done in GC threads. This should 
not affect the total run-time of a program as the total work remains the same, 
but mutator latency is improved.

* Other suitable benchmarks show no significant improvements or regressions.

-------------

Commit messages:
 - Whitespace fix in zunittest.hpp
 - Copyright years
 - 8350441: ZGC: Overhaul Page Allocation

Changes: https://git.openjdk.org/jdk/pull/24547/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24547&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8350441
  Stats: 12052 lines in 118 files changed: 7936 ins; 3218 del; 898 mod
  Patch: https://git.openjdk.org/jdk/pull/24547.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24547/head:pull/24547

PR: https://git.openjdk.org/jdk/pull/24547

RFR: 8350441: ZGC: Overhaul Page Allocation

Reply via email to