> This fix addresses a performance regression found on some aarch64 processors, > namely the Apple M1, when we moved to a quarter round parallel implementation > in JDK-8349106. After making some improvements in the ordering of the > instructions in the 20-round loop we found that going back to a > block-parallel implementation was faster, but it definitely needed the > ordering changes for that to be the case. More importantly, the block > parallel implementation with the interleaving turns out to be faster on even > those processors that showed improvements when moving to the quarter round > parallel implementation. > > There is a spreadsheet attached to the JBS bug that shows 3 different > implementations relative to the current (QR-parallel with no interleaving) > implementation on 3 different ARM64 processors. Comparative benchmarks can > also be found below.
Jamil Nimeh has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge with main - Regroup CRC32 stub generators together - Place columnar/diagonal alignment code into separate method - 8350126: Regression ~3% on Crypto-ChaCha20Poly1305.encrypt for MacOSX aarch64 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24420/files - new: https://git.openjdk.org/jdk/pull/24420/files/7ae8802d..b6fb9136 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24420&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24420&range=02-03 Stats: 251518 lines in 1780 files changed: 55210 ins; 190355 del; 5953 mod Patch: https://git.openjdk.org/jdk/pull/24420.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24420/head:pull/24420 PR: https://git.openjdk.org/jdk/pull/24420