On Mon, 21 Apr 2025 21:53:33 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:
>> This fix addresses a performance regression found on some aarch64 >> processors, namely the Apple M1, when we moved to a quarter round parallel >> implementation in JDK-8349106. After making some improvements in the >> ordering of the instructions in the 20-round loop we found that going back >> to a block-parallel implementation was faster, but it definitely needed the >> ordering changes for that to be the case. More importantly, the block >> parallel implementation with the interleaving turns out to be faster on even >> those processors that showed improvements when moving to the quarter round >> parallel implementation. >> >> There is a spreadsheet attached to the JBS bug that shows 3 different >> implementations relative to the current (QR-parallel with no interleaving) >> implementation on 3 different ARM64 processors. Comparative benchmarks can >> also be found below. > > Jamil Nimeh has updated the pull request with a new target base due to a > merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains four additional > commits since the last revision: > > - Merge with main > - Regroup CRC32 stub generators together > - Place columnar/diagonal alignment code into separate method > - 8350126: Regression ~3% on Crypto-ChaCha20Poly1305.encrypt for MacOSX > aarch64 Marked as reviewed by aph (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/24420#pullrequestreview-2784665815