> This fix addresses a performance regression found on some aarch64 processors, > namely the Apple M1, when we moved to a quarter round parallel implementation > in JDK-8349106. After making some improvements in the ordering of the > instructions in the 20-round loop we found that going back to a > block-parallel implementation was faster, but it definitely needed the > ordering changes for that to be the case. More importantly, the block > parallel implementation with the interleaving turns out to be faster on even > those processors that showed improvements when moving to the quarter round > parallel implementation. > > There is a spreadsheet attached to the JBS bug that shows 3 different > implementations relative to the current (QR-parallel with no interleaving) > implementation on 3 different ARM64 processors. Comparative benchmarks can > also be found below.
Jamil Nimeh has updated the pull request incrementally with one additional commit since the last revision: Place columnar/diagonal alignment code into separate method ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24420/files - new: https://git.openjdk.org/jdk/pull/24420/files/b530e166..fe865308 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24420&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24420&range=00-01 Stats: 39 lines in 3 files changed: 33 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/24420.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24420/head:pull/24420 PR: https://git.openjdk.org/jdk/pull/24420