> This fix addresses a performance regression found on some aarch64 processors, > namely the Apple M1, when we moved to a quarter round parallel implementation > in JDK-8349106. After making some improvements in the ordering of the > instructions in the 20-round loop we found that going back to a > block-parallel implementation was faster, but it definitely needed the > ordering changes for that to be the case. More importantly, the block > parallel implementation with the interleaving turns out to be faster on even > those processors that showed improvements when moving to the quarter round > parallel implementation. > > There is a spreadsheet attached to the JBS bug that shows 3 different > implementations relative to the current (QR-parallel with no interleaving) > implementation on 3 different ARM64 processors. Comparative benchmarks can > also be found below.
Jamil Nimeh has updated the pull request incrementally with one additional commit since the last revision: Regroup CRC32 stub generators together ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24420/files - new: https://git.openjdk.org/jdk/pull/24420/files/fe865308..7ae8802d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24420&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24420&range=01-02 Stats: 82 lines in 1 file changed: 41 ins; 41 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/24420.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24420/head:pull/24420 PR: https://git.openjdk.org/jdk/pull/24420