On Thu, 3 Apr 2025 16:31:39 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:
> This fix addresses a performance regression found on some aarch64 processors, > namely the Apple M1, when we moved to a quarter round parallel implementation > in JDK-8349106. After making some improvements in the ordering of the > instructions in the 20-round loop we found that going back to a > block-parallel implementation was faster, but it definitely needed the > ordering changes for that to be the case. More importantly, the block > parallel implementation with the interleaving turns out to be faster on even > those processors that showed improvements when moving to the quarter round > parallel implementation. > > There is a spreadsheet attached to the JBS bug that shows 3 different > implementations relative to the current (QR-parallel with no interleaving) > implementation on 3 different ARM64 processors. Comparative benchmarks can > also be found below. This pull request has now been integrated. Changeset: 594b2651 Author: Jamil Nimeh <jni...@openjdk.org> URL: https://git.openjdk.org/jdk/commit/594b26516e5c01d7daa331db59bdbe8ab7dc0a6d Stats: 395 lines in 3 files changed: 137 ins; 80 del; 178 mod 8350126: Regression ~3% on Crypto-ChaCha20Poly1305.encrypt for MacOSX aarch64 Reviewed-by: aph ------------- PR: https://git.openjdk.org/jdk/pull/24420