This fix addresses a performance regression found on some aarch64 processors, 
namely the Apple M1, when we moved to a quarter round parallel implementation 
in JDK-8349106.  After making some improvements in the ordering of the 
instructions in the 20-round loop we found that going back to a block-parallel 
implementation was faster, but it definitely needed the ordering changes for 
that to be the case.  More importantly, the block parallel implementation with 
the interleaving turns out to be faster on even those processors that showed 
improvements when moving to the quarter round parallel implementation.

There is a spreadsheet attached to the JBS bug that shows 3 different 
implementations relative to the current (QR-parallel with no interleaving) 
implementation on 3 different ARM64 processors.  Comparative benchmarks can 
also be found below.

-------------

Commit messages:
 - 8350126: Regression ~3% on Crypto-ChaCha20Poly1305.encrypt for MacOSX aarch64

Changes: https://git.openjdk.org/jdk/pull/24420/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24420&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8350126
  Stats: 488 lines in 3 files changed: 238 ins; 214 del; 36 mod
  Patch: https://git.openjdk.org/jdk/pull/24420.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24420/head:pull/24420

PR: https://git.openjdk.org/jdk/pull/24420

Reply via email to