> This fix addresses a performance regression found on some aarch64 processors, 
> namely the Apple M1, when we moved to a quarter round parallel implementation 
> in JDK-8349106.  After making some improvements in the ordering of the 
> instructions in the 20-round loop we found that going back to a 
> block-parallel implementation was faster, but it definitely needed the 
> ordering changes for that to be the case.  More importantly, the block 
> parallel implementation with the interleaving turns out to be faster on even 
> those processors that showed improvements when moving to the quarter round 
> parallel implementation.
> 
> There is a spreadsheet attached to the JBS bug that shows 3 different 
> implementations relative to the current (QR-parallel with no interleaving) 
> implementation on 3 different ARM64 processors.  Comparative benchmarks can 
> also be found below.

Jamil Nimeh has updated the pull request with a new target base due to a merge 
or a rebase. The incremental webrev excludes the unrelated changes brought in 
by the merge/rebase. The pull request contains four additional commits since 
the last revision:

 - Merge with main
 - Regroup CRC32 stub generators together
 - Place columnar/diagonal alignment code into separate method
 - 8350126: Regression ~3% on Crypto-ChaCha20Poly1305.encrypt for MacOSX aarch64

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/24420/files
  - new: https://git.openjdk.org/jdk/pull/24420/files/7ae8802d..b6fb9136

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=24420&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24420&range=02-03

  Stats: 251518 lines in 1780 files changed: 55210 ins; 190355 del; 5953 mod
  Patch: https://git.openjdk.org/jdk/pull/24420.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24420/head:pull/24420

PR: https://git.openjdk.org/jdk/pull/24420

Reply via email to