> This fix addresses a performance regression found on some aarch64 processors, 
> namely the Apple M1, when we moved to a quarter round parallel implementation 
> in JDK-8349106.  After making some improvements in the ordering of the 
> instructions in the 20-round loop we found that going back to a 
> block-parallel implementation was faster, but it definitely needed the 
> ordering changes for that to be the case.  More importantly, the block 
> parallel implementation with the interleaving turns out to be faster on even 
> those processors that showed improvements when moving to the quarter round 
> parallel implementation.
> 
> There is a spreadsheet attached to the JBS bug that shows 3 different 
> implementations relative to the current (QR-parallel with no interleaving) 
> implementation on 3 different ARM64 processors.  Comparative benchmarks can 
> also be found below.

Jamil Nimeh has updated the pull request incrementally with one additional 
commit since the last revision:

  Regroup CRC32 stub generators together

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/24420/files
  - new: https://git.openjdk.org/jdk/pull/24420/files/fe865308..7ae8802d

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=24420&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24420&range=01-02

  Stats: 82 lines in 1 file changed: 41 ins; 41 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/24420.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24420/head:pull/24420

PR: https://git.openjdk.org/jdk/pull/24420

Reply via email to