On Thu, 3 Apr 2025 16:31:39 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:

> This fix addresses a performance regression found on some aarch64 processors, 
> namely the Apple M1, when we moved to a quarter round parallel implementation 
> in JDK-8349106.  After making some improvements in the ordering of the 
> instructions in the 20-round loop we found that going back to a 
> block-parallel implementation was faster, but it definitely needed the 
> ordering changes for that to be the case.  More importantly, the block 
> parallel implementation with the interleaving turns out to be faster on even 
> those processors that showed improvements when moving to the quarter round 
> parallel implementation.
> 
> There is a spreadsheet attached to the JBS bug that shows 3 different 
> implementations relative to the current (QR-parallel with no interleaving) 
> implementation on 3 different ARM64 processors.  Comparative benchmarks can 
> also be found below.

This pull request has now been integrated.

Changeset: 594b2651
Author:    Jamil Nimeh <jni...@openjdk.org>
URL:       
https://git.openjdk.org/jdk/commit/594b26516e5c01d7daa331db59bdbe8ab7dc0a6d
Stats:     395 lines in 3 files changed: 137 ins; 80 del; 178 mod

8350126: Regression ~3% on Crypto-ChaCha20Poly1305.encrypt for MacOSX aarch64

Reviewed-by: aph

-------------

PR: https://git.openjdk.org/jdk/pull/24420

Reply via email to