On Mon, 21 Apr 2025 21:53:33 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:

>> This fix addresses a performance regression found on some aarch64 
>> processors, namely the Apple M1, when we moved to a quarter round parallel 
>> implementation in JDK-8349106.  After making some improvements in the 
>> ordering of the instructions in the 20-round loop we found that going back 
>> to a block-parallel implementation was faster, but it definitely needed the 
>> ordering changes for that to be the case.  More importantly, the block 
>> parallel implementation with the interleaving turns out to be faster on even 
>> those processors that showed improvements when moving to the quarter round 
>> parallel implementation.
>> 
>> There is a spreadsheet attached to the JBS bug that shows 3 different 
>> implementations relative to the current (QR-parallel with no interleaving) 
>> implementation on 3 different ARM64 processors.  Comparative benchmarks can 
>> also be found below.
>
> Jamil Nimeh has updated the pull request with a new target base due to a 
> merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains four additional 
> commits since the last revision:
> 
>  - Merge with main
>  - Regroup CRC32 stub generators together
>  - Place columnar/diagonal alignment code into separate method
>  - 8350126: Regression ~3% on Crypto-ChaCha20Poly1305.encrypt for MacOSX 
> aarch64

Marked as reviewed by aph (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/24420#pullrequestreview-2784665815

Reply via email to