On Fri, 18 Apr 2025 01:11:48 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:

>> This fix addresses a performance regression found on some aarch64 
>> processors, namely the Apple M1, when we moved to a quarter round parallel 
>> implementation in JDK-8349106.  After making some improvements in the 
>> ordering of the instructions in the 20-round loop we found that going back 
>> to a block-parallel implementation was faster, but it definitely needed the 
>> ordering changes for that to be the case.  More importantly, the block 
>> parallel implementation with the interleaving turns out to be faster on even 
>> those processors that showed improvements when moving to the quarter round 
>> parallel implementation.
>> 
>> There is a spreadsheet attached to the JBS bug that shows 3 different 
>> implementations relative to the current (QR-parallel with no interleaving) 
>> implementation on 3 different ARM64 processors.  Comparative benchmarks can 
>> also be found below.
>
> Jamil Nimeh has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Regroup CRC32 stub generators together

Hi @theRealAph, just wanted to check in and see if you were happy with the 
function for the column/diagonal reassignments.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24420#issuecomment-2819028037

Reply via email to