On Fri, 18 Apr 2025 01:11:48 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:
>> This fix addresses a performance regression found on some aarch64 >> processors, namely the Apple M1, when we moved to a quarter round parallel >> implementation in JDK-8349106. After making some improvements in the >> ordering of the instructions in the 20-round loop we found that going back >> to a block-parallel implementation was faster, but it definitely needed the >> ordering changes for that to be the case. More importantly, the block >> parallel implementation with the interleaving turns out to be faster on even >> those processors that showed improvements when moving to the quarter round >> parallel implementation. >> >> There is a spreadsheet attached to the JBS bug that shows 3 different >> implementations relative to the current (QR-parallel with no interleaving) >> implementation on 3 different ARM64 processors. Comparative benchmarks can >> also be found below. > > Jamil Nimeh has updated the pull request incrementally with one additional > commit since the last revision: > > Regroup CRC32 stub generators together Hi @theRealAph, just wanted to check in and see if you were happy with the function for the column/diagonal reassignments. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24420#issuecomment-2819028037