On Thu, 3 Apr 2025 16:31:39 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:
> This fix addresses a performance regression found on some aarch64 processors, > namely the Apple M1, when we moved to a quarter round parallel implementation > in JDK-8349106. After making some improvements in the ordering of the > instructions in the 20-round loop we found that going back to a > block-parallel implementation was faster, but it definitely needed the > ordering changes for that to be the case. More importantly, the block > parallel implementation with the interleaving turns out to be faster on even > those processors that showed improvements when moving to the quarter round > parallel implementation. > > There is a spreadsheet attached to the JBS bug that shows 3 different > implementations relative to the current (QR-parallel with no interleaving) > implementation on 3 different ARM64 processors. Comparative benchmarks can > also be found below. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4517: > 4515: bSet[0] = v16; bSet[1] = v17; bSet[2] = v18; bSet[3] = v19; > 4516: cSet[0] = v20; cSet[1] = v21; cSet[2] = v22; cSet[3] = v23; > 4517: dSet[0] = v24; dSet[1] = v25; dSet[2] = v26; dSet[3] = v27; Suggestion: bSet[0] = workSt[4]; bSet[1] = workSt[5]; bSet[2] = workSt[6]; bSet[3] = workSt[7]; cSet[0] = workSt[8]; cSet[1] = workSt[9]; cSet[2] = workSt[10]; cSet[3] = workSt[11]; dSet[0] = workSt[12]; dSet[1] = workSt[13]; dSet[2] = workSt[14]; dSet[3] = workSt[15]; How about something like this? The mapping from index to register, and from specification to implementation, is easier for this reviewer to understand. or maybe (better?) define a function, such that: regs_for_quarter_round(bSet, workSt, 4, 5, 6, 7); regs_for_quarter_round(cSet, workSt, 8, 9, 10, 11); regs_for_quarter_round(dSet, workSt, 12, 13, 14, 15); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24420#discussion_r2028932064