On Thu, 3 Apr 2025 16:31:39 GMT, Jamil Nimeh <[email protected]> wrote:
> This fix addresses a performance regression found on some aarch64 processors,
> namely the Apple M1, when we moved to a quarter round parallel implementation
> in JDK-8349106. After making some improvements in the ordering of the
> instructions in the 20-round loop we found that going back to a
> block-parallel implementation was faster, but it definitely needed the
> ordering changes for that to be the case. More importantly, the block
> parallel implementation with the interleaving turns out to be faster on even
> those processors that showed improvements when moving to the quarter round
> parallel implementation.
>
> There is a spreadsheet attached to the JBS bug that shows 3 different
> implementations relative to the current (QR-parallel with no interleaving)
> implementation on 3 different ARM64 processors. Comparative benchmarks can
> also be found below.
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4517:
> 4515: bSet[0] = v16; bSet[1] = v17; bSet[2] = v18; bSet[3] = v19;
> 4516: cSet[0] = v20; cSet[1] = v21; cSet[2] = v22; cSet[3] = v23;
> 4517: dSet[0] = v24; dSet[1] = v25; dSet[2] = v26; dSet[3] = v27;
Suggestion:
bSet[0] = workSt[4]; bSet[1] = workSt[5]; bSet[2] = workSt[6]; bSet[3] =
workSt[7];
cSet[0] = workSt[8]; cSet[1] = workSt[9]; cSet[2] = workSt[10]; cSet[3] =
workSt[11];
dSet[0] = workSt[12]; dSet[1] = workSt[13]; dSet[2] = workSt[14]; dSet[3] =
workSt[15];
How about something like this? The mapping from index to register, and from
specification to implementation, is easier for this reviewer to understand.
or maybe (better?) define a function, such that:
regs_for_quarter_round(bSet, workSt, 4, 5, 6, 7);
regs_for_quarter_round(cSet, workSt, 8, 9, 10, 11);
regs_for_quarter_round(dSet, workSt, 12, 13, 14, 15);
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/24420#discussion_r2028932064