On Thu, 3 Apr 2025 16:31:39 GMT, Jamil Nimeh <jni...@openjdk.org> wrote:

> This fix addresses a performance regression found on some aarch64 processors, 
> namely the Apple M1, when we moved to a quarter round parallel implementation 
> in JDK-8349106.  After making some improvements in the ordering of the 
> instructions in the 20-round loop we found that going back to a 
> block-parallel implementation was faster, but it definitely needed the 
> ordering changes for that to be the case.  More importantly, the block 
> parallel implementation with the interleaving turns out to be faster on even 
> those processors that showed improvements when moving to the quarter round 
> parallel implementation.
> 
> There is a spreadsheet attached to the JBS bug that shows 3 different 
> implementations relative to the current (QR-parallel with no interleaving) 
> implementation on 3 different ARM64 processors.  Comparative benchmarks can 
> also be found below.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4517:

> 4515:     bSet[0] = v16; bSet[1] = v17; bSet[2] = v18; bSet[3] = v19;
> 4516:     cSet[0] = v20; cSet[1] = v21; cSet[2] = v22; cSet[3] = v23;
> 4517:     dSet[0] = v24; dSet[1] = v25; dSet[2] = v26; dSet[3] = v27;

Suggestion:

    bSet[0] = workSt[4];  bSet[1] = workSt[5];  bSet[2] = workSt[6];  bSet[3] = 
workSt[7];
    cSet[0] = workSt[8];  cSet[1] = workSt[9];  cSet[2] = workSt[10]; cSet[3] = 
workSt[11];
    dSet[0] = workSt[12]; dSet[1] = workSt[13]; dSet[2] = workSt[14]; dSet[3] = 
workSt[15];

How about something like this? The mapping from index to register, and from 
specification to implementation, is easier for this reviewer to understand. 

or maybe (better?) define a function, such that:


regs_for_quarter_round(bSet, workSt, 4, 5, 6, 7);
regs_for_quarter_round(cSet, workSt, 8, 9, 10, 11);
regs_for_quarter_round(dSet, workSt, 12, 13, 14, 15);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24420#discussion_r2028932064

Reply via email to