On Fri, 4 Apr 2025 14:35:59 GMT, Andrew Haley <a...@openjdk.org> wrote:

>> This fix addresses a performance regression found on some aarch64 
>> processors, namely the Apple M1, when we moved to a quarter round parallel 
>> implementation in JDK-8349106.  After making some improvements in the 
>> ordering of the instructions in the 20-round loop we found that going back 
>> to a block-parallel implementation was faster, but it definitely needed the 
>> ordering changes for that to be the case.  More importantly, the block 
>> parallel implementation with the interleaving turns out to be faster on even 
>> those processors that showed improvements when moving to the quarter round 
>> parallel implementation.
>> 
>> There is a spreadsheet attached to the JBS bug that shows 3 different 
>> implementations relative to the current (QR-parallel with no interleaving) 
>> implementation on 3 different ARM64 processors.  Comparative benchmarks can 
>> also be found below.
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4517:
> 
>> 4515:     bSet[0] = v16; bSet[1] = v17; bSet[2] = v18; bSet[3] = v19;
>> 4516:     cSet[0] = v20; cSet[1] = v21; cSet[2] = v22; cSet[3] = v23;
>> 4517:     dSet[0] = v24; dSet[1] = v25; dSet[2] = v26; dSet[3] = v27;
> 
> Suggestion:
> 
>     bSet[0] = workSt[4];  bSet[1] = workSt[5];  bSet[2] = workSt[6];  bSet[3] 
> = workSt[7];
>     cSet[0] = workSt[8];  cSet[1] = workSt[9];  cSet[2] = workSt[10]; cSet[3] 
> = workSt[11];
>     dSet[0] = workSt[12]; dSet[1] = workSt[13]; dSet[2] = workSt[14]; dSet[3] 
> = workSt[15];
> 
> How about something like this? The mapping from index to register, and from 
> specification to implementation, is easier for this reviewer to understand. 
> 
> or maybe (better?) define a function, such that:
> 
> 
> regs_for_quarter_round(bSet, workSt, 4, 5, 6, 7);
> regs_for_quarter_round(cSet, workSt, 8, 9, 10, 11);
> regs_for_quarter_round(dSet, workSt, 12, 13, 14, 15);

I like either approach, I'll try the function route.  Either way it definitely 
helps make things more clear.  Putting this in terms of the workSt indicies 
maps more closely to how things are described in the RFC.  Good call!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24420#discussion_r2029095293

Reply via email to