On Thu, 27 Mar 2025 21:42:08 GMT, Volodymyr Paprotski <vpaprot...@openjdk.org> 
wrote:

>> Ferenc Rakoczi has updated the pull request incrementally with two 
>> additional commits since the last revision:
>> 
>>  - Further readability improvements.
>>  - Added asserts for array sizes
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 342:
> 
>> 340: // Performs two keccak() computations in parallel. The steps of the
>> 341: // two computations are executed interleaved.
>> 342: static address generate_double_keccak(StubGenerator *stubgen, 
>> MacroAssembler *_masm) {
> 
> This function seems ok. I didnt do as line-by-line 'exact' review as for the 
> NTT intrinsics, but just put the new version into a diff next to the original 
> function. Seems like a reasonable clean 'refactor' (hardcode the blocksize, 
> add new input registers 10-14. Makes it really easy to spot vs 0-4 original 
> registers..)
> 
> I didnt realize before that the 'top 3 limbs' are wasted. I guess it doesnt 
> matter, there are registers to spare aplenty and it makes the entire 
> algorithm cleaner and easier to follow.
> 
> I did also stare at the algorithm with the 'What about AVX2' question.. This 
> function would pretty much need to be rewritten it looks like :/
> 
> Last two questions.. 
> - how much performance is gained from doubling this function up?
> - If thats worth it.. what if instead it was quadrupled the input? (I scanned 
> the java code, it looked like NR was parametrized already to 2..). It looks 
> like there are almost enough registers here to go to 4 (I think 3 would need 
> to be freed up somehow.. alternatively, the upper 3 limbs are empty in all 
> operations, perhaps it could be used instead.. at the expense of readability)

Well, the algorithm (keccak()) is doing the same things on 5 array elements (It 
works on essentially a 5x5 matrix doing row and column operations, so putting 5 
array entries in a vector register was the "natural" thing to do).

This function can only be used under very special circumstances, which occur 
during the generation of tha "A matrix" in ML-KEM and ML-DSA, the speed of that 
matrix generation has almost doubled (I don't have exact numbers).

We are using 7 registers per state and 15 for the constants, so we have only 3 
to spare. We could perhaps juggle with the constants keeping just the ones that 
will be needed next in registers and reloading them "just in time", but that 
might slow things down a bit - more load instructions executed + maybe some 
load delay. On the other hand, more parallelism. I might try it out.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2024317665

Reply via email to