On Thu, 27 Mar 2025 21:42:08 GMT, Volodymyr Paprotski <vpaprot...@openjdk.org> wrote:
>> Ferenc Rakoczi has updated the pull request incrementally with two >> additional commits since the last revision: >> >> - Further readability improvements. >> - Added asserts for array sizes > > src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 342: > >> 340: // Performs two keccak() computations in parallel. The steps of the >> 341: // two computations are executed interleaved. >> 342: static address generate_double_keccak(StubGenerator *stubgen, >> MacroAssembler *_masm) { > > This function seems ok. I didnt do as line-by-line 'exact' review as for the > NTT intrinsics, but just put the new version into a diff next to the original > function. Seems like a reasonable clean 'refactor' (hardcode the blocksize, > add new input registers 10-14. Makes it really easy to spot vs 0-4 original > registers..) > > I didnt realize before that the 'top 3 limbs' are wasted. I guess it doesnt > matter, there are registers to spare aplenty and it makes the entire > algorithm cleaner and easier to follow. > > I did also stare at the algorithm with the 'What about AVX2' question.. This > function would pretty much need to be rewritten it looks like :/ > > Last two questions.. > - how much performance is gained from doubling this function up? > - If thats worth it.. what if instead it was quadrupled the input? (I scanned > the java code, it looked like NR was parametrized already to 2..). It looks > like there are almost enough registers here to go to 4 (I think 3 would need > to be freed up somehow.. alternatively, the upper 3 limbs are empty in all > operations, perhaps it could be used instead.. at the expense of readability) Well, the algorithm (keccak()) is doing the same things on 5 array elements (It works on essentially a 5x5 matrix doing row and column operations, so putting 5 array entries in a vector register was the "natural" thing to do). This function can only be used under very special circumstances, which occur during the generation of tha "A matrix" in ML-KEM and ML-DSA, the speed of that matrix generation has almost doubled (I don't have exact numbers). We are using 7 registers per state and 15 for the constants, so we have only 3 to spare. We could perhaps juggle with the constants keeping just the ones that will be needed next in registers and reloading them "just in time", but that might slow things down a bit - more load instructions executed + maybe some load delay. On the other hand, more parallelism. I might try it out. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2024317665