On Sun, 23 Mar 2025 17:00:43 GMT, Ferenc Rakoczi <d...@openjdk.org> wrote:

>> By using the aarch64 vector registers the speed of the computation of the 
>> ML-KEM algorithms (key generation, encapsulation, decapsulation) can be 
>> approximately doubled.
>
> Ferenc Rakoczi has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains eight commits:
> 
>  - Merged master.
>  - Fixed bad assertion.
>  - Fixed mismerge.
>  - Merged master.
>  - A little cleanup
>  - Merged master
>  - removing trailing spaces
>  - kyber aarch64 intrinsics

@ferakocz Thanks for another very good piece of work which appears to me to be 
functioning correctly and performantly.

The PR suffers from the same problems as the original ML_DSA one i.e. The 
mapping of data to registers and the overall structure of the generated code 
and its relation to the related Java code/the original algorithms will be hard 
for a maintainer to identify. I have reworked your patch to use vector 
sequences in this [draft PR](https://github.com/openjdk/jdk/pull/24419) in very 
much the same way as was done for the ML_DSA PR. This has significantly 
abstracted and clarified the register mappings that are in use in each kyber 
generator and has also made the higher level structure of the generated code 
much easier to follow.

Note that my rework of the generation routines was applied to your original PR 
after rebasing it on master. Before updating the kyber routines I also 
generalized a few of the VSeq<N> methods that benefit from being shared by both 
kyber and dilithium, most notably the montmul routines, and I added a few extra 
helpers.

The reworked version passes the ML_KEM functional test and gives similar 
performance improvements for the ML_KEM micro benchmark. The generated code 
does differ in a few places from what your original patch generates but only 
superficially - most notable is that a few loads/stores that rely on continued 
post-increments in the original instead use a constant offset or an add/load 
pair in the reworked code. This makes a very minor difference to code size and 
does not seem to affect performance.

I would like you to rework your PR to incorporate these changes because I 
believe it will make a big difference to maintainability.  n.b. it may be 
easier to integrate my changes by diffing your branch and mine and applying the 
resulting change set rather than trying to merge the changes. Please let me 
know if you have problems with the integration and need help.

I still have some further review comments and would also like to see more 
commenting to explain what the code is doing. However, I think it will be 
easier to do that after this rework has been integrated into your PR.

-------------

PR Review: https://git.openjdk.org/jdk/pull/23663#pullrequestreview-2746672860

Reply via email to