On Thu, 21 May 2026 23:33:13 GMT, Shawn Emery <[email protected]> wrote:

> My last comment on unused lanes in the worst case scenario of three was just 
> a nice to have, so not a requirement. The current set of changes have passed 
> ML-KEM and ML-DSA test vectors and the benchmarks (see attached) for ML-KEM 
> are quite impressive on my Ryzen 9 9900X with 42-52% gains for AVX-512! 
> [Intrinsics ML_KEM Benchmarks - 
> JDK-8384353.pdf](https://github.com/user-attachments/files/28125497/Intrinsics.ML_KEM.Benchmarks.-.JDK-8384353.pdf)

Thanks for verifying @smemery ! And the approval!

(I was typing a reply to that comment, mid-flight..) When I was implementing 
this (some time ago) I remember thinking through `nr=4`.. This is the least 
intrusive java change I could come up with. Alternatively, if changing java API 
is ok, we could pass the `parInd` into the squeeze operation. (its usually `== 
nrPa`r, except perhaps on last round). 

How bad? Out of the 6 total cases, 2 might be better? (mlkem 3x3 matrix of 
coefficients and mldsa 6x5 will produce some inefficiencies (or possibilities 
for optimization)).

But I think we could work on that in a separate PR that is java only.. I 
revisited that `generateA` loop today had some more thoughts..

-------------

PR Comment: https://git.openjdk.org/jdk/pull/31125#issuecomment-4513842687

Reply via email to