On Thu, 21 May 2026 23:33:13 GMT, Shawn Emery <[email protected]> wrote:
> My last comment on unused lanes in the worst case scenario of three was just > a nice to have, so not a requirement. The current set of changes have passed > ML-KEM and ML-DSA test vectors and the benchmarks (see attached) for ML-KEM > are quite impressive on my Ryzen 9 9900X with 42-52% gains for AVX-512! > [Intrinsics ML_KEM Benchmarks - > JDK-8384353.pdf](https://github.com/user-attachments/files/28125497/Intrinsics.ML_KEM.Benchmarks.-.JDK-8384353.pdf) Thanks for verifying @smemery ! And the approval! (I was typing a reply to that comment, mid-flight..) When I was implementing this (some time ago) I remember thinking through `nr=4`.. This is the least intrusive java change I could come up with. Alternatively, if changing java API is ok, we could pass the `parInd` into the squeeze operation. (its usually `== nrPa`r, except perhaps on last round). How bad? Out of the 6 total cases, 2 might be better? (mlkem 3x3 matrix of coefficients and mldsa 6x5 will produce some inefficiencies (or possibilities for optimization)). But I think we could work on that in a separate PR that is java only.. I revisited that `generateA` loop today had some more thoughts.. ------------- PR Comment: https://git.openjdk.org/jdk/pull/31125#issuecomment-4513842687
