On Thu, 25 Jun 2026 05:59:54 GMT, Shawn Emery <[email protected]> wrote:
> > I ran ML-KEM and ML-DSA benchmarks on a Macbook Pro M3 36GB Memory, see > > attached. Everything is improved with the set of changes except for > > ML-KEM-512 (encapsulation:-0.92%/decapsulation:-2.25%) and ML-DSA-44 (key > > generation:-1.03%). > > [8384353BenchmarksM3Pro.pdf](https://github.com/user-attachments/files/29319713/8384353BenchmarksM3Pro.pdf) > > Reran the specific benchmarks that showed regression on a quieter system > (MacbookAir M3 8GB Memory) and averaged the results, which showed negligible > differences of the afflicted benchmarks as follows: ML-KEM-512 > (encapsulation:-0.38%/decapsulation:-0.21%) and ML-DSA-44 (key generation: > -0.18%). In other words, the current fix resolves the regression in > performance by either negligible differences or better performance, where the > high observed +6.55% (ML-KEM-768, which would be one of the algorithms that > would gain the most from this delta). Thanks for the measurements, and the extra verification.. the first set of results really threw me for a loop, I dont see why it would be slower. The second rerun.. seems we are safe, so the PR performance is acceptable? (My theory is that GC is introducing a lot of variability to the benchmarks?) ------ We should now match _exactly_ the same number of `doubleKeccak` as before (admittedly, in possibly different order..) int[][][] a = new int[mlDsa_k][mlDsa_l][]; ML_DSA_44 << here mlDsa_k = 4; mlDsa_l = 4; ML_DSA_65 mlDsa_k = 6; mlDsa_l = 5; ML_DSA_87 mlDsa_k = 8; mlDsa_l = 7; short[][][] a = new short[mlKem_k][mlKem_k][]; ML-KEM-512 << here mlKem_k = 2; ML-KEM-768 mlKem_k = 3; ML-KEM-1024 mlKem_k = 4; For the first batch, the call sequence.. ML_DSA_44 Before: 2 + 2; 2 + 2; 2 + 2; 2 + 2 Now: 4; 4; 4; 4 (but without quadKeccak, 4 becomes 2 + 2) ML-KEM-512 Before: 2 ; 2 Now: 2 ; 2 (no change, each row has to be fully processed before moving to next) ---- > With the crypto intrinsics that don't return a value we followed the > convention that instead of declaring them void, we declare them int and have > the intrinsic version return 0 and the Java version return 1 (it makes it > easy for tests to decide which version was actually running). This convention > could be applied to the @IntrinsicCadidate methods in this class, too. I meant to do this! Thanks for catching, done. ------------- PR Comment: https://git.openjdk.org/jdk/pull/31648#issuecomment-4802731372
