On Thu, 25 Jun 2026 18:07:11 GMT, Volodymyr Paprotski <[email protected]> wrote:
> > > I ran ML-KEM and ML-DSA benchmarks on a Macbook Pro M3 36GB Memory, see > > > attached. Everything is improved with the set of changes except for > > > ML-KEM-512 (encapsulation:-0.92%/decapsulation:-2.25%) and ML-DSA-44 (key > > > generation:-1.03%). > > > [8384353BenchmarksM3Pro.pdf](https://github.com/user-attachments/files/29319713/8384353BenchmarksM3Pro.pdf) > > > > > > Reran the specific benchmarks that showed regression on a quieter system > > (MacbookAir M3 8GB Memory) and averaged the results, which showed > > negligible differences of the afflicted benchmarks as follows: ML-KEM-512 > > (encapsulation:-0.38%/decapsulation:-0.21%) and ML-DSA-44 (key generation: > > -0.18%). In other words, the current fix resolves the regression in > > performance by either negligible differences or better performance, where > > the high observed +6.55% (ML-KEM-768, which would be one of the algorithms > > that would gain the most from this delta). > > Thanks for the measurements, and the extra verification.. the first set of > results really threw me for a loop, I dont see why it would be slower. > > The second rerun.. seems we are safe, so the PR performance is acceptable? > (My theory is that GC is introducing a lot of variability to the benchmarks?) Yes, it's acceptable to me. I reran the benchmarks because of the variability from the first AArch64 system tested. The difference in SE alone between the two systems would cause a 0.68% variability alone for ML-KEM-512 decapsulation, as an example. > We should now match _exactly_ the same number of `doubleKeccak` as before > (admittedly, in possibly different order..) > > ``` > int[][][] a = new int[mlDsa_k][mlDsa_l][]; > ML_DSA_44 << here > mlDsa_k = 4; > mlDsa_l = 4; > ML_DSA_65 > mlDsa_k = 6; > mlDsa_l = 5; > ML_DSA_87 > mlDsa_k = 8; > mlDsa_l = 7; > > short[][][] a = new short[mlKem_k][mlKem_k][]; > ML-KEM-512 << here > mlKem_k = 2; > ML-KEM-768 > mlKem_k = 3; > ML-KEM-1024 > mlKem_k = 4; > ``` > > For the first batch, the call sequence.. > > ``` > ML_DSA_44 > Before: 2 + 2; 2 + 2; 2 + 2; 2 + 2 > Now: 4; 4; 4; 4 (but without quadKeccak, 4 becomes 2 + 2) > > ML-KEM-512 > Before: 2 ; 2 > Now: 2 ; 2 (no change, each row has to be fully processed before moving to > next) > ``` ------------- PR Comment: https://git.openjdk.org/jdk/pull/31648#issuecomment-4803842971
