On Thu, 10 Apr 2025 13:19:05 GMT, Ferenc Rakoczi <d...@openjdk.org> wrote:
>> By using the aarch64 vector registers the speed of the computation of the >> ML-KEM algorithms (key generation, encapsulation, decapsulation) can be >> approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with two additional > commits since the last revision: > > - Code rearrange, some renaming, fixing comments > - Changes suggested by Andrew Dinn. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5665: > 5663: vs_ld2_post(vs_back(vs1), __ T8H, nttb); > 5664: vs_ld2_post(vs_front(vs4), __ T8H, ntta); > 5665: vs_ld2_post(vs_back(vs4), __ T8H, nttb); Suggestion: vs_ld2_post(vs_front(vs1), __ T8H, ntta); // <a0, a1> x 8H vs_ld2_post(vs_back(vs1), __ T8H, nttb); // <b0, b1> x 8H vs_ld2_post(vs_front(vs4), __ T8H, ntta); // <a2, a3> x 8H vs_ld2_post(vs_back(vs4), __ T8H, nttb); // <b2, b3> x 8H src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5668: > 5666: // montmul the first and second pair of values loaded into vs1 > 5667: // in order and then with one pair reversed storing the two > 5668: // results in vs3 Suggestion: // compute 4 montmul cross-products for pairs (a0,a1) and (b0,b1) // i.e. montmul the first and second halves of vs1 in order and // then with one sequence reversed storing the two results in vs3 // // vs3[0] <- montmul(a0, b0) // vs3[1] <- montmul(a1, b1) // vs3[2] <- montmul(a0, b1) // vs3[3] <- montmul(a1, b0) src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5674: > 5672: // montmul the first and second pair of values loaded into vs4 > 5673: // in order and then with one pair reversed storing the two > 5674: // results in vs1 Suggestion: // compute 4 montmul cross-products for pairs (a2,a3) and (b2,b3) // i.e. montmul the first and second halves of vs4 in order and // then with one sequence reversed storing the two results in vs1 // // vs1[0] <- montmul(a2, b2) // vs1[1] <- montmul(a3, b3) // vs1[2] <- montmul(a2, b3) // vs1[3] <- montmul(a3, b2) src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5680: > 5678: // for each pair of results pick the second value in the first > 5679: // pair to create a sequence that we montmul by the zetas > 5680: // i.e. we want sequence <vs3[1], vs1[1]> Suggestion: // montmul result 2 of each cross-product i.e. (a1*b1, a3*b3) by a zeta. // We can schedule two montmuls at a time if we use a suitable vector // sequence <vs3[1], vs1[1]>. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5683: > 5681: int delta = vs1[1]->encoding() - vs3[1]->encoding(); > 5682: VSeq<2> vs5(vs3[1], delta); > 5683: kyber_montmul16(vs5, vz, vs5, vs_front(vs2), vq); Suggestion: // vs3[1] <- montmul(montmul(a1, b1), z0) // vs1[1] <- montmul(montmul(a3, b3), z1) kyber_montmul16(vs5, vz, vs5, vs_front(vs2), vq); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2044679089 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2044682671 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2044684696 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2044689607 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2044691632