On Thu, 10 Apr 2025 13:19:05 GMT, Ferenc Rakoczi <d...@openjdk.org> wrote:
>> By using the aarch64 vector registers the speed of the computation of the >> ML-KEM algorithms (key generation, encapsulation, decapsulation) can be >> approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with two additional > commits since the last revision: > > - Code rearrange, some renaming, fixing comments > - Changes suggested by Andrew Dinn. src/hotspot/cpu/aarch64/register_aarch64.hpp line 509: > 507: } > 508: > 509: // convenience methods for splitting 8-way of 4-way vector register Suggestion: // convenience methods for splitting 8-way or 4-way vector register src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5012: > 5010: assert(!va.is_constant(), "output vector must identify 2 different > registers"); > 5011: > 5012: // schedule 2 streams of i<nstructions across the vector sequences Suggestion: // schedule 2 streams of instructions across the vector sequences src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5166: > 5164: // > 5165: // On each level, we fill up the vector registers in such a way > that the > 5166: // array elements that need to be multiplied by the zetas be in one Suggestion: // array elements that need to be multiplied by the zetas are in one src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5168: > 5166: // array elements that need to be multiplied by the zetas be in one > 5167: // set of vector registers while the corresponding ones that don't > need to > 5168: // be multiplied, in another set. We can do 32 Montgomery > multiplications Suggestion: // be multiplied are in another set. We can do 32 Montgomery multiplications src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5278: > 5276: // level 4 > 5277: vs_ldpq(vq, kyberConsts); > 5278: int offsets3[8] = { 0, 32, 64, 96, 128, 160, 192, 224 }; I'd like to add comment here to explain the coefficient grouping and likewise at level 5 and 6. So here we have: Suggestion: // Up to level 3 the coefficients multiplied by or added/subtracted // to the zetas occur in discrete blocks whose size is some multiple // of 32. At level 4 coefficients occur in 8 discrete blocks of size 16 // so they are loaded using an ldr at 8 distinct offsets. int offsets3[8] = { 0, 32, 64, 96, 128, 160, 192, 224 }; src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5299: > 5297: > 5298: // level 5 > 5299: vs_ldpq(vq, kyberConsts); Suggestion: vs_ldpq(vq, kyberConsts); // At level 5 related coefficients occur in discrete blocks of size 8 so // need to be loaded interleaved using an ld2 operation with arrangement 2D src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5319: > 5317: vs_st2_indexed(vs1, __ T2D, coeffs, tmpAddr, 384, offsets4); > 5318: > 5319: // level 6 Suggestion: // level 6 // At level 6 related coefficients occur in discrete blocks of size 4 so // need to be loaded interleaved using an ld2 operation with arrangement 4S src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5377: > 5375: // level 0 > 5376: vs_ldpq(vq, kyberConsts); > 5377: int offsets4[4] = { 0, 32, 64, 96 }; Again a comment Suggestion: // At level 6 related coefficients occur in discrete blocks of size 4 so // need to be loaded interleaved using an ld2 operation with arrangement 4S int offsets4[4] = { 0, 32, 64, 96 }; src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5399: > 5397: vs_st2_indexed(vs1, __ T4S, coeffs, tmpAddr, 384, offsets4); > 5398: > 5399: // level 1 Again a comment Suggestion: // level 1 // At level 1 related coefficients occur in discrete blocks of size 8 so // need to be loaded interleaved using an ld2 operation with arrangement 2D src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5422: > 5420: vs_st2_indexed(vs1, __ T2D, coeffs, tmpAddr, 384, offsets4); > 5421: > 5422: // level 2 Again Suggestion: // level 2 // At level 2 coefficients occur in 8 discrete blocks of size 16 // so they are loaded using employing an ldr at 8 distinct offsets. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5464: > 5462: vs_str_indexed(vs1, __ Q, coeffs, 256, offsets3); > 5463: > 5464: // level 3 Suggestion: // level 3 // From level 3 upwards coefficients occur in discrete blocks whose size is // some multiple of 32 so can be loaded using ldpq and suitable indexes. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5591: > 5589: store64shorts(vs2, tmpAddr); > 5590: > 5591: load64shorts(vs1, tmpAddr); I'd like to make explicit the fact that we have avoided doing an add here (and in the next two cases) by adding a commented out generation step i.e. at this line insert Suggestion: // __ add(tmpAddr, coeffs, 128); // unneeded as implied by preceding store load64shorts(vs1, tmpAddr); src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5596: > 5594: store64shorts(vs2, tmpAddr); > 5595: > 5596: load64shorts(vs1, tmpAddr); Likewise insert: Suggestion: // __ add(tmpAddr, coeffs, 256); // unneeded as implied by preceding store load64shorts(vs1, tmpAddr); src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5601: > 5599: store64shorts(vs2, tmpAddr); > 5600: > 5601: load64shorts(vs1, tmpAddr); Likewise insert: Suggestion: // __ add(tmpAddr, coeffs, 384); // unneeded as implied by preceding store load64shorts(vs1, tmpAddr); src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5640: > 5638: VSeq<4> vs1(0), vs2(4); // 4 sets of 8x8H inputs/outputs/tmps > 5639: VSeq<4> vs3(16), vs4(20); > 5640: VSeq<2> vq(30); // pair of constants for montmul Suggestion: VSeq<2> vq(30); // pair of constants for montmul: qinv, q src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5642: > 5640: VSeq<2> vq(30); // pair of constants for montmul > 5641: VSeq<2> vz(28); // pair of zetas > 5642: VSeq<4> vc(27, 0); // constant sequence for montmul Suggestion: VSeq<4> vc(27, 0); // constant for montmul: montRSquareModQ src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6094: > 6092: VSeq<8> vc2_1(31, 0); > 6093: VSeq<2> vc2_2(31, 0); > 6094: FloatRegister vc2_3 = v31; Suggestion: // we also need a pair of corresponding constant sequences VSeq<8> vc1_1(30, 0); // kyber_q VSeq<2> vc1_2(30, 0); FloatRegister vc1_3 = v30; VSeq<8> vc2_1(31, 0); // kyberBarrettMultiplier VSeq<2> vc2_2(31, 0); FloatRegister vc2_3 = v31; src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6102: > 6100: // load q and the multiplier for the Barrett reduction > 6101: __ add(kyberConsts, kyberConsts, 16); > 6102: __ ldpq(vc1_3, vc2_3, kyberConsts); Suggestion: __ ldpq(vc1_3, vc2_3, kyberConsts); // kyber_q, kyberBarrettMultiplier src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6111: > 6109: __ ldr(vs1_3, __ Q, __ post(coeffs, 16)); > 6110: } > 6111: vs_sqdmulh(vs2_1, __ T8H, vs1_1, vc2_1); Suggestion: vs_sqdmulh(vs2_1, __ T8H, vs1_1, vc2_1); // vs2 <- (2 * vs1 * kyberBarrettMultiplier) >> 16 src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6116: > 6114: __ sqdmulh(vs2_3, __ T8H, vs1_3, vc2_3); > 6115: } > 6116: vs_sshr(vs2_1, __ T8H, vs2_1, 11); Suggestion: vs_sshr(vs2_1, __ T8H, vs2_1, 11); // vs2 <- (vs1 * kyberBarrettMultiplier) >> 26 src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 6121: > 6119: __ sshr(vs2_3, __ T8H, vs2_3, 11); > 6120: } > 6121: vs_mlsv(vs1_1, __ T8H, vs2_1, vc1_1); Suggestion: vs_mlsv(vs1_1, __ T8H, vs2_1, vc1_1); // vs1 <- vs1 - vs2 * kyber_q ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041886663 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041888092 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041889242 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041889966 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041892243 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041892994 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041893924 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041895469 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041896626 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041898215 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041899623 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041902206 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041902833 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041906126 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041909711 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041929011 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041867539 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041868670 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041881429 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041883012 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2041883686