On Thu, 10 Apr 2025 13:19:05 GMT, Ferenc Rakoczi <d...@openjdk.org> wrote:
>> By using the aarch64 vector registers the speed of the computation of the >> ML-KEM algorithms (key generation, encapsulation, decapsulation) can be >> approximately doubled. > > Ferenc Rakoczi has updated the pull request incrementally with two additional > commits since the last revision: > > - Code rearrange, some renaming, fixing comments > - Changes suggested by Andrew Dinn. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5300: > 5298: // level 5 > 5299: vs_ldpq(vq, kyberConsts); > 5300: int offsets4[4] = { 0, 32, 64, 96 }; Again a comment // At level 5 related coefficients occur in discrete blocks of size 8 so // need to be loaded interleaved using an ld2 operation with arrangement 2D src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5319: > 5317: vs_st2_indexed(vs1, __ T2D, coeffs, tmpAddr, 384, offsets4); > 5318: > 5319: // level 6 And again // At level 6 related coefficients occur in discrete blocks of size 4 so // need to be loaded interleaved using an ld2 operation with arrangement 4S src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5377: > 5375: // level 0 > 5376: vs_ldpq(vq, kyberConsts); > 5377: int offsets4[4] = { 0, 32, 64, 96 }; Again a comment // At level 0 related coefficients occur in discrete blocks of size 4 so // need to be loaded interleaved using an ld2 operation with arrangement 4S src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5399: > 5397: vs_st2_indexed(vs1, __ T4S, coeffs, tmpAddr, 384, offsets4); > 5398: > 5399: // level 1 Again a comment // At level 1 related coefficients occur in discrete blocks of size 8 so // need to be loaded interleaved using an ld2 operation with arrangement 2D src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5423: > 5421: > 5422: // level 2 > 5423: int offsets3[8] = { 0, 32, 64, 96, 128, 160, 192, 224 }; Again // At level 2 coefficients occur in 8 discrete blocks of size 16 // so they are loaded using employing an ldr at 8 distinct offsets. src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5464: > 5462: vs_str_indexed(vs1, __ Q, coeffs, 256, offsets3); > 5463: > 5464: // level 3 / From level 3 upwards coefficients occur in discrete blocks whose size is // some multiple of 32 so can be loaded using ldpq and suitable indexes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037571231 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037573218 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037577265 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037578385 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037581149 PR Review Comment: https://git.openjdk.org/jdk/pull/23663#discussion_r2037585101