On Mon, 9 Feb 2026 11:31:00 GMT, Andrew Dinn <[email protected]> wrote:
>> Ben Perez has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> fixed indexing bug in vs_ldpq, simplified vector loads in
>> generate_intpoly_assign()
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7988:
>
>> 7986: __ ld1(a_vec[0], __ T2D, aLimbs);
>> 7987: __ ldpq(a_vec[1], a_vec[2], Address(aLimbs, 16));
>> 7988: __ ldpq(a_vec[3], a_vec[4], Address(aLimbs, 48));
>
> I notice that here and elsewhere you have a 5 vector sequence and hence are
> not using vs_ldpq/stpq operations (because they only operate on even length
> sequences). However, if you add a bit of extra 'apparatus' to register.hpp
> you can then use the vs_ldpq/stpq operations.
>
> Your code processes the first register individually via ld1/st1 and then the
> remaining registers using a pair of loads i.e. operate as if the latter were
> a VSeq<4>. So, in register_aarch64.hpp you can add these functions:
>
>
> template<int N>
> FloatRegister vs_head(const VSeq<N>& v) {
> static_assert(N > 1), "sequence length must be greater than 1");
> return v.base();
> }
>
> template<int N>
> VSeq<N> vs_tail(const VSeq<N+1>& v) {
> static_assert(N > 1, "tail sequence length must be greater than 2");
> return VSeq<N>(v.base() + v.delta(), v.delta());
> }
>
> With those methods available you should be able to do all these VSeq<5> loads
> and stores using an ld1/st1 followed by an vs_ldpq_indexed or vs_stpq_indexed
> with a suitable start index and the same constant offset array e.g. here you
> could use
>
> Suggestion:
>
> int offsets[2] = { 0, 32 };
> __ ld1(vs_head(a_vec), __ T2D, aLimbs);
> vs_ldpq_indexed(vs_tail(a_vec), aLimbs, 16, offsets);
Made these changes except I opted to use `__ ld1(a_vec[0], __ T2D, aLimbs)`
instead of a `vs_head` method because it seemed easier to read. Happy to add
that though.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2819156616