On Fri, 24 Oct 2025 13:36:52 GMT, Andrew Haley <[email protected]> wrote:
>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()`
>> method
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7144:
>
>> 7142:
>> 7143: address generate_intpoly_montgomeryMult_P256() {
>> 7144:
>
> As a general point, it would help everyone if you provided pseudocode for the
> whole thing.
Happy to add pseudocode but it will be quite long and almost identical to what
is already in `MontgomeryIntegerPolynomial256.mult()` except mine uses a loop
instead of unrolling the whole thing
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7160:
>
>> 7158: };
>> 7159:
>> 7160: Register c_ptr = r9;
>
> `rscratch1` and `rscratch2` are used freely by macros, so aliasing them is
> always rather sketchy. As far as I can tell the arg registers aren't used
> here, so it makes sense to use `r3`...
Is there a reason hotspot doesn't leave `r9` open for use as a caller saved
local variable like in the ARM docs
https://developer.arm.com/documentation/102374/0103/Procedure-Call-Standard.
Either way will fix.
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7169:
>
>> 7167: Register mul_tmp = r14;
>> 7168: Register n = r15;
>> 7169:
>
> Here, you could do something like
>
>
> RegSet scratch = RegSet::range(r3, r28) - rscratch1 - rscratch2;
>
> {
> auto r_it = scratch.begin();
> Register
> c_ptr = *r_it++,
> a_i = *r_it++,
> c_idx = *r_it++, //c_idx is not used at the same time as a_i
> limb_mask_scalar = *r_it++,
> b_j = *r_it++,
> mod_j = *r_it++,
> mod_ptr = *r_it++,
> mul_tmp = *r_it++,
> n = *r_it++;
> ...
> }
>
>
>
> Note that a RegSet iterator doesn't affect the RegSet it was created from, so
> once this block has ended you can allocate again from the set of scratch
> registers.
Is there by any chance documentation for `RegSet` that I can reference while
making these changes?
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7229:
>
>> 7227: __ shl(high_01, __ T2D, high_01, shift1);
>> 7228: __ ushr(tmp, __ T2D, low_01, shift2);
>> 7229: __ orr(high_01, __ T2D, high_01, tmp);
>
> Suggestion:
>
> __ orr(high_01, __ T16B, high_01, tmp);
>
> everywhere.
thanks for the suggestion! Does using `T16B` improve performance? Similarly,
should this be applied to `EOR` as well?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2475133382
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2539340501
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2475267081
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2474686235