Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64

Ben Perez Mon, 01 Dec 2025 13:32:05 -0800

On Fri, 24 Oct 2025 13:36:52 GMT, Andrew Haley <[email protected]> wrote:


>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` 
>> method
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7144:
> 
>> 7142: 
>> 7143:   address generate_intpoly_montgomeryMult_P256() {
>> 7144: 
> 
> As a general point, it would help everyone if you provided pseudocode for the 
> whole thing.

Happy to add pseudocode but it will be quite long and almost identical to what 
is already in `MontgomeryIntegerPolynomial256.mult()` except mine uses a loop 
instead of unrolling the whole thing

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7160:
> 
>> 7158:     };
>> 7159: 
>> 7160:     Register c_ptr = r9;
> 
> `rscratch1` and `rscratch2` are used freely by macros, so aliasing them is 
> always rather sketchy. As far as I can tell the arg registers aren't used 
> here, so it makes sense to use `r3`...

Is there a reason hotspot doesn't leave `r9` open for use as a caller saved 
local variable like in the ARM docs 
https://developer.arm.com/documentation/102374/0103/Procedure-Call-Standard. 
Either way will fix.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7169:
> 
>> 7167:     Register mul_tmp = r14;
>> 7168:     Register n = r15;
>> 7169: 
> 
> Here, you could do something like
> 
> 
>     RegSet scratch = RegSet::range(r3, r28) - rscratch1 - rscratch2;
> 
>     {
>       auto r_it = scratch.begin();
>       Register
>         c_ptr = *r_it++,
>         a_i = *r_it++,
>         c_idx = *r_it++, //c_idx is not used at the same time as a_i
>         limb_mask_scalar = *r_it++,
>         b_j = *r_it++,
>         mod_j = *r_it++,
>         mod_ptr = *r_it++,
>         mul_tmp = *r_it++,
>         n = *r_it++;
>        ...
>     }
> 
> 
> 
> Note that a RegSet iterator doesn't affect the RegSet it was created from, so 
> once this block has ended you can allocate again from the set of scratch 
> registers.

Is there by any chance documentation for `RegSet` that I can reference while 
making these changes?

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7229:
> 
>> 7227:       __ shl(high_01, __ T2D, high_01, shift1);
>> 7228:       __ ushr(tmp, __ T2D, low_01, shift2);
>> 7229:       __ orr(high_01, __ T2D, high_01, tmp);
> 
> Suggestion:
> 
>       __ orr(high_01, __ T16B, high_01, tmp);
> 
> everywhere.

thanks for the suggestion! Does using `T16B` improve performance? Similarly, 
should this be applied to `EOR` as well?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2475133382
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2539340501
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2475267081
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2474686235

Re: RFR: 8355216: Accelerate P-256 arithmetic on aarch64

Reply via email to