On Fri, 5 Jun 2026 20:15:30 GMT, Shawn Emery <[email protected]> wrote:

> Curve25519 polynomial arithmetic is performed with intrinsincs implemented in 
> GPR related instructions for multiplication operations (method mult()). 
> Benchmark improvements include:
> 
> X25519 decapsulation: +9%
> X25519 encapsulation: +9%
> X22519 key agreement: +7%
> X25519 key-pair generation: +10%
> X25519-MLKEM decapsulation: +7%
> X25519-MLKEM encapsulation: +8%
> X25519-MLKEM key-pair generation: +8%
> EdDSA sign: +12%
> EdDSA verify: +12%
> EdDSA key-pair generation: +15%
> 
> Note 1: The difference between Aarch64 vs. x86_64 intrinsics implementation 
> include the lack of square() intrinsics; usage caused a 3.3% performance 
> regression due to the efficiencies of the symmetric squaring shape in Java 
> vs. the inefficiencies of the leaf calls and the additional cycles required 
> for 64 bit multiplication in Aarch64.
> Note 2: The GPR related instructions were optimal when compared to hybrid 
> (GPR related instructions for the first two iterations and Neon instructions 
> for the last two iterations) solution.  This design produced a -4%/-1% 
> performance drop in KEM decapsulation/encapsulation compared to the GPR 
> related instructions where the overhead of performing the limb splits and 
> reconstruction did not compensate enough for the efficiencies of SIMD 
> parallelism.
> 
> ---------
> - [X] I confirm that I make this contribution in accordance with the [OpenJDK 
> Interim AI Policy](https://openjdk.org/legal/ai).

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7678:

> 7676:   /**
> 7677:    * Arithmetic polynomial multiplicaiton in Curve25519.  The algorithm 
> mimics
> 7678:    * the version in Java, including the use of all columns (no folding 
> method).

Please mention class `IntegerPolynomial25519` or file 
`IntegerPolynomial25519.java` to make it easier for maintainers to find the 
source.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7720:

> 7718:     const int32_t columns    = limbs * 2;
> 7719:     const uint64_t mask      = 0x7FFFFFFFFFFFFULL;
> 7720:     const uint64_t CARRY_ADD = 0x4000000000000ULL;

It might be clearer to construct these from bpl rather than leave readers to 
count the number of zeroes:

Suggestion:

    const uint64_t mask      = ((1ULL << bpl) - 1ULL);
    const uint64_t CARRY_ADD = (1ULL << (bpl - 1));

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/31409#discussion_r3414184380
PR Review Comment: https://git.openjdk.org/jdk/pull/31409#discussion_r3414209558

Reply via email to