On Mon, 26 Jan 2026 04:25:44 GMT, Ben Perez <[email protected]> wrote:
>> An aarch64 implementation of the `MontgomeryIntegerPolynomial256.mult()` >> method and `IntegerPolynomial.conditionalAssign()`. Since 64-bit >> multiplication is not supported on Neon and manually performing this >> operation with 32-bit limbs is slower than with GPRs, a hybrid neon/gpr >> approach is used. Neon instructions are used to compute intermediate values >> used in the last two iterations of the main "loop", while the GPRs compute >> the first few iterations. At the method level this improves performance by >> ~9% and at the API level roughly 5%. >> >> Performance no intrinsic: >> >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt 8 2427.562 ± >> 24.923 ops/s >> PolynomialP256Bench.benchMultiply false thrpt 8 1757.495 ± >> 41.805 ops/s >> PolynomialP256Bench.benchSquare true thrpt 8 2435.202 ± >> 20.822 ops/s >> PolynomialP256Bench.benchSquare false thrpt 8 2420.390 ± >> 33.594 ops/s >> >> Benchmark (algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt Score Error Units >> SignatureBench.ECDSA.sign SHA256withECDSA 1024 256 >> thrpt 40 8439.881 ± 29.838 ops/s >> SignatureBench.ECDSA.sign SHA256withECDSA 16384 256 >> thrpt 40 7990.614 ± 30.998 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 1024 256 >> thrpt 40 2677.737 ± 8.400 ops/s >> SignatureBench.ECDSA.verify SHA256withECDSA 16384 256 >> thrpt 40 2619.297 ± 9.737 ops/s >> >> Benchmark (algorithm) (keyLength) >> (kpgAlgorithm) (provider) Mode Cnt Score Error Units >> KeyAgreementBench.EC.generateSecret ECDH 256 >> EC thrpt 40 1905.369 ± 3.745 ops/s >> >> Benchmark (algorithm) (keyLength) >> (kpgAlgorithm) (provider) Mode Cnt Score Error Units >> KeyAgreementBench.EC.generateSecret ECDH 256 >> EC thrpt 40 1903.997 ± 4.092 ops/s >> >> >> Performance with intrinsic >> >> Benchmark (isMontBench) Mode Cnt Score >> Error Units >> PolynomialP256Bench.benchMultiply true thrpt 8 2676.599 ± >> 24.722 ops/s >> PolynomialP256Bench.benchMultiply false thrpt 8 1770.589 ± >> 2.584 op... > > Ben Perez has updated the pull request incrementally with one additional > commit since the last revision: > > Added conditionalAssign() intrinsic, changed mult intrinsic to use hybrid > neon/gpr approach src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7957: > 7955: > 7956: vs_eor(a_vec, a_vec, b_vec); > 7957: Likewise there are `vs_stpq` and `vs_stqp_indexed` that you can use here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2727717489
