On Wed, 24 May 2023 16:17:14 GMT, Andrew Haley <[email protected]> wrote:
>> This provides a solid speedup of about 3-4x over the Java implementation.
>>
>> I have a vectorized version of this which uses a bunch of tricks to speed it
>> up, but it's complex and can still be improved. We're getting close to ramp
>> down, so I'm submitting this simple intrinsic so that we can get it reviewed
>> in time.
>>
>> Benchmarks:
>>
>>
>> ThunderX (2, I think):
>>
>> Benchmark (dataSize) (provider) Mode Cnt
>> Score Error Units
>> Poly1305DigestBench.updateBytes 64 thrpt 3
>> 14078352.014 ± 4201407.966 ops/s
>> Poly1305DigestBench.updateBytes 256 thrpt 3
>> 5154958.794 ± 1717146.980 ops/s
>> Poly1305DigestBench.updateBytes 1024 thrpt 3
>> 1416563.273 ± 1311809.454 ops/s
>> Poly1305DigestBench.updateBytes 16384 thrpt 3
>> 94059.570 ± 2913.021 ops/s
>> Poly1305DigestBench.updateBytes 1048576 thrpt 3
>> 1441.024 ± 164.443 ops/s
>>
>> Benchmark (dataSize) (provider) Mode Cnt
>> Score Error Units
>> Poly1305DigestBench.updateBytes 64 thrpt 3
>> 4516486.795 ± 419624.224 ops/s
>> Poly1305DigestBench.updateBytes 256 thrpt 3
>> 1228542.774 ± 202815.694 ops/s
>> Poly1305DigestBench.updateBytes 1024 thrpt 3
>> 316051.912 ± 23066.449 ops/s
>> Poly1305DigestBench.updateBytes 16384 thrpt 3
>> 20649.561 ± 1094.687 ops/s
>> Poly1305DigestBench.updateBytes 1048576 thrpt 3
>> 310.564 ± 31.053 ops/s
>>
>> Apple M1:
>>
>> Benchmark (dataSize) (provider) Mode Cnt
>> Score Error Units
>> Poly1305DigestBench.updateBytes 64 thrpt 3
>> 33551968.946 ± 849843.905 ops/s
>> Poly1305DigestBench.updateBytes 256 thrpt 3
>> 9911637.214 ± 63417.224 ops/s
>> Poly1305DigestBench.updateBytes 1024 thrpt 3
>> 2604370.740 ± 29208.265 ops/s
>> Poly1305DigestBench.updateBytes 16384 thrpt 3
>> 165183.633 ± 1975.998 ops/s
>> Poly1305DigestBench.updateBytes 1048576 thrpt 3
>> 2587.132 ± 40.240 ops/s
>>
>> Benchmark (dataSize) (provider) Mode Cnt
>> Score Error Units
>> Poly1305DigestBench.updateBytes 64 thrpt 3
>> 12373649.589 ± 184757.721 ops/s
>> Poly1305DigestBench.upd...
>
> Andrew Haley has updated the pull request incrementally with one additional
> commit since the last revision:
>
> Review comments
src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7135:
> 7133: regs = (regs.remaining() + U_0HI + U_1HI).begin();
> 7134:
> 7135: // U_2:U_1:U_0 += (U_1HI >> 2)
This comment and the next one both need correcting. They mention U_0HI and
U_1HI and, as the previous comment says, those registers are dead.
What actually happens here is best summarized as
// U_2:U_1:U_0 += (U2 >> 2) * 5
or, if we actually want to be clearer about the current encoding which does it
in several steps
// rscratch1 = (U2 >> 2)
// U2 = U2[1:0]
// U_2:U_1:U_0 += rscratch1
// U_2:U_1:U_0 += (rscratch1 << 2)
i.e. any bits that are set from 130 upwards are masked off, treated as an
integer in their own right, multiplied by 5 and the result added back in at the
bottom to update the 130 bit result U2[1:0]:U1[63:0]:U0[63:0].
I'm not sure whether this provides an opportunity for you to optimize this by
doing the multiply by five earlier i.e. replace the code with this version
// rscratch1 = (U2 >> 2) * 5
__ lsr(rscratch1, U_2, 2);
__ add(rscratch1, rscratch1, scratch1, Assembler::LSL, 2);
// U2 = U2[1:0]
__ andr(U_2, U_2, (u8)3);
// U2:U1:U0 += rscratch1
__ adds(U_0, U_0, rscratch1);
__ adcs(U_1, U_1, zr);
__ adc(U_2, U_2, zr);
The obvious concern is that the multiply of rscratch1 by 5 might overflow 64
bits. Is that why you have implemented two add and carry steps? If so then why
is it legitimate to do the multiply by 5 up front in the final reduction that
follows the loop?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1213062966