On Wed, 26 Oct 2022 15:47:08 GMT, vpaprotsk <[email protected]> wrote:
>> src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 171:
>>
>>> 169: }
>>> 170:
>>> 171: if (len >= 1024) {
>>
>> Out of curiosity, do you have any perf numbers for the impact of this change
>> on systems that do not support AVX512? Does this help or hurt (or make a
>> negligible impact) on poly1305 updates when the input is 1K or larger?
>
> (The first commit in this PR actually has the code without the check if
> anyone wants to measure.. well its also trivial to edit..)
>
> I measured about 50% slowdown on 64 byte payloads. One could argue that 64
> bytes is not all that representative, but we don't get much out of assembler
> at that load either so it didn't seem worth it to figure out some sort of
> platform check.
>
> AVX512 needs at least 256 = 16 blocks.. there is overhead also
> pre-calculating powers of R that needs to be amortized. Assembler does fall
> back to 64-bit multiplies for <256, while the Java version will have to use
> the 32-bit multiplies. <256, purely scalar, non-vector, 64 vs 32 is not
> _that_ big an issue though; the algorithm is plenty happy with 26-bit limbs,
> and whatever the benefit of 64, it gets erased by the interface-matching code
> copying limbs in and out..
>
> Right now, I measured 1k with `-XX:-UsePolyIntrinsics` to be about 10%
> slower. I think its acceptable, in order to get 18x?
>
> Most/all of the slowdown comes from this need of copying limbs out/in.. I am
> looking at perhaps copying limbs out in the intrinsic instead. Not very
> 'pretty'.. limbs are hidden in a nested private class behind an interface.. I
> would be breaking what is a good design with neat encapsulation. (I
> accidentally forced-pushed that earlier, if you are curious; non-working).
> The current version of this code seems more robust in the long term?
10% is not a negligible impact. I see your point about AVX512 reaping the
rewards of this change, but there are plenty of x86_64 systems without AVX512
that will be impacted, not to mention other platforms like aarch64 which (for
this change at least) will never see the benefits from the intrinsic.
I don't have any suggestions right at this moment for how this could be
streamlined at all to help reduce the pain for non-AVX512 systems. Worth
looking into though.
-------------
PR: https://git.openjdk.org/jdk/pull/10582