Re: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5]

Jamil Nimeh Wed, 26 Oct 2022 13:49:33 -0700

On Wed, 26 Oct 2022 15:47:08 GMT, vpaprotsk <d...@openjdk.org> wrote:


>> src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java line 171:
>> 
>>> 169:         }
>>> 170: 
>>> 171:         if (len >= 1024) {
>> 
>> Out of curiosity, do you have any perf numbers for the impact of this change 
>> on systems that do not support AVX512?  Does this help or hurt (or make a 
>> negligible impact) on poly1305 updates when the input is 1K or larger?
>
> (The first commit in this PR actually has the code without the check if 
> anyone wants to measure.. well its also trivial to edit..)
> 
> I measured about 50% slowdown on 64 byte payloads. One could argue that 64 
> bytes is not all that representative, but we don't get much out of assembler 
> at that load either so it didn't seem worth it to figure out some sort of 
> platform check. 
> 
> AVX512 needs at least 256 = 16 blocks.. there is overhead also 
> pre-calculating powers of R that needs to be amortized. Assembler does fall 
> back to 64-bit multiplies for <256, while the Java version will have to use 
> the 32-bit multiplies. <256, purely scalar, non-vector, 64 vs 32 is not 
> _that_ big an issue though; the algorithm is plenty happy with 26-bit limbs, 
> and whatever the benefit of 64, it gets erased by the interface-matching code 
> copying limbs in and out..
> 
> Right now, I measured 1k with `-XX:-UsePolyIntrinsics` to be about 10% 
> slower. I think its acceptable, in order to get 18x?
> 
> Most/all of the slowdown comes from this need of copying limbs out/in.. I am 
> looking at perhaps copying limbs out in the intrinsic instead. Not very 
> 'pretty'.. limbs are hidden in a nested private class behind an interface.. I 
> would be breaking what is a good design with neat encapsulation. (I 
> accidentally forced-pushed that earlier, if you are curious; non-working). 
> The current version of this code seems more robust in the long term?

10% is not a negligible impact.  I see your point about AVX512 reaping the 
rewards of this change, but there are plenty of x86_64 systems without AVX512 
that will be impacted, not to mention other platforms like aarch64 which (for 
this change at least) will never see the benefits from the intrinsic.

I don't have any suggestions right at this moment for how this could be 
streamlined at all to help reduce the pain for non-AVX512 systems.  Worth 
looking into though.

-------------

PR: https://git.openjdk.org/jdk/pull/10582

Re: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5]

Reply via email to