Re: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5]

vpaprotsk Fri, 28 Oct 2022 13:24:06 -0700

On Thu, 27 Oct 2022 09:33:32 GMT, Jatin Bhateja <[email protected]> wrote:


>> vpaprotsk has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   extra whitespace character
>
> src/hotspot/cpu/x86/macroAssembler_x86_poly.cpp line 849:
> 
>> 847:   jcc(Assembler::less, L_process16Loop);
>> 848: 
>> 849:   poly1305_process_blocks_avx512(input, length,
> 
> Since entire code is based on 512 bit encoding misalignment penalty may be 
> costly here. A scalar peel handling (as done in tail) for input portion 
> before a 64 byte aligned address  could further improve the performance for 
> large block sizes.

Hmm.. interesting. Is this for loading? `evmovdquq` vs `evmovdqaq`? I was 
actually looking at using evmovdqaq but there is no encoding for it yet (And 
just looking now on uops.info, they seem to have identical timings? perhaps 
their measurements are off..). There are quite a few optimizations I tried (and 
removed) here, but not this one..

Perhaps to have a record, while its relatively fresh in my mind.. since there 
is a 8-block (I deleted a 16-block vector multiply), one can have a peeled off 
version for just 256 as the minimum payload.. In that case we only need 
R^1..R^8, (not R^1..R^16). I also tried loop stride of 8 blocks instead of 16, 
but that gets quite bit slower (20ish%?).. There was also a version that did a 
much better interleaving of multiplication and loading of next message block 
into limbs.. There is potentially a better way to 'devolve' the vector loop at 
tail; ie. when 15-blocks are left, just do one more 8-block multiply, all the 
constants are already available..

I removed all of those eventually. Even then, the assembler code currently is 
already fairly complex. The extra pre-, post-processing and if cases, I was 
struggling to keep up myself. Maybe code cleanup would have helped, so it _is_ 
possible to bring some of that back in for extra 10+%? (There is a branch on my 
fork with that code)

I guess that's my long way of saying 'I don't want to complicate the assembler 
loop'?

-------------

PR: https://git.openjdk.org/jdk/pull/10582

Re: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5]

Reply via email to