Re: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5]

Volodymyr Paprotski Fri, 04 Nov 2022 07:42:04 -0700

On Wed, 2 Nov 2022 03:16:57 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


>>> And just looking now on uops.info, they seem to have identical timings?
>> 
>> Actual instruction being used (aligned vs unaligned versions) doesn't matter 
>> much here, because it's a dynamic property of the address being accessed: 
>> misaligned accesses that cross cache line boundary incur a penalty. Since 
>> cache lines are 64 bytes in size, every misaligned 512-bit access is 
>> penalized.
>
> I collected performance counters for the benchmark included with the patch 
> and its showing around 30% of 64 byte loads were spanning across the cache 
> line.
> 
>  Performance counter stats for 'java -jar target/benchmarks.jar -f 1 -wi 1 -i 
> 2 -w 30 -p dataSize=8192':
> 
>       122385646614      cycles                                                
>       
>       328096538160      instructions              #    2.68  insn per cycle   
>       
>        64530343063      MEM_INST_RETIRED.ALL_LOADS                            
>        
>        22900705491      MEM_INST_RETIRED.ALL_STORES                           
>         
>        19815558484      MEM_INST_RETIRED.SPLIT_LOADS                          
>          
>          701176106      MEM_INST_RETIRED.SPLIT_STORES    
> 
> Presence of scalar peel loop before the vector loop can save this penalty but 
> given its operating over block streams  it may be tricky. 
> We should also extend the scope of optimization (preferably in this PR or in 
> subsequent one) to optimize [MAC computation routine accepting 
> ByteBuffer.](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java#L116),

To close this thread.. @jatin-bhateja and I talked and realized that it is not 
possible to re-align input here. At least not with peeling with scalar loop. 
Scalar loop peels full blocks only (i.e. 16 bytes at a time). So out of 64 
positions, 1 is already aligned, 3 could be aligned with the right peel, and 60 
will land badly regardless.

-------------

PR: https://git.openjdk.org/jdk/pull/10582

Re: RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5]

Reply via email to