On Tue, 1 Nov 2022 23:04:45 GMT, Vladimir Ivanov <vliva...@openjdk.org> wrote:

>> Hmm.. interesting. Is this for loading? `evmovdquq` vs `evmovdqaq`? I was 
>> actually looking at using evmovdqaq but there is no encoding for it yet (And 
>> just looking now on uops.info, they seem to have identical timings? perhaps 
>> their measurements are off..). There are quite a few optimizations I tried 
>> (and removed) here, but not this one..
>> 
>> Perhaps to have a record, while its relatively fresh in my mind.. since 
>> there is a 8-block (I deleted a 16-block vector multiply), one can have a 
>> peeled off version for just 256 as the minimum payload.. In that case we 
>> only need R^1..R^8, (not R^1..R^16). I also tried loop stride of 8 blocks 
>> instead of 16, but that gets quite bit slower (20ish%?).. There was also a 
>> version that did a much better interleaving of multiplication and loading of 
>> next message block into limbs.. There is potentially a better way to 
>> 'devolve' the vector loop at tail; ie. when 15-blocks are left, just do one 
>> more 8-block multiply, all the constants are already available..
>> 
>> I removed all of those eventually. Even then, the assembler code currently 
>> is already fairly complex. The extra pre-, post-processing and if cases, I 
>> was struggling to keep up myself. Maybe code cleanup would have helped, so 
>> it _is_ possible to bring some of that back in for extra 10+%? (There is a 
>> branch on my fork with that code)
>> 
>> I guess that's my long way of saying 'I don't want to complicate the 
>> assembler loop'?
>
>> And just looking now on uops.info, they seem to have identical timings?
> 
> Actual instruction being used (aligned vs unaligned versions) doesn't matter 
> much here, because it's a dynamic property of the address being accessed: 
> misaligned accesses that cross cache line boundary incur a penalty. Since the 
> cache line size is 64 byte in size, every misaligned 512-bit access is 
> penalized.

I collected performance counters for the benchmark included with the patch and 
its showing around 30% of 64 byte loads were spanning across the cache line.

 Performance counter stats for 'java -jar target/benchmarks.jar -f 1 -wi 1 -i 2 
-w 30 -p dataSize=8192':

      122385646614      cycles                                                  
    
      328096538160      instructions              #    2.68  insn per cycle     
    
       64530343063      MEM_INST_RETIRED.ALL_LOADS                              
     
       22900705491      MEM_INST_RETIRED.ALL_STORES                             
      
       19815558484      MEM_INST_RETIRED.SPLIT_LOADS                            
       
         701176106      MEM_INST_RETIRED.SPLIT_STORES    

Presence of scalar peel loop before the vector loop can save this penalty. 
We should also extend the scope of optimization (preferably in this PR or in 
subsequent one) to optimize [MAC computation routine accepting 
ByteBuffer.](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java#L116),

-------------

PR: https://git.openjdk.org/jdk/pull/10582

Reply via email to