On Wed, 2 Nov 2022 03:16:57 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>>> And just looking now on uops.info, they seem to have identical timings? >> >> Actual instruction being used (aligned vs unaligned versions) doesn't matter >> much here, because it's a dynamic property of the address being accessed: >> misaligned accesses that cross cache line boundary incur a penalty. Since >> cache lines are 64 bytes in size, every misaligned 512-bit access is >> penalized. > > I collected performance counters for the benchmark included with the patch > and its showing around 30% of 64 byte loads were spanning across the cache > line. > > Performance counter stats for 'java -jar target/benchmarks.jar -f 1 -wi 1 -i > 2 -w 30 -p dataSize=8192': > > 122385646614 cycles > > 328096538160 instructions # 2.68 insn per cycle > > 64530343063 MEM_INST_RETIRED.ALL_LOADS > > 22900705491 MEM_INST_RETIRED.ALL_STORES > > 19815558484 MEM_INST_RETIRED.SPLIT_LOADS > > 701176106 MEM_INST_RETIRED.SPLIT_STORES > > Presence of scalar peel loop before the vector loop can save this penalty but > given its operating over block streams it may be tricky. > We should also extend the scope of optimization (preferably in this PR or in > subsequent one) to optimize [MAC computation routine accepting > ByteBuffer.](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java#L116), To close this thread.. @jatin-bhateja and I talked and realized that it is not possible to re-align input here. At least not with peeling with scalar loop. Scalar loop peels full blocks only (i.e. 16 bytes at a time). So out of 64 positions, 1 is already aligned, 3 could be aligned with the right peel, and 60 will land badly regardless. ------------- PR: https://git.openjdk.org/jdk/pull/10582