On Thu, 5 May 2022 02:00:04 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
> Currently the vectorization of masked vector store is implemented by the > masked store instruction only on architectures that support the predicate > feature. The compiler will fall back to the java scalar code for > non-predicate supported architectures like ARM NEON. However, for these > systems, the masked store can be vectorized with the non-masked vector `"load > + blend + store"`. For example, storing a vector` "v"` controlled by a mask` > "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` can be > implemented with: > > > 1) mem_v = load(addr) ; non-masked load from the same memory > 2) v = blend(mem_v, v, m) ; blend with the src vector with the mask > 3) store(addr, v) ; non-masked store into the memory > > > Since the first full loading needs the array offset must be inside of the > valid array bounds, we make the compiler do the vectorization only when the > offset is in range of the array boundary. And the compiler will still fall > back to the java scalar code if not all offsets are valid. Besides, the > original offset check for masked lanes are only applied when the offset is > not always inside of the array range. This also improves the performance for > masked store when the offset is always valid. The whole process is similar to > the masked load API. > > Here is the performance data for the masked vector store benchmarks on a X86 > non avx-512 system, which improves about `20x ~ 50x`: > > Benchmark before after Units > StoreMaskedBenchmark.byteStoreArrayMask 221.733 11094.126 ops/ms > StoreMaskedBenchmark.doubleStoreArrayMask 41.086 1034.408 ops/ms > StoreMaskedBenchmark.floatStoreArrayMask 73.820 1985.015 ops/ms > StoreMaskedBenchmark.intStoreArrayMask 75.028 2027.557 ops/ms > StoreMaskedBenchmark.longStoreArrayMask 40.929 1032.928 ops/ms > StoreMaskedBenchmark.shortStoreArrayMask 135.794 5307.567 ops/ms > > Similar performance gain can also be observed on ARM NEON system. > > And here is the performance data on X86 avx-512 system, which improves about > `1.88x - 2.81x`: > > Benchmark before after Units > StoreMaskedBenchmark.byteStoreArrayMask 11185.956 21012.824 ops/ms > StoreMaskedBenchmark.doubleStoreArrayMask 1480.644 3911.720 ops/ms > StoreMaskedBenchmark.floatStoreArrayMask 2738.352 7708.365 ops/ms > StoreMaskedBenchmark.intStoreArrayMask 4191.904 9300.428 ops/ms > StoreMaskedBenchmark.longStoreArrayMask 2025.031 4604.504 ops/ms > StoreMaskedBenchmark.shortStoreArrayMask 8339.389 17817.128 ops/ms > > Similar performance gain can also be observed on ARM SVE system. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/8544