Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v6]
On Tue, 7 Jun 2022 04:29:40 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the >> array boundary is implemented with pure java scalar code to avoid the IOOBE >> (IndexOutOfBoundaryException). This is necessary for architectures that do >> not support the predicate feature. Because the masked load is implemented >> with a full vector load and a vector blend applied on it. And a full vector >> load will definitely cause the IOOBE which is not valid. However, for >> architectures that support the predicate feature like SVE/AVX-512/RVV, it >> can be vectorized with the predicated load instruction as long as the >> indexes of the masked lanes are within the bounds of the array. For these >> architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE >> part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >>byte[] a, int offset, >>VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must >> check whether the current backend supports it and falls back to the java >> scalar version if not. This is different from the normal masked vector load >> that the compiler will generate a full vector load and a vector blend if the >> predicate load is not supported. So to let the compiler make the expected >> action, an additional flag (i.e. `usePred`) is added to the existing >> "loadMasked" intrinsic, with the value "true" for the IOOBE part while >> "false" for the normal load. And the compiler will fail to intrinsify if the >> flag is "true" and the predicate load is not supported by the backend, which >> means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` >> on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains seven commits: > > - Add constant OFFSET_IN_RANGE and OFFSET_OUT_OF_RANGE > - Merge branch 'jdk:master' into JDK-8283667 > - Merge branch 'jdk:master' into JDK-8283667 > - Use integer constant for offsetInRange all the way through > - Rename "use_predicate" to "needs_predicate" > - Rename the "usePred" to "offsetInRange" > - 8283667: [vectorapi] Vectorization for masked load with IOOBE with > predicate feature LGTM. - PR: https://git.openjdk.java.net/jdk/pull/8035
Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v6]
> Currently the vector load with mask when the given index happens out of the > array boundary is implemented with pure java scalar code to avoid the IOOBE > (IndexOutOfBoundaryException). This is necessary for architectures that do > not support the predicate feature. Because the masked load is implemented > with a full vector load and a vector blend applied on it. And a full vector > load will definitely cause the IOOBE which is not valid. However, for > architectures that support the predicate feature like SVE/AVX-512/RVV, it can > be vectorized with the predicated load instruction as long as the indexes of > the masked lanes are within the bounds of the array. For these architectures, > loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE > part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, >byte[] a, int offset, >VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must > check whether the current backend supports it and falls back to the java > scalar version if not. This is different from the normal masked vector load > that the compiler will generate a full vector load and a vector blend if the > predicate load is not supported. So to let the compiler make the expected > action, an additional flag (i.e. `usePred`) is added to the existing > "loadMasked" intrinsic, with the value "true" for the IOOBE part while > "false" for the normal load. And the compiler will fail to intrinsify if the > flag is "true" and the predicate load is not supported by the backend, which > means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` > on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: - Add constant OFFSET_IN_RANGE and OFFSET_OUT_OF_RANGE - Merge branch 'jdk:master' into JDK-8283667 - Merge branch 'jdk:master' into JDK-8283667 - Use integer constant for offsetInRange all the way through - Rename "use_predicate" to "needs_predicate" - Rename the "usePred" to "offsetInRange" - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature - Changes: https://git.openjdk.java.net/jdk/pull/8035/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk=8035=05 Stats: 453 lines in 44 files changed: 174 ins; 21 del; 258 mod Patch: https://git.openjdk.java.net/jdk/pull/8035.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035 PR: https://git.openjdk.java.net/jdk/pull/8035