Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-07 Thread Jatin Bhateja
On Tue, 7 Jun 2022 02:22:53 GMT, Xiaohong Gong  wrote:

>> test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java
>>  line 97:
>> 
>>> 95: public void byteLoadArrayMaskIOOBE() {
>>> 96: for (int i = 0; i < inSize; i += bspecies.length()) {
>>> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, 
>>> i);
>> 
>> For other case "if (offset >= 0 && offset <= (a.length - species.length())) 
>> )" we are anyways intrinsifying, should we limit this micro to work only for 
>> newly optimized case.
>
> Yeah, thanks and it's really a good suggestion to limit this benchmark only 
> for the IOOBE cases. I locally modified the tests to make sure only the IOOBE 
> case happens and the results show good as well. But do you think it's better 
> to keep as it is since we can also see the performance of the common cases to 
> make sure no regressions happen? As the current benchmarks can also show the 
> performance gain by this PR.

It was just to remove the noise from a targeted micro benchmark. But we can 
keep it as it is.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-07 Thread Xiaohong Gong
On Tue, 7 Jun 2022 06:41:36 GMT, Jatin Bhateja  wrote:

>> Yeah, thanks and it's really a good suggestion to limit this benchmark only 
>> for the IOOBE cases. I locally modified the tests to make sure only the 
>> IOOBE case happens and the results show good as well. But do you think it's 
>> better to keep as it is since we can also see the performance of the common 
>> cases to make sure no regressions happen? As the current benchmarks can also 
>> show the performance gain by this PR.
>
> It was just to remove the noise from a targeted micro benchmark. But we can 
> keep it as it is.

OK, thanks!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-06 Thread Xiaohong Gong
On Mon, 6 Jun 2022 15:41:06 GMT, Paul Sandoz  wrote:

> Looks good. As a follow on PR I think it would be useful to add constants 
> `OFFSET_IN_RANGE` and `OFFSET_OUT_OF_RANGE`, then it becomes much clearer in 
> source and you can drop the `/* offsetInRange */` comment on the argument.

Hi @PaulSandoz , thanks for the advice! I'v rebased the codes and added these 
two constants in the latest patch. Thanks again for the review!

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-06 Thread Xiaohong Gong
On Mon, 6 Jun 2022 10:40:45 GMT, Jatin Bhateja  wrote:

>> Xiaohong Gong has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains five commits:
>> 
>>  - Merge branch 'jdk:master' into JDK-8283667
>>  - Use integer constant for offsetInRange all the way through
>>  - Rename "use_predicate" to "needs_predicate"
>>  - Rename the "usePred" to "offsetInRange"
>>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with 
>> predicate feature
>
> test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java
>  line 97:
> 
>> 95: public void byteLoadArrayMaskIOOBE() {
>> 96: for (int i = 0; i < inSize; i += bspecies.length()) {
>> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, i);
> 
> For other case "if (offset >= 0 && offset <= (a.length - species.length())) 
> )" we are anyways intrinsifying, should we limit this micro to work only for 
> newly optimized case.

Yeah, thanks and it's really a good suggestion to limit this benchmark only for 
the IOOBE cases. I locally modified the tests to make sure only the IOOBE case 
happens and the results show good as well. But do you think it's better to keep 
as it is since we can also see the performance of the common cases to make sure 
no regressions happen? As the current benchmarks can also show the performance 
gain by this PR.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-06 Thread Paul Sandoz
On Thu, 2 Jun 2022 03:27:59 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains five commits:
> 
>  - Merge branch 'jdk:master' into JDK-8283667
>  - Use integer constant for offsetInRange all the way through
>  - Rename "use_predicate" to "needs_predicate"
>  - Rename the "usePred" to "offsetInRange"
>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with 
> predicate feature

Looks good. As a follow on PR I think it would be useful to add constants 
`OFFSET_IN_RANGE` and `OFFSET_OUT_OF_RANGE`, then it becomes much clearer in 
source and you can drop the `/* offsetInRange */` comment on the argument.

-

Marked as reviewed by psandoz (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-06 Thread Jatin Bhateja
On Thu, 2 Jun 2022 03:27:59 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains five commits:
> 
>  - Merge branch 'jdk:master' into JDK-8283667
>  - Use integer constant for offsetInRange all the way through
>  - Rename "use_predicate" to "needs_predicate"
>  - Rename the "usePred" to "offsetInRange"
>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with 
> predicate feature

test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java 
line 97:

> 95: public void byteLoadArrayMaskIOOBE() {
> 96: for (int i = 0; i < inSize; i += bspecies.length()) {
> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, i);

For other case "if (offset >= 0 && offset <= (a.length - species.length())) )" 
we are anyways intrinsifying, should we limit this micro to work only for newly 
optimized case.

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-02 Thread Sandhya Viswanathan
On Thu, 2 Jun 2022 03:27:59 GMT, Xiaohong Gong  wrote:

>> Currently the vector load with mask when the given index happens out of the 
>> array boundary is implemented with pure java scalar code to avoid the IOOBE 
>> (IndexOutOfBoundaryException). This is necessary for architectures that do 
>> not support the predicate feature. Because the masked load is implemented 
>> with a full vector load and a vector blend applied on it. And a full vector 
>> load will definitely cause the IOOBE which is not valid. However, for 
>> architectures that support the predicate feature like SVE/AVX-512/RVV, it 
>> can be vectorized with the predicated load instruction as long as the 
>> indexes of the masked lanes are within the bounds of the array. For these 
>> architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE 
>> part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies species,
>>byte[] a, int offset,
>>VectorMask m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>   if (offset >= 0 && offset <= (a.length - species.length())) {
>>   return vsp.dummyVector().fromArray0(a, offset, m);
>>   }
>> 
>>   // FIXME: optimize
>>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>   return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must 
>> check whether the current backend supports it and falls back to the java 
>> scalar version if not. This is different from the normal masked vector load 
>> that the compiler will generate a full vector load and a vector blend if the 
>> predicate load is not supported. So to let the compiler make the expected 
>> action, an additional flag (i.e. `usePred`) is added to the existing 
>> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
>> "false" for the normal load. And the compiler will fail to intrinsify if the 
>> flag is "true" and the predicate load is not supported by the backend, which 
>> means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
>> on the x86 AVX-512 system:
>> 
>> Benchmark  before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains five commits:
> 
>  - Merge branch 'jdk:master' into JDK-8283667
>  - Use integer constant for offsetInRange all the way through
>  - Rename "use_predicate" to "needs_predicate"
>  - Rename the "usePred" to "offsetInRange"
>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with 
> predicate feature

Marked as reviewed by sviswanathan (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/8035


Re: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

2022-06-01 Thread Xiaohong Gong
> Currently the vector load with mask when the given index happens out of the 
> array boundary is implemented with pure java scalar code to avoid the IOOBE 
> (IndexOutOfBoundaryException). This is necessary for architectures that do 
> not support the predicate feature. Because the masked load is implemented 
> with a full vector load and a vector blend applied on it. And a full vector 
> load will definitely cause the IOOBE which is not valid. However, for 
> architectures that support the predicate feature like SVE/AVX-512/RVV, it can 
> be vectorized with the predicated load instruction as long as the indexes of 
> the masked lanes are within the bounds of the array. For these architectures, 
> loading with unmasked lanes does not raise exception.
> 
> This patch adds the vectorization support for the masked load with IOOBE 
> part. Please see the original java implementation (FIXME: optimize):
> 
> 
>   @ForceInline
>   public static
>   ByteVector fromArray(VectorSpecies species,
>byte[] a, int offset,
>VectorMask m) {
>   ByteSpecies vsp = (ByteSpecies) species;
>   if (offset >= 0 && offset <= (a.length - species.length())) {
>   return vsp.dummyVector().fromArray0(a, offset, m);
>   }
> 
>   // FIXME: optimize
>   checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>   return vsp.vOp(m, i -> a[offset + i]);
>   }
> 
> Since it can only be vectorized with the predicate load, the hotspot must 
> check whether the current backend supports it and falls back to the java 
> scalar version if not. This is different from the normal masked vector load 
> that the compiler will generate a full vector load and a vector blend if the 
> predicate load is not supported. So to let the compiler make the expected 
> action, an additional flag (i.e. `usePred`) is added to the existing 
> "loadMasked" intrinsic, with the value "true" for the IOOBE part while 
> "false" for the normal load. And the compiler will fail to intrinsify if the 
> flag is "true" and the predicate load is not supported by the backend, which 
> means that normal java path will be executed.
> 
> Also adds the same vectorization support for masked:
>  - fromByteArray/fromByteBuffer
>  - fromBooleanArray
>  - fromCharArray
> 
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` 
> on the x86 AVX-512 system:
> 
> Benchmark  before   After  Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
> 
> Similar performance gain can also be observed on 512-bit SVE system.

Xiaohong Gong has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains five commits:

 - Merge branch 'jdk:master' into JDK-8283667
 - Use integer constant for offsetInRange all the way through
 - Rename "use_predicate" to "needs_predicate"
 - Rename the "usePred" to "offsetInRange"
 - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate 
feature

-

Changes: https://git.openjdk.java.net/jdk/pull/8035/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=8035=04
  Stats: 447 lines in 43 files changed: 168 ins; 21 del; 258 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8035.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035

PR: https://git.openjdk.java.net/jdk/pull/8035