On Fri, 11 Mar 2022 06:29:22 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:

> The current vector `"NEG"` is implemented with substraction a vector by zero 
> in case the architecture does not support the negation instruction. And to 
> fit the predicate feature for architectures that support it, the masked 
> vector `"NEG" ` is implemented with pattern `"v.not(m).add(1, m)"`. They both 
> can be optimized to a single negation instruction for ARM SVE.
> And so does the non-masked "NEG" for NEON. Besides, implementing the masked 
> "NEG" with substraction for architectures that support neither negation 
> instruction nor predicate feature can also save several instructions than the 
> current pattern.
> 
> To optimize the VectorAPI negation, this patch moves the implementation from 
> Java side to hotspot. The compiler will generate different nodes according to 
> the architecture:
>   - Generate the (predicated) negation node if architecture supports it, 
> otherwise, generate "`zero.sub(v)`" pattern for non-masked operation.
>   - Generate `"zero.sub(v, m)"` for masked operation if the architecture does 
> not have predicate feature, otherwise generate the original pattern 
> `"v.xor(-1, m).add(1, m)"`.
> 
> So with this patch, the following transformations are applied:
> 
> For non-masked negation with NEON:
> 
>   movi    v16.4s, #0x0
>   sub v17.4s, v16.4s, v17.4s       ==> neg v17.4s, v17.4s
> 
> and with SVE:
> 
>   mov z16.s, #0
>   sub z18.s, z16.s, z17.s          ==> neg z16.s, p7/m, z16.s
> 
> For masked negation with NEON:
> 
>   movi    v17.4s, #0x1
>   mvn v19.16b, v18.16b
>   mov v20.16b, v16.16b             ==>  neg v18.4s, v17.4s
>   bsl v20.16b, v19.16b, v18.16b         bsl v19.16b, v18.16b, v17.16b
>   add v19.4s, v20.4s, v17.4s
>   mov v18.16b, v16.16b
>   bsl v18.16b, v19.16b, v20.16b
> 
> and with SVE:
> 
>   mov z16.s, #-1
>   mov z17.s, #1                    ==> neg z16.s, p0/m, z16.s
>   eor z18.s, p0/m, z18.s, z16.s
>   add z18.s, p0/m, z18.s, z17.s
> 
> Here are the performance gains for benchmarks (see [1][2]) on ARM and x86 
> machines(note that the non-masked negation benchmarks do not have any 
> improvement on X86 since no instructions are changed):
> 
> NEON:
> Benchmark                Gain
> Byte128Vector.NEG        1.029
> Byte128Vector.NEGMasked  1.757
> Short128Vector.NEG       1.041
> Short128Vector.NEGMasked 1.659
> Int128Vector.NEG         1.005
> Int128Vector.NEGMasked   1.513
> Long128Vector.NEG        1.003
> Long128Vector.NEGMasked  1.878
> 
> SVE with 512-bits:
> Benchmark                Gain
> ByteMaxVector.NEG        1.10
> ByteMaxVector.NEGMasked  1.165
> ShortMaxVector.NEG       1.056
> ShortMaxVector.NEGMasked 1.195
> IntMaxVector.NEG         1.002
> IntMaxVector.NEGMasked   1.239
> LongMaxVector.NEG        1.031
> LongMaxVector.NEGMasked  1.191
> 
> X86 (non AVX-512):
> Benchmark                Gain
> ByteMaxVector.NEGMasked  1.254
> ShortMaxVector.NEGMasked 1.359
> IntMaxVector.NEGMasked   1.431
> LongMaxVector.NEGMasked  1.989
> 
> [1] 
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1881
> [2] 
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L1896

This pull request has now been integrated.

Changeset: d0668568
Author:    Xiaohong Gong <xg...@openjdk.org>
Committer: Jie Fu <ji...@openjdk.org>
URL:       
https://git.openjdk.java.net/jdk/commit/d06685680c17583d56dc3d788d9a2ecea8812bc8
Stats:     325 lines in 15 files changed: 275 ins; 25 del; 25 mod

8282162: [vector] Optimize integral vector negation API

Reviewed-by: jiefu, psandoz, njian

-------------

PR: https://git.openjdk.java.net/jdk/pull/7782

Reply via email to