Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.
Hi, Please note that currently the test: int a[N]; short b[N*2]; for (int i = 0; i N; ++i) a[i] = b[i*2]; Is compiled to (with -march=corei7 -O2 -ftree-vectorize): movdqa b(%rax), %xmm0 movdqa b-16(%rax), %xmm2 pand%xmm1, %xmm0 pand%xmm1, %xmm2 packusdw%xmm2, %xmm0 pmovsxwd%xmm0, %xmm2 psrldq $8, %xmm0 pmovsxwd%xmm0, %xmm0 movaps %xmm2, a-32(%rax) movaps %xmm0, a-16(%rax) Which is more close to the requested sequence. Thanks, Evgeny On Wed, Jun 25, 2014 at 8:34 PM, Cong Hou co...@google.com wrote: On Tue, Jun 24, 2014 at 4:05 AM, Richard Biener richard.guent...@gmail.com wrote: On Sat, May 3, 2014 at 2:39 AM, Cong Hou co...@google.com wrote: On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener rguent...@suse.de wrote: On Thu, 24 Apr 2014, Cong Hou wrote: Given the following loop: int a[N]; short b[N*2]; for (int i = 0; i N; ++i) a[i] = b[i*2]; After being vectorized, the access to b[i*2] will be compiled into several packing statements, while the type promotion from short to int will be compiled into several unpacking statements. With this patch, each pair of pack/unpack statements will be replaced by less expensive statements (with shift or bit-and operations). On x86_64, the loop above will be compiled into the following assembly (with -O2 -ftree-vectorize): movdqu 0x10(%rcx),%xmm3 movdqu -0x20(%rcx),%xmm0 movdqa %xmm0,%xmm2 punpcklwd %xmm3,%xmm0 punpckhwd %xmm3,%xmm2 movdqa %xmm0,%xmm3 punpcklwd %xmm2,%xmm0 punpckhwd %xmm2,%xmm3 movdqa %xmm1,%xmm2 punpcklwd %xmm3,%xmm0 pcmpgtw %xmm0,%xmm2 movdqa %xmm0,%xmm3 punpckhwd %xmm2,%xmm0 punpcklwd %xmm2,%xmm3 movups %xmm0,-0x10(%rdx) movups %xmm3,-0x20(%rdx) With this patch, the generated assembly is shown below: movdqu 0x10(%rcx),%xmm0 movdqu -0x20(%rcx),%xmm1 pslld $0x10,%xmm0 psrad $0x10,%xmm0 pslld $0x10,%xmm1 movups %xmm0,-0x10(%rdx) psrad $0x10,%xmm1 movups %xmm1,-0x20(%rdx) Bootstrapped and tested on x86-64. OK for trunk? This is an odd place to implement such transform. Also if it is faster or not depends on the exact ISA you target - for example ppc has constraints on the maximum number of shifts carried out in parallel and the above has 4 in very short succession. Esp. for the sign-extend path. Thank you for the information about ppc. If this is an issue, I think we can do it in a target dependent way. So this looks more like an opportunity for a post-vectorizer transform on RTL or for the vectorizer special-casing widening loads with a vectorizer pattern. I am not sure if the RTL transform is more difficult to implement. I prefer the widening loads method, which can be detected in a pattern recognizer. The target related issue will be resolved by only expanding the widening load on those targets where this pattern is beneficial. But this requires new tree operations to be defined. What is your suggestion? I apologize for the delayed reply. Likewise ;) I suggest to implement this optimization in vector lowering in tree-vect-generic.c. This sees for your example vect__5.7_32 = MEM[symbol: b, index: ivtmp.15_13, offset: 0B]; vect__5.8_34 = MEM[symbol: b, index: ivtmp.15_13, offset: 16B]; vect_perm_even_35 = VEC_PERM_EXPR vect__5.7_32, vect__5.8_34, { 0, 2, 4, 6, 8, 10, 12, 14 }; vect__6.9_37 = [vec_unpack_lo_expr] vect_perm_even_35; vect__6.9_38 = [vec_unpack_hi_expr] vect_perm_even_35; where you can apply the pattern matching and transform (after checking with the target, of course). This sounds good to me! I'll try to make a patch following your suggestion. Thank you! Cong Richard. thanks, Cong Richard.
Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.
On Tue, Jun 24, 2014 at 4:05 AM, Richard Biener richard.guent...@gmail.com wrote: On Sat, May 3, 2014 at 2:39 AM, Cong Hou co...@google.com wrote: On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener rguent...@suse.de wrote: On Thu, 24 Apr 2014, Cong Hou wrote: Given the following loop: int a[N]; short b[N*2]; for (int i = 0; i N; ++i) a[i] = b[i*2]; After being vectorized, the access to b[i*2] will be compiled into several packing statements, while the type promotion from short to int will be compiled into several unpacking statements. With this patch, each pair of pack/unpack statements will be replaced by less expensive statements (with shift or bit-and operations). On x86_64, the loop above will be compiled into the following assembly (with -O2 -ftree-vectorize): movdqu 0x10(%rcx),%xmm3 movdqu -0x20(%rcx),%xmm0 movdqa %xmm0,%xmm2 punpcklwd %xmm3,%xmm0 punpckhwd %xmm3,%xmm2 movdqa %xmm0,%xmm3 punpcklwd %xmm2,%xmm0 punpckhwd %xmm2,%xmm3 movdqa %xmm1,%xmm2 punpcklwd %xmm3,%xmm0 pcmpgtw %xmm0,%xmm2 movdqa %xmm0,%xmm3 punpckhwd %xmm2,%xmm0 punpcklwd %xmm2,%xmm3 movups %xmm0,-0x10(%rdx) movups %xmm3,-0x20(%rdx) With this patch, the generated assembly is shown below: movdqu 0x10(%rcx),%xmm0 movdqu -0x20(%rcx),%xmm1 pslld $0x10,%xmm0 psrad $0x10,%xmm0 pslld $0x10,%xmm1 movups %xmm0,-0x10(%rdx) psrad $0x10,%xmm1 movups %xmm1,-0x20(%rdx) Bootstrapped and tested on x86-64. OK for trunk? This is an odd place to implement such transform. Also if it is faster or not depends on the exact ISA you target - for example ppc has constraints on the maximum number of shifts carried out in parallel and the above has 4 in very short succession. Esp. for the sign-extend path. Thank you for the information about ppc. If this is an issue, I think we can do it in a target dependent way. So this looks more like an opportunity for a post-vectorizer transform on RTL or for the vectorizer special-casing widening loads with a vectorizer pattern. I am not sure if the RTL transform is more difficult to implement. I prefer the widening loads method, which can be detected in a pattern recognizer. The target related issue will be resolved by only expanding the widening load on those targets where this pattern is beneficial. But this requires new tree operations to be defined. What is your suggestion? I apologize for the delayed reply. Likewise ;) I suggest to implement this optimization in vector lowering in tree-vect-generic.c. This sees for your example vect__5.7_32 = MEM[symbol: b, index: ivtmp.15_13, offset: 0B]; vect__5.8_34 = MEM[symbol: b, index: ivtmp.15_13, offset: 16B]; vect_perm_even_35 = VEC_PERM_EXPR vect__5.7_32, vect__5.8_34, { 0, 2, 4, 6, 8, 10, 12, 14 }; vect__6.9_37 = [vec_unpack_lo_expr] vect_perm_even_35; vect__6.9_38 = [vec_unpack_hi_expr] vect_perm_even_35; where you can apply the pattern matching and transform (after checking with the target, of course). This sounds good to me! I'll try to make a patch following your suggestion. Thank you! Cong Richard. thanks, Cong Richard.
Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.
On Sat, May 3, 2014 at 2:39 AM, Cong Hou co...@google.com wrote: On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener rguent...@suse.de wrote: On Thu, 24 Apr 2014, Cong Hou wrote: Given the following loop: int a[N]; short b[N*2]; for (int i = 0; i N; ++i) a[i] = b[i*2]; After being vectorized, the access to b[i*2] will be compiled into several packing statements, while the type promotion from short to int will be compiled into several unpacking statements. With this patch, each pair of pack/unpack statements will be replaced by less expensive statements (with shift or bit-and operations). On x86_64, the loop above will be compiled into the following assembly (with -O2 -ftree-vectorize): movdqu 0x10(%rcx),%xmm3 movdqu -0x20(%rcx),%xmm0 movdqa %xmm0,%xmm2 punpcklwd %xmm3,%xmm0 punpckhwd %xmm3,%xmm2 movdqa %xmm0,%xmm3 punpcklwd %xmm2,%xmm0 punpckhwd %xmm2,%xmm3 movdqa %xmm1,%xmm2 punpcklwd %xmm3,%xmm0 pcmpgtw %xmm0,%xmm2 movdqa %xmm0,%xmm3 punpckhwd %xmm2,%xmm0 punpcklwd %xmm2,%xmm3 movups %xmm0,-0x10(%rdx) movups %xmm3,-0x20(%rdx) With this patch, the generated assembly is shown below: movdqu 0x10(%rcx),%xmm0 movdqu -0x20(%rcx),%xmm1 pslld $0x10,%xmm0 psrad $0x10,%xmm0 pslld $0x10,%xmm1 movups %xmm0,-0x10(%rdx) psrad $0x10,%xmm1 movups %xmm1,-0x20(%rdx) Bootstrapped and tested on x86-64. OK for trunk? This is an odd place to implement such transform. Also if it is faster or not depends on the exact ISA you target - for example ppc has constraints on the maximum number of shifts carried out in parallel and the above has 4 in very short succession. Esp. for the sign-extend path. Thank you for the information about ppc. If this is an issue, I think we can do it in a target dependent way. So this looks more like an opportunity for a post-vectorizer transform on RTL or for the vectorizer special-casing widening loads with a vectorizer pattern. I am not sure if the RTL transform is more difficult to implement. I prefer the widening loads method, which can be detected in a pattern recognizer. The target related issue will be resolved by only expanding the widening load on those targets where this pattern is beneficial. But this requires new tree operations to be defined. What is your suggestion? I apologize for the delayed reply. Likewise ;) I suggest to implement this optimization in vector lowering in tree-vect-generic.c. This sees for your example vect__5.7_32 = MEM[symbol: b, index: ivtmp.15_13, offset: 0B]; vect__5.8_34 = MEM[symbol: b, index: ivtmp.15_13, offset: 16B]; vect_perm_even_35 = VEC_PERM_EXPR vect__5.7_32, vect__5.8_34, { 0, 2, 4, 6, 8, 10, 12, 14 }; vect__6.9_37 = [vec_unpack_lo_expr] vect_perm_even_35; vect__6.9_38 = [vec_unpack_hi_expr] vect_perm_even_35; where you can apply the pattern matching and transform (after checking with the target, of course). Richard. thanks, Cong Richard.
Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.
On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener rguent...@suse.de wrote: On Thu, 24 Apr 2014, Cong Hou wrote: Given the following loop: int a[N]; short b[N*2]; for (int i = 0; i N; ++i) a[i] = b[i*2]; After being vectorized, the access to b[i*2] will be compiled into several packing statements, while the type promotion from short to int will be compiled into several unpacking statements. With this patch, each pair of pack/unpack statements will be replaced by less expensive statements (with shift or bit-and operations). On x86_64, the loop above will be compiled into the following assembly (with -O2 -ftree-vectorize): movdqu 0x10(%rcx),%xmm3 movdqu -0x20(%rcx),%xmm0 movdqa %xmm0,%xmm2 punpcklwd %xmm3,%xmm0 punpckhwd %xmm3,%xmm2 movdqa %xmm0,%xmm3 punpcklwd %xmm2,%xmm0 punpckhwd %xmm2,%xmm3 movdqa %xmm1,%xmm2 punpcklwd %xmm3,%xmm0 pcmpgtw %xmm0,%xmm2 movdqa %xmm0,%xmm3 punpckhwd %xmm2,%xmm0 punpcklwd %xmm2,%xmm3 movups %xmm0,-0x10(%rdx) movups %xmm3,-0x20(%rdx) With this patch, the generated assembly is shown below: movdqu 0x10(%rcx),%xmm0 movdqu -0x20(%rcx),%xmm1 pslld $0x10,%xmm0 psrad $0x10,%xmm0 pslld $0x10,%xmm1 movups %xmm0,-0x10(%rdx) psrad $0x10,%xmm1 movups %xmm1,-0x20(%rdx) Bootstrapped and tested on x86-64. OK for trunk? This is an odd place to implement such transform. Also if it is faster or not depends on the exact ISA you target - for example ppc has constraints on the maximum number of shifts carried out in parallel and the above has 4 in very short succession. Esp. for the sign-extend path. Thank you for the information about ppc. If this is an issue, I think we can do it in a target dependent way. So this looks more like an opportunity for a post-vectorizer transform on RTL or for the vectorizer special-casing widening loads with a vectorizer pattern. I am not sure if the RTL transform is more difficult to implement. I prefer the widening loads method, which can be detected in a pattern recognizer. The target related issue will be resolved by only expanding the widening load on those targets where this pattern is beneficial. But this requires new tree operations to be defined. What is your suggestion? I apologize for the delayed reply. thanks, Cong Richard.
Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.
On Thu, 24 Apr 2014, Cong Hou wrote: Given the following loop: int a[N]; short b[N*2]; for (int i = 0; i N; ++i) a[i] = b[i*2]; After being vectorized, the access to b[i*2] will be compiled into several packing statements, while the type promotion from short to int will be compiled into several unpacking statements. With this patch, each pair of pack/unpack statements will be replaced by less expensive statements (with shift or bit-and operations). On x86_64, the loop above will be compiled into the following assembly (with -O2 -ftree-vectorize): movdqu 0x10(%rcx),%xmm3 movdqu -0x20(%rcx),%xmm0 movdqa %xmm0,%xmm2 punpcklwd %xmm3,%xmm0 punpckhwd %xmm3,%xmm2 movdqa %xmm0,%xmm3 punpcklwd %xmm2,%xmm0 punpckhwd %xmm2,%xmm3 movdqa %xmm1,%xmm2 punpcklwd %xmm3,%xmm0 pcmpgtw %xmm0,%xmm2 movdqa %xmm0,%xmm3 punpckhwd %xmm2,%xmm0 punpcklwd %xmm2,%xmm3 movups %xmm0,-0x10(%rdx) movups %xmm3,-0x20(%rdx) With this patch, the generated assembly is shown below: movdqu 0x10(%rcx),%xmm0 movdqu -0x20(%rcx),%xmm1 pslld $0x10,%xmm0 psrad $0x10,%xmm0 pslld $0x10,%xmm1 movups %xmm0,-0x10(%rdx) psrad $0x10,%xmm1 movups %xmm1,-0x20(%rdx) Bootstrapped and tested on x86-64. OK for trunk? This is an odd place to implement such transform. Also if it is faster or not depends on the exact ISA you target - for example ppc has constraints on the maximum number of shifts carried out in parallel and the above has 4 in very short succession. Esp. for the sign-extend path. So this looks more like an opportunity for a post-vectorizer transform on RTL or for the vectorizer special-casing widening loads with a vectorizer pattern. Richard.
[PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.
Given the following loop: int a[N]; short b[N*2]; for (int i = 0; i N; ++i) a[i] = b[i*2]; After being vectorized, the access to b[i*2] will be compiled into several packing statements, while the type promotion from short to int will be compiled into several unpacking statements. With this patch, each pair of pack/unpack statements will be replaced by less expensive statements (with shift or bit-and operations). On x86_64, the loop above will be compiled into the following assembly (with -O2 -ftree-vectorize): movdqu 0x10(%rcx),%xmm3 movdqu -0x20(%rcx),%xmm0 movdqa %xmm0,%xmm2 punpcklwd %xmm3,%xmm0 punpckhwd %xmm3,%xmm2 movdqa %xmm0,%xmm3 punpcklwd %xmm2,%xmm0 punpckhwd %xmm2,%xmm3 movdqa %xmm1,%xmm2 punpcklwd %xmm3,%xmm0 pcmpgtw %xmm0,%xmm2 movdqa %xmm0,%xmm3 punpckhwd %xmm2,%xmm0 punpcklwd %xmm2,%xmm3 movups %xmm0,-0x10(%rdx) movups %xmm3,-0x20(%rdx) With this patch, the generated assembly is shown below: movdqu 0x10(%rcx),%xmm0 movdqu -0x20(%rcx),%xmm1 pslld $0x10,%xmm0 psrad $0x10,%xmm0 pslld $0x10,%xmm1 movups %xmm0,-0x10(%rdx) psrad $0x10,%xmm1 movups %xmm1,-0x20(%rdx) Bootstrapped and tested on x86-64. OK for trunk? thanks, Cong diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 117cdd0..e7143f1 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2014-04-23 Cong Hou co...@google.com + + * tree-vect-stmts.c (detect_pack_unpack_pattern): New function. + (vect_gen_widened_results_half): Call detect_pack_unpack_pattern. + 2014-04-23 David Malcolm dmalc...@redhat.com * is-a.h: Update comments to reflect the following changes to the diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 62b07f4..a8755b3 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,7 @@ +2014-04-23 Cong Hou co...@google.com + + * gcc.dg/vect/vect-125.c: New test. + 2014-04-23 Jeff Law l...@redhat.com PR tree-optimization/60902 diff --git a/gcc/testsuite/gcc.dg/vect/vect-125.c b/gcc/testsuite/gcc.dg/vect/vect-125.c new file mode 100644 index 000..988dea6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-125.c @@ -0,0 +1,122 @@ +/* { dg-require-effective-target vect_int } */ + +#include limits.h +#include tree-vect.h + +#define N 64 + +char b[N]; +unsigned char c[N]; +short d[N]; +unsigned short e[N]; + +__attribute__((noinline)) void +test1 () +{ + int a[N]; + int i; + for (i = 0; i N/2; i++) +{ + a[i] = b[i*2]; + a[i+N/2] = b[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != b[i*2] || a[i+N/2] != b[i*2+1]) + abort (); + + for (i = 0; i N/2; i++) +{ + a[i] = c[i*2]; + a[i+N/2] = c[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != c[i*2] || a[i+N/2] != c[i*2+1]) + abort (); + + for (i = 0; i N/2; i++) +{ + a[i] = d[i*2]; + a[i+N/2] = d[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != d[i*2] || a[i+N/2] != d[i*2+1]) + abort (); + + for (i = 0; i N/2; i++) +{ + a[i] = e[i*2]; + a[i+N/2] = e[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != e[i*2] || a[i+N/2] != e[i*2+1]) + abort (); +} + +__attribute__((noinline)) void +test2 () +{ + unsigned int a[N]; + int i; + for (i = 0; i N/2; i++) +{ + a[i] = b[i*2]; + a[i+N/2] = b[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != b[i*2] || a[i+N/2] != b[i*2+1]) + abort (); + + for (i = 0; i N/2; i++) +{ + a[i] = c[i*2]; + a[i+N/2] = c[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != c[i*2] || a[i+N/2] != c[i*2+1]) + abort (); + + for (i = 0; i N/2; i++) +{ + a[i] = d[i*2]; + a[i+N/2] = d[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != d[i*2] || a[i+N/2] != d[i*2+1]) + abort (); + + for (i = 0; i N/2; i++) +{ + a[i] = e[i*2]; + a[i+N/2] = e[i*2+1]; +} + for (i = 0; i N/2; i++) +if (a[i] != e[i*2] || a[i+N/2] != e[i*2+1]) + abort (); +} + +int +main () +{ + b[0] = CHAR_MIN; + c[0] = UCHAR_MAX; + d[0] = SHRT_MIN; + e[0] = USHRT_MAX; + + int i; + for (i = 1; i N; i++) +{ + b[i] = b[i-1] + 1; + c[i] = c[i-1] - 1; + d[i] = d[i-1] + 1; + e[i] = e[i-1] - 1; +} + + test1 (); + test2 (); + return 0; +} + +/* { dg-final { scan-tree-dump-times vectorized 4 loops 2 vect } } */ +/* { dg-final { scan-tree-dump-times A pack-unpack pattern is recognized 32 vect } } */ +/* { dg-final { cleanup-tree-dump vect } } */ + diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index 1a51d6d..d0cf1f4 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -3191,6 +3191,174 @@ vectorizable_simd_clone_call (gimple stmt, gimple_stmt_iterator *gsi, } +/* Function detect_pack_unpack_pattern + + Detect the following pattern: + + S1 vect3 = VEC_PERM_EXPR vect1, vect2, { 0, 2, 4, ... }; + or + S1 vect3 = VEC_PERM_EXPR vect1, vect2, { 1, 3, 5, ... }; + + S2 vect4 = [vec_unpack_lo_expr]