Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.

2014-11-21 Thread Evgeny Stupachenko
Hi,

Please note that currently the test:

int a[N];
short b[N*2];

for (int i = 0; i  N; ++i)
  a[i] = b[i*2];

Is compiled to (with -march=corei7 -O2 -ftree-vectorize):

movdqa  b(%rax), %xmm0
movdqa  b-16(%rax), %xmm2
pand%xmm1, %xmm0
pand%xmm1, %xmm2
packusdw%xmm2, %xmm0
pmovsxwd%xmm0, %xmm2
psrldq  $8, %xmm0
pmovsxwd%xmm0, %xmm0
movaps  %xmm2, a-32(%rax)
movaps  %xmm0, a-16(%rax)

Which is more close to the requested sequence.

Thanks,
Evgeny


On Wed, Jun 25, 2014 at 8:34 PM, Cong Hou co...@google.com wrote:
 On Tue, Jun 24, 2014 at 4:05 AM, Richard Biener
 richard.guent...@gmail.com wrote:
 On Sat, May 3, 2014 at 2:39 AM, Cong Hou co...@google.com wrote:
 On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener rguent...@suse.de wrote:
 On Thu, 24 Apr 2014, Cong Hou wrote:

 Given the following loop:

 int a[N];
 short b[N*2];

 for (int i = 0; i  N; ++i)
   a[i] = b[i*2];


 After being vectorized, the access to b[i*2] will be compiled into
 several packing statements, while the type promotion from short to int
 will be compiled into several unpacking statements. With this patch,
 each pair of pack/unpack statements will be replaced by less expensive
 statements (with shift or bit-and operations).

 On x86_64, the loop above will be compiled into the following assembly
 (with -O2 -ftree-vectorize):

 movdqu 0x10(%rcx),%xmm3
 movdqu -0x20(%rcx),%xmm0
 movdqa %xmm0,%xmm2
 punpcklwd %xmm3,%xmm0
 punpckhwd %xmm3,%xmm2
 movdqa %xmm0,%xmm3
 punpcklwd %xmm2,%xmm0
 punpckhwd %xmm2,%xmm3
 movdqa %xmm1,%xmm2
 punpcklwd %xmm3,%xmm0
 pcmpgtw %xmm0,%xmm2
 movdqa %xmm0,%xmm3
 punpckhwd %xmm2,%xmm0
 punpcklwd %xmm2,%xmm3
 movups %xmm0,-0x10(%rdx)
 movups %xmm3,-0x20(%rdx)


 With this patch, the generated assembly is shown below:

 movdqu 0x10(%rcx),%xmm0
 movdqu -0x20(%rcx),%xmm1
 pslld  $0x10,%xmm0
 psrad  $0x10,%xmm0
 pslld  $0x10,%xmm1
 movups %xmm0,-0x10(%rdx)
 psrad  $0x10,%xmm1
 movups %xmm1,-0x20(%rdx)


 Bootstrapped and tested on x86-64. OK for trunk?

 This is an odd place to implement such transform.  Also if it
 is faster or not depends on the exact ISA you target - for
 example ppc has constraints on the maximum number of shifts
 carried out in parallel and the above has 4 in very short
 succession.  Esp. for the sign-extend path.

 Thank you for the information about ppc. If this is an issue, I think
 we can do it in a target dependent way.



 So this looks more like an opportunity for a post-vectorizer
 transform on RTL or for the vectorizer special-casing
 widening loads with a vectorizer pattern.

 I am not sure if the RTL transform is more difficult to implement. I
 prefer the widening loads method, which can be detected in a pattern
 recognizer. The target related issue will be resolved by only
 expanding the widening load on those targets where this pattern is
 beneficial. But this requires new tree operations to be defined. What
 is your suggestion?

 I apologize for the delayed reply.

 Likewise ;)

 I suggest to implement this optimization in vector lowering in
 tree-vect-generic.c.  This sees for your example

   vect__5.7_32 = MEM[symbol: b, index: ivtmp.15_13, offset: 0B];
   vect__5.8_34 = MEM[symbol: b, index: ivtmp.15_13, offset: 16B];
   vect_perm_even_35 = VEC_PERM_EXPR vect__5.7_32, vect__5.8_34, { 0,
 2, 4, 6, 8, 10, 12, 14 };
   vect__6.9_37 = [vec_unpack_lo_expr] vect_perm_even_35;
   vect__6.9_38 = [vec_unpack_hi_expr] vect_perm_even_35;

 where you can apply the pattern matching and transform (after checking
 with the target, of course).

 This sounds good to me! I'll try to make a patch following your suggestion.

 Thank you!


 Cong


 Richard.


 thanks,
 Cong


 Richard.


Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.

2014-06-25 Thread Cong Hou
On Tue, Jun 24, 2014 at 4:05 AM, Richard Biener
richard.guent...@gmail.com wrote:
 On Sat, May 3, 2014 at 2:39 AM, Cong Hou co...@google.com wrote:
 On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener rguent...@suse.de wrote:
 On Thu, 24 Apr 2014, Cong Hou wrote:

 Given the following loop:

 int a[N];
 short b[N*2];

 for (int i = 0; i  N; ++i)
   a[i] = b[i*2];


 After being vectorized, the access to b[i*2] will be compiled into
 several packing statements, while the type promotion from short to int
 will be compiled into several unpacking statements. With this patch,
 each pair of pack/unpack statements will be replaced by less expensive
 statements (with shift or bit-and operations).

 On x86_64, the loop above will be compiled into the following assembly
 (with -O2 -ftree-vectorize):

 movdqu 0x10(%rcx),%xmm3
 movdqu -0x20(%rcx),%xmm0
 movdqa %xmm0,%xmm2
 punpcklwd %xmm3,%xmm0
 punpckhwd %xmm3,%xmm2
 movdqa %xmm0,%xmm3
 punpcklwd %xmm2,%xmm0
 punpckhwd %xmm2,%xmm3
 movdqa %xmm1,%xmm2
 punpcklwd %xmm3,%xmm0
 pcmpgtw %xmm0,%xmm2
 movdqa %xmm0,%xmm3
 punpckhwd %xmm2,%xmm0
 punpcklwd %xmm2,%xmm3
 movups %xmm0,-0x10(%rdx)
 movups %xmm3,-0x20(%rdx)


 With this patch, the generated assembly is shown below:

 movdqu 0x10(%rcx),%xmm0
 movdqu -0x20(%rcx),%xmm1
 pslld  $0x10,%xmm0
 psrad  $0x10,%xmm0
 pslld  $0x10,%xmm1
 movups %xmm0,-0x10(%rdx)
 psrad  $0x10,%xmm1
 movups %xmm1,-0x20(%rdx)


 Bootstrapped and tested on x86-64. OK for trunk?

 This is an odd place to implement such transform.  Also if it
 is faster or not depends on the exact ISA you target - for
 example ppc has constraints on the maximum number of shifts
 carried out in parallel and the above has 4 in very short
 succession.  Esp. for the sign-extend path.

 Thank you for the information about ppc. If this is an issue, I think
 we can do it in a target dependent way.



 So this looks more like an opportunity for a post-vectorizer
 transform on RTL or for the vectorizer special-casing
 widening loads with a vectorizer pattern.

 I am not sure if the RTL transform is more difficult to implement. I
 prefer the widening loads method, which can be detected in a pattern
 recognizer. The target related issue will be resolved by only
 expanding the widening load on those targets where this pattern is
 beneficial. But this requires new tree operations to be defined. What
 is your suggestion?

 I apologize for the delayed reply.

 Likewise ;)

 I suggest to implement this optimization in vector lowering in
 tree-vect-generic.c.  This sees for your example

   vect__5.7_32 = MEM[symbol: b, index: ivtmp.15_13, offset: 0B];
   vect__5.8_34 = MEM[symbol: b, index: ivtmp.15_13, offset: 16B];
   vect_perm_even_35 = VEC_PERM_EXPR vect__5.7_32, vect__5.8_34, { 0,
 2, 4, 6, 8, 10, 12, 14 };
   vect__6.9_37 = [vec_unpack_lo_expr] vect_perm_even_35;
   vect__6.9_38 = [vec_unpack_hi_expr] vect_perm_even_35;

 where you can apply the pattern matching and transform (after checking
 with the target, of course).

This sounds good to me! I'll try to make a patch following your suggestion.

Thank you!


Cong


 Richard.


 thanks,
 Cong


 Richard.


Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.

2014-06-24 Thread Richard Biener
On Sat, May 3, 2014 at 2:39 AM, Cong Hou co...@google.com wrote:
 On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener rguent...@suse.de wrote:
 On Thu, 24 Apr 2014, Cong Hou wrote:

 Given the following loop:

 int a[N];
 short b[N*2];

 for (int i = 0; i  N; ++i)
   a[i] = b[i*2];


 After being vectorized, the access to b[i*2] will be compiled into
 several packing statements, while the type promotion from short to int
 will be compiled into several unpacking statements. With this patch,
 each pair of pack/unpack statements will be replaced by less expensive
 statements (with shift or bit-and operations).

 On x86_64, the loop above will be compiled into the following assembly
 (with -O2 -ftree-vectorize):

 movdqu 0x10(%rcx),%xmm3
 movdqu -0x20(%rcx),%xmm0
 movdqa %xmm0,%xmm2
 punpcklwd %xmm3,%xmm0
 punpckhwd %xmm3,%xmm2
 movdqa %xmm0,%xmm3
 punpcklwd %xmm2,%xmm0
 punpckhwd %xmm2,%xmm3
 movdqa %xmm1,%xmm2
 punpcklwd %xmm3,%xmm0
 pcmpgtw %xmm0,%xmm2
 movdqa %xmm0,%xmm3
 punpckhwd %xmm2,%xmm0
 punpcklwd %xmm2,%xmm3
 movups %xmm0,-0x10(%rdx)
 movups %xmm3,-0x20(%rdx)


 With this patch, the generated assembly is shown below:

 movdqu 0x10(%rcx),%xmm0
 movdqu -0x20(%rcx),%xmm1
 pslld  $0x10,%xmm0
 psrad  $0x10,%xmm0
 pslld  $0x10,%xmm1
 movups %xmm0,-0x10(%rdx)
 psrad  $0x10,%xmm1
 movups %xmm1,-0x20(%rdx)


 Bootstrapped and tested on x86-64. OK for trunk?

 This is an odd place to implement such transform.  Also if it
 is faster or not depends on the exact ISA you target - for
 example ppc has constraints on the maximum number of shifts
 carried out in parallel and the above has 4 in very short
 succession.  Esp. for the sign-extend path.

 Thank you for the information about ppc. If this is an issue, I think
 we can do it in a target dependent way.



 So this looks more like an opportunity for a post-vectorizer
 transform on RTL or for the vectorizer special-casing
 widening loads with a vectorizer pattern.

 I am not sure if the RTL transform is more difficult to implement. I
 prefer the widening loads method, which can be detected in a pattern
 recognizer. The target related issue will be resolved by only
 expanding the widening load on those targets where this pattern is
 beneficial. But this requires new tree operations to be defined. What
 is your suggestion?

 I apologize for the delayed reply.

Likewise ;)

I suggest to implement this optimization in vector lowering in
tree-vect-generic.c.  This sees for your example

  vect__5.7_32 = MEM[symbol: b, index: ivtmp.15_13, offset: 0B];
  vect__5.8_34 = MEM[symbol: b, index: ivtmp.15_13, offset: 16B];
  vect_perm_even_35 = VEC_PERM_EXPR vect__5.7_32, vect__5.8_34, { 0,
2, 4, 6, 8, 10, 12, 14 };
  vect__6.9_37 = [vec_unpack_lo_expr] vect_perm_even_35;
  vect__6.9_38 = [vec_unpack_hi_expr] vect_perm_even_35;

where you can apply the pattern matching and transform (after checking
with the target, of course).

Richard.


 thanks,
 Cong


 Richard.


Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.

2014-05-02 Thread Cong Hou
On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener rguent...@suse.de wrote:
 On Thu, 24 Apr 2014, Cong Hou wrote:

 Given the following loop:

 int a[N];
 short b[N*2];

 for (int i = 0; i  N; ++i)
   a[i] = b[i*2];


 After being vectorized, the access to b[i*2] will be compiled into
 several packing statements, while the type promotion from short to int
 will be compiled into several unpacking statements. With this patch,
 each pair of pack/unpack statements will be replaced by less expensive
 statements (with shift or bit-and operations).

 On x86_64, the loop above will be compiled into the following assembly
 (with -O2 -ftree-vectorize):

 movdqu 0x10(%rcx),%xmm3
 movdqu -0x20(%rcx),%xmm0
 movdqa %xmm0,%xmm2
 punpcklwd %xmm3,%xmm0
 punpckhwd %xmm3,%xmm2
 movdqa %xmm0,%xmm3
 punpcklwd %xmm2,%xmm0
 punpckhwd %xmm2,%xmm3
 movdqa %xmm1,%xmm2
 punpcklwd %xmm3,%xmm0
 pcmpgtw %xmm0,%xmm2
 movdqa %xmm0,%xmm3
 punpckhwd %xmm2,%xmm0
 punpcklwd %xmm2,%xmm3
 movups %xmm0,-0x10(%rdx)
 movups %xmm3,-0x20(%rdx)


 With this patch, the generated assembly is shown below:

 movdqu 0x10(%rcx),%xmm0
 movdqu -0x20(%rcx),%xmm1
 pslld  $0x10,%xmm0
 psrad  $0x10,%xmm0
 pslld  $0x10,%xmm1
 movups %xmm0,-0x10(%rdx)
 psrad  $0x10,%xmm1
 movups %xmm1,-0x20(%rdx)


 Bootstrapped and tested on x86-64. OK for trunk?

 This is an odd place to implement such transform.  Also if it
 is faster or not depends on the exact ISA you target - for
 example ppc has constraints on the maximum number of shifts
 carried out in parallel and the above has 4 in very short
 succession.  Esp. for the sign-extend path.

Thank you for the information about ppc. If this is an issue, I think
we can do it in a target dependent way.



 So this looks more like an opportunity for a post-vectorizer
 transform on RTL or for the vectorizer special-casing
 widening loads with a vectorizer pattern.

I am not sure if the RTL transform is more difficult to implement. I
prefer the widening loads method, which can be detected in a pattern
recognizer. The target related issue will be resolved by only
expanding the widening load on those targets where this pattern is
beneficial. But this requires new tree operations to be defined. What
is your suggestion?

I apologize for the delayed reply.


thanks,
Cong


 Richard.


Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.

2014-04-28 Thread Richard Biener
On Thu, 24 Apr 2014, Cong Hou wrote:

 Given the following loop:
 
 int a[N];
 short b[N*2];
 
 for (int i = 0; i  N; ++i)
   a[i] = b[i*2];
 
 
 After being vectorized, the access to b[i*2] will be compiled into
 several packing statements, while the type promotion from short to int
 will be compiled into several unpacking statements. With this patch,
 each pair of pack/unpack statements will be replaced by less expensive
 statements (with shift or bit-and operations).
 
 On x86_64, the loop above will be compiled into the following assembly
 (with -O2 -ftree-vectorize):
 
 movdqu 0x10(%rcx),%xmm3
 movdqu -0x20(%rcx),%xmm0
 movdqa %xmm0,%xmm2
 punpcklwd %xmm3,%xmm0
 punpckhwd %xmm3,%xmm2
 movdqa %xmm0,%xmm3
 punpcklwd %xmm2,%xmm0
 punpckhwd %xmm2,%xmm3
 movdqa %xmm1,%xmm2
 punpcklwd %xmm3,%xmm0
 pcmpgtw %xmm0,%xmm2
 movdqa %xmm0,%xmm3
 punpckhwd %xmm2,%xmm0
 punpcklwd %xmm2,%xmm3
 movups %xmm0,-0x10(%rdx)
 movups %xmm3,-0x20(%rdx)
 
 
 With this patch, the generated assembly is shown below:
 
 movdqu 0x10(%rcx),%xmm0
 movdqu -0x20(%rcx),%xmm1
 pslld  $0x10,%xmm0
 psrad  $0x10,%xmm0
 pslld  $0x10,%xmm1
 movups %xmm0,-0x10(%rdx)
 psrad  $0x10,%xmm1
 movups %xmm1,-0x20(%rdx)
 
 
 Bootstrapped and tested on x86-64. OK for trunk?

This is an odd place to implement such transform.  Also if it
is faster or not depends on the exact ISA you target - for
example ppc has constraints on the maximum number of shifts
carried out in parallel and the above has 4 in very short
succession.  Esp. for the sign-extend path.

So this looks more like an opportunity for a post-vectorizer
transform on RTL or for the vectorizer special-casing
widening loads with a vectorizer pattern.

Richard.


[PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.

2014-04-24 Thread Cong Hou
Given the following loop:

int a[N];
short b[N*2];

for (int i = 0; i  N; ++i)
  a[i] = b[i*2];


After being vectorized, the access to b[i*2] will be compiled into
several packing statements, while the type promotion from short to int
will be compiled into several unpacking statements. With this patch,
each pair of pack/unpack statements will be replaced by less expensive
statements (with shift or bit-and operations).

On x86_64, the loop above will be compiled into the following assembly
(with -O2 -ftree-vectorize):

movdqu 0x10(%rcx),%xmm3
movdqu -0x20(%rcx),%xmm0
movdqa %xmm0,%xmm2
punpcklwd %xmm3,%xmm0
punpckhwd %xmm3,%xmm2
movdqa %xmm0,%xmm3
punpcklwd %xmm2,%xmm0
punpckhwd %xmm2,%xmm3
movdqa %xmm1,%xmm2
punpcklwd %xmm3,%xmm0
pcmpgtw %xmm0,%xmm2
movdqa %xmm0,%xmm3
punpckhwd %xmm2,%xmm0
punpcklwd %xmm2,%xmm3
movups %xmm0,-0x10(%rdx)
movups %xmm3,-0x20(%rdx)


With this patch, the generated assembly is shown below:

movdqu 0x10(%rcx),%xmm0
movdqu -0x20(%rcx),%xmm1
pslld  $0x10,%xmm0
psrad  $0x10,%xmm0
pslld  $0x10,%xmm1
movups %xmm0,-0x10(%rdx)
psrad  $0x10,%xmm1
movups %xmm1,-0x20(%rdx)


Bootstrapped and tested on x86-64. OK for trunk?


thanks,
Cong
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 117cdd0..e7143f1 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2014-04-23  Cong Hou  co...@google.com
+
+   * tree-vect-stmts.c (detect_pack_unpack_pattern): New function.
+   (vect_gen_widened_results_half): Call detect_pack_unpack_pattern.
+
 2014-04-23  David Malcolm  dmalc...@redhat.com
 
* is-a.h: Update comments to reflect the following changes to the
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 62b07f4..a8755b3 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2014-04-23  Cong Hou  co...@google.com
+
+   * gcc.dg/vect/vect-125.c: New test.
+
 2014-04-23  Jeff Law  l...@redhat.com
 
PR tree-optimization/60902
diff --git a/gcc/testsuite/gcc.dg/vect/vect-125.c 
b/gcc/testsuite/gcc.dg/vect/vect-125.c
new file mode 100644
index 000..988dea6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-125.c
@@ -0,0 +1,122 @@
+/* { dg-require-effective-target vect_int } */
+
+#include limits.h
+#include tree-vect.h
+
+#define N 64
+
+char b[N];
+unsigned char c[N];
+short d[N];
+unsigned short e[N];
+
+__attribute__((noinline)) void
+test1 ()
+{
+  int a[N];
+  int i;
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = b[i*2];
+  a[i+N/2] = b[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != b[i*2] || a[i+N/2] != b[i*2+1])
+  abort ();
+
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = c[i*2];
+  a[i+N/2] = c[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != c[i*2] || a[i+N/2] != c[i*2+1])
+  abort ();
+
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = d[i*2];
+  a[i+N/2] = d[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != d[i*2] || a[i+N/2] != d[i*2+1])
+  abort ();
+
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = e[i*2];
+  a[i+N/2] = e[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != e[i*2] || a[i+N/2] != e[i*2+1])
+  abort ();
+}
+
+__attribute__((noinline)) void
+test2 ()
+{
+  unsigned int a[N];
+  int i;
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = b[i*2];
+  a[i+N/2] = b[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != b[i*2] || a[i+N/2] != b[i*2+1])
+  abort ();
+
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = c[i*2];
+  a[i+N/2] = c[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != c[i*2] || a[i+N/2] != c[i*2+1])
+  abort ();
+
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = d[i*2];
+  a[i+N/2] = d[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != d[i*2] || a[i+N/2] != d[i*2+1])
+  abort ();
+
+  for (i = 0; i  N/2; i++)
+{
+  a[i] = e[i*2];
+  a[i+N/2] = e[i*2+1];
+}
+  for (i = 0; i  N/2; i++)
+if (a[i] != e[i*2] || a[i+N/2] != e[i*2+1])
+  abort ();
+}
+
+int
+main ()
+{
+  b[0] = CHAR_MIN;
+  c[0] = UCHAR_MAX;
+  d[0] = SHRT_MIN;
+  e[0] = USHRT_MAX;
+
+  int i;
+  for (i = 1; i  N; i++)
+{
+  b[i] = b[i-1] + 1;
+  c[i] = c[i-1] - 1;
+  d[i] = d[i-1] + 1;
+  e[i] = e[i-1] - 1;
+}
+
+  test1 ();
+  test2 ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times vectorized 4 loops 2 vect } } */
+/* { dg-final { scan-tree-dump-times A pack-unpack pattern is recognized 32 
vect } } */
+/* { dg-final { cleanup-tree-dump vect } } */
+
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 1a51d6d..d0cf1f4 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -3191,6 +3191,174 @@ vectorizable_simd_clone_call (gimple stmt, 
gimple_stmt_iterator *gsi,
 }
 
 
+/* Function detect_pack_unpack_pattern
+
+   Detect the following pattern:
+
+   S1  vect3 = VEC_PERM_EXPR vect1, vect2, { 0, 2, 4, ... };
+   or
+   S1  vect3 = VEC_PERM_EXPR vect1, vect2, { 1, 3, 5, ... };
+
+   S2  vect4 = [vec_unpack_lo_expr]