Re: "match.pd" (was: Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?)

2018-11-04 Thread Marc Glisse

(resent because of mail issues on my end)

On Mon, 22 Oct 2018, Thomas Schwinge wrote:


I had a quick look at the difference, and a[j][i] remains in this form
throughout optimization. If I write instead *((*(a+j))+i) = 0; I get

   j_10 = tmp_17 / 1025;
   i_11 = tmp_17 % 1025;
   _1 = (long unsigned int) j_10;
   _2 = _1 * 1025;
   _3 = (sizetype) i_11;
   _4 = _2 + _3;

or for a power of 2

   j_10 = tmp_17 >> 10;
   i_11 = tmp_17 & 1023;
   _1 = (long unsigned int) j_10;
   _2 = _1 * 1024;
   _3 = (sizetype) i_11;
   _4 = _2 + _3;

and in both cases we fail to notice that _4 = (sizetype) tmp_17; (at least
I think that's true).

So there are missing match.pd transformations in addition to whatever
scev/ivdep/other work is needed.


With a very simplistic "match.pd" rule (not yet any special cases
checking etc.):

diff --git gcc/match.pd gcc/match.pd
index b36d7ccb5dc3..4c23116308da 100644
--- gcc/match.pd
+++ gcc/match.pd
@@ -5126,3 +5126,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
{ wide_int_to_tree (sizetype, off); })
  { swap_p ? @0 : @2; }))
{ rhs_tree; })
+
+/* Given:
+
+   j = in / N_I
+   i = in % N_I
+
+   ..., fold:
+
+   out = j * N_I + i
+
+   ..., into:
+
+   out = in
+*/
+
+/* As long as only considering N_I being INTEGER_CST (which are always second
+   argument?), probably don't need ":c" variants?  */
+
+(simplify
+ (plus:c
+  (mult:c
+   (trunc_div @0 INTEGER_CST@1)
+   INTEGER_CST@1)
+  (trunc_mod @0 INTEGER_CST@1))
+ (convert @0))


You should only specify INTEGER_CST@1 on the first occurence, the others
can be just @1. (you may be interested in @@1 at some point, but that
gets tricky)


..., the original code:

   int f1(int in)
   {
 int j = in / N_I;
 int i = in % N_I;

 int out = j * N_I + i;

 return out;
   }

... gets simplified from ("div-mod-0.c.027t.objsz1"):

   f1 (int in)
   {
 int out;
 int i;
 int j;
 int _1;
 int _6;

  :
 gimple_assign 
 gimple_assign 
 gimple_assign 
 gimple_assign 
 gimple_assign 
 gimple_return <_6>

   }

... to ("div-mod-0.c.028t.ccp1"):

   f1 (int in)
   {
 int out;
 int i;
 int j;
 int _1;

  :
 gimple_assign 
 gimple_assign 
 gimple_assign 
 gimple_return 

   }

(The three dead "gimple_assign"s get eliminated later on.)

So, that works.

However, it doesn't work yet for the original construct that I'd ran
into, which looks like this:

   [...]
   int i;
   int j;
   [...]
   signed int .offset.5_2;
   [...]
   unsigned int .offset.7_23;
   unsigned int .iter.0_24;
   unsigned int _25;
   unsigned int _26;
   [...]
   unsigned int .iter.0_32;
   [...]

:
   # gimple_phi <.offset.5_2, .offset.5_21(8), .offset.5_30(9)>
   gimple_assign 
   gimple_assign 
   gimple_assign 
   gimple_assign 
   gimple_assign 
   gimple_assign 
   gimple_assign 
   [...]

Resolving the "a[j][i] = 123" we'll need to look into later.

As Marc noted above, with that changed into "*(*(a + j) + i) = 123", we
get:

   [...]
   int i;
   int j;
   long unsigned int _1;
   long unsigned int _2;
   sizetype _3;
   sizetype _4;
   sizetype _5;
   int * _6;
   [...]
   signed int .offset.5_8;
   [...]
   unsigned int .offset.7_29;
   unsigned int .iter.0_30;
   unsigned int _31;
   unsigned int _32;
   [...]

:
   # gimple_phi <.offset.5_8, .offset.5_27(8), .offset.5_36(9)>
   gimple_assign 
   gimple_assign 
   gimple_assign 
   gimple_assign 
   gimple_assign 
   gimple_assign 
   gimple_assign 
   gimple_assign 
   gimple_assign 
   gimple_assign 
   gimple_assign 
   gimple_assign 
   gimple_assign 
   [...]

Here, unless I'm confused, "_4" is supposed to be equal to ".iter.0_30",
but "match.pd" doesn't agree yet.  Note the many "nop_expr"s here, which
I have not yet figured out how to handle, I suppose?  I tried some things
but couldn't get it to work.  Apparently the existing instances of
"(match (nop_convert @0)" and "Basic strip-useless-type-conversions /
strip_nops" rule also don't handle these; should they?  Or, are in fact
here the types mixed up too much?


"(match (nop_convert @0)" defines a shortcut so some transformations can
use nop_convert to detect some specific conversions, but it doesn't do
anything by itself. "Basic strip-useless-type-conversions" strips
conversions that are *useless*, essentially from a type to the same
type. If you want to handle true conversions, you need to do that
explicitly, see the many transformations that use convert? convert1?
convert2? and specify for which particular conversions the
transformation is valid.  Finding out the right conditions to detect
these conversions is often the most painful part of writing a match.pd
transformation.


I hope to get some time again soon to continue looking into this, but if
anybody got any ideas, I'm all ears.


--
Marc Glisse


Re: "match.pd" (was: Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?)

2018-10-23 Thread Richard Biener
On Mon, Oct 22, 2018 at 6:35 PM Thomas Schwinge  wrote:
>
> Hi!
>
> Thanks for all your comments already!  I continued looked into this for a
> bit (but then got interrupted by a higher-priority task).  Regarding this
> one specifically:
>
> On Fri, 12 Oct 2018 21:14:11 +0200, Marc Glisse  wrote:
> > On Fri, 12 Oct 2018, Thomas Schwinge wrote:
> >
> > > Hmm, and without any OpenACC/OpenMP etc., actually the same problem is
> > > also present when running the following code through the vectorizer:
> > >
> > >for (int tmp = 0; tmp < N_J * N_I; ++tmp)
> > >  {
> > >int j = tmp / N_I;
> > >int i = tmp % N_I;
> > >a[j][i] = 0;
> > >  }
> > >
> > > ... whereas the following variant (obviously) does vectorize:
> > >
> > >int a[NJ * NI];
> > >
> > >for (int tmp = 0; tmp < N_J * N_I; ++tmp)
> > >  a[tmp] = 0;
> >
> > I had a quick look at the difference, and a[j][i] remains in this form
> > throughout optimization. If I write instead *((*(a+j))+i) = 0; I get
> >
> >j_10 = tmp_17 / 1025;
> >i_11 = tmp_17 % 1025;
> >_1 = (long unsigned int) j_10;
> >_2 = _1 * 1025;
> >_3 = (sizetype) i_11;
> >_4 = _2 + _3;
> >
> > or for a power of 2
> >
> >j_10 = tmp_17 >> 10;
> >i_11 = tmp_17 & 1023;
> >_1 = (long unsigned int) j_10;
> >_2 = _1 * 1024;
> >_3 = (sizetype) i_11;
> >_4 = _2 + _3;
> >
> > and in both cases we fail to notice that _4 = (sizetype) tmp_17; (at least
> > I think that's true).
> >
> > So there are missing match.pd transformations in addition to whatever
> > scev/ivdep/other work is needed.
>
> With a very simplistic "match.pd" rule (not yet any special cases
> checking etc.):
>
> diff --git gcc/match.pd gcc/match.pd
> index b36d7ccb5dc3..4c23116308da 100644
> --- gcc/match.pd
> +++ gcc/match.pd
> @@ -5126,3 +5126,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> { wide_int_to_tree (sizetype, off); })
>   { swap_p ? @0 : @2; }))
> { rhs_tree; })
> +
> +/* Given:
> +
> +   j = in / N_I
> +   i = in % N_I
> +
> +   ..., fold:
> +
> +   out = j * N_I + i
> +
> +   ..., into:
> +
> +   out = in
> +*/
> +
> +/* As long as only considering N_I being INTEGER_CST (which are always second
> +   argument?), probably don't need ":c" variants?  */
> +
> +(simplify
> + (plus:c
> +  (mult:c
> +   (trunc_div @0 INTEGER_CST@1)
> +   INTEGER_CST@1)
> +  (trunc_mod @0 INTEGER_CST@1))
> + (convert @0))
>
> ..., the original code:
>
> int f1(int in)
> {
>   int j = in / N_I;
>   int i = in % N_I;
>
>   int out = j * N_I + i;
>
>   return out;
> }
>
> ... gets simplified from ("div-mod-0.c.027t.objsz1"):
>
> f1 (int in)
> {
>   int out;
>   int i;
>   int j;
>   int _1;
>   int _6;
>
>:
>   gimple_assign 
>   gimple_assign 
>   gimple_assign 
>   gimple_assign 
>   gimple_assign 
>   gimple_return <_6>
>
> }
>
> ... to ("div-mod-0.c.028t.ccp1"):
>
> f1 (int in)
> {
>   int out;
>   int i;
>   int j;
>   int _1;
>
>:
>   gimple_assign 
>   gimple_assign 
>   gimple_assign 
>   gimple_return 
>
> }
>
> (The three dead "gimple_assign"s get eliminated later on.)
>
> So, that works.
>
> However, it doesn't work yet for the original construct that I'd ran
> into, which looks like this:
>
> [...]
> int i;
> int j;
> [...]
> signed int .offset.5_2;
> [...]
> unsigned int .offset.7_23;
> unsigned int .iter.0_24;
> unsigned int _25;
> unsigned int _26;
> [...]
> unsigned int .iter.0_32;
> [...]
>
>  :
> # gimple_phi <.offset.5_2, .offset.5_21(8), .offset.5_30(9)>
> gimple_assign 
> gimple_assign 
> gimple_assign 
> gimple_assign 
> gimple_assign 
> gimple_assign 
> gimple_assign 
> [...]
>
> Resolving the "a[j][i] = 123" we'll need to look into later.
>
> As Marc noted above, with that changed into "*(*(a + j) + i) = 123", we
> get:
>
> [...]
> int i;
> int j;
> long unsigned int _1;
> long unsigned int _2;
> sizetype _3;
> sizetype _4;
> sizetype _5;
> int * _6;
> [...]
> signed int .offset.5_8;
> [...]
> unsigned int .offset.7_29;
> unsigned int .iter.0_30;
> unsigned int _31;
> unsigned int _32;
> [...]
>
>  :
> # gimple_phi <.offset.5_8, .offset.5_27(8), .offset.5_36(9)>
> gimple_assign 
> gimple_assign 
> gimple_assign 
> gimple_assign 
> gimple_assign 
> gimple_assign 
> gimple_assign 
> gimple_assign 
> gimple_assign 
> gimple_assign 
> gimple_assign 
> gimple_assign 
> gimple_assign 
> [...]
>
> Here, unless I'm confused, "_4" is supposed to be equal to ".iter.0_30",
> but "match.pd" doesn't agree yet.  Note the many "nop_expr"s here, which
> I have not yet figured out how to handle, I suppose?  I tried 

"match.pd" (was: Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?)

2018-10-22 Thread Thomas Schwinge
Hi!

Thanks for all your comments already!  I continued looked into this for a
bit (but then got interrupted by a higher-priority task).  Regarding this
one specifically:

On Fri, 12 Oct 2018 21:14:11 +0200, Marc Glisse  wrote:
> On Fri, 12 Oct 2018, Thomas Schwinge wrote:
> 
> > Hmm, and without any OpenACC/OpenMP etc., actually the same problem is
> > also present when running the following code through the vectorizer:
> >
> >for (int tmp = 0; tmp < N_J * N_I; ++tmp)
> >  {
> >int j = tmp / N_I;
> >int i = tmp % N_I;
> >a[j][i] = 0;
> >  }
> >
> > ... whereas the following variant (obviously) does vectorize:
> >
> >int a[NJ * NI];
> >
> >for (int tmp = 0; tmp < N_J * N_I; ++tmp)
> >  a[tmp] = 0;
> 
> I had a quick look at the difference, and a[j][i] remains in this form 
> throughout optimization. If I write instead *((*(a+j))+i) = 0; I get
> 
>j_10 = tmp_17 / 1025;
>i_11 = tmp_17 % 1025;
>_1 = (long unsigned int) j_10;
>_2 = _1 * 1025;
>_3 = (sizetype) i_11;
>_4 = _2 + _3;
> 
> or for a power of 2
> 
>j_10 = tmp_17 >> 10;
>i_11 = tmp_17 & 1023;
>_1 = (long unsigned int) j_10;
>_2 = _1 * 1024;
>_3 = (sizetype) i_11;
>_4 = _2 + _3;
> 
> and in both cases we fail to notice that _4 = (sizetype) tmp_17; (at least 
> I think that's true).
> 
> So there are missing match.pd transformations in addition to whatever 
> scev/ivdep/other work is needed.

With a very simplistic "match.pd" rule (not yet any special cases
checking etc.):

diff --git gcc/match.pd gcc/match.pd
index b36d7ccb5dc3..4c23116308da 100644
--- gcc/match.pd
+++ gcc/match.pd
@@ -5126,3 +5126,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
{ wide_int_to_tree (sizetype, off); })
  { swap_p ? @0 : @2; }))
{ rhs_tree; })
+
+/* Given:
+
+   j = in / N_I
+   i = in % N_I
+
+   ..., fold:
+
+   out = j * N_I + i
+
+   ..., into:
+
+   out = in
+*/
+
+/* As long as only considering N_I being INTEGER_CST (which are always second
+   argument?), probably don't need ":c" variants?  */
+
+(simplify
+ (plus:c
+  (mult:c
+   (trunc_div @0 INTEGER_CST@1)
+   INTEGER_CST@1)
+  (trunc_mod @0 INTEGER_CST@1))
+ (convert @0))

..., the original code:

int f1(int in)
{
  int j = in / N_I;
  int i = in % N_I;

  int out = j * N_I + i;

  return out;
}

... gets simplified from ("div-mod-0.c.027t.objsz1"):

f1 (int in)
{
  int out;
  int i;
  int j;
  int _1;
  int _6;

   :
  gimple_assign 
  gimple_assign 
  gimple_assign 
  gimple_assign 
  gimple_assign 
  gimple_return <_6>

}

... to ("div-mod-0.c.028t.ccp1"):

f1 (int in)
{
  int out;
  int i;
  int j;
  int _1;

   :
  gimple_assign 
  gimple_assign 
  gimple_assign 
  gimple_return 

}

(The three dead "gimple_assign"s get eliminated later on.)

So, that works.

However, it doesn't work yet for the original construct that I'd ran
into, which looks like this:

[...]
int i;
int j;
[...]
signed int .offset.5_2;
[...]
unsigned int .offset.7_23;
unsigned int .iter.0_24;
unsigned int _25;
unsigned int _26;
[...]
unsigned int .iter.0_32;
[...]

 :
# gimple_phi <.offset.5_2, .offset.5_21(8), .offset.5_30(9)>
gimple_assign 
gimple_assign 
gimple_assign 
gimple_assign 
gimple_assign 
gimple_assign 
gimple_assign 
[...]

Resolving the "a[j][i] = 123" we'll need to look into later.

As Marc noted above, with that changed into "*(*(a + j) + i) = 123", we
get:

[...]
int i;
int j;
long unsigned int _1;
long unsigned int _2;
sizetype _3;
sizetype _4;
sizetype _5;
int * _6;
[...]
signed int .offset.5_8;
[...]
unsigned int .offset.7_29;
unsigned int .iter.0_30;
unsigned int _31;
unsigned int _32;
[...]

 :
# gimple_phi <.offset.5_8, .offset.5_27(8), .offset.5_36(9)>
gimple_assign 
gimple_assign 
gimple_assign 
gimple_assign 
gimple_assign 
gimple_assign 
gimple_assign 
gimple_assign 
gimple_assign 
gimple_assign 
gimple_assign 
gimple_assign 
gimple_assign 
[...]

Here, unless I'm confused, "_4" is supposed to be equal to ".iter.0_30",
but "match.pd" doesn't agree yet.  Note the many "nop_expr"s here, which
I have not yet figured out how to handle, I suppose?  I tried some things
but couldn't get it to work.  Apparently the existing instances of
"(match (nop_convert @0)" and "Basic strip-useless-type-conversions /
strip_nops" rule also don't handle these; should they?  Or, are in fact
here the types mixed up too much?

I hope to get some time again soon to continue looking into this, but if
anybody got any ideas, I'm all ears.


Grüße
 Thomas