Re: Support fused multiply-adds in fully-masked reductions

2018-05-25 Thread Richard Biener
On Thu, May 24, 2018 at 2:17 PM Richard Sandiford <
richard.sandif...@linaro.org> wrote:

> Richard Biener  writes:
> > On Wed, May 16, 2018 at 11:26 AM Richard Sandiford <
> > richard.sandif...@linaro.org> wrote:
> >
> >> This patch adds support for fusing a conditional add or subtract
> >> with a multiplication, so that we can use fused multiply-add and
> >> multiply-subtract operations for fully-masked reductions.  E.g.
> >> for SVE we vectorise:
> >
> >>double res = 0.0;
> >>for (int i = 0; i < n; ++i)
> >>  res += x[i] * y[i];
> >
> >> using a fully-masked loop in which the loop body has the form:
> >
> >>res_1 = PHI<0(preheader), res_2(latch)>;
> >>avec = IFN_MASK_LOAD (loop_mask, a)
> >>bvec = IFN_MASK_LOAD (loop_mask, b)
> >>prod = avec * bvec;
> >>res_2 = IFN_COND_ADD (loop_mask, res_1, prod);
> >
> >> where the last statement does the equivalent of:
> >
> >>res_2 = loop_mask ? res_1 + prod : res_1;
> >
> >> (operating elementwise).  The point of the patch is to convert the last
> >> two statements into a single internal function that is the equivalent
of:
> >
> >>res_2 = loop_mask ? fma (avec, bvec, res_1) : res_1;
> >
> >> (again operating elementwise).
> >
> >> All current conditional X operations have the form "do X or don't do X
> >> to the first operand" (add/don't add to first operand, etc.).  However,
> >> the FMA optabs and functions are ordered so that the accumulator comes
> >> last.  There were two obvious ways of resolving this: break the
> >> convention for conditional operators and have "add/don't add to the
> >> final operand" or break the convention for FMA and put the accumulator
> >> first.  The patch goes for the latter, but adds _REV to make it obvious
> >> that the operands are in a different order.
> >
> > Eh.  I guess you'll do the same to SAD/DOT_PROD/WIDEN_SUM?
> >
> > That said, I don't really see the "do or not do to the first operand",
it's
> > "do or not do the operation on operands 1 to 2 (or 3)".  None of the
> > current ops modify operand 1, they all produce a new value, no?

> Yeah, neither the current functions nor these ones actually changed
> operand 1.  It was all about deciding what the "else" value should be.
> The _REV thing was a "fix" for the fact that we wanted the else value
> to be the final operand of fma.

> Of course, the real fix was to make all the IFN_COND_* functions take an
> explicit else value, as you suggested in the review of the other patch
> in the series.  So all this _REV stuff is redundant now.

> Here's an updated version based on top of the IFN_COND_FMA patch
> that I just posted.  Tested in the same way.

OK.

Thanks,
Richard.

> Thanks,
> Richard

> 2018-05-24  Richard Sandiford  
>  Alan Hayward  
>  David Sherwood  

> gcc/
>  * internal-fn.h (can_interpret_as_conditional_op_p): Declare.
>  * internal-fn.c (can_interpret_as_conditional_op_p): New function.
>  * tree-ssa-math-opts.c (convert_mult_to_fma_1): Handle conditional
>  plus and minus and convert them into IFN_COND_FMA-based sequences.
>  (convert_mult_to_fma): Handle conditional plus and minus.

> gcc/testsuite/
>  * gcc.dg/vect/vect-fma-2.c: New test.
>  * gcc.target/aarch64/sve/reduc_4.c: Likewise.
>  * gcc.target/aarch64/sve/reduc_6.c: Likewise.
>  * gcc.target/aarch64/sve/reduc_7.c: Likewise.

> Index: gcc/internal-fn.h
> ===
> --- gcc/internal-fn.h   2018-05-24 13:05:46.049605128 +0100
> +++ gcc/internal-fn.h   2018-05-24 13:08:24.643987582 +0100
> @@ -196,6 +196,9 @@ extern internal_fn get_conditional_inter
>   extern internal_fn get_conditional_internal_fn (internal_fn);
>   extern tree_code conditional_internal_fn_code (internal_fn);
>   extern internal_fn get_unconditional_internal_fn (internal_fn);
> +extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
> +  tree_code *, tree (&)[3],
> +  tree *);

>   extern bool internal_load_fn_p (internal_fn);
>   extern bool internal_store_fn_p (internal_fn);
> Index: gcc/internal-fn.c
> ===
> --- gcc/internal-fn.c   2018-05-24 13:05:46.048606357 +0100
> +++ gcc/internal-fn.c   2018-05-24 13:08:24.643987582 +0100
> @@ -,6 +,62 @@ #define CASE(NAME) case IFN_COND_##NAME:
>   }
>   }

> +/* Return true if STMT can be interpreted as a conditional tree code
> +   operation of the form:
> +
> + LHS = COND ? OP (RHS1, ...) : ELSE;
> +
> +   operating elementwise if the operands are vectors.  This includes
> +   the case of an all-true COND, so that the operation always happens.
> +
> +   When returning true, set:
> +
> +   - *COND_OUT to the condition COND, or to NULL_TREE if the condition
> + is known to be all-true
> +   - *CODE_OUT to the tree code

Re: Support fused multiply-adds in fully-masked reductions

2018-05-24 Thread Richard Sandiford
Richard Biener  writes:
> On Wed, May 16, 2018 at 11:26 AM Richard Sandiford <
> richard.sandif...@linaro.org> wrote:
>
>> This patch adds support for fusing a conditional add or subtract
>> with a multiplication, so that we can use fused multiply-add and
>> multiply-subtract operations for fully-masked reductions.  E.g.
>> for SVE we vectorise:
>
>>double res = 0.0;
>>for (int i = 0; i < n; ++i)
>>  res += x[i] * y[i];
>
>> using a fully-masked loop in which the loop body has the form:
>
>>res_1 = PHI<0(preheader), res_2(latch)>;
>>avec = IFN_MASK_LOAD (loop_mask, a)
>>bvec = IFN_MASK_LOAD (loop_mask, b)
>>prod = avec * bvec;
>>res_2 = IFN_COND_ADD (loop_mask, res_1, prod);
>
>> where the last statement does the equivalent of:
>
>>res_2 = loop_mask ? res_1 + prod : res_1;
>
>> (operating elementwise).  The point of the patch is to convert the last
>> two statements into a single internal function that is the equivalent of:
>
>>res_2 = loop_mask ? fma (avec, bvec, res_1) : res_1;
>
>> (again operating elementwise).
>
>> All current conditional X operations have the form "do X or don't do X
>> to the first operand" (add/don't add to first operand, etc.).  However,
>> the FMA optabs and functions are ordered so that the accumulator comes
>> last.  There were two obvious ways of resolving this: break the
>> convention for conditional operators and have "add/don't add to the
>> final operand" or break the convention for FMA and put the accumulator
>> first.  The patch goes for the latter, but adds _REV to make it obvious
>> that the operands are in a different order.
>
> Eh.  I guess you'll do the same to SAD/DOT_PROD/WIDEN_SUM?
>
> That said, I don't really see the "do or not do to the first operand", it's
> "do or not do the operation on operands 1 to 2 (or 3)".  None of the
> current ops modify operand 1, they all produce a new value, no?

Yeah, neither the current functions nor these ones actually changed
operand 1.  It was all about deciding what the "else" value should be.
The _REV thing was a "fix" for the fact that we wanted the else value
to be the final operand of fma.

Of course, the real fix was to make all the IFN_COND_* functions take an
explicit else value, as you suggested in the review of the other patch
in the series.  So all this _REV stuff is redundant now.

Here's an updated version based on top of the IFN_COND_FMA patch
that I just posted.  Tested in the same way.

Thanks,
Richard

2018-05-24  Richard Sandiford  
Alan Hayward  
David Sherwood  

gcc/
* internal-fn.h (can_interpret_as_conditional_op_p): Declare.
* internal-fn.c (can_interpret_as_conditional_op_p): New function.
* tree-ssa-math-opts.c (convert_mult_to_fma_1): Handle conditional
plus and minus and convert them into IFN_COND_FMA-based sequences.
(convert_mult_to_fma): Handle conditional plus and minus.

gcc/testsuite/
* gcc.dg/vect/vect-fma-2.c: New test.
* gcc.target/aarch64/sve/reduc_4.c: Likewise.
* gcc.target/aarch64/sve/reduc_6.c: Likewise.
* gcc.target/aarch64/sve/reduc_7.c: Likewise.

Index: gcc/internal-fn.h
===
--- gcc/internal-fn.h   2018-05-24 13:05:46.049605128 +0100
+++ gcc/internal-fn.h   2018-05-24 13:08:24.643987582 +0100
@@ -196,6 +196,9 @@ extern internal_fn get_conditional_inter
 extern internal_fn get_conditional_internal_fn (internal_fn);
 extern tree_code conditional_internal_fn_code (internal_fn);
 extern internal_fn get_unconditional_internal_fn (internal_fn);
+extern bool can_interpret_as_conditional_op_p (gimple *, tree *,
+  tree_code *, tree (&)[3],
+  tree *);
 
 extern bool internal_load_fn_p (internal_fn);
 extern bool internal_store_fn_p (internal_fn);
Index: gcc/internal-fn.c
===
--- gcc/internal-fn.c   2018-05-24 13:05:46.048606357 +0100
+++ gcc/internal-fn.c   2018-05-24 13:08:24.643987582 +0100
@@ -,6 +,62 @@ #define CASE(NAME) case IFN_COND_##NAME:
 }
 }
 
+/* Return true if STMT can be interpreted as a conditional tree code
+   operation of the form:
+
+ LHS = COND ? OP (RHS1, ...) : ELSE;
+
+   operating elementwise if the operands are vectors.  This includes
+   the case of an all-true COND, so that the operation always happens.
+
+   When returning true, set:
+
+   - *COND_OUT to the condition COND, or to NULL_TREE if the condition
+ is known to be all-true
+   - *CODE_OUT to the tree code
+   - OPS[I] to operand I of *CODE_OUT
+   - *ELSE_OUT to the fallback value ELSE, or to NULL_TREE if the
+ condition is known to be all true.  */
+
+bool
+can_interpret_as_conditional_op_p (gimple *stmt, tree *cond_out,
+  tree_code *code_out,
+  tree (&ops)[3], tree 

Re: Support fused multiply-adds in fully-masked reductions

2018-05-24 Thread Richard Biener
On Wed, May 16, 2018 at 11:26 AM Richard Sandiford <
richard.sandif...@linaro.org> wrote:

> This patch adds support for fusing a conditional add or subtract
> with a multiplication, so that we can use fused multiply-add and
> multiply-subtract operations for fully-masked reductions.  E.g.
> for SVE we vectorise:

>double res = 0.0;
>for (int i = 0; i < n; ++i)
>  res += x[i] * y[i];

> using a fully-masked loop in which the loop body has the form:

>res_1 = PHI<0(preheader), res_2(latch)>;
>avec = IFN_MASK_LOAD (loop_mask, a)
>bvec = IFN_MASK_LOAD (loop_mask, b)
>prod = avec * bvec;
>res_2 = IFN_COND_ADD (loop_mask, res_1, prod);

> where the last statement does the equivalent of:

>res_2 = loop_mask ? res_1 + prod : res_1;

> (operating elementwise).  The point of the patch is to convert the last
> two statements into a single internal function that is the equivalent of:

>res_2 = loop_mask ? fma (avec, bvec, res_1) : res_1;

> (again operating elementwise).

> All current conditional X operations have the form "do X or don't do X
> to the first operand" (add/don't add to first operand, etc.).  However,
> the FMA optabs and functions are ordered so that the accumulator comes
> last.  There were two obvious ways of resolving this: break the
> convention for conditional operators and have "add/don't add to the
> final operand" or break the convention for FMA and put the accumulator
> first.  The patch goes for the latter, but adds _REV to make it obvious
> that the operands are in a different order.

Eh.  I guess you'll do the same to SAD/DOT_PROD/WIDEN_SUM?

That said, I don't really see the "do or not do to the first operand", it's
"do or not do the operation on operands 1 to 2 (or 3)".  None of the
current ops modify operand 1, they all produce a new value, no?

> Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf
> and x86_64-linux-gnu.  OK to install?

OK, but as said I don't see a reason for the operand order to differ
in the first place.

Richard.

> Richard


> 2018-05-16  Richard Sandiford  
>  Alan Hayward  
>  David Sherwood  

> gcc/
>  * doc/md.texi (cond_fma_rev, cond_fnma_rev): Document.
>  * optabs.def (cond_fma_rev, cond_fnma_rev): New optabs.
>  * internal-fn.def (COND_FMA_REV, COND_FNMA_REV): New internal
>  functions.
>  * internal-fn.h (can_interpret_as_conditional_op_p): Declare.
>  * internal-fn.c (cond_ternary_direct): New macro.
>  (expand_cond_ternary_optab_fn): Likewise.
>  (direct_cond_ternary_optab_supported_p): Likewise.
>  (FOR_EACH_CODE_MAPPING): Likewise.
>  (get_conditional_internal_fn): Use FOR_EACH_CODE_MAPPING.
>  (conditional_internal_fn_code): New function.
>  (can_interpret_as_conditional_op_p): Likewise.
>  * tree-ssa-math-opts.c (fused_cond_internal_fn): New function.
>  (convert_mult_to_fma_1): Transform calls to IFN_COND_ADD to
>  IFN_COND_FMA_REV and calls to IFN_COND_SUB to IFN_COND_FNMA_REV.
>  (convert_mult_to_fma): Handle calls to IFN_COND_ADD and
IFN_COND_SUB.
>  * genmatch.c (commutative_op): Handle CFN_COND_FMA_REV and
>  CFN_COND_FNMA_REV.
>  * config/aarch64/iterators.md (UNSPEC_COND_FMLA): New unspec.
>  (UNSPEC_COND_FMLS): Likewise.
>  (optab, sve_fp_op): Handle them.
>  (SVE_COND_INT_OP): Rename to...
>  (SVE_COND_INT2_OP): ...this.
>  (SVE_COND_FP_OP): Rename to...
>  (SVE_COND_FP2_OP): ...this.
>  (SVE_COND_FP3_OP): New iterator.
>  * config/aarch64/aarch64-sve.md (cond_): Update
>  for new iterator names.  Add a pattern for SVE_COND_FP3_OP.

> gcc/testsuite/
>  * gcc.target/aarch64/sve/reduc_4.c: New test.
>  * gcc.target/aarch64/sve/reduc_6.c: Likewise.
>  * gcc.target/aarch64/sve/reduc_7.c: Likewise.

> Index: gcc/doc/md.texi
> ===
> --- gcc/doc/md.texi 2018-05-16 10:23:03.590853492 +0100
> +++ gcc/doc/md.texi 2018-05-16 10:23:03.886838736 +0100
> @@ -6367,6 +6367,32 @@ be in a normal C @samp{?:} condition.
>   Operands 0, 2 and 3 all have mode @var{m}, while operand 1 has the mode
>   returned by @code{TARGET_VECTORIZE_GET_MASK_MODE}.

> +@cindex @code{cond_fma_rev@var{mode}} instruction pattern
> +@item @samp{cond_fma_rev@var{mode}}
> +Similar to @samp{cond_add@var{m}}, but compute:
> +@smallexample
> +op0 = op1 ? fma (op3, op4, op2) : op2;
> +@end smallexample
> +for scalars and:
> +@smallexample
> +op0[I] = op1[I] ? fma (op3[I], op4[I], op2[I]) : op2[I];
> +@end smallexample
> +for vectors.  The @samp{_rev} indicates that the addend (operand 2)
> +comes first.
> +
> +@cindex @code{cond_fnma_rev@var{mode}} instruction pattern
> +@item @samp{cond_fnma_rev@var{mode}}
> +Similar to @samp{cond_fma_rev@var{m}}, but negate operand 3 before
> +multiplying i

Support fused multiply-adds in fully-masked reductions

2018-05-16 Thread Richard Sandiford
This patch adds support for fusing a conditional add or subtract
with a multiplication, so that we can use fused multiply-add and
multiply-subtract operations for fully-masked reductions.  E.g.
for SVE we vectorise:

  double res = 0.0;
  for (int i = 0; i < n; ++i)
res += x[i] * y[i];

using a fully-masked loop in which the loop body has the form:

  res_1 = PHI<0(preheader), res_2(latch)>;
  avec = IFN_MASK_LOAD (loop_mask, a)
  bvec = IFN_MASK_LOAD (loop_mask, b)
  prod = avec * bvec;
  res_2 = IFN_COND_ADD (loop_mask, res_1, prod);

where the last statement does the equivalent of:

  res_2 = loop_mask ? res_1 + prod : res_1;

(operating elementwise).  The point of the patch is to convert the last
two statements into a single internal function that is the equivalent of:

  res_2 = loop_mask ? fma (avec, bvec, res_1) : res_1;

(again operating elementwise).

All current conditional X operations have the form "do X or don't do X
to the first operand" (add/don't add to first operand, etc.).  However,
the FMA optabs and functions are ordered so that the accumulator comes
last.  There were two obvious ways of resolving this: break the
convention for conditional operators and have "add/don't add to the
final operand" or break the convention for FMA and put the accumulator
first.  The patch goes for the latter, but adds _REV to make it obvious
that the operands are in a different order.

Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf
and x86_64-linux-gnu.  OK to install?

Richard


2018-05-16  Richard Sandiford  
Alan Hayward  
David Sherwood  

gcc/
* doc/md.texi (cond_fma_rev, cond_fnma_rev): Document.
* optabs.def (cond_fma_rev, cond_fnma_rev): New optabs.
* internal-fn.def (COND_FMA_REV, COND_FNMA_REV): New internal
functions.
* internal-fn.h (can_interpret_as_conditional_op_p): Declare.
* internal-fn.c (cond_ternary_direct): New macro.
(expand_cond_ternary_optab_fn): Likewise.
(direct_cond_ternary_optab_supported_p): Likewise.
(FOR_EACH_CODE_MAPPING): Likewise.
(get_conditional_internal_fn): Use FOR_EACH_CODE_MAPPING.
(conditional_internal_fn_code): New function.
(can_interpret_as_conditional_op_p): Likewise.
* tree-ssa-math-opts.c (fused_cond_internal_fn): New function.
(convert_mult_to_fma_1): Transform calls to IFN_COND_ADD to
IFN_COND_FMA_REV and calls to IFN_COND_SUB to IFN_COND_FNMA_REV.
(convert_mult_to_fma): Handle calls to IFN_COND_ADD and IFN_COND_SUB.
* genmatch.c (commutative_op): Handle CFN_COND_FMA_REV and
CFN_COND_FNMA_REV.
* config/aarch64/iterators.md (UNSPEC_COND_FMLA): New unspec.
(UNSPEC_COND_FMLS): Likewise.
(optab, sve_fp_op): Handle them.
(SVE_COND_INT_OP): Rename to...
(SVE_COND_INT2_OP): ...this.
(SVE_COND_FP_OP): Rename to...
(SVE_COND_FP2_OP): ...this.
(SVE_COND_FP3_OP): New iterator.
* config/aarch64/aarch64-sve.md (cond_): Update
for new iterator names.  Add a pattern for SVE_COND_FP3_OP.

gcc/testsuite/
* gcc.target/aarch64/sve/reduc_4.c: New test.
* gcc.target/aarch64/sve/reduc_6.c: Likewise.
* gcc.target/aarch64/sve/reduc_7.c: Likewise.

Index: gcc/doc/md.texi
===
--- gcc/doc/md.texi 2018-05-16 10:23:03.590853492 +0100
+++ gcc/doc/md.texi 2018-05-16 10:23:03.886838736 +0100
@@ -6367,6 +6367,32 @@ be in a normal C @samp{?:} condition.
 Operands 0, 2 and 3 all have mode @var{m}, while operand 1 has the mode
 returned by @code{TARGET_VECTORIZE_GET_MASK_MODE}.
 
+@cindex @code{cond_fma_rev@var{mode}} instruction pattern
+@item @samp{cond_fma_rev@var{mode}}
+Similar to @samp{cond_add@var{m}}, but compute:
+@smallexample
+op0 = op1 ? fma (op3, op4, op2) : op2;
+@end smallexample
+for scalars and:
+@smallexample
+op0[I] = op1[I] ? fma (op3[I], op4[I], op2[I]) : op2[I];
+@end smallexample
+for vectors.  The @samp{_rev} indicates that the addend (operand 2)
+comes first.
+
+@cindex @code{cond_fnma_rev@var{mode}} instruction pattern
+@item @samp{cond_fnma_rev@var{mode}}
+Similar to @samp{cond_fma_rev@var{m}}, but negate operand 3 before
+multiplying it.  That is, the instruction performs:
+@smallexample
+op0 = op1 ? fma (-op3, op4, op2) : op2;
+@end smallexample
+for scalars and:
+@smallexample
+op0[I] = op1[I] ? fma (-op3[I], op4[I], op2[I]) : op2[I];
+@end smallexample
+for vectors.
+
 @cindex @code{neg@var{mode}cc} instruction pattern
 @item @samp{neg@var{mode}cc}
 Similar to @samp{mov@var{mode}cc} but for conditional negation.  Conditionally
Index: gcc/optabs.def
===
--- gcc/optabs.def  2018-05-16 10:23:03.590853492 +0100
+++ gcc/optabs.def  2018-05-16 10:23:03.887838686 +0100
@@ -222,6 +222,8 @@ OPTAB_D (notcc_optab, "not$acc")
 OPTAB_D (movcc_opt