Re: Support fused multiply-adds in fully-masked reductions
On Thu, May 24, 2018 at 2:17 PM Richard Sandiford < richard.sandif...@linaro.org> wrote: > Richard Biener writes: > > On Wed, May 16, 2018 at 11:26 AM Richard Sandiford < > > richard.sandif...@linaro.org> wrote: > > > >> This patch adds support for fusing a conditional add or subtract > >> with a multiplication, so that we can use fused multiply-add and > >> multiply-subtract operations for fully-masked reductions. E.g. > >> for SVE we vectorise: > > > >>double res = 0.0; > >>for (int i = 0; i < n; ++i) > >> res += x[i] * y[i]; > > > >> using a fully-masked loop in which the loop body has the form: > > > >>res_1 = PHI<0(preheader), res_2(latch)>; > >>avec = IFN_MASK_LOAD (loop_mask, a) > >>bvec = IFN_MASK_LOAD (loop_mask, b) > >>prod = avec * bvec; > >>res_2 = IFN_COND_ADD (loop_mask, res_1, prod); > > > >> where the last statement does the equivalent of: > > > >>res_2 = loop_mask ? res_1 + prod : res_1; > > > >> (operating elementwise). The point of the patch is to convert the last > >> two statements into a single internal function that is the equivalent of: > > > >>res_2 = loop_mask ? fma (avec, bvec, res_1) : res_1; > > > >> (again operating elementwise). > > > >> All current conditional X operations have the form "do X or don't do X > >> to the first operand" (add/don't add to first operand, etc.). However, > >> the FMA optabs and functions are ordered so that the accumulator comes > >> last. There were two obvious ways of resolving this: break the > >> convention for conditional operators and have "add/don't add to the > >> final operand" or break the convention for FMA and put the accumulator > >> first. The patch goes for the latter, but adds _REV to make it obvious > >> that the operands are in a different order. > > > > Eh. I guess you'll do the same to SAD/DOT_PROD/WIDEN_SUM? > > > > That said, I don't really see the "do or not do to the first operand", it's > > "do or not do the operation on operands 1 to 2 (or 3)". None of the > > current ops modify operand 1, they all produce a new value, no? > Yeah, neither the current functions nor these ones actually changed > operand 1. It was all about deciding what the "else" value should be. > The _REV thing was a "fix" for the fact that we wanted the else value > to be the final operand of fma. > Of course, the real fix was to make all the IFN_COND_* functions take an > explicit else value, as you suggested in the review of the other patch > in the series. So all this _REV stuff is redundant now. > Here's an updated version based on top of the IFN_COND_FMA patch > that I just posted. Tested in the same way. OK. Thanks, Richard. > Thanks, > Richard > 2018-05-24 Richard Sandiford > Alan Hayward > David Sherwood > gcc/ > * internal-fn.h (can_interpret_as_conditional_op_p): Declare. > * internal-fn.c (can_interpret_as_conditional_op_p): New function. > * tree-ssa-math-opts.c (convert_mult_to_fma_1): Handle conditional > plus and minus and convert them into IFN_COND_FMA-based sequences. > (convert_mult_to_fma): Handle conditional plus and minus. > gcc/testsuite/ > * gcc.dg/vect/vect-fma-2.c: New test. > * gcc.target/aarch64/sve/reduc_4.c: Likewise. > * gcc.target/aarch64/sve/reduc_6.c: Likewise. > * gcc.target/aarch64/sve/reduc_7.c: Likewise. > Index: gcc/internal-fn.h > === > --- gcc/internal-fn.h 2018-05-24 13:05:46.049605128 +0100 > +++ gcc/internal-fn.h 2018-05-24 13:08:24.643987582 +0100 > @@ -196,6 +196,9 @@ extern internal_fn get_conditional_inter > extern internal_fn get_conditional_internal_fn (internal_fn); > extern tree_code conditional_internal_fn_code (internal_fn); > extern internal_fn get_unconditional_internal_fn (internal_fn); > +extern bool can_interpret_as_conditional_op_p (gimple *, tree *, > + tree_code *, tree (&)[3], > + tree *); > extern bool internal_load_fn_p (internal_fn); > extern bool internal_store_fn_p (internal_fn); > Index: gcc/internal-fn.c > === > --- gcc/internal-fn.c 2018-05-24 13:05:46.048606357 +0100 > +++ gcc/internal-fn.c 2018-05-24 13:08:24.643987582 +0100 > @@ -,6 +,62 @@ #define CASE(NAME) case IFN_COND_##NAME: > } > } > +/* Return true if STMT can be interpreted as a conditional tree code > + operation of the form: > + > + LHS = COND ? OP (RHS1, ...) : ELSE; > + > + operating elementwise if the operands are vectors. This includes > + the case of an all-true COND, so that the operation always happens. > + > + When returning true, set: > + > + - *COND_OUT to the condition COND, or to NULL_TREE if the condition > + is known to be all-true > + - *CODE_OUT to the tree code
Re: Support fused multiply-adds in fully-masked reductions
Richard Biener writes: > On Wed, May 16, 2018 at 11:26 AM Richard Sandiford < > richard.sandif...@linaro.org> wrote: > >> This patch adds support for fusing a conditional add or subtract >> with a multiplication, so that we can use fused multiply-add and >> multiply-subtract operations for fully-masked reductions. E.g. >> for SVE we vectorise: > >>double res = 0.0; >>for (int i = 0; i < n; ++i) >> res += x[i] * y[i]; > >> using a fully-masked loop in which the loop body has the form: > >>res_1 = PHI<0(preheader), res_2(latch)>; >>avec = IFN_MASK_LOAD (loop_mask, a) >>bvec = IFN_MASK_LOAD (loop_mask, b) >>prod = avec * bvec; >>res_2 = IFN_COND_ADD (loop_mask, res_1, prod); > >> where the last statement does the equivalent of: > >>res_2 = loop_mask ? res_1 + prod : res_1; > >> (operating elementwise). The point of the patch is to convert the last >> two statements into a single internal function that is the equivalent of: > >>res_2 = loop_mask ? fma (avec, bvec, res_1) : res_1; > >> (again operating elementwise). > >> All current conditional X operations have the form "do X or don't do X >> to the first operand" (add/don't add to first operand, etc.). However, >> the FMA optabs and functions are ordered so that the accumulator comes >> last. There were two obvious ways of resolving this: break the >> convention for conditional operators and have "add/don't add to the >> final operand" or break the convention for FMA and put the accumulator >> first. The patch goes for the latter, but adds _REV to make it obvious >> that the operands are in a different order. > > Eh. I guess you'll do the same to SAD/DOT_PROD/WIDEN_SUM? > > That said, I don't really see the "do or not do to the first operand", it's > "do or not do the operation on operands 1 to 2 (or 3)". None of the > current ops modify operand 1, they all produce a new value, no? Yeah, neither the current functions nor these ones actually changed operand 1. It was all about deciding what the "else" value should be. The _REV thing was a "fix" for the fact that we wanted the else value to be the final operand of fma. Of course, the real fix was to make all the IFN_COND_* functions take an explicit else value, as you suggested in the review of the other patch in the series. So all this _REV stuff is redundant now. Here's an updated version based on top of the IFN_COND_FMA patch that I just posted. Tested in the same way. Thanks, Richard 2018-05-24 Richard Sandiford Alan Hayward David Sherwood gcc/ * internal-fn.h (can_interpret_as_conditional_op_p): Declare. * internal-fn.c (can_interpret_as_conditional_op_p): New function. * tree-ssa-math-opts.c (convert_mult_to_fma_1): Handle conditional plus and minus and convert them into IFN_COND_FMA-based sequences. (convert_mult_to_fma): Handle conditional plus and minus. gcc/testsuite/ * gcc.dg/vect/vect-fma-2.c: New test. * gcc.target/aarch64/sve/reduc_4.c: Likewise. * gcc.target/aarch64/sve/reduc_6.c: Likewise. * gcc.target/aarch64/sve/reduc_7.c: Likewise. Index: gcc/internal-fn.h === --- gcc/internal-fn.h 2018-05-24 13:05:46.049605128 +0100 +++ gcc/internal-fn.h 2018-05-24 13:08:24.643987582 +0100 @@ -196,6 +196,9 @@ extern internal_fn get_conditional_inter extern internal_fn get_conditional_internal_fn (internal_fn); extern tree_code conditional_internal_fn_code (internal_fn); extern internal_fn get_unconditional_internal_fn (internal_fn); +extern bool can_interpret_as_conditional_op_p (gimple *, tree *, + tree_code *, tree (&)[3], + tree *); extern bool internal_load_fn_p (internal_fn); extern bool internal_store_fn_p (internal_fn); Index: gcc/internal-fn.c === --- gcc/internal-fn.c 2018-05-24 13:05:46.048606357 +0100 +++ gcc/internal-fn.c 2018-05-24 13:08:24.643987582 +0100 @@ -,6 +,62 @@ #define CASE(NAME) case IFN_COND_##NAME: } } +/* Return true if STMT can be interpreted as a conditional tree code + operation of the form: + + LHS = COND ? OP (RHS1, ...) : ELSE; + + operating elementwise if the operands are vectors. This includes + the case of an all-true COND, so that the operation always happens. + + When returning true, set: + + - *COND_OUT to the condition COND, or to NULL_TREE if the condition + is known to be all-true + - *CODE_OUT to the tree code + - OPS[I] to operand I of *CODE_OUT + - *ELSE_OUT to the fallback value ELSE, or to NULL_TREE if the + condition is known to be all true. */ + +bool +can_interpret_as_conditional_op_p (gimple *stmt, tree *cond_out, + tree_code *code_out, + tree (&ops)[3], tree
Re: Support fused multiply-adds in fully-masked reductions
On Wed, May 16, 2018 at 11:26 AM Richard Sandiford < richard.sandif...@linaro.org> wrote: > This patch adds support for fusing a conditional add or subtract > with a multiplication, so that we can use fused multiply-add and > multiply-subtract operations for fully-masked reductions. E.g. > for SVE we vectorise: >double res = 0.0; >for (int i = 0; i < n; ++i) > res += x[i] * y[i]; > using a fully-masked loop in which the loop body has the form: >res_1 = PHI<0(preheader), res_2(latch)>; >avec = IFN_MASK_LOAD (loop_mask, a) >bvec = IFN_MASK_LOAD (loop_mask, b) >prod = avec * bvec; >res_2 = IFN_COND_ADD (loop_mask, res_1, prod); > where the last statement does the equivalent of: >res_2 = loop_mask ? res_1 + prod : res_1; > (operating elementwise). The point of the patch is to convert the last > two statements into a single internal function that is the equivalent of: >res_2 = loop_mask ? fma (avec, bvec, res_1) : res_1; > (again operating elementwise). > All current conditional X operations have the form "do X or don't do X > to the first operand" (add/don't add to first operand, etc.). However, > the FMA optabs and functions are ordered so that the accumulator comes > last. There were two obvious ways of resolving this: break the > convention for conditional operators and have "add/don't add to the > final operand" or break the convention for FMA and put the accumulator > first. The patch goes for the latter, but adds _REV to make it obvious > that the operands are in a different order. Eh. I guess you'll do the same to SAD/DOT_PROD/WIDEN_SUM? That said, I don't really see the "do or not do to the first operand", it's "do or not do the operation on operands 1 to 2 (or 3)". None of the current ops modify operand 1, they all produce a new value, no? > Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf > and x86_64-linux-gnu. OK to install? OK, but as said I don't see a reason for the operand order to differ in the first place. Richard. > Richard > 2018-05-16 Richard Sandiford > Alan Hayward > David Sherwood > gcc/ > * doc/md.texi (cond_fma_rev, cond_fnma_rev): Document. > * optabs.def (cond_fma_rev, cond_fnma_rev): New optabs. > * internal-fn.def (COND_FMA_REV, COND_FNMA_REV): New internal > functions. > * internal-fn.h (can_interpret_as_conditional_op_p): Declare. > * internal-fn.c (cond_ternary_direct): New macro. > (expand_cond_ternary_optab_fn): Likewise. > (direct_cond_ternary_optab_supported_p): Likewise. > (FOR_EACH_CODE_MAPPING): Likewise. > (get_conditional_internal_fn): Use FOR_EACH_CODE_MAPPING. > (conditional_internal_fn_code): New function. > (can_interpret_as_conditional_op_p): Likewise. > * tree-ssa-math-opts.c (fused_cond_internal_fn): New function. > (convert_mult_to_fma_1): Transform calls to IFN_COND_ADD to > IFN_COND_FMA_REV and calls to IFN_COND_SUB to IFN_COND_FNMA_REV. > (convert_mult_to_fma): Handle calls to IFN_COND_ADD and IFN_COND_SUB. > * genmatch.c (commutative_op): Handle CFN_COND_FMA_REV and > CFN_COND_FNMA_REV. > * config/aarch64/iterators.md (UNSPEC_COND_FMLA): New unspec. > (UNSPEC_COND_FMLS): Likewise. > (optab, sve_fp_op): Handle them. > (SVE_COND_INT_OP): Rename to... > (SVE_COND_INT2_OP): ...this. > (SVE_COND_FP_OP): Rename to... > (SVE_COND_FP2_OP): ...this. > (SVE_COND_FP3_OP): New iterator. > * config/aarch64/aarch64-sve.md (cond_): Update > for new iterator names. Add a pattern for SVE_COND_FP3_OP. > gcc/testsuite/ > * gcc.target/aarch64/sve/reduc_4.c: New test. > * gcc.target/aarch64/sve/reduc_6.c: Likewise. > * gcc.target/aarch64/sve/reduc_7.c: Likewise. > Index: gcc/doc/md.texi > === > --- gcc/doc/md.texi 2018-05-16 10:23:03.590853492 +0100 > +++ gcc/doc/md.texi 2018-05-16 10:23:03.886838736 +0100 > @@ -6367,6 +6367,32 @@ be in a normal C @samp{?:} condition. > Operands 0, 2 and 3 all have mode @var{m}, while operand 1 has the mode > returned by @code{TARGET_VECTORIZE_GET_MASK_MODE}. > +@cindex @code{cond_fma_rev@var{mode}} instruction pattern > +@item @samp{cond_fma_rev@var{mode}} > +Similar to @samp{cond_add@var{m}}, but compute: > +@smallexample > +op0 = op1 ? fma (op3, op4, op2) : op2; > +@end smallexample > +for scalars and: > +@smallexample > +op0[I] = op1[I] ? fma (op3[I], op4[I], op2[I]) : op2[I]; > +@end smallexample > +for vectors. The @samp{_rev} indicates that the addend (operand 2) > +comes first. > + > +@cindex @code{cond_fnma_rev@var{mode}} instruction pattern > +@item @samp{cond_fnma_rev@var{mode}} > +Similar to @samp{cond_fma_rev@var{m}}, but negate operand 3 before > +multiplying i
Support fused multiply-adds in fully-masked reductions
This patch adds support for fusing a conditional add or subtract with a multiplication, so that we can use fused multiply-add and multiply-subtract operations for fully-masked reductions. E.g. for SVE we vectorise: double res = 0.0; for (int i = 0; i < n; ++i) res += x[i] * y[i]; using a fully-masked loop in which the loop body has the form: res_1 = PHI<0(preheader), res_2(latch)>; avec = IFN_MASK_LOAD (loop_mask, a) bvec = IFN_MASK_LOAD (loop_mask, b) prod = avec * bvec; res_2 = IFN_COND_ADD (loop_mask, res_1, prod); where the last statement does the equivalent of: res_2 = loop_mask ? res_1 + prod : res_1; (operating elementwise). The point of the patch is to convert the last two statements into a single internal function that is the equivalent of: res_2 = loop_mask ? fma (avec, bvec, res_1) : res_1; (again operating elementwise). All current conditional X operations have the form "do X or don't do X to the first operand" (add/don't add to first operand, etc.). However, the FMA optabs and functions are ordered so that the accumulator comes last. There were two obvious ways of resolving this: break the convention for conditional operators and have "add/don't add to the final operand" or break the convention for FMA and put the accumulator first. The patch goes for the latter, but adds _REV to make it obvious that the operands are in a different order. Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf and x86_64-linux-gnu. OK to install? Richard 2018-05-16 Richard Sandiford Alan Hayward David Sherwood gcc/ * doc/md.texi (cond_fma_rev, cond_fnma_rev): Document. * optabs.def (cond_fma_rev, cond_fnma_rev): New optabs. * internal-fn.def (COND_FMA_REV, COND_FNMA_REV): New internal functions. * internal-fn.h (can_interpret_as_conditional_op_p): Declare. * internal-fn.c (cond_ternary_direct): New macro. (expand_cond_ternary_optab_fn): Likewise. (direct_cond_ternary_optab_supported_p): Likewise. (FOR_EACH_CODE_MAPPING): Likewise. (get_conditional_internal_fn): Use FOR_EACH_CODE_MAPPING. (conditional_internal_fn_code): New function. (can_interpret_as_conditional_op_p): Likewise. * tree-ssa-math-opts.c (fused_cond_internal_fn): New function. (convert_mult_to_fma_1): Transform calls to IFN_COND_ADD to IFN_COND_FMA_REV and calls to IFN_COND_SUB to IFN_COND_FNMA_REV. (convert_mult_to_fma): Handle calls to IFN_COND_ADD and IFN_COND_SUB. * genmatch.c (commutative_op): Handle CFN_COND_FMA_REV and CFN_COND_FNMA_REV. * config/aarch64/iterators.md (UNSPEC_COND_FMLA): New unspec. (UNSPEC_COND_FMLS): Likewise. (optab, sve_fp_op): Handle them. (SVE_COND_INT_OP): Rename to... (SVE_COND_INT2_OP): ...this. (SVE_COND_FP_OP): Rename to... (SVE_COND_FP2_OP): ...this. (SVE_COND_FP3_OP): New iterator. * config/aarch64/aarch64-sve.md (cond_): Update for new iterator names. Add a pattern for SVE_COND_FP3_OP. gcc/testsuite/ * gcc.target/aarch64/sve/reduc_4.c: New test. * gcc.target/aarch64/sve/reduc_6.c: Likewise. * gcc.target/aarch64/sve/reduc_7.c: Likewise. Index: gcc/doc/md.texi === --- gcc/doc/md.texi 2018-05-16 10:23:03.590853492 +0100 +++ gcc/doc/md.texi 2018-05-16 10:23:03.886838736 +0100 @@ -6367,6 +6367,32 @@ be in a normal C @samp{?:} condition. Operands 0, 2 and 3 all have mode @var{m}, while operand 1 has the mode returned by @code{TARGET_VECTORIZE_GET_MASK_MODE}. +@cindex @code{cond_fma_rev@var{mode}} instruction pattern +@item @samp{cond_fma_rev@var{mode}} +Similar to @samp{cond_add@var{m}}, but compute: +@smallexample +op0 = op1 ? fma (op3, op4, op2) : op2; +@end smallexample +for scalars and: +@smallexample +op0[I] = op1[I] ? fma (op3[I], op4[I], op2[I]) : op2[I]; +@end smallexample +for vectors. The @samp{_rev} indicates that the addend (operand 2) +comes first. + +@cindex @code{cond_fnma_rev@var{mode}} instruction pattern +@item @samp{cond_fnma_rev@var{mode}} +Similar to @samp{cond_fma_rev@var{m}}, but negate operand 3 before +multiplying it. That is, the instruction performs: +@smallexample +op0 = op1 ? fma (-op3, op4, op2) : op2; +@end smallexample +for scalars and: +@smallexample +op0[I] = op1[I] ? fma (-op3[I], op4[I], op2[I]) : op2[I]; +@end smallexample +for vectors. + @cindex @code{neg@var{mode}cc} instruction pattern @item @samp{neg@var{mode}cc} Similar to @samp{mov@var{mode}cc} but for conditional negation. Conditionally Index: gcc/optabs.def === --- gcc/optabs.def 2018-05-16 10:23:03.590853492 +0100 +++ gcc/optabs.def 2018-05-16 10:23:03.887838686 +0100 @@ -222,6 +222,8 @@ OPTAB_D (notcc_optab, "not$acc") OPTAB_D (movcc_opt