[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

2021-05-18 Thread Kevin P. Neal via Phabricator via cfe-commits
kpn added a comment.

In D99675#2696327 , @pengfei wrote:

> In D99675#2695424 , @kpn wrote:
>
>> What changes are needed for a backend, and what happens if they aren't done?
>
> As far as I understand it, backend does optimizations based on patterns of 
> the known nodes and MIs. Inserting a new node/MI will block any optimizations 
> across the fence. So it respects the semantics of the intrinsic without 
> target special chenges.
> I'm not sure if there's room for optimization cross the `arithmetic.fence`. 
> If there is and no changes for it, backend may have some performance loss 
> under these circumstances.
>
>> Having something needed for correctness silently not work seems ... 
>> sub-optimal.
>
> I think backend is conservative for optimizations when use the intrinsic. It 
> won't have correctness issue silently, but performance loss might.

OK, that sounds fine, then.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99675/new/

https://reviews.llvm.org/D99675

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

2021-04-16 Thread Pengfei Wang via Phabricator via cfe-commits
pengfei added a comment.

In D99675#2695424 , @kpn wrote:

> What changes are needed for a backend, and what happens if they aren't done?

As far as I understand it, backend does optimizations based on patterns of the 
known nodes and MIs. Inserting a new node/MI will block any optimizations 
across the fence. So it respects the semantics of the intrinsic without target 
special chenges.
I'm not sure if there's room for optimization cross the `arithmetic.fence`. If 
there is and no changes for it, backend may have some performance loss under 
these circumstances.

> Having something needed for correctness silently not work seems ... 
> sub-optimal.

I think backend is conservative for optimizations when use the intrinsic. It 
won't have correctness issue silently, but performance loss might.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99675/new/

https://reviews.llvm.org/D99675

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

2021-04-16 Thread Kevin P. Neal via Phabricator via cfe-commits
kpn added a comment.

In D99675#2695480 , @mibintc wrote:

> In D99675#2695424 , @kpn wrote:
>
>> What changes are needed for a backend, and what happens if they aren't done?
>
> In the clang patch, I'm planning to add into TargetInfo a function like "does 
> the target support __arithmetic_fence"?
> In the llvm patch, the fallback implementation could be to merely ignore the 
> call, and pass through the operand value. Is that adequate?

If clang is the only compiler to ever emit this new intrinsic then, yes, that's 
perfectly fine.

If a front-end other than clang uses the new fence then I'm nervous about 
having the fence just vanish. If the fence is used then it must be for 
correctness, right? Having something needed for correctness silently not work 
seems ... sub-optimal. It's the sort of thing that might not get caught in 
testing, and then you've got end-users running software that silently lacks 
something needed for correctness. That makes me nervous. I'd rather LLVM bomb 
instead of silently dropping this fence. Then developers know they have a 
problem before a product goes out the door.

But if I'm the only one that's nervous then that's OK and clang rejecting the 
compile would be sufficient.

Has this sort of issue come up in the past? How was it handled?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99675/new/

https://reviews.llvm.org/D99675

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

2021-04-16 Thread Melanie Blower via Phabricator via cfe-commits
mibintc added a comment.

In D99675#2695424 , @kpn wrote:

> What changes are needed for a backend, and what happens if they aren't done?

In the clang patch, I'm planning to add into TargetInfo a function like "does 
the target support __arithmetic_fence"?
In the llvm patch, the fallback implementation could be to merely ignore the 
call, and pass through the operand value. Is that adequate?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99675/new/

https://reviews.llvm.org/D99675

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

2021-04-16 Thread Kevin P. Neal via Phabricator via cfe-commits
kpn added a comment.

What changes are needed for a backend, and what happens if they aren't done?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99675/new/

https://reviews.llvm.org/D99675

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

2021-04-16 Thread Melanie Blower via Phabricator via cfe-commits
mibintc updated this revision to Diff 338099.
mibintc added a comment.

I accidentally dropped the test case in previous commit.  Just adding it back 
in -- under the llvm/test directory (previously it was in the wrong location).


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99675/new/

https://reviews.llvm.org/D99675

Files:
  llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
  llvm/include/llvm/CodeGen/BasicTTIImpl.h
  llvm/include/llvm/CodeGen/ISDOpcodes.h
  llvm/include/llvm/CodeGen/SelectionDAGISel.h
  llvm/include/llvm/IR/IRBuilder.h
  llvm/include/llvm/IR/Intrinsics.td
  llvm/include/llvm/Support/TargetOpcodes.def
  llvm/include/llvm/Target/Target.td
  llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
  llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
  llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
  llvm/test/CodeGen/X86/arithmetic_fence.ll

Index: llvm/test/CodeGen/X86/arithmetic_fence.ll
===
--- /dev/null
+++ llvm/test/CodeGen/X86/arithmetic_fence.ll
@@ -0,0 +1,122 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=+fma | FileCheck %s --check-prefix=X86
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+fma | FileCheck %s --check-prefix=X64
+
+define float @f1(float %a, float %b, float %c) {
+; X86-LABEL: f1:
+; X86:   # %bb.0:
+; X86-NEXT:pushl %eax
+; X86-NEXT:.cfi_def_cfa_offset 8
+; X86-NEXT:vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; X86-NEXT:vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
+; X86-NEXT:vfmadd213ss {{.*#+}} xmm1 = (xmm0 * xmm1) + mem
+; X86-NEXT:vmovss %xmm1, (%esp)
+; X86-NEXT:flds (%esp)
+; X86-NEXT:popl %eax
+; X86-NEXT:.cfi_def_cfa_offset 4
+; X86-NEXT:retl
+;
+; X64-LABEL: f1:
+; X64:   # %bb.0:
+; X64-NEXT:vfmadd213ss {{.*#+}} xmm0 = (xmm1 * xmm0) + xmm2
+; X64-NEXT:retq
+  %mul = fmul fast float %b, %a
+  %add = fadd fast float %mul, %c
+  ret float %add
+}
+
+define float @f2(float %a, float %b, float %c) {
+; X86-LABEL: f2:
+; X86:   # %bb.0:
+; X86-NEXT:pushl %eax
+; X86-NEXT:.cfi_def_cfa_offset 8
+; X86-NEXT:vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero
+; X86-NEXT:vmulss {{[0-9]+}}(%esp), %xmm0, %xmm0
+; X86-NEXT:#ARITH_FENCE
+; X86-NEXT:vaddss {{[0-9]+}}(%esp), %xmm0, %xmm0
+; X86-NEXT:vmovss %xmm0, (%esp)
+; X86-NEXT:flds (%esp)
+; X86-NEXT:popl %eax
+; X86-NEXT:.cfi_def_cfa_offset 4
+; X86-NEXT:retl
+;
+; X64-LABEL: f2:
+; X64:   # %bb.0:
+; X64-NEXT:vmulss %xmm0, %xmm1, %xmm0
+; X64-NEXT:#ARITH_FENCE
+; X64-NEXT:vaddss %xmm2, %xmm0, %xmm0
+; X64-NEXT:retq
+  %mul = fmul fast float %b, %a
+  %tmp = call float @llvm.arithmetic.fence.f32(float %mul)
+  %add = fadd fast float %tmp, %c
+  ret float %add
+}
+
+define double @f3(double %a, double %b, double %c) {
+; X86-LABEL: f3:
+; X86:   # %bb.0:
+; X86-NEXT:pushl %ebp
+; X86-NEXT:.cfi_def_cfa_offset 8
+; X86-NEXT:.cfi_offset %ebp, -8
+; X86-NEXT:movl %esp, %ebp
+; X86-NEXT:.cfi_def_cfa_register %ebp
+; X86-NEXT:andl $-8, %esp
+; X86-NEXT:subl $8, %esp
+; X86-NEXT:vmovsd {{.*#+}} xmm0 = mem[0],zero
+; X86-NEXT:vaddsd 16(%ebp), %xmm0, %xmm0
+; X86-NEXT:vmovsd %xmm0, (%esp)
+; X86-NEXT:fldl (%esp)
+; X86-NEXT:movl %ebp, %esp
+; X86-NEXT:popl %ebp
+; X86-NEXT:.cfi_def_cfa %esp, 4
+; X86-NEXT:retl
+;
+; X64-LABEL: f3:
+; X64:   # %bb.0:
+; X64-NEXT:vaddsd %xmm2, %xmm1, %xmm0
+; X64-NEXT:retq
+  %1 = fadd fast double %a, %b
+  %2 = fsub fast double %c, %a
+  %3 = fadd fast double %1, %2
+  ret double %3
+}
+
+define double @f4(double %a, double %b, double %c) {
+; X86-LABEL: f4:
+; X86:   # %bb.0:
+; X86-NEXT:pushl %ebp
+; X86-NEXT:.cfi_def_cfa_offset 8
+; X86-NEXT:.cfi_offset %ebp, -8
+; X86-NEXT:movl %esp, %ebp
+; X86-NEXT:.cfi_def_cfa_register %ebp
+; X86-NEXT:andl $-8, %esp
+; X86-NEXT:subl $8, %esp
+; X86-NEXT:vmovsd {{.*#+}} xmm0 = mem[0],zero
+; X86-NEXT:vmovsd {{.*#+}} xmm1 = mem[0],zero
+; X86-NEXT:vaddsd 16(%ebp), %xmm1, %xmm2
+; X86-NEXT:#ARITH_FENCE
+; X86-NEXT:vsubsd %xmm1, %xmm0, %xmm0
+; X86-NEXT:vaddsd %xmm0, %xmm2, %xmm0
+; X86-NEXT:vmovsd %xmm0, (%esp)
+; X86-NEXT:fldl (%esp)
+; X86-NEXT:movl %ebp, %esp
+; X86-NEXT:popl %ebp
+; X86-NEXT:.cfi_def_cfa %esp, 4
+; X86-NEXT:retl
+;
+; X64-LABEL: f4:
+; X64:   # %bb.0:
+; X64-NEXT:vaddsd %xmm1, %xmm0, %xmm1
+; X64-NEXT:#ARITH_FENCE
+; X64-NEXT:vsubsd %xmm0, %xmm2, %xmm0
+; X64-NEXT:vaddsd %xmm0, %xmm1, %xmm0
+; X64-NEXT:retq
+  %1 = fadd fast double %a, %b
+  %t = call double @llvm.arithmetic.fence.f64(double %1)
+  %2 = fsub fast double %c, %a
+  %3 = fadd fast double %t, %2
+  ret double %3
+}
+
+declare float @llvm.arithmetic.fence.f32(float)

[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

2021-04-15 Thread Melanie Blower via Phabricator via cfe-commits
mibintc updated this revision to Diff 337879.
mibintc edited the summary of this revision.
mibintc added a comment.

This is a minor update from @pengfei which allows simple tests cases to run 
end-to-end with clang.
Also I changed the "summary" to reflect the review discussion around the FMA 
optimization, to choose "FMA is not allowed across a fence".


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99675/new/

https://reviews.llvm.org/D99675

Files:
  llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
  llvm/include/llvm/CodeGen/BasicTTIImpl.h
  llvm/include/llvm/CodeGen/ISDOpcodes.h
  llvm/include/llvm/CodeGen/SelectionDAGISel.h
  llvm/include/llvm/IR/IRBuilder.h
  llvm/include/llvm/IR/Intrinsics.td
  llvm/include/llvm/Support/TargetOpcodes.def
  llvm/include/llvm/Target/Target.td
  llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
  llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
  llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp

Index: llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
===
--- llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
+++ llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
@@ -2321,6 +2321,11 @@
N->getOperand(0));
 }
 
+void SelectionDAGISel::Select_ARITH_FENCE(SDNode *N) {
+  CurDAG->SelectNodeTo(N, TargetOpcode::ARITH_FENCE, N->getValueType(0),
+   N->getOperand(0));
+}
+
 /// GetVBR - decode a vbr encoding whose top bit is set.
 LLVM_ATTRIBUTE_ALWAYS_INLINE static uint64_t
 GetVBR(uint64_t Val, const unsigned char *MatcherTable, unsigned ) {
@@ -2872,6 +2877,9 @@
   case ISD::FREEZE:
 Select_FREEZE(NodeToMatch);
 return;
+  case ISD::ARITH_FENCE:
+Select_ARITH_FENCE(NodeToMatch);
+return;
   }
 
   assert(!NodeToMatch->isMachineOpcode() && "Node already selected!");
Index: llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
===
--- llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -7210,6 +7210,13 @@
 }
 break;
   }
+  case Intrinsic::arithmetic_fence: {
+auto DL = getCurSDLoc();
+SDValue Val = getValue(FPI.getArgOperand(0));
+EVT ResultVT = TLI.getValueType(DAG.getDataLayout(), FPI.getType());
+setValue(, DAG.getNode(ISD::ARITH_FENCE, DL, ResultVT, Val));
+return;
+  }
   }
 
   // A few strict DAG nodes carry additional operands that are not
Index: llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
===
--- llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
+++ llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
@@ -1275,6 +1275,9 @@
   case TargetOpcode::PSEUDO_PROBE:
 emitPseudoProbe(MI);
 break;
+  case TargetOpcode::ARITH_FENCE:
+OutStreamer->emitRawComment("ARITH_FENCE");
+break;
   default:
 emitInstruction();
 if (CanDoExtraAnalysis) {
Index: llvm/include/llvm/Target/Target.td
===
--- llvm/include/llvm/Target/Target.td
+++ llvm/include/llvm/Target/Target.td
@@ -1172,6 +1172,12 @@
   let AsmString = "PSEUDO_PROBE";
   let hasSideEffects = 1;
 }
+def ARITH_FENCE : StandardPseudoInstruction {
+  let OutOperandList = (outs unknown:$dst);
+  let InOperandList = (ins unknown:$src);
+  let AsmString = "";
+  let hasSideEffects = false;
+}
 
 def STACKMAP : StandardPseudoInstruction {
   let OutOperandList = (outs);
Index: llvm/include/llvm/Support/TargetOpcodes.def
===
--- llvm/include/llvm/Support/TargetOpcodes.def
+++ llvm/include/llvm/Support/TargetOpcodes.def
@@ -117,6 +117,9 @@
 /// Pseudo probe
 HANDLE_TARGET_OPCODE(PSEUDO_PROBE)
 
+/// Arithmetic fence.
+HANDLE_TARGET_OPCODE(ARITH_FENCE)
+
 /// A Stackmap instruction captures the location of live variables at its
 /// position in the instruction stream. It is followed by a shadow of bytes
 /// that must lie within the function and not contain another stackmap.
Index: llvm/include/llvm/IR/Intrinsics.td
===
--- llvm/include/llvm/IR/Intrinsics.td
+++ llvm/include/llvm/IR/Intrinsics.td
@@ -1311,6 +1311,9 @@
 def int_pseudoprobe : Intrinsic<[], [llvm_i64_ty, llvm_i64_ty, llvm_i32_ty, llvm_i64_ty],
 [IntrInaccessibleMemOnly, IntrWillReturn]>;
 
+// Arithmetic fence intrinsic.
+def int_arithmetic_fence : Intrinsic<[llvm_anyfloat_ty], [LLVMMatchType<0>], [IntrNoMem]>;
+
 // Intrinsics to support half precision floating point format
 let IntrProperties = [IntrNoMem, IntrWillReturn] in {
 def int_convert_to_fp16   : DefaultAttrsIntrinsic<[llvm_i16_ty], [llvm_anyfloat_ty]>;
Index: llvm/include/llvm/IR/IRBuilder.h

[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

2021-04-08 Thread Eli Friedman via Phabricator via cfe-commits
efriedma added a comment.

In D99675#2672138 , @kbsmith1 wrote:

> In D99675#2671924 , @efriedma wrote:
>
>> How is llvm.arith.fence() different from using "freeze" on a floating-point 
>> value?  The goal isn't really the same, sure, but the effects seem similar 
>> at first glance.
>
> They are similar.  However, freeze is a no-op if the operand can be proven 
> not to be undef or poison, and in such circumstances could be removed by an 
> optimizer.  llvm.arith.fence cannot be removed by an optimizer, because doing 
> so might allow instructions that were "outside" the fence from being 
> reassociated/distrbuted with the instructions/operands that were inside the 
> fence.

Okay.  In practice, it's basically impossible for us to prove that the result 
of "fast" arithmetic isn't poison, given the way ninf/nnan are defined, but 
depending on that would be fragile.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99675/new/

https://reviews.llvm.org/D99675

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

2021-04-06 Thread Andy Kaylor via Phabricator via cfe-commits
andrew.w.kaylor added a comment.

In D99675#2671924 , @efriedma wrote:

>> The expression “llvm.arith.fence(a * b) + c” means that “a * b” must happen 
>> before “+ c” and FMA guarantees that, but to prevent later optimizations 
>> from unpacking the FMA the correct transformation needs to be:
>>
>> llvm.arith.fence(a * b) + c  →  llvm.arith.fence(FMA(a, b, c))
>
> Does this actually block later transforms from unpacking the FMA?  Maybe if 
> the FMA isn't marked "fast"...

Later transforms could unpack the FMA, but the result would be fenced. The 
intent isn't so much to prevent the FMA from being unpacked as to prevent 
losing the original fence semantics. That said, it doesn't quite work. For 
example, you might have this:

  %mul = fmul fast float %a, %b
  %fenced_mul = call float @llvm.arith.fence.f32(%mul)
  %result = fadd fast float %fenced_mul, %c

If there are no other uses of %fenced_mul, that could become

  %tmp = call fast float @llvm.fmuladd.f32(float %a, float %b, float %c)
  %result = call float @llvm.arith.fence.f32(%tmp)

If a later optimization decided to unpack this, it would become this:

  %mul = fmul fast float %a, %b
  %tmp = fadd fast float %mul, %c
  %result = call float @llvm.arith.fence.f32(%tmp)

I suggested this as a way of enabling the FMA optimization. It brings the fadd 
into the fence, but still protects the fmul from being reassociated or 
otherwise transformed with other operations outside the fence. In a purely 
practical sense, this would probably work. In a more strict sense, though, I 
now see that it has the problem that you could legally distribute the addition 
within the fence. I can't see a practical reason anyone would do that, but the 
semantics would allow it. The same ("legal but not practical") is true of 
forming the fmuladd intrinsic before codegen, I think.

So, no, I don't think this works the way it was intended.

That might push us back to Kevin's suggestion of just not allowing the FMA 
optimization across a fence.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99675/new/

https://reviews.llvm.org/D99675

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

2021-04-06 Thread Melanie Blower via Phabricator via cfe-commits
mibintc added a comment.

In D99675#2671924 , @efriedma wrote:

>> The expression “llvm.arith.fence(a * b) + c” means that “a * b” must happen 
>> before “+ c” and FMA guarantees that, but to prevent later optimizations 
>> from unpacking the FMA the correct transformation needs to be:
>>
>> llvm.arith.fence(a * b) + c  →  llvm.arith.fence(FMA(a, b, c))
>
> Does this actually block later transforms from unpacking the FMA?  Maybe if 
> the FMA isn't marked "fast"...

I'd like @pengfei to reply to this question. I think the overall idea is that 
many of the optimizations are pattern based, and the existing pattern wouldn't 
match the new intrinsic.

> 
>
> How is llvm.arith.fence() different from using "freeze" on a floating-point 
> value?  The goal isn't really the same, sure, but the effects seem similar at 
> first glance.

Initially we thought the intrinsic "ssa.copy" could serve. However ssa.copy is 
for a different purpose and it gets optimized away.  We want arith.fence to 
survive through codegen, that's one reason why we think a new intrinsic is 
needed.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99675/new/

https://reviews.llvm.org/D99675

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

2021-04-06 Thread Kevin B. Smith via Phabricator via cfe-commits
kbsmith1 added a comment.

In D99675#2671924 , @efriedma wrote:

>> The expression “llvm.arith.fence(a * b) + c” means that “a * b” must happen 
>> before “+ c” and FMA guarantees that, but to prevent later optimizations 
>> from unpacking the FMA the correct transformation needs to be:
>>
>> llvm.arith.fence(a * b) + c  →  llvm.arith.fence(FMA(a, b, c))
>
> Does this actually block later transforms from unpacking the FMA?  Maybe if 
> the FMA isn't marked "fast"...

I think we could define llvm.arith.fence to be such that this FMA contraction 
isn't legal/correct, or it could be left as is.  In the implementation that was 
used for the Intel compiler FMA contraction did not occur across an an __fence 
boundary.  It is unclear whether that was intended as the semantic, or if we 
just never bothered to implement that contraction.
Not allowing the FMA contraction across the llvm.arith.fence would make 
unpacking an FMA allowed under the same circumstances that LLVM currently 
allows that.

> 
>
> How is llvm.arith.fence() different from using "freeze" on a floating-point 
> value?  The goal isn't really the same, sure, but the effects seem similar at 
> first glance.

They are similar.  However, fence is a no-op if the operand can be proven not 
to be undef or poison, and in such circumstances could be removed by an 
optimizer.  llvm.arith.fence cannot be removed by an optimizer, because doing 
so might allow instructions that were "outside" the fence from being 
reassociated/distrbuted with the instructions/operands that were inside the 
fence.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99675/new/

https://reviews.llvm.org/D99675

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

2021-04-06 Thread Eli Friedman via Phabricator via cfe-commits
efriedma added a comment.

> The expression “llvm.arith.fence(a * b) + c” means that “a * b” must happen 
> before “+ c” and FMA guarantees that, but to prevent later optimizations from 
> unpacking the FMA the correct transformation needs to be:
>
> llvm.arith.fence(a * b) + c  →  llvm.arith.fence(FMA(a, b, c))

Does this actually block later transforms from unpacking the FMA?  Maybe if the 
FMA isn't marked "fast"...



How is llvm.arith.fence() different from using "freeze" on a floating-point 
value?  The goal isn't really the same, sure, but the effects seem similar at 
first glance.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99675/new/

https://reviews.llvm.org/D99675

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

2021-03-31 Thread Melanie Blower via Phabricator via cfe-commits
mibintc created this revision.
mibintc added reviewers: andrew.w.kaylor, pengfei, kbsmith1.
Herald added subscribers: dexonsmith, jfb, hiraditya.
mibintc requested review of this revision.
Herald added a subscriber: jdoerfert.
Herald added a project: LLVM.

This is a proposal to add a new llvm intrinsic, llvm.arith.fence.  The purpose 
is to provide fine control, at the expression level, over floating point 
optimization when -ffast-math (-ffp-model=fast) is enabled.  We are also 
proposing a new clang builtin that provides access to this intrinsic, as well 
as a new clang command line option `-fprotect-parens` that will be implemented 
using this intrinsic.

This patch is authored by @pengfei

Rationale
-

Some expression transformations that are mathematically correct, such as 
reassociation and distribution, may be incorrect when dealing with finite 
precision floating point.  For example, these two expressions,

  (a + b) + c
  a + (b + c)

are equivalent mathematically in integer arithmetic, but not in floating point. 
 In some floating point (FP) models, the compiler is allowed to make these 
value-unsafe transformations for performance reasons, even when the programmer 
uses parentheses explicitly.  But the compiler must always honor the 
parentheses implied by llvm.arith.fence, regardless of the FP model settings.

Under `–ffp-model=fast`, llvm.arith.fence provides a way to partially enforce 
ordering in an FP expression.

| Original expression   | Transformed expression | Permitted? |
| - | -- | -- |
| (a + b) + c   | a + (b + c)| Yes!   |
| llvm.arith.fence((a + b) + c) | a + (b + c)| No!|
|



NOTE: The llvm.arith.fence serves no purpose in value-safe FP modes like 
`–ffp-model=precise`:  FP expressions are already strictly ordered.

The new llvm intrinsic also enables the implementation of the option 
`-fprotect-parens` which is available in gfortran as well as the Intel C++ and 
Fortran compilers: icc and ifort.

Proposed llvm IR changes


Requirements for llvm.arith.fence:

- There is one operand. The input to the intrinsic is an llvm::Value and must 
be scalar floating point or vector floating point.
- The return type is the same as the operand type.
- The return value is equivalent to the operand.

Optimizing llvm.arith.fence
---

- Constant folding may substitute the constant value of the llvm.arith.fence 
operand for the value of fence itself in the case where the operand is constant.
- CSE Detection: No special changes needed: if E1 and E2 are CSE, then 
llvm.arith.fence(E1) and llvm.arith.fence(E2) are CSE.
- FMA transformation should be enabled, at least in the -ffp-model=fast case.
  - The expression “llvm.arith.fence(a * b) + c” means that “a * b” must happen 
before “+ c” and FMA guarantees that, but to prevent later optimizations from 
unpacking the FMA the correct transformation needs to be:

  llvm.arith.fence(a * b) + c  →  llvm.arith.fence(FMA(a, b, c)) 



- In the ffp-model=fast case, FMA formation doesn’t happen until Isel, so we 
just need to add the llvm.arith.fence cases to ISel pattern matching.
- There are some choices around the FMA optimization. For this example:

  %t1 = fmul double %x, %y
  %t2 = call double @llvm.arith.fence.f64(double %t1)
  %t3 = fadd contract double %t2, %z

  1. FMA is allowed across an arith.fence if and only if the FMF `contract` 
flag is set for the llvm.arith.fence operand. //We are recommending this 
choice.//
  2. FMA is not allowed across a fence
  3. The FMF `contract` flag should be set on the llvm.arith.fence intrinsic 
call if contraction should be enabled
- Fast Math Optimization:
  - The result of a llvm.arith.fence can participate in fast math 
optimizations.  For example:

  // This transformation is legal:
  w + llvm.arith.fence(x + y) + z   →   w + z + llvm.arith.fence(x + y)

- The operand of a llvm.arith.fence can participate in fast math optimizations. 
 For example:

  // This transformation is legal:
  llvm.arith.fence((x+y)+z) --> llvm.arith.fence(x+(y+z))



NOTE: We want fast-math optimization within the fence, but not across the fence.



- MIR Optimization:
  - The use of a pseudo-operation in the MIR serves the same purpose as the 
intrinsic in the IR,  since all the optimizations are based on patterns 
matching from known DAGs/MIs.
  - Backend simply respects the llvm.arith.fence intrinsic, builds 
llvm.arith.fence node during DAG/ISel and emits pseudo arithmetic_fence MI 
after it.
  - The pseudo arithmetic_fence MI turns into a comment when emitting assembly.

Other llvm changes needed -- utility functions
--

The ValueTracking utilities will need to be taught to handle the new intrinsic. 
For example, there are utility functions like `isKnownNeverNaN()` and 
`CannotBeOrderedLessThanZero()` that will