[PATCH] D44222: [AArch64] Add vmulxh_lane FP16 intrinsics

2018-03-19 Thread Sjoerd Meijer via Phabricator via cfe-commits
SjoerdMeijer abandoned this revision.
SjoerdMeijer added a comment.

This is implemented in https://reviews.llvm.org/D44591.


https://reviews.llvm.org/D44222



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D44222: [AArch64] Add vmulxh_lane FP16 intrinsics

2018-03-16 Thread Abderrazek Zaafrani via Phabricator via cfe-commits
az added a comment.

Was not able to update this particular review with the new code, So I created a 
new one in  https://reviews.llvm.org/D44591

I manage to reuse the mulx scalar intrinsic work, not exactly calling the fp16 
scalar intrinsic itself which is not available here but the same frontend 
codegen work with an extract instruction before that.


https://reviews.llvm.org/D44222



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D44222: [AArch64] Add vmulxh_lane FP16 intrinsics

2018-03-12 Thread Sjoerd Meijer via Phabricator via cfe-commits
SjoerdMeijer added inline comments.



Comment at: include/clang/Basic/arm_neon.td:1504
+  // Scalar floating point multiply extended (scalar, by element)
+  def SCALAR_FMULX_LANEH : IOpInst<"vmulx_lane", "ssdi", "Sh", 
OP_SCALAR_MUL_LN>;
+  def SCALAR_FMULX_LANEQH : IOpInst<"vmulx_laneq", "ssji", "Sh", 
OP_SCALAR_MUL_LN>;

az wrote:
> SjoerdMeijer wrote:
> > I found that unfortunately it's not that straightforward. This leads to 
> > wrong code generation as it is generating a fmul instead of fmulx. I am 
> > suspecting this instruction description should be using OP_SCALAR_MULX_LN, 
> > but also the type decls are wrong. Need to dig a bit further here.
> Sorry for confusion as the commented code was never intended to be used and 
> it is a copy of the code for the intrinsic vmulh_lane(). It was done that way 
> in order to point out that vmulh_lane() and vmulxh_lane() intrinsics should 
> be implemented in a similar way. The only useful thing in the commented code 
> is the explanation that we need the scalar intrinsic vmulxh_f16() which was 
> implemented in the scalar intrinsic patch later on.
>  
> If we look at how vmulh_lane (a, b, lane) is implemented:
> x = extract (b, lane);
> res = a * x;
> return res; 
> 
> Similarly, I thought at the time that vmulxh_lane (a, b, lane) can be 
> implemented: 
> x = extract (b, lane);
> res = vmulxh_f16 (a, x);  // no llvm native mulx instruction, so we use 
> the fp16 scalar intrinsic.
> return res; 
> 
> I am not sure now that we can easily use scalar intrinsic while generating 
> the arm_neon.h file. In case we can not do that, I am thinking that the 
> frontend should generate a new builtin for intrinsic vmulxh_lane() that the 
> backend recognizes and generate the right code for it which is fmulx  h0, h0, 
> v1.h[lane]. If you made or will be making progress on this, then that is 
> great. Otherwise, I can look at a frontend solution for it.  
Hi Abderrazek,
Thanks for the clarifications!  And I agree with your observations.
This simple changed looked to do the right thing, because as you also said, 
this vmulx is just an extract and a multiply, but then it was incorrectly 
generating a fmul which should be a fmulx. I briefly looked at fixing this, but 
also didn't see how I could use the scalar intrinsic here. Looks like passing a 
builtin is indeed the best thing, also because fmulx is instruction selected 
based on a intrinsic:

  defm FMULX: SIMDThreeSameVectorFP<0,0,0b011,"fmulx", 
int_aarch64_neon_fmulx>;

If you have the bandwidth to pick this up, that would be great; I started 
looking into the other failing AArch64 vector intrinsics.

Cheers,
Sjoerd.


https://reviews.llvm.org/D44222



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D44222: [AArch64] Add vmulxh_lane FP16 intrinsics

2018-03-08 Thread Abderrazek Zaafrani via Phabricator via cfe-commits
az added inline comments.



Comment at: include/clang/Basic/arm_neon.td:1504
+  // Scalar floating point multiply extended (scalar, by element)
+  def SCALAR_FMULX_LANEH : IOpInst<"vmulx_lane", "ssdi", "Sh", 
OP_SCALAR_MUL_LN>;
+  def SCALAR_FMULX_LANEQH : IOpInst<"vmulx_laneq", "ssji", "Sh", 
OP_SCALAR_MUL_LN>;

SjoerdMeijer wrote:
> I found that unfortunately it's not that straightforward. This leads to wrong 
> code generation as it is generating a fmul instead of fmulx. I am suspecting 
> this instruction description should be using OP_SCALAR_MULX_LN, but also the 
> type decls are wrong. Need to dig a bit further here.
Sorry for confusion as the commented code was never intended to be used and it 
is a copy of the code for the intrinsic vmulh_lane(). It was done that way in 
order to point out that vmulh_lane() and vmulxh_lane() intrinsics should be 
implemented in a similar way. The only useful thing in the commented code is 
the explanation that we need the scalar intrinsic vmulxh_f16() which was 
implemented in the scalar intrinsic patch later on.
 
If we look at how vmulh_lane (a, b, lane) is implemented:
x = extract (b, lane);
res = a * x;
return res; 

Similarly, I thought at the time that vmulxh_lane (a, b, lane) can be 
implemented: 
x = extract (b, lane);
res = vmulxh_f16 (a, x);  // no llvm native mulx instruction, so we use the 
fp16 scalar intrinsic.
return res; 

I am not sure now that we can easily use scalar intrinsic while generating the 
arm_neon.h file. In case we can not do that, I am thinking that the frontend 
should generate a new builtin for intrinsic vmulxh_lane() that the backend 
recognizes and generate the right code for it which is fmulx  h0, h0, 
v1.h[lane]. If you made or will be making progress on this, then that is great. 
Otherwise, I can look at a frontend solution for it.  


https://reviews.llvm.org/D44222



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D44222: [AArch64] Add vmulxh_lane FP16 intrinsics

2018-03-08 Thread Sjoerd Meijer via Phabricator via cfe-commits
SjoerdMeijer added inline comments.



Comment at: include/clang/Basic/arm_neon.td:1504
+  // Scalar floating point multiply extended (scalar, by element)
+  def SCALAR_FMULX_LANEH : IOpInst<"vmulx_lane", "ssdi", "Sh", 
OP_SCALAR_MUL_LN>;
+  def SCALAR_FMULX_LANEQH : IOpInst<"vmulx_laneq", "ssji", "Sh", 
OP_SCALAR_MUL_LN>;

I found that unfortunately it's not that straightforward. This leads to wrong 
code generation as it is generating a fmul instead of fmulx. I am suspecting 
this instruction description should be using OP_SCALAR_MULX_LN, but also the 
type decls are wrong. Need to dig a bit further here.


https://reviews.llvm.org/D44222



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D44222: [AArch64] Add vmulxh_lane FP16 intrinsics

2018-03-07 Thread Evandro Menezes via Phabricator via cfe-commits
evandro accepted this revision.
evandro added a comment.
This revision is now accepted and ready to land.

Looks pretty straightforward to me.


https://reviews.llvm.org/D44222



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D44222: [AArch64] Add vmulxh_lane FP16 intrinsics

2018-03-07 Thread Sjoerd Meijer via Phabricator via cfe-commits
SjoerdMeijer created this revision.
SjoerdMeijer added reviewers: az, evandro, olista01.
Herald added subscribers: kristof.beyls, javed.absar, rengolin.

Add 2 vmulxh_lane vector intrinsics that were commented out.


https://reviews.llvm.org/D44222

Files:
  include/clang/Basic/arm_neon.td
  test/CodeGen/aarch64-v8.2a-neon-intrinsics.c


Index: test/CodeGen/aarch64-v8.2a-neon-intrinsics.c
===
--- test/CodeGen/aarch64-v8.2a-neon-intrinsics.c
+++ test/CodeGen/aarch64-v8.2a-neon-intrinsics.c
@@ -1223,27 +1223,25 @@
   return vmulxq_n_f16(a, b);
 }
 
-/* TODO: Not implemented yet (needs scalar intrinsic from arm_fp16.h)
-// CCHECK-LABEL: test_vmulxh_lane_f16
-// CCHECK: [[CONV0:%.*]] = fpext half %a to float
-// CCHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float
-// CCHECK: [[MUL:%.*]]   = fmul float [[CONV0:%.*]], [[CONV0:%.*]]
-// CCHECK: [[CONV3:%.*]] = fptrunc float %mul to half
-// CCHECK: ret half [[CONV3:%.*]]
+// CHECK-LABEL: test_vmulxh_lane_f16
+// CHECK: [[CONV0:%.*]] = fpext half %a to float
+// CHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float
+// CHECK: [[MUL:%.*]]   = fmul float [[CONV0:%.*]], [[CONV0:%.*]]
+// CHECK: [[CONV3:%.*]] = fptrunc float %mul to half
+// CHECK: ret half [[CONV3:%.*]]
 float16_t test_vmulxh_lane_f16(float16_t a, float16x4_t b) {
   return vmulxh_lane_f16(a, b, 3);
 }
 
-// CCHECK-LABEL: test_vmulxh_laneq_f16
-// CCHECK: [[CONV0:%.*]] = fpext half %a to float
-// CCHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float
-// CCHECK: [[MUL:%.*]]   = fmul float [[CONV0:%.*]], [[CONV0:%.*]]
-// CCHECK: [[CONV3:%.*]] = fptrunc float %mul to half
-// CCHECK: ret half [[CONV3:%.*]]
+// CHECK-LABEL: test_vmulxh_laneq_f16
+// CHECK: [[CONV0:%.*]] = fpext half %a to float
+// CHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float
+// CHECK: [[MUL:%.*]]   = fmul float [[CONV0:%.*]], [[CONV0:%.*]]
+// CHECK: [[CONV3:%.*]] = fptrunc float %mul to half
+// CHECK: ret half [[CONV3:%.*]]
 float16_t test_vmulxh_laneq_f16(float16_t a, float16x8_t b) {
   return vmulxh_laneq_f16(a, b, 7);
 }
-*/
 
 // CHECK-LABEL: test_vmaxv_f16
 // CHECK: [[TMP0:%.*]] = bitcast <4 x half> %a to <8 x i8>
Index: include/clang/Basic/arm_neon.td
===
--- include/clang/Basic/arm_neon.td
+++ include/clang/Basic/arm_neon.td
@@ -1499,11 +1499,10 @@
   def VMULX_LANEH   : IOpInst<"vmulx_lane", "ddgi", "hQh", OP_MULX_LN>;
   def VMULX_LANEQH  : IOpInst<"vmulx_laneq", "ddji", "hQh", OP_MULX_LN>;
   def VMULX_NH  : IOpInst<"vmulx_n", "dds", "hQh", OP_MULX_N>;
-  // TODO: Scalar floating point multiply extended (scalar, by element)
-  // Below ones are commented out because they need vmulx_f16(float16_t, 
float16_t)
-  // which will be implemented later with fp16 scalar intrinsic (arm_fp16.h)
-  //def SCALAR_FMULX_LANEH : IOpInst<"vmulx_lane", "ssdi", "Sh", 
OP_SCALAR_MUL_LN>;
-  //def SCALAR_FMULX_LANEQH : IOpInst<"vmulx_laneq", "ssji", "Sh", 
OP_SCALAR_MUL_LN>;
+
+  // Scalar floating point multiply extended (scalar, by element)
+  def SCALAR_FMULX_LANEH : IOpInst<"vmulx_lane", "ssdi", "Sh", 
OP_SCALAR_MUL_LN>;
+  def SCALAR_FMULX_LANEQH : IOpInst<"vmulx_laneq", "ssji", "Sh", 
OP_SCALAR_MUL_LN>;
 
   // ARMv8.2-A FP16 reduction vector intrinsics.
   def VMAXVH   : SInst<"vmaxv", "sd", "hQh">;


Index: test/CodeGen/aarch64-v8.2a-neon-intrinsics.c
===
--- test/CodeGen/aarch64-v8.2a-neon-intrinsics.c
+++ test/CodeGen/aarch64-v8.2a-neon-intrinsics.c
@@ -1223,27 +1223,25 @@
   return vmulxq_n_f16(a, b);
 }
 
-/* TODO: Not implemented yet (needs scalar intrinsic from arm_fp16.h)
-// CCHECK-LABEL: test_vmulxh_lane_f16
-// CCHECK: [[CONV0:%.*]] = fpext half %a to float
-// CCHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float
-// CCHECK: [[MUL:%.*]]   = fmul float [[CONV0:%.*]], [[CONV0:%.*]]
-// CCHECK: [[CONV3:%.*]] = fptrunc float %mul to half
-// CCHECK: ret half [[CONV3:%.*]]
+// CHECK-LABEL: test_vmulxh_lane_f16
+// CHECK: [[CONV0:%.*]] = fpext half %a to float
+// CHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float
+// CHECK: [[MUL:%.*]]   = fmul float [[CONV0:%.*]], [[CONV0:%.*]]
+// CHECK: [[CONV3:%.*]] = fptrunc float %mul to half
+// CHECK: ret half [[CONV3:%.*]]
 float16_t test_vmulxh_lane_f16(float16_t a, float16x4_t b) {
   return vmulxh_lane_f16(a, b, 3);
 }
 
-// CCHECK-LABEL: test_vmulxh_laneq_f16
-// CCHECK: [[CONV0:%.*]] = fpext half %a to float
-// CCHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float
-// CCHECK: [[MUL:%.*]]   = fmul float [[CONV0:%.*]], [[CONV0:%.*]]
-// CCHECK: [[CONV3:%.*]] = fptrunc float %mul to half
-// CCHECK: ret half [[CONV3:%.*]]
+// CHECK-LABEL: test_vmulxh_laneq_f16
+// CHECK: [[CONV0:%.*]] = fpext half %a to float
+// CHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float
+// CHECK: [[MUL:%.*]]   = fmul float [[CONV0:%.*]], [[CONV0:%.*]]
+// CHECK: [[CONV3:%.*]] = fptrunc float %mul to half
+//