[PATCH] D44222: [AArch64] Add vmulxh_lane FP16 intrinsics
SjoerdMeijer abandoned this revision. SjoerdMeijer added a comment. This is implemented in https://reviews.llvm.org/D44591. https://reviews.llvm.org/D44222 ___ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D44222: [AArch64] Add vmulxh_lane FP16 intrinsics
az added a comment. Was not able to update this particular review with the new code, So I created a new one in https://reviews.llvm.org/D44591 I manage to reuse the mulx scalar intrinsic work, not exactly calling the fp16 scalar intrinsic itself which is not available here but the same frontend codegen work with an extract instruction before that. https://reviews.llvm.org/D44222 ___ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D44222: [AArch64] Add vmulxh_lane FP16 intrinsics
SjoerdMeijer added inline comments. Comment at: include/clang/Basic/arm_neon.td:1504 + // Scalar floating point multiply extended (scalar, by element) + def SCALAR_FMULX_LANEH : IOpInst<"vmulx_lane", "ssdi", "Sh", OP_SCALAR_MUL_LN>; + def SCALAR_FMULX_LANEQH : IOpInst<"vmulx_laneq", "ssji", "Sh", OP_SCALAR_MUL_LN>; az wrote: > SjoerdMeijer wrote: > > I found that unfortunately it's not that straightforward. This leads to > > wrong code generation as it is generating a fmul instead of fmulx. I am > > suspecting this instruction description should be using OP_SCALAR_MULX_LN, > > but also the type decls are wrong. Need to dig a bit further here. > Sorry for confusion as the commented code was never intended to be used and > it is a copy of the code for the intrinsic vmulh_lane(). It was done that way > in order to point out that vmulh_lane() and vmulxh_lane() intrinsics should > be implemented in a similar way. The only useful thing in the commented code > is the explanation that we need the scalar intrinsic vmulxh_f16() which was > implemented in the scalar intrinsic patch later on. > > If we look at how vmulh_lane (a, b, lane) is implemented: > x = extract (b, lane); > res = a * x; > return res; > > Similarly, I thought at the time that vmulxh_lane (a, b, lane) can be > implemented: > x = extract (b, lane); > res = vmulxh_f16 (a, x); // no llvm native mulx instruction, so we use > the fp16 scalar intrinsic. > return res; > > I am not sure now that we can easily use scalar intrinsic while generating > the arm_neon.h file. In case we can not do that, I am thinking that the > frontend should generate a new builtin for intrinsic vmulxh_lane() that the > backend recognizes and generate the right code for it which is fmulx h0, h0, > v1.h[lane]. If you made or will be making progress on this, then that is > great. Otherwise, I can look at a frontend solution for it. Hi Abderrazek, Thanks for the clarifications! And I agree with your observations. This simple changed looked to do the right thing, because as you also said, this vmulx is just an extract and a multiply, but then it was incorrectly generating a fmul which should be a fmulx. I briefly looked at fixing this, but also didn't see how I could use the scalar intrinsic here. Looks like passing a builtin is indeed the best thing, also because fmulx is instruction selected based on a intrinsic: defm FMULX: SIMDThreeSameVectorFP<0,0,0b011,"fmulx", int_aarch64_neon_fmulx>; If you have the bandwidth to pick this up, that would be great; I started looking into the other failing AArch64 vector intrinsics. Cheers, Sjoerd. https://reviews.llvm.org/D44222 ___ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D44222: [AArch64] Add vmulxh_lane FP16 intrinsics
az added inline comments. Comment at: include/clang/Basic/arm_neon.td:1504 + // Scalar floating point multiply extended (scalar, by element) + def SCALAR_FMULX_LANEH : IOpInst<"vmulx_lane", "ssdi", "Sh", OP_SCALAR_MUL_LN>; + def SCALAR_FMULX_LANEQH : IOpInst<"vmulx_laneq", "ssji", "Sh", OP_SCALAR_MUL_LN>; SjoerdMeijer wrote: > I found that unfortunately it's not that straightforward. This leads to wrong > code generation as it is generating a fmul instead of fmulx. I am suspecting > this instruction description should be using OP_SCALAR_MULX_LN, but also the > type decls are wrong. Need to dig a bit further here. Sorry for confusion as the commented code was never intended to be used and it is a copy of the code for the intrinsic vmulh_lane(). It was done that way in order to point out that vmulh_lane() and vmulxh_lane() intrinsics should be implemented in a similar way. The only useful thing in the commented code is the explanation that we need the scalar intrinsic vmulxh_f16() which was implemented in the scalar intrinsic patch later on. If we look at how vmulh_lane (a, b, lane) is implemented: x = extract (b, lane); res = a * x; return res; Similarly, I thought at the time that vmulxh_lane (a, b, lane) can be implemented: x = extract (b, lane); res = vmulxh_f16 (a, x); // no llvm native mulx instruction, so we use the fp16 scalar intrinsic. return res; I am not sure now that we can easily use scalar intrinsic while generating the arm_neon.h file. In case we can not do that, I am thinking that the frontend should generate a new builtin for intrinsic vmulxh_lane() that the backend recognizes and generate the right code for it which is fmulx h0, h0, v1.h[lane]. If you made or will be making progress on this, then that is great. Otherwise, I can look at a frontend solution for it. https://reviews.llvm.org/D44222 ___ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D44222: [AArch64] Add vmulxh_lane FP16 intrinsics
SjoerdMeijer added inline comments. Comment at: include/clang/Basic/arm_neon.td:1504 + // Scalar floating point multiply extended (scalar, by element) + def SCALAR_FMULX_LANEH : IOpInst<"vmulx_lane", "ssdi", "Sh", OP_SCALAR_MUL_LN>; + def SCALAR_FMULX_LANEQH : IOpInst<"vmulx_laneq", "ssji", "Sh", OP_SCALAR_MUL_LN>; I found that unfortunately it's not that straightforward. This leads to wrong code generation as it is generating a fmul instead of fmulx. I am suspecting this instruction description should be using OP_SCALAR_MULX_LN, but also the type decls are wrong. Need to dig a bit further here. https://reviews.llvm.org/D44222 ___ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D44222: [AArch64] Add vmulxh_lane FP16 intrinsics
evandro accepted this revision. evandro added a comment. This revision is now accepted and ready to land. Looks pretty straightforward to me. https://reviews.llvm.org/D44222 ___ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D44222: [AArch64] Add vmulxh_lane FP16 intrinsics
SjoerdMeijer created this revision. SjoerdMeijer added reviewers: az, evandro, olista01. Herald added subscribers: kristof.beyls, javed.absar, rengolin. Add 2 vmulxh_lane vector intrinsics that were commented out. https://reviews.llvm.org/D44222 Files: include/clang/Basic/arm_neon.td test/CodeGen/aarch64-v8.2a-neon-intrinsics.c Index: test/CodeGen/aarch64-v8.2a-neon-intrinsics.c === --- test/CodeGen/aarch64-v8.2a-neon-intrinsics.c +++ test/CodeGen/aarch64-v8.2a-neon-intrinsics.c @@ -1223,27 +1223,25 @@ return vmulxq_n_f16(a, b); } -/* TODO: Not implemented yet (needs scalar intrinsic from arm_fp16.h) -// CCHECK-LABEL: test_vmulxh_lane_f16 -// CCHECK: [[CONV0:%.*]] = fpext half %a to float -// CCHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float -// CCHECK: [[MUL:%.*]] = fmul float [[CONV0:%.*]], [[CONV0:%.*]] -// CCHECK: [[CONV3:%.*]] = fptrunc float %mul to half -// CCHECK: ret half [[CONV3:%.*]] +// CHECK-LABEL: test_vmulxh_lane_f16 +// CHECK: [[CONV0:%.*]] = fpext half %a to float +// CHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float +// CHECK: [[MUL:%.*]] = fmul float [[CONV0:%.*]], [[CONV0:%.*]] +// CHECK: [[CONV3:%.*]] = fptrunc float %mul to half +// CHECK: ret half [[CONV3:%.*]] float16_t test_vmulxh_lane_f16(float16_t a, float16x4_t b) { return vmulxh_lane_f16(a, b, 3); } -// CCHECK-LABEL: test_vmulxh_laneq_f16 -// CCHECK: [[CONV0:%.*]] = fpext half %a to float -// CCHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float -// CCHECK: [[MUL:%.*]] = fmul float [[CONV0:%.*]], [[CONV0:%.*]] -// CCHECK: [[CONV3:%.*]] = fptrunc float %mul to half -// CCHECK: ret half [[CONV3:%.*]] +// CHECK-LABEL: test_vmulxh_laneq_f16 +// CHECK: [[CONV0:%.*]] = fpext half %a to float +// CHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float +// CHECK: [[MUL:%.*]] = fmul float [[CONV0:%.*]], [[CONV0:%.*]] +// CHECK: [[CONV3:%.*]] = fptrunc float %mul to half +// CHECK: ret half [[CONV3:%.*]] float16_t test_vmulxh_laneq_f16(float16_t a, float16x8_t b) { return vmulxh_laneq_f16(a, b, 7); } -*/ // CHECK-LABEL: test_vmaxv_f16 // CHECK: [[TMP0:%.*]] = bitcast <4 x half> %a to <8 x i8> Index: include/clang/Basic/arm_neon.td === --- include/clang/Basic/arm_neon.td +++ include/clang/Basic/arm_neon.td @@ -1499,11 +1499,10 @@ def VMULX_LANEH : IOpInst<"vmulx_lane", "ddgi", "hQh", OP_MULX_LN>; def VMULX_LANEQH : IOpInst<"vmulx_laneq", "ddji", "hQh", OP_MULX_LN>; def VMULX_NH : IOpInst<"vmulx_n", "dds", "hQh", OP_MULX_N>; - // TODO: Scalar floating point multiply extended (scalar, by element) - // Below ones are commented out because they need vmulx_f16(float16_t, float16_t) - // which will be implemented later with fp16 scalar intrinsic (arm_fp16.h) - //def SCALAR_FMULX_LANEH : IOpInst<"vmulx_lane", "ssdi", "Sh", OP_SCALAR_MUL_LN>; - //def SCALAR_FMULX_LANEQH : IOpInst<"vmulx_laneq", "ssji", "Sh", OP_SCALAR_MUL_LN>; + + // Scalar floating point multiply extended (scalar, by element) + def SCALAR_FMULX_LANEH : IOpInst<"vmulx_lane", "ssdi", "Sh", OP_SCALAR_MUL_LN>; + def SCALAR_FMULX_LANEQH : IOpInst<"vmulx_laneq", "ssji", "Sh", OP_SCALAR_MUL_LN>; // ARMv8.2-A FP16 reduction vector intrinsics. def VMAXVH : SInst<"vmaxv", "sd", "hQh">; Index: test/CodeGen/aarch64-v8.2a-neon-intrinsics.c === --- test/CodeGen/aarch64-v8.2a-neon-intrinsics.c +++ test/CodeGen/aarch64-v8.2a-neon-intrinsics.c @@ -1223,27 +1223,25 @@ return vmulxq_n_f16(a, b); } -/* TODO: Not implemented yet (needs scalar intrinsic from arm_fp16.h) -// CCHECK-LABEL: test_vmulxh_lane_f16 -// CCHECK: [[CONV0:%.*]] = fpext half %a to float -// CCHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float -// CCHECK: [[MUL:%.*]] = fmul float [[CONV0:%.*]], [[CONV0:%.*]] -// CCHECK: [[CONV3:%.*]] = fptrunc float %mul to half -// CCHECK: ret half [[CONV3:%.*]] +// CHECK-LABEL: test_vmulxh_lane_f16 +// CHECK: [[CONV0:%.*]] = fpext half %a to float +// CHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float +// CHECK: [[MUL:%.*]] = fmul float [[CONV0:%.*]], [[CONV0:%.*]] +// CHECK: [[CONV3:%.*]] = fptrunc float %mul to half +// CHECK: ret half [[CONV3:%.*]] float16_t test_vmulxh_lane_f16(float16_t a, float16x4_t b) { return vmulxh_lane_f16(a, b, 3); } -// CCHECK-LABEL: test_vmulxh_laneq_f16 -// CCHECK: [[CONV0:%.*]] = fpext half %a to float -// CCHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float -// CCHECK: [[MUL:%.*]] = fmul float [[CONV0:%.*]], [[CONV0:%.*]] -// CCHECK: [[CONV3:%.*]] = fptrunc float %mul to half -// CCHECK: ret half [[CONV3:%.*]] +// CHECK-LABEL: test_vmulxh_laneq_f16 +// CHECK: [[CONV0:%.*]] = fpext half %a to float +// CHECK: [[CONV1:%.*]] = fpext half %{{.*}} to float +// CHECK: [[MUL:%.*]] = fmul float [[CONV0:%.*]], [[CONV0:%.*]] +// CHECK: [[CONV3:%.*]] = fptrunc float %mul to half +//