[Bug target/70314] AVX512 not using kandw to combine comparison results

2020-08-05 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70314

--- Comment #6 from Hongtao.liu  ---
Same issue mentioned in PR88808

[Bug target/70314] AVX512 not using kandw to combine comparison results

2020-08-05 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70314

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #5 from Hongtao.liu  ---
(In reply to Marc Glisse from comment #4)
> We now generate for the original testcase
> 
>   vpcmpd  $1, %zmm3, %zmm2, %k1
>   vpcmpd  $1, %zmm1, %zmm0, %k0{%k1}
>   vpmovm2d%k0, %zmm0
> 
> which looks great.
> 
> However, using | instead of &, we get
> 
>   vpcmpd  $1, %zmm1, %zmm0, %k0
>   vpcmpd  $1, %zmm3, %zmm2, %k1
>   kmovw   %k0, %eax
>   kmovw   %k1, %edx
>   orl %edx, %eax
>   kmovw   %eax, %k2

Yes, korw %k0, %k1, %k2 should be used
i'll take a look.

>   vpmovm2d%k2, %zmm0
> 
> Well, at least gimple did what it could, and it is now up to the target to
> handle logical operations on bool vectors / k* registers. There is probably
> already another bug report about that...

[Bug tree-optimization/96481] New: SLP fail to vectorize VEC_COND_EXPR pattern.

2020-08-05 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96481

Bug ID: 96481
   Summary: SLP fail to vectorize VEC_COND_EXPR pattern.
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---

testcase not vectorized:
-
#include 

inline unsigned opt(unsigned a, unsigned b, unsigned c, unsigned d) {
return a > b ? c : d;
}

void opt( unsigned * __restrict dst, const unsigned *pa, const unsigned *pb,
const unsigned *pc, const unsigned  *pd )
{

 *dst++ = opt(*pa++, *pb++, *pc++, *pd++);
 *dst++ = opt(*pa++, *pb++, *pc++, *pd++);
 *dst++ = opt(*pa++, *pb++, *pc++, *pd++);
 *dst++ = opt(*pa++, *pb++, *pc++, *pd++);
}



testcase successfully vectorized:


inline unsigned opt(unsigned a, unsigned b, unsigned c, unsigned d) {
return a > b ? c : d;
}

void opt( unsigned * __restrict dst, const unsigned *pa, const unsigned *pb,
const unsigned *pc, const unsigned  *pd )
{
for (int i = 0; i != 4; i++)
 *dst++ = opt(*pa++, *pb++, *pc++, *pd++);
}


llvm can handle both case
refer to https://godbolt.org/z/jYoPxT

[Bug target/96476] [Request] expose preferred vector width to preprocessor

2020-08-05 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96476

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #1 from Hongtao.liu  ---
TARGET_CPU_CPP_BUILTINS?
It says 
---
This function-like macro expands to a block of code that defines built-in
preproces-
sor macros and assertions for the target CPU, using the functions
builtin_define,
builtin_define_std and builtin_assert. When the front end calls this macro it
provides a trailing semicolon, and since it has finished command line option
processing
your code can use those results freely.


[Bug target/96350] New: [cet] For ENDBR immediate, the binary would include a gadget that starts with a fake ENDBR64 opcode.

2020-07-27 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96350

Bug ID: 96350
   Summary: [cet] For ENDBR immediate, the binary would include a
gadget that starts with a fake ENDBR64 opcode.
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
CC: hjl.tools at gmail dot com
  Target Milestone: ---
Target: i386, x86-64

ENDBR32 and ENDBR64 have specific opcodes:
-   ENDBR32: F3 0F 1E FB
-   ENDBR64: F3 0F 1E FA

And we want that attackers won’t find unintended ENDBR32/64 opcode matches in
the binary

Here’s an example:

If the compiler had to generate asm for the following code:
a = 0xF30F1EFA

it could, for example, generate:
mov 0xF30F1EFA, dword ptr[a]

In such a case, the binary would include a gadget that starts with a fake
ENDBR64 opcode.

Therefore, the requirement from the compilers is to split such generation into
multiple operations, such that the explicit immediate never shows in the binary

[Bug target/96262] [11 Regression] ICE: in decompose, at rtl.h:2280 with -O -mavx512bw since r11-1411-gc7199fb6e694d1a0

2020-07-23 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96262

--- Comment #3 from Hongtao.liu  ---
a patch is posted at
https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550427.html

[Bug target/96271] Failure to optimize memcmp of doubles to avoid going through memory

2020-07-22 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96271

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #3 from Hongtao.liu  ---
in for testcase.c.267r.dse1

pass_dse1 fail to delete

---
trying to replace DImode load in insn 7 from DFmode store in insn 2
-- could not extract bits of stored value

trying to replace DImode load in insn 8 from DFmode store in insn 3
-- could not extract bits of stored value

...


(insn 2 5 3 2 (set (mem/c:DF (plus:DI (reg/f:DI 19 frame)
(const_int -8 [0xfff8])) [1 a+0 S8 A64])
(reg:DF 20 xmm0 [ a ])) "pr96271_double.c":4:1 135 {*movdf_internal}
 (expr_list:REG_DEAD (reg:DF 20 xmm0 [ a ])

...


(insn 7 4 8 2 (set (reg:DI 87 [ MEM  [(char * {ref-all})] ])
(mem/c:DI (plus:DI (reg/f:DI 19 frame)
(const_int -8 [0xfff8])) [0 MEM 
[(char * {ref-all})]+0 S8 A64])) "pr96271_double.c":5:37 74 {*movdi_internal}
 (nil))

...

---

Shouldn't DImode load behave the same with DFmode load?

[Bug target/96273] ice in extract_insn, at recog.c:2294, unrecognizable insn:

2020-07-22 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96273

Hongtao.liu  changed:

   What|Removed |Added

 CC||ubizjak at gmail dot com

--- Comment #2 from Hongtao.liu  ---
Caused by patch in PR95750?

[Bug target/96262] [11 Regression] ICE: in decompose, at rtl.h:2280 with -O -mavx512bw

2020-07-22 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96262

--- Comment #2 from Hongtao.liu  ---
2268inline wi::storage_ref
2269wi::int_traits ::decompose (HOST_WIDE_INT *,
2270unsigned int precision,
2271const rtx_mode_t )
2272{
2273  gcc_checking_assert (precision == get_precision (x));
2274  switch (GET_CODE (x.first))
2275{
2276case CONST_INT:
2277  if (precision < HOST_BITS_PER_WIDE_INT)
2278/* Nonzero BImodes are stored as STORE_FLAG_VALUE, which on many
2279   targets is 1 rather than -1.  */
B228=>  gcc_checking_assert (INTVAL (x.first)
2281 == sext_hwi (INTVAL (x.first), precision)
2282 || (x.second == BImode && INTVAL (x.first) == 


(gdb) p debug_rtx (x.first)
(const_int 254 [0xfe])
(gdb) p INTVAL(x.first)
$48 = 254
(gdb) p sext_hwi (INTVAL (x.first), precision)
$49 = -2
(gdb) p precision
$50 = 8

For E_QImode const_int 254 is equivilent to -2, Should this condition be
relaxed?

Or it could be fixed by
---
-  unsigned int and_constant, xor_constant;
+  char and_constant, xor_constant;  > used by GEN_INT, be promoted to
HOST_WID_INT.
---

[Bug target/96262] [11 Regression] ICE: in decompose, at rtl.h:2280 with -O -mavx512bw

2020-07-21 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96262

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #1 from Hongtao.liu  ---
It's introduced by my patch.

[Bug target/96246] [AVX512] unefficient code generatation for vpblendm*

2020-07-20 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96246

--- Comment #2 from Hongtao.liu  ---
(In reply to Richard Biener from comment #1)
> With -mavx2 it works:
> 
> vpcmpgtd%ymm1, %ymm0, %ymm0
> vpblendvb   %ymm0, %ymm2, %ymm3, %ymm0
> 
> not sure how _load comes into play - we expand from
_load_mask have same rtx pattern as _blendm_, the
only difference is constraint(_load_mask has '0C' for second constraint)

---
1057 (define_insn "_load_mask"
 1058   [(set (match_operand:V48_AVX512VL 0 "register_operand" "=v,v")
 1059 (vec_merge:V48_AVX512VL
 1060   (match_operand:V48_AVX512VL 1 "nonimmediate_operand" "v,m")
 1061   (match_operand:V48_AVX512VL 2 "nonimm_or_0_operand" "0C,0C")
 1062   (match_operand: 3 "register_operand"
"Yk,Yk")))]

...


 1159 (define_insn "_blendm"
 1160   [(set (match_operand:V48_AVX512VL 0 "register_operand" "=v")
 1161 (vec_merge:V48_AVX512VL
 1162   (match_operand:V48_AVX512VL 2 "nonimmediate_operand" "vm")
 1163   (match_operand:V48_AVX512VL 1 "register_operand" "v")
 1164   (match_operand: 3 "register_operand" "Yk")))]

---
because _load_mask existed early(in line 1057) than
_blendm (in line 1159) in md file, after expand, the pattern is
always recognized as _load_mask, and pass_reload will only match
'0' constraint and missed for 'v' constraint.
> 
> 
>[local count: 1073741824]:
>   _6 = .VCOND (a_2(D), b_3(D), c_4(D), d_5(D), 109);
>   return _6;

[Bug tree-optimization/96244] Redudant mask load generated

2020-07-20 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96244

--- Comment #2 from Hongtao.liu  ---
(In reply to Richard Biener from comment #1)
> so range-info is one index too pessimistic here.  So IMHO it's not about
> "redundant" masked loads, it's about the fact that we end up with loads
> at all here.  If c and d would not be register arguments we would have to
> perform loads and if they might trap we could not elide the masked load.

compared to masked load, load seems to be be more probably eliminated by
backend for this situation.

[Bug target/96246] New: [AVX512] unefficient code generatation for vpblendm*

2020-07-20 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96246

Bug ID: 96246
   Summary: [AVX512] unefficient code generatation for vpblendm*
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---
Target: i386, x86-64

cat test.c

---
typedef int v8si __attribute__ ((__vector_size__ (32)));
v8si
foo (v8si a, v8si b, v8si c, v8si d)
{
return a > b ? c : d;
}
---

gcc11 -O2 -mavx512f -mavx512vl

gcc generate
---
vpcmpd  $6, %ymm1, %ymm0, %k1
vmovdqa32   %ymm2, %ymm3{%k1}
vmovdqa %ymm3, %ymm0 
ret
---

could be optimized to

---
vpcmpd  $6, %ymm1, %ymm0, %k1
vpblendmd   %ymm2, %ymm3, %ymm0 {%k1}
---

gcc failed to generate optimal code because in sse.md

(define_insn "_load_mask have the same pattern as 
(define_insn "_blendm" and existed early in the file, rtx pattern
match is always recognized as _load_mask which missed opportunity
in pass_reload, and can't combine to _blendm after reload.

[Bug target/96243] For vector compare to mask register, UNSPEC is needed instead of comparison operator

2020-07-19 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96243

--- Comment #1 from Hongtao.liu  ---
cut from cse.c
---
3342 case RTX_COMPARE:
3343 case RTX_COMM_COMPARE:
3344   /* See what items are actually being compared and set FOLDED_ARG[01]
3345  to those values and CODE to the actual comparison code.  If any
are
3346  constant, set CONST_ARG0 and CONST_ARG1 appropriately.  We
needn't
3347  do anything if both operands are already known to be constant. 
*/
3348 
3349   /* ??? Vector mode comparisons are not supported yet.  */
3350   if (VECTOR_MODE_P (mode))
3351 break;
---

[Bug tree-optimization/96244] New: Redudant mask load generated

2020-07-19 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96244

Bug ID: 96244
   Summary: Redudant mask load generated
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---

cat test.c

---
typedef int v8si __attribute__ ((__vector_size__ (32)));
v8si
foo (v8si a, v8si b, v8si c, v8si d)
{
  v8si e;
for (int i = 0; i != 8; i++)
 e[i] = a[i] > b[i] ? c[i] : d[i];
return e;
}
---

gcc -Ofast -mavx2 test.c

cat test.c.238t.optimized
---
foo (v8si a, v8si b, v8si c, v8si d)
{
  vector(8) int vect_iftmp.19;
  vector(8) int vect_iftmp.18;
  vector(8)  mask__31.15;
  vector(8) int vect_iftmp.14;
  vector(8)  mask__28.11;

   [local count: 119292720]:
  mask__28.11_40 = b_50(D) < a_53(D);
  vect_iftmp.14_43 = .MASK_LOAD (, 32B, mask__28.11_40); ---> redundant
  mask__31.15_44 = b_50(D) >= a_53(D);
  vect_iftmp.18_47 = .MASK_LOAD (, 32B, mask__31.15_44); ---> redundant
  vect_iftmp.19_49 = .VCOND (b_50(D), a_53(D), vect_iftmp.18_47,
vect_iftmp.14_43, 110);
  return vect_iftmp.19_49;
---

could be optimized to 
---
vect_iftmp.19_49 = .VCOND (b_50(D), a_53(D), d, c);
---

[Bug target/96243] New: For vector compare to mask register, UNSPEC is needed instead of comparison operator

2020-07-19 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96243

Bug ID: 96243
   Summary: For vector compare to mask register, UNSPEC is needed
instead of comparison operator
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---
Target: i386, x86-64

When tring to relax 

 (define_expand "_eq3"
   [(set (match_operand: 0 "register_operand")
-   (unspec:
- [(match_operand:VI48_AVX512VL 1 "nonimmediate_operand")
-  (match_operand:VI48_AVX512VL 2 "nonimmediate_operand")]
- UNSPEC_MASKED_EQ))]
+   (eq:
+ (match_operand:VI48_AVX512VL 1 "nonimmediate_operand")
+  (match_operand:VI48_AVX512VL 2 "nonimmediate_operand")))]
   "TARGET_AVX512F"
   "ix86_fixup_binary_operands_no_copy (EQ, mode, operands);")

I got runtime failure from gcc.target/i386/avx512vl-vpcmpeqq-2.c, that's
because cse will take (eq:QI (V4DI: 90) (V4DI: 91)) as a boolean value and do
some optimization, that's not correct for vector compare, also others places
like combine hold the same assumption.

The pattern like 

(define_insn "*_cmp3"
  [(set (match_operand: 0 "register_operand" "=k")
(match_operator: 3 "ix86_comparison_int_operator"
  [(match_operand:VI48_AVX512VL 1 "register_operand" "v")
   (match_operand:VI48_AVX512VL 2 "nonimmediate_operand"
"")]))]
  "TARGET_AVX512F && "
  "vpcmp\t{%I3, %2, %1,
%0|%0, %1,
%2, %I3}"
  [(set_attr "type" "ssecmp")
   (set_attr "length_immediate" "1")
   (set_attr "prefix" "evex")
   (set_attr "mode" "")])


Need to be fixed.

[Bug target/96201] x86 movsd/movsq string instructions and alignment inference

2020-07-14 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96201

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #1 from Hongtao.liu  ---
The issue is caused by pass_ivopt, ivopt select only one iv for f3(dn) which
seems not to be optimal, and select two iv for f4(sn,dn) which seems optimal.
---
loop in f3:

Selected IV set for loop 1 at pr96201.c:25, 10 avg niters, 1 IVs:
Candidate 8:
  Var befor: dn_24
  Var after: dn_18
  Incr POS: orig biv
  IV struct:
Type:   int *
Base:   (int *) _3
Step:   4
Biv:N
Overflowness wrto loop niter:   Overflow

loop in f4:

Selected IV set for loop 1 at pr96201.c:34, 10 avg niters, 2 IVs:
Candidate 6:
  Var befor: sn_26
  Var after: sn_20
  Incr POS: orig biv
  IV struct:
Type:   int *
Base:   sn_14
Step:   4
Object: (void *) sn_14
Biv:N
Overflowness wrto loop niter:   Overflow
Candidate 8:
  Var befor: dn_27
  Var after: dn_21
  Incr POS: orig biv
  IV struct:
Type:   int *
Base:   dn_16
Step:   4
Object: (void *) dn_16
Biv:N
Overflowness wrto loop niter:   Overflow

---

then it generate more instructions for f3 which pass_combine failed to combine
them.

---
loop in f3:

Trying 19 -> 22:
   19: r83:DI=r92:DI
   22: [r83:DI]=r89:SI
  REG_DEAD r89:SI
  REG_DEAD r83:DI
Can't combine i2 into i3

Trying 21 -> 22:
   21: r89:SI=[r93:DI]
  REG_DEAD r93:DI
   22: [r83:DI]=r89:SI
  REG_DEAD r89:SI
  REG_DEAD r83:DI
Failed to match this instruction:
(set (mem:SI (reg/v/f:DI 83 [ dn ]) [1 *dn_2+0 S4 A32])
(mem:SI (reg/f:DI 93 [ _20 ]) [1 *_20+0 S4 A32]))

Trying 18, 21 -> 22:
   18: {r93:DI=r92:DI+r102:DI;clobber flags:CC;}
  REG_UNUSED flags:CC
   21: r89:SI=[r93:DI]
  REG_DEAD r93:DI
   22: [r83:DI]=r89:SI
  REG_DEAD r89:SI
  REG_DEAD r83:DI
Can't combine i1 into i3

Trying 21, 19 -> 22:
   21: r89:SI=[r93:DI]
  REG_DEAD r93:DI
   19: r83:DI=r92:DI
   22: [r83:DI]=r89:SI
  REG_DEAD r89:SI
  REG_DEAD r83:DI
Can't combine i1 into i3

(insn 18 16 19 4 (parallel [
(set (reg/f:DI 93 [ _20 ])
(plus:DI (reg/v/f:DI 92 [ dn ])
(reg:DI 102)))
(clobber (reg:CC 17 flags))
]) 210 {*adddi_1}
 (expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))
(insn 19 18 20 4 (set (reg/v/f:DI 83 [ dn ])
(reg/v/f:DI 92 [ dn ])) 74 {*movdi_internal}
 (nil))
(insn 20 19 21 4 (parallel [
(set (reg/v/f:DI 92 [ dn ])
(plus:DI (reg/v/f:DI 92 [ dn ])
(const_int 4 [0x4])))
(clobber (reg:CC 17 flags))
]) "pr96201.c":25:24 210 {*adddi_1}
 (expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))
(insn 21 20 22 4 (set (reg:SI 89 [ _9 ])
(mem:SI (reg/f:DI 93 [ _20 ]) [1 *_20+0 S4 A32])) "pr96201.c":25:29 75
{*movsi_internal}
 (expr_list:REG_DEAD (reg/f:DI 93 [ _20 ])
(nil)))
(insn 22 21 24 4 (set (mem:SI (reg/v/f:DI 83 [ dn ]) [1 *dn_2+0 S4 A32])
(reg:SI 89 [ _9 ])) "pr96201.c":25:27 75 {*movsi_internal}
 (expr_list:REG_DEAD (reg:SI 89 [ _9 ])
(expr_list:REG_DEAD (reg/v/f:DI 83 [ dn ])
(nil



loop in f4:

Trying 16, 18, 17 -> 19:
   16: {r89:DI=r89:DI+0x4;clobber flags:CC;}
  REG_UNUSED flags:CC
   18: r88:SI=[r89:DI-0x4]
   17: {r90:DI=r90:DI+0x4;clobber flags:CC;}
  REG_UNUSED flags:CC
   19: [r90:DI-0x4]=r88:SI
  REG_DEAD r88:SI
Successfully matched this instruction:
(parallel [
(set (mem:SI (reg/v/f:DI 90 [ dn ]) [1 MEM[base: dn_21, offset: -4B]+0
S4 A32])
(mem:SI (reg/v/f:DI 89 [ sn ]) [1 MEM[base: sn_20, offset: -4B]+0
S4 A32]))
(set (reg/v/f:DI 90 [ dn ])
(plus:DI (reg/v/f:DI 90 [ dn ])
(const_int 4 [0x4])))
(set (reg/v/f:DI 89 [ sn ])
(plus:DI (reg/v/f:DI 89 [ sn ])
(const_int 4 [0x4])))
])

(insn 16 15 17 3 (parallel [
(set (reg/v/f:DI 89 [ sn ])
(plus:DI (reg/v/f:DI 89 [ sn ])
(const_int 4 [0x4])))
(clobber (reg:CC 17 flags))
]) "pr96201.c":34:32 210 {*adddi_1}
 (expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))
(insn 17 16 18 3 (parallel [
(set (reg/v/f:DI 90 [ dn ])
(plus:DI (reg/v/f:DI 90 [ dn ])
(const_int 4 [0x4])))
(clobber (reg:CC 17 flags))
]) "pr96201.c":34:24 210 {*adddi_1}
 (expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))
(insn 18 17 19 3 (set (reg:SI 88 [ _9 ])
(mem:SI (plus:DI (reg/v/f:DI 89 [ sn ])
(const_int -4 [0xfffc])) [1 MEM[base: sn_20,
offset: 

[Bug target/96186] [11 regressoion] ICE: Unrecognizable insn since r11-1970-fab263ab0fc10ea08409b80afa7e8569438b8d28

2020-07-14 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96186

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #2 from Hongtao.liu  ---
(In reply to Richard Biener from comment #1)
> There's a duplicate bug IIRC.

It's caused by same commit, but this issue is still existed after fix of
PR96144.

Should be fixed by
https://gcc.gnu.org/pipermail/gcc-patches/2020-July/549960.html

[Bug target/87767] Missing AVX512 memory broadcast for constant vector

2020-07-13 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87767

--- Comment #7 from Hongtao.liu  ---
a patch is posted at
https://gcc.gnu.org/pipermail/gcc-patches/2020-July/549713.html

[Bug target/95766] Failure to directly use vpbroadcastd for _mm_set1_epi32 when passing unsigned short

2020-07-09 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95766

--- Comment #4 from Hongtao.liu  ---
Simple case:

cat test.c:
int f(unsigned short a)
{
return a * 101;
}

gcc:
f(unsigned short):
  movzwl %di, %eax
  imull $101, %eax, %eax
  ret

llvm:
f(unsigned short): # @f(unsigned short)
  imull $101, %edi, %eax
  retq

GCC always does the conversion.

[Bug target/95766] Failure to directly use vpbroadcastd for _mm_set1_epi32 when passing unsigned short

2020-07-09 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95766

--- Comment #1 from Hongtao.liu  ---
Shouldn't **a** be extended to int first?

[Bug target/95524] Subtimal codegen for shift by constant for v16qi/v32qi under -march=skylake

2020-07-09 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95524

Hongtao.liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Hongtao.liu  ---
Fixed in GCC11

[Bug target/95488] Suboptimal multiplication codegen for v16qi

2020-07-09 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95488

Hongtao.liu  changed:

   What|Removed |Added

 Status|REOPENED|RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from Hongtao.liu  ---
Fixed in GCC11

[Bug target/95740] Failure to avoid using the stack when interpreting a float as an integer when it is modified afterwards

2020-06-19 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95740

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #2 from Hongtao.liu  ---
Increase constraints preference and reduce sse->integer move cost can't help
it.


modified   gcc/config/i386/i386.md  
@@ -2294,9 +2294,9 @@   

 (define_insn "*movsi_internal" 
   [(set (match_operand:SI 0 "nonimmediate_operand" 
-"=r,m ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k")  
+"=r,m ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,r ,m,?*v,*k,*k ,*rm,*k")   
 (match_operand:SI 1 "general_operand"  
-"g ,re,C ,*y,m  ,*y,*y,r  ,C ,*v,m ,*v,*v,r  ,*r,*km,*k ,CBC"))]   
+"g ,re,C ,*y,m  ,*y,*y,r  ,C ,*v,m ,v,*v,r  ,*r,*km,*k ,CBC"))]
   "!(MEM_P (operands[0]) && MEM_P (operands[1]))"  
 {  
   switch (get_attr_type (insn))
modified   gcc/config/i386/x86-tune-costs.h 
@@ -1624,7 +1624,7 @@ struct processor_costs skylake_cost = {   
in 32,64,128,256 and 512-bit */ 
   {8, 8, 8, 12, 24},   /* cost of storing SSE registers
in 32,64,128,256 and 512-bit */ 
-  6, 6,/* SSE->integer and
integer->SSE moves */   
+  2, 2,/* SSE->integer and
integer->SSE moves */  

--

It seems to me for insn inserted reloaded before

18: r89:SI=r87:SF#0

(insn 18 16 6 2 (set (reg:SI 89) 
 (subreg:SI (reg:SF 87) 0))

LRA prefer to put reg:SI 89 into memory since it would be used later.

[Bug target/95488] Suboptimal multiplication codegen for v16qi

2020-06-16 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95488

--- Comment #9 from Hongtao.liu  ---
(In reply to H.J. Lu from comment #8)
>  -march=skylake-avx512 gave:
> 
> [hjl@gnu-cfl-2 gcc]$
> /export/build/gnu/tools-build/gcc-debug/build-x86_64-linux/gcc/xgcc
> -B/export/build/gnu/tools-build/gcc-debug/build-x86_64-linux/gcc/
> /export/gnu/import/git/sources/gcc/gcc/testsuite/gcc.target/i386/avx512bw-
> pr95488-1.c  -march=skylake-avx512   -fno-diagnostics-show-caret
> -fno-diagnostics-show-line-numbers -fdiagnostics-color=never 
> -fdiagnostics-urls=never  -O2  -ffat-lto-objects -fno-ident -S -o
> avx512bw-pr95488-1.s
> [hjl@gnu-cfl-2 gcc]$ cat avx512bw-pr95488-1.s
>   .file   "avx512bw-pr95488-1.c"
>   .text
>   .p2align 4
>   .globl  mul_512
>   .type   mul_512, @function
> mul_512:
> .LFB0:
>   .cfi_startproc
>   vpunpcklbw  %ymm0, %ymm0, %ymm3
>   vpunpcklbw  %ymm1, %ymm1, %ymm2
>   vpunpckhbw  %ymm0, %ymm0, %ymm0
>   vpunpckhbw  %ymm1, %ymm1, %ymm1
>   vpmullw %ymm3, %ymm2, %ymm2
>   vpmullw %ymm0, %ymm1, %ymm1
>   vpshufb .LC0(%rip), %ymm2, %ymm0
>   vpshufb .LC1(%rip), %ymm1, %ymm1
>   vpor%ymm1, %ymm0, %ymm0
>   ret
>   .cfi_endproc
> .LFE0:
>   .size   mul_512, .-mul_512
>   .p2align 4
>   .globl  umul_512
>   .type   umul_512, @function
> umul_512:
> .LFB1:
>   .cfi_startproc
>   vpunpcklbw  %ymm0, %ymm0, %ymm3
>   vpunpcklbw  %ymm1, %ymm1, %ymm2
>   vpunpckhbw  %ymm0, %ymm0, %ymm0
>   vpunpckhbw  %ymm1, %ymm1, %ymm1
>   vpmullw %ymm3, %ymm2, %ymm2
>   vpmullw %ymm0, %ymm1, %ymm1
>   vpshufb .LC0(%rip), %ymm2, %ymm0
>   vpshufb .LC1(%rip), %ymm1, %ymm1
>   vpor%ymm1, %ymm0, %ymm0
>   ret
>   .cfi_endproc
> .LFE1:
>   .size   umul_512, .-umul_512

It's on purpose, maybe I'll add -mprefer-vector-with=512 to testcase.

19498  /* Not generate zmm instruction when prefer 128/256 bit vector width. 
*/
19499  if (qimode == V32QImode 
19500  && (TARGET_PREFER_AVX128 || TARGET_PREFER_AVX256))
19501return false;


[Bug target/95524] Subtimal codegen for shift by constant for v16qi/v32qi under -march=skylake

2020-06-15 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95524

--- Comment #3 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #0)

> icc has
> ---
> ashift(char __vector(16)):
> vpsllwxmm1, xmm0, 5 #9.16
> vpand xmm0, xmm1, XMMWORD PTR .L_2il0floatpacket.0[rip] #9.16
> ret #9.16
> ashift2(char __vector(32), char __vector(32)):
> vpsllwymm2, ymm0, 5 #15.16
> vpand ymm0, ymm2, YMMWORD PTR .L_2il0floatpacket.1[rip] #15.16
> ret #15.16
> ashiftrt(char __vector(16)):
> vpsrlwxmm1, xmm0, 5 #21.16
> vpand xmm0, xmm1, XMMWORD PTR .L_2il0floatpacket.2[rip] #21.16
> ret #21.16
> arshiftrt2(char __vector(32)):
> vpsrlwymm1, ymm0, 5 #27.16
> vpand ymm0, ymm1, YMMWORD PTR .L_2il0floatpacket.3[rip] #27.16
> ret #27.16
> .long  
> 

ICC seems to generate inaccurate instructions for ashiftrt, but clang is right,
still better than gcc, refer to https://godbolt.org/z/ttV5xY

[Bug target/95524] Subtimal codegen for shift by constant for v16qi/v32qi under -march=skylake

2020-06-15 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95524

--- Comment #2 from Hongtao.liu  ---
Microbenchmark show on Skylake client
---
benchmark   Skylake client  
ashift  improvement
v16qi   13%
v32qi   5%
v64qi   7%

ashiftrt
v16qi   5%
v32qi   7%
v64qi   6%

lshiftrt
v16qi   16%
v32qi   13%
v64qi   6%
---

[Bug target/95488] Suboptimal multiplication codegen for v16qi

2020-06-14 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95488

Hongtao.liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #7 from Hongtao.liu  ---
Fixed in GCC11.

[Bug target/95524] Subtimal codegen for shift by constant for v16qi/v32qi under -march=skylake

2020-06-11 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95524

--- Comment #1 from Hongtao.liu  ---
Microbenchmark show

interleave_ashiftrt : 69023847
magic_ashiftrt :  62488066

Seems 10% improvement.

[Bug target/95488] Suboptimal multiplication codegen for v16qi

2020-06-11 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95488

--- Comment #5 from Hongtao.liu  ---
Microbenchmark

cat test.c

#include 
#include 
#include 

typedef char  v16qi  __attribute__ ((vector_size (16)));
extern v16qi interleave_mul (v16qi, v16qi);
extern v16qi extend_mul (v16qi, v16qi);

#define LOOP 3000


int
main ()
{
  int i;
  unsigned long long start, end;
  unsigned long long diff;
  unsigned int aux;
  v16qi *p0;
  v16qi *p1;
  v16qi x, y;

  p0 = (v16qi *) malloc (LOOP *  sizeof (*p0));
  p1 = (v16qi *) malloc (LOOP *  sizeof (*p1));
  for (i = 0; i < LOOP; i++)
for (int j = 0; j != 16; j++)
{
  p0[i][j] = 1 + i + j;
  p1[i][j] = 1 + i * i + j * j;
}

#if 1
  start = __rdtscp ();
  for (i = 0; i < LOOP; i+=16)
y = interleave_mul (p0[i], p1[i]);
  end = __rdtscp ();
  diff = end - start;

  printf ("interleave_mul : %lld\n", diff);

#endif

#if 1
  start = __rdtscp ();
  for (i = 0; i < LOOP; i+=16)
x = extend_mul (p0[i], p1[i]);
  end = __rdtscp ();
  diff = end - start;

  printf ("extend_mul :%lld\n", diff);
#endif

  free (p0);
  free (p1);

  return 0;
}
---
show a little bit improvement:

interleave_mul : 10418
extend_mul :103922083

[Bug target/95400] -march=native and -march=icelake-client produce different results on icelake client

2020-06-04 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95400

--- Comment #5 from Hongtao.liu  ---
(In reply to Martin Liška from comment #4)
> Can we backport the change to active branches?

Backport to GCC9, GCC10.
Partially backport to GCC8.(drop tremont and tigerlake part).

[Bug target/95524] New: Subtimal codegen for shift by constant for v16qi/v32qi under -march=skylake

2020-06-04 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95524

Bug ID: 95524
   Summary: Subtimal codegen for shift by constant for v16qi/v32qi
under -march=skylake
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---
Target: x86_64-*-* i?86-*-*

cat test.c
---
typedef char v16qi __attribute__ ((vector_size (16)));
typedef char v32qi __attribute__ ((vector_size (32)));
typedef unsigned char v16uqi __attribute__ ((vector_size (16)));
typedef unsigned char v32uqi __attribute__ ((vector_size (32)));

v16qi
ashift (v16qi a)
{
return  a<<5;
}

v32qi
ashift2 (v32qi a, v32qi b)
{
return  a<<5;
}

v16qi
ashiftrt (v16qi a)
{
return  a>>5;
}

v32qi
arshiftrt2 (v32qi a)
{
return  a>>5;
}

v16uqi
lshiftrt (v16uqi a)
{
return  a>>5;
}

v32uqi
lshiftrt2 (v32uqi a)
{
return  a>>5;
}
---

gcc11 -O2 -march=skylake

---
ashift(char __vector(16)):
vpaddb  xmm0, xmm0, xmm0
vpaddb  xmm0, xmm0, xmm0
vpaddb  xmm0, xmm0, xmm0
vpaddb  xmm0, xmm0, xmm0
vpaddb  xmm0, xmm0, xmm0
ret
ashift2(char __vector(32), char __vector(32)):
vpaddb  ymm0, ymm0, ymm0
vpaddb  ymm0, ymm0, ymm0
vpaddb  ymm0, ymm0, ymm0
vpaddb  ymm0, ymm0, ymm0
vpaddb  ymm0, ymm0, ymm0
ret
ashiftrt(char __vector(16)):
vpmovsxbw   xmm2, xmm0
vpsrldq xmm1, xmm0, 8
vpmovsxbw   xmm1, xmm1
vpsraw  xmm0, xmm2, 5
vmovdqa xmm2, XMMWORD PTR .LC0[rip]
vpsraw  xmm1, xmm1, 5
vpand   xmm0, xmm2, xmm0
vpand   xmm2, xmm2, xmm1
vpackuswb   xmm0, xmm0, xmm2
ret
arshiftrt2(char __vector(32)):
vmovdqa ymm1, ymm0
vextracti128xmm1, ymm1, 0x1
vmovdqa ymm2, YMMWORD PTR .LC1[rip]
vpmovsxbw   ymm0, xmm0
vpmovsxbw   ymm1, xmm1
vpsraw  ymm1, ymm1, 5
vpsraw  ymm0, ymm0, 5
vpand   ymm0, ymm2, ymm0
vpand   ymm2, ymm2, ymm1
vpackuswb   ymm0, ymm0, ymm2
vpermq  ymm0, ymm0, 216
ret
lshiftrt(unsigned char __vector(16)):
vpmovzxbw   xmm2, xmm0
vpsrldq xmm1, xmm0, 8
vpmovzxbw   xmm1, xmm1
vpsrlw  xmm0, xmm2, 5
vmovdqa xmm2, XMMWORD PTR .LC0[rip]
vpsrlw  xmm1, xmm1, 5
vpand   xmm0, xmm2, xmm0
vpand   xmm2, xmm2, xmm1
vpackuswb   xmm0, xmm0, xmm2
ret
lshiftrt2(unsigned char __vector(32)):
vmovdqa ymm1, ymm0
vextracti128xmm1, ymm1, 0x1
vmovdqa ymm2, YMMWORD PTR .LC1[rip]
vpmovzxbw   ymm0, xmm0
vpmovzxbw   ymm1, xmm1
vpsrlw  ymm1, ymm1, 5
vpsrlw  ymm0, ymm0, 5
vpand   ymm0, ymm2, ymm0
vpand   ymm2, ymm2, ymm1
vpackuswb   ymm0, ymm0, ymm2
vpermq  ymm0, ymm0, 216
ret
.LC0:
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.LC1:
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
---

icc has
---
ashift(char __vector(16)):
vpsllwxmm1, xmm0, 5 #9.16
vpand xmm0, xmm1, XMMWORD PTR .L_2il0floatpacket.0[rip] #9.16
ret #9.16
ashift2(char __vector(32), char __vector(32)):
vpsllwymm2, ymm0, 5 #15.16
vpand ymm0, ymm2, YMMWORD PTR .L_2il0floatpacket.1[rip] #15.16
ret #15.16
ashiftrt(char __vector(16)):
vpsrlwxmm1, xmm0, 5 #21.16
vpand xmm0, xmm1, XMMWORD PTR .L_2il0floatpacket.2[rip] #21.16
ret #21.16
arshiftrt2(char __vector(32)):
vpsrlwymm1, ymm0, 5 #27.16
vpand ymm0, ymm1, YMMWORD PTR .L_2il0floatpacket.3[rip] #27.16
ret #27.16
lshiftrt(unsigned char __vector(16)):
vpsrlwxmm1, xmm0, 5 #33.16
vpand xmm0, xmm1, XMMWORD PTR .L_2il0floatpacket.2[rip] #33.16
ret #33.16
lshiftrt2(unsigned char __vector(3

[Bug target/95488] Suboptimal multiplication codegen for v16qi

2020-06-03 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95488

--- Comment #4 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #3)
> (In reply to Richard Biener from comment #2)
> > (In reply to Hongtao.liu from comment #1)
> > > I think it's this TYPE_SIGN (TREE_TYPE (REG_EXPR (op1))).
> > 
> > That's not reliable.  Mutliplication shouldn't care about sign?
I think you're right, as along as we only care about lower 8 bits, sign isn't a
matter. 
> 
> We need to extend v16qi to v16hi first, extension does care about sign.

[Bug target/95488] Suboptimal multiplication codegen for v16qi

2020-06-03 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95488

--- Comment #3 from Hongtao.liu  ---
(In reply to Richard Biener from comment #2)
> (In reply to Hongtao.liu from comment #1)
> > I think it's this TYPE_SIGN (TREE_TYPE (REG_EXPR (op1))).
> 
> That's not reliable.  Mutliplication shouldn't care about sign?

We need to extend v16qi to v16hi first, extension does care about sign.

[Bug target/95488] Suboptimal multiplication codegen for v16qi

2020-06-02 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95488

--- Comment #1 from Hongtao.liu  ---
I think it's this TYPE_SIGN (TREE_TYPE (REG_EXPR (op1))).

[Bug target/95488] New: Suboptimal multiplication codegen for v16qi

2020-06-02 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95488

Bug ID: 95488
   Summary: Suboptimal multiplication codegen for v16qi
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---
Target: x86_64-*-* i?86-*-*

cat test.c

---
typedef unsigned char v16qi __attribute__ ((vector_size (16)));
v16qi
foo (v16qi a, v16qi b)
{
return  a*b;
}
---

gcc -O2 -march=skylake-avx512

---
foo(unsigned char __vector(16), unsigned char __vector(16)):
vpunpcklbw  xmm3, xmm0, xmm0
vpunpcklbw  xmm2, xmm1, xmm1
vpunpckhbw  xmm0, xmm0, xmm0
vpunpckhbw  xmm1, xmm1, xmm1
vpmullw xmm2, xmm2, xmm3
vpmullw xmm1, xmm1, xmm0
vmovdqa xmm3, XMMWORD PTR .LC0[rip]
vpand   xmm0, xmm3, xmm2
vpand   xmm3, xmm3, xmm1
vpackuswb   xmm0, xmm0, xmm3
ret
.LC0:
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
.value  255
---

icc generate
---
foo(unsigned char __vector(16), unsigned char __vector(16)):
vpmovzxbw ymm2, xmm0#5.15
vpmovzxbw ymm3, xmm1#5.15
vpmullw   ymm4, ymm2, ymm3  #5.15
vpmovwb   xmm0, ymm4#5.15
vzeroupper  #5.15
ret #5.15
---

we can do better in ix86_expand_vecop_qihi, problem is how can i get sign info
for an rtx operand.

[Bug target/95453] Failure to avoid useless sign extension

2020-06-01 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95453

--- Comment #2 from Hongtao.liu  ---
Duplicated as PR95076?

[Bug target/95211] [11 Regression] ICE in emit_unop_insn, at optabs.c:3622

2020-05-29 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95211

--- Comment #9 from Hongtao.liu  ---
(In reply to Arseny Solokha from comment #8)
> Is there some further work pending, or should this PR be closed now?

Fixed in GCC11.

[Bug target/95256] [11 Regression] ICE in convert_move, at expr.c:278 since r11-263-g7c355156aa20eaec

2020-05-29 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95256

--- Comment #6 from Hongtao.liu  ---
(In reply to Arseny Solokha from comment #5)
> Is there some further work pending, or should this PR be closed now?

It's fixed.

[Bug target/92658] x86 lacks vector extend / truncate

2020-05-22 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92658

--- Comment #20 from Hongtao.liu  ---
(In reply to Mark Wielaard from comment #19)
> (In reply to CVS Commits from comment #18)
> > gcc/testsuite/ChangeLog:
> > * gcc.target/i386/pr92658-avx512f.c: New test.
> > * gcc.target/i386/pr92658-avx512vl.c: Ditto.
> > * gcc.target/i386/pr92658-avx512bw-trunc.c: Ditto.
> 
> Note that the second one as committed has an extra closing brace which
> causes an error:
> 
> ERROR: gcc.target/i386/pr92658-avx512vl.c: unknown dg option: \} for "}"
> 
> diff --git a/gcc/testsuite/gcc.target/i386/pr92658-avx512vl.c
> b/gcc/testsuite/gcc.target/i386/pr92658-avx512vl.c
> index 50b32f968ac3..dc50084119b5 100644
> --- a/gcc/testsuite/gcc.target/i386/pr92658-avx512vl.c
> +++ b/gcc/testsuite/gcc.target/i386/pr92658-avx512vl.c
> @@ -121,7 +121,7 @@ truncdb_128 (v16qi * dst, v4si * __restrict src)
>dst[0] = *(v16qi *) tem;
>  }
>  
> -/* { dg-final { scan-assembler-times "vpmovqd" 2 } } } */
> +/* { dg-final { scan-assembler-times "vpmovqd" 2 } } */
>  /* { dg-final { scan-assembler-times "vpmovqw" 2 { xfail *-*-* } } } */
>  /* { dg-final { scan-assembler-times "vpmovqb" 2 { xfail *-*-* } } } */
>  /* { dg-final { scan-assembler-times "vpmovdw" 1 } } */

Oh, sorry for typo.

[Bug target/95125] Unoptimal code for vectorized conversions

2020-05-22 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95125

--- Comment #5 from Hongtao.liu  ---
(In reply to Uroš Bizjak from comment #3)
> It turns out that a bunch of patterns have to be renamed (and testcases
> added).
> 
> Easyhack, waiting for someone to show some love to conversion patterns in
> sse.md.

expander for floatv4siv4df2, fix_truncv4dfv4si2 already exists.

if change **float_double fix_double** to
---
void
float_double (void)
{
d[0] = i[0];
d[1] = i[1];
d[2] = i[2];
d[3] = i[3];
}

void
fix_double (void)
{
i[0] = d[0];
i[1] = d[1];
i[2] = d[2];
i[3] = d[3];
}


it successfully generate

---
float_double():
vcvtdq2pd   i(%rip), %ymm0
vmovapd %ymm0, d(%rip)
vzeroupper
ret
fix_double():
vcvttpd2dqy d(%rip), %xmm0
vmovdqa %xmm0, i(%rip)
ret
l:
-

[Bug target/95125] Unoptimal code for vectorized conversions

2020-05-21 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95125

--- Comment #4 from Hongtao.liu  ---
(In reply to Uroš Bizjak from comment #3)
> It turns out that a bunch of patterns have to be renamed (and testcases
> added).
> 
> Easyhack, waiting for someone to show some love to conversion patterns in
> sse.md.

I'll take a look.

[Bug target/92658] x86 lacks vector extend / truncate

2020-05-20 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92658

--- Comment #17 from Hongtao.liu  ---
Created attachment 48570
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48570=edit
0001-Add-missing-vector-truncmn2-expanders-PR92658.patch

Seems there're only truncmn2 for truncate, not expander for us_truncate and
ss_truncate, am i missing?

[Bug target/92658] x86 lacks vector extend / truncate

2020-05-19 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92658

--- Comment #16 from Hongtao.liu  ---
(In reply to Uroš Bizjak from comment #15)
> I will leave truncations (Down Converts in Intel speak) which are AVX512F
> instructions to someone else. It should be easy to add missing patterns and
> tests following the example of committed patch.

I'll take a look.

[Bug target/94962] Suboptimal AVX2 code for _mm256_zextsi128_si256(_mm_set1_epi8(-1))

2020-05-19 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94962

--- Comment #6 from Hongtao.liu  ---
(In reply to Nemo from comment #5)
> (In reply to Jakub Jelinek from comment #2)
> 
> I would be happy if GCC could just emit optimal code (single vcmpeqd
> instruction) for this useful constant:
> 
> _mm256_set_m128i(_mm_setzero_si128(), _mm_set1_epi8(-1))
> 
> aka.
> 
> _mm256_inserti128_si256(_mm256_setzero_si256(), _mm_set1_epi8(-1), 0)
> 
> 
> (The latter is just what GCC uses to implement _mm256_zextsi128_si256, if I
> am reading the headers correctly.)
> 
> It's a minor thing, but I was a little surprised to find that none of the
> compilers I know of are able to do this. At least, not with any input I
> tried.

vmovdqa xmm0, xmm0 is not redundant here, it would clear up 128-256 bit which
is the meaning of `zext`.

[Bug target/94962] Suboptimal AVX2 code for _mm256_zextsi128_si256(_mm_set1_epi8(-1))

2020-05-18 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94962

--- Comment #4 from Hongtao.liu  ---
(In reply to Jakub Jelinek from comment #2)
> But such an instruction isn't always redundant, it really depends on what
> the previous setter of the register did, whether the upper 128 bit of the
> 256-bit register are already guaranteed to be zero or not.

(define_insn "avx_vec_concat"
  [(set (match_operand:V_256_512 0 "register_operand" "=x,v,x,Yv")
(vec_concat:V_256_512
  (match_operand: 1 "nonimmediate_operand" "x,v,xm,vm")
  (match_operand: 2 "nonimm_or_0_operand"
"xm,vm,C,C")))]

define_insn "*_vinsert_0"
  [(set (match_operand:AVX512_VEC 0 "register_operand" "=v,x,Yv")
(vec_merge:AVX512_VEC
  (match_operand:AVX512_VEC 1 "reg_or_0_operand" "v,C,C")
  (vec_duplicate:AVX512_VEC
(match_operand: 2 "nonimmediate_operand"
"vm,xm,vm"))
  (match_operand:SI 3 "const_int_operand" "n,n,n")))]


Upper part already zeroed.

> Thus the #c1 patch looks incorrect to me, one would need peephole2s or some
> combine patterns or target specific pass etc. to discover that at least for
> the common cases; and it isn't something we model in the RTL patterns (what
> insns guarantee which upper bits zero and what do not; and for some there
> can be different choices even in the same define_insn, we could implement
> something using widened registers and then there would be no guarantee etc.).

[Bug target/94962] Suboptimal AVX2 code for _mm256_zextsi128_si256(_mm_set1_epi8(-1))

2020-05-18 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94962

--- Comment #3 from Hongtao.liu  ---
You're right, from intel SDM: 
VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are
zeroed.

[Bug target/94962] Suboptimal AVX2 code for _mm256_zextsi128_si256(_mm_set1_epi8(-1))

2020-05-18 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94962

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #1 from Hongtao.liu  ---
redudant vmovdaq xmm0, xmm0 is generated by 

(insn:TI 7 6 14 2 (set (reg:V8SI 20 xmm0 [84])
(vec_concat:V8SI (reg:V4SI 20 xmm0 [86])
(const_vector:V4SI [
(const_int 0 [0]) repeated x4
])))
"/export/users2/liuhongt/install/gcc10_trunk/lib/gcc/x86_64-pc-linux-gnu/10.0.1/include/avxintrin.h":770:20
5296 {avx_vec_concatv8si}
 (expr_list:REG_EQUIV (const_vector:V8SI [
(const_int -1 [0x]) repeated x4
(const_int 0 [0]) repeated x4
])
(nil)))
-

could be eliminated if src operand has same regno as dest operand.

---
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 7a7ecd4be87..4ff4cf55f74 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -21123,6 +21123,9 @@
}
 case 2:
 case 3:
+  if (register_operand (operands[1], MODE)
+ && REGNO (operands[1]) == REGNO (operand[0]))
+   return "";
   switch (get_attr_mode (insn))
{
case MODE_V16SF:
---

[Bug target/92658] x86 lacks vector extend / truncate

2020-05-15 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92658

--- Comment #13 from Hongtao.liu  ---
*** Bug 92611 has been marked as a duplicate of this bug. ***

[Bug target/92611] auto vectorization failed for type promotation

2020-05-15 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92611

Hongtao.liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #3 from Hongtao.liu  ---
x86 lacks vector extend / truncate

*** This bug has been marked as a duplicate of bug 92658 ***

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2020-05-15 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 92611, which changed state.

Bug 92611 Summary: auto vectorization failed for type promotation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92611

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |DUPLICATE

[Bug middle-end/92492] AVX512: Missed vectorization opportunity

2020-05-15 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92492
Bug 92492 depends on bug 92611, which changed state.

Bug 92611 Summary: auto vectorization failed for type promotation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92611

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |DUPLICATE

[Bug target/95078] Missing fwprop for SIB address

2020-05-12 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95078

--- Comment #2 from Hongtao.liu  ---
(In reply to Richard Biener from comment #1)
> TER should go away, not be extended.  So you are suggesting that we replace
> 
> leaq44(%rdi,%rdx,4), %rdx  --- redundant could be fwprop
> movl(%rdx), %eax
> movl$3, (%rsi)
> addl(%rdx), %eax
> 
> with
> 
> movl   44(%rdi,%rdx,4), %eax
> movl$3, (%rsi)
> addl   44(%rdi,%rdx,4), %eax
> 
Yes.
> ?  The variant that looks bigger is actually one byte smaller.  Note as
> soon as there are three uses it will be larger again...
> 
> So this is really something for RTL and yeah, fwprop only makes "local"
> decisions.  Note that I think that your proposed variant will consume
> more resources since the complex addressing modes are likely split into
> a separate uop.  Yes, overall I'd expect less latency for your sequence.
Yes, also it will increase register pressure since propagation mostly would
increase live range for base and index reg, it's a subtle optimization, maybe 
a cost model could help, and fwprop should be more "smart" to see the
redundance of adress calculation after propagation.

[Bug target/95078] New: Missing fwprop for SIB address

2020-05-12 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95078

Bug ID: 95078
   Summary: Missing fwprop for SIB address
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
CC: hjl.tools at gmail dot com
  Target Milestone: ---
Target: i386, x86-64

cat test.c

int foo (int* p1, int* p2, int scale)
{
int ret = *(p1 + scale * 4 + 11);
*p2 = 3;
int ret2 = *(p1 + scale * 4 + 11);
return ret + ret2;
}

gcc11 -O2 test.c -S 

foo(int*, int*, int):
sall$2, %edx
movslq  %edx, %rdx
leaq44(%rdi,%rdx,4), %rdx  --- redundant could be fwprop
movl(%rdx), %eax
movl$3, (%rsi)
addl(%rdx), %eax
ret

fwprop failed to propagate this because it think cost of address
44(%rdi,%rdx,4) is more expensive than (%rdx), that's correct locally, but
under global view, if it could be propagated into both movl, leaq would be
eliminated, which benifits performance.

The ideal place to handle this issue is TER opt in pass_expand, but currently
TER only handle simple situation  single use and block level

 48   A pass is made through the function, one block at a time.  No cross block 
 49   information is tracked.   
 50 
 51   Variables which only have one use, and whose defining stmt is considered  
 52   a replaceable expression (see ssa_is_replaceable_p) are tracked to see
whether 
 53   they can be replaced at their use location.   

Should TER be extended?

Another testcase has this issue in more complex cfg

Refer to
https://godbolt.org/z/ofjH9R

[Bug target/94118] Undocumented inline assembly [target] operand modifiers

2020-05-07 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94118

--- Comment #2 from Hongtao.liu  ---
(In reply to Frédéric Recoules from comment #0)
> The section 6.47.2.8 x86 Operand Modifiers of
> https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html is only about x86.
> 
> As it was done for Operand Constraints
> (https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-
> Constraints) it would be beneficial to create a separated page.
> 

Do you mean create a page like *Operand Modifiers for Particular Machines* and
move the section 6.47.2.8 x86 Operand Modifiers to the page?

[Bug target/94841] New: [10 Regression]527.cam4_r 7.68% regression on Intel Cascadelaker with -O2, 9.57% regression with -Ofast -march=native -funroll-loops -flto

2020-04-28 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94841

Bug ID: 94841
   Summary: [10 Regression]527.cam4_r 7.68% regression on Intel
Cascadelaker with -O2, 9.57% regression with -Ofast
-march=native -funroll-loops -flto
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
CC: hjl.tools at gmail dot com, tkoenig at gcc dot gnu.org,
wwwhhhyyy333 at gmail dot com
  Target Milestone: ---
Target: i386, x86-64

Starting with


commit 06eca1acafa27e19e82dc73927394a7a4d0bdbc5
Author: Thomas König 
Date:   Thu Apr 23 20:30:01 2020 +0200

Fix PR 93956, wrong pointer when returned via function.

This one took a bit of detective work.  When array pointers point
to components of derived types, we currently set the span field
and then create an array temporary when we pass the array
pointer to a procedure as a non-pointer or non-target argument.
(This is inefficient, but that's for another release).

Now, the compiler detected this case when there was a direct assignment
like p => a%b, but not when p was returned either as a function result
or via an argument.  This patch fixes that.

2020-04-23  Thomas Koenig  

PR fortran/93956
* expr.c (gfc_check_pointer_assign): Also set subref_array_pointer
when a function returns a pointer.
* interface.c (gfc_set_subref_array_pointer_arg): New function.
(gfc_procedure_use): Call it.

2020-04-23  Thomas Koenig  

PR fortran/93956
* gfortran.dg/pointer_assign_13.f90: New test.

--

[Bug target/94736] Missing ENDBR at label

2020-04-25 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94736

--- Comment #1 from Hongtao.liu  ---
Indirect jump `goto *p` is optimized off, so there's no indirect jump, either
no need for inserting endbr64

[Bug tree-optimization/94375] 548.exchange2_r run time is 8-18% worse than GCC 9 at -Ofast -march=native

2020-03-30 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94375

--- Comment #5 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #4)
> (In reply to Martin Jambor from comment #3)
> > (In reply to Hongtao.liu from comment #1)
> > > Try -mprefer-vector-width=128,256-bit vectorization is not helpful for 548
> > > according to our experience.
> > 
> > I have seen this helping on one system running SLES 15.1 and with
> > trunk abe13e1847f (Feb 17 2020) but not on another running openSUSE
> > Tumbleweed and with trunk revision 26b3e568a60 (Mar 23 2020).  So,
> > from my perspective, perhaps it helps, perhaps it doesn't.
> 
> What's your GCC option for OPENSUSE?
> 
> Default value of -mprefer-vector-width for -mtune=zenver1 is 128, if that,
> it won't help.
> Different processor have different tune which may has different default
> vector width.

for -march=native, it depends on processor of your server/client.

[Bug tree-optimization/94375] 548.exchange2_r run time is 8-18% worse than GCC 9 at -Ofast -march=native

2020-03-30 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94375

--- Comment #4 from Hongtao.liu  ---
(In reply to Martin Jambor from comment #3)
> (In reply to Hongtao.liu from comment #1)
> > Try -mprefer-vector-width=128,256-bit vectorization is not helpful for 548
> > according to our experience.
> 
> I have seen this helping on one system running SLES 15.1 and with
> trunk abe13e1847f (Feb 17 2020) but not on another running openSUSE
> Tumbleweed and with trunk revision 26b3e568a60 (Mar 23 2020).  So,
> from my perspective, perhaps it helps, perhaps it doesn't.

What's your GCC option for OPENSUSE?

Default value of -mprefer-vector-width for -mtune=zenver1 is 128, if that, it
won't help.
Different processor have different tune which may has different default vector
width.

[Bug target/94373] 548.exchange2_r run time is 7-12% worse than GCC 9 at -O2 and generic march/mtune

2020-03-30 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94373

--- Comment #3 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #2)
> I think
> Change lea_cost from 2 --> 1 in skylake can fix this regressions.
> 
> Since it's stage4 now, i hold my patch.

Classify: it's for -O2 -mtune=skylake-avx512

not sure the what cause the regression for -O2 -mtune=generic.

[Bug tree-optimization/94375] 548.exchange2_r run time is 8-18% worse than GCC 9 at -Ofast -march=native

2020-03-29 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94375

--- Comment #1 from Hongtao.liu  ---
Try -mprefer-vector-width=128,256-bit vectorization is not helpful for 548
according to our experience.

[Bug target/94373] 548.exchange2_r run time is 7-12% worse than GCC 9 at -O2 and generic march/mtune

2020-03-29 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94373

--- Comment #2 from Hongtao.liu  ---
I think
Change lea_cost from 2 --> 1 in skylake can fix this regressions.

Since it's stage4 now, i hold my patch.

[Bug target/93724] macro of _mm512_shrdi_epi16 lack a closing parenthesis

2020-02-14 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93724

Hongtao.liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Hongtao.liu  ---
Fixed in GCC10, back port to GCC9 and GCC8.

[Bug target/93696] AVX512VPOPCNTDQ writemask intrinsics produce incorrect results

2020-02-13 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93696

Hongtao.liu  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Hongtao.liu  ---
Fixed in GCC10, backport to GCC9

[Bug target/93673] Fake error given by gcc when compiling for _kshift intrinsics

2020-02-12 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93673

Hongtao.liu  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Hongtao.liu  ---
Fixed in GCC10

[Bug target/93724] macro of _mm512_shrdi_epi16 lack a closing parenthesis

2020-02-12 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93724

--- Comment #1 from Hongtao.liu  ---
Created attachment 47832
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47832=edit
Fixed patch

[Bug target/93724] New: macro of _mm512_shrdi_epi16 lack a closing parenthesis

2020-02-12 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93724

Bug ID: 93724
   Summary: macro of _mm512_shrdi_epi16 lack a closing parenthesis
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---
Target: i386, x86-64

cat test.c
---
#include
__m512i foo(__m512i a, __m512i b){
return _mm512_shrdi_epi16 (a, b, 10);
}
---

gcc10_trunk -O0 -mavx512vbmi2 -S 

error

: In function '__m512i foo(__m512i, __m512i)':

:3:41: error: expected ')' before ';' token

3 | return _mm512_shrdi_epi16 (a, b, 10);

  | ^

In file included from
/opt/compiler-explorer/gcc-trunk-20200212/lib/gcc/x86_64-linux-gnu/10.0.1/include/immintrin.h:87,

 from :1:

:3:12: note: to match this '('

3 | return _mm512_shrdi_epi16 (a, b, 10);

  |^~

Compiler returned: 1
---

refer to https://godbolt.org/z/Nv5E6D

affected intrinsics

_mm512_maskz_shrdi_epi64
_mm512_mask_shrdi_epi64
_mm512_shrdi_epi64
_mm256_maskz_shrdi_epi64
_mm256_mask_shrdi_epi64
_mm256_shrdi_epi64
_mm_maskz_shrdi_epi64
_mm_mask_shrdi_epi64
_mm_shrdi_epi64
_mm512_maskz_shrdi_epi32
_mm512_mask_shrdi_epi32
_mm512_shrdi_epi32
_mm256_maskz_shrdi_epi32
_mm256_mask_shrdi_epi32
_mm256_shrdi_epi32
_mm_maskz_shrdi_epi32
_mm_mask_shrdi_epi32
_mm_shrdi_epi32
_mm512_maskz_shrdi_epi16
_mm512_mask_shrdi_epi16
_mm512_shrdi_epi16
_mm256_maskz_shrdi_epi16
_mm256_mask_shrdi_epi16
_mm256_shrdi_epi16
_mm_maskz_shrdi_epi16
_mm_mask_shrdi_epi16
_mm_shrdi_epi16
_mm512_maskz_shldi_epi64
_mm512_mask_shldi_epi64
_mm512_shldi_epi64
_mm256_maskz_shldi_epi64
_mm256_mask_shldi_epi64
_mm256_shldi_epi64
_mm_maskz_shldi_epi64
_mm_mask_shldi_epi64
_mm_shldi_epi64
_mm512_maskz_shldi_epi32
_mm512_mask_shldi_epi32
_mm512_shldi_epi32
_mm256_maskz_shldi_epi32
_mm256_mask_shldi_epi32
_mm256_shldi_epi32
_mm_maskz_shldi_epi32
_mm_mask_shldi_epi32
_mm_shldi_epi32
_mm512_maskz_shldi_epi16
_mm512_mask_shldi_epi16
_mm512_shldi_epi16
_mm256_maskz_shldi_epi16
_mm256_mask_shldi_epi16
_mm256_shldi_epi16
_mm_maskz_shldi_epi16
_mm_mask_shldi_epi16
_mm_shldi_epi16

[Bug target/93670] ICE for _mm256_extractf32x4_ps (unrecognized insn)

2020-02-12 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93670

Hongtao.liu  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Hongtao.liu  ---
Fixed in GCC10

[Bug target/93696] New: AVX512VPOPCNTDQ writemask intrinsics produce incorrect results

2020-02-11 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93696

Bug ID: 93696
   Summary: AVX512VPOPCNTDQ writemask intrinsics produce incorrect
results
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---
Target: i386, x86-64

The writemask (mask) forms of the AVX512VPOPCNTDQ intrinsics generate incorrect
results. When the mask bit is not set, it appears the GCC implementation is
copying from the third parameter, whereas it should be copying from the first
parameter.

testcase:
cat test.c

#include
__m128i foo (__m128i dst, __mmask8 m, __m128i src)
{
  return _mm_mask_popcnt_epi64 (dst, m, src);
}

gcc10_trunk -O2 -mavx512vpopcntdq -mavx512vl -S

foo(long long __vector(2), unsigned char, long long __vector(2)):
kmovw   %edi, %k1
vpopcntq%xmm0, %xmm1{%k1}
vmovdqa64   %xmm1, %xmm0
ret

which is incorrect, it should be 

foo(__m128i, unsigned char, __m128i):
kmovw %edi, %k1 #4.10
vpopcntq  %xmm1, %xmm0{%k1} #4.10
ret #4.10

Refer to https://godbolt.org/z/EK12b0

Affected intrinsics

_mm256_mask_popcnt_epi64
_mm_mask_popcnt_epi64
_mm256_mask_popcnt_epi32
_mm_mask_popcnt_epi32
_mm512_mask_popcnt_epi32
_mm512_mask_popcnt_epi64
_mm512_mask_popcnt_epi16
_mm256_mask_popcnt_epi16
_mm_mask_popcnt_epi16
_mm512_mask_popcnt_epi8
_mm256_mask_popcnt_epi8
_mm_mask_popcnt_epi8

[Bug target/93673] Fake error given by gcc when compiling for _kshift intrinsics

2020-02-11 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93673

--- Comment #1 from Hongtao.liu  ---
Affected instrinsics
_kshiftli_mask16
_kshiftri_mask16

[Bug target/93673] New: Fake error given by gcc when compiling for _kshift intrinsics

2020-02-11 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93673

Bug ID: 93673
   Summary: Fake error given by gcc when compiling for _kshift
intrinsics
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---
Target: i386, x86-64

cat test.c

#include
#include
__mmask16 i__kshiftli_mask16_KSHIFTLW(__mmask16 arg_0, unsigned int arg_1){
__mmask16 result;
switch (arg_1) {
case 0xFF: // arg_1
result = _kshiftli_mask16(arg_0, 0xff);
break;
default: // arg_1
assert(false);
break;
}

return result;
}
---

gcc10_trunk test.c -S -O0 -mavx512f
error:
#1 with x86-64 gcc (trunk)
In file included from
/opt/compiler-explorer/gcc-trunk-20200211/lib/gcc/x86_64-linux-gnu/10.0.1/include/immintrin.h:55,

 from :1:

: In function '__mmask16 i__kshiftli_mask16_KSHIFTLW(__mmask16,
unsigned int)':

:7:13: error: the last argument must be an 8-bit immediate

7 |result = _kshiftli_mask16(arg_0, 0xff);

  | ^~~~

Compiler returned: 1

But oxFF is already an 8-bit immediate.

Refer to https://godbolt.org/z/xirwxN

[Bug target/93670] ICE for _mm256_extractf32x4_ps (unrecognized insn)

2020-02-10 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93670

--- Comment #1 from Hongtao.liu  ---
Refer to https://godbolt.org/z/QfpRWu

[Bug target/93670] New: ICE for _mm256_extractf32x4_ps (unrecognized insn)

2020-02-10 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93670

Bug ID: 93670
   Summary: ICE for _mm256_extractf32x4_ps (unrecognized insn)
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---
Target: i386, x86-64

cat test.c
---
#include
#include
__m128i i__mm256_extractf32x4_ps_VEXTRACTF32X4(__m256i arg_0, int arg_1) {
  __m128i result;
  switch (arg_1) {
  case 0x00: // arg_1
result = _mm256_extracti32x4_epi32(arg_0, 0x00);
break;
  case 0x01: // arg_1
result = _mm256_extracti32x4_epi32(arg_0, 0x01);
break;
  default: // arg_1
assert(0);
break;
  }
  return result;
}
---

gcc10_20200110 -O2 -S test.c -mavx512f -mavx512vl 

error:

test.c: In function ‘i__mm256_extractf32x4_ps_VEXTRACTF32X4’:
test.c:17:1: error: unrecognizable insn:
   17 | }
  | ^
(insn 20 19 21 6 (set (reg:V4SI 89)
(vec_merge:V4SI (vec_select:V4SI (reg:V8SI 90)
(parallel [
(const_int 0 [0])
(const_int 1 [0x1])
(const_int 2 [0x2])
(const_int 3 [0x3])
]))
(reg:V4SI 91)
(reg:QI 92)))
"/export/users2/liuhongt/install/gcc10_trunk/lib/gcc/x86_64-pc-linux-gnu/10.0.0/include/avx512vlintrin.h":10055:20
-1
 (nil))
during RTL pass: vregs
test.c:17:1: internal compiler error: in extract_insn, at recog.c:2294
0x1185247 _fatal_insn(char const*, rtx_def const*, char const*, int, char
const*)
../../../gcc/gnu-toolchain/gcc/gcc/rtl-error.c:108
0x1185288 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
../../../gcc/gnu-toolchain/gcc/gcc/rtl-error.c:116
0x111d0d4 extract_insn(rtx_insn*)
../../../gcc/gnu-toolchain/gcc/gcc/recog.c:2294
0xd0eece instantiate_virtual_regs_in_insn
../../../gcc/gnu-toolchain/gcc/gcc/function.c:1607
0xd10517 instantiate_virtual_regs
../../../gcc/gnu-toolchain/gcc/gcc/function.c:1977
0xd105e2 execute
../../../gcc/gnu-toolchain/gcc/gcc/function.c:2026
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

affected intrinsics

_mm256_extractf32x4_ps
_mm256_mask_extractf32x4_ps
_mm256_maskz_extractf32x4_ps
_mm256_extracti32x4_epi32
_mm256_mask_extracti32x4_epi32
_mm256_maskz_extracti32x4_epi32

[Bug target/92295] Inefficient vector constructor

2020-02-03 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92295

--- Comment #4 from Hongtao.liu  ---
(In reply to Martin Liška from comment #3)
> Can we close the issue?

Yes, it's fixed in GCC10.

[Bug target/93243] misoptimization: minor changes of the code leads change up to +/- 30% performance on x86_64, -Os faster than -Ofast/O2/O3

2020-01-12 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93243

--- Comment #2 from Hongtao.liu  ---

> The diffs in the source code are:
> #if CASE & 1
> #define CMP(a, b) ((a) < (b))
> #else
> #define CMP(a, b) (((a) - (b)) < 0)
> #endiF
> 
(a) < (b) is not equal to ((a) - (b) < 0)
Compiler will trait them differently.

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2019-12-19 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #8 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #4)

> But that is not true any more.  So I think this optimization can be removed
> as it is too early.  Just double check the above testcase and the C++
> testcase (g++.dg/opt/ptrintsum1.C) to make sure they still work and post
> that removal.  This optimization is most likely causing other missed
> optimizations already too.  So I would compile SPEC to see if there is any
> differences; my bet you might find some.

No big impact for SPEC2017, more or less like noise.

500.perlbench_r 0.21%
502.gcc_r   0.14%
505.mcf_r   -0.40%
520.omnetpp_r   -0.47%
523.xalancbmk_r -1.20%
525.x264_r  -1.26%
531.deepsjeng_r -0.05%
541.leela_r -0.39%
548.exchange2_r -0.09%
557.xz_r-0.16%
geomean for intrate -0.37%
503.bwaves_r-0.19%
507.cactuBSSN_r 0.23%
508.namd_r  -0.12%
510.parest_r0.18%
511.povray_r-0.30%
519.lbm_r   BuildSame   #VALUE!
521.wrf_r   -0.01%
526.blender_r   -0.44%
527.cam4_r  -0.17%
538.imagick_r   0.47%
544.nab_r   -1.00%
549.fotonik3d_r 0.09%
554.roms_r  0.28%
geomean for fprate  -0.08%
geomean -0.21%

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2019-12-19 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #6 from Hongtao.liu  ---
New fail by removal

unix/-m32: c-c++-common/restrict-2.c  -Wc++-compat   scan-tree-dump-times lim2
"Moving statement" 11
unix/-m32: c-c++-common/restrict-2.c  -std=gnu++14  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m32: c-c++-common/restrict-2.c  -std=gnu++17  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m32: c-c++-common/restrict-2.c  -std=gnu++2a  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m32: c-c++-common/restrict-2.c  -std=gnu++98  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m32: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump ch2 "is now do-while
loop"
unix/-m32: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump-times ch2 "  if " 3
unix/-m32: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump ch2 "is now do-while
loop"
unix/-m32: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump-times ch2 "Will
duplicate bb" 3
unix/-m32: gcc.dg/tree-ssa/pr81744.c scan-tree-dump-times pcom "Store-stores
chain" 2
unix/-m32: gcc.dg/vect/pr57558-2.c -flto -ffat-lto-objects  scan-tree-dump vect
"vectorized 1 loops"
unix/-m32: gcc.dg/vect/pr57558-2.c scan-tree-dump vect "vectorized 1 loops"
unix/-m64: c-c++-common/builtins.c  -Wc++-compat  (test for excess errors)
unix/-m64: c-c++-common/restrict-2.c  -Wc++-compat   scan-tree-dump-times lim2
"Moving statement" 11
unix/-m64: c-c++-common/restrict-2.c  -std=gnu++14  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m64: c-c++-common/restrict-2.c  -std=gnu++17  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m64: c-c++-common/restrict-2.c  -std=gnu++2a  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m64: c-c++-common/restrict-2.c  -std=gnu++98  scan-tree-dump-times lim2
"Moving statement" 11
unix/-m64: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump ch2 "is now do-while
loop"
unix/-m64: gcc.dg/tree-ssa/copy-headers-5.c scan-tree-dump-times ch2 "  if " 3
unix/-m64: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump ch2 "is now do-while
loop"
unix/-m64: gcc.dg/tree-ssa/copy-headers-7.c scan-tree-dump-times ch2 "Will
duplicate bb" 3
unix/-m64: gcc.dg/tree-ssa/pr81744.c scan-tree-dump-times pcom "Store-stores
chain" 2
unix/-m64: gcc.dg/vect/pr57558-2.c -flto -ffat-lto-objects  scan-tree-dump vect
"vectorized 1 loops"
unix/-m64: gcc.dg/vect/pr57558-2.c scan-tree-dump vect "vectorized 1 loops"

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2019-12-18 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #5 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #4)

> But that is not true any more.  So I think this optimization can be removed
> as it is too early.  Just double check the above testcase and the C++
> testcase (g++.dg/opt/ptrintsum1.C) to make sure they still work and post
> that removal.

Removal works fines with those 2 testcases.

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2019-12-18 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #3 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #2)
> I think the problem is we are folding the right side of the array (with the
> conversion to size_t) too early.
> That is:
> src1[j-1]
> 
> Is being folded too early to have (j-1)*4
> 
> Fixing this up in match.pd is wrong.
> 
> This gets us the best code without any patch to match.pd:
> int foo(unsigned int *__restrict src1, int i, int k, int n)
> {
>   int j = k + n;
>   int sum = src1[j];
>   int jj = j-1;
>   sum += src1[jj];
>   if (i <= k)
> {
>   j+=2;
> int ii = j-3;
>   sum += src1[ii];
> }
>   return sum + j;
> }
> 
> See how j-1 and j-3 are not folded early and that fixes the issue.

It's done by parser, cut from c-common.c
--
3182   if (TREE_CODE (intop) == MINUS_EXPR)
3183 subcode = (subcode == PLUS_EXPR ? MINUS_EXPR : PLUS_EXPR);
3184   /* Convert both subexpression types to the type of intop,
3185  because weird cases involving pointer arithmetic
3186  can result in a sum or difference with different type args.  */
3187   ptrop = build_binary_op (EXPR_LOCATION (TREE_OPERAND (intop, 1)),
3188subcode, ptrop,
3189convert (int_type, TREE_OPERAND (intop,
1)),
3190true);
3191   intop = convert (int_type, TREE_OPERAND (intop, 0));
3192 }

--

ptrop ---> src1 + 18446744073709551612;
intop ---> j

It seems on purpose???

[Bug tree-optimization/92980] [miss optimization]redundant load missed by fre.

2019-12-17 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

--- Comment #1 from Hongtao.liu  ---
test.c.033.fre1
foo (unsigned int * restrict src1, int i, int k, int n)
{
  int sum;
  int j;
  long unsigned int _1;
  long unsigned int _2;
  unsigned int * _3;
  unsigned int _4;
  sizetype _7;
  unsigned int * _8;
  unsigned int _9;
  unsigned int _11;
  long unsigned int _12;
  long unsigned int _13;
  sizetype _14;
  unsigned int * _15;
  unsigned int _16;
  unsigned int _18;
  int _31;

   :
  j_23 = k_21(D) + n_22(D);
  _1 = (long unsigned int) j_23;
  _2 = _1 * 4;
  _3 = src1_24(D) + _2;
  _4 = *_3;
  sum_26 = (int) _4;
  _7 = _2 + 18446744073709551612;
  _8 = src1_24(D) + _7;
  _9 = *_8;
  _11 = _4 + _9;
  sum_27 = (int) _11;
  if (k_21(D) >= i_28(D))
goto ; [INV]
  else
goto ; [INV]

   :
  j_29 = j_23 + 2;
  _12 = (long unsigned int) j_29;
  _13 = _12 * 4;
  _14 = _13 + 18446744073709551604; --- it shoule be simplified to _7
  _15 = src1_24(D) + _14;
  _16 = *_15;
  _18 = _11 + _16;
  sum_30 = (int) _18;

   :
  # j_19 = PHI 
  # sum_20 = PHI 
  _31 = j_19 + sum_20;
  return _31;
}

[Bug tree-optimization/92980] New: [miss optimization]redundant load missed by fre.

2019-12-17 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92980

Bug ID: 92980
   Summary: [miss optimization]redundant load missed by fre.
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
CC: hjl.tools at gmail dot com, wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

cat test.c

int foo(unsigned int *__restrict src1, int i, int k, int n)
{
  int j = k + n;
  int sum = src1[j];
  sum += src1[j-1];
  if (i <= k)
{
  j+=2;
  sum += src1[j-3];
}
  return sum + j;
}


x86_64_gcctrunk -Ofast test.c -S
got 

foo:
.LFB0:
.cfi_startproc
addl%edx, %ecx
movl%esi, %r8d
movslq  %ecx, %rsi
movl(%rdi,%rsi,4), %eax
addl-4(%rdi,%rsi,4), %eax
cmpl%r8d, %edx
jl  .L3
addl$2, %ecx
movslq  %ecx, %rdx
addl-12(%rdi,%rdx,4), %eax    redudant load, it's actual a[j-1]
which is loaded before.
.L3:
addl%ecx, %eax
ret
.cfi_endproc
.LFE0:
.size   foo, .-foo
.ident  "GCC: (GNU) 10.0.0 20191117 (experimental)"
.section.note.GNU-stack,"",@progbits

it could be better like

foo:
.LFB0:
.cfi_startproc
addl%edx, %ecx
movl%esi, %r9d
movslq  %ecx, %rsi
movl-4(%rdi,%rsi,4), %r8d
movl(%rdi,%rsi,4), %eax
addl%r8d, %eax
cmpl%r9d, %edx
jl  .L3
addl$2, %ecx
addl%r8d, %eax > reuse earlir load result.
.L3:
addl%ecx, %eax
ret
.cfi_endproc

[Bug target/92865] [10 Regression] error: unrecognizable insn: in extract_insn, at recog.c:2294 since r279107

2019-12-09 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92865

--- Comment #6 from Hongtao.liu  ---
The missing point of my patch is for 512-bit vector compare, integer mask
vector compare still should be used even with target_xop, that's the root cause
of this issue.

Refer to this part.
---
-  if (GET_MODE_SIZE (cmp_ops_mode) == 64)
+  if (ix86_valid_mask_cmp_mode (cmp_ops_mode))
 {
   unsigned int nbits = GET_MODE_NUNITS (cmp_ops_mode);
-  cmp_mode = int_mode_for_size (nbits, 0).require ();
   maskcmp = true;
+  cmp_mode = nbits > 8 ? int_mode_for_size (nbits, 0).require () :
E_QImode;
 }
   else
 cmp_mode = cmp_ops_mode;
@@ -3461,37 +3484,6 @@ ix86_expand_sse_cmp (rtx dest, enum rtx_code code, rtx
cmp_op0, rtx cmp_op1,
   || (op_false && reg_overlap_mentioned_p (dest, op_false)))
 dest = gen_reg_rtx (maskcmp ? cmp_mode : mode);

-  /* Compare patterns for int modes are unspec in AVX512F only.  */
-  if (maskcmp && (code == GT || code == EQ))
-{
-  rtx (*gen)(rtx, rtx, rtx);
-
-  switch (cmp_ops_mode)
-   {
-   case E_V64QImode:
- gcc_assert (TARGET_AVX512BW);
- gen = code == GT ? gen_avx512bw_gtv64qi3 : gen_avx512bw_eqv64qi3_1;
- break;
-   case E_V32HImode:
- gcc_assert (TARGET_AVX512BW);
- gen = code == GT ? gen_avx512bw_gtv32hi3 : gen_avx512bw_eqv32hi3_1;
- break;
-   case E_V16SImode:
- gen = code == GT ? gen_avx512f_gtv16si3 : gen_avx512f_eqv16si3_1;
- break;
-   case E_V8DImode:
- gen = code == GT ? gen_avx512f_gtv8di3 : gen_avx512f_eqv8di3_1;
- break;
-   default:
- gen = NULL;
-   }
-
-  if (gen)
-   {
- emit_insn (gen (dest, cmp_op0, cmp_op1));
- return dest;
-   }
-}
   x = gen_rtx_fmt_ee (code, cmp_mode, cmp_op0, cmp_op1);

   if (cmp_mode != mode && !maskcmp)
-

[Bug target/92865] [10 Regression] error: unrecognizable insn: in extract_insn, at recog.c:2294 since r279107

2019-12-09 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92865

--- Comment #5 from Hongtao.liu  ---
(In reply to Richard Biener from comment #4)
> (In reply to Hongtao.liu from comment #3)
> > Since TARGET_XOP only supports 128-bit vector compare,
> > ix86_valid_mask_cmp_mode should also handle 256/512-bit vector compare when
> > avx512f is avalable.
> > 
> > 
> > untested patch
> > 
> > @@ -3428,7 +3428,7 @@ static bool
> >  ix86_valid_mask_cmp_mode (machine_mode mode)
> >  {
> >/* XOP has its own vector conditional movement.  */
> > -  if (TARGET_XOP)
> > +  if (TARGET_XOP && GET_MODE_SIZE (mode) == 128)
> >  return false;
> 
> Shouldn't we do sth like TARGET_XOP && !TARGET_AVX512F instead?  That is
> maybe simply elide that check completely, not sure why it was added.
True, thanks
> 
> >/* AVX512F is needed for mask operation.  */
> > 
> > I'll add some testcase later.

[Bug target/92865] [10 Regression] error: unrecognizable insn: in extract_insn, at recog.c:2294 since r279107

2019-12-09 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92865

--- Comment #3 from Hongtao.liu  ---
Since TARGET_XOP only supports 128-bit vector compare, ix86_valid_mask_cmp_mode
should also handle 256/512-bit vector compare when avx512f is avalable.


untested patch

@@ -3428,7 +3428,7 @@ static bool
 ix86_valid_mask_cmp_mode (machine_mode mode)
 {
   /* XOP has its own vector conditional movement.  */
-  if (TARGET_XOP)
+  if (TARGET_XOP && GET_MODE_SIZE (mode) == 128)
 return false;

   /* AVX512F is needed for mask operation.  */

I'll add some testcase later.

[Bug target/92865] [10 Regression] error: unrecognizable insn: in extract_insn, at recog.c:2294 since

2019-12-09 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92865

--- Comment #1 from Hongtao.liu  ---
I'll take a look.

[Bug middle-end/85559] [meta-bug] Improve conditional move

2019-12-09 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85559
Bug 85559 depends on bug 92578, which changed state.

Bug 92578 Summary: [i386] cmov not generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92578

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |INVALID

[Bug target/92578] [i386] cmov not generated

2019-12-09 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92578

Hongtao.liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |INVALID

--- Comment #4 from Hongtao.liu  ---
It's designed on purpose.

[Bug target/92686] Inefficient mask operation for 128/256-bit vector VCOND_EXPR under avx512f

2019-12-08 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92686

Hongtao.liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Hongtao.liu  ---
Fixed in GCC10.

[Bug target/80969] [8 Regression] ICE in ix86_expand_prologue, at config/i386/i386.c:14606

2019-12-08 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80969

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #8 from Hongtao.liu  ---
ICE in r279107, I think my patch trigger some potential bug.

/export/users/liuhongt/gcc/gnu-toolchain/gcc/gcc/testsuite/gcc.target/i386/pr80969-1.c:
In function ‘main’:
/export/users/liuhongt/gcc/gnu-toolchain/gcc/gcc/testsuite/gcc.target/i386/pr80969-1.c:16:1:
internal compiler error: in sp_valid_at, at config/i386/i386.c:6160
0x747bd7 sp_valid_at
../../../gcc/gnu-toolchain/gcc/gcc/config/i386/i386.c:6160
0x10ded88 choose_basereg
../../../gcc/gnu-toolchain/gcc/gcc/config/i386/i386.c:6196
0x10df115 choose_baseaddr
../../../gcc/gnu-toolchain/gcc/gcc/config/i386/i386.c:6309
0x10df1b6 ix86_emit_save_reg_using_mov
../../../gcc/gnu-toolchain/gcc/gcc/config/i386/i386.c:6358
0x10f450e ix86_emit_save_sse_regs_using_mov
../../../gcc/gnu-toolchain/gcc/gcc/config/i386/i386.c:6447
0x10f55b0 ix86_expand_prologue()
../../../gcc/gnu-toolchain/gcc/gcc/config/i386/i386.c:8248
0x14368ba gen_prologue()
../../../gcc/gnu-toolchain/gcc/gcc/config/i386/i386.md:13130
0x10e5988 target_gen_prologue
../../../gcc/gnu-toolchain/gcc/gcc/config/i386/i386.md:19667
0xabcdb7 make_prologue_seq
../../../gcc/gnu-toolchain/gcc/gcc/function.c:5779
0xabcf82 thread_prologue_and_epilogue_insns()
../../../gcc/gnu-toolchain/gcc/gcc/function.c:5896
0xabd672 rest_of_handle_thread_prologue_and_epilogue
../../../gcc/gnu-toolchain/gcc/gcc/function.c:6387
0xabd672 execute
../../../gcc/gnu-toolchain/gcc/gcc/function.c:6463
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

Same error for pr80969-4.c.

[Bug target/92686] Inefficient mask operation for 128/256-bit vector VCOND_EXPR under avx512f

2019-11-27 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92686

--- Comment #4 from Hongtao.liu  ---
(In reply to Richard Biener from comment #2)
> It would be definitely nice to have this.  Maybe add a tunable whether to use
> mask registers for SSE/AVX2?  
Sure for 128/256-bit vector under avx512f.

> Is there any boost frequency penalty for using them?
not see any frequency penalty by using mask register.

> Using mask registers also looks like a way to reduce register pressure
> (in case the register pressure is not on the masks side).

Yes, it would save 1 vector register(using for mask) and 1
instruction(vpblendvb)

[Bug target/92686] Inefficient mask operation for 128/256-bit vector VCOND_EXPR under avx512f

2019-11-27 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92686

--- Comment #3 from Hongtao.liu  ---
Created attachment 47372
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47372=edit
Local patch with Bootstrap and regression test on i386/x86_64 is ok.

Also I found there are some disturb with pr88547.

[Bug target/92686] Inefficient mask operation for 128/256-bit vector VCOND_EXPR under avx512f

2019-11-26 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92686

--- Comment #1 from Hongtao.liu  ---
My local patch shows there's no big performance impact on SPEC2017.

[Bug target/92686] New: Inefficient mask operation for 128/256-bit vector VCOND_EXPR under avx512f

2019-11-26 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92686

Bug ID: 92686
   Summary: Inefficient mask operation for 128/256-bit vector
VCOND_EXPR under avx512f
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
CC: hjl.tools at gmail dot com, wwwhhhyyy333 at gmail dot com
  Target Milestone: ---
Target: i386, x86-64

Cat test.c

void mc_weight( unsigned int *__restrict dst, unsigned int *__restrict src1,
int *__restrict src2)
{
for( int x = 0; x < 16; x++ )
dst[x] = src1[x] > src2[x] ? src1[x] : dst[x];
}

With option -Ofast -march=skylake-avx512

gcc using xmm register as mask and using vpblendvb for condition vector move

vmovdqu32   (%rsi), %ymm0
vpminud (%rdx), %ymm0, %ymm1
vpcmpeqd%ymm1, %ymm0, %ymm1
vpblendvb   %ymm1, (%rdi), %ymm0, %ymm0
vmovdqu32   %ymm0, (%rdi)
vmovdqu32   32(%rsi), %ymm0
vpminud 32(%rdx), %ymm0, %ymm1
vpcmpeqd%ymm1, %ymm0, %ymm1
vpblendvb   %ymm1, 32(%rdi), %ymm0, %ymm0
vmovdqu32   %ymm0, 32(%rdi)
vzeroupper


But there is mask register in avx512f, it could be better as:

vmovdqu   (%rsi), %ymm0 #5.25
vmovdqu   32(%rsi), %ymm1   #5.25
vpcmpud   $6, (%rdx), %ymm0, %k1#5.25
vpcmpud   $6, 32(%rdx), %ymm1, %k2  #5.25
vmovdqu32 %ymm0, (%rdi){%k1}#5.6
vmovdqu32 %ymm1, 32(%rdi){%k2}  #5.6
vzeroupper  #6.1
ret #6.1

That's because currently gcc only hanlde 512-bit vector
=---
 3437  /* In AVX512F the result of comparison is an integer mask.  */   
 3438  bool maskcmp = false;
 3439  rtx x;   
 3440   
 3441  if (GET_MODE_SIZE (cmp_ops_mode) == 64)  
 3442{  
 3443  unsigned int nbits = GET_MODE_NUNITS (cmp_ops_mode); 
 3444  cmp_mode = int_mode_for_size (nbits, 0).require ();  
 3445  maskcmp = true;  
 3446}  
 3447  else  


With additional -mprefer-vector-width=512, gcc have 

vmovdqu32   (%rsi), %zmm0
vpminud (%rdx), %zmm0, %zmm1
vpcmpeqd%zmm1, %zmm0, %k1
vmovdqu32   (%rdi), %zmm0{%k1}
vmovdqu32   %zmm0, (%rdi)
vzeroupper
ret

Since mask register is related to isa not vector size, under avx512f we can
also have 128/256-bit vector condition move.

[Bug target/92611] auto vectorization failed for type promotation

2019-11-21 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92611

--- Comment #2 from Hongtao.liu  ---
(In reply to Richard Biener from comment #1)
> I think Richard laid ground for this to work on x86 (it needs AVX512?),
AVX512 is not needed.
> not sure what is needed in the backend here to make V4QI -> V4SI conversions
> vectorized?
There're many expressions depends on type promotion.
Such like 

void foo(int *__restrict dst, char *__restrict src1,
char *__restrict src2)
{
for(int x = 0; x < 4; x++ )
dst[x] = src1[x] + src2[x];
}

And I think it also affects pr92492.

[Bug target/92611] New: auto vectorization failed for type promotation

2019-11-21 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92611

Bug ID: 92611
   Summary: auto vectorization failed for type promotation
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---
Target: i386, x86-64

Cat test.c

void foo(int *__restrict dst, char *__restrict src)
{
for(int x = 0; x < 4; x++ )
dst[x] = src[x];
}

Clang generate

---
vpmovsxbd   (%rsi), %xmm0
vmovdqu %xmm0, (%rdi)
retq
---

while GCC generate


movsbl  (%rsi), %eax
movl%eax, (%rdi)
movsbl  1(%rsi), %eax
movl%eax, 4(%rdi)
movsbl  2(%rsi), %eax
movl%eax, 8(%rdi)
movsbl  3(%rsi), %eax
movl%eax, 12(%rdi)
ret
---


Refer to https://godbolt.org/z/ckmXm_

[Bug target/92578] [i386] cmov not generated

2019-11-19 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92578

--- Comment #3 from Hongtao.liu  ---
(In reply to Richard Biener from comment #1)
> With newcnt-=2 you get
> 
> movl%edx, %r8d
> movl%esi, %eax
> leal-2(%rsi), %edx
> cmpl%r8d, %edi
> cmove   %edx, %eax
> ret
> 

It could be better using cmovne

leal-2(%rsi), %eax
cmpl%edx, %edi
cmovnel %esi, %eax
retq

[Bug target/92578] New: [i386] cmov not generated

2019-11-19 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92578

Bug ID: 92578
   Summary: [i386] cmov not generated
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
CC: hjl.tools at gmail dot com
  Target Milestone: ---

Cat test.c

int foo(int moves, int movecnt, int komove) {
int newcnt = movecnt;
if (moves == komove)
newcnt--;
return newcnt;
}

gcc10 -O2 test.c -S

cmpl%edx, %edi
movl%esi, %eax
sete%dl
movzbl  %dl, %edx
subl%edx, %eax
ret

It could be better like

cmpl  %edx, %edi#6.12
lea   -1(%rsi), %eax#5.9
cmovne%esi, %eax#6.12
ret  

Just like icc did, refer to https://godbolt.org/z/6mqkt8

[Bug target/92448] Confusing using of TARGET_PREFER_AVX128

2019-11-17 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92448

Hongtao.liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Hongtao.liu  ---
Fixed in gcc10.

[Bug target/92492] [AVX512F] Icc generate much better code for loop vectorization

2019-11-13 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92492

--- Comment #3 from Hongtao.liu  ---
(In reply to Richard Biener from comment #2)
> ICC also uses effectively two vector sizes, v8qi and v8hi AFAICS?  But
> why does it use %ymm then...

I think it's v8qi and v8si, icc use vpmovzxbd not vpmovzxbw.

[Bug target/92492] [AVX512F] Icc generate much better code for loop vectorization

2019-11-13 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92492

--- Comment #1 from Hongtao.liu  ---
Much more simple case, exclude disturb of point alias and unknown loop count
cat test.c:

typedef unsigned char uint8_t;

static inline uint8_t x264_clip_uint8( int x )
{
  return x&(~63) ? (-x)>>7 : x;
}


void mc_weight( uint8_t *__restrict dst, uint8_t *__restrict src)
{
for( int x = 0; x < 16; x++ )
dst[x] = x264_clip_uint8(src[x]);
}

Refer to https://godbolt.org/z/YJXWRD 

Gcc failed to vectorize loop, icc succeed.

  1   2   >