[Bug target/95750] [x86] Use dummy atomic insn instead of mfence in __atomic_thread_fence(seq_cst)

2020-07-23 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95750

--- Comment #15 from Uroš Bizjak  ---
(In reply to Joseph C. Sible from comment #14)
> I notice this change affects -Os too, even though "lock orq $0,(%rsp)" is 6
> bytes and "mfence" is only 3 bytes.

Yes, we can emit mfence for -Os. I'm testing the following patch:

--cut here--
diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md
index c88750d3664..ed17bb00205 100644
--- a/gcc/config/i386/sync.md
+++ b/gcc/config/i386/sync.md
@@ -123,7 +123,8 @@
   rtx mem;

   if ((TARGET_64BIT || TARGET_SSE2)
- && !TARGET_AVOID_MFENCE)
+ && (optimize_function_for_size_p (cfun)
+ || !TARGET_AVOID_MFENCE))
mfence_insn = gen_mfence_sse2;
   else
mfence_insn = gen_mfence_nosse;
--cut here--

[Bug tree-optimization/96272] Failure to optimize overflow check

2020-07-22 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96272

--- Comment #3 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #2)
> Well, it needs the addition too, so I think this can't be done in match.pd,
> but would need to be done in some other pass (not sure which, perhaps
> phiopt?).

No, I was referring to the first step of the optimization. The converted source
would read something like:

unsigned
bar (unsigned a, unsigned b)
{
  int dummy;
  if (__builtin_uadd_overflow (a, b, ))
return UINT_MAX;
  return a + b;
}

The RTL CSE pass is able to eliminate one addition, resulting in:

bar:
addl%esi, %edi
jc  .L5
movl%edi, %eax
ret
.L5:
orl $-1, %eax
ret

Eventually, some tree pass could convert the above source to:

unsigned
bar (unsigned a, unsigned b)
{
  unsigned res;
  if (__builtin_uadd_overflow (a, b, ))
return UINT_MAX;
  return res;
}

which results in:

bar:
addl%esi, %edi
movl$-1, %eax
cmovnc  %edi, %eax
ret

[Bug tree-optimization/96272] Failure to optimize overflow check

2020-07-22 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96272

Uroš Bizjak  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org
 Ever confirmed|0   |1
   Last reconfirmed||2020-07-22
 Status|UNCONFIRMED |NEW

--- Comment #1 from Uroš Bizjak  ---
Confirmed, pattern to convert:

a > UINT_MAX - b;

to

__builtin_uadd_overflow

should be added to match.pd.

[Bug target/96273] ice in extract_insn, at recog.c:2294, unrecognizable insn:

2020-07-22 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96273

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from Uroš Bizjak  ---
Already fixed.

[Bug target/95750] [x86] Use dummy atomic insn instead of mfence in __atomic_thread_fence(seq_cst)

2020-07-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95750

Uroš Bizjak  changed:

   What|Removed |Added

  Attachment #48756|0   |1
is obsolete||
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Ever confirmed|0   |1
   Last reconfirmed||2020-07-20

--- Comment #11 from Uroš Bizjak  ---
Created attachment 48897
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48897=edit
Proposed patch

Patch in testing.

[Bug tree-optimization/96226] Failure to optimize shift+not to rotate

2020-07-17 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96226

--- Comment #1 from Uroš Bizjak  ---
The combine produces:

Trying 7, 8 -> 9:
7: r89:SI=0x1
8: {r88:SI=r89:SI<3_mask"
  [(set (match_operand:SWI48 0 "nonimmediate_operand")
(any_rotate:SWI48
  (match_operand:SWI48 1 "nonimmediate_operand")
  (subreg:QI
(and:SI
  (match_operand:SI 2 "register_operand" "c")
  (match_operand:SI 3 "const_int_operand")) 0)))
   (clobber (reg:CC FLAGS_REG))]
  "ix86_binary_operator_ok (, mode, operands)
   && (INTVAL (operands[3]) & (GET_MODE_BITSIZE (mode)-1))
  == GET_MODE_BITSIZE (mode)-1

However, the above pattern doesn't allow immediate operand.

[Bug target/96189] Failure to use eflags from cmpxchg on x86

2020-07-16 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96189

Uroš Bizjak  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org
 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

--- Comment #5 from Uroš Bizjak  ---
Hm...

Please note that peephole2 scanning require exact RTL sequences, and already
fails for e.g.:

_Bool
foo (unsigned int *x, unsigned int z)
{
  unsigned int y = 0;
  __atomic_compare_exchange_n (x, , z, 0, __ATOMIC_RELAXED,
__ATOMIC_RELAXED);
  return y == 0;
}

(which is used in a couple of places throughout glibc), due to early peephole2
optimization that converts:

(insn 7 4 8 2 (set (reg:SI 0 ax [90])
(const_int 0 [0])) "cmpx0.c":5:3 75 {*movsi_internal}

to:

(insn 31 4 8 2 (parallel [
(set (reg:DI 0 ax [90])
(const_int 0 [0]))
(clobber (reg:CC 17 flags))

Other than that, the required sequence is broken quite often by various
reloads, due to the complexity of CMPXCHG insn.

However, __atomic_compare_exchange_n returns a boolean value that is exactly
what the first function is testing, so the following two functions are
equivalent:

--cut here--
_Bool
foo (unsigned int *x, unsigned int y, unsigned int z)
{
  unsigned int old_y = y;
  __atomic_compare_exchange_n (x, , z, 0, __ATOMIC_RELAXED,
__ATOMIC_RELAXED);
  return y == old_y;
}

_Bool
bar (unsigned int *x, unsigned int y, unsigned int z)
{
  return __atomic_compare_exchange_n (x, , z, 0, __ATOMIC_RELAXED,
__ATOMIC_RELAXED);
}
--cut here--

I wonder, if the above transformation can happen on the tree level, so it would
apply universally for all targets, and would also handle CMPXCHG[8,16]B
doubleword instructions on x86 targets.

Let's ask experts.

[Bug target/96189] Failure to use eflags from cmpxchg on x86

2020-07-15 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96189

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
   Target Milestone|--- |11.0
 Status|ASSIGNED|RESOLVED

--- Comment #4 from Uroš Bizjak  ---
Implemented for gcc-11.

[Bug target/96189] Failure to use eflags from cmpxchg on x86

2020-07-15 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96189

--- Comment #3 from Uroš Bizjak  ---
The master branch has been updated by Uros Bizjak :

https://gcc.gnu.org/g:6c2848ad02feef5ac094d1158be3861819b3bb49

commit r11-2140-g6c2848ad02feef5ac094d1158be3861819b3bb49
Author: Uros Bizjak 
Date:   Wed Jul 15 21:27:00 2020 +0200

i386: Introduce peephole2 to use flags from CMPXCHG more [PR96189]

CMPXCHG instruction sets ZF flag if the values in the destination operand
and EAX register are equal; otherwise the ZF flag is cleared and value
from destination operand is loaded to EAX. Following assembly:

movl%esi, %eax
lock cmpxchgl   %edx, (%rdi)
cmpl%esi, %eax
sete%al

can be optimized by removing the unneeded comparison, since set ZF flag
signals that no update to EAX happened.

2020-15-07  Uroš Bizjak  

gcc/ChangeLog:
PR target/95355
* config/i386/sync.md
(peephole2 to remove unneded compare after CMPXCHG): New pattern.

gcc/testsuite/ChangeLog:
PR target/95355
* gcc.target/i386/pr96189.c: New test.

[Bug target/95355] [11 Regression] Assembler messages: Error: operand size mismatch for `vpmovzxbd' with -masm=intel since r11-485-gf6e40195ec3d3b402a5f6c58dbf359479bc4cbfa

2020-07-15 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95355

Uroš Bizjak  changed:

   What|Removed |Added

   Target Milestone|11.0|10.2
 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #10 from Uroš Bizjak  ---
(In reply to Martin Liška from comment #5)
> Can the bug be marked as resolved?

Yes, this particular problem is fixed for gcc-10.2+.

[Bug target/95355] [11 Regression] Assembler messages: Error: operand size mismatch for `vpmovzxbd' with -masm=intel since r11-485-gf6e40195ec3d3b402a5f6c58dbf359479bc4cbfa

2020-07-15 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95355

--- Comment #9 from Uroš Bizjak  ---
(In reply to CVS Commits from comment #8)
> The master branch has been updated by Uros Bizjak :

Bah. Wrong PR reference, should be PR96189.

[Bug target/96189] Failure to use eflags from cmpxchg on x86

2020-07-15 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96189

Uroš Bizjak  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2020-07-15
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com

--- Comment #2 from Uroš Bizjak  ---
Created attachment 48877
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48877=edit
Prototype patch

Introduce peephole2 pattern that removes the comparison in certain cases.
Doubleword cmpxchg is not handled, the doubleword comparison sequence is just
too complicated in this late stage of compilation.

[Bug target/96062] Partial register stall caused by avoidable use of SETcc, and useless MOVZBL

2020-07-05 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96062

--- Comment #1 from Uroš Bizjak  ---
(In reply to Joseph C. Sible from comment #0)
> Consider this C code:
> 
> long ps4_syscall0(long n) {
> long ret;
> int carry;
> __asm__ __volatile__(
> "syscall"
> : "=a"(ret), "=@ccc"(carry)
> : "a"(n)
> : "rcx", "r8", "r9", "r10", "r11", "memory"
> );
> return carry ? -ret : ret;
> }
> 
> With "-O3", it results in this assembly:
> 
> ps4_syscall0:
> movq%rdi, %rax
> syscall
> setc%dl
> movq%rax, %rdi
> movzbl  %dl, %edx
> negq%rdi
> testl   %edx, %edx
> cmovne  %rdi, %rax
> ret
> 
> On modern Intel CPUs, doing "setc %dl" creates a false dependency on rdx.
> Doing "movzbl %dl, %edx" doesn't do anything to fix that. Here's some ways
> that we could improve this code, without having to fall back to a
> conditional branch:
> 
> 1. Get rid of "movzbl %dl, %edx" (since it doesn't help), and then do "testb
> %dl, %dl" instead of "testl %edx, %edx".

Just declare "_Bool carry". There is no need for int.

[Bug tree-optimization/95839] Failure to optimize addition of vector elements to vector addition

2020-06-24 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95839

--- Comment #2 from Uroš Bizjak  ---
What I find interesting is a similar case with the division instead of the
addition. Clang compiles it to:

divps   %xmm1, %xmm0
retq

Considering that we have [a0, a1, 0, 0] / [b0, b1, 0, 0], this will surely fire
invalid operation exception. I have explicitly avoided generation of division
using 4-element DIVPS for v2sf operands exactly due to this issue.

[Bug target/95750] [x86] Use dummy atomic insn instead of mfence in __atomic_thread_fence(seq_cst)

2020-06-19 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95750

--- Comment #10 from Uroš Bizjak  ---
Created attachment 48756
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48756=edit
Proposed patch

Patch in testing, survives GOMP testcases.

On a related note, the patch uses TARGET_USE_XCHG_FOR_ATOMIC_STORE, which
should probably be renamed to something more appropriate.

[Bug target/95750] [x86] Use dummy atomic insn instead of mfence in __atomic_thread_fence(seq_cst)

2020-06-19 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95750

--- Comment #9 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #8)

> The culprit is the %esp here, that adds the 0x67 prefix to the insn and
> will only work if %rsp is below 4GB.

Ah, indeed... I was in a bit of hurry and didn't notice.

[Bug target/95750] [x86] Use dummy atomic insn instead of mfence in __atomic_thread_fence(seq_cst)

2020-06-19 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95750

--- Comment #7 from Uroš Bizjak  ---
Actually, x86_64 (at least my Fedora 32) does not like operations on stack:

Starting program: /sdd/uros/git/gcc/gcc/testsuite/gcc.dg/atomic/a.out 

Program received signal SIGSEGV, Segmentation fault.
0x0040110a in main ()
(gdb) disass
Dump of assembler code for function main:
   0x00401106 <+0>: push   %rbp
   0x00401107 <+1>: mov%rsp,%rbp
=> 0x0040110a <+4>: lock orq $0x0,(%esp)
   0x0040 <+11>:mov$0x0,%eax
   0x00401116 <+16>:pop%rbp
   0x00401117 <+17>:retq   
End of assembler dump.

I didn't investigate further, but 32bit executable works OK.

[Bug target/95750] [x86] Use dummy atomic insn instead of mfence in __atomic_thread_fence(seq_cst)

2020-06-18 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95750

--- Comment #4 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #3)
> How about the following patch:

Surely, mfence_nosse needs to be enabled also for
TARGET_USE_XCHG_FOR_ATOMIC_STORE.

> This will generate "lock orl $0, (%rsp)" instead of mfence.

Please also read [1] why we avoid -4(%%esp).

[1] https://gcc.gnu.org/pipermail/gcc-patches/2017-February/469630.html

[Bug target/95750] [x86] Use dummy atomic insn instead of mfence in __atomic_thread_fence(seq_cst)

2020-06-18 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95750

--- Comment #3 from Uroš Bizjak  ---
How about the following patch:

--cut here--
diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md
index 9ab5456b227..7d9442d45b7 100644
--- a/gcc/config/i386/sync.md
+++ b/gcc/config/i386/sync.md
@@ -117,10 +117,11 @@
   rtx (*mfence_insn)(rtx);
   rtx mem;

-  if (TARGET_64BIT || TARGET_SSE2)
-   mfence_insn = gen_mfence_sse2;
-  else
+  if (!(TARGET_64BIT || TARGET_SSE2)
+ || TARGET_USE_XCHG_FOR_ATOMIC_STORE)
mfence_insn = gen_mfence_nosse;
+  else
+   mfence_insn = gen_mfence_sse2;

   mem = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode));
   MEM_VOLATILE_P (mem) = 1;
--cut here--

This will generate "lock orl $0, (%rsp)" instead of mfence.

[Bug target/95632] Redundant zero extension

2020-06-16 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95632

--- Comment #6 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #5)
> (In reply to Mel Chen from comment #2)
> > Is it possible to pretend that we have a pattern that can match xor (reg:SI
> > 80), (reg: SI 72), 0xa001 in combine pass?
> > And then, if the constant part is too large to put in to the immediate part,
> > it can be split to 2 xor in split pass.
> 
> Please note that the combine pass has its own (rather limited) splitter, it
> is documented in the second part of "Defining How to Split Instructions"
> paragraph. The example is dealing with the instruction that has too large
> immediate part, and looks similar to your problem.

Oh, I missed the discussion above. In this case, x86 implements pre-reload
splits, please see patterns decorated with ix86_pre_reload_split condition.

[Bug target/95632] Redundant zero extension

2020-06-16 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95632

--- Comment #5 from Uroš Bizjak  ---
(In reply to Mel Chen from comment #2)
> Is it possible to pretend that we have a pattern that can match xor (reg:SI
> 80), (reg: SI 72), 0xa001 in combine pass?
> And then, if the constant part is too large to put in to the immediate part,
> it can be split to 2 xor in split pass.

Please note that the combine pass has its own (rather limited) splitter, it is
documented in the second part of "Defining How to Split Instructions"
paragraph. The example is dealing with the instruction that has too large
immediate part, and looks similar to your problem.

[Bug target/95652] GCC 8.3.1 generates syntactically incorrect assembly code with -masm=intel

2020-06-12 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95652

--- Comment #6 from Uroš Bizjak  ---
(In reply to Teo Samarzija from comment #5)
> (In reply to Uroš Bizjak from comment #4)
> > (In reply to Teo Samarzija from comment #3)
> > > Besides, how does CLANG compile that same code fine, also under Linux and
> > > with "-masm=intel"? Maybe you can copy the way CLANG does that.
> > 
> > Clang outputs:
> > 
> > mov dword ptr [rip + eax], 1088421888
> > 
> > which gld does't understand:
> > 
> > pr95652.s: Assembler messages:
> > pr95652.s:17: Error: `dword ptr [rip+eax]' is not a valid base/index
> > expression
> 
> Which version of CLANG? CLANG 9.0 compiles the code in the opening post
> correctly (under Linux and using Intel Syntax).

Try to assemble the produced assembly with gas.

[Bug target/95652] GCC 8.3.1 generates syntactically incorrect assembly code with -masm=intel

2020-06-12 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95652

--- Comment #4 from Uroš Bizjak  ---
(In reply to Teo Samarzija from comment #3)
> Besides, how does CLANG compile that same code fine, also under Linux and
> with "-masm=intel"? Maybe you can copy the way CLANG does that.

Clang outputs:

mov dword ptr [rip + eax], 1088421888

which gld does't understand:

pr95652.s: Assembler messages:
pr95652.s:17: Error: `dword ptr [rip+eax]' is not a valid base/index expression

[Bug target/95531] Failure to use TZCNT for __builtin_ffs

2020-06-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95531

--- Comment #4 from Uroš Bizjak  ---
(In reply to Segher Boessenkool from comment #3)
> What is the question?  4+4 = 16?

Ah, indeed - the question is why combine changes CCCmode compare to CCZmode
compare.

[Bug middle-end/95528] [10/11 Regression] internal compiler error: in emit_move_insn, at expr.c:3814

2020-06-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95528

--- Comment #3 from Uroš Bizjak  ---
Introduced by[1].

[1] https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=bea408857a7d

[Bug middle-end/95528] [10/11 Regression] internal compiler error: in emit_move_insn, at expr.c:3814

2020-06-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95528

Uroš Bizjak  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2020-06-04
 Ever confirmed|0   |1

--- Comment #2 from Uroš Bizjak  ---
Something is wrong with the handling of:

(define_expand "vec_pack_trunc_"
  [(set (match_operand: 0 "register_operand")
(ior:
  (ashift:
(zero_extend:
  (match_operand:SWI24 2 "register_operand"))
(match_dup 3))
  (zero_extend:
(match_operand:SWI24 1 "register_operand"]
  "TARGET_AVX512BW"
{
  operands[3] = GEN_INT (GET_MODE_BITSIZE (mode));
})

Disabling this expander "fixes" the testcase.

"-O2 -mavx512bw" is needed.

[Bug target/95531] Failure to use TZCNT for __builtin_ffs

2020-06-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95531

Uroš Bizjak  changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org

--- Comment #2 from Uroš Bizjak  ---
Adding Segher to CC.

[Bug target/95531] Failure to use TZCNT for __builtin_ffs

2020-06-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95531

--- Comment #1 from Uroš Bizjak  ---
So, what kind of sorcery is this?

(insn 2 4 3 2 (set (reg/v:SI 83 [ x ])
(reg:SI 5 di [ x ])) "pr95531.c":2:1 67 {*movsi_internal}
 (expr_list:REG_DEAD (reg:SI 5 di [ x ])
(nil)))

(...)

(insn 7 6 8 2 (parallel [
(set (reg:CCC 17 flags)
(compare:CCC (reg/v:SI 83 [ x ])
(const_int 0 [0])))
(set (reg:SI 82 [  ])
(ctz:SI (reg/v:SI 83 [ x ])))
]) "pr95531.c":3:12 786 {*tzcntsi_1}
 (expr_list:REG_DEAD (reg/v:SI 83 [ x ])
(nil)))
(insn 8 7 9 2 (set (reg:SI 82 [  ])
(if_then_else:SI (eq (reg:CCC 17 flags)
(const_int 0 [0]))
(reg:SI 84)
(reg:SI 82 [  ]))) "pr95531.c":3:12 1029 {*movsicc_noc}
 (expr_list:REG_DEAD (reg:SI 84)
(expr_list:REG_DEAD (reg:CCC 17 flags)
(nil

Combine does:

Trying 2 -> 7:
2: r83:SI=r86:SI
  REG_DEAD r86:SI
7: {flags:CCC=cmp(r83:SI,0);r82:SI=ctz(r83:SI);}
  REG_DEAD r83:SI
Successfully matched this instruction:
(parallel [
(set (reg:CCZ 17 flags)
(compare:CCZ (reg:SI 86)
(const_int 0 [0])))
(set (reg:SI 82 [  ])
(ctz:SI (reg:SI 86)))
])
Successfully matched this instruction:
(set (reg:SI 82 [  ])
(if_then_else:SI (eq (reg:CCZ 17 flags)
(const_int 0 [0]))
(reg:SI 84)
(reg:SI 82 [  ])))
allowing combination of insns 2 and 7
original costs 4 + 4 = 16
replacement cost 12
deferring deletion of insn with uid = 2.
modifying other_insn 8: r82:SI={(flags:CCZ==0)?r84:SI:r82:SI}
  REG_DEAD r84:SI
  REG_DEAD flags:CCC
deferring rescan insn with uid = 8.
modifying insn i3 7: {flags:CCZ=cmp(r86:SI,0);r82:SI=ctz(r86:SI);}
  REG_DEAD r86:SI
deferring rescan insn with uid = 7.

Combine changes CCCmode comparison to CCZmode.

[Bug target/95529] Failure to reuse flags generated by TZCNT for cmovcc on BMI-capable targets

2020-06-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95529

--- Comment #6 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #5)

> > We can do a peephole that would convert REP BSF + TEST to BSF. However, on
> > BMI capable targets, REP BSF decodes as TZCNT, so the question is if one BSF
> > is faster than TZCNT + TEST?
> I would expect so, yes.

Not universally. While Intel is agnostic to either insn, Ryzen

TZCNT: latency 2, rec through 0.5
BSF:   latency 3, rec through 3

> With -mbmi and TZCNT we could also use the carry flag to elide the test.

We would have to change the mode of the flags reg on a follow up flags user
(CMOVE, *movsicc_noc) from reg:CCZ to reg:CCC. This can't be done during
combine.

> > (Please note that the conversion to CMOVE comes a bit late in the pass
> > sequence, so we can't convert TZCNT + TEST + CMOVE to TZCNT + CMOVC.)

[Bug target/95529] Failure to reuse flags generated by TZCNT for cmovcc on BMI-capable targets

2020-06-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95529

Uroš Bizjak  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2020-06-04
 Status|UNCONFIRMED |NEW

--- Comment #4 from Uroš Bizjak  ---
(In reply to Gabriel Ravier from comment #2)
> If using `-mbmi`, shouldn't GCC be able to assume the target is, in fact, a
> BMI-capable CPU ? I understand that this bug report may be invalid for `bsf`
> (which would mean Clang has invalid behaviour, which seems odd but ok), but
> should I reopen this report/make a new report for `-mbmi` ?

We can do a peephole that would convert REP BSF + TEST to BSF. However, on BMI
capable targets, REP BSF decodes as TZCNT, so the question is if one BSF is
faster than TZCNT + TEST?

(Please note that the conversion to CMOVE comes a bit late in the pass
sequence, so we can't convert TZCNT + TEST + CMOVE to TZCNT + CMOVC.)

[Bug target/95529] Failure to reuse flags generated by trailing zeros instruction for cmovcc

2020-06-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95529

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |WONTFIX
 Status|UNCONFIRMED |RESOLVED

--- Comment #1 from Uroš Bizjak  ---
Please note that "REP BSF" on BMI capable targets decodes as TZCNT, which is
not equal to BSF as far as flags are concerned. BSF sets zero flag on zero
input, where TZCNT sets carry flag on zero input.

GCC emits TZCNT by default on the premise that the same binary can benefit from
TZCNT runing faster than BSF on BMI capable targets. Unfortunatelly, flags
can't be reused due to the difference, explained above.

[Bug target/95525] Bitmask conflict between PTA_AVX512VP2INTERSECT and PTA_WAITPKG

2020-06-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95525

Uroš Bizjak  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Target Milestone|--- |10.2
   Last reconfirmed||2020-06-04
 Status|UNCONFIRMED |NEW

[Bug target/95525] Bitmask conflict between PTA_AVX512VP2INTERSECT and PTA_WAITPKG

2020-06-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95525

--- Comment #1 from Uroš Bizjak  ---
PTA_WAITPKG is currently unused. Please just move PTA_WAITPKG and subsequent
bits for one place to the left.

[Bug libfortran/95418] [11 Regression] Static assert going off on MinGW

2020-06-01 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95418

Uroš Bizjak  changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |FIXED
 Target||x86_64-w64-mingw32

--- Comment #11 from Uroš Bizjak  ---
Fixed.

[Bug target/95435] bad builtin memcpy performance with znver1/znver2 and 32bit

2020-06-01 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95435

--- Comment #9 from Uroš Bizjak  ---
(In reply to Alexander Monakov from comment #8)
> There's no tuning tables for memcmp at all, existing structs cover only
> memset and memcpy. So as far as I see retuning memset/memcpy doesn't need to
> wait for [1], because there's no infrastructure in place for memcmp tuning,
> and adding that can be done independently. Updating Ryzen tables would not
> touch any code updated by H.J.Lu's patchset at all.

Agreed.

[Bug target/95435] bad builtin memcpy performance with znver1/znver2 and 32bit

2020-06-01 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95435

Uroš Bizjak  changed:

   What|Removed |Added

 CC||hjl.tools at gmail dot com

--- Comment #7 from Uroš Bizjak  ---
I think that stringops (including memcmp) for x86 targets should be retuned for
new glibc, once [1] is approved and committed. Please note that currently
bench-stringop doesn't benchmark memcmp.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546919.html

[Bug target/95237] LOCAL_DECL_ALIGNMENT shrinks alignment, FAIL gcc.target/i386/pr69454-2.c

2020-06-01 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95237

--- Comment #3 from Uroš Bizjak  ---
ICEs are "fixed" by the first hunk, the testcase in Comment #0 by the second:

--cut here--
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 060e2df62ea..cd7abaf7e04 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -16752,6 +16752,7 @@ ix86_local_alignment (tree exp, machine_mode mode,
   decl = NULL;
 }

+#if 0
   /* Don't do dynamic stack realignment for long long objects with
  -mpreferred-stack-boundary=2.  */
   if (!TARGET_64BIT
@@ -16761,6 +16762,7 @@ ix86_local_alignment (tree exp, machine_mode mode,
   && (!type || !TYPE_USER_ALIGN (type))
   && (!decl || !DECL_USER_ALIGN (decl)))
 align = 32;
+#endif

   /* If TYPE is NULL, we are allocating a stack slot for caller-save
  register in MODE.  We will return the largest alignment of XF
@@ -16868,6 +16870,7 @@ ix86_minimum_alignment (tree exp, machine_mode mode,
   if (TARGET_64BIT || align != 64 || ix86_preferred_stack_boundary >= 64)
 return align;

+#if 0
   /* Don't do dynamic stack realignment for long long objects with
  -mpreferred-stack-boundary=2.  */
   if ((mode == DImode || (type && TYPE_MODE (type) == DImode))
@@ -16877,6 +16880,7 @@ ix86_minimum_alignment (tree exp, machine_mode mode,
   gcc_checking_assert (!TARGET_STV);
   return 32;
 }
+#endif

   return align;
 }
--cut here--

[Bug libfortran/95418] [11 Regression] Static assert going off on MinGW

2020-05-31 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95418

--- Comment #7 from Uroš Bizjak  ---
Created attachment 48649
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48649=edit
Untested patch.

Can someone with an access to MinGW target please test the attached patch?

The layout is defined by the hardware, and gcc_struct reflects this layout.

BTW: I also doubt defining _FP_STRUCT_LAYOUT in sfp-machine.h has any effect,
we have to use __attribute__ ((gcc_struct)) directly on the structure
definition.

[Bug libfortran/95418] [11 Regression] Static assert going off on MinGW

2020-05-31 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95418

--- Comment #6 from Uroš Bizjak  ---
(In reply to Thomas Koenig from comment #3)
> Adding the author of the patch.
> 
> Uros: I find no discussion of this patch on the fortran mailing list.
> Please remember to do so in the future if you touch the libgfortran
> or gcc/fortran directories.

Thomas,

Contrary to my other libgfortran contribution, I was under the impression that
the patch touches only deep architectural details of the x87 chip, and should
be (and in fact is) independent of libgfortran implementation.

I would like to point out that the part, referred in Comment #4 unifies the
structure definition with the ones in libgcc soft-fp and libatomic. So, if this
change turns out to be problematic for MinGW, then the existing definitions in
libgcc in libatomic are wrong as well. Actually, libgcc sfp-machine.h defines:

#ifdef __MINGW32__
  /* Make sure we are using gnu-style bitfield handling.  */
#define _FP_STRUCT_LAYOUT  __attribute__ ((gcc_struct))
#endif

which should probably be added to libgfortran fpu-387.h (and libatomic fenv.c).

[Bug target/95355] [11 Regression] Assembler messages: Error: operand size mismatch for `vpmovzxbd' with -masm=intel since r11-485-gf6e40195ec3d3b402a5f6c58dbf359479bc4cbfa

2020-05-27 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95355

--- Comment #2 from Uroš Bizjak  ---
This is pre-exsisting problem.

There are a couple of wrong %q modifiers in vpmov* insn templates:

--cut here--
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index fde65391d7d..1cf1b8cea3b 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -17559,7 +17559,7 @@
(any_extend:V16SI
  (match_operand:V16QI 1 "nonimmediate_operand" "vm")))]
   "TARGET_AVX512F"
-  "vpmovbd\t{%1, %0|%0, %q1}"
+  "vpmovbd\t{%1, %0|%0, %1}"
   [(set_attr "type" "ssemov")
(set_attr "prefix" "evex")
(set_attr "mode" "XI")])
@@ -17935,7 +17935,7 @@
(any_extend:V8DI
  (match_operand:V8HI 1 "nonimmediate_operand" "vm")))]
   "TARGET_AVX512F"
-  "vpmovwq\t{%1, %0|%0, %q1}"
+  "vpmovwq\t{%1, %0|%0, %1}"
   [(set_attr "type" "ssemov")
(set_attr "prefix" "evex")
(set_attr "mode" "XI")])
--cut here--

[Bug target/95211] [11 Regression] ICE in emit_unop_insn, at optabs.c:3622

2020-05-25 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95211

--- Comment #5 from Uroš Bizjak  ---
This testcase is fixed by [1]

[1] https://gcc.gnu.org/pipermail/gcc-patches/2020-May/546408.html

[Bug target/95255] [10/11 Regression] ICE in gen_roundevendf2, at config/i386/i386.md:16328 since r10-2809-gd3b92f35d84f44a8

2020-05-24 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95255

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Uroš Bizjak  ---
Fixed for gcc-10.2+.

[Bug target/95255] [10/11 Regression] ICE in gen_roundevendf2, at config/i386/i386.md:16328 since r10-2809-gd3b92f35d84f44a8

2020-05-22 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95255

Uroš Bizjak  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com

--- Comment #3 from Uroš Bizjak  ---
Created attachment 48580
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48580=edit
patch in testing.

[Bug target/95125] Unoptimal code for vectorized conversions

2020-05-22 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95125

Uroš Bizjak  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #6 from Uroš Bizjak  ---
(In reply to Hongtao.liu from comment #5)
> (In reply to Uroš Bizjak from comment #3)
> > It turns out that a bunch of patterns have to be renamed (and testcases
> > added).
> > 
> > Easyhack, waiting for someone to show some love to conversion patterns in
> > sse.md.
> 
> expander for floatv4siv4df2, fix_truncv4dfv4si2 already exists.
> 
> if change **float_double fix_double** to
> ---
> void
> float_double (void)
> {
> d[0] = i[0];
> d[1] = i[1];
> d[2] = i[2];
> d[3] = i[3];
> }

Hm, the above is vectorized, but the equivalent:

void
float_double (void)
{
  for (int n = 0; n < 4; n++)
d[n] = i[n];
}

is not?

[Bug target/95169] [10/11 Regression] i386 comparison between nan and 0.0 triggers Invalid operation exception

2020-05-21 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95169

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #9 from Uroš Bizjak  ---
Fixed for gcc-10.2+.

[Bug target/95256] [11 Regression] ICE in convert_move, at expr.c:278 since r11-263-g7c355156aa20eaec

2020-05-21 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95256

Uroš Bizjak  changed:

   What|Removed |Added

 Depends on||95125
 CC||crazylht at gmail dot com

--- Comment #3 from Uroš Bizjak  ---
This PR and PR95211 will be fixed by the  for PR95125. Here, the offending
pattern should be renamed to something like

avx512dq_fix_truncv2sfv2di2

and an expander should be added:

(define_expand "fix_truncv2sfv2di2"
  [(set (match_operand:V2DI 0 "register_operand")
(any_fix:V2DI
  (match_operand:V2SF 1 "nonimmediate_operand")))]
  "TARGET_AVX512DQ && TARGET_AVX512VL"
{
  if (!MEM_P (operands[1]))
{
  operands[1] = simplify_gen_subreg (V4SFmode, operands[1], V2SFmode, 0);
  emit_insn (gen_avx512dq_fix_truncv2sfv2di2 (operands[0],
operands[1]));
  DONE;
}
})


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95125
[Bug 95125] Unoptimal code for vectorized conversions

[Bug target/95256] [11 Regression] ICE in convert_move, at expr.c:278 since r11-263-g7c355156aa20eaec

2020-05-21 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95256

Uroš Bizjak  changed:

   What|Removed |Added

   Target Milestone|--- |11.0

--- Comment #2 from Uroš Bizjak  ---
The patch exposes another case of named pattern with non-conformant operand
mode. 

This time, it is:

(define_insn "fix_truncv2sfv2di2"
  [(set (match_operand:V2DI 0 "register_operand" "=v")
(any_fix:V2DI
  (vec_select:V2SF
(match_operand:V4SF 1 "nonimmediate_operand" "vm")
(parallel [(const_int 0) (const_int 1)]]
  "TARGET_AVX512DQ && TARGET_AVX512VL"
  "vcvttps2qq\t{%1, %0|%0, %q1}"
  [(set_attr "type" "ssecvt")
   (set_attr "prefix" "evex")
   (set_attr "mode" "TI")])

[Bug target/95229] [11 Regression] in mark_jump_label_1

2020-05-21 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95229

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #10 from Uroš Bizjak  ---
Fixed. (thanks Martin for adjusting the testcase!).

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2020-05-21 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 95229, which changed state.

Bug 95229 Summary: [11 Regression] in mark_jump_label_1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95229

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug target/95255] [10/11 Regression] ICE in gen_roundevendf2, at config/i386/i386.md:16328 since r10-2809-gd3b92f35d84f44a8

2020-05-21 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95255

--- Comment #1 from Uroš Bizjak  ---
Please post your compile flags.

[Bug target/95218] [11 Regression] FAIL: gcc.target/i386/fma_run_double_1.c execution test since r11-455-g94f687bd9ae37ece

2020-05-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95218

--- Comment #18 from Uroš Bizjak  ---
Created attachment 48575
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48575=edit
Patch in testing.

[Bug target/95218] [11 Regression] FAIL: gcc.target/i386/fma_run_double_1.c execution test since r11-455-g94f687bd9ae37ece

2020-05-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95218

--- Comment #17 from Uroš Bizjak  ---
The problem is with commutative operands, these somehow confuse postreload
pass.

I'll commit partial revert that basically puts back:

 (define_insn_and_split "*2"
-  [(set (match_operand:VF 0 "register_operand" "=x,v")
+  [(set (match_operand:VF 0 "register_operand" "=x,x,v,v")
(absneg:VF
- (match_operand:VF 1 "vector_operand" "%0,v")))
-   (use (match_operand:VF 2 "vector_operand" "xBm,vm"))]
+ (match_operand:VF 1 "vector_operand" "0,xBm,v,m")))
+   (use (match_operand:VF 2 "vector_operand" "xBm,0,vm,v"))]

with manual swapping of operands.

[Bug target/95238] [11 Regression] Invalid *pushsi2_rex64

2020-05-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95238

--- Comment #4 from Uroš Bizjak  ---
(In reply to H.J. Lu from comment #3)
> (In reply to Uroš Bizjak from comment #2)
> > (In reply to H.J. Lu from comment #1)
> > > The "i" constraint shouldn't be used for flag_pic since symbolic constant
> > > leads to writable text in 32-bit mode and invalid in 64-bit mode.
> > 
> > Just a typo. "i" should be changed back to "e".
> 
> There are other "ri" in push patterns.  The 32 bit linker won't complain
> but will add DT_TEXTREL for "push $symbol" when generating shared object.

(define_insn "*push2"
  [(set (match_operand:DWI 0 "push_operand" "=<,<")
(match_operand:DWI 1 "general_no_elim_operand" "riF*o,*v"))]

This will never match symbol, so "i" can be "n" as well.

;; For TARGET_64BIT we always round up to 8 bytes.
(define_insn "*pushsi2_rex64"
  [(set (match_operand:SI 0 "push_operand" "=X,X")
(match_operand:SI 1 "nonmemory_no_elim_operand" "re,*v"))]

This is changed to "e", as was before.

(define_insn "*pushsi2"
  [(set (match_operand:SI 0 "push_operand" "=<,<")
(match_operand:SI 1 "general_no_elim_operand" "ri*m,*v"))]

This was "i" before my patch.

(define_insn "*push2_prologue"
  [(set (match_operand:W 0 "push_operand" "=<")
(match_operand:W 1 "general_no_elim_operand" "r*m"))

This one is in effect "e".

[Bug target/95218] [11 Regression] FAIL: gcc.target/i386/fma_run_double_1.c execution test since r11-455-g94f687bd9ae37ece

2020-05-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95218

--- Comment #15 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #13)
> So perhaps pre-reload splitter of that into the UNSPEC form?

Vector insns should be able to use pre-reload splitter, but scalar instructions
depend on post-reload splitter, because they are split in a different way,
depending on a register set of allocated register (FP, XMM or even integer).

So, it really needs to be a post-reload splitter.

[Bug target/95218] [11 Regression] FAIL: gcc.target/i386/fma_run_double_1.c execution test since r11-455-g94f687bd9ae37ece

2020-05-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95218

Uroš Bizjak  changed:

   What|Removed |Added

 CC|uros at gcc dot gnu.org|

--- Comment #14 from Uroš Bizjak  ---
Its interesting to note, that only *some* of insns with memory uses get
removed.

[Bug target/95218] [11 Regression] FAIL: gcc.target/i386/fma_run_double_1.c execution test since r11-455-g94f687bd9ae37ece

2020-05-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95218

--- Comment #12 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #11)
> Note a 'use' is not something that needs to be preserved, so
> 
> (define_insn_and_split "*2"
>   [(set (match_operand:VF 0 "register_operand" "=x,v")
> (absneg:VF
>   (match_operand:VF 1 "vector_operand" "%0,v")))
>(use (match_operand:VF 2 "vector_operand" "xBm,vm"))]
>   "TARGET_SSE"
>   "#"
>   "&& reload_completed"
>   [(set (match_dup 0)
> (:VF (match_dup 1) (match_dup 2)))]
>   ""
>   [(set_attr "isa" "noavx,avx")])
> 
> doesn't make much sense (before reload).  To me, that is.  Why do
> we go that obfuscated way at all?  I think a clean solution is to
> use an UNSPEC here (well, "clean"...).

The reason for this approach was, that combine still processes the pattern as
abs/nop. Please see how *nabstf2_1 is defined.

[Bug target/95238] [11 Regression] Invalid *pushsi2_rex64

2020-05-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95238

--- Comment #2 from Uroš Bizjak  ---
(In reply to H.J. Lu from comment #1)
> The "i" constraint shouldn't be used for flag_pic since symbolic constant
> leads to writable text in 32-bit mode and invalid in 64-bit mode.

Just a typo. "i" should be changed back to "e".

[Bug target/95229] [11 Regression] in mark_jump_label_1

2020-05-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95229

--- Comment #7 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #6)

> That fixes the testcase.  But simplify_subreg is used in a lot more places
> so leaving to Uros to match up with expectations.

Oh, yes... We don't have hard regs here, so all should be changed to
simplify_gen_subregs. I have a patch.

[Bug target/95218] [11 Regression] FAIL: gcc.target/i386/fma_run_double_1.c execution test since r11-455-g94f687bd9ae37ece

2020-05-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95218

Uroš Bizjak  changed:

   What|Removed |Added

 CC||law at gcc dot gnu.org,
   ||rsandifo at gcc dot gnu.org

--- Comment #9 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #7)
> Ooh, yes :(
> 
> '(use X)'
>  Represents the use of the value of X.  It indicates that the value
>  in X at this point in the program is needed, even though it may not
>  be apparent why this is so.  Therefore, the compiler will not
>  attempt to delete previous instructions whose only effect is to
>  store a value in X.  X must be a 'reg' expression.
> 
> Partial revert is in works.

Actually, no. The above applies to single (use ...) RTX, not (use ...) as part
of a parallel. There are plenty of uses of memory_operands in i386.md:

(define_insn "fix_truncdi_i387"
  [(set (match_operand:DI 0 "nonimmediate_operand" "=m")
(fix:DI (match_operand 1 "register_operand" "f")))
   (use (match_operand:HI 2 "memory_operand" "m"))
   (use (match_operand:HI 3 "memory_operand" "m"))
   (clobber (match_scratch:XF 4 "="))]

Let's ask experts.

[Bug target/95218] [11 Regression] FAIL: gcc.target/i386/fma_run_double_1.c execution test since r11-455-g94f687bd9ae37ece

2020-05-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95218

Uroš Bizjak  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com

--- Comment #7 from Uroš Bizjak  ---
Ooh, yes :(

'(use X)'
 Represents the use of the value of X.  It indicates that the value
 in X at this point in the program is needed, even though it may not
 be apparent why this is so.  Therefore, the compiler will not
 attempt to delete previous instructions whose only effect is to
 store a value in X.  X must be a 'reg' expression.

Partial revert is in works.

[Bug target/95218] [11 Regression] FAIL: gcc.target/i386/fma_run_double_1.c execution test since r11-455-g94f687bd9ae37ece

2020-05-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95218

--- Comment #6 from Uroš Bizjak  ---
I think I found the issue.

Before the patch, we had:

(insn 375 373 2574 7 (parallel [
(set (reg:V4DF 21 xmm1 [orig:1681 vect__45.441 ] [1681])
(neg:V4DF (mem/c:V4DF (plus:DI (reg/f:DI 7 sp)
(const_int 160 [0xa0])) [3 %sfp+-1184 S32 A256])))
(use (reg:V4DF 20 xmm0 [3332]))
]) "fma_1.h":20:10 1487 {*negv4df2}
 (nil))

after the patch, reload is free to create:

(insn 375 3216 2578 7 (parallel [
(set (reg:V4DF 21 xmm1 [orig:1681 vect__45.441 ] [1681])
(neg:V4DF (reg:V4DF 20 xmm0 [3332])))
(use (mem/c:V4DF (plus:DI (reg/f:DI 7 sp)
(const_int 160 [0xa0])) [3 %sfp+-1184 S32 A256]))
]) "fma_1.h":20:10 1487 {*negv4df2}
 (nil))

which postreload pass does not like, and simply deletes it:

deleting insn with uid = 375.

Just like that. No substitution whatsoever.

So, is there some limitation with (use) RTX, so we can't have memory here?

[Bug target/95218] [11 Regression] FAIL: gcc.target/i386/fma_run_double_1.c execution test since r11-455-g94f687bd9ae37ece

2020-05-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95218

--- Comment #4 from Uroš Bizjak  ---
(In reply to Martin Liška from comment #3)
> Started with r11-455-g94f687bd9ae37ece.

It is not obvious from the referred patch what is going wrong here.

Unfortunately, I have no FMA capable machine, can someone please isolate one
small test that fails?

[Bug target/95211] [11 Regression] ICE in emit_unop_insn, at optabs.c:3622

2020-05-19 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95211

Uroš Bizjak  changed:

   What|Removed |Added

 Status|WAITING |NEW
 CC||jakub at gcc dot gnu.org

--- Comment #3 from Uroš Bizjak  ---
Confirmed.

sse.md includes a named pattern defined with non-conforming operands:

(define_expand "floatv2div2sf2"
  [(set (match_operand:V4SF 0 "register_operand" "=v")
(vec_concat:V4SF
(any_float:V2SF (match_operand:V2DI 1 "nonimmediate_operand" "vm"))
(match_dup 2)))]
  "TARGET_AVX512DQ && TARGET_AVX512VL"
  "operands[2] = CONST0_RTX (V2SFmode);")

V2SF vectorization now triggers this expander.

CC author.

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2020-05-19 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 89386, which changed state.

Bug 89386 Summary: Generation of vectorized MULHRS (Multiply High with Round 
and Scale) instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89386

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/89386] Generation of vectorized MULHRS (Multiply High with Round and Scale) instruction

2020-05-19 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89386

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED
   Target Milestone|--- |10.0

--- Comment #3 from Uroš Bizjak  ---
Fixed also for x86 targets.

[Bug target/82261] x86: missing peephole for SHLD / SHRD

2020-05-19 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261

--- Comment #3 from Uroš Bizjak  ---
(In reply to Michael Clark from comment #2)
> Just refreshing this issue. I found it while testing some code-gen on
> Godbolt:

The combiner creates:

Failed to match this instruction:
(parallel [
(set (reg:SI 89)
(ior:SI (ashift:SI (reg:SI 94)
(subreg:QI (reg/v:SI 88 [ n ]) 0))
(lshiftrt:SI (reg:SI 95)
(minus:QI (subreg:QI (reg:SI 91) 0)
(subreg:QI (reg/v:SI 88 [ n ]) 0)
(clobber (reg:CC 17 flags))
])

This is *almost* matched by:

(define_insn "x86_shld"
  [(set (match_operand:SI 0 "nonimmediate_operand" "+r*m")
(ior:SI (ashift:SI (match_dup 0)
  (match_operand:QI 2 "nonmemory_operand" "Ic"))
(lshiftrt:SI (match_operand:SI 1 "register_operand" "r")
  (minus:QI (const_int 32) (match_dup 2)
   (clobber (reg:CC FLAGS_REG))]

but RTL combiner doesn't propagate (const_int 32) into the pattern.

I wonder if tree combiner can help here.

[Bug tree-optimization/95201] New: Some x86 vector-extend patterns are not exercised.

2020-05-19 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95201

Bug ID: 95201
   Summary: Some x86 vector-extend patterns are not exercised.
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ubizjak at gmail dot com
  Target Milestone: ---

Some of x86 vector extend patterns are not exercised by middle end. Currently,
they are XFAILed in gcc.target/i386/pr92658-*.c:


pr92658-avx2.c:/* { dg-final { scan-assembler-times "pmovzxbq" 2 { xfail *-*-*
} } } */
pr92658-sse4.c:/* { dg-final { scan-assembler-times "pmovzxbd" 2 { xfail *-*-*
} } } */
pr92658-sse4.c:/* { dg-final { scan-assembler-times "pmovzxbq" 2 { xfail *-*-*
} } } */
pr92658-sse4.c:/* { dg-final { scan-assembler-times "pmovzxwq" 2 { xfail *-*-*
} } } */

These correspond to:

-O2 -ftree-vectorize -mavx2 is required:

--cut here--
typedef unsigned char v32qi __attribute__((vector_size (32)));
typedef unsigned short v16hi __attribute__((vector_size (32)));
typedef unsigned int v8si __attribute__((vector_size (32)));
typedef unsigned long long v4di __attribute__((vector_size (32)));

void
foo_u8_u64 (v4di * dst, v32qi * __restrict src)
{
  unsigned long long tem[4];
  tem[0] = (*src)[0];
  tem[1] = (*src)[1];
  tem[2] = (*src)[2];
  tem[3] = (*src)[3];
  dst[0] = *(v4di *) tem;
}

void
bar_u8_u64 (v4di * dst, v32qi src)
{
  unsigned long long tem[4];
  tem[0] = src[0];
  tem[1] = src[1];
  tem[2] = src[2];
  tem[3] = src[3];
  dst[0] = *(v4di *) tem;
}

/* { dg-final { scan-assembler-times "pmovzxbq" 2 { xfail *-*-* } } } */
--cut here--

-O2 -ftree-vectorize -msse4.1 is required:

--cut here--
void
foo_u8_u32 (v4si * dst, v16qi * __restrict src)
{
  unsigned int tem[4];
  tem[0] = (*src)[0];
  tem[1] = (*src)[1];
  tem[2] = (*src)[2];
  tem[3] = (*src)[3];
  dst[0] = *(v4si *) tem;
}

void
bar_u8_u32 (v4si * dst, v16qi src)
{
  unsigned int tem[4];
  tem[0] = src[0];
  tem[1] = src[1];
  tem[2] = src[2];
  tem[3] = src[3];
  dst[0] = *(v4si *) tem;
}

/* { dg-final { scan-assembler-times "pmovzxbd" 2 { xfail *-*-* } } } */

void
foo_u8_u64 (v2di * dst, v16qi * __restrict src)
{
  unsigned long long tem[2];
  tem[0] = (*src)[0];
  tem[1] = (*src)[1];
  dst[0] = *(v2di *) tem;
}

void
bar_u8_u64 (v2di * dst, v16qi src)
{
  unsigned long long tem[2];
  tem[0] = src[0];
  tem[1] = src[1];
  dst[0] = *(v2di *) tem;
}

/* { dg-final { scan-assembler-times "pmovzxbq" 2 { xfail *-*-* } } } */

void
foo_u16_u64 (v2di * dst, v8hi * __restrict src)
{
  unsigned long long tem[2];
  tem[0] = (*src)[0];
  tem[1] = (*src)[1];
  dst[0] = *(v2di *) tem;
}

void
bar_u16_u64 (v2di * dst, v8hi src)
{
  unsigned long long tem[2];
  tem[0] = src[0];
  tem[1] = src[1];
  dst[0] = *(v2di *) tem;
}

/* { dg-final { scan-assembler-times "pmovzxwq" 2 { xfail *-*-* } } } */

Please note that these testcases fail to vectorize also in their loop forms,
e.g.:

--cut here--
void
foo_u8_u64 (v4di * dst, v32qi * __restrict src)
{
  unsigned long long tem[4];

  for (int i = 0; i < 4; i++)
tem[i] = (*src)[i];

  dst[0] = *(v4di *) tem;
}

void
bar_u8_u64 (v4di * dst, v32qi src)
{
  unsigned long long tem[4];

  for (int i = 0; i < 4; i++)
tem[i] = src[i];

  dst[0] = *(v4di *) tem;
}
--cut here--

Please see also PR 92658#c8 for some analysis.

[Bug target/92658] x86 lacks vector extend / truncate

2020-05-19 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92658

Uroš Bizjak  changed:

   What|Removed |Added

   Keywords||easyhack
   Assignee|ubizjak at gmail dot com   |unassigned at gcc dot 
gnu.org
 CC||ubizjak at gmail dot com
 Status|ASSIGNED|NEW

--- Comment #15 from Uroš Bizjak  ---
I will leave truncations (Down Converts in Intel speak) which are AVX512F
instructions to someone else. It should be easy to add missing patterns and
tests following the example of committed patch.

[Bug target/95169] i386 comparison between nan and 0.0 triggers Invalid operation exception

2020-05-17 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95169

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Status|NEW |ASSIGNED

--- Comment #6 from Uroš Bizjak  ---
Created attachment 48551
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48551=edit
Prototype patch

[Bug target/95169] i386 comparison between nan and 0.0 triggers Invalid operation exception

2020-05-17 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95169

Uroš Bizjak  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2020-05-17
   Target Milestone|--- |10.2

--- Comment #5 from Uroš Bizjak  ---
There are two places in i386-expand.c that say:

  /* We may be reversing unordered compare to normal compare, that
 is not valid in general (we may convert non-trapping condition
 to trapping one), however on i386 we currently emit all
 comparisons unordered.  */

This is not the case anymore.

The compilation hits this place.

[Bug target/92658] x86 lacks vector extend / truncate

2020-05-14 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92658

--- Comment #10 from Uroš Bizjak  ---
The patch is ready to be pushed, it is waiting for a decision what to do with
failed cases.

Richi, should this patch move forward (eventually XFAILing failed cases), or do
you plan to look at the fails from the generic vectorizer POV?

[Bug target/95125] Unoptimal code for vectorized conversions

2020-05-14 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95125

--- Comment #3 from Uroš Bizjak  ---
It turns out that a bunch of patterns have to be renamed (and testcases added).

Easyhack, waiting for someone to show some love to conversion patterns in
sse.md.

[Bug target/95125] Unoptimal code for vectorized conversions

2020-05-14 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95125

--- Comment #2 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #1)
> ISTR I filed a duplicate 10 years ago or so.  The issue is the vectorizer
> could not handle V4DFmode -> V4SFmode conversions.
> 
> Could, because for SVE we added the capability but this requires
> additional instruction patterns (IIRC I filed a but about this last
> year).  Yep.  PR92658 it is.

Oh... yes. And it is even assigned to me. And there is a patch... ;)

Anyway, I got surprised, since my soon-to-be committed v2sf-v2df conversion
patch was able to fully vectorize similar testcase involving double[2] and
float[2], while code involving [4] compiled to he mess below.

[Bug target/95125] New: Unoptimal code for vectorized conversions

2020-05-14 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95125

Bug ID: 95125
   Summary: Unoptimal code for vectorized conversions
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ubizjak at gmail dot com
  Target Milestone: ---

Following testcase

--cut here--
float f[4];
double d[4];
int i[4];

void
float_truncate (void)
{
  for (int n = 0; n < 4; n++)
f[n] = d[n];
}

void
float_extend (void)
{
  for (int n = 0; n < 4; n++)
d[n] = f[n];
}

void
float_float (void)
{
  for (int n = 0; n < 4; n++)
f[n] = i[n];
}

void
fix_float (void)
{
  for (int n = 0; n < 4; n++)
i[n] = f[n];
}

void
float_double (void)
{
  for (int n = 0; n < 4; n++)
d[n] = i[n];
}

void
fix_double (void)
{
  for (int n = 0; n < 4; n++)
i[n] = d[n];
}
--cut here--

when compiled with "-O3 -mavx" should result in a single conversion
instruction.

float_truncate:
vxorps  %xmm0, %xmm0, %xmm0
vcvtsd2ss   d+8(%rip), %xmm0, %xmm2
vmovaps %xmm2, %xmm3
vcvtsd2ss   d(%rip), %xmm0, %xmm1
vcvtsd2ss   d+16(%rip), %xmm0, %xmm2
vcvtsd2ss   d+24(%rip), %xmm0, %xmm0
vunpcklps   %xmm0, %xmm2, %xmm2
vunpcklps   %xmm3, %xmm1, %xmm0
vmovlhps%xmm2, %xmm0, %xmm0
vmovaps %xmm0, f(%rip)
ret

float_extend:
vcvtps2pd   f(%rip), %xmm0
vmovapd %xmm0, d(%rip)
vxorps  %xmm0, %xmm0, %xmm0
vmovlps f+8(%rip), %xmm0, %xmm0
vcvtps2pd   %xmm0, %xmm0
vmovapd %xmm0, d+16(%rip)
ret

float_float:
vcvtdq2ps   i(%rip), %xmm0
vmovaps %xmm0, f(%rip)
ret

fix_float:
vcvttps2dq  f(%rip), %xmm0
vmovdqa %xmm0, i(%rip)
ret

float_double:
vcvtdq2pd   i(%rip), %xmm0
vmovapd %xmm0, d(%rip)
vpshufd $238, i(%rip), %xmm0
vcvtdq2pd   %xmm0, %xmm0
vmovapd %xmm0, d+16(%rip)
ret

fix_double:
pushq   %rbp
vmovapd d(%rip), %xmm1
vinsertf128 $0x1, d+16(%rip), %ymm1, %ymm0
movq%rsp, %rbp
vcvttpd2dqy %ymm0, %xmm0
vmovdqa %xmm0, i(%rip)
vzeroupper
popq%rbp
ret

Clang manages to emit optimal code.

[Bug target/95083] x86 fp_movcc expansion depends on real_cst sharing

2020-05-13 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95083

--- Comment #2 from Uroš Bizjak  ---
It looks to me that a couple of (scalar) splitters are missing in sse.md.

There is vector

(define_insn_and_split "*_blendv_lt"

Defined as:

  [(set (match_operand:VF_128_256 0 "register_operand" "=Yr,*x,x")
(unspec:VF_128_256
  [(match_operand:VF_128_256 1 "register_operand" "0,0,x")
   (match_operand:VF_128_256 2 "vector_operand" "YrBm,*xBm,xm")
   (lt:VF_128_256
 (match_operand: 3 "register_operand" "Yz,Yz,x")
 (match_operand: 4 "const0_operand" "C,C,C"))]
  UNSPEC_BLENDV))]

(please note const0 operand 4).

Probably similar pattern is missing that would degrade to MIN/MAX, for vector
and scalar versions.

[Bug tree-optimization/95060] vfnmsub132ps is not generated with -ffast-math

2020-05-11 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95060

Uroš Bizjak  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #1 from Uroš Bizjak  ---
Related to PR86999.

[Bug tree-optimization/95060] New: vfnmsub132ps is not generated with -ffast-math

2020-05-11 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95060

Bug ID: 95060
   Summary: vfnmsub132ps is not generated with -ffast-math
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ubizjak at gmail dot com
  Target Milestone: ---

Following testcase:

--cut here--
float r[8], a[8], b[8], c[8];

void
test_fnms (void)
{
  for (int i = 0; i < 8; i++)
r[i] = -(a[i] * b[i]) - c[i];
}
--cut here--

compiles on x86_64 with "-O3 -mfma" to

vmovaps b(%rip), %ymm0
vmovaps c(%rip), %ymm1
vfnmsub132psa(%rip), %ymm1, %ymm0
vmovaps %ymm0, r(%rip)
vzeroupper
ret

However, when -ffast-math is added, negation gets moved out of the insn:

vmovaps b(%rip), %ymm0
vmovaps c(%rip), %ymm1
vfmadd132ps a(%rip), %ymm1, %ymm0
->  vxorps  .LC0(%rip), %ymm0, %ymm0
vmovaps %ymm0, r(%rip)
vzeroupper
ret

[Bug target/95046] Vectorize V2SFmode operations

2020-05-11 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95046

Uroš Bizjak  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Target Milestone|--- |11.0
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com

[Bug target/95046] Vectorize V2SFmode operations

2020-05-11 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95046

Uroš Bizjak  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2020-05-11
 Target||x86_64
 Status|UNCONFIRMED |NEW
   Severity|normal  |enhancement

[Bug target/95046] New: Vectorize V2SFmode operations

2020-05-11 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95046

Bug ID: 95046
   Summary: Vectorize V2SFmode operations
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ubizjak at gmail dot com
  Target Milestone: ---

The compiler should vectorize V2SF operations using XMM registers.

The same principles as applied to integer MMX operations (mmx-with-sse) should
also apply to V2SF mode operations, but to avoid unwanted secondary effects
(e.g. exceptions) extra care should be taken to load values to registers with
parts outside V2SFmode cleared.

Following testcase:

--cut here--
float r[2], a[2], b[2];

void foo (void)
{
  for (int i = 0; i < 2; i++)
r[i] = a[i] + b[i];
}
--cut here--

should vectorize to:

movqa(%rip), %xmm0
movqb(%rip), %xmm1
addps   %xmm1, %xmm0
movlps  %xmm0, r(%rip)

Please note movq insn that assures clearing of top 64bits of 128bit xmm
register.

[Bug target/91188] strict_low_part operations do not work

2020-05-07 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91188

Uroš Bizjak  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
   Target Milestone|10.2|10.0
 Resolution|--- |FIXED

--- Comment #6 from Uroš Bizjak  ---
Fixed in gcc-10.

[Bug tree-optimization/94877] Failure to simplify ~(x + 1) to -2 - x

2020-05-06 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94877

--- Comment #4 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #3)
> I'm not sure why this is considered a simplification, two insns vs. two, and
> on the subtraction it isn't specific to just one target, but I think for
> most the constant will need to be forced into register, the immediates the
> instructions have is mostly for the second operand.

Two ALU operations are merged into one, assuming that move is "free".

[Bug tree-optimization/94913] Failure to optimize not+cmp into overflow check

2020-05-05 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94913

Uroš Bizjak  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 CC||jakub at gcc dot gnu.org
 Ever confirmed|0   |1
   Last reconfirmed||2020-05-05

--- Comment #3 from Uroš Bizjak  ---
(In reply to Gabriel Ravier from comment #1)
> The same thing happens for this code :
> 
> bool f(unsigned x, unsigned y)
> {
> return (x - y - 1) >= x;
> }

This transformation is the job for tree combiner.

[Bug tree-optimization/94913] Failure to optimize not+cmp into overflow check

2020-05-05 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94913

--- Comment #2 from Uroš Bizjak  ---
Created attachment 48458
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48458=edit
Prototype patch

Prototype patch for missed optimization, described in Comment #0.

Following testcase:

--cut here--
int
foo (unsigned int x, unsigned int y)
{
  return ~x < y;
}

void f1 (void);
void f2 (void);

void
bar (unsigned int x, unsigned int y)
{
  if (~x < y)
f1 ();
  else
f2 ();
}
--cut here--

compiles (-O2) to:

foo:
xorl%eax, %eax
addl%esi, %edi
setc%al
ret

bar:
addl%esi, %edi
jnc .L4
jmp f1
.L4:
jmp f2

[Bug target/94795] Failure to use fast sbb method on x86 for spreading any set bit to all bits

2020-05-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94795

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
   Target Milestone|--- |11.0
 Status|ASSIGNED|RESOLVED
  Component|rtl-optimization|target

--- Comment #6 from Uroš Bizjak  ---
Implemented for gcc-11.

[Bug target/94650] Missed x86-64 peephole optimization: x >= large power of two

2020-05-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94650

Uroš Bizjak  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED
   Target Milestone|--- |11.0

--- Comment #4 from Uroš Bizjak  ---
Implemented for gcc-11.

[Bug tree-optimization/94914] Failure to optimize check of high part of 64-bit result of 32 by 32 multiplication into overflow check

2020-05-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94914

Uroš Bizjak  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
 CC||jakub at gcc dot gnu.org
   Last reconfirmed||2020-05-04

--- Comment #1 from Uroš Bizjak  ---
Confirmed.

llvm:

movl%edi, %eax
xorl%ecx, %ecx
mull%esi
seto%cl
movl%ecx, %eax
retq

gcc:

movl%esi, %esi
movl%edi, %edi
xorl%eax, %eax
imulq   %rsi, %rdi
shrq$32, %rdi
setne   %al
ret

GCC does have overflow checking arithmetic builtins, but I'm not sure if at the
moment these are used just for UBSAN checking.

[Bug tree-optimization/94921] Failure to optimize nots with sub into single add

2020-05-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94921

Uroš Bizjak  changed:

   What|Removed |Added

   Last reconfirmed||2020-05-04
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #2 from Uroš Bizjak  ---
(In reply to Marc Glisse from comment #1)
> x + y ?

Correct.

llvm:

leal(%rdi,%rsi), %eax
retq

gcc:

notl%edi
subl%esi, %edi
movl%edi, %eax
notl%eax
ret

Confirmed, looks like a job for a tree combiner.

[Bug c/94902] internal compiler error: output_operand: invalid use of register 'frame'

2020-05-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94902

--- Comment #3 from Uroš Bizjak  ---
This is the third time I have seen this type of bugreport, and I really don't
know what is so magical on number "19" that everybody wants the register by
this number.

If this number crashes the compiler, then just don't use it.

[Bug rtl-optimization/94837] Failure to optimize out spurious movbe into bswap

2020-04-29 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94837

--- Comment #5 from Uroš Bizjak  ---
Probably some secondary effect of subregs on register allocation, changing
"float" to "int" in the original testcase gets us expected alternative and
optimal code using BSWAP.

[Bug rtl-optimization/94837] Failure to optimize out spurious movbe into bswap

2020-04-29 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94837

Uroš Bizjak  changed:

   What|Removed |Added

   Keywords|missed-optimization |ra
 CC||vmakarov at gcc dot gnu.org
   Last reconfirmed||2020-04-29
 Resolution|DUPLICATE   |---
 Status|RESOLVED|NEW
 Ever confirmed|0   |1

--- Comment #4 from Uroš Bizjak  ---
Looks like RA (tuning?) problem.

We enter reload (-O2 -mmovbe -mtune=intel) with:

(insn 14 4 2 2 (set (reg:SF 87)
(reg:SF 20 xmm0 [ x ])) "pr94837.c":2:1 112 {*movsf_internal}
 (expr_list:REG_DEAD (reg:SF 20 xmm0 [ x ])
(nil)))
(insn 7 6 11 2 (set (subreg:SI (reg:SF 84 [  ]) 0)
(bswap:SI (subreg:SI (reg:SF 87) 0))) "pr94837.c":11:19 869
{*bswapsi2_movbe}
 (expr_list:REG_DEAD (reg:SF 87)
(nil)))
(insn 11 7 12 2 (set (reg/i:SF 20 xmm0)
(reg:SF 84 [  ])) "pr94837.c":12:1 112 {*movsf_internal}
 (expr_list:REG_DEAD (reg:SF 84 [  ])
(nil)))

and this sequence gets reloaded to:

(insn 17 6 7 2 (set (mem/c:SI (plus:DI (reg/f:DI 7 sp)
(const_int -4 [0xfffc])) [1 %sfp+-4 S4 A32])
(reg:SI 20 xmm0 [87])) "pr94837.c":11:19 67 {*movsi_internal}
 (nil))
(insn 7 17 16 2 (set (reg:SI 0 ax [88])
(bswap:SI (mem/c:SI (plus:DI (reg/f:DI 7 sp)
(const_int -4 [0xfffc])) [1 %sfp+-4 S4 A32])))
"pr94837.c":11:19 869 {*bswapsi2_movbe}
 (nil))
(insn 16 7 12 2 (set (reg:SI 20 xmm0 [orig:84  ] [84])
(reg:SI 0 ax [88])) "pr94837.c":11:19 67 {*movsi_internal}
 (nil))

One would expect reg allocator to choose alternative 0 from:

(define_insn "*bswap2_movbe"
  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=r,r,m")
(bswap:SWI48 (match_operand:SWI48 1 "nonimmediate_operand" "0,m,r")))]
  "TARGET_MOVBE
   && !(MEM_P (operands[0]) && MEM_P (operands[1]))"
  "@
bswap\t%0
movbe{}\t{%1, %0|%0, %1}
movbe{}\t{%1, %0|%0, %1}"

but for some reason this is not the case.

[Bug rtl-optimization/94838] Failure to optimize out useless zero-ing after register was already zero-ed

2020-04-29 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94838

Uroš Bizjak  changed:

   What|Removed |Added

  Component|target  |rtl-optimization

--- Comment #5 from Uroš Bizjak  ---
(In reply to Gabriel Ravier from comment #0)
> int f(bool b, int *p)
> {
> return b && *p;
> }
> 
> GCC generates this with -O3:
> 
> f(bool, int*):
>   xor eax, eax
>   test dil, dil
>   je .L1
>   mov edx, DWORD PTR [rsi]
>   xor eax, eax ; This can be removed, since eax is already 0 here
>   test edx, edx
>   setne al
> .L1:
>   ret

The first xor is return value load and the second is from peephole2 pass that
converts:

   11: NOTE_INSN_BASIC_BLOCK 3
   12: flags:CCZ=cmp([si:DI],0)
  REG_DEAD si:DI
   13: NOTE_INSN_DELETED
   32: ax:QI=flags:CCZ!=0
  REG_DEAD flags:CCZ
   33: ax:SI=zero_extend(ax:QI)

to:

   11: NOTE_INSN_BASIC_BLOCK 3
   40: {ax:SI=0;clobber flags:CC;}
   43: dx:SI=[si:DI]
   44: flags:CCZ=cmp(dx:SI,0)
   42: strict_low_part(ax:QI)=flags:CCZ!=0

The follow-up cprop-hardreg pass does not notice that we already have zero
loaded to a register.

There is nothing that target-dependent part can do here. A follow-up RTL
hardreg propagation pass should fix this.

[Bug rtl-optimization/94795] Failure to use fast sbb method on x86 for spreading any set bit to all bits

2020-04-27 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94795

Uroš Bizjak  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com

--- Comment #4 from Uroš Bizjak  ---
Created attachment 48386
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48386=edit
Proof of concept patch

Proof of concept patch that implements both suggestions and results in:

negl%edi
sbbl%eax, %eax
ret

for the first case and:

cmpl$1, %edi
sbbl%eax, %eax
ret

for the second.

For the record, the transformation triggers:

- for linux x86_64 defconfig: 338 times neg/sbb and 28 times cmp/sbb

- for GCC bootstrap: 296 times neg/sbb and 1246 times cmp/sbb

[Bug rtl-optimization/94795] Failure to use fast sbb method on x86 for spreading any set bit to all bits

2020-04-27 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94795

--- Comment #3 from Uroš Bizjak  ---
(In reply to Gabriel Ravier from comment #2)
> Also, I can also provide this a very similar function for which such an

This optimization could be implemented with a simple combine splitter:

--cut here--
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index b426c21d3dd..8ea3a4a141a 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -17979,6 +18045,18 @@
  (clobber (reg:CC FLAGS_REG))])]
   "operands[2] = GEN_INT (INTVAL (operands[2]) + 1);")

+(define_split
+  [(set (match_operand:SWI48 0 "register_operand")
+   (neg:SWI48
+ (eq:SWI48
+   (match_operand:SWI 1 "nonimmediate_operand")
+   (const_int 0]
+  ""
+  [(set (reg:CC FLAGS_REG) (compare:CC (match_dup 1) (const_int 1)))
+   (parallel [(set (match_dup 0)
+  (neg:SWI48 (ltu:SWI48 (reg:CC FLAGS_REG) (const_int 0
+ (clobber (reg:CC FLAGS_REG))])])
+
 (define_insn "*movcc_noc"
   [(set (match_operand:SWI248 0 "register_operand" "=r,r")
(if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
--cut here--

(QImode and HImode modes have to be added to *x86_movcc_0_m1_neg pattern
for the above splitter to also output to QImode and HImode operands.)

[Bug target/94650] Missed x86-64 peephole optimization: x >= large power of two

2020-04-20 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94650

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Status|NEW |ASSIGNED

--- Comment #2 from Uroš Bizjak  ---
Created attachment 48315
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48315=edit
Prototype patch

Using this patch, the following asm is created (-O2):

--cut here--
check:
xorl%eax, %eax
shrq$40, %rdi
setne   %al
ret

test0:
shrq$40, %rdi
jne .L5
ret
.L5:
xorl%edi, %edi
jmp g

test1:
movq%rdi, %rax
shrq$40, %rax
jne .L8
ret
.L8:
jmp g
--cut here--

[Bug target/94603] ICE: in extract_insn, at recog.c:2343 (unrecognizable insn) with -mno-sse2 and __builtin_ia32_movq128

2020-04-15 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94603

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
   Target Milestone|--- |8.5
 Status|ASSIGNED|RESOLVED

--- Comment #10 from Uroš Bizjak  ---
Fixed for gcc-8.5+.

[Bug target/94603] ICE: in extract_insn, at recog.c:2343 (unrecognizable insn) with -mno-sse2 and __builtin_ia32_movq128

2020-04-15 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94603

--- Comment #6 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #5)
> (In reply to Uroš Bizjak from comment #4)
> > (In reply to Jakub Jelinek from comment #3)
> > > The testcase will need -msse -mno-sse2.
> > 
> > Yes, but the testcase is invalid, because __builtin_ia32_movq128 should not
> > be used without SSE2. Fixed compiler reports:
> > 
> > pr94603.c: In function ‘foo’:
> > pr94603.c:6:10: warning: implicit declaration of function
> > ‘__builtin_ia32_movq128’; did you mean ‘__builtin_ia32_movntps’?
> > [-Wimplicit-function-declaration]
> > pr94603.c:6:10: error: incompatible types when returning type ‘int’ but ‘V’
> > {aka ‘__vector(2) long long int’} was expected
> 
> I know.  But we (often) include even invalid testcases, perhaps with just
> dg-error "" and dg-warning "" (or use -w too) if we don't care about exact
> wording but just want to verify there is no ICE.

This is the testcase:

--cut here--
/* PR target/94603 */
/* { dg-do compile } */
/* { dg-options "-Wno-implicit-function-declaration -msse -mno-sse2" } */

typedef long long __attribute__ ((__vector_size__ (16))) V;

V
foo (V v)
{
  return __builtin_ia32_movq128 (v);  /* { dg-error "" } */
}
--cut here--

[Bug target/94603] ICE: in extract_insn, at recog.c:2343 (unrecognizable insn) with -mno-sse2 and __builtin_ia32_movq128

2020-04-15 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94603

--- Comment #4 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #3)
> The testcase will need -msse -mno-sse2.

Yes, but the testcase is invalid, because __builtin_ia32_movq128 should not be
used without SSE2. Fixed compiler reports:

pr94603.c: In function ‘foo’:
pr94603.c:6:10: warning: implicit declaration of function
‘__builtin_ia32_movq128’; did you mean ‘__builtin_ia32_movntps’?
[-Wimplicit-function-declaration]
pr94603.c:6:10: error: incompatible types when returning type ‘int’ but ‘V’
{aka ‘__vector(2) long long int’} was expected

[Bug target/94603] ICE: in extract_insn, at recog.c:2343 (unrecognizable insn) with -mno-sse2 and __builtin_ia32_movq128

2020-04-15 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94603

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Status|NEW |ASSIGNED

--- Comment #2 from Uroš Bizjak  ---
Created attachment 48278
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48278=edit
Patch in testing.

[Bug target/94561] [10 Regression] ICE in ix86_get_ssemov

2020-04-11 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94561

Uroš Bizjak  changed:

   What|Removed |Added

   Last reconfirmed||2020-04-11
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
 CC||hjl.tools at gmail dot com

--- Comment #1 from Uroš Bizjak  ---
Confirmed, CC author.

  1   2   3   4   5   6   7   8   9   10   >