[Bug tree-optimization/100922] CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy

2021-06-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #4 from Richard Biener  ---
(In reply to Andrew Pinski from comment #1)
> Pre takes:
>[local count: 1073741824]:
>   # str_9 = PHI 
>   # out_10 = PHI 
>   str_16 = str_9 + 1;
>   c_17 = *str_9;
>   _5 = (sizetype) c_17;
>   _6 = keep_lut_14(D) + _5;
>   inc_18 = *_6;
>   *out_10 = c_17;
>   _7 = (sizetype) inc_18;
>   out_20 = out_10 + _7;
>   if (c_17 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
>[local count: 955630224]:
>   goto ; [100.00%]
> 
> Changes it to:
>[local count: 1073741824]:
>   # str_9 = PHI 
>   # out_10 = PHI 
>   # prephitmp_60 = PHI 
>   str_16 = str_9 + 1;
>   c_17 = *str_9;
>   _5 = (sizetype) c_17;
>   _6 = keep_lut_14(D) + _5;
>   *out_10 = c_17;
>   _7 = (sizetype) prephitmp_60;
>   out_20 = out_10 + _7;
>   if (c_17 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
>[local count: 955630224]:
>   pretmp_55 = MEM[(char *)str_9 + 1B];
>   _57 = (sizetype) pretmp_55;
>   _58 = keep_lut_14(D) + _57;
>   pretmp_59 = *_58;
>   goto ; [100.00%]
> 
> And then ch_vect changes it to (basically undoing the PRE):
>[local count: 850510900]:
> 
>[local count: 955630224]:
>   # str_42 = PHI 
>   # str_41 = PHI 
>   # out_38 = PHI 
>   pretmp_55 = MEM[(char *)str_42 + 1B];
>   _57 = (sizetype) pretmp_55;
>   _58 = keep_lut_14(D) + _57;
>   pretmp_59 = *_58;
>   str_16 = str_41 + 1;
>   c_17 = *str_41;
>   *out_38 = c_17;
>   _7 = (sizetype) pretmp_59;
>   out_20 = out_38 + _7;
>   if (c_17 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
> And then IVOPTS comes around and does:
>[local count: 850510900]:
> 
>[local count: 955630224]:
>   # str_41 = PHI 
>   # out_38 = PHI 
>   pretmp_55 = MEM[(char *)str_41];
>   _57 = (sizetype) pretmp_55;
>   _58 = keep_lut_14(D) + _57;
>   pretmp_59 = *_58;
>   str_16 = str_41 + 1;
>   c_17 = MEM[(char *)str_16 + -1B];
>   *out_38 = c_17;
>   _7 = (sizetype) pretmp_59;
>   out_20 = out_38 + _7;
>   if (c_17 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
> Which is fine except FRE (and DOM), don't recognize the MEM[(char *)str_16 +
> -1B] and MEM[(char *)str_41] being the same.  For FRE, it almost looks like
> TARGET_MEM_REF is not handled 

TARGET_MEM_REF is handled but what is not is "forwarding" the str_16 definition
to the str_16 + -1B TARGET_MEM_REF - vn_reference_maybe_forwprop_address only
forwards to MEM_REFs (see the caller in valueize_refs_1 - the TARGET_MEM_REF
representation is different enough to make a trivial patch impossible).

Now, there's code to fold TARGET_MEM_REF back to MEM_REF, but it doesn't
trigger here as

  /* If possible use a plain MEM_REF instead of a TARGET_MEM_REF.
 ???  As IVOPTs does not follow restrictions to where the base
 pointer may point to create a MEM_REF only if we know that
 base is valid.  */
  if ((TREE_CODE (base) == ADDR_EXPR || TREE_CODE (base) == INTEGER_CST)

I _think_ that IVOPTs is fixed, but I'm not 100% sure ...

[Bug tree-optimization/100922] CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy

2021-06-05 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922

--- Comment #3 from Andrew Pinski  ---
(In reply to Peter Cordes from comment #2)
> Possibly also related:
Fully unrelated.

[Bug tree-optimization/100922] CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy

2021-06-05 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922

--- Comment #2 from Peter Cordes  ---
Possibly also related:

With different surrounding code, this loop can compile to asm which has two
useless movz / mov register copies in the loop at -O2 
(https://godbolt.org/z/PTcqzM6q7).  (To set up for entry into the next loop in
over-complicated ways, and doing this in the loop is unnecessary.)


  while( lut[(unsigned char)*str] == 0 ){  // also catches terminating 0
str++;
  }


.L19:
movzbl  1(%rdi), %edx
addq$1, %rdi
movzbl  %dl, %ecx
movl%edx, %eax
cmpb$0, -120(%rsp,%rcx)
je  .L19

from source

void remove_chars(char *restrict str, const char *restrict remove)
{
  char lut[256] = {0};
  do {
lut[(unsigned char)*remove] = -1;
  }while(*remove++);

/***   Over complicated asm in this loop */
  while( lut[(unsigned char)*str] == 0 ){  // also catches terminating 0
str++;
  }
  // str points at first char to *not* keep (or the terminating 0)
  const char *in = str;
  char *out = str;
  while (*in)
{
  char mask = lut[(unsigned char)*in];
unsigned char cin = *in, cout = *out;
*out = mask ? cout : cin;
  out += mask + 1;
  in++;
}
  *out = *in;
}

[Bug tree-optimization/100922] CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy

2021-06-05 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922

Andrew Pinski  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2021-06-05
 Ever confirmed|0   |1

--- Comment #1 from Andrew Pinski  ---
Pre takes:
   [local count: 1073741824]:
  # str_9 = PHI 
  # out_10 = PHI 
  str_16 = str_9 + 1;
  c_17 = *str_9;
  _5 = (sizetype) c_17;
  _6 = keep_lut_14(D) + _5;
  inc_18 = *_6;
  *out_10 = c_17;
  _7 = (sizetype) inc_18;
  out_20 = out_10 + _7;
  if (c_17 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 955630224]:
  goto ; [100.00%]

Changes it to:
   [local count: 1073741824]:
  # str_9 = PHI 
  # out_10 = PHI 
  # prephitmp_60 = PHI 
  str_16 = str_9 + 1;
  c_17 = *str_9;
  _5 = (sizetype) c_17;
  _6 = keep_lut_14(D) + _5;
  *out_10 = c_17;
  _7 = (sizetype) prephitmp_60;
  out_20 = out_10 + _7;
  if (c_17 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 955630224]:
  pretmp_55 = MEM[(char *)str_9 + 1B];
  _57 = (sizetype) pretmp_55;
  _58 = keep_lut_14(D) + _57;
  pretmp_59 = *_58;
  goto ; [100.00%]

And then ch_vect changes it to (basically undoing the PRE):
   [local count: 850510900]:

   [local count: 955630224]:
  # str_42 = PHI 
  # str_41 = PHI 
  # out_38 = PHI 
  pretmp_55 = MEM[(char *)str_42 + 1B];
  _57 = (sizetype) pretmp_55;
  _58 = keep_lut_14(D) + _57;
  pretmp_59 = *_58;
  str_16 = str_41 + 1;
  c_17 = *str_41;
  *out_38 = c_17;
  _7 = (sizetype) pretmp_59;
  out_20 = out_38 + _7;
  if (c_17 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

And then IVOPTS comes around and does:
   [local count: 850510900]:

   [local count: 955630224]:
  # str_41 = PHI 
  # out_38 = PHI 
  pretmp_55 = MEM[(char *)str_41];
  _57 = (sizetype) pretmp_55;
  _58 = keep_lut_14(D) + _57;
  pretmp_59 = *_58;
  str_16 = str_41 + 1;
  c_17 = MEM[(char *)str_16 + -1B];
  *out_38 = c_17;
  _7 = (sizetype) pretmp_59;
  out_20 = out_38 + _7;
  if (c_17 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

Which is fine except FRE (and DOM), don't recognize the MEM[(char *)str_16 +
-1B] and MEM[(char *)str_41] being the same.  For FRE, it almost looks like
TARGET_MEM_REF is not handled