[Bug rtl-optimization/64286] New: Redundant extend removal ignores vector element type

2014-12-12 Thread sergos.gnu at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64286

Bug ID: 64286
   Summary: Redundant extend removal ignores vector element type
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: sergos.gnu at gmail dot com

Created attachment 34266
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=34266&action=edit
reproducer, taken from public sources

The problem is reproducible starting 4.9 and on trunk also.


The line 29 contains a load into V16QI vector

29:p2 = _mm_loadu_si128((__m128i *) (s - 3 * p));

later used at

60:work = _mm_or_si128(_mm_subs_epu8(p2, p1), _mm_subs_epu8(p1, p2));

and later sign extended into V16HI vector

151:   p256_2 = _mm256_cvtepu8_epi16(p2);



At the phase 217 split2 we have:

(insn 207 204 209 2 (set (reg:V16QI 21 xmm0 [447])
(mem:V16QI (plus:DI (reg/f:DI 6 bp)
(const_int -114 [0xff8e])) [0  S16 A16]))
GCC_Bug.p.c:2609 1136 {*movv16qi_internal}
 (expr_list:REG_EQUIV (mem:V16QI (plus:DI (reg/f:DI 20 frame)
(const_int -66 [0xffbe])) [0  S16 A16])
(nil)))

...

(insn 236 235 238 2 (set (reg:V16QI 22 xmm1 [462])
(us_minus:V16QI (reg:V16QI 23 xmm2 [450])
(reg:V16QI 21 xmm0 [447]))) GCC_Bug.p.c:2925 2096
{*sse2_ussubv16qi3}
 (nil))

... (and number of other operations with xmm0 as V16QI)

(insn 871 869 873 2 (set (reg:V16HI 21 xmm0 [orig:573 D.17673 ] [573])
(zero_extend:V16HI (reg:V16QI 21 xmm0 [447]))) GCC_Bug.p.c:5280 2521
{avx2_zero_extendv16qiv16hi2}
 (nil))


After that REE reports:

---
Trying to eliminate extension:
(insn 871 869 873 2 (set (reg:V16HI 21 xmm0 [orig:573 D.17673 ] [573])
(zero_extend:V16HI (reg:V16QI 21 xmm0 [447]))) GCC_Bug.p.c:5280 2521
{avx2_zero_extendv16qiv16hi2}
 (nil))
Tentatively merged extension with definition :
(insn 207 204 209 2 (set (reg:V16HI 21 xmm0)
(zero_extend:V16HI (mem:V16QI (plus:DI (reg/f:DI 6 bp)
(const_int -114 [0xff8e])) [0  S16 A16])))
GCC_Bug.p.c:2609 -1
 (nil))
deferring rescan insn with uid = 207.
All merges were successful.
Eliminated the extension.
-


That renders all V16QI insns using xmm0 invalid. 

The test should be compiled with 

gcc -O2 GCC_Bug_min.c -mavx2

And run on an avx2-enabled platform.

Correct output:
Is valid: 1

Incorrect output:
Is valid: 0


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-11-14 Thread sergos.gnu at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #12 from Sergey Ostanevich  2012-11-14 
18:56:22 UTC ---

Actually, it is not. 

I found that PRE did not collected a memory access within the loop that caused

later missing vectorization. Here is dump before (good one) and after the

commit (bad one)



:

pretmp_263 = (integer(kind=8)) ival2_82;

pretmp_264 = pretmp_263 + -1;

pretmp_265 = *xxtrt_46(D)[pretmp_264];



:

# ival2_10 = PHI 

# ival2_14 = PHI 

# prephitmp_266 = PHI 

_83 = (integer(kind=8)) ival2_10;

_84 = _83 + -1;

_85 = *xxtrt_46(D)[_84];

_86 = (integer(kind=8)) ival2_14;

_87 = _86 + -1;

_88 = prephitmp_266;

if (_85 < _88)

  goto ;

else

  goto ;



:

goto ;



:



:

# ival2_15 = PHI 

# prephitmp_237 = PHI <_88(90), _85(29)>

ival2_89 = ival2_10 + -1;

if (ival2_10 == ipos1_12)

  goto ;

else

  goto ;



   :

   goto ;

-

:



:

# ival2_10 = PHI 

   # ival2_14 = PHI 

_83 = (integer(kind=8)) ival2_10;

_84 = _83 + -1;

_85 = *xxtrt_46(D)[_84];

_86 = (integer(kind=8)) ival2_14;

_87 = _86 + -1;

_88 = *xxtrt_46(D)[_87];

if (_85 < _88)

  goto ;

else

  goto ;



:

goto ;



:



:

# ival2_15 = PHI 

ival2_89 = ival2_10 + -1;

if (ival2_10 == ipos1_12)

  goto ;

else

  goto ;



   :

   goto ;

-



So for the loop that starting at bb 28 you can see the xxtrt_46 access was not

put into pretemp. Possible reason is exactly as it was mentioned by Richard -

there were extra candidates collected and this one become less anticipatable



Skipping partial partial redundancy for expression

{array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165)   

   not partially anticipated on any to be optimized for speed edges

  ---

Found partial partial redundancy for expression

 {array_ref,mem_ref<0B>,xxtrt_46(D)}@.MEM_30(D) (0165)

Created phi prephitmp_237 = PHI <_88(90), _85(29)>

 in block 30


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-10-08 Thread sergos.gnu at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #9 from Sergey Ostanevich  2012-10-08 
08:55:25 UTC ---

Thanks for the reduced test, Dominique!



I see that vectorized did not manage to generate MIN after the change. Also, it

is looks pretty similar to what I posted at first: there was no prephitmp

created for the xxtrt_[]





> ival2_15 = _85 < prephitmp_266 ? ival2_10 : iva

> prephitmp_237 = MIN_EXPR <_85, prephitmp_266>;

---

< _86 = (integer(kind=8)) ival2_14;

< _87 = _86 + -1;

< _88 = *xxtrt_46(D)[_87];

< ival2_15 = _85 < _88 ? ival2_10 : ival2_14;



I suspect that one of the iterator you removed - possibly VEC_iterate - made

more traverse than that you created?



I also double check that for the reduced test MIN did not generated and not

appears in assembly. PMU measurements (Vtune) confirms that BBLOCKs missing min

contributes the difference in clocks.


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-09-26 Thread sergos.gnu at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #5 from Sergey Ostanevich  2012-09-26 
20:07:26 UTC ---

for 093t.pre I see the following missing in cptrf2 function, first is good,

second is degraded:



***

*** 8947,8966 

goto ;



:

-   pretmp_325 = (integer(kind=8)) ival2_80;

-   pretmp_326 = pretmp_325 + -1;

-   pretmp_327 = *xxtrt_25(D)[pretmp_326];



:

# ival2_136 = PHI 

# ival2_140 = PHI 

-   # prephitmp_328 = PHI 

_137 = (integer(kind=8)) ival2_136;

_138 = _137 + -1;

_139 = *xxtrt_25(D)[_138];

_141 = (integer(kind=8)) ival2_140;

_142 = _141 + -1;

!   _143 = prephitmp_328;

if (_139 < _143)

  goto ;

else

--- 8838,8853 

goto ;



:



:

# ival2_136 = PHI 

# ival2_140 = PHI 

_137 = (integer(kind=8)) ival2_136;

_138 = _137 + -1;

_139 = *xxtrt_25(D)[_138];

_141 = (integer(kind=8)) ival2_140;

_142 = _141 + -1;

!   _143 = *xxtrt_25(D)[_142];

if (_139 < _143)

  goto ;

else

***



but more surprising to me is that first diff is in 020t.inline_param1



***

*** 16790,16794 

calls:

  dtrti2/26 function not considered for inlining

!   loop depth: 0 freq:1000 size: 9 time: 18 callee size:82 stack:28

  dtrsm/21 function not considered for inlining

loop depth: 0 freq:1000 size:16 time: 25 callee size:324 stack: 4

--- 16790,16794 

calls:

  dtrti2/26 function not considered for inlining

!   loop depth: 0 freq:1000 size: 9 time: 18 callee size:81 stack:28

  dtrsm/21 function not considered for inlining

loop depth: 0 freq:1000 size:16 time: 25 callee size:324 stack: 4

***


[Bug tree-optimization/54717] [4.8 Regression] Runtime regression: polyhedron test "rnflow" degraded

2012-09-26 Thread sergos.gnu at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



--- Comment #3 from Sergey Ostanevich  2012-09-26 
15:11:38 UTC ---

adding -### gives (in part of options)





/export/users/syostane/pb11/gcc120914/libexec/gcc/x86_64-unknown-linux-gnu/4.8.0/f951

air.f90 "-march=corei7" -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt

-mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm

-mno-avx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd

-mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx --param

"l1-cache-size=32" --param "l1-cache-line-size=64" --param

"l2-cache-size=12288" "-mtune=corei7" -quiet -dumpbase air.f90 -auxbase air

-fintrinsic-modules-path

/export/users/syostane/pb11/gcc120914/lib/gcc/x86_64-unknown-linux-gnu/4.8.0/finclude

-o /tmp/ccmW82c1.s


[Bug tree-optimization/54717] New: Runtime regression: polyhedron test "rnflow" degraded

2012-09-26 Thread sergos.gnu at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54717



 Bug #: 54717

   Summary: Runtime regression: polyhedron test "rnflow" degraded

Classification: Unclassified

   Product: gcc

   Version: 4.8.0

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: tree-optimization

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: sergos@gmail.com





commit 024fee2c369096e6fe6cde620243df5843893004

Author: rguenth 

Date:   Thu Sep 13 12:43:58 2012 +



2012-09-13  Richard Guenther  



* tree-ssa-sccvn.h (enum vn_kind): New.

(vn_get_stmt_kind): Likewise.

* tree-ssa-sccvn.c (vn_get_stmt_kind): New function, adjust

ADDR_EXPR handling.

(visit_use): Use it.

* tree-ssa-pre.c (compute_avail): Likewise, simplify further.



* gcc.dg/tree-ssa/ssa-fre-37.c: New testcase.





git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@191253

138bc75d-0d04-0410-961f-82ee72b054a4



caused a 20% degradation on polyhedron's "rnflow"



commit 780bedc1ccae5ae85fb99afed8a1ac1cc598121b

Geometric Mean Execution Time =  18.28 seconds



commit 024fee2c369096e6fe6cde620243df5843893004

Geometric Mean Execution Time =  24.82 seconds





compilation options used:

gfortran -march=native -ffast-math -funroll-loops -O3 -ftree-vectorize %n.f90

-static -o %n


[Bug target/50572] New: unstable performance on Atom due to loop alignment

2011-09-30 Thread sergos.gnu at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50572

 Bug #: 50572
   Summary: unstable performance on Atom due to loop alignment
Classification: Unclassified
   Product: gcc
   Version: 4.7.0
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: sergos@gmail.com


After monitoring of Atom performance on trunk for some period of time I figured
out that we have a significant (up to 15%) instability because of loop
alignment. Currently for Atom we have the following alignments:

  {&atom_cost, 16, 7, 16, 7, 16}

for

struct ptt
{
  const struct processor_costs *cost;   /* Processor costs */
  const int align_loop; /* Default alignments.  */
  const int align_loop_max_skip;
  const int align_jump;
  const int align_jump_max_skip;
  const int align_func;
};

Which means we try to align by 16, although if it takes no more than 7 bytes to
insert. This 'if' is the source of instability. For a reduction loop I observed
almost twice slowdown because it did not fit into 16bytes after being aligned
by 8.

I used the -falign-loops=16 option to measure code size impact using -m32-O2
-msse2 -mfpmath=sse -ffast-math -march=atom for SPEC2000:

SPEC2000
Test.text section size
-
AlignedCurrentIncreas%% increase
wupwise6303246300842400,04%
swim_602612602548640,01%
mgrid_6083886082121760,03%
applu_6416846414122720,04%
mesa_94144493811633280,35%
galgel_81350881176417440,21%
art_4375724374121600,04%
equake_4422284420841440,03%
facerec6949486945963520,05%
ammp_56142856029211360,20%
lucas_6632366629482880,04%
fma3d_1565348156022851200,33%
sixtrac1537844153422836160,24%
apsi_7191727183408320,12%
gzip_4804524800204320,09%
vpr_54816454715610080,18%
cc1_1554052154653275200,49%
mcf_4340364339081280,03%
crafty_59208459083612480,21%
parser_50947650827612000,24%
eon_118934811888524960,04%
perlbmk89429289126830240,34%
gap_84563684112445120,54%
vortex_96998896878812000,12%
bzip2_4725964722603360,07%
twolf_60714060504420960,35%

Will it be acceptable to put -falign-loops=16 under -mtune=atom for O2?


[Bug middle-end/50315] Regression on Atom after fix #49958

2011-09-15 Thread sergos.gnu at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50315

--- Comment #7 from Sergey Ostanevich  2011-09-15 
11:24:27 UTC ---
Richard, I believe your test should be reading as 

> So you can go from (a +no b) +no c to a + no (b + c), dropping overflow
knowledge on re-association.

And let me re-phrase what's Joseph said (just to be sure I got the idea):
we have to preserve the overflow semantics at GIMPLE level to avoid possible
problems during translation into RTL. 

Consider we have situation without overflow in 32-bit with particular
calculation order and can use either 32-bit or 64-bit operations to perform
that. But after reassociation in GIMPLE we can introduce overflow for 32-bit,
that will lead to wrong result in case we use 64-bit operations. 

Being aware of such situation during traslation we can evade error, but it
requires too much effort (or even impossible) to provide this data to the
translator. 

Is it right?


[Bug c/50315] Regression on Atom after fix #49958

2011-09-07 Thread sergos.gnu at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50315

Sergey Ostanevich  changed:

   What|Removed |Added

 CC||sergos.gnu at gmail dot com

--- Comment #4 from Sergey Ostanevich  2011-09-07 
13:56:30 UTC ---
Richard,

Will it be a good idea to have a twos-complement architecture hook? In case of
x86 we can reassociate since the architecture itself always behave as
twos-complement. So introducing such a flag can help with this particular
reassociation and another one that Ilya Enkovich implemented recently.

What's your opinion?


[Bug target/49206] [4.5/4.6/4.7 Regression] RA failure in spill_failure, at reload1.c:2113

2011-08-22 Thread sergos.gnu at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49206

--- Comment #3 from Sergey Ostanevich  2011-08-22 
16:37:54 UTC ---
is it right that while() is an infinite loop? at least some phases can rely on
this?