[Bug tree-optimization/98673] pass fre4 inhibit pass dom3 to create much more optimized code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98673 Richard Biener changed: What|Removed |Added Known to fail||10.2.1 See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=70359 Known to work||11.0 --- Comment #6 from Richard Biener --- So with your testcase on trunk I see for RISCV ble a1,zero,.L2 li a6,4 li a5,0 sub a6,a6,a2 .L5: lw a4,4(a2) sllia7,a5,2 add t1,a6,a2 addia5,a5,1 ble a4,a0,.L3 lw t3,0(a2) ble t3,a0,.L9 .L3: addia2,a2,4 bne a1,a5,.L5 which is fine, same for x86. This is usually a SSA coalescing issue where a failed coalesce ends up splitting the backedge and emitting a move there. I can see the issue on the branch where the problematic one is ;; basic block 4, loop depth 1 ;;pred: 3 ;;7 # i_57 = PHI <0(3), i_41(7)> ... ;; basic block 7, loop depth 1 ;;pred: 4 ;;5 i_41 = i_57 + 1; ivtmp.14_90 = ivtmp.14_91 + 4; if (_6 != i_41) goto ; [94.50%] else goto ; [5.50%] ;;succ: 4 ;;8 ;; basic block 8, loop depth 0 ;;pred: 7 _87 = (sizetype) i_57; _146 = _87 + 2; which is a use of the pre-increment i_57 on the loop exit edge. This inhibits coalescing of i_57 and i_41 causing the copy. That's exactly the issue noted in the cited PRs. There have been patches floating around re-materializing i_41 + 1 at the point of i_57 to make the coalescing possible but I think nobody developed them in full. See the thread starting at https://gcc.gnu.org/pipermail/gcc-patches/2018-March/495843.html
[Bug tree-optimization/98673] pass fre4 inhibit pass dom3 to create much more optimized code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98673
--- Comment #5 from jojo ---
Sorry for late :)
Please test with following c case:
long
YTableLookup (long xValue, long xEntries, const long *xAxis,
const long *yTable )
{
int i ;
long xDelta ;
long outValue ;
for (i=0; i<(xEntries - 1); i++)
{
if ((xValue < xAxis[i + 1]) && (xValue >= xAxis[i]))
break ;
}
if (i == (xEntries - 1))
xValue = xAxis[i] ;
xDelta = (long) ((xValue - xAxis[i]) * 1000) / (xAxis[i + 1] - xAxis[i]);
outValue = (long) 1000 - xDelta) * (long) yTable[i]) / 1000) +
((xDelta * (long) yTable[i+1]) / 1000)) ;
return outValue ;
}
risc-v cc1 option: -O2 -march=rv32gc -mabi=ilp32d
=
YTableLookup:
addia1,a1,-1
ble a1,zero,.L2
li a7,4
li a4,0
sub a7,a7,a2
j .L5
.L6:
mv a4,a5
.L5:
lw a6,4(a2)
addia5,a4,1
add t1,a7,a2
ble a6,a0,.L3
lw t3,0(a2)
sllit4,a4,2
ble t3,a0,.L9
.L3:
addia2,a2,4
bne a1,a5,.L6
addia4,a4,2
sllit1,a4,2
addit4,t1,-4
add t4,a3,t4
li a1,0
li a5,1000
.L4:
or x86 cc1 option: -O2 -march=i386
==
YTableLookup:
.LFB0:
pushl %ebp
.LCFI0:
pushl %edi
.LCFI1:
pushl %esi
.LCFI2:
pushl %ebx
.LCFI3:
pushl %ecx
.LCFI4:
movl24(%esp), %esi
movl32(%esp), %edi
movl28(%esp), %eax
decl%eax
testl %eax, %eax
jle .L2
movl$4, %ecx
xorl%edx, %edx
jmp .L5
.align 4
.L6:
movl%ebx, %edx
.L5:
movl(%edi,%ecx), %ebx
cmpl%esi, %ebx
jle .L3
leal-4(%ecx), %ebp
movl%ebp, (%esp)
movl(%edi,%edx,4), %ebp
cmpl%esi, %ebp
jle .L10
.L3:
leal1(%edx), %ebx
addl$4, %ecx
cmpl%ebx, %eax
jne .L6
leal8(,%edx,4), %ecx
movl36(%esp), %eax
leal-4(%eax,%ecx), %esi
xorl%eax, %eax
movl$1000, %ebx
.L4:
Please check the redundancy instruction 'mov' at .L6:
[Bug tree-optimization/98673] pass fre4 inhibit pass dom3 to create much more optimized code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98673
Richard Biener changed:
What|Removed |Added
Known to work||8.4.0
--- Comment #4 from Richard Biener ---
Please try simplifying the testcase, I've tried
long loadValue;
const long *engLoad;
float engLoadDelta1;
void foo()
{
long i1, numXEntries = 50;
for( i1 = 0 ; i1 < ( numXEntries - 1 ) ; i1++ )
{
if( ( loadValue < engLoad[i1+1] ) && ( loadValue >= engLoad[i1] ) )
{
break ;
}
}
if( i1 == ( numXEntries - 1 ) )
{
loadValue = engLoad[i1] ;
}
engLoadDelta1 = (float)( loadValue - engLoad[i1] ) /
(float)( engLoad[i1 + 1] - engLoad[i1] ) ;
}
which on x86 doesn't exhibit the issue (same code with GCC 8 and GCC 10).
[Bug tree-optimization/98673] pass fre4 inhibit pass dom3 to create much more optimized code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98673
--- Comment #3 from jojo ---
(In reply to jojo from comment #2)
> (In reply to Richard Biener from comment #1)
> > The analysis sounds a bit confused. What is the transform that DOM cannot
> > do after the transform that FRE does? There's some older bug about
> > out-of-SSA
> > coalescing issues with loops and liveness of induction variables but it's
> > not clear if this is related (the assembly doesn't show the loop exit
> > block).
> >
> > Can you name the loop in the source that is problematic?
> >
>
> see this loop:
>
> for( i1 = 0 ; i1 < ( numXEntries - 1 ) ; i1++ )
> {
> if( ( loadValue < engLoad[i1+1] ) && ( loadValue >= engLoad[i1]
> ) )
> {
> break ;
> }
> }
Richi: Can you please update Known to work?
[Bug tree-optimization/98673] pass fre4 inhibit pass dom3 to create much more optimized code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98673
--- Comment #2 from jojo ---
(In reply to Richard Biener from comment #1)
> The analysis sounds a bit confused. What is the transform that DOM cannot
> do after the transform that FRE does? There's some older bug about
> out-of-SSA
> coalescing issues with loops and liveness of induction variables but it's
> not clear if this is related (the assembly doesn't show the loop exit block).
>
> Can you name the loop in the source that is problematic?
>
see this loop:
for( i1 = 0 ; i1 < ( numXEntries - 1 ) ; i1++ )
{
if( ( loadValue < engLoad[i1+1] ) && ( loadValue >= engLoad[i1] ) )
{
break ;
}
}
[Bug tree-optimization/98673] pass fre4 inhibit pass dom3 to create much more optimized code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98673 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org Target||riscv Keywords||missed-optimization --- Comment #1 from Richard Biener --- The analysis sounds a bit confused. What is the transform that DOM cannot do after the transform that FRE does? There's some older bug about out-of-SSA coalescing issues with loops and liveness of induction variables but it's not clear if this is related (the assembly doesn't show the loop exit block). Can you name the loop in the source that is problematic? See PR86270 and PR70359
