[Bug tree-optimization/87205] Inefficient code generation for switch

2018-09-05 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87205

Martin Liška  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Martin Liška  ---
I fixed the basic problematic, please open a separate for the tricky C++
issues..

[Bug tree-optimization/87205] Inefficient code generation for switch

2018-09-05 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87205

--- Comment #11 from Martin Liška  ---
Author: marxin
Date: Wed Sep  5 11:28:49 2018
New Revision: 264124

URL: https://gcc.gnu.org/viewcvs?rev=264124=gcc=rev
Log:
Group switch cases in switch lowering (PR tree-optimization/87205).

2018-09-05  Martin Liska  

PR tree-optimization/87205
* tree-switch-conversion.c (pass_lower_switch::execute):
Group cases for switch statements.
2018-09-05  Martin Liska  

PR tree-optimization/87205
* gcc.dg/tree-ssa/pr87205-2.c: New test.
* gcc.dg/tree-ssa/pr87205.c: New test.

Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/pr87205-2.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/pr87205.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-switch-conversion.c

[Bug tree-optimization/87205] Inefficient code generation for switch

2018-09-05 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87205

--- Comment #10 from Martin Liška  ---
(In reply to Peter Dimov from comment #8)
> (In reply to Martin Liška from comment #7)
> > I'm not sure here Y are different types here and member access based on
> > the type is distinct.
> 
> Yes, one could argue that, I suppose. But in the `return ((Y<0>*)p)->v;`
> case the member access _is_ lifted outside the jump table. If that's correct
> there, it should also be correct here. :-)

Yes, it's hoisted out (lifted). But it happens very late in optimization
pipeline, that's why we create first jump table and later we isolate the
expression.

[Bug tree-optimization/87205] Inefficient code generation for switch

2018-09-04 Thread pdimov at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87205

--- Comment #9 from Peter Dimov  ---
For more context, see https://godbolt.org/z/SzfpKr

```
#include 

template struct variant
{
std::aligned_union_t<0, T...> storage_;
unsigned index_;
};

template
auto visit( variant& v, F f )
{
switch( v.index_ )
{
case 0: return f( (T0&)v.storage_ );
case 1: return f( (T1&)v.storage_ );
case 2: return f( (T2&)v.storage_ );
case 3: return f( (T3&)v.storage_ );
case 4: return f( (T4&)v.storage_ );
case 5: return f( (T5&)v.storage_ );
default: __builtin_unreachable();
}
}

struct X
{
int v;
};

template struct Y: X
{
};

using V = variant, Y<1>, Y<2>, Y<3>, Y<4>, Y<5>>;

void f( X& );
int g( int );

int h1( V& v )
{
return visit( v, [](X const& x){ return x.v; } );
}

int h2( V& v )
{
return visit( v, [](auto&& x){ return x.v; } );
}

void h3( V& v )
{
return visit( v, [](auto&& x){ f(x); } );
}

int h4( V& v )
{
return visit( v, [](auto&& x){ return g(x.v); } );
}
```

This generates

```
h1(variant, Y<1>, Y<2>, Y<3>, Y<4>, Y<5> >&):
  mov eax, DWORD PTR [rdi]
  ret
h2(variant, Y<1>, Y<2>, Y<3>, Y<4>, Y<5> >&):
  mov eax, DWORD PTR [rdi]
  ret
h3(variant, Y<1>, Y<2>, Y<3>, Y<4>, Y<5> >&):
  cmp DWORD PTR [rdi+4], 5
  jbe .L15
.L15:
  jmp f(X&)
h4(variant, Y<1>, Y<2>, Y<3>, Y<4>, Y<5> >&):
  cmp DWORD PTR [rdi+4], 5
  jbe .L19
.L19:
  mov edi, DWORD PTR [rdi]
  jmp g(int)
```

so the member access is folded in both cases (which is good!), even though the
first occurs through X& and the second through Y&.

I've been unable to determine what makes the optimizations misfire. This code
should in principle be the same as the simplified one, but it isn't.

[Bug tree-optimization/87205] Inefficient code generation for switch

2018-09-04 Thread pdimov at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87205

--- Comment #8 from Peter Dimov  ---
(In reply to Martin Liška from comment #7)
> I'm not sure here Y are different types here and member access based on
> the type is distinct.

Yes, one could argue that, I suppose. But in the `return ((Y<0>*)p)->v;` case
the member access _is_ lifted outside the jump table. If that's correct there,
it should also be correct here. :-)

[Bug tree-optimization/87205] Inefficient code generation for switch

2018-09-04 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87205

--- Comment #7 from Martin Liška  ---
(In reply to Peter Dimov from comment #5)
> Another:
> 
> ```
> struct X
> {
> int v;
> };
> 
> template struct Y: X
> {
> };
> 
> void f( int v );
> 
> void h( unsigned ix, void* p )
> {
> switch( ix )
> {
> case 0: f( ((Y<0>*)p)->v ); break;
> case 1: f( ((Y<1>*)p)->v ); break;
> case 2: f( ((Y<2>*)p)->v ); break;
> case 3: f( ((Y<3>*)p)->v ); break;
> case 4: f( ((Y<4>*)p)->v ); break;
> case 5: f( ((Y<5>*)p)->v ); break;
> default: __builtin_unreachable();
> }
> }
> ```
> 
> ```
> h(unsigned int, void*):
>   mov edi, edi
>   jmp [QWORD PTR .L4[0+rdi*8]]
> .L4:
>   .quad .L3
>   .quad .L3
>   .quad .L3
>   .quad .L3
>   .quad .L3
>   .quad .L3
> .L3:
>   mov edi, DWORD PTR [rsi]
>   jmp f(int)
> ```
> 
> https://godbolt.org/z/pGVx6W
> 
> This however demonstrates a different problem, so it may need to go into a
> separate bug.

I'm not sure here Y are different types here and member access based on the
type
is distinct.

[Bug tree-optimization/87205] Inefficient code generation for switch

2018-09-04 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87205

--- Comment #6 from Martin Liška  ---

> ```
> h(unsigned int, void*):
>   cmp edi, 5
>   jbe .L5
> .L5:
>   mov rdi, rsi
>   jmp f(X*)
> ```
> 
> https://godbolt.org/z/2Lh_GZ

Good, my patch can handle that and can generate direct call:

_Z1hjPv:
.LFB0:
.cfi_startproc
movq%rsi, %rdi
jmp _Z1fP1X
.cfi_endproc

[Bug tree-optimization/87205] Inefficient code generation for switch

2018-09-04 Thread pdimov at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87205

--- Comment #5 from Peter Dimov  ---
Another:

```
struct X
{
int v;
};

template struct Y: X
{
};

void f( int v );

void h( unsigned ix, void* p )
{
switch( ix )
{
case 0: f( ((Y<0>*)p)->v ); break;
case 1: f( ((Y<1>*)p)->v ); break;
case 2: f( ((Y<2>*)p)->v ); break;
case 3: f( ((Y<3>*)p)->v ); break;
case 4: f( ((Y<4>*)p)->v ); break;
case 5: f( ((Y<5>*)p)->v ); break;
default: __builtin_unreachable();
}
}
```

```
h(unsigned int, void*):
  mov edi, edi
  jmp [QWORD PTR .L4[0+rdi*8]]
.L4:
  .quad .L3
  .quad .L3
  .quad .L3
  .quad .L3
  .quad .L3
  .quad .L3
.L3:
  mov edi, DWORD PTR [rsi]
  jmp f(int)
```

https://godbolt.org/z/pGVx6W

This however demonstrates a different problem, so it may need to go into a
separate bug.

[Bug tree-optimization/87205] Inefficient code generation for switch

2018-09-04 Thread pdimov at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87205

--- Comment #4 from Peter Dimov  ---
If the code is not the same the jump table is not optimized out and there's no
extra check. But it also happens with code that is not the same on the C++
side, for example:

```
struct X
{
int v;
};

template struct Y: X
{
};

void f( X* x );

void h( unsigned ix, void* p )
{
switch( ix )
{
case 0: f( (Y<0>*)p ); break;
case 1: f( (Y<1>*)p ); break;
case 2: f( (Y<2>*)p ); break;
case 3: f( (Y<3>*)p ); break;
case 4: f( (Y<4>*)p ); break;
case 5: f( (Y<5>*)p ); break;
default: __builtin_unreachable();
}
}
```

```
h(unsigned int, void*):
  cmp edi, 5
  jbe .L5
.L5:
  mov rdi, rsi
  jmp f(X*)
```

https://godbolt.org/z/2Lh_GZ

A variation on the same theme, which demonstrates another kind of missed
optimization:

```
struct X
{
int v;
};

template struct Y: X
{
};

int h( unsigned ix, void* p )
{
switch( ix )
{
case 0: return ((Y<0>*)p)->v;
case 1: return ((Y<1>*)p)->v;
case 2: return ((Y<2>*)p)->v;
case 3: return ((Y<3>*)p)->v;
case 4: return ((Y<4>*)p)->v;
case 5: return ((Y<5>*)p)->v;
default: __builtin_unreachable();
}
}
```

```
h(unsigned int, void*):
  mov edi, edi
  mov eax, DWORD PTR [rsi]
  jmp [QWORD PTR .L4[0+rdi*8]]
.L4:
  .quad .L9
  .quad .L8
  .quad .L7
  .quad .L6
  .quad .L5
  .quad .L3
.L5:
  ret
.L3:
  ret
.L9:
  ret
.L8:
  ret
.L7:
  ret
.L6:
  ret
```

https://godbolt.org/z/lCzlR2

There's a table, so there's no redundant check.

[Bug tree-optimization/87205] Inefficient code generation for switch

2018-09-04 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87205

Martin Liška  changed:

   What|Removed |Added

   Target Milestone|--- |9.0