[Bug tree-optimization/101301] Improving sparse switch statement

2023-04-03 Thread marxin at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101301

Martin Liška  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW
   Assignee|marxin at gcc dot gnu.org  |unassigned at gcc dot 
gnu.org

[Bug tree-optimization/101301] Improving sparse switch statement

2022-11-30 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101301

--- Comment #11 from CVS Commits  ---
The master branch has been updated by Martin Liska :

https://gcc.gnu.org/g:4fa25a7eb322f0a003c1eb15680c71ece345e01e

commit r13-4409-g4fa25a7eb322f0a003c1eb15680c71ece345e01e
Author: Martin Liska 
Date:   Mon Jan 24 15:45:38 2022 +0100

Improve profile handling in switch lowering.

PR tree-optimization/101301
PR tree-optimization/103680

gcc/ChangeLog:

* tree-switch-conversion.cc (bit_test_cluster::emit):
Handle correctly remaining probability.
(switch_decision_tree::try_switch_expansion): Fix BB's count
where a cluster expansion happens.
(switch_decision_tree::emit_cmp_and_jump_insns): Fill up also
BB count.
(switch_decision_tree::do_jump_if_equal): Likewise.
(switch_decision_tree::emit_case_nodes): Handle special case
for BT expansion which can also fallback to a default BB.
* tree-switch-conversion.h (cluster::cluster): Add
m_default_prob probability.

[Bug tree-optimization/101301] Improving sparse switch statement

2021-08-12 Thread marxin at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101301

Martin Liška  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |marxin at gcc dot 
gnu.org
   Last reconfirmed||2021-08-12
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1

--- Comment #10 from Martin Liška  ---
> Ha :-)  Too bad we can only warn "invalid sum of incoming frequencies", it
> still
> happens too often (read: at all) to actually error on it...  that would
> perhaps
> grab people's attention :-)

I can confirm switch lowering is not updating BB counts and I see some other
issues related to edge probabilities. I have a partial fix, but I'm going to
return to it once stage1 is gone.

[Bug tree-optimization/101301] Improving sparse switch statement

2021-07-06 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101301

--- Comment #9 from Segher Boessenkool  ---
(In reply to Thomas Koenig from comment #8)
> (In reply to Segher Boessenkool from comment #7)
>  
> > Yeah :-)  Of course in your testing the nine named cases have the same
> > probability,
> > which is not very true in practice (but is there any better guess possible),
> > and
> > the "default" case has that same probability for GCC (is there a better
> > estimate
> > for that, maybe?)
> 
> I think that we should leave something to do for the hardware branch
> predictors.  Any pattern should lead to better predictions. The test
> case is rather brutal because it is random.

Heh.  My question is if we would get better code if we assumed the default case
is more likely than the other cases, in general, on actual code "in the wild".
There is bound to be some literature about that, hrm...

> > (I expect there just is some typo or thinko somewhere in the pass, fwiw :-) 
> > )
> 
> As I have demonstrated above, such a thinko is rather easy to make :-)

Ha :-)  Too bad we can only warn "invalid sum of incoming frequencies", it
still
happens too often (read: at all) to actually error on it...  that would perhaps
grab people's attention :-)

[Bug tree-optimization/101301] Improving sparse switch statement

2021-07-06 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101301

--- Comment #8 from Thomas Koenig  ---
(In reply to Segher Boessenkool from comment #7)

> Yeah :-)  Of course in your testing the nine named cases have the same
> probability,
> which is not very true in practice (but is there any better guess possible),
> and
> the "default" case has that same probability for GCC (is there a better
> estimate
> for that, maybe?)

I think that we should leave something to do for the hardware branch
predictors.  Any pattern should lead to better predictions. The test
case is rather brutal because it is random.

> (I expect there just is some typo or thinko somewhere in the pass, fwiw :-) )

As I have demonstrated above, such a thinko is rather easy to make :-)

[Bug tree-optimization/101301] Improving sparse switch statement

2021-07-05 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101301

--- Comment #7 from Segher Boessenkool  ---
Assuming that you divide the "default" case into ten pieces of 0.01 each,
some slight corrections:

(In reply to Thomas Koenig from comment #6)
> int foo3 (int n)
> {
>   if (__builtin_expect_with_probability (n >= 5, 1, 0.55))
> {
>   if (__builtin_expect_with_probability (n >= 7, 1, 0.33/0.55))
> {
>   if (__builtin_expect_with_probability (n == 7, 1, 0.1/0.33))
> return 7;
>   if (__builtin_expect_with_probability (n == 8, 1, 0.1/0.23))
> return 8;
>   if (__builtin_expect_with_probability (n == 9, 1, 0.1/0.11))
> return 9;

0.1/0.13 for that last one.

> 
>   return 0;
> }
>   else
> {
>   if (__builtin_expect_with_probability (n == 5, 1, 0.1/0.22))
> return 5;
>   if (__builtin_expect_with_probability (n == 6, 1, 0.1/0.11))
> return 6;

0.1/0.12

> 
>   return 0;
> }
> }
>   else
> {
>   if (__builtin_expect_with_probability (n >= 3, 1, 0.22/0.45))
> {
>   if (__builtin_expect_with_probability (n == 3, 1, 0.1/0.22))
> return 3;
>   if (__builtin_expect_with_probability (n == 4, 1, 0.1/0.11))
> return 4;

0.1/0.12

> 
>   return 0;
> }
>   else
> {
>   if (__builtin_expect_with_probability (n == 1, 1, 0.1/0.23))
> return 1;
>   if (__builtin_expect_with_probability (n == 2, 1, 0.1/0.13))
> return 2;
> 
>   return 0;
> }
> }
> }
> 
> the numbers on POWER9 become
> 
> [tkoenig@gcc135 ~]$ gcc -O3 bench.c a.c
> [tkoenig@gcc135 ~]$ ./a.out
> foo: 7.134855
> foo2: 7.842507
> foo3: 6.624406
> [tkoenig@gcc135 ~]$ gcc -mcpu=native -O3 bench.c a.c
> [tkoenig@gcc135 ~]$ ./a.out
> foo: 6.458520
> foo2: 7.696735
> foo3: 6.196469
> 
> where, on a few runs, the differene betweeh foo and foo3 with -mcpu=native
> sometimes disappears and sometimes is larger (gcc135 is not a benchmark
> machine).
> 
> So, I'd say there some advantage in the compiler not lying to itself :-)

Yeah :-)  Of course in your testing the nine named cases have the same
probability,
which is not very true in practice (but is there any better guess possible),
and
the "default" case has that same probability for GCC (is there a better
estimate
for that, maybe?)

(I expect there just is some typo or thinko somewhere in the pass, fwiw :-) )

[Bug tree-optimization/101301] Improving sparse switch statement

2021-07-04 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101301

--- Comment #6 from Thomas Koenig  ---
If I create a foo3 function with

int foo3 (int n)
{
  if (__builtin_expect_with_probability (n >= 5, 1, 0.55))
{
  if (__builtin_expect_with_probability (n >= 7, 1, 0.33/0.55))
{
  if (__builtin_expect_with_probability (n == 7, 1, 0.1/0.33))
return 7;
  if (__builtin_expect_with_probability (n == 8, 1, 0.1/0.23))
return 8;
  if (__builtin_expect_with_probability (n == 9, 1, 0.1/0.11))
return 9;

  return 0;
}
  else
{
  if (__builtin_expect_with_probability (n == 5, 1, 0.1/0.22))
return 5;
  if (__builtin_expect_with_probability (n == 6, 1, 0.1/0.11))
return 6;

  return 0;
}
}
  else
{
  if (__builtin_expect_with_probability (n >= 3, 1, 0.22/0.45))
{
  if (__builtin_expect_with_probability (n == 3, 1, 0.1/0.22))
return 3;
  if (__builtin_expect_with_probability (n == 4, 1, 0.1/0.11))
return 4;

  return 0;
}
  else
{
  if (__builtin_expect_with_probability (n == 1, 1, 0.1/0.23))
return 1;
  if (__builtin_expect_with_probability (n == 2, 1, 0.1/0.13))
return 2;

  return 0;
}
}
}

the numbers on POWER9 become

[tkoenig@gcc135 ~]$ gcc -O3 bench.c a.c
[tkoenig@gcc135 ~]$ ./a.out
foo: 7.134855
foo2: 7.842507
foo3: 6.624406
[tkoenig@gcc135 ~]$ gcc -mcpu=native -O3 bench.c a.c
[tkoenig@gcc135 ~]$ ./a.out
foo: 6.458520
foo2: 7.696735
foo3: 6.196469

where, on a few runs, the differene betweeh foo and foo3 with -mcpu=native
sometimes disappears and sometimes is larger (gcc135 is not a benchmark
machine).

So, I'd say there some advantage in the compiler not lying to itself :-)

[Bug tree-optimization/101301] Improving sparse switch statement

2021-07-04 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101301

--- Comment #5 from Segher Boessenkool  ---
Looking at -fdump-tree-all-all, things are fine with "foo" until the
switchlower pass.  Before that all nine switch cases and the default
case all had probability 0.10; after the switch conversion there are
many basic blocks with "Invalid sum of incoming counts", and not
everything has the same probability any more.

[Bug tree-optimization/101301] Improving sparse switch statement

2021-07-03 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101301

--- Comment #4 from Thomas Koenig  ---
(In reply to Segher Boessenkool from comment #3)
> What CPU is in your Power system, and what -mcpu= did you compile with?
> What is the ABI you are using?  (That last one doesn't seem to matter in
> this particular case, but :-) )

The machine is gcc135 from the gcc compile farm. cat /proc/cpuinfo says

processor   : 0
cpu : POWER9, altivec supported
clock   : 3800.00MHz
revision: 2.2 (pvr 004e 1202)

...

I compiled without -mcpu.

With -mcpu=native, the times are

[tkoenig@gcc135 ~]$ gcc -mcpu=native -O3  bench.c a.c && ./a.out
foo: 5.679413
foo2: 6.799938

[tkoenig@gcc135 ~]$ gcc -mcpu=native -O3 -DFULL bench.c a.c && ./a.out
foo: 5.713161
foo2: 6.137598

so -mcpu=native makes a difference there.  The version of the compiler is

gcc-Version 12.0.0 20210701 (experimental) (GCC) 

HTH.

[Bug tree-optimization/101301] Improving sparse switch statement

2021-07-03 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101301

--- Comment #3 from Segher Boessenkool  ---
What CPU is in your Power system, and what -mcpu= did you compile with?
What is the ABI you are using?  (That last one doesn't seem to matter in
this particular case, but :-) )

[Bug tree-optimization/101301] Improving sparse switch statement

2021-07-03 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101301

Thomas Koenig  changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org

--- Comment #2 from Thomas Koenig  ---
Some benchmarks are interesting, and it seems that things are a lot less clear
than I thought.

a.c contains the code in the original test. bench.c has


#include 
#include 
#include 

#define N 1024
#define TIMES (1024*1024)

int foo(int);
int foo2(int);

const int nums[] =
#ifdef FULL
  {  5000, 1, 15000, 2, 25000, 3, 35000, 4, 5,
5, 6, 6, 7, 7, 8, 8, 9, 9,
 10};
#else
  {  1, 2, 3, 4, 5, 6, 7, 8, 9 };
#endif

double this_time ()
{
  struct timeval tv;
  gettimeofday (, NULL);
  return tv.tv_sec + tv.tv_usec * 1e-6;
}

int main()
{
  double t1, t2;
  int *x;
  int r;
  int s;
  x = malloc(N * sizeof(*x));
  for (int i=0; i:
  e0:   ce 56 83 2c cmpwi   cr1,r3,2
  e4:   00 00 80 3c lis r4,0
  e8:   00 00 a0 38 li  r5,0
  ec:   02 00 c0 38 li  r6,2
  f0:   01 00 e0 38 li  r7,1
  f4:   01 00 00 3d lis r8,1
  f8:   67 2b 83 2a cmplwi  cr5,r3,1
  fc:   9c ad 03 28 cmplwi  r3,4
 100:   35 82 89 60 ori r9,r4,3
 104:   04 00 40 39 li  r10,4
 108:   6a 04 08 61 ori r8,r8,1130
 10c:   03 d9 84 60 ori r4,r4,5
 110:   9e 29 c6 7c iselr6,r6,r5,6
 114:   35 82 03 2b cmplwi  cr6,r3,3
 118:   00 48 83 7c cmpwcr1,r3,r9
 11c:   9e 35 c7 7c iselr6,r7,r6,22
 120:   03 00 e0 38 li  r7,3
 124:   01 00 69 6c xoris   r9,r3,1
 128:   9e 28 4a 7d iseleq  r10,r10,r5
 12c:   40 40 03 7c cmplw   r3,r8
 130:   66 2b 08 39 addir8,r8,0
 134:   d1 2f 89 2a cmplwi  cr5,r9,12241
 138:   9e 56 e7 7c iselr7,r7,r10,26
 13c:   06 00 40 39 li  r10,6
 140:   9f 86 09 2b cmplwi  cr6,r9,34463
 144:   38 5b 89 2b cmplwi  cr7,r9,23352
 148:   08 00 20 39 li  r9,8
 14c:   9e 28 4a 7d iseleq  r10,r10,r5
 150:   00 40 03 7c cmpwr3,r8
 154:   09 00 00 39 li  r8,9
 158:   9e 2f a9 7c iselr5,r9,r5,30
 15c:   1e 39 c6 7c iselr6,r6,r7,4
 160:   05 00 e0 38 li  r7,5
 164:   03 d9 83 28 cmplwi  cr1,r3,5
 168:   9e 2e a8 7c iselr5,r8,r5,26
 16c:   07 00 00 39 li  r8,7
 170:   9e 51 e7 7c iselr7,r7,r10,6
 174:   00 20 83 7c cmpwcr1,r3,r4
 178:   9e 2d a8 7c iselr5,r8,r5,22
 17c:   5e 38 65 7c iselgt  r3,r5,r7
 180:   1e 19 66 7c iselr3,r6,r3,4
 184:   20 00 63 78 clrldi  r3,r3,32
 188:   20 00 80 4e blr

I don't have any recent Intel hardware to test.

So, things are less clear then when just looking at the assembler code,
and the behavior of branch predictors is not easy to predict.

A small (up to 5% advantage) x86_64 could be a big disadvantage or
advantage on POWER, depending on how it is compiled...

[Bug tree-optimization/101301] Improving sparse switch statement

2021-07-03 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101301

--- Comment #1 from Thomas Koenig  ---
Here's the assembly for the switch statement with -O3:

cmpl$5, %edi
je  .L6
jle .L18
movl$8, %eax
cmpl$8, %edi
je  .L1
jle .L19
xorl%eax, %eax
cmpl$9, %edi
sete%al
leal(%rax,%rax,8), %eax
ret
.p2align 4,,10
.p2align 3
.L18:
movl$3, %eax
cmpl$3, %edi
je  .L1
jle .L20
xorl%eax, %eax
cmpl$4, %edi
sete%al
sall$2, %eax
ret
.p2align 4,,10
.p2align 3
.L19:
movl$6, %eax
cmpl$6, %edi
je  .L1
xorl%eax, %eax
movl$7, %edx
cmpl$7, %edi
cmove   %edx, %eax
ret
.p2align 4,,10
.p2align 3
.L6:
movl$5, %eax
.L1:
ret
.p2align 4,,10
.p2align 3
.L20:
movl$1, %eax
cmpl$1, %edi
je  .L1
xorl%eax, %eax
cmpl$2, %edi
sete%al
addl%eax, %eax
ret

and here for the if statement version:

cmpl$4, %edi
jle .L2
cmpl$6, %edi
jle .L3
cmpl$7, %edi
je  .L6
cmpl$8, %edi
je  .L7
xorl%eax, %eax
cmpl$9, %edi
sete%al
leal(%rax,%rax,8), %eax
ret
.p2align 4,,10
.p2align 3
.L3:
cmpl$5, %edi
je  .L9
xorl%eax, %eax
movl$6, %edx
cmpl$6, %edi
cmove   %edx, %eax
ret
.p2align 4,,10
.p2align 3
.L2:
cmpl$2, %edi
jle .L5
cmpl$3, %edi
je  .L11
xorl%eax, %eax
cmpl$4, %edi
sete%al
sall$2, %eax
ret
.p2align 4,,10
.p2align 3
.L5:
movl$1, %eax
cmpl$1, %edi
je  .L1
xorl%eax, %eax
cmpl$2, %edi
sete%al
addl%eax, %eax
ret
.p2align 4,,10
.p2align 3
.L11:
movl$3, %eax
.L1:
ret
.p2align 4,,10
.p2align 3
.L7:
movl$8, %eax
ret
.p2align 4,,10
.p2align 3
.L9:
movl$5, %eax
ret
.p2align 4,,10
.p2align 3
.L6:
movl$7, %eax
ret

Stepping through the code with a debugger, here is a table on the number
of conditional jumps for all special values and in all relevant intervals:

 5000  5  3
 1 5  3
 15000 5  3
 2 5  3
 25000 5  3
 3 3  3
 35000 4  3
 4 4  3
 5 4  3
 5 1  3
 6 5  3
 6 5  3
 7 5  3
 7 5  3
 8 5  4
 8 3  4
 9 4  4
 9 4  4
10 4  4

If one assumes each case as equally likely, the number of conditional jumps
goes down from 81 to 62.

Assuming each case statement is reached once, the total number of
conditional jumps executed goes from 35 to 29.