[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-04-24 Thread xry111 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #13 from Xi Ruoyao  ---
(In reply to David Malcolm from comment #10)
> (In reply to Jan Hubicka from comment #4)
> > I keep mentioning to Larabel that he should use -fno-semantic-interposition,
> > but he doesn't.
> 
> Possibly a silly question, but how about changing the default in GCC 15? 
> What proportion of users actually make use of -fsemantic-interposition ?

At least if building Glibc with -fno-semantic-interposition, several tests will
fail.  I've not figured out if they are test-suite issues or real issues
though.

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-04-24 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #12 from Andrew Pinski  ---
(In reply to Andrew Pinski from comment #11)
> (In reply to David Malcolm from comment #10)
> > (In reply to Jan Hubicka from comment #4)
> > > I keep mentioning to Larabel that he should use 
> > > -fno-semantic-interposition,
> > > but he doesn't.
> > 
> > Possibly a silly question, but how about changing the default in GCC 15? 
> > What proportion of users actually make use of -fsemantic-interposition ?
> 
> See https://inbox.sourceware.org/gcc-patches/ri6czn5z8mw@suse.cz/ for
> previous discussion on this.

Sorry
https://inbox.sourceware.org/gcc-patches/20210606231215.49899-1-mask...@google.com/

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-04-24 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #11 from Andrew Pinski  ---
(In reply to David Malcolm from comment #10)
> (In reply to Jan Hubicka from comment #4)
> > I keep mentioning to Larabel that he should use -fno-semantic-interposition,
> > but he doesn't.
> 
> Possibly a silly question, but how about changing the default in GCC 15? 
> What proportion of users actually make use of -fsemantic-interposition ?

See https://inbox.sourceware.org/gcc-patches/ri6czn5z8mw@suse.cz/ for
previous discussion on this.

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-04-24 Thread dmalcolm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

David Malcolm  changed:

   What|Removed |Added

 CC||dmalcolm at gcc dot gnu.org

--- Comment #10 from David Malcolm  ---
(In reply to Jan Hubicka from comment #4)
> I keep mentioning to Larabel that he should use -fno-semantic-interposition,
> but he doesn't.

Possibly a silly question, but how about changing the default in GCC 15?  What
proportion of users actually make use of -fsemantic-interposition ?

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-04-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #9 from Jan Hubicka  ---
Phoronix still claims the difference
https://www.phoronix.com/review/gcc14-clang18-amd-zen4/2

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-01-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #8 from Richard Biener  ---
Created attachment 57006
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57006=edit
unroll heuristics

this one

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-01-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #7 from Richard Biener  ---
IMO it should be purely growth/unrolled-insns bound, the bound on the actual
unrolled iterations is somewhat artificial (to avoid really large unrolls
when we estimate the unrolled body to be zero, thus never hit any of the other
limits).  That said, I think we should get better at estimating growth - I
don't
think we get that the reads from the constant arrays get elided?  (though
that's
not always an optimal thing)

See the proposal on better estimation I had last year.

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

--- Comment #6 from Jan Hubicka  ---
The internal loops are:

static const unsigned keccakf_rotc[24] = {
   1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 2, 14, 27, 41, 56, 8, 25, 43, 62, 18,
39, 61, 20, 44
}; 

static const unsigned keccakf_piln[24] = {
   10, 7, 11, 17, 18, 3, 5, 16, 8, 21, 24, 4, 15, 23, 19, 13, 12, 2, 20, 14,
22, 9, 6, 1
};

static void keccakf(ulong64 s[25])
{  
   int i, j, round;
   ulong64 t, bc[5];

   for(round = 0; round < SHA3_KECCAK_ROUNDS; round++) {
  /* Theta */
  for(i = 0; i < 5; i++)
 bc[i] = s[i] ^ s[i + 5] ^ s[i + 10] ^ s[i + 15] ^ s[i + 20];

  for(i = 0; i < 5; i++) { 
 t = bc[(i + 4) % 5] ^ ROL64(bc[(i + 1) % 5], 1);
 for(j = 0; j < 25; j += 5)
s[j + i] ^= t;
  }
  /* Rho Pi */
  t = s[1];
  for(i = 0; i < 24; i++) {
 j = keccakf_piln[i];
 bc[0] = s[j];
 s[j] = ROL64(t, keccakf_rotc[i]);
 t = bc[0];
  }
  /* Chi */
  for(j = 0; j < 25; j += 5) {
 for(i = 0; i < 5; i++)
bc[i] = s[j + i];
 for(i = 0; i < 5; i++)
s[j + i] ^= (~bc[(i + 1) % 5]) & bc[(i + 2) % 5];
  }
  s[0] ^= keccakf_rndc[round];
   }
}

I suppose with complete unrolling this will propagate, partly stay in registers
and fold. I think increasing the default limits, especially -O3 may make sense.
Value of 16 is there for very long time (I think since the initial
implementation).

[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235

Jan Hubicka  changed:

   What|Removed |Added

Summary|SMHasher SHA3-256 benchmark |SMHasher SHA3-256 benchmark
   |is almost 40% slower vs.|is almost 40% slower vs.
   |Clang   |Clang (not enough complete
   ||loop peeling)

--- Comment #5 from Jan Hubicka  ---
On my zen3 machine default build gets me 180MB/S
-O3 -flto -funroll-all-loops gets me 193MB/s
-O3 -flto --param max-completely-peel-times=30 gets me 382MB/s, speedup is gone
with --param max-completely-peel-times=20, default is 16.