[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #13 from Xi Ruoyao --- (In reply to David Malcolm from comment #10) > (In reply to Jan Hubicka from comment #4) > > I keep mentioning to Larabel that he should use -fno-semantic-interposition, > > but he doesn't. > > Possibly a silly question, but how about changing the default in GCC 15? > What proportion of users actually make use of -fsemantic-interposition ? At least if building Glibc with -fno-semantic-interposition, several tests will fail. I've not figured out if they are test-suite issues or real issues though.
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #12 from Andrew Pinski --- (In reply to Andrew Pinski from comment #11) > (In reply to David Malcolm from comment #10) > > (In reply to Jan Hubicka from comment #4) > > > I keep mentioning to Larabel that he should use > > > -fno-semantic-interposition, > > > but he doesn't. > > > > Possibly a silly question, but how about changing the default in GCC 15? > > What proportion of users actually make use of -fsemantic-interposition ? > > See https://inbox.sourceware.org/gcc-patches/ri6czn5z8mw@suse.cz/ for > previous discussion on this. Sorry https://inbox.sourceware.org/gcc-patches/20210606231215.49899-1-mask...@google.com/
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #11 from Andrew Pinski --- (In reply to David Malcolm from comment #10) > (In reply to Jan Hubicka from comment #4) > > I keep mentioning to Larabel that he should use -fno-semantic-interposition, > > but he doesn't. > > Possibly a silly question, but how about changing the default in GCC 15? > What proportion of users actually make use of -fsemantic-interposition ? See https://inbox.sourceware.org/gcc-patches/ri6czn5z8mw@suse.cz/ for previous discussion on this.
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 David Malcolm changed: What|Removed |Added CC||dmalcolm at gcc dot gnu.org --- Comment #10 from David Malcolm --- (In reply to Jan Hubicka from comment #4) > I keep mentioning to Larabel that he should use -fno-semantic-interposition, > but he doesn't. Possibly a silly question, but how about changing the default in GCC 15? What proportion of users actually make use of -fsemantic-interposition ?
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #9 from Jan Hubicka --- Phoronix still claims the difference https://www.phoronix.com/review/gcc14-clang18-amd-zen4/2
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #8 from Richard Biener --- Created attachment 57006 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57006=edit unroll heuristics this one
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #7 from Richard Biener --- IMO it should be purely growth/unrolled-insns bound, the bound on the actual unrolled iterations is somewhat artificial (to avoid really large unrolls when we estimate the unrolled body to be zero, thus never hit any of the other limits). That said, I think we should get better at estimating growth - I don't think we get that the reads from the constant arrays get elided? (though that's not always an optimal thing) See the proposal on better estimation I had last year.
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #6 from Jan Hubicka --- The internal loops are: static const unsigned keccakf_rotc[24] = { 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 2, 14, 27, 41, 56, 8, 25, 43, 62, 18, 39, 61, 20, 44 }; static const unsigned keccakf_piln[24] = { 10, 7, 11, 17, 18, 3, 5, 16, 8, 21, 24, 4, 15, 23, 19, 13, 12, 2, 20, 14, 22, 9, 6, 1 }; static void keccakf(ulong64 s[25]) { int i, j, round; ulong64 t, bc[5]; for(round = 0; round < SHA3_KECCAK_ROUNDS; round++) { /* Theta */ for(i = 0; i < 5; i++) bc[i] = s[i] ^ s[i + 5] ^ s[i + 10] ^ s[i + 15] ^ s[i + 20]; for(i = 0; i < 5; i++) { t = bc[(i + 4) % 5] ^ ROL64(bc[(i + 1) % 5], 1); for(j = 0; j < 25; j += 5) s[j + i] ^= t; } /* Rho Pi */ t = s[1]; for(i = 0; i < 24; i++) { j = keccakf_piln[i]; bc[0] = s[j]; s[j] = ROL64(t, keccakf_rotc[i]); t = bc[0]; } /* Chi */ for(j = 0; j < 25; j += 5) { for(i = 0; i < 5; i++) bc[i] = s[j + i]; for(i = 0; i < 5; i++) s[j + i] ^= (~bc[(i + 1) % 5]) & bc[(i + 2) % 5]; } s[0] ^= keccakf_rndc[round]; } } I suppose with complete unrolling this will propagate, partly stay in registers and fold. I think increasing the default limits, especially -O3 may make sense. Value of 16 is there for very long time (I think since the initial implementation).
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 Jan Hubicka changed: What|Removed |Added Summary|SMHasher SHA3-256 benchmark |SMHasher SHA3-256 benchmark |is almost 40% slower vs.|is almost 40% slower vs. |Clang |Clang (not enough complete ||loop peeling) --- Comment #5 from Jan Hubicka --- On my zen3 machine default build gets me 180MB/S -O3 -flto -funroll-all-loops gets me 193MB/s -O3 -flto --param max-completely-peel-times=30 gets me 382MB/s, speedup is gone with --param max-completely-peel-times=20, default is 16.