Re: clang performance bug is worse on openbsd than freebsd
Sorting an array of around 300 (or 3) randomly created unsigned characters sounds like a task tailor made for binsort. (Which seems plausibly worth mentioning in this context.) That said, the key openbsd issues might not include performance on this particular benchmark. Thanks, -- Raul On Sun, Nov 7, 2021 at 9:16 PM Luke Small wrote: > > https://bugs.llvm.org/show_bug.cgi?id=50026 > > I reported it to the llvm people. it is two slightly different quicksort > algorithms which perform radically differently. The one which you could > assume would take more time, performs MUCH better. > > I made a custom quicksort algorithm which outperforms qsort by A LOT for > sorting an array of around 300 randomly created unsigned characters, which > is what I use it for. > > if you read the report a guy said there's a 10% difference for sorting 3 > million characters on freebsd, but there's about 40% performance difference > on OpenBSD. maybe it's also how the OpenBSD team modified clang to prevent > rop chain stuff or something? I'm using a westmere based intell server. > > it's the same for clang 11. > > What's the deal? > > -Luke sort_test2r.c Description: Binary data
Re: clang performance bug is worse on openbsd than freebsd
On 2021-11-08, Theo de Raadt wrote: > Stuart Henderson wrote: > >> On 2021-11-08, Otto Moerbeek wrote: >> > On Sun, Nov 07, 2021 at 08:13:36PM -0600, Luke Small wrote: >> > >> >> https://bugs.llvm.org/show_bug.cgi?id=50026 >> >> >> >> I reported it to the llvm people. it is two slightly different quicksort >> >> algorithms which perform radically differently. The one which you could >> >> assume would take more time, performs MUCH better. >> >> >> >> I made a custom quicksort algorithm which outperforms qsort by A LOT for >> >> sorting an array of around 300 randomly created unsigned characters, which >> >> is what I use it for. >> >> >> >> if you read the report a guy said there's a 10% difference for sorting 3 >> >> million characters on freebsd, but there's about 40% performance >> >> difference >> >> on OpenBSD. maybe it's also how the OpenBSD team modified clang to prevent >> >> rop chain stuff or something? I'm using a westmere based intell server. >> >> >> >> it's the same for clang 11. >> >> >> >> What's the deal? >> > >> > 1. Your test is too small. I increased the test size to get less error >> > in measurement. I also changed the code to either run quicksort or >> > quicksort depending on an argument. >> > >> > 2. I confirmed your observation using time on amd64, arm64 however >> > shows a difference more in line with expectations: >> > >> > [otto@mc:8]$ time ./a.out A >> > quicksort >> > time = 0.335777021 >> > 0m00.35s real 0m00.35s user 0m00.01s system >> > [otto@mc:9]$ time ./a.out B >> > quicksort1 >> > time = 0.345703033 >> > 0m00.36s real 0m00.36s user 0m00.01s system >> > >> > >> > I suspect some non-optimal decision of the optimizer on amd64 for the >> > first quicksort. >> > >> > A next step could be somebody looking at the code produced by the >> > compiler. That is not going to be me. >> >> This is another example of code quite badly affected by fixup-gadgets. >> With an increased test size and building separate objects for each of >> the two tests and with/without CFLAGS=-fno-fixup-gadgets: >> >> $ for i in quicksort*; do echo $i; time ./$i; done >> >> quicksort >> 0m02.33s real 0m02.32s user 0m00.00s system >> quicksort-no-fixup-gadgets >> 0m01.29s real 0m01.27s user 0m00.00s system >> quicksort1 >> 0m01.06s real 0m00.94s user 0m00.02s system >> quicksort1-no-fixup-gadgets >> 0m01.08s real 0m00.87s user 0m00.01s system >> >> > > The word everyone is looking for is "tradeoff", meaning this is intentional. Yes. But we have seen fixup-gadgets making a suboptimal choice before and it was improved. Maybe there's nothing more that can be done, maybe someone proficient in X86 asm will spot something - much easier to compare the two now we know exactly what's affecting this particular code and that it can be toggled with a compiler flag.
Re: clang performance bug is worse on openbsd than freebsd
Stuart Henderson wrote: > On 2021-11-08, Otto Moerbeek wrote: > > On Sun, Nov 07, 2021 at 08:13:36PM -0600, Luke Small wrote: > > > >> https://bugs.llvm.org/show_bug.cgi?id=50026 > >> > >> I reported it to the llvm people. it is two slightly different quicksort > >> algorithms which perform radically differently. The one which you could > >> assume would take more time, performs MUCH better. > >> > >> I made a custom quicksort algorithm which outperforms qsort by A LOT for > >> sorting an array of around 300 randomly created unsigned characters, which > >> is what I use it for. > >> > >> if you read the report a guy said there's a 10% difference for sorting 3 > >> million characters on freebsd, but there's about 40% performance difference > >> on OpenBSD. maybe it's also how the OpenBSD team modified clang to prevent > >> rop chain stuff or something? I'm using a westmere based intell server. > >> > >> it's the same for clang 11. > >> > >> What's the deal? > > > > 1. Your test is too small. I increased the test size to get less error > > in measurement. I also changed the code to either run quicksort or > > quicksort depending on an argument. > > > > 2. I confirmed your observation using time on amd64, arm64 however > > shows a difference more in line with expectations: > > > > [otto@mc:8]$ time ./a.out A > > quicksort > > time = 0.335777021 > > 0m00.35s real 0m00.35s user 0m00.01s system > > [otto@mc:9]$ time ./a.out B > > quicksort1 > > time = 0.345703033 > > 0m00.36s real 0m00.36s user 0m00.01s system > > > > > > I suspect some non-optimal decision of the optimizer on amd64 for the > > first quicksort. > > > > A next step could be somebody looking at the code produced by the > > compiler. That is not going to be me. > > This is another example of code quite badly affected by fixup-gadgets. > With an increased test size and building separate objects for each of > the two tests and with/without CFLAGS=-fno-fixup-gadgets: > > $ for i in quicksort*; do echo $i; time ./$i; done > > quicksort > 0m02.33s real 0m02.32s user 0m00.00s system > quicksort-no-fixup-gadgets > 0m01.29s real 0m01.27s user 0m00.00s system > quicksort1 > 0m01.06s real 0m00.94s user 0m00.02s system > quicksort1-no-fixup-gadgets > 0m01.08s real 0m00.87s user 0m00.01s system > > The word everyone is looking for is "tradeoff", meaning this is intentional.
Re: clang performance bug is worse on openbsd than freebsd
On 2021-11-08, Otto Moerbeek wrote: > On Sun, Nov 07, 2021 at 08:13:36PM -0600, Luke Small wrote: > >> https://bugs.llvm.org/show_bug.cgi?id=50026 >> >> I reported it to the llvm people. it is two slightly different quicksort >> algorithms which perform radically differently. The one which you could >> assume would take more time, performs MUCH better. >> >> I made a custom quicksort algorithm which outperforms qsort by A LOT for >> sorting an array of around 300 randomly created unsigned characters, which >> is what I use it for. >> >> if you read the report a guy said there's a 10% difference for sorting 3 >> million characters on freebsd, but there's about 40% performance difference >> on OpenBSD. maybe it's also how the OpenBSD team modified clang to prevent >> rop chain stuff or something? I'm using a westmere based intell server. >> >> it's the same for clang 11. >> >> What's the deal? > > 1. Your test is too small. I increased the test size to get less error > in measurement. I also changed the code to either run quicksort or > quicksort depending on an argument. > > 2. I confirmed your observation using time on amd64, arm64 however > shows a difference more in line with expectations: > > [otto@mc:8]$ time ./a.out A > quicksort > time = 0.335777021 > 0m00.35s real 0m00.35s user 0m00.01s system > [otto@mc:9]$ time ./a.out B > quicksort1 > time = 0.345703033 > 0m00.36s real 0m00.36s user 0m00.01s system > > > I suspect some non-optimal decision of the optimizer on amd64 for the > first quicksort. > > A next step could be somebody looking at the code produced by the > compiler. That is not going to be me. This is another example of code quite badly affected by fixup-gadgets. With an increased test size and building separate objects for each of the two tests and with/without CFLAGS=-fno-fixup-gadgets: $ for i in quicksort*; do echo $i; time ./$i; done quicksort 0m02.33s real 0m02.32s user 0m00.00s system quicksort-no-fixup-gadgets 0m01.29s real 0m01.27s user 0m00.00s system quicksort1 0m01.06s real 0m00.94s user 0m00.02s system quicksort1-no-fixup-gadgets 0m01.08s real 0m00.87s user 0m00.01s system
Re: clang performance bug is worse on openbsd than freebsd
On Sun, Nov 07, 2021 at 08:13:36PM -0600, Luke Small wrote: > https://bugs.llvm.org/show_bug.cgi?id=50026 > > I reported it to the llvm people. it is two slightly different quicksort > algorithms which perform radically differently. The one which you could > assume would take more time, performs MUCH better. > > I made a custom quicksort algorithm which outperforms qsort by A LOT for > sorting an array of around 300 randomly created unsigned characters, which > is what I use it for. > > if you read the report a guy said there's a 10% difference for sorting 3 > million characters on freebsd, but there's about 40% performance difference > on OpenBSD. maybe it's also how the OpenBSD team modified clang to prevent > rop chain stuff or something? I'm using a westmere based intell server. > > it's the same for clang 11. > > What's the deal? 1. Your test is too small. I increased the test size to get less error in measurement. I also changed the code to either run quicksort or quicksort depending on an argument. 2. I confirmed your observation using time on amd64, arm64 however shows a difference more in line with expectations: [otto@mc:8]$ time ./a.out A quicksort time = 0.335777021 0m00.35s real 0m00.35s user 0m00.01s system [otto@mc:9]$ time ./a.out B quicksort1 time = 0.345703033 0m00.36s real 0m00.36s user 0m00.01s system I suspect some non-optimal decision of the optimizer on amd64 for the first quicksort. A next step could be somebody looking at the code produced by the compiler. That is not going to be me. -Otto > > -Luke > > /* > clang sort_test2.c -O2 -o sort_test && ./sort_test > */ > > /* > * quicksort should be slightly faster than quicksort1 because it is > * identical except quicksort1 does more work using global counters. > * quicsort1 runs 41% faster. Using gcc, quicksort is instead slightly faster. > */ > > > /* > * I'm using clang 10.0.1 on OpenBSD. these are my results: > > > quicksort > time = 0.004501771 > > > > quicksort1 > time = 0.002655233 > */ > > > #include > #include > #include > #include > #include > #include > > size_t m; > size_t g; > size_t i; > > > /* > * quicksort sorts characters on a u_char array. > * 'a' is the first character on the array. > * 'b' is last character on the array > */ > static > void quicksort(u_char a[], u_char b[]) > { > u_char temp; > > if (b - a <= 1) > { > if (a == b) > return; > > if (*a > *b) > { > temp = *a; *a = *b; *b = temp; > } > return; > } > u_char * j = a + 1; > u_char * k = b; > u_char u = *a; > > /* >* This if() avoids repeated awkward special case branching >* in the similar if() in the "while (++j < k)" loop >*/ > if (*j > u) > { > do > { > if (*k < u) > { > temp = *j; *j = *k; *k = temp; > goto skip0; > } > } while (j < --k); > > quicksort(j, b); > return; > > skip0: > if (j == --k) > { > *a = *j; *j = u; > > quicksort(j+1, b); > return; > } > } > > > while (++j < k) > { > if (*j > u) > { > do > { > if (*k < u) > { > temp = *j; *j = *k; *k = temp; > goto skip; > } > } while (j < --k); > > quicksort(j, b); > > *a = *--j; *j = u; > > quicksort(a, j-1); > return; > > skip: > if (j == --k) > { > *a = *j; *j = u; > > quicksort(a, j-1); > quicksort(j+1, b); > return; > } > } > } > > if (*j > u) > { > quicksort(j, b); > > *a = *--j; *j = u; > > quicksort(a, j-1); > } > else > { > *a = *j; *j = u; > > quicksort(a, j-1); > > if (j == b) > return; > > quicksort(j+1, b); > } > } > > /* > * quicksort1 is identical to quicksort() but it alters some global counters > * it should take slightly longer to run, but it runs in half the time > */ > static > void quicksort1(u_char a[], u_char b[], size_t leve
clang performance bug is worse on openbsd than freebsd
https://bugs.llvm.org/show_bug.cgi?id=50026 I reported it to the llvm people. it is two slightly different quicksort algorithms which perform radically differently. The one which you could assume would take more time, performs MUCH better. I made a custom quicksort algorithm which outperforms qsort by A LOT for sorting an array of around 300 randomly created unsigned characters, which is what I use it for. if you read the report a guy said there's a 10% difference for sorting 3 million characters on freebsd, but there's about 40% performance difference on OpenBSD. maybe it's also how the OpenBSD team modified clang to prevent rop chain stuff or something? I'm using a westmere based intell server. it's the same for clang 11. What's the deal? -Luke /* clang sort_test2.c -O2 -o sort_test && ./sort_test */ /* * quicksort should be slightly faster than quicksort1 because it is * identical except quicksort1 does more work using global counters. * quicsort1 runs 41% faster. Using gcc, quicksort is instead slightly faster. */ /* * I'm using clang 10.0.1 on OpenBSD. these are my results: quicksort time = 0.004501771 quicksort1 time = 0.002655233 */ #include #include #include #include #include #include size_t m; size_t g; size_t i; /* * quicksort sorts characters on a u_char array. * 'a' is the first character on the array. * 'b' is last character on the array */ static void quicksort(u_char a[], u_char b[]) { u_char temp; if (b - a <= 1) { if (a == b) return; if (*a > *b) { temp = *a; *a = *b; *b = temp; } return; } u_char * j = a + 1; u_char * k = b; u_char u = *a; /* * This if() avoids repeated awkward special case branching * in the similar if() in the "while (++j < k)" loop */ if (*j > u) { do { if (*k < u) { temp = *j; *j = *k; *k = temp; goto skip0; } } while (j < --k); quicksort(j, b); return; skip0: if (j == --k) { *a = *j; *j = u; quicksort(j+1, b); return; } } while (++j < k) { if (*j > u) { do { if (*k < u) { temp = *j; *j = *k; *k = temp; goto skip; } } while (j < --k); quicksort(j, b); *a = *--j; *j = u; quicksort(a, j-1); return; skip: if (j == --k) { *a = *j; *j = u; quicksort(a, j-1); quicksort(j+1, b); return; } } } if (*j > u) { quicksort(j, b); *a = *--j; *j = u; quicksort(a, j-1); } else { *a = *j; *j = u; quicksort(a, j-1); if (j == b) return; quicksort(j+1, b); } } /* * quicksort1 is identical to quicksort() but it alters some global counters * it should take slightly longer to run, but it runs in half the time */ static void quicksort1(u_char a[], u_char b[], size_t level) { ++m; if (level > i) i = level; u_char temp; if (b - a <= 1) { ++g; if (a == b) return; if (*a > *b) { temp = *a; *a = *b; *b = temp; } return; } u_char * j = a + 1; u_char * k = b; u_char u = *a; /* * This if() avoids repeated awkward special case branching * in the similar if() in the "while (++j < k)" loop */ if (*j > u) { do { if (*k < u) { temp = *j; *j = *k; *k = temp; goto skip0; } } while (j < --k); quicksort1(j, b, level + 1); return; skip0: if (j == --k) { *a = *j; *j = u; quicksort1(j+1, b, level + 1); return; } } while (++j < k) { if (*j > u) { do { if (*k < u) { temp = *j; *j = *k; *k = temp; goto skip; } } while (j < --k); quicksort1(j, b, level + 1); *a = *--j; *j = u; quicksort1(a, j-1, level + 1); return; skip: if (j == --k) { *a = *j; *j = u; quicksort1(a, j-1, level + 1); quicksort1(j+1, b, level + 1); return; } } } if (*j > u) { quicksort1(j, b, level + 1); *a = *--j; *j = u; quicksort1(a, j-1, level + 1); } else { *a = *j; *j = u; quicksort1(a, j-1, level + 1); if (j == b) return; quicksort1(j+1, b, level + 1); } } int main() { long double cpu_time_used; struct timespec start, end; u_char *buffer, *buffer2; size_t y; size_t n = 3; buffer = (u_char*)malloc(n); if (buffer == NULL) err(1, "malloc"); buffer2 = (u_char*)malloc(n); if (buffer2 == NULL) err(1, "malloc"); for (y = 0; y < n; ++y) buffer[y] = rand() % 256; // run it once to clear any wrappers clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); memcpy(buffer2, buffer, n); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); quicksort(buffer2, buffer2 + n - 1); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); cpu_time_used = (long double) (end.tv_sec - start.tv_sec) + (long double) (end.tv_nsec - start.tv_nsec) / (long double) 10; printf("\n\nquicksort\n"); printf("time = %.9Lf\n\n", cpu_time_used); memcpy(buffer2, buffer, n); i = m = g = 0; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); quicksort1(buffer2, buffer2 + n - 1, 1); clock