[Bug libstdc++/89120] std::minmax_element 2.5 times slower than hand written loop

2021-05-17 Thread antoshkka at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89120

--- Comment #2 from Antony Polukhin  ---
Long story short: I've found no way to improve the standard library code to
always work faster. I'm in favor of closing this ticket as invalid/wont fix.

Long story:

I've tried to add a specialization of minmax_element algorithm for std::less
comparators and arithmetic types. That specialization was doing more
comparisons but in a more predictable way. On big datasets the performance
increased, but decreased on small datasets.


Then I've tried another approach. If the comparison of __first with __next is
barely predictable, then just avoid branching on it.

Portable solution:

bool __b = __comp(__next, __first);   
_ForwardIterator __pots[3] = {__first, __next, __first};
_ForwardIterator __pot_min = *(__pots + __b);
_ForwardIterator __pot_max = *(__pots + __b + 1);

Special case for random access iterators:

bool __b = __comp(__next, __first);
_ForwardIterator __pot_min = __first, __pot_max = __next;
__pot_min += b;
__pot_max -= b;


Unfortunately both those approaches add some overhead for small datasets.
Another disadvantage, is that those approaches produce orthogonal results on
different compilers:  

GCC-9 performance gets better on big datasets
-
Benchmark  Time   CPU Iterations
-
naive_minmax/2 3 ns  3 ns  247522237
naive_minmax/8 7 ns  7 ns  103044422
naive_minmax/262144  1715635 ns1710406 ns407
naive_minmax/1048576 6970755 ns6947034 ns101

branchless_minmax/28 ns  8 ns   81324904
branchless_minmax/8   30 ns 30 ns   23494608
branchless_minmax/262144  457287 ns 456412 ns   1529
branchless_minmax/10485764267914 ns4219969 ns363



Clang-9 performance degrades on big datasets
-
Benchmark  Time   CPU Iterations
-
naive_minmax/2 2 ns  2 ns  380928404
naive_minmax/8 7 ns  7 ns   92642970
naive_minmax/262144   262921 ns 262288 ns   2630
naive_minmax/1048576 1149407 ns1147626 ns618

branchless_minmax/22 ns  2 ns  307146020
branchless_minmax/8   10 ns 10 ns   74417142
branchless_minmax/262144  425880 ns 425241 ns   1637
branchless_minmax/10485761747785 ns1745725 ns397


Final attempt. Different compilers optimize the algorithm differently. Clang
shows good performance on big datasets with >4k elements, GCC - on medium sized
datasets with 128-1k elements. Maybe providing more info on probabilities could
help both compilers to produce better code. But looks like heuristics already
deduce the probabilities to be close to 0.5,
__builtin_expect_with_probability(__b, true, 0.5) changed nothing in the
assembly https://godbolt.org/z/PqWoaKfhW

[Bug libstdc++/89120] std::minmax_element 2.5 times slower than hand written loop

2019-01-30 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89120

--- Comment #1 from Marc Glisse  ---
First there is the issue of iterator vs value, as in your other PR.

The performance of minmax depends a lot on the element type and the
distribution. The standard requires that we perform only 3n/2 comparisons (not
2n like your version), which essentially prescribes the implementation.
However, minmax is probably the most classical example shown in classes on
branch prediction. For a uniform choice of a permutation, the 2n comparisons
can be well predicted (O(log n) are mispredicted) while if you only do 3n/2
comparisons, n/2 of those are unpredictable, so n/4 are badly predicted, and
that costs a lot.