https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106677

            Bug ID: 106677
           Summary: Abstraction overhead with std::views::join
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: glisse at gcc dot gnu.org
  Target Milestone: ---

(from https://stackoverflow.com/q/73407636/1918193 )

#include <array>
#include <vector>
#include <ranges>

struct Foo {
    auto join() const { return m_array | std::views::join; }
    auto direct() const { return std::views::all(m_array[0]); }
    std::array<std::vector<int*>, 1> m_array;
};
__attribute__((noinline)) int sum_array(const Foo& foo)
{
    int result = 0;
    for (int* val : foo.join())
        result += *val;
    return result;
}
__attribute__((noinline)) int sum_vec(const Foo& foo)
{
    int result = 0;
    for (int* val : foo.direct())
        result += *val;
    return result;
}

I am using a snapshot from 20220719 with -std=gnu++2b -O3 and looking at
.optimized dumps.

sum_vec gets relatively nice, short code. sum_array gets something uglier.

  _18 = &foo_5(D)->m_array;
  _6 = foo_5(D) + 24;
  if (_6 != _18)

Err, x != x+24 should be folded to false? Let's add

  if(foo.m_array.begin()==foo.m_array.end())__builtin_unreachable();

to move forward.

  _16 = MEM[(int * const * const &)foo_4(D)];
  _17 = MEM[(int * const * const &)foo_4(D) + 8];
  if (_16 != _17)
    goto <bb 4>; [5.50%]
  else
    goto <bb 3>; [94.50%]

why are we guessing that the vector is probably empty? Let's look at more code

  <bb 2> [local count: 853673669]:
  _10 = &MEM[(const struct array *)foo_4(D)]._M_elems;
  _7 = foo_4(D) + 24;
  _16 = MEM[(int * const * const &)foo_4(D)];
  _17 = MEM[(int * const * const &)foo_4(D) + 8];
  if (_16 != _17)
    goto <bb 4>; [5.50%]
  else
    goto <bb 3>; [94.50%]

  <bb 3> [local count: 806721618]:
  _18 = foo_4(D) + 24;

  <bb 4> [local count: 96636762]:
  # SR.89_28 = PHI <_10(2), _18(3)>
  # SR.90_41 = PHI <_16(2), 0B(3)>
  goto <bb 9>; [100.00%]

  <bb 9> [local count: 923031551]:
  # result_2 = PHI <0(4), result_12(8)>
  # SR.89_13 = PHI <SR.89_28(4), SR.89_54(8)>
  # SR.90_30 = PHI <SR.90_41(4), SR.90_27(8)>
  if (_7 == SR.89_13)
    goto <bb 10>; [30.00%]
  else
    goto <bb 12>; [70.00%]

  <bb 10> [local count: 276909463]:
  if (SR.90_30 == 0B)
    goto <bb 11>; [16.34%]
  else
    goto <bb 12>; [83.66%]

  <bb 11> [local count: 96636764]:
  # result_31 = PHI <result_2(10), result_12(7), result_12(5)>
  return result_31;

(why not _18 = _7 towards the beginning?)
It would be nice if threading could isolate the case of an empty vector: 2 -> 3
-> 4 -> 9 -> 10 -> 11: just return 0, and the rest of the code may become
easier to optimize.

Let me add

  if(foo.m_array[0].begin()==foo.m_array[0].end())__builtin_unreachable();

to avoid the empty vector case as well. This looks better, at least the inner
loop looks normal, but we are still iterating on the elements of m_array, when
we should be able to tell that it has exactly 1 element.

Reply via email to