https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106677
Bug ID: 106677 Summary: Abstraction overhead with std::views::join Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- (from https://stackoverflow.com/q/73407636/1918193 ) #include <array> #include <vector> #include <ranges> struct Foo { auto join() const { return m_array | std::views::join; } auto direct() const { return std::views::all(m_array[0]); } std::array<std::vector<int*>, 1> m_array; }; __attribute__((noinline)) int sum_array(const Foo& foo) { int result = 0; for (int* val : foo.join()) result += *val; return result; } __attribute__((noinline)) int sum_vec(const Foo& foo) { int result = 0; for (int* val : foo.direct()) result += *val; return result; } I am using a snapshot from 20220719 with -std=gnu++2b -O3 and looking at .optimized dumps. sum_vec gets relatively nice, short code. sum_array gets something uglier. _18 = &foo_5(D)->m_array; _6 = foo_5(D) + 24; if (_6 != _18) Err, x != x+24 should be folded to false? Let's add if(foo.m_array.begin()==foo.m_array.end())__builtin_unreachable(); to move forward. _16 = MEM[(int * const * const &)foo_4(D)]; _17 = MEM[(int * const * const &)foo_4(D) + 8]; if (_16 != _17) goto <bb 4>; [5.50%] else goto <bb 3>; [94.50%] why are we guessing that the vector is probably empty? Let's look at more code <bb 2> [local count: 853673669]: _10 = &MEM[(const struct array *)foo_4(D)]._M_elems; _7 = foo_4(D) + 24; _16 = MEM[(int * const * const &)foo_4(D)]; _17 = MEM[(int * const * const &)foo_4(D) + 8]; if (_16 != _17) goto <bb 4>; [5.50%] else goto <bb 3>; [94.50%] <bb 3> [local count: 806721618]: _18 = foo_4(D) + 24; <bb 4> [local count: 96636762]: # SR.89_28 = PHI <_10(2), _18(3)> # SR.90_41 = PHI <_16(2), 0B(3)> goto <bb 9>; [100.00%] <bb 9> [local count: 923031551]: # result_2 = PHI <0(4), result_12(8)> # SR.89_13 = PHI <SR.89_28(4), SR.89_54(8)> # SR.90_30 = PHI <SR.90_41(4), SR.90_27(8)> if (_7 == SR.89_13) goto <bb 10>; [30.00%] else goto <bb 12>; [70.00%] <bb 10> [local count: 276909463]: if (SR.90_30 == 0B) goto <bb 11>; [16.34%] else goto <bb 12>; [83.66%] <bb 11> [local count: 96636764]: # result_31 = PHI <result_2(10), result_12(7), result_12(5)> return result_31; (why not _18 = _7 towards the beginning?) It would be nice if threading could isolate the case of an empty vector: 2 -> 3 -> 4 -> 9 -> 10 -> 11: just return 0, and the rest of the code may become easier to optimize. Let me add if(foo.m_array[0].begin()==foo.m_array[0].end())__builtin_unreachable(); to avoid the empty vector case as well. This looks better, at least the inner loop looks normal, but we are still iterating on the elements of m_array, when we should be able to tell that it has exactly 1 element.