https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102435
Bug ID: 102435 Summary: gcc 9: aarch64 -ftree-loop-vectorize results in wrong code Product: gcc Version: 9.4.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: dimi...@unified-streaming.com Target Milestone: --- We noticed a problem with a loop optimization enabled by -O3 on a program targeting AArch64. It turns out that this problem is specifically caused by -ftree-loop-vectorize, and has actually been fixed by (or as a side-effect of) commit https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=c89366b12ff4f362 ("[AArch64] Support vectorising with multiple vector sizes") by Richard Sandiford. However, this commit was made on master when it was gcc-10, so while the problem does not occur with gcc 10.x and 11.x, it *does* occur with 9.x. In our particular instance, this is the default version on Ubuntu 20.04 for arm64, e.g. gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04). Reduced test case: // g++ -std=c++17 -O2 -ftree-loop-vectorize testcase.cpp // or // g++ -std=c++17 -O3 testcase.cpp #include <cassert> #include <cstdint> #include <iostream> #include <vector> struct sample_t { sample_t(uint64_t dts, uint32_t duration) : dts_(dts) , duration_(duration) , cto_(0) , sample_description_index_(0) , pos_(0) , size_(0) , flags_(0) , aux_pos_(0) , aux_size_(0) { } uint64_t dts_; uint32_t duration_; int32_t cto_; uint32_t sample_description_index_; uint64_t pos_; uint32_t size_; uint32_t flags_; uint64_t aux_pos_; uint32_t aux_size_; }; typedef std::vector<sample_t> samples_t; __attribute__((__noinline__)) samples_t get_result(samples_t&& samples) { uint64_t base_media_decode_time = ~0; auto first = samples.begin(); auto last = samples.end(); if(first != last) { base_media_decode_time = first->dts_; uint32_t duration = 0; for(--last; first != last; ++first) { duration = static_cast<uint32_t>(first[1].dts_ - first->dts_); first->duration_ = duration; } first->duration_ = duration; } return samples; } int main(void) { samples_t samples_in = { {0, 3}, {3, 3}, {6, 3}, {9, 1}, {10, 2} }; samples_t samples_out = get_result(std::move(samples_in)); for(sample_t sample : samples_out) { std::cout << sample.dts_ << ", " << sample.duration_ << '\n'; } // Expected output: // 0, 3 // 3, 3 // 6, 3 // 9, 1 // 10, 1 // // Bad output: // 0, 3 // 3, 0 // 6, 0 // 9, 0 // 10, 0 return 0; } Not that it appears vital that the struct sample_t is pretty large, e.g. removing all of the members after the first two makes the output correct, even with gcc 9 and -ftree-loop-vectorize. I have not determined precisely what the cutoff size is.