[Bug c++/104084] [12 regression] Internal compiler error: tree check: expected target_expr, have compound_expr in build_new_1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104084 --- Comment #4 from Allan Jensen --- Created attachment 52217 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52217&action=edit -E output
[Bug c++/104084] [12 regression] Internal compiler error: tree check: expected target_expr, have compound_expr in build_new_1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104084 --- Comment #3 from Allan Jensen --- -v output: Using built-in specs. COLLECT_GCC=/opt/gcc/bin/g++-12 Target: x86_64-pc-linux-gnu Configured with: ../configure --enable-languages=c,c++ --prefix=/opt/gcc --program-suffix=-12 Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 12.0.1 20220117 (experimental) (GCC) COLLECT_GCC_OPTIONS='-MMD' '-MF' 'obj/third_party/libgav1/libgav1/loop_restoration_info.o.d' '-D' 'USE_UDEV' '-D' 'USE_AURA=1' '-D' 'USE_NSS_CERTS=1' '-D' 'USE_OZONE=1' '-D' 'OFFICIAL_BUILD' '-D' 'TOOLKIT_QT' '-D' '_FILE_OFFSET_BITS=64' '-D' '_LARGEFILE_SOURCE' '-D' '_LARGEFILE64_SOURCE' '-D' 'NO_UNWIND_TABLES' '-D' 'NDEBUG' '-D' 'NVALGRIND' '-D' 'DYNAMIC_ANNOTATIONS_ENABLED=0' '-D' 'LIBGAV1_MAX_BITDEPTH=10' '-D' 'LIBGAV1_THREADPOOL_USE_STD_MUTEX' '-D' 'LIBGAV1_ENABLE_LOGGING=0' '-D' 'LIBGAV1_PUBLIC=' '-I' 'gen' '-I' '../../../../../../qtwebengine/src/3rdparty/chromium' '-I' '../../../../../../qtwebengine/src/3rdparty/chromium/third_party/libgav1/src' '-I' '../../../../../../qtwebengine/src/3rdparty/chromium/third_party/libgav1/src/src' '-fno-ident' '-fno-strict-aliasing' '--param=ssp-buffer-size=4' '-fstack-protector' '-Wno-unknown-pragmas' '-Wno-parentheses' '-Wno-sign-compare' '-Wstringop-overflow=0' '-Wno-stringop-overread' '-Wno-psabi' '-Wno-multichar' '-Wno-format-zero-length' '-fno-unwind-tables' '-fno-asynchronous-unwind-tables' '-fPIC' '-pipe' '-pthread' '-m64' '-O2' '-fno-omit-frame-pointer' '-g1' '-fvisibility=hidden' '-Wno-unused-local-typedefs' '-Wno-maybe-uninitialized' '-Wno-deprecated-declarations' '-fno-delete-null-pointer-checks' '-Wno-comment' '-Wno-packed-not-aligned' '-Wno-dangling-else' '-Wno-missing-field-initializers' '-Wno-unused-parameter' '-O2' '-fdata-sections' '-ffunction-sections' '-std=gnu++14' '-fvisibility-inlines-hidden' '-Wno-narrowing' '-Wno-attributes' '-Wno-class-memaccess' '-Wno-subobject-linkage' '-Wno-invalid-offsetof' '-Wno-return-type' '-Wno-deprecated-copy' '-c' '-o' 'obj/third_party/libgav1/libgav1/loop_restoration_info.o' '-v' '-shared-libgcc' '-mtune=generic' '-march=x86-64' '-dumpdir' 'obj/third_party/libgav1/libgav1/' /opt/gcc/libexec/gcc/x86_64-pc-linux-gnu/12.0.1/cc1plus -quiet -v -I gen -I ../../../../../../qtwebengine/src/3rdparty/chromium -I ../../../../../../qtwebengine/src/3rdparty/chromium/third_party/libgav1/src -I ../../../../../../qtwebengine/src/3rdparty/chromium/third_party/libgav1/src/src -imultiarch x86_64-linux-gnu -MMD obj/third_party/libgav1/libgav1/loop_restoration_info.d -MF obj/third_party/libgav1/libgav1/loop_restoration_info.o.d -MQ obj/third_party/libgav1/libgav1/loop_restoration_info.o -D_GNU_SOURCE -D_REENTRANT -D USE_UDEV -D USE_AURA=1 -D USE_NSS_CERTS=1 -D USE_OZONE=1 -D OFFICIAL_BUILD -D TOOLKIT_QT -D _FILE_OFFSET_BITS=64 -D _LARGEFILE_SOURCE -D _LARGEFILE64_SOURCE -D NO_UNWIND_TABLES -D NDEBUG -D NVALGRIND -D DYNAMIC_ANNOTATIONS_ENABLED=0 -D LIBGAV1_MAX_BITDEPTH=10 -D LIBGAV1_THREADPOOL_USE_STD_MUTEX -D LIBGAV1_ENABLE_LOGGING=0 -D LIBGAV1_PUBLIC= ../../../../../../qtwebengine/src/3rdparty/chromium/third_party/libgav1/src/src/loop_restoration_info.cc -quiet -dumpdir obj/third_party/libgav1/libgav1/ -dumpbase loop_restoration_info.cc -dumpbase-ext .cc -m64 -mtune=generic -march=x86-64 -g1 -O2 -O2 -Wno-unknown-pragmas -Wno-parentheses -Wno-sign-compare -Wstringop-overflow=0 -Wno-stringop-overread -Wno-psabi -Wno-multichar -Wno-format-zero-length -Wno-unused-local-typedefs -Wno-maybe-uninitialized -Wno-deprecated-declarations -Wno-comment -Wno-packed-not-aligned -Wno-dangling-else -Wno-missing-field-initializers -Wno-unused-parameter -Wno-narrowing -Wno-attributes -Wno-class-memaccess -Wno-subobject-linkage -Wno-invalid-offsetof -Wno-return-type -Wno-deprecated-copy -std=gnu++14 -version -fno-ident -fno-strict-aliasing -fstack-protector -fno-unwind-tables -fno-asynchronous-unwind-tables -fPIC -fno-omit-frame-pointer -fvisibility=hidden -fno-delete-null-pointer-checks -fdata-sections -ffunction-sections -fvisibility-inlines-hidden --param=ssp-buffer-size=4 -o - | as -v -I gen -I ../../../../../../qtwebengine/src/3rdparty/chromium -I ../../../../../../qtwebengine/src/3rdparty/chromium/third_party/libgav1/src -I ../../../../../../qtwebengine/src/3rdparty/chromium/third_party/libgav1/src/src --gdwarf-5 --64 -o obj/third_party/libgav1/libgav1/loop_restoration_info.o GNU assembler version 2.36.1 (x86_64-linux-gnu) using BFD version (GNU Binutils for Ubuntu) 2.36.1 GNU C++14 (GCC) version 12.0.1 20220117 (experimental) (x86_64-pc-linux-gnu) compiled by GNU C version 12.0.1 20220117 (experimental), GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.0, isl version isl-0.23-GMP GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096 ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu" ignoring nonexistent directory "/opt/gcc/lib/gcc/x86_64-pc-linux-gnu/12.0.1/../../../../x86_64-pc-linux-gnu/include" #includ
[Bug c++/104084] [12 regression] Internal compiler error: tree check: expected target_expr, have compound_expr in build_new_1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104084 --- Comment #2 from Allan Jensen --- Removing the (std::nothrow), and declaring the untagged new operator (instead of declaring them deleted), seems to work around the issue.
[Bug c++/104084] New: [12 regression] Internal compiler error: tree check: expected target_expr, have compound_expr in build_new_1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104084 Bug ID: 104084 Summary: [12 regression] Internal compiler error: tree check: expected target_expr, have compound_expr in build_new_1 Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- Another error encounted while compiling Qt with gcc 12. This time in libgav1 (used by Chromium). ../../../../../../qtwebengine/src/3rdparty/chromium/third_party/libgav1/src/src/utils/dynamic_buffer.h:40:19: internal compiler error: tree check: expected target_expr, have compound_expr in build_new_1, at cp/init.c:3792 40 | buffer_.reset(new (std::nothrow) T[size]); | ^~ 0x8a12eb tree_check_failed(tree_node const*, char const*, int, char const*, ...) ../../gcc/tree.c:8702 0x6f06b9 tree_operand_check_code(tree_node*, tree_code, int, char const*, int, char const*) ../../gcc/tree.h:3950 0x6f06b9 build_new_1 ../../gcc/cp/init.c:3792 0xa5c8f1 build_new(unsigned int, vec**, tree_node*, tree_node*, vec**, int, int) ../../gcc/cp/init.c:4002 0xb4a6ad tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, bool) ../../gcc/cp/pt.c:20387 0xb750ca tsubst_copy_and_build_call_args ../../gcc/cp/pt.c:19761 0xb48c88 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, bool) ../../gcc/cp/pt.c:20508 0xb5d33f tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool) ../../gcc/cp/pt.c:19316 0xb5ed6b tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool) ../../gcc/cp/pt.c:18329 0xb5e484 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool) ../../gcc/cp/pt.c:18301 0xb5e4ec tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool) ../../gcc/cp/pt.c:18658 0xb5c048 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool) ../../gcc/cp/pt.c:18287 0xb5c048 instantiate_body ../../gcc/cp/pt.c:26239 0xb5d080 instantiate_decl(tree_node*, bool, bool) ../../gcc/cp/pt.c:26532 0xb813d3 instantiate_pending_templates(int) ../../gcc/cp/pt.c:26611 0xa3ba28 c_parse_final_cleanups() ../../gcc/cp/decl2.c:5097 Disabling optimizations or using different C++ standards, or fuzzing other compiler flags didn't seem to help. Let me know if you need the intermediate code.
[Bug c++/104078] New: Some type determination weirdness
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104078 Bug ID: 104078 Summary: Some type determination weirdness Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- In an attempt to compile Qt and specifically Qt WebEngine with latest gcc 12 from git today, I git the follow weird error, from Skia inside Chromium inside QtWebengine: ./../../../../../qtwebengine/src/3rdparty/chromium/third_party/skia/src/gpu/GrRefCnt.h:173:73: error: ‘‘dependent_operator_type’ not supported by dump_type’ is not a valid type for a template non-type parameter 173 | gr_sp; | ^ ../../../../../../qtwebengine/src/3rdparty/chromium/third_party/skia/src/gpu/GrRefCnt.h:173:73: error: ‘‘dependent_operator_type’ not supported by dump_type’ is not a valid type for a template non-type parameter The error triggers in C++17 mode only, and the file compiles fine in C++20 mode (and in c++17 mode on older gccs, and clang and msvc, etc).
[Bug target/31667] Integer extensions vectorization could be improved
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31667 --- Comment #6 from Allan Jensen --- (In reply to Andrew Pinski from comment #5) > We produce this now: > > movdqa x(%rip), %xmm1 > pxor%xmm0, %xmm0 > movdqa %xmm1, %xmm2 > punpckhbw %xmm0, %xmm1 > movaps %xmm1, y+16(%rip) > movdqa x+16(%rip), %xmm1 > punpcklbw %xmm0, %xmm2 > movaps %xmm2, y(%rip) > movdqa %xmm1, %xmm2 > punpckhbw %xmm0, %xmm1 > movaps %xmm1, y+48(%rip) > movdqa x+32(%rip), %xmm1 > punpcklbw %xmm0, %xmm2 > movaps %xmm2, y+32(%rip) > movdqa %xmm1, %xmm2 > punpckhbw %xmm0, %xmm1 > movaps %xmm1, y+80(%rip) > movdqa x+48(%rip), %xmm1 > punpcklbw %xmm0, %xmm2 > movaps %xmm2, y+64(%rip) > movdqa %xmm1, %xmm2 > punpckhbw %xmm0, %xmm1 > punpcklbw %xmm0, %xmm2 > movaps %xmm1, y+112(%rip) > movaps %xmm2, y+96(%rip) > > And even ICC produce a similar thing except scheduled differently. I hope that is because you forgot -msse4.1?
[Bug tree-optimization/78394] False positives of maybe-uninitialized with -Og
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78394 --- Comment #17 from Allan Jensen --- Yes, if you can figure out exactly what optimization passes it needs, then we could disable the warning when those passes are disabled.
[Bug c/97083] New: __builtin_lround and _builtin_llround not replaced with fcvtas on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97083 Bug ID: 97083 Summary: __builtin_lround and _builtin_llround not replaced with fcvtas on aarch64 Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- On aarch64 calling __builtin_round and casting the result to int or long long uses a single fcvtas instruction, but using __builtin_lround or __builtin_llround instead will do function call. Seems like they are missing the same optimization.
[Bug c/66970] Add __has_builtin() macro
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66970 --- Comment #19 from Allan Jensen --- (In reply to felix from comment #18) > So even if this feature is adopted as-is, it will necessitate some changes > in the documentation. And while I can sympathise with claims that this > behaviour is surprising, what are the alternatives? If keywords should count > as built-ins, should __has_builtin(sizeof) expand to 1? Should > __has_builtin(volatile)? No just keywords that begin with __builtin_..
[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #9 from Allan Jensen --- (In reply to Marc Glisse from comment #6) > Created attachment 45303 [details] > example patch (untested) > > Making the meaning of shuffles visible in GIMPLE could help a bit (although > it wouldn't solve the problem completely because IIRC we don't dare combine > shuffles, since it is hard to find an optimal expansion for a shuffle and we > might pessimize some cases). With some other cases there are checks to see if a combined new tree can be generated as a single instruction and only combined in that case. And as soon as the compiler have SSSE3 available, we can shuffle anything as single instruction, so combining them is always safe and fast.
[Bug c++/88475] -E -fdirectives-only clashes with raw strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88475 --- Comment #5 from Allan Jensen --- Note, you can fix the conflict with icecc by setting ICEC_REMOTE_CPP=0 Icecc will only do this to enable the remote cpp feature.
[Bug debug/68836] GCC can't properly emit debug info for function arguments in a back-trace when using -Og
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68836 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #8 from Allan Jensen --- Duplicate of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78685
[Bug debug/86582] [debug] vla size reported as 0 at Og
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86582 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #3 from Allan Jensen --- Wouldn't this be solved by disable -ftree-dse for -Og where as bug 78685 is more complicated?
[Bug target/89057] [8/9 Regression] AArch64 ld3 st4 less optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89057 --- Comment #4 from Allan Jensen --- While that change might have made things worse. The real problem is probably that the registers for those instructions are loaded and stored using intrinsics, so proper register allocation and combining cant be performed. For ARMv7 for instance the same code can be optimized to having no moves but just a single vswp instruction between ld3 and st4. And MSVC and clang can do that but GCC can not.
[Bug target/89058] GCC 7->8 regression: ARM(64) ld3 st4 less optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89058 --- Comment #2 from Allan Jensen --- Oops, sorry.
[Bug target/89058] New: GCC 7->8 regression: ARM(64) ld3 st4 less optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89058 Bug ID: 89058 Summary: GCC 7->8 regression: ARM(64) ld3 st4 less optimized Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- When using the vld3_u8 and vst4_u8 instrinsics, the code generated with gcc8 is less efficient than the code generated with gcc7. One has 3 moves, and the other 9 moves. The code in question is: #include #include void qt_convert_rgb888_to_rgb32_neon(unsigned *dst, const unsigned char *src, int len) { if (!len) return; const unsigned *const end = dst + len; // align dst on 64 bits const int offsetToAlignOn8Bytes = (reinterpret_cast(dst) >> 2) & 0x1; for (int i = 0; i < offsetToAlignOn8Bytes; ++i) { *dst++ = 0xff00 | (src[0] << 16) | (src[1] << 8) | src[2]; src += 3; } if ((len - offsetToAlignOn8Bytes) >= 8) { const unsigned *const simdEnd = end - 7; // non-inline asm version (uses more moves) uint8x8x4_t dstVector; dstVector.val[3] = vdup_n_u8(0xff); do { uint8x8x3_t srcVector = vld3_u8(src); src += 3 * 8; dstVector.val[0] = srcVector.val[2]; dstVector.val[1] = srcVector.val[1]; dstVector.val[2] = srcVector.val[0]; vst4_u8((uint8_t*)dst, dstVector); dst += 8; } while (dst < simdEnd); } while (dst != end) { *dst++ = 0xff00 | (src[0] << 16) | (src[1] << 8) | src[2]; src += 3; } } With gcc 7.3 the inner loop is: .L5: ld3 {v4.8b - v6.8b}, [x1] add x1, x1, 24 orr v3.16b, v7.16b, v7.16b mov v0.8b, v6.8b mov v1.8b, v5.8b mov v2.8b, v4.8b st4 {v0.8b - v3.8b}, [x0] add x0, x0, 32 cmp x3, x0 bhi .L5 With gcc 8.2 the inner loop is: .L5: ld3 {v4.8b - v6.8b}, [x1] adrpx3, .LC1 add x1, x1, 24 ldr q3, [x3, #:lo12:.LC1] mov v16.8b, v6.8b mov v7.8b, v5.8b mov v4.8b, v4.8b ins v16.d[1], v17.d[0] ins v7.d[1], v17.d[0] ins v4.d[1], v17.d[0] mov v0.16b, v16.16b mov v1.16b, v7.16b mov v2.16b, v4.16b st4 {v0.8b - v3.8b}, [x0] add x0, x0, 32 cmp x2, x0 bhi .L5
[Bug target/89057] New: GCC 7->8 regression: ARM(64) ld3 st4 less optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89057 Bug ID: 89057 Summary: GCC 7->8 regression: ARM(64) ld3 st4 less optimized Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- When using the vld3_u8 and vst4_u8 instrinsics, the code generated with gcc8 is less efficient than the code generated with gcc7. One has 3 moves, and the other 9 moves. The code in question is: #include #include void qt_convert_rgb888_to_rgb32_neon(unsigned *dst, const unsigned char *src, int len) { if (!len) return; const unsigned *const end = dst + len; // align dst on 64 bits const int offsetToAlignOn8Bytes = (reinterpret_cast(dst) >> 2) & 0x1; for (int i = 0; i < offsetToAlignOn8Bytes; ++i) { *dst++ = 0xff00 | (src[0] << 16) | (src[1] << 8) | src[2]; src += 3; } if ((len - offsetToAlignOn8Bytes) >= 8) { const unsigned *const simdEnd = end - 7; // non-inline asm version (uses more moves) uint8x8x4_t dstVector; dstVector.val[3] = vdup_n_u8(0xff); do { uint8x8x3_t srcVector = vld3_u8(src); src += 3 * 8; dstVector.val[0] = srcVector.val[2]; dstVector.val[1] = srcVector.val[1]; dstVector.val[2] = srcVector.val[0]; vst4_u8((uint8_t*)dst, dstVector); dst += 8; } while (dst < simdEnd); } while (dst != end) { *dst++ = 0xff00 | (src[0] << 16) | (src[1] << 8) | src[2]; src += 3; } } With gcc 7.3 the inner loop is: .L5: ld3 {v4.8b - v6.8b}, [x1] add x1, x1, 24 orr v3.16b, v7.16b, v7.16b mov v0.8b, v6.8b mov v1.8b, v5.8b mov v2.8b, v4.8b st4 {v0.8b - v3.8b}, [x0] add x0, x0, 32 cmp x3, x0 bhi .L5 With gcc 8.2 the inner loop is: .L5: ld3 {v4.8b - v6.8b}, [x1] adrpx3, .LC1 add x1, x1, 24 ldr q3, [x3, #:lo12:.LC1] mov v16.8b, v6.8b mov v7.8b, v5.8b mov v4.8b, v4.8b ins v16.d[1], v17.d[0] ins v7.d[1], v17.d[0] ins v4.d[1], v17.d[0] mov v0.16b, v16.16b mov v1.16b, v7.16b mov v2.16b, v4.16b st4 {v0.8b - v3.8b}, [x0] add x0, x0, 32 cmp x2, x0 bhi .L5
[Bug c++/88475] -E -fdirectives-only clashes with raw strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88475 --- Comment #3 from Allan Jensen --- No, it has to be a raw-string to be valid. https://wandbox.org/permlink/I0yF3U3OXoH6LbIM
[Bug c++/88475] -E -fdirectives-only clashes with raw strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88475 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #1 from Allan Jensen --- I also see this with Debian's gcc 8.2.0 (gcc version 8.2.0 (Debian 8.2.0-14))
[Bug tree-optimization/78394] False positives of maybe-uninitialized with -Og
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78394 --- Comment #9 from Allan Jensen --- I see two other level effort ways to possibly fix the issue. Disable the warning like for -O0 as it is buggy, or if we believe it still has some value in -Og even with the false positivies, just removing it from -Wall or -Wextra, so it at least doesn't get enabled unless explicitly asked for.
[Bug c++/58407] [C++11] Should warn about deprecated implicit generation of copy constructor/assignment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58407 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #24 from Allan Jensen --- So with this the rule-of-three is now the rule-of-four or six?
[Bug target/85950] Unsafe-math-optimizations regresses optimization using SSE4.1 roundss
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85950 --- Comment #6 from Allan Jensen --- Btw, I have tested and the patch works for my cases.
[Bug target/85950] Unsafe-math-optimizations regresses optimization using SSE4.1 roundss
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85950 --- Comment #4 from Allan Jensen --- Btw, I found this while trying to figure out why std::round() wasn't also optimized to a single roundss instruction, is that just a missing optimization or is there a quirk about that that makes them not fit? I noticed the definition of the ROUND enum in i386.md is even missing the entry to for normal rounding (0 AFAIK)
[Bug rtl-optimization/85950] Unsafe-math-optimizations regresses optimization using SSE4.1 roundss
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85950 --- Comment #2 from Allan Jensen --- Created attachment 44196 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44196&action=edit Example To trigger need both a rounding conversion and a conversion to integer.
[Bug rtl-optimization/85950] Unsafe-math-optimizations regresses optimization using SSE4.1 roundss
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85950 --- Comment #1 from Allan Jensen --- Sorry forget the example above. I will attached the real code that triggers it. Note it does not trigger with -fno-signed-zeros, -fno-trapping-math, -fassociative-math and -freciprocal-math, so it is something specific to unsafe-math-optimizations itself.
[Bug rtl-optimization/85950] New: Unsafe-math-optimizations regresses optimization using SSE4.1 roundss
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85950 Bug ID: 85950 Summary: Unsafe-math-optimizations regresses optimization using SSE4.1 roundss Product: gcc Version: 8.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- When SSE4.1 is available, std::floor, std::ceil and their C counterparts are inlined to being a single roundss instruction. However if compiled with -Ofast, -ffast-math or -funsafe-math-optimization specifically, then you instead get a slightly improved version of the much slower SSE2 implementation of the same functions. For instance compiling this with -msse4.1: #include double stdfloor(double a) { return std::floor(a); } double stdceil(double a) { return std::ceil(a); }
[Bug tree-optimization/85692] Two source permute not used for vector initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85692 --- Comment #5 from Allan Jensen --- Created attachment 44088 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44088&action=edit suggested patch
[Bug tree-optimization/85692] Two source permute not used for vector initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85692 --- Comment #4 from Allan Jensen --- Note I already posted a patch on gcc-patches myself. It is very similar to yours
[Bug tree-optimization/85692] Two source permute not used for vector initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85692 --- Comment #1 from Allan Jensen --- Created attachment 44084 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44084&action=edit construct.cc Motivating examples. Compile with -msse4.1 for the second case.
[Bug tree-optimization/85692] New: Two source permute not used for vector initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85692 Bug ID: 85692 Summary: Two source permute not used for vector initialization Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- If a vector initialization is using elements from only a single vector source, it will be optimized as a shuffle, but if it is using elements from two, it will not be attempted. This appears to be a missing case in tree-ssa-forwprop.c:simplify_vector_constructor
[Bug rtl-optimization/85551] No strength reduction of modulo and integer vision
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85551 --- Comment #2 from Allan Jensen --- Hmm.. I appear to have made unsafe assumptions in the mod_opt cases. The first safe optimization version would then be: void mod_opt(int *a, int count, int stride, unsigned width) { int pos_opt = 0; for (int i = 0; i < count; ++i) { if (pos_opt < 0 || pos_opt >= width) pos_opt = pos_opt % width; a[i] = pos_opt; pos_opt += stride; } } To be able to completely get rid of modulo, you need to know or check for the size of stride compared to width.
[Bug rtl-optimization/85551] No strength reduction of modulo and integer vision
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85551 --- Comment #1 from Allan Jensen --- I also stumbled on this old motivating article when I tried googling the concept: http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TM-600.pdf
[Bug rtl-optimization/85551] New: No strength reduction of modulo and integer vision
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85551 Bug ID: 85551 Summary: No strength reduction of modulo and integer vision Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- Created attachment 44030 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44030&action=edit strmod.cpp Many simple loops using modulo naively can be optimized too not perform the expensive module/division every iterations, but GCC does not perform this strength reduction. I have attached a motivating example including two iterations of optimizations. An easy safe one (though it might interfere with vectorization if the arch has vectorized integer divisions), and a more agressive one that is much faster in some cases but not always.
[Bug tree-optimization/85406] Unnecessary blend when vectorizing short-cutted calculations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85406 --- Comment #6 from Allan Jensen --- Yeah, the a==255 was actually not a case I would expect the compiler to solve, which is why I changed the example to the a==0 case, which should be solveable using existing constant propagation. Note you can put both short-cuts in, though as it standards only gcc 7 and 8 can vectorize it with two conditions, so we cant use that in general code as we need it to be fast elsewhere too.
[Bug tree-optimization/85406] Unnecessary blend when vectorizing short-cutted calculations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85406 --- Comment #4 from Allan Jensen --- Created attachment 43995 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43995&action=edit gccbug85406.cpp This version compiles with a pcmpeqd and pandn instead of a blend, but the principle is the same. Though the last of a ptest in the beginning is worse, as that risks a performance regression compared to non-vectorized.
[Bug tree-optimization/85406] Unnecessary blend when vectorizing short-cutted calculations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85406 --- Comment #3 from Allan Jensen --- You need to add the loop around it void test(unsigned *buffer, int count) { for (int i = 0; i < count; ++i) buffer[i] = qPremultiply(buffer[i]); }
[Bug tree-optimization/85406] Unnecessary blend when vectorizing short-cutted calculations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85406 --- Comment #1 from Allan Jensen --- Note it might be hard to figure out for the compiler that the result for a==255 will leave the input unchanged, but you can observe the same if you instead test for a == 0 (and return 0). In that case the compiler should have enough math deduction to be able to tell that the result of a==0 is always 0.
[Bug tree-optimization/85406] New: Unnecessary blend when vectorizing short-cutted calculations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85406 Bug ID: 85406 Summary: Unnecessary blend when vectorizing short-cutted calculations Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- If you have something like this: inline unsigned qPremultiply(unsigned x) { const unsigned a = x >> 24; if (a == 255) return x; unsigned t = (x & 0xff00ff) * a; t = (t + ((t >> 8) & 0xff00ff) + 0x800080) >> 8; t &= 0xff00ff; x = ((x >> 8) & 0xff) * a; x = (x + ((x >> 8) & 0xff) + 0x80); x &= 0xff00; return x | t | (a << 24); } Gcc will vectorize it so that the longer calculation is always performed and with an added blend in the end to merge the two different return values. This is however unnecessary as the calculation will give the same result, and thus the blend can be saved. Also in any case it is actually a bit unsafe to vectorize as the performance difference between the two branches is substantial, and it happens that in this case the short-cut is likely to be valid most of the time, so a nonvectorized loop might be faster than a vectorized one by doing a lot less. The latter can be fixed, if the short-cut was also vectorized, for instance making the test for 4 values at a time and skip the long route if none of them need it.
[Bug tree-optimization/84777] -Os inhibits all vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84777 --- Comment #8 from Allan Jensen --- Yes, those I say are missing are compared to -O2. I was investigating this in relation to Qt. We either build these files with -O3, or with -Os for customer that are binary size sensitive. Since some of the image handling routines are quite heavy and have been written for auto-vectorization I was just checking if I could get it to work and the results with your patch are quite good: Normal sizes of qdrawhelper.o with -O3/-O2/-Os: 277704 / 198984 / 168440 With -O2 -ftree-vectorize: 242224 With -O2 -fopenmp: 219536 With -Os -ftree-loop-vectorize: 168440 (no change) With -Os -fopenmp: 177144 (with your patch) So most of the -Os benefit and still many of the central draw loops auto-vectorized. Haven't benchmarked it yet though.
[Bug tree-optimization/84777] -Os inhibits all vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84777 --- Comment #6 from Allan Jensen --- Great. Your patch worked with 90% of the marked loops! The remaining report things like this with -fopt-info-vec-missed: note: not vectorized: relevant stmt not supported: idisty.872_437 = (unsigned int) idisty_386; note: bad operation or unsupported loop bound. But the result is already pretty good for -fopenmp with manually marked loops.
[Bug tree-optimization/84777] -Os inhibits all vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84777 --- Comment #4 from Allan Jensen --- I will try the patch. I just tried -fopt-info-vec-missed and the message reported for every loop was: note: not vectorized: latch block not empty. note: bad loop form.
[Bug tree-optimization/84777] New: -Os inhibits all vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84777 Bug ID: 84777 Summary: -Os inhibits all vectorization Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- Neither the command-line flag -ftree-loop-vectorize nor -fopenmp combined with "#pragma omp simd" works when -Os is active. It seems that it when specified manually vectorization should be work even in -Os mode. I can almost see why -ftree-loop-vectorize wouldn't work, which is why I tried the manual marking of loops to vectorize, but the latter didn't work either. I would suggest documenting this behavior and fix at least vectorizing manually marked loops.
[Bug tree-optimization/84670] [8 Regression] ICE: in compute_antic_aux, at tree-ssa-pre.c:2148 with -O2 -fno-tree-dominator-opts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84670 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #13 from Allan Jensen --- *** Bug 84718 has been marked as a duplicate of this bug. ***
[Bug middle-end/84718] [8 regression] ICE when compiling chromium
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84718 Allan Jensen changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #5 from Allan Jensen --- Yes an updated build which includes the fix from 84670 works. *** This bug has been marked as a duplicate of bug 84670 ***
[Bug middle-end/84718] [8 regression] ICE when compiling chromium
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84718 --- Comment #4 from Allan Jensen --- I will update my gcc build and check
[Bug middle-end/84718] [8 regression] ICE when compiling chromium
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84718 --- Comment #2 from Allan Jensen --- Created attachment 43568 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43568&action=edit spdy_alt_svc_wire_format.ii.gz
[Bug middle-end/84718] [8 regression] ICE when compiling chromium
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84718 --- Comment #1 from Allan Jensen --- Created attachment 43567 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43567&action=edit spdy_alt_svc_wire_format.s
[Bug middle-end/84718] New: [8 regression] ICE when compiling chromium
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84718 Bug ID: 84718 Summary: [8 regression] ICE when compiling chromium Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- Created attachment 43566 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43566&action=edit gcc log Using latest gcc 8 updated today I hit an internal compiler error in the Chromium part of qtwebengine in the file net/spdy/core/spdy_alt_svc_wire_format.cc
[Bug middle-end/84019] [7/8 regression] ICE in fold-const of std::complex division
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84019 Allan Jensen changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |INVALID --- Comment #9 from Allan Jensen --- I now have trouble reproducing it. Let's assume for now my configuration was wrong at the time this was still reproducable for me.
[Bug middle-end/84019] [7/8 regression] ICE in fold-const of std::complex division
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84019 --- Comment #8 from Allan Jensen --- Yes, I will take a look again and produce the intermediate results
[Bug lto/63688] all_symbols_read_handler: Assertion `lto_wrapper_argv' failed.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63688 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #2 from Allan Jensen --- Yeah, that assert is kind of useless and the -plugin argument is basically pointless without the undocumented required -plugin-opt commands necessary. Though maybe that is a binutils bug?
[Bug middle-end/84019] [7/8 regression] ICE under fold-const
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84019 --- Comment #2 from Allan Jensen --- I can provide the intermediate code, but I haven't created a reduced test-case, so it would be big.
[Bug middle-end/84019] [7/8 regression] ICE under fold-const
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84019 --- Comment #1 from Allan Jensen --- First line of the ICE (the only line reported by system gcc) ../../src/init2.c:52: MPFR assertion failed: p >= 2 && p <= ((mpfr_prec_t)((mpfr_uprec_t)(~(mpfr_uprec_t)0)>>1))
[Bug middle-end/84019] New: [7/8 regression] ICE under fold-const
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84019 Bug ID: 84019 Summary: [7/8 regression] ICE under fold-const Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- ICE when compiling Chromium in QtWebEngine under certain conditions. With gcc 8: during GIMPLE pass: fre ../../../../../qtwebengine/src/3rdparty/chromium/third_party/WebKit/Source/platform/audio/IIRFilter.cpp: In member function ‘void blink::IIRFilter::GetFrequencyResponse(int, const float*, float*, float*)’: ../../../../../qtwebengine/src/3rdparty/chromium/third_party/WebKit/Source/platform/audio/IIRFilter.cpp:221:1: internal compiler error: Aborted } // namespace blink ^ 0xe6e39f crash_signal ../../gcc/toplev.c:325 0xa49596 do_mpc_arg2(tree_node*, tree_node*, tree_node*, int, int (*)(__mpc_struct*, __mpc_struct const*, __mpc_struct const*, int)) ../../gcc/builtins.c:10478 0xbb51f4 const_binop ../../gcc/fold-const.c:1405 0xbb5ee7 const_binop(tree_code, tree_node*, tree_node*, tree_node*) ../../gcc/fold-const.c:1705 0x11daa14 gimple_resimplify2(gimple**, code_helper*, tree_node*, tree_node**, tree_node* (*)(tree_node*)) ../../gcc/gimple-match-head.c:133 0x12ad258 gimple_simplify(gimple*, code_helper*, tree_node**, gimple**, tree_node* (*)(tree_node*), tree_node* (*)(tree_node*)) ../../gcc/gimple-match-head.c:643 0xbf904a gimple_fold_stmt_to_constant_1(gimple*, tree_node* (*)(tree_node*), tree_node* (*)(tree_node*)) ../../gcc/gimple-fold.c:6117 0x101a604 try_to_simplify ../../gcc/tree-ssa-sccvn.c:3982 0x101a604 visit_use ../../gcc/tree-ssa-sccvn.c:4033 0x101c736 process_scc ../../gcc/tree-ssa-sccvn.c:4363 0x101c736 extract_and_process_scc_for_name ../../gcc/tree-ssa-sccvn.c:4434 0x101c736 DFS ../../gcc/tree-ssa-sccvn.c:4484 0x101cbd3 sccvn_dom_walker::before_dom_children(basic_block_def*) ../../gcc/tree-ssa-sccvn.c:4917 0x15ad907 dom_walker::walk(basic_block_def*) ../../gcc/domwalk.c:308 0x101d6ca run_scc_vn(vn_lookup_kind) ../../gcc/tree-ssa-sccvn.c:5033 0x101deea execute ../../gcc/tree-ssa-sccvn.c:6015 The same happens with system gcc (7.2 from Debian), but not with system gcc-6
[Bug tree-optimization/83847] [8 Regression] ICE in vectorizable_load, at tree-vect-stmts.c:7365
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83847 --- Comment #4 from Allan Jensen --- Full output from the ICE: during GIMPLE pass: vect /src/qt5/qtbase/src/corelib/kernel/qmetaobjectbuilder.cpp: In function ‘int buildMetaObject(QMetaObjectBuilderPrivate*, char*, int, bool)’: /src/qt5/qtbase/src/corelib/kernel/qmetaobjectbuilder.cpp:1174:12: internal compiler error: in vectorizable_load, at tree-vect-stmts.c:7365 static int buildMetaObject(QMetaObjectBuilderPrivate *d, char *buf, ^~~ 0x74c949 vectorizable_load ../../gcc/tree-vect-stmts.c:7365 0x10a40b4 vect_analyze_stmt(gimple*, bool*, _slp_tree*, _slp_instance*) ../../gcc/tree-vect-stmts.c:9355 0x10bddee vect_analyze_loop_operations ../../gcc/tree-vect-loop.c:1875 0x10bddee vect_analyze_loop_2 ../../gcc/tree-vect-loop.c:2254 0x10bddee vect_analyze_loop(loop*, _loop_vec_info*) ../../gcc/tree-vect-loop.c:2546 0x10d6b2d vectorize_loops() ../../gcc/tree-vectorizer.c:664 Please submit a full bug report,
[Bug tree-optimization/83847] [8 Regression] ICE in vectorizable_load, at tree-vect-stmts.c:7365
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83847 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #3 from Allan Jensen --- Affects building Qt 5.10 QtCore, but only if optimizing for certain architectures. I triggered it with /opt/gcc/bin/g++-8 -c -pipe -march=skylake -g -O3 -std=c++1z -fvisibility=hidden -fvisibility-inlines-hidden -Wall -W -Wvla -Wdate-time -Wshift-overflow=2 -Wduplicated-cond -Wno-stringop-overflow -D_REENTRANT -fPIC -DQT_NO_USING_NAMESPACE -DQT_NO_FOREACH -DELF_INTERPRETER=\"/lib64/ld-linux-x86-64.so.2\" -DQT_NO_NARROWING_CONVERSIONS_IN_CONNECT -DQT_BUILD_CORE_LIB -DQT_BUILDING_QT -DQT_NO_CAST_TO_ASCII -DQT_ASCII_CAST_WARNINGS -DQT_MOC_COMPAT -DQT_USE_QSTRINGBUILDER -DQT_DEPRECATED_WARNINGS -DQT_DISABLE_DEPRECATED_BEFORE=0x05 -D_LARGEFILE64_SOURCE -D_LARGEFILE_SOURCE -DQT_NO_DEBUG -DPCRE2_CODE_UNIT_WIDTH=16 -I/src/qt5/qtbase/src/corelib -I. -Iglobal -I/src/qt5/qtbase/src/3rdparty/harfbuzz/src -I/src/qt5/qtbase/src/3rdparty/md5 -I/src/qt5/qtbase/src/3rdparty/md4 -I/src/qt5/qtbase/src/3rdparty/sha3 -I/src/qt5/qtbase/src/3rdparty/forkfd -I../../include -I../../include/QtCore -I../../include/QtCore/5.10.1 -I../../include/QtCore/5.10.1/QtCore -I.moc -I/src/qt5/qtbase/src/3rdparty/pcre2/src -isystem /usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -I/src/qt5/qtbase/mkspecs/linux-g++ -o .obj/qmetaobjectbuilder.o /src/qt5/qtbase/src/corelib/kernel/qmetaobjectbuilder.cpp Removing -march=skylake worked.
[Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426 --- Comment #3 from Allan Jensen --- Note it appears the fact it can do it at all in -Os is new in gcc 7
[Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426 --- Comment #2 from Allan Jensen --- Created attachment 42301 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42301&action=edit Assembler output with -Os -ftree-slp-vectorize
[Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426 --- Comment #1 from Allan Jensen --- Created attachment 42300 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42300&action=edit Assembler output with -O3
[Bug tree-optimization/82426] New: Missed tree-slp-vectorization on -O2 and -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426 Bug ID: 82426 Summary: Missed tree-slp-vectorization on -O2 and -O3 Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- Created attachment 42299 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42299&action=edit vectslp.cpp The attached example is a simple matrix multiplication. With -O3 or -O2 -ftree-slp-vectorize the basic-block is not vectorized. Oddly, with -Os -ftree-slp-vectorize it is.
[Bug rtl-optimization/81174] bswap not recognized in |= statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81174 Allan Jensen changed: What|Removed |Added Version|6.1.1 |7.1.0 --- Comment #1 from Allan Jensen --- Also reproduced with gcc 4.8, 4.9, 5 and 7. Works in clang. With gcc 6+ it would sometimes work if bswap was called as part of a constructor.
[Bug rtl-optimization/81174] New: bswap not recognized in |= statement
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81174 Bug ID: 81174 Summary: bswap not recognized in |= statement Product: gcc Version: 6.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- Created attachment 41610 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41610&action=edit bswap-issue.cc In writting a big-endian bitfield accessor I noticed that bswap was not always recognized. It appears the problem triggers together with |= statements, at least replacing the |= statement with += solves the issue. I have attached a test case. The faulty one is the first, the two second ones work.
[Bug ipa/80277] New: ipa-icf missing overlooking functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80277 Bug ID: 80277 Summary: ipa-icf missing overlooking functions Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- Created attachment 41100 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41100&action=edit icf.cc Several functions that produce identical assembler are not merged by ipa-icf. I have attached an example, and only the two functions foo0 and foo1 that are identical in every detail are meged, though all the foo* functions produce identical assembler. I theorice it is because the function signature is compared before the content, and the templates and different types might cause that early comparison to fail when it shouldn't. I added a second test that just changed the return value but kept everything else identical and it also wasn't merged. A little unrelated: I noted the ipa-icf optimization is undone by -O3 as it re-inlines, though that is kind of pointless unless it is needed for second level inlining.
[Bug target/80040] New: SSE4.1 ptest not always merged
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80040 Bug ID: 80040 Summary: SSE4.1 ptest not always merged Product: gcc Version: 6.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- Created attachment 40971 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40971&action=edit Example The intrinsics _mm_testz_si128 and _mm_testc_si128 both map to the exact same instruction and parameters. They are sometimes merged to just one instruction call, but not always. I have attached and example where in the first function the two intrinsics are merge but in the second are not.
[Bug target/80040] SSE4.1 ptest not always merged
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80040 --- Comment #2 from Allan Jensen --- Created attachment 40973 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40973&action=edit Assembler output from gcc 6 Easier to compare
[Bug target/80040] SSE4.1 ptest not always merged
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80040 --- Comment #1 from Allan Jensen --- Created attachment 40972 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40972&action=edit Assembler output
[Bug target/78921] New: SSE/AVX shuffle intrinsics uses builtins instead of __builtin_shuffle
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78921 Bug ID: 78921 Summary: SSE/AVX shuffle intrinsics uses builtins instead of __builtin_shuffle Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- The intrinsics for x86 SIMD shuffle instructions could be redeclared using __builtin_shuffle. This would help folding and better instruction selection. This has already been suggested on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756 and is also a necessary component of solving one part of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78563 .
[Bug target/78762] Regression: Splitting unaligned AVX loads also when AVX2 is enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 --- Comment #13 from Allan Jensen --- The question is if the unaligned store is still slow on Excavator and Ryzen which support AVX2. As far as I understand the bulldozer architectures just prefer split AVX because it was basically emulating them with 128-bit micro-ops anyway.
[Bug target/78762] Regression: Splitting unaligned AVX loads also when AVX2 is enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 --- Comment #11 from Allan Jensen --- Btw, did you benchmark store splitting on AMD? It is also enabled for BDVER and ZNVER1.
[Bug target/78762] Regression: Splitting unaligned AVX loads also when AVX2 is enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 --- Comment #10 from Allan Jensen --- That would solve the problem, but also leave the behavior as Sandybridge only (nehalem didn't have AVX).
[Bug target/59874] Missing builtin (__builtin_clzs) when compiling with g++
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59874 --- Comment #15 from Allan Jensen --- Yes, the patch works and it also evaluates at compile time.
[Bug target/59874] Missing builtin (__builtin_clzs) when compiling with g++
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59874 --- Comment #8 from Allan Jensen --- Thanks that looks good. I will test it when I have a chance. I am changing the Qt sources to not assume the presence of __builtin_clzs when __BMI__ is defined. It can use __builtin_clz() and __builtin_ctz()-16U instead, but for general compatibility it is nice that GCC also keeps it around. Note, it would be even better though if GCC could support the short forms as generic builtins. That changes the semantics slightly, but they are named so similarly to the clz, clzl and clzll it would be easy to assume they also are generics, with similar semantics, and can work across all targets. Btw. I assume __builtin_clzs being a target specific builtin, that GCC never had the capability of resolving it at compile-time? If that is the case, it might actually be a bug that GCC allowed it at all in a constexpr function.
[Bug c/66970] Add __has_builtin() macro
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66970 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #5 from Allan Jensen --- This just hit us again, when a patch release removed __builtin_clzs or renamed it to __builtin_lzcnt_u16. We need to be able to detect which ones exist at compile-time otherwise we can't ship in headers that won't break when gcc updates.
[Bug target/59874] Missing builtin (__builtin_clzs) when compiling with g++
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59874 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #5 from Allan Jensen --- This is more problematic to fix in Qt itself. How can we determine if we should/can use __builtin_clzs or __lzcnt16? Note the former is practically standard being supported by both older gcc and clang. There is also the problem that we need to call a builtin, because the C-intrinsics don't work as constexpr.
[Bug target/70118] UBSan claims misaligned access in SSE instrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70118 Allan Jensen changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #7 from Allan Jensen --- Fixed in trunk
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 --- Comment #11 from Allan Jensen --- The think the issue I noted is completely separate from this one, so I opened https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 to deal with it. I think this one could probably be closed though.
[Bug target/78762] Regression: Splitting unaligned AVX loads also when AVX2 is enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 --- Comment #3 from Allan Jensen --- Created attachment 40298 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40298&action=edit Test compiled with gcc 6
[Bug target/78762] Regression: Splitting unaligned AVX loads also when AVX2 is enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 --- Comment #2 from Allan Jensen --- Created attachment 40297 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40297&action=edit Test compiled with -march=haswell
[Bug target/78762] Regression: Splitting unaligned AVX loads also when AVX2 is enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 --- Comment #1 from Allan Jensen --- Created attachment 40296 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40296&action=edit Test compiled with -mavx2
[Bug target/78762] New: Regression: Splitting unaligned AVX loads also when AVX2 is enabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 Bug ID: 78762 Summary: Regression: Splitting unaligned AVX loads also when AVX2 is enabled Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- Created attachment 40295 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40295&action=edit Test In gcc 7 when not optimizing for speed or newer Intel architectures unaligned AVX loads are now split. It appears this is on purpose, and the code related to it quite old, but I haven't been able to trigger it with older versions gcc (tried 4.9, 5 and 6). However this is a special tuning intended for Sandybridge and possibly AMD cpus. It does not trigger on any AVX2 processor. Therefore it now causes a universal performance degradation in code optimized for generic AVX2. I suggest this tuning is disabled when avx2 is enabled.
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 --- Comment #10 from Allan Jensen --- No I mean it triggers when you compile with -mavx2, it is solved with -march=haswell. It appears the issue is the tune flag X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL is set for all processors that support avx2, but if you use generic+avx2, it still pessimistically optimizes for pre-avx2 processors setting MASK_AVX256_SPLIT_UNALIGNED_LOAD. Though since there are two controlling flags and the second X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL is still set for some avx2 processors (btver and znver) besides generic, it is harder to argue what generic+avx2 should do there.
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 --- Comment #8 from Allan Jensen --- Note this happens with -mavx2, but not with -march=haswell. It appears the tuning is a bit too pessimistic when avx2 is enabled on generic x64.
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #7 from Allan Jensen --- This is significantly worse with integer operands. _mm256_storeu_si256((__m256i *)&data[3], _mm256_add_epi32(_mm256_loadu_si256((const __m256i *)&data[0]), _mm256_loadu_si256((const __m256i *)&data[1])) ); compiles to: vmovdqu 0x20(%rax),%xmm0 vinserti128 $0x1,0x30(%rax),%ymm0,%ymm0 vmovdqu (%rax),%xmm1 vinserti128 $0x1,0x10(%rax),%ymm1,%ymm1 vpaddd %ymm1,%ymm0,%ymm0 vmovups %xmm0,0x60(%rax) vextracti128 $0x1,%ymm0,0x70(%rax)
[Bug target/78563] SSE4.1 pmovzx shuffle pattern not recognized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78563 --- Comment #1 from Allan Jensen --- Created attachment 40177 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40177&action=edit Test
[Bug target/78563] New: SSE4.1 pmovzx shuffle pattern not recognized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78563 Bug ID: 78563 Summary: SSE4.1 pmovzx shuffle pattern not recognized Product: gcc Version: 6.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- An unpack pattern with 0 constant are neither folded nor recognized as a pmovzx instruction. SSE2 code: _mm_unpacklo_epi32(X, _mm_setzero_si128()) GCC code: __builtin_shuffle((__v4si)X, (__v4si)_mm_setzero_si128(), (__v4si){0, 4, 1, 5}); Will both produce the same result of an xor setting 0 and an unpack instruction, while it could with SSE4.1 emit a pmozx instruction. Note epi32 is just an example here used because it is most compact, this also affects the 8 and 16 bit equivelents. Looking in config/i386/i386.c it seems like there is no code in the expand_vec_perm_* methods for detecting pmovzx patterns.
[Bug target/31667] Integer extensions vectorization could be improved
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31667 --- Comment #4 from Allan Jensen --- (In reply to Allan Jensen from comment #3) > Gcc 5 and 6 produces code with pmovzx when compiling the example with -O3 > -msse4.1 > > I assume this can be closed. Note like comment 1 saids, it will not use a memory load, though instead it does half as many memory reads. movdqa 0x0(%rip),%xmm0# 8 pmovzxbw %xmm0,%xmm1 psrldq $0x8,%xmm0 pmovzxbw %xmm0,%xmm0 movaps %xmm1,0x0(%rip)# 1e movaps %xmm0,0x0(%rip)# 25
[Bug target/31667] Integer extensions vectorization could be improved
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31667 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #3 from Allan Jensen --- Gcc 5 and 6 produces code with pmovzx when compiling the example with -O3 -msse4.1 I assume this can be closed.
[Bug target/70118] UBSan claims misaligned access in SSE instrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70118 Allan Jensen changed: What|Removed |Added Attachment #40130|0 |1 is obsolete|| --- Comment #5 from Allan Jensen --- Created attachment 40140 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40140&action=edit Patch Updated patch confirmed to work
[Bug target/70118] UBSan claims misaligned access in SSE instrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70118 --- Comment #4 from Allan Jensen --- Created attachment 40130 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40130&action=edit Proposed patch On closer inspection, we are only almost there, two minor changes are still needed. (testing patch).
[Bug target/70118] UBSan claims misaligned access in SSE instrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70118 --- Comment #3 from Allan Jensen --- Or r217608
[Bug target/70118] UBSan claims misaligned access in SSE instrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70118 --- Comment #2 from Allan Jensen --- I believe this to be fixed by r239889
[Bug tree-optimization/78394] False positives of maybe-uninitialized with -Og
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78394 Allan Jensen changed: What|Removed |Added Attachment #40064|0 |1 is obsolete|| --- Comment #1 from Allan Jensen --- Created attachment 40065 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40065&action=edit maybe_uninitialized.cpp Added another example
[Bug tree-optimization/78394] New: False positives of maybe-uninitialized with -Og
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78394 Bug ID: 78394 Summary: False positives of maybe-uninitialized with -Og Product: gcc Version: 6.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- Created attachment 40064 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40064&action=edit maybe_uninitialized.cpp Compiling with -Og produces a number of unique false positives for the maybe-unintialized warnings. The warnings are only emited for -Og and not for -O0, -O1, -O2 or -O3.
[Bug pch/63319] [5 Regression] ICE: Segmentation fault building qt5 with pch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63319 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #12 from Allan Jensen --- There is a chance this has already been fixed. We recently ran into the issue again, see https://bugreports.qt.io/browse/QTBUG-56817 but it only affects GCC 5.3.1. On Debian's gcc 5.4.1 version it works.
[Bug tree-optimization/77902] Auto-vectorizes epilogue loops of manually vectorized functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77902 Allan Jensen changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #3 from Allan Jensen --- Since it appears to be optimized better in gcc 7, let's say this is resolved.
[Bug tree-optimization/77902] Auto-vectorizes epilogue loops of manually vectorized functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77902 --- Comment #2 from Allan Jensen --- While this have been the case in both GCC 5 and GCC 6, it appears to both failing cases previously meantioned already produced the best case result in using a half recent GCC 7. gcc version 7.0.0 20160923 (experimental) (GCC)
[Bug tree-optimization/77902] Auto-vectorizes epilogue loops of manually vectorized functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77902 --- Comment #1 from Allan Jensen --- Further experimentation shows that GCC can sometimes reason about the remaining range but does so inconsistenly. For instance this examplse also fails: int result = 0; for (; count >= 4; count -= 4) { // Manually vectorized or batched code foobar_4x(result, vector); vector += 4; } for (; count >= 0; --count) { // Still autovectorized result += *vector++; } But replacing the epilogue with a loop that counts up, and GCC appears to figure out it is pointless to vectorize: for (int i = 0; i < count; ++count) { // correctly not vectorized
[Bug tree-optimization/77902] New: Auto-vectorizes epilogue loops or manually vectorized functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77902 Bug ID: 77902 Summary: Auto-vectorizes epilogue loops or manually vectorized functions Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- Created attachment 39774 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39774&action=edit Example that trigger the pointless auto-vectorization A common pattern when manually vectorizing an inner function is to have a small epilogue that handles the remainder of the input vector that cannot be handled by the vectorized stepping. For instance: int i = 0; for (; i < (count - 3); i +=4) // do 4 at a time for (; i < count; ++i) // do 1 at a time When compiled with -O3 or -ftree-loop-vectorize that last epilogue may be auto-vectorized by GCC even though it can at most be run 3 times, and the auto-vectorized code-path will never be called. Rewriting it as int i = 0; for (; i < (count - 3); i +=4) // do 4 at a time for (int _i; _i < 3 && i < count; ++_i, ++i) // do 1 at a time Fixes the issue. I am guessing GCC would do well to learn a range from the main-loop so that it can figure out on its own that the epilogue can not be run more than 3 times.
[Bug c++/77796] New: tautological compare warning emitted for inherited static method comparisons
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77796 Bug ID: 77796 Summary: tautological compare warning emitted for inherited static method comparisons Product: gcc Version: 6.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: linux at carewolf dot com Target Milestone: --- We have been running into several issues with the tautological compare warning in qtdeclarative, first there was https://bugreports.qt.io/browse/QTBUG-53373 (warning about comparing a typedef with its definition), and recently https://bugreports.qt.io/browse/QTBUG-56266 (warning about a method that is resolved to what it is compared to). Both cases the comparison are not tautological, but merely compile time, and specifically used in places where they need to be resolvable at compile time. It makes no sense to warn about a comparison being resolvable at compile time a place that demands a constexpr. The latest example can be reproduced with this simple code: class A { public: static void destroy() { } }; class B : public A { }; const int tbl[1] = { B::destroy == A::destroy ? 0 : 1 }; It specifically looks for whether the method has been overwritten in a derived class, but since the names are looked up using two different scopes, it shouldn't trigger the taulogical warning. Only comparing (A::destroy == A::destroy) should do that.
[Bug lto/65274] Internal compiler error: should die in combat
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65274 --- Comment #4 from Allan Jensen --- It works now.