Re: Question about constructing vector types in GIMPLE pass
On Mon, 8 Apr 2024, Hanke Zhang via Gcc wrote: Hi, I've been working on strengthening auto-vectorization on intel CPUs recently. I tried to do it in the GIMPLE pass. And I noticed that some vector types in the GIMPLE code are confusing to me. The example code is here: _1 = MEM[(const __m256i_u * {ref-all})_2]; I wondered how I could construct or get the type `(const __m256i_u * {ref-all})` in the GIMPLE pass. If you have any ideas that can help me. I'll be so grateful! :) I am not sure what you are asking exactly. If you already have access to such a MEM_REF, then the doc tells you where to look for this type: "The first operand is the pointer being dereferenced; it will always have pointer or reference type. The second operand is a pointer constant serving as constant offset applied to the pointer being dereferenced with its type specifying the type to be used for type-based alias analysis. The type of the node specifies the alignment of the access." If you want to create a new type similar to this one, you can build it with various tools: build_vector_type or build_vector_type_for_mode build_pointer_type_for_mode(*, VOIDmode, true) to build a pointer that can alias anything build_qualified_type to add const (probably useless) build_aligned_type to specify that it is unaligned -- Marc Glisse
Re: Building gcc with "-O -g"?
On Sat, 10 Feb 2024, Steve Kargl via Gcc wrote: So, how does one biulding all parts of gcc with "-O -g"? In my shell script, I have CFLAGS="-O -g" export CFLAGS CXXFLAGS="-O -g" export CXXFLAGS BOOT_CFLAGS="-O -g" export BOOT_CFLAGS ../gcc/configure --prefix=$HOME/work --enable-languages=c,c++,fortran \ --enable-bootstrap --disable-libssp --disable-multilib but during bootstrap I see /home/kargl/gcc/obj/./prev-gcc/xg++ -B/home/kargl/gcc/obj/./prev-gcc/ -B/home/kargl/work/x86_64-unknown-freebsd15.0/bin/ -nostdinc++ -B/home/kargl/gcc/obj/prev-x86_64-unknown-freebsd15.0/libstdc++-v3/src/.libs -B/home/kargl/gcc/obj/prev-x86_64-unknown-freebsd15.0/libstdc++-v3/libsupc++/.libs -I/home/kargl/gcc/obj/prev-x86_64-unknown-freebsd15.0/libstdc++-v3/include/x86_64-unknown-freebsd15.0 -I/home/kargl/gcc/obj/prev-x86_64-unknown-freebsd15.0/libstdc++-v3/include -I/home/kargl/gcc/gcc/libstdc++-v3/libsupc++ -L/home/kargl/gcc/obj/prev-x86_64-unknown-freebsd15.0/libstdc++-v3/src/.libs -L/home/kargl/gcc/obj/prev-x86_64-unknown-freebsd15.0/libstdc++-v3/libsupc++/.libs -fno-PIE -c -g -O2 -fno-checking -gtoggle -DIN_GCC-fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -Wconditionally-supported -Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -f no-common -DHAVE_CONFIG_H -fno-PIE -I. -I. -I../../gcc/gcc -I../../gcc/gcc/. -I../../gcc/gcc/../include -I../../gcc/gcc/../libcpp/include -I../../gcc/gcc/../libcody -I/usr/local/include -I../../gcc/gcc/../libdecnumber -I../../gcc/gcc/../libdecnumber/dpd -I../libdecnumber -I../../gcc/gcc/../libbacktrace -o fibonacci_heap.o -MT fibonacci_heap.o -MMD -MP -MF ./.deps/fibonacci_heap.TPo ../../gcc/gcc/fibonacci_heap.cc Note the "-g -O2". Why? In addition to CFLAGS and BOOT_CFLAGS, you are missing CFLAGS_FOR_TARGET (plus the same 3 for CXX). I don't know if that's still sufficient, but that's what I used to set a few years ago. -- Marc Glisse
Re: Expected warning maybe-uninitialized does not appear using g++13.2.0?
On Thu, 21 Dec 2023, David Malcolm via Gcc wrote: On Wed, 2023-12-20 at 11:16 -0800, Eric Batchelor wrote: Hello, I unintentionally stumbled upon some strange behaviour that occurred due to a typo. I reproduced the behaviour where an object (std::string in my case) can be passed to a function by reference, uninitialized, WITHOUT a compiler warning. Changing the code to pass the object by value DOES emit the warning. I don't think the compiled code is incorrect, it segfaults presumably due to uninitialized members. I understand there may seldom be a reason to use uninitialized objects, so "don't do that," but as I said this was unintentional and it seems that it should have generated a warning, which have saved some head-scratching. Code to reproduce: #include std::string f(std::string ) { s.append("x"); return s; } int main() { std::string a = f(a); } Compile and run (no warning): $ g++ -o uninit_obj uninit_obj.cpp -std=c++23 -Wall -Wpedantic - Wextra && ./uninit_obj Segmentation fault (core dumped) No difference whether using -O0 (or 1 2 3) As I understand it, -Wmaybe-uninitialized is purely intraprocedural i.e. it works within each individual function, without considering the interactions *between* functions. If you compile #include static std::string f(std::string ) { s.append("x"); return s; } void g() { std::string a = f(a); } with -O3, by the time we get to the uninit pass, function g starts with void g () { size_type __dnew; struct string a; [...] [local count: 1073741824]: _26 = a._M_string_length; if (_26 == 4611686018427387903) which should not require any interprocedural logic. -- Marc Glisse
Re: libstdc++: Speed up push_back
On Thu, 23 Nov 2023, Jonathan Wakely wrote: That's why we need -fsane-operator-new Although the standard forbids it, some of us just provide an inline implementation inline void* operator new(std::size_t n){return malloc(n);} inline void operator delete(void*p)noexcept{free(p);} inline void operator delete(void*p,std::size_t)noexcept{free(p);} (I could certainly add a check to abort if malloc returns 0 or other details, depending on what the application calls for) It used to enable a number of optimizations, for instance in gcc-9 auto f(){ return std::vector(4096); } was optimized to just one call to calloc (someone broke that in gcc-10). Using LTO on libsupc++ is related. I don't know if we want to define "sane" operators new/delete, or just have a flag that promises that we won't try to replace the default ones. -- Marc Glisse
Re: libstdc++: Turn memmove to memcpy in vector reallocations
On Tue, 21 Nov 2023, Jonathan Wakely wrote: CC Marc Glisse who added the relocation support. He might recall why we use memmove when all uses are for newly-allocated storage, which cannot overlap the existing storage. Going back a bit: https://gcc.gnu.org/pipermail/gcc-patches/2019-April/520658.html "I think the call to memmove in __relocate_a_1 could probably be memcpy (I don't remember why I chose memmove)" Going back a bit further: https://gcc.gnu.org/pipermail/gcc-patches/2018-September/505800.html "I had to add a special case for trivial types, using memmove, to avoid perf regressions, since relocation takes precedence over the old path that is specialized to call memmove." So the reason seems to be because vector already used memmove before my patch. You can dig further if you want to check why that is ;-) -- Marc Glisse
Re: [PATCH] Remove unnecessary "& 1" in year_month_day_last::day()
On Sun, 5 Nov 2023, Cassio Neri wrote: When year_month_day_last::day() was implemented, Dr. Matthias Kretz realised that the operation "& 1" wasn't necessary but we did not patch it at that time. This patch removes the unnecessary operation. Is there an entry in gcc's bugzilla about having the optimizer handle this kind of optimization? unsigned f(unsigned x){ if(x>=32)__builtin_unreachable(); return 30|(x&1); // --> 30|x } (that optimization would come in addition to your patch, doing the optimization by hand is still a good idea) It looks like the criterion would be a|(b) when the possible 1 bits of b are included in the certainly 1 bits of a|c. -- Marc Glisse
Re: Question about merging if-else blocks
On Wed, 27 Sep 2023, Hanke Zhang via Gcc wrote: Hi, I have recently been working on merging if-else statement blocks, and I found a rather bizarre phenomenon that I would like to ask about. A rough explanation is that for two consecutive if-else blocks, if their if statements are exactly the same, they should be merged, like the following program: int a = atoi(argv[1]); if (a) { printf("if 1"); } else { printf("else 1"); } if (a) { printf("if 2"); } else { printf("else 2"); } After using the -O3 -flto optimization option, it can be optimized as follows: int a = atoi(argv[1]); if (a) { printf("if 1"); printf("if 2"); } else { printf("else 1"); printf("else 2"); } But `a` here is a local variable. If I declare a as a global variable, it cannot be optimized as above. I would like to ask why this is? And is there any solution? If 'a' is a global variable, how do you know 'printf' doesn't modify its value? (you could know it for printf, but it really depends on the function that is called) -- Marc Glisse
Re: Different ASM for ReLU function between GCC11 and GCC12
On Mon, 19 Jun 2023, André Günther via Gcc wrote: I noticed that a simple function like auto relu( float x ) { return x > 0.f ? x : 0.f; } compiles to different ASM using GCC11 (or lower) and GCC12 (or higher). On -O3 -mavx2 the former compiles above function to relu(float): vmaxss xmm0, xmm0, DWORD PTR .LC0[rip] ret .LC0: .long 0 which is what I would naively expect and what also clang essentially does (clang actually uses an xor before the maxss to get the zero). The latter, however, compiles the function to relu(float): vxorps xmm1, xmm1, xmm1 vcmpltss xmm2, xmm1, xmm0 vblendvps xmm0, xmm1, xmm0, xmm2 ret which looks like a missed optimisation. Does anyone know if there's a reason for the changed behaviour? With -fno-signed-zeros -ffinite-math-only, gcc-12 still uses max instead of cmp+blend. So the first thing to check would be if both versions give the same result on negative 0 and NaN. -- Marc Glisse
Re: [PATCH] libstdc++: Add missing constexpr to simd
On Mon, 22 May 2023, Jonathan Wakely via Libstdc++ wrote: * subscripting vector builtins is not allowed in constant expressions Is that just because nobody made it work (yet)? Yes. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101651 and others. * if the implementation would otherwise call SIMD intrinsics/builtins https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80517 and others. Makes sense to work around them for now. -- Marc Glisse
Re: [GSoC] Conflicted Built-in Trait Name
On Sat, 25 Mar 2023, Ken Matsui via Gcc wrote: Built-in trait naming simply adds two underscores (__) to the original trait name. However, the same names are already in use for some built-in traits, such as is_void, is_pointer, and is_signed. For example, __is_void is used in the following files: * gcc/testsuite/g++.dg/tm/pr46567.C This is a testcase, you can rename __is_void to whatever in there, it doesn't matter. * libstdc++-v3/include/bits/cpp_type_traits.h This __is_void seems to be used in a single place in include/debug/helper_functions.h, couldn't we tweak that code so __is_void becomes unused and can be removed? -- Marc Glisse
Re: [PATCH 2/2] libstdc++: use new built-in trait __add_const
On Tue, 21 Mar 2023, Ken Matsui via Libstdc++ wrote: /// add_const +#if __has_builtin(__add_const) + template +struct add_const +{ using type = __add_const(_Tp); }; +#else template struct add_const { using type = _Tp const; }; +#endif Is that really better? You asked elsewhere if you should measure for each patch, and I think that at least for such a trivial case, you need to demonstrate that there is a point. The drawbacks are obvious: more code in libstdc++, non-standard, and more builtins in the compiler. Using builtins makes more sense for complicated traits where you can save several instantiations. Now that you have done a couple simple cases to see how it works, I think you should concentrate on the more complicated cases. -- Marc Glisse
Re: Triggering -save-temps from the front-end code
On Mon, 28 Nov 2022, Florian Weimer via Gcc wrote: * Arsen Arsenović: Hi, Florian Weimer via Gcc writes: Unfortunately, some build systems immediately delete the input source files. Is there some easy way I can dump the pre-processed and non-preprocessed sources to my log file? I tried to understand how -save-temps for crash recovery works, but it seems that this runs outside of the frontend, in the driver. Would dumping unconditionally into some "side channel" -dumpdir be a sufficient workaround? Of the file names overlap, and it seems in this case, the dump files are just overwritten. I don't see a way to make the file names unique. I guess for the tough cases, I can just keep running the build under strace. You could override unlink with LD_PRELOAD. Use a special purpose filesystem (gitfs? I haven't tried it yet). Wrap gcc with a command that calls the true gcc with a different TMPDIR / -dumpdir each time. -- Marc Glisse
Re: Please, really, make `-masm=intel` the default for x86
On Fri, 25 Nov 2022, LIU Hao via Gcc wrote: I am a Windows developer and I have been writing x86 and amd64 assembly for more than ten years. One annoying thing about GCC is that, for x86 if I need to write I piece of inline assembly then I have to do it twice: one in AT syntax and one in Intel syntax. The doc for -masm=dialect says: Darwin does not support ‘intel’. Assuming that's still true, and even with Mac Intel going away, it doesn't help. -- Marc Glisse
Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
On Fri, 4 Nov 2022, Hongyu Wang via Gcc-patches wrote: This is a follow-up patch for PR98167 The sequence c1 = VEC_PERM_EXPR (a, a, mask) c2 = VEC_PERM_EXPR (b, b, mask) c3 = c1 op c2 can be optimized to c = a op b c3 = VEC_PERM_EXPR (c, c, mask) for all integer vector operation, and float operation with full permutation. Hello, I assume the "full permutation" condition is to avoid performing some extra operations that would raise exception flags. If so, are there conditions (-fno-trapping-math?) where the transformation would be safe with arbitrary shuffles? -- Marc Glisse
Re: Different outputs in Gimple pass dump generated by two different architectures
On Thu, 10 Nov 2022, Kevin Lee wrote: While looking at the failure for gcc.dg/uninit-pred-9_b.c, I observed that x86-64 and risc-v has a different output for the gimple pass since r12-4790-g4b3a325f07acebf4 <https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=4b3a325f07acebf4>. Probably since earlier. What would be causing the difference? Is this intended? Link <https://godbolt.org/z/eWxnYsK1z> for details. Thank you! See LOGICAL_OP_NON_SHORT_CIRCUIT in fold-const.cc (and various discussions on the topic in mailing lists and bugzilla). -- Marc Glisse
Re: clarification question
On Sat, 22 Oct 2022, Péntek Imre via Gcc wrote: https://gcc.gnu.org/backends.html by "Architecture does not have a single condition code register" do you mean it has none or do you mean it has multiple? Either. If you look at the examples below, there is a C for riscv, which has 0, and one for sparc, which has several. -- Marc Glisse
Re: [PATCH] Optimize (X<
On Tue, 13 Sep 2022, Roger Sayle wrote: This patch tweaks the match.pd transformation previously added to fold (X< In https://gcc.gnu.org/onlinedocs/gcc/Integers-implementation.html , I read: "Bitwise operators act on the representation of the value including both the sign and value bits, where the sign bit is considered immediately above the highest-value value bit. Signed ‘>>’ acts on negative numbers by sign extension. As an extension to the C language, GCC does not use the latitude given in C99 and C11 only to treat certain aspects of signed ‘<<’ as undefined." To me, this means that for instance INT_MIN<<1 is well defined and evaluates to 0. But with this patch we turn (INT_MIN<<1)+(INT_MIN<<1) into (INT_MIN+INT_MIN)<<1, which is UB. If we decide not to support this extension anymore, I think we need to change the documentation first. -- Marc Glisse
Re: Floating-point comparisons in the middle-end
On Thu, 1 Sep 2022, Joseph Myers wrote: On Thu, 1 Sep 2022, FX via Gcc wrote: A tentative patch is attached, it seems to work well on simple examples, but for test coverage the hard part is going to be that the comparisons seem to be optimised away very easily into their non-signaling versions. Basically, if I do: Presumably that can be reproduced without depending on the new built-in function? In which case it's an existing bug somewhere in the optimizers. (simplify (cmp @0 REAL_CST@1) [...] (if (REAL_VALUE_ISNAN (TREE_REAL_CST (@1)) && !tree_expr_signaling_nan_p (@1) && !tree_expr_maybe_signaling_nan_p (@0)) { constant_boolean_node (cmp == NE_EXPR, type); }) only tries to preserve a comparison with sNaN, but not with qNaN. There are probably other issues since various gcc devs used to have a different opinion on the meaning of -ftrapping-math. -- Marc Glisse
Re: GCC 12.1 Release Candidate available from gcc.gnu.org
On Mon, 2 May 2022, Boris Kolpackov wrote: Jakub Jelinek writes: The first release candidate for GCC 12.1 is available [...] There is an unfixed bogus warning that is a regression in 12 and that I think will have a pretty wide effect (any code that assigns/appends a 1-char string literal to std::string): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105329 For example, in my relatively small codebase I had about 20 instances of this warning. Seeing that it's enabled as part of -Wall (not just -Wextra), I believe there will be a lot of grumpy users. There is a proposed work around in this (duplicate) bug that looks pretty simple: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104336 Perhaps it makes sense to consider it? Please no, that workaround looks like a fragile hack (basically writing a+b-a to obfuscate b) that may break if you look at it sideways and likely makes the generated code a bit worse. Besides, we should take it as a hint that user code is also likely to trigger the warning by accident. IMO there are several ways to make progress on this in parallel: * improve the optimizer (as investigated by Andrew) * tweak the warning so it becomes either cleverer or less eager (maybe even use the fact that this special case is in a system header? that should be a last resort though) * battle that has already been lost, no need to rehash it: --- a/gcc/c-family/c.opt +++ b/gcc/c-family/c.opt @@ -1450,3 +1450,3 @@ Warn when a declaration has duplicate const, volatile, restrict or _Atomic speci Wrestrict -C ObjC C++ ObjC++ Var(warn_restrict) Warning LangEnabledBy(C ObjC C++ ObjC++, Wall) +C ObjC C++ ObjC++ Var(warn_restrict) Warning LangEnabledBy(C ObjC C++ ObjC++, Wextra) Warn when an argument passed to a restrict-qualified parameter aliases with Or admit that --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -5729,4 +5729,3 @@ by this option and not enabled by the latter and vice versa. This enables all the warnings about constructions that some users -consider questionable, and that are easy to avoid (or modify to -prevent the warning), even in conjunction with macros. This also +consider questionable. This also enables some language-specific warnings described in @ref{C++ Dialect -- Marc Glisse
Re: [PATCH] libstdc++: Implement std::unreachable() for C++23 (P0627R6)
On Thu, 31 Mar 2022, Jonathan Wakely wrote: On Thu, 31 Mar 2022 at 17:03, Marc Glisse via Libstdc++ wrote: On Thu, 31 Mar 2022, Matthias Kretz via Gcc-patches wrote: I like it. But I'd like it even more if we could have #elif defined _UBSAN __ubsan_invoke_ub("reached std::unreachable()"); But to my knowledge UBSAN has no hooks for the library like this (yet). -fsanitize=undefined already replaces __builtin_unreachable with its own thing, so I was indeed going to ask if the assertion / trap provide a better debugging experience compared to plain __builtin_unreachable, with the possibility to get a stack trace (UBSAN_OPTIONS=print_stacktrace=1), etc? Detecting if (the right subset of) ubsan is enabled sounds like a good idea. Does UBsan define a macro that we can use to detect it? https://github.com/google/sanitizers/issues/765 seems to say no (it could be outdated though), but they were asking for use cases to motivate adding one. Apparently there is a macro for clang, although I don't think it is fine-grained. Adding one to cppbuiltin.cc testing SANITIZE_UNREACHABLE looks easy, maybe we can do just this one, we don't need to go overboard and define macros for all possible suboptions of ubsan right now. I don't think any of that prevents from pushing your patch as is for gcc-12. -- Marc Glisse
Re: [PATCH] libstdc++: Implement std::unreachable() for C++23 (P0627R6)
On Thu, 31 Mar 2022, Matthias Kretz via Gcc-patches wrote: I like it. But I'd like it even more if we could have #elif defined _UBSAN __ubsan_invoke_ub("reached std::unreachable()"); But to my knowledge UBSAN has no hooks for the library like this (yet). -fsanitize=undefined already replaces __builtin_unreachable with its own thing, so I was indeed going to ask if the assertion / trap provide a better debugging experience compared to plain __builtin_unreachable, with the possibility to get a stack trace (UBSAN_OPTIONS=print_stacktrace=1), etc? Detecting if (the right subset of) ubsan is enabled sounds like a good idea. -- Marc Glisse
Re: [PATCH] PR tree-optimization/101895: Fold VEC_PERM to help recognize FMA.
On Fri, 11 Mar 2022, Roger Sayle wrote: +(match vec_same_elem_p + CONSTRUCTOR@0 + (if (uniform_vector_p (TREE_CODE (@0) == SSA_NAME +? gimple_assign_rhs1 (SSA_NAME_DEF_STMT (@0)) : @0 Ah, I didn't remember we needed that, we don't seem to be very consistent about it. Probably for this reason, the transformation "Prefer vector1 << scalar to vector1 << vector2" does not match typedef int vec __attribute__((vector_size(16))); vec f(vec a, int b){ vec bb = { b, b, b, b }; return a << bb; } which is only optimized at vector lowering time. +/* Push VEC_PERM earlier if that may help FMA perception (PR101895). */ +(for plusminus (plus minus) + (simplify +(plusminus (vec_perm (mult@0 @1 vec_same_elem_p@2) @0 @3) @4) +(plusminus (mult (vec_perm @1 @1 @3) @2) @4))) Don't you want :s on mult and vec_perm? -- Marc Glisse
Re: [PATCH] Implement constant-folding simplifications of reductions.
On Mon, 21 Feb 2022, Roger Sayle wrote: +/* Fold REDUC (@0 op VECTOR_CST) as REDUC (@0) op REDUC (VECTOR_CST). */ +(for reduc (IFN_REDUC_PLUS IFN_REDUC_MAX IFN_REDUC_MIN IFN_REDUC_FMAX +IFN_REDUC_FMIN IFN_REDUC_AND IFN_REDUC_IOR IFN_REDUC_XOR) + op (plus max min IFN_FMAX IFN_FMIN bit_and bit_ior bit_xor) + (simplify (reduc (op @0 VECTOR_CST@1)) +(op (reduc:type @0) (reduc:type @1 I wonder if we need to test flag_associative_math for the 'plus' case, or if the presence of IFN_REDUC_PLUS is enough to justify the possible loss of precision. -- Marc Glisse
Re: "cannot convert to a pointer type" error repeated tens of times
On Sat, 12 Feb 2022, Andrea Monaco via Gcc wrote: #include int main (void) { float a; curl_easy_setopt (NULL, 0, (void *) a); } with "gcc -c bug.c" gives bug.c: In function ‘main’: bug.c:15:3: error: cannot convert to a pointer type curl_easy_setopt (NULL, 0, (void *) a); ^~~~ bug.c:15:3: error: cannot convert to a pointer type bug.c:15:3: error: cannot convert to a pointer type bug.c:15:3: error: cannot convert to a pointer type [...] bug.c:15:3: error: cannot convert to a pointer type In file included from /usr/include/x86_64-linux-gnu/curl/curl.h:2826, from bug.c:1: bug.c:15:3: error: cannot convert to a pointer type curl_easy_setopt (NULL, 0, (void *) a); ^~~~ bug.c:15:3: error: cannot convert to a pointer type curl_easy_setopt (NULL, 0, (void *) a); ^~~~ bug.c:15:3: error: cannot convert to a pointer type curl_easy_setopt (NULL, 0, (void *) a); ^~~~ The error message is correct, but is repeated tens of times. The function is declared this way in curl.h CURL_EXTERN CURLcode curl_easy_setopt(CURL *curl, CURLoption option, ...); No, curl_easy_setopt is a macro. If you look at the preprocessed code, you get many statements doing the same wrong operation, and one warning for each of them. (wrong list, should be gcc-help, or an issue on bugzilla) -- Marc Glisse
Re: gcd_1.c:188:13: runtime error: shift exponent 64 is too large for 64-bit type 'long unsigned int'
On Wed, 2 Feb 2022, Toon Moene wrote: Fascinating. My gcc directory had both gmp-6.2.1 and -6.1.0, but the symbolic link 'gmp' pointed to the old one. A similar problem for mpc, mpfr and isl ... You need to pass --force to contrib/download_prerequisites if you want them to be updated. -- Marc Glisse
Re: gcd_1.c:188:13: runtime error: shift exponent 64 is too large for 64-bit type 'long unsigned int'
On Tue, 1 Feb 2022, Toon Moene wrote: I just ran a "ubsan" build on my x86_64-linux-gnu system. See: https://gcc.gnu.org/pipermail/gcc-testresults/2022-February/754454.html This is an interesting failure: Executing on host: /home/toon/scratch/bld1142336/gcc/testsuite/gfortran29/../../gfortran -B/home/toon/scratch/bld1142336/gcc/testsuite/gfortran29/../../ -B/home/toon/scratch/bld1142336/x86_64-pc-linux-gnu/./libgfortran/ /home/toon/compilers/gcc/gcc/testsuite/gfortran.dg/graphite/pr39516.f -fdiagnostics-plain-output -fdiagnostics-plain-output-O -O2 -ftree-loop-linear -S -o pr39516.s(timeout = 300) spawn -ignore SIGHUP /home/toon/scratch/bld1142336/gcc/testsuite/gfortran29/../../gfortran -B/home/toon/scratch/bld1142336/gcc/testsuite/gfortran29/../../ -B/home/toon/scratch/bld1142336/x86_64-pc-linux-gnu/./libgfortran/ /home/toon/compilers/gcc/gcc/testsuite/gfortran.dg/graphite/pr39516.f -fdiagnostics-plain-output -fdiagnostics-plain-output -O -O2 -ftree-loop-linear -S -o pr39516.s gcd_1.c:188:13: runtime error: shift exponent 64 is too large for 64-bit type 'long unsigned int' FAIL: gfortran.dg/graphite/pr39516.f -O (test for excess errors) Excess errors: gcd_1.c:188:13: runtime error: shift exponent 64 is too large for 64-bit type 'long unsigned int' Note that the test case is pure Fortran source. The undefined error seems to come from a function inside the graphite library ... Maybe try with a more recent version of GMP first? gcd_1.c has only 103 lines in release 6.2.1. A stack trace (UBSAN_OPTIONS=print_stacktrace=1) would make it easier to guess where this is coming from. -- Marc Glisse
Re: reordering of trapping operations and volatile
On Sat, 8 Jan 2022, Martin Uecker via Gcc wrote: Am Samstag, den 08.01.2022, 13:41 +0100 schrieb Richard Biener: On January 8, 2022 9:32:24 AM GMT+01:00, Martin Uecker wrote: Hi Richard, thank you for your quick response! I have a question regarding reodering of volatile accesses and trapping operations. My initial assumption (and hope) was that compilers take care to avoid creating traps that are incorrectly ordered relative to observable behavior. I had trouble finding examples, and my cursory glace at the code seemed to confirm that GCC carefully avoids this. But then someone showed me this example, where this can happen in GCC: volatile int x; int foo(int a, int b, _Bool store_to_x) { if (!store_to_x) return a / b; x = b; return a / b; } https://godbolt.org/z/vq3r8vjxr In this example a division is hoisted before the volatile store. (the division by zero which could trap is UB, of course). As Martin Sebor pointed out this is done as part of redundancy elimination in tree-ssa-pre.c and that this might simply be an oversight (and could then be fixed with a small change). Could you clarify whether such reordering is intentional and could be exploited in general also in other optimizations or confirm that this is an oversight that affects only this specific case? If this is intentional, are there examples where this is important for optimization? In general there is no data flow information that prevents traps from being reordered with respect to volatile accesses. Yes, although I think potentially trapping ops are not moved before calls (as this would be incorrect). So do you think it would be feasable to prevent this for volatile too? The specific case could be easily mitigated in PRE. Another case would be A = c / d; X = 1; If (use_a) Bar (a); Where we'd sink a across x into the guarded Bb I suspect. Yes. Related example: https://godbolt.org/z/5WGhadre3 volatile int x; void bar(int a); void foo(int c, int d) { int a = c / d; x = 1; if (d) bar(a); } foo: mov DWORD PTR x[rip], 1 testesi, esi jne .L4 ret .L4: mov eax, edi cdq idivesi mov edi, eax jmp bar It would be nice to prevent this too, although I am less concerned about this direction, as the UB has already happened so there is not much we could guarantee about this anyway. In the other case, it could affect correct code before the trap. -fnon-call-exceptions helps with the first testcase but not with the second one. I don't know if that's by accident, but the flag seems possibly relevant. -- Marc Glisse
Re: [PATCH] tree-optimization/103514 Missing XOR-EQ-AND Optimization
+/* (a & b) ^ (a == b) -> !(a | b) */ +/* (a & b) == (a ^ b) -> !(a | b) */ +(for first_op (bit_xor eq) + second_op (eq bit_xor) + (simplify + (first_op:c (bit_and:c @0 @1) (second_op:c @0 @1)) + (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) +&& types_match (TREE_TYPE (@0), TREE_TYPE (@1))) +(convert (bit_not (bit_ior @0 @1)) I don't think you need types_match, if both are operands of bit_and, their types must already match. It isn't clear what the INTEGRAL_TYPE_P test is for. Your 2 transformations don't seem that similar to me. The first one requires that a and b have the same type as the result of ==, so they are boolean-like. The second one makes sense for more general integers, but then it looks like it should produce (a|b)==0. It doesn't look like we have a canonical representation between a^b and a!=b for booleans :-( (sorry for the broken thread, I was automatically unsubscribed because mailman doesn't like greylisting) -- Marc Glisse
Re: [RFC/PATCH] C++ constexpr vs. floating point exceptions.
On Tue, 21 Sep 2021, Jason Merrill via Gcc-patches wrote: On Tue, Sep 21, 2021 at 4:30 PM Joseph Myers wrote: On Tue, 21 Sep 2021, Jakub Jelinek via Gcc-patches wrote: On Tue, Sep 21, 2021 at 02:15:59PM +0100, Roger Sayle wrote: Can you double check? Integer division by zero is undefined, but isn't floating point division by zero defined by the appropriate IEEE standards? https://eel.is/c++draft/expr.mul#4 doesn't make the division by zero behavior conditional on integral types. C has similar wording. The position for C is that Annex F semantics take precedence over all the ways in which floating-point arithmetic has undefined behavior in the main body of the standard. So floating-point overflow and division by zero are fully defined in the presence of Annex F support, while out-of-range conversions from floating point to integer types produce an unspecified value (not necessarily the same unspecified value for different executions of the conversion in the abstract machine - as discussed in bug 93806, GCC can get that wrong and act as if a single execution of such a conversion in the abstract machine produces more than one result). In C, as specified in Annex F, initializers for floating-point objects with static or thread storage duration are evaluated with exceptions discarded and the default rounding mode in effect; 7.0 / 0.0 is a fully valid initializer for such an object to initialize it to positive infinity. As I understand it, the question for this thread is whether C++ constexpr should have a similar rule to C static initializers (which ought to apply to 1.0 / 3.0, raising inexact, just as much as to 7.0 / 0.0). The C rules seem to be F.8.2 Translation During translation the IEC 60559 default modes are in effect: — The rounding direction mode is rounding to nearest. — The rounding precision mode (if supported) is set so that results are not shortened. — Trapping or stopping (if supported) is disabled on all floating-point exceptions. Recommended practice: The implementation should produce a diagnostic message for each translation-time floating-point exception, other than “inexact”; the implementation should then proceed with the translation of the program. I think following the same rules for C++ would be appropriate in a I agree that looking at the C standard is more interesting, C++ is very bad at specifying anything float related. diagnosing context: warn and continue. In a template argument deduction (SFINAE) context, where errors become silent substitution failures, it's probably better to treat them as non-constant. I am trying to imagine a sfinae example affected by whether 1./0. is constant. Does that mean A<0.,__builtin_inf()> would fail to use the specialization in templatestruct A{}; templatestruct A{}; ? I don't like that, I believe it should use the specialization. With ieee754, 1./0. is perfectly well defined as +inf, the only question is whether it should also set a flag at runtime, which is not relevant in a manifestly consteval context (fetestexcept, etc are not constexpr, that should be enough to catch mistakes). If some user wants to forbid FE_DIVBYZERO, FE_INEXACT, FE_INVALID, FE_OVERFLOW or FE_UNDERFLOW in compile-time operations, that looks like it could be part of a separate compiler flag or pragma, like C's FENV_ROUND can affect the rounding mode in static initializers (of course C++ templates make pragmas less convenient). -- Marc Glisse
Re: unexpected result with -O2 solved via "volatile"
On Sun, 19 Sep 2021, Allin Cottrell via Gcc wrote: Should this perhaps be considered a bug? Below is a minimal test case for a type of calculation that occurs in my real code. It works as expected when compiled without optimization, but produces what seems like a wrong result when compiled with -O2, using both gcc 10.3.1 20210422 on Fedora and gcc 11.1.0-1 on Arch Linux. I realize there's a newer gcc release but it's not yet available for Arch, and looking at https://gcc.gnu.org/gcc-11/changes.html I didn't see anything to suggest that something relevant has changed. https://gcc.gnu.org/bugs/ says that you should first try compiling your code with -fsanitize=undefined, which tells you at runtime that your code is broken. Apart from that, bug reports should go to https://gcc.gnu.org/bugzilla/ and questions to gcc-h...@gcc.gnu.org. -- Marc Glisse
Re: [EXTERNAL] Re: [PATCH] tree-optimization: Optimize division followed by multiply [PR95176]
On Fri, 6 Aug 2021, Victor Tong wrote: Thanks for the feedback. Here's the updated pattern: /* X - (X - Y) --> Y */ (simplify (minus (convert1? @0) (convert2? (minus@2 (convert3? @@0) @1))) (if (ANY_INTEGRAL_TYPE_P (type) && TYPE_OVERFLOW_UNDEFINED(type) && !TYPE_OVERFLOW_SANITIZED(type) && ANY_INTEGRAL_TYPE_P (TREE_TYPE(@2)) && TYPE_OVERFLOW_UNDEFINED(TREE_TYPE(@2)) && !TYPE_OVERFLOW_SANITIZED(TREE_TYPE(@2)) && ANY_INTEGRAL_TYPE_P (TREE_TYPE(@0)) && TYPE_OVERFLOW_UNDEFINED(TREE_TYPE(@0)) && !TYPE_OVERFLOW_SANITIZED(TREE_TYPE(@0)) && TYPE_PRECISION (TREE_TYPE (@2)) <= TYPE_PRECISION (type) && TYPE_PRECISION (TREE_TYPE (@0)) <= TYPE_PRECISION (type)) (convert @1))) I kept the TYPE_OVERFLOW_SANITIZED checks because I saw other patterns that leverage undefined overflows check for it. I think this new pattern shouldn't be applied if overflow sanitizer checks are enabled. why is this testing TREE_TYPE (@0)? I'm checking the type of @0 because I'm concerned that there could be a case where @0's type isn't an integer type with undefined overflow. I tried creating a test case and couldn't seem to create one where this is violated but I kept the checks to avoid causing a regression. If I'm being overcautious and you feel that the type checks on @0 aren't needed, I can remove them. I think the precision check on TREE_TYPE(@0) is needed to avoid truncation cases though. It doesn't matter if @0 has undefined overflow, but it can matter that it be signed (yes, the 2 are correlated...) if it has the same precision as @2. Otherwise (int64_t)(-1u)-(int64_t)((int)(-1u)-0) is definitely not 0 and it has type:int64_t, @2:int, @0:unsigned. Ignoring the sanitizer, the confusing double matching of constants, and restricting to scalars, I think the tightest condition (without vrp) that works is TYPE_PRECISION (type) <= TYPE_PRECISION (TREE_TYPE (@2)) || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@2)) && (TYPE_PRECISION (TREE_TYPE (@0)) < TYPE_PRECISION (TREE_TYPE (@2)) || TYPE_PRECISION (TREE_TYPE (@0)) == TYPE_PRECISION (TREE_TYPE (@2)) && !TYPE_UNSIGNED (TREE_TYPE (@0)) ) (where implicitly undefined => signed) but of course it is ok to handle only a subset. It is too late for me to think about @0 vs @@0. The patch you posted seems to accept the wrong (int128t)LLONG_MAX-(int128_t)((int)LLONG_MAX-0) -> (int128_t)0 Using ANY_INTEGRAL_TYPE_P allows vectors, and TYPE_PRECISION mustn't be used for vectors (maybe we should add more checking to catch such uses), and they may not be very happy with convert either... Once we'd "inline" nop_convert genmatch would complain about this. Maybe the new transform could be about scalars, and we could restrict the old one to vectors, to simplify the code, but that wouldn't help genmatch with the fact that the pattern x-(x-y) would appear twice. Is it really that bad? It is suspicious, but can be justified. Is someone working on inlining nop_convert? I'd like to avoid breaking someone else's work if that's being worked on right now. I've also added some extra tests to cover this new pattern. I've attached a patch with my latest changes. From: Richard Biener Sent: Wednesday, July 28, 2021 2:59 AM To: Victor Tong Cc: Marc Glisse ; gcc-patches@gcc.gnu.org Subject: Re: [EXTERNAL] Re: [PATCH] tree-optimization: Optimize division followed by multiply [PR95176] On Tue, Jun 29, 2021 at 1:10 AM Victor Tong wrote: Thanks Richard and Marc. I wrote the following test case to compare the outputs of fn1() and fn1NoOpt() below with my extra pattern being applied. I tested the two functions with all of the integers from INT_MIN to INT_MAX. long fn1 (int x) { return 42L - (long)(42 - x); } #pragma GCC push_options #pragma GCC optimize ("O0") long fn1NoOpt (int x) { volatile int y = (42 - x); return 42L - (long)y; } #pragma GCC pop_options int main () { for (long i=INT_MIN; i<=INT_MAX;i++) { auto valNoOpt = fn1NoOpt(i); auto valOpt = fn1(i); if (valNoOpt != valOpt) printf("valOpt=%ld, valNoOpt=%ld\n", valOpt, valNoOpt); } return 0; } I saw that the return values of fn1() and fn1NoOpt() differed when the input was between INT_MIN and INT_MIN+42 inclusive. When passing values in this range to fn1NoOpt(), a signed overflow is triggered which causes the value to differ (undefined behavior). This seems to go in line with what Marc described and I think the transformation is correct in the scenario above. I do think that type casts that result in truncation (i.e. from a higher precision to a lower one) or with unsigned types will result in an incorrect transformation so those scenarios need t
Re: [PATCH] Fold (X<
On Mon, 26 Jul 2021, Roger Sayle wrote: The one aspect that's a little odd is that each transform is paired with a convert@1 variant, using the efficient match machinery to expose any zero extension to fold-const.c's tree_nonzero_bits functionality. Copying the first transform for context +(for op (bit_ior bit_xor) + (simplify + (op (mult:s@0 @1 INTEGER_CST@2) + (mult:s@3 @1 INTEGER_CST@4)) + (if (INTEGRAL_TYPE_P (type) && TYPE_OVERFLOW_WRAPS (type) + && (tree_nonzero_bits (@0) & tree_nonzero_bits (@3)) == 0) + (mult @1 +{ wide_int_to_tree (type, wi::to_wide (@2) + wi::to_wide (@4)); }))) + (simplify + (op (mult:s@0 (convert@1 @2) INTEGER_CST@3) + (mult:s@4 (convert@1 @2) INTEGER_CST@5)) + (if (INTEGRAL_TYPE_P (type) && TYPE_OVERFLOW_WRAPS (type) + && (tree_nonzero_bits (@0) & tree_nonzero_bits (@4)) == 0) + (mult @1 +{ wide_int_to_tree (type, wi::to_wide (@3) + wi::to_wide (@5)); }))) Could you explain how the convert helps exactly? -- Marc Glisse
Re: An asm constraint issue (ARM FPU)
On Sun, 25 Jul 2021, Zoltán Kócsi wrote: I try to write a one-liner inline function to create a double form a 64-bit integer, not converting it to a double but the integer containing the bit pattern for the double (type spoofing). The compiler is arm-eabi-gcc 8.2.0. The target is a Cortex-A9, with NEON. According to the info page the assembler constraint "w" denotes an FPU double register, d0 - d31. The code is the following: double spoof( uint64_t x ) { double r; asm volatile ( " vmov.64 %[d],%Q[i],%R[i] \n" Isn't it supposed to be %P[d] for a double? (the documentation is very lacking...) : [d] "=w" (r) : [i] "q" (x) ); return r; } The command line: arm-eabi-gcc -O0 -c -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon-vfpv4 \ test.c It compiles and the generated object code is this: : 0: e52db004push{fp}; (str fp, [sp, #-4]!) 4: e28db000add fp, sp, #0 8: e24dd014sub sp, sp, #20 c: e14b01f4strdr0, [fp, #-20] ; 0xffec 10: e14b21d4ldrdr2, [fp, #-20] ; 0xffec 14: ec432b30vmovd16, r2, r3 18: ed4b0b03vstrd16, [fp, #-12] 1c: e14b20dcldrdr2, [fp, #-12] 20: ec432b30vmovd16, r2, r3 24: eeb00b60vmov.f64d0, d16 28: e28bd000add sp, fp, #0 2c: e49db004pop {fp}; (ldr fp, [sp], #4) 30: e12fff1ebx lr which is not really efficient, but works. However, if I specify -O1, -O2 or -Os then the compilation fails because assembler complains. This is the assembly the compiler generated, (comments and irrelevant stuff removed): spoof: vmov.64 s0,r0,r1 bx lr where the problem is that 's0' is a single-precision float register and it should be 'd0' instead. Either I'm seriously missing something, in which case I would be most obliged if someone sent me to the right direction; or it is a compiler or documentation bug. Thanks, Zoltan -- Marc Glisse
Re: [PATCH] Fold bswap32(x) != 0 to x != 0 (and related transforms)
On Sun, 18 Jul 2021, Roger Sayle wrote: +(if (GIMPLE || !TREE_SIDE_EFFECTS (@0)) I don't think you need to worry about that, the general genmatch machinery is already supposed to take care of it. All the existing cases in match.pd are about cond_expr, where counting the occurrences of each @i is not reliable. -- Marc Glisse
Re: [EXTERNAL] Re: [PATCH] tree-optimization: Optimize division followed by multiply [PR95176]
On Fri, 18 Jun 2021, Richard Biener wrote: Option 2: Add a new pattern to support scenarios that the existing nop_convert pattern bails out on. Existing pattern: (simplify (minus (nop_convert1? @0) (nop_convert2? (minus (nop_convert3? @@0) @1))) (view_convert @1)) I tried to check with a program when T3 x; T1 y; (T2)x-(T2)((T1)x-y) can be safely replaced with (T2)y From the output, it looks like this is safe when T1 is at least as large as T2. It is wrong when T1 is unsigned and smaller than T2. And when T1 is signed and smaller than T2, it is ok if T3 is the same type as T1 (signed then) or has strictly less precision (any sign), and not in other cases. Note that this is when signed implies undefined overflow and unsigned implies wrapping, and I wouldn't put too much faith in this recently dusted program. And it doesn't say how to write the match.pd pattern with '?', "@@", disabling it if TYPE_OVERFLOW_SANITIZED, etc. Mostly, I wanted to say that if we are going to go handle more than nop_convert for more than just 1 or 2 easy transformations, I think some kind of computer verification would be useful, it would save a lot of time and headaches. (I just check by brute force all possible precisions (from 1 to 6) and signedness for T1, T2 and T3, all possible values for x and y, compute the before and after formulas, accepting if there is UB before, rejecting if there is UB after (and not before), and manually try to see a pattern in the list of types that work) -- Marc Glisse
Re: [PATCH] Simplify (view_convert ~a) < 0 to (view_convert a) >= 0 [PR middle-end/100738]
On Fri, 4 Jun 2021, Hongtao Liu via Gcc-patches wrote: On Tue, Jun 1, 2021 at 6:17 PM Marc Glisse wrote: On Tue, 1 Jun 2021, Hongtao Liu via Gcc-patches wrote: Hi: This patch is about to simplify (view_convert:type ~a) < 0 to (view_convert:type a) >= 0 when type is signed integer. Similar for (view_convert:type ~a) >= 0. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Ok for the trunk? gcc/ChangeLog: PR middle-end/100738 * match.pd ((view_convert ~a) < 0 --> (view_convert a) >= 0, (view_convert ~a) >= 0 --> (view_convert a) < 0): New GIMPLE simplification. We already have /* Fold ~X op C as X op' ~C, where op' is the swapped comparison. */ (for cmp (simple_comparison) scmp (swapped_simple_comparison) (simplify (cmp (bit_not@2 @0) CONSTANT_CLASS_P@1) (if (single_use (@2) && (TREE_CODE (@1) == INTEGER_CST || TREE_CODE (@1) == VECTOR_CST)) (scmp @0 (bit_not @1) Would it make sense to try and generalize it a bit, say with (cmp (nop_convert1? (bit_not @0)) CONSTANT_CLASS_P) (scmp (view_convert:XXX @0) (bit_not @1)) Thanks for your advice, it looks great. And can I use *view_convert1?* instead of *nop_convert1?* here, because the original case is view_convert, and nop_convert would fail to simplify the case. Near the top of match.pd, you can see /* With nop_convert? combine convert? and view_convert? in one pattern plus conditionalize on tree_nop_conversion_p conversions. */ (match (nop_convert @0) (convert @0) (if (tree_nop_conversion_p (type, TREE_TYPE (@0) (match (nop_convert @0) (view_convert @0) (if (VECTOR_TYPE_P (type) && VECTOR_TYPE_P (TREE_TYPE (@0)) && known_eq (TYPE_VECTOR_SUBPARTS (type), TYPE_VECTOR_SUBPARTS (TREE_TYPE (@0))) && tree_nop_conversion_p (TREE_TYPE (type), TREE_TYPE (TREE_TYPE (@0)) So at least the intention is that it can handle both NOP_EXPR for scalars and VIEW_CONVERT_EXPR for vectors, and I think we alread use it that way in some places in match.pd, like (simplify (negate (nop_convert? (bit_not @0))) (plus (view_convert @0) { build_each_one_cst (type); })) (simplify (bit_xor:c (nop_convert?:s (bit_not:s @0)) @1) (if (tree_nop_conversion_p (type, TREE_TYPE (@0))) (bit_not (bit_xor (view_convert @0) @1 (the 'if' seems redundant for this one) (simplify (negate (nop_convert? (negate @1))) (if (!TYPE_OVERFLOW_SANITIZED (type) && !TYPE_OVERFLOW_SANITIZED (TREE_TYPE (@1))) (view_convert @1))) etc. At some point this got some genmatch help, to handle '?' and numbers, so I don't remember all the details, but following these examples should work. -- Marc Glisse
Re: [PATCH] Simplify (view_convert ~a) < 0 to (view_convert a) >= 0 [PR middle-end/100738]
On Tue, 1 Jun 2021, Hongtao Liu via Gcc-patches wrote: Hi: This patch is about to simplify (view_convert:type ~a) < 0 to (view_convert:type a) >= 0 when type is signed integer. Similar for (view_convert:type ~a) >= 0. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Ok for the trunk? gcc/ChangeLog: PR middle-end/100738 * match.pd ((view_convert ~a) < 0 --> (view_convert a) >= 0, (view_convert ~a) >= 0 --> (view_convert a) < 0): New GIMPLE simplification. We already have /* Fold ~X op C as X op' ~C, where op' is the swapped comparison. */ (for cmp (simple_comparison) scmp (swapped_simple_comparison) (simplify (cmp (bit_not@2 @0) CONSTANT_CLASS_P@1) (if (single_use (@2) && (TREE_CODE (@1) == INTEGER_CST || TREE_CODE (@1) == VECTOR_CST)) (scmp @0 (bit_not @1) Would it make sense to try and generalize it a bit, say with (cmp (nop_convert1? (bit_not @0)) CONSTANT_CLASS_P) (scmp (view_convert:XXX @0) (bit_not @1)) (I still believe that it is a bad idea that SSA_NAMEs are strongly typed, encoding the type in operations would be more convenient, but I think the time for that choice has long gone) -- Marc Glisse
Re: [ARM] PR66791: Replace calls to builtin in vmul_n (a, b) intrinsics with __a * __b
On Wed, 26 May 2021, Prathamesh Kulkarni via Gcc-patches wrote: The attached patch removes calls to builtins in vmul_n* (a, b) with __a * __b. I am not familiar with neon, but are __a and __b unsigned here? Otherwise, is vmul_n already undefined in case of overflow? -- Marc Glisse
Re: [PATCH] Optimize x < 0 ? ~y : y to (x >> 31) ^ y in match.pd
On Sun, 23 May 2021, apinski--- via Gcc-patches wrote: +(for cmp (ge lt) +/* x < 0 ? ~y : y into (x >> (prec-1)) ^ y. */ +/* x >= 0 ? ~y : y into ~((x >> (prec-1)) ^ y). */ + (simplify + (cond (cmp @0 integer_zerop) (bit_not @1) @1) + (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) +&& !TYPE_UNSIGNED (TREE_TYPE (@0)) +&& TYPE_PRECISION (TREE_TYPE (@0)) == TYPE_PRECISION (type)) Is there a risk that x is signed char (precision 8) and y is a vector with 8 elements? -- Marc Glisse
Re: [PATCH] phiopt: Optimize (x <=> y) cmp z [PR94589]
On Fri, 14 May 2021, Jakub Jelinek via Gcc-patches wrote: On Thu, May 06, 2021 at 09:42:41PM +0200, Marc Glisse wrote: We can probably do it in 2 steps, first something like (for cmp (eq ne) (simplify (cmp (bit_and:c @0 @1) @0) (cmp (@0 (bit_not! @1)) { build_zero_cst (TREE_TYPE (@0)); }))) to get rid of the double use, and then simplify X==0 to X<=~C if C is a mask 111...000 (I thought we already had a function to detect such masks, or the 000...111, but I can't find them anymore). I agree that the comparison seems preferable, although if X is signed, the way GIMPLE represents types will add an inconvenient cast. And I think VRP already manages to use the bit test to derive a range. I've tried the second step, but it unfortunately regresses +FAIL: gcc.dg/ipa/propbits-2.c scan-tree-dump-not optimized "fail_test" +FAIL: gcc.dg/tree-ssa/loop-42.c scan-tree-dump-not ivcanon "under assumptions " so maybe it is better to keep these cases as the users wrote them. Thank you for trying it, the patch looks good (it would probably also work on GENERIC), but indeed it is just a canonicalization, not necessarily worth breaking other stuff. This seems to point to another missed optimization. Looking at propbits-2.c, if I rewrite the condition in f1 as if ((unsigned)x <= 0xf) I see later in VRP # RANGE [0, 4294967295] NONZERO 15 x.2_1 = (unsigned intD.9) x_4(D); if (x.2_1 <= 15) where we fail to derive a range from the nonzero bits. Maybe VRP expects IPA-CP should have done it already and doesn't bother recomputing it. I understand this may not be a priority though, that's fine. I didn't look at loop-42.c. -- Marc Glisse
Re: [PATCH] match.pd: Optimize (x & y) == x into (x & ~y) == 0 [PR94589]
On Tue, 11 May 2021, Jakub Jelinek via Gcc-patches wrote: On Thu, May 06, 2021 at 09:42:41PM +0200, Marc Glisse wrote: We can probably do it in 2 steps, first something like (for cmp (eq ne) (simplify (cmp (bit_and:c @0 @1) @0) (cmp (@0 (bit_not! @1)) { build_zero_cst (TREE_TYPE (@0)); }))) to get rid of the double use, and then simplify X==0 to X<=~C if C is a mask 111...000 (I thought we already had a function to detect such masks, or the 000...111, but I can't find them anymore). Ok, here is the first step then. Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? Or should it have cmp:c too given that == and != are commutative too? Ah, yes, you are right, good point on the cmp:c, thank you. 2021-05-11 Jakub Jelinek Marc Glisse PR tree-optimization/94589 * match.pd ((X & Y) == Y -> (X & ~Y) == 0, ^ X? (X | Y) == Y -> (X & ~Y) == 0): New GIMPLE simplifications. * gcc.dg/tree-ssa/pr94589-1.c: New test. --- gcc/match.pd.jj 2021-04-27 14:46:56.583716888 +0200 +++ gcc/match.pd2021-05-10 22:31:49.726870421 +0200 @@ -4764,6 +4764,18 @@ (define_operator_list COND_TERNARY (cmp:c (bit_xor:c @0 @1) @0) (cmp @1 { build_zero_cst (TREE_TYPE (@1)); })) +#if GIMPLE + /* (X & Y) == X becomes (X & ~Y) == 0. */ + (simplify + (cmp (bit_and:c @0 @1) @0) + (cmp (bit_and @0 (bit_not! @1)) { build_zero_cst (TREE_TYPE (@0)); })) + + /* (X | Y) == Y becomes (X & ~Y) == 0. */ + (simplify + (cmp (bit_ior:c @0 @1) @1) + (cmp (bit_and @0 (bit_not! @1)) { build_zero_cst (TREE_TYPE (@0)); })) +#endif + /* (X ^ C1) op C2 can be rewritten as X op (C1 ^ C2). */ (simplify (cmp (convert?@3 (bit_xor @0 INTEGER_CST@1)) INTEGER_CST@2) --- gcc/testsuite/gcc.dg/tree-ssa/pr94589-1.c.jj2021-05-10 22:36:10.574130179 +0200 +++ gcc/testsuite/gcc.dg/tree-ssa/pr94589-1.c 2021-05-10 22:36:17.789054362 +0200 @@ -0,0 +1,21 @@ +/* PR tree-optimization/94589 */ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +int +foo (int x) +{ + return (x & 23) == x; +/* { dg-final { scan-tree-dump " & -24;" "optimized" } } */ +/* { dg-final { scan-tree-dump-not " & 23;" "optimized" } } */ +/* { dg-final { scan-tree-dump " == 0" "optimized" } } */ +} + +int +bar (int x) +{ + return (x | 137) != 137; +/* { dg-final { scan-tree-dump " & -138;" "optimized" } } */ +/* { dg-final { scan-tree-dump-not " \\| 137;" "optimized" } } */ +/* { dg-final { scan-tree-dump " != 0" "optimized" } } */ +} Jakub -- Marc Glisse
Re: [PATCH] phiopt: Optimize (x <=> y) cmp z [PR94589]
On Thu, 6 May 2021, Jakub Jelinek via Gcc-patches wrote: Though, (x&1) == x is equivalent to both (x&~1)==0 and to x < 2U and from the latter two it isn't obvious which one is better/more canonical. On aarch64 I don't see differences in number of insns nor in their size: 10:13001c00sxtbw0, w0 14:721f781ftst w0, #0xfffe 18:1a9f17e0csetw0, eq // eq = none 1c:d65f03c0ret vs. 20:12001c00and w0, w0, #0xff 24:7100041fcmp w0, #0x1 28:1a9f87e0csetw0, ls // ls = plast 2c:d65f03c0ret On x86_64 same number of insns, but the comparison is shorter (note, the spaceship result is a struct with signed char based enum): 10:31 c0 xor%eax,%eax 12:81 e7 fe 00 00 00 and$0xfe,%edi 18:0f 94 c0sete %al 1b:c3 retq 1c:0f 1f 40 00 nopl 0x0(%rax) vs. 20:31 c0 xor%eax,%eax 22:40 80 ff 01 cmp$0x1,%dil 26:0f 96 c0setbe %al 29:c3 retq Generally, I'd think that the comparison should be better because it will be most common in user code that way and VRP will be able to do the best thing for it. We can probably do it in 2 steps, first something like (for cmp (eq ne) (simplify (cmp (bit_and:c @0 @1) @0) (cmp (@0 (bit_not! @1)) { build_zero_cst (TREE_TYPE (@0)); }))) to get rid of the double use, and then simplify X==0 to X<=~C if C is a mask 111...000 (I thought we already had a function to detect such masks, or the 000...111, but I can't find them anymore). I agree that the comparison seems preferable, although if X is signed, the way GIMPLE represents types will add an inconvenient cast. And I think VRP already manages to use the bit test to derive a range. -- Marc Glisse
Re: [PATCH] phiopt: Optimize (x <=> y) cmp z [PR94589]
On Tue, 4 May 2021, Jakub Jelinek via Gcc-patches wrote: 2) the pr94589-2.C testcase should be matching just 12 times each, but runs into operator>=(strong_ordering, unspecified) being defined as (_M_value&1)==_M_value rather than _M_value>=0. When not honoring NaNs, the 2 case should be unreachable and so (_M_value&1)==_M_value is then equivalent to _M_value>=0, but is not a single use but two uses. I'll need to pattern match that case specially. Somewhere in RTL (_M_value&1)==_M_value is turned into (_M_value&-2)==0, that could be worth doing already in GIMPLE. -- Marc Glisse
Re: [PATCH] match.pd: Use single_use for (T)(A) + CST -> (T)(A + CST) [PR95798]
On Wed, 24 Feb 2021, Jakub Jelinek via Gcc-patches wrote: The following patch adds single_use case which restores these testcases but keeps the testcases the patch meant to improve as is. Hello, I wonder if :s would be sufficient here? I don't have an opinion on which one is better for this particular transformation (don't change the patch because of my comment), we just seem to be getting more and more uses of single_use in match.pd, maybe at some point we need to revisit the meaning of :s or introduce a stronger :S. -- Marc Glisse
Re: bug in DSE?
On Fri, 12 Feb 2021, Andrew MacLeod via Gcc wrote: I dont't want to immediately open a PR, so I'll just ask about testsuite/gcc.dg/pr83609.c. the compilation string is -O2 -fno-tree-forwprop -fno-tree-ccp -fno-tree-fre -fno-tree-pre -fno-code-hoisting Which passes as is. if I however add -fno-tree-vrp as well, then it looks like dead store maybe does something wong... with EVRP running, we translate function foo() from complex float foo () { complex float c; complex float * c.0_1; complex float _4; : c.0_1 = MEM[(long long unsigned int *)c.0_1] = 1311768467463790320; _4 = c; Isn't that a clear violation of strict aliasing? -- Marc Glisse
Re: [PATCH, libstdc++] GLIBCXX_HAVE_INT64_T
On Wed, 6 Jan 2021, David Edelsohn via Gcc-patches wrote: Currently the type of streamoff is determined at libstdc++ configure time, chosen by the definitions of _GLIBCXX_HAVE_INT64_T_LONG and _GLIBCXX_HAVE_INT64_T_LONG_LONG. For a multilib configuration, the difference is encoded in the different multilib header file paths. For "FAT" library targets that package 32 bit and 64 bit libraries together, G++ also expects a single header file directory hierarchy, causing an incorrect value for streamoff in some situations. Shouldn't we change that? I don't see why using the same directory for linking should imply using the same directory for includes. -- Marc Glisse
Re: What is the type of vector signed + vector unsigned?
On Tue, 29 Dec 2020, Richard Sandiford via Gcc wrote: Any thoughts on what f should return in the following testcase, given the usual GNU behaviour of treating signed >> as arithmetic shift right? typedef int vs4 __attribute__((vector_size(16))); typedef unsigned int vu4 __attribute__((vector_size(16))); int f (void) { vs4 x = { -1, -1, -1, -1 }; vu4 y = { 0, 0, 0, 0 }; return ((x + y) >> 1)[0]; } The C frontend takes the type of x+y from the first operand, so x+y is signed and f returns -1. Symmetry is an important property of addition in C/C++. The C++ frontend applies similar rules to x+y as it would to scalars, with unsigned T having a higher rank than signed T, so x+y is unsigned and f returns 0x7fff. That looks like the most natural choice. FWIW, Clang treats x+y as signed, so f returns -1 for both C and C++. I think clang follows gcc and uses the type of the first operand. -- Marc Glisse
Re: [PATCH] match.pd: Add ~(X - Y) -> ~X + Y simplification [PR96685]
On Sat, 12 Dec 2020, Jakub Jelinek via Gcc-patches wrote: On Sat, Dec 12, 2020 at 01:25:39PM +0100, Marc Glisse wrote: On Sat, 12 Dec 2020, Jakub Jelinek via Gcc-patches wrote: This patch adds the ~(X - Y) -> ~X + Y simplification requested in the PR (plus also ~(X + C) -> ~X + (-C) for constants C that can be safely negated. Would it have been wrong to produce ~X - C without caring about negating (and then extending it to non-constants)? Extending it to non-constants is what I wanted to avoid. For ~(X + Y), because + is commutative, it wouldn't be a canonicalization as it would pick more-less randomly whether to do ~X + Y or X + ~Y. ~X - Y or ~Y - X I guess. Ok, I understand. But then in the constant case, why produce ~X + -C instead of ~X - C (which I think doesn't need to care about negating), or even ~C - X (one less operation)? Or do we already have a transformation from ~X - C to ~C - X? -- Marc Glisse
Re: [PATCH] match.pd: Add ~(X - Y) -> ~X + Y simplification [PR96685]
On Sat, 12 Dec 2020, Jakub Jelinek via Gcc-patches wrote: This patch adds the ~(X - Y) -> ~X + Y simplification requested in the PR (plus also ~(X + C) -> ~X + (-C) for constants C that can be safely negated. Would it have been wrong to produce ~X - C without caring about negating (and then extending it to non-constants)? I wonder if this makes /* ~(~X - Y) -> X + Y and ~(~X + Y) -> X - Y. */ useless. -- Marc Glisse
Re: Integer division on x86 -m32
On Thu, 10 Dec 2020, Lucas de Almeida via Gcc wrote: when performing (int64_t) foo / (int32_t) bar in gcc under x86, a call to __divdi3 is always output, even though it seems the use of the idiv instruction could be faster. IIRC, idiv requires that the quotient fit in 32 bits, while your C code doesn't. (1LL << 60) / 3 would cause an error with idiv. It would be possible to use idiv in some cases, if the compiler can prove that variables are in the right range, but that's not so easy. You can use inline asm to force the use of idiv if you know it is safe for your case, the most common being modular arithmetic: if you know that uint32_t a, b, c, d are smaller than m (and m!=0), you can compute a*b+c+d in uint64_t, then use div to compute that modulo m. -- Marc Glisse
Re: The conditions when convert from double to float is permitted?
On Thu, 10 Dec 2020, Xionghu Luo via Gcc wrote: I have a maybe silly question about whether there is any *standard* or *options* (like -ffast-math) for GCC that allow double to float demotion optimization? For example, 1) from PR22326: #include float foo(float f, float x, float y) { return (fabs(f)*x+y); } The fabs will return double result but it could be demoted to float actually since the function returns float finally. With fp-contract, this is (float)fma((double)f,(double)x,(double)y). This could almost be transformed into fmaf(f,x,y), except that the double rounding may not be strictly equivalent. Still, that seems like it would be no problem with -funsafe-math-optimizations, just like turning (float)((double)x*(double)y) into x*y, as long as it is a single operation with casts on all inputs and output. Whether there are cases that can be optimized without -funsafe-math-optimizations is harder to tell. -- Marc Glisse
Re: [PATCH] match.pd: Use ranges to optimize some x * y / y to x [PR97997]
On Thu, 26 Nov 2020, Jakub Jelinek via Gcc-patches wrote: For signed integers with undefined overflow we already optimize x * y / y into x, but for signed integers with -fwrapv or unsigned integers we don't. The following patch allows optimizing that into just x if value ranges prove that x * y will never overflow. I've long wanted a helper that checks if VRP thinks an operation could overflow, I think at some point it would make sense to move this code to some function so that it can be easily reused. Maybe also define a matcher so we can write (mult_noovf @0 @1) which would succeed if either overflow is undefined or if VRP can prove that no overflow is happening. Of course that's all ideas for later, refactoring belongs in the second or third patch using a feature, not the first one :-) -- Marc Glisse
Re: [PATCH] [tree-optimization] Optimize max/min pattern with comparison
On Tue, 1 Dec 2020, Eugene Rozenfeld via Gcc-patches wrote: Thank you for the review Jeff. I don't need to look at the opcode to know the result. The pattern will be matched only in these 4 cases: X <= MAX(X, Y) -> true X > MAX(X, Y) -> false X >= MIN(X, Y) -> true X < MIN(X, Y) -> false So, the result will be true for GE_EXPR and LE_EXPR and false otherwise. Is that true even if X is NaN? It may be hard to hit a situation where this matters though, if we honor NaN, we don't build MAX_EXPR (which is unspecified). -- Marc Glisse
Re: Reassociation and trapping operations
On Wed, 25 Nov 2020, Ilya Leoshkevich via Gcc wrote: I have a C floating point comparison (a <= b && a >= b), which test_for_singularity turns into (a <= b && a == b) and vectorizer turns into ((a <= b) & (a == b)). So far so good. eliminate_redundant_comparison, however, turns it into just (a == b). I don't think this is correct, because (a <= b) traps and (a == b) doesn't. Hello, let me just mention the old https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53805 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53806 There has been some debate about the exact meaning of -ftrapping-math, but don't let that stop you. -- Marc Glisse
Re: Installing a generated header file
On Thu, 12 Nov 2020, Bill Schmidt via Gcc wrote: Hi! I'm working on a project where it's desirable to generate a target-specific header file while building GCC, and install it with the rest of the target-specific headers (i.e., in lib/gcc//11.0.0/include). Today it appears that only those headers listed in "extra_headers" in config.gcc will be placed there, and those are assumed to be found in gcc/config/. In my case, the header file will end up in my build directory instead. Questions: * Has anyone tried something like this before? I didn't find anything. * If so, can you please point me to an example? * Otherwise, I'd be interested in advice about providing new infrastructure to support this. I'm a relative noob with respect to the configury code, and I'm sure my initial instincts will be wrong. :) Does the i386 mm_malloc.h file match your scenario? -- Marc Glisse
Re: A couple GIMPLE questions
On Sat, 5 Sep 2020, Gary Oblock via Gcc wrote: First off one of the questions just me being curious but second is quite serious. Note, this is GIMPLE coming into my optimization and not something I've modified. Here's the C code: type_t * do_comp( type_t *data, size_t len) { type_t *res; type_t *x = min_of_x( data, len); type_t *y = max_of_y( data, len); res = y; if ( x < y ) res = 0; return res; } And here's the resulting GIMPLE: ;; Function do_comp.constprop (do_comp.constprop.0, funcdef_no=5, decl_uid=4392, cgraph_uid=3, symbol_order=68) (executed once) do_comp.constprop (struct type_t * data) { struct type_t * res; struct type_t * x; struct type_t * y; size_t len; [local count: 1073741824]: [local count: 1073741824]: x_2 = min_of_x (data_1(D), 1); y_3 = max_of_y (data_1(D), 1); if (x_2 < y_3) goto ; [29.00%] else goto ; [71.00%] [local count: 311385128]: [local count: 1073741824]: # res_4 = PHI return res_4; } The silly question first. In the "if" stmt how does GCC get those probabilities? Which it shows as 29.00% and 71.00%. I believe they should both be 50.00%. See the profile_estimate pass dump. One branch makes the function return NULL, which makes gcc guess that it may be a bit less likely than the other. Those are heuristics, which are tuned to help on average, but of course they are sometimes wrong. The serious question is what is going on with this phi? res_4 = PHI This makes zero sense practicality wise to me and how is it supposed to be recognized and used? Note, I really do need to transform the "0B" into something else for my structure reorganization optimization. That's not a question? Are you asking why PHIs exist at all? They are the standard way to represent merging in SSA representations. You can iterate on the PHIs of a basic block, etc. CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and contains information that is confidential and proprietary to Ampere Computing or its subsidiaries. It is to be used solely for the purpose of furthering the parties' business relationship. Any unauthorized review, copying, or distribution of this email (or any attachments thereto) is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto. Could you please get rid of this when posting on public mailing lists? -- Marc Glisse
Re: [PATCH] c++: Disable -frounding-math during manifestly constant evaluation [PR96862]
On Wed, 2 Sep 2020, Jason Merrill via Gcc-patches wrote: On 9/1/20 6:13 AM, Marc Glisse wrote: On Tue, 1 Sep 2020, Jakub Jelinek via Gcc-patches wrote: As discussed in the PR, fold-const.c punts on floating point constant evaluation if the result is inexact and -frounding-math is turned on. /* Don't constant fold this floating point operation if the result may dependent upon the run-time rounding mode and flag_rounding_math is set, or if GCC's software emulation is unable to accurately represent the result. */ if ((flag_rounding_math || (MODE_COMPOSITE_P (mode) && !flag_unsafe_math_optimizations)) && (inexact || !real_identical (, ))) return NULL_TREE; Jonathan said that we should be evaluating them anyway, e.g. conceptually as if they are done with the default rounding mode before user had a chance to change that, and e.g. in C in initializers it is also ignored. In fact, fold-const.c for C initializers turns off various other options: /* Perform constant folding and related simplification of initializer expression EXPR. These behave identically to "fold_buildN" but ignore potential run-time traps and exceptions that fold must preserve. */ #define START_FOLD_INIT \ int saved_signaling_nans = flag_signaling_nans;\ int saved_trapping_math = flag_trapping_math;\ int saved_rounding_math = flag_rounding_math;\ int saved_trapv = flag_trapv;\ int saved_folding_initializer = folding_initializer;\ flag_signaling_nans = 0;\ flag_trapping_math = 0;\ flag_rounding_math = 0;\ flag_trapv = 0;\ folding_initializer = 1; #define END_FOLD_INIT \ flag_signaling_nans = saved_signaling_nans;\ flag_trapping_math = saved_trapping_math;\ flag_rounding_math = saved_rounding_math;\ flag_trapv = saved_trapv;\ folding_initializer = saved_folding_initializer; So, shall cxx_eval_outermost_constant_expr instead turn off all those options (then warning_sentinel wouldn't be the right thing to use, but given the 8 or how many return stmts in cxx_eval_outermost_constant_expr, we'd need a RAII class for this. Not sure about the folding_initializer, that one is affecting complex multiplication and division constant evaluation somehow. I don't think we need to turn off flag_signaling_nans or flag_trapv. I think we want to turn off flag_trapping_math so we can fold 1./0 to inf (still in a context where folding is mandatory). Setting folding_initializer seems consistent with that, enabling infinite results in complex folding (it also forces folding of __builtin_constant_p, which may be redundant with force_folding_builtin_constant_p). C++ says that division by zero has undefined behavior, and that an expression with undefined behavior is not constant, so we shouldn't fold 1./0 to inf anyway. The same is true of other trapping operations. So clearing flag_signaling_nans, flag_trapping_math, and flag_trapv seems wrong for C++. And folding_initializer seems to be used for the same sort of thing. So we should actually force flag_trapping_math to true during constexpr evaluation? And folding_initializer to false, and never mind trapv but maybe disable wrapv? #include constexpr double a = std::numeric_limits::infinity(); constexpr double b = a + a; constexpr double c = a - a; constexpr double d = 1. / a; constexpr double e = 1. / d; clang rejects c and e. MSVC rejects e. Intel warns on c. Gcc rejects only e, and accepts the whole thing if I pass -fno-trapping-math. Almost any FP operation is possibly trapping, 1./3. sets FE_INEXACT just as 1./0. sets FE_DIVBYZERO. But the standard says char array[1 + int(1 + 0.2 - 0.1 - 0.1)]; // Must be evaluated during translation So it doesn't seem like it cares about that? Division by zero is the only one that gets weirdly special-cased... -- Marc Glisse
Re: [RFC] Add new flag to specify output constraint in match.pd
On Wed, 2 Sep 2020, Richard Biener via Gcc wrote: On Mon, Aug 24, 2020 at 8:20 AM Feng Xue OS via Gcc wrote: There is a match-folding issue derived from pr94234. A piece of code like: int foo (int n) { int t1 = 8 * n; int t2 = 8 * (n - 1); return t1 - t2; } It can be perfectly caught by the rule "(A * C) +- (B * C) -> (A +- B) * C", and be folded to constant "8". But this folding will fail if both v1 and v2 have multiple uses, as the following code. int foo (int n) { int t1 = 8 * n; int t2 = 8 * (n - 1); use_fn (t1, t2); return t1 - t2; } Given an expression with non-single-use operands, folding it will introduce duplicated computation in most situations, and is deemed to be unprofitable. But it is always beneficial if final result is a constant or existing SSA value. And the rule is: (simplify (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) (if ((!ANY_INTEGRAL_TYPE_P (type) || TYPE_OVERFLOW_WRAPS (type) || (INTEGRAL_TYPE_P (type) && tree_expr_nonzero_p (@0) && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type) /* If @1 +- @2 is constant require a hard single-use on either original operand (but not on both). */ && (single_use (@3) || single_use (@4))) <- control whether match or not (mult (plusminus @1 @2) @0))) Current matcher only provides a way to check something before folding, but no mechanism to affect decision after folding. If has, for the above case, we can let it go when we find result is a constant. :s already has a counter-measure where it still folds if the output is at most one operation. So this transformation has a counter-counter-measure of checking single_use explicitly. And now we want a counter^3-measure... Counter-measure is key factor to matching-cost. ":s" seems to be somewhat coarse-grained. And here we do need more control over it. But ideally, we could decouple these counter-measures from definitions of match-rule, and let gimple-matcher get a more reasonable match-or-not decision based on these counters. Anyway, it is another story. Like the way to describe input operand using flags, we could also add a new flag to specify this kind of constraint on output that we expect it is a simple gimple value. Proposed syntax is (opcode:v{ condition } ) The char "v" stands for gimple value, if more descriptive, other char is preferred. "condition" enclosed by { } is an optional c-syntax condition expression. If present, only when "condition" is met, matcher will check whether folding result is a gimple value using gimple_simplified_result_is_gimple_val (). Since there is no SSA concept in GENERIC, this is only for GIMPLE-match, not GENERIC-match. With this syntax, the rule is changed to #Form 1: (simplify (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) (if ((!ANY_INTEGRAL_TYPE_P (type) || TYPE_OVERFLOW_WRAPS (type) || (INTEGRAL_TYPE_P (type) && tree_expr_nonzero_p (@0) && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)) ( if (!single_use (@3) && !single_use (@4)) (mult:v (plusminus @1 @2) @0))) (mult (plusminus @1 @2) @0) That seems to match what you can do with '!' now (that's very recent). It's also what :s does but a slight bit more "local". When any operand is marked :s and it has more than a single-use we only allow simplifications that do not require insertion of extra stmts. So basically the above pattern doesn't behave any different than if you omit your :v. Only if you'd place :v on an inner expression there would be a difference. Correlating the inner expression we'd not want to insert new expressions for with a specific :s (or multiple ones) would be a more natural extension of what :s provides. Thus, for the above case (Form 1), you do not need :v at all and :s works. Let's consider that multiplication is expensive. We have code like 5*X-3*X, which can be simplified to 2*X. However, if both 5*X and 3*X have other uses, that would increase the number of multiplications. :s would not block a simplification to 2*X, which is a single stmt. So the existing transformation has extra explicit checks for single_use. And those extra checks block the transformation even for 5*X-4*X -> X which does not increase the number of multiplications. Which is where '!' (or :v here) comes in. Or we could decide that the extra multiplication is not that bad if it saves an addition, simplifies the expression, possibly gains more insn parallelism, etc, in which case we could just drop the existing hard single_use check... -- Marc Glisse
Re: [PATCH] middle-end/94301 - map V1x to x when the vector mode is not available
On Tue, 1 Sep 2020, Jakub Jelinek via Gcc-patches wrote: On Tue, Sep 01, 2020 at 11:50:32AM +0200, Richard Biener wrote: While IMHO it makes very much sense to map V1{S,D,T}F types to {S,D,T}Fmode if there's no special vector mode for it the question is whether we can make this "change" for the sake of the ABIs? The alternative is to make vector lowering adjust this, inserting V_C_Es for example. Doesn't gcc violate the x86_64 ABI document by passing V1* vectors on the stack? Clang seems to go half-way (arguments on the stack but return in register), which is strange as well. I think those vectors are relatively rare and breaking the ABI for them may be doable, but that's not based on any actual data. I'd fear about the ABI consequences, as well as anything in the FEs and middle-end that cares about vector modes, so I think lowering this in tree-vect-generic.c using V_C_Es looks much safer to me. If they weren't so rare, we could consider lowering them earlier so they benefit from more optimizations, but that doesn't seem worth the trouble. -- Marc Glisse
Re: [PATCH] c++: Disable -frounding-math during manifestly constant evaluation [PR96862]
On Tue, 1 Sep 2020, Jakub Jelinek via Gcc-patches wrote: As discussed in the PR, fold-const.c punts on floating point constant evaluation if the result is inexact and -frounding-math is turned on. /* Don't constant fold this floating point operation if the result may dependent upon the run-time rounding mode and flag_rounding_math is set, or if GCC's software emulation is unable to accurately represent the result. */ if ((flag_rounding_math || (MODE_COMPOSITE_P (mode) && !flag_unsafe_math_optimizations)) && (inexact || !real_identical (, ))) return NULL_TREE; Jonathan said that we should be evaluating them anyway, e.g. conceptually as if they are done with the default rounding mode before user had a chance to change that, and e.g. in C in initializers it is also ignored. In fact, fold-const.c for C initializers turns off various other options: /* Perform constant folding and related simplification of initializer expression EXPR. These behave identically to "fold_buildN" but ignore potential run-time traps and exceptions that fold must preserve. */ #define START_FOLD_INIT \ int saved_signaling_nans = flag_signaling_nans;\ int saved_trapping_math = flag_trapping_math;\ int saved_rounding_math = flag_rounding_math;\ int saved_trapv = flag_trapv;\ int saved_folding_initializer = folding_initializer;\ flag_signaling_nans = 0;\ flag_trapping_math = 0;\ flag_rounding_math = 0;\ flag_trapv = 0;\ folding_initializer = 1; #define END_FOLD_INIT \ flag_signaling_nans = saved_signaling_nans;\ flag_trapping_math = saved_trapping_math;\ flag_rounding_math = saved_rounding_math;\ flag_trapv = saved_trapv;\ folding_initializer = saved_folding_initializer; So, shall cxx_eval_outermost_constant_expr instead turn off all those options (then warning_sentinel wouldn't be the right thing to use, but given the 8 or how many return stmts in cxx_eval_outermost_constant_expr, we'd need a RAII class for this. Not sure about the folding_initializer, that one is affecting complex multiplication and division constant evaluation somehow. I don't think we need to turn off flag_signaling_nans or flag_trapv. I think we want to turn off flag_trapping_math so we can fold 1./0 to inf (still in a context where folding is mandatory). Setting folding_initializer seems consistent with that, enabling infinite results in complex folding (it also forces folding of __builtin_constant_p, which may be redundant with force_folding_builtin_constant_p). The following patch has been bootstrapped/regtested on x86_64-linux and i686-linux, but see above, maybe we want something else. 2020-09-01 Jakub Jelinek PR c++/96862 * constexpr.c (cxx_eval_outermost_constant_expr): Temporarily disable flag_rounding_math during manifestly constant evaluation. * g++.dg/cpp1z/constexpr-96862.C: New test. --- gcc/cp/constexpr.c.jj 2020-08-31 14:10:15.826921458 +0200 +++ gcc/cp/constexpr.c 2020-08-31 15:41:26.429964532 +0200 @@ -6680,6 +6680,8 @@ cxx_eval_outermost_constant_expr (tree t allow_non_constant, strict, manifestly_const_eval || !allow_non_constant }; + /* Turn off -frounding-math for manifestly constant evaluation. */ + warning_sentinel rm (flag_rounding_math, ctx.manifestly_const_eval); tree type = initialized_type (t); tree r = t; bool is_consteval = false; --- gcc/testsuite/g++.dg/cpp1z/constexpr-96862.C.jj 2020-08-31 15:50:07.847473028 +0200 +++ gcc/testsuite/g++.dg/cpp1z/constexpr-96862.C2020-08-31 15:49:40.829861168 +0200 @@ -0,0 +1,20 @@ +// PR c++/96862 +// { dg-do compile { target c++17 } } +// { dg-additional-options "-frounding-math" } + +constexpr double a = 0x1.0p+100 + 0x1.0p-100; +const double b = 0x1.0p+100 + 0x1.0p-100; +const double & = 0x1.0p+100 + 0x1.0p-100; +static_assert (0x1.0p+100 + 0x1.0p-100 == 0x1.0p+100, ""); + +void +foo () +{ + constexpr double d = 0x1.0p+100 + 0x1.0p-100; + const double e = 0x1.0p+100 + 0x1.0p-100; + const double & = 0x1.0p+100 + 0x1.0p-100; + static_assert (0x1.0p+100 + 0x1.0p-100 == 0x1.0p+100, ""); +} + +const double = a; +const double = b; Jakub -- Marc Glisse
Re: [RFC] Add new flag to specify output constraint in match.pd
On Fri, 21 Aug 2020, Feng Xue OS via Gcc wrote: There is a match-folding issue derived from pr94234. A piece of code like: int foo (int n) { int t1 = 8 * n; int t2 = 8 * (n - 1); return t1 - t2; } It can be perfectly caught by the rule "(A * C) +- (B * C) -> (A +- B) * C", and be folded to constant "8". But this folding will fail if both v1 and v2 have multiple uses, as the following code. int foo (int n) { int t1 = 8 * n; int t2 = 8 * (n - 1); use_fn (t1, t2); return t1 - t2; } Given an expression with non-single-use operands, folding it will introduce duplicated computation in most situations, and is deemed to be unprofitable. But it is always beneficial if final result is a constant or existing SSA value. And the rule is: (simplify (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) (if ((!ANY_INTEGRAL_TYPE_P (type) || TYPE_OVERFLOW_WRAPS (type) || (INTEGRAL_TYPE_P (type) && tree_expr_nonzero_p (@0) && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type) /* If @1 +- @2 is constant require a hard single-use on either original operand (but not on both). */ && (single_use (@3) || single_use (@4))) <- control whether match or not (mult (plusminus @1 @2) @0))) Current matcher only provides a way to check something before folding, but no mechanism to affect decision after folding. If has, for the above case, we can let it go when we find result is a constant. :s already has a counter-measure where it still folds if the output is at most one operation. So this transformation has a counter-counter-measure of checking single_use explicitly. And now we want a counter^3-measure... Like the way to describe input operand using flags, we could also add a new flag to specify this kind of constraint on output that we expect it is a simple gimple value. Proposed syntax is (opcode:v{ condition } ) The char "v" stands for gimple value, if more descriptive, other char is preferred. "condition" enclosed by { } is an optional c-syntax condition expression. If present, only when "condition" is met, matcher will check whether folding result is a gimple value using gimple_simplified_result_is_gimple_val (). Since there is no SSA concept in GENERIC, this is only for GIMPLE-match, not GENERIC-match. With this syntax, the rule is changed to #Form 1: (simplify (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) (if ((!ANY_INTEGRAL_TYPE_P (type) || TYPE_OVERFLOW_WRAPS (type) || (INTEGRAL_TYPE_P (type) && tree_expr_nonzero_p (@0) && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)) ( if (!single_use (@3) && !single_use (@4)) (mult:v (plusminus @1 @2) @0))) (mult (plusminus @1 @2) @0) That seems to match what you can do with '!' now (that's very recent). #Form 2: (simplify (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) (if ((!ANY_INTEGRAL_TYPE_P (type) || TYPE_OVERFLOW_WRAPS (type) || (INTEGRAL_TYPE_P (type) && tree_expr_nonzero_p (@0) && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)) (mult:v{ !single_use (@3) && !single_use (@4 } (plusminus @1 @2) @0 Indeed, something more flexible than '!' would be nice, but I am not so sure about this version. If we are going to allow inserting code after resimplification and before validation, maybe we should go even further and let people insert arbitrary code there... -- Marc Glisse
Re: [committed] libstdc++: Make __int128 meet integer-class requirements [PR 96042]
On Sat, 22 Aug 2020, Jonathan Wakely via Gcc-patches wrote: On Sat, 22 Aug 2020 at 13:13, Jonathan Wakely wrote: On Sat, 22 Aug 2020 at 10:52, Marc Glisse wrote: is there a particular reason to handle only __int128 this way, and not all the non-standard integer types? It looks like it would be a bit simpler to avoid a special case. I have no objection to doing it for all of them, it just wasn't necessary to solve the immediate problem that the library now uses __int128 even when integral<__int128> is false. (Hmm, or is size_t an alias for __int20 on one arch, which would mean we do use it?) Oh I remember why I didn't do that now. I did actually want to do it that way initially. The macros like __GLIBCXX_TYPE_INT_N_0 are not defined in strict mode, so we have no generic way to name those types. IIRC, those macros were introduced specifically to help libstdc++. If libstdc++ wants them defined in different circumstances, it should be fine to change the condition from "!flag_iso || int_n_data[i].bitsize == POINTER_SIZE" to whatever you need. But now I understand why you did this, thanks. -- Marc Glisse
RE: [PATCH] middle-end: Recognize idioms for bswap32 and bswap64 in match.pd.
On Sat, 15 Aug 2020, Roger Sayle wrote: Here's version #2 of the patch to recognize bswap32 and bswap64 incorporating your suggestions and feedback. The test cases now confirm the transformation is applied when int is 32 bits and long is 64 bits, and should pass otherwise; the patterns now reuse (more) capturing groups, and the patterns have been made more generic to allow the ultimate type to be signed or unsigned (hence there are now two new gcc.dg tests). Alas my efforts to allow the input argument to be signed, and use fold_convert to coerce it to the correct type before calling __builtin_bswap failed, with the error messages: You can't use fold_convert for that (well, maybe if you restricted the transformation to GENERIC), but if I understand correctly, you are trying to do (convert (BUILT_IN_BSWAP64 (convert:uint64_type_node @1)) ? (untested) From: Marc Glisse +(simplify + (bit_ior:c +(lshift + (convert (BUILT_IN_BSWAP16 (convert (bit_and @0 + INTEGER_CST@1 + (INTEGER_CST@2)) +(convert (BUILT_IN_BSWAP16 (convert (rshift @3 + INTEGER_CST@4) I didn't realize we kept this useless bit_and when casting to a smaller type. I was confused when I wrote that and thought we were converting from int to uint16_t, but bswap16 actually takes an int on x86_64, probably because of the calling convention, so we are converting from unsigned int to int. Having implementation details like the calling convention appear here in the intermediate language complicates things a bit. Can we assume that it is fine to build a call to bswap32/bswap64 taking uint32_t/uint64_t and that only bswap16 can be affected? Do most targets have a similar-enough calling convention that this transformation also works on them? It looks like aarch64 / powerpc64le / mips64el would like for bswap16->bswap32 a transformation of the same form as the one you wrote for bswap32->bswap64. I was wondering what would happen if I start from an int instead of an unsigned int. f (int x) { short unsigned int _1; short unsigned int _2; short unsigned int _3; int _5; int _7; unsigned int _8; unsigned int _9; int _10; [local count: 1073741824]: _7 = x_4(D) & 65535; _1 = __builtin_bswap16 (_7); _8 = (unsigned int) x_4(D); _9 = _8 >> 16; _10 = (int) _9; _2 = __builtin_bswap16 (_10); _3 = _1 | _2; _5 = (int) _3; return _5; } Handling this in the same transformation with a pair of convert12? and some tests should be doable, but it gets complicated enough that it is fine to postpone that. -- Marc Glisse
Re: [PATCH] middle-end: Simplify popcount/parity of bswap/rotate.
On Fri, 21 Aug 2020, Roger Sayle wrote: This simple patch to match.pd optimizes away bit permutation operations, specifically bswap and rotate, in calls to popcount and parity. Good idea. Although this patch has been developed and tested on LP64, it relies on there being no truncations or extensions to "marry up" the appropriate PARITY, PARITYL and PARITYLL forms with either BSWAP32 or BSWAP64, assuming this transformation won't fire if the integral types have different sizes. There would be a convert_expr or similar if the sizes were different, so you are safe. I wish we had only generic builtins/ifn instead of __builtin_parity{,l,ll,imax}, and inconsistently __builtin_bswap{16,32,64,128}, that's very inconvenient to deal with... And genmatch generates a lot of useless code because of that. I didn't try but couldn't the transformations be merged to reduce repetition? (for bitcnt (POPCOUNT PARITY) (for swap (BUILT_IN_BSWAP32 BUILT_IN_BSWAP64) (simplify (bitcnt (swap @0)) (bitcnt @0))) (for rot (lrotate rrotate) (simplify (bitcnt (rot @0 @1)) (bitcnt @0 I assume you skipped BUILT_IN_BSWAP16 because on 32+ bit targets, there is an intermediate extension, and 16 bit targets are "rare"? And BUILT_IN_BSWAP128 because on most platforms intmax_t is only 64 bits and we don't have a 128-bit version of parity/popcount? (we have an IFN, but it seldom appears by magic) -- Marc Glisse
Re: [committed] libstdc++: Make __int128 meet integer-class requirements [PR 96042]
On Wed, 19 Aug 2020, Jonathan Wakely via Gcc-patches wrote: Because __int128 can be used as the difference type for iota_view, we need to ensure that it meets the requirements of an integer-class type. The requirements in [iterator.concept.winc] p10 include numeric_limits being specialized and giving meaningful answers. Currently we only specialize numeric_limits for non-standard integer types in non-strict modes. However, nothing prevents us from defining an explicit specialization for any implementation-defined type, so it doesn't matter whether std::is_integral<__int128> is true or not. This patch ensures that the numeric_limits specializations for signed and unsigned __int128 are defined whenever __int128 is available. It also makes the __numeric_traits and __int_limits helpers work for __int128, via a new __gnu_cxx::__is_integer_nonstrict trait. Hello, is there a particular reason to handle only __int128 this way, and not all the non-standard integer types? It looks like it would be a bit simpler to avoid a special case. -- Marc Glisse
Re: [PATCH] middle-end: Recognize idioms for bswap32 and bswap64 in match.pd.
On Wed, 12 Aug 2020, Roger Sayle wrote: This patch is inspired by a small code fragment in comment #3 of bugzilla PR rtl-optimization/94804. That snippet appears almost unrelated to the topic of the PR, but recognizing __builtin_bswap64 from two __builtin_bswap32 calls, seems like a clever/useful trick. GCC's optabs.c contains the inverse logic to expand bswap64 by IORing two bswap32 calls, so this transformation/canonicalization is safe, even on targets without suitable optab support. But on x86_64, the swap64 of the test case becomes a single instruction. This patch has been tested on x86_64-pc-linux-gnu with a "make bootstrap" and a "make -k check" with no new failures. Ok for mainline? Your tests seem to assume that int has 32 bits and long 64. + (if (operand_equal_p (@0, @2, 0) Why not reuse @0 instead of introducing @2 in the pattern? Similarly, it may be a bit shorter to reuse @1 instead of a new @3 (I don't think the tricks with @@ will be needed here). + && types_match (TREE_TYPE (@0), uint64_type_node) that seems very specific. What goes wrong with a signed type for instance? +(simplify + (bit_ior:c +(lshift + (convert (BUILT_IN_BSWAP16 (convert (bit_and @0 + INTEGER_CST@1 + (INTEGER_CST@2)) +(convert (BUILT_IN_BSWAP16 (convert (rshift @3 + INTEGER_CST@4) I didn't realize we kept this useless bit_and when casting to a smaller type. We probably get a different pattern on 16-bit targets, but a pattern they do not match won't hurt them. -- Marc Glisse
Simplify X * C1 == C2 with wrapping overflow
Odd numbers are invertible in Z / 2^n Z, so X * C1 == C2 can be rewritten as X == C2 * inv(C1) when overflow wraps. mod_inv should probably be updated to better match the other wide_int functions, but that's a separate issue. Bootstrap+regtest on x86_64-pc-linux-gnu. 2020-08-10 Marc Glisse PR tree-optimization/95433 * match.pd (X * C1 == C2): Handle wrapping overflow. * expr.c (maybe_optimize_mod_cmp): Qualify call to mod_inv. (mod_inv): Move... * wide-int.cc (mod_inv): ... here. * wide-int.h (mod_inv): Declare it. * gcc.dg/tree-ssa/pr95433-2.c: New file. -- Marc Glissediff --git a/gcc/expr.c b/gcc/expr.c index a150fa0d3b5..ebf0c9e4797 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -11859,38 +11859,6 @@ string_constant (tree arg, tree *ptr_offset, tree *mem_size, tree *decl) return init; } -/* Compute the modular multiplicative inverse of A modulo M - using extended Euclid's algorithm. Assumes A and M are coprime. */ -static wide_int -mod_inv (const wide_int , const wide_int ) -{ - /* Verify the assumption. */ - gcc_checking_assert (wi::eq_p (wi::gcd (a, b), 1)); - - unsigned int p = a.get_precision () + 1; - gcc_checking_assert (b.get_precision () + 1 == p); - wide_int c = wide_int::from (a, p, UNSIGNED); - wide_int d = wide_int::from (b, p, UNSIGNED); - wide_int x0 = wide_int::from (0, p, UNSIGNED); - wide_int x1 = wide_int::from (1, p, UNSIGNED); - - if (wi::eq_p (b, 1)) -return wide_int::from (1, p, UNSIGNED); - - while (wi::gt_p (c, 1, UNSIGNED)) -{ - wide_int t = d; - wide_int q = wi::divmod_trunc (c, d, UNSIGNED, ); - c = t; - wide_int s = x0; - x0 = wi::sub (x1, wi::mul (q, x0)); - x1 = s; -} - if (wi::lt_p (x1, 0, SIGNED)) -x1 += d; - return x1; -} - /* Optimize x % C1 == C2 for signed modulo if C1 is a power of two and C2 is non-zero and C3 ((1<<(prec-1)) | (C1 - 1)): for C2 > 0 to x & C3 == C2 @@ -12101,7 +12069,7 @@ maybe_optimize_mod_cmp (enum tree_code code, tree *arg0, tree *arg1) w = wi::lrshift (w, shift); wide_int a = wide_int::from (w, prec + 1, UNSIGNED); wide_int b = wi::shifted_mask (prec, 1, false, prec + 1); - wide_int m = wide_int::from (mod_inv (a, b), prec, UNSIGNED); + wide_int m = wide_int::from (wi::mod_inv (a, b), prec, UNSIGNED); tree c3 = wide_int_to_tree (type, m); tree c5 = NULL_TREE; wide_int d, e; diff --git a/gcc/match.pd b/gcc/match.pd index 7e5c5a6eae6..c3b88168ac4 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -3828,7 +3828,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (cmp @0 @2)) /* For integral types with undefined overflow fold - x * C1 == C2 into x == C2 / C1 or false. */ + x * C1 == C2 into x == C2 / C1 or false. + If overflow wraps and C1 is odd, simplify to x == C2 / C1 in the ring + Z / 2^n Z. */ (for cmp (eq ne) (simplify (cmp (mult @0 INTEGER_CST@1) INTEGER_CST@2) @@ -3839,7 +3841,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (if (wi::multiple_of_p (wi::to_widest (@2), wi::to_widest (@1), TYPE_SIGN (TREE_TYPE (@0)), )) (cmp @0 { wide_int_to_tree (TREE_TYPE (@0), quot); }) - { constant_boolean_node (cmp == NE_EXPR, type); }) + { constant_boolean_node (cmp == NE_EXPR, type); })) + (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) + && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)) + && (wi::bit_and (wi::to_wide (@1), 1) == 1)) +(cmp @0 + { + tree itype = TREE_TYPE (@0); + int p = TYPE_PRECISION (itype); + wide_int m = wi::one (p + 1) << p; + wide_int a = wide_int::from (wi::to_wide (@1), p + 1, UNSIGNED); + wide_int i = wide_int::from (wi::mod_inv (a, m), +p, TYPE_SIGN (itype)); + wide_int_to_tree (itype, wi::mul (i, wi::to_wide (@2))); + }) /* Simplify comparison of something with itself. For IEEE floating-point, we can only do some of these simplifications. */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr95433-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr95433-2.c new file mode 100644 index 000..c830a3d195f --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr95433-2.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-O -fwrapv -fdump-tree-gimple" } */ + +typedef __INT32_TYPE__ int32_t; +typedef unsigned __INT32_TYPE__ uint32_t; + +int e(int32_t x){return 3*x==5;} +int f(int32_t x){return 3*x==-5;} +int g(int32_t x){return -3*x==5;} +int h(int32_t x){return 7*x==3;} +int i(uint32_t x){return 7*x==3;} + +/* { dg-final { scan-tree-dump-times "== 1431655767" 1 "gimple" } } */ +/* { dg-final { scan-tree-dump-times "== -1431655767" 2 "gimple" } } */ +/* { dg-final { scan-tree-dump-times "== 613566757" 2 "gimple" } } */ diff --git a/gcc/wide-int.cc b/gcc/wide-int.cc index 03be0074816..f4d949c38a0 100644 --- a/gcc/wide-int.cc +++ b/gcc/wide-int.cc @@ -2223,6 +2223,39 @@ wi::round_
Re: Simplify X * C1 == C2 with undefined overflow
On Fri, 7 Aug 2020, Jakub Jelinek wrote: On Fri, Aug 07, 2020 at 10:57:54PM +0200, Marc Glisse wrote: On Fri, 7 Aug 2020, Joern Wolfgang Rennecke wrote: On 07/08/20 19:21, Marc Glisse wrote: If we are going to handle the wrapping case, we shouldn't limit to the non-wrapping meaning of multiplicity. 3*X==5 should become X==1431655767 (for a 32 bit type), etc. Do we have an extended gcd somewhere in gcc, to help compute 1431655767? I don't quite see yet how this relates to gcd, https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm is the most common way to compute the modular multiplicative inverse of a number. For 3 and 2^32, it could tell us that 2863311531*3-2*2^32=1, so modulo 2^32 2863311531*3==1, and 3*X==5 is the same as 2863311531*3*X==2863311531*5, i.e. X==1431655767. wi::gcd ? That's the first place I looked, but this is only the regular Euclid algorithm, not the extended one. It tells you what the gcd is, but does not give you the coefficients of the Bézout identity. For 3 and 2^32, it would tell me the gcd is 1, while I want the number 2863311531 (inverse of 3). Ah, found it, there is mod_inv hidden in expr.c! -- Marc Glisse
Re: Simplify X * C1 == C2 with undefined overflow
On Fri, 7 Aug 2020, Joern Wolfgang Rennecke wrote: On 07/08/20 19:21, Marc Glisse wrote: If we are going to handle the wrapping case, we shouldn't limit to the non-wrapping meaning of multiplicity. 3*X==5 should become X==1431655767 (for a 32 bit type), etc. Do we have an extended gcd somewhere in gcc, to help compute 1431655767? I don't quite see yet how this relates to gcd, https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm is the most common way to compute the modular multiplicative inverse of a number. For 3 and 2^32, it could tell us that 2863311531*3-2*2^32=1, so modulo 2^32 2863311531*3==1, and 3*X==5 is the same as 2863311531*3*X==2863311531*5, i.e. X==1431655767. However, there are cases were the second division will not be possible without rest. Consider : 7*X == 3 3/7 is 0 rest 3. 0x1 / 7 is 0x24924924 rest 4 Since 3 cannot be represented as an integer multiple of -4, we can conclude that the predicate is always false. 613566757*7-2^32==3 Also: 14*X == 28 has a non-zero power of two in the constant factor, so we have to factor out the power of two (if the right hand side has a lower power of two, the predicate is always false) and consider in modulo arithmetic with a number of bits less by the factored out power of two, i.e. 31 bit here: 7*X == 14 which is solved as above - but in 32 bit arithmetic - to X == 2 and to convert back to 32 bit arithmetic, we get: X & 0x7fff == 2 or 2*x == 4 Yes, we can reduce the size of the numbers a bit, but the gains won't be as nice for even numbers. I think the always-false case is already handled by CCP, tracking which bits can be 0 (I didn't check). -- Marc Glisse
Re: Simplify X * C1 == C2 with undefined overflow
On Fri, 7 Aug 2020, Joern Wolfgang Rennecke wrote: this transformation is quite straightforward, without overflow, 3*X==15 is the same as X==5 and 3*X==5 cannot happen. Actually, with binary and decimal computers, this transformation (with these specific numbers) is also valid for wrapping overflow. More generally, it is valid for wrapping overflow if the right hand side of the comparison is divisible without rest by the constant factor, and the constant factor has no sub-factor that is a zero divisor for the ring defined by the wrapping operation. For binary computers, the latter condition can be more simply be restated as: The constant factor has to be odd. (For decimal computers, it's: must not be divisible by two and/or five.) If we are going to handle the wrapping case, we shouldn't limit to the non-wrapping meaning of multiplicity. 3*X==5 should become X==1431655767 (for a 32 bit type), etc. Do we have an extended gcd somewhere in gcc, to help compute 1431655767? (Even if the variable factor is wider than equality comparison, that is not a problem as long as the comparison is not widened by the transformation.) On the other hand, the following generalizations would work only without overflow: - handling of inequality-comparisons - merely have to account for the sign of the factor reversing the sense of the inequality, e.g. -3*X >= 15 ---> X <= 5 And again for this one, we have to be careful how we round the division, but we don't have to limit to the case where 15 is a multiple of 3. 3*X>7 can be replaced with X>2. Those are two nice suggestions. Do you intend to write a patch? Otherwise I'll try to do it eventually (no promise). -- Marc Glisse
Re: FENV_ACCESS status
n't hinder the second too much. But having the strictest version first looked reasonable. There's no such thing currently as effects on memory state only depend on arguments. This reminds me of the initialization of static/thread_local variables in functions, when Jason tried to add an attribute, but I don't think it was ever committed, and the semantics were likely too different. I _think_ we don't have to say the mem out state depends on the mem in state (FP ENV), well - it does, but the difference only depends on the actual arguments. A different rounding mode could cause different exceptions I believe. That said, tracking FENV together with memory will complicate things but explicitely tracking an (or multiple?) extra FP ENV register input/output makes the problem not go away (the second plus still has the mutated FP ENV from the first plus as input). Instead we'd have to separately track the effect of a single operation and the overall FP state, like (_3, flags_5) = .IFN_PLUS (_1, _2, 0); fpexstate = merge (flags_5, fpexstate); (_4, flags_6) = .IFN_PLUS (_1, _2, 0); fpexstate = merge (flage_6, fpexstate); We would have to be careful that lines 2 and 3 cannot be swapped (unless we keep all the merges and key expansion on those and not on the IFN? But we may end up with a use of the sum before the merge). or so and there we can CSE. And I guess we would have a transformation merge(f, merge(f, state)) --> merge(f, state) We have to track exception state separately from the FP control word for rounding-mode for this to work. Thus when we're not interested in the exception state then .IFN_PLUS would be 'pure' (only dependent on the FP CW)? So I guess we should think of somehow separating rounding mode tracking and exception state? If we make the functions affect memory anyway we can have the FP state reg(s) modeled explicitely with a fake decl(s) and pass that by reference to the IFNs? Then we can make use of the "fn spec" attribute to tell which function reads/writes which reg. Across unknown functions we'd then have to use the store/load "trick" to merge them with the global memory state though. Splitting the rounding mode from the exceptions certainly makes sense, since they are used quite differently. _3 = .FENV_PLUS (_1, _2, 0, _round, _except) or just _3 = .FENV_PLUS (_1, _2, 1, _round, 0) or _3 = .FENV_PLUS (_1, _2, 2, 0, _except) when we are not interested in everything. with fake global decls for fenv_round and fenv_except (so "unknown" already possibly reads/writes it) and fn specs to say it doesn't look at other memory? I was more thinking of making that implicit, through magic in a couple relevant functions (the value in flags says if the global fenv_round or fenv_except is accessed), as a refinement of just "memory". But IIUC, we would need something that does not use memory at all (not even one variable) if we wanted to avoid the big penalty in alias analysis, etc. If we consider the case without exceptions: round = get_fenv_round() _3 = .FENV_PLUS (_1, _2, opts, round) with .FENV_PLUS "const" and get_fenv_round "pure" (or even reading round from a fake global variable instead of a function call) would be tempting, but it doesn't work, since now .FENV_PLUS can migrate after a later call to fesetround. Even without exceptions we need some protection after, so it may be easier to keep the memory (fenv) read as part of .FENV_PLUS. Also, caring only about rounding doesn't match any standard #pragma, so such an option may see very little use in practice... Sorry for the incoherent brain-dump above ;) It is great to have someone to discuss this with! -- Marc Glisse
Re: VEC_COND_EXPR optimizations v2
On Fri, 7 Aug 2020, Richard Biener wrote: On Fri, Aug 7, 2020 at 10:33 AM Marc Glisse wrote: On Fri, 7 Aug 2020, Richard Biener wrote: On Thu, Aug 6, 2020 at 8:07 PM Marc Glisse wrote: On Thu, 6 Aug 2020, Christophe Lyon wrote: Was I on the right track configuring with --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9 --with-fpu=neon-fp16 then compiling without any special option? Maybe you also need --with-float=hard, I don't remember if it's implied by the 'hf' target suffix Thanks! That's what I was missing to reproduce the issue. Now I can reproduce it with just typedef unsigned int vec __attribute__((vector_size(16))); typedef int vi __attribute__((vector_size(16))); vi f(vec a,vec b){ return a==5 | b==7; } with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1 _1 = a_5(D) == { 5, 5, 5, 5 }; _3 = b_6(D) == { 7, 7, 7, 7 }; _9 = _1 | _3; _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107); we fail to expand the equality comparison (expand_vec_cmp_expr_p returns false), while with -fdisable-tree-forwprop4 we do manage to expand _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112); It doesn't make much sense to me that we can expand the more complicated form and not the simpler form of the same operation (both compare a to 5 and produce a vector of -1 or 0 of the same size), especially when the target has an instruction (vceq) that does just what we want. Introducing boolean vectors was fine, but I think they should be real types, that we can operate on, not be forced to appear only as the first argument of a vcond. I can think of 2 natural ways to improve things: either implement vector comparisons in the ARM backend (possibly by forwarding to their existing code for vcond), or in the generic expansion code try using vcond if the direct comparison opcode is not provided. We can temporarily revert my patch, but I would like it to be temporary. Since aarch64 seems to handle the same code just fine, maybe someone who knows arm could copy the relevant code over? Does my message make sense, do people have comments? So what complicates things now (and to some extent pre-existed when you used AVX512 which _could_ operate on boolean vectors) is that we have split out the condition from VEC_COND_EXPR to separate stmts but we do not expect backends to be able to code-generate the separate form - instead we rely on the ISEL pass to trasform VEC_COND_EXPRs to .VCOND[U] "merging" the compares again. Now that process breaks down once we have things like _9 = _1 | _3; - at some point I argued that we should handle vector compares [and operations on boolean vectors] as well in ISEL but then when it came up again for some reason I disregarded that again. Thus - we don't want to go back to fixing up the generic expansion code (which looks at one instruction at a time and is restricted by TER single-use restrictions). Here, it would only be handling comparisons left over by ISEL because they do not feed a VEC_COND_EXPR, maybe not so bad. Handling it in ISEL would be more consistent, but then we may have to teach the vector lowering pass about this. Instead we want to deal with this in ISEL which should behave more intelligently. In the above case it might involve turning the _1 and _3 defs into .VCOND [with different result type], doing _9 in that type and then somehow dealing with _7 ... but this eventually means undoing the match simplification that introduced the code? For targets that do not have compact boolean vectors, VEC_COND_EXPR<*,-1,0> does nothing, it is just there as a technicality. Removing it to allow more simplifications makes sense, reintroducing it for expansion can make sense as well, I think it is ok if the second one reverses the first, but very locally, without having to propagate a change of type to the uses. So on ARM we could turn _1 and _3 into .VCOND producing bool:32 vectors, and replace _7 with a VIEW_CONVERT_EXPR. Or does using bool vectors like that seem bad? (maybe they aren't guaranteed to be equivalent to signed integer vectors with values 0 and -1 and we need to query the target to know if that is the case, or introduce an extra .VCOND) Also, we only want to replace comparisons with .VCOND if the target doesn't handle the comparison directly. For AVX512, we do want to produce SImode bool vectors (for k* registers) and operate on them directly, that's the whole point of introducing the bool vectors (if bool vectors were only used to feed VEC_COND_EXPR and were all turned into .VCOND before expansion, I don't see the point of specifying different types for bool vectors for AVX512 vs non-AVX512, as it would make no difference on what is passed to the backend). I think targets should provide some number of operations, including for instance bit_and and bit_ior on bool vectors, and be a bit consistent about what they provide, it beco
Re: FENV_ACCESS status
hat said, you're the one doing the work and going with internal functions is reasonable - I'm not sure to what extent optimization for FENV acccess code will ever be possible (or wanted/expected). So going more precise might not have any advantage. I think some optimizations are expected. For instance, not having to re-read the same number from memory many times just because there was an addition in between (which could write to fenv but that's it). Some may still want FMA (with a consistent rounding direction). For those (like me) who usually only care about rounding and not exceptions, making the operations pure would be great, and nothing says we cannot vectorize those rounded operations! I am trying to be realistic with what I can achieve, but if you think the IFNs would paint us into a corner, then we can drop this approach. You needed to guard SQRT - will you need to guard other math functions? (round, etc.) Maybe, but probably not many. I thought I might have to guard all of them (sin, cos, etc), but IIRC Joseph's comment seemed to imply that this wouldn't be necessary. I am likely missing FMA now... If we need to keep the IFNs use memory state they will count towards walk limits of the alias oracle even if they can be disambiguated against. This will affect both compile-time and optimizations. Yes... + /* Careful not to end up with something like X - X, which could get + simplified. */ + if (!skip0 && already_protected (op1)) we're already relying on RTL not optimizing (x + 0.5) - 0.5 but since that would involve association the simple X - X case might indeed be optimized (but wouldn't that be a bug if it is not correct?) Indeed we do not currently simplify X-X without -ffinite-math-only. However, I am trying to be safe, and whether we can simplify or not is something that depends on each operation (what the pragma said at that point in the source code), while flag_finite_math_only is at best per function. -- Marc Glisse
Re: VEC_COND_EXPR optimizations v2
On Fri, 7 Aug 2020, Richard Biener wrote: On Thu, Aug 6, 2020 at 8:07 PM Marc Glisse wrote: On Thu, 6 Aug 2020, Christophe Lyon wrote: Was I on the right track configuring with --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9 --with-fpu=neon-fp16 then compiling without any special option? Maybe you also need --with-float=hard, I don't remember if it's implied by the 'hf' target suffix Thanks! That's what I was missing to reproduce the issue. Now I can reproduce it with just typedef unsigned int vec __attribute__((vector_size(16))); typedef int vi __attribute__((vector_size(16))); vi f(vec a,vec b){ return a==5 | b==7; } with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1 _1 = a_5(D) == { 5, 5, 5, 5 }; _3 = b_6(D) == { 7, 7, 7, 7 }; _9 = _1 | _3; _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107); we fail to expand the equality comparison (expand_vec_cmp_expr_p returns false), while with -fdisable-tree-forwprop4 we do manage to expand _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112); It doesn't make much sense to me that we can expand the more complicated form and not the simpler form of the same operation (both compare a to 5 and produce a vector of -1 or 0 of the same size), especially when the target has an instruction (vceq) that does just what we want. Introducing boolean vectors was fine, but I think they should be real types, that we can operate on, not be forced to appear only as the first argument of a vcond. I can think of 2 natural ways to improve things: either implement vector comparisons in the ARM backend (possibly by forwarding to their existing code for vcond), or in the generic expansion code try using vcond if the direct comparison opcode is not provided. We can temporarily revert my patch, but I would like it to be temporary. Since aarch64 seems to handle the same code just fine, maybe someone who knows arm could copy the relevant code over? Does my message make sense, do people have comments? So what complicates things now (and to some extent pre-existed when you used AVX512 which _could_ operate on boolean vectors) is that we have split out the condition from VEC_COND_EXPR to separate stmts but we do not expect backends to be able to code-generate the separate form - instead we rely on the ISEL pass to trasform VEC_COND_EXPRs to .VCOND[U] "merging" the compares again. Now that process breaks down once we have things like _9 = _1 | _3; - at some point I argued that we should handle vector compares [and operations on boolean vectors] as well in ISEL but then when it came up again for some reason I disregarded that again. Thus - we don't want to go back to fixing up the generic expansion code (which looks at one instruction at a time and is restricted by TER single-use restrictions). Here, it would only be handling comparisons left over by ISEL because they do not feed a VEC_COND_EXPR, maybe not so bad. Handling it in ISEL would be more consistent, but then we may have to teach the vector lowering pass about this. Instead we want to deal with this in ISEL which should behave more intelligently. In the above case it might involve turning the _1 and _3 defs into .VCOND [with different result type], doing _9 in that type and then somehow dealing with _7 ... but this eventually means undoing the match simplification that introduced the code? For targets that do not have compact boolean vectors, VEC_COND_EXPR<*,-1,0> does nothing, it is just there as a technicality. Removing it to allow more simplifications makes sense, reintroducing it for expansion can make sense as well, I think it is ok if the second one reverses the first, but very locally, without having to propagate a change of type to the uses. So on ARM we could turn _1 and _3 into .VCOND producing bool:32 vectors, and replace _7 with a VIEW_CONVERT_EXPR. Or does using bool vectors like that seem bad? (maybe they aren't guaranteed to be equivalent to signed integer vectors with values 0 and -1 and we need to query the target to know if that is the case, or introduce an extra .VCOND) Also, we only want to replace comparisons with .VCOND if the target doesn't handle the comparison directly. For AVX512, we do want to produce SImode bool vectors (for k* registers) and operate on them directly, that's the whole point of introducing the bool vectors (if bool vectors were only used to feed VEC_COND_EXPR and were all turned into .VCOND before expansion, I don't see the point of specifying different types for bool vectors for AVX512 vs non-AVX512, as it would make no difference on what is passed to the backend). I think targets should provide some number of operations, including for instance bit_and and bit_ior on bool vectors, and be a bit consistent about what they provide, it becomes unmanageable in the middle-end otherwise... -- Marc Glisse
Re: VEC_COND_EXPR optimizations v2
On Thu, 6 Aug 2020, Christophe Lyon wrote: Was I on the right track configuring with --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9 --with-fpu=neon-fp16 then compiling without any special option? Maybe you also need --with-float=hard, I don't remember if it's implied by the 'hf' target suffix Thanks! That's what I was missing to reproduce the issue. Now I can reproduce it with just typedef unsigned int vec __attribute__((vector_size(16))); typedef int vi __attribute__((vector_size(16))); vi f(vec a,vec b){ return a==5 | b==7; } with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 -O1 _1 = a_5(D) == { 5, 5, 5, 5 }; _3 = b_6(D) == { 7, 7, 7, 7 }; _9 = _1 | _3; _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107); we fail to expand the equality comparison (expand_vec_cmp_expr_p returns false), while with -fdisable-tree-forwprop4 we do manage to expand _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 112); It doesn't make much sense to me that we can expand the more complicated form and not the simpler form of the same operation (both compare a to 5 and produce a vector of -1 or 0 of the same size), especially when the target has an instruction (vceq) that does just what we want. Introducing boolean vectors was fine, but I think they should be real types, that we can operate on, not be forced to appear only as the first argument of a vcond. I can think of 2 natural ways to improve things: either implement vector comparisons in the ARM backend (possibly by forwarding to their existing code for vcond), or in the generic expansion code try using vcond if the direct comparison opcode is not provided. We can temporarily revert my patch, but I would like it to be temporary. Since aarch64 seems to handle the same code just fine, maybe someone who knows arm could copy the relevant code over? Does my message make sense, do people have comments? -- Marc Glisse
Re: VEC_COND_EXPR optimizations v2
On Thu, 6 Aug 2020, Christophe Lyon wrote: On Thu, 6 Aug 2020 at 11:06, Marc Glisse wrote: On Thu, 6 Aug 2020, Christophe Lyon wrote: 2020-08-05 Marc Glisse PR tree-optimization/95906 PR target/70314 * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e), (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations. (op (c ? a : b)): Update to match the new transformations. * gcc.dg/tree-ssa/andnot-2.c: New file. * gcc.dg/tree-ssa/pr95906.c: Likewise. * gcc.target/i386/pr70314.c: Likewise. I think this patch is causing several ICEs on arm-none-linux-gnueabihf --with-cpu cortex-a9 --with-fpu neon-fp16: Executed from: gcc.c-torture/compile/compile.exp gcc.c-torture/compile/20160205-1.c -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions (internal compiler error) gcc.c-torture/compile/20160205-1.c -O3 -g (internal compiler error) Executed from: gcc.dg/dg.exp gcc.dg/pr87746.c (internal compiler error) Executed from: gcc.dg/tree-ssa/tree-ssa.exp gcc.dg/tree-ssa/ifc-cd.c (internal compiler error) I tried a cross from x86_64-linux with current master .../configure --target=arm-none-linux-gnueabihf --enable-languages=c,c++ --with-system-zlib --disable-nls --with-cpu=cortex-a9 --with-fpu=neon-fp16 make it stops at some point with an error, but I have xgcc and cc1 in build/gcc. I copied 2 of the testcases and compiled ./xgcc pr87746.c -Ofast -S -B. ./xgcc -O3 -fdump-tree-ifcvt-details-blocks-details ifc-cd.c -S -B. without getting any ICE. Sorry for the delay, I had to reproduce the problem manually. Is there a machine on the compile farm where this is easy to reproduce? I don't think there is any arm machine in the compile farm. Or could you attach the .optimized dump that corresponds to the backtrace below? It looks like we end up with a comparison with an unexpected return type. I've compiled pr87746.c with -fdump-tree-ifcvt-details-blocks-details, here is the log. Is that what you need? Thanks. The one from -fdump-tree-optimized would be closer to the ICE. Though it would also be convenient to know which stmt is being expanded when we ICE, etc. Was I on the right track configuring with --target=arm-none-linux-gnueabihf --with-cpu=cortex-a9 --with-fpu=neon-fp16 then compiling without any special option? Thanks, Christophe Executed from: gcc.dg/vect/vect.exp gcc.dg/vect/pr59591-1.c (internal compiler error) gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/pr86927.c (internal compiler error) gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/slp-cond-5.c (internal compiler error) gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/vect-23.c (internal compiler error) gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/vect-24.c (internal compiler error) gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error) gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal compiler error) Backtrace for gcc.c-torture/compile/20160205-1.c -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions during RTL pass: expand /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal compiler error: in do_store_flag, at expr.c:12259 0x8feb26 do_store_flag /gcc/expr.c:12259 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9617 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, rtx_def**, expand_modifier) /gcc/expr.c:8065 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9950 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, rtx_def**, expand_modifier) /gcc/expr.c:8065 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9950 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, rtx_def**, expand_modifier) /gcc/expr.c:8065 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9950 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e
Re: VEC_COND_EXPR optimizations v2
On Thu, 6 Aug 2020, Richard Biener wrote: On Thu, Aug 6, 2020 at 10:17 AM Christophe Lyon wrote: Hi, On Wed, 5 Aug 2020 at 16:24, Richard Biener via Gcc-patches wrote: On Wed, Aug 5, 2020 at 3:33 PM Marc Glisse wrote: New version that passed bootstrap+regtest during the night. When vector comparisons were forced to use vec_cond_expr, we lost a number of optimizations (my fault for not adding enough testcases to prevent that). This patch tries to unwrap vec_cond_expr a bit so some optimizations can still happen. I wasn't planning to add all those transformations together, but adding one caused a regression, whose fix introduced a second regression, etc. Restricting to constant folding would not be sufficient, we also need at least things like X|0 or X The transformations are quite conservative with :s and folding only if everything simplifies, we may want to relax this later. And of course we are going to miss things like a?b:c + a?c:b -> b+c. In terms of number of operations, some transformations turning 2 VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look like a gain... I expect the bit_not disappears in most cases, and VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR. I am a bit confused that with avx512 we get types like "vector(4) " with :2 and not :1 (is it a hack so true is 1 and not -1?), but that doesn't matter for this patch. OK. Thanks, Richard. 2020-08-05 Marc Glisse PR tree-optimization/95906 PR target/70314 * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e), (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations. (op (c ? a : b)): Update to match the new transformations. * gcc.dg/tree-ssa/andnot-2.c: New file. * gcc.dg/tree-ssa/pr95906.c: Likewise. * gcc.target/i386/pr70314.c: Likewise. I think this patch is causing several ICEs on arm-none-linux-gnueabihf --with-cpu cortex-a9 --with-fpu neon-fp16: Executed from: gcc.c-torture/compile/compile.exp gcc.c-torture/compile/20160205-1.c -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions (internal compiler error) gcc.c-torture/compile/20160205-1.c -O3 -g (internal compiler error) Executed from: gcc.dg/dg.exp gcc.dg/pr87746.c (internal compiler error) Executed from: gcc.dg/tree-ssa/tree-ssa.exp gcc.dg/tree-ssa/ifc-cd.c (internal compiler error) Executed from: gcc.dg/vect/vect.exp gcc.dg/vect/pr59591-1.c (internal compiler error) gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/pr86927.c (internal compiler error) gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/slp-cond-5.c (internal compiler error) gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/vect-23.c (internal compiler error) gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/vect-24.c (internal compiler error) gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error) gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal compiler error) Backtrace for gcc.c-torture/compile/20160205-1.c -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions during RTL pass: expand /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal compiler error: in do_store_flag, at expr.c:12259 0x8feb26 do_store_flag /gcc/expr.c:12259 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9617 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, rtx_def**, expand_modifier) /gcc/expr.c:8065 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9950 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, rtx_def**, expand_modifier) /gcc/expr.c:8065 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9950 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, rtx_def**, expand_modifier) /gcc/expr.c:8065 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9950 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, ex
Re: VEC_COND_EXPR optimizations v2
On Thu, 6 Aug 2020, Christophe Lyon wrote: 2020-08-05 Marc Glisse PR tree-optimization/95906 PR target/70314 * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e), (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations. (op (c ? a : b)): Update to match the new transformations. * gcc.dg/tree-ssa/andnot-2.c: New file. * gcc.dg/tree-ssa/pr95906.c: Likewise. * gcc.target/i386/pr70314.c: Likewise. I think this patch is causing several ICEs on arm-none-linux-gnueabihf --with-cpu cortex-a9 --with-fpu neon-fp16: Executed from: gcc.c-torture/compile/compile.exp gcc.c-torture/compile/20160205-1.c -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions (internal compiler error) gcc.c-torture/compile/20160205-1.c -O3 -g (internal compiler error) Executed from: gcc.dg/dg.exp gcc.dg/pr87746.c (internal compiler error) Executed from: gcc.dg/tree-ssa/tree-ssa.exp gcc.dg/tree-ssa/ifc-cd.c (internal compiler error) I tried a cross from x86_64-linux with current master .../configure --target=arm-none-linux-gnueabihf --enable-languages=c,c++ --with-system-zlib --disable-nls --with-cpu=cortex-a9 --with-fpu=neon-fp16 make it stops at some point with an error, but I have xgcc and cc1 in build/gcc. I copied 2 of the testcases and compiled ./xgcc pr87746.c -Ofast -S -B. ./xgcc -O3 -fdump-tree-ifcvt-details-blocks-details ifc-cd.c -S -B. without getting any ICE. Is there a machine on the compile farm where this is easy to reproduce? Or could you attach the .optimized dump that corresponds to the backtrace below? It looks like we end up with a comparison with an unexpected return type. Executed from: gcc.dg/vect/vect.exp gcc.dg/vect/pr59591-1.c (internal compiler error) gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/pr86927.c (internal compiler error) gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/slp-cond-5.c (internal compiler error) gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/vect-23.c (internal compiler error) gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/vect-24.c (internal compiler error) gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error) gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error) gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal compiler error) Backtrace for gcc.c-torture/compile/20160205-1.c -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions during RTL pass: expand /gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal compiler error: in do_store_flag, at expr.c:12259 0x8feb26 do_store_flag /gcc/expr.c:12259 0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9617 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, rtx_def**, expand_modifier) /gcc/expr.c:8065 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9950 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, rtx_def**, expand_modifier) /gcc/expr.c:8065 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9950 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, rtx_def**, expand_modifier) /gcc/expr.c:8065 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9950 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**, rtx_def**, expand_modifier) /gcc/expr.c:8065 0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode, expand_modifier) /gcc/expr.c:9950 0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode, expand_modifier, rtx_def**, bool) /gcc/expr.c:10159 0x91174e expand_expr /gcc/expr.h:282 Christophe -- Marc Glisse
FENV_ACCESS status
Hello, I updated the patch discussed in https://patchwork.ozlabs.org/project/gcc/patch/alpine.deb.2.02.1906221743430.16...@grove.saclay.inria.fr/ and pushed it as something like refs/users/glisse/heads/fenv (first user branch in gcc's git, I hope it worked). I am also attaching the diff here. I managed to compile and run real-world code with it, which is a good sign :-) As should be obvious looking at the diff, there is a lot of work left. Experts may also find much better ways to rewrite several parts of the patch. The option is called -ffenv-access so it doesn't interfere with -frounding-math, at least until we have something good enough to replace -frounding-math without too much performance regression. I switched to hex float constants for DBL_MAX and others for C99+, I don't care about making fenv_access work in prehistoric modes. On the other hand, since I haven't started on fenv_round, this is probably useless for now. Several changes, in particular the constexpr stuff, was needed to parse standard headers, otherwise I would have delayed the implementation. Currently the floating point environment is represented by "memory" in general. Refining it so the compiler knows that storing a float does not change the rounding mode (for instance) should wait, in my opinion. Conversions look like .FENV_CONVERT (arg, (target_type*)0, 0) the pointer is there so we know the target type, even if the lhs disappears at some point. The last 0 is the same as for all the others, a place to store options about the operation (do we care about rounding, about exceptions, etc), it is just a placeholder for now. I could rename it to .FENV_NOP since we seem to generate NOP usually, but it looked strange to me. I did not do anything for templates in C++. As long as we have a constant global flag, it doesn't matter, but as soon as we will have a pragma, things will get messy, we will need to remember what the mode was when parsing, so we can use it at instantiation time... (or just declare that the pragma doesn't work with templates in a first version) I am trying to have enough infrastructure in place so that the functionality is useful, and also so that implementing other pieces (parsing the pragma, C front-end, gimple optimizations, target hook for the asm string, opcode and target optimization, simd, etc) become independent and can be done by different people. It is unlikely that I can find the time to do everything. If other people want to contribute or even take over (assuming the branch does not look hopelessly bad to them), that would be great! That's also why I pushed it as a branch. Apart from the obvious (making sure it bootstraps, running the testsuite, adding a few tests), what missing pieces do you consider a strict requirement for this to have a chance to reach master one day as an experimental option? -- Marc Glissecommit 4adb494e88323bf41ee2c0871caa2323fa2aca06 Author: Marc Glisse Date: Wed Aug 5 18:49:57 2020 +0200 Introduce -ffenv-access diff --git a/gcc/c-family/c-cppbuiltin.c b/gcc/c-family/c-cppbuiltin.c index 74ecca8de8e..4c4d4f3d563 100644 --- a/gcc/c-family/c-cppbuiltin.c +++ b/gcc/c-family/c-cppbuiltin.c @@ -1635,16 +1635,23 @@ lazy_hex_fp_value (cpp_reader *, cpp_macro *macro, unsigned num) { REAL_VALUE_TYPE real; char dec_str[64], buf1[256]; + size_t len; gcc_checking_assert (num < lazy_hex_fp_value_count); - real_from_string (, lazy_hex_fp_values[num].hex_str); - real_to_decimal_for_mode (dec_str, , sizeof (dec_str), - lazy_hex_fp_values[num].digits, 0, - lazy_hex_fp_values[num].mode); + if (!flag_isoc99) +{ + real_from_string (, lazy_hex_fp_values[num].hex_str); + real_to_decimal_for_mode (dec_str, , sizeof (dec_str), +lazy_hex_fp_values[num].digits, 0, +lazy_hex_fp_values[num].mode); + + len = sprintf (buf1, "%s%s", dec_str, lazy_hex_fp_values[num].fp_suffix); +} + else +len = sprintf (buf1, "%s%s", lazy_hex_fp_values[num].hex_str, + lazy_hex_fp_values[num].fp_suffix); - size_t len -= sprintf (buf1, "%s%s", dec_str, lazy_hex_fp_values[num].fp_suffix); gcc_assert (len < sizeof (buf1)); for (unsigned idx = 0; idx < macro->count; idx++) if (macro->exp.tokens[idx].type == CPP_NUMBER) @@ -1701,13 +1708,16 @@ builtin_define_with_hex_fp_value (const char *macro, it's easy to get the exact correct value), parse it as a real, then print it back out as decimal. */ - real_from_string (, hex_str); - real_to_decimal_for_mode (dec_str, , sizeof (dec_str), digits, 0, - TYPE_MODE (type)); + if (!flag_isoc99) +{ + real_from_string (, hex_str); + real_to_decimal_for_mode (dec_str, , sizeof (dec_str), digits, 0, +TYPE_MODE (type)); +} /* Assemble the macro in the following fashion macro = fp_cast [dec_str fp_suffix] */ - sprintf (buf2, "%s%s&
VEC_COND_EXPR optimizations v2
New version that passed bootstrap+regtest during the night. When vector comparisons were forced to use vec_cond_expr, we lost a number of optimizations (my fault for not adding enough testcases to prevent that). This patch tries to unwrap vec_cond_expr a bit so some optimizations can still happen. I wasn't planning to add all those transformations together, but adding one caused a regression, whose fix introduced a second regression, etc. Restricting to constant folding would not be sufficient, we also need at least things like X|0 or X The transformations are quite conservative with :s and folding only if everything simplifies, we may want to relax this later. And of course we are going to miss things like a?b:c + a?c:b -> b+c. In terms of number of operations, some transformations turning 2 VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look like a gain... I expect the bit_not disappears in most cases, and VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR. I am a bit confused that with avx512 we get types like "vector(4) " with :2 and not :1 (is it a hack so true is 1 and not -1?), but that doesn't matter for this patch. 2020-08-05 Marc Glisse PR tree-optimization/95906 PR target/70314 * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e), (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations. (op (c ? a : b)): Update to match the new transformations. * gcc.dg/tree-ssa/andnot-2.c: New file. * gcc.dg/tree-ssa/pr95906.c: Likewise. * gcc.target/i386/pr70314.c: Likewise. -- Marc Glissediff --git a/gcc/match.pd b/gcc/match.pd index a052c9e3dbc..f9297fcadbe 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -3436,20 +3436,66 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (if (integer_zerop (@0)) @2))) -/* Sink unary operations to constant branches, but only if we do fold it to - constants. */ +#if GIMPLE +/* Sink unary operations to branches, but only if we do fold both. */ (for op (negate bit_not abs absu) (simplify - (op (vec_cond @0 VECTOR_CST@1 VECTOR_CST@2)) - (with - { - tree cst1, cst2; - cst1 = const_unop (op, type, @1); - if (cst1) - cst2 = const_unop (op, type, @2); - } - (if (cst1 && cst2) -(vec_cond @0 { cst1; } { cst2; }) + (op (vec_cond:s @0 @1 @2)) + (vec_cond @0 (op! @1) (op! @2 + +/* Sink binary operation to branches, but only if we can fold it. */ +(for op (tcc_comparison plus minus mult bit_and bit_ior bit_xor + rdiv trunc_div ceil_div floor_div round_div + trunc_mod ceil_mod floor_mod round_mod min max) +/* (c ? a : b) op (c ? d : e) --> c ? (a op d) : (b op e) */ + (simplify + (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4)) + (vec_cond @0 (op! @1 @3) (op! @2 @4))) + +/* (c ? a : b) op d --> c ? (a op d) : (b op d) */ + (simplify + (op (vec_cond:s @0 @1 @2) @3) + (vec_cond @0 (op! @1 @3) (op! @2 @3))) + (simplify + (op @3 (vec_cond:s @0 @1 @2)) + (vec_cond @0 (op! @3 @1) (op! @3 @2 +#endif + +/* (v ? w : 0) ? a : b is just (v & w) ? a : b */ +(simplify + (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2) + (if (types_match (@0, @3)) + (vec_cond (bit_and @0 @3) @1 @2))) +(simplify + (vec_cond (vec_cond:s @0 integer_all_onesp @3) @1 @2) + (if (types_match (@0, @3)) + (vec_cond (bit_ior @0 @3) @1 @2))) +(simplify + (vec_cond (vec_cond:s @0 integer_zerop @3) @1 @2) + (if (types_match (@0, @3)) + (vec_cond (bit_ior @0 (bit_not @3)) @2 @1))) +(simplify + (vec_cond (vec_cond:s @0 @3 integer_all_onesp) @1 @2) + (if (types_match (@0, @3)) + (vec_cond (bit_and @0 (bit_not @3)) @2 @1))) + +/* c1 ? c2 ? a : b : b --> (c1 & c2) ? a : b */ +(simplify + (vec_cond @0 (vec_cond:s @1 @2 @3) @3) + (if (types_match (@0, @1)) + (vec_cond (bit_and @0 @1) @2 @3))) +(simplify + (vec_cond @0 @2 (vec_cond:s @1 @2 @3)) + (if (types_match (@0, @1)) + (vec_cond (bit_ior @0 @1) @2 @3))) +(simplify + (vec_cond @0 (vec_cond:s @1 @2 @3) @2) + (if (types_match (@0, @1)) + (vec_cond (bit_ior (bit_not @0) @1) @2 @3))) +(simplify + (vec_cond @0 @3 (vec_cond:s @1 @2 @3)) + (if (types_match (@0, @1)) + (vec_cond (bit_and (bit_not @0) @1) @2 @3))) /* Simplification moved from fold_cond_expr_with_comparison. It may also be extended. */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c b/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c new file mode 100644 index 000..e0955ce3ffd --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-forwprop3-raw -w -Wno-psabi" } */ + +typedef long vec __attribute__((vector_size(16))); +vec f(vec x){ + vec y = x < 10; + return y & (y == 0); +} + +/* { dg-final { scan-tree-dump-not "_expr" "forwprop3" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c b/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c new file m
Re: [PATCH] Amend match.pd syntax with force-simplified results
On Fri, 31 Jul 2020, Richard Biener wrote: This adds a ! marker to result expressions that should simplify (and if not fail the simplification). This can for example be used like (simplify (plus (vec_cond:s @0 @1 @2) @3) (vec_cond @0 (plus! @1 @3) (plus! @2 @3))) to make the simplification only apply in case both plus operations in the result end up simplified to a simple operand. (replacing plus with bit_ior) The generated code in gimple_simplify_BIT_IOR_EXPR may look like { tree _o1[2], _r1; _o1[0] = captures[2]; _o1[1] = captures[4]; gimple_match_op tem_op (res_op->cond.any_else (), BIT_IOR_EXPR, TREE_TYPE (_o1[0]), _o1[0], _o1[1]); tem_op.resimplify (lseq, valueize); _r1 = maybe_push_res_to_seq (_op, NULL); if (!_r1) return false; res_op->ops[1] = _r1; } In particular, it contains this "return false" which directly exits the function, instead of just giving up on this particular transformation and trying the next one. I'll reorder my transformations to work around this, but it looks like a pre-existing limitation. -- Marc Glisse
Re: Simplify X * C1 == C2 with undefined overflow
On Mon, 3 Aug 2020, Richard Biener wrote: On Sat, Aug 1, 2020 at 9:29 AM Marc Glisse wrote: Hello, this transformation is quite straightforward, without overflow, 3*X==15 is the same as X==5 and 3*X==5 cannot happen. Adding a single_use restriction for the first case didn't seem necessary, although of course it can slightly increase register pressure in some cases. Bootstrap+regtest on x86_64-pc-linux-gnu. OK with using constant_boolean_node (cmp == NE_EXPR, type). ISTR we had the x * 0 == CST simplification somewhere but maybe it was x * y == 0 ... ah, yes: /* Transform comparisons of the form X * C1 CMP 0 to X CMP 0 in the signed arithmetic case. That form is created by the compiler often enough for folding it to be of value. One example is in computing loop trip counts after Operator Strength Reduction. */ (for cmp (simple_comparison) scmp (swapped_simple_comparison) (simplify (cmp (mult@3 @0 INTEGER_CST@1) integer_zerop@2) As it is placed after your pattern it will be never matched I think (but we don't warn because of INTEGER_CST vs. integer_zerop). But I think your pattern subsumes it besides of the X * 0 == 0 compare - oh, and the other pattern also handles relational compares (those will still trigger). Maybe place the patterns next to each other? Also see whether moving yours after the above will cease the testcases to be handled because it's no longer matched - if not this might be the better order. I moved it after, it still works, so I pushed the patch. Note that the other transformation has a single_use restriction, while this one doesn't, that's not very consistent, but also hopefully not so important... -- Marc Glisse
Simplify X * C1 == C2 with undefined overflow
Hello, this transformation is quite straightforward, without overflow, 3*X==15 is the same as X==5 and 3*X==5 cannot happen. Adding a single_use restriction for the first case didn't seem necessary, although of course it can slightly increase register pressure in some cases. Bootstrap+regtest on x86_64-pc-linux-gnu. 2020-08-03 Marc Glisse PR tree-optimization/95433 * match.pd (X * C1 == C2): New transformation. * gcc.c-torture/execute/pr23135.c: Add -fwrapv to avoid undefined behavior. * gcc.dg/tree-ssa/pr95433.c: New file. -- Marc Glissediff --git a/gcc/match.pd b/gcc/match.pd index a052c9e3dbc..78fd8cf5d9e 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -1578,6 +1578,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) && wi::neg_p (wi::to_wide (@1), TYPE_SIGN (TREE_TYPE (@1 (cmp @2 @0)) +/* For integral types with undefined overflow fold + x * C1 == C2 into x == C2 / C1 or false. */ +(for cmp (eq ne) + (simplify + (cmp (mult @0 INTEGER_CST@1) INTEGER_CST@2) + (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) + && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@0)) + && wi::to_wide (@1) != 0) + (with { widest_int quot; } +(if (wi::multiple_of_p (wi::to_widest (@2), wi::to_widest (@1), + TYPE_SIGN (TREE_TYPE (@0)), )) + (cmp @0 { wide_int_to_tree (TREE_TYPE (@0), quot); }) + { build_int_cst (type, cmp == NE_EXPR); }) + /* (X - 1U) <= INT_MAX-1U into (int) X > 0. */ (for cmp (le gt) icmp (gt le) diff --git a/gcc/testsuite/gcc.c-torture/execute/pr23135.c b/gcc/testsuite/gcc.c-torture/execute/pr23135.c index e740ff52874..ef9b7efc9c4 100644 --- a/gcc/testsuite/gcc.c-torture/execute/pr23135.c +++ b/gcc/testsuite/gcc.c-torture/execute/pr23135.c @@ -1,7 +1,7 @@ /* Based on execute/simd-1.c, modified by joern.renne...@st.com to trigger a reload bug. Verified for gcc mainline from 20050722 13:00 UTC for sh-elf -m4 -O2. */ -/* { dg-options "-Wno-psabi" } */ +/* { dg-options "-Wno-psabi -fwrapv" } */ /* { dg-add-options stack_size } */ #ifndef STACK_SIZE diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr95433.c b/gcc/testsuite/gcc.dg/tree-ssa/pr95433.c new file mode 100644 index 000..4e161ee26cc --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr95433.c @@ -0,0 +1,8 @@ +/* { dg-do compile } */ +/* { dg-options "-O -fdump-tree-optimized" } */ + +int f(int x){return x*7==17;} +int g(int x){return x*3==15;} + +/* { dg-final { scan-tree-dump "return 0;" "optimized" } } */ +/* { dg-final { scan-tree-dump "== 5;" "optimized" } } */
Re: [PATCH] Amend match.pd syntax with force-simplified results
On Fri, 31 Jul 2020, Richard Biener wrote: Or we simply automatically disable those patterns for GENERIC (though that would probably be unexpected). Since the definition is not clear, whatever we do will be unexpected at least in some cases. Disabling it for GENERIC for now seems ok to me, we can always extend it later... -- Marc Glisse
Re: [PATCH] Amend match.pd syntax with force-simplified results
On Fri, 31 Jul 2020, Richard Biener wrote: This adds a ! marker to result expressions that should simplify (and if not fail the simplification). This can for example be used like (simplify (plus (vec_cond:s @0 @1 @2) @3) (vec_cond @0 (plus! @1 @3) (plus! @2 @3))) to make the simplification only apply in case both plus operations in the result end up simplified to a simple operand. Thanks. The ! syntax seems fine. If we run out, we can always introduce an attribute-like syntax: (plus[force_leaf] @1 @2), but we aren't there yet. And either this feature will see a lot of use and deserve its short syntax, or it won't and it will be easy to reclaim '!' for something else. I am wondering about GENERIC. IIUC, '!' is ignored for GENERIC. Should we protect some/most uses of this syntax with #if GIMPLE? In my case, I think resimplify might simply return non-0 because it swapped the operands, which should not be sufficient to enable the transformation. -- Marc Glisse
Re: VEC_COND_EXPR optimizations
On Fri, 31 Jul 2020, Marc Glisse wrote: On Fri, 31 Jul 2020, Richard Sandiford wrote: Marc Glisse writes: On Fri, 31 Jul 2020, Richard Sandiford wrote: Marc Glisse writes: +/* (c ? a : b) op (c ? d : e) --> c ? (a op d) : (b op e) */ + (simplify + (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4)) + (with + { + tree rhs1, rhs2 = NULL; + rhs1 = fold_binary (op, type, @1, @3); + if (rhs1 && is_gimple_val (rhs1)) + rhs2 = fold_binary (op, type, @2, @4); + } + (if (rhs2 && is_gimple_val (rhs2)) +(vec_cond @0 { rhs1; } { rhs2; }) +#endif This one looks dangerous for potentially-trapping ops. I would expect the operation not to be folded if it can trap. Is that too optimistic? Not sure TBH. I was thinking of “trapping” in the sense of raising an IEEE exception, rather than in the could-throw/must-end-bb sense. That's what I understood from your message :-) I thought match.pd applied to things like FP addition as normal and it was up to individual patterns to check the appropriate properties. Yes, and in this case I am delegating that to fold_binary, which already performs this check. I tried with this C++ program typedef double vecf __attribute__((vector_size(16))); typedef long long veci __attribute__((vector_size(16))); vecf f(veci c){ return (c?1.:2.)/(c?3.:7.); } the folding happens by default, but not with -frounding-math, which seems like exactly what we want. That was for rounding. -ftrapping-math does not disable folding of typedef double vecf __attribute__((vector_size(16))); typedef long long veci __attribute__((vector_size(16))); vecf f(veci c){ vecf z={}; return (z+1)/(z+3); } despite a possible inexact flag, so it doesn't disable vec_cond_expr folding either. (not sure we want to fix this unless -fno-trapping-math becomes the default) -- Marc Glisse
Re: VEC_COND_EXPR optimizations
On Fri, 31 Jul 2020, Richard Sandiford wrote: Marc Glisse writes: On Fri, 31 Jul 2020, Richard Sandiford wrote: Marc Glisse writes: +/* (c ? a : b) op (c ? d : e) --> c ? (a op d) : (b op e) */ + (simplify + (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4)) + (with + { + tree rhs1, rhs2 = NULL; + rhs1 = fold_binary (op, type, @1, @3); + if (rhs1 && is_gimple_val (rhs1)) + rhs2 = fold_binary (op, type, @2, @4); + } + (if (rhs2 && is_gimple_val (rhs2)) +(vec_cond @0 { rhs1; } { rhs2; }) +#endif This one looks dangerous for potentially-trapping ops. I would expect the operation not to be folded if it can trap. Is that too optimistic? Not sure TBH. I was thinking of “trapping” in the sense of raising an IEEE exception, rather than in the could-throw/must-end-bb sense. That's what I understood from your message :-) I thought match.pd applied to things like FP addition as normal and it was up to individual patterns to check the appropriate properties. Yes, and in this case I am delegating that to fold_binary, which already performs this check. I tried with this C++ program typedef double vecf __attribute__((vector_size(16))); typedef long long veci __attribute__((vector_size(16))); vecf f(veci c){ return (c?1.:2.)/(c?3.:7.); } the folding happens by default, but not with -frounding-math, which seems like exactly what we want. -- Marc Glisse
Re: VEC_COND_EXPR optimizations
On Fri, 31 Jul 2020, Richard Biener wrote: On Fri, Jul 31, 2020 at 1:39 PM Richard Biener wrote: On Fri, Jul 31, 2020 at 1:35 PM Richard Biener wrote: On Thu, Jul 30, 2020 at 9:49 AM Marc Glisse wrote: When vector comparisons were forced to use vec_cond_expr, we lost a number of optimizations (my fault for not adding enough testcases to prevent that). This patch tries to unwrap vec_cond_expr a bit so some optimizations can still happen. I wasn't planning to add all those transformations together, but adding one caused a regression, whose fix introduced a second regression, etc. Using a simple fold_binary internally looks like an ok compromise to me. It remains cheap enough (not recursive, and vector instructions are not that frequent), while still allowing more than const_binop (X|0 or X for instance). The transformations are quite conservative with :s and folding only if everything simplifies, we may want to relax this later. And of course we are going to miss things like a?b:c + a?c:b -> b+c. In terms of number of operations, some transformations turning 2 VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look like a gain... I expect the bit_not disappears in most cases, and VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR. I am a bit confused that with avx512 we get types like "vector(4) " with :2 and not :1 (is it a hack so true is 1 and not -1?), but that doesn't matter for this patch. Regtest+bootstrap on x86_64-pc-linux-gnu + (with + { + tree rhs1, rhs2 = NULL; + rhs1 = fold_binary (op, type, @1, @3); + if (rhs1 && is_gimple_val (rhs1)) + rhs2 = fold_binary (op, type, @2, @3); ICK. I guess a more match-and-simplify way would be I was focused on avoiding recursion, but indeed that's independent, I could have set a trivial valueize function for that. (with { tree rhs1, rhs2; gimple_match_op op (gimple_match_cond::UNCOND, op, type, @1, @3); if (op.resimplify (NULL, valueize) Oh, so you would be ok with something that recurses without limit? I thought we were avoiding this because it may allow some compile time explosion with pathological examples. && gimple_simplified_result_is_gimple_val (op)) { rhs1 = op.ops[0]; ... other operand ... } now in theory we could invent some new syntax for this, like (simplify (op (vec_cond:s @0 @1 @2) @3) (vec_cond @0 (op:x @1 @3) (op:x @2 @3))) and pick something better instead of :x (:s is taken, would be 'simplified', :c is taken would be 'constexpr', ...). _Maybe_ just (simplify (op (vec_cond:s @0 @1 @2) @3) (vec_cond:x @0 (op @1 @3) (op @2 @3))) which would have the same practical meaning as passing NULL for the seq argument to simplification - do not allow any intermediate stmt to be generated. Note I specifically do not like those if (it-simplifies) checks because we already would code-generate those anyway. For (simplify (plus (vec_cond:s @0 @1 @2) @3) (vec_cond @0 (plus @1 @3) (plus @2 @3))) we get res_op->set_op (VEC_COND_EXPR, type, 3); res_op->ops[0] = captures[1]; res_op->ops[0] = unshare_expr (res_op->ops[0]); { tree _o1[2], _r1; _o1[0] = captures[2]; _o1[1] = captures[4]; gimple_match_op tem_op (res_op->cond.any_else (), PLUS_EXPR, TREE_TYPE (_o1[0]), _o1[0], _o1[1]); tem_op.resimplify (lseq, valueize); _r1 = maybe_push_res_to_seq (_op, lseq); () if (!_r1) return false; res_op->ops[1] = _r1; } { tree _o1[2], _r1; _o1[0] = captures[3]; _o1[1] = captures[4]; gimple_match_op tem_op (res_op->cond.any_else (), PLUS_EXPR, TREE_TYPE (_o1[0]), _o1[0], _o1[1]); tem_op.resimplify (lseq, valueize); _r1 = maybe_push_res_to_seq (_op, lseq); (***) if (!_r1) return false; res_op->ops[2] = _r1; } res_op->resimplify (lseq, valueize); return true; and the only change required would be to pass NULL to maybe_push_res_to_seq here instead of lseq at the (***) marked points. (simplify (plus (vec_cond:s @0 @1 @2) @3) (vec_cond:l @0 (plus @1 @3) (plus @2 @3))) 'l' for 'force leaf'. I'll see if I can quickly cme up with a patch. Does it have a clear meaning for GENERIC? I guess that's probably not such a big problem. There are a number of transformations that we would like to perform "if simplifies", but I don't know if it will always have exactly this form
Re: VEC_COND_EXPR optimizations
On Fri, 31 Jul 2020, Richard Biener wrote: +/* (v ? w : 0) ? a : b is just (v & w) ? a : b */ +(simplify + (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2) + (vec_cond (bit_and @0 @3) @1 @2)) Does something check automatically that @0 and @3 have compatible types? @0 should always have a vector boolean type and thus will not be generally compatible with @3. But OTOH then when you see (vec_cond (vec_cond ... then @3 must be vector boolean as well... But in theory with AVX512 the inner vec_cond could have a SImode condition @0 producing a regular V4SImode vector mask for an outer AVX512 SSE-style vec-cond and you then would get a mismatch. Ah, I thought the SSE-style vec_cond was impossible in AVX512 mode, at least I couldn't generate one in a few tests, but I didn't try very hard. So indeed better add a type compatibility check. Ok, it can't hurt. -- Marc Glisse
Re: VEC_COND_EXPR optimizations
On Fri, 31 Jul 2020, Richard Sandiford wrote: Marc Glisse writes: +/* (c ? a : b) op (c ? d : e) --> c ? (a op d) : (b op e) */ + (simplify + (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4)) + (with + { + tree rhs1, rhs2 = NULL; + rhs1 = fold_binary (op, type, @1, @3); + if (rhs1 && is_gimple_val (rhs1)) + rhs2 = fold_binary (op, type, @2, @4); + } + (if (rhs2 && is_gimple_val (rhs2)) +(vec_cond @0 { rhs1; } { rhs2; }) +#endif This one looks dangerous for potentially-trapping ops. I would expect the operation not to be folded if it can trap. Is that too optimistic? +/* (v ? w : 0) ? a : b is just (v & w) ? a : b */ +(simplify + (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2) + (vec_cond (bit_and @0 @3) @1 @2)) Does something check automatically that @0 and @3 have compatible types? My memory of this dates from before the avx512-related changes, but at least at the time, the type of the condition in vec_cond_expr had to have the same size and number of elements as the result, and have signed integral elements. Now the size constraint has changed, but it still looks like for a given number of elements and size (this last one ignored for avx512), there is essentially a single type that can appear as condition. Is this wrong for x86? For SVE? I could add some types_match conditions if that seems too unsafe... -- Marc Glisse
VEC_COND_EXPR optimizations
When vector comparisons were forced to use vec_cond_expr, we lost a number of optimizations (my fault for not adding enough testcases to prevent that). This patch tries to unwrap vec_cond_expr a bit so some optimizations can still happen. I wasn't planning to add all those transformations together, but adding one caused a regression, whose fix introduced a second regression, etc. Using a simple fold_binary internally looks like an ok compromise to me. It remains cheap enough (not recursive, and vector instructions are not that frequent), while still allowing more than const_binop (X|0 or X for instance). The transformations are quite conservative with :s and folding only if everything simplifies, we may want to relax this later. And of course we are going to miss things like a?b:c + a?c:b -> b+c. In terms of number of operations, some transformations turning 2 VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look like a gain... I expect the bit_not disappears in most cases, and VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR. I am a bit confused that with avx512 we get types like "vector(4) " with :2 and not :1 (is it a hack so true is 1 and not -1?), but that doesn't matter for this patch. Regtest+bootstrap on x86_64-pc-linux-gnu 2020-07-30 Marc Glisse PR tree-optimization/95906 PR target/70314 * match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e), (v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations. * gcc.dg/tree-ssa/andnot-2.c: New file. * gcc.dg/tree-ssa/pr95906.c: Likewise. * gcc.target/i386/pr70314.c: Likewise. -- Marc Glissediff --git a/gcc/match.pd b/gcc/match.pd index c6ae7a7db7a..af52d56162b 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -3451,6 +3451,77 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (if (cst1 && cst2) (vec_cond @0 { cst1; } { cst2; }) +/* Sink binary operation to branches, but only if we can fold it. */ +#if GIMPLE +(for op (tcc_comparison plus minus mult bit_and bit_ior bit_xor + rdiv trunc_div ceil_div floor_div round_div + trunc_mod ceil_mod floor_mod round_mod min max) +/* (c ? a : b) op d --> c ? (a op d) : (b op d) */ + (simplify + (op (vec_cond:s @0 @1 @2) @3) + (with + { + tree rhs1, rhs2 = NULL; + rhs1 = fold_binary (op, type, @1, @3); + if (rhs1 && is_gimple_val (rhs1)) + rhs2 = fold_binary (op, type, @2, @3); + } + (if (rhs2 && is_gimple_val (rhs2)) +(vec_cond @0 { rhs1; } { rhs2; } + (simplify + (op @3 (vec_cond:s @0 @1 @2)) + (with + { + tree rhs1, rhs2 = NULL; + rhs1 = fold_binary (op, type, @3, @1); + if (rhs1 && is_gimple_val (rhs1)) + rhs2 = fold_binary (op, type, @3, @2); + } + (if (rhs2 && is_gimple_val (rhs2)) +(vec_cond @0 { rhs1; } { rhs2; } + +/* (c ? a : b) op (c ? d : e) --> c ? (a op d) : (b op e) */ + (simplify + (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4)) + (with + { + tree rhs1, rhs2 = NULL; + rhs1 = fold_binary (op, type, @1, @3); + if (rhs1 && is_gimple_val (rhs1)) + rhs2 = fold_binary (op, type, @2, @4); + } + (if (rhs2 && is_gimple_val (rhs2)) +(vec_cond @0 { rhs1; } { rhs2; }) +#endif + +/* (v ? w : 0) ? a : b is just (v & w) ? a : b */ +(simplify + (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2) + (vec_cond (bit_and @0 @3) @1 @2)) +(simplify + (vec_cond (vec_cond:s @0 integer_all_onesp @3) @1 @2) + (vec_cond (bit_ior @0 @3) @1 @2)) +(simplify + (vec_cond (vec_cond:s @0 integer_zerop @3) @1 @2) + (vec_cond (bit_ior @0 (bit_not @3)) @2 @1)) +(simplify + (vec_cond (vec_cond:s @0 @3 integer_all_onesp) @1 @2) + (vec_cond (bit_and @0 (bit_not @3)) @2 @1)) + +/* c1 ? c2 ? a : b : b --> (c1 & c2) ? a : b */ +(simplify + (vec_cond @0 (vec_cond:s @1 @2 @3) @3) + (vec_cond (bit_and @0 @1) @2 @3)) +(simplify + (vec_cond @0 @2 (vec_cond:s @1 @2 @3)) + (vec_cond (bit_ior @0 @1) @2 @3)) +(simplify + (vec_cond @0 (vec_cond:s @1 @2 @3) @2) + (vec_cond (bit_ior (bit_not @0) @1) @2 @3)) +(simplify + (vec_cond @0 @3 (vec_cond:s @1 @2 @3)) + (vec_cond (bit_and (bit_not @0) @1) @2 @3)) + /* Simplification moved from fold_cond_expr_with_comparison. It may also be extended. */ /* This pattern implements two kinds simplification: diff --git a/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c b/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c new file mode 100644 index 000..e0955ce3ffd --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-forwprop3-raw -w -Wno-psabi" } */ + +typedef long vec __attribute__((vector_size(16))); +vec f(vec x){ + vec y = x < 10; + return y & (y == 0); +} + +/* { dg-final { scan-tree-dump-not "_expr" "forwprop3" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c b/gc
RE: [PATCH] middle-end: Fold popcount(x&4) to (x>>2)&1 and friends.
On Thu, 23 Jul 2020, Marc Glisse wrote: On Wed, 22 Jul 2020, Roger Sayle wrote: Many thanks for the peer review and feedback. I completely agree that POPCOUNT and PARITY iterators simplifies things and handle the IFN_ variants. Is there a reason why the iterators cannot be used for this one? +(for popcount (BUILT_IN_POPCOUNT BUILT_IN_POPCOUNTL BUILT_IN_POPCOUNTLL + BUILT_IN_POPCOUNTIMAX) + parity (BUILT_IN_PARITY BUILT_IN_PARITYL BUILT_IN_PARITYLL +BUILT_IN_PARITYIMAX) + (simplify +(bit_and (popcount @0) integer_onep) +(parity @0))) Ah, maybe because we may have platforms/modes where the optab for popcount is supported but not the one for parity, and we are not allowed to create internal calls if the optab is not supported? I think expand_parity tries to expand parity as popcount&1, so it should be fine, but I didn't actually try it... -- Marc Glisse
RE: [PATCH] middle-end: Fold popcount(x&4) to (x>>2)&1 and friends.
On Wed, 22 Jul 2020, Roger Sayle wrote: Many thanks for the peer review and feedback. I completely agree that POPCOUNT and PARITY iterators simplifies things and handle the IFN_ variants. Is there a reason why the iterators cannot be used for this one? +(for popcount (BUILT_IN_POPCOUNT BUILT_IN_POPCOUNTL BUILT_IN_POPCOUNTLL + BUILT_IN_POPCOUNTIMAX) + parity (BUILT_IN_PARITY BUILT_IN_PARITYL BUILT_IN_PARITYLL +BUILT_IN_PARITYIMAX) + (simplify +(bit_and (popcount @0) integer_onep) +(parity @0))) -- Marc Glisse
Re: [PATCH v2]: Optimize a >= 0 && b >= 0 to (a | b) >= 0 [PR95731]
On Fri, 10 Jul 2020, Joe Ramsay wrote: adds a new pattern to simplify a >= 0 && b >= 0 to (a | b) >= 0. We should probably add the symmetric simplification of a<0|b<0 to (a|b)<0 * match.pd: New simplication. I think Jakub was suggesting something slightly mode detailed. It would be nice to put the transformation next to "(x == 0 & y == 0) -> (x | typeof(x)(y)) == 0." in the file, since they work similarly. +(for and (truth_and truth_andif) I think the most important one here is bit_and. + (and (ge @0 integer_zerop) (ge @1 integer_zerop)) I was expecting some :s on the ge, but I notice that we don't have them for the similar transformation, so I don't know... + If one type is wider then the narrower type is sign-extended Is that necessarily a sign-extension? True, for an unsigned type, we usually replace >=0 with !=0 IIRC, but it wouldn't hurt to add an explicit check for that. + (if ( INTEGRAL_TYPE_P (TREE_TYPE (@0)) + && INTEGRAL_TYPE_P (TREE_TYPE (@1))) I like indenting to align things, but I don't think that matches the official style used in gcc. I am always afraid that this kind of transformation may hurt if it happens too early (does VRP know how to get ranges for a and b from (a|b)>=0?) but I guess this isn't the first transformation in this direction, so we can see all of them later if it causes trouble... If the types are int128_t and int64_t, is extending the right thing to do? Will it eventually simplify to ((int64_t)(i128>>64)|i64)>=0 -- Marc Glisse
Re: Local optimization options
On Sun, 5 Jul 2020, Thomas König wrote: Am 04.07.2020 um 19:11 schrieb Richard Biener : On July 4, 2020 11:30:05 AM GMT+02:00, "Thomas König" wrote: What could be a preferred way to achieve that? Could optimization options like -ffast-math be applied to blocks instead of functions? Could we set flags on the TREE codes to allow certain optinizations? Other things? The middle end can handle those things on function granularity only. Richard. OK, so that will not work (or not without a disproportionate amount of effort). Would it be possible to set something like a TREE_FAST_MATH flag on TREEs? An operation could then be optimized according to these rules iff both operands had that flag, and would also have it then. In order to support various semantics on floating point operations, I was planning to replace some trees with internal functions, with an extra operand to specify various behaviors (rounding, exception, etc). Although at least in the beginning, I was thinking of only using those functions in safe mode, to avoid perf regressions. https://gcc.gnu.org/pipermail/gcc-patches/2019-August/527040.html This may never happen now, but it sounds similar to setting flags like TREE_FAST_MATH that you are suggesting. I was going with functions for more flexibility, and to avoid all the existing assumptions about trees. While I guess for fast-math, the worst the assumptions could do is clear the flag, which would make use optimize less than possible, not so bad. -- Marc Glisse
Re: [RFC] rs6000: Add builtins for fegetround, feclearexcept and feraiseexcept [PR94193]
On Mon, 29 Jun 2020, Richard Biener via Gcc-patches wrote: At least without -frounding-math fegetround could be constant folded to FE_TONEAREST for which we'd need the actual value of FE_TONEAREST. That will break existing code which, since -frounding-math doesn't work for that, protects all FP operations with volatile read/writes or similar asm, and then doesn't specify -frounding-math because it doesn't seem necessary. I am not saying that code is right, just that it exists. In a world where we have implemented fenv_access, this kind of folding of fegetround could only happen in "#pragma fenv_access off" regions, which seems to imply that it would be the front-end's responsibility (although it would need help from the back-end to know the default value to fold to). -- Marc Glisse
Re: [RFC] rs6000: Add builtins for fegetround, feclearexcept and feraiseexcept [PR94193]
On Mon, 29 Jun 2020, Segher Boessenkool wrote: Another question. How do these builtins prevent other FP insns from being moved (or optimised) "over" them? At the GIMPLE level they don't. They prevent other function calls from moving across, just because function calls where at least one is not pure can't cross, but otherwise fenv_access is one big missing feature in gcc. I started something last year (and postponed indefinitely for lack of time), replacing all FP operations (when the safe mode is enabled) with builtins that get expanded by default to insert asm pass-through on the arguments and the result. -- Marc Glisse
Re: [PATCH] tree-ssa-threadbackward.c (profitable_jump_thread_path): Do not allow __builtin_constant_p.
On Fri, 26 Jun 2020, Jeff Law via Gcc-patches wrote: In theory yes, but there are cases where paths converge (like you've shown) where you may have evaluated to a constant on the paths, but it's not a constant at the convergence point. You have to be very careful using b_c_p like this and it's been a regular source of kernel bugs. I'd recommend looking at the .ssa dump and walk forward from there if the .ssa dump looks correct. Here is the last dump before thread1 (105t.mergephi2). I don't see anything incorrect in it. ledtrig_cpu (_Bool is_active) { int old; int iftmp.0_1; int _5; [local count: 1073741824]: if (is_active_2(D) != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 536870913]: [local count: 1073741824]: # iftmp.0_1 = PHI <1(2), -1(3)> _5 = __builtin_constant_p (iftmp.0_1); if (_5 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 536870913]: if (iftmp.0_1 >= -128) goto ; [50.00%] else goto ; [50.00%] [local count: 268435456]: if (iftmp.0_1 <= 127) goto ; [34.00%] else goto ; [66.00%] [local count: 91268056]: __asm__ __volatile__("asi %0,%1 " : "ptr" "=Q" MEM[(int *)_active_cpus] : "val" "i" iftmp.0_1, "Q" MEM[(int *)_active_cpus] : "memory", "cc"); goto ; [100.00%] [local count: 982473769]: __asm__ __volatile__("laa %0,%2,%1 " : "old" "=d" old_8, "ptr" "=Q" MEM[(int *)_active_cpus] : "val" "d" iftmp.0_1, "Q" MEM[(int *)_active_cpus] : "memory", "cc"); [local count: 1073741824]: return; } There is a single _b_c_p, the immediate asm argument is exactly the argument of _b_c_p, and it is in the branch protected by _b_c_p. Now the thread1 dump, for comparison ledtrig_cpu (_Bool is_active) { int old; int iftmp.0_4; int iftmp.0_6; int _7; int _12; int iftmp.0_13; int iftmp.0_14; [local count: 1073741824]: if (is_active_2(D) != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 536870912]: # iftmp.0_6 = PHI <1(2)> _7 = __builtin_constant_p (iftmp.0_6); if (_7 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 536870912]: # iftmp.0_4 = PHI <-1(2)> _12 = __builtin_constant_p (iftmp.0_4); if (_12 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 268435456]: if (iftmp.0_4 >= -128) goto ; [20.00%] else goto ; [80.00%] [local count: 214748364]: if (iftmp.0_6 <= 127) goto ; [12.00%] else goto ; [88.00%] [local count: 91268056]: # iftmp.0_13 = PHI __asm__ __volatile__("asi %0,%1 " : "ptr" "=Q" MEM[(int *)_active_cpus] : "val" "i" iftmp.0_13, "Q" MEM[(int *)_active_cpus] : "memory", "cc"); goto ; [100.00%] [local count: 982473769]: # iftmp.0_14 = PHI __asm__ __volatile__("laa %0,%2,%1 " : "old" "=d" old_8, "ptr" "=Q" MEM[(int *)_active_cpus] : "val" "d" iftmp.0_14, "Q" MEM[(int *)_active_cpus] : "memory", "cc"); [local count: 1073741824]: return; } Thread1 decides to separate the paths is_active and !is_active (surprisingly, for one it optimizes out the comparison <= 127 and for the other the comparison >= -128, while it could optimize both in both cases). And it decides to converge after the comparisons, but before the asm. What the pass did does seem to hurt. It looks like if we duplicate _b_c_p, we may need to duplicate far enough to include all the blocks dominated by _b_c_p==true (including the asm, here). Otherwise, any _b_c_p can be optimized to true, because for a boolean b is the same as b ? true : false __builtin_constant_p(b ? true : false) would be the same as b ? __builtin_constant_p(true) : __builtin_constant_p(false), i.e. true. It is too bad we don't have any optimization pass using ranges between IPA and thread1, that would have gotten rid of the comparisons, and hence the temptation to thread. Adding always_inline on atomic_add (or flatten on the caller) does help: EVRP removes the comparisons. Do you see a way forward without changing what thread1 does or declaring the testcase as unsupported? -- Marc Glisse
Re: [PATCH] tree-ssa-threadbackward.c (profitable_jump_thread_path): Do not allow __builtin_constant_p.
On Sat, 27 Jun 2020, Ilya Leoshkevich via Gcc-patches wrote: Is there something specific that a compiler user should look out for? For example, here is the kernel code, from which the test was derived: static inline void atomic_add(int i, atomic_t *v) { #ifdef CONFIG_HAVE_MARCH_Z196_FEATURES if (__builtin_constant_p(i) && (i > -129) && (i < 128)) { __atomic_add_const(i, >counter); return; } #endif __atomic_add(i, >counter); } It looks very straightforward - can there still be something wrong with its usage of b_c_p? I'd recommend looking at the .ssa dump and walk forward from there if the .ssa dump looks correct. Well, 021t.ssa already has: __attribute__((gnu_inline)) __atomic_add_const (intD.6 valD.2269, intD.6 * ptrD.2270) { intD.6 val_3(D) = valD.2269; intD.6 * ptr_2(D) = ptrD.2270; ;; basic block 2, loop depth 0, maybe hot ;;prev block 0, next block 1, flags: (NEW) ;;pred: ENTRY (FALLTHRU) # .MEM_4 = VDEF <.MEM_1(D)> __asm__ __volatile__("asi %0,%1 " : "ptr" "=Q" *ptr_2(D) : "val" "i" val_3(D), "Q" *ptr_2(D) : "memory", "cc"); # VUSE <.MEM_4> return; ;;succ: EXIT } which is, strictly speaking, not correct, because val_3(D) and valD.2269 are not constant. But as far as I understand we are willing to tolerate trees like this until a certain point. What is this point supposed to be? If I understood you right, 106t.thread1 is already too late - why is it so? Small remark: shouldn't __atomic_add_const be marked with the always_inline attribute, since it isn't usable when it isn't inlined? -- Marc Glisse
std::includes performance tweak
Hello, I am proposing a small tweak to the implementation of __includes, which in my application saves 20% of the running time. I noticed it because using range-v3 was giving unexpected performance gains. The unified diff is attached, but let me first show a more readable context diff. *** /tmp/zzm2NX_stl_algo.h 2020-06-19 10:48:58.702634366 +0200 --- libstdc++-v3/include/bits/stl_algo.h2020-06-18 23:16:06.183427245 +0200 *** *** 2783,2797 _Compare __comp) { while (__first1 != __last1 && __first2 != __last2) ! if (__comp(__first2, __first1)) ! return false; ! else if (__comp(__first1, __first2)) ! ++__first1; ! else ! { ! ++__first1; ++__first2; ! } return __first2 == __last2; } --- 2783,2795 _Compare __comp) { while (__first1 != __last1 && __first2 != __last2) ! { ! if (__comp(__first2, __first1)) ! return false; ! if (!__comp(__first1, __first2)) ++__first2; ! ++__first1; ! } return __first2 == __last2; } As you can see, it isn't much change. Some of the gain comes from pulling the 2 calls ++__first1 out of the condition so there is just one call. And most of the gain comes from replacing the resulting if (__comp(__first1, __first2)) ; else ++__first2; with if (!__comp(__first1, __first2)) ++__first2; I was very surprised that the code ended up being so different for such a change, and I still don't really understand where the extra time is going... Anyway, while I blame the compiler for not generating very good code with the current implementation, I believe the change can be seen as a simplification and should be pushed to master. It regtests fine. 2020-06-20 Marc Glisse * include/bits/stl_algo.h (__includes): Simplify the code. (as with the patch for std::optional, I still haven't worked on my ssh key issue and cannot currently push) -- Marc Glissediff --git a/libstdc++-v3/include/bits/stl_algo.h b/libstdc++-v3/include/bits/stl_algo.h index fd6edd0d5f4..550a15f2b3b 100644 --- a/libstdc++-v3/include/bits/stl_algo.h +++ b/libstdc++-v3/include/bits/stl_algo.h @@ -2783,15 +2783,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _Compare __comp) { while (__first1 != __last1 && __first2 != __last2) - if (__comp(__first2, __first1)) - return false; - else if (__comp(__first1, __first2)) - ++__first1; - else - { - ++__first1; + { + if (__comp(__first2, __first1)) + return false; + if (!__comp(__first1, __first2)) ++__first2; - } + ++__first1; + } return __first2 == __last2; }