Re: Question about constructing vector types in GIMPLE pass

2024-04-08 Thread Marc Glisse via Gcc

On Mon, 8 Apr 2024, Hanke Zhang via Gcc wrote:


Hi,
I've been working on strengthening auto-vectorization on intel CPUs
recently. I tried to do it in the GIMPLE pass. And I noticed that some
vector types in the GIMPLE code are confusing to me. The example code
is here:

_1 = MEM[(const __m256i_u * {ref-all})_2];

I wondered how I could construct or get the type `(const __m256i_u *
{ref-all})` in the GIMPLE pass.

If you have any ideas that can help me. I'll be so grateful! :)


I am not sure what you are asking exactly. If you already have access to 
such a MEM_REF, then the doc tells you where to look for this type:


"The first operand is the pointer being dereferenced; it will always have
pointer or reference type.  The second operand is a pointer constant
serving as constant offset applied to the pointer being dereferenced
with its type specifying the type to be used for type-based alias 
analysis.

The type of the node specifies the alignment of the access."

If you want to create a new type similar to this one, you can build it 
with various tools:


build_vector_type or build_vector_type_for_mode
build_pointer_type_for_mode(*, VOIDmode, true) to build a pointer that can 
alias anything
build_qualified_type to add const (probably useless)
build_aligned_type to specify that it is unaligned

--
Marc Glisse


Re: Building gcc with "-O -g"?

2024-02-10 Thread Marc Glisse via Gcc

On Sat, 10 Feb 2024, Steve Kargl via Gcc wrote:


So, how does one biulding all parts of gcc with "-O -g"?

In my shell script, I have

CFLAGS="-O -g"
export CFLAGS

CXXFLAGS="-O -g"
export CXXFLAGS

BOOT_CFLAGS="-O -g"
export BOOT_CFLAGS

../gcc/configure --prefix=$HOME/work --enable-languages=c,c++,fortran \
 --enable-bootstrap --disable-libssp --disable-multilib

but during bootstrap I see

/home/kargl/gcc/obj/./prev-gcc/xg++ -B/home/kargl/gcc/obj/./prev-gcc/ 
-B/home/kargl/work/x86_64-unknown-freebsd15.0/bin/ -nostdinc++ 
-B/home/kargl/gcc/obj/prev-x86_64-unknown-freebsd15.0/libstdc++-v3/src/.libs 
-B/home/kargl/gcc/obj/prev-x86_64-unknown-freebsd15.0/libstdc++-v3/libsupc++/.libs
  
-I/home/kargl/gcc/obj/prev-x86_64-unknown-freebsd15.0/libstdc++-v3/include/x86_64-unknown-freebsd15.0
  -I/home/kargl/gcc/obj/prev-x86_64-unknown-freebsd15.0/libstdc++-v3/include  
-I/home/kargl/gcc/gcc/libstdc++-v3/libsupc++ 
-L/home/kargl/gcc/obj/prev-x86_64-unknown-freebsd15.0/libstdc++-v3/src/.libs 
-L/home/kargl/gcc/obj/prev-x86_64-unknown-freebsd15.0/libstdc++-v3/libsupc++/.libs
  -fno-PIE -c   -g -O2 -fno-checking -gtoggle -DIN_GCC-fno-exceptions 
-fno-rtti -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings 
-Wcast-qual -Wmissing-format-attribute -Wconditionally-supported 
-Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros 
-Wno-overlength-strings -Werror -f

no-common  -DHAVE_CONFIG_H -fno-PIE -I. -I. -I../../gcc/gcc -I../../gcc/gcc/. 
-I../../gcc/gcc/../include  -I../../gcc/gcc/../libcpp/include 
-I../../gcc/gcc/../libcody -I/usr/local/include  
-I../../gcc/gcc/../libdecnumber -I../../gcc/gcc/../libdecnumber/dpd 
-I../libdecnumber -I../../gcc/gcc/../libbacktrace   -o fibonacci_heap.o -MT 
fibonacci_heap.o -MMD -MP -MF ./.deps/fibonacci_heap.TPo 
../../gcc/gcc/fibonacci_heap.cc


Note the "-g -O2".  Why?


In addition to CFLAGS and BOOT_CFLAGS, you are missing CFLAGS_FOR_TARGET 
(plus the same 3 for CXX). I don't know if that's still sufficient, but 
that's what I used to set a few years ago.


--
Marc Glisse


Re: Expected warning maybe-uninitialized does not appear using g++13.2.0?

2023-12-21 Thread Marc Glisse via Gcc

On Thu, 21 Dec 2023, David Malcolm via Gcc wrote:


On Wed, 2023-12-20 at 11:16 -0800, Eric Batchelor wrote:

Hello, I unintentionally stumbled upon some strange behaviour that
occurred due to a typo.
I reproduced the behaviour where an object (std::string in my case)
can
be passed to a function by reference, uninitialized, WITHOUT a
compiler
warning.
Changing the code to pass the object by value DOES emit the warning.
I don't think the compiled code is incorrect, it segfaults presumably
due to uninitialized members.
I understand there may seldom be a reason to use uninitialized
objects,
so "don't do that," but as I said this was unintentional and it seems
that it should have generated a warning, which have saved some
head-scratching.

Code to reproduce:

#include 
std::string f(std::string ) {
   s.append("x");
   return s;
}
int main() {
   std::string a = f(a);
}

Compile and run (no warning):

$ g++ -o uninit_obj uninit_obj.cpp -std=c++23 -Wall -Wpedantic -
Wextra
&& ./uninit_obj
Segmentation fault (core dumped)

No difference whether using -O0 (or 1 2 3)


As I understand it, -Wmaybe-uninitialized is purely intraprocedural
i.e. it works within each individual function, without considering the
interactions *between* functions.


If you compile

#include 
static std::string f(std::string ) {
 s.append("x");
 return s;
}
void g() {
 std::string a = f(a);
}

with -O3, by the time we get to the uninit pass, function g starts with

void g ()
{
  size_type __dnew;
  struct string a;
[...]
   [local count: 1073741824]:
  _26 = a._M_string_length;
  if (_26 == 4611686018427387903)

which should not require any interprocedural logic.

--
Marc Glisse


Re: libstdc++: Speed up push_back

2023-11-24 Thread Marc Glisse

On Thu, 23 Nov 2023, Jonathan Wakely wrote:


That's why we need -fsane-operator-new


Although the standard forbids it, some of us just provide an inline 
implementation


inline void* operator new(std::size_t n){return malloc(n);}
inline void operator delete(void*p)noexcept{free(p);}
inline void operator delete(void*p,std::size_t)noexcept{free(p);}

(I could certainly add a check to abort if malloc returns 0 or other 
details, depending on what the application calls for)


It used to enable a number of optimizations, for instance in gcc-9

auto f(){ return std::vector(4096); }

was optimized to just one call to calloc (someone broke that in gcc-10).

Using LTO on libsupc++ is related.

I don't know if we want to define "sane" operators new/delete, or just 
have a flag that promises that we won't try to replace the default ones.


--
Marc Glisse


Re: libstdc++: Turn memmove to memcpy in vector reallocations

2023-11-21 Thread Marc Glisse

On Tue, 21 Nov 2023, Jonathan Wakely wrote:


CC Marc Glisse who added the relocation support. He might recall why
we use memmove when all uses are for newly-allocated storage, which
cannot overlap the existing storage.


Going back a bit:

https://gcc.gnu.org/pipermail/gcc-patches/2019-April/520658.html

"I think the call to memmove in __relocate_a_1 could probably be
memcpy (I don't remember why I chose memmove)"

Going back a bit further:

https://gcc.gnu.org/pipermail/gcc-patches/2018-September/505800.html

"I had to add a special case for trivial types, using memmove, to avoid
perf regressions, since relocation takes precedence over the old path that
is specialized to call memmove."

So the reason seems to be because vector already used memmove before my 
patch. You can dig further if you want to check why that is ;-)


--
Marc Glisse


Re: [PATCH] Remove unnecessary "& 1" in year_month_day_last::day()

2023-11-05 Thread Marc Glisse

On Sun, 5 Nov 2023, Cassio Neri wrote:


When year_month_day_last::day() was implemented, Dr. Matthias Kretz realised
that the operation "& 1" wasn't necessary but we did not patch it at that
time. This patch removes the unnecessary operation.


Is there an entry in gcc's bugzilla about having the optimizer handle this 
kind of optimization?


unsigned f(unsigned x){
  if(x>=32)__builtin_unreachable();
  return 30|(x&1); // --> 30|x
}

(that optimization would come in addition to your patch, doing the 
optimization by hand is still a good idea)


It looks like the criterion would be a|(b) when the possible 1 bits of b 
are included in the certainly 1 bits of a|c.


--
Marc Glisse


Re: Question about merging if-else blocks

2023-09-26 Thread Marc Glisse via Gcc

On Wed, 27 Sep 2023, Hanke Zhang via Gcc wrote:


Hi, I have recently been working on merging if-else statement blocks,
and I found a rather bizarre phenomenon that I would like to ask
about.
A rough explanation is that for two consecutive if-else blocks, if
their if statements are exactly the same, they should be merged, like
the following program:

int a = atoi(argv[1]);
if (a) {
 printf("if 1");
} else {
 printf("else 1");
}
if (a) {
 printf("if 2");
} else {
 printf("else 2");
}

After using the -O3 -flto optimization option, it can be optimized as follows:

int a = atoi(argv[1]);
if (a) {
 printf("if 1");
 printf("if 2");
} else {
 printf("else 1");
 printf("else 2");
}

But `a` here is a local variable. If I declare a as a global variable,
it cannot be optimized as above. I would like to ask why this is? And
is there any solution?


If 'a' is a global variable, how do you know 'printf' doesn't modify its 
value? (you could know it for printf, but it really depends on the 
function that is called)


--
Marc Glisse


Re: Different ASM for ReLU function between GCC11 and GCC12

2023-06-19 Thread Marc Glisse via Gcc

On Mon, 19 Jun 2023, André Günther via Gcc wrote:


I noticed that a simple function like
auto relu( float x ) {
   return x > 0.f ? x : 0.f;
}
compiles to different ASM using GCC11 (or lower) and GCC12 (or higher). On
-O3 -mavx2 the former compiles above function to

relu(float):
   vmaxss xmm0, xmm0, DWORD PTR .LC0[rip]
   ret
.LC0:
   .long 0

which is what I would naively expect and what also clang essentially does
(clang actually uses an xor before the maxss to get the zero). The latter,
however, compiles the function to

relu(float):
   vxorps xmm1, xmm1, xmm1
   vcmpltss xmm2, xmm1, xmm0
   vblendvps xmm0, xmm1, xmm0, xmm2
   ret

which looks like a missed optimisation. Does anyone know if there's a
reason for the changed behaviour?


With -fno-signed-zeros -ffinite-math-only, gcc-12 still uses max instead 
of cmp+blend. So the first thing to check would be if both versions give 
the same result on negative 0 and NaN.


--
Marc Glisse


Re: [PATCH] libstdc++: Add missing constexpr to simd

2023-05-22 Thread Marc Glisse via Gcc-patches

On Mon, 22 May 2023, Jonathan Wakely via Libstdc++ wrote:


* subscripting vector builtins is not allowed in constant expressions


Is that just because nobody made it work (yet)?


Yes.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101651 and others.


* if the implementation would otherwise call SIMD intrinsics/builtins


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80517 and others.

Makes sense to work around them for now.

--
Marc Glisse


Re: [GSoC] Conflicted Built-in Trait Name

2023-03-25 Thread Marc Glisse via Gcc

On Sat, 25 Mar 2023, Ken Matsui via Gcc wrote:


Built-in trait naming simply adds two underscores (__) to the original
trait name. However, the same names are already in use for some
built-in traits, such as is_void, is_pointer, and is_signed.

For example, __is_void is used in the following files:

* gcc/testsuite/g++.dg/tm/pr46567.C


This is a testcase, you can rename __is_void to whatever in there, it 
doesn't matter.



* libstdc++-v3/include/bits/cpp_type_traits.h


This __is_void seems to be used in a single place in 
include/debug/helper_functions.h, couldn't we tweak that code so __is_void 
becomes unused and can be removed?


--
Marc Glisse


Re: [PATCH 2/2] libstdc++: use new built-in trait __add_const

2023-03-21 Thread Marc Glisse via Gcc-patches

On Tue, 21 Mar 2023, Ken Matsui via Libstdc++ wrote:


  /// add_const
+#if __has_builtin(__add_const)
+  template
+struct add_const
+{ using type = __add_const(_Tp); };
+#else
  template
struct add_const
{ using type = _Tp const; };
+#endif


Is that really better? You asked elsewhere if you should measure for each 
patch, and I think that at least for such a trivial case, you need to 
demonstrate that there is a point. The drawbacks are obvious: more code in 
libstdc++, non-standard, and more builtins in the compiler.


Using builtins makes more sense for complicated traits where you can save 
several instantiations. Now that you have done a couple simple cases to 
see how it works, I think you should concentrate on the more complicated 
cases.


--
Marc Glisse


Re: Triggering -save-temps from the front-end code

2022-11-28 Thread Marc Glisse via Gcc

On Mon, 28 Nov 2022, Florian Weimer via Gcc wrote:


* Arsen Arsenović:


Hi,

Florian Weimer via Gcc  writes:


Unfortunately, some build systems immediately delete the input source
files.  Is there some easy way I can dump the pre-processed and
non-preprocessed sources to my log file?  I tried to understand how
-save-temps for crash recovery works, but it seems that this runs
outside of the frontend, in the driver.


Would dumping unconditionally into some "side channel" -dumpdir be a
sufficient workaround?


Of the file names overlap, and it seems in this case, the dump files are
just overwritten.  I don't see a way to make the file names unique.

I guess for the tough cases, I can just keep running the build under
strace.


You could override unlink with LD_PRELOAD. Use a special purpose 
filesystem (gitfs? I haven't tried it yet). Wrap gcc with a command that 
calls the true gcc with a different TMPDIR / -dumpdir each time.


--
Marc Glisse


Re: Please, really, make `-masm=intel` the default for x86

2022-11-25 Thread Marc Glisse via Gcc

On Fri, 25 Nov 2022, LIU Hao via Gcc wrote:

I am a Windows developer and I have been writing x86 and amd64 assembly for 
more than ten years. One annoying thing about GCC is that, for x86 if I need 
to write I piece of inline assembly then I have to do it twice: one in AT 
syntax and one in Intel syntax.


The doc for -masm=dialect says:

Darwin does not support ‘intel’.

Assuming that's still true, and even with Mac Intel going away, it doesn't 
help.


--
Marc Glisse


Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]

2022-11-16 Thread Marc Glisse via Gcc-patches

On Fri, 4 Nov 2022, Hongyu Wang via Gcc-patches wrote:


This is a follow-up patch for PR98167

The sequence
c1 = VEC_PERM_EXPR (a, a, mask)
c2 = VEC_PERM_EXPR (b, b, mask)
c3 = c1 op c2
can be optimized to
c = a op b
c3 = VEC_PERM_EXPR (c, c, mask)
for all integer vector operation, and float operation with
full permutation.


Hello,

I assume the "full permutation" condition is to avoid performing some 
extra operations that would raise exception flags. If so, are there 
conditions (-fno-trapping-math?) where the transformation would be safe 
with arbitrary shuffles?


--
Marc Glisse


Re: Different outputs in Gimple pass dump generated by two different architectures

2022-11-11 Thread Marc Glisse via Gcc

On Thu, 10 Nov 2022, Kevin Lee wrote:


While looking at the failure for gcc.dg/uninit-pred-9_b.c, I observed that
x86-64 and risc-v has a different output for the gimple pass since
r12-4790-g4b3a325f07acebf4
<https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=4b3a325f07acebf4>.


Probably since earlier.

What would be causing the difference? Is this intended? Link 
<https://godbolt.org/z/eWxnYsK1z> for details. Thank you!


See LOGICAL_OP_NON_SHORT_CIRCUIT in fold-const.cc (and various discussions 
on the topic in mailing lists and bugzilla).


--
Marc Glisse


Re: clarification question

2022-10-22 Thread Marc Glisse via Gcc

On Sat, 22 Oct 2022, Péntek Imre via Gcc wrote:


https://gcc.gnu.org/backends.html

by "Architecture does not have a single condition code register" do you mean 
it has none or do you mean it has multiple?


Either.

If you look at the examples below, there is a C for riscv, which has 0, 
and one for sparc, which has several.


--
Marc Glisse


Re: [PATCH] Optimize (X<

2022-09-14 Thread Marc Glisse via Gcc-patches

On Tue, 13 Sep 2022, Roger Sayle wrote:


This patch tweaks the match.pd transformation previously added to fold
(X<

In https://gcc.gnu.org/onlinedocs/gcc/Integers-implementation.html , I 
read:


"Bitwise operators act on the representation of the value including both 
the sign and value bits, where the sign bit is considered immediately 
above the highest-value value bit. Signed ‘>>’ acts on negative numbers by 
sign extension.


As an extension to the C language, GCC does not use the latitude given in 
C99 and C11 only to treat certain aspects of signed ‘<<’ as undefined."


To me, this means that for instance INT_MIN<<1 is well defined and 
evaluates to 0. But with this patch we turn (INT_MIN<<1)+(INT_MIN<<1) into 
(INT_MIN+INT_MIN)<<1, which is UB.


If we decide not to support this extension anymore, I think we need to 
change the documentation first.


--
Marc Glisse


Re: Floating-point comparisons in the middle-end

2022-09-01 Thread Marc Glisse via Gcc

On Thu, 1 Sep 2022, Joseph Myers wrote:


On Thu, 1 Sep 2022, FX via Gcc wrote:


A tentative patch is attached, it seems to work well on simple examples,
but for test coverage the hard part is going to be that the comparisons
seem to be optimised away very easily into their non-signaling versions.
Basically, if I do:


Presumably that can be reproduced without depending on the new built-in
function?  In which case it's an existing bug somewhere in the optimizers.


 (simplify
  (cmp @0 REAL_CST@1)
[...]
   (if (REAL_VALUE_ISNAN (TREE_REAL_CST (@1))
&& !tree_expr_signaling_nan_p (@1)
&& !tree_expr_maybe_signaling_nan_p (@0))
{ constant_boolean_node (cmp == NE_EXPR, type); })

only tries to preserve a comparison with sNaN, but not with qNaN. There 
are probably other issues since various gcc devs used to have a different 
opinion on the meaning of -ftrapping-math.



--
Marc Glisse


Re: GCC 12.1 Release Candidate available from gcc.gnu.org

2022-05-02 Thread Marc Glisse via Gcc

On Mon, 2 May 2022, Boris Kolpackov wrote:


Jakub Jelinek  writes:


The first release candidate for GCC 12.1 is available [...]


There is an unfixed bogus warning that is a regression in 12
and that I think will have a pretty wide effect (any code
that assigns/appends a 1-char string literal to std::string):

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105329

For example, in my relatively small codebase I had about 20
instances of this warning. Seeing that it's enabled as part
of -Wall (not just -Wextra), I believe there will be a lot
of grumpy users.

There is a proposed work around in this (duplicate) bug that
looks pretty simple:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104336

Perhaps it makes sense to consider it?


Please no, that workaround looks like a fragile hack (basically writing 
a+b-a to obfuscate b) that may break if you look at it sideways and likely 
makes the generated code a bit worse. Besides, we should take it as a hint 
that user code is also likely to trigger the warning by accident.


IMO there are several ways to make progress on this in parallel:

* improve the optimizer (as investigated by Andrew)

* tweak the warning so it becomes either cleverer or less eager (maybe 
even use the fact that this special case is in a system header? that 
should be a last resort though)


* battle that has already been lost, no need to rehash it:

--- a/gcc/c-family/c.opt
+++ b/gcc/c-family/c.opt
@@ -1450,3 +1450,3 @@ Warn when a declaration has duplicate const, 
volatile, restrict or _Atomic speci

 Wrestrict
-C ObjC C++ ObjC++ Var(warn_restrict) Warning LangEnabledBy(C ObjC C++ ObjC++, 
Wall)
+C ObjC C++ ObjC++ Var(warn_restrict) Warning LangEnabledBy(C ObjC C++ ObjC++, 
Wextra)
 Warn when an argument passed to a restrict-qualified parameter aliases with

Or admit that

--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -5729,4 +5729,3 @@ by this option and not enabled by the latter and vice 
versa.
 This enables all the warnings about constructions that some users
-consider questionable, and that are easy to avoid (or modify to
-prevent the warning), even in conjunction with macros.  This also
+consider questionable.  This also
 enables some language-specific warnings described in @ref{C++ Dialect

--
Marc Glisse


Re: [PATCH] libstdc++: Implement std::unreachable() for C++23 (P0627R6)

2022-03-31 Thread Marc Glisse via Gcc-patches

On Thu, 31 Mar 2022, Jonathan Wakely wrote:


On Thu, 31 Mar 2022 at 17:03, Marc Glisse via Libstdc++
 wrote:


On Thu, 31 Mar 2022, Matthias Kretz via Gcc-patches wrote:


I like it. But I'd like it even more if we could have

#elif defined _UBSAN
   __ubsan_invoke_ub("reached std::unreachable()");

But to my knowledge UBSAN has no hooks for the library like this (yet).


-fsanitize=undefined already replaces __builtin_unreachable with its own
thing, so I was indeed going to ask if the assertion / trap provide a
better debugging experience compared to plain __builtin_unreachable, with
the possibility to get a stack trace (UBSAN_OPTIONS=print_stacktrace=1),
etc? Detecting if (the right subset of) ubsan is enabled sounds like a
good idea.


Does UBsan define a macro that we can use to detect it?


https://github.com/google/sanitizers/issues/765 seems to say no (it could 
be outdated though), but they were asking for use cases to motivate adding 
one. Apparently there is a macro for clang, although I don't think it is 
fine-grained.


Adding one to cppbuiltin.cc testing SANITIZE_UNREACHABLE looks easy, maybe 
we can do just this one, we don't need to go overboard and define macros 
for all possible suboptions of ubsan right now.


I don't think any of that prevents from pushing your patch as is for 
gcc-12.


--
Marc Glisse


Re: [PATCH] libstdc++: Implement std::unreachable() for C++23 (P0627R6)

2022-03-31 Thread Marc Glisse via Gcc-patches

On Thu, 31 Mar 2022, Matthias Kretz via Gcc-patches wrote:


I like it. But I'd like it even more if we could have

#elif defined _UBSAN
   __ubsan_invoke_ub("reached std::unreachable()");

But to my knowledge UBSAN has no hooks for the library like this (yet).


-fsanitize=undefined already replaces __builtin_unreachable with its own 
thing, so I was indeed going to ask if the assertion / trap provide a 
better debugging experience compared to plain __builtin_unreachable, with 
the possibility to get a stack trace (UBSAN_OPTIONS=print_stacktrace=1), 
etc? Detecting if (the right subset of) ubsan is enabled sounds like a 
good idea.


--
Marc Glisse


Re: [PATCH] PR tree-optimization/101895: Fold VEC_PERM to help recognize FMA.

2022-03-12 Thread Marc Glisse via Gcc-patches

On Fri, 11 Mar 2022, Roger Sayle wrote:

+(match vec_same_elem_p
+  CONSTRUCTOR@0
+  (if (uniform_vector_p (TREE_CODE (@0) == SSA_NAME
+? gimple_assign_rhs1 (SSA_NAME_DEF_STMT (@0)) : @0

Ah, I didn't remember we needed that, we don't seem to be very consistent 
about it. Probably for this reason, the transformation "Prefer vector1 << 
scalar to vector1 << vector2" does not match


typedef int vec __attribute__((vector_size(16)));
vec f(vec a, int b){
  vec bb = { b, b, b, b };
  return a << bb;
}

which is only optimized at vector lowering time.

+/* Push VEC_PERM earlier if that may help FMA perception (PR101895).  */
+(for plusminus (plus minus)
+  (simplify
+(plusminus (vec_perm (mult@0 @1 vec_same_elem_p@2) @0 @3) @4)
+(plusminus (mult (vec_perm @1 @1 @3) @2) @4)))

Don't you want :s on mult and vec_perm?

--
Marc Glisse


Re: [PATCH] Implement constant-folding simplifications of reductions.

2022-02-21 Thread Marc Glisse

On Mon, 21 Feb 2022, Roger Sayle wrote:


+/* Fold REDUC (@0 op VECTOR_CST) as REDUC (@0) op REDUC (VECTOR_CST).  */
+(for reduc (IFN_REDUC_PLUS IFN_REDUC_MAX IFN_REDUC_MIN IFN_REDUC_FMAX
+IFN_REDUC_FMIN IFN_REDUC_AND IFN_REDUC_IOR IFN_REDUC_XOR)
+ op (plus max min IFN_FMAX IFN_FMIN bit_and bit_ior bit_xor)
+  (simplify (reduc (op @0 VECTOR_CST@1))
+(op (reduc:type @0) (reduc:type @1


I wonder if we need to test flag_associative_math for the 'plus' case,
or if the presence of IFN_REDUC_PLUS is enough to justify the possible
loss of precision.

--
Marc Glisse


Re: "cannot convert to a pointer type" error repeated tens of times

2022-02-12 Thread Marc Glisse

On Sat, 12 Feb 2022, Andrea Monaco via Gcc wrote:


 #include 

 int
 main (void)
 {
   float a;

   curl_easy_setopt (NULL, 0, (void *) a);
 }


with "gcc -c bug.c" gives


 bug.c: In function ‘main’:
 bug.c:15:3: error: cannot convert to a pointer type
curl_easy_setopt (NULL, 0, (void *) a);
^~~~
 bug.c:15:3: error: cannot convert to a pointer type
 bug.c:15:3: error: cannot convert to a pointer type
 bug.c:15:3: error: cannot convert to a pointer type
 [...]
 bug.c:15:3: error: cannot convert to a pointer type
 In file included from /usr/include/x86_64-linux-gnu/curl/curl.h:2826,
  from bug.c:1:
 bug.c:15:3: error: cannot convert to a pointer type
curl_easy_setopt (NULL, 0, (void *) a);
^~~~
 bug.c:15:3: error: cannot convert to a pointer type
curl_easy_setopt (NULL, 0, (void *) a);
^~~~
 bug.c:15:3: error: cannot convert to a pointer type
curl_easy_setopt (NULL, 0, (void *) a);
^~~~


The error message is correct, but is repeated tens of times.
The function is declared this way in curl.h

 CURL_EXTERN CURLcode curl_easy_setopt(CURL *curl, CURLoption option,
 ...);


No, curl_easy_setopt is a macro. If you look at the preprocessed code, you 
get many statements doing the same wrong operation, and one warning for 
each of them.


(wrong list, should be gcc-help, or an issue on bugzilla)

--
Marc Glisse


Re: gcd_1.c:188:13: runtime error: shift exponent 64 is too large for 64-bit type 'long unsigned int'

2022-02-02 Thread Marc Glisse

On Wed, 2 Feb 2022, Toon Moene wrote:

Fascinating. My gcc directory had both gmp-6.2.1 and -6.1.0, but the symbolic 
link 'gmp' pointed to the old one. A similar problem for mpc, mpfr and isl 
...


You need to pass --force to contrib/download_prerequisites if you want 
them to be updated.


--
Marc Glisse


Re: gcd_1.c:188:13: runtime error: shift exponent 64 is too large for 64-bit type 'long unsigned int'

2022-02-01 Thread Marc Glisse

On Tue, 1 Feb 2022, Toon Moene wrote:


I just ran a "ubsan" build on my x86_64-linux-gnu system.

See: https://gcc.gnu.org/pipermail/gcc-testresults/2022-February/754454.html

This is an interesting failure:

Executing on host: 
/home/toon/scratch/bld1142336/gcc/testsuite/gfortran29/../../gfortran 
-B/home/toon/scratch/bld1142336/gcc/testsuite/gfortran29/../../ 
-B/home/toon/scratch/bld1142336/x86_64-pc-linux-gnu/./libgfortran/ 
/home/toon/compilers/gcc/gcc/testsuite/gfortran.dg/graphite/pr39516.f 
-fdiagnostics-plain-output  -fdiagnostics-plain-output-O  -O2 
-ftree-loop-linear -S -o pr39516.s(timeout = 300)
spawn -ignore SIGHUP 
/home/toon/scratch/bld1142336/gcc/testsuite/gfortran29/../../gfortran 
-B/home/toon/scratch/bld1142336/gcc/testsuite/gfortran29/../../ 
-B/home/toon/scratch/bld1142336/x86_64-pc-linux-gnu/./libgfortran/ 
/home/toon/compilers/gcc/gcc/testsuite/gfortran.dg/graphite/pr39516.f 
-fdiagnostics-plain-output -fdiagnostics-plain-output -O -O2 
-ftree-loop-linear -S -o pr39516.s
gcd_1.c:188:13: runtime error: shift exponent 64 is too large for 64-bit type 
'long unsigned int'

FAIL: gfortran.dg/graphite/pr39516.f   -O  (test for excess errors)
Excess errors:
gcd_1.c:188:13: runtime error: shift exponent 64 is too large for 64-bit type 
'long unsigned int'


Note that the test case is pure Fortran source. The undefined error seems to 
come from a function inside the graphite library ...


Maybe try with a more recent version of GMP first? gcd_1.c has only 103 
lines in release 6.2.1.


A stack trace (UBSAN_OPTIONS=print_stacktrace=1) would make it easier to 
guess where this is coming from.


--
Marc Glisse


Re: reordering of trapping operations and volatile

2022-01-08 Thread Marc Glisse

On Sat, 8 Jan 2022, Martin Uecker via Gcc wrote:


Am Samstag, den 08.01.2022, 13:41 +0100 schrieb Richard Biener:

On January 8, 2022 9:32:24 AM GMT+01:00, Martin Uecker  
wrote:

Hi Richard,


thank you for your quick response!


I have a question regarding reodering of volatile
accesses and trapping operations. My initial
assumption (and  hope) was that compilers take
care to avoid creating traps that are incorrectly
ordered relative to observable behavior.

I had trouble finding examples, and my cursory
glace at the code seemed to confirm that GCC
carefully avoids this.  But then someone showed
me this example, where this can happen in GCC:


volatile int x;

int foo(int a, int b, _Bool store_to_x)
{
 if (!store_to_x)
   return a / b;
 x = b;
 return a / b;
}


https://godbolt.org/z/vq3r8vjxr

In this example a division is hoisted
before the volatile store. (the division
by zero which could trap is UB, of course).

As Martin Sebor pointed out this is done
as part of redundancy elimination
in tree-ssa-pre.c and that this might
simply be an oversight (and could then be
fixed with a small change).

Could you clarify whether such reordering
is intentional and could be exploited in
general also in other optimizations or
confirm that this is an oversight that
affects only this specific case?

If this is intentional, are there examples
where this is important for optimization?


In general there is no data flow information that
prevents traps from being reordered with respect
to volatile accesses.


Yes, although I think potentially trapping ops
are not moved before calls (as this would be
incorrect).  So do you think it would be feasable
to prevent this for volatile too?


The specific case could be
easily mitigated in PRE. Another case would be

A = c / d;
X = 1;
If (use_a)
  Bar (a);

Where we'd sink a across x into the guarded Bb I suspect.


Yes. Related example:

https://godbolt.org/z/5WGhadre3

volatile int x;
void bar(int a);

void foo(int c, int d)
{
 int a = c / d;
 x = 1;
 if (d)
   bar(a);
}

foo:
   mov DWORD PTR x[rip], 1
   testesi, esi
   jne .L4
   ret
.L4:
   mov eax, edi
   cdq
   idivesi
   mov edi, eax
   jmp bar


It would be nice to prevent this too, although
I am less concerned about this direction, as
the UB has already happened so there is not
much we could guarantee about this anyway.

In the other case, it could affect correct
code before the trap.


-fnon-call-exceptions helps with the first testcase but not with the 
second one. I don't know if that's by accident, but the flag seems 
possibly relevant.


--
Marc Glisse


Re: [PATCH] tree-optimization/103514 Missing XOR-EQ-AND Optimization

2021-12-04 Thread Marc Glisse

+/* (a & b) ^ (a == b) -> !(a | b) */
+/* (a & b) == (a ^ b) -> !(a | b) */
+(for first_op (bit_xor eq)
+ second_op (eq bit_xor)
+ (simplify
+  (first_op:c (bit_and:c @0 @1) (second_op:c @0 @1))
+   (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
+&& types_match (TREE_TYPE (@0), TREE_TYPE (@1)))
+(convert (bit_not (bit_ior @0 @1))

I don't think you need types_match, if both are operands of bit_and, their 
types must already match.


It isn't clear what the INTEGRAL_TYPE_P test is for. Your 2 
transformations don't seem that similar to me. The first one requires that 
a and b have the same type as the result of ==, so they are boolean-like. 
The second one makes sense for more general integers, but then it looks 
like it should produce (a|b)==0.


It doesn't look like we have a canonical representation between a^b and 
a!=b for booleans :-(


(sorry for the broken thread, I was automatically unsubscribed because 
mailman doesn't like greylisting)


--
Marc Glisse


Re: [RFC/PATCH] C++ constexpr vs. floating point exceptions.

2021-09-21 Thread Marc Glisse

On Tue, 21 Sep 2021, Jason Merrill via Gcc-patches wrote:


On Tue, Sep 21, 2021 at 4:30 PM Joseph Myers 
wrote:


On Tue, 21 Sep 2021, Jakub Jelinek via Gcc-patches wrote:


On Tue, Sep 21, 2021 at 02:15:59PM +0100, Roger Sayle wrote:

Can you double check?  Integer division by zero is undefined, but

isn't floating point

division by zero defined by the appropriate IEEE standards?


https://eel.is/c++draft/expr.mul#4 doesn't make the division by zero
behavior conditional on integral types.
C has similar wording.


The position for C is that Annex F semantics take precedence over all the
ways in which floating-point arithmetic has undefined behavior in the main
body of the standard.  So floating-point overflow and division by zero are
fully defined in the presence of Annex F support, while out-of-range
conversions from floating point to integer types produce an unspecified
value (not necessarily the same unspecified value for different executions
of the conversion in the abstract machine - as discussed in bug 93806, GCC
can get that wrong and act as if a single execution of such a conversion
in the abstract machine produces more than one result).

In C, as specified in Annex F, initializers for floating-point objects
with static or thread storage duration are evaluated with exceptions
discarded and the default rounding mode in effect; 7.0 / 0.0 is a fully
valid initializer for such an object to initialize it to positive
infinity.  As I understand it, the question for this thread is whether C++
constexpr should have a similar rule to C static initializers (which ought
to apply to 1.0 / 3.0, raising inexact, just as much as to 7.0 / 0.0).



The C rules seem to be

F.8.2 Translation
During translation the IEC 60559 default modes are in effect:
— The rounding direction mode is rounding to nearest.
— The rounding precision mode (if supported) is set so that results are
not shortened.
— Trapping or stopping (if supported) is disabled on all floating-point
exceptions.
Recommended practice:
The implementation should produce a diagnostic message for each
translation-time floating-point exception, other than “inexact”; the
implementation should then proceed with the translation of the program.

I think following the same rules for C++ would be appropriate in a


I agree that looking at the C standard is more interesting, C++ is very 
bad at specifying anything float related.



diagnosing context: warn and continue.  In a template argument deduction
(SFINAE) context, where errors become silent substitution failures, it's
probably better to treat them as non-constant.


I am trying to imagine a sfinae example affected by whether 1./0. is 
constant. Does that mean A<0.,__builtin_inf()> would fail to use the 
specialization in


templatestruct A{};
templatestruct A{};

? I don't like that, I believe it should use the specialization. With 
ieee754, 1./0. is perfectly well defined as +inf, the only question is 
whether it should also set a flag at runtime, which is not relevant in a 
manifestly consteval context (fetestexcept, etc are not constexpr, that 
should be enough to catch mistakes). If some user wants to forbid 
FE_DIVBYZERO, FE_INEXACT, FE_INVALID, FE_OVERFLOW or FE_UNDERFLOW in 
compile-time operations, that looks like it could be part of a separate 
compiler flag or pragma, like C's FENV_ROUND can affect the rounding mode 
in static initializers (of course C++ templates make pragmas less 
convenient).


--
Marc Glisse


Re: unexpected result with -O2 solved via "volatile"

2021-09-19 Thread Marc Glisse

On Sun, 19 Sep 2021, Allin Cottrell via Gcc wrote:

Should this perhaps be considered a bug? Below is a minimal test case for a 
type of calculation that occurs in my real code. It works as expected when 
compiled without optimization, but produces what seems like a wrong result 
when compiled with -O2, using both gcc 10.3.1 20210422 on Fedora and gcc 
11.1.0-1 on Arch Linux. I realize there's a newer gcc release but it's not 
yet available for Arch, and looking at 
https://gcc.gnu.org/gcc-11/changes.html I didn't see anything to suggest that 
something relevant has changed.


https://gcc.gnu.org/bugs/ says that you should first try compiling your 
code with -fsanitize=undefined, which tells you at runtime that your code 
is broken.


Apart from that, bug reports should go to https://gcc.gnu.org/bugzilla/ 
and questions to gcc-h...@gcc.gnu.org.


--
Marc Glisse


Re: [EXTERNAL] Re: [PATCH] tree-optimization: Optimize division followed by multiply [PR95176]

2021-08-06 Thread Marc Glisse

On Fri, 6 Aug 2021, Victor Tong wrote:


Thanks for the feedback. Here's the updated pattern:

 /* X - (X - Y) --> Y */
 (simplify
   (minus (convert1? @0) (convert2? (minus@2 (convert3? @@0) @1)))
   (if (ANY_INTEGRAL_TYPE_P (type)
   && TYPE_OVERFLOW_UNDEFINED(type)
   && !TYPE_OVERFLOW_SANITIZED(type)
   && ANY_INTEGRAL_TYPE_P (TREE_TYPE(@2))
   && TYPE_OVERFLOW_UNDEFINED(TREE_TYPE(@2))
   && !TYPE_OVERFLOW_SANITIZED(TREE_TYPE(@2))
   && ANY_INTEGRAL_TYPE_P (TREE_TYPE(@0))
   && TYPE_OVERFLOW_UNDEFINED(TREE_TYPE(@0))
   && !TYPE_OVERFLOW_SANITIZED(TREE_TYPE(@0))
   && TYPE_PRECISION (TREE_TYPE (@2)) <= TYPE_PRECISION (type)
   && TYPE_PRECISION (TREE_TYPE (@0)) <= TYPE_PRECISION (type))
   (convert @1)))

I kept the TYPE_OVERFLOW_SANITIZED checks because I saw other patterns that 
leverage undefined overflows check for it. I think this new pattern shouldn't 
be applied if overflow sanitizer checks are enabled.


why is this testing TREE_TYPE (@0)?


I'm checking the type of @0 because I'm concerned that there could be a case 
where @0's type isn't an integer type with undefined overflow. I tried creating 
a test case and couldn't seem to create one where this is violated but I kept 
the checks to avoid causing a regression. If I'm being overcautious and you 
feel that the type checks on @0 aren't needed, I can remove them. I think the 
precision check on TREE_TYPE(@0) is needed to avoid truncation cases though.


It doesn't matter if @0 has undefined overflow, but it can matter that it 
be signed (yes, the 2 are correlated...) if it has the same precision as 
@2. Otherwise (int64_t)(-1u)-(int64_t)((int)(-1u)-0) is definitely not 0 
and it has type:int64_t, @2:int, @0:unsigned.


Ignoring the sanitizer, the confusing double matching of constants, and 
restricting to scalars, I think the tightest condition (without vrp) that 
works is


TYPE_PRECISION (type) <= TYPE_PRECISION (TREE_TYPE (@2)) ||
 TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@2)) &&
  (TYPE_PRECISION (TREE_TYPE (@0)) < TYPE_PRECISION (TREE_TYPE (@2)) ||
   TYPE_PRECISION (TREE_TYPE (@0)) == TYPE_PRECISION (TREE_TYPE (@2)) &&
   !TYPE_UNSIGNED (TREE_TYPE (@0))
  )

(where implicitly undefined => signed) but of course it is ok to handle
only a subset. It is too late for me to think about @0 vs @@0.

The patch you posted seems to accept the wrong 
(int128t)LLONG_MAX-(int128_t)((int)LLONG_MAX-0) -> (int128_t)0


Using ANY_INTEGRAL_TYPE_P allows vectors, and TYPE_PRECISION mustn't be 
used for vectors (maybe we should add more checking to catch such uses), 
and they may not be very happy with convert either...



Once we'd "inline" nop_convert genmatch would complain about this.


Maybe the new transform could be about scalars, and we could restrict the 
old one to vectors, to simplify the code, but that wouldn't help genmatch 
with the fact that the pattern x-(x-y) would appear twice. Is it really 
that bad? It is suspicious, but can be justified.



Is someone working on inlining nop_convert? I'd like to avoid breaking someone 
else's work if that's being worked on right now.

I've also added some extra tests to cover this new pattern. I've attached a 
patch with my latest changes.


From: Richard Biener 
Sent: Wednesday, July 28, 2021 2:59 AM
To: Victor Tong 
Cc: Marc Glisse ; gcc-patches@gcc.gnu.org 

Subject: Re: [EXTERNAL] Re: [PATCH] tree-optimization: Optimize division followed by multiply [PR95176] 
 

On Tue, Jun 29, 2021 at 1:10 AM Victor Tong  wrote:


Thanks Richard and Marc.

I wrote the following test case to compare the outputs of fn1() and fn1NoOpt() 
below with my extra pattern being applied. I tested the two functions with all 
of the integers from INT_MIN to INT_MAX.

long
fn1 (int x)
{
   return 42L - (long)(42 - x);
}

#pragma GCC push_options
#pragma GCC optimize ("O0")
long
fn1NoOpt (int x)
{
   volatile int y = (42 - x);
   return 42L - (long)y;
}
#pragma GCC pop_options

int main ()
{
 for (long i=INT_MIN; i<=INT_MAX;i++)
 {
 auto valNoOpt = fn1NoOpt(i);
 auto valOpt = fn1(i);
 if (valNoOpt != valOpt)
 printf("valOpt=%ld, valNoOpt=%ld\n", valOpt, valNoOpt);
 }
 return 0;
}

I saw that the return values of fn1() and fn1NoOpt() differed when the input 
was between INT_MIN and INT_MIN+42 inclusive. When passing values in this range 
to fn1NoOpt(), a signed overflow is triggered which causes the value to differ 
(undefined behavior). This seems to go in line with what Marc described and I 
think the transformation is correct in the scenario above. I do think that type 
casts that result in truncation (i.e. from a higher precision to a lower one) 
or with unsigned types will result in an incorrect transformation so those 
scenarios need t

Re: [PATCH] Fold (X<

2021-07-26 Thread Marc Glisse

On Mon, 26 Jul 2021, Roger Sayle wrote:

The one aspect that's a little odd is that each transform is paired with 
a convert@1 variant, using the efficient match machinery to expose any 
zero extension to fold-const.c's tree_nonzero_bits functionality.


Copying the first transform for context

+(for op (bit_ior bit_xor)
+ (simplify
+  (op (mult:s@0 @1 INTEGER_CST@2)
+  (mult:s@3 @1 INTEGER_CST@4))
+  (if (INTEGRAL_TYPE_P (type) && TYPE_OVERFLOW_WRAPS (type)
+   && (tree_nonzero_bits (@0) & tree_nonzero_bits (@3)) == 0)
+   (mult @1
+{ wide_int_to_tree (type, wi::to_wide (@2) + wi::to_wide (@4)); })))
+ (simplify
+  (op (mult:s@0 (convert@1 @2) INTEGER_CST@3)
+  (mult:s@4 (convert@1 @2) INTEGER_CST@5))
+  (if (INTEGRAL_TYPE_P (type) && TYPE_OVERFLOW_WRAPS (type)
+   && (tree_nonzero_bits (@0) & tree_nonzero_bits (@4)) == 0)
+   (mult @1
+{ wide_int_to_tree (type, wi::to_wide (@3) + wi::to_wide (@5)); })))

Could you explain how the convert helps exactly?

--
Marc Glisse


Re: An asm constraint issue (ARM FPU)

2021-07-25 Thread Marc Glisse

On Sun, 25 Jul 2021, Zoltán Kócsi wrote:


I try to write a one-liner inline function to create a double form
a 64-bit integer, not converting it to a double but the integer
containing the bit pattern for the double (type spoofing).

The compiler is arm-eabi-gcc 8.2.0.
The target is a Cortex-A9, with NEON.

According to the info page the assembler constraint "w" denotes an FPU
double register, d0 - d31.

The code is the following:

double spoof( uint64_t x )
{
double r;

  asm volatile
  (
" vmov.64 %[d],%Q[i],%R[i] \n"


Isn't it supposed to be %P[d] for a double?
(the documentation is very lacking...)


: [d] "=w" (r)
: [i] "q" (x)
  );

  return r;
}

The command line:

arm-eabi-gcc -O0 -c -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon-vfpv4 \
test.c

It compiles and the generated object code is this:

 :
  0:   e52db004push{fp}; (str fp, [sp, #-4]!)
  4:   e28db000add fp, sp, #0
  8:   e24dd014sub sp, sp, #20
  c:   e14b01f4strdr0, [fp, #-20]  ; 0xffec
 10:   e14b21d4ldrdr2, [fp, #-20]  ; 0xffec
 14:   ec432b30vmovd16, r2, r3
 18:   ed4b0b03vstrd16, [fp, #-12]
 1c:   e14b20dcldrdr2, [fp, #-12]
 20:   ec432b30vmovd16, r2, r3
 24:   eeb00b60vmov.f64d0, d16
 28:   e28bd000add sp, fp, #0
 2c:   e49db004pop {fp}; (ldr fp, [sp], #4)
 30:   e12fff1ebx  lr

which is not really efficient, but works.

However, if I specify -O1, -O2 or -Os then the compilation fails
because assembler complains. This is the assembly the compiler
generated, (comments and irrelevant stuff removed):

spoof:
  vmov.64 s0,r0,r1
  bx lr

where the problem is that 's0' is a single-precision float register and
it should be 'd0' instead.

Either I'm seriously missing something, in which case I would be most
obliged if someone sent me to the right direction; or it is a compiler
or documentation bug.

Thanks,

Zoltan


--
Marc Glisse


Re: [PATCH] Fold bswap32(x) != 0 to x != 0 (and related transforms)

2021-07-18 Thread Marc Glisse

On Sun, 18 Jul 2021, Roger Sayle wrote:


+(if (GIMPLE || !TREE_SIDE_EFFECTS (@0))


I don't think you need to worry about that, the general genmatch machinery 
is already supposed to take care of it. All the existing cases in match.pd 
are about cond_expr, where counting the occurrences of each @i is not 
reliable.


--
Marc Glisse


Re: [EXTERNAL] Re: [PATCH] tree-optimization: Optimize division followed by multiply [PR95176]

2021-06-19 Thread Marc Glisse

On Fri, 18 Jun 2021, Richard Biener wrote:


Option 2: Add a new pattern to support scenarios that the existing nop_convert 
pattern bails out on.

Existing pattern:

(simplify
   (minus (nop_convert1? @0) (nop_convert2? (minus (nop_convert3? @@0) @1)))
   (view_convert @1))


I tried to check with a program when

T3 x;
T1 y;
(T2)x-(T2)((T1)x-y)

can be safely replaced with

(T2)y

From the output, it looks like this is safe when T1 is at least as large 
as T2. It is wrong when T1 is unsigned and smaller than T2. And when T1 is 
signed and smaller than T2, it is ok if T3 is the same type as T1 (signed 
then) or has strictly less precision (any sign), and not in other cases.


Note that this is when signed implies undefined overflow and unsigned 
implies wrapping, and I wouldn't put too much faith in this recently 
dusted program. And it doesn't say how to write the match.pd pattern with 
'?', "@@", disabling it if TYPE_OVERFLOW_SANITIZED, etc.


Mostly, I wanted to say that if we are going to go handle more than 
nop_convert for more than just 1 or 2 easy transformations, I think some 
kind of computer verification would be useful, it would save a lot of time 
and headaches.


(I just check by brute force all possible precisions (from 1 to 6) and 
signedness for T1, T2 and T3, all possible values for x and y, compute 
the before and after formulas, accepting if there is UB before, rejecting 
if there is UB after (and not before), and manually try to see a pattern 
in the list of types that work)


--
Marc Glisse


Re: [PATCH] Simplify (view_convert ~a) < 0 to (view_convert a) >= 0 [PR middle-end/100738]

2021-06-04 Thread Marc Glisse

On Fri, 4 Jun 2021, Hongtao Liu via Gcc-patches wrote:


On Tue, Jun 1, 2021 at 6:17 PM Marc Glisse  wrote:


On Tue, 1 Jun 2021, Hongtao Liu via Gcc-patches wrote:


Hi:
 This patch is about to simplify (view_convert:type ~a) < 0 to
(view_convert:type a) >= 0 when type is signed integer. Similar for
(view_convert:type ~a) >= 0.
 Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
 Ok for the trunk?

gcc/ChangeLog:

   PR middle-end/100738
   * match.pd ((view_convert ~a) < 0 --> (view_convert a) >= 0,
   (view_convert ~a) >= 0 --> (view_convert a) < 0): New GIMPLE
   simplification.


We already have

/* Fold ~X op C as X op' ~C, where op' is the swapped comparison.  */
(for cmp (simple_comparison)
  scmp (swapped_simple_comparison)
  (simplify
   (cmp (bit_not@2 @0) CONSTANT_CLASS_P@1)
   (if (single_use (@2)
&& (TREE_CODE (@1) == INTEGER_CST || TREE_CODE (@1) == VECTOR_CST))
(scmp @0 (bit_not @1)

Would it make sense to try and generalize it a bit, say with

(cmp (nop_convert1? (bit_not @0)) CONSTANT_CLASS_P)

(scmp (view_convert:XXX @0) (bit_not @1))


Thanks for your advice, it looks great.
And can I use *view_convert1?* instead of *nop_convert1?* here,
because the original case is view_convert, and nop_convert would fail
to simplify the case.


Near the top of match.pd, you can see

/* With nop_convert? combine convert? and view_convert? in one pattern
   plus conditionalize on tree_nop_conversion_p conversions.  */
(match (nop_convert @0)
 (convert @0)
 (if (tree_nop_conversion_p (type, TREE_TYPE (@0)
(match (nop_convert @0)
 (view_convert @0)
 (if (VECTOR_TYPE_P (type) && VECTOR_TYPE_P (TREE_TYPE (@0))
  && known_eq (TYPE_VECTOR_SUBPARTS (type),
   TYPE_VECTOR_SUBPARTS (TREE_TYPE (@0)))
  && tree_nop_conversion_p (TREE_TYPE (type), TREE_TYPE (TREE_TYPE (@0))

So at least the intention is that it can handle both NOP_EXPR for scalars 
and VIEW_CONVERT_EXPR for vectors, and I think we alread use it that way 
in some places in match.pd, like


(simplify
 (negate (nop_convert? (bit_not @0)))
 (plus (view_convert @0) { build_each_one_cst (type); }))

(simplify
 (bit_xor:c (nop_convert?:s (bit_not:s @0)) @1)
 (if (tree_nop_conversion_p (type, TREE_TYPE (@0)))
  (bit_not (bit_xor (view_convert @0) @1

(the 'if' seems redundant for this one)

 (simplify
  (negate (nop_convert? (negate @1)))
  (if (!TYPE_OVERFLOW_SANITIZED (type)
   && !TYPE_OVERFLOW_SANITIZED (TREE_TYPE (@1)))
   (view_convert @1)))

etc.


At some point this got some genmatch help, to handle '?' and numbers, so I 
don't remember all the details, but following these examples should work.


--
Marc Glisse


Re: [PATCH] Simplify (view_convert ~a) < 0 to (view_convert a) >= 0 [PR middle-end/100738]

2021-06-01 Thread Marc Glisse

On Tue, 1 Jun 2021, Hongtao Liu via Gcc-patches wrote:


Hi:
 This patch is about to simplify (view_convert:type ~a) < 0 to
(view_convert:type a) >= 0 when type is signed integer. Similar for
(view_convert:type ~a) >= 0.
 Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
 Ok for the trunk?

gcc/ChangeLog:

   PR middle-end/100738
   * match.pd ((view_convert ~a) < 0 --> (view_convert a) >= 0,
   (view_convert ~a) >= 0 --> (view_convert a) < 0): New GIMPLE
   simplification.


We already have

/* Fold ~X op C as X op' ~C, where op' is the swapped comparison.  */
(for cmp (simple_comparison)
 scmp (swapped_simple_comparison)
 (simplify
  (cmp (bit_not@2 @0) CONSTANT_CLASS_P@1)
  (if (single_use (@2)
   && (TREE_CODE (@1) == INTEGER_CST || TREE_CODE (@1) == VECTOR_CST))
   (scmp @0 (bit_not @1)

Would it make sense to try and generalize it a bit, say with

(cmp (nop_convert1? (bit_not @0)) CONSTANT_CLASS_P)

(scmp (view_convert:XXX @0) (bit_not @1))

(I still believe that it is a bad idea that SSA_NAMEs are strongly typed, 
encoding the type in operations would be more convenient, but I think the 
time for that choice has long gone)


--
Marc Glisse


Re: [ARM] PR66791: Replace calls to builtin in vmul_n (a, b) intrinsics with __a * __b

2021-05-26 Thread Marc Glisse

On Wed, 26 May 2021, Prathamesh Kulkarni via Gcc-patches wrote:


The attached patch removes calls to builtins in vmul_n* (a, b) with __a * __b.


I am not familiar with neon, but are __a and __b unsigned here? Otherwise, 
is vmul_n already undefined in case of overflow?


--
Marc Glisse


Re: [PATCH] Optimize x < 0 ? ~y : y to (x >> 31) ^ y in match.pd

2021-05-23 Thread Marc Glisse

On Sun, 23 May 2021, apinski--- via Gcc-patches wrote:


+(for cmp (ge lt)
+/* x < 0 ? ~y : y into (x >> (prec-1)) ^ y. */
+/* x >= 0 ? ~y : y into ~((x >> (prec-1)) ^ y). */
+ (simplify
+  (cond (cmp @0 integer_zerop) (bit_not @1) @1)
+   (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
+&& !TYPE_UNSIGNED (TREE_TYPE (@0))
+&& TYPE_PRECISION (TREE_TYPE (@0)) == TYPE_PRECISION (type))


Is there a risk that x is signed char (precision 8) and y is a vector with 
8 elements?


--
Marc Glisse


Re: [PATCH] phiopt: Optimize (x <=> y) cmp z [PR94589]

2021-05-15 Thread Marc Glisse

On Fri, 14 May 2021, Jakub Jelinek via Gcc-patches wrote:


On Thu, May 06, 2021 at 09:42:41PM +0200, Marc Glisse wrote:

We can probably do it in 2 steps, first something like

(for cmp (eq ne)
 (simplify
  (cmp (bit_and:c @0 @1) @0)
  (cmp (@0 (bit_not! @1)) { build_zero_cst (TREE_TYPE (@0)); })))

to get rid of the double use, and then simplify X==0 to X<=~C if C is a
mask 111...000 (I thought we already had a function to detect such masks, or
the 000...111, but I can't find them anymore).

I agree that the comparison seems preferable, although if X is signed, the
way GIMPLE represents types will add an inconvenient cast. And I think VRP
already manages to use the bit test to derive a range.


I've tried the second step, but it unfortunately regresses
+FAIL: gcc.dg/ipa/propbits-2.c scan-tree-dump-not optimized "fail_test"
+FAIL: gcc.dg/tree-ssa/loop-42.c scan-tree-dump-not ivcanon "under assumptions "
so maybe it is better to keep these cases as the users wrote them.


Thank you for trying it, the patch looks good (it would probably also work 
on GENERIC), but indeed it is just a canonicalization, not necessarily 
worth breaking other stuff. This seems to point to another missed 
optimization. Looking at propbits-2.c, if I rewrite the condition in f1 as


  if ((unsigned)x <= 0xf)

I see later in VRP

  # RANGE [0, 4294967295] NONZERO 15
  x.2_1 = (unsigned intD.9) x_4(D);
  if (x.2_1 <= 15)

where we fail to derive a range from the nonzero bits. Maybe VRP expects 
IPA-CP should have done it already and doesn't bother recomputing it.


I understand this may not be a priority though, that's fine.

I didn't look at loop-42.c.

--
Marc Glisse


Re: [PATCH] match.pd: Optimize (x & y) == x into (x & ~y) == 0 [PR94589]

2021-05-11 Thread Marc Glisse

On Tue, 11 May 2021, Jakub Jelinek via Gcc-patches wrote:


On Thu, May 06, 2021 at 09:42:41PM +0200, Marc Glisse wrote:

We can probably do it in 2 steps, first something like

(for cmp (eq ne)
 (simplify
  (cmp (bit_and:c @0 @1) @0)
  (cmp (@0 (bit_not! @1)) { build_zero_cst (TREE_TYPE (@0)); })))

to get rid of the double use, and then simplify X==0 to X<=~C if C is a
mask 111...000 (I thought we already had a function to detect such masks, or
the 000...111, but I can't find them anymore).


Ok, here is the first step then.
Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Or should it have cmp:c too given that == and != are commutative too?


Ah, yes, you are right, good point on the cmp:c, thank you.


2021-05-11  Jakub Jelinek  
    Marc Glisse  

PR tree-optimization/94589
* match.pd ((X & Y) == Y -> (X & ~Y) == 0,

   ^

X?


(X | Y) == Y -> (X & ~Y) == 0): New GIMPLE simplifications.

* gcc.dg/tree-ssa/pr94589-1.c: New test.

--- gcc/match.pd.jj 2021-04-27 14:46:56.583716888 +0200
+++ gcc/match.pd2021-05-10 22:31:49.726870421 +0200
@@ -4764,6 +4764,18 @@ (define_operator_list COND_TERNARY
  (cmp:c (bit_xor:c @0 @1) @0)
  (cmp @1 { build_zero_cst (TREE_TYPE (@1)); }))

+#if GIMPLE
+ /* (X & Y) == X becomes (X & ~Y) == 0.  */
+ (simplify
+  (cmp (bit_and:c @0 @1) @0)
+  (cmp (bit_and @0 (bit_not! @1)) { build_zero_cst (TREE_TYPE (@0)); }))
+
+ /* (X | Y) == Y becomes (X & ~Y) == 0.  */
+ (simplify
+  (cmp (bit_ior:c @0 @1) @1)
+  (cmp (bit_and @0 (bit_not! @1)) { build_zero_cst (TREE_TYPE (@0)); }))
+#endif
+
 /* (X ^ C1) op C2 can be rewritten as X op (C1 ^ C2).  */
 (simplify
  (cmp (convert?@3 (bit_xor @0 INTEGER_CST@1)) INTEGER_CST@2)
--- gcc/testsuite/gcc.dg/tree-ssa/pr94589-1.c.jj2021-05-10 
22:36:10.574130179 +0200
+++ gcc/testsuite/gcc.dg/tree-ssa/pr94589-1.c   2021-05-10 22:36:17.789054362 
+0200
@@ -0,0 +1,21 @@
+/* PR tree-optimization/94589 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+int
+foo (int x)
+{
+  return (x & 23) == x;
+/* { dg-final { scan-tree-dump " & -24;" "optimized" } } */
+/* { dg-final { scan-tree-dump-not " & 23;" "optimized" } } */
+/* { dg-final { scan-tree-dump " == 0" "optimized" } } */
+}
+
+int
+bar (int x)
+{
+  return (x | 137) != 137;
+/* { dg-final { scan-tree-dump " & -138;" "optimized" } } */
+/* { dg-final { scan-tree-dump-not " \\| 137;" "optimized" } } */
+/* { dg-final { scan-tree-dump " != 0" "optimized" } } */
+}


Jakub


--
Marc Glisse


Re: [PATCH] phiopt: Optimize (x <=> y) cmp z [PR94589]

2021-05-06 Thread Marc Glisse

On Thu, 6 May 2021, Jakub Jelinek via Gcc-patches wrote:


Though, (x&1) == x is equivalent to both (x&~1)==0 and to x < 2U
and from the latter two it isn't obvious which one is better/more canonical.
On aarch64 I don't see differences in number of insns nor in their size:
 10:13001c00sxtbw0, w0
 14:721f781ftst w0, #0xfffe
 18:1a9f17e0csetw0, eq  // eq = none
 1c:d65f03c0ret
vs.
 20:12001c00and w0, w0, #0xff
 24:7100041fcmp w0, #0x1
 28:1a9f87e0csetw0, ls  // ls = plast
 2c:d65f03c0ret
On x86_64 same number of insns, but the comparison is shorter (note, the
spaceship result is a struct with signed char based enum):
 10:31 c0   xor%eax,%eax
 12:81 e7 fe 00 00 00   and$0xfe,%edi
 18:0f 94 c0sete   %al
 1b:c3  retq
 1c:0f 1f 40 00 nopl   0x0(%rax)
vs.
 20:31 c0   xor%eax,%eax
 22:40 80 ff 01 cmp$0x1,%dil
 26:0f 96 c0setbe  %al
 29:c3  retq
Generally, I'd think that the comparison should be better because it
will be most common in user code that way and VRP will be able to do the
best thing for it.


We can probably do it in 2 steps, first something like

(for cmp (eq ne)
 (simplify
  (cmp (bit_and:c @0 @1) @0)
  (cmp (@0 (bit_not! @1)) { build_zero_cst (TREE_TYPE (@0)); })))

to get rid of the double use, and then simplify X==0 to X<=~C if C is a 
mask 111...000 (I thought we already had a function to detect such masks, 
or the 000...111, but I can't find them anymore).


I agree that the comparison seems preferable, although if X is signed, the 
way GIMPLE represents types will add an inconvenient cast. And I think VRP 
already manages to use the bit test to derive a range.


--
Marc Glisse


Re: [PATCH] phiopt: Optimize (x <=> y) cmp z [PR94589]

2021-05-05 Thread Marc Glisse

On Tue, 4 May 2021, Jakub Jelinek via Gcc-patches wrote:


2) the pr94589-2.C testcase should be matching just 12 times each, but runs
into operator>=(strong_ordering, unspecified) being defined as
(_M_value&1)==_M_value
rather than _M_value>=0.  When not honoring NaNs, the 2 case should be
unreachable and so (_M_value&1)==_M_value is then equivalent to _M_value>=0,
but is not a single use but two uses.  I'll need to pattern match that case
specially.


Somewhere in RTL (_M_value&1)==_M_value is turned into (_M_value&-2)==0, 
that could be worth doing already in GIMPLE.


--
Marc Glisse


Re: [PATCH] match.pd: Use single_use for (T)(A) + CST -> (T)(A + CST) [PR95798]

2021-02-24 Thread Marc Glisse

On Wed, 24 Feb 2021, Jakub Jelinek via Gcc-patches wrote:


The following patch adds single_use case which restores these testcases
but keeps the testcases the patch meant to improve as is.


Hello,

I wonder if :s would be sufficient here? I don't have an opinion on which 
one is better for this particular transformation (don't change the patch 
because of my comment), we just seem to be getting more and more uses of 
single_use in match.pd, maybe at some point we need to revisit the meaning 
of :s or introduce a stronger :S.


--
Marc Glisse


Re: bug in DSE?

2021-02-12 Thread Marc Glisse

On Fri, 12 Feb 2021, Andrew MacLeod via Gcc wrote:

I dont't want to immediately open a PR,  so I'll just ask about 
testsuite/gcc.dg/pr83609.c.


the compilation string  is
  -O2 -fno-tree-forwprop -fno-tree-ccp -fno-tree-fre -fno-tree-pre 
-fno-code-hoisting


Which passes as is.

if I however add   -fno-tree-vrp   as well, then it looks like dead store 
maybe does something wong...


with EVRP running, we translate function foo() from


complex float foo ()
{
  complex float c;
  complex float * c.0_1;
  complex float _4;

   :
  c.0_1 = 
  MEM[(long long unsigned int *)c.0_1] = 1311768467463790320;
  _4 = c;


Isn't that a clear violation of strict aliasing?

--
Marc Glisse


Re: [PATCH, libstdc++] GLIBCXX_HAVE_INT64_T

2021-01-06 Thread Marc Glisse

On Wed, 6 Jan 2021, David Edelsohn via Gcc-patches wrote:


Currently the type of streamoff is determined at libstdc++ configure
time, chosen by the definitions of _GLIBCXX_HAVE_INT64_T_LONG and
_GLIBCXX_HAVE_INT64_T_LONG_LONG.  For a multilib configuration, the
difference is encoded in the different multilib header file paths.
For "FAT" library targets that package 32 bit and 64 bit libraries
together, G++ also expects a single header file directory hierarchy,
causing an incorrect value for streamoff in some situations.


Shouldn't we change that? I don't see why using the same directory for 
linking should imply using the same directory for includes.


--
Marc Glisse


Re: What is the type of vector signed + vector unsigned?

2020-12-29 Thread Marc Glisse

On Tue, 29 Dec 2020, Richard Sandiford via Gcc wrote:


Any thoughts on what f should return in the following testcase, given the
usual GNU behaviour of treating signed >> as arithmetic shift right?

   typedef int vs4 __attribute__((vector_size(16)));
   typedef unsigned int vu4 __attribute__((vector_size(16)));
   int
   f (void)
   {
 vs4 x = { -1, -1, -1, -1 };
 vu4 y = { 0, 0, 0, 0 };
 return ((x + y) >> 1)[0];
   }

The C frontend takes the type of x+y from the first operand, so x+y
is signed and f returns -1.


Symmetry is an important property of addition in C/C++.


The C++ frontend applies similar rules to x+y as it would to scalars,
with unsigned T having a higher rank than signed T, so x+y is unsigned
and f returns 0x7fff.


That looks like the most natural choice.


FWIW, Clang treats x+y as signed, so f returns -1 for both C and C++.


I think clang follows gcc and uses the type of the first operand.

--
Marc Glisse


Re: [PATCH] match.pd: Add ~(X - Y) -> ~X + Y simplification [PR96685]

2020-12-12 Thread Marc Glisse

On Sat, 12 Dec 2020, Jakub Jelinek via Gcc-patches wrote:


On Sat, Dec 12, 2020 at 01:25:39PM +0100, Marc Glisse wrote:

On Sat, 12 Dec 2020, Jakub Jelinek via Gcc-patches wrote:


This patch adds the ~(X - Y) -> ~X + Y simplification requested
in the PR (plus also ~(X + C) -> ~X + (-C) for constants C that can
be safely negated.


Would it have been wrong to produce ~X - C without caring about negating
(and then extending it to non-constants)?


Extending it to non-constants is what I wanted to avoid.
For ~(X + Y), because + is commutative, it wouldn't be a canonicalization
as it would pick more-less randomly whether to do ~X + Y or X + ~Y.


~X - Y or ~Y - X I guess.

Ok, I understand. But then in the constant case, why produce ~X + -C 
instead of ~X - C (which I think doesn't need to care about negating), or 
even ~C - X (one less operation)? Or do we already have a transformation 
from ~X - C to ~C - X?


--
Marc Glisse


Re: [PATCH] match.pd: Add ~(X - Y) -> ~X + Y simplification [PR96685]

2020-12-12 Thread Marc Glisse

On Sat, 12 Dec 2020, Jakub Jelinek via Gcc-patches wrote:


This patch adds the ~(X - Y) -> ~X + Y simplification requested
in the PR (plus also ~(X + C) -> ~X + (-C) for constants C that can
be safely negated.


Would it have been wrong to produce ~X - C without caring about negating 
(and then extending it to non-constants)?


I wonder if this makes
/* ~(~X - Y) -> X + Y and ~(~X + Y) -> X - Y.  */
useless.

--
Marc Glisse


Re: Integer division on x86 -m32

2020-12-10 Thread Marc Glisse

On Thu, 10 Dec 2020, Lucas de Almeida via Gcc wrote:


when performing (int64_t) foo / (int32_t) bar in gcc under x86, a call to
__divdi3 is always output, even though it seems the use of the idiv
instruction could be faster.


IIRC, idiv requires that the quotient fit in 32 bits, while your C code 
doesn't. (1LL << 60) / 3 would cause an error with idiv.


It would be possible to use idiv in some cases, if the compiler can prove 
that variables are in the right range, but that's not so easy. You can use 
inline asm to force the use of idiv if you know it is safe for your case, 
the most common being modular arithmetic: if you know that uint32_t a, b, 
c, d are smaller than m (and m!=0), you can compute a*b+c+d in uint64_t, 
then use div to compute that modulo m.


--
Marc Glisse


Re: The conditions when convert from double to float is permitted?

2020-12-10 Thread Marc Glisse

On Thu, 10 Dec 2020, Xionghu Luo via Gcc wrote:


I have a maybe silly question about whether there is any *standard*
or *options* (like -ffast-math) for GCC that allow double to float
demotion optimization?  For example,

1) from PR22326:

#include 

float foo(float f, float x, float y) {
return (fabs(f)*x+y);
}

The fabs will return double result but it could be demoted to float
actually since the function returns float finally.


With fp-contract, this is (float)fma((double)f,(double)x,(double)y). This 
could almost be transformed into fmaf(f,x,y), except that the double 
rounding may not be strictly equivalent. Still, that seems like it would 
be no problem with -funsafe-math-optimizations, just like turning 
(float)((double)x*(double)y) into x*y, as long as it is a single operation 
with casts on all inputs and output. Whether there are cases that can be 
optimized without -funsafe-math-optimizations is harder to tell.


--
Marc Glisse


Re: [PATCH] match.pd: Use ranges to optimize some x * y / y to x [PR97997]

2020-12-06 Thread Marc Glisse

On Thu, 26 Nov 2020, Jakub Jelinek via Gcc-patches wrote:


For signed integers with undefined overflow we already optimize x * y / y
into x, but for signed integers with -fwrapv or unsigned integers we don't.
The following patch allows optimizing that into just x if value ranges
prove that x * y will never overflow.


I've long wanted a helper that checks if VRP thinks an operation could 
overflow, I think at some point it would make sense to move this code to 
some function so that it can be easily reused. Maybe also define a matcher 
so we can write (mult_noovf @0 @1) which would succeed if either overflow 
is undefined or if VRP can prove that no overflow is happening.


Of course that's all ideas for later, refactoring belongs in the second or 
third patch using a feature, not the first one :-)


--
Marc Glisse


Re: [PATCH] [tree-optimization] Optimize max/min pattern with comparison

2020-12-06 Thread Marc Glisse

On Tue, 1 Dec 2020, Eugene Rozenfeld via Gcc-patches wrote:


Thank you for the review Jeff.

I don't need to look at the opcode to know the result. The pattern will be 
matched only in these 4 cases:

X <= MAX(X, Y) -> true
X > MAX(X, Y) -> false
X >= MIN(X, Y) -> true
X < MIN(X, Y) -> false

So, the result will be true for GE_EXPR and LE_EXPR and false otherwise.


Is that true even if X is NaN?

It may be hard to hit a situation where this matters though, if we honor 
NaN, we don't build MAX_EXPR (which is unspecified).


--
Marc Glisse


Re: Reassociation and trapping operations

2020-11-24 Thread Marc Glisse

On Wed, 25 Nov 2020, Ilya Leoshkevich via Gcc wrote:


I have a C floating point comparison (a <= b && a >= b), which
test_for_singularity turns into (a <= b && a == b) and vectorizer turns
into ((a <= b) & (a == b)).  So far so good.

eliminate_redundant_comparison, however, turns it into just (a == b).
I don't think this is correct, because (a <= b) traps and (a == b)
doesn't.



Hello,

let me just mention the old
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53805
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53806

There has been some debate about the exact meaning of -ftrapping-math, but 
don't let that stop you.


--
Marc Glisse


Re: Installing a generated header file

2020-11-12 Thread Marc Glisse

On Thu, 12 Nov 2020, Bill Schmidt via Gcc wrote:

Hi!  I'm working on a project where it's desirable to generate a 
target-specific header file while building GCC, and install it with the 
rest of the target-specific headers (i.e., in 
lib/gcc//11.0.0/include).  Today it appears that only those 
headers listed in "extra_headers" in config.gcc will be placed there, 
and those are assumed to be found in gcc/config/.  In my case, 
the header file will end up in my build directory instead.


Questions:

* Has anyone tried something like this before?  I didn't find anything.
* If so, can you please point me to an example?
* Otherwise, I'd be interested in advice about providing new infrastructure to 
support
 this.  I'm a relative noob with respect to the configury code, and I'm sure my
 initial instincts will be wrong. :)


Does the i386 mm_malloc.h file match your scenario?

--
Marc Glisse


Re: A couple GIMPLE questions

2020-09-06 Thread Marc Glisse

On Sat, 5 Sep 2020, Gary Oblock via Gcc wrote:


First off one of the questions just me being curious but
second is quite serious. Note, this is GIMPLE coming
into my optimization and not something I've modified.

Here's the C code:

type_t *
do_comp( type_t *data, size_t len)
{
 type_t *res;
 type_t *x = min_of_x( data, len);
 type_t *y = max_of_y( data, len);

 res = y;
 if ( x < y ) res = 0;
 return res;
}

And here's the resulting GIMPLE:

;; Function do_comp.constprop (do_comp.constprop.0, funcdef_no=5, 
decl_uid=4392, cgraph_uid=3, symbol_order=68) (executed once)

do_comp.constprop (struct type_t * data)
{
 struct type_t * res;
 struct type_t * x;
 struct type_t * y;
 size_t len;

  [local count: 1073741824]:

  [local count: 1073741824]:
 x_2 = min_of_x (data_1(D), 1);
 y_3 = max_of_y (data_1(D), 1);
 if (x_2 < y_3)
   goto ; [29.00%]
 else
   goto ; [71.00%]

  [local count: 311385128]:

  [local count: 1073741824]:
 # res_4 = PHI 
 return res_4;

}

The silly question first. In the "if" stmt how does GCC
get those probabilities? Which it shows as 29.00% and
71.00%. I believe they should both be 50.00%.


See the profile_estimate pass dump. One branch makes the function return 
NULL, which makes gcc guess that it may be a bit less likely than the 
other. Those are heuristics, which are tuned to help on average, but of 
course they are sometimes wrong.



The serious question is what is going on with this phi?
   res_4 = PHI 

This makes zero sense practicality wise to me and how is
it supposed to be recognized and used? Note, I really do
need to transform the "0B" into something else for my
structure reorganization optimization.


That's not a question? Are you asking why PHIs exist at all? They are the 
standard way to represent merging in SSA representations. You can iterate 
on the PHIs of a basic block, etc.



CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is for 
the sole use of the intended recipient(s) and contains information that is 
confidential and proprietary to Ampere Computing or its subsidiaries. It is to 
be used solely for the purpose of furthering the parties' business 
relationship. Any unauthorized review, copying, or distribution of this email 
(or any attachments thereto) is strictly prohibited. If you are not the 
intended recipient, please contact the sender immediately and permanently 
delete the original and any copies of this email and any attachments thereto.


Could you please get rid of this when posting on public mailing lists?

--
Marc Glisse


Re: [PATCH] c++: Disable -frounding-math during manifestly constant evaluation [PR96862]

2020-09-02 Thread Marc Glisse

On Wed, 2 Sep 2020, Jason Merrill via Gcc-patches wrote:


On 9/1/20 6:13 AM, Marc Glisse wrote:

On Tue, 1 Sep 2020, Jakub Jelinek via Gcc-patches wrote:


As discussed in the PR, fold-const.c punts on floating point constant
evaluation if the result is inexact and -frounding-math is turned on.
 /* Don't constant fold this floating point operation if the
    result may dependent upon the run-time rounding mode and
    flag_rounding_math is set, or if GCC's software emulation
    is unable to accurately represent the result.  */
 if ((flag_rounding_math
  || (MODE_COMPOSITE_P (mode) && !flag_unsafe_math_optimizations))
 && (inexact || !real_identical (, )))
   return NULL_TREE;
Jonathan said that we should be evaluating them anyway, e.g. conceptually
as if they are done with the default rounding mode before user had a 
chance

to change that, and e.g. in C in initializers it is also ignored.
In fact, fold-const.c for C initializers turns off various other options:

/* Perform constant folding and related simplification of initializer
  expression EXPR.  These behave identically to "fold_buildN" but ignore
  potential run-time traps and exceptions that fold must preserve.  */

#define START_FOLD_INIT \
 int saved_signaling_nans = flag_signaling_nans;\
 int saved_trapping_math = flag_trapping_math;\
 int saved_rounding_math = flag_rounding_math;\
 int saved_trapv = flag_trapv;\
 int saved_folding_initializer = folding_initializer;\
 flag_signaling_nans = 0;\
 flag_trapping_math = 0;\
 flag_rounding_math = 0;\
 flag_trapv = 0;\
 folding_initializer = 1;

#define END_FOLD_INIT \
 flag_signaling_nans = saved_signaling_nans;\
 flag_trapping_math = saved_trapping_math;\
 flag_rounding_math = saved_rounding_math;\
 flag_trapv = saved_trapv;\
 folding_initializer = saved_folding_initializer;

So, shall cxx_eval_outermost_constant_expr instead turn off all those
options (then warning_sentinel wouldn't be the right thing to use, but 
given

the 8 or how many return stmts in cxx_eval_outermost_constant_expr, we'd
need a RAII class for this.  Not sure about the folding_initializer, that
one is affecting complex multiplication and division constant evaluation
somehow.


I don't think we need to turn off flag_signaling_nans or flag_trapv. I 
think we want to turn off flag_trapping_math so we can fold 1./0 to inf 
(still in a context where folding is mandatory). Setting 
folding_initializer seems consistent with that, enabling infinite results 
in complex folding (it also forces folding of __builtin_constant_p, which 
may be redundant with force_folding_builtin_constant_p).


C++ says that division by zero has undefined behavior, and that an expression 
with undefined behavior is not constant, so we shouldn't fold 1./0 to inf 
anyway.  The same is true of other trapping operations.  So clearing 
flag_signaling_nans, flag_trapping_math, and flag_trapv seems wrong for C++. 
And folding_initializer seems to be used for the same sort of thing.


So we should actually force flag_trapping_math to true during constexpr
evaluation? And folding_initializer to false, and never mind trapv but
maybe disable wrapv?

#include 
constexpr double a = std::numeric_limits::infinity();
constexpr double b = a + a;
constexpr double c = a - a;
constexpr double d = 1. / a;
constexpr double e = 1. / d;

clang rejects c and e. MSVC rejects e. Intel warns on c.

Gcc rejects only e, and accepts the whole thing if I pass
-fno-trapping-math.

Almost any FP operation is possibly trapping, 1./3. sets FE_INEXACT just 
as 1./0. sets FE_DIVBYZERO. But the standard says


char array[1 + int(1 + 0.2 - 0.1 - 0.1)]; // Must be evaluated during 
translation

So it doesn't seem like it cares about that? Division by zero is the only 
one that gets weirdly special-cased...


--
Marc Glisse


Re: [RFC] Add new flag to specify output constraint in match.pd

2020-09-02 Thread Marc Glisse

On Wed, 2 Sep 2020, Richard Biener via Gcc wrote:


On Mon, Aug 24, 2020 at 8:20 AM Feng Xue OS via Gcc  wrote:



  There is a match-folding issue derived from pr94234.  A piece of code like:

  int foo (int n)
  {
 int t1 = 8 * n;
 int t2 = 8 * (n - 1);

 return t1 - t2;
  }

 It can be perfectly caught by the rule "(A * C) +- (B * C) -> (A +- B) * C", 
and
 be folded to constant "8". But this folding will fail if both v1 and v2 have
 multiple uses, as the following code.

  int foo (int n)
  {
 int t1 = 8 * n;
 int t2 = 8 * (n - 1);

 use_fn (t1, t2);
 return t1 - t2;
  }

 Given an expression with non-single-use operands, folding it will introduce
 duplicated computation in most situations, and is deemed to be unprofitable.
 But it is always beneficial if final result is a constant or existing SSA 
value.

 And the rule is:
  (simplify
   (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
   (if ((!ANY_INTEGRAL_TYPE_P (type)
|| TYPE_OVERFLOW_WRAPS (type)
|| (INTEGRAL_TYPE_P (type)
&& tree_expr_nonzero_p (@0)
&& expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)
   /* If @1 +- @2 is constant require a hard single-use on either
  original operand (but not on both).  */
   && (single_use (@3) || single_use (@4)))   <- control whether match 
or not
(mult (plusminus @1 @2) @0)))

 Current matcher only provides a way to check something before folding,
 but no mechanism to affect decision after folding. If has, for the above
 case, we can let it go when we find result is a constant.


:s already has a counter-measure where it still folds if the output is at
most one operation. So this transformation has a counter-counter-measure
of checking single_use explicitly. And now we want a counter^3-measure...


Counter-measure is key factor to matching-cost.  ":s" seems to be somewhat
coarse-grained. And here we do need more control over it.

But ideally, we could decouple these counter-measures from definitions of
match-rule, and let gimple-matcher get a more reasonable match-or-not
decision based on these counters. Anyway, it is another story.


 Like the way to describe input operand using flags, we could also add
 a new flag to specify this kind of constraint on output that we expect
 it is a simple gimple value.

 Proposed syntax is

  (opcode:v{ condition } )

 The char "v" stands for gimple value, if more descriptive, other char is
 preferred. "condition" enclosed by { } is an optional c-syntax condition
 expression. If present, only when "condition" is met, matcher will check
 whether folding result is a gimple value using
 gimple_simplified_result_is_gimple_val ().

 Since there is no SSA concept in GENERIC, this is only for GIMPLE-match,
 not GENERIC-match.

 With this syntax, the rule is changed to

 #Form 1:
  (simplify
   (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
   (if ((!ANY_INTEGRAL_TYPE_P (type)
|| TYPE_OVERFLOW_WRAPS (type)
|| (INTEGRAL_TYPE_P (type)
&& tree_expr_nonzero_p (@0)
&& expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type))
   ( if (!single_use (@3) && !single_use (@4))
  (mult:v (plusminus @1 @2) @0)))
  (mult (plusminus @1 @2) @0)


That seems to match what you can do with '!' now (that's very recent).


It's also what :s does but a slight bit more "local".  When any operand is
marked :s and it has more than a single-use we only allow simplifications
that do not require insertion of extra stmts.  So basically the above pattern
doesn't behave any different than if you omit your :v.  Only if you'd
place :v on an inner expression there would be a difference.  Correlating
the inner expression we'd not want to insert new expressions for with
a specific :s (or multiple ones) would be a more natural extension of what
:s provides.

Thus, for the above case (Form 1), you do not need :v at all and :s works.


Let's consider that multiplication is expensive. We have code like 
5*X-3*X, which can be simplified to 2*X. However, if both 5*X and 3*X have 
other uses, that would increase the number of multiplications. :s would 
not block a simplification to 2*X, which is a single stmt. So the existing 
transformation has extra explicit checks for single_use. And those extra 
checks block the transformation even for 5*X-4*X -> X which does not 
increase the number of multiplications. Which is where '!' (or :v here) 
comes in.


Or we could decide that the extra multiplication is not that bad if it 
saves an addition, simplifies the expression, possibly gains more insn 
parallelism, etc, in which case we could just drop the existing hard 
single_use check...


--
Marc Glisse


Re: [PATCH] middle-end/94301 - map V1x to x when the vector mode is not available

2020-09-01 Thread Marc Glisse

On Tue, 1 Sep 2020, Jakub Jelinek via Gcc-patches wrote:


On Tue, Sep 01, 2020 at 11:50:32AM +0200, Richard Biener wrote:

While IMHO it makes very much sense to map V1{S,D,T}F types
to {S,D,T}Fmode if there's no special vector mode for it the
question is whether we can make this "change" for the sake
of the ABIs?  The alternative is to make vector lowering
adjust this, inserting V_C_Es for example.


Doesn't gcc violate the x86_64 ABI document by passing V1* vectors on the 
stack? Clang seems to go half-way (arguments on the stack but return in 
register), which is strange as well.


I think those vectors are relatively rare and breaking the ABI for them 
may be doable, but that's not based on any actual data.



I'd fear about the ABI consequences, as well as anything in the FEs and
middle-end that cares about vector modes, so I think lowering this in
tree-vect-generic.c using V_C_Es looks much safer to me.


If they weren't so rare, we could consider lowering them earlier so they 
benefit from more optimizations, but that doesn't seem worth the trouble.


--
Marc Glisse


Re: [PATCH] c++: Disable -frounding-math during manifestly constant evaluation [PR96862]

2020-09-01 Thread Marc Glisse

On Tue, 1 Sep 2020, Jakub Jelinek via Gcc-patches wrote:


As discussed in the PR, fold-const.c punts on floating point constant
evaluation if the result is inexact and -frounding-math is turned on.
 /* Don't constant fold this floating point operation if the
result may dependent upon the run-time rounding mode and
flag_rounding_math is set, or if GCC's software emulation
is unable to accurately represent the result.  */
 if ((flag_rounding_math
  || (MODE_COMPOSITE_P (mode) && !flag_unsafe_math_optimizations))
 && (inexact || !real_identical (, )))
   return NULL_TREE;
Jonathan said that we should be evaluating them anyway, e.g. conceptually
as if they are done with the default rounding mode before user had a chance
to change that, and e.g. in C in initializers it is also ignored.
In fact, fold-const.c for C initializers turns off various other options:

/* Perform constant folding and related simplification of initializer
  expression EXPR.  These behave identically to "fold_buildN" but ignore
  potential run-time traps and exceptions that fold must preserve.  */

#define START_FOLD_INIT \
 int saved_signaling_nans = flag_signaling_nans;\
 int saved_trapping_math = flag_trapping_math;\
 int saved_rounding_math = flag_rounding_math;\
 int saved_trapv = flag_trapv;\
 int saved_folding_initializer = folding_initializer;\
 flag_signaling_nans = 0;\
 flag_trapping_math = 0;\
 flag_rounding_math = 0;\
 flag_trapv = 0;\
 folding_initializer = 1;

#define END_FOLD_INIT \
 flag_signaling_nans = saved_signaling_nans;\
 flag_trapping_math = saved_trapping_math;\
 flag_rounding_math = saved_rounding_math;\
 flag_trapv = saved_trapv;\
 folding_initializer = saved_folding_initializer;

So, shall cxx_eval_outermost_constant_expr instead turn off all those
options (then warning_sentinel wouldn't be the right thing to use, but given
the 8 or how many return stmts in cxx_eval_outermost_constant_expr, we'd
need a RAII class for this.  Not sure about the folding_initializer, that
one is affecting complex multiplication and division constant evaluation
somehow.


I don't think we need to turn off flag_signaling_nans or flag_trapv. I 
think we want to turn off flag_trapping_math so we can fold 1./0 to inf 
(still in a context where folding is mandatory). Setting 
folding_initializer seems consistent with that, enabling infinite results 
in complex folding (it also forces folding of __builtin_constant_p, which 
may be redundant with force_folding_builtin_constant_p).



The following patch has been bootstrapped/regtested on x86_64-linux and
i686-linux, but see above, maybe we want something else.

2020-09-01  Jakub Jelinek  

PR c++/96862
* constexpr.c (cxx_eval_outermost_constant_expr): Temporarily disable
flag_rounding_math during manifestly constant evaluation.

* g++.dg/cpp1z/constexpr-96862.C: New test.

--- gcc/cp/constexpr.c.jj   2020-08-31 14:10:15.826921458 +0200
+++ gcc/cp/constexpr.c  2020-08-31 15:41:26.429964532 +0200
@@ -6680,6 +6680,8 @@ cxx_eval_outermost_constant_expr (tree t
allow_non_constant, strict,
manifestly_const_eval || !allow_non_constant };

+  /* Turn off -frounding-math for manifestly constant evaluation.  */
+  warning_sentinel rm (flag_rounding_math, ctx.manifestly_const_eval);
  tree type = initialized_type (t);
  tree r = t;
  bool is_consteval = false;
--- gcc/testsuite/g++.dg/cpp1z/constexpr-96862.C.jj 2020-08-31 
15:50:07.847473028 +0200
+++ gcc/testsuite/g++.dg/cpp1z/constexpr-96862.C2020-08-31 
15:49:40.829861168 +0200
@@ -0,0 +1,20 @@
+// PR c++/96862
+// { dg-do compile { target c++17 } }
+// { dg-additional-options "-frounding-math" }
+
+constexpr double a = 0x1.0p+100 + 0x1.0p-100;
+const double b = 0x1.0p+100 + 0x1.0p-100;
+const double & = 0x1.0p+100 + 0x1.0p-100;
+static_assert (0x1.0p+100 + 0x1.0p-100 == 0x1.0p+100, "");
+
+void
+foo ()
+{
+  constexpr double d = 0x1.0p+100 + 0x1.0p-100;
+  const double e = 0x1.0p+100 + 0x1.0p-100;
+  const double & = 0x1.0p+100 + 0x1.0p-100;
+  static_assert (0x1.0p+100 + 0x1.0p-100 == 0x1.0p+100, "");
+}
+
+const double  = a;
+const double  = b;

Jakub


--
Marc Glisse


Re: [RFC] Add new flag to specify output constraint in match.pd

2020-08-23 Thread Marc Glisse

On Fri, 21 Aug 2020, Feng Xue OS via Gcc wrote:


 There is a match-folding issue derived from pr94234.  A piece of code like:

 int foo (int n)
 {
int t1 = 8 * n;
int t2 = 8 * (n - 1);

return t1 - t2;
 }

It can be perfectly caught by the rule "(A * C) +- (B * C) -> (A +- B) * C", and
be folded to constant "8". But this folding will fail if both v1 and v2 have
multiple uses, as the following code.

 int foo (int n)
 {
int t1 = 8 * n;
int t2 = 8 * (n - 1);

use_fn (t1, t2);
return t1 - t2;
 }

Given an expression with non-single-use operands, folding it will introduce
duplicated computation in most situations, and is deemed to be unprofitable.
But it is always beneficial if final result is a constant or existing SSA value.

And the rule is:
 (simplify
  (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
  (if ((!ANY_INTEGRAL_TYPE_P (type)
 || TYPE_OVERFLOW_WRAPS (type)
 || (INTEGRAL_TYPE_P (type)
 && tree_expr_nonzero_p (@0)
 && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)
/* If @1 +- @2 is constant require a hard single-use on either
   original operand (but not on both).  */
&& (single_use (@3) || single_use (@4)))   <- control whether match 
or not
   (mult (plusminus @1 @2) @0)))

Current matcher only provides a way to check something before folding,
but no mechanism to affect decision after folding. If has, for the above
case, we can let it go when we find result is a constant.


:s already has a counter-measure where it still folds if the output is at 
most one operation. So this transformation has a counter-counter-measure 
of checking single_use explicitly. And now we want a counter^3-measure...



Like the way to describe input operand using flags, we could also add
a new flag to specify this kind of constraint on output that we expect
it is a simple gimple value.

Proposed syntax is

 (opcode:v{ condition } )

The char "v" stands for gimple value, if more descriptive, other char is
preferred. "condition" enclosed by { } is an optional c-syntax condition
expression. If present, only when "condition" is met, matcher will check
whether folding result is a gimple value using
gimple_simplified_result_is_gimple_val ().

Since there is no SSA concept in GENERIC, this is only for GIMPLE-match,
not GENERIC-match.

With this syntax, the rule is changed to

#Form 1:
 (simplify
  (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
  (if ((!ANY_INTEGRAL_TYPE_P (type)
 || TYPE_OVERFLOW_WRAPS (type)
 || (INTEGRAL_TYPE_P (type)
 && tree_expr_nonzero_p (@0)
 && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type))
  ( if (!single_use (@3) && !single_use (@4))
 (mult:v (plusminus @1 @2) @0)))
 (mult (plusminus @1 @2) @0)


That seems to match what you can do with '!' now (that's very recent).


#Form 2:
 (simplify
  (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
  (if ((!ANY_INTEGRAL_TYPE_P (type)
 || TYPE_OVERFLOW_WRAPS (type)
 || (INTEGRAL_TYPE_P (type)
 && tree_expr_nonzero_p (@0)
 && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type))
 (mult:v{ !single_use (@3) && !single_use (@4 } (plusminus @1 @2) @0


Indeed, something more flexible than '!' would be nice, but I am not so 
sure about this version. If we are going to allow inserting code after 
resimplification and before validation, maybe we should go even further 
and let people insert arbitrary code there...


--
Marc Glisse


Re: [committed] libstdc++: Make __int128 meet integer-class requirements [PR 96042]

2020-08-22 Thread Marc Glisse

On Sat, 22 Aug 2020, Jonathan Wakely via Gcc-patches wrote:


On Sat, 22 Aug 2020 at 13:13, Jonathan Wakely  wrote:


On Sat, 22 Aug 2020 at 10:52, Marc Glisse wrote:

is there a particular reason to handle only __int128 this way, and not all
the non-standard integer types? It looks like it would be a bit simpler to
avoid a special case.


I have no objection to doing it for all of them, it just wasn't
necessary to solve the immediate problem that the library now uses
__int128 even when integral<__int128> is false. (Hmm, or is size_t  an
alias for __int20 on one arch, which would mean we do use it?)


Oh I remember why I didn't do that now. I did actually want to do it
that way initially.

The macros like __GLIBCXX_TYPE_INT_N_0 are not defined in strict mode,
so we have no generic way to name those types.


IIRC, those macros were introduced specifically to help libstdc++. If 
libstdc++ wants them defined in different circumstances, it should be fine 
to change the condition from "!flag_iso || int_n_data[i].bitsize == 
POINTER_SIZE" to whatever you need.


But now I understand why you did this, thanks.

--
Marc Glisse


RE: [PATCH] middle-end: Recognize idioms for bswap32 and bswap64 in match.pd.

2020-08-22 Thread Marc Glisse

On Sat, 15 Aug 2020, Roger Sayle wrote:

Here's version #2 of the patch to recognize bswap32 and bswap64 
incorporating your suggestions and feedback.  The test cases now confirm 
the transformation is applied when int is 32 bits and long is 64 bits, 
and should pass otherwise; the patterns now reuse (more) capturing 
groups, and the patterns have been made more generic to allow the 
ultimate type to be signed or unsigned (hence there are now two new 
gcc.dg tests).


Alas my efforts to allow the input argument to be signed, and use 
fold_convert to coerce it to the correct type before calling 
__builtin_bswap failed, with the error messages:


You can't use fold_convert for that (well, maybe if you restricted the 
transformation to GENERIC), but if I understand correctly, you are trying 
to do


(convert (BUILT_IN_BSWAP64 (convert:uint64_type_node @1))

? (untested)


From: Marc Glisse 

+(simplify
+  (bit_ior:c
+(lshift
+  (convert (BUILT_IN_BSWAP16 (convert (bit_and @0
+  INTEGER_CST@1
+  (INTEGER_CST@2))
+(convert (BUILT_IN_BSWAP16 (convert (rshift @3
+   INTEGER_CST@4)

I didn't realize we kept this useless bit_and when casting to a smaller
type.


I was confused when I wrote that and thought we were converting from int 
to uint16_t, but bswap16 actually takes an int on x86_64, probably because 
of the calling convention, so we are converting from unsigned int to int.


Having implementation details like the calling convention appear here in 
the intermediate language complicates things a bit. Can we assume that it 
is fine to build a call to bswap32/bswap64 taking uint32_t/uint64_t and 
that only bswap16 can be affected? Do most targets have a similar-enough 
calling convention that this transformation also works on them? It looks 
like aarch64 / powerpc64le / mips64el would like for bswap16->bswap32 a 
transformation of the same form as the one you wrote for bswap32->bswap64.



I was wondering what would happen if I start from an int instead of an 
unsigned int.


f (int x)
{
  short unsigned int _1;
  short unsigned int _2;
  short unsigned int _3;
  int _5;
  int _7;
  unsigned int _8;
  unsigned int _9;
  int _10;

   [local count: 1073741824]:
  _7 = x_4(D) & 65535;
  _1 = __builtin_bswap16 (_7);
  _8 = (unsigned int) x_4(D);
  _9 = _8 >> 16;
  _10 = (int) _9;
  _2 = __builtin_bswap16 (_10);
  _3 = _1 | _2;
  _5 = (int) _3;
  return _5;
}

Handling this in the same transformation with a pair of convert12? and 
some tests should be doable, but it gets complicated enough that it is 
fine to postpone that.


--
Marc Glisse


Re: [PATCH] middle-end: Simplify popcount/parity of bswap/rotate.

2020-08-22 Thread Marc Glisse

On Fri, 21 Aug 2020, Roger Sayle wrote:


This simple patch to match.pd optimizes away bit permutation
operations, specifically bswap and rotate, in calls to popcount and
parity.


Good idea.


Although this patch has been developed and tested on LP64,
it relies on there being no truncations or extensions to "marry up"
the appropriate PARITY, PARITYL and PARITYLL forms with either BSWAP32
or BSWAP64, assuming this transformation won't fire if the integral
types have different sizes.


There would be a convert_expr or similar if the sizes were different, so 
you are safe.


I wish we had only generic builtins/ifn instead of
__builtin_parity{,l,ll,imax}, and inconsistently __builtin_bswap{16,32,64,128},
that's very inconvenient to deal with... And genmatch generates a lot of
useless code because of that.

I didn't try but couldn't the transformations be merged to reduce repetition?

(for bitcnt (POPCOUNT PARITY)
 (for swap (BUILT_IN_BSWAP32 BUILT_IN_BSWAP64)
  (simplify
   (bitcnt (swap @0))
   (bitcnt @0)))
 (for rot (lrotate rrotate)
  (simplify
   (bitcnt (rot @0 @1))
   (bitcnt @0

I assume you skipped BUILT_IN_BSWAP16 because on 32+ bit targets, there
is an intermediate extension, and 16 bit targets are "rare"? And
BUILT_IN_BSWAP128 because on most platforms intmax_t is only 64 bits and
we don't have a 128-bit version of parity/popcount? (we have an IFN, but
it seldom appears by magic)

--
Marc Glisse


Re: [committed] libstdc++: Make __int128 meet integer-class requirements [PR 96042]

2020-08-22 Thread Marc Glisse

On Wed, 19 Aug 2020, Jonathan Wakely via Gcc-patches wrote:


Because __int128 can be used as the difference type for iota_view, we
need to ensure that it meets the requirements of an integer-class type.
The requirements in [iterator.concept.winc] p10 include numeric_limits
being specialized and giving meaningful answers. Currently we only
specialize numeric_limits for non-standard integer types in non-strict
modes.  However, nothing prevents us from defining an explicit
specialization for any implementation-defined type, so it doesn't matter
whether std::is_integral<__int128> is true or not.

This patch ensures that the numeric_limits specializations for signed
and unsigned __int128 are defined whenever __int128 is available. It
also makes the __numeric_traits and __int_limits helpers work for
__int128, via a new __gnu_cxx::__is_integer_nonstrict trait.


Hello,

is there a particular reason to handle only __int128 this way, and not all 
the non-standard integer types? It looks like it would be a bit simpler to 
avoid a special case.


--
Marc Glisse


Re: [PATCH] middle-end: Recognize idioms for bswap32 and bswap64 in match.pd.

2020-08-12 Thread Marc Glisse

On Wed, 12 Aug 2020, Roger Sayle wrote:


This patch is inspired by a small code fragment in comment #3 of
bugzilla PR rtl-optimization/94804.  That snippet appears almost
unrelated to the topic of the PR, but recognizing __builtin_bswap64
from two __builtin_bswap32 calls, seems like a clever/useful trick.
GCC's optabs.c contains the inverse logic to expand bswap64 by
IORing two bswap32 calls, so this transformation/canonicalization
is safe, even on targets without suitable optab support.  But
on x86_64, the swap64 of the test case becomes a single instruction.


This patch has been tested on x86_64-pc-linux-gnu with a "make
bootstrap" and a "make -k check" with no new failures.
Ok for mainline?


Your tests seem to assume that int has 32 bits and long 64.

+  (if (operand_equal_p (@0, @2, 0)

Why not reuse @0 instead of introducing @2 in the pattern? Similarly, it 
may be a bit shorter to reuse @1 instead of a new @3 (I don't think the 
tricks with @@ will be needed here).


+   && types_match (TREE_TYPE (@0), uint64_type_node)

that seems very specific. What goes wrong with a signed type for instance?

+(simplify
+  (bit_ior:c
+(lshift
+  (convert (BUILT_IN_BSWAP16 (convert (bit_and @0
+  INTEGER_CST@1
+  (INTEGER_CST@2))
+(convert (BUILT_IN_BSWAP16 (convert (rshift @3
+   INTEGER_CST@4)

I didn't realize we kept this useless bit_and when casting to a smaller 
type. We probably get a different pattern on 16-bit targets, but a pattern 
they do not match won't hurt them.


--
Marc Glisse


Simplify X * C1 == C2 with wrapping overflow

2020-08-09 Thread Marc Glisse
Odd numbers are invertible in Z / 2^n Z, so X * C1 == C2 can be rewritten 
as X == C2 * inv(C1) when overflow wraps.


mod_inv should probably be updated to better match the other wide_int 
functions, but that's a separate issue.


Bootstrap+regtest on x86_64-pc-linux-gnu.

2020-08-10  Marc Glisse  

PR tree-optimization/95433
* match.pd (X * C1 == C2): Handle wrapping overflow.
* expr.c (maybe_optimize_mod_cmp): Qualify call to mod_inv.
(mod_inv): Move...
* wide-int.cc (mod_inv): ... here.
* wide-int.h (mod_inv): Declare it.

* gcc.dg/tree-ssa/pr95433-2.c: New file.

--
Marc Glissediff --git a/gcc/expr.c b/gcc/expr.c
index a150fa0d3b5..ebf0c9e4797 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -11859,38 +11859,6 @@ string_constant (tree arg, tree *ptr_offset, tree *mem_size, tree *decl)
   return init;
 }
 
-/* Compute the modular multiplicative inverse of A modulo M
-   using extended Euclid's algorithm.  Assumes A and M are coprime.  */
-static wide_int
-mod_inv (const wide_int , const wide_int )
-{
-  /* Verify the assumption.  */
-  gcc_checking_assert (wi::eq_p (wi::gcd (a, b), 1));
-
-  unsigned int p = a.get_precision () + 1;
-  gcc_checking_assert (b.get_precision () + 1 == p);
-  wide_int c = wide_int::from (a, p, UNSIGNED);
-  wide_int d = wide_int::from (b, p, UNSIGNED);
-  wide_int x0 = wide_int::from (0, p, UNSIGNED);
-  wide_int x1 = wide_int::from (1, p, UNSIGNED);
-
-  if (wi::eq_p (b, 1))
-return wide_int::from (1, p, UNSIGNED);
-
-  while (wi::gt_p (c, 1, UNSIGNED))
-{
-  wide_int t = d;
-  wide_int q = wi::divmod_trunc (c, d, UNSIGNED, );
-  c = t;
-  wide_int s = x0;
-  x0 = wi::sub (x1, wi::mul (q, x0));
-  x1 = s;
-}
-  if (wi::lt_p (x1, 0, SIGNED))
-x1 += d;
-  return x1;
-}
-
 /* Optimize x % C1 == C2 for signed modulo if C1 is a power of two and C2
is non-zero and C3 ((1<<(prec-1)) | (C1 - 1)):
for C2 > 0 to x & C3 == C2
@@ -12101,7 +12069,7 @@ maybe_optimize_mod_cmp (enum tree_code code, tree *arg0, tree *arg1)
   w = wi::lrshift (w, shift);
   wide_int a = wide_int::from (w, prec + 1, UNSIGNED);
   wide_int b = wi::shifted_mask (prec, 1, false, prec + 1);
-  wide_int m = wide_int::from (mod_inv (a, b), prec, UNSIGNED);
+  wide_int m = wide_int::from (wi::mod_inv (a, b), prec, UNSIGNED);
   tree c3 = wide_int_to_tree (type, m);
   tree c5 = NULL_TREE;
   wide_int d, e;
diff --git a/gcc/match.pd b/gcc/match.pd
index 7e5c5a6eae6..c3b88168ac4 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3828,7 +3828,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (cmp @0 @2))
 
 /* For integral types with undefined overflow fold
-   x * C1 == C2 into x == C2 / C1 or false.  */
+   x * C1 == C2 into x == C2 / C1 or false.
+   If overflow wraps and C1 is odd, simplify to x == C2 / C1 in the ring
+   Z / 2^n Z.  */
 (for cmp (eq ne)
  (simplify
   (cmp (mult @0 INTEGER_CST@1) INTEGER_CST@2)
@@ -3839,7 +3841,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 (if (wi::multiple_of_p (wi::to_widest (@2), wi::to_widest (@1),
 			TYPE_SIGN (TREE_TYPE (@0)), ))
  (cmp @0 { wide_int_to_tree (TREE_TYPE (@0), quot); })
- { constant_boolean_node (cmp == NE_EXPR, type); })
+ { constant_boolean_node (cmp == NE_EXPR, type); }))
+   (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
+	&& TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
+	&& (wi::bit_and (wi::to_wide (@1), 1) == 1))
+(cmp @0
+ {
+   tree itype = TREE_TYPE (@0);
+   int p = TYPE_PRECISION (itype);
+   wide_int m = wi::one (p + 1) << p;
+   wide_int a = wide_int::from (wi::to_wide (@1), p + 1, UNSIGNED);
+   wide_int i = wide_int::from (wi::mod_inv (a, m),
+p, TYPE_SIGN (itype));
+   wide_int_to_tree (itype, wi::mul (i, wi::to_wide (@2)));
+ })
 
 /* Simplify comparison of something with itself.  For IEEE
floating-point, we can only do some of these simplifications.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr95433-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr95433-2.c
new file mode 100644
index 000..c830a3d195f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr95433-2.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fwrapv -fdump-tree-gimple" } */
+
+typedef __INT32_TYPE__ int32_t;
+typedef unsigned __INT32_TYPE__ uint32_t;
+
+int e(int32_t x){return 3*x==5;}
+int f(int32_t x){return 3*x==-5;}
+int g(int32_t x){return -3*x==5;}
+int h(int32_t x){return 7*x==3;}
+int i(uint32_t x){return 7*x==3;}
+
+/* { dg-final { scan-tree-dump-times "== 1431655767" 1 "gimple" } } */
+/* { dg-final { scan-tree-dump-times "== -1431655767" 2 "gimple" } } */
+/* { dg-final { scan-tree-dump-times "== 613566757" 2 "gimple" } } */
diff --git a/gcc/wide-int.cc b/gcc/wide-int.cc
index 03be0074816..f4d949c38a0 100644
--- a/gcc/wide-int.cc
+++ b/gcc/wide-int.cc
@@ -2223,6 +2223,39 @@ wi::round_

Re: Simplify X * C1 == C2 with undefined overflow

2020-08-07 Thread Marc Glisse

On Fri, 7 Aug 2020, Jakub Jelinek wrote:


On Fri, Aug 07, 2020 at 10:57:54PM +0200, Marc Glisse wrote:

On Fri, 7 Aug 2020, Joern Wolfgang Rennecke wrote:



On 07/08/20 19:21, Marc Glisse wrote:


If we are going to handle the wrapping case, we shouldn't limit to
the non-wrapping meaning of multiplicity. 3*X==5 should become
X==1431655767 (for a 32 bit type), etc.

Do we have an extended gcd somewhere in gcc, to help compute 1431655767?

I don't quite see yet how this relates to gcd,


https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm is the most
common way to compute the modular multiplicative inverse of a number. For 3
and 2^32, it could tell us that 2863311531*3-2*2^32=1, so modulo 2^32
2863311531*3==1, and 3*X==5 is the same as 2863311531*3*X==2863311531*5,
i.e. X==1431655767.


wi::gcd ?


That's the first place I looked, but this is only the regular Euclid 
algorithm, not the extended one. It tells you what the gcd is, but does 
not give you the coefficients of the Bézout identity. For 3 and 2^32, it 
would tell me the gcd is 1, while I want the number 2863311531 (inverse of 
3).


Ah, found it, there is mod_inv hidden in expr.c!

--
Marc Glisse


Re: Simplify X * C1 == C2 with undefined overflow

2020-08-07 Thread Marc Glisse

On Fri, 7 Aug 2020, Joern Wolfgang Rennecke wrote:



On 07/08/20 19:21, Marc Glisse wrote:


If we are going to handle the wrapping case, we shouldn't limit to the 
non-wrapping meaning of multiplicity. 3*X==5 should become X==1431655767 
(for a 32 bit type), etc.


Do we have an extended gcd somewhere in gcc, to help compute 1431655767?

I don't quite see yet how this relates to gcd,


https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm is the most 
common way to compute the modular multiplicative inverse of a number. For 
3 and 2^32, it could tell us that 2863311531*3-2*2^32=1, so modulo 2^32 
2863311531*3==1, and 3*X==5 is the same as 2863311531*3*X==2863311531*5, 
i.e. X==1431655767.


However, there are cases were the second division will not be possible 
without rest.

Consider : 7*X == 3
3/7 is 0 rest 3.
0x1 / 7 is 0x24924924 rest 4
Since 3 cannot be represented as an integer multiple of -4, we can conclude 
that the predicate

is always false.


613566757*7-2^32==3


Also: 14*X == 28
has a non-zero power of two in the constant factor, so we have to factor out 
the power of two
(if the right hand side has a lower power of two, the predicate is always 
false) and consider
in modulo arithmetic with a number of bits less by the factored out power of 
two, i.e. 31 bit here:

7*X == 14
which is solved as above - but in 32 bit arithmetic - to
X == 2
and to convert back to 32 bit arithmetic, we get:
X & 0x7fff == 2
or
2*x == 4


Yes, we can reduce the size of the numbers a bit, but the gains won't be 
as nice for even numbers. I think the always-false case is already handled 
by CCP, tracking which bits can be 0 (I didn't check).


--
Marc Glisse


Re: Simplify X * C1 == C2 with undefined overflow

2020-08-07 Thread Marc Glisse

On Fri, 7 Aug 2020, Joern Wolfgang Rennecke wrote:


this transformation is quite straightforward, without overflow, 3*X==15 is
the same as X==5 and 3*X==5 cannot happen.


Actually, with binary and decimal computers, this transformation (with 
these specific numbers) is also valid for wrapping overflow.  More 
generally, it is valid for wrapping overflow if the right hand side of 
the comparison is divisible without rest by the constant factor, and the 
constant factor has no sub-factor that is a zero divisor for the ring 
defined by the wrapping operation. For binary computers, the latter 
condition can be more simply be restated as: The constant factor has to 
be odd. (For decimal computers, it's: must not be divisible by two 
and/or five.)


If we are going to handle the wrapping case, we shouldn't limit to the 
non-wrapping meaning of multiplicity. 3*X==5 should become X==1431655767 
(for a 32 bit type), etc.


Do we have an extended gcd somewhere in gcc, to help compute 1431655767?

(Even if the variable factor is wider than equality comparison, that is 
not a problem as long as the comparison is not widened by the 
transformation.)


On the other hand, the following generalizations would work only without 
overflow:
- handling of inequality-comparisons - merely have to account for the sign of 
the factor reversing the sense of

 the inequality, e.g. -3*X >= 15 ---> X <= 5


And again for this one, we have to be careful how we round the division, 
but we don't have to limit to the case where 15 is a multiple of 3. 3*X>7 
can be replaced with X>2.



Those are two nice suggestions. Do you intend to write a patch? Otherwise 
I'll try to do it eventually (no promise).


--
Marc Glisse


Re: FENV_ACCESS status

2020-08-07 Thread Marc Glisse
n't hinder the second too much. But having the strictest version 
first looked reasonable.


There's no such thing currently as effects on memory state only depend 
on arguments.


This reminds me of the initialization of static/thread_local variables in 
functions, when Jason tried to add an attribute, but I don't think it was 
ever committed, and the semantics were likely too different.


I _think_ we don't have to say the mem out state depends on the mem in 
state (FP ENV), well - it does, but the difference only depends on the 
actual arguments.


A different rounding mode could cause different exceptions I believe.


That said, tracking FENV together with memory will complicate things
but explicitely tracking an (or multiple?) extra FP ENV register input/output
makes the problem not go away (the second plus still has the mutated
FP ENV from the first plus as input).  Instead we'd have to separately
track the effect of a single operation and the overall FP state, like

(_3, flags_5) = .IFN_PLUS (_1, _2, 0);
fpexstate = merge (flags_5, fpexstate);
(_4, flags_6) = .IFN_PLUS (_1, _2, 0);
fpexstate = merge (flage_6, fpexstate);


We would have to be careful that lines 2 and 3 cannot be swapped (unless 
we keep all the merges and key expansion on those and not on the IFN? 
But we may end up with a use of the sum before the merge).



or so and there we can CSE.


And I guess we would have a transformation
merge(f, merge(f, state)) --> merge(f, state)


We have to track exception state separately
from the FP control word for rounding-mode for this to work.  Thus when
we're not interested in the exception state then .IFN_PLUS would be 'pure'
(only dependent on the FP CW)?

So I guess we should think of somehow separating rounding mode tracking
and exception state?  If we make the functions affect memory anyway
we can have the FP state reg(s) modeled explicitely with a fake decl(s) and pass
that by reference to the IFNs?  Then we can make use of the "fn spec" attribute
to tell which function reads/writes which reg.  Across unknown functions we'd
then have to use the store/load "trick" to merge them with the global
memory state though.


Splitting the rounding mode from the exceptions certainly makes sense, 
since they are used quite differently.



_3 = .FENV_PLUS (_1, _2, 0, _round, _except)
or just
_3 = .FENV_PLUS (_1, _2, 1, _round, 0)
or 
_3 = .FENV_PLUS (_1, _2, 2, 0, _except)

when we are not interested in everything.

with fake global decls for fenv_round and fenv_except (so "unknown" 
already possibly reads/writes it) and fn specs to say it doesn't look at 
other memory? I was more thinking of making that implicit, through magic 
in a couple relevant functions (the value in flags says if the global 
fenv_round or fenv_except is accessed), as a refinement of just "memory".


But IIUC, we would need something that does not use memory at all (not 
even one variable) if we wanted to avoid the big penalty in alias 
analysis, etc.


If we consider the case without exceptions:

round = get_fenv_round()
_3 = .FENV_PLUS (_1, _2, opts, round)

with .FENV_PLUS "const" and get_fenv_round "pure" (or even reading round 
from a fake global variable instead of a function call) would be tempting, 
but it doesn't work, since now .FENV_PLUS can migrate after a later call 
to fesetround. Even without exceptions we need some protection after, so 
it may be easier to keep the memory (fenv) read as part of .FENV_PLUS.


Also, caring only about rounding doesn't match any standard #pragma, so 
such an option may see very little use in practice...



Sorry for the incoherent brain-dump above ;)


It is great to have someone to discuss this with!

--
Marc Glisse


Re: VEC_COND_EXPR optimizations v2

2020-08-07 Thread Marc Glisse

On Fri, 7 Aug 2020, Richard Biener wrote:


On Fri, Aug 7, 2020 at 10:33 AM Marc Glisse  wrote:


On Fri, 7 Aug 2020, Richard Biener wrote:


On Thu, Aug 6, 2020 at 8:07 PM Marc Glisse  wrote:


On Thu, 6 Aug 2020, Christophe Lyon wrote:


Was I on the right track configuring with
--target=arm-none-linux-gnueabihf --with-cpu=cortex-a9
--with-fpu=neon-fp16
then compiling without any special option?


Maybe you also need --with-float=hard, I don't remember if it's
implied by the 'hf' target suffix


Thanks! That's what I was missing to reproduce the issue. Now I can
reproduce it with just

typedef unsigned int vec __attribute__((vector_size(16)));
typedef int vi __attribute__((vector_size(16)));
vi f(vec a,vec b){
 return a==5 | b==7;
}

with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 
-O1

   _1 = a_5(D) == { 5, 5, 5, 5 };
   _3 = b_6(D) == { 7, 7, 7, 7 };
   _9 = _1 | _3;
   _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107);

we fail to expand the equality comparison (expand_vec_cmp_expr_p returns
false), while with -fdisable-tree-forwprop4 we do manage to expand

   _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 
112);

It doesn't make much sense to me that we can expand the more complicated
form and not the simpler form of the same operation (both compare a to 5
and produce a vector of -1 or 0 of the same size), especially when the
target has an instruction (vceq) that does just what we want.

Introducing boolean vectors was fine, but I think they should be real
types, that we can operate on, not be forced to appear only as the first
argument of a vcond.

I can think of 2 natural ways to improve things: either implement vector
comparisons in the ARM backend (possibly by forwarding to their existing
code for vcond), or in the generic expansion code try using vcond if the
direct comparison opcode is not provided.

We can temporarily revert my patch, but I would like it to be temporary.
Since aarch64 seems to handle the same code just fine, maybe someone who
knows arm could copy the relevant code over?

Does my message make sense, do people have comments?


So what complicates things now (and to some extent pre-existed when you
used AVX512 which _could_ operate on boolean vectors) is that we
have split out the condition from VEC_COND_EXPR to separate stmts
but we do not expect backends to be able to code-generate the separate
form - instead we rely on the ISEL pass to trasform VEC_COND_EXPRs
to .VCOND[U] "merging" the compares again.  Now that process breaks
down once we have things like _9 = _1 | _3;  -  at some point I argued
that we should handle vector compares [and operations on boolean vectors]
as well in ISEL but then when it came up again for some reason I
disregarded that again.

Thus - we don't want to go back to fixing up the generic expansion code
(which looks at one instruction at a time and is restricted by TER single-use
restrictions).


Here, it would only be handling comparisons left over by ISEL because they
do not feed a VEC_COND_EXPR, maybe not so bad. Handling it in ISEL would
be more consistent, but then we may have to teach the vector lowering pass
about this.


Instead we want to deal with this in ISEL which should
behave more intelligently.  In the above case it might involve turning
the _1 and _3 defs into .VCOND [with different result type], doing
_9 in that type and then somehow dealing with _7 ... but this eventually
means undoing the match simplification that introduced the code?


For targets that do not have compact boolean vectors,
VEC_COND_EXPR<*,-1,0> does nothing, it is just there as a technicality.
Removing it to allow more simplifications makes sense, reintroducing it
for expansion can make sense as well, I think it is ok if the second one
reverses the first, but very locally, without having to propagate a change
of type to the uses. So on ARM we could turn _1 and _3 into .VCOND
producing bool:32 vectors, and replace _7 with a VIEW_CONVERT_EXPR. Or
does using bool vectors like that seem bad? (maybe they aren't guaranteed
to be equivalent to signed integer vectors with values 0 and -1 and we
need to query the target to know if that is the case, or introduce an
extra .VCOND)

Also, we only want to replace comparisons with .VCOND if the target
doesn't handle the comparison directly. For AVX512, we do want to produce
SImode bool vectors (for k* registers) and operate on them directly,
that's the whole point of introducing the bool vectors (if bool vectors
were only used to feed VEC_COND_EXPR and were all turned into .VCOND
before expansion, I don't see the point of specifying different types for
bool vectors for AVX512 vs non-AVX512, as it would make no difference on
what is passed to the backend).

I think targets should provide some number of operations, including for
instance bit_and and bit_ior on bool vectors, and be a bit consistent
about what they provide, it beco

Re: FENV_ACCESS status

2020-08-07 Thread Marc Glisse
hat said, you're the one doing the work and going with internal functions
is reasonable - I'm not sure to what extent optimization for FENV acccess
code will ever be possible (or wanted/expected).  So going more precise
might not have any advantage.


I think some optimizations are expected. For instance, not having to 
re-read the same number from memory many times just because there was an 
addition in between (which could write to fenv but that's it). Some may 
still want FMA (with a consistent rounding direction). For those (like me) 
who usually only care about rounding and not exceptions, making the 
operations pure would be great, and nothing says we cannot vectorize those 
rounded operations!


I am trying to be realistic with what I can achieve, but if you think the 
IFNs would paint us into a corner, then we can drop this approach.



You needed to guard SQRT - will you need to guard other math functions?
(round, etc.)


Maybe, but probably not many. I thought I might have to guard all of them 
(sin, cos, etc), but IIRC Joseph's comment seemed to imply that this 
wouldn't be necessary. I am likely missing FMA now...



If we need to keep the IFNs use memory state they will count towards
walk limits of the alias oracle even if they can be disambiguated against.
This will affect both compile-time and optimizations.


Yes...


+  /* Careful not to end up with something like X - X, which could get
+ simplified.  */
+  if (!skip0 && already_protected (op1))

we're already relying on RTL not optimizing (x + 0.5) - 0.5 but since
that would involve association the simple X - X case might indeed
be optimized (but wouldn't that be a bug if it is not correct?)


Indeed we do not currently simplify X-X without -ffinite-math-only. 
However, I am trying to be safe, and whether we can simplify or not is 
something that depends on each operation (what the pragma said at that 
point in the source code), while flag_finite_math_only is at best per 
function.


--
Marc Glisse


Re: VEC_COND_EXPR optimizations v2

2020-08-07 Thread Marc Glisse

On Fri, 7 Aug 2020, Richard Biener wrote:


On Thu, Aug 6, 2020 at 8:07 PM Marc Glisse  wrote:


On Thu, 6 Aug 2020, Christophe Lyon wrote:


Was I on the right track configuring with
--target=arm-none-linux-gnueabihf --with-cpu=cortex-a9
--with-fpu=neon-fp16
then compiling without any special option?


Maybe you also need --with-float=hard, I don't remember if it's
implied by the 'hf' target suffix


Thanks! That's what I was missing to reproduce the issue. Now I can
reproduce it with just

typedef unsigned int vec __attribute__((vector_size(16)));
typedef int vi __attribute__((vector_size(16)));
vi f(vec a,vec b){
 return a==5 | b==7;
}

with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 
-O1

   _1 = a_5(D) == { 5, 5, 5, 5 };
   _3 = b_6(D) == { 7, 7, 7, 7 };
   _9 = _1 | _3;
   _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107);

we fail to expand the equality comparison (expand_vec_cmp_expr_p returns
false), while with -fdisable-tree-forwprop4 we do manage to expand

   _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 
112);

It doesn't make much sense to me that we can expand the more complicated
form and not the simpler form of the same operation (both compare a to 5
and produce a vector of -1 or 0 of the same size), especially when the
target has an instruction (vceq) that does just what we want.

Introducing boolean vectors was fine, but I think they should be real
types, that we can operate on, not be forced to appear only as the first
argument of a vcond.

I can think of 2 natural ways to improve things: either implement vector
comparisons in the ARM backend (possibly by forwarding to their existing
code for vcond), or in the generic expansion code try using vcond if the
direct comparison opcode is not provided.

We can temporarily revert my patch, but I would like it to be temporary.
Since aarch64 seems to handle the same code just fine, maybe someone who
knows arm could copy the relevant code over?

Does my message make sense, do people have comments?


So what complicates things now (and to some extent pre-existed when you
used AVX512 which _could_ operate on boolean vectors) is that we
have split out the condition from VEC_COND_EXPR to separate stmts
but we do not expect backends to be able to code-generate the separate
form - instead we rely on the ISEL pass to trasform VEC_COND_EXPRs
to .VCOND[U] "merging" the compares again.  Now that process breaks
down once we have things like _9 = _1 | _3;  -  at some point I argued
that we should handle vector compares [and operations on boolean vectors]
as well in ISEL but then when it came up again for some reason I
disregarded that again.

Thus - we don't want to go back to fixing up the generic expansion code
(which looks at one instruction at a time and is restricted by TER single-use
restrictions).


Here, it would only be handling comparisons left over by ISEL because they 
do not feed a VEC_COND_EXPR, maybe not so bad. Handling it in ISEL would 
be more consistent, but then we may have to teach the vector lowering pass 
about this.



Instead we want to deal with this in ISEL which should
behave more intelligently.  In the above case it might involve turning
the _1 and _3 defs into .VCOND [with different result type], doing
_9 in that type and then somehow dealing with _7 ... but this eventually
means undoing the match simplification that introduced the code?


For targets that do not have compact boolean vectors, 
VEC_COND_EXPR<*,-1,0> does nothing, it is just there as a technicality. 
Removing it to allow more simplifications makes sense, reintroducing it 
for expansion can make sense as well, I think it is ok if the second one 
reverses the first, but very locally, without having to propagate a change 
of type to the uses. So on ARM we could turn _1 and _3 into .VCOND 
producing bool:32 vectors, and replace _7 with a VIEW_CONVERT_EXPR. Or 
does using bool vectors like that seem bad? (maybe they aren't guaranteed 
to be equivalent to signed integer vectors with values 0 and -1 and we 
need to query the target to know if that is the case, or introduce an 
extra .VCOND)


Also, we only want to replace comparisons with .VCOND if the target 
doesn't handle the comparison directly. For AVX512, we do want to produce 
SImode bool vectors (for k* registers) and operate on them directly, 
that's the whole point of introducing the bool vectors (if bool vectors 
were only used to feed VEC_COND_EXPR and were all turned into .VCOND 
before expansion, I don't see the point of specifying different types for 
bool vectors for AVX512 vs non-AVX512, as it would make no difference on 
what is passed to the backend).


I think targets should provide some number of operations, including for 
instance bit_and and bit_ior on bool vectors, and be a bit consistent 
about what they provide, it becomes unmanageable in the middle-end 
otherwise...


--
Marc Glisse


Re: VEC_COND_EXPR optimizations v2

2020-08-06 Thread Marc Glisse

On Thu, 6 Aug 2020, Christophe Lyon wrote:


Was I on the right track configuring with
--target=arm-none-linux-gnueabihf --with-cpu=cortex-a9
--with-fpu=neon-fp16
then compiling without any special option?


Maybe you also need --with-float=hard, I don't remember if it's
implied by the 'hf' target suffix


Thanks! That's what I was missing to reproduce the issue. Now I can
reproduce it with just

typedef unsigned int vec __attribute__((vector_size(16)));
typedef int vi __attribute__((vector_size(16)));
vi f(vec a,vec b){
return a==5 | b==7;
}

with -fdisable-tree-forwprop1 -fdisable-tree-forwprop2 -fdisable-tree-forwprop3 
-O1

  _1 = a_5(D) == { 5, 5, 5, 5 };
  _3 = b_6(D) == { 7, 7, 7, 7 };
  _9 = _1 | _3;
  _7 = .VCOND (_9, { 0, 0, 0, 0 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 107);

we fail to expand the equality comparison (expand_vec_cmp_expr_p returns
false), while with -fdisable-tree-forwprop4 we do manage to expand

  _2 = .VCONDU (a_5(D), { 5, 5, 5, 5 }, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }, 
112);

It doesn't make much sense to me that we can expand the more complicated
form and not the simpler form of the same operation (both compare a to 5
and produce a vector of -1 or 0 of the same size), especially when the
target has an instruction (vceq) that does just what we want.

Introducing boolean vectors was fine, but I think they should be real 
types, that we can operate on, not be forced to appear only as the first 
argument of a vcond.


I can think of 2 natural ways to improve things: either implement vector 
comparisons in the ARM backend (possibly by forwarding to their existing 
code for vcond), or in the generic expansion code try using vcond if the 
direct comparison opcode is not provided.


We can temporarily revert my patch, but I would like it to be temporary. 
Since aarch64 seems to handle the same code just fine, maybe someone who 
knows arm could copy the relevant code over?


Does my message make sense, do people have comments?

--
Marc Glisse


Re: VEC_COND_EXPR optimizations v2

2020-08-06 Thread Marc Glisse

On Thu, 6 Aug 2020, Christophe Lyon wrote:


On Thu, 6 Aug 2020 at 11:06, Marc Glisse  wrote:


On Thu, 6 Aug 2020, Christophe Lyon wrote:


2020-08-05  Marc Glisse  

PR tree-optimization/95906
PR target/70314
* match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
(v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
(op (c ? a : b)): Update to match the new transformations.

* gcc.dg/tree-ssa/andnot-2.c: New file.
* gcc.dg/tree-ssa/pr95906.c: Likewise.
* gcc.target/i386/pr70314.c: Likewise.



I think this patch is causing several ICEs on arm-none-linux-gnueabihf
--with-cpu cortex-a9 --with-fpu neon-fp16:
 Executed from: gcc.c-torture/compile/compile.exp
   gcc.c-torture/compile/20160205-1.c   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  (internal
compiler error)
   gcc.c-torture/compile/20160205-1.c   -O3 -g  (internal compiler error)
 Executed from: gcc.dg/dg.exp
   gcc.dg/pr87746.c (internal compiler error)
 Executed from: gcc.dg/tree-ssa/tree-ssa.exp
   gcc.dg/tree-ssa/ifc-cd.c (internal compiler error)


I tried a cross from x86_64-linux with current master

.../configure --target=arm-none-linux-gnueabihf --enable-languages=c,c++ 
--with-system-zlib --disable-nls --with-cpu=cortex-a9 --with-fpu=neon-fp16
make

it stops at some point with an error, but I have xgcc and cc1 in
build/gcc.

I copied 2 of the testcases and compiled

./xgcc pr87746.c -Ofast -S -B.
./xgcc -O3 -fdump-tree-ifcvt-details-blocks-details ifc-cd.c -S -B.

without getting any ICE.


Sorry for the delay, I had to reproduce the problem manually.


Is there a machine on the compile farm where this is easy to reproduce?

I don't think there is any arm machine in the compile farm.


Or could you attach the .optimized dump that corresponds to the
backtrace below? It looks like we end up with a comparison with an
unexpected return type.



I've compiled pr87746.c with -fdump-tree-ifcvt-details-blocks-details,
here is the log.
Is that what you need?


Thanks.
The one from -fdump-tree-optimized would be closer to the ICE.
Though it would also be convenient to know which stmt is being expanded 
when we ICE, etc.


Was I on the right track configuring with 
--target=arm-none-linux-gnueabihf --with-cpu=cortex-a9 
--with-fpu=neon-fp16

then compiling without any special option?


Thanks,

Christophe


 Executed from: gcc.dg/vect/vect.exp
   gcc.dg/vect/pr59591-1.c (internal compiler error)
   gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error)
   gcc.dg/vect/pr86927.c (internal compiler error)
   gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error)
   gcc.dg/vect/slp-cond-5.c (internal compiler error)
   gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error)
   gcc.dg/vect/vect-23.c (internal compiler error)
   gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error)
   gcc.dg/vect/vect-24.c (internal compiler error)
   gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error)
   gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error)
   gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal
compiler error)

Backtrace for gcc.c-torture/compile/20160205-1.c   -O3
-fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer
-finline-functions
during RTL pass: expand
/gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal
compiler error: in do_store_flag, at expr.c:12259
0x8feb26 do_store_flag
   /gcc/expr.c:12259
0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
   /gcc/expr.c:9617
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
   /gcc/expr.c:10159
0x91174e expand_expr
   /gcc/expr.h:282
0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
rtx_def**, expand_modifier)
   /gcc/expr.c:8065
0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
   /gcc/expr.c:9950
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
   /gcc/expr.c:10159
0x91174e expand_expr
   /gcc/expr.h:282
0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
rtx_def**, expand_modifier)
   /gcc/expr.c:8065
0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
   /gcc/expr.c:9950
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
   /gcc/expr.c:10159
0x91174e expand_expr
   /gcc/expr.h:282
0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
rtx_def**, expand_modifier)
   /gcc/expr.c:8065
0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
   /gcc/expr.c:9950
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
   /gcc/expr.c:10159
0x91174e

Re: VEC_COND_EXPR optimizations v2

2020-08-06 Thread Marc Glisse

On Thu, 6 Aug 2020, Richard Biener wrote:


On Thu, Aug 6, 2020 at 10:17 AM Christophe Lyon
 wrote:


Hi,


On Wed, 5 Aug 2020 at 16:24, Richard Biener via Gcc-patches
 wrote:


On Wed, Aug 5, 2020 at 3:33 PM Marc Glisse  wrote:


New version that passed bootstrap+regtest during the night.

When vector comparisons were forced to use vec_cond_expr, we lost a number of
optimizations (my fault for not adding enough testcases to prevent that).
This patch tries to unwrap vec_cond_expr a bit so some optimizations can
still happen.

I wasn't planning to add all those transformations together, but adding one
caused a regression, whose fix introduced a second regression, etc.

Restricting to constant folding would not be sufficient, we also need at
least things like X|0 or X The transformations are quite conservative
with :s and folding only if everything simplifies, we may want to relax
this later. And of course we are going to miss things like a?b:c + a?c:b
-> b+c.

In terms of number of operations, some transformations turning 2
VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look
like a gain... I expect the bit_not disappears in most cases, and
VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.

I am a bit confused that with avx512 we get types like "vector(4)
" with :2 and not :1 (is it a hack so true is 1 and not
-1?), but that doesn't matter for this patch.


OK.

Thanks,
Richard.


2020-08-05  Marc Glisse  

PR tree-optimization/95906
PR target/70314
* match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
(v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
(op (c ? a : b)): Update to match the new transformations.

* gcc.dg/tree-ssa/andnot-2.c: New file.
* gcc.dg/tree-ssa/pr95906.c: Likewise.
* gcc.target/i386/pr70314.c: Likewise.



I think this patch is causing several ICEs on arm-none-linux-gnueabihf
--with-cpu cortex-a9 --with-fpu neon-fp16:
  Executed from: gcc.c-torture/compile/compile.exp
gcc.c-torture/compile/20160205-1.c   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  (internal
compiler error)
gcc.c-torture/compile/20160205-1.c   -O3 -g  (internal compiler error)
  Executed from: gcc.dg/dg.exp
gcc.dg/pr87746.c (internal compiler error)
  Executed from: gcc.dg/tree-ssa/tree-ssa.exp
gcc.dg/tree-ssa/ifc-cd.c (internal compiler error)
  Executed from: gcc.dg/vect/vect.exp
gcc.dg/vect/pr59591-1.c (internal compiler error)
gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error)
gcc.dg/vect/pr86927.c (internal compiler error)
gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error)
gcc.dg/vect/slp-cond-5.c (internal compiler error)
gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error)
gcc.dg/vect/vect-23.c (internal compiler error)
gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error)
gcc.dg/vect/vect-24.c (internal compiler error)
gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error)
gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error)
gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal
compiler error)

Backtrace for gcc.c-torture/compile/20160205-1.c   -O3
-fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer
-finline-functions
during RTL pass: expand
/gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal
compiler error: in do_store_flag, at expr.c:12259
0x8feb26 do_store_flag
/gcc/expr.c:12259
0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
/gcc/expr.c:9617
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
/gcc/expr.c:10159
0x91174e expand_expr
/gcc/expr.h:282
0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
rtx_def**, expand_modifier)
/gcc/expr.c:8065
0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
/gcc/expr.c:9950
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
/gcc/expr.c:10159
0x91174e expand_expr
/gcc/expr.h:282
0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
rtx_def**, expand_modifier)
/gcc/expr.c:8065
0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
/gcc/expr.c:9950
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
/gcc/expr.c:10159
0x91174e expand_expr
/gcc/expr.h:282
0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
rtx_def**, expand_modifier)
/gcc/expr.c:8065
0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
/gcc/expr.c:9950
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
ex

Re: VEC_COND_EXPR optimizations v2

2020-08-06 Thread Marc Glisse

On Thu, 6 Aug 2020, Christophe Lyon wrote:


2020-08-05  Marc Glisse  

PR tree-optimization/95906
PR target/70314
* match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
(v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
(op (c ? a : b)): Update to match the new transformations.

* gcc.dg/tree-ssa/andnot-2.c: New file.
* gcc.dg/tree-ssa/pr95906.c: Likewise.
* gcc.target/i386/pr70314.c: Likewise.



I think this patch is causing several ICEs on arm-none-linux-gnueabihf
--with-cpu cortex-a9 --with-fpu neon-fp16:
 Executed from: gcc.c-torture/compile/compile.exp
   gcc.c-torture/compile/20160205-1.c   -O3 -fomit-frame-pointer
-funroll-loops -fpeel-loops -ftracer -finline-functions  (internal
compiler error)
   gcc.c-torture/compile/20160205-1.c   -O3 -g  (internal compiler error)
 Executed from: gcc.dg/dg.exp
   gcc.dg/pr87746.c (internal compiler error)
 Executed from: gcc.dg/tree-ssa/tree-ssa.exp
   gcc.dg/tree-ssa/ifc-cd.c (internal compiler error)


I tried a cross from x86_64-linux with current master

.../configure --target=arm-none-linux-gnueabihf --enable-languages=c,c++ 
--with-system-zlib --disable-nls --with-cpu=cortex-a9 --with-fpu=neon-fp16
make

it stops at some point with an error, but I have xgcc and cc1 in
build/gcc.

I copied 2 of the testcases and compiled

./xgcc pr87746.c -Ofast -S -B.
./xgcc -O3 -fdump-tree-ifcvt-details-blocks-details ifc-cd.c -S -B.

without getting any ICE.

Is there a machine on the compile farm where this is easy to reproduce?
Or could you attach the .optimized dump that corresponds to the
backtrace below? It looks like we end up with a comparison with an
unexpected return type.


 Executed from: gcc.dg/vect/vect.exp
   gcc.dg/vect/pr59591-1.c (internal compiler error)
   gcc.dg/vect/pr59591-1.c -flto -ffat-lto-objects (internal compiler error)
   gcc.dg/vect/pr86927.c (internal compiler error)
   gcc.dg/vect/pr86927.c -flto -ffat-lto-objects (internal compiler error)
   gcc.dg/vect/slp-cond-5.c (internal compiler error)
   gcc.dg/vect/slp-cond-5.c -flto -ffat-lto-objects (internal compiler error)
   gcc.dg/vect/vect-23.c (internal compiler error)
   gcc.dg/vect/vect-23.c -flto -ffat-lto-objects (internal compiler error)
   gcc.dg/vect/vect-24.c (internal compiler error)
   gcc.dg/vect/vect-24.c -flto -ffat-lto-objects (internal compiler error)
   gcc.dg/vect/vect-cond-reduc-6.c (internal compiler error)
   gcc.dg/vect/vect-cond-reduc-6.c -flto -ffat-lto-objects (internal
compiler error)

Backtrace for gcc.c-torture/compile/20160205-1.c   -O3
-fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer
-finline-functions
during RTL pass: expand
/gcc/testsuite/gcc.c-torture/compile/20160205-1.c:2:5: internal
compiler error: in do_store_flag, at expr.c:12259
0x8feb26 do_store_flag
   /gcc/expr.c:12259
0x900201 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
   /gcc/expr.c:9617
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
   /gcc/expr.c:10159
0x91174e expand_expr
   /gcc/expr.h:282
0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
rtx_def**, expand_modifier)
   /gcc/expr.c:8065
0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
   /gcc/expr.c:9950
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
   /gcc/expr.c:10159
0x91174e expand_expr
   /gcc/expr.h:282
0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
rtx_def**, expand_modifier)
   /gcc/expr.c:8065
0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
   /gcc/expr.c:9950
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
   /gcc/expr.c:10159
0x91174e expand_expr
   /gcc/expr.h:282
0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
rtx_def**, expand_modifier)
   /gcc/expr.c:8065
0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
   /gcc/expr.c:9950
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
   /gcc/expr.c:10159
0x91174e expand_expr
   /gcc/expr.h:282
0x91174e expand_operands(tree_node*, tree_node*, rtx_def*, rtx_def**,
rtx_def**, expand_modifier)
   /gcc/expr.c:8065
0x8ff543 expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
expand_modifier)
   /gcc/expr.c:9950
0x908cd0 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
   /gcc/expr.c:10159
0x91174e expand_expr
   /gcc/expr.h:282

Christophe


--
Marc Glisse


FENV_ACCESS status

2020-08-05 Thread Marc Glisse

Hello,

I updated the patch discussed in 
https://patchwork.ozlabs.org/project/gcc/patch/alpine.deb.2.02.1906221743430.16...@grove.saclay.inria.fr/ 
and pushed it as something like refs/users/glisse/heads/fenv (first user 
branch in gcc's git, I hope it worked). I am also attaching the diff here.


I managed to compile and run real-world code with it, which is a good sign 
:-)


As should be obvious looking at the diff, there is a lot of work left. 
Experts may also find much better ways to rewrite several parts of the 
patch.


The option is called -ffenv-access so it doesn't interfere with 
-frounding-math, at least until we have something good enough to replace 
-frounding-math without too much performance regression.


I switched to hex float constants for DBL_MAX and others for C99+, I don't 
care about making fenv_access work in prehistoric modes. On the other 
hand, since I haven't started on fenv_round, this is probably useless for 
now.


Several changes, in particular the constexpr stuff, was needed to parse 
standard headers, otherwise I would have delayed the implementation.


Currently the floating point environment is represented by "memory" in 
general. Refining it so the compiler knows that storing a float does not 
change the rounding mode (for instance) should wait, in my opinion.


Conversions look like
.FENV_CONVERT (arg, (target_type*)0, 0)
the pointer is there so we know the target type, even if the lhs 
disappears at some point. The last 0 is the same as for all the others, a 
place to store options about the operation (do we care about rounding, 
about exceptions, etc), it is just a placeholder for now. I could rename 
it to .FENV_NOP since we seem to generate NOP usually, but it looked 
strange to me.


I did not do anything for templates in C++. As long as we have a constant 
global flag, it doesn't matter, but as soon as we will have a pragma, 
things will get messy, we will need to remember what the mode was when 
parsing, so we can use it at instantiation time... (or just declare that 
the pragma doesn't work with templates in a first version)


I am trying to have enough infrastructure in place so that the 
functionality is useful, and also so that implementing other pieces 
(parsing the pragma, C front-end, gimple optimizations, target hook for 
the asm string, opcode and target optimization, simd, etc) become 
independent and can be done by different people. It is unlikely that I can 
find the time to do everything. If other people want to contribute or even 
take over (assuming the branch does not look hopelessly bad to them), that 
would be great! That's also why I pushed it as a branch.


Apart from the obvious (making sure it bootstraps, running the testsuite, 
adding a few tests), what missing pieces do you consider a strict 
requirement for this to have a chance to reach master one day as an 
experimental option?


--
Marc Glissecommit 4adb494e88323bf41ee2c0871caa2323fa2aca06
Author: Marc Glisse 
Date:   Wed Aug 5 18:49:57 2020 +0200

Introduce -ffenv-access

diff --git a/gcc/c-family/c-cppbuiltin.c b/gcc/c-family/c-cppbuiltin.c
index 74ecca8de8e..4c4d4f3d563 100644
--- a/gcc/c-family/c-cppbuiltin.c
+++ b/gcc/c-family/c-cppbuiltin.c
@@ -1635,16 +1635,23 @@ lazy_hex_fp_value (cpp_reader *, cpp_macro *macro, unsigned num)
 {
   REAL_VALUE_TYPE real;
   char dec_str[64], buf1[256];
+  size_t len;
 
   gcc_checking_assert (num < lazy_hex_fp_value_count);
 
-  real_from_string (, lazy_hex_fp_values[num].hex_str);
-  real_to_decimal_for_mode (dec_str, , sizeof (dec_str),
-			lazy_hex_fp_values[num].digits, 0,
-			lazy_hex_fp_values[num].mode);
+  if (!flag_isoc99)
+{
+  real_from_string (, lazy_hex_fp_values[num].hex_str);
+  real_to_decimal_for_mode (dec_str, , sizeof (dec_str),
+lazy_hex_fp_values[num].digits, 0,
+lazy_hex_fp_values[num].mode);
+
+  len = sprintf (buf1, "%s%s", dec_str, lazy_hex_fp_values[num].fp_suffix);
+}
+  else
+len = sprintf (buf1, "%s%s", lazy_hex_fp_values[num].hex_str,
+		   lazy_hex_fp_values[num].fp_suffix);
 
-  size_t len
-= sprintf (buf1, "%s%s", dec_str, lazy_hex_fp_values[num].fp_suffix);
   gcc_assert (len < sizeof (buf1));
   for (unsigned idx = 0; idx < macro->count; idx++)
 if (macro->exp.tokens[idx].type == CPP_NUMBER)
@@ -1701,13 +1708,16 @@ builtin_define_with_hex_fp_value (const char *macro,
  it's easy to get the exact correct value), parse it as a real,
  then print it back out as decimal.  */
 
-  real_from_string (, hex_str);
-  real_to_decimal_for_mode (dec_str, , sizeof (dec_str), digits, 0,
-			TYPE_MODE (type));
+  if (!flag_isoc99)
+{
+  real_from_string (, hex_str);
+  real_to_decimal_for_mode (dec_str, , sizeof (dec_str), digits, 0,
+TYPE_MODE (type));
+}
 
   /* Assemble the macro in the following fashion
  macro = fp_cast [dec_str fp_suffix] */
-  sprintf (buf2, "%s%s&

VEC_COND_EXPR optimizations v2

2020-08-05 Thread Marc Glisse

New version that passed bootstrap+regtest during the night.

When vector comparisons were forced to use vec_cond_expr, we lost a number of 
optimizations (my fault for not adding enough testcases to prevent that). 
This patch tries to unwrap vec_cond_expr a bit so some optimizations can 
still happen.


I wasn't planning to add all those transformations together, but adding one 
caused a regression, whose fix introduced a second regression, etc.


Restricting to constant folding would not be sufficient, we also need at 
least things like X|0 or X The transformations are quite conservative 
with :s and folding only if everything simplifies, we may want to relax 
this later. And of course we are going to miss things like a?b:c + a?c:b 
-> b+c.


In terms of number of operations, some transformations turning 2 
VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not look 
like a gain... I expect the bit_not disappears in most cases, and 
VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.


I am a bit confused that with avx512 we get types like "vector(4) 
" with :2 and not :1 (is it a hack so true is 1 and not 
-1?), but that doesn't matter for this patch.


2020-08-05  Marc Glisse  

PR tree-optimization/95906
PR target/70314
* match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
(v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.
(op (c ? a : b)): Update to match the new transformations.

* gcc.dg/tree-ssa/andnot-2.c: New file.
* gcc.dg/tree-ssa/pr95906.c: Likewise.
* gcc.target/i386/pr70314.c: Likewise.

--
Marc Glissediff --git a/gcc/match.pd b/gcc/match.pd
index a052c9e3dbc..f9297fcadbe 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3436,20 +3436,66 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (if (integer_zerop (@0))
@2)))
 
-/* Sink unary operations to constant branches, but only if we do fold it to
-   constants.  */
+#if GIMPLE
+/* Sink unary operations to branches, but only if we do fold both.  */
 (for op (negate bit_not abs absu)
  (simplify
-  (op (vec_cond @0 VECTOR_CST@1 VECTOR_CST@2))
-  (with
-   {
- tree cst1, cst2;
- cst1 = const_unop (op, type, @1);
- if (cst1)
-   cst2 = const_unop (op, type, @2);
-   }
-   (if (cst1 && cst2)
-(vec_cond @0 { cst1; } { cst2; })
+  (op (vec_cond:s @0 @1 @2))
+  (vec_cond @0 (op! @1) (op! @2
+
+/* Sink binary operation to branches, but only if we can fold it.  */
+(for op (tcc_comparison plus minus mult bit_and bit_ior bit_xor
+	 rdiv trunc_div ceil_div floor_div round_div
+	 trunc_mod ceil_mod floor_mod round_mod min max)
+/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
+ (simplify
+  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
+  (vec_cond @0 (op! @1 @3) (op! @2 @4)))
+
+/* (c ? a : b) op d  -->  c ? (a op d) : (b op d) */
+ (simplify
+  (op (vec_cond:s @0 @1 @2) @3)
+  (vec_cond @0 (op! @1 @3) (op! @2 @3)))
+ (simplify
+  (op @3 (vec_cond:s @0 @1 @2))
+  (vec_cond @0 (op! @3 @1) (op! @3 @2
+#endif
+
+/* (v ? w : 0) ? a : b is just (v & w) ? a : b  */
+(simplify
+ (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2)
+ (if (types_match (@0, @3))
+  (vec_cond (bit_and @0 @3) @1 @2)))
+(simplify
+ (vec_cond (vec_cond:s @0 integer_all_onesp @3) @1 @2)
+ (if (types_match (@0, @3))
+  (vec_cond (bit_ior @0 @3) @1 @2)))
+(simplify
+ (vec_cond (vec_cond:s @0 integer_zerop @3) @1 @2)
+ (if (types_match (@0, @3))
+  (vec_cond (bit_ior @0 (bit_not @3)) @2 @1)))
+(simplify
+ (vec_cond (vec_cond:s @0 @3 integer_all_onesp) @1 @2)
+ (if (types_match (@0, @3))
+  (vec_cond (bit_and @0 (bit_not @3)) @2 @1)))
+
+/* c1 ? c2 ? a : b : b  -->  (c1 & c2) ? a : b  */
+(simplify
+ (vec_cond @0 (vec_cond:s @1 @2 @3) @3)
+ (if (types_match (@0, @1))
+  (vec_cond (bit_and @0 @1) @2 @3)))
+(simplify
+ (vec_cond @0 @2 (vec_cond:s @1 @2 @3))
+ (if (types_match (@0, @1))
+  (vec_cond (bit_ior @0 @1) @2 @3)))
+(simplify
+ (vec_cond @0 (vec_cond:s @1 @2 @3) @2)
+ (if (types_match (@0, @1))
+  (vec_cond (bit_ior (bit_not @0) @1) @2 @3)))
+(simplify
+ (vec_cond @0 @3 (vec_cond:s @1 @2 @3))
+ (if (types_match (@0, @1))
+  (vec_cond (bit_and (bit_not @0) @1) @2 @3)))
 
 /* Simplification moved from fold_cond_expr_with_comparison.  It may also
be extended.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c b/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c
new file mode 100644
index 000..e0955ce3ffd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop3-raw -w -Wno-psabi" } */
+
+typedef long vec __attribute__((vector_size(16)));
+vec f(vec x){
+  vec y = x < 10;
+  return y & (y == 0);
+}
+
+/* { dg-final { scan-tree-dump-not "_expr" "forwprop3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c b/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c
new file m

Re: [PATCH] Amend match.pd syntax with force-simplified results

2020-08-04 Thread Marc Glisse

On Fri, 31 Jul 2020, Richard Biener wrote:


This adds a ! marker to result expressions that should simplify
(and if not fail the simplification).  This can for example be
used like

(simplify
 (plus (vec_cond:s @0 @1 @2) @3)
 (vec_cond @0 (plus! @1 @3) (plus! @2 @3)))

to make the simplification only apply in case both plus operations
in the result end up simplified to a simple operand.


(replacing plus with bit_ior)
The generated code in gimple_simplify_BIT_IOR_EXPR may look like

  {
tree _o1[2], _r1;
_o1[0] = captures[2];
_o1[1] = captures[4];
gimple_match_op tem_op (res_op->cond.any_else (), BIT_IOR_EXPR, TREE_TYPE 
(_o1[0]), _o1[0], _o1[1]);
tem_op.resimplify (lseq, valueize);
_r1 = maybe_push_res_to_seq (_op, NULL);
if (!_r1) return false;
res_op->ops[1] = _r1;
  }

In particular, it contains this "return false" which directly exits the 
function, instead of just giving up on this particular transformation and 
trying the next one. I'll reorder my transformations to work around this, 
but it looks like a pre-existing limitation.


--
Marc Glisse


Re: Simplify X * C1 == C2 with undefined overflow

2020-08-04 Thread Marc Glisse

On Mon, 3 Aug 2020, Richard Biener wrote:


On Sat, Aug 1, 2020 at 9:29 AM Marc Glisse  wrote:


Hello,

this transformation is quite straightforward, without overflow, 3*X==15 is
the same as X==5 and 3*X==5 cannot happen. Adding a single_use restriction
for the first case didn't seem necessary, although of course it can
slightly increase register pressure in some cases.

Bootstrap+regtest on x86_64-pc-linux-gnu.


OK with using constant_boolean_node (cmp == NE_EXPR, type).

ISTR we had the x * 0 == CST simplification somewhere
but maybe it was x * y == 0 ...  ah, yes:

/* Transform comparisons of the form X * C1 CMP 0 to X CMP 0 in the
  signed arithmetic case.  That form is created by the compiler
  often enough for folding it to be of value.  One example is in
  computing loop trip counts after Operator Strength Reduction.  */
(for cmp (simple_comparison)
scmp (swapped_simple_comparison)
(simplify
 (cmp (mult@3 @0 INTEGER_CST@1) integer_zerop@2)

As it is placed after your pattern it will be never matched I think
(but we don't warn because of INTEGER_CST vs. integer_zerop).

But I think your pattern subsumes it besides of the X * 0 == 0
compare - oh, and the other pattern also handles relational compares
(those will still trigger).

Maybe place the patterns next to each other?  Also see whether
moving yours after the above will cease the testcases to be handled
because it's no longer matched - if not this might be the better
order.


I moved it after, it still works, so I pushed the patch. Note that the 
other transformation has a single_use restriction, while this one doesn't, 
that's not very consistent, but also hopefully not so important...


--
Marc Glisse


Simplify X * C1 == C2 with undefined overflow

2020-08-01 Thread Marc Glisse

Hello,

this transformation is quite straightforward, without overflow, 3*X==15 is 
the same as X==5 and 3*X==5 cannot happen. Adding a single_use restriction 
for the first case didn't seem necessary, although of course it can 
slightly increase register pressure in some cases.


Bootstrap+regtest on x86_64-pc-linux-gnu.

2020-08-03  Marc Glisse  

PR tree-optimization/95433
* match.pd (X * C1 == C2): New transformation.

* gcc.c-torture/execute/pr23135.c: Add -fwrapv to avoid
undefined behavior.
* gcc.dg/tree-ssa/pr95433.c: New file.

--
Marc Glissediff --git a/gcc/match.pd b/gcc/match.pd
index a052c9e3dbc..78fd8cf5d9e 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -1578,6 +1578,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 	&& wi::neg_p (wi::to_wide (@1), TYPE_SIGN (TREE_TYPE (@1
 (cmp @2 @0))
 
+/* For integral types with undefined overflow fold
+   x * C1 == C2 into x == C2 / C1 or false.  */
+(for cmp (eq ne)
+ (simplify
+  (cmp (mult @0 INTEGER_CST@1) INTEGER_CST@2)
+  (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
+   && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@0))
+   && wi::to_wide (@1) != 0)
+   (with { widest_int quot; }
+(if (wi::multiple_of_p (wi::to_widest (@2), wi::to_widest (@1),
+			TYPE_SIGN (TREE_TYPE (@0)), ))
+ (cmp @0 { wide_int_to_tree (TREE_TYPE (@0), quot); })
+ { build_int_cst (type, cmp == NE_EXPR); })
+
 /* (X - 1U) <= INT_MAX-1U into (int) X > 0.  */
 (for cmp (le gt)
  icmp (gt le)
diff --git a/gcc/testsuite/gcc.c-torture/execute/pr23135.c b/gcc/testsuite/gcc.c-torture/execute/pr23135.c
index e740ff52874..ef9b7efc9c4 100644
--- a/gcc/testsuite/gcc.c-torture/execute/pr23135.c
+++ b/gcc/testsuite/gcc.c-torture/execute/pr23135.c
@@ -1,7 +1,7 @@
 /* Based on execute/simd-1.c, modified by joern.renne...@st.com to
trigger a reload bug.  Verified for gcc mainline from 20050722 13:00 UTC
for sh-elf -m4 -O2.  */
-/* { dg-options "-Wno-psabi" } */
+/* { dg-options "-Wno-psabi -fwrapv" } */
 /* { dg-add-options stack_size } */
 
 #ifndef STACK_SIZE
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr95433.c b/gcc/testsuite/gcc.dg/tree-ssa/pr95433.c
new file mode 100644
index 000..4e161ee26cc
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr95433.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fdump-tree-optimized" } */
+
+int f(int x){return x*7==17;}
+int g(int x){return x*3==15;}
+
+/* { dg-final { scan-tree-dump "return 0;" "optimized" } } */
+/* { dg-final { scan-tree-dump "== 5;" "optimized" } } */


Re: [PATCH] Amend match.pd syntax with force-simplified results

2020-07-31 Thread Marc Glisse

On Fri, 31 Jul 2020, Richard Biener wrote:


Or we simply automatically disable those patterns for GENERIC
(though that would probably be unexpected).


Since the definition is not clear, whatever we do will be unexpected at 
least in some cases. Disabling it for GENERIC for now seems ok to me, we 
can always extend it later...


--
Marc Glisse


Re: [PATCH] Amend match.pd syntax with force-simplified results

2020-07-31 Thread Marc Glisse

On Fri, 31 Jul 2020, Richard Biener wrote:


This adds a ! marker to result expressions that should simplify
(and if not fail the simplification).  This can for example be
used like

(simplify
 (plus (vec_cond:s @0 @1 @2) @3)
 (vec_cond @0 (plus! @1 @3) (plus! @2 @3)))

to make the simplification only apply in case both plus operations
in the result end up simplified to a simple operand.


Thanks.

The ! syntax seems fine. If we run out, we can always introduce an 
attribute-like syntax: (plus[force_leaf] @1 @2), but we aren't there yet. 
And either this feature will see a lot of use and deserve its short 
syntax, or it won't and it will be easy to reclaim '!' for something else.


I am wondering about GENERIC. IIUC, '!' is ignored for GENERIC. Should we 
protect some/most uses of this syntax with #if GIMPLE? In my case, I think 
resimplify might simply return non-0 because it swapped the operands, 
which should not be sufficient to enable the transformation.


--
Marc Glisse


Re: VEC_COND_EXPR optimizations

2020-07-31 Thread Marc Glisse

On Fri, 31 Jul 2020, Marc Glisse wrote:


On Fri, 31 Jul 2020, Richard Sandiford wrote:


Marc Glisse  writes:

On Fri, 31 Jul 2020, Richard Sandiford wrote:


Marc Glisse  writes:

+/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
+ (simplify
+  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
+  (with
+   {
+ tree rhs1, rhs2 = NULL;
+ rhs1 = fold_binary (op, type, @1, @3);
+ if (rhs1 && is_gimple_val (rhs1))
+   rhs2 = fold_binary (op, type, @2, @4);
+   }
+   (if (rhs2 && is_gimple_val (rhs2))
+(vec_cond @0 { rhs1; } { rhs2; })
+#endif


This one looks dangerous for potentially-trapping ops.


I would expect the operation not to be folded if it can trap. Is that too
optimistic?


Not sure TBH.  I was thinking of “trapping” in the sense of raising
an IEEE exception, rather than in the could-throw/must-end-bb sense.


That's what I understood from your message :-)


I thought match.pd applied to things like FP addition as normal and
it was up to individual patterns to check the appropriate properties.


Yes, and in this case I am delegating that to fold_binary, which already 
performs this check.


I tried with this C++ program

typedef double vecf __attribute__((vector_size(16)));
typedef long long veci __attribute__((vector_size(16)));
vecf f(veci c){
 return (c?1.:2.)/(c?3.:7.);
}

the folding happens by default, but not with -frounding-math, which seems 
like exactly what we want.


That was for rounding. -ftrapping-math does not disable folding of

typedef double vecf __attribute__((vector_size(16)));
typedef long long veci __attribute__((vector_size(16)));
vecf f(veci c){
vecf z={};
  return (z+1)/(z+3);
}

despite a possible inexact flag, so it doesn't disable vec_cond_expr 
folding either.


(not sure we want to fix this unless -fno-trapping-math becomes the 
default)


--
Marc Glisse


Re: VEC_COND_EXPR optimizations

2020-07-31 Thread Marc Glisse

On Fri, 31 Jul 2020, Richard Sandiford wrote:


Marc Glisse  writes:

On Fri, 31 Jul 2020, Richard Sandiford wrote:


Marc Glisse  writes:

+/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
+ (simplify
+  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
+  (with
+   {
+ tree rhs1, rhs2 = NULL;
+ rhs1 = fold_binary (op, type, @1, @3);
+ if (rhs1 && is_gimple_val (rhs1))
+   rhs2 = fold_binary (op, type, @2, @4);
+   }
+   (if (rhs2 && is_gimple_val (rhs2))
+(vec_cond @0 { rhs1; } { rhs2; })
+#endif


This one looks dangerous for potentially-trapping ops.


I would expect the operation not to be folded if it can trap. Is that too
optimistic?


Not sure TBH.  I was thinking of “trapping” in the sense of raising
an IEEE exception, rather than in the could-throw/must-end-bb sense.


That's what I understood from your message :-)


I thought match.pd applied to things like FP addition as normal and
it was up to individual patterns to check the appropriate properties.


Yes, and in this case I am delegating that to fold_binary, which already 
performs this check.


I tried with this C++ program

typedef double vecf __attribute__((vector_size(16)));
typedef long long veci __attribute__((vector_size(16)));
vecf f(veci c){
  return (c?1.:2.)/(c?3.:7.);
}

the folding happens by default, but not with -frounding-math, which seems 
like exactly what we want.


--
Marc Glisse


Re: VEC_COND_EXPR optimizations

2020-07-31 Thread Marc Glisse

On Fri, 31 Jul 2020, Richard Biener wrote:


On Fri, Jul 31, 2020 at 1:39 PM Richard Biener
 wrote:


On Fri, Jul 31, 2020 at 1:35 PM Richard Biener
 wrote:


On Thu, Jul 30, 2020 at 9:49 AM Marc Glisse  wrote:


When vector comparisons were forced to use vec_cond_expr, we lost a number
of optimizations (my fault for not adding enough testcases to prevent
that). This patch tries to unwrap vec_cond_expr a bit so some
optimizations can still happen.

I wasn't planning to add all those transformations together, but adding
one caused a regression, whose fix introduced a second regression, etc.

Using a simple fold_binary internally looks like an ok compromise to me.
It remains cheap enough (not recursive, and vector instructions are not
that frequent), while still allowing more than const_binop (X|0 or X for
instance). The transformations are quite conservative with :s and folding
only if everything simplifies, we may want to relax this later. And of
course we are going to miss things like a?b:c + a?c:b -> b+c.

In terms of number of operations, some transformations turning 2
VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not
look like a gain... I expect the bit_not disappears in most cases, and
VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.

I am a bit confused that with avx512 we get types like "vector(4)
" with :2 and not :1 (is it a hack so true is 1 and not
-1?), but that doesn't matter for this patch.

Regtest+bootstrap on x86_64-pc-linux-gnu


+  (with
+   {
+ tree rhs1, rhs2 = NULL;
+ rhs1 = fold_binary (op, type, @1, @3);
+ if (rhs1 && is_gimple_val (rhs1))
+   rhs2 = fold_binary (op, type, @2, @3);

ICK.  I guess a more match-and-simplify way would be


I was focused on avoiding recursion, but indeed that's independent, I 
could have set a trivial valueize function for that.



   (with
{
  tree rhs1, rhs2;
  gimple_match_op op (gimple_match_cond::UNCOND, op,
  type, @1, @3);
  if (op.resimplify (NULL, valueize)


Oh, so you would be ok with something that recurses without limit? I 
thought we were avoiding this because it may allow some compile time 
explosion with pathological examples.



  && gimple_simplified_result_is_gimple_val (op))
   {
 rhs1 = op.ops[0];
 ... other operand ...
   }

now in theory we could invent some new syntax for this, like

 (simplify
  (op (vec_cond:s @0 @1 @2) @3)
  (vec_cond @0 (op:x @1 @3) (op:x @2 @3)))

and pick something better instead of :x (:s is taken,
would be 'simplified', :c is taken would be 'constexpr', ...).

_Maybe_ just

 (simplify
  (op (vec_cond:s @0 @1 @2) @3)
  (vec_cond:x @0 (op @1 @3) (op @2 @3)))

which would have the same practical meaning as passing
NULL for the seq argument to simplification - do not allow
any intermediate stmt to be generated.


Note I specifically do not like those if (it-simplifies) checks
because we already would code-generate those anyway.  For

(simplify
  (plus (vec_cond:s @0 @1 @2) @3)
  (vec_cond @0 (plus @1 @3) (plus @2 @3)))

we get

res_op->set_op (VEC_COND_EXPR, type, 3);
res_op->ops[0] = captures[1];
res_op->ops[0] = unshare_expr (res_op->ops[0]);
{
  tree _o1[2], _r1;
  _o1[0] = captures[2];
  _o1[1] = captures[4];
  gimple_match_op tem_op (res_op->cond.any_else
(), PLUS_EXPR, TREE_TYPE (_o1[0]), _o1[0], _o1[1]);
  tem_op.resimplify (lseq, valueize);
  _r1 = maybe_push_res_to_seq (_op, lseq);  ()
  if (!_r1) return false;
  res_op->ops[1] = _r1;
}
{
  tree _o1[2], _r1;
  _o1[0] = captures[3];
  _o1[1] = captures[4];
  gimple_match_op tem_op (res_op->cond.any_else
(), PLUS_EXPR, TREE_TYPE (_o1[0]), _o1[0], _o1[1]);
  tem_op.resimplify (lseq, valueize);
  _r1 = maybe_push_res_to_seq (_op, lseq);  (***)
  if (!_r1) return false;
  res_op->ops[2] = _r1;
}
res_op->resimplify (lseq, valueize);
return true;

and the only change required would be to pass NULL to maybe_push_res_to_seq
here instead of lseq at the (***) marked points.


(simplify
 (plus (vec_cond:s @0 @1 @2) @3)
 (vec_cond:l @0 (plus @1 @3) (plus @2 @3)))

'l' for 'force leaf'.  I'll see if I can quickly cme up with a patch.


Does it have a clear meaning for GENERIC? I guess that's probably not such 
a big problem.


There are a number of transformations that we would like to perform "if 
 simplifies", but I don't know if it will always have exactly 
this form

Re: VEC_COND_EXPR optimizations

2020-07-31 Thread Marc Glisse

On Fri, 31 Jul 2020, Richard Biener wrote:


+/* (v ? w : 0) ? a : b is just (v & w) ? a : b  */
+(simplify
+ (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2)
+ (vec_cond (bit_and @0 @3) @1 @2))


Does something check automatically that @0 and @3 have compatible types?


@0 should always have a vector boolean type and thus will not be generally
compatible with @3.  But OTOH then when you see (vec_cond (vec_cond ...
then @3 must be vector boolean as well...

But in theory with AVX512 the inner vec_cond could have a SImode
condition @0 producing a regular V4SImode vector mask for an outer
AVX512 SSE-style vec-cond and you then would get a mismatch.


Ah, I thought the SSE-style vec_cond was impossible in AVX512 mode, at 
least I couldn't generate one in a few tests, but I didn't try very hard.



So indeed better add a type compatibility check.


Ok, it can't hurt.

--
Marc Glisse


Re: VEC_COND_EXPR optimizations

2020-07-31 Thread Marc Glisse

On Fri, 31 Jul 2020, Richard Sandiford wrote:


Marc Glisse  writes:

+/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
+ (simplify
+  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
+  (with
+   {
+ tree rhs1, rhs2 = NULL;
+ rhs1 = fold_binary (op, type, @1, @3);
+ if (rhs1 && is_gimple_val (rhs1))
+   rhs2 = fold_binary (op, type, @2, @4);
+   }
+   (if (rhs2 && is_gimple_val (rhs2))
+(vec_cond @0 { rhs1; } { rhs2; })
+#endif


This one looks dangerous for potentially-trapping ops.


I would expect the operation not to be folded if it can trap. Is that too 
optimistic?



+/* (v ? w : 0) ? a : b is just (v & w) ? a : b  */
+(simplify
+ (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2)
+ (vec_cond (bit_and @0 @3) @1 @2))


Does something check automatically that @0 and @3 have compatible types?


My memory of this dates from before the avx512-related changes, but at 
least at the time, the type of the condition in vec_cond_expr had to have 
the same size and number of elements as the result, and have signed 
integral elements. Now the size constraint has changed, but it still looks 
like for a given number of elements and size (this last one ignored for 
avx512), there is essentially a single type that can appear as condition. 
Is this wrong for x86? For SVE?


I could add some types_match conditions if that seems too unsafe...

--
Marc Glisse


VEC_COND_EXPR optimizations

2020-07-30 Thread Marc Glisse
When vector comparisons were forced to use vec_cond_expr, we lost a number 
of optimizations (my fault for not adding enough testcases to prevent 
that). This patch tries to unwrap vec_cond_expr a bit so some 
optimizations can still happen.


I wasn't planning to add all those transformations together, but adding 
one caused a regression, whose fix introduced a second regression, etc.


Using a simple fold_binary internally looks like an ok compromise to me. 
It remains cheap enough (not recursive, and vector instructions are not 
that frequent), while still allowing more than const_binop (X|0 or X for 
instance). The transformations are quite conservative with :s and folding 
only if everything simplifies, we may want to relax this later. And of 
course we are going to miss things like a?b:c + a?c:b -> b+c.


In terms of number of operations, some transformations turning 2 
VEC_COND_EXPR into VEC_COND_EXPR + BIT_IOR_EXPR + BIT_NOT_EXPR might not 
look like a gain... I expect the bit_not disappears in most cases, and 
VEC_COND_EXPR looks more costly than a simpler BIT_IOR_EXPR.


I am a bit confused that with avx512 we get types like "vector(4) 
" with :2 and not :1 (is it a hack so true is 1 and not 
-1?), but that doesn't matter for this patch.


Regtest+bootstrap on x86_64-pc-linux-gnu

2020-07-30  Marc Glisse  

PR tree-optimization/95906
PR target/70314
* match.pd ((c ? a : b) op d, (c ? a : b) op (c ? d : e),
(v ? w : 0) ? a : b, c1 ? c2 ? a : b : b): New transformations.

* gcc.dg/tree-ssa/andnot-2.c: New file.
* gcc.dg/tree-ssa/pr95906.c: Likewise.
* gcc.target/i386/pr70314.c: Likewise.

--
Marc Glissediff --git a/gcc/match.pd b/gcc/match.pd
index c6ae7a7db7a..af52d56162b 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3451,6 +3451,77 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(if (cst1 && cst2)
 (vec_cond @0 { cst1; } { cst2; })
 
+/* Sink binary operation to branches, but only if we can fold it.  */
+#if GIMPLE
+(for op (tcc_comparison plus minus mult bit_and bit_ior bit_xor
+	 rdiv trunc_div ceil_div floor_div round_div
+	 trunc_mod ceil_mod floor_mod round_mod min max)
+/* (c ? a : b) op d  -->  c ? (a op d) : (b op d) */
+ (simplify
+  (op (vec_cond:s @0 @1 @2) @3)
+  (with
+   {
+ tree rhs1, rhs2 = NULL;
+ rhs1 = fold_binary (op, type, @1, @3);
+ if (rhs1 && is_gimple_val (rhs1))
+   rhs2 = fold_binary (op, type, @2, @3);
+   }
+   (if (rhs2 && is_gimple_val (rhs2))
+(vec_cond @0 { rhs1; } { rhs2; }
+ (simplify
+  (op @3 (vec_cond:s @0 @1 @2))
+  (with
+   {
+ tree rhs1, rhs2 = NULL;
+ rhs1 = fold_binary (op, type, @3, @1);
+ if (rhs1 && is_gimple_val (rhs1))
+   rhs2 = fold_binary (op, type, @3, @2);
+   }
+   (if (rhs2 && is_gimple_val (rhs2))
+(vec_cond @0 { rhs1; } { rhs2; }
+
+/* (c ? a : b) op (c ? d : e)  -->  c ? (a op d) : (b op e) */
+ (simplify
+  (op (vec_cond:s @0 @1 @2) (vec_cond:s @0 @3 @4))
+  (with
+   {
+ tree rhs1, rhs2 = NULL;
+ rhs1 = fold_binary (op, type, @1, @3);
+ if (rhs1 && is_gimple_val (rhs1))
+   rhs2 = fold_binary (op, type, @2, @4);
+   }
+   (if (rhs2 && is_gimple_val (rhs2))
+(vec_cond @0 { rhs1; } { rhs2; })
+#endif
+
+/* (v ? w : 0) ? a : b is just (v & w) ? a : b  */
+(simplify
+ (vec_cond (vec_cond:s @0 @3 integer_zerop) @1 @2)
+ (vec_cond (bit_and @0 @3) @1 @2))
+(simplify
+ (vec_cond (vec_cond:s @0 integer_all_onesp @3) @1 @2)
+ (vec_cond (bit_ior @0 @3) @1 @2))
+(simplify
+ (vec_cond (vec_cond:s @0 integer_zerop @3) @1 @2)
+ (vec_cond (bit_ior @0 (bit_not @3)) @2 @1))
+(simplify
+ (vec_cond (vec_cond:s @0 @3 integer_all_onesp) @1 @2)
+ (vec_cond (bit_and @0 (bit_not @3)) @2 @1))
+
+/* c1 ? c2 ? a : b : b  -->  (c1 & c2) ? a : b  */
+(simplify
+ (vec_cond @0 (vec_cond:s @1 @2 @3) @3)
+ (vec_cond (bit_and @0 @1) @2 @3))
+(simplify
+ (vec_cond @0 @2 (vec_cond:s @1 @2 @3))
+ (vec_cond (bit_ior @0 @1) @2 @3))
+(simplify
+ (vec_cond @0 (vec_cond:s @1 @2 @3) @2)
+ (vec_cond (bit_ior (bit_not @0) @1) @2 @3))
+(simplify
+ (vec_cond @0 @3 (vec_cond:s @1 @2 @3))
+ (vec_cond (bit_and (bit_not @0) @1) @2 @3))
+
 /* Simplification moved from fold_cond_expr_with_comparison.  It may also
be extended.  */
 /* This pattern implements two kinds simplification:
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c b/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c
new file mode 100644
index 000..e0955ce3ffd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/andnot-2.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop3-raw -w -Wno-psabi" } */
+
+typedef long vec __attribute__((vector_size(16)));
+vec f(vec x){
+  vec y = x < 10;
+  return y & (y == 0);
+}
+
+/* { dg-final { scan-tree-dump-not "_expr" "forwprop3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr95906.c b/gc

RE: [PATCH] middle-end: Fold popcount(x&4) to (x>>2)&1 and friends.

2020-07-23 Thread Marc Glisse

On Thu, 23 Jul 2020, Marc Glisse wrote:


On Wed, 22 Jul 2020, Roger Sayle wrote:

Many thanks for the peer review and feedback.  I completely agree that 
POPCOUNT

and PARITY iterators simplifies things and handle the IFN_ variants.


Is there a reason why the iterators cannot be used for this one?

+(for popcount (BUILT_IN_POPCOUNT BUILT_IN_POPCOUNTL BUILT_IN_POPCOUNTLL
+  BUILT_IN_POPCOUNTIMAX)
+ parity (BUILT_IN_PARITY BUILT_IN_PARITYL BUILT_IN_PARITYLL
+BUILT_IN_PARITYIMAX)
+  (simplify
+(bit_and (popcount @0) integer_onep)
+(parity @0)))


Ah, maybe because we may have platforms/modes where the optab for popcount 
is supported but not the one for parity, and we are not allowed to create 
internal calls if the optab is not supported? I think expand_parity tries 
to expand parity as popcount&1, so it should be fine, but I didn't 
actually try it...


--
Marc Glisse


RE: [PATCH] middle-end: Fold popcount(x&4) to (x>>2)&1 and friends.

2020-07-23 Thread Marc Glisse

On Wed, 22 Jul 2020, Roger Sayle wrote:


Many thanks for the peer review and feedback.  I completely agree that POPCOUNT
and PARITY iterators simplifies things and handle the IFN_ variants.


Is there a reason why the iterators cannot be used for this one?

+(for popcount (BUILT_IN_POPCOUNT BUILT_IN_POPCOUNTL BUILT_IN_POPCOUNTLL
+  BUILT_IN_POPCOUNTIMAX)
+ parity (BUILT_IN_PARITY BUILT_IN_PARITYL BUILT_IN_PARITYLL
+BUILT_IN_PARITYIMAX)
+  (simplify
+(bit_and (popcount @0) integer_onep)
+(parity @0)))


--
Marc Glisse


Re: [PATCH v2]: Optimize a >= 0 && b >= 0 to (a | b) >= 0 [PR95731]

2020-07-10 Thread Marc Glisse

On Fri, 10 Jul 2020, Joe Ramsay wrote:


adds a new pattern to simplify a >= 0 && b >= 0 to (a | b) >= 0.


We should probably add the symmetric simplification of a<0|b<0 to (a|b)<0


   * match.pd: New simplication.


I think Jakub was suggesting something slightly mode detailed.

It would be nice to put the transformation next to "(x == 0 & y == 0) -> 
(x | typeof(x)(y)) == 0." in the file, since they work similarly.


+(for and (truth_and truth_andif)

I think the most important one here is bit_and.

+  (and (ge @0 integer_zerop) (ge @1 integer_zerop))

I was expecting some :s on the ge, but I notice that we don't have them 
for the similar transformation, so I don't know...


+   If one type is wider then the narrower type is sign-extended

Is that necessarily a sign-extension? True, for an unsigned type, we 
usually replace >=0 with !=0 IIRC, but it wouldn't hurt to add an explicit 
check for that.


+  (if (   INTEGRAL_TYPE_P (TREE_TYPE (@0))
+   && INTEGRAL_TYPE_P (TREE_TYPE (@1)))

I like indenting to align things, but I don't think that matches the 
official style used in gcc.


I am always afraid that this kind of transformation may hurt if it happens 
too early (does VRP know how to get ranges for a and b from (a|b)>=0?) but 
I guess this isn't the first transformation in this direction, so we can 
see all of them later if it causes trouble...


If the types are int128_t and int64_t, is extending the right thing to do? 
Will it eventually simplify to ((int64_t)(i128>>64)|i64)>=0


--
Marc Glisse


Re: Local optimization options

2020-07-05 Thread Marc Glisse

On Sun, 5 Jul 2020, Thomas König wrote:




Am 04.07.2020 um 19:11 schrieb Richard Biener :

On July 4, 2020 11:30:05 AM GMT+02:00, "Thomas König"  wrote:


What could be a preferred way to achieve that? Could optimization
options like -ffast-math be applied to blocks instead of functions?
Could we set flags on the TREE codes to allow certain optinizations?
Other things?


The middle end can handle those things on function granularity only.

Richard.


OK, so that will not work (or not without a disproportionate
amount of effort).  Would it be possible to set something like a
TREE_FAST_MATH flag on TREEs? An operation could then be
optimized according to these rules iff both operands
had that flag, and would also have it then.


In order to support various semantics on floating point operations, I was 
planning to replace some trees with internal functions, with an extra 
operand to specify various behaviors (rounding, exception, etc). Although 
at least in the beginning, I was thinking of only using those functions in 
safe mode, to avoid perf regressions.


https://gcc.gnu.org/pipermail/gcc-patches/2019-August/527040.html

This may never happen now, but it sounds similar to setting flags like 
TREE_FAST_MATH that you are suggesting. I was going with functions for 
more flexibility, and to avoid all the existing assumptions about trees. 
While I guess for fast-math, the worst the assumptions could do is clear 
the flag, which would make use optimize less than possible, not so bad.


--
Marc Glisse


Re: [RFC] rs6000: Add builtins for fegetround, feclearexcept and feraiseexcept [PR94193]

2020-06-30 Thread Marc Glisse

On Mon, 29 Jun 2020, Richard Biener via Gcc-patches wrote:


At least without -frounding-math fegetround could be
constant folded to FE_TONEAREST for which we'd need the
actual value of FE_TONEAREST.


That will break existing code which, since -frounding-math doesn't work 
for that, protects all FP operations with volatile read/writes or similar 
asm, and then doesn't specify -frounding-math because it doesn't seem 
necessary. I am not saying that code is right, just that it exists.


In a world where we have implemented fenv_access, this kind of folding of 
fegetround could only happen in "#pragma fenv_access off" regions, which 
seems to imply that it would be the front-end's responsibility (although 
it would need help from the back-end to know the default value to fold 
to).


--
Marc Glisse


Re: [RFC] rs6000: Add builtins for fegetround, feclearexcept and feraiseexcept [PR94193]

2020-06-29 Thread Marc Glisse

On Mon, 29 Jun 2020, Segher Boessenkool wrote:


Another question.  How do these builtins prevent other FP insns from
being moved (or optimised) "over" them?


At the GIMPLE level they don't. They prevent other function calls from 
moving across, just because function calls where at least one is not pure 
can't cross, but otherwise fenv_access is one big missing feature in gcc. 
I started something last year (and postponed indefinitely for lack of 
time), replacing all FP operations (when the safe mode is enabled) with 
builtins that get expanded by default to insert asm pass-through on the 
arguments and the result.


--
Marc Glisse


Re: [PATCH] tree-ssa-threadbackward.c (profitable_jump_thread_path): Do not allow __builtin_constant_p.

2020-06-27 Thread Marc Glisse

On Fri, 26 Jun 2020, Jeff Law via Gcc-patches wrote:


In theory yes, but there are cases where paths converge (like you've shown) 
where
you may have evaluated to a constant on the paths, but it's not a constant at 
the
convergence point.  You have to be very careful using b_c_p like this and it's
been a regular source of kernel bugs.


I'd recommend looking at the .ssa dump and walk forward from there if the .ssa
dump looks correct.


Here is the last dump before thread1 (105t.mergephi2). I don't see 
anything incorrect in it.


ledtrig_cpu (_Bool is_active)
{
  int old;
  int iftmp.0_1;
  int _5;

   [local count: 1073741824]:
  if (is_active_2(D) != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 536870913]:

   [local count: 1073741824]:
  # iftmp.0_1 = PHI <1(2), -1(3)>
  _5 = __builtin_constant_p (iftmp.0_1);
  if (_5 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 536870913]:
  if (iftmp.0_1 >= -128)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 268435456]:
  if (iftmp.0_1 <= 127)
goto ; [34.00%]
  else
goto ; [66.00%]

   [local count: 91268056]:
  __asm__ __volatile__("asi %0,%1
" : "ptr" "=Q" MEM[(int *)_active_cpus] : "val" "i" iftmp.0_1, "Q" MEM[(int *)_active_cpus] 
: "memory", "cc");
  goto ; [100.00%]

   [local count: 982473769]:
  __asm__ __volatile__("laa %0,%2,%1
" : "old" "=d" old_8, "ptr" "=Q" MEM[(int *)_active_cpus] : "val" "d" iftmp.0_1, "Q" MEM[(int 
*)_active_cpus] : "memory", "cc");

   [local count: 1073741824]:
  return;

}

There is a single _b_c_p, the immediate asm argument is exactly the 
argument of _b_c_p, and it is in the branch protected by _b_c_p.


Now the thread1 dump, for comparison

ledtrig_cpu (_Bool is_active)
{
  int old;
  int iftmp.0_4;
  int iftmp.0_6;
  int _7;
  int _12;
  int iftmp.0_13;
  int iftmp.0_14;

   [local count: 1073741824]:
  if (is_active_2(D) != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 536870912]:
  # iftmp.0_6 = PHI <1(2)>
  _7 = __builtin_constant_p (iftmp.0_6);
  if (_7 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 536870912]:
  # iftmp.0_4 = PHI <-1(2)>
  _12 = __builtin_constant_p (iftmp.0_4);
  if (_12 != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 268435456]:
  if (iftmp.0_4 >= -128)
goto ; [20.00%]
  else
goto ; [80.00%]

   [local count: 214748364]:
  if (iftmp.0_6 <= 127)
goto ; [12.00%]
  else
goto ; [88.00%]

   [local count: 91268056]:
  # iftmp.0_13 = PHI 
  __asm__ __volatile__("asi %0,%1
" : "ptr" "=Q" MEM[(int *)_active_cpus] : "val" "i" iftmp.0_13, "Q" MEM[(int 
*)_active_cpus] : "memory", "cc");
  goto ; [100.00%]

   [local count: 982473769]:
  # iftmp.0_14 = PHI 
  __asm__ __volatile__("laa %0,%2,%1
" : "old" "=d" old_8, "ptr" "=Q" MEM[(int *)_active_cpus] : "val" "d" iftmp.0_14, "Q" MEM[(int 
*)_active_cpus] : "memory", "cc");

   [local count: 1073741824]:
  return;

}

Thread1 decides to separate the paths is_active and !is_active 
(surprisingly, for one it optimizes out the comparison <= 127 and for the 
other the comparison >= -128, while it could optimize both in both cases). 
And it decides to converge after the comparisons, but before the asm.


What the pass did does seem to hurt. It looks like if we duplicate _b_c_p, 
we may need to duplicate far enough to include all the blocks dominated by 
_b_c_p==true (including the asm, here). Otherwise, any _b_c_p can be 
optimized to true, because for a boolean


b is the same as b ? true : false
__builtin_constant_p(b ? true : false) would be the same as b ? 
__builtin_constant_p(true) : __builtin_constant_p(false), i.e. true.


It is too bad we don't have any optimization pass using ranges between IPA 
and thread1, that would have gotten rid of the comparisons, and hence the 
temptation to thread. Adding always_inline on atomic_add (or flatten on 
the caller) does help: EVRP removes the comparisons.


Do you see a way forward without changing what thread1 does or declaring 
the testcase as unsupported?


--
Marc Glisse


Re: [PATCH] tree-ssa-threadbackward.c (profitable_jump_thread_path): Do not allow __builtin_constant_p.

2020-06-27 Thread Marc Glisse

On Sat, 27 Jun 2020, Ilya Leoshkevich via Gcc-patches wrote:


Is there something specific that a compiler user should look out for?
For example, here is the kernel code, from which the test was derived:

static inline void atomic_add(int i, atomic_t *v)
{
#ifdef CONFIG_HAVE_MARCH_Z196_FEATURES
   if (__builtin_constant_p(i) && (i > -129) && (i < 128)) {
   __atomic_add_const(i, >counter);
   return;
   }
#endif
   __atomic_add(i, >counter);
}

It looks very straightforward - can there still be something wrong
with its usage of b_c_p?

I'd recommend looking at the .ssa dump and walk forward from there if 
the .ssa dump looks correct.


Well, 021t.ssa already has:

__attribute__((gnu_inline))
__atomic_add_const (intD.6 valD.2269, intD.6 * ptrD.2270)
{
 intD.6 val_3(D) = valD.2269;
 intD.6 * ptr_2(D) = ptrD.2270;
;;   basic block 2, loop depth 0, maybe hot
;;prev block 0, next block 1, flags: (NEW)
;;pred:   ENTRY (FALLTHRU)
 # .MEM_4 = VDEF <.MEM_1(D)>
 __asm__ __volatile__("asi %0,%1
" : "ptr" "=Q" *ptr_2(D) : "val" "i" val_3(D), "Q" *ptr_2(D) :
"memory", "cc");
 # VUSE <.MEM_4>
 return;
;;succ:   EXIT

}

which is, strictly speaking, not correct, because val_3(D) and
valD.2269 are not constant.  But as far as I understand we are willing
to tolerate trees like this until a certain point.

What is this point supposed to be?  If I understood you right,
106t.thread1 is already too late - why is it so?


Small remark: shouldn't __atomic_add_const be marked with the 
always_inline attribute, since it isn't usable when it isn't inlined?


--
Marc Glisse


std::includes performance tweak

2020-06-19 Thread Marc Glisse

Hello,

I am proposing a small tweak to the implementation of __includes, which in 
my application saves 20% of the running time. I noticed it because using 
range-v3 was giving unexpected performance gains.


The unified diff is attached, but let me first show a more readable 
context diff.


*** /tmp/zzm2NX_stl_algo.h  2020-06-19 10:48:58.702634366 +0200
--- libstdc++-v3/include/bits/stl_algo.h2020-06-18 23:16:06.183427245 
+0200
***
*** 2783,2797 
   _Compare __comp)
  {
while (__first1 != __last1 && __first2 != __last2)
!   if (__comp(__first2, __first1))
! return false;
!   else if (__comp(__first1, __first2))
! ++__first1;
!   else
! {
!   ++__first1;
++__first2;
! }

return __first2 == __last2;
  }
--- 2783,2795 
   _Compare __comp)
  {
while (__first1 != __last1 && __first2 != __last2)
!   {
! if (__comp(__first2, __first1))
!   return false;
! if (!__comp(__first1, __first2))
++__first2;
! ++__first1;
!   }

return __first2 == __last2;
  }

As you can see, it isn't much change. Some of the gain comes from pulling 
the 2 calls ++__first1 out of the condition so there is just one call. And 
most of the gain comes from replacing the resulting


if (__comp(__first1, __first2))
  ;
else
  ++__first2;

with

if (!__comp(__first1, __first2))
  ++__first2;

I was very surprised that the code ended up being so different for such a 
change, and I still don't really understand where the extra time is 
going...


Anyway, while I blame the compiler for not generating very good code with 
the current implementation, I believe the change can be seen as a 
simplification and should be pushed to master. It regtests fine.


2020-06-20  Marc Glisse  

* include/bits/stl_algo.h (__includes): Simplify the code.

(as with the patch for std::optional, I still haven't worked on my ssh key 
issue and cannot currently push)


--
Marc Glissediff --git a/libstdc++-v3/include/bits/stl_algo.h b/libstdc++-v3/include/bits/stl_algo.h
index fd6edd0d5f4..550a15f2b3b 100644
--- a/libstdc++-v3/include/bits/stl_algo.h
+++ b/libstdc++-v3/include/bits/stl_algo.h
@@ -2783,15 +2783,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	   _Compare __comp)
 {
   while (__first1 != __last1 && __first2 != __last2)
-	if (__comp(__first2, __first1))
-	  return false;
-	else if (__comp(__first1, __first2))
-	  ++__first1;
-	else
-	  {
-	++__first1;
+	{
+	  if (__comp(__first2, __first1))
+	return false;
+	  if (!__comp(__first1, __first2))
 	++__first2;
-	  }
+	  ++__first1;
+	}
 
   return __first2 == __last2;
 }


  1   2   3   4   5   6   7   8   9   10   >