Re: port contrib/download_prerequisites script to macOS

2017-04-12 Thread Jerry DeLisle
On 04/12/2017 05:40 PM, Martin Sebor wrote:
> On 04/12/2017 04:03 PM, Jeff Law wrote:
>> On 04/04/2017 07:10 PM, Damian Rouson wrote:
>>> All,
>>>
>>> The attached patch modifies the contrib/download_prerequisites script
>>> to work on macOS.
>>> The revised script detects the operating system and adjusts the shasum
>>> and md5 commands
>>> to their expected name and arguments on macOS.  The revised script
>>> also uses curl if
>>> wget is not present.  macOS ships with curl but not wget.
>>>
>>> Tested on macOS and Lubuntu and Fedora Linux distributions.
>>>
>>> Ok for trunk?
>>>
>>> Damian
>>>
>>>
>>> 2017-04-05  Damian Rouson  
>>>
>>> * download_prerequisites (md5_check): New function emulates Linux
>>> 'md5 --check' on macOS.  Modified script for macOS compatibility.
>> I wonder if we should just switch to curl from wget in general rather
>> than conditionalizing the code at all.
> 
> That was going to be my suggestion as well.  It will make updating
> the script easier.
> 
> Martin
> 
>>
>> For the sums, rather than doing a check of the OS, just see if
>> sha512/md5sum exists.  If not, then fallback to the Darwin defaults.
>>
>> Jeff
> 

I did not wait long enough for your comments and comitted it already. We can
certainly adjust it.

Jerry


Re: port contrib/download_prerequisites script to macOS

2017-04-12 Thread Martin Sebor

On 04/12/2017 04:03 PM, Jeff Law wrote:

On 04/04/2017 07:10 PM, Damian Rouson wrote:

All,

The attached patch modifies the contrib/download_prerequisites script
to work on macOS.
The revised script detects the operating system and adjusts the shasum
and md5 commands
to their expected name and arguments on macOS.  The revised script
also uses curl if
wget is not present.  macOS ships with curl but not wget.

Tested on macOS and Lubuntu and Fedora Linux distributions.

Ok for trunk?

Damian


2017-04-05  Damian Rouson  

* download_prerequisites (md5_check): New function emulates Linux
'md5 --check' on macOS.  Modified script for macOS compatibility.

I wonder if we should just switch to curl from wget in general rather
than conditionalizing the code at all.


That was going to be my suggestion as well.  It will make updating
the script easier.

Martin



For the sums, rather than doing a check of the OS, just see if
sha512/md5sum exists.  If not, then fallback to the Darwin defaults.

Jeff




[PATCH,rs6000] PR80315: Add test cases to confirm ICE has been fixed

2017-04-12 Thread Kelvin Nilsen

PR80315 Reported an Internal Compiler Error when the third argument to
__builtin_crypto_vshasigmaw was an integer constant with a value
greater than 15.  The patch to correct this problem was committed
yesterday. This patch adds 4 new test cases to the regression suite.

Regression testing has confirmed that these test programs reproduce the
error reported with PR80315 before yesterday's patch was applied, and
that all test programs pass following application of yesterday's patch.

Is this ok for the trunk?


gcc/testsuite/ChangeLog:

2017-04-12  Kelvin Nilsen  

* gcc.target/powerpc/pr80315-1.c: New test.
* gcc.target/powerpc/pr80315-2.c: New test.
* gcc.target/powerpc/pr80315-3.c: New test.
* gcc.target/powerpc/pr80315-4.c: New test.

Index: gcc/testsuite/gcc.target/powerpc/pr80315-1.c
===
--- gcc/testsuite/gcc.target/powerpc/pr80315-1.c(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/pr80315-1.c(working copy)
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { 
"-mcpu=power8" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-options "-mcpu=power8" } */
+
+int
+main()
+{
+  __attribute__((altivec(vector__))) unsigned int test, res;
+  const int s0 = 0;
+  int mask;
+
+  /* Argument 2 must be 0 or 1.  Argument 3 must be in range 0..15.  */
+  res = __builtin_crypto_vshasigmaw (test, 1, 0xff); /* { dg-error "argument 3 
must be in the range 0..15" } */
+  return 0;
+}
Index: gcc/testsuite/gcc.target/powerpc/pr80315-2.c
===
--- gcc/testsuite/gcc.target/powerpc/pr80315-2.c(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/pr80315-2.c(working copy)
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { 
"-mcpu=power8" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-options "-mcpu=power8" } */
+
+int
+main ()
+{
+  __attribute__((altivec(vector__))) unsigned long long test, res;
+  const int s0 = 0;
+  int mask;
+
+  /* Argument 2 must be 0 or 1.  Argument 3 must be in range 0..15.  */
+  res = __builtin_crypto_vshasigmad (test, 1, 0xff); /* { dg-error "argument 3 
must be in the range 0..15" } */
+  return 0;
+}
Index: gcc/testsuite/gcc.target/powerpc/pr80315-3.c
===
--- gcc/testsuite/gcc.target/powerpc/pr80315-3.c(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/pr80315-3.c(working copy)
@@ -0,0 +1,18 @@
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { 
"-mcpu=power8" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-options "-mcpu=power8" } */
+
+#include 
+
+vector unsigned int
+main ()
+{
+  vector unsigned int test, res;
+  const int s0 = 0;
+  int mask;
+
+  /* Argument 2 must be 0 or 1.  Argument 3 must be in range 0..15.  */
+  res = vec_shasigma_be (test, 1, 0xff); /* { dg-error "argument 3 must be in 
the range 0..15" } */
+  return res;
+}
Index: gcc/testsuite/gcc.target/powerpc/pr80315-4.c
===
--- gcc/testsuite/gcc.target/powerpc/pr80315-4.c(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/pr80315-4.c(working copy)
@@ -0,0 +1,18 @@
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { 
"-mcpu=power8" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-options "-mcpu=power8" } */
+
+#include 
+
+vector unsigned long long int
+main ()
+{
+  vector unsigned long long int test, res;
+  const int s0 = 0;
+  int mask;
+
+  /* Argument 2 must be 0 or 1.  Argument 3 must be in range 0..15.  */
+  res = vec_shasigma_be (test, 1, 0xff); /* { dg-error "argument 3 must be in 
the range 0..15" } */
+  return res;
+}



[PATCH] Fix PR51513, switch statement with default case containing __builtin_unreachable leads to wild branch

2017-04-12 Thread Peter Bergner
This patch is the second attempt to fix PR51513, namely the generation of
wild branches due to switch case statements that only contain calls to
__builtin_unreachable().  With the first attempt:

https://gcc.gnu.org/ml/gcc-patches/2016-04/msg01915.html

richi said he preferred if we just eliminated the range check for
default case statements that contained __builtin_unreachable().
This patch implements that idea.  It also removes normal case
statement blocks that are marked as unreachable, but in those cases,
we just use a dummy label in the jump table for them.

This passed bootstrap and regtesting with no regressions on powerpc64-linux
and x86_64-linux.  Ok for trunk now or trunk during stage1?

Peter


gcc/
* tree-cfg.c (gimple_unreachable_bb_p): New function.
(assert_unreachable_fallthru_edge_p): Use it.
* tree-cfg.h: Prototype it.
* stmt.c: Include cfghooks.h and tree-cfg.h.
(emit_case_dispatch_table) : New local variable.
Use it to fill dispatch table gaps and unreachable cases.
Remove edges to unreachable blocks.
(expand_case): Test for unreachable default case statement and
remove its edge.  Set default_label accordingly.
(emit_case_nodes): Only emit branch to default_label if it
exists.

gcc/testsuite/
* gcc.target/powerpc/pr51513.c: New test.

Index: gcc/stmt.c
===
--- gcc/stmt.c  (revision 246661)
+++ gcc/stmt.c  (working copy)
@@ -30,6 +30,7 @@ along with GCC; see the file COPYING3.
 #include "rtl.h"
 #include "tree.h"
 #include "gimple.h"
+#include "cfghooks.h"
 #include "predict.h"
 #include "alloc-pool.h"
 #include "memmodel.h"
@@ -49,6 +50,7 @@ along with GCC; see the file COPYING3.
 #include "expr.h"
 #include "langhooks.h"
 #include "cfganal.h"
+#include "tree-cfg.h"
 #include "params.h"
 #include "dumpfile.h"
 #include "builtins.h"
@@ -989,6 +991,14 @@ emit_case_dispatch_table (tree index_exp
   labelvec = XALLOCAVEC (rtx, ncases);
   memset (labelvec, 0, ncases * sizeof (rtx));
 
+  /* The dispatch table may contain gaps and labels of unreachable case
+ statements.  Gaps can exist at the beginning of the table if we tried
+ to avoid the minval subtraction.  We fill the dispatch table slots
+ associated with the gaps and unreachable case blocks with the default
+ case label.  However, in the event the default case itself is unreachable,
+ we then use any label from one of the reachable case statements.  */
+  rtx gap_label = (default_label) ? default_label : fallback_label;
+
   for (n = case_list; n; n = n->right)
 {
   /* Compute the low and high bounds relative to the minimum
@@ -1000,42 +1010,51 @@ emit_case_dispatch_table (tree index_exp
   HOST_WIDE_INT i_high
= tree_to_uhwi (fold_build2 (MINUS_EXPR, index_type,
 n->high, minval));
-  HOST_WIDE_INT i;
 
+  basic_block case_bb = label_to_block (n->code_label);
+  rtx case_label;
+  if (gimple_unreachable_bb_p (case_bb))
+   {
+ /* We have an unreachable case statement, so replace its label
+with a dummy label and remove the edge to the unreachable block.
+The block itself will be automatically removed later.  */
+ case_label = gap_label;
+ remove_edge (find_edge (stmt_bb, case_bb));
+   }
+  else
+   case_label = label_rtx (n->code_label);
+
+  HOST_WIDE_INT i;
   for (i = i_low; i <= i_high; i ++)
-   labelvec[i]
- = gen_rtx_LABEL_REF (Pmode, label_rtx (n->code_label));
+   labelvec[i] = gen_rtx_LABEL_REF (Pmode, case_label);
 }
 
-  /* Fill in the gaps with the default.  We may have gaps at
- the beginning if we tried to avoid the minval subtraction,
- so substitute some label even if the default label was
- deemed unreachable.  */
-  if (!default_label)
-default_label = fallback_label;
+  /* Now fill in the dispatch table gaps.  */
   for (i = 0; i < ncases; i++)
 if (labelvec[i] == 0)
   {
-has_gaps = true;
-labelvec[i] = gen_rtx_LABEL_REF (Pmode, default_label);
+   has_gaps = true;
+   labelvec[i] = gen_rtx_LABEL_REF (Pmode, gap_label);
   }
 
-  if (has_gaps)
-{
-  /* There is at least one entry in the jump table that jumps
- to default label. The default label can either be reached
- through the indirect jump or the direct conditional jump
- before that. Split the probability of reaching the
- default label among these two jumps.  */
-  new_default_prob = conditional_probability (default_prob/2,
-  base);
-  default_prob /= 2;
-  base -= default_prob;
-}
-  else
+  if (default_label)
 {
-  base -= default_prob;
-  default_prob = 0;
+  if (has_gaps)
+   {
+ /* There is at least one entry in the jump 

RFC: seeking insight on store_data_bypass_p (recog.c)

2017-04-12 Thread Kelvin Nilsen


My work on PR80101 is "motivating" me to modify the implementation of
store_data_bypass_p (in gcc/recog.c).

I have a patch that bootstraps with no regressions.  However, I think
"regression" testing may not be enough to prove I got this right.  If my
new patch returns the wrong value, the outcome will be poor instruction
scheduling decisions, which will impact performance, but probably not
"correctness".

So I'd like some help understanding the existing implementation of
store_data_bypass_p.  To establish some context, here is what I think I
understand about this function:

1. As input arguments, out_insn represents an rtl expression that
potentially "produces" a store to memory and in_insn represents an rtl
expression that potentially "consumes" a value recently stored to memory.

2. If the memory store produced matches the memory fetch consumed, this
function returns true to indicate that this sequence of two instructions
qualifies for a special "bypass" latency that represents the fact that
the fetch will obtain the value out of the write buffer.  So, whereas
the instruction scheduler might normally expect that this sequence of
two instructions would experience Load-Hit-Store penalties associated
with cache coherency hardware costs, since these two instruction qualify
for the store_data_bypass optimization, the instruction scheduler counts
the latency as only 1 or 2 cycles (potentially).  [This is what I
understand, but I may be wrong, so please correct me if so.]

3. Actually, what I described above is only the "simple" case.  It may
be that the rtl for either out_insn or in_insn is really a parallel
clause with multiple rtl trees beneath it.  In this case, we compare the
subtrees in a "similar" way to see if the compound expressions qualify
for the store_data_bypass_p "optimization".  (I've got some questions
about how this is done below)  As currently implemented, special
handling is given to a CLOBBER subtree as part of either PARALLEL
expression: we ignore it.  This is because CLOBBER does not represent
any real machine instructions.  It just represents semantic information
that might be used by the compiler.

In addition to seeking confirmation of my existing understanding of the
code as outlined above, the specific questions that I am seeking help
with are:

1. In the current implementation (as I understand it), near the top of
the function body, we handle the case that the consumer (in_insn) rtl is
a single SET expression and the producer (out_insn) rtl is a PARALLEL
expression containing multiple sets.  The way I read this code, we are
requiring that every one of the producer's parallel SET instructions
produce the same value that is to be consumed in order to qualify this
sequence as a "store data bypass".  That seems wrong to me.  I would
expect that we only need "one" of the produced values to match the
consumed value in order to qualify for the "store data bypass"
optimization.  Please explain.  (The same confusing behavior happens
below in the same function, in the case that the consumer rtl is a
PARALLEL expression of multiple SETs: we require that every producer's
stored value match every consumer's fetched value.)

2. A "bigger" concern is that any time any SETs are buried within a
PARALLEL tree, I'm not sure the answer produced by this function, as
currently implemented, is at all reliable:

 a) PARALLEL does not necessarily mean all of its subtrees happen in
parallel on hardware.  It just means that there is no sequencing imposed
by the source code, so the final order in which the multiple subtrees
beneath the PARALLEL node is not known at this stage of compilation.

 b) It seems to me that it doesn't really make sense to speak of whether
a whole bunch of producers combined with a whole bunch of consumers
qualify for an optimized store data bypass latency.  If we say that they
do qualify (as a group), which pair(s) of producer and consumer machine
instructions qualify?  It seems we need to know which producer matches
with which consumer in order to know where the bypass latencies "fit"
into the schedule.

 c) Furthermore, if it turns out that the "arbitrary" order in which the
producer instructions and consumer instructions are emitted places too
much "distance" between a producer and the matching consumer, then it is
possible that by the time the hardware executes the consumer, the stored
value is no longer in the write buffer, so even though we might have
"thought" two PARALLEL rtl expressions qualified for the store bypass
optimization, we really should have returned false.

Can someone help me understand this better?

Thanks much.


-- 
Kelvin Nilsen, Ph.D.  kdnil...@linux.vnet.ibm.com
home office: 801-756-4821, cell: 520-991-6727
IBM Linux Technology Center - PPC Toolchain



[PATCH], Fix PR/target 80099 (internal error with -mno-upper-regs-sf)

2017-04-12 Thread Michael Meissner
The problem is rs6000_expand_vector_extract did not check for SFmode being
allowed in the Altivec (upper) registers, but the insn implementing the
variable extract had it as a condition.

In looking at the variable extract code, it currently does not require SFmode
to go in the Altivec registers, but it does require DImode to go into the
Altivec registers (vec_extract of V2DFmode will require DFmode to go in Altivec
registers instead of DImode).

I have tested this patch on a little endian power8 system and there were no
regressions with either bootstrap or make check.

[gcc]
2017-04-12  Michael Meissner  

PR target/80099
* config/rs6000/rs6000.c (rs6000_expand_vector_extract): Make sure
that DFmode or DImode as appropriate can go in Altivec registers
before generating the faster sequences for variable vec_extracts.
* config/rs6000/vsx.md (vsx_extract_v4sf): Remove unneeded
TARGET_UPPER_REGS_SF condition.

[gcc/testsuite]
2017-04-12  Michael Meissner  

PR target/80099
* gcc.target/powerpc/pr80099.c: New test.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797
Index: gcc/config/rs6000/rs6000.c
===
--- gcc/config/rs6000/rs6000.c  (revision 246852)
+++ gcc/config/rs6000/rs6000.c  (working copy)
@@ -7586,15 +7586,23 @@ rs6000_expand_vector_extract (rtx target
   switch (mode)
{
case V2DFmode:
- emit_insn (gen_vsx_extract_v2df_var (target, vec, elt));
- return;
+ if (TARGET_UPPER_REGS_DF)
+   {
+ emit_insn (gen_vsx_extract_v2df_var (target, vec, elt));
+ return;
+   }
+ break;
 
case V2DImode:
- emit_insn (gen_vsx_extract_v2di_var (target, vec, elt));
- return;
+ if (TARGET_UPPER_REGS_DI)
+   {
+ emit_insn (gen_vsx_extract_v2di_var (target, vec, elt));
+ return;
+   }
+ break;
 
case V4SFmode:
- if (TARGET_UPPER_REGS_SF)
+ if (TARGET_UPPER_REGS_DI)
{
  emit_insn (gen_vsx_extract_v4sf_var (target, vec, elt));
  return;
@@ -7602,16 +7610,28 @@ rs6000_expand_vector_extract (rtx target
  break;
 
case V4SImode:
- emit_insn (gen_vsx_extract_v4si_var (target, vec, elt));
- return;
+ if (TARGET_UPPER_REGS_DI)
+   {
+ emit_insn (gen_vsx_extract_v4si_var (target, vec, elt));
+ return;
+   }
+ break;
 
case V8HImode:
- emit_insn (gen_vsx_extract_v8hi_var (target, vec, elt));
- return;
+ if (TARGET_UPPER_REGS_DI)
+   {
+ emit_insn (gen_vsx_extract_v8hi_var (target, vec, elt));
+ return;
+   }
+ break;
 
case V16QImode:
- emit_insn (gen_vsx_extract_v16qi_var (target, vec, elt));
- return;
+ if (TARGET_UPPER_REGS_DI)
+   {
+ emit_insn (gen_vsx_extract_v16qi_var (target, vec, elt));
+ return;
+   }
+ break;
 
default:
  gcc_unreachable ();
Index: gcc/config/rs6000/vsx.md
===
--- gcc/config/rs6000/vsx.md(revision 246852)
+++ gcc/config/rs6000/vsx.md(working copy)
@@ -2419,8 +2419,7 @@ (define_insn_and_split "vsx_extract_v4sf
   UNSPEC_VSX_EXTRACT))
(clobber (match_scratch:DI 3 "=r,,"))
(clobber (match_scratch:V2DI 4 "=,X,X"))]
-  "VECTOR_MEM_VSX_P (V4SFmode) && TARGET_DIRECT_MOVE_64BIT
-   && TARGET_UPPER_REGS_SF"
+  "VECTOR_MEM_VSX_P (V4SFmode) && TARGET_DIRECT_MOVE_64BIT"
   "#"
   "&& reload_completed"
   [(const_int 0)]
Index: gcc/testsuite/gcc.target/powerpc/pr80099.c
===
--- gcc/testsuite/gcc.target/powerpc/pr80099.c  (revision 0)
+++ gcc/testsuite/gcc.target/powerpc/pr80099.c  (working copy)
@@ -0,0 +1,12 @@
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { 
"-mcpu=power8" } } */
+/* { dg-options "-mcpu=power8 -O2 -mno-upper-regs-di" } */
+
+/* PR target/80099: compiler internal error if -mno-upper-regs-di used.  */
+
+int a;
+int int_from_mem (vector float *c)
+{
+  return __builtin_vec_extract (*c, a);
+}


Re: [PATCH][PR sanitizer/77631] Support separate debug info in libbacktrace

2017-04-12 Thread Jeff Law

On 03/22/2017 09:28 AM, Denis Khalikov wrote:

Hello everyone,
I've fixed some issues and implemented functionality
to search debug file by build-id.
Can someone please review my patch.

Given this doesn't look like a regression, I'm going to punt to gcc-8.

jeff



Re: [PR59319] output friends in debug info

2017-04-12 Thread Jeff Law

On 03/21/2017 12:34 PM, Alexandre Oliva wrote:

On Jan 27, 2017, Alexandre Oliva  wrote:


On Oct 19, 2016, Alexandre Oliva  wrote:

On Sep 23, 2016, Alexandre Oliva  wrote:

On Aug 30, 2016, Alexandre Oliva  wrote:

Handling non-template friends is kind of easy, [...]



Regstrapped on x86_64-linux-gnu and i686-linux-gnu, I'd failed to
mention.



Ping?



Ping?  (conflicts resolved, patch refreshed and retested)



Ping?  (trivial conflicts resolved)


Ping?  https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02112.html
Going to punt to gcc-8.  Sorry, but we're just getting late in the 
release process..


jeff



Re: port contrib/download_prerequisites script to macOS

2017-04-12 Thread Jeff Law

On 04/04/2017 07:10 PM, Damian Rouson wrote:

All,

The attached patch modifies the contrib/download_prerequisites script to work 
on macOS.
The revised script detects the operating system and adjusts the shasum and md5 
commands
to their expected name and arguments on macOS.  The revised script also uses 
curl if
wget is not present.  macOS ships with curl but not wget.

Tested on macOS and Lubuntu and Fedora Linux distributions.

Ok for trunk?

Damian


2017-04-05  Damian Rouson  

* download_prerequisites (md5_check): New function emulates Linux
'md5 --check' on macOS.  Modified script for macOS compatibility.
I wonder if we should just switch to curl from wget in general rather 
than conditionalizing the code at all.


For the sums, rather than doing a check of the OS, just see if 
sha512/md5sum exists.  If not, then fallback to the Darwin defaults.


Jeff


Re: [libcp1] handle anon aggregates linkage-named by typedefs

2017-04-12 Thread Jeff Law

On 04/11/2017 03:13 PM, Alexandre Oliva wrote:

On Apr 10, 2017, Jeff Law  wrote:


On 03/21/2017 05:32 PM, Alexandre Oliva wrote:

* libcp1plugin.cc (plugin_build_decl): Propagate typedef name to
anonymous aggregate target type.

Can you put some kind of pointer in the code you copied from cp/decl.c
so that there's some chance anyone changing that code would at least
go look at the libcpplugin.cc copy and try to DTRT.


Well, if we're actually touching the g++ code, I'd much rather refactor
it a bit and export this chunk of code for libcc1 to use without
copying.  How's this instead?  (testing)


[libcp1] handle anon aggregates linkage-named by typedefs

Arrange for the first typedef to an anonymous type in the same context
to be used as the linkage name for the type.

for  gcc/cp/ChangeLog

* decl.c (name_unnamed_type): Split out of...
(grokdeclarator): ... this.
* decl.h (name_unnamed_type): Declare.

for  libcc1/ChangeLog

* libcp1plugin.cc (plugin_build_decl): Call name_unnamed_type.

OK.

jeff



Re: [PATCH] Destroy arguments for _Cilk_spawn calling in the child (PR 80038)

2017-04-12 Thread Jeff Law

On 04/07/2017 08:02 AM, Xi Ruoyao wrote:

On 2017-04-06 11:12 -0600, Jeff Law wrote:


With the likely deprecation in mind, I've only done a cursory review of
the changes -- mostly to verify that they hit Cilk+ paths only.



What's the purpose behind changing when we set the in_lto_p flag?


Without that change, GCC with my patch ICEed with _Cilk_spawn and
-flto -O3 -fcilkplus since __cilkrts_stack_frame.ctx's type (array of void *)
was not TYPE_STRUCTURAL_EQUALITY_P in lto stage.

If this change is not proper, I'll work on modifying my patch to work
without touching in_lto_p.
It's certainly be preferable to not change in_lto_p-- unless Richi wants 
to chime in on the safety of setting in_lto_p earlier.


I'm not familiar enough with the LTO interactions to know if movement of 
in_lto_p is safe.


Jeff


Re: [PATCH], Fix PR 80098, Add better checking for disabling features

2017-04-12 Thread Michael Meissner
On Tue, Apr 11, 2017 at 06:04:33PM -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Apr 11, 2017 at 05:32:41PM -0400, Michael Meissner wrote:
> > PR 80098 is an interaction between -mmodulo (ISA 3.0/power9 GPR modulo
> > instructions) and -mno-vsx where the -mmodulo option enables some of the ISA
> > 3.0 vector features, even though -mno-vsx was specified.
> > 
> > To do this, I added a table for disabling other vector options when the 
> > major
> > vector options (-mvsx, -mpower8-vector, and -mpower9-vector) are disabled.
> > 
> > We could extend this if desired for other options (for example, -mno-popcntd
> > could disable -mmodulo and perhaps the vector options).
> 
> Or we could just remove -mmodulo etc.  What good do they do?  They make
> testing infeasible: an exponential number of combinations to test.

We can't remove -mmodulo, -mpopcntd, -mcmpb, -mpopcntb as these are the basic
markers for -mcpu=power9/power7/power6/power5, and lots of other things depend
on these options.  I'm not sure we have a marker for power8 that isn't vector
related.

Yeah, it would be better to have specific ISA levels, but we don't currently
have them.


> > @@ -3967,7 +3969,7 @@ rs6000_option_override_internal (bool gl
> >  #endif
> >  #ifdef OS_MISSING_ALTIVEC
> >if (OS_MISSING_ALTIVEC)
> > -set_masks &= ~(OPTION_MASK_ALTIVEC | OPTION_MASK_VSX);
> > +set_masks &= ~NO_ALTIVEC_MASKS;
> 
> NO_ALTIVEC_MASKS isn't defined anywhere.  Did you send the wrong patch?

I originally had NO_ALTIVEC_MASKS and then I redid the code, naming them
OTHER__VECTOR_MASKS.  I missed fixing the section that disables Altivec
and VSX instructions if the assembler doesn't have Altivec support.

The following patch fixes this.  I reran the tests on a little endian power8
system.

I also tested the code on a big endian power7 machine by building a compiler
and then rebuilding rs6000.o, adding the OS_MISSING_ALTIVEC define.  I verified
by hand that it disables Altivec and VSX support by default.

Can I apply this to the trunk?

[gcc]
2017-04-12  Michael Meissner  

PR target/80098
* config/rs6000/rs6000-cpus.def (OTHER_P9_VECTOR_MASKS): Define
masks of options that should be turned off if the VSX vector
options are turned off.
(OTHER_P8_VECTOR_MASKS): Likewise.
(OTHER_VSX_VECTOR_MASKS): Likewise.
* config/rs6000/rs6000.c (rs6000_option_override_internal): Call
rs6000_disable_incompatible_switches to validate no type switches
like -mvsx.
(rs6000_incompatible_switch): New function to disallow turning on
other vector options if -mno-vsx, -mno-power8-vector, or
-mno-power9-vector are specified.

[gcc/testsuite]
2017-04-12  Michael Meissner  

PR target/80098
* gcc.target/powerpc/pr80098-1.c: New test.
* gcc.target/powerpc/pr80098-2.c: Likewise.
* gcc.target/powerpc/pr80098-3.c: Likewise.
* gcc.target/powerpc/pr80098-4.c: Likewise.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797
Index: gcc/config/rs6000/rs6000-cpus.def
===
--- gcc/config/rs6000/rs6000-cpus.def   (revision 246826)
+++ gcc/config/rs6000/rs6000-cpus.def   (working copy)
@@ -84,6 +84,30 @@
 | OPTION_MASK_UPPER_REGS_SF\
 | OPTION_MASK_VSX_SMALL_INTEGER)
 
+/* Flags that need to be turned off if -mno-power9-vector.  */
+#define OTHER_P9_VECTOR_MASKS  (OPTION_MASK_FLOAT128_HW\
+| OPTION_MASK_P9_DFORM_SCALAR  \
+| OPTION_MASK_P9_DFORM_VECTOR  \
+| OPTION_MASK_P9_MINMAX)
+
+/* Flags that need to be turned off if -mno-power8-vector.  */
+#define OTHER_P8_VECTOR_MASKS  (OTHER_P9_VECTOR_MASKS  \
+| OPTION_MASK_P9_VECTOR\
+| OPTION_MASK_DIRECT_MOVE  \
+| OPTION_MASK_CRYPTO   \
+| OPTION_MASK_UPPER_REGS_SF)   \
+
+/* Flags that need to be turned off if -mno-vsx.  */
+#define OTHER_VSX_VECTOR_MASKS (OTHER_P8_VECTOR_MASKS  \
+| OPTION_MASK_EFFICIENT_UNALIGNED_VSX  \
+| OPTION_MASK_FLOAT128_KEYWORD \
+| OPTION_MASK_FLOAT128_TYPE\
+| OPTION_MASK_P8_VECTOR\
+| OPTION_MASK_UPPER_REGS_DI\
+| OPTION_MASK_UPPER_REGS_DF\
+| OPTION_MASK_VSX_SMALL_INTEGER  

Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-12 Thread Segher Boessenkool
On Wed, Apr 12, 2017 at 08:59:34PM +0200, Jakub Jelinek wrote:
> On Wed, Apr 12, 2017 at 01:15:56PM -0500, Segher Boessenkool wrote:
> > On Wed, Apr 12, 2017 at 07:06:38PM +0200, Jakub Jelinek wrote:
> > > On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote:
> > > > This is a fix for PR 80131 
> > > > Currently the code A << (B - C) is not simplified.
> > > > However at least a more specific case of 1U << (C -x) where C = 
> > > > precision(type) - 1 can be simplified to (1 << C) >> x.
> > > 
> > > Is that always a win though?
> > > Some constants have higher costs than others on various targets, some
> > > significantly higher.  This change might be beneficial only
> > > if if C is as expensive as 1, then you get rid of a one (typically cheap)
> > > operation.
> > > Which makes me wonder whether this should be done at GIMPLE time and not
> > > at RTL time (expansion or combine etc.) when one can compare the RTX 
> > > costs.
> > 
> > Yeah, either combine or simplify-rtx I'd say.
> > 
> > The transformA << (B - C)  --->   (A << B) >> C
> > only works if A << B does not overflow but A << (B + 1) does (and then
> 
> Is that really true?  The A << B does not overflow is obvious precondition.
> 
> But consider say A 5U, B 29 and C (not compile time known) -2.

Ah yes, A unsigned.  Bah.

> 5U << 31 is valid 0x8000U, but (5U << 29) >> (-2) is UB.
> Isn't the other condition instead that either C must be non-negative, or
> B must be number of bits in A's type - 1, i.e. that for negative C
> A << (B - C) would already be always UB?
> But then unless we know C is non-negative, A must be really just 1,
> otherwise A << B overflows.

Yeah.


Segher


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-12 Thread Jakub Jelinek
On Wed, Apr 12, 2017 at 01:15:56PM -0500, Segher Boessenkool wrote:
> Hi,
> 
> On Wed, Apr 12, 2017 at 07:06:38PM +0200, Jakub Jelinek wrote:
> > On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote:
> > > This is a fix for PR 80131 
> > > Currently the code A << (B - C) is not simplified.
> > > However at least a more specific case of 1U << (C -x) where C = 
> > > precision(type) - 1 can be simplified to (1 << C) >> x.
> > 
> > Is that always a win though?
> > Some constants have higher costs than others on various targets, some
> > significantly higher.  This change might be beneficial only
> > if if C is as expensive as 1, then you get rid of a one (typically cheap)
> > operation.
> > Which makes me wonder whether this should be done at GIMPLE time and not
> > at RTL time (expansion or combine etc.) when one can compare the RTX costs.
> 
> Yeah, either combine or simplify-rtx I'd say.
> 
> The transformA << (B - C)  --->   (A << B) >> C
> only works if A << B does not overflow but A << (B + 1) does (and then

Is that really true?  The A << B does not overflow is obvious precondition.

But consider say A 5U, B 29 and C (not compile time known) -2.
5U << 31 is valid 0x8000U, but (5U << 29) >> (-2) is UB.
Isn't the other condition instead that either C must be non-negative, or
B must be number of bits in A's type - 1, i.e. that for negative C
A << (B - C) would already be always UB?
But then unless we know C is non-negative, A must be really just 1,
otherwise A << B overflows.

> always does work afaics).  Or if we know C is non-negative and A << B
> does not overflow.  So realistically A and B need to be constant.
> 
> > Or do this at match.pd as canonicalization and then have RTL transformation
> > to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or
> > shorter).
> 
> The inverse transform only works for A=1, not for the more general case.

Jakub


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-12 Thread Segher Boessenkool
Hi,

On Wed, Apr 12, 2017 at 07:06:38PM +0200, Jakub Jelinek wrote:
> On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote:
> > This is a fix for PR 80131 
> > Currently the code A << (B - C) is not simplified.
> > However at least a more specific case of 1U << (C -x) where C = 
> > precision(type) - 1 can be simplified to (1 << C) >> x.
> 
> Is that always a win though?
> Some constants have higher costs than others on various targets, some
> significantly higher.  This change might be beneficial only
> if if C is as expensive as 1, then you get rid of a one (typically cheap)
> operation.
> Which makes me wonder whether this should be done at GIMPLE time and not
> at RTL time (expansion or combine etc.) when one can compare the RTX costs.

Yeah, either combine or simplify-rtx I'd say.

The transformA << (B - C)  --->   (A << B) >> C
only works if A << B does not overflow but A << (B + 1) does (and then
always does work afaics).  Or if we know C is non-negative and A << B
does not overflow.  So realistically A and B need to be constant.

> Or do this at match.pd as canonicalization and then have RTL transformation
> to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or
> shorter).

The inverse transform only works for A=1, not for the more general case.


Segher


Re: [PATCH][libgcc, fuchsia]

2017-04-12 Thread Josh Conner via gcc-patches

Ping^3?

I think this should be very straightforward - it just adds fuchsia 
target support to libgcc. Please do let me know if there are any concerns...


Thanks!

- Josh


2017-04-12  Joshua Conner  

* config/arm/unwind-arm.h (_Unwind_decode_typeinfo_ptr): Use
pc-relative indirect handling for fuchsia.
* config/t-slibgcc-fuchsia: New file.
* config.host (*-*-fuchsia*, aarch64*-*-fuchsia*, arm*-*-fuchsia*,
x86_64-*-fuchsia*): Add definitions.


On 2/21/17 9:41 AM, Josh Conner wrote:

Ping^2?


On 2/2/17 11:22 AM, Josh Conner wrote:

Ping?


On 1/17/17 10:40 AM, Josh Conner wrote:

The attached patch adds fuchsia support to libgcc.

OK for trunk?

Thanks -

Josh

2017-01-17  Joshua Conner  

* config/arm/unwind-arm.h (_Unwind_decode_typeinfo_ptr): Use
pc-relative indirect handling for fuchsia.
* config/t-slibgcc-fuchsia: New file.
* config.host (*-*-fuchsia*, aarch64*-*-fuchsia*, arm*-*-fuchsia*,
x86_64-*-fuchsia*): Add definitions.







Index: config/arm/unwind-arm.h
===
--- config/arm/unwind-arm.h (revision 246880)
+++ config/arm/unwind-arm.h (working copy)
@@ -49,7 +49,7 @@
return 0;
 
 #if (defined(linux) && !defined(__uClinux__)) || defined(__NetBSD__) \
-|| defined(__FreeBSD__)
+|| defined(__FreeBSD__) || defined(__fuchsia__)
   /* Pc-relative indirect.  */
 #define _GLIBCXX_OVERRIDE_TTYPE_ENCODING (DW_EH_PE_pcrel | DW_EH_PE_indirect)
   tmp += ptr;
Index: config/t-slibgcc-fuchsia
===
--- config/t-slibgcc-fuchsia(revision 0)
+++ config/t-slibgcc-fuchsia(working copy)
@@ -0,0 +1,22 @@
+# Copyright (C) 2017 Free Software Foundation, Inc.
+#
+# This file is part of GCC.
+#
+# GCC is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3, or (at your option)
+# any later version.
+#
+# GCC is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with GCC; see the file COPYING3.  If not see
+# .
+
+# Fuchsia-specific shared library overrides.
+
+SHLIB_LDFLAGS = -Wl,--soname=$(SHLIB_SONAME) \
+$(LDFLAGS)
Index: config.host
===
--- config.host (revision 246880)
+++ config.host (working copy)
@@ -231,6 +231,10 @@
   ;;
   esac
   ;;
+*-*-fuchsia*)
+  tmake_file="$tmake_file t-crtstuff-pic t-libgcc-pic t-eh-dw2-dip t-slibgcc 
t-slibgcc-fuchsia"
+  extra_parts="crtbegin.o crtend.o"
+  ;;
 *-*-linux* | frv-*-*linux* | *-*-kfreebsd*-gnu | *-*-gnu* | 
*-*-kopensolaris*-gnu)
   tmake_file="$tmake_file t-crtstuff-pic t-libgcc-pic t-eh-dw2-dip t-slibgcc 
t-slibgcc-gld t-slibgcc-elf-ver t-linux"
   extra_parts="crtbegin.o crtbeginS.o crtbeginT.o crtend.o crtendS.o"
@@ -342,6 +346,10 @@
tmake_file="${tmake_file} ${cpu_type}/t-softfp t-softfp t-crtfm"
md_unwind_header=aarch64/freebsd-unwind.h
;;
+aarch64*-*-fuchsia*)
+   tmake_file="${tmake_file} ${cpu_type}/t-aarch64"
+   tmake_file="${tmake_file} ${cpu_type}/t-softfp t-softfp"
+   ;;
 aarch64*-*-linux*)
extra_parts="$extra_parts crtfastmath.o"
md_unwind_header=aarch64/linux-unwind.h
@@ -394,6 +402,12 @@
unwind_header=config/arm/unwind-arm.h
tmake_file="${tmake_file} t-softfp-sfdf t-softfp-excl arm/t-softfp 
t-softfp"
;;
+arm*-*-fuchsia*)
+   tmake_file="${tmake_file} arm/t-arm arm/t-elf arm/t-bpabi"
+   tmake_file="${tmake_file} arm/tsoftfp t-softfp"
+   tm_file="${tm_file} arm/bpabi-lib.h"
+   unwind_header=config/arm/unwind-arm.h
+   ;;
 arm*-*-netbsdelf*)
tmake_file="$tmake_file arm/t-arm arm/t-netbsd t-slibgcc-gld-nover"
;;
@@ -588,6 +602,9 @@
 x86_64-*-elf* | x86_64-*-rtems*)
tmake_file="$tmake_file i386/t-crtstuff t-crtstuff-pic t-libgcc-pic"
;;
+x86_64-*-fuchsia*)
+   tmake_file="$tmake_file t-libgcc-pic"
+   ;;
 i[34567]86-*-dragonfly*)
tmake_file="${tmake_file} i386/t-dragonfly i386/t-crtstuff"
md_unwind_header=i386/dragonfly-unwind.h


Re: [PATCH] Attempt harder to emit a cmove in emit_conditional_move (PR tree-optimization/79390)

2017-04-12 Thread Richard Biener
On April 12, 2017 6:33:19 PM GMT+02:00, Jakub Jelinek  wrote:
>Hi!
>
>As mentioned in the PR, for LU benchmark we generate worse code with
>-Ofast
>compared to -O3, because in the former we don't use a conditional move.
>
>The problem is during emit_conditional_move, while in both cases
>swap_commutative_operands_p (op2, op3) tells us it might be better to
>swap
>them, the difference between -Ofast and -O3 is that
>reversed_comparison_code_parts
>returns in -Ofast case LE, but in -O3 UNKNOWN.  For -O3 this means we
>don't
>swap o2 and o3, emit a GT and successfully emit a conditional move.
>For -Ofast, we swap the arguments and attempt to emit a LE, but the
>predicate on the cbranchdf4 expander fails in that case, because "LE"
>is not
>considered a trivial comparison operator, is more expensive, thus the
>only
>attempt we try fails and we don't emit a cmov at all.
>
>The following patch will do the same thing as previously, but if we
>were to
>return NULL_RTX, it will try to swap op2/op3 (either back, if we
>swapped
>them first, effectively returning to the original comparison, or if we
>haven't swapped them first, trying to reverse the comparison) and see
>if
>that leads to a usable sequence.
>
>The patch caused a FAIL in the pr70465-2.c testcase, where previously
>in ce1
>we weren't able to emit any cmov and now are, which tiny bit changes
>the
>IL that gets through regstack and thus the testcase is not testing
>anymore
>what it wanted - the %st(1) and %st comparison arguments are swapped
>and
>the fcmov operands too and thus a fxchg is seen as needed (before
>regstack
>it is impossible to find out which ordering is more beneficial
>actually).
>We could teach regstack to attempt to reverse the comparisons and swap
>corresponding fcmov arguments, but that does look like a GCC8 material
>to
>me.
>
>Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

OK.

Richard.

>2017-04-12  Jakub Jelinek  
>
>   PR tree-optimization/79390
>   * optabs.c (emit_conditional_move): If the preferred op2/op3 operand
>   order does not result in usable sequence, retry with reversed operand
>   order.
>
>   * gcc.target/i386/pr70465-2.c: Xfail the scan-assembler-not test.
>
>--- gcc/optabs.c.jj2017-02-02 08:44:12.0 +0100
>+++ gcc/optabs.c   2017-04-12 14:30:14.905433771 +0200
>@@ -4258,12 +4258,15 @@ emit_conditional_move (rtx target, enum
>   if (cmode == VOIDmode)
> cmode = GET_MODE (op0);
> 
>+  enum rtx_code orig_code = code;
>+  bool swapped = false;
>   if (swap_commutative_operands_p (op2, op3)
> && ((reversed = reversed_comparison_code_parts (code, op0, op1, NULL))
>   != UNKNOWN))
> {
>   std::swap (op2, op3);
>   code = reversed;
>+  swapped = true;
> }
> 
>   if (mode == VOIDmode)
>@@ -4272,45 +4275,62 @@ emit_conditional_move (rtx target, enum
>   icode = direct_optab_handler (movcc_optab, mode);
> 
>   if (icode == CODE_FOR_nothing)
>-return 0;
>+return NULL_RTX;
> 
>   if (!target)
> target = gen_reg_rtx (mode);
> 
>-  code = unsignedp ? unsigned_condition (code) : code;
>-  comparison = simplify_gen_relational (code, VOIDmode, cmode, op0,
>op1);
>-
>-  /* We can get const0_rtx or const_true_rtx in some circumstances. 
>Just
>- return NULL and let the caller figure out how best to deal with
>this
>- situation.  */
>-  if (!COMPARISON_P (comparison))
>-return NULL_RTX;
>-
>-  saved_pending_stack_adjust save;
>-  save_pending_stack_adjust ();
>-  last = get_last_insn ();
>-  do_pending_stack_adjust ();
>-  prepare_cmp_insn (XEXP (comparison, 0), XEXP (comparison, 1),
>-  GET_CODE (comparison), NULL_RTX, unsignedp, OPTAB_WIDEN,
>-  , );
>-  if (comparison)
>+  for (int pass = 0; ; pass++)
> {
>-  struct expand_operand ops[4];
>+  code = unsignedp ? unsigned_condition (code) : code;
>+  comparison = simplify_gen_relational (code, VOIDmode, cmode,
>op0, op1);
> 
>-  create_output_operand ([0], target, mode);
>-  create_fixed_operand ([1], comparison);
>-  create_input_operand ([2], op2, mode);
>-  create_input_operand ([3], op3, mode);
>-  if (maybe_expand_insn (icode, 4, ops))
>+  /* We can get const0_rtx or const_true_rtx in some
>circumstances.  Just
>+   punt and let the caller figure out how best to deal with this
>+   situation.  */
>+  if (COMPARISON_P (comparison))
>   {
>-if (ops[0].value != target)
>-  convert_move (target, ops[0].value, false);
>-return target;
>+saved_pending_stack_adjust save;
>+save_pending_stack_adjust ();
>+last = get_last_insn ();
>+do_pending_stack_adjust ();
>+prepare_cmp_insn (XEXP (comparison, 0), XEXP (comparison, 1),
>+  GET_CODE (comparison), NULL_RTX, unsignedp,
>+  OPTAB_WIDEN, , );
>+if (comparison)
>+  {
>+

[Patch, GCC/ARM, gcc-5-branch] Fix PR68390 Incorrect code due to indirect tail call of varargs function with hard float ABI

2017-04-12 Thread Christophe Lyon
Hi,

It looks like we forgot to backport the fix for PR68390 to gcc-5-branch.
The patch applies cleanly, and fwiw we've had it in the linaro-5
branch for a while.

OK to apply to gcc-5-branch?

Thanks,

Christophe
2017-04-12  Christophe Lyon  

Backport from mainline
+2015-11-23  Kugan Vivekanandarajah  

gcc/
PR target/68390
* config/arm/arm.c (arm_function_ok_for_sibcall): Get function type
for indirect function call.

gcc/testsuite/
PR target/68390
* gcc.c-torture/execute/pr68390.c: New test.

Index: gcc/config/arm/arm.c
===
--- gcc/config/arm/arm.c(revision 246880)
+++ gcc/config/arm/arm.c(working copy)
@@ -6507,8 +6507,13 @@
 a VFP register but then need to transfer it to a core
 register.  */
   rtx a, b;
+  tree decl_or_type = decl;
 
-  a = arm_function_value (TREE_TYPE (exp), decl, false);
+  /* If it is an indirect function pointer, get the function type.  */
+  if (!decl)
+   decl_or_type = TREE_TYPE (TREE_TYPE (CALL_EXPR_FN (exp)));
+
+  a = arm_function_value (TREE_TYPE (exp), decl_or_type, false);
   b = arm_function_value (TREE_TYPE (DECL_RESULT (cfun->decl)),
  cfun->decl, false);
   if (!rtx_equal_p (a, b))
Index: gcc/testsuite/gcc.c-torture/execute/pr68390.c
===
--- gcc/testsuite/gcc.c-torture/execute/pr68390.c   (nonexistent)
+++ gcc/testsuite/gcc.c-torture/execute/pr68390.c   (working copy)
@@ -0,0 +1,27 @@
+/* { dg-do run }  */
+/* { dg-options "-O2" } */
+
+__attribute__ ((noinline))
+double direct(int x, ...)
+{
+  return x*x;
+}
+
+__attribute__ ((noinline))
+double broken(double (*indirect)(int x, ...), int v)
+{
+  return indirect(v);
+}
+
+int main ()
+{
+  double d1, d2;
+  int i = 2;
+  d1 = broken (direct, i);
+  if (d1 != i*i)
+{
+  __builtin_abort ();
+}
+  return 0;
+}
+


Re: [PATCH] Fix another fold-const.c type bug (PR sanitizer/80403)

2017-04-12 Thread Richard Biener
On April 12, 2017 6:12:57 PM GMT+02:00, Jakub Jelinek  wrote:
>Hi!
>
>Similarly to PR80349, we have other spots where we don't get the
>types right.  opN are the original args, argN is the same after
>STRIP_NOPS, so when we want to have type of type, we should use
>opN rather than argN (opN is less expensive variant to fold_convert
>argN to type).
>
>Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

OK

Richard.

>2017-04-12  Jakub Jelinek  
>
>   PR sanitizer/80403
>   PR sanitizer/80404
>   PR sanitizer/80405
>   * fold-const.c (fold_ternary_loc): Use op1 instead of arg1 as argument
>   to fold_build2_loc.  Convert TREE_OPERAND (tem, 0) to type.  Use
>   op0 instead of fold_convert_loc (loc, type, arg0).
>
>   * g++.dg/ubsan/pr80403.C: New test.
>   * g++.dg/ubsan/pr80404.C: New test.
>   * g++.dg/ubsan/pr80405.C: New test.
>
>--- gcc/fold-const.c.jj2017-04-12 07:20:42.0 +0200
>+++ gcc/fold-const.c   2017-04-12 08:59:09.044260961 +0200
>@@ -11508,10 +11508,12 @@ fold_ternary_loc (location_t loc, enum t
> STRIP_NOPS (tem);
> if (TREE_CODE (tem) == RSHIFT_EXPR
> && tree_fits_uhwi_p (TREE_OPERAND (tem, 1))
>-  && (unsigned HOST_WIDE_INT) tree_log2 (arg1) ==
>-   tree_to_uhwi (TREE_OPERAND (tem, 1)))
>+  && (unsigned HOST_WIDE_INT) tree_log2 (arg1)
>+   == tree_to_uhwi (TREE_OPERAND (tem, 1)))
>   return fold_build2_loc (loc, BIT_AND_EXPR, type,
>-  TREE_OPERAND (tem, 0), arg1);
>+  fold_convert_loc (loc, type,
>+TREE_OPERAND (tem, 0)),
>+  op1);
>   }
> 
>   /* A & N ? N : 0 is simply A & N if N is a power of two.  This
>@@ -11542,7 +11544,7 @@ fold_ternary_loc (location_t loc, enum t
> && (code == VEC_COND_EXPR || !VECTOR_TYPE_P (type)))
>   return fold_build2_loc (loc, code == VEC_COND_EXPR ? BIT_AND_EXPR
>  : TRUTH_ANDIF_EXPR,
>-  type, fold_convert_loc (loc, type, arg0), arg1);
>+  type, op0, op1);
> 
> /* Convert A ? B : 1 into !A || B if A and B are truth values.  */
>if (code == VEC_COND_EXPR ? integer_all_onesp (op2) : integer_onep
>(op2)
>@@ -11558,7 +11560,7 @@ fold_ternary_loc (location_t loc, enum t
>? BIT_IOR_EXPR
>: TRUTH_ORIF_EXPR,
>   type, fold_convert_loc (loc, type, tem),
>-  arg1);
>+  op1);
>   }
> 
> /* Convert A ? 0 : B into !A && B if A and B are truth values.  */
>--- gcc/testsuite/g++.dg/ubsan/pr80403.C.jj2017-04-12
>08:52:20.954465589 +0200
>+++ gcc/testsuite/g++.dg/ubsan/pr80403.C   2017-04-12 08:52:00.0
>+0200
>@@ -0,0 +1,11 @@
>+// PR sanitizer/80403
>+// { dg-do compile }
>+// { dg-options "-fsanitize=undefined" }
>+
>+unsigned
>+foo ()
>+{
>+  unsigned a = (unsigned) (!(6044238 >> 0) >= (0 < 0)) % 0;   // {
>dg-warning "division by zero" }
>+  unsigned b = (unsigned) (!(6044238 >> 0) >= (0 < 0)) / 0;   // {
>dg-warning "division by zero" }
>+  return a + b;
>+}
>--- gcc/testsuite/g++.dg/ubsan/pr80404.C.jj2017-04-12
>09:08:43.497014011 +0200
>+++ gcc/testsuite/g++.dg/ubsan/pr80404.C   2017-04-12 09:07:26.0
>+0200
>@@ -0,0 +1,12 @@
>+// PR sanitizer/80404
>+// { dg-do compile }
>+// { dg-options "-fsanitize=undefined" }
>+
>+extern short v;
>+
>+unsigned
>+foo ()
>+{
>+  unsigned a = (0 < 0 >= (0 >= 0)) / (unsigned) v;
>+  return a;
>+}
>--- gcc/testsuite/g++.dg/ubsan/pr80405.C.jj2017-04-12
>09:08:46.663973725 +0200
>+++ gcc/testsuite/g++.dg/ubsan/pr80405.C   2017-04-12 09:07:53.0
>+0200
>@@ -0,0 +1,11 @@
>+// PR sanitizer/80405
>+// { dg-do compile }
>+// { dg-options "-fsanitize=undefined" }
>+
>+extern unsigned int v, w;
>+
>+void
>+foo ()
>+{
>+  w = (!~0 >= (unsigned) (0 < 0)) << v;
>+}
>
>
>   Jakub



Re: [wwwdocs,fortran] Update link to CHKSYS

2017-04-12 Thread Jerry DeLisle
On 04/11/2017 03:40 PM, Gerald Pfeifer wrote:
> This one has been failing for quite a while, and I found
>   http://flibs.sourceforge.net/chksys.html
> as a potential replacement link.
> 
> Thoughts?
> 

When I visit the suggested link I have to go up one level manually to get to a
page that has a link to the actual source code. Norton Safe Web which is in my
router declares the link to the source code there at sf.net to be unsafe since
it has found one computer virus residing out there somewhere..

I don't know how real that concern is. Regardless, it may be better to point to
one level up to get to everything and one can scroll down the page to find the
link to CHECKSYS.

Just my thoughts.

Jerry


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-12 Thread Jakub Jelinek
On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote:
> Hi all
> 
> This is a fix for PR 80131 
> Currently the code A << (B - C) is not simplified.
> However at least a more specific case of 1U << (C -x) where C = 
> precision(type) - 1 can be simplified to (1 << C) >> x.

Is that always a win though?
Some constants have higher costs than others on various targets, some
significantly higher.  This change might be beneficial only
if if C is as expensive as 1, then you get rid of a one (typically cheap)
operation.
Which makes me wonder whether this should be done at GIMPLE time and not
at RTL time (expansion or combine etc.) when one can compare the RTX costs.
Or do this at match.pd as canonicalization and then have RTL transformation
to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or
shorter).

Jakub


[PATCH] Attempt harder to emit a cmove in emit_conditional_move (PR tree-optimization/79390)

2017-04-12 Thread Jakub Jelinek
Hi!

As mentioned in the PR, for LU benchmark we generate worse code with -Ofast
compared to -O3, because in the former we don't use a conditional move.

The problem is during emit_conditional_move, while in both cases
swap_commutative_operands_p (op2, op3) tells us it might be better to swap
them, the difference between -Ofast and -O3 is that 
reversed_comparison_code_parts
returns in -Ofast case LE, but in -O3 UNKNOWN.  For -O3 this means we don't
swap o2 and o3, emit a GT and successfully emit a conditional move.
For -Ofast, we swap the arguments and attempt to emit a LE, but the
predicate on the cbranchdf4 expander fails in that case, because "LE" is not
considered a trivial comparison operator, is more expensive, thus the only
attempt we try fails and we don't emit a cmov at all.

The following patch will do the same thing as previously, but if we were to
return NULL_RTX, it will try to swap op2/op3 (either back, if we swapped
them first, effectively returning to the original comparison, or if we
haven't swapped them first, trying to reverse the comparison) and see if
that leads to a usable sequence.

The patch caused a FAIL in the pr70465-2.c testcase, where previously in ce1
we weren't able to emit any cmov and now are, which tiny bit changes the
IL that gets through regstack and thus the testcase is not testing anymore
what it wanted - the %st(1) and %st comparison arguments are swapped and
the fcmov operands too and thus a fxchg is seen as needed (before regstack
it is impossible to find out which ordering is more beneficial actually).
We could teach regstack to attempt to reverse the comparisons and swap
corresponding fcmov arguments, but that does look like a GCC8 material to
me.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2017-04-12  Jakub Jelinek  

PR tree-optimization/79390
* optabs.c (emit_conditional_move): If the preferred op2/op3 operand
order does not result in usable sequence, retry with reversed operand
order.

* gcc.target/i386/pr70465-2.c: Xfail the scan-assembler-not test.

--- gcc/optabs.c.jj 2017-02-02 08:44:12.0 +0100
+++ gcc/optabs.c2017-04-12 14:30:14.905433771 +0200
@@ -4258,12 +4258,15 @@ emit_conditional_move (rtx target, enum
   if (cmode == VOIDmode)
 cmode = GET_MODE (op0);
 
+  enum rtx_code orig_code = code;
+  bool swapped = false;
   if (swap_commutative_operands_p (op2, op3)
   && ((reversed = reversed_comparison_code_parts (code, op0, op1, NULL))
   != UNKNOWN))
 {
   std::swap (op2, op3);
   code = reversed;
+  swapped = true;
 }
 
   if (mode == VOIDmode)
@@ -4272,45 +4275,62 @@ emit_conditional_move (rtx target, enum
   icode = direct_optab_handler (movcc_optab, mode);
 
   if (icode == CODE_FOR_nothing)
-return 0;
+return NULL_RTX;
 
   if (!target)
 target = gen_reg_rtx (mode);
 
-  code = unsignedp ? unsigned_condition (code) : code;
-  comparison = simplify_gen_relational (code, VOIDmode, cmode, op0, op1);
-
-  /* We can get const0_rtx or const_true_rtx in some circumstances.  Just
- return NULL and let the caller figure out how best to deal with this
- situation.  */
-  if (!COMPARISON_P (comparison))
-return NULL_RTX;
-
-  saved_pending_stack_adjust save;
-  save_pending_stack_adjust ();
-  last = get_last_insn ();
-  do_pending_stack_adjust ();
-  prepare_cmp_insn (XEXP (comparison, 0), XEXP (comparison, 1),
-   GET_CODE (comparison), NULL_RTX, unsignedp, OPTAB_WIDEN,
-   , );
-  if (comparison)
+  for (int pass = 0; ; pass++)
 {
-  struct expand_operand ops[4];
+  code = unsignedp ? unsigned_condition (code) : code;
+  comparison = simplify_gen_relational (code, VOIDmode, cmode, op0, op1);
 
-  create_output_operand ([0], target, mode);
-  create_fixed_operand ([1], comparison);
-  create_input_operand ([2], op2, mode);
-  create_input_operand ([3], op3, mode);
-  if (maybe_expand_insn (icode, 4, ops))
+  /* We can get const0_rtx or const_true_rtx in some circumstances.  Just
+punt and let the caller figure out how best to deal with this
+situation.  */
+  if (COMPARISON_P (comparison))
{
- if (ops[0].value != target)
-   convert_move (target, ops[0].value, false);
- return target;
+ saved_pending_stack_adjust save;
+ save_pending_stack_adjust ();
+ last = get_last_insn ();
+ do_pending_stack_adjust ();
+ prepare_cmp_insn (XEXP (comparison, 0), XEXP (comparison, 1),
+   GET_CODE (comparison), NULL_RTX, unsignedp,
+   OPTAB_WIDEN, , );
+ if (comparison)
+   {
+ struct expand_operand ops[4];
+
+ create_output_operand ([0], target, mode);
+ create_fixed_operand ([1], comparison);
+ create_input_operand ([2], op2, mode);
+   

Re: [PATCH] Fix fixincludes for canadian cross builds

2017-04-12 Thread Bruce Korb
I will be unable to look at this for a couple of weeks, so I leave
this to others to look at.

On Wed, Apr 12, 2017 at 8:58 AM, Yvan Roux  wrote:
> Hi,
>
> On 20 February 2017 at 18:53, Bruce Korb  wrote:
>> On 02/18/17 01:01, Bernd Edlinger wrote:
>>> On 02/18/17 00:37, Bruce Korb wrote:
 On 02/06/17 10:44, Bernd Edlinger wrote:
> I tested this change with different arm-linux-gnueabihf cross
> compilers, and verified that mkheaders still works on the host system.
>
> Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
> Is it OK for trunk?

 As long as you certify that this is correct for all systems we care about:

 +BUILD_SYSTEM_HEADER_DIR = `
 +echo $(CROSS_SYSTEM_HEADER_DIR) | \
 +sed -e :a -e 's,[^/]*/\.\.\/,,' -e ta`

 that is pretty obtuse sed-speak to me.  I suggest a comment
 explaining what sed is supposed to be doing.  What should
 "$(CROSS_SYSTEM_HEADER_DIR)" look like?

>>>
>>> I took it just from a few lines above, so I thought that comment would
>>> sufficiently explain the syntax:
>>
>> I confess, I didn't pull a new copy of gcc, sorry.
>> So it looks good to me.
>
>
> We just noticed that this patch brakes canadian cross builds when
> configured with --with-build-sysroot, since headers are searched into
> the target sysroot instead of the one specified for builds.
>
> Maybe there's a cleaner way to fix this and avoid the duplication but
> I didn't find another to test if --with-build-sysroot is used.  The
> attached patch fixes the issue.  Tested with a Full canadian cross
> build for i686-w64-mingw32 host and arm-linux-gnueabihf.
>
> Thanks
> Yvan
>
> 2017-04-12  Yvan Roux  
>
>* Makefile.in (BUILD_SYSTEM_HEADER_DIR): Set to SYSTEM_HEADER_DIR
>when configured with --with-build-sysroot.


[PATCH] Fix another fold-const.c type bug (PR sanitizer/80403)

2017-04-12 Thread Jakub Jelinek
Hi!

Similarly to PR80349, we have other spots where we don't get the
types right.  opN are the original args, argN is the same after
STRIP_NOPS, so when we want to have type of type, we should use
opN rather than argN (opN is less expensive variant to fold_convert
argN to type).

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2017-04-12  Jakub Jelinek  

PR sanitizer/80403
PR sanitizer/80404
PR sanitizer/80405
* fold-const.c (fold_ternary_loc): Use op1 instead of arg1 as argument
to fold_build2_loc.  Convert TREE_OPERAND (tem, 0) to type.  Use
op0 instead of fold_convert_loc (loc, type, arg0).

* g++.dg/ubsan/pr80403.C: New test.
* g++.dg/ubsan/pr80404.C: New test.
* g++.dg/ubsan/pr80405.C: New test.

--- gcc/fold-const.c.jj 2017-04-12 07:20:42.0 +0200
+++ gcc/fold-const.c2017-04-12 08:59:09.044260961 +0200
@@ -11508,10 +11508,12 @@ fold_ternary_loc (location_t loc, enum t
  STRIP_NOPS (tem);
  if (TREE_CODE (tem) == RSHIFT_EXPR
  && tree_fits_uhwi_p (TREE_OPERAND (tem, 1))
-  && (unsigned HOST_WIDE_INT) tree_log2 (arg1) ==
-tree_to_uhwi (TREE_OPERAND (tem, 1)))
+  && (unsigned HOST_WIDE_INT) tree_log2 (arg1)
+== tree_to_uhwi (TREE_OPERAND (tem, 1)))
return fold_build2_loc (loc, BIT_AND_EXPR, type,
-   TREE_OPERAND (tem, 0), arg1);
+   fold_convert_loc (loc, type,
+ TREE_OPERAND (tem, 0)),
+   op1);
}
 
   /* A & N ? N : 0 is simply A & N if N is a power of two.  This
@@ -11542,7 +11544,7 @@ fold_ternary_loc (location_t loc, enum t
  && (code == VEC_COND_EXPR || !VECTOR_TYPE_P (type)))
return fold_build2_loc (loc, code == VEC_COND_EXPR ? BIT_AND_EXPR
   : TRUTH_ANDIF_EXPR,
-   type, fold_convert_loc (loc, type, arg0), arg1);
+   type, op0, op1);
 
   /* Convert A ? B : 1 into !A || B if A and B are truth values.  */
   if (code == VEC_COND_EXPR ? integer_all_onesp (op2) : integer_onep (op2)
@@ -11558,7 +11560,7 @@ fold_ternary_loc (location_t loc, enum t
 ? BIT_IOR_EXPR
 : TRUTH_ORIF_EXPR,
type, fold_convert_loc (loc, type, tem),
-   arg1);
+   op1);
}
 
   /* Convert A ? 0 : B into !A && B if A and B are truth values.  */
--- gcc/testsuite/g++.dg/ubsan/pr80403.C.jj 2017-04-12 08:52:20.954465589 
+0200
+++ gcc/testsuite/g++.dg/ubsan/pr80403.C2017-04-12 08:52:00.0 
+0200
@@ -0,0 +1,11 @@
+// PR sanitizer/80403
+// { dg-do compile }
+// { dg-options "-fsanitize=undefined" }
+
+unsigned
+foo ()
+{
+  unsigned a = (unsigned) (!(6044238 >> 0) >= (0 < 0)) % 0;// { dg-warning 
"division by zero" }
+  unsigned b = (unsigned) (!(6044238 >> 0) >= (0 < 0)) / 0;// { dg-warning 
"division by zero" }
+  return a + b;
+}
--- gcc/testsuite/g++.dg/ubsan/pr80404.C.jj 2017-04-12 09:08:43.497014011 
+0200
+++ gcc/testsuite/g++.dg/ubsan/pr80404.C2017-04-12 09:07:26.0 
+0200
@@ -0,0 +1,12 @@
+// PR sanitizer/80404
+// { dg-do compile }
+// { dg-options "-fsanitize=undefined" }
+
+extern short v;
+
+unsigned
+foo ()
+{
+  unsigned a = (0 < 0 >= (0 >= 0)) / (unsigned) v;
+  return a;
+}
--- gcc/testsuite/g++.dg/ubsan/pr80405.C.jj 2017-04-12 09:08:46.663973725 
+0200
+++ gcc/testsuite/g++.dg/ubsan/pr80405.C2017-04-12 09:07:53.0 
+0200
@@ -0,0 +1,11 @@
+// PR sanitizer/80405
+// { dg-do compile }
+// { dg-options "-fsanitize=undefined" }
+
+extern unsigned int v, w;
+
+void
+foo ()
+{
+  w = (!~0 >= (unsigned) (0 < 0)) << v;
+}


Jakub


Fix SH port failure in delay slot scheduling

2017-04-12 Thread Jeff Law


The SH port has this delay slot description like this:

;; Conditional branches with delay slots are available starting with SH2.
;; If zero displacement conditional branches are fast, disable the delay
;; slot if the branch jumps over only one 2-byte insn.
(define_delay
  (and (eq_attr "type" "cbranch")
   (match_test "TARGET_SH2")
   (not (and (match_test "TARGET_ZDCBRANCH")
 (match_test "sh_cbranch_distance (insn, 4) == 2"
  [(eq_attr "cond_delay_slot" "yes") (nil) (nil)])



What's interesting here is whether or not a particular insn has a delay 
slot is dependent on nearby insns *and* it can change within the reorg 
pass itself.  This can cause assert failures within 
write_eligible_for_delay.


It's been 20+ years since I was deep into the delay slot scheduler, but 
my recollection is this doesn't consistently work and I could argue that 
this delay slot description is fundamentally broken.


While I can fix this specific assertion failure, I would not at all be 
surprised if there's other issues lurking.  The port maintainers should 
be on notice that this description may need to be adjusted in the future 
to remove the distance test.


Addressing this specific failure can be done by verifying the given insn 
still has a delay slot in eligible_for_delay*.   That's enough to get 
the SH to build libgcc/newlib.  I've also verified other ports that use 
delay slots such as the PA & Sparc can build glibc and newlib as 
appropriate.


Installing on the trunk.

Jeff
commit 294e866ca9e7e59f5cd637e5b746828c614e6bc5
Author: law 
Date:   Wed Apr 12 16:08:18 2017 +

* genattrtab.c (write_eligible_delay): Verify DELAY_INSN still
has a delay slot in the generated code.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@246879 
138bc75d-0d04-0410-961f-82ee72b054a4

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index fc0becf..89af9cc 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,5 +1,8 @@
 2017-04-12  Jeff Law  
 
+   * genattrtab.c (write_eligible_delay): Verify DELAY_INSN still
+   has a delay slot in the generated code.
+
* config/cris/cris.md (cris_preferred_reload_class): Return
GENNONACR_REGS rather than GENERAL_REGS.
 
diff --git a/gcc/genattrtab.c b/gcc/genattrtab.c
index cd4e668..3629b5f 100644
--- a/gcc/genattrtab.c
+++ b/gcc/genattrtab.c
@@ -4416,6 +4416,9 @@ write_eligible_delay (FILE *outf, const char *kind)
   fprintf (outf, "{\n");
   fprintf (outf, "  rtx_insn *insn ATTRIBUTE_UNUSED;\n");
   fprintf (outf, "\n");
+  fprintf (outf, "  if (num_delay_slots (delay_insn) == 0)\n");
+  fprintf (outf, "return 0;");
+  fprintf (outf, "\n");
   fprintf (outf, "  gcc_assert (slot < %d);\n", max_slots);
   fprintf (outf, "\n");
   /* Allow dbr_schedule to pass labels, etc.  This can happen if try_split


Re: [PATCH] Fix fixincludes for canadian cross builds

2017-04-12 Thread Yvan Roux
Hi,

On 20 February 2017 at 18:53, Bruce Korb  wrote:
> On 02/18/17 01:01, Bernd Edlinger wrote:
>> On 02/18/17 00:37, Bruce Korb wrote:
>>> On 02/06/17 10:44, Bernd Edlinger wrote:
 I tested this change with different arm-linux-gnueabihf cross
 compilers, and verified that mkheaders still works on the host system.

 Bootstrapped and reg-tested on x86_64-pc-linux-gnu.
 Is it OK for trunk?
>>>
>>> As long as you certify that this is correct for all systems we care about:
>>>
>>> +BUILD_SYSTEM_HEADER_DIR = `
>>> +echo $(CROSS_SYSTEM_HEADER_DIR) | \
>>> +sed -e :a -e 's,[^/]*/\.\.\/,,' -e ta`
>>>
>>> that is pretty obtuse sed-speak to me.  I suggest a comment
>>> explaining what sed is supposed to be doing.  What should
>>> "$(CROSS_SYSTEM_HEADER_DIR)" look like?
>>>
>>
>> I took it just from a few lines above, so I thought that comment would
>> sufficiently explain the syntax:
>
> I confess, I didn't pull a new copy of gcc, sorry.
> So it looks good to me.


We just noticed that this patch brakes canadian cross builds when
configured with --with-build-sysroot, since headers are searched into
the target sysroot instead of the one specified for builds.

Maybe there's a cleaner way to fix this and avoid the duplication but
I didn't find another to test if --with-build-sysroot is used.  The
attached patch fixes the issue.  Tested with a Full canadian cross
build for i686-w64-mingw32 host and arm-linux-gnueabihf.

Thanks
Yvan

2017-04-12  Yvan Roux  

   * Makefile.in (BUILD_SYSTEM_HEADER_DIR): Set to SYSTEM_HEADER_DIR
   when configured with --with-build-sysroot.
diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index e38b726..7aed942 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -520,6 +520,8 @@ SYSTEM_HEADER_DIR = `echo @SYSTEM_HEADER_DIR@ | sed -e :a 
-e 's,[^/]*/\.\.\/,,'
 # Path to the system headers on the build machine
 ifeq ($(build),$(host))
 BUILD_SYSTEM_HEADER_DIR = $(SYSTEM_HEADER_DIR)
+else ifdef SYSROOT_CFLAGS_FOR_TARGET
+BUILD_SYSTEM_HEADER_DIR = $(SYSTEM_HEADER_DIR)
 else
 BUILD_SYSTEM_HEADER_DIR = `echo $(CROSS_SYSTEM_HEADER_DIR) | sed -e :a -e 
's,[^/]*/\.\.\/,,' -e ta`
 endif


[wwwdocs] update primary platform name for gcc-7

2017-04-12 Thread Jonathan Wakely

The default config.guess for x86_64 GNU/Linux now uses "pc" not
"unknown" so update the release criteria accordingly.

Committed to cvs.
Index: htdocs/gcc-7/criteria.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-7/criteria.html,v
retrieving revision 1.3
diff -u -r1.3 criteria.html
--- htdocs/gcc-7/criteria.html  4 Mar 2017 16:42:50 -   1.3
+++ htdocs/gcc-7/criteria.html  12 Apr 2017 15:45:49 -
@@ -108,7 +108,7 @@
 mipsisa64-elf
 powerpc64-unknown-linux-gnu
 sparc-sun-solaris2.10
-x86_64-unknown-linux-gnu
+x86_64-pc-linux-gnu
 
 
 Secondary Platform List


[PATCH 2/2] arc: Fix for loop end detection

2017-04-12 Thread Andrew Burgess
We use a negative ID number to link together the doloop_begin and
doloop_end instructions.  This negative ID number is setup within
doloop_begin, at this point the ID is stored into the loop end
instruction (doloop_end_i) and placed into the doloop_begin_i
instruction.

In arc.c (arc_reorg) we extract the ID from the doloop_end_i
instruction in order to find the matching doloop_begin_i instruction,
though the ID is only used in some cases.

Currently in arc_reorg when we extract the ID we negate it.  This
negation is invalid.  The ID stored in both doloop_end_i and
doloop_begin_i is already negative, the negation in arc_reorg means
that if we need to use the ID to find the doloop_begin_i then we will
never find it (as the IDs will never match).

This commit removes the unneeded negation, moves the extraction of the
ID into a more appropriately scoped block and adds a new test for this
issue.

gcc/ChangeLog:

* config/arc/arc.c (arc_reorg): Move loop_end_id into a more local
block, and do not negate it, the stored id is already negative.

gcc/testsuite/ChangeLog:

* gcc.target/arc/loop-1.c: New file.
---
 gcc/ChangeLog |  6 +
 gcc/config/arc/arc.c  |  5 ++--
 gcc/testsuite/ChangeLog   |  4 
 gcc/testsuite/gcc.target/arc/loop-1.c | 45 +++
 4 files changed, 58 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arc/loop-1.c

diff --git a/gcc/config/arc/arc.c b/gcc/config/arc/arc.c
index 0563a74..39b198d 100644
--- a/gcc/config/arc/arc.c
+++ b/gcc/config/arc/arc.c
@@ -6573,8 +6573,6 @@ arc_reorg (void)
  rtx_insn *lp_simple = NULL;
  rtx_insn *next = NULL;
  rtx op0 = XEXP (XVECEXP (PATTERN (insn), 0, 1), 0);
- HOST_WIDE_INT loop_end_id
-   = -INTVAL (XEXP (XVECEXP (PATTERN (insn), 0, 4), 0));
  int seen_label = 0;
 
  for (lp = prev;
@@ -6585,6 +6583,9 @@ arc_reorg (void)
  if (!lp || !NONJUMP_INSN_P (lp)
  || dead_or_set_regno_p (lp, LP_COUNT))
{
+ HOST_WIDE_INT loop_end_id
+   = INTVAL (XEXP (XVECEXP (PATTERN (insn), 0, 4), 0));
+
  for (prev = next = insn, lp = NULL ; prev || next;)
{
  if (prev)
diff --git a/gcc/testsuite/gcc.target/arc/loop-1.c 
b/gcc/testsuite/gcc.target/arc/loop-1.c
new file mode 100644
index 000..1afe8eb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/loop-1.c
@@ -0,0 +1,45 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+/* This case would fail to make use of the zero-overhead loop
+   instruction at one time due to a bug.  */
+
+extern char a[];
+
+struct some_t
+{
+  struct
+  {
+int aaa;
+short bbb;
+char ccc;
+char ddd;
+  } ppp[8];
+
+  int www[1];
+};
+
+int b;
+
+void
+some_function ()
+{
+  struct some_t *tmp = (struct some_t *) a;
+
+  while ((*tmp).ppp[b].ccc)
+while(0);
+
+  for (; b; b++)
+{
+  if (tmp->ppp[b].ccc)
+{
+  int c = tmp->ppp[b].bbb;
+  int d = tmp->ppp[b].aaa;
+  int e = d - tmp->www[c];
+  if (e)
+tmp->ppp[b].ddd = 1;
+}
+}
+}
+
+/* { dg-final { scan-assembler "\[^\n\]+lp \\.L__GCC__" } } */
-- 
2.4.11



[PATCH 0/2] ARC Zero Overhead Loop Fixes

2017-04-12 Thread Andrew Burgess
Found two issues with the ARC loop detection.  The first generates
code that the current assembler can't handle, while the second causes
some loops to be missed.

--

Andrew Burgess (2):
  arc: Use @pcl assembler syntax instead of invalid expressions
  arc: Fix for loop end detection

 gcc/ChangeLog | 10 
 gcc/config/arc/arc.c  |  5 ++--
 gcc/config/arc/arc.md |  2 +-
 gcc/testsuite/ChangeLog   |  4 
 gcc/testsuite/gcc.target/arc/loop-1.c | 45 +++
 5 files changed, 63 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arc/loop-1.c

-- 
2.4.11



[PATCH 1/2] arc: Use @pcl assembler syntax instead of invalid expressions

2017-04-12 Thread Andrew Burgess
The old ARC assembler would accept expressions like 'LABEL-(.&-4)'
which would calculate the offset from the PCL to LABEL.  The new ARC
assembler does not accept these expressions, instead there's an @pcl
synax, used like LABEL@pcl which gives the offset from PCL to LABEL.

Most of the use of the old expression syntax have been removed,
however, this one got missed.

gcc/ChangeLog:

* config/arc/arc.md (doloop_begin_i): Use @pcl assembler syntax.
---
 gcc/ChangeLog | 4 
 gcc/config/arc/arc.md | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md
index 88b7fca..f707bd0 100644
--- a/gcc/config/arc/arc.md
+++ b/gcc/config/arc/arc.md
@@ -5070,7 +5070,7 @@
 {
   /* ??? Can do better for when a scratch register
 is known.  But that would require extra testing.  */
-  return "push_s r0\;add r0,pcl,%4-(.&-4)\;sr r0,[2]; LP_START\;add 
r0,pcl,.L__GCC__LP%1-(.&-4)\;sr r0,[3]; LP_END\;pop_s r0";
+  return "push_s r0\;add r0,pcl,%4@pcl\;sr r0,[2]; LP_START\;add 
r0,pcl,.L__GCC__LP%1@pcl\;sr r0,[3]; LP_END\;pop_s r0";
 }
   /* Check if the loop end is in range to be set by the lp instruction.  */
   size = INTVAL (operands[3]) < 2 ? 0 : 2048;
-- 
2.4.11



Fix cris/crisv32 port

2017-04-12 Thread Jeff Law

The cris ports are unable to build newlib due to a reload failure.

What happens is we need to reload an auto-inc memory reference.   The 
preferred class is GENERAL_REGS.  THe register we happen to select is 
ACR, but ACR can not be used in an auto-inc addressing mode.


This can be easily fixed by returning GENNONACR_REGS from the preferred 
reload class hook instead of GENERAL_REGS.


It may be advisable to also define the LIMIT_RELOAD_CLASS hook.  I 
haven't done that, but the port maintainers should seriously consider it.


With this change cris and crisv32 both build newlib successfully.

Installed on the trunk.

Jeff
diff --git a/gcc/config/cris/cris.c b/gcc/config/cris/cris.c
index 21137bd..8c134a6 100644
--- a/gcc/config/cris/cris.c
+++ b/gcc/config/cris/cris.c
@@ -1597,7 +1597,7 @@ cris_preferred_reload_class (rtx x ATTRIBUTE_UNUSED, 
reg_class_t rclass)
   && rclass != SRP_REGS
   && rclass != CC0_REGS
   && rclass != SPECIAL_REGS)
-return GENERAL_REGS;
+return GENNONACR_REGS;
 
   return rclass;
 }


[PATCH] Fix PR80359

2017-04-12 Thread Richard Biener

I am testing the follow^W^W^WJeff has tested the following,
applied to trunk.

Richard.

2017-04-12  Richard Biener  
Jeff Law  

PR tree-optimization/80359
* tree-ssa-dse.c (maybe_trim_partially_dead_store): Do not
trim stores to TARGET_MEM_REFs.

* gcc.dg/torture/pr80359.c: New testcase.

Index: gcc/tree-ssa-dse.c
===
*** gcc/tree-ssa-dse.c  (revision 246871)
--- gcc/tree-ssa-dse.c  (working copy)
*** maybe_trim_memstar_call (ao_ref *ref, sb
*** 451,457 
  static void
  maybe_trim_partially_dead_store (ao_ref *ref, sbitmap live, gimple *stmt)
  {
!   if (is_gimple_assign (stmt))
  {
switch (gimple_assign_rhs_code (stmt))
{
--- 451,458 
  static void
  maybe_trim_partially_dead_store (ao_ref *ref, sbitmap live, gimple *stmt)
  {
!   if (is_gimple_assign (stmt)
!   && TREE_CODE (gimple_assign_lhs (stmt)) != TARGET_MEM_REF)
  {
switch (gimple_assign_rhs_code (stmt))
{
Index: gcc/testsuite/gcc.dg/torture/pr80359.c
===
*** gcc/testsuite/gcc.dg/torture/pr80359.c  (nonexistent)
--- gcc/testsuite/gcc.dg/torture/pr80359.c  (working copy)
***
*** 0 
--- 1,12 
+ /* { dg-do compile } */
+ 
+ void FFT(_Complex *X, int length)
+ {
+   unsigned i, j;
+   for (; i < length; i++)
+ {
+   X[i] = 0;
+   for (j = 0; j < length; j++)
+   X[i] = X[i] / length;
+ }
+ }


[PATCH][AArch64] Improve address cost for -mcpu=generic

2017-04-12 Thread Wilco Dijkstra
All cores which add a cpu_addrcost_table use a non-zero value for
HI and TI mode shifts (a non-zero value for general indexing also 
applies to all shifts).  Given this, it makes no sense to use a
different setting in generic_addrcost_table.  So change it so that all
supported cores, including -mcpu=generic, now generate the same:

int f(short *p, short *q, long x) { return p[x] + q[x]; }

lsl x2, x2, 1
ldrsh   w3, [x0, x2]
ldrsh   w0, [x1, x2]
add w0, w3, w0
ret

Bootstrapped for AArch64. Any comments? OK for stage 1?

ChangeLog:
2017-04-12  Wilco Dijkstra  

* gcc/config/aarch64/aarch64.c (generic_addrcost_table):
Change HI/TI mode setting.

---
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
419b756efcb40e48880cd4529efc4f9f59938325..728ce7029f1e2b5161d9f317d10e564dd5a5f472
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -193,10 +193,10 @@ static const struct aarch64_flag_desc 
aarch64_tuning_flags[] =
 static const struct cpu_addrcost_table generic_addrcost_table =
 {
 {
-  0, /* hi  */
+  1, /* hi  */
   0, /* si  */
   0, /* di  */
-  0, /* ti  */
+  1, /* ti  */
 },
   0, /* pre_modify  */
   0, /* post_modify  */



Re: OpenACC 2.5 default (present) clause

2017-04-12 Thread Thomas Schwinge
Hi!

On Fri, 07 Apr 2017 17:08:55 +0200, I wrote:
> OpenACC 2.5 added a default (present) clause, which "causes all arrays or
> variables of aggregate data type used in the compute construct that have
> implicitly determined data attributes to be treated as if they appeared
> in a present clause".  Preceded by the following cleanup patch (see
>  for its
> origin), OK for trunk in next stage 1?

(Jakub asked me to ping this in next stage 1.)

> For now committed to gomp-4_0-branch in r246763:

> --- libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c

> +#pragma acc enter data create (a)
> +
> +#pragma acc parallel default (present)
> +{
> +  for (int j = 0; j < N; ++j)
> + a[j] = j - 1;
> +}
> +
> +#pragma acc update host (a)
> +
> +for (i = 0; i < N; ++i)
> +  {
> +if (a[i] != i - 1)
> +   abort ();
> +  }
> +
> +#pragma acc kernels default (present)
> +{
> +  for (int j = 0; j < N; ++j)
> + a[j] = j - 2;
> +}
> +
> +#pragma acc exit data copyout (a)
> +
> +for (i = 0; i < N; ++i)
> +  {
> +if (a[i] != i - 2)
> +   abort ();
> +  }

In our PowerPC testing, this change triggered linking failures (for host
comilation) and ICEs (for offloading compilation).
 "DCE vs. offloading" filed (here, we got:
"char a[N]", causing GCC to figure out that "(char) -1 != (int) -1", thus
unconditionally "abort", thus the following OpenACC kernels getting DCEd,
thus PR80411), and committed to gomp-4_0-branch in r246871:

commit 6cf38f5ba09e4152336cf94d0eb9c80db9ffc024
Author: tschwinge 
Date:   Wed Apr 12 12:54:10 2017 +

Fix libgomp.oacc-c-c++-common/nested-2.c: "char" might mean "unsigned char"

libgomp/
* testsuite/libgomp.oacc-c-c++-common/nested-2.c: Respect that
"char" might mean "unsigned char".

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@246871 
138bc75d-0d04-0410-961f-82ee72b054a4
---
 libgomp/ChangeLog.gomp | 5 +
 libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c | 8 
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
index 873668e..32cb7e8 100644
--- libgomp/ChangeLog.gomp
+++ libgomp/ChangeLog.gomp
@@ -1,3 +1,8 @@
+2017-04-12  Thomas Schwinge  
+
+   * testsuite/libgomp.oacc-c-c++-common/nested-2.c: Respect that
+   "char" might mean "unsigned char".
+
 2017-04-07  Thomas Schwinge  
 
* testsuite/libgomp.oacc-c++/template-reduction.C: Update.
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c 
libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
index e8ead3d..51b3b18 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
@@ -143,28 +143,28 @@ main (int argc, char *argv[])
 #pragma acc parallel default (present)
 {
   for (int j = 0; j < N; ++j)
-   a[j] = j - 1;
+   a[j] = j + 1;
 }
 
 #pragma acc update host (a)
 
 for (i = 0; i < N; ++i)
   {
-if (a[i] != i - 1)
+if (a[i] != i + 1)
  abort ();
   }
 
 #pragma acc kernels default (present)
 {
   for (int j = 0; j < N; ++j)
-   a[j] = j - 2;
+   a[j] = j + 2;
 }
 
 #pragma acc exit data copyout (a)
 
 for (i = 0; i < N; ++i)
   {
-if (a[i] != i - 2)
+if (a[i] != i + 2)
  abort ();
   }
 


Grüße
 Thomas


[PATCH][ARM] Update max_cond_insns settings

2017-04-12 Thread Wilco Dijkstra
The existing setting of max_cond_insns for most cores is non-optimal.
Thumb-2 IT has a maximum limit of 4, so 5 means emitting 2 IT sequences.
Also such long sequences of conditional instructions can increase the number
of executed instructions significantly, so using 5 for max_cond_insns is
non-optimal.

Previous benchmarking showed that setting max_cond_insn to 2 was the best value
for Cortex-A15 and Cortex-A57.  All ARMv8-A cores use 2 - apart from Cortex-A35
and Cortex-A53.  Given that using 5 is worse, set it to 2.  This also has the
advantage of producing more uniform code.

Bootstrap and regress OK on arm-none-linux-gnueabihf.

OK for stage 1?

ChangeLog:
2017-04-12  Wilco Dijkstra  

* gcc/config/arm/arm.c (arm_cortex_a53_tune): Set max_cond_insns to 2.
(arm_cortex_a35_tune): Likewise.
---

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 
29e8d1d07d918fbb2a627a653510dfc8587ee01a..1a6d552aa322114795acbb3667c6ea36963bf193
 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -1967,7 +1967,7 @@ const struct tune_params arm_cortex_a35_tune =
   arm_default_branch_cost,
   _default_vec_cost,
   1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  2,   /* Max cond insns.  */
   8,   /* Memset max inline.  */
   1,   /* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
@@ -1990,7 +1990,7 @@ const struct tune_params arm_cortex_a53_tune =
   arm_default_branch_cost,
   _default_vec_cost,
   1,   /* Constant limit.  */
-  5,   /* Max cond insns.  */
+  2,   /* Max cond insns.  */
   8,   /* Memset max inline.  */
   2,   /* Issue rate.  */
   ARM_PREFETCH_NOT_BENEFICIAL,

[PATCH][AArch64] Update alignment for -mcpu=generic

2017-04-12 Thread Wilco Dijkstra
With -mcpu=generic the loop alignment is currently 4.  All but one of the
supported cores use 8 or higher.  Since using 8 provides performance gains
on several cores, it is best to use that by default.  As discussed in [1],
the jump alignment has no effect on performance, yet has a relatively high
codesize cost [2], so setting it to 4 is best.  This gives a 0.2% overall 
codesize improvement as well as performance gains in several benchmarks.
Any objections?

Bootstrap OK on AArch64, OK for stage 1?

ChangeLog:
2017-04-12  Wilco Dijkstra  

* config/aarch64/aarch64.c (generic_tunings): Set jump alignment to 4.
Set loop alignment to 8.

[1] https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00574.html
[2] https://gcc.gnu.org/ml/gcc-patches/2016-06/msg02075.html

---
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
c8cf7169a5d387de336920b50c83761dc0c96f3a..8b729b1b1f87316e940d7fc657f235a935ffa93e
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -538,8 +538,8 @@ static const struct tune_params generic_tunings =
   2, /* issue_rate  */
   (AARCH64_FUSE_AES_AESMC), /* fusible_ops  */
   8,   /* function_align.  */
-  8,   /* jump_align.  */
-  4,   /* loop_align.  */
+  4,   /* jump_align.  */
+  8,   /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
   1,   /* vec_reassoc_width.  */


[PATCH][AArch64] Set jump alignment to 4 for Cortex cores

2017-04-12 Thread Wilco Dijkstra
Set jump alignment to 4 for Cortex cores as it reduces codesize by 0.4% on 
average
with no obvious performance difference.  See original discussion of the 
overheads
of various alignments: https://gcc.gnu.org/ml/gcc-patches/2016-06/msg02075.html

Bootstrap OK, OK for stage 1?

ChangeLog:
2017-04-12  Wilco Dijkstra  

* config/aarch64/aarch64.c (cortexa35_tunings): Set jump alignment to 4.
(cortexa53_tunings): Likewise.
(cortexa57_tunings): Likewise.
(cortexa72_tunings): Likewise.
(cortexa73_tunings): Likewise.

--
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
a6004e6e283ba7157e65b678cf668f8a47e21abb..a8b3a29dd2e242a35f37b8c6a6fb30699ace5e01
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -564,7 +564,7 @@ static const struct tune_params cortexa35_tunings =
   (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_MOV_MOVK | AARCH64_FUSE_ADRP_ADD
| AARCH64_FUSE_MOVK_MOVK | AARCH64_FUSE_ADRP_LDR), /* fusible_ops  */
   16,  /* function_align.  */
-  8,   /* jump_align.  */
+  4,   /* jump_align.  */
   8,   /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
@@ -590,7 +590,7 @@ static const struct tune_params cortexa53_tunings =
   (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_MOV_MOVK | AARCH64_FUSE_ADRP_ADD
| AARCH64_FUSE_MOVK_MOVK | AARCH64_FUSE_ADRP_LDR), /* fusible_ops  */
   16,  /* function_align.  */
-  8,   /* jump_align.  */
+  4,   /* jump_align.  */
   8,   /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
@@ -616,7 +616,7 @@ static const struct tune_params cortexa57_tunings =
   (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_MOV_MOVK | AARCH64_FUSE_ADRP_ADD
| AARCH64_FUSE_MOVK_MOVK), /* fusible_ops  */
   16,  /* function_align.  */
-  8,   /* jump_align.  */
+  4,   /* jump_align.  */
   8,   /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
@@ -642,7 +642,7 @@ static const struct tune_params cortexa72_tunings =
   (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_MOV_MOVK | AARCH64_FUSE_ADRP_ADD
| AARCH64_FUSE_MOVK_MOVK), /* fusible_ops  */
   16,  /* function_align.  */
-  8,   /* jump_align.  */
+  4,   /* jump_align.  */
   8,   /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */
@@ -668,7 +668,7 @@ static const struct tune_params cortexa73_tunings =
   (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_MOV_MOVK | AARCH64_FUSE_ADRP_ADD
| AARCH64_FUSE_MOVK_MOVK | AARCH64_FUSE_ADRP_LDR), /* fusible_ops  */
   16,  /* function_align.  */
-  8,   /* jump_align.  */
+  4,   /* jump_align.  */
   8,   /* loop_align.  */
   2,   /* int_reassoc_width.  */
   4,   /* fp_reassoc_width.  */


Re: [PR 80293] Don't totally-sRA char arrays

2017-04-12 Thread Richard Biener
On Wed, 12 Apr 2017, Martin Jambor wrote:

> Hi,
> 
> the patch below is an attempt to deal with PR 80293 as non-invasively
> as possible.  Basically, it switches off total SRA scalarization of
> any local aggregates which contains an array of elements that have one
> byte (or less).
> 
> The logic behind this is that accessing such arrays element-wise
> usually results in poor code and that such char arrays are often used
> for non-statically-typed content anyway, and we do not want to copy
> that byte per byte.
> 
> Alan, do you think this could impact your constant pool scalarization
> too severely?

Hmm, isn't one of the issues that we have

if (VAR_P (var) && scalarizable_type_p (TREE_TYPE (var)))
  {
if (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (var)))
<= max_scalarization_size)
  {
create_total_scalarization_access (var);

which limits the size of scalarizable vars but not the number
of accesses we create for total scalarization?

Is scalarizable_type_p only used in contexts where we have no hint
of the actual accesses?  That is, for the constant pool case we
usually have

  x = .LC0;
  .. = x[2];

so we have a "hint" that accesses on x are those we'd want to
optimize to accesses to .LC0.  If we have no accesses on x
then we can as well scalarize using word_mode for example?

> Richi, if you or Alan does not object in a few days, I'd like to
> commit this in time for gcc7.  It has passed bootstrap and testing on
> x86_64-linux (but the constant pool SRA work was aimed primarily at
> ARM).

Maybe we can -- if this is the case here -- not completely scalarize
in case we don't know how the destination of the aggregate copy
is used?

> Thanks,
> 
> Martin
> 
> 
> 2017-04-10  Martin Jambor  
> 
>   * tree-sra.c (scalarizable_type_p): Make char arrays not totally
>   scalarizable.
> 
> testsuite/
>   * g++.dg/tree-ssa/pr80293.C: New test.
> ---
>  gcc/testsuite/g++.dg/tree-ssa/pr80293.C | 45 
> +
>  gcc/tree-sra.c  |  2 +-
>  2 files changed, 46 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/g++.dg/tree-ssa/pr80293.C
> 
> diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr80293.C 
> b/gcc/testsuite/g++.dg/tree-ssa/pr80293.C
> new file mode 100644
> index 000..7faf35ae983
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/tree-ssa/pr80293.C
> @@ -0,0 +1,45 @@
> +// { dg-do compile }
> +// { dg-options "-O2 -std=gnu++11 -fdump-tree-optimized" } */
> +
> +#include 
> +
> +// Return a copy of the underlying memory of an arbitrary value.
> +template <
> +typename T,
> +typename = typename 
> std::enable_if::type
> +>
> +auto getMem(
> +T const & value
> +) -> std::array {
> +auto ret = std::array{};
> +__builtin_memcpy(ret.data(), , sizeof(T));
> +return ret;
> +}
> +
> +template <
> +typename T,
> +typename = typename 
> std::enable_if::type
> +>
> +auto fromMem(
> +std::array const & buf
> +) -> T {
> +auto ret = T{};
> +__builtin_memcpy(, buf.data(), sizeof(T));
> +return ret;
> +}
> +
> +double foo1(std::uint64_t arg) {
> +return fromMem(getMem(arg));
> +}
> +
> +double foo2(std::uint64_t arg) {
> +return *reinterpret_cast();
> +}
> +
> +double foo3(std::uint64_t arg) {
> +double ret;
> +__builtin_memcpy(, , sizeof(arg));
> +return ret;
> +}
> +
> +// { dg-final { scan-tree-dump-not "BIT_FIELD_REF" "optimized" } }
> diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
> index 02453d3ed9a..cbe9e862a2f 100644
> --- a/gcc/tree-sra.c
> +++ b/gcc/tree-sra.c
> @@ -981,7 +981,7 @@ scalarizable_type_p (tree type)
>if (TYPE_DOMAIN (type) == NULL_TREE
> || !tree_fits_shwi_p (TYPE_SIZE (type))
> || !tree_fits_shwi_p (TYPE_SIZE (TREE_TYPE (type)))
> -   || (tree_to_shwi (TYPE_SIZE (TREE_TYPE (type))) <= 0)
> +   || (tree_to_shwi (TYPE_SIZE (TREE_TYPE (type))) <= BITS_PER_UNIT)
> || !tree_fits_shwi_p (TYPE_MIN_VALUE (TYPE_DOMAIN (type
>   return false;
>if (tree_to_shwi (TYPE_SIZE (type)) == 0
> 

-- 
Richard Biener 
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)


[PR 80293] Don't totally-sRA char arrays

2017-04-12 Thread Martin Jambor
Hi,

the patch below is an attempt to deal with PR 80293 as non-invasively
as possible.  Basically, it switches off total SRA scalarization of
any local aggregates which contains an array of elements that have one
byte (or less).

The logic behind this is that accessing such arrays element-wise
usually results in poor code and that such char arrays are often used
for non-statically-typed content anyway, and we do not want to copy
that byte per byte.

Alan, do you think this could impact your constant pool scalarization
too severely?

Richi, if you or Alan does not object in a few days, I'd like to
commit this in time for gcc7.  It has passed bootstrap and testing on
x86_64-linux (but the constant pool SRA work was aimed primarily at
ARM).

Thanks,

Martin


2017-04-10  Martin Jambor  

* tree-sra.c (scalarizable_type_p): Make char arrays not totally
scalarizable.

testsuite/
* g++.dg/tree-ssa/pr80293.C: New test.
---
 gcc/testsuite/g++.dg/tree-ssa/pr80293.C | 45 +
 gcc/tree-sra.c  |  2 +-
 2 files changed, 46 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/g++.dg/tree-ssa/pr80293.C

diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr80293.C 
b/gcc/testsuite/g++.dg/tree-ssa/pr80293.C
new file mode 100644
index 000..7faf35ae983
--- /dev/null
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr80293.C
@@ -0,0 +1,45 @@
+// { dg-do compile }
+// { dg-options "-O2 -std=gnu++11 -fdump-tree-optimized" } */
+
+#include 
+
+// Return a copy of the underlying memory of an arbitrary value.
+template <
+typename T,
+typename = typename 
std::enable_if::type
+>
+auto getMem(
+T const & value
+) -> std::array {
+auto ret = std::array{};
+__builtin_memcpy(ret.data(), , sizeof(T));
+return ret;
+}
+
+template <
+typename T,
+typename = typename 
std::enable_if::type
+>
+auto fromMem(
+std::array const & buf
+) -> T {
+auto ret = T{};
+__builtin_memcpy(, buf.data(), sizeof(T));
+return ret;
+}
+
+double foo1(std::uint64_t arg) {
+return fromMem(getMem(arg));
+}
+
+double foo2(std::uint64_t arg) {
+return *reinterpret_cast();
+}
+
+double foo3(std::uint64_t arg) {
+double ret;
+__builtin_memcpy(, , sizeof(arg));
+return ret;
+}
+
+// { dg-final { scan-tree-dump-not "BIT_FIELD_REF" "optimized" } }
diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
index 02453d3ed9a..cbe9e862a2f 100644
--- a/gcc/tree-sra.c
+++ b/gcc/tree-sra.c
@@ -981,7 +981,7 @@ scalarizable_type_p (tree type)
   if (TYPE_DOMAIN (type) == NULL_TREE
  || !tree_fits_shwi_p (TYPE_SIZE (type))
  || !tree_fits_shwi_p (TYPE_SIZE (TREE_TYPE (type)))
- || (tree_to_shwi (TYPE_SIZE (TREE_TYPE (type))) <= 0)
+ || (tree_to_shwi (TYPE_SIZE (TREE_TYPE (type))) <= BITS_PER_UNIT)
  || !tree_fits_shwi_p (TYPE_MIN_VALUE (TYPE_DOMAIN (type
return false;
   if (tree_to_shwi (TYPE_SIZE (type)) == 0
-- 
2.12.0



[PATCH] Fix PR79390 (partly)

2017-04-12 Thread Richard Biener

This avoids another case of path splitting which gets in the way
of RTL if conversion.  With this patch -O3 performance gets back
to GCC 6 levels (with -Ofast we still regress as RTL if conversion
doesn't catch the case).

Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.

Richard.

2017-04-12  Richard Biener  

PR tree-optimization/79390
* gimple-ssa-split-paths.c (is_feasible_trace): Restrict
threading case even more.

Index: gcc/gimple-ssa-split-paths.c
===
--- gcc/gimple-ssa-split-paths.c(revision 246803)
+++ gcc/gimple-ssa-split-paths.c(working copy)
@@ -249,13 +249,17 @@ is_feasible_trace (basic_block bb)
  imm_use_iterator iter2;
  FOR_EACH_IMM_USE_FAST (use2_p, iter2, gimple_phi_result 
(stmt))
{
- if (is_gimple_debug (USE_STMT (use2_p)))
+ gimple *use_stmt = USE_STMT (use2_p);
+ if (is_gimple_debug (use_stmt))
continue;
- basic_block use_bb = gimple_bb (USE_STMT (use2_p));
+ basic_block use_bb = gimple_bb (use_stmt);
  if (use_bb != bb
  && dominated_by_p (CDI_DOMINATORS, bb, use_bb))
{
- found_useful_phi = true;
+ if (gcond *cond = dyn_cast  (use_stmt))
+   if (gimple_cond_code (cond) == EQ_EXPR
+   || gimple_cond_code (cond) == NE_EXPR)
+ found_useful_phi = true;
  break;
}
}


[PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-12 Thread Sudi Das
Hi all

This is a fix for PR 80131 
Currently the code A << (B - C) is not simplified.
However at least a more specific case of 1U << (C -x) where C = precision(type) 
- 1 can be simplified to (1 << C) >> x.

This is done by adding a new simplification rule in match.pd

So for a test case :

unsigned int f1(unsigned int i)
{
  return 1U << (31 - i);
}

We see a gimple dump of 

f1 (unsigned int i)
{
  unsigned int D.3121;

  D.3121 = 2147483648 >> i;
  return D.3121;
}

instead of 

f1 (unsigned int i)
{
  unsigned int D.3121;

  _1 = 31 - i;
  D.3121 = 1 << _1;
  return D.3121;
}


Add a new test case and checked for regressions on bootstrapped 
aarch64-none-linux-gnu.
Ok for stage 1?

Thanks
Sudi

2017-03-23  Sudakshina Das  

PR middle-end/80131
* match.pd: Simplify 1 << (C - x) where C = precision (type) - 1.

2017-03-23  Sudakshina Das  

PR middle-end/80131
* testsuite/gcc.dg/pr80131-1.c: New Test.diff --git a/gcc/match.pd b/gcc/match.pd
index 7b96800..be20fb7 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -508,6 +508,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
&& tree_nop_conversion_p (type, TREE_TYPE (@1)))
(lshift @0 @2)))
 
+/* Fold (1 << (C - x)) where C = precision(type) - 1
+   into ((1 << C) >> x). */
+(simplify
+ (lshift integer_onep@0 (minus INTEGER_CST@1 @2))
+  (if (INTEGRAL_TYPE_P (type)
+   && TYPE_PRECISION (type) <= HOST_BITS_PER_WIDE_INT
+   && tree_to_uhwi (@1) == (unsigned)(TYPE_PRECISION (type) - 1))
+   (if (TYPE_UNSIGNED(type))
+ (rshift (lshift @0 @1) @2)
+   (with
+{ tree utype = unsigned_type_for (type); }
+(convert:type (rshift (lshift (convert:utype @0) @1) @2))
+
 /* Fold (C1/X)*C2 into (C1*C2)/X.  */
 (simplify
  (mult (rdiv@3 REAL_CST@0 @1) REAL_CST@2)
diff --git a/gcc/testsuite/gcc.dg/pr80131-1.c b/gcc/testsuite/gcc.dg/pr80131-1.c
new file mode 100644
index 000..2bb6ff3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr80131-1.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-options "-fdump-tree-gimple" } */
+
+/* Checks the simplification of:
+   1 << (C - x) to (1 << C) >> x, where C = precision (type) - 1
+   f1 is not simplified but f2, f3 and f4 are. */
+
+__INT64_TYPE__ f1 (__INT64_TYPE__ i)
+{
+  return (__INT64_TYPE__)1 << (31 - i);
+}
+
+__INT64_TYPE__ f2 (__INT64_TYPE__ i)
+{
+  return (__INT64_TYPE__)1 << (63 - i);
+}
+
+__UINT64_TYPE__ f3 (__INT64_TYPE__ i)
+{
+  return (__UINT64_TYPE__)1 << (63 - i);
+}
+
+__INT32_TYPE__ f4 (__INT32_TYPE__ i)
+{
+  return (__INT32_TYPE__)1 << (31 - i);
+}
+
+/* { dg-final { scan-tree-dump-times "= 31 -"  1 "gimple" } } */
+/* { dg-final { scan-tree-dump-times "9223372036854775808 >>" 2 "gimple" } } */
+/* { dg-final { scan-tree-dump "2147483648 >>" "gimple" } } */


[PATCH, GCC/ARM, stage4] Set mode for success result of atomic compare and swap

2017-04-12 Thread Thomas Preudhomme

Hi,

Currently atomic_compare_and_swap_1 define_insn do not have a mode
set for the destination of the set indicating the success result of the
instruction. This is because the operand can be either a CC_Z register
(for 32-bit targets) or a SI register (for 16-bit Thumb targets). This
result in lack of checking for the mode.

This commit use a new CCSI iterator to solve this issue while avoiding
duplication of the patterns. The insn name are kept unique by using
attributes tied to the iterator (SIDI:mode and CCSI:arch) instead of
usign the builtin mode attribute. Expander arm_expand_compare_and_swap
is also adapted accordingly.

ChangeLog entry is as follows:

*** gcc/ChangeLog ***

2017-04-11  Thomas Preud'homme  

* config/arm/iterators.md (CCSI): New mode iterator.
(arch): New mode attribute.
* config/arm/sync.md (atomic_compare_and_swap_1): Rename into ...
(atomic_compare_and_swap_1): This and ...
(atomic_compare_and_swap_1): This.  Use CCSI
code iterator for success result mode.
* config/arm/arm.c (arm_expand_compare_and_swap): Adapt code to use
the corresponding new insn generators.

Testing: arm-none-eabi cross-compiler built successfully for ARMv8-M
Mainline and Baseline without the lack of destination mode warning in
sync.md. Testsuite show no regression.

Is this ok for stage4?

Best regards,

Thomas
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index b24143e32e2f10f3b150f7ed0df4fabb3cc8..cf628714507efd2b5a5ab5de97ef32fd45987d1f 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -28190,17 +28190,32 @@ arm_expand_compare_and_swap (rtx operands[])
   gcc_unreachable ();
 }
 
-  switch (mode)
+  if (TARGET_THUMB1)
 {
-case QImode: gen = gen_atomic_compare_and_swapqi_1; break;
-case HImode: gen = gen_atomic_compare_and_swaphi_1; break;
-case SImode: gen = gen_atomic_compare_and_swapsi_1; break;
-case DImode: gen = gen_atomic_compare_and_swapdi_1; break;
-default:
-  gcc_unreachable ();
+  switch (mode)
+	{
+	case QImode: gen = gen_atomic_compare_and_swapt1qi_1; break;
+	case HImode: gen = gen_atomic_compare_and_swapt1hi_1; break;
+	case SImode: gen = gen_atomic_compare_and_swapt1si_1; break;
+	case DImode: gen = gen_atomic_compare_and_swapt1di_1; break;
+	default:
+	  gcc_unreachable ();
+	}
+}
+  else
+{
+  switch (mode)
+	{
+	case QImode: gen = gen_atomic_compare_and_swap32qi_1; break;
+	case HImode: gen = gen_atomic_compare_and_swap32hi_1; break;
+	case SImode: gen = gen_atomic_compare_and_swap32si_1; break;
+	case DImode: gen = gen_atomic_compare_and_swap32di_1; break;
+	default:
+	  gcc_unreachable ();
+	}
 }
 
-  bdst = TARGET_THUMB1 ? bval : gen_rtx_REG (CCmode, CC_REGNUM);
+  bdst = TARGET_THUMB1 ? bval : gen_rtx_REG (CC_Zmode, CC_REGNUM);
   emit_insn (gen (bdst, rval, mem, oldval, newval, is_weak, mod_s, mod_f));
 
   if (mode == QImode || mode == HImode)
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index e2e588688eb04c158d1c146bca12d84cfb5ff130..48992879a8eecc66eba913c2b9a7c5989c5c7bc6 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -45,6 +45,9 @@
 ;; A list of the 32bit and 64bit integer modes
 (define_mode_iterator SIDI [SI DI])
 
+;; A list of atomic compare and swap success return modes
+(define_mode_iterator CCSI [(CC_Z "TARGET_32BIT") (SI "TARGET_THUMB1")])
+
 ;; A list of modes which the VFP unit can handle
 (define_mode_iterator SDF [(SF "") (DF "TARGET_VFP_DOUBLE")])
 
@@ -411,6 +414,10 @@
 ;; Mode attributes
 ;;
 
+;; Determine name of atomic compare and swap from success result mode.  This
+;; distinguishes between 16-bit Thumb and 32-bit Thumb/ARM.
+(define_mode_attr arch [(CC_Z "32") (SI "t1")])
+
 ;; Determine element size suffix from vector mode.
 (define_mode_attr MMX_char [(V8QI "b") (V4HI "h") (V2SI "w") (DI "d")])
 
diff --git a/gcc/config/arm/sync.md b/gcc/config/arm/sync.md
index 1f91b7364d5689145a10bbb193d54a0677b2fd36..b4b4f2e6815e7c31c9874c19af31e908107e6a62 100644
--- a/gcc/config/arm/sync.md
+++ b/gcc/config/arm/sync.md
@@ -191,9 +191,9 @@
 
 ;; Constraints of this pattern must be at least as strict as those of the
 ;; cbranchsi operations in thumb1.md and aim to be as permissive.
-(define_insn_and_split "atomic_compare_and_swap_1"
-  [(set (match_operand 0 "cc_register_operand" "=,,,")		;; bool out
-	(unspec_volatile:CC_Z [(const_int 0)] VUNSPEC_ATOMIC_CAS))
+(define_insn_and_split "atomic_compare_and_swap_1"
+  [(set (match_operand:CCSI 0 "cc_register_operand" "=,,,")	;; bool out
+	(unspec_volatile:CCSI [(const_int 0)] VUNSPEC_ATOMIC_CAS))
(set (match_operand:SI 1 "s_register_operand" "=,,&0,*h")	;; val out
 	(zero_extend:SI
 	  (match_operand:NARROW 2 "mem_noofs_operand" "+Ua,Ua,Ua,Ua")))	;; memory
@@ -223,9 +223,9 @@
 
 ;; Constraints of this pattern 

[PATCH] rs6000: Enforce quad_address_p in TImode atomic_load/store (PR80382)

2017-04-12 Thread Segher Boessenkool
Whatever expand expands to should be valid instructions.  The defined
instructions here have a quad_memory_operand predicate, which boils
down to quad_address_p on the address, so let's test for that instead
of only disallowing indexed addresses.

Tested on powerpc64-linux, applying to trunk.


Segher


2017-04-12  Segher Boessenkool  

* config/rs6000/sync.md (atomic_load, atomic_store

Re: [RFC] S/390: Alignment peeling prolog generation

2017-04-12 Thread Richard Biener
On Wed, Apr 12, 2017 at 9:50 AM, Robin Dapp  wrote:
>> Note I was very conservative here to allow store bandwidth starved
>> CPUs to benefit from aligning a store.
>>
>> I think it would be reasonable to apply the same heuristic to the
>> store case that we only peel for same cost if peeling would at least
>> align two refs.
>
> Do you mean checking if peeling aligns >= 2 refs for sure? (i.e. with a
> known misalignment) Or the same as currently via
> STMT_VINFO_SAME_ALIGN_REFS just for stores and .length() >= 2?

The latter.

> Is checking via vect_peeling_hash_choose_best_peeling () too costly or
> simply unnecessary if we already know the costs for aligned and
> unaligned are the same?

This one only works for known misalignment, otherwise it's overkill.

OTOH if with some refactoring we can end up using a single cost model
that would be great.  That is for the SAME_ALIGN_REFS we want to
choose the unknown misalignment with the maximum number of
SAME_ALIGN_REFS.  And if we know the misalignment of a single
ref then we still may want to align a unknown misalign ref if that has
more SAME_ALIGN_REFS (I think we always choose the known-misalign
one currently).

Richard.

> Regards
>  Robin
>


Re: [RFC] S/390: Alignment peeling prolog generation

2017-04-12 Thread Robin Dapp
> Note I was very conservative here to allow store bandwidth starved
> CPUs to benefit from aligning a store.
> 
> I think it would be reasonable to apply the same heuristic to the
> store case that we only peel for same cost if peeling would at least
> align two refs.

Do you mean checking if peeling aligns >= 2 refs for sure? (i.e. with a
known misalignment) Or the same as currently via
STMT_VINFO_SAME_ALIGN_REFS just for stores and .length() >= 2?

Is checking via vect_peeling_hash_choose_best_peeling () too costly or
simply unnecessary if we already know the costs for aligned and
unaligned are the same?

Regards
 Robin



Re: One more path to fix PR70478

2017-04-12 Thread Christophe Lyon
On 11 April 2017 at 21:43, Vladimir Makarov  wrote:
>
>
> On 04/11/2017 03:30 AM, Christophe Lyon wrote:
>>
>> Hi Vladimir,
>>
>> On 10 April 2017 at 17:05, Vladimir Makarov  wrote:
>>>
>>>This is the second try to fix
>>>
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70478
>>>
>>>The first try patch triggered a latent bug and broke one Fortran
>>> testcase
>>> on x86-64.
>>>
>>>The patch was successfully bootstrapped on x86-64 and tested on
>>> x86-64,
>>> ppc64, and aarch64.
>>>
>>>Committed as rev. 246808.
>>>
>>>
>> This patch causes regression on arm*hf configurations:
>>Executed from: gcc.target/arm/arm.exp
>>  gcc.target/arm/armv8_2-fp16-move-1.c scan-assembler-times
>> ldrh\\tr[0-9]+ 2
>>  gcc.target/arm/armv8_2-fp16-move-1.c scan-assembler-times
>> strh\\tr[0-9]+ 2
>>  gcc.target/arm/armv8_2-fp16-move-1.c scan-assembler-times
>> vld1\\.16\\t{d[0-9]+\\[[0-9]+\\]}, \\[r[0-9]+\\] 2
>>  gcc.target/arm/armv8_2-fp16-move-1.c scan-assembler-times
>> vmov\\.f16\\tr[0-9]+, s[0-9]+ 4
>>  gcc.target/arm/armv8_2-fp16-move-1.c scan-assembler-times
>> vmov\\.f16\\ts[0-9]+, r[0-9]+ 4
>>  gcc.target/arm/armv8_2-fp16-move-1.c scan-assembler-times
>> vst1\\.16\\t{d[0-9]+\\[[0-9]+\\]}, \\[r[0-9]+\\] 2
>>
>>
> I've committed a patch which is supposed to fix the regression.
>

I confirm it's now OK. Thanks for the prompt fix!

Christophe