Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-06-15 Thread Eric Botcazou
 Whoops, I forgot to commit that last patch.  Check now.

The warning is there on the 4.7 branch now.

-- 
Eric Botcazou


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-06-15 Thread Aldy Hernandez

On 06/15/12 06:40, Eric Botcazou wrote:

Whoops, I forgot to commit that last patch.  Check now.


The warning is there on the 4.7 branch now.



Arghhh, that's the second time.  I wonder why the warning doesn't show 
up on my bootstraps.


Anyway, committed the attached patch to branch.
Backport from mainline:
2012-05-31  Aldy Hernandez  al...@redhat.com
* tree-ssa-loop-im.c (execute_sm): Do not check flag_tm.
* gimple.h (block_in_transaction): Check for flag_tm.

Index: tree-ssa-loop-im.c
===
--- tree-ssa-loop-im.c  (revision 188631)
+++ tree-ssa-loop-im.c  (working copy)
@@ -2154,7 +2154,7 @@ execute_sm (struct loop *loop, VEC (edge
   fmt_data.orig_loop = loop;
   for_each_index (ref-mem, force_move_till, fmt_data);
 
-  if ((flag_tm  block_in_transaction (loop_preheader_edge (loop)-src))
+  if (block_in_transaction (loop_preheader_edge (loop)-src)
   || !PARAM_VALUE (PARAM_ALLOW_STORE_DATA_RACES))
 multi_threaded_model_p = true;
 
Index: gimple.h
===
--- gimple.h(revision 188631)
+++ gimple.h(working copy)
@@ -1587,7 +1587,7 @@ gimple_set_has_volatile_ops (gimple stmt
 static inline bool
 block_in_transaction (basic_block bb)
 {
-  return bb-flags  BB_IN_TRANSACTION;
+  return flag_tm  bb-flags  BB_IN_TRANSACTION;
 }
 
 /* Return true if STMT is in a transaction.  */


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-06-01 Thread Tobias Burnus

Aldy Hernandez wrote:

PR tree-optimization/52558
* cfg.c (alloc_aux_for_edge): Fix comment.
(alloc_aux_for_edge): Remove static.
* basic-block.h (alloc_aux_for_edge): Protoize.
* tree-ssa-loop-im.c (execute_sm_if_changed): New.
(execute_sm_if_changed_flag): New.
(execute_sm_if_changed_flag_set): New.
(execute_sm): Do not generate data races unless requested.
(tree_ssa_lim_initialize): Call alloc_aux_for_edges.
(tree_ssa_lim_finalize): Call free_aux_for_edges.
* gimple.h (block_in_transaction): New.
(gimple_in_transaction): Use block_in_transaction.



Index: gimple.h
===
--- gimple.h(revision 188080)
+++ gimple.h(working copy)



+/* Return true if BB is in a transaction.  */
+
+static inline bool
+block_in_transaction (basic_block bb)
+{
+  return bb-flags  BB_IN_TRANSACTION;


The compile does not seem to like the latter very much and, thus, warns 
for every file which includes gimple.h:


gcc/gimple.h: In function 'block_in_transaction':
gcc/gimple.h:1596:20: warning: overflow in implicit constant conversion 
[-Woverflow]

   return bb-flags  BB_IN_TRANSACTION;
^

Tobias


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-06-01 Thread Aldy Hernandez

On 06/01/12 01:22, Tobias Burnus wrote:


gcc/gimple.h: In function 'block_in_transaction':
gcc/gimple.h:1596:20: warning: overflow in implicit constant conversion
[-Woverflow]
return bb-flags  BB_IN_TRANSACTION;
^


Is this still the case with the code currently in mainline:

  return flag_tm  bb-flags  BB_IN_TRANSACTION;

??


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-06-01 Thread Aldy Hernandez

On 06/01/12 10:11, Aldy Hernandez wrote:

On 06/01/12 01:22, Tobias Burnus wrote:


gcc/gimple.h: In function 'block_in_transaction':
gcc/gimple.h:1596:20: warning: overflow in implicit constant conversion
[-Woverflow]
return bb-flags  BB_IN_TRANSACTION;
^


Is this still the case with the code currently in mainline:

return flag_tm  bb-flags  BB_IN_TRANSACTION;

??


Whoops, I forgot to commit that last patch.  Check now.


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-31 Thread Aldy Hernandez

On 05/29/12 06:13, Richard Guenther wrote:

On Mon, 21 May 2012, Aldy Hernandez wrote:


On 05/16/12 07:53, Richard Guenther wrote:

On Mon, 7 May 2012, Aldy Hernandez wrote:




(flag_tm  loop_preheader_edge (loop)-src-flags  BB_IN_TRANSACTION)

can you encapsulate this into a predicate?  Like block_in_transaction ()
that also checks flag_tm?


Done.. whoops, forgot to check for flag_tm.  I will move this into the 
predicate after I do another round of testing.  I missed this bit, and 
I've just committed.




+  /* ?? FIXME TESTING TESTING ?? */
+  multi_threaded_model_p=true;
+  /* ?? FIXME TESTING TESTING ?? */

that of course needs fixing ;)  (and re-testing)


This was here on purpose, so you'd see how I was testing.  I have 
committed the attached patch, not before testing with _and_ without the 
above FIXME.


Thanks so much for the review.  I will follow up with the flag_tm 
abstraction.


Closing the PR...

Aldy
PR tree-optimization/52558
* cfg.c (alloc_aux_for_edge): Fix comment.
(alloc_aux_for_edge): Remove static.
* basic-block.h (alloc_aux_for_edge): Protoize.
* tree-ssa-loop-im.c (execute_sm_if_changed): New.
(execute_sm_if_changed_flag): New.
(execute_sm_if_changed_flag_set): New.
(execute_sm): Do not generate data races unless requested.
(tree_ssa_lim_initialize): Call alloc_aux_for_edges.
(tree_ssa_lim_finalize): Call free_aux_for_edges.
* gimple.h (block_in_transaction): New.
(gimple_in_transaction): Use block_in_transaction.

Index: testsuite/gcc.dg/pr52558-1.c
===
--- testsuite/gcc.dg/pr52558-1.c(revision 0)
+++ testsuite/gcc.dg/pr52558-1.c(revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options --param allow-store-data-races=0 -O2 -fdump-tree-lim1 } */
+
+/* Test that `count' is not written to unless p-data  0.  */
+
+int count;
+
+struct obj {
+int data;
+struct obj *next;
+} *q;
+
+void func()
+{
+  struct obj *p;
+  for (p = q; p; p = p-next)
+if (p-data  0)
+  count++;
+}
+
+/* { dg-final { scan-tree-dump-times MEM count_lsm.. count_lsm_flag 1 lim1 
} } */
+/* { dg-final { cleanup-tree-dump lim1 } } */
Index: testsuite/gcc.dg/pr52558-2.c
===
--- testsuite/gcc.dg/pr52558-2.c(revision 0)
+++ testsuite/gcc.dg/pr52558-2.c(revision 0)
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options --param allow-store-data-races=0 -O2 -fdump-tree-lim1 } */
+
+/* Test that g_2 is not written to unless !g_1.  */
+
+int g_1 = 1;
+int g_2 = 0;
+
+int func_1(void)
+{
+ int l;
+ for (l = 0; l  1234; l++)
+ {
+   if (g_1)
+ return l;
+   else
+ g_2 = 0;
+ }
+ return 999;
+}
+
+/* { dg-final { scan-tree-dump-times MEM.*g_2_lsm_flag 1 lim1 } } */
+/* { dg-final { cleanup-tree-dump lim1 } } */
Index: testsuite/gcc.dg/tm/reg-promotion.c
===
--- testsuite/gcc.dg/tm/reg-promotion.c (revision 0)
+++ testsuite/gcc.dg/tm/reg-promotion.c (revision 0)
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options -fgnu-tm -O2 -fdump-tree-lim1 } */
+
+/* Test that `count' is not written to unless p-data0.  */
+
+int count;
+
+struct obj {
+int data;
+struct obj *next;
+} *q;
+
+void func()
+{
+  struct obj *p;
+  __transaction_atomic {
+for (p = q; p; p = p-next)
+  if (p-data  0)
+   count++;
+  }
+}
+
+/* { dg-final { scan-tree-dump-times MEM count_lsm.. count_lsm_flag 1 lim1 
} } */
+/* { dg-final { cleanup-tree-dump lim1 } } */
Index: tree-ssa-loop-im.c
===
--- tree-ssa-loop-im.c  (revision 188080)
+++ tree-ssa-loop-im.c  (working copy)
@@ -52,7 +52,7 @@ along with GCC; see the file COPYING3.
 }
  }
 
-   Where COND and INV are is invariants, but evaluating INV may trap or be
+   Where COND and INV are invariants, but evaluating INV may trap or be
invalid from some other reason if !COND.  This may be transformed to
 
if (cond)
@@ -1626,6 +1626,7 @@ gather_mem_refs_stmt (struct loop *loop,
  fprintf (dump_file, \n);
}
 }
+
   if (is_stored)
 mark_ref_stored (ref, loop);
 
@@ -1956,6 +1957,173 @@ get_lsm_tmp_name (tree ref, unsigned n)
   return lsm_tmp_name;
 }
 
+struct prev_flag_edges {
+  /* Edge to insert new flag comparison code.  */
+  edge append_cond_position;
+
+  /* Edge for fall through from previous flag comparison.  */
+  edge last_cond_fallthru;
+};
+
+/* Helper function for execute_sm.  Emit code to store TMP_VAR into
+   MEM along edge EX.
+
+   The store is only done if MEM has changed.  We do this so no
+   changes to MEM occur on code paths that did not originally store
+   into it.
+
+   The common case for execute_sm will transform:
+
+ for (...) {
+   if (foo)
+ stuff;
+   else
+ 

Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-29 Thread Richard Guenther
On Mon, 21 May 2012, Aldy Hernandez wrote:

 On 05/16/12 07:53, Richard Guenther wrote:
  On Mon, 7 May 2012, Aldy Hernandez wrote:
 
 [Sorry for the delay; I was on vacation.]
 
 I am forgoing the load avoidance code altogether to simplify things. Thanks.
 
  +  /* Emit the load code into the latch, so that we are sure it will
  + be processed after all dependencies.  */
  +  latch_edge = loop_latch_edge (loop);
 
  inserting code into the latch block is bad - the vectorizer will
  be confused by non-empty latches.
 
 The code as is on trunk inserts the load on the latch.

Does it?  Ok ...

  That's why I also
 inserted the supporting flag checking code there.  Do you want me to put the
 supporting code somewhere else?

Technically there isn't a good other place that does not add a
partial redundancy.

  Your ChangeLog mentions changes that are no longer necessary
  (the target hook).
 
 Whoops.  Fixed.
 
 
  I didn't quite get the store order issue - we _do_ preserve store
  ordering right now, do we?  So how come introducing the flags
  change that?
 
 The current scheme on trunk works because it inserts one load with
 gsi_insert_on_edge() and subsequent ones get appended correctly, whereas my
 patch has to split the edge (which happens at the top of the block), so
 subsequent code inserts happen in reverse order (at the top).  If I remember
 correctly, that is... I can look again and report if it's still unclear.

Hmm, the old code (well, and the new one ...) walks stores to move
by walking a bitmap.  I suppose we rely on computing uids in the
original program order here then.

(flag_tm  loop_preheader_edge (loop)-src-flags  BB_IN_TRANSACTION)

can you encapsulate this into a predicate?  Like block_in_transaction ()
that also checks flag_tm?

+  /* ?? FIXME TESTING TESTING ?? */
+  multi_threaded_model_p=true;
+  /* ?? FIXME TESTING TESTING ?? */

that of course needs fixing ;)  (and re-testing)


 New patch attached.  Tested on x86-64 Linux.  No regressions.

Ok with the new predicate in basic-block.h and re-testing without the
fixme above.

Thanks,
Richard.


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-16 Thread Richard Guenther
On Mon, 7 May 2012, Aldy Hernandez wrote:

 Hi.  Sorry for the delay.  There were various tricky hiccups along the way to
 bootstrappability and regression cleanliness...
 
 On 04/26/12 04:51, Richard Guenther wrote:
  On Wed, 25 Apr 2012, Aldy Hernandez wrote:
 
   On 04/25/12 06:45, Richard Guenther wrote:
On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandezal...@redhat.com
wrote:
 On 04/13/12 03:46, Richard Guenther wrote:
 
  On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandezal...@redhat.com
  wrote:
 
   Speak of loads, I am keeping the information as an additional bitmap in
   `memory_accesses', as -refs_in_loop was set for stores as well, so I
   couldn't
   depend on it.  Let me know if you have another idea.
 
  Hmm, refs_in_loop  ~all_refs_stored_in_loop, so instead of
 
  +  bitmap reads = VEC_index (bitmap, memory_accesses.reads_in_loop,
  +   loop-num);
  +  ref_read_in_loop_p = bitmap_bit_p (reads, ref-id);
 
  ref_read_in_loop_p = bitmap_bit_p (refs, ref-id)  !bitmap_bit_p
  (stores, ref-id);
 
  ?  But maybe that doesn't work if a ref is both read and stored to.
  Btw, rather than adding a bitmap to memory_accesses I'd rather add
  a mark_ref_loaded corresponding to mark_ref_stored (or rather merge
  both into one) and a bitmap to struct mem_ref.
 
 I have done so as mark_ref_loaded().  I first tried to merge both into a
 generic mark_ref(), to avoid iterating twice, but I was concerned with the
 early loop exit of !bitmap_bit_p(ref-stored, loop-num), not knowing if I
 should exit the loop altogether in the merged case or what.  In either case,
 the code was somewhat convoluted, so I opted for a separate clean function.
 
 In scanning the gimple stream for the loads I noticed that instances of two
 component refs (say foo.bar) were not pointer comparable, so I am asking the
 alias oracle with refs_may_alias_p.  It also seemed to make more sense wrt
 memory chunks that may alias.  What do you think?

No, it asks for equal must aliases (it will replace them after all!).
You should use operand_equal_p to compare reference trees.

But why do all the following dance

+
+  if (gimple_assign_single_p (stmt))
+{
+  tree t = gimple_assign_rhs1 (stmt);
+  /* ?? For some reason two exact COMPONENT_REFs cannot be
+compared with pointer equality, so ask the alias oracle.  */
+  if (ref-mem == t
+ || ((TREE_CODE (t) == SSA_NAME
+  || DECL_P (t)
+  || handled_component_p (t)
+  || TREE_CODE (t) == MEM_REF
+  || TREE_CODE (t) == TARGET_MEM_REF)
+  refs_may_alias_p (t, ref-mem)))
+   mark_ref_loaded (ref, loop);
+}

at all and not just

   if (is_stored)
 mark_ref_stored (ref, loop);
   else
 mark_ref_loaded (ref, loop);

and similar in the !mem case mark the ref loaded unconditionally:

   mark_ref_loaded (ref, loop);
   if (gimple_vdef (stmt))
 mark_ref_stored (ref, loop);

 This patch is far more robust and fully tested.  Two wrenches to complicate
 things though...
 
 First, while inserting the code writing the _LSM_ temporary writes into the
 original variables, I found a handful of tests where the order of the writes
 mattered (aliased locations, inlined code, etc, IIRC).  [This is in loops
 where multiple stores are moved.]  So iterating and inserting on the exit
 edges caused the stores to be saved in reverse.  I am now saving the edges and
 threading the code, so they happen in the correct order:
 
   if (foo_flag_lsm)
   foo = foo_lsm;
   if (foo2_flag_lsm)
   foo2 = foo2_lsm;
   actual_exit:
 
 This required some edge redirection so we didn't jump to the original actual
 exit prematurely when handling multiple moved stores.  The dominator tree
 needed to be updated, and there is one instance where I couldn't find a clever
 way of saving the order without jumping through an inordinate amount of hoops.
 It seemed easier to call recompute_dominator.
 
 Second, avoiding the load, unless it happens in the loop like you have
 suggested, brought about some bogus unused warnings with the LSM temporaries.
 What happens is that compute_ssa() creates uses of the LSM temporaries that
 are not yet defined, so we end up with PHI nodes with a definition entry (D).
 Since we are sure the eventual uses will be gated by their corresponding flag,
 I created phi nodes with an initial value of 0 in the preheader of the loops
 in which they are used.  This solved the problem.  I also played with
 initializing to 0 at the entry block, but that seemed a bit too invasive.

Hm.  Btw, see also PR39612 which you could solve with that?

 Andrew suggested the correct fix was to add a new pass that was able to do
 some ?? flow sensitive data flow analysis ?? that could discover these
 unreachable paths and insert the 0 phis at the start of the blocks
 automatically.  But that seemed like far too 

PING: Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-15 Thread Aldy Hernandez

PING.


Hi. Sorry for the delay. There were various tricky hiccups along the way
to bootstrappability and regression cleanliness...

On 04/26/12 04:51, Richard Guenther wrote:

On Wed, 25 Apr 2012, Aldy Hernandez wrote:


On 04/25/12 06:45, Richard Guenther wrote:

On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandezal...@redhat.com
wrote:

On 04/13/12 03:46, Richard Guenther wrote:


On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandezal...@redhat.com
wrote:



Speak of loads, I am keeping the information as an additional bitmap in
`memory_accesses', as -refs_in_loop was set for stores as well, so I
couldn't
depend on it. Let me know if you have another idea.


Hmm, refs_in_loop ~all_refs_stored_in_loop, so instead of

+ bitmap reads = VEC_index (bitmap, memory_accesses.reads_in_loop,
+ loop-num);
+ ref_read_in_loop_p = bitmap_bit_p (reads, ref-id);

ref_read_in_loop_p = bitmap_bit_p (refs, ref-id) !bitmap_bit_p
(stores, ref-id);

? But maybe that doesn't work if a ref is both read and stored to.
Btw, rather than adding a bitmap to memory_accesses I'd rather add
a mark_ref_loaded corresponding to mark_ref_stored (or rather merge
both into one) and a bitmap to struct mem_ref.


I have done so as mark_ref_loaded(). I first tried to merge both into a
generic mark_ref(), to avoid iterating twice, but I was concerned with
the early loop exit of !bitmap_bit_p(ref-stored, loop-num), not
knowing if I should exit the loop altogether in the merged case or what.
In either case, the code was somewhat convoluted, so I opted for a
separate clean function.

In scanning the gimple stream for the loads I noticed that instances of
two component refs (say foo.bar) were not pointer comparable, so I am
asking the alias oracle with refs_may_alias_p. It also seemed to make
more sense wrt memory chunks that may alias. What do you think?

This patch is far more robust and fully tested. Two wrenches to
complicate things though...

First, while inserting the code writing the _LSM_ temporary writes into
the original variables, I found a handful of tests where the order of
the writes mattered (aliased locations, inlined code, etc, IIRC). [This
is in loops where multiple stores are moved.] So iterating and inserting
on the exit edges caused the stores to be saved in reverse. I am now
saving the edges and threading the code, so they happen in the correct
order:

if (foo_flag_lsm)
foo = foo_lsm;
if (foo2_flag_lsm)
foo2 = foo2_lsm;
actual_exit:

This required some edge redirection so we didn't jump to the original
actual exit prematurely when handling multiple moved stores. The
dominator tree needed to be updated, and there is one instance where I
couldn't find a clever way of saving the order without jumping through
an inordinate amount of hoops. It seemed easier to call
recompute_dominator.

Second, avoiding the load, unless it happens in the loop like you have
suggested, brought about some bogus unused warnings with the LSM
temporaries. What happens is that compute_ssa() creates uses of the LSM
temporaries that are not yet defined, so we end up with PHI nodes with a
definition entry (D). Since we are sure the eventual uses will be gated
by their corresponding flag, I created phi nodes with an initial value
of 0 in the preheader of the loops in which they are used. This solved
the problem. I also played with initializing to 0 at the entry block,
but that seemed a bit too invasive.

Andrew suggested the correct fix was to add a new pass that was able to
do some ?? flow sensitive data flow analysis ?? that could discover
these unreachable paths and insert the 0 phis at the start of the blocks
automatically. But that seemed like far too much work, considering how
long it's taken me to get this far ;-).

Bootstraped and fully tested with multi_threaded_model_p=true
hard-coded. This is the revision I'd like to contribute, sans the
hardcoded value.

What do you think?




Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-08 Thread Aldy Hernandez

On 05/07/12 19:11, Andrew MacLeod wrote:

On 05/07/2012 07:04 PM, Aldy Hernandez wrote:



Andrew suggested the correct fix was to add a new pass that was able
to do some ?? flow sensitive data flow analysis ?? that could discover
these unreachable paths and insert the 0 phis at the start of the
blocks automatically. But that seemed like far too much work,
considering how long it's taken me to get this far ;-).



Wait, no. I didn't suggest writing an entire generic pass was the
correct fix for you... I said that this was a technique that a generic
pass which identified this sort of flow sensitive data flow info could
use to work within the CFG. Simply zeroing out uses of the load in PHI
nodes on paths it has determined are not actually reachable by that
value, and you were suppose to integrate just the bits of that technique
that you required.. just replace any uses of your LSM temporary with 0.
I never intended you should write an entire pass...


Ah, well in that case, I've already done that.  Well I don't do it on 
any path, just in the loop header, and things propagate down from there 
quite nicely :).


Aldy


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-07 Thread Aldy Hernandez
Hi.  Sorry for the delay.  There were various tricky hiccups along the 
way to bootstrappability and regression cleanliness...


On 04/26/12 04:51, Richard Guenther wrote:

On Wed, 25 Apr 2012, Aldy Hernandez wrote:


On 04/25/12 06:45, Richard Guenther wrote:

On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandezal...@redhat.com   wrote:

On 04/13/12 03:46, Richard Guenther wrote:


On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandezal...@redhat.com
wrote:



Speak of loads, I am keeping the information as an additional bitmap in
`memory_accesses', as -refs_in_loop was set for stores as well, so I couldn't
depend on it.  Let me know if you have another idea.


Hmm, refs_in_loop  ~all_refs_stored_in_loop, so instead of

+  bitmap reads = VEC_index (bitmap, memory_accesses.reads_in_loop,
+   loop-num);
+  ref_read_in_loop_p = bitmap_bit_p (reads, ref-id);

ref_read_in_loop_p = bitmap_bit_p (refs, ref-id)  !bitmap_bit_p
(stores, ref-id);

?  But maybe that doesn't work if a ref is both read and stored to.
Btw, rather than adding a bitmap to memory_accesses I'd rather add
a mark_ref_loaded corresponding to mark_ref_stored (or rather merge
both into one) and a bitmap to struct mem_ref.


I have done so as mark_ref_loaded().  I first tried to merge both into a 
generic mark_ref(), to avoid iterating twice, but I was concerned with 
the early loop exit of !bitmap_bit_p(ref-stored, loop-num), not 
knowing if I should exit the loop altogether in the merged case or what. 
 In either case, the code was somewhat convoluted, so I opted for a 
separate clean function.


In scanning the gimple stream for the loads I noticed that instances of 
two component refs (say foo.bar) were not pointer comparable, so I am 
asking the alias oracle with refs_may_alias_p.  It also seemed to make 
more sense wrt memory chunks that may alias.  What do you think?


This patch is far more robust and fully tested.  Two wrenches to 
complicate things though...


First, while inserting the code writing the _LSM_ temporary writes into 
the original variables, I found a handful of tests where the order of 
the writes mattered (aliased locations, inlined code, etc, IIRC).  [This 
is in loops where multiple stores are moved.]  So iterating and 
inserting on the exit edges caused the stores to be saved in reverse.  I 
am now saving the edges and threading the code, so they happen in the 
correct order:


if (foo_flag_lsm)
foo = foo_lsm;
if (foo2_flag_lsm)
foo2 = foo2_lsm;
actual_exit:

This required some edge redirection so we didn't jump to the original 
actual exit prematurely when handling multiple moved stores.  The 
dominator tree needed to be updated, and there is one instance where I 
couldn't find a clever way of saving the order without jumping through 
an inordinate amount of hoops.  It seemed easier to call 
recompute_dominator.


Second, avoiding the load, unless it happens in the loop like you have 
suggested, brought about some bogus unused warnings with the LSM 
temporaries.  What happens is that compute_ssa() creates uses of the LSM 
temporaries that are not yet defined, so we end up with PHI nodes with a 
definition entry (D).  Since we are sure the eventual uses will be gated 
by their corresponding flag, I created phi nodes with an initial value 
of 0 in the preheader of the loops in which they are used.  This solved 
the problem.  I also played with initializing to 0 at the entry block, 
but that seemed a bit too invasive.


Andrew suggested the correct fix was to add a new pass that was able to 
do some ?? flow sensitive data flow analysis ?? that could discover 
these unreachable paths and insert the 0 phis at the start of the blocks 
automatically.  But that seemed like far too much work, considering how 
long it's taken me to get this far ;-).


Bootstraped and fully tested with multi_threaded_model_p=true 
hard-coded.  This is the revision I'd like to contribute, sans the 
hardcoded value.


What do you think?
* cfg.c (alloc_aux_for_edge): Fix comment.
(alloc_aux_for_edge): Remove static.
* gimple.h (gimple_set_in_transaction): Remove.
(gimple_in_transaction): Look in BB instead.
(gimple_statement_base): Remove in_transaction field.
* basic-block.h (enum bb_flags): Add BB_IN_TRANSACTION.
(alloc_aux_for_edge): Protoize.
* trans-mem.c (compute_transaction_bits): Place transaction bit
information into basic blocks.
* Makefile.in (tree-ssa-loop-im.o): Depend on TARGET_H.
* doc/tm.texi: Regenerate.
* tree-ssa-loop-im.c (execute_sm_if_changed): New.
(execute_sm_if_changed_flag): New.
(execute_sm_if_changed_flag_set): New.
(execute_sm): Do not generate data races unless requested.
(mark_ref): Rename from mark_ref_stored.  Set loaded bit.
(mem_ref_alloc): Initialize loaded field.
  

Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-05-07 Thread Andrew MacLeod

On 05/07/2012 07:04 PM, Aldy Hernandez wrote:



Andrew suggested the correct fix was to add a new pass that was able 
to do some ?? flow sensitive data flow analysis ?? that could discover 
these unreachable paths and insert the 0 phis at the start of the 
blocks automatically.  But that seemed like far too much work, 
considering how long it's taken me to get this far ;-).




Wait, no. I didn't suggest writing an entire generic pass was the 
correct fix for you... I said that this was a technique that a generic 
pass  which identified this sort of flow sensitive data flow info could 
use to work within the CFG.  Simply zeroing out uses of the load in PHI 
nodes on paths it has determined are not actually reachable by that 
value, and you were suppose to integrate just the bits of that technique 
that you required.. just replace any uses of your LSM temporary with 
0.   I never intended you should write an entire pass...


Andrew



Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-26 Thread Richard Guenther
On Wed, 25 Apr 2012, Aldy Hernandez wrote:

 On 04/25/12 06:45, Richard Guenther wrote:
  On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandezal...@redhat.com  wrote:
   On 04/13/12 03:46, Richard Guenther wrote:
   
On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandezal...@redhat.com
wrote:
 
  +  /* ?? Perhaps we should cache this somewhere in the BB, or are
  + multiple levels of empty BB's common. ??  */
  +  FOR_EACH_EDGE (e, ei, bb-preds)
  +{
  +  int res = bb_in_transaction_1 (e-src);
  +  if (res != -1)
  +   return (bool) res;
  +}
  +  return false;
 
  that's weird code - if predecessors are not all in or not in a transaction
  you return a random value.  So it seems this is somewhat fragile.
 
  If bb status can be derived from looking at a single stmt why is the
  transaction-ness not recorded as a basic-block flag then?  Thus,
  why do we have a bit in gimple stmts?
 
 How about we get rid of the GIMPLE bit altogether and keep it in the BB flags
 like you mention?  See patch.

Yeah, that looks good.

  +  bool single_exit_p = single_pred_p (ex-dest);
 
  that's a strange variable name ;)
 
 How about loop_has_only_one_exit? :)

Fine ;)

 
  +/* Helper function for execute_sm.  On every path that sets REF, set
  +   an appropriate flag indicating the store.  */
  +
  +static tree
  +execute_sm_if_changed_flag_set (struct loop *loop, mem_ref_p ref)
  +{
  ...
  +  FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc)
  +{
  +  gimple_stmt_iterator gsi;
  +  gimple stmt;
  +
  +  gsi = gsi_start_bb (gimple_bb (loc-stmt));
  +  for (; !gsi_end_p (gsi); gsi_next (gsi))
  +   if (gsi_stmt (gsi) == loc-stmt)
 
  does not seem to do it on every path but on every REF setter.  And instead
  of the innermost loop just use gsi_for_stmt (loc-stmt).
 
 Updated comment.  Fixed.
 
 
  +  /* Emit the load code into the latch, so that we are sure it will
  + be processed after all dependencies.  */
  +  latch_edge = loop_latch_edge (loop);
  +  load = gimple_build_assign (mem_ssa, unshare_expr (ref-mem));
  lim_data = init_lim_data (load);
  lim_data-max_loop = loop;
  lim_data-tgt_loop = loop;
  +  gsi_insert_on_edge (latch_edge, load);
  +  load = gimple_build_assign (tmp_var, mem_ssa);
  +  lim_data = init_lim_data (load);
  +  lim_data-max_loop = loop;
  +  lim_data-tgt_loop = loop;
  +  gsi_insert_on_edge (latch_edge, load);
 
  the 2nd assign is no longer a load, so I suppose you don't need any
  lim_data
  for it?
 
 Re-did all this.  See patch.
 
 
  +  else if (store == STORE_IF_CHANGED_FLAG)
  +   execute_sm_if_changed (ex, ref-mem, mem_ssa, tmp_var,
  +  store_flag);
 
  and this can pass NULL instead of mem_ssa?
 
 Got rid of mem_ssa altogether, as well as the if-changed approach which proved
 inferior to the flags based approach.

Good.

  Overall this looks reasonable.  With also handling trapping things I meant
  that we should be able to apply store-motion to
 
  int a[256];
  void foo (int *p)
  {
 int i;
 for (i= 0;i256;i++)
   if (flag)
 *p = a[i];
  }
 
  but we cannot, because the transform
 
 lsm = *p;
 for (i = 0; i  256; ++i)
   if (flag)
 lsm = a[i];
 *p = lsm;
 
  even when the store is properly conditional the load is not (and you do not
  change that).  The unconditional load may trap here.  What I meant to say
  is that we do not need the load at all unless there is also a load involved
  inside the loop - if we use the flag-based approach.  If there is also a
  load
  inside the loop we'd need to perform the load there (or not handle this
  case)
 
 Ahh!  Yes, this sounds very profitable.  Would you mind me doing this in a
 separate patch?  I am already handling loads in this incarnation, so this
 should be straightforward.

No, it's fine to do it separately.

 Speak of loads, I am keeping the information as an additional bitmap in
 `memory_accesses', as -refs_in_loop was set for stores as well, so I couldn't
 depend on it.  Let me know if you have another idea.

Hmm, refs_in_loop  ~all_refs_stored_in_loop, so instead of

+  bitmap reads = VEC_index (bitmap, memory_accesses.reads_in_loop,
+   loop-num);
+  ref_read_in_loop_p = bitmap_bit_p (reads, ref-id);

ref_read_in_loop_p = bitmap_bit_p (refs, ref-id)  !bitmap_bit_p 
(stores, ref-id);

?  But maybe that doesn't work if a ref is both read and stored to.
Btw, rather than adding a bitmap to memory_accesses I'd rather add
a mark_ref_loaded corresponding to mark_ref_stored (or rather merge
both into one) and a bitmap to struct mem_ref.

 Mildly tested.  Wadaya think?

Looks good overall, minus the detail of remembering refs loaded.

Richard.


 Thanks again.
 Aldy
 

-- 
Richard Guenther rguent...@suse.de
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer

Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-25 Thread Richard Guenther
On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandez al...@redhat.com wrote:
 On 04/13/12 03:46, Richard Guenther wrote:

 On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandezal...@redhat.com  wrote:


 Richard.  Thanks so much for reviewing and providing an alternative
 approach, which AFAICT provides superior results.


 A similar effect could be obtained by keeping a flag whether we entered
 the
 path that stored the value before the transform.  Thus, do

   lsm = g2;  // technically not needed, if you also handle loads
   flag = false;
   for (...)
     {
        if (g1)
          {
             if (flag)
               g2 = lsm;
          }
        else
          {
             lsm = 0;
             flag = true;
          }
     }
   if (flag)
     g2 = lsm;


 I have implemented this in the attached patch, and it seems to be generating
 better code than my other if-changed approach.  It also avoids generating
 unnecessary loads that may trap.

 For the original testcase:


 int g_1 = 1;
 int g_2 = 0;

 int func_1(void)
 {
  int l;
  for (l = 0; l  1234; l++)
  {
   if (g_1)
     return l;
   else
     g_2 = 0;
  }
  return 999;
 }

 After all optimizations are done, we are now generating the following for
 the flags approach:

 new:
 func_1:
        movl    g_1(%rip), %edx
        xorl    %eax, %eax
        testl   %edx, %edx
        je      .L5
        rep
        ret
 .L5:
        movl    $0, g_2(%rip)
        movw    $999, %ax
        ret

 Which is significantly better than the unmodified trunk of:

 func_1:
        movl    g_1(%rip), %edx
        movl    g_2(%rip), %eax
        testl   %edx, %edx
        je      .L5
        movl    %eax, g_2(%rip)
        xorl    %eax, %eax
        ret
 .L5:
        movl    $0, g_2(%rip)
        movl    $999, %eax
        ret

 And in other less contrived cases, we generate correct code that avoids
 writes on code paths that did not have it.  For example, in Hans register
 promotion example:

 int count;

 struct obj {
    int data;
    struct obj *next;
 } *q;

 void func()
 {
  struct obj *p;
  for (p = q; p; p = p-next)
        if (p-data  0)
                count++;
 }

 Under the new memory model we should avoid promoting count to a register
 and unilaterally writing to it upon exiting the loop.

 With the attached patch, we now generate:

 func:
 .LFB0:
        movq    q(%rip), %rax
        xorl    %ecx, %ecx
        movl    count(%rip), %edx
        testq   %rax, %rax
        je      .L1
 .L9:
        movl    (%rax), %esi
        testl   %esi, %esi
        jle     .L3
        addl    $1, %edx
        movl    $1, %ecx
 .L3:
        movq    8(%rax), %rax
        testq   %rax, %rax
        jne     .L9
        testb   %cl, %cl
        jne     .L12
 .L1:
        rep
        ret
 .L12:
        movl    %edx, count(%rip)
        ret

 Which is as expected.

 I don't understand your suggestion of:


    lsm = g2;  // technically not needed, if you also handle loads

 that would allow to extend this to cover the cases where the access may

 eventually trap (if you omit the load before the loop).


 Can't I just remove the lsm=g2?  What's this about also handling loads?



 Not sure which is more efficient (you can eventually combine flag
 variables
 for different store locations, combining the if-changed compare is not so
 easy
 I guess).


 So far, I see better code generated with the flags approach, so...


 1. No pass can figure out the equality (or inequality) of g_2_lsm and
 g_2.
  In the PR, Richard mentions that both FRE/PRE can figure this out, but
 they
 are not run after store motion.  DOM should also be able to, but
 apparently
 does not :(.  Tips?


 Well.  Schedule FRE after loop opts ...

 DOM is not prepared to handle loads/stores with differing VOPs - it was
 never
 updated really after the alias-improvements merge.

 2. The GIMPLE_CONDs I create in this patch are currently causing problems
 with complex floats/doubles.  do_jump is complaining that it can't
 compare
 them, so I assume it is too late (in tree-ssa-loop-im.c) to compare
 complex
 values since complex lowering has already happened (??). Is there a more
 canonical way of creating a GIMPLE_COND that may contain complex floats
 at
 this stage?


 As you are handling redundant stores you want a bitwise comparison anyway,
 but yes, complex compares need to be lowered.  But as said, you want a
 bitwise comparison, not a value-comparison.  You can try using


 I can ignore all this.  I have implemented both alternatives, with a target
 hook as suggested, but I'm thinking of removing it altogether and just
 leaving the flags approach.

 Few comments on the patch itself.

 +         new_bb = split_edge (ex);
 +         then_bb = create_empty_bb (new_bb);
 +         if (current_loops  new_bb-loop_father)

 +           add_bb_to_loop (then_bb, new_bb-loop_father);

 +         gsi = gsi_start_bb (new_bb);
 +         t1 = make_rename_temp (TREE_TYPE (ref-mem), NULL);

 Hmm, why do you need to re-load the 

Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-25 Thread Aldy Hernandez

On 04/25/12 06:45, Richard Guenther wrote:

On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandezal...@redhat.com  wrote:

On 04/13/12 03:46, Richard Guenther wrote:


On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandezal...@redhat.comwrote:



+  /* ?? Perhaps we should cache this somewhere in the BB, or are
+ multiple levels of empty BB's common. ??  */
+  FOR_EACH_EDGE (e, ei, bb-preds)
+{
+  int res = bb_in_transaction_1 (e-src);
+  if (res != -1)
+   return (bool) res;
+}
+  return false;

that's weird code - if predecessors are not all in or not in a transaction
you return a random value.  So it seems this is somewhat fragile.

If bb status can be derived from looking at a single stmt why is the
transaction-ness not recorded as a basic-block flag then?  Thus,
why do we have a bit in gimple stmts?


How about we get rid of the GIMPLE bit altogether and keep it in the BB 
flags like you mention?  See patch.



+  bool single_exit_p = single_pred_p (ex-dest);

that's a strange variable name ;)


How about loop_has_only_one_exit? :)



+/* Helper function for execute_sm.  On every path that sets REF, set
+   an appropriate flag indicating the store.  */
+
+static tree
+execute_sm_if_changed_flag_set (struct loop *loop, mem_ref_p ref)
+{
...
+  FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc)
+{
+  gimple_stmt_iterator gsi;
+  gimple stmt;
+
+  gsi = gsi_start_bb (gimple_bb (loc-stmt));
+  for (; !gsi_end_p (gsi); gsi_next (gsi))
+   if (gsi_stmt (gsi) == loc-stmt)

does not seem to do it on every path but on every REF setter.  And instead
of the innermost loop just use gsi_for_stmt (loc-stmt).


Updated comment.  Fixed.



+  /* Emit the load code into the latch, so that we are sure it will
+ be processed after all dependencies.  */
+  latch_edge = loop_latch_edge (loop);
+  load = gimple_build_assign (mem_ssa, unshare_expr (ref-mem));
lim_data = init_lim_data (load);
lim_data-max_loop = loop;
lim_data-tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);
+  load = gimple_build_assign (tmp_var, mem_ssa);
+  lim_data = init_lim_data (load);
+  lim_data-max_loop = loop;
+  lim_data-tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);

the 2nd assign is no longer a load, so I suppose you don't need any lim_data
for it?


Re-did all this.  See patch.



+  else if (store == STORE_IF_CHANGED_FLAG)
+   execute_sm_if_changed (ex, ref-mem, mem_ssa, tmp_var,
+  store_flag);

and this can pass NULL instead of mem_ssa?


Got rid of mem_ssa altogether, as well as the if-changed approach which 
proved inferior to the flags based approach.



Overall this looks reasonable.  With also handling trapping things I meant
that we should be able to apply store-motion to

int a[256];
void foo (int *p)
{
   int i;
   for (i= 0;i256;i++)
 if (flag)
   *p = a[i];
}

but we cannot, because the transform

   lsm = *p;
   for (i = 0; i  256; ++i)
 if (flag)
   lsm = a[i];
   *p = lsm;

even when the store is properly conditional the load is not (and you do not
change that).  The unconditional load may trap here.  What I meant to say
is that we do not need the load at all unless there is also a load involved
inside the loop - if we use the flag-based approach.  If there is also a load
inside the loop we'd need to perform the load there (or not handle this
case)


Ahh!  Yes, this sounds very profitable.  Would you mind me doing this in 
a separate patch?  I am already handling loads in this incarnation, so 
this should be straightforward.


Speak of loads, I am keeping the information as an additional bitmap in 
`memory_accesses', as -refs_in_loop was set for stores as well, so I 
couldn't depend on it.  Let me know if you have another idea.


Mildly tested.  Wadaya think?

Thanks again.
Aldy
* gimple.h (gimple_set_in_transaction): Remove.
(gimple_in_transaction): Look in BB instead.
(gimple_statement_base): Remove in_transaction field.
* basic-block.h (enum bb_flags): Add BB_IN_TRANSACTION.
* trans-mem.c (compute_transaction_bits): Place transaction bit
information into basic blocks.
* Makefile.in (tree-ssa-loop-im.o): Depend on TARGET_H.
* doc/tm.texi: Regenerate.
* tree-ssa-loop-im.c (execute_sm_if_changed): New.
(execute_sm_if_changed_flag): New.
(execute_sm_if_changed_flag_set): New.
(execute_sm): Do not generate data races unless requested.
* targhooks.c (default_can_compare_bitwise_p): New.
* targhooks.h (default_can_compare_bitwise_p): Protoize.
* target.def (can_compare_bitwise_p): New hook.
* tree-ssa-loop-im.c (execute_sm): Guard store motion with a
conditional.
* opts.c (finish_options): Do not allow store or load data races
with -fgnu-tm.

Index: trans-mem.c
===
--- trans-mem.c (revision 

Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-24 Thread Aldy Hernandez

On 04/13/12 03:46, Richard Guenther wrote:

On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandezal...@redhat.com  wrote:


Richard.  Thanks so much for reviewing and providing an alternative 
approach, which AFAICT provides superior results.



A similar effect could be obtained by keeping a flag whether we entered the
path that stored the value before the transform.  Thus, do

   lsm = g2;  // technically not needed, if you also handle loads
   flag = false;
   for (...)
 {
if (g1)
  {
 if (flag)
   g2 = lsm;
  }
else
  {
 lsm = 0;
 flag = true;
  }
 }
   if (flag)
 g2 = lsm;


I have implemented this in the attached patch, and it seems to be 
generating better code than my other if-changed approach.  It also 
avoids generating unnecessary loads that may trap.


For the original testcase:

int g_1 = 1;
int g_2 = 0;

int func_1(void)
{
 int l;
 for (l = 0; l  1234; l++)
 {
   if (g_1)
 return l;
   else
 g_2 = 0;
 }
 return 999;
}

After all optimizations are done, we are now generating the following 
for the flags approach:


new:
func_1:
movlg_1(%rip), %edx
xorl%eax, %eax
testl   %edx, %edx
je  .L5
rep
ret
.L5:
movl$0, g_2(%rip)
movw$999, %ax
ret

Which is significantly better than the unmodified trunk of:

func_1:
movlg_1(%rip), %edx
movlg_2(%rip), %eax
testl   %edx, %edx
je  .L5
movl%eax, g_2(%rip)
xorl%eax, %eax
ret
.L5:
movl$0, g_2(%rip)
movl$999, %eax
ret

And in other less contrived cases, we generate correct code that avoids 
writes on code paths that did not have it.  For example, in Hans 
register promotion example:


int count;

struct obj {
int data;
struct obj *next;
} *q;

void func()
{
  struct obj *p;
  for (p = q; p; p = p-next)
if (p-data  0)
count++;
}

Under the new memory model we should avoid promoting count to a 
register and unilaterally writing to it upon exiting the loop.


With the attached patch, we now generate:

func:
.LFB0:
movqq(%rip), %rax
xorl%ecx, %ecx
movlcount(%rip), %edx
testq   %rax, %rax
je  .L1
.L9:
movl(%rax), %esi
testl   %esi, %esi
jle .L3
addl$1, %edx
movl$1, %ecx
.L3:
movq8(%rax), %rax
testq   %rax, %rax
jne .L9
testb   %cl, %cl
jne .L12
.L1:
rep
ret
.L12:
movl%edx, count(%rip)
ret

Which is as expected.

I don't understand your suggestion of:

lsm = g2;  // technically not needed, if you also handle loads

that would allow to extend this to cover the cases where the access may
eventually trap (if you omit the load before the loop).


Can't I just remove the lsm=g2?  What's this about also handling loads?



Not sure which is more efficient (you can eventually combine flag variables
for different store locations, combining the if-changed compare is not so easy
I guess).


So far, I see better code generated with the flags approach, so...


1. No pass can figure out the equality (or inequality) of g_2_lsm and g_2.
  In the PR, Richard mentions that both FRE/PRE can figure this out, but they
are not run after store motion.  DOM should also be able to, but apparently
does not :(.  Tips?


Well.  Schedule FRE after loop opts ...

DOM is not prepared to handle loads/stores with differing VOPs - it was never
updated really after the alias-improvements merge.


2. The GIMPLE_CONDs I create in this patch are currently causing problems
with complex floats/doubles.  do_jump is complaining that it can't compare
them, so I assume it is too late (in tree-ssa-loop-im.c) to compare complex
values since complex lowering has already happened (??). Is there a more
canonical way of creating a GIMPLE_COND that may contain complex floats at
this stage?


As you are handling redundant stores you want a bitwise comparison anyway,
but yes, complex compares need to be lowered.  But as said, you want a
bitwise comparison, not a value-comparison.  You can try using


I can ignore all this.  I have implemented both alternatives, with a 
target hook as suggested, but I'm thinking of removing it altogether and 
just leaving the flags approach.



Few comments on the patch itself.

+ new_bb = split_edge (ex);
+ then_bb = create_empty_bb (new_bb);
+ if (current_loops  new_bb-loop_father)
+   add_bb_to_loop (then_bb, new_bb-loop_father);

+ gsi = gsi_start_bb (new_bb);
+ t1 = make_rename_temp (TREE_TYPE (ref-mem), NULL);

Hmm, why do you need to re-load the value?  Don't you have both
the value as it was loaded before the loop and the new value (the lsm tmp reg)
already?  Why not simply remember the SSA 

Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-17 Thread Aldy Hernandez

On 04/13/12 18:22, Boehm, Hans wrote:




-Original Message-
From: Aldy Hernandez [mailto:al...@redhat.com]
Sent: Thursday, April 12, 2012 3:12 PM
To: Richard Guenther
Cc: Andrew MacLeod; Boehm, Hans; gcc-patches; Torvald Riegel
Subject: [PR tree-optimization/52558]: RFC: questions on store data
race

Here we have a testcase that affects both the C++ memory model and
transactional memory.

[Hans, this is caused by the same problem that is causing the
speculative register promotion issue you and Torvald pointed me at].




In the following testcase (adapted from the PR), the loop invariant
motion pass caches a pre-existing value for g_2, and then performs a
store to g_2 on every path, causing a store data race:

int g_1 = 1;
int g_2 = 0;

int func_1(void)
{
int l;
for (l = 0; l  1234; l++)
{
  if (g_1)
return l;
  else
g_2 = 0;-- Store to g_2 should only happen if !g_1
}
return 999;
}

This gets transformed into something like:

g_2_lsm = g_2;
if (g_1) {
g_2 = g_2_lsm;  // boo! hiss!
return 0;
} else {
g_2_lsm = 0;
g_2 = g_2_lsm;
}

The spurious write to g_2 could cause a data race.

Presumably the g_2_lsm = g_2 is actually outside the loop?

Why does the second g_2 = g_2_lsm; get introduced?  I would have expected it 
before the return.  Was the example just over-abbreviated?


There is some abbreviation going on :).  To be exact, this is what -O2 
currently produces for the lim1 pass.


bb 2:
  pretmp.4_1 = g_1;
  g_2_lsm.6_12 = g_2;

bb 3:
  # l_13 = PHI l_6(5), 0(2)
  # g_2_lsm.6_10 = PHI g_2_lsm.6_11(5), g_2_lsm.6_12(2)
  g_1.0_4 = pretmp.4_1;
  if (g_1.0_4 != 0)
goto bb 7;
  else
goto bb 4;

bb 4:
  g_2_lsm.6_11 = 0;
  l_6 = l_13 + 1;
  if (l_6 != 1234)
goto bb 5;
  else
goto bb 8;

bb 8:
  # g_2_lsm.6_18 = PHI g_2_lsm.6_11(4)
  g_2 = g_2_lsm.6_18;
  goto bb 6;

bb 5:
  goto bb 3;

bb 7:
  # g_2_lsm.6_17 = PHI g_2_lsm.6_10(3)
  # l_19 = PHI l_13(3)
  g_2 = g_2_lsm.6_17;

bb 6:
  # l_2 = PHI l_19(7), 999(8)
  return l_2;

So yes, there seems to be another write to g_2 inside the else, but 
probably because we have merged some basic blocks along the way.




Other than that, this sounds right to me.  So does Richard's flag-based 
version, which is the approach I would have originally expected.  But that 
clearly costs you a register.  It would be interesting to see how they compare.


I am working on the flag based approach.

Thanks to both of you.


Re: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-13 Thread Richard Guenther
On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez al...@redhat.com wrote:
 Here we have a testcase that affects both the C++ memory model and
 transactional memory.

 [Hans, this is caused by the same problem that is causing the speculative
 register promotion issue you and Torvald pointed me at].

 In the following testcase (adapted from the PR), the loop invariant motion
 pass caches a pre-existing value for g_2, and then performs a store to g_2
 on every path, causing a store data race:

 int g_1 = 1;
 int g_2 = 0;

 int func_1(void)
 {
  int l;
  for (l = 0; l  1234; l++)
  {
    if (g_1)
      return l;
    else
      g_2 = 0;  -- Store to g_2 should only happen if !g_1
  }
  return 999;
 }

 This gets transformed into something like:

        g_2_lsm = g_2;
        if (g_1) {
                g_2 = g_2_lsm;  // boo! hiss!
                return 0;
        } else {
                g_2_lsm = 0;
                g_2 = g_2_lsm;
        }

 The spurious write to g_2 could cause a data race.

 Andrew has suggested verifying that the store to g_2 was actually different
 than on entry, and letting PHI copy propagation optimize the redundant
 comparisons.  Like this:

        g_2_lsm = g_2;
        if (g_1) {
                if (g_2_lsm != g_2)     // hopefully optimized away
                        g_2 = g_2_lsm;  // hopefully optimized away
                return 0;
        } else {
                g_2_lsm = 0;
                if (g_2_lsm != g_2)     // hopefully optimized away
                        g_2 = g_2_lsm;
        }

 ...which PHI copy propagation and/or friends should be able to optimize.

 For that matter, regardless of the memory model, the proposed solution would
 improve the existing code by removing the extra store (in this contrived
 test case anyhow).

 Anyone see a problem with this approach (see attached patch)?

Can we _remove_ a store we percieve as redundant (with a single-threaded
view) with the memory model?  If so, then this sounds plausible (minus
the optimization that if the variable is written to unconditionally we avoid
this comparison -- see the places where we check whether we can apply
store motion to possibly trapping locations).

A similar effect could be obtained by keeping a flag whether we entered the
path that stored the value before the transform.  Thus, do

  lsm = g2;  // technically not needed, if you also handle loads
  flag = false;
  for (...)
{
   if (g1)
 {
if (flag)
  g2 = lsm;
 }
   else
 {
lsm = 0;
flag = true;
 }
}
  if (flag)
g2 = lsm;

that would allow to extend this to cover the cases where the access may
eventually trap (if you omit the load before the loop).

Not sure which is more efficient (you can eventually combine flag variables
for different store locations, combining the if-changed compare is not so easy
I guess).

 Assuming the unlikely scenario that everyone likes this :), I have two
 problems that I would like answered.  But feel free to ignore if the
 approach is a no go.

 1. No pass can figure out the equality (or inequality) of g_2_lsm and g_2.
  In the PR, Richard mentions that both FRE/PRE can figure this out, but they
 are not run after store motion.  DOM should also be able to, but apparently
 does not :(.  Tips?

Well.  Schedule FRE after loop opts ...

DOM is not prepared to handle loads/stores with differing VOPs - it was never
updated really after the alias-improvements merge.

 2. The GIMPLE_CONDs I create in this patch are currently causing problems
 with complex floats/doubles.  do_jump is complaining that it can't compare
 them, so I assume it is too late (in tree-ssa-loop-im.c) to compare complex
 values since complex lowering has already happened (??). Is there a more
 canonical way of creating a GIMPLE_COND that may contain complex floats at
 this stage?

As you are handling redundant stores you want a bitwise comparison anyway,
but yes, complex compares need to be lowered.  But as said, you want a
bitwise comparison, not a value-comparison.  You can try using

  if (VIEW_CONVERT_EXPR int_type_for_mode (mode_for_size (...)) 
(lsm_tmp_reg) != VIEW_CONVERT_EXPR    (orig_value))
...

that can of course result in really awful codegen (think of fp - gp register
moves, or even stack spills).  Which maybe hints at that the flag approach
would be better in some cases?  (you need to cache the original value
in a register anyway, the only thing you save is updating it when you would
update the flag value)

Maybe you can combine both, eventually dependent on a target hook
can_compare_bitwise (mode) that would tell you if the target can efficiently
compare two registers in mode for bit equality.

Few comments on the patch itself.

+ new_bb = split_edge (ex);
+ then_bb = create_empty_bb (new_bb);
+ if (current_loops  new_bb-loop_father)
+   add_bb_to_loop (then_bb, new_bb-loop_father);

+ gsi 

RE: [PR tree-optimization/52558]: RFC: questions on store data race

2012-04-13 Thread Boehm, Hans


 -Original Message-
 From: Aldy Hernandez [mailto:al...@redhat.com]
 Sent: Thursday, April 12, 2012 3:12 PM
 To: Richard Guenther
 Cc: Andrew MacLeod; Boehm, Hans; gcc-patches; Torvald Riegel
 Subject: [PR tree-optimization/52558]: RFC: questions on store data
 race
 
 Here we have a testcase that affects both the C++ memory model and
 transactional memory.
 
 [Hans, this is caused by the same problem that is causing the
 speculative register promotion issue you and Torvald pointed me at].
 
 In the following testcase (adapted from the PR), the loop invariant
 motion pass caches a pre-existing value for g_2, and then performs a
 store to g_2 on every path, causing a store data race:
 
 int g_1 = 1;
 int g_2 = 0;
 
 int func_1(void)
 {
int l;
for (l = 0; l  1234; l++)
{
  if (g_1)
return l;
  else
g_2 = 0;   -- Store to g_2 should only happen if !g_1
}
return 999;
 }
 
 This gets transformed into something like:
 
   g_2_lsm = g_2;
   if (g_1) {
   g_2 = g_2_lsm;  // boo! hiss!
   return 0;
   } else {
   g_2_lsm = 0;
   g_2 = g_2_lsm;
   }
 
 The spurious write to g_2 could cause a data race.
Presumably the g_2_lsm = g_2 is actually outside the loop?

Why does the second g_2 = g_2_lsm; get introduced?  I would have expected it 
before the return.  Was the example just over-abbreviated?

Other than that, this sounds right to me.  So does Richard's flag-based 
version, which is the approach I would have originally expected.  But that 
clearly costs you a register.  It would be interesting to see how they compare.

Hans

 
 Andrew has suggested verifying that the store to g_2 was actually
 different than on entry, and letting PHI copy propagation optimize the
 redundant comparisons.  Like this:
 
   g_2_lsm = g_2;
   if (g_1) {
   if (g_2_lsm != g_2) // hopefully optimized away
   g_2 = g_2_lsm;  // hopefully optimized away
   return 0;
   } else {
   g_2_lsm = 0;
   if (g_2_lsm != g_2) // hopefully optimized away
   g_2 = g_2_lsm;
   }
 
 ...which PHI copy propagation and/or friends should be able to
 optimize.
 
 For that matter, regardless of the memory model, the proposed solution
 would improve the existing code by removing the extra store (in this
 contrived test case anyhow).
 
 Anyone see a problem with this approach (see attached patch)?
 
 Assuming the unlikely scenario that everyone likes this :), I have two
 problems that I would like answered.  But feel free to ignore if the
 approach is a no go.
 
 1. No pass can figure out the equality (or inequality) of g_2_lsm and
 g_2.  In the PR, Richard mentions that both FRE/PRE can figure this
 out, but they are not run after store motion.  DOM should also be able
 to, but apparently does not :(.  Tips?
 
 2. The GIMPLE_CONDs I create in this patch are currently causing
 problems with complex floats/doubles.  do_jump is complaining that it
 can't compare them, so I assume it is too late (in tree-ssa-loop-im.c)
 to compare complex values since complex lowering has already happened
 (??). Is there a more canonical way of creating a GIMPLE_COND that may
 contain complex floats at this stage?
 
 Thanks folks.


[PR tree-optimization/52558]: RFC: questions on store data race

2012-04-12 Thread Aldy Hernandez
Here we have a testcase that affects both the C++ memory model and 
transactional memory.


[Hans, this is caused by the same problem that is causing the 
speculative register promotion issue you and Torvald pointed me at].


In the following testcase (adapted from the PR), the loop invariant 
motion pass caches a pre-existing value for g_2, and then performs a 
store to g_2 on every path, causing a store data race:


int g_1 = 1;
int g_2 = 0;

int func_1(void)
{
  int l;
  for (l = 0; l  1234; l++)
  {
if (g_1)
  return l;
else
  g_2 = 0;  -- Store to g_2 should only happen if !g_1
  }
  return 999;
}

This gets transformed into something like:

g_2_lsm = g_2;
if (g_1) {
g_2 = g_2_lsm;  // boo! hiss!
return 0;
} else {
g_2_lsm = 0;
g_2 = g_2_lsm;
}

The spurious write to g_2 could cause a data race.

Andrew has suggested verifying that the store to g_2 was actually 
different than on entry, and letting PHI copy propagation optimize the 
redundant comparisons.  Like this:


g_2_lsm = g_2;
if (g_1) {
if (g_2_lsm != g_2) // hopefully optimized away
g_2 = g_2_lsm;  // hopefully optimized away
return 0;
} else {
g_2_lsm = 0;
if (g_2_lsm != g_2) // hopefully optimized away
g_2 = g_2_lsm;
}

...which PHI copy propagation and/or friends should be able to optimize.

For that matter, regardless of the memory model, the proposed solution 
would improve the existing code by removing the extra store (in this 
contrived test case anyhow).


Anyone see a problem with this approach (see attached patch)?

Assuming the unlikely scenario that everyone likes this :), I have two 
problems that I would like answered.  But feel free to ignore if the 
approach is a no go.


1. No pass can figure out the equality (or inequality) of g_2_lsm and 
g_2.  In the PR, Richard mentions that both FRE/PRE can figure this out, 
but they are not run after store motion.  DOM should also be able to, 
but apparently does not :(.  Tips?


2. The GIMPLE_CONDs I create in this patch are currently causing 
problems with complex floats/doubles.  do_jump is complaining that it 
can't compare them, so I assume it is too late (in tree-ssa-loop-im.c) 
to compare complex values since complex lowering has already happened 
(??). Is there a more canonical way of creating a GIMPLE_COND that may 
contain complex floats at this stage?


Thanks folks.
* tree-ssa-loop-im.c (execute_sm): Guard store motion with a
conditional.
* opts.c (finish_options): Do not allow store or load data races
with -fgnu-tm.

Index: tree-ssa-loop-im.c
===
--- tree-ssa-loop-im.c  (revision 186108)
+++ tree-ssa-loop-im.c  (working copy)
@@ -1999,8 +1999,59 @@ execute_sm (struct loop *loop, VEC (edge
 
   FOR_EACH_VEC_ELT (edge, exits, i, ex)
 {
-  store = gimple_build_assign (unshare_expr (ref-mem), tmp_var);
-  gsi_insert_on_edge (ex, store);
+  basic_block new_bb, then_bb, old_dest;
+  edge then_old_edge;
+  gimple_stmt_iterator gsi;
+  gimple stmt;
+  tree t1;
+
+  if (PARAM_VALUE (PARAM_ALLOW_STORE_DATA_RACES))
+   {
+ store = gimple_build_assign (unshare_expr (ref-mem), tmp_var);
+ gsi_insert_on_edge (ex, store);
+   }
+  else
+   {
+ old_dest = ex-dest;
+ new_bb = split_edge (ex);
+ then_bb = create_empty_bb (new_bb);
+ if (current_loops  new_bb-loop_father)
+   add_bb_to_loop (then_bb, new_bb-loop_father);
+
+ gsi = gsi_start_bb (new_bb);
+ t1 = make_rename_temp (TREE_TYPE (ref-mem), NULL);
+ stmt = gimple_build_assign (t1, unshare_expr (ref-mem));
+ gsi_insert_after (gsi, stmt, GSI_CONTINUE_LINKING);
+ stmt = gimple_build_cond (NE_EXPR, t1, tmp_var,
+   NULL_TREE, NULL_TREE);
+ gsi_insert_after (gsi, stmt, GSI_CONTINUE_LINKING);
+
+ gsi = gsi_start_bb (then_bb);
+ store = gimple_build_assign (unshare_expr (ref-mem), tmp_var);
+ gsi_insert_after (gsi, store, GSI_CONTINUE_LINKING);
+
+ make_edge (new_bb, then_bb, EDGE_TRUE_VALUE);
+ make_edge (new_bb, old_dest, EDGE_FALSE_VALUE);
+ then_old_edge = make_edge (then_bb, old_dest, EDGE_FALLTHRU);
+ set_immediate_dominator (CDI_DOMINATORS, then_bb, new_bb);
+
+ for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi); gsi_next 
(gsi))
+   {
+ gimple phi = gsi_stmt (gsi);
+ unsigned i;
+
+ for (i = 0; i  gimple_phi_num_args (phi); i++)
+   if (gimple_phi_arg_edge (phi, i)-src == new_bb)
+ {
+   tree arg = gimple_phi_arg_def (phi, i);
+