[og7] backported "[nvptx, PR84041] Add memory_barrier insn"

2018-04-11 Thread Tom de Vries

On 04/09/2018 03:19 PM, Tom de Vries wrote:

Hi,

we've been having hanging OpenMP tests for nvptx offloading: 
for-{3,5,6}.c and the corresponding C++ test-cases.


The failures have now been analyzed down to gomp_ptrlock_get in 
libgomp/config/nvptx/ptrlock.h:

...
  static inline void *gomp_ptrlock_get (gomp_ptrlock_t *ptrlock)
{
   uintptr_t v = (uintptr_t) __atomic_load_n (ptrlock, MEMMODEL_ACQUIRE);
   if (v > 2)
     return (void *) v;

   if (v == 0
   && __atomic_compare_exchange_n (ptrlock, &v, 1, false,
   MEMMODEL_ACQUIRE,
   MEMMODEL_ACQUIRE))
     return NULL;

   while (v == 1)
     v = (uintptr_t) __atomic_load_n (ptrlock, MEMMODEL_ACQUIRE);

   return (void *) v;
}
...

There's no atomic load insn defined for nvptx, and also no memory 
barrier insn, so the atomic load ends up generating a normal load. The 
JIT compiler does loop-invariant code motion, and moves the load out of 
the loop, which turns the while into an eternal loop.



Fix conservatively by defining the memory_barrier insn. This can 
possibly be fixed more optimally by implementing an atomic load 
operation in nvptx.


Build x86_64 with nvptx accelerator and reg-tested libgomp.

Committed to stage4 trunk.



And back-ported to og7 branch.

Thanks,
- Tom


0001-nvptx-Add-memory_barrier-insn.patch


[nvptx] Add memory_barrier insn

2018-04-09  Tom de Vries  

PR target/84041
* config/nvptx/nvptx.md (define_c_enum "unspecv"): Add UNSPECV_MEMBAR.
(define_expand "*memory_barrier"): New define_expand.
(define_insn "memory_barrier"): New insn.

---
  gcc/config/nvptx/nvptx.md | 22 ++
  1 file changed, 22 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 4f4453d..68bba36 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -55,6 +55,7 @@
 UNSPECV_CAS
 UNSPECV_XCHG
 UNSPECV_BARSYNC
+   UNSPECV_MEMBAR
 UNSPECV_DIM_POS
  
 UNSPECV_FORK

@@ -1459,6 +1460,27 @@
"\\tbar.sync\\t%0;"
[(set_attr "predicable" "false")])
  
+(define_expand "memory_barrier"

+  [(set (match_dup 0)
+   (unspec_volatile:BLK [(match_dup 0)] UNSPECV_MEMBAR))]
+  ""
+{
+  operands[0] = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode));
+  MEM_VOLATILE_P (operands[0]) = 1;
+})
+
+;; Ptx defines the memory barriers membar.cta, membar.gl and membar.sys
+;; (corresponding to cuda functions threadfence_block, threadfence and
+;; threadfence_system).  For the insn memory_barrier we use membar.sys.  This
+;; may be overconservative, but before using membar.gl instead we'll need to
+;; explain in detail why it's safe to use.  For now, use membar.sys.
+(define_insn "*memory_barrier"
+  [(set (match_operand:BLK 0 "" "")
+   (unspec_volatile:BLK [(match_dup 0)] UNSPECV_MEMBAR))]
+  ""
+  "\\tmembar.sys;"
+  [(set_attr "predicable" "false")])
+
  (define_insn "nvptx_nounroll"
[(unspec_volatile [(const_int 0)] UNSPECV_NOUNROLL)]
""





[nvptx, PR84041] Add memory_barrier insn

2018-04-09 Thread Tom de Vries

Hi,

we've been having hanging OpenMP tests for nvptx offloading: 
for-{3,5,6}.c and the corresponding C++ test-cases.


The failures have now been analyzed down to gomp_ptrlock_get in 
libgomp/config/nvptx/ptrlock.h:

...
 static inline void *gomp_ptrlock_get (gomp_ptrlock_t *ptrlock)
{
  uintptr_t v = (uintptr_t) __atomic_load_n (ptrlock, MEMMODEL_ACQUIRE);
  if (v > 2)
return (void *) v;

  if (v == 0
  && __atomic_compare_exchange_n (ptrlock, &v, 1, false,
  MEMMODEL_ACQUIRE,
  MEMMODEL_ACQUIRE))
return NULL;

  while (v == 1)
v = (uintptr_t) __atomic_load_n (ptrlock, MEMMODEL_ACQUIRE);

  return (void *) v;
}
...

There's no atomic load insn defined for nvptx, and also no memory 
barrier insn, so the atomic load ends up generating a normal load. The 
JIT compiler does loop-invariant code motion, and moves the load out of 
the loop, which turns the while into an eternal loop.



Fix conservatively by defining the memory_barrier insn. This can 
possibly be fixed more optimally by implementing an atomic load 
operation in nvptx.


Build x86_64 with nvptx accelerator and reg-tested libgomp.

Committed to stage4 trunk.

Thanks,
- Tom
[nvptx] Add memory_barrier insn

2018-04-09  Tom de Vries  

	PR target/84041
	* config/nvptx/nvptx.md (define_c_enum "unspecv"): Add UNSPECV_MEMBAR.
	(define_expand "*memory_barrier"): New define_expand.
	(define_insn "memory_barrier"): New insn.

---
 gcc/config/nvptx/nvptx.md | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 4f4453d..68bba36 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -55,6 +55,7 @@
UNSPECV_CAS
UNSPECV_XCHG
UNSPECV_BARSYNC
+   UNSPECV_MEMBAR
UNSPECV_DIM_POS
 
UNSPECV_FORK
@@ -1459,6 +1460,27 @@
   "\\tbar.sync\\t%0;"
   [(set_attr "predicable" "false")])
 
+(define_expand "memory_barrier"
+  [(set (match_dup 0)
+	(unspec_volatile:BLK [(match_dup 0)] UNSPECV_MEMBAR))]
+  ""
+{
+  operands[0] = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode));
+  MEM_VOLATILE_P (operands[0]) = 1;
+})
+
+;; Ptx defines the memory barriers membar.cta, membar.gl and membar.sys
+;; (corresponding to cuda functions threadfence_block, threadfence and
+;; threadfence_system).  For the insn memory_barrier we use membar.sys.  This
+;; may be overconservative, but before using membar.gl instead we'll need to
+;; explain in detail why it's safe to use.  For now, use membar.sys.
+(define_insn "*memory_barrier"
+  [(set (match_operand:BLK 0 "" "")
+	(unspec_volatile:BLK [(match_dup 0)] UNSPECV_MEMBAR))]
+  ""
+  "\\tmembar.sys;"
+  [(set_attr "predicable" "false")])
+
 (define_insn "nvptx_nounroll"
   [(unspec_volatile [(const_int 0)] UNSPECV_NOUNROLL)]
   ""