Re: [PATCH] Add faster HTM fastpath for libitm TSX v2

2013-01-30 Thread Torvald Riegel
On Tue, 2013-01-29 at 20:19 +0100, Andi Kleen wrote:
  next time please reply to the points raised in a review if you disagree
  with them, and don't just ignore them.  That speeds up the review.
 
 It was all in the previous email on the topic.

This v2 patch did not incorporate all the changes that I requested, nor
did you explicitly object to those you didn't incorporate.  Maybe you
don't care what's in libitm's comments, but I do.

// See gtm_thread::begin_transaction.
   -uint32_t GTM::htm_fastpath = 0;
   +uint32_t GTM::htm_fastpath asm(__gtm_htm_fastpath) = 0;
   +
   +uint32_t *GTM::global_lock asm(__gtm_global_lock);
  
  Rearrange the serial lock's fields (see below).
 
 To my knowledge C++ classes have no guaranteed layout,
 so that's not safe because there is no guarantee where
 the vtable pointers are. it would be only with plain old structs.

And gtm_rwlock doesn't have any virtual members, right?  So it's like a
plain old struct (ie, it's a POD type).

 +  // pr_tryHTM can be set by an assembler fast path when it already
 tried
 +  // a hardware transaction once. In this case we do one retry less.
 
 pr_tryHTM is already documented.

And I asked you to simply reference this in a comment.  What's the
problem with that?  I do not want people working on libitm to have to
grep through the code for uses of pr_tryHTM and look for comments that
might be related to it just because you don't want to put a simple
reference into the comment at the declaration of pr_tryHTM.  Can't you
see that adding the reference is just plain useful for everybody else?


Torvald



Re: [PATCH] Add faster HTM fastpath for libitm TSX v2

2013-01-29 Thread Torvald Riegel
On Thu, 2013-01-24 at 19:30 -0800, Andi Kleen wrote:
 From: Andi Kleen a...@linux.intel.com
 
 The libitm TSX hardware transaction fast path currently does quite a bit of
 unnecessary work (saving registers etc.) before even trying to start
 a hardware transaction. This patch moves the initial attempt at a
 transaction early into the assembler stub. Complicated work like retries
 is still done in C. So this is essentially a fast path for the fast
 path.
 
 The assembler code doesn't understand the layout of serial_lock, but
 it needs to check that serial_lock is free. We export just the lock
 variable as a separate pointer for this.
 
 The assembler code controls the HTM fast path with a new pr_tryHTM flag.
 
 I had to reorganize the retry loop slightly so that the waiting for the
 lock to be free again is done first (because the first transaction
 is already done elsewhere). This makes the patch look larger than
 it really is. This just moves one basic block around.
 
 Passes bootstrap/testsuite on x86_64
 
 v2: Use asm() to supply assembler name of namespaced variables
 Make comments more verbose.

Andi,

next time please reply to the points raised in a review if you disagree
with them, and don't just ignore them.  That speeds up the review.

 libitm/:
 2013-01-24  Andi Kleen  a...@linux.intel.com
 
   * beginend.cc (GTM::htm_fastpath): Add asm alias to
   __gtm_htm_fastpath
   (GTM::global_lock): Add.
   (begin_transaction): Check pr_tryHTM. Remove one iteration
   from HTM retry loop. Move lock free check to the beginning of
   loop body.
   * config/linux/rwlock.h (gtm_rwlock::get_lock_var): Add.
   * config/posix/rwlock.h (tm_rwlock::get_lock_var): Add.
   * config/x86/sjlj.S: Include config.h.
   (pr_tryHTM, pr_uninstrumentedCode, pr_hasNoAbort,
a_runInstrumentedCode, a_runUninstrumentedCode,
_XABORT_EXPLICIT, _XABORT_RETRY): Add defines.
   (_ITM_beginTransaction): Add HTM fast path.
   * libitm.h (pr_tryHTM): Add.
   * libitm_i.h (GTM::htm_fastpath): Add asm alias.
   (GTM::gtm_global_lock): Add.
   * method-serial.cc (method_group::init, fini): Initialize
   GTM::global_lock and GTM::htm_fastpath.
 ---
  libitm/beginend.cc   |   46 +++--
  libitm/config/linux/rwlock.h |5 +
  libitm/config/posix/rwlock.h |5 +
  libitm/config/x86/sjlj.S |   47 
 ++
  libitm/libitm.h  |7 +--
  libitm/libitm_i.h|3 ++-
  libitm/method-serial.cc  |2 ++
  7 files changed, 92 insertions(+), 23 deletions(-)
 
 diff --git a/libitm/beginend.cc b/libitm/beginend.cc
 index 4369946..118920b 100644
 --- a/libitm/beginend.cc
 +++ b/libitm/beginend.cc
 @@ -55,7 +55,9 @@ static pthread_key_t thr_release_key;
  static pthread_once_t thr_release_once = PTHREAD_ONCE_INIT;
  
  // See gtm_thread::begin_transaction.
 -uint32_t GTM::htm_fastpath = 0;
 +uint32_t GTM::htm_fastpath asm(__gtm_htm_fastpath) = 0;
 +
 +uint32_t *GTM::global_lock asm(__gtm_global_lock);

Rearrange the serial lock's fields (see below).

  /* Allocate a transaction structure.  */
  void *
 @@ -187,27 +189,13 @@ GTM::gtm_thread::begin_transaction (uint32_t prop, 
 const gtm_jmpbuf *jb)
// indeed in serial mode, and HW transactions should never need serial mode
// for any internal changes (e.g., they never abort visibly to the STM code
// and thus do not trigger the standard retry handling).

Add a comment as second paragraph in the previous block of comments,
briefly explaining when pr_tryHTM is set (ie, under which condition),
and that this happens in the assembler implementation of
_ITM_beginTransaction.

 -  if (likely(htm_fastpath  (prop  pr_hasNoAbort)))
 +  // pr_tryHTM can be set by an assembler fast path when it already tried
 +  // a hardware transaction once. In this case we do one retry less.
 +  if (likely(prop  pr_tryHTM))
  {
 -  for (uint32_t t = htm_fastpath; t; t--)
 +  // One transaction has been already tried by the assembler fast path.
 +  for (uint32_t t = htm_fastpath - 1; t; t--)
   {
 -   uint32_t ret = htm_begin();
 -   if (htm_begin_success(ret))
 - {
 -   // We are executing a transaction now.
 -   // Monitor the writer flag in the serial-mode lock, and abort
 -   // if there is an active or waiting serial-mode transaction.
 -   if (unlikely(serial_lock.is_write_locked()))
 - htm_abort();
 -   else
 - // We do not need to set a_saveLiveVariables because of HTM.
 - return (prop  pr_uninstrumentedCode) ?
 - a_runUninstrumentedCode : a_runInstrumentedCode;
 - }
 -   // The transaction has aborted.  Don't retry if it's unlikely that
 -   // retrying the transaction will be successful.
 -   if (!htm_abort_should_retry(ret))
 - break;
 // 

Re: [PATCH] Add faster HTM fastpath for libitm TSX v2

2013-01-29 Thread Andi Kleen
 next time please reply to the points raised in a review if you disagree
 with them, and don't just ignore them.  That speeds up the review.

It was all in the previous email on the topic.

   // See gtm_thread::begin_transaction.
  -uint32_t GTM::htm_fastpath = 0;
  +uint32_t GTM::htm_fastpath asm(__gtm_htm_fastpath) = 0;
  +
  +uint32_t *GTM::global_lock asm(__gtm_global_lock);
 
 Rearrange the serial lock's fields (see below).

To my knowledge C++ classes have no guaranteed layout,
so that's not safe because there is no guarantee where
the vtable pointers are. it would be only with plain old structs.

+  // pr_tryHTM can be set by an assembler fast path when it already
tried
+  // a hardware transaction once. In this case we do one retry less.

pr_tryHTM is already documented.

-Andi


Re: [PATCH] Add faster HTM fastpath for libitm TSX v2

2013-01-25 Thread Uros Bizjak
Hello!

 The libitm TSX hardware transaction fast path currently does quite a bit of
 unnecessary work (saving registers etc.) before even trying to start
 a hardware transaction. This patch moves the initial attempt at a
 transaction early into the assembler stub. Complicated work like retries
 is still done in C. So this is essentially a fast path for the fast
 path.

 The assembler code doesn't understand the layout of serial_lock, but
 it needs to check that serial_lock is free. We export just the lock
 variable as a separate pointer for this.

Probably the attached (RFC) patch can be useful in this case. The
patch allows to specify the label for xbegin, so it is possible to
implement code like following (non-sensical) example:

--cut here--
extern int lock1;
extern int lock2;

int test ()
{
  register unsigned int res asm (eax);

  __builtin_ia32_xbegin_label (__txn_abort);

  lock1 = 0;

 __txn_abort:
  if (res  0x10)
lock2 = 1;

  return 0;
}
--cut here--

gcc -O2 -mrtm:

test:
xbegin  .L2
movl$0, lock1(%rip)
.L2:
testb   $16, %al
je  .L3
movl$1, lock2(%rip)
.L3:
xorl%eax, %eax
ret

Please note that the edge from __builtin_ia32_xbegin_label is not
known to tree optimizers, so someone familiar with this part of the
compiler should enhance/fix the patch. With the (fixed) patch, you can
implement your assembly using plain C++, probably also using ifunc
relocations.

Uros.
Index: i386.md
===
--- i386.md (revision 195460)
+++ i386.md (working copy)
@@ -18093,6 +18093,17 @@
   DONE;
 })
 
+(define_expand xbegin_label
+  [(parallel
+[(set (pc)
+ (if_then_else (ne (unspec [(const_int 0)] UNSPEC_XBEGIN_ABORT)
+   (const_int 0))
+   (match_operand 1)
+   (pc)))
+ (set (match_operand:SI 0 register_operand)
+ (unspec_volatile:SI [(match_dup 0)] UNSPECV_XBEGIN))])]
+  TARGET_RTM)
+
 (define_insn xbegin_1
   [(set (pc)
(if_then_else (ne (unspec [(const_int 0)] UNSPEC_XBEGIN_ABORT)
Index: i386.c
===
--- i386.c  (revision 195460)
+++ i386.c  (working copy)
@@ -26570,6 +26570,7 @@ enum ix86_builtins
 
   /* RTM */
   IX86_BUILTIN_XBEGIN,
+  IX86_BUILTIN_XBEGIN_LABEL,
   IX86_BUILTIN_XEND,
   IX86_BUILTIN_XABORT,
   IX86_BUILTIN_XTEST,
@@ -26942,6 +26943,7 @@ static const struct builtin_description bdesc_spec
 
   /* RTM */
   { OPTION_MASK_ISA_RTM, CODE_FOR_xbegin, __builtin_ia32_xbegin, 
IX86_BUILTIN_XBEGIN, UNKNOWN, (int) UNSIGNED_FTYPE_VOID },
+  { OPTION_MASK_ISA_RTM, CODE_FOR_xbegin_label, __builtin_ia32_xbegin_label, 
IX86_BUILTIN_XBEGIN_LABEL, UNKNOWN, (int) VOID_FTYPE_PCVOID },
   { OPTION_MASK_ISA_RTM, CODE_FOR_xend, __builtin_ia32_xend, 
IX86_BUILTIN_XEND, UNKNOWN, (int) VOID_FTYPE_VOID },
   { OPTION_MASK_ISA_RTM, CODE_FOR_xtest, __builtin_ia32_xtest, 
IX86_BUILTIN_XTEST, UNKNOWN, (int) INT_FTYPE_VOID },
 };
@@ -32239,6 +32241,17 @@ addcarryx:
 
   return target;
 
+case IX86_BUILTIN_XBEGIN_LABEL:
+  arg0 = CALL_EXPR_ARG (exp, 0);
+  op0 = expand_normal (arg0);
+  if (GET_CODE (op0) != LABEL_REF)
+   {
+ error (the xbegin's argument must be a label);
+ return const0_rtx;
+   }
+  emit_jump_insn (gen_xbegin_label (gen_rtx_REG (SImode, AX_REG), op0));
+  return 0;
+
 case IX86_BUILTIN_XABORT:
   icode = CODE_FOR_xabort;
   arg0 = CALL_EXPR_ARG (exp, 0);


Re: [PATCH] Add faster HTM fastpath for libitm TSX v2

2013-01-25 Thread Andi Kleen
 Probably the attached (RFC) patch can be useful in this case. The
 patch allows to specify the label for xbegin, so it is possible to
 implement code like following (non-sensical) example:

It can be actually implemented using asm goto. I have some macros
for this.  And the tree optimizers should even support it.

#define XBEGIN(label)   \
 asm volatile goto(.byte 0xc7,0xf8 ; .long %l0-1f\n1: ::: eax :
label)
#define XEND()asm volatile(.byte 0x0f,0x01,0xd5)
#define XFAIL(label) label: asm volatile( ::: eax)
#define XFAIL_STATUS(label, status) label: asm volatile( : =a
(status))

But the assembler code is still needed for this because the non TSX path needs
to save all registers (it's like a setjmp/longjmp), and that cannot 
be implemented in C.

The goal of the assembler code was not to use the label, but to move
the initial transaction before the setjmp code.

-Andi


[PATCH] Add faster HTM fastpath for libitm TSX v2

2013-01-24 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

The libitm TSX hardware transaction fast path currently does quite a bit of
unnecessary work (saving registers etc.) before even trying to start
a hardware transaction. This patch moves the initial attempt at a
transaction early into the assembler stub. Complicated work like retries
is still done in C. So this is essentially a fast path for the fast
path.

The assembler code doesn't understand the layout of serial_lock, but
it needs to check that serial_lock is free. We export just the lock
variable as a separate pointer for this.

The assembler code controls the HTM fast path with a new pr_tryHTM flag.

I had to reorganize the retry loop slightly so that the waiting for the
lock to be free again is done first (because the first transaction
is already done elsewhere). This makes the patch look larger than
it really is. This just moves one basic block around.

Passes bootstrap/testsuite on x86_64

v2: Use asm() to supply assembler name of namespaced variables
Make comments more verbose.

libitm/:
2013-01-24  Andi Kleen  a...@linux.intel.com

* beginend.cc (GTM::htm_fastpath): Add asm alias to
__gtm_htm_fastpath
(GTM::global_lock): Add.
(begin_transaction): Check pr_tryHTM. Remove one iteration
from HTM retry loop. Move lock free check to the beginning of
loop body.
* config/linux/rwlock.h (gtm_rwlock::get_lock_var): Add.
* config/posix/rwlock.h (tm_rwlock::get_lock_var): Add.
* config/x86/sjlj.S: Include config.h.
(pr_tryHTM, pr_uninstrumentedCode, pr_hasNoAbort,
 a_runInstrumentedCode, a_runUninstrumentedCode,
 _XABORT_EXPLICIT, _XABORT_RETRY): Add defines.
(_ITM_beginTransaction): Add HTM fast path.
* libitm.h (pr_tryHTM): Add.
* libitm_i.h (GTM::htm_fastpath): Add asm alias.
(GTM::gtm_global_lock): Add.
* method-serial.cc (method_group::init, fini): Initialize
GTM::global_lock and GTM::htm_fastpath.
---
 libitm/beginend.cc   |   46 +++--
 libitm/config/linux/rwlock.h |5 +
 libitm/config/posix/rwlock.h |5 +
 libitm/config/x86/sjlj.S |   47 ++
 libitm/libitm.h  |7 +--
 libitm/libitm_i.h|3 ++-
 libitm/method-serial.cc  |2 ++
 7 files changed, 92 insertions(+), 23 deletions(-)

diff --git a/libitm/beginend.cc b/libitm/beginend.cc
index 4369946..118920b 100644
--- a/libitm/beginend.cc
+++ b/libitm/beginend.cc
@@ -55,7 +55,9 @@ static pthread_key_t thr_release_key;
 static pthread_once_t thr_release_once = PTHREAD_ONCE_INIT;
 
 // See gtm_thread::begin_transaction.
-uint32_t GTM::htm_fastpath = 0;
+uint32_t GTM::htm_fastpath asm(__gtm_htm_fastpath) = 0;
+
+uint32_t *GTM::global_lock asm(__gtm_global_lock);
 
 /* Allocate a transaction structure.  */
 void *
@@ -187,27 +189,13 @@ GTM::gtm_thread::begin_transaction (uint32_t prop, const 
gtm_jmpbuf *jb)
   // indeed in serial mode, and HW transactions should never need serial mode
   // for any internal changes (e.g., they never abort visibly to the STM code
   // and thus do not trigger the standard retry handling).
-  if (likely(htm_fastpath  (prop  pr_hasNoAbort)))
+  // pr_tryHTM can be set by an assembler fast path when it already tried
+  // a hardware transaction once. In this case we do one retry less.
+  if (likely(prop  pr_tryHTM))
 {
-  for (uint32_t t = htm_fastpath; t; t--)
+  // One transaction has been already tried by the assembler fast path.
+  for (uint32_t t = htm_fastpath - 1; t; t--)
{
- uint32_t ret = htm_begin();
- if (htm_begin_success(ret))
-   {
- // We are executing a transaction now.
- // Monitor the writer flag in the serial-mode lock, and abort
- // if there is an active or waiting serial-mode transaction.
- if (unlikely(serial_lock.is_write_locked()))
-   htm_abort();
- else
-   // We do not need to set a_saveLiveVariables because of HTM.
-   return (prop  pr_uninstrumentedCode) ?
-   a_runUninstrumentedCode : a_runInstrumentedCode;
-   }
- // The transaction has aborted.  Don't retry if it's unlikely that
- // retrying the transaction will be successful.
- if (!htm_abort_should_retry(ret))
-   break;
  // Wait until any concurrent serial-mode transactions have finished.
  // This is an empty critical section, but won't be elided.
  if (serial_lock.is_write_locked())
@@ -225,6 +213,24 @@ GTM::gtm_thread::begin_transaction (uint32_t prop, const 
gtm_jmpbuf *jb)
  // we have retried so often that we should go serial to avoid
  // starvation.
}
+
+ uint32_t ret = htm_begin();
+ if (htm_begin_success(ret))
+   {
+