Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-09 Thread Mathieu Desnoyers
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Tue, May 08, 2012 at 02:48:27PM -0400, Mathieu Desnoyers wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
   On Mon, May 07, 2012 at 12:10:55PM -0400, Mathieu Desnoyers wrote:

* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote:
[...]
 Just to make sure I understand -- the reason that the del functions
 say no memory barrier instead of acts like rcu_dereference() is
 that the del functions don't return anything.

[...]
  @@ -391,6 +413,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct 
  cds_lfht_node *node);
* function.
* Call with rcu_read_lock held.
* Threads calling this API need to be registered RCU read-side 
  threads.
  + * This function does not issue any memory barrier.
*/

One more question about the del memory ordering semantic. Following
commit

commit db00ccc36e7fb04ce8044fb1be7964acd1de6ae0
Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com
Date:   Mon Dec 19 16:45:51 2011 -0500

rculfhash: Relax atomicity guarantees required by removal operation

The atomicity guarantees for the removal operation do not need to 
be as
strict as a cmpxchg. Use a uatomic_set followed by a xchg on a newly
introduced REMOVAL_OWNER_FLAG to perform removal.

Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com


The del operation is performed in two steps:

1 - uatomic_or(), which sets the REMOVED flag (the actual tombstone)
unconditionally into the node's next pointer.
2 - uatomic_xchg(), which atomically exchanges the old pointer with
its current value (read) or'd with the REMOVAL_OWNER flag. The trick
is that if the xchg returns a pointer with the REMOVAL_OWNER flag
set, it means we are not the first thread to set this flag, so we
should not free the node. However, if xchg returns a node without
the REMOVAL_OWNER flag set, we are indeed the first to set it, so
we should call free.

Now regarding memory ordering semantics, should we consider the atomic
action of del to apply when the or is called, or when the xchg is
called ? Or should we simply document that the del effect on the node
happens in two separate steps ?

The way I see it, the actual effect of removal, as seen from RCU read
traversal and lookup point of view, is observed as soon as the REMOVED
tombstone is set, so I would think that the atomic publication of the
removal is performed by the or.

However, we ensure full memory barriers around xchg, but not usually 
around or. Therefore, the current implementation does not issue a 
memory barrier before the or, so we should either change our planned
memory barrier documentation, or the implementation, to match. This
would probably require creation of a cmm_smp_mb__before_uatomic_or(), so
x86 does not end up issuing a useless memory barrer.
   
   My kneejerk reaction is that the or is really doing the deletion.
   Readers and other updaters care about the deletion, not about which CPU
   is going to do the free.
   
   Or did I misunderstand how this works?
  
  You got it right, this is how I see it too.
  
  However, in order to provide a full memory barrier before the or, we'd
  need to add a cmm_smp_mb() before the or (I don't think we want to
  presume that our or operation issues full memory barriers on all
  architectures).
  
  However, on x86, the lock; or does issue a full memory barrier. So I
  think we should introduce a macro that can translate into a memory
  barrier on architectures that need it, and to nothing on x86.
  
  Thoughts ?
 
 Makes sense to me!

Allright, committed the following. I end up using
cmm_smp_mb__before_uatomic_*.

commit 7b783f818175cd92d98f78e761331f306ff406a5
Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com
Date:   Tue May 8 17:12:20 2012 -0400

rculfhash: document implied memory barriers

We choose to provide full memory barriers before and after successful
hash table update operations. Eventually, new API with weaker semantic
can be added, but let's make the basic API as fool-proof as possible.

Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com

commit 196f4fab9bf26c48bc318ac2ff985469c4f62c7e
Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com
Date:   Tue May 8 17:09:46 2012 -0400

rculfhash: Ensure future-proof memory barrier semantic consistency

Use cmm_smp_mb__before_uatomic_or() prior to the uatomic_or() in
_rcu_lfht_del() to ensure correct memory barrier semantic when we relax
(in the future) the barrier implementation of some architectures.


Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-08 Thread Paul E. McKenney
On Mon, May 07, 2012 at 12:10:55PM -0400, Mathieu Desnoyers wrote:
 
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote:
 [...]
  Just to make sure I understand -- the reason that the del functions
  say no memory barrier instead of acts like rcu_dereference() is
  that the del functions don't return anything.
 
 [...]
   @@ -391,6 +413,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct 
   cds_lfht_node *node);
 * function.
 * Call with rcu_read_lock held.
 * Threads calling this API need to be registered RCU read-side threads.
   + * This function does not issue any memory barrier.
 */
 
 One more question about the del memory ordering semantic. Following
 commit
 
 commit db00ccc36e7fb04ce8044fb1be7964acd1de6ae0
 Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com
 Date:   Mon Dec 19 16:45:51 2011 -0500
 
 rculfhash: Relax atomicity guarantees required by removal operation
 
 The atomicity guarantees for the removal operation do not need to be as
 strict as a cmpxchg. Use a uatomic_set followed by a xchg on a newly
 introduced REMOVAL_OWNER_FLAG to perform removal.
 
 Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
 
 
 The del operation is performed in two steps:
 
 1 - uatomic_or(), which sets the REMOVED flag (the actual tombstone)
 unconditionally into the node's next pointer.
 2 - uatomic_xchg(), which atomically exchanges the old pointer with
 its current value (read) or'd with the REMOVAL_OWNER flag. The trick
 is that if the xchg returns a pointer with the REMOVAL_OWNER flag
 set, it means we are not the first thread to set this flag, so we
 should not free the node. However, if xchg returns a node without
 the REMOVAL_OWNER flag set, we are indeed the first to set it, so
 we should call free.
 
 Now regarding memory ordering semantics, should we consider the atomic
 action of del to apply when the or is called, or when the xchg is
 called ? Or should we simply document that the del effect on the node
 happens in two separate steps ?
 
 The way I see it, the actual effect of removal, as seen from RCU read
 traversal and lookup point of view, is observed as soon as the REMOVED
 tombstone is set, so I would think that the atomic publication of the
 removal is performed by the or.
 
 However, we ensure full memory barriers around xchg, but not usually 
 around or. Therefore, the current implementation does not issue a 
 memory barrier before the or, so we should either change our planned
 memory barrier documentation, or the implementation, to match. This
 would probably require creation of a cmm_smp_mb__before_uatomic_or(), so
 x86 does not end up issuing a useless memory barrer.

My kneejerk reaction is that the or is really doing the deletion.
Readers and other updaters care about the deletion, not about which CPU
is going to do the free.

Or did I misunderstand how this works?

Thanx, Paul

 Thoughts ?
 
 Thanks,
 
 Mathieu
 
 -- 
 Mathieu Desnoyers
 Operating System Efficiency RD Consultant
 EfficiOS Inc.
 http://www.efficios.com
 
 ___
 rp mailing list
 r...@svcs.cs.pdx.edu
 http://svcs.cs.pdx.edu/mailman/listinfo/rp
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-08 Thread Mathieu Desnoyers
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Mon, May 07, 2012 at 12:10:55PM -0400, Mathieu Desnoyers wrote:
  
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
   On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote:
  [...]
   Just to make sure I understand -- the reason that the del functions
   say no memory barrier instead of acts like rcu_dereference() is
   that the del functions don't return anything.
  
  [...]
@@ -391,6 +413,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct 
cds_lfht_node *node);
  * function.
  * Call with rcu_read_lock held.
  * Threads calling this API need to be registered RCU read-side 
threads.
+ * This function does not issue any memory barrier.
  */
  
  One more question about the del memory ordering semantic. Following
  commit
  
  commit db00ccc36e7fb04ce8044fb1be7964acd1de6ae0
  Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com
  Date:   Mon Dec 19 16:45:51 2011 -0500
  
  rculfhash: Relax atomicity guarantees required by removal operation
  
  The atomicity guarantees for the removal operation do not need to be as
  strict as a cmpxchg. Use a uatomic_set followed by a xchg on a newly
  introduced REMOVAL_OWNER_FLAG to perform removal.
  
  Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
  
  
  The del operation is performed in two steps:
  
  1 - uatomic_or(), which sets the REMOVED flag (the actual tombstone)
  unconditionally into the node's next pointer.
  2 - uatomic_xchg(), which atomically exchanges the old pointer with
  its current value (read) or'd with the REMOVAL_OWNER flag. The trick
  is that if the xchg returns a pointer with the REMOVAL_OWNER flag
  set, it means we are not the first thread to set this flag, so we
  should not free the node. However, if xchg returns a node without
  the REMOVAL_OWNER flag set, we are indeed the first to set it, so
  we should call free.
  
  Now regarding memory ordering semantics, should we consider the atomic
  action of del to apply when the or is called, or when the xchg is
  called ? Or should we simply document that the del effect on the node
  happens in two separate steps ?
  
  The way I see it, the actual effect of removal, as seen from RCU read
  traversal and lookup point of view, is observed as soon as the REMOVED
  tombstone is set, so I would think that the atomic publication of the
  removal is performed by the or.
  
  However, we ensure full memory barriers around xchg, but not usually 
  around or. Therefore, the current implementation does not issue a 
  memory barrier before the or, so we should either change our planned
  memory barrier documentation, or the implementation, to match. This
  would probably require creation of a cmm_smp_mb__before_uatomic_or(), so
  x86 does not end up issuing a useless memory barrer.
 
 My kneejerk reaction is that the or is really doing the deletion.
 Readers and other updaters care about the deletion, not about which CPU
 is going to do the free.
 
 Or did I misunderstand how this works?

You got it right, this is how I see it too.

However, in order to provide a full memory barrier before the or, we'd
need to add a cmm_smp_mb() before the or (I don't think we want to
presume that our or operation issues full memory barriers on all
architectures).

However, on x86, the lock; or does issue a full memory barrier. So I
think we should introduce a macro that can translate into a memory
barrier on architectures that need it, and to nothing on x86.

Thoughts ?

Thanks,

Mathieu

 
   Thanx, Paul
 
  Thoughts ?
  
  Thanks,
  
  Mathieu
  
  -- 
  Mathieu Desnoyers
  Operating System Efficiency RD Consultant
  EfficiOS Inc.
  http://www.efficios.com
  
  ___
  rp mailing list
  r...@svcs.cs.pdx.edu
  http://svcs.cs.pdx.edu/mailman/listinfo/rp
  
 
 
 ___
 lttng-dev mailing list
 lttng-dev@lists.lttng.org
 http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers
Operating System Efficiency RD Consultant
EfficiOS Inc.
http://www.efficios.com

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-08 Thread Paul E. McKenney
On Tue, May 08, 2012 at 02:48:27PM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Mon, May 07, 2012 at 12:10:55PM -0400, Mathieu Desnoyers wrote:
   
   * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote:
   [...]
Just to make sure I understand -- the reason that the del functions
say no memory barrier instead of acts like rcu_dereference() is
that the del functions don't return anything.
   
   [...]
 @@ -391,6 +413,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct 
 cds_lfht_node *node);
   * function.
   * Call with rcu_read_lock held.
   * Threads calling this API need to be registered RCU read-side 
 threads.
 + * This function does not issue any memory barrier.
   */
   
   One more question about the del memory ordering semantic. Following
   commit
   
   commit db00ccc36e7fb04ce8044fb1be7964acd1de6ae0
   Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com
   Date:   Mon Dec 19 16:45:51 2011 -0500
   
   rculfhash: Relax atomicity guarantees required by removal operation
   
   The atomicity guarantees for the removal operation do not need to be 
   as
   strict as a cmpxchg. Use a uatomic_set followed by a xchg on a newly
   introduced REMOVAL_OWNER_FLAG to perform removal.
   
   Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
   
   
   The del operation is performed in two steps:
   
   1 - uatomic_or(), which sets the REMOVED flag (the actual tombstone)
   unconditionally into the node's next pointer.
   2 - uatomic_xchg(), which atomically exchanges the old pointer with
   its current value (read) or'd with the REMOVAL_OWNER flag. The trick
   is that if the xchg returns a pointer with the REMOVAL_OWNER flag
   set, it means we are not the first thread to set this flag, so we
   should not free the node. However, if xchg returns a node without
   the REMOVAL_OWNER flag set, we are indeed the first to set it, so
   we should call free.
   
   Now regarding memory ordering semantics, should we consider the atomic
   action of del to apply when the or is called, or when the xchg is
   called ? Or should we simply document that the del effect on the node
   happens in two separate steps ?
   
   The way I see it, the actual effect of removal, as seen from RCU read
   traversal and lookup point of view, is observed as soon as the REMOVED
   tombstone is set, so I would think that the atomic publication of the
   removal is performed by the or.
   
   However, we ensure full memory barriers around xchg, but not usually 
   around or. Therefore, the current implementation does not issue a 
   memory barrier before the or, so we should either change our planned
   memory barrier documentation, or the implementation, to match. This
   would probably require creation of a cmm_smp_mb__before_uatomic_or(), so
   x86 does not end up issuing a useless memory barrer.
  
  My kneejerk reaction is that the or is really doing the deletion.
  Readers and other updaters care about the deletion, not about which CPU
  is going to do the free.
  
  Or did I misunderstand how this works?
 
 You got it right, this is how I see it too.
 
 However, in order to provide a full memory barrier before the or, we'd
 need to add a cmm_smp_mb() before the or (I don't think we want to
 presume that our or operation issues full memory barriers on all
 architectures).
 
 However, on x86, the lock; or does issue a full memory barrier. So I
 think we should introduce a macro that can translate into a memory
 barrier on architectures that need it, and to nothing on x86.
 
 Thoughts ?

Makes sense to me!

Thanx, Paul

 Thanks,
 
 Mathieu
 
  
  Thanx, Paul
  
   Thoughts ?
   
   Thanks,
   
   Mathieu
   
   -- 
   Mathieu Desnoyers
   Operating System Efficiency RD Consultant
   EfficiOS Inc.
   http://www.efficios.com
   
   ___
   rp mailing list
   r...@svcs.cs.pdx.edu
   http://svcs.cs.pdx.edu/mailman/listinfo/rp
   
  
  
  ___
  lttng-dev mailing list
  lttng-dev@lists.lttng.org
  http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
 
 -- 
 Mathieu Desnoyers
 Operating System Efficiency RD Consultant
 EfficiOS Inc.
 http://www.efficios.com
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-07 Thread Mathieu Desnoyers
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Sat, May 05, 2012 at 02:22:12PM -0400, Mathieu Desnoyers wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
[...]
   Just to make sure I understand -- the reason that the del functions
   say no memory barrier instead of acts like rcu_dereference() is
   that the del functions don't return anything.
  
  Hrm, you bring an interesting point. I think we should change the two
  rcu_dereference() in _cds_lfht_del for CMM_LOAD_SHARED(). The
  difference between the two is that CMM_LOAD_SHARED() does not imply a
  read barrier between the read and following uses of the data pointed to
  by the pointer read.
  
  Same thing for cds_lfht_is_node_deleted: the rcu_dereference() should
  be changed for a CMM_LOAD_SHARED(), because we never use the loaded
  pointer as a pointer to other data. There are a few other locations
  where the pointer is only used for its flags.
  
  Here is what I propose:
  
  diff --git a/rculfhash.c b/rculfhash.c
  index b9f795f..6e27436 100644
  --- a/rculfhash.c
  +++ b/rculfhash.c
  @@ -923,7 +923,7 @@ int _cds_lfht_replace(struct cds_lfht *ht, unsigned 
  long size,
  bucket = lookup_bucket(ht, size, 
  bit_reverse_ulong(old_node-reverse_hash));
  _cds_lfht_gc_bucket(bucket, new_node);
  
  -   assert(is_removed(rcu_dereference(old_node-next)));
  +   assert(is_removed(CMM_LOAD_SHARED(old_node-next)));
 
 This is good.
 
  return 0;
   }
  
  @@ -1061,7 +1061,7 @@ int _cds_lfht_del(struct cds_lfht *ht, unsigned long 
  size,
   * logical removal flag). Return -ENOENT if the node had
   * previously been removed.
   */
  -   next = rcu_dereference(node-next);
  +   next = CMM_LOAD_SHARED(node-next);
  if (caa_unlikely(is_removed(next)))
  return -ENOENT;
  assert(!is_bucket(next));
 
 As long as next is not dereferenced anywhere, this is good.  Which
 appears to be the case.
 
  @@ -1082,7 +1082,7 @@ int _cds_lfht_del(struct cds_lfht *ht, unsigned long 
  size,
  bucket = lookup_bucket(ht, size, bit_reverse_ulong(node-reverse_hash));
  _cds_lfht_gc_bucket(bucket, node);
  
  -   assert(is_removed(rcu_dereference(node-next)));
  +   assert(is_removed(CMM_LOAD_SHARED(node-next)));
 
 This one is fine as well.
 
  /*
   * Last phase: atomically exchange node-next with a version
   * having REMOVAL_OWNER_FLAG set. If the returned node-next
  @@ -1510,7 +1510,7 @@ void cds_lfht_lookup(struct cds_lfht *ht, unsigned 
  long hash,
  }
  node = clear_flag(next);
  }
  -   assert(!node || !is_bucket(rcu_dereference(node-next)));
  +   assert(!node || !is_bucket(CMM_LOAD_SHARED(node-next)));
 
 As is this one.
 
  iter-node = node;
  iter-next = next;
   }
  @@ -1543,7 +1543,7 @@ void cds_lfht_next_duplicate(struct cds_lfht *ht, 
  cds_lfht_match_fct match,
  }
  node = clear_flag(next);
  }
  -   assert(!node || !is_bucket(rcu_dereference(node-next)));
  +   assert(!node || !is_bucket(CMM_LOAD_SHARED(node-next)));
 
 Same here.
 
  iter-node = node;
  iter-next = next;
   }
  @@ -1565,7 +1565,7 @@ void cds_lfht_next(struct cds_lfht *ht, struct 
  cds_lfht_iter *iter)
  }
  node = clear_flag(next);
  }
  -   assert(!node || !is_bucket(rcu_dereference(node-next)));
  +   assert(!node || !is_bucket(CMM_LOAD_SHARED(node-next)));
 
 And here.
 
  iter-node = node;
  iter-next = next;
   }
  @@ -1668,7 +1668,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct 
  cds_lfht_node *node)
  
   int cds_lfht_is_node_deleted(struct cds_lfht_node *node)
   {
  -   return is_removed(rcu_dereference(node-next));
  +   return is_removed(CMM_LOAD_SHARED(node-next));
 
 And also here.
 
   }
  
   static
  
  If it's ok for you, I will first commit this change, and then commit the
  memory barrier documentation with your reviewed-by.
 
 Looks good!

OK, this change is committed as:

commit a85eff522c253434a9c2b53d6c3a702842fb1d5d
Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com
Date:   Mon May 7 11:18:14 2012 -0400

rculfhash: replace unneeded rcu_dereference by CMM_LOAD_SHARED

The difference between the two is that CMM_LOAD_SHARED() does not imply
a read barrier between the read and following uses of the data pointed
to by the pointer read.

All sites that only use the pointer load for its bits (never
dereference) don't need the read barrier implied by rcu_dereference.

Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com
Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency RD Consultant
EfficiOS Inc.
http://www.efficios.com

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-07 Thread Mathieu Desnoyers

* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote:
[...]
 Just to make sure I understand -- the reason that the del functions
 say no memory barrier instead of acts like rcu_dereference() is
 that the del functions don't return anything.

[...]
  @@ -391,6 +413,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct 
  cds_lfht_node *node);
* function.
* Call with rcu_read_lock held.
* Threads calling this API need to be registered RCU read-side threads.
  + * This function does not issue any memory barrier.
*/

One more question about the del memory ordering semantic. Following
commit

commit db00ccc36e7fb04ce8044fb1be7964acd1de6ae0
Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com
Date:   Mon Dec 19 16:45:51 2011 -0500

rculfhash: Relax atomicity guarantees required by removal operation

The atomicity guarantees for the removal operation do not need to be as
strict as a cmpxchg. Use a uatomic_set followed by a xchg on a newly
introduced REMOVAL_OWNER_FLAG to perform removal.

Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com


The del operation is performed in two steps:

1 - uatomic_or(), which sets the REMOVED flag (the actual tombstone)
unconditionally into the node's next pointer.
2 - uatomic_xchg(), which atomically exchanges the old pointer with
its current value (read) or'd with the REMOVAL_OWNER flag. The trick
is that if the xchg returns a pointer with the REMOVAL_OWNER flag
set, it means we are not the first thread to set this flag, so we
should not free the node. However, if xchg returns a node without
the REMOVAL_OWNER flag set, we are indeed the first to set it, so
we should call free.

Now regarding memory ordering semantics, should we consider the atomic
action of del to apply when the or is called, or when the xchg is
called ? Or should we simply document that the del effect on the node
happens in two separate steps ?

The way I see it, the actual effect of removal, as seen from RCU read
traversal and lookup point of view, is observed as soon as the REMOVED
tombstone is set, so I would think that the atomic publication of the
removal is performed by the or.

However, we ensure full memory barriers around xchg, but not usually 
around or. Therefore, the current implementation does not issue a 
memory barrier before the or, so we should either change our planned
memory barrier documentation, or the implementation, to match. This
would probably require creation of a cmm_smp_mb__before_uatomic_or(), so
x86 does not end up issuing a useless memory barrer.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency RD Consultant
EfficiOS Inc.
http://www.efficios.com

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-05 Thread Mathieu Desnoyers
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
   On Thu, May 03, 2012 at 01:13:30PM -0400, Mathieu Desnoyers wrote:
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  [...]
 
 A write barrier would be sufficient in the case where there were only
 two threads observing each other.  A full memory barrier would be 
 needed
 to prevent the assertion from firing in this sort of case (not sure 
 that
 this is exactly right, but something like this):
 
 Initial contents: B, C
 
 T0: add A; del B
 T1: if (!lookup B) { add B; del C }
 T2: r1 = lookup C; smp_mb(); r2 = lookup A
 
 assert(lookup C || lookup A);

What you are bringing here as counter-example is, I think, transitivity.
   
   Yep!
   
I'm trying to figure out how your example could fail, and I cannot see
how. Follows a detail of the closest scenario I get to failing is the
following, but it does not fail. After that, I'm proposing a different
scenario, which I think will be more appropriate for the current topic.

Attempted detail of your scenario:

T0 T1 T2

add A
wmb
  (add A globally
   observable)
   
   Almost.  All that this really guarantees is that if someone sees
   the add B, they will also see the add A.
   
del B
  (del B globally
   observable)
   (add A NOT brought
into cache)
   (del B brought into
cache)
   read B
   (test false)
   add B
   wmb
   (add B globally
observable)
   del C
   (del C globally
observable)
   
   Here, if someone sees the del C, they will see the add B, and
   they also will have lost the opportunity to modify B before T1
   reads from it and modifies it.
   
  (add A NOT brought
   into cache)
  (del C brought into
   cache)
  read C - not there.
  mb
  (add A brought
   into cache)
  read A - there - success.
   
   So we see that C is not there.  We know that B would be there if
   we looked at it.  But we don't look at B, we look at A.  But the
   ordering back to T0's add A requires transitivity, which wmb
   does not guarantee.
  
  OK, got it!
  
   
If I look at the transitivity section in Linux memory-barriers.txt, I
notice that the example is mainly around using read barrier around loads
rather than general barrier. Let's see if I can modify that example to
come up with an example error case:

Initial content: empty

T0: add X
T1: r1 = lookup X; smp_mb; r2 = lookup Y
T2: add Y; r3 = lookup X

assert( !(r1  !r2  !r3) )

The key thing here is that if the barrier in T2 after add Y is a
smp_wmb rather than a smp_mb, this could allow the r3 = lookup X to be
reordered before add Y, thus allowing the assertion to fail.
   
   Your example is simpler, and demonstrates the need just as well, so
   let's go with your example.
   
I think it would be more intuitive for users if lookups vs updates
performed on the same thread are ordered with full memory barriers.
Given that we don't want to add extra barriers in read operations, it
would make sense to guarantee full memory barriers before and after
updates.

So how about we use full memory barriers before and after each of: add,
del (success), add_unique (success), replace, and add_replace ? If we
ever want to relax those ordering guarantees, then we can always add new
update APIs with a weaker ordering.

Thoughts ?
   
   That line of reasoning makes a lot of sense to me!
  
  Sounds good.
  
  Here is what I propose, thoughts ?
 
 Just to make sure I understand -- the reason that the del functions
 say no memory barrier instead of acts like rcu_dereference() is
 that the del functions don't return anything.

Hrm, you bring an interesting point. I think we should change the two
rcu_dereference() in _cds_lfht_del for CMM_LOAD_SHARED(). The
difference between the two is that CMM_LOAD_SHARED() does not imply a
read barrier between the read and following uses of the data pointed to
by the pointer read.

Same thing for cds_lfht_is_node_deleted: the rcu_dereference() should
be changed for a CMM_LOAD_SHARED(), because we never use 

Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-04 Thread Mathieu Desnoyers
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Thu, May 03, 2012 at 01:13:30PM -0400, Mathieu Desnoyers wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
[...]
   
   A write barrier would be sufficient in the case where there were only
   two threads observing each other.  A full memory barrier would be needed
   to prevent the assertion from firing in this sort of case (not sure that
   this is exactly right, but something like this):
   
   Initial contents: B, C
   
   T0: add A; del B
   T1: if (!lookup B) { add B; del C }
   T2: r1 = lookup C; smp_mb(); r2 = lookup A
   
   assert(lookup C || lookup A);
  
  What you are bringing here as counter-example is, I think, transitivity.
 
 Yep!
 
  I'm trying to figure out how your example could fail, and I cannot see
  how. Follows a detail of the closest scenario I get to failing is the
  following, but it does not fail. After that, I'm proposing a different
  scenario, which I think will be more appropriate for the current topic.
  
  Attempted detail of your scenario:
  
  T0 T1 T2
  
  add A
  wmb
(add A globally
 observable)
 
 Almost.  All that this really guarantees is that if someone sees
 the add B, they will also see the add A.
 
  del B
(del B globally
 observable)
 (add A NOT brought
  into cache)
 (del B brought into
  cache)
 read B
 (test false)
 add B
 wmb
 (add B globally
  observable)
 del C
 (del C globally
  observable)
 
 Here, if someone sees the del C, they will see the add B, and
 they also will have lost the opportunity to modify B before T1
 reads from it and modifies it.
 
(add A NOT brought
 into cache)
(del C brought into
 cache)
read C - not there.
mb
(add A brought
 into cache)
read A - there - success.
 
 So we see that C is not there.  We know that B would be there if
 we looked at it.  But we don't look at B, we look at A.  But the
 ordering back to T0's add A requires transitivity, which wmb
 does not guarantee.

OK, got it!

 
  If I look at the transitivity section in Linux memory-barriers.txt, I
  notice that the example is mainly around using read barrier around loads
  rather than general barrier. Let's see if I can modify that example to
  come up with an example error case:
  
  Initial content: empty
  
  T0: add X
  T1: r1 = lookup X; smp_mb; r2 = lookup Y
  T2: add Y; r3 = lookup X
  
  assert( !(r1  !r2  !r3) )
  
  The key thing here is that if the barrier in T2 after add Y is a
  smp_wmb rather than a smp_mb, this could allow the r3 = lookup X to be
  reordered before add Y, thus allowing the assertion to fail.
 
 Your example is simpler, and demonstrates the need just as well, so
 let's go with your example.
 
  I think it would be more intuitive for users if lookups vs updates
  performed on the same thread are ordered with full memory barriers.
  Given that we don't want to add extra barriers in read operations, it
  would make sense to guarantee full memory barriers before and after
  updates.
  
  So how about we use full memory barriers before and after each of: add,
  del (success), add_unique (success), replace, and add_replace ? If we
  ever want to relax those ordering guarantees, then we can always add new
  update APIs with a weaker ordering.
  
  Thoughts ?
 
 That line of reasoning makes a lot of sense to me!

Sounds good.

Here is what I propose, thoughts ?


diff --git a/urcu/rculfhash.h b/urcu/rculfhash.h
index 2d8a310..2938e5e 100644
--- a/urcu/rculfhash.h
+++ b/urcu/rculfhash.h
@@ -203,6 +203,7 @@ void cds_lfht_count_nodes(struct cds_lfht *ht,
  *
  * Call with rcu_read_lock held.
  * Threads calling this API need to be registered RCU read-side threads.
+ * This function acts as a rcu_dereference() to read the node pointer.
  */
 void cds_lfht_lookup(struct cds_lfht *ht, unsigned long hash,
cds_lfht_match_fct match, const void *key,
@@ -226,6 +227,7 @@ void cds_lfht_lookup(struct cds_lfht *ht, unsigned long 
hash,
  * node returned by a previous cds_lfht_next.
  * Call with rcu_read_lock held.
  * Threads calling this API need to be registered RCU read-side threads.
+ * This function acts as a rcu_dereference() to read the node pointer.
  */
 void cds_lfht_next_duplicate(struct cds_lfht *ht,
cds_lfht_match_fct match, const void *key,
@@ -239,6 +241,7 @@ void 

Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-04 Thread Paul E. McKenney
On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Thu, May 03, 2012 at 01:13:30PM -0400, Mathieu Desnoyers wrote:
   * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 [...]

A write barrier would be sufficient in the case where there were only
two threads observing each other.  A full memory barrier would be needed
to prevent the assertion from firing in this sort of case (not sure that
this is exactly right, but something like this):

Initial contents: B, C

T0: add A; del B
T1: if (!lookup B) { add B; del C }
T2: r1 = lookup C; smp_mb(); r2 = lookup A

assert(lookup C || lookup A);
   
   What you are bringing here as counter-example is, I think, transitivity.
  
  Yep!
  
   I'm trying to figure out how your example could fail, and I cannot see
   how. Follows a detail of the closest scenario I get to failing is the
   following, but it does not fail. After that, I'm proposing a different
   scenario, which I think will be more appropriate for the current topic.
   
   Attempted detail of your scenario:
   
   T0 T1 T2
   
   add A
   wmb
 (add A globally
  observable)
  
  Almost.  All that this really guarantees is that if someone sees
  the add B, they will also see the add A.
  
   del B
 (del B globally
  observable)
  (add A NOT brought
   into cache)
  (del B brought into
   cache)
  read B
  (test false)
  add B
  wmb
  (add B globally
   observable)
  del C
  (del C globally
   observable)
  
  Here, if someone sees the del C, they will see the add B, and
  they also will have lost the opportunity to modify B before T1
  reads from it and modifies it.
  
 (add A NOT brought
  into cache)
 (del C brought into
  cache)
 read C - not there.
 mb
 (add A brought
  into cache)
 read A - there - success.
  
  So we see that C is not there.  We know that B would be there if
  we looked at it.  But we don't look at B, we look at A.  But the
  ordering back to T0's add A requires transitivity, which wmb
  does not guarantee.
 
 OK, got it!
 
  
   If I look at the transitivity section in Linux memory-barriers.txt, I
   notice that the example is mainly around using read barrier around loads
   rather than general barrier. Let's see if I can modify that example to
   come up with an example error case:
   
   Initial content: empty
   
   T0: add X
   T1: r1 = lookup X; smp_mb; r2 = lookup Y
   T2: add Y; r3 = lookup X
   
   assert( !(r1  !r2  !r3) )
   
   The key thing here is that if the barrier in T2 after add Y is a
   smp_wmb rather than a smp_mb, this could allow the r3 = lookup X to be
   reordered before add Y, thus allowing the assertion to fail.
  
  Your example is simpler, and demonstrates the need just as well, so
  let's go with your example.
  
   I think it would be more intuitive for users if lookups vs updates
   performed on the same thread are ordered with full memory barriers.
   Given that we don't want to add extra barriers in read operations, it
   would make sense to guarantee full memory barriers before and after
   updates.
   
   So how about we use full memory barriers before and after each of: add,
   del (success), add_unique (success), replace, and add_replace ? If we
   ever want to relax those ordering guarantees, then we can always add new
   update APIs with a weaker ordering.
   
   Thoughts ?
  
  That line of reasoning makes a lot of sense to me!
 
 Sounds good.
 
 Here is what I propose, thoughts ?

Just to make sure I understand -- the reason that the del functions
say no memory barrier instead of acts like rcu_dereference() is
that the del functions don't return anything.

Assuming my understanding is correct:

Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com

Thanx, Paul

 diff --git a/urcu/rculfhash.h b/urcu/rculfhash.h
 index 2d8a310..2938e5e 100644
 --- a/urcu/rculfhash.h
 +++ b/urcu/rculfhash.h
 @@ -203,6 +203,7 @@ void cds_lfht_count_nodes(struct cds_lfht *ht,
   *
   * Call with rcu_read_lock held.
   * Threads calling this API need to be registered RCU read-side threads.
 + * This function acts as a rcu_dereference() to read the node pointer.
   */
  void cds_lfht_lookup(struct cds_lfht *ht, unsigned 

Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-03 Thread Mathieu Desnoyers
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Wed, May 02, 2012 at 12:16:39AM -0400, Mathieu Desnoyers wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
   On Tue, May 01, 2012 at 05:02:06PM -0400, Mathieu Desnoyers wrote:
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
   On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote:
* Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers 
  wrote:
   Hi!
   
   After 1 year of development, with the last 6-7 months 
   spent polishing
   the API and testing the implementation, I think it is 
   getting about time
   to release the RCU lock-free hash table in a new 
   Userspace RCU version
   (0.7).
   
   I recently described the guarantees provided by the hash 
   table in more
   detail, and created tests for the uniqueness guarantee 
   for traversals
   performed concurrently with add_unique and add_replace 
   operations. I
   also added test modes that create long hash chains, to 
   test corner-cases
   of the hash table.
   
   One small thing I wonder is whether we should document 
   that the hash
   table update operations imply full memory barriers ?
  
  The Linux kernel's rule seems good here -- if a hash-table 
  operation is
  atomic and returns a value, it should imply a full barrier. 
   So:
  
  cds_lfht_new(): No point in claiming barriers -- publishing 
  the
  pointer to the hash table is where the barriers are 
  important.
  
  cds_lfht_destroy(): Ditto.
  
  cds_lfht_lookup(): Not an update (let alone an atomic 
  update), no barriers.
  
  cds_lfht_next_duplicate(): Ditto.
  
  cds_lfht_first(): Ditto.
  
  cds_lfht_next(): Ditto.
  
  cds_lfht_add(): Atomic update, but no return value, so no 
  barrier
  implied.
 
 Yep, makes sense. We use cmpxchg internally to perform the 
 update, but
 it could make sense to eventually use a cmpxchg that has no 
 memory
 barriers to perform this update. So I agree on not providing 
 a memory
 barrier guarantee on the add operation, since it does not 
 return any
 value.
 
  
  cds_lfht_add_unique(): Atomic update that returns a value, 
  so should
  imply a full memory barrier.
 
 add_unique is a bit special:
 
 - if it returns the node received as parameter, it means the 
 add
   succeeded, which imply an update, and thus a memory barrier.

Hrm, thinking further: if we make the add operation not act 
as a
full memory barrier, then the add_unique success should not act 
as a
full mb neither.
   
   Think of it as being similar to the Linux kernel's atomic_inc() 
   and
   atomic_add_return() primitives.  The latter guarantees memory 
   barriers
   and the former does not.
  
  I think add/add_unique vs atomic_inc/atomic_add_return are 
  fundamentally
  different.
  
  atomic_inc:
   - input: increment
   - output: none
   - effect: increment target address value
  
  atomic_add_return:
   - input: increment
   - output: incremented value
   - effect: increment target address value, atomically reading the
 resulting value.
  
  hash table add:
   - input: new node to add
   - output: none
   - effect: atomically add the new node into the table
  
  hash table add_unique (success):
   - input: new node to add
   - output: (we just return whether the operation has succeeded)
   - effect: atomically add the new node into the table
  
  hash table add_unique (failure):
   - input: new node to try adding
   - output: (we just return whether the operation has succeeded)
   - effect: simple lookup (read)
  
  So as we see, the add_unique failure only performs a read. Adding a
  memory barrier before this read would require us to add a memory 
  barrier
  also on the success path, which would degrade performance. The 
  success
  path does: lookup failure, cmpxchg to add the node, retry if 
  changed.
  Adding a memory barrier before the lookup would add an extra memory
  barrier in 

Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-03 Thread Paul E. McKenney
On Thu, May 03, 2012 at 01:13:30PM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Wed, May 02, 2012 at 12:16:39AM -0400, Mathieu Desnoyers wrote:
   * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
On Tue, May 01, 2012 at 05:02:06PM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote:
   * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers 
wrote:
 * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
   On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu 
   Desnoyers wrote:
Hi!

After 1 year of development, with the last 6-7 months 
spent polishing
the API and testing the implementation, I think it is 
getting about time
to release the RCU lock-free hash table in a new 
Userspace RCU version
(0.7).

I recently described the guarantees provided by the 
hash table in more
detail, and created tests for the uniqueness guarantee 
for traversals
performed concurrently with add_unique and add_replace 
operations. I
also added test modes that create long hash chains, to 
test corner-cases
of the hash table.

One small thing I wonder is whether we should document 
that the hash
table update operations imply full memory barriers ?
   
   The Linux kernel's rule seems good here -- if a 
   hash-table operation is
   atomic and returns a value, it should imply a full 
   barrier.  So:
   
   cds_lfht_new(): No point in claiming barriers -- 
   publishing the
 pointer to the hash table is where the barriers are 
   important.
   
   cds_lfht_destroy(): Ditto.
   
   cds_lfht_lookup(): Not an update (let alone an atomic 
   update), no barriers.
   
   cds_lfht_next_duplicate(): Ditto.
   
   cds_lfht_first(): Ditto.
   
   cds_lfht_next(): Ditto.
   
   cds_lfht_add(): Atomic update, but no return value, so no 
   barrier
 implied.
  
  Yep, makes sense. We use cmpxchg internally to perform the 
  update, but
  it could make sense to eventually use a cmpxchg that has no 
  memory
  barriers to perform this update. So I agree on not 
  providing a memory
  barrier guarantee on the add operation, since it does not 
  return any
  value.
  
   
   cds_lfht_add_unique(): Atomic update that returns a 
   value, so should
 imply a full memory barrier.
  
  add_unique is a bit special:
  
  - if it returns the node received as parameter, it means 
  the add
succeeded, which imply an update, and thus a memory 
  barrier.
 
 Hrm, thinking further: if we make the add operation not act 
 as a
 full memory barrier, then the add_unique success should not 
 act as a
 full mb neither.

Think of it as being similar to the Linux kernel's atomic_inc() 
and
atomic_add_return() primitives.  The latter guarantees memory 
barriers
and the former does not.
   
   I think add/add_unique vs atomic_inc/atomic_add_return are 
   fundamentally
   different.
   
   atomic_inc:
- input: increment
- output: none
- effect: increment target address value
   
   atomic_add_return:
- input: increment
- output: incremented value
- effect: increment target address value, atomically reading the
  resulting value.
   
   hash table add:
- input: new node to add
- output: none
- effect: atomically add the new node into the table
   
   hash table add_unique (success):
- input: new node to add
- output: (we just return whether the operation has succeeded)
- effect: atomically add the new node into the table
   
   hash table add_unique (failure):
- input: new node to try adding
- output: (we just return whether the operation has succeeded)
- effect: simple lookup (read)
   
   So as we see, the add_unique failure only performs a read. Adding 
   a
   memory barrier before this read would require us to add a memory 
   barrier
   also on the success path, which would 

Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-02 Thread Paul E. McKenney
On Wed, May 02, 2012 at 12:16:39AM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Tue, May 01, 2012 at 05:02:06PM -0400, Mathieu Desnoyers wrote:
   * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote:
   * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers 
 wrote:
  Hi!
  
  After 1 year of development, with the last 6-7 months spent 
  polishing
  the API and testing the implementation, I think it is 
  getting about time
  to release the RCU lock-free hash table in a new Userspace 
  RCU version
  (0.7).
  
  I recently described the guarantees provided by the hash 
  table in more
  detail, and created tests for the uniqueness guarantee for 
  traversals
  performed concurrently with add_unique and add_replace 
  operations. I
  also added test modes that create long hash chains, to test 
  corner-cases
  of the hash table.
  
  One small thing I wonder is whether we should document that 
  the hash
  table update operations imply full memory barriers ?
 
 The Linux kernel's rule seems good here -- if a hash-table 
 operation is
 atomic and returns a value, it should imply a full barrier.  
 So:
 
 cds_lfht_new(): No point in claiming barriers -- publishing 
 the
   pointer to the hash table is where the barriers are 
 important.
 
 cds_lfht_destroy(): Ditto.
 
 cds_lfht_lookup(): Not an update (let alone an atomic 
 update), no barriers.
 
 cds_lfht_next_duplicate(): Ditto.
 
 cds_lfht_first(): Ditto.
 
 cds_lfht_next(): Ditto.
 
 cds_lfht_add(): Atomic update, but no return value, so no 
 barrier
   implied.

Yep, makes sense. We use cmpxchg internally to perform the 
update, but
it could make sense to eventually use a cmpxchg that has no 
memory
barriers to perform this update. So I agree on not providing a 
memory
barrier guarantee on the add operation, since it does not 
return any
value.

 
 cds_lfht_add_unique(): Atomic update that returns a value, so 
 should
   imply a full memory barrier.

add_unique is a bit special:

- if it returns the node received as parameter, it means the add
  succeeded, which imply an update, and thus a memory barrier.
   
   Hrm, thinking further: if we make the add operation not act as a
   full memory barrier, then the add_unique success should not act 
   as a
   full mb neither.
  
  Think of it as being similar to the Linux kernel's atomic_inc() and
  atomic_add_return() primitives.  The latter guarantees memory 
  barriers
  and the former does not.
 
 I think add/add_unique vs atomic_inc/atomic_add_return are 
 fundamentally
 different.
 
 atomic_inc:
  - input: increment
  - output: none
  - effect: increment target address value
 
 atomic_add_return:
  - input: increment
  - output: incremented value
  - effect: increment target address value, atomically reading the
resulting value.
 
 hash table add:
  - input: new node to add
  - output: none
  - effect: atomically add the new node into the table
 
 hash table add_unique (success):
  - input: new node to add
  - output: (we just return whether the operation has succeeded)
  - effect: atomically add the new node into the table
 
 hash table add_unique (failure):
  - input: new node to try adding
  - output: (we just return whether the operation has succeeded)
  - effect: simple lookup (read)
 
 So as we see, the add_unique failure only performs a read. Adding a
 memory barrier before this read would require us to add a memory 
 barrier
 also on the success path, which would degrade performance. The success
 path does: lookup failure, cmpxchg to add the node, retry if changed.
 Adding a memory barrier before the lookup would add an extra memory
 barrier in addition to the one located in the cmpxchg, and I don't 
 think
 we want that overhead.

Perhaps a better analogy is cmpxchg() in the Linux kernel.  Some
architectures place the memory barrier before 

Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-01 Thread Paul E. McKenney
On Tue, May 01, 2012 at 11:12:15AM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote:
   Hi!
   
   After 1 year of development, with the last 6-7 months spent polishing
   the API and testing the implementation, I think it is getting about time
   to release the RCU lock-free hash table in a new Userspace RCU version
   (0.7).
   
   I recently described the guarantees provided by the hash table in more
   detail, and created tests for the uniqueness guarantee for traversals
   performed concurrently with add_unique and add_replace operations. I
   also added test modes that create long hash chains, to test corner-cases
   of the hash table.
   
   One small thing I wonder is whether we should document that the hash
   table update operations imply full memory barriers ?
  
  The Linux kernel's rule seems good here -- if a hash-table operation is
  atomic and returns a value, it should imply a full barrier.  So:
  
  cds_lfht_new(): No point in claiming barriers -- publishing the
  pointer to the hash table is where the barriers are important.
  
  cds_lfht_destroy(): Ditto.
  
  cds_lfht_lookup(): Not an update (let alone an atomic update), no barriers.
  
  cds_lfht_next_duplicate(): Ditto.
  
  cds_lfht_first(): Ditto.
  
  cds_lfht_next(): Ditto.
  
  cds_lfht_add(): Atomic update, but no return value, so no barrier
  implied.
 
 Yep, makes sense. We use cmpxchg internally to perform the update, but
 it could make sense to eventually use a cmpxchg that has no memory
 barriers to perform this update. So I agree on not providing a memory
 barrier guarantee on the add operation, since it does not return any
 value.
 
  cds_lfht_add_unique(): Atomic update that returns a value, so should
  imply a full memory barrier.
 
 add_unique is a bit special:
 
 - if it returns the node received as parameter, it means the add
   succeeded, which imply an update, and thus a memory barrier.
 
 - if it returns a different node than the one received as parameter, it
   failed, and thus means that it only performed a lookup, so there is no
   guarantee that a memory barrier has been executed.

Good point!  But I would suggest handling this like some architectures
handle a failing cmpxchg() -- there is a memory barrier beforehand,
but not afterwards.  This works well because although the caller can
easily supply a memory barrier after a failing cmpxchg(), there is no
way to supply one beforehand except by supplying a redundant barrier in
the case of a successful cmpxchg().

So I suggest that a failing add_unique() guarantee full memory barrier
beforehand, but not afterwards.

Thanx, Paul

  cds_lfht_add_replace(): Ditto.
  
  cds_lfht_replace(): Ditto.
  
  cds_lfht_del(): Ditto.
 
 Yep, makes sense.
 
 I'll add this documentation in the API.
 
 Thanks!
 
 Mathieu
 
  
  cds_lfht_is_node_deleted(): Not an update (let alone an atomic update),
  no barriers.
  
  cds_lfht_resize(): Atomic update, but no return value, so no barrier
  implied.
  
  cds_lfht_for_each(): Not an update (let alone an atomic update),
  no barriers.
  
  cds_lfht_for_each_duplicate(): Ditto.
  
  cds_lfht_for_each_entry(): Ditto.
  
  cds_lfht_for_each_entry_duplicate(): Ditto.
  
  Thanx, Paul
  
   I'm personally getting confident that the hash table API is clean
   enough, and the implementation well tested, but I'd like to have your
   thoughts on the readiness of the hash table for production use.
   
   Thanks,
   
   Mathieu
   
   -- 
   Mathieu Desnoyers
   Operating System Efficiency RD Consultant
   EfficiOS Inc.
   http://www.efficios.com
   
   ___
   rp mailing list
   r...@svcs.cs.pdx.edu
   http://svcs.cs.pdx.edu/mailman/listinfo/rp
   
  
  
  ___
  lttng-dev mailing list
  lttng-dev@lists.lttng.org
  http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
 
 -- 
 Mathieu Desnoyers
 Operating System Efficiency RD Consultant
 EfficiOS Inc.
 http://www.efficios.com
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-01 Thread Paul E. McKenney
On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote:
 * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
   On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote:
Hi!

After 1 year of development, with the last 6-7 months spent polishing
the API and testing the implementation, I think it is getting about time
to release the RCU lock-free hash table in a new Userspace RCU version
(0.7).

I recently described the guarantees provided by the hash table in more
detail, and created tests for the uniqueness guarantee for traversals
performed concurrently with add_unique and add_replace operations. I
also added test modes that create long hash chains, to test corner-cases
of the hash table.

One small thing I wonder is whether we should document that the hash
table update operations imply full memory barriers ?
   
   The Linux kernel's rule seems good here -- if a hash-table operation is
   atomic and returns a value, it should imply a full barrier.  So:
   
   cds_lfht_new(): No point in claiming barriers -- publishing the
 pointer to the hash table is where the barriers are important.
   
   cds_lfht_destroy(): Ditto.
   
   cds_lfht_lookup(): Not an update (let alone an atomic update), no 
   barriers.
   
   cds_lfht_next_duplicate(): Ditto.
   
   cds_lfht_first(): Ditto.
   
   cds_lfht_next(): Ditto.
   
   cds_lfht_add(): Atomic update, but no return value, so no barrier
 implied.
  
  Yep, makes sense. We use cmpxchg internally to perform the update, but
  it could make sense to eventually use a cmpxchg that has no memory
  barriers to perform this update. So I agree on not providing a memory
  barrier guarantee on the add operation, since it does not return any
  value.
  
   
   cds_lfht_add_unique(): Atomic update that returns a value, so should
 imply a full memory barrier.
  
  add_unique is a bit special:
  
  - if it returns the node received as parameter, it means the add
succeeded, which imply an update, and thus a memory barrier.
 
 Hrm, thinking further: if we make the add operation not act as a
 full memory barrier, then the add_unique success should not act as a
 full mb neither.

Think of it as being similar to the Linux kernel's atomic_inc() and
atomic_add_return() primitives.  The latter guarantees memory barriers
and the former does not.

  - if it returns a different node than the one received as parameter, it
failed, and thus means that it only performed a lookup, so there is no
guarantee that a memory barrier has been executed.
  
  
   
   cds_lfht_add_replace(): Ditto.
   
   cds_lfht_replace(): Ditto.
   
   cds_lfht_del(): Ditto.
 
 One more point: del is similar to add_unique: if it succeeds, we
 execute a full memory barrier. If it fails, no memory barrier is
 guaranteed. But if we want to make the guarantees relax, we might not
 want to guarantee that a memory barrier is present in any of the del
 cases.
 
 In the end, the only primitives for which I think it really makes sense
 to provide memory barrier semantic is the add_replace and replace : they
 actually act as an higher-level cmpxchg over the hash table nodes.

I believe that this should do the same -- memory barrier before, but no
memory barrier after on failure.

Another approach is C++11, in which there are a couple of arguments to
the compare-and-swap primitive specifying the memory ordering constraints
for the success and failure cases, respectively.

Unless you expect use cases with lots of failing cds_lfht_del() and
add_unique() calls, the performance difference should not be significant.

Thanx, Paul

 Thoughts ?
 
 Thanks,
 
 Mathieu
 
  
  Yep, makes sense.
  
  I'll add this documentation in the API.
  
  Thanks!
  
  Mathieu
  
   
   cds_lfht_is_node_deleted(): Not an update (let alone an atomic update),
 no barriers.
   
   cds_lfht_resize(): Atomic update, but no return value, so no barrier
 implied.
   
   cds_lfht_for_each(): Not an update (let alone an atomic update),
 no barriers.
   
   cds_lfht_for_each_duplicate(): Ditto.
   
   cds_lfht_for_each_entry(): Ditto.
   
   cds_lfht_for_each_entry_duplicate(): Ditto.
   
 Thanx, Paul
   
I'm personally getting confident that the hash table API is clean
enough, and the implementation well tested, but I'd like to have your
thoughts on the readiness of the hash table for production use.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency RD Consultant
EfficiOS Inc.
http://www.efficios.com

___
rp mailing list
r...@svcs.cs.pdx.edu
http://svcs.cs.pdx.edu/mailman/listinfo/rp

   
   
   ___
 

Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-01 Thread Paul E. McKenney
On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote:
   * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote:
  Hi!
  
  After 1 year of development, with the last 6-7 months spent 
  polishing
  the API and testing the implementation, I think it is getting about 
  time
  to release the RCU lock-free hash table in a new Userspace RCU 
  version
  (0.7).
  
  I recently described the guarantees provided by the hash table in 
  more
  detail, and created tests for the uniqueness guarantee for 
  traversals
  performed concurrently with add_unique and add_replace operations. I
  also added test modes that create long hash chains, to test 
  corner-cases
  of the hash table.
  
  One small thing I wonder is whether we should document that the hash
  table update operations imply full memory barriers ?
 
 The Linux kernel's rule seems good here -- if a hash-table operation 
 is
 atomic and returns a value, it should imply a full barrier.  So:
 
 cds_lfht_new(): No point in claiming barriers -- publishing the
   pointer to the hash table is where the barriers are important.
 
 cds_lfht_destroy(): Ditto.
 
 cds_lfht_lookup(): Not an update (let alone an atomic update), no 
 barriers.
 
 cds_lfht_next_duplicate(): Ditto.
 
 cds_lfht_first(): Ditto.
 
 cds_lfht_next(): Ditto.
 
 cds_lfht_add(): Atomic update, but no return value, so no barrier
   implied.

Yep, makes sense. We use cmpxchg internally to perform the update, but
it could make sense to eventually use a cmpxchg that has no memory
barriers to perform this update. So I agree on not providing a memory
barrier guarantee on the add operation, since it does not return any
value.

 
 cds_lfht_add_unique(): Atomic update that returns a value, so should
   imply a full memory barrier.

add_unique is a bit special:

- if it returns the node received as parameter, it means the add
  succeeded, which imply an update, and thus a memory barrier.
   
   Hrm, thinking further: if we make the add operation not act as a
   full memory barrier, then the add_unique success should not act as a
   full mb neither.
  
  Think of it as being similar to the Linux kernel's atomic_inc() and
  atomic_add_return() primitives.  The latter guarantees memory barriers
  and the former does not.
 
 I think add/add_unique vs atomic_inc/atomic_add_return are fundamentally
 different.
 
 atomic_inc:
  - input: increment
  - output: none
  - effect: increment target address value
 
 atomic_add_return:
  - input: increment
  - output: incremented value
  - effect: increment target address value, atomically reading the
resulting value.
 
 hash table add:
  - input: new node to add
  - output: none
  - effect: atomically add the new node into the table
 
 hash table add_unique (success):
  - input: new node to add
  - output: (we just return whether the operation has succeeded)
  - effect: atomically add the new node into the table
 
 hash table add_unique (failure):
  - input: new node to try adding
  - output: (we just return whether the operation has succeeded)
  - effect: simple lookup (read)
 
 So as we see, the add_unique failure only performs a read. Adding a
 memory barrier before this read would require us to add a memory barrier
 also on the success path, which would degrade performance. The success
 path does: lookup failure, cmpxchg to add the node, retry if changed.
 Adding a memory barrier before the lookup would add an extra memory
 barrier in addition to the one located in the cmpxchg, and I don't think
 we want that overhead.

Perhaps a better analogy is cmpxchg() in the Linux kernel.  Some
architectures place the memory barrier before unconditionally, but place
a memory barrier after only if the cmpxchg() succeeds.  Of course,
architectures for which the cmpxchg instruction implies a barrier
don't need any explicit memory-barrier instructions.

I do see your point about the memory barrier in the failure case requiring
an additional memory barrier in the success case, however.

- if it returns a different node than the one received as parameter, it
  failed, and thus means that it only performed a lookup, so there is no
  guarantee that a memory barrier has been executed.


 
 cds_lfht_add_replace(): Ditto.
 
 cds_lfht_replace(): Ditto.
 
 cds_lfht_del(): Ditto.
   
   One more point: del is similar to add_unique: if it succeeds, we
   execute a full memory barrier. If it fails, no memory barrier is
   guaranteed. But if 

Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-01 Thread Mathieu Desnoyers
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
   On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote:
* Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote:
   Hi!
   
   After 1 year of development, with the last 6-7 months spent 
   polishing
   the API and testing the implementation, I think it is getting 
   about time
   to release the RCU lock-free hash table in a new Userspace RCU 
   version
   (0.7).
   
   I recently described the guarantees provided by the hash table in 
   more
   detail, and created tests for the uniqueness guarantee for 
   traversals
   performed concurrently with add_unique and add_replace 
   operations. I
   also added test modes that create long hash chains, to test 
   corner-cases
   of the hash table.
   
   One small thing I wonder is whether we should document that the 
   hash
   table update operations imply full memory barriers ?
  
  The Linux kernel's rule seems good here -- if a hash-table 
  operation is
  atomic and returns a value, it should imply a full barrier.  So:
  
  cds_lfht_new(): No point in claiming barriers -- publishing the
  pointer to the hash table is where the barriers are important.
  
  cds_lfht_destroy(): Ditto.
  
  cds_lfht_lookup(): Not an update (let alone an atomic update), no 
  barriers.
  
  cds_lfht_next_duplicate(): Ditto.
  
  cds_lfht_first(): Ditto.
  
  cds_lfht_next(): Ditto.
  
  cds_lfht_add(): Atomic update, but no return value, so no barrier
  implied.
 
 Yep, makes sense. We use cmpxchg internally to perform the update, but
 it could make sense to eventually use a cmpxchg that has no memory
 barriers to perform this update. So I agree on not providing a memory
 barrier guarantee on the add operation, since it does not return any
 value.
 
  
  cds_lfht_add_unique(): Atomic update that returns a value, so should
  imply a full memory barrier.
 
 add_unique is a bit special:
 
 - if it returns the node received as parameter, it means the add
   succeeded, which imply an update, and thus a memory barrier.

Hrm, thinking further: if we make the add operation not act as a
full memory barrier, then the add_unique success should not act as a
full mb neither.
   
   Think of it as being similar to the Linux kernel's atomic_inc() and
   atomic_add_return() primitives.  The latter guarantees memory barriers
   and the former does not.
  
  I think add/add_unique vs atomic_inc/atomic_add_return are fundamentally
  different.
  
  atomic_inc:
   - input: increment
   - output: none
   - effect: increment target address value
  
  atomic_add_return:
   - input: increment
   - output: incremented value
   - effect: increment target address value, atomically reading the
 resulting value.
  
  hash table add:
   - input: new node to add
   - output: none
   - effect: atomically add the new node into the table
  
  hash table add_unique (success):
   - input: new node to add
   - output: (we just return whether the operation has succeeded)
   - effect: atomically add the new node into the table
  
  hash table add_unique (failure):
   - input: new node to try adding
   - output: (we just return whether the operation has succeeded)
   - effect: simple lookup (read)
  
  So as we see, the add_unique failure only performs a read. Adding a
  memory barrier before this read would require us to add a memory barrier
  also on the success path, which would degrade performance. The success
  path does: lookup failure, cmpxchg to add the node, retry if changed.
  Adding a memory barrier before the lookup would add an extra memory
  barrier in addition to the one located in the cmpxchg, and I don't think
  we want that overhead.
 
 Perhaps a better analogy is cmpxchg() in the Linux kernel.  Some
 architectures place the memory barrier before unconditionally, but place
 a memory barrier after only if the cmpxchg() succeeds.  Of course,
 architectures for which the cmpxchg instruction implies a barrier
 don't need any explicit memory-barrier instructions.
 
 I do see your point about the memory barrier in the failure case requiring
 an additional memory barrier in the success case, however.
 
 - if it returns a different node than the one received as parameter, 
 it
   failed, and thus means that it only performed a lookup, so there is 
 no
   guarantee that a memory barrier has been executed.
 
 
  
  cds_lfht_add_replace(): Ditto.
  
  cds_lfht_replace(): 

Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-01 Thread Paul E. McKenney
On Tue, May 01, 2012 at 05:02:06PM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote:
   * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote:
 * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
   On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote:
Hi!

After 1 year of development, with the last 6-7 months spent 
polishing
the API and testing the implementation, I think it is getting 
about time
to release the RCU lock-free hash table in a new Userspace RCU 
version
(0.7).

I recently described the guarantees provided by the hash table 
in more
detail, and created tests for the uniqueness guarantee for 
traversals
performed concurrently with add_unique and add_replace 
operations. I
also added test modes that create long hash chains, to test 
corner-cases
of the hash table.

One small thing I wonder is whether we should document that the 
hash
table update operations imply full memory barriers ?
   
   The Linux kernel's rule seems good here -- if a hash-table 
   operation is
   atomic and returns a value, it should imply a full barrier.  So:
   
   cds_lfht_new(): No point in claiming barriers -- publishing the
 pointer to the hash table is where the barriers are important.
   
   cds_lfht_destroy(): Ditto.
   
   cds_lfht_lookup(): Not an update (let alone an atomic update), no 
   barriers.
   
   cds_lfht_next_duplicate(): Ditto.
   
   cds_lfht_first(): Ditto.
   
   cds_lfht_next(): Ditto.
   
   cds_lfht_add(): Atomic update, but no return value, so no barrier
 implied.
  
  Yep, makes sense. We use cmpxchg internally to perform the update, 
  but
  it could make sense to eventually use a cmpxchg that has no memory
  barriers to perform this update. So I agree on not providing a 
  memory
  barrier guarantee on the add operation, since it does not return 
  any
  value.
  
   
   cds_lfht_add_unique(): Atomic update that returns a value, so 
   should
 imply a full memory barrier.
  
  add_unique is a bit special:
  
  - if it returns the node received as parameter, it means the add
succeeded, which imply an update, and thus a memory barrier.
 
 Hrm, thinking further: if we make the add operation not act as a
 full memory barrier, then the add_unique success should not act as a
 full mb neither.

Think of it as being similar to the Linux kernel's atomic_inc() and
atomic_add_return() primitives.  The latter guarantees memory barriers
and the former does not.
   
   I think add/add_unique vs atomic_inc/atomic_add_return are fundamentally
   different.
   
   atomic_inc:
- input: increment
- output: none
- effect: increment target address value
   
   atomic_add_return:
- input: increment
- output: incremented value
- effect: increment target address value, atomically reading the
  resulting value.
   
   hash table add:
- input: new node to add
- output: none
- effect: atomically add the new node into the table
   
   hash table add_unique (success):
- input: new node to add
- output: (we just return whether the operation has succeeded)
- effect: atomically add the new node into the table
   
   hash table add_unique (failure):
- input: new node to try adding
- output: (we just return whether the operation has succeeded)
- effect: simple lookup (read)
   
   So as we see, the add_unique failure only performs a read. Adding a
   memory barrier before this read would require us to add a memory barrier
   also on the success path, which would degrade performance. The success
   path does: lookup failure, cmpxchg to add the node, retry if changed.
   Adding a memory barrier before the lookup would add an extra memory
   barrier in addition to the one located in the cmpxchg, and I don't think
   we want that overhead.
  
  Perhaps a better analogy is cmpxchg() in the Linux kernel.  Some
  architectures place the memory barrier before unconditionally, but place
  a memory barrier after only if the cmpxchg() succeeds.  Of course,
  architectures for which the cmpxchg instruction implies a barrier
  don't need any explicit memory-barrier instructions.
  
  I do see your point about the memory barrier in the failure case requiring
  an additional memory barrier in the success case, however.
  
  - if it returns a different node than the one received as 
  parameter, it
failed, 

Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-01 Thread Mathieu Desnoyers
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Tue, May 01, 2012 at 05:02:06PM -0400, Mathieu Desnoyers wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
   On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote:
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote:
  * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
   * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers 
wrote:
 Hi!
 
 After 1 year of development, with the last 6-7 months spent 
 polishing
 the API and testing the implementation, I think it is getting 
 about time
 to release the RCU lock-free hash table in a new Userspace 
 RCU version
 (0.7).
 
 I recently described the guarantees provided by the hash 
 table in more
 detail, and created tests for the uniqueness guarantee for 
 traversals
 performed concurrently with add_unique and add_replace 
 operations. I
 also added test modes that create long hash chains, to test 
 corner-cases
 of the hash table.
 
 One small thing I wonder is whether we should document that 
 the hash
 table update operations imply full memory barriers ?

The Linux kernel's rule seems good here -- if a hash-table 
operation is
atomic and returns a value, it should imply a full barrier.  So:

cds_lfht_new(): No point in claiming barriers -- publishing the
pointer to the hash table is where the barriers are 
important.

cds_lfht_destroy(): Ditto.

cds_lfht_lookup(): Not an update (let alone an atomic update), 
no barriers.

cds_lfht_next_duplicate(): Ditto.

cds_lfht_first(): Ditto.

cds_lfht_next(): Ditto.

cds_lfht_add(): Atomic update, but no return value, so no 
barrier
implied.
   
   Yep, makes sense. We use cmpxchg internally to perform the 
   update, but
   it could make sense to eventually use a cmpxchg that has no memory
   barriers to perform this update. So I agree on not providing a 
   memory
   barrier guarantee on the add operation, since it does not 
   return any
   value.
   

cds_lfht_add_unique(): Atomic update that returns a value, so 
should
imply a full memory barrier.
   
   add_unique is a bit special:
   
   - if it returns the node received as parameter, it means the add
 succeeded, which imply an update, and thus a memory barrier.
  
  Hrm, thinking further: if we make the add operation not act as a
  full memory barrier, then the add_unique success should not act as a
  full mb neither.
 
 Think of it as being similar to the Linux kernel's atomic_inc() and
 atomic_add_return() primitives.  The latter guarantees memory barriers
 and the former does not.

I think add/add_unique vs atomic_inc/atomic_add_return are fundamentally
different.

atomic_inc:
 - input: increment
 - output: none
 - effect: increment target address value

atomic_add_return:
 - input: increment
 - output: incremented value
 - effect: increment target address value, atomically reading the
   resulting value.

hash table add:
 - input: new node to add
 - output: none
 - effect: atomically add the new node into the table

hash table add_unique (success):
 - input: new node to add
 - output: (we just return whether the operation has succeeded)
 - effect: atomically add the new node into the table

hash table add_unique (failure):
 - input: new node to try adding
 - output: (we just return whether the operation has succeeded)
 - effect: simple lookup (read)

So as we see, the add_unique failure only performs a read. Adding a
memory barrier before this read would require us to add a memory barrier
also on the success path, which would degrade performance. The success
path does: lookup failure, cmpxchg to add the node, retry if changed.
Adding a memory barrier before the lookup would add an extra memory
barrier in addition to the one located in the cmpxchg, and I don't think
we want that overhead.
   
   Perhaps a better analogy is cmpxchg() in the Linux kernel.  Some
   architectures place the memory barrier before unconditionally, but place
   a memory barrier after only if the cmpxchg() succeeds.  Of course,
   architectures for which the cmpxchg instruction implies a barrier
   don't need any explicit memory-barrier instructions.
   
   I do see your