Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 08, 2012 at 02:48:27PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Mon, May 07, 2012 at 12:10:55PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote: [...] Just to make sure I understand -- the reason that the del functions say no memory barrier instead of acts like rcu_dereference() is that the del functions don't return anything. [...] @@ -391,6 +413,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct cds_lfht_node *node); * function. * Call with rcu_read_lock held. * Threads calling this API need to be registered RCU read-side threads. + * This function does not issue any memory barrier. */ One more question about the del memory ordering semantic. Following commit commit db00ccc36e7fb04ce8044fb1be7964acd1de6ae0 Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com Date: Mon Dec 19 16:45:51 2011 -0500 rculfhash: Relax atomicity guarantees required by removal operation The atomicity guarantees for the removal operation do not need to be as strict as a cmpxchg. Use a uatomic_set followed by a xchg on a newly introduced REMOVAL_OWNER_FLAG to perform removal. Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com The del operation is performed in two steps: 1 - uatomic_or(), which sets the REMOVED flag (the actual tombstone) unconditionally into the node's next pointer. 2 - uatomic_xchg(), which atomically exchanges the old pointer with its current value (read) or'd with the REMOVAL_OWNER flag. The trick is that if the xchg returns a pointer with the REMOVAL_OWNER flag set, it means we are not the first thread to set this flag, so we should not free the node. However, if xchg returns a node without the REMOVAL_OWNER flag set, we are indeed the first to set it, so we should call free. Now regarding memory ordering semantics, should we consider the atomic action of del to apply when the or is called, or when the xchg is called ? Or should we simply document that the del effect on the node happens in two separate steps ? The way I see it, the actual effect of removal, as seen from RCU read traversal and lookup point of view, is observed as soon as the REMOVED tombstone is set, so I would think that the atomic publication of the removal is performed by the or. However, we ensure full memory barriers around xchg, but not usually around or. Therefore, the current implementation does not issue a memory barrier before the or, so we should either change our planned memory barrier documentation, or the implementation, to match. This would probably require creation of a cmm_smp_mb__before_uatomic_or(), so x86 does not end up issuing a useless memory barrer. My kneejerk reaction is that the or is really doing the deletion. Readers and other updaters care about the deletion, not about which CPU is going to do the free. Or did I misunderstand how this works? You got it right, this is how I see it too. However, in order to provide a full memory barrier before the or, we'd need to add a cmm_smp_mb() before the or (I don't think we want to presume that our or operation issues full memory barriers on all architectures). However, on x86, the lock; or does issue a full memory barrier. So I think we should introduce a macro that can translate into a memory barrier on architectures that need it, and to nothing on x86. Thoughts ? Makes sense to me! Allright, committed the following. I end up using cmm_smp_mb__before_uatomic_*. commit 7b783f818175cd92d98f78e761331f306ff406a5 Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com Date: Tue May 8 17:12:20 2012 -0400 rculfhash: document implied memory barriers We choose to provide full memory barriers before and after successful hash table update operations. Eventually, new API with weaker semantic can be added, but let's make the basic API as fool-proof as possible. Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com commit 196f4fab9bf26c48bc318ac2ff985469c4f62c7e Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com Date: Tue May 8 17:09:46 2012 -0400 rculfhash: Ensure future-proof memory barrier semantic consistency Use cmm_smp_mb__before_uatomic_or() prior to the uatomic_or() in _rcu_lfht_del() to ensure correct memory barrier semantic when we relax (in the future) the barrier implementation of some architectures.
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
On Mon, May 07, 2012 at 12:10:55PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote: [...] Just to make sure I understand -- the reason that the del functions say no memory barrier instead of acts like rcu_dereference() is that the del functions don't return anything. [...] @@ -391,6 +413,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct cds_lfht_node *node); * function. * Call with rcu_read_lock held. * Threads calling this API need to be registered RCU read-side threads. + * This function does not issue any memory barrier. */ One more question about the del memory ordering semantic. Following commit commit db00ccc36e7fb04ce8044fb1be7964acd1de6ae0 Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com Date: Mon Dec 19 16:45:51 2011 -0500 rculfhash: Relax atomicity guarantees required by removal operation The atomicity guarantees for the removal operation do not need to be as strict as a cmpxchg. Use a uatomic_set followed by a xchg on a newly introduced REMOVAL_OWNER_FLAG to perform removal. Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com The del operation is performed in two steps: 1 - uatomic_or(), which sets the REMOVED flag (the actual tombstone) unconditionally into the node's next pointer. 2 - uatomic_xchg(), which atomically exchanges the old pointer with its current value (read) or'd with the REMOVAL_OWNER flag. The trick is that if the xchg returns a pointer with the REMOVAL_OWNER flag set, it means we are not the first thread to set this flag, so we should not free the node. However, if xchg returns a node without the REMOVAL_OWNER flag set, we are indeed the first to set it, so we should call free. Now regarding memory ordering semantics, should we consider the atomic action of del to apply when the or is called, or when the xchg is called ? Or should we simply document that the del effect on the node happens in two separate steps ? The way I see it, the actual effect of removal, as seen from RCU read traversal and lookup point of view, is observed as soon as the REMOVED tombstone is set, so I would think that the atomic publication of the removal is performed by the or. However, we ensure full memory barriers around xchg, but not usually around or. Therefore, the current implementation does not issue a memory barrier before the or, so we should either change our planned memory barrier documentation, or the implementation, to match. This would probably require creation of a cmm_smp_mb__before_uatomic_or(), so x86 does not end up issuing a useless memory barrer. My kneejerk reaction is that the or is really doing the deletion. Readers and other updaters care about the deletion, not about which CPU is going to do the free. Or did I misunderstand how this works? Thanx, Paul Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency RD Consultant EfficiOS Inc. http://www.efficios.com ___ rp mailing list r...@svcs.cs.pdx.edu http://svcs.cs.pdx.edu/mailman/listinfo/rp ___ lttng-dev mailing list lttng-dev@lists.lttng.org http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Mon, May 07, 2012 at 12:10:55PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote: [...] Just to make sure I understand -- the reason that the del functions say no memory barrier instead of acts like rcu_dereference() is that the del functions don't return anything. [...] @@ -391,6 +413,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct cds_lfht_node *node); * function. * Call with rcu_read_lock held. * Threads calling this API need to be registered RCU read-side threads. + * This function does not issue any memory barrier. */ One more question about the del memory ordering semantic. Following commit commit db00ccc36e7fb04ce8044fb1be7964acd1de6ae0 Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com Date: Mon Dec 19 16:45:51 2011 -0500 rculfhash: Relax atomicity guarantees required by removal operation The atomicity guarantees for the removal operation do not need to be as strict as a cmpxchg. Use a uatomic_set followed by a xchg on a newly introduced REMOVAL_OWNER_FLAG to perform removal. Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com The del operation is performed in two steps: 1 - uatomic_or(), which sets the REMOVED flag (the actual tombstone) unconditionally into the node's next pointer. 2 - uatomic_xchg(), which atomically exchanges the old pointer with its current value (read) or'd with the REMOVAL_OWNER flag. The trick is that if the xchg returns a pointer with the REMOVAL_OWNER flag set, it means we are not the first thread to set this flag, so we should not free the node. However, if xchg returns a node without the REMOVAL_OWNER flag set, we are indeed the first to set it, so we should call free. Now regarding memory ordering semantics, should we consider the atomic action of del to apply when the or is called, or when the xchg is called ? Or should we simply document that the del effect on the node happens in two separate steps ? The way I see it, the actual effect of removal, as seen from RCU read traversal and lookup point of view, is observed as soon as the REMOVED tombstone is set, so I would think that the atomic publication of the removal is performed by the or. However, we ensure full memory barriers around xchg, but not usually around or. Therefore, the current implementation does not issue a memory barrier before the or, so we should either change our planned memory barrier documentation, or the implementation, to match. This would probably require creation of a cmm_smp_mb__before_uatomic_or(), so x86 does not end up issuing a useless memory barrer. My kneejerk reaction is that the or is really doing the deletion. Readers and other updaters care about the deletion, not about which CPU is going to do the free. Or did I misunderstand how this works? You got it right, this is how I see it too. However, in order to provide a full memory barrier before the or, we'd need to add a cmm_smp_mb() before the or (I don't think we want to presume that our or operation issues full memory barriers on all architectures). However, on x86, the lock; or does issue a full memory barrier. So I think we should introduce a macro that can translate into a memory barrier on architectures that need it, and to nothing on x86. Thoughts ? Thanks, Mathieu Thanx, Paul Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency RD Consultant EfficiOS Inc. http://www.efficios.com ___ rp mailing list r...@svcs.cs.pdx.edu http://svcs.cs.pdx.edu/mailman/listinfo/rp ___ lttng-dev mailing list lttng-dev@lists.lttng.org http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev -- Mathieu Desnoyers Operating System Efficiency RD Consultant EfficiOS Inc. http://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
On Tue, May 08, 2012 at 02:48:27PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Mon, May 07, 2012 at 12:10:55PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote: [...] Just to make sure I understand -- the reason that the del functions say no memory barrier instead of acts like rcu_dereference() is that the del functions don't return anything. [...] @@ -391,6 +413,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct cds_lfht_node *node); * function. * Call with rcu_read_lock held. * Threads calling this API need to be registered RCU read-side threads. + * This function does not issue any memory barrier. */ One more question about the del memory ordering semantic. Following commit commit db00ccc36e7fb04ce8044fb1be7964acd1de6ae0 Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com Date: Mon Dec 19 16:45:51 2011 -0500 rculfhash: Relax atomicity guarantees required by removal operation The atomicity guarantees for the removal operation do not need to be as strict as a cmpxchg. Use a uatomic_set followed by a xchg on a newly introduced REMOVAL_OWNER_FLAG to perform removal. Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com The del operation is performed in two steps: 1 - uatomic_or(), which sets the REMOVED flag (the actual tombstone) unconditionally into the node's next pointer. 2 - uatomic_xchg(), which atomically exchanges the old pointer with its current value (read) or'd with the REMOVAL_OWNER flag. The trick is that if the xchg returns a pointer with the REMOVAL_OWNER flag set, it means we are not the first thread to set this flag, so we should not free the node. However, if xchg returns a node without the REMOVAL_OWNER flag set, we are indeed the first to set it, so we should call free. Now regarding memory ordering semantics, should we consider the atomic action of del to apply when the or is called, or when the xchg is called ? Or should we simply document that the del effect on the node happens in two separate steps ? The way I see it, the actual effect of removal, as seen from RCU read traversal and lookup point of view, is observed as soon as the REMOVED tombstone is set, so I would think that the atomic publication of the removal is performed by the or. However, we ensure full memory barriers around xchg, but not usually around or. Therefore, the current implementation does not issue a memory barrier before the or, so we should either change our planned memory barrier documentation, or the implementation, to match. This would probably require creation of a cmm_smp_mb__before_uatomic_or(), so x86 does not end up issuing a useless memory barrer. My kneejerk reaction is that the or is really doing the deletion. Readers and other updaters care about the deletion, not about which CPU is going to do the free. Or did I misunderstand how this works? You got it right, this is how I see it too. However, in order to provide a full memory barrier before the or, we'd need to add a cmm_smp_mb() before the or (I don't think we want to presume that our or operation issues full memory barriers on all architectures). However, on x86, the lock; or does issue a full memory barrier. So I think we should introduce a macro that can translate into a memory barrier on architectures that need it, and to nothing on x86. Thoughts ? Makes sense to me! Thanx, Paul Thanks, Mathieu Thanx, Paul Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency RD Consultant EfficiOS Inc. http://www.efficios.com ___ rp mailing list r...@svcs.cs.pdx.edu http://svcs.cs.pdx.edu/mailman/listinfo/rp ___ lttng-dev mailing list lttng-dev@lists.lttng.org http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev -- Mathieu Desnoyers Operating System Efficiency RD Consultant EfficiOS Inc. http://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Sat, May 05, 2012 at 02:22:12PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: [...] Just to make sure I understand -- the reason that the del functions say no memory barrier instead of acts like rcu_dereference() is that the del functions don't return anything. Hrm, you bring an interesting point. I think we should change the two rcu_dereference() in _cds_lfht_del for CMM_LOAD_SHARED(). The difference between the two is that CMM_LOAD_SHARED() does not imply a read barrier between the read and following uses of the data pointed to by the pointer read. Same thing for cds_lfht_is_node_deleted: the rcu_dereference() should be changed for a CMM_LOAD_SHARED(), because we never use the loaded pointer as a pointer to other data. There are a few other locations where the pointer is only used for its flags. Here is what I propose: diff --git a/rculfhash.c b/rculfhash.c index b9f795f..6e27436 100644 --- a/rculfhash.c +++ b/rculfhash.c @@ -923,7 +923,7 @@ int _cds_lfht_replace(struct cds_lfht *ht, unsigned long size, bucket = lookup_bucket(ht, size, bit_reverse_ulong(old_node-reverse_hash)); _cds_lfht_gc_bucket(bucket, new_node); - assert(is_removed(rcu_dereference(old_node-next))); + assert(is_removed(CMM_LOAD_SHARED(old_node-next))); This is good. return 0; } @@ -1061,7 +1061,7 @@ int _cds_lfht_del(struct cds_lfht *ht, unsigned long size, * logical removal flag). Return -ENOENT if the node had * previously been removed. */ - next = rcu_dereference(node-next); + next = CMM_LOAD_SHARED(node-next); if (caa_unlikely(is_removed(next))) return -ENOENT; assert(!is_bucket(next)); As long as next is not dereferenced anywhere, this is good. Which appears to be the case. @@ -1082,7 +1082,7 @@ int _cds_lfht_del(struct cds_lfht *ht, unsigned long size, bucket = lookup_bucket(ht, size, bit_reverse_ulong(node-reverse_hash)); _cds_lfht_gc_bucket(bucket, node); - assert(is_removed(rcu_dereference(node-next))); + assert(is_removed(CMM_LOAD_SHARED(node-next))); This one is fine as well. /* * Last phase: atomically exchange node-next with a version * having REMOVAL_OWNER_FLAG set. If the returned node-next @@ -1510,7 +1510,7 @@ void cds_lfht_lookup(struct cds_lfht *ht, unsigned long hash, } node = clear_flag(next); } - assert(!node || !is_bucket(rcu_dereference(node-next))); + assert(!node || !is_bucket(CMM_LOAD_SHARED(node-next))); As is this one. iter-node = node; iter-next = next; } @@ -1543,7 +1543,7 @@ void cds_lfht_next_duplicate(struct cds_lfht *ht, cds_lfht_match_fct match, } node = clear_flag(next); } - assert(!node || !is_bucket(rcu_dereference(node-next))); + assert(!node || !is_bucket(CMM_LOAD_SHARED(node-next))); Same here. iter-node = node; iter-next = next; } @@ -1565,7 +1565,7 @@ void cds_lfht_next(struct cds_lfht *ht, struct cds_lfht_iter *iter) } node = clear_flag(next); } - assert(!node || !is_bucket(rcu_dereference(node-next))); + assert(!node || !is_bucket(CMM_LOAD_SHARED(node-next))); And here. iter-node = node; iter-next = next; } @@ -1668,7 +1668,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct cds_lfht_node *node) int cds_lfht_is_node_deleted(struct cds_lfht_node *node) { - return is_removed(rcu_dereference(node-next)); + return is_removed(CMM_LOAD_SHARED(node-next)); And also here. } static If it's ok for you, I will first commit this change, and then commit the memory barrier documentation with your reviewed-by. Looks good! OK, this change is committed as: commit a85eff522c253434a9c2b53d6c3a702842fb1d5d Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com Date: Mon May 7 11:18:14 2012 -0400 rculfhash: replace unneeded rcu_dereference by CMM_LOAD_SHARED The difference between the two is that CMM_LOAD_SHARED() does not imply a read barrier between the read and following uses of the data pointed to by the pointer read. All sites that only use the pointer load for its bits (never dereference) don't need the read barrier implied by rcu_dereference. Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency RD Consultant EfficiOS Inc. http://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote: [...] Just to make sure I understand -- the reason that the del functions say no memory barrier instead of acts like rcu_dereference() is that the del functions don't return anything. [...] @@ -391,6 +413,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct cds_lfht_node *node); * function. * Call with rcu_read_lock held. * Threads calling this API need to be registered RCU read-side threads. + * This function does not issue any memory barrier. */ One more question about the del memory ordering semantic. Following commit commit db00ccc36e7fb04ce8044fb1be7964acd1de6ae0 Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com Date: Mon Dec 19 16:45:51 2011 -0500 rculfhash: Relax atomicity guarantees required by removal operation The atomicity guarantees for the removal operation do not need to be as strict as a cmpxchg. Use a uatomic_set followed by a xchg on a newly introduced REMOVAL_OWNER_FLAG to perform removal. Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com The del operation is performed in two steps: 1 - uatomic_or(), which sets the REMOVED flag (the actual tombstone) unconditionally into the node's next pointer. 2 - uatomic_xchg(), which atomically exchanges the old pointer with its current value (read) or'd with the REMOVAL_OWNER flag. The trick is that if the xchg returns a pointer with the REMOVAL_OWNER flag set, it means we are not the first thread to set this flag, so we should not free the node. However, if xchg returns a node without the REMOVAL_OWNER flag set, we are indeed the first to set it, so we should call free. Now regarding memory ordering semantics, should we consider the atomic action of del to apply when the or is called, or when the xchg is called ? Or should we simply document that the del effect on the node happens in two separate steps ? The way I see it, the actual effect of removal, as seen from RCU read traversal and lookup point of view, is observed as soon as the REMOVED tombstone is set, so I would think that the atomic publication of the removal is performed by the or. However, we ensure full memory barriers around xchg, but not usually around or. Therefore, the current implementation does not issue a memory barrier before the or, so we should either change our planned memory barrier documentation, or the implementation, to match. This would probably require creation of a cmm_smp_mb__before_uatomic_or(), so x86 does not end up issuing a useless memory barrer. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency RD Consultant EfficiOS Inc. http://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Thu, May 03, 2012 at 01:13:30PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: [...] A write barrier would be sufficient in the case where there were only two threads observing each other. A full memory barrier would be needed to prevent the assertion from firing in this sort of case (not sure that this is exactly right, but something like this): Initial contents: B, C T0: add A; del B T1: if (!lookup B) { add B; del C } T2: r1 = lookup C; smp_mb(); r2 = lookup A assert(lookup C || lookup A); What you are bringing here as counter-example is, I think, transitivity. Yep! I'm trying to figure out how your example could fail, and I cannot see how. Follows a detail of the closest scenario I get to failing is the following, but it does not fail. After that, I'm proposing a different scenario, which I think will be more appropriate for the current topic. Attempted detail of your scenario: T0 T1 T2 add A wmb (add A globally observable) Almost. All that this really guarantees is that if someone sees the add B, they will also see the add A. del B (del B globally observable) (add A NOT brought into cache) (del B brought into cache) read B (test false) add B wmb (add B globally observable) del C (del C globally observable) Here, if someone sees the del C, they will see the add B, and they also will have lost the opportunity to modify B before T1 reads from it and modifies it. (add A NOT brought into cache) (del C brought into cache) read C - not there. mb (add A brought into cache) read A - there - success. So we see that C is not there. We know that B would be there if we looked at it. But we don't look at B, we look at A. But the ordering back to T0's add A requires transitivity, which wmb does not guarantee. OK, got it! If I look at the transitivity section in Linux memory-barriers.txt, I notice that the example is mainly around using read barrier around loads rather than general barrier. Let's see if I can modify that example to come up with an example error case: Initial content: empty T0: add X T1: r1 = lookup X; smp_mb; r2 = lookup Y T2: add Y; r3 = lookup X assert( !(r1 !r2 !r3) ) The key thing here is that if the barrier in T2 after add Y is a smp_wmb rather than a smp_mb, this could allow the r3 = lookup X to be reordered before add Y, thus allowing the assertion to fail. Your example is simpler, and demonstrates the need just as well, so let's go with your example. I think it would be more intuitive for users if lookups vs updates performed on the same thread are ordered with full memory barriers. Given that we don't want to add extra barriers in read operations, it would make sense to guarantee full memory barriers before and after updates. So how about we use full memory barriers before and after each of: add, del (success), add_unique (success), replace, and add_replace ? If we ever want to relax those ordering guarantees, then we can always add new update APIs with a weaker ordering. Thoughts ? That line of reasoning makes a lot of sense to me! Sounds good. Here is what I propose, thoughts ? Just to make sure I understand -- the reason that the del functions say no memory barrier instead of acts like rcu_dereference() is that the del functions don't return anything. Hrm, you bring an interesting point. I think we should change the two rcu_dereference() in _cds_lfht_del for CMM_LOAD_SHARED(). The difference between the two is that CMM_LOAD_SHARED() does not imply a read barrier between the read and following uses of the data pointed to by the pointer read. Same thing for cds_lfht_is_node_deleted: the rcu_dereference() should be changed for a CMM_LOAD_SHARED(), because we never use
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Thu, May 03, 2012 at 01:13:30PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: [...] A write barrier would be sufficient in the case where there were only two threads observing each other. A full memory barrier would be needed to prevent the assertion from firing in this sort of case (not sure that this is exactly right, but something like this): Initial contents: B, C T0: add A; del B T1: if (!lookup B) { add B; del C } T2: r1 = lookup C; smp_mb(); r2 = lookup A assert(lookup C || lookup A); What you are bringing here as counter-example is, I think, transitivity. Yep! I'm trying to figure out how your example could fail, and I cannot see how. Follows a detail of the closest scenario I get to failing is the following, but it does not fail. After that, I'm proposing a different scenario, which I think will be more appropriate for the current topic. Attempted detail of your scenario: T0 T1 T2 add A wmb (add A globally observable) Almost. All that this really guarantees is that if someone sees the add B, they will also see the add A. del B (del B globally observable) (add A NOT brought into cache) (del B brought into cache) read B (test false) add B wmb (add B globally observable) del C (del C globally observable) Here, if someone sees the del C, they will see the add B, and they also will have lost the opportunity to modify B before T1 reads from it and modifies it. (add A NOT brought into cache) (del C brought into cache) read C - not there. mb (add A brought into cache) read A - there - success. So we see that C is not there. We know that B would be there if we looked at it. But we don't look at B, we look at A. But the ordering back to T0's add A requires transitivity, which wmb does not guarantee. OK, got it! If I look at the transitivity section in Linux memory-barriers.txt, I notice that the example is mainly around using read barrier around loads rather than general barrier. Let's see if I can modify that example to come up with an example error case: Initial content: empty T0: add X T1: r1 = lookup X; smp_mb; r2 = lookup Y T2: add Y; r3 = lookup X assert( !(r1 !r2 !r3) ) The key thing here is that if the barrier in T2 after add Y is a smp_wmb rather than a smp_mb, this could allow the r3 = lookup X to be reordered before add Y, thus allowing the assertion to fail. Your example is simpler, and demonstrates the need just as well, so let's go with your example. I think it would be more intuitive for users if lookups vs updates performed on the same thread are ordered with full memory barriers. Given that we don't want to add extra barriers in read operations, it would make sense to guarantee full memory barriers before and after updates. So how about we use full memory barriers before and after each of: add, del (success), add_unique (success), replace, and add_replace ? If we ever want to relax those ordering guarantees, then we can always add new update APIs with a weaker ordering. Thoughts ? That line of reasoning makes a lot of sense to me! Sounds good. Here is what I propose, thoughts ? diff --git a/urcu/rculfhash.h b/urcu/rculfhash.h index 2d8a310..2938e5e 100644 --- a/urcu/rculfhash.h +++ b/urcu/rculfhash.h @@ -203,6 +203,7 @@ void cds_lfht_count_nodes(struct cds_lfht *ht, * * Call with rcu_read_lock held. * Threads calling this API need to be registered RCU read-side threads. + * This function acts as a rcu_dereference() to read the node pointer. */ void cds_lfht_lookup(struct cds_lfht *ht, unsigned long hash, cds_lfht_match_fct match, const void *key, @@ -226,6 +227,7 @@ void cds_lfht_lookup(struct cds_lfht *ht, unsigned long hash, * node returned by a previous cds_lfht_next. * Call with rcu_read_lock held. * Threads calling this API need to be registered RCU read-side threads. + * This function acts as a rcu_dereference() to read the node pointer. */ void cds_lfht_next_duplicate(struct cds_lfht *ht, cds_lfht_match_fct match, const void *key, @@ -239,6 +241,7 @@ void
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Thu, May 03, 2012 at 01:13:30PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: [...] A write barrier would be sufficient in the case where there were only two threads observing each other. A full memory barrier would be needed to prevent the assertion from firing in this sort of case (not sure that this is exactly right, but something like this): Initial contents: B, C T0: add A; del B T1: if (!lookup B) { add B; del C } T2: r1 = lookup C; smp_mb(); r2 = lookup A assert(lookup C || lookup A); What you are bringing here as counter-example is, I think, transitivity. Yep! I'm trying to figure out how your example could fail, and I cannot see how. Follows a detail of the closest scenario I get to failing is the following, but it does not fail. After that, I'm proposing a different scenario, which I think will be more appropriate for the current topic. Attempted detail of your scenario: T0 T1 T2 add A wmb (add A globally observable) Almost. All that this really guarantees is that if someone sees the add B, they will also see the add A. del B (del B globally observable) (add A NOT brought into cache) (del B brought into cache) read B (test false) add B wmb (add B globally observable) del C (del C globally observable) Here, if someone sees the del C, they will see the add B, and they also will have lost the opportunity to modify B before T1 reads from it and modifies it. (add A NOT brought into cache) (del C brought into cache) read C - not there. mb (add A brought into cache) read A - there - success. So we see that C is not there. We know that B would be there if we looked at it. But we don't look at B, we look at A. But the ordering back to T0's add A requires transitivity, which wmb does not guarantee. OK, got it! If I look at the transitivity section in Linux memory-barriers.txt, I notice that the example is mainly around using read barrier around loads rather than general barrier. Let's see if I can modify that example to come up with an example error case: Initial content: empty T0: add X T1: r1 = lookup X; smp_mb; r2 = lookup Y T2: add Y; r3 = lookup X assert( !(r1 !r2 !r3) ) The key thing here is that if the barrier in T2 after add Y is a smp_wmb rather than a smp_mb, this could allow the r3 = lookup X to be reordered before add Y, thus allowing the assertion to fail. Your example is simpler, and demonstrates the need just as well, so let's go with your example. I think it would be more intuitive for users if lookups vs updates performed on the same thread are ordered with full memory barriers. Given that we don't want to add extra barriers in read operations, it would make sense to guarantee full memory barriers before and after updates. So how about we use full memory barriers before and after each of: add, del (success), add_unique (success), replace, and add_replace ? If we ever want to relax those ordering guarantees, then we can always add new update APIs with a weaker ordering. Thoughts ? That line of reasoning makes a lot of sense to me! Sounds good. Here is what I propose, thoughts ? Just to make sure I understand -- the reason that the del functions say no memory barrier instead of acts like rcu_dereference() is that the del functions don't return anything. Assuming my understanding is correct: Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com Thanx, Paul diff --git a/urcu/rculfhash.h b/urcu/rculfhash.h index 2d8a310..2938e5e 100644 --- a/urcu/rculfhash.h +++ b/urcu/rculfhash.h @@ -203,6 +203,7 @@ void cds_lfht_count_nodes(struct cds_lfht *ht, * * Call with rcu_read_lock held. * Threads calling this API need to be registered RCU read-side threads. + * This function acts as a rcu_dereference() to read the node pointer. */ void cds_lfht_lookup(struct cds_lfht *ht, unsigned
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Wed, May 02, 2012 at 12:16:39AM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 05:02:06PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote: * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote: Hi! After 1 year of development, with the last 6-7 months spent polishing the API and testing the implementation, I think it is getting about time to release the RCU lock-free hash table in a new Userspace RCU version (0.7). I recently described the guarantees provided by the hash table in more detail, and created tests for the uniqueness guarantee for traversals performed concurrently with add_unique and add_replace operations. I also added test modes that create long hash chains, to test corner-cases of the hash table. One small thing I wonder is whether we should document that the hash table update operations imply full memory barriers ? The Linux kernel's rule seems good here -- if a hash-table operation is atomic and returns a value, it should imply a full barrier. So: cds_lfht_new(): No point in claiming barriers -- publishing the pointer to the hash table is where the barriers are important. cds_lfht_destroy(): Ditto. cds_lfht_lookup(): Not an update (let alone an atomic update), no barriers. cds_lfht_next_duplicate(): Ditto. cds_lfht_first(): Ditto. cds_lfht_next(): Ditto. cds_lfht_add(): Atomic update, but no return value, so no barrier implied. Yep, makes sense. We use cmpxchg internally to perform the update, but it could make sense to eventually use a cmpxchg that has no memory barriers to perform this update. So I agree on not providing a memory barrier guarantee on the add operation, since it does not return any value. cds_lfht_add_unique(): Atomic update that returns a value, so should imply a full memory barrier. add_unique is a bit special: - if it returns the node received as parameter, it means the add succeeded, which imply an update, and thus a memory barrier. Hrm, thinking further: if we make the add operation not act as a full memory barrier, then the add_unique success should not act as a full mb neither. Think of it as being similar to the Linux kernel's atomic_inc() and atomic_add_return() primitives. The latter guarantees memory barriers and the former does not. I think add/add_unique vs atomic_inc/atomic_add_return are fundamentally different. atomic_inc: - input: increment - output: none - effect: increment target address value atomic_add_return: - input: increment - output: incremented value - effect: increment target address value, atomically reading the resulting value. hash table add: - input: new node to add - output: none - effect: atomically add the new node into the table hash table add_unique (success): - input: new node to add - output: (we just return whether the operation has succeeded) - effect: atomically add the new node into the table hash table add_unique (failure): - input: new node to try adding - output: (we just return whether the operation has succeeded) - effect: simple lookup (read) So as we see, the add_unique failure only performs a read. Adding a memory barrier before this read would require us to add a memory barrier also on the success path, which would degrade performance. The success path does: lookup failure, cmpxchg to add the node, retry if changed. Adding a memory barrier before the lookup would add an extra memory barrier in
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
On Thu, May 03, 2012 at 01:13:30PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Wed, May 02, 2012 at 12:16:39AM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 05:02:06PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote: * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote: Hi! After 1 year of development, with the last 6-7 months spent polishing the API and testing the implementation, I think it is getting about time to release the RCU lock-free hash table in a new Userspace RCU version (0.7). I recently described the guarantees provided by the hash table in more detail, and created tests for the uniqueness guarantee for traversals performed concurrently with add_unique and add_replace operations. I also added test modes that create long hash chains, to test corner-cases of the hash table. One small thing I wonder is whether we should document that the hash table update operations imply full memory barriers ? The Linux kernel's rule seems good here -- if a hash-table operation is atomic and returns a value, it should imply a full barrier. So: cds_lfht_new(): No point in claiming barriers -- publishing the pointer to the hash table is where the barriers are important. cds_lfht_destroy(): Ditto. cds_lfht_lookup(): Not an update (let alone an atomic update), no barriers. cds_lfht_next_duplicate(): Ditto. cds_lfht_first(): Ditto. cds_lfht_next(): Ditto. cds_lfht_add(): Atomic update, but no return value, so no barrier implied. Yep, makes sense. We use cmpxchg internally to perform the update, but it could make sense to eventually use a cmpxchg that has no memory barriers to perform this update. So I agree on not providing a memory barrier guarantee on the add operation, since it does not return any value. cds_lfht_add_unique(): Atomic update that returns a value, so should imply a full memory barrier. add_unique is a bit special: - if it returns the node received as parameter, it means the add succeeded, which imply an update, and thus a memory barrier. Hrm, thinking further: if we make the add operation not act as a full memory barrier, then the add_unique success should not act as a full mb neither. Think of it as being similar to the Linux kernel's atomic_inc() and atomic_add_return() primitives. The latter guarantees memory barriers and the former does not. I think add/add_unique vs atomic_inc/atomic_add_return are fundamentally different. atomic_inc: - input: increment - output: none - effect: increment target address value atomic_add_return: - input: increment - output: incremented value - effect: increment target address value, atomically reading the resulting value. hash table add: - input: new node to add - output: none - effect: atomically add the new node into the table hash table add_unique (success): - input: new node to add - output: (we just return whether the operation has succeeded) - effect: atomically add the new node into the table hash table add_unique (failure): - input: new node to try adding - output: (we just return whether the operation has succeeded) - effect: simple lookup (read) So as we see, the add_unique failure only performs a read. Adding a memory barrier before this read would require us to add a memory barrier also on the success path, which would
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
On Wed, May 02, 2012 at 12:16:39AM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 05:02:06PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote: * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote: Hi! After 1 year of development, with the last 6-7 months spent polishing the API and testing the implementation, I think it is getting about time to release the RCU lock-free hash table in a new Userspace RCU version (0.7). I recently described the guarantees provided by the hash table in more detail, and created tests for the uniqueness guarantee for traversals performed concurrently with add_unique and add_replace operations. I also added test modes that create long hash chains, to test corner-cases of the hash table. One small thing I wonder is whether we should document that the hash table update operations imply full memory barriers ? The Linux kernel's rule seems good here -- if a hash-table operation is atomic and returns a value, it should imply a full barrier. So: cds_lfht_new(): No point in claiming barriers -- publishing the pointer to the hash table is where the barriers are important. cds_lfht_destroy(): Ditto. cds_lfht_lookup(): Not an update (let alone an atomic update), no barriers. cds_lfht_next_duplicate(): Ditto. cds_lfht_first(): Ditto. cds_lfht_next(): Ditto. cds_lfht_add(): Atomic update, but no return value, so no barrier implied. Yep, makes sense. We use cmpxchg internally to perform the update, but it could make sense to eventually use a cmpxchg that has no memory barriers to perform this update. So I agree on not providing a memory barrier guarantee on the add operation, since it does not return any value. cds_lfht_add_unique(): Atomic update that returns a value, so should imply a full memory barrier. add_unique is a bit special: - if it returns the node received as parameter, it means the add succeeded, which imply an update, and thus a memory barrier. Hrm, thinking further: if we make the add operation not act as a full memory barrier, then the add_unique success should not act as a full mb neither. Think of it as being similar to the Linux kernel's atomic_inc() and atomic_add_return() primitives. The latter guarantees memory barriers and the former does not. I think add/add_unique vs atomic_inc/atomic_add_return are fundamentally different. atomic_inc: - input: increment - output: none - effect: increment target address value atomic_add_return: - input: increment - output: incremented value - effect: increment target address value, atomically reading the resulting value. hash table add: - input: new node to add - output: none - effect: atomically add the new node into the table hash table add_unique (success): - input: new node to add - output: (we just return whether the operation has succeeded) - effect: atomically add the new node into the table hash table add_unique (failure): - input: new node to try adding - output: (we just return whether the operation has succeeded) - effect: simple lookup (read) So as we see, the add_unique failure only performs a read. Adding a memory barrier before this read would require us to add a memory barrier also on the success path, which would degrade performance. The success path does: lookup failure, cmpxchg to add the node, retry if changed. Adding a memory barrier before the lookup would add an extra memory barrier in addition to the one located in the cmpxchg, and I don't think we want that overhead. Perhaps a better analogy is cmpxchg() in the Linux kernel. Some architectures place the memory barrier before
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
On Tue, May 01, 2012 at 11:12:15AM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote: Hi! After 1 year of development, with the last 6-7 months spent polishing the API and testing the implementation, I think it is getting about time to release the RCU lock-free hash table in a new Userspace RCU version (0.7). I recently described the guarantees provided by the hash table in more detail, and created tests for the uniqueness guarantee for traversals performed concurrently with add_unique and add_replace operations. I also added test modes that create long hash chains, to test corner-cases of the hash table. One small thing I wonder is whether we should document that the hash table update operations imply full memory barriers ? The Linux kernel's rule seems good here -- if a hash-table operation is atomic and returns a value, it should imply a full barrier. So: cds_lfht_new(): No point in claiming barriers -- publishing the pointer to the hash table is where the barriers are important. cds_lfht_destroy(): Ditto. cds_lfht_lookup(): Not an update (let alone an atomic update), no barriers. cds_lfht_next_duplicate(): Ditto. cds_lfht_first(): Ditto. cds_lfht_next(): Ditto. cds_lfht_add(): Atomic update, but no return value, so no barrier implied. Yep, makes sense. We use cmpxchg internally to perform the update, but it could make sense to eventually use a cmpxchg that has no memory barriers to perform this update. So I agree on not providing a memory barrier guarantee on the add operation, since it does not return any value. cds_lfht_add_unique(): Atomic update that returns a value, so should imply a full memory barrier. add_unique is a bit special: - if it returns the node received as parameter, it means the add succeeded, which imply an update, and thus a memory barrier. - if it returns a different node than the one received as parameter, it failed, and thus means that it only performed a lookup, so there is no guarantee that a memory barrier has been executed. Good point! But I would suggest handling this like some architectures handle a failing cmpxchg() -- there is a memory barrier beforehand, but not afterwards. This works well because although the caller can easily supply a memory barrier after a failing cmpxchg(), there is no way to supply one beforehand except by supplying a redundant barrier in the case of a successful cmpxchg(). So I suggest that a failing add_unique() guarantee full memory barrier beforehand, but not afterwards. Thanx, Paul cds_lfht_add_replace(): Ditto. cds_lfht_replace(): Ditto. cds_lfht_del(): Ditto. Yep, makes sense. I'll add this documentation in the API. Thanks! Mathieu cds_lfht_is_node_deleted(): Not an update (let alone an atomic update), no barriers. cds_lfht_resize(): Atomic update, but no return value, so no barrier implied. cds_lfht_for_each(): Not an update (let alone an atomic update), no barriers. cds_lfht_for_each_duplicate(): Ditto. cds_lfht_for_each_entry(): Ditto. cds_lfht_for_each_entry_duplicate(): Ditto. Thanx, Paul I'm personally getting confident that the hash table API is clean enough, and the implementation well tested, but I'd like to have your thoughts on the readiness of the hash table for production use. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency RD Consultant EfficiOS Inc. http://www.efficios.com ___ rp mailing list r...@svcs.cs.pdx.edu http://svcs.cs.pdx.edu/mailman/listinfo/rp ___ lttng-dev mailing list lttng-dev@lists.lttng.org http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev -- Mathieu Desnoyers Operating System Efficiency RD Consultant EfficiOS Inc. http://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote: * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote: Hi! After 1 year of development, with the last 6-7 months spent polishing the API and testing the implementation, I think it is getting about time to release the RCU lock-free hash table in a new Userspace RCU version (0.7). I recently described the guarantees provided by the hash table in more detail, and created tests for the uniqueness guarantee for traversals performed concurrently with add_unique and add_replace operations. I also added test modes that create long hash chains, to test corner-cases of the hash table. One small thing I wonder is whether we should document that the hash table update operations imply full memory barriers ? The Linux kernel's rule seems good here -- if a hash-table operation is atomic and returns a value, it should imply a full barrier. So: cds_lfht_new(): No point in claiming barriers -- publishing the pointer to the hash table is where the barriers are important. cds_lfht_destroy(): Ditto. cds_lfht_lookup(): Not an update (let alone an atomic update), no barriers. cds_lfht_next_duplicate(): Ditto. cds_lfht_first(): Ditto. cds_lfht_next(): Ditto. cds_lfht_add(): Atomic update, but no return value, so no barrier implied. Yep, makes sense. We use cmpxchg internally to perform the update, but it could make sense to eventually use a cmpxchg that has no memory barriers to perform this update. So I agree on not providing a memory barrier guarantee on the add operation, since it does not return any value. cds_lfht_add_unique(): Atomic update that returns a value, so should imply a full memory barrier. add_unique is a bit special: - if it returns the node received as parameter, it means the add succeeded, which imply an update, and thus a memory barrier. Hrm, thinking further: if we make the add operation not act as a full memory barrier, then the add_unique success should not act as a full mb neither. Think of it as being similar to the Linux kernel's atomic_inc() and atomic_add_return() primitives. The latter guarantees memory barriers and the former does not. - if it returns a different node than the one received as parameter, it failed, and thus means that it only performed a lookup, so there is no guarantee that a memory barrier has been executed. cds_lfht_add_replace(): Ditto. cds_lfht_replace(): Ditto. cds_lfht_del(): Ditto. One more point: del is similar to add_unique: if it succeeds, we execute a full memory barrier. If it fails, no memory barrier is guaranteed. But if we want to make the guarantees relax, we might not want to guarantee that a memory barrier is present in any of the del cases. In the end, the only primitives for which I think it really makes sense to provide memory barrier semantic is the add_replace and replace : they actually act as an higher-level cmpxchg over the hash table nodes. I believe that this should do the same -- memory barrier before, but no memory barrier after on failure. Another approach is C++11, in which there are a couple of arguments to the compare-and-swap primitive specifying the memory ordering constraints for the success and failure cases, respectively. Unless you expect use cases with lots of failing cds_lfht_del() and add_unique() calls, the performance difference should not be significant. Thanx, Paul Thoughts ? Thanks, Mathieu Yep, makes sense. I'll add this documentation in the API. Thanks! Mathieu cds_lfht_is_node_deleted(): Not an update (let alone an atomic update), no barriers. cds_lfht_resize(): Atomic update, but no return value, so no barrier implied. cds_lfht_for_each(): Not an update (let alone an atomic update), no barriers. cds_lfht_for_each_duplicate(): Ditto. cds_lfht_for_each_entry(): Ditto. cds_lfht_for_each_entry_duplicate(): Ditto. Thanx, Paul I'm personally getting confident that the hash table API is clean enough, and the implementation well tested, but I'd like to have your thoughts on the readiness of the hash table for production use. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency RD Consultant EfficiOS Inc. http://www.efficios.com ___ rp mailing list r...@svcs.cs.pdx.edu http://svcs.cs.pdx.edu/mailman/listinfo/rp ___
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote: * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote: Hi! After 1 year of development, with the last 6-7 months spent polishing the API and testing the implementation, I think it is getting about time to release the RCU lock-free hash table in a new Userspace RCU version (0.7). I recently described the guarantees provided by the hash table in more detail, and created tests for the uniqueness guarantee for traversals performed concurrently with add_unique and add_replace operations. I also added test modes that create long hash chains, to test corner-cases of the hash table. One small thing I wonder is whether we should document that the hash table update operations imply full memory barriers ? The Linux kernel's rule seems good here -- if a hash-table operation is atomic and returns a value, it should imply a full barrier. So: cds_lfht_new(): No point in claiming barriers -- publishing the pointer to the hash table is where the barriers are important. cds_lfht_destroy(): Ditto. cds_lfht_lookup(): Not an update (let alone an atomic update), no barriers. cds_lfht_next_duplicate(): Ditto. cds_lfht_first(): Ditto. cds_lfht_next(): Ditto. cds_lfht_add(): Atomic update, but no return value, so no barrier implied. Yep, makes sense. We use cmpxchg internally to perform the update, but it could make sense to eventually use a cmpxchg that has no memory barriers to perform this update. So I agree on not providing a memory barrier guarantee on the add operation, since it does not return any value. cds_lfht_add_unique(): Atomic update that returns a value, so should imply a full memory barrier. add_unique is a bit special: - if it returns the node received as parameter, it means the add succeeded, which imply an update, and thus a memory barrier. Hrm, thinking further: if we make the add operation not act as a full memory barrier, then the add_unique success should not act as a full mb neither. Think of it as being similar to the Linux kernel's atomic_inc() and atomic_add_return() primitives. The latter guarantees memory barriers and the former does not. I think add/add_unique vs atomic_inc/atomic_add_return are fundamentally different. atomic_inc: - input: increment - output: none - effect: increment target address value atomic_add_return: - input: increment - output: incremented value - effect: increment target address value, atomically reading the resulting value. hash table add: - input: new node to add - output: none - effect: atomically add the new node into the table hash table add_unique (success): - input: new node to add - output: (we just return whether the operation has succeeded) - effect: atomically add the new node into the table hash table add_unique (failure): - input: new node to try adding - output: (we just return whether the operation has succeeded) - effect: simple lookup (read) So as we see, the add_unique failure only performs a read. Adding a memory barrier before this read would require us to add a memory barrier also on the success path, which would degrade performance. The success path does: lookup failure, cmpxchg to add the node, retry if changed. Adding a memory barrier before the lookup would add an extra memory barrier in addition to the one located in the cmpxchg, and I don't think we want that overhead. Perhaps a better analogy is cmpxchg() in the Linux kernel. Some architectures place the memory barrier before unconditionally, but place a memory barrier after only if the cmpxchg() succeeds. Of course, architectures for which the cmpxchg instruction implies a barrier don't need any explicit memory-barrier instructions. I do see your point about the memory barrier in the failure case requiring an additional memory barrier in the success case, however. - if it returns a different node than the one received as parameter, it failed, and thus means that it only performed a lookup, so there is no guarantee that a memory barrier has been executed. cds_lfht_add_replace(): Ditto. cds_lfht_replace(): Ditto. cds_lfht_del(): Ditto. One more point: del is similar to add_unique: if it succeeds, we execute a full memory barrier. If it fails, no memory barrier is guaranteed. But if
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote: * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote: Hi! After 1 year of development, with the last 6-7 months spent polishing the API and testing the implementation, I think it is getting about time to release the RCU lock-free hash table in a new Userspace RCU version (0.7). I recently described the guarantees provided by the hash table in more detail, and created tests for the uniqueness guarantee for traversals performed concurrently with add_unique and add_replace operations. I also added test modes that create long hash chains, to test corner-cases of the hash table. One small thing I wonder is whether we should document that the hash table update operations imply full memory barriers ? The Linux kernel's rule seems good here -- if a hash-table operation is atomic and returns a value, it should imply a full barrier. So: cds_lfht_new(): No point in claiming barriers -- publishing the pointer to the hash table is where the barriers are important. cds_lfht_destroy(): Ditto. cds_lfht_lookup(): Not an update (let alone an atomic update), no barriers. cds_lfht_next_duplicate(): Ditto. cds_lfht_first(): Ditto. cds_lfht_next(): Ditto. cds_lfht_add(): Atomic update, but no return value, so no barrier implied. Yep, makes sense. We use cmpxchg internally to perform the update, but it could make sense to eventually use a cmpxchg that has no memory barriers to perform this update. So I agree on not providing a memory barrier guarantee on the add operation, since it does not return any value. cds_lfht_add_unique(): Atomic update that returns a value, so should imply a full memory barrier. add_unique is a bit special: - if it returns the node received as parameter, it means the add succeeded, which imply an update, and thus a memory barrier. Hrm, thinking further: if we make the add operation not act as a full memory barrier, then the add_unique success should not act as a full mb neither. Think of it as being similar to the Linux kernel's atomic_inc() and atomic_add_return() primitives. The latter guarantees memory barriers and the former does not. I think add/add_unique vs atomic_inc/atomic_add_return are fundamentally different. atomic_inc: - input: increment - output: none - effect: increment target address value atomic_add_return: - input: increment - output: incremented value - effect: increment target address value, atomically reading the resulting value. hash table add: - input: new node to add - output: none - effect: atomically add the new node into the table hash table add_unique (success): - input: new node to add - output: (we just return whether the operation has succeeded) - effect: atomically add the new node into the table hash table add_unique (failure): - input: new node to try adding - output: (we just return whether the operation has succeeded) - effect: simple lookup (read) So as we see, the add_unique failure only performs a read. Adding a memory barrier before this read would require us to add a memory barrier also on the success path, which would degrade performance. The success path does: lookup failure, cmpxchg to add the node, retry if changed. Adding a memory barrier before the lookup would add an extra memory barrier in addition to the one located in the cmpxchg, and I don't think we want that overhead. Perhaps a better analogy is cmpxchg() in the Linux kernel. Some architectures place the memory barrier before unconditionally, but place a memory barrier after only if the cmpxchg() succeeds. Of course, architectures for which the cmpxchg instruction implies a barrier don't need any explicit memory-barrier instructions. I do see your point about the memory barrier in the failure case requiring an additional memory barrier in the success case, however. - if it returns a different node than the one received as parameter, it failed, and thus means that it only performed a lookup, so there is no guarantee that a memory barrier has been executed. cds_lfht_add_replace(): Ditto. cds_lfht_replace():
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
On Tue, May 01, 2012 at 05:02:06PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote: * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote: Hi! After 1 year of development, with the last 6-7 months spent polishing the API and testing the implementation, I think it is getting about time to release the RCU lock-free hash table in a new Userspace RCU version (0.7). I recently described the guarantees provided by the hash table in more detail, and created tests for the uniqueness guarantee for traversals performed concurrently with add_unique and add_replace operations. I also added test modes that create long hash chains, to test corner-cases of the hash table. One small thing I wonder is whether we should document that the hash table update operations imply full memory barriers ? The Linux kernel's rule seems good here -- if a hash-table operation is atomic and returns a value, it should imply a full barrier. So: cds_lfht_new(): No point in claiming barriers -- publishing the pointer to the hash table is where the barriers are important. cds_lfht_destroy(): Ditto. cds_lfht_lookup(): Not an update (let alone an atomic update), no barriers. cds_lfht_next_duplicate(): Ditto. cds_lfht_first(): Ditto. cds_lfht_next(): Ditto. cds_lfht_add(): Atomic update, but no return value, so no barrier implied. Yep, makes sense. We use cmpxchg internally to perform the update, but it could make sense to eventually use a cmpxchg that has no memory barriers to perform this update. So I agree on not providing a memory barrier guarantee on the add operation, since it does not return any value. cds_lfht_add_unique(): Atomic update that returns a value, so should imply a full memory barrier. add_unique is a bit special: - if it returns the node received as parameter, it means the add succeeded, which imply an update, and thus a memory barrier. Hrm, thinking further: if we make the add operation not act as a full memory barrier, then the add_unique success should not act as a full mb neither. Think of it as being similar to the Linux kernel's atomic_inc() and atomic_add_return() primitives. The latter guarantees memory barriers and the former does not. I think add/add_unique vs atomic_inc/atomic_add_return are fundamentally different. atomic_inc: - input: increment - output: none - effect: increment target address value atomic_add_return: - input: increment - output: incremented value - effect: increment target address value, atomically reading the resulting value. hash table add: - input: new node to add - output: none - effect: atomically add the new node into the table hash table add_unique (success): - input: new node to add - output: (we just return whether the operation has succeeded) - effect: atomically add the new node into the table hash table add_unique (failure): - input: new node to try adding - output: (we just return whether the operation has succeeded) - effect: simple lookup (read) So as we see, the add_unique failure only performs a read. Adding a memory barrier before this read would require us to add a memory barrier also on the success path, which would degrade performance. The success path does: lookup failure, cmpxchg to add the node, retry if changed. Adding a memory barrier before the lookup would add an extra memory barrier in addition to the one located in the cmpxchg, and I don't think we want that overhead. Perhaps a better analogy is cmpxchg() in the Linux kernel. Some architectures place the memory barrier before unconditionally, but place a memory barrier after only if the cmpxchg() succeeds. Of course, architectures for which the cmpxchg instruction implies a barrier don't need any explicit memory-barrier instructions. I do see your point about the memory barrier in the failure case requiring an additional memory barrier in the success case, however. - if it returns a different node than the one received as parameter, it failed,
Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 05:02:06PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 01:41:36PM -0400, Mathieu Desnoyers wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 11:21:44AM -0400, Mathieu Desnoyers wrote: * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote: * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Tue, May 01, 2012 at 10:16:09AM -0400, Mathieu Desnoyers wrote: Hi! After 1 year of development, with the last 6-7 months spent polishing the API and testing the implementation, I think it is getting about time to release the RCU lock-free hash table in a new Userspace RCU version (0.7). I recently described the guarantees provided by the hash table in more detail, and created tests for the uniqueness guarantee for traversals performed concurrently with add_unique and add_replace operations. I also added test modes that create long hash chains, to test corner-cases of the hash table. One small thing I wonder is whether we should document that the hash table update operations imply full memory barriers ? The Linux kernel's rule seems good here -- if a hash-table operation is atomic and returns a value, it should imply a full barrier. So: cds_lfht_new(): No point in claiming barriers -- publishing the pointer to the hash table is where the barriers are important. cds_lfht_destroy(): Ditto. cds_lfht_lookup(): Not an update (let alone an atomic update), no barriers. cds_lfht_next_duplicate(): Ditto. cds_lfht_first(): Ditto. cds_lfht_next(): Ditto. cds_lfht_add(): Atomic update, but no return value, so no barrier implied. Yep, makes sense. We use cmpxchg internally to perform the update, but it could make sense to eventually use a cmpxchg that has no memory barriers to perform this update. So I agree on not providing a memory barrier guarantee on the add operation, since it does not return any value. cds_lfht_add_unique(): Atomic update that returns a value, so should imply a full memory barrier. add_unique is a bit special: - if it returns the node received as parameter, it means the add succeeded, which imply an update, and thus a memory barrier. Hrm, thinking further: if we make the add operation not act as a full memory barrier, then the add_unique success should not act as a full mb neither. Think of it as being similar to the Linux kernel's atomic_inc() and atomic_add_return() primitives. The latter guarantees memory barriers and the former does not. I think add/add_unique vs atomic_inc/atomic_add_return are fundamentally different. atomic_inc: - input: increment - output: none - effect: increment target address value atomic_add_return: - input: increment - output: incremented value - effect: increment target address value, atomically reading the resulting value. hash table add: - input: new node to add - output: none - effect: atomically add the new node into the table hash table add_unique (success): - input: new node to add - output: (we just return whether the operation has succeeded) - effect: atomically add the new node into the table hash table add_unique (failure): - input: new node to try adding - output: (we just return whether the operation has succeeded) - effect: simple lookup (read) So as we see, the add_unique failure only performs a read. Adding a memory barrier before this read would require us to add a memory barrier also on the success path, which would degrade performance. The success path does: lookup failure, cmpxchg to add the node, retry if changed. Adding a memory barrier before the lookup would add an extra memory barrier in addition to the one located in the cmpxchg, and I don't think we want that overhead. Perhaps a better analogy is cmpxchg() in the Linux kernel. Some architectures place the memory barrier before unconditionally, but place a memory barrier after only if the cmpxchg() succeeds. Of course, architectures for which the cmpxchg instruction implies a barrier don't need any explicit memory-barrier instructions. I do see your