Re: [lttng-dev] [RFC] Deprecating RCU signal flavor

2023-08-23 Thread Paul E. McKenney via lttng-dev
On Mon, Aug 21, 2023 at 11:43:32AM -0400, Mathieu Desnoyers wrote:
> On 8/15/23 08:38, Mathieu Desnoyers via lttng-dev wrote:
> > On 8/14/23 17:05, Olivier Dion via lttng-dev wrote:
> > > 
> > > After discussing it with Mathieu, we agree on the following 3 phases for
> > > deprecating the signal flavor:
> > > 
> > >   1) liburcu-signal will be implemented in term of liburcu-mb. The only
> > >   difference between the two flavors will be the public header files,
> > >   linked symbols and library name.  Note that this add a regression in
> > >   term of performance, since the implementation of liburcu-mb adds memory
> > >   barriers on the reader side which are not present in the original
> > >   liburcu-signal implementation.
> > > 
> > >   2) Adding the deprecated attribute to every public functions exposed by
> > >   the liburcu-signal flavor.  At this point, tests for liburcu-signal
> > >   will also be removed from the project.  There will be no more support
> > >   for this flavor.
> > > 
> > >   3) Removing the liburcu-signal flavor completely from the project.
> > > 
> > > Finally, here is a tentative versions release of mine for each phase:
> > > 
> > >   1) 0.15.0 [October 2023] (also TSAN support yay!)
> > > 
> > >   2) 0.15.1
> > > 
> > >   3) 0.16.0 || 1.0.0 (maybe a major bump since this is an API breaking
> > >   change)
> > 
> > There is a distinction between the version number of the liburcu project
> > (0.14) and the ABI soname for the shared objects. We may be able to do
> > step (3) without going to 1.0.0 (I don't see removal of the urcu-signal
> > flavor a strong enough motivation for hitting 1.0.0 yet).
> > 
> > Technically speaking, given that we would be removing the entire
> > liburcu-signal.so shared object, we would not be changing _symbols_
> > within an existing shared object, therefore I'm not even sure we need to
> > bump the soname for all the other remaining shared objects.
> 
> So after merging this commit:
> 
> Phase 1 of deprecating liburcu-signal
> The first phase of liburcu-signal deprecation consists of implementing
> it in term of liburcu-mb. In other words, liburcu-signal is identical to
> liburcu-mb at the exception of the function symbols and public header
> files.
> This is done by:
>   1) Removing the RCU_SIGNAL specific code in urcu.c
>   2) Making the RCU_MB specific code also specific to RCU_SIGNAL in
>   urcu.c
>   3) Rewriting _urcu_signal_read_unlock_update_and_wakeup to use a
>   atomic store with CMM_SEQ_CST instead of a store CMM_RELAXED with
>   cmm_barrier() around it. We could keep the explicit barriers, but that
>   would require to add some cmm_annotate annotations. Therefore, to be
>   less intrusive in a public header file, simply use the CMM_SEQ_CST
>   like for the mb flavor.
> 
> I notice that an application previously built against urcu-signal with
> _LGPL_SOURCE defined would have to be rebuilt, which would require a
> soname bump of urcu-signal.
> 
> So considering that this phase 1 is not really a "drop in" replacement,
> I favor removing the urcu-signal flavor entirely before the next release.
> 
> Thoughts ?

The replacement is liburcu-mb, correct?

I will need to change perfbook, but that should be an easy change,
plus sys_membarrier() is widely available by now.

Thanx, Paul
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH v2 04/12] urcu/system: Use atomic builtins if configured

2023-07-05 Thread Paul E. McKenney via lttng-dev
On Wed, Jul 05, 2023 at 03:03:21PM -0400, Olivier Dion wrote:
> On Wed, 05 Jul 2023, "Paul E. McKenney"  wrote:
> > On Tue, Jul 04, 2023 at 10:43:21AM -0400, Olivier Dion wrote:
> >> On Wed, 21 Jun 2023, "Paul E. McKenney"  wrote:
> >> > On Wed, Jun 07, 2023 at 02:53:51PM -0400, Olivier Dion wrote:
> >> >
> >> > Same question here on loss of volatile semantics.
> >> 
> >> This apply to all review on volatile semantics.  I added a
> >> cmm_cast_volatile() macro/template for C/C++ for adding the volatile
> >> qualifier to pointers passed to every atomic builtin calls.
> >
> > Sounds very good, thank you!
> 
> Maybe a case of synchronicity here, but I just stumble upon this
> <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0062r1.html>
> where you seem to express the same concerns :-)

Just for completeness, my response to Hans's concern about volatile is
addressed by an empty memory-clobber asm, similar to barrier() in the
Linux kernel.

But yes, this has seen significant discussion over the years.  ;-)

Thanx, Paul
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH v2 04/12] urcu/system: Use atomic builtins if configured

2023-07-05 Thread Paul E. McKenney via lttng-dev
On Tue, Jul 04, 2023 at 10:43:21AM -0400, Olivier Dion wrote:
> On Wed, 21 Jun 2023, "Paul E. McKenney"  wrote:
> > On Wed, Jun 07, 2023 at 02:53:51PM -0400, Olivier Dion wrote:
> >
> > Same question here on loss of volatile semantics.
> 
> This apply to all review on volatile semantics.  I added a
> cmm_cast_volatile() macro/template for C/C++ for adding the volatile
> qualifier to pointers passed to every atomic builtin calls.

Sounds very good, thank you!

Thanx, Paul
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH v2 05/12] urcu/uatomic: Add CMM memory model

2023-06-29 Thread Paul E. McKenney via lttng-dev
On Thu, Jun 29, 2023 at 12:49:00PM -0400, Olivier Dion wrote:
> On Wed, 21 Jun 2023, "Paul E. McKenney"  wrote:
> > On Wed, Jun 07, 2023 at 02:53:52PM -0400, Olivier Dion wrote:
> >> -#ifdef __URCU_DEREFERENCE_USE_ATOMIC_CONSUME
> >> -# define _rcu_dereference(p) __extension__ ({ 
> >> \
> >> -  __typeof__(__extension__ ({ 
> >> \
> >> -  __typeof__(p) __attribute__((unused)) 
> >> _p0 = { 0 }; \
> >> -  _p0;
> >> \
> >> -  })) _p1;
> >> \
> >> -  __atomic_load(&(p), &_p1, 
> >> __ATOMIC_CONSUME);\
> >
> > There is talk of getting rid of memory_order_consume.  But for the moment,
> > it is what there is.  Another alternative is to use a volatile load,
> > similar to old-style CMM_LOAD_SHARED() or in-kernel READ_ONCE().
> 
> I think we can stick to __ATOMIC_CONSUME for now.  Hopefully getting rid
> of it means it will be an alias for __ATOMIC_ACQUIRE for ever.

That seems emininently reasonable to me!

Thanx, Paul
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH 02/11] urcu/uatomic: Use atomic builtins if configured

2023-06-22 Thread Paul E. McKenney via lttng-dev
On Thu, Jun 22, 2023 at 03:53:33PM -0400, Olivier Dion wrote:
> On Thu, 22 Jun 2023, "Paul E. McKenney"  wrote:
> 
> > I suggest C11 volatile atomic load/store.  Load/store fusing is permitted
> > for non-volatile atomic loads and stores, and such fusing can ruin your
> > code's entire day.  ;-)
> 
> Good catch.  Seems like not a problem on GCC (yet), but Clang is extremely
> aggressive and seems to do store fusing on some corner cases [0].
> 
> However, I do not find any simple reproducer of load/store fusing.  Do
> you have example of such fusing, or is this a precaution?  In the
> meantime, back to reading the standard to be certain :-)
> 
>  [0] https://godbolt.org/z/odKG9a75a

I certainly have heard a number of compiler writers thinking in terms
of doing load/store fusing, some of whom were trying to get rid of the
volatile variants in order to remove an impediment to their mission of
optimizing all programs out of existence.  ;-)

I therefore suggest taking this possibility quite seriously.

Thanx, Paul
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH 02/11] urcu/uatomic: Use atomic builtins if configured

2023-06-22 Thread Paul E. McKenney via lttng-dev
On Thu, Jun 22, 2023 at 11:55:55AM -0400, Mathieu Desnoyers wrote:
> On 6/21/23 19:19, Paul E. McKenney wrote:
> [...]
> > > diff --git a/include/urcu/uatomic/builtins-generic.h 
> > > b/include/urcu/uatomic/builtins-generic.h
> > > new file mode 100644
> > > index 000..8e6a9b5
> > > --- /dev/null
> > > +++ b/include/urcu/uatomic/builtins-generic.h
> > > @@ -0,0 +1,85 @@
> > > +/*
> > > + * urcu/uatomic/builtins-generic.h
> > > + *
> > > + * Copyright (c) 2023 Olivier Dion 
> > > + *
> > > + * This library is free software; you can redistribute it and/or
> > > + * modify it under the terms of the GNU Lesser General Public
> > > + * License as published by the Free Software Foundation; either
> > > + * version 2.1 of the License, or (at your option) any later version.
> > > + *
> > > + * This library is distributed in the hope that it will be useful,
> > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > + * Lesser General Public License for more details.
> > > + *
> > > + * You should have received a copy of the GNU Lesser General Public
> > > + * License along with this library; if not, write to the Free Software
> > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 
> > > 02110-1301 USA
> > > + */
> > > +
> > > +#ifndef _URCU_UATOMIC_BUILTINS_GENERIC_H
> > > +#define _URCU_UATOMIC_BUILTINS_GENERIC_H
> > > +
> > > +#include 
> > > +
> > > +#define uatomic_set(addr, v) __atomic_store_n(addr, v, __ATOMIC_RELAXED)
> > > +
> > > +#define uatomic_read(addr) __atomic_load_n(addr, __ATOMIC_RELAXED)
> > 
> > Does this lose the volatile semantics that the old-style definitions
> > had?
> > 
> 
> Yes.
> 
> [...]
> 
> > > +++ b/include/urcu/uatomic/builtins-x86.h
> > > @@ -0,0 +1,85 @@
> > > +/*
> > > + * urcu/uatomic/builtins-x86.h
> > > + *
> > > + * Copyright (c) 2023 Olivier Dion 
> > > + *
> > > + * This library is free software; you can redistribute it and/or
> > > + * modify it under the terms of the GNU Lesser General Public
> > > + * License as published by the Free Software Foundation; either
> > > + * version 2.1 of the License, or (at your option) any later version.
> > > + *
> > > + * This library is distributed in the hope that it will be useful,
> > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > + * Lesser General Public License for more details.
> > > + *
> > > + * You should have received a copy of the GNU Lesser General Public
> > > + * License along with this library; if not, write to the Free Software
> > > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 
> > > 02110-1301 USA
> > > + */
> > > +
> > > +#ifndef _URCU_UATOMIC_BUILTINS_X86_H
> > > +#define _URCU_UATOMIC_BUILTINS_X86_H
> > > +
> > > +#include 
> > > +
> > > +#define uatomic_set(addr, v) __atomic_store_n(addr, v, __ATOMIC_RELAXED)
> > > +
> > > +#define uatomic_read(addr) __atomic_load_n(addr, __ATOMIC_RELAXED)
> > 
> > And same question here.
> 
> Yes, this opens interesting questions:
> 
> * what semantic do we want for uatomic_read/set ?
> 
> * what semantic do we want for CMM_LOAD_SHARED/CMM_STORE_SHARED ?
> 
> * do we want to allow load/store-shared to work on variables larger than a
> word ? (e.g. on a uint64_t on a 32-bit architecture, or on a structure)
> 
> * what are the guarantees of a volatile type ?
> 
> * what are the guarantees of a load/store relaxed in C11 ?
> 
> Does the delta between volatile and C11 relaxed guarantees matter ?
> 
> Is there an advantage to use C11 load/store relaxed over volatile ? Should
> we combine both C11 load/store _and_ volatile ? Should we use
> atomic_signal_fence instead ?

I suggest C11 volatile atomic load/store.  Load/store fusing is permitted
for non-volatile atomic loads and stores, and such fusing can ruin your
code's entire day.  ;-)

Thanx, Paul
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH 04/11] urcu/arch/generic: Use atomic builtins if configured

2023-06-21 Thread Paul E. McKenney via lttng-dev
On Wed, Jun 21, 2023 at 09:48:10PM -0400, Mathieu Desnoyers wrote:
> On 6/21/23 20:53, Olivier Dion wrote:
> > On Wed, 21 Jun 2023, "Paul E. McKenney"  wrote:
> > > On Mon, May 15, 2023 at 04:17:11PM -0400, Olivier Dion wrote:
> > > >   #ifndef cmm_mb
> > > >   #define cmm_mb()__sync_synchronize()
> > > 
> > > Just out of curiosity, why not also implement cmm_mb() in terms of
> > > __atomic_thread_fence(__ATOMIC_SEQ_CST)?  (Or is that a later patch?)
> > 
> > IIRC, Mathieu and I agree that the definition of a thread fence -- acts
> > as a synchronization fence between threads -- is too weak for what we
> > want here.  For example, with I/O devices.
> > 
> > Although __sync_synchronize() is probably an alias for a SEQ_CST thread
> > fence, its definition -- issues a full memory barrier -- is stronger.
> > 
> > We do not want to rely on this assumption (alias) and prefer to rely on
> > the documented definition instead.
> 
> We should document this rationale with a new comment near the #define,
> in case anyone mistakenly decides to use a thread fence there to make it
> similar to the rest of the code in the future.

That would be good, thank you!

Ah, and I did not find any issues with the rest of the patchset.

Thanx, Paul
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH v2 08/12] benchmark: Use uatomic for accessing global states

2023-06-21 Thread Paul E. McKenney via lttng-dev
On Wed, Jun 07, 2023 at 02:53:55PM -0400, Olivier Dion wrote:
> Global states accesses were protected via memory barriers. Use the
> uatomic API with the CMM memory model so that TSAN can understand the
> ordering imposed by the synchronization flags.
> 
> Change-Id: I1bf5702c5ac470f308c478effe39e424a3158060
> Co-authored-by: Mathieu Desnoyers 
> Signed-off-by: Olivier Dion 

This does look more organized!

Thanx, Paul

> ---
>  tests/benchmark/Makefile.am | 91 +
>  tests/benchmark/common-states.c |  1 +
>  tests/benchmark/common-states.h | 51 ++
>  tests/benchmark/test_mutex.c| 32 +
>  tests/benchmark/test_perthreadlock.c| 32 +
>  tests/benchmark/test_rwlock.c   | 32 +
>  tests/benchmark/test_urcu.c | 33 +
>  tests/benchmark/test_urcu_assign.c  | 33 +
>  tests/benchmark/test_urcu_bp.c  | 33 +
>  tests/benchmark/test_urcu_defer.c   | 33 +
>  tests/benchmark/test_urcu_gc.c  | 34 ++---
>  tests/benchmark/test_urcu_hash.c|  6 +-
>  tests/benchmark/test_urcu_hash.h| 15 
>  tests/benchmark/test_urcu_hash_rw.c | 10 +--
>  tests/benchmark/test_urcu_hash_unique.c | 10 +--
>  tests/benchmark/test_urcu_lfq.c | 20 ++
>  tests/benchmark/test_urcu_lfs.c | 20 ++
>  tests/benchmark/test_urcu_lfs_rcu.c | 20 ++
>  tests/benchmark/test_urcu_qsbr.c| 33 +
>  tests/benchmark/test_urcu_qsbr_gc.c | 34 ++---
>  tests/benchmark/test_urcu_wfcq.c| 22 +++---
>  tests/benchmark/test_urcu_wfq.c | 20 ++
>  tests/benchmark/test_urcu_wfs.c | 22 +++---
>  23 files changed, 177 insertions(+), 460 deletions(-)
>  create mode 100644 tests/benchmark/common-states.c
>  create mode 100644 tests/benchmark/common-states.h
> 
> diff --git a/tests/benchmark/Makefile.am b/tests/benchmark/Makefile.am
> index c53e025..a7f91c2 100644
> --- a/tests/benchmark/Makefile.am
> +++ b/tests/benchmark/Makefile.am
> @@ -1,4 +1,5 @@
>  AM_CPPFLAGS += -I$(top_srcdir)/src -I$(top_srcdir)/tests/common
> +AM_CPPFLAGS += -include $(top_srcdir)/tests/benchmark/common-states.h
>  
>  TEST_EXTENSIONS = .tap
>  TAP_LOG_DRIVER_FLAGS = --merge --comments
> @@ -7,6 +8,8 @@ TAP_LOG_DRIVER = env AM_TAP_AWK='$(AWK)' \
>   URCU_TESTS_BUILDDIR='$(abs_top_builddir)/tests' \
>   $(SHELL) $(top_srcdir)/tests/utils/tap-driver.sh
>  
> +noinst_HEADERS = common-states.h
> +
>  SCRIPT_LIST = \
>   runpaul-phase1.sh \
>   runpaul-phase2.sh \
> @@ -61,163 +64,163 @@ URCU_CDS_LIB=$(top_builddir)/src/liburcu-cds.la
>  
>  DEBUG_YIELD_LIB=$(builddir)/../common/libdebug-yield.la
>  
> -test_urcu_SOURCES = test_urcu.c
> +test_urcu_SOURCES = test_urcu.c common-states.c
>  test_urcu_LDADD = $(URCU_LIB)
>  
> -test_urcu_dynamic_link_SOURCES = test_urcu.c
> +test_urcu_dynamic_link_SOURCES = test_urcu.c common-states.c
>  test_urcu_dynamic_link_LDADD = $(URCU_LIB)
>  test_urcu_dynamic_link_CFLAGS = -DDYNAMIC_LINK_TEST $(AM_CFLAGS)
>  
> -test_urcu_timing_SOURCES = test_urcu_timing.c
> +test_urcu_timing_SOURCES = test_urcu_timing.c common-states.c
>  test_urcu_timing_LDADD = $(URCU_LIB)
>  
> -test_urcu_yield_SOURCES = test_urcu.c
> +test_urcu_yield_SOURCES = test_urcu.c common-states.c
>  test_urcu_yield_LDADD = $(URCU_LIB) $(DEBUG_YIELD_LIB)
>  test_urcu_yield_CFLAGS = -DDEBUG_YIELD $(AM_CFLAGS)
>  
>  
> -test_urcu_qsbr_SOURCES = test_urcu_qsbr.c
> +test_urcu_qsbr_SOURCES = test_urcu_qsbr.c common-states.c
>  test_urcu_qsbr_LDADD = $(URCU_QSBR_LIB)
>  
> -test_urcu_qsbr_timing_SOURCES = test_urcu_qsbr_timing.c
> +test_urcu_qsbr_timing_SOURCES = test_urcu_qsbr_timing.c common-states.c
>  test_urcu_qsbr_timing_LDADD = $(URCU_QSBR_LIB)
>  
>  
> -test_urcu_mb_SOURCES = test_urcu.c
> +test_urcu_mb_SOURCES = test_urcu.c common-states.c
>  test_urcu_mb_LDADD = $(URCU_MB_LIB)
>  test_urcu_mb_CFLAGS = -DRCU_MB $(AM_CFLAGS)
>  
>  
> -test_urcu_signal_SOURCES = test_urcu.c
> +test_urcu_signal_SOURCES = test_urcu.c common-states.c
>  test_urcu_signal_LDADD = $(URCU_SIGNAL_LIB)
>  test_urcu_signal_CFLAGS = -DRCU_SIGNAL $(AM_CFLAGS)
>  
> -test_urcu_signal_dynamic_link_SOURCES = test_urcu.c
> +test_urcu_signal_dynamic_link_SOURCES = test_urcu.c common-states.c
>  test_urcu_signal_dynamic_link_LDADD = $(URCU_SIGNAL_LIB)
>  test_urcu_signal_dynamic_link_CFLAGS = -DRCU_SIGNAL -DDYNAMIC_LINK_TEST \
>   $(AM_CFLAGS)
>  
> -test_urcu_signal_timing_SOURCES = test_urcu_timing.c
> +test_urcu_signal_timing_SOURCES = test_urcu_timing.c common-states.c
>  test_urcu_signal_timing_LDADD = $(URCU_SIGNAL_LIB)
>  test_urcu_signal_timing_CFLAGS= -DRCU_SIGNAL $(AM_CFLAGS)
>  
> -test_urcu_signal_yield_SOURCES = test_urcu.c
> +test_urcu_signal_yield_SOURCES = test_urcu.c common-states.c
>  test_urcu_signal_yield_LDADD = 

Re: [lttng-dev] [PATCH v2 07/12] tests: Use uatomic for accessing global states

2023-06-21 Thread Paul E. McKenney via lttng-dev
On Wed, Jun 07, 2023 at 02:53:54PM -0400, Olivier Dion wrote:
> Global states accesses were protected via memory barriers. Use the
> uatomic API with the CMM memory model so that TSAN does not warns about

"does not warn", for whatever that is worth.

> none atomic concurrent accesses.
> 
> Also, the thread id map mutex must be unlocked after setting the new
> created thread id in the map. Otherwise, the new thread could observe an
> unset id.
> 
> Change-Id: I1ecdc387b3f510621cbc116ad3b95c676f5d659a
> Co-authored-by: Mathieu Desnoyers 
> Signed-off-by: Olivier Dion 
> ---
>  tests/common/api.h|  12 ++--
>  tests/regression/rcutorture.h | 106 +++---
>  2 files changed, 80 insertions(+), 38 deletions(-)
> 
> diff --git a/tests/common/api.h b/tests/common/api.h
> index a260463..9d22b0f 100644
> --- a/tests/common/api.h
> +++ b/tests/common/api.h
> @@ -26,6 +26,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * Machine parameters.
> @@ -135,7 +136,7 @@ static int __smp_thread_id(void)
>   thread_id_t tid = pthread_self();
>  
>   for (i = 0; i < NR_THREADS; i++) {
> - if (__thread_id_map[i] == tid) {
> + if (uatomic_read(&__thread_id_map[i]) == tid) {
>   long v = i + 1;  /* must be non-NULL. */
>  
>   if (pthread_setspecific(thread_id_key, (void *)v) != 0) 
> {
> @@ -184,12 +185,13 @@ static thread_id_t create_thread(void *(*func)(void *), 
> void *arg)
>   exit(-1);
>   }
>   __thread_id_map[i] = __THREAD_ID_MAP_WAITING;
> - spin_unlock(&__thread_id_map_mutex);
> +
>   if (pthread_create(, NULL, func, arg) != 0) {
>   perror("create_thread:pthread_create");
>   exit(-1);
>   }
> - __thread_id_map[i] = tid;
> + uatomic_set(&__thread_id_map[i], tid);
> + spin_unlock(&__thread_id_map_mutex);
>   return tid;
>  }
>  
> @@ -199,7 +201,7 @@ static void *wait_thread(thread_id_t tid)
>   void *vp;
>  
>   for (i = 0; i < NR_THREADS; i++) {
> - if (__thread_id_map[i] == tid)
> + if (uatomic_read(&__thread_id_map[i]) == tid)
>   break;
>   }
>   if (i >= NR_THREADS){
> @@ -211,7 +213,7 @@ static void *wait_thread(thread_id_t tid)
>   perror("wait_thread:pthread_join");
>   exit(-1);
>   }
> - __thread_id_map[i] = __THREAD_ID_MAP_EMPTY;
> + uatomic_set(&__thread_id_map[i], __THREAD_ID_MAP_EMPTY);
>   return vp;
>  }
>  
> diff --git a/tests/regression/rcutorture.h b/tests/regression/rcutorture.h
> index bc394f9..5835b8f 100644
> --- a/tests/regression/rcutorture.h
> +++ b/tests/regression/rcutorture.h
> @@ -44,6 +44,14 @@
>   * data.  A correct RCU implementation will have all but the first two
>   * numbers non-zero.
>   *
> + * rcu_stress_count: Histogram of "ages" of structures seen by readers.  If 
> any
> + * entries past the first two are non-zero, RCU is broken. The age of a newly
> + * allocated structure is zero, it becomes one when removed from reader
> + * visibility, and is incremented once per grace period subsequently -- and 
> is
> + * freed after passing through (RCU_STRESS_PIPE_LEN-2) grace periods.  Since
> + * this tests only has one true writer (there are fake writers), only 
> buckets at
> + * indexes 0 and 1 should be none-zero.
> + *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License as published by
>   * the Free Software Foundation; either version 2 of the License, or
> @@ -68,6 +76,8 @@
>  #include 
>  #include "tap.h"
>  
> +#include 
> +
>  #define NR_TESTS 1
>  
>  DEFINE_PER_THREAD(long long, n_reads_pt);
> @@ -145,10 +155,10 @@ void *rcu_read_perf_test(void *arg)
>   run_on(me);
>   uatomic_inc();
>   put_thread_offline();
> - while (goflag == GOFLAG_INIT)
> + while (uatomic_read() == GOFLAG_INIT)
>   (void) poll(NULL, 0, 1);
>   put_thread_online();
> - while (goflag == GOFLAG_RUN) {
> + while (uatomic_read() == GOFLAG_RUN) {
>   for (i = 0; i < RCU_READ_RUN; i++) {
>   rcu_read_lock();
>   /* rcu_read_lock_nest(); */
> @@ -180,9 +190,9 @@ void *rcu_update_perf_test(void *arg 
> __attribute__((unused)))
>   }
>   }
>   uatomic_inc();
> - while (goflag == GOFLAG_INIT)
> + while (uatomic_read() == GOFLAG_INIT)
>   (void) poll(NULL, 0, 1);
> - while (goflag == GOFLAG_RUN) {
> + while (uatomic_read() == GOFLAG_RUN) {
>   synchronize_rcu();
>   n_updates_local++;
>   }
> @@ -211,15 +221,11 @@ int perftestrun(int nthreads, int nreaders, int 
> nupdaters)
>   int t;
>   int duration = 1;
>  
> - cmm_smp_mb();
>   while (uatomic_read() < nthreads)
>   (void) poll(NULL, 0, 1);
> - goflag = GOFLAG_RUN;
> - 

Re: [lttng-dev] [PATCH 02/11] urcu/uatomic: Use atomic builtins if configured

2023-06-21 Thread Paul E. McKenney via lttng-dev
On Mon, May 15, 2023 at 04:17:09PM -0400, Olivier Dion wrote:
> Implement uatomic in term of atomic builtins if configured to do so.
> 
> Change-Id: I5814494c62ee507fd5d381c3ba4ccd0a80c4f4e3
> Co-authored-by: Mathieu Desnoyers 
> Signed-off-by: Olivier Dion 
> ---
>  include/Makefile.am |  3 +
>  include/urcu/uatomic.h  |  5 +-
>  include/urcu/uatomic/builtins-generic.h | 85 +
>  include/urcu/uatomic/builtins-x86.h | 85 +
>  include/urcu/uatomic/builtins.h | 83 
>  5 files changed, 260 insertions(+), 1 deletion(-)
>  create mode 100644 include/urcu/uatomic/builtins-generic.h
>  create mode 100644 include/urcu/uatomic/builtins-x86.h
>  create mode 100644 include/urcu/uatomic/builtins.h
> 
> diff --git a/include/Makefile.am b/include/Makefile.am
> index ba1fe60..fac941f 100644
> --- a/include/Makefile.am
> +++ b/include/Makefile.am
> @@ -63,6 +63,9 @@ nobase_include_HEADERS = \
>   urcu/uatomic/alpha.h \
>   urcu/uatomic_arch.h \
>   urcu/uatomic/arm.h \
> + urcu/uatomic/builtins.h \
> + urcu/uatomic/builtins-generic.h \
> + urcu/uatomic/builtins-x86.h \
>   urcu/uatomic/gcc.h \
>   urcu/uatomic/generic.h \
>   urcu/uatomic.h \
> diff --git a/include/urcu/uatomic.h b/include/urcu/uatomic.h
> index 2fb5fd4..6b57c5f 100644
> --- a/include/urcu/uatomic.h
> +++ b/include/urcu/uatomic.h
> @@ -22,8 +22,11 @@
>  #define _URCU_UATOMIC_H
>  
>  #include 
> +#include 
>  
> -#if defined(URCU_ARCH_X86)
> +#if defined(CONFIG_RCU_USE_ATOMIC_BUILTINS)
> +#include 
> +#elif defined(URCU_ARCH_X86)
>  #include 
>  #elif defined(URCU_ARCH_PPC)
>  #include 
> diff --git a/include/urcu/uatomic/builtins-generic.h 
> b/include/urcu/uatomic/builtins-generic.h
> new file mode 100644
> index 000..8e6a9b5
> --- /dev/null
> +++ b/include/urcu/uatomic/builtins-generic.h
> @@ -0,0 +1,85 @@
> +/*
> + * urcu/uatomic/builtins-generic.h
> + *
> + * Copyright (c) 2023 Olivier Dion 
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 
> USA
> + */
> +
> +#ifndef _URCU_UATOMIC_BUILTINS_GENERIC_H
> +#define _URCU_UATOMIC_BUILTINS_GENERIC_H
> +
> +#include 
> +
> +#define uatomic_set(addr, v) __atomic_store_n(addr, v, __ATOMIC_RELAXED)
> +
> +#define uatomic_read(addr) __atomic_load_n(addr, __ATOMIC_RELAXED)

Does this lose the volatile semantics that the old-style definitions
had?

> +
> +#define uatomic_cmpxchg(addr, old, new)  
> \
> + __extension__   \
> + ({  \
> + __typeof__(*(addr)) _old = (__typeof__(*(addr)))old;\
> + __atomic_compare_exchange_n(addr, &_old, new, 0,\
> + __ATOMIC_SEQ_CST,   \
> + __ATOMIC_SEQ_CST);  \
> + _old;   \
> + })
> +
> +#define uatomic_xchg(addr, v)\
> + __atomic_exchange_n(addr, v, __ATOMIC_SEQ_CST)
> +
> +#define uatomic_add_return(addr, v)  \
> + __atomic_add_fetch(addr, v, __ATOMIC_SEQ_CST)
> +
> +#define uatomic_sub_return(addr, v)  \
> + __atomic_sub_fetch(addr, v, __ATOMIC_SEQ_CST)
> +
> +#define uatomic_and(addr, mask)  \
> + (void)__atomic_and_fetch(addr, mask, __ATOMIC_RELAXED)
> +
> +#define uatomic_or(addr, mask)   \
> + (void)__atomic_or_fetch(addr, mask, __ATOMIC_RELAXED)
> +
> +#define uatomic_add(addr, v) \
> + (void)__atomic_add_fetch(addr, v, __ATOMIC_RELAXED)
> +
> +#define uatomic_sub(addr, v) \
> + (void)__atomic_sub_fetch(addr, v, __ATOMIC_RELAXED)
> +
> +#define uatomic_inc(addr)\
> + (void)__atomic_add_fetch(addr, 1, __ATOMIC_RELAXED)
> +
> +#define uatomic_dec(addr)\
> + (void)__atomic_sub_fetch(addr, 1, __ATOMIC_RELAXED)
> +
> +#define cmm_smp_mb__before_uatomic_and() 

Re: [lttng-dev] [PATCH v2 05/12] urcu/uatomic: Add CMM memory model

2023-06-21 Thread Paul E. McKenney via lttng-dev
On Wed, Jun 07, 2023 at 02:53:52PM -0400, Olivier Dion wrote:
> Introducing the CMM memory model with the following new primitives:
> 
>   - uatomic_load(addr, memory_order)
> 
>   - uatomic_store(addr, value, memory_order)
>   - uatomic_and_mo(addr, mask, memory_order)
>   - uatomic_or_mo(addr, mask, memory_order)
>   - uatomic_add_mo(addr, value, memory_order)
>   - uatomic_sub_mo(addr, value, memory_order)
>   - uatomic_inc_mo(addr, memory_order)
>   - uatomic_dec_mo(addr, memory_order)
> 
>   - uatomic_add_return_mo(addr, value, memory_order)
>   - uatomic_sub_return_mo(addr, value, memory_order)
> 
>   - uatomic_xchg_mo(addr, value, memory_order)
> 
>   - uatomic_cmpxchg_mo(addr, old, new,
>memory_order_success,
>memory_order_failure)
> 
> The CMM memory model reflects the C11 memory model with an additional
> CMM_SEQ_CST_FENCE memory order. The memory order can be selected through
> the enum cmm_memorder.
> 
> * With Atomic Builtins
> 
> If configured with atomic builtins, the correspondence between the CMM
> memory model and the C11 memory model is a one to one at the exception
> of the CMM_SEQ_CST_FENCE memory order which implies the memory order
> CMM_SEQ_CST and a thread fence after the operation.
> 
> * Without Atomic Builtins
> 
> However, if not configured with atomic builtins, the following stipulate
> the memory model.
> 
> For load operations with uatomic_load(), the memory orders CMM_RELAXED,
> CMM_CONSUME, CMM_ACQUIRE, CMM_SEQ_CST and CMM_SEQ_CST_FENCE are
> allowed. A barrier may be inserted before and after the load from memory
> depending on the memory order:
> 
>   - CMM_RELAXED: No barrier
>   - CMM_CONSUME: Memory barrier after read
>   - CMM_ACQUIRE: Memory barrier after read
>   - CMM_SEQ_CST: Memory barriers before and after read
>   - CMM_SEQ_CST_FENCE: Memory barriers before and after read
> 
> For store operations with uatomic_store(), the memory orders
> CMM_RELAXED, CMM_RELEASE, CMM_SEQ_CST and CMM_SEQ_CST_FENCE are
> allowed. A barrier may be inserted before and after the store to memory
> depending on the memory order:
> 
>   - CMM_RELAXED: No barrier
>   - CMM_RELEASE: Memory barrier before operation
>   - CMM_SEQ_CST: Memory barriers before and after operation
>   - CMM_SEQ_CST_FENCE: Memory barriers before and after operation
> 
> For load/store operations with uatomic_and_mo(), uatomic_or_mo(),
> uatomic_add_mo(), uatomic_sub_mo(), uatomic_inc_mo(), uatomic_dec_mo(),
> uatomic_add_return_mo() and uatomic_sub_return_mo(), all memory orders
> are allowed. A barrier may be inserted before and after the operation
> depending on the memory order:
> 
>   - CMM_RELAXED: No barrier
>   - CMM_ACQUIRE: Memory barrier after operation
>   - CMM_CONSUME: Memory barrier after operation
>   - CMM_RELEASE: Memory barrier before operation
>   - CMM_ACQ_REL: Memory barriers before and after operation
>   - CMM_SEQ_CST: Memory barriers before and after operation
>   - CMM_SEQ_CST_FENCE: Memory barriers before and after operation
> 
> For the exchange operation uatomic_xchg_mo(), any memory order is
> valid. A barrier may be inserted before and after the exchange to memory
> depending on the memory order:
> 
>   - CMM_RELAXED: No barrier
>   - CMM_ACQUIRE: Memory barrier after operation
>   - CMM_CONSUME: Memory barrier after operation
>   - CMM_RELEASE: Memory barrier before operation
>   - CMM_ACQ_REL: Memory barriers before and after operation
>   - CMM_SEQ_CST: Memory barriers before and after operation
>   - CMM_SEQ_CST_FENCE: Memory barriers before and after operation
> 
> For the compare exchange operation uatomic_cmpxchg_mo(), the success
> memory order can be anything while the failure memory order cannot be
> CMM_RELEASE nor CMM_ACQ_REL and cannot be stronger than the success
> memory order. A barrier may be inserted before and after the store to
> memory depending on the memory orders:
> 
>  Success memory order:
> 
>   - CMM_RELAXED: No barrier
>   - CMM_ACQUIRE: Memory barrier after operation
>   - CMM_CONSUME: Memory barrier after operation
>   - CMM_RELEASE: Memory barrier before operation
>   - CMM_ACQ_REL: Memory barriers before and after operation
>   - CMM_SEQ_CST: Memory barriers before and after operation
>   - CMM_SEQ_CST_FENCE: Memory barriers before and after operation
> 
>   Barriers after the operations are only emitted if the compare exchange
>   succeed.
> 
>  Failure memory order:
>   - CMM_RELAXED: No barrier
>   - CMM_ACQUIRE: Memory barrier after operation
>   - CMM_CONSUME: Memory barrier after operation
>   - CMM_SEQ_CST: Memory barriers before and after operation
>   - CMM_SEQ_CST_FENCE: Memory barriers before and after operation
> 
>   Barriers after the operations are only emitted if the compare exchange
>   failed.  Barriers before the operation are never emitted by this
>   memory order.
> 
> Change-Id: I213ba19c84e82a63083f00143a3142ffbdab1d52
> Co-authored-by: Mathieu Desnoyers 
> Signed-off-by: 

Re: [lttng-dev] [PATCH v2 04/12] urcu/system: Use atomic builtins if configured

2023-06-21 Thread Paul E. McKenney via lttng-dev
On Wed, Jun 07, 2023 at 02:53:51PM -0400, Olivier Dion wrote:
> If configured to use atomic builtins, use them for implementing the
> CMM_LOAD_SHARED and CMM_STORE_SHARED macros.
> 
> Change-Id: I3ef0d26c47aced6e94b40fd59c7b8baa6272
> Co-authored-by: Mathieu Desnoyers 
> Signed-off-by: Olivier Dion 
> ---
>  include/urcu/system.h | 21 +
>  1 file changed, 21 insertions(+)
> 
> diff --git a/include/urcu/system.h b/include/urcu/system.h
> index faae390..f184aad 100644
> --- a/include/urcu/system.h
> +++ b/include/urcu/system.h
> @@ -19,9 +19,28 @@
>   * all copies or substantial portions of the Software.
>   */
>  
> +#include 
>  #include 
>  #include 
>  
> +#ifdef CONFIG_RCU_USE_ATOMIC_BUILTINS
> +
> +#define CMM_LOAD_SHARED(x)   \
> + __atomic_load_n(&(x), __ATOMIC_RELAXED)
> +
> +#define _CMM_LOAD_SHARED(x) CMM_LOAD_SHARED(x)
> +
> +#define CMM_STORE_SHARED(x, v)   \
> + __extension__   \
> + ({  \
> + __typeof__(v) _v = (v); \
> + __atomic_store_n(&(x), _v, __ATOMIC_RELAXED);   \
> + _v; \
> + })
> +
> +#define _CMM_STORE_SHARED(x, v) CMM_STORE_SHARED(x, v)

Same question here on loss of volatile semantics.

Thanx, Paul

> +
> +#else
>  /*
>   * Identify a shared load. A cmm_smp_rmc() or cmm_smp_mc() should come
>   * before the load.
> @@ -56,4 +75,6 @@
>   _v = _v;/* Work around clang "unused result" */ \
>   })
>  
> +#endif   /* CONFIG_RCU_USE_ATOMIC_BUILTINS */
> +
>  #endif /* _URCU_SYSTEM_H */
> -- 
> 2.40.1
> 
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH 04/11] urcu/arch/generic: Use atomic builtins if configured

2023-06-21 Thread Paul E. McKenney via lttng-dev
On Mon, May 15, 2023 at 04:17:11PM -0400, Olivier Dion wrote:
> If configured to use atomic builtins, implement SMP memory barriers in
> term of atomic builtins if the architecture does not implement its own
> version.
> 
> Change-Id: Iddc4283606e0fce572e104d2d3f03b5c0d9926fb
> Co-authored-by: Mathieu Desnoyers 
> Signed-off-by: Olivier Dion 
> ---
>  include/urcu/arch/generic.h | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/include/urcu/arch/generic.h b/include/urcu/arch/generic.h
> index be6e41e..e292c70 100644
> --- a/include/urcu/arch/generic.h
> +++ b/include/urcu/arch/generic.h
> @@ -43,6 +43,14 @@ extern "C" {
>   * GCC builtins) as well as cmm_rmb and cmm_wmb (defaulting to cmm_mb).
>   */
>  
> +#ifdef CONFIG_RCU_USE_ATOMIC_BUILTINS
> +
> +# ifndef cmm_smp_mb
> +#  define cmm_smp_mb() __atomic_thread_fence(__ATOMIC_SEQ_CST)
> +# endif
> +
> +#endif   /* CONFIG_RCU_USE_ATOMIC_BUILTINS */
> +
>  #ifndef cmm_mb
>  #define cmm_mb()__sync_synchronize()

Just out of curiosity, why not also implement cmm_mb() in terms of
__atomic_thread_fence(__ATOMIC_SEQ_CST)?  (Or is that a later patch?)

Thanx, Paul

>  #endif
> -- 
> 2.39.2
> 
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [RFC] Deprecating RCU signal flavor

2023-05-16 Thread Paul E. McKenney via lttng-dev
On Wed, May 10, 2023 at 05:10:27PM -0400, Olivier Dion wrote:
> Hi all,
> 
> We have the intention of deprecating the urcu-signal flavor in the
> future.  We are asking users of URCU for _feedback_ on this before
> going any further.
> 
> Part of this decision is that we are adding support for TSAN in URCU and
> the signal flavor deadlocks with TSAN.  It is also my understanding that
> the urcu-signal flavor was historically made as a fallback for system
> lacking the membarrier(2) system call.  Nowadays, most systems have
> support for that system call, making the urcu-signal an artifact of the
> past.

No objections here.  I will need to do a little rework of my book, but
I can do that.  ;-)

> The following is a proposed timeline of events:
> 
>   1. Asking for feedback from users of URCU (this message)
>   2. Disabling the signal flavor by default and adding --enable-flavor-signal
>   3. Removing the signal flavor

Makes sense to me!

Thanx, Paul
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] RCU API usage from call_rcu callbacks?

2023-03-22 Thread Paul E. McKenney via lttng-dev
On Wed, Mar 22, 2023 at 09:57:25AM -0400, Mathieu Desnoyers wrote:
> On 2023-03-22 07:08, Ondřej Surý via lttng-dev wrote:
> > Hi,
> > 
> > the documentation is pretty silent on this, and asking here is probably 
> > going to be faster
> > than me trying to use the source to figure this out.
> > 
> > Is it legal to call_rcu() from within the call_rcu() callback?
> 
> Yes. call_rcu callbacks can be chained.
> 
> Note that you'll need to issue rcu_barrier() on program exit as many times as 
> you chained call_rcu callbacks if you intend to make sure no queued callbacks 
> still exist on program clean shutdown. See this comment above 
> urcu_call_rcu_exit():
> 
>  * Teardown the default call_rcu worker thread if there are no queued
>  * callbacks on process exit. This prevents leaking memory.
>  *
>  * Here is how an application can ensure graceful teardown of this
>  * worker thread:
>  *
>  * - An application queuing call_rcu callbacks should invoke
>  *   rcu_barrier() before it exits.
>  * - When chaining call_rcu callbacks, the number of calls to
>  *   rcu_barrier() on application exit must match at least the maximum
>  *   number of chained callbacks.
>  * - If an application chains callbacks endlessly, it would have to be
>  *   modified to stop chaining callbacks when it detects an application
>  *   exit (e.g. with a flag), and wait for quiescence with rcu_barrier()
>  *   after setting that flag.

This trick can also be used to gracefully shut down in the presence
of bounded chaining using but one rcu_barrier() call.

Thanx, Paul

>  * - The statements above apply to a library which queues call_rcu
>  *   callbacks, only it needs to invoke rcu_barrier in its library
>  *   destructor.
> 
> 
> > 
> > What about the other RCU (and CDS) API calls?
> 
> They can be unless stated otherwise. For instance, rcu_barrier() cannot be 
> called from a call_rcu worker thread.
> 
> > 
> > How does that interact with create_call_rcu_data()?  I have  event loops 
> > and I am
> > initializing  1:1 call_rcu helper threads as I need to do some 
> > per-thread initialization
> > as some of the destroy-like functions use random numbers (don't ask).
> 
> As I recall, set_thread_call_rcu_data() will associate a call_rcu worker 
> instance for the current thread. So all following call_rcu() invocations from 
> that thread will be queued into this per-thread call_rcu queue, and handled 
> by the call_rcu worker thread.
> 
> But I wonder why you inherently need this 1:1 mapping, rather than using the 
> content of the structure containing the rcu_head to figure out which 
> per-thread data should be used ?
> 
> If you manage to separate the context from the worker thread instances, then 
> you could use per-cpu call_rcu worker threads, which will eventually scale 
> even better when I integrate the liburcu call_rcu API with sys_rseq 
> concurrency ids [1].
> 
> > 
> > If it's legal to call_rcu() from call_rcu thread, which thread is going to 
> > be used?
> 
> The call_rcu invoked from the call_rcu worker thread will queue the call_rcu 
> callback onto the queue handled by that worker thread. It does so by setting
> 
>   URCU_TLS(thread_call_rcu_data) = crdp;
> 
> early in call_rcu_thread(). So any chained call_rcu is handled by the same 
> call_rcu worker thread doing the chaining, with the exception of teardown 
> where the pending callbacks are moved to the default worker thread.
> 
> Thanks,
> 
> Mathieu
> 
> [1] 
> https://lore.kernel.org/lkml/20221122203932.231377-1-mathieu.desnoy...@efficios.com/
> 
> 
> > 
> > Thank you,
> > Ondrej
> > --
> > Ondřej Surý (He/Him)
> > ond...@sury.org
> > 
> > ___
> > lttng-dev mailing list
> > lttng-dev@lists.lttng.org
> > https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH] QSBR: Use xor operation to replace add operation when changing rcu_gp.ctr value

2022-02-16 Thread Paul E. McKenney via lttng-dev
On Wed, Feb 16, 2022 at 03:53:20PM -0500, Mathieu Desnoyers wrote:
> - On Feb 16, 2022, at 2:35 AM, lttng-dev lttng-dev@lists.lttng.org wrote:
> 
> > It is enough to have three values of rcu_gp.ctr, 00 for INACTIVE,
> > 01 or 11 for ACTIVE. So it is possible to replace add operation
> > with xor operation when changing rcu_gp.ctr value.
> 
> What is missing here is a description justifying why this change is useful.
> 
> What is inherently better about XOR compared to ADD or even binary-OR ?
> 
> If it's about performance, then a benchmark on relevant architectures
> would be useful. But I suspect that if end users care that much about the
> performance of urcu_qsbr_synchronize_rcu(), they might be doing something
> wrong.

Plus having the full counter can be extremely helpful when debugging.

Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > Signed-off-by: yaowenbin 
> > ---
> > src/urcu-qsbr.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/src/urcu-qsbr.c b/src/urcu-qsbr.c
> > index 3709412..46135f9 100644
> > --- a/src/urcu-qsbr.c
> > +++ b/src/urcu-qsbr.c
> > @@ -391,7 +391,7 @@ void urcu_qsbr_synchronize_rcu(void)
> > goto out;
> > 
> > /* Increment current G.P. */
> > -   CMM_STORE_SHARED(urcu_qsbr_gp.ctr, urcu_qsbr_gp.ctr + URCU_QSBR_GP_CTR);
> > +   CMM_STORE_SHARED(urcu_qsbr_gp.ctr, urcu_qsbr_gp.ctr ^ URCU_QSBR_GP_CTR);
> > 
> > /*
> >  * Must commit urcu_qsbr_gp.ctr update to memory before waiting for
> > --
> > 2.27.0
> > ___
> > lttng-dev mailing list
> > lttng-dev@lists.lttng.org
> > https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] User-space RCU: call rcu_barrier() before dissociating helper thread?

2021-05-05 Thread Paul E. McKenney via lttng-dev
On Wed, May 05, 2021 at 10:46:58AM -0400, Mathieu Desnoyers wrote:
> - On May 5, 2021, at 3:54 AM, Martin Wilck mwi...@suse.com wrote:
> 
> > On Fri, 2021-04-30 at 14:41 -0400, Mathieu Desnoyers wrote:
> >> - On Apr 29, 2021, at 9:49 AM, lttng-dev
> >> lttng-dev@lists.lttng.org wrote:
> >> 
> >> > In multipath-tools, we are using a custom RCU helper thread, which
> >> > is cleaned
> >> > out
> >> > on exit:
> >> > 
> >> > https://github.com/opensvc/multipath-tools/blob/23a01fa679481ff1144139222fbd2c4c863b78f8/multipathd/main.c#L3058
> >> > 
> >> > I put a call to rcu_barrier() there in order to make sure all
> >> > callbacks had
> >> > finished
> >> > before detaching the helper thread.
> >> > 
> >> > Now we got a report that rcu_barrier() isn't available before user-
> >> > space RCU 0.8
> >> > (https://github.com/opensvc/multipath-tools/issues/5) (and RHEL7 /
> >> > Centos7
> >> > still has 0.7.16).
> >> > 
> >> > Question: was it over-cautious or otherwise wrong to call
> >> > rcu_barrier() before
> >> > set_thread_call_rcu_data(NULL)? Can we maybe just skip this call?
> >> > If no, what
> >> > would be the recommended way for liburcu < 0.8 to dissociate a
> >> > helper thread?
> >> > 
> >> > (Note: I'm not currently subscribed to lttng-dev).
> >> 
> >> First of all, there is a significant reason why liburcu does not free
> >> the "default"
> >> call_rcu worker thread data structures at process exit. This is
> >> caused by the fact that
> >> a call_rcu callback may very well invoke call_rcu() to re-enqueue
> >> more work.
> >> 
> >> AFAIU this is somewhat similar to what happens to the Linux kernel
> >> RCU implementation
> >> when the machine needs to be shutdown or rebooted: there may indeed
> >> never be any point
> >> in time where it is safe to free the call_rcu worker thread data
> >> structures without leaks,
> >> due to the fact that a call_rcu callback may re-enqueue further work
> >> indefinitely.
> >> 
> >> So my understanding is that you implement your own call rcu worker
> >> thread because the
> >> one provided by liburcu leaks data structure on process exit, and you
> >> expect that
> >> call rcu_barrier once will suffice to ensure quiescence of the call
> >> rcu worker thread
> >> data structures. Unfortunately, this does not cover the scenario
> >> where a call_rcu
> >> callback re-enqueues additional work.
> > 
> > I understand. In multipath-tools, we only have one callback, which
> > doesn't re-enqueue any work. Our callback really just calls free() on a
> > data structure. And it's unlikely that we'll get more RCU callbacks any
> > time soon.
> > 
> > So, to clarify my question: Does it make sense to call rcu_barrier()
> > before set_thread_call_rcu_data(NULL) in this case?
> 
> Yes, it would ensure that all pending callbacks are executed prior to
> removing the worker thread. And considering that you don't have chained
> callbacks, it makes sense to invoke rcu_barrier() only once.

If you do have chained callbacks, one trick is to:

1.  Prevent your application from doing any more new invocations
of call_rcu().

2.  Set a flag that prevents any future callbacks from chaining.

3.  Do two calls to rcu_barrier(), one to wait for pre-existing
callbacks and another to wait for any additional chained
callbacks that happened concurrently with #2 above.

Thanx, Paul
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] liburcu: LTO breaking rcu_dereference on arm64 and possibly other architectures ?

2021-04-16 Thread Paul E. McKenney via lttng-dev
On Fri, Apr 16, 2021 at 03:30:53PM -0400, Mathieu Desnoyers wrote:
> - On Apr 16, 2021, at 3:02 PM, paulmck paul...@kernel.org wrote:
> [...]
> > 
> > If it can be done reasonably, I suggest also having some way for the
> > person building userspace RCU to say "I know what I am doing, so do
> > it with volatile rather than memory_order_consume."
> 
> Like so ?
> 
> #define CMM_ACCESS_ONCE(x) (*(__volatile__  __typeof__(x) *)&(x))
> #define CMM_LOAD_SHARED(p) CMM_ACCESS_ONCE(p)
> 
> /*
>  * By defining URCU_DEREFERENCE_USE_VOLATILE, the user requires use of
>  * volatile access to implement rcu_dereference rather than
>  * memory_order_consume load from the C11/C++11 standards.
>  *
>  * This may improve performance on weakly-ordered architectures where
>  * the compiler implements memory_order_consume as a
>  * memory_order_acquire, which is stricter than required by the
>  * standard.
>  *
>  * Note that using volatile accesses for rcu_dereference may cause
>  * LTO to generate incorrectly ordered code starting from C11/C++11.
>  */
> 
> #ifdef URCU_DEREFERENCE_USE_VOLATILE
> # define rcu_dereference(x) CMM_LOAD_SHARED(x)
> #else
> # if defined (__cplusplus)
> #  if __cplusplus >= 201103L
> #   include 
> #   define rcu_dereference(x)   
> ((std::atomic<__typeof__(x)>)(x)).load(std::memory_order_consume)
> #  else
> #   define rcu_dereference(x)   CMM_LOAD_SHARED(x)
> #  endif
> # else
> #  if (defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L)
> #   include 
> #   define rcu_dereference(x)   atomic_load_explicit(&(x), 
> memory_order_consume)
> #  else
> #   define rcu_dereference(x)   CMM_LOAD_SHARED(x)
> #  endif
> # endif
> #endif

Looks good to me!

Thanx, Paul
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] liburcu: LTO breaking rcu_dereference on arm64 and possibly other architectures ?

2021-04-16 Thread Paul E. McKenney via lttng-dev
On Fri, Apr 16, 2021 at 02:40:08PM -0400, Mathieu Desnoyers wrote:
> - On Apr 16, 2021, at 12:01 PM, paulmck paul...@kernel.org wrote:
> 
> > On Fri, Apr 16, 2021 at 05:17:11PM +0200, Peter Zijlstra wrote:
> >> On Fri, Apr 16, 2021 at 10:52:16AM -0400, Mathieu Desnoyers wrote:
> >> > Hi Paul, Will, Peter,
> >> > 
> >> > I noticed in this discussion https://lkml.org/lkml/2021/4/16/118 that LTO
> >> > is able to break rcu_dereference. This seems to be taken care of by
> >> > arch/arm64/include/asm/rwonce.h on arm64 in the Linux kernel tree.
> >> > 
> >> > In the liburcu user-space library, we have this comment near 
> >> > rcu_dereference()
> >> > in
> >> > include/urcu/static/pointer.h:
> >> > 
> >> >  * The compiler memory barrier in CMM_LOAD_SHARED() ensures that
> >> >  value-speculative
> >> >  * optimizations (e.g. VSS: Value Speculation Scheduling) does not 
> >> > perform the
> >> >  * data read before the pointer read by speculating the value of the 
> >> > pointer.
> >> >  * Correct ordering is ensured because the pointer is read as a volatile 
> >> > access.
> >> >  * This acts as a global side-effect operation, which forbids reordering 
> >> > of
> >> >  * dependent memory operations. Note that such concern about 
> >> > dependency-breaking
> >> >  * optimizations will eventually be taken care of by the 
> >> > "memory_order_consume"
> >> >  * addition to forthcoming C++ standard.
> >> > 
> >> > (note: CMM_LOAD_SHARED() is the equivalent of READ_ONCE(), but was 
> >> > introduced in
> >> > liburcu as a public API before READ_ONCE() existed in the Linux kernel)
> >> > 
> >> > Peter tells me the "memory_order_consume" is not something which can be 
> >> > used
> >> > today.
> >> > Any information on its status at C/C++ standard levels and 
> >> > implementation-wise ?
> > 
> > Actually, you really can use memory_order_consume.  All current
> > implementations will compile it as if it was memory_order_acquire.
> > This will work correctly, but may be slower than you would like on ARM,
> > PowerPC, and so on.
> > 
> > On things like x86, the penalty is forgone optimizations, so less
> > of a problem there.
> 
> OK
> 
> > 
> >> > Pragmatically speaking, what should we change in liburcu to ensure we 
> >> > don't
> >> > generate
> >> > broken code when LTO is enabled ? I suspect there are a few options here:
> >> > 
> >> > 1) Fail to build if LTO is enabled,
> >> > 2) Generate slower code for rcu_dereference, either on all architectures 
> >> > or only
> >> >on weakly-ordered architectures,
> >> > 3) Generate different code depending on whether LTO is enabled or not. 
> >> > AFAIU
> >> > this would only
> >> >work if every compile unit is aware that it will end up being 
> >> > optimized with
> >> >LTO. Not sure
> >> >how this could be done in the context of user-space.
> >> > 4) [ Insert better idea here. ]
> > 
> > Use memory_order_consume if LTO is enabled.  That will work now, and
> > might generate good code in some hoped-for future.
> 
> In the context of a user-space library, how does one check whether LTO is 
> enabled with
> preprocessor directives ? A quick test with gcc seems to show that both with 
> and without
> -flto cannot be distinguished from a preprocessor POV, e.g. the output of both
> 
> gcc --std=c11 -O2 -dM -E - < /dev/null
> and
> gcc --std=c11 -O2 -flto -dM -E - < /dev/null
> 
> is exactly the same. Am I missing something here ?

No idea.  ;-)

> If we accept to use memory_order_consume all the time in both C and C++ code 
> starting from
> C11 and C++11, the following code snippet could do the trick:
> 
> #define CMM_ACCESS_ONCE(x) (*(__volatile__  __typeof__(x) *)&(x))
> #define CMM_LOAD_SHARED(p) CMM_ACCESS_ONCE(p)
> 
> #if defined (__cplusplus)
> # if __cplusplus >= 201103L
> #  include 
> #  define rcu_dereference(x)
> ((std::atomic<__typeof__(x)>)(x)).load(std::memory_order_consume)
> # else
> #  define rcu_dereference(x)CMM_LOAD_SHARED(x)
> # endif
> #else
> # if (defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L)
> #  include 
> #  define rcu_dereference(x)atomic_load_explicit(&(x), 
> memory_order_consume)
> # else
> #  define rcu_dereference(x)CMM_LOAD_SHARED(x)
> # endif
> #endif
> 
> This uses the volatile approach prior to C11/C++11, and moves to 
> memory_order_consume
> afterwards. This will bring a performance penalty on weakly-ordered 
> architectures even
> when -flto is not specified though.
> 
> Then the burden is pushed on the compiler people to eventually implement an 
> efficient
> memory_order_consume.
> 
> Is that acceptable ?

That makes sense to me!

If it can be done reasonably, I suggest also having some way for the
person building userspace RCU to say "I know what I am doing, so do
it with volatile rather than memory_order_consume."

Thanx, Paul
___
lttng-dev mailing list
lttng-dev@lists.lttng.org

Re: [lttng-dev] liburcu: LTO breaking rcu_dereference on arm64 and possibly other architectures ?

2021-04-16 Thread Paul E. McKenney via lttng-dev
On Fri, Apr 16, 2021 at 05:17:11PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 16, 2021 at 10:52:16AM -0400, Mathieu Desnoyers wrote:
> > Hi Paul, Will, Peter,
> > 
> > I noticed in this discussion https://lkml.org/lkml/2021/4/16/118 that LTO
> > is able to break rcu_dereference. This seems to be taken care of by
> > arch/arm64/include/asm/rwonce.h on arm64 in the Linux kernel tree.
> > 
> > In the liburcu user-space library, we have this comment near 
> > rcu_dereference() in
> > include/urcu/static/pointer.h:
> > 
> >  * The compiler memory barrier in CMM_LOAD_SHARED() ensures that 
> > value-speculative
> >  * optimizations (e.g. VSS: Value Speculation Scheduling) does not perform 
> > the
> >  * data read before the pointer read by speculating the value of the 
> > pointer.
> >  * Correct ordering is ensured because the pointer is read as a volatile 
> > access.
> >  * This acts as a global side-effect operation, which forbids reordering of
> >  * dependent memory operations. Note that such concern about 
> > dependency-breaking
> >  * optimizations will eventually be taken care of by the 
> > "memory_order_consume"
> >  * addition to forthcoming C++ standard.
> > 
> > (note: CMM_LOAD_SHARED() is the equivalent of READ_ONCE(), but was 
> > introduced in
> > liburcu as a public API before READ_ONCE() existed in the Linux kernel)
> > 
> > Peter tells me the "memory_order_consume" is not something which can be 
> > used today.
> > Any information on its status at C/C++ standard levels and 
> > implementation-wise ?

Actually, you really can use memory_order_consume.  All current
implementations will compile it as if it was memory_order_acquire.
This will work correctly, but may be slower than you would like on ARM,
PowerPC, and so on.

On things like x86, the penalty is forgone optimizations, so less
of a problem there.

> > Pragmatically speaking, what should we change in liburcu to ensure we don't 
> > generate
> > broken code when LTO is enabled ? I suspect there are a few options here:
> > 
> > 1) Fail to build if LTO is enabled,
> > 2) Generate slower code for rcu_dereference, either on all architectures or 
> > only
> >on weakly-ordered architectures,
> > 3) Generate different code depending on whether LTO is enabled or not. 
> > AFAIU this would only
> >work if every compile unit is aware that it will end up being optimized 
> > with LTO. Not sure
> >how this could be done in the context of user-space.
> > 4) [ Insert better idea here. ]

Use memory_order_consume if LTO is enabled.  That will work now, and
might generate good code in some hoped-for future.

> > Thoughts ?
> 
> Using memory_order_acquire is safe; and is basically what Will did for
> ARM64.
> 
> The problematic tranformations are possible even without LTO, although
> less likely due to less visibility, but everybody agrees they're
> possible and allowed.
> 
> OTOH we do not have a positive sighting of it actually happening (I
> think), we're all just being cautious and not willing to debug the
> resulting wreckage if it does indeed happen.

And yes, you can also use memory_order_acquire.

Thanx, Paul
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH urcu 4/4] Don't force a target and optimization level on ARMv7

2020-12-15 Thread Paul E. McKenney via lttng-dev
On Tue, Dec 15, 2020 at 11:28:50AM -0500, Michael Jeanson wrote:
> We shouldn't force a specific target cpu for the compiler on ARMv7 but
> let the system or the user choose it. If some of our code depends on a
> specific target CPU features, it should be compile tested.
> 
> Also remove the default optimisation level of O1, it's potentially a
> workaround to an early armv7 compiler performance problem and anyway
> most builds will have an optimisation level flag set in the CFLAGS which
> will override this one.

Indeed, the original was based on advice from ARM that has undoubtedly
changed over time, so...

> Signed-off-by: Michael Jeanson 
> Cc: Paul E. McKenney 

Acked-by: Paul E. McKenney 

> Change-Id: I1d1bb5cc0fa0be8f8b1d6a9ad7bf063809be1aef
> ---
>  configure.ac | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/configure.ac b/configure.ac
> index daa967a..f477425 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -119,10 +119,6 @@ AS_CASE([$host],[*-cygwin*],
>   [AM_CONDITIONAL(USE_CYGWIN, false)]
>  )
>  
> -AS_IF([test "$host_cpu" = "armv7l"],[
> - AM_CFLAGS="$AM_CFLAGS -mcpu=cortex-a9 -mtune=cortex-a9 -O1"
> -])
> -
>  # Search for clock_gettime
>  AC_SEARCH_LIBS([clock_gettime], [rt], [
>   AC_DEFINE([CONFIG_RCU_HAVE_CLOCK_GETTIME], [1])
> -- 
> 2.29.2
> 
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH urcu] fix: bump tests thread limit to 256

2020-12-09 Thread Paul E. McKenney via lttng-dev
On Wed, Dec 09, 2020 at 01:29:47PM -0500, Mathieu Desnoyers wrote:
> Hi Paul,
> 
> Should I merge this temporary fix for liburcu tests, or should we go
> for dynamic allocation of the array right away instead ?

Getting something running now is a good thing.  I have occasional access
to a system that could use 512, though.  (448 suffices, but powers of
two and all that.)

Longer term, I agree with dynamic allocation.

Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> - On Dec 9, 2020, at 1:15 PM, Michael Jeanson mjean...@efficios.com wrote:
> 
> > Machines with more than 128 CPUs are becomming more common, the proper
> > fix here would be to dynamically allocate the array which we will do,
> > but in the meantime bump the limit to 256 to fix the problem on a 160
> > CPUs ppc64el system where this was reported.
> > 
> > Signed-off-by: Michael Jeanson 
> > Cc: Paul E. McKenney 
> > Change-Id: Ib3cb5d8cb4515e6f626be33c2685fa38cb081782
> > ---
> > tests/common/api.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tests/common/api.h b/tests/common/api.h
> > index 2b72ec5..b15e588 100644
> > --- a/tests/common/api.h
> > +++ b/tests/common/api.h
> > @@ -108,7 +108,7 @@ static void spin_unlock(spinlock_t *sp)
> > 
> > typedef pthread_t thread_id_t;
> > 
> > -#define NR_THREADS 128
> > +#define NR_THREADS 256
> > 
> > #define __THREAD_ID_MAP_EMPTY ((thread_id_t) 0)
> > #define __THREAD_ID_MAP_WAITING ((thread_id_t) 1)
> > --
> > 2.29.2
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH] call_rcu: Fix race between rcu_barrier() and call_rcu_data_free()

2020-10-26 Thread Paul E. McKenney via lttng-dev
On Mon, Oct 26, 2020 at 03:58:11PM -0400, Mathieu Desnoyers wrote:
> - On Oct 22, 2020, at 6:30 PM, paulmck paul...@kernel.org wrote:
> 
> > The current code can lose RCU callbacks at shutdown time, which can
> > result in hangs.  This lossage can happen as follows:
> > 
> > o   A thread invokes call_rcu_data_free(), which executes up through
> >the wake_call_rcu_thread().  At this point, the call_rcu_data
> >structure has been drained of callbacks, but is still on the
> >call_rcu_data_list.  Note that this thread does not hold the
> >call_rcu_mutex.
> > 
> > o   Another thread invokes rcu_barrier(), which traverses the
> >call_rcu_data_list under the protection of call_rcu_mutex,
> >a list which still includes the above newly drained structure.
> >This thread therefore adds a callback to the newly drained
> >call_rcu_data structure.  It then releases call_rcu_mutex and
> >enters a mystifying loop that does futex stuff.
> > 
> > o   The first thread finishes executing call_rcu_data_free(),
> >which acquires call_rcu_mutex just long enough to remove the
> >newly drained call_rcu_data structure from call_rcu_data_list.
> >Which causes one of the rcu_barrier() invocation's callbacks to
> >be leaked.
> > 
> > o   The second thread's rcu_barrier() invocation never returns
> >resulting in a hang.
> > 
> > This commit therefore changes call_rcu_data_free() to acquire
> > call_rcu_mutex before checking the call_rcu_data structure for callbacks.
> > In the case where there are no callbacks, call_rcu_mutex is held across
> > both the check and the removal from call_rcu_data_list, thus preventing
> > rcu_barrier() from adding a callback in the meantime.  In the case where
> > there are callbacks, call_rcu_mutex must be momentarily dropped across
> > the call to get_default_call_rcu_data(), which can itself acquire
> > call_rcu_mutex.  This momentary drop is not a problem because any
> > callbacks that rcu_barrier() might queue during that period of time will
> > be moved to the default call_rcu_data structure, and the lock will be
> > held across the full time including moving those callbacks and removing
> > the call_rcu_data structure that was passed into call_rcu_data_free()
> > from call_rcu_data_list.
> > 
> > With this fix, a several-hundred-CPU test successfully completes more
> > than 5,000 executions.  Without this fix, it fails within a few tens
> > of executions.  Although the failures happen more quickly on larger
> > systems, in theory this could happen on a single-CPU system, courtesy
> > of preemption.
> 
> I agree with this fix, will merge in liburcu master, stable-0.12, and 
> stable-2.11.
> Out of curiosity, which test is hanging ?  Is it a test which is part of the 
> liburcu
> tree or some out-of-tree test ? I wonder why we did not catch it in our CI 
> [1].

The hung test was from perfbook [1] in the CodeSamples/datastruct/hash
directory.  A repeat-by is as follows:

# Have userspace RCU preinstalled as you wish.
git clone git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
cd CodeSamples
make pthreads
cd datastruct/hash
make
time for ((i = 0; i < 2000; i++)); do echo $i; ./hash_bkt_rcu --schroedinger 
--nreaders 444 --nupdaters 4 --duration 1000 --updatewait 1 --nbuckets 262144 
--elems/writer 65536; done

This normally hangs within a few tens of iterations.  With this patch,
the passes more than 6,000 iterations.

I have smaller tests that produce this same hang on my 12-CPU laptop,
but with much lower probability.  Here is one example that did hang on
my laptop, and which could be placed into a similar bash loop as above:

hash_bkt_rcu --schroedinger --nreaders 10 --nupdaters 2 --duration 1000 
--updatewait 1 --nbuckets 8192 --elems/writer 4096

But I don't have a good estimate of the hang probability, except a
suspicion that it is lower than would be convenient for a CI test.
Attaching to the hung process using gdb did confirm the type of hang,
however.

It might be possible to create a focused test that races rcu_barrier()
against thread exit, where threads are created and exit repeatedly,
and make a per-thread call_rcu() worker in the meantime..

Thoughts?

Thanx, Paul

[1] git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git

> Thanks,
> 
> Mathieu
> 
> [1] https://ci.lttng.org/view/Liburcu/
> 
> > 
> > Signed-off-by: Paul E. McKenney 
> > Cc: Mathieu Desnoyers 
> > Cc: Stephen Hemminger 
> > Cc: Alan Stern 
> > Cc: La

[lttng-dev] [PATCH] call_rcu: Fix race between rcu_barrier() and call_rcu_data_free()

2020-10-22 Thread Paul E. McKenney via lttng-dev
The current code can lose RCU callbacks at shutdown time, which can
result in hangs.  This lossage can happen as follows:

o   A thread invokes call_rcu_data_free(), which executes up through
the wake_call_rcu_thread().  At this point, the call_rcu_data
structure has been drained of callbacks, but is still on the
call_rcu_data_list.  Note that this thread does not hold the
call_rcu_mutex.

o   Another thread invokes rcu_barrier(), which traverses the
call_rcu_data_list under the protection of call_rcu_mutex,
a list which still includes the above newly drained structure.
This thread therefore adds a callback to the newly drained
call_rcu_data structure.  It then releases call_rcu_mutex and
enters a mystifying loop that does futex stuff.

o   The first thread finishes executing call_rcu_data_free(),
which acquires call_rcu_mutex just long enough to remove the
newly drained call_rcu_data structure from call_rcu_data_list.
Which causes one of the rcu_barrier() invocation's callbacks to
be leaked.

o   The second thread's rcu_barrier() invocation never returns
resulting in a hang.

This commit therefore changes call_rcu_data_free() to acquire
call_rcu_mutex before checking the call_rcu_data structure for callbacks.
In the case where there are no callbacks, call_rcu_mutex is held across
both the check and the removal from call_rcu_data_list, thus preventing
rcu_barrier() from adding a callback in the meantime.  In the case where
there are callbacks, call_rcu_mutex must be momentarily dropped across
the call to get_default_call_rcu_data(), which can itself acquire
call_rcu_mutex.  This momentary drop is not a problem because any
callbacks that rcu_barrier() might queue during that period of time will
be moved to the default call_rcu_data structure, and the lock will be
held across the full time including moving those callbacks and removing
the call_rcu_data structure that was passed into call_rcu_data_free()
from call_rcu_data_list.

With this fix, a several-hundred-CPU test successfully completes more
than 5,000 executions.  Without this fix, it fails within a few tens
of executions.  Although the failures happen more quickly on larger
systems, in theory this could happen on a single-CPU system, courtesy
of preemption.

Signed-off-by: Paul E. McKenney 
Cc: Mathieu Desnoyers 
Cc: Stephen Hemminger 
Cc: Alan Stern 
Cc: Lai Jiangshan 
Cc: 
Cc: 

---

 urcu-call-rcu-impl.h |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/urcu-call-rcu-impl.h b/src/urcu-call-rcu-impl.h
index b6ec6ba..18fd65a 100644
--- a/src/urcu-call-rcu-impl.h
+++ b/src/urcu-call-rcu-impl.h
@@ -772,9 +772,13 @@ void call_rcu_data_free(struct call_rcu_data *crdp)
while ((uatomic_read(>flags) & URCU_CALL_RCU_STOPPED) == 
0)
(void) poll(NULL, 0, 1);
}
+   call_rcu_lock(_rcu_mutex);
if (!cds_wfcq_empty(>cbs_head, >cbs_tail)) {
-   /* Create default call rcu data if need be */
+   call_rcu_unlock(_rcu_mutex);
+   /* Create default call rcu data if need be. */
+   /* CBs queued here will be handed to the default list. */
(void) get_default_call_rcu_data();
+   call_rcu_lock(_rcu_mutex);
__cds_wfcq_splice_blocking(_call_rcu_data->cbs_head,
_call_rcu_data->cbs_tail,
>cbs_head, >cbs_tail);
@@ -783,7 +787,6 @@ void call_rcu_data_free(struct call_rcu_data *crdp)
wake_call_rcu_thread(default_call_rcu_data);
}
 
-   call_rcu_lock(_rcu_mutex);
cds_list_del(>list);
call_rcu_unlock(_rcu_mutex);
 
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] call_rcu seems inefficient without futex

2020-01-27 Thread Paul E. McKenney
On Mon, Jan 27, 2020 at 10:38:05AM -0500, Mathieu Desnoyers wrote:
> - On Jan 23, 2020, at 7:19 PM, lttng-dev lttng-dev@lists.lttng.org wrote:
> 
> > Hi,
> > 
> > I recently installed knot dns for a very small FreeBSD server. I noticed
> > that it uses a surprising amount of CPU, even when there is no load:
> > about 0.25%. That's not huge, but it seems unnecessarily high when my
> > QPS is less than 0.01.
> > 
> > After some profiling, I came to the conclusion that this is caused by
> > call_rcu_wait using futex_async to repeatedly wait. Since there is no
> > futex on FreeBSD (without the Linux compatibility layer), this
> > effectively turns into a permanent busy waiting loop.
> > 
> > I think futex_noasync can be used here instead. call_rcu_wait is only
> > supposed to be called from call_rcu_thread, never from a signal context.
> > call_rcu calls get_call_rcu_data, which may call
> > get_default_call_rcu_data, which calls pthread_mutex_lock through
> > call_rcu_lock. Therefore, call_rcu is not async-signal-safe already.
> 
> call_rcu() is meant to be async-signal-safe and lock-free after that
> initialization has been performed on first use. Paul, do you know where
> we have documented this in liburcu ?

Lock freedom is the goal, but when not in real-time mode, call_rcu()
does invoke futex_async(), which can acquire locks within the Linux
kernel.

Should BSD instead use POSIX condvars for the call_rcu() waits and
wakeups?

> > Also, I think it only makes sense to use call_rcu around a RCU write,
> > which contradicts the README saying that only RCU reads are allowed in
> > signal handlers.

I do not believe that it is always safe to invoke call_rcu() from within
a signal handler.  If you made sure to invoke it outside a signal handler
the first time, and then used real-time mode, that should work.  But in
that case, you aren't invoking the futex code.

> Not sure what you mean by "use call_rcu around a RCU write" ?

I confess to some curiosity on this point as well.  Maybe what is meant
is "around a RCU write" as in "near to an RCU write" as in "in place of
using synchronize_rcu()"?

> Is there anything similar to sys_futex on FreeBSD ?
> 
> It would be good to look into alternative ways to fix this that do not
> involve changing the guarantees provided by call_rcu() for that fallback
> scenario (no futex available). Perhaps in your use-case you may want to
> tweak the retry delay for compat_futex_async(). Currently
> src/compat_futex.c:compat_futex_async() has a 10ms delay. Would 100ms
> be more acceptable ?

If this works for knot dns, it would of course be simpler.

Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > I applied "sed -i -e 's/futex_async/futex_noasync/'
> > src/urcu-call-rcu-impl.h" and knot seems to work correctly with only
> > 0.01% CPU now. I also ran tests/unit and tests/regression with default
> > and signal backends and all completed successfully.
> > 
> > I think that the other two usages of futex_async are also a little
> > suspicious, but I didn't look too closely.
> > 
> > Thanks,
> > Alex.
> > ___
> > lttng-dev mailing list
> > lttng-dev@lists.lttng.org
> > https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] RCU consistency guarantees

2019-12-15 Thread Paul E. McKenney
On Sun, Dec 15, 2019 at 05:10:11PM -0500, Yuxin Ren wrote:
> On Sun, Dec 15, 2019 at 3:30 PM Paul E. McKenney  wrote:
> 
> > On Sat, Dec 14, 2019 at 01:31:31AM -0500, Yuxin Ren wrote:
> > > Hi Paul
> > >
> > > On Sat, Dec 7, 2019 at 5:42 PM Paul E. McKenney 
> > wrote:
> > >
> > > > On Sat, Dec 07, 2019 at 03:04:42PM -0500, Yuxin Ren wrote:
> > > > > Thanks a lot for your help. I have some questions below.
> > > > >
> > > > > On Sat, Dec 7, 2019 at 1:37 AM Paul E. McKenney 
> > > > wrote:
> > > > >
> > > > > > On Fri, Dec 06, 2019 at 07:00:13PM -0500, Yuxin Ren wrote:
> > > > > > > Thanks so much for your great help.
> > > > > > > I definitely will look at those resources and papers!
> > > > > > >
> > > > > > > One more thing that I am confused
> > > > > > > As I mentioned earlier, someone said One key distinction is that
> > both
> > > > > > MVCC
> > > > > > > and RLU provide much stronger consistency guarantees to readers
> > than
> > > > does
> > > > > > > RCU ...) (https://lwn.net/Articles/777036/).
> > > > > >
> > > > > > That someone was in fact me.  ;-)
> > > > > >
> > > > > > > I am not sure if the above statement is correct or not. But in
> > > > general,
> > > > > > > How can we compare RCU consistency guarantees to other techniques
> > > > (such
> > > > > > as
> > > > > > > RLU)?
> > > > > > > How to reason about which one has stronger or weaker guarantees?
> > > > > >
> > > > > > I suggest starting from the use case.  For concreteness, let's
> > assume
> > > > > > that we are using a hash table.  At one extreme, imagine a use
> > case in
> > > > > > which each event makes exactly one hash-table operation.  No
> > > > information
> > > > > > is carried from one event to the next.  (This might well be the
> > case
> > > > > > for simple web server.)  Such a use case cannot tell the difference
> > > > > > between RCU on the one hand and MVCC/RLU on the other.
> > > > > >
> > > > > > At the other extreme, suppose that each event either accesses or
> > > > updates
> > > > > > multiple entries in the hash table.  In this case, MVCC/RLU will
> > rule
> > > > > > out outcomes that RCU would permit.  For example, suppose we had
> > four
> > > > > > events accessing two different elements in different buckets of the
> > > > > > hash table:
> > > > > >
> > > > > > E1: Adds 32 to the hash table.
> > > > > > E2: Adds 1729 to the hash table.
> > > > > > E3: Within a read-side critical section, looks up 32 then
> > 1729.
> > > > > > E4: Within a read-side critical section, looks up 1729
> > then 32.
> > > > > >
> > > > > > Given either MVCC or RLU, it will not be possible for E3 to find
> > 32 but
> > > > > > not 1729 and at the same time for E4 to find 1729 but not 32.
> > Given
> > > > RCU,
> > > > > > this outcome is possible.
> > > > > >
> > > > > When you say "Within a read-side section", do you mean within a
> > single
> > > > same
> > > > > read section? such as
> > > > >
> > > > > > read_lock()
> > > > > > lookup(32)
> > > > > > lookup(1729)
> > > > > > read_unlock()
> > > > >
> > > > >
> > > > > How about putting two lookups into two read-side sections? Do we
> > still
> > > > have
> > > > > the problem, then?
> > > > >
> > > > > > read_lock()
> > > > > > lookup(32)
> > > > > > read_unlock()
> > > > > > read_lock()
> > > > > > lookup(1729)
> > > > > > read_unlock()
> > > >
> > > > Without in any way agreeing with your characterization of this as a
> > > > problem, because rcu_read_lock() and rcu_read_unlock() provide
> > > > absolutely no ordering guarantees in the 

Re: [lttng-dev] RCU consistency guarantees

2019-12-15 Thread Paul E. McKenney
On Sat, Dec 14, 2019 at 01:31:31AM -0500, Yuxin Ren wrote:
> Hi Paul
> 
> On Sat, Dec 7, 2019 at 5:42 PM Paul E. McKenney  wrote:
> 
> > On Sat, Dec 07, 2019 at 03:04:42PM -0500, Yuxin Ren wrote:
> > > Thanks a lot for your help. I have some questions below.
> > >
> > > On Sat, Dec 7, 2019 at 1:37 AM Paul E. McKenney 
> > wrote:
> > >
> > > > On Fri, Dec 06, 2019 at 07:00:13PM -0500, Yuxin Ren wrote:
> > > > > Thanks so much for your great help.
> > > > > I definitely will look at those resources and papers!
> > > > >
> > > > > One more thing that I am confused
> > > > > As I mentioned earlier, someone said One key distinction is that both
> > > > MVCC
> > > > > and RLU provide much stronger consistency guarantees to readers than
> > does
> > > > > RCU ...) (https://lwn.net/Articles/777036/).
> > > >
> > > > That someone was in fact me.  ;-)
> > > >
> > > > > I am not sure if the above statement is correct or not. But in
> > general,
> > > > > How can we compare RCU consistency guarantees to other techniques
> > (such
> > > > as
> > > > > RLU)?
> > > > > How to reason about which one has stronger or weaker guarantees?
> > > >
> > > > I suggest starting from the use case.  For concreteness, let's assume
> > > > that we are using a hash table.  At one extreme, imagine a use case in
> > > > which each event makes exactly one hash-table operation.  No
> > information
> > > > is carried from one event to the next.  (This might well be the case
> > > > for simple web server.)  Such a use case cannot tell the difference
> > > > between RCU on the one hand and MVCC/RLU on the other.
> > > >
> > > > At the other extreme, suppose that each event either accesses or
> > updates
> > > > multiple entries in the hash table.  In this case, MVCC/RLU will rule
> > > > out outcomes that RCU would permit.  For example, suppose we had four
> > > > events accessing two different elements in different buckets of the
> > > > hash table:
> > > >
> > > > E1: Adds 32 to the hash table.
> > > > E2: Adds 1729 to the hash table.
> > > > E3: Within a read-side critical section, looks up 32 then 1729.
> > > > E4: Within a read-side critical section, looks up 1729 then 32.
> > > >
> > > > Given either MVCC or RLU, it will not be possible for E3 to find 32 but
> > > > not 1729 and at the same time for E4 to find 1729 but not 32.  Given
> > RCU,
> > > > this outcome is possible.
> > > >
> > > When you say "Within a read-side section", do you mean within a single
> > same
> > > read section? such as
> > >
> > > > read_lock()
> > > > lookup(32)
> > > > lookup(1729)
> > > > read_unlock()
> > >
> > >
> > > How about putting two lookups into two read-side sections? Do we still
> > have
> > > the problem, then?
> > >
> > > > read_lock()
> > > > lookup(32)
> > > > read_unlock()
> > > > read_lock()
> > > > lookup(1729)
> > > > read_unlock()
> >
> > Without in any way agreeing with your characterization of this as a
> > problem, because rcu_read_lock() and rcu_read_unlock() provide
> > absolutely no ordering guarantees in the absence of a grace period,
> > any non-grace-period-related reordering that can happen with a single
> > RCU read-side critical section can also happen when that critical
> > section is split in two as you have done above.
> >
> > > Could you kindly give me more clues why RCU can see such reorder, while
> > RLU
> > > can prevent it?
> >
> > Here are minimal C-language implementations for RCU that can (and are)
> > actually used:
> >
> Great. We use the same thing in our real-time work [1]

It has been a popular choice for 40 years now.  ;-)

> > #define rcu_read_lock()
> > #define rcu_read_unlock()
> >
> > Please compare these to the read-side markers presented in the RLU paper,
> > and then tell me your thoughts on the answer to your question.  ;-)
> >
> I submit my homework here, but I do not think I did it well.
> 1. I believe in the default URCU implementation, it has memory barrier
> inside the read_lock / read_unlock.

It certainly wa

Re: [lttng-dev] RCU consistency guarantees

2019-12-09 Thread Paul E. McKenney
On Sat, Dec 07, 2019 at 03:04:42PM -0500, Yuxin Ren wrote:
> Thanks a lot for your help. I have some questions below.
> 
> On Sat, Dec 7, 2019 at 1:37 AM Paul E. McKenney  wrote:
> 
> > On Fri, Dec 06, 2019 at 07:00:13PM -0500, Yuxin Ren wrote:
> > > Thanks so much for your great help.
> > > I definitely will look at those resources and papers!
> > >
> > > One more thing that I am confused
> > > As I mentioned earlier, someone said One key distinction is that both
> > MVCC
> > > and RLU provide much stronger consistency guarantees to readers than does
> > > RCU ...) (https://lwn.net/Articles/777036/).
> >
> > That someone was in fact me.  ;-)
> >
> > > I am not sure if the above statement is correct or not. But in general,
> > > How can we compare RCU consistency guarantees to other techniques (such
> > as
> > > RLU)?
> > > How to reason about which one has stronger or weaker guarantees?
> >
> > I suggest starting from the use case.  For concreteness, let's assume
> > that we are using a hash table.  At one extreme, imagine a use case in
> > which each event makes exactly one hash-table operation.  No information
> > is carried from one event to the next.  (This might well be the case
> > for simple web server.)  Such a use case cannot tell the difference
> > between RCU on the one hand and MVCC/RLU on the other.
> >
> > At the other extreme, suppose that each event either accesses or updates
> > multiple entries in the hash table.  In this case, MVCC/RLU will rule
> > out outcomes that RCU would permit.  For example, suppose we had four
> > events accessing two different elements in different buckets of the
> > hash table:
> >
> > E1: Adds 32 to the hash table.
> > E2: Adds 1729 to the hash table.
> > E3: Within a read-side critical section, looks up 32 then 1729.
> > E4: Within a read-side critical section, looks up 1729 then 32.
> >
> > Given either MVCC or RLU, it will not be possible for E3 to find 32 but
> > not 1729 and at the same time for E4 to find 1729 but not 32.  Given RCU,
> > this outcome is possible.
> >
> When you say "Within a read-side section", do you mean within a single same
> read section? such as
> 
> > read_lock()
> > lookup(32)
> > lookup(1729)
> > read_unlock()
> 
> 
> How about putting two lookups into two read-side sections? Do we still have
> the problem, then?
> 
> > read_lock()
> > lookup(32)
> > read_unlock()
> > read_lock()
> > lookup(1729)
> > read_unlock()

Without in any way agreeing with your characterization of this as a
problem, because rcu_read_lock() and rcu_read_unlock() provide
absolutely no ordering guarantees in the absence of a grace period,
any non-grace-period-related reordering that can happen with a single
RCU read-side critical section can also happen when that critical
section is split in two as you have done above.

> Could you kindly give me more clues why RCU can see such reorder, while RLU
> can prevent it?

Here are minimal C-language implementations for RCU that can (and are)
actually used:

#define rcu_read_lock()
#define rcu_read_unlock()

Please compare these to the read-side markers presented in the RLU paper,
and then tell me your thoughts on the answer to your question.  ;-)

> > This is because MVCC and RLU provide readers a consistent view of
> > the updates, and RCU does not.  Of course, it is often the case that a
> > consistent view is not needed, in which case the MVCC and RLU guarantees
> > are incurring read-side overhead for no reason.  But if the use case
> > requires consistent readers, RCU is not an option.
> >
> > The reason a consistent view is not always needed is that speed-of-light
> > delays make it impossible to provide a consistent view of the outside
> > world.  In the common case where the use case interacts with the
> > outside world, the algorithms absolutely must be designed to tolerate
> > inconsistency, which opens the door to things like RCU.
> 
> I am confused here. I think speed-of-light delays happen everywhere, not
> only bound to RCU, but also  any other synchronization approach (such RLU).
> If so, how do others (RLU) provide consistent views?

You just stated the answer.  Now it is only necessary for you to invest
the time, effort, and thought to fully understand it.  To help with this,
the following paragraph provides another hint:

Yes, you are quite right, speed-of-light delays between the
outside world and the computer affect RLU just as surely as they
do RCU.  This means that the additi

Re: [lttng-dev] RCU consistency guarantees

2019-12-09 Thread Paul E. McKenney
On Fri, Dec 06, 2019 at 07:00:13PM -0500, Yuxin Ren wrote:
> Thanks so much for your great help.
> I definitely will look at those resources and papers!
> 
> One more thing that I am confused
> As I mentioned earlier, someone said One key distinction is that both MVCC
> and RLU provide much stronger consistency guarantees to readers than does
> RCU ...) (https://lwn.net/Articles/777036/).

That someone was in fact me.  ;-)

> I am not sure if the above statement is correct or not. But in general,
> How can we compare RCU consistency guarantees to other techniques (such as
> RLU)?
> How to reason about which one has stronger or weaker guarantees?

I suggest starting from the use case.  For concreteness, let's assume
that we are using a hash table.  At one extreme, imagine a use case in
which each event makes exactly one hash-table operation.  No information
is carried from one event to the next.  (This might well be the case
for simple web server.)  Such a use case cannot tell the difference
between RCU on the one hand and MVCC/RLU on the other.

At the other extreme, suppose that each event either accesses or updates
multiple entries in the hash table.  In this case, MVCC/RLU will rule
out outcomes that RCU would permit.  For example, suppose we had four
events accessing two different elements in different buckets of the
hash table:

E1: Adds 32 to the hash table.
E2: Adds 1729 to the hash table.
E3: Within a read-side critical section, looks up 32 then 1729.
E4: Within a read-side critical section, looks up 1729 then 32.

Given either MVCC or RLU, it will not be possible for E3 to find 32 but
not 1729 and at the same time for E4 to find 1729 but not 32.  Given RCU,
this outcome is possible.

This is because MVCC and RLU provide readers a consistent view of
the updates, and RCU does not.  Of course, it is often the case that a
consistent view is not needed, in which case the MVCC and RLU guarantees
are incurring read-side overhead for no reason.  But if the use case
requires consistent readers, RCU is not an option.

The reason a consistent view is not always needed is that speed-of-light
delays make it impossible to provide a consistent view of the outside
world.  In the common case where the use case interacts with the
outside world, the algorithms absolutely must be designed to tolerate
inconsistency, which opens the door to things like RCU.

Thanx, Paul

> Thanks
> Yuxin
> 
> On Fri, Dec 6, 2019 at 11:30 AM Paul E. McKenney  wrote:
> 
> > On Fri, Dec 06, 2019 at 10:59:05AM -0500, Mathieu Desnoyers wrote:
> > > - On Dec 6, 2019, at 3:51 PM, Yuxin Ren  wrote:
> > >
> > > > On Fri, Dec 6, 2019 at 5:49 AM Mathieu Desnoyers < [
> > > > mailto:mathieu.desnoy...@efficios.com | mathieu.desnoy...@efficios.com
> > ] >
> > > > wrote:
> > >
> > > >> - On Dec 5, 2019, at 8:17 PM, Yuxin Ren < [ mailto:
> > r...@gwmail.gwu.edu |
> > > >> r...@gwmail.gwu.edu ] > wrote:
> > >
> > > >>> Hi,
> > > >>> I am a student, and learning RCU now, but still know very little
> > about it.
> > > >>> Are there any documents/papers/materials which (in)formally define
> > and explain
> > > >>> RCU consistency guarantees?
> > >
> > > >> You may want to have a look at
> > >
> > > >> User-Level Implementations of Read-Copy Update
> > > >> Article in IEEE Transactions on Parallel and Distributed Systems
> > 23(2):375 - 382
> > > >> · March 2012
> > >
> > > > Thanks for your info.
> > > > However, I do not think URCU talks about any consistency model
> > formally.
> > >
> > > > From previous communication with Paul, he said RCU is not designed for
> > > > linearizability, and it is totally acceptable that RCU is not
> > linearizable.
> > > > However, I am curious how to accurately/formally Characterize RCU
> > consistency
> > > > model/guarantees
> > >
> > > Adding Paul E. McKenney in CC.
> > >
> > > I am referring to the section "Overview of RCU semantics" in the paper.
> > Not sure it has the level of
> > > formality you are looking for though. Paul, do you have pointers to
> > additional material ?
> >
> > Indeed I do!  The Linux kernel memory model (LKMM) includes RCU.  It is
> > in tools/memory-model in recent kernel source trees, which includes
> > documentation.  This is an executable model, which means that you
> > can create litmus tests and have the model formally and automatical

Re: [lttng-dev] RCU consistency guarantees

2019-12-06 Thread Paul E. McKenney
On Fri, Dec 06, 2019 at 10:59:05AM -0500, Mathieu Desnoyers wrote:
> - On Dec 6, 2019, at 3:51 PM, Yuxin Ren  wrote: 
> 
> > On Fri, Dec 6, 2019 at 5:49 AM Mathieu Desnoyers < [
> > mailto:mathieu.desnoy...@efficios.com | mathieu.desnoy...@efficios.com ] >
> > wrote:
> 
> >> - On Dec 5, 2019, at 8:17 PM, Yuxin Ren < [ mailto:r...@gwmail.gwu.edu 
> >> |
> >> r...@gwmail.gwu.edu ] > wrote:
> 
> >>> Hi,
> >>> I am a student, and learning RCU now, but still know very little about it.
> >>> Are there any documents/papers/materials which (in)formally define and 
> >>> explain
> >>> RCU consistency guarantees?
> 
> >> You may want to have a look at
> 
> >> User-Level Implementations of Read-Copy Update
> >> Article in IEEE Transactions on Parallel and Distributed Systems 23(2):375 
> >> - 382
> >> · March 2012
> 
> > Thanks for your info.
> > However, I do not think URCU talks about any consistency model formally.
> 
> > From previous communication with Paul, he said RCU is not designed for
> > linearizability, and it is totally acceptable that RCU is not linearizable.
> > However, I am curious how to accurately/formally Characterize RCU 
> > consistency
> > model/guarantees
> 
> Adding Paul E. McKenney in CC. 
> 
> I am referring to the section "Overview of RCU semantics" in the paper. Not 
> sure it has the level of 
> formality you are looking for though. Paul, do you have pointers to 
> additional material ? 

Indeed I do!  The Linux kernel memory model (LKMM) includes RCU.  It is
in tools/memory-model in recent kernel source trees, which includes
documentation.  This is an executable model, which means that you
can create litmus tests and have the model formally and automatically
evaluate them.

There are also a number of publications covering LKMM:

o   A formal kernel memory-ordering model
https://lwn.net/Articles/718628/
https://lwn.net/Articles/720550/

These cover the release stores and dependency ordering that
provide RCU's publish-subscribe guarantees.

Backup material here:


https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/LWNLinuxMM/

With these two likely being of particular interest:


https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/LWNLinuxMM/RCUguarantees.html

https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/LWNLinuxMM/srcu.html

o   Frightening Small Children and Disconcerting Grown-ups: Concurrency in 
the Linux Kernel
https://dl.acm.org/citation.cfm?id=3177156
http://www0.cs.ucl.ac.uk/staff/j.alglave/papers/asplos18.pdf

Backup material:

http://diy.inria.fr/linux/

o   Who's afraid of a big bad optimizing compiler?
https://lwn.net/Articles/793253/

o   Calibrating your fear of big bad optimizing compilers
https://lwn.net/Articles/799218/

These last two justify use of normal C-language assignment
statements to initialize and access data referenced by
RCU-protected pointers.

There is a large body of litmus tests (thousands of them) here:

https://github.com/paulmckrcu/litmus

Many of these litmus tests involve RCU, and these can be located by
search for files containing rcu_read_lock(), rcu_read_unlock(),
synchronize_rcu(), and so on.

Or were you looking for something else?

Thanx, Paul

> Thanks, 
> 
> Mathieu 
> 
> >> as a starting point.
> 
> >> Thanks,
> 
> >> Mathieu
> 
> >>> I know there are some consistency models in the database area (such as 
> >>> PRAM,
> >>> Read Uncommitted, etc) from [ https://jepsen.io/consistency |
> >>> https://jepsen.io/consistency ] and [1].
> >>> How does RCU related to those consistency models?
> 
> >>> I also found some comments online (One key distinction is that both MVCC 
> >>> and RLU
> >>> provide much stronger consistency guarantees to readers than does RCU 
> >>> ...) ( [
> >>> https://lwn.net/Articles/777036/ | https://lwn.net/Articles/777036/ ] ).
> >>> I do not understand how we reason/dresibe/compare the consistency 
> >>> guarantees. (
> >>> I even do not know what consistency guarantees provided by RCU formally)
> >>> Could someone explain this to me?
> 
> >>> [1] Bailis, P., Davidson, A., Fekete, A., Ghodsi, A., Hellerstein, J. M., 
> >>> &
> >>> Stoica, I. (2013). Highly available transactions: Virtues and limitations.
> >>> Proc

Re: [lttng-dev] large liblttng-ust startup overhead (ust_lock)

2017-09-06 Thread Paul E. McKenney
BTW, your expedited commit hit mainline earlier this week.  Here is
hoping!  ;-)

Thanx, Paul

On Wed, Sep 06, 2017 at 08:23:40PM +, Mathieu Desnoyers wrote:
> - On Sep 6, 2017, at 3:57 PM, Mathieu Desnoyers 
> mathieu.desnoy...@efficios.com wrote:
> 
> > - On Sep 6, 2017, at 3:35 AM, Milian Wolff milian.wo...@kdab.com wrote:
> > 
> >> On Dienstag, 5. September 2017 20:11:58 CEST Mathieu Desnoyers wrote:
> >>> - On Sep 5, 2017, at 11:08 AM, Milian Wolff milian.wo...@kdab.com 
> >>> wrote:
> >>> > On Tuesday, September 5, 2017 4:51:42 PM CEST Mathieu Desnoyers wrote:
> >>> >> - On Sep 5, 2017, at 10:34 AM, Milian Wolff milian.wo...@kdab.com
> >> wrote:
> >>> >> > Hey all,
> >>> >> > 
> >>> >> > I have noticed a very large overhead when linking against 
> >>> >> > liblttng-ust:
> >>> >> > 
> >>> >> > ~
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ cat lttng-test.c
> >>> >> > int main()
> >>> >> > {
> >>> >> > 
> >>> >> >  return 0;
> >>> >> > 
> >>> >> > }
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ gcc -O2 -g -ldl lttng-test.c
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ perf stat -r 5 ./a.out
> >>> >> > 
> >>> >> > Performance counter stats for './a.out' (5 runs):
> >>> >> >  0.209587  task-clock (msec) #0.596 CPUs
> >>> >> >  utilized
> >>> >> > 
> >>> >> > ( +-  8.76% )
> >>> >> > 
> >>> >> > 0  context-switches  #0.000 K/sec
> >>> >> > 0  cpu-migrations#0.000 K/sec
> >>> >> >
> >>> >> >49  page-faults   #0.235 M/sec
> >>> >> > 
> >>> >> > ( +-  1.19% )
> >>> >> > 
> >>> >> >   706,854  cycles#3.373 GHz
> >>> >> > 
> >>> >> > ( +-  8.82% )
> >>> >> > 
> >>> >> >   773,603  instructions  #1.09  insn per
> >>> >> >   cycle
> >>> >> > 
> >>> >> > ( +-  0.75% )
> >>> >> > 
> >>> >> >   147,128  branches  #  701.987 M/sec
> >>> >> > 
> >>> >> > ( +-  0.70% )
> >>> >> > 
> >>> >> > 4,096  branch-misses #2.78% of all
> >>> >> > branches
> >>> >> > 
> >>> >> > ( +-  5.27% )
> >>> >> > 
> >>> >> >   0.000351422 seconds time elapsed
> >>> >> > 
> >>> >> > ( +- 11.85% )
> >>> >> > 
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ gcc -O2 -g -ldl -llttng-ust lttng-test.c
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ perf stat -r 5 ./a.out
> >>> >> > 
> >>> >> > Performance counter stats for './a.out' (5 runs):
> >>> >> >  2.063040  task-clock (msec) #0.009 CPUs
> >>> >> >  utilized
> >>> >> > 
> >>> >> > ( +-  1.37% )
> >>> >> > 
> >>> >> >44  context-switches  #0.021 M/sec
> >>> >> > 
> >>> >> > ( +-  1.95% )
> >>> >> > 
> >>> >> > 2  cpu-migrations#0.776 K/sec
> >>> >> > 
> >>> >> > ( +- 25.00% )
> >>> >> > 
> >>> >> >   209  page-faults   #0.101 M/sec
> >>> >> > 
> >>> >> > ( +-  0.34% )
> >>> >> > 
> >>> >> > 7,053,686  cycles#3.419 GHz
> >>> >> > 
> >>> >> > ( +-  2.03% )
> >>> >> > 
> >>> >> > 6,893,783  instructions  #0.98  insn per
> >>> >> > cycle
> >>> >> > 
> >>> >> > ( +-  0.25% )
> >>> >> > 
> >>> >> > 1,342,492  branches  #  650.735 M/sec
> >>> >> > 
> >>> >> > ( +-  0.20% )
> >>> >> > 
> >>> >> >29,390  branch-misses #2.19% of all
> >>> >> >branches
> >>> >> > 
> >>> >> > ( +-  0.61% )
> >>> >> > 
> >>> >> >   0.225597302 seconds time elapsed
> >>> >> > 
> >>> >> > ( +-  6.68% )
> >>> >> > ~
> >>> >> > 
> >>> >> > This is without any LTTng session configured. If I enable LTTng 
> >>> >> > kernel
> >>> >> > and
> >>> >> > userspace events, this becomes even worse:
> >>> >> > 
> >>> >> > ~
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ cat $(which run_lttng_trace.sh)
> >>> >> > #!/bin/sh
> >>> >> > 
> >>> >> > if [ -z "$(pidof lttng-sessiond)" ]; then
> >>> >> > 
> >>> >> >sudo lttng-sessiond --daemonize
> >>> >> > 
> >>> >> > fi
> >>> >> > 
> >>> >> > sudo lttng create -o ~/lttng-traces/$(date -Iseconds)
> >>> >> > sudo lttng enable-channel kernel -k --subbuf-size 16M --num-subbuf 8
> >>> >> > sudo lttng enable-event -c kernel -k -a
> >>> >> > sudo lttng enable-channel ust -u --subbuf-size 16M --num-subbuf 8
> >>> >> > sudo lttng enable-event -c ust -u lttng_ust_tracef:*
> >>> >> > sudo lttng start
> >>> >> > $@
> >>> >> > sudo lttng stop
> >>> >> > 
> >>> >> > sudo chmod a+rx -R ~/lttng-traces
> >>> >> > ┌milian@milian-kdab2:/tmp
> >>> >> > └$ run_lttng_trace.sh perf stat -r 5 ./a.out
> >>> >> > Session auto-20170905-162818 created.
> >>> >> > Traces will be written in
> >>> >> > 

Re: [lttng-dev] [RFC PATCH liburcu 0/2] Remove RCU requirements on hash table destroy

2017-06-05 Thread Paul E. McKenney
On Tue, May 30, 2017 at 05:10:18PM -0400, Mathieu Desnoyers wrote:
> The RCU lock-free hash table currently requires that the destroy
> function should not be called from within RCU read-side critical
> sections. This is caused by the lazy resize, which uses the call_rcu
> worker thread, even though all it really needs is a workqueue/worker
> thread scheme.
> 
> Implement an internal workqueue API in liburcu, and use it instead of
> call_rcu in rculfhash to overcome this limitation.

Took a quick look, and it appears plausible.

Some opportunity to share CPU-affinity code between this and the
call_rcu() code, FWIW.  Two of the system-call stubs look to be identical
other than the system call (EINTR checks and soforth), but I am not sure
that it is worth combining them.

Thanx, Paul

> Mathieu Desnoyers (2):
>   Implement urcu workqueues internal API
>   Use workqueue in rculfhash
> 
>  include/urcu/rculfhash.h |  15 +-
>  src/Makefile.am  |   2 +-
>  src/rculfhash-internal.h |   2 +-
>  src/rculfhash.c  | 124 ++--
>  src/workqueue.c  | 507 
> +++
>  src/workqueue.h  | 104 ++
>  6 files changed, 686 insertions(+), 68 deletions(-)
>  create mode 100644 src/workqueue.c
>  create mode 100644 src/workqueue.h
> 
> -- 
> 2.1.4
> 

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] question about the RCU variant in CITRUS tree paper

2017-05-12 Thread Paul E. McKenney
On Fri, May 12, 2017 at 02:41:23PM -0400, Yuxin Ren wrote:
> Hi Paul,
> 
> Thank you for your reply.
> 
> If I understand your reply correctly, the update-side lock you
> mentioned is the lock used in the tree deletion algorithm.

Yes.

> But their urcu_synchronize contains no lock.
> So I think the lock is kind of problem caused by their usage of RCU,
> not from their urcu_synchronize implementation.

Yes again.  They are worried about things like two different reader
threads disagreeing about the order in which two different updater threads
added elements to a linked RCU-protected data structure.  In theory,
this sort of disagreement is a huge problem, but in practice almost no
one cares.

If you do care, one way to avoid the problem is to hold your update-side
lock (in this case, the one in their deletion algorithm) across the
grace period.  If you don't care, which is almost always the case in
practice, you release the lock first, and only then wait for the
grace period.

> I want to compare their RCU implementation with the U-RCU
> implementation, because the authors argued their implementation
> performs better than U-RCU.
> Is it possible to use their new RCU implementation as a drop-in
> replacement for U-RCU?

Probably, but you should trust actually trying it more than you trust
my answer to that question.  ;-)

Thanx, Paul

> I am relative new to RCU, so my question could be stupid.
> Many thanks for your time
> Yuxin
> 
> On Thu, May 11, 2017 at 4:23 PM, Paul E. McKenney
> <paul...@linux.vnet.ibm.com> wrote:
> > On Thu, May 11, 2017 at 04:05:45PM -0400, Yuxin Ren wrote:
> >> Hi,
> >>
> >> I am learning U-RCU now.
> >> And I read paper Concurrent Updates with RCU: Search Tree as an Example
> >> ( 
> >> https://pdfs.semanticscholar.org/73e4/cd29273cf9d98d35bc184330e694ba798987.pdf
> >> )
> >>
> >> In this paper, the authors present a variant RCU implementation, and
> >> argued their new RCU has better performance than default U-RCU.
> >>
> >> Do you think their argument and implementation is correct in all cases?
> >> If they are right, will you wan to integrate their improment to U-RCU
> >> implementation?
> >>
> >> For your convenience, I paste the related text from the paper here.
> >> "In our implementation, each thread has a counter and flag, the
> >> counter counts the number of critical sections executed by the thread
> >> and a flag indicates if the thread is currently inside its read-side
> >> critical section. The rcu_read_lock operation increments the counter
> >> and sets the flag to true, while the rcu_read_unlock operation sets
> >> the flag to false. When a thread executes a synchronize_rcu operation,
> >> it waits for every other thread, until one of two things occurs:
> >> either the thread has increased its counter or the thread’s flag is
> >> set to false. "
> >>
> >> One its implementation can be found from synchrobench
> >> https://github.com/gramoli/synchrobench/blob/master/c-cpp/src/trees/tree-lock/new_urcu.c
> >
> > I covered this one here:  https://lwn.net/Articles/667593/
> >
> > The short version is that they are working around what I consider to
> > be a design bug in their algorithm, namely that they are holding the
> > update-side lock across RCU grace periods.  They do this to achieve
> > linearizability, which is prized by many conference referees/reviewers,
> > but not as useful in practice as is commonly supposed.
> >
> > But it does have a broken URL to the paper, so I will send your working
> > version to the LWN editors CCing you.  Thank you for that!
> >
> > Thanx, Paul
> >
> 

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] question about the RCU variant in CITRUS tree paper

2017-05-11 Thread Paul E. McKenney
On Thu, May 11, 2017 at 04:05:45PM -0400, Yuxin Ren wrote:
> Hi,
> 
> I am learning U-RCU now.
> And I read paper Concurrent Updates with RCU: Search Tree as an Example
> ( 
> https://pdfs.semanticscholar.org/73e4/cd29273cf9d98d35bc184330e694ba798987.pdf
> )
> 
> In this paper, the authors present a variant RCU implementation, and
> argued their new RCU has better performance than default U-RCU.
> 
> Do you think their argument and implementation is correct in all cases?
> If they are right, will you wan to integrate their improment to U-RCU
> implementation?
> 
> For your convenience, I paste the related text from the paper here.
> "In our implementation, each thread has a counter and flag, the
> counter counts the number of critical sections executed by the thread
> and a flag indicates if the thread is currently inside its read-side
> critical section. The rcu_read_lock operation increments the counter
> and sets the flag to true, while the rcu_read_unlock operation sets
> the flag to false. When a thread executes a synchronize_rcu operation,
> it waits for every other thread, until one of two things occurs:
> either the thread has increased its counter or the thread’s flag is
> set to false. "
> 
> One its implementation can be found from synchrobench
> https://github.com/gramoli/synchrobench/blob/master/c-cpp/src/trees/tree-lock/new_urcu.c

I covered this one here:  https://lwn.net/Articles/667593/

The short version is that they are working around what I consider to
be a design bug in their algorithm, namely that they are holding the
update-side lock across RCU grace periods.  They do this to achieve
linearizability, which is prized by many conference referees/reviewers,
but not as useful in practice as is commonly supposed.

But it does have a broken URL to the paper, so I will send your working
version to the LWN editors CCing you.  Thank you for that!

Thanx, Paul

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] liburcu rcu_xchg_pointer and rcu_cmpxchg_pointer ARM32 barriers

2016-12-05 Thread Paul E. McKenney
On Mon, Dec 05, 2016 at 11:01:10PM +, Mathieu Desnoyers wrote:
> - On Dec 5, 2016, at 5:35 PM, Paul E. McKenney paul...@linux.vnet.ibm.com 
> wrote:
> 
> > On Mon, Dec 05, 2016 at 02:14:47PM +, Mathieu Desnoyers wrote:
> >> Hi Paul,
> >> 
> >> So about the liburcu rcu_xchg_pointer() barriers, here is the current
> >> situation:
> >> 
> >> rcu_xchg_pointer is implemented as:
> >> 
> >> #define _rcu_xchg_pointer(p, v) \
> >> __extension__   \
> >> ({  \
> >> __typeof__(*p) _pv = (v);   \
> >> if (!__builtin_constant_p(v) || \
> >> ((v) != NULL))  \
> >> cmm_wmb();  \
> >> uatomic_xchg(p, _pv);   \
> >> })
> >> 
> >> So we actually add a write barrier before the uatomic_xchg(),
> >> which should not be required if we consider that uatomic_xchg()
> >> *should* imply a full barrier before/after.
> >> 
> >> But in reality, it's ARM32 uatomic_xchg() which does not fulfill
> >> its contract, due to __sync_lock_test_and_set being only
> >> an acquire barrier [1]. So the extra cmm_wmb() is what saved
> >> us here for rcu_xchg_pointer().
> >> 
> >> The code currently generated by rcu_xchg_pointer() looks like:
> >> 
> >>11000:   f3bf 8f5f   dmb sy
> >>11004:   e857 ef00   ldrex   lr, [r7]
> >>11008:   e847 0300   strex   r3, r0, [r7]
> >>1100c:   2b00cmp r3, #0
> >>1100e:   d1f9bne.n   11004 <thr_writer+0x70>
> >>11010:   f3bf 8f5b   dmb ish
> >> 
> >> 
> >> Looking at the cmpxchg variant:
> >> 
> >> #define _rcu_cmpxchg_pointer(p, old, _new)  \
> >> __extension__   \
> >> ({  \
> >> __typeof__(*p) _pold = (old);   \
> >> __typeof__(*p) _pnew = (_new);  \
> >> if (!__builtin_constant_p(_new) ||  \
> >> ((_new) != NULL))   \
> >> cmm_wmb(); 
> >>  \
> >> uatomic_cmpxchg(p, _pold, _pnew);   \
> >> })
> >> 
> >> We also notice a cmm_wmb() before what should imply a full barrier
> >> (uatomic_cmxchg). The latter is implemented with 
> >> __sync_val_compare_and_swap_N,
> >> which should imply a full barrier based on [1] (which is as vague as it
> >> gets). Looking at the generated code, we indeed have two barriers before:
> >> 
> >>11000:   f3bf 8f5f   dmb sy
> >>11004:   f3bf 8f5b   dmb ish
> >>11008:   e857 ef00   ldrex   lr, [r7]
> >>1100c:   45c6cmp lr, r8
> >>1100e:   d103bne.n   11018 <thr_writer+0x84>
> >>11010:   e847 0300   strex   r3, r0, [r7]
> >>11014:   2b00cmp r3, #0
> >>11016:   d1f7bne.n   11008 <thr_writer+0x74>
> >>11018:   f3bf 8f5b   dmb ish
> >> 
> >> So for stable-0.8 and stable-0.9, I would be tempted to err on
> >> the safe side and simply add the missing cmm_smp_mb() within
> >> uatomic_xchg() before the __sync_lock_test_and_set().
> >> 
> >> For the master branch, in addition to adding the missing cmm_smp_mb()
> >> to uatomic_xchg(), we could remove the redundant cmm_wmb() in
> >> rcu_cmpxchg_pointer and rcu_xchg_pointer.
> >> 
> >> Thoughts ?
> > 
> > Seems reasonable to me.  It is the x86 guys who might have objections,
> > given that the extra barrier costs them but has no effect.  ;-)
> 
> This barrier is only added to the asm-specific code of uatomic_xchg and 
> uatomic_cmpxchg(),
> and has no impact on x86, so we should be good.
> 
> Actually, removing the explicit wmb() from rcu_cmpxchg_pointer() and 
> rcu_xchg_pointer()
> will even speed up those operations on x86.

Even better!

Thanx, Paul

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] liburcu rcu_xchg_pointer and rcu_cmpxchg_pointer ARM32 barriers

2016-12-05 Thread Paul E. McKenney
On Mon, Dec 05, 2016 at 02:14:47PM +, Mathieu Desnoyers wrote:
> Hi Paul,
> 
> So about the liburcu rcu_xchg_pointer() barriers, here is the current
> situation:
> 
> rcu_xchg_pointer is implemented as:
> 
> #define _rcu_xchg_pointer(p, v) \
> __extension__   \
> ({  \
> __typeof__(*p) _pv = (v);   \
> if (!__builtin_constant_p(v) || \
> ((v) != NULL))  \
> cmm_wmb();  \
> uatomic_xchg(p, _pv);   \
> })
> 
> So we actually add a write barrier before the uatomic_xchg(),
> which should not be required if we consider that uatomic_xchg()
> *should* imply a full barrier before/after.
> 
> But in reality, it's ARM32 uatomic_xchg() which does not fulfill
> its contract, due to __sync_lock_test_and_set being only
> an acquire barrier [1]. So the extra cmm_wmb() is what saved
> us here for rcu_xchg_pointer().
> 
> The code currently generated by rcu_xchg_pointer() looks like:
> 
>11000:   f3bf 8f5f   dmb sy
>11004:   e857 ef00   ldrex   lr, [r7]
>11008:   e847 0300   strex   r3, r0, [r7]
>1100c:   2b00cmp r3, #0
>1100e:   d1f9bne.n   11004 
>11010:   f3bf 8f5b   dmb ish
> 
> 
> Looking at the cmpxchg variant:
> 
> #define _rcu_cmpxchg_pointer(p, old, _new)  \
> __extension__   \
> ({  \
> __typeof__(*p) _pold = (old);   \
> __typeof__(*p) _pnew = (_new);  \
> if (!__builtin_constant_p(_new) ||  \
> ((_new) != NULL))   \
> cmm_wmb();
>   \
> uatomic_cmpxchg(p, _pold, _pnew);   \
> })
> 
> We also notice a cmm_wmb() before what should imply a full barrier
> (uatomic_cmxchg). The latter is implemented with 
> __sync_val_compare_and_swap_N,
> which should imply a full barrier based on [1] (which is as vague as it
> gets). Looking at the generated code, we indeed have two barriers before:
> 
>11000:   f3bf 8f5f   dmb sy
>11004:   f3bf 8f5b   dmb ish
>11008:   e857 ef00   ldrex   lr, [r7]
>1100c:   45c6cmp lr, r8
>1100e:   d103bne.n   11018 
>11010:   e847 0300   strex   r3, r0, [r7]
>11014:   2b00cmp r3, #0
>11016:   d1f7bne.n   11008 
>11018:   f3bf 8f5b   dmb ish
> 
> So for stable-0.8 and stable-0.9, I would be tempted to err on
> the safe side and simply add the missing cmm_smp_mb() within
> uatomic_xchg() before the __sync_lock_test_and_set().
> 
> For the master branch, in addition to adding the missing cmm_smp_mb()
> to uatomic_xchg(), we could remove the redundant cmm_wmb() in
> rcu_cmpxchg_pointer and rcu_xchg_pointer.
> 
> Thoughts ?

Seems reasonable to me.  It is the x86 guys who might have objections,
given that the extra barrier costs them but has no effect.  ;-)

Thanx, Paul

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] High memory consumption issue on RCU side

2016-09-24 Thread Paul E. McKenney
On Sat, Sep 24, 2016 at 03:34:47PM +, Mathieu Desnoyers wrote:
> - On Sep 24, 2016, at 11:22 AM, Paul E. McKenney 
> paul...@linux.vnet.ibm.com wrote:
> 
> > On Sat, Sep 24, 2016 at 10:42:24AM +0300, Evgeniy Ivanov wrote:
> >> Hi Mathieu,
> >> 
> >> On Sat, Sep 24, 2016 at 12:59 AM, Mathieu Desnoyers
> >> <mathieu.desnoy...@efficios.com> wrote:
> >> > - On Sep 22, 2016, at 3:14 PM, Evgeniy Ivanov lolkaanti...@gmail.com 
> >> > wrote:
> >> >
> >> >> Hi all,
> >> >>
> >> >> I'm investigating high memory usage of my program: RSS varies between
> >> >> executions in range 20-50 GB, though it should be determenistic. I've
> >> >> found that all the memory is allocated in this stack:
> >> >>
> >> >> Allocated 17673781248 bytes in 556 allocations
> >> >>cds_lfht_alloc_bucket_table3 from liburcu-cds.so.2.0.0
> >> >>_do_cds_lfht_resize  from liburcu-cds.so.2.0.0
> >> >>do_resize_cb from liburcu-cds.so.2.0.0
> >> >>call_rcu_thread  from liburcu-qsbr.so.2.0.0
> >> >>start_thread from libpthread-2.12.so
> >> >>clonefrom libc-2.12.so
> >> >>
> >> >> According pstack it should be quiescent state.  Call thread waits on 
> >> >> syscall:
> >> >> syscall
> >> >> call_rcu_thread
> >> >> start_thread
> >> >> clone
> >> >>
> >> >> We use urcu-0.8.7, only rculfhash (QSBR). Is it some kind of leak in
> >> >> RCU or any chance I misuse it? What would you recommend to
> >> >> troubleshoot the situation?
> >> >
> >> > urcu-qsbr is the fastest flavor of urcu, but it is rather tricky to use 
> >> > well.
> >> > Make sure that:
> >> >
> >> > - Each registered thread periodically reach a quiescent state, by:
> >> >   - Invoking rcu_quiescent_state periodically, and
> >> >   - Making sure to surround any blocking for relatively large amount of
> >> > time by rcu_thread_offline()/rcu_thread_online().
> >> >
> >> > In urcu-qsbr, the "default" state of threads is to be within a RCU 
> >> > read-side.
> >> > Therefore, if you omit any of the two advice above, you end up in a 
> >> > situation
> >> > where grace periods never complete, and therefore no call_rcu() 
> >> > callbacks can
> >> > be processed. This effectively acts like a big memory leak.
> >> 
> >> It was the original assumption, but in memory stacks I don't see such
> >> allocations for my data. Instead huge allocations happen right in
> >> call_rcu_thread. Memory footprint for my app is about 20 GB, erasing
> >> RCU data is a rare operation, so almost 20 GB in rcu thread looks
> >> suspecios. I'll try to not erase any RCU protected data and reproduce
> >> the issue (complicated thing is that under memory tracer it happens
> >> not so often).
> > 
> > Interesting.  Trying to figure out why your call_rcu_thread() would
> > ever allocate memory.
> > 
> > Ah!  Do your RCU callbacks allocate memory?
> 
> In this case yes: urculfhash allocates memory within a call rcu worker
> thread when a hash table resize is performed.

Is this then expected behavior?

Though I must admit that 20GB sounds like some serious resizing...

Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > Thanx, Paul
> > 
> >> > Hoping this helps,
> >> >
> >> > Thanks,
> >> >
> >> > Mathieu
> >> >
> >> >
> >> > --
> >> > Mathieu Desnoyers
> >> > EfficiOS Inc.
> >> > http://www.efficios.com
> >> 
> >> 
> >> 
> >> --
> >> Cheers,
> >> Evgeniy
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] High memory consumption issue on RCU side

2016-09-24 Thread Paul E. McKenney
On Sat, Sep 24, 2016 at 10:42:24AM +0300, Evgeniy Ivanov wrote:
> Hi Mathieu,
> 
> On Sat, Sep 24, 2016 at 12:59 AM, Mathieu Desnoyers
>  wrote:
> > - On Sep 22, 2016, at 3:14 PM, Evgeniy Ivanov lolkaanti...@gmail.com 
> > wrote:
> >
> >> Hi all,
> >>
> >> I'm investigating high memory usage of my program: RSS varies between
> >> executions in range 20-50 GB, though it should be determenistic. I've
> >> found that all the memory is allocated in this stack:
> >>
> >> Allocated 17673781248 bytes in 556 allocations
> >>cds_lfht_alloc_bucket_table3 from liburcu-cds.so.2.0.0
> >>_do_cds_lfht_resize  from liburcu-cds.so.2.0.0
> >>do_resize_cb from liburcu-cds.so.2.0.0
> >>call_rcu_thread  from liburcu-qsbr.so.2.0.0
> >>start_thread from libpthread-2.12.so
> >>clonefrom libc-2.12.so
> >>
> >> According pstack it should be quiescent state.  Call thread waits on 
> >> syscall:
> >> syscall
> >> call_rcu_thread
> >> start_thread
> >> clone
> >>
> >> We use urcu-0.8.7, only rculfhash (QSBR). Is it some kind of leak in
> >> RCU or any chance I misuse it? What would you recommend to
> >> troubleshoot the situation?
> >
> > urcu-qsbr is the fastest flavor of urcu, but it is rather tricky to use 
> > well.
> > Make sure that:
> >
> > - Each registered thread periodically reach a quiescent state, by:
> >   - Invoking rcu_quiescent_state periodically, and
> >   - Making sure to surround any blocking for relatively large amount of
> > time by rcu_thread_offline()/rcu_thread_online().
> >
> > In urcu-qsbr, the "default" state of threads is to be within a RCU 
> > read-side.
> > Therefore, if you omit any of the two advice above, you end up in a 
> > situation
> > where grace periods never complete, and therefore no call_rcu() callbacks 
> > can
> > be processed. This effectively acts like a big memory leak.
> 
> It was the original assumption, but in memory stacks I don't see such
> allocations for my data. Instead huge allocations happen right in
> call_rcu_thread. Memory footprint for my app is about 20 GB, erasing
> RCU data is a rare operation, so almost 20 GB in rcu thread looks
> suspecios. I'll try to not erase any RCU protected data and reproduce
> the issue (complicated thing is that under memory tracer it happens
> not so often).

Interesting.  Trying to figure out why your call_rcu_thread() would
ever allocate memory.

Ah!  Do your RCU callbacks allocate memory?

Thanx, Paul

> > Hoping this helps,
> >
> > Thanks,
> >
> > Mathieu
> >
> >
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > http://www.efficios.com
> 
> 
> 
> -- 
> Cheers,
> Evgeniy
> 

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] RCU on non-cache-coherent memory

2016-08-04 Thread Paul E. McKenney
On Tue, Aug 02, 2016 at 10:58:57PM +, Mathieu Desnoyers wrote:
> - On Aug 1, 2016, at 8:30 PM, Yuxin Ren r...@gwmail.gwu.edu wrote:
> 
> > Hi all,
> > 
> > Is there any research or publications about RCU on top of
> > non-cache-coherent multi-core architecture?
> > Not only RCU, any other synchronization technique on top of
> > non-cache-coherent multi-core
> > is also helpful.
> 
> CCing Paul E. McKenney, who might know more on this topic.
> 
> Back in 2009 when I started the liburcu.org project, I
> planned to eventually add support for such architectures,
> e.g. Blackfin, which is why I initially added the cmm_mc(),
> cmm_rmc() and cmm_wmc() macros in the library (see
> CONFIG_HAVE_MEM_COHERENCY). However, all currently implemented
> architectures have mem coherency, so it's always defined as a
> simple compiler barrier. See include/urcu/arch/generic.h in liburcu
> for details.

Mathieu pretty much covered it.  We only have theoretical experience
with non-cache-coherent systems.  You might want to contact the authors
of these papers:

http://www.sigops.org/sosp/sosp11/posters/summaries/sosp11-final7.pdf
http://arxiv.org/pdf/1301.4490.pdf

There are probably others, as this was a hot topic a couple of years ago.

Thanx, Paul

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


[lttng-dev] FYI, another paper checking URCU correctness

2016-06-13 Thread Paul E. McKenney
Hello!

On the off chance that this is new news of interest...

https://arxiv.org/pdf/1606.01400v1.pdf

"Operational Aspects of C/C++ Concurrency", Anton Podkopaev, Ilya Sergey,
Aleksandar Nanevski.

At first glance, they seem to be using a combination of formal
verification and testing, using a simple linked-list usage of URCU.
The state space was too large for their formal technique, so they added
a random state-space sampling approach, sort of like what Promela/spin
does for large problems.  They did promote rcu_dereference() to an
acquire load, but list real rcu_dereference() as future work.

They didn't find any bugs, but their technique did find an injected bug
that resulted in too-short grace periods.

Thanx, Paul

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] question about rcu_bp_exit()

2016-05-19 Thread Paul E. McKenney
On Wed, May 18, 2016 at 06:40:03PM +, Mathieu Desnoyers wrote:
> - On May 18, 2016, at 5:44 AM, songxin  wrote: 
> 
> > Hi,
> > Now I get a crash because receiving signal SIGSEGV as below.
> 
> > #0 arena_alloc (arena=) at
> > /usr/src/debug/liburcu/0.9.1+git5fd33b1e5003ca316bd314ec3fd1447f6199a282-r0/git/urcu-bp.c:432
> > #1 add_thread () at
> > /usr/src/debug/liburcu/0.9.1+git5fd33b1e5003ca316bd314ec3fd1447f6199a282-r0/git/urcu-bp.c:462
> > #2 rcu_bp_register () at
> > /usr/src/debug/liburcu/0.9.1+git5fd33b1e5003ca316bd314ec3fd1447f6199a282-r0/git/urcu-bp.c:541
> 
> > I read the code of urcu-bp.c and found that "if (chunk->data_len - 
> > chunk->used <
> > len)" is in 432 line. So I guess that the chunk is a illegal pointer.
> > Below is the function rcu_bp_exit().
> 
> > static
> > void rcu_bp_exit(void)
> > {
> > mutex_lock(_lock);
> > if (!--rcu_bp_refcount) {
> > struct registry_chunk *chunk, *tmp;
> > int ret;
> 
> > cds_list_for_each_entry_safe(chunk, tmp,
> > _arena.chunk_list, node) {
> > munmap(chunk, chunk->data_len
> > + sizeof(struct registry_chunk));
> > }
> > ret = pthread_key_delete(urcu_bp_key);
> > if (ret)
> > abort();
> > }
> > mutex_unlock(_lock);
> > }
> 
> > My question is below.
> > Why did not delete the chunk from registry_arena.chunk_list before munmap a
> > chunk?
> 
> It is not expected that any thread would be created after the execution of 
> rcu_bp_exit() as a library destructor. Does re-initializing the chunk_list 
> after 
> iterating on it within rcu_bp_exit() fix your issue ? 
> 
> I'm curious about your use-case for creating threads after the library 
> destructor 
> has run. 

I am with Mathieu on this -- not much good can be expected using things
after their cleanup.  Though I suppose that, given a sufficient use case,
there could at least in theory be an option for manual control of cleanup.

Thanx, Paul

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] Question about lock in synchronize_rcu implementation of URCU

2016-04-28 Thread Paul E. McKenney
On Thu, Apr 28, 2016 at 02:38:40PM +, Mathieu Desnoyers wrote:
> - On Apr 28, 2016, at 9:47 AM, Yuxin Ren r...@gwmail.gwu.edu wrote:
> 
> > Hi Boqun and Paul,
> > 
> > Thank you so much for your help.
> > 
> > I found one reason to use that lock.
> > In the slow path, a thread will move all waiters to a local queue.
> > https://github.com/urcu/userspace-rcu/blob/master/urcu.c#L406
> > After this, following thread can also perform grace period, as the
> > global waiter queue is empty.
> > Thus the rcu_gp_lock is used to ensure mutual exclusion.
> > 
> > However, from real time aspect, such blocking will cause priority
> > inversion: higher priority writer can be blocked by low priority
> > writer.
> > Is there a way to modify the code to allow multiple threads to perform
> > grace period concurrently?
> 
> Before we redesign urcu for RT, would it be possible to simply
> use pi-mutexes (priority inheritance) instead to protect grace periods
> from each other with the current urcu scheme ?

Given that priority inversion can happen with low-priority readers
blocking a grace period that a high-priority updater is waiting on,
I stand by my earlier advice:  Don't let high-priority updaters block
waiting for grace periods.  ;-)

Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> 
> > 
> > Thanks again!!
> > Yuxin
> > 
> > On Thu, Apr 28, 2016 at 8:44 AM, Boqun Feng <boqun.f...@gmail.com> wrote:
> >> Hi Paul and Yuxin,
> >>
> >> On Wed, Apr 27, 2016 at 09:23:27PM -0700, Paul E. McKenney wrote:
> >>> Try building without it and see what happens when you run the tests.
> >>>
> >>
> >> I've run a 'regtest' with the following modification out of curiousity:
> >>
> >> diff --git a/urcu.c b/urcu.c
> >> index a5568bdbd075..9dc3c9feae56 100644
> >> --- a/urcu.c
> >> +++ b/urcu.c
> >> @@ -398,8 +398,6 @@ void synchronize_rcu(void)
> >> /* We won't need to wake ourself up */
> >> urcu_wait_set_state(, URCU_WAIT_RUNNING);
> >>
> >> -   mutex_lock(_gp_lock);
> >> -
> >> /*
> >>  * Move all waiters into our local queue.
> >>  */
> >> @@ -480,7 +478,6 @@ void synchronize_rcu(void)
> >> smp_mb_master();
> >>  out:
> >> mutex_unlock(_registry_lock);
> >> -   mutex_unlock(_gp_lock);
> >>
> >> /*
> >>  * Wakeup waiters only after we have completed the grace period
> >>
> >>
> >> And guess what, the result of the test was:
> >>
> >> Test Summary Report
> >> ---
> >> ./run-urcu-tests.sh 1 (Wstat: 0 Tests: 979 Failed: 18)
> >>   Failed tests:  30, 45, 60, 75, 90, 103, 105, 120, 135
> >>   150, 165, 180, 195, 210, 225, 240, 253
> >>   255
> >> Files=2, Tests=996, 1039 wallclock secs ( 0.55 usr  0.04 sys + 8981.02 cusr
> >> 597.64 csys = 9579.25 CPU)
> >> Result: FAIL
> >>
> >> And test case 30 for example is something like:
> >>
> >> tests/benchmark/test_urcu_mb   1 -d 0 -b 32768
> >>
> >> and it failed because:
> >>
> >> lt-test_urcu_mb: test_urcu.c:183: thr_reader: Assertion `*local_ptr == 8'
> >> failed.
> >> zsh: abort (core dumped)  ~/userspace-rcu/tests/benchmark/test_urcu_mb 4 4 
> >> 1 -d
> >> 0 -b 32768
> >>
> >> So I think what was going on here was:
> >>
> >> CPU 0 (reader)  CPU 1 (writer)
> >> CPU 2 (writer)
> >> === 
> >> ==
> >> rcu_read_lock();
> >> new =
> >> malloc(sizeof(int));
> >> local_ptr = rcu_dereference(test_rcu_pointer); // local_ptr == old
> >> *new = 8;
> >>
> >>  old = rcu_xchg_pointer(_rcu_pointer, new);
> >> synchronize_rcu():
> >>   urcu_wait_add(); // return 0
> >>   urcu_move_waiters(); // @gp_waiters is 
> >> empty,
> >>// the next 
> >> urcu_wait_add() will return 0
> >>
> >>   

Re: [lttng-dev] Question about lock in synchronize_rcu implementation of URCU

2016-04-28 Thread Paul E. McKenney
On Thu, Apr 28, 2016 at 09:47:23AM -0400, Yuxin Ren wrote:
> Hi Boqun and Paul,
> 
> Thank you so much for your help.
> 
> I found one reason to use that lock.
> In the slow path, a thread will move all waiters to a local queue.
> https://github.com/urcu/userspace-rcu/blob/master/urcu.c#L406
> After this, following thread can also perform grace period, as the
> global waiter queue is empty.
> Thus the rcu_gp_lock is used to ensure mutual exclusion.
> 
> However, from real time aspect, such blocking will cause priority
> inversion: higher priority writer can be blocked by low priority
> writer.
> Is there a way to modify the code to allow multiple threads to perform
> grace period concurrently?

If a thread has real-time requirements, you shouldn't have it do
synchronous grace periods, just as you shouldn't have it do (say)
sleep(10).

You should instead either (1) have some other non-realtime thread do
the cleanup activities involving synchronize_rcu() or (2) have the
real-time thread use the asynchronous call_rcu().

Thanx, Paul

> Thanks again!!
> Yuxin
> 
> On Thu, Apr 28, 2016 at 8:44 AM, Boqun Feng <boqun.f...@gmail.com> wrote:
> > Hi Paul and Yuxin,
> >
> > On Wed, Apr 27, 2016 at 09:23:27PM -0700, Paul E. McKenney wrote:
> >> Try building without it and see what happens when you run the tests.
> >>
> >
> > I've run a 'regtest' with the following modification out of curiousity:
> >
> > diff --git a/urcu.c b/urcu.c
> > index a5568bdbd075..9dc3c9feae56 100644
> > --- a/urcu.c
> > +++ b/urcu.c
> > @@ -398,8 +398,6 @@ void synchronize_rcu(void)
> > /* We won't need to wake ourself up */
> > urcu_wait_set_state(, URCU_WAIT_RUNNING);
> >
> > -   mutex_lock(_gp_lock);
> > -
> > /*
> >  * Move all waiters into our local queue.
> >  */
> > @@ -480,7 +478,6 @@ void synchronize_rcu(void)
> > smp_mb_master();
> >  out:
> > mutex_unlock(_registry_lock);
> > -   mutex_unlock(_gp_lock);
> >
> > /*
> >  * Wakeup waiters only after we have completed the grace period
> >
> >
> > And guess what, the result of the test was:
> >
> > Test Summary Report
> > ---
> > ./run-urcu-tests.sh 1 (Wstat: 0 Tests: 979 Failed: 18)
> >   Failed tests:  30, 45, 60, 75, 90, 103, 105, 120, 135
> >   150, 165, 180, 195, 210, 225, 240, 253
> >   255
> > Files=2, Tests=996, 1039 wallclock secs ( 0.55 usr  0.04 sys + 8981.02 cusr 
> > 597.64 csys = 9579.25 CPU)
> > Result: FAIL
> >
> > And test case 30 for example is something like:
> >
> > tests/benchmark/test_urcu_mb   1 -d 0 -b 32768
> >
> > and it failed because:
> >
> > lt-test_urcu_mb: test_urcu.c:183: thr_reader: Assertion `*local_ptr == 8' 
> > failed.
> > zsh: abort (core dumped)  ~/userspace-rcu/tests/benchmark/test_urcu_mb 4 4 
> > 1 -d 0 -b 32768
> >
> > So I think what was going on here was:
> >
> > CPU 0 (reader)  CPU 1 (writer)  
> > CPU 2 (writer)
> > === 
> > ==
> > rcu_read_lock();
> > new = malloc(sizeof(int));
> > local_ptr = rcu_dereference(test_rcu_pointer); // local_ptr == old  
> > *new = 8;
> > 
> > old = rcu_xchg_pointer(_rcu_pointer, new);
> > synchronize_rcu():
> >   urcu_wait_add(); // return 0
> >   urcu_move_waiters(); // @gp_waiters is 
> > empty,
> >// the next 
> > urcu_wait_add() will return 0
> >
> > 
> > synchronize_rcu():
> > 
> >   urcu_wait_add(); // return 0
> >
> >   mutex_lock(_register_lock);
> >   wait_for_readers(); // remove registered 
> > reader from @registery,
> >   // release 
> > rcu_register_lock and wait via poll()
> >
> >

Re: [lttng-dev] Question about lock in synchronize_rcu implementation of URCU

2016-04-28 Thread Paul E. McKenney
On Thu, Apr 28, 2016 at 08:44:01PM +0800, Boqun Feng wrote:
> Hi Paul and Yuxin,
> 
> On Wed, Apr 27, 2016 at 09:23:27PM -0700, Paul E. McKenney wrote:
> > Try building without it and see what happens when you run the tests.
> > 
> 
> I've run a 'regtest' with the following modification out of curiousity:
> 
> diff --git a/urcu.c b/urcu.c
> index a5568bdbd075..9dc3c9feae56 100644
> --- a/urcu.c
> +++ b/urcu.c
> @@ -398,8 +398,6 @@ void synchronize_rcu(void)
>   /* We won't need to wake ourself up */
>   urcu_wait_set_state(, URCU_WAIT_RUNNING);
>  
> - mutex_lock(_gp_lock);
> -
>   /*
>* Move all waiters into our local queue.
>*/
> @@ -480,7 +478,6 @@ void synchronize_rcu(void)
>   smp_mb_master();
>  out:
>   mutex_unlock(_registry_lock);
> - mutex_unlock(_gp_lock);
>  
>   /*
>* Wakeup waiters only after we have completed the grace period
> 
> 
> And guess what, the result of the test was:
> 
> Test Summary Report
> ---
> ./run-urcu-tests.sh 1 (Wstat: 0 Tests: 979 Failed: 18)
>   Failed tests:  30, 45, 60, 75, 90, 103, 105, 120, 135
>   150, 165, 180, 195, 210, 225, 240, 253
> 255
> Files=2, Tests=996, 1039 wallclock secs ( 0.55 usr  0.04 sys + 8981.02 cusr 
> 597.64 csys = 9579.25 CPU)
> Result: FAIL
> 
> And test case 30 for example is something like:
> 
> tests/benchmark/test_urcu_mb   1 -d 0 -b 32768
> 
> and it failed because:
> 
> lt-test_urcu_mb: test_urcu.c:183: thr_reader: Assertion `*local_ptr == 8' 
> failed.
> zsh: abort (core dumped)  ~/userspace-rcu/tests/benchmark/test_urcu_mb 4 4 1 
> -d 0 -b 32768
> 
> So I think what was going on here was:
> 
> CPU 0 (reader)CPU 1 (writer)  
> CPU 2 (writer)
> ===   
> ==
> rcu_read_lock();  
> new = malloc(sizeof(int));
> local_ptr = rcu_dereference(test_rcu_pointer); // local_ptr == old
> *new = 8;
>   
> old = rcu_xchg_pointer(_rcu_pointer, new);
>   synchronize_rcu():
> urcu_wait_add(); // return 0
> urcu_move_waiters(); // @gp_waiters is empty,
>  // the next 
> urcu_wait_add() will return 0
> 
>   
> synchronize_rcu():
>   
>   urcu_wait_add(); // return 0
> 
> mutex_lock(_register_lock);
> wait_for_readers(); // remove registered 
> reader from @registery,
> // release 
> rcu_register_lock and wait via poll()
> 
>   
>   mutex_lock(_registry_lock);
>   
>   wait_for_readers(); // @regsitery is empty! we are so lucky
>   
>   return;
> 
>   
> if (old)
>   
> free(old)
> rcu_read_unlock();
> assert(*local_ptr==8); // but local_ptr(i.e. old) is already freed.
> 
> 
> So the point is there could be two writers calling synchronize_rcu() but
> not returning early(both of them enter the slow path to perform a grace
> period), so the rcu_gp_lock is necessary in this case.
> 
> (Cc  Mathieu)
> 
> But this is only my understanding and I'm learning the URCU code too ;-)

Nothing quite like actually trying it and seeing what happens!  One of
the best learning methods that I know of.

Assuming the act of actually trying it is non-fatal, of course.  ;-)

Thanx, Paul

> Regards,
> Boqun
> 
> 
> > Might well be that it is unnecessary, but I will defer to Mathieu
> > on that point.
> > 
> > Thanx, Paul
> > 
> > On Wed, Apr 27, 2016 at 10:18:04PM -0400, Yuxin Ren wrote:
> > > As they don't currently perform grace period, why do we use the 
> > > rcu_gp_lock?
> > > 
&

Re: [lttng-dev] Question about lock in synchronize_rcu implementation of URCU

2016-04-27 Thread Paul E. McKenney
Try building without it and see what happens when you run the tests.

Might well be that it is unnecessary, but I will defer to Mathieu
on that point.

Thanx, Paul

On Wed, Apr 27, 2016 at 10:18:04PM -0400, Yuxin Ren wrote:
> As they don't currently perform grace period, why do we use the rcu_gp_lock?
> 
> Thank you.
> Yuxin
> 
> On Wed, Apr 27, 2016 at 10:08 PM, Paul E. McKenney
> <paul...@linux.vnet.ibm.com> wrote:
> > On Wed, Apr 27, 2016 at 09:34:16PM -0400, Yuxin Ren wrote:
> >> Hi,
> >>
> >> I am learning the URCU code.
> >>
> >> Why do we need rcu_gp_lock in synchronize_rcu?
> >> https://github.com/urcu/userspace-rcu/blob/master/urcu.c#L401
> >>
> >> In the comment, it says this lock ensures mutual exclusion between
> >> threads calling synchronize_rcu().
> >> But only the first thread added to waiter queue can proceed to detect
> >> grace period.
> >> How can multiple threads currently perform the grace thread?
> >
> > They don't concurrently perform grace periods, and it would be wasteful
> > for them to do so.  Instead, the first one performs the grace period,
> > and all that were waiting at the time it started get the benefit of that
> > same grace period.
> >
> > Any that arrived after the first grace period performs the first
> > grace period are served by whichever of them performs the second
> > grace period.
> >
> > Thanx, Paul
> >
> 

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] Question about lock in synchronize_rcu implementation of URCU

2016-04-27 Thread Paul E. McKenney
On Wed, Apr 27, 2016 at 09:34:16PM -0400, Yuxin Ren wrote:
> Hi,
> 
> I am learning the URCU code.
> 
> Why do we need rcu_gp_lock in synchronize_rcu?
> https://github.com/urcu/userspace-rcu/blob/master/urcu.c#L401
> 
> In the comment, it says this lock ensures mutual exclusion between
> threads calling synchronize_rcu().
> But only the first thread added to waiter queue can proceed to detect
> grace period.
> How can multiple threads currently perform the grace thread?

They don't concurrently perform grace periods, and it would be wasteful
for them to do so.  Instead, the first one performs the grace period,
and all that were waiting at the time it started get the benefit of that
same grace period.

Any that arrived after the first grace period performs the first
grace period are served by whichever of them performs the second
grace period.

Thanx, Paul

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] real time Userspace RCU

2016-03-31 Thread Paul E. McKenney
On Thu, Mar 31, 2016 at 09:20:07AM +0800, Yuxin Ren wrote:
> Thank you all!!
> 
> I agree URCU does timing quite well.
> But are there any formal response time analysis for URCU/RCU (both
> read  and update path)?

Not that I know of.  You could be the first!

> Or could anyone guide me how to do RTA for URCU/RCU?

Compile something with a simple RCU read-side critical section, and then
count the instructions.  QSBR will of course work best, but MB will also
have good bounds.  Signal-based will be a bit more complicated.

Much depends on what type of RTA you want to do.  Brandenburg's
dissertation has quite a bit of info and many citations:

http://www.cs.unc.edu/~bbb/diss/

You can also take an experimental approach, though a great many
runs are required.  OSADL (https://www.osadl.org/) does quite a
bit of this work on -rt Linux.

Thanx, Paul

> Thanks again.
> 
> On Fri, Mar 11, 2016 at 10:00 PM, Mathieu Desnoyers
> <mathieu.desnoy...@efficios.com> wrote:
> > - On Mar 11, 2016, at 6:45 AM, Paul E. McKenney 
> > paul...@linux.vnet.ibm.com wrote:
> >
> >> On Thu, Mar 10, 2016 at 08:53:05PM +, Mathieu Desnoyers wrote:
> >>> - On Mar 10, 2016, at 3:33 PM, Yuxin Ren r...@gwmail.gwu.edu wrote:
> >>>
> >>> > Thank you for your reply.
> >>> >
> >>> > I want to generally understand how to apply urcu to real time systems.
> >>> > I know real time system focus on predictability on both timing and
> >>> > memory consumption.
> >>> > So how does real time urcu support predictability?
> >>> > Could you provide me some papers, documents or any materials about any
> >>> > aspect of real time urcu?
> >>>
> >>> Adding Paul E. McKenney in CC, who may have some thoughts on this
> >>> topic.
> >>
> >> URCU does timing quite well, given that the read-side primitives each
> >> execute a fixed sequence of instructions.  Updates using call_rcu()
> >> can be used to minimize update-side latency, but if you need to bound
> >> memory overhead, the best way to do that is to make sure that updates
> >> are not on the critical path, and then use synchronize_rcu() instead
> >> of call_rcu().  In that case, the total amount of memory waiting for
> >> reclamation is bounded by the maximum size of an RCU-protected memory
> >> block times the number of threads.
> >
> > An intermediate solution if both update throughput and bounded-memory
> > are required (but the application would not have real-time constraints
> > on updates) would be to use the defer_rcu() API in liburcu. It amortizes
> > the cost of synchronize_rcu() over many defer_rcu() calls with a worker
> > thread, but only up to an upper bound. When the upper bound is reached,
> > the defer queue call empties the defer queue itself.
> >
> > Thanks,
> >
> > Mathieu
> >
> >>
> >> So can you design your application so that updates are off the critical
> >> path?  If so, you can get both bounded read-side accesses and bounded
> >> memory footprint.
> >>
> >> This of course assumes that your data structures are simple enough
> >> that readers don't need to use retry techniques.
> >>
> >> The following info might be helpful:
> >>
> >> http://www2.rdrop.com/users/paulmck/realtime/paper/DetSyncRCU.2009.08.18a.pdf
> >> http://www2.rdrop.com/users/paulmck/realtime/paper/DetSyncRCU.2009.09.29a.pdf
> >>
> >> http://www2.rdrop.com/users/paulmck/realtime/paper/RTLWS2012occcRT.2012.10.19e.pdf
> >>
> >> It also depends on your timeframe.  Microseconds?  Life is hard.
> >> Milliseconds?  Care is required, but you have a fair amount of freedom.
> >> Seconds?  Life is not so hard.  Unless you need to do two seconds of
> >> computation in one second or some such.  ;-)
> >>
> >>   Thanx, Paul
> >>
> >>> Thanks,
> >>>
> >>> Mathieu
> >>>
> >>> >
> >>> > Thanks again!
> >>> >
> >>> > On Thu, Mar 10, 2016 at 1:52 PM, Michel Dagenais
> >>> > <michel.dagen...@polymtl.ca> wrote:
> >>> >> Real-time and embedded systems is an important current focus for the 
> >>> >> LTTng
> >>> >> toolchain research. Do you have specific needs for userspace RCU?
> >>> >>
> >>> 

Re: [lttng-dev] real time Userspace RCU

2016-03-11 Thread Paul E. McKenney
On Thu, Mar 10, 2016 at 08:53:05PM +, Mathieu Desnoyers wrote:
> - On Mar 10, 2016, at 3:33 PM, Yuxin Ren r...@gwmail.gwu.edu wrote:
> 
> > Thank you for your reply.
> > 
> > I want to generally understand how to apply urcu to real time systems.
> > I know real time system focus on predictability on both timing and
> > memory consumption.
> > So how does real time urcu support predictability?
> > Could you provide me some papers, documents or any materials about any
> > aspect of real time urcu?
> 
> Adding Paul E. McKenney in CC, who may have some thoughts on this
> topic.

URCU does timing quite well, given that the read-side primitives each
execute a fixed sequence of instructions.  Updates using call_rcu()
can be used to minimize update-side latency, but if you need to bound
memory overhead, the best way to do that is to make sure that updates
are not on the critical path, and then use synchronize_rcu() instead
of call_rcu().  In that case, the total amount of memory waiting for
reclamation is bounded by the maximum size of an RCU-protected memory
block times the number of threads.

So can you design your application so that updates are off the critical
path?  If so, you can get both bounded read-side accesses and bounded
memory footprint.

This of course assumes that your data structures are simple enough
that readers don't need to use retry techniques.

The following info might be helpful:

http://www2.rdrop.com/users/paulmck/realtime/paper/DetSyncRCU.2009.08.18a.pdf
http://www2.rdrop.com/users/paulmck/realtime/paper/DetSyncRCU.2009.09.29a.pdf

http://www2.rdrop.com/users/paulmck/realtime/paper/RTLWS2012occcRT.2012.10.19e.pdf

It also depends on your timeframe.  Microseconds?  Life is hard.
Milliseconds?  Care is required, but you have a fair amount of freedom.
Seconds?  Life is not so hard.  Unless you need to do two seconds of
computation in one second or some such.  ;-)

Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > Thanks again!
> > 
> > On Thu, Mar 10, 2016 at 1:52 PM, Michel Dagenais
> > <michel.dagen...@polymtl.ca> wrote:
> >> Real-time and embedded systems is an important current focus for the LTTng
> >> toolchain research. Do you have specific needs for userspace RCU?
> >>
> >> - Mail original -
> >>> Hi,
> >>>
> >>>  Is there any work or research about Userspace RCU on real time or
> >>> embedded systems?
> >>> Any information is welcome.
> >>>
> >>> Thanks a lot!
> >>> Yuxin
> >>> ___
> >>> lttng-dev mailing list
> >>> lttng-dev@lists.lttng.org
> >>> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> >>>
> > ___
> > lttng-dev mailing list
> > lttng-dev@lists.lttng.org
> > https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [rp] [RFC PATCH urcu] urcu_ref_get: API change: return boolean

2016-01-21 Thread Paul E. McKenney
On Thu, Jan 21, 2016 at 05:06:09PM +, Mathieu Desnoyers wrote:
> - On Jan 21, 2016, at 11:59 AM, Josh Triplett j...@joshtriplett.org wrote:
> 
> > On Thu, Jan 21, 2016 at 04:45:20PM +, Mathieu Desnoyers wrote:
> >> - On Jan 19, 2016, at 3:57 PM, Mathieu Desnoyers
> >> mathieu.desnoy...@efficios.com wrote:
> >> 
> >> > This is a RFC of a follow up patch based on urcu commit 7d7c5d467 "Fix:
> >> > handle reference count overflow".
> >> > 
> >> > Change the urcu_ref_get prototype to return a boolean, which takes the
> >> > value false if a LONG_MAX overflow would occur (get has not been
> >> > performed), or true otherwise.
> >> > 
> >> > This interface change also introduces a "warn_unused_result" gcc
> >> > function attribute, which will show warnings if users don't handle the
> >> > return value.
> >> > 
> >> > I'm wondering whether this change is useful enough to justify breaking
> >> > the API (need to bump the major library version), or if introducing a
> >> > new "urcu_ref_get_safe()" or such would be a better option ?
> >> 
> >> After some thinking, I will go for adding a new urcu_ref_get_safe() API,
> >> thus not requiring to bump the library major version.
> > 
> > You may want to add a deprecated __attribute__ to the unsafe version in
> > the header.
> 
> The "unsafe" version now does a "abort()" in case of detected overflow,
> so it's not a security concern per-se, but could theoretically lead to
> an application denial of service.
> 
> I'm tempted to keep urcu_ref_get() as it is (not deprecate it) because
> there appears to be valid use-cases for it: for instance, if an application
> fully controls the reference counting (refcount value don't depend on
> external inputs), urcu_ref_get would still seem like a good API to use.

This seems to me to be a good tradeoff.  If experience shows that
urcu_ref_get() is "unsafe at any speed", then Josh's suggestion of
deprecating it would make a lot of sense.

Thanx, Paul


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [RFC PATCH urcu] Fix: dynamic fallback to compat futex on sys_futex ENOSYS

2015-09-13 Thread Paul E. McKenney
On Fri, Sep 11, 2015 at 10:48:38AM -0400, Mathieu Desnoyers wrote:
> Some MIPS processors (e.g. Cavium Octeon II) dynamically check if the
> CPU supports ll/sc within sys_futex, and return a ENOSYS errno if they
> don't, even though the architecture implements sys_futex.
> 
> Handle this situation by always building the sys_futex compatibility
> layer, and fall-back on it if sys_futex return a ENOSYS errno. This is
> a tiny compat layer which adds very little space overhead.
> 
> This adds an unlikely branch on return from sys_futex, which should
> not be an issue performance-wise (we've already taken a system call).
> 
> Since this is a fall-back mode, don't try to be clever, and don't cache
> the result, so that the common cases (architectures with a properly
> working sys_futex) don't get two conditional branches, just one.

Looks like a reasonable approach to me.

Acked-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>

> Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
> CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> CC: Michael Jeanson <mjean...@efficios.com>
> CC: Jon Bernard <jbern...@debian.org>
> ---
>  Makefile.am  |  2 --
>  urcu/futex.h | 70 
> +---
>  2 files changed, 57 insertions(+), 15 deletions(-)
> 
> diff --git a/Makefile.am b/Makefile.am
> index 752510d..f9a 100644
> --- a/Makefile.am
> +++ b/Makefile.am
> @@ -41,9 +41,7 @@ else
>  COMPAT=
>  endif
> 
> -if COMPAT_FUTEX
>  COMPAT+=compat_futex.c
> -endif
> 
>  RCULFHASH = rculfhash.c rculfhash-mm-order.c rculfhash-mm-chunk.c \
>   rculfhash-mm-mmap.c
> diff --git a/urcu/futex.h b/urcu/futex.h
> index 2be3bb6..13d2b1a 100644
> --- a/urcu/futex.h
> +++ b/urcu/futex.h
> @@ -47,22 +47,66 @@ extern "C" {
>   * (returns EINTR).
>   */
> 
> +extern int compat_futex_noasync(int32_t *uaddr, int op, int32_t val,
> + const struct timespec *timeout, int32_t *uaddr2, int32_t val3);
> +extern int compat_futex_async(int32_t *uaddr, int op, int32_t val,
> + const struct timespec *timeout, int32_t *uaddr2, int32_t val3);
> +
>  #ifdef CONFIG_RCU_HAVE_FUTEX
> +
> +#include 
> +#include 
> +#include 
>  #include 
> -#define futex(...)   syscall(__NR_futex, __VA_ARGS__)
> -#define futex_noasync(uaddr, op, val, timeout, uaddr2, val3) \
> - futex(uaddr, op, val, timeout, uaddr2, val3)
> -#define futex_async(uaddr, op, val, timeout, uaddr2, val3)   \
> - futex(uaddr, op, val, timeout, uaddr2, val3)
> +
> +static inline int futex(int32_t *uaddr, int op, int32_t val,
> + const struct timespec *timeout, int32_t *uaddr2, int32_t val3)
> +{
> + return syscall(__NR_futex, uaddr, op, val, timeout,
> + uaddr2, val3);
> +}
> +
> +static inline int futex_noasync(int32_t *uaddr, int op, int32_t val,
> + const struct timespec *timeout, int32_t *uaddr2, int32_t val3)
> +{
> + int ret;
> +
> + ret = futex(uaddr, op, val, timeout, uaddr2, val3);
> + if (caa_unlikely(ret < 0 && errno == ENOSYS)) {
> + return compat_futex_noasync(uaddr, op, val, timeout,
> + uaddr2, val3);
> + }
> + return ret;
> +
> +}
> +
> +static inline int futex_async(int32_t *uaddr, int op, int32_t val,
> + const struct timespec *timeout, int32_t *uaddr2, int32_t val3)
> +{
> + int ret;
> +
> + ret = futex(uaddr, op, val, timeout, uaddr2, val3);
> + if (caa_unlikely(ret < 0 && errno == ENOSYS)) {
> + return compat_futex_async(uaddr, op, val, timeout,
> + uaddr2, val3);
> + }
> + return ret;
> +}
> +
>  #else
> -extern int compat_futex_noasync(int32_t *uaddr, int op, int32_t val,
> - const struct timespec *timeout, int32_t *uaddr2, int32_t val3);
> -#define futex_noasync(uaddr, op, val, timeout, uaddr2, val3) \
> - compat_futex_noasync(uaddr, op, val, timeout, uaddr2, val3)
> -extern int compat_futex_async(int32_t *uaddr, int op, int32_t val,
> - const struct timespec *timeout, int32_t *uaddr2, int32_t val3);
> -#define futex_async(uaddr, op, val, timeout, uaddr2, val3)   \
> - compat_futex_async(uaddr, op, val, timeout, uaddr2, val3)
> +
> +static inline int futex_noasync(int32_t *uaddr, int op, int32_t val,
> + const struct timespec *timeout, int32_t *uaddr2, int32_t val3)
> +{
> + return compat_futex_noasync(uaddr, op, val, timeout, uaddr2, val3);
> +}
> +
> +static inline int futex_async(int32_t *uaddr, int op, int32_t val,
> + const struct timespec *timeout, int32_t *uaddr2, int32_t val3)
> +{
> + return compat_futex_async(uaddr, op, val, timeout, uaddr2, val3);
> +}
> +
>  #endif
> 
>  #ifdef __cplusplus 
> -- 
> 2.1.4
> 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH] Fix: call_rcu_thread() affinity failure

2015-06-29 Thread Paul E. McKenney
On Mon, Jun 29, 2015 at 06:56:34PM -0400, Mathieu Desnoyers wrote:
 Make call_rcu_thread() affine itself more persistently
 
 Currently, URCU simply fails if a call_rcu_thread() fails to affine
 itself. This is problematic when execution is constrained by cgroup
 and hotunplugged CPUs. This commit therefore makes call_rcu_thread()
 retry setting its affinity every 256 grace periods, but only if it
 detects that it migrated to a different CPU. Since sched_getcpu() is
 cheap on many architectures, this check is less costly than going
 through a system call.
 
 Reported-by: Michael Jeanson mjean...@efficios.com
 Suggested-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com

A couple of issues, but otherwise good.  (They might even be issues
with your code rather than my eyes, you never know!)

Thanx, Paul

 ---
  urcu-call-rcu-impl.h | 36 +++-
  1 file changed, 31 insertions(+), 5 deletions(-)
 
 diff --git a/urcu-call-rcu-impl.h b/urcu-call-rcu-impl.h
 index 5cc02d9..b82a59b 100644
 --- a/urcu-call-rcu-impl.h
 +++ b/urcu-call-rcu-impl.h
 @@ -45,6 +45,9 @@
  #include urcu/ref.h
  #include urcu-die.h
 
 +#define SET_AFFINITY_CHECK_PERIOD(1U  8)   /* 256 */
 +#define SET_AFFINITY_CHECK_PERIOD_MASK   
 (SET_AFFINITY_CHECK_PERIOD - 1)
 +
  /* Data structure that identifies a call_rcu thread. */
 
  struct call_rcu_data {
 @@ -62,6 +65,7 @@ struct call_rcu_data {
   unsigned long qlen; /* maintained for debugging. */
   pthread_t tid;
   int cpu_affinity;
 + unsigned long gp_count;
   struct cds_list_head list;
  } __attribute__((aligned(CAA_CACHE_LINE_SIZE)));
 
 @@ -203,22 +207,42 @@ static void call_rcu_unlock(pthread_mutex_t *pmp)
   urcu_die(ret);
  }
 
 +/*
 + * Periodically retry setting CPU affinity if we migrate.
 + * Losing affinity can be caused by CPU hotunplug/hotplug, or by
 + * cpuset(7).
 + */
  #if HAVE_SCHED_SETAFFINITY
  static
  int set_thread_cpu_affinity(struct call_rcu_data *crdp)
  {
   cpu_set_t mask;
 + int ret;
 
   if (crdp-cpu_affinity  0)
   return 0;
 + if (++crdp-gp_count  SET_AFFINITY_CHECK_PERIOD_MASK)
 + return 0;
 + if (urcu_sched_getcpu() != crdp-cpu_affinity)

Don't we want == here instead of !=?

 + return 0;
 
   CPU_ZERO(mask);
   CPU_SET(crdp-cpu_affinity, mask);
  #if SCHED_SETAFFINITY_ARGS == 2
 - return sched_setaffinity(0, mask);
 + ret = sched_setaffinity(0, mask);
  #else
 - return sched_setaffinity(0, sizeof(mask), mask);
 + ret = sched_setaffinity(0, sizeof(mask), mask);
  #endif
 + /*
 +  * EINVAL is fine: can be caused by hotunplugged CPUs, or by
 +  * cpuset(7). This is why we should always retry is we detect

s/is we detect/if we detect/

 +  * migration.
 +  */
 + if (ret  errno == EINVAL) {
 + ret = 0;
 + errno = 0;
 + }
 + return ret;
  }
  #else
  static
 @@ -275,10 +299,8 @@ static void *call_rcu_thread(void *arg)
   unsigned long cbcount;
   struct call_rcu_data *crdp = (struct call_rcu_data *) arg;
   int rt = !!(uatomic_read(crdp-flags)  URCU_CALL_RCU_RT);
 - int ret;
 
 - ret = set_thread_cpu_affinity(crdp);
 - if (ret)
 + if (set_thread_cpu_affinity(crdp))
   urcu_die(errno);
 
   /*
 @@ -298,6 +320,9 @@ static void *call_rcu_thread(void *arg)
   struct cds_wfcq_node *cbs, *cbs_tmp_n;
   enum cds_wfcq_ret splice_ret;
 
 + if (set_thread_cpu_affinity(crdp))
 + urcu_die(errno);
 +
   if (uatomic_read(crdp-flags)  URCU_CALL_RCU_PAUSE) {
   /*
* Pause requested. Become quiescent: remove
 @@ -391,6 +416,7 @@ static void call_rcu_data_init(struct call_rcu_data 
 **crdpp,
   crdp-flags = flags;
   cds_list_add(crdp-list, call_rcu_data_list);
   crdp-cpu_affinity = cpu_affinity;
 + crdp-gp_count = 0;
   cmm_smp_mb();  /* Structure initialized before pointer is planted. */
   *crdpp = crdp;
   ret = pthread_create(crdp-tid, NULL, call_rcu_thread, crdp);
 -- 
 2.1.4
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH] Fix: call_rcu_thread() affinity failure

2015-06-29 Thread Paul E. McKenney
On Mon, Jun 29, 2015 at 11:06:15PM +, Mathieu Desnoyers wrote:
 - On Jun 29, 2015, at 7:01 PM, Paul E. McKenney 
 paul...@linux.vnet.ibm.com wrote:
 
  On Mon, Jun 29, 2015 at 06:56:34PM -0400, Mathieu Desnoyers wrote:
  Make call_rcu_thread() affine itself more persistently
  
  Currently, URCU simply fails if a call_rcu_thread() fails to affine
  itself. This is problematic when execution is constrained by cgroup
  and hotunplugged CPUs. This commit therefore makes call_rcu_thread()
  retry setting its affinity every 256 grace periods, but only if it
  detects that it migrated to a different CPU. Since sched_getcpu() is
  cheap on many architectures, this check is less costly than going
  through a system call.
  
  Reported-by: Michael Jeanson mjean...@efficios.com
  Suggested-by: Paul E. McKenney paul...@linux.vnet.ibm.com
  Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
  
  A couple of issues, but otherwise good.  (They might even be issues
  with your code rather than my eyes, you never know!)
 
 Your eyes are very good indeed. Fixing those 2 nits, and adding
 your Acked-by.

Very good!

That said, I would feel better about my eyes had they not inserted the
bug being fixed.  ;-)

Thanx, Paul

 Thanks!
 
 Mathieu
 
  
  Thanx, Paul
  
  ---
   urcu-call-rcu-impl.h | 36 +++-
   1 file changed, 31 insertions(+), 5 deletions(-)
  
  diff --git a/urcu-call-rcu-impl.h b/urcu-call-rcu-impl.h
  index 5cc02d9..b82a59b 100644
  --- a/urcu-call-rcu-impl.h
  +++ b/urcu-call-rcu-impl.h
  @@ -45,6 +45,9 @@
   #include urcu/ref.h
   #include urcu-die.h
  
  +#define SET_AFFINITY_CHECK_PERIOD (1U  8)   /* 256 */
  +#define SET_AFFINITY_CHECK_PERIOD_MASK
  (SET_AFFINITY_CHECK_PERIOD - 1)
  +
   /* Data structure that identifies a call_rcu thread. */
  
   struct call_rcu_data {
  @@ -62,6 +65,7 @@ struct call_rcu_data {
 unsigned long qlen; /* maintained for debugging. */
 pthread_t tid;
 int cpu_affinity;
  +  unsigned long gp_count;
 struct cds_list_head list;
   } __attribute__((aligned(CAA_CACHE_LINE_SIZE)));
  
  @@ -203,22 +207,42 @@ static void call_rcu_unlock(pthread_mutex_t *pmp)
 urcu_die(ret);
   }
  
  +/*
  + * Periodically retry setting CPU affinity if we migrate.
  + * Losing affinity can be caused by CPU hotunplug/hotplug, or by
  + * cpuset(7).
  + */
   #if HAVE_SCHED_SETAFFINITY
   static
   int set_thread_cpu_affinity(struct call_rcu_data *crdp)
   {
 cpu_set_t mask;
  +  int ret;
  
 if (crdp-cpu_affinity  0)
 return 0;
  +  if (++crdp-gp_count  SET_AFFINITY_CHECK_PERIOD_MASK)
  +  return 0;
  +  if (urcu_sched_getcpu() != crdp-cpu_affinity)
  
  Don't we want == here instead of !=?
  
  +  return 0;
  
 CPU_ZERO(mask);
 CPU_SET(crdp-cpu_affinity, mask);
   #if SCHED_SETAFFINITY_ARGS == 2
  -  return sched_setaffinity(0, mask);
  +  ret = sched_setaffinity(0, mask);
   #else
  -  return sched_setaffinity(0, sizeof(mask), mask);
  +  ret = sched_setaffinity(0, sizeof(mask), mask);
   #endif
  +  /*
  +   * EINVAL is fine: can be caused by hotunplugged CPUs, or by
  +   * cpuset(7). This is why we should always retry is we detect
  
  s/is we detect/if we detect/
  
  +   * migration.
  +   */
  +  if (ret  errno == EINVAL) {
  +  ret = 0;
  +  errno = 0;
  +  }
  +  return ret;
   }
   #else
   static
  @@ -275,10 +299,8 @@ static void *call_rcu_thread(void *arg)
 unsigned long cbcount;
 struct call_rcu_data *crdp = (struct call_rcu_data *) arg;
 int rt = !!(uatomic_read(crdp-flags)  URCU_CALL_RCU_RT);
  -  int ret;
  
  -  ret = set_thread_cpu_affinity(crdp);
  -  if (ret)
  +  if (set_thread_cpu_affinity(crdp))
 urcu_die(errno);
  
 /*
  @@ -298,6 +320,9 @@ static void *call_rcu_thread(void *arg)
 struct cds_wfcq_node *cbs, *cbs_tmp_n;
 enum cds_wfcq_ret splice_ret;
  
  +  if (set_thread_cpu_affinity(crdp))
  +  urcu_die(errno);
  +
 if (uatomic_read(crdp-flags)  URCU_CALL_RCU_PAUSE) {
 /*
  * Pause requested. Become quiescent: remove
  @@ -391,6 +416,7 @@ static void call_rcu_data_init(struct call_rcu_data 
  **crdpp,
 crdp-flags = flags;
 cds_list_add(crdp-list, call_rcu_data_list);
 crdp-cpu_affinity = cpu_affinity;
  +  crdp-gp_count = 0;
 cmm_smp_mb();  /* Structure initialized before pointer is planted. */
 *crdpp = crdp;
 ret = pthread_create(crdp-tid, NULL, call_rcu_thread, crdp);
  --
  2.1.4
 
 -- 
 Mathieu Desnoyers
 EfficiOS Inc.
 http://www.efficios.com
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH] Fix: deadlock when thread join is issued in read-side C.S. (v2)

2015-04-25 Thread Paul E. McKenney
On Sat, Apr 25, 2015 at 11:52:29AM -0400, Mathieu Desnoyers wrote:
 The transitive dependency between:
 
 RCU read-side C.S. - synchronize_rcu - rcu_gp_lock - rcu_register_thread
 
 and the dependency:
 
 pthread_join - awaiting for thread completion
 
 Can block a thread on join, and thus have the side-effect of deadlocking
 a thread doing a pthread_join while within a RCU read-side critical
 section. This join would be awaiting for completion of register_thread or
 rcu_unregister_thread, which may never complete because the rcu_gp_lock
 is held by synchronize_rcu executed from another thread.
 
 One solution to fix this is to add a new lock, rcu_registry_lock. This
 lock now protects the thread registry. It is released between iterations
 on the registry by synchronize_rcu, thus allowing thread
 registration/unregistration to complete even though synchronize_rcu is
 awaiting for RCU read-side critical sections to complete.
 
 Changes since v1:
 - Hold both rcu_gp_lock and rcu_registry_lock across fork in urcu-bp.
 
 Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com

Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com

 CC: Eugene Ivanov eugene.iva...@orc-group.com
 CC: Paul E. McKenney paul...@linux.vnet.ibm.com
 CC: Lai Jiangshan la...@cn.fujitsu.com
 CC: Stephen Hemminger step...@networkplumber.org
 ---
  urcu-bp.c   | 49 +++
  urcu-qsbr.c | 38 +
  urcu.c  | 63 
 -
  3 files changed, 125 insertions(+), 25 deletions(-)
 
 diff --git a/urcu-bp.c b/urcu-bp.c
 index 6b2875d..4dc4028 100644
 --- a/urcu-bp.c
 +++ b/urcu-bp.c
 @@ -99,7 +99,21 @@ void __attribute__((constructor)) rcu_bp_init(void);
  static
  void __attribute__((destructor)) rcu_bp_exit(void);
 
 +/*
 + * rcu_gp_lock ensures mutual exclusion between threads calling
 + * synchronize_rcu().
 + */
  static pthread_mutex_t rcu_gp_lock = PTHREAD_MUTEX_INITIALIZER;
 +/*
 + * rcu_registry_lock ensures mutual exclusion between threads
 + * registering and unregistering themselves to/from the registry, and
 + * with threads reading that registry from synchronize_rcu(). However,
 + * this lock is not held all the way through the completion of awaiting
 + * for the grace period. It is sporadically released between iterations
 + * on the registry.
 + * rcu_registry_lock may nest inside rcu_gp_lock.
 + */
 +static pthread_mutex_t rcu_registry_lock = PTHREAD_MUTEX_INITIALIZER;
 
  static pthread_mutex_t init_lock = PTHREAD_MUTEX_INITIALIZER;
  static int initialized;
 @@ -160,6 +174,10 @@ static void mutex_unlock(pthread_mutex_t *mutex)
   urcu_die(ret);
  }
 
 +/*
 + * Always called with rcu_registry lock held. Releases this lock between
 + * iterations and grabs it again. Holds the lock when it returns.
 + */
  static void wait_for_readers(struct cds_list_head *input_readers,
   struct cds_list_head *cur_snap_readers,
   struct cds_list_head *qsreaders)
 @@ -202,10 +220,14 @@ static void wait_for_readers(struct cds_list_head 
 *input_readers,
   if (cds_list_empty(input_readers)) {
   break;
   } else {
 + /* Temporarily unlock the registry lock. */
 + mutex_unlock(rcu_registry_lock);
   if (wait_loops = RCU_QS_ACTIVE_ATTEMPTS)
   (void) poll(NULL, 0, RCU_SLEEP_DELAY_MS);
   else
   caa_cpu_relax();
 + /* Re-lock the registry lock before the next loop. */
 + mutex_lock(rcu_registry_lock);
   }
   }
  }
 @@ -224,6 +246,8 @@ void synchronize_rcu(void)
 
   mutex_lock(rcu_gp_lock);
 
 + mutex_lock(rcu_registry_lock);
 +
   if (cds_list_empty(registry))
   goto out;
 
 @@ -234,6 +258,8 @@ void synchronize_rcu(void)
 
   /*
* Wait for readers to observe original parity or be quiescent.
 +  * wait_for_readers() can release and grab again rcu_registry_lock
 +  * interally.
*/
   wait_for_readers(registry, cur_snap_readers, qsreaders);
 
 @@ -263,6 +289,8 @@ void synchronize_rcu(void)
 
   /*
* Wait for readers to observe new parity or be quiescent.
 +  * wait_for_readers() can release and grab again rcu_registry_lock
 +  * interally.
*/
   wait_for_readers(cur_snap_readers, NULL, qsreaders);
 
 @@ -277,6 +305,7 @@ void synchronize_rcu(void)
*/
   cmm_smp_mb();
  out:
 + mutex_unlock(rcu_registry_lock);
   mutex_unlock(rcu_gp_lock);
   ret = pthread_sigmask(SIG_SETMASK, oldmask, NULL);
   assert(!ret);
 @@ -485,9 +514,9 @@ void rcu_bp_register(void)
*/
   rcu_bp_init();
 
 - mutex_lock(rcu_gp_lock);
 + mutex_lock(rcu_registry_lock);
   add_thread();
 - mutex_unlock(rcu_gp_lock

Re: [lttng-dev] Deadlock between call_rcu thread and RCU-bp thread doing registration in rcu_read_lock()

2015-04-17 Thread Paul E. McKenney
On Fri, Apr 17, 2015 at 12:23:46PM +0300, Eugene Ivanov wrote:
 Hi Mathieu,
 
 On 04/10/2015 11:26 PM, Mathieu Desnoyers wrote:
 - Original Message -
 Hi,
 
 I use rcu-bp (0.8.6) and get deadlock between call_rcu thread and
 threads willing to do rcu_read_lock():
 1. Some thread is in read-side critical section.
 2. call_rcu thread waits for readers in stack of rcu_bp_register(), i.e.
 holds mutex.
 3. Another thread enters into critical section via rcu_read_lock() and
 blocks on the mutex taken by thread 2.
 
 Such deadlock is quite unexpected for me. Especially if RCU is used for
 reference counting.
 Hi Eugene,
 
 Let's have a look at the reproducer below,
 
 Originally it happened with rculfhash, below is minimized reproducer:
 
 #include pthread.h
 #include urcu-bp.h
 
 struct Node
 {
   struct rcu_head rcu_head;
 };
 
 static void free_node(struct rcu_head * head)
 {
   struct Node *node = caa_container_of(head, struct Node, rcu_head);
   free(node);
 }
 
 static void * reader_thread(void * arg)
 {
   rcu_read_lock();
   rcu_read_unlock();
   return NULL;
 }
 
 int main(int argc, char * argv[])
 {
   rcu_read_lock();
   struct Node * node = malloc(sizeof(*node));
   call_rcu(node-rcu_head, free_node);
 
   pthread_t read_thread_info;
   pthread_create(read_thread_info, NULL, reader_thread, NULL);
   pthread_join(read_thread_info, NULL);
 This pthread_join blocks until reader_thread exits. It blocks
 while holding the RCU read-side lock. Quoting README.md:
 
 ### Interaction with mutexes
 
 One must be careful to do not cause deadlocks due to interaction of
 `synchronize_rcu()` and RCU read-side with mutexes. If `synchronize_rcu()`
 is called with a mutex held, this mutex (or any mutex which has this
 mutex in its dependency chain) should not be acquired from within a RCU
 read-side critical section.
 
 This is especially important to understand in the context of the
 QSBR flavor: a registered reader thread being online by
 default should be considered as within a RCU read-side critical
 section unless explicitly put offline. Therefore, if
 `synchronize_rcu()` is called with a mutex held, this mutex, as
 well as any mutex which has this mutex in its dependency chain
 should only be taken when the RCU reader thread is offline
 (this can be performed by calling `rcu_thread_offline()`).
 
 So what appears to happen here is that urcu-bp lazy registration
 grabs the rcu_gp_lock when the first rcu_read_lock is encountered.
 This mutex is also held when synchronize_rcu() is awaiting on
 reader thread's completion. So synchronize_rcu() of the call_rcu
 thread can block on the read-side lock held by main() (awaiting
 on pthread_join), which blocks the lazy registration of reader_thread,
 because it needs to grab that same lock.
 
 So this issue here is caused by holding the RCU read-side lock
 while calling pthread_join.
 
 For the QSBR flavor, you will want to put the main() thread
 offline before awaiting on pthread_join.
 
 Does it answer your question ?
 
 Thank you for thorough explanation. The thing I still don't get is
 related to the case, when either thread wants to hold read lock for
 arbitrary long time to do some complicated data processing, e.g. walk
 through huge hash table and send some network responses related to the
 data in the table. pthread_join() can be moved out from the CS and
 instead in CS we can have sleep(1000) or just a long loop to demonstrate
 the case. Thread creation can be done somewhere else as well. Do I
 understand it correctly, that if synchronize_rcu() is executed same time
 by call_rcu thread, no other threads can be registered and unregistered
 until reader has finished? Regarding documentation it looks as a correct
 RCU usage, because I don't have any mutexes, just one of the threads
 stays in CS for very long time and the only mutex involved is rcu_gp_lock.

Hmmm...

One possible way to allow this use case (if desired) is to make
thread registration use trylock on rcu_gp_lock.  If this fails, they
unconditionally acquire an rcu_gp_fallback_lock, and add the thread
to a secondary list.  Then, while still holding rcu_gp_fallback_lock,
again trylock rcu_gp_lock.  If this succeeds, move the thread(s) to the
real list and release both locks, otherwise, release rcu_gp_fallback_lock
and leave.

In addition, just after the grace-period machinery releases rcu_gp_lock,
it acquires rcu_gp_fallback_lock.  If the secondary list is non-empty,
it then re-acquires rcu_gp_lock and moves the threads to the real list.
Finally, of course, it releases all the locks that it acquired.

The reason that this works is that a given grace period need not wait on
threads that didn't exist before that grace period started.  Note that
this relies on trylock never having spurious failures, which is guaranteed
by POSIX (but sadly not C/C++'s shiny new part-of-language locks).

Seem reasonable?

   

Re: [lttng-dev] Deadlock between call_rcu thread and RCU-bp thread doing registration in rcu_read_lock()

2015-04-17 Thread Paul E. McKenney
On Fri, Apr 17, 2015 at 07:18:18AM -0700, Paul E. McKenney wrote:
 On Fri, Apr 17, 2015 at 12:23:46PM +0300, Eugene Ivanov wrote:
  Hi Mathieu,
  
  On 04/10/2015 11:26 PM, Mathieu Desnoyers wrote:
  - Original Message -
  Hi,
  
  I use rcu-bp (0.8.6) and get deadlock between call_rcu thread and
  threads willing to do rcu_read_lock():
  1. Some thread is in read-side critical section.
  2. call_rcu thread waits for readers in stack of rcu_bp_register(), i.e.
  holds mutex.
  3. Another thread enters into critical section via rcu_read_lock() and
  blocks on the mutex taken by thread 2.
  
  Such deadlock is quite unexpected for me. Especially if RCU is used for
  reference counting.
  Hi Eugene,
  
  Let's have a look at the reproducer below,
  
  Originally it happened with rculfhash, below is minimized reproducer:
  
  #include pthread.h
  #include urcu-bp.h
  
  struct Node
  {
struct rcu_head rcu_head;
  };
  
  static void free_node(struct rcu_head * head)
  {
struct Node *node = caa_container_of(head, struct Node, 
   rcu_head);
free(node);
  }
  
  static void * reader_thread(void * arg)
  {
rcu_read_lock();
rcu_read_unlock();
return NULL;
  }
  
  int main(int argc, char * argv[])
  {
rcu_read_lock();
struct Node * node = malloc(sizeof(*node));
call_rcu(node-rcu_head, free_node);
  
pthread_t read_thread_info;
pthread_create(read_thread_info, NULL, reader_thread, NULL);
pthread_join(read_thread_info, NULL);
  This pthread_join blocks until reader_thread exits. It blocks
  while holding the RCU read-side lock. Quoting README.md:
  
  ### Interaction with mutexes
  
  One must be careful to do not cause deadlocks due to interaction of
  `synchronize_rcu()` and RCU read-side with mutexes. If `synchronize_rcu()`
  is called with a mutex held, this mutex (or any mutex which has this
  mutex in its dependency chain) should not be acquired from within a RCU
  read-side critical section.
  
  This is especially important to understand in the context of the
  QSBR flavor: a registered reader thread being online by
  default should be considered as within a RCU read-side critical
  section unless explicitly put offline. Therefore, if
  `synchronize_rcu()` is called with a mutex held, this mutex, as
  well as any mutex which has this mutex in its dependency chain
  should only be taken when the RCU reader thread is offline
  (this can be performed by calling `rcu_thread_offline()`).
  
  So what appears to happen here is that urcu-bp lazy registration
  grabs the rcu_gp_lock when the first rcu_read_lock is encountered.
  This mutex is also held when synchronize_rcu() is awaiting on
  reader thread's completion. So synchronize_rcu() of the call_rcu
  thread can block on the read-side lock held by main() (awaiting
  on pthread_join), which blocks the lazy registration of reader_thread,
  because it needs to grab that same lock.
  
  So this issue here is caused by holding the RCU read-side lock
  while calling pthread_join.
  
  For the QSBR flavor, you will want to put the main() thread
  offline before awaiting on pthread_join.
  
  Does it answer your question ?
  
  Thank you for thorough explanation. The thing I still don't get is
  related to the case, when either thread wants to hold read lock for
  arbitrary long time to do some complicated data processing, e.g. walk
  through huge hash table and send some network responses related to the
  data in the table. pthread_join() can be moved out from the CS and
  instead in CS we can have sleep(1000) or just a long loop to demonstrate
  the case. Thread creation can be done somewhere else as well. Do I
  understand it correctly, that if synchronize_rcu() is executed same time
  by call_rcu thread, no other threads can be registered and unregistered
  until reader has finished? Regarding documentation it looks as a correct
  RCU usage, because I don't have any mutexes, just one of the threads
  stays in CS for very long time and the only mutex involved is rcu_gp_lock.
 
 Hmmm...
 
 One possible way to allow this use case (if desired) is to make
 thread registration use trylock on rcu_gp_lock.  If this fails, they
 unconditionally acquire an rcu_gp_fallback_lock, and add the thread
 to a secondary list.  Then, while still holding rcu_gp_fallback_lock,
 again trylock rcu_gp_lock.  If this succeeds, move the thread(s) to the
 real list and release both locks, otherwise, release rcu_gp_fallback_lock
 and leave.
 
 In addition, just after the grace-period machinery releases rcu_gp_lock,
 it acquires rcu_gp_fallback_lock.  If the secondary list is non-empty,
 it then re-acquires rcu_gp_lock and moves the threads to the real list.
 Finally, of course, it releases all the locks that it acquired.
 
 The reason that this works is that a given grace period need not wait on
 threads that didn't exist before that grace period started.  Note

Re: [lttng-dev] Alternative to signals/sys_membarrier() in liburcu

2015-03-13 Thread Paul E. McKenney
On Fri, Mar 13, 2015 at 09:07:43AM +0100, Ingo Molnar wrote:
 
 * Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote:
 
  - Original Message -
   From: Linus Torvalds torva...@linux-foundation.org
   To: Mathieu Desnoyers mathieu.desnoy...@efficios.com
   Cc: Michael Sullivan su...@msully.net, lttng-dev@lists.lttng.org, 
   LKML linux-ker...@vger.kernel.org, Paul E.
   McKenney paul...@linux.vnet.ibm.com, Peter Zijlstra 
   pet...@infradead.org, Ingo Molnar mi...@kernel.org,
   Thomas Gleixner t...@linutronix.de, Steven Rostedt 
   rost...@goodmis.org
   Sent: Thursday, March 12, 2015 5:47:05 PM
   Subject: Re: Alternative to signals/sys_membarrier() in liburcu
   
   On Thu, Mar 12, 2015 at 1:53 PM, Mathieu Desnoyers
   mathieu.desnoy...@efficios.com wrote:
   
So the question as it stands appears to be: would you be comfortable
having users abuse mprotect(), relying on its side-effect of issuing
a smp_mb() on each targeted CPU for the TLB shootdown, as
an effective implementation of process-wide memory barrier ?
   
   Be *very* careful.
   
   Just yesterday, in another thread (discussing the auto-numa TLB 
   performance regression), we were discussing skipping the TLB 
   invalidates entirely if the mprotect relaxes the protections.
 
 We have such code already in mm/mprotect.c, introduced in:
 
   10c1045f28e8 mm: numa: avoid unnecessary TLB flushes when setting NUMA 
 hinting entries
 
 which does:
 
 /* Avoid TLB flush if possible */
 if (pte_protnone(oldpte))
 continue;
 
   Because if you *used* to be read-only, and them mprotect() 
   something so that it is read-write, there really is no need to 
   send a TLB invalidate, at least on x86. You can just change the 
   page tables, and *if* any entries are stale in the TLB they'll 
   take a microfault on access and then just reload the TLB.
   
   So mprotect() to a more permissive mode is not necessarily 
   serializing.
  
  The idea here is to always mprotect() to a more restrictive mode, 
  which should trigger the TLB shootdown.
 
 So what happens if a CPU comes around that integrates TLB shootdown 
 management into its cache coherency protocol? In such a case IPI 
 traffic can be skipped: the memory bus messages take care of TLB 
 flushes in most cases.
 
 It's a natural optimization IMHO, because TLB flushes are conceptually 
 pretty close to the synchronization mechanisms inherent in data cache 
 coherency protocols:
 
 This could be implemented for example by a CPU that knows about ptes 
 and handles their modification differently: when a pte is modified it 
 will broadcast a MESI invalidation message not just for the cacheline 
 belonging to the pte's physical address, but also an 'invalidate TLB' 
 MESI message for the pte value's page.
 
 The TLB shootdown would either be guaranteed within the MESI 
 transaction, or there would either be a deterministic timing 
 guarantee, or some explicit synchronization mechanism (new 
 instruction) to make sure the remote TLB(s) got shot down.
 
 Every form of this would be way faster than sending interrupts. New 
 OSs could support this by the hardware telling them in which cases the 
 TLBs are 'auto-flushed', while old OSs would still be compatible by 
 sending (now pointless) TLB shootdown IPIs.
 
 So it's a relatively straightforward hardware optimization IMHO: 
 assuming TLB flushes are considered important enough to complicate the 
 cacheline state machine (which I think they currently aren't).
 
 So in this case there's no interrupt and no other interruption of the 
 remote CPU's flow of execution in any fashion that could advance the 
 RCU state machine.
 
 What do you think?

I agree -- there really have been systems able to flush remote TLBs
without interrupting the remote CPU.

So, given the fact that the userspace RCU library does now see
some real-world use, is it now time for Mathieu to resubmit his
sys_membarrier() patch?

Thanx, Paul


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] Alternative to signals/sys_membarrier() in liburcu

2015-03-12 Thread Paul E. McKenney
On Thu, Mar 12, 2015 at 08:56:00PM +, Mathieu Desnoyers wrote:
 (sorry for re-send, my mail client tricked me into posting HTML
 to lkml)
 
 Hi, 
 
 Michael Sullivan proposed a clever hack abusing mprotect() to 
 perform the same effect as sys_membarrier() I submitted a few 
 years ago ( https://lkml.org/lkml/2010/4/18/15 ). 
 
 At that time, the sys_membarrier implementation was deemed 
 technically sound, but there were not enough users of the system call 
 to justify its inclusion. 
 
 So far, the number of users of liburcu has increased, but liburcu 
 still appears to be the only direct user of sys_membarrier. On this 
 front, we could argue that many other system calls have only 
 one user: glibc. In that respect, liburcu is quite similar to glibc. 
 
 So the question as it stands appears to be: would you be comfortable 
 having users abuse mprotect(), relying on its side-effect of issuing 
 a smp_mb() on each targeted CPU for the TLB shootdown, as 
 an effective implementation of process-wide memory barrier ? 
 
 Thoughts ? 

Are there any architectures left that use hardware-assisted global
TLB invalidation?  On such an architecture, you might not get a memory
barrier except on the CPU executing the mprotect() or munmap().

(Here is hoping that no one does -- it is a cute abuse^Whack otherwise!)

Thanx, Paul

 Thanks! 
 
 Mathieu 
 
 
 
 
 
 From: Michael Sullivan su...@msully.net 
 To: Mathieu Desnoyers mathieu.desnoy...@efficios.com 
 Cc: lttng-dev@lists.lttng.org 
 Sent: Thursday, March 12, 2015 12:04:07 PM 
 Subject: Re: [lttng-dev] Alternative to signals/sys_membarrier() in liburcu 
 
 On Thu, Mar 12, 2015 at 10:57 AM, Mathieu Desnoyers  
 mathieu.desnoy...@efficios.com  wrote: 
 
 
 
 
 Even though it depends on internal behavior not currently specified by 
 mprotect, 
 I'd very much like to see the prototype you have, 
 
 
 I ended up posting my code at 
 https://github.com/msullivan/userspace-rcu/tree/msync-barrier . 
 The interesting patch is 
 https://github.com/msullivan/userspace-rcu/commit/04656b468d418efbc5d934ab07954eb8395a7ab0
  . 
 
 Quick blog post I wrote about it at 
 http://www.msully.net/blog/2015/02/24/forcing-memory-barriers-on-other-cpus-with-mprotect2/
  . 
 (I talked briefly about sys_membarrier in the post as best as I could piece 
 together from LKML; if my comment on it is inaccurate I can edit the post.) 
 
 -Michael Sullivan 
 
 
 
 -- 
 Mathieu Desnoyers 
 EfficiOS Inc. 
 http://www.efficios.com 
 
 ___
 lttng-dev mailing list
 lttng-dev@lists.lttng.org
 http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] Xeon Phi memory barriers

2013-12-06 Thread Paul E. McKenney
On Fri, Dec 06, 2013 at 08:15:38PM +, Mathieu Desnoyers wrote:
 - Original Message -
  From: Simon Marchi simon.mar...@polymtl.ca
  To: lttng-dev@lists.lttng.org
  Sent: Tuesday, November 19, 2013 4:26:06 PM
  Subject: [lttng-dev] Xeon Phi memory barriers
  
  Hello there,
 
 Hi Simon,
 
 While reading this reply, please keep in mind that I'm in a
 mindset where I've been in a full week of meeting, and it's late on
 Friday evening here. So YMMV ;-) I'm CCing Paul E. McKenney, so he can
 debunk my answer :)
 
  
  liburcu does not build on the Intel Xeon Phi, because the chip is
  recognized as x86_64, but lacks the {s,l,m}fence instructions found on
  usual x86_64 processors. The following is taken from the Xeon Phi dev
  guide:
 
 Let's have a look:
 
  
  The Intel® Xeon PhiTM coprocessor memory model is the same as that of
  the Intel® Pentium processor. The reads and writes always appear in
  programmed order at the system bus (or the ring interconnect in the
  case of the Intel® Xeon PhiTM coprocessor); the exception being that
  read misses are permitted to go ahead of buffered writes on the system
  bus when all the buffered writes are cached hits and are, therefore,
  not directed to the same address being accessed by the read miss.
 
 OK, so reads can be reordered with respect to following writes.

That would be -preceding- writes, correct?

  As a consequence of its stricter memory ordering model, the Intel®
  Xeon PhiTM coprocessor does not support the SFENCE, LFENCE, and MFENCE
  instructions that provide a more efficient way of controlling memory
  ordering on other Intel processors.
 
 I guess sfence and lfence are indeed completely useless, because we only
 can ever care about ordering reads vs writes (mfence). But even the mfence
 is not there.

The usual approach is an atomic operation to a dummy location on the
stack.  Is that the recommendation for Xeon Phi?

Either way, what should userspace RCU do to detect that it is being built
on a Xeon Phi?  I am sure that Mathieu would welcome the relevant patches
for this.  ;-)

  While reads and writes from an Intel® Xeon PhiTM coprocessor appear in
  program order on the system bus,
 
 This part of the sentence seems misleading to me. Didn't the first
 sentence state the opposite ? the exception being that
 read misses are permitted to go ahead of buffered writes on the system
 bus when all the buffered writes are cached hits and are, therefore,
 not directed to the same address being accessed by the read miss.
 
 I'm probably missing something.

The trick might be that read misses are only allowed to pass write
-hits-, which would mean that the system bus would have already seen
the invalidate corresponding to the delayed write, and thus would
have no evidence of any misorderingr

  the compiler can still reorder
  unrelated memory operations while maintaining program order on a
  single Intel® Xeon PhiTM coprocessor (hardware thread). If software
  running on an Intel® Xeon PhiTM coprocessor is dependent on the order
  of memory operations on another Intel® Xeon PhiTM coprocessor then a
  serializing instruction (e.g., CPUID, instruction with a LOCK prefix)
  between the memory operations is required to guarantee completion of
  all memory accesses issued prior to the serializing instruction before
  any subsequent memory operations are started.

OK, sounds like my guess of atomic instruction to dummy stack location
is correct, or perhaps carrying out a nearby assignment using an
xchg instruction.

  (end of quote)
  
  From what I understand, it is safe to leave out any run-time memory
  barriers, but we still need barriers that prevent the compiler from
  reordering (using __asm__ __volatile__ (:::memory)). In
  urcu/arch/x86.h, I see that when CONFIG_RCU_HAVE_FENCE is false,
  memory barriers result in both compile-time and run-time memory
  barriers:  __asm__ __volatile__ (lock; addl $0,0(%%esp):::memory).
  I guess this would work for the Phi, but the lock instruction does not
  seem necessary.
 
 Actually, either a cpuid (core serializing) instruction or lock-prefixed
 instruction (serializing as a side-effect memory accesses) seems required.

It would certainly be safe.  One approach would be to keep it that way
unless/until someone showed it to be unnecessary.

  So, should we just set CONFIG_RCU_HAVE_FENCE to false when compiling
  for the Phi and go on with our lives, or should we add a specific
  config for this case?
 
 I _think_ we could get away with this mapping:
 
 smp_wmb() - barrier()
   reasoning: write vs write are not reordered by the processor.
 
 smp_rmb() - barrier()
   reasoning: read vs read not reordered by processor.
 
 smp_mb() - __asm__ __volatile__ (lock; addl $0,0(%%esp):::memory)
or a cpuid instruction
   reasoning: cpu can reorder reads vs later writes.
 
 smp_read_barrier_depends() - nothing at all (not needed at any level).

This should be safe, though I would argue for do { } while (0

Re: [lttng-dev] bug in urcu

2013-11-04 Thread Paul E. McKenney
On Sun, Nov 03, 2013 at 02:13:52PM +, Mathieu Desnoyers wrote:
 - Original Message -
  From: Paul E. McKenney paul...@linux.vnet.ibm.com
  To: Mathieu Desnoyers mathieu.desnoy...@efficios.com
  Cc: Vladimir Nikulichev n...@tbricks.com, lttng-dev@lists.lttng.org
  Sent: Sunday, November 3, 2013 9:03:59 AM
  Subject: Re: [lttng-dev] bug in urcu
  
  On Fri, Nov 01, 2013 at 08:18:59PM +, Mathieu Desnoyers wrote:
   - Original Message -
From: Mathieu Desnoyers mathieu.desnoy...@efficios.com
To: Vladimir Nikulichev n...@tbricks.com
Cc: lttng-dev@lists.lttng.org, Paul E. McKenney
paul...@linux.vnet.ibm.com
Sent: Friday, November 1, 2013 9:55:14 AM
Subject: Re: [lttng-dev] bug in urcu

- Original Message -
 From: Mathieu Desnoyers mathieu.desnoy...@efficios.com
 To: Vladimir Nikulichev n...@tbricks.com
 Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
 Sent: Friday, November 1, 2013 9:42:16 AM
 Subject: Re: bug in urcu
 
 - Original Message -
  From: Vladimir Nikulichev n...@tbricks.com
  To: Mathieu Desnoyers mathieu.desnoy...@efficios.com
  Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
  Sent: Tuesday, October 22, 2013 1:32:11 PM
  Subject: Re: bug in urcu
  
   
   It looks like an issue with the thread-local storage (TLS)
   compatibility
   layer.
   
   Can you show me the output of ./configure on that machine ? I'm
   especially
   interested in the output of:
   
   Thread Local Storage (TLS): __thread. (example on my machine)
  
  It uses pthread_getspecific() by default:
 
 Good catch!
 
 looking at the output of
 
 nm tests/unit/.libs/test_urcu_multiflavor :
 
  U __tls_access_rcu_reader
 
 seems to be the issue. We're missing macro expansion in tls-compat.h.
 With
 the patch attached:
 
  U __tls_access_rcu_reader_bp
  U __tls_access_rcu_reader_mb
  U __tls_access_rcu_reader_memb
  U __tls_access_rcu_reader_sig
 
 which should fix your issue. Can you try it out and let me know if it
 fixes
 your problem ?

Extra question (Paul ? Adding lttng-dev in CC):

Please note that this affects an unusual configuration of userspace RCU
(with TLS pthread key fallback), needed for some BSD that don't support
compiler TLS. Strictly speaking, this should require bumping the URCU
library soname version major number, because it breaks the ABI presented
to applications on those unusual configurations. However, since this is
only for unusual configurations, I wonder if we should bump the soname
version major number or not ? If we do need to bump the soname, can we
really do this in a stable version fix (0.7, 0.8), or do we need to push
a 0.9 out and document the limitation for 0.7 and 0.8 ?
   
   I just found a way to keep ABI compatibility for 0.7 and 0.8, and abort()
   the application if it's using the old ABI (with symbol clash) when there
   are multiple instances of this symbol loaded. This involves tricks with
   weak symbols, a constructor, a reference count, and a wrapper around the
   bogus symbol, but it works !! Note that it only affects users of urcu that
   have _LGPL_SOURCE defined.
  
  Sounds good to me!  ;-)
  
  I trust that you have also added copious comments...
 
 Yes, especially within the 0.7 and 0.8 branches, within urcu/tls-compat.h:
 
 /*
  * The *_1() macros ensure macro parameters are expanded.
  *
  * __DEFINE_URCU_TLS_GLOBAL and __URCU_TLS_CALL exist for the sole
  * purpose of notifying applications compiled against non-fixed 0.7 and
  * 0.8 userspace RCU headers and using multiple flavors concurrently to
  * recompile against fixed userspace RCU headers.
  */
 
 as well as
 
 /*
  * Define with and without macro expansion to handle erroneous callers.
  * Trigger an abort() if the caller application uses the clashing symbol
  * if a weak symbol is overridden.
  */
 
 For the master branch, I added a much simpler comment:
 
 /*
  * The *_1() macros ensure macro parameters are expanded.
  */

OK, those should cover it.  ;-)

Thanx, Paul

 Thanks,
 
 Mathieu
 
  
  Thanx, Paul
  
  checking whether gcc accepts -g... (cached) yes
  checking for gcc option to accept ISO C89... (cached) none needed
  checking whether gcc understands -c and -o together... (cached) yes
  checking dependency style of gcc... (cached) gcc3
  checking whether make sets $(MAKE)... (cached) yes
  checking how to print strings... printf
  checking for a sed that does not truncate output... /usr/bin/sed
  checking for grep that handles long lines and -e... /usr/bin/grep
  checking for egrep

Re: [lttng-dev] bug in urcu

2013-11-03 Thread Paul E. McKenney
On Fri, Nov 01, 2013 at 08:18:59PM +, Mathieu Desnoyers wrote:
 - Original Message -
  From: Mathieu Desnoyers mathieu.desnoy...@efficios.com
  To: Vladimir Nikulichev n...@tbricks.com
  Cc: lttng-dev@lists.lttng.org, Paul E. McKenney 
  paul...@linux.vnet.ibm.com
  Sent: Friday, November 1, 2013 9:55:14 AM
  Subject: Re: [lttng-dev] bug in urcu
  
  - Original Message -
   From: Mathieu Desnoyers mathieu.desnoy...@efficios.com
   To: Vladimir Nikulichev n...@tbricks.com
   Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
   Sent: Friday, November 1, 2013 9:42:16 AM
   Subject: Re: bug in urcu
   
   - Original Message -
From: Vladimir Nikulichev n...@tbricks.com
To: Mathieu Desnoyers mathieu.desnoy...@efficios.com
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Sent: Tuesday, October 22, 2013 1:32:11 PM
Subject: Re: bug in urcu

 
 It looks like an issue with the thread-local storage (TLS)
 compatibility
 layer.
 
 Can you show me the output of ./configure on that machine ? I'm
 especially
 interested in the output of:
 
 Thread Local Storage (TLS): __thread. (example on my machine)

It uses pthread_getspecific() by default:
   
   Good catch!
   
   looking at the output of
   
   nm tests/unit/.libs/test_urcu_multiflavor :
   
U __tls_access_rcu_reader
   
   seems to be the issue. We're missing macro expansion in tls-compat.h. With
   the patch attached:
   
U __tls_access_rcu_reader_bp
U __tls_access_rcu_reader_mb
U __tls_access_rcu_reader_memb
U __tls_access_rcu_reader_sig
   
   which should fix your issue. Can you try it out and let me know if it 
   fixes
   your problem ?
  
  Extra question (Paul ? Adding lttng-dev in CC):
  
  Please note that this affects an unusual configuration of userspace RCU
  (with TLS pthread key fallback), needed for some BSD that don't support
  compiler TLS. Strictly speaking, this should require bumping the URCU
  library soname version major number, because it breaks the ABI presented
  to applications on those unusual configurations. However, since this is
  only for unusual configurations, I wonder if we should bump the soname
  version major number or not ? If we do need to bump the soname, can we
  really do this in a stable version fix (0.7, 0.8), or do we need to push
  a 0.9 out and document the limitation for 0.7 and 0.8 ?
 
 I just found a way to keep ABI compatibility for 0.7 and 0.8, and abort() the 
 application if it's using the old ABI (with symbol clash) when there are 
 multiple instances of this symbol loaded. This involves tricks with weak 
 symbols, a constructor, a reference count, and a wrapper around the bogus 
 symbol, but it works !! Note that it only affects users of urcu that have 
 _LGPL_SOURCE defined.

Sounds good to me!  ;-)

I trust that you have also added copious comments...

Thanx, Paul

checking whether gcc accepts -g... (cached) yes
checking for gcc option to accept ISO C89... (cached) none needed
checking whether gcc understands -c and -o together... (cached) yes
checking dependency style of gcc... (cached) gcc3
checking whether make sets $(MAKE)... (cached) yes
checking how to print strings... printf
checking for a sed that does not truncate output... /usr/bin/sed
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for fgrep... /usr/bin/grep -F
checking for ld used by gcc...
/usr/llvm-gcc-4.2/libexec/gcc/i686-apple-darwin11/4.2.1/ld
checking if the linker
(/usr/llvm-gcc-4.2/libexec/gcc/i686-apple-darwin11/4.2.1/ld) is GNU 
ld...
no
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm
checking the name lister (/usr/bin/nm) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 196608
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands +=... yes
checking how to convert x86_64-apple-darwin12.5.0 file names to
x86_64-apple-darwin12.5.0 format... func_convert_file_noop
checking how to convert x86_64-apple-darwin12.5.0 file names to 
toolchain
format... func_convert_file_noop
checking for /usr/llvm-gcc-4.2/libexec/gcc/i686-apple-darwin11/4.2.1/ld
option to reload object files... -r
checking for objdump... no
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for ar... ar
checking for archiver @FILE support... no
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm output from gcc object... ok

Re: [lttng-dev] [RFC PATCH] timekeeping: introduce timekeeping_is_busy()

2013-09-11 Thread Paul E. McKenney
On Wed, Sep 11, 2013 at 02:54:41PM -0400, Mathieu Desnoyers wrote:
 * John Stultz (john.stu...@linaro.org) wrote:
  On 09/11/2013 08:08 AM, Mathieu Desnoyers wrote:
 [...]
 
 Now focusing on features (the fix discussion is in a separate
 sub-thread):
 
  
   LTTng uses ktime to have the same time-base across kernel and
   user-space, so traces gathered from LTTng-modules and LTTng-UST can be
   correlated. We plan on using ktime until a fast, scalable, and
   fine-grained time-source for tracing that can be used across kernel and
   user-space, and which does not rely on read seqlock for kernel-level
   synchronization, makes its way into the kernel.
  
  So my fix for the issue aside, I could see cases where using timekeeping
  for tracing could run into similar issues, so something like your
  timekeeping_is_busy() check sounds reasonable.
 
 Yep, it would certainly make those use of ktime_get() more robust
 against internal changes.
 
  I might suggest we wrap
  the timekeeper locking in a helper function so we don't have the
  spinlock(); set_owner(); write_seqcount(); pattern all over the place
  (and it avoids me forgetting to set the owner in some future change,
  further mucking things up :).
 
 Good idea.
 
  As for your waiting for fast, scalable, and fine-grained time-source
  for tracing that can be used across kernel and user-space, and which
  does not rely on read seqlock for kernel-level synchronization wish,
  I'd be interested in hearing ideas if anyone has them.
 
 So far, the best I could come up with is this: using a RCU (or RCU-like)
 scheme to protect in-kernel timestamp reads (possibly with RCU sched,
 which is based on preemption disabling), and use a sequence lock to
 protect reads from user-space.
 
 Time updates within the kernel would have to deal with both RCU pointer
 update and track quiescent state, and would need to hold a write seqlock
 to synchronize against concurrent user-space reads.
 
  After getting the recent lock-hold reduction work merged in 3.10, I had
  some thoughts that maybe we could do some sort of rcu style timekeeper
  switch. The down side is that there really is a time bound in which the
  timekeeper state is valid for, so there would have to be some sort of
  seqcount style retry if we didn't finish the calculation within the
  valid bound (which can run into similar deadlock problems if the
  updater is delayed by a reader spinning waiting for an update).
 
 What could make a reader fail to finish the calculation within the valid
 time bound ? Besides preemption ? If it's caused by a too long
 interrupt, this will have an effect on the entire timekeeping, because
 the timer interrupt will likely be delayed, and therefore the periodical
 update changing the write seqlock value will be delayed too. So for the
 interrupt case, it looks like a too long interrupt (or interrupt
 disable section) will already disrupt timekeeping with the current
 design.
 
  
  Also there is the memory issue of having N timekeeper structures hanging
  around, since there could be many readers delayed mid-calculation, but
  that could probably be bound by falling back to a seqcount (and again,
  that opens up deadlock possibilities). Anyway, it all gets pretty
  complicated pretty quickly, which makes ensuring correctness even harder. :(
  
  But yea, I'd be interested in other ideas and approaches.
 
 If we can afford a synchronize_rcu_sched() wherever the write seqlock is
 needed, we could go with the following. Please note that I use
 synchronize_rcu_sched() rather than call_rcu_sched() here because I try
 to avoid having too many timekeeper structures hanging around, and I
 think it can be generally a good think to ensure timekeeping core does
 not depend on the memory allocator (but I could very well be wrong).

The issue called out with this the last time I remember it being put
forward was that grace periods can be delayed for longer than is an
acceptable gap between timekeeping updates.  But maybe something has
changed since then -- that was a few years ago.

Thanx, Paul

 In kernel/time/timekeeper.c:
 
 static DEFINE_MUTEX(timekeeper_mutex);
 static seqcount_t timekeeper_user_seq;
 
 struct timekeeper_rcu {
 struct a[2];
 struct timekeeper *p;   /* current */
 };
 
 /* Timekeeper structure for kernel readers */
 static struct timekeeper_rcu timekeeper_rcu;
 
 /* Timekeeper structure for userspace readers */
 static struct timekeeper timekeeper_user;
 
 /* for updates */
 update_time()
 {
 struct timekeeper *next_p;
 
 mutex_lock(timekeeper_mutex);
 
 /* RCU update, for kernel readers */
 if (timekeeper_rcu.p == timekeeper_rcu.a[0])
 next_p = timekeeper_rcu.a[1];
 else
 next_p = timekeeper_rcu.a[0];
 
 timekeeper_copy(next_p, timekeeper_rcu.p);
 timekeeper_do_update(next_p, ...);
 
 

Re: [lttng-dev] [RFC] adding into middle of RCU list

2013-09-01 Thread Paul E. McKenney
On Sun, Sep 01, 2013 at 01:42:10PM -0700, Josh Triplett wrote:
 On Sat, Aug 31, 2013 at 02:32:28PM -0700, Paul E. McKenney wrote:
  On Thu, Aug 29, 2013 at 07:16:37PM -0700, Josh Triplett wrote:
   On Thu, Aug 29, 2013 at 05:57:33PM -0700, Paul E. McKenney wrote:
On Fri, Aug 23, 2013 at 02:08:22PM -0700, Paul E. McKenney wrote:
 On Fri, Aug 23, 2013 at 01:16:53PM -0400, Mathieu Desnoyers wrote:
  #define __rcu_assign_pointer(p, v, space) \
  do { \
  smp_wmb(); \
  (p) = (typeof(*v) __force space *)(v); \
  } while (0)
 
 Or I need to fix this one as well.  ;-)

In that vein...  Is there anything like typeof() that also preserves
sparse's notion of address space?  Wrapping an ACCESS_ONCE() around
p in the assignment above results in sparse errors.
   
   typeof() will preserve sparse's notion of address space as long as you
   do typeof(p), not typeof(*p):
   
   $ cat test.c
   #define as(n) __attribute__((address_space(n),noderef))
   #define __force __attribute__((force))
   
   int main(void)
   {
   int target = 0;
   int as(1) *foo = (__force typeof(target) as(1) *) target;
   typeof(foo) bar = foo;
   return *bar;
   }
   $ sparse test.c
   test.c:9:13: warning: dereference of noderef expression
   
   Notice that sparse didn't warn on the assignment of foo to bar (because
   typeof propagated the address space of 1), and warned on the dereference
   of bar (because typeof propagated noderef).
  
  Thank you for the info!
  
  Suppose that I want to do something like this:
  
  #define __rcu_assign_pointer(p, v, space) \
  do { \
  smp_wmb(); \
  ACCESS_ONCE(p) = (typeof(*v) __force space *)(v); \
  } while (0)
  
  Now, this does typeof(*p), so as you noted above sparse complains about
  address-space mismatches.  Thus far, I haven't been able to come up with
  something that (1) does sparse address-space checking, (2) does C type
  checking, and (3) forces the assignment to be volatile.
  
  Any thoughts on how to do this?
 
 First of all, if p and v had compatible types *including* address
 spaces, you wouldn't need the space argument; the following
 self-contained test case passes both sparse and GCC typechecking:
 
 #define as(n) __attribute__((address_space(n),noderef))
 #define __force __attribute__((force))
 #define ACCESS_ONCE(x) (*(volatile typeof(x) *)(x))
 extern void smp_wmb(void);
 
 #define rcu_assign_pointer(p, v) \
 do { \
 smp_wmb(); \
 ACCESS_ONCE(p) = (v); \
 } while (0)
 
 struct foo;
 
 int main(void)
 {
 struct foo as(1) *dest;
 struct foo as(1) *src = (void *)0;
 
 rcu_assign_pointer(dest, src);
 
 return 0;
 }
 
 
 
 But in this case, you want dest and src to have compatible types except
 that dest must have the __rcu address space and src might not.  So,
 let's change the types of dest and src, and add the appropriate cast.
 The following also passes both GCC and sparse:
 
 #define __rcu __attribute__((address_space(4),noderef))
 #define __force __attribute__((force))
 #define ACCESS_ONCE(x) (*(volatile typeof(x) *)(x))
 extern void smp_wmb(void);
 
 #define rcu_assign_pointer(p, v) \
 do { \
 smp_wmb(); \
 ACCESS_ONCE(p) = (typeof(*(v)) __rcu __force *)(v); \
 } while (0)
 
 struct foo { int x; };
 
 int main(void)
 {
 struct foo __rcu *dest;
 struct foo *src = (void *)0;
 
 rcu_assign_pointer(dest, src);
 
 return 0;
 }
 
 
 However, that cast forces the source to have the __rcu address space
 without checking what address space it started out with.  If you want to
 verify that the source has the kernel address space, you can cast to
 that address space first, *without* __force, which will warn if the
 source doesn't start out with that address space:
 
 #define __kernel __attribute__((address_space(0)))
 #define __user __attribute__((address_space(1),noderef))
 #define __rcu __attribute__((address_space(4),noderef))
 #define __force __attribute__((force))
 #define ACCESS_ONCE(x) (*(volatile typeof(x) *)(x))
 extern void smp_wmb(void);
 
 #define rcu_assign_pointer(p, v) \
 do { \
 smp_wmb(); \
 ACCESS_ONCE(p) = (typeof(*(v)) __rcu __force *)(typeof(*(v)) __kernel 
 *)(v); \
 } while (0)
 
 struct foo { int x; };
 
 int main(void)
 {
 struct foo __rcu *dest;
 struct foo *src = (void *)0;
 struct foo __user *badsrc = (void *)0;
 
 rcu_assign_pointer(dest, src);
 rcu_assign_pointer(dest, badsrc);
 
 return 0;
 }
 
 
 This produces a warning on the line using badsrc:
 
 test.c:23:5: warning: cast removes address space of expression
 
 However, that doesn't seem like the most obvious warning, since
 rcu_assign_pointer doesn't look like a cast, and since it doesn't print
 the full types involved like most address space warnings do.  So,
 instead, let's add and use a __chk_kernel_ptr

Re: [lttng-dev] [PATCH] rcu: Make rcu_assign_pointer's assignment volatile and type-safe

2013-09-01 Thread Paul E. McKenney
On Sun, Sep 01, 2013 at 04:42:52PM -0700, Josh Triplett wrote:
 rcu_assign_pointer needs to use ACCESS_ONCE to make the assignment to
 the destination pointer volatile, to protect against compilers too
 clever for their own good.
 
 In addition, since rcu_assign_pointer force-casts the source pointer to
 add the __rcu address space (overriding any existing address space), add
 an explicit check that the source pointer has the __kernel address space
 to start with.
 
 This new check produces warnings like this, when attempting to assign
 from a __user pointer:
 
 test.c:25:9: warning: incorrect type in argument 2 (different address spaces)
 test.c:25:9:expected struct foo *noident
 test.c:25:9:got struct foo [noderef] asn:1*badsrc
 
 Signed-off-by: Josh Triplett j...@joshtriplett.org

Queued for 3.13, thank you very much!

Thanx, Paul

 ---
  include/linux/rcupdate.h | 12 +++-
  1 file changed, 11 insertions(+), 1 deletion(-)
 
 diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
 index 4b14bdc..3f62def 100644
 --- a/include/linux/rcupdate.h
 +++ b/include/linux/rcupdate.h
 @@ -510,8 +510,17 @@ static inline void rcu_preempt_sleep_check(void)
  #ifdef __CHECKER__
  #define rcu_dereference_sparse(p, space) \
   ((void)(((typeof(*p) space *)p) == p))
 +/* The dummy first argument in __rcu_assign_pointer_typecheck makes the
 + * typechecked pointer the second argument, matching rcu_assign_pointer 
 itself;
 + * this avoids confusion about argument numbers in warning messages. */
 +#define __rcu_assign_pointer_check_kernel(v) \
 + do { \
 + extern void __rcu_assign_pointer_typecheck(int, typeof(*(v)) 
 __kernel *); \
 + __rcu_assign_pointer_typecheck(0, v); \
 + } while (0)
  #else /* #ifdef __CHECKER__ */
  #define rcu_dereference_sparse(p, space)
 +#define __rcu_assign_pointer_check_kernel(v) do { } while (0)
  #endif /* #else #ifdef __CHECKER__ */
 
  #define __rcu_access_pointer(p, space) \
 @@ -555,7 +564,8 @@ static inline void rcu_preempt_sleep_check(void)
  #define __rcu_assign_pointer(p, v, space) \
   do { \
   smp_wmb(); \
 - (p) = (typeof(*v) __force space *)(v); \
 + __rcu_assign_pointer_check_kernel(v); \
 + ACCESS_ONCE(p) = (typeof(*(v)) __force space *)(v); \
   } while (0)
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [RFC] adding into middle of RCU list

2013-08-31 Thread Paul E. McKenney
On Thu, Aug 29, 2013 at 07:16:37PM -0700, Josh Triplett wrote:
 On Thu, Aug 29, 2013 at 05:57:33PM -0700, Paul E. McKenney wrote:
  On Fri, Aug 23, 2013 at 02:08:22PM -0700, Paul E. McKenney wrote:
   On Fri, Aug 23, 2013 at 01:16:53PM -0400, Mathieu Desnoyers wrote:
#define __rcu_assign_pointer(p, v, space) \
do { \
smp_wmb(); \
(p) = (typeof(*v) __force space *)(v); \
} while (0)
   
   Or I need to fix this one as well.  ;-)
  
  In that vein...  Is there anything like typeof() that also preserves
  sparse's notion of address space?  Wrapping an ACCESS_ONCE() around
  p in the assignment above results in sparse errors.
 
 typeof() will preserve sparse's notion of address space as long as you
 do typeof(p), not typeof(*p):
 
 $ cat test.c
 #define as(n) __attribute__((address_space(n),noderef))
 #define __force __attribute__((force))
 
 int main(void)
 {
 int target = 0;
 int as(1) *foo = (__force typeof(target) as(1) *) target;
 typeof(foo) bar = foo;
 return *bar;
 }
 $ sparse test.c
 test.c:9:13: warning: dereference of noderef expression
 
 Notice that sparse didn't warn on the assignment of foo to bar (because
 typeof propagated the address space of 1), and warned on the dereference
 of bar (because typeof propagated noderef).

Thank you for the info!

Suppose that I want to do something like this:

#define __rcu_assign_pointer(p, v, space) \
do { \
smp_wmb(); \
ACCESS_ONCE(p) = (typeof(*v) __force space *)(v); \
} while (0)

Now, this does typeof(*p), so as you noted above sparse complains about
address-space mismatches.  Thus far, I haven't been able to come up with
something that (1) does sparse address-space checking, (2) does C type
checking, and (3) forces the assignment to be volatile.

Any thoughts on how to do this?

Thanx, Paul


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [RFC] adding into middle of RCU list

2013-08-29 Thread Paul E. McKenney
On Fri, Aug 23, 2013 at 02:08:22PM -0700, Paul E. McKenney wrote:
 On Fri, Aug 23, 2013 at 01:16:53PM -0400, Mathieu Desnoyers wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
   On Thu, Aug 22, 2013 at 09:33:18PM -0700, Stephen Hemminger wrote:

[ . . . ]

+
+/**
+ * Splice an RCU-protected list into an existing list.
+ *
+ * Note that this function blocks in synchronize_rcu()
+ *
+ * Important note: this function is not called concurrently
+ *   with other updates to the list.
+ */
+static inline void caa_list_splice_init_rcu(struct cds_list_head *list,
+   struct cds_list_head *head)
+{
+   struct cds_list_head *first = list-next;
+   struct cds_list_head *last = list-prev;
+   struct cds_list_head *at = head-next;
+
+   if (cds_list_empty(list))
+   return;
+
+   /* first and last tracking list, so initialize it. */
+   CDS_INIT_LIST_HEAD(list);
   
   This change is happening in the presence of readers on the list, right?
   For this to work reliably in the presence of mischievous compilers,
   wouldn't CDS_INIT_LIST_HEAD() need to use CMM_ACCESS_ONCE() for its
   pointer accesses?
  
  Actually, we have rcu_assign_pointer()/rcu_set_pointer() exactly for
  this. They even skip the memory barrier if they store a NULL pointer.
  
   Hmmm...  The kernel version seems to have the same issue...
  
  The compiler memory model of the Linux kernel AFAIK does not require an
  ACCESS_ONCE() for stores to word-aligned, word-sized integers/pointers,
  even if those are expected to be read concurrently. For reference, see:
  
  #define __rcu_assign_pointer(p, v, space) \
  do { \
  smp_wmb(); \
  (p) = (typeof(*v) __force space *)(v); \
  } while (0)
 
 Or I need to fix this one as well.  ;-)

In that vein...  Is there anything like typeof() that also preserves
sparse's notion of address space?  Wrapping an ACCESS_ONCE() around
p in the assignment above results in sparse errors.

Thanx, Paul

Thanx, Paul


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [RFC] adding into middle of RCU list

2013-08-23 Thread Paul E. McKenney
On Thu, Aug 22, 2013 at 09:33:18PM -0700, Stephen Hemminger wrote:
 I needed to add into the middle of an RCU list, does this make sense.
 
 
 
 From a45892b0d49ac5fe449ba7e19c646cb17f7cee57 Mon Sep 17 00:00:00 2001
 From: Stephen Hemminger step...@networkplumber.org
 Date: Thu, 22 Aug 2013 21:27:04 -0700
 Subject: [PATCH] Add list_splice_init_rcu to allow insertion into a RCU list
 
 Simplified version of the version in kernel.
 ---
  urcu/rculist.h |   32 
  1 file changed, 32 insertions(+)
 
 diff --git a/urcu/rculist.h b/urcu/rculist.h
 index 1fd2df3..2e8a5a0 100644
 --- a/urcu/rculist.h
 +++ b/urcu/rculist.h
 @@ -72,6 +72,38 @@ void cds_list_del_rcu(struct cds_list_head *elem)
   CMM_STORE_SHARED(elem-prev-next, elem-next);
  }
  
 +
 +/**
 + * Splice an RCU-protected list into an existing list.
 + *
 + * Note that this function blocks in synchronize_rcu()
 + *
 + * Important note: this function is not called concurrently
 + *   with other updates to the list.
 + */
 +static inline void caa_list_splice_init_rcu(struct cds_list_head *list,
 + struct cds_list_head *head)
 +{
 + struct cds_list_head *first = list-next;
 + struct cds_list_head *last = list-prev;
 + struct cds_list_head *at = head-next;
 +
 + if (cds_list_empty(list))
 + return;
 +
 + /* first and last tracking list, so initialize it. */
 + CDS_INIT_LIST_HEAD(list);

This change is happening in the presence of readers on the list, right?
For this to work reliably in the presence of mischievous compilers,
wouldn't CDS_INIT_LIST_HEAD() need to use CMM_ACCESS_ONCE() for its
pointer accesses?

Hmmm...  The kernel version seems to have the same issue...
Patch below, FWIW.

Thanx, Paul

 +
 + /* Wait for any readers to finish using the list before splicing */
 + synchronize_rcu();
 +
 + /* Readers are finished with the source list, so perform splice. */
 + last-next = at;
 + rcu_assign_pointer(head-next, first);
 + first-prev = head;
 + at-prev = last;
 +}
 +
  /*
   * Iteration through all elements of the list must be done while 
 rcu_read_lock()
   * is held.
 -- 
 1.7.10.4

rcu: Make list_splice_init_rcu() account for RCU readers

The list_splice_init_rcu() function allows a list visible to RCU readers
to be spliced into another list visible to RCU readers.  This is OK,
except for the use of INIT_LIST_HEAD(), which does pointer updates
without doing anything to make those updates safe for concurrent readers.

Of course, most of the time INIT_LIST_HEAD() is being used in reader-free
contexts, such as initialization or cleanup, so it is OK for it to update
pointers in an unsafe-for-RCU-readers manner.  This commit therefore
creates an INIT_LIST_HEAD_RCU() that uses ACCESS_ONCE() to make the updates
reader-safe.  The reason that we can use ACCESS_ONCE() instead of the more
typical rcu_assign_pointer() is that list_splice_init_rcu() is updating the
pointers to reference something that is already visible to readers, so
that there is no problem with pre-initialized values.

Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 4106721..45a0a9e 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -19,6 +19,21 @@
  */
 
 /*
+ * INIT_LIST_HEAD_RCU - Initialize a list_head visible to RCU readers
+ * @list: list to be initialized
+ *
+ * You should instead use INIT_LIST_HEAD() for normal initialization and
+ * cleanup tasks, when readers have no access to the list being initialized.
+ * However, if the list being initialized is visible to readers, you
+ * need to keep the compiler from being too mischievous.
+ */
+static inline void INIT_LIST_HEAD_RCU(struct list_head *list)
+{
+   ACCESS_ONCE(list-next) = list;
+   ACCESS_ONCE(list-prev) = list;
+}
+
+/*
  * return the -next pointer of a list_head in an rcu safe
  * way, we must not access it directly
  */
@@ -191,9 +206,13 @@ static inline void list_splice_init_rcu(struct list_head 
*list,
if (list_empty(list))
return;
 
-   /* first and last tracking list, so initialize it. */
+   /*
+* first and last tracking list, so initialize it.  RCU readers
+* have access to this list, so we must use INIT_LIST_HEAD_RCU()
+* instead of INIT_LIST_HEAD().
+*/
 
-   INIT_LIST_HEAD(list);
+   INIT_LIST_HEAD_RCU(list);
 
/*
 * At this point, the list body still points to the source list.


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [RFC] adding into middle of RCU list

2013-08-23 Thread Paul E. McKenney
On Fri, Aug 23, 2013 at 12:09:56PM -0700, Stephen Hemminger wrote:
 On Fri, 23 Aug 2013 13:16:53 -0400
 Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote:
 
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
   On Thu, Aug 22, 2013 at 09:33:18PM -0700, Stephen Hemminger wrote:
I needed to add into the middle of an RCU list, does this make sense.



From a45892b0d49ac5fe449ba7e19c646cb17f7cee57 Mon Sep 17 00:00:00 2001
From: Stephen Hemminger step...@networkplumber.org
Date: Thu, 22 Aug 2013 21:27:04 -0700
Subject: [PATCH] Add list_splice_init_rcu to allow insertion into a RCU 
list

Simplified version of the version in kernel.
---
 urcu/rculist.h |   32 
 1 file changed, 32 insertions(+)

diff --git a/urcu/rculist.h b/urcu/rculist.h
index 1fd2df3..2e8a5a0 100644
--- a/urcu/rculist.h
+++ b/urcu/rculist.h
@@ -72,6 +72,38 @@ void cds_list_del_rcu(struct cds_list_head *elem)
CMM_STORE_SHARED(elem-prev-next, elem-next);
 }
 
+
+/**
+ * Splice an RCU-protected list into an existing list.
+ *
+ * Note that this function blocks in synchronize_rcu()
+ *
+ * Important note: this function is not called concurrently
+ *   with other updates to the list.
+ */
+static inline void caa_list_splice_init_rcu(struct cds_list_head *list,
+   struct cds_list_head *head)
+{
+   struct cds_list_head *first = list-next;
+   struct cds_list_head *last = list-prev;
+   struct cds_list_head *at = head-next;
+
+   if (cds_list_empty(list))
+   return;
+
+   /* first and last tracking list, so initialize it. */
+   CDS_INIT_LIST_HEAD(list);
   
   This change is happening in the presence of readers on the list, right?
   For this to work reliably in the presence of mischievous compilers,
   wouldn't CDS_INIT_LIST_HEAD() need to use CMM_ACCESS_ONCE() for its
   pointer accesses?
  
  Actually, we have rcu_assign_pointer()/rcu_set_pointer() exactly for
  this. They even skip the memory barrier if they store a NULL pointer.
  
   
   Hmmm...  The kernel version seems to have the same issue...
  
  The compiler memory model of the Linux kernel AFAIK does not require an
  ACCESS_ONCE() for stores to word-aligned, word-sized integers/pointers,
  even if those are expected to be read concurrently. For reference, see:
  
  #define __rcu_assign_pointer(p, v, space) \
  do { \
  smp_wmb(); \
  (p) = (typeof(*v) __force space *)(v); \
  } while (0)
  
  In userspace RCU, we require to match CMM_LOAD_SHARED() with
  CMM_STORE_SHARED() (which are used by
  rcu_dereference()/rcu_{set,assign}_pointer) whenever we concurrently
  access a variable shared between threads.
  
  So I recommend using rcu_set_pointer() in userspace RCU, but I don't
  think your patch is needed for Linux, given the Linux kernel compiler
  memory model that is less strict than userspace RCU's model.
  
  Thanks,
  
  Mathieu
  
  
   Patch below, FWIW.
   
 Thanx, Paul
   
+
+   /* Wait for any readers to finish using the list before 
splicing */
+   synchronize_rcu();
+
+   /* Readers are finished with the source list, so perform 
splice. */
+   last-next = at;
+   rcu_assign_pointer(head-next, first);
+   first-prev = head;
+   at-prev = last;
+}
+
 /*
  * Iteration through all elements of the list must be done while 
rcu_read_lock()
  * is held.
-- 
1.7.10.4
   
   rcu: Make list_splice_init_rcu() account for RCU readers
   
   The list_splice_init_rcu() function allows a list visible to RCU readers
   to be spliced into another list visible to RCU readers.  This is OK,
   except for the use of INIT_LIST_HEAD(), which does pointer updates
   without doing anything to make those updates safe for concurrent readers.
   
   Of course, most of the time INIT_LIST_HEAD() is being used in reader-free
   contexts, such as initialization or cleanup, so it is OK for it to update
   pointers in an unsafe-for-RCU-readers manner.  This commit therefore
   creates an INIT_LIST_HEAD_RCU() that uses ACCESS_ONCE() to make the 
   updates
   reader-safe.  The reason that we can use ACCESS_ONCE() instead of the more
   typical rcu_assign_pointer() is that list_splice_init_rcu() is updating 
   the
   pointers to reference something that is already visible to readers, so
   that there is no problem with pre-initialized values.
   
   Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
   
   diff --git a/include/linux/rculist.h b/include/linux/rculist.h
   index 4106721..45a0a9e 100644
   --- a/include/linux/rculist.h
   +++ b/include/linux/rculist.h
   @@ -19,6 +19,21

Re: [lttng-dev] [RFC] adding into middle of RCU list

2013-08-23 Thread Paul E. McKenney
On Fri, Aug 23, 2013 at 01:16:53PM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Thu, Aug 22, 2013 at 09:33:18PM -0700, Stephen Hemminger wrote:
   I needed to add into the middle of an RCU list, does this make sense.
   
   
   
   From a45892b0d49ac5fe449ba7e19c646cb17f7cee57 Mon Sep 17 00:00:00 2001
   From: Stephen Hemminger step...@networkplumber.org
   Date: Thu, 22 Aug 2013 21:27:04 -0700
   Subject: [PATCH] Add list_splice_init_rcu to allow insertion into a RCU 
   list
   
   Simplified version of the version in kernel.
   ---
urcu/rculist.h |   32 
1 file changed, 32 insertions(+)
   
   diff --git a/urcu/rculist.h b/urcu/rculist.h
   index 1fd2df3..2e8a5a0 100644
   --- a/urcu/rculist.h
   +++ b/urcu/rculist.h
   @@ -72,6 +72,38 @@ void cds_list_del_rcu(struct cds_list_head *elem)
 CMM_STORE_SHARED(elem-prev-next, elem-next);
}

   +
   +/**
   + * Splice an RCU-protected list into an existing list.
   + *
   + * Note that this function blocks in synchronize_rcu()
   + *
   + * Important note: this function is not called concurrently
   + *   with other updates to the list.
   + */
   +static inline void caa_list_splice_init_rcu(struct cds_list_head *list,
   + struct cds_list_head *head)
   +{
   + struct cds_list_head *first = list-next;
   + struct cds_list_head *last = list-prev;
   + struct cds_list_head *at = head-next;
   +
   + if (cds_list_empty(list))
   + return;
   +
   + /* first and last tracking list, so initialize it. */
   + CDS_INIT_LIST_HEAD(list);
  
  This change is happening in the presence of readers on the list, right?
  For this to work reliably in the presence of mischievous compilers,
  wouldn't CDS_INIT_LIST_HEAD() need to use CMM_ACCESS_ONCE() for its
  pointer accesses?
 
 Actually, we have rcu_assign_pointer()/rcu_set_pointer() exactly for
 this. They even skip the memory barrier if they store a NULL pointer.
 
  
  Hmmm...  The kernel version seems to have the same issue...
 
 The compiler memory model of the Linux kernel AFAIK does not require an
 ACCESS_ONCE() for stores to word-aligned, word-sized integers/pointers,
 even if those are expected to be read concurrently. For reference, see:
 
 #define __rcu_assign_pointer(p, v, space) \
 do { \
 smp_wmb(); \
 (p) = (typeof(*v) __force space *)(v); \
 } while (0)

Or I need to fix this one as well.  ;-)

 In userspace RCU, we require to match CMM_LOAD_SHARED() with
 CMM_STORE_SHARED() (which are used by
 rcu_dereference()/rcu_{set,assign}_pointer) whenever we concurrently
 access a variable shared between threads.
 
 So I recommend using rcu_set_pointer() in userspace RCU, but I don't
 think your patch is needed for Linux, given the Linux kernel compiler
 memory model that is less strict than userspace RCU's model.

Me, I trust compilers a lot less than I did some years back.  ;-)

Thanx, Paul

 Thanks,
 
 Mathieu
 
 
  Patch below, FWIW.
  
  Thanx, Paul
  
   +
   + /* Wait for any readers to finish using the list before splicing */
   + synchronize_rcu();
   +
   + /* Readers are finished with the source list, so perform splice. */
   + last-next = at;
   + rcu_assign_pointer(head-next, first);
   + first-prev = head;
   + at-prev = last;
   +}
   +
/*
 * Iteration through all elements of the list must be done while 
   rcu_read_lock()
 * is held.
   -- 
   1.7.10.4
  
  rcu: Make list_splice_init_rcu() account for RCU readers
  
  The list_splice_init_rcu() function allows a list visible to RCU readers
  to be spliced into another list visible to RCU readers.  This is OK,
  except for the use of INIT_LIST_HEAD(), which does pointer updates
  without doing anything to make those updates safe for concurrent readers.
  
  Of course, most of the time INIT_LIST_HEAD() is being used in reader-free
  contexts, such as initialization or cleanup, so it is OK for it to update
  pointers in an unsafe-for-RCU-readers manner.  This commit therefore
  creates an INIT_LIST_HEAD_RCU() that uses ACCESS_ONCE() to make the updates
  reader-safe.  The reason that we can use ACCESS_ONCE() instead of the more
  typical rcu_assign_pointer() is that list_splice_init_rcu() is updating the
  pointers to reference something that is already visible to readers, so
  that there is no problem with pre-initialized values.
  
  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
  
  diff --git a/include/linux/rculist.h b/include/linux/rculist.h
  index 4106721..45a0a9e 100644
  --- a/include/linux/rculist.h
  +++ b/include/linux/rculist.h
  @@ -19,6 +19,21 @@
*/
   
   /*
  + * INIT_LIST_HEAD_RCU - Initialize a list_head visible to RCU readers
  + * @list: list to be initialized
  + *
  + * You should instead use

Re: [lttng-dev] [RFC PATCH urcu] Implement rcu_barrier()

2013-06-05 Thread Paul E. McKenney
On Fri, May 31, 2013 at 11:35:17AM -0400, Mathieu Desnoyers wrote:
 Awaits for all in-flight call_rcu handlers to complete execution before
 returning.
 
 Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com

One suggestion below, looks good in general.

Thanx, Paul

 ---
 diff --git a/urcu-call-rcu-impl.h b/urcu-call-rcu-impl.h
 index f7f0f71..fb3568f 100644
 --- a/urcu-call-rcu-impl.h
 +++ b/urcu-call-rcu-impl.h
 @@ -64,6 +64,16 @@ struct call_rcu_data {
   struct cds_list_head list;
  } __attribute__((aligned(CAA_CACHE_LINE_SIZE)));
 
 +struct call_rcu_completion {
 + int barrier_count;
 + int32_t futex;
 +};
 +
 +struct call_rcu_completion_work {
 + struct rcu_head head;
 + struct call_rcu_completion *completion;
 +};
 +
  /*
   * List of all call_rcu_data structures to keep valgrind happy.
   * Protected by call_rcu_mutex.
 @@ -236,6 +246,26 @@ static void call_rcu_wake_up(struct call_rcu_data *crdp)
   }
  }
 
 +static void call_rcu_completion_wait(struct call_rcu_completion *completion)
 +{
 + /* Read completion barrier count before read futex */
 + cmm_smp_mb();
 + if (uatomic_read(completion-futex) == -1)
 + futex_async(completion-futex, FUTEX_WAIT, -1,
 +   NULL, NULL, 0);
 +}
 +
 +static void call_rcu_completion_wake_up(struct call_rcu_completion 
 *completion)
 +{
 + /* Write to completion barrier count before reading/writing futex */
 + cmm_smp_mb();
 + if (caa_unlikely(uatomic_read(completion-futex) == -1)) {
 + uatomic_set(completion-futex, 0);
 + futex_async(completion-futex, FUTEX_WAKE, 1,
 +   NULL, NULL, 0);
 + }
 +}
 +
  /* This is the code run by each call_rcu thread. */
 
  static void *call_rcu_thread(void *arg)
 @@ -604,6 +634,17 @@ static void wake_call_rcu_thread(struct call_rcu_data 
 *crdp)
   call_rcu_wake_up(crdp);
  }
 
 +static void _call_rcu(struct rcu_head *head,
 +   void (*func)(struct rcu_head *head),
 +   struct call_rcu_data *crdp)
 +{
 + cds_wfcq_node_init(head-next);
 + head-func = func;
 + cds_wfcq_enqueue(crdp-cbs_head, crdp-cbs_tail, head-next);
 + uatomic_inc(crdp-qlen);
 + wake_call_rcu_thread(crdp);
 +}
 +
  /*
   * Schedule a function to be invoked after a following grace period.
   * This is the only function that must be called -- the others are
 @@ -618,20 +659,15 @@ static void wake_call_rcu_thread(struct call_rcu_data 
 *crdp)
   *
   * call_rcu must be called by registered RCU read-side threads.
   */
 -
  void call_rcu(struct rcu_head *head,
 void (*func)(struct rcu_head *head))
  {
   struct call_rcu_data *crdp;
 
 - cds_wfcq_node_init(head-next);
 - head-func = func;
   /* Holding rcu read-side lock across use of per-cpu crdp */
   rcu_read_lock();
   crdp = get_call_rcu_data();
 - cds_wfcq_enqueue(crdp-cbs_head, crdp-cbs_tail, head-next);
 - uatomic_inc(crdp-qlen);
 - wake_call_rcu_thread(crdp);
 + _call_rcu(head, func, crdp);
   rcu_read_unlock();
  }
 
 @@ -730,6 +766,89 @@ void free_all_cpu_call_rcu_data(void)
   free(crdp);
  }
 
 +static
 +void _rcu_barrier_complete(struct rcu_head *head)
 +{
 + struct call_rcu_completion_work *work;
 + struct call_rcu_completion *completion;
 +
 + work = caa_container_of(head, struct call_rcu_completion_work, head);
 + completion = work-completion;
 + uatomic_dec(completion-barrier_count);
 + call_rcu_completion_wake_up(completion);
 + free(work);
 +}
 +
 +/*
 + * Wait for all in-flight call_rcu callbacks to complete execution.
 + */
 +void rcu_barrier(void)
 +{
 + struct call_rcu_data *crdp;
 + struct call_rcu_completion completion;
 + int count = 0, work_count = 0;
 + int was_online;
 +
 + /* Put in offline state in QSBR. */
 + was_online = rcu_read_ongoing();
 + if (was_online)
 + rcu_thread_offline();
 + /*
 +  * Calling a rcu_barrier() within a RCU read-side critical
 +  * section is an error.
 +  */
 + if (rcu_read_ongoing()) {
 + static int warned = 0;
 +
 + if (!warned) {
 + fprintf(stderr, [error] liburcu: rcu_barrier() called 
 from within RCU read-side critical section.\n);
 + }
 + warned = 1;
 + goto online;
 + }
 +
 + call_rcu_lock(call_rcu_mutex);
 + cds_list_for_each_entry(crdp, call_rcu_data_list, list)
 + count++;
 +
 + completion.barrier_count = count;
 +
 + cds_list_for_each_entry(crdp, call_rcu_data_list, list) {
 + struct call_rcu_completion_work *work;
 +
 + work = calloc(sizeof(*work), 1);
 + if (!work) {
 + static int warned = 0;
 +
 + if (!warned) {
 + fprintf(stderr, [error] 

Re: [lttng-dev] Quick questions about liburcu and RCU in general

2013-05-07 Thread Paul E. McKenney
On Tue, May 07, 2013 at 07:59:14AM -0400, Mathieu Desnoyers wrote:
 Hi Richard,
 
 * Richard Braun (rbr...@sceen.net) wrote:
  Hello,
  
  I'm currently studying RCU/URCU, and I have a few questions that I wasn't
  sure where to ask.
  
  1/ Why use poll instead of sched_yield in e.g. force_mb_all_readers ?
  (I guess it's about portability and the effect is expected to be the same,
  but is there another reason ?)
 
 poll() allow us to do a millisecond-level wait (timer-based).
 sched_yield() is pretty much a scheduler hack that just says be nice to
 other scheduled processes here. Quoting sched_yield(2):
 
If the calling thread is the only thread in the highest  priority list
at that time, it will continue to run after a call to sched_yield().
 
 This is a kind of behavior we don't want.
 
  2/ What was the conclusion of the discussion regarding sys_membarrier ?
  (I couldn't find it in the main mail thread, and it looks quite interesting,
  even though I expect most carefully written applications not to exceed
  one thread per processor too much)
 
 The conclusion so far was that:
 
 - the implementation was fine,
 - we had to show there were enough users of this new ABI to justify its
   inclusion and maintenance cost. Currently, it's pretty much just
   liburcu that uses it. As the number of liburcu users grows, this
   helps, but what would help even more would be to have other libraries
   and applications using sys_membarrier() (other people interested and
   weighting in).
 
  
  3/ Do you know if IBM allows the use of patented RCU techniques in GPLv3+
  code as well ? (GPL is mentioned in the Linux documentation, and
  apparently liburcu is covered by LGPLv2+ so I expect that to be the case,
  just looking for a confirmation)
 
 AFAIK, LGPLv2+ code can migrate into GPLv3+, it's the other way around
 that is not permitted. Therefore, I would expect it is allowed, but I
 will let Paul answer to this one on behalf of IBM.

Yes, LGPLv2+ allows you to migrate to LGPLv3+, which is compatible with
GPLv3+.  So as long as you derive the code from lttng's userspace-rcu
library, you are set.

That said, you could also just link userspace-rcu into your GPLv3+
application.

Thanx, Paul

 Thanks,
 
 Mathieu
 
 
  
  Thanks for your answers.
  
  -- 
  Richard Braun
 
 -- 
 Mathieu Desnoyers
 EfficiOS Inc.
 http://www.efficios.com
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [rp] [PATCH urcu] rculfhash: add assertions on node alignment

2013-02-14 Thread Paul E. McKenney
On Thu, Feb 14, 2013 at 11:19:33AM -0500, Mathieu Desnoyers wrote:
 I've had a report of someone running into issues with the RCU lock-free
 hash table by embedding the struct cds_lfht_node into a packed structure
 by mistake, thus not respecting alignment requirements stated in
 urcu/rculfhash.h. Assertions on replace and add operations should
 catch this, but I notice that we should add assertions on the
 REMOVAL_OWNER_FLAG to cover all possible misalignments.
 
 Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com

Makes sense to me!

Thanx, Paul

 ---
 diff --git a/rculfhash.c b/rculfhash.c
 index 8c3d621..e0be7df 100644
 --- a/rculfhash.c
 +++ b/rculfhash.c
 @@ -833,13 +833,16 @@ void _cds_lfht_gc_bucket(struct cds_lfht_node *bucket, 
 struct cds_lfht_node *nod
 
   assert(!is_bucket(bucket));
   assert(!is_removed(bucket));
 + assert(!is_removal_owner(bucket));
   assert(!is_bucket(node));
   assert(!is_removed(node));
 + assert(!is_removal_owner(node));
   for (;;) {
   iter_prev = bucket;
   /* We can always skip the bucket node initially */
   iter = rcu_dereference(iter_prev-next);
   assert(!is_removed(iter));
 + assert(!is_removal_owner(iter));
   assert(iter_prev-reverse_hash = node-reverse_hash);
   /*
* We should never be called with bucket (start of chain)
 @@ -860,6 +863,7 @@ void _cds_lfht_gc_bucket(struct cds_lfht_node *bucket, 
 struct cds_lfht_node *nod
   iter = next;
   }
   assert(!is_removed(iter));
 + assert(!is_removal_owner(iter));
   if (is_bucket(iter))
   new_next = flag_bucket(clear_flag(next));
   else
 @@ -880,8 +884,10 @@ int _cds_lfht_replace(struct cds_lfht *ht, unsigned long 
 size,
   return -ENOENT;
 
   assert(!is_removed(old_node));
 + assert(!is_removal_owner(old_node));
   assert(!is_bucket(old_node));
   assert(!is_removed(new_node));
 + assert(!is_removal_owner(new_node));
   assert(!is_bucket(new_node));
   assert(new_node != old_node);
   for (;;) {
 @@ -956,6 +962,7 @@ void _cds_lfht_add(struct cds_lfht *ht,
 
   assert(!is_bucket(node));
   assert(!is_removed(node));
 + assert(!is_removal_owner(node));
   bucket = lookup_bucket(ht, size, hash);
   for (;;) {
   uint32_t chain_len = 0;
 @@ -1016,7 +1023,9 @@ void _cds_lfht_add(struct cds_lfht *ht,
   insert:
   assert(node != clear_flag(iter));
   assert(!is_removed(iter_prev));
 + assert(!is_removal_owner(iter_prev));
   assert(!is_removed(iter));
 + assert(!is_removal_owner(iter));
   assert(iter_prev != node);
   if (!bucket_flag)
   node-next = clear_flag(iter);
 @@ -1036,6 +1045,7 @@ void _cds_lfht_add(struct cds_lfht *ht,
 
   gc_node:
   assert(!is_removed(iter));
 + assert(!is_removal_owner(iter));
   if (is_bucket(iter))
   new_next = flag_bucket(clear_flag(next));
   else
 @@ -1700,6 +1710,7 @@ int cds_lfht_delete_bucket(struct cds_lfht *ht)
   if (!is_bucket(node))
   return -EPERM;
   assert(!is_removed(node));
 + assert(!is_removal_owner(node));
   } while (!is_end(node));
   /*
* size accessed without rcu_dereference because hash table is
 
 -- 
 Mathieu Desnoyers
 EfficiOS Inc.
 http://www.efficios.com
 
 ___
 rp mailing list
 r...@svcs.cs.pdx.edu
 http://svcs.cs.pdx.edu/mailman/listinfo/rp
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH] Add ACCESS_ONCE() to avoid compiler splitting assignments

2013-01-19 Thread Paul E. McKenney
On Wed, Jan 16, 2013 at 07:50:54AM -0500, Mathieu Desnoyers wrote:
 * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
   As noted by Konstantin Khlebnikov, gcc can split assignment of
   constants to long variables (https://lkml.org/lkml/2013/1/15/141),
   though assignment of NULL (0) is OK.  Assuming that a gcc bug is
   fixed (http://gcc.gnu.org/bugzilla/attachment.cgi?id=29169action=diff
   has a patch), making the store be volatile keeps gcc from splitting.
   
   This commit therefore applies ACCESS_ONCE() to CMM_STORE_SHARED(),
   which is the underlying primitive used by rcu_assign_pointer().
  
  Hi Paul,
  
  I recognise that this is an issue in the Linux kernel, since a simple
  store is used and expected to be performed atomically when aligned.
  However, I think this does not affect liburcu, see below:
 
 Side question: what gcc versions may issue non-atomic volatile stores ?
 I think we should at least document those. Bug
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55981 seems to target gcc
 4.7.2, but I wonder when this issue first appeared on x86 and x86-64
 (and if it affects other architectures as well).

I have no idea which versions are affected.  The bug is in the x86
backend, so is specific to x86, but there might well be similar bugs
in other architectures.

 Thanks,
 
 Mathieu
 
  
   
   Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
   
   diff --git a/urcu/system.h b/urcu/system.h
   index 2a45f22..7a1887e 100644
   --- a/urcu/system.h
   +++ b/urcu/system.h
   @@ -49,7 +49,7 @@
 */
#define CMM_STORE_SHARED(x, v)   \
 ({  \
   - __typeof__(x) _v = _CMM_STORE_SHARED(x, v); \
   + __typeof__(x) CMM_ACCESS_ONCE(_v) = _CMM_STORE_SHARED(x, v);
   \
  
  Here, the macro _CMM_STORE_SHARED(x, v) is doing the actual store.
  It stores v into x. So adding a CMM_ACCESS_ONCE(_v), as you propose
  here, is really only making sure the return value (usually unused),
  located on the stack, is accessed with a volatile access, which does not
  make much sense.
  
  What really matters is the _CMM_STORE_SHARED() macro:
  
  #define _CMM_STORE_SHARED(x, v) ({ CMM_ACCESS_ONCE(x) = (v); })
  
  which already uses a volatile access for the store. So this seems to be
  a case where our preemptive use of volatile for stores in addition to
  loads made us bug-free for a gcc behavior unexpected at the time we
  implemented this macro. Just a touch of paranoia seems to be a good
  thing sometimes. ;-)
  
  Thoughts ?

Here is my thought:  You should ignore my fix.  Please accept my
apologies for my confusion!

Thanx, Paul

  Thanks,
  
  Mathieu
  
 cmm_smp_wmc();  \
 _v; \
 })
   
   
   ___
   lttng-dev mailing list
   lttng-dev@lists.lttng.org
   http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
  
  -- 
  Mathieu Desnoyers
  EfficiOS Inc.
  http://www.efficios.com
  
  ___
  lttng-dev mailing list
  lttng-dev@lists.lttng.org
  http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
 
 -- 
 Mathieu Desnoyers
 EfficiOS Inc.
 http://www.efficios.com
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [rp] [RFC PATCH urcu] Add last output parameter to pop/dequeue

2013-01-15 Thread Paul E. McKenney
[Sorry for the delay, finally getting back to this.]

On Mon, Dec 17, 2012 at 09:40:09AM -0500, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Thu, Dec 13, 2012 at 06:44:56AM -0500, Mathieu Desnoyers wrote:
   I noticed that in addition to having:
   
   - push/enqueue returning whether the stack/queue was empty prior to the
 operation,
   - pop_all/splice, by nature, emptying the stack/queue,
   
   it can be interesting to make pop/dequeue operations return whether they
   are returning the last element of the stack/queue (therefore emptying
   it). This allow extending the test-cases covering the number of empty
   stack/queue encountered by both push/enqueuer and pop/dequeuer threads
   not only to push/enqueue paired with pop_all/splice, but also to
   pop/dequeue.
   
   In the case of wfstack, this unfortunately requires to modify an already
   exposed API. As a RFC, one question we should answer is how we want to
   handle the way forward: should we add new functions to the wfstack API
   and leave the existing ones alone ? 
   
   Thoughts ?
  
  Hmmm...  What is the use case, given that a push might happen immediately
  after the pop said that the stack/queue was empty?  Of course, if we
  somehow know that there are no concurrent pushes, we could instead
  check for empty.
  
  So what am I missing here?
 
 The setup for those use-cases is the following (I'm using the stack as
 example, but the same applies to queue):
 
 - we have N threads doing push and using the push return value that
   states whether it pushed into an empty stack.
 - we have M threads doing pop, using the return value to know if it
   pops a stack into an empty-stack-state. Following the locking
   requirements, we protect those M threads'pop by a mutex, but they
   don't need to be protected against push.
 
 Just to help understanding where the idea comes from, let's start with
 another use-case that is similar (push/pop_all). Knowing whether we
 pushed into an empty stack along with pop_all become very useful when
 you want to combine the stack with a higher level batching semantic
 linked to the elements present within the stack.
 
 In the case of grace period batching, for instance, I used
 push/pop_all to provide this kind of semantic: if we push into an
 empty stack, we know we will have to go through the grace period. If we
 are pushed into a non-empty stack, we just wait to be awakened by the
 first thread which was pushed into the stack. This requires that we use
 pop_all before going though the grace period.
 
 Now more specifically about pop, one use-case I have in mind is
 energy-efficient handling of empty stacks. With M threads executing
 pop, let's suppose we want them to be blocked on a futex when there is
 nothing to do. Now the tricky part is: how can we do this without adding
 overhead (extra load/stores) to the stack ?
 
 If we have the ability to know whether we are popping the last element
 of a stack, we can use this information to go into a futex wait state
 after having handled the last element. Since the threads doing push
 would monitor whether they push into an empty stack, they would wake us
 whenever needed.
 
 If instead we choose to simply wait until one of the M threads discovers
 that the stack is actually empty, we are issuing extra pop (which
 fails) each time the stack is empty. In the worse-case, if a queue
 always flip between 0 and 1 elements, we double the number of pop
 needed to handle the same amount of nodes.
 
 Otherwise, if we choose to add an explicit check to see whether the
 stack is empty, we are adding an extra load of the head node for every
 pop.
 
 Another use-case I see is low-overhead monitoring of stack usage
 efficiency. For this kind of use-case, we might want to know, both
 within push and pop threads, if we are underutilizing our system
 resources. Having the ability to know that we are reaching empty state
 without any extra overhead to stack memory traffic gives us this
 ability.
 
 I must admit that the use-cases for returning whether pop takes the last
 element is not as strong as the batching case with push/pop_all, mainly
 because AFAIU, we can achieve the same result by doing an extra check of
 stack emptiness state (either by an explicit empty() check, or by
 issuing an extra pop that will see an empty stack). What we are saving
 here is the extra overhead on stack cache-lines cause by this extra
 check.
 
 Another use-case, although maybe less compelling, is for validation.
 With concurrent threads doing push/pop/pop_all operations on the stack,
 we can perform the following check: If we empty the stack at the end of
 test execution, the
 
   number of push-to-empty-stack
 
   must be equal to the
 
   number of pop_all-from-non-empty-stack
+ number of pop-last-element-from-non-empty-stack
 
 We should note that this validation could not be performed if pop is
 not returning whether it popped the last

Re: [lttng-dev] [RFC PATCH urcu] Add last output parameter to pop/dequeue

2012-12-14 Thread Paul E. McKenney
On Thu, Dec 13, 2012 at 06:44:56AM -0500, Mathieu Desnoyers wrote:
 I noticed that in addition to having:
 
 - push/enqueue returning whether the stack/queue was empty prior to the
   operation,
 - pop_all/splice, by nature, emptying the stack/queue,
 
 it can be interesting to make pop/dequeue operations return whether they
 are returning the last element of the stack/queue (therefore emptying
 it). This allow extending the test-cases covering the number of empty
 stack/queue encountered by both push/enqueuer and pop/dequeuer threads
 not only to push/enqueue paired with pop_all/splice, but also to
 pop/dequeue.
 
 In the case of wfstack, this unfortunately requires to modify an already
 exposed API. As a RFC, one question we should answer is how we want to
 handle the way forward: should we add new functions to the wfstack API
 and leave the existing ones alone ? 
 
 Thoughts ?

Hmmm...  What is the use case, given that a push might happen immediately
after the pop said that the stack/queue was empty?  Of course, if we
somehow know that there are no concurrent pushes, we could instead
check for empty.

So what am I missing here?

Thanx, Paul

 Thanks,
 
 Mathieu
 
 ---
 diff --git a/tests/test_urcu_wfcq.c b/tests/test_urcu_wfcq.c
 index 91285a5..de9566d 100644
 --- a/tests/test_urcu_wfcq.c
 +++ b/tests/test_urcu_wfcq.c
 @@ -168,6 +168,7 @@ static DEFINE_URCU_TLS(unsigned long long, 
 nr_successful_dequeues);
  static DEFINE_URCU_TLS(unsigned long long, nr_successful_enqueues);
  static DEFINE_URCU_TLS(unsigned long long, nr_empty_dest_enqueues);
  static DEFINE_URCU_TLS(unsigned long long, nr_splice);
 +static DEFINE_URCU_TLS(unsigned long long, nr_dequeue_last);
 
  static unsigned int nr_enqueuers;
  static unsigned int nr_dequeuers;
 @@ -228,11 +229,15 @@ fail:
  static void do_test_dequeue(enum test_sync sync)
  {
   struct cds_wfcq_node *node;
 + bool last;
 
   if (sync == TEST_SYNC_MUTEX)
 - node = cds_wfcq_dequeue_blocking(head, tail);
 + node = cds_wfcq_dequeue_blocking(head, tail, last);
   else
 - node = __cds_wfcq_dequeue_blocking(head, tail);
 + node = __cds_wfcq_dequeue_blocking(head, tail, last);
 +
 + if (last)
 + URCU_TLS(nr_dequeue_last)++;
 
   if (node) {
   free(node);
 @@ -263,6 +268,7 @@ static void do_test_splice(enum test_sync sync)
   break;
   case CDS_WFCQ_RET_DEST_EMPTY:
   URCU_TLS(nr_splice)++;
 + URCU_TLS(nr_dequeue_last)++;
   /* ok */
   break;
   case CDS_WFCQ_RET_DEST_NON_EMPTY:
 @@ -325,16 +331,21 @@ static void *thr_dequeuer(void *_count)
   count[0] = URCU_TLS(nr_dequeues);
   count[1] = URCU_TLS(nr_successful_dequeues);
   count[2] = URCU_TLS(nr_splice);
 + count[3] = URCU_TLS(nr_dequeue_last);
   return ((void*)2);
  }
 
 -static void test_end(unsigned long long *nr_dequeues)
 +static void test_end(unsigned long long *nr_dequeues,
 + unsigned long long *nr_dequeue_last)
  {
   struct cds_wfcq_node *node;
 + bool last;
 
   do {
 - node = cds_wfcq_dequeue_blocking(head, tail);
 + node = cds_wfcq_dequeue_blocking(head, tail, last);
   if (node) {
 + if (last)
 + (*nr_dequeue_last)++;
   free(node);
   (*nr_dequeues)++;
   }
 @@ -367,7 +378,7 @@ int main(int argc, char **argv)
   unsigned long long tot_successful_enqueues = 0,
  tot_successful_dequeues = 0,
  tot_empty_dest_enqueues = 0,
 -tot_splice = 0;
 +tot_splice = 0, tot_dequeue_last = 0;
   unsigned long long end_dequeues = 0;
   int i, a, retval = 0;
 
 @@ -480,7 +491,7 @@ int main(int argc, char **argv)
   tid_enqueuer = malloc(sizeof(*tid_enqueuer) * nr_enqueuers);
   tid_dequeuer = malloc(sizeof(*tid_dequeuer) * nr_dequeuers);
   count_enqueuer = malloc(3 * sizeof(*count_enqueuer) * nr_enqueuers);
 - count_dequeuer = malloc(3 * sizeof(*count_dequeuer) * nr_dequeuers);
 + count_dequeuer = malloc(4 * sizeof(*count_dequeuer) * nr_dequeuers);
   cds_wfcq_init(head, tail);
 
   next_aff = 0;
 @@ -493,7 +504,7 @@ int main(int argc, char **argv)
   }
   for (i = 0; i  nr_dequeuers; i++) {
   err = pthread_create(tid_dequeuer[i], NULL, thr_dequeuer,
 -  count_dequeuer[3 * i]);
 +  count_dequeuer[4 * i]);
   if (err != 0)
   exit(1);
   }
 @@ -533,34 +544,37 @@ int main(int argc, char **argv)
   err = pthread_join(tid_dequeuer[i], tret);
   if (err != 0)
   exit(1);
 - tot_dequeues += count_dequeuer[3 * i];
 -   

Re: [lttng-dev] [PATCH] urcu: avoid false sharing for rcu_gp_ctr

2012-12-10 Thread Paul E. McKenney
On Fri, Dec 07, 2012 at 12:22:52PM -0500, Mathieu Desnoyers wrote:
 * Lai Jiangshan (eag0...@gmail.com) wrote:
  On Saturday, December 8, 2012, Mathieu Desnoyers wrote:
  
   * Lai Jiangshan (eag0...@gmail.com javascript:;) wrote:
we can define rcu_gp_ctr and registry with aligned attribute, but it is
   not
reliable way
   
We need only this:
unsigned long rcu_gp_ctr __attribute((aligned and padded(don't put other
var next to it except the futex)))
  
   In which situation would this be unreliable ?
  
  
  
  int a;
  int b __attribute__((aligned));
  int c;
  
  b and c will be in the same line, even we define c as aligned too, the
  compiler/linker may put a next to b, thus a and b in the same line
 
 So if our goal is to have rcu_gp_ctr and rcu_gp_futex on the same cache
 line, which is different from that of the registry, we could do:
 
 typeA rcu_gp_ctr __attribute__((aligned(...)));
 typeB rcu_gp_futex;
 typeC registry __attribute__((aligned(...)));
 
 I would expect the compiler won't typically reorder rcu_gp_futex and
 registry. But I guess there is no guarantee it is going to be always
 true given by the C99 standard.
 
 I guess this is a case where we could bump the library version number
 and do things properly.
 
 Let's think a bit more about it, anyone else has comments on this ?

If you really want the alignment and padding, they should go into a
structure.  ;-)

Thanx, Paul

 Thanks,
 
 Mathieu
 
  
  
  
   Thanks,
  
   Mathieu
  
   
On Saturday, December 8, 2012, Mathieu Desnoyers wrote:
   
 * Lai Jiangshan (la...@cn.fujitsu.com javascript:; javascript:;)
   wrote:
  @rcu_gp_ctr and @registry share the same cache line, it causes
  false sharing and slowdown both of the read site and update site.
 
  Fix: Use different cache line for them.
 
  Although rcu_gp_futex is updated less than rcu_gp_ctr, but
  they always be accessed at almost the same time, so we also move
 rcu_gp_futex
  to the cacheline of rcu_gp_ctr to reduce the cacheline-usage or
 cache-missing
  of read site.

 Hi Lai,

 I agree on the goal: placing registry and rcu_gp_ctr on different
 cache-lines. And yes, it makes sense to put rcu_gp_ctr and 
 rcu_gp_futex
 on the same cache-line. I agree that the next patch is fine too
   (keeping
 qsbr and other urcu similar). This is indeed what I try to ensure
 myself.

 I'm just concerned that this patch seems to break ABI compability for
 liburcu: the read-side, within applications, would have to be
 recompiled. So I guess we should decide if we do this change in a way
 that does not break the ABI (e.g. not introducing a structure), or if
   we
 choose to bump the library version number.

 Thoughts ?

 Thanks,

 Mathieu

 
 
  test: (4X6=24 CPUs)
 
  Before patch:
 
  [root@localhost userspace-rcu]# ./tests/test_urcu_mb 20 1 20
  SUMMARY ./tests/test_urcu_mb  testdur   20 nr_readers  20 rdur
  0 wdur  0 nr_writers   1 wdelay  0 nr_reads   2100285330
   nr_writes
  3390219 nr_ops   2103675549
  [root@localhost userspace-rcu]# ./tests/test_urcu_mb 20 1 20
  SUMMARY ./tests/test_urcu_mb  testdur   20 nr_readers  20 rdur
  0 wdur  0 nr_writers   1 wdelay  0 nr_reads   1619868562
   nr_writes
  3529478 nr_ops   1623398040
  [root@localhost userspace-rcu]# ./tests/test_urcu_mb 20 1 20
  SUMMARY ./tests/test_urcu_mb  testdur   20 nr_readers  20 rdur
  0 wdur  0 nr_writers   1 wdelay  0 nr_reads   1949067038
   nr_writes
  3469334 nr_ops   1952536372
 
 
  after patch:
 
  [root@localhost userspace-rcu]# ./tests/test_urcu_mb 20 1 20
  SUMMARY ./tests/test_urcu_mb  testdur   20 nr_readers  20 rdur
  0 wdur  0 nr_writers   1 wdelay  0 nr_reads   3380191848
   nr_writes
  4903248 nr_ops   3385095096
  [root@localhost userspace-rcu]# ./tests/test_urcu_mb 20 1 20
  SUMMARY ./tests/test_urcu_mb  testdur   20 nr_readers  20 rdur
  0 wdur  0 nr_writers   1 wdelay  0 nr_reads   3397637486
   nr_writes
  4129809 nr_ops   3401767295
 
  Singed-by: Lai Jiangshan la...@cn.fujitsu.com 
  javascript:;javascript:;
  ---
  diff --git a/urcu.c b/urcu.c
  index 15def09..436d71c 100644
  --- a/urcu.c
  +++ b/urcu.c
  @@ -83,16 +83,7 @@ void __attribute__((destructor)) rcu_exit(void);
   #endif
 
   static pthread_mutex_t rcu_gp_lock = PTHREAD_MUTEX_INITIALIZER;
  -
  -int32_t rcu_gp_futex;
  -
  -/*
  - * Global grace period counter.
  - * Contains the current RCU_GP_CTR_PHASE.
  - * Also has a RCU_GP_COUNT of 1, to accelerate the reader fast 
  path.
  - * Written to only by writer with mutex taken. Read by both 

Re: [lttng-dev] [PATCH 14/16] urcu-qsbr: batch concurrent synchronize_rcu()

2012-11-22 Thread Paul E. McKenney
On Thu, Nov 22, 2012 at 05:04:47PM +0800, Lai Jiangshan wrote:
 On 11/22/2012 02:33 AM, Mathieu Desnoyers wrote:
  * Lai Jiangshan (la...@cn.fujitsu.com) wrote:
  Could you delay 14~16 for 40 days if I don't implement it in 40 days?
  
  I'm curious to know more about the changes you are planning. Is that
  another way to implement grace periods that would allow multiple threads
  to execute synchronize_rcu() concurrently ?
 
 synchronize_rcu()s in this implement share coarse-grain step(1 GP)
 to achieve concurrence. My implement will use fine-grain step(1 check or 1 
 flip)
 like SRCU. and call_rcu() is also considered in this implement to avoid
 unneeded wait.
 
  
  Please note that changes in these algorithms will need to go through
  very strict review/validation/verification. So I expect that if it takes
  40 days to implement, we can plan at least 3-4 months of validation work.
 
 I means I don't have time. If I can't steal some time from the late 40 days,
 this code is OK for me.

Why don't we take the current code, which would allow some academic
projects to test on large systems in the next few months, and then
replace it with your code when available and if appropriate?

Thanx, Paul

  With that in mind, would it make sense to merge the batching approach in
  the meantime ? The advantage of the batching approach is that it does
  not touch the core of the synchronization algorithm.
  
  Thoughts ?
  
  Thanks,
  
  Mathieu
  
 
  On 11/21/2012 03:40 AM, Mathieu Desnoyers wrote:
  Here are benchmarks on batching of synchronize_rcu(), and it leads to
  very interesting scalability improvement and speedups, e.g., on a
  24-core AMD, with a write-heavy scenario (4 readers threads, 20 updater
  threads, each updater using synchronize_rcu()):
 
  * Serialized grace periods :
 
  ./test_urcu_qsbr 4 20 20
  SUMMARY ./test_urcu_qsbr  testdur   20 nr_readers   4
  rdur  0 wdur  0 nr_writers  20 wdelay  0
  nr_reads  20251412728 nr_writes  1826331 nr_ops  20253239059
 
  * Batched grace periods :
 
  ./test_urcu_qsbr 4 20 20
  SUMMARY ./test_urcu_qsbr  testdur   20 nr_readers   4
  rdur  0 wdur  0 nr_writers  20 wdelay  0
  nr_reads  15141994746 nr_writes  9382515 nr_ops  15151377261
 
  For a 9382515/1826331 = 5.13 speedup for 20 updaters.
 
  Of course, we can see that readers have slowed down, probably due to
  increased update traffic, given there is no change to the read-side code
  whatsoever.
 
  Now let's see the penality of managing the stack for single-updater.
  With 4 readers, single updater:
 
  * Serialized grace periods :
 
  ./test_urcu_qsbr 4 1 20
  SUMMARY ./test_urcu_qsbr  testdur   20 nr_readers   4
  rdur  0 wdur  0 nr_writers   1 wdelay  0
  nr_reads  19240784755 nr_writes  2130839 nr_ops  19242915594
 
  * Batched grace periods :
 
  ./test_urcu_qsbr 4 1 20
  SUMMARY ./test_urcu_qsbr  testdur   20 nr_readers   4
  rdur  0 wdur  0 nr_writers   1 wdelay  0
  nr_reads  19160162768 nr_writes  2253068 nr_ops  1916241583
 
  2253068 vs 2137036 - a couple of runs show that this difference lost in
  the noise for single updater.
 
  CC: Paul E. McKenney paul...@linux.vnet.ibm.com
  CC: Lai Jiangshan la...@cn.fujitsu.com
  CC: Alan Stern st...@rowland.harvard.edu
  Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
  ---
   urcu-qsbr.c |  151 
  +++
   1 file changed, 151 insertions(+)
 
  diff --git a/urcu-qsbr.c b/urcu-qsbr.c
  index 5b341b5..7f747ed 100644
  --- a/urcu-qsbr.c
  +++ b/urcu-qsbr.c
  @@ -36,6 +36,7 @@
   #include poll.h
   
   #include urcu/wfcqueue.h
  +#include urcu/wfstack.h
   #include urcu/map/urcu-qsbr.h
   #define BUILD_QSBR_LIB
   #include urcu/static/urcu-qsbr.h
  @@ -78,6 +79,35 @@ DEFINE_URCU_TLS(unsigned int, rcu_rand_yield);
   
   static CDS_LIST_HEAD(registry);
   
  +/*
  + * Number of busy-loop attempts before waiting on futex for grace period
  + * batching.
  + */
  +#define RCU_AWAKE_ATTEMPTS 1000
  +
  +enum adapt_wakeup_state {
  + /* AWAKE_WAITING is compared directly (futex compares it). */
  + AWAKE_WAITING = 0,
  + /* non-zero are used as masks. */
  + AWAKE_WAKEUP =  (1  0),
  + AWAKE_AWAKENED =(1  1),
  + AWAKE_TEARDOWN =(1  2),
  +};
  +
  +struct gp_waiters_thread {
  + struct cds_wfs_node node;
  + int32_t wait_futex;
  +};
  +
  +/*
  + * Stack keeping threads awaiting to wait for a grace period. Contains
  + * struct gp_waiters_thread objects.
  + */
  +static struct cds_wfs_stack gp_waiters = {
  + .head = CDS_WFS_END,
  + .lock = PTHREAD_MUTEX_INITIALIZER,
  +};
  +
   static void mutex_lock(pthread_mutex_t *mutex)
   {
int ret;
  @@ -116,6 +146,58 @@ static void wait_gp(void)
  NULL, NULL, 0);
   }
   
  +/*
  + * Note: urcu_adaptative_wake_up needs value to stay

[lttng-dev] [PATCH] wfcqueue: Fix lock and unlock functions

2012-11-15 Thread Paul E. McKenney
The current implementation of cds_wfcq_dequeue_lock() and
cds_wfcq_dequeue_unlock() entails mutually assured recursion.
Redirect to _cds_wfcq_dequeue_lock() and _cds_wfcq_dequeue_unlock(),
respectively.

Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/wfcqueue.c b/wfcqueue.c
index 3474ee0..90b810e 100644
--- a/wfcqueue.c
+++ b/wfcqueue.c
@@ -57,13 +57,13 @@ void cds_wfcq_enqueue(struct cds_wfcq_head *head,
 void cds_wfcq_dequeue_lock(struct cds_wfcq_head *head,
struct cds_wfcq_tail *tail)
 {
-   cds_wfcq_dequeue_lock(head, tail);
+   _cds_wfcq_dequeue_lock(head, tail);
 }
 
 void cds_wfcq_dequeue_unlock(struct cds_wfcq_head *head,
struct cds_wfcq_tail *tail)
 {
-   cds_wfcq_dequeue_unlock(head, tail);
+   _cds_wfcq_dequeue_unlock(head, tail);
 }
 
 struct cds_wfcq_node *cds_wfcq_dequeue_blocking(


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


[lttng-dev] Fw: Re: [PATCH v2] epoll: Support for disabling items, and a self-test app.

2012-10-29 Thread Paul E. McKenney
FYI, userspace RCU proposed to solve an issue with epoll.

Thanx, Paul

- Forwarded message from Matt Helsley matth...@linux.vnet.ibm.com -

Date: Fri, 26 Oct 2012 14:52:42 -0700
From: Matt Helsley matth...@linux.vnet.ibm.com
To: Michael Kerrisk (man-pages) mtk.manpa...@gmail.com
Cc: Paton J. Lewis pale...@adobe.com, Alexander Viro
v...@zeniv.linux.org.uk, Andrew Morton a...@linux-foundation.org,
Jason Baron jba...@redhat.com, linux-fsde...@vger.kernel.org
linux-fsde...@vger.kernel.org, linux-ker...@vger.kernel.org
linux-ker...@vger.kernel.org, Paul Holland pholl...@adobe.com,
Davide Libenzi davi...@xmailserver.org, libc-al...@sourceware.org
libc-al...@sourceware.org, Linux API linux-...@vger.kernel.org,
Paul McKenney paul...@us.ibm.com
Subject: Re: [PATCH v2] epoll: Support for disabling items, and a self-test
app.

On Thu, Oct 25, 2012 at 12:23:24PM +0200, Michael Kerrisk (man-pages) wrote:
 Hi Pat,
 
 
  I suppose that I have a concern that goes in the other direction. Is
  there not some other solution possible that doesn't require the use of
  EPOLLONESHOT? It seems overly restrictive to require that the caller
  must employ this flag, and imposes the burden that the caller must
  re-enable monitoring after each event.
 
  Does a solution like the following (with no requirement for EPOLLONESHOT)
  work?
 
  0. Implement an epoll_ctl() operation EPOLL_CTL_XXX
 where the name XXX might be chosen based on the decision
 in 4(a).
  1. EPOLL_CTL_XXX employs a private flag, EPOLLUSED, in the
 per-fd events mask in the ready list. By default,
 that flag is off.
  2. epoll_wait() always clears the EPOLLUSED flag if a
 file descriptor is found to be ready.
  3. If an epoll_ctl(EPOLL_CTL_XXX) discovers that the EPOLLUSED
 flag is NOT set, then
  a) it sets the EPOLLUSED flag
  b) It disables I/O events (as per EPOLL_CTL_DISABLE)
 (I'm not 100% sure if this is necesary).
  c) it returns EBUSY to the caller
  4. If an epoll_ctl(EPOLL_CTL_XXX) discovers that the EPOLLUSED
 flag IS set, then it
  a) either deletes the fd or disables events for the fd
 (the choice here is a matter of design taste, I think;
 deletion has the virtue of simplicity; disabling provides
 the option to re-enable the fd later, if desired)
  b) returns 0 to the caller.
 
  All of the above with suitable locking around the user-space cache.
 
  Cheers,
 
  Michael
 
 
  I don't believe that proposal will solve the problem. Consider the case
  where a worker thread has just executed epoll_wait and is about to execute
  the next line of code (which will access the data associated with the fd
  receiving the event). If the deletion thread manages to call
  epoll_ctl(EPOLL_CTL_XXX) for that fd twice in a row before the worker thread
  is able to execute the next statement, then the deletion thread will
  mistakenly conclude that it is safe to destroy the data that the worker
  thread is about to access.
 
 Okay -- I had the idea there might be a hole in my proposal ;-).
 
 By the way, have you been reading the comments in the two LWN articles
 on EPOLL_CTL_DISABLE?
 https://lwn.net/Articles/520012/
 http://lwn.net/SubscriberLink/520198/fd81ba0ecb1858a2/
 
 There's some interesting proposals there--some suggesting that an
 entirely user-space solution might be possible. I haven't looked
 deeply into the ideas though.

Yeah, I became quite interested so I wrote a crude epoll + urcu test.
Since it's RCU review to ensure I've not made any serious mistakes could
be quite helpful:

#define _LGPL_SOURCE 1
#define _GNU_SOURCE 1

#include stdlib.h
#include stdio.h
#include string.h
#include unistd.h
#include pthread.h
#include errno.h
#include fcntl.h
#include time.h

#include sys/epoll.h

/*
 * Locking Voodoo:
 *
 * The globabls prefixed by _ require special care because they will be
 * accessed from multiple threads.
 *
 * The precise locking scheme we use varies whether READERS_USE_MUTEX is defined
 * When we're using userspace RCU the mutex only gets acquired for writes
 * to _-prefixed globals. Reads are done inside RCU read side critical
 * sections.
 * Otherwise the epmutex covers reads and writes to them all and the test
 * is not very scalable.
 */
static pthread_mutex_t epmutex = PTHREAD_MUTEX_INITIALIZER;
static int _p[2]; /* Send dummy data from one thread to another */
static int _epfd; /* Threads wait to read/write on epfd */
static int _nepitems = 0;

#ifdef READERS_USE_MUTEX
#define init_lock() do {} while(0)
#define init_thread() do {} while(0)
#define read_lock pthread_mutex_lock
#define read_unlock pthread_mutex_unlock
#define fini_thread() do {} while(0)
/* Because readers use the mutex synchronize_rcu() is a no-op */
#define synchronize_rcu() do {} while(0)
#else
#include urcu.h
#define 

Re: [lttng-dev] urcu stack and queues updates and documentation

2012-10-22 Thread Paul E. McKenney
On Wed, Oct 17, 2012 at 11:19:46AM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Sun, Oct 14, 2012 at 01:53:32PM -0400, Mathieu Desnoyers wrote:
   Hi Paul!
   
   I know you are currently looking at documentation of urcu data
   structures. I did quite a bit of work in that area these past days. Here
   is my plan:
  
  Actually, I diverted to the atomic operations, given that the stack/queue
  API seems to be in flux.  ;-)
 
 That sounds like a wise decision ;-)
 
   1) I would like to deprecate, at some point, rculfqueue, wfqueue, and
  rculfstack.
   
   2) For wfqueue, we replace it by wfcqueue, currently in the urcu master
  branch.
   
   3) For rculfstack, we replace it by lfstack available here (volatile
  branch):
   
   git://git.dorsal.polymtl.ca/~compudj/userspace-rcu
   branch: urcu/lfstack
  
  I probably have to document them to have any chance of having an opinion,
  other than my usual advice to avoid disrupting users of the old interfaces.
 
 My general plan is to leave the old interfaces in place, marking them as
 deprecated by adding a __attribute__((deprecated(This interface is 
 deprecated. Please refer to urcu/xxxqueue.h for its replacement.))).
 Then we'll be able to drop the deprecated interfaces in a couple of
 versions.

Fair enough.  Should enough users protest, we can of course leave them
in place.

   4) I did documentation improvements (and implemented pop_all as well as
  empty, and iterators) for wfstack here (volatile branch too):
   
   git://git.dorsal.polymtl.ca/~compudj/userspace-rcu
   branch: urcu/wfstack
  
  I will be very happy to take advantage of this.  ;-)
 
 I wonder how we should move forward with these ? I could pull the
 urcu/wfstack, urcu/lfstack commits into master with your approval, and
 mark rculfstack and wfqueue as deprecated. wfstack is simply extended. I
 would wait a bit before deciding anything wrt rculfqueue. Thoughts ?

I would be in favor of pulling them in -- we can fix if need be.
That said, I am not so sure that getting rid of wfqueue is a good idea,
given your analysis below.

   5) The last one to look into would be rculfqueue. I'd really like to
  create a lfcqueue derived from wfcqueue if possible. It's the next
  item on my todo list this weekend.
  
  The piece I am missing is ABA avoidance.  Or is this the approach
  that assumes a single dequeuer?
 
 If we look at the big picture, the main difference between the wf and
 lf approaches, both for stack and queue, is that wf requires
 traversal to busy-wait when it sees the intermediate NULL pointer state.
 This allows wait-free push/enqueue with xchg. The lf approach ensures
 that a simple traversal can be done on the structures, at the expense of
 requiring a cmpxchg on the enqueue/push.
 
 Luckily, for stacks, the nature of stacks makes push ABA-proof (see
 the documentation in the code), even if we use cmpxchg.
 
 Unluckily, for queues, using cmpxchg on enqueue is ABA-prone. dequeue
 is ABA-prone too. Moreover, we need to have existance guarantees, so an
 enqueue does not attempt to do a cmpxchg on the next pointer of a node
 that has already been dequeued and reallocated. So, one approach is to
 always rely on RCU, and require the RCU read-side lock to be held around
 enqueue, and around dequeue. Now, the question is: can we rely on other,
 non-rcu techniques, to protect lfqueue against ABA and offer existance
 guarantees ?
 
 A single-dequeuer approach would unfortunately not be sufficient,
 because enqueue is ABA-prone, and due to lack of existance guarantees
 for the node we are about to append after: if we have multiple enqueuers
 and a single dequeuer, one enqueue could suffer from ABA, and try to
 touch reallocated memory, due to dequeue+reallocation of a node.
 
 Even forcing single-enqueuer/single-dequeuer would not suffice: if,
 between the moment we get the tail node we plan to append after, and the
 moment we perform the cmpxchg to that node next pointer, the node is
 dequeued and freed, we would be touching freed memory (corruption).
 
 Therefore, that would require a single mutex on _both_ enqueue and
 dequeue operations, which really defeats the purpose of a lock-free
 queue.
 
 So my current understanding is that we might have to stay with a RCU
 lfcqueue, requiring RCU read-side lock to be held for enqueue and
 dequeue, and requiring to wait for a grace period to elapse before
 freeing the memory returned by dequeue. The benefit of using rculfcqueue
 over wfcqueue is that traversal of the nodes, and dequeue, don't need to
 busy-loop on NULL next pointers.
 
 Thoughts ?

Heh! It would indeed seem that we didn't think through the conversion
from wfqueue as thoroughly as we might have.  ;-)

Thanx, Paul

 Thanks!
 
 Mathieu
 
  
  Thanx, Paul
  
   Thoughts ?
   
   Thanks,
   
   Mathieu

Re: [lttng-dev] urcu stack and queues updates and documentation

2012-10-16 Thread Paul E. McKenney
On Sun, Oct 14, 2012 at 01:53:32PM -0400, Mathieu Desnoyers wrote:
 Hi Paul!
 
 I know you are currently looking at documentation of urcu data
 structures. I did quite a bit of work in that area these past days. Here
 is my plan:

Actually, I diverted to the atomic operations, given that the stack/queue
API seems to be in flux.  ;-)

 1) I would like to deprecate, at some point, rculfqueue, wfqueue, and
rculfstack.
 
 2) For wfqueue, we replace it by wfcqueue, currently in the urcu master
branch.
 
 3) For rculfstack, we replace it by lfstack available here (volatile
branch):
 
 git://git.dorsal.polymtl.ca/~compudj/userspace-rcu
 branch: urcu/lfstack

I probably have to document them to have any chance of having an opinion,
other than my usual advice to avoid disrupting users of the old interfaces.

 4) I did documentation improvements (and implemented pop_all as well as
empty, and iterators) for wfstack here (volatile branch too):
 
 git://git.dorsal.polymtl.ca/~compudj/userspace-rcu
 branch: urcu/wfstack

I will be very happy to take advantage of this.  ;-)

 5) The last one to look into would be rculfqueue. I'd really like to
create a lfcqueue derived from wfcqueue if possible. It's the next
item on my todo list this weekend.

The piece I am missing is ABA avoidance.  Or is this the approach
that assumes a single dequeuer?

Thanx, Paul

 Thoughts ?
 
 Thanks,
 
 Mathieu
 
 -- 
 Mathieu Desnoyers
 Operating System Efficiency RD Consultant
 EfficiOS Inc.
 http://www.efficios.com
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] rculfstack bug

2012-10-10 Thread Paul E. McKenney
On Wed, Oct 10, 2012 at 07:42:15AM -0400, Mathieu Desnoyers wrote:
 * Lai Jiangshan (la...@cn.fujitsu.com) wrote:
  test code:
  ./tests/test_urcu_lfs 100 10 10
  
  bug produce rate  60%
  
  {{{
  I didn't see any bug when ./tests/test_urcu_lfs 10 10 10 Or 
  ./tests/test_urcu_lfs 100 100 10
  But I just test it about 5 times
  }}}
  
  4cores*1threads: Intel(R) Core(TM) i5 CPU 760
  RCU_MB (no time to test for other rcu type)
  test commit: 768fba83676f49eb73fd1d8ad452016a84c5ec2a
  
  I didn't see any bug when ./tests/test_urcu_mb 10 100 10
  
  Sorry, I tried, but I failed to find out the root cause currently.
 
 I think I managed to narrow down the issue:
 
 1) the master branch does not reproduce it, but commit
768fba83676f49eb73fd1d8ad452016a84c5ec2a repdroduces it about 50% of the
time.
 
 2) the main change between 768fba83676f49eb73fd1d8ad452016a84c5ec2a and
current master (f94061a3df4c9eab9ac869a19e4228de54771fcb) is call_rcu
moving to wfcqueue.
 
 3) the bug always arise, for me, at the end of the 10 seconds.
However, it might be simply due to the fact that most of the memory
get freed at the end of program execution.
 
 4) I've been able to get a backtrace, and it looks like we have some
call_rcu callback-invokation threads still working while
call_rcu_data_free() is invoked. In the backtrace, call_rcu_data_free()
is nicely waiting for the next thread to stop, and during that time,
two callback-invokation threads are invoking callbacks (and one of
them triggers the segfault).

Do any of the callbacks reference __thread variables from some other
thread?  If so, those threads must refrain from exiting until after
such callbacks complete.

Thanx, Paul

 So I expect that commit 
 
 commit 5161f31e09ce33dd79afad8d08a2372fbf1c4fbe
 Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com
 Date:   Tue Sep 25 10:50:49 2012 -0500
 
 call_rcu: use wfcqueue, eliminate false-sharing
 
 Eliminate false-sharing between call_rcu (enqueuer) and worker threads
 on the queue head and tail.
 
 Acked-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
 
 Could have managed to fix the issue, or change the timing enough that it
 does not reproduces. I'll continue investigating.
 
 Thanks,
 
 Mathieu
 
 
  
  *** glibc detected *** 
  /home/laijs/work/userspace-rcu/tests/.libs/lt-test_urcu_lfs: double free or 
  corruption (out): 0x7f20955dfbb0 ***
  === Backtrace: =
  /lib64/libc.so.6[0x37ee676d63]
  /home/laijs/work/userspace-rcu/tests/.libs/lt-test_urcu_lfs[0x4024f5]
  /lib64/libpthread.so.0[0x37eda06ccb]
  /lib64/libc.so.6(clone+0x6d)[0x37ee6e0c2d]
  === Memory map: 
  0040-00405000 r-xp  08:08 6031723
  /home/laijs/work/userspace-rcu/tests/.libs/lt-test_urcu_lfs
  00605000-00606000 rw-p 5000 08:08 6031723
  /home/laijs/work/userspace-rcu/tests/.libs/lt-test_urcu_lfs
  00606000-00616000 rw-p  00:00 0 
  00e9c000-03482000 rw-p  00:00 0  
  [heap]
  37ed60-37ed61f000 r-xp  08:01 1507421
  /lib64/ld-2.13.so
  37ed81e000-37ed81f000 r--p 0001e000 08:01 1507421
  /lib64/ld-2.13.so
  37ed81f000-37ed82 rw-p 0001f000 08:01 1507421
  /lib64/ld-2.13.so
  37ed82-37ed821000 rw-p  00:00 0 
  37eda0-37eda17000 r-xp  08:01 1507427
  /lib64/libpthread-2.13.so
  37eda17000-37edc16000 ---p 00017000 08:01 1507427
  /lib64/libpthread-2.13.so
  37edc16000-37edc17000 r--p 00016000 08:01 1507427
  /lib64/libpthread-2.13.so
  37edc17000-37edc18000 rw-p 00017000 08:01 1507427
  /lib64/libpthread-2.13.so
  37edc18000-37edc1c000 rw-p  00:00 0 
  37ee60-37ee791000 r-xp  08:01 1507423
  /lib64/libc-2.13.so
  37ee791000-37ee991000 ---p 00191000 08:01 1507423
  /lib64/libc-2.13.so
  37ee991000-37ee995000 r--p 00191000 08:01 1507423
  /lib64/libc-2.13.so
  37ee995000-37ee996000 rw-p 00195000 08:01 1507423
  /lib64/libc-2.13.so
  37ee996000-37ee99c000 rw-p  00:00 0 
  37f0e0-37f0e15000 r-xp  08:01 1507437
  /lib64/libgcc_s-4.5.1-20100924.so.1
  37f0e15000-37f1014000 ---p 00015000 08:01 1507437
  /lib64/libgcc_s-4.5.1-20100924.so.1
  37f1014000-37f1015000 rw-p 00014000 08:01 1507437
  /lib64/libgcc_s-4.5.1-20100924.so.1
  7f1ee400-7f1ee4029000 rw-p  00:00 0 
  7f1ee4029000-7f1ee800 ---p  00:00 0 
  7f1eec00-7f1eee039000 rw-p  00:00 0 
  7f1eee039000-7f1ef000

Re: [lttng-dev] [RFC] re-document rculfstack and even rename it

2012-10-10 Thread Paul E. McKenney
On Wed, Oct 10, 2012 at 03:52:08PM +0800, Lai Jiangshan wrote:
 rculfstack is not really require RCU-only.
 
 1) cds_lfs_push_rcu() don't need any lock, don't need RCU nor other locks.
 
 2) cds_lfs_pop_rcu() don't only one of the following synchronization(not only 
 RCU):
   A) use rcu_read_lock() to protect cds_lfs_pop_rcu() and use 
 synchronize_rcu()
or call_rcu() to free the popped node. (current comments said we 
 need this
synchronization, and thus we named this struct with rcu prefix. 
 But actually,
  the followings are OK, and are more popular/friendly)
   B) use mutexs/locks to protect cds_lfs_pop_rcu(), we can free to 
 free/modify the
  popped node any time, we don't need any synchronization when free 
 them.
   C) only ONE thread can call cds_lfs_pop_rcu(). (multi-providers-single 
 customer)
   D) others, like read-write locks.
 
 I consider B) and C) are more popular. In linux kernel,
 kernel/task_work.c uses a hybird ways of B) and C).
 
 I suggest to rename it, Or document B) and C) at least.

Good timing -- stacks and queues are next on my list for documentation.  ;-)

Thanx, Paul


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] rculfstack bug

2012-10-10 Thread Paul E. McKenney
On Wed, Oct 10, 2012 at 01:53:04PM -0400, Mathieu Desnoyers wrote:
 * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
   On Wed, Oct 10, 2012 at 07:42:15AM -0400, Mathieu Desnoyers wrote:
* Lai Jiangshan (la...@cn.fujitsu.com) wrote:
 test code:
 ./tests/test_urcu_lfs 100 10 10
 
 bug produce rate  60%
 
 {{{
 I didn't see any bug when ./tests/test_urcu_lfs 10 10 10 Or 
 ./tests/test_urcu_lfs 100 100 10
 But I just test it about 5 times
 }}}
 
 4cores*1threads: Intel(R) Core(TM) i5 CPU 760
 RCU_MB (no time to test for other rcu type)
 test commit: 768fba83676f49eb73fd1d8ad452016a84c5ec2a
 
 I didn't see any bug when ./tests/test_urcu_mb 10 100 10
 
 Sorry, I tried, but I failed to find out the root cause currently.

I think I managed to narrow down the issue:

1) the master branch does not reproduce it, but commit
   768fba83676f49eb73fd1d8ad452016a84c5ec2a repdroduces it about 50% of 
the
   time.

2) the main change between 768fba83676f49eb73fd1d8ad452016a84c5ec2a and
   current master (f94061a3df4c9eab9ac869a19e4228de54771fcb) is call_rcu
   moving to wfcqueue.

3) the bug always arise, for me, at the end of the 10 seconds.
   However, it might be simply due to the fact that most of the memory
   get freed at the end of program execution.

4) I've been able to get a backtrace, and it looks like we have some
   call_rcu callback-invokation threads still working while
   call_rcu_data_free() is invoked. In the backtrace, 
call_rcu_data_free()
   is nicely waiting for the next thread to stop, and during that time,
   two callback-invokation threads are invoking callbacks (and one of
   them triggers the segfault).
   
   Do any of the callbacks reference __thread variables from some other
   thread?  If so, those threads must refrain from exiting until after
   such callbacks complete.
  
  The callback is a simple caa_container_of + free, usual stuff, nothing
  fancy.
 
 Here is the fix: the bug was in call rcu. It is not required for master,
 because we fixed it while moving to wfcqueue.
 
 We were erroneously writing to the head field of the default
 call_rcu_data rather than tail.

Ouch!!!  I have no idea why that would have passed my testing.  :-(

 I wonder if we should simply do a new release with call_rcu using
 wfcqueue and tell people to upgrade, or if we should somehow create a
 stable branch with this fix.
 
 Thoughts ?

Under what conditions does this bug appear?  It is necessary to not just
use call_rcu(), but also to explicitly call call_rcu_data_free(), right?

My guess is that a stable branch would be good -- there will be other
bugs, after all.  :-/

Thanx, Paul

 Thanks,
 
 Mathieu
 
 ---
 diff --git a/urcu-call-rcu-impl.h b/urcu-call-rcu-impl.h
 index 13b24ff..b205229 100644
 --- a/urcu-call-rcu-impl.h
 +++ b/urcu-call-rcu-impl.h
 @@ -647,8 +647,9 @@ void call_rcu_data_free(struct call_rcu_data *crdp)
   /* Create default call rcu data if need be */
   (void) get_default_call_rcu_data();
   cbs_endprev = (struct cds_wfq_node **)
 - uatomic_xchg(default_call_rcu_data, cbs_tail);
 - *cbs_endprev = cbs;
 + uatomic_xchg(default_call_rcu_data-cbs.tail,
 + cbs_tail);
 + _CMM_STORE_SHARED(*cbs_endprev, cbs);
   uatomic_add(default_call_rcu_data-qlen,
   uatomic_read(crdp-qlen));
   wake_call_rcu_thread(default_call_rcu_data);
 
 
  
  Thanks,
  
  Mathieu
  
   
 Thanx, Paul
   
So I expect that commit 

commit 5161f31e09ce33dd79afad8d08a2372fbf1c4fbe
Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com
Date:   Tue Sep 25 10:50:49 2012 -0500

call_rcu: use wfcqueue, eliminate false-sharing

Eliminate false-sharing between call_rcu (enqueuer) and worker 
threads
on the queue head and tail.

Acked-by: Paul E. McKenney paul...@linux.vnet.ibm.com
Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com

Could have managed to fix the issue, or change the timing enough that it
does not reproduces. I'll continue investigating.

Thanks,

Mathieu


 
 *** glibc detected *** 
 /home/laijs/work/userspace-rcu/tests/.libs/lt-test_urcu_lfs: double 
 free or corruption (out): 0x7f20955dfbb0 ***
 === Backtrace: =
 /lib64/libc.so.6[0x37ee676d63]
 /home/laijs/work/userspace-rcu/tests/.libs/lt-test_urcu_lfs[0x4024f5]
 /lib64/libpthread.so.0[0x37eda06ccb]
 /lib64/libc.so.6(clone+0x6d)[0x37ee6e0c2d

Re: [lttng-dev] rculfstack bug

2012-10-10 Thread Paul E. McKenney
On Thu, Oct 11, 2012 at 09:31:01AM +0800, Lai Jiangshan wrote:
 On 10/11/2012 03:50 AM, Paul E. McKenney wrote:
  On Wed, Oct 10, 2012 at 01:53:04PM -0400, Mathieu Desnoyers wrote:
  * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
  * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Wed, Oct 10, 2012 at 07:42:15AM -0400, Mathieu Desnoyers wrote:
  * Lai Jiangshan (la...@cn.fujitsu.com) wrote:
  test code:
  ./tests/test_urcu_lfs 100 10 10
 
  bug produce rate  60%
 
  {{{
  I didn't see any bug when ./tests/test_urcu_lfs 10 10 10 Or 
  ./tests/test_urcu_lfs 100 100 10
  But I just test it about 5 times
  }}}
 
  4cores*1threads: Intel(R) Core(TM) i5 CPU 760
  RCU_MB (no time to test for other rcu type)
  test commit: 768fba83676f49eb73fd1d8ad452016a84c5ec2a
 
  I didn't see any bug when ./tests/test_urcu_mb 10 100 10
 
  Sorry, I tried, but I failed to find out the root cause currently.
 
  I think I managed to narrow down the issue:
 
  1) the master branch does not reproduce it, but commit
 768fba83676f49eb73fd1d8ad452016a84c5ec2a repdroduces it about 50% of 
  the
 time.
 
  2) the main change between 768fba83676f49eb73fd1d8ad452016a84c5ec2a and
 current master (f94061a3df4c9eab9ac869a19e4228de54771fcb) is call_rcu
 moving to wfcqueue.
 
  3) the bug always arise, for me, at the end of the 10 seconds.
 However, it might be simply due to the fact that most of the memory
 get freed at the end of program execution.
 
  4) I've been able to get a backtrace, and it looks like we have some
 call_rcu callback-invokation threads still working while
 call_rcu_data_free() is invoked. In the backtrace, 
  call_rcu_data_free()
 is nicely waiting for the next thread to stop, and during that time,
 two callback-invokation threads are invoking callbacks (and one of
 them triggers the segfault).
 
  Do any of the callbacks reference __thread variables from some other
  thread?  If so, those threads must refrain from exiting until after
  such callbacks complete.
 
  The callback is a simple caa_container_of + free, usual stuff, nothing
  fancy.
 
  Here is the fix: the bug was in call rcu. It is not required for master,
  because we fixed it while moving to wfcqueue.
 
  We were erroneously writing to the head field of the default
  call_rcu_data rather than tail.
  
  Ouch!!!  I have no idea why that would have passed my testing.  :-(
 
 It's one of the reasons that I rewrite wfqueue and introduce delete_all()
 (Mathieu uses splice instead) to replace open code of wfqueue in 
 urcu-call-rcu-impl.h.

Good catch!!!

Thanx, Paul

  I wonder if we should simply do a new release with call_rcu using
  wfcqueue and tell people to upgrade, or if we should somehow create a
  stable branch with this fix.
 
  Thoughts ?
  
  Under what conditions does this bug appear?  It is necessary to not just
  use call_rcu(), but also to explicitly call call_rcu_data_free(), right?
  
  My guess is that a stable branch would be good -- there will be other
  bugs, after all.  :-/
  
  Thanx, Paul
  
  Thanks,
 
  Mathieu
 
  ---
  diff --git a/urcu-call-rcu-impl.h b/urcu-call-rcu-impl.h
  index 13b24ff..b205229 100644
  --- a/urcu-call-rcu-impl.h
  +++ b/urcu-call-rcu-impl.h
  @@ -647,8 +647,9 @@ void call_rcu_data_free(struct call_rcu_data *crdp)
 /* Create default call rcu data if need be */
 (void) get_default_call_rcu_data();
 cbs_endprev = (struct cds_wfq_node **)
  -  uatomic_xchg(default_call_rcu_data, cbs_tail);
  -  *cbs_endprev = cbs;
  +  uatomic_xchg(default_call_rcu_data-cbs.tail,
  +  cbs_tail);
  +  _CMM_STORE_SHARED(*cbs_endprev, cbs);
 uatomic_add(default_call_rcu_data-qlen,
 uatomic_read(crdp-qlen));
 wake_call_rcu_thread(default_call_rcu_data);
 
 
 
  Thanks,
 
  Mathieu
 
 
   Thanx, Paul
 
  So I expect that commit 
 
  commit 5161f31e09ce33dd79afad8d08a2372fbf1c4fbe
  Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com
  Date:   Tue Sep 25 10:50:49 2012 -0500
 
  call_rcu: use wfcqueue, eliminate false-sharing
  
  Eliminate false-sharing between call_rcu (enqueuer) and worker 
  threads
  on the queue head and tail.
  
 
 I think the changelog of this commit is too short.
 
 Thanks,
 Lai
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [URCU PATCH 3/3] call_rcu: use wfcqueue, eliminate false-sharing

2012-10-08 Thread Paul E. McKenney
On Mon, Oct 08, 2012 at 10:49:16AM -0400, Mathieu Desnoyers wrote:
 * Lai Jiangshan (la...@cn.fujitsu.com) wrote:
  On 10/02/2012 10:16 PM, Mathieu Desnoyers wrote:
   Eliminate false-sharing between call_rcu (enqueuer) and worker threads
   on the queue head and tail.
   
   Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
   ---
   diff --git a/tests/Makefile.am b/tests/Makefile.am
   index 81718bb..c92bbe6 100644
   --- a/tests/Makefile.am
   +++ b/tests/Makefile.am
   @@ -30,14 +30,14 @@ if COMPAT_FUTEX
COMPAT+=$(top_srcdir)/compat_futex.c
endif

   -URCU=$(top_srcdir)/urcu.c $(top_srcdir)/urcu-pointer.c 
   $(top_srcdir)/wfqueue.c $(COMPAT)
   -URCU_QSBR=$(top_srcdir)/urcu-qsbr.c $(top_srcdir)/urcu-pointer.c 
   $(top_srcdir)/wfqueue.c $(COMPAT)
   +URCU=$(top_srcdir)/urcu.c $(top_srcdir)/urcu-pointer.c 
   $(top_srcdir)/wfcqueue.c $(COMPAT)
   +URCU_QSBR=$(top_srcdir)/urcu-qsbr.c $(top_srcdir)/urcu-pointer.c 
   $(top_srcdir)/wfcqueue.c $(COMPAT)
# URCU_MB uses urcu.c but -DRCU_MB must be defined
   -URCU_MB=$(top_srcdir)/urcu.c $(top_srcdir)/urcu-pointer.c 
   $(top_srcdir)/wfqueue.c $(COMPAT)
   +URCU_MB=$(top_srcdir)/urcu.c $(top_srcdir)/urcu-pointer.c 
   $(top_srcdir)/wfcqueue.c $(COMPAT)
# URCU_SIGNAL uses urcu.c but -DRCU_SIGNAL must be defined
   -URCU_SIGNAL=$(top_srcdir)/urcu.c $(top_srcdir)/urcu-pointer.c 
   $(top_srcdir)/wfqueue.c $(COMPAT)
   -URCU_BP=$(top_srcdir)/urcu-bp.c $(top_srcdir)/urcu-pointer.c 
   $(top_srcdir)/wfqueue.c $(COMPAT)
   -URCU_DEFER=$(top_srcdir)/urcu.c $(top_srcdir)/urcu-pointer.c 
   $(top_srcdir)/wfqueue.c $(COMPAT)
   +URCU_SIGNAL=$(top_srcdir)/urcu.c $(top_srcdir)/urcu-pointer.c 
   $(top_srcdir)/wfcqueue.c $(COMPAT)
   +URCU_BP=$(top_srcdir)/urcu-bp.c $(top_srcdir)/urcu-pointer.c 
   $(top_srcdir)/wfcqueue.c $(COMPAT)
   +URCU_DEFER=$(top_srcdir)/urcu.c $(top_srcdir)/urcu-pointer.c 
   $(top_srcdir)/wfcqueue.c $(COMPAT)

URCU_COMMON_LIB=$(top_builddir)/liburcu-common.la
URCU_LIB=$(top_builddir)/liburcu.la
   diff --git a/urcu-call-rcu-impl.h b/urcu-call-rcu-impl.h
   index 13b24ff..cf65992 100644
   --- a/urcu-call-rcu-impl.h
   +++ b/urcu-call-rcu-impl.h
   @@ -21,6 +21,7 @@
 */

#define _GNU_SOURCE
   +#define _LGPL_SOURCE
#include stdio.h
#include pthread.h
#include signal.h
   @@ -35,7 +36,7 @@
#include sched.h

#include config.h
   -#include urcu/wfqueue.h
   +#include urcu/wfcqueue.h
#include urcu-call-rcu.h
#include urcu-pointer.h
#include urcu/list.h
   @@ -46,7 +47,14 @@
/* Data structure that identifies a call_rcu thread. */

struct call_rcu_data {
   - struct cds_wfq_queue cbs;
   + /*
   +  * Align the tail on cache line size to eliminate false-sharing
   +  * with head.
   +  */
   + struct cds_wfcq_tail __attribute__((aligned(CAA_CACHE_LINE_SIZE))) 
   cbs_tail;
   + /* Alignment on cache line size will add padding here */
   +
   + struct cds_wfcq_head cbs_head;
  
  
  wrong here. In this code, cbs_tail and cbs_head are in the same cache line.
  
  ---
  
  struct call_rcu_data {
  struct cds_wfcq_tail cbs_tail;
  struct cds_wfcq_head __attribute__((aligned(CAA_CACHE_LINE_SIZE))) 
  cbs_head;
  /* other fields, can move some fields up to use the room between tail 
  and head */
  };
  
  # cat test.c
  
  struct a {
  int __attribute__((aligned(64))) i;
  int j;
  };
  struct b {
  int i;
  int __attribute__((aligned(64))) j;
  };
  
  void main(void)
  {
  printf(%d,%d\n, sizeof(struct a), sizeof(struct b));
  }
  
  # ./a.out
  64,128
  
 
 Good point! While we are there, I notice that the qlen count, kept for
 debugging, is causing false-sharing too. I wonder if we should split
 this counter in two counters: nr_enqueue and nr_dequeue, which would sit
 in two different cache lines ? It's mainly Paul who cares about this
 counter. Thoughts ?

Works for me, as long as nr_enqueue and nr_dequeue are both unsigned long
to avoid issues with overflow.

Thanx, Paul

 Here is the fix to the problem you noticed above:
 
 commit b9f893b69fbc31baea418794938f4eb74cc4923a
 Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com
 Date:   Mon Oct 8 10:44:38 2012 -0400
 
 Fix urcu-call-rcu-impl.h: false-sharing
 
struct call_rcu_data {
   -   struct cds_wfq_queue cbs;
   +   /*
   +* Align the tail on cache line size to eliminate false-sharing
   +* with head.
   +*/
   +   struct cds_wfcq_tail 
 __attribute__((aligned(CAA_CACHE_LINE_SIZE))) cbs_tail;
   +   /* Alignment on cache line size will add padding here */
   +
   +   struct cds_wfcq_head cbs_head;
 
 
  wrong here. In this code, cbs_tail and cbs_head are in the same cache 
 line.
 
 Reported-by: Lai Jiangshan la...@cn.fujitsu.com
 Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
 
 Thanks!
 
 Mathieu
 
 

Re: [lttng-dev] [rp] [URCU PATCH 0/3] wait-free concurrent queues (wfcqueue)

2012-10-04 Thread Paul E. McKenney
On Wed, Oct 03, 2012 at 05:04:36PM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Tue, Oct 02, 2012 at 10:13:07AM -0400, Mathieu Desnoyers wrote:
   Implement wait-free concurrent queues, with a new API different from
   wfqueue.h, which is already provided by Userspace RCU. The advantage of
   splitting the head and tail objects of the queue into different
   arguments is to allow these to sit on different cache-lines, thus
   eliminating false-sharing, leading to a 2.3x speed increase.
   
   This API also introduces a splice operation, which moves all nodes
   from one queue into another, and postpones the synchronization to either
   dequeue or iteration on the list. The splice operation does not need to
   touch every single node of the queue it moves them from. Moreover, the
   splice operation only needs to ensure mutual exclusion with other
   dequeuers, iterations, and splice operations from the list it splices
   from, but acts as a simple enqueuer on the list it splices into (no
   mutual exclusion needed for that list).
   
   Feedback is welcome,
  
  These look sane to me, though I must confess that the tail pointer
  referencing the node rather than the node's next pointer did throw
  me for a bit.  ;-)
 
 Yes, this was originally introduced with Lai's original patch to
 wfqueue, which I think is a nice simplification: it's pretty much the
 same thing to use the last node address as tail rather than the address
 of its first member (its next pointer address (_not_ value)). It ends up
 being the same address in this case, but more interestingly, we don't
 have to use a struct cds_wfcq_node ** type: a simple struct
 cds_wfcq_node *  suffice.
 
 Thanks Paul, I will therefore merge these 3 patches with your Acked-by.

Good point -- just confirming:

Acked-by: Paul E. McKenney paul...@linux.vnet.ibm.com

 Lai, you are welcome to provide improvements to this code against the
 master branch. I will gladly consider any change you propose.
 
 Thanks!
 
 Mathieu
 
 -- 
 Mathieu Desnoyers
 Operating System Efficiency RD Consultant
 EfficiOS Inc.
 http://www.efficios.com
 
 ___
 rp mailing list
 r...@svcs.cs.pdx.edu
 http://svcs.cs.pdx.edu/mailman/listinfo/rp
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [rp] [URCU PATCH 0/3] wait-free concurrent queues (wfcqueue)

2012-10-03 Thread Paul E. McKenney
On Tue, Oct 02, 2012 at 10:13:07AM -0400, Mathieu Desnoyers wrote:
 Implement wait-free concurrent queues, with a new API different from
 wfqueue.h, which is already provided by Userspace RCU. The advantage of
 splitting the head and tail objects of the queue into different
 arguments is to allow these to sit on different cache-lines, thus
 eliminating false-sharing, leading to a 2.3x speed increase.
 
 This API also introduces a splice operation, which moves all nodes
 from one queue into another, and postpones the synchronization to either
 dequeue or iteration on the list. The splice operation does not need to
 touch every single node of the queue it moves them from. Moreover, the
 splice operation only needs to ensure mutual exclusion with other
 dequeuers, iterations, and splice operations from the list it splices
 from, but acts as a simple enqueuer on the list it splices into (no
 mutual exclusion needed for that list).
 
 Feedback is welcome,

These look sane to me, though I must confess that the tail pointer
referencing the node rather than the node's next pointer did throw
me for a bit.  ;-)

Thanx, Paul


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH] Ensure that read-side functions meet 10-line LGPL criterion

2012-09-04 Thread Paul E. McKenney
On Mon, Sep 03, 2012 at 02:03:00PM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  This commit ensures that all read-side functions meet the 10-line LGPL
  criterion that permits them to be expanded directly into non-LGPL code,
  without function-call instructions.  It also documents this as the intent.
  
  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
  
  diff --git a/urcu/static/urcu-bp.h b/urcu/static/urcu-bp.h
  index e7b2eda..881b4a4 100644
  --- a/urcu/static/urcu-bp.h
  +++ b/urcu/static/urcu-bp.h
  @@ -6,8 +6,8 @@
*
* Userspace RCU header.
*
  - * TO BE INCLUDED ONLY IN LGPL-COMPATIBLE CODE. See urcu.h for linking
  - * dynamically with the userspace rcu library.
  + * TO BE INCLUDED ONLY IN CODE THAT IS TO BE RECOMPILED ON EACH LIBURCU
  + * RELEASE. See urcu.h for linking dynamically with the userspace rcu 
  library.
*
* Copyright (c) 2009 Mathieu Desnoyers mathieu.desnoy...@efficios.com
* Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
  @@ -162,32 +162,48 @@ static inline int rcu_old_gp_ongoing(long *value)
   ((v ^ rcu_gp_ctr)  RCU_GP_CTR_PHASE);
   }
   
  +/*
  + * Helper for _rcu_read_lock().  The format of rcu_gp_ctr (as well as
  + * the per-thread rcu_reader.ctr) has the upper bits containing a count of
  + * _rcu_read_lock() nesting, and a lower-order bit that contains either 
  zero
  + * or RCU_GP_CTR_PHASE.  The smp_mb_slave() ensures that the accesses in
  + * _rcu_read_lock() happen before the subsequent read-side critical 
  section.
  + */
  +static inline void _rcu_read_lock_help(unsigned long tmp)
 
 could we rename the _rcu_read_lock_help to _rcu_read_lock_update ?
 
 I think it would fit better the role of this function in the algorithm.
 
 As Josh pointed out, directloy - directly below,
 
 The rest looks good. I'll wait for an updated version.

Here you go!

Thanx, Paul



Ensure that read-side functions meet 10-line LGPL criterion

This commit ensures that all read-side functions meet the 10-line LGPL
criterion that permits them to be expanded directly into non-LGPL code,
without function-call instructions.  It also documents this as the intent.

[ paulmck: Spelling fixes called out by Josh Triplett and name
change called out by Mathieu Desnoyers (_rcu_read_lock_help() -
_rcu_read_lock_update(). ]

Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/urcu/static/urcu-bp.h b/urcu/static/urcu-bp.h
index e7b2eda..a2f7368 100644
--- a/urcu/static/urcu-bp.h
+++ b/urcu/static/urcu-bp.h
@@ -6,8 +6,8 @@
  *
  * Userspace RCU header.
  *
- * TO BE INCLUDED ONLY IN LGPL-COMPATIBLE CODE. See urcu.h for linking
- * dynamically with the userspace rcu library.
+ * TO BE INCLUDED ONLY IN CODE THAT IS TO BE RECOMPILED ON EACH LIBURCU
+ * RELEASE. See urcu.h for linking dynamically with the userspace rcu library.
  *
  * Copyright (c) 2009 Mathieu Desnoyers mathieu.desnoy...@efficios.com
  * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
@@ -162,32 +162,48 @@ static inline int rcu_old_gp_ongoing(long *value)
 ((v ^ rcu_gp_ctr)  RCU_GP_CTR_PHASE);
 }
 
+/*
+ * Helper for _rcu_read_lock().  The format of rcu_gp_ctr (as well as
+ * the per-thread rcu_reader.ctr) has the upper bits containing a count of
+ * _rcu_read_lock() nesting, and a lower-order bit that contains either zero
+ * or RCU_GP_CTR_PHASE.  The smp_mb_slave() ensures that the accesses in
+ * _rcu_read_lock() happen before the subsequent read-side critical section.
+ */
+static inline void _rcu_read_lock_update(unsigned long tmp)
+{
+   if (caa_likely(!(tmp  RCU_GP_CTR_NEST_MASK))) {
+   _CMM_STORE_SHARED(URCU_TLS(rcu_reader)-ctr, 
_CMM_LOAD_SHARED(rcu_gp_ctr));
+   cmm_smp_mb();
+   } else
+   _CMM_STORE_SHARED(URCU_TLS(rcu_reader)-ctr, tmp + 
RCU_GP_COUNT);
+}
+
+/*
+ * Enter an RCU read-side critical section.
+ *
+ * The first cmm_barrier() call ensures that the compiler does not reorder
+ * the body of _rcu_read_lock() with a mutex.
+ *
+ * This function and its helper are both less than 10 lines long.  The
+ * intent is that this function meets the 10-line criterion in LGPL,
+ * allowing this function to be invoked directly from non-LGPL code.
+ */
 static inline void _rcu_read_lock(void)
 {
long tmp;
 
-   /* Check if registered */
if (caa_unlikely(!URCU_TLS(rcu_reader)))
-   rcu_bp_register();
-
+   rcu_bp_register(); /* If not yet registered. */
cmm_barrier();  /* Ensure the compiler does not reorder us with mutex */
tmp = URCU_TLS(rcu_reader)-ctr;
-   /*
-* rcu_gp_ctr is
-*   RCU_GP_COUNT | (~RCU_GP_CTR_PHASE or RCU_GP_CTR_PHASE)
-*/
-   if (caa_likely(!(tmp  RCU_GP_CTR_NEST_MASK))) {
-   _CMM_STORE_SHARED

Re: [lttng-dev] [rp] [PATCH] Ensure that read-side functions meet 10-line LGPL criterion

2012-09-02 Thread Paul E. McKenney
On Sat, Sep 01, 2012 at 10:13:55PM -0700, Josh Triplett wrote:
 On Sat, Sep 01, 2012 at 05:59:11PM -0700, Paul E. McKenney wrote:
  This commit ensures that all read-side functions meet the 10-line LGPL
  criterion that permits them to be expanded directly into non-LGPL code,
  without function-call instructions.  It also documents this as the intent.
  
  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 
 s/directloy/directly/g in the comments.

Good catch, fixed.

 Also, this seems inordinately silly. :)
 
 Assuming you don't plan to copy other LGPLed code into this library (or
 more specifically the header file), you might consider just adding an
 explicit exception at the top, saying that the inline functions in this
 file may be assumed to qualify for the relevant clause of the LGPL,
 regardless of their length.  (You'd probably want to limit that
 exception to only the code in the header, not any other code in the
 library, so someone couldn't just copy the whole library into the
 headers.)

I believe that it is important to allow LGPL code to flow easily between
these headers and other LGPL projects.  This commit represents a trivial
change, admittedly, but one that could save a large amount of bookkeeping
and license-compatibility hassle down the road.

Thanx, Paul

  diff --git a/urcu/static/urcu-bp.h b/urcu/static/urcu-bp.h
  index e7b2eda..881b4a4 100644
  --- a/urcu/static/urcu-bp.h
  +++ b/urcu/static/urcu-bp.h
  @@ -6,8 +6,8 @@
*
* Userspace RCU header.
*
  - * TO BE INCLUDED ONLY IN LGPL-COMPATIBLE CODE. See urcu.h for linking
  - * dynamically with the userspace rcu library.
  + * TO BE INCLUDED ONLY IN CODE THAT IS TO BE RECOMPILED ON EACH LIBURCU
  + * RELEASE. See urcu.h for linking dynamically with the userspace rcu 
  library.
*
* Copyright (c) 2009 Mathieu Desnoyers mathieu.desnoy...@efficios.com
* Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
  @@ -162,32 +162,48 @@ static inline int rcu_old_gp_ongoing(long *value)
   ((v ^ rcu_gp_ctr)  RCU_GP_CTR_PHASE);
   }
   
  +/*
  + * Helper for _rcu_read_lock().  The format of rcu_gp_ctr (as well as
  + * the per-thread rcu_reader.ctr) has the upper bits containing a count of
  + * _rcu_read_lock() nesting, and a lower-order bit that contains either 
  zero
  + * or RCU_GP_CTR_PHASE.  The smp_mb_slave() ensures that the accesses in
  + * _rcu_read_lock() happen before the subsequent read-side critical 
  section.
  + */
  +static inline void _rcu_read_lock_help(unsigned long tmp)
  +{
  +   if (caa_likely(!(tmp  RCU_GP_CTR_NEST_MASK))) {
  +   _CMM_STORE_SHARED(URCU_TLS(rcu_reader)-ctr, 
  _CMM_LOAD_SHARED(rcu_gp_ctr));
  +   cmm_smp_mb();
  +   } else
  +   _CMM_STORE_SHARED(URCU_TLS(rcu_reader)-ctr, tmp + 
  RCU_GP_COUNT);
  +}
  +
  +/*
  + * Enter an RCU read-side critical section.
  + *
  + * The first cmm_barrier() call ensures that the compiler does not reorder
  + * the body of _rcu_read_lock() with a mutex.
  + *
  + * This function and its helper are both less than 10 lines long.  The
  + * intent is that this function meets the 10-line criterion in LGPL,
  + * allowing this function to be invoked directly from non-LGPL code.
  + */
   static inline void _rcu_read_lock(void)
   {
  long tmp;
   
  -   /* Check if registered */
  if (caa_unlikely(!URCU_TLS(rcu_reader)))
  -   rcu_bp_register();
  -
  +   rcu_bp_register(); /* If not yet registered. */
  cmm_barrier();  /* Ensure the compiler does not reorder us with mutex */
  tmp = URCU_TLS(rcu_reader)-ctr;
  -   /*
  -* rcu_gp_ctr is
  -*   RCU_GP_COUNT | (~RCU_GP_CTR_PHASE or RCU_GP_CTR_PHASE)
  -*/
  -   if (caa_likely(!(tmp  RCU_GP_CTR_NEST_MASK))) {
  -   _CMM_STORE_SHARED(URCU_TLS(rcu_reader)-ctr, 
  _CMM_LOAD_SHARED(rcu_gp_ctr));
  -   /*
  -* Set active readers count for outermost nesting level before
  -* accessing the pointer.
  -*/
  -   cmm_smp_mb();
  -   } else {
  -   _CMM_STORE_SHARED(URCU_TLS(rcu_reader)-ctr, tmp + 
  RCU_GP_COUNT);
  -   }
  +   _rcu_read_lock_help(tmp);
   }
   
  +/*
  + * Exit an RCU read-side critical section.  This function is less than
  + * 10 lines of code, and is intended to be usable by non-LGPL code, as
  + * called out in LGPL.
  + */
   static inline void _rcu_read_unlock(void)
   {
  /*
  diff --git a/urcu/static/urcu-pointer.h b/urcu/static/urcu-pointer.h
  index 48dc5bf..0ddf6a1 100644
  --- a/urcu/static/urcu-pointer.h
  +++ b/urcu/static/urcu-pointer.h
  @@ -6,8 +6,8 @@
*
* Userspace RCU header. Operations on pointers.
*
  - * TO BE INCLUDED ONLY IN LGPL-COMPATIBLE CODE. See urcu-pointer.h for
  - * linking dynamically with the userspace rcu library.
  + * TO BE INCLUDED ONLY IN CODE THAT IS TO BE RECOMPILED ON EACH LIBURCU
  + * RELEASE. See urcu.h for linking

[lttng-dev] [PATCH] Ensure that read-side functions meet 10-line LGPL criterion

2012-09-01 Thread Paul E. McKenney
This commit ensures that all read-side functions meet the 10-line LGPL
criterion that permits them to be expanded directly into non-LGPL code,
without function-call instructions.  It also documents this as the intent.

Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/urcu/static/urcu-bp.h b/urcu/static/urcu-bp.h
index e7b2eda..881b4a4 100644
--- a/urcu/static/urcu-bp.h
+++ b/urcu/static/urcu-bp.h
@@ -6,8 +6,8 @@
  *
  * Userspace RCU header.
  *
- * TO BE INCLUDED ONLY IN LGPL-COMPATIBLE CODE. See urcu.h for linking
- * dynamically with the userspace rcu library.
+ * TO BE INCLUDED ONLY IN CODE THAT IS TO BE RECOMPILED ON EACH LIBURCU
+ * RELEASE. See urcu.h for linking dynamically with the userspace rcu library.
  *
  * Copyright (c) 2009 Mathieu Desnoyers mathieu.desnoy...@efficios.com
  * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
@@ -162,32 +162,48 @@ static inline int rcu_old_gp_ongoing(long *value)
 ((v ^ rcu_gp_ctr)  RCU_GP_CTR_PHASE);
 }
 
+/*
+ * Helper for _rcu_read_lock().  The format of rcu_gp_ctr (as well as
+ * the per-thread rcu_reader.ctr) has the upper bits containing a count of
+ * _rcu_read_lock() nesting, and a lower-order bit that contains either zero
+ * or RCU_GP_CTR_PHASE.  The smp_mb_slave() ensures that the accesses in
+ * _rcu_read_lock() happen before the subsequent read-side critical section.
+ */
+static inline void _rcu_read_lock_help(unsigned long tmp)
+{
+   if (caa_likely(!(tmp  RCU_GP_CTR_NEST_MASK))) {
+   _CMM_STORE_SHARED(URCU_TLS(rcu_reader)-ctr, 
_CMM_LOAD_SHARED(rcu_gp_ctr));
+   cmm_smp_mb();
+   } else
+   _CMM_STORE_SHARED(URCU_TLS(rcu_reader)-ctr, tmp + 
RCU_GP_COUNT);
+}
+
+/*
+ * Enter an RCU read-side critical section.
+ *
+ * The first cmm_barrier() call ensures that the compiler does not reorder
+ * the body of _rcu_read_lock() with a mutex.
+ *
+ * This function and its helper are both less than 10 lines long.  The
+ * intent is that this function meets the 10-line criterion in LGPL,
+ * allowing this function to be invoked directly from non-LGPL code.
+ */
 static inline void _rcu_read_lock(void)
 {
long tmp;
 
-   /* Check if registered */
if (caa_unlikely(!URCU_TLS(rcu_reader)))
-   rcu_bp_register();
-
+   rcu_bp_register(); /* If not yet registered. */
cmm_barrier();  /* Ensure the compiler does not reorder us with mutex */
tmp = URCU_TLS(rcu_reader)-ctr;
-   /*
-* rcu_gp_ctr is
-*   RCU_GP_COUNT | (~RCU_GP_CTR_PHASE or RCU_GP_CTR_PHASE)
-*/
-   if (caa_likely(!(tmp  RCU_GP_CTR_NEST_MASK))) {
-   _CMM_STORE_SHARED(URCU_TLS(rcu_reader)-ctr, 
_CMM_LOAD_SHARED(rcu_gp_ctr));
-   /*
-* Set active readers count for outermost nesting level before
-* accessing the pointer.
-*/
-   cmm_smp_mb();
-   } else {
-   _CMM_STORE_SHARED(URCU_TLS(rcu_reader)-ctr, tmp + 
RCU_GP_COUNT);
-   }
+   _rcu_read_lock_help(tmp);
 }
 
+/*
+ * Exit an RCU read-side critical section.  This function is less than
+ * 10 lines of code, and is intended to be usable by non-LGPL code, as
+ * called out in LGPL.
+ */
 static inline void _rcu_read_unlock(void)
 {
/*
diff --git a/urcu/static/urcu-pointer.h b/urcu/static/urcu-pointer.h
index 48dc5bf..0ddf6a1 100644
--- a/urcu/static/urcu-pointer.h
+++ b/urcu/static/urcu-pointer.h
@@ -6,8 +6,8 @@
  *
  * Userspace RCU header. Operations on pointers.
  *
- * TO BE INCLUDED ONLY IN LGPL-COMPATIBLE CODE. See urcu-pointer.h for
- * linking dynamically with the userspace rcu library.
+ * TO BE INCLUDED ONLY IN CODE THAT IS TO BE RECOMPILED ON EACH LIBURCU
+ * RELEASE. See urcu.h for linking dynamically with the userspace rcu library.
  *
  * Copyright (c) 2009 Mathieu Desnoyers mathieu.desnoy...@efficios.com
  * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
@@ -59,8 +59,11 @@ extern C {
  * addition to forthcoming C++ standard.
  *
  * Should match rcu_assign_pointer() or rcu_xchg_pointer().
+ *
+ * This macro is less than 10 lines long.  The intent is that this macro
+ * meets the 10-line criterion in LGPL, allowing this function to be
+ * expanded directloy in non-LGPL code.
  */
-
 #define _rcu_dereference(p) ({ \
__typeof__(p) _p1 = CMM_LOAD_SHARED(p); 
\
cmm_smp_read_barrier_depends(); \
@@ -73,8 +76,11 @@ extern C {
  * data structure, which can be safely freed after waiting for a quiescent 
state
  * using synchronize_rcu(). If fails (unexpected value), returns old (which
  * should not be freed !).
+ *
+ * This macro is less than 10 lines long.  The intent is that this macro
+ * meets the 10-line criterion in LGPL, allowing this function to be
+ * expanded directloy in non-LGPL code.
  */
-
 #define

Re: [lttng-dev] [PATCH 2/2] urcu: new wfqueue implementation

2012-08-10 Thread Paul E. McKenney
 **cbs_tail;
 - struct cds_wfq_node **cbs_endprev;
 + struct cds_wfq_node *head, *tail;
 
   if (crdp == NULL || crdp == default_call_rcu_data) {
   return;
   }
 +
   if ((uatomic_read(crdp-flags)  URCU_CALL_RCU_STOPPED) == 0) {
   uatomic_or(crdp-flags, URCU_CALL_RCU_STOP);
   wake_call_rcu_thread(crdp);
   while ((uatomic_read(crdp-flags)  URCU_CALL_RCU_STOPPED) == 
 0)
   poll(NULL, 0, 1);
   }
 - if (crdp-cbs.head != _CMM_LOAD_SHARED(crdp-cbs.tail)) {
 - while ((cbs = _CMM_LOAD_SHARED(crdp-cbs.head)) == NULL)
 - poll(NULL, 0, 1);
 - _CMM_STORE_SHARED(crdp-cbs.head, NULL);
 - cbs_tail = (struct cds_wfq_node **)
 - uatomic_xchg(crdp-cbs.tail, crdp-cbs.head);
 +
 + if (!cds_wfq_empty(crdp-cbs)) {
 + head = __cds_wfq_dequeue_all_blocking(crdp-cbs, tail);
 + assert(head);
 +
   /* Create default call rcu data if need be */
   (void) get_default_call_rcu_data();
 - cbs_endprev = (struct cds_wfq_node **)
 - uatomic_xchg(default_call_rcu_data, cbs_tail);
 - *cbs_endprev = cbs;
 +
 + __cds_wfq_append_list(default_call_rcu_data-cbs, head, tail);
 +
   uatomic_add(default_call_rcu_data-qlen,
   uatomic_read(crdp-qlen));
 +
   wake_call_rcu_thread(default_call_rcu_data);
   }
 
 diff --git a/urcu/static/wfqueue.h b/urcu/static/wfqueue.h
 index 636e1af..15ea9fc 100644
 --- a/urcu/static/wfqueue.h
 +++ b/urcu/static/wfqueue.h
 @@ -10,6 +10,7 @@
   * dynamically with the userspace rcu library.
   *
   * Copyright 2010 - Mathieu Desnoyers mathieu.desnoy...@efficios.com
 + * Copyright 2011-2012 - Lai Jiangshan la...@cn.fujitsu.com
   *
   * This library is free software; you can redistribute it and/or
   * modify it under the terms of the GNU Lesser General Public
 @@ -29,6 +30,7 @@
  #include pthread.h
  #include assert.h
  #include poll.h
 +#include stdbool.h
  #include urcu/compiler.h
  #include urcu/uatomic.h
 
 @@ -38,8 +40,6 @@ extern C {
 
  /*
   * Queue with wait-free enqueue/blocking dequeue.
 - * This implementation adds a dummy head node when the queue is empty to 
 ensure
 - * we can always update the queue locklessly.
   *
   * Inspired from half-wait-free/half-blocking queue implementation done by
   * Paul E. McKenney.
 @@ -57,31 +57,43 @@ static inline void _cds_wfq_init(struct cds_wfq_queue *q)
  {
   int ret;
 
 - _cds_wfq_node_init(q-dummy);
   /* Set queue head and tail */
 - q-head = q-dummy;
 - q-tail = q-dummy.next;
 + _cds_wfq_node_init(q-head);
 + q-tail = q-head;
   ret = pthread_mutex_init(q-lock, NULL);
   assert(!ret);
  }
 
 -static inline void _cds_wfq_enqueue(struct cds_wfq_queue *q,
 - struct cds_wfq_node *node)
 +static inline bool _cds_wfq_empty(struct cds_wfq_queue *q)
  {
 - struct cds_wfq_node **old_tail;
 + /*
 +  * Queue is empty if no node is pointed by q-head.next nor q-tail.
 +  */
 + return q-head.next == NULL  CMM_LOAD_SHARED(q-tail) == q-head;
 +}
 
 +static inline void ___cds_wfq_append_list(struct cds_wfq_queue *q,
 + struct cds_wfq_node *head, struct cds_wfq_node *tail)
 +{
   /*
* uatomic_xchg() implicit memory barrier orders earlier stores to data
* structure containing node and setting node-next to NULL before
* publication.
*/
 - old_tail = uatomic_xchg(q-tail, node-next);
 + tail = uatomic_xchg(q-tail, tail);
 +
   /*
 -  * At this point, dequeuers see a NULL old_tail-next, which indicates
 +  * At this point, dequeuers see a NULL tail-next, which indicates
* that the queue is being appended to. The following store will append
* node to the queue from a dequeuer perspective.
*/
 - CMM_STORE_SHARED(*old_tail, node);
 + CMM_STORE_SHARED(tail-next, head);
 +}
 +
 +static inline void _cds_wfq_enqueue(struct cds_wfq_queue *q,
 + struct cds_wfq_node *node)
 +{
 + ___cds_wfq_append_list(q, node, node);
  }
 
  /*
 @@ -120,27 +132,46 @@ ___cds_wfq_dequeue_blocking(struct cds_wfq_queue *q)
  {
   struct cds_wfq_node *node, *next;
 
 - /*
 -  * Queue is empty if it only contains the dummy node.
 -  */
 - if (q-head == q-dummy  CMM_LOAD_SHARED(q-tail) == q-dummy.next)
 + if (_cds_wfq_empty(q))
   return NULL;
 - node = q-head;
 
 - next = ___cds_wfq_node_sync_next(node);
 + node = ___cds_wfq_node_sync_next(q-head);
 +
 + if ((next = CMM_LOAD_SHARED(node-next)) == NULL) {
 + if (CMM_LOAD_SHARED(q-tail) == node) {
 + /*
 +  * @node is the only node in the queue.
 +  * Try to move the tail to q-head

Re: [lttng-dev] [RFC] Userspace RCU library internal error handling

2012-06-21 Thread Paul E. McKenney
On Thu, Jun 21, 2012 at 12:41:13PM -0400, Mathieu Desnoyers wrote:
 Hi,
 
 Currently, liburcu calls exit(-1) upon internal consistency error.
 This is not pretty, and usually frowned upon in libraries.
 
 One example of failure path where we use this is if pthread_mutex_lock()
 would happen to fail within synchronize_rcu(). Clearly, this should
 _never_ happen: it would typically be triggered only by memory
 corruption (or other terrible things like that). That being said, we
 clearly don't want to make synchronize_rcu() return errors like that
 to the application, because it would complexify the application error
 handling needlessly.
 
 So instead of calling exit(-1), one possibility would be to do something
 like this:
 
 #include signal.h
 #include pthread.h
 #include stdio.h
 
 #define urcu_die(fmt, ...)  \
 do {\
 fprintf(stderr, fmt, ##__VA_ARGS__);\
 (void) pthread_kill(pthread_self(), SIGBUS);\
 } while (0)
 
 and call urcu_die(); in those unrecoverable error cases, instead of
 calling exit(-1). Therefore, if an application chooses to trap those
 signals, it can, which is otherwise not possible with a direct call to
 exit().

This approach makes a lot of sense to me.

Thanx, Paul


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [rp] [RFC PATCH urcu] Document uatomic operations

2012-05-17 Thread Paul E. McKenney
On Thu, May 17, 2012 at 06:04:13PM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Thu, May 17, 2012 at 01:59:43PM -0400, Mathieu Desnoyers wrote:
   * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
On Wed, May 16, 2012 at 11:45:39AM -0700, Josh Triplett wrote:
 On Wed, May 16, 2012 at 11:32:38AM -0700, Paul E. McKenney wrote:
  On Wed, May 16, 2012 at 02:17:42PM -0400, Mathieu Desnoyers wrote:
   * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
On Tue, May 15, 2012 at 08:10:03AM -0400, Mathieu Desnoyers 
wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Mon, May 14, 2012 at 10:39:01PM -0400, Mathieu Desnoyers 
  wrote:
   Document each atomic operation provided by 
   urcu/uatomic.h, along with
   their memory barrier guarantees.
  
  Great to see the documentation!!!  Some comments below.
  
  Thanx, 
  Paul
  
   Signed-off-by: Mathieu Desnoyers 
   mathieu.desnoy...@efficios.com
   ---
   diff --git a/doc/Makefile.am b/doc/Makefile.am
   index bec1d7c..db9811c 100644
   --- a/doc/Makefile.am
   +++ b/doc/Makefile.am
   @@ -1 +1 @@
   -dist_doc_DATA = rcu-api.txt
   +dist_doc_DATA = rcu-api.txt uatomic-api.txt
   diff --git a/doc/uatomic-api.txt b/doc/uatomic-api.txt
   new file mode 100644
   index 000..3605acf
   --- /dev/null
   +++ b/doc/uatomic-api.txt
   @@ -0,0 +1,80 @@
   +Userspace RCU Atomic Operations API
   +by Mathieu Desnoyers and Paul E. McKenney
   +
   +
   +This document describes the urcu/uatomic.h API. Those 
   are the atomic
   +operations provided by the Userspace RCU library. The 
   general rule
   +regarding memory barriers is that only uatomic_xchg(),
   +uatomic_cmpxchg(), uatomic_add_return(), and 
   uatomic_sub_return() imply
   +full memory barriers before and after the atomic 
   operation. Other
   +primitives don't guarantee any memory barrier.
   +
   +Only atomic operations performed on integers (int and 
   long, signed
   +and unsigned) are supported on all architectures. Some 
   architectures
   +also support 1-byte and 2-byte atomic operations. Those 
   respectively
   +have UATOMIC_HAS_ATOMIC_BYTE and 
   UATOMIC_HAS_ATOMIC_SHORT defined when
   +uatomic.h is included. An architecture trying to perform 
   an atomic write
   +to a type size not supported by the architecture will 
   trigger an illegal
   +instruction.
   +
   +In the description below, type is a type that can be 
   atomically
   +written to by the architecture. It needs to be at most 
   word-sized, and
   +its alignment needs to greater or equal to its size.
   +
   +type uatomic_set(type *addr, type v)
   +
   + Atomically write @v into @addr.
  
  Wouldn't this instead be void uatomic_set(type *addr, type 
  v)?
 
 Well, in that case, we'd need to change the macro. Currently,
 _uatomic_set maps directly to:
 
 #define _uatomic_set(addr, v)   CMM_STORE_SHARED(*(addr), (v))
 
 and CMM_STORE_SHARED returns v. The question becomes: should 
 we change
 _uatomic_set or CMM_STORE_SHARED so they don't return 
 anything, or
 document that they return something ?
 
 One thing I noticed is that linters often complain that the 
 return value
 of CMM_STORE_SHARED is never used. One thing we could look 
 into is to
 try some gcc attributes and/or linter annotations to flag 
 this return
 value as possibly unused. Thoughts ?

Hmmm...

Does the following work?

#define _uatomic_set(addr, v)   
((void)CMM_STORE_SHARED(*(addr), (v)))
   
   Well, it would work, yes, but then we would not be consistent 
   between
   return values or no return values of:
   
   uatomic_set()
   rcu_assign_pointer()
   rcu_set_pointer()
   
   if you notice, in the Linux kernel, rcu_assign_pointer returns the
   new pointer value. But you are right that atomic_set() does not 
   return
   anything. So which consistency would be best to keep ?
  
  Hmmm...  I wonder how many people actually use 
  rcu_assign_pointer()'s
  return value?  If no one, I should make it a do-while(0).  Although
  cscope does not show anything, I should probably put

Re: [lttng-dev] [RFC PATCH urcu] Document uatomic operations

2012-05-16 Thread Paul E. McKenney
On Tue, May 15, 2012 at 08:10:03AM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Mon, May 14, 2012 at 10:39:01PM -0400, Mathieu Desnoyers wrote:
   Document each atomic operation provided by urcu/uatomic.h, along with
   their memory barrier guarantees.
  
  Great to see the documentation!!!  Some comments below.
  
  Thanx, Paul
  
   Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
   ---
   diff --git a/doc/Makefile.am b/doc/Makefile.am
   index bec1d7c..db9811c 100644
   --- a/doc/Makefile.am
   +++ b/doc/Makefile.am
   @@ -1 +1 @@
   -dist_doc_DATA = rcu-api.txt
   +dist_doc_DATA = rcu-api.txt uatomic-api.txt
   diff --git a/doc/uatomic-api.txt b/doc/uatomic-api.txt
   new file mode 100644
   index 000..3605acf
   --- /dev/null
   +++ b/doc/uatomic-api.txt
   @@ -0,0 +1,80 @@
   +Userspace RCU Atomic Operations API
   +by Mathieu Desnoyers and Paul E. McKenney
   +
   +
   +This document describes the urcu/uatomic.h API. Those are the atomic
   +operations provided by the Userspace RCU library. The general rule
   +regarding memory barriers is that only uatomic_xchg(),
   +uatomic_cmpxchg(), uatomic_add_return(), and uatomic_sub_return() imply
   +full memory barriers before and after the atomic operation. Other
   +primitives don't guarantee any memory barrier.
   +
   +Only atomic operations performed on integers (int and long, signed
   +and unsigned) are supported on all architectures. Some architectures
   +also support 1-byte and 2-byte atomic operations. Those respectively
   +have UATOMIC_HAS_ATOMIC_BYTE and UATOMIC_HAS_ATOMIC_SHORT defined when
   +uatomic.h is included. An architecture trying to perform an atomic write
   +to a type size not supported by the architecture will trigger an illegal
   +instruction.
   +
   +In the description below, type is a type that can be atomically
   +written to by the architecture. It needs to be at most word-sized, and
   +its alignment needs to greater or equal to its size.
   +
   +type uatomic_set(type *addr, type v)
   +
   + Atomically write @v into @addr.
  
  Wouldn't this instead be void uatomic_set(type *addr, type v)?
 
 Well, in that case, we'd need to change the macro. Currently,
 _uatomic_set maps directly to:
 
 #define _uatomic_set(addr, v)   CMM_STORE_SHARED(*(addr), (v))
 
 and CMM_STORE_SHARED returns v. The question becomes: should we change
 _uatomic_set or CMM_STORE_SHARED so they don't return anything, or
 document that they return something ?
 
 One thing I noticed is that linters often complain that the return value
 of CMM_STORE_SHARED is never used. One thing we could look into is to
 try some gcc attributes and/or linter annotations to flag this return
 value as possibly unused. Thoughts ?

Hmmm...

Does the following work?

#define _uatomic_set(addr, v)   ((void)CMM_STORE_SHARED(*(addr), (v)))

  By Atomically write @v into @addr, what is meant is that no concurrent
  operation that reads from addr will see partial effects of uatomic_set(),
  correct?  In other words, the concurrent read will either see v or
  the old value, not a mush of the two.
 
 yep. I added that clarification.
 
  
   +
   +type uatomic_read(type *addr)
   +
   + Atomically read @v from @addr.
  
  Similar comments on the meaning of atomically.  This may sound picky,
  but people coming from an x86 environment might otherwise assume that
  there is lock prefix involved...
 
 same.
 
  
   +
   +type uatomic_cmpxchg(type *addr, type old, type new)
   +
   + Atomically check if @addr contains @old. If true, then replace
   + the content of @addr by @new. Return the value previously
   + contained by @addr. This function imply a full memory barrier
   + before and after the atomic operation.
  
  Suggest then atomically replace or some such.  It might not hurt
  to add that this is an atomic read-modify-write operation.
 
 Updated to:
 
 type uatomic_cmpxchg(type *addr, type old, type new)
 
 An atomic read-modify-write operation that performs this 
 sequence of operations atomically: check if @addr contains @old.
 If true, then replace the content of @addr by @new. Return the
 value previously contained by @addr. This function imply a full
 memory barrier before and after the atomic operation.
 
  
  Similar comments on the other value-returning atomics.
 
 Will do something similar.
 
  
   +
   +type uatomic_xchg(type *addr, type new)
   +
   + Atomically replace the content of @addr by @new, and return the
   + value previously contained by @addr. This function imply a full
   + memory barrier before and after the atomic operation.
   +
   +type uatomic_add_return(type *addr, type v)
   +type uatomic_sub_return(type *addr, type v)
   +
   + Atomically increment/decrement the content of @addr by @v, and
   + return the resulting value. This function imply a full memory
   + barrier before and after

Re: [lttng-dev] [RFC PATCH urcu] Implement urcu/tls-compat.h

2012-05-16 Thread Paul E. McKenney
On Wed, May 16, 2012 at 09:56:58AM -0400, Mathieu Desnoyers wrote:
 Suggested-by: Marek Vavruša marek.vavr...@nic.cz
 Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com

Interesting!  Jeffrey Yasskin of Google was suggesting use of
pthread_get_specific() over __thread even where __thread was available.
His justification is that pthread_get_specific() allows destructors.
Not sure I agree with him in the case of userspace RCU due to performance
issues, but thought I would pass his advice along.

Thanx, Paul

 ---
 diff --git a/Makefile.am b/Makefile.am
 index 0a369fd..6263057 100644
 --- a/Makefile.am
 +++ b/Makefile.am
 @@ -18,7 +18,8 @@ nobase_dist_include_HEADERS = urcu/compiler.h urcu/hlist.h 
 urcu/list.h \
   urcu/ref.h urcu/cds.h urcu/urcu_ref.h urcu/urcu-futex.h \
   urcu/uatomic_arch.h urcu/rculfhash.h \
   $(top_srcdir)/urcu/map/*.h \
 - $(top_srcdir)/urcu/static/*.h
 + $(top_srcdir)/urcu/static/*.h \
 + urcu/tls-compat.h
  nobase_nodist_include_HEADERS = urcu/arch.h urcu/uatomic.h urcu/config.h
  
  EXTRA_DIST = $(top_srcdir)/urcu/arch/*.h $(top_srcdir)/urcu/uatomic/*.h \
 diff --git a/urcu/tls-compat.h b/urcu/tls-compat.h
 new file mode 100644
 index 000..d7c7537
 --- /dev/null
 +++ b/urcu/tls-compat.h
 @@ -0,0 +1,99 @@
 +#ifndef _URCU_TLS_COMPAT_H
 +#define _URCU_TLS_COMPAT_H
 +
 +/*
 + * urcu/tls-compat.h
 + *
 + * Userspace RCU library - Thread-Local Storage Compatibility Header
 + *
 + * Copyright 2012 - Mathieu Desnoyers mathieu.desnoy...@efficios.com
 + *
 + * This library is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU Lesser General Public
 + * License as published by the Free Software Foundation; either
 + * version 2.1 of the License, or (at your option) any later version.
 + *
 + * This library is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 + * Lesser General Public License for more details.
 + *
 + * You should have received a copy of the GNU Lesser General Public
 + * License along with this library; if not, write to the Free Software
 + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 
 USA
 + */
 +
 +#include stdlib.h
 +#include urcu/config.h
 +#include urcu/compiler.h
 +#include urcu/arch.h
 +
 +#ifdef __cplusplus
 +extern C {
 +#endif
 +
 +#ifdef CONFIG_RCU_TLS/* Based on ax_tls.m4 */
 +
 +# define DECLARE_URCU_TLS(type, name)\
 + CONFIG_RCU_TLS type __tls_ ## name
 +
 +# define DEFINE_URCU_TLS(type, name) \
 + CONFIG_RCU_TLS type __tls_ ## name
 +
 +# define URCU_TLS(name)  (__tls_ ## name)
 +
 +#else /* #ifndef CONFIG_RCU_TLS */
 +
 +# include pthread.h
 +
 +struct urcu_tls {
 + pthread_key_t key;
 + pthread_mutex_t init_mutex;
 + int init_done;
 +};
 +
 +# define DECLARE_URCU_TLS(type, name)\
 + type *__tls_access_ ## name(void)
 +
 +/*
 + * Note: we don't free memory at process exit, since it will be dealt
 + * with by the OS.
 + */
 +# define DEFINE_URCU_TLS(type, name) \
 + type *__tls_access_ ## name(void)   \
 + {   \
 + static struct urcu_tls __tls_ ## name = {   \
 + .init_mutex = PTHREAD_MUTEX_INITIALIZER,\
 + .init_done = 0, \
 + };  \
 + void *__tls_p;  \
 + if (!__tls_ ## name.init_done) {\
 + /* Mutex to protect concurrent init */  \
 + pthread_mutex_lock(__tls_ ## name.init_mutex); \
 + if (!__tls_ ## name.init_done) {\
 + (void) pthread_key_create(__tls_ ## name.key, \
 + free);  \
 + cmm_smp_wmb();  /* create key before write 
 init_done */ \
 + __tls_ ## name.init_done = 1;   \
 + }   \
 + pthread_mutex_unlock(__tls_ ## name.init_mutex); \
 + }   \
 + cmm_smp_rmb();  /* read init_done before getting key */ \
 + __tls_p = pthread_getspecific(__tls_ ## name.key); \
 + if (caa_unlikely(__tls_p == NULL)) {\
 + __tls_p = calloc(1, sizeof(type));  \
 + (void) pthread_setspecific(__tls_ ## name.key,  \
 + __tls_p);   \
 + }   

Re: [lttng-dev] [rp] [RFC PATCH urcu] Document uatomic operations

2012-05-16 Thread Paul E. McKenney
On Wed, May 16, 2012 at 02:17:42PM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Tue, May 15, 2012 at 08:10:03AM -0400, Mathieu Desnoyers wrote:
   * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
On Mon, May 14, 2012 at 10:39:01PM -0400, Mathieu Desnoyers wrote:
 Document each atomic operation provided by urcu/uatomic.h, along with
 their memory barrier guarantees.

Great to see the documentation!!!  Some comments below.

Thanx, Paul

 Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
 ---
 diff --git a/doc/Makefile.am b/doc/Makefile.am
 index bec1d7c..db9811c 100644
 --- a/doc/Makefile.am
 +++ b/doc/Makefile.am
 @@ -1 +1 @@
 -dist_doc_DATA = rcu-api.txt
 +dist_doc_DATA = rcu-api.txt uatomic-api.txt
 diff --git a/doc/uatomic-api.txt b/doc/uatomic-api.txt
 new file mode 100644
 index 000..3605acf
 --- /dev/null
 +++ b/doc/uatomic-api.txt
 @@ -0,0 +1,80 @@
 +Userspace RCU Atomic Operations API
 +by Mathieu Desnoyers and Paul E. McKenney
 +
 +
 +This document describes the urcu/uatomic.h API. Those are the 
 atomic
 +operations provided by the Userspace RCU library. The general rule
 +regarding memory barriers is that only uatomic_xchg(),
 +uatomic_cmpxchg(), uatomic_add_return(), and uatomic_sub_return() 
 imply
 +full memory barriers before and after the atomic operation. Other
 +primitives don't guarantee any memory barrier.
 +
 +Only atomic operations performed on integers (int and long, 
 signed
 +and unsigned) are supported on all architectures. Some architectures
 +also support 1-byte and 2-byte atomic operations. Those respectively
 +have UATOMIC_HAS_ATOMIC_BYTE and UATOMIC_HAS_ATOMIC_SHORT defined 
 when
 +uatomic.h is included. An architecture trying to perform an atomic 
 write
 +to a type size not supported by the architecture will trigger an 
 illegal
 +instruction.
 +
 +In the description below, type is a type that can be atomically
 +written to by the architecture. It needs to be at most word-sized, 
 and
 +its alignment needs to greater or equal to its size.
 +
 +type uatomic_set(type *addr, type v)
 +
 + Atomically write @v into @addr.

Wouldn't this instead be void uatomic_set(type *addr, type v)?
   
   Well, in that case, we'd need to change the macro. Currently,
   _uatomic_set maps directly to:
   
   #define _uatomic_set(addr, v)   CMM_STORE_SHARED(*(addr), (v))
   
   and CMM_STORE_SHARED returns v. The question becomes: should we change
   _uatomic_set or CMM_STORE_SHARED so they don't return anything, or
   document that they return something ?
   
   One thing I noticed is that linters often complain that the return value
   of CMM_STORE_SHARED is never used. One thing we could look into is to
   try some gcc attributes and/or linter annotations to flag this return
   value as possibly unused. Thoughts ?
  
  Hmmm...
  
  Does the following work?
  
  #define _uatomic_set(addr, v)   ((void)CMM_STORE_SHARED(*(addr), (v)))
 
 Well, it would work, yes, but then we would not be consistent between
 return values or no return values of:
 
 uatomic_set()
 rcu_assign_pointer()
 rcu_set_pointer()
 
 if you notice, in the Linux kernel, rcu_assign_pointer returns the
 new pointer value. But you are right that atomic_set() does not return
 anything. So which consistency would be best to keep ?

Hmmm...  I wonder how many people actually use rcu_assign_pointer()'s
return value?  If no one, I should make it a do-while(0).  Although
cscope does not show anything, I should probably put together a
coccinelle script.

Thanx, Paul

 Thanks,
 
 Mathieu
 
 
  
By Atomically write @v into @addr, what is meant is that no concurrent
operation that reads from addr will see partial effects of 
uatomic_set(),
correct?  In other words, the concurrent read will either see v or
the old value, not a mush of the two.
   
   yep. I added that clarification.
   

 +
 +type uatomic_read(type *addr)
 +
 + Atomically read @v from @addr.

Similar comments on the meaning of atomically.  This may sound picky,
but people coming from an x86 environment might otherwise assume that
there is lock prefix involved...
   
   same.
   

 +
 +type uatomic_cmpxchg(type *addr, type old, type new)
 +
 + Atomically check if @addr contains @old. If true, then replace
 + the content of @addr by @new. Return the value previously
 + contained by @addr. This function imply a full memory barrier
 + before and after the atomic operation.

Suggest then atomically replace or some such.  It might not hurt
to add

Re: [lttng-dev] [rp] [RFC PATCH urcu] Document uatomic operations

2012-05-16 Thread Paul E. McKenney
On Wed, May 16, 2012 at 11:45:39AM -0700, Josh Triplett wrote:
 On Wed, May 16, 2012 at 11:32:38AM -0700, Paul E. McKenney wrote:
  On Wed, May 16, 2012 at 02:17:42PM -0400, Mathieu Desnoyers wrote:
   * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
On Tue, May 15, 2012 at 08:10:03AM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Mon, May 14, 2012 at 10:39:01PM -0400, Mathieu Desnoyers wrote:
   Document each atomic operation provided by urcu/uatomic.h, along 
   with
   their memory barrier guarantees.
  
  Great to see the documentation!!!  Some comments below.
  
  Thanx, Paul
  
   Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
   ---
   diff --git a/doc/Makefile.am b/doc/Makefile.am
   index bec1d7c..db9811c 100644
   --- a/doc/Makefile.am
   +++ b/doc/Makefile.am
   @@ -1 +1 @@
   -dist_doc_DATA = rcu-api.txt
   +dist_doc_DATA = rcu-api.txt uatomic-api.txt
   diff --git a/doc/uatomic-api.txt b/doc/uatomic-api.txt
   new file mode 100644
   index 000..3605acf
   --- /dev/null
   +++ b/doc/uatomic-api.txt
   @@ -0,0 +1,80 @@
   +Userspace RCU Atomic Operations API
   +by Mathieu Desnoyers and Paul E. McKenney
   +
   +
   +This document describes the urcu/uatomic.h API. Those are the 
   atomic
   +operations provided by the Userspace RCU library. The general 
   rule
   +regarding memory barriers is that only uatomic_xchg(),
   +uatomic_cmpxchg(), uatomic_add_return(), and 
   uatomic_sub_return() imply
   +full memory barriers before and after the atomic operation. Other
   +primitives don't guarantee any memory barrier.
   +
   +Only atomic operations performed on integers (int and long, 
   signed
   +and unsigned) are supported on all architectures. Some 
   architectures
   +also support 1-byte and 2-byte atomic operations. Those 
   respectively
   +have UATOMIC_HAS_ATOMIC_BYTE and UATOMIC_HAS_ATOMIC_SHORT 
   defined when
   +uatomic.h is included. An architecture trying to perform an 
   atomic write
   +to a type size not supported by the architecture will trigger an 
   illegal
   +instruction.
   +
   +In the description below, type is a type that can be atomically
   +written to by the architecture. It needs to be at most 
   word-sized, and
   +its alignment needs to greater or equal to its size.
   +
   +type uatomic_set(type *addr, type v)
   +
   + Atomically write @v into @addr.
  
  Wouldn't this instead be void uatomic_set(type *addr, type v)?
 
 Well, in that case, we'd need to change the macro. Currently,
 _uatomic_set maps directly to:
 
 #define _uatomic_set(addr, v)   CMM_STORE_SHARED(*(addr), (v))
 
 and CMM_STORE_SHARED returns v. The question becomes: should we change
 _uatomic_set or CMM_STORE_SHARED so they don't return anything, or
 document that they return something ?
 
 One thing I noticed is that linters often complain that the return 
 value
 of CMM_STORE_SHARED is never used. One thing we could look into is to
 try some gcc attributes and/or linter annotations to flag this return
 value as possibly unused. Thoughts ?

Hmmm...

Does the following work?

#define _uatomic_set(addr, v)   ((void)CMM_STORE_SHARED(*(addr), (v)))
   
   Well, it would work, yes, but then we would not be consistent between
   return values or no return values of:
   
   uatomic_set()
   rcu_assign_pointer()
   rcu_set_pointer()
   
   if you notice, in the Linux kernel, rcu_assign_pointer returns the
   new pointer value. But you are right that atomic_set() does not return
   anything. So which consistency would be best to keep ?
  
  Hmmm...  I wonder how many people actually use rcu_assign_pointer()'s
  return value?  If no one, I should make it a do-while(0).  Although
  cscope does not show anything, I should probably put together a
  coccinelle script.
 
 I just searched the entire Linux kernel with git grep
 '\S.*rcu_assign_pointer' (any non-whitespace preceding
 rcu_assign_pointer), and found no instances of anything assuming a
 return value from rcu_assign_pointer.  I'd recommend making it void.

Woo-hoo!!!  Thank you both!!!

Thanx, Paul


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [rp] [RFC PATCH urcu] document concurrent data structures

2012-05-14 Thread Paul E. McKenney
On Mon, May 14, 2012 at 11:36:04PM -0400, Mathieu Desnoyers wrote:
 Document the concurrent data structures provided by the userspace RCU
 library.

Looks good to me!

Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com

 Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
 ---
 diff --git a/doc/cds-api.txt b/doc/cds-api.txt
 new file mode 100644
 index 000..7a3c6e0
 --- /dev/null
 +++ b/doc/cds-api.txt
 @@ -0,0 +1,59 @@
 +Userspace RCU Concurrent Data Structures (CDS) API
 +by Mathieu Desnoyers and Paul E. McKenney
 +
 +
 +This document describes briefly the data structures contained with the
 +userspace RCU library.
 +
 +urcu/list.h:
 +
 + Doubly-linked list, which requires mutual exclusion on updates
 + and reads.
 +
 +urcu/rculist.h:
 +
 + Doubly-linked list, which requires mutual exclusion on updates,
 + allows RCU read traversals.
 + 
 +urcu/hlist.h:
 +
 + Doubly-linked list, with single pointer list head. Requires
 + mutual exclusion on updates and reads. Useful for implementing
 + hash tables. Downside over list.h: lookup of tail in O(n).
 +
 +urcu/rcuhlist.h:
 +
 + Doubly-linked list, with single pointer list head. Requires
 + mutual exclusion on updates, allows RCU read traversals. Useful
 + for implementing hash tables. Downside over rculist.h: lookup of
 + tail in O(n).
 +
 +urcu/rculfqueue.h:
 +
 + RCU queue with lock-free enqueue, lock-free dequeue. RCU used to
 + provide existance guarantees.
 +
 +urcu/wfqueue.h:
 +
 + Queue with wait-free enqueue, blocking dequeue. This queue does
 + _not_ use RCU.
 +
 +urcu/rculfstack.h:
 +
 + RCU stack with lock-free push, lock-free dequeue. RCU used to
 + provide existance guarantees.
 +
 +urcu/wfstack.h:
 +
 + Stack with wait-free enqueue, blocking dequeue. This stack does
 + _not_ use RCU.
 +
 +urcu/rculfhash.h:
 +
 + Lock-Free Resizable RCU Hash Table. RCU used to provide
 + existance guarantees. Provides scalable updates, and scalable
 + RCU read-side lookups and traversals. Unique and duplicate keys
 + are supported. Provides uniquify add and replace add
 + operations, along with associated read-side traversal uniqueness
 + guarantees. Automatic hash table resize based on number of
 + elements is supported. See the API for more details.
 
 -- 
 Mathieu Desnoyers
 Operating System Efficiency RD Consultant
 EfficiOS Inc.
 http://www.efficios.com
 
 ___
 rp mailing list
 r...@svcs.cs.pdx.edu
 http://svcs.cs.pdx.edu/mailman/listinfo/rp
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-08 Thread Paul E. McKenney
On Mon, May 07, 2012 at 12:10:55PM -0400, Mathieu Desnoyers wrote:
 
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote:
 [...]
  Just to make sure I understand -- the reason that the del functions
  say no memory barrier instead of acts like rcu_dereference() is
  that the del functions don't return anything.
 
 [...]
   @@ -391,6 +413,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct 
   cds_lfht_node *node);
 * function.
 * Call with rcu_read_lock held.
 * Threads calling this API need to be registered RCU read-side threads.
   + * This function does not issue any memory barrier.
 */
 
 One more question about the del memory ordering semantic. Following
 commit
 
 commit db00ccc36e7fb04ce8044fb1be7964acd1de6ae0
 Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com
 Date:   Mon Dec 19 16:45:51 2011 -0500
 
 rculfhash: Relax atomicity guarantees required by removal operation
 
 The atomicity guarantees for the removal operation do not need to be as
 strict as a cmpxchg. Use a uatomic_set followed by a xchg on a newly
 introduced REMOVAL_OWNER_FLAG to perform removal.
 
 Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
 
 
 The del operation is performed in two steps:
 
 1 - uatomic_or(), which sets the REMOVED flag (the actual tombstone)
 unconditionally into the node's next pointer.
 2 - uatomic_xchg(), which atomically exchanges the old pointer with
 its current value (read) or'd with the REMOVAL_OWNER flag. The trick
 is that if the xchg returns a pointer with the REMOVAL_OWNER flag
 set, it means we are not the first thread to set this flag, so we
 should not free the node. However, if xchg returns a node without
 the REMOVAL_OWNER flag set, we are indeed the first to set it, so
 we should call free.
 
 Now regarding memory ordering semantics, should we consider the atomic
 action of del to apply when the or is called, or when the xchg is
 called ? Or should we simply document that the del effect on the node
 happens in two separate steps ?
 
 The way I see it, the actual effect of removal, as seen from RCU read
 traversal and lookup point of view, is observed as soon as the REMOVED
 tombstone is set, so I would think that the atomic publication of the
 removal is performed by the or.
 
 However, we ensure full memory barriers around xchg, but not usually 
 around or. Therefore, the current implementation does not issue a 
 memory barrier before the or, so we should either change our planned
 memory barrier documentation, or the implementation, to match. This
 would probably require creation of a cmm_smp_mb__before_uatomic_or(), so
 x86 does not end up issuing a useless memory barrer.

My kneejerk reaction is that the or is really doing the deletion.
Readers and other updaters care about the deletion, not about which CPU
is going to do the free.

Or did I misunderstand how this works?

Thanx, Paul

 Thoughts ?
 
 Thanks,
 
 Mathieu
 
 -- 
 Mathieu Desnoyers
 Operating System Efficiency RD Consultant
 EfficiOS Inc.
 http://www.efficios.com
 
 ___
 rp mailing list
 r...@svcs.cs.pdx.edu
 http://svcs.cs.pdx.edu/mailman/listinfo/rp
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [rp] [RFC] Readiness for URCU release with RCU lock-free hash table

2012-05-08 Thread Paul E. McKenney
On Tue, May 08, 2012 at 02:48:27PM -0400, Mathieu Desnoyers wrote:
 * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
  On Mon, May 07, 2012 at 12:10:55PM -0400, Mathieu Desnoyers wrote:
   
   * Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
On Fri, May 04, 2012 at 12:53:12PM -0400, Mathieu Desnoyers wrote:
   [...]
Just to make sure I understand -- the reason that the del functions
say no memory barrier instead of acts like rcu_dereference() is
that the del functions don't return anything.
   
   [...]
 @@ -391,6 +413,7 @@ int cds_lfht_del(struct cds_lfht *ht, struct 
 cds_lfht_node *node);
   * function.
   * Call with rcu_read_lock held.
   * Threads calling this API need to be registered RCU read-side 
 threads.
 + * This function does not issue any memory barrier.
   */
   
   One more question about the del memory ordering semantic. Following
   commit
   
   commit db00ccc36e7fb04ce8044fb1be7964acd1de6ae0
   Author: Mathieu Desnoyers mathieu.desnoy...@efficios.com
   Date:   Mon Dec 19 16:45:51 2011 -0500
   
   rculfhash: Relax atomicity guarantees required by removal operation
   
   The atomicity guarantees for the removal operation do not need to be 
   as
   strict as a cmpxchg. Use a uatomic_set followed by a xchg on a newly
   introduced REMOVAL_OWNER_FLAG to perform removal.
   
   Signed-off-by: Mathieu Desnoyers mathieu.desnoy...@efficios.com
   
   
   The del operation is performed in two steps:
   
   1 - uatomic_or(), which sets the REMOVED flag (the actual tombstone)
   unconditionally into the node's next pointer.
   2 - uatomic_xchg(), which atomically exchanges the old pointer with
   its current value (read) or'd with the REMOVAL_OWNER flag. The trick
   is that if the xchg returns a pointer with the REMOVAL_OWNER flag
   set, it means we are not the first thread to set this flag, so we
   should not free the node. However, if xchg returns a node without
   the REMOVAL_OWNER flag set, we are indeed the first to set it, so
   we should call free.
   
   Now regarding memory ordering semantics, should we consider the atomic
   action of del to apply when the or is called, or when the xchg is
   called ? Or should we simply document that the del effect on the node
   happens in two separate steps ?
   
   The way I see it, the actual effect of removal, as seen from RCU read
   traversal and lookup point of view, is observed as soon as the REMOVED
   tombstone is set, so I would think that the atomic publication of the
   removal is performed by the or.
   
   However, we ensure full memory barriers around xchg, but not usually 
   around or. Therefore, the current implementation does not issue a 
   memory barrier before the or, so we should either change our planned
   memory barrier documentation, or the implementation, to match. This
   would probably require creation of a cmm_smp_mb__before_uatomic_or(), so
   x86 does not end up issuing a useless memory barrer.
  
  My kneejerk reaction is that the or is really doing the deletion.
  Readers and other updaters care about the deletion, not about which CPU
  is going to do the free.
  
  Or did I misunderstand how this works?
 
 You got it right, this is how I see it too.
 
 However, in order to provide a full memory barrier before the or, we'd
 need to add a cmm_smp_mb() before the or (I don't think we want to
 presume that our or operation issues full memory barriers on all
 architectures).
 
 However, on x86, the lock; or does issue a full memory barrier. So I
 think we should introduce a macro that can translate into a memory
 barrier on architectures that need it, and to nothing on x86.
 
 Thoughts ?

Makes sense to me!

Thanx, Paul

 Thanks,
 
 Mathieu
 
  
  Thanx, Paul
  
   Thoughts ?
   
   Thanks,
   
   Mathieu
   
   -- 
   Mathieu Desnoyers
   Operating System Efficiency RD Consultant
   EfficiOS Inc.
   http://www.efficios.com
   
   ___
   rp mailing list
   r...@svcs.cs.pdx.edu
   http://svcs.cs.pdx.edu/mailman/listinfo/rp
   
  
  
  ___
  lttng-dev mailing list
  lttng-dev@lists.lttng.org
  http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
 
 -- 
 Mathieu Desnoyers
 Operating System Efficiency RD Consultant
 EfficiOS Inc.
 http://www.efficios.com
 


___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


  1   2   >