from:"Mathieu Desnoyers"

- On May 3, 2018, at 12:22 PM, Daniel Colascione dan...@google.com wrote:

> On Thu, May 3, 2018 at 9:12 AM Mathieu Desnoyers <
> mathieu.desnoy...@efficios.com> wrote:
>> By the way, if we eventually find a way to enhance user-space mutexes in
> the
>> fashion you describe here, it would belong to another TLS area, and would
>> be registered by another system call than rseq. I proposed a more generic
>> "TLS area registration" system call a few years ago, but Linus told me he
>> wanted a system call that was specific to rseq. If we need to implement
>> other use-cases in a TLS area shared between kernel and user-space in a
>> similar fashion, the plan is to do it in a distinct system call.
> 
> If we proliferate TLS areas; we'd have to register each one upon thread
> creation, adding to the overall thread creation path. There's already a
> provision for versioning the TLS area. What's the benefit of splitting the
> registration over multiple system calls?

See the original discussion thread at

https://lkml.org/lkml/2016/4/7/502

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

- On May 3, 2018, at 12:22 PM, Daniel Colascione dan...@google.com wrote:

> On Thu, May 3, 2018 at 9:12 AM Mathieu Desnoyers <
> mathieu.desnoy...@efficios.com> wrote:
>> By the way, if we eventually find a way to enhance user-space mutexes in
> the
>> fashion you describe here, it would belong to another TLS area, and would
>> be registered by another system call than rseq. I proposed a more generic
>> "TLS area registration" system call a few years ago, but Linus told me he
>> wanted a system call that was specific to rseq. If we need to implement
>> other use-cases in a TLS area shared between kernel and user-space in a
>> similar fashion, the plan is to do it in a distinct system call.
> 
> If we proliferate TLS areas; we'd have to register each one upon thread
> creation, adding to the overall thread creation path. There's already a
> provision for versioning the TLS area. What's the benefit of splitting the
> registration over multiple system calls?

See the original discussion thread at

https://lkml.org/lkml/2016/4/7/502

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

- On May 2, 2018, at 12:07 PM, Daniel Colascione dan...@google.com wrote:

> On Wed, May 2, 2018 at 9:03 AM Mathieu Desnoyers <
> mathieu.desnoy...@efficios.com> wrote:
> 
>> - On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com
> wrote:
>> [...]
>> >
>> > I think a small enhancement to rseq would let us build a perfect
> userspace
>> > mutex, one that spins on lock-acquire only when the lock owner is
> running
>> > and that sleeps otherwise, freeing userspace from both specifying ad-hoc
>> > spin counts and from trying to detect situations in which spinning is
>> > generally pointless.
>> >
>> > It'd work like this: in the per-thread rseq data structure, we'd
> include a
>> > description of a futex operation for the kernel would perform (in the
>> > context of the preempted thread) upon preemption, immediately before
>> > schedule(). If the futex operation itself sleeps, that's no problem: we
>> > will have still accomplished our goal of running some other thread
> instead
>> > of the preempted thread.
> 
>> Hi Daniel,
> 
>> I agree that the problem you are aiming to solve is important. Let's see
>> what prevents the proposed rseq implementation from doing what you
> envision.
> 
>> The main issue here is touching userspace immediately before schedule().
>> At that specific point, it's not possible to take a page fault. In the
> proposed
>> rseq implementation, we get away with it by raising a task struct flag,
> and using
>> it in a return to userspace notifier (where we can actually take a
> fault), where
>> we touch the userspace TLS area.
> 
>> If we can find a way to solve this limitation, then the rest of your
> design
>> makes sense to me.
> 
> Thanks for taking a look!
> 
> Why couldn't we take a page fault just before schedule? The reason we can't
> take a page fault in atomic context is that doing so might call schedule.
> Here, we're about to call schedule _anyway_, so what harm does it do to
> call something that might call schedule? If we schedule via that call, we
> can skip the manual schedule we were going to perform.

By the way, if we eventually find a way to enhance user-space mutexes in the
fashion you describe here, it would belong to another TLS area, and would
be registered by another system call than rseq. I proposed a more generic
"TLS area registration" system call a few years ago, but Linus told me he
wanted a system call that was specific to rseq. If we need to implement
other use-cases in a TLS area shared between kernel and user-space in a
similar fashion, the plan is to do it in a distinct system call.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

- On May 2, 2018, at 12:07 PM, Daniel Colascione dan...@google.com wrote:

> On Wed, May 2, 2018 at 9:03 AM Mathieu Desnoyers <
> mathieu.desnoy...@efficios.com> wrote:
> 
>> - On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com
> wrote:
>> [...]
>> >
>> > I think a small enhancement to rseq would let us build a perfect
> userspace
>> > mutex, one that spins on lock-acquire only when the lock owner is
> running
>> > and that sleeps otherwise, freeing userspace from both specifying ad-hoc
>> > spin counts and from trying to detect situations in which spinning is
>> > generally pointless.
>> >
>> > It'd work like this: in the per-thread rseq data structure, we'd
> include a
>> > description of a futex operation for the kernel would perform (in the
>> > context of the preempted thread) upon preemption, immediately before
>> > schedule(). If the futex operation itself sleeps, that's no problem: we
>> > will have still accomplished our goal of running some other thread
> instead
>> > of the preempted thread.
> 
>> Hi Daniel,
> 
>> I agree that the problem you are aiming to solve is important. Let's see
>> what prevents the proposed rseq implementation from doing what you
> envision.
> 
>> The main issue here is touching userspace immediately before schedule().
>> At that specific point, it's not possible to take a page fault. In the
> proposed
>> rseq implementation, we get away with it by raising a task struct flag,
> and using
>> it in a return to userspace notifier (where we can actually take a
> fault), where
>> we touch the userspace TLS area.
> 
>> If we can find a way to solve this limitation, then the rest of your
> design
>> makes sense to me.
> 
> Thanks for taking a look!
> 
> Why couldn't we take a page fault just before schedule? The reason we can't
> take a page fault in atomic context is that doing so might call schedule.
> Here, we're about to call schedule _anyway_, so what harm does it do to
> call something that might call schedule? If we schedule via that call, we
> can skip the manual schedule we were going to perform.

By the way, if we eventually find a way to enhance user-space mutexes in the
fashion you describe here, it would belong to another TLS area, and would
be registered by another system call than rseq. I proposed a more generic
"TLS area registration" system call a few years ago, but Linus told me he
wanted a system call that was specific to rseq. If we need to implement
other use-cases in a TLS area shared between kernel and user-space in a
similar fashion, the plan is to do it in a distinct system call.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Mathieu Desnoyers

- On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com wrote:
[...]
> 
> I think a small enhancement to rseq would let us build a perfect userspace
> mutex, one that spins on lock-acquire only when the lock owner is running
> and that sleeps otherwise, freeing userspace from both specifying ad-hoc
> spin counts and from trying to detect situations in which spinning is
> generally pointless.
> 
> It'd work like this: in the per-thread rseq data structure, we'd include a
> description of a futex operation for the kernel would perform (in the
> context of the preempted thread) upon preemption, immediately before
> schedule(). If the futex operation itself sleeps, that's no problem: we
> will have still accomplished our goal of running some other thread instead
> of the preempted thread.

Hi Daniel,

I agree that the problem you are aiming to solve is important. Let's see
what prevents the proposed rseq implementation from doing what you envision.

The main issue here is touching userspace immediately before schedule().
At that specific point, it's not possible to take a page fault. In the proposed
rseq implementation, we get away with it by raising a task struct flag, and 
using
it in a return to userspace notifier (where we can actually take a fault), where
we touch the userspace TLS area.

If we can find a way to solve this limitation, then the rest of your design
makes sense to me.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Mathieu Desnoyers

- On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com wrote:
[...]
> 
> I think a small enhancement to rseq would let us build a perfect userspace
> mutex, one that spins on lock-acquire only when the lock owner is running
> and that sleeps otherwise, freeing userspace from both specifying ad-hoc
> spin counts and from trying to detect situations in which spinning is
> generally pointless.
> 
> It'd work like this: in the per-thread rseq data structure, we'd include a
> description of a futex operation for the kernel would perform (in the
> context of the preempted thread) upon preemption, immediately before
> schedule(). If the futex operation itself sleeps, that's no problem: we
> will have still accomplished our goal of running some other thread instead
> of the preempted thread.

Hi Daniel,

I agree that the problem you are aiming to solve is important. Let's see
what prevents the proposed rseq implementation from doing what you envision.

The main issue here is touching userspace immediately before schedule().
At that specific point, it's not possible to take a page fault. In the proposed
rseq implementation, we get away with it by raising a task struct flag, and 
using
it in a return to userspace notifier (where we can actually take a fault), where
we touch the userspace TLS area.

If we can find a way to solve this limitation, then the rest of your design
makes sense to me.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

[PATCH 04/14] arm: Wire up restartable sequences system call

Wire up the rseq system call on 32-bit ARM.

This provides an ABI improving the speed of a user-space getcpu
operation on ARM by skipping the getcpu system call on the fast path, as
well as improving the speed of user-space operations on per-cpu data
compared to using load-linked/store-conditional.

TODO: wire up rseq_syscall() on return from system call. It is used with
CONFIG_DEBUG_RSEQ=y to ensure system calls are not issued within rseq critical
section

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Thomas Gleixner <t...@linutronix.de>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-...@vger.kernel.org
---
 arch/arm/tools/syscall.tbl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 0bb0e9c6376c..fbc74b5fa3ed 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -412,3 +412,4 @@
 395common  pkey_alloc  sys_pkey_alloc
 396common  pkey_free   sys_pkey_free
 397common  statx   sys_statx
+398common  rseqsys_rseq
-- 
2.11.0

[PATCH 04/14] arm: Wire up restartable sequences system call

Wire up the rseq system call on 32-bit ARM.

This provides an ABI improving the speed of a user-space getcpu
operation on ARM by skipping the getcpu system call on the fast path, as
well as improving the speed of user-space operations on per-cpu data
compared to using load-linked/store-conditional.

TODO: wire up rseq_syscall() on return from system call. It is used with
CONFIG_DEBUG_RSEQ=y to ensure system calls are not issued within rseq critical
section

Signed-off-by: Mathieu Desnoyers 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Thomas Gleixner 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Boqun Feng 
CC: linux-...@vger.kernel.org
---
 arch/arm/tools/syscall.tbl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 0bb0e9c6376c..fbc74b5fa3ed 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -412,3 +412,4 @@
 395common  pkey_alloc  sys_pkey_alloc
 396common  pkey_free   sys_pkey_free
 397common  statx   sys_statx
+398common  rseqsys_rseq
-- 
2.11.0

[RFC PATCH for 4.18 00/14] Restartable Sequences

Hi,

Here is an updated RFC round of the Restartable Sequences patchset
based on kernel 4.17-rc3. Based on feedback from Linus, I'm introducing
only the rseq system call, keeping the rest for later.

This already enables speeding up the Facebook jemalloc and arm64 PMC
read from user-space use-cases, as well as speedup of use-cases relying
on getting the current cpu number from user-space. We'll have to wait
until a more complete solution is introduced before the LTTng-UST
tracer can replace its ring buffer atomic instructions with rseq
though. But let's proceed one step at a time.

The main change introduced by the removal of cpu_opv from this series
in terms of library use from user-space is that APIs that previously
took a CPU number as argument now only act on the current CPU.

So for instance, this turns:

  int cpu = rseq_per_cpu_lock(lock, target_cpu);
  [...]
  rseq_per_cpu_unlock(lock, cpu);

into

  int cpu = rseq_this_cpu_lock(lock);
  [...]
  rseq_per_cpu_unlock(lock, cpu);

and:

  per_cpu_list_push(list, node, target_cpu);
  [...]
  per_cpu_list_pop(list, node, target_cpu);

into

  this_cpu_list_push(list, node, );  /* cpu is an output parameter. */
  [...]
  node = this_cpu_list_pop(list, );  /* cpu is an output parameter. */

Eventually integrating cpu_opv or some alternative will allow passing
the cpu number as parameter rather than requiring the algorithm to work
on the current CPU.

The second effect of not having the cpu_opv fallback is that
line and instruction single-stepping with a debugger transforms rseq
critical sections based on retry loops into never-ending loops.
Debuggers need to use the __rseq_table section to skip those critical
sections in order to correctly behave when single-stepping a thread
which uses rseq in a retry loop. However, applications which use an
alternative fallback method rather than retrying on rseq fast-path abort
won't be affected by this kind of single-stepping issue.

Feedback is welcome!

Thanks,

Mathieu

Boqun Feng (2):
  powerpc: Add support for restartable sequences
  powerpc: Wire up restartable sequences system call

Mathieu Desnoyers (12):
  uapi headers: Provide types_32_64.h (v2)
  rseq: Introduce restartable sequences system call (v13)
  arm: Add restartable sequences support
  arm: Wire up restartable sequences system call
  x86: Add support for restartable sequences (v2)
  x86: Wire up restartable sequence system call
  selftests: lib.mk: Introduce OVERRIDE_TARGETS
  rseq: selftests: Provide rseq library (v5)
  rseq: selftests: Provide basic test
  rseq: selftests: Provide basic percpu ops test (v2)
  rseq: selftests: Provide parametrized tests (v2)
  rseq: selftests: Provide Makefile, scripts, gitignore (v2)

 MAINTAINERS|   12 +
 arch/Kconfig   |7 +
 arch/arm/Kconfig   |1 +
 arch/arm/kernel/signal.c   |7 +
 arch/arm/tools/syscall.tbl |1 +
 arch/powerpc/Kconfig   |1 +
 arch/powerpc/include/asm/systbl.h  |1 +
 arch/powerpc/include/asm/unistd.h  |2 +-
 arch/powerpc/include/uapi/asm/unistd.h |1 +
 arch/powerpc/kernel/signal.c   |3 +
 arch/x86/Kconfig   |1 +
 arch/x86/entry/common.c|3 +
 arch/x86/entry/syscalls/syscall_32.tbl |1 +
 arch/x86/entry/syscalls/syscall_64.tbl |1 +
 arch/x86/kernel/signal.c   |6 +
 fs/exec.c  |1 +
 include/linux/sched.h  |  134 +++
 include/linux/syscalls.h   |4 +-
 include/trace/events/rseq.h|   56 +
 include/uapi/linux/rseq.h  |  150 +++
 include/uapi/linux/types_32_64.h   |   67 ++
 init/Kconfig   |   23 +
 kernel/Makefile|1 +
 kernel/fork.c  |2 +
 kernel/rseq.c  |  366 ++
 kernel/sched/core.c|2 +
 kernel/sys_ni.c|3 +
 tools/testing/selftests/Makefile   |1 +
 tools/testing/selftests/lib.mk |4 +
 tools/testing/selftests/rseq/.gitignore|6 +
 tools/testing/selftests/rseq/Makefile  |   29 +
 .../testing/selftests/rseq/basic_percpu_ops_test.c |  312 +
 tools/testing/selftests/rseq/basic_test.c  |   55 +
 tools/testing/selftests/rseq/param_test.c  | 1259 
 tools/testing/selftests/rseq/rseq-arm.h|  732 
 tools/testing/selftests/rseq/rseq-ppc.h|  688 +++
 tools/testing

[RFC PATCH for 4.18 00/14] Restartable Sequences

Hi,

Here is an updated RFC round of the Restartable Sequences patchset
based on kernel 4.17-rc3. Based on feedback from Linus, I'm introducing
only the rseq system call, keeping the rest for later.

This already enables speeding up the Facebook jemalloc and arm64 PMC
read from user-space use-cases, as well as speedup of use-cases relying
on getting the current cpu number from user-space. We'll have to wait
until a more complete solution is introduced before the LTTng-UST
tracer can replace its ring buffer atomic instructions with rseq
though. But let's proceed one step at a time.

The main change introduced by the removal of cpu_opv from this series
in terms of library use from user-space is that APIs that previously
took a CPU number as argument now only act on the current CPU.

So for instance, this turns:

  int cpu = rseq_per_cpu_lock(lock, target_cpu);
  [...]
  rseq_per_cpu_unlock(lock, cpu);

into

  int cpu = rseq_this_cpu_lock(lock);
  [...]
  rseq_per_cpu_unlock(lock, cpu);

and:

  per_cpu_list_push(list, node, target_cpu);
  [...]
  per_cpu_list_pop(list, node, target_cpu);

into

  this_cpu_list_push(list, node, );  /* cpu is an output parameter. */
  [...]
  node = this_cpu_list_pop(list, );  /* cpu is an output parameter. */

Eventually integrating cpu_opv or some alternative will allow passing
the cpu number as parameter rather than requiring the algorithm to work
on the current CPU.

The second effect of not having the cpu_opv fallback is that
line and instruction single-stepping with a debugger transforms rseq
critical sections based on retry loops into never-ending loops.
Debuggers need to use the __rseq_table section to skip those critical
sections in order to correctly behave when single-stepping a thread
which uses rseq in a retry loop. However, applications which use an
alternative fallback method rather than retrying on rseq fast-path abort
won't be affected by this kind of single-stepping issue.

Feedback is welcome!

Thanks,

Mathieu

Boqun Feng (2):
  powerpc: Add support for restartable sequences
  powerpc: Wire up restartable sequences system call

Mathieu Desnoyers (12):
  uapi headers: Provide types_32_64.h (v2)
  rseq: Introduce restartable sequences system call (v13)
  arm: Add restartable sequences support
  arm: Wire up restartable sequences system call
  x86: Add support for restartable sequences (v2)
  x86: Wire up restartable sequence system call
  selftests: lib.mk: Introduce OVERRIDE_TARGETS
  rseq: selftests: Provide rseq library (v5)
  rseq: selftests: Provide basic test
  rseq: selftests: Provide basic percpu ops test (v2)
  rseq: selftests: Provide parametrized tests (v2)
  rseq: selftests: Provide Makefile, scripts, gitignore (v2)

 MAINTAINERS|   12 +
 arch/Kconfig   |7 +
 arch/arm/Kconfig   |1 +
 arch/arm/kernel/signal.c   |7 +
 arch/arm/tools/syscall.tbl |1 +
 arch/powerpc/Kconfig   |1 +
 arch/powerpc/include/asm/systbl.h  |1 +
 arch/powerpc/include/asm/unistd.h  |2 +-
 arch/powerpc/include/uapi/asm/unistd.h |1 +
 arch/powerpc/kernel/signal.c   |3 +
 arch/x86/Kconfig   |1 +
 arch/x86/entry/common.c|3 +
 arch/x86/entry/syscalls/syscall_32.tbl |1 +
 arch/x86/entry/syscalls/syscall_64.tbl |1 +
 arch/x86/kernel/signal.c   |6 +
 fs/exec.c  |1 +
 include/linux/sched.h  |  134 +++
 include/linux/syscalls.h   |4 +-
 include/trace/events/rseq.h|   56 +
 include/uapi/linux/rseq.h  |  150 +++
 include/uapi/linux/types_32_64.h   |   67 ++
 init/Kconfig   |   23 +
 kernel/Makefile|1 +
 kernel/fork.c  |2 +
 kernel/rseq.c  |  366 ++
 kernel/sched/core.c|2 +
 kernel/sys_ni.c|3 +
 tools/testing/selftests/Makefile   |1 +
 tools/testing/selftests/lib.mk |4 +
 tools/testing/selftests/rseq/.gitignore|6 +
 tools/testing/selftests/rseq/Makefile  |   29 +
 .../testing/selftests/rseq/basic_percpu_ops_test.c |  312 +
 tools/testing/selftests/rseq/basic_test.c  |   55 +
 tools/testing/selftests/rseq/param_test.c  | 1259 
 tools/testing/selftests/rseq/rseq-arm.h|  732 
 tools/testing/selftests/rseq/rseq-ppc.h|  688 +++
 tools/testing

[PATCH 06/14] x86: Wire up restartable sequence system call

Wire up the rseq system call on x86 32/64.

This provides an ABI improving the speed of a user-space getcpu
operation on x86 by removing the need to perform a function call, "lsl"
instruction, or system call on the fast path, as well as improving the
speed of user-space operations on per-cpu data.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
Reviewed-by: Thomas Gleixner <t...@linutronix.de>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: "H. Peter Anvin" <h...@zytor.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-...@vger.kernel.org
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index d6b27dab1b30..db346da64947 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -396,3 +396,4 @@
 382i386pkey_free   sys_pkey_free   
__ia32_sys_pkey_free
 383i386statx   sys_statx   
__ia32_sys_statx
 384i386arch_prctl  sys_arch_prctl  
__ia32_compat_sys_arch_prctl
+385i386rseqsys_rseq
__ia32_sys_rseq
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 4dfe42666d0c..41b082b125c3 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -341,6 +341,7 @@
 330common  pkey_alloc  __x64_sys_pkey_alloc
 331common  pkey_free   __x64_sys_pkey_free
 332common  statx   __x64_sys_statx
+333common  rseq__x64_sys_rseq
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.11.0

[PATCH 02/14] rseq: Introduce restartable sequences system call (v13)

hreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.

* CONFIG_RSEQ=n

avg.: 41.37 s
std.dev.: 0.36 s

* CONFIG_RSEQ=y

avg.: 40.46 s
std.dev.: 0.33 s

- Size

On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.

[1] https://lwn.net/Articles/650333/
[2]
http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Link:
http://lkml.kernel.org/r/20151027235635.16059.11630.st...@pjt-glaptop.roam.corp.google.com
Link:
http://lkml.kernel.org/r/20150624222609.6116.86035.st...@kitami.mtv.corp.google.com
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Thomas Gleixner <t...@linutronix.de>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: "H. Peter Anvin" <h...@zytor.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Michael Kerrisk <mtk.manpa...@gmail.com>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Alexander Viro <v...@zeniv.linux.org.uk>
CC: linux-...@vger.kernel.org
---

Changes since v1:
- Return -1, errno=EINVAL if cpu_cache pointer is not aligned on
sizeof(int32_t).
- Update man page to describe the pointer alignement requirements and
update atomicity guarantees.
- Add MAINTAINERS file GETCPU_CACHE entry.
- Remove dynamic memory allocation: go back to having a single
getcpu_cache entry per thread. Update documentation accordingly.
- Rebased on Linux 4.4.

Changes since v2:
- Introduce a "cmd" argument, along with an enum with GETCPU_CACHE_GET
and GETCPU_CACHE_SET. Introduce a uapi header linux/getcpu_cache.h
defining this enumeration.
- Split resume notifier architecture implementation from the system call
wire up in the following arch-specific patches.
- Man pages updates.
- Handle 32-bit compat pointers.
- Simplify handling of getcpu_cache GETCPU_CACHE_SET compiler barrier:
set the current cpu cache pointer before doing the cache update, and
set it back to NULL if the update fails. Setting it back to NULL on
error ensures that no resume notifier will trigger a SIGSEGV if a
migration happened concurrently.

Changes since v3:
- Fix __user annotations in compat code,
- Update memory ordering comments.
- Rebased on kernel v4.5-rc5.

Changes since v4:
- Inline getcpu_cache_fork, getcpu_cache_execve, and getcpu_cache_exit.
- Add new line between if() and switch() to improve readability.
- Added sched switch benchmarks (hackbench) and size overhead comparison
to change log.

Changes since v5:
- Rename "getcpu_cache" to "thread_local_abi", allowing to extend
this system call to cover future features such as restartable critical
sections. Generalizing this system call ensures that we can add
features similar to the cpu_id field within the same cache-line
without having to track one pointer per feature within the task
struct.
- Add a tlabi_nr parameter to the system call, thus allowing to extend
the ABI beyond the initial 64-byte structure by registering structures
with tlabi_nr greater than 0. The initial ABI structure is associated
with tlabi_nr 0.
- Rebased on kernel v4.5.

Changes since v6:
- Integrate "restartable sequences" v2 patchset from Paul Turner.
- Add handling of single-stepping purely in user-space, with a
fallback to locking after 2 rseq failures to ensure progress, and
by exposing a __rseq_table section to debuggers so they know where
to put breakpoints when dealing with rseq assembly blocks which
can be aborted at any point.
- make the code and ABI generic: porting the kernel implementation
simply requires to wire up the signal handler and return to user-space
hooks, and allocate the syscall number.
- extend testing with a fully configurable test program. See
param_spinlock_test -h for details.
- handling of rseq ENOSYS in user-space, also with a fallback
to locking.
- modify Paul Turner's rseq ABI to only require a single TLS store on
the user-space fast-path, removing the need to populate two additional
registers. This is made possible by introduc

[PATCH 06/14] x86: Wire up restartable sequence system call

Wire up the rseq system call on x86 32/64.

This provides an ABI improving the speed of a user-space getcpu
operation on x86 by removing the need to perform a function call, "lsl"
instruction, or system call on the fast path, as well as improving the
speed of user-space operations on per-cpu data.

Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Thomas Gleixner 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: "H. Peter Anvin" 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Boqun Feng 
CC: linux-...@vger.kernel.org
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index d6b27dab1b30..db346da64947 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -396,3 +396,4 @@
 382i386pkey_free   sys_pkey_free   
__ia32_sys_pkey_free
 383i386statx   sys_statx   
__ia32_sys_statx
 384i386arch_prctl  sys_arch_prctl  
__ia32_compat_sys_arch_prctl
+385i386rseqsys_rseq
__ia32_sys_rseq
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 4dfe42666d0c..41b082b125c3 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -341,6 +341,7 @@
 330common  pkey_alloc  __x64_sys_pkey_alloc
 331common  pkey_free   __x64_sys_pkey_free
 332common  statx   __x64_sys_statx
+333common  rseq__x64_sys_rseq
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.11.0

[PATCH 02/14] rseq: Introduce restartable sequences system call (v13)

* CONFIG_RSEQ=n

avg.: 41.37 s
std.dev.: 0.36 s

* CONFIG_RSEQ=y

avg.: 40.46 s
std.dev.: 0.33 s

- Size

On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.

[1] https://lwn.net/Articles/650333/
[2]
http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Link:
http://lkml.kernel.org/r/20151027235635.16059.11630.st...@pjt-glaptop.roam.corp.google.com
Link:
http://lkml.kernel.org/r/20150624222609.6116.86035.st...@kitami.mtv.corp.google.com
Signed-off-by: Mathieu Desnoyers
CC: Thomas Gleixner
CC: Paul Turner
CC: Andrew Hunter
CC: Peter Zijlstra
CC: Andy Lutomirski
CC: Andi Kleen
CC: Dave Watson
CC: Chris Lameter
CC: Ingo Molnar
CC: "H. Peter Anvin"
CC: Ben Maurer
CC: Steven Rostedt
CC: "Paul E. McKenney"
CC: Josh Triplett
CC: Linus Torvalds
CC: Andrew Morton
CC: Russell King
CC: Catalin Marinas
CC: Will Deacon
CC: Michael Kerrisk
CC: Boqun Feng
CC: Alexander Viro
CC: linux-...@vger.kernel.org
---

Changes since v3:
- Fix __user annotations in compat code,
- Update memory ordering comments.
- Rebased on kernel v4.5-rc5.

Changes since v7:
- Documentation updates.
- Integrated powerpc architecture support.
- Compare rseq critical section start_ip, allows shriking the user-space
fast-path code size.
- Added Peter Zijlstra, Paul E. McKenney and Boqun Feng as
co-maintainers.
- Added do_rseq2 and do_rseq_memcpy to test program helper library.
- Code cleanup based on review from Peter Zijlstra, Andy Lutomirski and
Boqun Feng.
- Rebase on kernel v4.8-rc2.

Changes since v8:
- clear rseq_cs even if non-nested. Spe

[PATCH 07/14] powerpc: Add support for restartable sequences

From: Boqun Feng <boqun.f...@gmail.com>

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Perform fixup on the pre-signal when a signal is delivered on top of a
restartable sequence critical section.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-...@lists.ozlabs.org
---
 arch/powerpc/Kconfig | 1 +
 arch/powerpc/kernel/signal.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..ed21a777e8c6 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -223,6 +223,7 @@ config PPC
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HAVE_IRQ_TIME_ACCOUNTING
+   select HAVE_RSEQ
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
select MODULES_USE_ELF_RELA
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index 61db86ecd318..d3bb3aaaf5ac 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
/* Re-enable the breakpoints for the signal stack */
thread_change_pc(tsk, tsk->thread.regs);
 
+   rseq_signal_deliver(tsk->thread.regs);
+
if (is32) {
if (ksig.ka.sa.sa_flags & SA_SIGINFO)
ret = handle_rt_signal32(, oldset, tsk);
@@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long 
thread_info_flags)
if (thread_info_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   rseq_handle_notify_resume(regs);
}
 
user_enter();
-- 
2.11.0

[PATCH 08/14] powerpc: Wire up restartable sequences system call

From: Boqun Feng <boqun.f...@gmail.com>

Wire up the rseq system call on powerpc.

This provides an ABI improving the speed of a user-space getcpu
operation on powerpc by skipping the getcpu system call on the fast
path, as well as improving the speed of user-space operations on per-cpu
data compared to using load-reservation/store-conditional atomics.

TODO: wire up rseq_syscall() on return from system call. It is used with
CONFIG_DEBUG_RSEQ=y to ensure system calls are not issued within rseq critical
section

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-...@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index d61f9c96d916..45d4d37495fd 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -392,3 +392,4 @@ SYSCALL(statx)
 SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
+SYSCALL(rseq)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index daf1ba97a00c..1e9708632dce 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls387
+#define NR_syscalls388
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 389c36fd8299..ac5ba55066dd 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -398,5 +398,6 @@
 #define __NR_pkey_alloc384
 #define __NR_pkey_free 385
 #define __NR_pkey_mprotect 386
+#define __NR_rseq  387
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0

[PATCH 07/14] powerpc: Add support for restartable sequences

From: Boqun Feng 

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Perform fixup on the pre-signal when a signal is delivered on top of a
restartable sequence critical section.

Signed-off-by: Boqun Feng 
Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: linuxppc-...@lists.ozlabs.org
---
 arch/powerpc/Kconfig | 1 +
 arch/powerpc/kernel/signal.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..ed21a777e8c6 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -223,6 +223,7 @@ config PPC
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HAVE_IRQ_TIME_ACCOUNTING
+   select HAVE_RSEQ
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
select MODULES_USE_ELF_RELA
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index 61db86ecd318..d3bb3aaaf5ac 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
/* Re-enable the breakpoints for the signal stack */
thread_change_pc(tsk, tsk->thread.regs);
 
+   rseq_signal_deliver(tsk->thread.regs);
+
if (is32) {
if (ksig.ka.sa.sa_flags & SA_SIGINFO)
ret = handle_rt_signal32(, oldset, tsk);
@@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long 
thread_info_flags)
if (thread_info_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   rseq_handle_notify_resume(regs);
}
 
user_enter();
-- 
2.11.0

[PATCH 08/14] powerpc: Wire up restartable sequences system call

From: Boqun Feng 

Wire up the rseq system call on powerpc.

This provides an ABI improving the speed of a user-space getcpu
operation on powerpc by skipping the getcpu system call on the fast
path, as well as improving the speed of user-space operations on per-cpu
data compared to using load-reservation/store-conditional atomics.

TODO: wire up rseq_syscall() on return from system call. It is used with
CONFIG_DEBUG_RSEQ=y to ensure system calls are not issued within rseq critical
section

Signed-off-by: Boqun Feng 
Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: linuxppc-...@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index d61f9c96d916..45d4d37495fd 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -392,3 +392,4 @@ SYSCALL(statx)
 SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
+SYSCALL(rseq)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index daf1ba97a00c..1e9708632dce 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls387
+#define NR_syscalls388
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 389c36fd8299..ac5ba55066dd 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -398,5 +398,6 @@
 #define __NR_pkey_alloc384
 #define __NR_pkey_free 385
 #define __NR_pkey_mprotect 386
+#define __NR_rseq  387
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0

[PATCH 11/14] rseq: selftests: Provide basic test

"basic_test" only asserts that RSEQ works moderately correctly. E.g.
that the CPUID pointer works.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Shuah Khan <shua...@osg.samsung.com>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Thomas Gleixner <t...@linutronix.de>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: "H. Peter Anvin" <h...@zytor.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-kselft...@vger.kernel.org
CC: linux-...@vger.kernel.org
---
 tools/testing/selftests/rseq/basic_test.c | 56 +++
 1 file changed, 56 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/basic_test.c

diff --git a/tools/testing/selftests/rseq/basic_test.c 
b/tools/testing/selftests/rseq/basic_test.c
new file mode 100644
index ..d8efbfb89193
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_test.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: LGPL-2.1
+/*
+ * Basic test coverage for critical regions and rseq_current_cpu().
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "rseq.h"
+
+void test_cpu_pointer(void)
+{
+   cpu_set_t affinity, test_affinity;
+   int i;
+
+   sched_getaffinity(0, sizeof(affinity), );
+   CPU_ZERO(_affinity);
+   for (i = 0; i < CPU_SETSIZE; i++) {
+   if (CPU_ISSET(i, )) {
+   CPU_SET(i, _affinity);
+   sched_setaffinity(0, sizeof(test_affinity),
+   _affinity);
+   assert(sched_getcpu() == i);
+   assert(rseq_current_cpu() == i);
+   assert(rseq_current_cpu_raw() == i);
+   assert(rseq_cpu_start() == i);
+   CPU_CLR(i, _affinity);
+   }
+   }
+   sched_setaffinity(0, sizeof(affinity), );
+}
+
+int main(int argc, char **argv)
+{
+   if (rseq_register_current_thread()) {
+   fprintf(stderr, "Error: rseq_register_current_thread(...) 
failed(%d): %s\n",
+   errno, strerror(errno));
+   goto init_thread_error;
+   }
+   printf("testing current cpu\n");
+   test_cpu_pointer();
+   if (rseq_unregister_current_thread()) {
+   fprintf(stderr, "Error: rseq_unregister_current_thread(...) 
failed(%d): %s\n",
+   errno, strerror(errno));
+   goto init_thread_error;
+   }
+   return 0;
+
+init_thread_error:
+   return -1;
+}
-- 
2.11.0

[PATCH 09/14] selftests: lib.mk: Introduce OVERRIDE_TARGETS

Introduce OVERRIDE_TARGETS to allow tests to express dependencies on
header files and .so, which require to override the selftests lib.mk
targets.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
Acked-by: Shuah Khan <shua...@osg.samsung.com>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Thomas Gleixner <t...@linutronix.de>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: "H. Peter Anvin" <h...@zytor.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-kselft...@vger.kernel.org
CC: linux-...@vger.kernel.org
---
 tools/testing/selftests/lib.mk | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 195e9d4739a9..9fd57efae439 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -106,6 +106,9 @@ COMPILE.S = $(CC) $(ASFLAGS) $(CPPFLAGS) $(TARGET_ARCH) -c
 LINK.S = $(CC) $(ASFLAGS) $(CPPFLAGS) $(LDFLAGS) $(TARGET_ARCH)
 endif
 
+# Selftest makefiles can override those targets by setting
+# OVERRIDE_TARGETS = 1.
+ifeq ($(OVERRIDE_TARGETS),)
 $(OUTPUT)/%:%.c
$(LINK.c) $^ $(LDLIBS) -o $@
 
@@ -114,5 +117,6 @@ $(OUTPUT)/%.o:%.S
 
 $(OUTPUT)/%:%.S
$(LINK.S) $^ $(LDLIBS) -o $@
+endif
 
 .PHONY: run_tests all clean install emit_tests
-- 
2.11.0

[PATCH 11/14] rseq: selftests: Provide basic test

"basic_test" only asserts that RSEQ works moderately correctly. E.g.
that the CPUID pointer works.

Signed-off-by: Mathieu Desnoyers 
CC: Shuah Khan 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Thomas Gleixner 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: "H. Peter Anvin" 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Boqun Feng 
CC: linux-kselft...@vger.kernel.org
CC: linux-...@vger.kernel.org
---
 tools/testing/selftests/rseq/basic_test.c | 56 +++
 1 file changed, 56 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/basic_test.c

diff --git a/tools/testing/selftests/rseq/basic_test.c 
b/tools/testing/selftests/rseq/basic_test.c
new file mode 100644
index ..d8efbfb89193
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_test.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: LGPL-2.1
+/*
+ * Basic test coverage for critical regions and rseq_current_cpu().
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "rseq.h"
+
+void test_cpu_pointer(void)
+{
+   cpu_set_t affinity, test_affinity;
+   int i;
+
+   sched_getaffinity(0, sizeof(affinity), );
+   CPU_ZERO(_affinity);
+   for (i = 0; i < CPU_SETSIZE; i++) {
+   if (CPU_ISSET(i, )) {
+   CPU_SET(i, _affinity);
+   sched_setaffinity(0, sizeof(test_affinity),
+   _affinity);
+   assert(sched_getcpu() == i);
+   assert(rseq_current_cpu() == i);
+   assert(rseq_current_cpu_raw() == i);
+   assert(rseq_cpu_start() == i);
+   CPU_CLR(i, _affinity);
+   }
+   }
+   sched_setaffinity(0, sizeof(affinity), );
+}
+
+int main(int argc, char **argv)
+{
+   if (rseq_register_current_thread()) {
+   fprintf(stderr, "Error: rseq_register_current_thread(...) 
failed(%d): %s\n",
+   errno, strerror(errno));
+   goto init_thread_error;
+   }
+   printf("testing current cpu\n");
+   test_cpu_pointer();
+   if (rseq_unregister_current_thread()) {
+   fprintf(stderr, "Error: rseq_unregister_current_thread(...) 
failed(%d): %s\n",
+   errno, strerror(errno));
+   goto init_thread_error;
+   }
+   return 0;
+
+init_thread_error:
+   return -1;
+}
-- 
2.11.0

[PATCH 09/14] selftests: lib.mk: Introduce OVERRIDE_TARGETS

Introduce OVERRIDE_TARGETS to allow tests to express dependencies on
header files and .so, which require to override the selftests lib.mk
targets.

Signed-off-by: Mathieu Desnoyers 
Acked-by: Shuah Khan 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Thomas Gleixner 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: "H. Peter Anvin" 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Boqun Feng 
CC: linux-kselft...@vger.kernel.org
CC: linux-...@vger.kernel.org
---
 tools/testing/selftests/lib.mk | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 195e9d4739a9..9fd57efae439 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -106,6 +106,9 @@ COMPILE.S = $(CC) $(ASFLAGS) $(CPPFLAGS) $(TARGET_ARCH) -c
 LINK.S = $(CC) $(ASFLAGS) $(CPPFLAGS) $(LDFLAGS) $(TARGET_ARCH)
 endif
 
+# Selftest makefiles can override those targets by setting
+# OVERRIDE_TARGETS = 1.
+ifeq ($(OVERRIDE_TARGETS),)
 $(OUTPUT)/%:%.c
$(LINK.c) $^ $(LDLIBS) -o $@
 
@@ -114,5 +117,6 @@ $(OUTPUT)/%.o:%.S
 
 $(OUTPUT)/%:%.S
$(LINK.S) $^ $(LDLIBS) -o $@
+endif
 
 .PHONY: run_tests all clean install emit_tests
-- 
2.11.0

[PATCH 12/14] rseq: selftests: Provide basic percpu ops test (v2)

"basic_percpu_ops_test" is a slightly more "realistic" variant,
implementing a few simple per-cpu operations and testing their
correctness.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Shuah Khan <shua...@osg.samsung.com>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Thomas Gleixner <t...@linutronix.de>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: "H. Peter Anvin" <h...@zytor.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-kselft...@vger.kernel.org
CC: linux-...@vger.kernel.org
---
Changes since v1:
- Use only rseq, remove use of cpu_opv system call.
---
 .../testing/selftests/rseq/basic_percpu_ops_test.c | 313 +
 1 file changed, 313 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.c

diff --git a/tools/testing/selftests/rseq/basic_percpu_ops_test.c 
b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
new file mode 100644
index ..96ef27905879
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "rseq.h"
+
+#define ARRAY_SIZE(arr)(sizeof(arr) / sizeof((arr)[0]))
+
+struct percpu_lock_entry {
+   intptr_t v;
+} __attribute__((aligned(128)));
+
+struct percpu_lock {
+   struct percpu_lock_entry c[CPU_SETSIZE];
+};
+
+struct test_data_entry {
+   intptr_t count;
+} __attribute__((aligned(128)));
+
+struct spinlock_test_data {
+   struct percpu_lock lock;
+   struct test_data_entry c[CPU_SETSIZE];
+   int reps;
+};
+
+struct percpu_list_node {
+   intptr_t data;
+   struct percpu_list_node *next;
+};
+
+struct percpu_list_entry {
+   struct percpu_list_node *head;
+} __attribute__((aligned(128)));
+
+struct percpu_list {
+   struct percpu_list_entry c[CPU_SETSIZE];
+};
+
+/* A simple percpu spinlock.  Returns the cpu lock was acquired on. */
+int rseq_this_cpu_lock(struct percpu_lock *lock)
+{
+   int cpu;
+
+   for (;;) {
+   int ret;
+
+   cpu = rseq_cpu_start();
+   ret = rseq_cmpeqv_storev(>c[cpu].v,
+0, 1, cpu);
+   if (rseq_likely(!ret))
+   break;
+   /* Retry if comparison fails or rseq aborts. */
+   }
+   /*
+* Acquire semantic when taking lock after control dependency.
+* Matches rseq_smp_store_release().
+*/
+   rseq_smp_acquire__after_ctrl_dep();
+   return cpu;
+}
+
+void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+   assert(lock->c[cpu].v == 1);
+   /*
+* Release lock, with release semantic. Matches
+* rseq_smp_acquire__after_ctrl_dep().
+*/
+   rseq_smp_store_release(>c[cpu].v, 0);
+}
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+   struct spinlock_test_data *data = arg;
+   int i, cpu;
+
+   if (rseq_register_current_thread()) {
+   fprintf(stderr, "Error: rseq_register_current_thread(...) 
failed(%d): %s\n",
+   errno, strerror(errno));
+   abort();
+   }
+   for (i = 0; i < data->reps; i++) {
+   cpu = rseq_this_cpu_lock(>lock);
+   data->c[cpu].count++;
+   rseq_percpu_unlock(>lock, cpu);
+   }
+   if (rseq_unregister_current_thread()) {
+   fprintf(stderr, "Error: rseq_unregister_current_thread(...) 
failed(%d): %s\n",
+   errno, strerror(errno));
+   abort();
+   }
+
+   return NULL;
+}
+
+/*
+ * A simple test which implements a sharded counter using a per-cpu
+ * lock.  Obviously real applications might prefer to simply use a
+ * per-cpu increment; however, this is reasonable for a test and the
+ * lock can be extended to synchronize more complicated operations.
+ */
+void test_percpu_spinlock(void)
+{
+   const int num_threads = 200;
+   int i;
+   uint64_t sum;
+   pthread_t test_threads[num_threads];
+   struct spinlock_test_

[PATCH 12/14] rseq: selftests: Provide basic percpu ops test (v2)

"basic_percpu_ops_test" is a slightly more "realistic" variant,
implementing a few simple per-cpu operations and testing their
correctness.

Signed-off-by: Mathieu Desnoyers 
CC: Shuah Khan 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Thomas Gleixner 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: "H. Peter Anvin" 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Boqun Feng 
CC: linux-kselft...@vger.kernel.org
CC: linux-...@vger.kernel.org
---
Changes since v1:
- Use only rseq, remove use of cpu_opv system call.
---
 .../testing/selftests/rseq/basic_percpu_ops_test.c | 313 +
 1 file changed, 313 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.c

diff --git a/tools/testing/selftests/rseq/basic_percpu_ops_test.c 
b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
new file mode 100644
index ..96ef27905879
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "rseq.h"
+
+#define ARRAY_SIZE(arr)(sizeof(arr) / sizeof((arr)[0]))
+
+struct percpu_lock_entry {
+   intptr_t v;
+} __attribute__((aligned(128)));
+
+struct percpu_lock {
+   struct percpu_lock_entry c[CPU_SETSIZE];
+};
+
+struct test_data_entry {
+   intptr_t count;
+} __attribute__((aligned(128)));
+
+struct spinlock_test_data {
+   struct percpu_lock lock;
+   struct test_data_entry c[CPU_SETSIZE];
+   int reps;
+};
+
+struct percpu_list_node {
+   intptr_t data;
+   struct percpu_list_node *next;
+};
+
+struct percpu_list_entry {
+   struct percpu_list_node *head;
+} __attribute__((aligned(128)));
+
+struct percpu_list {
+   struct percpu_list_entry c[CPU_SETSIZE];
+};
+
+/* A simple percpu spinlock.  Returns the cpu lock was acquired on. */
+int rseq_this_cpu_lock(struct percpu_lock *lock)
+{
+   int cpu;
+
+   for (;;) {
+   int ret;
+
+   cpu = rseq_cpu_start();
+   ret = rseq_cmpeqv_storev(>c[cpu].v,
+0, 1, cpu);
+   if (rseq_likely(!ret))
+   break;
+   /* Retry if comparison fails or rseq aborts. */
+   }
+   /*
+* Acquire semantic when taking lock after control dependency.
+* Matches rseq_smp_store_release().
+*/
+   rseq_smp_acquire__after_ctrl_dep();
+   return cpu;
+}
+
+void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+   assert(lock->c[cpu].v == 1);
+   /*
+* Release lock, with release semantic. Matches
+* rseq_smp_acquire__after_ctrl_dep().
+*/
+   rseq_smp_store_release(>c[cpu].v, 0);
+}
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+   struct spinlock_test_data *data = arg;
+   int i, cpu;
+
+   if (rseq_register_current_thread()) {
+   fprintf(stderr, "Error: rseq_register_current_thread(...) 
failed(%d): %s\n",
+   errno, strerror(errno));
+   abort();
+   }
+   for (i = 0; i < data->reps; i++) {
+   cpu = rseq_this_cpu_lock(>lock);
+   data->c[cpu].count++;
+   rseq_percpu_unlock(>lock, cpu);
+   }
+   if (rseq_unregister_current_thread()) {
+   fprintf(stderr, "Error: rseq_unregister_current_thread(...) 
failed(%d): %s\n",
+   errno, strerror(errno));
+   abort();
+   }
+
+   return NULL;
+}
+
+/*
+ * A simple test which implements a sharded counter using a per-cpu
+ * lock.  Obviously real applications might prefer to simply use a
+ * per-cpu increment; however, this is reasonable for a test and the
+ * lock can be extended to synchronize more complicated operations.
+ */
+void test_percpu_spinlock(void)
+{
+   const int num_threads = 200;
+   int i;
+   uint64_t sum;
+   pthread_t test_threads[num_threads];
+   struct spinlock_test_data data;
+
+   memset(, 0, sizeof(data));
+   data.reps = 5000;
+
+   for (i = 0; i < num_threads; i++)
+   pthread_create(_threads[i], NULL,
+  test_percpu_spinlock_thread, );
+
+   for (i = 0; i < num_threads; i++)
+   pthread_join(test_threads[i], NULL);
+
+   sum = 0;
+   for (i = 0; i < CPU_SETSIZE; i++)
+   sum += data.c[i].count;
+
+   assert(sum == (uint64_t)data.reps * num_threads);
+}
+
+void this_cpu_list_push(struct percpu_list *list,
+   struct percpu_list_node *node,
+

[PATCH 01/14] uapi headers: Provide types_32_64.h (v2)

Provide helper macros for fields which represent pointers in
kernel-userspace ABI. This facilitates handling of 32-bit
user-space by 64-bit kernels by defining those fields as
32-bit 0-padding and 32-bit integer on 32-bit architectures,
which allows the kernel to treat those as 64-bit integers.
The order of padding and 32-bit integer depends on the
endianness.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Paul Turner <p...@google.com>
CC: Thomas Gleixner <t...@linutronix.de>
CC: Andrew Hunter <a...@google.com>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: "H. Peter Anvin" <h...@zytor.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Michael Kerrisk <mtk.manpa...@gmail.com>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-...@vger.kernel.org

---

Changes since v1:
- Public uapi headers use __u32 and __u64 rather than uint32_t and
  uint64_t.
---
 include/uapi/linux/types_32_64.h | 50 
 1 file changed, 50 insertions(+)
 create mode 100644 include/uapi/linux/types_32_64.h

diff --git a/include/uapi/linux/types_32_64.h b/include/uapi/linux/types_32_64.h
new file mode 100644
index ..0a87ace34a57
--- /dev/null
+++ b/include/uapi/linux/types_32_64.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_TYPES_32_64_H
+#define _UAPI_LINUX_TYPES_32_64_H
+
+/*
+ * linux/types_32_64.h
+ *
+ * Integer type declaration for pointers across 32-bit and 64-bit systems.
+ *
+ * Copyright (c) 2015-2018 Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
+ */
+
+#ifdef __KERNEL__
+# include 
+#else
+# include 
+#endif
+
+#include 
+
+#ifdef __BYTE_ORDER
+# if (__BYTE_ORDER == __BIG_ENDIAN)
+#  define LINUX_BYTE_ORDER_BIG_ENDIAN
+# else
+#  define LINUX_BYTE_ORDER_LITTLE_ENDIAN
+# endif
+#else
+# ifdef __BIG_ENDIAN
+#  define LINUX_BYTE_ORDER_BIG_ENDIAN
+# else
+#  define LINUX_BYTE_ORDER_LITTLE_ENDIAN
+# endif
+#endif
+
+#ifdef __LP64__
+# define LINUX_FIELD_u32_u64(field)__u64 field
+# define LINUX_FIELD_u32_u64_INIT_ONSTACK(field, v)field = (intptr_t)v
+#else
+# ifdef LINUX_BYTE_ORDER_BIG_ENDIAN
+#  define LINUX_FIELD_u32_u64(field)   __u32 field ## _padding, field
+#  define LINUX_FIELD_u32_u64_INIT_ONSTACK(field, v)   \
+   field ## _padding = 0, field = (intptr_t)v
+# else
+#  define LINUX_FIELD_u32_u64(field)   __u32 field, field ## _padding
+#  define LINUX_FIELD_u32_u64_INIT_ONSTACK(field, v)   \
+   field = (intptr_t)v, field ## _padding = 0
+# endif
+#endif
+
+#endif /* _UAPI_LINUX_TYPES_32_64_H */
-- 
2.11.0

[PATCH 01/14] uapi headers: Provide types_32_64.h (v2)

Provide helper macros for fields which represent pointers in
kernel-userspace ABI. This facilitates handling of 32-bit
user-space by 64-bit kernels by defining those fields as
32-bit 0-padding and 32-bit integer on 32-bit architectures,
which allows the kernel to treat those as 64-bit integers.
The order of padding and 32-bit integer depends on the
endianness.

Signed-off-by: Mathieu Desnoyers 
CC: "Paul E. McKenney" 
CC: Peter Zijlstra 
CC: Paul Turner 
CC: Thomas Gleixner 
CC: Andrew Hunter 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: "H. Peter Anvin" 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Michael Kerrisk 
CC: Boqun Feng 
CC: linux-...@vger.kernel.org

---

Changes since v1:
- Public uapi headers use __u32 and __u64 rather than uint32_t and
  uint64_t.
---
 include/uapi/linux/types_32_64.h | 50 
 1 file changed, 50 insertions(+)
 create mode 100644 include/uapi/linux/types_32_64.h

diff --git a/include/uapi/linux/types_32_64.h b/include/uapi/linux/types_32_64.h
new file mode 100644
index ..0a87ace34a57
--- /dev/null
+++ b/include/uapi/linux/types_32_64.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_TYPES_32_64_H
+#define _UAPI_LINUX_TYPES_32_64_H
+
+/*
+ * linux/types_32_64.h
+ *
+ * Integer type declaration for pointers across 32-bit and 64-bit systems.
+ *
+ * Copyright (c) 2015-2018 Mathieu Desnoyers 
+ */
+
+#ifdef __KERNEL__
+# include 
+#else
+# include 
+#endif
+
+#include 
+
+#ifdef __BYTE_ORDER
+# if (__BYTE_ORDER == __BIG_ENDIAN)
+#  define LINUX_BYTE_ORDER_BIG_ENDIAN
+# else
+#  define LINUX_BYTE_ORDER_LITTLE_ENDIAN
+# endif
+#else
+# ifdef __BIG_ENDIAN
+#  define LINUX_BYTE_ORDER_BIG_ENDIAN
+# else
+#  define LINUX_BYTE_ORDER_LITTLE_ENDIAN
+# endif
+#endif
+
+#ifdef __LP64__
+# define LINUX_FIELD_u32_u64(field)__u64 field
+# define LINUX_FIELD_u32_u64_INIT_ONSTACK(field, v)field = (intptr_t)v
+#else
+# ifdef LINUX_BYTE_ORDER_BIG_ENDIAN
+#  define LINUX_FIELD_u32_u64(field)   __u32 field ## _padding, field
+#  define LINUX_FIELD_u32_u64_INIT_ONSTACK(field, v)   \
+   field ## _padding = 0, field = (intptr_t)v
+# else
+#  define LINUX_FIELD_u32_u64(field)   __u32 field, field ## _padding
+#  define LINUX_FIELD_u32_u64_INIT_ONSTACK(field, v)   \
+   field = (intptr_t)v, field ## _padding = 0
+# endif
+#endif
+
+#endif /* _UAPI_LINUX_TYPES_32_64_H */
-- 
2.11.0

[PATCH 10/14] rseq: selftests: Provide rseq library (v5)

This rseq helper library provides a user-space API to the rseq()
system call.

The rseq fast-path exposes the instruction pointer addresses where the
rseq assembly blocks begin and end, as well as the associated abort
instruction pointer, in the __rseq_table section. This section allows
debuggers may know where to place breakpoints when single-stepping
through assembly blocks which may be aborted at any point by the kernel.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Shuah Khan <shua...@osg.samsung.com>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Thomas Gleixner <t...@linutronix.de>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: "H. Peter Anvin" <h...@zytor.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-kselft...@vger.kernel.org
CC: linux-...@vger.kernel.org
---
Changes since v1:
- Provide abort-ip signature: The abort-ip signature is located just
  before the abort-ip target. It is currently hardcoded, but a
  user-space application could use the __rseq_table to iterate on all
  abort-ip targets and use a random value as signature if needed in the
  future.
- Add rseq_prepare_unload(): Libraries and JIT code using rseq critical
  sections need to issue rseq_prepare_unload() on each thread at least
  once before reclaim of struct rseq_cs.
- Use initial-exec TLS model, non-weak symbol: The initial-exec model is
  signal-safe, whereas the global-dynamic model is not.  Remove the
  "weak" symbol attribute from the __rseq_abi in rseq.c. The rseq.so
  library will have ownership of that symbol, and there is not reason for
  an application or user library to try to define that symbol.
  The expected use is to link against libreq.so, which owns and provide
  that symbol.
- Set cpu_id to -2 on register error
- Add rseq_len syscall parameter, rseq_cs version
- Ensure disassember-friendly signature: x86 32/64 disassembler have a
  hard time decoding the instruction stream after a bad instruction. Use
  a nopl instruction to encode the signature. Suggested by Andy Lutomirski.
- Exercise parametrized tests variants in a shell scripts.
- Restartable sequences selftests: Remove use of event counter.
- Use cpu_id_start field:  With the cpu_id_start field, the C
  preparation phase of the fast-path does not need to compare cpu_id < 0
  anymore.
- Signal-safe registration and refcounting: Allow libraries using
  librseq.so to register it from signal handlers.
- Use OVERRIDE_TARGETS in makefile.
- Use "m" constraints for rseq_cs field.

Changes since v2:
- Update based on Thomas Gleixner's comments.

Changes since v3:
- Generate param_test_skip_fastpath and param_test_benchmark with
  -DSKIP_FASTPATH and -DBENCHMARK (respectively). Add param_test_fastpath
  to run_param_test.sh.

Changes since v4:
- Fold arm: workaround gcc asm size guess,
- Namespace barrier() -> rseq_barrier() in library header,
- Take into account coding style feedback from Peter Zijlstra,
- Split rseq selftests into logical commits.
---
 tools/testing/selftests/rseq/rseq-arm.h  |  715 +++
 tools/testing/selftests/rseq/rseq-ppc.h  |  671 ++
 tools/testing/selftests/rseq/rseq-skip.h |   65 ++
 tools/testing/selftests/rseq/rseq-x86.h  | 1132 ++
 tools/testing/selftests/rseq/rseq.c  |  117 +++
 tools/testing/selftests/rseq/rseq.h  |  147 
 6 files changed, 2847 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/rseq-arm.h
 create mode 100644 tools/testing/selftests/rseq/rseq-ppc.h
 create mode 100644 tools/testing/selftests/rseq/rseq-skip.h
 create mode 100644 tools/testing/selftests/rseq/rseq-x86.h
 create mode 100644 tools/testing/selftests/rseq/rseq.c
 create mode 100644 tools/testing/selftests/rseq/rseq.h

diff --git a/tools/testing/selftests/rseq/rseq-arm.h 
b/tools/testing/selftests/rseq/rseq-arm.h
new file mode 100644
index ..3b055f9aeaab
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -0,0 +1,715 @@
+/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
+/*
+ * rseq-arm.h
+ *
+ * (C) Copyright 2016-2018 - Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
+ */
+
+#define RSEQ_SIG   0x53053053
+
+#define rseq_smp_mb()  __asm__ __volatile

[PATCH 10/14] rseq: selftests: Provide rseq library (v5)

This rseq helper library provides a user-space API to the rseq()
system call.

The rseq fast-path exposes the instruction pointer addresses where the
rseq assembly blocks begin and end, as well as the associated abort
instruction pointer, in the __rseq_table section. This section allows
debuggers may know where to place breakpoints when single-stepping
through assembly blocks which may be aborted at any point by the kernel.

Signed-off-by: Mathieu Desnoyers 
CC: Shuah Khan 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Thomas Gleixner 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: "H. Peter Anvin" 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Boqun Feng 
CC: linux-kselft...@vger.kernel.org
CC: linux-...@vger.kernel.org
---
Changes since v1:
- Provide abort-ip signature: The abort-ip signature is located just
  before the abort-ip target. It is currently hardcoded, but a
  user-space application could use the __rseq_table to iterate on all
  abort-ip targets and use a random value as signature if needed in the
  future.
- Add rseq_prepare_unload(): Libraries and JIT code using rseq critical
  sections need to issue rseq_prepare_unload() on each thread at least
  once before reclaim of struct rseq_cs.
- Use initial-exec TLS model, non-weak symbol: The initial-exec model is
  signal-safe, whereas the global-dynamic model is not.  Remove the
  "weak" symbol attribute from the __rseq_abi in rseq.c. The rseq.so
  library will have ownership of that symbol, and there is not reason for
  an application or user library to try to define that symbol.
  The expected use is to link against libreq.so, which owns and provide
  that symbol.
- Set cpu_id to -2 on register error
- Add rseq_len syscall parameter, rseq_cs version
- Ensure disassember-friendly signature: x86 32/64 disassembler have a
  hard time decoding the instruction stream after a bad instruction. Use
  a nopl instruction to encode the signature. Suggested by Andy Lutomirski.
- Exercise parametrized tests variants in a shell scripts.
- Restartable sequences selftests: Remove use of event counter.
- Use cpu_id_start field:  With the cpu_id_start field, the C
  preparation phase of the fast-path does not need to compare cpu_id < 0
  anymore.
- Signal-safe registration and refcounting: Allow libraries using
  librseq.so to register it from signal handlers.
- Use OVERRIDE_TARGETS in makefile.
- Use "m" constraints for rseq_cs field.

Changes since v2:
- Update based on Thomas Gleixner's comments.

Changes since v3:
- Generate param_test_skip_fastpath and param_test_benchmark with
  -DSKIP_FASTPATH and -DBENCHMARK (respectively). Add param_test_fastpath
  to run_param_test.sh.

Changes since v4:
- Fold arm: workaround gcc asm size guess,
- Namespace barrier() -> rseq_barrier() in library header,
- Take into account coding style feedback from Peter Zijlstra,
- Split rseq selftests into logical commits.
---
 tools/testing/selftests/rseq/rseq-arm.h  |  715 +++
 tools/testing/selftests/rseq/rseq-ppc.h  |  671 ++
 tools/testing/selftests/rseq/rseq-skip.h |   65 ++
 tools/testing/selftests/rseq/rseq-x86.h  | 1132 ++
 tools/testing/selftests/rseq/rseq.c  |  117 +++
 tools/testing/selftests/rseq/rseq.h  |  147 
 6 files changed, 2847 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/rseq-arm.h
 create mode 100644 tools/testing/selftests/rseq/rseq-ppc.h
 create mode 100644 tools/testing/selftests/rseq/rseq-skip.h
 create mode 100644 tools/testing/selftests/rseq/rseq-x86.h
 create mode 100644 tools/testing/selftests/rseq/rseq.c
 create mode 100644 tools/testing/selftests/rseq/rseq.h

diff --git a/tools/testing/selftests/rseq/rseq-arm.h 
b/tools/testing/selftests/rseq/rseq-arm.h
new file mode 100644
index ..3b055f9aeaab
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -0,0 +1,715 @@
+/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
+/*
+ * rseq-arm.h
+ *
+ * (C) Copyright 2016-2018 - Mathieu Desnoyers 
+ */
+
+#define RSEQ_SIG   0x53053053
+
+#define rseq_smp_mb()  __asm__ __volatile__ ("dmb" ::: "memory", "cc")
+#define rseq_smp_rmb() __asm__ __volatile__ ("dmb" ::: "memory", "cc")
+#define rseq_smp_wmb() __asm__ __volatile__ ("dmb" ::: "memory", "cc")
+
+#define rseq_smp_load_acquire(p)   \
+__extension__ ({   \
+   __typeof(*p) p1 = RSEQ_READ_ONCE(*p);   \
+   rseq_smp_mb();  \
+   p1; \
+})
+
+#defi

[PATCH 14/14] rseq: selftests: Provide Makefile, scripts, gitignore (v2)

A run_param_test.sh script runs many variants of the parametrizable
tests.

Wire up the rseq Makefile, add directory entry into MAINTAINERS file.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Shuah Khan <shua...@osg.samsung.com>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Thomas Gleixner <t...@linutronix.de>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: "H. Peter Anvin" <h...@zytor.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-kselft...@vger.kernel.org
CC: linux-...@vger.kernel.org
---
Changes since v1:
- Use only rseq, remove use of cpu_opv.
---
 MAINTAINERS|   1 +
 tools/testing/selftests/Makefile   |   1 +
 tools/testing/selftests/rseq/.gitignore|   6 ++
 tools/testing/selftests/rseq/Makefile  |  30 ++
 tools/testing/selftests/rseq/run_param_test.sh | 121 +
 5 files changed, 159 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/.gitignore
 create mode 100644 tools/testing/selftests/rseq/Makefile
 create mode 100755 tools/testing/selftests/rseq/run_param_test.sh

diff --git a/MAINTAINERS b/MAINTAINERS
index 4d61ce154dfc..5e8968b3ccae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11991,6 +11991,7 @@ S:  Supported
 F: kernel/rseq.c
 F: include/uapi/linux/rseq.h
 F: include/trace/events/rseq.h
+F: tools/testing/selftests/rseq/
 
 RFKILL
 M: Johannes Berg <johan...@sipsolutions.net>
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 32aafa92074c..593fb44c9cd4 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -28,6 +28,7 @@ TARGETS += powerpc
 TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
+TARGETS += rseq
 TARGETS += seccomp
 TARGETS += sigaltstack
 TARGETS += size
diff --git a/tools/testing/selftests/rseq/.gitignore 
b/tools/testing/selftests/rseq/.gitignore
new file mode 100644
index ..cc610da7e369
--- /dev/null
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -0,0 +1,6 @@
+basic_percpu_ops_test
+basic_test
+basic_rseq_op_test
+param_test
+param_test_benchmark
+param_test_compare_twice
diff --git a/tools/testing/selftests/rseq/Makefile 
b/tools/testing/selftests/rseq/Makefile
new file mode 100644
index ..c30c52e1d0d2
--- /dev/null
+++ b/tools/testing/selftests/rseq/Makefile
@@ -0,0 +1,30 @@
+# SPDX-License-Identifier: GPL-2.0+ OR MIT
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/ -L./ -Wl,-rpath=./
+LDLIBS += -lpthread
+
+# Own dependencies because we only want to build against 1st prerequisite, but
+# still track changes to header files and depend on shared object.
+OVERRIDE_TARGETS = 1
+
+TEST_GEN_PROGS = basic_test basic_percpu_ops_test param_test \
+   param_test_benchmark param_test_compare_twice
+
+TEST_GEN_PROGS_EXTENDED = librseq.so
+
+TEST_PROGS = run_param_test.sh
+
+include ../lib.mk
+
+$(OUTPUT)/librseq.so: rseq.c rseq.h rseq-*.h
+   $(CC) $(CFLAGS) -shared -fPIC $< $(LDLIBS) -o $@
+
+$(OUTPUT)/%: %.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h
+   $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
+
+$(OUTPUT)/param_test_benchmark: param_test.c $(TEST_GEN_PROGS_EXTENDED) \
+   rseq.h rseq-*.h
+   $(CC) $(CFLAGS) -DBENCHMARK $< $(LDLIBS) -lrseq -o $@
+
+$(OUTPUT)/param_test_compare_twice: param_test.c $(TEST_GEN_PROGS_EXTENDED) \
+   rseq.h rseq-*.h
+   $(CC) $(CFLAGS) -DRSEQ_COMPARE_TWICE $< $(LDLIBS) -lrseq -o $@
diff --git a/tools/testing/selftests/rseq/run_param_test.sh 
b/tools/testing/selftests/rseq/run_param_test.sh
new file mode 100755
index ..3acd6d75ff9f
--- /dev/null
+++ b/tools/testing/selftests/rseq/run_param_test.sh
@@ -0,0 +1,121 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0+ or MIT
+
+EXTRA_ARGS=${@}
+
+OLDIFS="$IFS"
+IFS=$'\n'
+TEST_LIST=(
+   "-T s"
+   "-T l"
+   "-T b"
+   "-T b -M"
+   "-T m"
+   "-T m -M"
+   "-T i"
+)
+
+TEST_NAME=(
+   "spinlock"
+   "list"
+   "buffer"
+

[PATCH 14/14] rseq: selftests: Provide Makefile, scripts, gitignore (v2)

A run_param_test.sh script runs many variants of the parametrizable
tests.

Wire up the rseq Makefile, add directory entry into MAINTAINERS file.

Signed-off-by: Mathieu Desnoyers 
CC: Shuah Khan 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Thomas Gleixner 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: "H. Peter Anvin" 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Boqun Feng 
CC: linux-kselft...@vger.kernel.org
CC: linux-...@vger.kernel.org
---
Changes since v1:
- Use only rseq, remove use of cpu_opv.
---
 MAINTAINERS|   1 +
 tools/testing/selftests/Makefile   |   1 +
 tools/testing/selftests/rseq/.gitignore|   6 ++
 tools/testing/selftests/rseq/Makefile  |  30 ++
 tools/testing/selftests/rseq/run_param_test.sh | 121 +
 5 files changed, 159 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/.gitignore
 create mode 100644 tools/testing/selftests/rseq/Makefile
 create mode 100755 tools/testing/selftests/rseq/run_param_test.sh

diff --git a/MAINTAINERS b/MAINTAINERS
index 4d61ce154dfc..5e8968b3ccae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11991,6 +11991,7 @@ S:  Supported
 F: kernel/rseq.c
 F: include/uapi/linux/rseq.h
 F: include/trace/events/rseq.h
+F: tools/testing/selftests/rseq/
 
 RFKILL
 M: Johannes Berg 
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 32aafa92074c..593fb44c9cd4 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -28,6 +28,7 @@ TARGETS += powerpc
 TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
+TARGETS += rseq
 TARGETS += seccomp
 TARGETS += sigaltstack
 TARGETS += size
diff --git a/tools/testing/selftests/rseq/.gitignore 
b/tools/testing/selftests/rseq/.gitignore
new file mode 100644
index ..cc610da7e369
--- /dev/null
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -0,0 +1,6 @@
+basic_percpu_ops_test
+basic_test
+basic_rseq_op_test
+param_test
+param_test_benchmark
+param_test_compare_twice
diff --git a/tools/testing/selftests/rseq/Makefile 
b/tools/testing/selftests/rseq/Makefile
new file mode 100644
index ..c30c52e1d0d2
--- /dev/null
+++ b/tools/testing/selftests/rseq/Makefile
@@ -0,0 +1,30 @@
+# SPDX-License-Identifier: GPL-2.0+ OR MIT
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/ -L./ -Wl,-rpath=./
+LDLIBS += -lpthread
+
+# Own dependencies because we only want to build against 1st prerequisite, but
+# still track changes to header files and depend on shared object.
+OVERRIDE_TARGETS = 1
+
+TEST_GEN_PROGS = basic_test basic_percpu_ops_test param_test \
+   param_test_benchmark param_test_compare_twice
+
+TEST_GEN_PROGS_EXTENDED = librseq.so
+
+TEST_PROGS = run_param_test.sh
+
+include ../lib.mk
+
+$(OUTPUT)/librseq.so: rseq.c rseq.h rseq-*.h
+   $(CC) $(CFLAGS) -shared -fPIC $< $(LDLIBS) -o $@
+
+$(OUTPUT)/%: %.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h
+   $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
+
+$(OUTPUT)/param_test_benchmark: param_test.c $(TEST_GEN_PROGS_EXTENDED) \
+   rseq.h rseq-*.h
+   $(CC) $(CFLAGS) -DBENCHMARK $< $(LDLIBS) -lrseq -o $@
+
+$(OUTPUT)/param_test_compare_twice: param_test.c $(TEST_GEN_PROGS_EXTENDED) \
+   rseq.h rseq-*.h
+   $(CC) $(CFLAGS) -DRSEQ_COMPARE_TWICE $< $(LDLIBS) -lrseq -o $@
diff --git a/tools/testing/selftests/rseq/run_param_test.sh 
b/tools/testing/selftests/rseq/run_param_test.sh
new file mode 100755
index ..3acd6d75ff9f
--- /dev/null
+++ b/tools/testing/selftests/rseq/run_param_test.sh
@@ -0,0 +1,121 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0+ or MIT
+
+EXTRA_ARGS=${@}
+
+OLDIFS="$IFS"
+IFS=$'\n'
+TEST_LIST=(
+   "-T s"
+   "-T l"
+   "-T b"
+   "-T b -M"
+   "-T m"
+   "-T m -M"
+   "-T i"
+)
+
+TEST_NAME=(
+   "spinlock"
+   "list"
+   "buffer"
+   "buffer with barrier"
+   "memcpy"
+   "memcpy with barrier"
+   "increment"
+)
+IFS="$OLDIFS"
+
+REPS=1000
+SLOW_REPS=100
+
+function do_tests()
+{
+   local i=0
+   while [ "$i" -lt "${#TEST_LIST[@]}" ]; do
+   echo "Running test ${TEST_NAME[$i]}"
+   ./param_test ${TEST_LIST[$i]} -r ${REPS} ${@} ${EXTRA_ARGS} || 
exit 1
+   echo "Running compare-twice test ${TEST_NAME[$i]}"
+   ./param_test_compare_twice ${TEST_LIST[$i]} -r ${REPS} ${@} 
${EXTRA_ARGS} || exit 1
+   let &quo

[PATCH 03/14] arm: Add restartable sequences support

Call the rseq_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.

Perform fixup on the pre-signal frame when a signal is delivered on top
of a restartable sequence critical section.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Thomas Gleixner <t...@linutronix.de>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-...@vger.kernel.org
---
 arch/arm/Kconfig | 1 +
 arch/arm/kernel/signal.c | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index a7f8e7f4b88f..4f5c386631d4 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -91,6 +91,7 @@ config ARM
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE if (SMP && ARM_LPAE)
select HAVE_REGS_AND_STACK_ACCESS_API
+   select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UID16
select HAVE_VIRT_CPU_ACCOUNTING_GEN
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index bd8810d4acb3..5879ab3f53c1 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -541,6 +541,12 @@ static void handle_signal(struct ksignal *ksig, struct 
pt_regs *regs)
int ret;
 
/*
+* Increment event counter and perform fixup for the pre-signal
+* frame.
+*/
+   rseq_signal_deliver(regs);
+
+   /*
 * Set up the stack frame
 */
if (ksig->ka.sa.sa_flags & SA_SIGINFO)
@@ -660,6 +666,7 @@ do_work_pending(struct pt_regs *regs, unsigned int 
thread_flags, int syscall)
} else {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   rseq_handle_notify_resume(regs);
}
}
local_irq_disable();
-- 
2.11.0

[PATCH 03/14] arm: Add restartable sequences support

Call the rseq_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.

Perform fixup on the pre-signal frame when a signal is delivered on top
of a restartable sequence critical section.

Signed-off-by: Mathieu Desnoyers 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Thomas Gleixner 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Boqun Feng 
CC: linux-...@vger.kernel.org
---
 arch/arm/Kconfig | 1 +
 arch/arm/kernel/signal.c | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index a7f8e7f4b88f..4f5c386631d4 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -91,6 +91,7 @@ config ARM
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE if (SMP && ARM_LPAE)
select HAVE_REGS_AND_STACK_ACCESS_API
+   select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UID16
select HAVE_VIRT_CPU_ACCOUNTING_GEN
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index bd8810d4acb3..5879ab3f53c1 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -541,6 +541,12 @@ static void handle_signal(struct ksignal *ksig, struct 
pt_regs *regs)
int ret;
 
/*
+* Increment event counter and perform fixup for the pre-signal
+* frame.
+*/
+   rseq_signal_deliver(regs);
+
+   /*
 * Set up the stack frame
 */
if (ksig->ka.sa.sa_flags & SA_SIGINFO)
@@ -660,6 +666,7 @@ do_work_pending(struct pt_regs *regs, unsigned int 
thread_flags, int syscall)
} else {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   rseq_handle_notify_resume(regs);
}
}
local_irq_disable();
-- 
2.11.0

[PATCH 13/14] rseq: selftests: Provide parametrized tests (v2)

"param_test" is a parametrizable restartable sequences test. See
the "--help" output for usage.

"param_test_benchmark" is the same as "param_test", but it removes
testing book-keeping code to allow accurate benchmarks.

"param_test_compare_twice" is the same as "param_test", but it performs
each comparison within rseq critical section twice, thus validating
invariants. If any of the second comparisons fails, an error message
is printed and the test aborts.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Shuah Khan <shua...@osg.samsung.com>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Thomas Gleixner <t...@linutronix.de>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: "H. Peter Anvin" <h...@zytor.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-kselft...@vger.kernel.org
CC: linux-...@vger.kernel.org
---
Changes since v1:
- Use only rseq, remove use of cpu_opv.
---
 tools/testing/selftests/rseq/param_test.c | 1260 +
 1 file changed, 1260 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/param_test.c

diff --git a/tools/testing/selftests/rseq/param_test.c 
b/tools/testing/selftests/rseq/param_test.c
new file mode 100644
index ..6a9f602a8718
--- /dev/null
+++ b/tools/testing/selftests/rseq/param_test.c
@@ -0,0 +1,1260 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static inline pid_t gettid(void)
+{
+   return syscall(__NR_gettid);
+}
+
+#define NR_INJECT  9
+static int loop_cnt[NR_INJECT + 1];
+
+static int loop_cnt_1 asm("asm_loop_cnt_1") __attribute__((used));
+static int loop_cnt_2 asm("asm_loop_cnt_2") __attribute__((used));
+static int loop_cnt_3 asm("asm_loop_cnt_3") __attribute__((used));
+static int loop_cnt_4 asm("asm_loop_cnt_4") __attribute__((used));
+static int loop_cnt_5 asm("asm_loop_cnt_5") __attribute__((used));
+static int loop_cnt_6 asm("asm_loop_cnt_6") __attribute__((used));
+
+static int opt_modulo, verbose;
+
+static int opt_yield, opt_signal, opt_sleep,
+   opt_disable_rseq, opt_threads = 200,
+   opt_disable_mod = 0, opt_test = 's', opt_mb = 0;
+
+#ifndef RSEQ_SKIP_FASTPATH
+static long long opt_reps = 5000;
+#else
+static long long opt_reps = 100;
+#endif
+
+static __thread __attribute__((tls_model("initial-exec")))
+unsigned int signals_delivered;
+
+#ifndef BENCHMARK
+
+static __thread __attribute__((tls_model("initial-exec"), unused))
+unsigned int yield_mod_cnt, nr_abort;
+
+#define printf_verbose(fmt, ...)   \
+   do {\
+   if (verbose)\
+   printf(fmt, ## __VA_ARGS__);\
+   } while (0)
+
+#if defined(__x86_64__) || defined(__i386__)
+
+#define INJECT_ASM_REG "eax"
+
+#define RSEQ_INJECT_CLOBBER \
+   , INJECT_ASM_REG
+
+#ifdef __i386__
+
+#define RSEQ_INJECT_ASM(n) \
+   "mov asm_loop_cnt_" #n ", %%" INJECT_ASM_REG "\n\t" \
+   "test %%" INJECT_ASM_REG ",%%" INJECT_ASM_REG "\n\t" \
+   "jz 333f\n\t" \
+   "222:\n\t" \
+   "dec %%" INJECT_ASM_REG "\n\t" \
+   "jnz 222b\n\t" \
+   "333:\n\t"
+
+#elif defined(__x86_64__)
+
+#define RSEQ_INJECT_ASM(n) \
+   "lea asm_loop_cnt_" #n "(%%rip), %%" INJECT_ASM_REG "\n\t" \
+   "mov (%%" INJECT_ASM_REG "), %%" INJECT_ASM_REG "\n\t" \
+   "test %%" INJECT_ASM_REG ",%%" INJECT_ASM_REG "\n\t" \
+   "jz 333f\n\t" \
+   "222:\n\t" \
+   "dec %%" INJECT_ASM_REG "\n\t" \
+   "jnz 222b\n\t" \
+   "333:\n\t"
+
+#else
+#error "Unsupported architecture"
+#endif
+
+#elif defined(__ARMEL__)
+
+#define RSEQ_IN

[PATCH 13/14] rseq: selftests: Provide parametrized tests (v2)

"param_test" is a parametrizable restartable sequences test. See
the "--help" output for usage.

"param_test_benchmark" is the same as "param_test", but it removes
testing book-keeping code to allow accurate benchmarks.

"param_test_compare_twice" is the same as "param_test", but it performs
each comparison within rseq critical section twice, thus validating
invariants. If any of the second comparisons fails, an error message
is printed and the test aborts.

Signed-off-by: Mathieu Desnoyers 
CC: Shuah Khan 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Thomas Gleixner 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: "H. Peter Anvin" 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Boqun Feng 
CC: linux-kselft...@vger.kernel.org
CC: linux-...@vger.kernel.org
---
Changes since v1:
- Use only rseq, remove use of cpu_opv.
---
 tools/testing/selftests/rseq/param_test.c | 1260 +
 1 file changed, 1260 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/param_test.c

diff --git a/tools/testing/selftests/rseq/param_test.c 
b/tools/testing/selftests/rseq/param_test.c
new file mode 100644
index ..6a9f602a8718
--- /dev/null
+++ b/tools/testing/selftests/rseq/param_test.c
@@ -0,0 +1,1260 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static inline pid_t gettid(void)
+{
+   return syscall(__NR_gettid);
+}
+
+#define NR_INJECT  9
+static int loop_cnt[NR_INJECT + 1];
+
+static int loop_cnt_1 asm("asm_loop_cnt_1") __attribute__((used));
+static int loop_cnt_2 asm("asm_loop_cnt_2") __attribute__((used));
+static int loop_cnt_3 asm("asm_loop_cnt_3") __attribute__((used));
+static int loop_cnt_4 asm("asm_loop_cnt_4") __attribute__((used));
+static int loop_cnt_5 asm("asm_loop_cnt_5") __attribute__((used));
+static int loop_cnt_6 asm("asm_loop_cnt_6") __attribute__((used));
+
+static int opt_modulo, verbose;
+
+static int opt_yield, opt_signal, opt_sleep,
+   opt_disable_rseq, opt_threads = 200,
+   opt_disable_mod = 0, opt_test = 's', opt_mb = 0;
+
+#ifndef RSEQ_SKIP_FASTPATH
+static long long opt_reps = 5000;
+#else
+static long long opt_reps = 100;
+#endif
+
+static __thread __attribute__((tls_model("initial-exec")))
+unsigned int signals_delivered;
+
+#ifndef BENCHMARK
+
+static __thread __attribute__((tls_model("initial-exec"), unused))
+unsigned int yield_mod_cnt, nr_abort;
+
+#define printf_verbose(fmt, ...)   \
+   do {\
+   if (verbose)\
+   printf(fmt, ## __VA_ARGS__);\
+   } while (0)
+
+#if defined(__x86_64__) || defined(__i386__)
+
+#define INJECT_ASM_REG "eax"
+
+#define RSEQ_INJECT_CLOBBER \
+   , INJECT_ASM_REG
+
+#ifdef __i386__
+
+#define RSEQ_INJECT_ASM(n) \
+   "mov asm_loop_cnt_" #n ", %%" INJECT_ASM_REG "\n\t" \
+   "test %%" INJECT_ASM_REG ",%%" INJECT_ASM_REG "\n\t" \
+   "jz 333f\n\t" \
+   "222:\n\t" \
+   "dec %%" INJECT_ASM_REG "\n\t" \
+   "jnz 222b\n\t" \
+   "333:\n\t"
+
+#elif defined(__x86_64__)
+
+#define RSEQ_INJECT_ASM(n) \
+   "lea asm_loop_cnt_" #n "(%%rip), %%" INJECT_ASM_REG "\n\t" \
+   "mov (%%" INJECT_ASM_REG "), %%" INJECT_ASM_REG "\n\t" \
+   "test %%" INJECT_ASM_REG ",%%" INJECT_ASM_REG "\n\t" \
+   "jz 333f\n\t" \
+   "222:\n\t" \
+   "dec %%" INJECT_ASM_REG "\n\t" \
+   "jnz 222b\n\t" \
+   "333:\n\t"
+
+#else
+#error "Unsupported architecture"
+#endif
+
+#elif defined(__ARMEL__)
+
+#define RSEQ_INJECT_INPUT \
+   , [loop_cnt_1]"m"(loop_cnt[1]) \
+   , [loop_cnt_2]"m"(loop_cnt[2]) \
+   , [loop_cnt_3]"m"(loop_cnt[3]) \
+   , [loop_cnt_4]"m"(loop_cnt[4]) \
+   , [loop_cnt_5]"m"(loop_cnt[5]) \
+   , [loop_cnt_6]"m"(loop_cnt[6])
+
+#define INJECT_ASM_REG "r4"
+
+#define RSEQ_INJECT_CLOBBER \
+   , INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+   "ldr " INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
+   "cmp " INJECT_ASM_REG ", #0\n\t" \
+   &

[PATCH 05/14] x86: Add support for restartable sequences (v2)

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Perform fixup on the pre-signal frame when a signal is delivered on top
of a restartable sequence critical section.

Check that system calls are not invoked from within rseq critical
sections by invoking rseq_signal() from syscall_return_slowpath().
With CONFIG_DEBUG_RSEQ, such behavior results in termination of the
process with SIGSEGV.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
Reviewed-by: Thomas Gleixner <t...@linutronix.de>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: "H. Peter Anvin" <h...@zytor.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-...@vger.kernel.org

---

Changes since v1:
- Call rseq_signal() when returning from a system call.
---
 arch/x86/Kconfig | 1 +
 arch/x86/entry/common.c  | 3 +++
 arch/x86/kernel/signal.c | 6 ++
 3 files changed, 10 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c07f492b871a..62e00a1a7cf7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -180,6 +180,7 @@ config X86
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && 
UNWINDER_FRAME_POINTER && STACK_VALIDATION
select HAVE_STACK_VALIDATIONif X86_64
+   select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UNSTABLE_SCHED_CLOCK
select HAVE_USER_RETURN_NOTIFIER
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index fbf6a6c3fd2d..92190879b228 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -164,6 +164,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 
cached_flags)
if (cached_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   rseq_handle_notify_resume(regs);
}
 
if (cached_flags & _TIF_USER_RETURN_NOTIFY)
@@ -254,6 +255,8 @@ __visible inline void syscall_return_slowpath(struct 
pt_regs *regs)
WARN(irqs_disabled(), "syscall %ld left IRQs disabled", 
regs->orig_ax))
local_irq_enable();
 
+   rseq_syscall(regs);
+
/*
 * First do one-time work.  If these work items are enabled, we
 * want to run them exactly once per syscall exit with IRQs on.
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index da270b95fe4d..445ca11ff863 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -688,6 +688,12 @@ setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
sigset_t *set = sigmask_to_save();
compat_sigset_t *cset = (compat_sigset_t *) set;
 
+   /*
+* Increment event counter and perform fixup for the pre-signal
+* frame.
+*/
+   rseq_signal_deliver(regs);
+
/* Set up the stack frame */
if (is_ia32_frame(ksig)) {
if (ksig->ka.sa.sa_flags & SA_SIGINFO)
-- 
2.11.0

[PATCH 05/14] x86: Add support for restartable sequences (v2)

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Perform fixup on the pre-signal frame when a signal is delivered on top
of a restartable sequence critical section.

Check that system calls are not invoked from within rseq critical
sections by invoking rseq_signal() from syscall_return_slowpath().
With CONFIG_DEBUG_RSEQ, such behavior results in termination of the
process with SIGSEGV.

Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Thomas Gleixner 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: "H. Peter Anvin" 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Boqun Feng 
CC: linux-...@vger.kernel.org

---

Changes since v1:
- Call rseq_signal() when returning from a system call.
---
 arch/x86/Kconfig | 1 +
 arch/x86/entry/common.c  | 3 +++
 arch/x86/kernel/signal.c | 6 ++
 3 files changed, 10 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c07f492b871a..62e00a1a7cf7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -180,6 +180,7 @@ config X86
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && 
UNWINDER_FRAME_POINTER && STACK_VALIDATION
select HAVE_STACK_VALIDATIONif X86_64
+   select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UNSTABLE_SCHED_CLOCK
select HAVE_USER_RETURN_NOTIFIER
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index fbf6a6c3fd2d..92190879b228 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -164,6 +164,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 
cached_flags)
if (cached_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   rseq_handle_notify_resume(regs);
}
 
if (cached_flags & _TIF_USER_RETURN_NOTIFY)
@@ -254,6 +255,8 @@ __visible inline void syscall_return_slowpath(struct 
pt_regs *regs)
WARN(irqs_disabled(), "syscall %ld left IRQs disabled", 
regs->orig_ax))
local_irq_enable();
 
+   rseq_syscall(regs);
+
/*
 * First do one-time work.  If these work items are enabled, we
 * want to run them exactly once per syscall exit with IRQs on.
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index da270b95fe4d..445ca11ff863 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -688,6 +688,12 @@ setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
sigset_t *set = sigmask_to_save();
compat_sigset_t *cset = (compat_sigset_t *) set;
 
+   /*
+* Increment event counter and perform fixup for the pre-signal
+* frame.
+*/
+   rseq_signal_deliver(regs);
+
/* Set up the stack frame */
if (is_ia32_frame(ksig)) {
if (ksig->ka.sa.sa_flags & SA_SIGINFO)
-- 
2.11.0

[PATCH 4.17-rc2] selftests: Fix lib.mk run_tests target shell script

Within run_tests target, the whole script needs to be executed within
the same shell and not as separate subshells, so the initial test_num
variable set to 0 is still present when executing "test_num=`echo
$$test_num+1 | bc`;".

Demonstration of the issue (make run_tests):

TAP version 13
(standard_in) 1: syntax error
selftests: basic_test

ok 1.. selftests: basic_test [PASS]
(standard_in) 1: syntax error
selftests: basic_percpu_ops_test

ok 1.. selftests: basic_percpu_ops_test [PASS]
(standard_in) 1: syntax error
selftests: param_test

ok 1.. selftests: param_test [PASS]

With fix applied:

TAP version 13
selftests: basic_test

ok 1..1 selftests: basic_test [PASS]
selftests: basic_percpu_ops_test

ok 1..2 selftests: basic_percpu_ops_test [PASS]
selftests: param_test

ok 1..3 selftests: param_test [PASS]

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
Fixes: 1f87c7c15d7 ("selftests: lib.mk: change RUN_TESTS to print messages in 
TAP13 format")
CC: Shuah Khan <shua...@osg.samsung.com>
CC: linux-kselft...@vger.kernel.org
---
 tools/testing/selftests/lib.mk | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 195e9d4739a9..c1b1a4dc6a96 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -20,10 +20,10 @@ all: $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) 
$(TEST_GEN_FILES)
 
 .ONESHELL:
 define RUN_TESTS
-   @export KSFT_TAP_LEVEL=`echo 1`;
-   @test_num=`echo 0`;
-   @echo "TAP version 13";
-   @for TEST in $(1); do   \
+   @export KSFT_TAP_LEVEL=`echo 1`;\
+   test_num=`echo 0`;  \
+   echo "TAP version 13";  \
+   for TEST in $(1); do\
BASENAME_TEST=`basename $$TEST`;\
test_num=`echo $$test_num+1 | bc`;  \
echo "selftests: $$BASENAME_TEST";  \
-- 
2.11.0

[PATCH 4.17-rc2] selftests: Fix lib.mk run_tests target shell script

Within run_tests target, the whole script needs to be executed within
the same shell and not as separate subshells, so the initial test_num
variable set to 0 is still present when executing "test_num=`echo
$$test_num+1 | bc`;".

Demonstration of the issue (make run_tests):

TAP version 13
(standard_in) 1: syntax error
selftests: basic_test

ok 1.. selftests: basic_test [PASS]
(standard_in) 1: syntax error
selftests: basic_percpu_ops_test

ok 1.. selftests: basic_percpu_ops_test [PASS]
(standard_in) 1: syntax error
selftests: param_test

ok 1.. selftests: param_test [PASS]

With fix applied:

TAP version 13
selftests: basic_test

ok 1..1 selftests: basic_test [PASS]
selftests: basic_percpu_ops_test

ok 1..2 selftests: basic_percpu_ops_test [PASS]
selftests: param_test

ok 1..3 selftests: param_test [PASS]

Signed-off-by: Mathieu Desnoyers 
Fixes: 1f87c7c15d7 ("selftests: lib.mk: change RUN_TESTS to print messages in 
TAP13 format")
CC: Shuah Khan 
CC: linux-kselft...@vger.kernel.org
---
 tools/testing/selftests/lib.mk | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 195e9d4739a9..c1b1a4dc6a96 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -20,10 +20,10 @@ all: $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) 
$(TEST_GEN_FILES)
 
 .ONESHELL:
 define RUN_TESTS
-   @export KSFT_TAP_LEVEL=`echo 1`;
-   @test_num=`echo 0`;
-   @echo "TAP version 13";
-   @for TEST in $(1); do   \
+   @export KSFT_TAP_LEVEL=`echo 1`;\
+   test_num=`echo 0`;  \
+   echo "TAP version 13";  \
+   for TEST in $(1); do\
BASENAME_TEST=`basename $$TEST`;\
test_num=`echo $$test_num+1 | bc`;  \
echo "selftests: $$BASENAME_TEST";  \
-- 
2.11.0

Re: [PATCH 1/1] selftests: Fix lib.mk run_tests target shell script

- On Apr 27, 2018, at 5:05 PM, shuah sh...@kernel.org wrote:

> On 04/27/2018 02:42 PM, Shuah Khan wrote:
>> On 04/27/2018 02:17 PM, Mathieu Desnoyers wrote:
>>> - On Nov 1, 2017, at 6:28 PM, Shuah Khan shua...@osg.samsung.com wrote:
>>>
>>>> On 11/01/2017 04:24 PM, Mathieu Desnoyers wrote:
>>>>> - On Nov 1, 2017, at 6:22 PM, Mathieu Desnoyers
>>>>> mathieu.desnoy...@efficios.com wrote:
>>>>>
>>>>>> - On Nov 1, 2017, at 5:33 PM, Shuah Khan shua...@osg.samsung.com 
>>>>>> wrote:
>>>>>>
>>>>>>> On 10/28/2017 07:46 AM, Mathieu Desnoyers wrote:
>>>>>>>> Within run_tests target, the whole script needs to be executed within
>>>>>>>> the same shell and not as separate subshells, so the initial test_num
>>>>>>>> variable set to 0 is still present when executing "test_num=`echo
>>>>>>>> $$test_num+1 | bc`;".
>>>>>>>>
>>>>>>>> Demonstration of the issue (make run_tests):
>>>>>>>>
>>>>>>>> TAP version 13
>>>>>>>> (standard_in) 1: syntax error
>>>>>>>> selftests: basic_test
>>>>>>>> 
>>>>>>>> ok 1.. selftests: basic_test [PASS]
>>>>>>>> (standard_in) 1: syntax error
>>>>>>>> selftests: basic_percpu_ops_test
>>>>>>>> 
>>>>>>>> ok 1.. selftests: basic_percpu_ops_test [PASS]
>>>>>>>> (standard_in) 1: syntax error
>>>>>>>> selftests: param_test
>>>>>>>> 
>>>>>>>> ok 1.. selftests: param_test [PASS]
>>>>>>>
>>>>>>> Hi Mathieu,
>>>>>>>
>>>>>>> Odd. I don't see the error. I am curious if this specific to
>>>>>>> env. Can you reproduce this with one of the existing tests,
>>>>>>> kcmp or breakpoints
>>>>>>
>>>>>> Yes, it reproduces:
>>>>>>
>>>>>> cd tools/testing/selftests/kcmp
>>>>>> make run_tests
>>>>>> gcc -I../../../../usr/include/kcmp_test.c  -o
>>>>>> /home/efficios/git/linux-rseq/tools/testing/selftests/kcmp/kcmp_test
>>>>>> TAP version 13
>>>>>> (standard_in) 1: syntax error
>>>>>> selftests: kcmp_test
>>>>>> 
>>>>>> ok 1.. selftests: kcmp_test [PASS]
>>>>>>
>>>>>> cd tools/testing/selftests/breakpoints
>>>>>> make run_tests
>>>>>> gcc step_after_suspend_test.c  -o
>>>>>> /home/efficios/git/linux-rseq/tools/testing/selftests/breakpoints/step_after_suspend_test
>>>>>> gcc breakpoint_test.c  -o
>>>>>> /home/efficios/git/linux-rseq/tools/testing/selftests/breakpoints/breakpoint_test
>>>>>> TAP version 13
>>>>>> (standard_in) 1: syntax error
>>>>>> selftests: step_after_suspend_test
>>>>>> 
>>>>>> not ok 1.. selftests:  step_after_suspend_test [FAIL]
>>>>>> (standard_in) 1: syntax error
>>>>>> selftests: breakpoint_test
>>>>>> 
>>>>>> ok 1.. selftests: breakpoint_test [PASS]
>>>>>>
>>>>>
>>>>> The version of "make" on that machine is:
>>>>>
>>>>> make --version
>>>>> GNU Make 3.81
>>>>> Copyright (C) 2006  Free Software Foundation, Inc.
>>>>> This is free software; see the source for copying conditions.
>>>>> There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
>>>>> PARTICULAR PURPOSE.
>>>>>
>>>>> This program built for x86_64-pc-linux-gnu
>>>>>
>>>>> (if it helps reproducing)
>>>>>
>>>>
>>>> Yup that's it. I have
>>>>
>>>> GNU Make 4.1
>>>> Built for x86_64-pc-linux-gnu
>>>> Copyright (C) 1988-2014 Free Software Foundation, Inc.
>>>> License GPLv3+: GNU GPL versi

Re: [PATCH 1/1] selftests: Fix lib.mk run_tests target shell script

- On Apr 27, 2018, at 5:05 PM, shuah sh...@kernel.org wrote:

> On 04/27/2018 02:42 PM, Shuah Khan wrote:
>> On 04/27/2018 02:17 PM, Mathieu Desnoyers wrote:
>>> - On Nov 1, 2017, at 6:28 PM, Shuah Khan shua...@osg.samsung.com wrote:
>>>
>>>> On 11/01/2017 04:24 PM, Mathieu Desnoyers wrote:
>>>>> - On Nov 1, 2017, at 6:22 PM, Mathieu Desnoyers
>>>>> mathieu.desnoy...@efficios.com wrote:
>>>>>
>>>>>> - On Nov 1, 2017, at 5:33 PM, Shuah Khan shua...@osg.samsung.com 
>>>>>> wrote:
>>>>>>
>>>>>>> On 10/28/2017 07:46 AM, Mathieu Desnoyers wrote:
>>>>>>>> Within run_tests target, the whole script needs to be executed within
>>>>>>>> the same shell and not as separate subshells, so the initial test_num
>>>>>>>> variable set to 0 is still present when executing "test_num=`echo
>>>>>>>> $$test_num+1 | bc`;".
>>>>>>>>
>>>>>>>> Demonstration of the issue (make run_tests):
>>>>>>>>
>>>>>>>> TAP version 13
>>>>>>>> (standard_in) 1: syntax error
>>>>>>>> selftests: basic_test
>>>>>>>> 
>>>>>>>> ok 1.. selftests: basic_test [PASS]
>>>>>>>> (standard_in) 1: syntax error
>>>>>>>> selftests: basic_percpu_ops_test
>>>>>>>> 
>>>>>>>> ok 1.. selftests: basic_percpu_ops_test [PASS]
>>>>>>>> (standard_in) 1: syntax error
>>>>>>>> selftests: param_test
>>>>>>>> 
>>>>>>>> ok 1.. selftests: param_test [PASS]
>>>>>>>
>>>>>>> Hi Mathieu,
>>>>>>>
>>>>>>> Odd. I don't see the error. I am curious if this specific to
>>>>>>> env. Can you reproduce this with one of the existing tests,
>>>>>>> kcmp or breakpoints
>>>>>>
>>>>>> Yes, it reproduces:
>>>>>>
>>>>>> cd tools/testing/selftests/kcmp
>>>>>> make run_tests
>>>>>> gcc -I../../../../usr/include/kcmp_test.c  -o
>>>>>> /home/efficios/git/linux-rseq/tools/testing/selftests/kcmp/kcmp_test
>>>>>> TAP version 13
>>>>>> (standard_in) 1: syntax error
>>>>>> selftests: kcmp_test
>>>>>> 
>>>>>> ok 1.. selftests: kcmp_test [PASS]
>>>>>>
>>>>>> cd tools/testing/selftests/breakpoints
>>>>>> make run_tests
>>>>>> gcc step_after_suspend_test.c  -o
>>>>>> /home/efficios/git/linux-rseq/tools/testing/selftests/breakpoints/step_after_suspend_test
>>>>>> gcc breakpoint_test.c  -o
>>>>>> /home/efficios/git/linux-rseq/tools/testing/selftests/breakpoints/breakpoint_test
>>>>>> TAP version 13
>>>>>> (standard_in) 1: syntax error
>>>>>> selftests: step_after_suspend_test
>>>>>> 
>>>>>> not ok 1.. selftests:  step_after_suspend_test [FAIL]
>>>>>> (standard_in) 1: syntax error
>>>>>> selftests: breakpoint_test
>>>>>> 
>>>>>> ok 1.. selftests: breakpoint_test [PASS]
>>>>>>
>>>>>
>>>>> The version of "make" on that machine is:
>>>>>
>>>>> make --version
>>>>> GNU Make 3.81
>>>>> Copyright (C) 2006  Free Software Foundation, Inc.
>>>>> This is free software; see the source for copying conditions.
>>>>> There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
>>>>> PARTICULAR PURPOSE.
>>>>>
>>>>> This program built for x86_64-pc-linux-gnu
>>>>>
>>>>> (if it helps reproducing)
>>>>>
>>>>
>>>> Yup that's it. I have
>>>>
>>>> GNU Make 4.1
>>>> Built for x86_64-pc-linux-gnu
>>>> Copyright (C) 1988-2014 Free Software Foundation, Inc.
>>>> License GPLv3+: GNU GPL versi

Re: [PATCH 1/1] selftests: Fix lib.mk run_tests target shell script

- On Nov 1, 2017, at 6:28 PM, Shuah Khan shua...@osg.samsung.com wrote:

> On 11/01/2017 04:24 PM, Mathieu Desnoyers wrote:
>> - On Nov 1, 2017, at 6:22 PM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
>> 
>>> - On Nov 1, 2017, at 5:33 PM, Shuah Khan shua...@osg.samsung.com wrote:
>>>
>>>> On 10/28/2017 07:46 AM, Mathieu Desnoyers wrote:
>>>>> Within run_tests target, the whole script needs to be executed within
>>>>> the same shell and not as separate subshells, so the initial test_num
>>>>> variable set to 0 is still present when executing "test_num=`echo
>>>>> $$test_num+1 | bc`;".
>>>>>
>>>>> Demonstration of the issue (make run_tests):
>>>>>
>>>>> TAP version 13
>>>>> (standard_in) 1: syntax error
>>>>> selftests: basic_test
>>>>> 
>>>>> ok 1.. selftests: basic_test [PASS]
>>>>> (standard_in) 1: syntax error
>>>>> selftests: basic_percpu_ops_test
>>>>> 
>>>>> ok 1.. selftests: basic_percpu_ops_test [PASS]
>>>>> (standard_in) 1: syntax error
>>>>> selftests: param_test
>>>>> 
>>>>> ok 1.. selftests: param_test [PASS]
>>>>
>>>> Hi Mathieu,
>>>>
>>>> Odd. I don't see the error. I am curious if this specific to
>>>> env. Can you reproduce this with one of the existing tests,
>>>> kcmp or breakpoints
>>>
>>> Yes, it reproduces:
>>>
>>> cd tools/testing/selftests/kcmp
>>> make run_tests
>>> gcc -I../../../../usr/include/kcmp_test.c  -o
>>> /home/efficios/git/linux-rseq/tools/testing/selftests/kcmp/kcmp_test
>>> TAP version 13
>>> (standard_in) 1: syntax error
>>> selftests: kcmp_test
>>> 
>>> ok 1.. selftests: kcmp_test [PASS]
>>>
>>> cd tools/testing/selftests/breakpoints
>>> make run_tests
>>> gcc step_after_suspend_test.c  -o
>>> /home/efficios/git/linux-rseq/tools/testing/selftests/breakpoints/step_after_suspend_test
>>> gcc breakpoint_test.c  -o
>>> /home/efficios/git/linux-rseq/tools/testing/selftests/breakpoints/breakpoint_test
>>> TAP version 13
>>> (standard_in) 1: syntax error
>>> selftests: step_after_suspend_test
>>> 
>>> not ok 1.. selftests:  step_after_suspend_test [FAIL]
>>> (standard_in) 1: syntax error
>>> selftests: breakpoint_test
>>> 
>>> ok 1.. selftests: breakpoint_test [PASS]
>>>
>> 
>> The version of "make" on that machine is:
>> 
>> make --version
>> GNU Make 3.81
>> Copyright (C) 2006  Free Software Foundation, Inc.
>> This is free software; see the source for copying conditions.
>> There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
>> PARTICULAR PURPOSE.
>> 
>> This program built for x86_64-pc-linux-gnu
>> 
>> (if it helps reproducing)
>> 
> 
> Yup that's it. I have
> 
> GNU Make 4.1
> Built for x86_64-pc-linux-gnu
> Copyright (C) 1988-2014 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> 
> I will test with your patch and see what happens in my env.

Hi,

I still see the problem with v4.17-rc2. Did you have time to
consider merging my fix ?

Thanks,

Mathieu


> 
> thanks,
> -- Shuah

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH 1/1] selftests: Fix lib.mk run_tests target shell script

- On Nov 1, 2017, at 6:28 PM, Shuah Khan shua...@osg.samsung.com wrote:

> On 11/01/2017 04:24 PM, Mathieu Desnoyers wrote:
>> - On Nov 1, 2017, at 6:22 PM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
>> 
>>> - On Nov 1, 2017, at 5:33 PM, Shuah Khan shua...@osg.samsung.com wrote:
>>>
>>>> On 10/28/2017 07:46 AM, Mathieu Desnoyers wrote:
>>>>> Within run_tests target, the whole script needs to be executed within
>>>>> the same shell and not as separate subshells, so the initial test_num
>>>>> variable set to 0 is still present when executing "test_num=`echo
>>>>> $$test_num+1 | bc`;".
>>>>>
>>>>> Demonstration of the issue (make run_tests):
>>>>>
>>>>> TAP version 13
>>>>> (standard_in) 1: syntax error
>>>>> selftests: basic_test
>>>>> 
>>>>> ok 1.. selftests: basic_test [PASS]
>>>>> (standard_in) 1: syntax error
>>>>> selftests: basic_percpu_ops_test
>>>>> 
>>>>> ok 1.. selftests: basic_percpu_ops_test [PASS]
>>>>> (standard_in) 1: syntax error
>>>>> selftests: param_test
>>>>> 
>>>>> ok 1.. selftests: param_test [PASS]
>>>>
>>>> Hi Mathieu,
>>>>
>>>> Odd. I don't see the error. I am curious if this specific to
>>>> env. Can you reproduce this with one of the existing tests,
>>>> kcmp or breakpoints
>>>
>>> Yes, it reproduces:
>>>
>>> cd tools/testing/selftests/kcmp
>>> make run_tests
>>> gcc -I../../../../usr/include/kcmp_test.c  -o
>>> /home/efficios/git/linux-rseq/tools/testing/selftests/kcmp/kcmp_test
>>> TAP version 13
>>> (standard_in) 1: syntax error
>>> selftests: kcmp_test
>>> 
>>> ok 1.. selftests: kcmp_test [PASS]
>>>
>>> cd tools/testing/selftests/breakpoints
>>> make run_tests
>>> gcc step_after_suspend_test.c  -o
>>> /home/efficios/git/linux-rseq/tools/testing/selftests/breakpoints/step_after_suspend_test
>>> gcc breakpoint_test.c  -o
>>> /home/efficios/git/linux-rseq/tools/testing/selftests/breakpoints/breakpoint_test
>>> TAP version 13
>>> (standard_in) 1: syntax error
>>> selftests: step_after_suspend_test
>>> 
>>> not ok 1.. selftests:  step_after_suspend_test [FAIL]
>>> (standard_in) 1: syntax error
>>> selftests: breakpoint_test
>>> 
>>> ok 1.. selftests: breakpoint_test [PASS]
>>>
>> 
>> The version of "make" on that machine is:
>> 
>> make --version
>> GNU Make 3.81
>> Copyright (C) 2006  Free Software Foundation, Inc.
>> This is free software; see the source for copying conditions.
>> There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
>> PARTICULAR PURPOSE.
>> 
>> This program built for x86_64-pc-linux-gnu
>> 
>> (if it helps reproducing)
>> 
> 
> Yup that's it. I have
> 
> GNU Make 4.1
> Built for x86_64-pc-linux-gnu
> Copyright (C) 1988-2014 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> 
> I will test with your patch and see what happens in my env.

Hi,

I still see the problem with v4.17-rc2. Did you have time to
consider merging my fix ?

Thanks,

Mathieu


> 
> thanks,
> -- Shuah

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH RFC] tracepoint: Introduce tracepoint callbacks executing with preempt on

- On Apr 27, 2018, at 2:11 PM, Joel Fernandes joe...@google.com wrote:

> On Fri, Apr 27, 2018 at 9:37 AM, Steven Rostedt <rost...@goodmis.org> wrote:
>> On Fri, 27 Apr 2018 09:30:05 -0700
>> Joel Fernandes <joe...@google.com> wrote:
>>
>>> On Fri, Apr 27, 2018 at 7:47 AM, Steven Rostedt <rost...@goodmis.org> wrote:
>>> > On Fri, 27 Apr 2018 10:26:29 -0400 (EDT)
>>> > Mathieu Desnoyers <mathieu.desnoy...@efficios.com> wrote:
>>> >
>>> >> The general approach and the implementation look fine, except for
>>> >> one small detail: I would be tempted to explicitly disable preemption
>>> >> around the call to the tracepoint callback for the rcuidle variant,
>>> >> unless we plan to audit every tracer right away to remove any assumption
>>> >> that preemption is disabled in the callback implementation.
>>> >
>>> > I'm thinking that we do that audit. There shouldn't be many instances
>>> > of it. I like the idea that a tracepoint callback gets called with
>>> > preemption enabled.
>>>
>>> Here is the list of all callers of the _rcuidle :
>>
>> I was thinking of auditing who registers callbacks to any tracepoints.
> 
> Ok. If you feel strongly about this, I think for now I could also just
> wrap the callback execution with preempt_disable_notrace. And, when/if
> we get to doing the blocking callbacks work, we can considering
> keeping preempts on.

My main point here is to introduce the minimal change (keeping preemption
disabled) needed for the rcuidle variant, and only tackle the work of
dealing with preemptible callbacks when we really need it and when we can
properly test it (e.g. by using it for syscall entry/exit tracing).

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH RFC] tracepoint: Introduce tracepoint callbacks executing with preempt on

- On Apr 27, 2018, at 2:11 PM, Joel Fernandes joe...@google.com wrote:

> On Fri, Apr 27, 2018 at 9:37 AM, Steven Rostedt  wrote:
>> On Fri, 27 Apr 2018 09:30:05 -0700
>> Joel Fernandes  wrote:
>>
>>> On Fri, Apr 27, 2018 at 7:47 AM, Steven Rostedt  wrote:
>>> > On Fri, 27 Apr 2018 10:26:29 -0400 (EDT)
>>> > Mathieu Desnoyers  wrote:
>>> >
>>> >> The general approach and the implementation look fine, except for
>>> >> one small detail: I would be tempted to explicitly disable preemption
>>> >> around the call to the tracepoint callback for the rcuidle variant,
>>> >> unless we plan to audit every tracer right away to remove any assumption
>>> >> that preemption is disabled in the callback implementation.
>>> >
>>> > I'm thinking that we do that audit. There shouldn't be many instances
>>> > of it. I like the idea that a tracepoint callback gets called with
>>> > preemption enabled.
>>>
>>> Here is the list of all callers of the _rcuidle :
>>
>> I was thinking of auditing who registers callbacks to any tracepoints.
> 
> Ok. If you feel strongly about this, I think for now I could also just
> wrap the callback execution with preempt_disable_notrace. And, when/if
> we get to doing the blocking callbacks work, we can considering
> keeping preempts on.

My main point here is to introduce the minimal change (keeping preemption
disabled) needed for the rcuidle variant, and only tackle the work of
dealing with preemptible callbacks when we really need it and when we can
properly test it (e.g. by using it for syscall entry/exit tracing).

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH RFC] tracepoint: Introduce tracepoint callbacks executing with preempt on

- On Apr 27, 2018, at 11:40 AM, rostedt rost...@goodmis.org wrote:

> On Fri, 27 Apr 2018 08:38:26 -0700
> "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> 
>> On Fri, Apr 27, 2018 at 10:47:47AM -0400, Steven Rostedt wrote:
>> > On Fri, 27 Apr 2018 10:26:29 -0400 (EDT)
>> > Mathieu Desnoyers <mathieu.desnoy...@efficios.com> wrote:
>> >   
>> > > The general approach and the implementation look fine, except for
>> > > one small detail: I would be tempted to explicitly disable preemption
>> > > around the call to the tracepoint callback for the rcuidle variant,
>> > > unless we plan to audit every tracer right away to remove any assumption
>> > > that preemption is disabled in the callback implementation.
>> > 
>> > I'm thinking that we do that audit. There shouldn't be many instances
>> > of it. I like the idea that a tracepoint callback gets called with
>> > preemption enabled.
>> 
>> Are you really sure you want to increase your state space that much?
> 
> Why not? The code I have in callbacks already deals with all sorts of
> context - normal, softirq, irq, NMI, preemption disabled, irq
> disabled.

It does so by disabling preemption in the callbacks, even when it's
redundant with the guarantees already provided by tracepoint-sched-rcu
and by kprobes. It's not that great for a fast-path.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH RFC] tracepoint: Introduce tracepoint callbacks executing with preempt on

- On Apr 27, 2018, at 11:40 AM, rostedt rost...@goodmis.org wrote:

> On Fri, 27 Apr 2018 08:38:26 -0700
> "Paul E. McKenney"  wrote:
> 
>> On Fri, Apr 27, 2018 at 10:47:47AM -0400, Steven Rostedt wrote:
>> > On Fri, 27 Apr 2018 10:26:29 -0400 (EDT)
>> > Mathieu Desnoyers  wrote:
>> >   
>> > > The general approach and the implementation look fine, except for
>> > > one small detail: I would be tempted to explicitly disable preemption
>> > > around the call to the tracepoint callback for the rcuidle variant,
>> > > unless we plan to audit every tracer right away to remove any assumption
>> > > that preemption is disabled in the callback implementation.
>> > 
>> > I'm thinking that we do that audit. There shouldn't be many instances
>> > of it. I like the idea that a tracepoint callback gets called with
>> > preemption enabled.
>> 
>> Are you really sure you want to increase your state space that much?
> 
> Why not? The code I have in callbacks already deals with all sorts of
> context - normal, softirq, irq, NMI, preemption disabled, irq
> disabled.

It does so by disabling preemption in the callbacks, even when it's
redundant with the guarantees already provided by tracepoint-sched-rcu
and by kprobes. It's not that great for a fast-path.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH RFC] tracepoint: Introduce tracepoint callbacks executing with preempt on

- On Apr 27, 2018, at 10:47 AM, rostedt rost...@goodmis.org wrote:

> On Fri, 27 Apr 2018 10:26:29 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoy...@efficios.com> wrote:
> 
>> The general approach and the implementation look fine, except for
>> one small detail: I would be tempted to explicitly disable preemption
>> around the call to the tracepoint callback for the rcuidle variant,
>> unless we plan to audit every tracer right away to remove any assumption
>> that preemption is disabled in the callback implementation.
> 
> I'm thinking that we do that audit. There shouldn't be many instances
> of it. I like the idea that a tracepoint callback gets called with
> preemption enabled.

I see that ftrace explicitly disables preemption in its ring buffer
code. FWIW, this is redundant when called from sched-rcu tracepoints
and from kprobes which adds unnecessary performance overhead.

LTTng expects preemption to be disabled when invoked. I can adapt on my
side as needed, but would prefer not to have redundant preemption disabling
for probes hooking on sched-rcu tracepoints (which is the common case).

Do perf callbacks expect preemption to be disabled ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH RFC] tracepoint: Introduce tracepoint callbacks executing with preempt on

- On Apr 27, 2018, at 10:47 AM, rostedt rost...@goodmis.org wrote:

> On Fri, 27 Apr 2018 10:26:29 -0400 (EDT)
> Mathieu Desnoyers  wrote:
> 
>> The general approach and the implementation look fine, except for
>> one small detail: I would be tempted to explicitly disable preemption
>> around the call to the tracepoint callback for the rcuidle variant,
>> unless we plan to audit every tracer right away to remove any assumption
>> that preemption is disabled in the callback implementation.
> 
> I'm thinking that we do that audit. There shouldn't be many instances
> of it. I like the idea that a tracepoint callback gets called with
> preemption enabled.

I see that ftrace explicitly disables preemption in its ring buffer
code. FWIW, this is redundant when called from sched-rcu tracepoints
and from kprobes which adds unnecessary performance overhead.

LTTng expects preemption to be disabled when invoked. I can adapt on my
side as needed, but would prefer not to have redundant preemption disabling
for probes hooking on sched-rcu tracepoints (which is the common case).

Do perf callbacks expect preemption to be disabled ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH RFC] tracepoint: Introduce tracepoint callbacks executing with preempt on

- On Apr 27, 2018, at 12:26 AM, Joel Fernandes joe...@google.com wrote:

> In recent tests with IRQ on/off tracepoints, a large performance
> overhead ~10% is noticed when running hackbench. This is root caused to
> calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
> tracepoint code. Following a long discussion on the list [1] about this,
> we concluded that srcu is a better alternative for use during rcu idle.
> Although it does involve extra barriers, its lighter than the sched-rcu
> version which has to do additional RCU calls to notify RCU idle about
> entry into RCU sections.
> 
> In this patch, we change the underlying implementation of the
> trace_*_rcuidle API to use SRCU. This has shown to improve performance
> alot for the high frequency irq enable/disable tracepoints.

The general approach and the implementation look fine, except for
one small detail: I would be tempted to explicitly disable preemption
around the call to the tracepoint callback for the rcuidle variant,
unless we plan to audit every tracer right away to remove any assumption
that preemption is disabled in the callback implementation.

That would be the main difference between an eventual "may_sleep" tracepoint
and a rcuidle tracepoint: "may_sleep" would use SRCU and leave preemption
enabled when invoking the callback. rcuidle uses SRCU too, but would
disable preemption when invoking the callback.

Thoughts ?

Thanks,

Mathieu


> 
> In the future, we can add a new may_sleep API which can use this
> infrastructure for callbacks that actually can sleep which will support
> Mathieu's usecase of blocking probes.
> 
> Test: Tested idle and preempt/irq tracepoints.
> 
> [1] https://patchwork.kernel.org/patch/10344297/
> 
> Cc: Steven Rostedt <rost...@goodmis.org>
> Cc: Peter Zilstra <pet...@infradead.org>
> Cc: Ingo Molnar <mi...@redhat.com>
> Cc: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
> Cc: Tom Zanussi <tom.zanu...@linux.intel.com>
> Cc: Namhyung Kim <namhy...@kernel.org>
> Cc: Thomas Glexiner <t...@linutronix.de>
> Cc: Boqun Feng <boqun.f...@gmail.com>
> Cc: Paul McKenney <paul...@linux.vnet.ibm.com>
> Cc: Frederic Weisbecker <fweis...@gmail.com>
> Cc: Randy Dunlap <rdun...@infradead.org>
> Cc: Masami Hiramatsu <mhira...@kernel.org>
> Cc: Fenguang Wu <fengguang...@intel.com>
> Cc: Baohong Liu <baohong@intel.com>
> Cc: Vedang Patel <vedang.pa...@intel.com>
> Cc: kernel-t...@android.com
> Signed-off-by: Joel Fernandes <joe...@google.com>
> ---
> include/linux/tracepoint.h | 37 +
> kernel/tracepoint.c| 10 +-
> 2 files changed, 38 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index c94f466d57ef..a1c1987de423 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -15,6 +15,7 @@
>  */
> 
> #include 
> +#include 
> #include 
> #include 
> #include 
> @@ -33,6 +34,8 @@ struct trace_eval_map {
> 
> #define TRACEPOINT_DEFAULT_PRIO   10
> 
> +extern struct srcu_struct tracepoint_srcu;
> +
> extern int
> tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
> extern int
> @@ -77,6 +80,7 @@ int unregister_tracepoint_module_notifier(struct
> notifier_block *nb)
>  */
> static inline void tracepoint_synchronize_unregister(void)
> {
> + synchronize_srcu(_srcu);
>   synchronize_sched();
> }
> 
> @@ -129,18 +133,26 @@ extern void syscall_unregfunc(void);
>  * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
>  * "void *data", where as the DECLARE_TRACE() will pass in "void *data, 
> proto".
>  */
> -#define __DO_TRACE(tp, proto, args, cond, rcucheck)  \
> +#define __DO_TRACE(tp, proto, args, cond, preempt_on)
> \
>   do {\
>   struct tracepoint_func *it_func_ptr;\
>   void *it_func;  \
>   void *__data;   \
> + int __maybe_unused idx = 0; \
>   \
>   if (!(cond))\
>   return; \
> - if (rcucheck)   \
> - rcu_irq_enter_irqson(); \
> -

Re: [PATCH RFC] tracepoint: Introduce tracepoint callbacks executing with preempt on

- On Apr 27, 2018, at 12:26 AM, Joel Fernandes joe...@google.com wrote:

> In recent tests with IRQ on/off tracepoints, a large performance
> overhead ~10% is noticed when running hackbench. This is root caused to
> calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
> tracepoint code. Following a long discussion on the list [1] about this,
> we concluded that srcu is a better alternative for use during rcu idle.
> Although it does involve extra barriers, its lighter than the sched-rcu
> version which has to do additional RCU calls to notify RCU idle about
> entry into RCU sections.
> 
> In this patch, we change the underlying implementation of the
> trace_*_rcuidle API to use SRCU. This has shown to improve performance
> alot for the high frequency irq enable/disable tracepoints.

The general approach and the implementation look fine, except for
one small detail: I would be tempted to explicitly disable preemption
around the call to the tracepoint callback for the rcuidle variant,
unless we plan to audit every tracer right away to remove any assumption
that preemption is disabled in the callback implementation.

That would be the main difference between an eventual "may_sleep" tracepoint
and a rcuidle tracepoint: "may_sleep" would use SRCU and leave preemption
enabled when invoking the callback. rcuidle uses SRCU too, but would
disable preemption when invoking the callback.

Thoughts ?

Thanks,

Mathieu


> 
> In the future, we can add a new may_sleep API which can use this
> infrastructure for callbacks that actually can sleep which will support
> Mathieu's usecase of blocking probes.
> 
> Test: Tested idle and preempt/irq tracepoints.
> 
> [1] https://patchwork.kernel.org/patch/10344297/
> 
> Cc: Steven Rostedt 
> Cc: Peter Zilstra 
> Cc: Ingo Molnar 
> Cc: Mathieu Desnoyers 
> Cc: Tom Zanussi 
> Cc: Namhyung Kim 
> Cc: Thomas Glexiner 
> Cc: Boqun Feng 
> Cc: Paul McKenney 
> Cc: Frederic Weisbecker 
> Cc: Randy Dunlap 
> Cc: Masami Hiramatsu 
> Cc: Fenguang Wu 
> Cc: Baohong Liu 
> Cc: Vedang Patel 
> Cc: kernel-t...@android.com
> Signed-off-by: Joel Fernandes 
> ---
> include/linux/tracepoint.h | 37 +
> kernel/tracepoint.c| 10 +-
> 2 files changed, 38 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index c94f466d57ef..a1c1987de423 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -15,6 +15,7 @@
>  */
> 
> #include 
> +#include 
> #include 
> #include 
> #include 
> @@ -33,6 +34,8 @@ struct trace_eval_map {
> 
> #define TRACEPOINT_DEFAULT_PRIO   10
> 
> +extern struct srcu_struct tracepoint_srcu;
> +
> extern int
> tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
> extern int
> @@ -77,6 +80,7 @@ int unregister_tracepoint_module_notifier(struct
> notifier_block *nb)
>  */
> static inline void tracepoint_synchronize_unregister(void)
> {
> + synchronize_srcu(_srcu);
>   synchronize_sched();
> }
> 
> @@ -129,18 +133,26 @@ extern void syscall_unregfunc(void);
>  * as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
>  * "void *data", where as the DECLARE_TRACE() will pass in "void *data, 
> proto".
>  */
> -#define __DO_TRACE(tp, proto, args, cond, rcucheck)  \
> +#define __DO_TRACE(tp, proto, args, cond, preempt_on)
> \
>   do {\
>   struct tracepoint_func *it_func_ptr;\
>   void *it_func;  \
>   void *__data;   \
> + int __maybe_unused idx = 0; \
>   \
>   if (!(cond))\
>   return; \
> - if (rcucheck)   \
> - rcu_irq_enter_irqson(); \
> - rcu_read_lock_sched_notrace();  \
> - it_func_ptr = rcu_dereference_sched((tp)->funcs);   \
> + if (preempt_on) {   \
> + WARN_ON_ONCE(in_nmi()); /* no srcu from nmi */  \
> + idx = srcu_read_lock(_srcu); \
> + it_func_ptr = srcu_dereference((tp)->funcs, \
>

Re: [PATCH] NFS: Avoid quadratic search when freeing delegations.

s_sb_active(server->super))
> - continue;
> + break;
> +
> + if (prev) {
> + struct inode *tmp;

missing whiteline, or the assignment can be done on the variable definition.

> + tmp = nfs_delegation_grab_inode(prev);
> + if (tmp) {
> + to_put = place_holder;
> + place_holder = tmp;
> + }
> + }
> +
>   inode = nfs_delegation_grab_inode(delegation);
>   if (inode == NULL) {
>   rcu_read_unlock();
> @@ -505,16 +542,26 @@ int nfs_client_return_marked_delegations(struct 
> nfs_client
> *clp)
>   delegation = 
> nfs_start_delegation_return_locked(NFS_I(inode));
>   rcu_read_unlock();
> 
> + if (to_put) {
> + iput(to_put);
> + to_put = NULL;

considering the scope of to_put, I think assigning it to NULL is redundant
with the variable definition.

> + }
> +
>   err = nfs_end_delegation_return(inode, delegation, 0);
>   iput(inode);
>   nfs_sb_deactive(server->super);
> + cond_resched();
>   if (!err)
>   goto restart;
>   set_bit(NFS4CLNT_DELEGRETURN, >cl_state);
> + if (place_holder)
> + iput(place_holder);
>   return err;
>   }
>   }
>   rcu_read_unlock();
> + if (place_holder)
> + iput(place_holder);
>   return 0;
> }
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 127f534fec94..2d86f9869842 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -403,6 +403,16 @@ static inline void list_splice_tail_init_rcu(struct
> list_head *list,
>>member != (head);\
>pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
> 
> +/**
> + * list_for_each_entry_from_rcu - iterate over a list continuing from current
> point

There is some documentation missing about the requirements for continuing
a RCU list iteration.

The simple case is that both original iteration and continued iteration
need to be in the same RCU read-side critical section.

A more complex case like yours is along those lines:

rcu_read_lock()
list_iteration
   break;
grab inode ref (prohibit removal of node from list)
rcu_read_unlock()

rcu_read_lock()
put inode ref
continue list iteration
rcu_read_unlock()

I'm thinking these allowed patterns should be documented somehow,
else we'll see list iteration being continued across different
RCU read-side critical sections without any reference counting
(or locking) to ensure protection against removal.

Thanks,

Mathieu

> + * @pos: the type * to use as a loop cursor.
> + * @head:the head for your list.
> + * @member:  the name of the list_node within the struct.
> + */
> +#define list_for_each_entry_from_rcu(pos, head, member)  
> \
> + for (; &(pos)->member != (head);
> \
> + pos = list_entry_rcu(pos->member.next, typeof(*(pos)), member))
> +
> /**
>  * hlist_del_rcu - deletes entry from hash list without re-initialization
>  * @n: the element to delete from the hash list.
> --
> 2.14.0.rc0.dirty

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH] NFS: Avoid quadratic search when freeing delegations.

ctive(server->super))
> - continue;
> + break;
> +
> + if (prev) {
> + struct inode *tmp;

missing whiteline, or the assignment can be done on the variable definition.

> + tmp = nfs_delegation_grab_inode(prev);
> + if (tmp) {
> + to_put = place_holder;
> + place_holder = tmp;
> + }
> + }
> +
>   inode = nfs_delegation_grab_inode(delegation);
>   if (inode == NULL) {
>   rcu_read_unlock();
> @@ -505,16 +542,26 @@ int nfs_client_return_marked_delegations(struct 
> nfs_client
> *clp)
>   delegation = 
> nfs_start_delegation_return_locked(NFS_I(inode));
>   rcu_read_unlock();
> 
> + if (to_put) {
> + iput(to_put);
> + to_put = NULL;

considering the scope of to_put, I think assigning it to NULL is redundant
with the variable definition.

> + }
> +
>   err = nfs_end_delegation_return(inode, delegation, 0);
>   iput(inode);
>   nfs_sb_deactive(server->super);
> + cond_resched();
>   if (!err)
>   goto restart;
>   set_bit(NFS4CLNT_DELEGRETURN, >cl_state);
> + if (place_holder)
> + iput(place_holder);
>   return err;
>   }
>   }
>   rcu_read_unlock();
> + if (place_holder)
> + iput(place_holder);
>   return 0;
> }
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 127f534fec94..2d86f9869842 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -403,6 +403,16 @@ static inline void list_splice_tail_init_rcu(struct
> list_head *list,
>>member != (head);\
>pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
> 
> +/**
> + * list_for_each_entry_from_rcu - iterate over a list continuing from current
> point

There is some documentation missing about the requirements for continuing
a RCU list iteration.

The simple case is that both original iteration and continued iteration
need to be in the same RCU read-side critical section.

A more complex case like yours is along those lines:

rcu_read_lock()
list_iteration
   break;
grab inode ref (prohibit removal of node from list)
rcu_read_unlock()

rcu_read_lock()
put inode ref
continue list iteration
rcu_read_unlock()

I'm thinking these allowed patterns should be documented somehow,
else we'll see list iteration being continued across different
RCU read-side critical sections without any reference counting
(or locking) to ensure protection against removal.

Thanks,

Mathieu

> + * @pos: the type * to use as a loop cursor.
> + * @head:the head for your list.
> + * @member:  the name of the list_node within the struct.
> + */
> +#define list_for_each_entry_from_rcu(pos, head, member)  
> \
> + for (; &(pos)->member != (head);
> \
> + pos = list_entry_rcu(pos->member.next, typeof(*(pos)), member))
> +
> /**
>  * hlist_del_rcu - deletes entry from hash list without re-initialization
>  * @n: the element to delete from the hash list.
> --
> 2.14.0.rc0.dirty

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

[RFC PATCH manpages] clock_getres.2: document CLOCK_MONOTONIC suspend behavior

Considering that user-space expects CLOCK_MONOTONIC not to account time
during which system is suspended, explicitly document this behavior as
ABI.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Michael Kerrisk <mtk.manpa...@gmail.com>
CC: Thomas Gleixner <t...@linutronix.de>
CC: John Stultz <john.stu...@linaro.org>
CC: Stephen Boyd <sb...@kernel.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
---
 man2/clock_getres.2 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man2/clock_getres.2 b/man2/clock_getres.2
index 0812d159a..ca5ca87c8 100644
--- a/man2/clock_getres.2
+++ b/man2/clock_getres.2
@@ -144,7 +144,7 @@ This clock is not affected by discontinuous jumps in the 
system time
 (e.g., if the system administrator manually changes the clock),
 but is affected by the incremental adjustments performed by
 .BR adjtime (3)
-and NTP.
+and NTP. It does not include time during which the system is suspended.
 .TP
 .BR CLOCK_MONOTONIC_COARSE " (since Linux 2.6.32; Linux-specific)"
 .\" Added in commit da15cfdae03351c689736f8d142618592e3cebc3
-- 
2.11.0

[RFC PATCH manpages] clock_getres.2: document CLOCK_MONOTONIC suspend behavior

Considering that user-space expects CLOCK_MONOTONIC not to account time
during which system is suspended, explicitly document this behavior as
ABI.

Signed-off-by: Mathieu Desnoyers 
CC: Michael Kerrisk 
CC: Thomas Gleixner 
CC: John Stultz 
CC: Stephen Boyd 
CC: Linus Torvalds 
---
 man2/clock_getres.2 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man2/clock_getres.2 b/man2/clock_getres.2
index 0812d159a..ca5ca87c8 100644
--- a/man2/clock_getres.2
+++ b/man2/clock_getres.2
@@ -144,7 +144,7 @@ This clock is not affected by discontinuous jumps in the 
system time
 (e.g., if the system administrator manually changes the clock),
 but is affected by the incremental adjustments performed by
 .BR adjtime (3)
-and NTP.
+and NTP. It does not include time during which the system is suspended.
 .TP
 .BR CLOCK_MONOTONIC_COARSE " (since Linux 2.6.32; Linux-specific)"
 .\" Added in commit da15cfdae03351c689736f8d142618592e3cebc3
-- 
2.11.0

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

- On Apr 26, 2018, at 11:03 AM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Apr 25, 2018, at 6:51 PM, rostedt rost...@goodmis.org wrote:
> 
>> On Wed, 25 Apr 2018 17:40:56 -0400 (EDT)
>> Mathieu Desnoyers <mathieu.desnoy...@efficios.com> wrote:
>> 
>>> One problem with your approach is that you can have multiple callers
>>> for the same tracepoint name, where some could be non-preemptible and
>>> others blocking. Also, there is then no clear way for the callback
>>> registration API to enforce whether the callback expects the tracepoint
>>> to be blocking or non-preemptible. This can introduce hard to diagnose
>>> issues in a kernel without debug options enabled.
>> 
>> I agree that it should not be tied to an implementation name. But
>> "blocking" is confusing. I would say "can_sleep" or some such name that
>> states that the trace point caller is indeed something that can sleep.
> 
> "trace_*event*_{can,might,may}_sleep" are all acceptable candidates for
> me.
> 
>> 
>>> 
>>> Regarding the name, I'm OK with having something along the lines of
>>> trace_*event*_blocking or such. Please don't use "srcu" or other naming
>>> that is explicitly tied to the underlying mechanism used internally
>>> however: what we want to convey is that this specific tracepoint probe
>>> can be preempted and block. The underlying implementation could move to
>>> a different RCU flavor brand in the future, and it should not impact
>>> users of the tracepoint APIs.
>>> 
>>> In order to ensure that probes that may block only register themselves
>>> to tracepoints that allow blocking, we should introduce new tracepoint
>>> declaration/definition *and* registration APIs also contain the
>>> "BLOCKING/blocking" keywords (or such), so we can ensure that a
>>> tracepoint probe being registered to a "blocking" tracepoint is indeed
>>> allowed to block.
>> 
>> I'd really don't want to add more declaration/definitions, as we
>> already have too many as is, and with different meanings and the number
>> is of incarnations is n! in growth.
>> 
>> I'd say we just stick with a trace__can_sleep() call, and make
>> sure that if that is used that no trace_() call is also used, and
>> enforce this with linker or compiler tricks.
> 
> My main concern is not about having both trace__can_sleep() mixed
> with trace_() calls. It's more about having a registration API allowing
> modules registering probes that may need to sleep to explicitly declare it,
> and enforce that tracepoint never connects a probe that may need to sleep
> with an instrumentation site which cannot sleep.
> 
> I'm unsure what's the best way to achieve this goal though. We could possibly
> extend the tracepoint_probe_register_* APIs to introduce e.g.
> tracepoint_probe_register_prio_flags() and provide a 
> TRACEPOINT_PROBE_CAN_SLEEP
> as parameter upon registration. If this flag is provided, then we could figure
> out
> an way to iterate on all callers, and ensure they are all "can_sleep" type of
> callers.

Iteration on all callers would require that we add some separate section data
for each caller, which we don't have currently. At the moment, the only data
we need is at the tracepoint definition. If we have tons of callers for a given
tracepoint (which might be the case for lockdep), we'd end up consuming a lot of
useless space.

This is one reason why I would prefer to have separate tracepoint definitions
for each of rcuidle, can_sleep, and nonpreemptible (nmi-safe) tracepoints.

Thanks,

Mathieu


> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> 
> 
>> 
>> -- Steve
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

- On Apr 26, 2018, at 11:03 AM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Apr 25, 2018, at 6:51 PM, rostedt rost...@goodmis.org wrote:
> 
>> On Wed, 25 Apr 2018 17:40:56 -0400 (EDT)
>> Mathieu Desnoyers  wrote:
>> 
>>> One problem with your approach is that you can have multiple callers
>>> for the same tracepoint name, where some could be non-preemptible and
>>> others blocking. Also, there is then no clear way for the callback
>>> registration API to enforce whether the callback expects the tracepoint
>>> to be blocking or non-preemptible. This can introduce hard to diagnose
>>> issues in a kernel without debug options enabled.
>> 
>> I agree that it should not be tied to an implementation name. But
>> "blocking" is confusing. I would say "can_sleep" or some such name that
>> states that the trace point caller is indeed something that can sleep.
> 
> "trace_*event*_{can,might,may}_sleep" are all acceptable candidates for
> me.
> 
>> 
>>> 
>>> Regarding the name, I'm OK with having something along the lines of
>>> trace_*event*_blocking or such. Please don't use "srcu" or other naming
>>> that is explicitly tied to the underlying mechanism used internally
>>> however: what we want to convey is that this specific tracepoint probe
>>> can be preempted and block. The underlying implementation could move to
>>> a different RCU flavor brand in the future, and it should not impact
>>> users of the tracepoint APIs.
>>> 
>>> In order to ensure that probes that may block only register themselves
>>> to tracepoints that allow blocking, we should introduce new tracepoint
>>> declaration/definition *and* registration APIs also contain the
>>> "BLOCKING/blocking" keywords (or such), so we can ensure that a
>>> tracepoint probe being registered to a "blocking" tracepoint is indeed
>>> allowed to block.
>> 
>> I'd really don't want to add more declaration/definitions, as we
>> already have too many as is, and with different meanings and the number
>> is of incarnations is n! in growth.
>> 
>> I'd say we just stick with a trace__can_sleep() call, and make
>> sure that if that is used that no trace_() call is also used, and
>> enforce this with linker or compiler tricks.
> 
> My main concern is not about having both trace__can_sleep() mixed
> with trace_() calls. It's more about having a registration API allowing
> modules registering probes that may need to sleep to explicitly declare it,
> and enforce that tracepoint never connects a probe that may need to sleep
> with an instrumentation site which cannot sleep.
> 
> I'm unsure what's the best way to achieve this goal though. We could possibly
> extend the tracepoint_probe_register_* APIs to introduce e.g.
> tracepoint_probe_register_prio_flags() and provide a 
> TRACEPOINT_PROBE_CAN_SLEEP
> as parameter upon registration. If this flag is provided, then we could figure
> out
> an way to iterate on all callers, and ensure they are all "can_sleep" type of
> callers.

Iteration on all callers would require that we add some separate section data
for each caller, which we don't have currently. At the moment, the only data
we need is at the tracepoint definition. If we have tons of callers for a given
tracepoint (which might be the case for lockdep), we'd end up consuming a lot of
useless space.

This is one reason why I would prefer to have separate tracepoint definitions
for each of rcuidle, can_sleep, and nonpreemptible (nmi-safe) tracepoints.

Thanks,

Mathieu


> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> 
> 
>> 
>> -- Steve
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

- On Apr 25, 2018, at 7:13 PM, Joel Fernandes joe...@google.com wrote:

> Hi Mathieu,
> 
> On Wed, Apr 25, 2018 at 2:40 PM, Mathieu Desnoyers
> <mathieu.desnoy...@efficios.com> wrote:
>> - On Apr 25, 2018, at 5:27 PM, Joel Fernandes joe...@google.com wrote:
>>
>>> On Tue, Apr 24, 2018 at 9:20 PM, Paul E. McKenney
>>> <paul...@linux.vnet.ibm.com> wrote:
>>> [..]
>>>>> >
>>>>> > Sounds good, thanks.
>>>>> >
>>>>> > Also I found the reason for my boot issue. It was because the
>>>>> > init_srcu_struct in the prototype was being done in an initcall.
>>>>> > Instead if I do it in start_kernel before the tracepoint is used, it
>>>>> > fixes it (although I don't know if this is dangerous to do like this
>>>>> > but I can get it to boot atleast.. Let me know if this isn't the
>>>>> > right way to do it, or if something else could go wrong)
>>>>> >
>>>>> > diff --git a/init/main.c b/init/main.c
>>>>> > index 34823072ef9e..ecc88319c6da 100644
>>>>> > --- a/init/main.c
>>>>> > +++ b/init/main.c
>>>>> > @@ -631,6 +631,7 @@ asmlinkage __visible void __init start_kernel(void)
>>>>> > WARN(!irqs_disabled(), "Interrupts were enabled early\n");
>>>>> > early_boot_irqs_disabled = false;
>>>>> >
>>>>> > +   init_srcu_struct(_srcu);
>>>>> > lockdep_init_early();
>>>>> >
>>>>> > local_irq_enable();
>>>>> > --
>>>>> >
>>>>> > I benchmarked it and the performance also looks quite good compared
>>>>> > to the rcu tracepoint version.
>>>>> >
>>>>> > If you, Paul and other think doing the init_srcu_struct like this
>>>>> > should be Ok, then I can try to work more on your srcu prototype and
>>>>> > roll into my series and post them in the next RFC series (or let me
>>>>> > know if you wanted to work your srcu stuff in a separate series..).
>>>>>
>>>>> That is definitely not what I was expecting, but let's see if it works
>>>>> anyway...  ;-)
>>>>>
>>>>> But first, I was instead expecting something like this:
>>>>>
>>>>> DEFINE_SRCU(tracepoint_srcu);
>>>>>
>>>>> With this approach, some of the initialization happens at compile time
>>>>> and the rest happens at the first call_srcu().
>>>>>
>>>>> This will work -only- if the first call_srcu() doesn't happen until after
>>>>> workqueue_init_early() has been invoked.  Which I believe must have been
>>>>> the case in your testing, because otherwise it looks like __call_srcu()
>>>>> would have complained bitterly.
>>>>>
>>>>> On the other hand, if you need to invoke call_srcu() before the call
>>>>> to workqueue_init_early(), then you need the patch that I am beating
>>>>> into shape.  Plus you would need to use DEFINE_SRCU() and to avoid
>>>>> invoking init_srcu_struct().
>>>>
>>>> And here is the patch.  I do not intend to send it upstream unless it
>>>> actually proves necessary, and it appears that current SRCU does what
>>>> you need.
>>>>
>>>> You would only need this patch if you wanted to invoke call_srcu()
>>>> before workqueue_init_early() was called, which does not seem likely.
>>>
>>> Cool. So I was chatting with Paul and just to update everyone as well,
>>> I tried the DEFINE_SRCU instead of the late init_srcu_struct call and
>>> can make it past boot too (thanks Paul!). Also I don't see a reason we
>>> need the RCU callback to execute early and its fine if it runs later.
>>>
>>> Also, I was thinking of introducing a separate trace_*event*_srcu API
>>> as a replacement to the _rcuidle API. Then I can make use of it for my
>>> tracepoints, and then later can use it for the other tracepoints
>>> needing _rcuidle. After that we can finally get rid of the _rcuidle
>>> API if there are no other users of it. This is just a rough plan, but
>>> let me know if there's any issue with this plan that you can think
>>> off.
>>> IMO, I believe its simpler if the caller worries about whether it can
>>> tole

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

- On Apr 25, 2018, at 7:13 PM, Joel Fernandes joe...@google.com wrote:

> Hi Mathieu,
> 
> On Wed, Apr 25, 2018 at 2:40 PM, Mathieu Desnoyers
>  wrote:
>> - On Apr 25, 2018, at 5:27 PM, Joel Fernandes joe...@google.com wrote:
>>
>>> On Tue, Apr 24, 2018 at 9:20 PM, Paul E. McKenney
>>>  wrote:
>>> [..]
>>>>> >
>>>>> > Sounds good, thanks.
>>>>> >
>>>>> > Also I found the reason for my boot issue. It was because the
>>>>> > init_srcu_struct in the prototype was being done in an initcall.
>>>>> > Instead if I do it in start_kernel before the tracepoint is used, it
>>>>> > fixes it (although I don't know if this is dangerous to do like this
>>>>> > but I can get it to boot atleast.. Let me know if this isn't the
>>>>> > right way to do it, or if something else could go wrong)
>>>>> >
>>>>> > diff --git a/init/main.c b/init/main.c
>>>>> > index 34823072ef9e..ecc88319c6da 100644
>>>>> > --- a/init/main.c
>>>>> > +++ b/init/main.c
>>>>> > @@ -631,6 +631,7 @@ asmlinkage __visible void __init start_kernel(void)
>>>>> > WARN(!irqs_disabled(), "Interrupts were enabled early\n");
>>>>> > early_boot_irqs_disabled = false;
>>>>> >
>>>>> > +   init_srcu_struct(_srcu);
>>>>> > lockdep_init_early();
>>>>> >
>>>>> > local_irq_enable();
>>>>> > --
>>>>> >
>>>>> > I benchmarked it and the performance also looks quite good compared
>>>>> > to the rcu tracepoint version.
>>>>> >
>>>>> > If you, Paul and other think doing the init_srcu_struct like this
>>>>> > should be Ok, then I can try to work more on your srcu prototype and
>>>>> > roll into my series and post them in the next RFC series (or let me
>>>>> > know if you wanted to work your srcu stuff in a separate series..).
>>>>>
>>>>> That is definitely not what I was expecting, but let's see if it works
>>>>> anyway...  ;-)
>>>>>
>>>>> But first, I was instead expecting something like this:
>>>>>
>>>>> DEFINE_SRCU(tracepoint_srcu);
>>>>>
>>>>> With this approach, some of the initialization happens at compile time
>>>>> and the rest happens at the first call_srcu().
>>>>>
>>>>> This will work -only- if the first call_srcu() doesn't happen until after
>>>>> workqueue_init_early() has been invoked.  Which I believe must have been
>>>>> the case in your testing, because otherwise it looks like __call_srcu()
>>>>> would have complained bitterly.
>>>>>
>>>>> On the other hand, if you need to invoke call_srcu() before the call
>>>>> to workqueue_init_early(), then you need the patch that I am beating
>>>>> into shape.  Plus you would need to use DEFINE_SRCU() and to avoid
>>>>> invoking init_srcu_struct().
>>>>
>>>> And here is the patch.  I do not intend to send it upstream unless it
>>>> actually proves necessary, and it appears that current SRCU does what
>>>> you need.
>>>>
>>>> You would only need this patch if you wanted to invoke call_srcu()
>>>> before workqueue_init_early() was called, which does not seem likely.
>>>
>>> Cool. So I was chatting with Paul and just to update everyone as well,
>>> I tried the DEFINE_SRCU instead of the late init_srcu_struct call and
>>> can make it past boot too (thanks Paul!). Also I don't see a reason we
>>> need the RCU callback to execute early and its fine if it runs later.
>>>
>>> Also, I was thinking of introducing a separate trace_*event*_srcu API
>>> as a replacement to the _rcuidle API. Then I can make use of it for my
>>> tracepoints, and then later can use it for the other tracepoints
>>> needing _rcuidle. After that we can finally get rid of the _rcuidle
>>> API if there are no other users of it. This is just a rough plan, but
>>> let me know if there's any issue with this plan that you can think
>>> off.
>>> IMO, I believe its simpler if the caller worries about whether it can
>>> tolerate if tracepoint probes can block or not, than making it a
>>> property

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

- On Apr 25, 2018, at 6:51 PM, rostedt rost...@goodmis.org wrote:

> On Wed, 25 Apr 2018 17:40:56 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoy...@efficios.com> wrote:
> 
>> One problem with your approach is that you can have multiple callers
>> for the same tracepoint name, where some could be non-preemptible and
>> others blocking. Also, there is then no clear way for the callback
>> registration API to enforce whether the callback expects the tracepoint
>> to be blocking or non-preemptible. This can introduce hard to diagnose
>> issues in a kernel without debug options enabled.
> 
> I agree that it should not be tied to an implementation name. But
> "blocking" is confusing. I would say "can_sleep" or some such name that
> states that the trace point caller is indeed something that can sleep.

"trace_*event*_{can,might,may}_sleep" are all acceptable candidates for
me.

> 
>> 
>> Regarding the name, I'm OK with having something along the lines of
>> trace_*event*_blocking or such. Please don't use "srcu" or other naming
>> that is explicitly tied to the underlying mechanism used internally
>> however: what we want to convey is that this specific tracepoint probe
>> can be preempted and block. The underlying implementation could move to
>> a different RCU flavor brand in the future, and it should not impact
>> users of the tracepoint APIs.
>> 
>> In order to ensure that probes that may block only register themselves
>> to tracepoints that allow blocking, we should introduce new tracepoint
>> declaration/definition *and* registration APIs also contain the
>> "BLOCKING/blocking" keywords (or such), so we can ensure that a
>> tracepoint probe being registered to a "blocking" tracepoint is indeed
>> allowed to block.
> 
> I'd really don't want to add more declaration/definitions, as we
> already have too many as is, and with different meanings and the number
> is of incarnations is n! in growth.
> 
> I'd say we just stick with a trace__can_sleep() call, and make
> sure that if that is used that no trace_() call is also used, and
> enforce this with linker or compiler tricks.

My main concern is not about having both trace__can_sleep() mixed
with trace_() calls. It's more about having a registration API allowing
modules registering probes that may need to sleep to explicitly declare it,
and enforce that tracepoint never connects a probe that may need to sleep
with an instrumentation site which cannot sleep.

I'm unsure what's the best way to achieve this goal though. We could possibly
extend the tracepoint_probe_register_* APIs to introduce e.g.
tracepoint_probe_register_prio_flags() and provide a TRACEPOINT_PROBE_CAN_SLEEP
as parameter upon registration. If this flag is provided, then we could figure 
out
an way to iterate on all callers, and ensure they are all "can_sleep" type of
callers.

Thoughts ?

Thanks,

Mathieu



> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

- On Apr 25, 2018, at 6:51 PM, rostedt rost...@goodmis.org wrote:

> On Wed, 25 Apr 2018 17:40:56 -0400 (EDT)
> Mathieu Desnoyers  wrote:
> 
>> One problem with your approach is that you can have multiple callers
>> for the same tracepoint name, where some could be non-preemptible and
>> others blocking. Also, there is then no clear way for the callback
>> registration API to enforce whether the callback expects the tracepoint
>> to be blocking or non-preemptible. This can introduce hard to diagnose
>> issues in a kernel without debug options enabled.
> 
> I agree that it should not be tied to an implementation name. But
> "blocking" is confusing. I would say "can_sleep" or some such name that
> states that the trace point caller is indeed something that can sleep.

"trace_*event*_{can,might,may}_sleep" are all acceptable candidates for
me.

> 
>> 
>> Regarding the name, I'm OK with having something along the lines of
>> trace_*event*_blocking or such. Please don't use "srcu" or other naming
>> that is explicitly tied to the underlying mechanism used internally
>> however: what we want to convey is that this specific tracepoint probe
>> can be preempted and block. The underlying implementation could move to
>> a different RCU flavor brand in the future, and it should not impact
>> users of the tracepoint APIs.
>> 
>> In order to ensure that probes that may block only register themselves
>> to tracepoints that allow blocking, we should introduce new tracepoint
>> declaration/definition *and* registration APIs also contain the
>> "BLOCKING/blocking" keywords (or such), so we can ensure that a
>> tracepoint probe being registered to a "blocking" tracepoint is indeed
>> allowed to block.
> 
> I'd really don't want to add more declaration/definitions, as we
> already have too many as is, and with different meanings and the number
> is of incarnations is n! in growth.
> 
> I'd say we just stick with a trace__can_sleep() call, and make
> sure that if that is used that no trace_() call is also used, and
> enforce this with linker or compiler tricks.

My main concern is not about having both trace__can_sleep() mixed
with trace_() calls. It's more about having a registration API allowing
modules registering probes that may need to sleep to explicitly declare it,
and enforce that tracepoint never connects a probe that may need to sleep
with an instrumentation site which cannot sleep.

I'm unsure what's the best way to achieve this goal though. We could possibly
extend the tracepoint_probe_register_* APIs to introduce e.g.
tracepoint_probe_register_prio_flags() and provide a TRACEPOINT_PROBE_CAN_SLEEP
as parameter upon registration. If this flag is provided, then we could figure 
out
an way to iterate on all callers, and ensure they are all "can_sleep" type of
callers.

Thoughts ?

Thanks,

Mathieu



> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

2018-04-25 Thread Mathieu Desnoyers

 is explicitly tied to the underlying mechanism used internally
however: what we want to convey is that this specific tracepoint probe
can be preempted and block. The underlying implementation could move to
a different RCU flavor brand in the future, and it should not impact
users of the tracepoint APIs.

In order to ensure that probes that may block only register themselves
to tracepoints that allow blocking, we should introduce new tracepoint
declaration/definition *and* registration APIs also contain the
"BLOCKING/blocking" keywords (or such), so we can ensure that a
tracepoint probe being registered to a "blocking" tracepoint is indeed
allowed to block.

Thanks,

Mathieu


> 
> Thanks,
> 
>  - Joel

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

2018-04-25 Thread Mathieu Desnoyers

ing mechanism used internally
however: what we want to convey is that this specific tracepoint probe
can be preempted and block. The underlying implementation could move to
a different RCU flavor brand in the future, and it should not impact
users of the tracepoint APIs.

In order to ensure that probes that may block only register themselves
to tracepoints that allow blocking, we should introduce new tracepoint
declaration/definition *and* registration APIs also contain the
"BLOCKING/blocking" keywords (or such), so we can ensure that a
tracepoint probe being registered to a "blocking" tracepoint is indeed
allowed to block.

Thanks,

Mathieu


> 
> Thanks,
> 
>  - Joel

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

2018-04-24 Thread Mathieu Desnoyers

- On Apr 24, 2018, at 2:59 PM, Joel Fernandes joe...@google.com wrote:

> Hi Paul,
> 
> On Tue, Apr 24, 2018 at 11:26 AM, Paul E. McKenney
> <paul...@linux.vnet.ibm.com> wrote:
>> On Tue, Apr 24, 2018 at 11:23:02AM -0700, Paul E. McKenney wrote:
>>> On Tue, Apr 24, 2018 at 10:26:58AM -0700, Paul E. McKenney wrote:
>>> > On Tue, Apr 24, 2018 at 09:01:34AM -0700, Joel Fernandes wrote:
>>> > > On Tue, Apr 24, 2018 at 8:56 AM, Paul E. McKenney
>>> > > <paul...@linux.vnet.ibm.com> wrote:
>>> > > > On Mon, Apr 23, 2018 at 05:22:44PM -0400, Steven Rostedt wrote:
>>> > > >> On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
>>> > > >> Mathieu Desnoyers <mathieu.desnoy...@efficios.com> wrote:
>>> > > >>
>>> > > >>
>>> > > >> > I'm inclined to explicitly declare the tracepoints with their given
>>> > > >> > synchronization method. Tracepoint probe callback functions for 
>>> > > >> > currently
>>> > > >> > existing tracepoints expect to have preemption disabled when 
>>> > > >> > invoked.
>>> > > >> > This assumption will not be true anymore for srcu-tracepoints.
>>> > > >>
>>> > > >> Actually, why not have a flag attached to the tracepoint_func that
>>> > > >> states if it expects preemption to be enabled or not? If a
>>> > > >> trace_##event##_srcu() is called, then simply disable preemption 
>>> > > >> before
>>> > > >> calling the callbacks for it. That way if a callback is fine for use
>>> > > >> with srcu, then it would require calling
>>> > > >>
>>> > > >>   register_trace_##event##_may_sleep();
>>> > > >>
>>> > > >> Then if someone uses this on a tracepoint where preemption is 
>>> > > >> disabled,
>>> > > >> we simply do not call it.
>>> > > >
>>> > > > One more stupid question...  If we are having to trace so much stuff
>>> > > > in the idle loop, are we perhaps grossly overstating the extent of 
>>> > > > that
>>> > > > "idle" loop?  For being called "idle", this code seems quite busy!
>>> > >
>>> > > ;-)
>>> > > The performance hit I am observing is when running a heavy workload,
>>> > > like hackbench or something like that. That's what I am trying to
>>> > > correct.
>>> > > By the way is there any limitation on using SRCU too early during
>>> > > boot? I backported Mathieu's srcu tracepoint patches but the kernel
>>> > > hangs pretty early in the boot. I register lockdep probes in
>>> > > start_kernel. I am hoping that's not why.
>>> > >
>>> > > I could also have just screwed up the backporting... may be for my
>>> > > testing, I will just replace the rcu API with the srcu instead of all
>>> > > of Mathieu's new TRACE_EVENT macros for SRCU, since all I am trying to
>>> > > do right now is measure the performance of my patches with SRCU.
>>> >
>>> > Gah, yes, there is an entry on my capacious todo list on making SRCU
>>> > grace periods work during early boot and mid-boot.  Let me see what
>>> > I can do...
>>>
>>> OK, just need to verify that you are OK with call_srcu()'s callbacks
>>> not being invoked until sometime during core_initcall() time.  (If you
>>> really do need them to be invoked before that, in theory it is possible,
>>> but in practice it is weird, even for RCU.)
>>
>> Oh, and that early at boot, you will need to use DEFINE_SRCU() or
>> DEFINE_STATIC_SRCU() rather than dynamic allocation and initialization.
>>
>> Thanx, Paul
>>
> 
> Oh ok.
> 
> About call_rcu, calling it later may be an issue since we register the
> probes in start_kernel, for the first probe call_rcu will be sched,
> but for the second one I think it'll try to call_rcu to get rid of the
> first one.
> 
> This is the relevant code that gets called when probes are added:
> 
> static inline void release_probes(struct tracepoint_func *old)
> {
>if (old) {
>    struct tp_probes *tp_probes = container_of(old,
>struct tp_probes, probes[0]);
>call_rcu_sched(_probes->rcu, rcu_free_old_probes);
>}
> }
> 
> Maybe we can somehow defer the call_srcu until later? Would that be possible?
> 
> also Mathieu, you didn't modify the call_rcu_sched in your prototype
> to be changed to use call_srcu, should you be doing that?

You're right, I think I should have introduced a call_srcu in there.
It's missing in my prototype.

However, in the prototype I did, we need to wait for *both* sched-rcu
and SRCU grace periods, because we don't track which site is using which
rcu flavor.

So you could achieve this relatively easily by means of two chained
RCU callbacks, e.g.:

release_probes() calls call_rcu_sched(... , rcu_free_old_probes)

and then in rcu_free_old_probes() do:

call_srcu(... , srcu_free_old_probes)

and perform kfree(container_of(head, struct tp_probes, rcu));
within srcu_free_old_probes.

It is somewhat a hack, but should work.

Thanks,

Mathieu

> 
> thanks,
> 
>  - Joel

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

2018-04-24 Thread Mathieu Desnoyers

- On Apr 24, 2018, at 2:59 PM, Joel Fernandes joe...@google.com wrote:

> Hi Paul,
> 
> On Tue, Apr 24, 2018 at 11:26 AM, Paul E. McKenney
>  wrote:
>> On Tue, Apr 24, 2018 at 11:23:02AM -0700, Paul E. McKenney wrote:
>>> On Tue, Apr 24, 2018 at 10:26:58AM -0700, Paul E. McKenney wrote:
>>> > On Tue, Apr 24, 2018 at 09:01:34AM -0700, Joel Fernandes wrote:
>>> > > On Tue, Apr 24, 2018 at 8:56 AM, Paul E. McKenney
>>> > >  wrote:
>>> > > > On Mon, Apr 23, 2018 at 05:22:44PM -0400, Steven Rostedt wrote:
>>> > > >> On Mon, 23 Apr 2018 13:12:21 -0400 (EDT)
>>> > > >> Mathieu Desnoyers  wrote:
>>> > > >>
>>> > > >>
>>> > > >> > I'm inclined to explicitly declare the tracepoints with their given
>>> > > >> > synchronization method. Tracepoint probe callback functions for 
>>> > > >> > currently
>>> > > >> > existing tracepoints expect to have preemption disabled when 
>>> > > >> > invoked.
>>> > > >> > This assumption will not be true anymore for srcu-tracepoints.
>>> > > >>
>>> > > >> Actually, why not have a flag attached to the tracepoint_func that
>>> > > >> states if it expects preemption to be enabled or not? If a
>>> > > >> trace_##event##_srcu() is called, then simply disable preemption 
>>> > > >> before
>>> > > >> calling the callbacks for it. That way if a callback is fine for use
>>> > > >> with srcu, then it would require calling
>>> > > >>
>>> > > >>   register_trace_##event##_may_sleep();
>>> > > >>
>>> > > >> Then if someone uses this on a tracepoint where preemption is 
>>> > > >> disabled,
>>> > > >> we simply do not call it.
>>> > > >
>>> > > > One more stupid question...  If we are having to trace so much stuff
>>> > > > in the idle loop, are we perhaps grossly overstating the extent of 
>>> > > > that
>>> > > > "idle" loop?  For being called "idle", this code seems quite busy!
>>> > >
>>> > > ;-)
>>> > > The performance hit I am observing is when running a heavy workload,
>>> > > like hackbench or something like that. That's what I am trying to
>>> > > correct.
>>> > > By the way is there any limitation on using SRCU too early during
>>> > > boot? I backported Mathieu's srcu tracepoint patches but the kernel
>>> > > hangs pretty early in the boot. I register lockdep probes in
>>> > > start_kernel. I am hoping that's not why.
>>> > >
>>> > > I could also have just screwed up the backporting... may be for my
>>> > > testing, I will just replace the rcu API with the srcu instead of all
>>> > > of Mathieu's new TRACE_EVENT macros for SRCU, since all I am trying to
>>> > > do right now is measure the performance of my patches with SRCU.
>>> >
>>> > Gah, yes, there is an entry on my capacious todo list on making SRCU
>>> > grace periods work during early boot and mid-boot.  Let me see what
>>> > I can do...
>>>
>>> OK, just need to verify that you are OK with call_srcu()'s callbacks
>>> not being invoked until sometime during core_initcall() time.  (If you
>>> really do need them to be invoked before that, in theory it is possible,
>>> but in practice it is weird, even for RCU.)
>>
>> Oh, and that early at boot, you will need to use DEFINE_SRCU() or
>> DEFINE_STATIC_SRCU() rather than dynamic allocation and initialization.
>>
>> Thanx, Paul
>>
> 
> Oh ok.
> 
> About call_rcu, calling it later may be an issue since we register the
> probes in start_kernel, for the first probe call_rcu will be sched,
> but for the second one I think it'll try to call_rcu to get rid of the
> first one.
> 
> This is the relevant code that gets called when probes are added:
> 
> static inline void release_probes(struct tracepoint_func *old)
> {
>if (old) {
>    struct tp_probes *tp_probes = container_of(old,
>struct tp_probes, probes[0]);
>call_rcu_sched(_probes->rcu, rcu_free_old_probes);
>}
> }
> 
> Maybe we can somehow defer the call_srcu until later? Would that be possible?
> 
> also Mathieu, you didn't modify the call_rcu_sched in your prototype
> to be changed to use call_srcu, should you be doing that?

You're right, I think I should have introduced a call_srcu in there.
It's missing in my prototype.

However, in the prototype I did, we need to wait for *both* sched-rcu
and SRCU grace periods, because we don't track which site is using which
rcu flavor.

So you could achieve this relatively easily by means of two chained
RCU callbacks, e.g.:

release_probes() calls call_rcu_sched(... , rcu_free_old_probes)

and then in rcu_free_old_probes() do:

call_srcu(... , srcu_free_old_probes)

and perform kfree(container_of(head, struct tp_probes, rcu));
within srcu_free_old_probes.

It is somewhat a hack, but should work.

Thanks,

Mathieu

> 
> thanks,
> 
>  - Joel

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

- On Apr 23, 2018, at 12:18 PM, rostedt rost...@goodmis.org wrote:

> On Mon, 23 Apr 2018 10:59:43 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoy...@efficios.com> wrote:
> 
>> The main open question here is whether we want one SRCU grace period
>> domain per SRCU tracepoint definition, or just one SRCU domain for all
>> SRCU tracepoints would be fine.
>> 
>> I'm not sure what we would gain by having the extra granularity provided
>> by one SRCU grace period domain per tracepoint, and having a single SRCU
>> domain for all SRCU tracepoints makes it easy to batch grace period after
>> bulk tracepoint modifications.
>> 
>> Thoughts ?
> 
> I didn't think too much depth in this. It was more of just a brain
> storming idea. Yeah, one singe RCU domain may be good enough. I was
> thinking more of how to know when a tracepoint required the SRCU domain
> as supposed to a preempt disabled domain, and wanted to just suggest
> the linker script approach.
> 
> This is how I detect if trace_printk() is used anywhere in the kernel
> (and do that big warning if it is). That way the trace events don't
> need to be created any special way. You just use the trace_##event##_X
> flavor and it automatically detects what to do. But we need to make
> sure the same event isn't used for multiple flavors (SRCU vs schedule),
> or maybe we can, and any change would have to do both synchronizations.

The approach I used for synchronize rcu a few years ago when I did a srcu
tracepoint prototype [1] was simply this:

 static inline void tracepoint_synchronize_unregister(void)
 {
+   synchronize_srcu(_srcu);
synchronize_sched();
 }

So whenever we synchronize after tracepoint unregistration, the tracepoint
code always issue both synchronize_sched() and SRCU synchronize. This way,
tracepoint API users don't have to care about the kind of tracepoint they
are updating.

I'm inclined to explicitly declare the tracepoints with their given
synchronization method. Tracepoint probe callback functions for currently
existing tracepoints expect to have preemption disabled when invoked.
This assumption will not be true anymore for srcu-tracepoints.

Thanks,

Mathieu

[1] https://github.com/compudj/linux-dev/commits/tracepoint-srcu

> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

- On Apr 23, 2018, at 12:18 PM, rostedt rost...@goodmis.org wrote:

> On Mon, 23 Apr 2018 10:59:43 -0400 (EDT)
> Mathieu Desnoyers  wrote:
> 
>> The main open question here is whether we want one SRCU grace period
>> domain per SRCU tracepoint definition, or just one SRCU domain for all
>> SRCU tracepoints would be fine.
>> 
>> I'm not sure what we would gain by having the extra granularity provided
>> by one SRCU grace period domain per tracepoint, and having a single SRCU
>> domain for all SRCU tracepoints makes it easy to batch grace period after
>> bulk tracepoint modifications.
>> 
>> Thoughts ?
> 
> I didn't think too much depth in this. It was more of just a brain
> storming idea. Yeah, one singe RCU domain may be good enough. I was
> thinking more of how to know when a tracepoint required the SRCU domain
> as supposed to a preempt disabled domain, and wanted to just suggest
> the linker script approach.
> 
> This is how I detect if trace_printk() is used anywhere in the kernel
> (and do that big warning if it is). That way the trace events don't
> need to be created any special way. You just use the trace_##event##_X
> flavor and it automatically detects what to do. But we need to make
> sure the same event isn't used for multiple flavors (SRCU vs schedule),
> or maybe we can, and any change would have to do both synchronizations.

The approach I used for synchronize rcu a few years ago when I did a srcu
tracepoint prototype [1] was simply this:

 static inline void tracepoint_synchronize_unregister(void)
 {
+   synchronize_srcu(_srcu);
synchronize_sched();
 }

So whenever we synchronize after tracepoint unregistration, the tracepoint
code always issue both synchronize_sched() and SRCU synchronize. This way,
tracepoint API users don't have to care about the kind of tracepoint they
are updating.

I'm inclined to explicitly declare the tracepoints with their given
synchronization method. Tracepoint probe callback functions for currently
existing tracepoints expect to have preemption disabled when invoked.
This assumption will not be true anymore for srcu-tracepoints.

Thanks,

Mathieu

[1] https://github.com/compudj/linux-dev/commits/tracepoint-srcu

> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

- On Apr 23, 2018, at 10:53 AM, rostedt rost...@goodmis.org wrote:

> On Mon, 23 Apr 2018 10:31:28 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoy...@efficios.com> wrote:
> 
>> I've been wanting to introduce an alternative tracepoint instrumentation
>> "flavor" for e.g. system call entry/exit which rely on SRCU rather than
>> sched-rcu (preempt-off). This would allow taking faults within the
>> instrumentation
>> probe, which makes lots of things easier when fetching data from user-space
>> upon system call entry/exit. This could also be used to cleanly instrument
>> the idle loop.
> 
> I'd be OK with such an approach. And I don't think it would be that
> hard to implement. It could be similar to the rcu_idle() tracepoints,
> where each flavor simply passes in what protection it uses for
> DO_TRACE(). We could do linker tricks to tell the tracepoint.c code how
> the tracepoint is protected (add section code, that could be read to
> update flags in the tracepoint). Of course modules that have
> tracepoints could only use the standard preempt ones.
> 
> That is, if trace_##event##_srcu(trace_##event##_sp, PARAMS), is used,
> then the trace_##event##_sp would need to be created somewhere. The use
> of trace_##event##_srcu() would create a section entry, and on boot up
> we can see that the use of this tracepoint requires srcu protection
> with a pointer to the trace_##event##_sp srcu_struct. This could be
> used to make sure that trace_#event() call isn't done multiple times
> that uses two different protection flavors.
> 
> I'm just brain storming the idea, and I'm sure I screwed up something
> above, but I do believe it is feasible.

The main open question here is whether we want one SRCU grace period
domain per SRCU tracepoint definition, or just one SRCU domain for all
SRCU tracepoints would be fine.

I'm not sure what we would gain by having the extra granularity provided
by one SRCU grace period domain per tracepoint, and having a single SRCU
domain for all SRCU tracepoints makes it easy to batch grace period after
bulk tracepoint modifications.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

- On Apr 23, 2018, at 10:53 AM, rostedt rost...@goodmis.org wrote:

> On Mon, 23 Apr 2018 10:31:28 -0400 (EDT)
> Mathieu Desnoyers  wrote:
> 
>> I've been wanting to introduce an alternative tracepoint instrumentation
>> "flavor" for e.g. system call entry/exit which rely on SRCU rather than
>> sched-rcu (preempt-off). This would allow taking faults within the
>> instrumentation
>> probe, which makes lots of things easier when fetching data from user-space
>> upon system call entry/exit. This could also be used to cleanly instrument
>> the idle loop.
> 
> I'd be OK with such an approach. And I don't think it would be that
> hard to implement. It could be similar to the rcu_idle() tracepoints,
> where each flavor simply passes in what protection it uses for
> DO_TRACE(). We could do linker tricks to tell the tracepoint.c code how
> the tracepoint is protected (add section code, that could be read to
> update flags in the tracepoint). Of course modules that have
> tracepoints could only use the standard preempt ones.
> 
> That is, if trace_##event##_srcu(trace_##event##_sp, PARAMS), is used,
> then the trace_##event##_sp would need to be created somewhere. The use
> of trace_##event##_srcu() would create a section entry, and on boot up
> we can see that the use of this tracepoint requires srcu protection
> with a pointer to the trace_##event##_sp srcu_struct. This could be
> used to make sure that trace_#event() call isn't done multiple times
> that uses two different protection flavors.
> 
> I'm just brain storming the idea, and I'm sure I screwed up something
> above, but I do believe it is feasible.

The main open question here is whether we want one SRCU grace period
domain per SRCU tracepoint definition, or just one SRCU domain for all
SRCU tracepoints would be fine.

I'm not sure what we would gain by having the extra granularity provided
by one SRCU grace period domain per tracepoint, and having a single SRCU
domain for all SRCU tracepoints makes it easy to batch grace period after
bulk tracepoint modifications.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

so adds other complications such as
>> that it can't be used from the idle path, that's why the
>> rcu_irq_enter_* was added in the first place. Would be nice if we can
>> just avoid these RCU calls for the preempt/irq tracepoints... Any
>> thoughts about this or any other ideas to solve this?
> 
> In theory, the tracepoint code could use SRCU instead of RCU, given that
> SRCU readers can be in the idle loop, although at the expense of a couple
> of smp_mb() calls in each tracepoint.  In practice, I must defer to the
> people who know the tracepoint code better than I.

I've been wanting to introduce an alternative tracepoint instrumentation
"flavor" for e.g. system call entry/exit which rely on SRCU rather than
sched-rcu (preempt-off). This would allow taking faults within the 
instrumentation
probe, which makes lots of things easier when fetching data from user-space
upon system call entry/exit. This could also be used to cleanly instrument
the idle loop.

I would be tempted to proceed carefully and introduce a new kind of SRCU
tracepoint rather than changing all existing ones from sched-rcu to SRCU
though.

So the lockdep stuff could use the SRCU tracepoint flavor, which I guess
would be faster than the rcu_irq_enter_*().

Thanks,

Mathieu


> 
>   Thanx, Paul
> 
>> Meanwhile I'll also do some performance testing with Matsami's idea as well..
>> 
>> thanks,
>> 
>> - Joel

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

e path, that's why the
>> rcu_irq_enter_* was added in the first place. Would be nice if we can
>> just avoid these RCU calls for the preempt/irq tracepoints... Any
>> thoughts about this or any other ideas to solve this?
> 
> In theory, the tracepoint code could use SRCU instead of RCU, given that
> SRCU readers can be in the idle loop, although at the expense of a couple
> of smp_mb() calls in each tracepoint.  In practice, I must defer to the
> people who know the tracepoint code better than I.

I've been wanting to introduce an alternative tracepoint instrumentation
"flavor" for e.g. system call entry/exit which rely on SRCU rather than
sched-rcu (preempt-off). This would allow taking faults within the 
instrumentation
probe, which makes lots of things easier when fetching data from user-space
upon system call entry/exit. This could also be used to cleanly instrument
the idle loop.

I would be tempted to proceed carefully and introduce a new kind of SRCU
tracepoint rather than changing all existing ones from sched-rcu to SRCU
though.

So the lockdep stuff could use the SRCU tracepoint flavor, which I guess
would be faster than the rcu_irq_enter_*().

Thanks,

Mathieu


> 
>   Thanx, Paul
> 
>> Meanwhile I'll also do some performance testing with Matsami's idea as well..
>> 
>> thanks,
>> 
>> - Joel

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

- On Apr 16, 2018, at 3:26 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Mon, Apr 16, 2018 at 12:21 PM, Mathieu Desnoyers
> <mathieu.desnoy...@efficios.com> wrote:
>>
>> And I try very hard to avoid being told I'm the one breaking
>> user-space. ;-)
> 
> You *can't* be breaking user space. User space doesn't use this yet.
> 
> That's actually why I'd like to start with the minimal set - to make
> sure we don't introduce features that will come back to bite us later.
> 
> The one compelling use case I saw was a memory allocator that used
> this for getting per-CPU (vs per-thread) memory scaling.
> 
> That code didn't need the cpu_opv system call at all.
> 
> And if somebody does a ldload of a malloc library, and then wants to
> analyze the behavior of a program, maybe they should ldload their own
> malloc routines first? That's pretty much par for the course for those
> kinds of projects.
> 
> So I'd much rather we first merge the non-contentious parts that
> actually have some numbers for "this improves performance and makes a
> nice fancy malloc possible".
> 
> As it is, the cpu_opv seems to be all about theory, not about actual need.

I fully get your point about getting the minimal feature in. So let's focus
on rseq only.

I will rework the patchset so the rseq selftests don't depend on cpu_opv,
and remove the cpu_opv stuff. I think it would be a good start for the
Facebook guys (jemalloc), given that just rseq seems to be enough for them
for now. It should be enough for the arm64 performance counters as well.

Then we'll figure out what is needed to make other projects use it based on
their needs (e.g. lttng-ust, liburcu, glibc malloc), and whether jemalloc
end up requiring cpu_opv for memory migration between per-cpu pools after all.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

- On Apr 16, 2018, at 3:26 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Mon, Apr 16, 2018 at 12:21 PM, Mathieu Desnoyers
>  wrote:
>>
>> And I try very hard to avoid being told I'm the one breaking
>> user-space. ;-)
> 
> You *can't* be breaking user space. User space doesn't use this yet.
> 
> That's actually why I'd like to start with the minimal set - to make
> sure we don't introduce features that will come back to bite us later.
> 
> The one compelling use case I saw was a memory allocator that used
> this for getting per-CPU (vs per-thread) memory scaling.
> 
> That code didn't need the cpu_opv system call at all.
> 
> And if somebody does a ldload of a malloc library, and then wants to
> analyze the behavior of a program, maybe they should ldload their own
> malloc routines first? That's pretty much par for the course for those
> kinds of projects.
> 
> So I'd much rather we first merge the non-contentious parts that
> actually have some numbers for "this improves performance and makes a
> nice fancy malloc possible".
> 
> As it is, the cpu_opv seems to be all about theory, not about actual need.

I fully get your point about getting the minimal feature in. So let's focus
on rseq only.

I will rework the patchset so the rseq selftests don't depend on cpu_opv,
and remove the cpu_opv stuff. I think it would be a good start for the
Facebook guys (jemalloc), given that just rseq seems to be enough for them
for now. It should be enough for the arm64 performance counters as well.

Then we'll figure out what is needed to make other projects use it based on
their needs (e.g. lttng-ust, liburcu, glibc malloc), and whether jemalloc
end up requiring cpu_opv for memory migration between per-cpu pools after all.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

- On Apr 16, 2018, at 2:39 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Mon, Apr 16, 2018 at 11:35 AM, Mathieu Desnoyers
> <mathieu.desnoy...@efficios.com> wrote:
>> Specifically for single-stepping, the __rseq_table section introduced
>> at user-level will allow newer debuggers and tools which do line and
>> instruction-level single-stepping to skip over rseq critical sections.
>> However, this breaks existing debuggers and tools.
> 
> I really don't think single-stepping is a valid argument.
> 
> Even if the cpu_opv() allows you to "single step", you're not actually
> single stepping the same thing that you're using. So you are literally
> debugging something else than the real code.
> 
> At that point, you don't need "cpu_opv()", you need to just load
> /dev/urandom in a buffer, and single-step that. Ta-daa! No new kernel
> functionality needed.
> 
> So if the main argument for cpu_opv is single-stepping, then just rip
> it out. It's not useful.

No, single-stepping is not the only use-case. Accessing remote cpu
data is another use-case fulfilled by cpu_opv, which I think is more
compelling.

> 
> Anybody who cares deeply about single-stepping shouldn't be using
> optimistic algorithms, and they shouldn't be doing multi-threaded
> stuff either. They won't be able to use things like transactional
> memory either.
> 
> You can't single-step into the kernel to see what the kernel does
> either when you're debugging something.
> 
> News at 11: "single stepping isn't always viable".

I don't mind if people cannot stop the program with a debugger and
observe the state of registers manually at each step though a rseq
critical section.

I do mind breaking existing tools that rely on single-stepping
approaches to automatically analyze program behavior [1,2].
Introducing a rseq critical section into a library (e.g. glibc
memory allocator) would cause existing programs being analyzed
with existing tools to hang.

And I try very hard to avoid being told I'm the one breaking
user-space. ;-)

Thanks,

Mathieu

[1] http://rr-project.org/
[2] https://www.gnu.org/software/gdb/news/reversible.html

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

- On Apr 16, 2018, at 2:39 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Mon, Apr 16, 2018 at 11:35 AM, Mathieu Desnoyers
>  wrote:
>> Specifically for single-stepping, the __rseq_table section introduced
>> at user-level will allow newer debuggers and tools which do line and
>> instruction-level single-stepping to skip over rseq critical sections.
>> However, this breaks existing debuggers and tools.
> 
> I really don't think single-stepping is a valid argument.
> 
> Even if the cpu_opv() allows you to "single step", you're not actually
> single stepping the same thing that you're using. So you are literally
> debugging something else than the real code.
> 
> At that point, you don't need "cpu_opv()", you need to just load
> /dev/urandom in a buffer, and single-step that. Ta-daa! No new kernel
> functionality needed.
> 
> So if the main argument for cpu_opv is single-stepping, then just rip
> it out. It's not useful.

No, single-stepping is not the only use-case. Accessing remote cpu
data is another use-case fulfilled by cpu_opv, which I think is more
compelling.

> 
> Anybody who cares deeply about single-stepping shouldn't be using
> optimistic algorithms, and they shouldn't be doing multi-threaded
> stuff either. They won't be able to use things like transactional
> memory either.
> 
> You can't single-step into the kernel to see what the kernel does
> either when you're debugging something.
> 
> News at 11: "single stepping isn't always viable".

I don't mind if people cannot stop the program with a debugger and
observe the state of registers manually at each step though a rseq
critical section.

I do mind breaking existing tools that rely on single-stepping
approaches to automatically analyze program behavior [1,2].
Introducing a rseq critical section into a library (e.g. glibc
memory allocator) would cause existing programs being analyzed
with existing tools to hang.

And I try very hard to avoid being told I'm the one breaking
user-space. ;-)

Thanks,

Mathieu

[1] http://rr-project.org/
[2] https://www.gnu.org/software/gdb/news/reversible.html

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

- On Apr 14, 2018, at 6:44 PM, Andy Lutomirski l...@amacapital.net wrote:

> On Thu, Apr 12, 2018 at 12:43 PM, Linus Torvalds
> <torva...@linux-foundation.org> wrote:
>> On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers
>> <mathieu.desnoy...@efficios.com> wrote:
>>> The cpu_opv system call executes a vector of operations on behalf of
>>> user-space on a specific CPU with preemption disabled. It is inspired
>>> by readv() and writev() system calls which take a "struct iovec"
>>> array as argument.
>>
>> Do we really want the page pinning?
>>
>> This whole cpu_opv thing is the most questionable part of the series,
>> and the page pinning is the most questionable part of cpu_opv for me.
>>
>> Can we plan on merging just the plain rseq parts *without* this all
>> first, and then see the cpu_opv thing as a "maybe future expansion"
>> part.
>>
>> I think that would make Andy happier too.
>>
> 
> It only makes me happier if the userspace code involved is actually
> going to work when single-stepped, which might actually be the case
> (fingers crossed).

Specifically for single-stepping, the __rseq_table section introduced
at user-level will allow newer debuggers and tools which do line and
instruction-level single-stepping to skip over rseq critical sections.
However, this breaks existing debuggers and tools.

For a userspace tracer tool such as LTTng-UST, requiring upgrade to newer
debugger versions would limit its adoption in the field. So if using rseq
breaks current debugger tools, lttng-ust won't use rseq until
single-stepping can be done in a non-breaking way, or will have to wait
until most end-user deployments (distributions used in the field) include
debugger versions that skip over the code identified by the __rseq_table
section, which will take many years.

> That being said, I'm not really convinced that
> cpu_opv() makes much difference here, since I'm not entirely convinced
> that user code will actually use it or that user code will actually be
> that well tested.  C'est la vie.

For the use-case of cpu_opv invoked as single-stepping fall-back, this path
will indeed not be executed often enough to be well-tested. I'm considering
the following approach to allow user-space to test cpu_opv more thoroughly:
we can introduce an environment variable, e.g.:

- RSEQ_DISABLE=1: Disable rseq thread registration,
- RSEQ_DISABLE=random: Randomly disable rseq thread registration (some threads
  use rseq, other threads end up using the cpu_opv fallback)

which would disable the rseq fast-path for all or some threads, and thus allow
thorough testing of cpu_opv used as single-stepping fallback.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

- On Apr 14, 2018, at 6:44 PM, Andy Lutomirski l...@amacapital.net wrote:

> On Thu, Apr 12, 2018 at 12:43 PM, Linus Torvalds
>  wrote:
>> On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers
>>  wrote:
>>> The cpu_opv system call executes a vector of operations on behalf of
>>> user-space on a specific CPU with preemption disabled. It is inspired
>>> by readv() and writev() system calls which take a "struct iovec"
>>> array as argument.
>>
>> Do we really want the page pinning?
>>
>> This whole cpu_opv thing is the most questionable part of the series,
>> and the page pinning is the most questionable part of cpu_opv for me.
>>
>> Can we plan on merging just the plain rseq parts *without* this all
>> first, and then see the cpu_opv thing as a "maybe future expansion"
>> part.
>>
>> I think that would make Andy happier too.
>>
> 
> It only makes me happier if the userspace code involved is actually
> going to work when single-stepped, which might actually be the case
> (fingers crossed).

Specifically for single-stepping, the __rseq_table section introduced
at user-level will allow newer debuggers and tools which do line and
instruction-level single-stepping to skip over rseq critical sections.
However, this breaks existing debuggers and tools.

For a userspace tracer tool such as LTTng-UST, requiring upgrade to newer
debugger versions would limit its adoption in the field. So if using rseq
breaks current debugger tools, lttng-ust won't use rseq until
single-stepping can be done in a non-breaking way, or will have to wait
until most end-user deployments (distributions used in the field) include
debugger versions that skip over the code identified by the __rseq_table
section, which will take many years.

> That being said, I'm not really convinced that
> cpu_opv() makes much difference here, since I'm not entirely convinced
> that user code will actually use it or that user code will actually be
> that well tested.  C'est la vie.

For the use-case of cpu_opv invoked as single-stepping fall-back, this path
will indeed not be executed often enough to be well-tested. I'm considering
the following approach to allow user-space to test cpu_opv more thoroughly:
we can introduce an environment variable, e.g.:

- RSEQ_DISABLE=1: Disable rseq thread registration,
- RSEQ_DISABLE=random: Randomly disable rseq thread registration (some threads
  use rseq, other threads end up using the cpu_opv fallback)

which would disable the rseq fast-path for all or some threads, and thus allow
thorough testing of cpu_opv used as single-stepping fallback.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

- On Apr 12, 2018, at 4:23 PM, Andi Kleen a...@firstfloor.org wrote:

>> Can we plan on merging just the plain rseq parts *without* this all
>> first, and then see the cpu_opv thing as a "maybe future expansion"
>> part.
> 
> That would be the right way to go. I doubt anybody really needs cpu_opv.
> We already have other code (e.g. vgettimeofday) which cannot
> be single stepped, and so far it never was a problem.

Single-stepping is only a subset of the rseq limitations addressed
by cpu_opv. Anoher major limitation is algorithms requiring data
migration between per-cpu data structures safely against CPU hotplug,
and without having to change the cpu affinity mask. This is the case
for memory allocators and userspace task schedulers which require
cpu_opv for migration between per-cpu memory pools and scheduler
runqueues.

About the vgettimeofday and general handling of vDSO by gdb, gdb's
approach only takes care of line-by-line single-stepping by hiding
Linux' vdso mapping so users cannot target source code lines within
that shared object. However, it breaks instruction-level single-stepping.
I reported this issue to you back in Nov. 2017:
https://lkml.org/lkml/2017/11/20/803

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

- On Apr 12, 2018, at 4:23 PM, Andi Kleen a...@firstfloor.org wrote:

>> Can we plan on merging just the plain rseq parts *without* this all
>> first, and then see the cpu_opv thing as a "maybe future expansion"
>> part.
> 
> That would be the right way to go. I doubt anybody really needs cpu_opv.
> We already have other code (e.g. vgettimeofday) which cannot
> be single stepped, and so far it never was a problem.

Single-stepping is only a subset of the rseq limitations addressed
by cpu_opv. Anoher major limitation is algorithms requiring data
migration between per-cpu data structures safely against CPU hotplug,
and without having to change the cpu affinity mask. This is the case
for memory allocators and userspace task schedulers which require
cpu_opv for migration between per-cpu memory pools and scheduler
runqueues.

About the vgettimeofday and general handling of vDSO by gdb, gdb's
approach only takes care of line-by-line single-stepping by hiding
Linux' vdso mapping so users cannot target source code lines within
that shared object. However, it breaks instruction-level single-stepping.
I reported this issue to you back in Nov. 2017:
https://lkml.org/lkml/2017/11/20/803

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

- On Apr 13, 2018, at 12:37 PM, Linus Torvalds 
torva...@linux-foundation.org wrote:

> On Fri, Apr 13, 2018 at 5:16 AM, Mathieu Desnoyers
> <mathieu.desnoy...@efficios.com> wrote:
>> The vmalloc space needed by cpu_opv is bound by the number of pages
>> a cpu_opv call can touch.
> 
> No it's not.
> 
> You can have a thousand different processes doing cpu_opv at the same time.
> 
> A *single* cpu_opv may me limited toi "only" a megabyte, but I'm not
> seeing any global limit anywhere.
> 
> In short, this looks like a guaranteed DoS approach to me.

Right, so one simple approach to solve this is to limit to the number
of concurrent cpu_opv executed at any given time.

Considering that cpu_opv is a slow path, we can limit the number
of concurrent cpu_opv executions by protecting this with a global
mutex, or a semaphore if we want the number of concurrent executions
to be greater than 1.

Another approach if we want to be fancier is to keep track of the
amount of vma address space currently used by all in-flight cpu_opv.
Beyond a given threshold, further execution of additional cpu_opv
instances would block, awaiting to be woken up when vmalloc address
space is freed when in-flight cpu_opv complete.

What global vmalloc address-space budget should we aim for ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

- On Apr 13, 2018, at 12:37 PM, Linus Torvalds 
torva...@linux-foundation.org wrote:

> On Fri, Apr 13, 2018 at 5:16 AM, Mathieu Desnoyers
>  wrote:
>> The vmalloc space needed by cpu_opv is bound by the number of pages
>> a cpu_opv call can touch.
> 
> No it's not.
> 
> You can have a thousand different processes doing cpu_opv at the same time.
> 
> A *single* cpu_opv may me limited toi "only" a megabyte, but I'm not
> seeing any global limit anywhere.
> 
> In short, this looks like a guaranteed DoS approach to me.

Right, so one simple approach to solve this is to limit to the number
of concurrent cpu_opv executed at any given time.

Considering that cpu_opv is a slow path, we can limit the number
of concurrent cpu_opv executions by protecting this with a global
mutex, or a semaphore if we want the number of concurrent executions
to be greater than 1.

Another approach if we want to be fancier is to keep track of the
amount of vma address space currently used by all in-flight cpu_opv.
Beyond a given threshold, further execution of additional cpu_opv
instances would block, awaiting to be woken up when vmalloc address
space is freed when in-flight cpu_opv complete.

What global vmalloc address-space budget should we aim for ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

- On Apr 12, 2018, at 4:07 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Thu, Apr 12, 2018 at 12:59 PM, Mathieu Desnoyers
> <mathieu.desnoy...@efficios.com> wrote:
>>
>> What are your concerns about page pinning ?
> 
> Pretty much everything.
> 
> It's the most complex part by far, and the vmalloc space is a limited
> resource on 32-bit architectures.

The vmalloc space needed by cpu_opv is bound by the number of pages
a cpu_opv call can touch. On architectures with virtually aliased
dcache, we also need to add a few extra pages worth of address space
to account for SHMLBA alignment.

So on ARM32, with SHMLBA=4 pages, this means at most 1 MB of virtual
address space temporarily needed for a cpu_opv system call in the very
worst case scenario: 16 ops * 2 uaddr * 8 pages per uaddr
(if we're unlucky and find ourselves aligned across two SHMLBA) * 4096 bytes 
per page.

If this amount of vmalloc space happens to be our limiting factor, we can
change the max cpu_opv ops array size supported, e.g. bringing it from 16 down
to 4. The largest number of operations I currently need in the cpu-opv library
is 4. With 4 ops, the worse case vmalloc space used by a cpu_opv system call
becomes 256 kB.

> 
>> Do you have an alternative approach in mind ?
> 
> Do everything in user space.

I wish we could disable preemption and cpu hotplug in user-space.
Unfortunately, that does not seem to be a viable solution for many
technical reasons, starting with page fault handling.

> 
> And even if you absolutely want cpu_opv at all, why not do it in the
> user space *mapping* without the aliasing into kernel space?

That's because cpu_opv need to execute the entire array of operations
with preemption disabled, and we cannot take a page fault with preemption
off.

Page pinning and aliasing user-space pages in the kernel linear mapping
ensure that we don't end up in trouble in page fault scenarios, such as
having the pages we need to touch swapped out under our feet.

> 
> The cpu_opv approach isn't even fast. It's *really* slow if it has to
> do VM crap.
> 
> The whole rseq thing was billed as "faster than atomics". I
> *guarantee* that the cpu_opv's aren't faster than atomics.

Yes, and here is the good news: cpu_opv speed does not even matter. rseq 
assember instruction sequences are very fast, but cannot deal with infrequent 
corner-cases.
cpu_opv is slow, but is guaranteed to deal with the occasional corner-case
situations.

This is similar to pthread mutex/futex fast/slow paths. The common case is fast
(rseq), and the speed of the infrequent case (cpu_opv) does not matter as long
as it's used infrequently enough, which is the case here.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

- On Apr 12, 2018, at 4:07 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Thu, Apr 12, 2018 at 12:59 PM, Mathieu Desnoyers
>  wrote:
>>
>> What are your concerns about page pinning ?
> 
> Pretty much everything.
> 
> It's the most complex part by far, and the vmalloc space is a limited
> resource on 32-bit architectures.

The vmalloc space needed by cpu_opv is bound by the number of pages
a cpu_opv call can touch. On architectures with virtually aliased
dcache, we also need to add a few extra pages worth of address space
to account for SHMLBA alignment.

So on ARM32, with SHMLBA=4 pages, this means at most 1 MB of virtual
address space temporarily needed for a cpu_opv system call in the very
worst case scenario: 16 ops * 2 uaddr * 8 pages per uaddr
(if we're unlucky and find ourselves aligned across two SHMLBA) * 4096 bytes 
per page.

If this amount of vmalloc space happens to be our limiting factor, we can
change the max cpu_opv ops array size supported, e.g. bringing it from 16 down
to 4. The largest number of operations I currently need in the cpu-opv library
is 4. With 4 ops, the worse case vmalloc space used by a cpu_opv system call
becomes 256 kB.

> 
>> Do you have an alternative approach in mind ?
> 
> Do everything in user space.

I wish we could disable preemption and cpu hotplug in user-space.
Unfortunately, that does not seem to be a viable solution for many
technical reasons, starting with page fault handling.

> 
> And even if you absolutely want cpu_opv at all, why not do it in the
> user space *mapping* without the aliasing into kernel space?

That's because cpu_opv need to execute the entire array of operations
with preemption disabled, and we cannot take a page fault with preemption
off.

Page pinning and aliasing user-space pages in the kernel linear mapping
ensure that we don't end up in trouble in page fault scenarios, such as
having the pages we need to touch swapped out under our feet.

> 
> The cpu_opv approach isn't even fast. It's *really* slow if it has to
> do VM crap.
> 
> The whole rseq thing was billed as "faster than atomics". I
> *guarantee* that the cpu_opv's aren't faster than atomics.

Yes, and here is the good news: cpu_opv speed does not even matter. rseq 
assember instruction sequences are very fast, but cannot deal with infrequent 
corner-cases.
cpu_opv is slow, but is guaranteed to deal with the occasional corner-case
situations.

This is similar to pthread mutex/futex fast/slow paths. The common case is fast
(rseq), and the speed of the infrequent case (cpu_opv) does not matter as long
as it's used infrequently enough, which is the case here.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

- On Apr 12, 2018, at 3:43 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers
> <mathieu.desnoy...@efficios.com> wrote:
>> The cpu_opv system call executes a vector of operations on behalf of
>> user-space on a specific CPU with preemption disabled. It is inspired
>> by readv() and writev() system calls which take a "struct iovec"
>> array as argument.
> 
> Do we really want the page pinning?
> 
> This whole cpu_opv thing is the most questionable part of the series,
> and the page pinning is the most questionable part of cpu_opv for me.

What are your concerns about page pinning ?

Do you have an alternative approach in mind ?

> Can we plan on merging just the plain rseq parts *without* this all
> first, and then see the cpu_opv thing as a "maybe future expansion"
> part.

The main problem with the incremental approach is that it won't deal
with remote CPU data accesses, and won't deal with cpu hotplug in
non-racy ways. For *some* of the use-cases, the other issues solved by
cpu_opv can be worked-around in user-space, at the cost of making
the userspace code a mess, and in many cases slower than if we can rely
on cpu_opv for the fallback.

All the rseq test-cases depend on cpu_opv as they stand now. Without
cpu_opv to handle the corner-cases, things become much more messy on the
user-space side.

Thanks,

Mathieu

> 
> I think that would make Andy happier too.
> 
>  Linus

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

- On Apr 12, 2018, at 3:43 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers
>  wrote:
>> The cpu_opv system call executes a vector of operations on behalf of
>> user-space on a specific CPU with preemption disabled. It is inspired
>> by readv() and writev() system calls which take a "struct iovec"
>> array as argument.
> 
> Do we really want the page pinning?
> 
> This whole cpu_opv thing is the most questionable part of the series,
> and the page pinning is the most questionable part of cpu_opv for me.

What are your concerns about page pinning ?

Do you have an alternative approach in mind ?

> Can we plan on merging just the plain rseq parts *without* this all
> first, and then see the cpu_opv thing as a "maybe future expansion"
> part.

The main problem with the incremental approach is that it won't deal
with remote CPU data accesses, and won't deal with cpu hotplug in
non-racy ways. For *some* of the use-cases, the other issues solved by
cpu_opv can be worked-around in user-space, at the cost of making
the userspace code a mess, and in many cases slower than if we can rely
on cpu_opv for the fallback.

All the rseq test-cases depend on cpu_opv as they stand now. Without
cpu_opv to handle the corner-cases, things become much more messy on the
user-space side.

Thanks,

Mathieu

> 
> I think that would make Andy happier too.
> 
>  Linus

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

[RFC PATCH for 4.18 15/23] arm: Wire up cpu_opv system call

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Thomas Gleixner <t...@linutronix.de>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-...@vger.kernel.org
---
 arch/arm/tools/syscall.tbl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index fbc74b5fa3ed..213ccfc2c437 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -413,3 +413,4 @@
 396common  pkey_free   sys_pkey_free
 397common  statx   sys_statx
 398common  rseqsys_rseq
+399common  cpu_opv sys_cpu_opv
-- 
2.11.0

[RFC PATCH for 4.18 15/23] arm: Wire up cpu_opv system call

Signed-off-by: Mathieu Desnoyers 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Thomas Gleixner 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Boqun Feng 
CC: linux-...@vger.kernel.org
---
 arch/arm/tools/syscall.tbl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index fbc74b5fa3ed..213ccfc2c437 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -413,3 +413,4 @@
 396common  pkey_free   sys_pkey_free
 397common  statx   sys_statx
 398common  rseqsys_rseq
+399common  cpu_opv sys_cpu_opv
-- 
2.11.0

[RFC PATCH for 4.18 14/23] powerpc: Wire up cpu_opv system call

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-...@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 45d4d37495fd..4131825b5a05 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -393,3 +393,4 @@ SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
 SYSCALL(rseq)
+SYSCALL(cpu_opv)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index 1e9708632dce..c19379f0a32e 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls388
+#define NR_syscalls389
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index ac5ba55066dd..f7a221bdb5df 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -399,5 +399,6 @@
 #define __NR_pkey_free 385
 #define __NR_pkey_mprotect 386
 #define __NR_rseq  387
+#define __NR_cpu_opv   388
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0

[RFC PATCH for 4.18 14/23] powerpc: Wire up cpu_opv system call

Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Boqun Feng 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: linuxppc-...@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 45d4d37495fd..4131825b5a05 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -393,3 +393,4 @@ SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
 SYSCALL(rseq)
+SYSCALL(cpu_opv)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index 1e9708632dce..c19379f0a32e 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls388
+#define NR_syscalls389
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index ac5ba55066dd..f7a221bdb5df 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -399,5 +399,6 @@
 #define __NR_pkey_free 385
 #define __NR_pkey_mprotect 386
 #define __NR_rseq  387
+#define __NR_cpu_opv   388
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0

[RFC PATCH for 4.18 06/23] x86: Wire up restartable sequence system call

Wire up the rseq system call on x86 32/64.

This provides an ABI improving the speed of a user-space getcpu
operation on x86 by removing the need to perform a function call, "lsl"
instruction, or system call on the fast path, as well as improving the
speed of user-space operations on per-cpu data.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
Reviewed-by: Thomas Gleixner <t...@linutronix.de>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: "H. Peter Anvin" <h...@zytor.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-...@vger.kernel.org
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 2a5e99cff859..b76cbd25854f 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
 382i386pkey_free   sys_pkey_free
 383i386statx   sys_statx
 384i386arch_prctl  sys_arch_prctl  
compat_sys_arch_prctl
+385i386rseqsys_rseq
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..3ad03495bbb9 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
 330common  pkey_alloc  sys_pkey_alloc
 331common  pkey_free   sys_pkey_free
 332common  statx   sys_statx
+333common  rseqsys_rseq
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.11.0

[RFC PATCH for 4.18 06/23] x86: Wire up restartable sequence system call

Wire up the rseq system call on x86 32/64.

This provides an ABI improving the speed of a user-space getcpu
operation on x86 by removing the need to perform a function call, "lsl"
instruction, or system call on the fast path, as well as improving the
speed of user-space operations on per-cpu data.

Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Thomas Gleixner 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: "H. Peter Anvin" 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Boqun Feng 
CC: linux-...@vger.kernel.org
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 2a5e99cff859..b76cbd25854f 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
 382i386pkey_free   sys_pkey_free
 383i386statx   sys_statx
 384i386arch_prctl  sys_arch_prctl  
compat_sys_arch_prctl
+385i386rseqsys_rseq
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..3ad03495bbb9 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
 330common  pkey_alloc  sys_pkey_alloc
 331common  pkey_free   sys_pkey_free
 332common  statx   sys_statx
+333common  rseqsys_rseq
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.11.0

[RFC PATCH for 4.18 18/23] rseq: selftests: Provide rseq library (v5)

This rseq helper library provides a user-space API to the rseq()
system call.

The rseq fast-path exposes the instruction pointer addresses where the
rseq assembly blocks begin and end, as well as the associated abort
instruction pointer, in the __rseq_table section. This section allows
debuggers may know where to place breakpoints when single-stepping
through assembly blocks which may be aborted at any point by the kernel.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Shuah Khan <shua...@osg.samsung.com>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Thomas Gleixner <t...@linutronix.de>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: "H. Peter Anvin" <h...@zytor.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-kselft...@vger.kernel.org
CC: linux-...@vger.kernel.org
---
Changes since v1:
- Provide abort-ip signature: The abort-ip signature is located just
  before the abort-ip target. It is currently hardcoded, but a
  user-space application could use the __rseq_table to iterate on all
  abort-ip targets and use a random value as signature if needed in the
  future.
- Add rseq_prepare_unload(): Libraries and JIT code using rseq critical
  sections need to issue rseq_prepare_unload() on each thread at least
  once before reclaim of struct rseq_cs.
- Use initial-exec TLS model, non-weak symbol: The initial-exec model is
  signal-safe, whereas the global-dynamic model is not.  Remove the
  "weak" symbol attribute from the __rseq_abi in rseq.c. The rseq.so
  library will have ownership of that symbol, and there is not reason for
  an application or user library to try to define that symbol.
  The expected use is to link against libreq.so, which owns and provide
  that symbol.
- Set cpu_id to -2 on register error
- Add rseq_len syscall parameter, rseq_cs version
- Ensure disassember-friendly signature: x86 32/64 disassembler have a
  hard time decoding the instruction stream after a bad instruction. Use
  a nopl instruction to encode the signature. Suggested by Andy Lutomirski.
- Exercise parametrized tests variants in a shell scripts.
- Restartable sequences selftests: Remove use of event counter.
- Use cpu_id_start field:  With the cpu_id_start field, the C
  preparation phase of the fast-path does not need to compare cpu_id < 0
  anymore.
- Signal-safe registration and refcounting: Allow libraries using
  librseq.so to register it from signal handlers.
- Use OVERRIDE_TARGETS in makefile.
- Use "m" constraints for rseq_cs field.

Changes since v2:
- Update based on Thomas Gleixner's comments.

Changes since v3:
- Generate param_test_skip_fastpath and param_test_benchmark with
  -DSKIP_FASTPATH and -DBENCHMARK (respectively). Add param_test_fastpath
  to run_param_test.sh.

Changes since v4:
- Fold arm: workaround gcc asm size guess,
- Namespace barrier() -> rseq_barrier() in library header,
- Take into account coding style feedback from Peter Zijlstra,
- Split rseq selftests into logical commits.
---
 tools/testing/selftests/rseq/rseq-arm.h  |  732 +++
 tools/testing/selftests/rseq/rseq-ppc.h  |  688 ++
 tools/testing/selftests/rseq/rseq-skip.h |   82 +++
 tools/testing/selftests/rseq/rseq-x86.h  | 1149 ++
 tools/testing/selftests/rseq/rseq.c  |  116 +++
 tools/testing/selftests/rseq/rseq.h  |  164 +
 6 files changed, 2931 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/rseq-arm.h
 create mode 100644 tools/testing/selftests/rseq/rseq-ppc.h
 create mode 100644 tools/testing/selftests/rseq/rseq-skip.h
 create mode 100644 tools/testing/selftests/rseq/rseq-x86.h
 create mode 100644 tools/testing/selftests/rseq/rseq.c
 create mode 100644 tools/testing/selftests/rseq/rseq.h

diff --git a/tools/testing/selftests/rseq/rseq-arm.h 
b/tools/testing/selftests/rseq/rseq-arm.h
new file mode 100644
index ..adcaa6cbbd01
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -0,0 +1,732 @@
+/*
+ * rseq-arm.h
+ *
+ * (C) Copyright 2016 - Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated document

[RFC PATCH for 4.18 18/23] rseq: selftests: Provide rseq library (v5)

This rseq helper library provides a user-space API to the rseq()
system call.

The rseq fast-path exposes the instruction pointer addresses where the
rseq assembly blocks begin and end, as well as the associated abort
instruction pointer, in the __rseq_table section. This section allows
debuggers may know where to place breakpoints when single-stepping
through assembly blocks which may be aborted at any point by the kernel.

Signed-off-by: Mathieu Desnoyers 
CC: Shuah Khan 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Thomas Gleixner 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: "H. Peter Anvin" 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Boqun Feng 
CC: linux-kselft...@vger.kernel.org
CC: linux-...@vger.kernel.org
---
Changes since v1:
- Provide abort-ip signature: The abort-ip signature is located just
  before the abort-ip target. It is currently hardcoded, but a
  user-space application could use the __rseq_table to iterate on all
  abort-ip targets and use a random value as signature if needed in the
  future.
- Add rseq_prepare_unload(): Libraries and JIT code using rseq critical
  sections need to issue rseq_prepare_unload() on each thread at least
  once before reclaim of struct rseq_cs.
- Use initial-exec TLS model, non-weak symbol: The initial-exec model is
  signal-safe, whereas the global-dynamic model is not.  Remove the
  "weak" symbol attribute from the __rseq_abi in rseq.c. The rseq.so
  library will have ownership of that symbol, and there is not reason for
  an application or user library to try to define that symbol.
  The expected use is to link against libreq.so, which owns and provide
  that symbol.
- Set cpu_id to -2 on register error
- Add rseq_len syscall parameter, rseq_cs version
- Ensure disassember-friendly signature: x86 32/64 disassembler have a
  hard time decoding the instruction stream after a bad instruction. Use
  a nopl instruction to encode the signature. Suggested by Andy Lutomirski.
- Exercise parametrized tests variants in a shell scripts.
- Restartable sequences selftests: Remove use of event counter.
- Use cpu_id_start field:  With the cpu_id_start field, the C
  preparation phase of the fast-path does not need to compare cpu_id < 0
  anymore.
- Signal-safe registration and refcounting: Allow libraries using
  librseq.so to register it from signal handlers.
- Use OVERRIDE_TARGETS in makefile.
- Use "m" constraints for rseq_cs field.

Changes since v2:
- Update based on Thomas Gleixner's comments.

Changes since v3:
- Generate param_test_skip_fastpath and param_test_benchmark with
  -DSKIP_FASTPATH and -DBENCHMARK (respectively). Add param_test_fastpath
  to run_param_test.sh.

Changes since v4:
- Fold arm: workaround gcc asm size guess,
- Namespace barrier() -> rseq_barrier() in library header,
- Take into account coding style feedback from Peter Zijlstra,
- Split rseq selftests into logical commits.
---
 tools/testing/selftests/rseq/rseq-arm.h  |  732 +++
 tools/testing/selftests/rseq/rseq-ppc.h  |  688 ++
 tools/testing/selftests/rseq/rseq-skip.h |   82 +++
 tools/testing/selftests/rseq/rseq-x86.h  | 1149 ++
 tools/testing/selftests/rseq/rseq.c  |  116 +++
 tools/testing/selftests/rseq/rseq.h  |  164 +
 6 files changed, 2931 insertions(+)
 create mode 100644 tools/testing/selftests/rseq/rseq-arm.h
 create mode 100644 tools/testing/selftests/rseq/rseq-ppc.h
 create mode 100644 tools/testing/selftests/rseq/rseq-skip.h
 create mode 100644 tools/testing/selftests/rseq/rseq-x86.h
 create mode 100644 tools/testing/selftests/rseq/rseq.c
 create mode 100644 tools/testing/selftests/rseq/rseq.h

diff --git a/tools/testing/selftests/rseq/rseq-arm.h 
b/tools/testing/selftests/rseq/rseq-arm.h
new file mode 100644
index ..adcaa6cbbd01
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -0,0 +1,732 @@
+/*
+ * rseq-arm.h
+ *
+ * (C) Copyright 2016 - Mathieu Desnoyers 
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNE

[RFC PATCH for 4.18 03/23] arm: Add restartable sequences support

Call the rseq_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.

Perform fixup on the pre-signal frame when a signal is delivered on top
of a restartable sequence critical section.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Russell King <li...@arm.linux.org.uk>
CC: Catalin Marinas <catalin.mari...@arm.com>
CC: Will Deacon <will.dea...@arm.com>
CC: Thomas Gleixner <t...@linutronix.de>
CC: Paul Turner <p...@google.com>
CC: Andrew Hunter <a...@google.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Andy Lutomirski <l...@amacapital.net>
CC: Andi Kleen <a...@firstfloor.org>
CC: Dave Watson <davejwat...@fb.com>
CC: Chris Lameter <c...@linux.com>
CC: Ingo Molnar <mi...@redhat.com>
CC: Ben Maurer <bmau...@fb.com>
CC: Steven Rostedt <rost...@goodmis.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: Josh Triplett <j...@joshtriplett.org>
CC: Linus Torvalds <torva...@linux-foundation.org>
CC: Andrew Morton <a...@linux-foundation.org>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: linux-...@vger.kernel.org
---
 arch/arm/Kconfig | 1 +
 arch/arm/kernel/signal.c | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7e3d53575486..1897d40ddd87 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -90,6 +90,7 @@ config ARM
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE if (SMP && ARM_LPAE)
select HAVE_REGS_AND_STACK_ACCESS_API
+   select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UID16
select HAVE_VIRT_CPU_ACCOUNTING_GEN
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index bd8810d4acb3..5879ab3f53c1 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -541,6 +541,12 @@ static void handle_signal(struct ksignal *ksig, struct 
pt_regs *regs)
int ret;
 
/*
+* Increment event counter and perform fixup for the pre-signal
+* frame.
+*/
+   rseq_signal_deliver(regs);
+
+   /*
 * Set up the stack frame
 */
if (ksig->ka.sa.sa_flags & SA_SIGINFO)
@@ -660,6 +666,7 @@ do_work_pending(struct pt_regs *regs, unsigned int 
thread_flags, int syscall)
} else {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   rseq_handle_notify_resume(regs);
}
}
local_irq_disable();
-- 
2.11.0

[RFC PATCH for 4.18 03/23] arm: Add restartable sequences support

Call the rseq_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.

Perform fixup on the pre-signal frame when a signal is delivered on top
of a restartable sequence critical section.

Signed-off-by: Mathieu Desnoyers 
CC: Russell King 
CC: Catalin Marinas 
CC: Will Deacon 
CC: Thomas Gleixner 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Dave Watson 
CC: Chris Lameter 
CC: Ingo Molnar 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Linus Torvalds 
CC: Andrew Morton 
CC: Boqun Feng 
CC: linux-...@vger.kernel.org
---
 arch/arm/Kconfig | 1 +
 arch/arm/kernel/signal.c | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7e3d53575486..1897d40ddd87 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -90,6 +90,7 @@ config ARM
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE if (SMP && ARM_LPAE)
select HAVE_REGS_AND_STACK_ACCESS_API
+   select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UID16
select HAVE_VIRT_CPU_ACCOUNTING_GEN
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index bd8810d4acb3..5879ab3f53c1 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -541,6 +541,12 @@ static void handle_signal(struct ksignal *ksig, struct 
pt_regs *regs)
int ret;
 
/*
+* Increment event counter and perform fixup for the pre-signal
+* frame.
+*/
+   rseq_signal_deliver(regs);
+
+   /*
 * Set up the stack frame
 */
if (ksig->ka.sa.sa_flags & SA_SIGINFO)
@@ -660,6 +666,7 @@ do_work_pending(struct pt_regs *regs, unsigned int 
thread_flags, int syscall)
} else {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   rseq_handle_notify_resume(regs);
}
}
local_irq_disable();
-- 
2.11.0

[RFC PATCH for 4.18 21/23] rseq: selftests: Provide basic percpu ops test