Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-06 Thread Thomas Gleixner
On Wed, 2 May 2018, Peter Zijlstra wrote:

> On Wed, May 02, 2018 at 06:27:22PM +, Daniel Colascione wrote:
> > On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra  wrote:
> > >> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote:
> > > > Suppose we make a userspace mutex implemented with a lock word having
> > three
> > > > bits: acquired, sleep_mode, and wait_pending, with the rest of the word
> > not
> > > > being relevant at the moment.
> > 
> > > So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and go
> > > with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the
> > > existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the
> > > owner TID.
> > 
> > That doesn't work if you want to use the rest of the word for something
> > else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can make
> > a lock with two bits.
> 
> Recursive locks are teh most horrible crap ever. And having the tid in
> the word allows things like kernel based optimistic spins and possibly
> PI related things.

FWIW, robust futex have also the TID requirement.

Thanks,

tglx


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-06 Thread Thomas Gleixner
On Wed, 2 May 2018, Peter Zijlstra wrote:

> On Wed, May 02, 2018 at 06:27:22PM +, Daniel Colascione wrote:
> > On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra  wrote:
> > >> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote:
> > > > Suppose we make a userspace mutex implemented with a lock word having
> > three
> > > > bits: acquired, sleep_mode, and wait_pending, with the rest of the word
> > not
> > > > being relevant at the moment.
> > 
> > > So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and go
> > > with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the
> > > existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the
> > > owner TID.
> > 
> > That doesn't work if you want to use the rest of the word for something
> > else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can make
> > a lock with two bits.
> 
> Recursive locks are teh most horrible crap ever. And having the tid in
> the word allows things like kernel based optimistic spins and possibly
> PI related things.

FWIW, robust futex have also the TID requirement.

Thanks,

tglx


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-04 Thread Ben Maurer
Hey -

I think the ideas Daniel brings up here are interesting -- specifically the 
notion that a thread could set a "pre-sleep wish" to signal it's sleeping. As 
this conversation shows I think there's a fair bit of depth to that. For 
example, the FUTEX_LOCK is an alternative approach. Another idea might be using 
the "currently running cpu" area of rseq to tell if the process in question was 
sleeping (assuming that the kernel would be modified to set this to -1 every 
time a process was unscheduled)

The idea discussed here seems orthogonal to the core thesis of rseq. I'm 
wondering if we can focus on getting rseq in, maybe with a eye for making sure 
this use case could be handled long term. My sense is that this is possible. We 
could use the flags setting in the per-thread rseq area, or maybe extend the 
meaning of the structure rseq_cs points to to signal that there was information 
about how to signal the sleeping of the current process. It seems to me this 
would be a natural way to add the functionality Daniel talks about if desired 
in the future.

Daniel - do you think there's anything we should do in the patch set today that 
would make it easier to implement your idea in the future without expanding the 
scope of the patch today. i.e. is there anything else we need to do to lay the 
framework for your idea.

I'd really love to see us get this patch in. There's a whole host of primitives 
this unlocks (more efficient RCU, better malloc implementations, fast 
reader-writer locks). I'm sure we'll have more ideas about APIs to provide once 
we've explored these use cases, but to me this patch is the MVP we need to ship 
to get that feedback. It's a solid framework that we can build upon, eg with 
the "opv" syscall or the idea in this thread if user feedback shows those 
things are necessary.

-b

Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-04 Thread Ben Maurer
Hey -

I think the ideas Daniel brings up here are interesting -- specifically the 
notion that a thread could set a "pre-sleep wish" to signal it's sleeping. As 
this conversation shows I think there's a fair bit of depth to that. For 
example, the FUTEX_LOCK is an alternative approach. Another idea might be using 
the "currently running cpu" area of rseq to tell if the process in question was 
sleeping (assuming that the kernel would be modified to set this to -1 every 
time a process was unscheduled)

The idea discussed here seems orthogonal to the core thesis of rseq. I'm 
wondering if we can focus on getting rseq in, maybe with a eye for making sure 
this use case could be handled long term. My sense is that this is possible. We 
could use the flags setting in the per-thread rseq area, or maybe extend the 
meaning of the structure rseq_cs points to to signal that there was information 
about how to signal the sleeping of the current process. It seems to me this 
would be a natural way to add the functionality Daniel talks about if desired 
in the future.

Daniel - do you think there's anything we should do in the patch set today that 
would make it easier to implement your idea in the future without expanding the 
scope of the patch today. i.e. is there anything else we need to do to lay the 
framework for your idea.

I'd really love to see us get this patch in. There's a whole host of primitives 
this unlocks (more efficient RCU, better malloc implementations, fast 
reader-writer locks). I'm sure we'll have more ideas about APIs to provide once 
we've explored these use cases, but to me this patch is the MVP we need to ship 
to get that feedback. It's a solid framework that we can build upon, eg with 
the "opv" syscall or the idea in this thread if user feedback shows those 
things are necessary.

-b

Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-03 Thread Mathieu Desnoyers
- On May 3, 2018, at 12:22 PM, Daniel Colascione dan...@google.com wrote:

> On Thu, May 3, 2018 at 9:12 AM Mathieu Desnoyers <
> mathieu.desnoy...@efficios.com> wrote:
>> By the way, if we eventually find a way to enhance user-space mutexes in
> the
>> fashion you describe here, it would belong to another TLS area, and would
>> be registered by another system call than rseq. I proposed a more generic
>> "TLS area registration" system call a few years ago, but Linus told me he
>> wanted a system call that was specific to rseq. If we need to implement
>> other use-cases in a TLS area shared between kernel and user-space in a
>> similar fashion, the plan is to do it in a distinct system call.
> 
> If we proliferate TLS areas; we'd have to register each one upon thread
> creation, adding to the overall thread creation path. There's already a
> provision for versioning the TLS area. What's the benefit of splitting the
> registration over multiple system calls?

See the original discussion thread at

https://lkml.org/lkml/2016/4/7/502

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-03 Thread Mathieu Desnoyers
- On May 3, 2018, at 12:22 PM, Daniel Colascione dan...@google.com wrote:

> On Thu, May 3, 2018 at 9:12 AM Mathieu Desnoyers <
> mathieu.desnoy...@efficios.com> wrote:
>> By the way, if we eventually find a way to enhance user-space mutexes in
> the
>> fashion you describe here, it would belong to another TLS area, and would
>> be registered by another system call than rseq. I proposed a more generic
>> "TLS area registration" system call a few years ago, but Linus told me he
>> wanted a system call that was specific to rseq. If we need to implement
>> other use-cases in a TLS area shared between kernel and user-space in a
>> similar fashion, the plan is to do it in a distinct system call.
> 
> If we proliferate TLS areas; we'd have to register each one upon thread
> creation, adding to the overall thread creation path. There's already a
> provision for versioning the TLS area. What's the benefit of splitting the
> registration over multiple system calls?

See the original discussion thread at

https://lkml.org/lkml/2016/4/7/502

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-03 Thread Joel Fernandes
Hi Daniel,
Nice to have this healthy discussion about pros/cons. Adding Waiman to the
discussion as well. Curious to hear what Waiman and Peter think about all
this. Some more comments inline.

On Thu, May 3, 2018 at 10:19 AM Daniel Colascione  wrote:

> On Thu, May 3, 2018 at 9:48 AM Joel Fernandes  wrote:
> > > > can skip the manual schedule we were going to perform.

> > > By the way, if we eventually find a way to enhance user-space mutexes
in
> > the
> > > fashion you describe here, it would belong to another TLS area, and
> would
> > > be registered by another system call than rseq. I proposed a more
> generic

> > Right. Also I still don't see any good reason why optimistic spinning in
> > the kernel with FUTEX_LOCK, as Peter described, can't be used instead of
> > using the rseq implementation and spinning in userspace, for such a
case.
> I
> > don't really fully buy that we need to design this interface assuming
any
> > privilege transition level time.

> > If privilege level transitions are slow,
> > we're going to have bad performance anyway.

> That's not the case. There's a large class of program that does useful
work
> while seldom entering the kernel: just ask the user-space network stack
> people.

Yes, I am aware of that. I was just saying in general, a system such as an
Android embedded system, not an HPC based system does make a lot of system
calls. I am not arguing that doing more things in userspace is good or bad
here. I am just talking about why do something else for no good reasons
(see below) when work has already been done on this area.

> It's not wise to design interfaces around system calls being cheap. Even
if
> system calls are currently cheap enough on some architectures some of the
> time, there's no guarantee that they'll stay that way, especially relative
> to straight-line user-mode execution. A pure user-space approach, on the
> other hand, involves no work in the kernel, and doing nothing is always
the
> optimal strategy. Besides, there are environments where system calls end
up
> being more expensive than you might think: consider strace or rr. If the
> kernel needs to get involved on some path, it's best that its involvement
> be as light as possible.

Ofcourse, but I think we shouldn't do a premature optimization here without
real data on typical Android devices about the cost of system calls
entry/exit, vs spin time. I am not against userspace lock based on rseq if
there is data and good reason, before investing significant time on
reinventing the wheel.

> > we should really stick to using FUTEX_LOCK and
> > reuse all the work that went into that area for Android and otherwise
(and
> > work with Waiman and others on improving that if there are any problems
> > with it).

> FUTEX_LOCK is a return to the bad old days when systems gave you a fixed
> list of synchronization primitives and if you wanted something else,
tough.

I am not saying we should fix sync. primitives made available to userspace,
or anything. I am talking about yours/our usecase and whether another sync
primitive interface is needed. For example, have another syscall to
register TLS area is a new interface, vs using the existing futex
interface. Linus is also against adding new sycalls unnecessarily.

> That the latest version of the FUTEX_LOCK patch includes a separate
> FUTEX_LOCK_SHARED mode is concerning. The functionality the kernel
provides

Why? That's just for reader-locks. What's the concern there? I know you had
something in mind about efficient userspace rw locks but I am curious
either way what you have in mind.

> to userspace should be more general-purpose and allow more experimentation
> without changes in the kernel. I see no reason to force userspace into 1)
> reserving 30 bits of its lockword for a TID and 2) adopting the kernel's

Based on our offline chat, this is for only 32-bit only systems though
right? Also based on Peter's idea of putting the recursion counter outside,
there shouldn't be a space issue?

> idea of spin time heuristics and lock stealing when the same basic
> functionality can be provided in a generic way while reserving only one
> bit. That this mechanism happens to be more efficient as well is a bonus.

And also probably easy to get wrong. Heuristics are hard and it would be
good to work with community on getting best approach for that and improving
existing code. Also about "generic way", that's even more reason in my view
to do it in the kernel.

> "Mechanism not policy" is still a good design principle.

Again, I am not advocating forcing of interfaces anything, but I'm against
reinventing the wheel and am all for spending time on improving existing
things.

thanks!

- Joel


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-03 Thread Joel Fernandes
Hi Daniel,
Nice to have this healthy discussion about pros/cons. Adding Waiman to the
discussion as well. Curious to hear what Waiman and Peter think about all
this. Some more comments inline.

On Thu, May 3, 2018 at 10:19 AM Daniel Colascione  wrote:

> On Thu, May 3, 2018 at 9:48 AM Joel Fernandes  wrote:
> > > > can skip the manual schedule we were going to perform.

> > > By the way, if we eventually find a way to enhance user-space mutexes
in
> > the
> > > fashion you describe here, it would belong to another TLS area, and
> would
> > > be registered by another system call than rseq. I proposed a more
> generic

> > Right. Also I still don't see any good reason why optimistic spinning in
> > the kernel with FUTEX_LOCK, as Peter described, can't be used instead of
> > using the rseq implementation and spinning in userspace, for such a
case.
> I
> > don't really fully buy that we need to design this interface assuming
any
> > privilege transition level time.

> > If privilege level transitions are slow,
> > we're going to have bad performance anyway.

> That's not the case. There's a large class of program that does useful
work
> while seldom entering the kernel: just ask the user-space network stack
> people.

Yes, I am aware of that. I was just saying in general, a system such as an
Android embedded system, not an HPC based system does make a lot of system
calls. I am not arguing that doing more things in userspace is good or bad
here. I am just talking about why do something else for no good reasons
(see below) when work has already been done on this area.

> It's not wise to design interfaces around system calls being cheap. Even
if
> system calls are currently cheap enough on some architectures some of the
> time, there's no guarantee that they'll stay that way, especially relative
> to straight-line user-mode execution. A pure user-space approach, on the
> other hand, involves no work in the kernel, and doing nothing is always
the
> optimal strategy. Besides, there are environments where system calls end
up
> being more expensive than you might think: consider strace or rr. If the
> kernel needs to get involved on some path, it's best that its involvement
> be as light as possible.

Ofcourse, but I think we shouldn't do a premature optimization here without
real data on typical Android devices about the cost of system calls
entry/exit, vs spin time. I am not against userspace lock based on rseq if
there is data and good reason, before investing significant time on
reinventing the wheel.

> > we should really stick to using FUTEX_LOCK and
> > reuse all the work that went into that area for Android and otherwise
(and
> > work with Waiman and others on improving that if there are any problems
> > with it).

> FUTEX_LOCK is a return to the bad old days when systems gave you a fixed
> list of synchronization primitives and if you wanted something else,
tough.

I am not saying we should fix sync. primitives made available to userspace,
or anything. I am talking about yours/our usecase and whether another sync
primitive interface is needed. For example, have another syscall to
register TLS area is a new interface, vs using the existing futex
interface. Linus is also against adding new sycalls unnecessarily.

> That the latest version of the FUTEX_LOCK patch includes a separate
> FUTEX_LOCK_SHARED mode is concerning. The functionality the kernel
provides

Why? That's just for reader-locks. What's the concern there? I know you had
something in mind about efficient userspace rw locks but I am curious
either way what you have in mind.

> to userspace should be more general-purpose and allow more experimentation
> without changes in the kernel. I see no reason to force userspace into 1)
> reserving 30 bits of its lockword for a TID and 2) adopting the kernel's

Based on our offline chat, this is for only 32-bit only systems though
right? Also based on Peter's idea of putting the recursion counter outside,
there shouldn't be a space issue?

> idea of spin time heuristics and lock stealing when the same basic
> functionality can be provided in a generic way while reserving only one
> bit. That this mechanism happens to be more efficient as well is a bonus.

And also probably easy to get wrong. Heuristics are hard and it would be
good to work with community on getting best approach for that and improving
existing code. Also about "generic way", that's even more reason in my view
to do it in the kernel.

> "Mechanism not policy" is still a good design principle.

Again, I am not advocating forcing of interfaces anything, but I'm against
reinventing the wheel and am all for spending time on improving existing
things.

thanks!

- Joel


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-03 Thread Daniel Colascione
On Thu, May 3, 2018 at 9:48 AM Joel Fernandes  wrote:
> > > can skip the manual schedule we were going to perform.

> > By the way, if we eventually find a way to enhance user-space mutexes in
> the
> > fashion you describe here, it would belong to another TLS area, and
would
> > be registered by another system call than rseq. I proposed a more
generic

> Right. Also I still don't see any good reason why optimistic spinning in
> the kernel with FUTEX_LOCK, as Peter described, can't be used instead of
> using the rseq implementation and spinning in userspace, for such a case.
I
> don't really fully buy that we need to design this interface assuming any
> privilege transition level time.

> If privilege level transitions are slow,
> we're going to have bad performance anyway.

That's not the case. There's a large class of program that does useful work
while seldom entering the kernel: just ask the user-space network stack
people.

It's not wise to design interfaces around system calls being cheap. Even if
system calls are currently cheap enough on some architectures some of the
time, there's no guarantee that they'll stay that way, especially relative
to straight-line user-mode execution. A pure user-space approach, on the
other hand, involves no work in the kernel, and doing nothing is always the
optimal strategy. Besides, there are environments where system calls end up
being more expensive than you might think: consider strace or rr. If the
kernel needs to get involved on some path, it's best that its involvement
be as light as possible.

> we should really stick to using FUTEX_LOCK and
> reuse all the work that went into that area for Android and otherwise (and
> work with Waiman and others on improving that if there are any problems
> with it).

FUTEX_LOCK is a return to the bad old days when systems gave you a fixed
list of synchronization primitives and if you wanted something else, tough.
That the latest version of the FUTEX_LOCK patch includes a separate
FUTEX_LOCK_SHARED mode is concerning. The functionality the kernel provides
to userspace should be more general-purpose and allow more experimentation
without changes in the kernel. I see no reason to force userspace into 1)
reserving 30 bits of its lockword for a TID and 2) adopting the kernel's
idea of spin time heuristics and lock stealing when the same basic
functionality can be provided in a generic way while reserving only one
bit. That this mechanism happens to be more efficient as well is a bonus.

"Mechanism not policy" is still a good design principle.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-03 Thread Daniel Colascione
On Thu, May 3, 2018 at 9:48 AM Joel Fernandes  wrote:
> > > can skip the manual schedule we were going to perform.

> > By the way, if we eventually find a way to enhance user-space mutexes in
> the
> > fashion you describe here, it would belong to another TLS area, and
would
> > be registered by another system call than rseq. I proposed a more
generic

> Right. Also I still don't see any good reason why optimistic spinning in
> the kernel with FUTEX_LOCK, as Peter described, can't be used instead of
> using the rseq implementation and spinning in userspace, for such a case.
I
> don't really fully buy that we need to design this interface assuming any
> privilege transition level time.

> If privilege level transitions are slow,
> we're going to have bad performance anyway.

That's not the case. There's a large class of program that does useful work
while seldom entering the kernel: just ask the user-space network stack
people.

It's not wise to design interfaces around system calls being cheap. Even if
system calls are currently cheap enough on some architectures some of the
time, there's no guarantee that they'll stay that way, especially relative
to straight-line user-mode execution. A pure user-space approach, on the
other hand, involves no work in the kernel, and doing nothing is always the
optimal strategy. Besides, there are environments where system calls end up
being more expensive than you might think: consider strace or rr. If the
kernel needs to get involved on some path, it's best that its involvement
be as light as possible.

> we should really stick to using FUTEX_LOCK and
> reuse all the work that went into that area for Android and otherwise (and
> work with Waiman and others on improving that if there are any problems
> with it).

FUTEX_LOCK is a return to the bad old days when systems gave you a fixed
list of synchronization primitives and if you wanted something else, tough.
That the latest version of the FUTEX_LOCK patch includes a separate
FUTEX_LOCK_SHARED mode is concerning. The functionality the kernel provides
to userspace should be more general-purpose and allow more experimentation
without changes in the kernel. I see no reason to force userspace into 1)
reserving 30 bits of its lockword for a TID and 2) adopting the kernel's
idea of spin time heuristics and lock stealing when the same basic
functionality can be provided in a generic way while reserving only one
bit. That this mechanism happens to be more efficient as well is a bonus.

"Mechanism not policy" is still a good design principle.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-03 Thread Joel Fernandes
On Thu, May 3, 2018 at 9:12 AM Mathieu Desnoyers <
mathieu.desnoy...@efficios.com> wrote:

> - On May 2, 2018, at 12:07 PM, Daniel Colascione dan...@google.com
wrote:

> > On Wed, May 2, 2018 at 9:03 AM Mathieu Desnoyers <
> > mathieu.desnoy...@efficios.com> wrote:
> >
> >> - On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com
> > wrote:
> >> [...]
> >> >
> >> > I think a small enhancement to rseq would let us build a perfect
> > userspace
> >> > mutex, one that spins on lock-acquire only when the lock owner is
> > running
> >> > and that sleeps otherwise, freeing userspace from both specifying
ad-hoc
> >> > spin counts and from trying to detect situations in which spinning is
> >> > generally pointless.
> >> >
> >> > It'd work like this: in the per-thread rseq data structure, we'd
> > include a
> >> > description of a futex operation for the kernel would perform (in the
> >> > context of the preempted thread) upon preemption, immediately before
> >> > schedule(). If the futex operation itself sleeps, that's no problem:
we
> >> > will have still accomplished our goal of running some other thread
> > instead
> >> > of the preempted thread.
> >
> >> Hi Daniel,
> >
> >> I agree that the problem you are aiming to solve is important. Let's
see
> >> what prevents the proposed rseq implementation from doing what you
> > envision.
> >
> >> The main issue here is touching userspace immediately before
schedule().
> >> At that specific point, it's not possible to take a page fault. In the
> > proposed
> >> rseq implementation, we get away with it by raising a task struct flag,
> > and using
> >> it in a return to userspace notifier (where we can actually take a
> > fault), where
> >> we touch the userspace TLS area.
> >
> >> If we can find a way to solve this limitation, then the rest of your
> > design
> >> makes sense to me.
> >
> > Thanks for taking a look!
> >
> > Why couldn't we take a page fault just before schedule? The reason we
can't
> > take a page fault in atomic context is that doing so might call
schedule.
> > Here, we're about to call schedule _anyway_, so what harm does it do to
> > call something that might call schedule? If we schedule via that call,
we
> > can skip the manual schedule we were going to perform.

> By the way, if we eventually find a way to enhance user-space mutexes in
the
> fashion you describe here, it would belong to another TLS area, and would
> be registered by another system call than rseq. I proposed a more generic

Right. Also I still don't see any good reason why optimistic spinning in
the kernel with FUTEX_LOCK, as Peter described, can't be used instead of
using the rseq implementation and spinning in userspace, for such a case. I
don't really fully buy that we need to design this interface assuming any
privilege transition level time. If privilege level transitions are slow,
we're going to have bad performance anyway. Unless there's some data to
show that we have to optimistically spin in userspace than the kernel
because its better to do so, we should really stick to using FUTEX_LOCK and
reuse all the work that went into that area for Android and otherwise (and
work with Waiman and others on improving that if there are any problems
with it).

I am excited though about the other synchronization design other than lock
implementation that rseq can help in.

thanks!

- Joel


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-03 Thread Joel Fernandes
On Thu, May 3, 2018 at 9:12 AM Mathieu Desnoyers <
mathieu.desnoy...@efficios.com> wrote:

> - On May 2, 2018, at 12:07 PM, Daniel Colascione dan...@google.com
wrote:

> > On Wed, May 2, 2018 at 9:03 AM Mathieu Desnoyers <
> > mathieu.desnoy...@efficios.com> wrote:
> >
> >> - On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com
> > wrote:
> >> [...]
> >> >
> >> > I think a small enhancement to rseq would let us build a perfect
> > userspace
> >> > mutex, one that spins on lock-acquire only when the lock owner is
> > running
> >> > and that sleeps otherwise, freeing userspace from both specifying
ad-hoc
> >> > spin counts and from trying to detect situations in which spinning is
> >> > generally pointless.
> >> >
> >> > It'd work like this: in the per-thread rseq data structure, we'd
> > include a
> >> > description of a futex operation for the kernel would perform (in the
> >> > context of the preempted thread) upon preemption, immediately before
> >> > schedule(). If the futex operation itself sleeps, that's no problem:
we
> >> > will have still accomplished our goal of running some other thread
> > instead
> >> > of the preempted thread.
> >
> >> Hi Daniel,
> >
> >> I agree that the problem you are aiming to solve is important. Let's
see
> >> what prevents the proposed rseq implementation from doing what you
> > envision.
> >
> >> The main issue here is touching userspace immediately before
schedule().
> >> At that specific point, it's not possible to take a page fault. In the
> > proposed
> >> rseq implementation, we get away with it by raising a task struct flag,
> > and using
> >> it in a return to userspace notifier (where we can actually take a
> > fault), where
> >> we touch the userspace TLS area.
> >
> >> If we can find a way to solve this limitation, then the rest of your
> > design
> >> makes sense to me.
> >
> > Thanks for taking a look!
> >
> > Why couldn't we take a page fault just before schedule? The reason we
can't
> > take a page fault in atomic context is that doing so might call
schedule.
> > Here, we're about to call schedule _anyway_, so what harm does it do to
> > call something that might call schedule? If we schedule via that call,
we
> > can skip the manual schedule we were going to perform.

> By the way, if we eventually find a way to enhance user-space mutexes in
the
> fashion you describe here, it would belong to another TLS area, and would
> be registered by another system call than rseq. I proposed a more generic

Right. Also I still don't see any good reason why optimistic spinning in
the kernel with FUTEX_LOCK, as Peter described, can't be used instead of
using the rseq implementation and spinning in userspace, for such a case. I
don't really fully buy that we need to design this interface assuming any
privilege transition level time. If privilege level transitions are slow,
we're going to have bad performance anyway. Unless there's some data to
show that we have to optimistically spin in userspace than the kernel
because its better to do so, we should really stick to using FUTEX_LOCK and
reuse all the work that went into that area for Android and otherwise (and
work with Waiman and others on improving that if there are any problems
with it).

I am excited though about the other synchronization design other than lock
implementation that rseq can help in.

thanks!

- Joel


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-03 Thread Daniel Colascione
On Thu, May 3, 2018 at 9:12 AM Mathieu Desnoyers <
mathieu.desnoy...@efficios.com> wrote:
> By the way, if we eventually find a way to enhance user-space mutexes in
the
> fashion you describe here, it would belong to another TLS area, and would
> be registered by another system call than rseq. I proposed a more generic
> "TLS area registration" system call a few years ago, but Linus told me he
> wanted a system call that was specific to rseq. If we need to implement
> other use-cases in a TLS area shared between kernel and user-space in a
> similar fashion, the plan is to do it in a distinct system call.

If we proliferate TLS areas; we'd have to register each one upon thread
creation, adding to the overall thread creation path. There's already a
provision for versioning the TLS area. What's the benefit of splitting the
registration over multiple system calls?


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-03 Thread Daniel Colascione
On Thu, May 3, 2018 at 9:12 AM Mathieu Desnoyers <
mathieu.desnoy...@efficios.com> wrote:
> By the way, if we eventually find a way to enhance user-space mutexes in
the
> fashion you describe here, it would belong to another TLS area, and would
> be registered by another system call than rseq. I proposed a more generic
> "TLS area registration" system call a few years ago, but Linus told me he
> wanted a system call that was specific to rseq. If we need to implement
> other use-cases in a TLS area shared between kernel and user-space in a
> similar fashion, the plan is to do it in a distinct system call.

If we proliferate TLS areas; we'd have to register each one upon thread
creation, adding to the overall thread creation path. There's already a
provision for versioning the TLS area. What's the benefit of splitting the
registration over multiple system calls?


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-03 Thread Mathieu Desnoyers
- On May 2, 2018, at 12:07 PM, Daniel Colascione dan...@google.com wrote:

> On Wed, May 2, 2018 at 9:03 AM Mathieu Desnoyers <
> mathieu.desnoy...@efficios.com> wrote:
> 
>> - On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com
> wrote:
>> [...]
>> >
>> > I think a small enhancement to rseq would let us build a perfect
> userspace
>> > mutex, one that spins on lock-acquire only when the lock owner is
> running
>> > and that sleeps otherwise, freeing userspace from both specifying ad-hoc
>> > spin counts and from trying to detect situations in which spinning is
>> > generally pointless.
>> >
>> > It'd work like this: in the per-thread rseq data structure, we'd
> include a
>> > description of a futex operation for the kernel would perform (in the
>> > context of the preempted thread) upon preemption, immediately before
>> > schedule(). If the futex operation itself sleeps, that's no problem: we
>> > will have still accomplished our goal of running some other thread
> instead
>> > of the preempted thread.
> 
>> Hi Daniel,
> 
>> I agree that the problem you are aiming to solve is important. Let's see
>> what prevents the proposed rseq implementation from doing what you
> envision.
> 
>> The main issue here is touching userspace immediately before schedule().
>> At that specific point, it's not possible to take a page fault. In the
> proposed
>> rseq implementation, we get away with it by raising a task struct flag,
> and using
>> it in a return to userspace notifier (where we can actually take a
> fault), where
>> we touch the userspace TLS area.
> 
>> If we can find a way to solve this limitation, then the rest of your
> design
>> makes sense to me.
> 
> Thanks for taking a look!
> 
> Why couldn't we take a page fault just before schedule? The reason we can't
> take a page fault in atomic context is that doing so might call schedule.
> Here, we're about to call schedule _anyway_, so what harm does it do to
> call something that might call schedule? If we schedule via that call, we
> can skip the manual schedule we were going to perform.

By the way, if we eventually find a way to enhance user-space mutexes in the
fashion you describe here, it would belong to another TLS area, and would
be registered by another system call than rseq. I proposed a more generic
"TLS area registration" system call a few years ago, but Linus told me he
wanted a system call that was specific to rseq. If we need to implement
other use-cases in a TLS area shared between kernel and user-space in a
similar fashion, the plan is to do it in a distinct system call.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-03 Thread Mathieu Desnoyers
- On May 2, 2018, at 12:07 PM, Daniel Colascione dan...@google.com wrote:

> On Wed, May 2, 2018 at 9:03 AM Mathieu Desnoyers <
> mathieu.desnoy...@efficios.com> wrote:
> 
>> - On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com
> wrote:
>> [...]
>> >
>> > I think a small enhancement to rseq would let us build a perfect
> userspace
>> > mutex, one that spins on lock-acquire only when the lock owner is
> running
>> > and that sleeps otherwise, freeing userspace from both specifying ad-hoc
>> > spin counts and from trying to detect situations in which spinning is
>> > generally pointless.
>> >
>> > It'd work like this: in the per-thread rseq data structure, we'd
> include a
>> > description of a futex operation for the kernel would perform (in the
>> > context of the preempted thread) upon preemption, immediately before
>> > schedule(). If the futex operation itself sleeps, that's no problem: we
>> > will have still accomplished our goal of running some other thread
> instead
>> > of the preempted thread.
> 
>> Hi Daniel,
> 
>> I agree that the problem you are aiming to solve is important. Let's see
>> what prevents the proposed rseq implementation from doing what you
> envision.
> 
>> The main issue here is touching userspace immediately before schedule().
>> At that specific point, it's not possible to take a page fault. In the
> proposed
>> rseq implementation, we get away with it by raising a task struct flag,
> and using
>> it in a return to userspace notifier (where we can actually take a
> fault), where
>> we touch the userspace TLS area.
> 
>> If we can find a way to solve this limitation, then the rest of your
> design
>> makes sense to me.
> 
> Thanks for taking a look!
> 
> Why couldn't we take a page fault just before schedule? The reason we can't
> take a page fault in atomic context is that doing so might call schedule.
> Here, we're about to call schedule _anyway_, so what harm does it do to
> call something that might call schedule? If we schedule via that call, we
> can skip the manual schedule we were going to perform.

By the way, if we eventually find a way to enhance user-space mutexes in the
fashion you describe here, it would belong to another TLS area, and would
be registered by another system call than rseq. I proposed a more generic
"TLS area registration" system call a few years ago, but Linus told me he
wanted a system call that was specific to rseq. If we need to implement
other use-cases in a TLS area shared between kernel and user-space in a
similar fashion, the plan is to do it in a distinct system call.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-03 Thread Peter Zijlstra
On Wed, May 02, 2018 at 08:37:13PM +, Daniel Colascione wrote:
> > Recursive locks are teh most horrible crap ever. And having the tid in
> 
> What happened to providing mechanism, not policy?
> 
> You can't wish away recursive locking. It's baked into Java and the CLR,
> and it's enshrined in POSIX. It's not going away, and there's no reason not
> to support it efficiently.

You can implement recursive locks just fine with a TID based word, just
keep the recursion counter external to the futex word. If owner==self,
increment etc..

> > the word allows things like kernel based optimistic spins and possibly
> > PI related things.
> 
> Sure. A lot of people don't want PI though, or at least they want to opt
> into it. And we shouldn't require an entry into the kernel for what we can
> in principle do efficiently in userspace.

Any additional PI would certainly be opt-in, but the kernel based
spinning might make sense unconditionally.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-03 Thread Peter Zijlstra
On Wed, May 02, 2018 at 08:37:13PM +, Daniel Colascione wrote:
> > Recursive locks are teh most horrible crap ever. And having the tid in
> 
> What happened to providing mechanism, not policy?
> 
> You can't wish away recursive locking. It's baked into Java and the CLR,
> and it's enshrined in POSIX. It's not going away, and there's no reason not
> to support it efficiently.

You can implement recursive locks just fine with a TID based word, just
keep the recursion counter external to the futex word. If owner==self,
increment etc..

> > the word allows things like kernel based optimistic spins and possibly
> > PI related things.
> 
> Sure. A lot of people don't want PI though, or at least they want to opt
> into it. And we shouldn't require an entry into the kernel for what we can
> in principle do efficiently in userspace.

Any additional PI would certainly be opt-in, but the kernel based
spinning might make sense unconditionally.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Steven Rostedt
On Wed, 02 May 2018 20:37:13 +
Daniel Colascione  wrote:

> On Wed, May 2, 2018 at 1:23 PM Peter Zijlstra  wrote:
> 
> > On Wed, May 02, 2018 at 06:27:22PM +, Daniel Colascione wrote:  
> > > On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra   
> wrote:
> > > >> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote:
> > > > > Suppose we make a userspace mutex implemented with a lock word  
> having
> > > three  
> > > > > bits: acquired, sleep_mode, and wait_pending, with the rest of the  
> word
> > > not  
> > > > > being relevant at the moment.  
> > >  
> > > > So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and  
> go
> > > > with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the
> > > > existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the
> > > > owner TID.  
> > >
> > > That doesn't work if you want to use the rest of the word for something
> > > else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can  
> make
> > > a lock with two bits.  
> 
> > Recursive locks are teh most horrible crap ever. And having the tid in  
> 
> What happened to providing mechanism, not policy?
> 
> You can't wish away recursive locking. It's baked into Java and the CLR,
> and it's enshrined in POSIX. It's not going away, and there's no reason not
> to support it efficiently.
> 
> > the word allows things like kernel based optimistic spins and possibly
> > PI related things.  
> 
> Sure. A lot of people don't want PI though, or at least they want to opt
> into it. And we shouldn't require an entry into the kernel for what we can
> in principle do efficiently in userspace.
> 
> > > > As brought up in the last time we talked about spin loops, why do we
> > > > care if the spin loop is in userspace or not? Aside from the whole PTI
> > > > thing, the syscall cost was around 150 cycle or so, while a LOCK  
> CMPXCHG
> > > > is around 20 cycles. So ~7 spins gets you the cost of entry. 

What about exit?
 
> > >
> > > That's pre-KPTI, isn't it?  
> 
> > Yes, and once the hardware gets sorted, we'll be there again. I don't
> > think we should design interfaces for 'broken' hardware.  
> 
> It would be a mistake to design interfaces under the assumption that
> everyone has fast permission level transitions.

Note, Robert Haas told me a few years ago at a plumbers conference that
postgresql implements their own user space spin locks because anything
that goes into the kernel has killed the performance. And they tried to
use futex but that still didn't beat out plain userspace locks.

-- Steve


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Steven Rostedt
On Wed, 02 May 2018 20:37:13 +
Daniel Colascione  wrote:

> On Wed, May 2, 2018 at 1:23 PM Peter Zijlstra  wrote:
> 
> > On Wed, May 02, 2018 at 06:27:22PM +, Daniel Colascione wrote:  
> > > On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra   
> wrote:
> > > >> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote:
> > > > > Suppose we make a userspace mutex implemented with a lock word  
> having
> > > three  
> > > > > bits: acquired, sleep_mode, and wait_pending, with the rest of the  
> word
> > > not  
> > > > > being relevant at the moment.  
> > >  
> > > > So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and  
> go
> > > > with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the
> > > > existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the
> > > > owner TID.  
> > >
> > > That doesn't work if you want to use the rest of the word for something
> > > else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can  
> make
> > > a lock with two bits.  
> 
> > Recursive locks are teh most horrible crap ever. And having the tid in  
> 
> What happened to providing mechanism, not policy?
> 
> You can't wish away recursive locking. It's baked into Java and the CLR,
> and it's enshrined in POSIX. It's not going away, and there's no reason not
> to support it efficiently.
> 
> > the word allows things like kernel based optimistic spins and possibly
> > PI related things.  
> 
> Sure. A lot of people don't want PI though, or at least they want to opt
> into it. And we shouldn't require an entry into the kernel for what we can
> in principle do efficiently in userspace.
> 
> > > > As brought up in the last time we talked about spin loops, why do we
> > > > care if the spin loop is in userspace or not? Aside from the whole PTI
> > > > thing, the syscall cost was around 150 cycle or so, while a LOCK  
> CMPXCHG
> > > > is around 20 cycles. So ~7 spins gets you the cost of entry. 

What about exit?
 
> > >
> > > That's pre-KPTI, isn't it?  
> 
> > Yes, and once the hardware gets sorted, we'll be there again. I don't
> > think we should design interfaces for 'broken' hardware.  
> 
> It would be a mistake to design interfaces under the assumption that
> everyone has fast permission level transitions.

Note, Robert Haas told me a few years ago at a plumbers conference that
postgresql implements their own user space spin locks because anything
that goes into the kernel has killed the performance. And they tried to
use futex but that still didn't beat out plain userspace locks.

-- Steve


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Daniel Colascione
On Wed, May 2, 2018 at 1:23 PM Peter Zijlstra  wrote:

> On Wed, May 02, 2018 at 06:27:22PM +, Daniel Colascione wrote:
> > On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra 
wrote:
> > >> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote:
> > > > Suppose we make a userspace mutex implemented with a lock word
having
> > three
> > > > bits: acquired, sleep_mode, and wait_pending, with the rest of the
word
> > not
> > > > being relevant at the moment.
> >
> > > So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and
go
> > > with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the
> > > existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the
> > > owner TID.
> >
> > That doesn't work if you want to use the rest of the word for something
> > else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can
make
> > a lock with two bits.

> Recursive locks are teh most horrible crap ever. And having the tid in

What happened to providing mechanism, not policy?

You can't wish away recursive locking. It's baked into Java and the CLR,
and it's enshrined in POSIX. It's not going away, and there's no reason not
to support it efficiently.

> the word allows things like kernel based optimistic spins and possibly
> PI related things.

Sure. A lot of people don't want PI though, or at least they want to opt
into it. And we shouldn't require an entry into the kernel for what we can
in principle do efficiently in userspace.

> > > As brought up in the last time we talked about spin loops, why do we
> > > care if the spin loop is in userspace or not? Aside from the whole PTI
> > > thing, the syscall cost was around 150 cycle or so, while a LOCK
CMPXCHG
> > > is around 20 cycles. So ~7 spins gets you the cost of entry.
> >
> > That's pre-KPTI, isn't it?

> Yes, and once the hardware gets sorted, we'll be there again. I don't
> think we should design interfaces for 'broken' hardware.

It would be a mistake to design interfaces under the assumption that
everyone has fast permission level transitions.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Daniel Colascione
On Wed, May 2, 2018 at 1:23 PM Peter Zijlstra  wrote:

> On Wed, May 02, 2018 at 06:27:22PM +, Daniel Colascione wrote:
> > On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra 
wrote:
> > >> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote:
> > > > Suppose we make a userspace mutex implemented with a lock word
having
> > three
> > > > bits: acquired, sleep_mode, and wait_pending, with the rest of the
word
> > not
> > > > being relevant at the moment.
> >
> > > So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and
go
> > > with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the
> > > existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the
> > > owner TID.
> >
> > That doesn't work if you want to use the rest of the word for something
> > else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can
make
> > a lock with two bits.

> Recursive locks are teh most horrible crap ever. And having the tid in

What happened to providing mechanism, not policy?

You can't wish away recursive locking. It's baked into Java and the CLR,
and it's enshrined in POSIX. It's not going away, and there's no reason not
to support it efficiently.

> the word allows things like kernel based optimistic spins and possibly
> PI related things.

Sure. A lot of people don't want PI though, or at least they want to opt
into it. And we shouldn't require an entry into the kernel for what we can
in principle do efficiently in userspace.

> > > As brought up in the last time we talked about spin loops, why do we
> > > care if the spin loop is in userspace or not? Aside from the whole PTI
> > > thing, the syscall cost was around 150 cycle or so, while a LOCK
CMPXCHG
> > > is around 20 cycles. So ~7 spins gets you the cost of entry.
> >
> > That's pre-KPTI, isn't it?

> Yes, and once the hardware gets sorted, we'll be there again. I don't
> think we should design interfaces for 'broken' hardware.

It would be a mistake to design interfaces under the assumption that
everyone has fast permission level transitions.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Peter Zijlstra
On Wed, May 02, 2018 at 06:27:22PM +, Daniel Colascione wrote:
> On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra  wrote:
> >> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote:
> > > Suppose we make a userspace mutex implemented with a lock word having
> three
> > > bits: acquired, sleep_mode, and wait_pending, with the rest of the word
> not
> > > being relevant at the moment.
> 
> > So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and go
> > with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the
> > existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the
> > owner TID.
> 
> That doesn't work if you want to use the rest of the word for something
> else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can make
> a lock with two bits.

Recursive locks are teh most horrible crap ever. And having the tid in
the word allows things like kernel based optimistic spins and possibly
PI related things.

> > As brought up in the last time we talked about spin loops, why do we
> > care if the spin loop is in userspace or not? Aside from the whole PTI
> > thing, the syscall cost was around 150 cycle or so, while a LOCK CMPXCHG
> > is around 20 cycles. So ~7 spins gets you the cost of entry.
> 
> That's pre-KPTI, isn't it?

Yes, and once the hardware gets sorted, we'll be there again. I don't
think we should design interfaces for 'broken' hardware.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Peter Zijlstra
On Wed, May 02, 2018 at 06:27:22PM +, Daniel Colascione wrote:
> On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra  wrote:
> >> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote:
> > > Suppose we make a userspace mutex implemented with a lock word having
> three
> > > bits: acquired, sleep_mode, and wait_pending, with the rest of the word
> not
> > > being relevant at the moment.
> 
> > So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and go
> > with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the
> > existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the
> > owner TID.
> 
> That doesn't work if you want to use the rest of the word for something
> else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can make
> a lock with two bits.

Recursive locks are teh most horrible crap ever. And having the tid in
the word allows things like kernel based optimistic spins and possibly
PI related things.

> > As brought up in the last time we talked about spin loops, why do we
> > care if the spin loop is in userspace or not? Aside from the whole PTI
> > thing, the syscall cost was around 150 cycle or so, while a LOCK CMPXCHG
> > is around 20 cycles. So ~7 spins gets you the cost of entry.
> 
> That's pre-KPTI, isn't it?

Yes, and once the hardware gets sorted, we'll be there again. I don't
think we should design interfaces for 'broken' hardware.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Daniel Colascione
On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra  wrote:
>> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote:
> > Suppose we make a userspace mutex implemented with a lock word having
three
> > bits: acquired, sleep_mode, and wait_pending, with the rest of the word
not
> > being relevant at the moment.

> So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and go
> with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the
> existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the
> owner TID.

That doesn't work if you want to use the rest of the word for something
else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can make
a lock with two bits.

> As brought up in the last time we talked about spin loops, why do we
> care if the spin loop is in userspace or not? Aside from the whole PTI
> thing, the syscall cost was around 150 cycle or so, while a LOCK CMPXCHG
> is around 20 cycles. So ~7 spins gets you the cost of entry.

That's pre-KPTI, isn't it?


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Daniel Colascione
On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra  wrote:
>> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote:
> > Suppose we make a userspace mutex implemented with a lock word having
three
> > bits: acquired, sleep_mode, and wait_pending, with the rest of the word
not
> > being relevant at the moment.

> So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and go
> with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the
> existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the
> owner TID.

That doesn't work if you want to use the rest of the word for something
else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can make
a lock with two bits.

> As brought up in the last time we talked about spin loops, why do we
> care if the spin loop is in userspace or not? Aside from the whole PTI
> thing, the syscall cost was around 150 cycle or so, while a LOCK CMPXCHG
> is around 20 cycles. So ~7 spins gets you the cost of entry.

That's pre-KPTI, isn't it?


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Peter Zijlstra
On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote:
> Suppose we make a userspace mutex implemented with a lock word having three
> bits: acquired, sleep_mode, and wait_pending, with the rest of the word not
> being relevant at the moment.

So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and go
with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the
existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the
owner TID.

As brought up in the last time we talked about spin loops, why do we
care if the spin loop is in userspace or not? Aside from the whole PTI
thing, the syscall cost was around 150 cycle or so, while a LOCK CMPXCHG
is around 20 cycles. So ~7 spins gets you the cost of entry.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Peter Zijlstra
On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote:
> Suppose we make a userspace mutex implemented with a lock word having three
> bits: acquired, sleep_mode, and wait_pending, with the rest of the word not
> being relevant at the moment.

So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and go
with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the
existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the
owner TID.

As brought up in the last time we talked about spin loops, why do we
care if the spin loop is in userspace or not? Aside from the whole PTI
thing, the syscall cost was around 150 cycle or so, while a LOCK CMPXCHG
is around 20 cycles. So ~7 spins gets you the cost of entry.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Daniel Colascione
On Wed, May 2, 2018 at 9:42 AM Steven Rostedt  wrote:

> On Wed, 02 May 2018 16:07:48 +
> Daniel Colascione  wrote:

> > Why couldn't we take a page fault just before schedule? The reason we
can't
> > take a page fault in atomic context is that doing so might call
schedule.
> > Here, we're about to call schedule _anyway_, so what harm does it do to
> > call something that might call schedule? If we schedule via that call,
we
> > can skip the manual schedule we were going to perform.

> Another issue is slowing down something that is considered a fast path.

There are two questions: 1) does this feature slow down schedule when
you're not using it? and 2) is schedule unacceptably slow when you are
using this feature?

The answer to #1 is no; rseq already tests current->rseq during task switch
(via rseq_set_notify_resume), so adding a single further branch (which we'd
only test when we follow the current->rseq path anyway) isn't a problem.

Regarding #2: yes, a futex operation will increase path length for that one
task switch, but in the no-page-fault case, only by a tiny amount. We can
run benchmarks of course, but I don't see any reason to suspect that the
proposal would make task switching unacceptably slow. If we *do* take a
page fault, we won't have done much additional work overall, since
*somebody* is going to take that page fault anyway when the lock is
released, and the latency of the task switch shouldn't increase, since the
futex code will very quickly realize that it needs to sleep and call
schedule anyway.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Daniel Colascione
On Wed, May 2, 2018 at 9:42 AM Steven Rostedt  wrote:

> On Wed, 02 May 2018 16:07:48 +
> Daniel Colascione  wrote:

> > Why couldn't we take a page fault just before schedule? The reason we
can't
> > take a page fault in atomic context is that doing so might call
schedule.
> > Here, we're about to call schedule _anyway_, so what harm does it do to
> > call something that might call schedule? If we schedule via that call,
we
> > can skip the manual schedule we were going to perform.

> Another issue is slowing down something that is considered a fast path.

There are two questions: 1) does this feature slow down schedule when
you're not using it? and 2) is schedule unacceptably slow when you are
using this feature?

The answer to #1 is no; rseq already tests current->rseq during task switch
(via rseq_set_notify_resume), so adding a single further branch (which we'd
only test when we follow the current->rseq path anyway) isn't a problem.

Regarding #2: yes, a futex operation will increase path length for that one
task switch, but in the no-page-fault case, only by a tiny amount. We can
run benchmarks of course, but I don't see any reason to suspect that the
proposal would make task switching unacceptably slow. If we *do* take a
page fault, we won't have done much additional work overall, since
*somebody* is going to take that page fault anyway when the lock is
released, and the latency of the task switch shouldn't increase, since the
futex code will very quickly realize that it needs to sleep and call
schedule anyway.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Steven Rostedt
On Wed, 02 May 2018 16:07:48 +
Daniel Colascione  wrote:

> Why couldn't we take a page fault just before schedule? The reason we can't
> take a page fault in atomic context is that doing so might call schedule.
> Here, we're about to call schedule _anyway_, so what harm does it do to
> call something that might call schedule? If we schedule via that call, we
> can skip the manual schedule we were going to perform.

Another issue is slowing down something that is considered a fast path.

-- Steve


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Steven Rostedt
On Wed, 02 May 2018 16:07:48 +
Daniel Colascione  wrote:

> Why couldn't we take a page fault just before schedule? The reason we can't
> take a page fault in atomic context is that doing so might call schedule.
> Here, we're about to call schedule _anyway_, so what harm does it do to
> call something that might call schedule? If we schedule via that call, we
> can skip the manual schedule we were going to perform.

Another issue is slowing down something that is considered a fast path.

-- Steve


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Daniel Colascione
On Wed, May 2, 2018 at 9:03 AM Mathieu Desnoyers <
mathieu.desnoy...@efficios.com> wrote:

> - On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com
wrote:
> [...]
> >
> > I think a small enhancement to rseq would let us build a perfect
userspace
> > mutex, one that spins on lock-acquire only when the lock owner is
running
> > and that sleeps otherwise, freeing userspace from both specifying ad-hoc
> > spin counts and from trying to detect situations in which spinning is
> > generally pointless.
> >
> > It'd work like this: in the per-thread rseq data structure, we'd
include a
> > description of a futex operation for the kernel would perform (in the
> > context of the preempted thread) upon preemption, immediately before
> > schedule(). If the futex operation itself sleeps, that's no problem: we
> > will have still accomplished our goal of running some other thread
instead
> > of the preempted thread.

> Hi Daniel,

> I agree that the problem you are aiming to solve is important. Let's see
> what prevents the proposed rseq implementation from doing what you
envision.

> The main issue here is touching userspace immediately before schedule().
> At that specific point, it's not possible to take a page fault. In the
proposed
> rseq implementation, we get away with it by raising a task struct flag,
and using
> it in a return to userspace notifier (where we can actually take a
fault), where
> we touch the userspace TLS area.

> If we can find a way to solve this limitation, then the rest of your
design
> makes sense to me.

Thanks for taking a look!

Why couldn't we take a page fault just before schedule? The reason we can't
take a page fault in atomic context is that doing so might call schedule.
Here, we're about to call schedule _anyway_, so what harm does it do to
call something that might call schedule? If we schedule via that call, we
can skip the manual schedule we were going to perform.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Daniel Colascione
On Wed, May 2, 2018 at 9:03 AM Mathieu Desnoyers <
mathieu.desnoy...@efficios.com> wrote:

> - On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com
wrote:
> [...]
> >
> > I think a small enhancement to rseq would let us build a perfect
userspace
> > mutex, one that spins on lock-acquire only when the lock owner is
running
> > and that sleeps otherwise, freeing userspace from both specifying ad-hoc
> > spin counts and from trying to detect situations in which spinning is
> > generally pointless.
> >
> > It'd work like this: in the per-thread rseq data structure, we'd
include a
> > description of a futex operation for the kernel would perform (in the
> > context of the preempted thread) upon preemption, immediately before
> > schedule(). If the futex operation itself sleeps, that's no problem: we
> > will have still accomplished our goal of running some other thread
instead
> > of the preempted thread.

> Hi Daniel,

> I agree that the problem you are aiming to solve is important. Let's see
> what prevents the proposed rseq implementation from doing what you
envision.

> The main issue here is touching userspace immediately before schedule().
> At that specific point, it's not possible to take a page fault. In the
proposed
> rseq implementation, we get away with it by raising a task struct flag,
and using
> it in a return to userspace notifier (where we can actually take a
fault), where
> we touch the userspace TLS area.

> If we can find a way to solve this limitation, then the rest of your
design
> makes sense to me.

Thanks for taking a look!

Why couldn't we take a page fault just before schedule? The reason we can't
take a page fault in atomic context is that doing so might call schedule.
Here, we're about to call schedule _anyway_, so what harm does it do to
call something that might call schedule? If we schedule via that call, we
can skip the manual schedule we were going to perform.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Mathieu Desnoyers
- On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com wrote:
[...]
> 
> I think a small enhancement to rseq would let us build a perfect userspace
> mutex, one that spins on lock-acquire only when the lock owner is running
> and that sleeps otherwise, freeing userspace from both specifying ad-hoc
> spin counts and from trying to detect situations in which spinning is
> generally pointless.
> 
> It'd work like this: in the per-thread rseq data structure, we'd include a
> description of a futex operation for the kernel would perform (in the
> context of the preempted thread) upon preemption, immediately before
> schedule(). If the futex operation itself sleeps, that's no problem: we
> will have still accomplished our goal of running some other thread instead
> of the preempted thread.

Hi Daniel,

I agree that the problem you are aiming to solve is important. Let's see
what prevents the proposed rseq implementation from doing what you envision.

The main issue here is touching userspace immediately before schedule().
At that specific point, it's not possible to take a page fault. In the proposed
rseq implementation, we get away with it by raising a task struct flag, and 
using
it in a return to userspace notifier (where we can actually take a fault), where
we touch the userspace TLS area.

If we can find a way to solve this limitation, then the rest of your design
makes sense to me.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Mathieu Desnoyers
- On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com wrote:
[...]
> 
> I think a small enhancement to rseq would let us build a perfect userspace
> mutex, one that spins on lock-acquire only when the lock owner is running
> and that sleeps otherwise, freeing userspace from both specifying ad-hoc
> spin counts and from trying to detect situations in which spinning is
> generally pointless.
> 
> It'd work like this: in the per-thread rseq data structure, we'd include a
> description of a futex operation for the kernel would perform (in the
> context of the preempted thread) upon preemption, immediately before
> schedule(). If the futex operation itself sleeps, that's no problem: we
> will have still accomplished our goal of running some other thread instead
> of the preempted thread.

Hi Daniel,

I agree that the problem you are aiming to solve is important. Let's see
what prevents the proposed rseq implementation from doing what you envision.

The main issue here is touching userspace immediately before schedule().
At that specific point, it's not possible to take a page fault. In the proposed
rseq implementation, we get away with it by raising a task struct flag, and 
using
it in a return to userspace notifier (where we can actually take a fault), where
we touch the userspace TLS area.

If we can find a way to solve this limitation, then the rest of your design
makes sense to me.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Peter Zijlstra
On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote:
> The usual approach to "better" is an "adaptive mutex". Such a thing, when
> it attempts to acquire a lock another thread owns, spins for some number of
> iterations, then falls back to futex. I guess that's a little better than
> immediately jumping to futex, but it's not optimal. We can still spin when
> the lock owner isn't scheduled, and the spin count is usually some guess
> (either specified manually or estimated statistically) that's not
> guaranteed to produce decent results. Even if we do pick a good spin count,
> we run a very good chance of under- or over-spinning on any given
> lock-acquire. We always want to sleep when spinning would be pointless.

Look for the FUTEX_LOCK patches from Waiman.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-02 Thread Peter Zijlstra
On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote:
> The usual approach to "better" is an "adaptive mutex". Such a thing, when
> it attempts to acquire a lock another thread owns, spins for some number of
> iterations, then falls back to futex. I guess that's a little better than
> immediately jumping to futex, but it's not optimal. We can still spin when
> the lock owner isn't scheduled, and the spin count is usually some guess
> (either specified manually or estimated statistically) that's not
> guaranteed to produce decent results. Even if we do pick a good spin count,
> we run a very good chance of under- or over-spinning on any given
> lock-acquire. We always want to sleep when spinning would be pointless.

Look for the FUTEX_LOCK patches from Waiman.


Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-01 Thread Daniel Colascione
Hi Mathieu: this work looks very cool. See inline.

On Mon, Apr 30, 2018 at 3:48 PM Mathieu Desnoyers <
mathieu.desnoy...@efficios.com> wrote:

> Hi,

> Here is an updated RFC round of the Restartable Sequences patchset
> based on kernel 4.17-rc3. Based on feedback from Linus, I'm introducing
> only the rseq system call, keeping the rest for later.

> This already enables speeding up the Facebook jemalloc and arm64 PMC
> read from user-space use-cases, as well as speedup of use-cases relying
> on getting the current cpu number from user-space. We'll have to wait
> until a more complete solution is introduced before the LTTng-UST
> tracer can replace its ring buffer atomic instructions with rseq
> though. But let's proceed one step at a time.

I like the general theme of the kernel using its "superpowers" (in this
case, knowledge of preemption) to help userspace do a better job without
userspace code needing to enter the kernel to benefit. The per-CPU data
structures this patch enables help in a lot of use cases, but I think
there's another use case that you might not have considered, one that can
benefit from a extension to your proposed API.

Consider mutexes: in the kernel, for mutual exclusion, we can use a
spinlock, which in the kernel ends up being simpler and (in a lot of
scenarios) more efficient than a mutex: a core that takes a spinlock
promises to keep the lock held for only a very short time, and it disables
interrupts to delay asynchronous work that might unexpectedly lengthen the
critical section. A different core that wants to grab that spinlock can
just spin on the lock word, confident that its spin will be short because
any core owning the lock is guaranteed to release it very quickly. (Long
spins would be very bad for power.) The overall result is a lock that's
much lighter than a mutex. (A spinlock can also be used in places we can't
sleep, but this ability isn't relevant to the discussion below.)

Userspace doesn't have a good equivalent to a lightweight spinlock. While
you can build a spinlock in userspace, the result ends up having serious
problems because of preemption: first, a thread owning such a lock can be
preempted in its critical section, lengthening the critical section
arbitrarily. Second, a thread spinning on a lock will keep spinning even
when the owning thread isn't scheduled anywhere.

Userspace can just implement a mutex as a try-acquire and a FUTEX_WAIT on
failure. This approach works fine when there's no contention, but a system
call is a pretty heavy operation. Why pay for a system call on occasional
light congestion with a short critical section. Can we do better?

The usual approach to "better" is an "adaptive mutex". Such a thing, when
it attempts to acquire a lock another thread owns, spins for some number of
iterations, then falls back to futex. I guess that's a little better than
immediately jumping to futex, but it's not optimal. We can still spin when
the lock owner isn't scheduled, and the spin count is usually some guess
(either specified manually or estimated statistically) that's not
guaranteed to produce decent results. Even if we do pick a good spin count,
we run a very good chance of under- or over-spinning on any given
lock-acquire. We always want to sleep when spinning would be pointless.

One important case of the spin-while-not-scheduled problem is operation on
a uniprocessor system: on such a system, only a single task can be
scheduled at a time, making all spins maximally pointless. The usual
approach to avoiding wasted spins for adaptive mutexes is for the adaptive
mutex library to ask upon initialization "How many cores are in this
system?", and if the answer comes back as "1", disable spinning. This
approach is inadequate: CPU affinity can change at arbitrary times, and CPU
affinity can produce the same spin pessimization that a uniprocessor system
does.

I think a small enhancement to rseq would let us build a perfect userspace
mutex, one that spins on lock-acquire only when the lock owner is running
and that sleeps otherwise, freeing userspace from both specifying ad-hoc
spin counts and from trying to detect situations in which spinning is
generally pointless.

It'd work like this: in the per-thread rseq data structure, we'd include a
description of a futex operation for the kernel would perform (in the
context of the preempted thread) upon preemption, immediately before
schedule(). If the futex operation itself sleeps, that's no problem: we
will have still accomplished our goal of running some other thread instead
of the preempted thread.

Suppose we make a userspace mutex implemented with a lock word having three
bits: acquired, sleep_mode, and wait_pending, with the rest of the word not
being relevant at the moment.

We'd implement lock-acquire the usual way, CASing the acquired bit into the
lock and deeming the lock taken if we're successful. Except that before
attempting the CAS, we'd configure the current thread's rseq area to

Re: [RFC PATCH for 4.18 00/14] Restartable Sequences

2018-05-01 Thread Daniel Colascione
Hi Mathieu: this work looks very cool. See inline.

On Mon, Apr 30, 2018 at 3:48 PM Mathieu Desnoyers <
mathieu.desnoy...@efficios.com> wrote:

> Hi,

> Here is an updated RFC round of the Restartable Sequences patchset
> based on kernel 4.17-rc3. Based on feedback from Linus, I'm introducing
> only the rseq system call, keeping the rest for later.

> This already enables speeding up the Facebook jemalloc and arm64 PMC
> read from user-space use-cases, as well as speedup of use-cases relying
> on getting the current cpu number from user-space. We'll have to wait
> until a more complete solution is introduced before the LTTng-UST
> tracer can replace its ring buffer atomic instructions with rseq
> though. But let's proceed one step at a time.

I like the general theme of the kernel using its "superpowers" (in this
case, knowledge of preemption) to help userspace do a better job without
userspace code needing to enter the kernel to benefit. The per-CPU data
structures this patch enables help in a lot of use cases, but I think
there's another use case that you might not have considered, one that can
benefit from a extension to your proposed API.

Consider mutexes: in the kernel, for mutual exclusion, we can use a
spinlock, which in the kernel ends up being simpler and (in a lot of
scenarios) more efficient than a mutex: a core that takes a spinlock
promises to keep the lock held for only a very short time, and it disables
interrupts to delay asynchronous work that might unexpectedly lengthen the
critical section. A different core that wants to grab that spinlock can
just spin on the lock word, confident that its spin will be short because
any core owning the lock is guaranteed to release it very quickly. (Long
spins would be very bad for power.) The overall result is a lock that's
much lighter than a mutex. (A spinlock can also be used in places we can't
sleep, but this ability isn't relevant to the discussion below.)

Userspace doesn't have a good equivalent to a lightweight spinlock. While
you can build a spinlock in userspace, the result ends up having serious
problems because of preemption: first, a thread owning such a lock can be
preempted in its critical section, lengthening the critical section
arbitrarily. Second, a thread spinning on a lock will keep spinning even
when the owning thread isn't scheduled anywhere.

Userspace can just implement a mutex as a try-acquire and a FUTEX_WAIT on
failure. This approach works fine when there's no contention, but a system
call is a pretty heavy operation. Why pay for a system call on occasional
light congestion with a short critical section. Can we do better?

The usual approach to "better" is an "adaptive mutex". Such a thing, when
it attempts to acquire a lock another thread owns, spins for some number of
iterations, then falls back to futex. I guess that's a little better than
immediately jumping to futex, but it's not optimal. We can still spin when
the lock owner isn't scheduled, and the spin count is usually some guess
(either specified manually or estimated statistically) that's not
guaranteed to produce decent results. Even if we do pick a good spin count,
we run a very good chance of under- or over-spinning on any given
lock-acquire. We always want to sleep when spinning would be pointless.

One important case of the spin-while-not-scheduled problem is operation on
a uniprocessor system: on such a system, only a single task can be
scheduled at a time, making all spins maximally pointless. The usual
approach to avoiding wasted spins for adaptive mutexes is for the adaptive
mutex library to ask upon initialization "How many cores are in this
system?", and if the answer comes back as "1", disable spinning. This
approach is inadequate: CPU affinity can change at arbitrary times, and CPU
affinity can produce the same spin pessimization that a uniprocessor system
does.

I think a small enhancement to rseq would let us build a perfect userspace
mutex, one that spins on lock-acquire only when the lock owner is running
and that sleeps otherwise, freeing userspace from both specifying ad-hoc
spin counts and from trying to detect situations in which spinning is
generally pointless.

It'd work like this: in the per-thread rseq data structure, we'd include a
description of a futex operation for the kernel would perform (in the
context of the preempted thread) upon preemption, immediately before
schedule(). If the futex operation itself sleeps, that's no problem: we
will have still accomplished our goal of running some other thread instead
of the preempted thread.

Suppose we make a userspace mutex implemented with a lock word having three
bits: acquired, sleep_mode, and wait_pending, with the rest of the word not
being relevant at the moment.

We'd implement lock-acquire the usual way, CASing the acquired bit into the
lock and deeming the lock taken if we're successful. Except that before
attempting the CAS, we'd configure the current thread's rseq area to

[RFC PATCH for 4.18 00/14] Restartable Sequences

2018-04-30 Thread Mathieu Desnoyers
Hi,

Here is an updated RFC round of the Restartable Sequences patchset
based on kernel 4.17-rc3. Based on feedback from Linus, I'm introducing
only the rseq system call, keeping the rest for later.

This already enables speeding up the Facebook jemalloc and arm64 PMC
read from user-space use-cases, as well as speedup of use-cases relying
on getting the current cpu number from user-space. We'll have to wait
until a more complete solution is introduced before the LTTng-UST
tracer can replace its ring buffer atomic instructions with rseq
though. But let's proceed one step at a time.

The main change introduced by the removal of cpu_opv from this series
in terms of library use from user-space is that APIs that previously
took a CPU number as argument now only act on the current CPU.

So for instance, this turns:

  int cpu = rseq_per_cpu_lock(lock, target_cpu);
  [...]
  rseq_per_cpu_unlock(lock, cpu);

into

  int cpu = rseq_this_cpu_lock(lock);
  [...]
  rseq_per_cpu_unlock(lock, cpu);

and:

  per_cpu_list_push(list, node, target_cpu);
  [...]
  per_cpu_list_pop(list, node, target_cpu);

into

  this_cpu_list_push(list, node, );  /* cpu is an output parameter. */
  [...]
  node = this_cpu_list_pop(list, );  /* cpu is an output parameter. */

Eventually integrating cpu_opv or some alternative will allow passing
the cpu number as parameter rather than requiring the algorithm to work
on the current CPU.

The second effect of not having the cpu_opv fallback is that
line and instruction single-stepping with a debugger transforms rseq
critical sections based on retry loops into never-ending loops.
Debuggers need to use the __rseq_table section to skip those critical
sections in order to correctly behave when single-stepping a thread
which uses rseq in a retry loop. However, applications which use an
alternative fallback method rather than retrying on rseq fast-path abort
won't be affected by this kind of single-stepping issue.

Feedback is welcome!

Thanks,

Mathieu

Boqun Feng (2):
  powerpc: Add support for restartable sequences
  powerpc: Wire up restartable sequences system call

Mathieu Desnoyers (12):
  uapi headers: Provide types_32_64.h (v2)
  rseq: Introduce restartable sequences system call (v13)
  arm: Add restartable sequences support
  arm: Wire up restartable sequences system call
  x86: Add support for restartable sequences (v2)
  x86: Wire up restartable sequence system call
  selftests: lib.mk: Introduce OVERRIDE_TARGETS
  rseq: selftests: Provide rseq library (v5)
  rseq: selftests: Provide basic test
  rseq: selftests: Provide basic percpu ops test (v2)
  rseq: selftests: Provide parametrized tests (v2)
  rseq: selftests: Provide Makefile, scripts, gitignore (v2)

 MAINTAINERS|   12 +
 arch/Kconfig   |7 +
 arch/arm/Kconfig   |1 +
 arch/arm/kernel/signal.c   |7 +
 arch/arm/tools/syscall.tbl |1 +
 arch/powerpc/Kconfig   |1 +
 arch/powerpc/include/asm/systbl.h  |1 +
 arch/powerpc/include/asm/unistd.h  |2 +-
 arch/powerpc/include/uapi/asm/unistd.h |1 +
 arch/powerpc/kernel/signal.c   |3 +
 arch/x86/Kconfig   |1 +
 arch/x86/entry/common.c|3 +
 arch/x86/entry/syscalls/syscall_32.tbl |1 +
 arch/x86/entry/syscalls/syscall_64.tbl |1 +
 arch/x86/kernel/signal.c   |6 +
 fs/exec.c  |1 +
 include/linux/sched.h  |  134 +++
 include/linux/syscalls.h   |4 +-
 include/trace/events/rseq.h|   56 +
 include/uapi/linux/rseq.h  |  150 +++
 include/uapi/linux/types_32_64.h   |   67 ++
 init/Kconfig   |   23 +
 kernel/Makefile|1 +
 kernel/fork.c  |2 +
 kernel/rseq.c  |  366 ++
 kernel/sched/core.c|2 +
 kernel/sys_ni.c|3 +
 tools/testing/selftests/Makefile   |1 +
 tools/testing/selftests/lib.mk |4 +
 tools/testing/selftests/rseq/.gitignore|6 +
 tools/testing/selftests/rseq/Makefile  |   29 +
 .../testing/selftests/rseq/basic_percpu_ops_test.c |  312 +
 tools/testing/selftests/rseq/basic_test.c  |   55 +
 tools/testing/selftests/rseq/param_test.c  | 1259 
 tools/testing/selftests/rseq/rseq-arm.h|  732 
 tools/testing/selftests/rseq/rseq-ppc.h|  688 +++
 

[RFC PATCH for 4.18 00/14] Restartable Sequences

2018-04-30 Thread Mathieu Desnoyers
Hi,

Here is an updated RFC round of the Restartable Sequences patchset
based on kernel 4.17-rc3. Based on feedback from Linus, I'm introducing
only the rseq system call, keeping the rest for later.

This already enables speeding up the Facebook jemalloc and arm64 PMC
read from user-space use-cases, as well as speedup of use-cases relying
on getting the current cpu number from user-space. We'll have to wait
until a more complete solution is introduced before the LTTng-UST
tracer can replace its ring buffer atomic instructions with rseq
though. But let's proceed one step at a time.

The main change introduced by the removal of cpu_opv from this series
in terms of library use from user-space is that APIs that previously
took a CPU number as argument now only act on the current CPU.

So for instance, this turns:

  int cpu = rseq_per_cpu_lock(lock, target_cpu);
  [...]
  rseq_per_cpu_unlock(lock, cpu);

into

  int cpu = rseq_this_cpu_lock(lock);
  [...]
  rseq_per_cpu_unlock(lock, cpu);

and:

  per_cpu_list_push(list, node, target_cpu);
  [...]
  per_cpu_list_pop(list, node, target_cpu);

into

  this_cpu_list_push(list, node, );  /* cpu is an output parameter. */
  [...]
  node = this_cpu_list_pop(list, );  /* cpu is an output parameter. */

Eventually integrating cpu_opv or some alternative will allow passing
the cpu number as parameter rather than requiring the algorithm to work
on the current CPU.

The second effect of not having the cpu_opv fallback is that
line and instruction single-stepping with a debugger transforms rseq
critical sections based on retry loops into never-ending loops.
Debuggers need to use the __rseq_table section to skip those critical
sections in order to correctly behave when single-stepping a thread
which uses rseq in a retry loop. However, applications which use an
alternative fallback method rather than retrying on rseq fast-path abort
won't be affected by this kind of single-stepping issue.

Feedback is welcome!

Thanks,

Mathieu

Boqun Feng (2):
  powerpc: Add support for restartable sequences
  powerpc: Wire up restartable sequences system call

Mathieu Desnoyers (12):
  uapi headers: Provide types_32_64.h (v2)
  rseq: Introduce restartable sequences system call (v13)
  arm: Add restartable sequences support
  arm: Wire up restartable sequences system call
  x86: Add support for restartable sequences (v2)
  x86: Wire up restartable sequence system call
  selftests: lib.mk: Introduce OVERRIDE_TARGETS
  rseq: selftests: Provide rseq library (v5)
  rseq: selftests: Provide basic test
  rseq: selftests: Provide basic percpu ops test (v2)
  rseq: selftests: Provide parametrized tests (v2)
  rseq: selftests: Provide Makefile, scripts, gitignore (v2)

 MAINTAINERS|   12 +
 arch/Kconfig   |7 +
 arch/arm/Kconfig   |1 +
 arch/arm/kernel/signal.c   |7 +
 arch/arm/tools/syscall.tbl |1 +
 arch/powerpc/Kconfig   |1 +
 arch/powerpc/include/asm/systbl.h  |1 +
 arch/powerpc/include/asm/unistd.h  |2 +-
 arch/powerpc/include/uapi/asm/unistd.h |1 +
 arch/powerpc/kernel/signal.c   |3 +
 arch/x86/Kconfig   |1 +
 arch/x86/entry/common.c|3 +
 arch/x86/entry/syscalls/syscall_32.tbl |1 +
 arch/x86/entry/syscalls/syscall_64.tbl |1 +
 arch/x86/kernel/signal.c   |6 +
 fs/exec.c  |1 +
 include/linux/sched.h  |  134 +++
 include/linux/syscalls.h   |4 +-
 include/trace/events/rseq.h|   56 +
 include/uapi/linux/rseq.h  |  150 +++
 include/uapi/linux/types_32_64.h   |   67 ++
 init/Kconfig   |   23 +
 kernel/Makefile|1 +
 kernel/fork.c  |2 +
 kernel/rseq.c  |  366 ++
 kernel/sched/core.c|2 +
 kernel/sys_ni.c|3 +
 tools/testing/selftests/Makefile   |1 +
 tools/testing/selftests/lib.mk |4 +
 tools/testing/selftests/rseq/.gitignore|6 +
 tools/testing/selftests/rseq/Makefile  |   29 +
 .../testing/selftests/rseq/basic_percpu_ops_test.c |  312 +
 tools/testing/selftests/rseq/basic_test.c  |   55 +
 tools/testing/selftests/rseq/param_test.c  | 1259 
 tools/testing/selftests/rseq/rseq-arm.h|  732 
 tools/testing/selftests/rseq/rseq-ppc.h|  688 +++