Re: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-28 Thread Mathieu Desnoyers
- On Jun 27, 2015, at 12:25 PM, Andy Lutomirski l...@amacapital.net wrote:

> Let me try to summarize some of the approaches with their pros and cons:
> 

I can try summarizing a desiderata that I gather from this
thread so far:

- *very fast* accesses for per-cpu push/pop, per-cpu lock
  acquisition, and per-cpu counters,
- guaranteed progress (don't loop forever when single-stepped),
- critical section implementation flexibility:
  - can be written only in assembly or also in C,
  - provide a single building-block (e.g. cmpxchg) or allow
applications/libraries to do arbitrary critical sections,
- portability to non-x86 architectures,
- low-intrusiveness for injection into applications (no signal
  handler, no segment selector used by pre-existing applications).
- can be used by either a single or many shared objects per process,
- don't disturb real-time scheduling,
- minimal slowdown of kernel scheduling execution.

More comments on each approach below:

> --- percpu segment ---
> 
> This is probably the simplest and might make sense regardless.
> cmpxchg can be used to do an atomic push onto a linked list.  I think
> that unlocked cmpxchg16b can be used to get an atomic pop.  (You'd
> have the list head pointer next to an auxiliary pointer to the second
> element in the list, perhaps.)

Based on http://www.agner.org/optimize/instruction_tables.pdf
On Intel Haswell:

instruction   latency (cycles)
cmpxchg:  8
lock cmpxchg: 19
cmpxchg16b:   15

cmpxchg16b does not appear to be particularly fast (twice the latency of
cmpxchg).

> 
> You can also use this for limited forms of speculative locking.
> Aborting cleanly if your lock is stolen might require the kernel's
> help, though (you're now on the wrong cpu, so you can't atomically
> poke the lock variable any more).
> 
> The ABI is straightforward, and the only limitation on multiple users
> in the same process is that they need to coordinate their offsets into
> the percpu segment.

One more downside about this approach: some applications already use
the gs segment (AFAIK the wine emulator uses it), so can be prohibitive
to use it from tracing code injected into pre-existing applications.

Another downside is that it is x86-specific.

> 
> --- vdso-provided atomic ops ---
> 
> This could be quite flexible.  The upside is that the ABI would be
> straightforward (call a function with clearly-specified behavior).

Following same ref as above, the call/ret pair alone would cost
about 5 cycles.

> The downside is that implementing it well might require percpu
> segments and a certain amount of coordination, and it requires a
> function call.

Same downside as above about gs segment being already used by some
applications.

> 
> One nice thing about doing it in the vdso is that we can change the
> implementation down the road.

Yes, this is clearly an advantage over letting applications inline
their cmpxchg on gs:.

Same downside as above about being x86-specific.

> 
> --- kernel preemption hooks ---
> 
> I'm defining a preemption hook as an action taken by the kernel when a
> user task is preempted during a critical section.
> 
> As an upside, we get extremely efficient, almost arbitrary percpu
> operations.  We don't need to worry about memory ordering at all,
> because the whole sequence aborts if anything else might run on the
> same cpu.  Push and pop are both easy.
> 
> One con is that actually defining where the critical section is might
> be nasty.  If there's a single IP range, then two libraries could
> fight over it.  We could have a variable somewhere that you write to
> arm the critical section, but that's a bit slower.

My current understanding is that we have two means to tell the kernel
"we are in a critical section": either through register content or
through a per-thread memory area. Paul's implementation uses the
instruction pointer, but it could perhaps use another reserved register
state, which might help us do the critical functions in C code rather
than assembly. It might be tricky to find a register that is guaranteed
not to be used though, hence the per-thread memory area.

The per-thread memory area has the advantage of allowing the critical
sections to be implemented in C code rather than assembly, but, as you
say, its downside is that we need at the very least to set/clear a TLS
flag (or have a nesting counter) surrounding the critical section. This
approach is quite similar to preempt_disable()/preempt_enable() in the
Linux kernel.

Another advantage of preempt migration hooks over migration hooks
is that the critical section can assume it has mutually exclusive
access, which is not the case for migration hooks, because it can
be preempted and continue execution afterward. This means what can
be achieved with e.g. "load+test+store" with preempt hook needs to
be performed with "cmpxchg" with migration hooks. This might not be
a huge issue for x86, but can become more expensive on other
architectures.

> 

Re: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-28 Thread Mathieu Desnoyers
- On Jun 27, 2015, at 12:25 PM, Andy Lutomirski l...@amacapital.net wrote:

 Let me try to summarize some of the approaches with their pros and cons:
 

I can try summarizing a desiderata that I gather from this
thread so far:

- *very fast* accesses for per-cpu push/pop, per-cpu lock
  acquisition, and per-cpu counters,
- guaranteed progress (don't loop forever when single-stepped),
- critical section implementation flexibility:
  - can be written only in assembly or also in C,
  - provide a single building-block (e.g. cmpxchg) or allow
applications/libraries to do arbitrary critical sections,
- portability to non-x86 architectures,
- low-intrusiveness for injection into applications (no signal
  handler, no segment selector used by pre-existing applications).
- can be used by either a single or many shared objects per process,
- don't disturb real-time scheduling,
- minimal slowdown of kernel scheduling execution.

More comments on each approach below:

 --- percpu segment ---
 
 This is probably the simplest and might make sense regardless.
 cmpxchg can be used to do an atomic push onto a linked list.  I think
 that unlocked cmpxchg16b can be used to get an atomic pop.  (You'd
 have the list head pointer next to an auxiliary pointer to the second
 element in the list, perhaps.)

Based on http://www.agner.org/optimize/instruction_tables.pdf
On Intel Haswell:

instruction   latency (cycles)
cmpxchg:  8
lock cmpxchg: 19
cmpxchg16b:   15

cmpxchg16b does not appear to be particularly fast (twice the latency of
cmpxchg).

 
 You can also use this for limited forms of speculative locking.
 Aborting cleanly if your lock is stolen might require the kernel's
 help, though (you're now on the wrong cpu, so you can't atomically
 poke the lock variable any more).
 
 The ABI is straightforward, and the only limitation on multiple users
 in the same process is that they need to coordinate their offsets into
 the percpu segment.

One more downside about this approach: some applications already use
the gs segment (AFAIK the wine emulator uses it), so can be prohibitive
to use it from tracing code injected into pre-existing applications.

Another downside is that it is x86-specific.

 
 --- vdso-provided atomic ops ---
 
 This could be quite flexible.  The upside is that the ABI would be
 straightforward (call a function with clearly-specified behavior).

Following same ref as above, the call/ret pair alone would cost
about 5 cycles.

 The downside is that implementing it well might require percpu
 segments and a certain amount of coordination, and it requires a
 function call.

Same downside as above about gs segment being already used by some
applications.

 
 One nice thing about doing it in the vdso is that we can change the
 implementation down the road.

Yes, this is clearly an advantage over letting applications inline
their cmpxchg on gs:.

Same downside as above about being x86-specific.

 
 --- kernel preemption hooks ---
 
 I'm defining a preemption hook as an action taken by the kernel when a
 user task is preempted during a critical section.
 
 As an upside, we get extremely efficient, almost arbitrary percpu
 operations.  We don't need to worry about memory ordering at all,
 because the whole sequence aborts if anything else might run on the
 same cpu.  Push and pop are both easy.
 
 One con is that actually defining where the critical section is might
 be nasty.  If there's a single IP range, then two libraries could
 fight over it.  We could have a variable somewhere that you write to
 arm the critical section, but that's a bit slower.

My current understanding is that we have two means to tell the kernel
we are in a critical section: either through register content or
through a per-thread memory area. Paul's implementation uses the
instruction pointer, but it could perhaps use another reserved register
state, which might help us do the critical functions in C code rather
than assembly. It might be tricky to find a register that is guaranteed
not to be used though, hence the per-thread memory area.

The per-thread memory area has the advantage of allowing the critical
sections to be implemented in C code rather than assembly, but, as you
say, its downside is that we need at the very least to set/clear a TLS
flag (or have a nesting counter) surrounding the critical section. This
approach is quite similar to preempt_disable()/preempt_enable() in the
Linux kernel.

Another advantage of preempt migration hooks over migration hooks
is that the critical section can assume it has mutually exclusive
access, which is not the case for migration hooks, because it can
be preempted and continue execution afterward. This means what can
be achieved with e.g. load+test+store with preempt hook needs to
be performed with cmpxchg with migration hooks. This might not be
a huge issue for x86, but can become more expensive on other
architectures.

 
 Another con is that you can't single-step 

Re: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-27 Thread Andy Lutomirski
Let me try to summarize some of the approaches with their pros and cons:

--- percpu segment ---

This is probably the simplest and might make sense regardless.
cmpxchg can be used to do an atomic push onto a linked list.  I think
that unlocked cmpxchg16b can be used to get an atomic pop.  (You'd
have the list head pointer next to an auxiliary pointer to the second
element in the list, perhaps.)

You can also use this for limited forms of speculative locking.
Aborting cleanly if your lock is stolen might require the kernel's
help, though (you're now on the wrong cpu, so you can't atomically
poke the lock variable any more).

The ABI is straightforward, and the only limitation on multiple users
in the same process is that they need to coordinate their offsets into
the percpu segment.

--- vdso-provided atomic ops ---

This could be quite flexible.  The upside is that the ABI would be
straightforward (call a function with clearly-specified behavior).
The downside is that implementing it well might require percpu
segments and a certain amount of coordination, and it requires a
function call.

One nice thing about doing it in the vdso is that we can change the
implementation down the road.

--- kernel preemption hooks ---

I'm defining a preemption hook as an action taken by the kernel when a
user task is preempted during a critical section.

As an upside, we get extremely efficient, almost arbitrary percpu
operations.  We don't need to worry about memory ordering at all,
because the whole sequence aborts if anything else might run on the
same cpu.  Push and pop are both easy.

One con is that actually defining where the critical section is might
be nasty.  If there's a single IP range, then two libraries could
fight over it.  We could have a variable somewhere that you write to
arm the critical section, but that's a bit slower.

Another con is that you can't single-step through this type of
critical section.  It will be preempted every time.

--- kernel migration hooks ---

I'm not sure either Paul or Mattieu discussed this, but another option
would be to have some special handling if a task is migrated during a
critical section or to allow a task to prevent migration entirely
during a critical section.  From the user's point of view, this is
weaker than preemption hooks: it's possible to start your critical
section, be preempted, and have another thread enter its own critical
section, then get rescheduled on the same cpu without aborting.  Users
would have to use local atomics (like cmpxchg) to make it useful.

As a major advantage, single-stepping still works.

This shares the coordination downside with preemption hooks (users
have to tell the kernel about their critical sections somehow).

Push can certainly be implemented using cmpxchg.  The gs prefix isn't
even needed.  Pop might be harder to implement directly without
resorting to cmpxchg16b or similar.

--- Unnamed trick ---

On entry to a critical section, try to take a per-cpu lock that stores
the holder's tid.  This might require percpu segments.

If you get the lock, then start doing your thing.  For example, you
could pop by reading head->next and writing it back to head.

If, however, you miss the lock, then you need to either wait or
forcibly abort the lock holder.  You could do the latter by sending a
signal or possibly using a new syscall that atomically aborts the lock
holder and takes the lock.  You don't need to wait, though -- all you
need to do is queue the signal and, if the lock holder is actually
running, wait for signal delivery to start.


Thoughts?  I personally like the other options better than preemption
hooks.  I prefer solutions that don't interfere with debugging.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-27 Thread Andy Lutomirski
Let me try to summarize some of the approaches with their pros and cons:

--- percpu segment ---

This is probably the simplest and might make sense regardless.
cmpxchg can be used to do an atomic push onto a linked list.  I think
that unlocked cmpxchg16b can be used to get an atomic pop.  (You'd
have the list head pointer next to an auxiliary pointer to the second
element in the list, perhaps.)

You can also use this for limited forms of speculative locking.
Aborting cleanly if your lock is stolen might require the kernel's
help, though (you're now on the wrong cpu, so you can't atomically
poke the lock variable any more).

The ABI is straightforward, and the only limitation on multiple users
in the same process is that they need to coordinate their offsets into
the percpu segment.

--- vdso-provided atomic ops ---

This could be quite flexible.  The upside is that the ABI would be
straightforward (call a function with clearly-specified behavior).
The downside is that implementing it well might require percpu
segments and a certain amount of coordination, and it requires a
function call.

One nice thing about doing it in the vdso is that we can change the
implementation down the road.

--- kernel preemption hooks ---

I'm defining a preemption hook as an action taken by the kernel when a
user task is preempted during a critical section.

As an upside, we get extremely efficient, almost arbitrary percpu
operations.  We don't need to worry about memory ordering at all,
because the whole sequence aborts if anything else might run on the
same cpu.  Push and pop are both easy.

One con is that actually defining where the critical section is might
be nasty.  If there's a single IP range, then two libraries could
fight over it.  We could have a variable somewhere that you write to
arm the critical section, but that's a bit slower.

Another con is that you can't single-step through this type of
critical section.  It will be preempted every time.

--- kernel migration hooks ---

I'm not sure either Paul or Mattieu discussed this, but another option
would be to have some special handling if a task is migrated during a
critical section or to allow a task to prevent migration entirely
during a critical section.  From the user's point of view, this is
weaker than preemption hooks: it's possible to start your critical
section, be preempted, and have another thread enter its own critical
section, then get rescheduled on the same cpu without aborting.  Users
would have to use local atomics (like cmpxchg) to make it useful.

As a major advantage, single-stepping still works.

This shares the coordination downside with preemption hooks (users
have to tell the kernel about their critical sections somehow).

Push can certainly be implemented using cmpxchg.  The gs prefix isn't
even needed.  Pop might be harder to implement directly without
resorting to cmpxchg16b or similar.

--- Unnamed trick ---

On entry to a critical section, try to take a per-cpu lock that stores
the holder's tid.  This might require percpu segments.

If you get the lock, then start doing your thing.  For example, you
could pop by reading head-next and writing it back to head.

If, however, you miss the lock, then you need to either wait or
forcibly abort the lock holder.  You could do the latter by sending a
signal or possibly using a new syscall that atomically aborts the lock
holder and takes the lock.  You don't need to wait, though -- all you
need to do is queue the signal and, if the lock holder is actually
running, wait for signal delivery to start.


Thoughts?  I personally like the other options better than preemption
hooks.  I prefer solutions that don't interfere with debugging.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-25 Thread Paul Turner
On Thu, Jun 25, 2015 at 6:15 PM, Mathieu Desnoyers
 wrote:
> - On Jun 24, 2015, at 10:54 PM, Paul Turner p...@google.com wrote:
>
>> On Wed, Jun 24, 2015 at 5:07 PM, Andy Lutomirski  wrote:
>>> On Wed, Jun 24, 2015 at 3:26 PM, Paul Turner  wrote:
 This is a fairly small series demonstrating a feature we've found to be 
 quite
 powerful in practice, "restartable sequences".

>>>
>>> On an extremely short glance, I'm starting to think that the right
>>> approach, at least for x86, is to implement per-cpu gsbase.  Then you
>>> could do cmpxchg with a gs prefix to atomically take a percpu lock and
>>> atomically release a percpu lock and check whether someone else stole
>>> the lock from you.  (Note: cmpxchg, unlike lock cmpxchg, is very
>>> fast.)
>>>
>>> This is totally useless for other architectures, but I think it would
>>> be reasonable clean on x86.  Thoughts?
>>
>> So this gives semantics that are obviously similar to this_cpu().
>> This provides allows reasonable per-cpu counters (which is alone
>> almost sufficient for a strong user-space RCU implementation giving
>> this some legs).
>>
>> However, unless there's a nice implementation trick I'm missing, the
>> thing that stands out to me for locks (or other primitives) is that
>> this forces a two-phase commit.  There's no way (short of say,
>> cmpxchg16b) to perform a write conditional on the lock not having been
>> stolen from us (and subsequently release the lock).
>>
>> e.g.
>> 1) We take the operation in some sort of speculative mode, that
>> another thread on the same cpu is stilled allowed to steal from us
>> 2) We prepare what we want to commit
>> 3) At this point we have to promote the lock taken in (1) to perform
>> our actual commit, or see that someone else has stolen (1)
>> 4) Release the promoted lock in (3)
>>
>> However, this means that if we're preempted at (3) then no other
>> thread on that cpu can make progress until we've been rescheduled and
>> released the lock; a nice property of the model we have today is that
>> threads sharing a cpu can not impede each other beyond what the
>> scheduler allows.
>>
>> A lesser concern, but worth mentioning, is that there are also
>> potential pitfalls in the interaction with signal handlers,
>> particularly if a 2-phase commit is used.
>
> Assuming we have a gs segment we can use to address per-cpu locks
> in userspace, would the following scheme take care of some of your
> concerns ?
>
> per-cpu int32_t: each lock initialized to "cpu_nr" value
>
> per-cpu lock:
>   get current cpu number. Remember this value as "CPU lock nr".
>   use cmpxchg on gs:lock to grab the lock.
>   - Expect old value to be "CPU lock nr".
>   - Update with a lock flag in most significant bit, "CPU lock nr"
> in lower bits.
>   - Retry if fails. Can be caused by migration or lock being already
> held.

So this is equivalent to just taking the lock in a non-speculative
mode (e.g. non-stealable) from the start.

The challenge remains exactly the same, as soon as an exclusive mode
exists you have some 'critical' critical inner section about the
commit, which impedes the progress of other threads if interrupted.
[The only reason you'd take the lock in any sort of speculative mode
above would be to reduce the length of this.]

I definitely agree that this gets us locks and counters, there's lots
of good things we can do with those and those alone may be sufficient
to implement it this way but I feel there is a lot of power compared
to what can be done with full lockless semantics that this leaves on
the table.  As a concrete example, consider what this does to the
complexity and usefulness of Pop().

>
> per-cpu unlock:
>   clear lock flag within the "CPU lock nr" lock.
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-25 Thread Mathieu Desnoyers
- On Jun 24, 2015, at 10:54 PM, Paul Turner p...@google.com wrote:

> On Wed, Jun 24, 2015 at 5:07 PM, Andy Lutomirski  wrote:
>> On Wed, Jun 24, 2015 at 3:26 PM, Paul Turner  wrote:
>>> This is a fairly small series demonstrating a feature we've found to be 
>>> quite
>>> powerful in practice, "restartable sequences".
>>>
>>
>> On an extremely short glance, I'm starting to think that the right
>> approach, at least for x86, is to implement per-cpu gsbase.  Then you
>> could do cmpxchg with a gs prefix to atomically take a percpu lock and
>> atomically release a percpu lock and check whether someone else stole
>> the lock from you.  (Note: cmpxchg, unlike lock cmpxchg, is very
>> fast.)
>>
>> This is totally useless for other architectures, but I think it would
>> be reasonable clean on x86.  Thoughts?
> 
> So this gives semantics that are obviously similar to this_cpu().
> This provides allows reasonable per-cpu counters (which is alone
> almost sufficient for a strong user-space RCU implementation giving
> this some legs).
> 
> However, unless there's a nice implementation trick I'm missing, the
> thing that stands out to me for locks (or other primitives) is that
> this forces a two-phase commit.  There's no way (short of say,
> cmpxchg16b) to perform a write conditional on the lock not having been
> stolen from us (and subsequently release the lock).
> 
> e.g.
> 1) We take the operation in some sort of speculative mode, that
> another thread on the same cpu is stilled allowed to steal from us
> 2) We prepare what we want to commit
> 3) At this point we have to promote the lock taken in (1) to perform
> our actual commit, or see that someone else has stolen (1)
> 4) Release the promoted lock in (3)
> 
> However, this means that if we're preempted at (3) then no other
> thread on that cpu can make progress until we've been rescheduled and
> released the lock; a nice property of the model we have today is that
> threads sharing a cpu can not impede each other beyond what the
> scheduler allows.
> 
> A lesser concern, but worth mentioning, is that there are also
> potential pitfalls in the interaction with signal handlers,
> particularly if a 2-phase commit is used.

Assuming we have a gs segment we can use to address per-cpu locks
in userspace, would the following scheme take care of some of your
concerns ?

per-cpu int32_t: each lock initialized to "cpu_nr" value

per-cpu lock:
  get current cpu number. Remember this value as "CPU lock nr".
  use cmpxchg on gs:lock to grab the lock.
  - Expect old value to be "CPU lock nr".
  - Update with a lock flag in most significant bit, "CPU lock nr"
in lower bits.
  - Retry if fails. Can be caused by migration or lock being already
held.

per-cpu unlock:
  clear lock flag within the "CPU lock nr" lock.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-25 Thread Paul Turner
On Thu, Jun 25, 2015 at 6:15 PM, Mathieu Desnoyers
mathieu.desnoy...@efficios.com wrote:
 - On Jun 24, 2015, at 10:54 PM, Paul Turner p...@google.com wrote:

 On Wed, Jun 24, 2015 at 5:07 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Wed, Jun 24, 2015 at 3:26 PM, Paul Turner p...@google.com wrote:
 This is a fairly small series demonstrating a feature we've found to be 
 quite
 powerful in practice, restartable sequences.


 On an extremely short glance, I'm starting to think that the right
 approach, at least for x86, is to implement per-cpu gsbase.  Then you
 could do cmpxchg with a gs prefix to atomically take a percpu lock and
 atomically release a percpu lock and check whether someone else stole
 the lock from you.  (Note: cmpxchg, unlike lock cmpxchg, is very
 fast.)

 This is totally useless for other architectures, but I think it would
 be reasonable clean on x86.  Thoughts?

 So this gives semantics that are obviously similar to this_cpu().
 This provides allows reasonable per-cpu counters (which is alone
 almost sufficient for a strong user-space RCU implementation giving
 this some legs).

 However, unless there's a nice implementation trick I'm missing, the
 thing that stands out to me for locks (or other primitives) is that
 this forces a two-phase commit.  There's no way (short of say,
 cmpxchg16b) to perform a write conditional on the lock not having been
 stolen from us (and subsequently release the lock).

 e.g.
 1) We take the operation in some sort of speculative mode, that
 another thread on the same cpu is stilled allowed to steal from us
 2) We prepare what we want to commit
 3) At this point we have to promote the lock taken in (1) to perform
 our actual commit, or see that someone else has stolen (1)
 4) Release the promoted lock in (3)

 However, this means that if we're preempted at (3) then no other
 thread on that cpu can make progress until we've been rescheduled and
 released the lock; a nice property of the model we have today is that
 threads sharing a cpu can not impede each other beyond what the
 scheduler allows.

 A lesser concern, but worth mentioning, is that there are also
 potential pitfalls in the interaction with signal handlers,
 particularly if a 2-phase commit is used.

 Assuming we have a gs segment we can use to address per-cpu locks
 in userspace, would the following scheme take care of some of your
 concerns ?

 per-cpu int32_t: each lock initialized to cpu_nr value

 per-cpu lock:
   get current cpu number. Remember this value as CPU lock nr.
   use cmpxchg on gs:lock to grab the lock.
   - Expect old value to be CPU lock nr.
   - Update with a lock flag in most significant bit, CPU lock nr
 in lower bits.
   - Retry if fails. Can be caused by migration or lock being already
 held.

So this is equivalent to just taking the lock in a non-speculative
mode (e.g. non-stealable) from the start.

The challenge remains exactly the same, as soon as an exclusive mode
exists you have some 'critical' critical inner section about the
commit, which impedes the progress of other threads if interrupted.
[The only reason you'd take the lock in any sort of speculative mode
above would be to reduce the length of this.]

I definitely agree that this gets us locks and counters, there's lots
of good things we can do with those and those alone may be sufficient
to implement it this way but I feel there is a lot of power compared
to what can be done with full lockless semantics that this leaves on
the table.  As a concrete example, consider what this does to the
complexity and usefulness of Pop().


 per-cpu unlock:
   clear lock flag within the CPU lock nr lock.

 Thanks,

 Mathieu

 --
 Mathieu Desnoyers
 EfficiOS Inc.
 http://www.efficios.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-25 Thread Mathieu Desnoyers
- On Jun 24, 2015, at 10:54 PM, Paul Turner p...@google.com wrote:

 On Wed, Jun 24, 2015 at 5:07 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Wed, Jun 24, 2015 at 3:26 PM, Paul Turner p...@google.com wrote:
 This is a fairly small series demonstrating a feature we've found to be 
 quite
 powerful in practice, restartable sequences.


 On an extremely short glance, I'm starting to think that the right
 approach, at least for x86, is to implement per-cpu gsbase.  Then you
 could do cmpxchg with a gs prefix to atomically take a percpu lock and
 atomically release a percpu lock and check whether someone else stole
 the lock from you.  (Note: cmpxchg, unlike lock cmpxchg, is very
 fast.)

 This is totally useless for other architectures, but I think it would
 be reasonable clean on x86.  Thoughts?
 
 So this gives semantics that are obviously similar to this_cpu().
 This provides allows reasonable per-cpu counters (which is alone
 almost sufficient for a strong user-space RCU implementation giving
 this some legs).
 
 However, unless there's a nice implementation trick I'm missing, the
 thing that stands out to me for locks (or other primitives) is that
 this forces a two-phase commit.  There's no way (short of say,
 cmpxchg16b) to perform a write conditional on the lock not having been
 stolen from us (and subsequently release the lock).
 
 e.g.
 1) We take the operation in some sort of speculative mode, that
 another thread on the same cpu is stilled allowed to steal from us
 2) We prepare what we want to commit
 3) At this point we have to promote the lock taken in (1) to perform
 our actual commit, or see that someone else has stolen (1)
 4) Release the promoted lock in (3)
 
 However, this means that if we're preempted at (3) then no other
 thread on that cpu can make progress until we've been rescheduled and
 released the lock; a nice property of the model we have today is that
 threads sharing a cpu can not impede each other beyond what the
 scheduler allows.
 
 A lesser concern, but worth mentioning, is that there are also
 potential pitfalls in the interaction with signal handlers,
 particularly if a 2-phase commit is used.

Assuming we have a gs segment we can use to address per-cpu locks
in userspace, would the following scheme take care of some of your
concerns ?

per-cpu int32_t: each lock initialized to cpu_nr value

per-cpu lock:
  get current cpu number. Remember this value as CPU lock nr.
  use cmpxchg on gs:lock to grab the lock.
  - Expect old value to be CPU lock nr.
  - Update with a lock flag in most significant bit, CPU lock nr
in lower bits.
  - Retry if fails. Can be caused by migration or lock being already
held.

per-cpu unlock:
  clear lock flag within the CPU lock nr lock.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-24 Thread Paul Turner
On Wed, Jun 24, 2015 at 5:07 PM, Andy Lutomirski  wrote:
> On Wed, Jun 24, 2015 at 3:26 PM, Paul Turner  wrote:
>> This is a fairly small series demonstrating a feature we've found to be quite
>> powerful in practice, "restartable sequences".
>>
>
> On an extremely short glance, I'm starting to think that the right
> approach, at least for x86, is to implement per-cpu gsbase.  Then you
> could do cmpxchg with a gs prefix to atomically take a percpu lock and
> atomically release a percpu lock and check whether someone else stole
> the lock from you.  (Note: cmpxchg, unlike lock cmpxchg, is very
> fast.)
>
> This is totally useless for other architectures, but I think it would
> be reasonable clean on x86.  Thoughts?

So this gives semantics that are obviously similar to this_cpu().
This provides allows reasonable per-cpu counters (which is alone
almost sufficient for a strong user-space RCU implementation giving
this some legs).

However, unless there's a nice implementation trick I'm missing, the
thing that stands out to me for locks (or other primitives) is that
this forces a two-phase commit.  There's no way (short of say,
cmpxchg16b) to perform a write conditional on the lock not having been
stolen from us (and subsequently release the lock).

e.g.
 1) We take the operation in some sort of speculative mode, that
another thread on the same cpu is stilled allowed to steal from us
 2) We prepare what we want to commit
 3) At this point we have to promote the lock taken in (1) to perform
our actual commit, or see that someone else has stolen (1)
 4) Release the promoted lock in (3)

However, this means that if we're preempted at (3) then no other
thread on that cpu can make progress until we've been rescheduled and
released the lock; a nice property of the model we have today is that
threads sharing a cpu can not impede each other beyond what the
scheduler allows.

A lesser concern, but worth mentioning, is that there are also
potential pitfalls in the interaction with signal handlers,
particularly if a 2-phase commit is used.

- Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-24 Thread Andy Lutomirski
On Wed, Jun 24, 2015 at 3:26 PM, Paul Turner  wrote:
> This is a fairly small series demonstrating a feature we've found to be quite
> powerful in practice, "restartable sequences".
>

On an extremely short glance, I'm starting to think that the right
approach, at least for x86, is to implement per-cpu gsbase.  Then you
could do cmpxchg with a gs prefix to atomically take a percpu lock and
atomically release a percpu lock and check whether someone else stole
the lock from you.  (Note: cmpxchg, unlike lock cmpxchg, is very
fast.)

This is totally useless for other architectures, but I think it would
be reasonable clean on x86.  Thoughts?

I can elaborate if necessary.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-24 Thread Paul Turner
This is a fairly small series demonstrating a feature we've found to be quite
powerful in practice, "restartable sequences".

Most simply: these sequences comprise small snippets of user-code that are
guaranteed to be (effectively) executed serially, with support for restart (or
other handling) in the event of preemption or interruption.

This (when combined with an in-memory cpu-id; potentially optional on some
architectures) can be used to implement extremely fast per-cpu operations that
do not rely on the use of actual processor atomics.  We've used this to back
performance critical code such as our malloc implementation with good results.

We previously discussed this feature at LPC:
  
http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Mathieu Desnoyers posted an alternate implementation based on the ideas above
at:
  https://lkml.org/lkml/2015/5/21/472

This RFC is based on the approach we currently use internally.  However, I'll
likely posit that neither this approach, nor the one posted above, is what we
should ultimately adopt (provided sufficient interest exists).

The implementation in this series can be summarized as follows:
- We allow a single (per-address) space critical section (and associated
  handler) to be defined.
- When a thread with RSEQ configured (via new syscall) is interrupted within
  a critical section, we modify its return address.  Either within signal
  delivery, or the notify-resume path.  The scheduler touchpoint is only a
  preempt_notifier which (potentially, dependent on config) examines the
  kernel copy of pt_regs.

There are a few core requirements which motivate the approach taken here:
1) We must not interfere with regular scheduling.  Unlike preemption protection
   (or similar), there are no special considerations for code running within a
   critical region beyond that we must move to the restart handler if
   interrupted.
2) The code executing in scheduler context must be tightly constrained, both in
   terms of overhead and that it must not require access to user memory.
3) The fast-path must be fast.  That is, user entry/execution/exit of a
   non-interrupted critical section is the most important case.  The restart
   handler is a 'slow' path that should only happen comparatively infrequently.
   We're strongly motivated here by high-performance, low-level primitives:
   think malloc, or rcu_read_lock.

While the contained implementation performs well under these constraints, it
has some notable limitations which we should consider for more general usage:

1) The definition of a single region works well for statically linked binaries
   but can be challenging when shared-libraries want to use this feature.  This
   is partially mitigated in our experience that a library of primitives is
   generally more useful than a myriad of custom sequences, but it's still a
   major concern.
2) Due to the nature of restart and explicit location requirements it's only
   really reasonable to write this critical section in assembly; which makes
   porting and maintenance less convenient.  (One of the included tests shows a
   reasonable example of what this looks like.)
3) TLS spreads the entrails of its implementation all over compile _and_ link.
   This makes properly handling it within the critical section cumbersome in
   the shared binary case.

We've been thinking about how to address these issues and are considering some
alternate ABIs that still satisfy (1)-(3), but have a more convenient
programming model.  I'll send this as a follow-up, but wanted to first share
the approach we've used to date.

Thanks,

- Paul



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-24 Thread Paul Turner
This is a fairly small series demonstrating a feature we've found to be quite
powerful in practice, restartable sequences.

Most simply: these sequences comprise small snippets of user-code that are
guaranteed to be (effectively) executed serially, with support for restart (or
other handling) in the event of preemption or interruption.

This (when combined with an in-memory cpu-id; potentially optional on some
architectures) can be used to implement extremely fast per-cpu operations that
do not rely on the use of actual processor atomics.  We've used this to back
performance critical code such as our malloc implementation with good results.

We previously discussed this feature at LPC:
  
http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Mathieu Desnoyers posted an alternate implementation based on the ideas above
at:
  https://lkml.org/lkml/2015/5/21/472

This RFC is based on the approach we currently use internally.  However, I'll
likely posit that neither this approach, nor the one posted above, is what we
should ultimately adopt (provided sufficient interest exists).

The implementation in this series can be summarized as follows:
- We allow a single (per-address) space critical section (and associated
  handler) to be defined.
- When a thread with RSEQ configured (via new syscall) is interrupted within
  a critical section, we modify its return address.  Either within signal
  delivery, or the notify-resume path.  The scheduler touchpoint is only a
  preempt_notifier which (potentially, dependent on config) examines the
  kernel copy of pt_regs.

There are a few core requirements which motivate the approach taken here:
1) We must not interfere with regular scheduling.  Unlike preemption protection
   (or similar), there are no special considerations for code running within a
   critical region beyond that we must move to the restart handler if
   interrupted.
2) The code executing in scheduler context must be tightly constrained, both in
   terms of overhead and that it must not require access to user memory.
3) The fast-path must be fast.  That is, user entry/execution/exit of a
   non-interrupted critical section is the most important case.  The restart
   handler is a 'slow' path that should only happen comparatively infrequently.
   We're strongly motivated here by high-performance, low-level primitives:
   think malloc, or rcu_read_lock.

While the contained implementation performs well under these constraints, it
has some notable limitations which we should consider for more general usage:

1) The definition of a single region works well for statically linked binaries
   but can be challenging when shared-libraries want to use this feature.  This
   is partially mitigated in our experience that a library of primitives is
   generally more useful than a myriad of custom sequences, but it's still a
   major concern.
2) Due to the nature of restart and explicit location requirements it's only
   really reasonable to write this critical section in assembly; which makes
   porting and maintenance less convenient.  (One of the included tests shows a
   reasonable example of what this looks like.)
3) TLS spreads the entrails of its implementation all over compile _and_ link.
   This makes properly handling it within the critical section cumbersome in
   the shared binary case.

We've been thinking about how to address these issues and are considering some
alternate ABIs that still satisfy (1)-(3), but have a more convenient
programming model.  I'll send this as a follow-up, but wanted to first share
the approach we've used to date.

Thanks,

- Paul



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-24 Thread Andy Lutomirski
On Wed, Jun 24, 2015 at 3:26 PM, Paul Turner p...@google.com wrote:
 This is a fairly small series demonstrating a feature we've found to be quite
 powerful in practice, restartable sequences.


On an extremely short glance, I'm starting to think that the right
approach, at least for x86, is to implement per-cpu gsbase.  Then you
could do cmpxchg with a gs prefix to atomically take a percpu lock and
atomically release a percpu lock and check whether someone else stole
the lock from you.  (Note: cmpxchg, unlike lock cmpxchg, is very
fast.)

This is totally useless for other architectures, but I think it would
be reasonable clean on x86.  Thoughts?

I can elaborate if necessary.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] restartable sequences: fast user-space percpu critical sections

2015-06-24 Thread Paul Turner
On Wed, Jun 24, 2015 at 5:07 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Wed, Jun 24, 2015 at 3:26 PM, Paul Turner p...@google.com wrote:
 This is a fairly small series demonstrating a feature we've found to be quite
 powerful in practice, restartable sequences.


 On an extremely short glance, I'm starting to think that the right
 approach, at least for x86, is to implement per-cpu gsbase.  Then you
 could do cmpxchg with a gs prefix to atomically take a percpu lock and
 atomically release a percpu lock and check whether someone else stole
 the lock from you.  (Note: cmpxchg, unlike lock cmpxchg, is very
 fast.)

 This is totally useless for other architectures, but I think it would
 be reasonable clean on x86.  Thoughts?

So this gives semantics that are obviously similar to this_cpu().
This provides allows reasonable per-cpu counters (which is alone
almost sufficient for a strong user-space RCU implementation giving
this some legs).

However, unless there's a nice implementation trick I'm missing, the
thing that stands out to me for locks (or other primitives) is that
this forces a two-phase commit.  There's no way (short of say,
cmpxchg16b) to perform a write conditional on the lock not having been
stolen from us (and subsequently release the lock).

e.g.
 1) We take the operation in some sort of speculative mode, that
another thread on the same cpu is stilled allowed to steal from us
 2) We prepare what we want to commit
 3) At this point we have to promote the lock taken in (1) to perform
our actual commit, or see that someone else has stolen (1)
 4) Release the promoted lock in (3)

However, this means that if we're preempted at (3) then no other
thread on that cpu can make progress until we've been rescheduled and
released the lock; a nice property of the model we have today is that
threads sharing a cpu can not impede each other beyond what the
scheduler allows.

A lesser concern, but worth mentioning, is that there are also
potential pitfalls in the interaction with signal handlers,
particularly if a 2-phase commit is used.

- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/