subject:"\[RFC PATCH\] percpu system call\: fast userspace percpu critical sections"

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-29 Thread Christoph Lameter

There are some interesting things one could do with a similar system at
the kernel level.

If we had a table of IP ranges in the kernel that specify critical
sections with the restart points defined then the kernel could consult
that when preempting kernel threads and set the IP to the
restart point if the IP happened to be in one of the ranges.

If we had something like that then per cpu atomic sections would become
much simpler (avoid the preempt accounting and other preempt overhead
there?) and the per cpu atomic instructions would also be easier to
generalize for various architectures. Solves a lot of issues introduced by
preempt logic.

I think this could lead to a further simplification of allocator
fastpaths. Gets rid of a lot of strange code. Look at the simplification
of the fastpath due to the ability to be able to guarantee that a section
of code is executed without preemption.

mm/slubc::slab_alloc_node()

redo:
do {
tid = this_cpu_read(s->cpu_slab->tid);
c = raw_cpu_ptr(s->cpu_slab);
} while (IS_ENABLED(CONFIG_PREEMPT) &&
 unlikely(tid != READ_ONCE(c->tid)));
barrier();

object = c->freelist;
page = c->page;
if (unlikely(!object || !node_match(page, node))) {

   ... slow path...

} else {
void *next_object = get_freepointer_safe(s, object);

if (unlikely(!this_cpu_cmpxchg_double(
s->cpu_slab->freelist, s->cpu_slab->tid,
object, tid,
next_object, next_tid(tid {

note_cmpxchg_failure("slab_alloc", s, tid);
goto redo;
}
}



-
Using the restart point system
-


start_critical_section:
restart_point:

c = this_cpu_ptr(s->cpu_slab);
object = c->freelist;
page = c->page;
if (!unlikely(!object || !node_match(page, node)) {

... slow path ...

} else {

void *next_object = get_freepointer_safe(s, object);

end_critical_section:
c->freelist = next_object;

}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-26 Thread Andy Lutomirski

On Tue, May 26, 2015 at 2:18 PM, Andy Lutomirski  wrote:
> On Tue, May 26, 2015 at 2:04 PM, Mathieu Desnoyers
>>
>>>
>>> It's too bad that not all architectures have a single-instruction
>>> unlocked compare-and-exchange.
>>
>> Based on my benchmarks, it's not clear that single-instruction
>> unlocked CAS is actually faster than doing the same with many
>> instructions.
>
> True, but with a single instruction the user can't get preempted in the 
> middle.
>
> Looking at your code, it looks like percpu_user_sched_in has some
> potentially nasty issues with page faults.  Avoiding touching user
> memory from the scheduler would be quite nice from an implementation
> POV, and the x86-specific gs hack wins in that regard.

ARM has "TLB lockdown entries" which could, I think, be used to
implement per-cpu or per-thread mappings.  I'm actually rather
surprised that Linux doesn't already use a TLB lockdown entry for TLS.
(Hmm.  Maybe it's because the interface to write the entries requires
actually touching the page.  Maybe not -- the ARM docs, in general,
seem to be much less clear than the Intel and AMD docs.)

ARM doesn't seem to have any single-instruction compare-exchange or
similar instruction, though, so this might be all that useful.  On the
other hand, ARM can probably do reasonably efficient per-cpu memory
allocation and such with a single ldrex/strex pair.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-26 Thread Andy Lutomirski

On Tue, May 26, 2015 at 2:20 PM, Andi Kleen  wrote:
>
> I haven't thought too much about it, but doing it in a vdso seems
> reasonable. It would mean that the scheduler hot path would
> need to know about its address though.
>
>> > You appear to summarize the two things that are added to the kernel with
>> > this RFC patch:
>> > - faster getcpu(): 12.7 ns -> 0.3 ns
>>
>> Yeah, I figured that out the second time I read your email and re-read
>> the original message, but I apparently forgot to fix up the email I
>> was typing.
>
> The high speed users seem to have all switched to using LSL directly,
> which is even faster. Since there are already users it cannot be changed
> anymore. Should probably just document this as an official ABI,
> and provide some standard include file with inline code.
>
>> IIRC some other OS's use gs as a userspace per-cpu pointer instead of
>> a per-thread pointer.  Would we want to consider enabling that?
>> This may interact oddly with the WIP wrgsbase stuff.  (Grr, Intel, why
>> couldn't you have made wrgsbase reset the gs selector to zero?  That
>> would have been so much nicer.)
>
> Applications really want the extra address registers. For example
> OpenJDK could free a GPR. And it's needed for the TLS of user space
> threads, which are becoming more and more popular. So there are
> already more important users queued.

This isn't necessarily the end of the world.  They could turn it off
or hopefully repurpose or extend their FS usage instead of requiring
GS.

IOW, I think that ARCH_SET_GS should continue working, as should
wrgsbase.  We could allow a program, strictly as an alternative, to do
ARCH_SET_GS_PERCPU or something.  (If we do that, we should wire up
arch_prctl for x86_32 as well.  We could further consider restricting
the new per-cpu mode to require a specific *nonzero* value for the
selector.  Also, it's really annoying that we'll be forced to assign
some meaning to a non-zero selector with unexpected gsbase during
context switch.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-26 Thread Andi Kleen


I haven't thought too much about it, but doing it in a vdso seems
reasonable. It would mean that the scheduler hot path would
need to know about its address though.

> > You appear to summarize the two things that are added to the kernel with
> > this RFC patch:
> > - faster getcpu(): 12.7 ns -> 0.3 ns
> 
> Yeah, I figured that out the second time I read your email and re-read
> the original message, but I apparently forgot to fix up the email I
> was typing.

The high speed users seem to have all switched to using LSL directly,
which is even faster. Since there are already users it cannot be changed
anymore. Should probably just document this as an official ABI,
and provide some standard include file with inline code.

> IIRC some other OS's use gs as a userspace per-cpu pointer instead of
> a per-thread pointer.  Would we want to consider enabling that?
> This may interact oddly with the WIP wrgsbase stuff.  (Grr, Intel, why
> couldn't you have made wrgsbase reset the gs selector to zero?  That
> would have been so much nicer.)

Applications really want the extra address registers. For example
OpenJDK could free a GPR. And it's needed for the TLS of user space
threads, which are becoming more and more popular. So there are
already more important users queued.

-Andi


-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-26 Thread Andy Lutomirski

On Tue, May 26, 2015 at 2:04 PM, Mathieu Desnoyers
 wrote:
> - Original Message -
>> On May 25, 2015 11:54 AM, "Andy Lutomirski"  wrote:
> [...]
>> I might be guilty of being too x86-centric here.  On x86, as long as
>> the lock and unlock primitives are sufficiently atomic, everything
>> should be okay.  On other architectures, though, a primitive that
>> gives lock, unlock, and abort of a per-cpu lock without checking that
>> you're still on the right cpu at unlock time may not be sufficient.
>> If the primitive is implemented purely with loads and stores, then
>> even if you take the lock, migrate, finish your work, and unlock
>> without anyone else contending from the lock (and hence don't abort),
>> the next thread to take the same lock will end up unsynchronized
>> unless there's appropriate memory ordering.  For example, if taking
>> the lock were an acquire and unlocking were a release, we'd be fine.
>
> If we look at x86-TSO, where stores reordered after loads is pretty
> much the only thing we need to care about, we have:
>
>   CPU 0   CPU 1
>
>   load, test lock[1]
>   store lock[1]
>   loads/stores to data[1]
>   [preempted, migrate to cpu 0]
>   [run other thread]
>   load, test lock[1] (loop)
>   loads/stores to data[1] (A)
>   store 0 to lock[1]  (B)
>   load, test lock[1]  (C)
>   store lock[1]
>   loads/stores to data[1] (D)
>
> What we want to ensure is that load/stores to data[1] from CPU 0 and
> 1 don't get interleaved.
>
> On CPU 0 doing unlock, loads and stores to data[1] (A) cannot be reordered
> against the store to lock[1] (B) due to TSO (all we care about is stores
> reordered after loads).
>
> On CPU 1 grabbing the 2nd lock, we care about loads/store from/to
> data[1] (C) being reordered before load, test of lock[1] (C). Again,
> thanks to TSO, loads/stores from/to data[1] (D) cannot be reordered
> wrt (C).
>
> So on x86-TSO we should be OK, but as you say, we need to be careful
> to have the right barriers in place on other architectures.
>
>>
>> Your RFC design certainly works (in principle -- I haven't looked at
>> the code in detail), but I can't shake the feeling that it's overkill
>
> If we can find a simpler way to achieve the same result, I'm all ears :)
>
>> and that it could be improved to avoid unnecessary aborts every time
>> the lock holder is scheduled out.
>
> The restart is only performed if preemption occurs when the thread is
> trying to grab the lock. Once the thread has the lock (it's now the
> lock holder), it will not be aborted even if it is migrated.

Then maybe this does have the same acquire/release issue after all.

>
>>
>> This isn't a problem in your RFC design, but if we wanted to come up
>> with tighter primitives, we'd have to be quite careful to document
>> exactly what memory ordering guarantees they come with.
>
> Indeed.
>
>>
>> It may be that all architectures for which you care about the
>> performance boost already have efficient acquire and release
>> operations.  Certainly x86 does, and I don't know how fast the new ARM
>> instructions are, but I imagine they're pretty good.
>
> We may even need memory barriers around lock and unlock on ARM. :/ ARM
> smp_store_release() and smp_load_acquire() currently implements those with
> smp_mb().

I would have expected it to use LDA/STL on new enough cpus:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/CIHDEACA.html


>
>>
>> It's too bad that not all architectures have a single-instruction
>> unlocked compare-and-exchange.
>
> Based on my benchmarks, it's not clear that single-instruction
> unlocked CAS is actually faster than doing the same with many
> instructions.

True, but with a single instruction the user can't get preempted in the middle.

Looking at your code, it looks like percpu_user_sched_in has some
potentially nasty issues with page faults.  Avoiding touching user
memory from the scheduler would be quite nice from an implementation
POV, and the x86-specific gs hack wins in that regard.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-26 Thread Mathieu Desnoyers

- Original Message -
> On May 25, 2015 11:54 AM, "Andy Lutomirski"  wrote:
[...]
> I might be guilty of being too x86-centric here.  On x86, as long as
> the lock and unlock primitives are sufficiently atomic, everything
> should be okay.  On other architectures, though, a primitive that
> gives lock, unlock, and abort of a per-cpu lock without checking that
> you're still on the right cpu at unlock time may not be sufficient.
> If the primitive is implemented purely with loads and stores, then
> even if you take the lock, migrate, finish your work, and unlock
> without anyone else contending from the lock (and hence don't abort),
> the next thread to take the same lock will end up unsynchronized
> unless there's appropriate memory ordering.  For example, if taking
> the lock were an acquire and unlocking were a release, we'd be fine.

If we look at x86-TSO, where stores reordered after loads is pretty
much the only thing we need to care about, we have:

  CPU 0   CPU 1

  load, test lock[1]
  store lock[1]
  loads/stores to data[1]
  [preempted, migrate to cpu 0]
  [run other thread]
  load, test lock[1] (loop)
  loads/stores to data[1] (A)
  store 0 to lock[1]  (B)
  load, test lock[1]  (C)
  store lock[1]
  loads/stores to data[1] (D)

What we want to ensure is that load/stores to data[1] from CPU 0 and
1 don't get interleaved.

On CPU 0 doing unlock, loads and stores to data[1] (A) cannot be reordered
against the store to lock[1] (B) due to TSO (all we care about is stores
reordered after loads).

On CPU 1 grabbing the 2nd lock, we care about loads/store from/to
data[1] (C) being reordered before load, test of lock[1] (C). Again,
thanks to TSO, loads/stores from/to data[1] (D) cannot be reordered
wrt (C).

So on x86-TSO we should be OK, but as you say, we need to be careful
to have the right barriers in place on other architectures.

> 
> Your RFC design certainly works (in principle -- I haven't looked at
> the code in detail), but I can't shake the feeling that it's overkill

If we can find a simpler way to achieve the same result, I'm all ears :)

> and that it could be improved to avoid unnecessary aborts every time
> the lock holder is scheduled out.

The restart is only performed if preemption occurs when the thread is
trying to grab the lock. Once the thread has the lock (it's now the
lock holder), it will not be aborted even if it is migrated.

> 
> This isn't a problem in your RFC design, but if we wanted to come up
> with tighter primitives, we'd have to be quite careful to document
> exactly what memory ordering guarantees they come with.

Indeed.

> 
> It may be that all architectures for which you care about the
> performance boost already have efficient acquire and release
> operations.  Certainly x86 does, and I don't know how fast the new ARM
> instructions are, but I imagine they're pretty good.

We may even need memory barriers around lock and unlock on ARM. :/ ARM
smp_store_release() and smp_load_acquire() currently implements those with
smp_mb().

> 
> It's too bad that not all architectures have a single-instruction
> unlocked compare-and-exchange.

Based on my benchmarks, it's not clear that single-instruction
unlocked CAS is actually faster than doing the same with many
instructions.

Thanks,

Mathieu

> 
> --Andy
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-26 Thread Andy Lutomirski

On Tue, May 26, 2015 at 1:38 PM, Mathieu Desnoyers
 wrote:
> - Original Message -
>> [cc: hpa, Borislav and Andi]
>>
>> On Mon, May 25, 2015 at 11:30 AM, Mathieu Desnoyers
>>  wrote:
>> > - Original Message -
>> >> On May 23, 2015 10:09 AM, "Mathieu Desnoyers"
>> >>  wrote:
>> >> >
>> >> > - Original Message -
>> >> > > On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
>> >> > >  wrote:
>> >> > > > - Original Message -
>> >> > > >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk
>> >> > > >> 
>> >> > > >> wrote:
>> >> > > >> > [CC += linux-api@]
>> >> > > >> >
>> >> > > >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
>> >> > > >> >  wrote:
>> >> > > >> >> Expose a new system call allowing userspace threads to register
>> >> > > >> >> a TLS area used as an ABI between the kernel and userspace to
>> >> > > >> >> share information required to create efficient per-cpu critical
>> >> > > >> >> sections in user-space.
>> >> > > >> >>
>> >> > > >> >> This ABI consists of a thread-local structure containing:
>> >> > > >> >>
>> >> > > >> >> - a nesting count surrounding the critical section,
>> >> > > >> >> - a signal number to be sent to the thread when preempting a
>> >> > > >> >> thread
>> >> > > >> >>   with non-zero nesting count,
>> >> > > >> >> - a flag indicating whether the signal has been sent within the
>> >> > > >> >>   critical section,
>> >> > > >> >> - an integer where to store the current CPU number, updated
>> >> > > >> >> whenever
>> >> > > >> >>   the thread is preempted. This CPU number cache is not strictly
>> >> > > >> >>   needed, but performs better than getcpu vdso.
>> >> > > >> >>
>> >> > > >> >> This approach is inspired by Paul Turner and Andrew Hunter's
>> >> > > >> >> work
>> >> > > >> >> on percpu atomics, which lets the kernel handle restart of
>> >> > > >> >> critical
>> >> > > >> >> sections, ref.
>> >> > > >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
>> >> > > >> >>
>> >> > > >> >> What is done differently here compared to percpu atomics: we
>> >> > > >> >> track
>> >> > > >> >> a single nesting counter per thread rather than many ranges of
>> >> > > >> >> instruction pointer values. We deliver a signal to user-space
>> >> > > >> >> and
>> >> > > >> >> let the logic of restart be handled in user-space, thus moving
>> >> > > >> >> the complexity out of the kernel. The nesting counter approach
>> >> > > >> >> allows us to skip the complexity of interacting with signals
>> >> > > >> >> that
>> >> > > >> >> would be otherwise needed with the percpu atomics approach,
>> >> > > >> >> which
>> >> > > >> >> needs to know which instruction pointers are preempted,
>> >> > > >> >> including
>> >> > > >> >> when preemption occurs on a signal handler nested over an
>> >> > > >> >> instruction
>> >> > > >> >> pointer of interest.
>> >> > > >> >>
>> >> > > >>
>> >> > > >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
>> >> > > >> unable to convince myself that the kernel needs to help at all.  To
>> >> > > >> do
>> >> > > >> this without kernel help, I want to relax the requirements
>> >> > > >> slightly.
>> >> > > >> With true per-cpu atomic sections, you have a guarantee that you
>> >> > > >> are
>> >> > > >> either really running on the same CPU for the entire duration of
>> >> > > >> the
>> >> > > >> atomic section or you abort.  I propose a weaker primitive: you
>> >> > > >> acquire one of an array of locks (probably one per cpu), and you
>> >> > > >> are
>> >> > > >> guaranteed that, if you don't abort, no one else acquires the same
>> >> > > >> lock while you hold it.
>> >> > > >
>> >> > > > In my proof of concept (https://github.com/compudj/percpu-dev) I
>> >> > > > actually implement an array of per-cpu lock. The issue here boils
>> >> > > > down to grabbing this per-cpu lock efficiently. Once the lock is
>> >> > > > taken,
>> >> > > > the thread has exclusive access to that per-cpu lock, even if it
>> >> > > > migrates.
>> >> > > >
>> >> > > >>  Here's how:
>> >> > > >>
>> >> > > >> Create an array of user-managed locks, one per cpu.  Call them
>> >> > > >> lock[i]
>> >> > > >> for 0 <= i < ncpus.
>> >> > > >>
>> >> > > >> To acquire, look up your CPU number.  Then, atomically, check that
>> >> > > >> lock[cpu] isn't held and, if so, mark it held and record both your
>> >> > > >> tid
>> >> > > >> and your lock acquisition count.  If you learn that the lock *was*
>> >> > > >> held after all, signal the holder (with kill or your favorite other
>> >> > > >> mechanism), telling it which lock acquisition count is being
>> >> > > >> aborted.
>> >> > > >> Then atomically steal the lock, but only if the lock acquisition
>> >> > > >> count
>> >> > > >> hasn't changed.
>> >> > > >>
>> >> > > >> This has a few benefits over the in-kernel approach:
>> >> > > >>
>> >> > > >> 1. No kernel patch.
>> >> > > >>
>> >> > > >> 2. No unnecessary abort if you are preempted in f

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-26 Thread Mathieu Desnoyers

- Original Message -
> [cc: hpa, Borislav and Andi]
> 
> On Mon, May 25, 2015 at 11:30 AM, Mathieu Desnoyers
>  wrote:
> > - Original Message -
> >> On May 23, 2015 10:09 AM, "Mathieu Desnoyers"
> >>  wrote:
> >> >
> >> > - Original Message -
> >> > > On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
> >> > >  wrote:
> >> > > > - Original Message -
> >> > > >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk
> >> > > >> 
> >> > > >> wrote:
> >> > > >> > [CC += linux-api@]
> >> > > >> >
> >> > > >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> >> > > >> >  wrote:
> >> > > >> >> Expose a new system call allowing userspace threads to register
> >> > > >> >> a TLS area used as an ABI between the kernel and userspace to
> >> > > >> >> share information required to create efficient per-cpu critical
> >> > > >> >> sections in user-space.
> >> > > >> >>
> >> > > >> >> This ABI consists of a thread-local structure containing:
> >> > > >> >>
> >> > > >> >> - a nesting count surrounding the critical section,
> >> > > >> >> - a signal number to be sent to the thread when preempting a
> >> > > >> >> thread
> >> > > >> >>   with non-zero nesting count,
> >> > > >> >> - a flag indicating whether the signal has been sent within the
> >> > > >> >>   critical section,
> >> > > >> >> - an integer where to store the current CPU number, updated
> >> > > >> >> whenever
> >> > > >> >>   the thread is preempted. This CPU number cache is not strictly
> >> > > >> >>   needed, but performs better than getcpu vdso.
> >> > > >> >>
> >> > > >> >> This approach is inspired by Paul Turner and Andrew Hunter's
> >> > > >> >> work
> >> > > >> >> on percpu atomics, which lets the kernel handle restart of
> >> > > >> >> critical
> >> > > >> >> sections, ref.
> >> > > >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> >> > > >> >>
> >> > > >> >> What is done differently here compared to percpu atomics: we
> >> > > >> >> track
> >> > > >> >> a single nesting counter per thread rather than many ranges of
> >> > > >> >> instruction pointer values. We deliver a signal to user-space
> >> > > >> >> and
> >> > > >> >> let the logic of restart be handled in user-space, thus moving
> >> > > >> >> the complexity out of the kernel. The nesting counter approach
> >> > > >> >> allows us to skip the complexity of interacting with signals
> >> > > >> >> that
> >> > > >> >> would be otherwise needed with the percpu atomics approach,
> >> > > >> >> which
> >> > > >> >> needs to know which instruction pointers are preempted,
> >> > > >> >> including
> >> > > >> >> when preemption occurs on a signal handler nested over an
> >> > > >> >> instruction
> >> > > >> >> pointer of interest.
> >> > > >> >>
> >> > > >>
> >> > > >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
> >> > > >> unable to convince myself that the kernel needs to help at all.  To
> >> > > >> do
> >> > > >> this without kernel help, I want to relax the requirements
> >> > > >> slightly.
> >> > > >> With true per-cpu atomic sections, you have a guarantee that you
> >> > > >> are
> >> > > >> either really running on the same CPU for the entire duration of
> >> > > >> the
> >> > > >> atomic section or you abort.  I propose a weaker primitive: you
> >> > > >> acquire one of an array of locks (probably one per cpu), and you
> >> > > >> are
> >> > > >> guaranteed that, if you don't abort, no one else acquires the same
> >> > > >> lock while you hold it.
> >> > > >
> >> > > > In my proof of concept (https://github.com/compudj/percpu-dev) I
> >> > > > actually implement an array of per-cpu lock. The issue here boils
> >> > > > down to grabbing this per-cpu lock efficiently. Once the lock is
> >> > > > taken,
> >> > > > the thread has exclusive access to that per-cpu lock, even if it
> >> > > > migrates.
> >> > > >
> >> > > >>  Here's how:
> >> > > >>
> >> > > >> Create an array of user-managed locks, one per cpu.  Call them
> >> > > >> lock[i]
> >> > > >> for 0 <= i < ncpus.
> >> > > >>
> >> > > >> To acquire, look up your CPU number.  Then, atomically, check that
> >> > > >> lock[cpu] isn't held and, if so, mark it held and record both your
> >> > > >> tid
> >> > > >> and your lock acquisition count.  If you learn that the lock *was*
> >> > > >> held after all, signal the holder (with kill or your favorite other
> >> > > >> mechanism), telling it which lock acquisition count is being
> >> > > >> aborted.
> >> > > >> Then atomically steal the lock, but only if the lock acquisition
> >> > > >> count
> >> > > >> hasn't changed.
> >> > > >>
> >> > > >> This has a few benefits over the in-kernel approach:
> >> > > >>
> >> > > >> 1. No kernel patch.
> >> > > >>
> >> > > >> 2. No unnecessary abort if you are preempted in favor of a thread
> >> > > >> that
> >> > > >> doesn't content for your lock.
> >> > > >>
> >> > > >> 3. Greatly improved debuggability.
> >> > > >>
> >> > > >> 4. Wit

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-26 Thread Andy Lutomirski

On May 25, 2015 11:54 AM, "Andy Lutomirski"  wrote:
>
> [cc: hpa, Borislav and Andi]
>
> On Mon, May 25, 2015 at 11:30 AM, Mathieu Desnoyers
>  wrote:
> > - Original Message -
> >> On May 23, 2015 10:09 AM, "Mathieu Desnoyers"
> >>  wrote:
> >> >
> >> > - Original Message -
> >> > > On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
> >> > >  wrote:
> >> > > > - Original Message -
> >> > > >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk
> >> > > >> 
> >> > > >> wrote:
> >> > > >> > [CC += linux-api@]
> >> > > >> >
> >> > > >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> >> > > >> >  wrote:
> >> > > >> >> Expose a new system call allowing userspace threads to register
> >> > > >> >> a TLS area used as an ABI between the kernel and userspace to
> >> > > >> >> share information required to create efficient per-cpu critical
> >> > > >> >> sections in user-space.
> >> > > >> >>
> >> > > >> >> This ABI consists of a thread-local structure containing:
> >> > > >> >>
> >> > > >> >> - a nesting count surrounding the critical section,
> >> > > >> >> - a signal number to be sent to the thread when preempting a 
> >> > > >> >> thread
> >> > > >> >>   with non-zero nesting count,
> >> > > >> >> - a flag indicating whether the signal has been sent within the
> >> > > >> >>   critical section,
> >> > > >> >> - an integer where to store the current CPU number, updated
> >> > > >> >> whenever
> >> > > >> >>   the thread is preempted. This CPU number cache is not strictly
> >> > > >> >>   needed, but performs better than getcpu vdso.
> >> > > >> >>
> >> > > >> >> This approach is inspired by Paul Turner and Andrew Hunter's work
> >> > > >> >> on percpu atomics, which lets the kernel handle restart of 
> >> > > >> >> critical
> >> > > >> >> sections, ref.
> >> > > >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> >> > > >> >>
> >> > > >> >> What is done differently here compared to percpu atomics: we 
> >> > > >> >> track
> >> > > >> >> a single nesting counter per thread rather than many ranges of
> >> > > >> >> instruction pointer values. We deliver a signal to user-space and
> >> > > >> >> let the logic of restart be handled in user-space, thus moving
> >> > > >> >> the complexity out of the kernel. The nesting counter approach
> >> > > >> >> allows us to skip the complexity of interacting with signals that
> >> > > >> >> would be otherwise needed with the percpu atomics approach, which
> >> > > >> >> needs to know which instruction pointers are preempted, including
> >> > > >> >> when preemption occurs on a signal handler nested over an
> >> > > >> >> instruction
> >> > > >> >> pointer of interest.
> >> > > >> >>
> >> > > >>
> >> > > >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
> >> > > >> unable to convince myself that the kernel needs to help at all.  To 
> >> > > >> do
> >> > > >> this without kernel help, I want to relax the requirements slightly.
> >> > > >> With true per-cpu atomic sections, you have a guarantee that you are
> >> > > >> either really running on the same CPU for the entire duration of the
> >> > > >> atomic section or you abort.  I propose a weaker primitive: you
> >> > > >> acquire one of an array of locks (probably one per cpu), and you are
> >> > > >> guaranteed that, if you don't abort, no one else acquires the same
> >> > > >> lock while you hold it.
> >> > > >
> >> > > > In my proof of concept (https://github.com/compudj/percpu-dev) I
> >> > > > actually implement an array of per-cpu lock. The issue here boils
> >> > > > down to grabbing this per-cpu lock efficiently. Once the lock is 
> >> > > > taken,
> >> > > > the thread has exclusive access to that per-cpu lock, even if it
> >> > > > migrates.
> >> > > >
> >> > > >>  Here's how:
> >> > > >>
> >> > > >> Create an array of user-managed locks, one per cpu.  Call them 
> >> > > >> lock[i]
> >> > > >> for 0 <= i < ncpus.
> >> > > >>
> >> > > >> To acquire, look up your CPU number.  Then, atomically, check that
> >> > > >> lock[cpu] isn't held and, if so, mark it held and record both your 
> >> > > >> tid
> >> > > >> and your lock acquisition count.  If you learn that the lock *was*
> >> > > >> held after all, signal the holder (with kill or your favorite other
> >> > > >> mechanism), telling it which lock acquisition count is being 
> >> > > >> aborted.
> >> > > >> Then atomically steal the lock, but only if the lock acquisition 
> >> > > >> count
> >> > > >> hasn't changed.
> >> > > >>
> >> > > >> This has a few benefits over the in-kernel approach:
> >> > > >>
> >> > > >> 1. No kernel patch.
> >> > > >>
> >> > > >> 2. No unnecessary abort if you are preempted in favor of a thread 
> >> > > >> that
> >> > > >> doesn't content for your lock.
> >> > > >>
> >> > > >> 3. Greatly improved debuggability.
> >> > > >>
> >> > > >> 4. With long critical sections and heavy load, you can improve
> >> > > >> performance by havin

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-25 Thread Mathieu Desnoyers

- Original Message -
> On Thu, May 21, 2015 at 7:44 AM, Mathieu Desnoyers 
>  wrote:
> > Expose a new system call allowing userspace threads to register
> > a TLS area used as an ABI between the kernel and userspace to
> > share information required to create efficient per-cpu critical
> > sections in user-space.
> 
> Not a way in hell.
> 
> You do user accesses from preempt notifiers.
> 
> That's so ludicrously broken that it's not even funny.
> 
> This patch needs to die, and never *ever* be resurrected. Put a stake
> in it, and bury it 6ft deep.

There is no need to fetch your vampire hunting kit just yet. ;-)

As an RFC patch, this should be considered as a prototype, nothing more.
I was mainly expecting feedback on the general approach, and trying to
put some pressure on Paul Turner, so he would post the patches he
talked about back in 2013.

Peter Zijlstra already pointed out that I should move those user accesses
before return to userspace.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-25 Thread Andy Lutomirski

[cc: hpa, Borislav and Andi]

On Mon, May 25, 2015 at 11:30 AM, Mathieu Desnoyers
 wrote:
> - Original Message -
>> On May 23, 2015 10:09 AM, "Mathieu Desnoyers"
>>  wrote:
>> >
>> > - Original Message -
>> > > On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
>> > >  wrote:
>> > > > - Original Message -
>> > > >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk
>> > > >> 
>> > > >> wrote:
>> > > >> > [CC += linux-api@]
>> > > >> >
>> > > >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
>> > > >> >  wrote:
>> > > >> >> Expose a new system call allowing userspace threads to register
>> > > >> >> a TLS area used as an ABI between the kernel and userspace to
>> > > >> >> share information required to create efficient per-cpu critical
>> > > >> >> sections in user-space.
>> > > >> >>
>> > > >> >> This ABI consists of a thread-local structure containing:
>> > > >> >>
>> > > >> >> - a nesting count surrounding the critical section,
>> > > >> >> - a signal number to be sent to the thread when preempting a thread
>> > > >> >>   with non-zero nesting count,
>> > > >> >> - a flag indicating whether the signal has been sent within the
>> > > >> >>   critical section,
>> > > >> >> - an integer where to store the current CPU number, updated
>> > > >> >> whenever
>> > > >> >>   the thread is preempted. This CPU number cache is not strictly
>> > > >> >>   needed, but performs better than getcpu vdso.
>> > > >> >>
>> > > >> >> This approach is inspired by Paul Turner and Andrew Hunter's work
>> > > >> >> on percpu atomics, which lets the kernel handle restart of critical
>> > > >> >> sections, ref.
>> > > >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
>> > > >> >>
>> > > >> >> What is done differently here compared to percpu atomics: we track
>> > > >> >> a single nesting counter per thread rather than many ranges of
>> > > >> >> instruction pointer values. We deliver a signal to user-space and
>> > > >> >> let the logic of restart be handled in user-space, thus moving
>> > > >> >> the complexity out of the kernel. The nesting counter approach
>> > > >> >> allows us to skip the complexity of interacting with signals that
>> > > >> >> would be otherwise needed with the percpu atomics approach, which
>> > > >> >> needs to know which instruction pointers are preempted, including
>> > > >> >> when preemption occurs on a signal handler nested over an
>> > > >> >> instruction
>> > > >> >> pointer of interest.
>> > > >> >>
>> > > >>
>> > > >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
>> > > >> unable to convince myself that the kernel needs to help at all.  To do
>> > > >> this without kernel help, I want to relax the requirements slightly.
>> > > >> With true per-cpu atomic sections, you have a guarantee that you are
>> > > >> either really running on the same CPU for the entire duration of the
>> > > >> atomic section or you abort.  I propose a weaker primitive: you
>> > > >> acquire one of an array of locks (probably one per cpu), and you are
>> > > >> guaranteed that, if you don't abort, no one else acquires the same
>> > > >> lock while you hold it.
>> > > >
>> > > > In my proof of concept (https://github.com/compudj/percpu-dev) I
>> > > > actually implement an array of per-cpu lock. The issue here boils
>> > > > down to grabbing this per-cpu lock efficiently. Once the lock is taken,
>> > > > the thread has exclusive access to that per-cpu lock, even if it
>> > > > migrates.
>> > > >
>> > > >>  Here's how:
>> > > >>
>> > > >> Create an array of user-managed locks, one per cpu.  Call them lock[i]
>> > > >> for 0 <= i < ncpus.
>> > > >>
>> > > >> To acquire, look up your CPU number.  Then, atomically, check that
>> > > >> lock[cpu] isn't held and, if so, mark it held and record both your tid
>> > > >> and your lock acquisition count.  If you learn that the lock *was*
>> > > >> held after all, signal the holder (with kill or your favorite other
>> > > >> mechanism), telling it which lock acquisition count is being aborted.
>> > > >> Then atomically steal the lock, but only if the lock acquisition count
>> > > >> hasn't changed.
>> > > >>
>> > > >> This has a few benefits over the in-kernel approach:
>> > > >>
>> > > >> 1. No kernel patch.
>> > > >>
>> > > >> 2. No unnecessary abort if you are preempted in favor of a thread that
>> > > >> doesn't content for your lock.
>> > > >>
>> > > >> 3. Greatly improved debuggability.
>> > > >>
>> > > >> 4. With long critical sections and heavy load, you can improve
>> > > >> performance by having several locks per cpu and choosing one at
>> > > >> random.
>> > > >>
>> > > >> Is there a reason that a scheme like this doesn't work?
>> > > >
>> > > > What do you mean exactly by "atomically check that lock is not
>> > > > held and, if so, mark it held" ? Do you imply using a lock-prefixed
>> > > > atomic operation ?
>> > >
>> > > Yes.
>> > >
>> > > >
>> > > > The goal of

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-25 Thread Mathieu Desnoyers

- Original Message -
> On May 23, 2015 10:09 AM, "Mathieu Desnoyers"
>  wrote:
> >
> > - Original Message -
> > > On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
> > >  wrote:
> > > > - Original Message -
> > > >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk
> > > >> 
> > > >> wrote:
> > > >> > [CC += linux-api@]
> > > >> >
> > > >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> > > >> >  wrote:
> > > >> >> Expose a new system call allowing userspace threads to register
> > > >> >> a TLS area used as an ABI between the kernel and userspace to
> > > >> >> share information required to create efficient per-cpu critical
> > > >> >> sections in user-space.
> > > >> >>
> > > >> >> This ABI consists of a thread-local structure containing:
> > > >> >>
> > > >> >> - a nesting count surrounding the critical section,
> > > >> >> - a signal number to be sent to the thread when preempting a thread
> > > >> >>   with non-zero nesting count,
> > > >> >> - a flag indicating whether the signal has been sent within the
> > > >> >>   critical section,
> > > >> >> - an integer where to store the current CPU number, updated
> > > >> >> whenever
> > > >> >>   the thread is preempted. This CPU number cache is not strictly
> > > >> >>   needed, but performs better than getcpu vdso.
> > > >> >>
> > > >> >> This approach is inspired by Paul Turner and Andrew Hunter's work
> > > >> >> on percpu atomics, which lets the kernel handle restart of critical
> > > >> >> sections, ref.
> > > >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> > > >> >>
> > > >> >> What is done differently here compared to percpu atomics: we track
> > > >> >> a single nesting counter per thread rather than many ranges of
> > > >> >> instruction pointer values. We deliver a signal to user-space and
> > > >> >> let the logic of restart be handled in user-space, thus moving
> > > >> >> the complexity out of the kernel. The nesting counter approach
> > > >> >> allows us to skip the complexity of interacting with signals that
> > > >> >> would be otherwise needed with the percpu atomics approach, which
> > > >> >> needs to know which instruction pointers are preempted, including
> > > >> >> when preemption occurs on a signal handler nested over an
> > > >> >> instruction
> > > >> >> pointer of interest.
> > > >> >>
> > > >>
> > > >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
> > > >> unable to convince myself that the kernel needs to help at all.  To do
> > > >> this without kernel help, I want to relax the requirements slightly.
> > > >> With true per-cpu atomic sections, you have a guarantee that you are
> > > >> either really running on the same CPU for the entire duration of the
> > > >> atomic section or you abort.  I propose a weaker primitive: you
> > > >> acquire one of an array of locks (probably one per cpu), and you are
> > > >> guaranteed that, if you don't abort, no one else acquires the same
> > > >> lock while you hold it.
> > > >
> > > > In my proof of concept (https://github.com/compudj/percpu-dev) I
> > > > actually implement an array of per-cpu lock. The issue here boils
> > > > down to grabbing this per-cpu lock efficiently. Once the lock is taken,
> > > > the thread has exclusive access to that per-cpu lock, even if it
> > > > migrates.
> > > >
> > > >>  Here's how:
> > > >>
> > > >> Create an array of user-managed locks, one per cpu.  Call them lock[i]
> > > >> for 0 <= i < ncpus.
> > > >>
> > > >> To acquire, look up your CPU number.  Then, atomically, check that
> > > >> lock[cpu] isn't held and, if so, mark it held and record both your tid
> > > >> and your lock acquisition count.  If you learn that the lock *was*
> > > >> held after all, signal the holder (with kill or your favorite other
> > > >> mechanism), telling it which lock acquisition count is being aborted.
> > > >> Then atomically steal the lock, but only if the lock acquisition count
> > > >> hasn't changed.
> > > >>
> > > >> This has a few benefits over the in-kernel approach:
> > > >>
> > > >> 1. No kernel patch.
> > > >>
> > > >> 2. No unnecessary abort if you are preempted in favor of a thread that
> > > >> doesn't content for your lock.
> > > >>
> > > >> 3. Greatly improved debuggability.
> > > >>
> > > >> 4. With long critical sections and heavy load, you can improve
> > > >> performance by having several locks per cpu and choosing one at
> > > >> random.
> > > >>
> > > >> Is there a reason that a scheme like this doesn't work?
> > > >
> > > > What do you mean exactly by "atomically check that lock is not
> > > > held and, if so, mark it held" ? Do you imply using a lock-prefixed
> > > > atomic operation ?
> > >
> > > Yes.
> > >
> > > >
> > > > The goal of this whole restart section approach is to allow grabbing
> > > > a lock (or doing other sequences of operations ending with a single
> > > > store) on per-cpu data without having to use slow l

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-23 Thread Linus Torvalds

On Thu, May 21, 2015 at 7:44 AM, Mathieu Desnoyers
 wrote:
> Expose a new system call allowing userspace threads to register
> a TLS area used as an ABI between the kernel and userspace to
> share information required to create efficient per-cpu critical
> sections in user-space.

Not a way in hell.

You do user accesses from preempt notifiers.

That's so ludicrously broken that it's not even funny.

This patch needs to die, and never *ever* be resurrected. Put a stake
in it, and bury it 6ft deep.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-23 Thread Andy Lutomirski

On May 23, 2015 10:09 AM, "Mathieu Desnoyers"
 wrote:
>
> - Original Message -
> > On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
> >  wrote:
> > > - Original Message -
> > >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk 
> > >> wrote:
> > >> > [CC += linux-api@]
> > >> >
> > >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> > >> >  wrote:
> > >> >> Expose a new system call allowing userspace threads to register
> > >> >> a TLS area used as an ABI between the kernel and userspace to
> > >> >> share information required to create efficient per-cpu critical
> > >> >> sections in user-space.
> > >> >>
> > >> >> This ABI consists of a thread-local structure containing:
> > >> >>
> > >> >> - a nesting count surrounding the critical section,
> > >> >> - a signal number to be sent to the thread when preempting a thread
> > >> >>   with non-zero nesting count,
> > >> >> - a flag indicating whether the signal has been sent within the
> > >> >>   critical section,
> > >> >> - an integer where to store the current CPU number, updated whenever
> > >> >>   the thread is preempted. This CPU number cache is not strictly
> > >> >>   needed, but performs better than getcpu vdso.
> > >> >>
> > >> >> This approach is inspired by Paul Turner and Andrew Hunter's work
> > >> >> on percpu atomics, which lets the kernel handle restart of critical
> > >> >> sections, ref.
> > >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> > >> >>
> > >> >> What is done differently here compared to percpu atomics: we track
> > >> >> a single nesting counter per thread rather than many ranges of
> > >> >> instruction pointer values. We deliver a signal to user-space and
> > >> >> let the logic of restart be handled in user-space, thus moving
> > >> >> the complexity out of the kernel. The nesting counter approach
> > >> >> allows us to skip the complexity of interacting with signals that
> > >> >> would be otherwise needed with the percpu atomics approach, which
> > >> >> needs to know which instruction pointers are preempted, including
> > >> >> when preemption occurs on a signal handler nested over an instruction
> > >> >> pointer of interest.
> > >> >>
> > >>
> > >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
> > >> unable to convince myself that the kernel needs to help at all.  To do
> > >> this without kernel help, I want to relax the requirements slightly.
> > >> With true per-cpu atomic sections, you have a guarantee that you are
> > >> either really running on the same CPU for the entire duration of the
> > >> atomic section or you abort.  I propose a weaker primitive: you
> > >> acquire one of an array of locks (probably one per cpu), and you are
> > >> guaranteed that, if you don't abort, no one else acquires the same
> > >> lock while you hold it.
> > >
> > > In my proof of concept (https://github.com/compudj/percpu-dev) I
> > > actually implement an array of per-cpu lock. The issue here boils
> > > down to grabbing this per-cpu lock efficiently. Once the lock is taken,
> > > the thread has exclusive access to that per-cpu lock, even if it
> > > migrates.
> > >
> > >>  Here's how:
> > >>
> > >> Create an array of user-managed locks, one per cpu.  Call them lock[i]
> > >> for 0 <= i < ncpus.
> > >>
> > >> To acquire, look up your CPU number.  Then, atomically, check that
> > >> lock[cpu] isn't held and, if so, mark it held and record both your tid
> > >> and your lock acquisition count.  If you learn that the lock *was*
> > >> held after all, signal the holder (with kill or your favorite other
> > >> mechanism), telling it which lock acquisition count is being aborted.
> > >> Then atomically steal the lock, but only if the lock acquisition count
> > >> hasn't changed.
> > >>
> > >> This has a few benefits over the in-kernel approach:
> > >>
> > >> 1. No kernel patch.
> > >>
> > >> 2. No unnecessary abort if you are preempted in favor of a thread that
> > >> doesn't content for your lock.
> > >>
> > >> 3. Greatly improved debuggability.
> > >>
> > >> 4. With long critical sections and heavy load, you can improve
> > >> performance by having several locks per cpu and choosing one at
> > >> random.
> > >>
> > >> Is there a reason that a scheme like this doesn't work?
> > >
> > > What do you mean exactly by "atomically check that lock is not
> > > held and, if so, mark it held" ? Do you imply using a lock-prefixed
> > > atomic operation ?
> >
> > Yes.
> >
> > >
> > > The goal of this whole restart section approach is to allow grabbing
> > > a lock (or doing other sequences of operations ending with a single
> > > store) on per-cpu data without having to use slow lock-prefixed
> > > atomic operations.
> >
> > Ah, ok, I assumed it was to allow multiple threads to work in parallel.
> >
> > How arch-specific are you willing to be?
>
> I'd want this to be usable on every major architectures.
>
> > On x86, it might be possible

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-23 Thread Mathieu Desnoyers

- Original Message -
> On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
>  wrote:
> > - Original Message -
> >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk 
> >> wrote:
> >> > [CC += linux-api@]
> >> >
> >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> >> >  wrote:
> >> >> Expose a new system call allowing userspace threads to register
> >> >> a TLS area used as an ABI between the kernel and userspace to
> >> >> share information required to create efficient per-cpu critical
> >> >> sections in user-space.
> >> >>
> >> >> This ABI consists of a thread-local structure containing:
> >> >>
> >> >> - a nesting count surrounding the critical section,
> >> >> - a signal number to be sent to the thread when preempting a thread
> >> >>   with non-zero nesting count,
> >> >> - a flag indicating whether the signal has been sent within the
> >> >>   critical section,
> >> >> - an integer where to store the current CPU number, updated whenever
> >> >>   the thread is preempted. This CPU number cache is not strictly
> >> >>   needed, but performs better than getcpu vdso.
> >> >>
> >> >> This approach is inspired by Paul Turner and Andrew Hunter's work
> >> >> on percpu atomics, which lets the kernel handle restart of critical
> >> >> sections, ref.
> >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> >> >>
> >> >> What is done differently here compared to percpu atomics: we track
> >> >> a single nesting counter per thread rather than many ranges of
> >> >> instruction pointer values. We deliver a signal to user-space and
> >> >> let the logic of restart be handled in user-space, thus moving
> >> >> the complexity out of the kernel. The nesting counter approach
> >> >> allows us to skip the complexity of interacting with signals that
> >> >> would be otherwise needed with the percpu atomics approach, which
> >> >> needs to know which instruction pointers are preempted, including
> >> >> when preemption occurs on a signal handler nested over an instruction
> >> >> pointer of interest.
> >> >>
> >>
> >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
> >> unable to convince myself that the kernel needs to help at all.  To do
> >> this without kernel help, I want to relax the requirements slightly.
> >> With true per-cpu atomic sections, you have a guarantee that you are
> >> either really running on the same CPU for the entire duration of the
> >> atomic section or you abort.  I propose a weaker primitive: you
> >> acquire one of an array of locks (probably one per cpu), and you are
> >> guaranteed that, if you don't abort, no one else acquires the same
> >> lock while you hold it.
> >
> > In my proof of concept (https://github.com/compudj/percpu-dev) I
> > actually implement an array of per-cpu lock. The issue here boils
> > down to grabbing this per-cpu lock efficiently. Once the lock is taken,
> > the thread has exclusive access to that per-cpu lock, even if it
> > migrates.
> >
> >>  Here's how:
> >>
> >> Create an array of user-managed locks, one per cpu.  Call them lock[i]
> >> for 0 <= i < ncpus.
> >>
> >> To acquire, look up your CPU number.  Then, atomically, check that
> >> lock[cpu] isn't held and, if so, mark it held and record both your tid
> >> and your lock acquisition count.  If you learn that the lock *was*
> >> held after all, signal the holder (with kill or your favorite other
> >> mechanism), telling it which lock acquisition count is being aborted.
> >> Then atomically steal the lock, but only if the lock acquisition count
> >> hasn't changed.
> >>
> >> This has a few benefits over the in-kernel approach:
> >>
> >> 1. No kernel patch.
> >>
> >> 2. No unnecessary abort if you are preempted in favor of a thread that
> >> doesn't content for your lock.
> >>
> >> 3. Greatly improved debuggability.
> >>
> >> 4. With long critical sections and heavy load, you can improve
> >> performance by having several locks per cpu and choosing one at
> >> random.
> >>
> >> Is there a reason that a scheme like this doesn't work?
> >
> > What do you mean exactly by "atomically check that lock is not
> > held and, if so, mark it held" ? Do you imply using a lock-prefixed
> > atomic operation ?
> 
> Yes.
> 
> >
> > The goal of this whole restart section approach is to allow grabbing
> > a lock (or doing other sequences of operations ending with a single
> > store) on per-cpu data without having to use slow lock-prefixed
> > atomic operations.
> 
> Ah, ok, I assumed it was to allow multiple threads to work in parallel.
> 
> How arch-specific are you willing to be?

I'd want this to be usable on every major architectures.

> On x86, it might be possible
> to play some GDT games so that an unlocked xchg relative

AFAIK, there is no such thing as an unlocked xchg. xchg always
imply the lock prefix on x86. I guess you mean cmpxchg here.

> to gs would
> work if you arranged for gs to refer to the GDT.  On 32-bit

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-22 Thread Andy Lutomirski

On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
 wrote:
> - Original Message -
>> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk 
>> wrote:
>> > [CC += linux-api@]
>> >
>> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
>> >  wrote:
>> >> Expose a new system call allowing userspace threads to register
>> >> a TLS area used as an ABI between the kernel and userspace to
>> >> share information required to create efficient per-cpu critical
>> >> sections in user-space.
>> >>
>> >> This ABI consists of a thread-local structure containing:
>> >>
>> >> - a nesting count surrounding the critical section,
>> >> - a signal number to be sent to the thread when preempting a thread
>> >>   with non-zero nesting count,
>> >> - a flag indicating whether the signal has been sent within the
>> >>   critical section,
>> >> - an integer where to store the current CPU number, updated whenever
>> >>   the thread is preempted. This CPU number cache is not strictly
>> >>   needed, but performs better than getcpu vdso.
>> >>
>> >> This approach is inspired by Paul Turner and Andrew Hunter's work
>> >> on percpu atomics, which lets the kernel handle restart of critical
>> >> sections, ref.
>> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
>> >>
>> >> What is done differently here compared to percpu atomics: we track
>> >> a single nesting counter per thread rather than many ranges of
>> >> instruction pointer values. We deliver a signal to user-space and
>> >> let the logic of restart be handled in user-space, thus moving
>> >> the complexity out of the kernel. The nesting counter approach
>> >> allows us to skip the complexity of interacting with signals that
>> >> would be otherwise needed with the percpu atomics approach, which
>> >> needs to know which instruction pointers are preempted, including
>> >> when preemption occurs on a signal handler nested over an instruction
>> >> pointer of interest.
>> >>
>>
>> I talked about this kind of thing with PeterZ at LSF/MM, and I was
>> unable to convince myself that the kernel needs to help at all.  To do
>> this without kernel help, I want to relax the requirements slightly.
>> With true per-cpu atomic sections, you have a guarantee that you are
>> either really running on the same CPU for the entire duration of the
>> atomic section or you abort.  I propose a weaker primitive: you
>> acquire one of an array of locks (probably one per cpu), and you are
>> guaranteed that, if you don't abort, no one else acquires the same
>> lock while you hold it.
>
> In my proof of concept (https://github.com/compudj/percpu-dev) I
> actually implement an array of per-cpu lock. The issue here boils
> down to grabbing this per-cpu lock efficiently. Once the lock is taken,
> the thread has exclusive access to that per-cpu lock, even if it
> migrates.
>
>>  Here's how:
>>
>> Create an array of user-managed locks, one per cpu.  Call them lock[i]
>> for 0 <= i < ncpus.
>>
>> To acquire, look up your CPU number.  Then, atomically, check that
>> lock[cpu] isn't held and, if so, mark it held and record both your tid
>> and your lock acquisition count.  If you learn that the lock *was*
>> held after all, signal the holder (with kill or your favorite other
>> mechanism), telling it which lock acquisition count is being aborted.
>> Then atomically steal the lock, but only if the lock acquisition count
>> hasn't changed.
>>
>> This has a few benefits over the in-kernel approach:
>>
>> 1. No kernel patch.
>>
>> 2. No unnecessary abort if you are preempted in favor of a thread that
>> doesn't content for your lock.
>>
>> 3. Greatly improved debuggability.
>>
>> 4. With long critical sections and heavy load, you can improve
>> performance by having several locks per cpu and choosing one at
>> random.
>>
>> Is there a reason that a scheme like this doesn't work?
>
> What do you mean exactly by "atomically check that lock is not
> held and, if so, mark it held" ? Do you imply using a lock-prefixed
> atomic operation ?

Yes.

>
> The goal of this whole restart section approach is to allow grabbing
> a lock (or doing other sequences of operations ending with a single
> store) on per-cpu data without having to use slow lock-prefixed
> atomic operations.

Ah, ok, I assumed it was to allow multiple threads to work in parallel.

How arch-specific are you willing to be?  On x86, it might be possible
to play some GDT games so that an unlocked xchg relative to gs would
work if you arranged for gs to refer to the GDT.  On 32-bit userspace,
you can do this with set_thread_area and on 64-bit userspace you can
do it with arch_prctl.  You'd be relying on nothing else in the
process using gs, but your proposal already relies on nothing else in
the process using your signal number.

I still dislike anything that tells programs when they're preempted,
since it prevents any kind of single-stepping.  It would be nicer if
you could somehow only get notif

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-22 Thread Andrew Hunter

On Fri, May 22, 2015 at 1:53 PM, Andy Lutomirski  wrote:
> Create an array of user-managed locks, one per cpu.  Call them lock[i]
> for 0 <= i < ncpus.
>
> To acquire, look up your CPU number.  Then, atomically, check that
> lock[cpu] isn't held and, if so, mark it held and record both your tid
> and your lock acquisition count.  If you learn that the lock *was*
> held after all, signal the holder (with kill or your favorite other
> mechanism), telling it which lock acquisition count is being aborted.
> Then atomically steal the lock, but only if the lock acquisition count
> hasn't changed.
>

We had to deploy the userspace percpu API (percpu sharded locks,
{double,}compare-and-swap, atomic-increment, etc) universally to the
fleet without waiting for 100% kernel penetration, not to mention
wanting to disable the kernel acceleration in case of kernel bugs.
(Since this is mostly used in core infrastructure--malloc, various
statistics platforms, etc--in userspace, checking for availability
isn't feasible. The primitives have to work 100% of the time or it
would be too complex for our developers to bother using them.)

So we did basically this (without the lock stealing...): we have a
single per-cpu spin lock manipulated with atomics, which we take very
briefly to implement (e.g.) compare-and-swap.  The performance is
hugely worse; typical overheads are in the 10x range _without_ any
on-cpu contention. Uncontended atomics are much cheaper than they were
on pre-Nehalem chips, but they still can't hold a candle to
unsynchronized instructions.

As a fallback path for userspace, this is fine--if 5% of binaries on
busted kernels aren't quite as fast, we can work with that in exchange
for being able to write a percpu op without worrying about what to do
on -ENOSYS.  But it's just not fast enough to compete as the intended
way to do things.

AHH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-22 Thread Mathieu Desnoyers

- Original Message -
> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk 
> wrote:
> > [CC += linux-api@]
> >
> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> >  wrote:
> >> Expose a new system call allowing userspace threads to register
> >> a TLS area used as an ABI between the kernel and userspace to
> >> share information required to create efficient per-cpu critical
> >> sections in user-space.
> >>
> >> This ABI consists of a thread-local structure containing:
> >>
> >> - a nesting count surrounding the critical section,
> >> - a signal number to be sent to the thread when preempting a thread
> >>   with non-zero nesting count,
> >> - a flag indicating whether the signal has been sent within the
> >>   critical section,
> >> - an integer where to store the current CPU number, updated whenever
> >>   the thread is preempted. This CPU number cache is not strictly
> >>   needed, but performs better than getcpu vdso.
> >>
> >> This approach is inspired by Paul Turner and Andrew Hunter's work
> >> on percpu atomics, which lets the kernel handle restart of critical
> >> sections, ref.
> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> >>
> >> What is done differently here compared to percpu atomics: we track
> >> a single nesting counter per thread rather than many ranges of
> >> instruction pointer values. We deliver a signal to user-space and
> >> let the logic of restart be handled in user-space, thus moving
> >> the complexity out of the kernel. The nesting counter approach
> >> allows us to skip the complexity of interacting with signals that
> >> would be otherwise needed with the percpu atomics approach, which
> >> needs to know which instruction pointers are preempted, including
> >> when preemption occurs on a signal handler nested over an instruction
> >> pointer of interest.
> >>
> 
> I talked about this kind of thing with PeterZ at LSF/MM, and I was
> unable to convince myself that the kernel needs to help at all.  To do
> this without kernel help, I want to relax the requirements slightly.
> With true per-cpu atomic sections, you have a guarantee that you are
> either really running on the same CPU for the entire duration of the
> atomic section or you abort.  I propose a weaker primitive: you
> acquire one of an array of locks (probably one per cpu), and you are
> guaranteed that, if you don't abort, no one else acquires the same
> lock while you hold it.

In my proof of concept (https://github.com/compudj/percpu-dev) I
actually implement an array of per-cpu lock. The issue here boils
down to grabbing this per-cpu lock efficiently. Once the lock is taken,
the thread has exclusive access to that per-cpu lock, even if it
migrates.

>  Here's how:
> 
> Create an array of user-managed locks, one per cpu.  Call them lock[i]
> for 0 <= i < ncpus.
> 
> To acquire, look up your CPU number.  Then, atomically, check that
> lock[cpu] isn't held and, if so, mark it held and record both your tid
> and your lock acquisition count.  If you learn that the lock *was*
> held after all, signal the holder (with kill or your favorite other
> mechanism), telling it which lock acquisition count is being aborted.
> Then atomically steal the lock, but only if the lock acquisition count
> hasn't changed.
> 
> This has a few benefits over the in-kernel approach:
> 
> 1. No kernel patch.
> 
> 2. No unnecessary abort if you are preempted in favor of a thread that
> doesn't content for your lock.
> 
> 3. Greatly improved debuggability.
> 
> 4. With long critical sections and heavy load, you can improve
> performance by having several locks per cpu and choosing one at
> random.
> 
> Is there a reason that a scheme like this doesn't work?

What do you mean exactly by "atomically check that lock is not
held and, if so, mark it held" ? Do you imply using a lock-prefixed
atomic operation ?

The goal of this whole restart section approach is to allow grabbing
a lock (or doing other sequences of operations ending with a single
store) on per-cpu data without having to use slow lock-prefixed
atomic operations.

> 
> >> Benchmarking sched_getcpu() vs tls cache approach. Getting the
> >> current CPU number:
> >>
> >> - With Linux vdso:12.7 ns
> >> - With TLS-cached cpu number:  0.3 ns
> 
> Slightly off-topic: try this again on a newer kernel.  The vdso should
> have gotten a bit faster in 3.19 or 4.0 IIRC.

Those benchmarks were done on Linux 4.0.

Thanks!

Mathieu

> 
> --Andy
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-22 Thread Andy Lutomirski

On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk  wrote:
> [CC += linux-api@]
>
> On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
>  wrote:
>> Expose a new system call allowing userspace threads to register
>> a TLS area used as an ABI between the kernel and userspace to
>> share information required to create efficient per-cpu critical
>> sections in user-space.
>>
>> This ABI consists of a thread-local structure containing:
>>
>> - a nesting count surrounding the critical section,
>> - a signal number to be sent to the thread when preempting a thread
>>   with non-zero nesting count,
>> - a flag indicating whether the signal has been sent within the
>>   critical section,
>> - an integer where to store the current CPU number, updated whenever
>>   the thread is preempted. This CPU number cache is not strictly
>>   needed, but performs better than getcpu vdso.
>>
>> This approach is inspired by Paul Turner and Andrew Hunter's work
>> on percpu atomics, which lets the kernel handle restart of critical
>> sections, ref. 
>> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
>>
>> What is done differently here compared to percpu atomics: we track
>> a single nesting counter per thread rather than many ranges of
>> instruction pointer values. We deliver a signal to user-space and
>> let the logic of restart be handled in user-space, thus moving
>> the complexity out of the kernel. The nesting counter approach
>> allows us to skip the complexity of interacting with signals that
>> would be otherwise needed with the percpu atomics approach, which
>> needs to know which instruction pointers are preempted, including
>> when preemption occurs on a signal handler nested over an instruction
>> pointer of interest.
>>

I talked about this kind of thing with PeterZ at LSF/MM, and I was
unable to convince myself that the kernel needs to help at all.  To do
this without kernel help, I want to relax the requirements slightly.
With true per-cpu atomic sections, you have a guarantee that you are
either really running on the same CPU for the entire duration of the
atomic section or you abort.  I propose a weaker primitive: you
acquire one of an array of locks (probably one per cpu), and you are
guaranteed that, if you don't abort, no one else acquires the same
lock while you hold it.  Here's how:

Create an array of user-managed locks, one per cpu.  Call them lock[i]
for 0 <= i < ncpus.

To acquire, look up your CPU number.  Then, atomically, check that
lock[cpu] isn't held and, if so, mark it held and record both your tid
and your lock acquisition count.  If you learn that the lock *was*
held after all, signal the holder (with kill or your favorite other
mechanism), telling it which lock acquisition count is being aborted.
Then atomically steal the lock, but only if the lock acquisition count
hasn't changed.

This has a few benefits over the in-kernel approach:

1. No kernel patch.

2. No unnecessary abort if you are preempted in favor of a thread that
doesn't content for your lock.

3. Greatly improved debuggability.

4. With long critical sections and heavy load, you can improve
performance by having several locks per cpu and choosing one at
random.

Is there a reason that a scheme like this doesn't work?

>> Benchmarking sched_getcpu() vs tls cache approach. Getting the
>> current CPU number:
>>
>> - With Linux vdso:12.7 ns
>> - With TLS-cached cpu number:  0.3 ns

Slightly off-topic: try this again on a newer kernel.  The vdso should
have gotten a bit faster in 3.19 or 4.0 IIRC.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-22 Thread Michael Kerrisk

[CC += linux-api@]

On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
 wrote:
> Expose a new system call allowing userspace threads to register
> a TLS area used as an ABI between the kernel and userspace to
> share information required to create efficient per-cpu critical
> sections in user-space.
>
> This ABI consists of a thread-local structure containing:
>
> - a nesting count surrounding the critical section,
> - a signal number to be sent to the thread when preempting a thread
>   with non-zero nesting count,
> - a flag indicating whether the signal has been sent within the
>   critical section,
> - an integer where to store the current CPU number, updated whenever
>   the thread is preempted. This CPU number cache is not strictly
>   needed, but performs better than getcpu vdso.
>
> This approach is inspired by Paul Turner and Andrew Hunter's work
> on percpu atomics, which lets the kernel handle restart of critical
> sections, ref. 
> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
>
> What is done differently here compared to percpu atomics: we track
> a single nesting counter per thread rather than many ranges of
> instruction pointer values. We deliver a signal to user-space and
> let the logic of restart be handled in user-space, thus moving
> the complexity out of the kernel. The nesting counter approach
> allows us to skip the complexity of interacting with signals that
> would be otherwise needed with the percpu atomics approach, which
> needs to know which instruction pointers are preempted, including
> when preemption occurs on a signal handler nested over an instruction
> pointer of interest.
>
> Advantages of this approach over percpu atomics:
> - kernel code is relatively simple: complexity of restart sections
>   is in user-space,
> - easy to port to other architectures: just need to reserve a new
>   system call,
> - for threads which have registered a TLS structure, the fast-path
>   at preemption is only a nesting counter check, along with the
>   optional store of the current CPU number, rather than comparing
>   instruction pointer with possibly many registered ranges,
>
> Caveats of this approach compared to the percpu atomics:
> - We need a signal number for this, so it cannot be done without
>   designing the application accordingly,
> - Handling restart in user-space is currently performed with page
>   protection, for which we install a SIGSEGV signal handler. Again,
>   this requires designing the application accordingly, especially
>   if the application installs its own segmentation fault handler,
> - It cannot be used for tracing of processes by injection of code
>   into their address space, due to interactions with application
>   signal handlers.
>
> The user-space proof of concept code implementing the restart section
> can be found here: https://github.com/compudj/percpu-dev
>
> Benchmarking sched_getcpu() vs tls cache approach. Getting the
> current CPU number:
>
> - With Linux vdso:12.7 ns
> - With TLS-cached cpu number:  0.3 ns
>
> We will use the TLS-cached cpu number for the following
> benchmarks.
>
> On an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, comparison
> with a baseline running very few load/stores (no locking,
> no getcpu, assuming one thread per CPU with affinity),
> against locking scheme based on "lock; cmpxchg", "cmpxchg"
> (using restart signal), load-store (using restart signal).
> This is performed with 32 threads on a 16-core, hyperthread
> system:
>
>  ns/loop  overhead (ns)
> Baseline:  3.7   0.0
> lock; cmpxchg:22.0  18.3
> cmpxchg:  11.1   7.4
> load-store:9.4   5.7
>
> Therefore, the load-store scheme has a speedup of 3.2x over the
> "lock; cmpxchg" scheme if both are using the tls-cache for the
> CPU number. If we use Linux sched_getcpu() for "lock; cmpxchg"
> we reach of speedup of 5.4x for load-store+tls-cache vs
> "lock; cmpxchg"+vdso-getcpu.
>
> I'm sending this out to trigger discussion, and hopefully to see
> Paul and Andrew's patches being posted publicly at some point, so
> we can compare our approaches.
>
> Signed-off-by: Mathieu Desnoyers 
> CC: Paul Turner 
> CC: Andrew Hunter 
> CC: Peter Zijlstra 
> CC: Ingo Molnar 
> CC: Ben Maurer 
> CC: Steven Rostedt 
> CC: "Paul E. McKenney" 
> CC: Josh Triplett 
> CC: Lai Jiangshan 
> CC: Linus Torvalds 
> CC: Andrew Morton 
> ---
>  arch/x86/syscalls/syscall_64.tbl  |   1 +
>  fs/exec.c |   1 +
>  include/linux/sched.h |  18 ++
>  include/uapi/asm-generic/unistd.h |   4 +-
>  init/Kconfig  |  10 +++
>  kernel/Makefile   |   1 +
>  kernel/fork.c |   2 +
>  kernel/percpu-user.c  | 126 
> ++
>  kernel/sys_ni.c   |   3 +
>  9 files changed, 165 insertions(+), 1 deletion(-)
>  create

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-22 Thread Mathieu Desnoyers

- Original Message -
> Very nice!
> 
> Some inline notes:
> 
> On Thu, May 21, 2015 at 7:44 AM, Mathieu Desnoyers
>  wrote:
> > Expose a new system call allowing userspace threads to register
> > a TLS area used as an ABI between the kernel and userspace to
> > share information required to create efficient per-cpu critical
> > sections in user-space.
> >
> > This ABI consists of a thread-local structure containing:
> >
> > - a nesting count surrounding the critical section,
> > - a signal number to be sent to the thread when preempting a thread
> >   with non-zero nesting count,
> > - a flag indicating whether the signal has been sent within the
> >   critical section,
> > - an integer where to store the current CPU number, updated whenever
> >   the thread is preempted. This CPU number cache is not strictly
> >   needed, but performs better than getcpu vdso.
> 
> While it is fractionally faster, this is actually something we've
> strongly considered dropping.  The complexity of correct TLS
> addressing is non-trivial.

Do you mean that crafting the TLS addressing in assembly is
non-trivial ? This is a constraint specific to the instruction
pointer based solution that I side-track with my approach.

In my implementation, I don't need to perform the TLS addressing
in assembly. By using the nesting counter increment/decrement, as
well as a data protection and restart with page fault handler, I can
do the getcpu TLS read directly in C, no special assembly crafting
is needed.

12 ns -> 0.3 ns acceleration for getting the CPU number seems to
be more than a marginal speedup in this context.

> 
> >
> > This approach is inspired by Paul Turner and Andrew Hunter's work
> > on percpu atomics, which lets the kernel handle restart of critical
> > sections, ref.
> > http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> >
> > What is done differently here compared to percpu atomics: we track
> > a single nesting counter per thread rather than many ranges of
> > instruction pointer values. We deliver a signal to user-space and
> > let the logic of restart be handled in user-space, thus moving
> > the complexity out of the kernel. The nesting counter approach
> > allows us to skip the complexity of interacting with signals that
> > would be otherwise needed with the percpu atomics approach, which
> > needs to know which instruction pointers are preempted, including
> > when preemption occurs on a signal handler nested over an instruction
> > pointer of interest.
> >
> > Advantages of this approach over percpu atomics:
> > - kernel code is relatively simple: complexity of restart sections
> >   is in user-space,
> 
> I think delivering a signal is actually much more complex than the
> address check here.

Delivering a signal indeed uses a complex delivery mechanism, but
it already exists, so we're not inventing anything new here.

> > - easy to port to other architectures: just need to reserve a new
> >   system call,
> 
> Not having an ABI is a potential nice advantage here.

Not sure what you mean by "not having an ABI" ? In some sense,
both your approach and mine need to add some sort of ABI.

> 
> > - for threads which have registered a TLS structure, the fast-path
> >   at preemption is only a nesting counter check, along with the
> >   optional store of the current CPU number, rather than comparing
> >   instruction pointer with possibly many registered ranges,
> 
> I'd argue this is not actually an advantage:  The preemption is the
> _slow path_.  For a reasonably behaving application this is something
> that should be happening on O(ms) boundaries.  Whereas entry/exit of
> the critical section may be occurring much more frequently.  This is
> also coming at the cost of a slower fast path (nesting updates).

I agree that entry/exit of c.s. is indeed the uttermost fast path we
care about here. However, since not all applications will use it,
other (badly designed) applications may be relying on very fast
scheduler execution due to frequent blocking, so in those pre-existing
cases, preemption would actually be a fast-path. I would not want to
impact those pre-existing applications too much.

Now looking more specifically at the critical section entry/exit:
since we're talking about a per-cpu critical section, doing anything
useful implies:

- entering the critical section
- getting the current cpu number
- doing something on per-cpu data (e.g. grabbing a lock)
- exiting the critical section

My benchmarks show that by implementing sched_getcpu() with a TLS
cache rather than a vdso, we can save about 12 ns. Explicit nesting
counter increment/decrement around the critical section takes much
less than that. If the nesting counter approach allows us to use the
getcpu TLS cache, it seems like a win all over.

> 
> It's also worth noting that the range comparison can be done in logN
> (binary search) or even constant time (bound max sequence size and
> mask

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-21 Thread Paul Turner

Very nice!

Some inline notes:

On Thu, May 21, 2015 at 7:44 AM, Mathieu Desnoyers
 wrote:
> Expose a new system call allowing userspace threads to register
> a TLS area used as an ABI between the kernel and userspace to
> share information required to create efficient per-cpu critical
> sections in user-space.
>
> This ABI consists of a thread-local structure containing:
>
> - a nesting count surrounding the critical section,
> - a signal number to be sent to the thread when preempting a thread
>   with non-zero nesting count,
> - a flag indicating whether the signal has been sent within the
>   critical section,
> - an integer where to store the current CPU number, updated whenever
>   the thread is preempted. This CPU number cache is not strictly
>   needed, but performs better than getcpu vdso.

While it is fractionally faster, this is actually something we've
strongly considered dropping.  The complexity of correct TLS
addressing is non-trivial.

>
> This approach is inspired by Paul Turner and Andrew Hunter's work
> on percpu atomics, which lets the kernel handle restart of critical
> sections, ref. 
> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
>
> What is done differently here compared to percpu atomics: we track
> a single nesting counter per thread rather than many ranges of
> instruction pointer values. We deliver a signal to user-space and
> let the logic of restart be handled in user-space, thus moving
> the complexity out of the kernel. The nesting counter approach
> allows us to skip the complexity of interacting with signals that
> would be otherwise needed with the percpu atomics approach, which
> needs to know which instruction pointers are preempted, including
> when preemption occurs on a signal handler nested over an instruction
> pointer of interest.
>
> Advantages of this approach over percpu atomics:
> - kernel code is relatively simple: complexity of restart sections
>   is in user-space,

I think delivering a signal is actually much more complex than the
address check here.
> - easy to port to other architectures: just need to reserve a new
>   system call,

Not having an ABI is a potential nice advantage here.

> - for threads which have registered a TLS structure, the fast-path
>   at preemption is only a nesting counter check, along with the
>   optional store of the current CPU number, rather than comparing
>   instruction pointer with possibly many registered ranges,

I'd argue this is not actually an advantage:  The preemption is the
_slow path_.  For a reasonably behaving application this is something
that should be happening on O(ms) boundaries.  Whereas entry/exit of
the critical section may be occurring much more frequently.  This is
also coming at the cost of a slower fast path (nesting updates).

It's also worth noting that the range comparison can be done in logN
(binary search) or even constant time (bound max sequence size and
mask out bits on address, or use a lookup table).

That said there are two very specific advantages:
1) CONFIG_PREEMPT greatly complicates correct inspection of the user address.
2) One region per address space is a difficult model for binaries with
shared linkage.

[ Note:  Compatibility of (1) is considered the primary problem with
our current patch-set.   I'll publish it as is, then try to follow up
with some alternatives we've been considering to get around this. ]

>
> Caveats of this approach compared to the percpu atomics:
> - We need a signal number for this, so it cannot be done without
>   designing the application accordingly,
> - Handling restart in user-space is currently performed with page
>   protection, for which we install a SIGSEGV signal handler. Again,
>   this requires designing the application accordingly, especially
>   if the application installs its own segmentation fault handler,
> - It cannot be used for tracing of processes by injection of code
>   into their address space, due to interactions with application
>   signal handlers.
>
> The user-space proof of concept code implementing the restart section
> can be found here: https://github.com/compudj/percpu-dev
>
> Benchmarking sched_getcpu() vs tls cache approach. Getting the
> current CPU number:
>
> - With Linux vdso:12.7 ns
> - With TLS-cached cpu number:  0.3 ns
>

Have you tried also a naked read against the segment limit?  This is
more comparable to TLS.  However, it's a definite ABI challenge.
> We will use the TLS-cached cpu number for the following
> benchmarks.
>
> On an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, comparison
> with a baseline running very few load/stores (no locking,
> no getcpu, assuming one thread per CPU with affinity),
> against locking scheme based on "lock; cmpxchg", "cmpxchg"
> (using restart signal), load-store (using restart signal).
> This is performed with 32 threads on a 16-core, hyperthread
> system:
>
>  ns/loop  overhead (ns)
> B

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-21 Thread Paul Turner

On Thu, May 21, 2015 at 12:08 PM, Mathieu Desnoyers
 wrote:
> - Original Message -
>> On Thu, May 21, 2015 at 10:44:47AM -0400, Mathieu Desnoyers wrote:
>>
>> > +struct thread_percpu_user {
>> > +   int32_t nesting;
>> > +   int32_t signal_sent;
>> > +   int32_t signo;
>> > +   int32_t current_cpu;
>> > +};
>>
>> I would require this thing be naturally aligned, such that it does not
>> cross cacheline boundaries.
>
> Good point. Adding a comment into the code to that effect.
>
>>
>> > +
>> > +static void percpu_user_sched_in(struct preempt_notifier *notifier, int
>> > cpu)
>> > +{
>> > +   struct thread_percpu_user __user *tpu_user;
>> > +   struct thread_percpu_user tpu;
>> > +   struct task_struct *t = current;
>> > +
>> > +   tpu_user = t->percpu_user;
>> > +   if (tpu_user == NULL)
>> > +   return;
>> > +   if (unlikely(t->flags & PF_EXITING))
>> > +   return;
>> > +   /*
>> > +* access_ok() of tpu_user has already been checked by sys_percpu().
>> > +*/
>> > +   if (__put_user(smp_processor_id(), &tpu_user->current_cpu)) {
>> > +   WARN_ON_ONCE(1);
>> > +   return;
>> > +   }
>>
>> This seems a waste; you already read the number unconditionally, might
>> as well double check and avoid the store.
>>
>> > +   if (__copy_from_user(&tpu, tpu_user, sizeof(tpu))) {
>> > +   WARN_ON_ONCE(1);
>> > +   return;
>> > +   }
>>
>>   if (tpu.current_cpu != smp_processor_id())
>>   __put_user();
>
> Yep, and I could even use the "cpu" parameter received by the
> function rather than smp_processor_id().
>
>>
>>
>>
>> > +   if (!tpu.nesting || tpu.signal_sent)
>> > +   return;
>> > +   if (do_send_sig_info(tpu.signo, SEND_SIG_PRIV, t, 0)) {
>> > +   WARN_ON_ONCE(1);
>> > +   return;
>> > +   }
>> > +   tpu.signal_sent = 1;
>> > +   if (__copy_to_user(tpu_user, &tpu, sizeof(tpu))) {
>> > +   WARN_ON_ONCE(1);
>> > +   return;
>> > +   }
>> > +}
>>
>> Please do not use preempt notifiers for this.
>
> Do you recommend we issue a function call from the scheduler
> finish_task_switch() ?
>
>>
>> Second, this all is done with preemption disabled, this means that all
>> that user access can fail.
>
> OK, this is one part I was worried about.
>
>>
>> You print useless WARNs and misbehave. If you detect a desire to fault,
>> you could delay until return to userspace and try again there. But it
>> all adds complexity.
>
> We could keep a flag, and then call the function again if we detect a
> desire to fault.
>
>>
>> The big advantage pjt's scheme had is that we have the instruction
>> pointer, we do not need to go read userspace memory that might not be
>> there. And it being limited  to a single range, while inconvenient,
>> simplifies the entire kernel side to:
>>
>>   if ((unsigned long)(ip - offset) < size)
>>   do_magic();
>>
>> Which is still simpler than the above.

Yeah, we've thought about this part of the API in some depth.  There
are obvious pros and cons.

The advantage of this approach is that:
  a) All of the demuxing can be done in userspace, the kernel check is
as you say, wholly trivial.
  B) We do not need to do anything complicated, e.g. signal delivery

The largest drawback is that it does require centralization of your
percpu regions.  It's been our experience that most features are best
implemented using a small library of operations rather than
long-form-atomic-sequences.  This is a notable challenge for binaries
with shared linkage.

>
> There is one big aspect of pjt's approach that I still don't grasp
> after all this time that makes me worry. How does it interact with
> the following scenario ?
>
> Userspace thread
>   - within the code region that needs to be restarted
> - signal handler nested on top
>   - running within the signal handler code
> - preempted by kernel
>   - checking instruction pointer misses the userspace stack
> underneath the signal handler.
>
> Given this scenario, is the kernel code really as simple as a pointer check
> on pt_regs, or do we need a stack walk over all signal frames ? Another way
> would be to check for the pt_regs instruction pointer whenever we receive
> a signal, but then it would require per-architectures modifications, and
> suddenly becomes less straightforward.
>

This is actually straightforward to handle:
On signal delivery, as you are setting up the signal stack, you modify
the interrupted state so that the signal handler will return into the
restart section rather than the original code.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-21 Thread Mathieu Desnoyers

- Original Message -
> On Thu, May 21, 2015 at 10:44:47AM -0400, Mathieu Desnoyers wrote:
> 
> > +struct thread_percpu_user {
> > +   int32_t nesting;
> > +   int32_t signal_sent;
> > +   int32_t signo;
> > +   int32_t current_cpu;
> > +};
> 
> I would require this thing be naturally aligned, such that it does not
> cross cacheline boundaries.

Good point. Adding a comment into the code to that effect.

> 
> > +
> > +static void percpu_user_sched_in(struct preempt_notifier *notifier, int
> > cpu)
> > +{
> > +   struct thread_percpu_user __user *tpu_user;
> > +   struct thread_percpu_user tpu;
> > +   struct task_struct *t = current;
> > +
> > +   tpu_user = t->percpu_user;
> > +   if (tpu_user == NULL)
> > +   return;
> > +   if (unlikely(t->flags & PF_EXITING))
> > +   return;
> > +   /*
> > +* access_ok() of tpu_user has already been checked by sys_percpu().
> > +*/
> > +   if (__put_user(smp_processor_id(), &tpu_user->current_cpu)) {
> > +   WARN_ON_ONCE(1);
> > +   return;
> > +   }
> 
> This seems a waste; you already read the number unconditionally, might
> as well double check and avoid the store.
> 
> > +   if (__copy_from_user(&tpu, tpu_user, sizeof(tpu))) {
> > +   WARN_ON_ONCE(1);
> > +   return;
> > +   }
> 
>   if (tpu.current_cpu != smp_processor_id())
>   __put_user();

Yep, and I could even use the "cpu" parameter received by the
function rather than smp_processor_id().

> 
> 
> 
> > +   if (!tpu.nesting || tpu.signal_sent)
> > +   return;
> > +   if (do_send_sig_info(tpu.signo, SEND_SIG_PRIV, t, 0)) {
> > +   WARN_ON_ONCE(1);
> > +   return;
> > +   }
> > +   tpu.signal_sent = 1;
> > +   if (__copy_to_user(tpu_user, &tpu, sizeof(tpu))) {
> > +   WARN_ON_ONCE(1);
> > +   return;
> > +   }
> > +}
> 
> Please do not use preempt notifiers for this.

Do you recommend we issue a function call from the scheduler
finish_task_switch() ?

> 
> Second, this all is done with preemption disabled, this means that all
> that user access can fail.

OK, this is one part I was worried about.

> 
> You print useless WARNs and misbehave. If you detect a desire to fault,
> you could delay until return to userspace and try again there. But it
> all adds complexity.

We could keep a flag, and then call the function again if we detect a
desire to fault.

> 
> The big advantage pjt's scheme had is that we have the instruction
> pointer, we do not need to go read userspace memory that might not be
> there. And it being limited  to a single range, while inconvenient,
> simplifies the entire kernel side to:
> 
>   if ((unsigned long)(ip - offset) < size)
>   do_magic();
> 
> Which is still simpler than the above.

There is one big aspect of pjt's approach that I still don't grasp
after all this time that makes me worry. How does it interact with
the following scenario ?

Userspace thread
  - within the code region that needs to be restarted
- signal handler nested on top
  - running within the signal handler code
- preempted by kernel
  - checking instruction pointer misses the userspace stack
underneath the signal handler.

Given this scenario, is the kernel code really as simple as a pointer check
on pt_regs, or do we need a stack walk over all signal frames ? Another way
would be to check for the pt_regs instruction pointer whenever we receive
a signal, but then it would require per-architectures modifications, and
suddenly becomes less straightforward.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-21 Thread Mathieu Desnoyers

- Original Message -
> On Thu, May 21, 2015 at 10:44:47AM -0400, Mathieu Desnoyers wrote:
> > Expose a new system call allowing userspace threads to register
> > a TLS area used as an ABI between the kernel and userspace to
> > share information required to create efficient per-cpu critical
> > sections in user-space.
> > 
> > This ABI consists of a thread-local structure containing:
> > 
> > - a nesting count surrounding the critical section,
> > - a signal number to be sent to the thread when preempting a thread
> >   with non-zero nesting count,
> > - a flag indicating whether the signal has been sent within the
> >   critical section,
> > - an integer where to store the current CPU number, updated whenever
> >   the thread is preempted. This CPU number cache is not strictly
> >   needed, but performs better than getcpu vdso.
> > 
> > This approach is inspired by Paul Turner and Andrew Hunter's work
> > on percpu atomics, which lets the kernel handle restart of critical
> > sections, ref.
> > http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> > 
> > What is done differently here compared to percpu atomics: we track
> > a single nesting counter per thread rather than many ranges of
> > instruction pointer values. We deliver a signal to user-space and
> > let the logic of restart be handled in user-space, thus moving
> > the complexity out of the kernel. The nesting counter approach
> > allows us to skip the complexity of interacting with signals that
> > would be otherwise needed with the percpu atomics approach, which
> > needs to know which instruction pointers are preempted, including
> > when preemption occurs on a signal handler nested over an instruction
> > pointer of interest.
> > 
> > Advantages of this approach over percpu atomics:
> > - kernel code is relatively simple: complexity of restart sections
> >   is in user-space,
> > - easy to port to other architectures: just need to reserve a new
> >   system call,
> > - for threads which have registered a TLS structure, the fast-path
> >   at preemption is only a nesting counter check, along with the
> >   optional store of the current CPU number, rather than comparing
> >   instruction pointer with possibly many registered ranges,
> > 
> > Caveats of this approach compared to the percpu atomics:
> > - We need a signal number for this, so it cannot be done without
> >   designing the application accordingly,
> > - Handling restart in user-space is currently performed with page
> >   protection, for which we install a SIGSEGV signal handler. Again,
> >   this requires designing the application accordingly, especially
> >   if the application installs its own segmentation fault handler,
> > - It cannot be used for tracing of processes by injection of code
> >   into their address space, due to interactions with application
> >   signal handlers.
> > 
> > The user-space proof of concept code implementing the restart section
> > can be found here: https://github.com/compudj/percpu-dev
> > 
> > Benchmarking sched_getcpu() vs tls cache approach. Getting the
> > current CPU number:
> > 
> > - With Linux vdso:12.7 ns
> > - With TLS-cached cpu number:  0.3 ns
> > 
> > We will use the TLS-cached cpu number for the following
> > benchmarks.
> > 
> > On an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, comparison
> > with a baseline running very few load/stores (no locking,
> > no getcpu, assuming one thread per CPU with affinity),
> > against locking scheme based on "lock; cmpxchg", "cmpxchg"
> > (using restart signal), load-store (using restart signal).
> > This is performed with 32 threads on a 16-core, hyperthread
> > system:
> > 
> >  ns/loop  overhead (ns)
> > Baseline:  3.7   0.0
> > lock; cmpxchg:22.0  18.3
> > cmpxchg:  11.1   7.4
> > load-store:9.4   5.7
> > 
> > Therefore, the load-store scheme has a speedup of 3.2x over the
> > "lock; cmpxchg" scheme if both are using the tls-cache for the
> > CPU number. If we use Linux sched_getcpu() for "lock; cmpxchg"
> > we reach of speedup of 5.4x for load-store+tls-cache vs
> > "lock; cmpxchg"+vdso-getcpu.
> > 
> > I'm sending this out to trigger discussion, and hopefully to see
> > Paul and Andrew's patches being posted publicly at some point, so
> > we can compare our approaches.
> 
> The idea seems sensible.  One quick comment: as with any new syscall,
> please include a flags argument.

Sure, I'll add it right away. Thanks for the feedback!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-21 Thread Peter Zijlstra

On Thu, May 21, 2015 at 10:44:47AM -0400, Mathieu Desnoyers wrote:

> +struct thread_percpu_user {
> + int32_t nesting;
> + int32_t signal_sent;
> + int32_t signo;
> + int32_t current_cpu;
> +};

I would require this thing be naturally aligned, such that it does not
cross cacheline boundaries.

> +
> +static void percpu_user_sched_in(struct preempt_notifier *notifier, int cpu)
> +{
> + struct thread_percpu_user __user *tpu_user;
> + struct thread_percpu_user tpu;
> + struct task_struct *t = current;
> +
> + tpu_user = t->percpu_user;
> + if (tpu_user == NULL)
> + return;
> + if (unlikely(t->flags & PF_EXITING))
> + return;
> + /*
> +  * access_ok() of tpu_user has already been checked by sys_percpu().
> +  */
> + if (__put_user(smp_processor_id(), &tpu_user->current_cpu)) {
> + WARN_ON_ONCE(1);
> + return;
> + }

This seems a waste; you already read the number unconditionally, might
as well double check and avoid the store.

> + if (__copy_from_user(&tpu, tpu_user, sizeof(tpu))) {
> + WARN_ON_ONCE(1);
> + return;
> + }

if (tpu.current_cpu != smp_processor_id())
__put_user();

> + if (!tpu.nesting || tpu.signal_sent)
> + return;
> + if (do_send_sig_info(tpu.signo, SEND_SIG_PRIV, t, 0)) {
> + WARN_ON_ONCE(1);
> + return;
> + }
> + tpu.signal_sent = 1;
> + if (__copy_to_user(tpu_user, &tpu, sizeof(tpu))) {
> + WARN_ON_ONCE(1);
> + return;
> + }
> +}

Please do not use preempt notifiers for this.

Second, this all is done with preemption disabled, this means that all
that user access can fail.

You print useless WARNs and misbehave. If you detect a desire to fault,
you could delay until return to userspace and try again there. But it
all adds complexity.

The big advantage pjt's scheme had is that we have the instruction
pointer, we do not need to go read userspace memory that might not be
there. And it being limited  to a single range, while inconvenient,
simplifies the entire kernel side to:

if ((unsigned long)(ip - offset) < size)
do_magic();

Which is still simpler than the above.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-21 Thread Josh Triplett

On Thu, May 21, 2015 at 10:44:47AM -0400, Mathieu Desnoyers wrote:
> Expose a new system call allowing userspace threads to register
> a TLS area used as an ABI between the kernel and userspace to
> share information required to create efficient per-cpu critical
> sections in user-space.
> 
> This ABI consists of a thread-local structure containing:
> 
> - a nesting count surrounding the critical section,
> - a signal number to be sent to the thread when preempting a thread
>   with non-zero nesting count,
> - a flag indicating whether the signal has been sent within the
>   critical section,
> - an integer where to store the current CPU number, updated whenever
>   the thread is preempted. This CPU number cache is not strictly
>   needed, but performs better than getcpu vdso.
> 
> This approach is inspired by Paul Turner and Andrew Hunter's work
> on percpu atomics, which lets the kernel handle restart of critical
> sections, ref. 
> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> 
> What is done differently here compared to percpu atomics: we track
> a single nesting counter per thread rather than many ranges of
> instruction pointer values. We deliver a signal to user-space and
> let the logic of restart be handled in user-space, thus moving
> the complexity out of the kernel. The nesting counter approach
> allows us to skip the complexity of interacting with signals that
> would be otherwise needed with the percpu atomics approach, which
> needs to know which instruction pointers are preempted, including
> when preemption occurs on a signal handler nested over an instruction
> pointer of interest.
> 
> Advantages of this approach over percpu atomics:
> - kernel code is relatively simple: complexity of restart sections
>   is in user-space,
> - easy to port to other architectures: just need to reserve a new
>   system call,
> - for threads which have registered a TLS structure, the fast-path
>   at preemption is only a nesting counter check, along with the
>   optional store of the current CPU number, rather than comparing
>   instruction pointer with possibly many registered ranges,
> 
> Caveats of this approach compared to the percpu atomics:
> - We need a signal number for this, so it cannot be done without
>   designing the application accordingly,
> - Handling restart in user-space is currently performed with page
>   protection, for which we install a SIGSEGV signal handler. Again,
>   this requires designing the application accordingly, especially
>   if the application installs its own segmentation fault handler,
> - It cannot be used for tracing of processes by injection of code
>   into their address space, due to interactions with application
>   signal handlers.
> 
> The user-space proof of concept code implementing the restart section
> can be found here: https://github.com/compudj/percpu-dev
> 
> Benchmarking sched_getcpu() vs tls cache approach. Getting the
> current CPU number:
> 
> - With Linux vdso:12.7 ns
> - With TLS-cached cpu number:  0.3 ns
> 
> We will use the TLS-cached cpu number for the following
> benchmarks.
> 
> On an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, comparison
> with a baseline running very few load/stores (no locking,
> no getcpu, assuming one thread per CPU with affinity),
> against locking scheme based on "lock; cmpxchg", "cmpxchg"
> (using restart signal), load-store (using restart signal).
> This is performed with 32 threads on a 16-core, hyperthread
> system:
> 
>  ns/loop  overhead (ns)
> Baseline:  3.7   0.0
> lock; cmpxchg:22.0  18.3
> cmpxchg:  11.1   7.4
> load-store:9.4   5.7
> 
> Therefore, the load-store scheme has a speedup of 3.2x over the
> "lock; cmpxchg" scheme if both are using the tls-cache for the
> CPU number. If we use Linux sched_getcpu() for "lock; cmpxchg"
> we reach of speedup of 5.4x for load-store+tls-cache vs
> "lock; cmpxchg"+vdso-getcpu.
> 
> I'm sending this out to trigger discussion, and hopefully to see
> Paul and Andrew's patches being posted publicly at some point, so
> we can compare our approaches.

The idea seems sensible.  One quick comment: as with any new syscall,
please include a flags argument.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH] percpu system call: fast userspace percpu critical sections

2015-05-21 Thread Mathieu Desnoyers

Expose a new system call allowing userspace threads to register
a TLS area used as an ABI between the kernel and userspace to
share information required to create efficient per-cpu critical
sections in user-space.

This ABI consists of a thread-local structure containing:

- a nesting count surrounding the critical section,
- a signal number to be sent to the thread when preempting a thread
  with non-zero nesting count,
- a flag indicating whether the signal has been sent within the
  critical section,
- an integer where to store the current CPU number, updated whenever
  the thread is preempted. This CPU number cache is not strictly
  needed, but performs better than getcpu vdso.

This approach is inspired by Paul Turner and Andrew Hunter's work
on percpu atomics, which lets the kernel handle restart of critical
sections, ref. 
http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

What is done differently here compared to percpu atomics: we track
a single nesting counter per thread rather than many ranges of
instruction pointer values. We deliver a signal to user-space and
let the logic of restart be handled in user-space, thus moving
the complexity out of the kernel. The nesting counter approach
allows us to skip the complexity of interacting with signals that
would be otherwise needed with the percpu atomics approach, which
needs to know which instruction pointers are preempted, including
when preemption occurs on a signal handler nested over an instruction
pointer of interest.

Advantages of this approach over percpu atomics:
- kernel code is relatively simple: complexity of restart sections
  is in user-space,
- easy to port to other architectures: just need to reserve a new
  system call,
- for threads which have registered a TLS structure, the fast-path
  at preemption is only a nesting counter check, along with the
  optional store of the current CPU number, rather than comparing
  instruction pointer with possibly many registered ranges,

Caveats of this approach compared to the percpu atomics:
- We need a signal number for this, so it cannot be done without
  designing the application accordingly,
- Handling restart in user-space is currently performed with page
  protection, for which we install a SIGSEGV signal handler. Again,
  this requires designing the application accordingly, especially
  if the application installs its own segmentation fault handler,
- It cannot be used for tracing of processes by injection of code
  into their address space, due to interactions with application
  signal handlers.

The user-space proof of concept code implementing the restart section
can be found here: https://github.com/compudj/percpu-dev

Benchmarking sched_getcpu() vs tls cache approach. Getting the
current CPU number:

- With Linux vdso:12.7 ns
- With TLS-cached cpu number:  0.3 ns

We will use the TLS-cached cpu number for the following
benchmarks.

On an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, comparison
with a baseline running very few load/stores (no locking,
no getcpu, assuming one thread per CPU with affinity),
against locking scheme based on "lock; cmpxchg", "cmpxchg"
(using restart signal), load-store (using restart signal).
This is performed with 32 threads on a 16-core, hyperthread
system:

 ns/loop  overhead (ns)
Baseline:  3.7   0.0
lock; cmpxchg:22.0  18.3
cmpxchg:  11.1   7.4
load-store:9.4   5.7

Therefore, the load-store scheme has a speedup of 3.2x over the
"lock; cmpxchg" scheme if both are using the tls-cache for the
CPU number. If we use Linux sched_getcpu() for "lock; cmpxchg"
we reach of speedup of 5.4x for load-store+tls-cache vs
"lock; cmpxchg"+vdso-getcpu.

I'm sending this out to trigger discussion, and hopefully to see
Paul and Andrew's patches being posted publicly at some point, so
we can compare our approaches.

Signed-off-by: Mathieu Desnoyers 
CC: Paul Turner 
CC: Andrew Hunter 
CC: Peter Zijlstra 
CC: Ingo Molnar 
CC: Ben Maurer 
CC: Steven Rostedt 
CC: "Paul E. McKenney" 
CC: Josh Triplett 
CC: Lai Jiangshan 
CC: Linus Torvalds 
CC: Andrew Morton 
---
 arch/x86/syscalls/syscall_64.tbl  |   1 +
 fs/exec.c |   1 +
 include/linux/sched.h |  18 ++
 include/uapi/asm-generic/unistd.h |   4 +-
 init/Kconfig  |  10 +++
 kernel/Makefile   |   1 +
 kernel/fork.c |   2 +
 kernel/percpu-user.c  | 126 ++
 kernel/sys_ni.c   |   3 +
 9 files changed, 165 insertions(+), 1 deletion(-)
 create mode 100644 kernel/percpu-user.c

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..0499703 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320common  kexec_file_load sys_kexec

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

[RFC PATCH] percpu system call: fast userspace percpu critical sections

28 matches

Site Navigation

Mail list logo

Footer information