Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-05-04 Thread Mathieu Desnoyers
- On Apr 16, 2018, at 4:58 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Apr 16, 2018, at 3:26 PM, Linus Torvalds 
> torva...@linux-foundation.org
> wrote:
> 
>> On Mon, Apr 16, 2018 at 12:21 PM, Mathieu Desnoyers
>>  wrote:
>>>
>>> And I try very hard to avoid being told I'm the one breaking
>>> user-space. ;-)
>> 
>> You *can't* be breaking user space. User space doesn't use this yet.
>> 
>> That's actually why I'd like to start with the minimal set - to make
>> sure we don't introduce features that will come back to bite us later.
>> 
>> The one compelling use case I saw was a memory allocator that used
>> this for getting per-CPU (vs per-thread) memory scaling.
>> 
>> That code didn't need the cpu_opv system call at all.
>> 
>> And if somebody does a ldload of a malloc library, and then wants to
>> analyze the behavior of a program, maybe they should ldload their own
>> malloc routines first? That's pretty much par for the course for those
>> kinds of projects.
>> 
>> So I'd much rather we first merge the non-contentious parts that
>> actually have some numbers for "this improves performance and makes a
>> nice fancy malloc possible".
>> 
>> As it is, the cpu_opv seems to be all about theory, not about actual need.
> 
> I fully get your point about getting the minimal feature in. So let's focus
> on rseq only.
> 
> I will rework the patchset so the rseq selftests don't depend on cpu_opv,
> and remove the cpu_opv stuff. I think it would be a good start for the
> Facebook guys (jemalloc), given that just rseq seems to be enough for them
> for now. It should be enough for the arm64 performance counters as well.
> 
> Then we'll figure out what is needed to make other projects use it based on
> their needs (e.g. lttng-ust, liburcu, glibc malloc), and whether jemalloc
> end up requiring cpu_opv for memory migration between per-cpu pools after all.

So, having done this, I find myself in need of advice regarding smoothly
transitioning existing user-space programs/libraries to rseq. Let's consider
a situation where only rseq (without cpu_opv) eventually gets merged into
4.18.

The proposed rseq implementation presents the following constraints:

- Only a single rseq TLS can be registered per thread, therefore rseq needs
  to be "owned" by a single library (let's say it's librseq.so),
- User-space rseq critical sections need to be inlined into applications and
  libraries for performance reasons (extra branches and calls significantly
  degrade performance of those fast-paths).

I have a ring buffer "space reservation" use-case in my user-space tracer
which requires both rseq and cpu_opv.

My original plan to transition this fast-path to rseq was to test the
@cpu_id field value from the rseq TLS and use a fallback based on
atomic instructions if it is negative. rseq is already designed to ensure
we can compare @cpu_id against @cpu_id_start and detect both migration
(cpu id differs) and rseq ENOSYS with a single branch in the fast path.

Once rseq gets merged and deployed into kernels, this means librseq.so
will actually populate the rseq TLS, and this @cpu_id field will be >= 0.
If kernels are released with rseq but without cpu_opv, then I cannot use
this @cpu_id field to detect whether *both* rseq and cpu_opv are available.

I see a few possible ways to handle this, none of which are particularly
great:

1) Duplicate the entire implementation of the user-space functions where
   the rseq critical sections are inlined, and dynamically detect whether
   cpu_opv is available, and select the right function at runtime. If those
   functions are relatively small this could be acceptable,

2) Code patching based on asm goto. There is no user-space library for
   this at the moment AFAIK, and patching user-space code triggers COW,
   which is bad for TLB and cache locality,

3) Add an extra branch in the rseq fast-path. I would like to avoid this
   especially on arm32, where the cost of an extra branch is significant
   enough to outweigh the benefit of rseq compared to ll/sc.

So far, only option (1) seems relatively acceptable from my perspective,
but that's only because my functions using rseq are relatively small.
If this code bloat is not seen as acceptable, then we should revisit
merging both rseq and cpu_opv at the same time, and make sure CONFIG_RSEQ
selects CONFIG_CPU_OPV.

Thoughts ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-05-04 Thread Mathieu Desnoyers
- On Apr 16, 2018, at 4:58 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Apr 16, 2018, at 3:26 PM, Linus Torvalds 
> torva...@linux-foundation.org
> wrote:
> 
>> On Mon, Apr 16, 2018 at 12:21 PM, Mathieu Desnoyers
>>  wrote:
>>>
>>> And I try very hard to avoid being told I'm the one breaking
>>> user-space. ;-)
>> 
>> You *can't* be breaking user space. User space doesn't use this yet.
>> 
>> That's actually why I'd like to start with the minimal set - to make
>> sure we don't introduce features that will come back to bite us later.
>> 
>> The one compelling use case I saw was a memory allocator that used
>> this for getting per-CPU (vs per-thread) memory scaling.
>> 
>> That code didn't need the cpu_opv system call at all.
>> 
>> And if somebody does a ldload of a malloc library, and then wants to
>> analyze the behavior of a program, maybe they should ldload their own
>> malloc routines first? That's pretty much par for the course for those
>> kinds of projects.
>> 
>> So I'd much rather we first merge the non-contentious parts that
>> actually have some numbers for "this improves performance and makes a
>> nice fancy malloc possible".
>> 
>> As it is, the cpu_opv seems to be all about theory, not about actual need.
> 
> I fully get your point about getting the minimal feature in. So let's focus
> on rseq only.
> 
> I will rework the patchset so the rseq selftests don't depend on cpu_opv,
> and remove the cpu_opv stuff. I think it would be a good start for the
> Facebook guys (jemalloc), given that just rseq seems to be enough for them
> for now. It should be enough for the arm64 performance counters as well.
> 
> Then we'll figure out what is needed to make other projects use it based on
> their needs (e.g. lttng-ust, liburcu, glibc malloc), and whether jemalloc
> end up requiring cpu_opv for memory migration between per-cpu pools after all.

So, having done this, I find myself in need of advice regarding smoothly
transitioning existing user-space programs/libraries to rseq. Let's consider
a situation where only rseq (without cpu_opv) eventually gets merged into
4.18.

The proposed rseq implementation presents the following constraints:

- Only a single rseq TLS can be registered per thread, therefore rseq needs
  to be "owned" by a single library (let's say it's librseq.so),
- User-space rseq critical sections need to be inlined into applications and
  libraries for performance reasons (extra branches and calls significantly
  degrade performance of those fast-paths).

I have a ring buffer "space reservation" use-case in my user-space tracer
which requires both rseq and cpu_opv.

My original plan to transition this fast-path to rseq was to test the
@cpu_id field value from the rseq TLS and use a fallback based on
atomic instructions if it is negative. rseq is already designed to ensure
we can compare @cpu_id against @cpu_id_start and detect both migration
(cpu id differs) and rseq ENOSYS with a single branch in the fast path.

Once rseq gets merged and deployed into kernels, this means librseq.so
will actually populate the rseq TLS, and this @cpu_id field will be >= 0.
If kernels are released with rseq but without cpu_opv, then I cannot use
this @cpu_id field to detect whether *both* rseq and cpu_opv are available.

I see a few possible ways to handle this, none of which are particularly
great:

1) Duplicate the entire implementation of the user-space functions where
   the rseq critical sections are inlined, and dynamically detect whether
   cpu_opv is available, and select the right function at runtime. If those
   functions are relatively small this could be acceptable,

2) Code patching based on asm goto. There is no user-space library for
   this at the moment AFAIK, and patching user-space code triggers COW,
   which is bad for TLB and cache locality,

3) Add an extra branch in the rseq fast-path. I would like to avoid this
   especially on arm32, where the cost of an extra branch is significant
   enough to outweigh the benefit of rseq compared to ll/sc.

So far, only option (1) seems relatively acceptable from my perspective,
but that's only because my functions using rseq are relatively small.
If this code bloat is not seen as acceptable, then we should revisit
merging both rseq and cpu_opv at the same time, and make sure CONFIG_RSEQ
selects CONFIG_CPU_OPV.

Thoughts ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Mathieu Desnoyers
- On Apr 16, 2018, at 3:26 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Mon, Apr 16, 2018 at 12:21 PM, Mathieu Desnoyers
>  wrote:
>>
>> And I try very hard to avoid being told I'm the one breaking
>> user-space. ;-)
> 
> You *can't* be breaking user space. User space doesn't use this yet.
> 
> That's actually why I'd like to start with the minimal set - to make
> sure we don't introduce features that will come back to bite us later.
> 
> The one compelling use case I saw was a memory allocator that used
> this for getting per-CPU (vs per-thread) memory scaling.
> 
> That code didn't need the cpu_opv system call at all.
> 
> And if somebody does a ldload of a malloc library, and then wants to
> analyze the behavior of a program, maybe they should ldload their own
> malloc routines first? That's pretty much par for the course for those
> kinds of projects.
> 
> So I'd much rather we first merge the non-contentious parts that
> actually have some numbers for "this improves performance and makes a
> nice fancy malloc possible".
> 
> As it is, the cpu_opv seems to be all about theory, not about actual need.

I fully get your point about getting the minimal feature in. So let's focus
on rseq only.

I will rework the patchset so the rseq selftests don't depend on cpu_opv,
and remove the cpu_opv stuff. I think it would be a good start for the
Facebook guys (jemalloc), given that just rseq seems to be enough for them
for now. It should be enough for the arm64 performance counters as well.

Then we'll figure out what is needed to make other projects use it based on
their needs (e.g. lttng-ust, liburcu, glibc malloc), and whether jemalloc
end up requiring cpu_opv for memory migration between per-cpu pools after all.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Mathieu Desnoyers
- On Apr 16, 2018, at 3:26 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Mon, Apr 16, 2018 at 12:21 PM, Mathieu Desnoyers
>  wrote:
>>
>> And I try very hard to avoid being told I'm the one breaking
>> user-space. ;-)
> 
> You *can't* be breaking user space. User space doesn't use this yet.
> 
> That's actually why I'd like to start with the minimal set - to make
> sure we don't introduce features that will come back to bite us later.
> 
> The one compelling use case I saw was a memory allocator that used
> this for getting per-CPU (vs per-thread) memory scaling.
> 
> That code didn't need the cpu_opv system call at all.
> 
> And if somebody does a ldload of a malloc library, and then wants to
> analyze the behavior of a program, maybe they should ldload their own
> malloc routines first? That's pretty much par for the course for those
> kinds of projects.
> 
> So I'd much rather we first merge the non-contentious parts that
> actually have some numbers for "this improves performance and makes a
> nice fancy malloc possible".
> 
> As it is, the cpu_opv seems to be all about theory, not about actual need.

I fully get your point about getting the minimal feature in. So let's focus
on rseq only.

I will rework the patchset so the rseq selftests don't depend on cpu_opv,
and remove the cpu_opv stuff. I think it would be a good start for the
Facebook guys (jemalloc), given that just rseq seems to be enough for them
for now. It should be enough for the arm64 performance counters as well.

Then we'll figure out what is needed to make other projects use it based on
their needs (e.g. lttng-ust, liburcu, glibc malloc), and whether jemalloc
end up requiring cpu_opv for memory migration between per-cpu pools after all.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Linus Torvalds
On Mon, Apr 16, 2018 at 12:21 PM, Mathieu Desnoyers
 wrote:
>
> And I try very hard to avoid being told I'm the one breaking
> user-space. ;-)

You *can't* be breaking user space. User space doesn't use this yet.

That's actually why I'd like to start with the minimal set - to make
sure we don't introduce features that will come back to bite us later.

The one compelling use case I saw was a memory allocator that used
this for getting per-CPU (vs per-thread) memory scaling.

That code didn't need the cpu_opv system call at all.

And if somebody does a ldload of a malloc library, and then wants to
analyze the behavior of a program, maybe they should ldload their own
malloc routines first? That's pretty much par for the course for those
kinds of projects.

So I'd much rather we first merge the non-contentious parts that
actually have some numbers for "this improves performance and makes a
nice fancy malloc possible".

As it is, the cpu_opv seems to be all about theory, not about actual need.

  Linus


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Linus Torvalds
On Mon, Apr 16, 2018 at 12:21 PM, Mathieu Desnoyers
 wrote:
>
> And I try very hard to avoid being told I'm the one breaking
> user-space. ;-)

You *can't* be breaking user space. User space doesn't use this yet.

That's actually why I'd like to start with the minimal set - to make
sure we don't introduce features that will come back to bite us later.

The one compelling use case I saw was a memory allocator that used
this for getting per-CPU (vs per-thread) memory scaling.

That code didn't need the cpu_opv system call at all.

And if somebody does a ldload of a malloc library, and then wants to
analyze the behavior of a program, maybe they should ldload their own
malloc routines first? That's pretty much par for the course for those
kinds of projects.

So I'd much rather we first merge the non-contentious parts that
actually have some numbers for "this improves performance and makes a
nice fancy malloc possible".

As it is, the cpu_opv seems to be all about theory, not about actual need.

  Linus


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Mathieu Desnoyers
- On Apr 16, 2018, at 2:39 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Mon, Apr 16, 2018 at 11:35 AM, Mathieu Desnoyers
>  wrote:
>> Specifically for single-stepping, the __rseq_table section introduced
>> at user-level will allow newer debuggers and tools which do line and
>> instruction-level single-stepping to skip over rseq critical sections.
>> However, this breaks existing debuggers and tools.
> 
> I really don't think single-stepping is a valid argument.
> 
> Even if the cpu_opv() allows you to "single step", you're not actually
> single stepping the same thing that you're using. So you are literally
> debugging something else than the real code.
> 
> At that point, you don't need "cpu_opv()", you need to just load
> /dev/urandom in a buffer, and single-step that. Ta-daa! No new kernel
> functionality needed.
> 
> So if the main argument for cpu_opv is single-stepping, then just rip
> it out. It's not useful.

No, single-stepping is not the only use-case. Accessing remote cpu
data is another use-case fulfilled by cpu_opv, which I think is more
compelling.

> 
> Anybody who cares deeply about single-stepping shouldn't be using
> optimistic algorithms, and they shouldn't be doing multi-threaded
> stuff either. They won't be able to use things like transactional
> memory either.
> 
> You can't single-step into the kernel to see what the kernel does
> either when you're debugging something.
> 
> News at 11: "single stepping isn't always viable".

I don't mind if people cannot stop the program with a debugger and
observe the state of registers manually at each step though a rseq
critical section.

I do mind breaking existing tools that rely on single-stepping
approaches to automatically analyze program behavior [1,2].
Introducing a rseq critical section into a library (e.g. glibc
memory allocator) would cause existing programs being analyzed
with existing tools to hang.

And I try very hard to avoid being told I'm the one breaking
user-space. ;-)

Thanks,

Mathieu

[1] http://rr-project.org/
[2] https://www.gnu.org/software/gdb/news/reversible.html

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Mathieu Desnoyers
- On Apr 16, 2018, at 2:39 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Mon, Apr 16, 2018 at 11:35 AM, Mathieu Desnoyers
>  wrote:
>> Specifically for single-stepping, the __rseq_table section introduced
>> at user-level will allow newer debuggers and tools which do line and
>> instruction-level single-stepping to skip over rseq critical sections.
>> However, this breaks existing debuggers and tools.
> 
> I really don't think single-stepping is a valid argument.
> 
> Even if the cpu_opv() allows you to "single step", you're not actually
> single stepping the same thing that you're using. So you are literally
> debugging something else than the real code.
> 
> At that point, you don't need "cpu_opv()", you need to just load
> /dev/urandom in a buffer, and single-step that. Ta-daa! No new kernel
> functionality needed.
> 
> So if the main argument for cpu_opv is single-stepping, then just rip
> it out. It's not useful.

No, single-stepping is not the only use-case. Accessing remote cpu
data is another use-case fulfilled by cpu_opv, which I think is more
compelling.

> 
> Anybody who cares deeply about single-stepping shouldn't be using
> optimistic algorithms, and they shouldn't be doing multi-threaded
> stuff either. They won't be able to use things like transactional
> memory either.
> 
> You can't single-step into the kernel to see what the kernel does
> either when you're debugging something.
> 
> News at 11: "single stepping isn't always viable".

I don't mind if people cannot stop the program with a debugger and
observe the state of registers manually at each step though a rseq
critical section.

I do mind breaking existing tools that rely on single-stepping
approaches to automatically analyze program behavior [1,2].
Introducing a rseq critical section into a library (e.g. glibc
memory allocator) would cause existing programs being analyzed
with existing tools to hang.

And I try very hard to avoid being told I'm the one breaking
user-space. ;-)

Thanks,

Mathieu

[1] http://rr-project.org/
[2] https://www.gnu.org/software/gdb/news/reversible.html

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Linus Torvalds
On Mon, Apr 16, 2018 at 11:35 AM, Mathieu Desnoyers
 wrote:
> Specifically for single-stepping, the __rseq_table section introduced
> at user-level will allow newer debuggers and tools which do line and
> instruction-level single-stepping to skip over rseq critical sections.
> However, this breaks existing debuggers and tools.

I really don't think single-stepping is a valid argument.

Even if the cpu_opv() allows you to "single step", you're not actually
single stepping the same thing that you're using. So you are literally
debugging something else than the real code.

At that point, you don't need "cpu_opv()", you need to just load
/dev/urandom in a buffer, and single-step that. Ta-daa! No new kernel
functionality needed.

So if the main argument for cpu_opv is single-stepping, then just rip
it out. It's not useful.

Anybody who cares deeply about single-stepping shouldn't be using
optimistic algorithms, and they shouldn't be doing multi-threaded
stuff either. They won't be able to use things like transactional
memory either.

You can't single-step into the kernel to see what the kernel does
either when you're debugging something.

News at 11: "single stepping isn't always viable".

Linus


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Linus Torvalds
On Mon, Apr 16, 2018 at 11:35 AM, Mathieu Desnoyers
 wrote:
> Specifically for single-stepping, the __rseq_table section introduced
> at user-level will allow newer debuggers and tools which do line and
> instruction-level single-stepping to skip over rseq critical sections.
> However, this breaks existing debuggers and tools.

I really don't think single-stepping is a valid argument.

Even if the cpu_opv() allows you to "single step", you're not actually
single stepping the same thing that you're using. So you are literally
debugging something else than the real code.

At that point, you don't need "cpu_opv()", you need to just load
/dev/urandom in a buffer, and single-step that. Ta-daa! No new kernel
functionality needed.

So if the main argument for cpu_opv is single-stepping, then just rip
it out. It's not useful.

Anybody who cares deeply about single-stepping shouldn't be using
optimistic algorithms, and they shouldn't be doing multi-threaded
stuff either. They won't be able to use things like transactional
memory either.

You can't single-step into the kernel to see what the kernel does
either when you're debugging something.

News at 11: "single stepping isn't always viable".

Linus


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Mathieu Desnoyers
- On Apr 14, 2018, at 6:44 PM, Andy Lutomirski l...@amacapital.net wrote:

> On Thu, Apr 12, 2018 at 12:43 PM, Linus Torvalds
>  wrote:
>> On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers
>>  wrote:
>>> The cpu_opv system call executes a vector of operations on behalf of
>>> user-space on a specific CPU with preemption disabled. It is inspired
>>> by readv() and writev() system calls which take a "struct iovec"
>>> array as argument.
>>
>> Do we really want the page pinning?
>>
>> This whole cpu_opv thing is the most questionable part of the series,
>> and the page pinning is the most questionable part of cpu_opv for me.
>>
>> Can we plan on merging just the plain rseq parts *without* this all
>> first, and then see the cpu_opv thing as a "maybe future expansion"
>> part.
>>
>> I think that would make Andy happier too.
>>
> 
> It only makes me happier if the userspace code involved is actually
> going to work when single-stepped, which might actually be the case
> (fingers crossed).

Specifically for single-stepping, the __rseq_table section introduced
at user-level will allow newer debuggers and tools which do line and
instruction-level single-stepping to skip over rseq critical sections.
However, this breaks existing debuggers and tools.

For a userspace tracer tool such as LTTng-UST, requiring upgrade to newer
debugger versions would limit its adoption in the field. So if using rseq
breaks current debugger tools, lttng-ust won't use rseq until
single-stepping can be done in a non-breaking way, or will have to wait
until most end-user deployments (distributions used in the field) include
debugger versions that skip over the code identified by the __rseq_table
section, which will take many years.

> That being said, I'm not really convinced that
> cpu_opv() makes much difference here, since I'm not entirely convinced
> that user code will actually use it or that user code will actually be
> that well tested.  C'est la vie.

For the use-case of cpu_opv invoked as single-stepping fall-back, this path
will indeed not be executed often enough to be well-tested. I'm considering
the following approach to allow user-space to test cpu_opv more thoroughly:
we can introduce an environment variable, e.g.:

- RSEQ_DISABLE=1: Disable rseq thread registration,
- RSEQ_DISABLE=random: Randomly disable rseq thread registration (some threads
  use rseq, other threads end up using the cpu_opv fallback)

which would disable the rseq fast-path for all or some threads, and thus allow
thorough testing of cpu_opv used as single-stepping fallback.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Mathieu Desnoyers
- On Apr 14, 2018, at 6:44 PM, Andy Lutomirski l...@amacapital.net wrote:

> On Thu, Apr 12, 2018 at 12:43 PM, Linus Torvalds
>  wrote:
>> On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers
>>  wrote:
>>> The cpu_opv system call executes a vector of operations on behalf of
>>> user-space on a specific CPU with preemption disabled. It is inspired
>>> by readv() and writev() system calls which take a "struct iovec"
>>> array as argument.
>>
>> Do we really want the page pinning?
>>
>> This whole cpu_opv thing is the most questionable part of the series,
>> and the page pinning is the most questionable part of cpu_opv for me.
>>
>> Can we plan on merging just the plain rseq parts *without* this all
>> first, and then see the cpu_opv thing as a "maybe future expansion"
>> part.
>>
>> I think that would make Andy happier too.
>>
> 
> It only makes me happier if the userspace code involved is actually
> going to work when single-stepped, which might actually be the case
> (fingers crossed).

Specifically for single-stepping, the __rseq_table section introduced
at user-level will allow newer debuggers and tools which do line and
instruction-level single-stepping to skip over rseq critical sections.
However, this breaks existing debuggers and tools.

For a userspace tracer tool such as LTTng-UST, requiring upgrade to newer
debugger versions would limit its adoption in the field. So if using rseq
breaks current debugger tools, lttng-ust won't use rseq until
single-stepping can be done in a non-breaking way, or will have to wait
until most end-user deployments (distributions used in the field) include
debugger versions that skip over the code identified by the __rseq_table
section, which will take many years.

> That being said, I'm not really convinced that
> cpu_opv() makes much difference here, since I'm not entirely convinced
> that user code will actually use it or that user code will actually be
> that well tested.  C'est la vie.

For the use-case of cpu_opv invoked as single-stepping fall-back, this path
will indeed not be executed often enough to be well-tested. I'm considering
the following approach to allow user-space to test cpu_opv more thoroughly:
we can introduce an environment variable, e.g.:

- RSEQ_DISABLE=1: Disable rseq thread registration,
- RSEQ_DISABLE=random: Randomly disable rseq thread registration (some threads
  use rseq, other threads end up using the cpu_opv fallback)

which would disable the rseq fast-path for all or some threads, and thus allow
thorough testing of cpu_opv used as single-stepping fallback.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Andi Kleen
> Single-stepping is only a subset of the rseq limitations addressed
> by cpu_opv. Anoher major limitation is algorithms requiring data
> migration between per-cpu data structures safely against CPU hotplug,
> and without having to change the cpu affinity mask. This is the case

And how many people are going to implement such a complex separate
path just for CPU hotplug? And even if they implement it how long
before it bitrots? Seems more like a checkbox item than a realistic
approach.

> for memory allocators and userspace task schedulers which require
> cpu_opv for migration between per-cpu memory pools and scheduler
> runqueues.

Not sure about that. Is that common?

> 
> About the vgettimeofday and general handling of vDSO by gdb, gdb's
> approach only takes care of line-by-line single-stepping by hiding
> Linux' vdso mapping so users cannot target source code lines within
> that shared object. However, it breaks instruction-level single-stepping.
> I reported this issue to you back in Nov. 2017:
> https://lkml.org/lkml/2017/11/20/803

It was known from day 1, but afaik never a problem.

-Andi


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Andi Kleen
> Single-stepping is only a subset of the rseq limitations addressed
> by cpu_opv. Anoher major limitation is algorithms requiring data
> migration between per-cpu data structures safely against CPU hotplug,
> and without having to change the cpu affinity mask. This is the case

And how many people are going to implement such a complex separate
path just for CPU hotplug? And even if they implement it how long
before it bitrots? Seems more like a checkbox item than a realistic
approach.

> for memory allocators and userspace task schedulers which require
> cpu_opv for migration between per-cpu memory pools and scheduler
> runqueues.

Not sure about that. Is that common?

> 
> About the vgettimeofday and general handling of vDSO by gdb, gdb's
> approach only takes care of line-by-line single-stepping by hiding
> Linux' vdso mapping so users cannot target source code lines within
> that shared object. However, it breaks instruction-level single-stepping.
> I reported this issue to you back in Nov. 2017:
> https://lkml.org/lkml/2017/11/20/803

It was known from day 1, but afaik never a problem.

-Andi


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Mathieu Desnoyers
- On Apr 12, 2018, at 4:23 PM, Andi Kleen a...@firstfloor.org wrote:

>> Can we plan on merging just the plain rseq parts *without* this all
>> first, and then see the cpu_opv thing as a "maybe future expansion"
>> part.
> 
> That would be the right way to go. I doubt anybody really needs cpu_opv.
> We already have other code (e.g. vgettimeofday) which cannot
> be single stepped, and so far it never was a problem.

Single-stepping is only a subset of the rseq limitations addressed
by cpu_opv. Anoher major limitation is algorithms requiring data
migration between per-cpu data structures safely against CPU hotplug,
and without having to change the cpu affinity mask. This is the case
for memory allocators and userspace task schedulers which require
cpu_opv for migration between per-cpu memory pools and scheduler
runqueues.

About the vgettimeofday and general handling of vDSO by gdb, gdb's
approach only takes care of line-by-line single-stepping by hiding
Linux' vdso mapping so users cannot target source code lines within
that shared object. However, it breaks instruction-level single-stepping.
I reported this issue to you back in Nov. 2017:
https://lkml.org/lkml/2017/11/20/803

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-16 Thread Mathieu Desnoyers
- On Apr 12, 2018, at 4:23 PM, Andi Kleen a...@firstfloor.org wrote:

>> Can we plan on merging just the plain rseq parts *without* this all
>> first, and then see the cpu_opv thing as a "maybe future expansion"
>> part.
> 
> That would be the right way to go. I doubt anybody really needs cpu_opv.
> We already have other code (e.g. vgettimeofday) which cannot
> be single stepped, and so far it never was a problem.

Single-stepping is only a subset of the rseq limitations addressed
by cpu_opv. Anoher major limitation is algorithms requiring data
migration between per-cpu data structures safely against CPU hotplug,
and without having to change the cpu affinity mask. This is the case
for memory allocators and userspace task schedulers which require
cpu_opv for migration between per-cpu memory pools and scheduler
runqueues.

About the vgettimeofday and general handling of vDSO by gdb, gdb's
approach only takes care of line-by-line single-stepping by hiding
Linux' vdso mapping so users cannot target source code lines within
that shared object. However, it breaks instruction-level single-stepping.
I reported this issue to you back in Nov. 2017:
https://lkml.org/lkml/2017/11/20/803

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-14 Thread Andy Lutomirski
On Thu, Apr 12, 2018 at 12:43 PM, Linus Torvalds
 wrote:
> On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers
>  wrote:
>> The cpu_opv system call executes a vector of operations on behalf of
>> user-space on a specific CPU with preemption disabled. It is inspired
>> by readv() and writev() system calls which take a "struct iovec"
>> array as argument.
>
> Do we really want the page pinning?
>
> This whole cpu_opv thing is the most questionable part of the series,
> and the page pinning is the most questionable part of cpu_opv for me.
>
> Can we plan on merging just the plain rseq parts *without* this all
> first, and then see the cpu_opv thing as a "maybe future expansion"
> part.
>
> I think that would make Andy happier too.
>

It only makes me happier if the userspace code involved is actually
going to work when single-stepped, which might actually be the case
(fingers crossed).  That being said, I'm not really convinced that
cpu_opv() makes much difference here, since I'm not entirely convinced
that user code will actually use it or that user code will actually be
that well tested.  C'est la vie.


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-14 Thread Andy Lutomirski
On Thu, Apr 12, 2018 at 12:43 PM, Linus Torvalds
 wrote:
> On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers
>  wrote:
>> The cpu_opv system call executes a vector of operations on behalf of
>> user-space on a specific CPU with preemption disabled. It is inspired
>> by readv() and writev() system calls which take a "struct iovec"
>> array as argument.
>
> Do we really want the page pinning?
>
> This whole cpu_opv thing is the most questionable part of the series,
> and the page pinning is the most questionable part of cpu_opv for me.
>
> Can we plan on merging just the plain rseq parts *without* this all
> first, and then see the cpu_opv thing as a "maybe future expansion"
> part.
>
> I think that would make Andy happier too.
>

It only makes me happier if the userspace code involved is actually
going to work when single-stepped, which might actually be the case
(fingers crossed).  That being said, I'm not really convinced that
cpu_opv() makes much difference here, since I'm not entirely convinced
that user code will actually use it or that user code will actually be
that well tested.  C'est la vie.


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-13 Thread Mathieu Desnoyers
- On Apr 13, 2018, at 12:37 PM, Linus Torvalds 
torva...@linux-foundation.org wrote:

> On Fri, Apr 13, 2018 at 5:16 AM, Mathieu Desnoyers
>  wrote:
>> The vmalloc space needed by cpu_opv is bound by the number of pages
>> a cpu_opv call can touch.
> 
> No it's not.
> 
> You can have a thousand different processes doing cpu_opv at the same time.
> 
> A *single* cpu_opv may me limited toi "only" a megabyte, but I'm not
> seeing any global limit anywhere.
> 
> In short, this looks like a guaranteed DoS approach to me.

Right, so one simple approach to solve this is to limit to the number
of concurrent cpu_opv executed at any given time.

Considering that cpu_opv is a slow path, we can limit the number
of concurrent cpu_opv executions by protecting this with a global
mutex, or a semaphore if we want the number of concurrent executions
to be greater than 1.

Another approach if we want to be fancier is to keep track of the
amount of vma address space currently used by all in-flight cpu_opv.
Beyond a given threshold, further execution of additional cpu_opv
instances would block, awaiting to be woken up when vmalloc address
space is freed when in-flight cpu_opv complete.

What global vmalloc address-space budget should we aim for ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-13 Thread Mathieu Desnoyers
- On Apr 13, 2018, at 12:37 PM, Linus Torvalds 
torva...@linux-foundation.org wrote:

> On Fri, Apr 13, 2018 at 5:16 AM, Mathieu Desnoyers
>  wrote:
>> The vmalloc space needed by cpu_opv is bound by the number of pages
>> a cpu_opv call can touch.
> 
> No it's not.
> 
> You can have a thousand different processes doing cpu_opv at the same time.
> 
> A *single* cpu_opv may me limited toi "only" a megabyte, but I'm not
> seeing any global limit anywhere.
> 
> In short, this looks like a guaranteed DoS approach to me.

Right, so one simple approach to solve this is to limit to the number
of concurrent cpu_opv executed at any given time.

Considering that cpu_opv is a slow path, we can limit the number
of concurrent cpu_opv executions by protecting this with a global
mutex, or a semaphore if we want the number of concurrent executions
to be greater than 1.

Another approach if we want to be fancier is to keep track of the
amount of vma address space currently used by all in-flight cpu_opv.
Beyond a given threshold, further execution of additional cpu_opv
instances would block, awaiting to be woken up when vmalloc address
space is freed when in-flight cpu_opv complete.

What global vmalloc address-space budget should we aim for ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-13 Thread Linus Torvalds
On Fri, Apr 13, 2018 at 5:16 AM, Mathieu Desnoyers
 wrote:
> The vmalloc space needed by cpu_opv is bound by the number of pages
> a cpu_opv call can touch.

No it's not.

You can have a thousand different processes doing cpu_opv at the same time.

A *single* cpu_opv may me limited toi "only" a megabyte, but I'm not
seeing any global limit anywhere.

In short, this looks like a guaranteed DoS approach to me.

   Linus


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-13 Thread Linus Torvalds
On Fri, Apr 13, 2018 at 5:16 AM, Mathieu Desnoyers
 wrote:
> The vmalloc space needed by cpu_opv is bound by the number of pages
> a cpu_opv call can touch.

No it's not.

You can have a thousand different processes doing cpu_opv at the same time.

A *single* cpu_opv may me limited toi "only" a megabyte, but I'm not
seeing any global limit anywhere.

In short, this looks like a guaranteed DoS approach to me.

   Linus


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-13 Thread Mathieu Desnoyers
- On Apr 12, 2018, at 4:07 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Thu, Apr 12, 2018 at 12:59 PM, Mathieu Desnoyers
>  wrote:
>>
>> What are your concerns about page pinning ?
> 
> Pretty much everything.
> 
> It's the most complex part by far, and the vmalloc space is a limited
> resource on 32-bit architectures.

The vmalloc space needed by cpu_opv is bound by the number of pages
a cpu_opv call can touch. On architectures with virtually aliased
dcache, we also need to add a few extra pages worth of address space
to account for SHMLBA alignment.

So on ARM32, with SHMLBA=4 pages, this means at most 1 MB of virtual
address space temporarily needed for a cpu_opv system call in the very
worst case scenario: 16 ops * 2 uaddr * 8 pages per uaddr
(if we're unlucky and find ourselves aligned across two SHMLBA) * 4096 bytes 
per page.

If this amount of vmalloc space happens to be our limiting factor, we can
change the max cpu_opv ops array size supported, e.g. bringing it from 16 down
to 4. The largest number of operations I currently need in the cpu-opv library
is 4. With 4 ops, the worse case vmalloc space used by a cpu_opv system call
becomes 256 kB.

> 
>> Do you have an alternative approach in mind ?
> 
> Do everything in user space.

I wish we could disable preemption and cpu hotplug in user-space.
Unfortunately, that does not seem to be a viable solution for many
technical reasons, starting with page fault handling.

> 
> And even if you absolutely want cpu_opv at all, why not do it in the
> user space *mapping* without the aliasing into kernel space?

That's because cpu_opv need to execute the entire array of operations
with preemption disabled, and we cannot take a page fault with preemption
off.

Page pinning and aliasing user-space pages in the kernel linear mapping
ensure that we don't end up in trouble in page fault scenarios, such as
having the pages we need to touch swapped out under our feet.

> 
> The cpu_opv approach isn't even fast. It's *really* slow if it has to
> do VM crap.
> 
> The whole rseq thing was billed as "faster than atomics". I
> *guarantee* that the cpu_opv's aren't faster than atomics.

Yes, and here is the good news: cpu_opv speed does not even matter. rseq 
assember instruction sequences are very fast, but cannot deal with infrequent 
corner-cases.
cpu_opv is slow, but is guaranteed to deal with the occasional corner-case
situations.

This is similar to pthread mutex/futex fast/slow paths. The common case is fast
(rseq), and the speed of the infrequent case (cpu_opv) does not matter as long
as it's used infrequently enough, which is the case here.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-13 Thread Mathieu Desnoyers
- On Apr 12, 2018, at 4:07 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Thu, Apr 12, 2018 at 12:59 PM, Mathieu Desnoyers
>  wrote:
>>
>> What are your concerns about page pinning ?
> 
> Pretty much everything.
> 
> It's the most complex part by far, and the vmalloc space is a limited
> resource on 32-bit architectures.

The vmalloc space needed by cpu_opv is bound by the number of pages
a cpu_opv call can touch. On architectures with virtually aliased
dcache, we also need to add a few extra pages worth of address space
to account for SHMLBA alignment.

So on ARM32, with SHMLBA=4 pages, this means at most 1 MB of virtual
address space temporarily needed for a cpu_opv system call in the very
worst case scenario: 16 ops * 2 uaddr * 8 pages per uaddr
(if we're unlucky and find ourselves aligned across two SHMLBA) * 4096 bytes 
per page.

If this amount of vmalloc space happens to be our limiting factor, we can
change the max cpu_opv ops array size supported, e.g. bringing it from 16 down
to 4. The largest number of operations I currently need in the cpu-opv library
is 4. With 4 ops, the worse case vmalloc space used by a cpu_opv system call
becomes 256 kB.

> 
>> Do you have an alternative approach in mind ?
> 
> Do everything in user space.

I wish we could disable preemption and cpu hotplug in user-space.
Unfortunately, that does not seem to be a viable solution for many
technical reasons, starting with page fault handling.

> 
> And even if you absolutely want cpu_opv at all, why not do it in the
> user space *mapping* without the aliasing into kernel space?

That's because cpu_opv need to execute the entire array of operations
with preemption disabled, and we cannot take a page fault with preemption
off.

Page pinning and aliasing user-space pages in the kernel linear mapping
ensure that we don't end up in trouble in page fault scenarios, such as
having the pages we need to touch swapped out under our feet.

> 
> The cpu_opv approach isn't even fast. It's *really* slow if it has to
> do VM crap.
> 
> The whole rseq thing was billed as "faster than atomics". I
> *guarantee* that the cpu_opv's aren't faster than atomics.

Yes, and here is the good news: cpu_opv speed does not even matter. rseq 
assember instruction sequences are very fast, but cannot deal with infrequent 
corner-cases.
cpu_opv is slow, but is guaranteed to deal with the occasional corner-case
situations.

This is similar to pthread mutex/futex fast/slow paths. The common case is fast
(rseq), and the speed of the infrequent case (cpu_opv) does not matter as long
as it's used infrequently enough, which is the case here.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-12 Thread Andi Kleen
> Can we plan on merging just the plain rseq parts *without* this all
> first, and then see the cpu_opv thing as a "maybe future expansion"
> part.

That would be the right way to go. I doubt anybody really needs cpu_opv.
We already have other code (e.g. vgettimeofday) which cannot 
be single stepped, and so far it never was a problem.

-Andi


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-12 Thread Andi Kleen
> Can we plan on merging just the plain rseq parts *without* this all
> first, and then see the cpu_opv thing as a "maybe future expansion"
> part.

That would be the right way to go. I doubt anybody really needs cpu_opv.
We already have other code (e.g. vgettimeofday) which cannot 
be single stepped, and so far it never was a problem.

-Andi


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-12 Thread Linus Torvalds
On Thu, Apr 12, 2018 at 12:59 PM, Mathieu Desnoyers
 wrote:
>
> What are your concerns about page pinning ?

Pretty much everything.

It's the most complex part by far, and the vmalloc space is a limited
resource on 32-bit architectures.

> Do you have an alternative approach in mind ?

Do everything in user space.

And even if you absolutely want cpu_opv at all, why not do it in the
user space *mapping* without the aliasing into kernel space?

The cpu_opv approach isn't even fast. It's *really* slow if it has to
do VM crap.

The whole rseq thing was billed as "faster than atomics". I
*guarantee* that the cpu_opv's aren't faster than atomics.

 Linus


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-12 Thread Linus Torvalds
On Thu, Apr 12, 2018 at 12:59 PM, Mathieu Desnoyers
 wrote:
>
> What are your concerns about page pinning ?

Pretty much everything.

It's the most complex part by far, and the vmalloc space is a limited
resource on 32-bit architectures.

> Do you have an alternative approach in mind ?

Do everything in user space.

And even if you absolutely want cpu_opv at all, why not do it in the
user space *mapping* without the aliasing into kernel space?

The cpu_opv approach isn't even fast. It's *really* slow if it has to
do VM crap.

The whole rseq thing was billed as "faster than atomics". I
*guarantee* that the cpu_opv's aren't faster than atomics.

 Linus


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-12 Thread Mathieu Desnoyers
- On Apr 12, 2018, at 3:43 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers
>  wrote:
>> The cpu_opv system call executes a vector of operations on behalf of
>> user-space on a specific CPU with preemption disabled. It is inspired
>> by readv() and writev() system calls which take a "struct iovec"
>> array as argument.
> 
> Do we really want the page pinning?
> 
> This whole cpu_opv thing is the most questionable part of the series,
> and the page pinning is the most questionable part of cpu_opv for me.

What are your concerns about page pinning ?

Do you have an alternative approach in mind ?

> Can we plan on merging just the plain rseq parts *without* this all
> first, and then see the cpu_opv thing as a "maybe future expansion"
> part.

The main problem with the incremental approach is that it won't deal
with remote CPU data accesses, and won't deal with cpu hotplug in
non-racy ways. For *some* of the use-cases, the other issues solved by
cpu_opv can be worked-around in user-space, at the cost of making
the userspace code a mess, and in many cases slower than if we can rely
on cpu_opv for the fallback.

All the rseq test-cases depend on cpu_opv as they stand now. Without
cpu_opv to handle the corner-cases, things become much more messy on the
user-space side.

Thanks,

Mathieu

> 
> I think that would make Andy happier too.
> 
>  Linus

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-12 Thread Mathieu Desnoyers
- On Apr 12, 2018, at 3:43 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers
>  wrote:
>> The cpu_opv system call executes a vector of operations on behalf of
>> user-space on a specific CPU with preemption disabled. It is inspired
>> by readv() and writev() system calls which take a "struct iovec"
>> array as argument.
> 
> Do we really want the page pinning?
> 
> This whole cpu_opv thing is the most questionable part of the series,
> and the page pinning is the most questionable part of cpu_opv for me.

What are your concerns about page pinning ?

Do you have an alternative approach in mind ?

> Can we plan on merging just the plain rseq parts *without* this all
> first, and then see the cpu_opv thing as a "maybe future expansion"
> part.

The main problem with the incremental approach is that it won't deal
with remote CPU data accesses, and won't deal with cpu hotplug in
non-racy ways. For *some* of the use-cases, the other issues solved by
cpu_opv can be worked-around in user-space, at the cost of making
the userspace code a mess, and in many cases slower than if we can rely
on cpu_opv for the fallback.

All the rseq test-cases depend on cpu_opv as they stand now. Without
cpu_opv to handle the corner-cases, things become much more messy on the
user-space side.

Thanks,

Mathieu

> 
> I think that would make Andy happier too.
> 
>  Linus

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-12 Thread Linus Torvalds
On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers
 wrote:
> The cpu_opv system call executes a vector of operations on behalf of
> user-space on a specific CPU with preemption disabled. It is inspired
> by readv() and writev() system calls which take a "struct iovec"
> array as argument.

Do we really want the page pinning?

This whole cpu_opv thing is the most questionable part of the series,
and the page pinning is the most questionable part of cpu_opv for me.

Can we plan on merging just the plain rseq parts *without* this all
first, and then see the cpu_opv thing as a "maybe future expansion"
part.

I think that would make Andy happier too.

 Linus


Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-12 Thread Linus Torvalds
On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers
 wrote:
> The cpu_opv system call executes a vector of operations on behalf of
> user-space on a specific CPU with preemption disabled. It is inspired
> by readv() and writev() system calls which take a "struct iovec"
> array as argument.

Do we really want the page pinning?

This whole cpu_opv thing is the most questionable part of the series,
and the page pinning is the most questionable part of cpu_opv for me.

Can we plan on merging just the plain rseq parts *without* this all
first, and then see the cpu_opv thing as a "maybe future expansion"
part.

I think that would make Andy happier too.

 Linus


[RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-12 Thread Mathieu Desnoyers
The cpu_opv system call executes a vector of operations on behalf of
user-space on a specific CPU with preemption disabled. It is inspired
by readv() and writev() system calls which take a "struct iovec"
array as argument.

The operations available are: comparison, memcpy, add, or, and, xor,
left shift, right shift, and memory barrier. The system call receives
a CPU number from user-space as argument, which is the CPU on which
those operations need to be performed.  All pointers in the ops must
have been set up to point to the per CPU memory of the CPU on which
the operations should be executed. The "comparison" operation can be
used to check that the data used in the preparation step did not
change between preparation of system call inputs and operation
execution within the preempt-off critical section.

The reason why we require all pointer offsets to be calculated by
user-space beforehand is because we need to use get_user_pages()
to first pin all pages touched by each operation. This takes care of
faulting-in the pages. Then, preemption is disabled, and the
operations are performed atomically with respect to other thread
execution on that CPU, without generating any page fault.

An overall maximum of 4216 bytes in enforced on the sum of operation
length within an operation vector, so user-space cannot generate a
too long preempt-off critical section (cache cold critical section
duration measured as 4.7µs on x86-64). Each operation is also limited
a length of 4096 bytes, meaning that an operation can touch a
maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
destination if addresses are not aligned on page boundaries).

If the thread is not running on the requested CPU, it is migrated to
it.

 Justification for cpu_opv 

Here are a few reasons justifying why the cpu_opv system call is
needed in addition to rseq:

1) Allow algorithms to perform per-cpu data migration without relying on
   sched_setaffinity()

The use-cases are migrating memory between per-cpu memory free-lists, or
stealing tasks from other per-cpu work queues: each require that
accesses to remote per-cpu data structures are performed.

Just rseq is not enough to cover those use-cases without additionally
relying on sched_setaffinity, which is unfortunately not
CPU-hotplug-safe.

The cpu_opv system call receives a CPU number as argument, and migrates
the current task to the right CPU to perform the operation sequence. If
the requested CPU is offline, it performs the operations from the
current CPU while preventing CPU hotplug, and with a mutex held.

2) Handling single-stepping from tools

Tools like debuggers, and simulators use single-stepping to run through
existing programs. If core libraries start to use restartable sequences
for e.g. memory allocation, this means pre-existing programs cannot be
single-stepped, simply because the underlying glibc or jemalloc has
changed.

The rseq user-space does expose a __rseq_table section for the sake of
debuggers, so they can skip over the rseq critical sections if they
want.  However, this requires upgrading tools, and still breaks
single-stepping in case where glibc or jemalloc is updated, but not the
tooling.

Having a performance-related library improvement break tooling is likely
to cause a big push-back against wide adoption of rseq.

3) Forward-progress guarantee

Having a piece of user-space code that stops progressing due to external
conditions is pretty bad. Developers are used to think of fast-path and
slow-path (e.g. for locking), where the contended vs uncontended cases
have different performance characteristics, but each need to provide
some level of progress guarantees.

There are concerns about proposing just "rseq" without the associated
slow-path (cpu_opv) that guarantees progress. It's just asking for
trouble when real-life will happen: page faults, uprobes, and other
unforeseen conditions that would seldom cause a rseq fast-path to never
progress.

4) Handling page faults

It's pretty easy to come up with corner-case scenarios where rseq does
not progress without the help from cpu_opv. For instance, a system with
swap enabled which is under high memory pressure could trigger page
faults at pretty much every rseq attempt. Although this scenario
is extremely unlikely, rseq becomes the weak link of the chain.

5) Comparison with LL/SC

The layman versed in the load-link/store-conditional instructions in
RISC architectures will notice the similarity between rseq and LL/SC
critical sections. The comparison can even be pushed further: since
debuggers can handle those LL/SC critical sections, they should be
able to handle rseq c.s. in the same way.

First, the way gdb recognises LL/SC c.s. patterns is very fragile:
it's limited to specific common patterns, and will miss the pattern
in all other cases. But fear not, having the rseq c.s. expose a
__rseq_table to debuggers removes that guessing part.

The main difference between LL/SC and rseq is that debuggers had
to 

[RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)

2018-04-12 Thread Mathieu Desnoyers
The cpu_opv system call executes a vector of operations on behalf of
user-space on a specific CPU with preemption disabled. It is inspired
by readv() and writev() system calls which take a "struct iovec"
array as argument.

The operations available are: comparison, memcpy, add, or, and, xor,
left shift, right shift, and memory barrier. The system call receives
a CPU number from user-space as argument, which is the CPU on which
those operations need to be performed.  All pointers in the ops must
have been set up to point to the per CPU memory of the CPU on which
the operations should be executed. The "comparison" operation can be
used to check that the data used in the preparation step did not
change between preparation of system call inputs and operation
execution within the preempt-off critical section.

The reason why we require all pointer offsets to be calculated by
user-space beforehand is because we need to use get_user_pages()
to first pin all pages touched by each operation. This takes care of
faulting-in the pages. Then, preemption is disabled, and the
operations are performed atomically with respect to other thread
execution on that CPU, without generating any page fault.

An overall maximum of 4216 bytes in enforced on the sum of operation
length within an operation vector, so user-space cannot generate a
too long preempt-off critical section (cache cold critical section
duration measured as 4.7µs on x86-64). Each operation is also limited
a length of 4096 bytes, meaning that an operation can touch a
maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
destination if addresses are not aligned on page boundaries).

If the thread is not running on the requested CPU, it is migrated to
it.

 Justification for cpu_opv 

Here are a few reasons justifying why the cpu_opv system call is
needed in addition to rseq:

1) Allow algorithms to perform per-cpu data migration without relying on
   sched_setaffinity()

The use-cases are migrating memory between per-cpu memory free-lists, or
stealing tasks from other per-cpu work queues: each require that
accesses to remote per-cpu data structures are performed.

Just rseq is not enough to cover those use-cases without additionally
relying on sched_setaffinity, which is unfortunately not
CPU-hotplug-safe.

The cpu_opv system call receives a CPU number as argument, and migrates
the current task to the right CPU to perform the operation sequence. If
the requested CPU is offline, it performs the operations from the
current CPU while preventing CPU hotplug, and with a mutex held.

2) Handling single-stepping from tools

Tools like debuggers, and simulators use single-stepping to run through
existing programs. If core libraries start to use restartable sequences
for e.g. memory allocation, this means pre-existing programs cannot be
single-stepped, simply because the underlying glibc or jemalloc has
changed.

The rseq user-space does expose a __rseq_table section for the sake of
debuggers, so they can skip over the rseq critical sections if they
want.  However, this requires upgrading tools, and still breaks
single-stepping in case where glibc or jemalloc is updated, but not the
tooling.

Having a performance-related library improvement break tooling is likely
to cause a big push-back against wide adoption of rseq.

3) Forward-progress guarantee

Having a piece of user-space code that stops progressing due to external
conditions is pretty bad. Developers are used to think of fast-path and
slow-path (e.g. for locking), where the contended vs uncontended cases
have different performance characteristics, but each need to provide
some level of progress guarantees.

There are concerns about proposing just "rseq" without the associated
slow-path (cpu_opv) that guarantees progress. It's just asking for
trouble when real-life will happen: page faults, uprobes, and other
unforeseen conditions that would seldom cause a rseq fast-path to never
progress.

4) Handling page faults

It's pretty easy to come up with corner-case scenarios where rseq does
not progress without the help from cpu_opv. For instance, a system with
swap enabled which is under high memory pressure could trigger page
faults at pretty much every rseq attempt. Although this scenario
is extremely unlikely, rseq becomes the weak link of the chain.

5) Comparison with LL/SC

The layman versed in the load-link/store-conditional instructions in
RISC architectures will notice the similarity between rseq and LL/SC
critical sections. The comparison can even be pushed further: since
debuggers can handle those LL/SC critical sections, they should be
able to handle rseq c.s. in the same way.

First, the way gdb recognises LL/SC c.s. patterns is very fragile:
it's limited to specific common patterns, and will miss the pattern
in all other cases. But fear not, having the rseq c.s. expose a
__rseq_table to debuggers removes that guessing part.

The main difference between LL/SC and rseq is that debuggers had
to