Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 16, 2018, at 4:58 PM, Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote: > - On Apr 16, 2018, at 3:26 PM, Linus Torvalds > torva...@linux-foundation.org > wrote: > >> On Mon, Apr 16, 2018 at 12:21 PM, Mathieu Desnoyers >>wrote: >>> >>> And I try very hard to avoid being told I'm the one breaking >>> user-space. ;-) >> >> You *can't* be breaking user space. User space doesn't use this yet. >> >> That's actually why I'd like to start with the minimal set - to make >> sure we don't introduce features that will come back to bite us later. >> >> The one compelling use case I saw was a memory allocator that used >> this for getting per-CPU (vs per-thread) memory scaling. >> >> That code didn't need the cpu_opv system call at all. >> >> And if somebody does a ldload of a malloc library, and then wants to >> analyze the behavior of a program, maybe they should ldload their own >> malloc routines first? That's pretty much par for the course for those >> kinds of projects. >> >> So I'd much rather we first merge the non-contentious parts that >> actually have some numbers for "this improves performance and makes a >> nice fancy malloc possible". >> >> As it is, the cpu_opv seems to be all about theory, not about actual need. > > I fully get your point about getting the minimal feature in. So let's focus > on rseq only. > > I will rework the patchset so the rseq selftests don't depend on cpu_opv, > and remove the cpu_opv stuff. I think it would be a good start for the > Facebook guys (jemalloc), given that just rseq seems to be enough for them > for now. It should be enough for the arm64 performance counters as well. > > Then we'll figure out what is needed to make other projects use it based on > their needs (e.g. lttng-ust, liburcu, glibc malloc), and whether jemalloc > end up requiring cpu_opv for memory migration between per-cpu pools after all. So, having done this, I find myself in need of advice regarding smoothly transitioning existing user-space programs/libraries to rseq. Let's consider a situation where only rseq (without cpu_opv) eventually gets merged into 4.18. The proposed rseq implementation presents the following constraints: - Only a single rseq TLS can be registered per thread, therefore rseq needs to be "owned" by a single library (let's say it's librseq.so), - User-space rseq critical sections need to be inlined into applications and libraries for performance reasons (extra branches and calls significantly degrade performance of those fast-paths). I have a ring buffer "space reservation" use-case in my user-space tracer which requires both rseq and cpu_opv. My original plan to transition this fast-path to rseq was to test the @cpu_id field value from the rseq TLS and use a fallback based on atomic instructions if it is negative. rseq is already designed to ensure we can compare @cpu_id against @cpu_id_start and detect both migration (cpu id differs) and rseq ENOSYS with a single branch in the fast path. Once rseq gets merged and deployed into kernels, this means librseq.so will actually populate the rseq TLS, and this @cpu_id field will be >= 0. If kernels are released with rseq but without cpu_opv, then I cannot use this @cpu_id field to detect whether *both* rseq and cpu_opv are available. I see a few possible ways to handle this, none of which are particularly great: 1) Duplicate the entire implementation of the user-space functions where the rseq critical sections are inlined, and dynamically detect whether cpu_opv is available, and select the right function at runtime. If those functions are relatively small this could be acceptable, 2) Code patching based on asm goto. There is no user-space library for this at the moment AFAIK, and patching user-space code triggers COW, which is bad for TLB and cache locality, 3) Add an extra branch in the rseq fast-path. I would like to avoid this especially on arm32, where the cost of an extra branch is significant enough to outweigh the benefit of rseq compared to ll/sc. So far, only option (1) seems relatively acceptable from my perspective, but that's only because my functions using rseq are relatively small. If this code bloat is not seen as acceptable, then we should revisit merging both rseq and cpu_opv at the same time, and make sure CONFIG_RSEQ selects CONFIG_CPU_OPV. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 16, 2018, at 4:58 PM, Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote: > - On Apr 16, 2018, at 3:26 PM, Linus Torvalds > torva...@linux-foundation.org > wrote: > >> On Mon, Apr 16, 2018 at 12:21 PM, Mathieu Desnoyers >> wrote: >>> >>> And I try very hard to avoid being told I'm the one breaking >>> user-space. ;-) >> >> You *can't* be breaking user space. User space doesn't use this yet. >> >> That's actually why I'd like to start with the minimal set - to make >> sure we don't introduce features that will come back to bite us later. >> >> The one compelling use case I saw was a memory allocator that used >> this for getting per-CPU (vs per-thread) memory scaling. >> >> That code didn't need the cpu_opv system call at all. >> >> And if somebody does a ldload of a malloc library, and then wants to >> analyze the behavior of a program, maybe they should ldload their own >> malloc routines first? That's pretty much par for the course for those >> kinds of projects. >> >> So I'd much rather we first merge the non-contentious parts that >> actually have some numbers for "this improves performance and makes a >> nice fancy malloc possible". >> >> As it is, the cpu_opv seems to be all about theory, not about actual need. > > I fully get your point about getting the minimal feature in. So let's focus > on rseq only. > > I will rework the patchset so the rseq selftests don't depend on cpu_opv, > and remove the cpu_opv stuff. I think it would be a good start for the > Facebook guys (jemalloc), given that just rseq seems to be enough for them > for now. It should be enough for the arm64 performance counters as well. > > Then we'll figure out what is needed to make other projects use it based on > their needs (e.g. lttng-ust, liburcu, glibc malloc), and whether jemalloc > end up requiring cpu_opv for memory migration between per-cpu pools after all. So, having done this, I find myself in need of advice regarding smoothly transitioning existing user-space programs/libraries to rseq. Let's consider a situation where only rseq (without cpu_opv) eventually gets merged into 4.18. The proposed rseq implementation presents the following constraints: - Only a single rseq TLS can be registered per thread, therefore rseq needs to be "owned" by a single library (let's say it's librseq.so), - User-space rseq critical sections need to be inlined into applications and libraries for performance reasons (extra branches and calls significantly degrade performance of those fast-paths). I have a ring buffer "space reservation" use-case in my user-space tracer which requires both rseq and cpu_opv. My original plan to transition this fast-path to rseq was to test the @cpu_id field value from the rseq TLS and use a fallback based on atomic instructions if it is negative. rseq is already designed to ensure we can compare @cpu_id against @cpu_id_start and detect both migration (cpu id differs) and rseq ENOSYS with a single branch in the fast path. Once rseq gets merged and deployed into kernels, this means librseq.so will actually populate the rseq TLS, and this @cpu_id field will be >= 0. If kernels are released with rseq but without cpu_opv, then I cannot use this @cpu_id field to detect whether *both* rseq and cpu_opv are available. I see a few possible ways to handle this, none of which are particularly great: 1) Duplicate the entire implementation of the user-space functions where the rseq critical sections are inlined, and dynamically detect whether cpu_opv is available, and select the right function at runtime. If those functions are relatively small this could be acceptable, 2) Code patching based on asm goto. There is no user-space library for this at the moment AFAIK, and patching user-space code triggers COW, which is bad for TLB and cache locality, 3) Add an extra branch in the rseq fast-path. I would like to avoid this especially on arm32, where the cost of an extra branch is significant enough to outweigh the benefit of rseq compared to ll/sc. So far, only option (1) seems relatively acceptable from my perspective, but that's only because my functions using rseq are relatively small. If this code bloat is not seen as acceptable, then we should revisit merging both rseq and cpu_opv at the same time, and make sure CONFIG_RSEQ selects CONFIG_CPU_OPV. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 16, 2018, at 3:26 PM, Linus Torvalds torva...@linux-foundation.org wrote: > On Mon, Apr 16, 2018 at 12:21 PM, Mathieu Desnoyers >wrote: >> >> And I try very hard to avoid being told I'm the one breaking >> user-space. ;-) > > You *can't* be breaking user space. User space doesn't use this yet. > > That's actually why I'd like to start with the minimal set - to make > sure we don't introduce features that will come back to bite us later. > > The one compelling use case I saw was a memory allocator that used > this for getting per-CPU (vs per-thread) memory scaling. > > That code didn't need the cpu_opv system call at all. > > And if somebody does a ldload of a malloc library, and then wants to > analyze the behavior of a program, maybe they should ldload their own > malloc routines first? That's pretty much par for the course for those > kinds of projects. > > So I'd much rather we first merge the non-contentious parts that > actually have some numbers for "this improves performance and makes a > nice fancy malloc possible". > > As it is, the cpu_opv seems to be all about theory, not about actual need. I fully get your point about getting the minimal feature in. So let's focus on rseq only. I will rework the patchset so the rseq selftests don't depend on cpu_opv, and remove the cpu_opv stuff. I think it would be a good start for the Facebook guys (jemalloc), given that just rseq seems to be enough for them for now. It should be enough for the arm64 performance counters as well. Then we'll figure out what is needed to make other projects use it based on their needs (e.g. lttng-ust, liburcu, glibc malloc), and whether jemalloc end up requiring cpu_opv for memory migration between per-cpu pools after all. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 16, 2018, at 3:26 PM, Linus Torvalds torva...@linux-foundation.org wrote: > On Mon, Apr 16, 2018 at 12:21 PM, Mathieu Desnoyers > wrote: >> >> And I try very hard to avoid being told I'm the one breaking >> user-space. ;-) > > You *can't* be breaking user space. User space doesn't use this yet. > > That's actually why I'd like to start with the minimal set - to make > sure we don't introduce features that will come back to bite us later. > > The one compelling use case I saw was a memory allocator that used > this for getting per-CPU (vs per-thread) memory scaling. > > That code didn't need the cpu_opv system call at all. > > And if somebody does a ldload of a malloc library, and then wants to > analyze the behavior of a program, maybe they should ldload their own > malloc routines first? That's pretty much par for the course for those > kinds of projects. > > So I'd much rather we first merge the non-contentious parts that > actually have some numbers for "this improves performance and makes a > nice fancy malloc possible". > > As it is, the cpu_opv seems to be all about theory, not about actual need. I fully get your point about getting the minimal feature in. So let's focus on rseq only. I will rework the patchset so the rseq selftests don't depend on cpu_opv, and remove the cpu_opv stuff. I think it would be a good start for the Facebook guys (jemalloc), given that just rseq seems to be enough for them for now. It should be enough for the arm64 performance counters as well. Then we'll figure out what is needed to make other projects use it based on their needs (e.g. lttng-ust, liburcu, glibc malloc), and whether jemalloc end up requiring cpu_opv for memory migration between per-cpu pools after all. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
On Mon, Apr 16, 2018 at 12:21 PM, Mathieu Desnoyerswrote: > > And I try very hard to avoid being told I'm the one breaking > user-space. ;-) You *can't* be breaking user space. User space doesn't use this yet. That's actually why I'd like to start with the minimal set - to make sure we don't introduce features that will come back to bite us later. The one compelling use case I saw was a memory allocator that used this for getting per-CPU (vs per-thread) memory scaling. That code didn't need the cpu_opv system call at all. And if somebody does a ldload of a malloc library, and then wants to analyze the behavior of a program, maybe they should ldload their own malloc routines first? That's pretty much par for the course for those kinds of projects. So I'd much rather we first merge the non-contentious parts that actually have some numbers for "this improves performance and makes a nice fancy malloc possible". As it is, the cpu_opv seems to be all about theory, not about actual need. Linus
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
On Mon, Apr 16, 2018 at 12:21 PM, Mathieu Desnoyers wrote: > > And I try very hard to avoid being told I'm the one breaking > user-space. ;-) You *can't* be breaking user space. User space doesn't use this yet. That's actually why I'd like to start with the minimal set - to make sure we don't introduce features that will come back to bite us later. The one compelling use case I saw was a memory allocator that used this for getting per-CPU (vs per-thread) memory scaling. That code didn't need the cpu_opv system call at all. And if somebody does a ldload of a malloc library, and then wants to analyze the behavior of a program, maybe they should ldload their own malloc routines first? That's pretty much par for the course for those kinds of projects. So I'd much rather we first merge the non-contentious parts that actually have some numbers for "this improves performance and makes a nice fancy malloc possible". As it is, the cpu_opv seems to be all about theory, not about actual need. Linus
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 16, 2018, at 2:39 PM, Linus Torvalds torva...@linux-foundation.org wrote: > On Mon, Apr 16, 2018 at 11:35 AM, Mathieu Desnoyers >wrote: >> Specifically for single-stepping, the __rseq_table section introduced >> at user-level will allow newer debuggers and tools which do line and >> instruction-level single-stepping to skip over rseq critical sections. >> However, this breaks existing debuggers and tools. > > I really don't think single-stepping is a valid argument. > > Even if the cpu_opv() allows you to "single step", you're not actually > single stepping the same thing that you're using. So you are literally > debugging something else than the real code. > > At that point, you don't need "cpu_opv()", you need to just load > /dev/urandom in a buffer, and single-step that. Ta-daa! No new kernel > functionality needed. > > So if the main argument for cpu_opv is single-stepping, then just rip > it out. It's not useful. No, single-stepping is not the only use-case. Accessing remote cpu data is another use-case fulfilled by cpu_opv, which I think is more compelling. > > Anybody who cares deeply about single-stepping shouldn't be using > optimistic algorithms, and they shouldn't be doing multi-threaded > stuff either. They won't be able to use things like transactional > memory either. > > You can't single-step into the kernel to see what the kernel does > either when you're debugging something. > > News at 11: "single stepping isn't always viable". I don't mind if people cannot stop the program with a debugger and observe the state of registers manually at each step though a rseq critical section. I do mind breaking existing tools that rely on single-stepping approaches to automatically analyze program behavior [1,2]. Introducing a rseq critical section into a library (e.g. glibc memory allocator) would cause existing programs being analyzed with existing tools to hang. And I try very hard to avoid being told I'm the one breaking user-space. ;-) Thanks, Mathieu [1] http://rr-project.org/ [2] https://www.gnu.org/software/gdb/news/reversible.html -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 16, 2018, at 2:39 PM, Linus Torvalds torva...@linux-foundation.org wrote: > On Mon, Apr 16, 2018 at 11:35 AM, Mathieu Desnoyers > wrote: >> Specifically for single-stepping, the __rseq_table section introduced >> at user-level will allow newer debuggers and tools which do line and >> instruction-level single-stepping to skip over rseq critical sections. >> However, this breaks existing debuggers and tools. > > I really don't think single-stepping is a valid argument. > > Even if the cpu_opv() allows you to "single step", you're not actually > single stepping the same thing that you're using. So you are literally > debugging something else than the real code. > > At that point, you don't need "cpu_opv()", you need to just load > /dev/urandom in a buffer, and single-step that. Ta-daa! No new kernel > functionality needed. > > So if the main argument for cpu_opv is single-stepping, then just rip > it out. It's not useful. No, single-stepping is not the only use-case. Accessing remote cpu data is another use-case fulfilled by cpu_opv, which I think is more compelling. > > Anybody who cares deeply about single-stepping shouldn't be using > optimistic algorithms, and they shouldn't be doing multi-threaded > stuff either. They won't be able to use things like transactional > memory either. > > You can't single-step into the kernel to see what the kernel does > either when you're debugging something. > > News at 11: "single stepping isn't always viable". I don't mind if people cannot stop the program with a debugger and observe the state of registers manually at each step though a rseq critical section. I do mind breaking existing tools that rely on single-stepping approaches to automatically analyze program behavior [1,2]. Introducing a rseq critical section into a library (e.g. glibc memory allocator) would cause existing programs being analyzed with existing tools to hang. And I try very hard to avoid being told I'm the one breaking user-space. ;-) Thanks, Mathieu [1] http://rr-project.org/ [2] https://www.gnu.org/software/gdb/news/reversible.html -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
On Mon, Apr 16, 2018 at 11:35 AM, Mathieu Desnoyerswrote: > Specifically for single-stepping, the __rseq_table section introduced > at user-level will allow newer debuggers and tools which do line and > instruction-level single-stepping to skip over rseq critical sections. > However, this breaks existing debuggers and tools. I really don't think single-stepping is a valid argument. Even if the cpu_opv() allows you to "single step", you're not actually single stepping the same thing that you're using. So you are literally debugging something else than the real code. At that point, you don't need "cpu_opv()", you need to just load /dev/urandom in a buffer, and single-step that. Ta-daa! No new kernel functionality needed. So if the main argument for cpu_opv is single-stepping, then just rip it out. It's not useful. Anybody who cares deeply about single-stepping shouldn't be using optimistic algorithms, and they shouldn't be doing multi-threaded stuff either. They won't be able to use things like transactional memory either. You can't single-step into the kernel to see what the kernel does either when you're debugging something. News at 11: "single stepping isn't always viable". Linus
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
On Mon, Apr 16, 2018 at 11:35 AM, Mathieu Desnoyers wrote: > Specifically for single-stepping, the __rseq_table section introduced > at user-level will allow newer debuggers and tools which do line and > instruction-level single-stepping to skip over rseq critical sections. > However, this breaks existing debuggers and tools. I really don't think single-stepping is a valid argument. Even if the cpu_opv() allows you to "single step", you're not actually single stepping the same thing that you're using. So you are literally debugging something else than the real code. At that point, you don't need "cpu_opv()", you need to just load /dev/urandom in a buffer, and single-step that. Ta-daa! No new kernel functionality needed. So if the main argument for cpu_opv is single-stepping, then just rip it out. It's not useful. Anybody who cares deeply about single-stepping shouldn't be using optimistic algorithms, and they shouldn't be doing multi-threaded stuff either. They won't be able to use things like transactional memory either. You can't single-step into the kernel to see what the kernel does either when you're debugging something. News at 11: "single stepping isn't always viable". Linus
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 14, 2018, at 6:44 PM, Andy Lutomirski l...@amacapital.net wrote: > On Thu, Apr 12, 2018 at 12:43 PM, Linus Torvalds >wrote: >> On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers >> wrote: >>> The cpu_opv system call executes a vector of operations on behalf of >>> user-space on a specific CPU with preemption disabled. It is inspired >>> by readv() and writev() system calls which take a "struct iovec" >>> array as argument. >> >> Do we really want the page pinning? >> >> This whole cpu_opv thing is the most questionable part of the series, >> and the page pinning is the most questionable part of cpu_opv for me. >> >> Can we plan on merging just the plain rseq parts *without* this all >> first, and then see the cpu_opv thing as a "maybe future expansion" >> part. >> >> I think that would make Andy happier too. >> > > It only makes me happier if the userspace code involved is actually > going to work when single-stepped, which might actually be the case > (fingers crossed). Specifically for single-stepping, the __rseq_table section introduced at user-level will allow newer debuggers and tools which do line and instruction-level single-stepping to skip over rseq critical sections. However, this breaks existing debuggers and tools. For a userspace tracer tool such as LTTng-UST, requiring upgrade to newer debugger versions would limit its adoption in the field. So if using rseq breaks current debugger tools, lttng-ust won't use rseq until single-stepping can be done in a non-breaking way, or will have to wait until most end-user deployments (distributions used in the field) include debugger versions that skip over the code identified by the __rseq_table section, which will take many years. > That being said, I'm not really convinced that > cpu_opv() makes much difference here, since I'm not entirely convinced > that user code will actually use it or that user code will actually be > that well tested. C'est la vie. For the use-case of cpu_opv invoked as single-stepping fall-back, this path will indeed not be executed often enough to be well-tested. I'm considering the following approach to allow user-space to test cpu_opv more thoroughly: we can introduce an environment variable, e.g.: - RSEQ_DISABLE=1: Disable rseq thread registration, - RSEQ_DISABLE=random: Randomly disable rseq thread registration (some threads use rseq, other threads end up using the cpu_opv fallback) which would disable the rseq fast-path for all or some threads, and thus allow thorough testing of cpu_opv used as single-stepping fallback. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 14, 2018, at 6:44 PM, Andy Lutomirski l...@amacapital.net wrote: > On Thu, Apr 12, 2018 at 12:43 PM, Linus Torvalds > wrote: >> On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers >> wrote: >>> The cpu_opv system call executes a vector of operations on behalf of >>> user-space on a specific CPU with preemption disabled. It is inspired >>> by readv() and writev() system calls which take a "struct iovec" >>> array as argument. >> >> Do we really want the page pinning? >> >> This whole cpu_opv thing is the most questionable part of the series, >> and the page pinning is the most questionable part of cpu_opv for me. >> >> Can we plan on merging just the plain rseq parts *without* this all >> first, and then see the cpu_opv thing as a "maybe future expansion" >> part. >> >> I think that would make Andy happier too. >> > > It only makes me happier if the userspace code involved is actually > going to work when single-stepped, which might actually be the case > (fingers crossed). Specifically for single-stepping, the __rseq_table section introduced at user-level will allow newer debuggers and tools which do line and instruction-level single-stepping to skip over rseq critical sections. However, this breaks existing debuggers and tools. For a userspace tracer tool such as LTTng-UST, requiring upgrade to newer debugger versions would limit its adoption in the field. So if using rseq breaks current debugger tools, lttng-ust won't use rseq until single-stepping can be done in a non-breaking way, or will have to wait until most end-user deployments (distributions used in the field) include debugger versions that skip over the code identified by the __rseq_table section, which will take many years. > That being said, I'm not really convinced that > cpu_opv() makes much difference here, since I'm not entirely convinced > that user code will actually use it or that user code will actually be > that well tested. C'est la vie. For the use-case of cpu_opv invoked as single-stepping fall-back, this path will indeed not be executed often enough to be well-tested. I'm considering the following approach to allow user-space to test cpu_opv more thoroughly: we can introduce an environment variable, e.g.: - RSEQ_DISABLE=1: Disable rseq thread registration, - RSEQ_DISABLE=random: Randomly disable rseq thread registration (some threads use rseq, other threads end up using the cpu_opv fallback) which would disable the rseq fast-path for all or some threads, and thus allow thorough testing of cpu_opv used as single-stepping fallback. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
> Single-stepping is only a subset of the rseq limitations addressed > by cpu_opv. Anoher major limitation is algorithms requiring data > migration between per-cpu data structures safely against CPU hotplug, > and without having to change the cpu affinity mask. This is the case And how many people are going to implement such a complex separate path just for CPU hotplug? And even if they implement it how long before it bitrots? Seems more like a checkbox item than a realistic approach. > for memory allocators and userspace task schedulers which require > cpu_opv for migration between per-cpu memory pools and scheduler > runqueues. Not sure about that. Is that common? > > About the vgettimeofday and general handling of vDSO by gdb, gdb's > approach only takes care of line-by-line single-stepping by hiding > Linux' vdso mapping so users cannot target source code lines within > that shared object. However, it breaks instruction-level single-stepping. > I reported this issue to you back in Nov. 2017: > https://lkml.org/lkml/2017/11/20/803 It was known from day 1, but afaik never a problem. -Andi
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
> Single-stepping is only a subset of the rseq limitations addressed > by cpu_opv. Anoher major limitation is algorithms requiring data > migration between per-cpu data structures safely against CPU hotplug, > and without having to change the cpu affinity mask. This is the case And how many people are going to implement such a complex separate path just for CPU hotplug? And even if they implement it how long before it bitrots? Seems more like a checkbox item than a realistic approach. > for memory allocators and userspace task schedulers which require > cpu_opv for migration between per-cpu memory pools and scheduler > runqueues. Not sure about that. Is that common? > > About the vgettimeofday and general handling of vDSO by gdb, gdb's > approach only takes care of line-by-line single-stepping by hiding > Linux' vdso mapping so users cannot target source code lines within > that shared object. However, it breaks instruction-level single-stepping. > I reported this issue to you back in Nov. 2017: > https://lkml.org/lkml/2017/11/20/803 It was known from day 1, but afaik never a problem. -Andi
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 12, 2018, at 4:23 PM, Andi Kleen a...@firstfloor.org wrote: >> Can we plan on merging just the plain rseq parts *without* this all >> first, and then see the cpu_opv thing as a "maybe future expansion" >> part. > > That would be the right way to go. I doubt anybody really needs cpu_opv. > We already have other code (e.g. vgettimeofday) which cannot > be single stepped, and so far it never was a problem. Single-stepping is only a subset of the rseq limitations addressed by cpu_opv. Anoher major limitation is algorithms requiring data migration between per-cpu data structures safely against CPU hotplug, and without having to change the cpu affinity mask. This is the case for memory allocators and userspace task schedulers which require cpu_opv for migration between per-cpu memory pools and scheduler runqueues. About the vgettimeofday and general handling of vDSO by gdb, gdb's approach only takes care of line-by-line single-stepping by hiding Linux' vdso mapping so users cannot target source code lines within that shared object. However, it breaks instruction-level single-stepping. I reported this issue to you back in Nov. 2017: https://lkml.org/lkml/2017/11/20/803 Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 12, 2018, at 4:23 PM, Andi Kleen a...@firstfloor.org wrote: >> Can we plan on merging just the plain rseq parts *without* this all >> first, and then see the cpu_opv thing as a "maybe future expansion" >> part. > > That would be the right way to go. I doubt anybody really needs cpu_opv. > We already have other code (e.g. vgettimeofday) which cannot > be single stepped, and so far it never was a problem. Single-stepping is only a subset of the rseq limitations addressed by cpu_opv. Anoher major limitation is algorithms requiring data migration between per-cpu data structures safely against CPU hotplug, and without having to change the cpu affinity mask. This is the case for memory allocators and userspace task schedulers which require cpu_opv for migration between per-cpu memory pools and scheduler runqueues. About the vgettimeofday and general handling of vDSO by gdb, gdb's approach only takes care of line-by-line single-stepping by hiding Linux' vdso mapping so users cannot target source code lines within that shared object. However, it breaks instruction-level single-stepping. I reported this issue to you back in Nov. 2017: https://lkml.org/lkml/2017/11/20/803 Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
On Thu, Apr 12, 2018 at 12:43 PM, Linus Torvaldswrote: > On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers > wrote: >> The cpu_opv system call executes a vector of operations on behalf of >> user-space on a specific CPU with preemption disabled. It is inspired >> by readv() and writev() system calls which take a "struct iovec" >> array as argument. > > Do we really want the page pinning? > > This whole cpu_opv thing is the most questionable part of the series, > and the page pinning is the most questionable part of cpu_opv for me. > > Can we plan on merging just the plain rseq parts *without* this all > first, and then see the cpu_opv thing as a "maybe future expansion" > part. > > I think that would make Andy happier too. > It only makes me happier if the userspace code involved is actually going to work when single-stepped, which might actually be the case (fingers crossed). That being said, I'm not really convinced that cpu_opv() makes much difference here, since I'm not entirely convinced that user code will actually use it or that user code will actually be that well tested. C'est la vie.
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
On Thu, Apr 12, 2018 at 12:43 PM, Linus Torvalds wrote: > On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers > wrote: >> The cpu_opv system call executes a vector of operations on behalf of >> user-space on a specific CPU with preemption disabled. It is inspired >> by readv() and writev() system calls which take a "struct iovec" >> array as argument. > > Do we really want the page pinning? > > This whole cpu_opv thing is the most questionable part of the series, > and the page pinning is the most questionable part of cpu_opv for me. > > Can we plan on merging just the plain rseq parts *without* this all > first, and then see the cpu_opv thing as a "maybe future expansion" > part. > > I think that would make Andy happier too. > It only makes me happier if the userspace code involved is actually going to work when single-stepped, which might actually be the case (fingers crossed). That being said, I'm not really convinced that cpu_opv() makes much difference here, since I'm not entirely convinced that user code will actually use it or that user code will actually be that well tested. C'est la vie.
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 13, 2018, at 12:37 PM, Linus Torvalds torva...@linux-foundation.org wrote: > On Fri, Apr 13, 2018 at 5:16 AM, Mathieu Desnoyers >wrote: >> The vmalloc space needed by cpu_opv is bound by the number of pages >> a cpu_opv call can touch. > > No it's not. > > You can have a thousand different processes doing cpu_opv at the same time. > > A *single* cpu_opv may me limited toi "only" a megabyte, but I'm not > seeing any global limit anywhere. > > In short, this looks like a guaranteed DoS approach to me. Right, so one simple approach to solve this is to limit to the number of concurrent cpu_opv executed at any given time. Considering that cpu_opv is a slow path, we can limit the number of concurrent cpu_opv executions by protecting this with a global mutex, or a semaphore if we want the number of concurrent executions to be greater than 1. Another approach if we want to be fancier is to keep track of the amount of vma address space currently used by all in-flight cpu_opv. Beyond a given threshold, further execution of additional cpu_opv instances would block, awaiting to be woken up when vmalloc address space is freed when in-flight cpu_opv complete. What global vmalloc address-space budget should we aim for ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 13, 2018, at 12:37 PM, Linus Torvalds torva...@linux-foundation.org wrote: > On Fri, Apr 13, 2018 at 5:16 AM, Mathieu Desnoyers > wrote: >> The vmalloc space needed by cpu_opv is bound by the number of pages >> a cpu_opv call can touch. > > No it's not. > > You can have a thousand different processes doing cpu_opv at the same time. > > A *single* cpu_opv may me limited toi "only" a megabyte, but I'm not > seeing any global limit anywhere. > > In short, this looks like a guaranteed DoS approach to me. Right, so one simple approach to solve this is to limit to the number of concurrent cpu_opv executed at any given time. Considering that cpu_opv is a slow path, we can limit the number of concurrent cpu_opv executions by protecting this with a global mutex, or a semaphore if we want the number of concurrent executions to be greater than 1. Another approach if we want to be fancier is to keep track of the amount of vma address space currently used by all in-flight cpu_opv. Beyond a given threshold, further execution of additional cpu_opv instances would block, awaiting to be woken up when vmalloc address space is freed when in-flight cpu_opv complete. What global vmalloc address-space budget should we aim for ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
On Fri, Apr 13, 2018 at 5:16 AM, Mathieu Desnoyerswrote: > The vmalloc space needed by cpu_opv is bound by the number of pages > a cpu_opv call can touch. No it's not. You can have a thousand different processes doing cpu_opv at the same time. A *single* cpu_opv may me limited toi "only" a megabyte, but I'm not seeing any global limit anywhere. In short, this looks like a guaranteed DoS approach to me. Linus
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
On Fri, Apr 13, 2018 at 5:16 AM, Mathieu Desnoyers wrote: > The vmalloc space needed by cpu_opv is bound by the number of pages > a cpu_opv call can touch. No it's not. You can have a thousand different processes doing cpu_opv at the same time. A *single* cpu_opv may me limited toi "only" a megabyte, but I'm not seeing any global limit anywhere. In short, this looks like a guaranteed DoS approach to me. Linus
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 12, 2018, at 4:07 PM, Linus Torvalds torva...@linux-foundation.org wrote: > On Thu, Apr 12, 2018 at 12:59 PM, Mathieu Desnoyers >wrote: >> >> What are your concerns about page pinning ? > > Pretty much everything. > > It's the most complex part by far, and the vmalloc space is a limited > resource on 32-bit architectures. The vmalloc space needed by cpu_opv is bound by the number of pages a cpu_opv call can touch. On architectures with virtually aliased dcache, we also need to add a few extra pages worth of address space to account for SHMLBA alignment. So on ARM32, with SHMLBA=4 pages, this means at most 1 MB of virtual address space temporarily needed for a cpu_opv system call in the very worst case scenario: 16 ops * 2 uaddr * 8 pages per uaddr (if we're unlucky and find ourselves aligned across two SHMLBA) * 4096 bytes per page. If this amount of vmalloc space happens to be our limiting factor, we can change the max cpu_opv ops array size supported, e.g. bringing it from 16 down to 4. The largest number of operations I currently need in the cpu-opv library is 4. With 4 ops, the worse case vmalloc space used by a cpu_opv system call becomes 256 kB. > >> Do you have an alternative approach in mind ? > > Do everything in user space. I wish we could disable preemption and cpu hotplug in user-space. Unfortunately, that does not seem to be a viable solution for many technical reasons, starting with page fault handling. > > And even if you absolutely want cpu_opv at all, why not do it in the > user space *mapping* without the aliasing into kernel space? That's because cpu_opv need to execute the entire array of operations with preemption disabled, and we cannot take a page fault with preemption off. Page pinning and aliasing user-space pages in the kernel linear mapping ensure that we don't end up in trouble in page fault scenarios, such as having the pages we need to touch swapped out under our feet. > > The cpu_opv approach isn't even fast. It's *really* slow if it has to > do VM crap. > > The whole rseq thing was billed as "faster than atomics". I > *guarantee* that the cpu_opv's aren't faster than atomics. Yes, and here is the good news: cpu_opv speed does not even matter. rseq assember instruction sequences are very fast, but cannot deal with infrequent corner-cases. cpu_opv is slow, but is guaranteed to deal with the occasional corner-case situations. This is similar to pthread mutex/futex fast/slow paths. The common case is fast (rseq), and the speed of the infrequent case (cpu_opv) does not matter as long as it's used infrequently enough, which is the case here. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 12, 2018, at 4:07 PM, Linus Torvalds torva...@linux-foundation.org wrote: > On Thu, Apr 12, 2018 at 12:59 PM, Mathieu Desnoyers > wrote: >> >> What are your concerns about page pinning ? > > Pretty much everything. > > It's the most complex part by far, and the vmalloc space is a limited > resource on 32-bit architectures. The vmalloc space needed by cpu_opv is bound by the number of pages a cpu_opv call can touch. On architectures with virtually aliased dcache, we also need to add a few extra pages worth of address space to account for SHMLBA alignment. So on ARM32, with SHMLBA=4 pages, this means at most 1 MB of virtual address space temporarily needed for a cpu_opv system call in the very worst case scenario: 16 ops * 2 uaddr * 8 pages per uaddr (if we're unlucky and find ourselves aligned across two SHMLBA) * 4096 bytes per page. If this amount of vmalloc space happens to be our limiting factor, we can change the max cpu_opv ops array size supported, e.g. bringing it from 16 down to 4. The largest number of operations I currently need in the cpu-opv library is 4. With 4 ops, the worse case vmalloc space used by a cpu_opv system call becomes 256 kB. > >> Do you have an alternative approach in mind ? > > Do everything in user space. I wish we could disable preemption and cpu hotplug in user-space. Unfortunately, that does not seem to be a viable solution for many technical reasons, starting with page fault handling. > > And even if you absolutely want cpu_opv at all, why not do it in the > user space *mapping* without the aliasing into kernel space? That's because cpu_opv need to execute the entire array of operations with preemption disabled, and we cannot take a page fault with preemption off. Page pinning and aliasing user-space pages in the kernel linear mapping ensure that we don't end up in trouble in page fault scenarios, such as having the pages we need to touch swapped out under our feet. > > The cpu_opv approach isn't even fast. It's *really* slow if it has to > do VM crap. > > The whole rseq thing was billed as "faster than atomics". I > *guarantee* that the cpu_opv's aren't faster than atomics. Yes, and here is the good news: cpu_opv speed does not even matter. rseq assember instruction sequences are very fast, but cannot deal with infrequent corner-cases. cpu_opv is slow, but is guaranteed to deal with the occasional corner-case situations. This is similar to pthread mutex/futex fast/slow paths. The common case is fast (rseq), and the speed of the infrequent case (cpu_opv) does not matter as long as it's used infrequently enough, which is the case here. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
> Can we plan on merging just the plain rseq parts *without* this all > first, and then see the cpu_opv thing as a "maybe future expansion" > part. That would be the right way to go. I doubt anybody really needs cpu_opv. We already have other code (e.g. vgettimeofday) which cannot be single stepped, and so far it never was a problem. -Andi
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
> Can we plan on merging just the plain rseq parts *without* this all > first, and then see the cpu_opv thing as a "maybe future expansion" > part. That would be the right way to go. I doubt anybody really needs cpu_opv. We already have other code (e.g. vgettimeofday) which cannot be single stepped, and so far it never was a problem. -Andi
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
On Thu, Apr 12, 2018 at 12:59 PM, Mathieu Desnoyerswrote: > > What are your concerns about page pinning ? Pretty much everything. It's the most complex part by far, and the vmalloc space is a limited resource on 32-bit architectures. > Do you have an alternative approach in mind ? Do everything in user space. And even if you absolutely want cpu_opv at all, why not do it in the user space *mapping* without the aliasing into kernel space? The cpu_opv approach isn't even fast. It's *really* slow if it has to do VM crap. The whole rseq thing was billed as "faster than atomics". I *guarantee* that the cpu_opv's aren't faster than atomics. Linus
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
On Thu, Apr 12, 2018 at 12:59 PM, Mathieu Desnoyers wrote: > > What are your concerns about page pinning ? Pretty much everything. It's the most complex part by far, and the vmalloc space is a limited resource on 32-bit architectures. > Do you have an alternative approach in mind ? Do everything in user space. And even if you absolutely want cpu_opv at all, why not do it in the user space *mapping* without the aliasing into kernel space? The cpu_opv approach isn't even fast. It's *really* slow if it has to do VM crap. The whole rseq thing was billed as "faster than atomics". I *guarantee* that the cpu_opv's aren't faster than atomics. Linus
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 12, 2018, at 3:43 PM, Linus Torvalds torva...@linux-foundation.org wrote: > On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers >wrote: >> The cpu_opv system call executes a vector of operations on behalf of >> user-space on a specific CPU with preemption disabled. It is inspired >> by readv() and writev() system calls which take a "struct iovec" >> array as argument. > > Do we really want the page pinning? > > This whole cpu_opv thing is the most questionable part of the series, > and the page pinning is the most questionable part of cpu_opv for me. What are your concerns about page pinning ? Do you have an alternative approach in mind ? > Can we plan on merging just the plain rseq parts *without* this all > first, and then see the cpu_opv thing as a "maybe future expansion" > part. The main problem with the incremental approach is that it won't deal with remote CPU data accesses, and won't deal with cpu hotplug in non-racy ways. For *some* of the use-cases, the other issues solved by cpu_opv can be worked-around in user-space, at the cost of making the userspace code a mess, and in many cases slower than if we can rely on cpu_opv for the fallback. All the rseq test-cases depend on cpu_opv as they stand now. Without cpu_opv to handle the corner-cases, things become much more messy on the user-space side. Thanks, Mathieu > > I think that would make Andy happier too. > > Linus -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
- On Apr 12, 2018, at 3:43 PM, Linus Torvalds torva...@linux-foundation.org wrote: > On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers > wrote: >> The cpu_opv system call executes a vector of operations on behalf of >> user-space on a specific CPU with preemption disabled. It is inspired >> by readv() and writev() system calls which take a "struct iovec" >> array as argument. > > Do we really want the page pinning? > > This whole cpu_opv thing is the most questionable part of the series, > and the page pinning is the most questionable part of cpu_opv for me. What are your concerns about page pinning ? Do you have an alternative approach in mind ? > Can we plan on merging just the plain rseq parts *without* this all > first, and then see the cpu_opv thing as a "maybe future expansion" > part. The main problem with the incremental approach is that it won't deal with remote CPU data accesses, and won't deal with cpu hotplug in non-racy ways. For *some* of the use-cases, the other issues solved by cpu_opv can be worked-around in user-space, at the cost of making the userspace code a mess, and in many cases slower than if we can rely on cpu_opv for the fallback. All the rseq test-cases depend on cpu_opv as they stand now. Without cpu_opv to handle the corner-cases, things become much more messy on the user-space side. Thanks, Mathieu > > I think that would make Andy happier too. > > Linus -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyerswrote: > The cpu_opv system call executes a vector of operations on behalf of > user-space on a specific CPU with preemption disabled. It is inspired > by readv() and writev() system calls which take a "struct iovec" > array as argument. Do we really want the page pinning? This whole cpu_opv thing is the most questionable part of the series, and the page pinning is the most questionable part of cpu_opv for me. Can we plan on merging just the plain rseq parts *without* this all first, and then see the cpu_opv thing as a "maybe future expansion" part. I think that would make Andy happier too. Linus
Re: [RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
On Thu, Apr 12, 2018 at 12:27 PM, Mathieu Desnoyers wrote: > The cpu_opv system call executes a vector of operations on behalf of > user-space on a specific CPU with preemption disabled. It is inspired > by readv() and writev() system calls which take a "struct iovec" > array as argument. Do we really want the page pinning? This whole cpu_opv thing is the most questionable part of the series, and the page pinning is the most questionable part of cpu_opv for me. Can we plan on merging just the plain rseq parts *without* this all first, and then see the cpu_opv thing as a "maybe future expansion" part. I think that would make Andy happier too. Linus
[RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
The cpu_opv system call executes a vector of operations on behalf of user-space on a specific CPU with preemption disabled. It is inspired by readv() and writev() system calls which take a "struct iovec" array as argument. The operations available are: comparison, memcpy, add, or, and, xor, left shift, right shift, and memory barrier. The system call receives a CPU number from user-space as argument, which is the CPU on which those operations need to be performed. All pointers in the ops must have been set up to point to the per CPU memory of the CPU on which the operations should be executed. The "comparison" operation can be used to check that the data used in the preparation step did not change between preparation of system call inputs and operation execution within the preempt-off critical section. The reason why we require all pointer offsets to be calculated by user-space beforehand is because we need to use get_user_pages() to first pin all pages touched by each operation. This takes care of faulting-in the pages. Then, preemption is disabled, and the operations are performed atomically with respect to other thread execution on that CPU, without generating any page fault. An overall maximum of 4216 bytes in enforced on the sum of operation length within an operation vector, so user-space cannot generate a too long preempt-off critical section (cache cold critical section duration measured as 4.7µs on x86-64). Each operation is also limited a length of 4096 bytes, meaning that an operation can touch a maximum of 4 pages (memcpy: 2 pages for source, 2 pages for destination if addresses are not aligned on page boundaries). If the thread is not running on the requested CPU, it is migrated to it. Justification for cpu_opv Here are a few reasons justifying why the cpu_opv system call is needed in addition to rseq: 1) Allow algorithms to perform per-cpu data migration without relying on sched_setaffinity() The use-cases are migrating memory between per-cpu memory free-lists, or stealing tasks from other per-cpu work queues: each require that accesses to remote per-cpu data structures are performed. Just rseq is not enough to cover those use-cases without additionally relying on sched_setaffinity, which is unfortunately not CPU-hotplug-safe. The cpu_opv system call receives a CPU number as argument, and migrates the current task to the right CPU to perform the operation sequence. If the requested CPU is offline, it performs the operations from the current CPU while preventing CPU hotplug, and with a mutex held. 2) Handling single-stepping from tools Tools like debuggers, and simulators use single-stepping to run through existing programs. If core libraries start to use restartable sequences for e.g. memory allocation, this means pre-existing programs cannot be single-stepped, simply because the underlying glibc or jemalloc has changed. The rseq user-space does expose a __rseq_table section for the sake of debuggers, so they can skip over the rseq critical sections if they want. However, this requires upgrading tools, and still breaks single-stepping in case where glibc or jemalloc is updated, but not the tooling. Having a performance-related library improvement break tooling is likely to cause a big push-back against wide adoption of rseq. 3) Forward-progress guarantee Having a piece of user-space code that stops progressing due to external conditions is pretty bad. Developers are used to think of fast-path and slow-path (e.g. for locking), where the contended vs uncontended cases have different performance characteristics, but each need to provide some level of progress guarantees. There are concerns about proposing just "rseq" without the associated slow-path (cpu_opv) that guarantees progress. It's just asking for trouble when real-life will happen: page faults, uprobes, and other unforeseen conditions that would seldom cause a rseq fast-path to never progress. 4) Handling page faults It's pretty easy to come up with corner-case scenarios where rseq does not progress without the help from cpu_opv. For instance, a system with swap enabled which is under high memory pressure could trigger page faults at pretty much every rseq attempt. Although this scenario is extremely unlikely, rseq becomes the weak link of the chain. 5) Comparison with LL/SC The layman versed in the load-link/store-conditional instructions in RISC architectures will notice the similarity between rseq and LL/SC critical sections. The comparison can even be pushed further: since debuggers can handle those LL/SC critical sections, they should be able to handle rseq c.s. in the same way. First, the way gdb recognises LL/SC c.s. patterns is very fragile: it's limited to specific common patterns, and will miss the pattern in all other cases. But fear not, having the rseq c.s. expose a __rseq_table to debuggers removes that guessing part. The main difference between LL/SC and rseq is that debuggers had to
[RFC PATCH for 4.18 12/23] cpu_opv: Provide cpu_opv system call (v7)
The cpu_opv system call executes a vector of operations on behalf of user-space on a specific CPU with preemption disabled. It is inspired by readv() and writev() system calls which take a "struct iovec" array as argument. The operations available are: comparison, memcpy, add, or, and, xor, left shift, right shift, and memory barrier. The system call receives a CPU number from user-space as argument, which is the CPU on which those operations need to be performed. All pointers in the ops must have been set up to point to the per CPU memory of the CPU on which the operations should be executed. The "comparison" operation can be used to check that the data used in the preparation step did not change between preparation of system call inputs and operation execution within the preempt-off critical section. The reason why we require all pointer offsets to be calculated by user-space beforehand is because we need to use get_user_pages() to first pin all pages touched by each operation. This takes care of faulting-in the pages. Then, preemption is disabled, and the operations are performed atomically with respect to other thread execution on that CPU, without generating any page fault. An overall maximum of 4216 bytes in enforced on the sum of operation length within an operation vector, so user-space cannot generate a too long preempt-off critical section (cache cold critical section duration measured as 4.7µs on x86-64). Each operation is also limited a length of 4096 bytes, meaning that an operation can touch a maximum of 4 pages (memcpy: 2 pages for source, 2 pages for destination if addresses are not aligned on page boundaries). If the thread is not running on the requested CPU, it is migrated to it. Justification for cpu_opv Here are a few reasons justifying why the cpu_opv system call is needed in addition to rseq: 1) Allow algorithms to perform per-cpu data migration without relying on sched_setaffinity() The use-cases are migrating memory between per-cpu memory free-lists, or stealing tasks from other per-cpu work queues: each require that accesses to remote per-cpu data structures are performed. Just rseq is not enough to cover those use-cases without additionally relying on sched_setaffinity, which is unfortunately not CPU-hotplug-safe. The cpu_opv system call receives a CPU number as argument, and migrates the current task to the right CPU to perform the operation sequence. If the requested CPU is offline, it performs the operations from the current CPU while preventing CPU hotplug, and with a mutex held. 2) Handling single-stepping from tools Tools like debuggers, and simulators use single-stepping to run through existing programs. If core libraries start to use restartable sequences for e.g. memory allocation, this means pre-existing programs cannot be single-stepped, simply because the underlying glibc or jemalloc has changed. The rseq user-space does expose a __rseq_table section for the sake of debuggers, so they can skip over the rseq critical sections if they want. However, this requires upgrading tools, and still breaks single-stepping in case where glibc or jemalloc is updated, but not the tooling. Having a performance-related library improvement break tooling is likely to cause a big push-back against wide adoption of rseq. 3) Forward-progress guarantee Having a piece of user-space code that stops progressing due to external conditions is pretty bad. Developers are used to think of fast-path and slow-path (e.g. for locking), where the contended vs uncontended cases have different performance characteristics, but each need to provide some level of progress guarantees. There are concerns about proposing just "rseq" without the associated slow-path (cpu_opv) that guarantees progress. It's just asking for trouble when real-life will happen: page faults, uprobes, and other unforeseen conditions that would seldom cause a rseq fast-path to never progress. 4) Handling page faults It's pretty easy to come up with corner-case scenarios where rseq does not progress without the help from cpu_opv. For instance, a system with swap enabled which is under high memory pressure could trigger page faults at pretty much every rseq attempt. Although this scenario is extremely unlikely, rseq becomes the weak link of the chain. 5) Comparison with LL/SC The layman versed in the load-link/store-conditional instructions in RISC architectures will notice the similarity between rseq and LL/SC critical sections. The comparison can even be pushed further: since debuggers can handle those LL/SC critical sections, they should be able to handle rseq c.s. in the same way. First, the way gdb recognises LL/SC c.s. patterns is very fragile: it's limited to specific common patterns, and will miss the pattern in all other cases. But fear not, having the rseq c.s. expose a __rseq_table to debuggers removes that guessing part. The main difference between LL/SC and rseq is that debuggers had to