Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
On Wed, 2 May 2018, Peter Zijlstra wrote: > On Wed, May 02, 2018 at 06:27:22PM +, Daniel Colascione wrote: > > On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra wrote: > > >> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote: > > > > Suppose we make a userspace mutex implemented with a lock word having > > three > > > > bits: acquired, sleep_mode, and wait_pending, with the rest of the word > > not > > > > being relevant at the moment. > > > > > So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and go > > > with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the > > > existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the > > > owner TID. > > > > That doesn't work if you want to use the rest of the word for something > > else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can make > > a lock with two bits. > > Recursive locks are teh most horrible crap ever. And having the tid in > the word allows things like kernel based optimistic spins and possibly > PI related things. FWIW, robust futex have also the TID requirement. Thanks, tglx
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
Hey - I think the ideas Daniel brings up here are interesting -- specifically the notion that a thread could set a "pre-sleep wish" to signal it's sleeping. As this conversation shows I think there's a fair bit of depth to that. For example, the FUTEX_LOCK is an alternative approach. Another idea might be using the "currently running cpu" area of rseq to tell if the process in question was sleeping (assuming that the kernel would be modified to set this to -1 every time a process was unscheduled) The idea discussed here seems orthogonal to the core thesis of rseq. I'm wondering if we can focus on getting rseq in, maybe with a eye for making sure this use case could be handled long term. My sense is that this is possible. We could use the flags setting in the per-thread rseq area, or maybe extend the meaning of the structure rseq_cs points to to signal that there was information about how to signal the sleeping of the current process. It seems to me this would be a natural way to add the functionality Daniel talks about if desired in the future. Daniel - do you think there's anything we should do in the patch set today that would make it easier to implement your idea in the future without expanding the scope of the patch today. i.e. is there anything else we need to do to lay the framework for your idea. I'd really love to see us get this patch in. There's a whole host of primitives this unlocks (more efficient RCU, better malloc implementations, fast reader-writer locks). I'm sure we'll have more ideas about APIs to provide once we've explored these use cases, but to me this patch is the MVP we need to ship to get that feedback. It's a solid framework that we can build upon, eg with the "opv" syscall or the idea in this thread if user feedback shows those things are necessary. -b
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
- On May 3, 2018, at 12:22 PM, Daniel Colascione dan...@google.com wrote: > On Thu, May 3, 2018 at 9:12 AM Mathieu Desnoyers < > mathieu.desnoy...@efficios.com> wrote: >> By the way, if we eventually find a way to enhance user-space mutexes in > the >> fashion you describe here, it would belong to another TLS area, and would >> be registered by another system call than rseq. I proposed a more generic >> "TLS area registration" system call a few years ago, but Linus told me he >> wanted a system call that was specific to rseq. If we need to implement >> other use-cases in a TLS area shared between kernel and user-space in a >> similar fashion, the plan is to do it in a distinct system call. > > If we proliferate TLS areas; we'd have to register each one upon thread > creation, adding to the overall thread creation path. There's already a > provision for versioning the TLS area. What's the benefit of splitting the > registration over multiple system calls? See the original discussion thread at https://lkml.org/lkml/2016/4/7/502 Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
Hi Daniel, Nice to have this healthy discussion about pros/cons. Adding Waiman to the discussion as well. Curious to hear what Waiman and Peter think about all this. Some more comments inline. On Thu, May 3, 2018 at 10:19 AM Daniel Colascione wrote: > On Thu, May 3, 2018 at 9:48 AM Joel Fernandes wrote: > > > > can skip the manual schedule we were going to perform. > > > By the way, if we eventually find a way to enhance user-space mutexes in > > the > > > fashion you describe here, it would belong to another TLS area, and > would > > > be registered by another system call than rseq. I proposed a more > generic > > Right. Also I still don't see any good reason why optimistic spinning in > > the kernel with FUTEX_LOCK, as Peter described, can't be used instead of > > using the rseq implementation and spinning in userspace, for such a case. > I > > don't really fully buy that we need to design this interface assuming any > > privilege transition level time. > > If privilege level transitions are slow, > > we're going to have bad performance anyway. > That's not the case. There's a large class of program that does useful work > while seldom entering the kernel: just ask the user-space network stack > people. Yes, I am aware of that. I was just saying in general, a system such as an Android embedded system, not an HPC based system does make a lot of system calls. I am not arguing that doing more things in userspace is good or bad here. I am just talking about why do something else for no good reasons (see below) when work has already been done on this area. > It's not wise to design interfaces around system calls being cheap. Even if > system calls are currently cheap enough on some architectures some of the > time, there's no guarantee that they'll stay that way, especially relative > to straight-line user-mode execution. A pure user-space approach, on the > other hand, involves no work in the kernel, and doing nothing is always the > optimal strategy. Besides, there are environments where system calls end up > being more expensive than you might think: consider strace or rr. If the > kernel needs to get involved on some path, it's best that its involvement > be as light as possible. Ofcourse, but I think we shouldn't do a premature optimization here without real data on typical Android devices about the cost of system calls entry/exit, vs spin time. I am not against userspace lock based on rseq if there is data and good reason, before investing significant time on reinventing the wheel. > > we should really stick to using FUTEX_LOCK and > > reuse all the work that went into that area for Android and otherwise (and > > work with Waiman and others on improving that if there are any problems > > with it). > FUTEX_LOCK is a return to the bad old days when systems gave you a fixed > list of synchronization primitives and if you wanted something else, tough. I am not saying we should fix sync. primitives made available to userspace, or anything. I am talking about yours/our usecase and whether another sync primitive interface is needed. For example, have another syscall to register TLS area is a new interface, vs using the existing futex interface. Linus is also against adding new sycalls unnecessarily. > That the latest version of the FUTEX_LOCK patch includes a separate > FUTEX_LOCK_SHARED mode is concerning. The functionality the kernel provides Why? That's just for reader-locks. What's the concern there? I know you had something in mind about efficient userspace rw locks but I am curious either way what you have in mind. > to userspace should be more general-purpose and allow more experimentation > without changes in the kernel. I see no reason to force userspace into 1) > reserving 30 bits of its lockword for a TID and 2) adopting the kernel's Based on our offline chat, this is for only 32-bit only systems though right? Also based on Peter's idea of putting the recursion counter outside, there shouldn't be a space issue? > idea of spin time heuristics and lock stealing when the same basic > functionality can be provided in a generic way while reserving only one > bit. That this mechanism happens to be more efficient as well is a bonus. And also probably easy to get wrong. Heuristics are hard and it would be good to work with community on getting best approach for that and improving existing code. Also about "generic way", that's even more reason in my view to do it in the kernel. > "Mechanism not policy" is still a good design principle. Again, I am not advocating forcing of interfaces anything, but I'm against reinventing the wheel and am all for spending time on improving existing things. thanks! - Joel
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
On Thu, May 3, 2018 at 9:48 AM Joel Fernandes wrote: > > > can skip the manual schedule we were going to perform. > > By the way, if we eventually find a way to enhance user-space mutexes in > the > > fashion you describe here, it would belong to another TLS area, and would > > be registered by another system call than rseq. I proposed a more generic > Right. Also I still don't see any good reason why optimistic spinning in > the kernel with FUTEX_LOCK, as Peter described, can't be used instead of > using the rseq implementation and spinning in userspace, for such a case. I > don't really fully buy that we need to design this interface assuming any > privilege transition level time. > If privilege level transitions are slow, > we're going to have bad performance anyway. That's not the case. There's a large class of program that does useful work while seldom entering the kernel: just ask the user-space network stack people. It's not wise to design interfaces around system calls being cheap. Even if system calls are currently cheap enough on some architectures some of the time, there's no guarantee that they'll stay that way, especially relative to straight-line user-mode execution. A pure user-space approach, on the other hand, involves no work in the kernel, and doing nothing is always the optimal strategy. Besides, there are environments where system calls end up being more expensive than you might think: consider strace or rr. If the kernel needs to get involved on some path, it's best that its involvement be as light as possible. > we should really stick to using FUTEX_LOCK and > reuse all the work that went into that area for Android and otherwise (and > work with Waiman and others on improving that if there are any problems > with it). FUTEX_LOCK is a return to the bad old days when systems gave you a fixed list of synchronization primitives and if you wanted something else, tough. That the latest version of the FUTEX_LOCK patch includes a separate FUTEX_LOCK_SHARED mode is concerning. The functionality the kernel provides to userspace should be more general-purpose and allow more experimentation without changes in the kernel. I see no reason to force userspace into 1) reserving 30 bits of its lockword for a TID and 2) adopting the kernel's idea of spin time heuristics and lock stealing when the same basic functionality can be provided in a generic way while reserving only one bit. That this mechanism happens to be more efficient as well is a bonus. "Mechanism not policy" is still a good design principle.
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
On Thu, May 3, 2018 at 9:12 AM Mathieu Desnoyers < mathieu.desnoy...@efficios.com> wrote: > - On May 2, 2018, at 12:07 PM, Daniel Colascione dan...@google.com wrote: > > On Wed, May 2, 2018 at 9:03 AM Mathieu Desnoyers < > > mathieu.desnoy...@efficios.com> wrote: > > > >> - On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com > > wrote: > >> [...] > >> > > >> > I think a small enhancement to rseq would let us build a perfect > > userspace > >> > mutex, one that spins on lock-acquire only when the lock owner is > > running > >> > and that sleeps otherwise, freeing userspace from both specifying ad-hoc > >> > spin counts and from trying to detect situations in which spinning is > >> > generally pointless. > >> > > >> > It'd work like this: in the per-thread rseq data structure, we'd > > include a > >> > description of a futex operation for the kernel would perform (in the > >> > context of the preempted thread) upon preemption, immediately before > >> > schedule(). If the futex operation itself sleeps, that's no problem: we > >> > will have still accomplished our goal of running some other thread > > instead > >> > of the preempted thread. > > > >> Hi Daniel, > > > >> I agree that the problem you are aiming to solve is important. Let's see > >> what prevents the proposed rseq implementation from doing what you > > envision. > > > >> The main issue here is touching userspace immediately before schedule(). > >> At that specific point, it's not possible to take a page fault. In the > > proposed > >> rseq implementation, we get away with it by raising a task struct flag, > > and using > >> it in a return to userspace notifier (where we can actually take a > > fault), where > >> we touch the userspace TLS area. > > > >> If we can find a way to solve this limitation, then the rest of your > > design > >> makes sense to me. > > > > Thanks for taking a look! > > > > Why couldn't we take a page fault just before schedule? The reason we can't > > take a page fault in atomic context is that doing so might call schedule. > > Here, we're about to call schedule _anyway_, so what harm does it do to > > call something that might call schedule? If we schedule via that call, we > > can skip the manual schedule we were going to perform. > By the way, if we eventually find a way to enhance user-space mutexes in the > fashion you describe here, it would belong to another TLS area, and would > be registered by another system call than rseq. I proposed a more generic Right. Also I still don't see any good reason why optimistic spinning in the kernel with FUTEX_LOCK, as Peter described, can't be used instead of using the rseq implementation and spinning in userspace, for such a case. I don't really fully buy that we need to design this interface assuming any privilege transition level time. If privilege level transitions are slow, we're going to have bad performance anyway. Unless there's some data to show that we have to optimistically spin in userspace than the kernel because its better to do so, we should really stick to using FUTEX_LOCK and reuse all the work that went into that area for Android and otherwise (and work with Waiman and others on improving that if there are any problems with it). I am excited though about the other synchronization design other than lock implementation that rseq can help in. thanks! - Joel
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
On Thu, May 3, 2018 at 9:12 AM Mathieu Desnoyers < mathieu.desnoy...@efficios.com> wrote: > By the way, if we eventually find a way to enhance user-space mutexes in the > fashion you describe here, it would belong to another TLS area, and would > be registered by another system call than rseq. I proposed a more generic > "TLS area registration" system call a few years ago, but Linus told me he > wanted a system call that was specific to rseq. If we need to implement > other use-cases in a TLS area shared between kernel and user-space in a > similar fashion, the plan is to do it in a distinct system call. If we proliferate TLS areas; we'd have to register each one upon thread creation, adding to the overall thread creation path. There's already a provision for versioning the TLS area. What's the benefit of splitting the registration over multiple system calls?
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
- On May 2, 2018, at 12:07 PM, Daniel Colascione dan...@google.com wrote: > On Wed, May 2, 2018 at 9:03 AM Mathieu Desnoyers < > mathieu.desnoy...@efficios.com> wrote: > >> - On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com > wrote: >> [...] >> > >> > I think a small enhancement to rseq would let us build a perfect > userspace >> > mutex, one that spins on lock-acquire only when the lock owner is > running >> > and that sleeps otherwise, freeing userspace from both specifying ad-hoc >> > spin counts and from trying to detect situations in which spinning is >> > generally pointless. >> > >> > It'd work like this: in the per-thread rseq data structure, we'd > include a >> > description of a futex operation for the kernel would perform (in the >> > context of the preempted thread) upon preemption, immediately before >> > schedule(). If the futex operation itself sleeps, that's no problem: we >> > will have still accomplished our goal of running some other thread > instead >> > of the preempted thread. > >> Hi Daniel, > >> I agree that the problem you are aiming to solve is important. Let's see >> what prevents the proposed rseq implementation from doing what you > envision. > >> The main issue here is touching userspace immediately before schedule(). >> At that specific point, it's not possible to take a page fault. In the > proposed >> rseq implementation, we get away with it by raising a task struct flag, > and using >> it in a return to userspace notifier (where we can actually take a > fault), where >> we touch the userspace TLS area. > >> If we can find a way to solve this limitation, then the rest of your > design >> makes sense to me. > > Thanks for taking a look! > > Why couldn't we take a page fault just before schedule? The reason we can't > take a page fault in atomic context is that doing so might call schedule. > Here, we're about to call schedule _anyway_, so what harm does it do to > call something that might call schedule? If we schedule via that call, we > can skip the manual schedule we were going to perform. By the way, if we eventually find a way to enhance user-space mutexes in the fashion you describe here, it would belong to another TLS area, and would be registered by another system call than rseq. I proposed a more generic "TLS area registration" system call a few years ago, but Linus told me he wanted a system call that was specific to rseq. If we need to implement other use-cases in a TLS area shared between kernel and user-space in a similar fashion, the plan is to do it in a distinct system call. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
On Wed, May 02, 2018 at 08:37:13PM +, Daniel Colascione wrote: > > Recursive locks are teh most horrible crap ever. And having the tid in > > What happened to providing mechanism, not policy? > > You can't wish away recursive locking. It's baked into Java and the CLR, > and it's enshrined in POSIX. It's not going away, and there's no reason not > to support it efficiently. You can implement recursive locks just fine with a TID based word, just keep the recursion counter external to the futex word. If owner==self, increment etc.. > > the word allows things like kernel based optimistic spins and possibly > > PI related things. > > Sure. A lot of people don't want PI though, or at least they want to opt > into it. And we shouldn't require an entry into the kernel for what we can > in principle do efficiently in userspace. Any additional PI would certainly be opt-in, but the kernel based spinning might make sense unconditionally.
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
On Wed, 02 May 2018 20:37:13 + Daniel Colascione wrote: > On Wed, May 2, 2018 at 1:23 PM Peter Zijlstra wrote: > > > On Wed, May 02, 2018 at 06:27:22PM +, Daniel Colascione wrote: > > > On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra > wrote: > > > >> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote: > > > > > Suppose we make a userspace mutex implemented with a lock word > having > > > three > > > > > bits: acquired, sleep_mode, and wait_pending, with the rest of the > word > > > not > > > > > being relevant at the moment. > > > > > > > So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and > go > > > > with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the > > > > existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the > > > > owner TID. > > > > > > That doesn't work if you want to use the rest of the word for something > > > else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can > make > > > a lock with two bits. > > > Recursive locks are teh most horrible crap ever. And having the tid in > > What happened to providing mechanism, not policy? > > You can't wish away recursive locking. It's baked into Java and the CLR, > and it's enshrined in POSIX. It's not going away, and there's no reason not > to support it efficiently. > > > the word allows things like kernel based optimistic spins and possibly > > PI related things. > > Sure. A lot of people don't want PI though, or at least they want to opt > into it. And we shouldn't require an entry into the kernel for what we can > in principle do efficiently in userspace. > > > > > As brought up in the last time we talked about spin loops, why do we > > > > care if the spin loop is in userspace or not? Aside from the whole PTI > > > > thing, the syscall cost was around 150 cycle or so, while a LOCK > CMPXCHG > > > > is around 20 cycles. So ~7 spins gets you the cost of entry. What about exit? > > > > > > That's pre-KPTI, isn't it? > > > Yes, and once the hardware gets sorted, we'll be there again. I don't > > think we should design interfaces for 'broken' hardware. > > It would be a mistake to design interfaces under the assumption that > everyone has fast permission level transitions. Note, Robert Haas told me a few years ago at a plumbers conference that postgresql implements their own user space spin locks because anything that goes into the kernel has killed the performance. And they tried to use futex but that still didn't beat out plain userspace locks. -- Steve
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
On Wed, May 2, 2018 at 1:23 PM Peter Zijlstra wrote: > On Wed, May 02, 2018 at 06:27:22PM +, Daniel Colascione wrote: > > On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra wrote: > > >> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote: > > > > Suppose we make a userspace mutex implemented with a lock word having > > three > > > > bits: acquired, sleep_mode, and wait_pending, with the rest of the word > > not > > > > being relevant at the moment. > > > > > So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and go > > > with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the > > > existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the > > > owner TID. > > > > That doesn't work if you want to use the rest of the word for something > > else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can make > > a lock with two bits. > Recursive locks are teh most horrible crap ever. And having the tid in What happened to providing mechanism, not policy? You can't wish away recursive locking. It's baked into Java and the CLR, and it's enshrined in POSIX. It's not going away, and there's no reason not to support it efficiently. > the word allows things like kernel based optimistic spins and possibly > PI related things. Sure. A lot of people don't want PI though, or at least they want to opt into it. And we shouldn't require an entry into the kernel for what we can in principle do efficiently in userspace. > > > As brought up in the last time we talked about spin loops, why do we > > > care if the spin loop is in userspace or not? Aside from the whole PTI > > > thing, the syscall cost was around 150 cycle or so, while a LOCK CMPXCHG > > > is around 20 cycles. So ~7 spins gets you the cost of entry. > > > > That's pre-KPTI, isn't it? > Yes, and once the hardware gets sorted, we'll be there again. I don't > think we should design interfaces for 'broken' hardware. It would be a mistake to design interfaces under the assumption that everyone has fast permission level transitions.
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
On Wed, May 02, 2018 at 06:27:22PM +, Daniel Colascione wrote: > On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra wrote: > >> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote: > > > Suppose we make a userspace mutex implemented with a lock word having > three > > > bits: acquired, sleep_mode, and wait_pending, with the rest of the word > not > > > being relevant at the moment. > > > So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and go > > with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the > > existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the > > owner TID. > > That doesn't work if you want to use the rest of the word for something > else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can make > a lock with two bits. Recursive locks are teh most horrible crap ever. And having the tid in the word allows things like kernel based optimistic spins and possibly PI related things. > > As brought up in the last time we talked about spin loops, why do we > > care if the spin loop is in userspace or not? Aside from the whole PTI > > thing, the syscall cost was around 150 cycle or so, while a LOCK CMPXCHG > > is around 20 cycles. So ~7 spins gets you the cost of entry. > > That's pre-KPTI, isn't it? Yes, and once the hardware gets sorted, we'll be there again. I don't think we should design interfaces for 'broken' hardware.
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
On Wed, May 2, 2018 at 10:22 AM Peter Zijlstra wrote: >> On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote: > > Suppose we make a userspace mutex implemented with a lock word having three > > bits: acquired, sleep_mode, and wait_pending, with the rest of the word not > > being relevant at the moment. > So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and go > with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the > existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the > owner TID. That doesn't work if you want to use the rest of the word for something else, like a recursion count. With FUTEX_WAIT and FUTEX_WAKE, you can make a lock with two bits. > As brought up in the last time we talked about spin loops, why do we > care if the spin loop is in userspace or not? Aside from the whole PTI > thing, the syscall cost was around 150 cycle or so, while a LOCK CMPXCHG > is around 20 cycles. So ~7 spins gets you the cost of entry. That's pre-KPTI, isn't it?
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote: > Suppose we make a userspace mutex implemented with a lock word having three > bits: acquired, sleep_mode, and wait_pending, with the rest of the word not > being relevant at the moment. So ideally we'd kill FUTEX_WAIT/FUTEX_WAKE for mutexes entirely, and go with FUTEX_LOCK/FUTEX_UNLOCK that have the same semantics as the existing FUTEX_LOCK_PI/FUTEX_UNLOCK_PI, namely, the word contains the owner TID. As brought up in the last time we talked about spin loops, why do we care if the spin loop is in userspace or not? Aside from the whole PTI thing, the syscall cost was around 150 cycle or so, while a LOCK CMPXCHG is around 20 cycles. So ~7 spins gets you the cost of entry.
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
On Wed, May 2, 2018 at 9:42 AM Steven Rostedt wrote: > On Wed, 02 May 2018 16:07:48 + > Daniel Colascione wrote: > > Why couldn't we take a page fault just before schedule? The reason we can't > > take a page fault in atomic context is that doing so might call schedule. > > Here, we're about to call schedule _anyway_, so what harm does it do to > > call something that might call schedule? If we schedule via that call, we > > can skip the manual schedule we were going to perform. > Another issue is slowing down something that is considered a fast path. There are two questions: 1) does this feature slow down schedule when you're not using it? and 2) is schedule unacceptably slow when you are using this feature? The answer to #1 is no; rseq already tests current->rseq during task switch (via rseq_set_notify_resume), so adding a single further branch (which we'd only test when we follow the current->rseq path anyway) isn't a problem. Regarding #2: yes, a futex operation will increase path length for that one task switch, but in the no-page-fault case, only by a tiny amount. We can run benchmarks of course, but I don't see any reason to suspect that the proposal would make task switching unacceptably slow. If we *do* take a page fault, we won't have done much additional work overall, since *somebody* is going to take that page fault anyway when the lock is released, and the latency of the task switch shouldn't increase, since the futex code will very quickly realize that it needs to sleep and call schedule anyway.
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
On Wed, 02 May 2018 16:07:48 + Daniel Colascione wrote: > Why couldn't we take a page fault just before schedule? The reason we can't > take a page fault in atomic context is that doing so might call schedule. > Here, we're about to call schedule _anyway_, so what harm does it do to > call something that might call schedule? If we schedule via that call, we > can skip the manual schedule we were going to perform. Another issue is slowing down something that is considered a fast path. -- Steve
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
On Wed, May 2, 2018 at 9:03 AM Mathieu Desnoyers < mathieu.desnoy...@efficios.com> wrote: > - On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com wrote: > [...] > > > > I think a small enhancement to rseq would let us build a perfect userspace > > mutex, one that spins on lock-acquire only when the lock owner is running > > and that sleeps otherwise, freeing userspace from both specifying ad-hoc > > spin counts and from trying to detect situations in which spinning is > > generally pointless. > > > > It'd work like this: in the per-thread rseq data structure, we'd include a > > description of a futex operation for the kernel would perform (in the > > context of the preempted thread) upon preemption, immediately before > > schedule(). If the futex operation itself sleeps, that's no problem: we > > will have still accomplished our goal of running some other thread instead > > of the preempted thread. > Hi Daniel, > I agree that the problem you are aiming to solve is important. Let's see > what prevents the proposed rseq implementation from doing what you envision. > The main issue here is touching userspace immediately before schedule(). > At that specific point, it's not possible to take a page fault. In the proposed > rseq implementation, we get away with it by raising a task struct flag, and using > it in a return to userspace notifier (where we can actually take a fault), where > we touch the userspace TLS area. > If we can find a way to solve this limitation, then the rest of your design > makes sense to me. Thanks for taking a look! Why couldn't we take a page fault just before schedule? The reason we can't take a page fault in atomic context is that doing so might call schedule. Here, we're about to call schedule _anyway_, so what harm does it do to call something that might call schedule? If we schedule via that call, we can skip the manual schedule we were going to perform.
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
- On May 1, 2018, at 11:53 PM, Daniel Colascione dan...@google.com wrote: [...] > > I think a small enhancement to rseq would let us build a perfect userspace > mutex, one that spins on lock-acquire only when the lock owner is running > and that sleeps otherwise, freeing userspace from both specifying ad-hoc > spin counts and from trying to detect situations in which spinning is > generally pointless. > > It'd work like this: in the per-thread rseq data structure, we'd include a > description of a futex operation for the kernel would perform (in the > context of the preempted thread) upon preemption, immediately before > schedule(). If the futex operation itself sleeps, that's no problem: we > will have still accomplished our goal of running some other thread instead > of the preempted thread. Hi Daniel, I agree that the problem you are aiming to solve is important. Let's see what prevents the proposed rseq implementation from doing what you envision. The main issue here is touching userspace immediately before schedule(). At that specific point, it's not possible to take a page fault. In the proposed rseq implementation, we get away with it by raising a task struct flag, and using it in a return to userspace notifier (where we can actually take a fault), where we touch the userspace TLS area. If we can find a way to solve this limitation, then the rest of your design makes sense to me. Thanks! Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
On Wed, May 02, 2018 at 03:53:47AM +, Daniel Colascione wrote: > The usual approach to "better" is an "adaptive mutex". Such a thing, when > it attempts to acquire a lock another thread owns, spins for some number of > iterations, then falls back to futex. I guess that's a little better than > immediately jumping to futex, but it's not optimal. We can still spin when > the lock owner isn't scheduled, and the spin count is usually some guess > (either specified manually or estimated statistically) that's not > guaranteed to produce decent results. Even if we do pick a good spin count, > we run a very good chance of under- or over-spinning on any given > lock-acquire. We always want to sleep when spinning would be pointless. Look for the FUTEX_LOCK patches from Waiman.
Re: [RFC PATCH for 4.18 00/14] Restartable Sequences
Hi Mathieu: this work looks very cool. See inline. On Mon, Apr 30, 2018 at 3:48 PM Mathieu Desnoyers < mathieu.desnoy...@efficios.com> wrote: > Hi, > Here is an updated RFC round of the Restartable Sequences patchset > based on kernel 4.17-rc3. Based on feedback from Linus, I'm introducing > only the rseq system call, keeping the rest for later. > This already enables speeding up the Facebook jemalloc and arm64 PMC > read from user-space use-cases, as well as speedup of use-cases relying > on getting the current cpu number from user-space. We'll have to wait > until a more complete solution is introduced before the LTTng-UST > tracer can replace its ring buffer atomic instructions with rseq > though. But let's proceed one step at a time. I like the general theme of the kernel using its "superpowers" (in this case, knowledge of preemption) to help userspace do a better job without userspace code needing to enter the kernel to benefit. The per-CPU data structures this patch enables help in a lot of use cases, but I think there's another use case that you might not have considered, one that can benefit from a extension to your proposed API. Consider mutexes: in the kernel, for mutual exclusion, we can use a spinlock, which in the kernel ends up being simpler and (in a lot of scenarios) more efficient than a mutex: a core that takes a spinlock promises to keep the lock held for only a very short time, and it disables interrupts to delay asynchronous work that might unexpectedly lengthen the critical section. A different core that wants to grab that spinlock can just spin on the lock word, confident that its spin will be short because any core owning the lock is guaranteed to release it very quickly. (Long spins would be very bad for power.) The overall result is a lock that's much lighter than a mutex. (A spinlock can also be used in places we can't sleep, but this ability isn't relevant to the discussion below.) Userspace doesn't have a good equivalent to a lightweight spinlock. While you can build a spinlock in userspace, the result ends up having serious problems because of preemption: first, a thread owning such a lock can be preempted in its critical section, lengthening the critical section arbitrarily. Second, a thread spinning on a lock will keep spinning even when the owning thread isn't scheduled anywhere. Userspace can just implement a mutex as a try-acquire and a FUTEX_WAIT on failure. This approach works fine when there's no contention, but a system call is a pretty heavy operation. Why pay for a system call on occasional light congestion with a short critical section. Can we do better? The usual approach to "better" is an "adaptive mutex". Such a thing, when it attempts to acquire a lock another thread owns, spins for some number of iterations, then falls back to futex. I guess that's a little better than immediately jumping to futex, but it's not optimal. We can still spin when the lock owner isn't scheduled, and the spin count is usually some guess (either specified manually or estimated statistically) that's not guaranteed to produce decent results. Even if we do pick a good spin count, we run a very good chance of under- or over-spinning on any given lock-acquire. We always want to sleep when spinning would be pointless. One important case of the spin-while-not-scheduled problem is operation on a uniprocessor system: on such a system, only a single task can be scheduled at a time, making all spins maximally pointless. The usual approach to avoiding wasted spins for adaptive mutexes is for the adaptive mutex library to ask upon initialization "How many cores are in this system?", and if the answer comes back as "1", disable spinning. This approach is inadequate: CPU affinity can change at arbitrary times, and CPU affinity can produce the same spin pessimization that a uniprocessor system does. I think a small enhancement to rseq would let us build a perfect userspace mutex, one that spins on lock-acquire only when the lock owner is running and that sleeps otherwise, freeing userspace from both specifying ad-hoc spin counts and from trying to detect situations in which spinning is generally pointless. It'd work like this: in the per-thread rseq data structure, we'd include a description of a futex operation for the kernel would perform (in the context of the preempted thread) upon preemption, immediately before schedule(). If the futex operation itself sleeps, that's no problem: we will have still accomplished our goal of running some other thread instead of the preempted thread. Suppose we make a userspace mutex implemented with a lock word having three bits: acquired, sleep_mode, and wait_pending, with the rest of the word not being relevant at the moment. We'd implement lock-acquire the usual way, CASing the acquired bit into the lock and deeming the lock taken if we're successful. Except that before attempting the CAS, we'd configure the current thread's rseq area to bitwise-o