Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On 09/23/2016 09:02 AM, Thomas Gleixner wrote: On Thu, 22 Sep 2016, Waiman Long wrote: Locking was done mostly by lock stealing. This is where most of the performance benefit comes from, not optimistic spinning. How does the lock latency distribution of all this look like and how fair is the whole thing? The TO futexes are unfair as can be seen from the min/max thread times listed above. It took the fastest thread 0.07s to complete all the locking operations, whereas the slowest one needed 2.65s. However, the situation reverses when I changed the critical section to a 1us sleep. In this case, 1us sleep is going to add another syscall and therefor scheduling, so what? Or did you just extend the critical section busy time? The 1us sleep will cause the spinning to stop and make all the waiters sleep. This is to simulate the extreme case where TO futex may not have the performance advantage. there will be no optimistic spinning. The performance results for 100k locking operations were listed below. wait-wake futex PI futexTO futex --- max time0.06s 9.32s 4.76s Yes, wait-wake futex is the unfair one in this case. min time5.59s 9.36s 5.62s average time3.25s 9.35s 5.41s In this case, the TO futexes are fairer but perform worse than the wait-wake futexes. That is because the lock handoff mechanism limit the amount of lock stealing in the TO futexes while the wait-wake futexes have no such restriction. When I disabled lock handoff, the TO futexes would then perform similar to the wait-wake futexes. So the benefit of these new fangled futexes is only there for extreme short critical sections and a gazillion of threads fighting for the same futex, right? Not really. Lock stealing will help performance when a gazillion of threads fighting for the same futex. Optimistic spinning will help to reduce the lock transfer latency because the waiter isn't sleeping no matter the number of threads. One set of data that I haven't shown so far is that the performance delta between wait-wait and TO futexes actually increases as the critical section is lengthened. This is because for short critical section, the waiters of wait-wake futex may not actually go to sleep because of the latency introduced by the code that has to be run before they do a final check to see if the futex value change before going to sleep. The longer the critical section, the higher the chance that they actually sleep and hence their performance is getting worse relative to the TO futexes. For example, with the critical section of 50 pause instructions instead of 5, the performance gain is about 5X instead of about 1.6X in the latter case. I really wonder how the average programmer should pick the right flavour, not to talk about any useful decision for something like glibc to pick the proper one. I would say that TO futexes will have better performance in most cases. Of course, I still need to run some real world benchmarks to quantify the effect of the new futexes. I am hoping to get suggestion of what is a good set of benchmarks to run. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On Thu, 22 Sep 2016, Waiman Long wrote: > On 09/22/2016 04:38 PM, Thomas Gleixner wrote: > > > wait-wake futex PI futexTO futex > > > --- > > > max time3.49s50.91s 2.65s > > > min time3.24s50.84s 0.07s > > > average time3.41s50.90s 1.84s > > > sys time 7m22.4s55.73s2m32.9s > > That's really interesting. Do you have any explanation for this massive > > system time differences? > > For the wait-wake futexes, the waiters will be kicked out with EAGAIN without > sleeping as long as the futex value changes before they acquire the hash > bucket spinlock. These waiters will attempt to acquire the lock again in the > user space. If that fails, they will call FUTEX_WAIT again. > > In the above test case, the critical section is pretty short and about 4/5 of > the FUTEX_WAIT calls resulted in an EAGAIN error. So there are a lot more > user-space/kernel transitions than the TO futexes. I think the difference in > the number of FUTEX_WAIT calls vs. the FUTEX_LOCK_TO calls causes the > difference between their sys times. Ok. > As for the PI futex, it is a strictly sleeping lock and so won't use that > much sys time. That's clear. > > > Locking was done mostly by lock stealing. This is where most of the > > > performance benefit comes from, not optimistic spinning. > > How does the lock latency distribution of all this look like and how fair > > is the whole thing? > > The TO futexes are unfair as can be seen from the min/max thread times listed > above. It took the fastest thread 0.07s to complete all the locking > operations, whereas the slowest one needed 2.65s. However, the situation > reverses when I changed the critical section to a 1us sleep. In this case, 1us sleep is going to add another syscall and therefor scheduling, so what? Or did you just extend the critical section busy time? > there will be no optimistic spinning. The performance results for 100k locking > operations were listed below. > > wait-wake futex PI futexTO futex > --- > max time0.06s 9.32s 4.76s > min time5.59s 9.36s 5.62s > average time3.25s 9.35s 5.41s > > In this case, the TO futexes are fairer but perform worse than the wait-wake > futexes. That is because the lock handoff mechanism limit the amount of lock > stealing in the TO futexes while the wait-wake futexes have no such > restriction. When I disabled lock handoff, the TO futexes would then perform > similar to the wait-wake futexes. So the benefit of these new fangled futexes is only there for extreme short critical sections and a gazillion of threads fighting for the same futex, right? I really wonder how the average programmer should pick the right flavour, not to talk about any useful decision for something like glibc to pick the proper one. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On 09/22/2016 05:41 PM, Thomas Gleixner wrote: On Thu, 22 Sep 2016, Davidlohr Bueso wrote: On Thu, 22 Sep 2016, Waiman Long wrote: BTW, my initial attempt for the new futex was to use the same workflow as the PI futexes, but use mutex which has optimistic spinning instead of rt_mutex. Btw, Thomas, do you still have any interest pursuing this for rtmutexes from -rt into mainline? If so I can resend the patches from a while ago. Certainly yes. My faint memory tells me that there was some potential issue due to boosting the owner only if it gets scheduled out, but I might be wrong. It is tricky to add optimistic spinning to rtmutexes because of the need to observe process priorities. It is certainly possible to make the top waiter spin, but then I am not sure how much performance gain with just that. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On 09/22/2016 04:38 PM, Thomas Gleixner wrote: On Thu, 22 Sep 2016, Waiman Long wrote: BTW, my initial attempt for the new futex was to use the same workflow as the PI futexes, but use mutex which has optimistic spinning instead of rt_mutex. That version can double the throughput compared with PI futexes but still far short of what can be achieved with wait-wake futex. Looking at the performance figures from the patch: wait-wake futex PI futexTO futex --- max time3.49s50.91s 2.65s min time3.24s50.84s 0.07s average time3.41s50.90s 1.84s sys time 7m22.4s55.73s2m32.9s That's really interesting. Do you have any explanation for this massive system time differences? For the wait-wake futexes, the waiters will be kicked out with EAGAIN without sleeping as long as the futex value changes before they acquire the hash bucket spinlock. These waiters will attempt to acquire the lock again in the user space. If that fails, they will call FUTEX_WAIT again. In the above test case, the critical section is pretty short and about 4/5 of the FUTEX_WAIT calls resulted in an EAGAIN error. So there are a lot more user-space/kernel transitions than the TO futexes. I think the difference in the number of FUTEX_WAIT calls vs. the FUTEX_LOCK_TO calls causes the difference between their sys times. As for the PI futex, it is a strictly sleeping lock and so won't use that much sys time. lock count 3,090,294 9,999,813 698,318 unlock count 3,268,896 9,999,814 134 The problem with a PI futexes like version is that almost all the lock/unlock operations were done in the kernel which added overhead and latency. Now looking at the numbers for the TO futexes, less than 1/10 of the lock operations were done in the kernel, the number of unlock was insignificant. Locking was done mostly by lock stealing. This is where most of the performance benefit comes from, not optimistic spinning. How does the lock latency distribution of all this look like and how fair is the whole thing? The TO futexes are unfair as can be seen from the min/max thread times listed above. It took the fastest thread 0.07s to complete all the locking operations, whereas the slowest one needed 2.65s. However, the situation reverses when I changed the critical section to a 1us sleep. In this case, there will be no optimistic spinning. The performance results for 100k locking operations were listed below. wait-wake futex PI futexTO futex --- max time0.06s 9.32s 4.76s min time5.59s 9.36s 5.62s average time3.25s 9.35s 5.41s In this case, the TO futexes are fairer but perform worse than the wait-wake futexes. That is because the lock handoff mechanism limit the amount of lock stealing in the TO futexes while the wait-wake futexes have no such restriction. When I disabled lock handoff, the TO futexes would then perform similar to the wait-wake futexes. This is also the reason that a lock handoff mechanism is implemented to prevent lock starvation which is likely to happen without one. Where is that lock handoff mechanism? In the futex_state object, there is a new field handoff_pid. It is set when the threshold count in futex_spin_on_owner() becomes negative When this field is set, the unlocker will change the futex word to that value instead of clearing it to 0 and others can steal it. I will separate out the lock handoff in a separate patch in the next revision to highlight it. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On Thu, 22 Sep 2016, Davidlohr Bueso wrote: > On Thu, 22 Sep 2016, Waiman Long wrote: > > > BTW, my initial attempt for the new futex was to use the same workflow as > > the PI futexes, but use mutex which has optimistic spinning instead of > > rt_mutex. > > Btw, Thomas, do you still have any interest pursuing this for rtmutexes from > -rt into mainline? If so I can resend the patches from a while ago. Certainly yes. My faint memory tells me that there was some potential issue due to boosting the owner only if it gets scheduled out, but I might be wrong. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On Thu, 22 Sep 2016, Waiman Long wrote: BTW, my initial attempt for the new futex was to use the same workflow as the PI futexes, but use mutex which has optimistic spinning instead of rt_mutex. Btw, Thomas, do you still have any interest pursuing this for rtmutexes from -rt into mainline? If so I can resend the patches from a while ago. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On 09/22/2016 04:26 PM, Thomas Gleixner wrote: On Thu, 22 Sep 2016, Waiman Long wrote: On 09/22/2016 09:34 AM, Thomas Gleixner wrote: On Thu, 22 Sep 2016, Peter Zijlstra wrote: I'd leave out the TO part entirely (or only mention it in changelogs). That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK. That brings me to a different question: How is user space going to support this, i.e. is this some extra magic for code which implements its own locking primitives or is there going to be a wide use via e.g. glibc. There are some applications that use futex(2) directly to implement their synchronization primitives. For those applications, they will need to modify their code to detect the presence of the new futexes. They can then use the new futexes if available and use wait-wake futexes if not. That's what I suspected. Did you talk to the other folks who complain about futex performance (database, JVM, etc.) and play their own games with user space spinlocks and whatever? I am also part of the team that help large application vendors to tune their application performance on our large SMP systems. Those application vendors tend to use futex directly instead of relying on glibc. We had seen spinlock contention in the futex could sometimes be a significant portion of the CPU cycles consumed depending on the workloads that were being run. We had been providing suggestions on the best practice of how to use futexes. But there is only so much you can do with tuning their locking code implementation. That is why I am also looking for way to improve the performance of the futex code in the kernel. I am also planning to take a look at the pthread_mutex* APIs to see if they can be modified to use the new futexes later on when the patch becomes more mature. Please involve glibc people who are interested in the futex stuff early and discuss the concept before it's set in stone for your particular usecase. Sure, I will start to do some prototyping and performance testing with glibc and then engage those folks about that. Also what's the reason that we can't do probabilistic spinning for FUTEX_WAIT and have to add yet another specialized variant of futexes? The main reason is that a FUTEX_WAIT waiter has no idea who the owner of the futex is. We usually do spinning when the lock owner is running and abort when it goes to sleep. We can't do that for FUTEX_WAIT. Fair enough. This wants to be spelled out in the changelog and explained a bit more detailed. I will. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On 09/22/2016 04:08 PM, Waiman Long wrote: On 09/22/2016 11:11 AM, Davidlohr Bueso wrote: On Thu, 22 Sep 2016, Thomas Gleixner wrote: On Thu, 22 Sep 2016, Davidlohr Bueso wrote: On Thu, 22 Sep 2016, Thomas Gleixner wrote: > Also what's the reason that we can't do probabilistic spinning for > FUTEX_WAIT and have to add yet another specialized variant of futexes? Where would this leave the respective FUTEX_WAKE? A nop? Probably have to differentiate the fact that the queue was empty, but there was a spinning, instead of straightforward returning 0. Sorry, but I really can't parse this answer. Can you folks please communicate with proper and coherent explanations instead of throwing a few gnawed off bones in my direction? I actually think that FUTEX_WAIT is the better/nicer approach. But my immediate question above was how to handle the FUTEX_WAKE counter-part. If we want to maintain current FIFO ordering for wakeups, now with WAIT spinners this will create lock stealing scenarios (including if we even guard against starvation). Or we could reduce the scope of spinners, due to the restrictions, similar to the top-waiter only being able to spin for rtmutexes. This of course will hurt the effectiveness of spinning in FUTEX_WAIT in the first place. Actually, there can be a lot of lock stealing going on with the wait-wake futexes. If the critical section is short enough, many of the lock waiters can be waiting in the hash bucket spinlock queue and not sleeping yet while the futex value changes. As a result, they will exit the futex syscall and back to user space with EAGAIN where one of them may get the lock. So we can't assume that they will get the lock in the FIFO order anyway. BTW, my initial attempt for the new futex was to use the same workflow as the PI futexes, but use mutex which has optimistic spinning instead of rt_mutex. That version can double the throughput compared with PI futexes but still far short of what can be achieved with wait-wake futex. Looking at the performance figures from the patch: wait-wake futex PI futexTO futex --- max time3.49s50.91s 2.65s min time3.24s50.84s 0.07s average time3.41s50.90s 1.84s sys time 7m22.4s55.73s2m32.9s lock count 3,090,294 9,999,813 698,318 unlock count 3,268,896 9,999,814 134 The problem with a PI futexes like version is that almost all the lock/unlock operations were done in the kernel which added overhead and latency. Now looking at the numbers for the TO futexes, less than 1/10 of the lock operations were done in the kernel, the number of unlock was insignificant. Locking was done mostly by lock stealing. This is where most of the performance benefit comes from, not optimistic spinning. This is also the reason that a lock handoff mechanism is implemented to prevent lock starvation which is likely to happen without one. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On Thu, 22 Sep 2016, Waiman Long wrote: > BTW, my initial attempt for the new futex was to use the same workflow as the > PI futexes, but use mutex which has optimistic spinning instead of rt_mutex. > That version can double the throughput compared with PI futexes but still far > short of what can be achieved with wait-wake futex. Looking at the performance > figures from the patch: > > wait-wake futex PI futexTO futex > --- > max time3.49s50.91s 2.65s > min time3.24s50.84s 0.07s > average time3.41s50.90s 1.84s > sys time 7m22.4s55.73s2m32.9s That's really interesting. Do you have any explanation for this massive system time differences? > lock count 3,090,294 9,999,813 698,318 > unlock count 3,268,896 9,999,814 134 > > The problem with a PI futexes like version is that almost all the lock/unlock > operations were done in the kernel which added overhead and latency. Now > looking at the numbers for the TO futexes, less than 1/10 of the lock > operations were done in the kernel, the number of unlock was insignificant. > Locking was done mostly by lock stealing. This is where most of the > performance benefit comes from, not optimistic spinning. How does the lock latency distribution of all this look like and how fair is the whole thing? > This is also the reason that a lock handoff mechanism is implemented to > prevent lock starvation which is likely to happen without one. Where is that lock handoff mechanism? Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On Thu, 22 Sep 2016, Waiman Long wrote: > On 09/22/2016 09:34 AM, Thomas Gleixner wrote: > > On Thu, 22 Sep 2016, Peter Zijlstra wrote: > > > > > > I'd leave out the TO part entirely (or only mention it in changelogs). > > > > > > That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK. > > That brings me to a different question: > > > > How is user space going to support this, i.e. is this some extra magic for > > code which implements its own locking primitives or is there going to be a > > wide use via e.g. glibc. > > There are some applications that use futex(2) directly to implement their > synchronization primitives. For those applications, they will need to modify > their code to detect the presence of the new futexes. They can then use the > new futexes if available and use wait-wake futexes if not. That's what I suspected. Did you talk to the other folks who complain about futex performance (database, JVM, etc.) and play their own games with user space spinlocks and whatever? > I am also planning to take a look at the pthread_mutex* APIs to see if they > can be modified to use the new futexes later on when the patch becomes more > mature. Please involve glibc people who are interested in the futex stuff early and discuss the concept before it's set in stone for your particular usecase. > > Also what's the reason that we can't do probabilistic spinning for > > FUTEX_WAIT and have to add yet another specialized variant of futexes? > > > > The main reason is that a FUTEX_WAIT waiter has no idea who the owner of the > futex is. We usually do spinning when the lock owner is running and abort when > it goes to sleep. We can't do that for FUTEX_WAIT. Fair enough. This wants to be spelled out in the changelog and explained a bit more detailed. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On 09/22/2016 11:11 AM, Davidlohr Bueso wrote: On Thu, 22 Sep 2016, Thomas Gleixner wrote: On Thu, 22 Sep 2016, Davidlohr Bueso wrote: On Thu, 22 Sep 2016, Thomas Gleixner wrote: > Also what's the reason that we can't do probabilistic spinning for > FUTEX_WAIT and have to add yet another specialized variant of futexes? Where would this leave the respective FUTEX_WAKE? A nop? Probably have to differentiate the fact that the queue was empty, but there was a spinning, instead of straightforward returning 0. Sorry, but I really can't parse this answer. Can you folks please communicate with proper and coherent explanations instead of throwing a few gnawed off bones in my direction? I actually think that FUTEX_WAIT is the better/nicer approach. But my immediate question above was how to handle the FUTEX_WAKE counter-part. If we want to maintain current FIFO ordering for wakeups, now with WAIT spinners this will create lock stealing scenarios (including if we even guard against starvation). Or we could reduce the scope of spinners, due to the restrictions, similar to the top-waiter only being able to spin for rtmutexes. This of course will hurt the effectiveness of spinning in FUTEX_WAIT in the first place. Actually, there can be a lot of lock stealing going on with the wait-wake futexes. If the critical section is short enough, many of the lock waiters can be waiting in the hash bucket spinlock queue and not sleeping yet while the futex value changes. As a result, they will exit the futex syscall and back to user space with EAGAIN where one of them may get the lock. So we can't assume that they will get the lock in the FIFO order anyway. Another immediate thought was situations where we spinner(s) and the wait queue is empty, the WAKE should also have to acknowledge that situation, as just returning 0 would indicate that there are actually no waiters on the futex. I would say that adding optimistic spinning to FUTEX_WAIT will be a major change and I don't think it will be less complex than adding a new futex type like the TO futexes while introducing a fair amount of risk of breaking existing applications. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On 09/22/2016 09:34 AM, Thomas Gleixner wrote: On Thu, 22 Sep 2016, Peter Zijlstra wrote: I'd leave out the TO part entirely (or only mention it in changelogs). That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK. That brings me to a different question: How is user space going to support this, i.e. is this some extra magic for code which implements its own locking primitives or is there going to be a wide use via e.g. glibc. There are some applications that use futex(2) directly to implement their synchronization primitives. For those applications, they will need to modify their code to detect the presence of the new futexes. They can then use the new futexes if available and use wait-wake futexes if not. I am also planning to take a look at the pthread_mutex* APIs to see if they can be modified to use the new futexes later on when the patch becomes more mature. Also what's the reason that we can't do probabilistic spinning for FUTEX_WAIT and have to add yet another specialized variant of futexes? The main reason is that a FUTEX_WAIT waiter has no idea who the owner of the futex is. We usually do spinning when the lock owner is running and abort when it goes to sleep. We can't do that for FUTEX_WAIT. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On 09/22/2016 09:23 AM, Peter Zijlstra wrote: On Tue, Sep 20, 2016 at 09:42:41AM -0400, Waiman Long wrote: +/* + * Spinning threshold before enabling lock handoff. + * Each sleep will decrement the threshold by 1/32 of the start value. + */ +#define TO_SPIN_THRESHOLD (1<< 13) +#define TO_SLEEP_DECREMENT (TO_SPIN_THRESHOLD/32) Argh, we should really get rid of those stupid numbers. Wasn't there a patch set working on implementing paravirt functions that would make all this fixable in a sane way? These thresholds are not paravirt related. They are there to determine when we should activate explicit lock handoff when the waiter is waiting for too long. Yes, it is arbitrary. The numbers are on the high side as setting them too low will limit lock stealing and hence overall performance. The FUTEX_WAITERS bit is only turned on when either the top waiter is sleeping or a lock handoff is needed. Otherwise, it is off to encourage lock stealing in the userspace as well as in the kernel. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On Thu, 22 Sep 2016, Thomas Gleixner wrote: On Thu, 22 Sep 2016, Davidlohr Bueso wrote: On Thu, 22 Sep 2016, Thomas Gleixner wrote: > Also what's the reason that we can't do probabilistic spinning for > FUTEX_WAIT and have to add yet another specialized variant of futexes? Where would this leave the respective FUTEX_WAKE? A nop? Probably have to differentiate the fact that the queue was empty, but there was a spinning, instead of straightforward returning 0. Sorry, but I really can't parse this answer. Can you folks please communicate with proper and coherent explanations instead of throwing a few gnawed off bones in my direction? I actually think that FUTEX_WAIT is the better/nicer approach. But my immediate question above was how to handle the FUTEX_WAKE counter-part. If we want to maintain current FIFO ordering for wakeups, now with WAIT spinners this will create lock stealing scenarios (including if we even guard against starvation). Or we could reduce the scope of spinners, due to the restrictions, similar to the top-waiter only being able to spin for rtmutexes. This of course will hurt the effectiveness of spinning in FUTEX_WAIT in the first place. Another immediate thought was situations where we spinner(s) and the wait queue is empty, the WAKE should also have to acknowledge that situation, as just returning 0 would indicate that there are actually no waiters on the futex. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On Thu, 22 Sep 2016, Thomas Gleixner wrote: On Thu, 22 Sep 2016, Peter Zijlstra wrote: On Wed, Sep 21, 2016 at 07:37:34PM -0400, Waiman Long wrote: > On 09/21/2016 02:59 AM, Mike Galbraith wrote: > >On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote: > >>This patch introduces a new futex implementation called > >>throughput-optimized (TO) futexes. > >nit: 'TO' sounds way too much like timeout... TP? You even use 'to' as > >shorthand for timeout in the next patch. > > I agree. I am not that satisfied with the TO name. So I will change it to TP > in my next revision of the patch. Thanks for the suggestion. I'd leave out the TO part entirely (or only mention it in changelogs). That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK. That brings me to a different question: How is user space going to support this, i.e. is this some extra magic for code which implements its own locking primitives or is there going to be a wide use via e.g. glibc. Also what's the reason that we can't do probabilistic spinning for FUTEX_WAIT and have to add yet another specialized variant of futexes? Where would this leave the respective FUTEX_WAKE? A nop? Probably have to differentiate the fact that the queue was empty, but there was a spinning, instead of straightforward returning 0. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On Thu, 22 Sep 2016, Peter Zijlstra wrote: > On Wed, Sep 21, 2016 at 07:37:34PM -0400, Waiman Long wrote: > > On 09/21/2016 02:59 AM, Mike Galbraith wrote: > > >On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote: > > >>This patch introduces a new futex implementation called > > >>throughput-optimized (TO) futexes. > > >nit: 'TO' sounds way too much like timeout... TP? You even use 'to' as > > >shorthand for timeout in the next patch. > > > > I agree. I am not that satisfied with the TO name. So I will change it to TP > > in my next revision of the patch. Thanks for the suggestion. > > I'd leave out the TO part entirely (or only mention it in changelogs). > > That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK. That brings me to a different question: How is user space going to support this, i.e. is this some extra magic for code which implements its own locking primitives or is there going to be a wide use via e.g. glibc. Also what's the reason that we can't do probabilistic spinning for FUTEX_WAIT and have to add yet another specialized variant of futexes? Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On 09/22/2016 03:49 AM, Peter Zijlstra wrote: On Wed, Sep 21, 2016 at 07:37:34PM -0400, Waiman Long wrote: On 09/21/2016 02:59 AM, Mike Galbraith wrote: On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote: This patch introduces a new futex implementation called throughput-optimized (TO) futexes. nit: 'TO' sounds way too much like timeout... TP? You even use 'to' as shorthand for timeout in the next patch. I agree. I am not that satisfied with the TO name. So I will change it to TP in my next revision of the patch. Thanks for the suggestion. I'd leave out the TO part entirely (or only mention it in changelogs). That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK. I was following the convention for the PI futexes where we have FUTEX_LOCK_PI and FUTEX_UNLOCK_PI. If you think it is OK to promote this new futex as default for mutex type locks and use the PI futexes only when needed, I am certainly OK to drop the suffix. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On Wed, Sep 21, 2016 at 07:37:34PM -0400, Waiman Long wrote: > On 09/21/2016 02:59 AM, Mike Galbraith wrote: > >On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote: > >>This patch introduces a new futex implementation called > >>throughput-optimized (TO) futexes. > >nit: 'TO' sounds way too much like timeout... TP? You even use 'to' as > >shorthand for timeout in the next patch. > > I agree. I am not that satisfied with the TO name. So I will change it to TP > in my next revision of the patch. Thanks for the suggestion. I'd leave out the TO part entirely (or only mention it in changelogs). That is, I'd call the futex ops: FUTEX_LOCK and FUTEX_UNLOCK. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On 09/21/2016 02:59 AM, Mike Galbraith wrote: On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote: This patch introduces a new futex implementation called throughput-optimized (TO) futexes. nit: 'TO' sounds way too much like timeout... TP? You even use 'to' as shorthand for timeout in the next patch. I agree. I am not that satisfied with the TO name. So I will change it to TP in my next revision of the patch. Thanks for the suggestion. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On Tue, 2016-09-20 at 09:42 -0400, Waiman Long wrote: > This patch introduces a new futex implementation called > throughput-optimized (TO) futexes. nit: 'TO' sounds way too much like timeout... TP? You even use 'to' as shorthand for timeout in the next patch. > /* > ->> > * Wake robust non-PI futexes here. The wakeup of > ->> > * PI futexes happens in exit_pi_state(): > +>> > * Wake robust non-PI/TO futexes here. The wakeup of > +>> > * PI/TO futexes happens in exit_pi_state(): > >> > */ nit: comment attracted my eye during high speed scroll-by. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
This patch introduces a new futex implementation called throughput-optimized (TO) futexes. The goal of this new futex type is to maximize locking throughput at the expense of fairness and deterministic latency. Its throughput is higher than that of the wait-wake futexes especially on systems with a large number of CPUs and the lock owners are unlikely to sleep. The downside is the increase in the response time variance. It also implements a lock hand-off mechanism to make sure that lock starvation won't happen. Using a futex locking microbenchmark to do 10 millions locking operations with 5 pause instructions as the critical section by 256 threads (10M/256 locking ops each) on a 4-socket 72-core 144-thread Haswell-EX system, the results of the benchmark runs were as follows: wait-wake futex PI futexTO futex --- max time3.49s50.91s 2.65s min time3.24s50.84s 0.07s average time3.41s50.90s 1.84s sys time 7m22.4s55.73s2m32.9s lock count 3,090,294 9,999,813 698,318 unlock count 3,268,896 9,999,814 134 The lock and unlock counts above show the actual numbers of futex(2) lock and unlock syscalls that were being issued. Signed-off-by: Waiman Long--- include/uapi/linux/futex.h |4 + kernel/futex.c | 634 +++- 2 files changed, 627 insertions(+), 11 deletions(-) diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h index 0b1f716..e7deaf3 100644 --- a/include/uapi/linux/futex.h +++ b/include/uapi/linux/futex.h @@ -20,6 +20,8 @@ #define FUTEX_WAKE_BITSET 10 #define FUTEX_WAIT_REQUEUE_PI 11 #define FUTEX_CMP_REQUEUE_PI 12 +#define FUTEX_LOCK_TO 13 +#define FUTEX_UNLOCK_TO14 #define FUTEX_PRIVATE_FLAG 128 #define FUTEX_CLOCK_REALTIME 256 @@ -39,6 +41,8 @@ FUTEX_PRIVATE_FLAG) #define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \ FUTEX_PRIVATE_FLAG) +#define FUTEX_LOCK_TO_PRIVATE (FUTEX_LOCK_TO | FUTEX_PRIVATE_FLAG) +#define FUTEX_UNLOCK_TO_PRIVATE(FUTEX_UNLOCK_TO | FUTEX_PRIVATE_FLAG) /* * Support for robust futexes: the kernel cleans up held futexes at diff --git a/kernel/futex.c b/kernel/futex.c index f8bb93f..7daba56 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -191,6 +191,11 @@ int __read_mostly futex_cmpxchg_enabled; #define FLAGS_CLOCKRT 0x02 #define FLAGS_HAS_TIMEOUT 0x04 +enum futex_type { + TYPE_PI = 0, + TYPE_TO, +}; + /* * Futex state object: * - Priority Inheritance state @@ -203,13 +208,30 @@ struct futex_state { struct list_head list; /* -* The PI object: +* Linking into the owning hashed bucket (TO futexes only) +*/ + struct list_head hb_list; + + /* +* The PI or mutex object: */ - struct rt_mutex pi_mutex; + union { + struct rt_mutex pi_mutex; + struct mutex mutex; + }; + /* +* For PI futex, owner is the task that owns the futex. +* For TO futex, owner is the mutex lock holder that is either +* spinning on the futex owner or sleeping. +*/ struct task_struct *owner; atomic_t refcount; + enum futex_type type; + + u32 handoff_pid;/* TO only, PID for lock hand-off */ + union futex_key key; }; @@ -262,6 +284,7 @@ struct futex_hash_bucket { atomic_t waiters; spinlock_t lock; struct plist_head chain; + struct list_head fs_list; /* List of futex state objects */ } cacheline_aligned_in_smp; /* @@ -801,8 +824,11 @@ static int refill_futex_state_cache(void) return -ENOMEM; INIT_LIST_HEAD(>list); + INIT_LIST_HEAD(>hb_list); + /* pi_mutex gets initialized later */ state->owner = NULL; + state->handoff_pid = 0; atomic_set(>refcount, 1); state->key = FUTEX_KEY_INIT; @@ -836,10 +862,10 @@ static void put_futex_state(struct futex_state *state) return; /* -* If state->owner is NULL, the owner is most probably dying -* and has cleaned up the futex state already +* If state->owner is NULL and the type is TYPE_PI, the owner +* is most probably dying and has cleaned up the state already */ - if (state->owner) { + if (state->owner && (state->type == TYPE_PI)) { raw_spin_lock_irq(>owner->pi_lock); list_del_init(>list); raw_spin_unlock_irq(>owner->pi_lock); @@ -847,6 +873,10 @@ static void put_futex_state(struct futex_state *state)