Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
On Thu, 21 Feb 2008 22:24:20 +0100 Ingo Molnar [EMAIL PROTECTED] wrote: regarding the concept: adaptive mutexes have been talked about in the past, but their advantage is not at all clear, that's why we havent done them. It's definitely not an unambigiously win-win concept. When ext3 was converted from sleeping locks to spinlocks, dbench-on-numaq throughput went up by a factor of ten. I'd expect that what RT has done was a truly awful change for lots of workloads on lots of machines. Yeah, there's the dont-enable-it-if-you're-doing-that option, but that adds the we-just-doubled-the-number-of-kernels-distros-need-to-ship problem. Does -rt also make bit_spin_lock preemptible? If not, I'd have thought the change was of little benefit to ext3/jbd and it might as well go back to spinning locks. - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH [RT] 00/14] RFC - adaptive real-time locks
The Real Time patches to the Linux kernel converts the architecture specific SMP-synchronization primitives commonly referred to as spinlocks to an RT mutex implementation that support a priority inheritance protocol, and priority-ordered wait queues. The RT mutex implementation allows tasks that would otherwise busy-wait for a contended lock to be preempted by higher priority tasks without compromising the integrity of critical sections protected by the lock. The unintended side-effect is that the -rt kernel suffers from significant degradation of IO throughput (disk and net) due to the extra overhead associated with managing pi-lists and context switching. This has been generally accepted as a price to pay for low-latency preemption. Our research indicates that it doesn't necessarily have to be this way. This patch set introduces an adaptive technology that retains both the priority inheritance protocol as well as the preemptive nature of spinlocks and mutexes and adds a 300+% throughput increase to the Linux Real time kernel. It applies to 2.6.24-rt1. These performance increases apply to disk IO as well as netperf UDP benchmarks, without compromising RT preemption latency. For more complex applications, overall the I/O throughput seems to approach the throughput on a PREEMPT_VOLUNTARY or PREEMPT_DESKTOP Kernel, as is shipped by most distros. Essentially, the RT Mutex has been modified to busy-wait under contention for a limited (and configurable) time. This works because most locks are typically held for very short time spans. Too often, by the time a task goes to sleep on a mutex, the mutex is already being released on another CPU. The effect (on SMP) is that by polling a mutex for a limited time we reduce context switch overhead by up to 90%, and therefore eliminate CPU cycles as well as massive hot-spots in the scheduler / other bottlenecks in the Kernel - even though we busy-wait (using CPU cycles) to poll the lock. We have put together some data from different types of benchmarks for this patch series, which you can find here: ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf It compares a stock kernel.org 2.6.24 (PREEMPT_DESKTOP), a stock 2.6.24-rt1 (PREEMPT_RT), and a 2.6.24-rt1 + adaptive-lock (2.6.24-rt1-al) (PREEMPT_RT) kernel. The machine is a 4-way (dual-core, dual-socket) 2Ghz 5130 Xeon (core2duo-woodcrest) Dell Precision 490. Some tests show a marked improvement (for instance, dbench and hackbench), whereas some others (make -j 128) the results were not as profound but they were still net-positive. In all cases we have also verified that deterministic latency is not impacted by using cyclic-test. This patch series also includes some re-work on the raw_spinlock infrastructure, including Nick Piggin's x86-ticket-locks. We found that the increased pressure on the lock-wait_locks could cause rare but serious latency spikes that are fixed by a fifo raw_spinlock_t implementation. Nick was gracious enough to allow us to re-use his work (which is already accepted in 2.6.25). Note that we also have a C version of his protocol available if other architectures need fifo-lock support as well, which we will gladly post upon request. Special thanks go to many people who were instrumental to this project, including: *) the -rt team here at Novell for research, development, and testing. *) Nick Piggin for his invaluable consultation/feedback and use of his x86-ticket-locks. *) The reviewers/testers at Suse, Montavista, and Bill Huey for their time and feedback on the early versions of these patches. As always, comments/feedback/bug-fixes are welcome. Regards, -Greg - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
On Thu, Feb 21, 2008 at 10:26 AM, in message [EMAIL PROTECTED], Gregory Haskins [EMAIL PROTECTED] wrote: We have put together some data from different types of benchmarks for this patch series, which you can find here: ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf For convenience, I have also places a tarball of the entire series here: ftp://ftp.novell.com/dev/ghaskins/adaptive-locks-v1.tar.bz2 Regards, -Greg - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
hm. Why is the ticket spinlock patch included in this patchset? It just skews your performance results unnecessarily. Ticket spinlocks are independent conceptually, they are already upstream in 2.6.25-rc2 and -rt will have them automatically once we rebase to .25. and if we take the ticket spinlock patch out of your series, the the size of the patchset shrinks in half and touches only 200-300 lines of code ;-) Considering the total size of the -rt patchset: 652 files changed, 23830 insertions(+), 4636 deletions(-) we can regard it a routine optimization ;-) regarding the concept: adaptive mutexes have been talked about in the past, but their advantage is not at all clear, that's why we havent done them. It's definitely not an unambigiously win-win concept. So lets get some real marketing-free benchmarking done, and we are not just interested in the workloads where a bit of polling on contended locks helps, but we are also interested in workloads where the polling hurts ... And lets please do the comparisons without the ticket spinlock patch ... Ingo - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
It would also help to get the lockdep/lockstat output for those runs so that more discussion can happen. That framework can be extended to do all sorts of contention tracking and that is why I took implemented it in the first place to track the viability of adaptive spins. My initial results where that about 10 percent (heavier against BKL) where suitable for adaptive spins and 10 percent roughly for ownership stealing. The current implementation didn't seem to have those measurements coded in just yet (unlike my private implementation). I came to the original conclusion that it wasn't originally worth it, but the dbench number published say otherwise. It would be a good thing to get both a precise chunk of instrumentation to show where the problems are (these number could be skewed by a number of things) as well as the specific locks that are contended against. See ya bill On Thu, Feb 21, 2008 at 1:24 PM, Ingo Molnar [EMAIL PROTECTED] wrote: regarding the concept: adaptive mutexes have been talked about in the past, but their advantage is not at all clear, that's why we havent done them. It's definitely not an unambigiously win-win concept. So lets get some real marketing-free benchmarking done, and we are not just interested in the workloads where a bit of polling on contended locks helps, but we are also interested in workloads where the polling hurts ... And lets please do the comparisons without the ticket spinlock patch ... - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
On Thu, Feb 21, 2008 at 4:24 PM, in message [EMAIL PROTECTED], Ingo Molnar [EMAIL PROTECTED] wrote: hm. Why is the ticket spinlock patch included in this patchset? It just skews your performance results unnecessarily. Ticket spinlocks are independent conceptually, they are already upstream in 2.6.25-rc2 and -rt will have them automatically once we rebase to .25. Sorry if it was ambiguous. I included them because we found the patch series without them can cause spikes due to the newly introduced pressure on the (raw_spinlock_t)lock-wait_lock. You can run the adaptive-spin patches without them just fine (in fact, in many cases things run faster without themdbench *thrives* on chaos). But you may also measure a cyclic-test spike if you do so. So I included them to present a complete package without spikes. I tried to explain that detail in the prologue, but most people probably fell asleep before they got to the end ;) and if we take the ticket spinlock patch out of your series, the the size of the patchset shrinks in half and touches only 200-300 lines of code ;-) Considering the total size of the -rt patchset: 652 files changed, 23830 insertions(+), 4636 deletions(-) we can regard it a routine optimization ;-) Its not the size of your LOC, but what you do with it :) regarding the concept: adaptive mutexes have been talked about in the past, but their advantage is not at all clear, that's why we havent done them. It's definitely not an unambigiously win-win concept. So lets get some real marketing-free benchmarking done, and we are not just interested in the workloads where a bit of polling on contended locks helps, but we are also interested in workloads where the polling hurts ... And lets please do the comparisons without the ticket spinlock patch ... I'm open to suggestion, and this was just a sample of the testing we have done. We have thrown plenty of workloads at this patch series far beyond the slides I prepared in that URL, and they all seem to indicate a net positive improvement so far. Some of those results I cannot share due to NDA, and some I didnt share simply because I never formally collected the data like I did for these tests. If there is something you would like to see, please let me know and I will arrange for it to be executed if at all possible. Regards, -Greg - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
On Thu, Feb 21, 2008 at 4:42 PM, in message [EMAIL PROTECTED], Ingo Molnar [EMAIL PROTECTED] wrote: * Bill Huey (hui) [EMAIL PROTECTED] wrote: I came to the original conclusion that it wasn't originally worth it, but the dbench number published say otherwise. [...] dbench is a notoriously unreliable and meaningless workload. It's being frowned upon by the VM and the IO folks. I agree...its a pretty weak benchmark. BUT, it does pound on dcache_lock and therefore was a good demonstration of the benefits of lower-contention overhead. Also note we also threw other tests in that PDF if you scroll to the subsequent pages. If that's the only workload where spin-mutexes help, and if it's only a 3% improvement [of which it is unclear how much of that improvement was due to ticket spinlocks], then adaptive mutexes are probably not worth it. Note that the 3% figure being thrown around was from a single patch within the series. We are actually getting a net average gain of 443% in dbench. And note that the number goes *up* when you remove the ticketlocks. The ticketlocks are there to prevent latency spikes, not improve throughput. Also take a look at the hackbench numbers which are particularly promising. We get a net average gain of 493% faster for RT10 based hackbench runs. The kernel build was only a small gain, but it was all gain nonetheless. We see similar results for any other workloads we throw at this thing. I will gladly run any test requested to which I have the ability to run, and I would encourage third party results as well. I'd not exclude them fundamentally though, it's really the numbers that matter. The code is certainly simple enough (albeit the .config and sysctl controls are quite ugly and unacceptable - adaptive mutexes should really be ... adaptive, with no magic constants in .configs or else). We can clean this up, per your suggestions. But ... i'm somewhat sceptic, after having played with spin-a-bit mutexes before. Its very subtle to get this concept to work. The first few weeks, we were getting 90% regressions ;) Then we had a breakthrough and started to get this thing humming along quite nicely. Regards, -Greg - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
On Thu, 2008-02-21 at 22:24 +0100, Ingo Molnar wrote: hm. Why is the ticket spinlock patch included in this patchset? It just skews your performance results unnecessarily. Ticket spinlocks are independent conceptually, they are already upstream in 2.6.25-rc2 and -rt will have them automatically once we rebase to .25. True, the ticket spinlock certainly adds to the throughput results we have seen. However, the results without the ticket patch are still very significant. (IIRC, 500-600MB/s instead of the ~730MB/s advertised) We can easily re-gen the previous results for an apples-to-apples comparison. Without the ticket locks we see spikes in cyclictest that have been mapped (via latency trace) to the non-deterministic behavior of the raw rt lock wait_lock (which is used in the rt lock code). These spikes disappear with the ticket lock patch installed. Note that we see these spikes while running cyclictest concurrently with dbench/hackbench/kernel builds. and if we take the ticket spinlock patch out of your series, the the size of the patchset shrinks in half and touches only 200-300 lines of code ;-) Considering the total size of the -rt patchset: 652 files changed, 23830 insertions(+), 4636 deletions(-) we can regard it a routine optimization ;-) regarding the concept: adaptive mutexes have been talked about in the past, but their advantage is not at all clear, that's why we havent done them. It's definitely not an unambigiously win-win concept. Can you elaborate? Where did you previously see that this would be problematic? So lets get some real marketing-free benchmarking done, and we are not just interested in the workloads where a bit of polling on contended locks helps, but we are also interested in workloads where the polling hurts ... And lets please do the comparisons without the ticket spinlock patch ... Ingo Certainly. One of the things we have struggled with are real-world loads that prove/disprove the implementation. Thus far, everything we have come up with has only seen improvement. The worst case we have come up with so far is running dbench/hackbench in FIFO. Here we have multiple tasks at the same priority contending on the same locks. This should give a worst case performance since the tasks cannot preempt each other, and the nesting of locks should produce significant latencies. However, with the addition of the lateral-steal patch, we do not witness that at all, rather the throughput is on par with CFS results. Higher priority FIFO tasks will preempt the polling tasks, and the lower priority tasks will wait, as they should. Are there some suggestions on workloads that would show how the polling hurts? Thanks, -PWM - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
On Thu, 2008-02-21 at 15:12 -0700, Peter W. Morreale wrote: True, the ticket spinlock certainly adds to the throughput results we have seen. However, the results without the ticket patch are still very significant. (IIRC, 500-600MB/s instead of the ~730MB/s advertised) We can easily re-gen the previous results for an apples-to-apples comparison. I misspoke in the above. Greg is correct. Without the ticket lock patch we see an *improvement* in throughput, not a decrease. Sorry for any confusion. Best, -PWM - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks
On Thu, Feb 21, 2008 at 1:42 PM, Ingo Molnar [EMAIL PROTECTED] wrote: I'd not exclude them fundamentally though, it's really the numbers that matter. The code is certainly simple enough (albeit the .config and sysctl controls are quite ugly and unacceptable - adaptive mutexes should really be ... adaptive, with no magic constants in .configs or else). But ... i'm somewhat sceptic, after having played with spin-a-bit mutexes before. Yeah, it's yet another reason to get better instrumentation in -rt for this purpose and why it's important to only log things in the slow path so that the front end of the rtmutex doesn't see instrumentation artifacts in the measurement. If I get a chance tonight, I'll extend lockdep/lockstat for this purpose if peterz doesn't beat me to it. I won't have access to the test harness to replicate this though. bill - To unsubscribe from this list: send the line unsubscribe linux-rt-users in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html