Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-23 Thread Andrew Morton
On Thu, 21 Feb 2008 22:24:20 +0100 Ingo Molnar [EMAIL PROTECTED] wrote:

 regarding the concept: adaptive mutexes have been talked about in the 
 past, but their advantage is not at all clear, that's why we havent done 
 them. It's definitely not an unambigiously win-win concept.

When ext3 was converted from sleeping locks to spinlocks, dbench-on-numaq
throughput went up by a factor of ten.  I'd expect that what RT has done
was a truly awful change for lots of workloads on lots of machines.

Yeah, there's the dont-enable-it-if-you're-doing-that option, but that adds
the we-just-doubled-the-number-of-kernels-distros-need-to-ship problem.

Does -rt also make bit_spin_lock preemptible?  If not, I'd have thought the
change was of little benefit to ext3/jbd and it might as well go back to
spinning locks.

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
The Real Time patches to the Linux kernel converts the architecture
specific SMP-synchronization primitives commonly referred to as
spinlocks to an RT mutex implementation that support a priority
inheritance protocol, and priority-ordered wait queues.  The RT mutex
implementation allows tasks that would otherwise busy-wait for a
contended lock to be preempted by higher priority tasks without
compromising the integrity of critical sections protected by the lock.
The unintended side-effect is that the -rt kernel suffers from
significant degradation of IO throughput (disk and net) due to the
extra overhead associated with managing pi-lists and context switching.
This has been generally accepted as a price to pay for low-latency
preemption.

Our research indicates that it doesn't necessarily have to be this
way.  This patch set introduces an adaptive technology that retains both
the priority inheritance protocol as well as the preemptive nature of
spinlocks and mutexes and adds a 300+% throughput increase to the Linux
Real time kernel.  It applies to 2.6.24-rt1.  

These performance increases apply to disk IO as well as netperf UDP
benchmarks, without compromising RT preemption latency.  For more
complex applications, overall the I/O throughput seems to approach the
throughput on a PREEMPT_VOLUNTARY or PREEMPT_DESKTOP Kernel, as is
shipped by most distros.

Essentially, the RT Mutex has been modified to busy-wait under
contention for a limited (and configurable) time.  This works because
most locks are typically held for very short time spans.  Too often,
by the time a task goes to sleep on a mutex, the mutex is already being
released on another CPU.

The effect (on SMP) is that by polling a mutex for a limited time we
reduce context switch overhead by up to 90%, and therefore eliminate CPU
cycles as well as massive hot-spots in the scheduler / other bottlenecks
in the Kernel - even though we busy-wait (using CPU cycles) to poll the
lock.

We have put together some data from different types of benchmarks for
this patch series, which you can find here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf

It compares a stock kernel.org 2.6.24 (PREEMPT_DESKTOP), a stock
2.6.24-rt1 (PREEMPT_RT), and a 2.6.24-rt1 + adaptive-lock
(2.6.24-rt1-al) (PREEMPT_RT) kernel.  The machine is a 4-way (dual-core,
dual-socket) 2Ghz 5130 Xeon (core2duo-woodcrest) Dell Precision 490. 

Some tests show a marked improvement (for instance, dbench and hackbench),
whereas some others (make -j 128) the results were not as profound but
they were still net-positive. In all cases we have also verified that
deterministic latency is not impacted by using cyclic-test. 

This patch series also includes some re-work on the raw_spinlock
infrastructure, including Nick Piggin's x86-ticket-locks.  We found that
the increased pressure on the lock-wait_locks could cause rare but
serious latency spikes that are fixed by a fifo raw_spinlock_t
implementation.  Nick was gracious enough to allow us to re-use his
work (which is already accepted in 2.6.25).  Note that we also have a
C version of his protocol available if other architectures need
fifo-lock support as well, which we will gladly post upon request.

Special thanks go to many people who were instrumental to this project,
including:
  *) the -rt team here at Novell for research, development, and testing.
  *) Nick Piggin for his invaluable consultation/feedback and use of his
 x86-ticket-locks.
  *) The reviewers/testers at Suse, Montavista, and Bill Huey for their
 time and feedback on the early versions of these patches.

As always, comments/feedback/bug-fixes are welcome.

Regards,
-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
 On Thu, Feb 21, 2008 at 10:26 AM, in message
[EMAIL PROTECTED], Gregory Haskins
[EMAIL PROTECTED] wrote: 

 We have put together some data from different types of benchmarks for
 this patch series, which you can find here:
 
 ftp://ftp.novell.com/dev/ghaskins/adaptive-locks.pdf

For convenience, I have also places a tarball of the entire series here:

ftp://ftp.novell.com/dev/ghaskins/adaptive-locks-v1.tar.bz2

Regards,
-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Ingo Molnar

hm. Why is the ticket spinlock patch included in this patchset? It just 
skews your performance results unnecessarily. Ticket spinlocks are 
independent conceptually, they are already upstream in 2.6.25-rc2 and 
-rt will have them automatically once we rebase to .25.

and if we take the ticket spinlock patch out of your series, the the 
size of the patchset shrinks in half and touches only 200-300 lines of 
code ;-) Considering the total size of the -rt patchset:

   652 files changed, 23830 insertions(+), 4636 deletions(-)

we can regard it a routine optimization ;-)

regarding the concept: adaptive mutexes have been talked about in the 
past, but their advantage is not at all clear, that's why we havent done 
them. It's definitely not an unambigiously win-win concept.

So lets get some real marketing-free benchmarking done, and we are not 
just interested in the workloads where a bit of polling on contended 
locks helps, but we are also interested in workloads where the polling 
hurts ... And lets please do the comparisons without the ticket spinlock 
patch ...

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Bill Huey (hui)
It would also help to get the lockdep/lockstat output for those runs
so that more discussion can happen. That framework can be extended to
do all sorts of contention tracking and that is why I took implemented
it in the first place to track the viability of adaptive spins. My
initial results where that about 10 percent (heavier against BKL)
where suitable for adaptive spins and 10 percent roughly for ownership
stealing. The current implementation didn't seem to have those
measurements coded in just yet (unlike my private implementation).

I came to the original conclusion that it wasn't originally worth it,
but the dbench number published say otherwise. It would be a good
thing to get both a precise chunk of instrumentation to show where the
problems are (these number could be skewed by a number of things) as
well as the specific locks that are contended against.

See ya

bill

On Thu, Feb 21, 2008 at 1:24 PM, Ingo Molnar [EMAIL PROTECTED] wrote:
  regarding the concept: adaptive mutexes have been talked about in the
  past, but their advantage is not at all clear, that's why we havent done
  them. It's definitely not an unambigiously win-win concept.

  So lets get some real marketing-free benchmarking done, and we are not
  just interested in the workloads where a bit of polling on contended
  locks helps, but we are also interested in workloads where the polling
  hurts ... And lets please do the comparisons without the ticket spinlock
  patch ...
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
 On Thu, Feb 21, 2008 at  4:24 PM, in message [EMAIL PROTECTED],
Ingo Molnar [EMAIL PROTECTED] wrote: 

 hm. Why is the ticket spinlock patch included in this patchset? It just 
 skews your performance results unnecessarily. Ticket spinlocks are 
 independent conceptually, they are already upstream in 2.6.25-rc2 and 
 -rt will have them automatically once we rebase to .25.

Sorry if it was ambiguous.  I included them because we found the patch series 
without them can cause spikes due to the newly introduced pressure on the 
(raw_spinlock_t)lock-wait_lock.  You can run the adaptive-spin patches without 
them just fine (in fact, in many cases things run faster without themdbench 
*thrives* on chaos).  But you may also measure a cyclic-test spike if you do 
so.  So I included them to present a complete package without spikes.  I 
tried to explain that detail in the prologue, but most people probably fell 
asleep before they got to the end ;)

 
 and if we take the ticket spinlock patch out of your series, the the 
 size of the patchset shrinks in half and touches only 200-300 lines of 
 code ;-) Considering the total size of the -rt patchset:
 
652 files changed, 23830 insertions(+), 4636 deletions(-)
 
 we can regard it a routine optimization ;-)

Its not the size of your LOC, but what you do with it :)

 
 regarding the concept: adaptive mutexes have been talked about in the 
 past, but their advantage is not at all clear, that's why we havent done 
 them. It's definitely not an unambigiously win-win concept.
 
 So lets get some real marketing-free benchmarking done, and we are not 
 just interested in the workloads where a bit of polling on contended 
 locks helps, but we are also interested in workloads where the polling 
 hurts ... And lets please do the comparisons without the ticket spinlock 
 patch ...

I'm open to suggestion, and this was just a sample of the testing we have done. 
 We have thrown plenty of workloads at this patch series far beyond the slides 
I prepared in that URL, and they all seem to indicate a net positive 
improvement so far.  Some of those results I cannot share due to NDA, and some 
I didnt share simply because I never formally collected the data like I did for 
these tests.  If there is something you would like to see, please let me know 
and I will arrange for it to be executed if at all possible.

Regards,
-Greg

-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Gregory Haskins
 On Thu, Feb 21, 2008 at  4:42 PM, in message [EMAIL PROTECTED],
Ingo Molnar [EMAIL PROTECTED] wrote: 

 * Bill Huey (hui) [EMAIL PROTECTED] wrote:
 
 I came to the original conclusion that it wasn't originally worth it, 
 but the dbench number published say otherwise. [...]
 
 dbench is a notoriously unreliable and meaningless workload. It's being 
 frowned upon by the VM and the IO folks.

I agree...its a pretty weak benchmark.  BUT, it does pound on dcache_lock and 
therefore was a good demonstration of the benefits of lower-contention 
overhead.  Also note we also threw other tests in that PDF if you scroll to the 
subsequent pages.

 If that's the only workload 
 where spin-mutexes help, and if it's only a 3% improvement [of which it 
 is unclear how much of that improvement was due to ticket spinlocks], 
 then adaptive mutexes are probably not worth it.

Note that the 3% figure being thrown around was from a single patch within 
the series.  We are actually getting a net average gain of 443% in dbench.  And 
note that the number goes *up* when you remove the ticketlocks.  The 
ticketlocks are there to prevent latency spikes, not improve throughput.

Also take a look at the hackbench numbers which are particularly promising.   
We get a net average gain of 493% faster for RT10 based hackbench runs.  The 
kernel build was only a small gain, but it was all gain nonetheless.  We see 
similar results for any other workloads we throw at this thing.  I will gladly 
run any test requested to which I have the ability to run, and I would 
encourage third party results as well.


 
 I'd not exclude them fundamentally though, it's really the numbers that 
 matter. The code is certainly simple enough (albeit the .config and 
 sysctl controls are quite ugly and unacceptable - adaptive mutexes 
 should really be ... adaptive, with no magic constants in .configs or 
 else).

We can clean this up, per your suggestions.

 
 But ... i'm somewhat sceptic, after having played with spin-a-bit 
 mutexes before.

Its very subtle to get this concept to work.  The first few weeks, we were 
getting 90% regressions ;)  Then we had a breakthrough and started to get this 
thing humming along quite nicely.

Regards,
-Greg




-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Peter W. Morreale
On Thu, 2008-02-21 at 22:24 +0100, Ingo Molnar wrote:
 hm. Why is the ticket spinlock patch included in this patchset? It just 
 skews your performance results unnecessarily. Ticket spinlocks are 
 independent conceptually, they are already upstream in 2.6.25-rc2 and 
 -rt will have them automatically once we rebase to .25.
 

True, the ticket spinlock certainly adds to the throughput results we
have seen.  However, the results without the ticket patch are still very
significant.   (IIRC, 500-600MB/s instead of the ~730MB/s advertised) We
can easily re-gen the previous results for an apples-to-apples
comparison.

Without the ticket locks we see spikes in cyclictest that have been
mapped (via latency trace) to the non-deterministic behavior of the raw
rt lock wait_lock (which is used in the rt lock code).   These spikes
disappear with the ticket lock patch installed. 

Note that we see these spikes while running cyclictest concurrently with
dbench/hackbench/kernel builds.  

 and if we take the ticket spinlock patch out of your series, the the 
 size of the patchset shrinks in half and touches only 200-300 lines of 
 code ;-) Considering the total size of the -rt patchset:
 
652 files changed, 23830 insertions(+), 4636 deletions(-)
 
 we can regard it a routine optimization ;-)
 
 regarding the concept: adaptive mutexes have been talked about in the 
 past, but their advantage is not at all clear, that's why we havent done 
 them. It's definitely not an unambigiously win-win concept.
 

Can you elaborate?  Where did you previously see that this would be
problematic?

 So lets get some real marketing-free benchmarking done, and we are not 
 just interested in the workloads where a bit of polling on contended 
 locks helps, but we are also interested in workloads where the polling 
 hurts ... And lets please do the comparisons without the ticket spinlock 
 patch ...
 
   Ingo

Certainly.  One of the things we have struggled with are real-world
loads that prove/disprove the implementation.  Thus far, everything we
have come up with has only seen improvement.  

The worst case we have come up with so far is running dbench/hackbench
in FIFO.  Here we have multiple tasks at the same priority contending on
the same locks.  This should give a worst case performance since the
tasks cannot preempt each other, and the nesting of locks should produce
significant latencies.  However, with the addition of the lateral-steal
patch, we do not witness that at all, rather the throughput is on par
with CFS results.

Higher priority FIFO tasks will preempt the polling tasks, and the lower
priority tasks will wait, as they should.

Are there some suggestions on workloads that would show how the polling
hurts?  

Thanks,
-PWM


-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Peter W. Morreale
On Thu, 2008-02-21 at 15:12 -0700, Peter W. Morreale wrote:

 True, the ticket spinlock certainly adds to the throughput results we
 have seen.  However, the results without the ticket patch are still very
 significant.   (IIRC, 500-600MB/s instead of the ~730MB/s advertised) We
 can easily re-gen the previous results for an apples-to-apples
 comparison.
 

I misspoke in the above.  Greg is correct.  Without the ticket lock
patch we see an *improvement* in throughput, not a decrease.  

Sorry for any confusion.

Best,
-PWM


-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH [RT] 00/14] RFC - adaptive real-time locks

2008-02-21 Thread Bill Huey (hui)
On Thu, Feb 21, 2008 at 1:42 PM, Ingo Molnar [EMAIL PROTECTED] wrote:
  I'd not exclude them fundamentally though, it's really the numbers that
  matter. The code is certainly simple enough (albeit the .config and
  sysctl controls are quite ugly and unacceptable - adaptive mutexes
  should really be ... adaptive, with no magic constants in .configs or
  else).

  But ... i'm somewhat sceptic, after having played with spin-a-bit
  mutexes before.

Yeah, it's yet another reason to get better instrumentation in -rt for
this purpose and why it's important to only log things in the slow
path so that the front end of the rtmutex doesn't see instrumentation
artifacts in the measurement. If I get a chance tonight, I'll extend
lockdep/lockstat for this purpose if peterz doesn't beat me to it. I
won't have access to the test harness to replicate this though.

bill
-
To unsubscribe from this list: send the line unsubscribe linux-rt-users in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html