Re: [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use

2012-12-01 Thread Rik van Riel

On 12/01/2012 01:38 PM, Linus Torvalds wrote:

On Sat, Dec 1, 2012 at 4:26 AM, Ingo Molnar  wrote:



So as a quick concept hack I wrote the patch attached below.
(It's not signed off, see the patch description text for the
reason.)


Well, it confirms that anon_vma locking is a big problem, but as
outlined in my other email it's completely incorrect from an actual
behavior standpoint.

Btw, I think the anon_vma lock could be made a spinlock


The anon_vma lock used to be a spinlock, and was turned into a
mutex by Peter, as part of an effort to make more of the VM
preemptible.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use

2012-12-01 Thread Linus Torvalds
On Sat, Dec 1, 2012 at 10:41 AM, Ingo Molnar  wrote:
>
> I'll try the rwsem and see how it goes?

Yeah. That should be an easy conversion (just convert everything to
use the write-lock first, and then you can make one or two migration
places use the read version).

Side note: The mutex code tends to potentially generate slightly
faster noncontended locks than rwsems, and it does have the
MUTEX_SPIN_ON_OWNER feature that makes the contention case often
*much* better, so there are real downsides to rw-semaphores.

But for this load, it does seem like the scalability advantages of an
rwsem *might* be worth it.

Side note: in contrast, the rwlock spinning reader-writer locks are
basically never a win - the downsides just about always negate any
theoretical scalability advantage. rwsem's can work well, we already
use it for mmap_sem, for example, to allow concurrent page faults, and
it was a *big* scalabiloity win there. Although then we did the "drop
mmap_sem over IO and retry", and that might have negated many of the
advantages of the mmap_sem.

> Hm, indeed. For performance runs I typically disable lock
> debugging - which might have made me not directly notice some of
> the performance problems.

Yeah, lock debugging really tends to make anything that is close to
contended be absolutely *horribly* contended. Doubly so for the
mutexes because it disables the spinning code, but it's true in
general too.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use

2012-12-01 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Sat, Dec 1, 2012 at 4:26 AM, Ingo Molnar  wrote:
> >
> >
> > So as a quick concept hack I wrote the patch attached below.
> > (It's not signed off, see the patch description text for the
> > reason.)
> 
> Well, it confirms that anon_vma locking is a big problem, but 
> as outlined in my other email it's completely incorrect from 
> an actual behavior standpoint.

Yeah.

> Btw, I think the anon_vma lock could be made a spinlock 
> instead of a mutex or rwsem, but that would probably take more 
> work. We *shouldn't* be doing anything that needs IO inside 
> the anon_vma lock, though, so it *should* be doable. But there 
> are probably quite a bit of allocations inside the lock, and I 
> know it covers huge areas, so a spinlock might not only be 
> hard to convert to, it quite likely has latency issues too.

I'll try the rwsem and see how it goes?

> Oh, btw, MUTEX_SPIN_ON_OWNER may well improve performance too, 
> but it gets disabled by DEBUG_MUTEXES. So some of the 
> performance impact of the vma locking may be *very* 
> kernel-config dependent.

Hm, indeed. For performance runs I typically disable lock 
debugging - which might have made me not directly notice some of 
the performance problems.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use

2012-12-01 Thread Linus Torvalds
On Sat, Dec 1, 2012 at 4:26 AM, Ingo Molnar  wrote:
>
>
> So as a quick concept hack I wrote the patch attached below.
> (It's not signed off, see the patch description text for the
> reason.)

Well, it confirms that anon_vma locking is a big problem, but as
outlined in my other email it's completely incorrect from an actual
behavior standpoint.

Btw, I think the anon_vma lock could be made a spinlock instead of a
mutex or rwsem, but that would probably take more work. We *shouldn't*
be doing anything that needs IO inside the anon_vma lock, though, so
it *should* be doable. But there are probably quite a bit of
allocations inside the lock, and I know it covers huge areas, so a
spinlock might not only be hard to convert to, it quite likely has
latency issues too.

Oh, btw, MUTEX_SPIN_ON_OWNER may well improve performance too, but it
gets disabled by DEBUG_MUTEXES. So some of the performance impact of
the vma locking may be *very* kernel-config dependent.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use

2012-12-01 Thread Ingo Molnar

* Ingo Molnar  wrote:

> 1)
> 
> This patch might solve the remapping 
> (remove_migration_ptes()), but does not solve the anon-vma 
> locking done in the first, unmapping step of pte-migration - 
> which is done via try_to_unmap(): which is a generic VM 
> function used by swapout too, so callers do not necessarily 
> hold the mmap_sem.
> 
> A new TTU flag might solve it although I detest flag-driven 
> locking semantics with a passion:
> 
> Splitting out unlocked versions of try_to_unmap_anon(), 
> try_to_unmap_ksm(), try_to_unmap_file() and constructing an 
> unlocked try_to_unmap() out of them, to be used by the 
> migration code, would be the cleaner option.

So as a quick concept hack I wrote the patch attached below. 
(It's not signed off, see the patch description text for the 
reason.)

With this applied I get the same good 4x JVM performance:

 spec1.txt:   throughput = 157471.10 SPECjbb2005 bops 
 spec2.txt:   throughput = 157817.09 SPECjbb2005 bops 
 spec3.txt:   throughput = 157581.79 SPECjbb2005 bops 
 spec4.txt:   throughput = 157890.26 SPECjbb2005 bops 
   --
   SUM:   throughput = 630760.24 SPECjbb2005 bops

... because the JVM workload did not trigger the migration 
scalability threshold to begin with.

Mainline 4xJVM SPECjbb performance:

 spec1.txt:   throughput = 128575.47 SPECjbb2005 bops
 spec2.txt:   throughput = 125767.24 SPECjbb2005 bops
 spec3.txt:   throughput = 130042.30 SPECjbb2005 bops
 spec4.txt:   throughput = 128155.32 SPECjbb2005 bops
   --
   SUM:   throughput = 512540.33 SPECjbb2005 bops

 # (32 CPUs, 4 instances, 8 warehouses each, 240 seconds runtime, !THP)

But !THP/4K numa02 performance went trough the roof!

Mainline !THP numa02 performance:

 40.918 secs runtime/thread
 26.051 secs fastest (min) thread time
 59.229 secs elapsed (max) thread time [ spread: -28.0% ]
 26.844 GB data processed, per thread
858.993 GB data processed, total
  2.206 nsecs/byte/thread
  0.453 GB/sec/thread
 14.503 GB/sec total

numa/core v18 + migration-locking-enhancements, !THP:

 18.543 secs runtime/thread
 17.721 secs fastest (min) thread time
 19.262 secs elapsed (max) thread time [ spread: -4.0% ]
 26.844 GB data processed, per thread
858.993 GB data processed, total
  0.718 nsecs/byte/thread
  1.394 GB/sec/thread
 44.595 GB/sec total

as you can see the performance of each of the 32 threads is 
within a tight bound:

 17.721 secs fastest (min) thread time
 19.262 secs elapsed (max) thread time [ spread: -4.0% ]

... with very little spread between them.

So this is roughly as good as it can get without hard binding - 
and according to my limited testing the numa02 workload is 
20-30% faster than the AutoNUMA or balancenuma kernels on the 
same hardware/kernel combo. The above numa02 result now also 
gets reasonably close to the numa/core +THP numa02 numbers (to 
within 10%).

As expected there's a lot of TLB flushing going on, but, and 
this was unexpected to me, even maximally pushing the migration 
code does not trigger anything pathological on this 4-node 
system - so while the TLB optimization will be a welcome 
enhancement, it's not a must-have at this stage.

I'll do a cleaner version of this patch and I'll test on a 
larger system with a large NUMA factor too to make sure we don't 
need the TLB optimization on !THP.

So I think (assuming that I have not overlooked something 
critical in these patches!), with these two fixes all the 
difficult known regressions in numa/core are fixed.

I'll do more testing with broader workloads and on more systems 
to ascertain this.

Thanks,

Ingo

>
Subject: mm/migration: Remove anon vma locking from try_to_unmap() use
From: Ingo Molnar 
Date: Sat Dec 1 11:22:09 CET 2012

As outlined in:

mm/migration: Don't lock anon vmas in rmap_walk_anon()

the process-global anon vma mutex locking of the page migration
code can be very expensive.

This removes the second (and last) use of that mutex from the
migration code: try_to_unmap().

Since try_to_unmap() is used by swapout and filesystem code
as well, which does not hold the mmap_sem, we only want to
do this optimization from the migration path.

This patch is ugly and should be replaced via a
try_to_unmap_locked() variant instead which offers us the
unlocked codepath, but it's good enough for testing purposes.

Cc: Linus Torvalds 
Cc: Andrew Morton 
Cc: Peter Zijlstra 
Cc: Andrea Arcangeli 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: Thomas Gleixner 
Cc: Hugh Dickins 
Not-Signed-off-by: Ingo Molnar 
---
 include/linux/rmap.h |2 +-
 mm/hu