Re: [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use
On 12/01/2012 01:38 PM, Linus Torvalds wrote: On Sat, Dec 1, 2012 at 4:26 AM, Ingo Molnar wrote: So as a quick concept hack I wrote the patch attached below. (It's not signed off, see the patch description text for the reason.) Well, it confirms that anon_vma locking is a big problem, but as outlined in my other email it's completely incorrect from an actual behavior standpoint. Btw, I think the anon_vma lock could be made a spinlock The anon_vma lock used to be a spinlock, and was turned into a mutex by Peter, as part of an effort to make more of the VM preemptible. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use
On Sat, Dec 1, 2012 at 10:41 AM, Ingo Molnar wrote: > > I'll try the rwsem and see how it goes? Yeah. That should be an easy conversion (just convert everything to use the write-lock first, and then you can make one or two migration places use the read version). Side note: The mutex code tends to potentially generate slightly faster noncontended locks than rwsems, and it does have the MUTEX_SPIN_ON_OWNER feature that makes the contention case often *much* better, so there are real downsides to rw-semaphores. But for this load, it does seem like the scalability advantages of an rwsem *might* be worth it. Side note: in contrast, the rwlock spinning reader-writer locks are basically never a win - the downsides just about always negate any theoretical scalability advantage. rwsem's can work well, we already use it for mmap_sem, for example, to allow concurrent page faults, and it was a *big* scalabiloity win there. Although then we did the "drop mmap_sem over IO and retry", and that might have negated many of the advantages of the mmap_sem. > Hm, indeed. For performance runs I typically disable lock > debugging - which might have made me not directly notice some of > the performance problems. Yeah, lock debugging really tends to make anything that is close to contended be absolutely *horribly* contended. Doubly so for the mutexes because it disables the spinning code, but it's true in general too. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use
* Linus Torvalds wrote: > On Sat, Dec 1, 2012 at 4:26 AM, Ingo Molnar wrote: > > > > > > So as a quick concept hack I wrote the patch attached below. > > (It's not signed off, see the patch description text for the > > reason.) > > Well, it confirms that anon_vma locking is a big problem, but > as outlined in my other email it's completely incorrect from > an actual behavior standpoint. Yeah. > Btw, I think the anon_vma lock could be made a spinlock > instead of a mutex or rwsem, but that would probably take more > work. We *shouldn't* be doing anything that needs IO inside > the anon_vma lock, though, so it *should* be doable. But there > are probably quite a bit of allocations inside the lock, and I > know it covers huge areas, so a spinlock might not only be > hard to convert to, it quite likely has latency issues too. I'll try the rwsem and see how it goes? > Oh, btw, MUTEX_SPIN_ON_OWNER may well improve performance too, > but it gets disabled by DEBUG_MUTEXES. So some of the > performance impact of the vma locking may be *very* > kernel-config dependent. Hm, indeed. For performance runs I typically disable lock debugging - which might have made me not directly notice some of the performance problems. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use
On Sat, Dec 1, 2012 at 4:26 AM, Ingo Molnar wrote: > > > So as a quick concept hack I wrote the patch attached below. > (It's not signed off, see the patch description text for the > reason.) Well, it confirms that anon_vma locking is a big problem, but as outlined in my other email it's completely incorrect from an actual behavior standpoint. Btw, I think the anon_vma lock could be made a spinlock instead of a mutex or rwsem, but that would probably take more work. We *shouldn't* be doing anything that needs IO inside the anon_vma lock, though, so it *should* be doable. But there are probably quite a bit of allocations inside the lock, and I know it covers huge areas, so a spinlock might not only be hard to convert to, it quite likely has latency issues too. Oh, btw, MUTEX_SPIN_ON_OWNER may well improve performance too, but it gets disabled by DEBUG_MUTEXES. So some of the performance impact of the vma locking may be *very* kernel-config dependent. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use
* Ingo Molnar wrote: > 1) > > This patch might solve the remapping > (remove_migration_ptes()), but does not solve the anon-vma > locking done in the first, unmapping step of pte-migration - > which is done via try_to_unmap(): which is a generic VM > function used by swapout too, so callers do not necessarily > hold the mmap_sem. > > A new TTU flag might solve it although I detest flag-driven > locking semantics with a passion: > > Splitting out unlocked versions of try_to_unmap_anon(), > try_to_unmap_ksm(), try_to_unmap_file() and constructing an > unlocked try_to_unmap() out of them, to be used by the > migration code, would be the cleaner option. So as a quick concept hack I wrote the patch attached below. (It's not signed off, see the patch description text for the reason.) With this applied I get the same good 4x JVM performance: spec1.txt: throughput = 157471.10 SPECjbb2005 bops spec2.txt: throughput = 157817.09 SPECjbb2005 bops spec3.txt: throughput = 157581.79 SPECjbb2005 bops spec4.txt: throughput = 157890.26 SPECjbb2005 bops -- SUM: throughput = 630760.24 SPECjbb2005 bops ... because the JVM workload did not trigger the migration scalability threshold to begin with. Mainline 4xJVM SPECjbb performance: spec1.txt: throughput = 128575.47 SPECjbb2005 bops spec2.txt: throughput = 125767.24 SPECjbb2005 bops spec3.txt: throughput = 130042.30 SPECjbb2005 bops spec4.txt: throughput = 128155.32 SPECjbb2005 bops -- SUM: throughput = 512540.33 SPECjbb2005 bops # (32 CPUs, 4 instances, 8 warehouses each, 240 seconds runtime, !THP) But !THP/4K numa02 performance went trough the roof! Mainline !THP numa02 performance: 40.918 secs runtime/thread 26.051 secs fastest (min) thread time 59.229 secs elapsed (max) thread time [ spread: -28.0% ] 26.844 GB data processed, per thread 858.993 GB data processed, total 2.206 nsecs/byte/thread 0.453 GB/sec/thread 14.503 GB/sec total numa/core v18 + migration-locking-enhancements, !THP: 18.543 secs runtime/thread 17.721 secs fastest (min) thread time 19.262 secs elapsed (max) thread time [ spread: -4.0% ] 26.844 GB data processed, per thread 858.993 GB data processed, total 0.718 nsecs/byte/thread 1.394 GB/sec/thread 44.595 GB/sec total as you can see the performance of each of the 32 threads is within a tight bound: 17.721 secs fastest (min) thread time 19.262 secs elapsed (max) thread time [ spread: -4.0% ] ... with very little spread between them. So this is roughly as good as it can get without hard binding - and according to my limited testing the numa02 workload is 20-30% faster than the AutoNUMA or balancenuma kernels on the same hardware/kernel combo. The above numa02 result now also gets reasonably close to the numa/core +THP numa02 numbers (to within 10%). As expected there's a lot of TLB flushing going on, but, and this was unexpected to me, even maximally pushing the migration code does not trigger anything pathological on this 4-node system - so while the TLB optimization will be a welcome enhancement, it's not a must-have at this stage. I'll do a cleaner version of this patch and I'll test on a larger system with a large NUMA factor too to make sure we don't need the TLB optimization on !THP. So I think (assuming that I have not overlooked something critical in these patches!), with these two fixes all the difficult known regressions in numa/core are fixed. I'll do more testing with broader workloads and on more systems to ascertain this. Thanks, Ingo > Subject: mm/migration: Remove anon vma locking from try_to_unmap() use From: Ingo Molnar Date: Sat Dec 1 11:22:09 CET 2012 As outlined in: mm/migration: Don't lock anon vmas in rmap_walk_anon() the process-global anon vma mutex locking of the page migration code can be very expensive. This removes the second (and last) use of that mutex from the migration code: try_to_unmap(). Since try_to_unmap() is used by swapout and filesystem code as well, which does not hold the mmap_sem, we only want to do this optimization from the migration path. This patch is ugly and should be replaced via a try_to_unmap_locked() variant instead which offers us the unlocked codepath, but it's good enough for testing purposes. Cc: Linus Torvalds Cc: Andrew Morton Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Rik van Riel Cc: Mel Gorman Cc: Thomas Gleixner Cc: Hugh Dickins Not-Signed-off-by: Ingo Molnar --- include/linux/rmap.h |2 +- mm/hu