Re: [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon()
* Linus Torvalds wrote: > On Sat, Dec 1, 2012 at 1:49 AM, Ingo Molnar wrote: > > > > I *think* you are right that for this type of migration that > > we are using here we indeed don't need to take an exclusive > > vma lock - in fact I think we don't need to take it at all: > > I'm pretty sure we do need at least a read-side reference. Ok. > Even if no other MM can contain that particular pte, the vma > lock protects the chain that is created by fork and exit and > vma splitting etc. > > So it's enough that another thread does a fork() at the same > time. Or a partial unmap of the vma (that splits it in two), > for the rmap chain to be modified. > > Besides, there's absolutely nothing that protects that vma to > be part of the same vma chain in entirely unrelated processes. > The vma chain can get quite long over multiple forks (it's > even a performance problem under some extreme loads). > > And we do walk the rmap chain - so we need the lock. > > But we walk it read-only afaik, which is why I think the > semaphore could be an rwsem. > > Now, there *are* likely cases where we could avoid anon_vma > locking entirely, but they are very specialized. They'd be > along the lines of > > - we hold the page table lock > - we see that vma->anon_vma == vma->anon_vma->root > - we see that vma->anon_vma->refcount == 1 > > or similar, because then we can guarantee that the anon-vma > chain has a length of one without even locking, and holding > the page table lock means that any concurrent fork or > mmap/munmap from another thread will block on this particular > pte. Hm. These conditions may be true for some pretty common cases, but it's difficult to discover that information from the migration code due to the way we discover all anon vmas and walk the anon vma list: we first lock the anon vma then do we get to iterate over the individual vmas and do the pte changes. So it's the wrong way around. I think your rwsem suggestion is a lot more generic and more robust as well. > So I suspect that in the above kind of special case (which > might be a somewhat common case for normal page faults, for > example) we could make a "we have exclusive pte access to this > page" argument. But quite frankly, I completely made the above > rules up in my head, they may be bogus too. > > For the general migration case, it's definitely not possible > to just drop the anon_vma lock. Ok, I see - I'll redo this part then and try out how an rwsem fares. I suspect it would give a small speedup to a fair number of workloads, so it's worthwile to spend some time on it. Thanks for the suggestions! Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon()
On Sat, Dec 1, 2012 at 1:49 AM, Ingo Molnar wrote: > > I *think* you are right that for this type of migration that we > are using here we indeed don't need to take an exclusive vma > lock - in fact I think we don't need to take it at all: I'm pretty sure we do need at least a read-side reference. Even if no other MM can contain that particular pte, the vma lock protects the chain that is created by fork and exit and vma splitting etc. So it's enough that another thread does a fork() at the same time. Or a partial unmap of the vma (that splits it in two), for the rmap chain to be modified. Besides, there's absolutely nothing that protects that vma to be part of the same vma chain in entirely unrelated processes. The vma chain can get quite long over multiple forks (it's even a performance problem under some extreme loads). And we do walk the rmap chain - so we need the lock. But we walk it read-only afaik, which is why I think the semaphore could be an rwsem. Now, there *are* likely cases where we could avoid anon_vma locking entirely, but they are very specialized. They'd be along the lines of - we hold the page table lock - we see that vma->anon_vma == vma->anon_vma->root - we see that vma->anon_vma->refcount == 1 or similar, because then we can guarantee that the anon-vma chain has a length of one without even locking, and holding the page table lock means that any concurrent fork or mmap/munmap from another thread will block on this particular pte. So I suspect that in the above kind of special case (which might be a somewhat common case for normal page faults, for example) we could make a "we have exclusive pte access to this page" argument. But quite frankly, I completely made the above rules up in my head, they may be bogus too. For the general migration case, it's definitely not possible to just drop the anon_vma lock. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon()
On 12/01/2012 04:49 AM, Ingo Molnar wrote: * Linus Torvalds wrote: On Fri, Nov 30, 2012 at 11:58 AM, Ingo Molnar wrote: When pushed hard enough via threaded workloads (for example via the numa02 test) then the upstream page migration code in mm/migration.c becomes unscalable, resulting in lot of scheduling on the anon vma mutex and a subsequent drop in performance. Ugh. I wonder if migration really needs that thing to be a mutex? I may be wrong, but the anon_vma lock only protects the actual rmap chains, and migration only ever changes the pte *contents*, not the actual chains of pte's themselves, right? So if this is a migration-specific scalability issue, then it might be possible to solve by making the mutex be a rwsem instead, and have migration only take it for reading. Of course, I'm quite possibly wrong, and the code depends on full mutual exclusion. Just a thought, in case it makes somebody go "Hmm.." I *think* you are right that for this type of migration that we are using here we indeed don't need to take an exclusive vma lock - in fact I think we don't need to take it at all: The main goal in the migration code is to unmap the pte from all thread's MMU visibility, before we copy its contents into another page [located on another node] and map that page into the page tables instead of the old page. No other thread must have a write reference to the old page when the copying [migrate_page_copy()] is performed, or we corrupt user-space memory subtly via copying a slightly older version of user-space memory. rmap_walk() OTOH appears to have been written as a general purpose function, to be usable without holding the mmap_sem() as well, so it is written to protect against the disappearance of anon vmas. But ... in all upstream and NUMA-migration codepaths I could find - and AFAICS in all other page-migration codepaths as well, including sys_move_pages() - anon vmas cannot disappear from under us, because we are already holding the mmap_sem. [ Initially I assumed that swapout or filesystem code could somehow call this without holding the mmap sem - but could not find any such code path. ] So I think we could get away rather simply, with something like the (entirely and utterly untested!) patch below. But ... judging from the code my feeling is this can only be the first (and easiest) step: 1) This patch might solve the remapping (remove_migration_ptes()), but does not solve the anon-vma locking done in the first, unmapping step of pte-migration - which is done via try_to_unmap(): which is a generic VM function used by swapout too, so callers do not necessarily hold the mmap_sem. A new TTU flag might solve it although I detest flag-driven locking semantics with a passion: Splitting out unlocked versions of try_to_unmap_anon(), try_to_unmap_ksm(), try_to_unmap_file() and constructing an unlocked try_to_unmap() out of them, to be used by the migration code, would be the cleaner option. 2) Taking a process-global mutex 1024 times per 2MB was indeed very expensive - and lets assume that we manage to sort that out - but then we are AFAICS exposed to the next layer: the finegrained migrate_pages() model where the migration code flushes the TLB 512 times per 2MB to unmap and remap it again and again at 4K granularity ... Assuming the simpler patch goes fine I'll try to come up with something intelligent for the TLB flushing sub-problem too: we could in theory batch the migration TLB flushes as well, by first doing an array of 2MB granular unmaps, then copying up to 512x 4K pages, then doing the final 2MB granular [but still 4K-represented in the page tables] remap. 2MB granular TLB flushing is OK for these workloads, I can see that in +THP tests. I will keep you updated about how far I manage to get down this road. Thanks, Ingo > Subject: mm/migration: Don't lock anon vmas in rmap_walk_anon() From: Ingo Molnar Date: Thu Nov 22 14:16:26 CET 2012 rmap_walk_anon() appears to be too careful about locking the anon vma for its own good - since all callers are holding the mmap_sem no vma can go away from under us: - sys_move_pages() is doing down_read(&mm->mmap_sem) in the sys_move_pages() -> do_pages_move() -> do_move_page_to_node_array() code path, which then calls migrate_pages(pagelist), which then does unmap_and_move() for every page in the list, which does remove_migration_ptes() which calls rmap.c::try_to_unmap(). - the NUMA migration code's migrate_misplaced_page(), which calls migrate_pages() ... try_to_unmap(), is holding the mm->mmap_sem read-locked by virtue of the low level page fault handler taking it before calling handle_mm_fault(). Removing this lock removes a global mutex from the hot path of migration-happy threaded workloads which can cause pathological performance like this: 96.43%process 0 [kernel.kallsyms] [k] perf_trace_sched_switch |
[RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon()
* Linus Torvalds wrote: > On Fri, Nov 30, 2012 at 11:58 AM, Ingo Molnar wrote: > > > > When pushed hard enough via threaded workloads (for example > > via the numa02 test) then the upstream page migration code > > in mm/migration.c becomes unscalable, resulting in lot of > > scheduling on the anon vma mutex and a subsequent drop in > > performance. > > Ugh. > > I wonder if migration really needs that thing to be a mutex? I > may be wrong, but the anon_vma lock only protects the actual > rmap chains, and migration only ever changes the pte > *contents*, not the actual chains of pte's themselves, right? > > So if this is a migration-specific scalability issue, then it > might be possible to solve by making the mutex be a rwsem > instead, and have migration only take it for reading. > > Of course, I'm quite possibly wrong, and the code depends on > full mutual exclusion. > > Just a thought, in case it makes somebody go "Hmm.." I *think* you are right that for this type of migration that we are using here we indeed don't need to take an exclusive vma lock - in fact I think we don't need to take it at all: The main goal in the migration code is to unmap the pte from all thread's MMU visibility, before we copy its contents into another page [located on another node] and map that page into the page tables instead of the old page. No other thread must have a write reference to the old page when the copying [migrate_page_copy()] is performed, or we corrupt user-space memory subtly via copying a slightly older version of user-space memory. rmap_walk() OTOH appears to have been written as a general purpose function, to be usable without holding the mmap_sem() as well, so it is written to protect against the disappearance of anon vmas. But ... in all upstream and NUMA-migration codepaths I could find - and AFAICS in all other page-migration codepaths as well, including sys_move_pages() - anon vmas cannot disappear from under us, because we are already holding the mmap_sem. [ Initially I assumed that swapout or filesystem code could somehow call this without holding the mmap sem - but could not find any such code path. ] So I think we could get away rather simply, with something like the (entirely and utterly untested!) patch below. But ... judging from the code my feeling is this can only be the first (and easiest) step: 1) This patch might solve the remapping (remove_migration_ptes()), but does not solve the anon-vma locking done in the first, unmapping step of pte-migration - which is done via try_to_unmap(): which is a generic VM function used by swapout too, so callers do not necessarily hold the mmap_sem. A new TTU flag might solve it although I detest flag-driven locking semantics with a passion: Splitting out unlocked versions of try_to_unmap_anon(), try_to_unmap_ksm(), try_to_unmap_file() and constructing an unlocked try_to_unmap() out of them, to be used by the migration code, would be the cleaner option. 2) Taking a process-global mutex 1024 times per 2MB was indeed very expensive - and lets assume that we manage to sort that out - but then we are AFAICS exposed to the next layer: the finegrained migrate_pages() model where the migration code flushes the TLB 512 times per 2MB to unmap and remap it again and again at 4K granularity ... Assuming the simpler patch goes fine I'll try to come up with something intelligent for the TLB flushing sub-problem too: we could in theory batch the migration TLB flushes as well, by first doing an array of 2MB granular unmaps, then copying up to 512x 4K pages, then doing the final 2MB granular [but still 4K-represented in the page tables] remap. 2MB granular TLB flushing is OK for these workloads, I can see that in +THP tests. I will keep you updated about how far I manage to get down this road. Thanks, Ingo > Subject: mm/migration: Don't lock anon vmas in rmap_walk_anon() From: Ingo Molnar Date: Thu Nov 22 14:16:26 CET 2012 rmap_walk_anon() appears to be too careful about locking the anon vma for its own good - since all callers are holding the mmap_sem no vma can go away from under us: - sys_move_pages() is doing down_read(&mm->mmap_sem) in the sys_move_pages() -> do_pages_move() -> do_move_page_to_node_array() code path, which then calls migrate_pages(pagelist), which then does unmap_and_move() for every page in the list, which does remove_migration_ptes() which calls rmap.c::try_to_unmap(). - the NUMA migration code's migrate_misplaced_page(), which calls migrate_pages() ... try_to_unmap(), is holding the mm->mmap_sem read-locked by virtue of the low level page fault handler taking it before calling handle_mm_fault(). Removing this lock removes a global mutex from the hot path of migration-happy threaded workloads which can cause pathological performance like this: 96.43%process 0 [kernel.kallsyms] [k] perf_trace_s