Re: [PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock
On 03/04/2018 02:11, David Rientjes wrote: > On Tue, 13 Mar 2018, Laurent Dufour wrote: > >> This change is inspired by the Peter's proposal patch [1] which was >> protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in >> that particular case, and it is introducing major performance degradation >> due to excessive scheduling operations. >> >> To allow access to the mm_rb tree without grabbing the mmap_sem, this patch >> is protecting it access using a rwlock. As the mm_rb tree is a O(log n) >> search it is safe to protect it using such a lock. The VMA cache is not >> protected by the new rwlock and it should not be used without holding the >> mmap_sem. >> >> To allow the picked VMA structure to be used once the rwlock is released, a >> use count is added to the VMA structure. When the VMA is allocated it is >> set to 1. Each time the VMA is picked with the rwlock held its use count >> is incremented. Each time the VMA is released it is decremented. When the >> use count hits zero, this means that the VMA is no more used and should be >> freed. >> >> This patch is preparing for 2 kind of VMA access : >> - as usual, under the control of the mmap_sem, >> - without holding the mmap_sem for the speculative page fault handler. >> >> Access done under the control the mmap_sem doesn't require to grab the >> rwlock to protect read access to the mm_rb tree, but access in write must >> be done under the protection of the rwlock too. This affects inserting and >> removing of elements in the RB tree. >> >> The patch is introducing 2 new functions: >> - vma_get() to find a VMA based on an address by holding the new rwlock. >> - vma_put() to release the VMA when its no more used. >> These services are designed to be used when access are made to the RB tree >> without holding the mmap_sem. >> >> When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and >> we rely on the WMB done when releasing the rwlock to serialize the write >> with the RMB done in a later patch to check for the VMA's validity. >> >> When free_vma is called, the file associated with the VMA is closed >> immediately, but the policy and the file structure remained in used until >> the VMA's use count reach 0, which may happens later when exiting an >> in progress speculative page fault. >> >> [1] https://patchwork.kernel.org/patch/5108281/ >> >> Cc: Peter Zijlstra (Intel) >> Cc: Matthew Wilcox >> Signed-off-by: Laurent Dufour > > Can __free_vma() be generalized for mm/nommu.c's delete_vma() and > do_mmap()? Good question ! I guess if there is no mmu, there is no page fault, so no speculative page fault and this patch is clearly required by the speculative page fault handler. By the I should probably make CONFIG_SPECULATIVE_PAGE_FAULT dependent on CONFIG_MMU. This being said, if your idea is to extend the mm_rb tree rwlocking to the nommu case, then this is another story, and I wondering if there is a real need in such case. But I've to admit I'm not so familliar with kernel built for mmuless systems. Am I missing something ? Thanks, Laurent.
Re: [PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock
On 03/04/2018 02:11, David Rientjes wrote: > On Tue, 13 Mar 2018, Laurent Dufour wrote: > >> This change is inspired by the Peter's proposal patch [1] which was >> protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in >> that particular case, and it is introducing major performance degradation >> due to excessive scheduling operations. >> >> To allow access to the mm_rb tree without grabbing the mmap_sem, this patch >> is protecting it access using a rwlock. As the mm_rb tree is a O(log n) >> search it is safe to protect it using such a lock. The VMA cache is not >> protected by the new rwlock and it should not be used without holding the >> mmap_sem. >> >> To allow the picked VMA structure to be used once the rwlock is released, a >> use count is added to the VMA structure. When the VMA is allocated it is >> set to 1. Each time the VMA is picked with the rwlock held its use count >> is incremented. Each time the VMA is released it is decremented. When the >> use count hits zero, this means that the VMA is no more used and should be >> freed. >> >> This patch is preparing for 2 kind of VMA access : >> - as usual, under the control of the mmap_sem, >> - without holding the mmap_sem for the speculative page fault handler. >> >> Access done under the control the mmap_sem doesn't require to grab the >> rwlock to protect read access to the mm_rb tree, but access in write must >> be done under the protection of the rwlock too. This affects inserting and >> removing of elements in the RB tree. >> >> The patch is introducing 2 new functions: >> - vma_get() to find a VMA based on an address by holding the new rwlock. >> - vma_put() to release the VMA when its no more used. >> These services are designed to be used when access are made to the RB tree >> without holding the mmap_sem. >> >> When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and >> we rely on the WMB done when releasing the rwlock to serialize the write >> with the RMB done in a later patch to check for the VMA's validity. >> >> When free_vma is called, the file associated with the VMA is closed >> immediately, but the policy and the file structure remained in used until >> the VMA's use count reach 0, which may happens later when exiting an >> in progress speculative page fault. >> >> [1] https://patchwork.kernel.org/patch/5108281/ >> >> Cc: Peter Zijlstra (Intel) >> Cc: Matthew Wilcox >> Signed-off-by: Laurent Dufour > > Can __free_vma() be generalized for mm/nommu.c's delete_vma() and > do_mmap()? To be honest I didn't look at mm/nommu.c assuming that such architecture would probably be monothreaded. Am I wrong ?
Re: [PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock
On Tue, 13 Mar 2018, Laurent Dufour wrote: > This change is inspired by the Peter's proposal patch [1] which was > protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in > that particular case, and it is introducing major performance degradation > due to excessive scheduling operations. > > To allow access to the mm_rb tree without grabbing the mmap_sem, this patch > is protecting it access using a rwlock. As the mm_rb tree is a O(log n) > search it is safe to protect it using such a lock. The VMA cache is not > protected by the new rwlock and it should not be used without holding the > mmap_sem. > > To allow the picked VMA structure to be used once the rwlock is released, a > use count is added to the VMA structure. When the VMA is allocated it is > set to 1. Each time the VMA is picked with the rwlock held its use count > is incremented. Each time the VMA is released it is decremented. When the > use count hits zero, this means that the VMA is no more used and should be > freed. > > This patch is preparing for 2 kind of VMA access : > - as usual, under the control of the mmap_sem, > - without holding the mmap_sem for the speculative page fault handler. > > Access done under the control the mmap_sem doesn't require to grab the > rwlock to protect read access to the mm_rb tree, but access in write must > be done under the protection of the rwlock too. This affects inserting and > removing of elements in the RB tree. > > The patch is introducing 2 new functions: > - vma_get() to find a VMA based on an address by holding the new rwlock. > - vma_put() to release the VMA when its no more used. > These services are designed to be used when access are made to the RB tree > without holding the mmap_sem. > > When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and > we rely on the WMB done when releasing the rwlock to serialize the write > with the RMB done in a later patch to check for the VMA's validity. > > When free_vma is called, the file associated with the VMA is closed > immediately, but the policy and the file structure remained in used until > the VMA's use count reach 0, which may happens later when exiting an > in progress speculative page fault. > > [1] https://patchwork.kernel.org/patch/5108281/ > > Cc: Peter Zijlstra (Intel) > Cc: Matthew Wilcox > Signed-off-by: Laurent Dufour Can __free_vma() be generalized for mm/nommu.c's delete_vma() and do_mmap()?
Re: [PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock
On 14/03/2018 09:48, Peter Zijlstra wrote: > On Tue, Mar 13, 2018 at 06:59:47PM +0100, Laurent Dufour wrote: >> This change is inspired by the Peter's proposal patch [1] which was >> protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in >> that particular case, and it is introducing major performance degradation >> due to excessive scheduling operations. > > Do you happen to have a little more detail on that? This has been reported by kemi who find bad performance when running some benchmarks on top of the v5 series: https://patchwork.kernel.org/patch/687/ It appears that SRCU is generating a lot of additional scheduling to manage the freeing of the VMA structure. SRCU is dealing through per cpu ressources but the SRCU callback is And since we are handling this way a per process ressource (VMA) through a global resource (SRCU) this leads to a lot of overhead when scheduling the SRCU callback. >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >> index 34fde7111e88..28c763ea1036 100644 >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -335,6 +335,7 @@ struct vm_area_struct { >> struct vm_userfaultfd_ctx vm_userfaultfd_ctx; >> #ifdef CONFIG_SPECULATIVE_PAGE_FAULT >> seqcount_t vm_sequence; >> +atomic_t vm_ref_count; /* see vma_get(), vma_put() */ >> #endif >> } __randomize_layout; >> >> @@ -353,6 +354,9 @@ struct kioctx_table; >> struct mm_struct { >> struct vm_area_struct *mmap;/* list of VMAs */ >> struct rb_root mm_rb; >> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT >> +rwlock_t mm_rb_lock; >> +#endif >> u32 vmacache_seqnum; /* per-thread vmacache */ >> #ifdef CONFIG_MMU >> unsigned long (*get_unmapped_area) (struct file *filp, > > When I tried this, it simply traded contention on mmap_sem for > contention on these two cachelines. > > This was for the concurrent fault benchmark, where mmap_sem is only ever > acquired for reading (so no blocking ever happens) and the bottle-neck > was really pure cacheline access. I'd say that this expected if multiple threads are dealing on the same VMA, but if the VMA differ then this contention is disappearing while it is remaining when using the mmap_sem. This being said, test I did on PowerPC using will-it-scale/page_fault1_threads showed that the number of caches-misses generated in get_vma() are very low (less than 5%). Am I missing something ? > Only by using RCU can you avoid that thrashing. I agree, but this kind of test is the best use case for SRCU because there are not so many updates, so not a lot of call to the SRCU asynchronous callback. Honestly, I can't see an ideal solution here, RCU is not optimal when there is a high number of updates, and using a rwlock may introduced a bottleneck there. I get better results when using the rwlock than using SRCU in that case, but if you have another proposal, please advise, I'll give it a try. > Also note that if your database allocates the one giant mapping, it'll > be _one_ VMA and that vm_ref_count gets _very_ hot indeed. In the case of the database product I mentioned in the series header, that's the opposite, the VMA number is very high so this doesn't happen. But in the case of one VMA, it's clear that there will be a contention on vm_ref_count, but this would be better than blocking on the mmap_sem. Laurent.
Re: [PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock
On Tue, Mar 13, 2018 at 06:59:47PM +0100, Laurent Dufour wrote: > This change is inspired by the Peter's proposal patch [1] which was > protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in > that particular case, and it is introducing major performance degradation > due to excessive scheduling operations. Do you happen to have a little more detail on that? > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 34fde7111e88..28c763ea1036 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -335,6 +335,7 @@ struct vm_area_struct { > struct vm_userfaultfd_ctx vm_userfaultfd_ctx; > #ifdef CONFIG_SPECULATIVE_PAGE_FAULT > seqcount_t vm_sequence; > + atomic_t vm_ref_count; /* see vma_get(), vma_put() */ > #endif > } __randomize_layout; > > @@ -353,6 +354,9 @@ struct kioctx_table; > struct mm_struct { > struct vm_area_struct *mmap;/* list of VMAs */ > struct rb_root mm_rb; > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT > + rwlock_t mm_rb_lock; > +#endif > u32 vmacache_seqnum; /* per-thread vmacache */ > #ifdef CONFIG_MMU > unsigned long (*get_unmapped_area) (struct file *filp, When I tried this, it simply traded contention on mmap_sem for contention on these two cachelines. This was for the concurrent fault benchmark, where mmap_sem is only ever acquired for reading (so no blocking ever happens) and the bottle-neck was really pure cacheline access. Only by using RCU can you avoid that thrashing. Also note that if your database allocates the one giant mapping, it'll be _one_ VMA and that vm_ref_count gets _very_ hot indeed.
[PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock
This change is inspired by the Peter's proposal patch [1] which was protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in that particular case, and it is introducing major performance degradation due to excessive scheduling operations. To allow access to the mm_rb tree without grabbing the mmap_sem, this patch is protecting it access using a rwlock. As the mm_rb tree is a O(log n) search it is safe to protect it using such a lock. The VMA cache is not protected by the new rwlock and it should not be used without holding the mmap_sem. To allow the picked VMA structure to be used once the rwlock is released, a use count is added to the VMA structure. When the VMA is allocated it is set to 1. Each time the VMA is picked with the rwlock held its use count is incremented. Each time the VMA is released it is decremented. When the use count hits zero, this means that the VMA is no more used and should be freed. This patch is preparing for 2 kind of VMA access : - as usual, under the control of the mmap_sem, - without holding the mmap_sem for the speculative page fault handler. Access done under the control the mmap_sem doesn't require to grab the rwlock to protect read access to the mm_rb tree, but access in write must be done under the protection of the rwlock too. This affects inserting and removing of elements in the RB tree. The patch is introducing 2 new functions: - vma_get() to find a VMA based on an address by holding the new rwlock. - vma_put() to release the VMA when its no more used. These services are designed to be used when access are made to the RB tree without holding the mmap_sem. When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and we rely on the WMB done when releasing the rwlock to serialize the write with the RMB done in a later patch to check for the VMA's validity. When free_vma is called, the file associated with the VMA is closed immediately, but the policy and the file structure remained in used until the VMA's use count reach 0, which may happens later when exiting an in progress speculative page fault. [1] https://patchwork.kernel.org/patch/5108281/ Cc: Peter Zijlstra (Intel) Cc: Matthew Wilcox Signed-off-by: Laurent Dufour --- include/linux/mm_types.h | 4 ++ kernel/fork.c| 3 ++ mm/init-mm.c | 3 ++ mm/internal.h| 6 +++ mm/mmap.c| 122 ++- 5 files changed, 106 insertions(+), 32 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 34fde7111e88..28c763ea1036 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -335,6 +335,7 @@ struct vm_area_struct { struct vm_userfaultfd_ctx vm_userfaultfd_ctx; #ifdef CONFIG_SPECULATIVE_PAGE_FAULT seqcount_t vm_sequence; + atomic_t vm_ref_count; /* see vma_get(), vma_put() */ #endif } __randomize_layout; @@ -353,6 +354,9 @@ struct kioctx_table; struct mm_struct { struct vm_area_struct *mmap;/* list of VMAs */ struct rb_root mm_rb; +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT + rwlock_t mm_rb_lock; +#endif u32 vmacache_seqnum; /* per-thread vmacache */ #ifdef CONFIG_MMU unsigned long (*get_unmapped_area) (struct file *filp, diff --git a/kernel/fork.c b/kernel/fork.c index a32e1c4311b2..9ecac4f725b9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -889,6 +889,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, mm->mmap = NULL; mm->mm_rb = RB_ROOT; mm->vmacache_seqnum = 0; +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT + rwlock_init(&mm->mm_rb_lock); +#endif atomic_set(&mm->mm_users, 1); atomic_set(&mm->mm_count, 1); init_rwsem(&mm->mmap_sem); diff --git a/mm/init-mm.c b/mm/init-mm.c index f94d5d15ebc0..e71ac37a98c4 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -17,6 +17,9 @@ struct mm_struct init_mm = { .mm_rb = RB_ROOT, +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT + .mm_rb_lock = __RW_LOCK_UNLOCKED(init_mm.mm_rb_lock), +#endif .pgd= swapper_pg_dir, .mm_users = ATOMIC_INIT(2), .mm_count = ATOMIC_INIT(1), diff --git a/mm/internal.h b/mm/internal.h index 62d8c34e63d5..fb2667b20f0a 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -40,6 +40,12 @@ void page_writeback_init(void); int do_swap_page(struct vm_fault *vmf); +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT +extern struct vm_area_struct *get_vma(struct mm_struct *mm, + unsigned long addr); +extern void put_vma(struct vm_area_struct *vma); +#endif + void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, unsigned long floor, unsigned long ceiling); diff --git a/mm/mmap.c b/mm/mmap.c index ac32b577a0c9..182359a5445c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -160,6 +160,27