Re: [PATCH v10 09/25] mm: protect VMA modifications using VMA sequence count
On 23/04/2018 09:19, Minchan Kim wrote: > On Tue, Apr 17, 2018 at 04:33:15PM +0200, Laurent Dufour wrote: >> The VMA sequence count has been introduced to allow fast detection of >> VMA modification when running a page fault handler without holding >> the mmap_sem. >> >> This patch provides protection against the VMA modification done in : >> - madvise() >> - mpol_rebind_policy() >> - vma_replace_policy() >> - change_prot_numa() >> - mlock(), munlock() >> - mprotect() >> - mmap_region() >> - collapse_huge_page() >> - userfaultd registering services >> >> In addition, VMA fields which will be read during the speculative fault >> path needs to be written using WRITE_ONCE to prevent write to be split >> and intermediate values to be pushed to other CPUs. >> >> Signed-off-by: Laurent Dufour>> --- >> fs/proc/task_mmu.c | 5 - >> fs/userfaultfd.c | 17 + >> mm/khugepaged.c| 3 +++ >> mm/madvise.c | 6 +- >> mm/mempolicy.c | 51 ++- >> mm/mlock.c | 13 - >> mm/mmap.c | 22 +- >> mm/mprotect.c | 4 +++- >> mm/swap_state.c| 8 ++-- >> 9 files changed, 89 insertions(+), 40 deletions(-) >> >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >> index c486ad4b43f0..aeb417f28839 100644 >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -1136,8 +1136,11 @@ static ssize_t clear_refs_write(struct file *file, >> const char __user *buf, >> goto out_mm; >> } >> for (vma = mm->mmap; vma; vma = vma->vm_next) { >> -vma->vm_flags &= ~VM_SOFTDIRTY; >> +vm_write_begin(vma); >> +WRITE_ONCE(vma->vm_flags, >> + vma->vm_flags & >> ~VM_SOFTDIRTY); >> vma_set_page_prot(vma); >> +vm_write_end(vma); > > trivial: > > I think It's tricky to maintain that VMA fields to be read during SPF should > be > (READ|WRITE_ONCE). I think we need some accessor to read/write them rather > than > raw accessing like like vma_set_page_prot. Maybe spf prefix would be helpful. > > vma_spf_set_value(vma, vm_flags, val); > > We also add some markers in vm_area_struct's fileds to indicate that > people shouldn't access those fields directly. > > Just a thought. At the beginning I was liking that idea but... I'm not sure this will change a lot the code, most of the time the vm_write_begin()/end() are surrounding part of code larger than one VMA structure's field change. For this particular case and few others this will be applicable but that's not the majority. Thanks, Laurent. > > >> } >> downgrade_write(>mmap_sem); > > >> diff --git a/mm/swap_state.c b/mm/swap_state.c >> index fe079756bb18..8a8a402ed59f 100644 >> --- a/mm/swap_state.c >> +++ b/mm/swap_state.c >> @@ -575,6 +575,10 @@ static unsigned long swapin_nr_pages(unsigned long >> offset) >> * the readahead. >> * >> * Caller must hold down_read on the vma->vm_mm if vmf->vma is not NULL. >> + * This is needed to ensure the VMA will not be freed in our back. In the >> case >> + * of the speculative page fault handler, this cannot happen, even if we >> don't >> + * hold the mmap_sem. Callees are assumed to take care of reading VMA's >> fields > > I guess reader would be curious on *why* is safe with SPF. > Comment about the why could be helpful for reviewer. > >> + * using READ_ONCE() to read consistent values. >> */ >> struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, >> struct vm_fault *vmf) >> @@ -668,9 +672,9 @@ static inline void swap_ra_clamp_pfn(struct >> vm_area_struct *vma, >> unsigned long *start, >> unsigned long *end) >> { >> -*start = max3(lpfn, PFN_DOWN(vma->vm_start), >> +*start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)), >>PFN_DOWN(faddr & PMD_MASK)); >> -*end = min3(rpfn, PFN_DOWN(vma->vm_end), >> +*end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)), >> PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); >> } >> >> -- >> 2.7.4 >> >
Re: [PATCH v10 09/25] mm: protect VMA modifications using VMA sequence count
On Tue, Apr 17, 2018 at 04:33:15PM +0200, Laurent Dufour wrote: > The VMA sequence count has been introduced to allow fast detection of > VMA modification when running a page fault handler without holding > the mmap_sem. > > This patch provides protection against the VMA modification done in : > - madvise() > - mpol_rebind_policy() > - vma_replace_policy() > - change_prot_numa() > - mlock(), munlock() > - mprotect() > - mmap_region() > - collapse_huge_page() > - userfaultd registering services > > In addition, VMA fields which will be read during the speculative fault > path needs to be written using WRITE_ONCE to prevent write to be split > and intermediate values to be pushed to other CPUs. > > Signed-off-by: Laurent Dufour> --- > fs/proc/task_mmu.c | 5 - > fs/userfaultfd.c | 17 + > mm/khugepaged.c| 3 +++ > mm/madvise.c | 6 +- > mm/mempolicy.c | 51 ++- > mm/mlock.c | 13 - > mm/mmap.c | 22 +- > mm/mprotect.c | 4 +++- > mm/swap_state.c| 8 ++-- > 9 files changed, 89 insertions(+), 40 deletions(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index c486ad4b43f0..aeb417f28839 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1136,8 +1136,11 @@ static ssize_t clear_refs_write(struct file *file, > const char __user *buf, > goto out_mm; > } > for (vma = mm->mmap; vma; vma = vma->vm_next) { > - vma->vm_flags &= ~VM_SOFTDIRTY; > + vm_write_begin(vma); > + WRITE_ONCE(vma->vm_flags, > +vma->vm_flags & > ~VM_SOFTDIRTY); > vma_set_page_prot(vma); > + vm_write_end(vma); trivial: I think It's tricky to maintain that VMA fields to be read during SPF should be (READ|WRITE_ONCE). I think we need some accessor to read/write them rather than raw accessing like like vma_set_page_prot. Maybe spf prefix would be helpful. vma_spf_set_value(vma, vm_flags, val); We also add some markers in vm_area_struct's fileds to indicate that people shouldn't access those fields directly. Just a thought. > } > downgrade_write(>mmap_sem); > diff --git a/mm/swap_state.c b/mm/swap_state.c > index fe079756bb18..8a8a402ed59f 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -575,6 +575,10 @@ static unsigned long swapin_nr_pages(unsigned long > offset) > * the readahead. > * > * Caller must hold down_read on the vma->vm_mm if vmf->vma is not NULL. > + * This is needed to ensure the VMA will not be freed in our back. In the > case > + * of the speculative page fault handler, this cannot happen, even if we > don't > + * hold the mmap_sem. Callees are assumed to take care of reading VMA's > fields I guess reader would be curious on *why* is safe with SPF. Comment about the why could be helpful for reviewer. > + * using READ_ONCE() to read consistent values. > */ > struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, > struct vm_fault *vmf) > @@ -668,9 +672,9 @@ static inline void swap_ra_clamp_pfn(struct > vm_area_struct *vma, >unsigned long *start, >unsigned long *end) > { > - *start = max3(lpfn, PFN_DOWN(vma->vm_start), > + *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)), > PFN_DOWN(faddr & PMD_MASK)); > - *end = min3(rpfn, PFN_DOWN(vma->vm_end), > + *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)), > PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); > } > > -- > 2.7.4 >
[PATCH v10 09/25] mm: protect VMA modifications using VMA sequence count
The VMA sequence count has been introduced to allow fast detection of VMA modification when running a page fault handler without holding the mmap_sem. This patch provides protection against the VMA modification done in : - madvise() - mpol_rebind_policy() - vma_replace_policy() - change_prot_numa() - mlock(), munlock() - mprotect() - mmap_region() - collapse_huge_page() - userfaultd registering services In addition, VMA fields which will be read during the speculative fault path needs to be written using WRITE_ONCE to prevent write to be split and intermediate values to be pushed to other CPUs. Signed-off-by: Laurent Dufour--- fs/proc/task_mmu.c | 5 - fs/userfaultfd.c | 17 + mm/khugepaged.c| 3 +++ mm/madvise.c | 6 +- mm/mempolicy.c | 51 ++- mm/mlock.c | 13 - mm/mmap.c | 22 +- mm/mprotect.c | 4 +++- mm/swap_state.c| 8 ++-- 9 files changed, 89 insertions(+), 40 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index c486ad4b43f0..aeb417f28839 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1136,8 +1136,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, goto out_mm; } for (vma = mm->mmap; vma; vma = vma->vm_next) { - vma->vm_flags &= ~VM_SOFTDIRTY; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, + vma->vm_flags & ~VM_SOFTDIRTY); vma_set_page_prot(vma); + vm_write_end(vma); } downgrade_write(>mmap_sem); break; diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index cec550c8468f..b8212ba17695 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -659,8 +659,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs) octx = vma->vm_userfaultfd_ctx.ctx; if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) { + vm_write_begin(vma); vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING); + WRITE_ONCE(vma->vm_flags, + vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING)); + vm_write_end(vma); return 0; } @@ -885,8 +888,10 @@ static int userfaultfd_release(struct inode *inode, struct file *file) vma = prev; else prev = vma; - vma->vm_flags = new_flags; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, new_flags); vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; + vm_write_end(vma); } up_write(>mmap_sem); mmput(mm); @@ -1434,8 +1439,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, * the next vma was merged into the current one and * the current one has not been updated yet. */ - vma->vm_flags = new_flags; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, new_flags); vma->vm_userfaultfd_ctx.ctx = ctx; + vm_write_end(vma); skip: prev = vma; @@ -1592,8 +1599,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, * the next vma was merged into the current one and * the current one has not been updated yet. */ - vma->vm_flags = new_flags; + vm_write_begin(vma); + WRITE_ONCE(vma->vm_flags, new_flags); vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; + vm_write_end(vma); skip: prev = vma; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index d7b2a4bf8671..0b28af4b950d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1011,6 +1011,7 @@ static void collapse_huge_page(struct mm_struct *mm, if (mm_find_pmd(mm, address) != pmd) goto out; + vm_write_begin(vma); anon_vma_lock_write(vma->anon_vma); pte = pte_offset_map(pmd, address); @@ -1046,6 +1047,7 @@ static void collapse_huge_page(struct mm_struct *mm, pmd_populate(mm, pmd, pmd_pgtable(_pmd)); spin_unlock(pmd_ptl); anon_vma_unlock_write(vma->anon_vma); + vm_write_end(vma); result = SCAN_FAIL; goto