Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On Thu, Aug 31, 2017 at 12:25:16PM +0530, Anshuman Khandual wrote: > On 08/30/2017 03:02 PM, Laurent Dufour wrote: > > On 30/08/2017 07:58, Peter Zijlstra wrote: > >> On Wed, Aug 30, 2017 at 10:33:50AM +0530, Anshuman Khandual wrote: > >>> diff --git a/mm/filemap.c b/mm/filemap.c > >>> index a497024..08f3042 100644 > >>> --- a/mm/filemap.c > >>> +++ b/mm/filemap.c > >>> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page) > >>> int __lock_page_or_retry(struct page *page, struct mm_struct *mm, > >>> unsigned int flags) > >>> { > >>> + if (flags & FAULT_FLAG_SPECULATIVE) { > >>> + if (flags & FAULT_FLAG_KILLABLE) { > >>> + int ret; > >>> + > >>> + ret = __lock_page_killable(page); > >>> + if (ret) > >>> + return 0; > >>> + } else > >>> + __lock_page(page); > >>> + return 1; > >>> + } > >>> + > >>> if (flags & FAULT_FLAG_ALLOW_RETRY) { > >>> /* > >>> * CAUTION! In this case, mmap_sem is not released > >> > >> Yeah, that looks right. > > > > Hum, I'm wondering if FAULT_FLAG_RETRY_NOWAIT should be forced in the > > speculative path in that case to match the semantics of > > __lock_page_or_retry(). > > Doing that would force us to have another retry through classic fault > path wasting all the work done till now through SPF. Hence it may be > better to just wait, get the lock here and complete the fault. Peterz, > would you agree ? Or we should do as suggested by Laurent. More over, > forcing FAULT_FLAG_RETRY_NOWAIT on FAULT_FLAG_SPECULTIVE at this point > would look like a hack. Is there ever a situation where SPECULATIVE and NOWAIT are used together? That seems like something to avoid. A git-grep seems to suggest gup() can set it, but gup() will not be doing speculative faults. s390 also sets it, but then again, they don't have speculative fault support yet and when they do they can avoid setting them together. So maybe put in a WARN_ON_ONCE() on having both of them, it is not something that makes sense to me, but maybe someone sees a rationale for it?
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On 08/30/2017 03:02 PM, Laurent Dufour wrote: > On 30/08/2017 07:58, Peter Zijlstra wrote: >> On Wed, Aug 30, 2017 at 10:33:50AM +0530, Anshuman Khandual wrote: >>> diff --git a/mm/filemap.c b/mm/filemap.c >>> index a497024..08f3042 100644 >>> --- a/mm/filemap.c >>> +++ b/mm/filemap.c >>> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page) >>> int __lock_page_or_retry(struct page *page, struct mm_struct *mm, >>> unsigned int flags) >>> { >>> + if (flags & FAULT_FLAG_SPECULATIVE) { >>> + if (flags & FAULT_FLAG_KILLABLE) { >>> + int ret; >>> + >>> + ret = __lock_page_killable(page); >>> + if (ret) >>> + return 0; >>> + } else >>> + __lock_page(page); >>> + return 1; >>> + } >>> + >>> if (flags & FAULT_FLAG_ALLOW_RETRY) { >>> /* >>> * CAUTION! In this case, mmap_sem is not released >> >> Yeah, that looks right. > > Hum, I'm wondering if FAULT_FLAG_RETRY_NOWAIT should be forced in the > speculative path in that case to match the semantics of > __lock_page_or_retry(). Doing that would force us to have another retry through classic fault path wasting all the work done till now through SPF. Hence it may be better to just wait, get the lock here and complete the fault. Peterz, would you agree ? Or we should do as suggested by Laurent. More over, forcing FAULT_FLAG_RETRY_NOWAIT on FAULT_FLAG_SPECULTIVE at this point would look like a hack.
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On 30/08/2017 07:03, Anshuman Khandual wrote: > On 08/29/2017 07:15 PM, Peter Zijlstra wrote: >> On Tue, Aug 29, 2017 at 03:18:25PM +0200, Laurent Dufour wrote: >>> On 29/08/2017 14:04, Peter Zijlstra wrote: On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote: > On 27/08/2017 02:18, Kirill A. Shutemov wrote: >>> + >>> + if (unlikely(!vma->anon_vma)) >>> + goto unlock; >> >> It deserves a comment. > > You're right I'll add it in the next version. > For the record, the root cause is that __anon_vma_prepare() requires the > mmap_sem to be held because vm_next and vm_prev must be safe. But should that test not be: if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) goto unlock; Because !anon vmas will never have ->anon_vma set and you don't want to exclude those. >>> >>> Yes in the case we later allow non anonymous vmas to be handled. >>> Currently only anonymous vmas are supported so the check is good enough, >>> isn't it ? >> >> That wasn't at all clear from reading the code. This makes it clear >> ->anon_vma is only ever looked at for anonymous. >> >> And like Kirill says, we _really_ should start allowing some (if not >> all) vm_ops. Large file based mappings aren't particularly rare. >> >> I'm not sure we want to introduce a white-list or just bite the bullet >> and audit all ->fault() implementations. But either works and isn't >> terribly difficult, auditing all is more work though. > > filemap_fault() is used as vma-vm_ops->fault() for most of the file > systems. Changing it can enable speculative fault support for all of > them. It will still exclude other driver based vma-vm_ops->fault() > implementation. AFAICS, __lock_page_or_retry() function can drop > mm->mmap_sem if the page could not be locked right away. As suggested > by Peterz, making it understand FAULT_FLAG_SPECULATIVE should be good > enough. The patch is lightly tested for file mappings on top of this > series. Hi Anshuman, This sounds pretty good, except for the FAULT_FLAG_RETRY_NOWAIT's case I mentioned in another mail. The next step would be to find a way to discriminate between the vm_fault() functions. Any idea ? Thanks, Laurent. > > diff --git a/mm/filemap.c b/mm/filemap.c > index a497024..08f3042 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page) > int __lock_page_or_retry(struct page *page, struct mm_struct *mm, > unsigned int flags) > { > + if (flags & FAULT_FLAG_SPECULATIVE) { > + if (flags & FAULT_FLAG_KILLABLE) { > + int ret; > + > + ret = __lock_page_killable(page); > + if (ret) > + return 0; > + } else > + __lock_page(page); > + return 1; > + } > + > if (flags & FAULT_FLAG_ALLOW_RETRY) { > /* > * CAUTION! In this case, mmap_sem is not released > diff --git a/mm/memory.c b/mm/memory.c > index 549d235..02347f3 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3836,8 +3836,6 @@ static int handle_pte_fault(struct vm_fault *vmf) > if (!vmf->pte) { > if (vma_is_anonymous(vmf->vma)) > return do_anonymous_page(vmf); > - else if (vmf->flags & FAULT_FLAG_SPECULATIVE) > - return VM_FAULT_RETRY; > else > return do_fault(vmf); > } > @@ -4012,17 +4010,7 @@ int handle_speculative_fault(struct mm_struct *mm, > unsigned long address, > goto unlock; > } > > - /* > -* Can't call vm_ops service has we don't know what they would do > -* with the VMA. > -* This include huge page from hugetlbfs. > -*/ > - if (vma->vm_ops) { > - trace_spf_vma_notsup(_RET_IP_, vma, address); > - goto unlock; > - } > - > - if (unlikely(!vma->anon_vma)) { > + if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) { > trace_spf_vma_notsup(_RET_IP_, vma, address); > goto unlock; > } >
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On 30/08/2017 07:58, Peter Zijlstra wrote: > On Wed, Aug 30, 2017 at 10:33:50AM +0530, Anshuman Khandual wrote: >> diff --git a/mm/filemap.c b/mm/filemap.c >> index a497024..08f3042 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page) >> int __lock_page_or_retry(struct page *page, struct mm_struct *mm, >> unsigned int flags) >> { >> + if (flags & FAULT_FLAG_SPECULATIVE) { >> + if (flags & FAULT_FLAG_KILLABLE) { >> + int ret; >> + >> + ret = __lock_page_killable(page); >> + if (ret) >> + return 0; >> + } else >> + __lock_page(page); >> + return 1; >> + } >> + >> if (flags & FAULT_FLAG_ALLOW_RETRY) { >> /* >> * CAUTION! In this case, mmap_sem is not released > > Yeah, that looks right. Hum, I'm wondering if FAULT_FLAG_RETRY_NOWAIT should be forced in the speculative path in that case to match the semantics of __lock_page_or_retry(). > >> @@ -4012,17 +4010,7 @@ int handle_speculative_fault(struct mm_struct *mm, >> unsigned long address, >> goto unlock; >> } >> >> + if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) { >> trace_spf_vma_notsup(_RET_IP_, vma, address); >> goto unlock; >> } > > As riel pointed out on IRC slightly later, private file maps also need > ->anon_vma and those actually have ->vm_ops IIRC so the condition needs > to be slightly more complicated. Yes I read again the code and lead to the same conclusion.
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On 27/08/2017 02:18, Kirill A. Shutemov wrote: > On Fri, Aug 18, 2017 at 12:05:13AM +0200, Laurent Dufour wrote: >> +/* >> + * vm_normal_page() adds some processing which should be done while >> + * hodling the mmap_sem. >> + */ >> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address, >> + unsigned int flags) >> +{ >> +struct vm_fault vmf = { >> +.address = address, >> +}; >> +pgd_t *pgd; >> +p4d_t *p4d; >> +pud_t *pud; >> +pmd_t *pmd; >> +int dead, seq, idx, ret = VM_FAULT_RETRY; >> +struct vm_area_struct *vma; >> +struct mempolicy *pol; >> + >> +/* Clear flags that may lead to release the mmap_sem to retry */ >> +flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE); >> +flags |= FAULT_FLAG_SPECULATIVE; >> + >> +idx = srcu_read_lock(&vma_srcu); >> +vma = find_vma_srcu(mm, address); >> +if (!vma) >> +goto unlock; >> + >> +/* >> + * Validate the VMA found by the lockless lookup. >> + */ >> +dead = RB_EMPTY_NODE(&vma->vm_rb); >> +seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> >> seqlock,vma_rb_erase() */ >> +if ((seq & 1) || dead) >> +goto unlock; >> + >> +/* >> + * Can't call vm_ops service has we don't know what they would do >> + * with the VMA. >> + * This include huge page from hugetlbfs. >> + */ >> +if (vma->vm_ops) >> +goto unlock; > > I think we need to have a way to white-list safe ->vm_ops. > >> + >> +if (unlikely(!vma->anon_vma)) >> +goto unlock; > > It deserves a comment. > >> + >> +vmf.vma_flags = READ_ONCE(vma->vm_flags); >> +vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot); >> + >> +/* Can't call userland page fault handler in the speculative path */ >> +if (unlikely(vmf.vma_flags & VM_UFFD_MISSING)) >> +goto unlock; >> + >> +/* >> + * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which >> + * are not compatible with the speculative page fault processing. >> + */ >> +pol = __get_vma_policy(vma, address); >> +if (!pol) >> +pol = get_task_policy(current); >> +if (pol && pol->mode == MPOL_INTERLEAVE) >> +goto unlock; >> + >> +if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP) >> +/* >> + * This could be detected by the check address against VMA's >> + * boundaries but we want to trace it as not supported instead >> + * of changed. >> + */ >> +goto unlock; >> + >> +if (address < READ_ONCE(vma->vm_start) >> +|| READ_ONCE(vma->vm_end) <= address) >> +goto unlock; >> + >> +/* >> + * The three following checks are copied from access_error from >> + * arch/x86/mm/fault.c >> + */ >> +if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, >> + flags & FAULT_FLAG_INSTRUCTION, >> + flags & FAULT_FLAG_REMOTE)) >> +goto unlock; >> + >> +/* This is one is required to check that the VMA has write access set */ >> +if (flags & FAULT_FLAG_WRITE) { >> +if (unlikely(!(vmf.vma_flags & VM_WRITE))) >> +goto unlock; >> +} else { >> +if (unlikely(!(vmf.vma_flags & (VM_READ | VM_EXEC | VM_WRITE >> +goto unlock; >> +} >> + >> +/* >> + * Do a speculative lookup of the PTE entry. >> + */ >> +local_irq_disable(); >> +pgd = pgd_offset(mm, address); >> +if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) >> +goto out_walk; >> + >> +p4d = p4d_alloc(mm, pgd, address); >> +if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d))) >> +goto out_walk; >> + >> +pud = pud_alloc(mm, p4d, address); >> +if (pud_none(*pud) || unlikely(pud_bad(*pud))) >> +goto out_walk; >> + >> +pmd = pmd_offset(pud, address); >> +if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) >> +goto out_walk; >> + >> +/* >> + * The above does not allocate/instantiate page-tables because doing so >> + * would lead to the possibility of instantiating page-tables after >> + * free_pgtables() -- and consequently leaking them. >> + * >> + * The result is that we take at least one !speculative fault per PMD >> + * in order to instantiate it. >> + */ > > > Doing all this job and just give up because we cannot allocate page tables > looks very wasteful to me. > > Have you considered to look how we can hand over from speculative to > non-speculative path without starting from scratch (when possible)? > >> +/* Transparent huge pages are not supported. */ >> +if (unlikely(pmd_trans_huge(*pmd))) >> +goto out_walk; > > That's looks like a blocker to me. > > Is there any problem with making it supported (besides plain coding)?
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On Wed, Aug 30, 2017 at 07:19:30AM +1000, Benjamin Herrenschmidt wrote: > On Tue, 2017-08-29 at 13:27 +0200, Peter Zijlstra wrote: > > mpe helped me out and explained that is the PWC hint to TBLIE. > > > > So, you set need_flush_all when you unhook pud/pmd/pte which you then > > use to set PWC. So free_pgtables() will do the PWC when it unhooks > > higher level pages. > > > > But you're right that there's some issues, free_pgtables() itself > > doesn't seem to use mm->page_table_lock,pmd->lock _AT_ALL_ to unhook the > > pages. > > > > If it were to do that, things should work fine since those locks would > > then serialize against the speculative faults, we would never install a > > page if the VMA would be under tear-down and it would thus not be > > visible to your caches either. > > That's one case. I don't remember of *all* the cases to be honest, but > I do remember several times over the past few years thinking "ah we are > fine because the mm sem taken for writing protects us from any > concurrent tree structure change" :-) Well, installing always seems to use the locks (it needs to, because its always done with down_read()), that only leaves removal, and the only place I know that removes stuff is free_pgtables(). But I think I found another fun place, copy_page_range(). While it (pointlessly) takes all the PTLs on the dst mm it walks the src page tables without any PTLs. This means that if we have a multi-threaded process doing fork() a thread of the src mm could instantiate page-tables that will not be copied over. Of course, this is highly dubious behaviour to begin with, and I don't think there's anything fundamentally wrong with missing those pages but we should document this stuff.
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On Wed, Aug 30, 2017 at 10:33:50AM +0530, Anshuman Khandual wrote: > diff --git a/mm/filemap.c b/mm/filemap.c > index a497024..08f3042 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page) > int __lock_page_or_retry(struct page *page, struct mm_struct *mm, > unsigned int flags) > { > + if (flags & FAULT_FLAG_SPECULATIVE) { > + if (flags & FAULT_FLAG_KILLABLE) { > + int ret; > + > + ret = __lock_page_killable(page); > + if (ret) > + return 0; > + } else > + __lock_page(page); > + return 1; > + } > + > if (flags & FAULT_FLAG_ALLOW_RETRY) { > /* > * CAUTION! In this case, mmap_sem is not released Yeah, that looks right. > @@ -4012,17 +4010,7 @@ int handle_speculative_fault(struct mm_struct *mm, > unsigned long address, > goto unlock; > } > > + if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) { > trace_spf_vma_notsup(_RET_IP_, vma, address); > goto unlock; > } As riel pointed out on IRC slightly later, private file maps also need ->anon_vma and those actually have ->vm_ops IIRC so the condition needs to be slightly more complicated.
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On 08/27/2017 05:48 AM, Kirill A. Shutemov wrote: >> +/* Transparent huge pages are not supported. */ >> +if (unlikely(pmd_trans_huge(*pmd))) >> +goto out_walk; > That's looks like a blocker to me. > > Is there any problem with making it supported (besides plain coding)? IIUC we would have to reattempt once for each PMD level fault because of the lack of a page table entry there. Besides do we want to support huge pages in general as part of speculative page fault path ? The number of faults will be very less (256 times lower on POWER and 512 times lower on X86). So is it worth it ? BTW calling hugetlb_fault() after figuring out the VMA, works correctly inside handle_speculative _fault() last time I checked.
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On 08/29/2017 07:15 PM, Peter Zijlstra wrote: > On Tue, Aug 29, 2017 at 03:18:25PM +0200, Laurent Dufour wrote: >> On 29/08/2017 14:04, Peter Zijlstra wrote: >>> On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote: On 27/08/2017 02:18, Kirill A. Shutemov wrote: >> + >> +if (unlikely(!vma->anon_vma)) >> +goto unlock; > > It deserves a comment. You're right I'll add it in the next version. For the record, the root cause is that __anon_vma_prepare() requires the mmap_sem to be held because vm_next and vm_prev must be safe. >>> >>> But should that test not be: >>> >>> if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) >>> goto unlock; >>> >>> Because !anon vmas will never have ->anon_vma set and you don't want to >>> exclude those. >> >> Yes in the case we later allow non anonymous vmas to be handled. >> Currently only anonymous vmas are supported so the check is good enough, >> isn't it ? > > That wasn't at all clear from reading the code. This makes it clear > ->anon_vma is only ever looked at for anonymous. > > And like Kirill says, we _really_ should start allowing some (if not > all) vm_ops. Large file based mappings aren't particularly rare. > > I'm not sure we want to introduce a white-list or just bite the bullet > and audit all ->fault() implementations. But either works and isn't > terribly difficult, auditing all is more work though. filemap_fault() is used as vma-vm_ops->fault() for most of the file systems. Changing it can enable speculative fault support for all of them. It will still exclude other driver based vma-vm_ops->fault() implementation. AFAICS, __lock_page_or_retry() function can drop mm->mmap_sem if the page could not be locked right away. As suggested by Peterz, making it understand FAULT_FLAG_SPECULATIVE should be good enough. The patch is lightly tested for file mappings on top of this series. diff --git a/mm/filemap.c b/mm/filemap.c index a497024..08f3042 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page) int __lock_page_or_retry(struct page *page, struct mm_struct *mm, unsigned int flags) { + if (flags & FAULT_FLAG_SPECULATIVE) { + if (flags & FAULT_FLAG_KILLABLE) { + int ret; + + ret = __lock_page_killable(page); + if (ret) + return 0; + } else + __lock_page(page); + return 1; + } + if (flags & FAULT_FLAG_ALLOW_RETRY) { /* * CAUTION! In this case, mmap_sem is not released diff --git a/mm/memory.c b/mm/memory.c index 549d235..02347f3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3836,8 +3836,6 @@ static int handle_pte_fault(struct vm_fault *vmf) if (!vmf->pte) { if (vma_is_anonymous(vmf->vma)) return do_anonymous_page(vmf); - else if (vmf->flags & FAULT_FLAG_SPECULATIVE) - return VM_FAULT_RETRY; else return do_fault(vmf); } @@ -4012,17 +4010,7 @@ int handle_speculative_fault(struct mm_struct *mm, unsigned long address, goto unlock; } - /* -* Can't call vm_ops service has we don't know what they would do -* with the VMA. -* This include huge page from hugetlbfs. -*/ - if (vma->vm_ops) { - trace_spf_vma_notsup(_RET_IP_, vma, address); - goto unlock; - } - - if (unlikely(!vma->anon_vma)) { + if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) { trace_spf_vma_notsup(_RET_IP_, vma, address); goto unlock; }
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On 08/29/2017 05:34 PM, Peter Zijlstra wrote: > On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote: >> On 27/08/2017 02:18, Kirill A. Shutemov wrote: + + if (unlikely(!vma->anon_vma)) + goto unlock; >>> It deserves a comment. >> You're right I'll add it in the next version. >> For the record, the root cause is that __anon_vma_prepare() requires the >> mmap_sem to be held because vm_next and vm_prev must be safe. > But should that test not be: > > if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) > goto unlock; This makes more sense. We are backing off from speculative path because struct anon_vma has not been created for this anonymous vma and we cannot do that without holding mmap_sem. This should have nothing to do with vma->vm_ops availability.
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On Tue, 2017-08-29 at 13:27 +0200, Peter Zijlstra wrote: > mpe helped me out and explained that is the PWC hint to TBLIE. > > So, you set need_flush_all when you unhook pud/pmd/pte which you then > use to set PWC. So free_pgtables() will do the PWC when it unhooks > higher level pages. > > But you're right that there's some issues, free_pgtables() itself > doesn't seem to use mm->page_table_lock,pmd->lock _AT_ALL_ to unhook the > pages. > > If it were to do that, things should work fine since those locks would > then serialize against the speculative faults, we would never install a > page if the VMA would be under tear-down and it would thus not be > visible to your caches either. That's one case. I don't remember of *all* the cases to be honest, but I do remember several times over the past few years thinking "ah we are fine because the mm sem taken for writing protects us from any concurrent tree structure change" :-) Cheers, Ben.
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On Tue, Aug 29, 2017 at 03:18:25PM +0200, Laurent Dufour wrote: > On 29/08/2017 14:04, Peter Zijlstra wrote: > > On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote: > >> On 27/08/2017 02:18, Kirill A. Shutemov wrote: > + > +if (unlikely(!vma->anon_vma)) > +goto unlock; > >>> > >>> It deserves a comment. > >> > >> You're right I'll add it in the next version. > >> For the record, the root cause is that __anon_vma_prepare() requires the > >> mmap_sem to be held because vm_next and vm_prev must be safe. > > > > But should that test not be: > > > > if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) > > goto unlock; > > > > Because !anon vmas will never have ->anon_vma set and you don't want to > > exclude those. > > Yes in the case we later allow non anonymous vmas to be handled. > Currently only anonymous vmas are supported so the check is good enough, > isn't it ? That wasn't at all clear from reading the code. This makes it clear ->anon_vma is only ever looked at for anonymous. And like Kirill says, we _really_ should start allowing some (if not all) vm_ops. Large file based mappings aren't particularly rare. I'm not sure we want to introduce a white-list or just bite the bullet and audit all ->fault() implementations. But either works and isn't terribly difficult, auditing all is more work though.
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On 29/08/2017 14:04, Peter Zijlstra wrote: > On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote: >> On 27/08/2017 02:18, Kirill A. Shutemov wrote: + + if (unlikely(!vma->anon_vma)) + goto unlock; >>> >>> It deserves a comment. >> >> You're right I'll add it in the next version. >> For the record, the root cause is that __anon_vma_prepare() requires the >> mmap_sem to be held because vm_next and vm_prev must be safe. > > But should that test not be: > > if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) > goto unlock; > > Because !anon vmas will never have ->anon_vma set and you don't want to > exclude those. Yes in the case we later allow non anonymous vmas to be handled. Currently only anonymous vmas are supported so the check is good enough, isn't it ?
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote: > On 27/08/2017 02:18, Kirill A. Shutemov wrote: > >> + > >> + if (unlikely(!vma->anon_vma)) > >> + goto unlock; > > > > It deserves a comment. > > You're right I'll add it in the next version. > For the record, the root cause is that __anon_vma_prepare() requires the > mmap_sem to be held because vm_next and vm_prev must be safe. But should that test not be: if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) goto unlock; Because !anon vmas will never have ->anon_vma set and you don't want to exclude those.
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On Tue, Aug 29, 2017 at 10:33:52AM +0200, Peter Zijlstra wrote: > On Tue, Aug 29, 2017 at 07:14:37AM +1000, Benjamin Herrenschmidt wrote: > > We'd have to audit archs closely. Things like the page walk cache > > flushing on power etc... > > If you point me where to look, I'll have a poke around. I'm not > quite sure what you mean with pagewalk cache flushing. Your hash thing > flushes everything inside the PTL IIRC and the radix code appears fairly > 'normal'. mpe helped me out and explained that is the PWC hint to TBLIE. So, you set need_flush_all when you unhook pud/pmd/pte which you then use to set PWC. So free_pgtables() will do the PWC when it unhooks higher level pages. But you're right that there's some issues, free_pgtables() itself doesn't seem to use mm->page_table_lock,pmd->lock _AT_ALL_ to unhook the pages. If it were to do that, things should work fine since those locks would then serialize against the speculative faults, we would never install a page if the VMA would be under tear-down and it would thus not be visible to your caches either.
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On Tue, Aug 29, 2017 at 07:14:37AM +1000, Benjamin Herrenschmidt wrote: > On Mon, 2017-08-28 at 11:37 +0200, Peter Zijlstra wrote: > > > Doing all this job and just give up because we cannot allocate page tables > > > looks very wasteful to me. > > > > > > Have you considered to look how we can hand over from speculative to > > > non-speculative path without starting from scratch (when possible)? > > > > So we _can_ in fact allocate and install page-tables, but we have to be > > very careful about it. The interesting case is where we race with > > free_pgtables() and install a page that was just taken out. > > > > But since we already have the VMA I think we can do something like: > > That makes me extremely nervous... there could be all sort of > assumptions esp. in arch code about the fact that we never populate the > tree without the mm sem. That _would_ be somewhat dodgy, because that means it needs to rely on taking mmap_sem for _writing_ to undo things and arch/powerpc/ doesn't have many down_write.*mmap_sem: $ git grep "down_write.*mmap_sem" arch/powerpc/ arch/powerpc/kernel/vdso.c: if (down_write_killable(&mm->mmap_sem)) arch/powerpc/kvm/book3s_64_vio.c: down_write(¤t->mm->mmap_sem); arch/powerpc/mm/mmu_context_iommu.c:down_write(&mm->mmap_sem); arch/powerpc/mm/subpage-prot.c: down_write(&mm->mmap_sem); arch/powerpc/mm/subpage-prot.c: down_write(&mm->mmap_sem); arch/powerpc/mm/subpage-prot.c: down_write(&mm->mmap_sem); Then again, I suppose it could be relying on the implicit down_write from things like munmap() and the like.. And things _ought_ to be ordered by the various PTLs (mm->page_table_lock and pmd->lock) which of course doesn't mean something accidentally snuck through. > We'd have to audit archs closely. Things like the page walk cache > flushing on power etc... If you point me where to look, I'll have a poke around. I'm not quite sure what you mean with pagewalk cache flushing. Your hash thing flushes everything inside the PTL IIRC and the radix code appears fairly 'normal'. > I don't mind the "retry" .. .we've brought stuff in the L1 cache > already which I would expect to be the bulk of the overhead, and the > allocation case isn't that common. Do we have numbers to show how > destrimental this is today ? No numbers, afaik. And like I said, I didn't consider this an actual problem when I did these patches. But since Kirill asked ;-)
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On Mon, Aug 28, 2017 at 03:35:11PM -0700, Andi Kleen wrote: > Yes the whole thing is quite risky. Probably will need some > kind of per architecture opt-in scheme? See patch 19/20, that not enough for you?
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On 27/08/2017 02:18, Kirill A. Shutemov wrote: > On Fri, Aug 18, 2017 at 12:05:13AM +0200, Laurent Dufour wrote: >> +/* >> + * vm_normal_page() adds some processing which should be done while >> + * hodling the mmap_sem. >> + */ >> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address, >> + unsigned int flags) >> +{ >> +struct vm_fault vmf = { >> +.address = address, >> +}; >> +pgd_t *pgd; >> +p4d_t *p4d; >> +pud_t *pud; >> +pmd_t *pmd; >> +int dead, seq, idx, ret = VM_FAULT_RETRY; >> +struct vm_area_struct *vma; >> +struct mempolicy *pol; >> + >> +/* Clear flags that may lead to release the mmap_sem to retry */ >> +flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE); >> +flags |= FAULT_FLAG_SPECULATIVE; >> + >> +idx = srcu_read_lock(&vma_srcu); >> +vma = find_vma_srcu(mm, address); >> +if (!vma) >> +goto unlock; >> + >> +/* >> + * Validate the VMA found by the lockless lookup. >> + */ >> +dead = RB_EMPTY_NODE(&vma->vm_rb); >> +seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> >> seqlock,vma_rb_erase() */ >> +if ((seq & 1) || dead) >> +goto unlock; >> + >> +/* >> + * Can't call vm_ops service has we don't know what they would do >> + * with the VMA. >> + * This include huge page from hugetlbfs. >> + */ >> +if (vma->vm_ops) >> +goto unlock; > > I think we need to have a way to white-list safe ->vm_ops. Hi Kirill, Yes this would be a good optimization done in a next step. >> + >> +if (unlikely(!vma->anon_vma)) >> +goto unlock; > > It deserves a comment. You're right I'll add it in the next version. For the record, the root cause is that __anon_vma_prepare() requires the mmap_sem to be held because vm_next and vm_prev must be safe. >> + >> +vmf.vma_flags = READ_ONCE(vma->vm_flags); >> +vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot); >> + >> +/* Can't call userland page fault handler in the speculative path */ >> +if (unlikely(vmf.vma_flags & VM_UFFD_MISSING)) >> +goto unlock; >> + >> +/* >> + * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which >> + * are not compatible with the speculative page fault processing. >> + */ >> +pol = __get_vma_policy(vma, address); >> +if (!pol) >> +pol = get_task_policy(current); >> +if (pol && pol->mode == MPOL_INTERLEAVE) >> +goto unlock; >> + >> +if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP) >> +/* >> + * This could be detected by the check address against VMA's >> + * boundaries but we want to trace it as not supported instead >> + * of changed. >> + */ >> +goto unlock; >> + >> +if (address < READ_ONCE(vma->vm_start) >> +|| READ_ONCE(vma->vm_end) <= address) >> +goto unlock; >> + >> +/* >> + * The three following checks are copied from access_error from >> + * arch/x86/mm/fault.c >> + */ >> +if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, >> + flags & FAULT_FLAG_INSTRUCTION, >> + flags & FAULT_FLAG_REMOTE)) >> +goto unlock; >> + >> +/* This is one is required to check that the VMA has write access set */ >> +if (flags & FAULT_FLAG_WRITE) { >> +if (unlikely(!(vmf.vma_flags & VM_WRITE))) >> +goto unlock; >> +} else { >> +if (unlikely(!(vmf.vma_flags & (VM_READ | VM_EXEC | VM_WRITE >> +goto unlock; >> +} >> + >> +/* >> + * Do a speculative lookup of the PTE entry. >> + */ >> +local_irq_disable(); >> +pgd = pgd_offset(mm, address); >> +if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) >> +goto out_walk; >> + >> +p4d = p4d_alloc(mm, pgd, address); >> +if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d))) >> +goto out_walk; >> + >> +pud = pud_alloc(mm, p4d, address); >> +if (pud_none(*pud) || unlikely(pud_bad(*pud))) >> +goto out_walk; >> + >> +pmd = pmd_offset(pud, address); >> +if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) >> +goto out_walk; >> + >> +/* >> + * The above does not allocate/instantiate page-tables because doing so >> + * would lead to the possibility of instantiating page-tables after >> + * free_pgtables() -- and consequently leaking them. >> + * >> + * The result is that we take at least one !speculative fault per PMD >> + * in order to instantiate it. >> + */ > > > Doing all this job and just give up because we cannot allocate page tables > looks very wasteful to me. > > Have you considered to look how we can hand over from speculative to > non-speculative path without starting from scratch (when possible)?
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
> That makes me extremely nervous... there could be all sort of > assumptions esp. in arch code about the fact that we never populate the > tree without the mm sem. > > We'd have to audit archs closely. Things like the page walk cache > flushing on power etc... Yes the whole thing is quite risky. Probably will need some kind of per architecture opt-in scheme? -Andi
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On Mon, 2017-08-28 at 11:37 +0200, Peter Zijlstra wrote: > > Doing all this job and just give up because we cannot allocate page tables > > looks very wasteful to me. > > > > Have you considered to look how we can hand over from speculative to > > non-speculative path without starting from scratch (when possible)? > > So we _can_ in fact allocate and install page-tables, but we have to be > very careful about it. The interesting case is where we race with > free_pgtables() and install a page that was just taken out. > > But since we already have the VMA I think we can do something like: That makes me extremely nervous... there could be all sort of assumptions esp. in arch code about the fact that we never populate the tree without the mm sem. We'd have to audit archs closely. Things like the page walk cache flushing on power etc... I don't mind the "retry" .. .we've brought stuff in the L1 cache already which I would expect to be the bulk of the overhead, and the allocation case isn't that common. Do we have numbers to show how destrimental this is today ? Cheers, Ben.
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On Sun, Aug 27, 2017 at 03:18:23AM +0300, Kirill A. Shutemov wrote: > On Fri, Aug 18, 2017 at 12:05:13AM +0200, Laurent Dufour wrote: > > + /* > > +* Can't call vm_ops service has we don't know what they would do > > +* with the VMA. > > +* This include huge page from hugetlbfs. > > +*/ > > + if (vma->vm_ops) > > + goto unlock; > > I think we need to have a way to white-list safe ->vm_ops. Either that, or simply teach all ->fault() callbacks about speculative faults. Shouldn't be too hard, just 'work'. > > + > > + if (unlikely(!vma->anon_vma)) > > + goto unlock; > > It deserves a comment. Yes, that was very much not intended. It wrecks most of the fun. This really _should_ work for file maps too. > > + /* > > +* Do a speculative lookup of the PTE entry. > > +*/ > > + local_irq_disable(); > > + pgd = pgd_offset(mm, address); > > + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) > > + goto out_walk; > > + > > + p4d = p4d_alloc(mm, pgd, address); > > + if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d))) > > + goto out_walk; > > + > > + pud = pud_alloc(mm, p4d, address); > > + if (pud_none(*pud) || unlikely(pud_bad(*pud))) > > + goto out_walk; > > + > > + pmd = pmd_offset(pud, address); > > + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) > > + goto out_walk; > > + > > + /* > > +* The above does not allocate/instantiate page-tables because doing so > > +* would lead to the possibility of instantiating page-tables after > > +* free_pgtables() -- and consequently leaking them. > > +* > > +* The result is that we take at least one !speculative fault per PMD > > +* in order to instantiate it. > > +*/ > > > Doing all this job and just give up because we cannot allocate page tables > looks very wasteful to me. > > Have you considered to look how we can hand over from speculative to > non-speculative path without starting from scratch (when possible)? So we _can_ in fact allocate and install page-tables, but we have to be very careful about it. The interesting case is where we race with free_pgtables() and install a page that was just taken out. But since we already have the VMA I think we can do something like: if (p*g_none()) { p*d_t *new = p*d_alloc_one(mm, address); spin_lock(&mm->page_table_lock); if (!vma_changed_or_dead(vma,seq)) { if (p*d_none()) p*d_populate(mm, p*d, new); else p*d_free(new); new = NULL; } spin_unlock(&mm->page_table_lock); if (new) { p*d_free(new); goto out_walk; } } I just never bothered with that, figured we ought to get the basics working before trying to be clever. > > + /* Transparent huge pages are not supported. */ > > + if (unlikely(pmd_trans_huge(*pmd))) > > + goto out_walk; > > That's looks like a blocker to me. > > Is there any problem with making it supported (besides plain coding)? Not that I can remember, but I never really looked at THP, I don't think we even had that when I did the first versions.
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On Fri, Aug 18, 2017 at 12:05:13AM +0200, Laurent Dufour wrote: > +/* > + * vm_normal_page() adds some processing which should be done while > + * hodling the mmap_sem. > + */ > +int handle_speculative_fault(struct mm_struct *mm, unsigned long address, > + unsigned int flags) > +{ > + struct vm_fault vmf = { > + .address = address, > + }; > + pgd_t *pgd; > + p4d_t *p4d; > + pud_t *pud; > + pmd_t *pmd; > + int dead, seq, idx, ret = VM_FAULT_RETRY; > + struct vm_area_struct *vma; > + struct mempolicy *pol; > + > + /* Clear flags that may lead to release the mmap_sem to retry */ > + flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE); > + flags |= FAULT_FLAG_SPECULATIVE; > + > + idx = srcu_read_lock(&vma_srcu); > + vma = find_vma_srcu(mm, address); > + if (!vma) > + goto unlock; > + > + /* > + * Validate the VMA found by the lockless lookup. > + */ > + dead = RB_EMPTY_NODE(&vma->vm_rb); > + seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> > seqlock,vma_rb_erase() */ > + if ((seq & 1) || dead) > + goto unlock; > + > + /* > + * Can't call vm_ops service has we don't know what they would do > + * with the VMA. > + * This include huge page from hugetlbfs. > + */ > + if (vma->vm_ops) > + goto unlock; I think we need to have a way to white-list safe ->vm_ops. > + > + if (unlikely(!vma->anon_vma)) > + goto unlock; It deserves a comment. > + > + vmf.vma_flags = READ_ONCE(vma->vm_flags); > + vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot); > + > + /* Can't call userland page fault handler in the speculative path */ > + if (unlikely(vmf.vma_flags & VM_UFFD_MISSING)) > + goto unlock; > + > + /* > + * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which > + * are not compatible with the speculative page fault processing. > + */ > + pol = __get_vma_policy(vma, address); > + if (!pol) > + pol = get_task_policy(current); > + if (pol && pol->mode == MPOL_INTERLEAVE) > + goto unlock; > + > + if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP) > + /* > + * This could be detected by the check address against VMA's > + * boundaries but we want to trace it as not supported instead > + * of changed. > + */ > + goto unlock; > + > + if (address < READ_ONCE(vma->vm_start) > + || READ_ONCE(vma->vm_end) <= address) > + goto unlock; > + > + /* > + * The three following checks are copied from access_error from > + * arch/x86/mm/fault.c > + */ > + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, > +flags & FAULT_FLAG_INSTRUCTION, > +flags & FAULT_FLAG_REMOTE)) > + goto unlock; > + > + /* This is one is required to check that the VMA has write access set */ > + if (flags & FAULT_FLAG_WRITE) { > + if (unlikely(!(vmf.vma_flags & VM_WRITE))) > + goto unlock; > + } else { > + if (unlikely(!(vmf.vma_flags & (VM_READ | VM_EXEC | VM_WRITE > + goto unlock; > + } > + > + /* > + * Do a speculative lookup of the PTE entry. > + */ > + local_irq_disable(); > + pgd = pgd_offset(mm, address); > + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) > + goto out_walk; > + > + p4d = p4d_alloc(mm, pgd, address); > + if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d))) > + goto out_walk; > + > + pud = pud_alloc(mm, p4d, address); > + if (pud_none(*pud) || unlikely(pud_bad(*pud))) > + goto out_walk; > + > + pmd = pmd_offset(pud, address); > + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) > + goto out_walk; > + > + /* > + * The above does not allocate/instantiate page-tables because doing so > + * would lead to the possibility of instantiating page-tables after > + * free_pgtables() -- and consequently leaking them. > + * > + * The result is that we take at least one !speculative fault per PMD > + * in order to instantiate it. > + */ Doing all this job and just give up because we cannot allocate page tables looks very wasteful to me. Have you considered to look how we can hand over from speculative to non-speculative path without starting from scratch (when possible)? > + /* Transparent huge pages are not supported. */ > + if (unlikely(pmd_trans_huge(*pmd))) > + goto out_walk; That's looks like a blocker to me. Is there any problem with making it supported (besides plain coding)? > + > + vmf.vma = vma; > + vmf.pmd = pmd; > + vmf.pgoff = linear_page_index(vma, address); > +
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On 20/08/2017 14:11, Sergey Senozhatsky wrote: > On (08/18/17 00:05), Laurent Dufour wrote: > [..] >> +/* >> + * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which >> + * are not compatible with the speculative page fault processing. >> + */ >> +pol = __get_vma_policy(vma, address); >> +if (!pol) >> +pol = get_task_policy(current); >> +if (pol && pol->mode == MPOL_INTERLEAVE) >> +goto unlock; > > include/linux/mempolicy.h defines > > struct mempolicy *get_task_policy(struct task_struct *p); > struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, > unsigned long addr); > > only for CONFIG_NUMA configs. Thanks Sergey, I'll add #ifdef around this block.
Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
On (08/18/17 00:05), Laurent Dufour wrote: [..] > + /* > + * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which > + * are not compatible with the speculative page fault processing. > + */ > + pol = __get_vma_policy(vma, address); > + if (!pol) > + pol = get_task_policy(current); > + if (pol && pol->mode == MPOL_INTERLEAVE) > + goto unlock; include/linux/mempolicy.h defines struct mempolicy *get_task_policy(struct task_struct *p); struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, unsigned long addr); only for CONFIG_NUMA configs. -ss
[PATCH v2 14/20] mm: Provide speculative fault infrastructure
From: Peter Zijlstra Provide infrastructure to do a speculative fault (not holding mmap_sem). The not holding of mmap_sem means we can race against VMA change/removal and page-table destruction. We use the SRCU VMA freeing to keep the VMA around. We use the VMA seqcount to detect change (including umapping / page-table deletion) and we use gup_fast() style page-table walking to deal with page-table races. Once we've obtained the page and are ready to update the PTE, we validate if the state we started the fault with is still valid, if not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the PTE and we're done. Signed-off-by: Peter Zijlstra (Intel) [Manage the newly introduced pte_spinlock() for speculative page fault to fail if the VMA is touched in our back] [Rename vma_is_dead() to vma_has_changed() and declare it here] [Call p4d_alloc() as it is safe since pgd is valid] [Call pud_alloc() as it is safe since p4d is valid] [Set fe.sequence in __handle_mm_fault()] [Abort speculative path when handle_userfault() has to be called] [Add additional VMA's flags checks in handle_speculative_fault()] [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()] [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed] [Remove warning comment about waiting for !seq&1 since we don't want to wait] [Remove warning about no huge page support, mention it explictly] [Don't call do_fault() in the speculative path as __do_fault() calls vma->vm_ops->fault() which may want to release mmap_sem] [Only vm_fault pointer argument for vma_has_changed()] [Fix check against huge page, calling pmd_trans_huge()] [Introduce __HAVE_ARCH_CALL_SPF to declare the SPF handler only when architecture is supporting it] [Use READ_ONCE() when reading VMA's fields in the speculative path] [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for processing done in vm_normal_page()] [Check that vma->anon_vma is already set when starting the speculative path] [Check for memory policy as we can't support MPOL_INTERLEAVE case due to the processing done in mpol_misplaced()] [Don't support VMA growing up or down] [Move check on vm_sequence just before calling handle_pte_fault()] Signed-off-by: Laurent Dufour --- include/linux/hugetlb_inline.h | 2 +- include/linux/mm.h | 5 + include/linux/pagemap.h| 4 +- mm/internal.h | 14 +++ mm/memory.c| 237 - 5 files changed, 254 insertions(+), 8 deletions(-) diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h index a4e7ca0f3585..6cfdfca4cc2a 100644 --- a/include/linux/hugetlb_inline.h +++ b/include/linux/hugetlb_inline.h @@ -7,7 +7,7 @@ static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma) { - return !!(vma->vm_flags & VM_HUGETLB); + return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB); } #else diff --git a/include/linux/mm.h b/include/linux/mm.h index 0f4ddd72b172..0fe0811d304f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -315,6 +315,7 @@ struct vm_fault { gfp_t gfp_mask; /* gfp mask to be used for allocations */ pgoff_t pgoff; /* Logical page offset based on vma */ unsigned long address; /* Faulting virtual address */ + unsigned int sequence; pmd_t *pmd; /* Pointer to pmd entry matching * the 'address' */ pud_t *pud; /* Pointer to pud entry matching @@ -1297,6 +1298,10 @@ int invalidate_inode_page(struct page *page); #ifdef CONFIG_MMU extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags); +#ifdef __HAVE_ARCH_CALL_SPF +extern int handle_speculative_fault(struct mm_struct *mm, + unsigned long address, unsigned int flags); +#endif /* __HAVE_ARCH_CALL_SPF */ extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, unsigned long address, unsigned int fault_flags, bool *unlocked); diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 79b36f57c3ba..3a9735dfa6b6 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -443,8 +443,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma, pgoff_t pgoff; if (unlikely(is_vm_hugetlb_page(vma))) return linear_hugepage_index(vma, address); - pgoff = (address - vma->vm_start) >> PAGE_SHIFT; - pgoff += vma->vm_pgoff; + pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT; + pgoff += READ_ONCE(vma->vm_pgoff); return pgoff; } diff --git a/mm/internal.h b/mm/internal.h index 736540f15936..9d6347e35747 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -45,6 +45,20 @@ extern struct srcu_struct vma_srcu; extern struct v