subject:"\[PATCH v2 14\/20\] mm\: Provide speculative fault infrastructure"

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-31 Thread Peter Zijlstra

On Thu, Aug 31, 2017 at 12:25:16PM +0530, Anshuman Khandual wrote:
> On 08/30/2017 03:02 PM, Laurent Dufour wrote:
> > On 30/08/2017 07:58, Peter Zijlstra wrote:
> >> On Wed, Aug 30, 2017 at 10:33:50AM +0530, Anshuman Khandual wrote:
> >>> diff --git a/mm/filemap.c b/mm/filemap.c
> >>> index a497024..08f3042 100644
> >>> --- a/mm/filemap.c
> >>> +++ b/mm/filemap.c
> >>> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page)
> >>>  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
> >>>  unsigned int flags)
> >>>  {
> >>> +   if (flags & FAULT_FLAG_SPECULATIVE) {
> >>> +   if (flags & FAULT_FLAG_KILLABLE) {
> >>> +   int ret;
> >>> +
> >>> +   ret = __lock_page_killable(page);
> >>> +   if (ret)
> >>> +   return 0;
> >>> +   } else
> >>> +   __lock_page(page);
> >>> +   return 1;
> >>> +   }
> >>> +
> >>> if (flags & FAULT_FLAG_ALLOW_RETRY) {
> >>> /*
> >>>  * CAUTION! In this case, mmap_sem is not released
> >>
> >> Yeah, that looks right.
> > 
> > Hum, I'm wondering if FAULT_FLAG_RETRY_NOWAIT should be forced in the
> > speculative path in that case to match the semantics of
> > __lock_page_or_retry().
> 
> Doing that would force us to have another retry through classic fault
> path wasting all the work done till now through SPF. Hence it may be
> better to just wait, get the lock here and complete the fault. Peterz,
> would you agree ? Or we should do as suggested by Laurent. More over,
> forcing FAULT_FLAG_RETRY_NOWAIT on FAULT_FLAG_SPECULTIVE at this point
> would look like a hack.

Is there ever a situation where SPECULATIVE and NOWAIT are used
together? That seems like something to avoid.

A git-grep seems to suggest gup() can set it, but gup() will not be
doing speculative faults. s390 also sets it, but then again, they don't
have speculative fault support yet and when they do they can avoid
setting them together.

So maybe put in a WARN_ON_ONCE() on having both of them, it is not
something that makes sense to me, but maybe someone sees a rationale for
it?

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-30 Thread Anshuman Khandual

On 08/30/2017 03:02 PM, Laurent Dufour wrote:
> On 30/08/2017 07:58, Peter Zijlstra wrote:
>> On Wed, Aug 30, 2017 at 10:33:50AM +0530, Anshuman Khandual wrote:
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index a497024..08f3042 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page)
>>>  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>>>  unsigned int flags)
>>>  {
>>> +   if (flags & FAULT_FLAG_SPECULATIVE) {
>>> +   if (flags & FAULT_FLAG_KILLABLE) {
>>> +   int ret;
>>> +
>>> +   ret = __lock_page_killable(page);
>>> +   if (ret)
>>> +   return 0;
>>> +   } else
>>> +   __lock_page(page);
>>> +   return 1;
>>> +   }
>>> +
>>> if (flags & FAULT_FLAG_ALLOW_RETRY) {
>>> /*
>>>  * CAUTION! In this case, mmap_sem is not released
>>
>> Yeah, that looks right.
> 
> Hum, I'm wondering if FAULT_FLAG_RETRY_NOWAIT should be forced in the
> speculative path in that case to match the semantics of
> __lock_page_or_retry().

Doing that would force us to have another retry through classic fault
path wasting all the work done till now through SPF. Hence it may be
better to just wait, get the lock here and complete the fault. Peterz,
would you agree ? Or we should do as suggested by Laurent. More over,
forcing FAULT_FLAG_RETRY_NOWAIT on FAULT_FLAG_SPECULTIVE at this point
would look like a hack.

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-30 Thread Laurent Dufour

On 30/08/2017 07:03, Anshuman Khandual wrote:
> On 08/29/2017 07:15 PM, Peter Zijlstra wrote:
>> On Tue, Aug 29, 2017 at 03:18:25PM +0200, Laurent Dufour wrote:
>>> On 29/08/2017 14:04, Peter Zijlstra wrote:
 On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote:
> On 27/08/2017 02:18, Kirill A. Shutemov wrote:
>>> +
>>> +   if (unlikely(!vma->anon_vma))
>>> +   goto unlock;
>>
>> It deserves a comment.
>
> You're right I'll add it in the next version.
> For the record, the root cause is that __anon_vma_prepare() requires the
> mmap_sem to be held because vm_next and vm_prev must be safe.

 But should that test not be:

if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
goto unlock;

 Because !anon vmas will never have ->anon_vma set and you don't want to
 exclude those.
>>>
>>> Yes in the case we later allow non anonymous vmas to be handled.
>>> Currently only anonymous vmas are supported so the check is good enough,
>>> isn't it ?
>>
>> That wasn't at all clear from reading the code. This makes it clear
>> ->anon_vma is only ever looked at for anonymous.
>>
>> And like Kirill says, we _really_ should start allowing some (if not
>> all) vm_ops. Large file based mappings aren't particularly rare.
>>
>> I'm not sure we want to introduce a white-list or just bite the bullet
>> and audit all ->fault() implementations. But either works and isn't
>> terribly difficult, auditing all is more work though.
> 
> filemap_fault() is used as vma-vm_ops->fault() for most of the file
> systems. Changing it can enable speculative fault support for all of
> them. It will still exclude other driver based vma-vm_ops->fault()
> implementation. AFAICS, __lock_page_or_retry() function can drop
> mm->mmap_sem if the page could not be locked right away. As suggested
> by Peterz, making it understand FAULT_FLAG_SPECULATIVE should be good
> enough. The patch is lightly tested for file mappings on top of this
> series.

Hi Anshuman,

This sounds pretty good, except for  the FAULT_FLAG_RETRY_NOWAIT's case I
mentioned in another mail.

The next step would be to find a way to discriminate between the vm_fault()
functions. Any idea ?

Thanks,
Laurent.

> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a497024..08f3042 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page)
>  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>  unsigned int flags)
>  {
> +   if (flags & FAULT_FLAG_SPECULATIVE) {
> +   if (flags & FAULT_FLAG_KILLABLE) {
> +   int ret;
> +
> +   ret = __lock_page_killable(page);
> +   if (ret)
> +   return 0;
> +   } else
> +   __lock_page(page);
> +   return 1;
> +   }
> +
> if (flags & FAULT_FLAG_ALLOW_RETRY) {
> /*
>  * CAUTION! In this case, mmap_sem is not released
> diff --git a/mm/memory.c b/mm/memory.c
> index 549d235..02347f3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3836,8 +3836,6 @@ static int handle_pte_fault(struct vm_fault *vmf)
> if (!vmf->pte) {
> if (vma_is_anonymous(vmf->vma))
> return do_anonymous_page(vmf);
> -   else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> -   return VM_FAULT_RETRY;
> else
> return do_fault(vmf);
> }
> @@ -4012,17 +4010,7 @@ int handle_speculative_fault(struct mm_struct *mm, 
> unsigned long address,
> goto unlock;
> }
> 
> -   /*
> -* Can't call vm_ops service has we don't know what they would do
> -* with the VMA.
> -* This include huge page from hugetlbfs.
> -*/
> -   if (vma->vm_ops) {
> -   trace_spf_vma_notsup(_RET_IP_, vma, address);
> -   goto unlock;
> -   }
> -
> -   if (unlikely(!vma->anon_vma)) {
> +   if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) {
> trace_spf_vma_notsup(_RET_IP_, vma, address);
> goto unlock;
> }
>

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-30 Thread Laurent Dufour

On 30/08/2017 07:58, Peter Zijlstra wrote:
> On Wed, Aug 30, 2017 at 10:33:50AM +0530, Anshuman Khandual wrote:
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index a497024..08f3042 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page)
>>  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>>  unsigned int flags)
>>  {
>> +   if (flags & FAULT_FLAG_SPECULATIVE) {
>> +   if (flags & FAULT_FLAG_KILLABLE) {
>> +   int ret;
>> +
>> +   ret = __lock_page_killable(page);
>> +   if (ret)
>> +   return 0;
>> +   } else
>> +   __lock_page(page);
>> +   return 1;
>> +   }
>> +
>> if (flags & FAULT_FLAG_ALLOW_RETRY) {
>> /*
>>  * CAUTION! In this case, mmap_sem is not released
> 
> Yeah, that looks right.

Hum, I'm wondering if FAULT_FLAG_RETRY_NOWAIT should be forced in the
speculative path in that case to match the semantics of
__lock_page_or_retry().

> 
>> @@ -4012,17 +4010,7 @@ int handle_speculative_fault(struct mm_struct *mm, 
>> unsigned long address,
>> goto unlock;
>> }
>>
>> +   if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) {
>> trace_spf_vma_notsup(_RET_IP_, vma, address);
>> goto unlock;
>> }
> 
> As riel pointed out on IRC slightly later, private file maps also need
> ->anon_vma and those actually have ->vm_ops IIRC so the condition needs
> to be slightly more complicated.

Yes I read again the code and lead to the same conclusion.

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-30 Thread Laurent Dufour

On 27/08/2017 02:18, Kirill A. Shutemov wrote:
> On Fri, Aug 18, 2017 at 12:05:13AM +0200, Laurent Dufour wrote:
>> +/*
>> + * vm_normal_page() adds some processing which should be done while
>> + * hodling the mmap_sem.
>> + */
>> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>> + unsigned int flags)
>> +{
>> +struct vm_fault vmf = {
>> +.address = address,
>> +};
>> +pgd_t *pgd;
>> +p4d_t *p4d;
>> +pud_t *pud;
>> +pmd_t *pmd;
>> +int dead, seq, idx, ret = VM_FAULT_RETRY;
>> +struct vm_area_struct *vma;
>> +struct mempolicy *pol;
>> +
>> +/* Clear flags that may lead to release the mmap_sem to retry */
>> +flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
>> +flags |= FAULT_FLAG_SPECULATIVE;
>> +
>> +idx = srcu_read_lock(&vma_srcu);
>> +vma = find_vma_srcu(mm, address);
>> +if (!vma)
>> +goto unlock;
>> +
>> +/*
>> + * Validate the VMA found by the lockless lookup.
>> + */
>> +dead = RB_EMPTY_NODE(&vma->vm_rb);
>> +seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> 
>> seqlock,vma_rb_erase() */
>> +if ((seq & 1) || dead)
>> +goto unlock;
>> +
>> +/*
>> + * Can't call vm_ops service has we don't know what they would do
>> + * with the VMA.
>> + * This include huge page from hugetlbfs.
>> + */
>> +if (vma->vm_ops)
>> +goto unlock;
> 
> I think we need to have a way to white-list safe ->vm_ops.
> 
>> +
>> +if (unlikely(!vma->anon_vma))
>> +goto unlock;
> 
> It deserves a comment.
> 
>> +
>> +vmf.vma_flags = READ_ONCE(vma->vm_flags);
>> +vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
>> +
>> +/* Can't call userland page fault handler in the speculative path */
>> +if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
>> +goto unlock;
>> +
>> +/*
>> + * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which
>> + * are not compatible with the speculative page fault processing.
>> + */
>> +pol = __get_vma_policy(vma, address);
>> +if (!pol)
>> +pol = get_task_policy(current);
>> +if (pol && pol->mode == MPOL_INTERLEAVE)
>> +goto unlock;
>> +
>> +if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
>> +/*
>> + * This could be detected by the check address against VMA's
>> + * boundaries but we want to trace it as not supported instead
>> + * of changed.
>> + */
>> +goto unlock;
>> +
>> +if (address < READ_ONCE(vma->vm_start)
>> +|| READ_ONCE(vma->vm_end) <= address)
>> +goto unlock;
>> +
>> +/*
>> + * The three following checks are copied from access_error from
>> + * arch/x86/mm/fault.c
>> + */
>> +if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>> +   flags & FAULT_FLAG_INSTRUCTION,
>> +   flags & FAULT_FLAG_REMOTE))
>> +goto unlock;
>> +
>> +/* This is one is required to check that the VMA has write access set */
>> +if (flags & FAULT_FLAG_WRITE) {
>> +if (unlikely(!(vmf.vma_flags & VM_WRITE)))
>> +goto unlock;
>> +} else {
>> +if (unlikely(!(vmf.vma_flags & (VM_READ | VM_EXEC | VM_WRITE
>> +goto unlock;
>> +}
>> +
>> +/*
>> + * Do a speculative lookup of the PTE entry.
>> + */
>> +local_irq_disable();
>> +pgd = pgd_offset(mm, address);
>> +if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
>> +goto out_walk;
>> +
>> +p4d = p4d_alloc(mm, pgd, address);
>> +if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
>> +goto out_walk;
>> +
>> +pud = pud_alloc(mm, p4d, address);
>> +if (pud_none(*pud) || unlikely(pud_bad(*pud)))
>> +goto out_walk;
>> +
>> +pmd = pmd_offset(pud, address);
>> +if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
>> +goto out_walk;
>> +
>> +/*
>> + * The above does not allocate/instantiate page-tables because doing so
>> + * would lead to the possibility of instantiating page-tables after
>> + * free_pgtables() -- and consequently leaking them.
>> + *
>> + * The result is that we take at least one !speculative fault per PMD
>> + * in order to instantiate it.
>> + */
> 
> 
> Doing all this job and just give up because we cannot allocate page tables
> looks very wasteful to me.
> 
> Have you considered to look how we can hand over from speculative to
> non-speculative path without starting from scratch (when possible)?
> 
>> +/* Transparent huge pages are not supported. */
>> +if (unlikely(pmd_trans_huge(*pmd)))
>> +goto out_walk;
> 
> That's looks like a blocker to me.
> 
> Is there any problem with making it supported (besides plain coding)?

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-29 Thread Peter Zijlstra

On Wed, Aug 30, 2017 at 07:19:30AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2017-08-29 at 13:27 +0200, Peter Zijlstra wrote:
> > mpe helped me out and explained that is the PWC hint to TBLIE.
> > 
> > So, you set need_flush_all when you unhook pud/pmd/pte which you then
> > use to set PWC. So free_pgtables() will do the PWC when it unhooks
> > higher level pages.
> > 
> > But you're right that there's some issues, free_pgtables() itself
> > doesn't seem to use mm->page_table_lock,pmd->lock _AT_ALL_ to unhook the
> > pages.
> > 
> > If it were to do that, things should work fine since those locks would
> > then serialize against the speculative faults, we would never install a
> > page if the VMA would be under tear-down and it would thus not be
> > visible to your caches either.
> 
> That's one case. I don't remember of *all* the cases to be honest, but
> I do remember several times over the past few years thinking "ah we are
> fine because the mm sem taken for writing protects us from any
> concurrent tree structure change" :-)

Well, installing always seems to use the locks (it needs to, because its
always done with down_read()), that only leaves removal, and the only
place I know that removes stuff is free_pgtables().

But I think I found another fun place, copy_page_range(). While it
(pointlessly) takes all the PTLs on the dst mm it walks the src page
tables without any PTLs.

This means that if we have a multi-threaded process doing fork() a
thread of the src mm could instantiate page-tables that will not be
copied over.

Of course, this is highly dubious behaviour to begin with, and I don't
think there's anything fundamentally wrong with missing those pages but
we should document this stuff.

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-29 Thread Peter Zijlstra

On Wed, Aug 30, 2017 at 10:33:50AM +0530, Anshuman Khandual wrote:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a497024..08f3042 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page)
>  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>  unsigned int flags)
>  {
> +   if (flags & FAULT_FLAG_SPECULATIVE) {
> +   if (flags & FAULT_FLAG_KILLABLE) {
> +   int ret;
> +
> +   ret = __lock_page_killable(page);
> +   if (ret)
> +   return 0;
> +   } else
> +   __lock_page(page);
> +   return 1;
> +   }
> +
> if (flags & FAULT_FLAG_ALLOW_RETRY) {
> /*
>  * CAUTION! In this case, mmap_sem is not released

Yeah, that looks right.

> @@ -4012,17 +4010,7 @@ int handle_speculative_fault(struct mm_struct *mm, 
> unsigned long address,
> goto unlock;
> }
> 
> +   if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) {
> trace_spf_vma_notsup(_RET_IP_, vma, address);
> goto unlock;
> }

As riel pointed out on IRC slightly later, private file maps also need
->anon_vma and those actually have ->vm_ops IIRC so the condition needs
to be slightly more complicated.

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-29 Thread Anshuman Khandual

On 08/27/2017 05:48 AM, Kirill A. Shutemov wrote:
>> +/* Transparent huge pages are not supported. */
>> +if (unlikely(pmd_trans_huge(*pmd)))
>> +goto out_walk;
> That's looks like a blocker to me.
> 
> Is there any problem with making it supported (besides plain coding)?

IIUC we would have to reattempt once for each PMD level fault because
of the lack of a page table entry there. Besides do we want to support
huge pages in general as part of speculative page fault path ? The
number of faults will be very less (256 times lower on POWER and 512
times lower on X86). So is it worth it ? BTW calling hugetlb_fault()
after figuring out the VMA, works correctly inside handle_speculative
_fault() last time I checked.

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-29 Thread Anshuman Khandual

On 08/29/2017 07:15 PM, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 03:18:25PM +0200, Laurent Dufour wrote:
>> On 29/08/2017 14:04, Peter Zijlstra wrote:
>>> On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote:
 On 27/08/2017 02:18, Kirill A. Shutemov wrote:
>> +
>> +if (unlikely(!vma->anon_vma))
>> +goto unlock;
>
> It deserves a comment.

 You're right I'll add it in the next version.
 For the record, the root cause is that __anon_vma_prepare() requires the
 mmap_sem to be held because vm_next and vm_prev must be safe.
>>>
>>> But should that test not be:
>>>
>>> if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
>>> goto unlock;
>>>
>>> Because !anon vmas will never have ->anon_vma set and you don't want to
>>> exclude those.
>>
>> Yes in the case we later allow non anonymous vmas to be handled.
>> Currently only anonymous vmas are supported so the check is good enough,
>> isn't it ?
> 
> That wasn't at all clear from reading the code. This makes it clear
> ->anon_vma is only ever looked at for anonymous.
> 
> And like Kirill says, we _really_ should start allowing some (if not
> all) vm_ops. Large file based mappings aren't particularly rare.
> 
> I'm not sure we want to introduce a white-list or just bite the bullet
> and audit all ->fault() implementations. But either works and isn't
> terribly difficult, auditing all is more work though.

filemap_fault() is used as vma-vm_ops->fault() for most of the file
systems. Changing it can enable speculative fault support for all of
them. It will still exclude other driver based vma-vm_ops->fault()
implementation. AFAICS, __lock_page_or_retry() function can drop
mm->mmap_sem if the page could not be locked right away. As suggested
by Peterz, making it understand FAULT_FLAG_SPECULATIVE should be good
enough. The patch is lightly tested for file mappings on top of this
series.

diff --git a/mm/filemap.c b/mm/filemap.c
index a497024..08f3042 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page)
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 unsigned int flags)
 {
+   if (flags & FAULT_FLAG_SPECULATIVE) {
+   if (flags & FAULT_FLAG_KILLABLE) {
+   int ret;
+
+   ret = __lock_page_killable(page);
+   if (ret)
+   return 0;
+   } else
+   __lock_page(page);
+   return 1;
+   }
+
if (flags & FAULT_FLAG_ALLOW_RETRY) {
/*
 * CAUTION! In this case, mmap_sem is not released
diff --git a/mm/memory.c b/mm/memory.c
index 549d235..02347f3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3836,8 +3836,6 @@ static int handle_pte_fault(struct vm_fault *vmf)
if (!vmf->pte) {
if (vma_is_anonymous(vmf->vma))
return do_anonymous_page(vmf);
-   else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
-   return VM_FAULT_RETRY;
else
return do_fault(vmf);
}
@@ -4012,17 +4010,7 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
goto unlock;
}

-   /*
-* Can't call vm_ops service has we don't know what they would do
-* with the VMA.
-* This include huge page from hugetlbfs.
-*/
-   if (vma->vm_ops) {
-   trace_spf_vma_notsup(_RET_IP_, vma, address);
-   goto unlock;
-   }
-
-   if (unlikely(!vma->anon_vma)) {
+   if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) {
trace_spf_vma_notsup(_RET_IP_, vma, address);
goto unlock;
}

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-29 Thread Anshuman Khandual

On 08/29/2017 05:34 PM, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote:
>> On 27/08/2017 02:18, Kirill A. Shutemov wrote:
 +
 +  if (unlikely(!vma->anon_vma))
 +  goto unlock;
>>> It deserves a comment.
>> You're right I'll add it in the next version.
>> For the record, the root cause is that __anon_vma_prepare() requires the
>> mmap_sem to be held because vm_next and vm_prev must be safe.
> But should that test not be:
> 
>   if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
>   goto unlock;

This makes more sense. We are backing off from speculative path
because struct anon_vma has not been created for this anonymous
vma and we cannot do that without holding mmap_sem. This should
have nothing to do with vma->vm_ops availability.

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-29 Thread Benjamin Herrenschmidt

On Tue, 2017-08-29 at 13:27 +0200, Peter Zijlstra wrote:
> mpe helped me out and explained that is the PWC hint to TBLIE.
> 
> So, you set need_flush_all when you unhook pud/pmd/pte which you then
> use to set PWC. So free_pgtables() will do the PWC when it unhooks
> higher level pages.
> 
> But you're right that there's some issues, free_pgtables() itself
> doesn't seem to use mm->page_table_lock,pmd->lock _AT_ALL_ to unhook the
> pages.
> 
> If it were to do that, things should work fine since those locks would
> then serialize against the speculative faults, we would never install a
> page if the VMA would be under tear-down and it would thus not be
> visible to your caches either.

That's one case. I don't remember of *all* the cases to be honest, but
I do remember several times over the past few years thinking "ah we are
fine because the mm sem taken for writing protects us from any
concurrent tree structure change" :-)

Cheers,
Ben.

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-29 Thread Peter Zijlstra

On Tue, Aug 29, 2017 at 03:18:25PM +0200, Laurent Dufour wrote:
> On 29/08/2017 14:04, Peter Zijlstra wrote:
> > On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote:
> >> On 27/08/2017 02:18, Kirill A. Shutemov wrote:
>  +
>  +if (unlikely(!vma->anon_vma))
>  +goto unlock;
> >>>
> >>> It deserves a comment.
> >>
> >> You're right I'll add it in the next version.
> >> For the record, the root cause is that __anon_vma_prepare() requires the
> >> mmap_sem to be held because vm_next and vm_prev must be safe.
> > 
> > But should that test not be:
> > 
> > if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> > goto unlock;
> > 
> > Because !anon vmas will never have ->anon_vma set and you don't want to
> > exclude those.
> 
> Yes in the case we later allow non anonymous vmas to be handled.
> Currently only anonymous vmas are supported so the check is good enough,
> isn't it ?

That wasn't at all clear from reading the code. This makes it clear
->anon_vma is only ever looked at for anonymous.

And like Kirill says, we _really_ should start allowing some (if not
all) vm_ops. Large file based mappings aren't particularly rare.

I'm not sure we want to introduce a white-list or just bite the bullet
and audit all ->fault() implementations. But either works and isn't
terribly difficult, auditing all is more work though.

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-29 Thread Laurent Dufour

On 29/08/2017 14:04, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote:
>> On 27/08/2017 02:18, Kirill A. Shutemov wrote:
 +
 +  if (unlikely(!vma->anon_vma))
 +  goto unlock;
>>>
>>> It deserves a comment.
>>
>> You're right I'll add it in the next version.
>> For the record, the root cause is that __anon_vma_prepare() requires the
>> mmap_sem to be held because vm_next and vm_prev must be safe.
> 
> But should that test not be:
> 
>   if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
>   goto unlock;
> 
> Because !anon vmas will never have ->anon_vma set and you don't want to
> exclude those.

Yes in the case we later allow non anonymous vmas to be handled.
Currently only anonymous vmas are supported so the check is good enough,
isn't it ?

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-29 Thread Peter Zijlstra

On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote:
> On 27/08/2017 02:18, Kirill A. Shutemov wrote:
> >> +
> >> +  if (unlikely(!vma->anon_vma))
> >> +  goto unlock;
> > 
> > It deserves a comment.
> 
> You're right I'll add it in the next version.
> For the record, the root cause is that __anon_vma_prepare() requires the
> mmap_sem to be held because vm_next and vm_prev must be safe.

But should that test not be:

if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
goto unlock;

Because !anon vmas will never have ->anon_vma set and you don't want to
exclude those.

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-29 Thread Peter Zijlstra

On Tue, Aug 29, 2017 at 10:33:52AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 07:14:37AM +1000, Benjamin Herrenschmidt wrote:
> > We'd have to audit archs closely. Things like the page walk cache
> > flushing on power etc...
> 
> If you point me where to look, I'll have a poke around. I'm not
> quite sure what you mean with pagewalk cache flushing. Your hash thing
> flushes everything inside the PTL IIRC and the radix code appears fairly
> 'normal'.

mpe helped me out and explained that is the PWC hint to TBLIE.

So, you set need_flush_all when you unhook pud/pmd/pte which you then
use to set PWC. So free_pgtables() will do the PWC when it unhooks
higher level pages.

But you're right that there's some issues, free_pgtables() itself
doesn't seem to use mm->page_table_lock,pmd->lock _AT_ALL_ to unhook the
pages.

If it were to do that, things should work fine since those locks would
then serialize against the speculative faults, we would never install a
page if the VMA would be under tear-down and it would thus not be
visible to your caches either.

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-29 Thread Peter Zijlstra

On Tue, Aug 29, 2017 at 07:14:37AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2017-08-28 at 11:37 +0200, Peter Zijlstra wrote:
> > > Doing all this job and just give up because we cannot allocate page tables
> > > looks very wasteful to me.
> > > 
> > > Have you considered to look how we can hand over from speculative to
> > > non-speculative path without starting from scratch (when possible)?
> > 
> > So we _can_ in fact allocate and install page-tables, but we have to be
> > very careful about it. The interesting case is where we race with
> > free_pgtables() and install a page that was just taken out.
> > 
> > But since we already have the VMA I think we can do something like:
> 
> That makes me extremely nervous... there could be all sort of
> assumptions esp. in arch code about the fact that we never populate the
> tree without the mm sem.

That _would_ be somewhat dodgy, because that means it needs to rely on
taking mmap_sem for _writing_ to undo things and arch/powerpc/ doesn't
have many down_write.*mmap_sem:

$ git grep "down_write.*mmap_sem" arch/powerpc/
arch/powerpc/kernel/vdso.c: if (down_write_killable(&mm->mmap_sem))
arch/powerpc/kvm/book3s_64_vio.c:   down_write(¤t->mm->mmap_sem);
arch/powerpc/mm/mmu_context_iommu.c:down_write(&mm->mmap_sem);
arch/powerpc/mm/subpage-prot.c: down_write(&mm->mmap_sem);
arch/powerpc/mm/subpage-prot.c: down_write(&mm->mmap_sem);
arch/powerpc/mm/subpage-prot.c: down_write(&mm->mmap_sem);

Then again, I suppose it could be relying on the implicit down_write
from things like munmap() and the like..

And things _ought_ to be ordered by the various PTLs
(mm->page_table_lock and pmd->lock) which of course doesn't mean
something accidentally snuck through.

> We'd have to audit archs closely. Things like the page walk cache
> flushing on power etc...

If you point me where to look, I'll have a poke around. I'm not
quite sure what you mean with pagewalk cache flushing. Your hash thing
flushes everything inside the PTL IIRC and the radix code appears fairly
'normal'.

> I don't mind the "retry" .. .we've brought stuff in the L1 cache
> already which I would expect to be the bulk of the overhead, and the
> allocation case isn't that common. Do we have numbers to show how
> destrimental this is today ?

No numbers, afaik. And like I said, I didn't consider this an actual
problem when I did these patches. But since Kirill asked ;-)

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-29 Thread Peter Zijlstra

On Mon, Aug 28, 2017 at 03:35:11PM -0700, Andi Kleen wrote:
> Yes the whole thing is quite risky. Probably will need some
> kind of per architecture opt-in scheme?

See patch 19/20, that not enough for you?

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-29 Thread Laurent Dufour

On 27/08/2017 02:18, Kirill A. Shutemov wrote:
> On Fri, Aug 18, 2017 at 12:05:13AM +0200, Laurent Dufour wrote:
>> +/*
>> + * vm_normal_page() adds some processing which should be done while
>> + * hodling the mmap_sem.
>> + */
>> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>> + unsigned int flags)
>> +{
>> +struct vm_fault vmf = {
>> +.address = address,
>> +};
>> +pgd_t *pgd;
>> +p4d_t *p4d;
>> +pud_t *pud;
>> +pmd_t *pmd;
>> +int dead, seq, idx, ret = VM_FAULT_RETRY;
>> +struct vm_area_struct *vma;
>> +struct mempolicy *pol;
>> +
>> +/* Clear flags that may lead to release the mmap_sem to retry */
>> +flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
>> +flags |= FAULT_FLAG_SPECULATIVE;
>> +
>> +idx = srcu_read_lock(&vma_srcu);
>> +vma = find_vma_srcu(mm, address);
>> +if (!vma)
>> +goto unlock;
>> +
>> +/*
>> + * Validate the VMA found by the lockless lookup.
>> + */
>> +dead = RB_EMPTY_NODE(&vma->vm_rb);
>> +seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> 
>> seqlock,vma_rb_erase() */
>> +if ((seq & 1) || dead)
>> +goto unlock;
>> +
>> +/*
>> + * Can't call vm_ops service has we don't know what they would do
>> + * with the VMA.
>> + * This include huge page from hugetlbfs.
>> + */
>> +if (vma->vm_ops)
>> +goto unlock;
> 
> I think we need to have a way to white-list safe ->vm_ops.

Hi Kirill,
Yes this would be a good optimization done in a next step.

>> +
>> +if (unlikely(!vma->anon_vma))
>> +goto unlock;
> 
> It deserves a comment.

You're right I'll add it in the next version.
For the record, the root cause is that __anon_vma_prepare() requires the
mmap_sem to be held because vm_next and vm_prev must be safe.


>> +
>> +vmf.vma_flags = READ_ONCE(vma->vm_flags);
>> +vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
>> +
>> +/* Can't call userland page fault handler in the speculative path */
>> +if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
>> +goto unlock;
>> +
>> +/*
>> + * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which
>> + * are not compatible with the speculative page fault processing.
>> + */
>> +pol = __get_vma_policy(vma, address);
>> +if (!pol)
>> +pol = get_task_policy(current);
>> +if (pol && pol->mode == MPOL_INTERLEAVE)
>> +goto unlock;
>> +
>> +if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
>> +/*
>> + * This could be detected by the check address against VMA's
>> + * boundaries but we want to trace it as not supported instead
>> + * of changed.
>> + */
>> +goto unlock;
>> +
>> +if (address < READ_ONCE(vma->vm_start)
>> +|| READ_ONCE(vma->vm_end) <= address)
>> +goto unlock;
>> +
>> +/*
>> + * The three following checks are copied from access_error from
>> + * arch/x86/mm/fault.c
>> + */
>> +if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>> +   flags & FAULT_FLAG_INSTRUCTION,
>> +   flags & FAULT_FLAG_REMOTE))
>> +goto unlock;
>> +
>> +/* This is one is required to check that the VMA has write access set */
>> +if (flags & FAULT_FLAG_WRITE) {
>> +if (unlikely(!(vmf.vma_flags & VM_WRITE)))
>> +goto unlock;
>> +} else {
>> +if (unlikely(!(vmf.vma_flags & (VM_READ | VM_EXEC | VM_WRITE
>> +goto unlock;
>> +}
>> +
>> +/*
>> + * Do a speculative lookup of the PTE entry.
>> + */
>> +local_irq_disable();
>> +pgd = pgd_offset(mm, address);
>> +if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
>> +goto out_walk;
>> +
>> +p4d = p4d_alloc(mm, pgd, address);
>> +if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
>> +goto out_walk;
>> +
>> +pud = pud_alloc(mm, p4d, address);
>> +if (pud_none(*pud) || unlikely(pud_bad(*pud)))
>> +goto out_walk;
>> +
>> +pmd = pmd_offset(pud, address);
>> +if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
>> +goto out_walk;
>> +
>> +/*
>> + * The above does not allocate/instantiate page-tables because doing so
>> + * would lead to the possibility of instantiating page-tables after
>> + * free_pgtables() -- and consequently leaking them.
>> + *
>> + * The result is that we take at least one !speculative fault per PMD
>> + * in order to instantiate it.
>> + */
> 
> 
> Doing all this job and just give up because we cannot allocate page tables
> looks very wasteful to me.
> 
> Have you considered to look how we can hand over from speculative to
> non-speculative path without starting from scratch (when possible)?

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-28 Thread Andi Kleen

> That makes me extremely nervous... there could be all sort of
> assumptions esp. in arch code about the fact that we never populate the
> tree without the mm sem.
> 
> We'd have to audit archs closely. Things like the page walk cache
> flushing on power etc...

Yes the whole thing is quite risky. Probably will need some
kind of per architecture opt-in scheme?

-Andi

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-28 Thread Benjamin Herrenschmidt

On Mon, 2017-08-28 at 11:37 +0200, Peter Zijlstra wrote:
> > Doing all this job and just give up because we cannot allocate page tables
> > looks very wasteful to me.
> > 
> > Have you considered to look how we can hand over from speculative to
> > non-speculative path without starting from scratch (when possible)?
> 
> So we _can_ in fact allocate and install page-tables, but we have to be
> very careful about it. The interesting case is where we race with
> free_pgtables() and install a page that was just taken out.
> 
> But since we already have the VMA I think we can do something like:

That makes me extremely nervous... there could be all sort of
assumptions esp. in arch code about the fact that we never populate the
tree without the mm sem.

We'd have to audit archs closely. Things like the page walk cache
flushing on power etc...

I don't mind the "retry" .. .we've brought stuff in the L1 cache
already which I would expect to be the bulk of the overhead, and the
allocation case isn't that common. Do we have numbers to show how
destrimental this is today ?

Cheers,
Ben.

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-28 Thread Peter Zijlstra

On Sun, Aug 27, 2017 at 03:18:23AM +0300, Kirill A. Shutemov wrote:
> On Fri, Aug 18, 2017 at 12:05:13AM +0200, Laurent Dufour wrote:
> > +   /*
> > +* Can't call vm_ops service has we don't know what they would do
> > +* with the VMA.
> > +* This include huge page from hugetlbfs.
> > +*/
> > +   if (vma->vm_ops)
> > +   goto unlock;
> 
> I think we need to have a way to white-list safe ->vm_ops.

Either that, or simply teach all ->fault() callbacks about speculative
faults. Shouldn't be too hard, just 'work'.

> > +
> > +   if (unlikely(!vma->anon_vma))
> > +   goto unlock;
> 
> It deserves a comment.

Yes, that was very much not intended. It wrecks most of the fun. This
really _should_ work for file maps too.

> > +   /*
> > +* Do a speculative lookup of the PTE entry.
> > +*/
> > +   local_irq_disable();
> > +   pgd = pgd_offset(mm, address);
> > +   if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> > +   goto out_walk;
> > +
> > +   p4d = p4d_alloc(mm, pgd, address);
> > +   if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
> > +   goto out_walk;
> > +
> > +   pud = pud_alloc(mm, p4d, address);
> > +   if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> > +   goto out_walk;
> > +
> > +   pmd = pmd_offset(pud, address);
> > +   if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> > +   goto out_walk;
> > +
> > +   /*
> > +* The above does not allocate/instantiate page-tables because doing so
> > +* would lead to the possibility of instantiating page-tables after
> > +* free_pgtables() -- and consequently leaking them.
> > +*
> > +* The result is that we take at least one !speculative fault per PMD
> > +* in order to instantiate it.
> > +*/
> 
> 
> Doing all this job and just give up because we cannot allocate page tables
> looks very wasteful to me.
> 
> Have you considered to look how we can hand over from speculative to
> non-speculative path without starting from scratch (when possible)?

So we _can_ in fact allocate and install page-tables, but we have to be
very careful about it. The interesting case is where we race with
free_pgtables() and install a page that was just taken out.

But since we already have the VMA I think we can do something like:

if (p*g_none()) {
p*d_t *new = p*d_alloc_one(mm, address);

spin_lock(&mm->page_table_lock);
if (!vma_changed_or_dead(vma,seq)) {
if (p*d_none())
p*d_populate(mm, p*d, new);
else
p*d_free(new);

new = NULL;
}
spin_unlock(&mm->page_table_lock);

if (new) {
p*d_free(new);
goto out_walk;
}
}

I just never bothered with that, figured we ought to get the basics
working before trying to be clever.

> > +   /* Transparent huge pages are not supported. */
> > +   if (unlikely(pmd_trans_huge(*pmd)))
> > +   goto out_walk;
> 
> That's looks like a blocker to me.
> 
> Is there any problem with making it supported (besides plain coding)?

Not that I can remember, but I never really looked at THP, I don't think
we even had that when I did the first versions.

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-26 Thread Kirill A. Shutemov

On Fri, Aug 18, 2017 at 12:05:13AM +0200, Laurent Dufour wrote:
> +/*
> + * vm_normal_page() adds some processing which should be done while
> + * hodling the mmap_sem.
> + */
> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
> +  unsigned int flags)
> +{
> + struct vm_fault vmf = {
> + .address = address,
> + };
> + pgd_t *pgd;
> + p4d_t *p4d;
> + pud_t *pud;
> + pmd_t *pmd;
> + int dead, seq, idx, ret = VM_FAULT_RETRY;
> + struct vm_area_struct *vma;
> + struct mempolicy *pol;
> +
> + /* Clear flags that may lead to release the mmap_sem to retry */
> + flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
> + flags |= FAULT_FLAG_SPECULATIVE;
> +
> + idx = srcu_read_lock(&vma_srcu);
> + vma = find_vma_srcu(mm, address);
> + if (!vma)
> + goto unlock;
> +
> + /*
> +  * Validate the VMA found by the lockless lookup.
> +  */
> + dead = RB_EMPTY_NODE(&vma->vm_rb);
> + seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> 
> seqlock,vma_rb_erase() */
> + if ((seq & 1) || dead)
> + goto unlock;
> +
> + /*
> +  * Can't call vm_ops service has we don't know what they would do
> +  * with the VMA.
> +  * This include huge page from hugetlbfs.
> +  */
> + if (vma->vm_ops)
> + goto unlock;

I think we need to have a way to white-list safe ->vm_ops.

> +
> + if (unlikely(!vma->anon_vma))
> + goto unlock;

It deserves a comment.

> +
> + vmf.vma_flags = READ_ONCE(vma->vm_flags);
> + vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
> +
> + /* Can't call userland page fault handler in the speculative path */
> + if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
> + goto unlock;
> +
> + /*
> +  * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which
> +  * are not compatible with the speculative page fault processing.
> +  */
> + pol = __get_vma_policy(vma, address);
> + if (!pol)
> + pol = get_task_policy(current);
> + if (pol && pol->mode == MPOL_INTERLEAVE)
> + goto unlock;
> +
> + if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
> + /*
> +  * This could be detected by the check address against VMA's
> +  * boundaries but we want to trace it as not supported instead
> +  * of changed.
> +  */
> + goto unlock;
> +
> + if (address < READ_ONCE(vma->vm_start)
> + || READ_ONCE(vma->vm_end) <= address)
> + goto unlock;
> +
> + /*
> +  * The three following checks are copied from access_error from
> +  * arch/x86/mm/fault.c
> +  */
> + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
> +flags & FAULT_FLAG_INSTRUCTION,
> +flags & FAULT_FLAG_REMOTE))
> + goto unlock;
> +
> + /* This is one is required to check that the VMA has write access set */
> + if (flags & FAULT_FLAG_WRITE) {
> + if (unlikely(!(vmf.vma_flags & VM_WRITE)))
> + goto unlock;
> + } else {
> + if (unlikely(!(vmf.vma_flags & (VM_READ | VM_EXEC | VM_WRITE
> + goto unlock;
> + }
> +
> + /*
> +  * Do a speculative lookup of the PTE entry.
> +  */
> + local_irq_disable();
> + pgd = pgd_offset(mm, address);
> + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> + goto out_walk;
> +
> + p4d = p4d_alloc(mm, pgd, address);
> + if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
> + goto out_walk;
> +
> + pud = pud_alloc(mm, p4d, address);
> + if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> + goto out_walk;
> +
> + pmd = pmd_offset(pud, address);
> + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> + goto out_walk;
> +
> + /*
> +  * The above does not allocate/instantiate page-tables because doing so
> +  * would lead to the possibility of instantiating page-tables after
> +  * free_pgtables() -- and consequently leaking them.
> +  *
> +  * The result is that we take at least one !speculative fault per PMD
> +  * in order to instantiate it.
> +  */


Doing all this job and just give up because we cannot allocate page tables
looks very wasteful to me.

Have you considered to look how we can hand over from speculative to
non-speculative path without starting from scratch (when possible)?

> + /* Transparent huge pages are not supported. */
> + if (unlikely(pmd_trans_huge(*pmd)))
> + goto out_walk;

That's looks like a blocker to me.

Is there any problem with making it supported (besides plain coding)?

> +
> + vmf.vma = vma;
> + vmf.pmd = pmd;
> + vmf.pgoff = linear_page_index(vma, address);
> +

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-25 Thread Laurent Dufour

On 20/08/2017 14:11, Sergey Senozhatsky wrote:
> On (08/18/17 00:05), Laurent Dufour wrote:
> [..]
>> +/*
>> + * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which
>> + * are not compatible with the speculative page fault processing.
>> + */
>> +pol = __get_vma_policy(vma, address);
>> +if (!pol)
>> +pol = get_task_policy(current);
>> +if (pol && pol->mode == MPOL_INTERLEAVE)
>> +goto unlock;
> 
> include/linux/mempolicy.h defines
> 
> struct mempolicy *get_task_policy(struct task_struct *p);
> struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   unsigned long addr);
> 
> only for CONFIG_NUMA configs.

Thanks Sergey, I'll add #ifdef around this block.

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-20 Thread Sergey Senozhatsky

On (08/18/17 00:05), Laurent Dufour wrote:
[..]
> + /*
> +  * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which
> +  * are not compatible with the speculative page fault processing.
> +  */
> + pol = __get_vma_policy(vma, address);
> + if (!pol)
> + pol = get_task_policy(current);
> + if (pol && pol->mode == MPOL_INTERLEAVE)
> + goto unlock;

include/linux/mempolicy.h defines

struct mempolicy *get_task_policy(struct task_struct *p);
struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
unsigned long addr);

only for CONFIG_NUMA configs.

-ss

[PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-17 Thread Laurent Dufour

From: Peter Zijlstra 

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) 

[Manage the newly introduced pte_spinlock() for speculative page
 fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Call p4d_alloc() as it is safe since pgd is valid]
[Call pud_alloc() as it is safe since p4d is valid]
[Set fe.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
 to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
 vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Introduce __HAVE_ARCH_CALL_SPF to declare the SPF handler only when
 architecture is supporting it]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
 processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
 path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
 the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
Signed-off-by: Laurent Dufour 
---
 include/linux/hugetlb_inline.h |   2 +-
 include/linux/mm.h |   5 +
 include/linux/pagemap.h|   4 +-
 mm/internal.h  |  14 +++
 mm/memory.c| 237 -
 5 files changed, 254 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index a4e7ca0f3585..6cfdfca4cc2a 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -7,7 +7,7 @@
 
 static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
-   return !!(vma->vm_flags & VM_HUGETLB);
+   return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
 }
 
 #else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0f4ddd72b172..0fe0811d304f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -315,6 +315,7 @@ struct vm_fault {
gfp_t gfp_mask; /* gfp mask to be used for allocations 
*/
pgoff_t pgoff;  /* Logical page offset based on vma */
unsigned long address;  /* Faulting virtual address */
+   unsigned int sequence;
pmd_t *pmd; /* Pointer to pmd entry matching
 * the 'address' */
pud_t *pud; /* Pointer to pud entry matching
@@ -1297,6 +1298,10 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags);
+#ifdef __HAVE_ARCH_CALL_SPF
+extern int handle_speculative_fault(struct mm_struct *mm,
+   unsigned long address, unsigned int flags);
+#endif /* __HAVE_ARCH_CALL_SPF */
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 79b36f57c3ba..3a9735dfa6b6 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -443,8 +443,8 @@ static inline pgoff_t linear_page_index(struct 
vm_area_struct *vma,
pgoff_t pgoff;
if (unlikely(is_vm_hugetlb_page(vma)))
return linear_hugepage_index(vma, address);
-   pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
-   pgoff += vma->vm_pgoff;
+   pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT;
+   pgoff += READ_ONCE(vma->vm_pgoff);
return pgoff;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 736540f15936..9d6347e35747 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -45,6 +45,20 @@ extern struct srcu_struct vma_srcu;
 extern struct v

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

[PATCH v2 14/20] mm: Provide speculative fault infrastructure

25 matches

Site Navigation

Mail list logo

Footer information