Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

2017-06-09 Thread Vlastimil Babka
On 05/23/2017 02:42 PM, Vlastimil Babka wrote:
> On 05/16/2017 10:29 PM, Andrea Arcangeli wrote:
>> On Wed, Apr 12, 2017 at 03:33:35PM +0200, Vlastimil Babka wrote:
>>>
>>> pmdp_invalidate() does:
>>>
>>> pmd_t entry = *pmdp;
>>> set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
>>>
>>> so it's not atomic and if CPU sets dirty or accessed in the middle of
>>> this, they will be lost?
>>
>> I agree it looks like the dirty bit can be lost. Furthermore this also
>> loses a MMU notifier invalidate that will lead to corruption at the
>> secondary MMU level (which will keep using the old protection
>> permission, potentially keeping writing to a wrprotected page).
> 
> Oh, I didn't paste the whole function, just the pmd manipulation.
> There's also a third line:
> 
> flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> 
> so there's no missing invalidate, AFAICS? Sorry for the confusion.

Oh, tlb flush is not MMU notified invalidate...

>>>
>>> But I don't see how the other invalidate caller
>>> __split_huge_pmd_locked() deals with this either. Andrea, any idea?
>>
>> The original code I wrote did this in __split_huge_page_map to create
>> the "entry" to establish in the pte pagetables:
>>
>> entry = mk_pte(page + i, vma->vm_page_prot);
>> entry = maybe_mkwrite(pte_mkdirty(entry),
>> vma);
>>
>> For anonymous memory the dirty bit is only meaningful for swapping,
>> and THP couldn't be swapped so setting it unconditional avoided any
>> issue with the pmdp_invalidate; pmdp_establish.
> 
> Yeah, but now we are going to swap THP's, and we have shmem THP's...
> 
>> pmdp_invalidate is needed primarily to avoid aliasing of two different
>> TLB translation pointing from the same virtual address to the the same
>> physical address that triggered machine checks (while needing to keep
>> the pmd huge at all times, back then it was also splitting huge,
>> splitting is a software bit so userland could still access the data,
>> splitting bit only blocked kernel code to manipulate on it similar to
>> what migration entry does right now upstream, except those prevent
>> userland to access the page during split which is less efficient than
>> the splitting bit, but at least it's only used for the physical split,
>> back then there was no difference between virtual and physical split
>> and physical split is less frequent than the virtual one right now).
> 
> This took me a while to grasp, but I think I understand now :)
> 
>> It looks like this needs a pmdp_populate that atomically grabs the
>> value of the pmd and returns it like pmdp_huge_get_and_clear_notify
>> does
> 
> pmdp_huge_get_and_clear_notify() is now gone...
> 
>> and a _notify variant to use "freeze" is false (if freeze is true
>> the MMU notifier invalidate must have happened when the pmd was set to
>> a migration entry). If pmdp_populate_notify (freeze==true)
>> /pmd_populate (freeze==false) would return the old pmd value
>> atomically with xchg() (just instead of setting it to 0 we should set
>> it to the mknotpresent one), then we can set the dirty bit on the ptes
>> (__split_huge_pmd_locked) or in the pmd itself in the change_huge_pmd
>> accordingly.

I was trying to look into this yesterday, but e.g. I know next to
nothing about MMU notifiers (see above :) so I would probably get it
wrong. But it should get fixed, so... Kirill?

> I think the confusion was partially caused by the comment at the
> original caller of pmdp_invalidate():
> 
> we first mark the
> * current pmd notpresent (atomically because here the pmd_trans_huge
> * and pmd_trans_splitting must remain set at all times on the pmd
> * until the split is complete for this pmd),
> 
> It says "atomically" but in fact that only means that the pmd_trans_huge
> and pmd_trans_splitting flags are not temporarily cleared at any point
> of time, right? It's not a true atomic swap.
> 
>> If the "dirty" flag information is obtained by the pmd read before
>> calling pmdp_invalidate is not ok (losing _notify also not ok).
> 
> Right.
> 
>> Thanks!
>> Andrea
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 



Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

2017-06-09 Thread Vlastimil Babka
On 05/23/2017 02:42 PM, Vlastimil Babka wrote:
> On 05/16/2017 10:29 PM, Andrea Arcangeli wrote:
>> On Wed, Apr 12, 2017 at 03:33:35PM +0200, Vlastimil Babka wrote:
>>>
>>> pmdp_invalidate() does:
>>>
>>> pmd_t entry = *pmdp;
>>> set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
>>>
>>> so it's not atomic and if CPU sets dirty or accessed in the middle of
>>> this, they will be lost?
>>
>> I agree it looks like the dirty bit can be lost. Furthermore this also
>> loses a MMU notifier invalidate that will lead to corruption at the
>> secondary MMU level (which will keep using the old protection
>> permission, potentially keeping writing to a wrprotected page).
> 
> Oh, I didn't paste the whole function, just the pmd manipulation.
> There's also a third line:
> 
> flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> 
> so there's no missing invalidate, AFAICS? Sorry for the confusion.

Oh, tlb flush is not MMU notified invalidate...

>>>
>>> But I don't see how the other invalidate caller
>>> __split_huge_pmd_locked() deals with this either. Andrea, any idea?
>>
>> The original code I wrote did this in __split_huge_page_map to create
>> the "entry" to establish in the pte pagetables:
>>
>> entry = mk_pte(page + i, vma->vm_page_prot);
>> entry = maybe_mkwrite(pte_mkdirty(entry),
>> vma);
>>
>> For anonymous memory the dirty bit is only meaningful for swapping,
>> and THP couldn't be swapped so setting it unconditional avoided any
>> issue with the pmdp_invalidate; pmdp_establish.
> 
> Yeah, but now we are going to swap THP's, and we have shmem THP's...
> 
>> pmdp_invalidate is needed primarily to avoid aliasing of two different
>> TLB translation pointing from the same virtual address to the the same
>> physical address that triggered machine checks (while needing to keep
>> the pmd huge at all times, back then it was also splitting huge,
>> splitting is a software bit so userland could still access the data,
>> splitting bit only blocked kernel code to manipulate on it similar to
>> what migration entry does right now upstream, except those prevent
>> userland to access the page during split which is less efficient than
>> the splitting bit, but at least it's only used for the physical split,
>> back then there was no difference between virtual and physical split
>> and physical split is less frequent than the virtual one right now).
> 
> This took me a while to grasp, but I think I understand now :)
> 
>> It looks like this needs a pmdp_populate that atomically grabs the
>> value of the pmd and returns it like pmdp_huge_get_and_clear_notify
>> does
> 
> pmdp_huge_get_and_clear_notify() is now gone...
> 
>> and a _notify variant to use "freeze" is false (if freeze is true
>> the MMU notifier invalidate must have happened when the pmd was set to
>> a migration entry). If pmdp_populate_notify (freeze==true)
>> /pmd_populate (freeze==false) would return the old pmd value
>> atomically with xchg() (just instead of setting it to 0 we should set
>> it to the mknotpresent one), then we can set the dirty bit on the ptes
>> (__split_huge_pmd_locked) or in the pmd itself in the change_huge_pmd
>> accordingly.

I was trying to look into this yesterday, but e.g. I know next to
nothing about MMU notifiers (see above :) so I would probably get it
wrong. But it should get fixed, so... Kirill?

> I think the confusion was partially caused by the comment at the
> original caller of pmdp_invalidate():
> 
> we first mark the
> * current pmd notpresent (atomically because here the pmd_trans_huge
> * and pmd_trans_splitting must remain set at all times on the pmd
> * until the split is complete for this pmd),
> 
> It says "atomically" but in fact that only means that the pmd_trans_huge
> and pmd_trans_splitting flags are not temporarily cleared at any point
> of time, right? It's not a true atomic swap.
> 
>> If the "dirty" flag information is obtained by the pmd read before
>> calling pmdp_invalidate is not ok (losing _notify also not ok).
> 
> Right.
> 
>> Thanks!
>> Andrea
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 



Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

2017-05-23 Thread Vlastimil Babka
On 05/16/2017 10:29 PM, Andrea Arcangeli wrote:
> On Wed, Apr 12, 2017 at 03:33:35PM +0200, Vlastimil Babka wrote:
>>
>> pmdp_invalidate() does:
>>
>> pmd_t entry = *pmdp;
>> set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
>>
>> so it's not atomic and if CPU sets dirty or accessed in the middle of
>> this, they will be lost?
> 
> I agree it looks like the dirty bit can be lost. Furthermore this also
> loses a MMU notifier invalidate that will lead to corruption at the
> secondary MMU level (which will keep using the old protection
> permission, potentially keeping writing to a wrprotected page).

Oh, I didn't paste the whole function, just the pmd manipulation.
There's also a third line:

flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);

so there's no missing invalidate, AFAICS? Sorry for the confusion.

>>
>> But I don't see how the other invalidate caller
>> __split_huge_pmd_locked() deals with this either. Andrea, any idea?
> 
> The original code I wrote did this in __split_huge_page_map to create
> the "entry" to establish in the pte pagetables:
> 
>  entry = mk_pte(page + i, vma->vm_page_prot);
>  entry = maybe_mkwrite(pte_mkdirty(entry),
>  vma);
> 
> For anonymous memory the dirty bit is only meaningful for swapping,
> and THP couldn't be swapped so setting it unconditional avoided any
> issue with the pmdp_invalidate; pmdp_establish.

Yeah, but now we are going to swap THP's, and we have shmem THP's...

> pmdp_invalidate is needed primarily to avoid aliasing of two different
> TLB translation pointing from the same virtual address to the the same
> physical address that triggered machine checks (while needing to keep
> the pmd huge at all times, back then it was also splitting huge,
> splitting is a software bit so userland could still access the data,
> splitting bit only blocked kernel code to manipulate on it similar to
> what migration entry does right now upstream, except those prevent
> userland to access the page during split which is less efficient than
> the splitting bit, but at least it's only used for the physical split,
> back then there was no difference between virtual and physical split
> and physical split is less frequent than the virtual one right now).

This took me a while to grasp, but I think I understand now :)

> It looks like this needs a pmdp_populate that atomically grabs the
> value of the pmd and returns it like pmdp_huge_get_and_clear_notify
> does

pmdp_huge_get_and_clear_notify() is now gone...

> and a _notify variant to use "freeze" is false (if freeze is true
> the MMU notifier invalidate must have happened when the pmd was set to
> a migration entry). If pmdp_populate_notify (freeze==true)
> /pmd_populate (freeze==false) would return the old pmd value
> atomically with xchg() (just instead of setting it to 0 we should set
> it to the mknotpresent one), then we can set the dirty bit on the ptes
> (__split_huge_pmd_locked) or in the pmd itself in the change_huge_pmd
> accordingly.

I think the confusion was partially caused by the comment at the
original caller of pmdp_invalidate():

we first mark the
* current pmd notpresent (atomically because here the pmd_trans_huge
* and pmd_trans_splitting must remain set at all times on the pmd
* until the split is complete for this pmd),

It says "atomically" but in fact that only means that the pmd_trans_huge
and pmd_trans_splitting flags are not temporarily cleared at any point
of time, right? It's not a true atomic swap.

> If the "dirty" flag information is obtained by the pmd read before
> calling pmdp_invalidate is not ok (losing _notify also not ok).

Right.

> Thanks!
> Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 



Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

2017-05-23 Thread Vlastimil Babka
On 05/16/2017 10:29 PM, Andrea Arcangeli wrote:
> On Wed, Apr 12, 2017 at 03:33:35PM +0200, Vlastimil Babka wrote:
>>
>> pmdp_invalidate() does:
>>
>> pmd_t entry = *pmdp;
>> set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
>>
>> so it's not atomic and if CPU sets dirty or accessed in the middle of
>> this, they will be lost?
> 
> I agree it looks like the dirty bit can be lost. Furthermore this also
> loses a MMU notifier invalidate that will lead to corruption at the
> secondary MMU level (which will keep using the old protection
> permission, potentially keeping writing to a wrprotected page).

Oh, I didn't paste the whole function, just the pmd manipulation.
There's also a third line:

flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);

so there's no missing invalidate, AFAICS? Sorry for the confusion.

>>
>> But I don't see how the other invalidate caller
>> __split_huge_pmd_locked() deals with this either. Andrea, any idea?
> 
> The original code I wrote did this in __split_huge_page_map to create
> the "entry" to establish in the pte pagetables:
> 
>  entry = mk_pte(page + i, vma->vm_page_prot);
>  entry = maybe_mkwrite(pte_mkdirty(entry),
>  vma);
> 
> For anonymous memory the dirty bit is only meaningful for swapping,
> and THP couldn't be swapped so setting it unconditional avoided any
> issue with the pmdp_invalidate; pmdp_establish.

Yeah, but now we are going to swap THP's, and we have shmem THP's...

> pmdp_invalidate is needed primarily to avoid aliasing of two different
> TLB translation pointing from the same virtual address to the the same
> physical address that triggered machine checks (while needing to keep
> the pmd huge at all times, back then it was also splitting huge,
> splitting is a software bit so userland could still access the data,
> splitting bit only blocked kernel code to manipulate on it similar to
> what migration entry does right now upstream, except those prevent
> userland to access the page during split which is less efficient than
> the splitting bit, but at least it's only used for the physical split,
> back then there was no difference between virtual and physical split
> and physical split is less frequent than the virtual one right now).

This took me a while to grasp, but I think I understand now :)

> It looks like this needs a pmdp_populate that atomically grabs the
> value of the pmd and returns it like pmdp_huge_get_and_clear_notify
> does

pmdp_huge_get_and_clear_notify() is now gone...

> and a _notify variant to use "freeze" is false (if freeze is true
> the MMU notifier invalidate must have happened when the pmd was set to
> a migration entry). If pmdp_populate_notify (freeze==true)
> /pmd_populate (freeze==false) would return the old pmd value
> atomically with xchg() (just instead of setting it to 0 we should set
> it to the mknotpresent one), then we can set the dirty bit on the ptes
> (__split_huge_pmd_locked) or in the pmd itself in the change_huge_pmd
> accordingly.

I think the confusion was partially caused by the comment at the
original caller of pmdp_invalidate():

we first mark the
* current pmd notpresent (atomically because here the pmd_trans_huge
* and pmd_trans_splitting must remain set at all times on the pmd
* until the split is complete for this pmd),

It says "atomically" but in fact that only means that the pmd_trans_huge
and pmd_trans_splitting flags are not temporarily cleared at any point
of time, right? It's not a true atomic swap.

> If the "dirty" flag information is obtained by the pmd read before
> calling pmdp_invalidate is not ok (losing _notify also not ok).

Right.

> Thanks!
> Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 



Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

2017-05-16 Thread Andrea Arcangeli
On Wed, Apr 12, 2017 at 03:33:35PM +0200, Vlastimil Babka wrote:
> On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
> > In case prot_numa, we are under down_read(mmap_sem). It's critical
> > to not clear pmd intermittently to avoid race with MADV_DONTNEED
> > which is also under down_read(mmap_sem):
> > 
> > CPU0:   CPU1:
> > change_huge_pmd(prot_numa=1)
> >  pmdp_huge_get_and_clear_notify()
> > madvise_dontneed()
> >  zap_pmd_range()
> >   pmd_trans_huge(*pmd) == 0 (without ptl)
> >   // skip the pmd
> >  set_pmd_at();
> >  // pmd is re-established
> > 
> > The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> > which may break userspace.
> > 
> > Found by code analysis, never saw triggered.
> > 
> > Signed-off-by: Kirill A. Shutemov 
> > ---
> >  mm/huge_memory.c | 34 +-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e7ce73b2b208..bb2b3646bd78 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, 
> > pmd_t *pmd,
> > if (prot_numa && pmd_protnone(*pmd))
> > goto unlock;
> >  
> > -   entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> > +   /*
> > +* In case prot_numa, we are under down_read(mmap_sem). It's critical
> > +* to not clear pmd intermittently to avoid race with MADV_DONTNEED
> > +* which is also under down_read(mmap_sem):
> > +*
> > +*  CPU0:   CPU1:
> > +*  change_huge_pmd(prot_numa=1)
> > +*   pmdp_huge_get_and_clear_notify()
> > +* madvise_dontneed()
> > +*  zap_pmd_range()
> > +*   pmd_trans_huge(*pmd) == 0 (without ptl)
> > +*   // skip the pmd
> > +*   set_pmd_at();
> > +*   // pmd is re-established
> > +*
> > +* The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> > +* which may break userspace.
> > +*
> > +* pmdp_invalidate() is required to make sure we don't miss
> > +* dirty/young flags set by hardware.
> > +*/
> > +   entry = *pmd;
> > +   pmdp_invalidate(vma, addr, pmd);
> > +
> > +   /*
> > +* Recover dirty/young flags.  It relies on pmdp_invalidate to not
> > +* corrupt them.
> > +*/
> 
> pmdp_invalidate() does:
> 
> pmd_t entry = *pmdp;
> set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
> 
> so it's not atomic and if CPU sets dirty or accessed in the middle of
> this, they will be lost?

I agree it looks like the dirty bit can be lost. Furthermore this also
loses a MMU notifier invalidate that will lead to corruption at the
secondary MMU level (which will keep using the old protection
permission, potentially keeping writing to a wrprotected page).

> 
> But I don't see how the other invalidate caller
> __split_huge_pmd_locked() deals with this either. Andrea, any idea?

The original code I wrote did this in __split_huge_page_map to create
the "entry" to establish in the pte pagetables:

   entry = mk_pte(page + i, vma->vm_page_prot);
   entry = maybe_mkwrite(pte_mkdirty(entry),
   vma);

For anonymous memory the dirty bit is only meaningful for swapping,
and THP couldn't be swapped so setting it unconditional avoided any
issue with the pmdp_invalidate; pmdp_establish.

pmdp_invalidate is needed primarily to avoid aliasing of two different
TLB translation pointing from the same virtual address to the the same
physical address that triggered machine checks (while needing to keep
the pmd huge at all times, back then it was also splitting huge,
splitting is a software bit so userland could still access the data,
splitting bit only blocked kernel code to manipulate on it similar to
what migration entry does right now upstream, except those prevent
userland to access the page during split which is less efficient than
the splitting bit, but at least it's only used for the physical split,
back then there was no difference between virtual and physical split
and physical split is less frequent than the virtual one right now).

It looks like this needs a pmdp_populate that atomically grabs the
value of the pmd and returns it like pmdp_huge_get_and_clear_notify
does and a _notify variant to use "freeze" is false (if freeze is true
the MMU notifier invalidate must have happened when the pmd was set to
a migration entry). If pmdp_populate_notify (freeze==true)
/pmd_populate (freeze==false) would return the old pmd value
atomically with xchg() (just instead of setting it to 0 we should set
it to the mknotpresent one), then we can set the dirty bit on the ptes

Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

2017-05-16 Thread Andrea Arcangeli
On Wed, Apr 12, 2017 at 03:33:35PM +0200, Vlastimil Babka wrote:
> On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
> > In case prot_numa, we are under down_read(mmap_sem). It's critical
> > to not clear pmd intermittently to avoid race with MADV_DONTNEED
> > which is also under down_read(mmap_sem):
> > 
> > CPU0:   CPU1:
> > change_huge_pmd(prot_numa=1)
> >  pmdp_huge_get_and_clear_notify()
> > madvise_dontneed()
> >  zap_pmd_range()
> >   pmd_trans_huge(*pmd) == 0 (without ptl)
> >   // skip the pmd
> >  set_pmd_at();
> >  // pmd is re-established
> > 
> > The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> > which may break userspace.
> > 
> > Found by code analysis, never saw triggered.
> > 
> > Signed-off-by: Kirill A. Shutemov 
> > ---
> >  mm/huge_memory.c | 34 +-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e7ce73b2b208..bb2b3646bd78 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, 
> > pmd_t *pmd,
> > if (prot_numa && pmd_protnone(*pmd))
> > goto unlock;
> >  
> > -   entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> > +   /*
> > +* In case prot_numa, we are under down_read(mmap_sem). It's critical
> > +* to not clear pmd intermittently to avoid race with MADV_DONTNEED
> > +* which is also under down_read(mmap_sem):
> > +*
> > +*  CPU0:   CPU1:
> > +*  change_huge_pmd(prot_numa=1)
> > +*   pmdp_huge_get_and_clear_notify()
> > +* madvise_dontneed()
> > +*  zap_pmd_range()
> > +*   pmd_trans_huge(*pmd) == 0 (without ptl)
> > +*   // skip the pmd
> > +*   set_pmd_at();
> > +*   // pmd is re-established
> > +*
> > +* The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> > +* which may break userspace.
> > +*
> > +* pmdp_invalidate() is required to make sure we don't miss
> > +* dirty/young flags set by hardware.
> > +*/
> > +   entry = *pmd;
> > +   pmdp_invalidate(vma, addr, pmd);
> > +
> > +   /*
> > +* Recover dirty/young flags.  It relies on pmdp_invalidate to not
> > +* corrupt them.
> > +*/
> 
> pmdp_invalidate() does:
> 
> pmd_t entry = *pmdp;
> set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
> 
> so it's not atomic and if CPU sets dirty or accessed in the middle of
> this, they will be lost?

I agree it looks like the dirty bit can be lost. Furthermore this also
loses a MMU notifier invalidate that will lead to corruption at the
secondary MMU level (which will keep using the old protection
permission, potentially keeping writing to a wrprotected page).

> 
> But I don't see how the other invalidate caller
> __split_huge_pmd_locked() deals with this either. Andrea, any idea?

The original code I wrote did this in __split_huge_page_map to create
the "entry" to establish in the pte pagetables:

   entry = mk_pte(page + i, vma->vm_page_prot);
   entry = maybe_mkwrite(pte_mkdirty(entry),
   vma);

For anonymous memory the dirty bit is only meaningful for swapping,
and THP couldn't be swapped so setting it unconditional avoided any
issue with the pmdp_invalidate; pmdp_establish.

pmdp_invalidate is needed primarily to avoid aliasing of two different
TLB translation pointing from the same virtual address to the the same
physical address that triggered machine checks (while needing to keep
the pmd huge at all times, back then it was also splitting huge,
splitting is a software bit so userland could still access the data,
splitting bit only blocked kernel code to manipulate on it similar to
what migration entry does right now upstream, except those prevent
userland to access the page during split which is less efficient than
the splitting bit, but at least it's only used for the physical split,
back then there was no difference between virtual and physical split
and physical split is less frequent than the virtual one right now).

It looks like this needs a pmdp_populate that atomically grabs the
value of the pmd and returns it like pmdp_huge_get_and_clear_notify
does and a _notify variant to use "freeze" is false (if freeze is true
the MMU notifier invalidate must have happened when the pmd was set to
a migration entry). If pmdp_populate_notify (freeze==true)
/pmd_populate (freeze==false) would return the old pmd value
atomically with xchg() (just instead of setting it to 0 we should set
it to the mknotpresent one), then we can set the dirty bit on the ptes
(__split_huge_pmd_locked) or in the pmd itself 

Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

2017-05-16 Thread Vlastimil Babka
On 04/12/2017 03:33 PM, Vlastimil Babka wrote:
> On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
>> In case prot_numa, we are under down_read(mmap_sem). It's critical
>> to not clear pmd intermittently to avoid race with MADV_DONTNEED
>> which is also under down_read(mmap_sem):
>>
>>  CPU0:   CPU1:
>>  change_huge_pmd(prot_numa=1)
>>   pmdp_huge_get_and_clear_notify()
>> madvise_dontneed()
>>  zap_pmd_range()
>>   pmd_trans_huge(*pmd) == 0 (without ptl)
>>   // skip the pmd
>>   set_pmd_at();
>>   // pmd is re-established
>>
>> The race makes MADV_DONTNEED miss the huge pmd and don't clear it
>> which may break userspace.
>>
>> Found by code analysis, never saw triggered.
>>
>> Signed-off-by: Kirill A. Shutemov 
>> ---
>>  mm/huge_memory.c | 34 +-
>>  1 file changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index e7ce73b2b208..bb2b3646bd78 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
>> *pmd,
>>  if (prot_numa && pmd_protnone(*pmd))
>>  goto unlock;
>>  
>> -entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
>> +/*
>> + * In case prot_numa, we are under down_read(mmap_sem). It's critical
>> + * to not clear pmd intermittently to avoid race with MADV_DONTNEED
>> + * which is also under down_read(mmap_sem):
>> + *
>> + *  CPU0:   CPU1:
>> + *  change_huge_pmd(prot_numa=1)
>> + *   pmdp_huge_get_and_clear_notify()
>> + * madvise_dontneed()
>> + *  zap_pmd_range()
>> + *   pmd_trans_huge(*pmd) == 0 (without ptl)
>> + *   // skip the pmd
>> + *   set_pmd_at();
>> + *   // pmd is re-established
>> + *
>> + * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
>> + * which may break userspace.
>> + *
>> + * pmdp_invalidate() is required to make sure we don't miss
>> + * dirty/young flags set by hardware.
>> + */
>> +entry = *pmd;
>> +pmdp_invalidate(vma, addr, pmd);
>> +
>> +/*
>> + * Recover dirty/young flags.  It relies on pmdp_invalidate to not
>> + * corrupt them.
>> + */
> 
> pmdp_invalidate() does:
> 
> pmd_t entry = *pmdp;
> set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
> 
> so it's not atomic and if CPU sets dirty or accessed in the middle of
> this, they will be lost?
> 
> But I don't see how the other invalidate caller
> __split_huge_pmd_locked() deals with this either. Andrea, any idea?

Looks like we didn't resolve this and meanwhile the patch is in mainline
as ced108037c2aa. CC Andy who deals with TLB a lot these days.

> Vlastimil
> 
>> +if (pmd_dirty(*pmd))
>> +entry = pmd_mkdirty(entry);
>> +if (pmd_young(*pmd))
>> +entry = pmd_mkyoung(entry);
>> +
>>  entry = pmd_modify(entry, newprot);
>>  if (preserve_write)
>>  entry = pmd_mk_savedwrite(entry);
>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 



Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

2017-05-16 Thread Vlastimil Babka
On 04/12/2017 03:33 PM, Vlastimil Babka wrote:
> On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
>> In case prot_numa, we are under down_read(mmap_sem). It's critical
>> to not clear pmd intermittently to avoid race with MADV_DONTNEED
>> which is also under down_read(mmap_sem):
>>
>>  CPU0:   CPU1:
>>  change_huge_pmd(prot_numa=1)
>>   pmdp_huge_get_and_clear_notify()
>> madvise_dontneed()
>>  zap_pmd_range()
>>   pmd_trans_huge(*pmd) == 0 (without ptl)
>>   // skip the pmd
>>   set_pmd_at();
>>   // pmd is re-established
>>
>> The race makes MADV_DONTNEED miss the huge pmd and don't clear it
>> which may break userspace.
>>
>> Found by code analysis, never saw triggered.
>>
>> Signed-off-by: Kirill A. Shutemov 
>> ---
>>  mm/huge_memory.c | 34 +-
>>  1 file changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index e7ce73b2b208..bb2b3646bd78 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
>> *pmd,
>>  if (prot_numa && pmd_protnone(*pmd))
>>  goto unlock;
>>  
>> -entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
>> +/*
>> + * In case prot_numa, we are under down_read(mmap_sem). It's critical
>> + * to not clear pmd intermittently to avoid race with MADV_DONTNEED
>> + * which is also under down_read(mmap_sem):
>> + *
>> + *  CPU0:   CPU1:
>> + *  change_huge_pmd(prot_numa=1)
>> + *   pmdp_huge_get_and_clear_notify()
>> + * madvise_dontneed()
>> + *  zap_pmd_range()
>> + *   pmd_trans_huge(*pmd) == 0 (without ptl)
>> + *   // skip the pmd
>> + *   set_pmd_at();
>> + *   // pmd is re-established
>> + *
>> + * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
>> + * which may break userspace.
>> + *
>> + * pmdp_invalidate() is required to make sure we don't miss
>> + * dirty/young flags set by hardware.
>> + */
>> +entry = *pmd;
>> +pmdp_invalidate(vma, addr, pmd);
>> +
>> +/*
>> + * Recover dirty/young flags.  It relies on pmdp_invalidate to not
>> + * corrupt them.
>> + */
> 
> pmdp_invalidate() does:
> 
> pmd_t entry = *pmdp;
> set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
> 
> so it's not atomic and if CPU sets dirty or accessed in the middle of
> this, they will be lost?
> 
> But I don't see how the other invalidate caller
> __split_huge_pmd_locked() deals with this either. Andrea, any idea?

Looks like we didn't resolve this and meanwhile the patch is in mainline
as ced108037c2aa. CC Andy who deals with TLB a lot these days.

> Vlastimil
> 
>> +if (pmd_dirty(*pmd))
>> +entry = pmd_mkdirty(entry);
>> +if (pmd_young(*pmd))
>> +entry = pmd_mkyoung(entry);
>> +
>>  entry = pmd_modify(entry, newprot);
>>  if (preserve_write)
>>  entry = pmd_mk_savedwrite(entry);
>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 



Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

2017-04-12 Thread Vlastimil Babka
On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
> In case prot_numa, we are under down_read(mmap_sem). It's critical
> to not clear pmd intermittently to avoid race with MADV_DONTNEED
> which is also under down_read(mmap_sem):
> 
>   CPU0:   CPU1:
>   change_huge_pmd(prot_numa=1)
>pmdp_huge_get_and_clear_notify()
> madvise_dontneed()
>  zap_pmd_range()
>   pmd_trans_huge(*pmd) == 0 (without ptl)
>   // skip the pmd
>set_pmd_at();
>// pmd is re-established
> 
> The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> which may break userspace.
> 
> Found by code analysis, never saw triggered.
> 
> Signed-off-by: Kirill A. Shutemov 
> ---
>  mm/huge_memory.c | 34 +-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e7ce73b2b208..bb2b3646bd78 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
> *pmd,
>   if (prot_numa && pmd_protnone(*pmd))
>   goto unlock;
>  
> - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> + /*
> +  * In case prot_numa, we are under down_read(mmap_sem). It's critical
> +  * to not clear pmd intermittently to avoid race with MADV_DONTNEED
> +  * which is also under down_read(mmap_sem):
> +  *
> +  *  CPU0:   CPU1:
> +  *  change_huge_pmd(prot_numa=1)
> +  *   pmdp_huge_get_and_clear_notify()
> +  * madvise_dontneed()
> +  *  zap_pmd_range()
> +  *   pmd_trans_huge(*pmd) == 0 (without ptl)
> +  *   // skip the pmd
> +  *   set_pmd_at();
> +  *   // pmd is re-established
> +  *
> +  * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> +  * which may break userspace.
> +  *
> +  * pmdp_invalidate() is required to make sure we don't miss
> +  * dirty/young flags set by hardware.
> +  */
> + entry = *pmd;
> + pmdp_invalidate(vma, addr, pmd);
> +
> + /*
> +  * Recover dirty/young flags.  It relies on pmdp_invalidate to not
> +  * corrupt them.
> +  */

pmdp_invalidate() does:

pmd_t entry = *pmdp;
set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));

so it's not atomic and if CPU sets dirty or accessed in the middle of
this, they will be lost?

But I don't see how the other invalidate caller
__split_huge_pmd_locked() deals with this either. Andrea, any idea?

Vlastimil

> + if (pmd_dirty(*pmd))
> + entry = pmd_mkdirty(entry);
> + if (pmd_young(*pmd))
> + entry = pmd_mkyoung(entry);
> +
>   entry = pmd_modify(entry, newprot);
>   if (preserve_write)
>   entry = pmd_mk_savedwrite(entry);
> 



Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

2017-04-12 Thread Vlastimil Babka
On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
> In case prot_numa, we are under down_read(mmap_sem). It's critical
> to not clear pmd intermittently to avoid race with MADV_DONTNEED
> which is also under down_read(mmap_sem):
> 
>   CPU0:   CPU1:
>   change_huge_pmd(prot_numa=1)
>pmdp_huge_get_and_clear_notify()
> madvise_dontneed()
>  zap_pmd_range()
>   pmd_trans_huge(*pmd) == 0 (without ptl)
>   // skip the pmd
>set_pmd_at();
>// pmd is re-established
> 
> The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> which may break userspace.
> 
> Found by code analysis, never saw triggered.
> 
> Signed-off-by: Kirill A. Shutemov 
> ---
>  mm/huge_memory.c | 34 +-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e7ce73b2b208..bb2b3646bd78 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
> *pmd,
>   if (prot_numa && pmd_protnone(*pmd))
>   goto unlock;
>  
> - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> + /*
> +  * In case prot_numa, we are under down_read(mmap_sem). It's critical
> +  * to not clear pmd intermittently to avoid race with MADV_DONTNEED
> +  * which is also under down_read(mmap_sem):
> +  *
> +  *  CPU0:   CPU1:
> +  *  change_huge_pmd(prot_numa=1)
> +  *   pmdp_huge_get_and_clear_notify()
> +  * madvise_dontneed()
> +  *  zap_pmd_range()
> +  *   pmd_trans_huge(*pmd) == 0 (without ptl)
> +  *   // skip the pmd
> +  *   set_pmd_at();
> +  *   // pmd is re-established
> +  *
> +  * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> +  * which may break userspace.
> +  *
> +  * pmdp_invalidate() is required to make sure we don't miss
> +  * dirty/young flags set by hardware.
> +  */
> + entry = *pmd;
> + pmdp_invalidate(vma, addr, pmd);
> +
> + /*
> +  * Recover dirty/young flags.  It relies on pmdp_invalidate to not
> +  * corrupt them.
> +  */

pmdp_invalidate() does:

pmd_t entry = *pmdp;
set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));

so it's not atomic and if CPU sets dirty or accessed in the middle of
this, they will be lost?

But I don't see how the other invalidate caller
__split_huge_pmd_locked() deals with this either. Andrea, any idea?

Vlastimil

> + if (pmd_dirty(*pmd))
> + entry = pmd_mkdirty(entry);
> + if (pmd_young(*pmd))
> + entry = pmd_mkyoung(entry);
> +
>   entry = pmd_modify(entry, newprot);
>   if (preserve_write)
>   entry = pmd_mk_savedwrite(entry);
> 



Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

2017-03-03 Thread Dave Hansen
On 03/02/2017 07:10 AM, Kirill A. Shutemov wrote:
> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
> *pmd,
>   if (prot_numa && pmd_protnone(*pmd))
>   goto unlock;
>  
> - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);

Are there any remaining call sites for pmdp_huge_get_and_clear_notify()
after this?


Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

2017-03-03 Thread Dave Hansen
On 03/02/2017 07:10 AM, Kirill A. Shutemov wrote:
> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
> *pmd,
>   if (prot_numa && pmd_protnone(*pmd))
>   goto unlock;
>  
> - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);

Are there any remaining call sites for pmdp_huge_get_and_clear_notify()
after this?


[PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

2017-03-02 Thread Kirill A. Shutemov
In case prot_numa, we are under down_read(mmap_sem). It's critical
to not clear pmd intermittently to avoid race with MADV_DONTNEED
which is also under down_read(mmap_sem):

CPU0:   CPU1:
change_huge_pmd(prot_numa=1)
 pmdp_huge_get_and_clear_notify()
madvise_dontneed()
 zap_pmd_range()
  pmd_trans_huge(*pmd) == 0 (without ptl)
  // skip the pmd
 set_pmd_at();
 // pmd is re-established

The race makes MADV_DONTNEED miss the huge pmd and don't clear it
which may break userspace.

Found by code analysis, never saw triggered.

Signed-off-by: Kirill A. Shutemov 
---
 mm/huge_memory.c | 34 +-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e7ce73b2b208..bb2b3646bd78 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
if (prot_numa && pmd_protnone(*pmd))
goto unlock;
 
-   entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
+   /*
+* In case prot_numa, we are under down_read(mmap_sem). It's critical
+* to not clear pmd intermittently to avoid race with MADV_DONTNEED
+* which is also under down_read(mmap_sem):
+*
+*  CPU0:   CPU1:
+*  change_huge_pmd(prot_numa=1)
+*   pmdp_huge_get_and_clear_notify()
+* madvise_dontneed()
+*  zap_pmd_range()
+*   pmd_trans_huge(*pmd) == 0 (without ptl)
+*   // skip the pmd
+*   set_pmd_at();
+*   // pmd is re-established
+*
+* The race makes MADV_DONTNEED miss the huge pmd and don't clear it
+* which may break userspace.
+*
+* pmdp_invalidate() is required to make sure we don't miss
+* dirty/young flags set by hardware.
+*/
+   entry = *pmd;
+   pmdp_invalidate(vma, addr, pmd);
+
+   /*
+* Recover dirty/young flags.  It relies on pmdp_invalidate to not
+* corrupt them.
+*/
+   if (pmd_dirty(*pmd))
+   entry = pmd_mkdirty(entry);
+   if (pmd_young(*pmd))
+   entry = pmd_mkyoung(entry);
+
entry = pmd_modify(entry, newprot);
if (preserve_write)
entry = pmd_mk_savedwrite(entry);
-- 
2.11.0



[PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

2017-03-02 Thread Kirill A. Shutemov
In case prot_numa, we are under down_read(mmap_sem). It's critical
to not clear pmd intermittently to avoid race with MADV_DONTNEED
which is also under down_read(mmap_sem):

CPU0:   CPU1:
change_huge_pmd(prot_numa=1)
 pmdp_huge_get_and_clear_notify()
madvise_dontneed()
 zap_pmd_range()
  pmd_trans_huge(*pmd) == 0 (without ptl)
  // skip the pmd
 set_pmd_at();
 // pmd is re-established

The race makes MADV_DONTNEED miss the huge pmd and don't clear it
which may break userspace.

Found by code analysis, never saw triggered.

Signed-off-by: Kirill A. Shutemov 
---
 mm/huge_memory.c | 34 +-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e7ce73b2b208..bb2b3646bd78 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
if (prot_numa && pmd_protnone(*pmd))
goto unlock;
 
-   entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
+   /*
+* In case prot_numa, we are under down_read(mmap_sem). It's critical
+* to not clear pmd intermittently to avoid race with MADV_DONTNEED
+* which is also under down_read(mmap_sem):
+*
+*  CPU0:   CPU1:
+*  change_huge_pmd(prot_numa=1)
+*   pmdp_huge_get_and_clear_notify()
+* madvise_dontneed()
+*  zap_pmd_range()
+*   pmd_trans_huge(*pmd) == 0 (without ptl)
+*   // skip the pmd
+*   set_pmd_at();
+*   // pmd is re-established
+*
+* The race makes MADV_DONTNEED miss the huge pmd and don't clear it
+* which may break userspace.
+*
+* pmdp_invalidate() is required to make sure we don't miss
+* dirty/young flags set by hardware.
+*/
+   entry = *pmd;
+   pmdp_invalidate(vma, addr, pmd);
+
+   /*
+* Recover dirty/young flags.  It relies on pmdp_invalidate to not
+* corrupt them.
+*/
+   if (pmd_dirty(*pmd))
+   entry = pmd_mkdirty(entry);
+   if (pmd_young(*pmd))
+   entry = pmd_mkyoung(entry);
+
entry = pmd_modify(entry, newprot);
if (preserve_write)
entry = pmd_mk_savedwrite(entry);
-- 
2.11.0