Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-23 Thread Xiao Guangrong
On 08/23/2012 03:50 AM, Andrea Arcangeli wrote:
> Hi Andrew,
> 
> On Wed, Aug 22, 2012 at 12:15:35PM -0700, Andrew Morton wrote:
>> On Wed, 22 Aug 2012 18:29:55 +0200
>> Andrea Arcangeli  wrote:
>>
>>> On Wed, Aug 22, 2012 at 02:03:41PM +0800, Xiao Guangrong wrote:
 On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
> CPU0  CPU1
>   oldpage[1] == 0 (both guest & host)
> oldpage[0] = 1
> trigger do_wp_page

 We always do ptep_clear_flush before set_pte_at_notify(),
 at this point, we have done:
   pte = 0 and flush all tlbs
> mmu_notifier_change_pte
> spte = newpage + writable
>   guest does newpage[1] = 1
>   vmexit
>   host read oldpage[1] == 0

   It can not happen, at this point pte = 0, host can not
  access oldpage anymore, host read can generate #PF, it
   will be blocked on page table lock until CPU 0 release 
 the lock.
>>>
>>> Agreed, this is why your fix is safe.
>>>
>>> ...
>>>
>>> Thanks a lot for fixing this subtle race!
>>
>> I'll take that as an ack.
> 
> Yes thanks!
> 

Andrew, Andrea,

Thanks for your time to review the patch.

> I'd also like a comment that explains why in that case the order is
> reversed. The reverse order immediately rings an alarm bell otherwise
> ;). But the comment can be added with an incremental patch.
> 
>> Unfortunately we weren't told the user-visible effects of the bug,
>> which often makes it hard to determine which kernel versions should be
>> patched.  Please do always provide this information when fixing a bug.

Okay, i will pay more attention to this.

> 
> This is best answered by Xiao who said it's a testcase triggering
> this.
> 
> It requires the guest reading memory on CPU0 while the host writes to
> the same memory on CPU1, while CPU2 triggers the copy on write fault
> on another part of the same page (slightly before CPU1 writes). The
> host writes of CPU1 would need to happen in a microsecond window, and
> they wouldn't be immediately propagated to the guest in CPU0. They
> would still appear in the guest but with a microsecond delay (the
> guest has the spte mapped readonly when this happens so it's only a
> guest "microsecond delayed reading" problem as far as I can tell). I
> guess most of the time it would fall into the undefined by timing
> scenario so it's hard to tell how the side effect could escalate.

Yes, i agree. :)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-23 Thread Xiao Guangrong
On 08/23/2012 12:37 AM, Andrea Arcangeli wrote:
> On Wed, Aug 22, 2012 at 11:51:17AM +0800, Xiao Guangrong wrote:
>> Hmm, in KSM code, i found this code in replace_page:
>>
>> set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
>>
>> It is possible to establish a writable pte, no?
> 
> Hugh already answered this thanks. Further details on the vm_page_prot
> are in top of mmap.c, and KSM never scans MAP_SHARED vmas.
> 

Yes, i see that, thank you, Andrea!

>> Unfortunately, all these bugs are triggered by test cases.
> 
> Sure, I've seen the very Oops for the other one, and this one also can
> trigger if unlucky.
> 
> This one can trigger with KVM but only if KSM is enabled or with live
> migration or with device hotplug or some other event that triggers a
> fork in qemu.
> 
> My curiosity about the other one in the exit/unregister/release paths
> is if it really ever triggered with KVM. Because I can't easily see
> how it could trigger. By the time kvm_destroy_vm or exit_mmap() runs,
> no vcpu can be in guest mode anymore, so it cannot matter whatever the
> status of any leftover spte at that time.
> 

vcpu is not in guest mode, but the memory can be still hold in KVM MMU.

Consider this case:

   CPU 0  CPU 1

   create kvm
   create vcpu thread

   [ Guest is happily running ]

  send kill signal to the process

   call do_exit
  mmput mm
  exit_mmap, then
 delete mmu_notify

 reclaim the memory of these threads
!!
 Now, the page has been reclaimed but
 it is still hold in KVM MMU

mmu_notify->release
 !!
   delete spte, and call
   mark_page_accessed/mark_page_dirty for
   the page which has already been freed on CPU 1


 exit_files
release kvm/vcpu file, then
kvm and vcpu are destroyed.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-23 Thread Xiao Guangrong
On 08/23/2012 12:37 AM, Andrea Arcangeli wrote:
 On Wed, Aug 22, 2012 at 11:51:17AM +0800, Xiao Guangrong wrote:
 Hmm, in KSM code, i found this code in replace_page:

 set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma-vm_page_prot));

 It is possible to establish a writable pte, no?
 
 Hugh already answered this thanks. Further details on the vm_page_prot
 are in top of mmap.c, and KSM never scans MAP_SHARED vmas.
 

Yes, i see that, thank you, Andrea!

 Unfortunately, all these bugs are triggered by test cases.
 
 Sure, I've seen the very Oops for the other one, and this one also can
 trigger if unlucky.
 
 This one can trigger with KVM but only if KSM is enabled or with live
 migration or with device hotplug or some other event that triggers a
 fork in qemu.
 
 My curiosity about the other one in the exit/unregister/release paths
 is if it really ever triggered with KVM. Because I can't easily see
 how it could trigger. By the time kvm_destroy_vm or exit_mmap() runs,
 no vcpu can be in guest mode anymore, so it cannot matter whatever the
 status of any leftover spte at that time.
 

vcpu is not in guest mode, but the memory can be still hold in KVM MMU.

Consider this case:

   CPU 0  CPU 1

   create kvm
   create vcpu thread

   [ Guest is happily running ]

  send kill signal to the process

   call do_exit
  mmput mm
  exit_mmap, then
 delete mmu_notify

 reclaim the memory of these threads
!!
 Now, the page has been reclaimed but
 it is still hold in KVM MMU

mmu_notify-release
 !!
   delete spte, and call
   mark_page_accessed/mark_page_dirty for
   the page which has already been freed on CPU 1


 exit_files
release kvm/vcpu file, then
kvm and vcpu are destroyed.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-23 Thread Xiao Guangrong
On 08/23/2012 03:50 AM, Andrea Arcangeli wrote:
 Hi Andrew,
 
 On Wed, Aug 22, 2012 at 12:15:35PM -0700, Andrew Morton wrote:
 On Wed, 22 Aug 2012 18:29:55 +0200
 Andrea Arcangeli aarca...@redhat.com wrote:

 On Wed, Aug 22, 2012 at 02:03:41PM +0800, Xiao Guangrong wrote:
 On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
 CPU0  CPU1
   oldpage[1] == 0 (both guest  host)
 oldpage[0] = 1
 trigger do_wp_page

 We always do ptep_clear_flush before set_pte_at_notify(),
 at this point, we have done:
   pte = 0 and flush all tlbs
 mmu_notifier_change_pte
 spte = newpage + writable
   guest does newpage[1] = 1
   vmexit
   host read oldpage[1] == 0

   It can not happen, at this point pte = 0, host can not
  access oldpage anymore, host read can generate #PF, it
   will be blocked on page table lock until CPU 0 release 
 the lock.

 Agreed, this is why your fix is safe.

 ...

 Thanks a lot for fixing this subtle race!

 I'll take that as an ack.
 
 Yes thanks!
 

Andrew, Andrea,

Thanks for your time to review the patch.

 I'd also like a comment that explains why in that case the order is
 reversed. The reverse order immediately rings an alarm bell otherwise
 ;). But the comment can be added with an incremental patch.
 
 Unfortunately we weren't told the user-visible effects of the bug,
 which often makes it hard to determine which kernel versions should be
 patched.  Please do always provide this information when fixing a bug.

Okay, i will pay more attention to this.

 
 This is best answered by Xiao who said it's a testcase triggering
 this.
 
 It requires the guest reading memory on CPU0 while the host writes to
 the same memory on CPU1, while CPU2 triggers the copy on write fault
 on another part of the same page (slightly before CPU1 writes). The
 host writes of CPU1 would need to happen in a microsecond window, and
 they wouldn't be immediately propagated to the guest in CPU0. They
 would still appear in the guest but with a microsecond delay (the
 guest has the spte mapped readonly when this happens so it's only a
 guest microsecond delayed reading problem as far as I can tell). I
 guess most of the time it would fall into the undefined by timing
 scenario so it's hard to tell how the side effect could escalate.

Yes, i agree. :)


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Andrea Arcangeli
On Wed, Aug 22, 2012 at 12:58:05PM -0700, Andrew Morton wrote:
> If you can suggest some text I'll type it in right now.

Ok ;), I tried below:

This is safe to start by updating the secondary MMUs, because the
relevant primary MMU pte invalidate must have already happened with a
ptep_clear_flush before set_pte_at_notify has been invoked. Updating
the secondary MMUs first is required when we change both the
protection of the mapping from read-only to read-write and the pfn
(like during copy on write page faults). Otherwise the old page would
remain mapped readonly in the secondary MMUs after the new page is
already writable by some CPU through the primary MMU."

Thanks!
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Andrew Morton
On Wed, 22 Aug 2012 21:50:43 +0200
Andrea Arcangeli  wrote:

> Hi Andrew,
> 
> On Wed, Aug 22, 2012 at 12:15:35PM -0700, Andrew Morton wrote:
> > On Wed, 22 Aug 2012 18:29:55 +0200
> > Andrea Arcangeli  wrote:
> > 
> > > On Wed, Aug 22, 2012 at 02:03:41PM +0800, Xiao Guangrong wrote:
> > > > On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
> > > > > CPU0  CPU1
> > > > >   oldpage[1] == 0 (both guest & host)
> > > > > oldpage[0] = 1
> > > > > trigger do_wp_page
> > > > 
> > > > We always do ptep_clear_flush before set_pte_at_notify(),
> > > > at this point, we have done:
> > > >   pte = 0 and flush all tlbs
> > > > > mmu_notifier_change_pte
> > > > > spte = newpage + writable
> > > > >   guest does newpage[1] = 1
> > > > >   vmexit
> > > > >   host read oldpage[1] == 0
> > > > 
> > > >   It can not happen, at this point pte = 0, host can not
> > > >   access oldpage anymore, host read can generate #PF, it
> > > >   will be blocked on page table lock until CPU 0 
> > > > release the lock.
> > > 
> > > Agreed, this is why your fix is safe.
> > > 
> > > ...
> > >
> > > Thanks a lot for fixing this subtle race!
> > 
> > I'll take that as an ack.
> 
> Yes thanks!
> 
> I'd also like a comment that explains why in that case the order is
> reversed. The reverse order immediately rings an alarm bell otherwise
> ;). But the comment can be added with an incremental patch.

If you can suggest some text I'll type it in right now.

> > Unfortunately we weren't told the user-visible effects of the bug,
> > which often makes it hard to determine which kernel versions should be
> > patched.  Please do always provide this information when fixing a bug.
> 
> This is best answered by Xiao who said it's a testcase triggering
> this.
> 
> It requires the guest reading memory on CPU0 while the host writes to
> the same memory on CPU1, while CPU2 triggers the copy on write fault
> on another part of the same page (slightly before CPU1 writes). The
> host writes of CPU1 would need to happen in a microsecond window, and
> they wouldn't be immediately propagated to the guest in CPU0. They
> would still appear in the guest but with a microsecond delay (the
> guest has the spte mapped readonly when this happens so it's only a
> guest "microsecond delayed reading" problem as far as I can tell). I
> guess most of the time it would fall into the undefined by timing
> scenario so it's hard to tell how the side effect could escalate.

OK, thanks.  For this sort of thing I am dependent upon KVM people to
suggest whether the fix should be in 3.6 and whether -stable needs it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Andrea Arcangeli
Hi Andrew,

On Wed, Aug 22, 2012 at 12:15:35PM -0700, Andrew Morton wrote:
> On Wed, 22 Aug 2012 18:29:55 +0200
> Andrea Arcangeli  wrote:
> 
> > On Wed, Aug 22, 2012 at 02:03:41PM +0800, Xiao Guangrong wrote:
> > > On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
> > > > CPU0CPU1
> > > > oldpage[1] == 0 (both guest & host)
> > > > oldpage[0] = 1
> > > > trigger do_wp_page
> > > 
> > > We always do ptep_clear_flush before set_pte_at_notify(),
> > > at this point, we have done:
> > >   pte = 0 and flush all tlbs
> > > > mmu_notifier_change_pte
> > > > spte = newpage + writable
> > > > guest does newpage[1] = 1
> > > > vmexit
> > > > host read oldpage[1] == 0
> > > 
> > >   It can not happen, at this point pte = 0, host can not
> > > access oldpage anymore, host read can generate #PF, it
> > >   will be blocked on page table lock until CPU 0 release 
> > > the lock.
> > 
> > Agreed, this is why your fix is safe.
> > 
> > ...
> >
> > Thanks a lot for fixing this subtle race!
> 
> I'll take that as an ack.

Yes thanks!

I'd also like a comment that explains why in that case the order is
reversed. The reverse order immediately rings an alarm bell otherwise
;). But the comment can be added with an incremental patch.

> Unfortunately we weren't told the user-visible effects of the bug,
> which often makes it hard to determine which kernel versions should be
> patched.  Please do always provide this information when fixing a bug.

This is best answered by Xiao who said it's a testcase triggering
this.

It requires the guest reading memory on CPU0 while the host writes to
the same memory on CPU1, while CPU2 triggers the copy on write fault
on another part of the same page (slightly before CPU1 writes). The
host writes of CPU1 would need to happen in a microsecond window, and
they wouldn't be immediately propagated to the guest in CPU0. They
would still appear in the guest but with a microsecond delay (the
guest has the spte mapped readonly when this happens so it's only a
guest "microsecond delayed reading" problem as far as I can tell). I
guess most of the time it would fall into the undefined by timing
scenario so it's hard to tell how the side effect could escalate.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Andrew Morton
On Wed, 22 Aug 2012 18:29:55 +0200
Andrea Arcangeli  wrote:

> On Wed, Aug 22, 2012 at 02:03:41PM +0800, Xiao Guangrong wrote:
> > On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
> > > CPU0  CPU1
> > >   oldpage[1] == 0 (both guest & host)
> > > oldpage[0] = 1
> > > trigger do_wp_page
> > 
> > We always do ptep_clear_flush before set_pte_at_notify(),
> > at this point, we have done:
> >   pte = 0 and flush all tlbs
> > > mmu_notifier_change_pte
> > > spte = newpage + writable
> > >   guest does newpage[1] = 1
> > >   vmexit
> > >   host read oldpage[1] == 0
> > 
> >   It can not happen, at this point pte = 0, host can not
> >   access oldpage anymore, host read can generate #PF, it
> >   will be blocked on page table lock until CPU 0 release 
> > the lock.
> 
> Agreed, this is why your fix is safe.
> 
> ...
>
> Thanks a lot for fixing this subtle race!

I'll take that as an ack.

Unfortunately we weren't told the user-visible effects of the bug,
which often makes it hard to determine which kernel versions should be
patched.  Please do always provide this information when fixing a bug.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Andrea Arcangeli
On Wed, Aug 22, 2012 at 11:51:17AM +0800, Xiao Guangrong wrote:
> Hmm, in KSM code, i found this code in replace_page:
> 
> set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> 
> It is possible to establish a writable pte, no?

Hugh already answered this thanks. Further details on the vm_page_prot
are in top of mmap.c, and KSM never scans MAP_SHARED vmas.

> Unfortunately, all these bugs are triggered by test cases.

Sure, I've seen the very Oops for the other one, and this one also can
trigger if unlucky.

This one can trigger with KVM but only if KSM is enabled or with live
migration or with device hotplug or some other event that triggers a
fork in qemu.

My curiosity about the other one in the exit/unregister/release paths
is if it really ever triggered with KVM. Because I can't easily see
how it could trigger. By the time kvm_destroy_vm or exit_mmap() runs,
no vcpu can be in guest mode anymore, so it cannot matter whatever the
status of any leftover spte at that time.

The process in the oops certainly wasn't qemu*. This is what I meant
in the previous email about this. Of course the fix was certainly good
and needed for other mmu notifier users, great fix.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Andrea Arcangeli
On Wed, Aug 22, 2012 at 02:03:41PM +0800, Xiao Guangrong wrote:
> On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
> > CPU0CPU1
> > oldpage[1] == 0 (both guest & host)
> > oldpage[0] = 1
> > trigger do_wp_page
> 
> We always do ptep_clear_flush before set_pte_at_notify(),
> at this point, we have done:
>   pte = 0 and flush all tlbs
> > mmu_notifier_change_pte
> > spte = newpage + writable
> > guest does newpage[1] = 1
> > vmexit
> > host read oldpage[1] == 0
> 
>   It can not happen, at this point pte = 0, host can not
> access oldpage anymore, host read can generate #PF, it
>   will be blocked on page table lock until CPU 0 release the 
> lock.

Agreed, this is why your fix is safe.

So the thing is, it is never safe to mangle the secondary MMU before
the primary MMU. This is why your patch triggered all sort of alarm
bells to me and I was tempted to suggest an obviously safe
alternative.

The reason why your patch is safe, is that the required primary MMU
pte mangling happens before the set_pte_at_notify is invoked.

Other details about change_pte:

1) it is only safe to use on an already readonly pte if the pfn is
   being altered

2) it is only safe to run on a read-write mapping to convert it to a
   readonly mapping if the pfn doesn't change

KSM uses it as 2) in page_write_protect.

KSM uses it as 1) in replace_page and do_wp_page uses it as 1) too.

The new constraint for its safety after your fix is that it must
always be preceded by a ptep_clear_flush.

Of course it's quite natural that it is preceded by a
ptep_clear_flush, other things would risk to go wrong if
ptep_clear_flush wasn't done, so there's little risk of getting it
wrong.

I thought, maybe it would be clearer to do it as
ptep_clear_flush_notify_at(pte). That would avoid having methods that
rings the alarm bells. But the O_DIRECT check of KSM in
page_write_protect prevents such a change (there we need to run a
standalone ptep_clear_flush).

I suggest only adding a comment to mention the real primary MMU pte
update happens before set_pte_at_notify is invoked and we're not
really doing secondary MMU updates before primary MMU updates which
wouldn't never be safe.

It never would be safe because the secondary MMU can be just a TLB and
even the KSM sptes can be dropped at any time and refilled through
secondary MMU page faults running gup_fast. The PT lock won't stop it.

Thanks a lot for fixing this subtle race!
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Xiao Guangrong
On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
> On Tue, Aug 21, 2012 at 05:46:39PM +0800, Xiao Guangrong wrote:
>> There has a bug in set_pte_at_notify which always set the pte to the
>> new page before release the old page in secondary MMU, at this time,
>> the process will access on the new page, but the secondary MMU still
>> access on the old page, the memory is inconsistent between them
>>
>> Below scenario shows the bug more clearly:
>>
>> at the beginning: *p = 0, and p is write-protected by KSM or shared with
>> parent process
>>
>> CPU 0   CPU 1
>> write 1 to p to trigger COW,
>> set_pte_at_notify will be called:
>>   *pte = new_page + W; /* The W bit of pte is set */
>>
>>  *p = 1; /* pte is valid, so no #PF */
>>
>>  return back to secondary MMU, then
>>  the secondary MMU read p, but get:
>>  *p == 0;
>>
>>  /*
>>   * !!
>>   * the host has already set p to 1, but the 
>> secondary
>>   * MMU still get the old value 0
>>   */
>>
>>   call mmu_notifier_change_pte to release
>>   old page in secondary MMU
> 
> The KSM usage of it looks safe because it will only establish readonly
> ptes with it.
> 
> It seems a problem only for do_wp_page. It wasn't safe to setup
> writable ptes with it. I guess we first introduced it for KSM and then
> we added it to do_wp_page too by mistake.
> 
> The race window is really tiny, it's unlikely it has ever triggered,
> however this one seem to be possible so it's slightly more serious
> than the other race you recently found (the previous one in the exit
> path I think it was impossible to trigger with KVM).
> 
>> We can fix it by release old page first, then set the pte to the new
>> page.
>>
>> Note, the new page will be firstly used in secondary MMU before it is
>> mapped into the page table of the process, but this is safe because it
>> is protected by the page table lock, there is no race to change the pte
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  include/linux/mmu_notifier.h |2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index 1d1b1e1..8c7435a 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -317,8 +317,8 @@ static inline void mmu_notifier_mm_destroy(struct 
>> mm_struct *mm)
>>  unsigned long ___address = __address;   \
>>  pte_t ___pte = __pte;   \
>>  \
>> -set_pte_at(___mm, ___address, __ptep, ___pte);  \
>>  mmu_notifier_change_pte(___mm, ___address, ___pte); \
>> +set_pte_at(___mm, ___address, __ptep, ___pte);  \
>>  })
> 
> If we establish the spte on the new page, what will happen is the same
> race in reverse. The fundamental problem is that the first guy that
> writes to the "newpage" (guest or host) won't fault again and so it
> will fail to serialize against the PT lock.
> 
> CPU0  CPU1
>   oldpage[1] == 0 (both guest & host)
> oldpage[0] = 1
> trigger do_wp_page

We always do ptep_clear_flush before set_pte_at_notify(),
at this point, we have done:
  pte = 0 and flush all tlbs

> mmu_notifier_change_pte
> spte = newpage + writable
>   guest does newpage[1] = 1
>   vmexit
>   host read oldpage[1] == 0

  It can not happen, at this point pte = 0, host can not
  access oldpage anymore, host read can generate #PF, it
  will be blocked on page table lock until CPU 0 release the 
lock.

> pte = newpage + writable (too late)
> 




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Xiao Guangrong
On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
 On Tue, Aug 21, 2012 at 05:46:39PM +0800, Xiao Guangrong wrote:
 There has a bug in set_pte_at_notify which always set the pte to the
 new page before release the old page in secondary MMU, at this time,
 the process will access on the new page, but the secondary MMU still
 access on the old page, the memory is inconsistent between them

 Below scenario shows the bug more clearly:

 at the beginning: *p = 0, and p is write-protected by KSM or shared with
 parent process

 CPU 0   CPU 1
 write 1 to p to trigger COW,
 set_pte_at_notify will be called:
   *pte = new_page + W; /* The W bit of pte is set */

  *p = 1; /* pte is valid, so no #PF */

  return back to secondary MMU, then
  the secondary MMU read p, but get:
  *p == 0;

  /*
   * !!
   * the host has already set p to 1, but the 
 secondary
   * MMU still get the old value 0
   */

   call mmu_notifier_change_pte to release
   old page in secondary MMU
 
 The KSM usage of it looks safe because it will only establish readonly
 ptes with it.
 
 It seems a problem only for do_wp_page. It wasn't safe to setup
 writable ptes with it. I guess we first introduced it for KSM and then
 we added it to do_wp_page too by mistake.
 
 The race window is really tiny, it's unlikely it has ever triggered,
 however this one seem to be possible so it's slightly more serious
 than the other race you recently found (the previous one in the exit
 path I think it was impossible to trigger with KVM).
 
 We can fix it by release old page first, then set the pte to the new
 page.

 Note, the new page will be firstly used in secondary MMU before it is
 mapped into the page table of the process, but this is safe because it
 is protected by the page table lock, there is no race to change the pte

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  include/linux/mmu_notifier.h |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

 diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
 index 1d1b1e1..8c7435a 100644
 --- a/include/linux/mmu_notifier.h
 +++ b/include/linux/mmu_notifier.h
 @@ -317,8 +317,8 @@ static inline void mmu_notifier_mm_destroy(struct 
 mm_struct *mm)
  unsigned long ___address = __address;   \
  pte_t ___pte = __pte;   \
  \
 -set_pte_at(___mm, ___address, __ptep, ___pte);  \
  mmu_notifier_change_pte(___mm, ___address, ___pte); \
 +set_pte_at(___mm, ___address, __ptep, ___pte);  \
  })
 
 If we establish the spte on the new page, what will happen is the same
 race in reverse. The fundamental problem is that the first guy that
 writes to the newpage (guest or host) won't fault again and so it
 will fail to serialize against the PT lock.
 
 CPU0  CPU1
   oldpage[1] == 0 (both guest  host)
 oldpage[0] = 1
 trigger do_wp_page

We always do ptep_clear_flush before set_pte_at_notify(),
at this point, we have done:
  pte = 0 and flush all tlbs

 mmu_notifier_change_pte
 spte = newpage + writable
   guest does newpage[1] = 1
   vmexit
   host read oldpage[1] == 0

  It can not happen, at this point pte = 0, host can not
  access oldpage anymore, host read can generate #PF, it
  will be blocked on page table lock until CPU 0 release the 
lock.

 pte = newpage + writable (too late)
 




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Andrea Arcangeli
On Wed, Aug 22, 2012 at 02:03:41PM +0800, Xiao Guangrong wrote:
 On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
  CPU0CPU1
  oldpage[1] == 0 (both guest  host)
  oldpage[0] = 1
  trigger do_wp_page
 
 We always do ptep_clear_flush before set_pte_at_notify(),
 at this point, we have done:
   pte = 0 and flush all tlbs
  mmu_notifier_change_pte
  spte = newpage + writable
  guest does newpage[1] = 1
  vmexit
  host read oldpage[1] == 0
 
   It can not happen, at this point pte = 0, host can not
 access oldpage anymore, host read can generate #PF, it
   will be blocked on page table lock until CPU 0 release the 
 lock.

Agreed, this is why your fix is safe.

So the thing is, it is never safe to mangle the secondary MMU before
the primary MMU. This is why your patch triggered all sort of alarm
bells to me and I was tempted to suggest an obviously safe
alternative.

The reason why your patch is safe, is that the required primary MMU
pte mangling happens before the set_pte_at_notify is invoked.

Other details about change_pte:

1) it is only safe to use on an already readonly pte if the pfn is
   being altered

2) it is only safe to run on a read-write mapping to convert it to a
   readonly mapping if the pfn doesn't change

KSM uses it as 2) in page_write_protect.

KSM uses it as 1) in replace_page and do_wp_page uses it as 1) too.

The new constraint for its safety after your fix is that it must
always be preceded by a ptep_clear_flush.

Of course it's quite natural that it is preceded by a
ptep_clear_flush, other things would risk to go wrong if
ptep_clear_flush wasn't done, so there's little risk of getting it
wrong.

I thought, maybe it would be clearer to do it as
ptep_clear_flush_notify_at(pte). That would avoid having methods that
rings the alarm bells. But the O_DIRECT check of KSM in
page_write_protect prevents such a change (there we need to run a
standalone ptep_clear_flush).

I suggest only adding a comment to mention the real primary MMU pte
update happens before set_pte_at_notify is invoked and we're not
really doing secondary MMU updates before primary MMU updates which
wouldn't never be safe.

It never would be safe because the secondary MMU can be just a TLB and
even the KSM sptes can be dropped at any time and refilled through
secondary MMU page faults running gup_fast. The PT lock won't stop it.

Thanks a lot for fixing this subtle race!
Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Andrea Arcangeli
On Wed, Aug 22, 2012 at 11:51:17AM +0800, Xiao Guangrong wrote:
 Hmm, in KSM code, i found this code in replace_page:
 
 set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma-vm_page_prot));
 
 It is possible to establish a writable pte, no?

Hugh already answered this thanks. Further details on the vm_page_prot
are in top of mmap.c, and KSM never scans MAP_SHARED vmas.

 Unfortunately, all these bugs are triggered by test cases.

Sure, I've seen the very Oops for the other one, and this one also can
trigger if unlucky.

This one can trigger with KVM but only if KSM is enabled or with live
migration or with device hotplug or some other event that triggers a
fork in qemu.

My curiosity about the other one in the exit/unregister/release paths
is if it really ever triggered with KVM. Because I can't easily see
how it could trigger. By the time kvm_destroy_vm or exit_mmap() runs,
no vcpu can be in guest mode anymore, so it cannot matter whatever the
status of any leftover spte at that time.

The process in the oops certainly wasn't qemu*. This is what I meant
in the previous email about this. Of course the fix was certainly good
and needed for other mmu notifier users, great fix.

Thanks,
Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Andrew Morton
On Wed, 22 Aug 2012 18:29:55 +0200
Andrea Arcangeli aarca...@redhat.com wrote:

 On Wed, Aug 22, 2012 at 02:03:41PM +0800, Xiao Guangrong wrote:
  On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
   CPU0  CPU1
 oldpage[1] == 0 (both guest  host)
   oldpage[0] = 1
   trigger do_wp_page
  
  We always do ptep_clear_flush before set_pte_at_notify(),
  at this point, we have done:
pte = 0 and flush all tlbs
   mmu_notifier_change_pte
   spte = newpage + writable
 guest does newpage[1] = 1
 vmexit
 host read oldpage[1] == 0
  
It can not happen, at this point pte = 0, host can not
access oldpage anymore, host read can generate #PF, it
will be blocked on page table lock until CPU 0 release 
  the lock.
 
 Agreed, this is why your fix is safe.
 
 ...

 Thanks a lot for fixing this subtle race!

I'll take that as an ack.

Unfortunately we weren't told the user-visible effects of the bug,
which often makes it hard to determine which kernel versions should be
patched.  Please do always provide this information when fixing a bug.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Andrea Arcangeli
Hi Andrew,

On Wed, Aug 22, 2012 at 12:15:35PM -0700, Andrew Morton wrote:
 On Wed, 22 Aug 2012 18:29:55 +0200
 Andrea Arcangeli aarca...@redhat.com wrote:
 
  On Wed, Aug 22, 2012 at 02:03:41PM +0800, Xiao Guangrong wrote:
   On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
CPU0CPU1
oldpage[1] == 0 (both guest  host)
oldpage[0] = 1
trigger do_wp_page
   
   We always do ptep_clear_flush before set_pte_at_notify(),
   at this point, we have done:
 pte = 0 and flush all tlbs
mmu_notifier_change_pte
spte = newpage + writable
guest does newpage[1] = 1
vmexit
host read oldpage[1] == 0
   
 It can not happen, at this point pte = 0, host can not
   access oldpage anymore, host read can generate #PF, it
 will be blocked on page table lock until CPU 0 release 
   the lock.
  
  Agreed, this is why your fix is safe.
  
  ...
 
  Thanks a lot for fixing this subtle race!
 
 I'll take that as an ack.

Yes thanks!

I'd also like a comment that explains why in that case the order is
reversed. The reverse order immediately rings an alarm bell otherwise
;). But the comment can be added with an incremental patch.

 Unfortunately we weren't told the user-visible effects of the bug,
 which often makes it hard to determine which kernel versions should be
 patched.  Please do always provide this information when fixing a bug.

This is best answered by Xiao who said it's a testcase triggering
this.

It requires the guest reading memory on CPU0 while the host writes to
the same memory on CPU1, while CPU2 triggers the copy on write fault
on another part of the same page (slightly before CPU1 writes). The
host writes of CPU1 would need to happen in a microsecond window, and
they wouldn't be immediately propagated to the guest in CPU0. They
would still appear in the guest but with a microsecond delay (the
guest has the spte mapped readonly when this happens so it's only a
guest microsecond delayed reading problem as far as I can tell). I
guess most of the time it would fall into the undefined by timing
scenario so it's hard to tell how the side effect could escalate.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Andrew Morton
On Wed, 22 Aug 2012 21:50:43 +0200
Andrea Arcangeli aarca...@redhat.com wrote:

 Hi Andrew,
 
 On Wed, Aug 22, 2012 at 12:15:35PM -0700, Andrew Morton wrote:
  On Wed, 22 Aug 2012 18:29:55 +0200
  Andrea Arcangeli aarca...@redhat.com wrote:
  
   On Wed, Aug 22, 2012 at 02:03:41PM +0800, Xiao Guangrong wrote:
On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
 CPU0  CPU1
   oldpage[1] == 0 (both guest  host)
 oldpage[0] = 1
 trigger do_wp_page

We always do ptep_clear_flush before set_pte_at_notify(),
at this point, we have done:
  pte = 0 and flush all tlbs
 mmu_notifier_change_pte
 spte = newpage + writable
   guest does newpage[1] = 1
   vmexit
   host read oldpage[1] == 0

  It can not happen, at this point pte = 0, host can not
  access oldpage anymore, host read can generate #PF, it
  will be blocked on page table lock until CPU 0 
release the lock.
   
   Agreed, this is why your fix is safe.
   
   ...
  
   Thanks a lot for fixing this subtle race!
  
  I'll take that as an ack.
 
 Yes thanks!
 
 I'd also like a comment that explains why in that case the order is
 reversed. The reverse order immediately rings an alarm bell otherwise
 ;). But the comment can be added with an incremental patch.

If you can suggest some text I'll type it in right now.

  Unfortunately we weren't told the user-visible effects of the bug,
  which often makes it hard to determine which kernel versions should be
  patched.  Please do always provide this information when fixing a bug.
 
 This is best answered by Xiao who said it's a testcase triggering
 this.
 
 It requires the guest reading memory on CPU0 while the host writes to
 the same memory on CPU1, while CPU2 triggers the copy on write fault
 on another part of the same page (slightly before CPU1 writes). The
 host writes of CPU1 would need to happen in a microsecond window, and
 they wouldn't be immediately propagated to the guest in CPU0. They
 would still appear in the guest but with a microsecond delay (the
 guest has the spte mapped readonly when this happens so it's only a
 guest microsecond delayed reading problem as far as I can tell). I
 guess most of the time it would fall into the undefined by timing
 scenario so it's hard to tell how the side effect could escalate.

OK, thanks.  For this sort of thing I am dependent upon KVM people to
suggest whether the fix should be in 3.6 and whether -stable needs it.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-22 Thread Andrea Arcangeli
On Wed, Aug 22, 2012 at 12:58:05PM -0700, Andrew Morton wrote:
 If you can suggest some text I'll type it in right now.

Ok ;), I tried below:

This is safe to start by updating the secondary MMUs, because the
relevant primary MMU pte invalidate must have already happened with a
ptep_clear_flush before set_pte_at_notify has been invoked. Updating
the secondary MMUs first is required when we change both the
protection of the mapping from read-only to read-write and the pfn
(like during copy on write page faults). Otherwise the old page would
remain mapped readonly in the secondary MMUs after the new page is
already writable by some CPU through the primary MMU.

Thanks!
Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-21 Thread Xiao Guangrong
On 08/22/2012 12:12 PM, Hugh Dickins wrote:
> On Wed, 22 Aug 2012, Xiao Guangrong wrote:
>> On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
>>>
>>> The KSM usage of it looks safe because it will only establish readonly
>>> ptes with it.
>>
>> Hmm, in KSM code, i found this code in replace_page:
>>
>> set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
>>
>> It is possible to establish a writable pte, no?
> 
> No: we only do KSM in private vmas (!VM_SHARED), and because of the
> need to CopyOnWrite in those, vm_page_prot excludes write permission:
> write permission has to be added on COW fault.

After read the code carefully, yes, you are right. Thank you very much
for your explanation, Hugh! :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-21 Thread Hugh Dickins
On Wed, 22 Aug 2012, Xiao Guangrong wrote:
> On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
> > 
> > The KSM usage of it looks safe because it will only establish readonly
> > ptes with it.
> 
> Hmm, in KSM code, i found this code in replace_page:
> 
> set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> 
> It is possible to establish a writable pte, no?

No: we only do KSM in private vmas (!VM_SHARED), and because of the
need to CopyOnWrite in those, vm_page_prot excludes write permission:
write permission has to be added on COW fault.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-21 Thread Xiao Guangrong
On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
> On Tue, Aug 21, 2012 at 05:46:39PM +0800, Xiao Guangrong wrote:
>> There has a bug in set_pte_at_notify which always set the pte to the
>> new page before release the old page in secondary MMU, at this time,
>> the process will access on the new page, but the secondary MMU still
>> access on the old page, the memory is inconsistent between them
>>
>> Below scenario shows the bug more clearly:
>>
>> at the beginning: *p = 0, and p is write-protected by KSM or shared with
>> parent process
>>
>> CPU 0   CPU 1
>> write 1 to p to trigger COW,
>> set_pte_at_notify will be called:
>>   *pte = new_page + W; /* The W bit of pte is set */
>>
>>  *p = 1; /* pte is valid, so no #PF */
>>
>>  return back to secondary MMU, then
>>  the secondary MMU read p, but get:
>>  *p == 0;
>>
>>  /*
>>   * !!
>>   * the host has already set p to 1, but the 
>> secondary
>>   * MMU still get the old value 0
>>   */
>>
>>   call mmu_notifier_change_pte to release
>>   old page in secondary MMU
> 
> The KSM usage of it looks safe because it will only establish readonly
> ptes with it.

Hmm, in KSM code, i found this code in replace_page:

set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));

It is possible to establish a writable pte, no?

> 
> It seems a problem only for do_wp_page. It wasn't safe to setup
> writable ptes with it. I guess we first introduced it for KSM and then
> we added it to do_wp_page too by mistake.
> 
> The race window is really tiny, it's unlikely it has ever triggered,
> however this one seem to be possible so it's slightly more serious
> than the other race you recently found (the previous one in the exit
> path I think it was impossible to trigger with KVM).

Unfortunately, all these bugs are triggered by test cases.

> 
>> We can fix it by release old page first, then set the pte to the new
>> page.
>>
>> Note, the new page will be firstly used in secondary MMU before it is
>> mapped into the page table of the process, but this is safe because it
>> is protected by the page table lock, there is no race to change the pte
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  include/linux/mmu_notifier.h |2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index 1d1b1e1..8c7435a 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -317,8 +317,8 @@ static inline void mmu_notifier_mm_destroy(struct 
>> mm_struct *mm)
>>  unsigned long ___address = __address;   \
>>  pte_t ___pte = __pte;   \
>>  \
>> -set_pte_at(___mm, ___address, __ptep, ___pte);  \
>>  mmu_notifier_change_pte(___mm, ___address, ___pte); \
>> +set_pte_at(___mm, ___address, __ptep, ___pte);  \
>>  })
> 
> If we establish the spte on the new page, what will happen is the same
> race in reverse. The fundamental problem is that the first guy that
> writes to the "newpage" (guest or host) won't fault again and so it
> will fail to serialize against the PT lock.
> 
> CPU0  CPU1
>   oldpage[1] == 0 (both guest & host)
> oldpage[0] = 1
> trigger do_wp_page
> mmu_notifier_change_pte
> spte = newpage + writable
>   guest does newpage[1] = 1
>   vmexit
>   host read oldpage[1] == 0
> pte = newpage + writable (too late)
> 
> I think the fix is to use ptep_clear_flush_notify whenever
> set_pte_at_notify will establish a writable pte/spte. If the pte/spte
> established by set_pte_at_notify/change_pte is readonly we don't need
> to do the ptep_clear_flush_notify instead because when the host will
> write to the page that will fault and serialize against the
> PT lock (set_pte_at_notify must always run under the PT lock of course).
> 
> How about this:
> 
> =
>>From 160a0b1b2be9bf96c45b30d9423f8196ecebe351 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli 
> Date: Tue, 21 Aug 2012 16:48:11 +0200
> Subject: [PATCH] mmu_notifier: fix race in set_pte_at_notify usage
> 
> Whenever we establish a writable spte with set_pte_at_notify the
> ptep_clear_flush before it must be a _notify one that clears the spte
> too.
> 
> The fundamental problem is that if the primary MMU that writes to the
> "newpage" won't fault again if the pte established by
> set_pte_at_notify is writable. And so it will fail to serialize
> against the PT lock to wait the 

Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-21 Thread Andrea Arcangeli
On Tue, Aug 21, 2012 at 05:46:39PM +0800, Xiao Guangrong wrote:
> There has a bug in set_pte_at_notify which always set the pte to the
> new page before release the old page in secondary MMU, at this time,
> the process will access on the new page, but the secondary MMU still
> access on the old page, the memory is inconsistent between them
> 
> Below scenario shows the bug more clearly:
> 
> at the beginning: *p = 0, and p is write-protected by KSM or shared with
> parent process
> 
> CPU 0   CPU 1
> write 1 to p to trigger COW,
> set_pte_at_notify will be called:
>   *pte = new_page + W; /* The W bit of pte is set */
> 
>  *p = 1; /* pte is valid, so no #PF */
> 
>  return back to secondary MMU, then
>  the secondary MMU read p, but get:
>  *p == 0;
> 
>  /*
>   * !!
>   * the host has already set p to 1, but the secondary
>   * MMU still get the old value 0
>   */
> 
>   call mmu_notifier_change_pte to release
>   old page in secondary MMU

The KSM usage of it looks safe because it will only establish readonly
ptes with it.

It seems a problem only for do_wp_page. It wasn't safe to setup
writable ptes with it. I guess we first introduced it for KSM and then
we added it to do_wp_page too by mistake.

The race window is really tiny, it's unlikely it has ever triggered,
however this one seem to be possible so it's slightly more serious
than the other race you recently found (the previous one in the exit
path I think it was impossible to trigger with KVM).

> We can fix it by release old page first, then set the pte to the new
> page.
> 
> Note, the new page will be firstly used in secondary MMU before it is
> mapped into the page table of the process, but this is safe because it
> is protected by the page table lock, there is no race to change the pte
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  include/linux/mmu_notifier.h |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 1d1b1e1..8c7435a 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -317,8 +317,8 @@ static inline void mmu_notifier_mm_destroy(struct 
> mm_struct *mm)
>   unsigned long ___address = __address;   \
>   pte_t ___pte = __pte;   \
>   \
> - set_pte_at(___mm, ___address, __ptep, ___pte);  \
>   mmu_notifier_change_pte(___mm, ___address, ___pte); \
> + set_pte_at(___mm, ___address, __ptep, ___pte);  \
>  })

If we establish the spte on the new page, what will happen is the same
race in reverse. The fundamental problem is that the first guy that
writes to the "newpage" (guest or host) won't fault again and so it
will fail to serialize against the PT lock.

CPU0CPU1
oldpage[1] == 0 (both guest & host)
oldpage[0] = 1
trigger do_wp_page
mmu_notifier_change_pte
spte = newpage + writable
guest does newpage[1] = 1
vmexit
host read oldpage[1] == 0
pte = newpage + writable (too late)

I think the fix is to use ptep_clear_flush_notify whenever
set_pte_at_notify will establish a writable pte/spte. If the pte/spte
established by set_pte_at_notify/change_pte is readonly we don't need
to do the ptep_clear_flush_notify instead because when the host will
write to the page that will fault and serialize against the
PT lock (set_pte_at_notify must always run under the PT lock of course).

How about this:

=
>From 160a0b1b2be9bf96c45b30d9423f8196ecebe351 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli 
Date: Tue, 21 Aug 2012 16:48:11 +0200
Subject: [PATCH] mmu_notifier: fix race in set_pte_at_notify usage

Whenever we establish a writable spte with set_pte_at_notify the
ptep_clear_flush before it must be a _notify one that clears the spte
too.

The fundamental problem is that if the primary MMU that writes to the
"newpage" won't fault again if the pte established by
set_pte_at_notify is writable. And so it will fail to serialize
against the PT lock to wait the set_pte_at_notify to finish
updating all secondary MMUs before the write hits the newpage.

CPU0CPU1
oldpage[1] == 0 (all MMUs)
oldpage[0] = 1
trigger do_wp_page
take PT lock
ptep_clear_flush (secondary MMUs
still have read access to oldpage)
mmu_notifier_change_pte
pte = newpage + writable (primary MMU can write to
newpage)
host write 

[PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-21 Thread Xiao Guangrong
There has a bug in set_pte_at_notify which always set the pte to the
new page before release the old page in secondary MMU, at this time,
the process will access on the new page, but the secondary MMU still
access on the old page, the memory is inconsistent between them

Below scenario shows the bug more clearly:

at the beginning: *p = 0, and p is write-protected by KSM or shared with
parent process

CPU 0   CPU 1
write 1 to p to trigger COW,
set_pte_at_notify will be called:
  *pte = new_page + W; /* The W bit of pte is set */

 *p = 1; /* pte is valid, so no #PF */

 return back to secondary MMU, then
 the secondary MMU read p, but get:
 *p == 0;

 /*
  * !!
  * the host has already set p to 1, but the secondary
  * MMU still get the old value 0
  */

  call mmu_notifier_change_pte to release
  old page in secondary MMU

We can fix it by release old page first, then set the pte to the new
page.

Note, the new page will be firstly used in secondary MMU before it is
mapped into the page table of the process, but this is safe because it
is protected by the page table lock, there is no race to change the pte

Signed-off-by: Xiao Guangrong 
---
 include/linux/mmu_notifier.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 1d1b1e1..8c7435a 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -317,8 +317,8 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct 
*mm)
unsigned long ___address = __address;   \
pte_t ___pte = __pte;   \
\
-   set_pte_at(___mm, ___address, __ptep, ___pte);  \
mmu_notifier_change_pte(___mm, ___address, ___pte); \
+   set_pte_at(___mm, ___address, __ptep, ___pte);  \
 })

 #else /* CONFIG_MMU_NOTIFIER */
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-21 Thread Xiao Guangrong
On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
 On Tue, Aug 21, 2012 at 05:46:39PM +0800, Xiao Guangrong wrote:
 There has a bug in set_pte_at_notify which always set the pte to the
 new page before release the old page in secondary MMU, at this time,
 the process will access on the new page, but the secondary MMU still
 access on the old page, the memory is inconsistent between them

 Below scenario shows the bug more clearly:

 at the beginning: *p = 0, and p is write-protected by KSM or shared with
 parent process

 CPU 0   CPU 1
 write 1 to p to trigger COW,
 set_pte_at_notify will be called:
   *pte = new_page + W; /* The W bit of pte is set */

  *p = 1; /* pte is valid, so no #PF */

  return back to secondary MMU, then
  the secondary MMU read p, but get:
  *p == 0;

  /*
   * !!
   * the host has already set p to 1, but the 
 secondary
   * MMU still get the old value 0
   */

   call mmu_notifier_change_pte to release
   old page in secondary MMU
 
 The KSM usage of it looks safe because it will only establish readonly
 ptes with it.

Hmm, in KSM code, i found this code in replace_page:

set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma-vm_page_prot));

It is possible to establish a writable pte, no?

 
 It seems a problem only for do_wp_page. It wasn't safe to setup
 writable ptes with it. I guess we first introduced it for KSM and then
 we added it to do_wp_page too by mistake.
 
 The race window is really tiny, it's unlikely it has ever triggered,
 however this one seem to be possible so it's slightly more serious
 than the other race you recently found (the previous one in the exit
 path I think it was impossible to trigger with KVM).

Unfortunately, all these bugs are triggered by test cases.

 
 We can fix it by release old page first, then set the pte to the new
 page.

 Note, the new page will be firstly used in secondary MMU before it is
 mapped into the page table of the process, but this is safe because it
 is protected by the page table lock, there is no race to change the pte

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  include/linux/mmu_notifier.h |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

 diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
 index 1d1b1e1..8c7435a 100644
 --- a/include/linux/mmu_notifier.h
 +++ b/include/linux/mmu_notifier.h
 @@ -317,8 +317,8 @@ static inline void mmu_notifier_mm_destroy(struct 
 mm_struct *mm)
  unsigned long ___address = __address;   \
  pte_t ___pte = __pte;   \
  \
 -set_pte_at(___mm, ___address, __ptep, ___pte);  \
  mmu_notifier_change_pte(___mm, ___address, ___pte); \
 +set_pte_at(___mm, ___address, __ptep, ___pte);  \
  })
 
 If we establish the spte on the new page, what will happen is the same
 race in reverse. The fundamental problem is that the first guy that
 writes to the newpage (guest or host) won't fault again and so it
 will fail to serialize against the PT lock.
 
 CPU0  CPU1
   oldpage[1] == 0 (both guest  host)
 oldpage[0] = 1
 trigger do_wp_page
 mmu_notifier_change_pte
 spte = newpage + writable
   guest does newpage[1] = 1
   vmexit
   host read oldpage[1] == 0
 pte = newpage + writable (too late)
 
 I think the fix is to use ptep_clear_flush_notify whenever
 set_pte_at_notify will establish a writable pte/spte. If the pte/spte
 established by set_pte_at_notify/change_pte is readonly we don't need
 to do the ptep_clear_flush_notify instead because when the host will
 write to the page that will fault and serialize against the
 PT lock (set_pte_at_notify must always run under the PT lock of course).
 
 How about this:
 
 =
From 160a0b1b2be9bf96c45b30d9423f8196ecebe351 Mon Sep 17 00:00:00 2001
 From: Andrea Arcangeli aarca...@redhat.com
 Date: Tue, 21 Aug 2012 16:48:11 +0200
 Subject: [PATCH] mmu_notifier: fix race in set_pte_at_notify usage
 
 Whenever we establish a writable spte with set_pte_at_notify the
 ptep_clear_flush before it must be a _notify one that clears the spte
 too.
 
 The fundamental problem is that if the primary MMU that writes to the
 newpage won't fault again if the pte established by
 set_pte_at_notify is writable. And so it will fail to serialize
 against the PT lock to wait the set_pte_at_notify to finish
 updating all secondary MMUs before the write hits the newpage.
 
 CPU0  

Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-21 Thread Hugh Dickins
On Wed, 22 Aug 2012, Xiao Guangrong wrote:
 On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:
  
  The KSM usage of it looks safe because it will only establish readonly
  ptes with it.
 
 Hmm, in KSM code, i found this code in replace_page:
 
 set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma-vm_page_prot));
 
 It is possible to establish a writable pte, no?

No: we only do KSM in private vmas (!VM_SHARED), and because of the
need to CopyOnWrite in those, vm_page_prot excludes write permission:
write permission has to be added on COW fault.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-21 Thread Xiao Guangrong
On 08/22/2012 12:12 PM, Hugh Dickins wrote:
 On Wed, 22 Aug 2012, Xiao Guangrong wrote:
 On 08/21/2012 11:06 PM, Andrea Arcangeli wrote:

 The KSM usage of it looks safe because it will only establish readonly
 ptes with it.

 Hmm, in KSM code, i found this code in replace_page:

 set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma-vm_page_prot));

 It is possible to establish a writable pte, no?
 
 No: we only do KSM in private vmas (!VM_SHARED), and because of the
 need to CopyOnWrite in those, vm_page_prot excludes write permission:
 write permission has to be added on COW fault.

After read the code carefully, yes, you are right. Thank you very much
for your explanation, Hugh! :)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-21 Thread Xiao Guangrong
There has a bug in set_pte_at_notify which always set the pte to the
new page before release the old page in secondary MMU, at this time,
the process will access on the new page, but the secondary MMU still
access on the old page, the memory is inconsistent between them

Below scenario shows the bug more clearly:

at the beginning: *p = 0, and p is write-protected by KSM or shared with
parent process

CPU 0   CPU 1
write 1 to p to trigger COW,
set_pte_at_notify will be called:
  *pte = new_page + W; /* The W bit of pte is set */

 *p = 1; /* pte is valid, so no #PF */

 return back to secondary MMU, then
 the secondary MMU read p, but get:
 *p == 0;

 /*
  * !!
  * the host has already set p to 1, but the secondary
  * MMU still get the old value 0
  */

  call mmu_notifier_change_pte to release
  old page in secondary MMU

We can fix it by release old page first, then set the pte to the new
page.

Note, the new page will be firstly used in secondary MMU before it is
mapped into the page table of the process, but this is safe because it
is protected by the page table lock, there is no race to change the pte

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 include/linux/mmu_notifier.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 1d1b1e1..8c7435a 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -317,8 +317,8 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct 
*mm)
unsigned long ___address = __address;   \
pte_t ___pte = __pte;   \
\
-   set_pte_at(___mm, ___address, __ptep, ___pte);  \
mmu_notifier_change_pte(___mm, ___address, ___pte); \
+   set_pte_at(___mm, ___address, __ptep, ___pte);  \
 })

 #else /* CONFIG_MMU_NOTIFIER */
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: mmu_notifier: fix inconsistent memory between secondary MMU and host

2012-08-21 Thread Andrea Arcangeli
On Tue, Aug 21, 2012 at 05:46:39PM +0800, Xiao Guangrong wrote:
 There has a bug in set_pte_at_notify which always set the pte to the
 new page before release the old page in secondary MMU, at this time,
 the process will access on the new page, but the secondary MMU still
 access on the old page, the memory is inconsistent between them
 
 Below scenario shows the bug more clearly:
 
 at the beginning: *p = 0, and p is write-protected by KSM or shared with
 parent process
 
 CPU 0   CPU 1
 write 1 to p to trigger COW,
 set_pte_at_notify will be called:
   *pte = new_page + W; /* The W bit of pte is set */
 
  *p = 1; /* pte is valid, so no #PF */
 
  return back to secondary MMU, then
  the secondary MMU read p, but get:
  *p == 0;
 
  /*
   * !!
   * the host has already set p to 1, but the secondary
   * MMU still get the old value 0
   */
 
   call mmu_notifier_change_pte to release
   old page in secondary MMU

The KSM usage of it looks safe because it will only establish readonly
ptes with it.

It seems a problem only for do_wp_page. It wasn't safe to setup
writable ptes with it. I guess we first introduced it for KSM and then
we added it to do_wp_page too by mistake.

The race window is really tiny, it's unlikely it has ever triggered,
however this one seem to be possible so it's slightly more serious
than the other race you recently found (the previous one in the exit
path I think it was impossible to trigger with KVM).

 We can fix it by release old page first, then set the pte to the new
 page.
 
 Note, the new page will be firstly used in secondary MMU before it is
 mapped into the page table of the process, but this is safe because it
 is protected by the page table lock, there is no race to change the pte
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  include/linux/mmu_notifier.h |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
 index 1d1b1e1..8c7435a 100644
 --- a/include/linux/mmu_notifier.h
 +++ b/include/linux/mmu_notifier.h
 @@ -317,8 +317,8 @@ static inline void mmu_notifier_mm_destroy(struct 
 mm_struct *mm)
   unsigned long ___address = __address;   \
   pte_t ___pte = __pte;   \
   \
 - set_pte_at(___mm, ___address, __ptep, ___pte);  \
   mmu_notifier_change_pte(___mm, ___address, ___pte); \
 + set_pte_at(___mm, ___address, __ptep, ___pte);  \
  })

If we establish the spte on the new page, what will happen is the same
race in reverse. The fundamental problem is that the first guy that
writes to the newpage (guest or host) won't fault again and so it
will fail to serialize against the PT lock.

CPU0CPU1
oldpage[1] == 0 (both guest  host)
oldpage[0] = 1
trigger do_wp_page
mmu_notifier_change_pte
spte = newpage + writable
guest does newpage[1] = 1
vmexit
host read oldpage[1] == 0
pte = newpage + writable (too late)

I think the fix is to use ptep_clear_flush_notify whenever
set_pte_at_notify will establish a writable pte/spte. If the pte/spte
established by set_pte_at_notify/change_pte is readonly we don't need
to do the ptep_clear_flush_notify instead because when the host will
write to the page that will fault and serialize against the
PT lock (set_pte_at_notify must always run under the PT lock of course).

How about this:

=
From 160a0b1b2be9bf96c45b30d9423f8196ecebe351 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli aarca...@redhat.com
Date: Tue, 21 Aug 2012 16:48:11 +0200
Subject: [PATCH] mmu_notifier: fix race in set_pte_at_notify usage

Whenever we establish a writable spte with set_pte_at_notify the
ptep_clear_flush before it must be a _notify one that clears the spte
too.

The fundamental problem is that if the primary MMU that writes to the
newpage won't fault again if the pte established by
set_pte_at_notify is writable. And so it will fail to serialize
against the PT lock to wait the set_pte_at_notify to finish
updating all secondary MMUs before the write hits the newpage.

CPU0CPU1
oldpage[1] == 0 (all MMUs)
oldpage[0] = 1
trigger do_wp_page
take PT lock
ptep_clear_flush (secondary MMUs
still have read access to oldpage)
mmu_notifier_change_pte
pte = newpage + writable (primary MMU can write to
newpage)
host write