Re: set_pte_at_notify regression

2014-01-15 Thread Haggai Eran
On 12/01/2014 19:50, Andrea Arcangeli wrote:
 On Sun, Jan 12, 2014 at 01:56:12PM +0200, Haggai Eran wrote:
 Hi,

 On 10/01/2014 18:57, Andrea Arcangeli wrote:
 Hi!

 On Fri, Jan 10, 2014 at 06:38:06PM +0200, Izik Eidus wrote:
 It look like commit 6bdb913f0a70a4dfb7f066fb15e2d6f960701d00 break the 
 semantic of set_pte_at_notify.
 The change of calling first to mmu_notifier_invalidate_range_start, then 
 to set_pte_at_notify, and then to mmu_notifier_invalidate_range_end
 not only increase the amount of locks kvm have to take and release by 
 factor of 3, but in addition mmu_notifier_invalidate_range_start is zapping
 the pte entry from kvm, so when set_pte_at_notify get called, it doesn`t 
 have any spte to set and it acctuly get called for nothing, the result is
 increasing of vmexits for kvm from both do_wp_page and replace_page, and 
 broken semantic of set_pte_at_notify.

 Agreed.

 I would suggest to change set_pte_at_notify to return if change_pte
 was missing in some mmu notifier attached to this mm, so we can do
 something like:

ptep = page_check_address(page, mm, addr, ptl, 0);
[..]
notify_missing = false;
if (... ) {
 entry = ptep_clear_flush(...);
 [..]
 notify_missing = set_pte_at_notify(mm, addr, ptep, entry);
}
pte_unmap_unlock(ptep, ptl);
if (notify_missing)
 mmu_notifier_invalidate_page_if_missing_change_pte(mm, addr);

 and drop the range calls. This will provide sleepability and at the
 same time it won't screw the ability of change_pte to update sptes (by
 leaving those established by the time change_pte runs).

 I think it would be better for notifiers that do not support change_pte
 to keep getting both range_start and range_end notifiers. Otherwise, the
 invalidate_page notifier might end up marking the old page as dirty
 after it was already replaced in the primary page table.
 
 Ok but why would that be a problem? If the secondary pagetable mapping
 is found dirty, the old page shall be marked dirty as it means it was
 modified through the secondary mmu and is on-disk version may need to
 be updated before discarding the in-ram copy. What the difference
 would be to mark the page dirty in the range_start while the primary
 page table is still established, or after?
 
 ...
 
 But in places like ksm merging and do_wp_page we hold a page reference
 before we start the primary pagetable updating, until after the mmu
 notifier invalidate.

Right. I missed that page locking.

Another possible issue is with reads from the secondary page table.
Given a read-only page, suppose one host thread writes to that page, and
performs COW, but before it calls the
mmu_notifier_invalidate_page_if_missing_change_pte function another host
thread writes to the same page (this time without a page fault). Then we
have a valid entry in the secondary page table to a stale page, and
someone may read stale data from there.

Do you agree?

Thanks,
Haggai
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: set_pte_at_notify regression

2014-01-12 Thread Haggai Eran
Hi,

On 10/01/2014 18:57, Andrea Arcangeli wrote:
 Hi!

 On Fri, Jan 10, 2014 at 06:38:06PM +0200, Izik Eidus wrote:
 It look like commit 6bdb913f0a70a4dfb7f066fb15e2d6f960701d00 break the 
 semantic of set_pte_at_notify.
 The change of calling first to mmu_notifier_invalidate_range_start, then 
 to set_pte_at_notify, and then to mmu_notifier_invalidate_range_end
 not only increase the amount of locks kvm have to take and release by 
 factor of 3, but in addition mmu_notifier_invalidate_range_start is zapping
 the pte entry from kvm, so when set_pte_at_notify get called, it doesn`t 
 have any spte to set and it acctuly get called for nothing, the result is
 increasing of vmexits for kvm from both do_wp_page and replace_page, and 
 broken semantic of set_pte_at_notify.

 Agreed.

 I would suggest to change set_pte_at_notify to return if change_pte
 was missing in some mmu notifier attached to this mm, so we can do
 something like:

ptep = page_check_address(page, mm, addr, ptl, 0);
[..]
notify_missing = false;
if (... ) {
   entry = ptep_clear_flush(...);
 [..]
   notify_missing = set_pte_at_notify(mm, addr, ptep, entry);
}
pte_unmap_unlock(ptep, ptl);
if (notify_missing)
   mmu_notifier_invalidate_page_if_missing_change_pte(mm, addr);

 and drop the range calls. This will provide sleepability and at the
 same time it won't screw the ability of change_pte to update sptes (by
 leaving those established by the time change_pte runs).

I think it would be better for notifiers that do not support change_pte
to keep getting both range_start and range_end notifiers. Otherwise, the
invalidate_page notifier might end up marking the old page as dirty
after it was already replaced in the primary page table.

Perhaps we can have a flag in the mmu_notifier, similar to the
notify_missing returned value here, that determines in these cases
whether to call the invalidate_range_start/end pair, or just the
set_pte_at_notify.

Thanks,
Haggai
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: set_pte_at_notify regression

2014-01-12 Thread Andrea Arcangeli
On Sun, Jan 12, 2014 at 01:56:12PM +0200, Haggai Eran wrote:
 Hi,
 
 On 10/01/2014 18:57, Andrea Arcangeli wrote:
  Hi!
 
  On Fri, Jan 10, 2014 at 06:38:06PM +0200, Izik Eidus wrote:
  It look like commit 6bdb913f0a70a4dfb7f066fb15e2d6f960701d00 break the 
  semantic of set_pte_at_notify.
  The change of calling first to mmu_notifier_invalidate_range_start, then 
  to set_pte_at_notify, and then to mmu_notifier_invalidate_range_end
  not only increase the amount of locks kvm have to take and release by 
  factor of 3, but in addition mmu_notifier_invalidate_range_start is zapping
  the pte entry from kvm, so when set_pte_at_notify get called, it doesn`t 
  have any spte to set and it acctuly get called for nothing, the result is
  increasing of vmexits for kvm from both do_wp_page and replace_page, and 
  broken semantic of set_pte_at_notify.
 
  Agreed.
 
  I would suggest to change set_pte_at_notify to return if change_pte
  was missing in some mmu notifier attached to this mm, so we can do
  something like:
 
 ptep = page_check_address(page, mm, addr, ptl, 0);
 [..]
 notify_missing = false;
 if (... ) {
  entry = ptep_clear_flush(...);
  [..]
  notify_missing = set_pte_at_notify(mm, addr, ptep, entry);
 }
 pte_unmap_unlock(ptep, ptl);
 if (notify_missing)
  mmu_notifier_invalidate_page_if_missing_change_pte(mm, addr);
 
  and drop the range calls. This will provide sleepability and at the
  same time it won't screw the ability of change_pte to update sptes (by
  leaving those established by the time change_pte runs).
 
 I think it would be better for notifiers that do not support change_pte
 to keep getting both range_start and range_end notifiers. Otherwise, the
 invalidate_page notifier might end up marking the old page as dirty
 after it was already replaced in the primary page table.

Ok but why would that be a problem? If the secondary pagetable mapping
is found dirty, the old page shall be marked dirty as it means it was
modified through the secondary mmu and is on-disk version may need to
be updated before discarding the in-ram copy. What the difference
would be to mark the page dirty in the range_start while the primary
page table is still established, or after?

Here the docs too:

/*
 * Before this is invoked any secondary MMU is still ok to
 * read/write to the page previously pointed to by the Linux
 * pte because the page hasn't been freed yet and it won't be
 * freed until this returns. If required set_page_dirty has to
 * be called internally to this method.
 */
void (*invalidate_page)(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address);

Why the range_start/end is needed, is only to solve the mess with the
freeing of the page in those cases were we hold no individual
reference on the pages and we do tlb gather freeing.

But in places like ksm merging and do_wp_page we hold a page reference
before we start the primary pagetable updating, until after the mmu
notifier invalidate.

Thanks!
Andrea
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


set_pte_at_notify regression

2014-01-10 Thread Izik Eidus

Hi,

It look like commit 6bdb913f0a70a4dfb7f066fb15e2d6f960701d00 break the 
semantic of set_pte_at_notify.
The change of calling first to mmu_notifier_invalidate_range_start, then 
to set_pte_at_notify, and then to mmu_notifier_invalidate_range_end
not only increase the amount of locks kvm have to take and release by 
factor of 3, but in addition mmu_notifier_invalidate_range_start is zapping
the pte entry from kvm, so when set_pte_at_notify get called, it doesn`t 
have any spte to set and it acctuly get called for nothing, the result is
increasing of vmexits for kvm from both do_wp_page and replace_page, and 
broken semantic of set_pte_at_notify.


Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: set_pte_at_notify regression

2014-01-10 Thread Andrea Arcangeli
Hi!

On Fri, Jan 10, 2014 at 06:38:06PM +0200, Izik Eidus wrote:
 It look like commit 6bdb913f0a70a4dfb7f066fb15e2d6f960701d00 break the 
 semantic of set_pte_at_notify.
 The change of calling first to mmu_notifier_invalidate_range_start, then 
 to set_pte_at_notify, and then to mmu_notifier_invalidate_range_end
 not only increase the amount of locks kvm have to take and release by 
 factor of 3, but in addition mmu_notifier_invalidate_range_start is zapping
 the pte entry from kvm, so when set_pte_at_notify get called, it doesn`t 
 have any spte to set and it acctuly get called for nothing, the result is
 increasing of vmexits for kvm from both do_wp_page and replace_page, and 
 broken semantic of set_pte_at_notify.

Agreed.

I would suggest to change set_pte_at_notify to return if change_pte
was missing in some mmu notifier attached to this mm, so we can do
something like:

   ptep = page_check_address(page, mm, addr, ptl, 0);
   [..]
   notify_missing = false;
   if (... ) {
entry = ptep_clear_flush(...);
[..]
notify_missing = set_pte_at_notify(mm, addr, ptep, entry);
   }
   pte_unmap_unlock(ptep, ptl);
   if (notify_missing)
mmu_notifier_invalidate_page_if_missing_change_pte(mm, addr);

and drop the range calls. This will provide sleepability and at the
same time it won't screw the ability of change_pte to update sptes (by
leaving those established by the time change_pte runs).

This assuming the mutex are going to stay mutex for anon_vma lock and
i_mmap_mutex as I hope. Otherwise the commit could be as well
reverted, it would be pointless then to try to keep the
invalidate_page call outside the PT lock if all other invalidate_page
calls are inside rmap spinlocks.

I think giving a runtime or compiler option to switch the locks to
spinlocks is just fine, cellphones I think would be better off with
those locks as spinlocks for example, but completely removing the
ability to run those locks as mutex even on server setups, doesn't
look a too attractive development to me. A build option especially
wouldn't be too painful to maintain. So I'd be positive for an update
like above to retain the sleeability feature but without harming
change_pte users like KVM anymore.

Thanks!
Andrea
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html