Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-28 Thread Xiao Guangrong
On 11/27/2013 03:31 AM, Marcelo Tosatti wrote:
> On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
>> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
>>> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
>>>>>> Also, there is no guarantee of termination (as long as sptes are
>>>>>> deleted with the correct timing). BTW, can't see any guarantee of
>>>>>> termination for rculist nulls either (a writer can race with a lockless
>>>>>> reader indefinately, restarting the lockless walk every time).
>>>>>
>>>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
>>>>> that means, if the dirty-bitmap has been set during lockless 
>>>>> write-protection,
>>>>> it�s unnecessary to write-protect its sptes. Your idea?
>>>> This idea is based on the fact that the number of rmap is limited by
>>>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
>>>> we can break the rewalk at once, in the case of deleting, we can only
>>>> rewalk RMAP_RECYCLE_THRESHOLD times.
>>>
>>> Please explain in more detail.
>>
>> Okay.
>>
>> My proposal is like this:
>>
>> pte_list_walk_lockless()
>> {
>> restart:
>>
>> +if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
>> +return;
>>
>>  code-doing-lockless-walking;
>>  ..
>> }
>>
>> Before do lockless-walking, we check the dirty-bitmap first, if
>> it is set we can simply skip write-protection for the gfn, that
>> is the case that new spte is being added into rmap when we lockless
>> access the rmap.
> 
> The dirty bit could be set after the check.
> 
>> For the case of deleting spte from rmap, the number of entry is limited
>> by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
> 
> It can shrink and grow while lockless walk is performed.

Yes, indeed.

Hmmm, another idea in my mind to fix this is encoding the position into
the reserved bits of desc->more pointer, for example:

  +--++--++--+
rmapp ->  |Desc 0| -> |Desc 1| -> |Desc 2|
  +--++--++--+

There are 3 descs on the rmap, and:
rmapp = &desc0 | 1UL | 3UL << 50;
desc0->more = desc1 | 2UL << 50;
desc1->more = desc0 | 1UL << 50
desc2->more = &rmapp | 1UL; (The nulls pointer)

We will walk to the next desc only if the "position" of current desc
is >= the position of next desc. That can make sure we can reach the
last desc anyway.

And in order to avoiding doing too many "rewalk", we will goto the
slow path (do walk with holding the lock) instead when retry the walk
more that N times.

Thanks all you guys in thanksgiving day. :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-28 Thread Xiao Guangrong
On 11/27/2013 03:58 AM, Marcelo Tosatti wrote:
> On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote:
>> On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
>>> On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
>>>> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
>>>>  wrote:
>>>>>
>>>>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti  wrote:
>>>>
>>>> 
>>>>
>>>> I'm not really following, but note that parent_pte predates EPT (and
>>>> the use of rcu in kvm), so all the complexity that is the result of
>>>> trying to pack as many list entries into a cache line can be dropped.
>>>> Most setups now would have exactly one list entry, which is handled
>>>> specially antyway.
>>>>
>>>> Alternatively, the trick of storing multiple entries in one list entry
>>>> can be moved to generic code, it may be useful to others.
>>>
>>> Yes, can the lockless list walking code be transformed into generic
>>> single-linked list walking? So the correctness can be verified
>>> independently, and KVM becomes a simple user of that interface.
>>
>> I'am afraid the signle-entry list is not so good as we expected. In my
>> experience, there're too many entries on rmap, more than 300 sometimes.
>> (consider a case that a lib shared by all processes).
> 
> single linked list was about moving singly-linked lockless walking
> to generic code.
> 
> http://www.spinics.net/lists/linux-usb/msg39643.html
> http://marc.info/?l=linux-kernel&m=103305635013575&w=3
> 

Oh, i confused "single linked" and "single entry", sorry about that.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Xiao Guangrong
On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
>>>> Also, there is no guarantee of termination (as long as sptes are
>>>> deleted with the correct timing). BTW, can't see any guarantee of
>>>> termination for rculist nulls either (a writer can race with a lockless
>>>> reader indefinately, restarting the lockless walk every time).
>>>
>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
>>> that means, if the dirty-bitmap has been set during lockless 
>>> write-protection,
>>> it�s unnecessary to write-protect its sptes. Your idea?
>> This idea is based on the fact that the number of rmap is limited by
>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
>> we can break the rewalk at once, in the case of deleting, we can only
>> rewalk RMAP_RECYCLE_THRESHOLD times.
> 
> Please explain in more detail.

Okay.

My proposal is like this:

pte_list_walk_lockless()
{
restart:

+   if (__test_bit(slot->arch.dirty_bitmap, gfn-index))
+   return;

code-doing-lockless-walking;
..
}

Before do lockless-walking, we check the dirty-bitmap first, if
it is set we can simply skip write-protection for the gfn, that
is the case that new spte is being added into rmap when we lockless
access the rmap.

For the case of deleting spte from rmap, the number of entry is limited
by RMAP_RECYCLE_THRESHOLD, that is not endlessly.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Xiao Guangrong
On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
> On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
>> On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
>>  wrote:
>>>
>>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti  wrote:
>>
>> 
>>
>> I'm not really following, but note that parent_pte predates EPT (and
>> the use of rcu in kvm), so all the complexity that is the result of
>> trying to pack as many list entries into a cache line can be dropped.
>> Most setups now would have exactly one list entry, which is handled
>> specially antyway.
>>
>> Alternatively, the trick of storing multiple entries in one list entry
>> can be moved to generic code, it may be useful to others.
> 
> Yes, can the lockless list walking code be transformed into generic
> single-linked list walking? So the correctness can be verified
> independently, and KVM becomes a simple user of that interface.

I'am afraid the signle-entry list is not so good as we expected. In my
experience, there're too many entries on rmap, more than 300 sometimes.
(consider a case that a lib shared by all processes).

> 
> The simpler version is to maintain lockless walk on depth-1 rmap entries
> (and grab the lock once depth-2 entry is found).

I still think rmap-lockless is more graceful: soft mmu can get benefit
from it also it is promising to be used in some mmu-notify functions. :)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Xiao Guangrong
On 11/25/2013 10:08 PM, Marcelo Tosatti wrote:
> On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote:
>>
>> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti  wrote:
>>
>>> On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
>>>> It likes nulls list and we use the pte-list as the nulls which can help us 
>>>> to
>>>> detect whether the "desc" is moved to anther rmap then we can re-walk the 
>>>> rmap
>>>> if that happened
>>>>
>>>> kvm->slots_lock is held when we do lockless walking that prevents rmap
>>>> is reused (free rmap need to hold that lock) so that we can not see the 
>>>> same
>>>> nulls used on different rmaps
>>>>
>>>> Signed-off-by: Xiao Guangrong 
>>>
>>> How about simplified lockless walk on the slot while rmapp entry
>>> contains a single spte? (which should be the case with two-dimensional
>>> paging).
>>>
>>> That is, grab the lock when finding a rmap with more than one spte in
>>> it (and then keep it locked until the end).
>>
>> Hmm… that isn't straightforward and more complex than the approach
>> in this patchset. Also it can drop the improvement for shadow mmu that
>> gets great improvement by this patchset.
> 
> It is not more complex, since it would remove list lockless walk. Only
> the spte pointer at rmap[spte] is accessed without a lock. Its much much
> simpler.
> 
>>> For example, nothing prevents lockless walker to move into some
>>> parent_ptes chain, right?
>>
>> No.
>>
>> The nulls can help us to detect this case, for parent_ptes, the nulls points
>> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
>> is no chance that the “rmap" is used as shadow page when slot-lock is held.
> 
> The SLAB cache is the same, so entries can be reused. What prevents
> a desc entry living in slot.arch.rmap to be freed and reused by a
> parent_ptes desc?
> 

We will check is_last_spte(), all the sptes on parent_ptes should be failed.
And Gleb suggested to use a separate slab for rmap, that should be excellent.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Xiao Guangrong

Hi Peter,

On 11/25/2013 05:31 PM, Peter Zijlstra wrote:
> On Fri, Nov 22, 2013 at 05:14:29PM -0200, Marcelo Tosatti wrote:
>> Also, there is no guarantee of termination (as long as sptes are
>> deleted with the correct timing). BTW, can't see any guarantee of
>> termination for rculist nulls either (a writer can race with a lockless
>> reader indefinately, restarting the lockless walk every time).
> 
> What's an rculist null? 

I guess Marcelo was talking about rculist_nulls.h
(Documentation/RCU/rculist_nulls.txt).

> rculists have regular termination conditions,
> they'll reach the end (also head, due to circular etc..) in N steps,
> where N is the number of elements.
> 
> Of course you can keep adding elements to protract this situation, but
> you'll run out of memory eventually -- you also have to make sure to
> insert them after the element being read, otherwise the iteration will
> simply miss them.
> 
> Deleting an element preserves the element itself -- it has to be
> RCU-freed to be part of an rculist, and the element also preserves its
> fwd link, so any iterator will either not see the element, or if they're
> at the element, they'll continue their iteration as normal (rculist
> doesn't have backward iterators).
> 
> A deleted element may not be re-used before an RCU grace period has
> lapsed. Same as for freeing such an element. So if you want to move an
> rculist element you need to:
> 
>   list_del_rcu()
>   synchronize_rcu();
>   list_add_rcu();
> 
> Or use the list_splice_init_rcu() primitive which also explicitly takes
> a @sync argument.

Thanks for your detailed explanation, Peter!

What about if the element is allocated from SLAB_DESTROY_BY_RCU slab? That
means the element may be reused while do iteration.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Xiao Guangrong
On 11/25/2013 06:19 PM, Gleb Natapov wrote:
> On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote:
>>>
>>> For example, nothing prevents lockless walker to move into some
>>> parent_ptes chain, right?
>>
>> No.
>>
>> The nulls can help us to detect this case, for parent_ptes, the nulls points
>> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
>> is no chance that the �rmap" is used as shadow page when slot-lock is held.
>>
> But meanwhile we will write protect non-last level sptes, no? Better to

It will meet the non-last sptes but does not write-protect them since
we have do is_last_spte() check before cmpxchg.

> create separate slab caches for rmap and parent_ptes lists.

Yes, this is a good idea. Thanks you, Gleb!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-24 Thread Xiao Guangrong
On 11/25/2013 02:11 PM, Xiao Guangrong wrote:
> 
> On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti  wrote:
> 
>> On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
>>> It likes nulls list and we use the pte-list as the nulls which can help us 
>>> to
>>> detect whether the "desc" is moved to anther rmap then we can re-walk the 
>>> rmap
>>> if that happened
>>>
>>> kvm->slots_lock is held when we do lockless walking that prevents rmap
>>> is reused (free rmap need to hold that lock) so that we can not see the same
>>> nulls used on different rmaps
>>>
>>> Signed-off-by: Xiao Guangrong 
>>
>> How about simplified lockless walk on the slot while rmapp entry
>> contains a single spte? (which should be the case with two-dimensional
>> paging).
>>
>> That is, grab the lock when finding a rmap with more than one spte in
>> it (and then keep it locked until the end).
> 
> Hmm� that isn't straightforward and more complex than the approach
> in this patchset. Also it can drop the improvement for shadow mmu that
> gets great improvement by this patchset.
> 
>>
>> For example, nothing prevents lockless walker to move into some
>> parent_ptes chain, right?
> 
> No.
> 
> The nulls can help us to detect this case, for parent_ptes, the nulls points
> to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
> is no chance that the �rmap" is used as shadow page when slot-lock is held.
> 
>>
>> Also, there is no guarantee of termination (as long as sptes are
>> deleted with the correct timing). BTW, can't see any guarantee of
>> termination for rculist nulls either (a writer can race with a lockless
>> reader indefinately, restarting the lockless walk every time).
> 
> Hmm, that can be avoided by checking dirty-bitmap before rewalk,
> that means, if the dirty-bitmap has been set during lockless write-protection,
> it�s unnecessary to write-protect its sptes. Your idea?

This idea is based on the fact that the number of rmap is limited by
RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
we can break the rewalk at once, in the case of deleting, we can only
rewalk RMAP_RECYCLE_THRESHOLD times.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-24 Thread Xiao Guangrong

On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti  wrote:

> On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
>> It likes nulls list and we use the pte-list as the nulls which can help us to
>> detect whether the "desc" is moved to anther rmap then we can re-walk the 
>> rmap
>> if that happened
>> 
>> kvm->slots_lock is held when we do lockless walking that prevents rmap
>> is reused (free rmap need to hold that lock) so that we can not see the same
>> nulls used on different rmaps
>> 
>> Signed-off-by: Xiao Guangrong 
> 
> How about simplified lockless walk on the slot while rmapp entry
> contains a single spte? (which should be the case with two-dimensional
> paging).
> 
> That is, grab the lock when finding a rmap with more than one spte in
> it (and then keep it locked until the end).

Hmm… that isn't straightforward and more complex than the approach
in this patchset. Also it can drop the improvement for shadow mmu that
gets great improvement by this patchset.

> 
> For example, nothing prevents lockless walker to move into some
> parent_ptes chain, right?

No.

The nulls can help us to detect this case, for parent_ptes, the nulls points
to "shadow page" but for rmaps, the nulls points to slot.arch.rmap. There
is no chance that the “rmap" is used as shadow page when slot-lock is held.

> 
> Also, there is no guarantee of termination (as long as sptes are
> deleted with the correct timing). BTW, can't see any guarantee of
> termination for rculist nulls either (a writer can race with a lockless
> reader indefinately, restarting the lockless walk every time).

Hmm, that can be avoided by checking dirty-bitmap before rewalk,
that means, if the dirty-bitmap has been set during lockless write-protection,
it’s unnecessary to write-protect its sptes. Your idea?

But… do we really need to care it. :(






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

2013-11-20 Thread Xiao Guangrong

On Nov 21, 2013, at 3:47 AM, Marcelo Tosatti  wrote:

> On Wed, Nov 20, 2013 at 10:20:09PM +0800, Xiao Guangrong wrote:
>>> But what guarantee does userspace require, from GET_DIRTY_LOG, while vcpus 
>>> are
>>> executing? 
>> 
>> Aha. Single calling GET_DIRTY_LOG is useless since new dirty page can be 
>> generated
>> when GET_DIRTY_LOG is being returned. If user wants to get exact dirty pages 
>> the vcpus
>> should be stopped. 
>> 
>>> 
>>> With fast page fault:
>>> 
>>> if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
>>> /* The window in here... */
>>> mark_page_dirty(vcpu->kvm, gfn);
>>> 
>>> And the $SUBJECT set_spte reordering, the rule becomes
>>> 
>>> A call to GET_DIRTY_LOG guarantees to return correct information about 
>>> dirty pages before invocation of the previous GET_DIRTY_LOG call.
>>> 
>>> (see example 1: the next GET_DIRTY_LOG will return the dirty information
>>> there).
>>> 
>> 
>> It seems no.
>> 
>> The first GET_DIRTY_LOG can happen before fast-page-fault,
>> the second GET_DIRTY_LOG happens in the window between cmpxchg()
>> and mark_page_dirty(), for the second one, the information is still 
>> “incorrect”.
> 
> Its correct for the previous GET_DIRTY_LOG call.

Oh, yes.

> 
>>> The rule for sptes that is, because kvm_write_guest does not match the
>>> documentation at all.
>> 
>> You mean the case of “kvm_write_guest” is valid (I do not know why it is)?
>> Or anything else?
>> 
>>> 
>>> So before example 1 and this patch, the rule (well for sptes at least) was
>>> 
>>> "Given a memory slot, return a bitmap containing any pages dirtied
>>> since the last call to this ioctl.  Bit 0 is the first page in the
>>> memory slot.  Ensure the entire structure is cleared to avoid padding
>>> issues."
>>> 
>>> Can you explain why it is OK to relax this rule?
>> 
>> It’s because:
>> 1) it doesn’t break current use cases, i.e. Live migration and FB-flushing.
>> 2) the current code, like kvm_write_guest  has already broken the 
>> documentation
>>   (the guest page has been written but missed in the dirty bitmap).
>> 3) it’s needless to implement a exact get-dirty-pages since the dirty pages 
>> can
>>   no be exactly got except stopping vcpus. 
>> 
>> So i think we'd document this case instead. No?
> 
> Lets figure out the requirements, then. I don't understand why
> FB-flushing is safe (think kvm-autotest: one pixel off the entire
> test fails).

I did not read FB-flushing code, i guess the reason why it can work is:
FB-flushing do periodicity get-dirty-page and flush it.  After guest writes
the page, the page will be flushed in the next GET_DIRTY_LOG.

> 
> Before fast page fault: Pages are either write protected or the
> corresponding dirty bitmap bit is set. Any write faults to dirty logged
> sptes while GET_DIRTY log is executing in the protected section are
> allowed to instantiate writeable spte after GET_DIRTY log is finished.
> 
> After fast page fault: Pages can be writeable and the dirty bitmap not
> set. Therefore data can be dirty before GET_DIRTY executes and still
> fail to be returned in the bitmap.

It’s right. The current GET_DIRTY fail to get the dirty page but the
next GET_DIRTY can get it properly since the current GET_DIRTY
need to flush all TLBs that waits for fast-page-fault finish.

I do not think it’s a big problem since single GET_DIRTY is useless as
i explained in the previous mail.

> 
> Since this patchset does not introduce change in behaviour (fast pf
> did), no reason to avoid merging this.

Yes, thank you, Marcelo! :)

> 
> BTW, since GET_DIRTY log does not require to report concurrent (to
> GET_DIRTY_LOG) write faults, it is not necessary to rewalk the spte
> list, is it?

You mean the “rewalk” we introduced in pte_list_walk_lockless() in this 
patchset?
I think this rewalk is needed because it’s caused by meet a unexpected nulls 
that
means we’re walking on the unexpected rmap. If we do not do this, some writable
sptes will be missed.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

2013-11-20 Thread Xiao Guangrong

On Nov 20, 2013, at 8:29 AM, Marcelo Tosatti  wrote:

> On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote:
>> On 08/07/2013 09:48 AM, Marcelo Tosatti wrote:
>>> On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
>>>> Make sure we can see the writable spte before the dirt bitmap is visible
>>>> 
>>>> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte 
>>>> based
>>>> on the dirty bitmap, we should ensure the writable spte can be found in 
>>>> rmap
>>>> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap 
>>>> and
>>>> failed to write-protect the page
>>>> 
>>>> Signed-off-by: Xiao Guangrong 
>>>> ---
>>>> arch/x86/kvm/mmu.c | 6 +++---
>>>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>> 
>>> Can you explain why this is safe, with regard to the rule 
>>> at edde99ce05290e50 ?
>> 
>> BTW, this log fixed this case:
>> 
>>VCPU 0KVM migration control
>> 
>>  write-protects all pages
>> #Pf happen then the page
>> become writable, set dirty
>> bit on the bitmap
>> 
>> swap the bitmap, current bitmap is empty
>> 
>> write the page (no dirty log)
>> 
>> stop the guest and push
>> the remaining dirty pages
>> Stopped
>> See current bitmap is empty that means
>> no page is dirty.
>>> 
>>> "The rule is that all pages are either dirty in the current bitmap,
>>> or write-protected, which is violated here."
>> 
>> Actually, this rule is not complete true, there's the 3th case:
>> the window between write guest page and set dirty bitmap is valid.
>> In that window, page is write-free and not dirty logged.
>> 
>> This case is based on the fact that at the final step of live migration,
>> kvm should stop the guest and push the remaining dirty pages to the
>> destination.
>> 
>> They're some examples in the current code:
>> example 1, in fast_pf_fix_direct_spte():
>>  if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
>>  /* The window in here... */
>>  mark_page_dirty(vcpu->kvm, gfn);
>> 
>> example 2, in kvm_write_guest_page():
>>  r = __copy_to_user((void __user *)addr + offset, data, len);
>>  if (r)
>>  return -EFAULT;
>>  /*
>>   * The window is here, the page is dirty but not logged in
>>* The bitmap.
>>   */
>>  mark_page_dirty(kvm, gfn);
>>  return 0;
> 

Hi Marcelo,

> Why is this valid ? That is, the obviously correct rule is
> 
> "that all pages are either dirty in the current bitmap,
> or write-protected, which is violated here."
> 
> With the window above, GET_DIRTY_LOG can be called 100 times while the 
> page is dirty, but the corresponding bit not set in the dirty bitmap.
> 
> It violates the documentation:
> 
> /* for KVM_GET_DIRTY_LOG */
> struct kvm_dirty_log {
>   __u32 slot;
>   __u32 padding;
>   union {
>   void __user *dirty_bitmap; /* one bit per page */
>   __u64 padding;
>   };
> };
> 
> Given a memory slot, return a bitmap containing any pages dirtied
> since the last call to this ioctl.  Bit 0 is the first page in the
> memory slot.  Ensure the entire structure is cleared to avoid padding
> issues.
> 
> The point about migration, is that GET_DIRTY_LOG is strictly correct
> because it stops vcpus.
> 
> But what guarantee does userspace require, from GET_DIRTY_LOG, while vcpus are
> executing? 

Aha. Single calling GET_DIRTY_LOG is useless since new dirty page can be 
generated
when GET_DIRTY_LOG is being returned. If user wants to get exact dirty pages 
the vcpus
should be stopped. 

> 
> With fast page fault:
> 
>  if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
>  /* The window in here... */
>  mark_page_dirty(vcpu->kvm, gfn);
> 
> And the $SUBJECT set_spte reordering, the rule becomes
> 
> A call to GET_DIRTY_LOG guarantees to return correct information about 
> dirty pages before invocation of the previous GET_DIRTY_LOG call.
> 
> (see example 1: the next GET_DIRTY_LOG will return the dirty information
> there).
> 

It seems no.

The first GET_DIRTY_LOG can happen before fast-page-fault,
the second GET_DIRTY_LOG happens in the

Re: [PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes

2013-11-14 Thread Xiao Guangrong
On 11/15/2013 02:39 AM, Marcelo Tosatti wrote:
> On Thu, Nov 14, 2013 at 01:15:24PM +0800, Xiao Guangrong wrote:
>>
>> Hi Marcelo,
>>
>> On 11/14/2013 08:36 AM, Marcelo Tosatti wrote:
>>
>>>
>>> Any code location which reads the writable bit in the spte and assumes if 
>>> its not
>>> set, that the translation which the spte refers to is not cached in a
>>> remote CPU's TLB can become buggy. (*)
>>>
>>> It might be the case that now its not an issue, but its so subtle that
>>> it should be improved.
>>>
>>> Can you add a fat comment on top of is_writeable_bit describing this?
>>> (and explain why is_writable_pte users do not make an assumption
>>> about (*). 
>>>
>>> "Writeable bit of locklessly modifiable sptes might be cleared
>>> but TLBs not flushed: so whenever reading locklessly modifiable sptes
>>> you cannot assume TLBs are flushed".
>>>
>>> For example this one is unclear:
>>>
>>> if (!can_unsync && is_writable_pte(*sptep))
>>> goto set_pte;
>>> And:
>>>
>>> if (!is_writable_pte(spte) &&
>>>   !(pt_protect && spte_is_locklessly_modifiable(spte)))
>>> return false;
>>>
>>> This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are
>>> serialized by a single mutex (if there were two mutexes, it would not be
>>> safe). Can you add an assert to both
>>> kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log 
>>> for (slots_lock) is locked, and explain?
>>>
>>> So just improve the comments please, thanks (no need to resend whole
>>> series).
>>
>> Thank you very much for your time to review it and really appreciate
>> for you detailed the issue so clearly to me.
>>
>> I will do it on the top of this patchset or after it is merged
>> (if it's possiable).
> 
> Ok, can you explain why every individual caller of is_writable_pte have
> no such assumption now? (the one mentioned above is not clear to me for
> example, should explain all of them).

Okay.

Generally speak, we 1) needn't care readonly spte too much since it
can not be locklessly write-protected and 2) if is_writable_pte() is used
to check mmu-mode's state we can check SPTE_MMU_WRITEABLE instead.

There are the places is_writable_pte is used:
1) in spte_has_volatile_bits():
 527 static bool spte_has_volatile_bits(u64 spte)
 528 {
 529 /*
 530  * Always atomicly update spte if it can be updated
 531  * out of mmu-lock, it can ensure dirty bit is not lost,
 532  * also, it can help us to get a stable is_writable_pte()
 533  * to ensure tlb flush is not missed.
 534  */
 535 if (spte_is_locklessly_modifiable(spte))
 536 return true;
 537
 538 if (!shadow_accessed_mask)
 539 return false;
 540
 541 if (!is_shadow_present_pte(spte))
 542 return false;
 543
 544 if ((spte & shadow_accessed_mask) &&
 545   (!is_writable_pte(spte) || (spte & shadow_dirty_mask)))
 546 return false;
 547
 548 return true;
 549 }

this path is not broken since any spte can be lockless modifiable will do
lockless update (will always return 'true' in  the line 536).

2): in mmu_spte_update()
594 /*
 595  * For the spte updated out of mmu-lock is safe, since
 596  * we always atomicly update it, see the comments in
 597  * spte_has_volatile_bits().
 598  */
 599 if (spte_is_locklessly_modifiable(old_spte) &&
 600   !is_writable_pte(new_spte))
 601 ret = true;

The new_spte is a temp value that can not be fetched by lockless
write-protection and !is_writable_pte() is stable enough (can not be
locklessly write-protected).

3) in spte_write_protect()
1368 if (!is_writable_pte(spte) &&
1369   !spte_is_locklessly_modifiable(spte))
1370 return false;
1371

It always do write-protection if the spte is lockelss modifiable.
(This code is the aspect after applying the whole pachset, the code is safe too
before patch "[PATCH v3 14/15] KVM: MMU: clean up spte_write_protect" since
the lockless write-protection path is serialized by a single lock.).

4) in set_spte()
2690 /*
2691  * Optimization: for pte sync, if spte was writable the 
hash
2692  * lookup is unnecessary (and expensive). Write protection
2693  * is responsibility of mmu_get_page / kvm_sync_page.
2694

Re: [PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes

2013-11-13 Thread Xiao Guangrong

Hi Marcelo,

On 11/14/2013 08:36 AM, Marcelo Tosatti wrote:

> 
> Any code location which reads the writable bit in the spte and assumes if its 
> not
> set, that the translation which the spte refers to is not cached in a
> remote CPU's TLB can become buggy. (*)
> 
> It might be the case that now its not an issue, but its so subtle that
> it should be improved.
> 
> Can you add a fat comment on top of is_writeable_bit describing this?
> (and explain why is_writable_pte users do not make an assumption
> about (*). 
> 
> "Writeable bit of locklessly modifiable sptes might be cleared
> but TLBs not flushed: so whenever reading locklessly modifiable sptes
> you cannot assume TLBs are flushed".
> 
> For example this one is unclear:
> 
> if (!can_unsync && is_writable_pte(*sptep))
> goto set_pte;
> And:
> 
> if (!is_writable_pte(spte) &&
>   !(pt_protect && spte_is_locklessly_modifiable(spte)))
> return false;
> 
> This is safe because get_dirty_log/kvm_mmu_slot_remove_write_access are
> serialized by a single mutex (if there were two mutexes, it would not be
> safe). Can you add an assert to both
> kvm_mmu_slot_remove_write_access/kvm_vm_ioctl_get_dirty_log 
> for (slots_lock) is locked, and explain?
> 
> So just improve the comments please, thanks (no need to resend whole
> series).

Thank you very much for your time to review it and really appreciate
for you detailed the issue so clearly to me.

I will do it on the top of this patchset or after it is merged
(if it's possiable).




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 00/15] KVM: MMU: locklessly write-protect

2013-11-10 Thread Xiao Guangrong
On 11/03/2013 08:29 PM, Gleb Natapov wrote:
> Marcelo can you review it please?
> 

Ping..

> On Wed, Oct 23, 2013 at 09:29:18PM +0800, Xiao Guangrong wrote:
>> Changelog v3:
>> - the changes from Gleb's review:
>>   1) drop the patch which fixed the count of spte number in rmap since it
>>  can not be easily fixed and it has gone after applying this patchset
>>
>> - ideas from Gleb and discussion with Marcelo is also very appreciated:
>>   2) change the way to locklessly access shadow page - use 
>> SLAB_DESTROY_BY_RCU
>>  to protect shadow page instead of conditionally using call_rcu()
>>   3) improve is_last_spte() that checks last spte by only using some bits on
>>  the spte, then it is safely used when we locklessly write-protect the
>>  shadow page table
>>
>> Changelog v2:
>>
>> - the changes from Gleb's review:
>>   1) fix calculating the number of spte in the pte_list_add()
>>   2) set iter->desc to NULL if meet a nulls desc to cleanup the code of
>>  rmap_get_next()
>>   3) fix hlist corruption due to accessing sp->hlish out of mmu-lock
>>   4) use rcu functions to access the rcu protected pointer
>>   5) spte will be missed in lockless walker if the spte is moved in a desc
>>  (remove a spte from the rmap using only one desc). Fix it by bottom-up
>>  walking the desc
>>
>> - the changes from Paolo's review
>>   1) make the order and memory barriers between update spte / add spte into
>>  rmap and dirty-log more clear
>>   
>> - the changes from Marcelo's review:
>>   1) let fast page fault only fix the spts on the last level (level = 1)
>>   2) improve some changelogs and comments
>>
>> - the changes from Takuya's review:
>>   move the patch "flush tlb if the spte can be locklessly modified" forward
>>   to make it's more easily merged
>>
>> Thank all of you very much for your time and patience on this patchset!
>>   
>> Since we use rcu_assign_pointer() to update the points in desc even if dirty
>> log is disabled, i have measured the performance:
>> Host: Intel(R) Xeon(R) CPU   X5690  @ 3.47GHz * 12 + 36G memory
>>
>> - migrate-perf (benchmark the time of get-dirty-log)
>>   before: Run 10 times, Avg time:9009483 ns.
>>   after: Run 10 times, Avg time:4807343 ns.
>>
>> - kerbench
>>   Guest: 12 VCPUs + 8G memory
>>   before:
>> EPT is enabled:
>> # cat 09-05-origin-ept | grep real   
>> real 85.58
>> real 83.47
>> real 82.95
>>
>> EPT is disabled:
>> # cat 09-05-origin-shadow | grep real
>> real 138.77
>> real 138.99
>> real 139.55
>>
>>   after:
>> EPT is enabled:
>> # cat 09-05-lockless-ept | grep real
>> real 83.40
>> real 82.81
>> real 83.39
>>
>> EPT is disabled:
>> # cat 09-05-lockless-shadow | grep real
>> real 138.91
>> real 139.71
>> real 138.94
>>
>> No performance regression!
>>
>>
>>
>> Background
>> ==
>> Currently, when mark memslot dirty logged or get dirty page, we need to
>> write-protect large guest memory, it is the heavy work, especially, we need 
>> to
>> hold mmu-lock which is also required by vcpu to fix its page table fault and
>> mmu-notifier when host page is being changed. In the extreme cpu / memory 
>> used
>> guest, it becomes a scalability issue.
>>
>> This patchset introduces a way to locklessly write-protect guest memory.
>>
>> Idea
>> ==
>> There are the challenges we meet and the ideas to resolve them.
>>
>> 1) How to locklessly walk rmap?
>> The first idea we got to prevent "desc" being freed when we are walking the
>> rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
>> it updates the rmap really frequently.
>>
>> So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
>> to be reused more quickly. We also store a "nulls" in the last "desc"
>> (desc->more) which can help us to detect whether the "desc" is moved to 
>> anther
>> rmap then we can re-walk the rmap if that happened. I learned this idea from
>> nulls-list.
>>
>> Another issue is, when a spte is deleted from the "desc", another spte in the
>> last "desc" will be moved to this position to replace the deleted one. If the
>> deleted one has been accessed and we do not access the replaced one, the
>> replace

Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

2013-10-27 Thread Xiao Guangrong
On 10/24/2013 08:32 PM, Gleb Natapov wrote:
> On Thu, Oct 24, 2013 at 07:01:49PM +0800, Xiao Guangrong wrote:
>> On 10/24/2013 06:39 PM, Gleb Natapov wrote:
>>> On Thu, Oct 24, 2013 at 06:10:46PM +0800, Xiao Guangrong wrote:
>>>> On 10/24/2013 05:52 PM, Gleb Natapov wrote:
>>>>> On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
>>>>>> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
>>>>>>
>>>>>>>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page 
>>>>>>>> *page_header(hpa_t shadow_page)
>>>>>>>>  {
>>>>>>>>struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>>>>>>>  
>>>>>>>> -  return (struct kvm_mmu_page *)page_private(page);
>>>>>>>> +  return (struct kvm_mmu_page *)(page->mapping);
>>>>>>> Why?
>>>>>>
>>>>>> That's because page->private has been used by slab:
>>>>>>
>>>>> But does lockless path actually looks at it?
>>>>
>>>> Lockless path does not use it, however, it is used by kvm_mmu_page():
>>>>
>>>> static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>>>> {
>>>>struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>>>
>>>>return (struct kvm_mmu_page *)(page->mapping);
>>>> }
>>>>
>>>> which is used in the common code.
>>> Ah, so the pointer is not available even after object is allocated.
>>> Make sense since we allocate object, not page here, but is it safe to
>>> use mapping like that?
>>
>> The commens says:
>>
>>  struct address_space *mapping;  /* If low bit clear, points to
>>   * inode address_space, or NULL.
>>   * If page mapped as anonymous
>>   * memory, low bit is set, and
>>   * it points to anon_vma object:
>>   * see PAGE_MAPPING_ANON below.
>>
>> It seems mapping is used for address_space or anonymous memory, in
>> our case, the page is used by slab, so I guess it is ok. And the bug
>> i put in set_page_header() was not tiggered on both slab and slub.
>>
> Yeah, I also think so. I asked Andrea (copied) and he thinks that it is
> safe too currently, but things changes fast in this area. Andrea?
> Another option is too save slab_cache pointer and reset it before
> freeing the object but it looks ugly.

It's ugly but it isn't too bad. :)

Since currently kvm is extensively used to test/measure linux kernel
and the BUG_ON() in set_page_header() can help us to detect the potential
issue, it is easy for us to fix the possible bug in the development-cycle
if 'mapping' is used by slab. If that really happen, maybe we can switch
it to your way instead.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

2013-10-24 Thread Xiao Guangrong
On 10/24/2013 06:39 PM, Gleb Natapov wrote:
> On Thu, Oct 24, 2013 at 06:10:46PM +0800, Xiao Guangrong wrote:
>> On 10/24/2013 05:52 PM, Gleb Natapov wrote:
>>> On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
>>>> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
>>>>
>>>>>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t 
>>>>>> shadow_page)
>>>>>>  {
>>>>>>  struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>>>>>  
>>>>>> -return (struct kvm_mmu_page *)page_private(page);
>>>>>> +return (struct kvm_mmu_page *)(page->mapping);
>>>>> Why?
>>>>
>>>> That's because page->private has been used by slab:
>>>>
>>> But does lockless path actually looks at it?
>>
>> Lockless path does not use it, however, it is used by kvm_mmu_page():
>>
>> static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
>> {
>>  struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>
>>  return (struct kvm_mmu_page *)(page->mapping);
>> }
>>
>> which is used in the common code.
> Ah, so the pointer is not available even after object is allocated.
> Make sense since we allocate object, not page here, but is it safe to
> use mapping like that?

The commens says:

struct address_space *mapping;  /* If low bit clear, points to
 * inode address_space, or NULL.
 * If page mapped as anonymous
 * memory, low bit is set, and
 * it points to anon_vma object:
 * see PAGE_MAPPING_ANON below.

It seems mapping is used for address_space or anonymous memory, in
our case, the page is used by slab, so I guess it is ok. And the bug
i put in set_page_header() was not tiggered on both slab and slub.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

2013-10-24 Thread Xiao Guangrong
On 10/24/2013 05:52 PM, Gleb Natapov wrote:
> On Thu, Oct 24, 2013 at 05:29:44PM +0800, Xiao Guangrong wrote:
>> On 10/24/2013 05:19 PM, Gleb Natapov wrote:
>>
>>>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t 
>>>> shadow_page)
>>>>  {
>>>>struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>>>  
>>>> -  return (struct kvm_mmu_page *)page_private(page);
>>>> +  return (struct kvm_mmu_page *)(page->mapping);
>>> Why?
>>
>> That's because page->private has been used by slab:
>>
> But does lockless path actually looks at it?

Lockless path does not use it, however, it is used by kvm_mmu_page():

static inline struct kvm_mmu_page *page_header(hpa_t shadow_page)
{
struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);

return (struct kvm_mmu_page *)(page->mapping);
}

which is used in the common code.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 13/15] KVM: MMU: locklessly write-protect the page

2013-10-24 Thread Xiao Guangrong
On 10/24/2013 05:17 PM, Gleb Natapov wrote:

>>  
>> -/**
>> - * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
>> +static void __rmap_write_protect_lockless(u64 *sptep)
>> +{
>> +u64 spte;
>> +
>> +retry:
>> +/*
>> + * Note we may partly read the sptep on 32bit host, however, we
>> + * allow this case because:
>> + * - we do not access the page got from the sptep.
>> + * - cmpxchg64 can detect that case and avoid setting a wrong value
>> + *   to the sptep.
>> + */
>> +spte = *rcu_dereference(sptep);
>> +if (unlikely(!is_last_spte(spte) || !is_writable_pte(spte)))
> is_last_spte gets two parameters.

In patch [PATCH v3 12/15] KVM: MMU: check last spte with unawareness of mapping 
level,
we have removed the 'level' from the parameter list.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

2013-10-24 Thread Xiao Guangrong
On 10/24/2013 05:19 PM, Gleb Natapov wrote:

>> @@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t 
>> shadow_page)
>>  {
>>  struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
>>  
>> -return (struct kvm_mmu_page *)page_private(page);
>> +return (struct kvm_mmu_page *)(page->mapping);
> Why?

That's because page->private has been used by slab:

/* Remainder is not double word aligned */
union {
unsigned long private;  /* Mapping-private opaque data:
 * usually used for buffer_heads
 * if PagePrivate set; used for
 * swp_entry_t if PageSwapCache;
 * indicates order in the buddy
 * system if PG_buddy is set.
 */
#if USE_SPLIT_PTLOCKS
spinlock_t ptl;
#endif
struct kmem_cache *slab_cache;  /* SL[AU]B: Pointer to slab */
struct page *first_page;/* Compound tail pages */
};

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 05/15] KVM: MMU: update spte and add it into rmap before dirty log

2013-10-23 Thread Xiao Guangrong
kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the its dirty
bitmap, so we should ensure the writable spte can be found in rmap before the
dirty bitmap is visible. Otherwise, we clear the dirty bitmap but fail to
write-protect the page which is detailed in the comments in this patch

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 84 ++
 arch/x86/kvm/x86.c | 10 +++
 2 files changed, 76 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 337d173..e85eed6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2427,6 +2427,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 {
u64 spte;
int ret = 0;
+   bool remap = is_rmap_spte(*sptep);
 
if (set_mmio_spte(vcpu->kvm, sptep, gfn, pfn, pte_access))
return 0;
@@ -2488,12 +2489,73 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
}
}
 
-   if (pte_access & ACC_WRITE_MASK)
-   mark_page_dirty(vcpu->kvm, gfn);
-
 set_pte:
if (mmu_spte_update(sptep, spte))
kvm_flush_remote_tlbs(vcpu->kvm);
+
+   if (!remap) {
+   if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD)
+   rmap_recycle(vcpu, sptep, gfn);
+
+   if (level > PT_PAGE_TABLE_LEVEL)
+   ++vcpu->kvm->stat.lpages;
+   }
+
+   /*
+* The orders we require are:
+* 1) set spte to writable __before__ set the dirty bitmap.
+*It makes sure that dirty-logging is not missed when do
+*live migration at the final step where kvm should stop
+*the guest and push the remaining dirty pages got from
+*dirty-bitmap to the destination. The similar cases are
+*in fast_pf_fix_direct_spte() and kvm_write_guest_page().
+*
+* 2) add the spte into rmap __before__ set the dirty bitmap.
+*
+* They can ensure we can find the writable spte on the rmap
+* when we do lockless write-protection since
+* kvm_vm_ioctl_get_dirty_log() write-protects the pages based
+* on its dirty-bitmap, otherwise these cases will happen:
+*
+*  CPU 0 CPU 1
+*  kvm ioctl doing get-dirty-pages
+* mark_page_dirty(gfn) which
+* set the gfn on the dirty maps
+*  mask = xchg(dirty_bitmap, 0)
+*
+*  try to write-protect gfns which
+*  are set on "mask" then walk then
+*  rmap, see no spte on that rmap
+* add the spte into rmap
+*
+* !! Then the page can be freely wrote but not recorded in
+* the dirty bitmap.
+*
+* And:
+*
+*  VCPU 0CPU 1
+*kvm ioctl doing get-dirty-pages
+* mark_page_dirty(gfn) which
+* set the gfn on the dirty maps
+*
+* add spte into rmap
+*mask = xchg(dirty_bitmap, 0)
+*
+*try to write-protect gfns which
+*are set on "mask" then walk then
+*rmap, see spte is on the ramp
+*but it is readonly or nonpresent
+* Mark spte writable
+*
+* !! Then the page can be freely wrote but not recorded in the
+* dirty bitmap.
+*
+* See the comments in kvm_vm_ioctl_get_dirty_log().
+*/
+   smp_wmb();
+
+   if (pte_access & ACC_WRITE_MASK)
+   mark_page_dirty(vcpu->kvm, gfn);
 done:
return ret;
 }
@@ -2503,9 +2565,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 int level, gfn_t gfn, pfn_t pfn, bool speculative,
 bool host_writable)
 {
-   int was_rmapped = 0;
-   int rmap_count;
-
pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
 *sptep, write_fault, gfn);
 
@@ -2527,8 +2586,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 spte_to_pfn(*sptep), pfn);
drop_spte(vcpu->kvm, sptep);
kvm_flush_remote_tlbs(vcpu->kvm);
-   } else
-   was_rmapped = 1;
+   }
}
 
if (set_spte(vcpu, sptep, pte_access, level, gfn, pfn, speculative,
@@ -2546,16 +2604,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 is_large_pte(*sptep)? "2MB" : "4kB",

[PATCH v3 02/15] KVM: MMU: lazily drop large spte

2013-10-23 Thread Xiao Guangrong
Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them un-present, the page fault caused by read access can
be avoided

The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter.  This removes the need for the return value.

This version has fixed the issue reported in 6b73a9606, the reason of that
issue is that fast_page_fault() directly sets the readonly large spte to
writable but only dirty the first page into the dirty-bitmap that means
other pages are missed. Fixed it by only the normal sptes (on the
PT_PAGE_TABLE_LEVEL level) can be fast fixed

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 36 
 arch/x86/kvm/x86.c |  8 ++--
 2 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d2aacc2..8739208 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1176,8 +1176,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
 
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte writ-protection is caused by protecting shadow page table.
- * @flush indicates whether tlb need be flushed.
+ * spte write-protection is caused by protecting shadow page table.
  *
  * Note: write protection is difference between drity logging and spte
  * protection:
@@ -1186,10 +1185,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
  * - for spte protection, the spte can be writable only after unsync-ing
  *   shadow page.
  *
- * Return true if the spte is dropped.
+ * Return true if tlb need be flushed.
  */
-static bool
-spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
+static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
 {
u64 spte = *sptep;
 
@@ -1199,17 +1197,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
*flush, bool pt_protect)
 
rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
 
-   if (__drop_large_spte(kvm, sptep)) {
-   *flush |= true;
-   return true;
-   }
-
if (pt_protect)
spte &= ~SPTE_MMU_WRITEABLE;
spte = spte & ~PT_WRITABLE_MASK;
 
-   *flush |= mmu_spte_update(sptep, spte);
-   return false;
+   return mmu_spte_update(sptep, spte);
 }
 
 static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
@@ -1221,11 +1213,8 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
 
for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
-   if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
-   sptep = rmap_get_first(*rmapp, &iter);
-   continue;
-   }
 
+   flush |= spte_write_protect(kvm, sptep, pt_protect);
sptep = rmap_get_next(&iter);
}
 
@@ -2669,6 +2658,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
break;
}
 
+   drop_large_spte(vcpu, iterator.sptep);
+
if (!is_shadow_present_pte(*iterator.sptep)) {
u64 base_addr = iterator.addr;
 
@@ -2870,6 +2861,19 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
goto exit;
 
/*
+* Do not fix write-permission on the large spte since we only dirty
+* the first page into the dirty-bitmap in fast_pf_fix_direct_spte()
+* that means other pages are missed if its slot is dirty-logged.
+*
+* Instead, we let the slow page fault path create a normal spte to
+* fix the access.
+*
+* See the comments in kvm_arch_commit_memory_region().
+*/
+   if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+   goto exit;
+
+   /*
 * Currently, fast page fault only works for direct mapping since
 * the gfn is not stable for indirect shadow page.
 * See Documentation/virtual/kvm/locking.txt to get more detail.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index edf2a07..b3aa650 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7223,8 +7223,12 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
/*
 * Write protect all pages for dirty logging.
-* Existing largepage mappings are destroyed here and new ones will
-* not be created until the end of the logging.
+*
+* All the sptes including the large sptes which point to this
+* slot are set to readonly. We can not create any new 

[PATCH v3 01/15] KVM: MMU: properly check last spte in fast_page_fault()

2013-10-23 Thread Xiao Guangrong
Using sp->role.level instead of @level since @level is not got from the
page table hierarchy

There is no issue in current code since the fast page fault currently only
fixes the fault caused by dirty-log that is always on the last level
(level = 1)

This patch makes the code more readable and avoids potential issue in the
further development

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 40772ef..d2aacc2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2798,9 +2798,9 @@ static bool page_fault_can_be_fast(u32 error_code)
 }
 
 static bool
-fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte)
+fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+   u64 *sptep, u64 spte)
 {
-   struct kvm_mmu_page *sp = page_header(__pa(sptep));
gfn_t gfn;
 
WARN_ON(!sp->role.direct);
@@ -2826,6 +2826,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
u32 error_code)
 {
struct kvm_shadow_walk_iterator iterator;
+   struct kvm_mmu_page *sp;
bool ret = false;
u64 spte = 0ull;
 
@@ -2846,7 +2847,8 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
goto exit;
}
 
-   if (!is_last_spte(spte, level))
+   sp = page_header(__pa(iterator.sptep));
+   if (!is_last_spte(spte, sp->role.level))
goto exit;
 
/*
@@ -2872,7 +2874,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
 * the gfn is not stable for indirect shadow page.
 * See Documentation/virtual/kvm/locking.txt to get more detail.
 */
-   ret = fast_pf_fix_direct_spte(vcpu, iterator.sptep, spte);
+   ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
 exit:
trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
  spte, ret);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 03/15] KVM: MMU: flush tlb if the spte can be locklessly modified

2013-10-23 Thread Xiao Guangrong
Relax the tlb flush condition since we will write-protect the spte out of mmu
lock. Note lockless write-protection only marks the writable spte to readonly
and the spte can be writable only if both SPTE_HOST_WRITEABLE and
SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)

This patch is used to avoid this kind of race:

  VCPU 0 VCPU 1
lockless wirte protection:
  set spte.w = 0
 lock mmu-lock

 write protection the spte to sync shadow page,
 see spte.w = 0, then without flush tlb

 unlock mmu-lock

 !!! At this point, the shadow page can still be
 writable due to the corrupt tlb entry
 Flush all TLB

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8739208..62f18ec 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -595,7 +595,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 * we always atomicly update it, see the comments in
 * spte_has_volatile_bits().
 */
-   if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
+   if (spte_is_locklessly_modifiable(old_spte) &&
+ !is_writable_pte(new_spte))
ret = true;
 
if (!shadow_accessed_mask)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 00/15] KVM: MMU: locklessly write-protect

2013-10-23 Thread Xiao Guangrong
ar a bit
but the latter is hard since we need to remove the spte from rmap. so we unify
these two behaviors that only make the spte readonly. Making large spte
readonly instead of nonpresent is also good for reducing jitter.

And we need to pay more attention on the order of making spte writable, adding
spte into rmap and setting the corresponding bit on dirty bitmap since
kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty bitmap,
we should ensure the writable spte can be found in rmap before the dirty bitmap
is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
the page.

Performance result
====
The performance result and the benchmark can be found at:
  http://permalink.gmane.org/gmane.linux.kernel/1534876

Xiao Guangrong (15):
  KVM: MMU: properly check last spte in fast_page_fault()
  KVM: MMU: lazily drop large spte
  KVM: MMU: flush tlb if the spte can be locklessly modified
  KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
  KVM: MMU: update spte and add it into rmap before dirty log
  KVM: MMU: redesign the algorithm of pte_list
  KVM: MMU: introduce nulls desc
  KVM: MMU: introduce pte-list lockless walker
  KVM: MMU: initialize the pointers in pte_list_desc properly
  KVM: MMU: allocate shadow pages from slab
  KVM: MMU: locklessly access shadow page under rcu protection
  KVM: MMU: check last spte with unawareness of mapping level
  KVM: MMU: locklessly write-protect the page
  KVM: MMU: clean up spte_write_protect
  KVM: MMU: use rcu functions to access the pointer

 arch/x86/include/asm/kvm_host.h |   7 +-
 arch/x86/kvm/mmu.c  | 586 ++--
 arch/x86/kvm/mmu.h  |   6 +
 arch/x86/kvm/mmu_audit.c|   6 +-
 arch/x86/kvm/paging_tmpl.h  |   6 +-
 arch/x86/kvm/x86.c  |  34 ++-
 6 files changed, 475 insertions(+), 170 deletions(-)

-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 04/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes

2013-10-23 Thread Xiao Guangrong
Now we can flush all the TLBs out of the mmu lock without TLB corruption when
write-proect the sptes, it is because:
- we have marked large sptes readonly instead of dropping them that means we
  just change the spte from writable to readonly so that we only need to care
  the case of changing spte from present to present (changing the spte from
  present to nonpresent will flush all the TLBs immediately), in other words,
  the only case we need to care is mmu_spte_update()

- in mmu_spte_update(), we haved checked
  SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
  means it does not depend on PT_WRITABLE_MASK anymore

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 18 ++
 arch/x86/kvm/x86.c |  9 +++--
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 62f18ec..337d173 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4273,15 +4273,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
if (*rmapp)
__rmap_write_protect(kvm, rmapp, false);
 
-   if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
-   kvm_flush_remote_tlbs(kvm);
+   if (need_resched() || spin_needbreak(&kvm->mmu_lock))
cond_resched_lock(&kvm->mmu_lock);
-   }
}
}
 
-   kvm_flush_remote_tlbs(kvm);
spin_unlock(&kvm->mmu_lock);
+
+   /*
+* We can flush all the TLBs out of the mmu lock without TLB
+* corruption since we just change the spte from writable to
+* readonly so that we only need to care the case of changing
+* spte from present to present (changing the spte from present
+* to nonpresent will flush all the TLBs immediately), in other
+* words, the only case we care is mmu_spte_update() where we
+* haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
+* instead of PT_WRITABLE_MASK, that means it does not depend
+* on PT_WRITABLE_MASK anymore.
+*/
+   kvm_flush_remote_tlbs(kvm);
 }
 
 #define BATCH_ZAP_PAGES10
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b3aa650..573c6b3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3571,11 +3571,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
kvm_dirty_log *log)
offset = i * BITS_PER_LONG;
kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
}
-   if (is_dirty)
-   kvm_flush_remote_tlbs(kvm);
 
spin_unlock(&kvm->mmu_lock);
 
+   /*
+* All the TLBs can be flushed out of mmu lock, see the comments in
+* kvm_mmu_slot_remove_write_access().
+*/
+   if (is_dirty)
+   kvm_flush_remote_tlbs(kvm);
+
r = -EFAULT;
if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
goto out;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-10-23 Thread Xiao Guangrong
It likes nulls list and we use the pte-list as the nulls which can help us to
detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
if that happened

kvm->slots_lock is held when we do lockless walking that prevents rmap
is reused (free rmap need to hold that lock) so that we can not see the same
nulls used on different rmaps

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 35 +--
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5cce039..4687329 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
return level - 1;
 }
 
+static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc 
*desc)
+{
+   unsigned long marker;
+
+   marker = (unsigned long)pte_list | 1UL;
+   desc->more = (struct pte_list_desc *)marker;
+}
+
+static bool desc_is_a_nulls(struct pte_list_desc *desc)
+{
+   return (unsigned long)desc & 1;
+}
+
+static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc)
+{
+   return (unsigned long *)((unsigned long)desc & ~1);
+}
+
 static int __find_first_free(struct pte_list_desc *desc)
 {
int i;
@@ -951,7 +969,7 @@ static int count_spte_number(struct pte_list_desc *desc)
 
first_free = __find_first_free(desc);
 
-   for (desc_num = 0; desc->more; desc = desc->more)
+   for (desc_num = 0; !desc_is_a_nulls(desc->more); desc = desc->more)
desc_num++;
 
return first_free + desc_num * PTE_LIST_EXT;
@@ -985,6 +1003,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc = mmu_alloc_pte_list_desc(vcpu);
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
+   desc_mark_nulls(pte_list, desc);
*pte_list = (unsigned long)desc | 1;
return 1;
}
@@ -1030,7 +1049,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
/*
 * Only one entry existing but still use a desc to store it?
 */
-   WARN_ON(!next_desc);
+   WARN_ON(desc_is_a_nulls(next_desc));
 
mmu_free_pte_list_desc(first_desc);
*pte_list = (unsigned long)next_desc | 1ul;
@@ -1041,7 +1060,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 * Only one entry in this desc, move the entry to the head
 * then the desc can be freed.
 */
-   if (!first_desc->sptes[1] && !first_desc->more) {
+   if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
*pte_list = (unsigned long)first_desc->sptes[0];
mmu_free_pte_list_desc(first_desc);
}
@@ -1070,7 +1089,7 @@ static void pte_list_remove(u64 *spte, unsigned long 
*pte_list)
 
rmap_printk("pte_list_remove:  %p many->many\n", spte);
desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-   while (desc) {
+   while (!desc_is_a_nulls(desc)) {
for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
if (desc->sptes[i] == spte) {
pte_list_desc_remove_entry(pte_list,
@@ -1097,11 +1116,13 @@ static void pte_list_walk(unsigned long *pte_list, 
pte_list_walk_fn fn)
return fn((u64 *)*pte_list);
 
desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-   while (desc) {
+   while (!desc_is_a_nulls(desc)) {
for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
fn(desc->sptes[i]);
desc = desc->more;
}
+
+   WARN_ON(desc_get_nulls_value(desc) != pte_list);
 }
 
 static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
@@ -1184,6 +1205,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct 
rmap_iterator *iter)
 
iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
iter->pos = 0;
+   WARN_ON(desc_is_a_nulls(iter->desc));
return iter->desc->sptes[iter->pos];
 }
 
@@ -1204,7 +1226,8 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
return sptep;
}
 
-   iter->desc = iter->desc->more;
+   iter->desc = desc_is_a_nulls(iter->desc->more) ?
+   NULL : iter->desc->more;
 
if (iter->desc) {
iter->pos = 0;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 08/15] KVM: MMU: introduce pte-list lockless walker

2013-10-23 Thread Xiao Guangrong
The basic idea is from nulls list which uses a nulls to indicate
whether the desc is moved to different pte-list

Note, we should do bottom-up walk in the desc since we always move
the bottom entry to the deleted position

Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 57 ++
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4687329..a864140 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -975,6 +975,10 @@ static int count_spte_number(struct pte_list_desc *desc)
return first_free + desc_num * PTE_LIST_EXT;
 }
 
+#define rcu_assign_pte_list(pte_list_p, value) \
+   rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),   \
+   (unsigned long *)(value))
+
 /*
  * Pte mapping structures:
  *
@@ -994,7 +998,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 
if (!*pte_list) {
rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
-   *pte_list = (unsigned long)spte;
+   rcu_assign_pte_list(pte_list, spte);
return 0;
}
 
@@ -1004,7 +1008,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
desc_mark_nulls(pte_list, desc);
-   *pte_list = (unsigned long)desc | 1;
+   rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
return 1;
}
 
@@ -1017,7 +1021,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
new_desc = mmu_alloc_pte_list_desc(vcpu);
new_desc->more = desc;
desc = new_desc;
-   *pte_list = (unsigned long)desc | 1;
+   rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
}
 
free_pos = find_first_free(desc);
@@ -1125,6 +1129,51 @@ static void pte_list_walk(unsigned long *pte_list, 
pte_list_walk_fn fn)
WARN_ON(desc_get_nulls_value(desc) != pte_list);
 }
 
+/* The caller should hold rcu lock. */
+static void pte_list_walk_lockless(unsigned long *pte_list,
+  pte_list_walk_fn fn)
+{
+   struct pte_list_desc *desc;
+   unsigned long pte_list_value;
+   int i;
+
+restart:
+   /*
+* Force the pte_list to be reloaded.
+*
+* See the comments in hlist_nulls_for_each_entry_rcu().
+*/
+   barrier();
+   pte_list_value = *rcu_dereference(pte_list);
+   if (!pte_list_value)
+   return;
+
+   if (!(pte_list_value & 1))
+   return fn((u64 *)pte_list_value);
+
+   desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
+   while (!desc_is_a_nulls(desc)) {
+   /*
+* We should do top-down walk since we always use the higher
+* indices to replace the deleted entry if only one desc is
+* used in the rmap when a spte is removed. Otherwise the
+* moved entry will be missed.
+*/
+   for (i = PTE_LIST_EXT - 1; i >= 0; i--)
+   if (desc->sptes[i])
+   fn(desc->sptes[i]);
+
+   desc = rcu_dereference(desc->more);
+
+   /* It is being initialized. */
+   if (unlikely(!desc))
+   goto restart;
+   }
+
+   if (unlikely(desc_get_nulls_value(desc) != pte_list))
+   goto restart;
+}
+
 static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
struct kvm_memory_slot *slot)
 {
@@ -4615,7 +4664,7 @@ int kvm_mmu_module_init(void)
 {
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
sizeof(struct pte_list_desc),
-   0, 0, NULL);
+   0, SLAB_DESTROY_BY_RCU, NULL);
if (!pte_list_desc_cache)
goto nomem;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 11/15] KVM: MMU: locklessly access shadow page under rcu protection

2013-10-23 Thread Xiao Guangrong
Use SLAB_DESTROY_BY_RCU to prevent the shadow page to be freed from the
slab, so that it can be locklessly accessed by holding rcu lock

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1bcc8c8..5b42858 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4710,8 +4710,8 @@ int kvm_mmu_module_init(void)
goto nomem;
 
mmu_shadow_page_cache = kmem_cache_create("mmu_shadow_page_cache",
- PAGE_SIZE, PAGE_SIZE, 0,
- init_shadow_page_table);
+  PAGE_SIZE, PAGE_SIZE, SLAB_DESTROY_BY_RCU,
+  init_shadow_page_table);
if (!mmu_shadow_page_cache)
goto nomem;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 12/15] KVM: MMU: check last spte with unawareness of mapping level

2013-10-23 Thread Xiao Guangrong
The sptes on the middle level should obey these rules:
- they are always writable
- they are not pointing to process's page, so that SPTE_HOST_WRITEABLE has
  no chance to be set

So we can check last spte by using PT_WRITABLE_MASK and SPTE_HOST_WRITEABLE
that can be got from spte, then we can let is_last_spte() do not depend on
the mapping level anymore

This is important to implement lockless write-protection since only spte is
available at that time

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 25 -
 arch/x86/kvm/mmu_audit.c   |  6 +++---
 arch/x86/kvm/paging_tmpl.h |  6 ++
 3 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5b42858..8b96d96 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -337,13 +337,13 @@ static int is_rmap_spte(u64 pte)
return is_shadow_present_pte(pte);
 }
 
-static int is_last_spte(u64 pte, int level)
+static int is_last_spte(u64 pte)
 {
-   if (level == PT_PAGE_TABLE_LEVEL)
-   return 1;
-   if (is_large_pte(pte))
-   return 1;
-   return 0;
+   /*
+* All the sptes on the middle level are writable but
+* SPTE_HOST_WRITEABLE is not set.
+*/
+   return !(is_writable_pte(pte) && !(pte & SPTE_HOST_WRITEABLE));
 }
 
 static pfn_t spte_to_pfn(u64 pte)
@@ -2203,7 +2203,7 @@ static bool shadow_walk_okay(struct 
kvm_shadow_walk_iterator *iterator)
 static void __shadow_walk_next(struct kvm_shadow_walk_iterator *iterator,
   u64 spte)
 {
-   if (is_last_spte(spte, iterator->level)) {
+   if (is_last_spte(spte)) {
iterator->level = 0;
return;
}
@@ -2255,15 +2255,14 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, 
u64 *sptep,
}
 }
 
-static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
-u64 *spte)
+static bool mmu_page_zap_pte(struct kvm *kvm, u64 *spte)
 {
u64 pte;
struct kvm_mmu_page *child;
 
pte = *spte;
if (is_shadow_present_pte(pte)) {
-   if (is_last_spte(pte, sp->role.level)) {
+   if (is_last_spte(pte)) {
drop_spte(kvm, spte);
if (is_large_pte(pte))
--kvm->stat.lpages;
@@ -2286,7 +2285,7 @@ static void kvm_mmu_page_unlink_children(struct kvm *kvm,
unsigned i;
 
for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
-   mmu_page_zap_pte(kvm, sp, sp->spt + i);
+   mmu_page_zap_pte(kvm, sp->spt + i);
 }
 
 static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
@@ -3068,7 +3067,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
}
 
sp = page_header(__pa(iterator.sptep));
-   if (!is_last_spte(spte, sp->role.level))
+   if (!is_last_spte(spte))
goto exit;
 
/*
@@ -4316,7 +4315,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
local_flush = true;
while (npte--) {
entry = *spte;
-   mmu_page_zap_pte(vcpu->kvm, sp, spte);
+   mmu_page_zap_pte(vcpu->kvm, spte);
if (gentry &&
  !((sp->role.word ^ vcpu->arch.mmu.base_role.word)
  & mask.word) && rmap_can_add(vcpu))
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index daff69e..d54e2ad 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -45,7 +45,7 @@ static void __mmu_spte_walk(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
fn(vcpu, ent + i, level);
 
if (is_shadow_present_pte(ent[i]) &&
- !is_last_spte(ent[i], level)) {
+ !is_last_spte(ent[i])) {
struct kvm_mmu_page *child;
 
child = page_header(ent[i] & PT64_BASE_ADDR_MASK);
@@ -110,7 +110,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 
*sptep, int level)
}
}
 
-   if (!is_shadow_present_pte(*sptep) || !is_last_spte(*sptep, level))
+   if (!is_shadow_present_pte(*sptep) || !is_last_spte(*sptep))
return;
 
gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
@@ -158,7 +158,7 @@ static void inspect_spte_has_rmap(struct kvm *kvm, u64 
*sptep)
 
 static void audit_sptes_have_rmaps(struct kvm_vcpu *vcpu, u64 *sptep, int 
level)
 {
-   if (is_shadow_present_pte(*sptep) && is_last_spte(*sptep, level))
+   if (is_shadow_present_pte(*sptep) && is_last_spte(*sptep))
inspect_spte_has_rmap(vcpu->kvm, sptep);
 }
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index

[PATCH v3 10/15] KVM: MMU: allocate shadow pages from slab

2013-10-23 Thread Xiao Guangrong
Allocate shadow pages from slab instead of page-allocator, frequent
shadow page allocation and free can be hit in the slab cache, it is
very useful for shadow mmu

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu.c  | 46 ++---
 2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5cbf316..df9ae10 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -389,6 +389,7 @@ struct kvm_vcpu_arch {
struct kvm_mmu *walk_mmu;
 
struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
+   struct kvm_mmu_memory_cache mmu_shadow_page_cache;
struct kvm_mmu_memory_cache mmu_page_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
 
@@ -946,7 +947,7 @@ static inline struct kvm_mmu_page *page_header(hpa_t 
shadow_page)
 {
struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
 
-   return (struct kvm_mmu_page *)page_private(page);
+   return (struct kvm_mmu_page *)(page->mapping);
 }
 
 static inline u16 kvm_read_ldt(void)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f3ae74e6..1bcc8c8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -178,6 +178,7 @@ struct kvm_shadow_walk_iterator {
 __shadow_walk_next(&(_walker), spte))
 
 static struct kmem_cache *pte_list_desc_cache;
+static struct kmem_cache *mmu_shadow_page_cache;
 static struct kmem_cache *mmu_page_header_cache;
 static struct percpu_counter kvm_total_used_mmu_pages;
 
@@ -746,7 +747,14 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
   GFP_KERNEL);
if (r)
goto out;
-   r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
+
+   r = mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
+  mmu_shadow_page_cache, 4,
+  GFP_KERNEL);
+   if (r)
+   goto out;
+
+   r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 4);
if (r)
goto out;
r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
@@ -760,6 +768,8 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
pte_list_desc_cache);
+   mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
+   mmu_shadow_page_cache);
mmu_free_memory_cache_page(&vcpu->arch.mmu_page_cache);
mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache,
mmu_page_header_cache);
@@ -1675,12 +1685,28 @@ static inline void kvm_mod_used_mmu_pages(struct kvm 
*kvm, int nr)
percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
+static void set_page_header(struct kvm_mmu_page *sp)
+{
+   struct page *page = virt_to_page(sp->spt);
+
+   WARN_ON(page->mapping);
+   page->mapping = (struct address_space *)sp;
+}
+
+static void clear_page_header(struct kvm_mmu_page *sp)
+{
+   struct page *page = virt_to_page(sp->spt);
+
+   page->mapping = NULL;
+}
+
 static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 {
ASSERT(is_empty_shadow_page(sp->spt));
hlist_del(&sp->hash_link);
list_del(&sp->link);
-   free_page((unsigned long)sp->spt);
+   clear_page_header(sp);
+   kmem_cache_free(mmu_shadow_page_cache, sp->spt);
if (!sp->role.direct)
free_page((unsigned long)sp->gfns);
kmem_cache_free(mmu_page_header_cache, sp);
@@ -1719,10 +1745,10 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct 
kvm_vcpu *vcpu,
struct kvm_mmu_page *sp;
 
sp = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-   sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache);
+   sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
if (!direct)
sp->gfns = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache);
-   set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+   set_page_header(sp);
 
/*
 * The active_mmu_pages list is the FIFO list, do not move the
@@ -2046,12 +2072,13 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
}
 }
 
-static void init_shadow_page_table(struct kvm_mmu_page *sp)
+static void init_shadow_page_table(void *p)
 {
+   u64 *sptp = (u64 *)p;
int i;
 
for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
-   sp->spt[i] = 0ull;
+   sptp[i] = 0ull;
 }
 
 static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
@@ -2137,7 +2164,6 @@ sta

[PATCH v3 14/15] KVM: MMU: clean up spte_write_protect

2013-10-23 Thread Xiao Guangrong
Now, the only user of spte_write_protect is rmap_write_protect which
always calls spte_write_protect with pt_protect = true, so drop
it and the unused parameter @kvm

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d82bbec..3e4b941 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1340,8 +1340,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
 }
 
 /*
- * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte write-protection is caused by protecting shadow page table.
+ * Write-protect on the specified @sptep.
  *
  * Note: write protection is difference between drity logging and spte
  * protection:
@@ -1352,25 +1351,23 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
  *
  * Return true if tlb need be flushed.
  */
-static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
+static bool spte_write_protect(u64 *sptep)
 {
u64 spte = *sptep;
 
if (!is_writable_pte(spte) &&
- !(pt_protect && spte_is_locklessly_modifiable(spte)))
+ !spte_is_locklessly_modifiable(spte))
return false;
 
rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
 
-   if (pt_protect)
-   spte &= ~SPTE_MMU_WRITEABLE;
-   spte = spte & ~PT_WRITABLE_MASK;
+   spte &= ~SPTE_MMU_WRITEABLE;
+   spte &= ~PT_WRITABLE_MASK;
 
return mmu_spte_update(sptep, spte);
 }
 
-static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
-bool pt_protect)
+static bool __rmap_write_protect(unsigned long *rmapp)
 {
u64 *sptep;
struct rmap_iterator iter;
@@ -1379,7 +1376,7 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
 
-   flush |= spte_write_protect(kvm, sptep, pt_protect);
+   flush |= spte_write_protect(sptep);
sptep = rmap_get_next(&iter);
}
 
@@ -1454,7 +1451,7 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
for (i = PT_PAGE_TABLE_LEVEL;
 i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
rmapp = __gfn_to_rmap(gfn, i, slot);
-   write_protected |= __rmap_write_protect(kvm, rmapp, true);
+   write_protected |= __rmap_write_protect(rmapp);
}
 
return write_protected;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 09/15] KVM: MMU: initialize the pointers in pte_list_desc properly

2013-10-23 Thread Xiao Guangrong
Since pte_list_desc will be locklessly accessed we need to atomicly initialize
its pointers so that the lockless walker can not get the partial value from the
pointer

In this patch we use the way of assigning pointer to initialize its pointers
which is always atomic instead of using kmem_cache_zalloc

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a864140..f3ae74e6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -687,14 +687,15 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu 
*vcpu)
 }
 
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
- struct kmem_cache *base_cache, int min)
+ struct kmem_cache *base_cache, int min,
+ gfp_t flags)
 {
void *obj;
 
if (cache->nobjs >= min)
return 0;
while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
-   obj = kmem_cache_zalloc(base_cache, GFP_KERNEL);
+   obj = kmem_cache_alloc(base_cache, flags);
if (!obj)
return -ENOMEM;
cache->objects[cache->nobjs++] = obj;
@@ -741,14 +742,16 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
int r;
 
r = mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
-  pte_list_desc_cache, 8 + PTE_PREFETCH_NUM);
+  pte_list_desc_cache, 8 + PTE_PREFETCH_NUM,
+  GFP_KERNEL);
if (r)
goto out;
r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
if (r)
goto out;
r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
-  mmu_page_header_cache, 4);
+  mmu_page_header_cache, 4,
+  GFP_KERNEL | __GFP_ZERO);
 out:
return r;
 }
@@ -913,6 +916,17 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
return level - 1;
 }
 
+static void pte_list_desc_ctor(void *p)
+{
+   struct pte_list_desc *desc = p;
+   int i;
+
+   for (i = 0; i < PTE_LIST_EXT; i++)
+   desc->sptes[i] = NULL;
+
+   desc->more = NULL;
+}
+
 static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc 
*desc)
 {
unsigned long marker;
@@ -1066,6 +1080,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 */
if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
*pte_list = (unsigned long)first_desc->sptes[0];
+   first_desc->sptes[0] = NULL;
mmu_free_pte_list_desc(first_desc);
}
 }
@@ -4663,8 +4678,8 @@ static void mmu_destroy_caches(void)
 int kvm_mmu_module_init(void)
 {
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
-   sizeof(struct pte_list_desc),
-   0, SLAB_DESTROY_BY_RCU, NULL);
+   sizeof(struct pte_list_desc),
+   0, SLAB_DESTROY_BY_RCU, pte_list_desc_ctor);
if (!pte_list_desc_cache)
goto nomem;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 13/15] KVM: MMU: locklessly write-protect the page

2013-10-23 Thread Xiao Guangrong
Currently, when mark memslot dirty logged or get dirty page, we need to
write-protect large guest memory, it is the heavy work, especially, we
need to hold mmu-lock which is also required by vcpu to fix its page table
fault and mmu-notifier when host page is being changed. In the extreme
cpu / memory used guest, it becomes a scalability issue

This patch introduces a way to locklessly write-protect guest memory

Now, lockless rmap walk, lockless shadow page table access and lockless
spte wirte-protection are ready, it is the time to implements page
write-protection out of mmu-lock

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |  4 ---
 arch/x86/kvm/mmu.c  | 59 ++---
 arch/x86/kvm/mmu.h  |  6 +
 arch/x86/kvm/x86.c  | 11 
 4 files changed, 55 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index df9ae10..cdb6f29 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -793,10 +793,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask);
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-struct kvm_memory_slot *slot,
-gfn_t gfn_offset, unsigned long mask);
 void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8b96d96..d82bbec 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1386,8 +1386,37 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
return flush;
 }
 
-/**
- * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
+static void __rmap_write_protect_lockless(u64 *sptep)
+{
+   u64 spte;
+
+retry:
+   /*
+* Note we may partly read the sptep on 32bit host, however, we
+* allow this case because:
+* - we do not access the page got from the sptep.
+* - cmpxchg64 can detect that case and avoid setting a wrong value
+*   to the sptep.
+*/
+   spte = *rcu_dereference(sptep);
+   if (unlikely(!is_last_spte(spte) || !is_writable_pte(spte)))
+   return;
+
+   if (likely(cmpxchg64(sptep, spte, spte & ~PT_WRITABLE_MASK) == spte))
+   return;
+
+   goto retry;
+}
+
+static void rmap_write_protect_lockless(unsigned long *rmapp)
+{
+   pte_list_walk_lockless(rmapp, __rmap_write_protect_lockless);
+}
+
+/*
+ * kvm_mmu_write_protect_pt_masked_lockless - write protect selected PT level
+ * pages out of mmu-lock.
+ *
  * @kvm: kvm instance
  * @slot: slot to protect
  * @gfn_offset: start of the BITS_PER_LONG pages we care about
@@ -1396,16 +1425,17 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
  * Used when we do not need to care about huge page mappings: e.g. during dirty
  * logging we do not have any such mappings.
  */
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-struct kvm_memory_slot *slot,
-gfn_t gfn_offset, unsigned long mask)
+void
+kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
+struct kvm_memory_slot *slot,
+gfn_t gfn_offset, unsigned long mask)
 {
unsigned long *rmapp;
 
while (mask) {
rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
  PT_PAGE_TABLE_LEVEL, slot);
-   __rmap_write_protect(kvm, rmapp, false);
+   rmap_write_protect_lockless(rmapp);
 
/* clear the first set bit */
mask &= mask - 1;
@@ -4477,7 +4507,7 @@ void kvm_mmu_setup(struct kvm_vcpu *vcpu)
init_kvm_mmu(vcpu);
 }
 
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
+void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot)
 {
struct kvm_memory_slot *memslot;
gfn_t last_gfn;
@@ -4486,8 +4516,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
memslot = id_to_memslot(kvm->memslots, slot);
last_gfn = memslot->base_gfn + memslot->npages - 1;
 
-   spin_lock(&kvm->mmu_lock);
-
+   rcu_read_lock();
for (i = PT_PAGE_TABLE_LEVEL;
 i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
unsigned long *rmapp;
@@ -4497,15 +4526,15 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
 

[PATCH v3 15/15] KVM: MMU: use rcu functions to access the pointer

2013-10-23 Thread Xiao Guangrong
Use rcu_assign_pointer() to update all the pointer in desc
and use rcu_dereference() to lockless read the pointer

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 46 --
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3e4b941..68dac26 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -937,12 +937,23 @@ static void pte_list_desc_ctor(void *p)
desc->more = NULL;
 }
 
+#define rcu_assign_pte_list(pte_list_p, value) \
+   rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),   \
+ (unsigned long *)(value))
+
+#define rcu_assign_desc_more(morep, value) \
+   rcu_assign_pointer(*(unsigned long __rcu **)&morep, \
+ (unsigned long *)value)
+
+#define rcu_assign_spte(sptep, value)  \
+   rcu_assign_pointer(*(u64 __rcu **)&sptep, (u64 *)value)
+
 static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc 
*desc)
 {
unsigned long marker;
 
marker = (unsigned long)pte_list | 1UL;
-   desc->more = (struct pte_list_desc *)marker;
+   rcu_assign_desc_more(desc->more, (struct pte_list_desc *)marker);
 }
 
 static bool desc_is_a_nulls(struct pte_list_desc *desc)
@@ -999,10 +1010,6 @@ static int count_spte_number(struct pte_list_desc *desc)
return first_free + desc_num * PTE_LIST_EXT;
 }
 
-#define rcu_assign_pte_list(pte_list_p, value) \
-   rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),   \
-   (unsigned long *)(value))
-
 /*
  * Pte mapping structures:
  *
@@ -1029,8 +1036,8 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
if (!(*pte_list & 1)) {
rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_pte_list_desc(vcpu);
-   desc->sptes[0] = (u64 *)*pte_list;
-   desc->sptes[1] = spte;
+   rcu_assign_spte(desc->sptes[0], *pte_list);
+   rcu_assign_spte(desc->sptes[1], spte);
desc_mark_nulls(pte_list, desc);
rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
return 1;
@@ -1043,13 +1050,13 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 
*spte,
if (desc->sptes[PTE_LIST_EXT - 1]) {
struct pte_list_desc *new_desc;
new_desc = mmu_alloc_pte_list_desc(vcpu);
-   new_desc->more = desc;
+   rcu_assign_desc_more(new_desc->more, desc);
desc = new_desc;
rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
}
 
free_pos = find_first_free(desc);
-   desc->sptes[free_pos] = spte;
+   rcu_assign_spte(desc->sptes[free_pos], spte);
return count_spte_number(desc) - 1;
 }
 
@@ -1067,8 +1074,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 * Move the entry from the first desc to this position we want
 * to remove.
 */
-   desc->sptes[i] = first_desc->sptes[last_used];
-   first_desc->sptes[last_used] = NULL;
+   rcu_assign_spte(desc->sptes[i], first_desc->sptes[last_used]);
+   rcu_assign_spte(first_desc->sptes[last_used], NULL);
 
/* No valid entry in this desc, we can free this desc now. */
if (!first_desc->sptes[0]) {
@@ -1080,7 +1087,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
WARN_ON(desc_is_a_nulls(next_desc));
 
mmu_free_pte_list_desc(first_desc);
-   *pte_list = (unsigned long)next_desc | 1ul;
+   rcu_assign_pte_list(pte_list, (unsigned long)next_desc | 1ul);
return;
}
 
@@ -1089,8 +1096,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 * then the desc can be freed.
 */
if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
-   *pte_list = (unsigned long)first_desc->sptes[0];
-   first_desc->sptes[0] = NULL;
+   rcu_assign_pte_list(pte_list, first_desc->sptes[0]);
+   rcu_assign_spte(first_desc->sptes[0], NULL);
mmu_free_pte_list_desc(first_desc);
}
 }
@@ -1112,7 +1119,7 @@ static void pte_list_remove(u64 *spte, unsigned long 
*pte_list)
pr_err("pte_list_remove:  %p 1->BUG\n", spte);
BUG();
}
-   *pte_list = 0;
+   rcu_assign_pte_list(pte_list, 0);
return;
}
 
@@ -1184,9 +1191,12 @@ restart:
 * used in the rmap when a spte is removed. Otherwise the
 * moved entry will be miss

[PATCH v3 06/15] KVM: MMU: redesign the algorithm of pte_list

2013-10-23 Thread Xiao Guangrong
Change the algorithm to:
1) always add new desc to the first desc (pointed by parent_ptes/rmap)
   that is good to implement rcu-nulls-list-like lockless rmap
   walking

2) always move the entry in the first desc to the the position we want
   to remove when delete a spte in the parent_ptes/rmap (backward-move).
   It is good for us to implement lockless rmap walk since in the current
   code, when a spte is deleted from the "desc", another spte in the last
   "desc" will be moved to this position to replace the deleted one. If the
   deleted one has been accessed and we do not access the replaced one, the
   replaced one is missed when we do lockless walk.
   To fix this case, we do not backward move the spte, instead, we forward
   move the entry: when a spte is deleted, we move the entry in the first
   desc to that position

Both of these also can reduce cache miss

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 179 -
 1 file changed, 123 insertions(+), 56 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e85eed6..5cce039 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,50 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
return level - 1;
 }
 
+static int __find_first_free(struct pte_list_desc *desc)
+{
+   int i;
+
+   for (i = 0; i < PTE_LIST_EXT; i++)
+   if (!desc->sptes[i])
+   break;
+   return i;
+}
+
+static int find_first_free(struct pte_list_desc *desc)
+{
+   int free = __find_first_free(desc);
+
+   WARN_ON(free >= PTE_LIST_EXT);
+   return free;
+}
+
+static int find_last_used(struct pte_list_desc *desc)
+{
+   int used = __find_first_free(desc) - 1;
+
+   WARN_ON(used < 0 || used >= PTE_LIST_EXT);
+   return used;
+}
+
+/*
+ * TODO: we can encode the desc number into the rmap/parent_ptes
+ * since at least 10 physical/virtual address bits are reserved
+ * on x86. It is worthwhile if it shows that the desc walking is
+ * a performance issue.
+ */
+static int count_spte_number(struct pte_list_desc *desc)
+{
+   int first_free, desc_num;
+
+   first_free = __find_first_free(desc);
+
+   for (desc_num = 0; desc->more; desc = desc->more)
+   desc_num++;
+
+   return first_free + desc_num * PTE_LIST_EXT;
+}
+
 /*
  * Pte mapping structures:
  *
@@ -923,98 +967,121 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
  *
  * Returns the number of pte entries before the spte was added or zero if
  * the spte was not added.
- *
  */
 static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
unsigned long *pte_list)
 {
struct pte_list_desc *desc;
-   int i, count = 0;
+   int free_pos;
 
if (!*pte_list) {
rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
*pte_list = (unsigned long)spte;
-   } else if (!(*pte_list & 1)) {
+   return 0;
+   }
+
+   if (!(*pte_list & 1)) {
rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_pte_list_desc(vcpu);
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
*pte_list = (unsigned long)desc | 1;
-   ++count;
-   } else {
-   rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
-   desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-   while (desc->sptes[PTE_LIST_EXT-1] && desc->more) {
-   desc = desc->more;
-   count += PTE_LIST_EXT;
-   }
-   if (desc->sptes[PTE_LIST_EXT-1]) {
-   desc->more = mmu_alloc_pte_list_desc(vcpu);
-   desc = desc->more;
-   }
-   for (i = 0; desc->sptes[i]; ++i)
-   ++count;
-   desc->sptes[i] = spte;
+   return 1;
}
-   return count;
+
+   rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
+   desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+
+   /* No empty entry in the desc. */
+   if (desc->sptes[PTE_LIST_EXT - 1]) {
+   struct pte_list_desc *new_desc;
+   new_desc = mmu_alloc_pte_list_desc(vcpu);
+   new_desc->more = desc;
+   desc = new_desc;
+   *pte_list = (unsigned long)desc | 1;
+   }
+
+   free_pos = find_first_free(desc);
+   desc->sptes[free_pos] = spte;
+   return count_spte_number(desc) - 1;
 }
 
 static void
-pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
-  int i, struct pte_list_desc *pr

Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-15 Thread Xiao Guangrong

On Oct 16, 2013, at 6:21 AM, Marcelo Tosatti  wrote:

> On Tue, Oct 15, 2013 at 06:57:05AM +0300, Gleb Natapov wrote:
>>> 
>>> Why is it safe to allow access, by the lockless page write protect
>>> side, to spt pointer for shadow page A that can change to a shadow page 
>>> pointer of shadow page B?
>>> 
>>> Write protect spte of any page at will? Or verify that in fact thats the
>>> shadow you want to write protect?
>>> 
>>> Note that spte value might be the same for different shadow pages, 
>>> so cmpxchg succeeding does not guarantees its the same shadow page that
>>> has been protected.
>>> 
>> Two things can happen: spte that we accidentally write protect is some
>> other last level spte - this is benign, it will be unprotected on next
>> fault.  
> 
> Nothing forbids two identical writable sptes to point to a same pfn. How
> do you know you are write protecting the correct one? (the proper gfn).
> 
> Lockless walk sounds interesting. By the time you get to the lower
> level, that might be a different spte.

That's safe. Since get-dirty-log is serialized by slot-lock the dirty-bit
can not be lost - even if we write-protect on the different memslot
 (the dirty bit is still set). The worst case is we write-protect on a
unnecessary spte and cause a extra #PF but that is really race.

And the lockless rmap-walker can detect the new spte so that
write-protection on the memslot is not missed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-09 Thread Xiao Guangrong
On 10/09/2013 09:56 AM, Marcelo Tosatti wrote:
> On Tue, Oct 08, 2013 at 12:02:32PM +0800, Xiao Guangrong wrote:
>>
>> Hi Marcelo,
>>
>> On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti  wrote:
>>
>>>>
>>>> +  if (kvm->arch.rcu_free_shadow_page) {
>>>> +  kvm_mmu_isolate_pages(invalid_list);
>>>> +  sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
>>>> +  list_del_init(invalid_list);
>>>> +  call_rcu(&sp->rcu, free_pages_rcu);
>>>> +  return;
>>>> +  }
>>>
>>> This is unbounded (there was a similar problem with early fast page fault
>>> implementations):
>>>
>>> From RCU/checklist.txt:
>>>
>>> "An especially important property of the synchronize_rcu()
>>>primitive is that it automatically self-limits: if grace periods
>>>are delayed for whatever reason, then the synchronize_rcu()
>>>primitive will correspondingly delay updates.  In contrast,
>>>code using call_rcu() should explicitly limit update rate in
>>>cases where grace periods are delayed, as failing to do so can
>>>result in excessive realtime latencies or even OOM conditions.
>>> "
>>
>> I understand what you are worrying about… Hmm, can it be avoided by
>> just using kvm->arch.rcu_free_shadow_page in a small window? - Then
>> there are slight chance that the page need to be freed by call_rcu.
> 
> The point that must be addressed is that you cannot allow an unlimited
> number of sp's to be freed via call_rcu between two grace periods.
> 
> So something like:
> 
> - For every 17MB worth of shadow pages.
> - Guarantee a grace period has passed.

Hmm, the 'qhimark' in rcu is 1, that means rcu allows call_rcu
to pend 1 times in a grace-period without slowdown. Can we really
reach this number while rcu_free_shadow_page is set? Anyway, if it can,
we can use rcu tech-break to avoid it, can't we?

> 
> If you control kvm->arch.rcu_free_shadow_page, you could periodically
> verify how many MBs worth of shadow pages are in the queue for RCU
> freeing and force grace period after a certain number.

I have no idea how to froce grace-period for the cpu which is running
in rcu-read side. IIUC, only dyntick-idle and offline CPU can be froced,
see rcu_gp_fqs().

> 
>>> Moreover, freeing pages differently depending on some state should 
>>> be avoided.
>>>
>>> Alternatives:
>>>
>>> - Disable interrupts at write protect sites.
>>
>> The write-protection can be triggered by KVM ioctl that is not in the VCPU
>> context, if we do this, we also need to send IPI to the KVM thread when do
>> TLB flush.
> 
> Yes. However for the case being measured, simultaneous page freeing by vcpus 
> should be minimal (therefore not affecting the latency of GET_DIRTY_LOG).

I agree, but write-protection will cost lots of time, we need to:
1) do write-protection under irq disabled, that is not good for device
Or
2) do pieces of works, then enable irq and disable it agian to continue
   the work. Enabling and disabing irq many times are not cheap for x86.

no?

> 
>> And we can not do much work while interrupt is disabled due to
>> interrupt latency.
>>
>>> - Rate limit the number of pages freed via call_rcu
>>> per grace period.
>>
>> Seems complex. :(
>>
>>> - Some better alternative.
>>
>> Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
>> and encodes the page-level into the spte (since we need to check if the spte
>> is the last-spte. ).  How about this?
> 
> Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
> regards to limitation? (maybe it is).

For my experience, freeing shadow page and allocing shadow page are balanced,
we can check it by (make -j12 on a guest with 4 vcpus and):

# echo > trace
[root@eric-desktop tracing]# cat trace > ~/log | sleep 3
[root@eric-desktop tracing]# cat ~/log | grep new | wc -l
10816
[root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
10656
[root@eric-desktop tracing]# cat set_event
kvmmmu:kvm_mmu_get_page
kvmmmu:kvm_mmu_prepare_zap_page

alloc VS. free = 10816 : 10656

So that, mostly all allocing and freeing are done in the slab's
cache and the slab frees shdadow pages very slowly, there is no rcu issue.

> 
>> I planned to do it after this patchset merged, if you like it and if you 
>> think
>> that "using kvm->arch.rcu_free_shadow_page in a small window" can not avoid
>> the issue, i

Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-07 Thread Xiao Guangrong

Hi Marcelo,

On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti  wrote:

>> 
>> +if (kvm->arch.rcu_free_shadow_page) {
>> +kvm_mmu_isolate_pages(invalid_list);
>> +sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
>> +list_del_init(invalid_list);
>> +call_rcu(&sp->rcu, free_pages_rcu);
>> +return;
>> +}
> 
> This is unbounded (there was a similar problem with early fast page fault
> implementations):
> 
> From RCU/checklist.txt:
> 
> "An especially important property of the synchronize_rcu()
>primitive is that it automatically self-limits: if grace periods
>are delayed for whatever reason, then the synchronize_rcu()
>primitive will correspondingly delay updates.  In contrast,
>code using call_rcu() should explicitly limit update rate in
>cases where grace periods are delayed, as failing to do so can
>result in excessive realtime latencies or even OOM conditions.
> "

I understand what you are worrying about… Hmm, can it be avoided by
just using kvm->arch.rcu_free_shadow_page in a small window? - Then
there are slight chance that the page need to be freed by call_rcu.

> 
> Moreover, freeing pages differently depending on some state should 
> be avoided.
> 
> Alternatives:
> 
> - Disable interrupts at write protect sites.

The write-protection can be triggered by KVM ioctl that is not in the VCPU
context, if we do this, we also need to send IPI to the KVM thread when do
TLB flush. And we can not do much work while interrupt is disabled due to
interrupt latency.

> - Rate limit the number of pages freed via call_rcu
> per grace period.

Seems complex. :(

> - Some better alternative.

Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
and encodes the page-level into the spte (since we need to check if the spte
is the last-spte. ).  How about this?

I planned to do it after this patchset merged, if you like it and if you think
that "using kvm->arch.rcu_free_shadow_page in a small window" can not avoid
the issue, i am happy to do it in the next version. :)

Thanks, Marcelo!


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 05/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes

2013-10-02 Thread Xiao Guangrong

On Oct 1, 2013, at 7:05 AM, Marcelo Tosatti  wrote:

> On Thu, Sep 05, 2013 at 06:29:08PM +0800, Xiao Guangrong wrote:
>> Now we can flush all the TLBs out of the mmu lock without TLB corruption when
>> write-proect the sptes, it is because:
>> - we have marked large sptes readonly instead of dropping them that means we
>>  just change the spte from writable to readonly so that we only need to care
>>  the case of changing spte from present to present (changing the spte from
>>  present to nonpresent will flush all the TLBs immediately), in other words,
>>  the only case we need to care is mmu_spte_update()
>> 
>> - in mmu_spte_update(), we have checked
>>  SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
>>  means it does not depend on PT_WRITABLE_MASK anymore
>> 
>> Signed-off-by: Xiao Guangrong 
>> ---
>> arch/x86/kvm/mmu.c | 18 ++
>> arch/x86/kvm/x86.c |  9 +++--
>> 2 files changed, 21 insertions(+), 6 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 7488229..a983570 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -4320,15 +4320,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm 
>> *kvm, int slot)
>>  if (*rmapp)
>>  __rmap_write_protect(kvm, rmapp, false);
>> 
>> -if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
>> -kvm_flush_remote_tlbs(kvm);
>> +if (need_resched() || spin_needbreak(&kvm->mmu_lock))
>>  cond_resched_lock(&kvm->mmu_lock);
>> -}
>>  }
>>  }
>> 
>> -kvm_flush_remote_tlbs(kvm);
>>  spin_unlock(&kvm->mmu_lock);
>> +
>> +/*
>> + * We can flush all the TLBs out of the mmu lock without TLB
>> + * corruption since we just change the spte from writable to
>> + * readonly so that we only need to care the case of changing
>> + * spte from present to present (changing the spte from present
>> + * to nonpresent will flush all the TLBs immediately), in other
>> + * words, the only case we care is mmu_spte_update() where we
>> + * haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
>> + * instead of PT_WRITABLE_MASK, that means it does not depend
>> + * on PT_WRITABLE_MASK anymore.
>> + */
>> +kvm_flush_remote_tlbs(kvm);
>> }
> 
> What about need_remote_flush? 

It is safe because before calling need_remote_flush mmu_pte_write_new_pte is 
called to
update the spte which will finally call set_spte() where the tlb flush has been 
properly checked.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 03/15] KVM: MMU: lazily drop large spte

2013-10-02 Thread Xiao Guangrong

On Oct 1, 2013, at 6:39 AM, Marcelo Tosatti  wrote:

> On Thu, Sep 05, 2013 at 06:29:06PM +0800, Xiao Guangrong wrote:
>> Currently, kvm zaps the large spte if write-protected is needed, the later
>> read can fault on that spte. Actually, we can make the large spte readonly
>> instead of making them un-present, the page fault caused by read access can
>> be avoided
>> 
>> The idea is from Avi:
>> | As I mentioned before, write-protecting a large spte is a good idea,
>> | since it moves some work from protect-time to fault-time, so it reduces
>> | jitter.  This removes the need for the return value.
>> 
>> This version has fixed the issue reported in 6b73a9606, the reason of that
>> issue is that fast_page_fault() directly sets the readonly large spte to
>> writable but only dirty the first page into the dirty-bitmap that means
>> other pages are missed. Fixed it by only the normal sptes (on the
>> PT_PAGE_TABLE_LEVEL level) can be fast fixed
>> 
>> Signed-off-by: Xiao Guangrong 
>> ---
>> arch/x86/kvm/mmu.c | 36 
>> arch/x86/kvm/x86.c |  8 ++--
>> 2 files changed, 26 insertions(+), 18 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 869f1db..88107ee 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1177,8 +1177,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
>> *sptep)
>> 
>> /*
>>  * Write-protect on the specified @sptep, @pt_protect indicates whether
>> - * spte writ-protection is caused by protecting shadow page table.
>> - * @flush indicates whether tlb need be flushed.
>> + * spte write-protection is caused by protecting shadow page table.
>>  *
>>  * Note: write protection is difference between drity logging and spte
>>  * protection:
>> @@ -1187,10 +1186,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, 
>> u64 *sptep)
>>  * - for spte protection, the spte can be writable only after unsync-ing
>>  *   shadow page.
>>  *
>> - * Return true if the spte is dropped.
>> + * Return true if tlb need be flushed.
>>  */
>> -static bool
>> -spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool 
>> pt_protect)
>> +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
>> {
>>  u64 spte = *sptep;
>> 
>> @@ -1200,17 +1198,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
>> *flush, bool pt_protect)
>> 
>>  rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
>> 
>> -if (__drop_large_spte(kvm, sptep)) {
>> -*flush |= true;
>> -return true;
>> -}
>> -
>>  if (pt_protect)
>>  spte &= ~SPTE_MMU_WRITEABLE;
>>  spte = spte & ~PT_WRITABLE_MASK;
>> 
>> -*flush |= mmu_spte_update(sptep, spte);
>> -return false;
>> +return mmu_spte_update(sptep, spte);
>> }
> 
> Is it necessary for kvm_mmu_unprotect_page to search for an entire range 
> large 
> page range now, instead of a 4k page?

It is unnecessary. kvm_mmu_unprotect_page is used to delete the gfn's shadow 
pages
then vcpu will try to re-fault. If any gfn in the large range has shadow page, 
it will stop using large
mapping, so that the mapping will be split to small mappings when vcpu re-fault 
again.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 02/15] KVM: MMU: properly check last spte in fast_page_fault()

2013-10-02 Thread Xiao Guangrong

On Oct 1, 2013, at 5:23 AM, Marcelo Tosatti  wrote:

>> 
> 
> Unrelated to this patch:
> 
> If vcpu->mode = OUTSIDE_GUEST_MODE, no IPI is sent
> by kvm_flush_remote_tlbs.

Yes.

> 
> So how is this supposed to work again?
> 
>/*
> * Wait for all vcpus to exit guest mode and/or lockless shadow
> * page table walks.
> */

On the lockless walking path, we change the vcpu->mode to
READING_SHADOW_PAGE_TABLES, so that IPI is needed.
Or i missed your question?


>kvm_flush_remote_tlbs(kvm);
> 
> Patch looks fine.

Thank you, Marcelo!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 09/15] KVM: MMU: introduce pte-list lockless walker

2013-09-16 Thread Xiao Guangrong
Hi Gleb,

On 09/16/2013 08:42 PM, Gleb Natapov wrote:

>>  static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
>>  struct kvm_memory_slot *slot)
>>  {
>> @@ -4651,7 +4700,7 @@ int kvm_mmu_module_init(void)
>>  {
>>  pte_list_desc_cache = kmem_cache_create("pte_list_desc",
>>  sizeof(struct pte_list_desc),
>> -0, 0, NULL);
>> +0, SLAB_DESTROY_BY_RCU, NULL);
> Haven't we agreed that constructor is needed for the cache?

Yes. I've made it as a separate patch:
[PATCH v2 10/15] KVM: MMU: initialize the pointers in pte_list_desc properly


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 01/15] KVM: MMU: fix the count of spte number

2013-09-08 Thread Xiao Guangrong

On Sep 8, 2013, at 10:01 PM, Gleb Natapov  wrote:

> On Sun, Sep 08, 2013 at 09:55:04PM +0800, Xiao Guangrong wrote:
>> 
>> On Sep 8, 2013, at 8:19 PM, Gleb Natapov  wrote:
>> 
>>> On Thu, Sep 05, 2013 at 06:29:04PM +0800, Xiao Guangrong wrote:
>>>> If the desc is the last one and it is full, its sptes is not counted
>>>> 
>>> Hmm, if desc is not full but it is not the last one all sptes after the
>>> desc are not counted too.
>> 
>> But the desc must be the last one if it's not full since we always add
>> new entry or delete entry from the last desc.
>> 
> Why do we alway delete entries from last desc? We delete them from the
> desc we found them in. Current code does not try to move entries between
> descs, only inside a desc.

Oh, yes. Sorry, my memory is wrong… :(

So, currently there has some gaps in desc and it wastes memory.  Can not fix
them with simple change and  i think it is not worthy to fix them separately 
since
after my new algorithm, these should all be fixed… so how about just drop this
fix?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 01/15] KVM: MMU: fix the count of spte number

2013-09-08 Thread Xiao Guangrong

On Sep 8, 2013, at 8:19 PM, Gleb Natapov  wrote:

> On Thu, Sep 05, 2013 at 06:29:04PM +0800, Xiao Guangrong wrote:
>> If the desc is the last one and it is full, its sptes is not counted
>> 
> Hmm, if desc is not full but it is not the last one all sptes after the
> desc are not counted too.

But the desc must be the last one if it's not full since we always add
new entry or delete entry from the last desc.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 09/15] KVM: MMU: introduce pte-list lockless walker

2013-09-08 Thread Xiao Guangrong

On Sep 5, 2013, at 6:29 PM, Xiao Guangrong  
wrote:

> 
> A desc only has 3 entries
> in the current code so it is not a problem now, but the issue will
> be triggered if we expend the size of desc in the further development

Sorry, this description is obvious wrong, the bug exists even if 3 entries in 
the desc… 
do not know what i was thinking about when wrote this down...--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] KVM: mmu: allow page tables to be in read-only slots

2013-09-05 Thread Xiao Guangrong
On 09/05/2013 08:21 PM, Paolo Bonzini wrote:
> Page tables in a read-only memory slot will currently cause a triple
> fault when running with shadow paging, because the page walker uses
> gfn_to_hva and it fails on such a slot.
> 
> TianoCore uses such a page table.  The idea is that, on real hardware,
> the firmware can already run in 64-bit flat mode when setting up the
> memory controller.  Real hardware seems to be fine with that as long as
> the accessed/dirty bits are set.  Thus, this patch saves whether the
> slot is readonly, and later checks it when updating the accessed and
> dirty bits.
> 
> Note that this scenario is not supported by NPT at all, as explained by
> comments in the code.

Reviewed-by: Xiao Guangrong 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 06/15] KVM: MMU: update spte and add it into rmap before dirty log

2013-09-05 Thread Xiao Guangrong
kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the its dirty
bitmap, so we should ensure the writable spte can be found in rmap before the
dirty bitmap is visible. Otherwise, we clear the dirty bitmap but fail to
write-protect the page which is detailed in the comments in this patch

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 84 ++
 arch/x86/kvm/x86.c | 10 +++
 2 files changed, 76 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a983570..8ea54d9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2428,6 +2428,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 {
u64 spte;
int ret = 0;
+   bool remap = is_rmap_spte(*sptep);
 
if (set_mmio_spte(vcpu->kvm, sptep, gfn, pfn, pte_access))
return 0;
@@ -2489,12 +2490,73 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
}
}
 
-   if (pte_access & ACC_WRITE_MASK)
-   mark_page_dirty(vcpu->kvm, gfn);
-
 set_pte:
if (mmu_spte_update(sptep, spte))
kvm_flush_remote_tlbs(vcpu->kvm);
+
+   if (!remap) {
+   if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD)
+   rmap_recycle(vcpu, sptep, gfn);
+
+   if (level > PT_PAGE_TABLE_LEVEL)
+   ++vcpu->kvm->stat.lpages;
+   }
+
+   /*
+* The orders we require are:
+* 1) set spte to writable __before__ set the dirty bitmap.
+*It makes sure that dirty-logging is not missed when do
+*live migration at the final step where kvm should stop
+*the guest and push the remaining dirty pages got from
+*dirty-bitmap to the destination. The similar cases are
+*in fast_pf_fix_direct_spte() and kvm_write_guest_page().
+*
+* 2) add the spte into rmap __before__ set the dirty bitmap.
+*
+* They can ensure we can find the writable spte on the rmap
+* when we do lockless write-protection since
+* kvm_vm_ioctl_get_dirty_log() write-protects the pages based
+* on its dirty-bitmap, otherwise these cases will happen:
+*
+*  CPU 0 CPU 1
+*  kvm ioctl doing get-dirty-pages
+* mark_page_dirty(gfn) which
+* set the gfn on the dirty maps
+*  mask = xchg(dirty_bitmap, 0)
+*
+*  try to write-protect gfns which
+*  are set on "mask" then walk then
+*  rmap, see no spte on that rmap
+* add the spte into rmap
+*
+* !! Then the page can be freely wrote but not recorded in
+* the dirty bitmap.
+*
+* And:
+*
+*  VCPU 0CPU 1
+*kvm ioctl doing get-dirty-pages
+* mark_page_dirty(gfn) which
+* set the gfn on the dirty maps
+*
+* add spte into rmap
+*mask = xchg(dirty_bitmap, 0)
+*
+*try to write-protect gfns which
+*are set on "mask" then walk then
+*rmap, see spte is on the ramp
+*but it is readonly or nonpresent
+* Mark spte writable
+*
+* !! Then the page can be freely wrote but not recorded in the
+* dirty bitmap.
+*
+* See the comments in kvm_vm_ioctl_get_dirty_log().
+*/
+   smp_wmb();
+
+   if (pte_access & ACC_WRITE_MASK)
+   mark_page_dirty(vcpu->kvm, gfn);
 done:
return ret;
 }
@@ -2504,9 +2566,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 int level, gfn_t gfn, pfn_t pfn, bool speculative,
 bool host_writable)
 {
-   int was_rmapped = 0;
-   int rmap_count;
-
pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
 *sptep, write_fault, gfn);
 
@@ -2528,8 +2587,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 spte_to_pfn(*sptep), pfn);
drop_spte(vcpu->kvm, sptep);
kvm_flush_remote_tlbs(vcpu->kvm);
-   } else
-   was_rmapped = 1;
+   }
}
 
if (set_spte(vcpu, sptep, pte_access, level, gfn, pfn, speculative,
@@ -2547,16 +2605,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 is_large_pte(*sptep)? "2MB" : "4kB",

[PATCH v2 05/15] KVM: MMU: flush tlb out of mmu lock when write-protect the sptes

2013-09-05 Thread Xiao Guangrong
Now we can flush all the TLBs out of the mmu lock without TLB corruption when
write-proect the sptes, it is because:
- we have marked large sptes readonly instead of dropping them that means we
  just change the spte from writable to readonly so that we only need to care
  the case of changing spte from present to present (changing the spte from
  present to nonpresent will flush all the TLBs immediately), in other words,
  the only case we need to care is mmu_spte_update()

- in mmu_spte_update(), we have checked
  SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
  means it does not depend on PT_WRITABLE_MASK anymore

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 18 ++
 arch/x86/kvm/x86.c |  9 +++--
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7488229..a983570 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4320,15 +4320,25 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
if (*rmapp)
__rmap_write_protect(kvm, rmapp, false);
 
-   if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
-   kvm_flush_remote_tlbs(kvm);
+   if (need_resched() || spin_needbreak(&kvm->mmu_lock))
cond_resched_lock(&kvm->mmu_lock);
-   }
}
}
 
-   kvm_flush_remote_tlbs(kvm);
spin_unlock(&kvm->mmu_lock);
+
+   /*
+* We can flush all the TLBs out of the mmu lock without TLB
+* corruption since we just change the spte from writable to
+* readonly so that we only need to care the case of changing
+* spte from present to present (changing the spte from present
+* to nonpresent will flush all the TLBs immediately), in other
+* words, the only case we care is mmu_spte_update() where we
+* haved checked SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE
+* instead of PT_WRITABLE_MASK, that means it does not depend
+* on PT_WRITABLE_MASK anymore.
+*/
+   kvm_flush_remote_tlbs(kvm);
 }
 
 #define BATCH_ZAP_PAGES10
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6ad0c07..72f1487 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3560,11 +3560,16 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
kvm_dirty_log *log)
offset = i * BITS_PER_LONG;
kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
}
-   if (is_dirty)
-   kvm_flush_remote_tlbs(kvm);
 
spin_unlock(&kvm->mmu_lock);
 
+   /*
+* All the TLBs can be flushed out of mmu lock, see the comments in
+* kvm_mmu_slot_remove_write_access().
+*/
+   if (is_dirty)
+   kvm_flush_remote_tlbs(kvm);
+
r = -EFAULT;
if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
goto out;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 07/15] KVM: MMU: redesign the algorithm of pte_list

2013-09-05 Thread Xiao Guangrong
Change the algorithm to:
1) always add new desc to the first desc (pointed by parent_ptes/rmap)
   that is good to implement rcu-nulls-list-like lockless rmap
   walking

2) always move the entry in the first desc to the the position we want
   to remove when delete a spte in the parent_ptes/rmap (backward-move).
   It is good for us to implement lockless rmap walk since in the current
   code, when a spte is deleted from the "desc", another spte in the last
   "desc" will be moved to this position to replace the deleted one. If the
   deleted one has been accessed and we do not access the replaced one, the
   replaced one is missed when we do lockless walk.
   To fix this case, we do not backward move the spte, instead, we forward
   move the entry: when a spte is deleted, we move the entry in the first
   desc to that position

Both of these also can reduce cache miss

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 180 -
 1 file changed, 123 insertions(+), 57 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8ea54d9..08fb4e2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,50 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
return level - 1;
 }
 
+static int __find_first_free(struct pte_list_desc *desc)
+{
+   int i;
+
+   for (i = 0; i < PTE_LIST_EXT; i++)
+   if (!desc->sptes[i])
+   break;
+   return i;
+}
+
+static int find_first_free(struct pte_list_desc *desc)
+{
+   int free = __find_first_free(desc);
+
+   WARN_ON(free >= PTE_LIST_EXT);
+   return free;
+}
+
+static int find_last_used(struct pte_list_desc *desc)
+{
+   int used = __find_first_free(desc) - 1;
+
+   WARN_ON(used < 0 || used >= PTE_LIST_EXT);
+   return used;
+}
+
+/*
+ * TODO: we can encode the desc number into the rmap/parent_ptes
+ * since at least 10 physical/virtual address bits are reserved
+ * on x86. It is worthwhile if it shows that the desc walking is
+ * a performance issue.
+ */
+static int count_spte_number(struct pte_list_desc *desc)
+{
+   int first_free, desc_num;
+
+   first_free = __find_first_free(desc);
+
+   for (desc_num = 0; desc->more; desc = desc->more)
+   desc_num++;
+
+   return first_free + desc_num * PTE_LIST_EXT;
+}
+
 /*
  * Pte mapping structures:
  *
@@ -923,99 +967,121 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
  *
  * Returns the number of pte entries before the spte was added or zero if
  * the spte was not added.
- *
  */
 static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
unsigned long *pte_list)
 {
struct pte_list_desc *desc;
-   int i, count = 0;
+   int free_pos;
 
if (!*pte_list) {
rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
*pte_list = (unsigned long)spte;
-   } else if (!(*pte_list & 1)) {
+   return 0;
+   }
+
+   if (!(*pte_list & 1)) {
rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_pte_list_desc(vcpu);
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
*pte_list = (unsigned long)desc | 1;
-   ++count;
-   } else {
-   rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
-   desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-   while (desc->sptes[PTE_LIST_EXT-1] && desc->more) {
-   desc = desc->more;
-   count += PTE_LIST_EXT;
-   }
-   if (desc->sptes[PTE_LIST_EXT-1]) {
-   count += PTE_LIST_EXT;
-   desc->more = mmu_alloc_pte_list_desc(vcpu);
-   desc = desc->more;
-   }
-   for (i = 0; desc->sptes[i]; ++i)
-   ++count;
-   desc->sptes[i] = spte;
+   return 1;
}
-   return count;
+
+   rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
+   desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+
+   /* No empty entry in the desc. */
+   if (desc->sptes[PTE_LIST_EXT - 1]) {
+   struct pte_list_desc *new_desc;
+   new_desc = mmu_alloc_pte_list_desc(vcpu);
+   new_desc->more = desc;
+   desc = new_desc;
+   *pte_list = (unsigned long)desc | 1;
+   }
+
+   free_pos = find_first_free(desc);
+   desc->sptes[free_pos] = spte;
+   return count_spte_number(desc) - 1;
 }
 
 static void
-pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
-

[PATCH v2 03/15] KVM: MMU: lazily drop large spte

2013-09-05 Thread Xiao Guangrong
Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them un-present, the page fault caused by read access can
be avoided

The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter.  This removes the need for the return value.

This version has fixed the issue reported in 6b73a9606, the reason of that
issue is that fast_page_fault() directly sets the readonly large spte to
writable but only dirty the first page into the dirty-bitmap that means
other pages are missed. Fixed it by only the normal sptes (on the
PT_PAGE_TABLE_LEVEL level) can be fast fixed

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 36 
 arch/x86/kvm/x86.c |  8 ++--
 2 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 869f1db..88107ee 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1177,8 +1177,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
 
 /*
  * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte writ-protection is caused by protecting shadow page table.
- * @flush indicates whether tlb need be flushed.
+ * spte write-protection is caused by protecting shadow page table.
  *
  * Note: write protection is difference between drity logging and spte
  * protection:
@@ -1187,10 +1186,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
  * - for spte protection, the spte can be writable only after unsync-ing
  *   shadow page.
  *
- * Return true if the spte is dropped.
+ * Return true if tlb need be flushed.
  */
-static bool
-spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
+static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
 {
u64 spte = *sptep;
 
@@ -1200,17 +1198,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
*flush, bool pt_protect)
 
rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
 
-   if (__drop_large_spte(kvm, sptep)) {
-   *flush |= true;
-   return true;
-   }
-
if (pt_protect)
spte &= ~SPTE_MMU_WRITEABLE;
spte = spte & ~PT_WRITABLE_MASK;
 
-   *flush |= mmu_spte_update(sptep, spte);
-   return false;
+   return mmu_spte_update(sptep, spte);
 }
 
 static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
@@ -1222,11 +1214,8 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
 
for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
-   if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
-   sptep = rmap_get_first(*rmapp, &iter);
-   continue;
-   }
 
+   flush |= spte_write_protect(kvm, sptep, pt_protect);
sptep = rmap_get_next(&iter);
}
 
@@ -2675,6 +2664,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
break;
}
 
+   drop_large_spte(vcpu, iterator.sptep);
+
if (!is_shadow_present_pte(*iterator.sptep)) {
u64 base_addr = iterator.addr;
 
@@ -2876,6 +2867,19 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
goto exit;
 
/*
+* Do not fix write-permission on the large spte since we only dirty
+* the first page into the dirty-bitmap in fast_pf_fix_direct_spte()
+* that means other pages are missed if its slot is dirty-logged.
+*
+* Instead, we let the slow page fault path create a normal spte to
+* fix the access.
+*
+* See the comments in kvm_arch_commit_memory_region().
+*/
+   if (sp->role.level > PT_PAGE_TABLE_LEVEL)
+   goto exit;
+
+   /*
 * Currently, fast page fault only works for direct mapping since
 * the gfn is not stable for indirect shadow page.
 * See Documentation/virtual/kvm/locking.txt to get more detail.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e5ca72a..6ad0c07 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7208,8 +7208,12 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
/*
 * Write protect all pages for dirty logging.
-* Existing largepage mappings are destroyed here and new ones will
-* not be created until the end of the logging.
+*
+* All the sptes including the large sptes which point to this
+* slot are set to readonly. We can not create any new 

[PATCH v2 08/15] KVM: MMU: introduce nulls desc

2013-09-05 Thread Xiao Guangrong
It likes nulls list and we use the pte-list as the nulls which can help us to
detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
if that happened

kvm->slots_lock is held when we do lockless walking that prevents rmap
is reused (free rmap need to hold that lock) so that we can not see the same
nulls used on different rmaps

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 35 +--
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 08fb4e2..c5f1b27 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
return level - 1;
 }
 
+static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc 
*desc)
+{
+   unsigned long marker;
+
+   marker = (unsigned long)pte_list | 1UL;
+   desc->more = (struct pte_list_desc *)marker;
+}
+
+static bool desc_is_a_nulls(struct pte_list_desc *desc)
+{
+   return (unsigned long)desc & 1;
+}
+
+static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc)
+{
+   return (unsigned long *)((unsigned long)desc & ~1);
+}
+
 static int __find_first_free(struct pte_list_desc *desc)
 {
int i;
@@ -951,7 +969,7 @@ static int count_spte_number(struct pte_list_desc *desc)
 
first_free = __find_first_free(desc);
 
-   for (desc_num = 0; desc->more; desc = desc->more)
+   for (desc_num = 0; !desc_is_a_nulls(desc->more); desc = desc->more)
desc_num++;
 
return first_free + desc_num * PTE_LIST_EXT;
@@ -985,6 +1003,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc = mmu_alloc_pte_list_desc(vcpu);
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
+   desc_mark_nulls(pte_list, desc);
*pte_list = (unsigned long)desc | 1;
return 1;
}
@@ -1030,7 +1049,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
/*
 * Only one entry existing but still use a desc to store it?
 */
-   WARN_ON(!next_desc);
+   WARN_ON(desc_is_a_nulls(next_desc));
 
mmu_free_pte_list_desc(first_desc);
*pte_list = (unsigned long)next_desc | 1ul;
@@ -1041,7 +1060,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 * Only one entry in this desc, move the entry to the head
 * then the desc can be freed.
 */
-   if (!first_desc->sptes[1] && !first_desc->more) {
+   if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
*pte_list = (unsigned long)first_desc->sptes[0];
mmu_free_pte_list_desc(first_desc);
}
@@ -1070,7 +1089,7 @@ static void pte_list_remove(u64 *spte, unsigned long 
*pte_list)
 
rmap_printk("pte_list_remove:  %p many->many\n", spte);
desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-   while (desc) {
+   while (!desc_is_a_nulls(desc)) {
for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
if (desc->sptes[i] == spte) {
pte_list_desc_remove_entry(pte_list,
@@ -1097,11 +1116,13 @@ static void pte_list_walk(unsigned long *pte_list, 
pte_list_walk_fn fn)
return fn((u64 *)*pte_list);
 
desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-   while (desc) {
+   while (!desc_is_a_nulls(desc)) {
for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
fn(desc->sptes[i]);
desc = desc->more;
}
+
+   WARN_ON(desc_get_nulls_value(desc) != pte_list);
 }
 
 static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
@@ -1184,6 +1205,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct 
rmap_iterator *iter)
 
iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
iter->pos = 0;
+   WARN_ON(desc_is_a_nulls(iter->desc));
return iter->desc->sptes[iter->pos];
 }
 
@@ -1204,7 +1226,8 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
return sptep;
}
 
-   iter->desc = iter->desc->more;
+   iter->desc = desc_is_a_nulls(iter->desc->more) ?
+   NULL : iter->desc->more;
 
if (iter->desc) {
iter->pos = 0;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 01/15] KVM: MMU: fix the count of spte number

2013-09-05 Thread Xiao Guangrong
If the desc is the last one and it is full, its sptes is not counted

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6e2d2c8..7714fd8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -948,6 +948,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
count += PTE_LIST_EXT;
}
if (desc->sptes[PTE_LIST_EXT-1]) {
+   count += PTE_LIST_EXT;
desc->more = mmu_alloc_pte_list_desc(vcpu);
desc = desc->more;
}
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 14/15] KVM: MMU: clean up spte_write_protect

2013-09-05 Thread Xiao Guangrong
Now, the only user of spte_write_protect is rmap_write_protect which
always calls spte_write_protect with pt_protect = true, so drop
it and the unused parameter @kvm

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 44b7822..f3f17a0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1330,8 +1330,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
 }
 
 /*
- * Write-protect on the specified @sptep, @pt_protect indicates whether
- * spte write-protection is caused by protecting shadow page table.
+ * Write-protect on the specified @sptep.
  *
  * Note: write protection is difference between drity logging and spte
  * protection:
@@ -1342,25 +1341,23 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
*sptep)
  *
  * Return true if tlb need be flushed.
  */
-static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
+static bool spte_write_protect(u64 *sptep)
 {
u64 spte = *sptep;
 
if (!is_writable_pte(spte) &&
- !(pt_protect && spte_is_locklessly_modifiable(spte)))
+ !spte_is_locklessly_modifiable(spte))
return false;
 
rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
 
-   if (pt_protect)
-   spte &= ~SPTE_MMU_WRITEABLE;
-   spte = spte & ~PT_WRITABLE_MASK;
+   spte &= ~SPTE_MMU_WRITEABLE;
+   spte &= ~PT_WRITABLE_MASK;
 
return mmu_spte_update(sptep, spte);
 }
 
-static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
-bool pt_protect)
+static bool __rmap_write_protect(unsigned long *rmapp)
 {
u64 *sptep;
struct rmap_iterator iter;
@@ -1369,7 +1366,7 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
 
-   flush |= spte_write_protect(kvm, sptep, pt_protect);
+   flush |= spte_write_protect(sptep);
sptep = rmap_get_next(&iter);
}
 
@@ -1438,7 +1435,7 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
for (i = PT_PAGE_TABLE_LEVEL;
 i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
rmapp = __gfn_to_rmap(gfn, i, slot);
-   write_protected |= __rmap_write_protect(kvm, rmapp, true);
+   write_protected |= __rmap_write_protect(rmapp);
}
 
return write_protected;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 09/15] KVM: MMU: introduce pte-list lockless walker

2013-09-05 Thread Xiao Guangrong
The basic idea is from nulls list which uses a nulls to indicate
whether the desc is moved to different pte-list

Note, we should do bottom-up walk in the desc since we always move
the bottom entry to the deleted position. A desc only has 3 entries
in the current code so it is not a problem now, but the issue will
be triggered if we expend the size of desc in the further development

Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 57 ++
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c5f1b27..3e1432f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -975,6 +975,10 @@ static int count_spte_number(struct pte_list_desc *desc)
return first_free + desc_num * PTE_LIST_EXT;
 }
 
+#define rcu_assign_pte_list(pte_list_p, value) \
+   rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),   \
+   (unsigned long *)(value))
+
 /*
  * Pte mapping structures:
  *
@@ -994,7 +998,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 
if (!*pte_list) {
rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
-   *pte_list = (unsigned long)spte;
+   rcu_assign_pte_list(pte_list, spte);
return 0;
}
 
@@ -1004,7 +1008,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc->sptes[0] = (u64 *)*pte_list;
desc->sptes[1] = spte;
desc_mark_nulls(pte_list, desc);
-   *pte_list = (unsigned long)desc | 1;
+   rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
return 1;
}
 
@@ -1017,7 +1021,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
new_desc = mmu_alloc_pte_list_desc(vcpu);
new_desc->more = desc;
desc = new_desc;
-   *pte_list = (unsigned long)desc | 1;
+   rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
}
 
free_pos = find_first_free(desc);
@@ -1125,6 +1129,51 @@ static void pte_list_walk(unsigned long *pte_list, 
pte_list_walk_fn fn)
WARN_ON(desc_get_nulls_value(desc) != pte_list);
 }
 
+/* The caller should hold rcu lock. */
+static void pte_list_walk_lockless(unsigned long *pte_list,
+  pte_list_walk_fn fn)
+{
+   struct pte_list_desc *desc;
+   unsigned long pte_list_value;
+   int i;
+
+restart:
+   /*
+* Force the pte_list to be reloaded.
+*
+* See the comments in hlist_nulls_for_each_entry_rcu().
+*/
+   barrier();
+   pte_list_value = *rcu_dereference(pte_list);
+   if (!pte_list_value)
+   return;
+
+   if (!(pte_list_value & 1))
+   return fn((u64 *)pte_list_value);
+
+   desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
+   while (!desc_is_a_nulls(desc)) {
+   /*
+* We should do top-down walk since we always use the higher
+* indices to replace the deleted entry if only one desc is
+* used in the rmap when a spte is removed. Otherwise the
+* moved entry will be missed.
+*/
+   for (i = PTE_LIST_EXT - 1; i >= 0; i--)
+   if (desc->sptes[i])
+   fn(desc->sptes[i]);
+
+   desc = rcu_dereference(desc->more);
+
+   /* It is being initialized. */
+   if (unlikely(!desc))
+   goto restart;
+   }
+
+   if (unlikely(desc_get_nulls_value(desc) != pte_list))
+   goto restart;
+}
+
 static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
struct kvm_memory_slot *slot)
 {
@@ -4651,7 +4700,7 @@ int kvm_mmu_module_init(void)
 {
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
sizeof(struct pte_list_desc),
-   0, 0, NULL);
+   0, SLAB_DESTROY_BY_RCU, NULL);
if (!pte_list_desc_cache)
goto nomem;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 04/15] KVM: MMU: flush tlb if the spte can be locklessly modified

2013-09-05 Thread Xiao Guangrong
Relax the tlb flush condition since we will write-protect the spte out of mmu
lock. Note lockless write-protection only marks the writable spte to readonly
and the spte can be writable only if both SPTE_HOST_WRITEABLE and
SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)

This patch is used to avoid this kind of race:

  VCPU 0 VCPU 1
lockless wirte protection:
  set spte.w = 0
 lock mmu-lock

 write protection the spte to sync shadow page,
 see spte.w = 0, then without flush tlb

 unlock mmu-lock

 !!! At this point, the shadow page can still be
 writable due to the corrupt tlb entry
 Flush all TLB

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 88107ee..7488229 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -595,7 +595,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 * we always atomicly update it, see the comments in
 * spte_has_volatile_bits().
 */
-   if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
+   if (spte_is_locklessly_modifiable(old_spte) &&
+ !is_writable_pte(new_spte))
ret = true;
 
if (!shadow_accessed_mask)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 10/15] KVM: MMU: initialize the pointers in pte_list_desc properly

2013-09-05 Thread Xiao Guangrong
Since pte_list_desc will be locklessly accessed we need to atomicly initialize
its pointers so that the lockless walker can not get the partial value from the
pointer

In this patch we use the way of assigning pointer to initialize its pointers
which is always atomic instead of using kmem_cache_zalloc

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3e1432f..fe80019 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -687,14 +687,15 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu 
*vcpu)
 }
 
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
- struct kmem_cache *base_cache, int min)
+ struct kmem_cache *base_cache, int min,
+ gfp_t flags)
 {
void *obj;
 
if (cache->nobjs >= min)
return 0;
while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
-   obj = kmem_cache_zalloc(base_cache, GFP_KERNEL);
+   obj = kmem_cache_alloc(base_cache, flags);
if (!obj)
return -ENOMEM;
cache->objects[cache->nobjs++] = obj;
@@ -741,14 +742,16 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
int r;
 
r = mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
-  pte_list_desc_cache, 8 + PTE_PREFETCH_NUM);
+  pte_list_desc_cache, 8 + PTE_PREFETCH_NUM,
+  GFP_KERNEL);
if (r)
goto out;
r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
if (r)
goto out;
r = mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
-  mmu_page_header_cache, 4);
+  mmu_page_header_cache, 4,
+  GFP_KERNEL | __GFP_ZERO);
 out:
return r;
 }
@@ -913,6 +916,17 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
return level - 1;
 }
 
+static void pte_list_desc_ctor(void *p)
+{
+   struct pte_list_desc *desc = p;
+   int i;
+
+   for (i = 0; i < PTE_LIST_EXT; i++)
+   desc->sptes[i] = NULL;
+
+   desc->more = NULL;
+}
+
 static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc 
*desc)
 {
unsigned long marker;
@@ -1066,6 +1080,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 */
if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
*pte_list = (unsigned long)first_desc->sptes[0];
+   first_desc->sptes[0] = NULL;
mmu_free_pte_list_desc(first_desc);
}
 }
@@ -4699,8 +4714,8 @@ static void mmu_destroy_caches(void)
 int kvm_mmu_module_init(void)
 {
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
-   sizeof(struct pte_list_desc),
-   0, SLAB_DESTROY_BY_RCU, NULL);
+   sizeof(struct pte_list_desc),
+   0, SLAB_DESTROY_BY_RCU, pte_list_desc_ctor);
if (!pte_list_desc_cache)
goto nomem;
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 11/15] KVM: MMU: reintroduce kvm_mmu_isolate_page()

2013-09-05 Thread Xiao Guangrong
It was removed by commit 834be0d83. Now we will need it to do lockless shadow
page walking protected by rcu, so reintroduce it

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index fe80019..2bf450a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1675,14 +1675,30 @@ static inline void kvm_mod_used_mmu_pages(struct kvm 
*kvm, int nr)
percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+/*
+ * Remove the sp from shadow page cache, after call it,
+ * we can not find this sp from the cache, and the shadow
+ * page table is still valid.
+ *
+ * It should be under the protection of mmu lock.
+ */
+static void kvm_mmu_isolate_page(struct kvm_mmu_page *sp)
 {
ASSERT(is_empty_shadow_page(sp->spt));
+
hlist_del(&sp->hash_link);
-   list_del(&sp->link);
-   free_page((unsigned long)sp->spt);
if (!sp->role.direct)
free_page((unsigned long)sp->gfns);
+}
+
+/*
+ * Free the shadow page table and the sp, we can do it
+ * out of the protection of mmu lock.
+ */
+static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
+{
+   list_del(&sp->link);
+   free_page((unsigned long)sp->spt);
kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -2361,6 +2377,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
list_for_each_entry_safe(sp, nsp, invalid_list, link) {
WARN_ON(!sp->role.invalid || sp->root_count);
+   kvm_mmu_isolate_page(sp);
kvm_mmu_free_page(sp);
}
 }
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 13/15] KVM: MMU: locklessly write-protect the page

2013-09-05 Thread Xiao Guangrong
Currently, when mark memslot dirty logged or get dirty page, we need to
write-protect large guest memory, it is the heavy work, especially, we
need to hold mmu-lock which is also required by vcpu to fix its page table
fault and mmu-notifier when host page is being changed. In the extreme
cpu / memory used guest, it becomes a scalability issue

This patch introduces a way to locklessly write-protect guest memory

Now, lockless rmap walk, lockless shadow page table access and lockless
spte wirte-protection are ready, it is the time to implements page
write-protection out of mmu-lock

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |  4 
 arch/x86/kvm/mmu.c  | 53 +
 arch/x86/kvm/mmu.h  |  6 +
 arch/x86/kvm/x86.c  | 11 -
 4 files changed, 49 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8e4ca0d..00b44b1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -789,10 +789,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask);
 
 int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-struct kvm_memory_slot *slot,
-gfn_t gfn_offset, unsigned long mask);
 void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f551fc7..44b7822 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1376,8 +1376,31 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
return flush;
 }
 
-/**
- * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
+static void __rmap_write_protect_lockless(u64 *sptep)
+{
+   u64 spte;
+   int level = page_header(__pa(sptep))->role.level;
+
+retry:
+   spte = mmu_spte_get_lockless(sptep);
+   if (unlikely(!is_last_spte(spte, level) || !is_writable_pte(spte)))
+   return;
+
+   if (likely(cmpxchg64(sptep, spte, spte & ~PT_WRITABLE_MASK) == spte))
+   return;
+
+   goto retry;
+}
+
+static void rmap_write_protect_lockless(unsigned long *rmapp)
+{
+   pte_list_walk_lockless(rmapp, __rmap_write_protect_lockless);
+}
+
+/*
+ * kvm_mmu_write_protect_pt_masked_lockless - write protect selected PT level
+ * pages out of mmu-lock.
+ *
  * @kvm: kvm instance
  * @slot: slot to protect
  * @gfn_offset: start of the BITS_PER_LONG pages we care about
@@ -1386,16 +1409,17 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
  * Used when we do not need to care about huge page mappings: e.g. during dirty
  * logging we do not have any such mappings.
  */
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-struct kvm_memory_slot *slot,
-gfn_t gfn_offset, unsigned long mask)
+void
+kvm_mmu_write_protect_pt_masked_lockless(struct kvm *kvm,
+struct kvm_memory_slot *slot,
+gfn_t gfn_offset, unsigned long mask)
 {
unsigned long *rmapp;
 
while (mask) {
rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
  PT_PAGE_TABLE_LEVEL, slot);
-   __rmap_write_protect(kvm, rmapp, false);
+   rmap_write_protect_lockless(rmapp);
 
/* clear the first set bit */
mask &= mask - 1;
@@ -4547,7 +4571,7 @@ int kvm_mmu_setup(struct kvm_vcpu *vcpu)
return init_kvm_mmu(vcpu);
 }
 
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
+void kvm_mmu_slot_remove_write_access_lockless(struct kvm *kvm, int slot)
 {
struct kvm_memory_slot *memslot;
gfn_t last_gfn;
@@ -4556,8 +4580,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
memslot = id_to_memslot(kvm->memslots, slot);
last_gfn = memslot->base_gfn + memslot->npages - 1;
 
-   spin_lock(&kvm->mmu_lock);
-
+   kvm_mmu_rcu_free_page_begin(kvm);
for (i = PT_PAGE_TABLE_LEVEL;
 i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
unsigned long *rmapp;
@@ -4567,15 +4590,15 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
 
for (index = 0; index <= last_index; ++index, ++rmapp) {
-   if (*rmapp)
-   __rmap_write_protect(kvm, rmapp, false);
+ 

[PATCH v2 02/15] KVM: MMU: properly check last spte in fast_page_fault()

2013-09-05 Thread Xiao Guangrong
Using sp->role.level instead of @level since @level is not got from the
page table hierarchy

There is no issue in current code since the fast page fault currently only
fixes the fault caused by dirty-log that is always on the last level
(level = 1)

This patch makes the code more readable and avoids potential issue in the
further development

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7714fd8..869f1db 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2804,9 +2804,9 @@ static bool page_fault_can_be_fast(u32 error_code)
 }
 
 static bool
-fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 spte)
+fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+   u64 *sptep, u64 spte)
 {
-   struct kvm_mmu_page *sp = page_header(__pa(sptep));
gfn_t gfn;
 
WARN_ON(!sp->role.direct);
@@ -2832,6 +2832,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
u32 error_code)
 {
struct kvm_shadow_walk_iterator iterator;
+   struct kvm_mmu_page *sp;
bool ret = false;
u64 spte = 0ull;
 
@@ -2852,7 +2853,8 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
goto exit;
}
 
-   if (!is_last_spte(spte, level))
+   sp = page_header(__pa(iterator.sptep));
+   if (!is_last_spte(spte, sp->role.level))
goto exit;
 
/*
@@ -2878,7 +2880,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t 
gva, int level,
 * the gfn is not stable for indirect shadow page.
 * See Documentation/virtual/kvm/locking.txt to get more detail.
 */
-   ret = fast_pf_fix_direct_spte(vcpu, iterator.sptep, spte);
+   ret = fast_pf_fix_direct_spte(vcpu, sp, iterator.sptep, spte);
 exit:
trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
  spte, ret);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 15/15] KVM: MMU: use rcu functions to access the pointer

2013-09-05 Thread Xiao Guangrong
Use rcu_assign_pointer() to update all the pointer in desc
and use rcu_dereference() to lockless read the pointer

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 46 --
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f3f17a0..808c2d9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -927,12 +927,23 @@ static void pte_list_desc_ctor(void *p)
desc->more = NULL;
 }
 
+#define rcu_assign_pte_list(pte_list_p, value) \
+   rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),   \
+ (unsigned long *)(value))
+
+#define rcu_assign_desc_more(morep, value) \
+   rcu_assign_pointer(*(unsigned long __rcu **)&morep, \
+ (unsigned long *)value)
+
+#define rcu_assign_spte(sptep, value)  \
+   rcu_assign_pointer(*(u64 __rcu **)&sptep, (u64 *)value)
+
 static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc 
*desc)
 {
unsigned long marker;
 
marker = (unsigned long)pte_list | 1UL;
-   desc->more = (struct pte_list_desc *)marker;
+   rcu_assign_desc_more(desc->more, (struct pte_list_desc *)marker);
 }
 
 static bool desc_is_a_nulls(struct pte_list_desc *desc)
@@ -989,10 +1000,6 @@ static int count_spte_number(struct pte_list_desc *desc)
return first_free + desc_num * PTE_LIST_EXT;
 }
 
-#define rcu_assign_pte_list(pte_list_p, value) \
-   rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p),   \
-   (unsigned long *)(value))
-
 /*
  * Pte mapping structures:
  *
@@ -1019,8 +1026,8 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
if (!(*pte_list & 1)) {
rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
desc = mmu_alloc_pte_list_desc(vcpu);
-   desc->sptes[0] = (u64 *)*pte_list;
-   desc->sptes[1] = spte;
+   rcu_assign_spte(desc->sptes[0], *pte_list);
+   rcu_assign_spte(desc->sptes[1], spte);
desc_mark_nulls(pte_list, desc);
rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
return 1;
@@ -1033,13 +1040,13 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 
*spte,
if (desc->sptes[PTE_LIST_EXT - 1]) {
struct pte_list_desc *new_desc;
new_desc = mmu_alloc_pte_list_desc(vcpu);
-   new_desc->more = desc;
+   rcu_assign_desc_more(new_desc->more, desc);
desc = new_desc;
rcu_assign_pte_list(pte_list, (unsigned long)desc | 1);
}
 
free_pos = find_first_free(desc);
-   desc->sptes[free_pos] = spte;
+   rcu_assign_spte(desc->sptes[free_pos], spte);
return count_spte_number(desc) - 1;
 }
 
@@ -1057,8 +1064,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 * Move the entry from the first desc to this position we want
 * to remove.
 */
-   desc->sptes[i] = first_desc->sptes[last_used];
-   first_desc->sptes[last_used] = NULL;
+   rcu_assign_spte(desc->sptes[i], first_desc->sptes[last_used]);
+   rcu_assign_spte(first_desc->sptes[last_used], NULL);
 
/* No valid entry in this desc, we can free this desc now. */
if (!first_desc->sptes[0]) {
@@ -1070,7 +1077,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
WARN_ON(desc_is_a_nulls(next_desc));
 
mmu_free_pte_list_desc(first_desc);
-   *pte_list = (unsigned long)next_desc | 1ul;
+   rcu_assign_pte_list(pte_list, (unsigned long)next_desc | 1ul);
return;
}
 
@@ -1079,8 +1086,8 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 * then the desc can be freed.
 */
if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
-   *pte_list = (unsigned long)first_desc->sptes[0];
-   first_desc->sptes[0] = NULL;
+   rcu_assign_pte_list(pte_list, first_desc->sptes[0]);
+   rcu_assign_spte(first_desc->sptes[0], NULL);
mmu_free_pte_list_desc(first_desc);
}
 }
@@ -1102,7 +1109,7 @@ static void pte_list_remove(u64 *spte, unsigned long 
*pte_list)
pr_err("pte_list_remove:  %p 1->BUG\n", spte);
BUG();
}
-   *pte_list = 0;
+   rcu_assign_pte_list(pte_list, 0);
return;
}
 
@@ -1174,9 +1181,12 @@ restart:
 * used in the rmap when a spte is removed. Otherwise the
 * moved entry will be miss

[PATCH v2 00/15] KVM: MMU: locklessly wirte-protect

2013-09-05 Thread Xiao Guangrong
d to write-protect
the page.

Performance result
====
The performance result and the benchmark can be found at:
  http://permalink.gmane.org/gmane.linux.kernel/1534876

Xiao Guangrong (15):
  KVM: MMU: fix the count of spte number
  KVM: MMU: properly check last spte in fast_page_fault()
  KVM: MMU: lazily drop large spte
  KVM: MMU: flush tlb if the spte can be locklessly modified
  KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
  KVM: MMU: update spte and add it into rmap before dirty log
  KVM: MMU: redesign the algorithm of pte_list
  KVM: MMU: introduce nulls desc
  KVM: MMU: introduce pte-list lockless walker
  KVM: MMU: initialize the pointers in pte_list_desc properly
  KVM: MMU: reintroduce kvm_mmu_isolate_page()
  KVM: MMU: allow locklessly access shadow page table out of vcpu thread
  KVM: MMU: locklessly write-protect the page
  KVM: MMU: clean up spte_write_protect
  KVM: MMU: use rcu functions to access the pointer

 arch/x86/include/asm/kvm_host.h |  10 +-
 arch/x86/kvm/mmu.c  | 566 ++--
 arch/x86/kvm/mmu.h  |  28 ++
 arch/x86/kvm/x86.c  |  34 ++-
 4 files changed, 491 insertions(+), 147 deletions(-)

-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-09-05 Thread Xiao Guangrong
It is easy if the handler is in the vcpu context, in that case we can use
walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
disable interrupt to stop shadow page being freed. But we are on the ioctl
context and the paths we are optimizing for have heavy workload, disabling
interrupt is not good for the system performance

We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then
use call_rcu() to free the shadow page if that indicator is set. Set/Clear the
indicator are protected by slot-lock, so it need not be atomic and does not
hurt the performance and the scalability

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |  6 +-
 arch/x86/kvm/mmu.c  | 32 
 arch/x86/kvm/mmu.h  | 22 ++
 3 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c76ff74..8e4ca0d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -226,7 +226,10 @@ struct kvm_mmu_page {
/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
unsigned long mmu_valid_gen;
 
-   DECLARE_BITMAP(unsync_child_bitmap, 512);
+   union {
+   DECLARE_BITMAP(unsync_child_bitmap, 512);
+   struct rcu_head rcu;
+   };
 
 #ifdef CONFIG_X86_32
/*
@@ -554,6 +557,7 @@ struct kvm_arch {
 */
struct list_head active_mmu_pages;
struct list_head zapped_obsolete_pages;
+   bool rcu_free_shadow_page;
 
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2bf450a..f551fc7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2355,6 +2355,30 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
struct kvm_mmu_page *sp,
return ret;
 }
 
+static void kvm_mmu_isolate_pages(struct list_head *invalid_list)
+{
+   struct kvm_mmu_page *sp;
+
+   list_for_each_entry(sp, invalid_list, link)
+   kvm_mmu_isolate_page(sp);
+}
+
+static void free_pages_rcu(struct rcu_head *head)
+{
+   struct kvm_mmu_page *next, *sp;
+
+   sp = container_of(head, struct kvm_mmu_page, rcu);
+   while (sp) {
+   if (!list_empty(&sp->link))
+   next = list_first_entry(&sp->link,
+ struct kvm_mmu_page, link);
+   else
+   next = NULL;
+   kvm_mmu_free_page(sp);
+   sp = next;
+   }
+}
+
 static void kvm_mmu_commit_zap_page(struct kvm *kvm,
struct list_head *invalid_list)
 {
@@ -2375,6 +2399,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 */
kvm_flush_remote_tlbs(kvm);
 
+   if (kvm->arch.rcu_free_shadow_page) {
+   kvm_mmu_isolate_pages(invalid_list);
+   sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
+   list_del_init(invalid_list);
+   call_rcu(&sp->rcu, free_pages_rcu);
+   return;
+   }
+
list_for_each_entry_safe(sp, nsp, invalid_list, link) {
WARN_ON(!sp->role.invalid || sp->root_count);
kvm_mmu_isolate_page(sp);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 77e044a..61217f3 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -117,4 +117,26 @@ static inline bool permission_fault(struct kvm_mmu *mmu, 
unsigned pte_access,
 }
 
 void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
+
+/*
+ * The caller should ensure that these two functions should be
+ * serially called.
+ */
+static inline void kvm_mmu_rcu_free_page_begin(struct kvm *kvm)
+{
+   rcu_read_lock();
+
+   kvm->arch.rcu_free_shadow_page = true;
+   /* Set the indicator before access shadow page. */
+   smp_mb();
+}
+
+static inline void kvm_mmu_rcu_free_page_end(struct kvm *kvm)
+{
+   /* Make sure that access shadow page has finished. */
+   smp_mb();
+   kvm->arch.rcu_free_shadow_page = false;
+
+   rcu_read_unlock();
+}
 #endif
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Xiao Guangrong
On 09/02/2013 05:49 PM, Gleb Natapov wrote:
> On Mon, Sep 02, 2013 at 05:42:25PM +0800, Xiao Guangrong wrote:
>> On 09/01/2013 05:17 PM, Gleb Natapov wrote:
>>> On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote:
>>>> Page tables in a read-only memory slot will currently cause a triple
>>>> fault because the page walker uses gfn_to_hva and it fails on such a slot.
>>>>
>>>> OVMF uses such a page table; however, real hardware seems to be fine with
>>>> that as long as the accessed/dirty bits are set.  Save whether the slot
>>>> is readonly, and later check it when updating the accessed and dirty bits.
>>>>
>>> The fix looks OK to me, but some comment below.
>>>
>>>> Cc: sta...@vger.kernel.org
>>>> Cc: g...@redhat.com
>>>> Cc: Xiao Guangrong 
>>>> Signed-off-by: Paolo Bonzini 
>>>> ---
>>>>CCing to stable@ since the regression was introduced with
>>>>support for readonly memory slots.
>>>>
>>>>  arch/x86/kvm/paging_tmpl.h |  7 ++-
>>>>  include/linux/kvm_host.h   |  1 +
>>>>  virt/kvm/kvm_main.c| 14 +-
>>>>  3 files changed, 16 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>>>> index 0433301..dadc5c0 100644
>>>> --- a/arch/x86/kvm/paging_tmpl.h
>>>> +++ b/arch/x86/kvm/paging_tmpl.h
>>>> @@ -99,6 +99,7 @@ struct guest_walker {
>>>>pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
>>>>gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
>>>>pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
>>>> +  bool pte_writable[PT_MAX_FULL_LEVELS];
>>>>unsigned pt_access;
>>>>unsigned pte_access;
>>>>gfn_t gfn;
>>>> @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct 
>>>> kvm_vcpu *vcpu,
>>>>if (pte == orig_pte)
>>>>continue;
>>>>  
>>>> +  if (unlikely(!walker->pte_writable[level - 1]))
>>>> +  return -EACCES;
>>>> +
>>>>ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, 
>>>> orig_pte, pte);
>>>>if (ret)
>>>>return ret;
>>>> @@ -309,7 +313,8 @@ retry_walk:
>>>>goto error;
>>>>real_gfn = gpa_to_gfn(real_gfn);
>>>>  
>>>> -  host_addr = gfn_to_hva(vcpu->kvm, real_gfn);
>>>> +  host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn,
>>>> +  &walker->pte_writable[walker->level 
>>>> - 1]);
>>> The use of gfn_to_hva_read is misleading. The code can still write into
>>> gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva()
>>> to gfn_to_hva_write().
>>
>> Yes. I agreed.
>>
>>>
>>> This makes me think are there other places where gfn_to_hva() was
>>> used, but gfn_to_hva_prot() should have been?
>>>  - kvm_host_page_size() looks incorrect. We never use huge page to map
>>>read only memory slots currently.
>>
>> It only checks whether gfn have been mapped, I think we can use
>> gfn_to_hva_read() instead, the real permission will be checked when we 
>> translate
>> the gfn to pfn.
>>
> Yes, all the cases I listed should be changed to use function that looks
> at both regular and RO slots.
> 
>>>  - kvm_handle_bad_page() also looks incorrect and may cause incorrect
>>>address to be reported to userspace.
>>
>> I have no idea on this point. kvm_handle_bad_page() is called when it failed 
>> to
>> translate the target gfn to pfn, then the emulator can detect the error on 
>> target gfn
>> properly. no? Or i misunderstood your meaning?
>>
> I am talking about the following code:
> 
> if (pfn == KVM_PFN_ERR_HWPOISON) {
> kvm_send_hwpoison_signal(gfn_to_hva(vcpu->kvm, gfn), current);
> return 0;
> }
> 
> pfn will be KVM_PFN_ERR_HWPOISON gfn is backed by faulty memory, we need
> to report the liner address of the faulty memory to a userspace here,
> but if gfn is in a RO slot gfn_to_hva() will not return correct address
> here.

Got it, thanks for your explanation.

BTW, if you and Paolo are busy on other things, i am happy to fix these issues. 
:)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Xiao Guangrong
On 09/02/2013 05:25 PM, Gleb Natapov wrote:
> On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
>> On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
>>> Page tables in a read-only memory slot will currently cause a triple
>>> fault because the page walker uses gfn_to_hva and it fails on such a slot.
>>>
>>> OVMF uses such a page table; however, real hardware seems to be fine with
>>> that as long as the accessed/dirty bits are set.  Save whether the slot
>>> is readonly, and later check it when updating the accessed and dirty bits.
>>
>> Paolo, do you know why OVMF is using readonly memory like this?
>>
> Just a guess, but perhaps they want to move to paging mode as early as
> possible even before memory controller is fully initialized.
> 
>> AFAIK, The fault trigged by this kind of access can hardly be fixed by
>> userspace since the fault is trigged by pagetable walking not by the current
>> instruction. Do you have any idea to let uerspace emulate it properly?
> Not sure what userspace you mean here, but there shouldn't be a fault in the

I just wonder how to fix this kind of fault. The current patch returns -EACCES
but that will crash the guest. I think we'd better let userspace to fix this
error (let userspace set the D/A bit.)

> first place if ROM page tables have access/dirty bit set and they do.

Yes, so we can not call x86_emulate_instruction() to fix this fault (that 
function
emulates the access on the first place). Need directly return a MMIO-exit to
userpsace when met this fault? What happen if this fault on pagetable-walking
is trigged in x86_emulate_instruction().?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Xiao Guangrong
On 09/01/2013 05:17 PM, Gleb Natapov wrote:
> On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote:
>> Page tables in a read-only memory slot will currently cause a triple
>> fault because the page walker uses gfn_to_hva and it fails on such a slot.
>>
>> OVMF uses such a page table; however, real hardware seems to be fine with
>> that as long as the accessed/dirty bits are set.  Save whether the slot
>> is readonly, and later check it when updating the accessed and dirty bits.
>>
> The fix looks OK to me, but some comment below.
> 
>> Cc: sta...@vger.kernel.org
>> Cc: g...@redhat.com
>> Cc: Xiao Guangrong 
>> Signed-off-by: Paolo Bonzini 
>> ---
>>  CCing to stable@ since the regression was introduced with
>>  support for readonly memory slots.
>>
>>  arch/x86/kvm/paging_tmpl.h |  7 ++-
>>  include/linux/kvm_host.h   |  1 +
>>  virt/kvm/kvm_main.c| 14 +-
>>  3 files changed, 16 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index 0433301..dadc5c0 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -99,6 +99,7 @@ struct guest_walker {
>>  pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
>>  gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
>>  pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
>> +bool pte_writable[PT_MAX_FULL_LEVELS];
>>  unsigned pt_access;
>>  unsigned pte_access;
>>  gfn_t gfn;
>> @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct 
>> kvm_vcpu *vcpu,
>>  if (pte == orig_pte)
>>  continue;
>>  
>> +if (unlikely(!walker->pte_writable[level - 1]))
>> +return -EACCES;
>> +
>>  ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, 
>> orig_pte, pte);
>>  if (ret)
>>  return ret;
>> @@ -309,7 +313,8 @@ retry_walk:
>>  goto error;
>>  real_gfn = gpa_to_gfn(real_gfn);
>>  
>> -host_addr = gfn_to_hva(vcpu->kvm, real_gfn);
>> +host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn,
>> +&walker->pte_writable[walker->level 
>> - 1]);
> The use of gfn_to_hva_read is misleading. The code can still write into
> gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva()
> to gfn_to_hva_write().

Yes. I agreed.

> 
> This makes me think are there other places where gfn_to_hva() was
> used, but gfn_to_hva_prot() should have been?
>  - kvm_host_page_size() looks incorrect. We never use huge page to map
>read only memory slots currently.

It only checks whether gfn have been mapped, I think we can use
gfn_to_hva_read() instead, the real permission will be checked when we translate
the gfn to pfn.

>  - kvm_handle_bad_page() also looks incorrect and may cause incorrect
>address to be reported to userspace.

I have no idea on this point. kvm_handle_bad_page() is called when it failed to
translate the target gfn to pfn, then the emulator can detect the error on 
target gfn
properly. no? Or i misunderstood your meaning?

>  - kvm_setup_async_pf() also incorrect. Makes all page fault on read
>only slot to be sync.
>  - kvm_vm_fault() one looks OK since function assumes write only slots,
>but it is obsolete and should be deleted anyway.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Xiao Guangrong
On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
> Page tables in a read-only memory slot will currently cause a triple
> fault because the page walker uses gfn_to_hva and it fails on such a slot.
> 
> OVMF uses such a page table; however, real hardware seems to be fine with
> that as long as the accessed/dirty bits are set.  Save whether the slot
> is readonly, and later check it when updating the accessed and dirty bits.

Paolo, do you know why OVMF is using readonly memory like this?

AFAIK, The fault trigged by this kind of access can hardly be fixed by
userspace since the fault is trigged by pagetable walking not by the current
instruction. Do you have any idea to let uerspace emulate it properly?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-09-02 Thread Xiao Guangrong
On 08/30/2013 07:44 PM, Gleb Natapov wrote:
> On Thu, Aug 29, 2013 at 08:02:30PM +0800, Xiao Guangrong wrote:
>> On 08/29/2013 07:33 PM, Xiao Guangrong wrote:
>>> On 08/29/2013 05:31 PM, Gleb Natapov wrote:
>>>> On Thu, Aug 29, 2013 at 02:50:51PM +0800, Xiao Guangrong wrote:
>>>>> After more thinking, I still think rcu_assign_pointer() is unneeded when 
>>>>> a entry
>>>>> is removed. The remove-API does not care the order between unlink the 
>>>>> entry and
>>>>> the changes to its fields. It is the caller's responsibility:
>>>>> - in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), 
>>>>> etc to
>>>>>   enforce all lookups exit and the later change on that entry is 
>>>>> invisible to the
>>>>>   lookups.
>>>>>
>>>>> - In the case of rculist_nulls, it seems refcounter is used to guarantee 
>>>>> the order
>>>>>   (see the example from Documentation/RCU/rculist_nulls.txt).
>>>>>
>>>>> - In our case, we allow the lookup to see the deleted desc even if it is 
>>>>> in slab cache
>>>>>   or its is initialized or it is re-added.
>>>>>
>>>> BTW is it a good idea? We can access deleted desc while it is allocated
>>>> and initialized to zero by kmem_cache_zalloc(), are we sure we cannot
>>>> see partially initialized desc->sptes[] entry? On related note what about
>>>> 32 bit systems, they do not have atomic access to desc->sptes[].
>>
>> Ah... wait. desc is a array of pointers:
>>
>> struct pte_list_desc {
>>  u64 *sptes[PTE_LIST_EXT];
>>  struct pte_list_desc *more;
>> };
>>
> Yep, I misread it to be u64 bits and wondered why do we use u64 to store
> pointers.
> 
>> assigning a pointer is aways aotomic, but we should carefully initialize it
>> as you said. I will introduce a constructor for desc slab cache which 
>> initialize
>> the struct like this:
>>
>> for (i = 0; i < PTE_LIST_EXT; i++)
>>  desc->sptes[i] = NULL;
>>
>> It is okay.
>>
> I hope slab does not write anything into allocated memory internally if
> constructor is present. 

If only constructor is present (no SLAB_DESTROY_BY_RCU), It'll temporarily
write the "poison" value into the memory then call the constructor to initialize
it again, e.g, in slab.c:

static void *cache_alloc_debugcheck_after(struct kmem_cache *cachep,
gfp_t flags, void *objp, unsigned long caller)
{
if (cachep->flags & SLAB_POISON) {
..
poison_obj(cachep, objp, POISON_INUSE);
}
..
if (cachep->ctor && cachep->flags & SLAB_POISON)
cachep->ctor(objp);
}

But SLAB_DESTROY_BY_RCU can force the allocer to don't touch
the memory, this is true in our case.

> BTW do you know what happens when SLAB debug is enabled
> and SLAB_DESTROY_BY_RCU is set? 

When SLAB debug is enabled, these 3 flags may be set:
#define SLAB_DEBUG_FREE 0x0100UL/* DEBUG: Perform (expensive) 
checks on free */
#define SLAB_RED_ZONE   0x0400UL/* DEBUG: Red zone objs in a 
cache */
#define SLAB_POISON 0x0800UL/* DEBUG: Poison objects */

Only SLAB_POISON can write something into the memory, and ...

> Does poison value is written into freed
> object (freed to slab, but not yet to page allocator)?

SLAB_POISON is cleared if SLAB_DESTROY_BY_RCU is used.
- In slab,  There is the code in __kmem_cache_create():
if (flags & SLAB_DESTROY_BY_RCU)
BUG_ON(flags & SLAB_POISON);

- In slub, the code is in calculate_sizes():
/*
 * Determine if we can poison the object itself. If the user of
 * the slab may touch the object after free or before allocation
 * then we should never poison the object itself.
 */
if ((flags & SLAB_POISON) && !(flags & SLAB_DESTROY_BY_RCU) &&
!s->ctor)
s->flags |= __OBJECT_POISON;
else
s->flags &= ~__OBJECT_POISON;

- in slob, it seems it does not support SLAB DEBUG.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-09-02 Thread Xiao Guangrong
On 08/30/2013 07:38 PM, Gleb Natapov wrote:
> On Thu, Aug 29, 2013 at 07:26:40PM +0800, Xiao Guangrong wrote:
>> On 08/29/2013 05:51 PM, Gleb Natapov wrote:
>>> On Thu, Aug 29, 2013 at 05:31:42PM +0800, Xiao Guangrong wrote:
>>>>> As Documentation/RCU/whatisRCU.txt says:
>>>>>
>>>>> As with rcu_assign_pointer(), an important function of
>>>>> rcu_dereference() is to document which pointers are protected by
>>>>> RCU, in particular, flagging a pointer that is subject to changing
>>>>> at any time, including immediately after the rcu_dereference().
>>>>> And, again like rcu_assign_pointer(), rcu_dereference() is
>>>>> typically used indirectly, via the _rcu list-manipulation
>>>>> primitives, such as list_for_each_entry_rcu().
>>>>>
>>>>> The documentation aspect of rcu_assign_pointer()/rcu_dereference() is
>>>>> important. The code is complicated, so self documentation will not hurt.
>>>>> I want to see what is actually protected by rcu here. Freeing shadow
>>>>> pages with call_rcu() further complicates matters: does it mean that
>>>>> shadow pages are also protected by rcu? 
>>>>
>>>> Yes, it stops shadow page to be freed when we do write-protection on
>>>> it.
>>>>
>>> Yeah, I got the trick, what I am saying that we have a data structure
>>> here protected by RCU, but we do not use RCU functions to access it...
>>
>> Yes, they are not used when insert a spte into rmap and get the rmap from
>> the entry... but do we need to use these functions to guarantee the order?
>>
>> The worst case is, we fetch the spte from the desc but the spte is not
>> updated yet, we can happily skip this spte since it will set the
>> dirty-bitmap later, this is guaranteed by the barrier between 
>> mmu_spte_update()
>> and mark_page_dirty(), the code is:
>>
>> set_spte():
>>
>>  if (mmu_spte_update(sptep, spte))
>>  kvm_flush_remote_tlbs(vcpu->kvm);
>>
>>  if (!remap) {
>>  if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD)
>>  rmap_recycle(vcpu, sptep, gfn);
>>
>>  if (level > PT_PAGE_TABLE_LEVEL)
>>  ++vcpu->kvm->stat.lpages;
>>  }
>>
>>  smp_wmb();
>>
>>  if (pte_access & ACC_WRITE_MASK)
>>  mark_page_dirty(vcpu->kvm, gfn);
>>
>> So, i guess if we can guaranteed the order by ourself, we do not need
>> to call the rcu functions explicitly...
>>
>> But, the memory barres in the rcu functions are really light on x86 (store
>> can not be reordered with store), so i do not mind to explicitly use them
>> if you think this way is more safe. :)
>>
> I think the self documentation aspect of using rcu function is also
> important.

Okay. I will use these rcu functions and measure them to see whether it'll
cause performance issue.

> 
>>> BTW why not allocate sp->spt from SLAB_DESTROY_BY_RCU cache too? We may
>>> switch write protection on a random spt occasionally if page is deleted
>>> and reused for another spt though. For last level spt it should not be a
>>> problem and for non last level we have is_last_spte() check in
>>> __rmap_write_protect_lockless(). Can it work?
>>
>> Yes, i also considered this way. It can work if we handle is_last_spte()
>> properly. Since the sp->spte can be reused, we can not get the mapping
>> level from sp. We need to encode the mapping level into spte so that
>> cmpxhg can understand if the page table has been moved to another mapping
>> level.
> Isn't one bit that says that spte is the last one enough? IIRC we
> have one more ignored bit to spare in spte.

Right. But i also want to use this way in fast_page_fault where mapping-level
is needed.

> 
>> Could you allow me to make this optimization separately after this
>> patchset be merged?
>>
> If you think it will complicate the initial version I am fine with
> postponing it for later.

Thank you, Gleb!


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-29 Thread Xiao Guangrong
On 08/29/2013 07:33 PM, Xiao Guangrong wrote:
> On 08/29/2013 05:31 PM, Gleb Natapov wrote:
>> On Thu, Aug 29, 2013 at 02:50:51PM +0800, Xiao Guangrong wrote:
>>> After more thinking, I still think rcu_assign_pointer() is unneeded when a 
>>> entry
>>> is removed. The remove-API does not care the order between unlink the entry 
>>> and
>>> the changes to its fields. It is the caller's responsibility:
>>> - in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), 
>>> etc to
>>>   enforce all lookups exit and the later change on that entry is invisible 
>>> to the
>>>   lookups.
>>>
>>> - In the case of rculist_nulls, it seems refcounter is used to guarantee 
>>> the order
>>>   (see the example from Documentation/RCU/rculist_nulls.txt).
>>>
>>> - In our case, we allow the lookup to see the deleted desc even if it is in 
>>> slab cache
>>>   or its is initialized or it is re-added.
>>>
>> BTW is it a good idea? We can access deleted desc while it is allocated
>> and initialized to zero by kmem_cache_zalloc(), are we sure we cannot
>> see partially initialized desc->sptes[] entry? On related note what about
>> 32 bit systems, they do not have atomic access to desc->sptes[].

Ah... wait. desc is a array of pointers:

struct pte_list_desc {
u64 *sptes[PTE_LIST_EXT];
struct pte_list_desc *more;
};

assigning a pointer is aways aotomic, but we should carefully initialize it
as you said. I will introduce a constructor for desc slab cache which initialize
the struct like this:

for (i = 0; i < PTE_LIST_EXT; i++)
desc->sptes[i] = NULL;

It is okay.

> 
> Good eyes. This is a bug here.
> 
> It seems we do not have a good to fix this. How disable this optimization on
> 32 bit host, small changes:
> 
>  static inline void kvm_mmu_rcu_free_page_begin(struct kvm *kvm)
>  {
> +#ifdef CONFIG_X86_64
> rcu_read_lock();
> 
> kvm->arch.rcu_free_shadow_page = true;
> /* Set the indicator before access shadow page. */
> smp_mb();
> +#else
> +   spin_lock(kvm->mmu_lock);
> +#endif
>  }
> 
>  static inline void kvm_mmu_rcu_free_page_end(struct kvm *kvm)
>  {
> +#ifdef CONFIG_X86_64
> /* Make sure that access shadow page has finished. */
> smp_mb();
> kvm->arch.rcu_free_shadow_page = false;
> 
> rcu_read_unlock();
> +#else
> +   spin_unlock(kvm->mmu_lock);
> +#endif
>  }
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-29 Thread Xiao Guangrong
On 08/29/2013 05:31 PM, Gleb Natapov wrote:
> On Thu, Aug 29, 2013 at 02:50:51PM +0800, Xiao Guangrong wrote:
>> After more thinking, I still think rcu_assign_pointer() is unneeded when a 
>> entry
>> is removed. The remove-API does not care the order between unlink the entry 
>> and
>> the changes to its fields. It is the caller's responsibility:
>> - in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), etc 
>> to
>>   enforce all lookups exit and the later change on that entry is invisible 
>> to the
>>   lookups.
>>
>> - In the case of rculist_nulls, it seems refcounter is used to guarantee the 
>> order
>>   (see the example from Documentation/RCU/rculist_nulls.txt).
>>
>> - In our case, we allow the lookup to see the deleted desc even if it is in 
>> slab cache
>>   or its is initialized or it is re-added.
>>
> BTW is it a good idea? We can access deleted desc while it is allocated
> and initialized to zero by kmem_cache_zalloc(), are we sure we cannot
> see partially initialized desc->sptes[] entry? On related note what about
> 32 bit systems, they do not have atomic access to desc->sptes[].

Good eyes. This is a bug here.

It seems we do not have a good to fix this. How disable this optimization on
32 bit host, small changes:

 static inline void kvm_mmu_rcu_free_page_begin(struct kvm *kvm)
 {
+#ifdef CONFIG_X86_64
rcu_read_lock();

kvm->arch.rcu_free_shadow_page = true;
/* Set the indicator before access shadow page. */
smp_mb();
+#else
+   spin_lock(kvm->mmu_lock);
+#endif
 }

 static inline void kvm_mmu_rcu_free_page_end(struct kvm *kvm)
 {
+#ifdef CONFIG_X86_64
/* Make sure that access shadow page has finished. */
smp_mb();
kvm->arch.rcu_free_shadow_page = false;

rcu_read_unlock();
+#else
+   spin_unlock(kvm->mmu_lock);
+#endif
 }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-29 Thread Xiao Guangrong
On 08/29/2013 05:51 PM, Gleb Natapov wrote:
> On Thu, Aug 29, 2013 at 05:31:42PM +0800, Xiao Guangrong wrote:
>>> As Documentation/RCU/whatisRCU.txt says:
>>>
>>> As with rcu_assign_pointer(), an important function of
>>> rcu_dereference() is to document which pointers are protected by
>>> RCU, in particular, flagging a pointer that is subject to changing
>>> at any time, including immediately after the rcu_dereference().
>>> And, again like rcu_assign_pointer(), rcu_dereference() is
>>> typically used indirectly, via the _rcu list-manipulation
>>> primitives, such as list_for_each_entry_rcu().
>>>
>>> The documentation aspect of rcu_assign_pointer()/rcu_dereference() is
>>> important. The code is complicated, so self documentation will not hurt.
>>> I want to see what is actually protected by rcu here. Freeing shadow
>>> pages with call_rcu() further complicates matters: does it mean that
>>> shadow pages are also protected by rcu? 
>>
>> Yes, it stops shadow page to be freed when we do write-protection on
>> it.
>>
> Yeah, I got the trick, what I am saying that we have a data structure
> here protected by RCU, but we do not use RCU functions to access it...

Yes, they are not used when insert a spte into rmap and get the rmap from
the entry... but do we need to use these functions to guarantee the order?

The worst case is, we fetch the spte from the desc but the spte is not
updated yet, we can happily skip this spte since it will set the
dirty-bitmap later, this is guaranteed by the barrier between mmu_spte_update()
and mark_page_dirty(), the code is:

set_spte():

if (mmu_spte_update(sptep, spte))
kvm_flush_remote_tlbs(vcpu->kvm);

if (!remap) {
if (rmap_add(vcpu, sptep, gfn) > RMAP_RECYCLE_THRESHOLD)
rmap_recycle(vcpu, sptep, gfn);

if (level > PT_PAGE_TABLE_LEVEL)
++vcpu->kvm->stat.lpages;
}

smp_wmb();

if (pte_access & ACC_WRITE_MASK)
mark_page_dirty(vcpu->kvm, gfn);

So, i guess if we can guaranteed the order by ourself, we do not need
to call the rcu functions explicitly...

But, the memory barres in the rcu functions are really light on x86 (store
can not be reordered with store), so i do not mind to explicitly use them
if you think this way is more safe. :)

> BTW why not allocate sp->spt from SLAB_DESTROY_BY_RCU cache too? We may
> switch write protection on a random spt occasionally if page is deleted
> and reused for another spt though. For last level spt it should not be a
> problem and for non last level we have is_last_spte() check in
> __rmap_write_protect_lockless(). Can it work?

Yes, i also considered this way. It can work if we handle is_last_spte()
properly. Since the sp->spte can be reused, we can not get the mapping
level from sp. We need to encode the mapping level into spte so that
cmpxhg can understand if the page table has been moved to another mapping
level. Could you allow me to make this optimization separately after this
patchset be merged?




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-29 Thread Xiao Guangrong
On 08/29/2013 05:08 PM, Gleb Natapov wrote:
> On Thu, Aug 29, 2013 at 02:50:51PM +0800, Xiao Guangrong wrote:
>>>>> BTW I do not see
>>>>> rcu_assign_pointer()/rcu_dereference() in your patches which hints on
>>>>
>>>> IIUC, We can not directly use rcu_assign_pointer(), that is something like:
>>>> p = v to assign a pointer to a pointer. But in our case, we need:
>>>>*pte_list = (unsigned long)desc | 1;
>>> >From Documentation/RCU/whatisRCU.txt:
>>>
>>> The updater uses this function to assign a new value to an RCU-protected 
>>> pointer.
>>>
>>> This is what we do, no? (assuming slot->arch.rmap[] is what rcu protects 
>>> here)
>>> The fact that the value is not correct pointer should not matter.
>>>
>>
>> Okay. Will change that code to:
>>
>> +
>> +#define rcu_assign_head_desc(pte_list_p, value)\
>> +   rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), (unsigned 
>> long *)(value))
>> +
>>  /*
>>   * Pte mapping structures:
>>   *
>> @@ -1006,14 +1010,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 
>> *spte,
>> desc->sptes[1] = spte;
>> desc_mark_nulls(pte_list, desc);
>>
>> -   /*
>> -* Esure the old spte has been updated into desc, so
>> -* that the another side can not get the desc from pte_list
>> -* but miss the old spte.
>> -*/
>> -   smp_wmb();
>> -
>> -   *pte_list = (unsigned long)desc | 1;
>> +   rcu_assign_head_desc(pte_list, (unsigned long)desc | 1);
>>
>>>>
>>>> So i add the smp_wmb() by myself:
>>>>/*
>>>> * Esure the old spte has been updated into desc, so
>>>> * that the another side can not get the desc from pte_list
>>>> * but miss the old spte.
>>>> */
>>>>smp_wmb();
>>>>
>>>>*pte_list = (unsigned long)desc | 1;
>>>>
>>>> But i missed it when inserting a empty desc, in that case, we need the 
>>>> barrier
>>>> too since we should make desc->more visible before assign it to pte_list to
>>>> avoid the lookup side seeing the invalid "nulls".
>>>>
>>>> I also use own code instead of rcu_dereference():
>>>> pte_list_walk_lockless():
>>>>pte_list_value = ACCESS_ONCE(*pte_list);
>>>>if (!pte_list_value)
>>>>return;
>>>>
>>>>if (!(pte_list_value & 1))
>>>>return fn((u64 *)pte_list_value);
>>>>
>>>>/*
>>>> * fetch pte_list before read sptes in the desc, see the comments
>>>> * in pte_list_add().
>>>> *
>>>> * There is the data dependence since the desc is got from pte_list.
>>>> */
>>>>smp_read_barrier_depends();
>>>>
>>>> That part can be replaced by rcu_dereference().
>>>>
>>> Yes please, also see commit c87a124a5d5e8cf8e21c4363c3372bcaf53ea190 for
>>> kind of scary bugs we can get here.
>>
>> Right, it is likely trigger-able in our case, will fix it.
>>
>>>
>>>>> incorrect usage of RCU. I think any access to slab pointers will need to
>>>>> use those.
>>>>
>>>> Remove desc is not necessary i think since we do not mind to see the old
>>>> info. (hlist_nulls_del_rcu() does not use rcu_dereference() too)
>>>>
>>> May be a bug. I also noticed that rculist_nulls uses rcu_dereference()
>>
>> But list_del_rcu() does not use rcu_assign_pointer() too.
>>
> This also suspicious.
> 
>>> to access ->next, but it does not use rcu_assign_pointer() pointer to
>>> assign it.
>>
>> You mean rcu_dereference() is used in hlist_nulls_for_each_entry_rcu()? I 
>> think
>> it's because we should validate the prefetched data before entry->next is
>> accessed, it is paired with the barrier in rcu_assign_pointer() when add a
>> new entry into the list. rcu_assign_pointer() make other fields in the entry
>> be visible before linking entry to the list. Otherwise, the lookup can access
>> that entry but get the invalid fields.
>>
>> After more thinking, I still think rcu_assign_pointer() is unneeded when a 
&g

Re: [PATCH 10/12] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-08-29 Thread Xiao Guangrong
On 08/29/2013 05:10 PM, Gleb Natapov wrote:
> On Tue, Jul 30, 2013 at 09:02:08PM +0800, Xiao Guangrong wrote:
>> It is easy if the handler is in the vcpu context, in that case we can use
>> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
>> disable interrupt to stop shadow page be freed. But we are on the ioctl 
>> context
>> and the paths we are optimizing for have heavy workload, disabling interrupt 
>> is
>> not good for the system performance
>>
>> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
>> call_rcu() to free the shadow page if that indicator is set. Set/Clear the
>> indicator are protected by slot-lock, so it need not be atomic and does not
>> hurt the performance and the scalability
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  arch/x86/include/asm/kvm_host.h |  6 +-
>>  arch/x86/kvm/mmu.c  | 23 +++
>>  arch/x86/kvm/mmu.h  | 22 ++
>>  3 files changed, 50 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h 
>> b/arch/x86/include/asm/kvm_host.h
>> index 531f47c..dc842b6 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -226,7 +226,10 @@ struct kvm_mmu_page {
>>  /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
>>  unsigned long mmu_valid_gen;
>>  
>> -DECLARE_BITMAP(unsync_child_bitmap, 512);
>> +union {
>> +DECLARE_BITMAP(unsync_child_bitmap, 512);
>> +struct rcu_head rcu;
>> +};
>>  
>>  #ifdef CONFIG_X86_32
>>  /*
>> @@ -545,6 +548,7 @@ struct kvm_arch {
>>   */
>>  struct list_head active_mmu_pages;
>>  struct list_head zapped_obsolete_pages;
>> +bool rcu_free_shadow_page;
>>  
>>  struct list_head assigned_dev_head;
>>  struct iommu_domain *iommu_domain;
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index f8fc0cc..7f3391f 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2322,6 +2322,22 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
>> struct kvm_mmu_page *sp,
>>  return ret;
>>  }
>>  
>> +static void free_pages_rcu(struct rcu_head *head)
>> +{
>> +struct kvm_mmu_page *next, *sp;
>> +
>> +sp = container_of(head, struct kvm_mmu_page, rcu);
>> +while (sp) {
>> +if (!list_empty(&sp->link))
>> +next = list_first_entry(&sp->link,
>> +  struct kvm_mmu_page, link);
>> +else
>> +next = NULL;
>> +kvm_mmu_free_page(sp);
> So here we are calling kvm_mmu_free_page() without holding mmu lock, why
> is it safe?

Oops. :(

I should move "hlist_del(&sp->hash_link);" from this function to
kvm_mmu_prepare_zap_page(), after that kvm_mmu_free_page() will not
touch global resource anymore.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 09:36 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 08:15:36PM +0800, Xiao Guangrong wrote:
>> On 08/28/2013 06:49 PM, Gleb Natapov wrote:
>>> On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote:
>>>> On 08/28/2013 05:46 PM, Gleb Natapov wrote:
>>>>> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
>>>>>>> Or what if desc is moved to another rmap, but then it
>>>>>>> is moved back to initial rmap (but another place in the desc list) so
>>>>>>> the check here will not catch that we need to restart walking?
>>>>>>
>>>>>> It is okay. We always add the new desc to the head, then we will walk
>>>>>> all the entires under this case.
>>>>>>
>>>>> Which races another question: What if desc is added in front of the list
>>>>> behind the point where lockless walker currently is?
>>>>
>>>> That case is new spte is being added into the rmap. We need not to care the
>>>> new sptes since it will set the dirty-bitmap then they can be 
>>>> write-protected
>>>> next time.
>>>>
>>> OK.
>>>
>>>>>
>>>>>> Right?
>>>>> Not sure. While lockless walker works on a desc rmap can be completely
>>>>> destroyed and recreated again. It can be any order.
>>>>
>>>> I think the thing is very similar as include/linux/rculist_nulls.h
>>> include/linux/rculist_nulls.h is for implementing hash tables, so they
>>> may not care about add/del/lookup race for instance, but may be we are
>>> (you are saying above that we are not), so similarity does not prove
>>> correctness for our case. 
>>
>> We do not care the "add" and "del" too when lookup the rmap. Under the "add"
>> case, it is okay, the reason i have explained above. Under the "del" case,
>> the spte becomes unpresent and flush all tlbs immediately, so it is also 
>> okay.
>>
>> I always use a stupid way to check the correctness, that is enumerating
>> all cases we may meet, in this patch, we may meet these cases:
>>
>> 1) kvm deletes the desc before we are current on
>>that descs have been checked, do not need to care it.
>>
>> 2) kvm deletes the desc after we are currently on
>>Since we always add/del the head desc, we can sure the current desc has 
>> been
>>deleted, then we will meet case 3).
>>
>> 3) kvm deletes the desc that we are currently on
>>3.a): the desc stays in slab cache (do not be reused).
>>  all spte entires are empty, then the fn() will skip the nonprsent 
>> spte,
>>  and desc->more is
>>  3.a.1) still pointing to next-desc, then we will continue the lookup
>>  3.a.2) or it is the "nulls list", that means we reach the last one,
>> then finish the walk.
>>
>>3.b): the desc is alloc-ed from slab cache and it's being initialized.
>>  we will see "desc->more == NULL" then restart the walking. It's 
>> okay.
>>
>>3.c): the desc is added to rmap or pte_list again.
>>  3.c.1): the desc is added to the current rmap again.
>>   the new desc always acts as the head desc, then we will walk
>>  all entries, some entries are double checked and not entry
>>  can be missed. It is okay.
>>
>>  3.c.2): the desc is added to another rmap or pte_list
>>  since kvm_set_memory_region() and get_dirty are serial by 
>> slots-lock.
>>  so the "nulls" can not be reused during lookup. Then we we 
>> will
>>  meet the different "nulls" at the end of walking that will 
>> cause
>>  rewalk.
>>
>> I know check the algorithm like this is really silly, do you have other idea?
>>
> Not silly, but easy to miss cases. For instance 3.c.3 can be:
>  the desc is added to another rmap then we move to another desc on the
>  wrong rmap, this other desc is also deleted and reinserted into
>  original rmap. Seams like justification from 3.c.1 applies to that to
>  though.
> 
>>> BTW I do not see
>>> rcu_assign_pointer()/rcu_dereference() in your patches which hints on
>>
>> IIUC, We can not directly use rcu_assign_pointer(), that is something like:
>> p = v to assign a pointer to a pointer. But in our case, we need:
>>

Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 06:49 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote:
>> On 08/28/2013 05:46 PM, Gleb Natapov wrote:
>>> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
>>>>> Or what if desc is moved to another rmap, but then it
>>>>> is moved back to initial rmap (but another place in the desc list) so
>>>>> the check here will not catch that we need to restart walking?
>>>>
>>>> It is okay. We always add the new desc to the head, then we will walk
>>>> all the entires under this case.
>>>>
>>> Which races another question: What if desc is added in front of the list
>>> behind the point where lockless walker currently is?
>>
>> That case is new spte is being added into the rmap. We need not to care the
>> new sptes since it will set the dirty-bitmap then they can be write-protected
>> next time.
>>
> OK.
> 
>>>
>>>> Right?
>>> Not sure. While lockless walker works on a desc rmap can be completely
>>> destroyed and recreated again. It can be any order.
>>
>> I think the thing is very similar as include/linux/rculist_nulls.h
> include/linux/rculist_nulls.h is for implementing hash tables, so they
> may not care about add/del/lookup race for instance, but may be we are
> (you are saying above that we are not), so similarity does not prove
> correctness for our case. 

We do not care the "add" and "del" too when lookup the rmap. Under the "add"
case, it is okay, the reason i have explained above. Under the "del" case,
the spte becomes unpresent and flush all tlbs immediately, so it is also okay.

I always use a stupid way to check the correctness, that is enumerating
all cases we may meet, in this patch, we may meet these cases:

1) kvm deletes the desc before we are current on
   that descs have been checked, do not need to care it.

2) kvm deletes the desc after we are currently on
   Since we always add/del the head desc, we can sure the current desc has been
   deleted, then we will meet case 3).

3) kvm deletes the desc that we are currently on
   3.a): the desc stays in slab cache (do not be reused).
 all spte entires are empty, then the fn() will skip the nonprsent spte,
 and desc->more is
 3.a.1) still pointing to next-desc, then we will continue the lookup
 3.a.2) or it is the "nulls list", that means we reach the last one,
then finish the walk.

   3.b): the desc is alloc-ed from slab cache and it's being initialized.
 we will see "desc->more == NULL" then restart the walking. It's okay.

   3.c): the desc is added to rmap or pte_list again.
 3.c.1): the desc is added to the current rmap again.
 the new desc always acts as the head desc, then we will walk
 all entries, some entries are double checked and not entry
 can be missed. It is okay.

 3.c.2): the desc is added to another rmap or pte_list
 since kvm_set_memory_region() and get_dirty are serial by 
slots-lock.
 so the "nulls" can not be reused during lookup. Then we we will
 meet the different "nulls" at the end of walking that will 
cause
 rewalk.

I know check the algorithm like this is really silly, do you have other idea?

> BTW I do not see
> rcu_assign_pointer()/rcu_dereference() in your patches which hints on

IIUC, We can not directly use rcu_assign_pointer(), that is something like:
p = v to assign a pointer to a pointer. But in our case, we need:
   *pte_list = (unsigned long)desc | 1;

So i add the smp_wmb() by myself:
/*
 * Esure the old spte has been updated into desc, so
 * that the another side can not get the desc from pte_list
 * but miss the old spte.
 */
smp_wmb();

*pte_list = (unsigned long)desc | 1;

But i missed it when inserting a empty desc, in that case, we need the barrier
too since we should make desc->more visible before assign it to pte_list to
avoid the lookup side seeing the invalid "nulls".

I also use own code instead of rcu_dereference():
pte_list_walk_lockless():
pte_list_value = ACCESS_ONCE(*pte_list);
if (!pte_list_value)
return;

if (!(pte_list_value & 1))
return fn((u64 *)pte_list_value);

/*
 * fetch pte_list before read sptes in the desc, see the comments
 * in pte_list_add().
 *
 * There is the data dependence since the desc is got from pte_list.
 */
smp_read_barrier_depends();

That part can be replac

Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 05:46 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
>>> Or what if desc is moved to another rmap, but then it
>>> is moved back to initial rmap (but another place in the desc list) so
>>> the check here will not catch that we need to restart walking?
>>
>> It is okay. We always add the new desc to the head, then we will walk
>> all the entires under this case.
>>
> Which races another question: What if desc is added in front of the list
> behind the point where lockless walker currently is?

That case is new spte is being added into the rmap. We need not to care the
new sptes since it will set the dirty-bitmap then they can be write-protected
next time.

> 
>> Right?
> Not sure. While lockless walker works on a desc rmap can be completely
> destroyed and recreated again. It can be any order.

I think the thing is very similar as include/linux/rculist_nulls.h

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 05:20 PM, Gleb Natapov wrote:
> On Tue, Jul 30, 2013 at 09:02:07PM +0800, Xiao Guangrong wrote:
>> The basic idea is from nulls list which uses a nulls to indicate
>> whether the desc is moved to different pte-list
>>
>> Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  arch/x86/kvm/mmu.c | 51 ++-
>>  1 file changed, 50 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 36caf6a..f8fc0cc 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1010,6 +1010,14 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 
>> *spte,
>>  desc->sptes[0] = (u64 *)*pte_list;
>>  desc->sptes[1] = spte;
>>  desc_mark_nulls(pte_list, desc);
>> +
>> +/*
>> + * Esure the old spte has been updated into desc, so
>> + * that the another side can not get the desc from pte_list
>> + * but miss the old spte.
>> + */
>> +smp_wmb();
>> +
>>  *pte_list = (unsigned long)desc | 1;
>>  return 1;
>>  }
>> @@ -1131,6 +1139,47 @@ static void pte_list_walk(unsigned long *pte_list, 
>> pte_list_walk_fn fn)
>>  WARN_ON(desc_get_nulls_value(desc) != pte_list);
>>  }
>>  
>> +/* The caller should hold rcu lock. */
>> +typedef void (*pte_list_walk_lockless_fn) (u64 *spte, int level);
>> +static void pte_list_walk_lockless(unsigned long *pte_list,
>> +   pte_list_walk_lockless_fn fn, int level)
>> +{
>> +struct pte_list_desc *desc;
>> +unsigned long pte_list_value;
>> +int i;
>> +
>> +restart:
>> +pte_list_value = ACCESS_ONCE(*pte_list);
>> +if (!pte_list_value)
>> +return;
>> +
>> +if (!(pte_list_value & 1))
>> +return fn((u64 *)pte_list_value, level);
>> +
>> +/*
>> + * fetch pte_list before read sptes in the desc, see the comments
>> + * in pte_list_add().
>> + *
>> + * There is the data dependence since the desc is got from pte_list.
>> + */
>> +smp_read_barrier_depends();
>> +
>> +desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
>> +while (!desc_is_a_nulls(desc)) {
>> +for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
>> +fn(desc->sptes[i], level);
>> +
>> +desc = ACCESS_ONCE(desc->more);
>> +
>> +/* It is being initialized. */
>> +if (unlikely(!desc))
>> +goto restart;
>> +}
>> +
>> +if (unlikely(desc_get_nulls_value(desc) != pte_list))
>> +goto restart;
> So is it really enough to guaranty safety and correctness? What if desc
> is moved to another rmap while we walking it so fn() is called on
> incorrect sptes? 

Then fn() will do needless write-protection. It is the unnecessary work
but it is acceptable for the rare case.

There has a bug that we can not detect mapping level from rmap since
the desc can be moved as you said, it can case we do write-protection
on the middle spte. Can fix it by getting mapping level from sp->role.level
since sp can not be reused when rcu is hold.

> Or what if desc is moved to another rmap, but then it
> is moved back to initial rmap (but another place in the desc list) so
> the check here will not catch that we need to restart walking?

It is okay. We always add the new desc to the head, then we will walk
all the entires under this case.

Right?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 04:58 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 04:37:32PM +0800, Xiao Guangrong wrote:
>> On 08/28/2013 04:12 PM, Gleb Natapov wrote:
>>
>>>> +
>>>> +  rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
>>>> +  desc = (struct pte_list_desc *)(*pte_list & ~1ul);
>>>> +
>>>> +  /* No empty position in the desc. */
>>>> +  if (desc->sptes[PTE_LIST_EXT - 1]) {
>>>> +  struct pte_list_desc *new_desc;
>>>> +  new_desc = mmu_alloc_pte_list_desc(vcpu);
>>>> +  new_desc->more = desc;
>>>> +  desc = new_desc;
>>>> +  *pte_list = (unsigned long)desc | 1;
>>>>}
>>>> -  return count;
>>>> +
>>>> +  free_pos = find_first_free(desc);
>>>> +  desc->sptes[free_pos] = spte;
>>>> +  return count_spte_number(desc);
>>> Should it be count_spte_number(desc) - 1? The function should returns
>>> the number of pte entries before the spte was added.
>>
>> Yes. We have handled it count_spte_number(), we count the number like this:
>>
>>  return first_free + desc_num * PTE_LIST_EXT;
>>
>> The first_free is indexed from 0.
>>
> Suppose when pte_list_add() is called there is one full desc, so the
> number that should be returned is PTE_LIST_EXT, correct? But since
> before calling count_spte_number() one more desc will be added and
> desc->sptes[0] will be set in it the first_free in count_spte_number
> will be 1 and PTE_LIST_EXT + 1 will be returned.

Oh, yes, you are right. Will fix it in the next version, thanks for you
pointing it out.

> 
>> Maybe it is clearer to let count_spte_number() return the real number.
>>
>>>
>>>>  }
>>>>  
>>>>  static void
>>>> -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc 
>>>> *desc,
>>>> - int i, struct pte_list_desc *prev_desc)
>>>> +pte_list_desc_remove_entry(unsigned long *pte_list,
>>>> + struct pte_list_desc *desc, int i)
>>>>  {
>>>> -  int j;
>>>> +  struct pte_list_desc *first_desc;
>>>> +  int last_used;
>>>> +
>>>> +  first_desc = (struct pte_list_desc *)(*pte_list & ~1ul);
>>>> +  last_used = find_last_used(first_desc);
>>>>  
>>>> -  for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
>>>> -  ;
>>>> -  desc->sptes[i] = desc->sptes[j];
>>>> -  desc->sptes[j] = NULL;
>>>> -  if (j != 0)
>>>> +  /*
>>>> +   * Move the entry from the first desc to this position we want
>>>> +   * to remove.
>>>> +   */
>>>> +  desc->sptes[i] = first_desc->sptes[last_used];
>>>> +  first_desc->sptes[last_used] = NULL;
>>>> +
>>> What if desc == first_desc and i < last_used. You still move spte
>>> backwards so lockless walk may have already examined entry at i and
>>> will miss spte that was moved there from last_used position, no?
>>
>> Right. I noticed it too and fixed in the v2 which is being tested.
>> I fixed it by bottom-up walk desc, like this:
>>
>> pte_list_walk_lockless():
>>
>>  desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
>>  while (!desc_is_a_nulls(desc)) {
>>  /*
>>   * We should do bottom-up walk since we always use the
>>   * bottom entry to replace the deleted entry if only
>>   * one desc is used in the rmap when a spte is removed.
>>   * Otherwise the moved entry will be missed.
>>   */
> I would call it top-down walk since we are walking from big indices to
> smaller once.

Okay, will fix the comments.

> 
>>  for (i = PTE_LIST_EXT - 1; i >= 0; i--)
>>  fn(desc->sptes[i]);
>>
>>  desc = ACCESS_ONCE(desc->more);
>>
>>  /* It is being initialized. */
>>  if (unlikely(!desc))
>>  goto restart;
>>  }
>>
>> How about this?
>>
> Tricky, very very tricky :)
> 
>>>
>>>> +  /* No valid entry in this desc, we can free this desc now. */
>>>> +  if (!first_desc->sptes[0]) {
>>>> +  struct pte_list_desc *next_desc = first_desc->more;
>>>> +
>>>> +  

Re: [PATCH 08/12] KVM: MMU: introduce nulls desc

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 04:40 PM, Gleb Natapov wrote:

>>  static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
>> @@ -1200,7 +1221,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct 
>> rmap_iterator *iter)
>>   */
>>  static u64 *rmap_get_next(struct rmap_iterator *iter)
>>  {
>> -if (iter->desc) {
>> +if (iter->desc && !desc_is_a_nulls(iter->desc)) {
>>  if (iter->pos < PTE_LIST_EXT - 1) {
>>  u64 *sptep;
>>  
>> @@ -1212,7 +1233,7 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
>>  
>>  iter->desc = iter->desc->more;
>>  
> I'd rather do:
>   iter->desc = desc_is_a_nulls(iter->desc) ? NULL : iter->desc;
> here and drop two desc_is_a_nulls() checks in this function.

Nice, will do it in the next version. Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 04:12 PM, Gleb Natapov wrote:

>> +
>> +rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
>> +desc = (struct pte_list_desc *)(*pte_list & ~1ul);
>> +
>> +/* No empty position in the desc. */
>> +if (desc->sptes[PTE_LIST_EXT - 1]) {
>> +struct pte_list_desc *new_desc;
>> +new_desc = mmu_alloc_pte_list_desc(vcpu);
>> +new_desc->more = desc;
>> +desc = new_desc;
>> +*pte_list = (unsigned long)desc | 1;
>>  }
>> -return count;
>> +
>> +free_pos = find_first_free(desc);
>> +desc->sptes[free_pos] = spte;
>> +return count_spte_number(desc);
> Should it be count_spte_number(desc) - 1? The function should returns
> the number of pte entries before the spte was added.

Yes. We have handled it count_spte_number(), we count the number like this:

return first_free + desc_num * PTE_LIST_EXT;

The first_free is indexed from 0.

Maybe it is clearer to let count_spte_number() return the real number.

> 
>>  }
>>  
>>  static void
>> -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc 
>> *desc,
>> -   int i, struct pte_list_desc *prev_desc)
>> +pte_list_desc_remove_entry(unsigned long *pte_list,
>> +   struct pte_list_desc *desc, int i)
>>  {
>> -int j;
>> +struct pte_list_desc *first_desc;
>> +int last_used;
>> +
>> +first_desc = (struct pte_list_desc *)(*pte_list & ~1ul);
>> +last_used = find_last_used(first_desc);
>>  
>> -for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
>> -;
>> -desc->sptes[i] = desc->sptes[j];
>> -desc->sptes[j] = NULL;
>> -if (j != 0)
>> +/*
>> + * Move the entry from the first desc to this position we want
>> + * to remove.
>> + */
>> +desc->sptes[i] = first_desc->sptes[last_used];
>> +first_desc->sptes[last_used] = NULL;
>> +
> What if desc == first_desc and i < last_used. You still move spte
> backwards so lockless walk may have already examined entry at i and
> will miss spte that was moved there from last_used position, no?

Right. I noticed it too and fixed in the v2 which is being tested.
I fixed it by bottom-up walk desc, like this:

pte_list_walk_lockless():

desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
while (!desc_is_a_nulls(desc)) {
/*
 * We should do bottom-up walk since we always use the
 * bottom entry to replace the deleted entry if only
 * one desc is used in the rmap when a spte is removed.
 * Otherwise the moved entry will be missed.
 */
for (i = PTE_LIST_EXT - 1; i >= 0; i--)
fn(desc->sptes[i]);

desc = ACCESS_ONCE(desc->more);

/* It is being initialized. */
if (unlikely(!desc))
goto restart;
}

How about this?

> 
>> +/* No valid entry in this desc, we can free this desc now. */
>> +if (!first_desc->sptes[0]) {
>> +struct pte_list_desc *next_desc = first_desc->more;
>> +
>> +/*
>> + * Only one entry existing but still use a desc to store it?
>> + */
>> +WARN_ON(!next_desc);
>> +
>> +mmu_free_pte_list_desc(first_desc);
>> +first_desc = next_desc;
>> +*pte_list = (unsigned long)first_desc | 1ul;
>>  return;
>> -if (!prev_desc && !desc->more)
>> -*pte_list = (unsigned long)desc->sptes[0];
>> -else
>> -if (prev_desc)
>> -prev_desc->more = desc->more;
>> -else
>> -*pte_list = (unsigned long)desc->more | 1;
>> -mmu_free_pte_list_desc(desc);
>> +}
>> +
>> +WARN_ON(!first_desc->sptes[0]);
>> +
>> +/*
>> + * Only one entry in this desc, move the entry to the head
>> + * then the desc can be freed.
>> + */
>> +if (!first_desc->sptes[1] && !first_desc->more) {
>> +*pte_list = (unsigned long)first_desc->sptes[0];
>> +mmu_free_pte_list_desc(first_desc);
>> +}
>>  }
>>  
>>  static void pte_list_remove(u64 *spte, unsigned long *pte_list)
>>  {
>>  struct pte_list_desc *desc;
>> -struct pte_list_desc *prev_desc;
>>  int i;
>>  
>>  if (!*pte_list) {
>> -printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
>> -BUG();
>> -} else if (!(*pte_list & 1)) {
>> +WARN(1, KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
> Why change BUG() to WARN() here and below?

WARN(1, "xxx") can replace two lines in the origin code. And personally,
i prefer WARN() to BUG() since sometimes BUG() can stop my box and i need to
get the full log by using kdump.

If you object it, i will change it back in the next version. :)

> 
>> +return;
>> +}
>> +
>> +if (!(*pte_list & 1)) {

Re: [PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 03:23 PM, Gleb Natapov wrote:
> On Tue, Jul 30, 2013 at 09:02:04PM +0800, Xiao Guangrong wrote:
>> Relax the tlb flush condition since we will write-protect the spte out of mmu
>> lock. Note lockless write-protection only marks the writable spte to readonly
>> and the spte can be writable only if both SPTE_HOST_WRITEABLE and
>> SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)
>>
>> This patch is used to avoid this kind of race:
>>
>>   VCPU 0 VCPU 1
>> lockless wirte protection:
>>   set spte.w = 0
>>  lock mmu-lock
>>
>>  write protection the spte to sync shadow 
>> page,
>>  see spte.w = 0, then without flush tlb
>>
>>   unlock mmu-lock
>>
>>  !!! At this point, the shadow page can 
>> still be
>>  writable due to the corrupt tlb entry
>>  Flush all TLB
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  arch/x86/kvm/mmu.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 58283bf..5a40564 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -600,7 +600,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>>   * we always atomicly update it, see the comments in
>>   * spte_has_volatile_bits().
>>   */
>> -if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
>> +if (spte_is_locklessly_modifiable(old_spte) &&
>> +  !is_writable_pte(new_spte))
>>  ret = true;
> This will needlessly flush tlbs when dirty login is not in use (common
> case) and old spte is non writable. Can you estimate how serious the
> performance hit is?

If non write-protection caused by dirty log, the spte is always writable
if SPTE_HOST_WRITEABLE and SPTE_MMU_WRITEABLE are set. In other words,
spte_is_locklessly_modifiable(old_spte) is the same as
is_writable_pte(old_spte) in the common case.

There are two cases causing unnecessary TLB flush that are
1) guest read faults on the spte write-protected by dirty log and uses a
   readonly host pfn to fix it.
   This is really rare since read access on the readonly can not trigger
   #PF.

2) guest requires write-protect caused by syncing shadow page.
   this is only needed if ept is disabled and in the most case, the guest
   has many sptes need to be write-protected. Unnecessary TLB flush is rare
   too.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect

2013-08-08 Thread Xiao Guangrong
On 08/09/2013 01:38 AM, Paolo Bonzini wrote:
> Il 06/08/2013 15:16, Xiao Guangrong ha scritto:
>> Hi Gleb, Paolo, Marcelo, Takuya,
>>
>> Any comments or further comments? :)
> 
> It's not the easiest patch to review.  I've looked at it (beyond the
> small comments I have already posted), but it will take some time to
> digest it...

Thanks for your time, Paolo!

> 
> By the way, both I and Gleb will be on vacation next week.  I will read
> email, but I will not be able to apply patches or do pull requests.

Enjoy your vacation. :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

2013-08-08 Thread Xiao Guangrong
[ Post again after adjusting the format since the mail list rejected to deliver 
my previous one. ]

On Aug 8, 2013, at 11:06 PM, Marcelo Tosatti  wrote:

> On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote:
>> On 08/07/2013 09:48 AM, Marcelo Tosatti wrote:
>>> On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
>>>> Make sure we can see the writable spte before the dirt bitmap is visible
>>>> 
>>>> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte 
>>>> based
>>>> on the dirty bitmap, we should ensure the writable spte can be found in 
>>>> rmap
>>>> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap 
>>>> and
>>>> failed to write-protect the page
>>>> 
>>>> Signed-off-by: Xiao Guangrong 
>>>> ---
>>>> arch/x86/kvm/mmu.c | 6 +++---
>>>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>> 
>>> Can you explain why this is safe, with regard to the rule 
>>> at edde99ce05290e50 ?
>> 
>> BTW, this log fixed this case:
>> 
>> VCPU 0KVM migration control
>> 
>>   write-protects all pages
>> #Pf happen then the page
>> become writable, set dirty
>> bit on the bitmap
>> 
>>  swap the bitmap, current bitmap is empty
>> 
>> write the page (no dirty log)
>> 
>>  stop the guest and push
>>  the remaining dirty pages
>> Stopped
>>  See current bitmap is empty that means
>>  no page is dirty.
>>> 
>>> "The rule is that all pages are either dirty in the current bitmap,
>>> or write-protected, which is violated here."
>> 
>> Actually, this rule is not complete true, there's the 3th case:
>> the window between write guest page and set dirty bitmap is valid.
>> In that window, page is write-free and not dirty logged.
>> 
>> This case is based on the fact that at the final step of live migration,
>> kvm should stop the guest and push the remaining dirty pages to the
>> destination.
>> 
>> They're some examples in the current code:
>> example 1, in fast_pf_fix_direct_spte():
>>  if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
>>  /* The window in here... */
>>  mark_page_dirty(vcpu->kvm, gfn);
>> 
>> example 2, in kvm_write_guest_page():
>>  r = __copy_to_user((void __user *)addr + offset, data, len);
>>  if (r)
>>  return -EFAULT;
>>  /*
>>   * The window is here, the page is dirty but not logged in
>> * The bitmap.
>>   */
>>  mark_page_dirty(kvm, gfn);
>>  return 0;
>> 
>>> 
>>> Overall, please document what is the required order of operations for
>>> both set_spte and get_dirty_log and why this order is safe (right on top
>>> of the code).
>> 
>> Okay.
>> 
>> The order we required here is, we should 1) set spte to writable __before__
>> set the dirty bitmap and 2) add the spte into rmap __before__ set the dirty
>> bitmap.
>> 
>> The point 1) is the same as fast_pf_fix_direct_spte(), which i explained 
>> above.
>> The point 1) and 2) can ensure we can find the spte on rmap and see the spte 
>> is
>> writable when we do lockless write-protection, otherwise these cases will 
>> happen
>> 
>> VCPU 0   kvm ioctl doing get-dirty-pages
>> 
>> mark_page_dirty(gfn) which
>> set the gfn on the dirty maps
>>  mask = xchg(dirty_bitmap, 0)
>> 
>>  walk all gfns which set on "mask" and
>>  locklessly write-protect the gfn,
>>  then walk rmap, see no spte on that rmap
>>  
>> 
>> add the spte into rmap
>> 
>> !! Then the page can be freely wrote but not recorded in the dirty 
>> bitmap.
>> 
>> Or:
>> 
>> VCPU 0   kvm ioctl doing get-dirty-pages
>> 
>> mark_page_dirty(gfn) which
>> set the gfn on the dirty maps
>> 
>> add spte into rmap
>>  mask = xchg(dirty_bitmap, 0)
>> 
>>  walk all gfns which set on "mask&

Re: [PATCH 10/12] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-08-07 Thread Xiao Guangrong
On 08/07/2013 09:09 PM, Takuya Yoshikawa wrote:
> On Tue, 30 Jul 2013 21:02:08 +0800
> Xiao Guangrong  wrote:
> 
>> @@ -2342,6 +2358,13 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>>   */
>>  kvm_flush_remote_tlbs(kvm);
>>  
>> +if (kvm->arch.rcu_free_shadow_page) {
>> +sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
>> +list_del_init(invalid_list);
>> +call_rcu(&sp->rcu, free_pages_rcu);
>> +return;
>> +}
>> +
>>  list_for_each_entry_safe(sp, nsp, invalid_list, link) {
>>  WARN_ON(!sp->role.invalid || sp->root_count);
>>  kvm_mmu_free_page(sp);
> 
> Shouldn't we avoid calling call_rcu() when we are holding mmu_lock?

Using call_rcu() to free pages is a rare case that happen only between
lockless write-protection and zapping shadow pages, so i think we do
not need to care this case too much.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

2013-08-06 Thread Xiao Guangrong
On 08/07/2013 09:48 AM, Marcelo Tosatti wrote:
> On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
>> Make sure we can see the writable spte before the dirt bitmap is visible
>>
>> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
>> on the dirty bitmap, we should ensure the writable spte can be found in rmap
>> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap 
>> and
>> failed to write-protect the page
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  arch/x86/kvm/mmu.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> Can you explain why this is safe, with regard to the rule 
> at edde99ce05290e50 ?

BTW, this log fixed this case:

 VCPU 0KVM migration control

   write-protects all pages
#Pf happen then the page
become writable, set dirty
bit on the bitmap

  swap the bitmap, current bitmap is empty

write the page (no dirty log)

  stop the guest and push
  the remaining dirty pages
Stopped
  See current bitmap is empty that means
  no page is dirty.
> 
> "The rule is that all pages are either dirty in the current bitmap,
> or write-protected, which is violated here."

Actually, this rule is not complete true, there's the 3th case:
the window between write guest page and set dirty bitmap is valid.
In that window, page is write-free and not dirty logged.

This case is based on the fact that at the final step of live migration,
kvm should stop the guest and push the remaining dirty pages to the
destination.

They're some examples in the current code:
example 1, in fast_pf_fix_direct_spte():
if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
/* The window in here... */
mark_page_dirty(vcpu->kvm, gfn);

example 2, in kvm_write_guest_page():
r = __copy_to_user((void __user *)addr + offset, data, len);
if (r)
return -EFAULT;
/*
 * The window is here, the page is dirty but not logged in
 * The bitmap.
 */
mark_page_dirty(kvm, gfn);
return 0;

> 
> Overall, please document what is the required order of operations for
> both set_spte and get_dirty_log and why this order is safe (right on top
> of the code).

Okay.

The order we required here is, we should 1) set spte to writable __before__
set the dirty bitmap and 2) add the spte into rmap __before__ set the dirty
bitmap.

The point 1) is the same as fast_pf_fix_direct_spte(), which i explained above.
The point 1) and 2) can ensure we can find the spte on rmap and see the spte is
writable when we do lockless write-protection, otherwise these cases will happen

VCPU 0  kvm ioctl doing get-dirty-pages

mark_page_dirty(gfn) which
set the gfn on the dirty maps
  mask = xchg(dirty_bitmap, 0)

  walk all gfns which set on "mask" and
  locklessly write-protect the gfn,
  then walk rmap, see no spte on that rmap


add the spte into rmap

!! Then the page can be freely wrote but not recorded in the dirty bitmap.

Or:

VCPU 0  kvm ioctl doing get-dirty-pages

mark_page_dirty(gfn) which
set the gfn on the dirty maps

add spte into rmap
  mask = xchg(dirty_bitmap, 0)

  walk all gfns which set on "mask" and
  locklessly write-protect the gfn,
  then walk rmap, see spte is on the ramp
  but it is readonly or nonpresent.

Mark spte writable

!! Then the page can be freely wrote but not recorded in the dirty bitmap.

Hopefully, i have clarified it, if you have any questions, please let me know.

> 
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 35d4b50..0fe56ad 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 
>> *sptep,
>>  }
>>  }
>>  
>> -if (pte_access & ACC_WRITE_MASK)
>> -mark_page_dirty(vcpu->kvm, gfn);
>> -
>>  set_pte:
>>  if (mmu_spte_update(sptep, spte))
>>  kvm_flush_remote_tlbs(vcpu->kvm);
> 
> Here, there is a modified guest page without dirty log bit set (think
> another vcpu writing to the page via this spte).

This is okay since we call mark_page_dirty(vcpu->kvm, gfn) aft

Re: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect

2013-08-06 Thread Xiao Guangrong
Hi Gleb, Paolo, Marcelo, Takuya,

Any comments or further comments? :)

On 07/30/2013 09:01 PM, Xiao Guangrong wrote:
> Background
> ==
> Currently, when mark memslot dirty logged or get dirty page, we need to
> write-protect large guest memory, it is the heavy work, especially, we need to
> hold mmu-lock which is also required by vcpu to fix its page table fault and
> mmu-notifier when host page is being changed. In the extreme cpu / memory used
> guest, it becomes a scalability issue.
> 
> This patchset introduces a way to locklessly write-protect guest memory.
> 
> Idea
> ==
> There are the challenges we meet and the ideas to resolve them.
> 
> 1) How to locklessly walk rmap?
> The first idea we got to prevent "desc" being freed when we are walking the
> rmap is using RCU. But when vcpu runs on shadow page mode or nested mmu mode,
> it updates the rmap really frequently.
> 
> So we uses SLAB_DESTROY_BY_RCU to manage "desc" instead, it allows the object
> to be reused more quickly. We also store a "nulls" in the last "desc"
> (desc->more) which can help us to detect whether the "desc" is moved to anther
> rmap then we can re-walk the rmap if that happened. I learned this idea from
> nulls-list.
> 
> Another issue is, when a spte is deleted from the "desc", another spte in the
> last "desc" will be moved to this position to replace the deleted one. If the
> deleted one has been accessed and we do not access the replaced one, the
> replaced one is missed when we do lockless walk.
> To fix this case, we do not backward move the spte, instead, we forward move
> the entry: when a spte is deleted, we move the entry in the first desc to that
> position.
> 
> 2) How to locklessly access shadow page table?
> It is easy if the handler is in the vcpu context, in that case we can use
> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
> disable interrupt to stop shadow page be freed. But we are on the ioctl 
> context
> and the paths we are optimizing for have heavy workload, disabling interrupt 
> is
> not good for the system performance.
> 
> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then use
> call_rcu() to free the shadow page if that indicator is set. Set/Clear the
> indicator are protected by slot-lock, so it need not be atomic and does not
> hurt the performance and the scalability.
> 
> 3) How to locklessly write-protect guest memory?
> Currently, there are two behaviors when we write-protect guest memory, one is
> clearing the Writable bit on spte and the another one is dropping spte when it
> points to large page. The former is easy we only need to atomicly clear a bit
> but the latter is hard since we need to remove the spte from rmap. so we unify
> these two behaviors that only make the spte readonly. Making large spte
> readonly instead of nonpresent is also good for reducing jitter.
> 
> And we need to pay more attention on the order of making spte writable, adding
> spte into rmap and setting the corresponding bit on dirty bitmap since
> kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty 
> bitmap,
> we should ensure the writable spte can be found in rmap before the dirty 
> bitmap
> is visible. Otherwise, we cleared the dirty bitmap and failed to write-protect
> the page.
> 
> Performance result
> 
> Host: CPU: Intel(R) Xeon(R) CPU   X5690  @ 3.47GHz x 12
> Mem: 36G
> 
> The benchmark i used and will be attached:
> a) kernbench
> b) migrate-perf
>it emulates guest migration
> c) mmtest
>it repeatedly writes the memory and measures the time and is used to
>generate memory access in the guest which is being migrated
> d) Qemu monitor command to implement guest live migration
>the script can be found in migrate-perf.
>   
> 
> 1) First, we use kernbench to benchmark the performance with 
> non-write-protection
>   case to detect the possible regression:
> 
>   EPT enabled:  Base: 84.05  After the patch: 83.53
>   EPT disabled: Base: 142.57 After the patch: 141.70
> 
>   No regression and the optimization may come from lazily drop large spte.
> 
> 2) Benchmark the performance of get dirty page
>(./migrate-perf -c 12 -m 3000 -t 20)
> 
>Base: Run 20 times, Avg time:24813809 ns.
>After the patch: Run 20 times, Avg time:8371577 ns.
>
>It improves +196%
>   
> 3) There is the result of Live Migration:
>3.1) Less vcpus, less memory and less dirty page generated
> (
>   Guest config: MEM_S

Re: [PATCH 0/5] perf kvm live - latest round take 4

2013-08-05 Thread Xiao Guangrong
On 08/06/2013 09:41 AM, David Ahern wrote:
> Hi Arnaldo:
> 
> This round addresses all of Xiao's comments. It also includes a small
> change in the live mode introduction to improve ordered samples
> processing. For that a change in perf-session functions is needed.

Reviewed-by: Xiao Guangrong 

David, could you please update the documentation? It can be a separate
patch.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] perf kvm stat report: Add option to analyze specific VM

2013-08-04 Thread Xiao Guangrong
On 08/03/2013 04:05 AM, David Ahern wrote:
> Add an option to analyze a specific VM within a data file. This
> allows the collection of kvm events for all VMs and then analyze
> data for each VM (or set of VMs) individually.

Interesting.

But how can we know which pid is the guest's pid after collecting
the info. Even if the .data file is moved to another box to do
off-analyze?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 8/9] perf kvm: debug for missing vmexit/vmentry event

2013-08-04 Thread Xiao Guangrong
On 08/03/2013 04:05 AM, David Ahern wrote:
> Expected to have missing events for each vcpu when perf is
> started. After that should not have missing events.

Sorry, i can not understand what's this info used for.

> 
> Signed-off-by: David Ahern 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Ingo Molnar 
> Cc: Frederic Weisbecker 
> Cc: Peter Zijlstra 
> Cc: Jiri Olsa 
> Cc: Namhyung Kim 
> Cc: Xiao Guangrong 
> Cc: Runzhen Wang 
> ---
>  tools/perf/builtin-kvm.c |   15 +--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
> index 28afc05d..41dd25a 100644
> --- a/tools/perf/builtin-kvm.c
> +++ b/tools/perf/builtin-kvm.c
> @@ -568,11 +568,22 @@ static bool handle_kvm_event(struct perf_kvm_stat *kvm,
>   (kvm->trace_vcpu != vcpu_record->vcpu_id))
>   return true;
> 
> - if (kvm->events_ops->is_begin_event(evsel, sample, &key))
> + if (kvm->events_ops->is_begin_event(evsel, sample, &key)) {
> + if (vcpu_record->start_time) {
> + pr_debug("consecutive begin events (%s) for pid %d, 
> vcpu %d\n",
> +  evsel->name, sample->pid, 
> vcpu_record->vcpu_id);
> + }

This is not true, the ->start_time is set in handle_begin_event() for the new 
alloced
vcpu_record.

If you just want to know the point where is the first event lost, you can track
it in:

static bool handle_end_event(struct perf_kvm_stat *kvm,
 struct vcpu_event_record *vcpu_record,
 struct event_key *key,
 u64 timestamp)
{
..

/* The begin event is not caught. */
if (!time_begin)   <==
return true;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 7/9] perf kvm: option to print events that exceed a threshold

2013-08-04 Thread Xiao Guangrong
On 08/03/2013 04:05 AM, David Ahern wrote:
> This is useful to spot high latency blips.

Yes, it is a good idea.

> 
> Signed-off-by: David Ahern 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Ingo Molnar 
> Cc: Frederic Weisbecker 
> Cc: Peter Zijlstra 
> Cc: Jiri Olsa 
> Cc: Namhyung Kim 
> Cc: Xiao Guangrong 
> Cc: Runzhen Wang 
> ---
>  tools/perf/builtin-kvm.c |   25 +
>  tools/perf/perf.h|3 +++
>  2 files changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
> index 5c6e3cd..28afc05d 100644
> --- a/tools/perf/builtin-kvm.c
> +++ b/tools/perf/builtin-kvm.c
> @@ -106,6 +106,7 @@ struct perf_kvm_stat {
>   u64 total_time;
>   u64 total_count;
>   u64 lost_events;
> + u64 threshold;
> 
>   struct rb_root result;
> 
> @@ -470,7 +471,7 @@ static bool update_kvm_event(struct kvm_event *event, int 
> vcpu_id,
>  static bool handle_end_event(struct perf_kvm_stat *kvm,
>struct vcpu_event_record *vcpu_record,
>struct event_key *key,
> -  u64 timestamp)
> +  struct perf_sample *sample)
>  {
>   struct kvm_event *event;
>   u64 time_begin, time_diff;
> @@ -507,12 +508,24 @@ static bool handle_end_event(struct perf_kvm_stat *kvm,
>   vcpu_record->start_time = 0;
> 
>   /* seems to happen once in a while during live mode */
> - if (timestamp < time_begin) {
> + if (sample->time < time_begin) {
>   pr_debug("End time before begin time; skipping event.\n");
>   return true;
>   }
> 
> - time_diff = timestamp - time_begin;
> + time_diff = sample->time - time_begin;
> +
> + if (kvm->threshold && time_diff > kvm->threshold) {
> + char decode[32];
> +
> + kvm->events_ops->decode_key(kvm, &event->key, decode);
> + if (strcmp(decode, "HLT")) {
> + pr_info("%" PRIu64 " VM %d, vcpu %d: %s event took %" 
> PRIu64 "usec\n",
> +  sample->time, sample->pid, 
> vcpu_record->vcpu_id,
> +  decode, time_diff/1000);
> + }

Any reason to filter HLT out? it is too frequent? But the info will be missed if
we're really interested in this event.

It shows the info when the events is parsed which is not very readable and 
analyzable,
can we only record and print the events that cost long time (handle-time > 
threshould)?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/9] perf kvm: add min and max stats to display

2013-08-04 Thread Xiao Guangrong
On 08/03/2013 04:05 AM, David Ahern wrote:
> Signed-off-by: David Ahern 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Ingo Molnar 
> Cc: Frederic Weisbecker 
> Cc: Peter Zijlstra 
> Cc: Jiri Olsa 
> Cc: Namhyung Kim 
> Cc: Xiao Guangrong 
> Cc: Runzhen Wang 
> ---
>  tools/perf/builtin-kvm.c |   21 ++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
> index 12f7593..5c6e3cd 100644
> --- a/tools/perf/builtin-kvm.c
> +++ b/tools/perf/builtin-kvm.c
> @@ -337,14 +337,19 @@ static void clear_events_cache_stats(struct list_head 
> *kvm_events_cache)
>   struct list_head *head;
>   struct kvm_event *event;
>   unsigned int i;
> + int j;
> 
>   for (i = 0; i < EVENTS_CACHE_SIZE; i++) {
>   head = &kvm_events_cache[i];
>   list_for_each_entry(event, head, hash_entry) {
>   /* reset stats for event */
> - memset(&event->total, 0, sizeof(event->total));
> - memset(event->vcpu, 0,
> -event->max_vcpu * sizeof(*event->vcpu));
> + event->total.time = 0;
> + init_stats(&event->total.stats);
> +
> + for (j = 0; j < event->max_vcpu; ++j) {
> + event->vcpu[j].time = 0;
> + init_stats(&event->vcpu[j].stats);
> + }
>   }
>   }
>  }
> @@ -718,6 +723,7 @@ static void print_result(struct perf_kvm_stat *kvm)
>   char decode[20];
>   struct kvm_event *event;
>   int vcpu = kvm->trace_vcpu;
> + struct kvm_event_stats *kvm_stats;
> 
>   if (kvm->live) {
>   puts(CONSOLE_CLEAR);
> @@ -731,6 +737,8 @@ static void print_result(struct perf_kvm_stat *kvm)
>   pr_info("%9s ", "Samples%");
> 
>   pr_info("%9s ", "Time%");
> + pr_info("%10s ", "Min Time");
> + pr_info("%10s ", "Max Time");
>   pr_info("%16s ", "Avg time");
>   pr_info("\n\n");
> 
> @@ -740,11 +748,18 @@ static void print_result(struct perf_kvm_stat *kvm)
>   ecount = get_event_count(event, vcpu);
>   etime = get_event_time(event, vcpu);
> 
> + if (vcpu == -1)
> + kvm_stats = &event->total;
> + else
> + kvm_stats = &event->vcpu[vcpu];

Can introduce a function to do it like get_event_count()/get_event_time()?

And the min/max value are updated in patch 2, so others look good to me:
Reviewed-by: Xiao Guangrong 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] perf stats: add max and min stats

2013-08-04 Thread Xiao Guangrong
On 08/03/2013 04:05 AM, David Ahern wrote:
> Need an initialization function to set min to -1 to
> differentiate from an actual min of 0.

Reviewed-by: Xiao Guangrong 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/9] perf kvm: add live mode - v3

2013-08-04 Thread Xiao Guangrong
Hi David,

Thanks for your nice job! I got some questions.

On 08/03/2013 04:05 AM, David Ahern wrote:

>  static int kvm_events_hash_fn(u64 key)
>  {
>   return key & (EVENTS_CACHE_SIZE - 1);
> @@ -472,7 +501,11 @@ static bool handle_end_event(struct perf_kvm_stat *kvm,
>   vcpu_record->last_event = NULL;
>   vcpu_record->start_time = 0;
> 
> - BUG_ON(timestamp < time_begin);
> + /* seems to happen once in a while during live mode */
> + if (timestamp < time_begin) {
> + pr_debug("End time before begin time; skipping event.\n");
> + return true;
> + }

No idea why it can happen. :(

> +static bool verify_vcpu(int vcpu)
> +{
> + int nr_cpus;
> +
> + if (vcpu != -1 && vcpu < 0) {
> + pr_err("Invalid vcpu:%d.\n", vcpu);
> + return false;
> + }
> +
> + nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
> + if ((nr_cpus > 0) && (vcpu > nr_cpus - 1)) {
> + pr_err("Invalid vcpu:%d.\n", vcpu);
> + return false;
> + }

Hmm, kvm can use more vcpus than the cpus on host.

> +static int kvm_events_live(struct perf_kvm_stat *kvm,
> +int argc, const char **argv)
> +{
> + char errbuf[BUFSIZ];
> + int err;
> +
> + const struct option live_options[] = {
> + OPT_STRING('p', "pid", &kvm->opts.target.pid, "pid",
> + "record events on existing process id"),
> + OPT_STRING('t', "tid", &kvm->opts.target.tid, "tid",
> + "record events on existing thread id"),
> + OPT_STRING('C', "cpu", &kvm->opts.target.cpu_list, "cpu",
> + "list of host cpus to monitor"),
> + OPT_UINTEGER('m', "mmap-pages", &kvm->opts.mmap_pages,
> + "number of mmap data pages"),
> + OPT_INCR('v', "verbose", &verbose,
> + "be more verbose (show counter open errors, etc)"),
> + OPT_BOOLEAN('a', "all-cpus", &kvm->opts.target.system_wide,
> + "system-wide collection from all CPUs"),
> + OPT_UINTEGER('d', "display", &kvm->display_time,
> + "time in seconds between display updates"),
> + OPT_STRING(0, "event", &kvm->report_event, "report event",
> + "event for reporting: vmexit, mmio, ioport"),
> + OPT_INTEGER(0, "vcpu", &kvm->trace_vcpu,
> + "vcpu id to report"),
> + OPT_STRING('k', "key", &kvm->sort_key, "sort-key",
> + "key for sorting: sample(sort by samples number)"
> + " time (sort by avg time)"),

Why we have so many parameters used for tracking. For KVM, we only need to know
1) which guest is tracked and 2) which vcpu in the guest is tracked and 3) what
kind of events. no?

Others look good to me. :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/9] perf kvm: split out tracepoints from record args

2013-08-04 Thread Xiao Guangrong
On 08/03/2013 04:05 AM, David Ahern wrote:
> Needed by kvm live command. Make record_args a local while we are
> messing with the args.

Reviewed-by: Xiao Guangrong 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] KVM: MMU: fix check the reserved bits on the gpte of L2

2013-08-04 Thread Xiao Guangrong
Current code always uses arch.mmu to check the reserved bits on guest gpte
which is valid only for L1 guest, we should use arch.nested_mmu instead when
we translate gva to gpa for the L2 guest

Fix it by using @mmu instead since it is adapted to the current mmu mode
automatically

The bug can be triggered when nested npt is used and L1 guest and L2 guest
use different mmu mode

Reported-by: Jan Kiszka 
Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/paging_tmpl.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7769699..3a75828 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -218,8 +218,7 @@ retry_walk:
if (unlikely(!is_present_gpte(pte)))
goto error;

-   if (unlikely(is_rsvd_bits_set(&vcpu->arch.mmu, pte,
- walker->level))) {
+   if (unlikely(is_rsvd_bits_set(mmu, pte, walker->level))) {
errcode |= PFERR_RSVD_MASK | PFERR_PRESENT_MASK;
goto error;
}
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/12] KVM: MMU: locklessly wirte-protect

2013-08-04 Thread Xiao Guangrong

On Aug 3, 2013, at 1:09 PM, Takuya Yoshikawa  wrote:

> On Tue, 30 Jul 2013 21:01:58 +0800
> Xiao Guangrong  wrote:
> 
>> Background
>> ==
>> Currently, when mark memslot dirty logged or get dirty page, we need to
>> write-protect large guest memory, it is the heavy work, especially, we need 
>> to
>> hold mmu-lock which is also required by vcpu to fix its page table fault and
>> mmu-notifier when host page is being changed. In the extreme cpu / memory 
>> used
>> guest, it becomes a scalability issue.
>> 
>> This patchset introduces a way to locklessly write-protect guest memory.
> 
> Nice improvements!

Thank you!

> 
> If I read the patch set correctly, this work contains the following changes:
> 
> Cleanups:
>Patch 1 and patch 12.
> 
> Lazy large page dropping for dirty logging:
>Patch 2-3.
>Patch 2 is preparatory to patch 3.
> 
>This does not look like an RFC if you address Marcelo's comment.
>Any reason to include this in an RFC patch set?

Right, these two patches are not really RFC since you guys have reviewed the
idea.

The reason i put these into this patchset is that they are also the preparing 
work
for implementing lockless writ-protection since after that we do not need to
remove a spte from the rmap any more. (only need to write-protect a spte.)

> 
> Making remote TLBs flushable outside of mmu_lock for dirty logging:
>Patch 6.
> 
>This is nice.  I'm locally using a similar patch for my work, but yours
>is much cleaner and better.  I hope this will get merged soon.

Thanks!

> 
> New Pte-list handling:
>Patch 7-9.
> 
>Still reading the details.
> 
> RCU-based lockless write protection.
>Patch 10-11.
> 
>If I understand RCU correctly, the current implementation has a 
> problem:
>read-side critical sections can become too long.
> 
>See the following LWN's article:
>"Sleepable RCU"
>https://lwn.net/Articles/202847/
> 
>Especially, kvm_mmu_slot_remove_write_access() can take hundreds of
>milliseconds, or even a few seconds for guests using shadow paging.
>Is it possible to break the read-side critical section after protecting
>some pages? -- I guess so.

Yes. we can use the break-tech in the code if it is needed, like this:

if (need_resched()) {
kvm_use_rcu_free_page_end();
kvm_use_rcu_free_page_begin();
}

> 
> Anyway, I want to see the following non-RFC quality patches get merged first:
>- Lazy large page dropping for dirty logging:
>- Making remote TLBs flushable outside of mmu_lock for dirty logging
> 
> As you are doing in patch 11, the latter can eliminate the TLB flushes before
> cond_resched_lock().  So this alone is an optimization, and since my work is
> based on this TLB flush-less lock breaking, I would appriciate if you make 
> this
> change first in your clean way.

Okay, i will move these patches to the front then the maintainers can merge
them easily.

> 
> The remaining patches, pte-list refactoring and lock-less ones, also look
> interesting, but I need to read more to understand them.
> 
> Thanks for the nice work!

Thanks for your review and the comments! :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/12] KVM: MMU: lazily drop large spte

2013-08-02 Thread Xiao Guangrong

On Aug 3, 2013, at 4:27 AM, Marcelo Tosatti  wrote:

> On Fri, Aug 02, 2013 at 11:42:19PM +0800, Xiao Guangrong wrote:
>> 
>> On Aug 2, 2013, at 10:55 PM, Marcelo Tosatti  wrote:
>> 
>>> On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
>>>> Currently, kvm zaps the large spte if write-protected is needed, the later
>>>> read can fault on that spte. Actually, we can make the large spte readonly
>>>> instead of making them un-present, the page fault caused by read access can
>>>> be avoided
>>>> 
>>>> The idea is from Avi:
>>>> | As I mentioned before, write-protecting a large spte is a good idea,
>>>> | since it moves some work from protect-time to fault-time, so it reduces
>>>> | jitter.  This removes the need for the return value.
>>>> 
>>>> [
>>>> It has fixed the issue reported in 6b73a9606 by stopping fast page fault
>>>> marking the large spte to writable
>>>> ]
>>> 
>>> Xiao,
>>> 
>>> Can you please write a comment explaining why are the problems 
>>> with shadow vs large read-only sptes (can't recall anymore),
>>> and then why it is now safe to do it.
>> 
>> Hi Marcelo,
>> 
>> Thanks for your review.  Yes. The bug reported in  6b73a9606 is, in this 
>> patch,
>> we mark the large spte as readonly when the pages are dirt logged and the
>> readonly spte can be set to writable by fast page fault, but on that path, 
>> it failed
>> to check dirty logging, so it will set the large spte to writable but only 
>> set the first
>> page to the dirty bitmap.
>> 
>> For example:
>> 
>> 1): KVM maps 0 ~ 2M memory to guest which is pointed by SPTE and SPTE
>> is writable.
>> 
>> 2): KVM dirty log 0 ~ 2M,  then set SPTE to readonly
>> 
>> 3): fast page fault set SPTE to writable and set page 0 to the dirty bitmap.
>> 
>> Then 4K ~ 2M memory is not dirty logged.
> 
> Ok can you write a self contained summary of read-only large sptes (when
> they are created, when destroyed, from which point they can't be created,
> etc), and the interaction with shadow write protection and creation of
> writeable sptes?
> Its easy to get lost.

Okay, will do.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/12] KVM: MMU: lazily drop large spte

2013-08-02 Thread Xiao Guangrong

On Aug 2, 2013, at 10:55 PM, Marcelo Tosatti  wrote:

> On Tue, Jul 30, 2013 at 09:02:01PM +0800, Xiao Guangrong wrote:
>> Currently, kvm zaps the large spte if write-protected is needed, the later
>> read can fault on that spte. Actually, we can make the large spte readonly
>> instead of making them un-present, the page fault caused by read access can
>> be avoided
>> 
>> The idea is from Avi:
>> | As I mentioned before, write-protecting a large spte is a good idea,
>> | since it moves some work from protect-time to fault-time, so it reduces
>> | jitter.  This removes the need for the return value.
>> 
>> [
>>  It has fixed the issue reported in 6b73a9606 by stopping fast page fault
>>  marking the large spte to writable
>> ]
> 
> Xiao,
> 
> Can you please write a comment explaining why are the problems 
> with shadow vs large read-only sptes (can't recall anymore),
> and then why it is now safe to do it.

Hi Marcelo,

Thanks for your review.  Yes. The bug reported in  6b73a9606 is, in this patch,
we mark the large spte as readonly when the pages are dirt logged and the
readonly spte can be set to writable by fast page fault, but on that path, it 
failed
to check dirty logging, so it will set the large spte to writable but only set 
the first
page to the dirty bitmap.

For example:

1): KVM maps 0 ~ 2M memory to guest which is pointed by SPTE and SPTE
 is writable.

2): KVM dirty log 0 ~ 2M,  then set SPTE to readonly

3): fast page fault set SPTE to writable and set page 0 to the dirty bitmap.

Then 4K ~ 2M memory is not dirty logged.

In this version, we let fast page fault do not mark large spte to writable if
its page are dirty logged.  But it is still not safe as you pointed out.

>> 
>> 
>>  /*
>> + * Can not map the large spte to writable if the page is dirty
>> + * logged.
>> + */
>> +if (sp->role.level > PT_PAGE_TABLE_LEVEL && force_pt_level)
>> +goto exit;
>> +
> 
> It is not safe to derive slot->dirty_bitmap like this: 
> since dirty log is enabled via RCU update, "is dirty bitmap enabled"
> info could be stale by the time you check it here via the parameter,
> so you can instantiate a large spte (because force_pt_level == false),
> while you should not.

Good catch! This is true even if we enable dirty log under the protection
of mmu lock.

How about let the fault page fault only fix the small spte, that is changing
the code to:
if (sp->role.level > PT_PAGE_TABLE_LEVEL)
goto exit;
?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/12] KVM: MMU: add spte into rmap before logging dirty page

2013-07-31 Thread Xiao Guangrong
On 07/30/2013 09:27 PM, Paolo Bonzini wrote:
> Il 30/07/2013 15:02, Xiao Guangrong ha scritto:
>> kvm_vm_ioctl_get_dirty_log() write-protects the spte based on the dirty
>> bitmap, we should ensure the writable spte can be found in rmap before the
>> dirty bitmap is visible. Otherwise, we cleared the dirty bitmap and failed
>> to write-protect the page
>>
>> It need the memory barrier to prevent out-of-order that will be added in the
>> later patch
> 
> Do you mean that the later patch will also introduce a memory barrier?

No. Sorry for the confusion. I mean we miss the memory barrier in this patch
and will fix it in the latter patch where we introduce the lockless
write-protection.

The memory barrier is added in
[PATCH 11/12] KVM: MMU: locklessly write-protect the page:

+   /*
+* We should put the sptep into rmap before dirty log
+* otherwise the lockless spte write-protect path will
+* clear the dirty bit map but fail to find the spte.
+*
+* See the comments in kvm_vm_ioctl_get_dirty_log().
+*/
+   smp_wmb();
+
if (pte_access & ACC_WRITE_MASK)

and the barrier in the another side is:
+   /*
+* xchg acts as a full barrier that ensures
+* clearing dirty bitmap before read rmap.
+*
+* See the comments in set_spte().
+*/
mask = xchg(&dirty_bitmap[i], 0);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

2013-07-31 Thread Xiao Guangrong
On 07/30/2013 09:26 PM, Paolo Bonzini wrote:
> Il 30/07/2013 15:02, Xiao Guangrong ha scritto:
>> Make sure we can see the writable spte before the dirt bitmap is visible
>>
>> We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte based
>> on the dirty bitmap, we should ensure the writable spte can be found in rmap
>> before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap 
>> and
>> failed to write-protect the page
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  arch/x86/kvm/mmu.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 35d4b50..0fe56ad 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2486,12 +2486,12 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 
>> *sptep,
>>  }
>>  }
>>  
>> -if (pte_access & ACC_WRITE_MASK)
>> -mark_page_dirty(vcpu->kvm, gfn);
>> -
>>  set_pte:
>>  if (mmu_spte_update(sptep, spte))
>>  kvm_flush_remote_tlbs(vcpu->kvm);
>> +
>> +if (pte_access & ACC_WRITE_MASK)
>> +mark_page_dirty(vcpu->kvm, gfn);
>>  done:
>>  return ret;
>>  }
>>
> 
> What about this comment above:
> 
> /*
>  * Optimization: for pte sync, if spte was writable the hash
>  * lookup is unnecessary (and expensive). Write protection
>  * is responsibility of mmu_get_page / kvm_sync_page.

This comments mean no sync shadow page created if the the spte is still writable
because add a sync page need to writable all spte point to this page. So we can
keep the spte as writable.

I think it is better to checking SPTE_MMU_WRITEABLE bit instead of 
PT_WRITABLE_MASK
since the latter bit can be cleared by dirty log and it can be a separate patch 
i
think.

>  * Same reasoning can be applied to dirty page accounting.

This comment means if the spte is writable the corresponding bit on dirty bitmap
should have been set.

Thanks to your reminder, i think this comment should be dropped, now we need to
mark_page_dirty() whenever the spte update to writable. Otherwise this will 
happen:

   VCPU 0 VCPU 1
Clear dirty bit on the bitmap
   Read the spte, it is writable
write the spte
   update the spte, keep it as writable
   and do not call mark_page_dirty().
Flush tlb

Then vcpu 1 can continue to write the page but fail to set the bit on the 
bitmap.

>  */
> if (!can_unsync && is_writable_pte(*sptep))
> goto set_pte;
> 
> if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
> 
> 
> ?
> 
> Should it be changed to
> 
> if (!can_unsync && is_writable_pte(*sptep))
> pte_access &= ~ACC_WRITE_MASK; /* do not mark dirty */

Yes, this can avoid the issue above.

But there is only a small window between sync the spte and locklessly 
write-protect
the spte (since the sptep is already writable), i think we'd better keep the 
spte
writable to speed up the normal case. :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   6   7   8   9   10   >