Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-12-09 Thread Xiao Guangrong
On 12/06/2013 08:22 AM, Marcelo Tosatti wrote:
 On Thu, Dec 05, 2013 at 11:30:27PM +0800, Xiao Guangrong wrote:
 In some cases, the lockless walker will do endless-walking on desc and
 without rewalk, consider this case:

 there are two descs: desc1 and desc2 who is pointed by desc1-next:
 desc1-next = desc2.

 CPU 0
 CPU 1

 lockless walk on desc1
  deleting 
 desc1 from the rmap
 lockless walk on desc2 (desc1-next)
 delete desc2 
 from the rmap
 add desc1
 add desc2, 
 then desc2-next = desc1

 lockless walk on desc1
delete desc2
delete desc1
add desc2
add desc1; 
 the desc1-next = desc2
 lockless walk on desc2

 ……

 Then, the walker is endlessly walking on desc1 and desc2 without any rewalk.
 
 The counter can be local to the walker. If its more than maximum rmap
 size, break.
 
 (but still, please consider carefully whether lockless list walking is
 really necessary or can be avoided).

Yep, Marcelo, you're right.

After thinking more, i do not have any idea to simplify this. Your approach
(lockless on the first level) seems a better solution. Will do it based on that
ways.

Thanks!


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-12-05 Thread Marcelo Tosatti
GOn Tue, Dec 03, 2013 at 03:10:48PM +0800, Xiao Guangrong wrote:
 On 11/28/2013 04:53 PM, Xiao Guangrong wrote:
  On 11/27/2013 03:31 AM, Marcelo Tosatti wrote:
  On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
  On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
  On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
  Also, there is no guarantee of termination (as long as sptes are
  deleted with the correct timing). BTW, can't see any guarantee of
  termination for rculist nulls either (a writer can race with a 
  lockless
  reader indefinately, restarting the lockless walk every time).
 
  Hmm, that can be avoided by checking dirty-bitmap before rewalk,
  that means, if the dirty-bitmap has been set during lockless 
  write-protection,
  it�s unnecessary to write-protect its sptes. Your idea?
  This idea is based on the fact that the number of rmap is limited by
  RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
  we can break the rewalk at once, in the case of deleting, we can only
  rewalk RMAP_RECYCLE_THRESHOLD times.
 
  Please explain in more detail.
 
  Okay.
 
  My proposal is like this:
 
  pte_list_walk_lockless()
  {
  restart:
 
  + if (__test_bit(slot-arch.dirty_bitmap, gfn-index))
  + return;
 
code-doing-lockless-walking;
..
  }
 
  Before do lockless-walking, we check the dirty-bitmap first, if
  it is set we can simply skip write-protection for the gfn, that
  is the case that new spte is being added into rmap when we lockless
  access the rmap.
 
  The dirty bit could be set after the check.
 
  For the case of deleting spte from rmap, the number of entry is limited
  by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
 
  It can shrink and grow while lockless walk is performed.
  
  Yes, indeed.
  
  Hmmm, another idea in my mind to fix this is encoding the position into
  the reserved bits of desc-more pointer, for example:
  
+--++--++--+
  rmapp -  |Desc 0| - |Desc 1| - |Desc 2|
+--++--++--+
  
  There are 3 descs on the rmap, and:
  rmapp = desc0 | 1UL | 3UL  50;
  desc0-more = desc1 | 2UL  50;
  desc1-more = desc0 | 1UL  50
  desc2-more = rmapp | 1UL; (The nulls pointer)
  
  We will walk to the next desc only if the position of current desc
  is = the position of next desc. That can make sure we can reach the
  last desc anyway.
  
  And in order to avoiding doing too many rewalk, we will goto the
  slow path (do walk with holding the lock) instead when retry the walk
  more that N times.
 
 How about this idea? Or you guys still prefer to the idea of lockless on
 first-level?

Xiao,

Is it not the case that simply moving to the slow path once a maximum of
rewalks has been reached enough? (looks a like a good solution).

Please move lockless rcu walking code to generic code where it belongs.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-12-05 Thread Xiao Guangrong

On Dec 5, 2013, at 9:50 PM, Marcelo Tosatti mtosa...@redhat.com wrote:

 GOn Tue, Dec 03, 2013 at 03:10:48PM +0800, Xiao Guangrong wrote:
 On 11/28/2013 04:53 PM, Xiao Guangrong wrote:
 On 11/27/2013 03:31 AM, Marcelo Tosatti wrote:
 On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
 On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
 On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
 Also, there is no guarantee of termination (as long as sptes are
 deleted with the correct timing). BTW, can't see any guarantee of
 termination for rculist nulls either (a writer can race with a 
 lockless
 reader indefinately, restarting the lockless walk every time).
 
 Hmm, that can be avoided by checking dirty-bitmap before rewalk,
 that means, if the dirty-bitmap has been set during lockless 
 write-protection,
 it�s unnecessary to write-protect its sptes. Your idea?
 This idea is based on the fact that the number of rmap is limited by
 RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
 we can break the rewalk at once, in the case of deleting, we can only
 rewalk RMAP_RECYCLE_THRESHOLD times.
 
 Please explain in more detail.
 
 Okay.
 
 My proposal is like this:
 
 pte_list_walk_lockless()
 {
 restart:
 
 + if (__test_bit(slot-arch.dirty_bitmap, gfn-index))
 + return;
 
   code-doing-lockless-walking;
   ..
 }
 
 Before do lockless-walking, we check the dirty-bitmap first, if
 it is set we can simply skip write-protection for the gfn, that
 is the case that new spte is being added into rmap when we lockless
 access the rmap.
 
 The dirty bit could be set after the check.
 
 For the case of deleting spte from rmap, the number of entry is limited
 by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
 
 It can shrink and grow while lockless walk is performed.
 
 Yes, indeed.
 
 Hmmm, another idea in my mind to fix this is encoding the position into
 the reserved bits of desc-more pointer, for example:
 
 +--++--++--+
 rmapp -  |Desc 0| - |Desc 1| - |Desc 2|
 +--++--++--+
 
 There are 3 descs on the rmap, and:
 rmapp = desc0 | 1UL | 3UL  50;
 desc0-more = desc1 | 2UL  50;
 desc1-more = desc0 | 1UL  50
 desc2-more = rmapp | 1UL; (The nulls pointer)
 
 We will walk to the next desc only if the position of current desc
 is = the position of next desc. That can make sure we can reach the
 last desc anyway.
 
 And in order to avoiding doing too many rewalk, we will goto the
 slow path (do walk with holding the lock) instead when retry the walk
 more that N times.
 
 How about this idea? Or you guys still prefer to the idea of lockless on
 first-level?
 
 Xiao,
 
 Is it not the case that simply moving to the slow path once a maximum of
 rewalks has been reached enough? (looks a like a good solution).

In some cases, the lockless walker will do endless-walking on desc and
without rewalk, consider this case:

there are two descs: desc1 and desc2 who is pointed by desc1-next:
desc1-next = desc2.

CPU 0   
 CPU 1

lockless walk on desc1
 deleting desc1 
from the rmap
lockless walk on desc2 (desc1-next)
delete desc2 
from the rmap
add desc1
add desc2, then 
desc2-next = desc1

lockless walk on desc1
   delete desc2
   delete desc1
   add desc2
   add desc1; the 
desc1-next = desc2
lockless walk on desc2

……

Then, the walker is endlessly walking on desc1 and desc2 without any rewalk.


 
 Please move lockless rcu walking code to generic code where it belongs.

Okay.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-12-05 Thread Marcelo Tosatti
On Thu, Dec 05, 2013 at 11:30:27PM +0800, Xiao Guangrong wrote:
  Is it not the case that simply moving to the slow path once a maximum of
  rewalks has been reached enough? (looks a like a good solution).
 
 In some cases, the lockless walker will do endless-walking on desc and
 without rewalk, consider this case:
 
 there are two descs: desc1 and desc2 who is pointed by desc1-next:
 desc1-next = desc2.
 
 CPU 0 
CPU 1
 
 lockless walk on desc1
  deleting 
 desc1 from the rmap
 lockless walk on desc2 (desc1-next)
 delete desc2 
 from the rmap
 add desc1
 add desc2, 
 then desc2-next = desc1
 
 lockless walk on desc1
delete desc2
delete desc1
add desc2
add desc1; the 
 desc1-next = desc2
 lockless walk on desc2
 
 ……
 
 Then, the walker is endlessly walking on desc1 and desc2 without any rewalk.

Ouch, this is the sort of thing that is worrysome. Do you still think
that having the benefit for shadow is important enough to justify this
complexity?

Also, another separate possibility is to use the dirty bit (which recent
Intel processors support). Then there are no faults at all to handle.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-12-05 Thread Marcelo Tosatti
On Thu, Dec 05, 2013 at 11:30:27PM +0800, Xiao Guangrong wrote:
 In some cases, the lockless walker will do endless-walking on desc and
 without rewalk, consider this case:
 
 there are two descs: desc1 and desc2 who is pointed by desc1-next:
 desc1-next = desc2.
 
 CPU 0 
CPU 1
 
 lockless walk on desc1
  deleting 
 desc1 from the rmap
 lockless walk on desc2 (desc1-next)
 delete desc2 
 from the rmap
 add desc1
 add desc2, 
 then desc2-next = desc1
 
 lockless walk on desc1
delete desc2
delete desc1
add desc2
add desc1; the 
 desc1-next = desc2
 lockless walk on desc2
 
 ……
 
 Then, the walker is endlessly walking on desc1 and desc2 without any rewalk.

The counter can be local to the walker. If its more than maximum rmap
size, break.

(but still, please consider carefully whether lockless list walking is
really necessary or can be avoided).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-12-02 Thread Xiao Guangrong
On 11/28/2013 04:53 PM, Xiao Guangrong wrote:
 On 11/27/2013 03:31 AM, Marcelo Tosatti wrote:
 On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
 On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
 On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
 Also, there is no guarantee of termination (as long as sptes are
 deleted with the correct timing). BTW, can't see any guarantee of
 termination for rculist nulls either (a writer can race with a lockless
 reader indefinately, restarting the lockless walk every time).

 Hmm, that can be avoided by checking dirty-bitmap before rewalk,
 that means, if the dirty-bitmap has been set during lockless 
 write-protection,
 it�s unnecessary to write-protect its sptes. Your idea?
 This idea is based on the fact that the number of rmap is limited by
 RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
 we can break the rewalk at once, in the case of deleting, we can only
 rewalk RMAP_RECYCLE_THRESHOLD times.

 Please explain in more detail.

 Okay.

 My proposal is like this:

 pte_list_walk_lockless()
 {
 restart:

 +   if (__test_bit(slot-arch.dirty_bitmap, gfn-index))
 +   return;

 code-doing-lockless-walking;
 ..
 }

 Before do lockless-walking, we check the dirty-bitmap first, if
 it is set we can simply skip write-protection for the gfn, that
 is the case that new spte is being added into rmap when we lockless
 access the rmap.

 The dirty bit could be set after the check.

 For the case of deleting spte from rmap, the number of entry is limited
 by RMAP_RECYCLE_THRESHOLD, that is not endlessly.

 It can shrink and grow while lockless walk is performed.
 
 Yes, indeed.
 
 Hmmm, another idea in my mind to fix this is encoding the position into
 the reserved bits of desc-more pointer, for example:
 
   +--++--++--+
 rmapp -  |Desc 0| - |Desc 1| - |Desc 2|
   +--++--++--+
 
 There are 3 descs on the rmap, and:
 rmapp = desc0 | 1UL | 3UL  50;
 desc0-more = desc1 | 2UL  50;
 desc1-more = desc0 | 1UL  50
 desc2-more = rmapp | 1UL; (The nulls pointer)
 
 We will walk to the next desc only if the position of current desc
 is = the position of next desc. That can make sure we can reach the
 last desc anyway.
 
 And in order to avoiding doing too many rewalk, we will goto the
 slow path (do walk with holding the lock) instead when retry the walk
 more that N times.

How about this idea? Or you guys still prefer to the idea of lockless on
first-level?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-28 Thread Xiao Guangrong
On 11/27/2013 03:58 AM, Marcelo Tosatti wrote:
 On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote:
 On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
 On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
 On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
 xiaoguangr...@linux.vnet.ibm.com wrote:

 On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote:

 snip complicated stuff about parent_pte

 I'm not really following, but note that parent_pte predates EPT (and
 the use of rcu in kvm), so all the complexity that is the result of
 trying to pack as many list entries into a cache line can be dropped.
 Most setups now would have exactly one list entry, which is handled
 specially antyway.

 Alternatively, the trick of storing multiple entries in one list entry
 can be moved to generic code, it may be useful to others.

 Yes, can the lockless list walking code be transformed into generic
 single-linked list walking? So the correctness can be verified
 independently, and KVM becomes a simple user of that interface.

 I'am afraid the signle-entry list is not so good as we expected. In my
 experience, there're too many entries on rmap, more than 300 sometimes.
 (consider a case that a lib shared by all processes).
 
 single linked list was about moving singly-linked lockless walking
 to generic code.
 
 http://www.spinics.net/lists/linux-usb/msg39643.html
 http://marc.info/?l=linux-kernelm=103305635013575w=3
 

Oh, i confused single linked and single entry, sorry about that.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-28 Thread Xiao Guangrong
On 11/27/2013 03:31 AM, Marcelo Tosatti wrote:
 On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
 On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
 On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
 Also, there is no guarantee of termination (as long as sptes are
 deleted with the correct timing). BTW, can't see any guarantee of
 termination for rculist nulls either (a writer can race with a lockless
 reader indefinately, restarting the lockless walk every time).

 Hmm, that can be avoided by checking dirty-bitmap before rewalk,
 that means, if the dirty-bitmap has been set during lockless 
 write-protection,
 it�s unnecessary to write-protect its sptes. Your idea?
 This idea is based on the fact that the number of rmap is limited by
 RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
 we can break the rewalk at once, in the case of deleting, we can only
 rewalk RMAP_RECYCLE_THRESHOLD times.

 Please explain in more detail.

 Okay.

 My proposal is like this:

 pte_list_walk_lockless()
 {
 restart:

 +if (__test_bit(slot-arch.dirty_bitmap, gfn-index))
 +return;

  code-doing-lockless-walking;
  ..
 }

 Before do lockless-walking, we check the dirty-bitmap first, if
 it is set we can simply skip write-protection for the gfn, that
 is the case that new spte is being added into rmap when we lockless
 access the rmap.
 
 The dirty bit could be set after the check.
 
 For the case of deleting spte from rmap, the number of entry is limited
 by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
 
 It can shrink and grow while lockless walk is performed.

Yes, indeed.

Hmmm, another idea in my mind to fix this is encoding the position into
the reserved bits of desc-more pointer, for example:

  +--++--++--+
rmapp -  |Desc 0| - |Desc 1| - |Desc 2|
  +--++--++--+

There are 3 descs on the rmap, and:
rmapp = desc0 | 1UL | 3UL  50;
desc0-more = desc1 | 2UL  50;
desc1-more = desc0 | 1UL  50
desc2-more = rmapp | 1UL; (The nulls pointer)

We will walk to the next desc only if the position of current desc
is = the position of next desc. That can make sure we can reach the
last desc anyway.

And in order to avoiding doing too many rewalk, we will goto the
slow path (do walk with holding the lock) instead when retry the walk
more that N times.

Thanks all you guys in thanksgiving day. :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-26 Thread Gleb Natapov
On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
 On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
  On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
  Also, there is no guarantee of termination (as long as sptes are
  deleted with the correct timing). BTW, can't see any guarantee of
  termination for rculist nulls either (a writer can race with a lockless
  reader indefinately, restarting the lockless walk every time).
 
  Hmm, that can be avoided by checking dirty-bitmap before rewalk,
  that means, if the dirty-bitmap has been set during lockless 
  write-protection,
  it�s unnecessary to write-protect its sptes. Your idea?
  This idea is based on the fact that the number of rmap is limited by
  RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
  we can break the rewalk at once, in the case of deleting, we can only
  rewalk RMAP_RECYCLE_THRESHOLD times.
  
  Please explain in more detail.
 
 Okay.
 
 My proposal is like this:
 
 pte_list_walk_lockless()
 {
 restart:
 
 + if (__test_bit(slot-arch.dirty_bitmap, gfn-index))
 + return;
 
   code-doing-lockless-walking;
   ..
 }
 
 Before do lockless-walking, we check the dirty-bitmap first, if
 it is set we can simply skip write-protection for the gfn, that
 is the case that new spte is being added into rmap when we lockless
 access the rmap.
 
 For the case of deleting spte from rmap, the number of entry is limited
 by RMAP_RECYCLE_THRESHOLD, that is not endlessly.
The point is that rmap entry that you are inspecting can be constantly
deleted and added to the beginning of some other list, so the code that
traverse the list will never reach the end.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-26 Thread Gleb Natapov
On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote:
 On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
  On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
  On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
  xiaoguangr...@linux.vnet.ibm.com wrote:
 
  On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
  snip complicated stuff about parent_pte
 
  I'm not really following, but note that parent_pte predates EPT (and
  the use of rcu in kvm), so all the complexity that is the result of
  trying to pack as many list entries into a cache line can be dropped.
  Most setups now would have exactly one list entry, which is handled
  specially antyway.
 
  Alternatively, the trick of storing multiple entries in one list entry
  can be moved to generic code, it may be useful to others.
  
  Yes, can the lockless list walking code be transformed into generic
  single-linked list walking? So the correctness can be verified
  independently, and KVM becomes a simple user of that interface.
 
 I'am afraid the signle-entry list is not so good as we expected. In my
 experience, there're too many entries on rmap, more than 300 sometimes.
 (consider a case that a lib shared by all processes).
 
This is without EPT though and non EPT HW is not performance king
anyway. Nested EPT uses shadow paging too, but VMs hardly share any
pages. With KSM they may though.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-26 Thread Marcelo Tosatti
On Tue, Nov 26, 2013 at 11:10:19AM +0800, Xiao Guangrong wrote:
 On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
  On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
  On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
  xiaoguangr...@linux.vnet.ibm.com wrote:
 
  On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
  snip complicated stuff about parent_pte
 
  I'm not really following, but note that parent_pte predates EPT (and
  the use of rcu in kvm), so all the complexity that is the result of
  trying to pack as many list entries into a cache line can be dropped.
  Most setups now would have exactly one list entry, which is handled
  specially antyway.
 
  Alternatively, the trick of storing multiple entries in one list entry
  can be moved to generic code, it may be useful to others.
  
  Yes, can the lockless list walking code be transformed into generic
  single-linked list walking? So the correctness can be verified
  independently, and KVM becomes a simple user of that interface.
 
 I'am afraid the signle-entry list is not so good as we expected. In my
 experience, there're too many entries on rmap, more than 300 sometimes.
 (consider a case that a lib shared by all processes).

single linked list was about moving singly-linked lockless walking
to generic code.

http://www.spinics.net/lists/linux-usb/msg39643.html
http://marc.info/?l=linux-kernelm=103305635013575w=3

  The simpler version is to maintain lockless walk on depth-1 rmap entries
  (and grab the lock once depth-2 entry is found).
 
 I still think rmap-lockless is more graceful: soft mmu can get benefit
 from it also it is promising to be used in some mmu-notify functions. :)

OK.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-26 Thread Marcelo Tosatti
On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote:
 On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
  On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
  Also, there is no guarantee of termination (as long as sptes are
  deleted with the correct timing). BTW, can't see any guarantee of
  termination for rculist nulls either (a writer can race with a lockless
  reader indefinately, restarting the lockless walk every time).
 
  Hmm, that can be avoided by checking dirty-bitmap before rewalk,
  that means, if the dirty-bitmap has been set during lockless 
  write-protection,
  it�s unnecessary to write-protect its sptes. Your idea?
  This idea is based on the fact that the number of rmap is limited by
  RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
  we can break the rewalk at once, in the case of deleting, we can only
  rewalk RMAP_RECYCLE_THRESHOLD times.
  
  Please explain in more detail.
 
 Okay.
 
 My proposal is like this:
 
 pte_list_walk_lockless()
 {
 restart:
 
 + if (__test_bit(slot-arch.dirty_bitmap, gfn-index))
 + return;
 
   code-doing-lockless-walking;
   ..
 }
 
 Before do lockless-walking, we check the dirty-bitmap first, if
 it is set we can simply skip write-protection for the gfn, that
 is the case that new spte is being added into rmap when we lockless
 access the rmap.

The dirty bit could be set after the check.

 For the case of deleting spte from rmap, the number of entry is limited
 by RMAP_RECYCLE_THRESHOLD, that is not endlessly.

It can shrink and grow while lockless walk is performed.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Peter Zijlstra
On Fri, Nov 22, 2013 at 05:14:29PM -0200, Marcelo Tosatti wrote:
 Also, there is no guarantee of termination (as long as sptes are
 deleted with the correct timing). BTW, can't see any guarantee of
 termination for rculist nulls either (a writer can race with a lockless
 reader indefinately, restarting the lockless walk every time).

What's an rculist null? rculists have regular termination conditions,
they'll reach the end (also head, due to circular etc..) in N steps,
where N is the number of elements.

Of course you can keep adding elements to protract this situation, but
you'll run out of memory eventually -- you also have to make sure to
insert them after the element being read, otherwise the iteration will
simply miss them.

Deleting an element preserves the element itself -- it has to be
RCU-freed to be part of an rculist, and the element also preserves its
fwd link, so any iterator will either not see the element, or if they're
at the element, they'll continue their iteration as normal (rculist
doesn't have backward iterators).

A deleted element may not be re-used before an RCU grace period has
lapsed. Same as for freeing such an element. So if you want to move an
rculist element you need to:

  list_del_rcu()
  synchronize_rcu();
  list_add_rcu();

Or use the list_splice_init_rcu() primitive which also explicitly takes
a @sync argument.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Gleb Natapov
On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote:
  
  For example, nothing prevents lockless walker to move into some
  parent_ptes chain, right?
 
 No.
 
 The nulls can help us to detect this case, for parent_ptes, the nulls points
 to shadow page but for rmaps, the nulls points to slot.arch.rmap. There
 is no chance that the “rmap is used as shadow page when slot-lock is held.
 
But meanwhile we will write protect non-last level sptes, no? Better to
create separate slab caches for rmap and parent_ptes lists.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Xiao Guangrong
On 11/25/2013 06:19 PM, Gleb Natapov wrote:
 On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote:

 For example, nothing prevents lockless walker to move into some
 parent_ptes chain, right?

 No.

 The nulls can help us to detect this case, for parent_ptes, the nulls points
 to shadow page but for rmaps, the nulls points to slot.arch.rmap. There
 is no chance that the �rmap is used as shadow page when slot-lock is held.

 But meanwhile we will write protect non-last level sptes, no? Better to

It will meet the non-last sptes but does not write-protect them since
we have do is_last_spte() check before cmpxchg.

 create separate slab caches for rmap and parent_ptes lists.

Yes, this is a good idea. Thanks you, Gleb!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Xiao Guangrong

Hi Peter,

On 11/25/2013 05:31 PM, Peter Zijlstra wrote:
 On Fri, Nov 22, 2013 at 05:14:29PM -0200, Marcelo Tosatti wrote:
 Also, there is no guarantee of termination (as long as sptes are
 deleted with the correct timing). BTW, can't see any guarantee of
 termination for rculist nulls either (a writer can race with a lockless
 reader indefinately, restarting the lockless walk every time).
 
 What's an rculist null? 

I guess Marcelo was talking about rculist_nulls.h
(Documentation/RCU/rculist_nulls.txt).

 rculists have regular termination conditions,
 they'll reach the end (also head, due to circular etc..) in N steps,
 where N is the number of elements.
 
 Of course you can keep adding elements to protract this situation, but
 you'll run out of memory eventually -- you also have to make sure to
 insert them after the element being read, otherwise the iteration will
 simply miss them.
 
 Deleting an element preserves the element itself -- it has to be
 RCU-freed to be part of an rculist, and the element also preserves its
 fwd link, so any iterator will either not see the element, or if they're
 at the element, they'll continue their iteration as normal (rculist
 doesn't have backward iterators).
 
 A deleted element may not be re-used before an RCU grace period has
 lapsed. Same as for freeing such an element. So if you want to move an
 rculist element you need to:
 
   list_del_rcu()
   synchronize_rcu();
   list_add_rcu();
 
 Or use the list_splice_init_rcu() primitive which also explicitly takes
 a @sync argument.

Thanks for your detailed explanation, Peter!

What about if the element is allocated from SLAB_DESTROY_BY_RCU slab? That
means the element may be reused while do iteration.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Peter Zijlstra
On Mon, Nov 25, 2013 at 06:59:24PM +0800, Xiao Guangrong wrote:
 
 Hi Peter,
 
 On 11/25/2013 05:31 PM, Peter Zijlstra wrote:
  On Fri, Nov 22, 2013 at 05:14:29PM -0200, Marcelo Tosatti wrote:
  Also, there is no guarantee of termination (as long as sptes are
  deleted with the correct timing). BTW, can't see any guarantee of
  termination for rculist nulls either (a writer can race with a lockless
  reader indefinately, restarting the lockless walk every time).
  
  What's an rculist null? 
 
 I guess Marcelo was talking about rculist_nulls.h
 (Documentation/RCU/rculist_nulls.txt).

Oh, let me have a look, I don't think I've really looked at that yet.

  rculists have regular termination conditions,
  they'll reach the end (also head, due to circular etc..) in N steps,
  where N is the number of elements.
  
  Of course you can keep adding elements to protract this situation, but
  you'll run out of memory eventually -- you also have to make sure to
  insert them after the element being read, otherwise the iteration will
  simply miss them.
  
  Deleting an element preserves the element itself -- it has to be
  RCU-freed to be part of an rculist, and the element also preserves its
  fwd link, so any iterator will either not see the element, or if they're
  at the element, they'll continue their iteration as normal (rculist
  doesn't have backward iterators).
  
  A deleted element may not be re-used before an RCU grace period has
  lapsed. Same as for freeing such an element. So if you want to move an
  rculist element you need to:
  
list_del_rcu()
synchronize_rcu();
list_add_rcu();
  
  Or use the list_splice_init_rcu() primitive which also explicitly takes
  a @sync argument.
 
 Thanks for your detailed explanation, Peter!
 
 What about if the element is allocated from SLAB_DESTROY_BY_RCU slab? That
 means the element may be reused while do iteration.

Then its broken, SLAB_DESTROY_BY_RCU rarely does what people expect it
to; which is why I wrote that honkin' huge comment right next to it.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Peter Zijlstra
On Mon, Nov 25, 2013 at 12:05:00PM +0100, Peter Zijlstra wrote:
 On Mon, Nov 25, 2013 at 06:59:24PM +0800, Xiao Guangrong wrote:
  I guess Marcelo was talking about rculist_nulls.h
  (Documentation/RCU/rculist_nulls.txt).
 
 Oh, let me have a look, I don't think I've really looked at that yet.

Egads, that's far too clever ;-)

Yes, if that's what Marcello was referencing to he's right, it doesn't
have a proper termination condition in the face of unlimited
modification.

And it can indeed use SLAB_DESTROY_BY_RCU as you alluded to -- although
the documentation is a tad scarce explaining why this is so.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Avi Kivity
On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
xiaoguangr...@linux.vnet.ibm.com wrote:

 On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote:

snip complicated stuff about parent_pte

I'm not really following, but note that parent_pte predates EPT (and
the use of rcu in kvm), so all the complexity that is the result of
trying to pack as many list entries into a cache line can be dropped.
Most setups now would have exactly one list entry, which is handled
specially antyway.

Alternatively, the trick of storing multiple entries in one list entry
can be moved to generic code, it may be useful to others.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Marcelo Tosatti
On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
 On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
 xiaoguangr...@linux.vnet.ibm.com wrote:
 
  On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
 snip complicated stuff about parent_pte
 
 I'm not really following, but note that parent_pte predates EPT (and
 the use of rcu in kvm), so all the complexity that is the result of
 trying to pack as many list entries into a cache line can be dropped.
 Most setups now would have exactly one list entry, which is handled
 specially antyway.
 
 Alternatively, the trick of storing multiple entries in one list entry
 can be moved to generic code, it may be useful to others.

Yes, can the lockless list walking code be transformed into generic
single-linked list walking? So the correctness can be verified
independently, and KVM becomes a simple user of that interface.

The simpler version is to maintain lockless walk on depth-1 rmap entries
(and grab the lock once depth-2 entry is found).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Marcelo Tosatti
On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote:
 
 On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
  On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
  It likes nulls list and we use the pte-list as the nulls which can help us 
  to
  detect whether the desc is moved to anther rmap then we can re-walk the 
  rmap
  if that happened
  
  kvm-slots_lock is held when we do lockless walking that prevents rmap
  is reused (free rmap need to hold that lock) so that we can not see the 
  same
  nulls used on different rmaps
  
  Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
  
  How about simplified lockless walk on the slot while rmapp entry
  contains a single spte? (which should be the case with two-dimensional
  paging).
  
  That is, grab the lock when finding a rmap with more than one spte in
  it (and then keep it locked until the end).
 
 Hmm… that isn't straightforward and more complex than the approach
 in this patchset. Also it can drop the improvement for shadow mmu that
 gets great improvement by this patchset.

It is not more complex, since it would remove list lockless walk. Only
the spte pointer at rmap[spte] is accessed without a lock. Its much much
simpler.

  For example, nothing prevents lockless walker to move into some
  parent_ptes chain, right?
 
 No.
 
 The nulls can help us to detect this case, for parent_ptes, the nulls points
 to shadow page but for rmaps, the nulls points to slot.arch.rmap. There
 is no chance that the “rmap is used as shadow page when slot-lock is held.

The SLAB cache is the same, so entries can be reused. What prevents
a desc entry living in slot.arch.rmap to be freed and reused by a
parent_ptes desc?

  Also, there is no guarantee of termination (as long as sptes are
  deleted with the correct timing). BTW, can't see any guarantee of
  termination for rculist nulls either (a writer can race with a lockless
  reader indefinately, restarting the lockless walk every time).
 
 Hmm, that can be avoided by checking dirty-bitmap before rewalk,
 that means, if the dirty-bitmap has been set during lockless write-protection,
 it’s unnecessary to write-protect its sptes. Your idea?
 
 But… do we really need to care it. :(

See my reply to Avi.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Gleb Natapov
On Mon, Nov 25, 2013 at 12:23:51PM -0200, Marcelo Tosatti wrote:
 On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
  On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
  xiaoguangr...@linux.vnet.ibm.com wrote:
  
   On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
  
  snip complicated stuff about parent_pte
  
  I'm not really following, but note that parent_pte predates EPT (and
  the use of rcu in kvm), so all the complexity that is the result of
  trying to pack as many list entries into a cache line can be dropped.
  Most setups now would have exactly one list entry, which is handled
  specially antyway.
  
  Alternatively, the trick of storing multiple entries in one list entry
  can be moved to generic code, it may be useful to others.
 
 Yes, can the lockless list walking code be transformed into generic
 single-linked list walking? So the correctness can be verified
 independently, and KVM becomes a simple user of that interface.
 
The code will become simpler but the problem of never ending walk of
rculist_nulls will remain.

 The simpler version is to maintain lockless walk on depth-1 rmap entries
 (and grab the lock once depth-2 entry is found).
And release it between each rmap walk or at the very end of write
protect?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Marcelo Tosatti
GOn Mon, Nov 25, 2013 at 04:29:28PM +0200, Gleb Natapov wrote:
 On Mon, Nov 25, 2013 at 12:23:51PM -0200, Marcelo Tosatti wrote:
  On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
   On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
   xiaoguangr...@linux.vnet.ibm.com wrote:
   
On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com 
wrote:
   
   snip complicated stuff about parent_pte
   
   I'm not really following, but note that parent_pte predates EPT (and
   the use of rcu in kvm), so all the complexity that is the result of
   trying to pack as many list entries into a cache line can be dropped.
   Most setups now would have exactly one list entry, which is handled
   specially antyway.
   
   Alternatively, the trick of storing multiple entries in one list entry
   can be moved to generic code, it may be useful to others.
  
  Yes, can the lockless list walking code be transformed into generic
  single-linked list walking? So the correctness can be verified
  independently, and KVM becomes a simple user of that interface.
  
 The code will become simpler but the problem of never ending walk of
 rculist_nulls will remain.

Yes. Hopefully it can be fixed in some way.

  The simpler version is to maintain lockless walk on depth-1 rmap entries
  (and grab the lock once depth-2 entry is found).
 And release it between each rmap walk or at the very end of write
 protect?

Or keep it locked until 10 consecutive sptes with depth 1 are found.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Marcelo Tosatti
On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
  Also, there is no guarantee of termination (as long as sptes are
  deleted with the correct timing). BTW, can't see any guarantee of
  termination for rculist nulls either (a writer can race with a lockless
  reader indefinately, restarting the lockless walk every time).
  
  Hmm, that can be avoided by checking dirty-bitmap before rewalk,
  that means, if the dirty-bitmap has been set during lockless 
  write-protection,
  it�s unnecessary to write-protect its sptes. Your idea?
 This idea is based on the fact that the number of rmap is limited by
 RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
 we can break the rewalk at once, in the case of deleting, we can only
 rewalk RMAP_RECYCLE_THRESHOLD times.

Please explain in more detail.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Xiao Guangrong
On 11/25/2013 10:08 PM, Marcelo Tosatti wrote:
 On Mon, Nov 25, 2013 at 02:11:31PM +0800, Xiao Guangrong wrote:

 On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote:

 On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
 It likes nulls list and we use the pte-list as the nulls which can help us 
 to
 detect whether the desc is moved to anther rmap then we can re-walk the 
 rmap
 if that happened

 kvm-slots_lock is held when we do lockless walking that prevents rmap
 is reused (free rmap need to hold that lock) so that we can not see the 
 same
 nulls used on different rmaps

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

 How about simplified lockless walk on the slot while rmapp entry
 contains a single spte? (which should be the case with two-dimensional
 paging).

 That is, grab the lock when finding a rmap with more than one spte in
 it (and then keep it locked until the end).

 Hmm… that isn't straightforward and more complex than the approach
 in this patchset. Also it can drop the improvement for shadow mmu that
 gets great improvement by this patchset.
 
 It is not more complex, since it would remove list lockless walk. Only
 the spte pointer at rmap[spte] is accessed without a lock. Its much much
 simpler.
 
 For example, nothing prevents lockless walker to move into some
 parent_ptes chain, right?

 No.

 The nulls can help us to detect this case, for parent_ptes, the nulls points
 to shadow page but for rmaps, the nulls points to slot.arch.rmap. There
 is no chance that the “rmap is used as shadow page when slot-lock is held.
 
 The SLAB cache is the same, so entries can be reused. What prevents
 a desc entry living in slot.arch.rmap to be freed and reused by a
 parent_ptes desc?
 

We will check is_last_spte(), all the sptes on parent_ptes should be failed.
And Gleb suggested to use a separate slab for rmap, that should be excellent.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Xiao Guangrong
On 11/25/2013 10:23 PM, Marcelo Tosatti wrote:
 On Mon, Nov 25, 2013 at 02:48:37PM +0200, Avi Kivity wrote:
 On Mon, Nov 25, 2013 at 8:11 AM, Xiao Guangrong
 xiaoguangr...@linux.vnet.ibm.com wrote:

 On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote:

 snip complicated stuff about parent_pte

 I'm not really following, but note that parent_pte predates EPT (and
 the use of rcu in kvm), so all the complexity that is the result of
 trying to pack as many list entries into a cache line can be dropped.
 Most setups now would have exactly one list entry, which is handled
 specially antyway.

 Alternatively, the trick of storing multiple entries in one list entry
 can be moved to generic code, it may be useful to others.
 
 Yes, can the lockless list walking code be transformed into generic
 single-linked list walking? So the correctness can be verified
 independently, and KVM becomes a simple user of that interface.

I'am afraid the signle-entry list is not so good as we expected. In my
experience, there're too many entries on rmap, more than 300 sometimes.
(consider a case that a lib shared by all processes).

 
 The simpler version is to maintain lockless walk on depth-1 rmap entries
 (and grab the lock once depth-2 entry is found).

I still think rmap-lockless is more graceful: soft mmu can get benefit
from it also it is promising to be used in some mmu-notify functions. :)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-25 Thread Xiao Guangrong
On 11/26/2013 02:12 AM, Marcelo Tosatti wrote:
 On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote:
 Also, there is no guarantee of termination (as long as sptes are
 deleted with the correct timing). BTW, can't see any guarantee of
 termination for rculist nulls either (a writer can race with a lockless
 reader indefinately, restarting the lockless walk every time).

 Hmm, that can be avoided by checking dirty-bitmap before rewalk,
 that means, if the dirty-bitmap has been set during lockless 
 write-protection,
 it�s unnecessary to write-protect its sptes. Your idea?
 This idea is based on the fact that the number of rmap is limited by
 RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
 we can break the rewalk at once, in the case of deleting, we can only
 rewalk RMAP_RECYCLE_THRESHOLD times.
 
 Please explain in more detail.

Okay.

My proposal is like this:

pte_list_walk_lockless()
{
restart:

+   if (__test_bit(slot-arch.dirty_bitmap, gfn-index))
+   return;

code-doing-lockless-walking;
..
}

Before do lockless-walking, we check the dirty-bitmap first, if
it is set we can simply skip write-protection for the gfn, that
is the case that new spte is being added into rmap when we lockless
access the rmap.

For the case of deleting spte from rmap, the number of entry is limited
by RMAP_RECYCLE_THRESHOLD, that is not endlessly.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-24 Thread Xiao Guangrong

On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote:

 On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
 It likes nulls list and we use the pte-list as the nulls which can help us to
 detect whether the desc is moved to anther rmap then we can re-walk the 
 rmap
 if that happened
 
 kvm-slots_lock is held when we do lockless walking that prevents rmap
 is reused (free rmap need to hold that lock) so that we can not see the same
 nulls used on different rmaps
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 
 How about simplified lockless walk on the slot while rmapp entry
 contains a single spte? (which should be the case with two-dimensional
 paging).
 
 That is, grab the lock when finding a rmap with more than one spte in
 it (and then keep it locked until the end).

Hmm… that isn't straightforward and more complex than the approach
in this patchset. Also it can drop the improvement for shadow mmu that
gets great improvement by this patchset.

 
 For example, nothing prevents lockless walker to move into some
 parent_ptes chain, right?

No.

The nulls can help us to detect this case, for parent_ptes, the nulls points
to shadow page but for rmaps, the nulls points to slot.arch.rmap. There
is no chance that the “rmap is used as shadow page when slot-lock is held.

 
 Also, there is no guarantee of termination (as long as sptes are
 deleted with the correct timing). BTW, can't see any guarantee of
 termination for rculist nulls either (a writer can race with a lockless
 reader indefinately, restarting the lockless walk every time).

Hmm, that can be avoided by checking dirty-bitmap before rewalk,
that means, if the dirty-bitmap has been set during lockless write-protection,
it’s unnecessary to write-protect its sptes. Your idea?

But… do we really need to care it. :(






--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-24 Thread Xiao Guangrong
On 11/25/2013 02:11 PM, Xiao Guangrong wrote:
 
 On Nov 23, 2013, at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
 On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
 It likes nulls list and we use the pte-list as the nulls which can help us 
 to
 detect whether the desc is moved to anther rmap then we can re-walk the 
 rmap
 if that happened

 kvm-slots_lock is held when we do lockless walking that prevents rmap
 is reused (free rmap need to hold that lock) so that we can not see the same
 nulls used on different rmaps

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

 How about simplified lockless walk on the slot while rmapp entry
 contains a single spte? (which should be the case with two-dimensional
 paging).

 That is, grab the lock when finding a rmap with more than one spte in
 it (and then keep it locked until the end).
 
 Hmm� that isn't straightforward and more complex than the approach
 in this patchset. Also it can drop the improvement for shadow mmu that
 gets great improvement by this patchset.
 

 For example, nothing prevents lockless walker to move into some
 parent_ptes chain, right?
 
 No.
 
 The nulls can help us to detect this case, for parent_ptes, the nulls points
 to shadow page but for rmaps, the nulls points to slot.arch.rmap. There
 is no chance that the �rmap is used as shadow page when slot-lock is held.
 

 Also, there is no guarantee of termination (as long as sptes are
 deleted with the correct timing). BTW, can't see any guarantee of
 termination for rculist nulls either (a writer can race with a lockless
 reader indefinately, restarting the lockless walk every time).
 
 Hmm, that can be avoided by checking dirty-bitmap before rewalk,
 that means, if the dirty-bitmap has been set during lockless write-protection,
 it�s unnecessary to write-protect its sptes. Your idea?

This idea is based on the fact that the number of rmap is limited by
RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap,
we can break the rewalk at once, in the case of deleting, we can only
rewalk RMAP_RECYCLE_THRESHOLD times.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-11-22 Thread Marcelo Tosatti
On Wed, Oct 23, 2013 at 09:29:25PM +0800, Xiao Guangrong wrote:
 It likes nulls list and we use the pte-list as the nulls which can help us to
 detect whether the desc is moved to anther rmap then we can re-walk the rmap
 if that happened
 
 kvm-slots_lock is held when we do lockless walking that prevents rmap
 is reused (free rmap need to hold that lock) so that we can not see the same
 nulls used on different rmaps
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

How about simplified lockless walk on the slot while rmapp entry
contains a single spte? (which should be the case with two-dimensional
paging).

That is, grab the lock when finding a rmap with more than one spte in
it (and then keep it locked until the end).

For example, nothing prevents lockless walker to move into some
parent_ptes chain, right?

Also, there is no guarantee of termination (as long as sptes are
deleted with the correct timing). BTW, can't see any guarantee of
termination for rculist nulls either (a writer can race with a lockless
reader indefinately, restarting the lockless walk every time).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 07/15] KVM: MMU: introduce nulls desc

2013-10-23 Thread Xiao Guangrong
It likes nulls list and we use the pte-list as the nulls which can help us to
detect whether the desc is moved to anther rmap then we can re-walk the rmap
if that happened

kvm-slots_lock is held when we do lockless walking that prevents rmap
is reused (free rmap need to hold that lock) so that we can not see the same
nulls used on different rmaps

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c | 35 +--
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5cce039..4687329 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -913,6 +913,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
return level - 1;
 }
 
+static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc 
*desc)
+{
+   unsigned long marker;
+
+   marker = (unsigned long)pte_list | 1UL;
+   desc-more = (struct pte_list_desc *)marker;
+}
+
+static bool desc_is_a_nulls(struct pte_list_desc *desc)
+{
+   return (unsigned long)desc  1;
+}
+
+static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc)
+{
+   return (unsigned long *)((unsigned long)desc  ~1);
+}
+
 static int __find_first_free(struct pte_list_desc *desc)
 {
int i;
@@ -951,7 +969,7 @@ static int count_spte_number(struct pte_list_desc *desc)
 
first_free = __find_first_free(desc);
 
-   for (desc_num = 0; desc-more; desc = desc-more)
+   for (desc_num = 0; !desc_is_a_nulls(desc-more); desc = desc-more)
desc_num++;
 
return first_free + desc_num * PTE_LIST_EXT;
@@ -985,6 +1003,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc = mmu_alloc_pte_list_desc(vcpu);
desc-sptes[0] = (u64 *)*pte_list;
desc-sptes[1] = spte;
+   desc_mark_nulls(pte_list, desc);
*pte_list = (unsigned long)desc | 1;
return 1;
}
@@ -1030,7 +1049,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
/*
 * Only one entry existing but still use a desc to store it?
 */
-   WARN_ON(!next_desc);
+   WARN_ON(desc_is_a_nulls(next_desc));
 
mmu_free_pte_list_desc(first_desc);
*pte_list = (unsigned long)next_desc | 1ul;
@@ -1041,7 +1060,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
 * Only one entry in this desc, move the entry to the head
 * then the desc can be freed.
 */
-   if (!first_desc-sptes[1]  !first_desc-more) {
+   if (!first_desc-sptes[1]  desc_is_a_nulls(first_desc-more)) {
*pte_list = (unsigned long)first_desc-sptes[0];
mmu_free_pte_list_desc(first_desc);
}
@@ -1070,7 +1089,7 @@ static void pte_list_remove(u64 *spte, unsigned long 
*pte_list)
 
rmap_printk(pte_list_remove:  %p many-many\n, spte);
desc = (struct pte_list_desc *)(*pte_list  ~1ul);
-   while (desc) {
+   while (!desc_is_a_nulls(desc)) {
for (i = 0; i  PTE_LIST_EXT  desc-sptes[i]; ++i)
if (desc-sptes[i] == spte) {
pte_list_desc_remove_entry(pte_list,
@@ -1097,11 +1116,13 @@ static void pte_list_walk(unsigned long *pte_list, 
pte_list_walk_fn fn)
return fn((u64 *)*pte_list);
 
desc = (struct pte_list_desc *)(*pte_list  ~1ul);
-   while (desc) {
+   while (!desc_is_a_nulls(desc)) {
for (i = 0; i  PTE_LIST_EXT  desc-sptes[i]; ++i)
fn(desc-sptes[i]);
desc = desc-more;
}
+
+   WARN_ON(desc_get_nulls_value(desc) != pte_list);
 }
 
 static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
@@ -1184,6 +1205,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct 
rmap_iterator *iter)
 
iter-desc = (struct pte_list_desc *)(rmap  ~1ul);
iter-pos = 0;
+   WARN_ON(desc_is_a_nulls(iter-desc));
return iter-desc-sptes[iter-pos];
 }
 
@@ -1204,7 +1226,8 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
return sptep;
}
 
-   iter-desc = iter-desc-more;
+   iter-desc = desc_is_a_nulls(iter-desc-more) ?
+   NULL : iter-desc-more;
 
if (iter-desc) {
iter-pos = 0;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html