Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-16 Thread Marcelo Tosatti
On Wed, Oct 16, 2013 at 12:12:11PM +0300, Gleb Natapov wrote:
> On Tue, Oct 15, 2013 at 07:21:19PM -0300, Marcelo Tosatti wrote:
> > On Tue, Oct 15, 2013 at 06:57:05AM +0300, Gleb Natapov wrote:
> > > > 
> > > > Why is it safe to allow access, by the lockless page write protect
> > > > side, to spt pointer for shadow page A that can change to a shadow page 
> > > > pointer of shadow page B?
> > > > 
> > > > Write protect spte of any page at will? Or verify that in fact thats the
> > > > shadow you want to write protect?
> > > > 
> > > > Note that spte value might be the same for different shadow pages, 
> > > > so cmpxchg succeeding does not guarantees its the same shadow page that
> > > > has been protected.
> > > > 
> > > Two things can happen: spte that we accidentally write protect is some
> > > other last level spte - this is benign, it will be unprotected on next
> > > fault.  
> > 
> > Nothing forbids two identical writable sptes to point to a same pfn. How
> > do you know you are write protecting the correct one? (the proper gfn).
> > 
> I am not sure I understand what you mean. If spt was freed and reallocated
> while lockless shadow page walk happened we definitely write protecting
> incorrect one, but this is not a problem unless the spte that is write
> protected is not last level, but there are ways to solve that.

Was assuming cmpxchg success on wrong spte would be problematic, but
Xiao says its detectable on the lockless rmap path.

> > Lockless walk sounds interesting. By the time you get to the lower
> > level, that might be a different spte.
> > 
> > All of this to avoid throttling, is it worthwhile?
> >
> I do not like kvm->arch.rcu_free_shadow_page, feels like a hack. I
> proposed slab RCU solution before, but it needs some more work to encode
> pt level into spte, so Xiao wanted to do it on top. But since something,
> either throttling or slab RCU, needs to be implemented anyway I prefer
> the later.

Yes, seems OK to me.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-16 Thread Gleb Natapov
On Tue, Oct 15, 2013 at 07:21:19PM -0300, Marcelo Tosatti wrote:
> On Tue, Oct 15, 2013 at 06:57:05AM +0300, Gleb Natapov wrote:
> > > 
> > > Why is it safe to allow access, by the lockless page write protect
> > > side, to spt pointer for shadow page A that can change to a shadow page 
> > > pointer of shadow page B?
> > > 
> > > Write protect spte of any page at will? Or verify that in fact thats the
> > > shadow you want to write protect?
> > > 
> > > Note that spte value might be the same for different shadow pages, 
> > > so cmpxchg succeeding does not guarantees its the same shadow page that
> > > has been protected.
> > > 
> > Two things can happen: spte that we accidentally write protect is some
> > other last level spte - this is benign, it will be unprotected on next
> > fault.  
> 
> Nothing forbids two identical writable sptes to point to a same pfn. How
> do you know you are write protecting the correct one? (the proper gfn).
> 
I am not sure I understand what you mean. If spt was freed and reallocated
while lockless shadow page walk happened we definitely write protecting
incorrect one, but this is not a problem unless the spte that is write
protected is not last level, but there are ways to solve that.

> Lockless walk sounds interesting. By the time you get to the lower
> level, that might be a different spte.
> 
> All of this to avoid throttling, is it worthwhile?
>
I do not like kvm->arch.rcu_free_shadow_page, feels like a hack. I
proposed slab RCU solution before, but it needs some more work to encode
pt level into spte, so Xiao wanted to do it on top. But since something,
either throttling or slab RCU, needs to be implemented anyway I prefer
the later.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-16 Thread Gleb Natapov
On Tue, Oct 15, 2013 at 07:21:19PM -0300, Marcelo Tosatti wrote:
 On Tue, Oct 15, 2013 at 06:57:05AM +0300, Gleb Natapov wrote:
   
   Why is it safe to allow access, by the lockless page write protect
   side, to spt pointer for shadow page A that can change to a shadow page 
   pointer of shadow page B?
   
   Write protect spte of any page at will? Or verify that in fact thats the
   shadow you want to write protect?
   
   Note that spte value might be the same for different shadow pages, 
   so cmpxchg succeeding does not guarantees its the same shadow page that
   has been protected.
   
  Two things can happen: spte that we accidentally write protect is some
  other last level spte - this is benign, it will be unprotected on next
  fault.  
 
 Nothing forbids two identical writable sptes to point to a same pfn. How
 do you know you are write protecting the correct one? (the proper gfn).
 
I am not sure I understand what you mean. If spt was freed and reallocated
while lockless shadow page walk happened we definitely write protecting
incorrect one, but this is not a problem unless the spte that is write
protected is not last level, but there are ways to solve that.

 Lockless walk sounds interesting. By the time you get to the lower
 level, that might be a different spte.
 
 All of this to avoid throttling, is it worthwhile?

I do not like kvm-arch.rcu_free_shadow_page, feels like a hack. I
proposed slab RCU solution before, but it needs some more work to encode
pt level into spte, so Xiao wanted to do it on top. But since something,
either throttling or slab RCU, needs to be implemented anyway I prefer
the later.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-16 Thread Marcelo Tosatti
On Wed, Oct 16, 2013 at 12:12:11PM +0300, Gleb Natapov wrote:
 On Tue, Oct 15, 2013 at 07:21:19PM -0300, Marcelo Tosatti wrote:
  On Tue, Oct 15, 2013 at 06:57:05AM +0300, Gleb Natapov wrote:

Why is it safe to allow access, by the lockless page write protect
side, to spt pointer for shadow page A that can change to a shadow page 
pointer of shadow page B?

Write protect spte of any page at will? Or verify that in fact thats the
shadow you want to write protect?

Note that spte value might be the same for different shadow pages, 
so cmpxchg succeeding does not guarantees its the same shadow page that
has been protected.

   Two things can happen: spte that we accidentally write protect is some
   other last level spte - this is benign, it will be unprotected on next
   fault.  
  
  Nothing forbids two identical writable sptes to point to a same pfn. How
  do you know you are write protecting the correct one? (the proper gfn).
  
 I am not sure I understand what you mean. If spt was freed and reallocated
 while lockless shadow page walk happened we definitely write protecting
 incorrect one, but this is not a problem unless the spte that is write
 protected is not last level, but there are ways to solve that.

Was assuming cmpxchg success on wrong spte would be problematic, but
Xiao says its detectable on the lockless rmap path.

  Lockless walk sounds interesting. By the time you get to the lower
  level, that might be a different spte.
  
  All of this to avoid throttling, is it worthwhile?
 
 I do not like kvm-arch.rcu_free_shadow_page, feels like a hack. I
 proposed slab RCU solution before, but it needs some more work to encode
 pt level into spte, so Xiao wanted to do it on top. But since something,
 either throttling or slab RCU, needs to be implemented anyway I prefer
 the later.

Yes, seems OK to me.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-15 Thread Xiao Guangrong

On Oct 16, 2013, at 6:21 AM, Marcelo Tosatti  wrote:

> On Tue, Oct 15, 2013 at 06:57:05AM +0300, Gleb Natapov wrote:
>>> 
>>> Why is it safe to allow access, by the lockless page write protect
>>> side, to spt pointer for shadow page A that can change to a shadow page 
>>> pointer of shadow page B?
>>> 
>>> Write protect spte of any page at will? Or verify that in fact thats the
>>> shadow you want to write protect?
>>> 
>>> Note that spte value might be the same for different shadow pages, 
>>> so cmpxchg succeeding does not guarantees its the same shadow page that
>>> has been protected.
>>> 
>> Two things can happen: spte that we accidentally write protect is some
>> other last level spte - this is benign, it will be unprotected on next
>> fault.  
> 
> Nothing forbids two identical writable sptes to point to a same pfn. How
> do you know you are write protecting the correct one? (the proper gfn).
> 
> Lockless walk sounds interesting. By the time you get to the lower
> level, that might be a different spte.

That's safe. Since get-dirty-log is serialized by slot-lock the dirty-bit
can not be lost - even if we write-protect on the different memslot
 (the dirty bit is still set). The worst case is we write-protect on a
unnecessary spte and cause a extra #PF but that is really race.

And the lockless rmap-walker can detect the new spte so that
write-protection on the memslot is not missed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-15 Thread Marcelo Tosatti
On Tue, Oct 15, 2013 at 06:57:05AM +0300, Gleb Natapov wrote:
> > 
> > Why is it safe to allow access, by the lockless page write protect
> > side, to spt pointer for shadow page A that can change to a shadow page 
> > pointer of shadow page B?
> > 
> > Write protect spte of any page at will? Or verify that in fact thats the
> > shadow you want to write protect?
> > 
> > Note that spte value might be the same for different shadow pages, 
> > so cmpxchg succeeding does not guarantees its the same shadow page that
> > has been protected.
> > 
> Two things can happen: spte that we accidentally write protect is some
> other last level spte - this is benign, it will be unprotected on next
> fault.  

Nothing forbids two identical writable sptes to point to a same pfn. How
do you know you are write protecting the correct one? (the proper gfn).

Lockless walk sounds interesting. By the time you get to the lower
level, that might be a different spte.

All of this to avoid throttling, is it worthwhile?

> If spte is not last level this is a problem and Xiao propose to
> fix it by encoding spte level into spte itself. Another way to fix it is
> to handle fault that is caused by write protected middle sptes in KVM -
> just unprotected them and go back to a guest.
> 
> --
>   Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-15 Thread Xiao Guangrong

On Oct 16, 2013, at 6:21 AM, Marcelo Tosatti mtosa...@redhat.com wrote:

 On Tue, Oct 15, 2013 at 06:57:05AM +0300, Gleb Natapov wrote:
 
 Why is it safe to allow access, by the lockless page write protect
 side, to spt pointer for shadow page A that can change to a shadow page 
 pointer of shadow page B?
 
 Write protect spte of any page at will? Or verify that in fact thats the
 shadow you want to write protect?
 
 Note that spte value might be the same for different shadow pages, 
 so cmpxchg succeeding does not guarantees its the same shadow page that
 has been protected.
 
 Two things can happen: spte that we accidentally write protect is some
 other last level spte - this is benign, it will be unprotected on next
 fault.  
 
 Nothing forbids two identical writable sptes to point to a same pfn. How
 do you know you are write protecting the correct one? (the proper gfn).
 
 Lockless walk sounds interesting. By the time you get to the lower
 level, that might be a different spte.

That's safe. Since get-dirty-log is serialized by slot-lock the dirty-bit
can not be lost - even if we write-protect on the different memslot
 (the dirty bit is still set). The worst case is we write-protect on a
unnecessary spte and cause a extra #PF but that is really race.

And the lockless rmap-walker can detect the new spte so that
write-protection on the memslot is not missed.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-15 Thread Marcelo Tosatti
On Tue, Oct 15, 2013 at 06:57:05AM +0300, Gleb Natapov wrote:
  
  Why is it safe to allow access, by the lockless page write protect
  side, to spt pointer for shadow page A that can change to a shadow page 
  pointer of shadow page B?
  
  Write protect spte of any page at will? Or verify that in fact thats the
  shadow you want to write protect?
  
  Note that spte value might be the same for different shadow pages, 
  so cmpxchg succeeding does not guarantees its the same shadow page that
  has been protected.
  
 Two things can happen: spte that we accidentally write protect is some
 other last level spte - this is benign, it will be unprotected on next
 fault.  

Nothing forbids two identical writable sptes to point to a same pfn. How
do you know you are write protecting the correct one? (the proper gfn).

Lockless walk sounds interesting. By the time you get to the lower
level, that might be a different spte.

All of this to avoid throttling, is it worthwhile?

 If spte is not last level this is a problem and Xiao propose to
 fix it by encoding spte level into spte itself. Another way to fix it is
 to handle fault that is caused by write protected middle sptes in KVM -
 just unprotected them and go back to a guest.
 
 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-14 Thread Gleb Natapov
On Mon, Oct 14, 2013 at 04:29:45PM -0300, Marcelo Tosatti wrote:
> On Sat, Oct 12, 2013 at 08:53:56AM +0300, Gleb Natapov wrote:
> > On Fri, Oct 11, 2013 at 05:30:17PM -0300, Marcelo Tosatti wrote:
> > > On Fri, Oct 11, 2013 at 08:38:31AM +0300, Gleb Natapov wrote:
> > > > > n_max_mmu_pages is not a suitable limit to throttle freeing of pages 
> > > > > via
> > > > > RCU (its too large). If the free memory watermarks are smaller than 
> > > > > n_max_mmu_pages for all guests, OOM is possible.
> > > > > 
> > > > Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
> > > > that slab size will be bound, so hopefully shrinker will touch it
> > > > rarely.
> > > > 
> > > > > > > > and, in addition, page released to slab is immediately
> > > > > > > > available for allocation, no need to wait for grace period. 
> > > > > > > 
> > > > > > > See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
> > > > > > > 
> > > > > > This comment is exactly what I was referring to in the code you 
> > > > > > quoted. Do
> > > > > > you see anything problematic in what comment describes?
> > > > > 
> > > > > "This delays freeing the SLAB page by a grace period, it does _NOT_
> > > > > delay object freeing." The page is not available for allocation.
> > > > By "page" I mean "spt page" which is a slab object. So "spt page"
> > > > AKA slab object will be available fo allocation immediately.
> > > 
> > > The object is reusable within that SLAB cache only, not the 
> > > entire system (therefore it does not prevent OOM condition).
> > > 
> > Since object is allocatable immediately by shadow paging code the number
> > of SLAB objects is bound by n_max_mmu_pages. If there is no enough
> > memory for n_max_mmu_pages OOM condition can happen anyway since shadow
> > paging code will usually have exactly n_max_mmu_pages allocated.
> > 
> > > OK, perhaps it is useful to use SLAB_DESTROY_BY_RCU, but throttling 
> > > is still necessary, as described in the RCU documentation.
> > > 
> > I do not see what should be throttled if we use SLAB_DESTROY_BY_RCU. RCU
> > comes into play only when SLAB cache is shrunk and it happens far from
> > kvm code.
> 
> You are right.
> 
> Why is it safe to allow access, by the lockless page write protect
> side, to spt pointer for shadow page A that can change to a shadow page 
> pointer of shadow page B?
> 
> Write protect spte of any page at will? Or verify that in fact thats the
> shadow you want to write protect?
> 
> Note that spte value might be the same for different shadow pages, 
> so cmpxchg succeeding does not guarantees its the same shadow page that
> has been protected.
> 
Two things can happen: spte that we accidentally write protect is some
other last level spte - this is benign, it will be unprotected on next
fault.  If spte is not last level this is a problem and Xiao propose to
fix it by encoding spte level into spte itself. Another way to fix it is
to handle fault that is caused by write protected middle sptes in KVM -
just unprotected them and go back to a guest.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-14 Thread Marcelo Tosatti
On Sat, Oct 12, 2013 at 08:53:56AM +0300, Gleb Natapov wrote:
> On Fri, Oct 11, 2013 at 05:30:17PM -0300, Marcelo Tosatti wrote:
> > On Fri, Oct 11, 2013 at 08:38:31AM +0300, Gleb Natapov wrote:
> > > > n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
> > > > RCU (its too large). If the free memory watermarks are smaller than 
> > > > n_max_mmu_pages for all guests, OOM is possible.
> > > > 
> > > Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
> > > that slab size will be bound, so hopefully shrinker will touch it
> > > rarely.
> > > 
> > > > > > > and, in addition, page released to slab is immediately
> > > > > > > available for allocation, no need to wait for grace period. 
> > > > > > 
> > > > > > See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
> > > > > > 
> > > > > This comment is exactly what I was referring to in the code you 
> > > > > quoted. Do
> > > > > you see anything problematic in what comment describes?
> > > > 
> > > > "This delays freeing the SLAB page by a grace period, it does _NOT_
> > > > delay object freeing." The page is not available for allocation.
> > > By "page" I mean "spt page" which is a slab object. So "spt page"
> > > AKA slab object will be available fo allocation immediately.
> > 
> > The object is reusable within that SLAB cache only, not the 
> > entire system (therefore it does not prevent OOM condition).
> > 
> Since object is allocatable immediately by shadow paging code the number
> of SLAB objects is bound by n_max_mmu_pages. If there is no enough
> memory for n_max_mmu_pages OOM condition can happen anyway since shadow
> paging code will usually have exactly n_max_mmu_pages allocated.
> 
> > OK, perhaps it is useful to use SLAB_DESTROY_BY_RCU, but throttling 
> > is still necessary, as described in the RCU documentation.
> > 
> I do not see what should be throttled if we use SLAB_DESTROY_BY_RCU. RCU
> comes into play only when SLAB cache is shrunk and it happens far from
> kvm code.

You are right.

Why is it safe to allow access, by the lockless page write protect
side, to spt pointer for shadow page A that can change to a shadow page 
pointer of shadow page B?

Write protect spte of any page at will? Or verify that in fact thats the
shadow you want to write protect?

Note that spte value might be the same for different shadow pages, 
so cmpxchg succeeding does not guarantees its the same shadow page that
has been protected.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-14 Thread Marcelo Tosatti
On Sat, Oct 12, 2013 at 08:53:56AM +0300, Gleb Natapov wrote:
 On Fri, Oct 11, 2013 at 05:30:17PM -0300, Marcelo Tosatti wrote:
  On Fri, Oct 11, 2013 at 08:38:31AM +0300, Gleb Natapov wrote:
n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
RCU (its too large). If the free memory watermarks are smaller than 
n_max_mmu_pages for all guests, OOM is possible.

   Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
   that slab size will be bound, so hopefully shrinker will touch it
   rarely.
   
   and, in addition, page released to slab is immediately
   available for allocation, no need to wait for grace period. 
  
  See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
  
 This comment is exactly what I was referring to in the code you 
 quoted. Do
 you see anything problematic in what comment describes?

This delays freeing the SLAB page by a grace period, it does _NOT_
delay object freeing. The page is not available for allocation.
   By page I mean spt page which is a slab object. So spt page
   AKA slab object will be available fo allocation immediately.
  
  The object is reusable within that SLAB cache only, not the 
  entire system (therefore it does not prevent OOM condition).
  
 Since object is allocatable immediately by shadow paging code the number
 of SLAB objects is bound by n_max_mmu_pages. If there is no enough
 memory for n_max_mmu_pages OOM condition can happen anyway since shadow
 paging code will usually have exactly n_max_mmu_pages allocated.
 
  OK, perhaps it is useful to use SLAB_DESTROY_BY_RCU, but throttling 
  is still necessary, as described in the RCU documentation.
  
 I do not see what should be throttled if we use SLAB_DESTROY_BY_RCU. RCU
 comes into play only when SLAB cache is shrunk and it happens far from
 kvm code.

You are right.

Why is it safe to allow access, by the lockless page write protect
side, to spt pointer for shadow page A that can change to a shadow page 
pointer of shadow page B?

Write protect spte of any page at will? Or verify that in fact thats the
shadow you want to write protect?

Note that spte value might be the same for different shadow pages, 
so cmpxchg succeeding does not guarantees its the same shadow page that
has been protected.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-14 Thread Gleb Natapov
On Mon, Oct 14, 2013 at 04:29:45PM -0300, Marcelo Tosatti wrote:
 On Sat, Oct 12, 2013 at 08:53:56AM +0300, Gleb Natapov wrote:
  On Fri, Oct 11, 2013 at 05:30:17PM -0300, Marcelo Tosatti wrote:
   On Fri, Oct 11, 2013 at 08:38:31AM +0300, Gleb Natapov wrote:
 n_max_mmu_pages is not a suitable limit to throttle freeing of pages 
 via
 RCU (its too large). If the free memory watermarks are smaller than 
 n_max_mmu_pages for all guests, OOM is possible.
 
Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
that slab size will be bound, so hopefully shrinker will touch it
rarely.

and, in addition, page released to slab is immediately
available for allocation, no need to wait for grace period. 
   
   See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
   
  This comment is exactly what I was referring to in the code you 
  quoted. Do
  you see anything problematic in what comment describes?
 
 This delays freeing the SLAB page by a grace period, it does _NOT_
 delay object freeing. The page is not available for allocation.
By page I mean spt page which is a slab object. So spt page
AKA slab object will be available fo allocation immediately.
   
   The object is reusable within that SLAB cache only, not the 
   entire system (therefore it does not prevent OOM condition).
   
  Since object is allocatable immediately by shadow paging code the number
  of SLAB objects is bound by n_max_mmu_pages. If there is no enough
  memory for n_max_mmu_pages OOM condition can happen anyway since shadow
  paging code will usually have exactly n_max_mmu_pages allocated.
  
   OK, perhaps it is useful to use SLAB_DESTROY_BY_RCU, but throttling 
   is still necessary, as described in the RCU documentation.
   
  I do not see what should be throttled if we use SLAB_DESTROY_BY_RCU. RCU
  comes into play only when SLAB cache is shrunk and it happens far from
  kvm code.
 
 You are right.
 
 Why is it safe to allow access, by the lockless page write protect
 side, to spt pointer for shadow page A that can change to a shadow page 
 pointer of shadow page B?
 
 Write protect spte of any page at will? Or verify that in fact thats the
 shadow you want to write protect?
 
 Note that spte value might be the same for different shadow pages, 
 so cmpxchg succeeding does not guarantees its the same shadow page that
 has been protected.
 
Two things can happen: spte that we accidentally write protect is some
other last level spte - this is benign, it will be unprotected on next
fault.  If spte is not last level this is a problem and Xiao propose to
fix it by encoding spte level into spte itself. Another way to fix it is
to handle fault that is caused by write protected middle sptes in KVM -
just unprotected them and go back to a guest.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-11 Thread Gleb Natapov
On Fri, Oct 11, 2013 at 05:30:17PM -0300, Marcelo Tosatti wrote:
> On Fri, Oct 11, 2013 at 08:38:31AM +0300, Gleb Natapov wrote:
> > > n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
> > > RCU (its too large). If the free memory watermarks are smaller than 
> > > n_max_mmu_pages for all guests, OOM is possible.
> > > 
> > Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
> > that slab size will be bound, so hopefully shrinker will touch it
> > rarely.
> > 
> > > > > > and, in addition, page released to slab is immediately
> > > > > > available for allocation, no need to wait for grace period. 
> > > > > 
> > > > > See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
> > > > > 
> > > > This comment is exactly what I was referring to in the code you quoted. 
> > > > Do
> > > > you see anything problematic in what comment describes?
> > > 
> > > "This delays freeing the SLAB page by a grace period, it does _NOT_
> > > delay object freeing." The page is not available for allocation.
> > By "page" I mean "spt page" which is a slab object. So "spt page"
> > AKA slab object will be available fo allocation immediately.
> 
> The object is reusable within that SLAB cache only, not the 
> entire system (therefore it does not prevent OOM condition).
> 
Since object is allocatable immediately by shadow paging code the number
of SLAB objects is bound by n_max_mmu_pages. If there is no enough
memory for n_max_mmu_pages OOM condition can happen anyway since shadow
paging code will usually have exactly n_max_mmu_pages allocated.

> OK, perhaps it is useful to use SLAB_DESTROY_BY_RCU, but throttling 
> is still necessary, as described in the RCU documentation.
> 
I do not see what should be throttled if we use SLAB_DESTROY_BY_RCU. RCU
comes into play only when SLAB cache is shrunk and it happens far from
kvm code.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-11 Thread Marcelo Tosatti
On Fri, Oct 11, 2013 at 08:38:31AM +0300, Gleb Natapov wrote:
> > n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
> > RCU (its too large). If the free memory watermarks are smaller than 
> > n_max_mmu_pages for all guests, OOM is possible.
> > 
> Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
> that slab size will be bound, so hopefully shrinker will touch it
> rarely.
> 
> > > > > and, in addition, page released to slab is immediately
> > > > > available for allocation, no need to wait for grace period. 
> > > > 
> > > > See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
> > > > 
> > > This comment is exactly what I was referring to in the code you quoted. Do
> > > you see anything problematic in what comment describes?
> > 
> > "This delays freeing the SLAB page by a grace period, it does _NOT_
> > delay object freeing." The page is not available for allocation.
> By "page" I mean "spt page" which is a slab object. So "spt page"
> AKA slab object will be available fo allocation immediately.

The object is reusable within that SLAB cache only, not the 
entire system (therefore it does not prevent OOM condition).

OK, perhaps it is useful to use SLAB_DESTROY_BY_RCU, but throttling 
is still necessary, as described in the RCU documentation.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-11 Thread Marcelo Tosatti
On Fri, Oct 11, 2013 at 08:38:31AM +0300, Gleb Natapov wrote:
  n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
  RCU (its too large). If the free memory watermarks are smaller than 
  n_max_mmu_pages for all guests, OOM is possible.
  
 Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
 that slab size will be bound, so hopefully shrinker will touch it
 rarely.
 
 and, in addition, page released to slab is immediately
 available for allocation, no need to wait for grace period. 

See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.

   This comment is exactly what I was referring to in the code you quoted. Do
   you see anything problematic in what comment describes?
  
  This delays freeing the SLAB page by a grace period, it does _NOT_
  delay object freeing. The page is not available for allocation.
 By page I mean spt page which is a slab object. So spt page
 AKA slab object will be available fo allocation immediately.

The object is reusable within that SLAB cache only, not the 
entire system (therefore it does not prevent OOM condition).

OK, perhaps it is useful to use SLAB_DESTROY_BY_RCU, but throttling 
is still necessary, as described in the RCU documentation.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-11 Thread Gleb Natapov
On Fri, Oct 11, 2013 at 05:30:17PM -0300, Marcelo Tosatti wrote:
 On Fri, Oct 11, 2013 at 08:38:31AM +0300, Gleb Natapov wrote:
   n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
   RCU (its too large). If the free memory watermarks are smaller than 
   n_max_mmu_pages for all guests, OOM is possible.
   
  Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
  that slab size will be bound, so hopefully shrinker will touch it
  rarely.
  
  and, in addition, page released to slab is immediately
  available for allocation, no need to wait for grace period. 
 
 See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
 
This comment is exactly what I was referring to in the code you quoted. 
Do
you see anything problematic in what comment describes?
   
   This delays freeing the SLAB page by a grace period, it does _NOT_
   delay object freeing. The page is not available for allocation.
  By page I mean spt page which is a slab object. So spt page
  AKA slab object will be available fo allocation immediately.
 
 The object is reusable within that SLAB cache only, not the 
 entire system (therefore it does not prevent OOM condition).
 
Since object is allocatable immediately by shadow paging code the number
of SLAB objects is bound by n_max_mmu_pages. If there is no enough
memory for n_max_mmu_pages OOM condition can happen anyway since shadow
paging code will usually have exactly n_max_mmu_pages allocated.

 OK, perhaps it is useful to use SLAB_DESTROY_BY_RCU, but throttling 
 is still necessary, as described in the RCU documentation.
 
I do not see what should be throttled if we use SLAB_DESTROY_BY_RCU. RCU
comes into play only when SLAB cache is shrunk and it happens far from
kvm code.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-10 Thread Gleb Natapov
On Thu, Oct 10, 2013 at 06:03:01PM -0300, Marcelo Tosatti wrote:
> On Thu, Oct 10, 2013 at 10:16:46PM +0300, Gleb Natapov wrote:
> > On Thu, Oct 10, 2013 at 01:42:22PM -0300, Marcelo Tosatti wrote:
> > > On Thu, Oct 10, 2013 at 03:08:45PM +0300, Gleb Natapov wrote:
> > > > On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
> > > > > > >> Gleb has a idea that uses RCU_DESTORY to protect the shadow page 
> > > > > > >> table
> > > > > > >> and encodes the page-level into the spte (since we need to check 
> > > > > > >> if the spte
> > > > > > >> is the last-spte. ).  How about this?
> > > > > > > 
> > > > > > > Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu 
> > > > > > > with
> > > > > > > regards to limitation? (maybe it is).
> > > > > > 
> > > > > > For my experience, freeing shadow page and allocing shadow page are 
> > > > > > balanced,
> > > > > > we can check it by (make -j12 on a guest with 4 vcpus and):
> > > > > > 
> > > > > > # echo > trace
> > > > > > [root@eric-desktop tracing]# cat trace > ~/log | sleep 3
> > > > > > [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
> > > > > > 10816
> > > > > > [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
> > > > > > 10656
> > > > > > [root@eric-desktop tracing]# cat set_event
> > > > > > kvmmmu:kvm_mmu_get_page
> > > > > > kvmmmu:kvm_mmu_prepare_zap_page
> > > > > > 
> > > > > > alloc VS. free = 10816 : 10656
> > > > > > 
> > > > > > So that, mostly all allocing and freeing are done in the slab's
> > > > > > cache and the slab frees shdadow pages very slowly, there is no rcu 
> > > > > > issue.
> > > > > 
> > > > > A more detailed test case would be:
> > > > > 
> > > > > - cpu0-vcpu0 releasing pages as fast as possible
> > > > > - cpu1 executing get_dirty_log
> > > > > 
> > > > > Think of a very large guest.
> > > > > 
> > > > The number of shadow pages allocated from slab will be bounded by
> > > > n_max_mmu_pages, 
> > > 
> > > Correct, but that limit is not suitable (maximum number of mmu pages
> > > should be larger than number of mmu pages freeable in a rcu grace
> > > period).
> > > 
> > I am not sure I understand what you mean here. What I was sating is that if
> > we change code to allocate sp->spt from slab, this slab will never have
> > more then n_max_mmu_pages objects in it.
> 
> n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
> RCU (its too large). If the free memory watermarks are smaller than 
> n_max_mmu_pages for all guests, OOM is possible.
> 
Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
that slab size will be bound, so hopefully shrinker will touch it
rarely.

> > > > and, in addition, page released to slab is immediately
> > > > available for allocation, no need to wait for grace period. 
> > > 
> > > See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
> > > 
> > This comment is exactly what I was referring to in the code you quoted. Do
> > you see anything problematic in what comment describes?
> 
> "This delays freeing the SLAB page by a grace period, it does _NOT_
> delay object freeing." The page is not available for allocation.
By "page" I mean "spt page" which is a slab object. So "spt page"
AKA slab object will be available fo allocation immediately.
 
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-10 Thread Marcelo Tosatti
On Thu, Oct 10, 2013 at 10:16:46PM +0300, Gleb Natapov wrote:
> On Thu, Oct 10, 2013 at 01:42:22PM -0300, Marcelo Tosatti wrote:
> > On Thu, Oct 10, 2013 at 03:08:45PM +0300, Gleb Natapov wrote:
> > > On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
> > > > > >> Gleb has a idea that uses RCU_DESTORY to protect the shadow page 
> > > > > >> table
> > > > > >> and encodes the page-level into the spte (since we need to check 
> > > > > >> if the spte
> > > > > >> is the last-spte. ).  How about this?
> > > > > > 
> > > > > > Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
> > > > > > regards to limitation? (maybe it is).
> > > > > 
> > > > > For my experience, freeing shadow page and allocing shadow page are 
> > > > > balanced,
> > > > > we can check it by (make -j12 on a guest with 4 vcpus and):
> > > > > 
> > > > > # echo > trace
> > > > > [root@eric-desktop tracing]# cat trace > ~/log | sleep 3
> > > > > [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
> > > > > 10816
> > > > > [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
> > > > > 10656
> > > > > [root@eric-desktop tracing]# cat set_event
> > > > > kvmmmu:kvm_mmu_get_page
> > > > > kvmmmu:kvm_mmu_prepare_zap_page
> > > > > 
> > > > > alloc VS. free = 10816 : 10656
> > > > > 
> > > > > So that, mostly all allocing and freeing are done in the slab's
> > > > > cache and the slab frees shdadow pages very slowly, there is no rcu 
> > > > > issue.
> > > > 
> > > > A more detailed test case would be:
> > > > 
> > > > - cpu0-vcpu0 releasing pages as fast as possible
> > > > - cpu1 executing get_dirty_log
> > > > 
> > > > Think of a very large guest.
> > > > 
> > > The number of shadow pages allocated from slab will be bounded by
> > > n_max_mmu_pages, 
> > 
> > Correct, but that limit is not suitable (maximum number of mmu pages
> > should be larger than number of mmu pages freeable in a rcu grace
> > period).
> > 
> I am not sure I understand what you mean here. What I was sating is that if
> we change code to allocate sp->spt from slab, this slab will never have
> more then n_max_mmu_pages objects in it.

n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
RCU (its too large). If the free memory watermarks are smaller than 
n_max_mmu_pages for all guests, OOM is possible.

> > > and, in addition, page released to slab is immediately
> > > available for allocation, no need to wait for grace period. 
> > 
> > See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
> > 
> This comment is exactly what I was referring to in the code you quoted. Do
> you see anything problematic in what comment describes?

"This delays freeing the SLAB page by a grace period, it does _NOT_
delay object freeing." The page is not available for allocation.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-10 Thread Gleb Natapov
On Thu, Oct 10, 2013 at 01:42:22PM -0300, Marcelo Tosatti wrote:
> On Thu, Oct 10, 2013 at 03:08:45PM +0300, Gleb Natapov wrote:
> > On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
> > > > >> Gleb has a idea that uses RCU_DESTORY to protect the shadow page 
> > > > >> table
> > > > >> and encodes the page-level into the spte (since we need to check if 
> > > > >> the spte
> > > > >> is the last-spte. ).  How about this?
> > > > > 
> > > > > Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
> > > > > regards to limitation? (maybe it is).
> > > > 
> > > > For my experience, freeing shadow page and allocing shadow page are 
> > > > balanced,
> > > > we can check it by (make -j12 on a guest with 4 vcpus and):
> > > > 
> > > > # echo > trace
> > > > [root@eric-desktop tracing]# cat trace > ~/log | sleep 3
> > > > [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
> > > > 10816
> > > > [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
> > > > 10656
> > > > [root@eric-desktop tracing]# cat set_event
> > > > kvmmmu:kvm_mmu_get_page
> > > > kvmmmu:kvm_mmu_prepare_zap_page
> > > > 
> > > > alloc VS. free = 10816 : 10656
> > > > 
> > > > So that, mostly all allocing and freeing are done in the slab's
> > > > cache and the slab frees shdadow pages very slowly, there is no rcu 
> > > > issue.
> > > 
> > > A more detailed test case would be:
> > > 
> > > - cpu0-vcpu0 releasing pages as fast as possible
> > > - cpu1 executing get_dirty_log
> > > 
> > > Think of a very large guest.
> > > 
> > The number of shadow pages allocated from slab will be bounded by
> > n_max_mmu_pages, 
> 
> Correct, but that limit is not suitable (maximum number of mmu pages
> should be larger than number of mmu pages freeable in a rcu grace
> period).
> 
I am not sure I understand what you mean here. What I was sating is that if
we change code to allocate sp->spt from slab, this slab will never have
more then n_max_mmu_pages objects in it.

> > and, in addition, page released to slab is immediately
> > available for allocation, no need to wait for grace period. 
> 
> See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
> 
This comment is exactly what I was referring to in the code you quoted. Do
you see anything problematic in what comment describes?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-10 Thread Marcelo Tosatti
On Thu, Oct 10, 2013 at 03:08:45PM +0300, Gleb Natapov wrote:
> On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
> > > >> Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
> > > >> and encodes the page-level into the spte (since we need to check if 
> > > >> the spte
> > > >> is the last-spte. ).  How about this?
> > > > 
> > > > Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
> > > > regards to limitation? (maybe it is).
> > > 
> > > For my experience, freeing shadow page and allocing shadow page are 
> > > balanced,
> > > we can check it by (make -j12 on a guest with 4 vcpus and):
> > > 
> > > # echo > trace
> > > [root@eric-desktop tracing]# cat trace > ~/log | sleep 3
> > > [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
> > > 10816
> > > [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
> > > 10656
> > > [root@eric-desktop tracing]# cat set_event
> > > kvmmmu:kvm_mmu_get_page
> > > kvmmmu:kvm_mmu_prepare_zap_page
> > > 
> > > alloc VS. free = 10816 : 10656
> > > 
> > > So that, mostly all allocing and freeing are done in the slab's
> > > cache and the slab frees shdadow pages very slowly, there is no rcu issue.
> > 
> > A more detailed test case would be:
> > 
> > - cpu0-vcpu0 releasing pages as fast as possible
> > - cpu1 executing get_dirty_log
> > 
> > Think of a very large guest.
> > 
> The number of shadow pages allocated from slab will be bounded by
> n_max_mmu_pages, 

Correct, but that limit is not suitable (maximum number of mmu pages
should be larger than number of mmu pages freeable in a rcu grace
period).

> and, in addition, page released to slab is immediately
> available for allocation, no need to wait for grace period. 

See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.

> RCU comes
> into play only when slab is shrunk, which should be almost never. If
> SLAB_DESTROY_BY_RCU slab does not rate limit how it frees its pages this
> is for slab to fix, not for its users.

Agree.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-10 Thread Gleb Natapov
On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
> > >> Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
> > >> and encodes the page-level into the spte (since we need to check if the 
> > >> spte
> > >> is the last-spte. ).  How about this?
> > > 
> > > Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
> > > regards to limitation? (maybe it is).
> > 
> > For my experience, freeing shadow page and allocing shadow page are 
> > balanced,
> > we can check it by (make -j12 on a guest with 4 vcpus and):
> > 
> > # echo > trace
> > [root@eric-desktop tracing]# cat trace > ~/log | sleep 3
> > [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
> > 10816
> > [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
> > 10656
> > [root@eric-desktop tracing]# cat set_event
> > kvmmmu:kvm_mmu_get_page
> > kvmmmu:kvm_mmu_prepare_zap_page
> > 
> > alloc VS. free = 10816 : 10656
> > 
> > So that, mostly all allocing and freeing are done in the slab's
> > cache and the slab frees shdadow pages very slowly, there is no rcu issue.
> 
> A more detailed test case would be:
> 
> - cpu0-vcpu0 releasing pages as fast as possible
> - cpu1 executing get_dirty_log
> 
> Think of a very large guest.
> 
The number of shadow pages allocated from slab will be bounded by
n_max_mmu_pages, and, in addition, page released to slab is immediately
available for allocation, no need to wait for grace period. RCU comes
into play only when slab is shrunk, which should be almost never. If
SLAB_DESTROY_BY_RCU slab does not rate limit how it frees its pages this
is for slab to fix, not for its users.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-10 Thread Gleb Natapov
On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
   Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
   and encodes the page-level into the spte (since we need to check if the 
   spte
   is the last-spte. ).  How about this?
   
   Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
   regards to limitation? (maybe it is).
  
  For my experience, freeing shadow page and allocing shadow page are 
  balanced,
  we can check it by (make -j12 on a guest with 4 vcpus and):
  
  # echo  trace
  [root@eric-desktop tracing]# cat trace  ~/log | sleep 3
  [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
  10816
  [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
  10656
  [root@eric-desktop tracing]# cat set_event
  kvmmmu:kvm_mmu_get_page
  kvmmmu:kvm_mmu_prepare_zap_page
  
  alloc VS. free = 10816 : 10656
  
  So that, mostly all allocing and freeing are done in the slab's
  cache and the slab frees shdadow pages very slowly, there is no rcu issue.
 
 A more detailed test case would be:
 
 - cpu0-vcpu0 releasing pages as fast as possible
 - cpu1 executing get_dirty_log
 
 Think of a very large guest.
 
The number of shadow pages allocated from slab will be bounded by
n_max_mmu_pages, and, in addition, page released to slab is immediately
available for allocation, no need to wait for grace period. RCU comes
into play only when slab is shrunk, which should be almost never. If
SLAB_DESTROY_BY_RCU slab does not rate limit how it frees its pages this
is for slab to fix, not for its users.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-10 Thread Marcelo Tosatti
On Thu, Oct 10, 2013 at 03:08:45PM +0300, Gleb Natapov wrote:
 On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
and encodes the page-level into the spte (since we need to check if 
the spte
is the last-spte. ).  How about this?

Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
regards to limitation? (maybe it is).
   
   For my experience, freeing shadow page and allocing shadow page are 
   balanced,
   we can check it by (make -j12 on a guest with 4 vcpus and):
   
   # echo  trace
   [root@eric-desktop tracing]# cat trace  ~/log | sleep 3
   [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
   10816
   [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
   10656
   [root@eric-desktop tracing]# cat set_event
   kvmmmu:kvm_mmu_get_page
   kvmmmu:kvm_mmu_prepare_zap_page
   
   alloc VS. free = 10816 : 10656
   
   So that, mostly all allocing and freeing are done in the slab's
   cache and the slab frees shdadow pages very slowly, there is no rcu issue.
  
  A more detailed test case would be:
  
  - cpu0-vcpu0 releasing pages as fast as possible
  - cpu1 executing get_dirty_log
  
  Think of a very large guest.
  
 The number of shadow pages allocated from slab will be bounded by
 n_max_mmu_pages, 

Correct, but that limit is not suitable (maximum number of mmu pages
should be larger than number of mmu pages freeable in a rcu grace
period).

 and, in addition, page released to slab is immediately
 available for allocation, no need to wait for grace period. 

See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.

 RCU comes
 into play only when slab is shrunk, which should be almost never. If
 SLAB_DESTROY_BY_RCU slab does not rate limit how it frees its pages this
 is for slab to fix, not for its users.

Agree.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-10 Thread Gleb Natapov
On Thu, Oct 10, 2013 at 01:42:22PM -0300, Marcelo Tosatti wrote:
 On Thu, Oct 10, 2013 at 03:08:45PM +0300, Gleb Natapov wrote:
  On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
 Gleb has a idea that uses RCU_DESTORY to protect the shadow page 
 table
 and encodes the page-level into the spte (since we need to check if 
 the spte
 is the last-spte. ).  How about this?
 
 Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
 regards to limitation? (maybe it is).

For my experience, freeing shadow page and allocing shadow page are 
balanced,
we can check it by (make -j12 on a guest with 4 vcpus and):

# echo  trace
[root@eric-desktop tracing]# cat trace  ~/log | sleep 3
[root@eric-desktop tracing]# cat ~/log | grep new | wc -l
10816
[root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
10656
[root@eric-desktop tracing]# cat set_event
kvmmmu:kvm_mmu_get_page
kvmmmu:kvm_mmu_prepare_zap_page

alloc VS. free = 10816 : 10656

So that, mostly all allocing and freeing are done in the slab's
cache and the slab frees shdadow pages very slowly, there is no rcu 
issue.
   
   A more detailed test case would be:
   
   - cpu0-vcpu0 releasing pages as fast as possible
   - cpu1 executing get_dirty_log
   
   Think of a very large guest.
   
  The number of shadow pages allocated from slab will be bounded by
  n_max_mmu_pages, 
 
 Correct, but that limit is not suitable (maximum number of mmu pages
 should be larger than number of mmu pages freeable in a rcu grace
 period).
 
I am not sure I understand what you mean here. What I was sating is that if
we change code to allocate sp-spt from slab, this slab will never have
more then n_max_mmu_pages objects in it.

  and, in addition, page released to slab is immediately
  available for allocation, no need to wait for grace period. 
 
 See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
 
This comment is exactly what I was referring to in the code you quoted. Do
you see anything problematic in what comment describes?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-10 Thread Marcelo Tosatti
On Thu, Oct 10, 2013 at 10:16:46PM +0300, Gleb Natapov wrote:
 On Thu, Oct 10, 2013 at 01:42:22PM -0300, Marcelo Tosatti wrote:
  On Thu, Oct 10, 2013 at 03:08:45PM +0300, Gleb Natapov wrote:
   On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
  Gleb has a idea that uses RCU_DESTORY to protect the shadow page 
  table
  and encodes the page-level into the spte (since we need to check 
  if the spte
  is the last-spte. ).  How about this?
  
  Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
  regards to limitation? (maybe it is).
 
 For my experience, freeing shadow page and allocing shadow page are 
 balanced,
 we can check it by (make -j12 on a guest with 4 vcpus and):
 
 # echo  trace
 [root@eric-desktop tracing]# cat trace  ~/log | sleep 3
 [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
 10816
 [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
 10656
 [root@eric-desktop tracing]# cat set_event
 kvmmmu:kvm_mmu_get_page
 kvmmmu:kvm_mmu_prepare_zap_page
 
 alloc VS. free = 10816 : 10656
 
 So that, mostly all allocing and freeing are done in the slab's
 cache and the slab frees shdadow pages very slowly, there is no rcu 
 issue.

A more detailed test case would be:

- cpu0-vcpu0 releasing pages as fast as possible
- cpu1 executing get_dirty_log

Think of a very large guest.

   The number of shadow pages allocated from slab will be bounded by
   n_max_mmu_pages, 
  
  Correct, but that limit is not suitable (maximum number of mmu pages
  should be larger than number of mmu pages freeable in a rcu grace
  period).
  
 I am not sure I understand what you mean here. What I was sating is that if
 we change code to allocate sp-spt from slab, this slab will never have
 more then n_max_mmu_pages objects in it.

n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
RCU (its too large). If the free memory watermarks are smaller than 
n_max_mmu_pages for all guests, OOM is possible.

   and, in addition, page released to slab is immediately
   available for allocation, no need to wait for grace period. 
  
  See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
  
 This comment is exactly what I was referring to in the code you quoted. Do
 you see anything problematic in what comment describes?

This delays freeing the SLAB page by a grace period, it does _NOT_
delay object freeing. The page is not available for allocation.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-10 Thread Gleb Natapov
On Thu, Oct 10, 2013 at 06:03:01PM -0300, Marcelo Tosatti wrote:
 On Thu, Oct 10, 2013 at 10:16:46PM +0300, Gleb Natapov wrote:
  On Thu, Oct 10, 2013 at 01:42:22PM -0300, Marcelo Tosatti wrote:
   On Thu, Oct 10, 2013 at 03:08:45PM +0300, Gleb Natapov wrote:
On Wed, Oct 09, 2013 at 10:47:10PM -0300, Marcelo Tosatti wrote:
   Gleb has a idea that uses RCU_DESTORY to protect the shadow page 
   table
   and encodes the page-level into the spte (since we need to check 
   if the spte
   is the last-spte. ).  How about this?
   
   Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu 
   with
   regards to limitation? (maybe it is).
  
  For my experience, freeing shadow page and allocing shadow page are 
  balanced,
  we can check it by (make -j12 on a guest with 4 vcpus and):
  
  # echo  trace
  [root@eric-desktop tracing]# cat trace  ~/log | sleep 3
  [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
  10816
  [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
  10656
  [root@eric-desktop tracing]# cat set_event
  kvmmmu:kvm_mmu_get_page
  kvmmmu:kvm_mmu_prepare_zap_page
  
  alloc VS. free = 10816 : 10656
  
  So that, mostly all allocing and freeing are done in the slab's
  cache and the slab frees shdadow pages very slowly, there is no rcu 
  issue.
 
 A more detailed test case would be:
 
 - cpu0-vcpu0 releasing pages as fast as possible
 - cpu1 executing get_dirty_log
 
 Think of a very large guest.
 
The number of shadow pages allocated from slab will be bounded by
n_max_mmu_pages, 
   
   Correct, but that limit is not suitable (maximum number of mmu pages
   should be larger than number of mmu pages freeable in a rcu grace
   period).
   
  I am not sure I understand what you mean here. What I was sating is that if
  we change code to allocate sp-spt from slab, this slab will never have
  more then n_max_mmu_pages objects in it.
 
 n_max_mmu_pages is not a suitable limit to throttle freeing of pages via
 RCU (its too large). If the free memory watermarks are smaller than 
 n_max_mmu_pages for all guests, OOM is possible.
 
Ah, yes. I am not saying n_max_mmu_pages will throttle RCU, just saying
that slab size will be bound, so hopefully shrinker will touch it
rarely.

and, in addition, page released to slab is immediately
available for allocation, no need to wait for grace period. 
   
   See SLAB_DESTROY_BY_RCU comment at include/linux/slab.h.
   
  This comment is exactly what I was referring to in the code you quoted. Do
  you see anything problematic in what comment describes?
 
 This delays freeing the SLAB page by a grace period, it does _NOT_
 delay object freeing. The page is not available for allocation.
By page I mean spt page which is a slab object. So spt page
AKA slab object will be available fo allocation immediately.
 
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-09 Thread Marcelo Tosatti
On Wed, Oct 09, 2013 at 06:45:47PM +0800, Xiao Guangrong wrote:
> On 10/09/2013 09:56 AM, Marcelo Tosatti wrote:
> > On Tue, Oct 08, 2013 at 12:02:32PM +0800, Xiao Guangrong wrote:
> >>
> >> Hi Marcelo,
> >>
> >> On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti  wrote:
> >>
> 
>  +if (kvm->arch.rcu_free_shadow_page) {
>  +kvm_mmu_isolate_pages(invalid_list);
>  +sp = list_first_entry(invalid_list, struct 
>  kvm_mmu_page, link);
>  +list_del_init(invalid_list);
>  +call_rcu(>rcu, free_pages_rcu);
>  +return;
>  +}
> >>>
> >>> This is unbounded (there was a similar problem with early fast page fault
> >>> implementations):
> >>>
> >>> From RCU/checklist.txt:
> >>>
> >>> "An especially important property of the synchronize_rcu()
> >>>primitive is that it automatically self-limits: if grace periods
> >>>are delayed for whatever reason, then the synchronize_rcu()
> >>>primitive will correspondingly delay updates.  In contrast,
> >>>code using call_rcu() should explicitly limit update rate in
> >>>cases where grace periods are delayed, as failing to do so can
> >>>result in excessive realtime latencies or even OOM conditions.
> >>> "
> >>
> >> I understand what you are worrying about… Hmm, can it be avoided by
> >> just using kvm->arch.rcu_free_shadow_page in a small window? - Then
> >> there are slight chance that the page need to be freed by call_rcu.
> > 
> > The point that must be addressed is that you cannot allow an unlimited
> > number of sp's to be freed via call_rcu between two grace periods.
> > 
> > So something like:
> > 
> > - For every 17MB worth of shadow pages.
> > - Guarantee a grace period has passed.
> 
> Hmm, the 'qhimark' in rcu is 1, that means rcu allows call_rcu
> to pend 1 times in a grace-period without slowdown. Can we really
> reach this number while rcu_free_shadow_page is set? Anyway, if it can,
> we can use rcu tech-break to avoid it, can't we?

Yes.

> > If you control kvm->arch.rcu_free_shadow_page, you could periodically
> > verify how many MBs worth of shadow pages are in the queue for RCU
> > freeing and force grace period after a certain number.
> 
> I have no idea how to froce grace-period for the cpu which is running
> in rcu-read side. IIUC, only dyntick-idle and offline CPU can be froced,
> see rcu_gp_fqs().

synchronize_rcu.

> >>> Moreover, freeing pages differently depending on some state should 
> >>> be avoided.
> >>>
> >>> Alternatives:
> >>>
> >>> - Disable interrupts at write protect sites.
> >>
> >> The write-protection can be triggered by KVM ioctl that is not in the VCPU
> >> context, if we do this, we also need to send IPI to the KVM thread when do
> >> TLB flush.
> > 
> > Yes. However for the case being measured, simultaneous page freeing by 
> > vcpus 
> > should be minimal (therefore not affecting the latency of GET_DIRTY_LOG).
> 
> I agree, but write-protection will cost lots of time, we need to:
> 1) do write-protection under irq disabled, that is not good for device
> Or
> 2) do pieces of works, then enable irq and disable it agian to continue
>the work. Enabling and disabing irq many times are not cheap for x86.
> 
> no?

Its getting cheaper (see performance optimization guide).

> >> And we can not do much work while interrupt is disabled due to
> >> interrupt latency.
> >>
> >>> - Rate limit the number of pages freed via call_rcu
> >>> per grace period.
> >>
> >> Seems complex. :(
> >>
> >>> - Some better alternative.
> >>
> >> Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
> >> and encodes the page-level into the spte (since we need to check if the 
> >> spte
> >> is the last-spte. ).  How about this?
> > 
> > Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
> > regards to limitation? (maybe it is).
> 
> For my experience, freeing shadow page and allocing shadow page are balanced,
> we can check it by (make -j12 on a guest with 4 vcpus and):
> 
> # echo > trace
> [root@eric-desktop tracing]# cat trace > ~/log | sleep 3
> [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
> 10816
> [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
> 10656
> [root@eric-desktop tracing]# cat set_event
> kvmmmu:kvm_mmu_get_page
> kvmmmu:kvm_mmu_prepare_zap_page
> 
> alloc VS. free = 10816 : 10656
> 
> So that, mostly all allocing and freeing are done in the slab's
> cache and the slab frees shdadow pages very slowly, there is no rcu issue.

A more detailed test case would be:

- cpu0-vcpu0 releasing pages as fast as possible
- cpu1 executing get_dirty_log

Think of a very large guest.

> >> I planned to do it after this patchset merged, if you like it and if you 
> >> think
> >> that "using kvm->arch.rcu_free_shadow_page in a small window" can not avoid
> >> the issue, i am happy to do it in the next version. :)
> > 

Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-09 Thread Xiao Guangrong
On 10/09/2013 09:56 AM, Marcelo Tosatti wrote:
> On Tue, Oct 08, 2013 at 12:02:32PM +0800, Xiao Guangrong wrote:
>>
>> Hi Marcelo,
>>
>> On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti  wrote:
>>

 +  if (kvm->arch.rcu_free_shadow_page) {
 +  kvm_mmu_isolate_pages(invalid_list);
 +  sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
 +  list_del_init(invalid_list);
 +  call_rcu(>rcu, free_pages_rcu);
 +  return;
 +  }
>>>
>>> This is unbounded (there was a similar problem with early fast page fault
>>> implementations):
>>>
>>> From RCU/checklist.txt:
>>>
>>> "An especially important property of the synchronize_rcu()
>>>primitive is that it automatically self-limits: if grace periods
>>>are delayed for whatever reason, then the synchronize_rcu()
>>>primitive will correspondingly delay updates.  In contrast,
>>>code using call_rcu() should explicitly limit update rate in
>>>cases where grace periods are delayed, as failing to do so can
>>>result in excessive realtime latencies or even OOM conditions.
>>> "
>>
>> I understand what you are worrying about… Hmm, can it be avoided by
>> just using kvm->arch.rcu_free_shadow_page in a small window? - Then
>> there are slight chance that the page need to be freed by call_rcu.
> 
> The point that must be addressed is that you cannot allow an unlimited
> number of sp's to be freed via call_rcu between two grace periods.
> 
> So something like:
> 
> - For every 17MB worth of shadow pages.
> - Guarantee a grace period has passed.

Hmm, the 'qhimark' in rcu is 1, that means rcu allows call_rcu
to pend 1 times in a grace-period without slowdown. Can we really
reach this number while rcu_free_shadow_page is set? Anyway, if it can,
we can use rcu tech-break to avoid it, can't we?

> 
> If you control kvm->arch.rcu_free_shadow_page, you could periodically
> verify how many MBs worth of shadow pages are in the queue for RCU
> freeing and force grace period after a certain number.

I have no idea how to froce grace-period for the cpu which is running
in rcu-read side. IIUC, only dyntick-idle and offline CPU can be froced,
see rcu_gp_fqs().

> 
>>> Moreover, freeing pages differently depending on some state should 
>>> be avoided.
>>>
>>> Alternatives:
>>>
>>> - Disable interrupts at write protect sites.
>>
>> The write-protection can be triggered by KVM ioctl that is not in the VCPU
>> context, if we do this, we also need to send IPI to the KVM thread when do
>> TLB flush.
> 
> Yes. However for the case being measured, simultaneous page freeing by vcpus 
> should be minimal (therefore not affecting the latency of GET_DIRTY_LOG).

I agree, but write-protection will cost lots of time, we need to:
1) do write-protection under irq disabled, that is not good for device
Or
2) do pieces of works, then enable irq and disable it agian to continue
   the work. Enabling and disabing irq many times are not cheap for x86.

no?

> 
>> And we can not do much work while interrupt is disabled due to
>> interrupt latency.
>>
>>> - Rate limit the number of pages freed via call_rcu
>>> per grace period.
>>
>> Seems complex. :(
>>
>>> - Some better alternative.
>>
>> Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
>> and encodes the page-level into the spte (since we need to check if the spte
>> is the last-spte. ).  How about this?
> 
> Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
> regards to limitation? (maybe it is).

For my experience, freeing shadow page and allocing shadow page are balanced,
we can check it by (make -j12 on a guest with 4 vcpus and):

# echo > trace
[root@eric-desktop tracing]# cat trace > ~/log | sleep 3
[root@eric-desktop tracing]# cat ~/log | grep new | wc -l
10816
[root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
10656
[root@eric-desktop tracing]# cat set_event
kvmmmu:kvm_mmu_get_page
kvmmmu:kvm_mmu_prepare_zap_page

alloc VS. free = 10816 : 10656

So that, mostly all allocing and freeing are done in the slab's
cache and the slab frees shdadow pages very slowly, there is no rcu issue.

> 
>> I planned to do it after this patchset merged, if you like it and if you 
>> think
>> that "using kvm->arch.rcu_free_shadow_page in a small window" can not avoid
>> the issue, i am happy to do it in the next version. :)
> 
> Unfortunately the window can be large (as it depends on the size of the
> memslot), so it would be best if this problem can be addressed before 
> merging. What is your idea for reducing rcu_free_shadow_page=1 window?
> 

I mean something like:

for (i =0; i <= page_nr; i++) {
rcu_free_shadow_page=1;
write_protect([i]);
rcu_free_shadow_page=0;
}

we write-protect less entries per time rather than a memslot.

BTW, i think rcu_need_break() would be usefull for this case, then the
rcu_read side do not dealy synchronize_rcu() too much.

> Thank you for 

Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-09 Thread Xiao Guangrong
On 10/09/2013 09:56 AM, Marcelo Tosatti wrote:
 On Tue, Oct 08, 2013 at 12:02:32PM +0800, Xiao Guangrong wrote:

 Hi Marcelo,

 On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti mtosa...@redhat.com wrote:


 +  if (kvm-arch.rcu_free_shadow_page) {
 +  kvm_mmu_isolate_pages(invalid_list);
 +  sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
 +  list_del_init(invalid_list);
 +  call_rcu(sp-rcu, free_pages_rcu);
 +  return;
 +  }

 This is unbounded (there was a similar problem with early fast page fault
 implementations):

 From RCU/checklist.txt:

 An especially important property of the synchronize_rcu()
primitive is that it automatically self-limits: if grace periods
are delayed for whatever reason, then the synchronize_rcu()
primitive will correspondingly delay updates.  In contrast,
code using call_rcu() should explicitly limit update rate in
cases where grace periods are delayed, as failing to do so can
result in excessive realtime latencies or even OOM conditions.
 

 I understand what you are worrying about… Hmm, can it be avoided by
 just using kvm-arch.rcu_free_shadow_page in a small window? - Then
 there are slight chance that the page need to be freed by call_rcu.
 
 The point that must be addressed is that you cannot allow an unlimited
 number of sp's to be freed via call_rcu between two grace periods.
 
 So something like:
 
 - For every 17MB worth of shadow pages.
 - Guarantee a grace period has passed.

Hmm, the 'qhimark' in rcu is 1, that means rcu allows call_rcu
to pend 1 times in a grace-period without slowdown. Can we really
reach this number while rcu_free_shadow_page is set? Anyway, if it can,
we can use rcu tech-break to avoid it, can't we?

 
 If you control kvm-arch.rcu_free_shadow_page, you could periodically
 verify how many MBs worth of shadow pages are in the queue for RCU
 freeing and force grace period after a certain number.

I have no idea how to froce grace-period for the cpu which is running
in rcu-read side. IIUC, only dyntick-idle and offline CPU can be froced,
see rcu_gp_fqs().

 
 Moreover, freeing pages differently depending on some state should 
 be avoided.

 Alternatives:

 - Disable interrupts at write protect sites.

 The write-protection can be triggered by KVM ioctl that is not in the VCPU
 context, if we do this, we also need to send IPI to the KVM thread when do
 TLB flush.
 
 Yes. However for the case being measured, simultaneous page freeing by vcpus 
 should be minimal (therefore not affecting the latency of GET_DIRTY_LOG).

I agree, but write-protection will cost lots of time, we need to:
1) do write-protection under irq disabled, that is not good for device
Or
2) do pieces of works, then enable irq and disable it agian to continue
   the work. Enabling and disabing irq many times are not cheap for x86.

no?

 
 And we can not do much work while interrupt is disabled due to
 interrupt latency.

 - Rate limit the number of pages freed via call_rcu
 per grace period.

 Seems complex. :(

 - Some better alternative.

 Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
 and encodes the page-level into the spte (since we need to check if the spte
 is the last-spte. ).  How about this?
 
 Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
 regards to limitation? (maybe it is).

For my experience, freeing shadow page and allocing shadow page are balanced,
we can check it by (make -j12 on a guest with 4 vcpus and):

# echo  trace
[root@eric-desktop tracing]# cat trace  ~/log | sleep 3
[root@eric-desktop tracing]# cat ~/log | grep new | wc -l
10816
[root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
10656
[root@eric-desktop tracing]# cat set_event
kvmmmu:kvm_mmu_get_page
kvmmmu:kvm_mmu_prepare_zap_page

alloc VS. free = 10816 : 10656

So that, mostly all allocing and freeing are done in the slab's
cache and the slab frees shdadow pages very slowly, there is no rcu issue.

 
 I planned to do it after this patchset merged, if you like it and if you 
 think
 that using kvm-arch.rcu_free_shadow_page in a small window can not avoid
 the issue, i am happy to do it in the next version. :)
 
 Unfortunately the window can be large (as it depends on the size of the
 memslot), so it would be best if this problem can be addressed before 
 merging. What is your idea for reducing rcu_free_shadow_page=1 window?
 

I mean something like:

for (i =0; i = page_nr; i++) {
rcu_free_shadow_page=1;
write_protect(rmap[i]);
rcu_free_shadow_page=0;
}

we write-protect less entries per time rather than a memslot.

BTW, i think rcu_need_break() would be usefull for this case, then the
rcu_read side do not dealy synchronize_rcu() too much.

 Thank you for the good work.

I really appreciate your, Gleb's and other guy's time and inspirations
on it. :)


--
To unsubscribe from this list: send the line unsubscribe 

Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-09 Thread Marcelo Tosatti
On Wed, Oct 09, 2013 at 06:45:47PM +0800, Xiao Guangrong wrote:
 On 10/09/2013 09:56 AM, Marcelo Tosatti wrote:
  On Tue, Oct 08, 2013 at 12:02:32PM +0800, Xiao Guangrong wrote:
 
  Hi Marcelo,
 
  On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
 
  +if (kvm-arch.rcu_free_shadow_page) {
  +kvm_mmu_isolate_pages(invalid_list);
  +sp = list_first_entry(invalid_list, struct 
  kvm_mmu_page, link);
  +list_del_init(invalid_list);
  +call_rcu(sp-rcu, free_pages_rcu);
  +return;
  +}
 
  This is unbounded (there was a similar problem with early fast page fault
  implementations):
 
  From RCU/checklist.txt:
 
  An especially important property of the synchronize_rcu()
 primitive is that it automatically self-limits: if grace periods
 are delayed for whatever reason, then the synchronize_rcu()
 primitive will correspondingly delay updates.  In contrast,
 code using call_rcu() should explicitly limit update rate in
 cases where grace periods are delayed, as failing to do so can
 result in excessive realtime latencies or even OOM conditions.
  
 
  I understand what you are worrying about… Hmm, can it be avoided by
  just using kvm-arch.rcu_free_shadow_page in a small window? - Then
  there are slight chance that the page need to be freed by call_rcu.
  
  The point that must be addressed is that you cannot allow an unlimited
  number of sp's to be freed via call_rcu between two grace periods.
  
  So something like:
  
  - For every 17MB worth of shadow pages.
  - Guarantee a grace period has passed.
 
 Hmm, the 'qhimark' in rcu is 1, that means rcu allows call_rcu
 to pend 1 times in a grace-period without slowdown. Can we really
 reach this number while rcu_free_shadow_page is set? Anyway, if it can,
 we can use rcu tech-break to avoid it, can't we?

Yes.

  If you control kvm-arch.rcu_free_shadow_page, you could periodically
  verify how many MBs worth of shadow pages are in the queue for RCU
  freeing and force grace period after a certain number.
 
 I have no idea how to froce grace-period for the cpu which is running
 in rcu-read side. IIUC, only dyntick-idle and offline CPU can be froced,
 see rcu_gp_fqs().

synchronize_rcu.

  Moreover, freeing pages differently depending on some state should 
  be avoided.
 
  Alternatives:
 
  - Disable interrupts at write protect sites.
 
  The write-protection can be triggered by KVM ioctl that is not in the VCPU
  context, if we do this, we also need to send IPI to the KVM thread when do
  TLB flush.
  
  Yes. However for the case being measured, simultaneous page freeing by 
  vcpus 
  should be minimal (therefore not affecting the latency of GET_DIRTY_LOG).
 
 I agree, but write-protection will cost lots of time, we need to:
 1) do write-protection under irq disabled, that is not good for device
 Or
 2) do pieces of works, then enable irq and disable it agian to continue
the work. Enabling and disabing irq many times are not cheap for x86.
 
 no?

Its getting cheaper (see performance optimization guide).

  And we can not do much work while interrupt is disabled due to
  interrupt latency.
 
  - Rate limit the number of pages freed via call_rcu
  per grace period.
 
  Seems complex. :(
 
  - Some better alternative.
 
  Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
  and encodes the page-level into the spte (since we need to check if the 
  spte
  is the last-spte. ).  How about this?
  
  Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
  regards to limitation? (maybe it is).
 
 For my experience, freeing shadow page and allocing shadow page are balanced,
 we can check it by (make -j12 on a guest with 4 vcpus and):
 
 # echo  trace
 [root@eric-desktop tracing]# cat trace  ~/log | sleep 3
 [root@eric-desktop tracing]# cat ~/log | grep new | wc -l
 10816
 [root@eric-desktop tracing]# cat ~/log | grep prepare | wc -l
 10656
 [root@eric-desktop tracing]# cat set_event
 kvmmmu:kvm_mmu_get_page
 kvmmmu:kvm_mmu_prepare_zap_page
 
 alloc VS. free = 10816 : 10656
 
 So that, mostly all allocing and freeing are done in the slab's
 cache and the slab frees shdadow pages very slowly, there is no rcu issue.

A more detailed test case would be:

- cpu0-vcpu0 releasing pages as fast as possible
- cpu1 executing get_dirty_log

Think of a very large guest.

  I planned to do it after this patchset merged, if you like it and if you 
  think
  that using kvm-arch.rcu_free_shadow_page in a small window can not avoid
  the issue, i am happy to do it in the next version. :)
  
  Unfortunately the window can be large (as it depends on the size of the
  memslot), so it would be best if this problem can be addressed before 
  merging. What is your idea for reducing rcu_free_shadow_page=1 window?
  
 
 I mean something like:
 
 for (i =0; i = page_nr; i++) {
  

Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-08 Thread Marcelo Tosatti
On Tue, Oct 08, 2013 at 12:02:32PM +0800, Xiao Guangrong wrote:
> 
> Hi Marcelo,
> 
> On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti  wrote:
> 
> >> 
> >> +  if (kvm->arch.rcu_free_shadow_page) {
> >> +  kvm_mmu_isolate_pages(invalid_list);
> >> +  sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
> >> +  list_del_init(invalid_list);
> >> +  call_rcu(>rcu, free_pages_rcu);
> >> +  return;
> >> +  }
> > 
> > This is unbounded (there was a similar problem with early fast page fault
> > implementations):
> > 
> > From RCU/checklist.txt:
> > 
> > "An especially important property of the synchronize_rcu()
> >primitive is that it automatically self-limits: if grace periods
> >are delayed for whatever reason, then the synchronize_rcu()
> >primitive will correspondingly delay updates.  In contrast,
> >code using call_rcu() should explicitly limit update rate in
> >cases where grace periods are delayed, as failing to do so can
> >result in excessive realtime latencies or even OOM conditions.
> > "
> 
> I understand what you are worrying about… Hmm, can it be avoided by
> just using kvm->arch.rcu_free_shadow_page in a small window? - Then
> there are slight chance that the page need to be freed by call_rcu.

The point that must be addressed is that you cannot allow an unlimited
number of sp's to be freed via call_rcu between two grace periods.

So something like:

- For every 17MB worth of shadow pages.
- Guarantee a grace period has passed.

If you control kvm->arch.rcu_free_shadow_page, you could periodically
verify how many MBs worth of shadow pages are in the queue for RCU
freeing and force grace period after a certain number.

> > Moreover, freeing pages differently depending on some state should 
> > be avoided.
> > 
> > Alternatives:
> > 
> > - Disable interrupts at write protect sites.
> 
> The write-protection can be triggered by KVM ioctl that is not in the VCPU
> context, if we do this, we also need to send IPI to the KVM thread when do
> TLB flush.

Yes. However for the case being measured, simultaneous page freeing by vcpus 
should be minimal (therefore not affecting the latency of GET_DIRTY_LOG).

> And we can not do much work while interrupt is disabled due to
> interrupt latency.
> 
> > - Rate limit the number of pages freed via call_rcu
> > per grace period.
> 
> Seems complex. :(
> 
> > - Some better alternative.
> 
> Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
> and encodes the page-level into the spte (since we need to check if the spte
> is the last-spte. ).  How about this?

Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
regards to limitation? (maybe it is).

> I planned to do it after this patchset merged, if you like it and if you think
> that "using kvm->arch.rcu_free_shadow_page in a small window" can not avoid
> the issue, i am happy to do it in the next version. :)

Unfortunately the window can be large (as it depends on the size of the
memslot), so it would be best if this problem can be addressed before 
merging. What is your idea for reducing rcu_free_shadow_page=1 window?

Thank you for the good work.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-08 Thread Marcelo Tosatti
On Tue, Oct 08, 2013 at 12:02:32PM +0800, Xiao Guangrong wrote:
 
 Hi Marcelo,
 
 On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 
  
  +  if (kvm-arch.rcu_free_shadow_page) {
  +  kvm_mmu_isolate_pages(invalid_list);
  +  sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
  +  list_del_init(invalid_list);
  +  call_rcu(sp-rcu, free_pages_rcu);
  +  return;
  +  }
  
  This is unbounded (there was a similar problem with early fast page fault
  implementations):
  
  From RCU/checklist.txt:
  
  An especially important property of the synchronize_rcu()
 primitive is that it automatically self-limits: if grace periods
 are delayed for whatever reason, then the synchronize_rcu()
 primitive will correspondingly delay updates.  In contrast,
 code using call_rcu() should explicitly limit update rate in
 cases where grace periods are delayed, as failing to do so can
 result in excessive realtime latencies or even OOM conditions.
  
 
 I understand what you are worrying about… Hmm, can it be avoided by
 just using kvm-arch.rcu_free_shadow_page in a small window? - Then
 there are slight chance that the page need to be freed by call_rcu.

The point that must be addressed is that you cannot allow an unlimited
number of sp's to be freed via call_rcu between two grace periods.

So something like:

- For every 17MB worth of shadow pages.
- Guarantee a grace period has passed.

If you control kvm-arch.rcu_free_shadow_page, you could periodically
verify how many MBs worth of shadow pages are in the queue for RCU
freeing and force grace period after a certain number.

  Moreover, freeing pages differently depending on some state should 
  be avoided.
  
  Alternatives:
  
  - Disable interrupts at write protect sites.
 
 The write-protection can be triggered by KVM ioctl that is not in the VCPU
 context, if we do this, we also need to send IPI to the KVM thread when do
 TLB flush.

Yes. However for the case being measured, simultaneous page freeing by vcpus 
should be minimal (therefore not affecting the latency of GET_DIRTY_LOG).

 And we can not do much work while interrupt is disabled due to
 interrupt latency.
 
  - Rate limit the number of pages freed via call_rcu
  per grace period.
 
 Seems complex. :(
 
  - Some better alternative.
 
 Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
 and encodes the page-level into the spte (since we need to check if the spte
 is the last-spte. ).  How about this?

Pointer please? Why is DESTROY_SLAB_RCU any safer than call_rcu with
regards to limitation? (maybe it is).

 I planned to do it after this patchset merged, if you like it and if you think
 that using kvm-arch.rcu_free_shadow_page in a small window can not avoid
 the issue, i am happy to do it in the next version. :)

Unfortunately the window can be large (as it depends on the size of the
memslot), so it would be best if this problem can be addressed before 
merging. What is your idea for reducing rcu_free_shadow_page=1 window?

Thank you for the good work.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-07 Thread Xiao Guangrong

Hi Marcelo,

On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti  wrote:

>> 
>> +if (kvm->arch.rcu_free_shadow_page) {
>> +kvm_mmu_isolate_pages(invalid_list);
>> +sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
>> +list_del_init(invalid_list);
>> +call_rcu(>rcu, free_pages_rcu);
>> +return;
>> +}
> 
> This is unbounded (there was a similar problem with early fast page fault
> implementations):
> 
> From RCU/checklist.txt:
> 
> "An especially important property of the synchronize_rcu()
>primitive is that it automatically self-limits: if grace periods
>are delayed for whatever reason, then the synchronize_rcu()
>primitive will correspondingly delay updates.  In contrast,
>code using call_rcu() should explicitly limit update rate in
>cases where grace periods are delayed, as failing to do so can
>result in excessive realtime latencies or even OOM conditions.
> "

I understand what you are worrying about… Hmm, can it be avoided by
just using kvm->arch.rcu_free_shadow_page in a small window? - Then
there are slight chance that the page need to be freed by call_rcu.

> 
> Moreover, freeing pages differently depending on some state should 
> be avoided.
> 
> Alternatives:
> 
> - Disable interrupts at write protect sites.

The write-protection can be triggered by KVM ioctl that is not in the VCPU
context, if we do this, we also need to send IPI to the KVM thread when do
TLB flush. And we can not do much work while interrupt is disabled due to
interrupt latency.

> - Rate limit the number of pages freed via call_rcu
> per grace period.

Seems complex. :(

> - Some better alternative.

Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
and encodes the page-level into the spte (since we need to check if the spte
is the last-spte. ).  How about this?

I planned to do it after this patchset merged, if you like it and if you think
that "using kvm->arch.rcu_free_shadow_page in a small window" can not avoid
the issue, i am happy to do it in the next version. :)

Thanks, Marcelo!


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-07 Thread Marcelo Tosatti
On Thu, Sep 05, 2013 at 06:29:15PM +0800, Xiao Guangrong wrote:
> It is easy if the handler is in the vcpu context, in that case we can use
> walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
> disable interrupt to stop shadow page being freed. But we are on the ioctl
> context and the paths we are optimizing for have heavy workload, disabling
> interrupt is not good for the system performance
> 
> We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then
> use call_rcu() to free the shadow page if that indicator is set. Set/Clear the
> indicator are protected by slot-lock, so it need not be atomic and does not
> hurt the performance and the scalability
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/include/asm/kvm_host.h |  6 +-
>  arch/x86/kvm/mmu.c  | 32 
>  arch/x86/kvm/mmu.h  | 22 ++
>  3 files changed, 59 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c76ff74..8e4ca0d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -226,7 +226,10 @@ struct kvm_mmu_page {
>   /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
>   unsigned long mmu_valid_gen;
>  
> - DECLARE_BITMAP(unsync_child_bitmap, 512);
> + union {
> + DECLARE_BITMAP(unsync_child_bitmap, 512);
> + struct rcu_head rcu;
> + };
>  
>  #ifdef CONFIG_X86_32
>   /*
> @@ -554,6 +557,7 @@ struct kvm_arch {
>*/
>   struct list_head active_mmu_pages;
>   struct list_head zapped_obsolete_pages;
> + bool rcu_free_shadow_page;
>  
>   struct list_head assigned_dev_head;
>   struct iommu_domain *iommu_domain;
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 2bf450a..f551fc7 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2355,6 +2355,30 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
> struct kvm_mmu_page *sp,
>   return ret;
>  }
>  
> +static void kvm_mmu_isolate_pages(struct list_head *invalid_list)
> +{
> + struct kvm_mmu_page *sp;
> +
> + list_for_each_entry(sp, invalid_list, link)
> + kvm_mmu_isolate_page(sp);
> +}
> +
> +static void free_pages_rcu(struct rcu_head *head)
> +{
> + struct kvm_mmu_page *next, *sp;
> +
> + sp = container_of(head, struct kvm_mmu_page, rcu);
> + while (sp) {
> + if (!list_empty(>link))
> + next = list_first_entry(>link,
> +   struct kvm_mmu_page, link);
> + else
> + next = NULL;
> + kvm_mmu_free_page(sp);
> + sp = next;
> + }
> +}
> +
>  static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>   struct list_head *invalid_list)
>  {
> @@ -2375,6 +2399,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>*/
>   kvm_flush_remote_tlbs(kvm);
>  
> + if (kvm->arch.rcu_free_shadow_page) {
> + kvm_mmu_isolate_pages(invalid_list);
> + sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
> + list_del_init(invalid_list);
> + call_rcu(>rcu, free_pages_rcu);
> + return;
> + }

This is unbounded (there was a similar problem with early fast page fault
implementations):

>From RCU/checklist.txt:

"An especially important property of the synchronize_rcu()
primitive is that it automatically self-limits: if grace periods
are delayed for whatever reason, then the synchronize_rcu()
primitive will correspondingly delay updates.  In contrast,
code using call_rcu() should explicitly limit update rate in
cases where grace periods are delayed, as failing to do so can
result in excessive realtime latencies or even OOM conditions.
"

Moreover, freeing pages differently depending on some state should 
be avoided.

Alternatives:

- Disable interrupts at write protect sites.
- Rate limit the number of pages freed via call_rcu
per grace period.
- Some better alternative.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-07 Thread Marcelo Tosatti
On Thu, Sep 05, 2013 at 06:29:15PM +0800, Xiao Guangrong wrote:
 It is easy if the handler is in the vcpu context, in that case we can use
 walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
 disable interrupt to stop shadow page being freed. But we are on the ioctl
 context and the paths we are optimizing for have heavy workload, disabling
 interrupt is not good for the system performance
 
 We add a indicator into kvm struct (kvm-arch.rcu_free_shadow_page), then
 use call_rcu() to free the shadow page if that indicator is set. Set/Clear the
 indicator are protected by slot-lock, so it need not be atomic and does not
 hurt the performance and the scalability
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/include/asm/kvm_host.h |  6 +-
  arch/x86/kvm/mmu.c  | 32 
  arch/x86/kvm/mmu.h  | 22 ++
  3 files changed, 59 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index c76ff74..8e4ca0d 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -226,7 +226,10 @@ struct kvm_mmu_page {
   /* The page is obsolete if mmu_valid_gen != kvm-arch.mmu_valid_gen.  */
   unsigned long mmu_valid_gen;
  
 - DECLARE_BITMAP(unsync_child_bitmap, 512);
 + union {
 + DECLARE_BITMAP(unsync_child_bitmap, 512);
 + struct rcu_head rcu;
 + };
  
  #ifdef CONFIG_X86_32
   /*
 @@ -554,6 +557,7 @@ struct kvm_arch {
*/
   struct list_head active_mmu_pages;
   struct list_head zapped_obsolete_pages;
 + bool rcu_free_shadow_page;
  
   struct list_head assigned_dev_head;
   struct iommu_domain *iommu_domain;
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 2bf450a..f551fc7 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -2355,6 +2355,30 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
 struct kvm_mmu_page *sp,
   return ret;
  }
  
 +static void kvm_mmu_isolate_pages(struct list_head *invalid_list)
 +{
 + struct kvm_mmu_page *sp;
 +
 + list_for_each_entry(sp, invalid_list, link)
 + kvm_mmu_isolate_page(sp);
 +}
 +
 +static void free_pages_rcu(struct rcu_head *head)
 +{
 + struct kvm_mmu_page *next, *sp;
 +
 + sp = container_of(head, struct kvm_mmu_page, rcu);
 + while (sp) {
 + if (!list_empty(sp-link))
 + next = list_first_entry(sp-link,
 +   struct kvm_mmu_page, link);
 + else
 + next = NULL;
 + kvm_mmu_free_page(sp);
 + sp = next;
 + }
 +}
 +
  static void kvm_mmu_commit_zap_page(struct kvm *kvm,
   struct list_head *invalid_list)
  {
 @@ -2375,6 +2399,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
*/
   kvm_flush_remote_tlbs(kvm);
  
 + if (kvm-arch.rcu_free_shadow_page) {
 + kvm_mmu_isolate_pages(invalid_list);
 + sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
 + list_del_init(invalid_list);
 + call_rcu(sp-rcu, free_pages_rcu);
 + return;
 + }

This is unbounded (there was a similar problem with early fast page fault
implementations):

From RCU/checklist.txt:

An especially important property of the synchronize_rcu()
primitive is that it automatically self-limits: if grace periods
are delayed for whatever reason, then the synchronize_rcu()
primitive will correspondingly delay updates.  In contrast,
code using call_rcu() should explicitly limit update rate in
cases where grace periods are delayed, as failing to do so can
result in excessive realtime latencies or even OOM conditions.


Moreover, freeing pages differently depending on some state should 
be avoided.

Alternatives:

- Disable interrupts at write protect sites.
- Rate limit the number of pages freed via call_rcu
per grace period.
- Some better alternative.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-10-07 Thread Xiao Guangrong

Hi Marcelo,

On Oct 8, 2013, at 9:23 AM, Marcelo Tosatti mtosa...@redhat.com wrote:

 
 +if (kvm-arch.rcu_free_shadow_page) {
 +kvm_mmu_isolate_pages(invalid_list);
 +sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
 +list_del_init(invalid_list);
 +call_rcu(sp-rcu, free_pages_rcu);
 +return;
 +}
 
 This is unbounded (there was a similar problem with early fast page fault
 implementations):
 
 From RCU/checklist.txt:
 
 An especially important property of the synchronize_rcu()
primitive is that it automatically self-limits: if grace periods
are delayed for whatever reason, then the synchronize_rcu()
primitive will correspondingly delay updates.  In contrast,
code using call_rcu() should explicitly limit update rate in
cases where grace periods are delayed, as failing to do so can
result in excessive realtime latencies or even OOM conditions.
 

I understand what you are worrying about… Hmm, can it be avoided by
just using kvm-arch.rcu_free_shadow_page in a small window? - Then
there are slight chance that the page need to be freed by call_rcu.

 
 Moreover, freeing pages differently depending on some state should 
 be avoided.
 
 Alternatives:
 
 - Disable interrupts at write protect sites.

The write-protection can be triggered by KVM ioctl that is not in the VCPU
context, if we do this, we also need to send IPI to the KVM thread when do
TLB flush. And we can not do much work while interrupt is disabled due to
interrupt latency.

 - Rate limit the number of pages freed via call_rcu
 per grace period.

Seems complex. :(

 - Some better alternative.

Gleb has a idea that uses RCU_DESTORY to protect the shadow page table
and encodes the page-level into the spte (since we need to check if the spte
is the last-spte. ).  How about this?

I planned to do it after this patchset merged, if you like it and if you think
that using kvm-arch.rcu_free_shadow_page in a small window can not avoid
the issue, i am happy to do it in the next version. :)

Thanks, Marcelo!


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-09-05 Thread Xiao Guangrong
It is easy if the handler is in the vcpu context, in that case we can use
walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
disable interrupt to stop shadow page being freed. But we are on the ioctl
context and the paths we are optimizing for have heavy workload, disabling
interrupt is not good for the system performance

We add a indicator into kvm struct (kvm->arch.rcu_free_shadow_page), then
use call_rcu() to free the shadow page if that indicator is set. Set/Clear the
indicator are protected by slot-lock, so it need not be atomic and does not
hurt the performance and the scalability

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |  6 +-
 arch/x86/kvm/mmu.c  | 32 
 arch/x86/kvm/mmu.h  | 22 ++
 3 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c76ff74..8e4ca0d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -226,7 +226,10 @@ struct kvm_mmu_page {
/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
unsigned long mmu_valid_gen;
 
-   DECLARE_BITMAP(unsync_child_bitmap, 512);
+   union {
+   DECLARE_BITMAP(unsync_child_bitmap, 512);
+   struct rcu_head rcu;
+   };
 
 #ifdef CONFIG_X86_32
/*
@@ -554,6 +557,7 @@ struct kvm_arch {
 */
struct list_head active_mmu_pages;
struct list_head zapped_obsolete_pages;
+   bool rcu_free_shadow_page;
 
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2bf450a..f551fc7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2355,6 +2355,30 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
struct kvm_mmu_page *sp,
return ret;
 }
 
+static void kvm_mmu_isolate_pages(struct list_head *invalid_list)
+{
+   struct kvm_mmu_page *sp;
+
+   list_for_each_entry(sp, invalid_list, link)
+   kvm_mmu_isolate_page(sp);
+}
+
+static void free_pages_rcu(struct rcu_head *head)
+{
+   struct kvm_mmu_page *next, *sp;
+
+   sp = container_of(head, struct kvm_mmu_page, rcu);
+   while (sp) {
+   if (!list_empty(>link))
+   next = list_first_entry(>link,
+ struct kvm_mmu_page, link);
+   else
+   next = NULL;
+   kvm_mmu_free_page(sp);
+   sp = next;
+   }
+}
+
 static void kvm_mmu_commit_zap_page(struct kvm *kvm,
struct list_head *invalid_list)
 {
@@ -2375,6 +2399,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 */
kvm_flush_remote_tlbs(kvm);
 
+   if (kvm->arch.rcu_free_shadow_page) {
+   kvm_mmu_isolate_pages(invalid_list);
+   sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
+   list_del_init(invalid_list);
+   call_rcu(>rcu, free_pages_rcu);
+   return;
+   }
+
list_for_each_entry_safe(sp, nsp, invalid_list, link) {
WARN_ON(!sp->role.invalid || sp->root_count);
kvm_mmu_isolate_page(sp);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 77e044a..61217f3 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -117,4 +117,26 @@ static inline bool permission_fault(struct kvm_mmu *mmu, 
unsigned pte_access,
 }
 
 void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
+
+/*
+ * The caller should ensure that these two functions should be
+ * serially called.
+ */
+static inline void kvm_mmu_rcu_free_page_begin(struct kvm *kvm)
+{
+   rcu_read_lock();
+
+   kvm->arch.rcu_free_shadow_page = true;
+   /* Set the indicator before access shadow page. */
+   smp_mb();
+}
+
+static inline void kvm_mmu_rcu_free_page_end(struct kvm *kvm)
+{
+   /* Make sure that access shadow page has finished. */
+   smp_mb();
+   kvm->arch.rcu_free_shadow_page = false;
+
+   rcu_read_unlock();
+}
 #endif
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 12/15] KVM: MMU: allow locklessly access shadow page table out of vcpu thread

2013-09-05 Thread Xiao Guangrong
It is easy if the handler is in the vcpu context, in that case we can use
walk_shadow_page_lockless_begin() and walk_shadow_page_lockless_end() that
disable interrupt to stop shadow page being freed. But we are on the ioctl
context and the paths we are optimizing for have heavy workload, disabling
interrupt is not good for the system performance

We add a indicator into kvm struct (kvm-arch.rcu_free_shadow_page), then
use call_rcu() to free the shadow page if that indicator is set. Set/Clear the
indicator are protected by slot-lock, so it need not be atomic and does not
hurt the performance and the scalability

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |  6 +-
 arch/x86/kvm/mmu.c  | 32 
 arch/x86/kvm/mmu.h  | 22 ++
 3 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c76ff74..8e4ca0d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -226,7 +226,10 @@ struct kvm_mmu_page {
/* The page is obsolete if mmu_valid_gen != kvm-arch.mmu_valid_gen.  */
unsigned long mmu_valid_gen;
 
-   DECLARE_BITMAP(unsync_child_bitmap, 512);
+   union {
+   DECLARE_BITMAP(unsync_child_bitmap, 512);
+   struct rcu_head rcu;
+   };
 
 #ifdef CONFIG_X86_32
/*
@@ -554,6 +557,7 @@ struct kvm_arch {
 */
struct list_head active_mmu_pages;
struct list_head zapped_obsolete_pages;
+   bool rcu_free_shadow_page;
 
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2bf450a..f551fc7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2355,6 +2355,30 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
struct kvm_mmu_page *sp,
return ret;
 }
 
+static void kvm_mmu_isolate_pages(struct list_head *invalid_list)
+{
+   struct kvm_mmu_page *sp;
+
+   list_for_each_entry(sp, invalid_list, link)
+   kvm_mmu_isolate_page(sp);
+}
+
+static void free_pages_rcu(struct rcu_head *head)
+{
+   struct kvm_mmu_page *next, *sp;
+
+   sp = container_of(head, struct kvm_mmu_page, rcu);
+   while (sp) {
+   if (!list_empty(sp-link))
+   next = list_first_entry(sp-link,
+ struct kvm_mmu_page, link);
+   else
+   next = NULL;
+   kvm_mmu_free_page(sp);
+   sp = next;
+   }
+}
+
 static void kvm_mmu_commit_zap_page(struct kvm *kvm,
struct list_head *invalid_list)
 {
@@ -2375,6 +2399,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 */
kvm_flush_remote_tlbs(kvm);
 
+   if (kvm-arch.rcu_free_shadow_page) {
+   kvm_mmu_isolate_pages(invalid_list);
+   sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
+   list_del_init(invalid_list);
+   call_rcu(sp-rcu, free_pages_rcu);
+   return;
+   }
+
list_for_each_entry_safe(sp, nsp, invalid_list, link) {
WARN_ON(!sp-role.invalid || sp-root_count);
kvm_mmu_isolate_page(sp);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 77e044a..61217f3 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -117,4 +117,26 @@ static inline bool permission_fault(struct kvm_mmu *mmu, 
unsigned pte_access,
 }
 
 void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
+
+/*
+ * The caller should ensure that these two functions should be
+ * serially called.
+ */
+static inline void kvm_mmu_rcu_free_page_begin(struct kvm *kvm)
+{
+   rcu_read_lock();
+
+   kvm-arch.rcu_free_shadow_page = true;
+   /* Set the indicator before access shadow page. */
+   smp_mb();
+}
+
+static inline void kvm_mmu_rcu_free_page_end(struct kvm *kvm)
+{
+   /* Make sure that access shadow page has finished. */
+   smp_mb();
+   kvm-arch.rcu_free_shadow_page = false;
+
+   rcu_read_unlock();
+}
 #endif
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/