subject:"\[PATCH 3\/4\] KVM\: Switch to srcu\-less get_dirty

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-03-19 Thread Xiao Guangrong

On 03/16/2012 05:44 PM, Takuya Yoshikawa wrote:

 On Fri, 16 Mar 2012 16:28:56 +0800
 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote:
 
 Thanks for your explanation, maybe you are right, i do not know migration
 much.

 What i worried about is, you have changed the behaviour of GET_DIRTY_LOG,
 in the current one, it can get all the dirty pages when it is called; after
 your change, GET_DIRTY_LOG can get a empty dirty bitmap but dirty page 
 exists.
 
 The current code also see the same situation because nothing prevents the
 guest from writing to pages before GET_DIRTY_LOG returns and the userspace
 checks the bitmap.  Everything is running.
 


The current code is under the protection of s-rcu:
IIRC, it always holds s-rcu when write guest page and set dirty bit,
that mean the dirty page is logged either in the old dirty_bitmap or in the
current memslot-dirty_bitmap. Yes?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-03-19 Thread Takuya Yoshikawa

On Mon, 19 Mar 2012 17:34:49 +0800
Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote:

 The current code is under the protection of s-rcu:
 IIRC, it always holds s-rcu when write guest page and set dirty bit,
 that mean the dirty page is logged either in the old dirty_bitmap or in the
 current memslot-dirty_bitmap. Yes?


Yes.

I just wanted to explain that getting clear dirty bitmap by GET_DIRTY_LOG
does not mean there is no dirty page: it just means that there was nothing
logged when we updated the bitmap in get_dirty_log().

We cannot know if anything happend between the bitmap update and result
check in the userspace.

So even when we get a clear bitmap, we need to stop the VCPU threads and
then do GET_DIRTY_LOG once more for live migration.


The important thing is that every bit set by mark_page_dirty() can be
found at some time in the future, including the final GET_DIRTY_LOG.

Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-03-16 Thread Takuya Yoshikawa

On Fri, 16 Mar 2012 13:03:48 +0800
Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote:

 For my quickly review, mmu_lock can not protect everything, if the guest page

Yes and ...

 is written out of the shadow page/ept table, dirty page will be lost.

No.

 
 There is a example:
 
  CPU A   CPU  B
 guest page is written by write-emulation
 
   hold mmu-lock and see 
 dirty-bitmap
   is not be changed, then 
 migration is
   completed.

We do not allow this break.

 
 call mark_page_dirty() to set dirty_bit map
 
 
 Right?


As you pointed out, we cannot assume mutual exclusion by mmu_lock.
That is why we are using atomic bitmap operations: xchg and set_bit.

In this sense we are at least guaranteed to get the dirty page
information in dirty_bitmap - the current one or next one.

So what we should care about is to not miss the information written in
the next bitmap at the time we actually migrate the guest.

Actually the userspace stops the guest at the final stage and then send the
remaining pages found in the bitmap.  So the above break between write and
mark_page_dirty() cannot happen IIUC.


Thanks,
Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-03-16 Thread Xiao Guangrong

On 03/16/2012 02:55 PM, Takuya Yoshikawa wrote:

 On Fri, 16 Mar 2012 13:03:48 +0800
 Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote:
 
 For my quickly review, mmu_lock can not protect everything, if the guest page
 
 Yes and ...
 
 is written out of the shadow page/ept table, dirty page will be lost.
 
 No.
 

 There is a example:

  CPU A   CPU  B
 guest page is written by write-emulation

   hold mmu-lock and see 
 dirty-bitmap
   is not be changed, then 
 migration is
   completed.
 
 We do not allow this break.
 


Hmm? what can avoid this? Could you please point it out?



 call mark_page_dirty() to set dirty_bit map


 Right?
 
 
 As you pointed out, we cannot assume mutual exclusion by mmu_lock.
 That is why we are using atomic bitmap operations: xchg and set_bit.
 
 In this sense we are at least guaranteed to get the dirty page
 information in dirty_bitmap - the current one or next one.
 


The problem is the guest page is written before dirty-bitmap is set,
we may log the dirty page in this window like above case...

 So what we should care about is to not miss the information written in
 the next bitmap at the time we actually migrate the guest.
 


Actually, the way log dirty page in MMU page-table is tricky:

set dirty-bitmap

allow spte to be writeable

page can be written

That means we always set dirty-bitmap _before_ page become dirty that is
the reason why your bitmap-way can work.

 Actually the userspace stops the guest at the final stage and then send the
 remaining pages found in the bitmap.  So the above break between write and
 mark_page_dirty() cannot happen IIUC.
 


Maybe i'd better firstly understand why We do not allow this break :)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-03-16 Thread Takuya Yoshikawa

On Fri, 16 Mar 2012 15:30:45 +0800
Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote:

  There is a example:
 
   CPU A   CPU  B
  guest page is written by write-emulation
 
hold mmu-lock and see 
  dirty-bitmap
is not be changed, then 
  migration is
completed.
  
  We do not allow this break.
  
 
 
 Hmm? what can avoid this? Could you please point it out?

Stopping the guest before actualy migrating the guest means VCPU threads
must be back in the userspace at the moment, no?

So when the final GET_DIRTY_LOG is being executed, thread A cannot be
in KVM.

 The problem is the guest page is written before dirty-bitmap is set,
 we may log the dirty page in this window like above case...

Exactly, but the next GET_DIRTY_LOG call can take that because, as I
wrote above, at this time the GET_DIRTY_LOG must not be the final one.

Makes sense?

Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-03-16 Thread Takuya Yoshikawa

On Fri, 16 Mar 2012 16:28:56 +0800
Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com wrote:

 Thanks for your explanation, maybe you are right, i do not know migration
 much.
 
 What i worried about is, you have changed the behaviour of GET_DIRTY_LOG,
 in the current one, it can get all the dirty pages when it is called; after
 your change, GET_DIRTY_LOG can get a empty dirty bitmap but dirty page exists.

The current code also see the same situation because nothing prevents the
guest from writing to pages before GET_DIRTY_LOG returns and the userspace
checks the bitmap.  Everything is running.

 Migration may work correctly depends on the final GET_DIRTY_LOG, in that time,
 guest is stopped. But i am not sure whether other components using 
 GET_DIRTY_LOG
 are happy, e.g. frame-buffer.

Ah, you are probably worrying about what I discussed with Avi before.

In the end we cannot get a complete snapshot without stopping the guest
like migration does.  So that cannot be guaranteed.

The only thing it can promise is to make it possible to get the log after
mark_page_dirty().  Even when the bit is not marked at Nth GET_DIRTY_LOG
time, we should be able to get it at (N+1)th call - maybe N+2.

For VGA, the display continues to update endlessly and each page will be
updated at some time: of course there may be a bit of time lag.

BTW marking framebuffer pages is through MMU and mmu_lock protected.


Thanks,
Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-03-15 Thread Xiao Guangrong

On 03/01/2012 06:33 PM, Takuya Yoshikawa wrote:


 + spin_lock(kvm-mmu_lock);
 
 - r = -ENOMEM;
 - slots = kmemdup(kvm-memslots, sizeof(*kvm-memslots), 
 GFP_KERNEL);
 - if (!slots)
 - goto out;
 + for (i = 0; i  n / sizeof(long); i++) {
 + unsigned long mask;
 + gfn_t offset;
 
 - memslot = id_to_memslot(slots, log-slot);
 - memslot-nr_dirty_pages = 0;
 - memslot-dirty_bitmap = dirty_bitmap_head;
 - update_memslots(slots, NULL);
 + if (!dirty_bitmap[i])
 + continue;
 
 - old_slots = kvm-memslots;
 - rcu_assign_pointer(kvm-memslots, slots);
 - synchronize_srcu_expedited(kvm-srcu);
 - kfree(old_slots);


For my quickly review, mmu_lock can not protect everything, if the guest page
is written out of the shadow page/ept table, dirty page will be lost.

There is a example:

 CPU A   CPU  B
guest page is written by write-emulation

  hold mmu-lock and see 
dirty-bitmap
  is not be changed, then 
migration is
  completed.

call mark_page_dirty() to set dirty_bit map


Right?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-03-01 Thread Takuya Yoshikawa

We have seen some problems of the current implementation of
get_dirty_log() which uses synchronize_srcu_expedited() for updating
dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms
order of latency when we use VGA displays.

Furthermore the recent discussion on the following thread
srcu: Implement call_srcu()
http://lkml.org/lkml/2012/1/31/211
also motivated us to implement get_dirty_log() without SRCU.

This patch achieves this goal without sacrificing the performance of
both VGA and live migration: in practice the new code is much faster
than the old one unless we have too many dirty pages.

Implementation:

The key part of the implementation is the use of xchg() operation for
clearing dirty bits atomically.  Since this allows us to update only
BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
until every dirty bit is cleared again for the next call.

Although some people may worry about the problem of using the atomic
memory instruction many times to the concurrently accessible bitmap,
it is usually accessed with mmu_lock held and we rarely see concurrent
accesses: so what we need to care about is the pure xchg() overheads.

Another point to note is that we do not use for_each_set_bit() to check
which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
simply use __ffs() in a loop.  This is much faster than repeatedly call
find_next_bit().

Performance:

The dirty-log-perf unit test showed nice improvements, some times faster
than before, when the number of dirty pages was below 8K.  For other
cases we saw a bit of regression but still enough fast compared to the
processing of these dirty pages in the userspace.

For real workloads, both VGA and live migration, we have observed pure
improvements: when the guest was reading a file, we originally saw a few
ms of latency, but with the new method the latency was less than 200us.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/x86.c |  116 +++
 1 files changed, 43 insertions(+), 73 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3bc1922..0748bab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3041,55 +3041,32 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
 }
 
 /**
- * write_protect_slot - write protect a slot for dirty logging
- * @kvm: the kvm instance
- * @memslot: the slot we protect
- * @dirty_bitmap: the bitmap indicating which pages are dirty
- * @nr_dirty_pages: the number of dirty pages
+ * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
+ * @kvm: kvm instance
+ * @log: slot id and address to which we copy the log
  *
- * We have two ways to find all sptes to protect:
- * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and
- *checks ones that have a spte mapping a page in the slot.
- * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap.
+ * We need to keep it in mind that VCPU threads can write to the bitmap
+ * concurrently.  So, to avoid losing data, we keep the following order for
+ * each bit:
  *
- * Generally speaking, if there are not so many dirty pages compared to the
- * number of shadow pages, we should use the latter.
+ *   1. Take a snapshot of the bit and clear it if needed.
+ *   2. Write protect the corresponding page.
+ *   3. Flush TLB's if needed.
+ *   4. Copy the snapshot to the userspace.
  *
- * Note that letting others write into a page marked dirty in the old bitmap
- * by using the remaining tlb entry is not a problem.  That page will become
- * write protected again when we flush the tlb and then be reported dirty to
- * the user space by copying the old bitmap.
+ * Between 2 and 3, the guest may write to the page using the remaining TLB
+ * entry.  This is not a problem because the page will be reported dirty at
+ * step 4 using the snapshot taken before and step 3 ensures that successive
+ * writes will be logged for the next call.
  */
-static void write_protect_slot(struct kvm *kvm,
-  struct kvm_memory_slot *memslot,
-  unsigned long *dirty_bitmap,
-  unsigned long nr_dirty_pages)
-{
-   spin_lock(kvm-mmu_lock);
-
-   /* Not many dirty pages compared to # of shadow pages. */
-   if (nr_dirty_pages  kvm-arch.n_used_mmu_pages) {
-   gfn_t offset;
-
-   for_each_set_bit(offset, dirty_bitmap, memslot-npages)
-   kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, 
1);
-
-   kvm_flush_remote_tlbs(kvm);
-   } else
-   kvm_mmu_slot_remove_write_access(kvm, memslot-id);
-
-   spin_unlock(kvm-mmu_lock);
-}
-
-/*
- * Get (and clear) the dirty memory log for a memory slot.
- */
-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
- struct kvm_dirty_log *log)
+int

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-02-29 Thread Avi Kivity

On 02/29/2012 09:49 AM, Takuya Yoshikawa wrote:
 Avi Kivity a...@redhat.com wrote:

   The key part of the implementation is the use of xchg() operation for
   clearing dirty bits atomically.  Since this allows us to update only
   BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
   until every dirty bit is cleared again for the next call.
  
  What about using cmpxchg16b?  That should reduce locked ops by a factor
  of 2 (but note it needs 16 bytes alignment).

 I tried cmpxchg16b first: the implementation could not be naturally
 combined with the for loop over the unsigned long array.

   Extra if not zero, alignement check and ... it was ugly
   and I guessed it would be slow.

 Taking it into account that cmpxchg16b needs more cycles than others,
 I think this should be tried carefully with more measurement later.

 How about concentrating on xchg now?

Okay.  But note you don't need the alignment check; simply allocate the
array aligned, and a multiple of 16 bytes, in the first place.


 The implementation is simple and gives us enough improvement for now.
 At least, I want to see whether xchg-based implementation works well
 for one release.

   GET_DIRTY_LOG can be easily tuned to one particular case and
   it is really hard to check whether the implementation works well
   for every important case.  I really want feedback from users
   before adding non-obvious optimization.

 In addition, we should care about the new API.  It is not decided about
 what kind of range can be ordered.  I think restricting the range to be
 long size aligned is natural.  Do you have any plan?

Not really.  But the current changes are all internal and don't affect
the user API.


   Another point to note is that we do not use for_each_set_bit() to check
   which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
   simply use __ffs() and __fls() and pass the range in between the two
   positions found by them to kvm_mmu_write_protect_pt_range().
  
  This seems artificial.

 OK, then I want to pass the bits (unsingned long) as a mask.

 Non-NPT machines may gain some.

   Even though the passed range may include clean pages, it is much faster
   than repeatedly call find_next_bit() due to the locality of dirty pages.
  
  Perhaps this is due to the implementation of find_next_bit()?  would
  using bsf improve things?

 I need to explain what I did in the past.

 Before srcu-less work, I had already noticed the slowness of
 for_each_set_bit() and replaced it with simple for loop like now: the
 improvement was significant.

 Yes, find_next_bit() is for generic use and not at all good when there
 are many consecutive bits set: it cannot assume anything so needs to check
 a lot of cases - we have long size aligned bitmap and bits is already
 known to be non-zero after the first check of the for loop.

 Of course, doing 64 function calls alone should be avoided in our case.
 I also do not want to call kvm_mmu_* for each bit.

 So, above, I proposed just passing bits to kvm_mmu_*: we can check
 each bit i in a register before using rmap[i] if needed.

 __ffs is really fast compared to other APIs.

You could do something like this

for (i = 0; i  bitmap_size_in_longs; ++i) {
mask = bitmap[i];
if (!mask)
   continue;
for (j = __ffs(mask); mask; mask = mask - 1, j = __ffs(mask))
handle i * BITS_PER_LONG + j;
}

This gives you the speed of __ffs() but without working on zero bits.


 One note is that we will lose in cases like bits = 0x..

 2271171.412064.9   138.6  16K -45
 3375866.214743.3   103.0  32K -55
 4408395.610720.067.2  64K -51
 5915336.226538.145.1 128K -44
 8497356.416441.032.4 256K -29

 So the last one will become worse.  For other 4 patterns I am not sure.

 I thought that we should tune to the last case for gaining a lot from
 the locality of WWS.  What do you think about this point?

   - } else {
   - r = -EFAULT;
   - if (clear_user(log-dirty_bitmap, n))
   - goto out;
   + kvm_mmu_write_protect_pt_range(kvm, memslot, start, end);
  
  If indeed the problem is find_next_bit(), then we could hanve
  kvm_mmu_write_protect_slot_masked() which would just take the bitmap as
  a parameter.  This would allow covering just this function with the
  spinlock, not the xchg loop.

 We may see partial display updates if we do not hold the mmu_lock during
 xchg loop: it is possible that pages near the end of the framebuffer alone
 gets updated sometimes - I noticed this problem when I fixed the TLB flush
 issue.

I don't understand why this happens.


 Not a big problem but still maybe-noticeable change, so I think we should
 do it separately with some comments if needed.

Well if it's noticable on the framebuffer it's also noticable with live
migration.  We could do

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-02-29 Thread Takuya Yoshikawa

Avi Kivity a...@redhat.com wrote:

 Okay.  But note you don't need the alignment check; simply allocate the
 array aligned, and a multiple of 16 bytes, in the first place.

OKay, then we can do something like:

for each (x = bitmap[i], y = bitmap[i+1])
if (!x  !y)
continue
else if ...
cmpxchg16b or 8b
...

I will try later.

  In addition, we should care about the new API.  It is not decided about
  what kind of range can be ordered.  I think restricting the range to be
  long size aligned is natural.  Do you have any plan?
 
 Not really.  But the current changes are all internal and don't affect
 the user API.

Then I will do my best to improve the performance now!

  __ffs is really fast compared to other APIs.
 
 You could do something like this
 
 for (i = 0; i  bitmap_size_in_longs; ++i) {
 mask = bitmap[i];
 if (!mask)
continue;
 for (j = __ffs(mask); mask; mask = mask - 1, j = __ffs(mask))
 handle i * BITS_PER_LONG + j;
 }
 
 This gives you the speed of __ffs() but without working on zero bits.

Yes, I will do this in v2.

  We may see partial display updates if we do not hold the mmu_lock during
  xchg loop: it is possible that pages near the end of the framebuffer alone
  gets updated sometimes - I noticed this problem when I fixed the TLB flush
  issue.
 
 I don't understand why this happens.

Because only mmu_lock protects the bitmap for VGA.

xchg i = 1
xchg i = 2
...
xchg i = N

We cannot get a complete snapshot without mmu_lock; if the guest faults on
the Nth page during xchg'ing i = 1, 2, ... then the i = N alone will
become newer.

But others will be updated by the next call, so the problem is restricted:
maybe not noticeable.

  Not a big problem but still maybe-noticeable change, so I think we should
  do it separately with some comments if needed.
 
 Well if it's noticable on the framebuffer it's also noticable with live
 migration.  We could do it later, but we need to really understand it first.

About live migration, we do not mind whether the bitmap is a complete snapshot.
In addition, we cannot do anything because the emulator can access the bitmap
without mmu_lock.

What we are doing is calling GET_DIRTY_LOG slot by slot: so already the
result is not a snapshot at all.

In the end, at the last stage, we will stop the guest and get a complete 
snapshot.

  In addition, we do not want to scan the dirty bitmap twice.  Using the
  bits value soon after it is read into a register seems to be the fastest.
 
 Probably.
 
  BTW, I also want to decide the design of the new API at this chance.
 
 Let's wait with that.  We're adding APIs too quickly.  Let's squeeze
 everything we can out of the current APIs.

I agree with you of course.

At the same time, we cannot say anything without actually implementing
sample userspace programs.

So I want to see how much improvement the proposed API can achieve.

I thought this might be a good GSoC project but ...


Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-02-29 Thread Avi Kivity

On 02/29/2012 12:16 PM, Takuya Yoshikawa wrote:
   We may see partial display updates if we do not hold the mmu_lock during
   xchg loop: it is possible that pages near the end of the framebuffer alone
   gets updated sometimes - I noticed this problem when I fixed the TLB flush
   issue.
  
  I don't understand why this happens.

 Because only mmu_lock protects the bitmap for VGA.

 xchg i = 1
 xchg i = 2
 ...
 xchg i = N

 We cannot get a complete snapshot without mmu_lock; if the guest faults on
 the Nth page during xchg'ing i = 1, 2, ... then the i = N alone will
 become newer.

Ah, so there is no data corruption (missed dirty bits), just the display
is updated inconsistently?

I don't think we can get a consistent snapshot anyway, since the guest
can update the framebuffer while userspace is processing it.


 But others will be updated by the next call, so the problem is restricted:
 maybe not noticeable.

   Not a big problem but still maybe-noticeable change, so I think we should
   do it separately with some comments if needed.
  
  Well if it's noticable on the framebuffer it's also noticable with live
  migration.  We could do it later, but we need to really understand it first.

 About live migration, we do not mind whether the bitmap is a complete 
 snapshot.
 In addition, we cannot do anything because the emulator can access the bitmap
 without mmu_lock.

 What we are doing is calling GET_DIRTY_LOG slot by slot: so already the
 result is not a snapshot at all.

 In the end, at the last stage, we will stop the guest and get a complete 
 snapshot.

Understood.  I don't think we can get a consistent vga snapshot without
stopping the guest, and even then, it depends on how the guest updates
the framebuffer.


   In addition, we do not want to scan the dirty bitmap twice.  Using the
   bits value soon after it is read into a register seems to be the fastest.
  
  Probably.
  
   BTW, I also want to decide the design of the new API at this chance.
  
  Let's wait with that.  We're adding APIs too quickly.  Let's squeeze
  everything we can out of the current APIs.

 I agree with you of course.

 At the same time, we cannot say anything without actually implementing
 sample userspace programs.

 So I want to see how much improvement the proposed API can achieve.

 I thought this might be a good GSoC project but ...

It may be too involved for GSoC, the issues are difficult.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-02-29 Thread Takuya Yoshikawa

Avi Kivity a...@redhat.com wrote:

  We cannot get a complete snapshot without mmu_lock; if the guest faults on
  the Nth page during xchg'ing i = 1, 2, ... then the i = N alone will
  become newer.
 
 Ah, so there is no data corruption (missed dirty bits), just the display
 is updated inconsistently?

Yes, no data corruption and just a matter of ... feeling.

The basic rule I wrote in the comment in this patch should be enough to
not lose dirty bits.

 I don't think we can get a consistent snapshot anyway, since the guest
 can update the framebuffer while userspace is processing it.

Yes, nothing will be broken.  I was just not sure what the API should promise
to the userspace.

To some extent the inconsistency may be felt more than before.

 Understood.  I don't think we can get a consistent vga snapshot without
 stopping the guest, and even then, it depends on how the guest updates
 the framebuffer.

OKay, I will not care about this from now.

  So I want to see how much improvement the proposed API can achieve.
 
  I thought this might be a good GSoC project but ...
 
 It may be too involved for GSoC, the issues are difficult.

I am now checking QEMU code closely - rather readable than before!

Though I can make experimental KVM patches for the student, just changing
QEMU's live migration code seems to be difficult - but not sure now.

Anyway, we should try this before deciding the new API: I mean if I cannot
find anyone who want to try this, no need to be a student, I may do instead
at some point.

Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-02-28 Thread Avi Kivity

On 02/23/2012 01:35 PM, Takuya Yoshikawa wrote:
 We have seen some problems of the current implementation of
 get_dirty_log() which uses synchronize_srcu_expedited() for updating
 dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms
 order of latency when we use VGA displays.

 Furthermore the recent discussion on the following thread
 srcu: Implement call_srcu()
 http://lkml.org/lkml/2012/1/31/211
 also motivated us to implement get_dirty_log() without SRCU.

 This patch achieves this goal without sacrificing the performance of
 both VGA and live migration: in practice the new code is much faster
 than the old one unless we have too many dirty pages.

 Implementation:

 The key part of the implementation is the use of xchg() operation for
 clearing dirty bits atomically.  Since this allows us to update only
 BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
 until every dirty bit is cleared again for the next call.

What about using cmpxchg16b?  That should reduce locked ops by a factor
of 2 (but note it needs 16 bytes alignment).


 Although some people may worry about the problem of using the atomic
 memory instruction many times to the concurrently accessible bitmap,
 it is usually accessed with mmu_lock held and we rarely see concurrent
 accesses: so what we need to care about is the pure xchg() overheads.

 Another point to note is that we do not use for_each_set_bit() to check
 which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
 simply use __ffs() and __fls() and pass the range in between the two
 positions found by them to kvm_mmu_write_protect_pt_range().

This seems artificial.

 Even though the passed range may include clean pages, it is much faster
 than repeatedly call find_next_bit() due to the locality of dirty pages.

Perhaps this is due to the implementation of find_next_bit()?  would
using bsf improve things?

 Performance:

 The dirty-log-perf unit test showed nice improvement, some times faster
 than before, when the number of dirty pages was below 8K.  For other
 cases we saw a bit of regression but still enough fast compared to the
 processing of these dirty pages in the userspace.

 For real workloads, both VGA and live migration, we have observed pure
 improvement: when the guest was reading a file, we originally saw a few
 ms of latency, but with the new method the latency was 50us to 300us.

  
  /**
 - * write_protect_slot - write protect a slot for dirty logging
 - * @kvm: the kvm instance
 - * @memslot: the slot we protect
 - * @dirty_bitmap: the bitmap indicating which pages are dirty
 - * @nr_dirty_pages: the number of dirty pages
 + * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a 
 slot
 + * @kvm: kvm instance
 + * @log: slot id and address to which we copy the log
   *
 - * We have two ways to find all sptes to protect:
 - * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and
 - *checks ones that have a spte mapping a page in the slot.
 - * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap.
 + * We need to keep it in mind that VCPU threads can write to the bitmap
 + * concurrently.  So, to avoid losing data, we keep the following order for
 + * each bit:
   *
 - * Generally speaking, if there are not so many dirty pages compared to the
 - * number of shadow pages, we should use the latter.
 + *   1. Take a snapshot of the bit and clear it if needed.
 + *   2. Write protect the corresponding page.
 + *   3. Flush TLB's if needed.
 + *   4. Copy the snapshot to the userspace.
   *
 - * Note that letting others write into a page marked dirty in the old bitmap
 - * by using the remaining tlb entry is not a problem.  That page will become
 - * write protected again when we flush the tlb and then be reported dirty to
 - * the user space by copying the old bitmap.
 + * Between 2 and 3, the guest may write to the page using the remaining TLB
 + * entry.  This is not a problem because the page will be reported dirty at
 + * step 4 using the snapshot taken before and step 3 ensures that successive
 + * writes will be logged for the next call.
   */
 -static void write_protect_slot(struct kvm *kvm,
 -struct kvm_memory_slot *memslot,
 -unsigned long *dirty_bitmap,
 -unsigned long nr_dirty_pages)
 -{
 - spin_lock(kvm-mmu_lock);
 -
 - /* Not many dirty pages compared to # of shadow pages. */
 - if (nr_dirty_pages  kvm-arch.n_used_mmu_pages) {
 - gfn_t offset;
 -
 - for_each_set_bit(offset, dirty_bitmap, memslot-npages)
 - kvm_mmu_write_protect_pt_range(kvm, memslot, offset, 
 offset);
 -
 - kvm_flush_remote_tlbs(kvm);
 - } else
 - kvm_mmu_slot_remove_write_access(kvm, memslot-id);
 -
 - spin_unlock(kvm-mmu_lock);
 -}
 -
 -/*
 - * Get (and clear) the dirty memory log for a memory slot.

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-02-28 Thread Takuya Yoshikawa

Avi Kivity a...@redhat.com wrote:

  The key part of the implementation is the use of xchg() operation for
  clearing dirty bits atomically.  Since this allows us to update only
  BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
  until every dirty bit is cleared again for the next call.
 
 What about using cmpxchg16b?  That should reduce locked ops by a factor
 of 2 (but note it needs 16 bytes alignment).

I tried cmpxchg16b first: the implementation could not be naturally
combined with the for loop over the unsigned long array.

Extra if not zero, alignement check and ... it was ugly
and I guessed it would be slow.

Taking it into account that cmpxchg16b needs more cycles than others,
I think this should be tried carefully with more measurement later.

How about concentrating on xchg now?

The implementation is simple and gives us enough improvement for now.
At least, I want to see whether xchg-based implementation works well
for one release.

GET_DIRTY_LOG can be easily tuned to one particular case and
it is really hard to check whether the implementation works well
for every important case.  I really want feedback from users
before adding non-obvious optimization.

In addition, we should care about the new API.  It is not decided about
what kind of range can be ordered.  I think restricting the range to be
long size aligned is natural.  Do you have any plan?

  Another point to note is that we do not use for_each_set_bit() to check
  which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
  simply use __ffs() and __fls() and pass the range in between the two
  positions found by them to kvm_mmu_write_protect_pt_range().
 
 This seems artificial.

OK, then I want to pass the bits (unsingned long) as a mask.

Non-NPT machines may gain some.

  Even though the passed range may include clean pages, it is much faster
  than repeatedly call find_next_bit() due to the locality of dirty pages.
 
 Perhaps this is due to the implementation of find_next_bit()?  would
 using bsf improve things?

I need to explain what I did in the past.

Before srcu-less work, I had already noticed the slowness of
for_each_set_bit() and replaced it with simple for loop like now: the
improvement was significant.

Yes, find_next_bit() is for generic use and not at all good when there
are many consecutive bits set: it cannot assume anything so needs to check
a lot of cases - we have long size aligned bitmap and bits is already
known to be non-zero after the first check of the for loop.

Of course, doing 64 function calls alone should be avoided in our case.
I also do not want to call kvm_mmu_* for each bit.

So, above, I proposed just passing bits to kvm_mmu_*: we can check
each bit i in a register before using rmap[i] if needed.

__ffs is really fast compared to other APIs.

One note is that we will lose in cases like bits = 0x..

2271171.412064.9   138.6  16K -45
3375866.214743.3   103.0  32K -55
4408395.610720.067.2  64K -51
5915336.226538.145.1 128K -44
8497356.416441.032.4 256K -29

So the last one will become worse.  For other 4 patterns I am not sure.

I thought that we should tune to the last case for gaining a lot from
the locality of WWS.  What do you think about this point?

  -   } else {
  -   r = -EFAULT;
  -   if (clear_user(log-dirty_bitmap, n))
  -   goto out;
  +   kvm_mmu_write_protect_pt_range(kvm, memslot, start, end);
 
 If indeed the problem is find_next_bit(), then we could hanve
 kvm_mmu_write_protect_slot_masked() which would just take the bitmap as
 a parameter.  This would allow covering just this function with the
 spinlock, not the xchg loop.

We may see partial display updates if we do not hold the mmu_lock during
xchg loop: it is possible that pages near the end of the framebuffer alone
gets updated sometimes - I noticed this problem when I fixed the TLB flush
issue.

Not a big problem but still maybe-noticeable change, so I think we should
do it separately with some comments if needed.

In addition, we do not want to scan the dirty bitmap twice.  Using the
bits value soon after it is read into a register seems to be the fastest.


BTW, I also want to decide the design of the new API at this chance.

Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-02-23 Thread Takuya Yoshikawa

We have seen some problems of the current implementation of
get_dirty_log() which uses synchronize_srcu_expedited() for updating
dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms
order of latency when we use VGA displays.

Furthermore the recent discussion on the following thread
srcu: Implement call_srcu()
http://lkml.org/lkml/2012/1/31/211
also motivated us to implement get_dirty_log() without SRCU.

This patch achieves this goal without sacrificing the performance of
both VGA and live migration: in practice the new code is much faster
than the old one unless we have too many dirty pages.

Implementation:

The key part of the implementation is the use of xchg() operation for
clearing dirty bits atomically.  Since this allows us to update only
BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
until every dirty bit is cleared again for the next call.

Although some people may worry about the problem of using the atomic
memory instruction many times to the concurrently accessible bitmap,
it is usually accessed with mmu_lock held and we rarely see concurrent
accesses: so what we need to care about is the pure xchg() overheads.

Another point to note is that we do not use for_each_set_bit() to check
which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
simply use __ffs() and __fls() and pass the range in between the two
positions found by them to kvm_mmu_write_protect_pt_range().

Even though the passed range may include clean pages, it is much faster
than repeatedly call find_next_bit() due to the locality of dirty pages.

Performance:

The dirty-log-perf unit test showed nice improvement, some times faster
than before, when the number of dirty pages was below 8K.  For other
cases we saw a bit of regression but still enough fast compared to the
processing of these dirty pages in the userspace.

For real workloads, both VGA and live migration, we have observed pure
improvement: when the guest was reading a file, we originally saw a few
ms of latency, but with the new method the latency was 50us to 300us.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/x86.c |  117 +++-
 1 files changed, 43 insertions(+), 74 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3b3d1eb..be4c52b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3041,55 +3041,32 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
 }
 
 /**
- * write_protect_slot - write protect a slot for dirty logging
- * @kvm: the kvm instance
- * @memslot: the slot we protect
- * @dirty_bitmap: the bitmap indicating which pages are dirty
- * @nr_dirty_pages: the number of dirty pages
+ * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
+ * @kvm: kvm instance
+ * @log: slot id and address to which we copy the log
  *
- * We have two ways to find all sptes to protect:
- * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and
- *checks ones that have a spte mapping a page in the slot.
- * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap.
+ * We need to keep it in mind that VCPU threads can write to the bitmap
+ * concurrently.  So, to avoid losing data, we keep the following order for
+ * each bit:
  *
- * Generally speaking, if there are not so many dirty pages compared to the
- * number of shadow pages, we should use the latter.
+ *   1. Take a snapshot of the bit and clear it if needed.
+ *   2. Write protect the corresponding page.
+ *   3. Flush TLB's if needed.
+ *   4. Copy the snapshot to the userspace.
  *
- * Note that letting others write into a page marked dirty in the old bitmap
- * by using the remaining tlb entry is not a problem.  That page will become
- * write protected again when we flush the tlb and then be reported dirty to
- * the user space by copying the old bitmap.
+ * Between 2 and 3, the guest may write to the page using the remaining TLB
+ * entry.  This is not a problem because the page will be reported dirty at
+ * step 4 using the snapshot taken before and step 3 ensures that successive
+ * writes will be logged for the next call.
  */
-static void write_protect_slot(struct kvm *kvm,
-  struct kvm_memory_slot *memslot,
-  unsigned long *dirty_bitmap,
-  unsigned long nr_dirty_pages)
-{
-   spin_lock(kvm-mmu_lock);
-
-   /* Not many dirty pages compared to # of shadow pages. */
-   if (nr_dirty_pages  kvm-arch.n_used_mmu_pages) {
-   gfn_t offset;
-
-   for_each_set_bit(offset, dirty_bitmap, memslot-npages)
-   kvm_mmu_write_protect_pt_range(kvm, memslot, offset, 
offset);
-
-   kvm_flush_remote_tlbs(kvm);
-   } else
-   kvm_mmu_slot_remove_write_access(kvm, memslot-id);
-
-   spin_unlock(kvm-mmu_lock);
-}
-
-/*
- *

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

[PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

[PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

15 matches

Site Navigation

Mail list logo

Footer information