Re:[RFC][PATCH v3 2/3] Provides multiple submits and asynchronous notifications.

2010-04-23 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
---

Michael,
Can't vhost supply a kiocb completion callback that will handle the list?
Yes, thanks. And with it I also remove the vq-receivr finally.
Thanks
Xiaohui

Nice progress. I commented on some minor issues below.
Thanks!

The updated patch addressed your comments on the minor issues.
Thanks!

Thanks
Xiaohui  

 drivers/vhost/net.c   |  236 +++-
 drivers/vhost/vhost.c |  120 ++---
 drivers/vhost/vhost.h |   14 +++
 3 files changed, 314 insertions(+), 56 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 38989d1..18f6c41 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -23,6 +23,8 @@
 #include linux/if_arp.h
 #include linux/if_tun.h
 #include linux/if_macvlan.h
+#include linux/mpassthru.h
+#include linux/aio.h
 
 #include net/sock.h
 
@@ -48,6 +50,7 @@ struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
+   struct kmem_cache   *cache;
/* Tells us whether we are polling a socket for TX.
 * We only do this when socket buffer fills up.
 * Protected by tx vq lock. */
@@ -92,11 +95,138 @@ static void tx_poll_start(struct vhost_net *net, struct 
socket *sock)
net-tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   unsigned long flags;
+
+   spin_lock_irqsave(vq-notify_lock, flags);
+   if (!list_empty(vq-notifier)) {
+   iocb = list_first_entry(vq-notifier,
+   struct kiocb, ki_list);
+   list_del(iocb-ki_list);
+   }
+   spin_unlock_irqrestore(vq-notify_lock, flags);
+   return iocb;
+}
+
+static void handle_iocb(struct kiocb *iocb)
+{
+   struct vhost_virtqueue *vq = iocb-private;
+   unsigned long flags;
+
+   spin_lock_irqsave(vq-notify_lock, flags);
+   list_add_tail(iocb-ki_list, vq-notifier);
+   spin_unlock_irqrestore(vq-notify_lock, flags);
+}
+
+static int is_async_vq(struct vhost_virtqueue *vq)
+{
+   return (vq-link_state == VHOST_VQ_LINK_ASYNC);
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq,
+ struct socket *sock)
+{
+   struct kiocb *iocb = NULL;
+   struct vhost_log *vq_log = NULL;
+   int rx_total_len = 0;
+   unsigned int head, log, in, out;
+   int size;
+
+   if (!is_async_vq(vq))
+   return;
+
+   if (sock-sk-sk_data_ready)
+   sock-sk-sk_data_ready(sock-sk, 0);
+
+   vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
+   vq-log : NULL;
+
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(net-dev, vq,
+   iocb-ki_pos, iocb-ki_nbytes);
+   size = iocb-ki_nbytes;
+   head = iocb-ki_pos;
+   rx_total_len += iocb-ki_nbytes;
+
+   if (iocb-ki_dtor)
+   iocb-ki_dtor(iocb);
+   kmem_cache_free(net-cache, iocb);
+
+   /* when log is enabled, recomputing the log info is needed,
+* since these buffers are in async queue, and may not get
+* the log info before.
+*/
+   if (unlikely(vq_log)) {
+   if (!log)
+   __vhost_get_vq_desc(net-dev, vq, vq-iov,
+   ARRAY_SIZE(vq-iov),
+   out, in, vq_log,
+   log, head);
+   vhost_log_write(vq, vq_log, log, size);
+   }
+   if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
+   vhost_poll_queue(vq-poll);
+   break;
+   }
+   }
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   int tx_total_len = 0;
+
+   if (!is_async_vq(vq))
+   return;
+
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(net-dev, vq,
+   iocb-ki_pos, 0);
+   tx_total_len += iocb-ki_nbytes;
+
+   if (iocb-ki_dtor)
+   iocb-ki_dtor(iocb);
+
+   kmem_cache_free(net-cache, iocb);
+   if 

Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-23 Thread Fernando Luis Vázquez Cao
On 04/23/2010 02:17 PM, Yoshiaki Tamura wrote:
 Dor Laor wrote:
[...]
 Second, even if it wasn't the case, the tsc delta and kvmclock are
 synchronized as part of the VM state so there is no use of trapping it
 in the middle.
 
 I should study the clock in KVM, but won't tsc get updated by the HW
 after migration?
 I was wondering the following case for example:
 
 1. The application on the guest calls rdtsc on host A.
 2. The application uses rdtsc value for something.
 3. Failover to host B.
 4. The application on the guest replays the rdtsc call on host B.
 5. If the rdtsc value is different between A and B, the application may
 get into trouble because of it.

Regarding the TSC, we need to guarantee that the guest sees a monotonic
TSC after migration, which can be achieved by adjusting the TSC offset properly.
Besides, we also need a trapping TSC, so that we can tackle the case where the
primary node and the standby node have different TSC frequencies.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM MMU: Take sp level into account when calculating quadran

2010-04-23 Thread Gui Jianfeng
Gui Jianfeng wrote:
 Take sp level into account when calculating quadrant, because only when
 level == PT_PAGE_TABLE_LEVEL, quadrant is needed.

Please ignore this patch, Sorry for the noise.

Gui

 
 Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com
 ---
  arch/x86/kvm/mmu.c |3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 640b82d..2a35a65 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1324,7 +1324,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
 kvm_vcpu *vcpu,
   if (role.direct)
   role.cr4_pae = 0;
   role.access = access;
 - if (vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL) {
 + if (vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL 
 + level == PT_PAGE_TABLE_LEVEL) {
   quadrant = gaddr  (PAGE_SHIFT + (PT64_PT_BITS * level));
   quadrant = (1  ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
   role.quadrant = quadrant;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PXE Boot Timeout Issue...

2010-04-23 Thread Stefan Hajnoczi
On Fri, Apr 23, 2010 at 1:45 AM, Stuart Sheldon s...@actusa.net wrote:
 Just upgraded to 12.3 user space tools from 11.0, and now when I attempt
 to netboot a guest, it appears that the pxe rom is timing out on dhcp
 before the bridge has enough time to come up.

 Is there a command line switch to set the dhcp timeout, or a build
 option that can be changed to set the timeout to a longer value, or
 disable it entirely?

The bridge shouldn't need significant amounts of time to come up.  Can
you describe the networking setup?  Are you using libvirt and with
what network config?

If you have a bridge configured, can you show the output of:

$ sudo brctl showstp $bridge_name

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: no need to test bit before setting dirty bit these days

2010-04-23 Thread Takuya Yoshikawa
As Avi pointed out, testing bit part in mark_page_dirty() was important
in the days of shadow paging, but currently EPT and NPT has already become
common and the chance of faulting a page more that once per iteration is
small. So let's remove the test bit to avoid extra access.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 virt/kvm/kvm_main.c |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6dc9404..9ab1a77 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1192,9 +1192,7 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
if (memslot  memslot-dirty_bitmap) {
unsigned long rel_gfn = gfn - memslot-base_gfn;
 
-   /* avoid RMW */
-   if (!generic_test_le_bit(rel_gfn, memslot-dirty_bitmap))
-   generic___set_le_bit(rel_gfn, memslot-dirty_bitmap);
+   generic___set_le_bit(rel_gfn, memslot-dirty_bitmap);
}
 }
 
-- 
1.6.3.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to tweak kernel to get the best out of kvm?

2010-04-23 Thread Alec Istomin
Hi there!
I'm trying to understand kvm performance bottleneck and I was hoping to
get sharing statistics form ksm. Was not able to locate it so far, the
below mentioned /sys entry is missing, while ksm module is loaded.

I'm using latest from rhel 5 x64: 2.6.18-194.el5, kmod-kvm-83-164.el5 and
I'm sorry if this is the wrong place for a redhat-specific question.

Thanks,
 Alec
 
On Wed, 10 Mar 2010 08:01:12 -0800, Avi Kivity wrote:
  See /sys/kernel/mm/ksm


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-23 Thread Avi Kivity

On 04/23/2010 04:44 AM, Zachary Amsden wrote:

   Or apply this patch.
time-warp.patch


diff -rup a/time-warp-test.c b/time-warp-test.c
--- a/time-warp-test.c  2010-04-15 16:30:13.955981607 -1000
+++ b/time-warp-test.c  2010-04-15 16:35:37.777982377 -1000
@@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
  {
DECLARE_ARGS(val, low, high);

-   asm volatile(cpuid; rdtsc : EAX_EDX_RET(val, low, high));
+   asm volatile(cpuid; rdtsc : EAX_EDX_RET(val, low, high) :: ebx, 
ecx);

   


Plus, replace cpuid by lfence/mfence.  cpuid will trap.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Linux Plumbers Conference 2010 - call for tracks

2010-04-23 Thread Jes Sorensen
Hi,

I am organizing a Virtualization track at LPC 2010 (Linux Plumbers
Conference 2010). Please see the official call for tracks below.

LPC is particular well suited for work in progress and subjects that
needs discussion and collaboration between communities, for example
between KVM/QEMU and the Linux Kernel community. We expect strong
participation of kernel developers at LPC, and likely also from the
desktop community (GNOME/KDE/Xorg).

If you have a virtualization related project proposal, please submit it
using the information below. Note this is focused on general Linux
Virtualization, it is not limited to a specific hypervisor.

Subjects could be in the area of (but not limited to):
 - Kernel - KVM/QEMU interaction
 - IO Performance improvements
 - NUMA awareness
 - Migration
 - Support for new hardware features, and/or provide guest access to
   these features.
 - Virtualization management, user interfaces, and desktop integration

If you have questions related to this track, feel free to ask
lpc-plann...@linuxplumbersconf.org or email me directly.

Please feel free to forward this message to other mailing lists or
people for whom you feel it would be relevant.

Best regards,
Jes


Linux Plumbers Conference 2010
Call for Tracks

This year, Linux Plumbers Conference will take place in Cambridge, MA
on November 3-5, 2010.  Unlike more traditional conferences, the
Plumbers conference is not structured around presentations of
completed work, or problems and solutions confined to a single
subsystem or layer of the Linux ecosystem.  Rather the Plumbers
Conference encourages BOFs type meetings and brainstorming sessions
where technical experts from different areas and leaders in the Linux
and Open Source world can get together and discuss how to make
progress towards the solution of interdisciplinary multifaceted
problems spanning multiple components of the Linux system.  In some
sense, the Plumbers Conference is really more of a workshop.

The program committee for the Linux Plumbers Conference is looking for
proposals for the tracks that will be run during the Plumbers
Conference.

To do that, we are looking for problem statements: things that could
be improved in Linux that cross multiple interfaces or other project
boundaries (if you can solve it yourself inside a single project,
please, don't let us stop you --- get hacking!).  We are looking for
problems that require collaboration and face-to-face communication
across multiple teams and open source projects.  These problems could
apply to anywhere Linux is used: Linux on the Desktop, Linux on Mobile
devices, Linux on servers, etc.

For example, if in order to get better performance, we need to get
better information about low-level devices from the kernel, and that
needs to be utilized by file system utilities, and the user needs to
be able to involved by exposing options at the UI level in control
panels and distribution installers --- the Plumbers conference might
be a great place to get everyone in the same room for half a day to
solve this particular problem.

Along with your problem statements or track ideas, please list the
projects which and/or key individuals who ideally should be present,
and who might be a good person or persons to run such a conference
track.

If you have any thoughts or contributions, you can either

   * discuss them on this Linux Weekly News page:
http://lwn.net/Articles/lpc2010-cfi/
   * add the proposed topic to the Topics wiki page:
http://wiki.linuxplumbersconf.org/2010:topics
   * send e-mail to: lpc-plann...@linuxplumbersconf.org

Many thanks for helping to make Linux an even better platform!

The 2010 LPC Committee

Note: The event will be co-located with the Linux Kernel Summit which
will be held earlier that week.

Please feel free to forward this announcements to any communities or
mailing lists where you think it would be appropriate.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().

2010-04-23 Thread Avi Kivity

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Currently buf size is fixed at 32KB.  It would be useful if it could
be flexible.

   


Why is this needed?  The real buffering is in the kernel anyways; this 
is only used to reduce the number of write() syscalls.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().

2010-04-23 Thread Yoshiaki Tamura

Avi Kivity wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Currently buf size is fixed at 32KB. It would be useful if it could
be flexible.



Why is this needed? The real buffering is in the kernel anyways; this is
only used to reduce the number of write() syscalls.


This was introduced to buffer the transfered guests image transaction ally on 
the receiver side.  The sender doesn't use it.

In case of intermediate state, we just discard this buffer.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Fernando Luis Vázquez Cao
On 04/23/2010 08:29 AM, Alexander Graf wrote:
 
 On 22.04.2010, at 08:09, Fernando Luis Vázquez Cao wrote:
 
 On 04/22/2010 11:45 AM, Fernando Luis Vázquez Cao wrote:
 On 04/21/2010 06:41 PM, Alexander Graf wrote:
 On 21.04.2010, at 10:29, Fernando Luis Vázquez Cao wrote:

 On 04/20/2010 08:03 PM, Takuya Yoshikawa wrote:
 @@ -318,7 +318,7 @@ struct kvm_dirty_log {
  __u32 padding1;
  union {
  void __user *dirty_bitmap; /* one bit per page */
 -__u64 padding2;
 +__u64 addr;

 This can break on x86_32 and x86_64-compat. addr is a long not a __u64.

 So the high 32 bits are zero. Where's the problem?

 If we are careful enough to cast the addr appropriately we should be fine,
 even if we keep the padding field in the union. I am not saying that it
 breaks 32 architectures but that it can potentially be problematic.

 +case KVM_SWITCH_DIRTY_LOG: {
 +struct kvm_dirty_log log;
 +
 +r = -EFAULT;
 +if (copy_from_user(log, argp, sizeof log))
 +goto out;
 +r = kvm_vm_ioctl_switch_dirty_log(kvm, log);
 +if (r)
 +goto out;
 +r = -EFAULT;
 +if (copy_to_user(argp, log, sizeof log))
 +goto out;
 +r = 0;
 +break;
 +}

 In x86_64-compat mode we are handling 32bit user-space addresses
 so we need the compat counterpart of KVM_SWITCH_DIRTY_LOG too.

 The compat code just forwards everything to the generic ioctls.

 The compat code uses struct compat_kvm_dirty_log instead of
 struct kvm_dirty_log to communicate with user space so
 the necessary conversions needs to be done before invoking
 the generic ioctl (see KVM_GET_DIRTY_LOG in kvm_vm_compat_ioctl).

 By the way we probable should move the definition of struct
 compat_kvm_dirty_log to a header file.

 It seems that it was you and Arnd who added the kvm_vm compat ioctl :-).
 Are you considering a different approach to tackle the issues that we
 have with a big-endian userspace?
 
 IIRC the issue was a pointer inside of a nested structure, no?

I would say the reason is that if we did not convert the user-space pointer to
a void * kvm_get_dirty_log() would end up copying the dirty log to

(log-dirty_bitmap  32) | 0x

Am I missing something?

Thanks,

Fernando
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 12:17, Fernando Luis Vázquez Cao wrote:

 On 04/23/2010 08:29 AM, Alexander Graf wrote:
 
 On 22.04.2010, at 08:09, Fernando Luis Vázquez Cao wrote:
 
 On 04/22/2010 11:45 AM, Fernando Luis Vázquez Cao wrote:
 On 04/21/2010 06:41 PM, Alexander Graf wrote:
 On 21.04.2010, at 10:29, Fernando Luis Vázquez Cao wrote:
 
 On 04/20/2010 08:03 PM, Takuya Yoshikawa wrote:
 @@ -318,7 +318,7 @@ struct kvm_dirty_log {
 __u32 padding1;
 union {
 void __user *dirty_bitmap; /* one bit per page */
 -   __u64 padding2;
 +   __u64 addr;
 
 This can break on x86_32 and x86_64-compat. addr is a long not a __u64.
 
 So the high 32 bits are zero. Where's the problem?
 
 If we are careful enough to cast the addr appropriately we should be fine,
 even if we keep the padding field in the union. I am not saying that it
 breaks 32 architectures but that it can potentially be problematic.
 
 +   case KVM_SWITCH_DIRTY_LOG: {
 +   struct kvm_dirty_log log;
 +
 +   r = -EFAULT;
 +   if (copy_from_user(log, argp, sizeof log))
 +   goto out;
 +   r = kvm_vm_ioctl_switch_dirty_log(kvm, log);
 +   if (r)
 +   goto out;
 +   r = -EFAULT;
 +   if (copy_to_user(argp, log, sizeof log))
 +   goto out;
 +   r = 0;
 +   break;
 +   }
 
 In x86_64-compat mode we are handling 32bit user-space addresses
 so we need the compat counterpart of KVM_SWITCH_DIRTY_LOG too.
 
 The compat code just forwards everything to the generic ioctls.
 
 The compat code uses struct compat_kvm_dirty_log instead of
 struct kvm_dirty_log to communicate with user space so
 the necessary conversions needs to be done before invoking
 the generic ioctl (see KVM_GET_DIRTY_LOG in kvm_vm_compat_ioctl).
 
 By the way we probable should move the definition of struct
 compat_kvm_dirty_log to a header file.
 
 It seems that it was you and Arnd who added the kvm_vm compat ioctl :-).
 Are you considering a different approach to tackle the issues that we
 have with a big-endian userspace?
 
 IIRC the issue was a pointer inside of a nested structure, no?
 
 I would say the reason is that if we did not convert the user-space pointer to
 a void * kvm_get_dirty_log() would end up copying the dirty log to
 
 (log-dirty_bitmap  32) | 0x

Well yes, that was the problem. If we always set the __u64 value to the pointer 
we're safe though.

union {
  void *p;
  __u64 q;
}

void x(void *r)
{
  // breaks:
  p = r;

  // works:
  q = (ulong)r;
}

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Document mmu

2010-04-23 Thread Avi Kivity

On 04/22/2010 11:16 PM, Marcelo Tosatti wrote:


Looks good otherwise. Perhaps add a pointer to Joerg's NPT slides,
although they're AMD specific.
   


Fixed all the comments, added a Further reading section and applied.  
Note that this is still complete (example: large pages); patches 
(especially to Documentation/kvm/locking.txt) more than welcome.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Document mmu

2010-04-23 Thread Avi Kivity

On 04/23/2010 10:12 AM, Gui Jianfeng wrote:



+Guest memory (gpa) is part of user address space of the process that is using
+kvm.  Userspace defines the translation between guest addresses and user
+addresses (gpa-gva); note that two gpas may alias to the same gva, but not
 

Do you mean (gpa-hva)?
   


I do.  Fixed.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 3/6] KVM: introduce a wrapper function to copy dirty bitmaps to user space

2010-04-23 Thread Avi Kivity

On 04/22/2010 11:57 AM, Takuya Yoshikawa wrote:


And about x86_32 copy_in_user().

They are using copy_user_generic() which is implemented only for 64bit.

So I'll show them current copy_from_user() and copy_to_user() version and
see their response.

If rejected, though I hope not, please give me another option, OK?



If we can justify it, I doubt it will be rejected, but in case it does, 
we'll work something out.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 5/6] KVM: moving dirty bitmaps to user space

2010-04-23 Thread Avi Kivity

On 04/22/2010 12:07 PM, Takuya Yoshikawa wrote:





+ slots-memslots[i].dirty_bitmap = NULL;
+ slots-memslots[i].dirty_bitmap_old = NULL;
kvm_free_physmem_slot(slots-memslots[i], NULL);
+ }


+/*
+ * Please use generic *_user bitops once they become available.
+ * Be careful setting the bit won't be done atomically.
+ */


Please introduce the user bitops as part of this patchset.



OK, I will do in the next version. In this RFC, I would be happy if I can
know the overall design is right or not.



Everything looks reasonable to me.

Do you have performance numbers?  I'm interested in both measurements of 
KVM_SWITCH_DIRTY_LOG under various conditions and macro benchmarks (for 
example, total guest throughput improvement under Kemari).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: no need to test bit before setting dirty bit these days

2010-04-23 Thread Avi Kivity

On 04/23/2010 11:48 AM, Takuya Yoshikawa wrote:

As Avi pointed out, testing bit part in mark_page_dirty() was important
in the days of shadow paging, but currently EPT and NPT has already become
common and the chance of faulting a page more that once per iteration is
small. So let's remove the test bit to avoid extra access.
   


Applied, thanks.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-23 Thread Avi Kivity

On 04/22/2010 01:27 PM, Liu Yu-B13201 wrote:


I met this error when built kernel. Anything wrong?

   CC  init/main.o
In file included from include/linux/ftrace_event.h:8,
  from include/trace/syscall.h:6,
  from include/linux/syscalls.h:75,
  from init/main.c:16:
include/linux/perf_event.h: In function 'perf_register_guest_info_callbacks':
include/linux/perf_event.h:1019: error: parameter name omitted
include/linux/perf_event.h: In function 'perf_unregister_guest_info_callbacks':
include/linux/perf_event.h:1021: error: parameter name omitted
make[1]: *** [init/main.o] Error 1
make: *** [init] Error 2
   


I merged tip/perf/code which may fix this.  Find it in kvm.git next branch.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-23 Thread Avi Kivity

On 04/22/2010 10:55 PM, Gleb Natapov wrote:




What about converting PIC/IOAPIC mutexes into spinlocks?
   

Works for me, but on large guests the spinning will be noticeable.
I believe.
 

For interrupts going through IOPIC, but we know this is not scalable
anyway.
   


Yes.  We also wanted to convert the ioapic/pic to spinlocks so we could 
queue the interrupt from the PIT directly instead of using 
KVM_REQ_PENDING_TIMER which keeps confusing me.  Chris Lalancette posted 
a patchset for this a while back but it was never completed.


I'm not really happy with adding lots of spin_lock_irqsave()s though, 
especially on the ioapic which may iterate over all vcpus (not worried 
about scaling, but about a malicious guest hurting host latency).


An alternative is make kvm_set_irq() irq safe: if msi, do things 
immediately, otherwise post a work item.  So we can call kvm_set_irq() 
directly from the interrupt.


Alternative alternative (perhaps better for short term): switch 
assigned_dev_lock to a mutex (we're already in a work handler, no need 
for spinlock).  The race between the irq and removal of an assigned 
device is closed by free_irq():



 lock
 mark assigned device as going away
 unlock
 free_irq()
 actually kill it

like invalid mmu pages.

--

Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] KVM MMU: fix hashing for TDP and non-paging modes

2010-04-23 Thread Avi Kivity

On 04/23/2010 12:15 AM, Eric Northup wrote:

I've been reading the x86's mmu.c recently and had been wondering
about something.  Avi's recent mmu documentation (thanks!) seems to
have confirmed my understanding of how the shadow paging is supposed
to be working.


Wasn't it a lot more fun to understand it from the code?  reading the 
documentation is way to easy.



In TDP mode, when mmu_alloc_roots() calls
kvm_mmu_get_page(), why does it pass (vcpu-arch.cr3  PAGE_SHIFT) or
(vcpu-arch.mmu.pae_root[i]) as gfn?
   


Historical accident.  Luckily, no harmful effects other than wasting a 
bit of cpu cycles every now and then.



It seems to me that in TDP mode, gfn should be either zero for the
root page table, or 0/1GB/2GB/3GB (for PAE page tables).

The existing behavior can lead to multiple, semantically-identical TDP
roots being created by mmu_alloc_roots, depending on the VCPU's CR3 at
the time that mmu_alloc_roots was called.  But the nested page tables
should be* independent of the VCPU state. That wastes some memory and
causes extra page faults while populating the extra copies of the page
tables.
   


Yeah.


*assuming that we aren't modeling per-VCPU state that might change the
physical address map as seen by that VCPU, such as setting the APIC
base to an address overlapping RAM.
   


We don't actually support that.


All feedback would be welcome, since I'm new to this system!  A
strawman patch follows.

   


Patch is correct, but I have already fixed this in a more extensive 
patch set (that also folds the 32-bit and 64-bit cases together, etc.) 
and I'm too lazy to rebase on top of yours.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 5/6] KVM: moving dirty bitmaps to user space

2010-04-23 Thread Takuya Yoshikawa

(2010/04/23 19:28), Avi Kivity wrote:



OK, I will do in the next version. In this RFC, I would be happy if I can
know the overall design is right or not.



Everything looks reasonable to me.



Thank you!


Do you have performance numbers? I'm interested in both measurements of
KVM_SWITCH_DIRTY_LOG under various conditions and macro benchmarks (for
example, total guest throughput improvement under Kemari).



Currently, I'm just checking the performance visually.
  - on laptops, we can feel the speed directly: my favorite is installing
ubuntu or debian by alternate installers

  - live migration with heavy work load

Now that I've got the overall design, I want to measure the performance by 
numbers.


About performance under Kemari: we(oss.ntt.co.jp staffs: me and Fernando) are
now concentrating on improving the basic live-migration infrastructures and
they(lab.ntt.co.jp staffs) are working hard for building Kemari itself.

  - We are also interested in using live-migration with HA software and this 
needs
light, stable live-migration: same as Kemari!


So measuring Kemari's performance with our patch-set will need some more time,
** How about Tamura-san? **


So at this stage, I'll show you performance improvement(I hope to improve) by
another test cases.

If possible next week!

Thanks,
  Takuya

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/10] KVM MMU: fix for calculating gpa in invlpg code

2010-04-23 Thread Avi Kivity

On 04/22/2010 09:12 AM, Xiao Guangrong wrote:

If the guest is 32-bit, we should use 'quadrant' to adjust gpa
offset

   


Good catch.  Only affects kvm_mmu_pte_write(), so I don't think this had 
ill effects other than not prefetching the correct address?



@@ -478,9 +478,14 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
((level == PT_DIRECTORY_LEVEL  is_large_pte(*sptep))) ||
((level == PT_PDPE_LEVEL  is_large_pte(*sptep {
struct kvm_mmu_page *sp = page_header(__pa(sptep));
+   int offset = 0;
+
+   if (PTTYPE == 32)
+   offset = sp-role.quadrant  PT64_LEVEL_BITS;;
   


Wrong for PT_DIRECTORY_LEVEL (should be q  8).  Also, too many 
semicolons.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/10] KVM MMU: Move invlpg code out of paging_tmpl.h

2010-04-23 Thread Avi Kivity

On 04/22/2010 09:12 AM, Xiao Guangrong wrote:

Using '!sp-role.cr4_pae' replaces 'PTTYPE == 32' and using
'pte_size = sp-role.cr4_pae ? 8 : 4' replaces sizeof(pt_element_t)

Then no need compile twice for this code

Signed-off-by: Xiao Guangrongxiaoguangr...@cn.fujitsu.com
---
  arch/x86/kvm/mmu.c |   60 ++-
  arch/x86/kvm/paging_tmpl.h |   56 -
  2 files changed, 58 insertions(+), 58 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index abf8bd4..fac7c09 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2256,6 +2256,62 @@ static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 
gpte, int level)
return (gpte  vcpu-arch.mmu.rsvd_bits_mask[bit7][level-1]) != 0;
  }

+static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
+{
+   struct kvm_shadow_walk_iterator iterator;
+   gpa_t pte_gpa = -1;
+   int level;
+   u64 *sptep;
+   int need_flush = 0;
+   unsigned pte_size = 0;
+
+   spin_lock(vcpu-kvm-mmu_lock);
+
+   for_each_shadow_entry(vcpu, gva, iterator) {
+   level = iterator.level;
+   sptep = iterator.sptep;
+
+   if (level == PT_PAGE_TABLE_LEVEL  ||
+   ((level == PT_DIRECTORY_LEVEL  is_large_pte(*sptep))) ||
+   ((level == PT_PDPE_LEVEL  is_large_pte(*sptep {
+   struct kvm_mmu_page *sp = page_header(__pa(sptep));
+   int offset = 0;
+
+   if (!sp-role.cr4_pae)
+   offset = sp-role.quadrant  PT64_LEVEL_BITS;;
+   pte_size = sp-role.cr4_pae ? 8 : 4;
+   pte_gpa = (sp-gfn  PAGE_SHIFT);
+   pte_gpa += (sptep - sp-spt + offset) * pte_size;
+
+   if (is_shadow_present_pte(*sptep)) {
+   rmap_remove(vcpu-kvm, sptep);
+   if (is_large_pte(*sptep))
+   --vcpu-kvm-stat.lpages;
+   need_flush = 1;
+   }
+   __set_spte(sptep, shadow_trap_nonpresent_pte);
+   break;
+   }
+
+   if (!is_shadow_present_pte(*sptep))
+   break;
+   }
+
+   if (need_flush)
+   kvm_flush_remote_tlbs(vcpu-kvm);
+
+   atomic_inc(vcpu-kvm-arch.invlpg_counter);
+
+   spin_unlock(vcpu-kvm-mmu_lock);
+
+   if (pte_gpa == -1)
+   return;
+
+   if (mmu_topup_memory_caches(vcpu))
+   return;
+   kvm_mmu_pte_write(vcpu, pte_gpa, NULL, pte_size, 0);
+}
+
   


I think we should keep it in - kvm_mmu_pte_write() calls back to 
FNAME(update_pte), we could make the call directly from here speed 
things up, since we already have the spte and don't need to look it up.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 5/6] KVM: moving dirty bitmaps to user space

2010-04-23 Thread Yoshiaki Tamura
2010/4/23 Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp:
 (2010/04/23 19:28), Avi Kivity wrote:


 OK, I will do in the next version. In this RFC, I would be happy if I can
 know the overall design is right or not.


 Everything looks reasonable to me.


 Thank you!

 Do you have performance numbers? I'm interested in both measurements of
 KVM_SWITCH_DIRTY_LOG under various conditions and macro benchmarks (for
 example, total guest throughput improvement under Kemari).


 Currently, I'm just checking the performance visually.
  - on laptops, we can feel the speed directly: my favorite is installing
    ubuntu or debian by alternate installers

  - live migration with heavy work load

 Now that I've got the overall design, I want to measure the performance by
 numbers.


 About performance under Kemari: we(oss.ntt.co.jp staffs: me and Fernando)
 are
 now concentrating on improving the basic live-migration infrastructures and
 they(lab.ntt.co.jp staffs) are working hard for building Kemari itself.

  - We are also interested in using live-migration with HA software and this
 needs
    light, stable live-migration: same as Kemari!


 So measuring Kemari's performance with our patch-set will need some more
 time,
 ** How about Tamura-san? **

You can measure your patch quite easy.

1. Run a simple program that malloc huge size (2GB for example), then
go into infinite loop that touches each in a really bad manner, like
only touch odd page numbers.
2. Start live migration which usually won't finish. Similar with Kemari.
3. Measure the response time of your ioctl().

I used 1 and 2 to measure dirty physmap speed up in QEMU.

Thanks,

Yoshi

 So at this stage, I'll show you performance improvement(I hope to improve)
 by
 another test cases.

 If possible next week!

 Thanks,
  Takuya

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/10] KVM MMU: don't write-protect if have new mapping to unsync page

2010-04-23 Thread Avi Kivity

On 04/22/2010 09:13 AM, Xiao Guangrong wrote:

If have new mapping to the unsync page(i.e, add a new parent), just
update the page from sp-gfn but not write-protect gfn, and if need
create new shadow page form sp-gfn, we should sync it
   


Sorry, I don't understand this patch.  Can you clarify?  for example, 
the situation before adding the new parent, what the guest action was, 
and the new situation.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/10] KVM MMU: Move invlpg code out of paging_tmpl.h

2010-04-23 Thread Avi Kivity

On 04/23/2010 02:27 PM, Avi Kivity wrote:

On 04/22/2010 09:12 AM, Xiao Guangrong wrote:

Using '!sp-role.cr4_pae' replaces 'PTTYPE == 32' and using
'pte_size = sp-role.cr4_pae ? 8 : 4' replaces sizeof(pt_element_t)

Then no need compile twice for this code

I think we should keep it in - kvm_mmu_pte_write() calls back to 
FNAME(update_pte), we could make the call directly from here speed 
things up, since we already have the spte and don't need to look it up.




I see you do this in patches 9, 10 - but is it possible to use 
update_pte directly?  I think we'll need to make 
guess_page_from_pte_write() part of paging_tmpl.h (in general anything 
that depends on pte size is better off in paging_tmpl.h).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 5/6] KVM: moving dirty bitmaps to user space

2010-04-23 Thread Avi Kivity

On 04/23/2010 02:14 PM, Takuya Yoshikawa wrote:



Do you have performance numbers? I'm interested in both measurements of
KVM_SWITCH_DIRTY_LOG under various conditions and macro benchmarks (for
example, total guest throughput improvement under Kemari).



Currently, I'm just checking the performance visually.
  - on laptops, we can feel the speed directly: my favorite is installing
ubuntu or debian by alternate installers

  - live migration with heavy work load

Now that I've got the overall design, I want to measure the 
performance by numbers.



About performance under Kemari: we(oss.ntt.co.jp staffs: me and 
Fernando) are
now concentrating on improving the basic live-migration 
infrastructures and

they(lab.ntt.co.jp staffs) are working hard for building Kemari itself.

  - We are also interested in using live-migration with HA software 
and this needs

light, stable live-migration: same as Kemari!



General live migration improvements are also interesting, I just the 
improvement would be more pronounced under Kemari.  Looking forward to 
seeing the results.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Avi Kivity

On 04/23/2010 01:20 PM, Alexander Graf wrote:



I would say the reason is that if we did not convert the user-space pointer to
a void * kvm_get_dirty_log() would end up copying the dirty log to

(log-dirty_bitmap  32) | 0x
 

Well yes, that was the problem. If we always set the __u64 value to the pointer 
we're safe though.

union {
   void *p;
   __u64 q;
}

void x(void *r)
{
   // breaks:
   p = r;

   // works:
   q = (ulong)r;
}
   


In that case it's better to avoid p altogether, since users will 
naturally assign to the pointer.


Using a 64-bit integer avoids the problem (though perhaps not sufficient 
for s390, Arnd?)


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Avi Kivity

On 04/22/2010 12:34 PM, Takuya Yoshikawa wrote:

Thanks, I can know basic rules about kvm/api .

(2010/04/21 20:46), Avi Kivity wrote:




+Type: vm ioctl
+Parameters: struct kvm_dirty_log (in/out)
+Returns: 0 on success, -1 on error
+
+/* for KVM_SWITCH_DIRTY_LOG */
+struct kvm_dirty_log {
+ __u32 slot;
+ __u32 padding;


Please put a flags field here (and verify it is zero in the ioctl
implementation). This allows us to extend it later.


+ union {
+ void __user *dirty_bitmap; /* one bit per page */
+ __u64 addr;
+ };


Just make dirty_bitmap a __u64. With the union there is the risk that
someone forgets to zero the structure so we get some random bits in the
pointer, and also issues with big endian and s390 compat.

Reserve some extra space here for future expansion.

Hm. I see that this is the existing kvm_dirty_log structure. So we can't
play with it, ignore my comments about it.


So, introducing a new structure to export(and get) two bitmap addresses
with a flag is the thing?

1. Qemu calls ioctl to get the two addresses.
2. Qemu calls ioctl to make KVM switch the dirty bitmaps.
   -- in this case, qemu have to change the address got in step 1.
   ...
3. Qemu calls ioctl(we can reuse 1. with different command flag) to 
free the bitmaps.



In my plan, we don't let qemu malloc() dirty bitmaps: at least, I want 
to make that
another patch set because this patch set already has some dependencies 
to other issues.


But, yes, I can make the struct considering the future expansion if it 
needed.


I guess it's better, since this patch is a future expansion of 
KVN_GET_DIRTY_LOG.  If we had a flags field and some room there, we 
could keep the old ioctl.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/10] KVM MMU: fix for calculating gpa in invlpg code

2010-04-23 Thread Xiao Guangrong


Avi Kivity wrote:
 On 04/22/2010 09:12 AM, Xiao Guangrong wrote:
 If the guest is 32-bit, we should use 'quadrant' to adjust gpa
 offset


 
 Good catch.  Only affects kvm_mmu_pte_write(), so I don't think this had
 ill effects other than not prefetching the correct address?
 

Yes

 @@ -478,9 +478,14 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu,
 gva_t gva)
   ((level == PT_DIRECTORY_LEVEL  is_large_pte(*sptep))) ||
   ((level == PT_PDPE_LEVEL  is_large_pte(*sptep {
   struct kvm_mmu_page *sp = page_header(__pa(sptep));
 +int offset = 0;
 +
 +if (PTTYPE == 32)
 +offset = sp-role.quadrant  PT64_LEVEL_BITS;;

 
 Wrong for PT_DIRECTORY_LEVEL (should be q  8).  Also, too many
 semicolons.
 

I guess you mean 'PT64_LEVEL_BITS' not 'PT_DIRECTORY_LEVEL' here :-)

It should be q  8 here? it hardly understand, take leve = 1 for example,
32-bit guest PTE page table mapping range is 2^(10+12), PAE's PTE page table
mapping range is 2^(9+12), so, i think it's quadrant  9 here, and other
function like FNAME(prefetch_page), FNAME(sync_page) also are q  9

Sorry for the double semicolons here, will fix it

Thanks,
Xiao
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/10] KVM MMU: allow more page become unsync at getting sp time

2010-04-23 Thread Avi Kivity

On 04/22/2010 09:13 AM, Xiao Guangrong wrote:

Allow more page become asynchronous at getting sp time, if need create new
shadow page for gfn but it not allow unsync(level  0), we should unsync all
gfn's unsync page
   


This is something I wanted for a long time.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/10] KVM MMU: fix for calculating gpa in invlpg code

2010-04-23 Thread Avi Kivity

On 04/23/2010 03:05 PM, Xiao Guangrong wrote:




@@ -478,9 +478,14 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu,
gva_t gva)
   ((level == PT_DIRECTORY_LEVEL   is_large_pte(*sptep))) ||
   ((level == PT_PDPE_LEVEL   is_large_pte(*sptep {
   struct kvm_mmu_page *sp = page_header(__pa(sptep));
+int offset = 0;
+
+if (PTTYPE == 32)
+offset = sp-role.quadrant   PT64_LEVEL_BITS;;

   

Wrong for PT_DIRECTORY_LEVEL (should be q  8).  Also, too many
semicolons.

 

I guess you mean 'PT64_LEVEL_BITS' not 'PT_DIRECTORY_LEVEL' here :-)
   


No, I mean if level == PT_DIRECTORY_LEVEL, then we want role.quadrant  
8, not 9.



It should be q  8 here? it hardly understand, take leve = 1 for example,
32-bit guest PTE page table mapping range is 2^(10+12), PAE's PTE page table
mapping range is 2^(9+12),


For level == PT_DIRECTORY_LEVEL, quadrant is in the range 0..3.  Each sp 
maps 1GB, while the guest page table maps 4GB.  So the upper two bits 
become the quadrant.



  so, i think it's quadrant  9 here, and other
function like FNAME(prefetch_page), FNAME(sync_page) also are q  9
   


They only work for PT_PAGE_TABLE_LEVEL, so for them 9 is correct.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 13:57, Avi Kivity wrote:

 On 04/23/2010 01:20 PM, Alexander Graf wrote:
 
 I would say the reason is that if we did not convert the user-space pointer 
 to
 a void * kvm_get_dirty_log() would end up copying the dirty log to
 
 (log-dirty_bitmap  32) | 0x
 
 Well yes, that was the problem. If we always set the __u64 value to the 
 pointer we're safe though.
 
 union {
   void *p;
   __u64 q;
 }
 
 void x(void *r)
 {
   // breaks:
   p = r;
 
   // works:
   q = (ulong)r;
 }
   
 
 In that case it's better to avoid p altogether, since users will naturally 
 assign to the pointer.
 
 Using a 64-bit integer avoids the problem (though perhaps not sufficient for 
 s390, Arnd?)

We only support 64 bit on S390, so we should be safe here. Even if not, compat 
mode has 31 bits pointers, so padding them to 64 bit should work too.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Arnd Bergmann
On Friday 23 April 2010, Avi Kivity wrote:
 On 04/23/2010 01:20 PM, Alexander Graf wrote:
 
  I would say the reason is that if we did not convert the user-space 
  pointer to
  a void * kvm_get_dirty_log() would end up copying the dirty log to
 
  (log-dirty_bitmap  32) | 0x
   
  Well yes, that was the problem. If we always set the __u64 value to the 
  pointer we're safe though.
 
  union {
 void *p;
 __u64 q;
  }
 
  void x(void *r)
  {
 // breaks:
 p = r;
 
 // works:
 q = (ulong)r;
  }
 
 
 In that case it's better to avoid p altogether, since users will 
 naturally assign to the pointer.

Right.
 
 Using a 64-bit integer avoids the problem (though perhaps not sufficient 
 for s390, Arnd?)

When there is only a __u64 for the address, it will work on s390 as well,
gcc is smart enough to clear the upper bit on a cast from long to pointer.

The simple rule is to never put any 'long' or pointer into data structures
that you pass to an ioctl, and to add padding to multiples of 64 bit to
align the data structure for the x86 alignment problem.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Avi Kivity

On 04/23/2010 03:27 PM, Arnd Bergmann wrote:



Using a 64-bit integer avoids the problem (though perhaps not sufficient
for s390, Arnd?)
 

When there is only a __u64 for the address, it will work on s390 as well,
gcc is smart enough to clear the upper bit on a cast from long to pointer.
   


Ah, much more convenient than compat_ioctl.  I assume it part of the 
ABI, not a gccism?


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Arnd Bergmann
On Friday 23 April 2010, Avi Kivity wrote:
 On 04/23/2010 03:27 PM, Arnd Bergmann wrote:
 
  Using a 64-bit integer avoids the problem (though perhaps not sufficient
  for s390, Arnd?)
   
  When there is only a __u64 for the address, it will work on s390 as well,
  gcc is smart enough to clear the upper bit on a cast from long to pointer.
 
 
 Ah, much more convenient than compat_ioctl.  I assume it part of the 
 ABI, not a gccism?

I don't think it's part of the ABI, but it's required to guarantee
that code like this works:

int compare_pointer(void *a, void *b)
{
unsigned long ai = (unsigned long)a, bi = (unsigned long)b;

return ai == bi; /* true if a and b point to the same object */
}

We certainly rely on this already.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Avi Kivity

On 04/23/2010 03:46 PM, Arnd Bergmann wrote:

On Friday 23 April 2010, Avi Kivity wrote:
   

On 04/23/2010 03:27 PM, Arnd Bergmann wrote:
 
   

Using a 64-bit integer avoids the problem (though perhaps not sufficient
for s390, Arnd?)

 

When there is only a __u64 for the address, it will work on s390 as well,
gcc is smart enough to clear the upper bit on a cast from long to pointer.

   

Ah, much more convenient than compat_ioctl.  I assume it part of the
ABI, not a gccism?
 

I don't think it's part of the ABI, but it's required to guarantee
that code like this works:

int compare_pointer(void *a, void *b)
{
unsigned long ai = (unsigned long)a, bi = (unsigned long)b;

return ai == bi; /* true if a and b point to the same object */
}

We certainly rely on this already.
   


Ah so the 31st bit is optional as far as userspace is concerned?  What 
does it mean? (just curious)


What happens on the opposite conversion?  is it restored?

What about

int compare_pointer(void *a, void *b)
{
unsigned long ai = (unsigned long)a;
void *aia = (void *)ai;

return a == b; /* true if a and b point to the same object */
}


Does gcc mask the big in pointer comparisons as well?

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 14:53, Avi Kivity wrote:

 On 04/23/2010 03:46 PM, Arnd Bergmann wrote:
 On Friday 23 April 2010, Avi Kivity wrote:
   
 On 04/23/2010 03:27 PM, Arnd Bergmann wrote:
 
   
 Using a 64-bit integer avoids the problem (though perhaps not sufficient
 for s390, Arnd?)
 
 
 When there is only a __u64 for the address, it will work on s390 as well,
 gcc is smart enough to clear the upper bit on a cast from long to pointer.
 
   
 Ah, much more convenient than compat_ioctl.  I assume it part of the
 ABI, not a gccism?
 
 I don't think it's part of the ABI, but it's required to guarantee
 that code like this works:
 
 int compare_pointer(void *a, void *b)
 {
  unsigned long ai = (unsigned long)a, bi = (unsigned long)b;
 
  return ai == bi; /* true if a and b point to the same object */
 }
 
 We certainly rely on this already.
   
 
 Ah so the 31st bit is optional as far as userspace is concerned?  What does 
 it mean? (just curious)

The 0x8000 bit declares that a pointer is in 24-bit mode, so that 
applications can use the spare upper bits for random data.

See http://en.wikipedia.org/wiki/31-bit for an explanation.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-23 Thread Chris Lalancette
On 04/23/2010 07:05 AM, Avi Kivity wrote:
 On 04/22/2010 10:55 PM, Gleb Natapov wrote:


 What about converting PIC/IOAPIC mutexes into spinlocks?

 Works for me, but on large guests the spinning will be noticeable.
 I believe.
  
 For interrupts going through IOPIC, but we know this is not scalable
 anyway.

 
 Yes.  We also wanted to convert the ioapic/pic to spinlocks so we could
 queue the interrupt from the PIT directly instead of using
 KVM_REQ_PENDING_TIMER which keeps confusing me.  Chris Lalancette posted
 a patchset for this a while back but it was never completed.

Yeah, I'm sorry I never completed it.  It turns out that with the HPET
changes that went in around the time I was looking at it, that set of
patches wasn't really required to fix the problem I was seeing with kdump.

That being said, if it's useful to somebody, I can repost the patches
(though they are woefully out-of-date now).  Let me know if you want
to see them again.

-- 
Chris Lalancette
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-23 Thread Anthony Liguori

On 04/22/2010 07:45 PM, Yoshiaki Tamura wrote:

Anthony Liguori wrote:


I think it would make sense to separate out the things that are actually
optimizations (like the dirty bitmap changes and the writev/readv
changes) and to attempt to justify them with actual performance data.


I agree with the separation plan.

For dirty bitmap change, Avi and I discussed on patchset for upsream 
QEMU while you were offline (Sorry, if I was wrong).  Could you also 
take a look?


Yes, I've seen it and I don't disagree.  That said, there ought to be 
perf data in the commit log so that down the road, the justification is 
understood.



http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg01396.html

Regarding writev, I agree that it should be backed with actual data, 
otherwise it should be removed.  We attemped to do everything that may 
reduce the overhead of the transaction.



I'd prefer not to modify the live migration protocol ABI and it doesn't
seem to be necessary if we're willing to add options to the -incoming
flag. We also want to be a bit more generic with respect to IO.


I totally agree with your approach not to change the protocol ABI.  
Can we add an option to -incoming?  Like, -incoming ft_mode, for example

Regarding the IO, let me reply to the next message.


Otherwise, the series looks very close to being mergable.


Thank you for your comment on each patch.

To be honest, I wasn't that confident because I'm a newbie to KVM/QEMU 
and struggled for how to implement in an acceptable way.


The series looks very good.  I'm eager to see this functionality merged.

Regards,

Anthony Liguori


Thanks,

Yoshi



Regards,

Anthony Liguori







--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Arnd Bergmann
On Friday 23 April 2010, Avi Kivity wrote:
 Ah so the 31st bit is optional as far as userspace is concerned?  What 
 does it mean? (just curious)

On data pointers it's ignored. When you branch to a function, this bit
determines whether the target function is run in 24 or 31 bit mode.
This allows linking to legacy code on older operating systems that
also support 24 bit libraries.

 What happens on the opposite conversion?  is it restored?
 
 What about
 
 int compare_pointer(void *a, void *b)
 {
 unsigned long ai = (unsigned long)a;
 void *aia = (void *)ai;
 
 return a == b; /* true if a and b point to the same object */
 }

Some instructions set the bit, others clear it, so aia and a may not
be bitwise identical.

 Does gcc mask the big in pointer comparisons as well?

Yes. To stay in the above example:

a == aia;   /* true */
(unsigned long)a == (unsigned long)aia; /* true */
*(unsigned long *)a == *(unsigned long *)aia; /* undefined on s390 */

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().

2010-04-23 Thread Avi Kivity

On 04/23/2010 12:59 PM, Yoshiaki Tamura wrote:

Avi Kivity wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Currently buf size is fixed at 32KB. It would be useful if it could
be flexible.



Why is this needed? The real buffering is in the kernel anyways; this is
only used to reduce the number of write() syscalls.


This was introduced to buffer the transfered guests image transaction 
ally on the receiver side.  The sender doesn't use it.

In case of intermediate state, we just discard this buffer.


How large can it grow?

What's wrong with applying it (perhaps partially) to the guest state?  
The next state transfer will overwrite it completely, no?


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-23 Thread Anthony Liguori

On 04/22/2010 08:53 PM, Yoshiaki Tamura wrote:

Anthony Liguori wrote:

On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:

2010/4/22 Dor Laordl...@redhat.com:

On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:

Dor Laor wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and 
we're

sending
this message to share what we have now and TODO lists. 
Hopefully, we

would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The 
current

code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control
logic is
implemented in QEMU. However, we needed a hack in KVM to prevent 
rip

from
proceeding before synchronizing VMs. It may also need some
plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.

[ snap]

The rest of this message describes TODO lists grouped by each 
topic.


=== event tapping ===

Event tapping is the core component of Kemari, and it decides on
which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true
for
disk I/O
and reliable network protocols such as TCP.

IMO any type of network even should be stalled too. What if the VM
runs
non tcp protocol and the packet that the master node sent reached 
some

remote client and before the sync to the slave the master failed?
In current implementation, it is actually stalling any type of 
network

that goes through virtio-net.

However, if the application was using unreliable protocols, it should
have its own recovering mechanism, or it should be completely
stateless.

Why do you treat tcp differently? You can damage the entire VM this
way -
think of dhcp request that was dropped on the moment you switched
between
the master and the slave?

I'm not trying to say that we should treat tcp differently, but just
it's severe.
In case of dhcp request, the client would have a chance to retry after
failover, correct?
BTW, in current implementation,


I'm slightly confused about the current implementation vs. my
recollection of the original paper with Xen. I had thought that all disk
and network I/O was buffered in such a way that at each checkpoint, the
I/O operations would be released in a burst. Otherwise, you would have
to synchronize after every I/O operation which is what it seems the
current implementation does.


Yes, you're almost right.
It's synchronizing before QEMU starts emulating I/O at each device model.


If NodeA is the master and NodeB is the slave, if NodeA sends a network 
packet, you'll checkpoint before the packet is actually sent, and then 
if a failure occurs before the next checkpoint, won't that result in 
both NodeA and NodeB sending out a duplicate version of the packet?


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Avi Kivity

On 04/23/2010 03:59 PM, Alexander Graf wrote:



Ah so the 31st bit is optional as far as userspace is concerned?  What does it 
mean? (just curious)
 

The 0x8000 bit declares that a pointer is in 24-bit mode, so that 
applications can use the spare upper bits for random data.

See http://en.wikipedia.org/wiki/31-bit for an explanation.
   


Interesting.  Luckily AMD made the top 16 bits of pointers reserved in 
x86-64.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().

2010-04-23 Thread Anthony Liguori

On 04/22/2010 10:37 PM, Yoshiaki Tamura wrote:

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

QEMUFile currently doesn't support writev(). For sending multiple
data, such as pages, using writev() should be more efficient.

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp


Is there performance data that backs this up? Since QEMUFile uses a
linear buffer for most operations that's limited to 16k, I suspect you
wouldn't be able to observe a difference in practice.


I currently don't have data, but I'll prepare it.
There were two things I wanted to avoid.

1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
2. Calling write() everytime even when we want to send multiple pages 
at once.


I think 2 may be neglectable.
But 1 seems to be problematic if we want make to the latency as small 
as possible, no?


Copying often has strange CPU characteristics depending on whether the 
data is already in cache.  It's better to drive these sort of 
optimizations through performance measurement because changes are not 
always obvious.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile.

2010-04-23 Thread Anthony Liguori

On 04/22/2010 11:02 PM, Yoshiaki Tamura wrote:

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

For fool proof purpose, qemu_put_vector_parepare should be called
before qemu_put_vector. Then, if qemu_put_* functions except this is
called after qemu_put_vector_prepare, program will abort().

Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp


I don't get it. What's this protecting against?


This was introduced to prevent mixing the order of normal write and 
vector write, and flush QEMUFile buffer before handling vectors.
While qemu_put_buffer copies data to QEMUFile buffer, 
qemu_put_vector() will bypass that buffer.


It's just fool proof purpose for what we encountered at beginning, and 
if the user of qemu_put_vector() is careful enough, we can remove 
qemu_put_vectore_prepare().  While writing this message, I started to 
think that just calling qemu_fflush() in qemu_put_vector() would be 
enough...


I definitely think removing the vector stuff in the first version would 
simplify the process of getting everything merged.  I'd prefer not to 
have two apis so if vector operations were important from a performance 
perspective, I'd want to see everything converted to a vector API.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-23 Thread Avi Kivity

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Kemari starts synchronizing VMs when QEMU handles I/O requests.
Without this patch VCPU state is already proceeded before
synchronization, and after failover to the VM on the receiver, it
hangs because of this.
   


We discussed moving the barrier to the actual output device, instead of 
the I/O port.  This allows you to complete the I/O transaction before 
starting synchronization.


Does it not work for some reason?

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/10] KVM MMU: Move invlpg code out of paging_tmpl.h

2010-04-23 Thread Xiao Guangrong


Avi Kivity wrote:
 On 04/23/2010 02:27 PM, Avi Kivity wrote:
 On 04/22/2010 09:12 AM, Xiao Guangrong wrote:
 Using '!sp-role.cr4_pae' replaces 'PTTYPE == 32' and using
 'pte_size = sp-role.cr4_pae ? 8 : 4' replaces sizeof(pt_element_t)

 Then no need compile twice for this code

 I think we should keep it in - kvm_mmu_pte_write() calls back to
 FNAME(update_pte), we could make the call directly from here speed
 things up, since we already have the spte and don't need to look it up.

 
 I see you do this in patches 9, 10 - but is it possible to use
 update_pte directly?  I think we'll need to make
 guess_page_from_pte_write() part of paging_tmpl.h (in general anything
 that depends on pte size is better off in paging_tmpl.h).
 

OK, i'll keep invlpg code in paging_tmpl.h and directly call FNAME(update_pte).

But, i don't see mmu_guess_page_from_pte_write() code depends on pte size. :-(

Xiao
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().

2010-04-23 Thread Anthony Liguori

On 04/23/2010 04:53 AM, Avi Kivity wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Currently buf size is fixed at 32KB.  It would be useful if it could
be flexible.



Why is this needed?  The real buffering is in the kernel anyways; this 
is only used to reduce the number of write() syscalls.


With vmstate, we really shouldn't need to do this magic anymore as an 
optimization.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/10] KVM MMU: Move invlpg code out of paging_tmpl.h

2010-04-23 Thread Avi Kivity

On 04/23/2010 04:21 PM, Xiao Guangrong wrote:


OK, i'll keep invlpg code in paging_tmpl.h and directly call FNAME(update_pte).

But, i don't see mmu_guess_page_from_pte_write() code depends on pte size. :-(
   


It doesn't indeed, I misremembered.  It's mmu_pte_write_new_pte() (which 
is no longer needed).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-23 Thread Avi Kivity

On 04/23/2010 04:02 PM, Chris Lalancette wrote:

On 04/23/2010 07:05 AM, Avi Kivity wrote:
   

On 04/22/2010 10:55 PM, Gleb Natapov wrote:
 


   

What about converting PIC/IOAPIC mutexes into spinlocks?

   

Works for me, but on large guests the spinning will be noticeable.
I believe.

 

For interrupts going through IOPIC, but we know this is not scalable
anyway.

   

Yes.  We also wanted to convert the ioapic/pic to spinlocks so we could
queue the interrupt from the PIT directly instead of using
KVM_REQ_PENDING_TIMER which keeps confusing me.  Chris Lalancette posted
a patchset for this a while back but it was never completed.
 

Yeah, I'm sorry I never completed it.  It turns out that with the HPET
changes that went in around the time I was looking at it, that set of
patches wasn't really required to fix the problem I was seeing with kdump.

That being said, if it's useful to somebody, I can repost the patches
(though they are woefully out-of-date now).  Let me know if you want
to see them again.
   


Let's see if one of the alternatives works out.  I prefer to keep the 
critical sections short.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] More fixes for nested svm

2010-04-23 Thread Alexander Graf
Hey Joerg,

On 22.04.2010, at 12:33, Joerg Roedel wrote:

 Hi Avi, Marcelo,
 
 here is another set of nested svm fixes. They fix NMI code to make UP Hyper-V
 root domain booting. The patches also add better handling for nested entry
 failures and mce intercepts.
 Also in this patchset are the fixes for the supported cpuid reporting for svm
 features. These patches were taken from the nested-npt patchset and slightly
 modified. These patches are also marked for -stable backporting.
 The probably most important fix is about exception reinjection. This didn't
 work reliably before and is fixed with the patch in this series now. This fix
 also touches common x86 code but that should be ok because it could be reused
 by nested-vmx later.
 Please review and give comments (or apply ;-).


Nice work. Please remember to keep me CC'ed when posting nested SVM patches.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 12:33, Joerg Roedel wrote:

 The patch introducing nested nmi handling had a bug. The
 check does not belong to enable_nmi_window but must be in
 nmi_allowed. This patch fixes this.
 
 Signed-off-by: Joerg Roedel joerg.roe...@amd.com
 ---
 arch/x86/kvm/svm.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)
 
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index ab78eb8..ec20584 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -2771,8 +2771,12 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
 {
   struct vcpu_svm *svm = to_svm(vcpu);
   struct vmcb *vmcb = svm-vmcb;
 - return !(vmcb-control.int_state  SVM_INTERRUPT_SHADOW_MASK) 
 - !(svm-vcpu.arch.hflags  HF_NMI_MASK);
 + int ret;
 + ret = !(vmcb-control.int_state  SVM_INTERRUPT_SHADOW_MASK) 
 +   !(svm-vcpu.arch.hflags  HF_NMI_MASK);
 + ret = ret  gif_set(svm)  nested_svm_nmi(svm);
 +
 + return ret;
 }
 
 static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu)
 @@ -2841,11 +2845,9 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
* Something prevents NMI from been injected. Single step over possible
* problem (IRET or exception injection or interrupt shadow)
*/
 - if (gif_set(svm)  nested_svm_nmi(svm)) {
 - svm-nmi_singlestep = true;
 - svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
 - update_db_intercept(vcpu);
 - }
 + svm-nmi_singlestep = true;
 + svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
 + update_db_intercept(vcpu);

So we're always messing with the nested guest state when the host wants to 
inject an nmi into the l1 guest? Is that safe?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] More fixes for nested svm

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 13:04, Avi Kivity wrote:

 On 04/22/2010 01:33 PM, Joerg Roedel wrote:
 Hi Avi, Marcelo,
 
 here is another set of nested svm fixes. They fix NMI code to make UP Hyper-V
 root domain booting. The patches also add better handling for nested entry
 failures and mce intercepts.
 Also in this patchset are the fixes for the supported cpuid reporting for svm
 features. These patches were taken from the nested-npt patchset and slightly
 modified. These patches are also marked for -stable backporting.
 The probably most important fix is about exception reinjection. This didn't
 work reliably before and is fixed with the patch in this series now. This fix
 also touches common x86 code but that should be ok because it could be reused
 by nested-vmx later.
 Please review and give comments (or apply ;-).
   
 
 Looks good.  I'll wait a day or two for more reviews (esp. Alex).


Looking over it, but can't promise I'll be able to fully go through the 
feedback. Flying off for 2 weeks of vacation tomorrow morning.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().

2010-04-23 Thread Avi Kivity

On 04/23/2010 04:22 PM, Anthony Liguori wrote:

I currently don't have data, but I'll prepare it.
There were two things I wanted to avoid.

1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
2. Calling write() everytime even when we want to send multiple pages 
at once.


I think 2 may be neglectable.
But 1 seems to be problematic if we want make to the latency as small 
as possible, no?



Copying often has strange CPU characteristics depending on whether the 
data is already in cache.  It's better to drive these sort of 
optimizations through performance measurement because changes are not 
always obvious.


Copying always introduces more cache pollution, so even if the data is 
in the cache, it is worthwhile (not disagreeing with the need to measure).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] KVM: SVM: Sync cr0 and cr3 to kvm state before nested handling

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 12:33, Joerg Roedel wrote:

 This patch syncs cr0 and cr3 from the vmcb to the kvm state
 before nested intercept handling is done. This allows to
 simplify the vmexit path.
 
 Signed-off-by: Joerg Roedel joerg.roe...@amd.com
 ---
 arch/x86/kvm/svm.c |   15 ++-
 1 files changed, 6 insertions(+), 9 deletions(-)
 
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index c480d7f..5ad9d80 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -1799,10 +1799,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
   nested_vmcb-save.gdtr   = vmcb-save.gdtr;
   nested_vmcb-save.idtr   = vmcb-save.idtr;
   nested_vmcb-save.cr0= kvm_read_cr0(svm-vcpu);
 - if (npt_enabled)
 - nested_vmcb-save.cr3= vmcb-save.cr3;
 - else
 - nested_vmcb-save.cr3= svm-vcpu.arch.cr3;
 + nested_vmcb-save.cr3= svm-vcpu.arch.cr3;


Why don't we need this anymore again?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] KVM: SVM: Propagate nested entry failure into guest hypervisor

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 12:33, Joerg Roedel wrote:

 This patch implements propagation of a failes guest vmrun
 back into the guest instead of killing the whole guest.

Oh, nice one.

Ack.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] KVM: X86: Add callback to let modules decide over some supported cpuid bits

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 12:33, Joerg Roedel wrote:

 This patch adds the get_supported_cpuid callback to
 kvm_x86_ops. It will be used in do_cpuid_ent to delegate the
 decission about some supported cpuid bits to the
 architecture modules.
 
 Cc: sta...@kernel.org

Please don't CC stable. There is a KVM stable tag now, so every stable 
submission goes through Avi/Marcelo.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] KVM: SVM: Report emulated SVM features to userspace

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 12:33, Joerg Roedel wrote:

 This patch implements the reporting of the emulated SVM
 features to userspace instead of the real hardware
 capabilities. Every real hardware capability needs emulation
 in nested svm so the old behavior was broken.
 
 Cc: sta...@kernel.org

Again, please don't CC stable directly.

 Signed-off-by: Joerg Roedel joerg.roe...@amd.com
 ---
 arch/x86/kvm/svm.c |   10 ++
 1 files changed, 10 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index 0fa2035..65fc114 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -3154,6 +3154,16 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu)
 
 static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 {
 + switch (func) {
 + case 0x800A:
 + entry-eax = 1; /* SVM revision 1 */
 + entry-ebx = 8; /* Lets support 8 ASIDs in case we add proper
 +ASID emulation to nested SVM */

I completely forgot what we do now. What do we do? It shouldn't be too hard to 
keep a table around and just assign 8 host ASIDs to the guest. If possible 
lazily, so we can just flush the whole thing when we run out of entries.

It's basically the same as my VSID mapping on ppc64 actually. See 
arch/powerpc/kvm/book3s_64_mmu_host.c and search for slb.

 + entry-ecx = 0; /* Reserved */
 + entry-edx = 0; /* Do not support any additional features */

What about nnpt? Hasn't that been accepted yet?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] KVM: x86: Allow marking an exception as reinjected

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 12:33, Joerg Roedel wrote:

 This patch adds logic to kvm/x86 which allows to mark an
 injected exception as reinjected. This allows to remove an
 ugly hack from svm_complete_interrupts that prevented
 exceptions from being reinjected at all in the nested case.
 The hack was necessary because an reinjected exception into
 the nested guest could cause a nested vmexit emulation. But
 reinjected exceptions must not intercept. The downside of
 the hack is that a exception that in injected could get
 lost.
 This patch fixes the problem and puts the code for it into
 generic x86 files because. Nested-VMX will likely have the
 same problem and could reuse the code.

So we always handle the reinjection from KVM code? Shouldn't the l1 hypervisor 
do this?

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] KVM: SVM: Handle MCE intercepts always on host level

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 12:33, Joerg Roedel wrote:

 This patch prevents MCE intercepts from being propagated
 into the L1 guest if they happened in an L2 guest.

Good catch. How did you stumble over this?

Ack.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] KVM: X86: Add callback to let modules decide over some supported cpuid bits

2010-04-23 Thread Avi Kivity

On 04/23/2010 04:52 PM, Alexander Graf wrote:

On 22.04.2010, at 12:33, Joerg Roedel wrote:

   

This patch adds the get_supported_cpuid callback to
kvm_x86_ops. It will be used in do_cpuid_ent to delegate the
decission about some supported cpuid bits to the
architecture modules.

Cc: sta...@kernel.org
 

Please don't CC stable. There is a KVM stable tag now, so every stable 
submission goes through Avi/Marcelo.


   


It's not a problem.  stable@ will ignore kvm patches not coming from the 
maintainers, and we will recognize cc: stable as a stable request.


(the new approach is to collect the patches in kvm-updates/2.6.x, 
autotest, and forward to stable@).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 03:46:07PM +0200, Alexander Graf wrote:
 
 On 22.04.2010, at 12:33, Joerg Roedel wrote:
 
  The patch introducing nested nmi handling had a bug. The
  check does not belong to enable_nmi_window but must be in
  nmi_allowed. This patch fixes this.
  
  Signed-off-by: Joerg Roedel joerg.roe...@amd.com
  ---
  arch/x86/kvm/svm.c |   16 +---
  1 files changed, 9 insertions(+), 7 deletions(-)
  
  diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
  index ab78eb8..ec20584 100644
  --- a/arch/x86/kvm/svm.c
  +++ b/arch/x86/kvm/svm.c
  @@ -2771,8 +2771,12 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
  {
  struct vcpu_svm *svm = to_svm(vcpu);
  struct vmcb *vmcb = svm-vmcb;
  -   return !(vmcb-control.int_state  SVM_INTERRUPT_SHADOW_MASK) 
  -   !(svm-vcpu.arch.hflags  HF_NMI_MASK);
  +   int ret;
  +   ret = !(vmcb-control.int_state  SVM_INTERRUPT_SHADOW_MASK) 
  + !(svm-vcpu.arch.hflags  HF_NMI_MASK);
  +   ret = ret  gif_set(svm)  nested_svm_nmi(svm);
  +
  +   return ret;
  }
  
  static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu)
  @@ -2841,11 +2845,9 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
   * Something prevents NMI from been injected. Single step over possible
   * problem (IRET or exception injection or interrupt shadow)
   */
  -   if (gif_set(svm)  nested_svm_nmi(svm)) {
  -   svm-nmi_singlestep = true;
  -   svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
  -   update_db_intercept(vcpu);
  -   }
  +   svm-nmi_singlestep = true;
  +   svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
  +   update_db_intercept(vcpu);
 
 So we're always messing with the nested guest state when the host
 wants to inject an nmi into the l1 guest? Is that safe?

Why not? We can't inject an NMI directly into L2 if the nested
hypervisor intercepts it.

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] KVM: SVM: Sync cr0 and cr3 to kvm state before nested handling

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 03:50:20PM +0200, Alexander Graf wrote:
 
 On 22.04.2010, at 12:33, Joerg Roedel wrote:
 
  This patch syncs cr0 and cr3 from the vmcb to the kvm state
  before nested intercept handling is done. This allows to
  simplify the vmexit path.
  
  Signed-off-by: Joerg Roedel joerg.roe...@amd.com
  ---
  arch/x86/kvm/svm.c |   15 ++-
  1 files changed, 6 insertions(+), 9 deletions(-)
  
  diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
  index c480d7f..5ad9d80 100644
  --- a/arch/x86/kvm/svm.c
  +++ b/arch/x86/kvm/svm.c
  @@ -1799,10 +1799,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
  nested_vmcb-save.gdtr   = vmcb-save.gdtr;
  nested_vmcb-save.idtr   = vmcb-save.idtr;
  nested_vmcb-save.cr0= kvm_read_cr0(svm-vcpu);
  -   if (npt_enabled)
  -   nested_vmcb-save.cr3= vmcb-save.cr3;
  -   else
  -   nested_vmcb-save.cr3= svm-vcpu.arch.cr3;
  +   nested_vmcb-save.cr3= svm-vcpu.arch.cr3;
 
 
 Why don't we need this anymore again?

Because arch.cr3 contains always the current l2-cr3. This wasn't the
case before because the sync was done after the nested handling. But now
we do it before a vmexit can happen and are safe to just use arch.cr3.

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 16:13, Joerg Roedel wrote:

 On Fri, Apr 23, 2010 at 03:46:07PM +0200, Alexander Graf wrote:
 
 On 22.04.2010, at 12:33, Joerg Roedel wrote:
 
 The patch introducing nested nmi handling had a bug. The
 check does not belong to enable_nmi_window but must be in
 nmi_allowed. This patch fixes this.
 
 Signed-off-by: Joerg Roedel joerg.roe...@amd.com
 ---
 arch/x86/kvm/svm.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)
 
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index ab78eb8..ec20584 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -2771,8 +2771,12 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
 {
 struct vcpu_svm *svm = to_svm(vcpu);
 struct vmcb *vmcb = svm-vmcb;
 -   return !(vmcb-control.int_state  SVM_INTERRUPT_SHADOW_MASK) 
 -   !(svm-vcpu.arch.hflags  HF_NMI_MASK);
 +   int ret;
 +   ret = !(vmcb-control.int_state  SVM_INTERRUPT_SHADOW_MASK) 
 + !(svm-vcpu.arch.hflags  HF_NMI_MASK);
 +   ret = ret  gif_set(svm)  nested_svm_nmi(svm);
 +
 +   return ret;
 }
 
 static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu)
 @@ -2841,11 +2845,9 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
  * Something prevents NMI from been injected. Single step over possible
  * problem (IRET or exception injection or interrupt shadow)
  */
 -   if (gif_set(svm)  nested_svm_nmi(svm)) {
 -   svm-nmi_singlestep = true;
 -   svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
 -   update_db_intercept(vcpu);
 -   }
 +   svm-nmi_singlestep = true;
 +   svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
 +   update_db_intercept(vcpu);
 
 So we're always messing with the nested guest state when the host
 wants to inject an nmi into the l1 guest? Is that safe?
 
 Why not? We can't inject an NMI directly into L2 if the nested
 hypervisor intercepts it.

So where did the code go that does the #vmexit in case the nested hypervisor 
does intercept it? It used to be nested_svm_nmi(), right?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] KVM: SVM: Sync cr0 and cr3 to kvm state before nested handling

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 16:17, Joerg Roedel wrote:

 On Fri, Apr 23, 2010 at 03:50:20PM +0200, Alexander Graf wrote:
 
 On 22.04.2010, at 12:33, Joerg Roedel wrote:
 
 This patch syncs cr0 and cr3 from the vmcb to the kvm state
 before nested intercept handling is done. This allows to
 simplify the vmexit path.
 
 Signed-off-by: Joerg Roedel joerg.roe...@amd.com
 ---
 arch/x86/kvm/svm.c |   15 ++-
 1 files changed, 6 insertions(+), 9 deletions(-)
 
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index c480d7f..5ad9d80 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -1799,10 +1799,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
 nested_vmcb-save.gdtr   = vmcb-save.gdtr;
 nested_vmcb-save.idtr   = vmcb-save.idtr;
 nested_vmcb-save.cr0= kvm_read_cr0(svm-vcpu);
 -   if (npt_enabled)
 -   nested_vmcb-save.cr3= vmcb-save.cr3;
 -   else
 -   nested_vmcb-save.cr3= svm-vcpu.arch.cr3;
 +   nested_vmcb-save.cr3= svm-vcpu.arch.cr3;
 
 
 Why don't we need this anymore again?
 
 Because arch.cr3 contains always the current l2-cr3. This wasn't the
 case before because the sync was done after the nested handling. But now
 we do it before a vmexit can happen and are safe to just use arch.cr3.

I see.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] KVM: SVM: Report emulated SVM features to userspace

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 03:55:15PM +0200, Alexander Graf wrote:
 
 On 22.04.2010, at 12:33, Joerg Roedel wrote:
  static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 
  *entry)
  {
  +   switch (func) {
  +   case 0x800A:
  +   entry-eax = 1; /* SVM revision 1 */
  +   entry-ebx = 8; /* Lets support 8 ASIDs in case we add proper
  +  ASID emulation to nested SVM */
 
 I completely forgot what we do now. What do we do? It shouldn't be too
 hard to keep a table around and just assign 8 host ASIDs to the guest.
 If possible lazily, so we can just flush the whole thing when we run
 out of entries.

Currently we have no ASID emulation. We just assign a new asid at every
vmrun/vmexit. But I have a rough idea of how we can emulate ASIDs for
the L2 guest with an per-vcpu ASID cache.

 It's basically the same as my VSID mapping on ppc64 actually. See
 arch/powerpc/kvm/book3s_64_mmu_host.c and search for slb.

Thanks, will have a look there.

  +   entry-ecx = 0; /* Reserved */
  +   entry-edx = 0; /* Do not support any additional features */
 
 What about nnpt? Hasn't that been accepted yet?

I am currently working on it. If all goes well I can submit a new
version next week.

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 04:19:40PM +0200, Alexander Graf wrote:
 
 On 23.04.2010, at 16:13, Joerg Roedel wrote:
 
  On Fri, Apr 23, 2010 at 03:46:07PM +0200, Alexander Graf wrote:
  
  On 22.04.2010, at 12:33, Joerg Roedel wrote:
  
  The patch introducing nested nmi handling had a bug. The
  check does not belong to enable_nmi_window but must be in
  nmi_allowed. This patch fixes this.
  
  Signed-off-by: Joerg Roedel joerg.roe...@amd.com
  ---
  arch/x86/kvm/svm.c |   16 +---
  1 files changed, 9 insertions(+), 7 deletions(-)
  
  diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
  index ab78eb8..ec20584 100644
  --- a/arch/x86/kvm/svm.c
  +++ b/arch/x86/kvm/svm.c
  @@ -2771,8 +2771,12 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
  {
struct vcpu_svm *svm = to_svm(vcpu);
struct vmcb *vmcb = svm-vmcb;
  - return !(vmcb-control.int_state  SVM_INTERRUPT_SHADOW_MASK) 
  - !(svm-vcpu.arch.hflags  HF_NMI_MASK);
  + int ret;
  + ret = !(vmcb-control.int_state  SVM_INTERRUPT_SHADOW_MASK) 
  +   !(svm-vcpu.arch.hflags  HF_NMI_MASK);
  + ret = ret  gif_set(svm)  nested_svm_nmi(svm);
  +
  + return ret;
  }
  
  static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu)
  @@ -2841,11 +2845,9 @@ static void enable_nmi_window(struct kvm_vcpu 
  *vcpu)
 * Something prevents NMI from been injected. Single step over possible
 * problem (IRET or exception injection or interrupt shadow)
 */
  - if (gif_set(svm)  nested_svm_nmi(svm)) {
  - svm-nmi_singlestep = true;
  - svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
  - update_db_intercept(vcpu);
  - }
  + svm-nmi_singlestep = true;
  + svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
  + update_db_intercept(vcpu);
  
  So we're always messing with the nested guest state when the host
  wants to inject an nmi into the l1 guest? Is that safe?
  
  Why not? We can't inject an NMI directly into L2 if the nested
  hypervisor intercepts it.
 
 So where did the code go that does the #vmexit in case the nested
 hypervisor does intercept it? It used to be nested_svm_nmi(), right?

No, nested_svm_nmi runs in atomic context where we can't emulate a
vmexit. We set exit_required and emulate the vmexit later.

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 16:22, Joerg Roedel wrote:

 On Fri, Apr 23, 2010 at 04:19:40PM +0200, Alexander Graf wrote:
 
 On 23.04.2010, at 16:13, Joerg Roedel wrote:
 
 On Fri, Apr 23, 2010 at 03:46:07PM +0200, Alexander Graf wrote:
 
 On 22.04.2010, at 12:33, Joerg Roedel wrote:
 
 The patch introducing nested nmi handling had a bug. The
 check does not belong to enable_nmi_window but must be in
 nmi_allowed. This patch fixes this.
 
 Signed-off-by: Joerg Roedel joerg.roe...@amd.com
 ---
 arch/x86/kvm/svm.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)
 
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index ab78eb8..ec20584 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -2771,8 +2771,12 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
 {
   struct vcpu_svm *svm = to_svm(vcpu);
   struct vmcb *vmcb = svm-vmcb;
 - return !(vmcb-control.int_state  SVM_INTERRUPT_SHADOW_MASK) 
 - !(svm-vcpu.arch.hflags  HF_NMI_MASK);
 + int ret;
 + ret = !(vmcb-control.int_state  SVM_INTERRUPT_SHADOW_MASK) 
 +   !(svm-vcpu.arch.hflags  HF_NMI_MASK);
 + ret = ret  gif_set(svm)  nested_svm_nmi(svm);
 +
 + return ret;
 }
 
 static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu)
 @@ -2841,11 +2845,9 @@ static void enable_nmi_window(struct kvm_vcpu 
 *vcpu)
* Something prevents NMI from been injected. Single step over possible
* problem (IRET or exception injection or interrupt shadow)
*/
 - if (gif_set(svm)  nested_svm_nmi(svm)) {
 - svm-nmi_singlestep = true;
 - svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
 - update_db_intercept(vcpu);
 - }
 + svm-nmi_singlestep = true;
 + svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
 + update_db_intercept(vcpu);
 
 So we're always messing with the nested guest state when the host
 wants to inject an nmi into the l1 guest? Is that safe?
 
 Why not? We can't inject an NMI directly into L2 if the nested
 hypervisor intercepts it.
 
 So where did the code go that does the #vmexit in case the nested
 hypervisor does intercept it? It used to be nested_svm_nmi(), right?
 
 No, nested_svm_nmi runs in atomic context where we can't emulate a
 vmexit. We set exit_required and emulate the vmexit later.

So we modify the L2 rflags and then trigger a #vmexit, leaving the l2 state 
broken?

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] KVM: x86: Allow marking an exception as reinjected

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 03:57:32PM +0200, Alexander Graf wrote:
 
 On 22.04.2010, at 12:33, Joerg Roedel wrote:
 
  This patch adds logic to kvm/x86 which allows to mark an
  injected exception as reinjected. This allows to remove an
  ugly hack from svm_complete_interrupts that prevented
  exceptions from being reinjected at all in the nested case.
  The hack was necessary because an reinjected exception into
  the nested guest could cause a nested vmexit emulation. But
  reinjected exceptions must not intercept. The downside of
  the hack is that a exception that in injected could get
  lost.
  This patch fixes the problem and puts the code for it into
  generic x86 files because. Nested-VMX will likely have the
  same problem and could reuse the code.
 
 So we always handle the reinjection from KVM code? Shouldn't the l1
 hypervisor do this?

No. We only have the problem if we need to handle a nested intercept on
the host level instead of reinjecting it. So the nested hypervisor
couldn't be involved in the reinjection.

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] KVM: SVM: Handle MCE intercepts always on host level

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 03:58:18PM +0200, Alexander Graf wrote:
 
 On 22.04.2010, at 12:33, Joerg Roedel wrote:
 
  This patch prevents MCE intercepts from being propagated
  into the L1 guest if they happened in an L2 guest.
 
 Good catch. How did you stumble over this?

While thinking about another issue which I can't disclose
right now ;-) Stay tuned...

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 04:24:54PM +0200, Alexander Graf wrote:
 
 On 23.04.2010, at 16:22, Joerg Roedel wrote:

  No, nested_svm_nmi runs in atomic context where we can't emulate a
  vmexit. We set exit_required and emulate the vmexit later.
 
 So we modify the L2 rflags and then trigger a #vmexit, leaving the l2
 state broken?

No, the rflags are changed in enable_nmi_window which isn't called when
we run nested and the nested hypervisor intercepts nmi. So it only runs
in the !nested case where it can't corrupt L2 state.

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] KVM: x86: Allow marking an exception as reinjected

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 16:27, Joerg Roedel wrote:

 On Fri, Apr 23, 2010 at 03:57:32PM +0200, Alexander Graf wrote:
 
 On 22.04.2010, at 12:33, Joerg Roedel wrote:
 
 This patch adds logic to kvm/x86 which allows to mark an
 injected exception as reinjected. This allows to remove an
 ugly hack from svm_complete_interrupts that prevented
 exceptions from being reinjected at all in the nested case.
 The hack was necessary because an reinjected exception into
 the nested guest could cause a nested vmexit emulation. But
 reinjected exceptions must not intercept. The downside of
 the hack is that a exception that in injected could get
 lost.
 This patch fixes the problem and puts the code for it into
 generic x86 files because. Nested-VMX will likely have the
 same problem and could reuse the code.
 
 So we always handle the reinjection from KVM code? Shouldn't the l1
 hypervisor do this?
 
 No. We only have the problem if we need to handle a nested intercept on
 the host level instead of reinjecting it. So the nested hypervisor
 couldn't be involved in the reinjection.

Hrm, makes sense. Not pretty.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 16:31, Joerg Roedel wrote:

 On Fri, Apr 23, 2010 at 04:24:54PM +0200, Alexander Graf wrote:
 
 On 23.04.2010, at 16:22, Joerg Roedel wrote:
 
 No, nested_svm_nmi runs in atomic context where we can't emulate a
 vmexit. We set exit_required and emulate the vmexit later.
 
 So we modify the L2 rflags and then trigger a #vmexit, leaving the l2
 state broken?
 
 No, the rflags are changed in enable_nmi_window which isn't called when
 we run nested and the nested hypervisor intercepts nmi. So it only runs
 in the !nested case where it can't corrupt L2 state.

Last time I checked the code enable_nmi_window was the function triggering the 
#vmexit, so it should run in that exact scenario. If what you say is true, 
where do we #vmexit instead then?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kvm apparently still has issues with blockdev 1Tb

2010-04-23 Thread Michael Tokarev
Not many details, but JFYI: at least 0.12 still has issues
(leads to silent data corruption) with block devices of
size = 1Tb.  There were several such cases reported on
#...@oftc in last two months.  Usually reducing the device
size to something less than 1Tb solves that issue (i.e.,
stops new data corruption from happening).

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 04:42:52PM +0200, Alexander Graf wrote:
 
 On 23.04.2010, at 16:31, Joerg Roedel wrote:
 
  On Fri, Apr 23, 2010 at 04:24:54PM +0200, Alexander Graf wrote:
  
  On 23.04.2010, at 16:22, Joerg Roedel wrote:
  
  No, nested_svm_nmi runs in atomic context where we can't emulate a
  vmexit. We set exit_required and emulate the vmexit later.
  
  So we modify the L2 rflags and then trigger a #vmexit, leaving the l2
  state broken?
  
  No, the rflags are changed in enable_nmi_window which isn't called when
  we run nested and the nested hypervisor intercepts nmi. So it only runs
  in the !nested case where it can't corrupt L2 state.
 
 Last time I checked the code enable_nmi_window was the function
 triggering the #vmexit,

Yes, thats the bug which this patch fixes :-)

so it should run in that exact scenario. If what you say is true, where
do we #vmexit instead then?

After setting exit_required we run into svm.c:svm_vcpu_run. There the
exit_required flag is checked and if set, the function immediatly
returns without doing a vmrun. A few cycles later we run into
svm.c:handle_exit() where at the beginning exit_required is checked, and
if set the vmexit is emulated.

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-23 Thread Jamie Lokier
Yoshiaki Tamura wrote:
 Jamie Lokier wrote:
 Yoshiaki Tamura wrote:
 Dor Laor wrote:
 On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
 Event tapping is the core component of Kemari, and it decides on which
 event the
 primary should synchronize with the secondary. The basic assumption
 here is
 that outgoing I/O operations are idempotent, which is usually true for
 disk I/O
 and reliable network protocols such as TCP.
 
 IMO any type of network even should be stalled too. What if the VM runs
 non tcp protocol and the packet that the master node sent reached some
 remote client and before the sync to the slave the master failed?
 
 In current implementation, it is actually stalling any type of network
 that goes through virtio-net.
 
 However, if the application was using unreliable protocols, it should have
 its own recovering mechanism, or it should be completely stateless.
 
 Even with unreliable protocols, if slave takeover causes the receiver
 to have received a packet that the sender _does not think it has ever
 sent_, expect some protocols to break.
 
 If the slave replaying master's behaviour since the last sync means it
 will definitely get into the same state of having sent the packet,
 that works out.
 
 That's something we're expecting now.
 
 But you still have to be careful that the other end's responses to
 that packet are not seen by the slave too early during that replay.
 Otherwise, for example, the slave may observe a TCP ACK to a packet
 that it hasn't yet sent, which is an error.
 
 Even current implementation syncs just before network output, what you 
 pointed out could happen.  In this case, would the connection going to be 
 lost, or would client/server recover from it?  If latter, it would be fine, 
 otherwise I wonder how people doing similar things are handling this 
 situation.

In the case of TCP in a synchronised state, I think it will recover
according to the rules in RFC793.  In an unsynchronised state
(during connection), I'm not sure if it recovers or if it looks like a
Connection reset error.  I suspect it does recover but I'm not certain.

But that's TCP.  Other protocols, such as over UDP, may behave
differently, because this is not an anticipated behaviour of a
network.

 However there is one respect in which they're not idempotent:
 
 The TTL field should be decreased if packets are delayed.  Packets
 should not appear to live in the network for longer than TTL seconds.
 If they do, some protocols (like TCP) can react to the delayed ones
 differently, such as sending a RST packet and breaking a connection.
 
 It is acceptable to reduce TTL faster than the minimum.  After all, it
 is reduced by 1 on every forwarding hop, in addition to time delays.
 
 So the problem is, when the slave takes over, it sends a packet with same 
 TTL which client may have received.

Yes.  I guess this is a general problem with time-based protocols and
virtual machines getting stopped for 1 minute (say), without knowing
that real time has moved on for the other nodes.

Some application transaction, caching and locking protocols will give
wrong results when their time assumptions are discontinuous to such a
large degree.  It's a bit nasty to impose that on them after they
worked so hard on their reliability :-)

However, I think such implementations _could_ be made safe if those
programs can arrange to definitely be interrupted with a signal when
the discontinuity happens.  Of course, only if they're aware they may
be running on a Kemari system...

I have an intuitive idea that there is a solution to that, but each
time I try to write the next paragraph explaining it, some little
complication crops up and it needs more thought.  Something about
concurrent, asynchronous transactions to keep the master running while
recording the minimum states that replay needs to be safe, while
slewing the replaying slave's virtual clock back to real time quickly
during recovery mode.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Mount and unmount CD

2010-04-23 Thread Matt Burkhardt
I'm having a problem with a virtual machine running under RHEL 5.4
64-bit.  I take out the CD / insert a new and the main machine sees the
new cd and makes it available.  However, the virtual machines still see
the old CD.  I've tried mounting the new CD, but it just keeps mounting
what it thinks is in there - the old one.

Any ideas?


Matt Burkhardt
Impari Systems, Inc.

m...@imparisystems.com
http://www.imparisystems.com 
http://www.linkedin.com/in/mlburkhardt 
http://www.twitter.com/matthewboh
502 Fairview Avenue
Frederick, MD  21701
work (301) 682-7901
cell   (301) 802-3235



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PXE Boot Timeout Issue...

2010-04-23 Thread Stuart Sheldon
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Stefan Hajnoczi wrote:
 On Fri, Apr 23, 2010 at 1:45 AM, Stuart Sheldon s...@actusa.net wrote:
 Just upgraded to 12.3 user space tools from 11.0, and now when I attempt
 to netboot a guest, it appears that the pxe rom is timing out on dhcp
 before the bridge has enough time to come up.

 Is there a command line switch to set the dhcp timeout, or a build
 option that can be changed to set the timeout to a longer value, or
 disable it entirely?
 
 The bridge shouldn't need significant amounts of time to come up.  Can
 you describe the networking setup?  Are you using libvirt and with
 what network config?
 
 If you have a bridge configured, can you show the output of:
 
 $ sudo brctl showstp $bridge_name

The network is fairly straight forward. PC with single NIC directly
attached to switch with a local DHCP server.

We do not use libvirt, we are using qemu with command switches.

Interestingly it appears that the -option-rom command switch doesn't do
anything any longer. It is using gPXE by default only.

Here is our current command line we are testing with:

export QEMU_AUDIO_DRV=alsa
qemu-system-x86_64 \
-m 1024 \
-soundhw es1370 \
-name 'Net Boot Disk' \
-net nic,macaddr=52:54:00:12:31:06,model=virtio \
-net tap,ifname=tap6,script=/etc/kvm/kvm-ifup-br0 \
-option-rom /usr/local/share/qemu/pxe-virtio.bin \
-boot n \
-daemonize


Here is the exact chain of events:

1) Start guest from command line
2) tap adapter is added to br0 and enters learning state on host while
gPXE starts DHCP request on guest.
3) after about 15 seconds, tap device enters promiscuous mode on host,
and at same time, gPXE reports No bootable device. on guest
4) Drop to guest console and signal a system_reset, guest gets DHCP
address and netboots normally.

Here is the info you requested, both before and after guest startup.

Before guest startup:

linus:~# brctl show br0
bridge name bridge id   STP enabled interfaces
br0 8000.001cc092ba91   no  eth0

linus:~# brctl showstp br0
br0
 bridge id  8000.001cc092ba91
 designated root8000.001cc092ba91
 root port 0path cost  0
 max age  20.00 bridge max age20.00
 hello time2.00 bridge hello time  2.00
 forward delay15.00 bridge forward delay  15.00
 ageing time 300.00
 hello timer   1.87 tcn timer  0.00
 topology change timer 0.00 gc timer   8.87
 flags  


eth0 (1)
 port id8001stateforwarding
 designated root8000.001cc092ba91   path cost 19
 designated bridge  8000.001cc092ba91   message age timer  0.00
 designated port8001forward delay timer0.00
 designated cost   0hold timer 0.87
 flags  

After guest startup:

linus:~# brctl show br0
bridge name bridge id   STP enabled interfaces
br0 8000.001cc092ba91   no  eth0
tap6
linus:~# brctl showstp br0
br0
 bridge id  8000.001cc092ba91
 designated root8000.001cc092ba91
 root port 0path cost  0
 max age  20.00 bridge max age20.00
 hello time2.00 bridge hello time  2.00
 forward delay15.00 bridge forward delay  15.00
 ageing time 300.00
 hello timer   1.00 tcn timer  0.00
 topology change timer 0.00 gc timer   2.00
 flags  


eth0 (1)
 port id8001stateforwarding
 designated root8000.001cc092ba91   path cost 19
 designated bridge  8000.001cc092ba91   message age timer  0.00
 designated port8001forward delay timer0.00
 designated cost   0hold timer 0.00
 flags  

tap6 (2)
 port id8002stateforwarding
 designated root8000.001cc092ba91   path cost100
 designated bridge  8000.001cc092ba91   message age timer  0.00
 designated port8002forward delay timer0.00
 designated cost   0hold timer 0.00
 flags  

Let me know if you need anything else...

Stu


- 

RHEL5.5, 32-bit VM repeatedly locks up due to kvmclock

2010-04-23 Thread David S. Ahern
After a few days of debugging I think kvmclock is the source of lockups
for a RHEL5.5-based VM. The VM works fine on one host, but repeatedly
locks up on another.

Server 1 - VM locks up repeatedly
-- DL580 G5
-- 4 quad-core X7350 processors at 2.93GHz
-- 48GB RAM

Server 2 - VM works just fine
-- DL380 G6
-- 2 quad-core E5540 processors at 2.53GHz
-- 24GB RAM

Both host servers are running Fedora Core 12, 2.6.32.11-99.fc12.x86_64
kernel. I have tried various versions of qemu-kvm -- the version in
FC-12 and the version for FC-12 in virt-preview. In both cases the
qemu-kvm command line is identical.

VM
- RHEL5.5, PAE kernel (also tried standard 32-bit)
- 2 vcpus
- 3GB RAM
- virtio network and disk

When the VM locks up both vcpu threads are spinning at 100%. Changing
the clocksource to jiffies appears to have addressed the problem.

David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[uq/master patch 1/5] vga: fix typo in length passed to kvm_log_stop

2010-04-23 Thread Marcelo Tosatti
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: qemu-kvm/hw/vga.c
===
--- qemu-kvm.orig/hw/vga.c
+++ qemu-kvm/hw/vga.c
@@ -1613,8 +1613,8 @@ void vga_dirty_log_stop(VGACommonState *
kvm_log_stop(s-map_addr, s-map_end - s-map_addr);
 
 if (kvm_enabled()  s-lfb_vram_mapped) {
-   kvm_log_stop(isa_mem_base + 0xa, 0x8);
-   kvm_log_stop(isa_mem_base + 0xa8000, 0x8);
+   kvm_log_stop(isa_mem_base + 0xa, 0x8000);
+   kvm_log_stop(isa_mem_base + 0xa8000, 0x8000);
 }
 
 #ifdef CONFIG_BOCHS_VBE


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[uq/master patch 0/5] prepare for qemu-kvm's usage of upstream memslot code

2010-04-23 Thread Marcelo Tosatti

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PXE Boot Timeout Issue...

2010-04-23 Thread Stuart Sheldon
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Jim Paris wrote:
 Stuart Sheldon wrote:
 Stefan Hajnoczi wrote:
 On Fri, Apr 23, 2010 at 1:45 AM, Stuart Sheldon s...@actusa.net wrote:
 Just upgraded to 12.3 user space tools from 11.0, and now when I attempt
 to netboot a guest, it appears that the pxe rom is timing out on dhcp
 before the bridge has enough time to come up.
 
 3) after about 15 seconds, tap device enters promiscuous mode on host,
 and at same time, gPXE reports No bootable device. on guest
 ...
  forward delay 15.00 bridge forward delay
   15.00
 
 I've run into this issue as well and lowering the forwarding delay
 with brctl setfd br0 5 fixes it.
 
 -jim
 

Thanks Jim! I think that has it...

Stu


- --
Hey nineteen Thats retha franklin, She dont remember The queen of soul,
Its hard times befallen The sole survivors, She thinks Im crazy,
But Im just growing old
   -- Steely Dan - Hey 19 - Lyrics
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iQIcBAEBCAAGBQJL0dkFAAoJEFKVLITDJSGSwJAQAIBL16jFmTavFWX0dU50syP+
VtK3EqQcv02bJdpAXdbfKuO1QDdjmucG2zEbNK4lMXKEwBsGcKtxDhUOVaNS07fK
OdfvsO+/wX9mC8JmqCah6r8Pwppd3F07nzP3HkPnswV/rgtDVrSWB2JInDy+d3R7
sd0DisPOeaHoXSQqJ8REGElVR2adzsdmRWUKpFmtu9tJCz48wsbkDLpcr5iMt7jA
sAPtTm1ENUEF3X3l6mlGB4kuYsLum93hPpgrxWmMvGO3uokVzRJYm3x2Yf5J++3v
yc8reMPPsQe1tI9r+ZtCKXE2XLqI/X4Pdap0IIjtid2245dBOMKys+aIXYZ1d7jL
YwngcgFDdse6B3Gt9/Tk5RhqqYQeYLsZckMOhy89DN2bftYxsX70xIWKKVxPEkNg
GniSN8pCg/Rx4ZOFkygwXRUayOx5Hi5nR6hetKyBqRM6mWYZvfvzRYH3O05JAOIz
3qKRObyuKHYFGkdWZ1LaTBoQoOdq+iWyuPK53kvpHFbdks6mZZg3YO70M6+ziqyT
zy9UvuumjzquuFwRu91R0qi02SZlNifvTQZ9aUpsPvRAhijroidqIzsH/5CoDuDe
AyQEKJgq64/ViA2QrYgpo/IFLAUbhCpMRIr7Tt5Te2TT+s43Lefyg6aY2XA2UwY0
6LwIWdBWwITjvP7XUU5B
=hwfS
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] Shared memory uio_pci driver

2010-04-23 Thread Cam Macdonell
On Mon, Apr 12, 2010 at 2:57 PM, Avi Kivity a...@redhat.com wrote:

 There is work now to bring msi to the generic pci 2.3 driver, perhaps we can
 use that instead.  From a quick look it looks fine.


I'd be interested to follow this development.  I can't find anything
on LKML, is it being discussed anywhere?

Thanks,
Cam
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mount and unmount CD

2010-04-23 Thread David S. Ahern
I saw this with RHEL5.3. I ended up hacking qemu to re_open the CD every
so often. See attached.

David


On 04/23/2010 09:10 AM, Matt Burkhardt wrote:
 I'm having a problem with a virtual machine running under RHEL 5.4
 64-bit.  I take out the CD / insert a new and the main machine sees the
 new cd and makes it available.  However, the virtual machines still see
 the old CD.  I've tried mounting the new CD, but it just keeps mounting
 what it thinks is in there - the old one.
 
 Any ideas?
 
 
 Matt Burkhardt
 Impari Systems, Inc.
 
 m...@imparisystems.com
 http://www.imparisystems.com 
 http://www.linkedin.com/in/mlburkhardt 
 http://www.twitter.com/matthewboh
 502 Fairview Avenue
 Frederick, MD  21701
 work (301) 682-7901
 cell   (301) 802-3235
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--- qemu/block-raw-posix.c.orig 2010-01-06 22:27:56.0 -0700
+++ qemu/block-raw-posix.c  2010-01-06 22:29:51.0 -0700
@@ -193,20 +193,40 @@
 static int raw_pread_aligned(BlockDriverState *bs, int64_t offset,
  uint8_t *buf, int count)
 {
 BDRVRawState *s = bs-opaque;
 int ret;
 
 ret = fd_open(bs);
 if (ret  0)
 return ret;
 
+/* media changes are only detected at the host layer when
+ * somethin reopens the cdrom device. Without an event 
+ * notice, we need a heuristic. Try the following which mimics
+ * what is done for floppy drives. Here we reopen the cdrom
+ * after 3 seconds of elapsed time - this should be short
+ * enough to cover a user inserting a new disk and then accessing
+ * it via the CLI/GUI.
+ */
+if (bs-type == BDRV_TYPE_CDROM) {
+static int64_t last = 0;
+int64_t now = qemu_get_clock(rt_clock);
+if ((now - last)  3000)
+ret = cdrom_reopen(bs);
+else
+   ret = 0;
+last = now;
+if (ret  0)
+   return ret;
+}
+
 if (offset = 0  lseek(s-fd, offset, SEEK_SET) == (off_t)-1) {
 ++(s-lseek_err_cnt);
 if(s-lseek_err_cnt = 10) {
 DEBUG_BLOCK_PRINT(raw_pread(%d:%s, % PRId64 , %p, %d) [% PRId64
   ] lseek failed : %d = %s\n,
   s-fd, bs-filename, offset, buf, count,
   bs-total_sectors, errno, strerror(errno));
 }
 return -1;
 }
--- qemu/hw/ide.c.orig  2010-01-06 22:28:02.0 -0700
+++ qemu/hw/ide.c   2010-01-06 22:30:45.0 -0700
@@ -1456,20 +1456,28 @@
 s-cd_sector_size = sector_size;
 
 /* XXX: check if BUSY_STAT should be set */
 s-status = READY_STAT | SEEK_STAT | DRQ_STAT | BUSY_STAT;
 ide_dma_start(s, ide_atapi_cmd_read_dma_cb);
 }
 
 static void ide_atapi_cmd_read(IDEState *s, int lba, int nb_sectors,
int sector_size)
 {
+if (s-is_cdrom) {
+static int64_t last = 0;
+int64_t now = qemu_get_clock(rt_clock);
+if ((now - last)  3000)
+(void) cdrom_reopen(s-bs);
+last = now;
+}
+
 #ifdef DEBUG_IDE_ATAPI
 printf(read %s: LBA=%d nb_sectors=%d\n, s-atapi_dma ? dma : pio,
lba, nb_sectors);
 #endif
 if (s-atapi_dma) {
 ide_atapi_cmd_read_dma(s, lba, nb_sectors, sector_size);
 } else {
 ide_atapi_cmd_read_pio(s, lba, nb_sectors, sector_size);
 }
 }


Re: Mount and unmount CD

2010-04-23 Thread David S. Ahern
oops. the previous patch rides on top of this one.

David


On 04/23/2010 12:18 PM, David S. Ahern wrote:
 I saw this with RHEL5.3. I ended up hacking qemu to re_open the CD every
 so often. See attached.
 
 David
 
 
 On 04/23/2010 09:10 AM, Matt Burkhardt wrote:
 I'm having a problem with a virtual machine running under RHEL 5.4
 64-bit.  I take out the CD / insert a new and the main machine sees the
 new cd and makes it available.  However, the virtual machines still see
 the old CD.  I've tried mounting the new CD, but it just keeps mounting
 what it thinks is in there - the old one.

 Any ideas?


 Matt Burkhardt
 Impari Systems, Inc.

 m...@imparisystems.com
 http://www.imparisystems.com 
 http://www.linkedin.com/in/mlburkhardt 
 http://www.twitter.com/matthewboh
 502 Fairview Avenue
 Frederick, MD  21701
 work (301) 682-7901
 cell   (301) 802-3235



 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--- qemu/block-raw-posix.c.orig 2010-01-06 21:46:31.0 -0700
+++ qemu/block-raw-posix.c  2010-01-06 21:54:22.0 -0700
@@ -107,20 +107,24 @@
 int fd_got_error;
 int fd_media_changed;
 #endif
 uint8_t* aligned_buf;
 } BDRVRawState;
 
 static int posix_aio_init(void);
 
 static int fd_open(BlockDriverState *bs);
 
+#if defined(__linux__)
+int cdrom_reopen(BlockDriverState *bs);
+#endif
+
 static int raw_open(BlockDriverState *bs, const char *filename, int flags)
 {
 BDRVRawState *s = bs-opaque;
 int fd, open_flags, ret;
 
 posix_aio_init();
 
 s-lseek_err_cnt = 0;
 
 open_flags = O_BINARY;
@@ -212,29 +216,32 @@
 if (ret == count)
 goto label__raw_read__success;
 
 DEBUG_BLOCK_PRINT(raw_pread(%d:%s, % PRId64 , %p, %d) [% PRId64
   ] read failed %d : %d = %s\n,
   s-fd, bs-filename, offset, buf, count,
   bs-total_sectors, ret, errno, strerror(errno));
 
 /* Try harder for CDrom. */
 if (bs-type == BDRV_TYPE_CDROM) {
-lseek(s-fd, offset, SEEK_SET);
-ret = read(s-fd, buf, count);
-if (ret == count)
-goto label__raw_read__success;
-lseek(s-fd, offset, SEEK_SET);
-ret = read(s-fd, buf, count);
-if (ret == count)
+int i;
+for (i = 0; i  2; ++i) {
+#if defined(__linux__)
+ret = cdrom_reopen(bs);
+if (ret  0)
 goto label__raw_read__success;
-
+#endif
+lseek(s-fd, offset, SEEK_SET);
+ret = read(s-fd, buf, count);
+if (ret == count)
+goto label__raw_read__success;
+}
 DEBUG_BLOCK_PRINT(raw_pread(%d:%s, % PRId64 , %p, %d) [% PRId64
   ] retry read failed %d : %d = %s\n,
   s-fd, bs-filename, offset, buf, count,
   bs-total_sectors, ret, errno, strerror(errno));
 }
 
 label__raw_read__success:
 
 return ret;
 }
@@ -1025,20 +1032,27 @@
 printf(Floppy opened\n);
 #endif
 }
 if (!last_media_present)
 s-fd_media_changed = 1;
 s-fd_open_time = qemu_get_clock(rt_clock);
 s-fd_got_error = 0;
 return 0;
 }
 
+int cdrom_reopen(BlockDriverState *bs)
+{
+/* mimics a 'change' monitor command - without the eject */
+bdrv_close(bs);
+return bdrv_open2(bs, bs-filename, 0, bs-drv);
+}
+
 static int raw_is_inserted(BlockDriverState *bs)
 {
 BDRVRawState *s = bs-opaque;
 int ret;
 
 switch(s-type) {
 case FTYPE_CD:
 ret = ioctl(s-fd, CDROM_DRIVE_STATUS, CDSL_CURRENT);
 if (ret == CDS_DISC_OK)
 return 1;
--- qemu/hw/ide.c.orig  2010-01-06 21:54:33.0 -0700
+++ qemu/hw/ide.c   2010-01-06 21:56:16.0 -0700
@@ -29,20 +29,24 @@
 #include pcmcia.h
 #include block.h
 #include block_int.h
 #include qemu-timer.h
 #include sysemu.h
 #include ppc_mac.h
 #include sh.h
 #include console.h
 #include syslog.h
 
+#if defined(__linux__)
+int cdrom_reopen(BlockDriverState *bs);
+#endif
+
 /* debug IDE devices */
 //#define DEBUG_IDE
 //#define DEBUG_IDE_ATAPI
 //#define DEBUG_AIO
 #define USE_DMA_CDROM
 
 /* Bits of HD_STATUS */
 #define ERR_STAT   0x01
 #define INDEX_STAT 0x02
 #define ECC_STAT   0x04/* Corrected error */
@@ -1363,20 +1367,25 @@
 /* ATAPI DMA support */
 
 /* XXX: handle read errors */
 static void ide_atapi_cmd_read_dma_cb(void *opaque, int ret)
 {
 BMDMAState *bm = opaque;
 IDEState *s = bm-ide_if;
 int data_offset, n;
 
 if (ret  0) {
+#if defined(__linux__)
+/* on EIO failure try re-opening file */
+if (ret == -EIO)
+(void) cdrom_reopen(s-bs);
+#endif
 ide_atapi_io_error(s, ret);
 goto eot;
 }
 
 if (s-io_buffer_size  0) {
/*
 * For a cdrom read sector command (s-lba != -1),

Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 16:51, Joerg Roedel wrote:

 On Fri, Apr 23, 2010 at 04:42:52PM +0200, Alexander Graf wrote:
 
 On 23.04.2010, at 16:31, Joerg Roedel wrote:
 
 On Fri, Apr 23, 2010 at 04:24:54PM +0200, Alexander Graf wrote:
 
 On 23.04.2010, at 16:22, Joerg Roedel wrote:
 
 No, nested_svm_nmi runs in atomic context where we can't emulate a
 vmexit. We set exit_required and emulate the vmexit later.
 
 So we modify the L2 rflags and then trigger a #vmexit, leaving the l2
 state broken?
 
 No, the rflags are changed in enable_nmi_window which isn't called when
 we run nested and the nested hypervisor intercepts nmi. So it only runs
 in the !nested case where it can't corrupt L2 state.
 
 Last time I checked the code enable_nmi_window was the function
 triggering the #vmexit,
 
 Yes, thats the bug which this patch fixes :-)
 
 so it should run in that exact scenario. If what you say is true, where
 do we #vmexit instead then?
 
 After setting exit_required we run into svm.c:svm_vcpu_run. There the
 exit_required flag is checked and if set, the function immediatly
 returns without doing a vmrun. A few cycles later we run into
 svm.c:handle_exit() where at the beginning exit_required is checked, and
 if set the vmexit is emulated.

Oh well, I guess I have to trust you on this one :)


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] More fixes for nested svm

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 13:04, Avi Kivity wrote:

 On 04/22/2010 01:33 PM, Joerg Roedel wrote:
 Hi Avi, Marcelo,
 
 here is another set of nested svm fixes. They fix NMI code to make UP Hyper-V
 root domain booting. The patches also add better handling for nested entry
 failures and mce intercepts.
 Also in this patchset are the fixes for the supported cpuid reporting for svm
 features. These patches were taken from the nested-npt patchset and slightly
 modified. These patches are also marked for -stable backporting.
 The probably most important fix is about exception reinjection. This didn't
 work reliably before and is fixed with the patch in this series now. This fix
 also touches common x86 code but that should be ok because it could be reused
 by nested-vmx later.
 Please review and give comments (or apply ;-).
   
 
 Looks good.  I'll wait a day or two for more reviews (esp. Alex).

No complaints from me.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-23 Thread Jeremy Fitzhardinge
On 04/23/2010 02:34 AM, Avi Kivity wrote:

 diff -rup a/time-warp-test.c b/time-warp-test.c
 --- a/time-warp-test.c2010-04-15 16:30:13.955981607 -1000
 +++ b/time-warp-test.c2010-04-15 16:35:37.777982377 -1000
 @@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
   {
   DECLARE_ARGS(val, low, high);

 -asm volatile(cpuid; rdtsc : EAX_EDX_RET(val, low, high));
 +asm volatile(cpuid; rdtsc : EAX_EDX_RET(val, low, high) ::
 ebx, ecx);




 Plus, replace cpuid by lfence/mfence.  cpuid will trap.

I presume the clobbers are needed for cpuid; if you use [lm]fence then
they shouldn't be needed, right?

J

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PXE Boot Timeout Issue...

2010-04-23 Thread Stefan Hajnoczi
For reference, my libvirt managed virbr0 has forwarding delay 0.  This
is the default:

http://libvirt.org/formatnetwork.html#elementsConnect

I know that my VM is a leaf node (it only has one NIC and isn't going
to create a loop in the network) and therefore it makes sense to
eliminate the forwarding delay entirely.

You might be interested in this link about STP delay on physical networks:

http://www.cisco.com/en/US/products/hw/switches/ps700/products_tech_note09186a00800b1500.shtml

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-23 Thread Avi Kivity

On 04/23/2010 10:22 PM, Jeremy Fitzhardinge wrote:

On 04/23/2010 02:34 AM, Avi Kivity wrote:
   

diff -rup a/time-warp-test.c b/time-warp-test.c
--- a/time-warp-test.c2010-04-15 16:30:13.955981607 -1000
+++ b/time-warp-test.c2010-04-15 16:35:37.777982377 -1000
@@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
   {
   DECLARE_ARGS(val, low, high);

-asm volatile(cpuid; rdtsc : EAX_EDX_RET(val, low, high));
+asm volatile(cpuid; rdtsc : EAX_EDX_RET(val, low, high) ::
ebx, ecx);


   


Plus, replace cpuid by lfence/mfence.  cpuid will trap.
 

I presume the clobbers are needed for cpuid; if you use [lm]fence then
they shouldn't be needed, right?

   


Right.  But I'm not sure lfence/mfence are sufficient - from what I 
understand rdtsc can pass right through them.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv5] add mergeable receiver buffers support to vhost

2010-04-23 Thread David L Stevens
This patch adds mergeable receive buffers support to vhost.

Signed-off-by: David L Stevens dlstev...@us.ibm.com

diff -ruNp net-next-v0/drivers/vhost/net.c net-next-v5/drivers/vhost/net.c
--- net-next-v0/drivers/vhost/net.c 2010-04-22 11:31:57.0 -0700
+++ net-next-v5/drivers/vhost/net.c 2010-04-22 12:41:17.0 -0700
@@ -109,7 +109,7 @@ static void handle_tx(struct vhost_net *
};
size_t len, total_len = 0;
int err, wmem;
-   size_t hdr_size;
+   size_t vhost_hlen;
struct socket *sock = rcu_dereference(vq-private_data);
if (!sock)
return;
@@ -128,13 +128,13 @@ static void handle_tx(struct vhost_net *
 
if (wmem  sock-sk-sk_sndbuf / 2)
tx_poll_stop(net);
-   hdr_size = vq-hdr_size;
+   vhost_hlen = vq-vhost_hlen;
 
for (;;) {
-   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
-ARRAY_SIZE(vq-iov),
-out, in,
-NULL, NULL);
+   head = vhost_get_desc(net-dev, vq, vq-iov,
+ ARRAY_SIZE(vq-iov),
+ out, in,
+ NULL, NULL);
/* Nothing new?  Wait for eventfd to tell us they refilled. */
if (head == vq-num) {
wmem = atomic_read(sock-sk-sk_wmem_alloc);
@@ -155,20 +155,20 @@ static void handle_tx(struct vhost_net *
break;
}
/* Skip header. TODO: support TSO. */
-   s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
+   s = move_iovec_hdr(vq-iov, vq-hdr, vhost_hlen, out);
msg.msg_iovlen = out;
len = iov_length(vq-iov, out);
/* Sanity check */
if (!len) {
vq_err(vq, Unexpected header len for TX: 
   %zd expected %zd\n,
-  iov_length(vq-hdr, s), hdr_size);
+  iov_length(vq-hdr, s), vhost_hlen);
break;
}
/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock-ops-sendmsg(NULL, sock, msg, len);
if (unlikely(err  0)) {
-   vhost_discard_vq_desc(vq);
+   vhost_discard_desc(vq, 1);
tx_poll_start(net, sock);
break;
}
@@ -187,12 +187,25 @@ static void handle_tx(struct vhost_net *
unuse_mm(net-dev.mm);
 }
 
+static int vhost_head_len(struct vhost_virtqueue *vq, struct sock *sk)
+{
+   struct sk_buff *head;
+   int len = 0;
+
+   lock_sock(sk);
+   head = skb_peek(sk-sk_receive_queue);
+   if (head)
+   len = head-len + vq-sock_hlen;
+   release_sock(sk);
+   return len;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_rx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
-   unsigned head, out, in, log, s;
+   unsigned in, log, s;
struct vhost_log *vq_log;
struct msghdr msg = {
.msg_name = NULL,
@@ -203,14 +216,14 @@ static void handle_rx(struct vhost_net *
.msg_flags = MSG_DONTWAIT,
};
 
-   struct virtio_net_hdr hdr = {
-   .flags = 0,
-   .gso_type = VIRTIO_NET_HDR_GSO_NONE
+   struct virtio_net_hdr_mrg_rxbuf hdr = {
+   .hdr.flags = 0,
+   .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
};
 
size_t len, total_len = 0;
-   int err;
-   size_t hdr_size;
+   int err, headcount, datalen;
+   size_t vhost_hlen;
struct socket *sock = rcu_dereference(vq-private_data);
if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
return;
@@ -218,18 +231,18 @@ static void handle_rx(struct vhost_net *
use_mm(net-dev.mm);
mutex_lock(vq-mutex);
vhost_disable_notify(vq);
-   hdr_size = vq-hdr_size;
+   vhost_hlen = vq-vhost_hlen;
 
vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
vq-log : NULL;
 
-   for (;;) {
-   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
-ARRAY_SIZE(vq-iov),
-out, in,
-vq_log, log);
+   while ((datalen = vhost_head_len(vq, sock-sk))) {
+   headcount = vhost_get_desc_n(vq, vq-heads, datalen+vhost_hlen,
+in, vq_log, log);
+   if (headcount  0)
+   break;
/* OK, now we need to know about 

Don't confine mouse on grab

2010-04-23 Thread Harald Braumann
Hi,

I'd like to propose a small change in the grab behaviour of the SDL
window:

Currently, when the mouse is moved over the window, the guest mouse is
moved as well and the guest receives keyboard input (using absolute
mouse here and focus-follow-mouse). However, the keyboard is not
grabbed, and so the guest doesn't receive all key combinations (very
bad, when you try to close a window in the guest and instead the whole
SDL window is closed). You have to click or use a key combination to
grab. Now, when the grab is active, also the pointer is confined
inside the SDL window. You have to again use a key combination to get
out.

The better behaviour, I think would be the following:
Whenever the SDL window gets the focus, it should grab the keyboard
without confining the pointer. Then you could use all the key
combinations in the guest and you could still change the focus by
moving the mouse out and to another window. VMware does it like this,
VirtualBox, I think, as well, and also the rdesktop client. 

However, I've checked the SDL documentation and it seems it only
provides SDL_WM_GrabInput which grabs the keyboard *and* confines the
mouse. So maybe SDL would need to be extended for this to work. If
that is in fact the case, a work-around would be, to never do a
grab. My window manager can be configured to deliver (almost) all key
combinations to certain windows, so I could still get the desired
behaviour.

Cheers,
harry
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


questions about implementing a zero-copy solution to transfer data between VMs

2010-04-23 Thread JinShan Xiong
Hi all,

Probably this is a stale question because I heard of we've already had
some this kind of patches.

I want to implement a zero-copy mechanism to transfer data between
VMs. From what I've investigated, it's quite easy at sender side,
because virtio is good at sharing data pages from VM to HOST; however,
problem arises at receiver side, we have to invent a mechanism to
share host pages to VM. There're two solutions I can think of, but I'm
not sure if they really work:
1. modify the shadow page table of the receiver in the HOST, and
notify the receiver, so that the receiver can access the data pages
directly;
2. implement a pci-character device in receiver. Before the receiver
is notified, HOST has to rearrange the data pages from sender, and set
the base address space of that pci device. In this way, after receiver
mapping the pci memory, it may access the data directly.

folks, would you please help me make sure if the schemes work, better one?

Thanks,
Jay
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RHEL5.5, 32-bit VM repeatedly locks up due to kvmclock

2010-04-23 Thread Brian Jackson
On Friday 23 April 2010 12:08:22 David S. Ahern wrote:
 After a few days of debugging I think kvmclock is the source of lockups
 for a RHEL5.5-based VM. The VM works fine on one host, but repeatedly
 locks up on another.
 
 Server 1 - VM locks up repeatedly
 -- DL580 G5
 -- 4 quad-core X7350 processors at 2.93GHz
 -- 48GB RAM
 
 Server 2 - VM works just fine
 -- DL380 G6
 -- 2 quad-core E5540 processors at 2.53GHz
 -- 24GB RAM
 
 Both host servers are running Fedora Core 12, 2.6.32.11-99.fc12.x86_64
 kernel. I have tried various versions of qemu-kvm -- the version in
 FC-12 and the version for FC-12 in virt-preview. In both cases the
 qemu-kvm command line is identical.
 
 VM
 - RHEL5.5, PAE kernel (also tried standard 32-bit)
 - 2 vcpus
 - 3GB RAM
 - virtio network and disk
 
 When the VM locks up both vcpu threads are spinning at 100%. Changing
 the clocksource to jiffies appears to have addressed the problem.


Does changing the guest to -smp 1 help?


 
 David
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-23 Thread Zachary Amsden

On 04/22/2010 11:34 PM, Avi Kivity wrote:

On 04/23/2010 04:44 AM, Zachary Amsden wrote:

   Or apply this patch.
time-warp.patch


diff -rup a/time-warp-test.c b/time-warp-test.c
--- a/time-warp-test.c2010-04-15 16:30:13.955981607 -1000
+++ b/time-warp-test.c2010-04-15 16:35:37.777982377 -1000
@@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
  {
  DECLARE_ARGS(val, low, high);

-asm volatile(cpuid; rdtsc : EAX_EDX_RET(val, low, high));
+asm volatile(cpuid; rdtsc : EAX_EDX_RET(val, low, high) :: 
ebx, ecx);




Plus, replace cpuid by lfence/mfence.  cpuid will trap.



Does lfence / mfence actually serialize?  I thought there was some great 
confusion about that not being the case on all AMD processors, and 
possibly not at all on Intel.


A trap, however is a great way to serialize.

I think, there is no serializing instruction which can be used from 
userspace which does not trap, at least, I don't know one off the top of 
my head.


Zach
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-23 Thread Jeremy Fitzhardinge
On 04/23/2010 02:31 PM, Zachary Amsden wrote:
 Does lfence / mfence actually serialize?  I thought there was some
 great confusion about that not being the case on all AMD processors,
 and possibly not at all on Intel.

 A trap, however is a great way to serialize.

 I think, there is no serializing instruction which can be used from
 userspace which does not trap, at least, I don't know one off the top
 of my head.

rsm is not technically privileged... but not quite usable from usermode ;)

J
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RHEL5.5, 32-bit VM repeatedly locks up due to kvmclock

2010-04-23 Thread Zachary Amsden

On 04/23/2010 10:39 AM, Brian Jackson wrote:

On Friday 23 April 2010 12:08:22 David S. Ahern wrote:
   

After a few days of debugging I think kvmclock is the source of lockups
for a RHEL5.5-based VM. The VM works fine on one host, but repeatedly
locks up on another.

Server 1 - VM locks up repeatedly
-- DL580 G5
-- 4 quad-core X7350 processors at 2.93GHz
-- 48GB RAM

Server 2 - VM works just fine
-- DL380 G6
-- 2 quad-core E5540 processors at 2.53GHz
-- 24GB RAM

Both host servers are running Fedora Core 12, 2.6.32.11-99.fc12.x86_64
kernel. I have tried various versions of qemu-kvm -- the version in
FC-12 and the version for FC-12 in virt-preview. In both cases the
qemu-kvm command line is identical.

VM
- RHEL5.5, PAE kernel (also tried standard 32-bit)
- 2 vcpus
- 3GB RAM
- virtio network and disk

When the VM locks up both vcpu threads are spinning at 100%. Changing
the clocksource to jiffies appears to have addressed the problem.
 


Does changing the guest to -smp 1 help?

   


Based on our current understanding of the problem, it should help, but 
it may not prevent the problem entirely.


There are three issues with kvmclock due to sampling:

1) smp clock alignment may be slightly off due to timing conditions
2) kvmclock is resampled at each switch of vcpu to another pcpu
3) kvmclock granularity exceeds that of kernel timespec, which means 
sampling errors may show even on UP


Recommend using a different clocksource (tsc is great if you have stable 
TSC and don't migrate across different-speed machines) until we have all 
the fixes in place.


Zach
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-23 Thread Zachary Amsden

On 04/23/2010 11:35 AM, Jeremy Fitzhardinge wrote:

On 04/23/2010 02:31 PM, Zachary Amsden wrote:
   

Does lfence / mfence actually serialize?  I thought there was some
great confusion about that not being the case on all AMD processors,
and possibly not at all on Intel.

A trap, however is a great way to serialize.

I think, there is no serializing instruction which can be used from
userspace which does not trap, at least, I don't know one off the top
of my head.
 

rsm is not technically privileged... but not quite usable from usermode ;)
   


rsm under hardware virtualization makes my head hurt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RHEL5.5, 32-bit VM repeatedly locks up due to kvmclock

2010-04-23 Thread David S. Ahern


On 04/23/2010 03:39 PM, Zachary Amsden wrote:
 On 04/23/2010 10:39 AM, Brian Jackson wrote:
 On Friday 23 April 2010 12:08:22 David S. Ahern wrote:
   
 After a few days of debugging I think kvmclock is the source of lockups
 for a RHEL5.5-based VM. The VM works fine on one host, but repeatedly
 locks up on another.

 Server 1 - VM locks up repeatedly
 -- DL580 G5
 -- 4 quad-core X7350 processors at 2.93GHz
 -- 48GB RAM

 Server 2 - VM works just fine
 -- DL380 G6
 -- 2 quad-core E5540 processors at 2.53GHz
 -- 24GB RAM

 Both host servers are running Fedora Core 12, 2.6.32.11-99.fc12.x86_64
 kernel. I have tried various versions of qemu-kvm -- the version in
 FC-12 and the version for FC-12 in virt-preview. In both cases the
 qemu-kvm command line is identical.

 VM
 - RHEL5.5, PAE kernel (also tried standard 32-bit)
 - 2 vcpus
 - 3GB RAM
 - virtio network and disk

 When the VM locks up both vcpu threads are spinning at 100%. Changing
 the clocksource to jiffies appears to have addressed the problem.
  

 Does changing the guest to -smp 1 help?


 
 Based on our current understanding of the problem, it should help, but
 it may not prevent the problem entirely.
 
 There are three issues with kvmclock due to sampling:
 
 1) smp clock alignment may be slightly off due to timing conditions
 2) kvmclock is resampled at each switch of vcpu to another pcpu
 3) kvmclock granularity exceeds that of kernel timespec, which means
 sampling errors may show even on UP
 
 Recommend using a different clocksource (tsc is great if you have stable
 TSC and don't migrate across different-speed machines) until we have all
 the fixes in place.

That's my plan for now. As I recall jiffies was the default in early
RHEL5 versions. Not sure what that means hardware wise.

The biggest problem for me is that RHEL5.5 defaults to kvmclock; I'll
find some workaround for it.

David

 
 Zach
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >