Re: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-21 Thread Kevin Wolf
Am 20.11.2011 11:59, schrieb Pekka Enberg:
 On Sun, 2011-11-20 at 14:14 +0800, Lan, Tianyu wrote:
 OK. Thx.
 But fsync is too slow. I try to find a way to sync a range of file. 
 Are there any solutions to meet my purpose?
 
 On Sun, 2011-11-20 at 08:23 +0200, Sasha Levin wrote:
 fdatasync() is as good as it'll get.

 tbh, maybe we should just consider opening QCOW images with O_SYNC and
 just get it over with?
 
 No, lets not do that. It's easier to improve the performance of correct
 code that doesn't use O_SYNC.

Yes, O_SYNC gives you horrible performance. With explicit fsyncs cluster
allocation is somewhat slow (you can try to speed it up by batching
updates like qemu does), but at least rewrites will perform reasonably
close to raw.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM device assignment and user privileges

2011-11-21 Thread Avi Kivity
On 11/21/2011 06:42 AM, Chris Wright wrote:
 * Avi Kivity (a...@redhat.com) wrote:
  On 11/20/2011 04:58 PM, Sasha Levin wrote:
   Hi all,
  
   I've been working on adding device assignment to KVM tools, and started
   with the basics of just getting a device assigned using the
   KVM_ASSIGN_PCI_DEVICE ioctl.
  
   What I've figured is that unprivileged users can request any PCI device
   to be assigned to him, including devices which he shouldn't be touching.
  
   In my case, it happened with the VGA card, where an unprivileged user
   simply called KVM_ASSIGN_PCI_DEVICE with the bus, seg and fn of the VGA
   card and caused the display on the host to go apeshit.
  
   Was it supposed to work this way? 
  
  No, of course not.

 Indeed.  A device is typically owned by a host OS driver which precludes
 device assignment from working.  If it's not, the unprivilged guest
 will not have access to the device's config space or resource bars as
 they are only rw for a privileged user.  And similarly, /dev/kvm was
 typically left as 0644.  As you can see, it's fragile.

   I couldn't find any security checks in
   the code paths of KVM_ASSIGN_PCI_DEVICE and it looks like any user can
   invoke it with any parameters he'd want - enabling him to kill the host.
  
  Alex, Chris?

 The security checks were removed some time back.  The expectation was
 that there was nothing an unprivleged user could usefully do w/ the
 assign device ioctl, and the assign irq ioctl only works after assign
 device.  It's built on an overly fragile set of assumptions, however.
 Avi, the simplest short term thing to do now might be simply revert:

 48bb09e KVM: remove CAP_SYS_RAWIO requirement from kvm_vm_ioctl_assign_irq

That does nothing for for the KVM_ASSIGN_PCI_DEVICE ioctl.

 While it's a regression for existing unprivileged users it's better than
 a hole.  And in the meantime, we can come up w/ something better to
 replace with.

How about doing an access(2) equivalent check for one of the sysfs files
used to represent the device?  If libvirt already chowns them to the
right user, then we have no regression, at least for that use case.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/6] KVM: introduce kvm_for_each_memslot macro

2011-11-21 Thread Avi Kivity
On 11/21/2011 02:54 AM, Takuya Yoshikawa wrote:
 (2011/11/20 20:21), Avi Kivity wrote:
 On 11/18/2011 11:18 AM, Xiao Guangrong wrote:
 index bb8728e..10524c0 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -307,6 +307,10 @@ static inline struct kvm_vcpu
 *kvm_get_vcpu(struct kvm *kvm, int i)
(vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
idx++)

 +#define kvm_for_each_memslot(slots, memslot, i)\
 +for (i = 0; i  (slots)-nmemslots\
 +  ({ memslot =(slots)-memslots[i]; 1; }); i++)
 +


 Statement expression not needed, you can use the comma operator:

i  (slots)-nmemslots  (memslot = @(slots)-memslots[i], true)

 or even

memslot =(slots)-memslots[i], i  (slots)-nmemslots

 or just kill i and make memslot the loop variable.


 Do you have any preference for the arguments ordering?

 I think placing the target one, memslot in this case, first is
 conventional in
 the kernel code, except when we want to place kvm or something like
 that.

 But in kvm code, there seems to be some difference.

You mean for the macro?  Yes, making memslot the first argument is a
good idea.  Any difference in kvm code is not intentional.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/6] KVM: introduce kvm_for_each_memslot macro

2011-11-21 Thread Takuya Yoshikawa

(2011/11/21 17:34), Avi Kivity wrote:

Do you have any preference for the arguments ordering?

I think placing the target one, memslot in this case, first is
conventional in
the kernel code, except when we want to place kvm or something like
that.

But in kvm code, there seems to be some difference.


You mean for the macro?  Yes, making memslot the first argument is a
good idea.  Any difference in kvm code is not intentional.



Yes.

Xiao, please change the order if you have no problem.

Thanks,
Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/6] KVM: introduce kvm_for_each_memslot macro

2011-11-21 Thread Xiao Guangrong
On 11/21/2011 04:40 PM, Takuya Yoshikawa wrote:

 (2011/11/21 17:34), Avi Kivity wrote:
 Do you have any preference for the arguments ordering?

 I think placing the target one, memslot in this case, first is
 conventional in
 the kernel code, except when we want to place kvm or something like
 that.

 But in kvm code, there seems to be some difference.

 You mean for the macro?  Yes, making memslot the first argument is a
 good idea.  Any difference in kvm code is not intentional.

 
 Yes.
 
 Xiao, please change the order if you have no problem.
 


OK, will change it in the next version, thanks!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-21 Thread Kevin Wolf
Am 21.11.2011 08:12, schrieb Lan Tianyu:
 When meeting request to write the cluster without copied flag,
 allocate a new cluster and write original data with modification
 to the new cluster. This also adds support for the writing operation
 of the qcow2 compressed image. After testing, image file can pass
 through qemu-img check. The performance is needed to be improved.
 
 Signed-off-by: Lan Tianyu tianyu@intel.com
 ---
  tools/kvm/disk/qcow.c|  411 
 +-
  tools/kvm/include/kvm/qcow.h |2 +
  2 files changed, 285 insertions(+), 128 deletions(-)

 @@ -766,122 +872,166 @@ static ssize_t qcow_write_cluster(struct qcow *q, u64 
 offset, void *buf, u32 src
   if (l2t_idx = l2t_size)
   return -1;
  
 - clust_off = get_cluster_offset(q, offset);
 - if (clust_off = clust_sz)
 - return -1;
 -
 - len = clust_sz - clust_off;
 - if (len  src_len)
 - len = src_len;
 -
 - mutex_lock(q-mutex);
 -
   l2t_offset = be64_to_cpu(l1t-l1_table[l1t_idx]);
 - if (l2t_offset  QCOW2_OFLAG_COMPRESSED) {
 - pr_warning(compressed clusters are not supported);
 - goto error;
 - }
 - if (!(l2t_offset  QCOW2_OFLAG_COPIED)) {
 - pr_warning(L2 copy-on-write clusters are not supported);
 - goto error;
 - }
 -
 - l2t_offset = QCOW2_OFFSET_MASK;
 - if (l2t_offset) {
 - /* read and cache l2 table */
 + if (l2t_offset  QCOW2_OFLAG_COPIED) {
 + l2t_offset = ~QCOW2_OFLAG_COPIED;
   l2t = qcow_read_l2_table(q, l2t_offset);
   if (!l2t)
   goto error;
   } else {
 - l2t = new_cache_table(q, l2t_offset);
 - if (!l2t)
 + l2t_new_offset = qcow_alloc_clusters(q, l2t_size*sizeof(u64));
 + if (l2t_new_offset  0)
   goto error;
  
 - /* Capture the state of the consistent QCOW image */
 - f_sz = file_size(q-fd);
 - if (!f_sz)
 - goto free_cache;
 + l2t = new_cache_table(q, l2t_new_offset);
 + if (!l2t)
 + goto free_cluster;
  
 - /* Write the l2 table of 0's at the end of the file */
 - l2t_offset = qcow_write_l2_table(q, l2t-table);
 - if (!l2t_offset)
 + if (l2t_offset) {
 + l2t = qcow_read_l2_table(q, l2t_offset);
 + if (!l2t)
 + goto free_cache;
 + } else
 + memset(l2t-table, 0x00, l2t_size * sizeof(u64));
 +
 + /*write l2 table*/
 + l2t-dirty = 1;
 + if (qcow_l2_cache_write(q, l2t)  0)
   goto free_cache;
  
 - if (cache_table(q, l2t)  0) {
 - if (ftruncate(q-fd, f_sz)  0)
 - goto free_cache;
 + /*cache l2 table*/
 + cache_table(q, l2t);

You're ignoring the return value here.

Otherwise I didn't find any obvious problem in a quick scan.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm: use this_cpu_xxx replace percpu_xxx funcs

2011-11-21 Thread Alex,Shi
On Mon, 2011-10-24 at 19:05 +0800, Avi Kivity wrote:
 On 10/24/2011 04:50 AM, Alex,Shi wrote:
  On Thu, 2011-10-20 at 15:34 +0800, Alex,Shi wrote:
   percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them
   for further code clean up.
   
   And in preempt safe scenario, __this_cpu_xxx funcs has a bit better
   performance since __this_cpu_xxx has no redundant preempt_disable()
   
 
  Avi: 
  Would you like to give some comments of this? 
 
 
 Sorry, was travelling:
 
 Acked-by: Avi Kivity a...@redhat.com
 

And this one, picking up or comments are all appreciated. :) 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-21 Thread Pekka Enberg
On Mon, Nov 21, 2011 at 9:12 AM, Lan Tianyu tianyu@intel.com wrote:
 +/*Allocate clusters according to the size. Find a postion that
 + *can satisfy the size. free_clust_idx is initialized to zero and
 + *Record last position.
 +*/

Can you please fix up your comments to use the following standard kernel style:

/*
 * Allocate clusters according to the size. Find a postion that
 * can satisfy the size. free_clust_idx is initialized to zero and
 * Record last position.
 */

[ Whitespace after asterisk and starting line doesn't have text. ]

 -               rfb = qcow_read_refcount_block(q, clust_idx);
 -               if (!rfb) {
 -                       pr_warning(L1: error while reading refcount table);
 +               clust_new_start = qcow_alloc_clusters(q, q-cluster_size);
 +               if (clust_new_start  0) {
 +                       pr_warning(Cluster alloc error!);

Please drop the exclamation marks from pr_warning() and pr_error()
messages. It just adds pointless noise.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Code clean up for percpu_xxx() functions

2011-11-21 Thread Alex,Shi
refreshed the patch on latest upstream kernel. Any comments or picking
up are appreciated. 

===
From 0dce61dc88b8ed2687b4d5c0633aa54d1f66fdc0 Mon Sep 17 00:00:00 2001
From: Alex Shi alex@intel.com
Date: Tue, 22 Nov 2011 00:05:37 +0800
Subject: [PATCH 3/3] Code clean up for percpu_xxx() functions

Since percpu_xxx() serial functions are duplicate with this_cpu_xxx().
Removing percpu_xxx() definition and replacing them by this_cpu_xxx() in
code.

And further more, as Christoph Lameter's requirement, I try to use
__this_cpu_xx to replace this_cpu_xxx if it is in preempt safe scenario.
The preempt safe scenarios include:
1, in irq/softirq/nmi handler
2, protected by preempt_disable
3, protected by spin_lock
4, if the code context imply that it is preempt safe, like the code is
follows or be followed a preempt safe code.

I left the xen code unchanged, since no idea of them.

BTW, In fact, this_cpu_xxx are same as __this_cpu_xxx since all funcs
implement in a single instruction for x86 machine. But it maybe
different for other platforms, so, doing this distinguish is helpful for
other platforms' performance.

Signed-off-by: Alex Shi alex@intel.com
Acked-by: Christoph Lameter c...@gentwo.org
---
 arch/x86/include/asm/current.h|2 +-
 arch/x86/include/asm/hardirq.h|9 +++--
 arch/x86/include/asm/irq_regs.h   |4 +-
 arch/x86/include/asm/mmu_context.h|   12 
 arch/x86/include/asm/percpu.h |   24 ++-
 arch/x86/include/asm/smp.h|4 +-
 arch/x86/include/asm/stackprotector.h |4 +-
 arch/x86/include/asm/thread_info.h|2 +-
 arch/x86/include/asm/tlbflush.h   |4 +-
 arch/x86/kernel/cpu/common.c  |2 +-
 arch/x86/kernel/cpu/mcheck/mce.c  |4 +-
 arch/x86/kernel/paravirt.c|   12 
 arch/x86/kernel/process_32.c  |2 +-
 arch/x86/kernel/process_64.c  |   12 
 arch/x86/mm/tlb.c |   10 +++---
 arch/x86/xen/enlighten.c  |6 ++--
 arch/x86/xen/irq.c|8 ++--
 arch/x86/xen/mmu.c|   20 ++--
 arch/x86/xen/multicalls.h |2 +-
 arch/x86/xen/smp.c|2 +-
 include/linux/percpu.h|   53 -
 include/linux/topology.h  |4 +-
 22 files changed, 73 insertions(+), 129 deletions(-)

diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h
index 4d447b7..9476c04 100644
--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -11,7 +11,7 @@ DECLARE_PER_CPU(struct task_struct *, current_task);
 
 static __always_inline struct task_struct *get_current(void)
 {
-   return percpu_read_stable(current_task);
+   return this_cpu_read_stable(current_task);
 }
 
 #define current get_current()
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 55e4de6..2890444 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -35,14 +35,15 @@ DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
 
 #define __ARCH_IRQ_STAT
 
-#define inc_irq_stat(member)   percpu_inc(irq_stat.member)
+#define inc_irq_stat(member)   __this_cpu_inc(irq_stat.member)
 
-#define local_softirq_pending()percpu_read(irq_stat.__softirq_pending)
+#define local_softirq_pending()
__this_cpu_read(irq_stat.__softirq_pending)
 
 #define __ARCH_SET_SOFTIRQ_PENDING
 
-#define set_softirq_pending(x) percpu_write(irq_stat.__softirq_pending, (x))
-#define or_softirq_pending(x)  percpu_or(irq_stat.__softirq_pending, (x))
+#define set_softirq_pending(x) \
+   __this_cpu_write(irq_stat.__softirq_pending, (x))
+#define or_softirq_pending(x)  __this_cpu_or(irq_stat.__softirq_pending, (x))
 
 extern void ack_bad_irq(unsigned int irq);
 
diff --git a/arch/x86/include/asm/irq_regs.h b/arch/x86/include/asm/irq_regs.h
index 7784322..15639ed 100644
--- a/arch/x86/include/asm/irq_regs.h
+++ b/arch/x86/include/asm/irq_regs.h
@@ -15,7 +15,7 @@ DECLARE_PER_CPU(struct pt_regs *, irq_regs);
 
 static inline struct pt_regs *get_irq_regs(void)
 {
-   return percpu_read(irq_regs);
+   return __this_cpu_read(irq_regs);
 }
 
 static inline struct pt_regs *set_irq_regs(struct pt_regs *new_regs)
@@ -23,7 +23,7 @@ static inline struct pt_regs *set_irq_regs(struct pt_regs 
*new_regs)
struct pt_regs *old_regs;
 
old_regs = get_irq_regs();
-   percpu_write(irq_regs, new_regs);
+   __this_cpu_write(irq_regs, new_regs);
 
return old_regs;
 }
diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index 6902152..02ca533 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -25,8 +25,8 @@ void destroy_context(struct mm_struct *mm);
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct 
*tsk)
 {
 #ifdef CONFIG_SMP
-   if 

RE: [PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

2011-11-21 Thread Lan, Tianyu
Yeah. I will fix them in the next version.

-Original Message-
From: penb...@gmail.com [mailto:penb...@gmail.com] On Behalf Of Pekka Enberg
Sent: Monday, November 21, 2011 5:06 PM
To: Lan, Tianyu
Cc: kvm@vger.kernel.org; asias.he...@gmail.com; levinsasha...@gmail.com; 
prasadjoshi...@gmail.com; kw...@redhat.com
Subject: Re: [PATCH] kvm tools, qcow: Add the support for copy-on-write clusters

On Mon, Nov 21, 2011 at 9:12 AM, Lan Tianyu tianyu@intel.com wrote:
 +/*Allocate clusters according to the size. Find a postion that
 + *can satisfy the size. free_clust_idx is initialized to zero and
 + *Record last position.
 +*/

Can you please fix up your comments to use the following standard kernel style:

/*
 * Allocate clusters according to the size. Find a postion that
 * can satisfy the size. free_clust_idx is initialized to zero and
 * Record last position.
 */

[ Whitespace after asterisk and starting line doesn't have text. ]

 -               rfb = qcow_read_refcount_block(q, clust_idx);
 -               if (!rfb) {
 -                       pr_warning(L1: error while reading refcount table);
 +               clust_new_start = qcow_alloc_clusters(q, q-cluster_size);
 +               if (clust_new_start  0) {
 +                       pr_warning(Cluster alloc error!);

Please drop the exclamation marks from pr_warning() and pr_error()
messages. It just adds pointless noise.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm: use this_cpu_xxx replace percpu_xxx funcs

2011-11-21 Thread Avi Kivity
On 11/21/2011 11:02 AM, Alex,Shi wrote:
 On Mon, 2011-10-24 at 19:05 +0800, Avi Kivity wrote:
  On 10/24/2011 04:50 AM, Alex,Shi wrote:
   On Thu, 2011-10-20 at 15:34 +0800, Alex,Shi wrote:
percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them
for further code clean up.

And in preempt safe scenario, __this_cpu_xxx funcs has a bit better
performance since __this_cpu_xxx has no redundant preempt_disable()

  
   Avi: 
   Would you like to give some comments of this? 
  
  
  Sorry, was travelling:
  
  Acked-by: Avi Kivity a...@redhat.com
  

 And this one, picking up or comments are all appreciated. :) 

Just to be clear, you want this applied in kvm.git?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm: use this_cpu_xxx replace percpu_xxx funcs

2011-11-21 Thread Alex,Shi
 percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace 
 them
 for further code clean up.
 
 And in preempt safe scenario, __this_cpu_xxx funcs has a bit better
 performance since __this_cpu_xxx has no redundant preempt_disable()
 
   
Avi: 
Would you like to give some comments of this? 
   
   
   Sorry, was travelling:
   
   Acked-by: Avi Kivity a...@redhat.com
   
 
  And this one, picking up or comments are all appreciated. :) 
 
 Just to be clear, you want this applied in kvm.git?
 


Glad to be there! :) 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Qemu-devel] [PATCH] ivshmem: fix PCI BAR2 registration during initialization

2011-11-21 Thread zanghongyong
From: Hongyong Zang zanghongy...@huawei.com

Ivshmem cannot work, and the command lspci cannot show ivshmem BAR2 in the 
guest.
As for pci_register_bar(), parameter MemoryRegion should be s-bar instead of 
s-ivshmem.

Signed-off-by: Hongyong Zang zanghongy...@huawei.com
---
 hw/ivshmem.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/hw/ivshmem.c b/hw/ivshmem.c
index 242fbea..2ecf658 100644
--- a/hw/ivshmem.c
+++ b/hw/ivshmem.c
@@ -694,7 +694,7 @@ static int pci_ivshmem_init(PCIDevice *dev)
 s-peers = g_malloc0(s-nb_peers * sizeof(Peer));
 
 pci_register_bar(s-dev, 2,
- PCI_BASE_ADDRESS_SPACE_MEMORY, s-ivshmem);
+ PCI_BASE_ADDRESS_SPACE_MEMORY, s-bar);
 
 s-eventfd_chr = g_malloc0(s-vectors * sizeof(CharDriverState *));
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5 of 5] virtio: expose added descriptors immediately

2011-11-21 Thread Michael S. Tsirkin
On Mon, Nov 21, 2011 at 12:18:45PM +1030, Rusty Russell wrote:
 On Wed, 16 Nov 2011 09:18:38 +0200, Michael S. Tsirkin m...@redhat.com 
 wrote:
  My unlocked kick patches will trip this warning: they make
  virtio-net do add + get without kick.
 
 Heh, it's a good sign if they do, since that means you're running really
 well :)

They don't in fact, in my testing :(. But I think they can with luck.

  I think block with unlocked kick can trip it too:
  add, lock is dropped and then an interrupt can get.
  
  We also don't need a kick each num - each 2^15 is enough.
  Why don't we do this at start of add_buf:
  if (vq-num_added = 0x7fff)
  return -ENOSPC;
 
 The warning was there in case a driver is never doing a kick, and
 getting away with it (mostly) because the device is polling.  Let's not
 penalize good drivers to catch bad ones.
 
 How about we do this properly, like so:

Absolutely. But I think we also need to handle num_added
overflow of a 15 bit counter, no? Otherwise the
vring_need_event logic might give us false negatives 
I'm guessing we can just assume we need a kick in that case.

 From: Rusty Russell ru...@rustcorp.com.au
 Subject: virtio: add debugging if driver doesn't kick.
 
 Under the existing #ifdef DEBUG, check that they don't have more than
 1/10 of a second between an add_buf() and a
 virtqueue_notify()/virtqueue_kick_prepare() call.
 
 We could get false positives on a really busy system, but good for
 development.
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 ---
  drivers/virtio/virtio_ring.c |   31 +++
  1 file changed, 31 insertions(+)
 
 diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
 --- a/drivers/virtio/virtio_ring.c
 +++ b/drivers/virtio/virtio_ring.c
 @@ -22,6 +22,7 @@
  #include linux/device.h
  #include linux/slab.h
  #include linux/module.h
 +#include linux/hrtimer.h
  
  /* virtio guest is communicating with a virtual device that actually runs 
 on
   * a host processor.  Memory barriers are used to control SMP effects. */
 @@ -102,6 +103,10 @@ struct vring_virtqueue
  #ifdef DEBUG
   /* They're supposed to lock for us. */
   unsigned int in_use;
 +
 + /* Figure out if their kicks are too delayed. */
 + bool last_add_time_valid;
 + ktime_t last_add_time;
  #endif
  
   /* Tokens for callbacks. */
 @@ -192,6 +197,19 @@ int virtqueue_add_buf(struct virtqueue *
  
   BUG_ON(data == NULL);
  
 +#ifdef DEBUG
 + {
 + ktime_t now = ktime_get();
 +
 + /* No kick or get, with .1 second between?  Warn. */
 + if (vq-last_add_time_valid)
 + WARN_ON(ktime_to_ms(ktime_sub(now, vq-last_add_time))
 +  100);
 + vq-last_add_time = now;
 + vq-last_add_time_valid = true;
 + }
 +#endif
 +
   /* If the host supports indirect descriptor tables, and we have multiple
* buffers, then go indirect. FIXME: tune this threshold */
   if (vq-indirect  (out + in)  1  vq-num_free) {
 @@ -291,6 +309,14 @@ bool virtqueue_kick_prepare(struct virtq
   new = vq-vring.avail-idx;
   vq-num_added = 0;
  
 +#ifdef DEBUG
 + if (vq-last_add_time_valid) {
 + WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
 +   vq-last_add_time))  100);
 + }
 + vq-last_add_time_valid = false;
 +#endif
 +
   if (vq-event) {
   needs_kick = vring_need_event(vring_avail_event(vq-vring),
 new, old);
 @@ -428,6 +454,10 @@ void *virtqueue_get_buf(struct virtqueue
   virtio_mb();
   }
  
 +#ifdef DEBUG
 + vq-last_add_time_valid = false;
 +#endif
 +
   END_USE(vq);
   return ret;
  }
 @@ -611,6 +641,7 @@ struct virtqueue *vring_new_virtqueue(un
   list_add_tail(vq-vq.list, vdev-vqs);
  #ifdef DEBUG
   vq-in_use = false;
 + vq-last_add_time_valid = false;
  #endif
  
   vq-indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Consistent Snapshots Idea

2011-11-21 Thread Richard Laager
I'm not an expert on the architecture of KVM, so perhaps this is a QEMU
question. If so, please let me know and I'll ask on a different list.

Background:

Assuming the block layer can make instantaneous snapshots of a guest's
disk (e.g. lvcreate -s), one can get crash consistent (i.e. as if the
guest crashed) snapshots. To get a fully consistent snapshot, you need
to shutdown the guest. For production VMs, this is obviously not ideal.

Idea:

What if KVM/QEMU was to fork() the guest and shutdown one copy?

KVM/QEMU would momentarily halt the execution of the guest and take a
writable, instantaneous snapshot of each block device. Then it would
fork(). The parent would resume execution as normal. The child would
redirect disk writes to the snapshot(s). The RAM should have
copy-on-write behavior as with any other fork()ed process. Other
resources like the network, display, sound, serial, etc. would simply be
disconnected/bit-bucketed. Finally, the child would resume guest
execution and send the guest an ACPI power button press event. This
would cause the guest OS to perform an orderly shutdown.

I believe this would provide consistent snapshots in the vast majority
of real-world scenarios in a guest OS and application-independent way.

Implementation Nits:

  * A timeout on the child process would likely be a good idea.
  * It'd probably be best to disconnect the network (i.e. tell the
guest the cable is unplugged) to avoid long timeouts. Likewise
for the hardware flow-control lines on the serial port.
  * For correctness, fdatasync()ing or similar might be necessary
after halting execution and before creating the snapshots.

Richard


signature.asc
Description: This is a digitally signed message part


Re: [Qemu-devel] [PATCH] ivshmem: fix PCI BAR2 registration during initialization

2011-11-21 Thread Avi Kivity
On 11/21/2011 12:56 PM, zanghongy...@huawei.com wrote:
 From: Hongyong Zang zanghongy...@huawei.com

 Ivshmem cannot work, and the command lspci cannot show ivshmem BAR2 in the 
 guest.
 As for pci_register_bar(), parameter MemoryRegion should be s-bar instead of 
 s-ivshmem.

 Signed-off-by: Hongyong Zang zanghongy...@huawei.com
 ---
  hw/ivshmem.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

 diff --git a/hw/ivshmem.c b/hw/ivshmem.c
 index 242fbea..2ecf658 100644
 --- a/hw/ivshmem.c
 +++ b/hw/ivshmem.c
 @@ -694,7 +694,7 @@ static int pci_ivshmem_init(PCIDevice *dev)
  s-peers = g_malloc0(s-nb_peers * sizeof(Peer));
  
  pci_register_bar(s-dev, 2,
 - PCI_BASE_ADDRESS_SPACE_MEMORY, s-ivshmem);
 + PCI_BASE_ADDRESS_SPACE_MEMORY, s-bar);
  
  s-eventfd_chr = g_malloc0(s-vectors * sizeof(CharDriverState *));
  

Reviewed-by: Avi Kivity a...@redhat.com

This is 1.0 worthy.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Consistent Snapshots Idea

2011-11-21 Thread Avi Kivity
On 11/21/2011 02:01 PM, Richard Laager wrote:
 I'm not an expert on the architecture of KVM, so perhaps this is a QEMU
 question. If so, please let me know and I'll ask on a different list.

It is a qemu question, yes (though fork()ing a guest also relates to kvm).

 Background:

 Assuming the block layer can make instantaneous snapshots of a guest's
 disk (e.g. lvcreate -s), one can get crash consistent (i.e. as if the
 guest crashed) snapshots. To get a fully consistent snapshot, you need
 to shutdown the guest. For production VMs, this is obviously not ideal.

 Idea:

 What if KVM/QEMU was to fork() the guest and shutdown one copy?

 KVM/QEMU would momentarily halt the execution of the guest and take a
 writable, instantaneous snapshot of each block device. Then it would
 fork(). The parent would resume execution as normal. The child would
 redirect disk writes to the snapshot(s). The RAM should have
 copy-on-write behavior as with any other fork()ed process. Other
 resources like the network, display, sound, serial, etc. would simply be
 disconnected/bit-bucketed. Finally, the child would resume guest
 execution and send the guest an ACPI power button press event. This
 would cause the guest OS to perform an orderly shutdown.

 I believe this would provide consistent snapshots in the vast majority
 of real-world scenarios in a guest OS and application-independent way.

Interesting idea.  Will the guest actually shut down nicely without a
network?  Things like NFS mounts will break.

 Implementation Nits:

   * A timeout on the child process would likely be a good idea.
   * It'd probably be best to disconnect the network (i.e. tell the
 guest the cable is unplugged) to avoid long timeouts. Likewise
 for the hardware flow-control lines on the serial port.

This is actually critical, otherwise the guest will shutdown(2) all
sockets and confuse the clients.

   * For correctness, fdatasync()ing or similar might be necessary
 after halting execution and before creating the snapshots.

Microsoft guests have an API to quiesce storage prior to a snapshot, and
I think there is work to bring this to Linux guests.  So it should be
possible to get consistent snapshots even without this, but it takes
more integration.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] kvm tools, qcow: Add the support for copy-on-write cluster

2011-11-21 Thread Lan Tianyu
When meeting request to write the cluster without copied flag,
allocate a new cluster and write original data with modification
to the new cluster. This also adds support for the writing operation
of the qcow2 compressed image. After testing, image file can pass
through qemu-img check. The performance is needed to be improved.

Signed-off-by: Lan Tianyu tianyu@intel.com
---
 tools/kvm/disk/qcow.c|  421 +-
 tools/kvm/include/kvm/qcow.h |2 +
 2 files changed, 292 insertions(+), 131 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index 680b37d..99b13e7 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -20,6 +20,16 @@
 #include linux/kernel.h
 #include linux/types.h
 
+
+static inline int qcow_pwrite_sync(int fd,
+   void *buf, size_t count, off_t offset)
+{
+   if (pwrite_in_full(fd, buf, count, offset)  0)
+   return -1;
+
+   return fdatasync(fd);
+}
+
 static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new)
 {
struct rb_node **link = (root-rb_node), *parent = NULL;
@@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct 
qcow_l2_table *c)
 
size = 1  header-l2_bits;
 
-   if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset)  0)
+   if (qcow_pwrite_sync(q-fd, c-table,
+   size * sizeof(u64), c-offset)  0)
return -1;
 
c-dirty = 0;
@@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table 
*c)
 */
lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, 
list);
 
-   if (qcow_l2_cache_write(q, lru)  0)
-   goto error;
-
/* Remove the node from the cache */
rb_erase(lru-node, r);
list_del_init(lru-list);
@@ -518,14 +526,6 @@ static inline u64 file_size(int fd)
return st.st_size;
 }
 
-static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t 
offset)
-{
-   if (pwrite_in_full(fd, buf, count, offset)  0)
-   return -1;
-
-   return fdatasync(fd);
-}
-
 /* Writes a level 2 table at the end of the file. */
 static u64 qcow_write_l2_table(struct qcow *q, u64 *table)
 {
@@ -601,7 +601,8 @@ static int write_refcount_block(struct qcow *q, struct 
qcow_refcount_block *rfb)
if (!rfb-dirty)
return 0;
 
-   if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), 
rfb-offset)  0)
+   if (qcow_pwrite_sync(q-fd, rfb-entries,
+   rfb-size * sizeof(u16), rfb-offset)  0)
return -1;
 
rfb-dirty = 0;
@@ -618,9 +619,6 @@ static int cache_refcount_block(struct qcow *q, struct 
qcow_refcount_block *c)
if (rft-nr_cached == MAX_CACHE_NODES) {
lru = list_first_entry(rft-lru_list, struct 
qcow_refcount_block, list);
 
-   if (write_refcount_block(q, lru)  0)
-   goto error;
-
rb_erase(lru-node, r);
list_del_init(lru-list);
rft-nr_cached--;
@@ -706,6 +704,11 @@ static struct qcow_refcount_block 
*qcow_read_refcount_block(struct qcow *q, u64
 
rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]);
 
+   if (!rfb_offset) {
+   pr_warning(Don't support to grow refcount table);
+   return NULL;
+   }
+
rfb = refcount_block_search(q, rfb_offset);
if (rfb)
return rfb;
@@ -728,35 +731,140 @@ error_free_rfb:
return NULL;
 }
 
+static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(Error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+
+   if (rfb_idx = rfb-size) {
+   pr_warning(L1: refcount block index out of bounds);
+   return -1;
+   }
+
+   return be16_to_cpu(rfb-entries[rfb_idx]);
+}
+
+static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append)
+{
+   struct qcow_refcount_block *rfb = NULL;
+   struct qcow_header *header = q-header;
+   u16 refcount;
+   u64 rfb_idx;
+
+   rfb = qcow_read_refcount_block(q, clust_idx);
+   if (!rfb) {
+   pr_warning(error while reading refcount table);
+   return -1;
+   }
+
+   rfb_idx = clust_idx  (((1ULL 
+   (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1));
+   if (rfb_idx = rfb-size) {
+   pr_warning(refcount block index out of bounds);
+   return -1;
+   }
+
+   refcount = be16_to_cpu(rfb-entries[rfb_idx]) + append;
+   

Re: [Qemu-devel] [RFC] Consistent Snapshots Idea

2011-11-21 Thread shu ming

On 2011-11-21 20:31, Avi Kivity wrote:

On 11/21/2011 02:01 PM, Richard Laager wrote:

I'm not an expert on the architecture of KVM, so perhaps this is a QEMU
question. If so, please let me know and I'll ask on a different list.

It is a qemu question, yes (though fork()ing a guest also relates to kvm).


Background:

Assuming the block layer can make instantaneous snapshots of a guest's
disk (e.g. lvcreate -s), one can get crash consistent (i.e. as if the
guest crashed) snapshots. To get a fully consistent snapshot, you need
to shutdown the guest. For production VMs, this is obviously not ideal.

Idea:

What if KVM/QEMU was to fork() the guest and shutdown one copy?

KVM/QEMU would momentarily halt the execution of the guest and take a
writable, instantaneous snapshot of each block device. Then it would
fork(). The parent would resume execution as normal. The child would
redirect disk writes to the snapshot(s). The RAM should have
copy-on-write behavior as with any other fork()ed process. Other
resources like the network, display, sound, serial, etc. would simply be
disconnected/bit-bucketed. Finally, the child would resume guest
execution and send the guest an ACPI power button press event. This
would cause the guest OS to perform an orderly shutdown.

I believe this would provide consistent snapshots in the vast majority
of real-world scenarios in a guest OS and application-independent way.

Interesting idea.  Will the guest actually shut down nicely without a
network?  Things like NFS mounts will break.


Does the child and parent process run in parallel?  What will happen if 
the parent process try to access the block device? It looks like that 
the child process will write to a snapshot file, but where will the 
parent process write to?





Implementation Nits:

   * A timeout on the child process would likely be a good idea.
   * It'd probably be best to disconnect the network (i.e. tell the
 guest the cable is unplugged) to avoid long timeouts. Likewise
 for the hardware flow-control lines on the serial port.

This is actually critical, otherwise the guest will shutdown(2) all
sockets and confuse the clients.


   * For correctness, fdatasync()ing or similar might be necessary
 after halting execution and before creating the snapshots.

Microsoft guests have an API to quiesce storage prior to a snapshot, and
I think there is work to bring this to Linux guests.  So it should be
possible to get consistent snapshots even without this, but it takes
more integration.




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-21 Thread Bharata B Rao
On Tue, Nov 08, 2011 at 09:33:04AM -0800, Chris Wright wrote:
 * Alexander Graf (ag...@suse.de) wrote:
  On 29.10.2011, at 20:45, Bharata B Rao wrote:
   As guests become NUMA aware, it becomes important for the guests to
   have correct NUMA policies when they run on NUMA aware hosts.
   Currently limited support for NUMA binding is available via libvirt
   where it is possible to apply a NUMA policy to the guest as a whole.
   However multinode guests would benefit if guest memory belonging to
   different guest nodes are mapped appropriately to different host NUMA 
   nodes.
   
   To achieve this we would need QEMU to expose information about
   guest RAM ranges (Guest Physical Address - GPA) and their host virtual
   address mappings (Host Virtual Address - HVA). Using GPA and HVA, any 
   external
   tool like libvirt would be able to divide the guest RAM as per the guest 
   NUMA
   node geometry and bind guest memory nodes to corresponding host memory 
   nodes
   using HVA. This needs both QEMU (and libvirt) changes as well as changes
   in the kernel.
  
  Ok, let's take a step back here. You are basically growing libvirt into a 
  memory resource manager that know how much memory is available on which 
  nodes and how these nodes would possibly fit into the host's memory layout.
  
  Shouldn't that be the kernel's job? It seems to me that architecturally the 
  kernel is the place I would want my memory resource controls to be in.
 
 I think that both Peter and Andrea are looking at this.  Before we commit
 an API to QEMU that has a different semantic than a possible new kernel
 interface (that perhaps QEMU could use directly to inform kernel of the
 binding/relationship between vcpu thread and it's memory at VM startuup)
 it would be useful to see what these guys are working on...

I looked at Peter's recent work in this area.
(https://lkml.org/lkml/2011/11/17/204)

It introduces two interfaces:

1. ms_tbind() to bind a thread to a memsched(*) group
2. ms_mbind() to bind a memory region to memsched group

I assume the 2nd interface could be used by QEMU to create
memsched groups for each of guest NUMA node memory regions.

In the past, Anthony has said that NUMA binding should be done from outside
of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041)
Though that was in a different context, may be we should re-look at that
and see if QEMU still sticks to that. I know its a bit early, but if needed
we should ask Peter to consider extending ms_mbind() to take a tid parameter
too instead of working on current task by default.

(*) memsched: An abstraction for representing coupling of threads with virtual
address ranges. Threads and virtual address ranges of a memsched group are
guaranteed (?) to be located on the same node.

Regards,
Bharata.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-21 Thread Peter Zijlstra
On Mon, 2011-11-21 at 20:48 +0530, Bharata B Rao wrote:

 I looked at Peter's recent work in this area.
 (https://lkml.org/lkml/2011/11/17/204)
 
 It introduces two interfaces:
 
 1. ms_tbind() to bind a thread to a memsched(*) group
 2. ms_mbind() to bind a memory region to memsched group
 
 I assume the 2nd interface could be used by QEMU to create
 memsched groups for each of guest NUMA node memory regions.

No, you would need both, you'll need to group vcpu threads _and_ some
vaddress space together.

I understood QEMU currently uses a single big anonymous mmap() to
allocate the guest memory, using this you could either use multiple or
carve up the big alloc into virtual nodes by assigning different parts
to different ms groups.

Example: suppose you want to create a 2 node guest with 8 vcpus, create
2 ms groups, each with 4 vcpu threads and assign half the total guest
mmap to either.

 In the past, Anthony has said that NUMA binding should be done from outside
 of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041)

If you want to expose a sense of virtual NUMA to your guest you really
have no choice there. The only thing you can do externally is run whole
VMs inside one particular node.

 Though that was in a different context, may be we should re-look at that
 and see if QEMU still sticks to that. I know its a bit early, but if needed
 we should ask Peter to consider extending ms_mbind() to take a tid parameter
 too instead of working on current task by default.

Uh, what for? ms_mbind() works on the current process, not task.

 (*) memsched: An abstraction for representing coupling of threads with virtual
 address ranges. Threads and virtual address ranges of a memsched group are
 guaranteed (?) to be located on the same node.

Yeah, more or less so. We could relax that slightly to allow tasks to
run away from the node for very short periods of time, but basically
that's the provided guarantee.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-21 Thread Bharata B Rao
On Mon, Nov 21, 2011 at 04:25:26PM +0100, Peter Zijlstra wrote:
 On Mon, 2011-11-21 at 20:48 +0530, Bharata B Rao wrote:
 
  I looked at Peter's recent work in this area.
  (https://lkml.org/lkml/2011/11/17/204)
  
  It introduces two interfaces:
  
  1. ms_tbind() to bind a thread to a memsched(*) group
  2. ms_mbind() to bind a memory region to memsched group
  
  I assume the 2nd interface could be used by QEMU to create
  memsched groups for each of guest NUMA node memory regions.
 
 No, you would need both, you'll need to group vcpu threads _and_ some
 vaddress space together.
 
 I understood QEMU currently uses a single big anonymous mmap() to
 allocate the guest memory, using this you could either use multiple or
 carve up the big alloc into virtual nodes by assigning different parts
 to different ms groups.
 
 Example: suppose you want to create a 2 node guest with 8 vcpus, create
 2 ms groups, each with 4 vcpu threads and assign half the total guest
 mmap to either.
 
  In the past, Anthony has said that NUMA binding should be done from outside
  of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041)
 
 If you want to expose a sense of virtual NUMA to your guest you really
 have no choice there. The only thing you can do externally is run whole
 VMs inside one particular node.
 
  Though that was in a different context, may be we should re-look at that
  and see if QEMU still sticks to that. I know its a bit early, but if needed
  we should ask Peter to consider extending ms_mbind() to take a tid parameter
  too instead of working on current task by default.
 
 Uh, what for? ms_mbind() works on the current process, not task.

In the original post of this mail thread, I proposed a way to export
guest RAM ranges (Guest Physical Address-GPA) and their corresponding host
host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor).
The idea was to use this GPA to HVA mappings from tools like libvirt to bind
specific parts of the guest RAM to different host nodes. This needed an
extension to existing mbind() to allow binding memory of a process(QEMU) from a
different process(libvirt). This was needed since we wanted to do all this from
libvirt.

Hence I was coming from that background when I asked for extending
ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA
binding should all be done from outside of QEMU, it is needed, otherwise
what you have should be sufficient.

Regards,
Bharata.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for Novemeber 22

2011-11-21 Thread Juan Quintela

Hi

Please send in any agenda items you are interested in covering.

Later, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Memory sync algorithm during migration

2011-11-21 Thread Oliver Hookins
On Tue, Nov 15, 2011 at 11:47:58AM +0100, ext Juan Quintela wrote:
 Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp wrote:
  Adding qemu-devel ML to CC.
 
  Your question should have been sent to qemu-devel ML because the logic
  is implemented in QEMU, not KVM.
 
  (2011/11/11 1:35), Oliver Hookins wrote:
  Hi,
 
  I am performing some benchmarks on KVM migration on two different types of 
  VM.
  One has 4GB RAM and the other 32GB. More or less idle, the 4GB VM takes 
  about 20
  seconds to migrate on our hardware while the 32GB VM takes about a minute.
 
  With a reasonable amount of memory activity going on (in the hundreds of 
  MB per
  second) the 32GB VM takes 3.5 minutes to migrate, but the 4GB VM never
  completes. Intuitively this tells me there is some watermarking of dirty 
  pages
  going on that is not particularly efficient when the dirty pages ratio is 
  high
  compared to total memory, but I may be completely incorrect.
 
  You can change the ratio IIRC.
  Hopefully, someone who knows well about QEMU will tell you better ways.
 
  Takuya
 
 
  Could anybody fill me in on what might be going on here? We're using 
  libvirt
  0.8.2 and kvm-83-224.el5.centos.1
 
 This is pretty old qemu/kvm code base.
 In principle, it makes no sense that with 32GB RAM migration finishes,
 and with 4GB RAM it is unable (intuitively it should be, if ever, the
 other way around).
 
 Do you have an easy test that makes the problem easily reproducible?
 Have you tried ustream qemu.git? (some improvements on that department).

I've just tried the qemu-kvm 0.14.1 tag which seems to be the latest that builds
on my platform. For some strange reason migrations always seem to fail in one
direction with Unknown savevm section or instance 'hpet' 0 messages.

This seems to point to different migration protocols on either end but they are
both running the same version of qemu-kvm I built. Does this ring any bells for
anyone?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-21 Thread Peter Zijlstra
On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote:
 
 In the original post of this mail thread, I proposed a way to export
 guest RAM ranges (Guest Physical Address-GPA) and their corresponding host
 host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor).
 The idea was to use this GPA to HVA mappings from tools like libvirt to bind
 specific parts of the guest RAM to different host nodes. This needed an
 extension to existing mbind() to allow binding memory of a process(QEMU) from 
 a
 different process(libvirt). This was needed since we wanted to do all this 
 from
 libvirt.
 
 Hence I was coming from that background when I asked for extending
 ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA
 binding should all be done from outside of QEMU, it is needed, otherwise
 what you have should be sufficient. 

That's just retarded, and no you won't get such extentions. Poking at
another process's virtual address space is just daft. Esp. if there's no
actual reason for it.

Furthermore, it would make libvirt a required part of qemu, and since I
don't think I've ever use libvirt that's another reason to object, I
don't need that stinking mess.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for v1.0] qemu-kvm: msix: Fire mask notifier on global mask changes

2011-11-21 Thread Michael S. Tsirkin
From: Jan Kiszka jan.kis...@siemens.com

Also invoke the mask notifier if the global MSI-X mask is modified. For
this purpose, we push the notifier call from the per-vector mask update
to the central msix_handle_mask_update.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

This needs to be applied on top of:
msix: avoid mask updates if mask is unchanged
I'll send a reminder once it's merged.

 hw/msix.c |   11 ++-
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/hw/msix.c b/hw/msix.c
index 56422c6..e50074f 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -231,6 +231,12 @@ static void msix_handle_mask_update(PCIDevice *dev, int 
vector, bool was_masked)
 return;
 }
 
+if (dev-msix_mask_notifier) {
+int ret;
+ret = dev-msix_mask_notifier(dev, vector, is_masked);
+assert(ret = 0);
+}
+
 if (!is_masked  msix_is_pending(dev, vector)) {
 msix_clr_pending(dev, vector);
 msix_notify(dev, vector);
@@ -292,11 +298,6 @@ static void msix_mmio_write(void *opaque, 
target_phys_addr_t addr,
 if (kvm_enabled()  kvm_irqchip_in_kernel()) {
 kvm_msix_update(dev, vector, was_masked, msix_is_masked(dev, vector));
 }
-if (was_masked != msix_is_masked(dev, vector)  dev-msix_mask_notifier) {
-int r = dev-msix_mask_notifier(dev, vector,
-   msix_is_masked(dev, vector));
-assert(r = 0);
-}
 msix_handle_mask_update(dev, vector, was_masked);
 }
 
-- 
1.7.5.53.gc233e
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-21 Thread Greg Rose

On 11/18/2011 9:40 AM, Ben Hutchings wrote:

On Fri, 2011-11-18 at 08:58 -0800, Greg Rose wrote:

On 11/17/2011 4:44 PM, Ben Hutchings wrote:

On Thu, 2011-11-17 at 16:32 -0800, Greg Rose wrote:

On 11/17/2011 4:15 PM, Ben Hutchings wrote:

Sorry to come to this rather late.

On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
[...]

v2 -v3
- Moved set and get filter ops from rtnl_link_ops to netdev_ops
- Support for SRIOV VFs.
   [Note: The get filters msg (in the way current get rtnetlink handles
   it) might get too big for SRIOV vfs. This patch follows existing 
sriov
   vf get code and tries to accomodate filters for all VF's in a PF.
   And for the SRIOV case I have only tested the fact that the VF
   arguments are getting delivered to rtnetlink correctly. The code
   follows existing sriov vf handling code so rest of it should work 
fine]

[...]

This is already broken for large numbers of VFs, and increasing the
amount of information per VF is going to make the situation worse.  I am
no netlink expert but I think that the current approach of bundling all
information about an interface in a single message may not be
sustainable.

Also, I'm unclear on why this interface is to be used to set filtering
for the (PF) net device as well as for related VFs.  Doesn't that
duplicate the functionality of ndo_set_rx_mode and
ndo_vlan_rx_{add,kill}_vid?


Functionally yes but contextually no.  This allows the PF driver to know
that it is setting these filters in the context of the existence of VFs,
allowing it to take appropriate action.  The other two functions may be
called without the presence of SR-IOV enablement and the existence of VFs.

Anyway, that's why I asked Roopa to add that capability.


I don't follow.  The PF driver already knows whether it has enabled VFs.

How do filters set this way interact with filters set through the
existing operations?  Should they override promiscuous mode?  None of
this has been specified.


Promiscuous mode is exactly the issue this feature is intended for.  I'm
not familiar with the solarflare device but Intel HW promiscuous mode is
only promiscuous on the physical port, not on the VEB.  So a packet sent
from a VF will not be captured by the PF across the VEB unless the MAC
and VLAN filters have been programmed into the HW.


Yes, I get it.  The hardware bridge needs to know more about the address
configuration on the host than the driver is getting at the moment.


So you may not need
the feature for your devices but it is required for Intel devices.


Well we don't have the hardware bridge but that means each VF driver
needs to know whether to fall back to the software bridge.  The net
driver needs much the same additional information.


And
it's a fairly simple request, just allow -1 to indicate that the target
of the filter requests is for the PF itself.  Using the already existing
set_rx_mode function wont' work because the PF driver will look at it
and figure it's in promiscuous mode anyway, so it won't set the filters
into the HW.  At least that is how it is in the case of our HW and
driver.  Again, the behavior of your HW and driver is unknown to me and
thus you may not require this feature.


What concerns me is that this seems to be a workaround rather than a fix
for over-use of promiscuous mode, and it changes the semantics of
filtering modes in ways that haven't been well-specified.


I feel the opposite is true.  It allows a known set of receive filters 
so that you don't have to use promiscuous mode, which cuts down on 
overhead from processing packets the upper layer stack isn't really 
interested in.




What if there's a software bridge between two net devices corresponding
to separate physical ports, so that they really need to be promiscuous?
What if the administrator runs tcpdump and really wants the (PF) net
device to be promiscuous?


I don't believe there is anything in this patch set that removes 
promiscuous mode operation as it is commonly used.  Perhaps I've missed 
something.




These cases shouldn't break because of VF acceleration.


I don't see how they would in the case of Intel HW.  And these new ops 
to set rx filters are something that Intel HW needs because it 
implements a VEB that is *not* a learning bridge and we don't want it to 
be a learning bridge because of security concerns.  It is better for our 
requirements to be allowed to set the MAC/VLAN filters that we want a VF 
to be able to see.



Or at least we
should make a conscious and documented decision that 'promiscuous'
doesn't mean that if you enable it on your network adapter.


Again, I don't know of any plans to change anything relating to putting 
a device in promiscuous mode or changing the semantics of what 
promiscuous mode means.


- Greg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-21 Thread Avi Kivity
On 11/21/2011 05:25 PM, Peter Zijlstra wrote:
 On Mon, 2011-11-21 at 20:48 +0530, Bharata B Rao wrote:

  I looked at Peter's recent work in this area.
  (https://lkml.org/lkml/2011/11/17/204)
  
  It introduces two interfaces:
  
  1. ms_tbind() to bind a thread to a memsched(*) group
  2. ms_mbind() to bind a memory region to memsched group
  
  I assume the 2nd interface could be used by QEMU to create
  memsched groups for each of guest NUMA node memory regions.

 No, you would need both, you'll need to group vcpu threads _and_ some
 vaddress space together.

 I understood QEMU currently uses a single big anonymous mmap() to
 allocate the guest memory, using this you could either use multiple or
 carve up the big alloc into virtual nodes by assigning different parts
 to different ms groups.

 Example: suppose you want to create a 2 node guest with 8 vcpus, create
 2 ms groups, each with 4 vcpu threads and assign half the total guest
 mmap to either.


Does ms_mbind() require that its vmas in its area be completely
contained in the region, or does it split vmas on demand?  I suggest the
latter to avoid exposing implementation details.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-21 Thread Peter Zijlstra
On Mon, 2011-11-21 at 20:03 +0200, Avi Kivity wrote:
 
 Does ms_mbind() require that its vmas in its area be completely
 contained in the region, or does it split vmas on demand?  I suggest the
 latter to avoid exposing implementation details. 

as implemented (which is still rather incomplete) it does the split on
demand like all other memory interfaces.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv4 0/3] acpi: DSDT/SSDT runtime patching

2011-11-21 Thread Michael S. Tsirkin
On Sun, Nov 20, 2011 at 04:08:59PM -0500, Kevin O'Connor wrote:
 On Sun, Nov 20, 2011 at 07:56:43PM +0200, Michael S. Tsirkin wrote:
  Here's an updated revision of acpi runtime patching patchset.
  Lightly tested.
 
 It looks good to me.
 
 -Kevin

Run some linux and windows tests, things seem to work
smoothly. Pls apply.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for Novemeber 22

2011-11-21 Thread Anthony Liguori

On 11/21/2011 10:00 AM, Juan Quintela wrote:


Hi

Please send in any agenda items you are interested in covering.


I'm technical on holiday this week so I won't be attending.

But as an FYI, I ran across seccomp-nurse[1] this weekend.  It more or less 
let's you write a python program to implement a userspace syscall whitelist.


I haven't looked at the code close enough yet, but I think the technique it uses 
is to create a companion thread along side the sandbox thread.  This thread only 
runs code in an area mapped read-only and presumably only uses thread local storage.


The companion thread isn't running in the sandbox, but has the same resources as 
the sandbox thread so it can essentially invoke syscalls on behalf of the 
sandbox thread.


It's seriously non-portable.  In fact, it only works on 32-bit x86 Linux right 
now.  But it's worth looking into.


[1] http://chdir.org/~nico/seccomp-nurse/

Regards,

Anthony Liguori



Later, Juan.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-21 Thread Chris Wright
* Peter Zijlstra (a.p.zijls...@chello.nl) wrote:
 On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote:
  
  In the original post of this mail thread, I proposed a way to export
  guest RAM ranges (Guest Physical Address-GPA) and their corresponding host
  host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU 
  monitor).
  The idea was to use this GPA to HVA mappings from tools like libvirt to bind
  specific parts of the guest RAM to different host nodes. This needed an
  extension to existing mbind() to allow binding memory of a process(QEMU) 
  from a
  different process(libvirt). This was needed since we wanted to do all this 
  from
  libvirt.
  
  Hence I was coming from that background when I asked for extending
  ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA
  binding should all be done from outside of QEMU, it is needed, otherwise
  what you have should be sufficient. 
 
 That's just retarded, and no you won't get such extentions. Poking at
 another process's virtual address space is just daft. Esp. if there's no
 actual reason for it.

Need to separate the binding vs the policy mgmt.  The policy mgmt could
still be done outside, whereas the binding could still be done from w/in
QEMU.  A simple monitor interface to rebalance vcpu memory allcoations
to different nodes could very well schedule vcpu thread work in QEMU.

So, I agree, even if there is some external policy mgmt, it could still
easily work w/ QEMU to use Peter's proposed interface.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Code clean up for percpu_xxx() functions

2011-11-21 Thread t...@kernel.org
(cc'ing hpa and quoting whole body)

On Mon, Nov 21, 2011 at 05:10:12PM +0800, Alex,Shi wrote:
 refreshed the patch on latest upstream kernel. Any comments or picking
 up are appreciated. 
 
 ===
 From 0dce61dc88b8ed2687b4d5c0633aa54d1f66fdc0 Mon Sep 17 00:00:00 2001
 From: Alex Shi alex@intel.com
 Date: Tue, 22 Nov 2011 00:05:37 +0800
 Subject: [PATCH 3/3] Code clean up for percpu_xxx() functions
 
 Since percpu_xxx() serial functions are duplicate with this_cpu_xxx().
 Removing percpu_xxx() definition and replacing them by this_cpu_xxx() in
 code.
 
 And further more, as Christoph Lameter's requirement, I try to use
 __this_cpu_xx to replace this_cpu_xxx if it is in preempt safe scenario.
 The preempt safe scenarios include:
 1, in irq/softirq/nmi handler
 2, protected by preempt_disable
 3, protected by spin_lock
 4, if the code context imply that it is preempt safe, like the code is
 follows or be followed a preempt safe code.
 
 I left the xen code unchanged, since no idea of them.
 
 BTW, In fact, this_cpu_xxx are same as __this_cpu_xxx since all funcs
 implement in a single instruction for x86 machine. But it maybe
 different for other platforms, so, doing this distinguish is helpful for
 other platforms' performance.
 
 Signed-off-by: Alex Shi alex@intel.com
 Acked-by: Christoph Lameter c...@gentwo.org

 Acked-by: Tejun Heo t...@kernel.org

hpa, I suppose this should go through x86?  The original patch can be
accessed at

  http://article.gmane.org/gmane.linux.kernel/1218055/raw

Thanks.

  arch/x86/include/asm/current.h|2 +-
  arch/x86/include/asm/hardirq.h|9 +++--
  arch/x86/include/asm/irq_regs.h   |4 +-
  arch/x86/include/asm/mmu_context.h|   12 
  arch/x86/include/asm/percpu.h |   24 ++-
  arch/x86/include/asm/smp.h|4 +-
  arch/x86/include/asm/stackprotector.h |4 +-
  arch/x86/include/asm/thread_info.h|2 +-
  arch/x86/include/asm/tlbflush.h   |4 +-
  arch/x86/kernel/cpu/common.c  |2 +-
  arch/x86/kernel/cpu/mcheck/mce.c  |4 +-
  arch/x86/kernel/paravirt.c|   12 
  arch/x86/kernel/process_32.c  |2 +-
  arch/x86/kernel/process_64.c  |   12 
  arch/x86/mm/tlb.c |   10 +++---
  arch/x86/xen/enlighten.c  |6 ++--
  arch/x86/xen/irq.c|8 ++--
  arch/x86/xen/mmu.c|   20 ++--
  arch/x86/xen/multicalls.h |2 +-
  arch/x86/xen/smp.c|2 +-
  include/linux/percpu.h|   53 
 -
  include/linux/topology.h  |4 +-
  22 files changed, 73 insertions(+), 129 deletions(-)
 
 diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h
 index 4d447b7..9476c04 100644
 --- a/arch/x86/include/asm/current.h
 +++ b/arch/x86/include/asm/current.h
 @@ -11,7 +11,7 @@ DECLARE_PER_CPU(struct task_struct *, current_task);
  
  static __always_inline struct task_struct *get_current(void)
  {
 - return percpu_read_stable(current_task);
 + return this_cpu_read_stable(current_task);
  }
  
  #define current get_current()
 diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
 index 55e4de6..2890444 100644
 --- a/arch/x86/include/asm/hardirq.h
 +++ b/arch/x86/include/asm/hardirq.h
 @@ -35,14 +35,15 @@ DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
  
  #define __ARCH_IRQ_STAT
  
 -#define inc_irq_stat(member) percpu_inc(irq_stat.member)
 +#define inc_irq_stat(member) __this_cpu_inc(irq_stat.member)
  
 -#define local_softirq_pending()  percpu_read(irq_stat.__softirq_pending)
 +#define local_softirq_pending()  
 __this_cpu_read(irq_stat.__softirq_pending)
  
  #define __ARCH_SET_SOFTIRQ_PENDING
  
 -#define set_softirq_pending(x)   
 percpu_write(irq_stat.__softirq_pending, (x))
 -#define or_softirq_pending(x)percpu_or(irq_stat.__softirq_pending, 
 (x))
 +#define set_softirq_pending(x)   \
 + __this_cpu_write(irq_stat.__softirq_pending, (x))
 +#define or_softirq_pending(x)
 __this_cpu_or(irq_stat.__softirq_pending, (x))
  
  extern void ack_bad_irq(unsigned int irq);
  
 diff --git a/arch/x86/include/asm/irq_regs.h b/arch/x86/include/asm/irq_regs.h
 index 7784322..15639ed 100644
 --- a/arch/x86/include/asm/irq_regs.h
 +++ b/arch/x86/include/asm/irq_regs.h
 @@ -15,7 +15,7 @@ DECLARE_PER_CPU(struct pt_regs *, irq_regs);
  
  static inline struct pt_regs *get_irq_regs(void)
  {
 - return percpu_read(irq_regs);
 + return __this_cpu_read(irq_regs);
  }
  
  static inline struct pt_regs *set_irq_regs(struct pt_regs *new_regs)
 @@ -23,7 +23,7 @@ static inline struct pt_regs *set_irq_regs(struct pt_regs 
 *new_regs)
   struct pt_regs *old_regs;
  
   old_regs = get_irq_regs();
 - percpu_write(irq_regs, new_regs);
 + __this_cpu_write(irq_regs, new_regs);
  

Re: [Qemu-devel] [PATCH] ivshmem: fix PCI BAR2 registration during initialization

2011-11-21 Thread Anthony Liguori

On 11/21/2011 04:56 AM, zanghongy...@huawei.com wrote:

From: Hongyong Zangzanghongy...@huawei.com

Ivshmem cannot work, and the command lspci cannot show ivshmem BAR2 in the 
guest.
As for pci_register_bar(), parameter MemoryRegion should be s-bar instead of 
s-ivshmem.

Signed-off-by: Hongyong Zangzanghongy...@huawei.com


Applied.  Thanks.

Regards,

Anthony Liguori


---
  hw/ivshmem.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/hw/ivshmem.c b/hw/ivshmem.c
index 242fbea..2ecf658 100644
--- a/hw/ivshmem.c
+++ b/hw/ivshmem.c
@@ -694,7 +694,7 @@ static int pci_ivshmem_init(PCIDevice *dev)
  s-peers = g_malloc0(s-nb_peers * sizeof(Peer));

  pci_register_bar(s-dev, 2,
- PCI_BASE_ADDRESS_SPACE_MEMORY,s-ivshmem);
+ PCI_BASE_ADDRESS_SPACE_MEMORY,s-bar);

  s-eventfd_chr = g_malloc0(s-vectors * sizeof(CharDriverState *));



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[FYI] Need to do a full rebuild if you are on Linux x86 host

2011-11-21 Thread Anthony Liguori

Due to this commit:

commit 40d6444e91c6ab17e5e8ab01d4eece90cbc4afed
Author: Avi Kivity a...@redhat.com
Date:   Tue Nov 15 20:12:17 2011 +0200

configure: build position independent executables on x86-Linux hosts

PIE binaries cannot be linked with non-PIE binaries and make is not smart enough 
to rebuild when the CFLAGS have changed.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5 of 5] virtio: expose added descriptors immediately

2011-11-21 Thread Rusty Russell
On Mon, 21 Nov 2011 13:57:04 +0200, Michael S. Tsirkin m...@redhat.com 
wrote:
 On Mon, Nov 21, 2011 at 12:18:45PM +1030, Rusty Russell wrote:
  On Wed, 16 Nov 2011 09:18:38 +0200, Michael S. Tsirkin m...@redhat.com 
  wrote:
   My unlocked kick patches will trip this warning: they make
   virtio-net do add + get without kick.
  
  Heh, it's a good sign if they do, since that means you're running really
  well :)
 
 They don't in fact, in my testing :(. But I think they can with luck.
 
   I think block with unlocked kick can trip it too:
   add, lock is dropped and then an interrupt can get.
   
   We also don't need a kick each num - each 2^15 is enough.
   Why don't we do this at start of add_buf:
   if (vq-num_added = 0x7fff)
 return -ENOSPC;
  
  The warning was there in case a driver is never doing a kick, and
  getting away with it (mostly) because the device is polling.  Let's not
  penalize good drivers to catch bad ones.
  
  How about we do this properly, like so:
 
 Absolutely. But I think we also need to handle num_added
 overflow of a 15 bit counter, no? Otherwise the
 vring_need_event logic might give us false negatives 
 I'm guessing we can just assume we need a kick in that case.

You're right.  Thankyou.  My immediate reaction of make it an unsigned
long doesn't work.

Here's the diff to what I posted before:

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -254,9 +254,10 @@ add_head:
vq-vring.avail-idx++;
vq-num_added++;
 
-   /* If you haven't kicked in this long, you're probably doing something
-* wrong. */
-   WARN_ON(vq-num_added  vq-vring.num);
+   /* This is very unlikely, but theoretically possible.  Kick
+* just in case. */
+   if (unlikely(vq-num_added == 65535))
+   virtqueue_kick(_vq);
 
pr_debug(Added buffer head %i to %p\n, head, vq);
END_USE(vq);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-21 Thread Anthony Liguori

On 11/21/2011 04:50 PM, Chris Wright wrote:

* Peter Zijlstra (a.p.zijls...@chello.nl) wrote:

On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote:


In the original post of this mail thread, I proposed a way to export
guest RAM ranges (Guest Physical Address-GPA) and their corresponding host
host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor).
The idea was to use this GPA to HVA mappings from tools like libvirt to bind
specific parts of the guest RAM to different host nodes. This needed an
extension to existing mbind() to allow binding memory of a process(QEMU) from a
different process(libvirt). This was needed since we wanted to do all this from
libvirt.

Hence I was coming from that background when I asked for extending
ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA
binding should all be done from outside of QEMU, it is needed, otherwise
what you have should be sufficient.


That's just retarded, and no you won't get such extentions. Poking at
another process's virtual address space is just daft. Esp. if there's no
actual reason for it.


Need to separate the binding vs the policy mgmt.  The policy mgmt could
still be done outside, whereas the binding could still be done from w/in
QEMU.  A simple monitor interface to rebalance vcpu memory allcoations
to different nodes could very well schedule vcpu thread work in QEMU.


I really would prefer to avoid having such an interface.  It's a shot gun that 
will only result in many poor feet being maimed.  I can't tell you the number of 
times I've encountered people using CPU pinning when they have absolutely no 
business doing CPU pinning.


If we really believe such an interface should exist, then the interface should 
really be from the kernel.  Once we have memgroups, there's no reason to involve 
QEMU at all.  QEMU can define the memgroups based on the NUMA nodes and then 
it's up to the kernel as to whether it exposes controls to explicitly bind 
memgroups within a process or not.


Regards,

Anthony Liguori


So, I agree, even if there is some external policy mgmt, it could still
easily work w/ QEMU to use Peter's proposed interface.

thanks,
-chris



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/7] KVM: x86 emulator: Use opcode::execute for some instructions

2011-11-21 Thread Takuya Yoshikawa
This patch set eats the remaining spaghetti in the emulator and
cleans up the large plate.

After this, only trivial cases will be there.

Passed emulator.flat test:
SUMMARY: 90 tests, 0 failures

Thanks,
Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] KVM: x86 emulator: Use opcode::execute for IN/OUT

2011-11-21 Thread Takuya Yoshikawa
IN : E4, E5, EC, ED
OUT: E6, E7, EE, EF

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/emulate.c |   54 ---
 1 files changed, 28 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8547958..8ba4ea8 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2776,6 +2776,24 @@ static int em_jcxz(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
+static int em_in(struct x86_emulate_ctxt *ctxt)
+{
+   if (!pio_in_emulated(ctxt, ctxt-dst.bytes, ctxt-src.val,
+ctxt-dst.val))
+   return X86EMUL_IO_NEEDED;
+
+   return X86EMUL_CONTINUE;
+}
+
+static int em_out(struct x86_emulate_ctxt *ctxt)
+{
+   ctxt-ops-pio_out_emulated(ctxt, ctxt-src.bytes, ctxt-dst.val,
+   ctxt-src.val, 1);
+   /* Disable writeback. */
+   ctxt-dst.type = OP_NONE;
+   return X86EMUL_CONTINUE;
+}
+
 static int em_cli(struct x86_emulate_ctxt *ctxt)
 {
if (emulator_bad_iopl(ctxt))
@@ -3004,6 +3022,8 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 #define D2bv(_f)  D((_f) | ByteOp), D(_f)
 #define D2bvIP(_f, _i, _p) DIP((_f) | ByteOp, _i, _p), DIP(_f, _i, _p)
 #define I2bv(_f, _e)  I((_f) | ByteOp, _e), I(_f, _e)
+#define I2bvIP(_f, _e, _i, _p) \
+   IIP((_f) | ByteOp, _e, _i, _p), IIP(_f, _e, _i, _p)
 
 #define I6ALU(_f, _e) I2bv((_f) | DstMem | SrcReg | ModRM, _e),
\
I2bv(((_f) | DstReg | SrcMem | ModRM)  ~Lock, _e), \
@@ -3217,13 +3237,13 @@ static struct opcode opcode_table[256] = {
/* 0xE0 - 0xE7 */
X3(I(SrcImmByte, em_loop)),
I(SrcImmByte, em_jcxz),
-   D2bvIP(SrcImmUByte | DstAcc, in,  check_perm_in),
-   D2bvIP(SrcAcc | DstImmUByte, out, check_perm_out),
+   I2bvIP(SrcImmUByte | DstAcc, em_in,  in,  check_perm_in),
+   I2bvIP(SrcAcc | DstImmUByte, em_out, out, check_perm_out),
/* 0xE8 - 0xEF */
D(SrcImm | Stack), D(SrcImm | ImplicitOps),
I(SrcImmFAddr | No64, em_jmp_far), D(SrcImmByte | ImplicitOps),
-   D2bvIP(SrcDX | DstAcc, in,  check_perm_in),
-   D2bvIP(SrcAcc | DstDX, out, check_perm_out),
+   I2bvIP(SrcDX | DstAcc, em_in,  in,  check_perm_in),
+   I2bvIP(SrcAcc | DstDX, em_out, out, check_perm_out),
/* 0xF0 - 0xF7 */
N, DI(ImplicitOps, icebp), N, N,
DI(ImplicitOps | Priv, hlt), D(ImplicitOps),
@@ -3325,6 +3345,7 @@ static struct opcode twobyte_table[256] = {
 #undef D2bv
 #undef D2bvIP
 #undef I2bv
+#undef I2bvIP
 #undef I6ALU
 
 static unsigned imm_size(struct x86_emulate_ctxt *ctxt)
@@ -3867,11 +3888,12 @@ special_insn:
case 0x6c:  /* insb */
case 0x6d:  /* insw/insd */
ctxt-src.val = ctxt-regs[VCPU_REGS_RDX];
-   goto do_io_in;
+   rc = em_in(ctxt);
+   break;
case 0x6e:  /* outsb */
case 0x6f:  /* outsw/outsd */
ctxt-dst.val = ctxt-regs[VCPU_REGS_RDX];
-   goto do_io_out;
+   rc = em_out(ctxt);
break;
case 0x70 ... 0x7f: /* jcc (short) */
if (test_cc(ctxt-b, ctxt-eflags))
@@ -3915,12 +3937,6 @@ special_insn:
ctxt-src.val = ctxt-regs[VCPU_REGS_RCX];
rc = em_grp2(ctxt);
break;
-   case 0xe4:  /* inb */
-   case 0xe5:  /* in */
-   goto do_io_in;
-   case 0xe6: /* outb */
-   case 0xe7: /* out */
-   goto do_io_out;
case 0xe8: /* call (near) */ {
long int rel = ctxt-src.val;
ctxt-src.val = (unsigned long) ctxt-_eip;
@@ -3933,20 +3949,6 @@ special_insn:
jmp_rel(ctxt, ctxt-src.val);
ctxt-dst.type = OP_NONE; /* Disable writeback. */
break;
-   case 0xec: /* in al,dx */
-   case 0xed: /* in (e/r)ax,dx */
-   do_io_in:
-   if (!pio_in_emulated(ctxt, ctxt-dst.bytes, ctxt-src.val,
-ctxt-dst.val))
-   goto done; /* IO is needed */
-   break;
-   case 0xee: /* out dx,al */
-   case 0xef: /* out dx,(e/r)ax */
-   do_io_out:
-   ops-pio_out_emulated(ctxt, ctxt-src.bytes, ctxt-dst.val,
- ctxt-src.val, 1);
-   ctxt-dst.type = OP_NONE;   /* Disable writeback. */
-   break;
case 0xf4:  /* hlt */
ctxt-ops-halt(ctxt);
break;
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] KVM: x86 emulator: Use opcode::execute for BT family

2011-11-21 Thread Takuya Yoshikawa
BT : 0F A3
BTS: 0F AB
BTR: 0F B3
BTC: 0F BB

Group 8: 0F BA

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/emulate.c |   77 +++
 1 files changed, 38 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8ba4ea8..7a9ce6d 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2813,6 +2813,35 @@ static int em_sti(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
+static int em_bt(struct x86_emulate_ctxt *ctxt)
+{
+   /* Disable writeback. */
+   ctxt-dst.type = OP_NONE;
+   /* only subword offset */
+   ctxt-src.val = (ctxt-dst.bytes  3) - 1;
+
+   emulate_2op_SrcV_nobyte(ctxt, bt);
+   return X86EMUL_CONTINUE;
+}
+
+static int em_bts(struct x86_emulate_ctxt *ctxt)
+{
+   emulate_2op_SrcV_nobyte(ctxt, bts);
+   return X86EMUL_CONTINUE;
+}
+
+static int em_btr(struct x86_emulate_ctxt *ctxt)
+{
+   emulate_2op_SrcV_nobyte(ctxt, btr);
+   return X86EMUL_CONTINUE;
+}
+
+static int em_btc(struct x86_emulate_ctxt *ctxt)
+{
+   emulate_2op_SrcV_nobyte(ctxt, btc);
+   return X86EMUL_CONTINUE;
+}
+
 static bool valid_cr(int nr)
 {
switch (nr) {
@@ -3117,10 +3146,10 @@ static struct group_dual group7 = { {
 
 static struct opcode group8[] = {
N, N, N, N,
-   D(DstMem | SrcImmByte | ModRM),
-   D(DstMem | SrcImmByte | ModRM | Lock | PageTable),
-   D(DstMem | SrcImmByte | ModRM | Lock),
-   D(DstMem | SrcImmByte | ModRM | Lock | PageTable),
+   I(DstMem | SrcImmByte | ModRM, em_bt),
+   I(DstMem | SrcImmByte | ModRM | Lock | PageTable, em_bts),
+   I(DstMem | SrcImmByte | ModRM | Lock, em_btr),
+   I(DstMem | SrcImmByte | ModRM | Lock | PageTable, em_btc),
 };
 
 static struct group_dual group9 = { {
@@ -3299,26 +3328,27 @@ static struct opcode twobyte_table[256] = {
X16(D(ByteOp | DstMem | SrcNone | ModRM| Mov)),
/* 0xA0 - 0xA7 */
I(Stack | Src2FS, em_push_sreg), I(Stack | Src2FS, em_pop_sreg),
-   DI(ImplicitOps, cpuid), D(DstMem | SrcReg | ModRM | BitOp),
+   DI(ImplicitOps, cpuid), I(DstMem | SrcReg | ModRM | BitOp, em_bt),
D(DstMem | SrcReg | Src2ImmByte | ModRM),
D(DstMem | SrcReg | Src2CL | ModRM), N, N,
/* 0xA8 - 0xAF */
I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg),
DI(ImplicitOps, rsm),
-   D(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable),
+   I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts),
D(DstMem | SrcReg | Src2ImmByte | ModRM),
D(DstMem | SrcReg | Src2CL | ModRM),
D(ModRM), I(DstReg | SrcMem | ModRM, em_imul),
/* 0xB0 - 0xB7 */
D2bv(DstMem | SrcReg | ModRM | Lock | PageTable),
I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg),
-   D(DstMem | SrcReg | ModRM | BitOp | Lock),
+   I(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr),
I(DstReg | SrcMemFAddr | ModRM | Src2FS, em_lseg),
I(DstReg | SrcMemFAddr | ModRM | Src2GS, em_lseg),
D(ByteOp | DstReg | SrcMem | ModRM | Mov), D(DstReg | SrcMem16 | ModRM 
| Mov),
/* 0xB8 - 0xBF */
N, N,
-   G(BitOp, group8), D(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable),
+   G(BitOp, group8),
+   I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_btc),
D(DstReg | SrcMem | ModRM), D(DstReg | SrcMem | ModRM),
D(ByteOp | DstReg | SrcMem | ModRM | Mov), D(DstReg | SrcMem16 | ModRM 
| Mov),
/* 0xC0 - 0xCF */
@@ -4103,21 +4133,10 @@ twobyte_insn:
case 0x90 ... 0x9f: /* setcc r/m8 */
ctxt-dst.val = test_cc(ctxt-b, ctxt-eflags);
break;
-   case 0xa3:
- bt:   /* bt */
-   ctxt-dst.type = OP_NONE;
-   /* only subword offset */
-   ctxt-src.val = (ctxt-dst.bytes  3) - 1;
-   emulate_2op_SrcV_nobyte(ctxt, bt);
-   break;
case 0xa4: /* shld imm8, r, r/m */
case 0xa5: /* shld cl, r, r/m */
emulate_2op_cl(ctxt, shld);
break;
-   case 0xab:
- bts:  /* bts */
-   emulate_2op_SrcV_nobyte(ctxt, bts);
-   break;
case 0xac: /* shrd imm8, r, r/m */
case 0xad: /* shrd cl, r, r/m */
emulate_2op_cl(ctxt, shrd);
@@ -4141,31 +4160,11 @@ twobyte_insn:
ctxt-dst.addr.reg = (unsigned long 
*)ctxt-regs[VCPU_REGS_RAX];
}
break;
-   case 0xb3:
- btr:  /* btr */
-   emulate_2op_SrcV_nobyte(ctxt, btr);
-   break;
case 0xb6 ... 0xb7: /* movzx */
ctxt-dst.bytes = ctxt-op_bytes;
ctxt-dst.val = (ctxt-d  ByteOp) ? (u8) ctxt-src.val
   : (u16) 

[PATCH 3/7] KVM: x86 emulator: Use opcode::execute for CALL

2011-11-21 Thread Takuya Yoshikawa
CALL: E8

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/emulate.c |   18 ++
 1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 7a9ce6d..6b7a03b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2482,6 +2482,15 @@ static int em_das(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
+static int em_call(struct x86_emulate_ctxt *ctxt)
+{
+   long rel = ctxt-src.val;
+
+   ctxt-src.val = (unsigned long)ctxt-_eip;
+   jmp_rel(ctxt, rel);
+   return em_push(ctxt);
+}
+
 static int em_call_far(struct x86_emulate_ctxt *ctxt)
 {
u16 sel, old_cs;
@@ -3269,7 +3278,7 @@ static struct opcode opcode_table[256] = {
I2bvIP(SrcImmUByte | DstAcc, em_in,  in,  check_perm_in),
I2bvIP(SrcAcc | DstImmUByte, em_out, out, check_perm_out),
/* 0xE8 - 0xEF */
-   D(SrcImm | Stack), D(SrcImm | ImplicitOps),
+   I(SrcImm | Stack, em_call), D(SrcImm | ImplicitOps),
I(SrcImmFAddr | No64, em_jmp_far), D(SrcImmByte | ImplicitOps),
I2bvIP(SrcDX | DstAcc, em_in,  in,  check_perm_in),
I2bvIP(SrcAcc | DstDX, em_out, out, check_perm_out),
@@ -3967,13 +3976,6 @@ special_insn:
ctxt-src.val = ctxt-regs[VCPU_REGS_RCX];
rc = em_grp2(ctxt);
break;
-   case 0xe8: /* call (near) */ {
-   long int rel = ctxt-src.val;
-   ctxt-src.val = (unsigned long) ctxt-_eip;
-   jmp_rel(ctxt, rel);
-   rc = em_push(ctxt);
-   break;
-   }
case 0xe9: /* jmp rel */
case 0xeb: /* jmp rel short */
jmp_rel(ctxt, ctxt-src.val);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] KVM: x86 emulator: Use opcode::execute for MOV to cr/dr

2011-11-21 Thread Takuya Yoshikawa
MOV: 0F 22 (move to control registers)
MOV: 0F 23 (move to debug registers)

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/emulate.c |   52 +++
 1 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 6b7a03b..7fe5ed1 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2638,6 +2638,34 @@ static int em_mov(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
+static int em_cr_write(struct x86_emulate_ctxt *ctxt)
+{
+   if (ctxt-ops-set_cr(ctxt, ctxt-modrm_reg, ctxt-src.val))
+   return emulate_gp(ctxt, 0);
+
+   /* Disable writeback. */
+   ctxt-dst.type = OP_NONE;
+   return X86EMUL_CONTINUE;
+}
+
+static int em_dr_write(struct x86_emulate_ctxt *ctxt)
+{
+   unsigned long val;
+
+   if (ctxt-mode == X86EMUL_MODE_PROT64)
+   val = ctxt-src.val  ~0ULL;
+   else
+   val = ctxt-src.val  ~0U;
+
+   /* #UD condition is already handled. */
+   if (ctxt-ops-set_dr(ctxt, ctxt-modrm_reg, val)  0)
+   return emulate_gp(ctxt, 0);
+
+   /* Disable writeback. */
+   ctxt-dst.type = OP_NONE;
+   return X86EMUL_CONTINUE;
+}
+
 static int em_mov_rm_sreg(struct x86_emulate_ctxt *ctxt)
 {
if (ctxt-modrm_reg  VCPU_SREG_GS)
@@ -3304,8 +3332,8 @@ static struct opcode twobyte_table[256] = {
/* 0x20 - 0x2F */
DIP(ModRM | DstMem | Priv | Op3264, cr_read, check_cr_read),
DIP(ModRM | DstMem | Priv | Op3264, dr_read, check_dr_read),
-   DIP(ModRM | SrcMem | Priv | Op3264, cr_write, check_cr_write),
-   DIP(ModRM | SrcMem | Priv | Op3264, dr_write, check_dr_write),
+   IIP(ModRM | SrcMem | Priv | Op3264, em_cr_write, cr_write, 
check_cr_write),
+   IIP(ModRM | SrcMem | Priv | Op3264, em_dr_write, dr_write, 
check_dr_write),
N, N, N, N,
N, N, N, N, N, N, N, N,
/* 0x30 - 0x3F */
@@ -4080,26 +4108,6 @@ twobyte_insn:
case 0x21: /* mov from dr to reg */
ops-get_dr(ctxt, ctxt-modrm_reg, ctxt-dst.val);
break;
-   case 0x22: /* mov reg, cr */
-   if (ops-set_cr(ctxt, ctxt-modrm_reg, ctxt-src.val)) {
-   emulate_gp(ctxt, 0);
-   rc = X86EMUL_PROPAGATE_FAULT;
-   goto done;
-   }
-   ctxt-dst.type = OP_NONE;
-   break;
-   case 0x23: /* mov from reg to dr */
-   if (ops-set_dr(ctxt, ctxt-modrm_reg, ctxt-src.val 
-   ((ctxt-mode == X86EMUL_MODE_PROT64) ?
-~0ULL : ~0U))  0) {
-   /* #UD condition is already handled by the code above */
-   emulate_gp(ctxt, 0);
-   rc = X86EMUL_PROPAGATE_FAULT;
-   goto done;
-   }
-
-   ctxt-dst.type = OP_NONE;   /* no writeback */
-   break;
case 0x30:
/* wrmsr */
msr_data = (u32)ctxt-regs[VCPU_REGS_RAX]
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] KVM: x86 emulator: Use opcode::execute for WRMSR/RDMSR

2011-11-21 Thread Takuya Yoshikawa
WRMSR: 0F 30
RDMSR: 0F 32

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/emulate.c |   52 
 1 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 7fe5ed1..906c5eb 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2666,6 +2666,30 @@ static int em_dr_write(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
+static int em_wrmsr(struct x86_emulate_ctxt *ctxt)
+{
+   u64 msr_data;
+
+   msr_data = (u32)ctxt-regs[VCPU_REGS_RAX]
+   | ((u64)ctxt-regs[VCPU_REGS_RDX]  32);
+   if (ctxt-ops-set_msr(ctxt, ctxt-regs[VCPU_REGS_RCX], msr_data))
+   return emulate_gp(ctxt, 0);
+
+   return X86EMUL_CONTINUE;
+}
+
+static int em_rdmsr(struct x86_emulate_ctxt *ctxt)
+{
+   u64 msr_data;
+
+   if (ctxt-ops-get_msr(ctxt, ctxt-regs[VCPU_REGS_RCX], msr_data))
+   return emulate_gp(ctxt, 0);
+
+   ctxt-regs[VCPU_REGS_RAX] = (u32)msr_data;
+   ctxt-regs[VCPU_REGS_RDX] = msr_data  32;
+   return X86EMUL_CONTINUE;
+}
+
 static int em_mov_rm_sreg(struct x86_emulate_ctxt *ctxt)
 {
if (ctxt-modrm_reg  VCPU_SREG_GS)
@@ -3337,9 +3361,9 @@ static struct opcode twobyte_table[256] = {
N, N, N, N,
N, N, N, N, N, N, N, N,
/* 0x30 - 0x3F */
-   DI(ImplicitOps | Priv, wrmsr),
+   II(ImplicitOps | Priv, em_wrmsr, wrmsr),
IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
-   DI(ImplicitOps | Priv, rdmsr),
+   II(ImplicitOps | Priv, em_rdmsr, rdmsr),
DIP(ImplicitOps | Priv, rdpmc, check_rdpmc),
I(ImplicitOps | VendorSpecific, em_sysenter),
I(ImplicitOps | Priv | VendorSpecific, em_sysexit),
@@ -3818,7 +3842,6 @@ static bool string_insn_completed(struct x86_emulate_ctxt 
*ctxt)
 int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
 {
struct x86_emulate_ops *ops = ctxt-ops;
-   u64 msr_data;
int rc = X86EMUL_CONTINUE;
int saved_dst_type = ctxt-dst.type;
 
@@ -4108,29 +4131,6 @@ twobyte_insn:
case 0x21: /* mov from dr to reg */
ops-get_dr(ctxt, ctxt-modrm_reg, ctxt-dst.val);
break;
-   case 0x30:
-   /* wrmsr */
-   msr_data = (u32)ctxt-regs[VCPU_REGS_RAX]
-   | ((u64)ctxt-regs[VCPU_REGS_RDX]  32);
-   if (ops-set_msr(ctxt, ctxt-regs[VCPU_REGS_RCX], msr_data)) {
-   emulate_gp(ctxt, 0);
-   rc = X86EMUL_PROPAGATE_FAULT;
-   goto done;
-   }
-   rc = X86EMUL_CONTINUE;
-   break;
-   case 0x32:
-   /* rdmsr */
-   if (ops-get_msr(ctxt, ctxt-regs[VCPU_REGS_RCX], msr_data)) {
-   emulate_gp(ctxt, 0);
-   rc = X86EMUL_PROPAGATE_FAULT;
-   goto done;
-   } else {
-   ctxt-regs[VCPU_REGS_RAX] = (u32)msr_data;
-   ctxt-regs[VCPU_REGS_RDX] = msr_data  32;
-   }
-   rc = X86EMUL_CONTINUE;
-   break;
case 0x40 ... 0x4f: /* cmov */
ctxt-dst.val = ctxt-dst.orig_val = ctxt-src.val;
if (!test_cc(ctxt-b, ctxt-eflags))
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] KVM: x86 emulator: Use opcode::execute for CMPXCHG

2011-11-21 Thread Takuya Yoshikawa
CMPXCHG: 0F B0, 0F B1

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/emulate.c |   37 +++--
 1 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 906c5eb..799000d 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1832,6 +1832,24 @@ static int em_ret_far(struct x86_emulate_ctxt *ctxt)
return rc;
 }
 
+static int em_cmpxchg(struct x86_emulate_ctxt *ctxt)
+{
+   /* Save real source value, then compare EAX against destination. */
+   ctxt-src.orig_val = ctxt-src.val;
+   ctxt-src.val = ctxt-regs[VCPU_REGS_RAX];
+   emulate_2op_SrcV(ctxt, cmp);
+
+   if (ctxt-eflags  EFLG_ZF) {
+   /* Success: write back to memory. */
+   ctxt-dst.val = ctxt-src.orig_val;
+   } else {
+   /* Failure: write the value we saw to EAX. */
+   ctxt-dst.type = OP_REG;
+   ctxt-dst.addr.reg = (unsigned long 
*)ctxt-regs[VCPU_REGS_RAX];
+   }
+   return X86EMUL_CONTINUE;
+}
+
 static int em_lseg(struct x86_emulate_ctxt *ctxt)
 {
int seg = ctxt-src2.val;
@@ -3400,7 +3418,7 @@ static struct opcode twobyte_table[256] = {
D(DstMem | SrcReg | Src2CL | ModRM),
D(ModRM), I(DstReg | SrcMem | ModRM, em_imul),
/* 0xB0 - 0xB7 */
-   D2bv(DstMem | SrcReg | ModRM | Lock | PageTable),
+   I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg),
I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg),
I(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr),
I(DstReg | SrcMemFAddr | ModRM | Src2FS, em_lseg),
@@ -4153,23 +4171,6 @@ twobyte_insn:
break;
case 0xae:  /* clflush */
break;
-   case 0xb0 ... 0xb1: /* cmpxchg */
-   /*
-* Save real source value, then compare EAX against
-* destination.
-*/
-   ctxt-src.orig_val = ctxt-src.val;
-   ctxt-src.val = ctxt-regs[VCPU_REGS_RAX];
-   emulate_2op_SrcV(ctxt, cmp);
-   if (ctxt-eflags  EFLG_ZF) {
-   /* Success: write back to memory. */
-   ctxt-dst.val = ctxt-src.orig_val;
-   } else {
-   /* Failure: write the value we saw to EAX. */
-   ctxt-dst.type = OP_REG;
-   ctxt-dst.addr.reg = (unsigned long 
*)ctxt-regs[VCPU_REGS_RAX];
-   }
-   break;
case 0xb6 ... 0xb7: /* movzx */
ctxt-dst.bytes = ctxt-op_bytes;
ctxt-dst.val = (ctxt-d  ByteOp) ? (u8) ctxt-src.val
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] KVM: x86 emulator: Use opcode::execute for BSF/BSR

2011-11-21 Thread Takuya Yoshikawa
BSF: 0F BC
BSR: 0F BD

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/emulate.c |   60 
 1 files changed, 35 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 799000d..4cd3313 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2921,6 +2921,40 @@ static int em_btc(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
+static int em_bsf(struct x86_emulate_ctxt *ctxt)
+{
+   u8 zf;
+
+   __asm__ (bsf %2, %0; setz %1
+: =r(ctxt-dst.val), =q(zf)
+: r(ctxt-src.val));
+
+   ctxt-eflags = ~X86_EFLAGS_ZF;
+   if (zf) {
+   ctxt-eflags |= X86_EFLAGS_ZF;
+   /* Disable writeback. */
+   ctxt-dst.type = OP_NONE;
+   }
+   return X86EMUL_CONTINUE;
+}
+
+static int em_bsr(struct x86_emulate_ctxt *ctxt)
+{
+   u8 zf;
+
+   __asm__ (bsr %2, %0; setz %1
+: =r(ctxt-dst.val), =q(zf)
+: r(ctxt-src.val));
+
+   ctxt-eflags = ~X86_EFLAGS_ZF;
+   if (zf) {
+   ctxt-eflags |= X86_EFLAGS_ZF;
+   /* Disable writeback. */
+   ctxt-dst.type = OP_NONE;
+   }
+   return X86EMUL_CONTINUE;
+}
+
 static bool valid_cr(int nr)
 {
switch (nr) {
@@ -3428,7 +3462,7 @@ static struct opcode twobyte_table[256] = {
N, N,
G(BitOp, group8),
I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_btc),
-   D(DstReg | SrcMem | ModRM), D(DstReg | SrcMem | ModRM),
+   I(DstReg | SrcMem | ModRM, em_bsf), I(DstReg | SrcMem | ModRM, em_bsr),
D(ByteOp | DstReg | SrcMem | ModRM | Mov), D(DstReg | SrcMem16 | ModRM 
| Mov),
/* 0xC0 - 0xCF */
D2bv(DstMem | SrcReg | ModRM | Lock),
@@ -4176,30 +4210,6 @@ twobyte_insn:
ctxt-dst.val = (ctxt-d  ByteOp) ? (u8) ctxt-src.val
   : (u16) ctxt-src.val;
break;
-   case 0xbc: {/* bsf */
-   u8 zf;
-   __asm__ (bsf %2, %0; setz %1
-: =r(ctxt-dst.val), =q(zf)
-: r(ctxt-src.val));
-   ctxt-eflags = ~X86_EFLAGS_ZF;
-   if (zf) {
-   ctxt-eflags |= X86_EFLAGS_ZF;
-   ctxt-dst.type = OP_NONE;   /* Disable writeback. */
-   }
-   break;
-   }
-   case 0xbd: {/* bsr */
-   u8 zf;
-   __asm__ (bsr %2, %0; setz %1
-: =r(ctxt-dst.val), =q(zf)
-: r(ctxt-src.val));
-   ctxt-eflags = ~X86_EFLAGS_ZF;
-   if (zf) {
-   ctxt-eflags |= X86_EFLAGS_ZF;
-   ctxt-dst.type = OP_NONE;   /* Disable writeback. */
-   }
-   break;
-   }
case 0xbe ... 0xbf: /* movsx */
ctxt-dst.bytes = ctxt-op_bytes;
ctxt-dst.val = (ctxt-d  ByteOp) ? (s8) ctxt-src.val :
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5 of 5] virtio: expose added descriptors immediately

2011-11-21 Thread Michael S. Tsirkin
On Tue, Nov 22, 2011 at 11:03:04AM +1030, Rusty Russell wrote:
 On Mon, 21 Nov 2011 13:57:04 +0200, Michael S. Tsirkin m...@redhat.com 
 wrote:
  On Mon, Nov 21, 2011 at 12:18:45PM +1030, Rusty Russell wrote:
   On Wed, 16 Nov 2011 09:18:38 +0200, Michael S. Tsirkin 
   m...@redhat.com wrote:
My unlocked kick patches will trip this warning: they make
virtio-net do add + get without kick.
   
   Heh, it's a good sign if they do, since that means you're running really
   well :)
  
  They don't in fact, in my testing :(. But I think they can with luck.
  
I think block with unlocked kick can trip it too:
add, lock is dropped and then an interrupt can get.

We also don't need a kick each num - each 2^15 is enough.
Why don't we do this at start of add_buf:
if (vq-num_added = 0x7fff)
return -ENOSPC;
   
   The warning was there in case a driver is never doing a kick, and
   getting away with it (mostly) because the device is polling.  Let's not
   penalize good drivers to catch bad ones.
   
   How about we do this properly, like so:
  
  Absolutely. But I think we also need to handle num_added
  overflow of a 15 bit counter, no? Otherwise the
  vring_need_event logic might give us false negatives 
  I'm guessing we can just assume we need a kick in that case.
 
 You're right.  Thankyou.  My immediate reaction of make it an unsigned
 long doesn't work.
 
 Here's the diff to what I posted before:
 
 diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
 --- a/drivers/virtio/virtio_ring.c
 +++ b/drivers/virtio/virtio_ring.c
 @@ -254,9 +254,10 @@ add_head:
   vq-vring.avail-idx++;
   vq-num_added++;
  
 - /* If you haven't kicked in this long, you're probably doing something
 -  * wrong. */
 - WARN_ON(vq-num_added  vq-vring.num);
 + /* This is very unlikely, but theoretically possible.  Kick
 +  * just in case. */
 + if (unlikely(vq-num_added == 65535))

This is 0x but why use the decimal notation?

 + virtqueue_kick(_vq);
  
   pr_debug(Added buffer head %i to %p\n, head, vq);
   END_USE(vq);

We also still need to reset vq-num_added, right?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/11] KVM: PPC: Add memory-mapping support for PCI passthrough and emulation

2011-11-21 Thread Paul Mackerras
On Sun, Nov 20, 2011 at 02:23:52PM +0200, Avi Kivity wrote:
 
 There is no the VMA.  There could be multiple VMAs, or none (with the
 mmap() coming afterwards).  You could do all the checks you want here,
 only to have host userspace remap it under your feet.  This needs to be
 done on a per-page basis at fault time.

OK, so that's a somewhat different mental model than I had in mind.  I
can change the code to do almost everything on a per-page basis at
fault time on POWER7.  There is one thing we can't do at fault time,
which is to tell the hardware the page size for the virtual real mode
area (VRMA), which is a mapping of the memory starting at guest
physical address zero.  We can however work out that pagesize the
first time we run a vcpu.  By that stage we must have some memory
mapped at gpa 0, otherwise the vcpu is not going to get very far.  We
will need to look for the page size of whatever is mapped at gpa 0 at
that point and give it to the hardware.

On PPC970, which is a much older processor, we can't intercept the
page faults (at least not without running the whole guest in user mode
and emulating all the privileged instructions), so once we have given
the guest access to a page, we can't revoke it.  We will have to take
and keep a reference to the page so it doesn't go away underneath us,
which of course doesn't guarantee that userland can continue to see
it, but does at least mean we won't corrupt memory.

  +   /*
  +* We require read  write permission as we cannot yet
  +* enforce guest read-only protection or no access.
  +*/
  +   if ((vma-vm_flags  (VM_READ | VM_WRITE)) !=
  +   (VM_READ | VM_WRITE))
  +   goto err_unlock;
 
 This, too, must be done at get_user_pages() time.
 
 What happens if mmu notifiers tell you to write protect a page?

On POWER7, we have to remove access to the page, and then when we get
a fault on the page, request write access when we do
get_user_pages_fast.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/11] KVM: PPC: Add memory-mapping support for PCI passthrough and emulation

2011-11-21 Thread Avi Kivity
On 11/21/2011 01:03 PM, Paul Mackerras wrote:
 On Sun, Nov 20, 2011 at 02:23:52PM +0200, Avi Kivity wrote:
  
  There is no the VMA.  There could be multiple VMAs, or none (with the
  mmap() coming afterwards).  You could do all the checks you want here,
  only to have host userspace remap it under your feet.  This needs to be
  done on a per-page basis at fault time.

 OK, so that's a somewhat different mental model than I had in mind.  I
 can change the code to do almost everything on a per-page basis at
 fault time on POWER7.  There is one thing we can't do at fault time,
 which is to tell the hardware the page size for the virtual real mode
 area (VRMA), which is a mapping of the memory starting at guest
 physical address zero.  We can however work out that pagesize the
 first time we run a vcpu.  By that stage we must have some memory
 mapped at gpa 0, otherwise the vcpu is not going to get very far.  We
 will need to look for the page size of whatever is mapped at gpa 0 at
 that point and give it to the hardware.

Ok.  Do you need to check at fault time that your assumptions haven't
changed, then?

 On PPC970, which is a much older processor, we can't intercept the
 page faults (at least not without running the whole guest in user mode
 and emulating all the privileged instructions), so once we have given
 the guest access to a page, we can't revoke it.  We will have to take
 and keep a reference to the page so it doesn't go away underneath us,
 which of course doesn't guarantee that userland can continue to see
 it, but does at least mean we won't corrupt memory.

Yes, this is similar to kvm/x86 before mmu notifiers were added.


   + /*
   +  * We require read  write permission as we cannot yet
   +  * enforce guest read-only protection or no access.
   +  */
   + if ((vma-vm_flags  (VM_READ | VM_WRITE)) !=
   + (VM_READ | VM_WRITE))
   + goto err_unlock;
  
  This, too, must be done at get_user_pages() time.
  
  What happens if mmu notifiers tell you to write protect a page?

 On POWER7, we have to remove access to the page, and then when we get
 a fault on the page, request write access when we do
 get_user_pages_fast.

Ok, so no ksm for you.  Does this apply to kvm-internal write
protection, like we do for the framebuffer and live migration?  I guess
you don't care much about the framebuffer (and anyway it doesn't need
read-only access), but removing access for live migration is painful.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/11] KVM: PPC: Add memory-mapping support for PCI passthrough and emulation

2011-11-21 Thread Paul Mackerras
On Mon, Nov 21, 2011 at 02:22:58PM +0200, Avi Kivity wrote:
 On 11/21/2011 01:03 PM, Paul Mackerras wrote:
  OK, so that's a somewhat different mental model than I had in mind.  I
  can change the code to do almost everything on a per-page basis at
  fault time on POWER7.  There is one thing we can't do at fault time,
  which is to tell the hardware the page size for the virtual real mode
  area (VRMA), which is a mapping of the memory starting at guest
  physical address zero.  We can however work out that pagesize the
  first time we run a vcpu.  By that stage we must have some memory
  mapped at gpa 0, otherwise the vcpu is not going to get very far.  We
  will need to look for the page size of whatever is mapped at gpa 0 at
  that point and give it to the hardware.
 
 Ok.  Do you need to check at fault time that your assumptions haven't
 changed, then?

At fault time, if we are expecting a large page and we find a small
page, pretty much all we can do is return from the vcpu_run ioctl with
an EFAULT error -- so yes we do check the page-size assumption at
fault time.  The other way around isn't a problem (expecting small
page and find large page), of course.

   What happens if mmu notifiers tell you to write protect a page?
 
  On POWER7, we have to remove access to the page, and then when we get
  a fault on the page, request write access when we do
  get_user_pages_fast.
 
 Ok, so no ksm for you.  Does this apply to kvm-internal write
 protection, like we do for the framebuffer and live migration?  I guess
 you don't care much about the framebuffer (and anyway it doesn't need
 read-only access), but removing access for live migration is painful.

For the framebuffer, we can use the hardware 'C' (changed) bit to
detect dirty pages without having to make them read-only.

On further thought, we can in fact make pages read-only when the guest
thinks they're read/write, at the cost of making the real protection
faults in the guest a little slower.  I'll look into it.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html