Re: [RFC v2 PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
Am 20.11.2011 11:59, schrieb Pekka Enberg: On Sun, 2011-11-20 at 14:14 +0800, Lan, Tianyu wrote: OK. Thx. But fsync is too slow. I try to find a way to sync a range of file. Are there any solutions to meet my purpose? On Sun, 2011-11-20 at 08:23 +0200, Sasha Levin wrote: fdatasync() is as good as it'll get. tbh, maybe we should just consider opening QCOW images with O_SYNC and just get it over with? No, lets not do that. It's easier to improve the performance of correct code that doesn't use O_SYNC. Yes, O_SYNC gives you horrible performance. With explicit fsyncs cluster allocation is somewhat slow (you can try to speed it up by batching updates like qemu does), but at least rewrites will perform reasonably close to raw. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM device assignment and user privileges
On 11/21/2011 06:42 AM, Chris Wright wrote: * Avi Kivity (a...@redhat.com) wrote: On 11/20/2011 04:58 PM, Sasha Levin wrote: Hi all, I've been working on adding device assignment to KVM tools, and started with the basics of just getting a device assigned using the KVM_ASSIGN_PCI_DEVICE ioctl. What I've figured is that unprivileged users can request any PCI device to be assigned to him, including devices which he shouldn't be touching. In my case, it happened with the VGA card, where an unprivileged user simply called KVM_ASSIGN_PCI_DEVICE with the bus, seg and fn of the VGA card and caused the display on the host to go apeshit. Was it supposed to work this way? No, of course not. Indeed. A device is typically owned by a host OS driver which precludes device assignment from working. If it's not, the unprivilged guest will not have access to the device's config space or resource bars as they are only rw for a privileged user. And similarly, /dev/kvm was typically left as 0644. As you can see, it's fragile. I couldn't find any security checks in the code paths of KVM_ASSIGN_PCI_DEVICE and it looks like any user can invoke it with any parameters he'd want - enabling him to kill the host. Alex, Chris? The security checks were removed some time back. The expectation was that there was nothing an unprivleged user could usefully do w/ the assign device ioctl, and the assign irq ioctl only works after assign device. It's built on an overly fragile set of assumptions, however. Avi, the simplest short term thing to do now might be simply revert: 48bb09e KVM: remove CAP_SYS_RAWIO requirement from kvm_vm_ioctl_assign_irq That does nothing for for the KVM_ASSIGN_PCI_DEVICE ioctl. While it's a regression for existing unprivileged users it's better than a hole. And in the meantime, we can come up w/ something better to replace with. How about doing an access(2) equivalent check for one of the sysfs files used to represent the device? If libvirt already chowns them to the right user, then we have no regression, at least for that use case. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/6] KVM: introduce kvm_for_each_memslot macro
On 11/21/2011 02:54 AM, Takuya Yoshikawa wrote: (2011/11/20 20:21), Avi Kivity wrote: On 11/18/2011 11:18 AM, Xiao Guangrong wrote: index bb8728e..10524c0 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -307,6 +307,10 @@ static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i) (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \ idx++) +#define kvm_for_each_memslot(slots, memslot, i)\ +for (i = 0; i (slots)-nmemslots\ + ({ memslot =(slots)-memslots[i]; 1; }); i++) + Statement expression not needed, you can use the comma operator: i (slots)-nmemslots (memslot = @(slots)-memslots[i], true) or even memslot =(slots)-memslots[i], i (slots)-nmemslots or just kill i and make memslot the loop variable. Do you have any preference for the arguments ordering? I think placing the target one, memslot in this case, first is conventional in the kernel code, except when we want to place kvm or something like that. But in kvm code, there seems to be some difference. You mean for the macro? Yes, making memslot the first argument is a good idea. Any difference in kvm code is not intentional. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/6] KVM: introduce kvm_for_each_memslot macro
(2011/11/21 17:34), Avi Kivity wrote: Do you have any preference for the arguments ordering? I think placing the target one, memslot in this case, first is conventional in the kernel code, except when we want to place kvm or something like that. But in kvm code, there seems to be some difference. You mean for the macro? Yes, making memslot the first argument is a good idea. Any difference in kvm code is not intentional. Yes. Xiao, please change the order if you have no problem. Thanks, Takuya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/6] KVM: introduce kvm_for_each_memslot macro
On 11/21/2011 04:40 PM, Takuya Yoshikawa wrote: (2011/11/21 17:34), Avi Kivity wrote: Do you have any preference for the arguments ordering? I think placing the target one, memslot in this case, first is conventional in the kernel code, except when we want to place kvm or something like that. But in kvm code, there seems to be some difference. You mean for the macro? Yes, making memslot the first argument is a good idea. Any difference in kvm code is not intentional. Yes. Xiao, please change the order if you have no problem. OK, will change it in the next version, thanks! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
Am 21.11.2011 08:12, schrieb Lan Tianyu: When meeting request to write the cluster without copied flag, allocate a new cluster and write original data with modification to the new cluster. This also adds support for the writing operation of the qcow2 compressed image. After testing, image file can pass through qemu-img check. The performance is needed to be improved. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c| 411 +- tools/kvm/include/kvm/qcow.h |2 + 2 files changed, 285 insertions(+), 128 deletions(-) @@ -766,122 +872,166 @@ static ssize_t qcow_write_cluster(struct qcow *q, u64 offset, void *buf, u32 src if (l2t_idx = l2t_size) return -1; - clust_off = get_cluster_offset(q, offset); - if (clust_off = clust_sz) - return -1; - - len = clust_sz - clust_off; - if (len src_len) - len = src_len; - - mutex_lock(q-mutex); - l2t_offset = be64_to_cpu(l1t-l1_table[l1t_idx]); - if (l2t_offset QCOW2_OFLAG_COMPRESSED) { - pr_warning(compressed clusters are not supported); - goto error; - } - if (!(l2t_offset QCOW2_OFLAG_COPIED)) { - pr_warning(L2 copy-on-write clusters are not supported); - goto error; - } - - l2t_offset = QCOW2_OFFSET_MASK; - if (l2t_offset) { - /* read and cache l2 table */ + if (l2t_offset QCOW2_OFLAG_COPIED) { + l2t_offset = ~QCOW2_OFLAG_COPIED; l2t = qcow_read_l2_table(q, l2t_offset); if (!l2t) goto error; } else { - l2t = new_cache_table(q, l2t_offset); - if (!l2t) + l2t_new_offset = qcow_alloc_clusters(q, l2t_size*sizeof(u64)); + if (l2t_new_offset 0) goto error; - /* Capture the state of the consistent QCOW image */ - f_sz = file_size(q-fd); - if (!f_sz) - goto free_cache; + l2t = new_cache_table(q, l2t_new_offset); + if (!l2t) + goto free_cluster; - /* Write the l2 table of 0's at the end of the file */ - l2t_offset = qcow_write_l2_table(q, l2t-table); - if (!l2t_offset) + if (l2t_offset) { + l2t = qcow_read_l2_table(q, l2t_offset); + if (!l2t) + goto free_cache; + } else + memset(l2t-table, 0x00, l2t_size * sizeof(u64)); + + /*write l2 table*/ + l2t-dirty = 1; + if (qcow_l2_cache_write(q, l2t) 0) goto free_cache; - if (cache_table(q, l2t) 0) { - if (ftruncate(q-fd, f_sz) 0) - goto free_cache; + /*cache l2 table*/ + cache_table(q, l2t); You're ignoring the return value here. Otherwise I didn't find any obvious problem in a quick scan. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm: use this_cpu_xxx replace percpu_xxx funcs
On Mon, 2011-10-24 at 19:05 +0800, Avi Kivity wrote: On 10/24/2011 04:50 AM, Alex,Shi wrote: On Thu, 2011-10-20 at 15:34 +0800, Alex,Shi wrote: percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them for further code clean up. And in preempt safe scenario, __this_cpu_xxx funcs has a bit better performance since __this_cpu_xxx has no redundant preempt_disable() Avi: Would you like to give some comments of this? Sorry, was travelling: Acked-by: Avi Kivity a...@redhat.com And this one, picking up or comments are all appreciated. :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
On Mon, Nov 21, 2011 at 9:12 AM, Lan Tianyu tianyu@intel.com wrote: +/*Allocate clusters according to the size. Find a postion that + *can satisfy the size. free_clust_idx is initialized to zero and + *Record last position. +*/ Can you please fix up your comments to use the following standard kernel style: /* * Allocate clusters according to the size. Find a postion that * can satisfy the size. free_clust_idx is initialized to zero and * Record last position. */ [ Whitespace after asterisk and starting line doesn't have text. ] - rfb = qcow_read_refcount_block(q, clust_idx); - if (!rfb) { - pr_warning(L1: error while reading refcount table); + clust_new_start = qcow_alloc_clusters(q, q-cluster_size); + if (clust_new_start 0) { + pr_warning(Cluster alloc error!); Please drop the exclamation marks from pr_warning() and pr_error() messages. It just adds pointless noise. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Code clean up for percpu_xxx() functions
refreshed the patch on latest upstream kernel. Any comments or picking up are appreciated. === From 0dce61dc88b8ed2687b4d5c0633aa54d1f66fdc0 Mon Sep 17 00:00:00 2001 From: Alex Shi alex@intel.com Date: Tue, 22 Nov 2011 00:05:37 +0800 Subject: [PATCH 3/3] Code clean up for percpu_xxx() functions Since percpu_xxx() serial functions are duplicate with this_cpu_xxx(). Removing percpu_xxx() definition and replacing them by this_cpu_xxx() in code. And further more, as Christoph Lameter's requirement, I try to use __this_cpu_xx to replace this_cpu_xxx if it is in preempt safe scenario. The preempt safe scenarios include: 1, in irq/softirq/nmi handler 2, protected by preempt_disable 3, protected by spin_lock 4, if the code context imply that it is preempt safe, like the code is follows or be followed a preempt safe code. I left the xen code unchanged, since no idea of them. BTW, In fact, this_cpu_xxx are same as __this_cpu_xxx since all funcs implement in a single instruction for x86 machine. But it maybe different for other platforms, so, doing this distinguish is helpful for other platforms' performance. Signed-off-by: Alex Shi alex@intel.com Acked-by: Christoph Lameter c...@gentwo.org --- arch/x86/include/asm/current.h|2 +- arch/x86/include/asm/hardirq.h|9 +++-- arch/x86/include/asm/irq_regs.h |4 +- arch/x86/include/asm/mmu_context.h| 12 arch/x86/include/asm/percpu.h | 24 ++- arch/x86/include/asm/smp.h|4 +- arch/x86/include/asm/stackprotector.h |4 +- arch/x86/include/asm/thread_info.h|2 +- arch/x86/include/asm/tlbflush.h |4 +- arch/x86/kernel/cpu/common.c |2 +- arch/x86/kernel/cpu/mcheck/mce.c |4 +- arch/x86/kernel/paravirt.c| 12 arch/x86/kernel/process_32.c |2 +- arch/x86/kernel/process_64.c | 12 arch/x86/mm/tlb.c | 10 +++--- arch/x86/xen/enlighten.c |6 ++-- arch/x86/xen/irq.c|8 ++-- arch/x86/xen/mmu.c| 20 ++-- arch/x86/xen/multicalls.h |2 +- arch/x86/xen/smp.c|2 +- include/linux/percpu.h| 53 - include/linux/topology.h |4 +- 22 files changed, 73 insertions(+), 129 deletions(-) diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h index 4d447b7..9476c04 100644 --- a/arch/x86/include/asm/current.h +++ b/arch/x86/include/asm/current.h @@ -11,7 +11,7 @@ DECLARE_PER_CPU(struct task_struct *, current_task); static __always_inline struct task_struct *get_current(void) { - return percpu_read_stable(current_task); + return this_cpu_read_stable(current_task); } #define current get_current() diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index 55e4de6..2890444 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -35,14 +35,15 @@ DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat); #define __ARCH_IRQ_STAT -#define inc_irq_stat(member) percpu_inc(irq_stat.member) +#define inc_irq_stat(member) __this_cpu_inc(irq_stat.member) -#define local_softirq_pending()percpu_read(irq_stat.__softirq_pending) +#define local_softirq_pending() __this_cpu_read(irq_stat.__softirq_pending) #define __ARCH_SET_SOFTIRQ_PENDING -#define set_softirq_pending(x) percpu_write(irq_stat.__softirq_pending, (x)) -#define or_softirq_pending(x) percpu_or(irq_stat.__softirq_pending, (x)) +#define set_softirq_pending(x) \ + __this_cpu_write(irq_stat.__softirq_pending, (x)) +#define or_softirq_pending(x) __this_cpu_or(irq_stat.__softirq_pending, (x)) extern void ack_bad_irq(unsigned int irq); diff --git a/arch/x86/include/asm/irq_regs.h b/arch/x86/include/asm/irq_regs.h index 7784322..15639ed 100644 --- a/arch/x86/include/asm/irq_regs.h +++ b/arch/x86/include/asm/irq_regs.h @@ -15,7 +15,7 @@ DECLARE_PER_CPU(struct pt_regs *, irq_regs); static inline struct pt_regs *get_irq_regs(void) { - return percpu_read(irq_regs); + return __this_cpu_read(irq_regs); } static inline struct pt_regs *set_irq_regs(struct pt_regs *new_regs) @@ -23,7 +23,7 @@ static inline struct pt_regs *set_irq_regs(struct pt_regs *new_regs) struct pt_regs *old_regs; old_regs = get_irq_regs(); - percpu_write(irq_regs, new_regs); + __this_cpu_write(irq_regs, new_regs); return old_regs; } diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 6902152..02ca533 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -25,8 +25,8 @@ void destroy_context(struct mm_struct *mm); static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) { #ifdef CONFIG_SMP - if
RE: [PATCH] kvm tools, qcow: Add the support for copy-on-write clusters
Yeah. I will fix them in the next version. -Original Message- From: penb...@gmail.com [mailto:penb...@gmail.com] On Behalf Of Pekka Enberg Sent: Monday, November 21, 2011 5:06 PM To: Lan, Tianyu Cc: kvm@vger.kernel.org; asias.he...@gmail.com; levinsasha...@gmail.com; prasadjoshi...@gmail.com; kw...@redhat.com Subject: Re: [PATCH] kvm tools, qcow: Add the support for copy-on-write clusters On Mon, Nov 21, 2011 at 9:12 AM, Lan Tianyu tianyu@intel.com wrote: +/*Allocate clusters according to the size. Find a postion that + *can satisfy the size. free_clust_idx is initialized to zero and + *Record last position. +*/ Can you please fix up your comments to use the following standard kernel style: /* * Allocate clusters according to the size. Find a postion that * can satisfy the size. free_clust_idx is initialized to zero and * Record last position. */ [ Whitespace after asterisk and starting line doesn't have text. ] - rfb = qcow_read_refcount_block(q, clust_idx); - if (!rfb) { - pr_warning(L1: error while reading refcount table); + clust_new_start = qcow_alloc_clusters(q, q-cluster_size); + if (clust_new_start 0) { + pr_warning(Cluster alloc error!); Please drop the exclamation marks from pr_warning() and pr_error() messages. It just adds pointless noise. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm: use this_cpu_xxx replace percpu_xxx funcs
On 11/21/2011 11:02 AM, Alex,Shi wrote: On Mon, 2011-10-24 at 19:05 +0800, Avi Kivity wrote: On 10/24/2011 04:50 AM, Alex,Shi wrote: On Thu, 2011-10-20 at 15:34 +0800, Alex,Shi wrote: percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them for further code clean up. And in preempt safe scenario, __this_cpu_xxx funcs has a bit better performance since __this_cpu_xxx has no redundant preempt_disable() Avi: Would you like to give some comments of this? Sorry, was travelling: Acked-by: Avi Kivity a...@redhat.com And this one, picking up or comments are all appreciated. :) Just to be clear, you want this applied in kvm.git? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm: use this_cpu_xxx replace percpu_xxx funcs
percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them for further code clean up. And in preempt safe scenario, __this_cpu_xxx funcs has a bit better performance since __this_cpu_xxx has no redundant preempt_disable() Avi: Would you like to give some comments of this? Sorry, was travelling: Acked-by: Avi Kivity a...@redhat.com And this one, picking up or comments are all appreciated. :) Just to be clear, you want this applied in kvm.git? Glad to be there! :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Qemu-devel] [PATCH] ivshmem: fix PCI BAR2 registration during initialization
From: Hongyong Zang zanghongy...@huawei.com Ivshmem cannot work, and the command lspci cannot show ivshmem BAR2 in the guest. As for pci_register_bar(), parameter MemoryRegion should be s-bar instead of s-ivshmem. Signed-off-by: Hongyong Zang zanghongy...@huawei.com --- hw/ivshmem.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/hw/ivshmem.c b/hw/ivshmem.c index 242fbea..2ecf658 100644 --- a/hw/ivshmem.c +++ b/hw/ivshmem.c @@ -694,7 +694,7 @@ static int pci_ivshmem_init(PCIDevice *dev) s-peers = g_malloc0(s-nb_peers * sizeof(Peer)); pci_register_bar(s-dev, 2, - PCI_BASE_ADDRESS_SPACE_MEMORY, s-ivshmem); + PCI_BASE_ADDRESS_SPACE_MEMORY, s-bar); s-eventfd_chr = g_malloc0(s-vectors * sizeof(CharDriverState *)); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5 of 5] virtio: expose added descriptors immediately
On Mon, Nov 21, 2011 at 12:18:45PM +1030, Rusty Russell wrote: On Wed, 16 Nov 2011 09:18:38 +0200, Michael S. Tsirkin m...@redhat.com wrote: My unlocked kick patches will trip this warning: they make virtio-net do add + get without kick. Heh, it's a good sign if they do, since that means you're running really well :) They don't in fact, in my testing :(. But I think they can with luck. I think block with unlocked kick can trip it too: add, lock is dropped and then an interrupt can get. We also don't need a kick each num - each 2^15 is enough. Why don't we do this at start of add_buf: if (vq-num_added = 0x7fff) return -ENOSPC; The warning was there in case a driver is never doing a kick, and getting away with it (mostly) because the device is polling. Let's not penalize good drivers to catch bad ones. How about we do this properly, like so: Absolutely. But I think we also need to handle num_added overflow of a 15 bit counter, no? Otherwise the vring_need_event logic might give us false negatives I'm guessing we can just assume we need a kick in that case. From: Rusty Russell ru...@rustcorp.com.au Subject: virtio: add debugging if driver doesn't kick. Under the existing #ifdef DEBUG, check that they don't have more than 1/10 of a second between an add_buf() and a virtqueue_notify()/virtqueue_kick_prepare() call. We could get false positives on a really busy system, but good for development. Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/virtio/virtio_ring.c | 31 +++ 1 file changed, 31 insertions(+) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -22,6 +22,7 @@ #include linux/device.h #include linux/slab.h #include linux/module.h +#include linux/hrtimer.h /* virtio guest is communicating with a virtual device that actually runs on * a host processor. Memory barriers are used to control SMP effects. */ @@ -102,6 +103,10 @@ struct vring_virtqueue #ifdef DEBUG /* They're supposed to lock for us. */ unsigned int in_use; + + /* Figure out if their kicks are too delayed. */ + bool last_add_time_valid; + ktime_t last_add_time; #endif /* Tokens for callbacks. */ @@ -192,6 +197,19 @@ int virtqueue_add_buf(struct virtqueue * BUG_ON(data == NULL); +#ifdef DEBUG + { + ktime_t now = ktime_get(); + + /* No kick or get, with .1 second between? Warn. */ + if (vq-last_add_time_valid) + WARN_ON(ktime_to_ms(ktime_sub(now, vq-last_add_time)) + 100); + vq-last_add_time = now; + vq-last_add_time_valid = true; + } +#endif + /* If the host supports indirect descriptor tables, and we have multiple * buffers, then go indirect. FIXME: tune this threshold */ if (vq-indirect (out + in) 1 vq-num_free) { @@ -291,6 +309,14 @@ bool virtqueue_kick_prepare(struct virtq new = vq-vring.avail-idx; vq-num_added = 0; +#ifdef DEBUG + if (vq-last_add_time_valid) { + WARN_ON(ktime_to_ms(ktime_sub(ktime_get(), + vq-last_add_time)) 100); + } + vq-last_add_time_valid = false; +#endif + if (vq-event) { needs_kick = vring_need_event(vring_avail_event(vq-vring), new, old); @@ -428,6 +454,10 @@ void *virtqueue_get_buf(struct virtqueue virtio_mb(); } +#ifdef DEBUG + vq-last_add_time_valid = false; +#endif + END_USE(vq); return ret; } @@ -611,6 +641,7 @@ struct virtqueue *vring_new_virtqueue(un list_add_tail(vq-vq.list, vdev-vqs); #ifdef DEBUG vq-in_use = false; + vq-last_add_time_valid = false; #endif vq-indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Consistent Snapshots Idea
I'm not an expert on the architecture of KVM, so perhaps this is a QEMU question. If so, please let me know and I'll ask on a different list. Background: Assuming the block layer can make instantaneous snapshots of a guest's disk (e.g. lvcreate -s), one can get crash consistent (i.e. as if the guest crashed) snapshots. To get a fully consistent snapshot, you need to shutdown the guest. For production VMs, this is obviously not ideal. Idea: What if KVM/QEMU was to fork() the guest and shutdown one copy? KVM/QEMU would momentarily halt the execution of the guest and take a writable, instantaneous snapshot of each block device. Then it would fork(). The parent would resume execution as normal. The child would redirect disk writes to the snapshot(s). The RAM should have copy-on-write behavior as with any other fork()ed process. Other resources like the network, display, sound, serial, etc. would simply be disconnected/bit-bucketed. Finally, the child would resume guest execution and send the guest an ACPI power button press event. This would cause the guest OS to perform an orderly shutdown. I believe this would provide consistent snapshots in the vast majority of real-world scenarios in a guest OS and application-independent way. Implementation Nits: * A timeout on the child process would likely be a good idea. * It'd probably be best to disconnect the network (i.e. tell the guest the cable is unplugged) to avoid long timeouts. Likewise for the hardware flow-control lines on the serial port. * For correctness, fdatasync()ing or similar might be necessary after halting execution and before creating the snapshots. Richard signature.asc Description: This is a digitally signed message part
Re: [Qemu-devel] [PATCH] ivshmem: fix PCI BAR2 registration during initialization
On 11/21/2011 12:56 PM, zanghongy...@huawei.com wrote: From: Hongyong Zang zanghongy...@huawei.com Ivshmem cannot work, and the command lspci cannot show ivshmem BAR2 in the guest. As for pci_register_bar(), parameter MemoryRegion should be s-bar instead of s-ivshmem. Signed-off-by: Hongyong Zang zanghongy...@huawei.com --- hw/ivshmem.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/hw/ivshmem.c b/hw/ivshmem.c index 242fbea..2ecf658 100644 --- a/hw/ivshmem.c +++ b/hw/ivshmem.c @@ -694,7 +694,7 @@ static int pci_ivshmem_init(PCIDevice *dev) s-peers = g_malloc0(s-nb_peers * sizeof(Peer)); pci_register_bar(s-dev, 2, - PCI_BASE_ADDRESS_SPACE_MEMORY, s-ivshmem); + PCI_BASE_ADDRESS_SPACE_MEMORY, s-bar); s-eventfd_chr = g_malloc0(s-vectors * sizeof(CharDriverState *)); Reviewed-by: Avi Kivity a...@redhat.com This is 1.0 worthy. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Consistent Snapshots Idea
On 11/21/2011 02:01 PM, Richard Laager wrote: I'm not an expert on the architecture of KVM, so perhaps this is a QEMU question. If so, please let me know and I'll ask on a different list. It is a qemu question, yes (though fork()ing a guest also relates to kvm). Background: Assuming the block layer can make instantaneous snapshots of a guest's disk (e.g. lvcreate -s), one can get crash consistent (i.e. as if the guest crashed) snapshots. To get a fully consistent snapshot, you need to shutdown the guest. For production VMs, this is obviously not ideal. Idea: What if KVM/QEMU was to fork() the guest and shutdown one copy? KVM/QEMU would momentarily halt the execution of the guest and take a writable, instantaneous snapshot of each block device. Then it would fork(). The parent would resume execution as normal. The child would redirect disk writes to the snapshot(s). The RAM should have copy-on-write behavior as with any other fork()ed process. Other resources like the network, display, sound, serial, etc. would simply be disconnected/bit-bucketed. Finally, the child would resume guest execution and send the guest an ACPI power button press event. This would cause the guest OS to perform an orderly shutdown. I believe this would provide consistent snapshots in the vast majority of real-world scenarios in a guest OS and application-independent way. Interesting idea. Will the guest actually shut down nicely without a network? Things like NFS mounts will break. Implementation Nits: * A timeout on the child process would likely be a good idea. * It'd probably be best to disconnect the network (i.e. tell the guest the cable is unplugged) to avoid long timeouts. Likewise for the hardware flow-control lines on the serial port. This is actually critical, otherwise the guest will shutdown(2) all sockets and confuse the clients. * For correctness, fdatasync()ing or similar might be necessary after halting execution and before creating the snapshots. Microsoft guests have an API to quiesce storage prior to a snapshot, and I think there is work to bring this to Linux guests. So it should be possible to get consistent snapshots even without this, but it takes more integration. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] kvm tools, qcow: Add the support for copy-on-write cluster
When meeting request to write the cluster without copied flag, allocate a new cluster and write original data with modification to the new cluster. This also adds support for the writing operation of the qcow2 compressed image. After testing, image file can pass through qemu-img check. The performance is needed to be improved. Signed-off-by: Lan Tianyu tianyu@intel.com --- tools/kvm/disk/qcow.c| 421 +- tools/kvm/include/kvm/qcow.h |2 + 2 files changed, 292 insertions(+), 131 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index 680b37d..99b13e7 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -20,6 +20,16 @@ #include linux/kernel.h #include linux/types.h + +static inline int qcow_pwrite_sync(int fd, + void *buf, size_t count, off_t offset) +{ + if (pwrite_in_full(fd, buf, count, offset) 0) + return -1; + + return fdatasync(fd); +} + static int l2_table_insert(struct rb_root *root, struct qcow_l2_table *new) { struct rb_node **link = (root-rb_node), *parent = NULL; @@ -101,7 +111,8 @@ static int qcow_l2_cache_write(struct qcow *q, struct qcow_l2_table *c) size = 1 header-l2_bits; - if (pwrite_in_full(q-fd, c-table, size * sizeof(u64), c-offset) 0) + if (qcow_pwrite_sync(q-fd, c-table, + size * sizeof(u64), c-offset) 0) return -1; c-dirty = 0; @@ -122,9 +133,6 @@ static int cache_table(struct qcow *q, struct qcow_l2_table *c) */ lru = list_first_entry(l1t-lru_list, struct qcow_l2_table, list); - if (qcow_l2_cache_write(q, lru) 0) - goto error; - /* Remove the node from the cache */ rb_erase(lru-node, r); list_del_init(lru-list); @@ -518,14 +526,6 @@ static inline u64 file_size(int fd) return st.st_size; } -static inline int qcow_pwrite_sync(int fd, void *buf, size_t count, off_t offset) -{ - if (pwrite_in_full(fd, buf, count, offset) 0) - return -1; - - return fdatasync(fd); -} - /* Writes a level 2 table at the end of the file. */ static u64 qcow_write_l2_table(struct qcow *q, u64 *table) { @@ -601,7 +601,8 @@ static int write_refcount_block(struct qcow *q, struct qcow_refcount_block *rfb) if (!rfb-dirty) return 0; - if (pwrite_in_full(q-fd, rfb-entries, rfb-size * sizeof(u16), rfb-offset) 0) + if (qcow_pwrite_sync(q-fd, rfb-entries, + rfb-size * sizeof(u16), rfb-offset) 0) return -1; rfb-dirty = 0; @@ -618,9 +619,6 @@ static int cache_refcount_block(struct qcow *q, struct qcow_refcount_block *c) if (rft-nr_cached == MAX_CACHE_NODES) { lru = list_first_entry(rft-lru_list, struct qcow_refcount_block, list); - if (write_refcount_block(q, lru) 0) - goto error; - rb_erase(lru-node, r); list_del_init(lru-list); rft-nr_cached--; @@ -706,6 +704,11 @@ static struct qcow_refcount_block *qcow_read_refcount_block(struct qcow *q, u64 rfb_offset = be64_to_cpu(rft-rf_table[rft_idx]); + if (!rfb_offset) { + pr_warning(Don't support to grow refcount table); + return NULL; + } + rfb = refcount_block_search(q, rfb_offset); if (rfb) return rfb; @@ -728,35 +731,140 @@ error_free_rfb: return NULL; } +static u16 qcow_get_refcount(struct qcow *q, u64 clust_idx) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(Error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + + if (rfb_idx = rfb-size) { + pr_warning(L1: refcount block index out of bounds); + return -1; + } + + return be16_to_cpu(rfb-entries[rfb_idx]); +} + +static int update_cluster_refcount(struct qcow *q, u64 clust_idx, u16 append) +{ + struct qcow_refcount_block *rfb = NULL; + struct qcow_header *header = q-header; + u16 refcount; + u64 rfb_idx; + + rfb = qcow_read_refcount_block(q, clust_idx); + if (!rfb) { + pr_warning(error while reading refcount table); + return -1; + } + + rfb_idx = clust_idx (((1ULL + (header-cluster_bits - QCOW_REFCOUNT_BLOCK_SHIFT)) - 1)); + if (rfb_idx = rfb-size) { + pr_warning(refcount block index out of bounds); + return -1; + } + + refcount = be16_to_cpu(rfb-entries[rfb_idx]) + append; +
Re: [Qemu-devel] [RFC] Consistent Snapshots Idea
On 2011-11-21 20:31, Avi Kivity wrote: On 11/21/2011 02:01 PM, Richard Laager wrote: I'm not an expert on the architecture of KVM, so perhaps this is a QEMU question. If so, please let me know and I'll ask on a different list. It is a qemu question, yes (though fork()ing a guest also relates to kvm). Background: Assuming the block layer can make instantaneous snapshots of a guest's disk (e.g. lvcreate -s), one can get crash consistent (i.e. as if the guest crashed) snapshots. To get a fully consistent snapshot, you need to shutdown the guest. For production VMs, this is obviously not ideal. Idea: What if KVM/QEMU was to fork() the guest and shutdown one copy? KVM/QEMU would momentarily halt the execution of the guest and take a writable, instantaneous snapshot of each block device. Then it would fork(). The parent would resume execution as normal. The child would redirect disk writes to the snapshot(s). The RAM should have copy-on-write behavior as with any other fork()ed process. Other resources like the network, display, sound, serial, etc. would simply be disconnected/bit-bucketed. Finally, the child would resume guest execution and send the guest an ACPI power button press event. This would cause the guest OS to perform an orderly shutdown. I believe this would provide consistent snapshots in the vast majority of real-world scenarios in a guest OS and application-independent way. Interesting idea. Will the guest actually shut down nicely without a network? Things like NFS mounts will break. Does the child and parent process run in parallel? What will happen if the parent process try to access the block device? It looks like that the child process will write to a snapshot file, but where will the parent process write to? Implementation Nits: * A timeout on the child process would likely be a good idea. * It'd probably be best to disconnect the network (i.e. tell the guest the cable is unplugged) to avoid long timeouts. Likewise for the hardware flow-control lines on the serial port. This is actually critical, otherwise the guest will shutdown(2) all sockets and confuse the clients. * For correctness, fdatasync()ing or similar might be necessary after halting execution and before creating the snapshots. Microsoft guests have an API to quiesce storage prior to a snapshot, and I think there is work to bring this to Linux guests. So it should be possible to get consistent snapshots even without this, but it takes more integration. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
On Tue, Nov 08, 2011 at 09:33:04AM -0800, Chris Wright wrote: * Alexander Graf (ag...@suse.de) wrote: On 29.10.2011, at 20:45, Bharata B Rao wrote: As guests become NUMA aware, it becomes important for the guests to have correct NUMA policies when they run on NUMA aware hosts. Currently limited support for NUMA binding is available via libvirt where it is possible to apply a NUMA policy to the guest as a whole. However multinode guests would benefit if guest memory belonging to different guest nodes are mapped appropriately to different host NUMA nodes. To achieve this we would need QEMU to expose information about guest RAM ranges (Guest Physical Address - GPA) and their host virtual address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external tool like libvirt would be able to divide the guest RAM as per the guest NUMA node geometry and bind guest memory nodes to corresponding host memory nodes using HVA. This needs both QEMU (and libvirt) changes as well as changes in the kernel. Ok, let's take a step back here. You are basically growing libvirt into a memory resource manager that know how much memory is available on which nodes and how these nodes would possibly fit into the host's memory layout. Shouldn't that be the kernel's job? It seems to me that architecturally the kernel is the place I would want my memory resource controls to be in. I think that both Peter and Andrea are looking at this. Before we commit an API to QEMU that has a different semantic than a possible new kernel interface (that perhaps QEMU could use directly to inform kernel of the binding/relationship between vcpu thread and it's memory at VM startuup) it would be useful to see what these guys are working on... I looked at Peter's recent work in this area. (https://lkml.org/lkml/2011/11/17/204) It introduces two interfaces: 1. ms_tbind() to bind a thread to a memsched(*) group 2. ms_mbind() to bind a memory region to memsched group I assume the 2nd interface could be used by QEMU to create memsched groups for each of guest NUMA node memory regions. In the past, Anthony has said that NUMA binding should be done from outside of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041) Though that was in a different context, may be we should re-look at that and see if QEMU still sticks to that. I know its a bit early, but if needed we should ask Peter to consider extending ms_mbind() to take a tid parameter too instead of working on current task by default. (*) memsched: An abstraction for representing coupling of threads with virtual address ranges. Threads and virtual address ranges of a memsched group are guaranteed (?) to be located on the same node. Regards, Bharata. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
On Mon, 2011-11-21 at 20:48 +0530, Bharata B Rao wrote: I looked at Peter's recent work in this area. (https://lkml.org/lkml/2011/11/17/204) It introduces two interfaces: 1. ms_tbind() to bind a thread to a memsched(*) group 2. ms_mbind() to bind a memory region to memsched group I assume the 2nd interface could be used by QEMU to create memsched groups for each of guest NUMA node memory regions. No, you would need both, you'll need to group vcpu threads _and_ some vaddress space together. I understood QEMU currently uses a single big anonymous mmap() to allocate the guest memory, using this you could either use multiple or carve up the big alloc into virtual nodes by assigning different parts to different ms groups. Example: suppose you want to create a 2 node guest with 8 vcpus, create 2 ms groups, each with 4 vcpu threads and assign half the total guest mmap to either. In the past, Anthony has said that NUMA binding should be done from outside of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041) If you want to expose a sense of virtual NUMA to your guest you really have no choice there. The only thing you can do externally is run whole VMs inside one particular node. Though that was in a different context, may be we should re-look at that and see if QEMU still sticks to that. I know its a bit early, but if needed we should ask Peter to consider extending ms_mbind() to take a tid parameter too instead of working on current task by default. Uh, what for? ms_mbind() works on the current process, not task. (*) memsched: An abstraction for representing coupling of threads with virtual address ranges. Threads and virtual address ranges of a memsched group are guaranteed (?) to be located on the same node. Yeah, more or less so. We could relax that slightly to allow tasks to run away from the node for very short periods of time, but basically that's the provided guarantee. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
On Mon, Nov 21, 2011 at 04:25:26PM +0100, Peter Zijlstra wrote: On Mon, 2011-11-21 at 20:48 +0530, Bharata B Rao wrote: I looked at Peter's recent work in this area. (https://lkml.org/lkml/2011/11/17/204) It introduces two interfaces: 1. ms_tbind() to bind a thread to a memsched(*) group 2. ms_mbind() to bind a memory region to memsched group I assume the 2nd interface could be used by QEMU to create memsched groups for each of guest NUMA node memory regions. No, you would need both, you'll need to group vcpu threads _and_ some vaddress space together. I understood QEMU currently uses a single big anonymous mmap() to allocate the guest memory, using this you could either use multiple or carve up the big alloc into virtual nodes by assigning different parts to different ms groups. Example: suppose you want to create a 2 node guest with 8 vcpus, create 2 ms groups, each with 4 vcpu threads and assign half the total guest mmap to either. In the past, Anthony has said that NUMA binding should be done from outside of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041) If you want to expose a sense of virtual NUMA to your guest you really have no choice there. The only thing you can do externally is run whole VMs inside one particular node. Though that was in a different context, may be we should re-look at that and see if QEMU still sticks to that. I know its a bit early, but if needed we should ask Peter to consider extending ms_mbind() to take a tid parameter too instead of working on current task by default. Uh, what for? ms_mbind() works on the current process, not task. In the original post of this mail thread, I proposed a way to export guest RAM ranges (Guest Physical Address-GPA) and their corresponding host host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor). The idea was to use this GPA to HVA mappings from tools like libvirt to bind specific parts of the guest RAM to different host nodes. This needed an extension to existing mbind() to allow binding memory of a process(QEMU) from a different process(libvirt). This was needed since we wanted to do all this from libvirt. Hence I was coming from that background when I asked for extending ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA binding should all be done from outside of QEMU, it is needed, otherwise what you have should be sufficient. Regards, Bharata. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for Novemeber 22
Hi Please send in any agenda items you are interested in covering. Later, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Memory sync algorithm during migration
On Tue, Nov 15, 2011 at 11:47:58AM +0100, ext Juan Quintela wrote: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp wrote: Adding qemu-devel ML to CC. Your question should have been sent to qemu-devel ML because the logic is implemented in QEMU, not KVM. (2011/11/11 1:35), Oliver Hookins wrote: Hi, I am performing some benchmarks on KVM migration on two different types of VM. One has 4GB RAM and the other 32GB. More or less idle, the 4GB VM takes about 20 seconds to migrate on our hardware while the 32GB VM takes about a minute. With a reasonable amount of memory activity going on (in the hundreds of MB per second) the 32GB VM takes 3.5 minutes to migrate, but the 4GB VM never completes. Intuitively this tells me there is some watermarking of dirty pages going on that is not particularly efficient when the dirty pages ratio is high compared to total memory, but I may be completely incorrect. You can change the ratio IIRC. Hopefully, someone who knows well about QEMU will tell you better ways. Takuya Could anybody fill me in on what might be going on here? We're using libvirt 0.8.2 and kvm-83-224.el5.centos.1 This is pretty old qemu/kvm code base. In principle, it makes no sense that with 32GB RAM migration finishes, and with 4GB RAM it is unable (intuitively it should be, if ever, the other way around). Do you have an easy test that makes the problem easily reproducible? Have you tried ustream qemu.git? (some improvements on that department). I've just tried the qemu-kvm 0.14.1 tag which seems to be the latest that builds on my platform. For some strange reason migrations always seem to fail in one direction with Unknown savevm section or instance 'hpet' 0 messages. This seems to point to different migration protocols on either end but they are both running the same version of qemu-kvm I built. Does this ring any bells for anyone? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote: In the original post of this mail thread, I proposed a way to export guest RAM ranges (Guest Physical Address-GPA) and their corresponding host host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor). The idea was to use this GPA to HVA mappings from tools like libvirt to bind specific parts of the guest RAM to different host nodes. This needed an extension to existing mbind() to allow binding memory of a process(QEMU) from a different process(libvirt). This was needed since we wanted to do all this from libvirt. Hence I was coming from that background when I asked for extending ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA binding should all be done from outside of QEMU, it is needed, otherwise what you have should be sufficient. That's just retarded, and no you won't get such extentions. Poking at another process's virtual address space is just daft. Esp. if there's no actual reason for it. Furthermore, it would make libvirt a required part of qemu, and since I don't think I've ever use libvirt that's another reason to object, I don't need that stinking mess. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for v1.0] qemu-kvm: msix: Fire mask notifier on global mask changes
From: Jan Kiszka jan.kis...@siemens.com Also invoke the mask notifier if the global MSI-X mask is modified. For this purpose, we push the notifier call from the per-vector mask update to the central msix_handle_mask_update. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This needs to be applied on top of: msix: avoid mask updates if mask is unchanged I'll send a reminder once it's merged. hw/msix.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/hw/msix.c b/hw/msix.c index 56422c6..e50074f 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -231,6 +231,12 @@ static void msix_handle_mask_update(PCIDevice *dev, int vector, bool was_masked) return; } +if (dev-msix_mask_notifier) { +int ret; +ret = dev-msix_mask_notifier(dev, vector, is_masked); +assert(ret = 0); +} + if (!is_masked msix_is_pending(dev, vector)) { msix_clr_pending(dev, vector); msix_notify(dev, vector); @@ -292,11 +298,6 @@ static void msix_mmio_write(void *opaque, target_phys_addr_t addr, if (kvm_enabled() kvm_irqchip_in_kernel()) { kvm_msix_update(dev, vector, was_masked, msix_is_masked(dev, vector)); } -if (was_masked != msix_is_masked(dev, vector) dev-msix_mask_notifier) { -int r = dev-msix_mask_notifier(dev, vector, - msix_is_masked(dev, vector)); -assert(r = 0); -} msix_handle_mask_update(dev, vector, was_masked); } -- 1.7.5.53.gc233e -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode
On 11/18/2011 9:40 AM, Ben Hutchings wrote: On Fri, 2011-11-18 at 08:58 -0800, Greg Rose wrote: On 11/17/2011 4:44 PM, Ben Hutchings wrote: On Thu, 2011-11-17 at 16:32 -0800, Greg Rose wrote: On 11/17/2011 4:15 PM, Ben Hutchings wrote: Sorry to come to this rather late. On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote: [...] v2 -v3 - Moved set and get filter ops from rtnl_link_ops to netdev_ops - Support for SRIOV VFs. [Note: The get filters msg (in the way current get rtnetlink handles it) might get too big for SRIOV vfs. This patch follows existing sriov vf get code and tries to accomodate filters for all VF's in a PF. And for the SRIOV case I have only tested the fact that the VF arguments are getting delivered to rtnetlink correctly. The code follows existing sriov vf handling code so rest of it should work fine] [...] This is already broken for large numbers of VFs, and increasing the amount of information per VF is going to make the situation worse. I am no netlink expert but I think that the current approach of bundling all information about an interface in a single message may not be sustainable. Also, I'm unclear on why this interface is to be used to set filtering for the (PF) net device as well as for related VFs. Doesn't that duplicate the functionality of ndo_set_rx_mode and ndo_vlan_rx_{add,kill}_vid? Functionally yes but contextually no. This allows the PF driver to know that it is setting these filters in the context of the existence of VFs, allowing it to take appropriate action. The other two functions may be called without the presence of SR-IOV enablement and the existence of VFs. Anyway, that's why I asked Roopa to add that capability. I don't follow. The PF driver already knows whether it has enabled VFs. How do filters set this way interact with filters set through the existing operations? Should they override promiscuous mode? None of this has been specified. Promiscuous mode is exactly the issue this feature is intended for. I'm not familiar with the solarflare device but Intel HW promiscuous mode is only promiscuous on the physical port, not on the VEB. So a packet sent from a VF will not be captured by the PF across the VEB unless the MAC and VLAN filters have been programmed into the HW. Yes, I get it. The hardware bridge needs to know more about the address configuration on the host than the driver is getting at the moment. So you may not need the feature for your devices but it is required for Intel devices. Well we don't have the hardware bridge but that means each VF driver needs to know whether to fall back to the software bridge. The net driver needs much the same additional information. And it's a fairly simple request, just allow -1 to indicate that the target of the filter requests is for the PF itself. Using the already existing set_rx_mode function wont' work because the PF driver will look at it and figure it's in promiscuous mode anyway, so it won't set the filters into the HW. At least that is how it is in the case of our HW and driver. Again, the behavior of your HW and driver is unknown to me and thus you may not require this feature. What concerns me is that this seems to be a workaround rather than a fix for over-use of promiscuous mode, and it changes the semantics of filtering modes in ways that haven't been well-specified. I feel the opposite is true. It allows a known set of receive filters so that you don't have to use promiscuous mode, which cuts down on overhead from processing packets the upper layer stack isn't really interested in. What if there's a software bridge between two net devices corresponding to separate physical ports, so that they really need to be promiscuous? What if the administrator runs tcpdump and really wants the (PF) net device to be promiscuous? I don't believe there is anything in this patch set that removes promiscuous mode operation as it is commonly used. Perhaps I've missed something. These cases shouldn't break because of VF acceleration. I don't see how they would in the case of Intel HW. And these new ops to set rx filters are something that Intel HW needs because it implements a VEB that is *not* a learning bridge and we don't want it to be a learning bridge because of security concerns. It is better for our requirements to be allowed to set the MAC/VLAN filters that we want a VF to be able to see. Or at least we should make a conscious and documented decision that 'promiscuous' doesn't mean that if you enable it on your network adapter. Again, I don't know of any plans to change anything relating to putting a device in promiscuous mode or changing the semantics of what promiscuous mode means. - Greg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
On 11/21/2011 05:25 PM, Peter Zijlstra wrote: On Mon, 2011-11-21 at 20:48 +0530, Bharata B Rao wrote: I looked at Peter's recent work in this area. (https://lkml.org/lkml/2011/11/17/204) It introduces two interfaces: 1. ms_tbind() to bind a thread to a memsched(*) group 2. ms_mbind() to bind a memory region to memsched group I assume the 2nd interface could be used by QEMU to create memsched groups for each of guest NUMA node memory regions. No, you would need both, you'll need to group vcpu threads _and_ some vaddress space together. I understood QEMU currently uses a single big anonymous mmap() to allocate the guest memory, using this you could either use multiple or carve up the big alloc into virtual nodes by assigning different parts to different ms groups. Example: suppose you want to create a 2 node guest with 8 vcpus, create 2 ms groups, each with 4 vcpu threads and assign half the total guest mmap to either. Does ms_mbind() require that its vmas in its area be completely contained in the region, or does it split vmas on demand? I suggest the latter to avoid exposing implementation details. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
On Mon, 2011-11-21 at 20:03 +0200, Avi Kivity wrote: Does ms_mbind() require that its vmas in its area be completely contained in the region, or does it split vmas on demand? I suggest the latter to avoid exposing implementation details. as implemented (which is still rather incomplete) it does the split on demand like all other memory interfaces. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4 0/3] acpi: DSDT/SSDT runtime patching
On Sun, Nov 20, 2011 at 04:08:59PM -0500, Kevin O'Connor wrote: On Sun, Nov 20, 2011 at 07:56:43PM +0200, Michael S. Tsirkin wrote: Here's an updated revision of acpi runtime patching patchset. Lightly tested. It looks good to me. -Kevin Run some linux and windows tests, things seem to work smoothly. Pls apply. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for Novemeber 22
On 11/21/2011 10:00 AM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. I'm technical on holiday this week so I won't be attending. But as an FYI, I ran across seccomp-nurse[1] this weekend. It more or less let's you write a python program to implement a userspace syscall whitelist. I haven't looked at the code close enough yet, but I think the technique it uses is to create a companion thread along side the sandbox thread. This thread only runs code in an area mapped read-only and presumably only uses thread local storage. The companion thread isn't running in the sandbox, but has the same resources as the sandbox thread so it can essentially invoke syscalls on behalf of the sandbox thread. It's seriously non-portable. In fact, it only works on 32-bit x86 Linux right now. But it's worth looking into. [1] http://chdir.org/~nico/seccomp-nurse/ Regards, Anthony Liguori Later, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
* Peter Zijlstra (a.p.zijls...@chello.nl) wrote: On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote: In the original post of this mail thread, I proposed a way to export guest RAM ranges (Guest Physical Address-GPA) and their corresponding host host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor). The idea was to use this GPA to HVA mappings from tools like libvirt to bind specific parts of the guest RAM to different host nodes. This needed an extension to existing mbind() to allow binding memory of a process(QEMU) from a different process(libvirt). This was needed since we wanted to do all this from libvirt. Hence I was coming from that background when I asked for extending ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA binding should all be done from outside of QEMU, it is needed, otherwise what you have should be sufficient. That's just retarded, and no you won't get such extentions. Poking at another process's virtual address space is just daft. Esp. if there's no actual reason for it. Need to separate the binding vs the policy mgmt. The policy mgmt could still be done outside, whereas the binding could still be done from w/in QEMU. A simple monitor interface to rebalance vcpu memory allcoations to different nodes could very well schedule vcpu thread work in QEMU. So, I agree, even if there is some external policy mgmt, it could still easily work w/ QEMU to use Peter's proposed interface. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Code clean up for percpu_xxx() functions
(cc'ing hpa and quoting whole body) On Mon, Nov 21, 2011 at 05:10:12PM +0800, Alex,Shi wrote: refreshed the patch on latest upstream kernel. Any comments or picking up are appreciated. === From 0dce61dc88b8ed2687b4d5c0633aa54d1f66fdc0 Mon Sep 17 00:00:00 2001 From: Alex Shi alex@intel.com Date: Tue, 22 Nov 2011 00:05:37 +0800 Subject: [PATCH 3/3] Code clean up for percpu_xxx() functions Since percpu_xxx() serial functions are duplicate with this_cpu_xxx(). Removing percpu_xxx() definition and replacing them by this_cpu_xxx() in code. And further more, as Christoph Lameter's requirement, I try to use __this_cpu_xx to replace this_cpu_xxx if it is in preempt safe scenario. The preempt safe scenarios include: 1, in irq/softirq/nmi handler 2, protected by preempt_disable 3, protected by spin_lock 4, if the code context imply that it is preempt safe, like the code is follows or be followed a preempt safe code. I left the xen code unchanged, since no idea of them. BTW, In fact, this_cpu_xxx are same as __this_cpu_xxx since all funcs implement in a single instruction for x86 machine. But it maybe different for other platforms, so, doing this distinguish is helpful for other platforms' performance. Signed-off-by: Alex Shi alex@intel.com Acked-by: Christoph Lameter c...@gentwo.org Acked-by: Tejun Heo t...@kernel.org hpa, I suppose this should go through x86? The original patch can be accessed at http://article.gmane.org/gmane.linux.kernel/1218055/raw Thanks. arch/x86/include/asm/current.h|2 +- arch/x86/include/asm/hardirq.h|9 +++-- arch/x86/include/asm/irq_regs.h |4 +- arch/x86/include/asm/mmu_context.h| 12 arch/x86/include/asm/percpu.h | 24 ++- arch/x86/include/asm/smp.h|4 +- arch/x86/include/asm/stackprotector.h |4 +- arch/x86/include/asm/thread_info.h|2 +- arch/x86/include/asm/tlbflush.h |4 +- arch/x86/kernel/cpu/common.c |2 +- arch/x86/kernel/cpu/mcheck/mce.c |4 +- arch/x86/kernel/paravirt.c| 12 arch/x86/kernel/process_32.c |2 +- arch/x86/kernel/process_64.c | 12 arch/x86/mm/tlb.c | 10 +++--- arch/x86/xen/enlighten.c |6 ++-- arch/x86/xen/irq.c|8 ++-- arch/x86/xen/mmu.c| 20 ++-- arch/x86/xen/multicalls.h |2 +- arch/x86/xen/smp.c|2 +- include/linux/percpu.h| 53 - include/linux/topology.h |4 +- 22 files changed, 73 insertions(+), 129 deletions(-) diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h index 4d447b7..9476c04 100644 --- a/arch/x86/include/asm/current.h +++ b/arch/x86/include/asm/current.h @@ -11,7 +11,7 @@ DECLARE_PER_CPU(struct task_struct *, current_task); static __always_inline struct task_struct *get_current(void) { - return percpu_read_stable(current_task); + return this_cpu_read_stable(current_task); } #define current get_current() diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index 55e4de6..2890444 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -35,14 +35,15 @@ DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat); #define __ARCH_IRQ_STAT -#define inc_irq_stat(member) percpu_inc(irq_stat.member) +#define inc_irq_stat(member) __this_cpu_inc(irq_stat.member) -#define local_softirq_pending() percpu_read(irq_stat.__softirq_pending) +#define local_softirq_pending() __this_cpu_read(irq_stat.__softirq_pending) #define __ARCH_SET_SOFTIRQ_PENDING -#define set_softirq_pending(x) percpu_write(irq_stat.__softirq_pending, (x)) -#define or_softirq_pending(x)percpu_or(irq_stat.__softirq_pending, (x)) +#define set_softirq_pending(x) \ + __this_cpu_write(irq_stat.__softirq_pending, (x)) +#define or_softirq_pending(x) __this_cpu_or(irq_stat.__softirq_pending, (x)) extern void ack_bad_irq(unsigned int irq); diff --git a/arch/x86/include/asm/irq_regs.h b/arch/x86/include/asm/irq_regs.h index 7784322..15639ed 100644 --- a/arch/x86/include/asm/irq_regs.h +++ b/arch/x86/include/asm/irq_regs.h @@ -15,7 +15,7 @@ DECLARE_PER_CPU(struct pt_regs *, irq_regs); static inline struct pt_regs *get_irq_regs(void) { - return percpu_read(irq_regs); + return __this_cpu_read(irq_regs); } static inline struct pt_regs *set_irq_regs(struct pt_regs *new_regs) @@ -23,7 +23,7 @@ static inline struct pt_regs *set_irq_regs(struct pt_regs *new_regs) struct pt_regs *old_regs; old_regs = get_irq_regs(); - percpu_write(irq_regs, new_regs); + __this_cpu_write(irq_regs, new_regs);
Re: [Qemu-devel] [PATCH] ivshmem: fix PCI BAR2 registration during initialization
On 11/21/2011 04:56 AM, zanghongy...@huawei.com wrote: From: Hongyong Zangzanghongy...@huawei.com Ivshmem cannot work, and the command lspci cannot show ivshmem BAR2 in the guest. As for pci_register_bar(), parameter MemoryRegion should be s-bar instead of s-ivshmem. Signed-off-by: Hongyong Zangzanghongy...@huawei.com Applied. Thanks. Regards, Anthony Liguori --- hw/ivshmem.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/hw/ivshmem.c b/hw/ivshmem.c index 242fbea..2ecf658 100644 --- a/hw/ivshmem.c +++ b/hw/ivshmem.c @@ -694,7 +694,7 @@ static int pci_ivshmem_init(PCIDevice *dev) s-peers = g_malloc0(s-nb_peers * sizeof(Peer)); pci_register_bar(s-dev, 2, - PCI_BASE_ADDRESS_SPACE_MEMORY,s-ivshmem); + PCI_BASE_ADDRESS_SPACE_MEMORY,s-bar); s-eventfd_chr = g_malloc0(s-vectors * sizeof(CharDriverState *)); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[FYI] Need to do a full rebuild if you are on Linux x86 host
Due to this commit: commit 40d6444e91c6ab17e5e8ab01d4eece90cbc4afed Author: Avi Kivity a...@redhat.com Date: Tue Nov 15 20:12:17 2011 +0200 configure: build position independent executables on x86-Linux hosts PIE binaries cannot be linked with non-PIE binaries and make is not smart enough to rebuild when the CFLAGS have changed. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5 of 5] virtio: expose added descriptors immediately
On Mon, 21 Nov 2011 13:57:04 +0200, Michael S. Tsirkin m...@redhat.com wrote: On Mon, Nov 21, 2011 at 12:18:45PM +1030, Rusty Russell wrote: On Wed, 16 Nov 2011 09:18:38 +0200, Michael S. Tsirkin m...@redhat.com wrote: My unlocked kick patches will trip this warning: they make virtio-net do add + get without kick. Heh, it's a good sign if they do, since that means you're running really well :) They don't in fact, in my testing :(. But I think they can with luck. I think block with unlocked kick can trip it too: add, lock is dropped and then an interrupt can get. We also don't need a kick each num - each 2^15 is enough. Why don't we do this at start of add_buf: if (vq-num_added = 0x7fff) return -ENOSPC; The warning was there in case a driver is never doing a kick, and getting away with it (mostly) because the device is polling. Let's not penalize good drivers to catch bad ones. How about we do this properly, like so: Absolutely. But I think we also need to handle num_added overflow of a 15 bit counter, no? Otherwise the vring_need_event logic might give us false negatives I'm guessing we can just assume we need a kick in that case. You're right. Thankyou. My immediate reaction of make it an unsigned long doesn't work. Here's the diff to what I posted before: diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -254,9 +254,10 @@ add_head: vq-vring.avail-idx++; vq-num_added++; - /* If you haven't kicked in this long, you're probably doing something -* wrong. */ - WARN_ON(vq-num_added vq-vring.num); + /* This is very unlikely, but theoretically possible. Kick +* just in case. */ + if (unlikely(vq-num_added == 65535)) + virtqueue_kick(_vq); pr_debug(Added buffer head %i to %p\n, head, vq); END_USE(vq); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
On 11/21/2011 04:50 PM, Chris Wright wrote: * Peter Zijlstra (a.p.zijls...@chello.nl) wrote: On Mon, 2011-11-21 at 21:30 +0530, Bharata B Rao wrote: In the original post of this mail thread, I proposed a way to export guest RAM ranges (Guest Physical Address-GPA) and their corresponding host host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor). The idea was to use this GPA to HVA mappings from tools like libvirt to bind specific parts of the guest RAM to different host nodes. This needed an extension to existing mbind() to allow binding memory of a process(QEMU) from a different process(libvirt). This was needed since we wanted to do all this from libvirt. Hence I was coming from that background when I asked for extending ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA binding should all be done from outside of QEMU, it is needed, otherwise what you have should be sufficient. That's just retarded, and no you won't get such extentions. Poking at another process's virtual address space is just daft. Esp. if there's no actual reason for it. Need to separate the binding vs the policy mgmt. The policy mgmt could still be done outside, whereas the binding could still be done from w/in QEMU. A simple monitor interface to rebalance vcpu memory allcoations to different nodes could very well schedule vcpu thread work in QEMU. I really would prefer to avoid having such an interface. It's a shot gun that will only result in many poor feet being maimed. I can't tell you the number of times I've encountered people using CPU pinning when they have absolutely no business doing CPU pinning. If we really believe such an interface should exist, then the interface should really be from the kernel. Once we have memgroups, there's no reason to involve QEMU at all. QEMU can define the memgroups based on the NUMA nodes and then it's up to the kernel as to whether it exposes controls to explicitly bind memgroups within a process or not. Regards, Anthony Liguori So, I agree, even if there is some external policy mgmt, it could still easily work w/ QEMU to use Peter's proposed interface. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/7] KVM: x86 emulator: Use opcode::execute for some instructions
This patch set eats the remaining spaghetti in the emulator and cleans up the large plate. After this, only trivial cases will be there. Passed emulator.flat test: SUMMARY: 90 tests, 0 failures Thanks, Takuya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] KVM: x86 emulator: Use opcode::execute for IN/OUT
IN : E4, E5, EC, ED OUT: E6, E7, EE, EF Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/kvm/emulate.c | 54 --- 1 files changed, 28 insertions(+), 26 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 8547958..8ba4ea8 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2776,6 +2776,24 @@ static int em_jcxz(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } +static int em_in(struct x86_emulate_ctxt *ctxt) +{ + if (!pio_in_emulated(ctxt, ctxt-dst.bytes, ctxt-src.val, +ctxt-dst.val)) + return X86EMUL_IO_NEEDED; + + return X86EMUL_CONTINUE; +} + +static int em_out(struct x86_emulate_ctxt *ctxt) +{ + ctxt-ops-pio_out_emulated(ctxt, ctxt-src.bytes, ctxt-dst.val, + ctxt-src.val, 1); + /* Disable writeback. */ + ctxt-dst.type = OP_NONE; + return X86EMUL_CONTINUE; +} + static int em_cli(struct x86_emulate_ctxt *ctxt) { if (emulator_bad_iopl(ctxt)) @@ -3004,6 +3022,8 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt) #define D2bv(_f) D((_f) | ByteOp), D(_f) #define D2bvIP(_f, _i, _p) DIP((_f) | ByteOp, _i, _p), DIP(_f, _i, _p) #define I2bv(_f, _e) I((_f) | ByteOp, _e), I(_f, _e) +#define I2bvIP(_f, _e, _i, _p) \ + IIP((_f) | ByteOp, _e, _i, _p), IIP(_f, _e, _i, _p) #define I6ALU(_f, _e) I2bv((_f) | DstMem | SrcReg | ModRM, _e), \ I2bv(((_f) | DstReg | SrcMem | ModRM) ~Lock, _e), \ @@ -3217,13 +3237,13 @@ static struct opcode opcode_table[256] = { /* 0xE0 - 0xE7 */ X3(I(SrcImmByte, em_loop)), I(SrcImmByte, em_jcxz), - D2bvIP(SrcImmUByte | DstAcc, in, check_perm_in), - D2bvIP(SrcAcc | DstImmUByte, out, check_perm_out), + I2bvIP(SrcImmUByte | DstAcc, em_in, in, check_perm_in), + I2bvIP(SrcAcc | DstImmUByte, em_out, out, check_perm_out), /* 0xE8 - 0xEF */ D(SrcImm | Stack), D(SrcImm | ImplicitOps), I(SrcImmFAddr | No64, em_jmp_far), D(SrcImmByte | ImplicitOps), - D2bvIP(SrcDX | DstAcc, in, check_perm_in), - D2bvIP(SrcAcc | DstDX, out, check_perm_out), + I2bvIP(SrcDX | DstAcc, em_in, in, check_perm_in), + I2bvIP(SrcAcc | DstDX, em_out, out, check_perm_out), /* 0xF0 - 0xF7 */ N, DI(ImplicitOps, icebp), N, N, DI(ImplicitOps | Priv, hlt), D(ImplicitOps), @@ -3325,6 +3345,7 @@ static struct opcode twobyte_table[256] = { #undef D2bv #undef D2bvIP #undef I2bv +#undef I2bvIP #undef I6ALU static unsigned imm_size(struct x86_emulate_ctxt *ctxt) @@ -3867,11 +3888,12 @@ special_insn: case 0x6c: /* insb */ case 0x6d: /* insw/insd */ ctxt-src.val = ctxt-regs[VCPU_REGS_RDX]; - goto do_io_in; + rc = em_in(ctxt); + break; case 0x6e: /* outsb */ case 0x6f: /* outsw/outsd */ ctxt-dst.val = ctxt-regs[VCPU_REGS_RDX]; - goto do_io_out; + rc = em_out(ctxt); break; case 0x70 ... 0x7f: /* jcc (short) */ if (test_cc(ctxt-b, ctxt-eflags)) @@ -3915,12 +3937,6 @@ special_insn: ctxt-src.val = ctxt-regs[VCPU_REGS_RCX]; rc = em_grp2(ctxt); break; - case 0xe4: /* inb */ - case 0xe5: /* in */ - goto do_io_in; - case 0xe6: /* outb */ - case 0xe7: /* out */ - goto do_io_out; case 0xe8: /* call (near) */ { long int rel = ctxt-src.val; ctxt-src.val = (unsigned long) ctxt-_eip; @@ -3933,20 +3949,6 @@ special_insn: jmp_rel(ctxt, ctxt-src.val); ctxt-dst.type = OP_NONE; /* Disable writeback. */ break; - case 0xec: /* in al,dx */ - case 0xed: /* in (e/r)ax,dx */ - do_io_in: - if (!pio_in_emulated(ctxt, ctxt-dst.bytes, ctxt-src.val, -ctxt-dst.val)) - goto done; /* IO is needed */ - break; - case 0xee: /* out dx,al */ - case 0xef: /* out dx,(e/r)ax */ - do_io_out: - ops-pio_out_emulated(ctxt, ctxt-src.bytes, ctxt-dst.val, - ctxt-src.val, 1); - ctxt-dst.type = OP_NONE; /* Disable writeback. */ - break; case 0xf4: /* hlt */ ctxt-ops-halt(ctxt); break; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] KVM: x86 emulator: Use opcode::execute for BT family
BT : 0F A3 BTS: 0F AB BTR: 0F B3 BTC: 0F BB Group 8: 0F BA Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/kvm/emulate.c | 77 +++ 1 files changed, 38 insertions(+), 39 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 8ba4ea8..7a9ce6d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2813,6 +2813,35 @@ static int em_sti(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } +static int em_bt(struct x86_emulate_ctxt *ctxt) +{ + /* Disable writeback. */ + ctxt-dst.type = OP_NONE; + /* only subword offset */ + ctxt-src.val = (ctxt-dst.bytes 3) - 1; + + emulate_2op_SrcV_nobyte(ctxt, bt); + return X86EMUL_CONTINUE; +} + +static int em_bts(struct x86_emulate_ctxt *ctxt) +{ + emulate_2op_SrcV_nobyte(ctxt, bts); + return X86EMUL_CONTINUE; +} + +static int em_btr(struct x86_emulate_ctxt *ctxt) +{ + emulate_2op_SrcV_nobyte(ctxt, btr); + return X86EMUL_CONTINUE; +} + +static int em_btc(struct x86_emulate_ctxt *ctxt) +{ + emulate_2op_SrcV_nobyte(ctxt, btc); + return X86EMUL_CONTINUE; +} + static bool valid_cr(int nr) { switch (nr) { @@ -3117,10 +3146,10 @@ static struct group_dual group7 = { { static struct opcode group8[] = { N, N, N, N, - D(DstMem | SrcImmByte | ModRM), - D(DstMem | SrcImmByte | ModRM | Lock | PageTable), - D(DstMem | SrcImmByte | ModRM | Lock), - D(DstMem | SrcImmByte | ModRM | Lock | PageTable), + I(DstMem | SrcImmByte | ModRM, em_bt), + I(DstMem | SrcImmByte | ModRM | Lock | PageTable, em_bts), + I(DstMem | SrcImmByte | ModRM | Lock, em_btr), + I(DstMem | SrcImmByte | ModRM | Lock | PageTable, em_btc), }; static struct group_dual group9 = { { @@ -3299,26 +3328,27 @@ static struct opcode twobyte_table[256] = { X16(D(ByteOp | DstMem | SrcNone | ModRM| Mov)), /* 0xA0 - 0xA7 */ I(Stack | Src2FS, em_push_sreg), I(Stack | Src2FS, em_pop_sreg), - DI(ImplicitOps, cpuid), D(DstMem | SrcReg | ModRM | BitOp), + DI(ImplicitOps, cpuid), I(DstMem | SrcReg | ModRM | BitOp, em_bt), D(DstMem | SrcReg | Src2ImmByte | ModRM), D(DstMem | SrcReg | Src2CL | ModRM), N, N, /* 0xA8 - 0xAF */ I(Stack | Src2GS, em_push_sreg), I(Stack | Src2GS, em_pop_sreg), DI(ImplicitOps, rsm), - D(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable), + I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_bts), D(DstMem | SrcReg | Src2ImmByte | ModRM), D(DstMem | SrcReg | Src2CL | ModRM), D(ModRM), I(DstReg | SrcMem | ModRM, em_imul), /* 0xB0 - 0xB7 */ D2bv(DstMem | SrcReg | ModRM | Lock | PageTable), I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg), - D(DstMem | SrcReg | ModRM | BitOp | Lock), + I(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr), I(DstReg | SrcMemFAddr | ModRM | Src2FS, em_lseg), I(DstReg | SrcMemFAddr | ModRM | Src2GS, em_lseg), D(ByteOp | DstReg | SrcMem | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov), /* 0xB8 - 0xBF */ N, N, - G(BitOp, group8), D(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable), + G(BitOp, group8), + I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_btc), D(DstReg | SrcMem | ModRM), D(DstReg | SrcMem | ModRM), D(ByteOp | DstReg | SrcMem | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov), /* 0xC0 - 0xCF */ @@ -4103,21 +4133,10 @@ twobyte_insn: case 0x90 ... 0x9f: /* setcc r/m8 */ ctxt-dst.val = test_cc(ctxt-b, ctxt-eflags); break; - case 0xa3: - bt: /* bt */ - ctxt-dst.type = OP_NONE; - /* only subword offset */ - ctxt-src.val = (ctxt-dst.bytes 3) - 1; - emulate_2op_SrcV_nobyte(ctxt, bt); - break; case 0xa4: /* shld imm8, r, r/m */ case 0xa5: /* shld cl, r, r/m */ emulate_2op_cl(ctxt, shld); break; - case 0xab: - bts: /* bts */ - emulate_2op_SrcV_nobyte(ctxt, bts); - break; case 0xac: /* shrd imm8, r, r/m */ case 0xad: /* shrd cl, r, r/m */ emulate_2op_cl(ctxt, shrd); @@ -4141,31 +4160,11 @@ twobyte_insn: ctxt-dst.addr.reg = (unsigned long *)ctxt-regs[VCPU_REGS_RAX]; } break; - case 0xb3: - btr: /* btr */ - emulate_2op_SrcV_nobyte(ctxt, btr); - break; case 0xb6 ... 0xb7: /* movzx */ ctxt-dst.bytes = ctxt-op_bytes; ctxt-dst.val = (ctxt-d ByteOp) ? (u8) ctxt-src.val : (u16)
[PATCH 3/7] KVM: x86 emulator: Use opcode::execute for CALL
CALL: E8 Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/kvm/emulate.c | 18 ++ 1 files changed, 10 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 7a9ce6d..6b7a03b 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2482,6 +2482,15 @@ static int em_das(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } +static int em_call(struct x86_emulate_ctxt *ctxt) +{ + long rel = ctxt-src.val; + + ctxt-src.val = (unsigned long)ctxt-_eip; + jmp_rel(ctxt, rel); + return em_push(ctxt); +} + static int em_call_far(struct x86_emulate_ctxt *ctxt) { u16 sel, old_cs; @@ -3269,7 +3278,7 @@ static struct opcode opcode_table[256] = { I2bvIP(SrcImmUByte | DstAcc, em_in, in, check_perm_in), I2bvIP(SrcAcc | DstImmUByte, em_out, out, check_perm_out), /* 0xE8 - 0xEF */ - D(SrcImm | Stack), D(SrcImm | ImplicitOps), + I(SrcImm | Stack, em_call), D(SrcImm | ImplicitOps), I(SrcImmFAddr | No64, em_jmp_far), D(SrcImmByte | ImplicitOps), I2bvIP(SrcDX | DstAcc, em_in, in, check_perm_in), I2bvIP(SrcAcc | DstDX, em_out, out, check_perm_out), @@ -3967,13 +3976,6 @@ special_insn: ctxt-src.val = ctxt-regs[VCPU_REGS_RCX]; rc = em_grp2(ctxt); break; - case 0xe8: /* call (near) */ { - long int rel = ctxt-src.val; - ctxt-src.val = (unsigned long) ctxt-_eip; - jmp_rel(ctxt, rel); - rc = em_push(ctxt); - break; - } case 0xe9: /* jmp rel */ case 0xeb: /* jmp rel short */ jmp_rel(ctxt, ctxt-src.val); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7] KVM: x86 emulator: Use opcode::execute for MOV to cr/dr
MOV: 0F 22 (move to control registers) MOV: 0F 23 (move to debug registers) Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/kvm/emulate.c | 52 +++ 1 files changed, 30 insertions(+), 22 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 6b7a03b..7fe5ed1 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2638,6 +2638,34 @@ static int em_mov(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } +static int em_cr_write(struct x86_emulate_ctxt *ctxt) +{ + if (ctxt-ops-set_cr(ctxt, ctxt-modrm_reg, ctxt-src.val)) + return emulate_gp(ctxt, 0); + + /* Disable writeback. */ + ctxt-dst.type = OP_NONE; + return X86EMUL_CONTINUE; +} + +static int em_dr_write(struct x86_emulate_ctxt *ctxt) +{ + unsigned long val; + + if (ctxt-mode == X86EMUL_MODE_PROT64) + val = ctxt-src.val ~0ULL; + else + val = ctxt-src.val ~0U; + + /* #UD condition is already handled. */ + if (ctxt-ops-set_dr(ctxt, ctxt-modrm_reg, val) 0) + return emulate_gp(ctxt, 0); + + /* Disable writeback. */ + ctxt-dst.type = OP_NONE; + return X86EMUL_CONTINUE; +} + static int em_mov_rm_sreg(struct x86_emulate_ctxt *ctxt) { if (ctxt-modrm_reg VCPU_SREG_GS) @@ -3304,8 +3332,8 @@ static struct opcode twobyte_table[256] = { /* 0x20 - 0x2F */ DIP(ModRM | DstMem | Priv | Op3264, cr_read, check_cr_read), DIP(ModRM | DstMem | Priv | Op3264, dr_read, check_dr_read), - DIP(ModRM | SrcMem | Priv | Op3264, cr_write, check_cr_write), - DIP(ModRM | SrcMem | Priv | Op3264, dr_write, check_dr_write), + IIP(ModRM | SrcMem | Priv | Op3264, em_cr_write, cr_write, check_cr_write), + IIP(ModRM | SrcMem | Priv | Op3264, em_dr_write, dr_write, check_dr_write), N, N, N, N, N, N, N, N, N, N, N, N, /* 0x30 - 0x3F */ @@ -4080,26 +4108,6 @@ twobyte_insn: case 0x21: /* mov from dr to reg */ ops-get_dr(ctxt, ctxt-modrm_reg, ctxt-dst.val); break; - case 0x22: /* mov reg, cr */ - if (ops-set_cr(ctxt, ctxt-modrm_reg, ctxt-src.val)) { - emulate_gp(ctxt, 0); - rc = X86EMUL_PROPAGATE_FAULT; - goto done; - } - ctxt-dst.type = OP_NONE; - break; - case 0x23: /* mov from reg to dr */ - if (ops-set_dr(ctxt, ctxt-modrm_reg, ctxt-src.val - ((ctxt-mode == X86EMUL_MODE_PROT64) ? -~0ULL : ~0U)) 0) { - /* #UD condition is already handled by the code above */ - emulate_gp(ctxt, 0); - rc = X86EMUL_PROPAGATE_FAULT; - goto done; - } - - ctxt-dst.type = OP_NONE; /* no writeback */ - break; case 0x30: /* wrmsr */ msr_data = (u32)ctxt-regs[VCPU_REGS_RAX] -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/7] KVM: x86 emulator: Use opcode::execute for WRMSR/RDMSR
WRMSR: 0F 30 RDMSR: 0F 32 Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/kvm/emulate.c | 52 1 files changed, 26 insertions(+), 26 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 7fe5ed1..906c5eb 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2666,6 +2666,30 @@ static int em_dr_write(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } +static int em_wrmsr(struct x86_emulate_ctxt *ctxt) +{ + u64 msr_data; + + msr_data = (u32)ctxt-regs[VCPU_REGS_RAX] + | ((u64)ctxt-regs[VCPU_REGS_RDX] 32); + if (ctxt-ops-set_msr(ctxt, ctxt-regs[VCPU_REGS_RCX], msr_data)) + return emulate_gp(ctxt, 0); + + return X86EMUL_CONTINUE; +} + +static int em_rdmsr(struct x86_emulate_ctxt *ctxt) +{ + u64 msr_data; + + if (ctxt-ops-get_msr(ctxt, ctxt-regs[VCPU_REGS_RCX], msr_data)) + return emulate_gp(ctxt, 0); + + ctxt-regs[VCPU_REGS_RAX] = (u32)msr_data; + ctxt-regs[VCPU_REGS_RDX] = msr_data 32; + return X86EMUL_CONTINUE; +} + static int em_mov_rm_sreg(struct x86_emulate_ctxt *ctxt) { if (ctxt-modrm_reg VCPU_SREG_GS) @@ -3337,9 +3361,9 @@ static struct opcode twobyte_table[256] = { N, N, N, N, N, N, N, N, N, N, N, N, /* 0x30 - 0x3F */ - DI(ImplicitOps | Priv, wrmsr), + II(ImplicitOps | Priv, em_wrmsr, wrmsr), IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc), - DI(ImplicitOps | Priv, rdmsr), + II(ImplicitOps | Priv, em_rdmsr, rdmsr), DIP(ImplicitOps | Priv, rdpmc, check_rdpmc), I(ImplicitOps | VendorSpecific, em_sysenter), I(ImplicitOps | Priv | VendorSpecific, em_sysexit), @@ -3818,7 +3842,6 @@ static bool string_insn_completed(struct x86_emulate_ctxt *ctxt) int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) { struct x86_emulate_ops *ops = ctxt-ops; - u64 msr_data; int rc = X86EMUL_CONTINUE; int saved_dst_type = ctxt-dst.type; @@ -4108,29 +4131,6 @@ twobyte_insn: case 0x21: /* mov from dr to reg */ ops-get_dr(ctxt, ctxt-modrm_reg, ctxt-dst.val); break; - case 0x30: - /* wrmsr */ - msr_data = (u32)ctxt-regs[VCPU_REGS_RAX] - | ((u64)ctxt-regs[VCPU_REGS_RDX] 32); - if (ops-set_msr(ctxt, ctxt-regs[VCPU_REGS_RCX], msr_data)) { - emulate_gp(ctxt, 0); - rc = X86EMUL_PROPAGATE_FAULT; - goto done; - } - rc = X86EMUL_CONTINUE; - break; - case 0x32: - /* rdmsr */ - if (ops-get_msr(ctxt, ctxt-regs[VCPU_REGS_RCX], msr_data)) { - emulate_gp(ctxt, 0); - rc = X86EMUL_PROPAGATE_FAULT; - goto done; - } else { - ctxt-regs[VCPU_REGS_RAX] = (u32)msr_data; - ctxt-regs[VCPU_REGS_RDX] = msr_data 32; - } - rc = X86EMUL_CONTINUE; - break; case 0x40 ... 0x4f: /* cmov */ ctxt-dst.val = ctxt-dst.orig_val = ctxt-src.val; if (!test_cc(ctxt-b, ctxt-eflags)) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] KVM: x86 emulator: Use opcode::execute for CMPXCHG
CMPXCHG: 0F B0, 0F B1 Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/kvm/emulate.c | 37 +++-- 1 files changed, 19 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 906c5eb..799000d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1832,6 +1832,24 @@ static int em_ret_far(struct x86_emulate_ctxt *ctxt) return rc; } +static int em_cmpxchg(struct x86_emulate_ctxt *ctxt) +{ + /* Save real source value, then compare EAX against destination. */ + ctxt-src.orig_val = ctxt-src.val; + ctxt-src.val = ctxt-regs[VCPU_REGS_RAX]; + emulate_2op_SrcV(ctxt, cmp); + + if (ctxt-eflags EFLG_ZF) { + /* Success: write back to memory. */ + ctxt-dst.val = ctxt-src.orig_val; + } else { + /* Failure: write the value we saw to EAX. */ + ctxt-dst.type = OP_REG; + ctxt-dst.addr.reg = (unsigned long *)ctxt-regs[VCPU_REGS_RAX]; + } + return X86EMUL_CONTINUE; +} + static int em_lseg(struct x86_emulate_ctxt *ctxt) { int seg = ctxt-src2.val; @@ -3400,7 +3418,7 @@ static struct opcode twobyte_table[256] = { D(DstMem | SrcReg | Src2CL | ModRM), D(ModRM), I(DstReg | SrcMem | ModRM, em_imul), /* 0xB0 - 0xB7 */ - D2bv(DstMem | SrcReg | ModRM | Lock | PageTable), + I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg), I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg), I(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr), I(DstReg | SrcMemFAddr | ModRM | Src2FS, em_lseg), @@ -4153,23 +4171,6 @@ twobyte_insn: break; case 0xae: /* clflush */ break; - case 0xb0 ... 0xb1: /* cmpxchg */ - /* -* Save real source value, then compare EAX against -* destination. -*/ - ctxt-src.orig_val = ctxt-src.val; - ctxt-src.val = ctxt-regs[VCPU_REGS_RAX]; - emulate_2op_SrcV(ctxt, cmp); - if (ctxt-eflags EFLG_ZF) { - /* Success: write back to memory. */ - ctxt-dst.val = ctxt-src.orig_val; - } else { - /* Failure: write the value we saw to EAX. */ - ctxt-dst.type = OP_REG; - ctxt-dst.addr.reg = (unsigned long *)ctxt-regs[VCPU_REGS_RAX]; - } - break; case 0xb6 ... 0xb7: /* movzx */ ctxt-dst.bytes = ctxt-op_bytes; ctxt-dst.val = (ctxt-d ByteOp) ? (u8) ctxt-src.val -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/7] KVM: x86 emulator: Use opcode::execute for BSF/BSR
BSF: 0F BC BSR: 0F BD Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/kvm/emulate.c | 60 1 files changed, 35 insertions(+), 25 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 799000d..4cd3313 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2921,6 +2921,40 @@ static int em_btc(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } +static int em_bsf(struct x86_emulate_ctxt *ctxt) +{ + u8 zf; + + __asm__ (bsf %2, %0; setz %1 +: =r(ctxt-dst.val), =q(zf) +: r(ctxt-src.val)); + + ctxt-eflags = ~X86_EFLAGS_ZF; + if (zf) { + ctxt-eflags |= X86_EFLAGS_ZF; + /* Disable writeback. */ + ctxt-dst.type = OP_NONE; + } + return X86EMUL_CONTINUE; +} + +static int em_bsr(struct x86_emulate_ctxt *ctxt) +{ + u8 zf; + + __asm__ (bsr %2, %0; setz %1 +: =r(ctxt-dst.val), =q(zf) +: r(ctxt-src.val)); + + ctxt-eflags = ~X86_EFLAGS_ZF; + if (zf) { + ctxt-eflags |= X86_EFLAGS_ZF; + /* Disable writeback. */ + ctxt-dst.type = OP_NONE; + } + return X86EMUL_CONTINUE; +} + static bool valid_cr(int nr) { switch (nr) { @@ -3428,7 +3462,7 @@ static struct opcode twobyte_table[256] = { N, N, G(BitOp, group8), I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_btc), - D(DstReg | SrcMem | ModRM), D(DstReg | SrcMem | ModRM), + I(DstReg | SrcMem | ModRM, em_bsf), I(DstReg | SrcMem | ModRM, em_bsr), D(ByteOp | DstReg | SrcMem | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov), /* 0xC0 - 0xCF */ D2bv(DstMem | SrcReg | ModRM | Lock), @@ -4176,30 +4210,6 @@ twobyte_insn: ctxt-dst.val = (ctxt-d ByteOp) ? (u8) ctxt-src.val : (u16) ctxt-src.val; break; - case 0xbc: {/* bsf */ - u8 zf; - __asm__ (bsf %2, %0; setz %1 -: =r(ctxt-dst.val), =q(zf) -: r(ctxt-src.val)); - ctxt-eflags = ~X86_EFLAGS_ZF; - if (zf) { - ctxt-eflags |= X86_EFLAGS_ZF; - ctxt-dst.type = OP_NONE; /* Disable writeback. */ - } - break; - } - case 0xbd: {/* bsr */ - u8 zf; - __asm__ (bsr %2, %0; setz %1 -: =r(ctxt-dst.val), =q(zf) -: r(ctxt-src.val)); - ctxt-eflags = ~X86_EFLAGS_ZF; - if (zf) { - ctxt-eflags |= X86_EFLAGS_ZF; - ctxt-dst.type = OP_NONE; /* Disable writeback. */ - } - break; - } case 0xbe ... 0xbf: /* movsx */ ctxt-dst.bytes = ctxt-op_bytes; ctxt-dst.val = (ctxt-d ByteOp) ? (s8) ctxt-src.val : -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5 of 5] virtio: expose added descriptors immediately
On Tue, Nov 22, 2011 at 11:03:04AM +1030, Rusty Russell wrote: On Mon, 21 Nov 2011 13:57:04 +0200, Michael S. Tsirkin m...@redhat.com wrote: On Mon, Nov 21, 2011 at 12:18:45PM +1030, Rusty Russell wrote: On Wed, 16 Nov 2011 09:18:38 +0200, Michael S. Tsirkin m...@redhat.com wrote: My unlocked kick patches will trip this warning: they make virtio-net do add + get without kick. Heh, it's a good sign if they do, since that means you're running really well :) They don't in fact, in my testing :(. But I think they can with luck. I think block with unlocked kick can trip it too: add, lock is dropped and then an interrupt can get. We also don't need a kick each num - each 2^15 is enough. Why don't we do this at start of add_buf: if (vq-num_added = 0x7fff) return -ENOSPC; The warning was there in case a driver is never doing a kick, and getting away with it (mostly) because the device is polling. Let's not penalize good drivers to catch bad ones. How about we do this properly, like so: Absolutely. But I think we also need to handle num_added overflow of a 15 bit counter, no? Otherwise the vring_need_event logic might give us false negatives I'm guessing we can just assume we need a kick in that case. You're right. Thankyou. My immediate reaction of make it an unsigned long doesn't work. Here's the diff to what I posted before: diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -254,9 +254,10 @@ add_head: vq-vring.avail-idx++; vq-num_added++; - /* If you haven't kicked in this long, you're probably doing something - * wrong. */ - WARN_ON(vq-num_added vq-vring.num); + /* This is very unlikely, but theoretically possible. Kick + * just in case. */ + if (unlikely(vq-num_added == 65535)) This is 0x but why use the decimal notation? + virtqueue_kick(_vq); pr_debug(Added buffer head %i to %p\n, head, vq); END_USE(vq); We also still need to reset vq-num_added, right? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/11] KVM: PPC: Add memory-mapping support for PCI passthrough and emulation
On Sun, Nov 20, 2011 at 02:23:52PM +0200, Avi Kivity wrote: There is no the VMA. There could be multiple VMAs, or none (with the mmap() coming afterwards). You could do all the checks you want here, only to have host userspace remap it under your feet. This needs to be done on a per-page basis at fault time. OK, so that's a somewhat different mental model than I had in mind. I can change the code to do almost everything on a per-page basis at fault time on POWER7. There is one thing we can't do at fault time, which is to tell the hardware the page size for the virtual real mode area (VRMA), which is a mapping of the memory starting at guest physical address zero. We can however work out that pagesize the first time we run a vcpu. By that stage we must have some memory mapped at gpa 0, otherwise the vcpu is not going to get very far. We will need to look for the page size of whatever is mapped at gpa 0 at that point and give it to the hardware. On PPC970, which is a much older processor, we can't intercept the page faults (at least not without running the whole guest in user mode and emulating all the privileged instructions), so once we have given the guest access to a page, we can't revoke it. We will have to take and keep a reference to the page so it doesn't go away underneath us, which of course doesn't guarantee that userland can continue to see it, but does at least mean we won't corrupt memory. + /* +* We require read write permission as we cannot yet +* enforce guest read-only protection or no access. +*/ + if ((vma-vm_flags (VM_READ | VM_WRITE)) != + (VM_READ | VM_WRITE)) + goto err_unlock; This, too, must be done at get_user_pages() time. What happens if mmu notifiers tell you to write protect a page? On POWER7, we have to remove access to the page, and then when we get a fault on the page, request write access when we do get_user_pages_fast. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/11] KVM: PPC: Add memory-mapping support for PCI passthrough and emulation
On 11/21/2011 01:03 PM, Paul Mackerras wrote: On Sun, Nov 20, 2011 at 02:23:52PM +0200, Avi Kivity wrote: There is no the VMA. There could be multiple VMAs, or none (with the mmap() coming afterwards). You could do all the checks you want here, only to have host userspace remap it under your feet. This needs to be done on a per-page basis at fault time. OK, so that's a somewhat different mental model than I had in mind. I can change the code to do almost everything on a per-page basis at fault time on POWER7. There is one thing we can't do at fault time, which is to tell the hardware the page size for the virtual real mode area (VRMA), which is a mapping of the memory starting at guest physical address zero. We can however work out that pagesize the first time we run a vcpu. By that stage we must have some memory mapped at gpa 0, otherwise the vcpu is not going to get very far. We will need to look for the page size of whatever is mapped at gpa 0 at that point and give it to the hardware. Ok. Do you need to check at fault time that your assumptions haven't changed, then? On PPC970, which is a much older processor, we can't intercept the page faults (at least not without running the whole guest in user mode and emulating all the privileged instructions), so once we have given the guest access to a page, we can't revoke it. We will have to take and keep a reference to the page so it doesn't go away underneath us, which of course doesn't guarantee that userland can continue to see it, but does at least mean we won't corrupt memory. Yes, this is similar to kvm/x86 before mmu notifiers were added. + /* + * We require read write permission as we cannot yet + * enforce guest read-only protection or no access. + */ + if ((vma-vm_flags (VM_READ | VM_WRITE)) != + (VM_READ | VM_WRITE)) + goto err_unlock; This, too, must be done at get_user_pages() time. What happens if mmu notifiers tell you to write protect a page? On POWER7, we have to remove access to the page, and then when we get a fault on the page, request write access when we do get_user_pages_fast. Ok, so no ksm for you. Does this apply to kvm-internal write protection, like we do for the framebuffer and live migration? I guess you don't care much about the framebuffer (and anyway it doesn't need read-only access), but removing access for live migration is painful. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/11] KVM: PPC: Add memory-mapping support for PCI passthrough and emulation
On Mon, Nov 21, 2011 at 02:22:58PM +0200, Avi Kivity wrote: On 11/21/2011 01:03 PM, Paul Mackerras wrote: OK, so that's a somewhat different mental model than I had in mind. I can change the code to do almost everything on a per-page basis at fault time on POWER7. There is one thing we can't do at fault time, which is to tell the hardware the page size for the virtual real mode area (VRMA), which is a mapping of the memory starting at guest physical address zero. We can however work out that pagesize the first time we run a vcpu. By that stage we must have some memory mapped at gpa 0, otherwise the vcpu is not going to get very far. We will need to look for the page size of whatever is mapped at gpa 0 at that point and give it to the hardware. Ok. Do you need to check at fault time that your assumptions haven't changed, then? At fault time, if we are expecting a large page and we find a small page, pretty much all we can do is return from the vcpu_run ioctl with an EFAULT error -- so yes we do check the page-size assumption at fault time. The other way around isn't a problem (expecting small page and find large page), of course. What happens if mmu notifiers tell you to write protect a page? On POWER7, we have to remove access to the page, and then when we get a fault on the page, request write access when we do get_user_pages_fast. Ok, so no ksm for you. Does this apply to kvm-internal write protection, like we do for the framebuffer and live migration? I guess you don't care much about the framebuffer (and anyway it doesn't need read-only access), but removing access for live migration is painful. For the framebuffer, we can use the hardware 'C' (changed) bit to detect dirty pages without having to make them read-only. On further thought, we can in fact make pages read-only when the guest thinks they're read/write, at the cost of making the real protection faults in the guest a little slower. I'll look into it. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html