Re: [PATCH 5/5] virtio-net: Switch default to new bottom half TX handler for iothread
On Tue, Aug 31, 2010 at 11:32 PM, Alex Williamson wrote: > On Tue, 2010-08-31 at 23:25 +0300, Michael S. Tsirkin wrote: >> On Fri, Aug 27, 2010 at 04:37:45PM -0600, Alex Williamson wrote: >> > The bottom half handler shows big improvements over the timer >> > with few downsides, default to it when the iothread is enabled. >> > >> > Using the following tests, with the guest and host connected >> > via tap+bridge: >> > >> > guest> netperf -t TCP_STREAM -H $HOST >> > host> netperf -t TCP_STREAM -H $GUEST >> > guest> netperf -t UDP_STREAM -H $HOST >> > host> netperf -t UDP_STREAM -H $GUEST >> > guest> netperf -t TCP_RR -H $HOST >> > >> > Results: base throughput, exits/throughput -> >> > patched throughput, exits/throughput >> > >> > --enable-io-thread >> > TCP guest->host 2737.77, 47.82 -> 6767.09, 29.15 = 247%, 61% >> > TCP host->guest 2231.33, 74.00 -> 4125.80, 67.61 = 185%, 91% >> > UDP guest->host 6281.68, 14.66 -> 12569.27, 1.98 = 200%, 14% >> > UDP host->guest 275.91, 289.22 -> 264.80, 293.53 = 96%, 101% >> > interations/s 1949.65, 82.97 -> 7417.56, 84.31 = 380%, 102% >> > >> > No --enable-io-thread >> > TCP guest->host 3041.57, 55.11 -> 1038.93, 517.57 = 34%, 939% >> > TCP host->guest 2416.03, 76.67 -> 5655.92, 55.52 = 234%, 72% >> > UDP guest->host 12255.82, 6.11 -> 7775.87, 31.32 = 63%, 513% >> > UDP host->guest 587.92, 245.95 -> 611.88, 239.92 = 104%, 98% >> > interations/s 1975.59, 83.21 -> 8935.50, 88.18 = 452%, 106% >> > >> > Signed-off-by: Alex Williamson >> >> parameter having different settings based on config >> options might surprise some users. I don't think >> we really need a parameter here ... > > I'm not a bit fan of this either, but I'd also prefer not to introduce a > regression for a performance difference we know about in advance. It > gets even more complicated when we factor in qemu-kvm, as it doesn't > build with iothread enabled, but seems to get and even better boost in > performance across the board thanks largely to the kvm-irqchip. Should > we instead make this a configure option? --enable-virtio-net-txbh? > Thanks, > > Alex qemu-kvm uses its own iothread implementation by default. It doesn't need --enable-io-thread because it already uses a similar model. Stefan > >> > --- >> > >> > hw/s390-virtio-bus.c | 3 ++- >> > hw/syborg_virtio.c | 3 ++- >> > hw/virtio-pci.c | 3 ++- >> > hw/virtio.h | 6 ++ >> > 4 files changed, 12 insertions(+), 3 deletions(-) >> > >> > diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c >> > index 1483362..985f99a 100644 >> > --- a/hw/s390-virtio-bus.c >> > +++ b/hw/s390-virtio-bus.c >> > @@ -328,7 +328,8 @@ static VirtIOS390DeviceInfo s390_virtio_net = { >> > .qdev.size = sizeof(VirtIOS390Device), >> > .qdev.props = (Property[]) { >> > DEFINE_NIC_PROPERTIES(VirtIOS390Device, nic), >> > - DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer, 1), >> > + DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer, >> > + TXTIMER_DEFAULT), >> > DEFINE_PROP_INT32("txburst", VirtIOS390Device, txburst, 256), >> > DEFINE_PROP_END_OF_LIST(), >> > }, >> > diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c >> > index 7b76972..ee5746d 100644 >> > --- a/hw/syborg_virtio.c >> > +++ b/hw/syborg_virtio.c >> > @@ -300,7 +300,8 @@ static SysBusDeviceInfo syborg_virtio_net_info = { >> > .qdev.props = (Property[]) { >> > DEFINE_NIC_PROPERTIES(SyborgVirtIOProxy, nic), >> > DEFINE_VIRTIO_NET_FEATURES(SyborgVirtIOProxy, host_features), >> > - DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer, 1), >> > + DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer, >> > + TXTIMER_DEFAULT), >> > DEFINE_PROP_INT32("txburst", SyborgVirtIOProxy, txburst, 256), >> > DEFINE_PROP_END_OF_LIST(), >> > } >> > diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c >> > index e025c09..9740f57 100644 >> > --- a/hw/virtio-pci.c >> > +++ b/hw/virtio-pci.c >> > @@ -695,7 +695,8 @@ static PCIDeviceInfo virtio_info[] = { >> > DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3), >> > DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features), >> > DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic), >> > - DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer, 1), >> > + DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer, >> > + TXTIMER_DEFAULT), >> > DEFINE_PROP_INT32("txburst", VirtIOPCIProxy, txburst, 256), >> > DEFINE_PROP_END_OF_LIST(), >> > }, >> > diff --git a/hw/virtio.h b/hw/virtio.h >> > index 4051889..a1a17a2 100644 >> > --- a/hw/virtio.h >> > +++ b/hw/virtio.h >> > @@ -183,6 +183,12 @@ void virtio_update_irq(VirtIODevice *vdev); >> > void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding, >> >
[KVM timekeeping] Revert getnsboottime() kernel API
Turns out this doesn't actually save any math or locking, name is chosen rather poorly, it doesn't match the existing kernel APIs, and requires kvm-kmod changes. Signed-off-by: Zachary Amsden --- arch/x86/kvm/x86.c| 18 ++ include/linux/time.h |1 - kernel/time/timekeeping.c | 28 +--- 3 files changed, 15 insertions(+), 32 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4014d6c..03605b8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -919,6 +919,16 @@ static inline u64 nsec_to_cycles(u64 nsec) return ret; } +static inline u64 get_kernel_ns(void) +{ + struct timespec ts; + + WARN_ON(preemptible()); + ktime_get_ts(&ts); + monotonic_to_bootbased(&ts); + return timespec_to_ns(&ts); +} + void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data) { struct kvm *kvm = vcpu->kvm; @@ -928,7 +938,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data) spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags); offset = data - native_read_tsc(); - ns = getnsboottime(); + ns = get_kernel_ns(); elapsed = ns - kvm->arch.last_tsc_nsec; sdiff = data - kvm->arch.last_tsc_write; if (sdiff < 0) @@ -981,7 +991,7 @@ static int kvm_write_guest_time(struct kvm_vcpu *v) /* Keep irq disabled to prevent changes to the clock */ local_irq_save(flags); kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp); - kernel_ns = getnsboottime(); + kernel_ns = get_kernel_ns(); this_tsc_khz = __get_cpu_var(cpu_tsc_khz); local_irq_restore(flags); @@ -3327,7 +3337,7 @@ long kvm_arch_vm_ioctl(struct file *filp, goto out; r = 0; - now_ns = getnsboottime(); + now_ns = get_kernel_ns(); delta = user_ns.clock - now_ns; kvm->arch.kvmclock_offset = delta; break; @@ -3336,7 +3346,7 @@ long kvm_arch_vm_ioctl(struct file *filp, struct kvm_clock_data user_ns; u64 now_ns; - now_ns = getnsboottime(); + now_ns = get_kernel_ns(); user_ns.clock = kvm->arch.kvmclock_offset + now_ns; user_ns.flags = 0; diff --git a/include/linux/time.h b/include/linux/time.h index 909e62a..9f15ac7 100644 --- a/include/linux/time.h +++ b/include/linux/time.h @@ -160,7 +160,6 @@ extern void getnstimeofday(struct timespec *tv); extern void getrawmonotonic(struct timespec *ts); extern void getboottime(struct timespec *ts); extern void monotonic_to_bootbased(struct timespec *ts); -extern s64 getnsboottime(void); extern struct timespec timespec_trunc(struct timespec t, unsigned gran); extern int timekeeping_valid_for_hres(void); diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 77e930d..dfbe271 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -284,33 +284,6 @@ void ktime_get_ts(struct timespec *ts) } EXPORT_SYMBOL_GPL(ktime_get_ts); - -/** - * getnsboottime - get the bootbased clock in nsec format - * - * The function calculates the bootbased clock from the realtime - * clock and the wall_to_monotonic offset and stores the result - * in normalized timespec format in the variable pointed to by @ts. - */ -s64 getnsboottime(void) -{ - unsigned int seq; - s64 secs, nsecs; - - WARN_ON(timekeeping_suspended); - - do { - seq = read_seqbegin(&xtime_lock); - secs = xtime.tv_sec + wall_to_monotonic.tv_sec; - secs += total_sleep_time.tv_sec; - nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec; - nsecs += total_sleep_time.tv_nsec + timekeeping_get_ns(); - - } while (read_seqretry(&xtime_lock, seq)); - return nsecs + (secs * NSEC_PER_SEC); -} -EXPORT_SYMBOL_GPL(getnsboottime); - /** * do_gettimeofday - Returns the time of day in a timeval * @tv:pointer to the timeval to be set @@ -327,6 +300,7 @@ void do_gettimeofday(struct timeval *tv) } EXPORT_SYMBOL(do_gettimeofday); + /** * do_settimeofday - Sets the time of day * @tv:pointer to the timespec variable containing the new time -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
On Tue, 2010-08-31 at 17:03 -0500, Anthony Liguori wrote: > On 08/31/2010 03:54 PM, Andrew Theurer wrote: > > On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote: > > > >> On 08/23/2010 04:16 PM, Andre Przywara wrote: > >> > >>> Anthony Liguori wrote: > >>> > On 08/23/2010 01:59 PM, Marcelo Tosatti wrote: > > > On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote: > > > >> According to the user-provided assignment bind the respective part > >> of the guest's memory to the given host node. This uses Linux' > >> mbind syscall (which is wrapped only in libnuma) to realize the > >> pinning right after the allocation. > >> Failures are not fatal, but produce a warning. > >> > >> Signed-off-by: Andre Przywara > >> ... > >> > > Why is it not possible (or perhaps not desired) to change the binding > > after the guest is started? > > > > Sounds unflexible. > > > >>> The solution is to introduce a monitor interface to later adjust the > >>> pinning, allowing both changing the affinity only (only valid for > >>> future fault-ins) and actually copying the memory (more costly). > >>> > >> This is just duplicating numactl. > >> > >> > >>> Actually this is the next item on my list, but I wanted to bring up > >>> the basics first to avoid recoding parts afterwards. Also I am not > >>> (yet) familiar with the QMP protocol. > >>> > We really need a solution that lets a user use a tool like numactl > outside of the QEMU instance. > > >>> I fear that is not how it's meant to work with the Linux' NUMA API. In > >>> opposite to the VCPU threads, which are externally visible entities > >>> (PIDs), the memory should be private to the QEMU process. While you > >>> can change the NUMA allocation policy of the _whole_ process, there is > >>> no way to externally distinguish parts of the process' memory. > >>> Although you could later (and externally) migrate already faulted > >>> pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you > >>> would let an external tool interfere with QEMUs internal memory > >>> management. Take for instance the change of the allocation policy > >>> regarding the 1MB and 3.5-4GB holes. An external tool would have to > >>> either track such changes or you simply could not change such things > >>> in QEMU. > >>> > >> It's extremely likely that if you're doing NUMA pinning, you're also > >> doing large pages via hugetlbfs. numactl can already set policies for > >> files in hugetlbfs so all you need to do is have a separate hugetlbfs > >> file for each numa node. > >> > > Why would we resort to hugetlbfs when we have transparent hugepages? > > > > If you care about NUMA pinning, I can't believe you don't want > guaranteed large page allocation which THP does not provide. I personally want a more automatic approach to placing VMs in NUMA nodes (not directed by the qemu process itself), but I'd also like to support a user's desire to pin and place cpus and memory, especially for large VMs that need to be defined as multi-node. For user defined pinning, libhugetlbfs will probably be fine, but for most VMs, I'd like to ensure we can do things like ballooning well, and I am not so sure that will be easy with libhugetlbfs. > The general point though is that we should find a way to partition > memory in qemu such that an external process can control the actual NUMA > placement. This gives us maximum flexibility. > > Otherwise, what do we implement in QEMU? Direct pinning of memory to > nodes? Can we migrate memory between nodes? Should we support > interleaving memory between two virtual nodes? Why pick and choose when > we can have it all. If there were a better way to do this than hugetlbfs, then I don't think I would shy away from this. Is there another way to change NUMA policies on mappings from a user tool? We can already inspect with /proc//numamaps. Is this something that could be added to numactl? > > > FWIW, large apps like databases have set a precedent for managing their > > own NUMA policies. > > Of course because they know what their NUMA policy should be. They live > in a simple world where they assume they're the only application in the > system, they read the distance tables, figure they'll use XX% of all > physical memory, and then pin how they see fit. > > But an individual QEMU process lives in a complex world. It's almost > never the only thing on the system and it's only allowed to use a subset > of resources. It's not sure what set of resources it can and can't use > and that's often times changing. The topology chosen for a guest is > static but it's host topology may be dynamic due to thinks like live > migration. True, that's why this would require support to change in the monitor. > In short, QEMU absolutely cannot imp
Re: Degrading Network performance as KVM/kernel version increases
On 8/31/2010 6:00 PM, matthew.r.roh...@l-3com.com wrote: I have been getting degrading network performance with newer versions of KVM and was wondering if this was expected? It seems like a bug, but I am new to this and maybe I am doing something wrong so I thought I would ask. KVM Host OS: Fedora 12 x86_64 KVM Guest OS Tiny Core Linux 2.6.33.3 kernel I have tried multiple host kernels 2.6.31.5, 2.6.31.6, 2.6.32.19 and 2.6.35.4 along with versions qemu-kvm 11.0 and qemu-system-x86_64 12.5 compiled from from qemu-kvm repo. I can't say anything about the kernel version making things worse. At least for the qemu-kvm version, you should be using -device and -netdev instead of -net nic -net tap (see *http://git.qemu.org/qemu.git/tree/docs/qdev-device-use.txt since it's not in the 0.12 tree).* Setup is: 2 hosts with 1 guest on each connected by 10 Gb nic. I am using virtio and have checked that hardware acceleration is working. Processor usage is less than 50% on host and guests. Here is what I am seeing, I will just include guest to guest statistics, I do have more (host to guest, etc.) if interested: My goal is to get as much bandwidth as I can between the 2 guests running on separate hosts. The most I have been able to get is ~4 Gb/s running 4 threads on iperf from guest A to guest B. I cannot seem to get much over 1.5Gb/s from guest to guest with a single iperf thread. Is there some sort of know send limit per thread? Is it expected that the latest version of the kernel and modules perform worse than earlier versions in the area of network performance ( I am guessing not, am I doing something wrong?)? I am using virtio and have checked that hardware acceleration is working. 4 iperf threads host to host yields ~9.5 Gb/s. Any ideas on how I can get better performance with newer versions? I have tried using vhost in 2.6.35 but I get the vhost could not be initialized error. The only thing I could find on the vhost error is that selinux should be off which it is. I am looking for ideas on increasing the bandwidth between guests and thoughts on the degrading performance. Vhost-net is probably your best bet for maximizing throughput. You might try a separate post just for the vhost error if nobody chimes in about it here. Thanks for your help! --Matt -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Degrading Network performance as KVM/kernel version increases
I have been getting degrading network performance with newer versions of KVM and was wondering if this was expected? It seems like a bug, but I am new to this and maybe I am doing something wrong so I thought I would ask. KVM Host OS: Fedora 12 x86_64 KVM Guest OS Tiny Core Linux 2.6.33.3 kernel I have tried multiple host kernels 2.6.31.5, 2.6.31.6, 2.6.32.19 and 2.6.35.4 along with versions qemu-kvm 11.0 and qemu-system-x86_64 12.5 compiled from from qemu-kvm repo. Setup is: 2 hosts with 1 guest on each connected by 10 Gb nic. I am using virtio and have checked that hardware acceleration is working. Processor usage is less than 50% on host and guests. Here is what I am seeing, I will just include guest to guest statistics, I do have more (host to guest, etc.) if interested: With kernel 2.6.31.5 and usign qemu-kvm 11.0 1.57 Gb/s (guest 1 to guest 2) then 1.37 Gb/s (guest 2 to guest 1) with a single iperf thread. With kernel 2.6.31.5 and usign qemu-kvm 11.0 3.16 Gb/s (guest 1 to guest 2) then 4.29 Gb/s (guest 2 to guest 1) with 4 (P4) iperf threads. With kernel 2.6.31.5 and usign qemu-system 12.5 1.02 Gb/s (guest 1 to guest 2) then .420 Gb/s (guest 2 to guest 1) with a single iperf thread. With kernel 2.6.31.5 and usign qemu-system 12.5 1.30 Gb/s (guest 1 to guest 2) then .655 Gb/s (guest 2 to guest 1) with 4 (P4) iperf threads. With kernel 2.6.31.5 on host 1 and 2.6.32.19 on host 2 and usign qemu-kvm 11.0 .580 Gb/s (guest 1 to guest 2) then 1.32 Gb/s(guest 2 to guest 1) with a single iperf thread. With kernel 2.6.32.19 and usign qemu-kvm 11.0 .548 Gb/s (guest 1 to guest 2) then .603 Gb/s (guest 2 to guest 1) with a single iperf thread. With kernel 2.6.32.19 and usign qemu-kvm 11.0 .569 Gb/s (guest 1 to guest 2) then .478 Gb/s (guest 2 to guest 1) with 4 (P4) iperf threads. With kernel 2.6.32.19 and usign qemu-system 12.5 .571 Gb/s (guest 1 to guest 2) then .500 Gb/s (guest 2 to guest 1) with a single iperf thread. With kernel 2.6.32.19 and usign qemu-system 12.5 .633 Gb/s (guest 1 to guest 2) then .705 Gb/s (guest 2 to guest 1) with 4 (P4) iperf threads. With kernel 2.6.35.4 and usign qemu-system 12.5 .418 Gb/s (guest 1 to guest 2) and then I gave up. My goal is to get as much bandwidth as I can between the 2 guests running on separate hosts. The most I have been able to get is ~4 Gb/s running 4 threads on iperf from guest A to guest B. I cannot seem to get much over 1.5Gb/s from guest to guest with a single iperf thread. Is there some sort of know send limit per thread? Is it expected that the latest version of the kernel and modules perform worse than earlier versions in the area of network performance ( I am guessing not, am I doing something wrong?)? I am using virtio and have checked that hardware acceleration is working. 4 iperf threads host to host yields ~9.5 Gb/s. Any ideas on how I can get better performance with newer versions? I have tried using vhost in 2.6.35 but I get the vhost could not be initialized error. The only thing I could find on the vhost error is that selinux should be off which it is. I am looking for ideas on increasing the bandwidth between guests and thoughts on the degrading performance. Thanks for your help! --Matt -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] pci: expand tabs to spaces in pci_regs.h
On Tue, Aug 31, 2010 at 11:29:53PM +0300, Michael S. Tsirkin wrote: > On Sat, Aug 28, 2010 at 05:54:52PM +0300, Eduard - Gabriel Munteanu wrote: > > The conversion was done using the GNU 'expand' tool (default settings) > > to make it obey the QEMU coding style. > > > > Signed-off-by: Eduard - Gabriel Munteanu > > I'm not really interested in this: we copied pci_regs.h from linux > to help non-linux hosts, and keeping the code consistent > with the original makes detecting bugs and adding new stuff > from linux/pci_regs.h easier. > > > --- > > hw/pci_regs.h | 1330 > > > > 1 files changed, 665 insertions(+), 665 deletions(-) > > rewrite hw/pci_regs.h (90%) Ok, I'll drop it. The only reason I did it was one of my additions to this file made the patch look indented awkwardly. I'll use tabs and merge it into Linux as well. Thanks, Eduard -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] virtio-net: Switch default to new bottom half TX handler for iothread
On 08/31/2010 05:32 PM, Alex Williamson wrote: On Tue, 2010-08-31 at 23:25 +0300, Michael S. Tsirkin wrote: On Fri, Aug 27, 2010 at 04:37:45PM -0600, Alex Williamson wrote: The bottom half handler shows big improvements over the timer with few downsides, default to it when the iothread is enabled. Using the following tests, with the guest and host connected via tap+bridge: guest> netperf -t TCP_STREAM -H $HOST host> netperf -t TCP_STREAM -H $GUEST guest> netperf -t UDP_STREAM -H $HOST host> netperf -t UDP_STREAM -H $GUEST guest> netperf -t TCP_RR -H $HOST Results: base throughput, exits/throughput -> patched throughput, exits/throughput --enable-io-thread TCP guest->host 2737.77, 47.82 -> 6767.09, 29.15 = 247%, 61% TCP host->guest 2231.33, 74.00 -> 4125.80, 67.61 = 185%, 91% UDP guest->host 6281.68, 14.66 -> 12569.27, 1.98 = 200%, 14% UDP host->guest 275.91, 289.22 -> 264.80, 293.53 = 96%, 101% interations/s 1949.65, 82.97 -> 7417.56, 84.31 = 380%, 102% No --enable-io-thread TCP guest->host 3041.57, 55.11 -> 1038.93, 517.57 = 34%, 939% TCP host->guest 2416.03, 76.67 -> 5655.92, 55.52 = 234%, 72% UDP guest->host 12255.82, 6.11 -> 7775.87, 31.32 = 63%, 513% UDP host->guest 587.92, 245.95 -> 611.88, 239.92 = 104%, 98% interations/s 1975.59, 83.21 -> 8935.50, 88.18 = 452%, 106% Signed-off-by: Alex Williamson parameter having different settings based on config options might surprise some users. I don't think we really need a parameter here ... I'm not a bit fan of this either, but I'd also prefer not to introduce a regression for a performance difference we know about in advance. It gets even more complicated when we factor in qemu-kvm, as it doesn't build with iothread enabled, but seems to get and even better boost in performance across the board thanks largely to the kvm-irqchip. Should we instead make this a configure option? --enable-virtio-net-txbh? No, at this stage, we should ignore no --enable-io-thread with -enable-kvm. Regards, Anthony Liguori Thanks, Alex --- hw/s390-virtio-bus.c |3 ++- hw/syborg_virtio.c |3 ++- hw/virtio-pci.c |3 ++- hw/virtio.h |6 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c index 1483362..985f99a 100644 --- a/hw/s390-virtio-bus.c +++ b/hw/s390-virtio-bus.c @@ -328,7 +328,8 @@ static VirtIOS390DeviceInfo s390_virtio_net = { .qdev.size = sizeof(VirtIOS390Device), .qdev.props = (Property[]) { DEFINE_NIC_PROPERTIES(VirtIOS390Device, nic), -DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer, 1), +DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer, + TXTIMER_DEFAULT), DEFINE_PROP_INT32("txburst", VirtIOS390Device, txburst, 256), DEFINE_PROP_END_OF_LIST(), }, diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c index 7b76972..ee5746d 100644 --- a/hw/syborg_virtio.c +++ b/hw/syborg_virtio.c @@ -300,7 +300,8 @@ static SysBusDeviceInfo syborg_virtio_net_info = { .qdev.props = (Property[]) { DEFINE_NIC_PROPERTIES(SyborgVirtIOProxy, nic), DEFINE_VIRTIO_NET_FEATURES(SyborgVirtIOProxy, host_features), -DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer, 1), +DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer, + TXTIMER_DEFAULT), DEFINE_PROP_INT32("txburst", SyborgVirtIOProxy, txburst, 256), DEFINE_PROP_END_OF_LIST(), } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index e025c09..9740f57 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -695,7 +695,8 @@ static PCIDeviceInfo virtio_info[] = { DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3), DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features), DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic), -DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer, 1), +DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer, + TXTIMER_DEFAULT), DEFINE_PROP_INT32("txburst", VirtIOPCIProxy, txburst, 256), DEFINE_PROP_END_OF_LIST(), }, diff --git a/hw/virtio.h b/hw/virtio.h index 4051889..a1a17a2 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -183,6 +183,12 @@ void virtio_update_irq(VirtIODevice *vdev); void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding, void *opaque); +#ifdef CONFIG_IOTHREAD + #define TXTIMER_DEFAULT 0 +#else + #define TXTIMER_DEFAULT 1 +#endif + Add a comment explaning that this is just a performance optimization? /* Base devices. */ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf); VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf, -- To unsubscribe from this list: send the line "unsu
Re: windows 2008 installation doesn't recognize cdrom
On Mon, Aug 30, 2010 at 05:29:45PM +0200, Martin Kraus wrote: > On Mon, Aug 30, 2010 at 04:12:05PM +0200, Jes Sorensen wrote: > > On 08/30/10 14:45, Martin Kraus wrote: > > > By the installer I've meant the windows installer which complains that it > > > can't > > > find a driver for cdrom (emulated by kvm patched qemu). I can boot from > > > the cd > > > image by using -drive file=.media=cdrom or -cdrom ... and the > > > installation starts but it stops at the second installation window asking > > > to > > > browse manualy for an appropriate driver. > > > > > > I've been just wondering if anyone came across this problem or if I should > > > blame it on Windows. > > > > That is very odd, I have certainly managed to install win2k8 here from > > an ISO image without problems. If you launch it with -monitor stdio what > > do you get when you type 'info block' ? > > ide0-hd0: type=hd removable=0 file=/dev/mapper/virtual-pinda ro=0 > drv=host_device encrypted=0 > ide1-cd0: type=cdrom removable=1 locked=0 > file=/root/pub/iso/SW_DVD5_Windows_Svr_2008_R2_x64_ENG_1_ProdAct_X15-50365(2).ISO > ro=0 drv=raw encrypted=0 > floppy0: type=floppy removable=1 locked=0 [not inserted] > sd0: type=floppy removable=1 locked=0 [not inserted] > > I'll try to download the iso again if I manage to get past the microsoft > signon. Oh well. I've managed to persuade someone capable of signing on to the microsoft site to get the hash of the dvd and it's been corrupted. I appologize for this blunder. thank you for help mk -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] virtio-net: Switch default to new bottom half TX handler for iothread
On Tue, 2010-08-31 at 23:25 +0300, Michael S. Tsirkin wrote: > On Fri, Aug 27, 2010 at 04:37:45PM -0600, Alex Williamson wrote: > > The bottom half handler shows big improvements over the timer > > with few downsides, default to it when the iothread is enabled. > > > > Using the following tests, with the guest and host connected > > via tap+bridge: > > > > guest> netperf -t TCP_STREAM -H $HOST > > host> netperf -t TCP_STREAM -H $GUEST > > guest> netperf -t UDP_STREAM -H $HOST > > host> netperf -t UDP_STREAM -H $GUEST > > guest> netperf -t TCP_RR -H $HOST > > > > Results: base throughput, exits/throughput -> > >patched throughput, exits/throughput > > > > --enable-io-thread > > TCP guest->host 2737.77, 47.82 -> 6767.09, 29.15 = 247%, 61% > > TCP host->guest 2231.33, 74.00 -> 4125.80, 67.61 = 185%, 91% > > UDP guest->host 6281.68, 14.66 -> 12569.27, 1.98 = 200%, 14% > > UDP host->guest 275.91, 289.22 -> 264.80, 293.53 = 96%, 101% > > interations/s 1949.65, 82.97 -> 7417.56, 84.31 = 380%, 102% > > > > No --enable-io-thread > > TCP guest->host 3041.57, 55.11 -> 1038.93, 517.57 = 34%, 939% > > TCP host->guest 2416.03, 76.67 -> 5655.92, 55.52 = 234%, 72% > > UDP guest->host 12255.82, 6.11 -> 7775.87, 31.32 = 63%, 513% > > UDP host->guest 587.92, 245.95 -> 611.88, 239.92 = 104%, 98% > > interations/s 1975.59, 83.21 -> 8935.50, 88.18 = 452%, 106% > > > > Signed-off-by: Alex Williamson > > parameter having different settings based on config > options might surprise some users. I don't think > we really need a parameter here ... I'm not a bit fan of this either, but I'd also prefer not to introduce a regression for a performance difference we know about in advance. It gets even more complicated when we factor in qemu-kvm, as it doesn't build with iothread enabled, but seems to get and even better boost in performance across the board thanks largely to the kvm-irqchip. Should we instead make this a configure option? --enable-virtio-net-txbh? Thanks, Alex > > --- > > > > hw/s390-virtio-bus.c |3 ++- > > hw/syborg_virtio.c |3 ++- > > hw/virtio-pci.c |3 ++- > > hw/virtio.h |6 ++ > > 4 files changed, 12 insertions(+), 3 deletions(-) > > > > diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c > > index 1483362..985f99a 100644 > > --- a/hw/s390-virtio-bus.c > > +++ b/hw/s390-virtio-bus.c > > @@ -328,7 +328,8 @@ static VirtIOS390DeviceInfo s390_virtio_net = { > > .qdev.size = sizeof(VirtIOS390Device), > > .qdev.props = (Property[]) { > > DEFINE_NIC_PROPERTIES(VirtIOS390Device, nic), > > -DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer, 1), > > +DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer, > > + TXTIMER_DEFAULT), > > DEFINE_PROP_INT32("txburst", VirtIOS390Device, txburst, 256), > > DEFINE_PROP_END_OF_LIST(), > > }, > > diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c > > index 7b76972..ee5746d 100644 > > --- a/hw/syborg_virtio.c > > +++ b/hw/syborg_virtio.c > > @@ -300,7 +300,8 @@ static SysBusDeviceInfo syborg_virtio_net_info = { > > .qdev.props = (Property[]) { > > DEFINE_NIC_PROPERTIES(SyborgVirtIOProxy, nic), > > DEFINE_VIRTIO_NET_FEATURES(SyborgVirtIOProxy, host_features), > > -DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer, 1), > > +DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer, > > + TXTIMER_DEFAULT), > > DEFINE_PROP_INT32("txburst", SyborgVirtIOProxy, txburst, 256), > > DEFINE_PROP_END_OF_LIST(), > > } > > diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c > > index e025c09..9740f57 100644 > > --- a/hw/virtio-pci.c > > +++ b/hw/virtio-pci.c > > @@ -695,7 +695,8 @@ static PCIDeviceInfo virtio_info[] = { > > DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3), > > DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features), > > DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic), > > -DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer, 1), > > +DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer, > > + TXTIMER_DEFAULT), > > DEFINE_PROP_INT32("txburst", VirtIOPCIProxy, txburst, 256), > > DEFINE_PROP_END_OF_LIST(), > > }, > > diff --git a/hw/virtio.h b/hw/virtio.h > > index 4051889..a1a17a2 100644 > > --- a/hw/virtio.h > > +++ b/hw/virtio.h > > @@ -183,6 +183,12 @@ void virtio_update_irq(VirtIODevice *vdev); > > void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding, > > void *opaque); > > > > +#ifdef CONFIG_IOTHREAD > > + #define TXTIMER_DEFAULT 0 > > +#else > > + #define TXTIMER_DEFAULT 1 > > +#endif > > + > > Add a comment explaning that this is just a performance optimization? > > > /* Base devices. */ > > VirtIODevice *virtio_blk_init(De
Re: [PATCH 0/5] virtio-net: More configurability and bh handling for tx
On Tue, 2010-08-31 at 16:33 -0500, Anthony Liguori wrote: > On 08/31/2010 03:27 PM, Michael S. Tsirkin wrote: > > On Fri, Aug 27, 2010 at 04:36:59PM -0600, Alex Williamson wrote: > > > >> Add the ability to configure the tx_timer timeout and add a bottom > >> half tx handler that typically shows a nice perf boost over the > >> time based approach. See last patch for perf details. Make this > >> the new default when the iothread is enabled. Thanks, > >> > >> Alex > >> > > As a further thought, maybe it would help to have a > > separate namespace for unsupported, developer-only parameters. > > Thoughts? > > > > We already have an undocumented one, just prefix with 'x-' to indicate > that it's experimental. Any objection then to x-txburst and x-txtimer then? Easy enough to rename. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 0/2] SVM: fix cpu onlining
-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 2/2] KVM: SVM: init_vmcb should reset vcpu->efer
Otherwise EFER_LMA bit is retained across a SIPI reset. Fixes guest cpu onlining. Signed-off-by: Marcelo Tosatti Index: kvm/arch/x86/kvm/svm.c === --- kvm.orig/arch/x86/kvm/svm.c +++ kvm/arch/x86/kvm/svm.c @@ -816,7 +816,7 @@ static void init_vmcb(struct vcpu_svm *s init_sys_seg(&save->ldtr, SEG_TYPE_LDT); init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16); - save->efer = EFER_SVME; + svm_set_efer(&svm->vcpu, 0); save->dr6 = 0x0ff0; save->dr7 = 0x400; save->rflags = 2; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/2] KVM: SVM: reset mmu context in init_vmcb
Since commit aad827034e419fa no mmu reinitialization is performed via init_vmcb. Zero vcpu->arch.cr0 and pass the reset value as a parameter to kvm_set_cr0. Signed-off-by: Marcelo Tosatti Index: kvm/arch/x86/kvm/svm.c === --- kvm.orig/arch/x86/kvm/svm.c +++ kvm/arch/x86/kvm/svm.c @@ -827,8 +827,8 @@ static void init_vmcb(struct vcpu_svm *s * This is the guest-visible cr0 value. * svm_set_cr0() sets PG and WP and clears NW and CD on save->cr0. */ - svm->vcpu.arch.cr0 = X86_CR0_NW | X86_CR0_CD | X86_CR0_ET; - (void)kvm_set_cr0(&svm->vcpu, svm->vcpu.arch.cr0); + svm->vcpu.arch.cr0 = 0; + (void)kvm_set_cr0(&svm->vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET); save->cr4 = X86_CR4_PAE; /* rdx = ?? */ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
On 08/31/2010 03:54 PM, Andrew Theurer wrote: On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote: On 08/23/2010 04:16 PM, Andre Przywara wrote: Anthony Liguori wrote: On 08/23/2010 01:59 PM, Marcelo Tosatti wrote: On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote: According to the user-provided assignment bind the respective part of the guest's memory to the given host node. This uses Linux' mbind syscall (which is wrapped only in libnuma) to realize the pinning right after the allocation. Failures are not fatal, but produce a warning. Signed-off-by: Andre Przywara ... Why is it not possible (or perhaps not desired) to change the binding after the guest is started? Sounds unflexible. The solution is to introduce a monitor interface to later adjust the pinning, allowing both changing the affinity only (only valid for future fault-ins) and actually copying the memory (more costly). This is just duplicating numactl. Actually this is the next item on my list, but I wanted to bring up the basics first to avoid recoding parts afterwards. Also I am not (yet) familiar with the QMP protocol. We really need a solution that lets a user use a tool like numactl outside of the QEMU instance. I fear that is not how it's meant to work with the Linux' NUMA API. In opposite to the VCPU threads, which are externally visible entities (PIDs), the memory should be private to the QEMU process. While you can change the NUMA allocation policy of the _whole_ process, there is no way to externally distinguish parts of the process' memory. Although you could later (and externally) migrate already faulted pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you would let an external tool interfere with QEMUs internal memory management. Take for instance the change of the allocation policy regarding the 1MB and 3.5-4GB holes. An external tool would have to either track such changes or you simply could not change such things in QEMU. It's extremely likely that if you're doing NUMA pinning, you're also doing large pages via hugetlbfs. numactl can already set policies for files in hugetlbfs so all you need to do is have a separate hugetlbfs file for each numa node. Why would we resort to hugetlbfs when we have transparent hugepages? If you care about NUMA pinning, I can't believe you don't want guaranteed large page allocation which THP does not provide. The general point though is that we should find a way to partition memory in qemu such that an external process can control the actual NUMA placement. This gives us maximum flexibility. Otherwise, what do we implement in QEMU? Direct pinning of memory to nodes? Can we migrate memory between nodes? Should we support interleaving memory between two virtual nodes? Why pick and choose when we can have it all. FWIW, large apps like databases have set a precedent for managing their own NUMA policies. Of course because they know what their NUMA policy should be. They live in a simple world where they assume they're the only application in the system, they read the distance tables, figure they'll use XX% of all physical memory, and then pin how they see fit. But an individual QEMU process lives in a complex world. It's almost never the only thing on the system and it's only allowed to use a subset of resources. It's not sure what set of resources it can and can't use and that's often times changing. The topology chosen for a guest is static but it's host topology may be dynamic due to thinks like live migration. In short, QEMU absolutely cannot implement a NUMA policy in a vacuum. Instead, it needs to let something with a larger view of the system determine a NUMA policy that makes sense overall. There are two ways we can do this. We can implement monitor commands that attempt to expose every single NUMA tunable possible. Or, we can tie into the existing commands which guarantee that we support every possible tunable and that as NUMA support in Linux evolves, we get all the new features for free. And, since numactl already supports setting policies on files in hugetlbfs, all we need is a simple change to qemu to allow -mem-path to work per-node instead of globally. And it's useful to implement other types of things like having one node be guaranteed large pages and another node THP or some other fanciness. Sounds awfully appealing to me. I don't see why qemu should be any different. Numactl is great for small apps that need to be pinned in one node, or spread evenly on all nodes. Having to get hugetlbfs involved just to workaround a shortcoming of numactl just seems like a bad idea. You seem to be asserting that we should implement a full NUMA policy in QEMU. What should it be when we don't (in QEMU) know what else is running on the system? Regards, Anthony Li
Re: [PATCH 0/5] virtio-net: More configurability and bh handling for tx
On 08/31/2010 03:27 PM, Michael S. Tsirkin wrote: On Fri, Aug 27, 2010 at 04:36:59PM -0600, Alex Williamson wrote: Add the ability to configure the tx_timer timeout and add a bottom half tx handler that typically shows a nice perf boost over the time based approach. See last patch for perf details. Make this the new default when the iothread is enabled. Thanks, Alex As a further thought, maybe it would help to have a separate namespace for unsupported, developer-only parameters. Thoughts? We already have an undocumented one, just prefix with 'x-' to indicate that it's experimental. Regards, Anthony Liguori --- Alex Williamson (5): virtio-net: Switch default to new bottom half TX handler for iothread virtio-net: Introduce a new bottom half packet TX virtio-net: Rename tx_timer_active to tx_waiting virtio-net: Limit number of packets sent per TX flush virtio-net: Make tx_timer timeout configurable hw/s390-virtio-bus.c |6 ++ hw/s390-virtio-bus.h |4 ++ hw/syborg_virtio.c | 10 +++- hw/virtio-net.c | 129 -- hw/virtio-net.h |2 - hw/virtio-pci.c | 10 +++- hw/virtio.h |9 +++ 7 files changed, 137 insertions(+), 33 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote: > On 08/23/2010 04:16 PM, Andre Przywara wrote: > > Anthony Liguori wrote: > >> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote: > >>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote: > According to the user-provided assignment bind the respective part > of the guest's memory to the given host node. This uses Linux' > mbind syscall (which is wrapped only in libnuma) to realize the > pinning right after the allocation. > Failures are not fatal, but produce a warning. > > Signed-off-by: Andre Przywara > > >>> ... > >>> Why is it not possible (or perhaps not desired) to change the binding > >>> after the guest is started? > >>> > >>> Sounds unflexible. > > The solution is to introduce a monitor interface to later adjust the > > pinning, allowing both changing the affinity only (only valid for > > future fault-ins) and actually copying the memory (more costly). > > This is just duplicating numactl. > > > Actually this is the next item on my list, but I wanted to bring up > > the basics first to avoid recoding parts afterwards. Also I am not > > (yet) familiar with the QMP protocol. > >> > >> We really need a solution that lets a user use a tool like numactl > >> outside of the QEMU instance. > > I fear that is not how it's meant to work with the Linux' NUMA API. In > > opposite to the VCPU threads, which are externally visible entities > > (PIDs), the memory should be private to the QEMU process. While you > > can change the NUMA allocation policy of the _whole_ process, there is > > no way to externally distinguish parts of the process' memory. > > Although you could later (and externally) migrate already faulted > > pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you > > would let an external tool interfere with QEMUs internal memory > > management. Take for instance the change of the allocation policy > > regarding the 1MB and 3.5-4GB holes. An external tool would have to > > either track such changes or you simply could not change such things > > in QEMU. > > It's extremely likely that if you're doing NUMA pinning, you're also > doing large pages via hugetlbfs. numactl can already set policies for > files in hugetlbfs so all you need to do is have a separate hugetlbfs > file for each numa node. Why would we resort to hugetlbfs when we have transparent hugepages? FWIW, large apps like databases have set a precedent for managing their own NUMA policies. I don't see why qemu should be any different. Numactl is great for small apps that need to be pinned in one node, or spread evenly on all nodes. Having to get hugetlbfs involved just to workaround a shortcoming of numactl just seems like a bad idea. > > Then you have all the flexibility of numactl and you can implement node > migration external to QEMU if you so desire. > > > So what is wrong with keeping that code in QEMU, which knows best > > about the internals and already has flexible and mighty ways (command > > line and QMP) of manipulating its behavior? > > NUMA is a last-mile optimization. For the audience that cares about > this level of optimization, only providing an interface that allows a > small set of those optimizations to be used is unacceptable. > > There's a very simple way to do this right and that's by adding > interfaces to QEMU that let's us work with existing tooling instead of > inventing new interfaces. > > Regards, > > Anthony Liguori > > > Regards, > > Andre. -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] virtio-net: Introduce a new bottom half packet TX
On Tue, 2010-08-31 at 23:14 +0300, Michael S. Tsirkin wrote: > On Fri, Aug 27, 2010 at 04:37:36PM -0600, Alex Williamson wrote: > > Based on a patch from Mark McLoughlin, this patch introduces a new > > bottom half packet transmitter that avoids the latency imposed by > > the tx_timer approach. Rather than scheduling a timer when a TX > > packet comes in, schedule a bottom half to be run from the iothread. > > The bottom half handler first attempts to flush the queue with > > notification disabled (this is where we could race with a guest > > without txburst). If we flush a full burst, reschedule immediately. > > If we send short of a full burst, try to re-enable notification. > > To avoid a race with TXs that may have occurred, we must then > > flush again. If we find some packets to send, the guest it probably > > active, so we can reschedule again. > > > > tx_timer and tx_bh are mutually exclusive, so we can re-use the > > tx_waiting flag to indicate one or the other needs to be setup. > > This allows us to seamlessly migrate between timer and bh TX > > handling. > > > > Signed-off-by: Alex Williamson > > --- > > > > hw/virtio-net.c | 81 > > ++- > > 1 files changed, 68 insertions(+), 13 deletions(-) > > > > diff --git a/hw/virtio-net.c b/hw/virtio-net.c > > index 8b652f2..3288c77 100644 > > --- a/hw/virtio-net.c > > +++ b/hw/virtio-net.c > > @@ -36,6 +36,7 @@ typedef struct VirtIONet > > VirtQueue *ctrl_vq; > > NICState *nic; > > QEMUTimer *tx_timer; > > +QEMUBH *tx_bh; > > uint32_t tx_timeout; > > int32_t tx_burst; > > int tx_waiting; > > @@ -704,16 +705,25 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, > > VirtQueue *vq) > > { > > VirtIONet *n = to_virtio_net(vdev); > > > > -if (n->tx_waiting) { > > -virtio_queue_set_notification(vq, 1); > > -qemu_del_timer(n->tx_timer); > > -n->tx_waiting = 0; > > -virtio_net_flush_tx(n, vq); > > +if (n->tx_timer) { > > +if (n->tx_waiting) { > > +virtio_queue_set_notification(vq, 1); > > +qemu_del_timer(n->tx_timer); > > +n->tx_waiting = 0; > > +virtio_net_flush_tx(n, vq); > > +} else { > > +qemu_mod_timer(n->tx_timer, > > + qemu_get_clock(vm_clock) + n->tx_timeout); > > +n->tx_waiting = 1; > > +virtio_queue_set_notification(vq, 0); > > +} > > } else { > > -qemu_mod_timer(n->tx_timer, > > - qemu_get_clock(vm_clock) + n->tx_timeout); > > +if (unlikely(n->tx_waiting)) { > > +return; > > +} > > +virtio_queue_set_notification(n->tx_vq, 0); > > +qemu_bh_schedule(n->tx_bh); > > n->tx_waiting = 1; > > -virtio_queue_set_notification(vq, 0); > > } > > } > > > > @@ -731,6 +741,41 @@ static void virtio_net_tx_timer(void *opaque) > > virtio_net_flush_tx(n, n->tx_vq); > > } > > > > +static void virtio_net_tx_bh(void *opaque) > > +{ > > +VirtIONet *n = opaque; > > +int32_t ret; > > + > > +n->tx_waiting = 0; > > + > > +/* Just in case the driver is not ready on more */ > > +if (unlikely(!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))) > > +return; > > + > > +ret = virtio_net_flush_tx(n, n->tx_vq); > > +if (ret == -EBUSY) { > > +return; /* Notification re-enable handled by tx_complete */ > > +} > > + > > +/* If we flush a full burst of packets, assume there are > > + * more coming and immediately reschedule */ > > +if (ret >= n->tx_burst) { > > +qemu_bh_schedule(n->tx_bh); > > +n->tx_waiting = 1; > > +return; > > +} > > + > > +/* If less than a full burst, re-enable notification and flush > > + * anything that may have come in while we weren't looking. If > > + * we find something, assume the guest is still active and reschedule > > */ > > +virtio_queue_set_notification(n->tx_vq, 1); > > +if (virtio_net_flush_tx(n, n->tx_vq) > 0) { > > Shouldn't this be virtio_net_flush_tx(n, n->tx_vq) >= n->tx_burst? > If we get less than tx_burst, the ring is empty now so no need to > reschedule. > Right? I suppose it depends on how aggressive we want to be. If the guest put something on the queue between the first flush and this one, then it might be actively transmitting, and if we want to optimize latency, we anticipate that it might continue to transmit and re-schedule. This is taken straight from markmc's rhel5 patch. I wouldn't argue that it's wrong to not reschedule here, but it's clearly less aggressive. Thanks, Alex > > +virtio_queue_set_notification(n->tx_vq, 0); > > +qemu_bh_schedule(n->tx_bh); > > +n->tx_waiting = 1; > > +} > > +} > > + > > static void virtio_net_save(QEMUFile *f, void *opaque) > > { > > VirtIONet *n = opaque; > > @@ -850,8
Re: [PATCH 0/5] virtio-net: More configurability and bh handling for tx
On Fri, Aug 27, 2010 at 04:36:59PM -0600, Alex Williamson wrote: > Add the ability to configure the tx_timer timeout and add a bottom > half tx handler that typically shows a nice perf boost over the > time based approach. See last patch for perf details. Make this > the new default when the iothread is enabled. Thanks, > > Alex As a further thought, maybe it would help to have a separate namespace for unsupported, developer-only parameters. Thoughts? > --- > > Alex Williamson (5): > virtio-net: Switch default to new bottom half TX handler for iothread > virtio-net: Introduce a new bottom half packet TX > virtio-net: Rename tx_timer_active to tx_waiting > virtio-net: Limit number of packets sent per TX flush > virtio-net: Make tx_timer timeout configurable > > > hw/s390-virtio-bus.c |6 ++ > hw/s390-virtio-bus.h |4 ++ > hw/syborg_virtio.c | 10 +++- > hw/virtio-net.c | 129 > -- > hw/virtio-net.h |2 - > hw/virtio-pci.c | 10 +++- > hw/virtio.h |9 +++ > 7 files changed, 137 insertions(+), 33 deletions(-) > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] virtio-net: Switch default to new bottom half TX handler for iothread
On Fri, Aug 27, 2010 at 04:37:45PM -0600, Alex Williamson wrote: > The bottom half handler shows big improvements over the timer > with few downsides, default to it when the iothread is enabled. > > Using the following tests, with the guest and host connected > via tap+bridge: > > guest> netperf -t TCP_STREAM -H $HOST > host> netperf -t TCP_STREAM -H $GUEST > guest> netperf -t UDP_STREAM -H $HOST > host> netperf -t UDP_STREAM -H $GUEST > guest> netperf -t TCP_RR -H $HOST > > Results: base throughput, exits/throughput -> >patched throughput, exits/throughput > > --enable-io-thread > TCP guest->host 2737.77, 47.82 -> 6767.09, 29.15 = 247%, 61% > TCP host->guest 2231.33, 74.00 -> 4125.80, 67.61 = 185%, 91% > UDP guest->host 6281.68, 14.66 -> 12569.27, 1.98 = 200%, 14% > UDP host->guest 275.91, 289.22 -> 264.80, 293.53 = 96%, 101% > interations/s 1949.65, 82.97 -> 7417.56, 84.31 = 380%, 102% > > No --enable-io-thread > TCP guest->host 3041.57, 55.11 -> 1038.93, 517.57 = 34%, 939% > TCP host->guest 2416.03, 76.67 -> 5655.92, 55.52 = 234%, 72% > UDP guest->host 12255.82, 6.11 -> 7775.87, 31.32 = 63%, 513% > UDP host->guest 587.92, 245.95 -> 611.88, 239.92 = 104%, 98% > interations/s 1975.59, 83.21 -> 8935.50, 88.18 = 452%, 106% > > Signed-off-by: Alex Williamson parameter having different settings based on config options might surprise some users. I don't think we really need a parameter here ... > --- > > hw/s390-virtio-bus.c |3 ++- > hw/syborg_virtio.c |3 ++- > hw/virtio-pci.c |3 ++- > hw/virtio.h |6 ++ > 4 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c > index 1483362..985f99a 100644 > --- a/hw/s390-virtio-bus.c > +++ b/hw/s390-virtio-bus.c > @@ -328,7 +328,8 @@ static VirtIOS390DeviceInfo s390_virtio_net = { > .qdev.size = sizeof(VirtIOS390Device), > .qdev.props = (Property[]) { > DEFINE_NIC_PROPERTIES(VirtIOS390Device, nic), > -DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer, 1), > +DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer, > + TXTIMER_DEFAULT), > DEFINE_PROP_INT32("txburst", VirtIOS390Device, txburst, 256), > DEFINE_PROP_END_OF_LIST(), > }, > diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c > index 7b76972..ee5746d 100644 > --- a/hw/syborg_virtio.c > +++ b/hw/syborg_virtio.c > @@ -300,7 +300,8 @@ static SysBusDeviceInfo syborg_virtio_net_info = { > .qdev.props = (Property[]) { > DEFINE_NIC_PROPERTIES(SyborgVirtIOProxy, nic), > DEFINE_VIRTIO_NET_FEATURES(SyborgVirtIOProxy, host_features), > -DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer, 1), > +DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer, > + TXTIMER_DEFAULT), > DEFINE_PROP_INT32("txburst", SyborgVirtIOProxy, txburst, 256), > DEFINE_PROP_END_OF_LIST(), > } > diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c > index e025c09..9740f57 100644 > --- a/hw/virtio-pci.c > +++ b/hw/virtio-pci.c > @@ -695,7 +695,8 @@ static PCIDeviceInfo virtio_info[] = { > DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3), > DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features), > DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic), > -DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer, 1), > +DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer, > + TXTIMER_DEFAULT), > DEFINE_PROP_INT32("txburst", VirtIOPCIProxy, txburst, 256), > DEFINE_PROP_END_OF_LIST(), > }, > diff --git a/hw/virtio.h b/hw/virtio.h > index 4051889..a1a17a2 100644 > --- a/hw/virtio.h > +++ b/hw/virtio.h > @@ -183,6 +183,12 @@ void virtio_update_irq(VirtIODevice *vdev); > void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding, > void *opaque); > > +#ifdef CONFIG_IOTHREAD > + #define TXTIMER_DEFAULT 0 > +#else > + #define TXTIMER_DEFAULT 1 > +#endif > + Add a comment explaning that this is just a performance optimization? > /* Base devices. */ > VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf); > VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf, > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] virtio-net: Limit number of packets sent per TX flush
On Fri, Aug 27, 2010 at 04:37:18PM -0600, Alex Williamson wrote: > If virtio_net_flush_tx is called with notification disabled, we can > race with the guest, processing packets at the same rate as they > get produced. The trouble is that this means we have no guaranteed > exit condition from the function and can spend minutes in there. > Currently flush_tx is only called with notification on, which seems > to limit us to one pass through the queue per call. An upcoming > patch changes this. > > One pass through the queue (256) seems to be a good default value > for this, balancing latency with throughput. We use a signed int > for txburst because 2^31 packets in a burst would take many, many > minutes to process and it allows us to easily return a negative > value value from virtio_net_flush_tx() to indicate a back-off > or error condition. > > Signed-off-by: Alex Williamson It might be better not to make this configurable, and simply set txburst = vq->num in code in a single place, than the magic 256 constant in multiuple places. Alternatively, let's put a macro in a .h file. It might also be a good idea to put the above motivation in comments in the code. > --- > > hw/s390-virtio-bus.c |4 +++- > hw/s390-virtio-bus.h |2 ++ > hw/syborg_virtio.c |6 +- > hw/virtio-net.c | 23 --- > hw/virtio-pci.c |6 +- > hw/virtio.h |2 +- > 6 files changed, 32 insertions(+), 11 deletions(-) > > diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c > index 4da0b40..1483362 100644 > --- a/hw/s390-virtio-bus.c > +++ b/hw/s390-virtio-bus.c > @@ -110,7 +110,8 @@ static int s390_virtio_net_init(VirtIOS390Device *dev) > { > VirtIODevice *vdev; > > -vdev = virtio_net_init((DeviceState *)dev, &dev->nic, dev->txtimer); > +vdev = virtio_net_init((DeviceState *)dev, &dev->nic, > + dev->txtimer, dev->txburst); > if (!vdev) { > return -1; > } > @@ -328,6 +329,7 @@ static VirtIOS390DeviceInfo s390_virtio_net = { > .qdev.props = (Property[]) { > DEFINE_NIC_PROPERTIES(VirtIOS390Device, nic), > DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer, 1), > +DEFINE_PROP_INT32("txburst", VirtIOS390Device, txburst, 256), > DEFINE_PROP_END_OF_LIST(), > }, > }; > diff --git a/hw/s390-virtio-bus.h b/hw/s390-virtio-bus.h > index 922daf2..808aea0 100644 > --- a/hw/s390-virtio-bus.h > +++ b/hw/s390-virtio-bus.h > @@ -45,6 +45,8 @@ typedef struct VirtIOS390Device { > uint32_t max_virtserial_ports; > /* Timeout value for virtio-net txtimer, 1 = default, >1 = ns timeout */ > uint32_t txtimer; > +/* Max tx packets for virtio-net to burst at a time */ > +int32_t txburst; > } VirtIOS390Device; > > typedef struct VirtIOS390Bus { > diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c > index c8d731a..7b76972 100644 > --- a/hw/syborg_virtio.c > +++ b/hw/syborg_virtio.c > @@ -70,6 +70,8 @@ typedef struct { > uint32_t host_features; > /* Timeout value for virtio-net txtimer, 1 = default, >1 = ns timeout */ > uint32_t txtimer; > +/* Max tx packets for virtio-net to burst at a time */ > +int32_t txburst; > } SyborgVirtIOProxy; > > static uint32_t syborg_virtio_readl(void *opaque, target_phys_addr_t offset) > @@ -286,7 +288,8 @@ static int syborg_virtio_net_init(SysBusDevice *dev) > VirtIODevice *vdev; > SyborgVirtIOProxy *proxy = FROM_SYSBUS(SyborgVirtIOProxy, dev); > > -vdev = virtio_net_init(&dev->qdev, &proxy->nic, proxy->txtimer); > +vdev = virtio_net_init(&dev->qdev, &proxy->nic, > + proxy->txtimer, proxy->txburst); > return syborg_virtio_init(proxy, vdev); > } > > @@ -298,6 +301,7 @@ static SysBusDeviceInfo syborg_virtio_net_info = { > DEFINE_NIC_PROPERTIES(SyborgVirtIOProxy, nic), > DEFINE_VIRTIO_NET_FEATURES(SyborgVirtIOProxy, host_features), > DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer, 1), > +DEFINE_PROP_INT32("txburst", SyborgVirtIOProxy, txburst, 256), > DEFINE_PROP_END_OF_LIST(), > } > }; > diff --git a/hw/virtio-net.c b/hw/virtio-net.c > index 9ef29f0..ac4aa8f 100644 > --- a/hw/virtio-net.c > +++ b/hw/virtio-net.c > @@ -37,6 +37,7 @@ typedef struct VirtIONet > NICState *nic; > QEMUTimer *tx_timer; > uint32_t tx_timeout; > +int32_t tx_burst; > int tx_timer_active; > uint32_t has_vnet_hdr; > uint8_t has_ufo; > @@ -620,7 +621,7 @@ static ssize_t virtio_net_receive(VLANClientState *nc, > const uint8_t *buf, size_ > return size; > } > > -static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq); > +static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq); > > static void virtio_net_tx_complete(VLANClientState *nc, ssize_t len) > { > @@ -636,16 +637,18 @@ static void virtio_net_tx_complete(VLANClientState *nc, > ssize_t len) > } >
Re: [PATCH 0/5] virtio-net: More configurability and bh handling for tx
On Fri, Aug 27, 2010 at 04:36:59PM -0600, Alex Williamson wrote: > Add the ability to configure the tx_timer timeout and add a bottom > half tx handler that typically shows a nice perf boost over the > time based approach. See last patch for perf details. Make this > the new default when the iothread is enabled. Thanks, > > Alex Looks good overall, but do we really want to make these parameters qdev properties? It really looks like exposing an implementation detail that we'll have hard time supporting in the future and users will have hard time understanding. OTOH developers should have no trouble tweaking a define on top of the file. > --- > > Alex Williamson (5): > virtio-net: Switch default to new bottom half TX handler for iothread > virtio-net: Introduce a new bottom half packet TX > virtio-net: Rename tx_timer_active to tx_waiting > virtio-net: Limit number of packets sent per TX flush > virtio-net: Make tx_timer timeout configurable > > > hw/s390-virtio-bus.c |6 ++ > hw/s390-virtio-bus.h |4 ++ > hw/syborg_virtio.c | 10 +++- > hw/virtio-net.c | 129 > -- > hw/virtio-net.h |2 - > hw/virtio-pci.c | 10 +++- > hw/virtio.h |9 +++ > 7 files changed, 137 insertions(+), 33 deletions(-) > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] virtio-net: Introduce a new bottom half packet TX
On Fri, Aug 27, 2010 at 04:37:36PM -0600, Alex Williamson wrote: > Based on a patch from Mark McLoughlin, this patch introduces a new > bottom half packet transmitter that avoids the latency imposed by > the tx_timer approach. Rather than scheduling a timer when a TX > packet comes in, schedule a bottom half to be run from the iothread. > The bottom half handler first attempts to flush the queue with > notification disabled (this is where we could race with a guest > without txburst). If we flush a full burst, reschedule immediately. > If we send short of a full burst, try to re-enable notification. > To avoid a race with TXs that may have occurred, we must then > flush again. If we find some packets to send, the guest it probably > active, so we can reschedule again. > > tx_timer and tx_bh are mutually exclusive, so we can re-use the > tx_waiting flag to indicate one or the other needs to be setup. > This allows us to seamlessly migrate between timer and bh TX > handling. > > Signed-off-by: Alex Williamson > --- > > hw/virtio-net.c | 81 > ++- > 1 files changed, 68 insertions(+), 13 deletions(-) > > diff --git a/hw/virtio-net.c b/hw/virtio-net.c > index 8b652f2..3288c77 100644 > --- a/hw/virtio-net.c > +++ b/hw/virtio-net.c > @@ -36,6 +36,7 @@ typedef struct VirtIONet > VirtQueue *ctrl_vq; > NICState *nic; > QEMUTimer *tx_timer; > +QEMUBH *tx_bh; > uint32_t tx_timeout; > int32_t tx_burst; > int tx_waiting; > @@ -704,16 +705,25 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, > VirtQueue *vq) > { > VirtIONet *n = to_virtio_net(vdev); > > -if (n->tx_waiting) { > -virtio_queue_set_notification(vq, 1); > -qemu_del_timer(n->tx_timer); > -n->tx_waiting = 0; > -virtio_net_flush_tx(n, vq); > +if (n->tx_timer) { > +if (n->tx_waiting) { > +virtio_queue_set_notification(vq, 1); > +qemu_del_timer(n->tx_timer); > +n->tx_waiting = 0; > +virtio_net_flush_tx(n, vq); > +} else { > +qemu_mod_timer(n->tx_timer, > + qemu_get_clock(vm_clock) + n->tx_timeout); > +n->tx_waiting = 1; > +virtio_queue_set_notification(vq, 0); > +} > } else { > -qemu_mod_timer(n->tx_timer, > - qemu_get_clock(vm_clock) + n->tx_timeout); > +if (unlikely(n->tx_waiting)) { > +return; > +} > +virtio_queue_set_notification(n->tx_vq, 0); > +qemu_bh_schedule(n->tx_bh); > n->tx_waiting = 1; > -virtio_queue_set_notification(vq, 0); > } > } > > @@ -731,6 +741,41 @@ static void virtio_net_tx_timer(void *opaque) > virtio_net_flush_tx(n, n->tx_vq); > } > > +static void virtio_net_tx_bh(void *opaque) > +{ > +VirtIONet *n = opaque; > +int32_t ret; > + > +n->tx_waiting = 0; > + > +/* Just in case the driver is not ready on more */ > +if (unlikely(!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))) > +return; > + > +ret = virtio_net_flush_tx(n, n->tx_vq); > +if (ret == -EBUSY) { > +return; /* Notification re-enable handled by tx_complete */ > +} > + > +/* If we flush a full burst of packets, assume there are > + * more coming and immediately reschedule */ > +if (ret >= n->tx_burst) { > +qemu_bh_schedule(n->tx_bh); > +n->tx_waiting = 1; > +return; > +} > + > +/* If less than a full burst, re-enable notification and flush > + * anything that may have come in while we weren't looking. If > + * we find something, assume the guest is still active and reschedule */ > +virtio_queue_set_notification(n->tx_vq, 1); > +if (virtio_net_flush_tx(n, n->tx_vq) > 0) { Shouldn't this be virtio_net_flush_tx(n, n->tx_vq) >= n->tx_burst? If we get less than tx_burst, the ring is empty now so no need to reschedule. Right? > +virtio_queue_set_notification(n->tx_vq, 0); > +qemu_bh_schedule(n->tx_bh); > +n->tx_waiting = 1; > +} > +} > + > static void virtio_net_save(QEMUFile *f, void *opaque) > { > VirtIONet *n = opaque; > @@ -850,8 +895,12 @@ static int virtio_net_load(QEMUFile *f, void *opaque, > int version_id) > n->mac_table.first_multi = i; > > if (n->tx_waiting) { > -qemu_mod_timer(n->tx_timer, > - qemu_get_clock(vm_clock) + n->tx_timeout); > +if (n->tx_timer) { > +qemu_mod_timer(n->tx_timer, > + qemu_get_clock(vm_clock) + n->tx_timeout); > +} else { > +qemu_bh_schedule(n->tx_bh); > +} > } > return 0; > } > @@ -939,9 +988,9 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf > *conf, > > qemu_format_nic_info_str(&n->nic->nc, conf->macaddr.a); > > -n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer,
Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!
On Tue, Aug 31, 2010 at 06:33:34PM +0200, Jes Sorensen wrote: > On 08/31/10 18:28, Avi Kivity wrote: > > On 08/31/2010 03:17 PM, jes.soren...@redhat.com wrote: > >> From: Jes Sorensen > >> > >> We regularly see bug reports over this one, however it is a write to > >> a read-only register which some operating systems (including Linux) > >> tend to write to once in a while. > >> > >> Ignore the writes since they do no harm. > >> > > > > Does Linux write it with wrmsr_safe()? If not, I don't see how it > > works. If it does, then we shouldn't ignore the write, instead issue > > the #GP as usual, but be silent about it. > > To be honest, I am having a hard time determining where the write > happens. I looked at this with Gleb and the rip obtained by > kvm_read_rip(vcpu) in the code doesn't indicate anything that resembles > a wrmsr(). > > It was either a > 560 outb(0xfe, 0x64); /* pulse reset low */ > or > 49asm volatile("sti; hlt": : :"memory"); > > which makes no sense to me, but given it's x86, I am not sure if it > could have come from the BIOS or something during reboot? > > Jes kvm_reset_msrs in qemu-kvm-x86.c. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!
On 08/31/10 11:04, Jes Sorensen wrote: >> Just grep for the msr name in a guest kernel source that's known to >> trigger the message. > > Been there, done that! This happens with an F13 kernel during reboot. > Ran the search on the expanded 2.6.32.8-149 tree and found no reference > to anything trying to write it, except for KVM backing up the flag, but > that shouldn't happen in the guest. I've been seeing the messages with a 32-bit Fedora 10 guest running a 2.6.27 variant. The wrmsr messages are generated by the time the grub menu appears. ie., pre-OS. David -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/5] virtio-net: Make tx_timer timeout configurable
* Alex Williamson (alex.william...@redhat.com) wrote: > On Tue, 2010-08-31 at 11:00 -0700, Chris Wright wrote: > > * Alex Williamson (alex.william...@redhat.com) wrote: > > > diff --git a/hw/virtio-net.c b/hw/virtio-net.c > > > index 075f72d..9ef29f0 100644 > > > --- a/hw/virtio-net.c > > > +++ b/hw/virtio-net.c > > > @@ -36,6 +36,7 @@ typedef struct VirtIONet > > > VirtQueue *ctrl_vq; > > > NICState *nic; > > > QEMUTimer *tx_timer; > > > +uint32_t tx_timeout; > > > int tx_timer_active; > > > uint32_t has_vnet_hdr; > > > uint8_t has_ufo; > > > @@ -702,7 +703,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, > > > VirtQueue *vq) > > > virtio_net_flush_tx(n, vq); > > > } else { > > > qemu_mod_timer(n->tx_timer, > > > - qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL); > > > + qemu_get_clock(vm_clock) + n->tx_timeout); > > > n->tx_timer_active = 1; > > > virtio_queue_set_notification(vq, 0); > > > } > > > @@ -842,7 +843,7 @@ static int virtio_net_load(QEMUFile *f, void *opaque, > > > int version_id) > > > > > > if (n->tx_timer_active) { > > > qemu_mod_timer(n->tx_timer, > > > - qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL); > > > + qemu_get_clock(vm_clock) + n->tx_timeout); > > > > I think I'm missing where this is stored? Looks like migration > > would revert a changed tx_timeout back to 150us. > > It's not stored, it can be instantiated on the migration target any way > you please and we can migrate between different values or even different > TX mitigation strategies. If a non-default value is used on the source > and you want to maintain the same behavior, the target needs to be > started the same way. heh, IOW, I did miss how it's stored...on cmdline ;) thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/5] virtio-net: Make tx_timer timeout configurable
* Alex Williamson (alex.william...@redhat.com) wrote: > diff --git a/hw/virtio-net.c b/hw/virtio-net.c > index 075f72d..9ef29f0 100644 > --- a/hw/virtio-net.c > +++ b/hw/virtio-net.c > @@ -36,6 +36,7 @@ typedef struct VirtIONet > VirtQueue *ctrl_vq; > NICState *nic; > QEMUTimer *tx_timer; > +uint32_t tx_timeout; > int tx_timer_active; > uint32_t has_vnet_hdr; > uint8_t has_ufo; > @@ -702,7 +703,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, > VirtQueue *vq) > virtio_net_flush_tx(n, vq); > } else { > qemu_mod_timer(n->tx_timer, > - qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL); > + qemu_get_clock(vm_clock) + n->tx_timeout); > n->tx_timer_active = 1; > virtio_queue_set_notification(vq, 0); > } > @@ -842,7 +843,7 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int > version_id) > > if (n->tx_timer_active) { > qemu_mod_timer(n->tx_timer, > - qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL); > + qemu_get_clock(vm_clock) + n->tx_timeout); I think I'm missing where this is stored? Looks like migration would revert a changed tx_timeout back to 150us. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/5] virtio-net: Make tx_timer timeout configurable
On Tue, 2010-08-31 at 11:00 -0700, Chris Wright wrote: > * Alex Williamson (alex.william...@redhat.com) wrote: > > diff --git a/hw/virtio-net.c b/hw/virtio-net.c > > index 075f72d..9ef29f0 100644 > > --- a/hw/virtio-net.c > > +++ b/hw/virtio-net.c > > @@ -36,6 +36,7 @@ typedef struct VirtIONet > > VirtQueue *ctrl_vq; > > NICState *nic; > > QEMUTimer *tx_timer; > > +uint32_t tx_timeout; > > int tx_timer_active; > > uint32_t has_vnet_hdr; > > uint8_t has_ufo; > > @@ -702,7 +703,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, > > VirtQueue *vq) > > virtio_net_flush_tx(n, vq); > > } else { > > qemu_mod_timer(n->tx_timer, > > - qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL); > > + qemu_get_clock(vm_clock) + n->tx_timeout); > > n->tx_timer_active = 1; > > virtio_queue_set_notification(vq, 0); > > } > > @@ -842,7 +843,7 @@ static int virtio_net_load(QEMUFile *f, void *opaque, > > int version_id) > > > > if (n->tx_timer_active) { > > qemu_mod_timer(n->tx_timer, > > - qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL); > > + qemu_get_clock(vm_clock) + n->tx_timeout); > > I think I'm missing where this is stored? Looks like migration > would revert a changed tx_timeout back to 150us. It's not stored, it can be instantiated on the migration target any way you please and we can migrate between different values or even different TX mitigation strategies. If a non-default value is used on the source and you want to maintain the same behavior, the target needs to be started the same way. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "KVM internal error. Suberror: 1" with ancient 2.4 kernel as guest
On 08/31/2010 06:49 PM, Avi Kivity wrote: On 08/31/2010 05:32 PM, Jiri Kosina wrote: (qemu) x/5i $eip 0xc027a841: movq (%esi),%mm0 0xc027a844: movq 0x8(%esi),%mm1 0xc027a848: movq 0x10(%esi),%mm2 0xc027a84c: movq 0x18(%esi),%mm3 0xc027a850: movq %mm0,(%edx) === Is there any issue with emulating MMX? Yes. MMX is not currently emulated. If there's a command line option to disable the use of MMX you can try it, otherwise wait for it to be implemented (or implement it yourself). I'll try to do it for 2.6.37, but can't promise anything. You can also run qemu with -cpu qemu32,-mmx. That will expose a cpu without mmx support; hopefully the guest kernel will see that and avoid mmx instructions. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!
On 08/31/10 18:44, Avi Kivity wrote: > On 08/31/2010 07:33 PM, Jes Sorensen wrote: >> On 08/31/10 18:28, Avi Kivity wrote: >>> On 08/31/2010 03:17 PM, jes.soren...@redhat.com wrote: From: Jes Sorensen We regularly see bug reports over this one, however it is a write to a read-only register which some operating systems (including Linux) tend to write to once in a while. Ignore the writes since they do no harm. >>> Does Linux write it with wrmsr_safe()? If not, I don't see how it >>> works. If it does, then we shouldn't ignore the write, instead issue >>> the #GP as usual, but be silent about it. >> To be honest, I am having a hard time determining where the write >> happens. I looked at this with Gleb and the rip obtained by >> kvm_read_rip(vcpu) in the code doesn't indicate anything that resembles >> a wrmsr(). >> >> It was either a >> 560outb(0xfe, 0x64); /* pulse reset low */ >> or >> 49asm volatile("sti; hlt": : :"memory"); >> >> which makes no sense to me, > > Just grep for the msr name in a guest kernel source that's known to > trigger the message. Been there, done that! This happens with an F13 kernel during reboot. Ran the search on the expanded 2.6.32.8-149 tree and found no reference to anything trying to write it, except for KVM backing up the flag, but that shouldn't happen in the guest. >> but given it's x86, I am not sure if it >> could have come from the BIOS or something during reboot? > > The bios is the same for all kernels (and is unlikely to mess with > performance counter msrs anyway). I was fooled by this too, it's not a performance counter MSR, it's a CPU frequency scaling MSR. Cheers, Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!
On 08/31/2010 03:17 PM, jes.soren...@redhat.com wrote: From: Jes Sorensen We regularly see bug reports over this one, however it is a write to a read-only register which some operating systems (including Linux) tend to write to once in a while. Ignore the writes since they do no harm. Does Linux write it with wrmsr_safe()? If not, I don't see how it works. If it does, then we shouldn't ignore the write, instead issue the #GP as usual, but be silent about it. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!
On 08/31/10 18:28, Avi Kivity wrote: > On 08/31/2010 03:17 PM, jes.soren...@redhat.com wrote: >> From: Jes Sorensen >> >> We regularly see bug reports over this one, however it is a write to >> a read-only register which some operating systems (including Linux) >> tend to write to once in a while. >> >> Ignore the writes since they do no harm. >> > > Does Linux write it with wrmsr_safe()? If not, I don't see how it > works. If it does, then we shouldn't ignore the write, instead issue > the #GP as usual, but be silent about it. To be honest, I am having a hard time determining where the write happens. I looked at this with Gleb and the rip obtained by kvm_read_rip(vcpu) in the code doesn't indicate anything that resembles a wrmsr(). It was either a 560 outb(0xfe, 0x64); /* pulse reset low */ or 49 asm volatile("sti; hlt": : :"memory"); which makes no sense to me, but given it's x86, I am not sure if it could have come from the BIOS or something during reboot? Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!
On 08/31/2010 07:33 PM, Jes Sorensen wrote: On 08/31/10 18:28, Avi Kivity wrote: On 08/31/2010 03:17 PM, jes.soren...@redhat.com wrote: From: Jes Sorensen We regularly see bug reports over this one, however it is a write to a read-only register which some operating systems (including Linux) tend to write to once in a while. Ignore the writes since they do no harm. Does Linux write it with wrmsr_safe()? If not, I don't see how it works. If it does, then we shouldn't ignore the write, instead issue the #GP as usual, but be silent about it. To be honest, I am having a hard time determining where the write happens. I looked at this with Gleb and the rip obtained by kvm_read_rip(vcpu) in the code doesn't indicate anything that resembles a wrmsr(). It was either a 560 outb(0xfe, 0x64); /* pulse reset low */ or 49 asm volatile("sti; hlt": : :"memory"); which makes no sense to me, Just grep for the msr name in a guest kernel source that's known to trigger the message. but given it's x86, I am not sure if it could have come from the BIOS or something during reboot? The bios is the same for all kernels (and is unlikely to mess with performance counter msrs anyway). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] *** SUBJECT HERE ***
On 08/31/2010 07:13 PM, Anthony Liguori wrote: Just as an aside, does anyone have a good way to maintain the 00 patches in series with repeated submissions? I tried to store it in git as an empty commit but most of the git tooling doesn't work well with that. I keep each posting in a -v1/ etc directory and copy-paste from that. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] *** SUBJECT HERE ***
On 08/31/2010 11:10 AM, Avi Kivity wrote: On 08/31/2010 05:43 PM, Gerd Hoffmann wrote: You better should pass '00*.patch' instead of '00*' to git send-email so it doesn't mail out the editor backup file ;) That's what I usually do - guess I slipped this time. Just as an aside, does anyone have a good way to maintain the 00 patches in series with repeated submissions? I tried to store it in git as an empty commit but most of the git tooling doesn't work well with that. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] *** SUBJECT HERE ***
On 08/31/2010 05:43 PM, Gerd Hoffmann wrote: You better should pass '00*.patch' instead of '00*' to git send-email so it doesn't mail out the editor backup file ;) That's what I usually do - guess I slipped this time. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "KVM internal error. Suberror: 1" with ancient 2.4 kernel as guest
On 08/31/2010 05:32 PM, Jiri Kosina wrote: (qemu) x/5i $eip 0xc027a841: movq (%esi),%mm0 0xc027a844: movq 0x8(%esi),%mm1 0xc027a848: movq 0x10(%esi),%mm2 0xc027a84c: movq 0x18(%esi),%mm3 0xc027a850: movq %mm0,(%edx) === Is there any issue with emulating MMX? Yes. MMX is not currently emulated. If there's a command line option to disable the use of MMX you can try it, otherwise wait for it to be implemented (or implement it yourself). I'll try to do it for 2.6.37, but can't promise anything. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: How kvm handle extern interrupt when guest os is running
I understand ,thank you. i find this in intel system programming guide: Acknowledge interrupt on exit. The “acknowledge interrupt on exit” VM-exit control in the controlling VMCS controls processor behavior for external interrupt acknowledgement. If the control is 1, the processor acknowledges the interrupt controller to acquire the interrupt vector upon VM exit, and stores the vector in the VM-exit interruption-information field. If the control is 0, the external interrupt is not acknowledged during VM exit. Since RFLAGS.IF is automatically cleared on VM exits due to external interrupts, VMM re-enabling of interrupts (setting RFLAGS.IF = 1) initiates the external interrupt acknowledgement and vectoring of the external interrupt through the monitor/host IDT. and i find xen set “acknowledge interrupt on exit” to 1. 2010-08-31 jemmy858585 发件人: Avi Kivity 发送时间: 2010-08-29 22:24:13 收件人: jemmy858585 抄送: kvm 主题: Re: How kvm handle extern interrupt when guest os is running On 08/29/2010 05:16 PM, jemmy858585 wrote: > i didn't understand how kvm handle extern interrupt when guest os running. > i think the extern interrupt will cause vm exit, then kvm should pass this > interrupt to host os , then the host os handle it. > but i didn't found the correlative code. > > I think vmx_complete_interrupts and vmx_handle_exit function will be execute, > but i didn't found how pass the extern interrupt to host os. > > If anyone knows , please tell me,thanks. > The interrupt causes a vmexit but is not consumed by the processor. Instead, it remains pending. As soon as the host enables interrupts the interrupt will be delivered. -- error compiling committee.c: too many arguments to function
Re: [PATCH] cgroups: fix API thinko
On Wed, Aug 25, 2010 at 02:35:20PM -0700, Andrew Morton wrote: > On Fri, 06 Aug 2010 10:38:24 -0600 > Alex Williamson wrote: > > > On Fri, 2010-08-06 at 09:34 -0700, Sridhar Samudrala wrote: > > > On 8/5/2010 3:59 PM, Michael S. Tsirkin wrote: > > > > cgroup_attach_task_current_cg API that have upstream is backwards: we > > > > really need an API to attach to the cgroups from another process A to > > > > the current one. > > > > > > > > In our case (vhost), a priveledged user wants to attach it's task to > > > > cgroups > > > > from a less priveledged one, the API makes us run it in the other > > > > task's context, and this fails. > > > > > > > > So let's make the API generic and just pass in 'from' and 'to' tasks. > > > > Add an inline wrapper for cgroup_attach_task_current_cg to avoid > > > > breaking bisect. > > > > > > > > Signed-off-by: Michael S. Tsirkin > > > > --- > > > > > > > > Paul, Li, Sridhar, could you please review the following > > > > patch? > > > > > > > > I only compile-tested it due to travel, but looks > > > > straight-forward to me. > > > > Alex Williamson volunteered to test and report the results. > > > > Sending out now for review as I might be offline for a bit. > > > > Will only try to merge when done, obviously. > > > > > > > > If OK, I would like to merge this through -net tree, > > > > together with the patch fixing vhost-net. > > > > Let me know if that sounds ok. > > > > > > > > Thanks! > > > > > > > > This patch is on top of net-next, it is needed for fix > > > > vhost-net regression in net-next, where a non-priveledged > > > > process can't enable the device anymore: > > > > > > > > when qemu uses vhost, inside the ioctl call it > > > > creates a thread, and tries to add > > > > this thread to the groups of current, and it fails. > > > > But we control the thread, so to solve the problem, > > > > we really should tell it 'connect to out cgroups'. > > So am I correct to assume that this change is now needed in 2.6.36, and > unneeded in 2.6.35? Yes, I think so. Unless there are objections, I intend to merge this (with the review fixes) through net-2.6 together with a vhost-net patch that depends on this fix. > Can it affect the userspace<->kernel API in amy manner? If so, it > should be backported into earlier kernels to reduce the number of > incompatible kernels out there. I think it doesn't affect anything except 2.6.36-rcX, earlier kernels didn't use this API. > Paul, did you have any comments? > > I didn't see any update in response to the minor review comments, so... > > > include/linux/cgroup.h |1 + > kernel/cgroup.c|6 +++--- > 2 files changed, 4 insertions(+), 3 deletions(-) > > diff -puN include/linux/cgroup.h~cgroups-fix-api-thinko-fix > include/linux/cgroup.h > --- a/include/linux/cgroup.h~cgroups-fix-api-thinko-fix > +++ a/include/linux/cgroup.h > @@ -579,6 +579,7 @@ void cgroup_iter_end(struct cgroup *cgrp > int cgroup_scan_tasks(struct cgroup_scanner *scan); > int cgroup_attach_task(struct cgroup *, struct task_struct *); > int cgroup_attach_task_all(struct task_struct *from, struct task_struct *); > + > static inline int cgroup_attach_task_current_cg(struct task_struct *tsk) > { > return cgroup_attach_task_all(current, tsk); > diff -puN kernel/cgroup.c~cgroups-fix-api-thinko-fix kernel/cgroup.c > --- a/kernel/cgroup.c~cgroups-fix-api-thinko-fix > +++ a/kernel/cgroup.c > @@ -1798,13 +1798,13 @@ out: > int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) > { > struct cgroupfs_root *root; > - struct cgroup *cur_cg; > int retval = 0; > > cgroup_lock(); > for_each_active_root(root) { > - cur_cg = task_cgroup_from_root(from, root); > - retval = cgroup_attach_task(cur_cg, tsk); > + struct cgroup *from_cg = task_cgroup_from_root(from, root); > + > + retval = cgroup_attach_task(from_cg, tsk); > if (retval) > break; > } > _ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call minutes for August 31
QMP/QPI - declare what's in 0.13 supported - means reasonable effort to avoid breaking something (deprecation is possible) - things that are left, shallow patch or human monitor conversion - how to move forward? - need to improve interfaces (no argument there) - internal vs. external interfaces - QMP == external, stable, extensible, discoverable, fwd/back compat - internal == C, data structures, changeable, non-stable - redefine internal interfaces and work up? - this addresses concern that most changes are in monitor.c - and addresses the concern that we aren't defining proper extensible top level interfaces - work top down from external? - map to internal details...fix internals when external is hard/impossible based on internals - need to focus on QMP command addition in the face of internal details - no disagreement there - decouple monitor and QMP - sane interfaces (proposal for migration from Anthony) - error issues... 0.13 schedule - rc1 tagged locally and under test, once testing completes, upload, then one week to fix any outstanding issues - will push tag later today and upload, announce once propagates (24hrs-ish) qemu-kvm integration - not getting a lot of cycles - nothing major that anthony won't pull - extboot still - performance - i/o threading model (merge both and fix in-tree) - in-kernel apic - device assignement (vfio against qemu tree) - disable ia64 - avi will look at doing the pull request -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] *** SUBJECT HERE ***
On 08/31/10 15:30, Avi Kivity wrote: On 08/31/2010 04:25 PM, Avi Kivity wrote: *** BLURB HERE *** That was supposed to be: [PATCH 0/6] kvm_stat tracepoint support This patchset allows kvm_stat to display the information exposed by kvm tracepoints. That was there too. You better should pass '00*.patch' instead of '00*' to git send-email so it doesn't mail out the editor backup file ;) cheers, Gerd -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
"KVM internal error. Suberror: 1" with ancient 2.4 kernel as guest
Hi, when migrating ancient machine to become KVM guest, I am facing a problem that KVM gives me the error below when being passed oldish vmlinuz image: === # qemu-kvm -kernel vmlinuz-2.4.33 KVM internal error. Suberror: 1 rax rbx 003e rcx rdx c1485180 rsi c00b8000 rdi c1485180 rsp c0305f70 rbp 0fa0 r8 r9 r10 r11 r12 r13 r14 r15 rip c027a841 rflags 0006 cs 0010 (/ p 1 dpl 0 db 1 s 1 type b l 0 g 1 avl 0) ds 0018 (/ p 1 dpl 0 db 1 s 1 type 3 l 0 g 1 avl 0) es 0018 (/ p 1 dpl 0 db 1 s 1 type 3 l 0 g 1 avl 0) ss 0018 (/bfff p 1 dpl 0 db 1 s 1 type 7 l 0 g 1 avl 0) fs 0018 (/ p 1 dpl 0 db 1 s 1 type 3 l 0 g 1 avl 0) gs 0018 (/ p 1 dpl 0 db 1 s 1 type 3 l 0 g 1 avl 0) tr 0060 (c0301d80/00eb p 1 dpl 0 db 0 s 0 type b l 0 g 0 avl 0) ldt 0068 (c0289020/0027 p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0) gdt c0288860/7f idt c0288000/7ff cr0 80050033 cr2 0 cr3 287000 cr4 90 cr8 0 efer 0 emulation failure, check dmesg for details === There is no message emitted in dmesg. The instructions around eip look reasonably valid from a quick look (this is monitor output from the time it crashes): === (qemu) x/5i $eip 0xc027a841: movq (%esi),%mm0 0xc027a844: movq 0x8(%esi),%mm1 0xc027a848: movq 0x10(%esi),%mm2 0xc027a84c: movq 0x18(%esi),%mm3 0xc027a850: movq %mm0,(%edx) === Is there any issue with emulating MMX? This is with 0.12.3 KVM module on 2.6.32. The kernel image which triggers this can be downloaded from http://www.jikos.cz/jikos/junk/vmlinuz-2.4.33 Thanks, -- Jiri Kosina SUSE Labs, Novell Inc. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VIA Nano processor U2250 (1.6GHz Capable)
On 08/30/2010 08:35 PM, Pierre-Philipp Braun wrote: Hi, I'm having a little issue starting a KVM guest. Is that a known Issue on VIA Nano vmx capable processors ? It's a bug in the cpu. It's possible that 2.6.37 will not trigger that bug, though. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] *** SUBJECT HERE ***
On 08/31/2010 04:25 PM, Avi Kivity wrote: *** BLURB HERE *** That was supposed to be: [PATCH 0/6] kvm_stat tracepoint support This patchset allows kvm_stat to display the information exposed by kvm tracepoints. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for August 31
On 08/31/2010 01:21 AM, Avi Kivity wrote: On 08/31/2010 12:54 AM, Juan Quintela wrote: Please send in any agenda items you are interested in covering. zero (dot|point) thirteen schedule -rc1 is tagged locally and undergoing regression testing. Once that finishes (later today), I'll push and upload. Once it's announced (~24 hours later) we've got one week until 0.13.0 to bring in additional stable fixes. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] kvm_stat: provide detailed kvm_exit:exit_reason display
In addition to displaying kvm_exit as an aggregate counter, use perf_event's filter capability to count individual exit reasons. Signed-off-by: Avi Kivity --- kvm/kvm_stat | 156 +- 1 files changed, 155 insertions(+), 1 deletions(-) diff --git a/kvm/kvm_stat b/kvm/kvm_stat index 9b0392b..677683a 100755 --- a/kvm/kvm_stat +++ b/kvm/kvm_stat @@ -16,6 +16,143 @@ class DebugfsProvider(object): return int(file(self.base + '/' + key).read()) return dict([(key, val(key)) for key in self._fields]) +vmx_exit_reasons = { +0: 'EXCEPTION_NMI', +1: 'EXTERNAL_INTERRUPT', +2: 'TRIPLE_FAULT', +7: 'PENDING_INTERRUPT', +8: 'NMI_WINDOW', +9: 'TASK_SWITCH', +10: 'CPUID', +12: 'HLT', +14: 'INVLPG', +15: 'RDPMC', +16: 'RDTSC', +18: 'VMCALL', +19: 'VMCLEAR', +20: 'VMLAUNCH', +21: 'VMPTRLD', +22: 'VMPTRST', +23: 'VMREAD', +24: 'VMRESUME', +25: 'VMWRITE', +26: 'VMOFF', +27: 'VMON', +28: 'CR_ACCESS', +29: 'DR_ACCESS', +30: 'IO_INSTRUCTION', +31: 'MSR_READ', +32: 'MSR_WRITE', +33: 'INVALID_STATE', +36: 'MWAIT_INSTRUCTION', +39: 'MONITOR_INSTRUCTION', +40: 'PAUSE_INSTRUCTION', +41: 'MCE_DURING_VMENTRY', +43: 'TPR_BELOW_THRESHOLD', +44: 'APIC_ACCESS', +48: 'EPT_VIOLATION', +49: 'EPT_MISCONFIG', +54: 'WBINVD', +55: 'XSETBV', +} + +svm_exit_reasons = { +0x000: 'READ_CR0', +0x003: 'READ_CR3', +0x004: 'READ_CR4', +0x008: 'READ_CR8', +0x010: 'WRITE_CR0', +0x013: 'WRITE_CR3', +0x014: 'WRITE_CR4', +0x018: 'WRITE_CR8', +0x020: 'READ_DR0', +0x021: 'READ_DR1', +0x022: 'READ_DR2', +0x023: 'READ_DR3', +0x024: 'READ_DR4', +0x025: 'READ_DR5', +0x026: 'READ_DR6', +0x027: 'READ_DR7', +0x030: 'WRITE_DR0', +0x031: 'WRITE_DR1', +0x032: 'WRITE_DR2', +0x033: 'WRITE_DR3', +0x034: 'WRITE_DR4', +0x035: 'WRITE_DR5', +0x036: 'WRITE_DR6', +0x037: 'WRITE_DR7', +0x040: 'EXCP_BASE', +0x060: 'INTR', +0x061: 'NMI', +0x062: 'SMI', +0x063: 'INIT', +0x064: 'VINTR', +0x065: 'CR0_SEL_WRITE', +0x066: 'IDTR_READ', +0x067: 'GDTR_READ', +0x068: 'LDTR_READ', +0x069: 'TR_READ', +0x06a: 'IDTR_WRITE', +0x06b: 'GDTR_WRITE', +0x06c: 'LDTR_WRITE', +0x06d: 'TR_WRITE', +0x06e: 'RDTSC', +0x06f: 'RDPMC', +0x070: 'PUSHF', +0x071: 'POPF', +0x072: 'CPUID', +0x073: 'RSM', +0x074: 'IRET', +0x075: 'SWINT', +0x076: 'INVD', +0x077: 'PAUSE', +0x078: 'HLT', +0x079: 'INVLPG', +0x07a: 'INVLPGA', +0x07b: 'IOIO', +0x07c: 'MSR', +0x07d: 'TASK_SWITCH', +0x07e: 'FERR_FREEZE', +0x07f: 'SHUTDOWN', +0x080: 'VMRUN', +0x081: 'VMMCALL', +0x082: 'VMLOAD', +0x083: 'VMSAVE', +0x084: 'STGI', +0x085: 'CLGI', +0x086: 'SKINIT', +0x087: 'RDTSCP', +0x088: 'ICEBP', +0x089: 'WBINVD', +0x08a: 'MONITOR', +0x08b: 'MWAIT', +0x08c: 'MWAIT_COND', +0x400: 'NPF', +} + +vendor_exit_reasons = { +'vmx': vmx_exit_reasons, +'svm': svm_exit_reasons, +} + +exit_reasons = None + +for line in file('/proc/cpuinfo').readlines(): +if line.startswith('flags'): +for flag in line.split(): +if flag in vendor_exit_reasons: +exit_reasons = vendor_exit_reasons[flag] + +filters = { +'kvm_exit': ('exit_reason', exit_reasons) +} + +def invert(d): +return dict((x[1], x[0]) for x in d.iteritems()) + +for f in filters: +filters[f] = (filters[f][0], invert(filters[f][1])) + import ctypes, struct, array libc = ctypes.CDLL('libc.so.6') @@ -62,6 +199,7 @@ PERF_FORMAT_TOTAL_TIME_RUNNING = 1 << 1 PERF_FORMAT_ID = 1 << 2 PERF_FORMAT_GROUP = 1 << 3 +import re class TracepointProvider(object): def __init__(self): @@ -69,6 +207,13 @@ class TracepointProvider(object): fields = [f for f in os.listdir(self.base) if os.path.isdir(self.base + '/' + f)] +extra = [] +for f in fields: +if f in filters: +subfield, values = filters[f] +for name, number in values.iteritems(): +extra.append(f + '(' + name + ')') +fields += extra self.select(fields) def fields(self): return self._fields @@ -80,10 +225,14 @@ class TracepointProvider(object): for cpu in self.cpus: group_leader = -1 for f in _fields: +fbase, sub = f, None +m = re.match(r'(.*)\((.*)\)', f) +if m: +fbase, sub = m.groups() attr = perf_event_attr() attr.type = PERF_TYPE_TRACEPOINT attr.size = ctypes.sizeof(attr) -id = int(file(self.base + f + '/id').rea
[PATCH 0/6] *** SUBJECT HERE ***
*** BLURB HERE *** Avi Kivity (6): kvm_stat: refactor to separate stats provider from difference engine kvm_stat: implement tracepoint stats provider kvm_stat: provide detailed kvm_exit:exit_reason display kvm_stat: sort tui output according to highest occurence kvm_stat: increase label width kvm_stat: be slower kvm/kvm_stat | 297 +++--- 1 files changed, 285 insertions(+), 12 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] kvm_stat: be slower
More information is displayed, so more time is need to process the information. Signed-off-by: Avi Kivity --- kvm/kvm_stat |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kvm/kvm_stat b/kvm/kvm_stat index b7bc846..e68ca4e 100755 --- a/kvm/kvm_stat +++ b/kvm/kvm_stat @@ -327,7 +327,7 @@ def tui(screen, stats): while True: refresh() -curses.halfdelay(10) +curses.halfdelay(30) try: c = screen.getkey() if c == 'q': -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] kvm_stat: implement tracepoint stats provider
Signed-off-by: Avi Kivity --- kvm/kvm_stat | 101 +- 1 files changed, 100 insertions(+), 1 deletions(-) diff --git a/kvm/kvm_stat b/kvm/kvm_stat index 75424fc..9b0392b 100755 --- a/kvm/kvm_stat +++ b/kvm/kvm_stat @@ -16,6 +16,100 @@ class DebugfsProvider(object): return int(file(self.base + '/' + key).read()) return dict([(key, val(key)) for key in self._fields]) +import ctypes, struct, array + +libc = ctypes.CDLL('libc.so.6') +syscall = libc.syscall +class perf_event_attr(ctypes.Structure): +_fields_ = [('type', ctypes.c_uint32), +('size', ctypes.c_uint32), +('config', ctypes.c_uint64), +('sample_freq', ctypes.c_uint64), +('sample_type', ctypes.c_uint64), +('read_format', ctypes.c_uint64), +('flags', ctypes.c_uint64), +('wakeup_events', ctypes.c_uint32), +('bp_type', ctypes.c_uint32), +('bp_addr', ctypes.c_uint64), +('bp_len', ctypes.c_uint64), +] +def _perf_event_open(attr, pid, cpu, group_fd, flags): +return syscall(298, ctypes.pointer(attr), ctypes.c_int(pid), + ctypes.c_int(cpu), ctypes.c_int(group_fd), + ctypes.c_long(flags)) + +PERF_TYPE_HARDWARE = 0 +PERF_TYPE_SOFTWARE = 1 +PERF_TYPE_TRACEPOINT = 2 +PERF_TYPE_HW_CACHE = 3 +PERF_TYPE_RAW = 4 +PERF_TYPE_BREAKPOINT = 5 + +PERF_SAMPLE_IP = 1 << 0 +PERF_SAMPLE_TID= 1 << 1 +PERF_SAMPLE_TIME = 1 << 2 +PERF_SAMPLE_ADDR = 1 << 3 +PERF_SAMPLE_READ = 1 << 4 +PERF_SAMPLE_CALLCHAIN = 1 << 5 +PERF_SAMPLE_ID = 1 << 6 +PERF_SAMPLE_CPU= 1 << 7 +PERF_SAMPLE_PERIOD = 1 << 8 +PERF_SAMPLE_STREAM_ID = 1 << 9 +PERF_SAMPLE_RAW= 1 << 10 + +PERF_FORMAT_TOTAL_TIME_ENABLED = 1 << 0 +PERF_FORMAT_TOTAL_TIME_RUNNING = 1 << 1 +PERF_FORMAT_ID = 1 << 2 +PERF_FORMAT_GROUP = 1 << 3 + + +class TracepointProvider(object): +def __init__(self): +self.base = '/sys/kernel/debug/tracing/events/kvm/' +fields = [f + for f in os.listdir(self.base) + if os.path.isdir(self.base + '/' + f)] +self.select(fields) +def fields(self): +return self._fields +def select(self, _fields): +self._fields = _fields +self.cpus = [0, 1] +fds = [] +self.group_leaders = [] +for cpu in self.cpus: +group_leader = -1 +for f in _fields: +attr = perf_event_attr() +attr.type = PERF_TYPE_TRACEPOINT +attr.size = ctypes.sizeof(attr) +id = int(file(self.base + f + '/id').read()) +attr.config = id +attr.sample_type = (PERF_SAMPLE_RAW +| PERF_SAMPLE_TIME +| PERF_SAMPLE_CPU) +attr.sample_period = 1 +attr.read_format = PERF_FORMAT_GROUP +fd = _perf_event_open(attr, -1, cpu, group_leader, 0) +if fd == -1: +raise Exception('perf_event_open failed') +if group_leader == -1: +group_leader = fd +fds.append(fd) +self.group_leaders.append(group_leader) +self.fds = fds +self.files = [os.fdopen(group_leader) + for group_leader in self.group_leaders] +def read(self): +ret = dict([(f, 0) for f in self._fields]) +bytes = 8 * (1 + len(self._fields)) +fmt = '' + 'q' * len(self._fields) +for file in self.files: +a = struct.unpack(fmt, file.read(bytes)) +for field, val in zip(self._fields, a): +ret[field] += val +return ret + class Stats: def __init__(self, provider, fields = None): def wanted(key): @@ -133,7 +227,12 @@ options.add_option('-f', '--fields', ) (options, args) = options.parse_args(sys.argv) -stats = Stats(provider = DebugfsProvider(), fields = options.fields) +try: +provider = TracepointProvider() +except: +provider = DebugfsProvider() + +stats = Stats(provider, fields = options.fields) if options.log: log(stats) -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] kvm_stat: increase label width
With kvm_exit drill down, labels are pretty large. Signed-off-by: Avi Kivity --- kvm/kvm_stat |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kvm/kvm_stat b/kvm/kvm_stat index c55bf88..b7bc846 100755 --- a/kvm/kvm_stat +++ b/kvm/kvm_stat @@ -295,7 +295,7 @@ if not os.access('/sys/kernel/debug/kvm', os.F_OK): print "and ensure the kvm modules are loaded" sys.exit(1) -label_width = 20 +label_width = 40 number_width = 10 def tui(screen, stats): -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] kvm_stat: sort tui output according to highest occurence
With plenty of counters and wide, short screens it's hard to see what's on the top. So sort the output. Signed-off-by: Avi Kivity --- kvm/kvm_stat |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/kvm/kvm_stat b/kvm/kvm_stat index 677683a..c55bf88 100755 --- a/kvm/kvm_stat +++ b/kvm/kvm_stat @@ -306,7 +306,12 @@ def tui(screen, stats): screen.addstr(0, 0, 'kvm statistics') row = 2 s = stats.get() -for key in sorted(s.keys()): +def sortkey(x): +if s[x][1]: +return -s[x][1] +else: +return x +for key in sorted(s.keys(), key = sortkey): if row >= screen.getmaxyx()[0]: break values = s[key] -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6] kvm_stat tracepoint support
This patchset allows kvm_stat to display the information exposed by kvm tracepoints. Avi Kivity (6): kvm_stat: refactor to separate stats provider from difference engine kvm_stat: implement tracepoint stats provider kvm_stat: provide detailed kvm_exit:exit_reason display kvm_stat: sort tui output according to highest occurence kvm_stat: increase label width kvm_stat: be slower kvm/kvm_stat | 297 +++--- 1 files changed, 285 insertions(+), 12 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] kvm_stat: refactor to separate stats provider from difference engine
Signed-off-by: Avi Kivity --- kvm/kvm_stat | 33 - 1 files changed, 24 insertions(+), 9 deletions(-) diff --git a/kvm/kvm_stat b/kvm/kvm_stat index 21aff5b..75424fc 100755 --- a/kvm/kvm_stat +++ b/kvm/kvm_stat @@ -3,21 +3,36 @@ import curses import sys, os, time, optparse +class DebugfsProvider(object): +def __init__(self): +self.base = '/sys/kernel/debug/kvm' +self._fields = os.listdir(self.base) +def fields(self): +return self._fields +def select(self, fields): +self._fields = fields +def read(self): +def val(key): +return int(file(self.base + '/' + key).read()) +return dict([(key, val(key)) for key in self._fields]) + class Stats: -def __init__(self, fields = None): +def __init__(self, provider, fields = None): def wanted(key): import re if not fields: return True return re.match(fields, key) != None -self.base = '/sys/kernel/debug/kvm' -self.values = {} -for key in os.listdir(self.base): -if wanted(key): -self.values[key] = None +self.provider = provider +self.values = dict([(key, None) +for key in provider.fields() +if wanted(key)]) +self.provider.select(self.values.keys()) def get(self): -for key, oldval in self.values.iteritems(): -newval = int(file(self.base + '/' + key).read()) +new = self.provider.read() +for key in self.provider.fields(): +oldval = self.values[key] +newval = new[key] newdelta = None if oldval is not None: newdelta = newval - oldval[0] @@ -118,7 +133,7 @@ options.add_option('-f', '--fields', ) (options, args) = options.parse_args(sys.argv) -stats = Stats(fields = options.fields) +stats = Stats(provider = DebugfsProvider(), fields = options.fields) if options.log: log(stats) -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it possible to live migrate guest OS'es between different versions of kvm/qemu-kvm?
On 08/30/2010 11:16 PM, Brian Jackson wrote: On Monday, August 30, 2010 07:04:51 am Nils Cant wrote: Hey guys, next try is without libvirt, but still no joy. After issuing 'migrate -d' on my sending host (qemu-kvm 0.11.0), I get the following output on the receiving host (qemu-kvm 0.12.4): A quick search of this and/or the qemu mailing lists would have told you this is unsupported. It should work between 0.12 and 0.13, though. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!
From: Jes Sorensen We regularly see bug reports over this one, however it is a write to a read-only register which some operating systems (including Linux) tend to write to once in a while. Ignore the writes since they do no harm. Signed-off-by: Jes Sorensen --- arch/x86/kvm/x86.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4014d6c..2e1ae7f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1439,6 +1439,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) pr_unimpl(vcpu, "unimplemented perfctr wrmsr: " "0x%x data 0x%llx\n", msr, data); break; + case MSR_IA32_PERF_STATUS: + /* +* Write to read-only register, has no effect, so +* lets not bother users with a warning about this. +*/ + break; case HV_X64_MSR_GUEST_OS_ID ... HV_X64_MSR_SINT15: if (kvm_hv_msr_partition_wide(msr)) { int r; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests v2 7/8] Add a simple kvmclock driver
Most of the codes were borrowed from arxh/x86/kernel/kvmclock.c. A special bit: PV_CLOCK_CYCLE_RAW_TEST_BIT is used to notify the driver to return unadjusted cycles. Signed-off-by: Jason Wang --- x86/kvmclock.c | 279 x86/kvmclock.h | 60 2 files changed, 339 insertions(+), 0 deletions(-) create mode 100644 x86/kvmclock.c create mode 100644 x86/kvmclock.h diff --git a/x86/kvmclock.c b/x86/kvmclock.c new file mode 100644 index 000..0624da3 --- /dev/null +++ b/x86/kvmclock.c @@ -0,0 +1,279 @@ +#include "libcflat.h" +#include "smp.h" +#include "atomic.h" +#include "processor.h" +#include "kvmclock.h" + +#define unlikely(x)__builtin_expect(!!(x), 0) +#define likely(x) __builtin_expect(!!(x), 1) + + +struct pvclock_vcpu_time_info __attribute__((aligned(4))) hv_clock[MAX_CPU]; +struct pvclock_wall_clock wall_clock; +static unsigned char valid_flags = 0; +static atomic64_t last_value = ATOMIC64_INIT(0); + +/* + * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction, + * yielding a 64-bit result. + */ +static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift) +{ + u64 product; +#ifdef __i386__ + u32 tmp1, tmp2; +#endif + + if (shift < 0) + delta >>= -shift; + else + delta <<= shift; + +#ifdef __i386__ + __asm__ ( + "mul %5 ; " + "mov %4,%%eax ; " + "mov %%edx,%4 ; " + "mul %5 ; " + "xor %5,%5; " + "add %4,%%eax ; " + "adc %5,%%edx ; " + : "=A" (product), "=r" (tmp1), "=r" (tmp2) + : "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) ); +#elif defined(__x86_64__) + __asm__ ( + "mul %%rdx ; shrd $32,%%rdx,%%rax" + : "=a" (product) : "0" (delta), "d" ((u64)mul_frac) ); +#else +#error implement me! +#endif + + return product; +} + +#ifdef __i386__ +# define do_div(n,base) ({ \ + u32 __base = (base);\ + u32 __rem; \ + __rem = ((u64)(n)) % __base;\ + (n) = ((u64)(n)) / __base; \ + __rem; \ + }) +#else +u32 __attribute__((weak)) __div64_32(u64 *n, u32 base) +{ + u64 rem = *n; + u64 b = base; + u64 res, d = 1; + u32 high = rem >> 32; + + /* Reduce the thing a bit first */ + res = 0; + if (high >= base) { + high /= base; + res = (u64) high << 32; + rem -= (u64) (high*base) << 32; + } + + while ((s64)b > 0 && b < rem) { + b = b+b; + d = d+d; + } + + do { + if (rem >= b) { + rem -= b; + res += d; + } + b >>= 1; + d >>= 1; + } while (d); + + *n = res; + return rem; +} + +# define do_div(n,base) ({ \ + u32 __base = (base);\ + u32 __rem; \ + (void)(((typeof((n)) *)0) == ((u64 *)0)); \ + if (likely(((n) >> 32) == 0)) { \ + __rem = (u32)(n) % __base; \ + (n) = (u32)(n) / __base;\ + } else \ + __rem = __div64_32(&(n), __base); \ + __rem; \ + }) +#endif + +/** + * set_normalized_timespec - set timespec sec and nsec parts and normalize + * + * @ts:pointer to timespec variable to be set + * @sec: seconds to set + * @nsec: nanoseconds to set + * + * Set seconds and nanoseconds field of a timespec variable and + * normalize to the timespec storage format + * + * Note: The tv_nsec part is always in the range of + * 0 <= tv_nsec < NSEC_PER_SEC + * For negative values only the tv_sec field is negative ! + */ +void set_normalized_timespec(struct timespec *ts, long sec, s64 nsec) +{ + while (nsec >= NSEC_PER_SEC) { + /* +* The following asm() prevents the compiler from +* optimising this loop into a modulo operation. See +* also __iter_div_u64_rem() in include/linux/time.h +*/ + asm("" : "+rm"(nsec)); + nsec -= NSEC_PER_SEC; + ++sec; + } + while (nsec < 0) { + asm("" : "+rm"(nsec)); + nsec += NSEC_PER_SEC; + --sec; + } + ts->tv_sec = sec; + ts->tv_nsec = nsec; +} + +static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) +{ + u64
[PATCH kvm-unit-tests v2 8/8] Add tests for kvm-clock
This patch implements three tests for kvmclock. First one check whether the date of time returned by kvmclock matches the value got from host. Second one check whether the cycle of kvmclock grows monotonically in smp guest and the count of warps and stalls are also recorded. The last test is a performance test to measure the guest cycles cost when trying to read cycle of kvmclock. Three parameters were accepted by the test: test loops, seconds since 1970-01-01 00:00:00 UTC which could be easily get through date +%s and the max accepted offset value between the tod of guest and host. Signed-off-by: Jason Wang --- config-x86-common.mak |6 +- x86/README|1 x86/kvmclock_test.c | 166 + x86/unittests.cfg |5 + 4 files changed, 177 insertions(+), 1 deletions(-) create mode 100644 x86/kvmclock_test.c diff --git a/config-x86-common.mak b/config-x86-common.mak index b8ca859..b541c1c 100644 --- a/config-x86-common.mak +++ b/config-x86-common.mak @@ -26,7 +26,8 @@ FLATLIBS = lib/libcflat.a $(libgcc) tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \ $(TEST_DIR)/smptest.flat $(TEST_DIR)/port80.flat \ $(TEST_DIR)/realmode.flat $(TEST_DIR)/msr.flat \ - $(TEST_DIR)/hypercall.flat $(TEST_DIR)/sieve.flat + $(TEST_DIR)/hypercall.flat $(TEST_DIR)/sieve.flat \ + $(TEST_DIR)/kvmclock_test.flat tests_and_config = $(TEST_DIR)/*.flat $(TEST_DIR)/unittests.cfg @@ -70,6 +71,9 @@ $(TEST_DIR)/rmap_chain.flat: $(cstart.o) $(TEST_DIR)/rmap_chain.o \ $(TEST_DIR)/svm.flat: $(cstart.o) $(TEST_DIR)/vm.o +$(TEST_DIR)/kvmclock_test.flat: $(cstart.o) $(TEST_DIR)/kvmclock.o \ +$(TEST_DIR)/kvmclock_test.o + arch_clean: $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat \ $(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o diff --git a/x86/README b/x86/README index ab5a2ae..215fb2f 100644 --- a/x86/README +++ b/x86/README @@ -12,3 +12,4 @@ sieve: heavy memory access with no paging and with paging static and with paging smptest: run smp_id() on every cpu and compares return value to number tsc: write to tsc(0) and write to tsc(1000) and read it back vmexit: long loops for each: cpuid, vmcall, mov_from_cr8, mov_to_cr8, inl_pmtimer, ipi, ipi+halt +kvmclock_test: test of wallclock, monotonic cycle and performance of kvmclock diff --git a/x86/kvmclock_test.c b/x86/kvmclock_test.c new file mode 100644 index 000..5b14ae2 --- /dev/null +++ b/x86/kvmclock_test.c @@ -0,0 +1,166 @@ +#include "libcflat.h" +#include "smp.h" +#include "atomic.h" +#include "processor.h" +#include "kvmclock.h" + +#define DEFAULT_TEST_LOOPS 1L +#define DEFAULT_THRESHOLD 5L + +struct test_info { +struct spinlock lock; +long loops; /* test loops */ +u64 warps;/* warp count */ +u64 stalls; /* stall count */ +long long worst; /* worst warp */ +volatile cycle_t last;/* last cycle seen by test */ +atomic_t ncpus; /* number of cpu in the test*/ +int check;/* check cycle ? */ +}; + +struct test_info ti[4]; + +static int wallclock_test(long sec, long threshold) +{ +long ksec, offset; +struct timespec ts; + +printf("Wallclock test, threshold %ld\n", threshold); +kvm_get_wallclock(&ts); +ksec = ts.tv_sec; + +offset = ksec - sec; +printf("Seconds get from host: %ld\n", sec); +printf("Seconds get from kvmclock: %ld\n", ksec); +printf("Offset:%ld\n", offset); + +if (offset > threshold || offset < -threshold) { +printf("offset too large!\n"); +return 1; +} + +return 0; +} + +static void kvm_clock_test(void *data) +{ +struct test_info *hv_test_info = (struct test_info *)data; +long i, check = hv_test_info->check; + +for (i = 0; i < hv_test_info->loops; i++){ +cycle_t t0, t1; +long long delta; + +if (check == 0) { +kvm_clock_read(); +continue; +} + +spin_lock(&hv_test_info->lock); +t1 = kvm_clock_read(); +t0 = hv_test_info->last; +hv_test_info->last = kvm_clock_read(); +spin_unlock(&hv_test_info->lock); + +delta = t1 - t0; +if (delta < 0) { +spin_lock(&hv_test_info->lock); +++hv_test_info->warps; +if (delta < hv_test_info->worst){ +hv_test_info->worst = delta; +printf("Worst warp %lld %\n", hv_test_info->worst); +} +spin_unloc
[PATCH kvm-unit-tests v2 5/8] Export tsc related helpers
Signed-off-by: Jason Wang --- lib/x86/processor.h | 22 ++ x86/tsc.c | 16 +--- x86/vmexit.c| 15 --- 3 files changed, 23 insertions(+), 30 deletions(-) diff --git a/lib/x86/processor.h b/lib/x86/processor.h index 67d7ca5..c348808 100644 --- a/lib/x86/processor.h +++ b/lib/x86/processor.h @@ -258,4 +258,26 @@ static inline void sti(void) asm volatile ("sti"); } +static inline unsigned long long rdtsc() +{ + long long r; + +#ifdef __x86_64__ + unsigned a, d; + + asm volatile ("rdtsc" : "=a"(a), "=d"(d)); + r = a | ((long long)d << 32); +#else + asm volatile ("rdtsc" : "=A"(r)); +#endif + return r; +} + +static inline void wrtsc(u64 tsc) +{ + unsigned a = tsc, d = tsc >> 32; + + asm volatile("wrmsr" : : "a"(a), "d"(d), "c"(0x10)); +} + #endif diff --git a/x86/tsc.c b/x86/tsc.c index 394a9c6..58f332d 100644 --- a/x86/tsc.c +++ b/x86/tsc.c @@ -1,19 +1,5 @@ #include "libcflat.h" - -u64 rdtsc(void) -{ - unsigned a, d; - - asm volatile("rdtsc" : "=a"(a), "=d"(d)); - return a | (u64)d << 32; -} - -void wrtsc(u64 tsc) -{ - unsigned a = tsc, d = tsc >> 32; - - asm volatile("wrmsr" : : "a"(a), "d"(d), "c"(0x10)); -} +#include "processor.h" void test_wrtsc(u64 t1) { diff --git a/x86/vmexit.c b/x86/vmexit.c index 84c384d..551083d 100644 --- a/x86/vmexit.c +++ b/x86/vmexit.c @@ -3,21 +3,6 @@ #include "smp.h" #include "processor.h" -static inline unsigned long long rdtsc() -{ - long long r; - -#ifdef __x86_64__ - unsigned a, d; - - asm volatile ("rdtsc" : "=a"(a), "=d"(d)); - r = a | ((long long)d << 32); -#else - asm volatile ("rdtsc" : "=A"(r)); -#endif - return r; -} - static unsigned int inl(unsigned short port) { unsigned int val; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests v2 6/8] Introduce atol()
Signed-off-by: Jason Wang --- lib/libcflat.h |1 + lib/string.c | 31 +++ 2 files changed, 32 insertions(+), 0 deletions(-) diff --git a/lib/libcflat.h b/lib/libcflat.h index 9cb90cf..d4ee2e0 100644 --- a/lib/libcflat.h +++ b/lib/libcflat.h @@ -50,6 +50,7 @@ extern void puts(const char *s); extern void *memset(void *s, int c, size_t n); +extern long atol(const char *ptr); #define ARRAY_SIZE(_a) (sizeof(_a)/sizeof((_a)[0])) #endif diff --git a/lib/string.c b/lib/string.c index acac3c0..1f19f5c 100644 --- a/lib/string.c +++ b/lib/string.c @@ -30,3 +30,34 @@ void *memset(void *s, int c, size_t n) return s; } + +long atol(const char *ptr) +{ +long acc = 0; +const char *s = ptr; +int neg, c; + +while (*s == ' ' || *s == '\t') +s++; +if (*s == '-'){ +neg = 1; +s++; +} else { +neg = 0; +if (*s == '+') +s++; +} + +while (*s) { +if (*s < '0' || *s > '9') +break; +c = *s - '0'; +acc = acc * 10 + c; +s++; +} + +if (neg) +acc = -acc; + +return acc; +} -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests v2 4/8] Introduce atomic operations
Signed-off-by: Jason Wang --- config-x86-common.mak |1 lib/x86/atomic.c | 37 +++ lib/x86/atomic.h | 164 + 3 files changed, 202 insertions(+), 0 deletions(-) create mode 100644 lib/x86/atomic.c create mode 100644 lib/x86/atomic.h diff --git a/config-x86-common.mak b/config-x86-common.mak index fdb9e3e..b8ca859 100644 --- a/config-x86-common.mak +++ b/config-x86-common.mak @@ -10,6 +10,7 @@ cflatobjs += \ cflatobjs += lib/x86/fwcfg.o cflatobjs += lib/x86/apic.o +cflatobjs += lib/x86/atomic.o $(libcflat): LDFLAGS += -nostdlib $(libcflat): CFLAGS += -ffreestanding -I lib diff --git a/lib/x86/atomic.c b/lib/x86/atomic.c new file mode 100644 index 000..da74ff2 --- /dev/null +++ b/lib/x86/atomic.c @@ -0,0 +1,37 @@ +#include +#include "atomic.h" + +#ifdef __i386__ + +u64 atomic64_cmpxchg(atomic64_t *v, u64 old, u64 new) +{ +u32 low = new; +u32 high = new >> 32; + +asm volatile("lock cmpxchg8b %1\n" + : "+A" (old), + "+m" (*(volatile long long *)&v->counter) + : "b" (low), "c" (high) + : "memory" + ); + +return old; +} + +#else + +u64 atomic64_cmpxchg(atomic64_t *v, u64 old, u64 new) +{ +u64 ret; +u64 _old = old; +u64 _new = new; + +asm volatile("lock cmpxchgq %2,%1" + : "=a" (ret), "+m" (*(volatile long *)&v->counter) + : "r" (_new), "0" (_old) + : "memory" + ); +return ret; +} + +#endif diff --git a/lib/x86/atomic.h b/lib/x86/atomic.h new file mode 100644 index 000..de2f033 --- /dev/null +++ b/lib/x86/atomic.h @@ -0,0 +1,164 @@ +#ifndef __ATOMIC_H +#define __ATOMIC_H + +typedef struct { + volatile int counter; +} atomic_t; + +#ifdef __i386__ + +/** + * atomic_read - read atomic variable + * @v: pointer of type atomic_t + * + * Atomically reads the value of @v. + */ +static inline int atomic_read(const atomic_t *v) +{ + return v->counter; +} + +/** + * atomic_set - set atomic variable + * @v: pointer of type atomic_t + * @i: required value + * + * Atomically sets the value of @v to @i. + */ +static inline void atomic_set(atomic_t *v, int i) +{ + v->counter = i; +} + +/** + * atomic_inc - increment atomic variable + * @v: pointer of type atomic_t + * + * Atomically increments @v by 1. + */ +static inline void atomic_inc(atomic_t *v) +{ + asm volatile("lock incl %0" +: "+m" (v->counter)); +} + +/** + * atomic_dec - decrement atomic variable + * @v: pointer of type atomic_t + * + * Atomically decrements @v by 1. + */ +static inline void atomic_dec(atomic_t *v) +{ + asm volatile("lock decl %0" +: "+m" (v->counter)); +} + +typedef struct { + u64 __attribute__((aligned(8))) counter; +} atomic64_t; + +#define ATOMIC64_INIT(val) { (val) } + +/** + * atomic64_read - read atomic64 variable + * @ptr: pointer to type atomic64_t + * + * Atomically reads the value of @ptr and returns it. + */ +static inline u64 atomic64_read(atomic64_t *ptr) +{ + u64 res; + + /* +* Note, we inline this atomic64_t primitive because +* it only clobbers EAX/EDX and leaves the others +* untouched. We also (somewhat subtly) rely on the +* fact that cmpxchg8b returns the current 64-bit value +* of the memory location we are touching: +*/ + asm volatile("mov %%ebx, %%eax\n\t" + "mov %%ecx, %%edx\n\t" + "lock cmpxchg8b %1\n" + : "=&A" (res) + : "m" (*ptr) + ); + return res; +} + +u64 atomic64_cmpxchg(atomic64_t *v, u64 old, u64 new); + +#elif defined(__x86_64__) + +/** + * atomic_read - read atomic variable + * @v: pointer of type atomic_t + * + * Atomically reads the value of @v. + */ +static inline int atomic_read(const atomic_t *v) +{ + return v->counter; +} + +/** + * atomic_set - set atomic variable + * @v: pointer of type atomic_t + * @i: required value + * + * Atomically sets the value of @v to @i. + */ +static inline void atomic_set(atomic_t *v, int i) +{ + v->counter = i; +} + +/** + * atomic_inc - increment atomic variable + * @v: pointer of type atomic_t + * + * Atomically increments @v by 1. + */ +static inline void atomic_inc(atomic_t *v) +{ + asm volatile("lock incl %0" +: "=m" (v->counter) +: "m" (v->counter)); +} + +/** + * atomic_dec - decrement atomic variable + * @v: pointer of type atomic_t + * + * Atomically decrements @v by 1. + */ +static inline void atomic_dec(atomic_t *v) +{ + asm volatile("lock decl %0" +: "=m" (v->counter) +: "m" (v->counter)); +} + +typedef struct { + long long counter; +} atomic64_t; + +#define ATOMIC
[PATCH kvm-unit-tests v2 3/8] Introduce memory barriers.
Signed-off-by: Jason Wang --- lib/x86/smp.h |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/lib/x86/smp.h b/lib/x86/smp.h index c2e7350..df5fdba 100644 --- a/lib/x86/smp.h +++ b/lib/x86/smp.h @@ -1,6 +1,10 @@ #ifndef __SMP_H #define __SMP_H +#define mb() asm volatile("mfence":::"memory") +#define rmb() asm volatile("lfence":::"memory") +#define wmb() asm volatile("sfence" ::: "memory") + struct spinlock { int v; }; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests v2 2/8] Increase max_cpu to 64
Signed-off-by: Jason Wang --- x86/cstart.S |2 +- x86/cstart64.S |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/x86/cstart.S b/x86/cstart.S index ae1fdbb..bc8d563 100644 --- a/x86/cstart.S +++ b/x86/cstart.S @@ -6,7 +6,7 @@ boot_idt = 0 ipi_vector = 0x20 -max_cpus = 4 +max_cpus = 64 .bss diff --git a/x86/cstart64.S b/x86/cstart64.S index ef09d82..71014d8 100644 --- a/x86/cstart64.S +++ b/x86/cstart64.S @@ -6,7 +6,7 @@ boot_idt = 0 ipi_vector = 0x20 -max_cpus = 4 +max_cpus = 64 .bss -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests v2 1/8] Introduce some type definiations.
Signed-off-by: Jason Wang --- lib/libcflat.h |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/lib/libcflat.h b/lib/libcflat.h index d0d3df2..9cb90cf 100644 --- a/lib/libcflat.h +++ b/lib/libcflat.h @@ -23,10 +23,14 @@ #include typedef unsigned char u8; +typedef signed char s8; typedef unsigned short u16; +typedef signed short s16; typedef unsigned u32; +typedef signed s32; typedef unsigned long ulong; typedef unsigned long long u64; +typedef signed long long s64; typedef unsigned long size_t; typedef _Bool bool; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests v2 0/8] Tests for kvmclock
The following series implements three tests for kvmclock: Wall clock test: check whether the kvmclock could return a correct TOD by pass the correct one to test and check it with the value returned by kvmclock. Monotonic cycle test: check whether the cycles of kvmclock grows monotonically in smp guests. Performance test: Measure the performance ( through guest TSC cycles ) of kvmclock driver for both with and without the guest side adjustment. Change since V1: - Change the max cpu to 64 in order to test in larger systems. - Make the per-cpu structure hv_clock 4 byte aligned. - Fix the bug of the clobber list in atomic64_cmpxchg(). - Add the test of pure performance test based on guest TSC cycle. - Record the count of stall when doing monotonic cycle test. - Introduce some new type definitions. - Fix the bug of the wallclock calculation and test. Here is a test result I got form a guest with 64 vcpus on a 64 cores intel X7550: The test command is qemu-system-x86_64 -device testdev,chardev=testlog -chardev file,id=testlog,path=msr.out -kernel ./x86/kvmclock_test.flat -smp 64 --append 10 For 64bit guests: The guest cycles spent on geting raw cycles is about 21543417. The guest cycles spent on geting adjusted cycles is 1067264988. Which means it takes about 50x more time on correcting the cycles than just returning cycles supplied by hypervisor. For 32bit guests: The guest cycles spent on geting raw cycles is about 26916562. The guest cycles spent on geting adjusted cycles is 2020119174. Which means it taks about 75x more time on correcting the cycles than just returning cycles supplied by hypervisor. And for 4 core guests: we only take about 10x-15x more time on correcting. Pay attention: 1 The performance test just read cycle and do not check anything else. 2 I've run the test for serverl times, the result are similar. 3 The cycles were measured from guest, so maybe not accurate. Here is a sample test results of 64bit guest with 64 vcpus: .. Check the stability of raw cycle ... Worst warp -10 Worst warp -6537 Worst warp -11170 Worst warp -11370 Worst warp -11372 Worst warp -31754 Worst warp -31876 Worst warp -31904 Worst warp -36771 Worst warp -37382 Worst warp -37688 Worst warp -37707 Total vcpus: 64 Test loops: 10 Total warps: 53811 Total stalls: 408 Worst warp: -37707 Raw cycle is not stable Monotonic cycle test: Total vcpus: 64 Test loops: 10 Total warps: 0 Total stalls: 214 Worst warp: 0 Measure the performance of raw cycle ... Total vcpus: 64 Test loops: 10 TSC cycles: 21543417 Measure the performance of adjusted cycle ... Total vcpus: 64 Test loops: 10 TSC cycles: 1067264988 --- Jason Wang (8): Introduce some type definiations. Increase max_cpu to 64 Introduce memory barriers. Introduce atomic operations Export tsc related helpers Introduce atol() Add a simple kvmclock driver Add tests for kvm-clock config-x86-common.mak |7 + lib/libcflat.h|5 + lib/string.c | 31 + lib/x86/atomic.c | 37 ++ lib/x86/atomic.h | 164 + lib/x86/processor.h | 22 lib/x86/smp.h |4 + x86/README|1 x86/cstart.S |2 x86/cstart64.S|2 x86/kvmclock.c| 279 + x86/kvmclock.h| 60 +++ x86/kvmclock_test.c | 166 + x86/tsc.c | 16 --- x86/unittests.cfg |5 + x86/vmexit.c | 15 --- 16 files changed, 783 insertions(+), 33 deletions(-) create mode 100644 lib/x86/atomic.c create mode 100644 lib/x86/atomic.h create mode 100644 x86/kvmclock.c create mode 100644 x86/kvmclock.h create mode 100644 x86/kvmclock_test.c -- Jason Wang -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 4/7] change kernel accounting to include steal time
On Mon, 2010-08-30 at 19:03 -0400, Rik van Riel wrote: > > > I think it basically comes down to adding "sched_clock_unstolen()" which > > the scheduler can use to measure time a process spends running, and > > sched_clock() for measuring sleep times. In the normal case, > > sched_clock_unstolen() would be the same as sched_clock(). > > That requires the host to export (any time the guest is scheduled > in), the amount of CPU time the VCPU thread has used, and the time > the VCPU was scheduled in. > > Since the VCPU must be running when it is examining these variables, > it can calculate the additional time (since it was last scheduled) > to account to the task, and remember the currently calculated time > in its own per-vcpu variable, so next time it can get a delta again. I think its easier (and sufficient) for the host to tell the guest how long it was _not_ running. That can simply be passed in when you start the vcpu again and doesn't need a fancy communication channel. The guests sched_clock() will measure wall time, the guests sched_clock_stolen() will report the accumulation of these stolen times. Then you can make sched_clock_unstolen() be sched_clock() - sched_clock_stolen(). And like Jeremy said, if you make the sched_fair stuff use sched_clock_unstolen() things should more or less work. The problem with all that is that you'll start to schedule on unstolen time instead of wall-time, which might not give the best results for things like latencies etc.. but I guess that's one of the prices you pay for using virt. Also, like said yesterday, you need some factor in update_cpu_power(), a quick hack might be to add all stolen time to sched_rt_avg_update(). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html