Re: [PATCH 5/5] virtio-net: Switch default to new bottom half TX handler for iothread
On Tue, Aug 31, 2010 at 11:32 PM, Alex Williamson alex.william...@redhat.com wrote: On Tue, 2010-08-31 at 23:25 +0300, Michael S. Tsirkin wrote: On Fri, Aug 27, 2010 at 04:37:45PM -0600, Alex Williamson wrote: The bottom half handler shows big improvements over the timer with few downsides, default to it when the iothread is enabled. Using the following tests, with the guest and host connected via tap+bridge: guest netperf -t TCP_STREAM -H $HOST host netperf -t TCP_STREAM -H $GUEST guest netperf -t UDP_STREAM -H $HOST host netperf -t UDP_STREAM -H $GUEST guest netperf -t TCP_RR -H $HOST Results: base throughput, exits/throughput - patched throughput, exits/throughput --enable-io-thread TCP guest-host 2737.77, 47.82 - 6767.09, 29.15 = 247%, 61% TCP host-guest 2231.33, 74.00 - 4125.80, 67.61 = 185%, 91% UDP guest-host 6281.68, 14.66 - 12569.27, 1.98 = 200%, 14% UDP host-guest 275.91, 289.22 - 264.80, 293.53 = 96%, 101% interations/s 1949.65, 82.97 - 7417.56, 84.31 = 380%, 102% No --enable-io-thread TCP guest-host 3041.57, 55.11 - 1038.93, 517.57 = 34%, 939% TCP host-guest 2416.03, 76.67 - 5655.92, 55.52 = 234%, 72% UDP guest-host 12255.82, 6.11 - 7775.87, 31.32 = 63%, 513% UDP host-guest 587.92, 245.95 - 611.88, 239.92 = 104%, 98% interations/s 1975.59, 83.21 - 8935.50, 88.18 = 452%, 106% Signed-off-by: Alex Williamson alex.william...@redhat.com parameter having different settings based on config options might surprise some users. I don't think we really need a parameter here ... I'm not a bit fan of this either, but I'd also prefer not to introduce a regression for a performance difference we know about in advance. It gets even more complicated when we factor in qemu-kvm, as it doesn't build with iothread enabled, but seems to get and even better boost in performance across the board thanks largely to the kvm-irqchip. Should we instead make this a configure option? --enable-virtio-net-txbh? Thanks, Alex qemu-kvm uses its own iothread implementation by default. It doesn't need --enable-io-thread because it already uses a similar model. Stefan --- hw/s390-virtio-bus.c | 3 ++- hw/syborg_virtio.c | 3 ++- hw/virtio-pci.c | 3 ++- hw/virtio.h | 6 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c index 1483362..985f99a 100644 --- a/hw/s390-virtio-bus.c +++ b/hw/s390-virtio-bus.c @@ -328,7 +328,8 @@ static VirtIOS390DeviceInfo s390_virtio_net = { .qdev.size = sizeof(VirtIOS390Device), .qdev.props = (Property[]) { DEFINE_NIC_PROPERTIES(VirtIOS390Device, nic), - DEFINE_PROP_UINT32(txtimer, VirtIOS390Device, txtimer, 1), + DEFINE_PROP_UINT32(txtimer, VirtIOS390Device, txtimer, + TXTIMER_DEFAULT), DEFINE_PROP_INT32(txburst, VirtIOS390Device, txburst, 256), DEFINE_PROP_END_OF_LIST(), }, diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c index 7b76972..ee5746d 100644 --- a/hw/syborg_virtio.c +++ b/hw/syborg_virtio.c @@ -300,7 +300,8 @@ static SysBusDeviceInfo syborg_virtio_net_info = { .qdev.props = (Property[]) { DEFINE_NIC_PROPERTIES(SyborgVirtIOProxy, nic), DEFINE_VIRTIO_NET_FEATURES(SyborgVirtIOProxy, host_features), - DEFINE_PROP_UINT32(txtimer, SyborgVirtIOProxy, txtimer, 1), + DEFINE_PROP_UINT32(txtimer, SyborgVirtIOProxy, txtimer, + TXTIMER_DEFAULT), DEFINE_PROP_INT32(txburst, SyborgVirtIOProxy, txburst, 256), DEFINE_PROP_END_OF_LIST(), } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index e025c09..9740f57 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -695,7 +695,8 @@ static PCIDeviceInfo virtio_info[] = { DEFINE_PROP_UINT32(vectors, VirtIOPCIProxy, nvectors, 3), DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features), DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic), - DEFINE_PROP_UINT32(txtimer, VirtIOPCIProxy, txtimer, 1), + DEFINE_PROP_UINT32(txtimer, VirtIOPCIProxy, txtimer, + TXTIMER_DEFAULT), DEFINE_PROP_INT32(txburst, VirtIOPCIProxy, txburst, 256), DEFINE_PROP_END_OF_LIST(), }, diff --git a/hw/virtio.h b/hw/virtio.h index 4051889..a1a17a2 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -183,6 +183,12 @@ void virtio_update_irq(VirtIODevice *vdev); void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding, void *opaque); +#ifdef CONFIG_IOTHREAD + #define TXTIMER_DEFAULT 0 +#else + #define TXTIMER_DEFAULT 1 +#endif + Add a comment explaning that this is just a performance optimization? /* Base devices. */ VirtIODevice
[PATCH 0/2] Improve kvm_stat delta display
Make the first sleep shorter, so numbers come up quicker, and scale the delta so it is a rate-of-change instead of a meaningless number. Avi Kivity (2): kvm_stat: make the initial sleep shorter kvm_stat: scale delta column to make it a rate kvm/kvm_stat | 10 ++ 1 files changed, 6 insertions(+), 4 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] kvm_stat: make the initial sleep shorter
So we can see results faster Signed-off-by: Avi Kivity a...@redhat.com --- kvm/kvm_stat |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/kvm/kvm_stat b/kvm/kvm_stat index e68ca4e..b031d48 100755 --- a/kvm/kvm_stat +++ b/kvm/kvm_stat @@ -325,9 +325,11 @@ def tui(screen, stats): row += 1 screen.refresh() +sleeptime = 0.25 while True: refresh() -curses.halfdelay(30) +curses.halfdelay(int(sleeptime * 10)) +sleeptime = 3 try: c = screen.getkey() if c == 'q': -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] kvm_stat: scale delta column to make it a rate
Scale the delta column by 1/sleeptime, so its units are events per second. Signed-off-by: Avi Kivity a...@redhat.com --- kvm/kvm_stat |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/kvm/kvm_stat b/kvm/kvm_stat index b031d48..e32df14 100755 --- a/kvm/kvm_stat +++ b/kvm/kvm_stat @@ -301,7 +301,7 @@ number_width = 10 def tui(screen, stats): curses.use_default_colors() curses.noecho() -def refresh(): +def refresh(sleeptime): screen.erase() screen.addstr(0, 0, 'kvm statistics') row = 2 @@ -321,13 +321,13 @@ def tui(screen, stats): screen.addstr(row, col, '%10d' % (values[0],)) col += number_width if values[1] is not None: -screen.addstr(row, col, '%8d' % (values[1],)) +screen.addstr(row, col, '%8d' % (values[1] / sleeptime,)) row += 1 screen.refresh() sleeptime = 0.25 while True: -refresh() +refresh(sleeptime) curses.halfdelay(int(sleeptime * 10)) sleeptime = 3 try: -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] SVM: fix cpu onlining
On 09/01/2010 01:13 AM, Marcelo Tosatti wrote: applied, thanks. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm_stat: use total count as a secondary sort key
This shows events with history in preference to events with none. Signed-off-by: Avi Kivity a...@redhat.com --- kvm/kvm_stat |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/kvm/kvm_stat b/kvm/kvm_stat index e32df14..4a16277 100755 --- a/kvm/kvm_stat +++ b/kvm/kvm_stat @@ -308,9 +308,9 @@ def tui(screen, stats): s = stats.get() def sortkey(x): if s[x][1]: -return -s[x][1] +return (-s[x][1], -s[x][0]) else: -return x +return (0, -s[x][0]) for key in sorted(s.keys(), key = sortkey): if row = screen.getmaxyx()[0]: break -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm_stat: ignore events that have never occured
Less cluttered display. Signed-off-by: Avi Kivity a...@redhat.com --- kvm/kvm_stat |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/kvm/kvm_stat b/kvm/kvm_stat index 4a16277..d373c60 100755 --- a/kvm/kvm_stat +++ b/kvm/kvm_stat @@ -315,6 +315,8 @@ def tui(screen, stats): if row = screen.getmaxyx()[0]: break values = s[key] +if not values[0] and not values[1]: +break col = 1 screen.addstr(row, col, key) col += label_width -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] *** SUBJECT HERE ***
On 08/31/10 18:13, Anthony Liguori wrote: Just as an aside, does anyone have a good way to maintain the 00 patches in series with repeated submissions? /me uses cut+paste from email folder or list archive. I suspect you are looking for something better though ... cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: Fix bogus value returned for MSR_IA32_PERF_STATUS
The returned value is completely bogus, and sets reserved bits. Return zero instead. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/x86.c |5 + 1 files changed, 1 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1cbf168..a2c03f1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1641,10 +1641,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) data = vcpu-arch.ia32_misc_enable_msr; break; case MSR_IA32_PERF_STATUS: - /* TSC increment by tick */ - data = 1000ULL; - /* CPU multiplier */ - data |= (((uint64_t)4ULL) 40); + data = 0; break; case MSR_EFER: data = vcpu-arch.efer; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: Don't save/restore MSR_IA32_PERF_STATUS
It is read/only; restoring it only results in annoying messages. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/x86.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cc6d6cd..1cbf168 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -739,7 +739,7 @@ static u32 msrs_to_save[] = { #ifdef CONFIG_X86_64 MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR, #endif - MSR_IA32_TSC, MSR_IA32_PERF_STATUS, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA + MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA }; static unsigned num_msrs_to_save; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] Fix MSR_IA32_PERF_STATUS
MSR_IA32_PERF_STATUS is read-only, yet exposed for save/restore. This generates annoying printk()s when userspace attempts to initialize it. The value returned is also bogus, the comments appear to refer to another msr. Fix both issues. Alex, please test with a guest Mac. Avi Kivity (2): KVM: Don't save/restore MSR_IA32_PERF_STATUS KVM: Fix bogus value returned for MSR_IA32_PERF_STATUS arch/x86/kvm/x86.c |7 ++- 1 files changed, 2 insertions(+), 5 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!
On 08/31/2010 11:02 PM, Marcelo Tosatti wrote: kvm_reset_msrs in qemu-kvm-x86.c. I posted a patch to fix this. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 00/35] KVM: PPC: End-August patch queue
On 08/31/2010 05:31 AM, Alexander Graf wrote: Howdy, This is my local patch queue with stuff that has accumulated over the last weeks on KVM for PPC with some last minute fixes, speedups and debugging help that I needed for the KVM Forum ;-). The highlights of this set are: - Converted most important debug points to tracepoints - Flush less PTEs (speedup) - Go back to our own hash (less duplicates) - Make SRs guest settable (speedup for 32 bit guests) - Remove r30/r31 restrictions from PV hooks (speedup!) - Fix random breakages - Fix random guest stalls - 440GP host support (Thanks Hollis!) - Reliable interrupt injection Keep in mind that this is the first version that is stable on PPC32 hosts. All versions prior to this could occupy otherwise used segment entries and thus crash your machine :-). It is also the first version that is stable with PPC64 guests, because they require more sophisticated interrupt injection logic for which qemu patches are also required. Please pull this tree from: git://github.com/agraf/linux-2.6.git kvm-ppc-next Have fun with more accurate, faster and less buggy KVM on PowerPC! Pulled, thanks. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!
On 08/31/10 22:02, Marcelo Tosatti wrote: On Tue, Aug 31, 2010 at 06:33:34PM +0200, Jes Sorensen wrote: which makes no sense to me, but given it's x86, I am not sure if it could have come from the BIOS or something during reboot? kvm_reset_msrs in qemu-kvm-x86.c. Interesting, I hadn't thought KVM would be able to cause itself to trap writes like this. Avi's patch looks good though :) Cheers, Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ftrace/perf_event leak
I recently added perf_event support to kvm_stat, to display kvm tracepoints as statistics (I'd like to fold this to tools/perf eventually, but that's another story). However I'm seeing a resource leak - after I quit the tool, there are quite a few references into the kvm module: kvm_intel 43655 0 kvm 272984 269 kvm_intel The tool is just a python script that reads /sys/kernel/debug/tracing/events/kvm to find out which events are available, uses perf_event_open() to create one group per cpu to which a lot of events are attached. The only special thing I can think of is that we use an ioctl to attach a filter to many perf_event descriptors. You can find the source at http://git.kernel.org/?p=virt/kvm/qemu-kvm.git;a=blob_plain;f=kvm/kvm_stat;hb=5bd5f131b50cb373ff4e2a3632c6dad00a1f0b55. All it needs are the kvm modules loaded; no need to actually run a guest. Run as root. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: Fix bogus value returned for MSR_IA32_PERF_STATUS
On 01.09.2010, at 09:34, Avi Kivity wrote: The returned value is completely bogus, and sets reserved bits. Return zero instead. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/x86.c |5 + 1 files changed, 1 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1cbf168..a2c03f1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1641,10 +1641,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) data = vcpu-arch.ia32_misc_enable_msr; break; case MSR_IA32_PERF_STATUS: - /* TSC increment by tick */ - data = 1000ULL; - /* CPU multiplier */ - data |= (((uint64_t)4ULL) 40); + data = 0; break; case MSR_EFER: data = vcpu-arch.efer; This is the respective code snippet from xnu: /* * Get the TSC increment. The TSC is incremented by this * on every bus tick. Calculate the TSC conversion factors * to and from nano-seconds. */ if (cpuid_info()-cpuid_family == CPU_FAMILY_PENTIUM_M) { uint64_tprfsts; prfsts = rdmsr64(IA32_PERF_STS); tscGranularity = (uint32_t)bitfield(prfsts, 44, 40); N_by_2_bus_ratio = prfsts bit(46); } else { panic(rtclock_init: unknown CPU family: 0x%X\n, cpuid_info()-cpuid_family); } if (N_by_2_bus_ratio) tscFCvtt2n = busFCvtt2n * 2 / (uint64_t)tscGranularity; else tscFCvtt2n = busFCvtt2n / (uint64_t)tscGranularity; tscFreq = ((1 * Giga) 32) / tscFCvtt2n; tscFCvtn2t = 0xULL / tscFCvtt2n; So by passing in 0 here, you effectively make that code divide something by 0 which results in a panic. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] virtio-net: Switch default to new bottom half TX handler for iothread
On Tue, Aug 31, 2010 at 04:32:54PM -0600, Alex Williamson wrote: On Tue, 2010-08-31 at 23:25 +0300, Michael S. Tsirkin wrote: On Fri, Aug 27, 2010 at 04:37:45PM -0600, Alex Williamson wrote: The bottom half handler shows big improvements over the timer with few downsides, default to it when the iothread is enabled. Using the following tests, with the guest and host connected via tap+bridge: guest netperf -t TCP_STREAM -H $HOST host netperf -t TCP_STREAM -H $GUEST guest netperf -t UDP_STREAM -H $HOST host netperf -t UDP_STREAM -H $GUEST guest netperf -t TCP_RR -H $HOST Results: base throughput, exits/throughput - patched throughput, exits/throughput --enable-io-thread TCP guest-host 2737.77, 47.82 - 6767.09, 29.15 = 247%, 61% TCP host-guest 2231.33, 74.00 - 4125.80, 67.61 = 185%, 91% UDP guest-host 6281.68, 14.66 - 12569.27, 1.98 = 200%, 14% UDP host-guest 275.91, 289.22 - 264.80, 293.53 = 96%, 101% interations/s 1949.65, 82.97 - 7417.56, 84.31 = 380%, 102% No --enable-io-thread TCP guest-host 3041.57, 55.11 - 1038.93, 517.57 = 34%, 939% TCP host-guest 2416.03, 76.67 - 5655.92, 55.52 = 234%, 72% UDP guest-host 12255.82, 6.11 - 7775.87, 31.32 = 63%, 513% UDP host-guest 587.92, 245.95 - 611.88, 239.92 = 104%, 98% interations/s 1975.59, 83.21 - 8935.50, 88.18 = 452%, 106% Signed-off-by: Alex Williamson alex.william...@redhat.com parameter having different settings based on config options might surprise some users. I don't think we really need a parameter here ... I'm not a bit fan of this either, but I'd also prefer not to introduce a regression for a performance difference we know about in advance. I agree. The comment was mainly about a parameter. If we make it a hidden parameter (mainbe with x- prefix), this won't be an issue. It gets even more complicated when we factor in qemu-kvm, as it doesn't build with iothread enabled, but seems to get and even better boost in performance across the board thanks largely to the kvm-irqchip. Should we instead make this a configure option? --enable-virtio-net-txbh? That'll work too. Thanks, Alex --- hw/s390-virtio-bus.c |3 ++- hw/syborg_virtio.c |3 ++- hw/virtio-pci.c |3 ++- hw/virtio.h |6 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c index 1483362..985f99a 100644 --- a/hw/s390-virtio-bus.c +++ b/hw/s390-virtio-bus.c @@ -328,7 +328,8 @@ static VirtIOS390DeviceInfo s390_virtio_net = { .qdev.size = sizeof(VirtIOS390Device), .qdev.props = (Property[]) { DEFINE_NIC_PROPERTIES(VirtIOS390Device, nic), -DEFINE_PROP_UINT32(txtimer, VirtIOS390Device, txtimer, 1), +DEFINE_PROP_UINT32(txtimer, VirtIOS390Device, txtimer, + TXTIMER_DEFAULT), DEFINE_PROP_INT32(txburst, VirtIOS390Device, txburst, 256), DEFINE_PROP_END_OF_LIST(), }, diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c index 7b76972..ee5746d 100644 --- a/hw/syborg_virtio.c +++ b/hw/syborg_virtio.c @@ -300,7 +300,8 @@ static SysBusDeviceInfo syborg_virtio_net_info = { .qdev.props = (Property[]) { DEFINE_NIC_PROPERTIES(SyborgVirtIOProxy, nic), DEFINE_VIRTIO_NET_FEATURES(SyborgVirtIOProxy, host_features), -DEFINE_PROP_UINT32(txtimer, SyborgVirtIOProxy, txtimer, 1), +DEFINE_PROP_UINT32(txtimer, SyborgVirtIOProxy, txtimer, + TXTIMER_DEFAULT), DEFINE_PROP_INT32(txburst, SyborgVirtIOProxy, txburst, 256), DEFINE_PROP_END_OF_LIST(), } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index e025c09..9740f57 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -695,7 +695,8 @@ static PCIDeviceInfo virtio_info[] = { DEFINE_PROP_UINT32(vectors, VirtIOPCIProxy, nvectors, 3), DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features), DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic), -DEFINE_PROP_UINT32(txtimer, VirtIOPCIProxy, txtimer, 1), +DEFINE_PROP_UINT32(txtimer, VirtIOPCIProxy, txtimer, + TXTIMER_DEFAULT), DEFINE_PROP_INT32(txburst, VirtIOPCIProxy, txburst, 256), DEFINE_PROP_END_OF_LIST(), }, diff --git a/hw/virtio.h b/hw/virtio.h index 4051889..a1a17a2 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -183,6 +183,12 @@ void virtio_update_irq(VirtIODevice *vdev); void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding, void *opaque); +#ifdef CONFIG_IOTHREAD + #define TXTIMER_DEFAULT 0 +#else + #define TXTIMER_DEFAULT 1 +#endif + Add a
Re: [PATCH 2/2] KVM: Fix bogus value returned for MSR_IA32_PERF_STATUS
On 09/01/2010 11:49 AM, Alexander Graf wrote: On 01.09.2010, at 09:34, Avi Kivity wrote: The returned value is completely bogus, and sets reserved bits. Return zero instead. Signed-off-by: Avi Kivitya...@redhat.com --- arch/x86/kvm/x86.c |5 + 1 files changed, 1 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1cbf168..a2c03f1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1641,10 +1641,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) data = vcpu-arch.ia32_misc_enable_msr; break; case MSR_IA32_PERF_STATUS: - /* TSC increment by tick */ - data = 1000ULL; - /* CPU multiplier */ - data |= (((uint64_t)4ULL) 40); + data = 0; break; case MSR_EFER: data = vcpu-arch.efer; This is the respective code snippet from xnu: /* * Get the TSC increment. The TSC is incremented by this * on every bus tick. Calculate the TSC conversion factors * to and from nano-seconds. */ if (cpuid_info()-cpuid_family == CPU_FAMILY_PENTIUM_M) { uint64_tprfsts; prfsts = rdmsr64(IA32_PERF_STS); tscGranularity = (uint32_t)bitfield(prfsts, 44, 40); N_by_2_bus_ratio = prfsts bit(46); } else { panic(rtclock_init: unknown CPU family: 0x%X\n, cpuid_info()-cpuid_family); } if (N_by_2_bus_ratio) tscFCvtt2n = busFCvtt2n * 2 / (uint64_t)tscGranularity; else tscFCvtt2n = busFCvtt2n / (uint64_t)tscGranularity; tscFreq = ((1 * Giga) 32) / tscFCvtt2n; tscFCvtn2t = 0xULL / tscFCvtt2n; So by passing in 0 here, you effectively make that code divide something by 0 which results in a panic. Right. Searching again I find that this is actually documented, (under MSR_PERF_STATU s instead of IA32_PERF_STATUS). I'll just drop this patch. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ftrace/perf_event leak
On Wed, 2010-09-01 at 11:42 +0300, Avi Kivity wrote: I recently added perf_event support to kvm_stat, to display kvm tracepoints as statistics (I'd like to fold this to tools/perf eventually, but that's another story). However I'm seeing a resource leak - after I quit the tool, there are quite a few references into the kvm module: kvm_intel 43655 0 kvm 272984 269 kvm_intel The tool is just a python script that reads /sys/kernel/debug/tracing/events/kvm to find out which events are available, uses perf_event_open() to create one group per cpu to which a lot of events are attached. The only special thing I can think of is that we use an ioctl to attach a filter to many perf_event descriptors. You can find the source at http://git.kernel.org/?p=virt/kvm/qemu-kvm.git;a=blob_plain;f=kvm/kvm_stat;hb=5bd5f131b50cb373ff4e2a3632c6dad00a1f0b55. All it needs are the kvm modules loaded; no need to actually run a guest. Run as root. Does something like the below cure that? I seem to remember C doesn't make any promises about the order of logic statements, hence we need to explicitly pull out that try_module_get() so that it evaluates after the rest of the conditions. --- kernel/trace/trace_event_perf.c | 10 ++ 1 files changed, 6 insertions(+), 4 deletions(-) diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index 000e6e8..35051f2 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -88,10 +88,11 @@ int perf_trace_init(struct perf_event *p_event) mutex_lock(event_mutex); list_for_each_entry(tp_event, ftrace_events, list) { if (tp_event-event.type == event_id - tp_event-class tp_event-class-reg - try_module_get(tp_event-mod)) { - ret = perf_trace_event_init(tp_event, p_event); - break; + tp_event-class tp_event-class-reg) { + if (try_module_get(tp_event-mod)) { + ret = perf_trace_event_init(tp_event, p_event); + break; + } } } mutex_unlock(event_mutex); @@ -138,6 +139,7 @@ void perf_trace_destroy(struct perf_event *p_event) free_percpu(tp_event-perf_events); tp_event-perf_events = NULL; + module_put(tp_event-mod); if (!--total_ref_count) { for (i = 0; i 4; i++) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/5] KVM: MMU: lower the aduit frequency
On 08/31/2010 05:27 AM, Xiao Guangrong wrote: I've always seen progress from the guest while running with audit enabled (its slow, but its not supposed to be fast anyway). Did you experience a freeze? There is a simply test in the guest if it's not rate limit: # time ls anaconda-ks.cfg Documents install.log Music Public Videos Desktop Downloads install.log.syslog Pictures Templates real1m26.053s user0m0.311s sys 0m1.813s 'ls' command cost about 1.5 minute, if we run the memory test program, i think the time/delay is unacceptable.. :-( Marcelo, would making the ratelimit optional help? personally I think without ratelimit audit is useless, and with ratelimit it is a lot less useful but can still point out problems. But I haven't used it in a while. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ftrace/perf_event leak
On 09/01/2010 12:04 PM, Peter Zijlstra wrote: Does something like the below cure that? Unfortunately not. I seem to remember C doesn't make any promises about the order of logic statements, hence we need to explicitly pull out that try_module_get() so that it evaluates after the rest of the conditions. and || are executed in order (so you can write if (p p-x) ). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ftrace/perf_event leak
Avi Kivity wrote: On 09/01/2010 12:04 PM, Peter Zijlstra wrote: Does something like the below cure that? Unfortunately not. Then try this: The bug should be caused by commit 1c024eca51fdc965290acf342ae16a476c2189d0. --- diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index 8a2b73f..9d1d1f2 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -161,6 +161,7 @@ void perf_trace_destroy(struct perf_event *p_event) } } out: + module_put(tp_event-mod); mutex_unlock(event_mutex); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ftrace/perf_event leak
On Wed, 2010-09-01 at 12:26 +0300, Avi Kivity wrote: On 09/01/2010 12:04 PM, Peter Zijlstra wrote: Does something like the below cure that? Unfortunately not. Hrmm,.. bugger, don't use modules is my motto.. still I'll have another look. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ftrace/perf_event leak
On Wed, 2010-09-01 at 17:38 +0800, Li Zefan wrote: --- diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index 8a2b73f..9d1d1f2 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -161,6 +161,7 @@ void perf_trace_destroy(struct perf_event *p_event) } } out: + module_put(tp_event-mod); mutex_unlock(event_mutex); } Ah, indeed, we so that try_module_get() for each reference to the tracepoint, I guess we also should do the below, in case the registration fails. diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index 000e6e8..31cc4cb 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -91,6 +91,8 @@ int perf_trace_init(struct perf_event *p_event) tp_event-class tp_event-class-reg try_module_get(tp_event-mod)) { ret = perf_trace_event_init(tp_event, p_event); + if (ret) + module_put(tp_event-mod); break; } } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/1] Fix guest kernel crash when hitting AMD errata
From: Jes Sorensen jes.soren...@redhat.com This fix gets around the case where a user specifies a CPU model triggering an old Opteron errata. The patch returns the clock timing the errata expects to see and ignores writes to make sure a guest doesn't get an exception from reading or writing an MSR that is not supported, since the kernel doesn't use safe MSR access for this errata workaround. Jes Sorensen (1): Fix guest kernel crash when hitting AMD errata arch/x86/kvm/x86.c | 22 ++ 1 files changed, 22 insertions(+), 0 deletions(-) -- 1.7.2.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] Fix guest kernel crash when hitting AMD errata
From: Jes Sorensen jes.soren...@redhat.com MSR_K7_CLK_CTL is a no longer documented MSR, which is only relevant on said old AMD CPU models. This change returns the expected value, which the Linux kernel is expecting to avoid writing back the MSR, plus it ignores all writes to the MSR. Signed-off-by: Jes Sorensen jes.soren...@redhat.com --- arch/x86/kvm/x86.c | 22 ++ 1 files changed, 22 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4014d6c..1810a80 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1439,6 +1439,16 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) pr_unimpl(vcpu, unimplemented perfctr wrmsr: 0x%x data 0x%llx\n, msr, data); break; + case MSR_K7_CLK_CTL: + /* +* Ignore all writes to this no longer documented MSR. +* Writes are only relevant for old K7 processors, +* all pre-dating SVM, but a recommended workaround from +* AMD for these chips. It is possible to speicify the +* affected processor models on the command line, hence +* the need to ignore the workaround. +*/ + break; case HV_X64_MSR_GUEST_OS_ID ... HV_X64_MSR_SINT15: if (kvm_hv_msr_partition_wide(msr)) { int r; @@ -1664,6 +1674,18 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) case MSR_IA32_MCG_STATUS: case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1: return get_msr_mce(vcpu, msr, pdata); + case MSR_K7_CLK_CTL: + /* +* Provide expected ramp-up count for K7. All other +* are set to zero, indicating minimum divisors for +* every field. +* +* This prevents guest kernels on AMD host with CPU +* type 6, model 8 and higher from exploding due to +* the rdmsr failing. +*/ + data = 0x2000; + break; case HV_X64_MSR_GUEST_OS_ID ... HV_X64_MSR_SINT15: if (kvm_hv_msr_partition_wide(msr)) { int r; -- 1.7.2.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] virtio-net: Introduce a new bottom half packet TX
On Tue, Aug 31, 2010 at 02:33:46PM -0600, Alex Williamson wrote: On Tue, 2010-08-31 at 23:14 +0300, Michael S. Tsirkin wrote: On Fri, Aug 27, 2010 at 04:37:36PM -0600, Alex Williamson wrote: Based on a patch from Mark McLoughlin, this patch introduces a new bottom half packet transmitter that avoids the latency imposed by the tx_timer approach. Rather than scheduling a timer when a TX packet comes in, schedule a bottom half to be run from the iothread. The bottom half handler first attempts to flush the queue with notification disabled (this is where we could race with a guest without txburst). If we flush a full burst, reschedule immediately. If we send short of a full burst, try to re-enable notification. To avoid a race with TXs that may have occurred, we must then flush again. If we find some packets to send, the guest it probably active, so we can reschedule again. tx_timer and tx_bh are mutually exclusive, so we can re-use the tx_waiting flag to indicate one or the other needs to be setup. This allows us to seamlessly migrate between timer and bh TX handling. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/virtio-net.c | 81 ++- 1 files changed, 68 insertions(+), 13 deletions(-) diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 8b652f2..3288c77 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -36,6 +36,7 @@ typedef struct VirtIONet VirtQueue *ctrl_vq; NICState *nic; QEMUTimer *tx_timer; +QEMUBH *tx_bh; uint32_t tx_timeout; int32_t tx_burst; int tx_waiting; @@ -704,16 +705,25 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq) { VirtIONet *n = to_virtio_net(vdev); -if (n-tx_waiting) { -virtio_queue_set_notification(vq, 1); -qemu_del_timer(n-tx_timer); -n-tx_waiting = 0; -virtio_net_flush_tx(n, vq); +if (n-tx_timer) { +if (n-tx_waiting) { +virtio_queue_set_notification(vq, 1); +qemu_del_timer(n-tx_timer); +n-tx_waiting = 0; +virtio_net_flush_tx(n, vq); +} else { +qemu_mod_timer(n-tx_timer, + qemu_get_clock(vm_clock) + n-tx_timeout); +n-tx_waiting = 1; +virtio_queue_set_notification(vq, 0); +} } else { -qemu_mod_timer(n-tx_timer, - qemu_get_clock(vm_clock) + n-tx_timeout); +if (unlikely(n-tx_waiting)) { +return; +} +virtio_queue_set_notification(n-tx_vq, 0); +qemu_bh_schedule(n-tx_bh); n-tx_waiting = 1; -virtio_queue_set_notification(vq, 0); } } @@ -731,6 +741,41 @@ static void virtio_net_tx_timer(void *opaque) virtio_net_flush_tx(n, n-tx_vq); } +static void virtio_net_tx_bh(void *opaque) +{ +VirtIONet *n = opaque; +int32_t ret; + +n-tx_waiting = 0; + +/* Just in case the driver is not ready on more */ +if (unlikely(!(n-vdev.status VIRTIO_CONFIG_S_DRIVER_OK))) +return; + +ret = virtio_net_flush_tx(n, n-tx_vq); +if (ret == -EBUSY) { +return; /* Notification re-enable handled by tx_complete */ +} + +/* If we flush a full burst of packets, assume there are + * more coming and immediately reschedule */ +if (ret = n-tx_burst) { +qemu_bh_schedule(n-tx_bh); +n-tx_waiting = 1; +return; +} + +/* If less than a full burst, re-enable notification and flush + * anything that may have come in while we weren't looking. If + * we find something, assume the guest is still active and reschedule */ +virtio_queue_set_notification(n-tx_vq, 1); +if (virtio_net_flush_tx(n, n-tx_vq) 0) { Shouldn't this be virtio_net_flush_tx(n, n-tx_vq) = n-tx_burst? If we get less than tx_burst, the ring is empty now so no need to reschedule. Right? I suppose it depends on how aggressive we want to be. If the guest put something on the queue between the first flush and this one, then it might be actively transmitting, and if we want to optimize latency, we anticipate that it might continue to transmit and re-schedule. This is taken straight from markmc's rhel5 patch. I wouldn't argue that it's wrong to not reschedule here, but it's clearly less aggressive. Thanks, Alex I'm a bit concerned that we are aggressive but not consistently aggressive. For example if the guest adds a packet before we do disable notification, we do not reschedule bh, but if it adds a packet after this, we do. If we get 255 packets, then another 255 packets, we poll without rescheduling an
Re: ftrace/perf_event leak
On 09/01/2010 12:38 PM, Li Zefan wrote: Then try this: Tested-by: Avi Kivity a...@redhat.com -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] pci: expand tabs to spaces in pci_regs.h
On Wed, Sep 01, 2010 at 01:58:30AM +0300, Eduard - Gabriel Munteanu wrote: On Tue, Aug 31, 2010 at 11:29:53PM +0300, Michael S. Tsirkin wrote: On Sat, Aug 28, 2010 at 05:54:52PM +0300, Eduard - Gabriel Munteanu wrote: The conversion was done using the GNU 'expand' tool (default settings) to make it obey the QEMU coding style. Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro I'm not really interested in this: we copied pci_regs.h from linux to help non-linux hosts, and keeping the code consistent with the original makes detecting bugs and adding new stuff from linux/pci_regs.h easier. --- hw/pci_regs.h | 1330 1 files changed, 665 insertions(+), 665 deletions(-) rewrite hw/pci_regs.h (90%) Ok, I'll drop it. The only reason I did it was one of my additions to this file made the patch look indented awkwardly. I'll use tabs and merge it into Linux as well. Thanks, Eduard Good idea, this way more people with pci knowledge check it. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] virtio-net: More configurability and bh handling for tx
On Tue, Aug 31, 2010 at 04:26:07PM -0600, Alex Williamson wrote: On Tue, 2010-08-31 at 16:33 -0500, Anthony Liguori wrote: On 08/31/2010 03:27 PM, Michael S. Tsirkin wrote: On Fri, Aug 27, 2010 at 04:36:59PM -0600, Alex Williamson wrote: Add the ability to configure the tx_timer timeout and add a bottom half tx handler that typically shows a nice perf boost over the time based approach. See last patch for perf details. Make this the new default when the iothread is enabled. Thanks, Alex As a further thought, maybe it would help to have a separate namespace for unsupported, developer-only parameters. Thoughts? We already have an undocumented one, just prefix with 'x-' to indicate that it's experimental. Any objection then to x-txburst and x-txtimer then? Easy enough to rename. Thanks, Alex No objection. Let's put the 256 constant in a header, with a comment explaining where it came from? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ftrace/perf_event leak
On Wed, 2010-09-01 at 13:38 +0300, Avi Kivity wrote: On 09/01/2010 12:38 PM, Li Zefan wrote: Then try this: Tested-by: Avi Kivity a...@redhat.com Thanks, queued as: --- Subject: perf, trace: Fix module leak From: Li Zefan l...@cn.fujitsu.com Date: Wed Sep 01 12:58:43 CEST 2010 Commit 1c024eca (perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events) caused a module refcount leak. Tested-by: Avi Kivity a...@redhat.com Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl LKML-Reference: 4c7e1f12.8030...@cn.fujitsu.com --- kernel/trace/trace_event_perf.c |3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6/kernel/trace/trace_event_perf.c === --- linux-2.6.orig/kernel/trace/trace_event_perf.c +++ linux-2.6/kernel/trace/trace_event_perf.c @@ -91,6 +91,8 @@ int perf_trace_init(struct perf_event *p tp_event-class tp_event-class-reg try_module_get(tp_event-mod)) { ret = perf_trace_event_init(tp_event, p_event); + if (ret) + module_put(tp_event-mod); break; } } @@ -147,6 +149,7 @@ void perf_trace_destroy(struct perf_even } } out: + module_put(tp_event-mod); mutex_unlock(event_mutex); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ftrace/perf_event leak
* Peter Zijlstra pet...@infradead.org wrote: On Wed, 2010-09-01 at 13:38 +0300, Avi Kivity wrote: On 09/01/2010 12:38 PM, Li Zefan wrote: Then try this: Tested-by: Avi Kivity a...@redhat.com Thanks, queued as: --- Subject: perf, trace: Fix module leak From: Li Zefan l...@cn.fujitsu.com Date: Wed Sep 01 12:58:43 CEST 2010 Commit 1c024eca (perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events) caused a module refcount leak. Tested-by: Avi Kivity a...@redhat.com Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl LKML-Reference: 4c7e1f12.8030...@cn.fujitsu.com Do we need this for -stable too? Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ftrace/perf_event leak
On Wed, Sep 01, 2010 at 01:02:57PM +0200, Peter Zijlstra wrote: On Wed, 2010-09-01 at 13:38 +0300, Avi Kivity wrote: On 09/01/2010 12:38 PM, Li Zefan wrote: Then try this: Tested-by: Avi Kivity a...@redhat.com Thanks, queued as: --- Subject: perf, trace: Fix module leak From: Li Zefan l...@cn.fujitsu.com Date: Wed Sep 01 12:58:43 CEST 2010 Commit 1c024eca (perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events) caused a module refcount leak. Tested-by: Avi Kivity a...@redhat.com Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl LKML-Reference: 4c7e1f12.8030...@cn.fujitsu.com --- kernel/trace/trace_event_perf.c |3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6/kernel/trace/trace_event_perf.c === --- linux-2.6.orig/kernel/trace/trace_event_perf.c +++ linux-2.6/kernel/trace/trace_event_perf.c @@ -91,6 +91,8 @@ int perf_trace_init(struct perf_event *p tp_event-class tp_event-class-reg try_module_get(tp_event-mod)) { ret = perf_trace_event_init(tp_event, p_event); + if (ret) + module_put(tp_event-mod); break; } } @@ -147,6 +149,7 @@ void perf_trace_destroy(struct perf_even } } out: + module_put(tp_event-mod); mutex_unlock(event_mutex); } Thanks for fixing this. However, can we split this in two patches to ease the backport? The lack of a module_put() after perf_trace_init() failure is there for a while (the backport needs to start in 2.6.32). But the lack of a module_put in the destroy path needs a .35 backport only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] 10Gb Ethernet Adapters (with TOE ) port
On 08/18/10 08:12, Onkar Mahajan wrote: Is there any benefit gained by porting 10Gb Ethernet adapters to Qemu/KVM ? The question may be naive. Please forgive me if this doesn't make any sense. Creating a new QEMU driver to mimic one of the 10Gbit adapters is probably a waste of time, you should be able to get the speed by using virtio. Alternatively you can use SRIOV to get direct access to the 10Gbit adapter in the guest if you want. Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ftrace/perf_event leak
On Wed, 2010-09-01 at 14:15 +0200, Frederic Weisbecker wrote: Thanks for fixing this. However, can we split this in two patches to ease the backport? The lack of a module_put() after perf_trace_init() failure is there for a while (the backport needs to start in 2.6.32). But the lack of a module_put in the destroy path needs a .35 backport only. I don't think it really needs two patches. Just notify stable (and Greg KH in particular) about the backport requirements. Greg can handle it ;) -- Steve -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/5] KVM: MMU: lower the aduit frequency
On Wed, Sep 01, 2010 at 12:06:37PM +0300, Avi Kivity wrote: On 08/31/2010 05:27 AM, Xiao Guangrong wrote: I've always seen progress from the guest while running with audit enabled (its slow, but its not supposed to be fast anyway). Did you experience a freeze? There is a simply test in the guest if it's not rate limit: # time ls anaconda-ks.cfg Documents install.log Music Public Videos Desktop Downloads install.log.syslog Pictures Templates real1m26.053s user0m0.311s sys 0m1.813s 'ls' command cost about 1.5 minute, if we run the memory test program, i think the time/delay is unacceptable.. :-( Marcelo, would making the ratelimit optional help? personally I think without ratelimit audit is useless, and with ratelimit it is a lot less useful but can still point out problems. But I haven't used it in a while. Was just wondering whether there was a freeze. Patchset looks good (ratelimit can be disabled manually). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ftrace/perf_event leak
* Steven Rostedt rost...@goodmis.org wrote: On Wed, 2010-09-01 at 14:15 +0200, Frederic Weisbecker wrote: Thanks for fixing this. However, can we split this in two patches to ease the backport? The lack of a module_put() after perf_trace_init() failure is there for a while (the backport needs to start in 2.6.32). But the lack of a module_put in the destroy path needs a .35 backport only. I don't think it really needs two patches. Just notify stable (and Greg KH in particular) about the backport requirements. Greg can handle it ;) Well, Greg certainly has more than enough to handle, so if there's different chunks with different -stable vectors then it would be most helpful to him to split things up! Manually trying to split up patches is both error-prone and stress-inducing. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/7] pci: memory access API and IOMMU support
Please see my comments at the end of this mail. Am 30.08.2010 00:08, schrieb Eduard - Gabriel Munteanu: PCI devices should access memory through pci_memory_*() instead of cpu_physical_memory_*(). This also provides support for translation and access checking in case an IOMMU is emulated. Memory maps are treated as remote IOTLBs (that is, translation caches belonging to the IOMMU-aware device itself). Clients (devices) must provide callbacks for map invalidation in case these maps are persistent beyond the current I/O context, e.g. AIO DMA transfers. Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro --- hw/pci.c | 191 +++- hw/pci.h | 69 +++ hw/pci_internals.h | 12 +++ qemu-common.h | 1 + 4 files changed, 272 insertions(+), 1 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 2dc1577..afcb33c 100644 --- a/hw/pci.c +++ b/hw/pci.c ... diff --git a/hw/pci.h b/hw/pci.h index c551f96..c95863a 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -172,6 +172,8 @@ struct PCIDevice { char *romfile; ram_addr_t rom_offset; uint32_t rom_bar; + + QLIST_HEAD(, PCIMemoryMap) memory_maps; }; PCIDevice *pci_register_device(PCIBus *bus, const char *name, @@ -391,4 +393,71 @@ static inline int ranges_overlap(uint64_t first1, uint64_t len1, return !(last2 first1 || last1 first2); } +/* + * Memory I/O and PCI IOMMU definitions. + */ + +#define IOMMU_PERM_READ (1 0) +#define IOMMU_PERM_WRITE (1 1) +#define IOMMU_PERM_RW (IOMMU_PERM_READ | IOMMU_PERM_WRITE) + +typedef int PCIInvalidateMapFunc(void *opaque); +typedef int PCITranslateFunc(PCIDevice *iommu, + PCIDevice *dev, + pcibus_t addr, + target_phys_addr_t *paddr, + target_phys_addr_t *len, + unsigned perms); + +void pci_memory_rw(PCIDevice *dev, + pcibus_t addr, + uint8_t *buf, + pcibus_t len, + int is_write); +void *pci_memory_map(PCIDevice *dev, + PCIInvalidateMapFunc *cb, + void *opaque, + pcibus_t addr, + target_phys_addr_t *len, + int is_write); +void pci_memory_unmap(PCIDevice *dev, + void *buffer, + target_phys_addr_t len, + int is_write, + target_phys_addr_t access_len); +void pci_register_iommu(PCIDevice *dev, PCITranslateFunc *translate); +void pci_memory_invalidate_range(PCIDevice *dev, pcibus_t addr, pcibus_t len); + +#define DECLARE_PCI_LD(suffix, size) \ +uint##size##_t pci_ld##suffix(PCIDevice *dev, pcibus_t addr); + +#define DECLARE_PCI_ST(suffix, size) \ +void pci_st##suffix(PCIDevice *dev, pcibus_t addr, uint##size##_t val); + +DECLARE_PCI_LD(ub, 8) +DECLARE_PCI_LD(uw, 16) +DECLARE_PCI_LD(l, 32) +DECLARE_PCI_LD(q, 64) + +DECLARE_PCI_ST(b, 8) +DECLARE_PCI_ST(w, 16) +DECLARE_PCI_ST(l, 32) +DECLARE_PCI_ST(q, 64) + +static inline void pci_memory_read(PCIDevice *dev, + pcibus_t addr, + uint8_t *buf, + pcibus_t len) +{ + pci_memory_rw(dev, addr, buf, len, 0); +} + +static inline void pci_memory_write(PCIDevice *dev, + pcibus_t addr, + const uint8_t *buf, + pcibus_t len) +{ + pci_memory_rw(dev, addr, (uint8_t *) buf, len, 1); +} + #endif The functions pci_memory_read and pci_memory_write not only read or write byte data but many different data types which leads to a lot of type casts in your other patches. I'd prefer void *buf and const void *buf in the argument lists. Then all those type casts could be removed. Regards Stefan Weil -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Degrading Network performance as KVM/kernel version increases
I can't say anything about the kernel version making things worse. At least for the qemu-kvm version, you should be using -device and -netdev instead of -net nic -net tap (see *http://git.qemu.org/qemu.git/tree/docs/qdev-device-use.txt since it's not in the 0.12 tree).* Thanks for your suggestion Brian I was not doing this correctly. After changing to -device and -netdev I did get a significant performance increase although overall performance is still much worse than what I was getting in the 2.3.31.5 kernel with version 11. Below is part of my startup script. Any chance you notice anything else wrong? Before your suggestion (what was working well with version 11 2.6.31): modprobe kvm modprobe kvm_intel modprobe tun echo -e Setting up bridge device br0 \r brctl addbr br0 ifconfig br0 192.168.100.254 netmask 255.255.255.0 up brctl addif br0 eth7 ifconfig eth7 down ifconfig eth7 0.0.0.0 for ((i=0; i NUM_OF_DEVICES ; i++)); do echo -e Setting up \r tunctl -b -g ${KVMNET_GID} -t kvmnet$i brctl addif br0 kvmnet$i ifconfig kvmnet$i up 0.0.0.0 promisc done echo 1 /proc/sys/net/ipv4/ip_forward iptables -t nat -A POSTROUTING -s 192.168.100.0/24 -o eth7 -j MASQUERADE for ((i=0; i NUM_OF_DEVICES ; i++)); do echo -e Creating Virtual disk $i \r qemu-img create -f qcow2 vdisk_$i.img $VDISK_SIZE echo -e Starting Virtual Machine $i \r /root/bin/qemu-system-x86_64 -cpu host -drive file=./vdisk_$i.img,if=virtio,boot=on -cdrom ./$2 -boot d \ -net nic,model=virtio,macaddr=52:54:00:12:34:3$i \ -net tap,ifname=kvmnet$i,script=no \ -m 1024 \ -smp 2 \ -usb \ -usbdevice tablet \ -localtime \ -daemonize \ -vga std After changing to -device and -netdev(working better with the latest stuff but still much worse than 2.6.31): modprobe kvm modprobe kvm_intel modprobe tun echo -e Setting up bridge device br0 \r brctl addbr br0 ifconfig br0 192.168.100.254 netmask 255.255.255.0 up brctl addif br0 eth7 ifconfig eth7 down ifconfig eth7 0.0.0.0 for ((i=0; i NUM_OF_DEVICES ; i++)); do echo -e Setting up \r tunctl -b -g ${KVMNET_GID} -t kvmnet$i brctl addif br0 kvmnet$i ifconfig kvmnet$i up 0.0.0.0 promisc done echo 1 /proc/sys/net/ipv4/ip_forward iptables -t nat -A POSTROUTING -s 192.168.100.0/24 -o eth7 -j MASQUERADE for ((i=0; i NUM_OF_DEVICES ; i++)); do echo -e Creating Virtual disk $i \r qemu-img create -f qcow2 vdisk_$i.img $VDISK_SIZE echo -e Starting Virtual Machine $i \r /root/bin/qemu-system-x86_64 -cpu host -drive file=./vdisk_$i.img,if=virtio,boot=on -cdrom ./$2 -boot d \ -netdev type=tap,id=tap.0,script=no,ifname=kvmnet$i \ -device virtio-net-pci,netdev=tap.0,mac=52:54:00:12:34:3$i \ -m 1024 \ -smp 2 \ -usb \ -usbdevice tablet \ -localtime \ -daemonize \ -vga std Vhost-net is probably your best bet for maximizing throughput. You might try a separate post just for the vhost error if nobody chimes in about it here. I will do that. Thanks again for the input! --Matt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/8] Implement getnsboottime kernel API
On 08/30/2010 06:06 AM, Glauber Costa wrote: From: Zachary Amsdenzams...@redhat.com Add a kernel call to get the number of nanoseconds since boot. This is generally useful enough to make it a generic call. Signed-off-by: Zachary Amsdenzams...@redhat.com --- include/linux/time.h |1 + kernel/time/timekeeping.c | 27 +++ 2 files changed, 28 insertions(+), 0 deletions(-) diff --git a/include/linux/time.h b/include/linux/time.h index ea3559f..5d04108 100644 --- a/include/linux/time.h +++ b/include/linux/time.h @@ -145,6 +145,7 @@ extern void getnstimeofday(struct timespec *tv); extern void getrawmonotonic(struct timespec *ts); extern void getboottime(struct timespec *ts); extern void monotonic_to_bootbased(struct timespec *ts); +extern s64 getnsboottime(void); extern struct timespec timespec_trunc(struct timespec t, unsigned gran); extern int timekeeping_valid_for_hres(void); diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index caf8d4d..d250f0a 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -285,6 +285,33 @@ void ktime_get_ts(struct timespec *ts) } EXPORT_SYMBOL_GPL(ktime_get_ts); + +/** + * getnsboottime - get the bootbased clock in nsec format + * + * The function calculates the bootbased clock from the realtime + * clock and the wall_to_monotonic offset and stores the result + * in normalized timespec format in the variable pointed to by @ts. + */ +s64 getnsboottime(void) +{ + unsigned int seq; + s64 secs, nsecs; + + WARN_ON(timekeeping_suspended); + + do { + seq = read_seqbegin(xtime_lock); + secs = xtime.tv_sec + wall_to_monotonic.tv_sec; + secs += total_sleep_time.tv_sec; + nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec; + nsecs += total_sleep_time.tv_nsec + timekeeping_get_ns(); + + } while (read_seqretry(xtime_lock, seq)); + return nsecs + (secs * NSEC_PER_SEC); +} +EXPORT_SYMBOL_GPL(getnsboottime); + /** * do_gettimeofday - Returns the time of day in a timeval * @tv: pointer to the timeval to be set FWIW, I sent a patch to drop this yesterday, in favor of this approach: +static inline u64 get_kernel_ns(void) +{ +struct timespec ts; + +WARN_ON(preemptible()); +ktime_get_ts(ts); +monotonic_to_bootbased(ts); +return timespec_to_ns(ts); +} + -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ftrace/perf_event leak
Subject: perf, trace: Fix module leak From: Li Zefan l...@cn.fujitsu.com Date: Wed Sep 01 12:58:43 CEST 2010 Commit 1c024eca (perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events) caused a module refcount leak. Tested-by: Avi Kivity a...@redhat.com Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl LKML-Reference: 4c7e1f12.8030...@cn.fujitsu.com --- kernel/trace/trace_event_perf.c |3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6/kernel/trace/trace_event_perf.c === --- linux-2.6.orig/kernel/trace/trace_event_perf.c +++ linux-2.6/kernel/trace/trace_event_perf.c @@ -91,6 +91,8 @@ int perf_trace_init(struct perf_event *p tp_event-class tp_event-class-reg try_module_get(tp_event-mod)) { ret = perf_trace_event_init(tp_event, p_event); +if (ret) +module_put(tp_event-mod); break; } } @@ -147,6 +149,7 @@ void perf_trace_destroy(struct perf_even } } out: +module_put(tp_event-mod); mutex_unlock(event_mutex); } Thanks for fixing this. However, can we split this in two patches to ease the backport? The lack of a module_put() after perf_trace_init() failure is there for a while (the backport needs to start in 2.6.32). The failure should be a rare case, I don't think this has to be backported? But the lack of a module_put in the destroy path needs a .35 backport only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM test: virtio_console test v2
From: Lukáš Doktor ldok...@redhat.com 1) Starts VMs with the specified number of virtio console devices 2) Start smoke test 3) Start loopback test 4) Start performance test This test uses an auxiliary script, console_switch.py, that is copied to guests. This script has functions to send and write data to virtio console ports. Details of each test can be found on the docstrings for the test_* functions. Changes from v1: * Style fixes * Whitespace cleanup * Docstring and message fixes * ID for char devices can't be a simple number, fix it Tested with Fedora 13 guest/host, -smp 1. Signed-off-by: Lukas Doktor ldok...@redhat.com Signed-off-by: Jiri Zupka jzu...@redhat.com Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/scripts/console_switch.py | 218 client/tests/kvm/tests/virtio_console.py | 774 client/tests/kvm/tests_base.cfg.sample | 11 + 3 files changed, 1003 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/scripts/console_switch.py create mode 100644 client/tests/kvm/tests/virtio_console.py diff --git a/client/tests/kvm/scripts/console_switch.py b/client/tests/kvm/scripts/console_switch.py new file mode 100644 index 000..05edbc2 --- /dev/null +++ b/client/tests/kvm/scripts/console_switch.py @@ -0,0 +1,218 @@ +#!/usr/bin/python +# -*- coding: utf-8 -*- + +Auxiliary script used to send data between ports on guests. + +...@copyright: 2008-2009 Red Hat Inc. +...@author: Jiri Zupka (jzu...@redhat.com) +...@author: Lukas Doktor (ldok...@redhat.com) + +import threading +from threading import Thread +import os,time,select,re,random,sys,array + +files = dict() +ev = threading.Event() +threads = [] + +DEBUGPATH=/sys/kernel/debug + + +class Switch(Thread): + +Create a thread which sends data between ports. + +def __init__(self, exitevent, in_files, out_files, cachesize=1): + +@param exitevent: Event to end switch. +@param in_files: Array of input files. +@param out_files: Array of output files. +@param cachesize: Block to receive and send. + +Thread.__init__(self) + +self.in_files = in_files +self.out_files = out_files + +self.cachesize = cachesize +self.exitevent = exitevent + + +def run(self): +while not self.exitevent.isSet(): +#TODO: Why select causes trouble? :-( +#ret = select.select(self.in_files,[],[],1.0) +data = +#if not ret[0] == []: +for desc in self.in_files: +data += os.read(desc, self.cachesize) +for desc in self.out_files: +os.write(desc, data) + + +class Sender(Thread): + +Creates thread which sends random blocks of data to the destination port. + +def __init__(self, port, length): + +@param port: Destination port. +@param length: Length of the random data block. + +Thread.__init__(self) +self.port = port +self.data = array.array('L') +for i in range(max(length/self.data.itemsize, 1)): +self.data.append(random.randrange(sys.maxint)) + + +def run(self): +while True: +os.write(self.port, self.data) +del threads[:] + + +def get_port_status(): + +Get info about ports from kernel debugfs. + +@return: ports dictionary of port properties + +ports = {} + +not_present_msg = FAIL: There's no virtio-ports dir in debugfs +if not os.path.ismount(DEBUGPATH): +os.system('mount -t debugfs none %s' % DEBUGPATH) +try: +if not os.path.isdir('%s/virtio-ports' % DEBUGPATH): +print not_present_msg +except: +print not_present_msg +else: +viop_names = os.listdir('%s/virtio-ports' % DEBUGPATH) +for name in viop_names: +f = open(%s/virtio-ports/%s % (DEBUGPATH, name), 'r') +port = {} +for line in iter(f): +m = re.match((\S+): (\S+),line) +port[m.group(1)] = m.group(2) + +if (port['is_console'] == yes): +port[path] = /dev/hvc%s % port[console_vtermno] +# Console works like a serialport +else: +port[path] = /dev/%s % name +ports[port['name']] = port +f.close() + +return ports + + +def open_device(in_files, ports): + +Open devices and return array of descriptors + +@param in_files: files array +@return: array of descriptor + +f = [] + +for item in in_files: +name = ports[item[0]][path] +if (not item[1] == ports[item[0]][is_console]): +print ports +print FAIL: Host console is not like console on guest side\n + +if (name in files): +f.append(files[name]) +else: +try: +files[name] = os.open(name,
[PATCH v2] KVM-TEST: Use cdrom for Linux guests in unattended_install
Bug 615839 - kernel panic when install guest with -fda There exists a bug of floppy driver, it blocked our installation tests. So replace floppy with cdrom for Linux guestsin unattended_install. Changes from v1: - Format floppy before mount - Mount install ISO to self.cd2_iso Signed-off-by: Amos Kong ak...@redhat.com --- client/tests/kvm/scripts/unattended.py | 127 ++-- client/tests/kvm/tests_base.cfg.sample | 88 -- 2 files changed, 119 insertions(+), 96 deletions(-) diff --git a/client/tests/kvm/scripts/unattended.py b/client/tests/kvm/scripts/unattended.py index a0f96e7..edb0abf 100755 --- a/client/tests/kvm/scripts/unattended.py +++ b/client/tests/kvm/scripts/unattended.py @@ -81,8 +81,14 @@ class UnattendedInstall(object): self.kernel_args = os.environ.get('KVM_TEST_kernel_args', '') self.finish_program = os.environ.get('KVM_TEST_finish_program', '') -cdrom_iso = os.environ.get('KVM_TEST_cdrom_cd1') self.unattended_file = os.environ.get('KVM_TEST_unattended_file') +cdrom_cd1 = os.environ.get('KVM_TEST_cdrom_cd1', '') +cdrom_cd2 = os.environ.get('KVM_TEST_cdrom_cd2', '') +self.cd1_iso = os.path.join(kvm_test_dir, cdrom_cd1) +self.cd2_iso = os.path.join(kvm_test_dir, cdrom_cd2) +cd1_dir = os.path.dirname(self.cd1_iso) +if not os.path.isdir(cd1_dir): +os.makedirs(cd1_dir) install_virtio = os.environ.get('KVM_TEST_install_virtio', 'no') if install_virtio == 'yes': @@ -120,15 +126,19 @@ class UnattendedInstall(object): self.qemu_img_bin = os.environ.get('KVM_TEST_qemu_img_binary') if not os.path.isabs(self.qemu_img_bin): self.qemu_img_bin = os.path.join(kvm_test_dir, self.qemu_img_bin) -self.cdrom_iso = os.path.join(kvm_test_dir, cdrom_iso) -self.floppy_mount = tempfile.mkdtemp(prefix='floppy_', dir='/tmp') -self.cdrom_mount = tempfile.mkdtemp(prefix='cdrom_', dir='/tmp') + +self.cd1_mount = tempfile.mkdtemp(prefix='cd1_', dir='/tmp') +self.cd2_mount = tempfile.mkdtemp(prefix='cd2_', dir='/tmp') +self.boot_img_mount = tempfile.mkdtemp(prefix='boot_img_', dir='/tmp') self.nfs_mount = tempfile.mkdtemp(prefix='nfs_', dir='/tmp') -floppy_name = os.environ['KVM_TEST_floppy'] -self.floppy_img = os.path.join(kvm_test_dir, floppy_name) -floppy_dir = os.path.dirname(self.floppy_img) -if not os.path.isdir(floppy_dir): -os.makedirs(floppy_dir) + +floppy_name = os.environ.get('KVM_TEST_floppy', '') +self.floppy_img = '' +if floppy_name: +self.floppy_img = os.path.join(kvm_test_dir, floppy_name) +floppy_dir = os.path.dirname(self.floppy_img) +if not os.path.isdir(floppy_dir): +os.makedirs(floppy_dir) self.pxe_dir = os.environ.get('KVM_TEST_pxe_dir', '') self.pxe_image = os.environ.get('KVM_TEST_pxe_image', '') @@ -162,7 +172,7 @@ class UnattendedInstall(object): path_list = glob.glob('*') for path in path_list: src = os.path.join(self.virtio_floppy_mount, path) -dst = os.path.join(self.floppy_mount, path) +dst = os.path.join(self.boot_img_mount, path) if os.path.isdir(path): shutil.copytree(src, dst) elif os.path.isfile(path): @@ -205,7 +215,7 @@ class UnattendedInstall(object): 4) Re-write the config file to the disk self.copy_virtio_drivers_floppy() -txtsetup_oem = os.path.join(self.floppy_mount, 'txtsetup.oem') +txtsetup_oem = os.path.join(self.boot_img_mount, 'txtsetup.oem') if not os.path.isfile(txtsetup_oem): raise SetupError('File txtsetup.oem not found on the install ' 'floppy. Please verify if your floppy virtio ' @@ -223,37 +233,39 @@ class UnattendedInstall(object): fp.close() -def create_boot_floppy(self): +def create_boot_img(self): -Prepares a boot floppy by creating a floppy image file, mounting it and -copying an answer file (kickstarts for RH based distros, answer files -for windows) to it. After that the image is umounted. +Prepares a boot floppy/iso image by creating a floppy/iso image file, +it contains an answer file (kickstarts for RH based distros, answer +files for windows). After that the image is umounted. -print Creating boot floppy +try: +if self.floppy_img: +print Creating boot floppy -if os.path.exists(self.floppy_img): -os.remove(self.floppy_img) +if os.path.exists(self.floppy_img): +os.remove(self.floppy_img) +c_cmd = '%s create -f raw %s 1440k' %
Re: [PATCH 4/7] ide: use the PCI memory access interface
On Sat, Aug 28, 2010 at 05:54:55PM +0300, Eduard - Gabriel Munteanu wrote: Emulated PCI IDE controllers now use the memory access interface. This also allows an emulated IOMMU to translate and check accesses. Map invalidation results in cancelling DMA transfers. Since the guest OS can't properly recover the DMA results in case the mapping is changed, this is a fairly good approximation. Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro --- dma-helpers.c | 46 +- dma.h | 21 - hw/ide/core.c | 15 --- hw/ide/internal.h | 39 +++ hw/ide/macio.c|4 ++-- hw/ide/pci.c |7 +++ 6 files changed, 117 insertions(+), 15 deletions(-) diff --git a/dma-helpers.c b/dma-helpers.c index 712ed89..a0dcdb8 100644 --- a/dma-helpers.c +++ b/dma-helpers.c @@ -10,12 +10,36 @@ #include dma.h #include block_int.h -void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint) +static void *qemu_sglist_default_map(void *opaque, + QEMUSGInvalMapFunc *inval_cb, + void *inval_opaque, + target_phys_addr_t addr, + target_phys_addr_t *len, + int is_write) +{ +return cpu_physical_memory_map(addr, len, is_write); +} + +static void qemu_sglist_default_unmap(void *opaque, + void *buffer, + target_phys_addr_t len, + int is_write, + target_phys_addr_t access_len) +{ +cpu_physical_memory_unmap(buffer, len, is_write, access_len); +} + +void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint, + QEMUSGMapFunc *map, QEMUSGUnmapFunc *unmap, void *opaque) { qsg-sg = qemu_malloc(alloc_hint * sizeof(ScatterGatherEntry)); qsg-nsg = 0; qsg-nalloc = alloc_hint; qsg-size = 0; + +qsg-map = map ? map : qemu_sglist_default_map; +qsg-unmap = unmap ? unmap : qemu_sglist_default_unmap; +qsg-opaque = opaque; } void qemu_sglist_add(QEMUSGList *qsg, target_phys_addr_t base, @@ -73,12 +97,23 @@ static void dma_bdrv_unmap(DMAAIOCB *dbs) int i; for (i = 0; i dbs-iov.niov; ++i) { -cpu_physical_memory_unmap(dbs-iov.iov[i].iov_base, - dbs-iov.iov[i].iov_len, !dbs-is_write, - dbs-iov.iov[i].iov_len); +dbs-sg-unmap(dbs-sg-opaque, + dbs-iov.iov[i].iov_base, + dbs-iov.iov[i].iov_len, !dbs-is_write, + dbs-iov.iov[i].iov_len); } } +static void dma_bdrv_cancel(void *opaque) +{ +DMAAIOCB *dbs = opaque; + +bdrv_aio_cancel(dbs-acb); +dma_bdrv_unmap(dbs); +qemu_iovec_destroy(dbs-iov); +qemu_aio_release(dbs); +} + static void dma_bdrv_cb(void *opaque, int ret) { DMAAIOCB *dbs = (DMAAIOCB *)opaque; @@ -100,7 +135,8 @@ static void dma_bdrv_cb(void *opaque, int ret) while (dbs-sg_cur_index dbs-sg-nsg) { cur_addr = dbs-sg-sg[dbs-sg_cur_index].base + dbs-sg_cur_byte; cur_len = dbs-sg-sg[dbs-sg_cur_index].len - dbs-sg_cur_byte; -mem = cpu_physical_memory_map(cur_addr, cur_len, !dbs-is_write); +mem = dbs-sg-map(dbs-sg-opaque, dma_bdrv_cancel, dbs, + cur_addr, cur_len, !dbs-is_write); if (!mem) break; qemu_iovec_add(dbs-iov, mem, cur_len); diff --git a/dma.h b/dma.h index f3bb275..d48f35c 100644 --- a/dma.h +++ b/dma.h @@ -15,6 +15,19 @@ #include hw/hw.h #include block.h +typedef void QEMUSGInvalMapFunc(void *opaque); +typedef void *QEMUSGMapFunc(void *opaque, +QEMUSGInvalMapFunc *inval_cb, +void *inval_opaque, +target_phys_addr_t addr, +target_phys_addr_t *len, +int is_write); +typedef void QEMUSGUnmapFunc(void *opaque, + void *buffer, + target_phys_addr_t len, + int is_write, + target_phys_addr_t access_len); + typedef struct { target_phys_addr_t base; target_phys_addr_t len; @@ -25,9 +38,15 @@ typedef struct { int nsg; int nalloc; target_phys_addr_t size; + +QEMUSGMapFunc *map; +QEMUSGUnmapFunc *unmap; +void *opaque; } QEMUSGList; -void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint); +void qemu_sglist_init(QEMUSGList *qsg, int alloc_hint, + QEMUSGMapFunc *map, QEMUSGUnmapFunc *unmap, +
Re: [PATCH 2/7] pci: memory access API and IOMMU support
On Sat, Aug 28, 2010 at 05:54:53PM +0300, Eduard - Gabriel Munteanu wrote: PCI devices should access memory through pci_memory_*() instead of cpu_physical_memory_*(). This also provides support for translation and access checking in case an IOMMU is emulated. Memory maps are treated as remote IOTLBs (that is, translation caches belonging to the IOMMU-aware device itself). Clients (devices) must provide callbacks for map invalidation in case these maps are persistent beyond the current I/O context, e.g. AIO DMA transfers. Signed-off-by: Eduard - Gabriel Munteanu eduard.munte...@linux360.ro I am concerned about adding more pointer chaising on data path. Could we have 1. an iommu pointer in a device, inherited by secondary buses when they are created and by devices from buses when they are attached. 2. translation pointer in the iommu instead of the bus 3. pci_memory_XX functions inline, doing fast path for non-iommu case: if (__builtin_expect(!dev-iommu, 1) return cpu_memory_rw --- hw/pci.c | 185 +++- hw/pci.h | 74 + hw/pci_internals.h | 12 qemu-common.h |1 + 4 files changed, 271 insertions(+), 1 deletions(-) Almost nothing here is PCI specific. Can this code go into dma.c/dma.h? We would have struct DMADevice, APIs like device_dma_write etc. This would help us get rid of the void * stuff as well? diff --git a/hw/pci.c b/hw/pci.c index 2dc1577..b460905 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -158,6 +158,19 @@ static void pci_device_reset(PCIDevice *dev) pci_update_mappings(dev); } +static int pci_no_translate(PCIDevice *iommu, +PCIDevice *dev, +pcibus_t addr, +target_phys_addr_t *paddr, +target_phys_addr_t *len, +unsigned perms) +{ +*paddr = addr; +*len = -1; + +return 0; +} + static void pci_bus_reset(void *opaque) { PCIBus *bus = opaque; @@ -220,7 +233,10 @@ void pci_bus_new_inplace(PCIBus *bus, DeviceState *parent, { qbus_create_inplace(bus-qbus, pci_bus_info, parent, name); assert(PCI_FUNC(devfn_min) == 0); -bus-devfn_min = devfn_min; + +bus-devfn_min = devfn_min; +bus-iommu = NULL; +bus-translate = pci_no_translate; /* host bridge */ QLIST_INIT(bus-child); @@ -1789,3 +1805,170 @@ static char *pcibus_get_dev_path(DeviceState *dev) return strdup(path); } +void pci_register_iommu(PCIDevice *iommu, +PCITranslateFunc *translate) +{ +iommu-bus-iommu = iommu; +iommu-bus-translate = translate; +} + The above seems broken for secondary buses, right? Also, can we use qdev for initialization in some way, instead of adding more APIs? E.g. I think it would be nice if we could just use qdev command line flags to control which bus is behind iommu and which isn't. +void pci_memory_rw(PCIDevice *dev, + pcibus_t addr, + uint8_t *buf, + pcibus_t len, + int is_write) +{ +int err; +unsigned perms; +PCIDevice *iommu = dev-bus-iommu; +target_phys_addr_t paddr, plen; + +perms = is_write ? IOMMU_PERM_WRITE : IOMMU_PERM_READ; + +while (len) { +err = dev-bus-translate(iommu, dev, addr, paddr, plen, perms); +if (err) +return; + +/* The translation might be valid for larger regions. */ +if (plen len) +plen = len; + +cpu_physical_memory_rw(paddr, buf, plen, is_write); + +len -= plen; +addr += plen; +buf += plen; +} +} + +static void pci_memory_register_map(PCIDevice *dev, +pcibus_t addr, +pcibus_t len, +target_phys_addr_t paddr, +PCIInvalidateMapFunc *invalidate, +void *invalidate_opaque) +{ +PCIMemoryMap *map; + +map = qemu_malloc(sizeof(PCIMemoryMap)); +map-addr = addr; +map-len= len; +map-paddr = paddr; +map-invalidate = invalidate; +map-invalidate_opaque = invalidate_opaque; + +QLIST_INSERT_HEAD(dev-memory_maps, map, list); +} + +static void pci_memory_unregister_map(PCIDevice *dev, + target_phys_addr_t paddr, + target_phys_addr_t len) +{ +PCIMemoryMap *map; + +QLIST_FOREACH(map, dev-memory_maps, list) { +if (map-paddr == paddr map-len == len) { +QLIST_REMOVE(map, list); +free(map); +} +} +} + +void
Re: [PULL 00/35] KVM: PPC: End-August patch queue
On 08/31/2010 05:31 AM, Alexander Graf wrote: Howdy, This is my local patch queue with stuff that has accumulated over the last weeks on KVM for PPC with some last minute fixes, speedups and debugging help that I needed for the KVM Forum ;-). The highlights of this set are: - Converted most important debug points to tracepoints - Flush less PTEs (speedup) - Go back to our own hash (less duplicates) - Make SRs guest settable (speedup for 32 bit guests) - Remove r30/r31 restrictions from PV hooks (speedup!) - Fix random breakages - Fix random guest stalls - 440GP host support (Thanks Hollis!) - Reliable interrupt injection Keep in mind that this is the first version that is stable on PPC32 hosts. All versions prior to this could occupy otherwise used segment entries and thus crash your machine :-). It is also the first version that is stable with PPC64 guests, because they require more sophisticated interrupt injection logic for which qemu patches are also required. Please pull this tree from: git://github.com/agraf/linux-2.6.git kvm-ppc-next Have fun with more accurate, faster and less buggy KVM on PowerPC! Pulled, thanks. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html