Re: [PATCH 5/5] virtio-net: Switch default to new bottom half TX handler for iothread

2010-08-31 Thread Stefan Hajnoczi
On Tue, Aug 31, 2010 at 11:32 PM, Alex Williamson
 wrote:
> On Tue, 2010-08-31 at 23:25 +0300, Michael S. Tsirkin wrote:
>> On Fri, Aug 27, 2010 at 04:37:45PM -0600, Alex Williamson wrote:
>> > The bottom half handler shows big improvements over the timer
>> > with few downsides, default to it when the iothread is enabled.
>> >
>> > Using the following tests, with the guest and host connected
>> > via tap+bridge:
>> >
>> > guest> netperf -t TCP_STREAM -H $HOST
>> > host> netperf -t TCP_STREAM -H $GUEST
>> > guest> netperf -t UDP_STREAM -H $HOST
>> > host> netperf -t UDP_STREAM -H $GUEST
>> > guest> netperf -t TCP_RR -H $HOST
>> >
>> > Results: base throughput, exits/throughput ->
>> >                    patched throughput, exits/throughput
>> >
>> > --enable-io-thread
>> > TCP guest->host 2737.77, 47.82  -> 6767.09, 29.15 = 247%, 61%
>> > TCP host->guest 2231.33, 74.00  -> 4125.80, 67.61 = 185%, 91%
>> > UDP guest->host 6281.68, 14.66  -> 12569.27, 1.98 = 200%, 14%
>> > UDP host->guest 275.91,  289.22 -> 264.80, 293.53 = 96%, 101%
>> > interations/s   1949.65, 82.97  -> 7417.56, 84.31 = 380%, 102%
>> >
>> > No --enable-io-thread
>> > TCP guest->host 3041.57, 55.11 -> 1038.93, 517.57 = 34%, 939%
>> > TCP host->guest 2416.03, 76.67 -> 5655.92, 55.52  = 234%, 72%
>> > UDP guest->host 12255.82, 6.11 -> 7775.87, 31.32  = 63%, 513%
>> > UDP host->guest 587.92, 245.95 -> 611.88, 239.92  = 104%, 98%
>> > interations/s   1975.59, 83.21 -> 8935.50, 88.18  = 452%, 106%
>> >
>> > Signed-off-by: Alex Williamson 
>>
>> parameter having different settings based on config
>> options might surprise some users. I don't think
>> we really need a parameter here ...
>
> I'm not a bit fan of this either, but I'd also prefer not to introduce a
> regression for a performance difference we know about in advance.  It
> gets even more complicated when we factor in qemu-kvm, as it doesn't
> build with iothread enabled, but seems to get and even better boost in
> performance across the board thanks largely to the kvm-irqchip.  Should
> we instead make this a configure option?  --enable-virtio-net-txbh?
> Thanks,
>
> Alex

qemu-kvm uses its own iothread implementation by default.  It doesn't
need --enable-io-thread because it already uses a similar model.

Stefan

>
>> > ---
>> >
>> >  hw/s390-virtio-bus.c |    3 ++-
>> >  hw/syborg_virtio.c   |    3 ++-
>> >  hw/virtio-pci.c      |    3 ++-
>> >  hw/virtio.h          |    6 ++
>> >  4 files changed, 12 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c
>> > index 1483362..985f99a 100644
>> > --- a/hw/s390-virtio-bus.c
>> > +++ b/hw/s390-virtio-bus.c
>> > @@ -328,7 +328,8 @@ static VirtIOS390DeviceInfo s390_virtio_net = {
>> >      .qdev.size = sizeof(VirtIOS390Device),
>> >      .qdev.props = (Property[]) {
>> >          DEFINE_NIC_PROPERTIES(VirtIOS390Device, nic),
>> > -        DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer, 1),
>> > +        DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer,
>> > +                           TXTIMER_DEFAULT),
>> >          DEFINE_PROP_INT32("txburst", VirtIOS390Device, txburst, 256),
>> >          DEFINE_PROP_END_OF_LIST(),
>> >      },
>> > diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c
>> > index 7b76972..ee5746d 100644
>> > --- a/hw/syborg_virtio.c
>> > +++ b/hw/syborg_virtio.c
>> > @@ -300,7 +300,8 @@ static SysBusDeviceInfo syborg_virtio_net_info = {
>> >      .qdev.props = (Property[]) {
>> >          DEFINE_NIC_PROPERTIES(SyborgVirtIOProxy, nic),
>> >          DEFINE_VIRTIO_NET_FEATURES(SyborgVirtIOProxy, host_features),
>> > -        DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer, 1),
>> > +        DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer,
>> > +                           TXTIMER_DEFAULT),
>> >          DEFINE_PROP_INT32("txburst", SyborgVirtIOProxy, txburst, 256),
>> >          DEFINE_PROP_END_OF_LIST(),
>> >      }
>> > diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
>> > index e025c09..9740f57 100644
>> > --- a/hw/virtio-pci.c
>> > +++ b/hw/virtio-pci.c
>> > @@ -695,7 +695,8 @@ static PCIDeviceInfo virtio_info[] = {
>> >              DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3),
>> >              DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features),
>> >              DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic),
>> > -            DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer, 1),
>> > +            DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer,
>> > +                               TXTIMER_DEFAULT),
>> >              DEFINE_PROP_INT32("txburst", VirtIOPCIProxy, txburst, 256),
>> >              DEFINE_PROP_END_OF_LIST(),
>> >          },
>> > diff --git a/hw/virtio.h b/hw/virtio.h
>> > index 4051889..a1a17a2 100644
>> > --- a/hw/virtio.h
>> > +++ b/hw/virtio.h
>> > @@ -183,6 +183,12 @@ void virtio_update_irq(VirtIODevice *vdev);
>> >  void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding,
>> >  

[KVM timekeeping] Revert getnsboottime() kernel API

2010-08-31 Thread Zachary Amsden
Turns out this doesn't actually save any math or locking, name
is chosen rather poorly, it doesn't match the existing kernel
APIs, and requires kvm-kmod changes.

Signed-off-by: Zachary Amsden 
---
 arch/x86/kvm/x86.c|   18 ++
 include/linux/time.h  |1 -
 kernel/time/timekeeping.c |   28 +---
 3 files changed, 15 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4014d6c..03605b8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -919,6 +919,16 @@ static inline u64 nsec_to_cycles(u64 nsec)
return ret;
 }
 
+static inline u64 get_kernel_ns(void)
+{
+   struct timespec ts;
+
+   WARN_ON(preemptible());
+   ktime_get_ts(&ts);
+   monotonic_to_bootbased(&ts);
+   return timespec_to_ns(&ts);
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
struct kvm *kvm = vcpu->kvm;
@@ -928,7 +938,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 
spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
offset = data - native_read_tsc();
-   ns = getnsboottime();
+   ns = get_kernel_ns();
elapsed = ns - kvm->arch.last_tsc_nsec;
sdiff = data - kvm->arch.last_tsc_write;
if (sdiff < 0)
@@ -981,7 +991,7 @@ static int kvm_write_guest_time(struct kvm_vcpu *v)
/* Keep irq disabled to prevent changes to the clock */
local_irq_save(flags);
kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
-   kernel_ns = getnsboottime();
+   kernel_ns = get_kernel_ns();
this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
local_irq_restore(flags);
 
@@ -3327,7 +3337,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
goto out;
 
r = 0;
-   now_ns = getnsboottime();
+   now_ns = get_kernel_ns();
delta = user_ns.clock - now_ns;
kvm->arch.kvmclock_offset = delta;
break;
@@ -3336,7 +3346,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
struct kvm_clock_data user_ns;
u64 now_ns;
 
-   now_ns = getnsboottime();
+   now_ns = get_kernel_ns();
user_ns.clock = kvm->arch.kvmclock_offset + now_ns;
user_ns.flags = 0;
 
diff --git a/include/linux/time.h b/include/linux/time.h
index 909e62a..9f15ac7 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -160,7 +160,6 @@ extern void getnstimeofday(struct timespec *tv);
 extern void getrawmonotonic(struct timespec *ts);
 extern void getboottime(struct timespec *ts);
 extern void monotonic_to_bootbased(struct timespec *ts);
-extern s64 getnsboottime(void);
 
 extern struct timespec timespec_trunc(struct timespec t, unsigned gran);
 extern int timekeeping_valid_for_hres(void);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 77e930d..dfbe271 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -284,33 +284,6 @@ void ktime_get_ts(struct timespec *ts)
 }
 EXPORT_SYMBOL_GPL(ktime_get_ts);
 
-
-/**
- * getnsboottime - get the bootbased clock in nsec format
- *
- * The function calculates the bootbased clock from the realtime
- * clock and the wall_to_monotonic offset and stores the result
- * in normalized timespec format in the variable pointed to by @ts.
- */
-s64 getnsboottime(void)
-{
-   unsigned int seq;
-   s64 secs, nsecs;
-
-   WARN_ON(timekeeping_suspended);
-
-   do {
-   seq = read_seqbegin(&xtime_lock);
-   secs = xtime.tv_sec + wall_to_monotonic.tv_sec;
-   secs += total_sleep_time.tv_sec;
-   nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec;
-   nsecs += total_sleep_time.tv_nsec + timekeeping_get_ns();
-
-   } while (read_seqretry(&xtime_lock, seq));
-   return nsecs + (secs * NSEC_PER_SEC);
-}
-EXPORT_SYMBOL_GPL(getnsboottime);
-
 /**
  * do_gettimeofday - Returns the time of day in a timeval
  * @tv:pointer to the timeval to be set
@@ -327,6 +300,7 @@ void do_gettimeofday(struct timeval *tv)
 }
 
 EXPORT_SYMBOL(do_gettimeofday);
+
 /**
  * do_settimeofday - Sets the time of day
  * @tv:pointer to the timespec variable containing the new time
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] NUMA: realize NUMA memory pinning

2010-08-31 Thread Andrew Theurer
On Tue, 2010-08-31 at 17:03 -0500, Anthony Liguori wrote:
> On 08/31/2010 03:54 PM, Andrew Theurer wrote:
> > On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote:
> >
> >> On 08/23/2010 04:16 PM, Andre Przywara wrote:
> >>  
> >>> Anthony Liguori wrote:
> >>>
>  On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
>   
> > On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
> >
> >> According to the user-provided assignment bind the respective part
> >> of the guest's memory to the given host node. This uses Linux'
> >> mbind syscall (which is wrapped only in libnuma) to realize the
> >> pinning right after the allocation.
> >> Failures are not fatal, but produce a warning.
> >>
> >> Signed-off-by: Andre Przywara
> >> ...
> >>  
> > Why is it not possible (or perhaps not desired) to change the binding
> > after the guest is started?
> >
> > Sounds unflexible.
> >
> >>> The solution is to introduce a monitor interface to later adjust the
> >>> pinning, allowing both changing the affinity only (only valid for
> >>> future fault-ins) and actually copying the memory (more costly).
> >>>
> >> This is just duplicating numactl.
> >>
> >>  
> >>> Actually this is the next item on my list, but I wanted to bring up
> >>> the basics first to avoid recoding parts afterwards. Also I am not
> >>> (yet) familiar with the QMP protocol.
> >>>
>  We really need a solution that lets a user use a tool like numactl
>  outside of the QEMU instance.
>   
> >>> I fear that is not how it's meant to work with the Linux' NUMA API. In
> >>> opposite to the VCPU threads, which are externally visible entities
> >>> (PIDs), the memory should be private to the QEMU process. While you
> >>> can change the NUMA allocation policy of the _whole_ process, there is
> >>> no way to externally distinguish parts of the process' memory.
> >>> Although you could later (and externally) migrate already faulted
> >>> pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you
> >>> would let an external tool interfere with QEMUs internal memory
> >>> management. Take for instance the change of the allocation policy
> >>> regarding the 1MB and 3.5-4GB holes. An external tool would have to
> >>> either track such changes or you simply could not change such things
> >>> in QEMU.
> >>>
> >> It's extremely likely that if you're doing NUMA pinning, you're also
> >> doing large pages via hugetlbfs.  numactl can already set policies for
> >> files in hugetlbfs so all you need to do is have a separate hugetlbfs
> >> file for each numa node.
> >>  
> > Why would we resort to hugetlbfs when we have transparent hugepages?
> >
> 
> If you care about NUMA pinning, I can't believe you don't want 
> guaranteed large page allocation which THP does not provide.

I personally want a more automatic approach to placing VMs in NUMA nodes
(not directed by the qemu process itself), but I'd also like to support
a user's desire to pin and place cpus and memory, especially for large
VMs that need to be defined as multi-node.  For user defined pinning,
libhugetlbfs will probably be fine, but for most VMs, I'd like to ensure
we can do things like ballooning well, and I am not so sure that will be
easy with libhugetlbfs.  

> The general point though is that we should find a way to partition 
> memory in qemu such that an external process can control the actual NUMA 
> placement.  This gives us maximum flexibility.
> 
> Otherwise, what do we implement in QEMU?  Direct pinning of memory to 
> nodes?  Can we migrate memory between nodes?  Should we support 
> interleaving memory between two virtual nodes?  Why pick and choose when 
> we can have it all.

If there were a better way to do this than hugetlbfs, then I don't think
I would shy away from this.  Is there another way to change NUMA
policies on mappings from a user tool?  We can already inspect
with /proc//numamaps.  Is this something that could be added to
numactl?

> 
> > FWIW, large apps like databases have set a precedent for managing their
> > own NUMA policies.
> 
> Of course because they know what their NUMA policy should be.  They live 
> in a simple world where they assume they're the only application in the 
> system, they read the distance tables, figure they'll use XX% of all 
> physical memory, and then pin how they see fit.
> 
> But an individual QEMU process lives in a complex world.  It's almost 
> never the only thing on the system and it's only allowed to use a subset 
> of resources.  It's not sure what set of resources it can and can't use 
> and that's often times changing.  The topology chosen for a guest is 
> static but it's host topology may be dynamic due to thinks like live 
> migration.

True, that's why this would require support to change in the monitor.

> In short, QEMU absolutely cannot imp

Re: Degrading Network performance as KVM/kernel version increases

2010-08-31 Thread Brian Jackson

 On 8/31/2010 6:00 PM, matthew.r.roh...@l-3com.com wrote:

I have been getting degrading network performance with newer versions of
KVM and was wondering if this was expected?  It seems like a bug, but I
am new to this and maybe I am doing something wrong so I thought I would
ask.

KVM Host OS: Fedora 12 x86_64
KVM Guest OS Tiny Core Linux 2.6.33.3 kernel

I have tried multiple host kernels 2.6.31.5, 2.6.31.6, 2.6.32.19 and
2.6.35.4 along with versions qemu-kvm 11.0 and qemu-system-x86_64 12.5
compiled from from qemu-kvm repo.



I can't say anything about the kernel version making things worse. At 
least for the qemu-kvm version, you should be using -device and -netdev 
instead of -net nic -net tap (see 
*http://git.qemu.org/qemu.git/tree/docs/qdev-device-use.txt since it's 
not in the 0.12 tree).*




Setup is: 2 hosts with 1 guest on each connected by 10 Gb nic.

I am using virtio and have checked that hardware acceleration is
working.

Processor usage is less than 50% on host and guests.

Here is what I am seeing, I will just include guest to guest statistics,
I do have more (host to guest, etc.) if interested:






My goal is to get as much bandwidth as I can between the 2 guests
running on separate hosts.  The most I have been able to get is ~4 Gb/s
running 4 threads on iperf from guest A to guest B.  I cannot seem to
get much over 1.5Gb/s from guest to guest with a single iperf thread.
Is there some sort of know send limit per thread?  Is it expected that
the latest version of the kernel and modules perform worse than earlier
versions in the area of network performance ( I am guessing not, am I
doing something wrong?)?  I am using virtio and have checked that
hardware acceleration is working.  4 iperf threads host to host yields
~9.5 Gb/s.  Any ideas on how I can get better performance with newer
versions?  I have tried using vhost in 2.6.35 but I get the vhost could
not be initialized error.  The only thing I could find on the vhost
error is that selinux should be off which it is.

I am looking for ideas on increasing the bandwidth between guests and
thoughts on the degrading performance.



Vhost-net is probably your best bet for maximizing throughput. You might 
try a separate post just for the vhost error if nobody chimes in about 
it here.




Thanks for your help! --Matt


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Degrading Network performance as KVM/kernel version increases

2010-08-31 Thread matthew . r . rohrer
I have been getting degrading network performance with newer versions of
KVM and was wondering if this was expected?  It seems like a bug, but I
am new to this and maybe I am doing something wrong so I thought I would
ask.

KVM Host OS: Fedora 12 x86_64
KVM Guest OS Tiny Core Linux 2.6.33.3 kernel

I have tried multiple host kernels 2.6.31.5, 2.6.31.6, 2.6.32.19 and
2.6.35.4 along with versions qemu-kvm 11.0 and qemu-system-x86_64 12.5
compiled from from qemu-kvm repo.

Setup is: 2 hosts with 1 guest on each connected by 10 Gb nic.

I am using virtio and have checked that hardware acceleration is
working.

Processor usage is less than 50% on host and guests. 

Here is what I am seeing, I will just include guest to guest statistics,
I do have more (host to guest, etc.) if interested:

With kernel 2.6.31.5 and usign qemu-kvm 11.0  1.57 Gb/s (guest 1 to
guest 2)  then 1.37 Gb/s (guest 2 to guest 1) with a single iperf
thread.
With kernel 2.6.31.5 and usign qemu-kvm 11.0  3.16 Gb/s (guest 1 to
guest 2)  then 4.29 Gb/s (guest 2 to guest 1) with 4 (P4) iperf threads.

With kernel 2.6.31.5 and usign qemu-system 12.5  1.02 Gb/s (guest 1 to
guest 2) then .420 Gb/s (guest 2 to guest 1) with a single iperf thread.
With kernel 2.6.31.5 and usign qemu-system 12.5  1.30 Gb/s (guest 1 to
guest 2)  then .655 Gb/s (guest 2 to guest 1) with 4 (P4) iperf threads.

With kernel 2.6.31.5 on host 1 and 2.6.32.19 on host 2 and usign
qemu-kvm 11.0  .580 Gb/s (guest 1 to guest 2)  then 1.32 Gb/s(guest 2 to
guest 1) with a single iperf thread.

With kernel 2.6.32.19 and usign qemu-kvm 11.0  .548 Gb/s (guest 1 to
guest 2) then .603 Gb/s (guest 2 to guest 1) with a single iperf thread.
With kernel 2.6.32.19 and usign qemu-kvm 11.0  .569 Gb/s (guest 1 to
guest 2)  then .478 Gb/s (guest 2 to guest 1) with 4 (P4) iperf threads.

With kernel 2.6.32.19 and usign qemu-system 12.5  .571 Gb/s (guest 1 to
guest 2) then .500 Gb/s (guest 2 to guest 1) with a single iperf thread.
With kernel 2.6.32.19 and usign qemu-system 12.5  .633 Gb/s (guest 1 to
guest 2)  then .705 Gb/s (guest 2 to guest 1) with 4 (P4) iperf threads.

With kernel 2.6.35.4 and usign qemu-system 12.5  .418 Gb/s (guest 1 to
guest 2) and then I gave up.


My goal is to get as much bandwidth as I can between the 2 guests
running on separate hosts.  The most I have been able to get is ~4 Gb/s
running 4 threads on iperf from guest A to guest B.  I cannot seem to
get much over 1.5Gb/s from guest to guest with a single iperf thread.
Is there some sort of know send limit per thread?  Is it expected that
the latest version of the kernel and modules perform worse than earlier
versions in the area of network performance ( I am guessing not, am I
doing something wrong?)?  I am using virtio and have checked that
hardware acceleration is working.  4 iperf threads host to host yields
~9.5 Gb/s.  Any ideas on how I can get better performance with newer
versions?  I have tried using vhost in 2.6.35 but I get the vhost could
not be initialized error.  The only thing I could find on the vhost
error is that selinux should be off which it is.

I am looking for ideas on increasing the bandwidth between guests and
thoughts on the degrading performance.

Thanks for your help! --Matt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] pci: expand tabs to spaces in pci_regs.h

2010-08-31 Thread Eduard - Gabriel Munteanu
On Tue, Aug 31, 2010 at 11:29:53PM +0300, Michael S. Tsirkin wrote:
> On Sat, Aug 28, 2010 at 05:54:52PM +0300, Eduard - Gabriel Munteanu wrote:
> > The conversion was done using the GNU 'expand' tool (default settings)
> > to make it obey the QEMU coding style.
> > 
> > Signed-off-by: Eduard - Gabriel Munteanu 
> 
> I'm not really interested in this: we copied pci_regs.h from linux
> to help non-linux hosts, and keeping the code consistent
> with the original makes detecting bugs and adding new stuff
> from linux/pci_regs.h easier.
> 
> > ---
> >  hw/pci_regs.h | 1330 
> > 
> >  1 files changed, 665 insertions(+), 665 deletions(-)
> >  rewrite hw/pci_regs.h (90%)

Ok, I'll drop it. The only reason I did it was one of my additions to
this file made the patch look indented awkwardly.

I'll use tabs and merge it into Linux as well.


Thanks,
Eduard

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-net: Switch default to new bottom half TX handler for iothread

2010-08-31 Thread Anthony Liguori

On 08/31/2010 05:32 PM, Alex Williamson wrote:

On Tue, 2010-08-31 at 23:25 +0300, Michael S. Tsirkin wrote:
   

On Fri, Aug 27, 2010 at 04:37:45PM -0600, Alex Williamson wrote:
 

The bottom half handler shows big improvements over the timer
with few downsides, default to it when the iothread is enabled.

Using the following tests, with the guest and host connected
via tap+bridge:

guest>  netperf -t TCP_STREAM -H $HOST
host>  netperf -t TCP_STREAM -H $GUEST
guest>  netperf -t UDP_STREAM -H $HOST
host>  netperf -t UDP_STREAM -H $GUEST
guest>  netperf -t TCP_RR -H $HOST

Results: base throughput, exits/throughput ->
patched throughput, exits/throughput

--enable-io-thread
TCP guest->host 2737.77, 47.82  ->  6767.09, 29.15 = 247%, 61%
TCP host->guest 2231.33, 74.00  ->  4125.80, 67.61 = 185%, 91%
UDP guest->host 6281.68, 14.66  ->  12569.27, 1.98 = 200%, 14%
UDP host->guest 275.91,  289.22 ->  264.80, 293.53 = 96%, 101%
interations/s   1949.65, 82.97  ->  7417.56, 84.31 = 380%, 102%

No --enable-io-thread
TCP guest->host 3041.57, 55.11 ->  1038.93, 517.57 = 34%, 939%
TCP host->guest 2416.03, 76.67 ->  5655.92, 55.52  = 234%, 72%
UDP guest->host 12255.82, 6.11 ->  7775.87, 31.32  = 63%, 513%
UDP host->guest 587.92, 245.95 ->  611.88, 239.92  = 104%, 98%
interations/s   1975.59, 83.21 ->  8935.50, 88.18  = 452%, 106%

Signed-off-by: Alex Williamson
   

parameter having different settings based on config
options might surprise some users. I don't think
we really need a parameter here ...
 

I'm not a bit fan of this either, but I'd also prefer not to introduce a
regression for a performance difference we know about in advance.  It
gets even more complicated when we factor in qemu-kvm, as it doesn't
build with iothread enabled, but seems to get and even better boost in
performance across the board thanks largely to the kvm-irqchip.  Should
we instead make this a configure option?  --enable-virtio-net-txbh?
   


No, at this stage, we should ignore no --enable-io-thread with -enable-kvm.

Regards,

Anthony Liguori


Thanks,

Alex

   

---

  hw/s390-virtio-bus.c |3 ++-
  hw/syborg_virtio.c   |3 ++-
  hw/virtio-pci.c  |3 ++-
  hw/virtio.h  |6 ++
  4 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c
index 1483362..985f99a 100644
--- a/hw/s390-virtio-bus.c
+++ b/hw/s390-virtio-bus.c
@@ -328,7 +328,8 @@ static VirtIOS390DeviceInfo s390_virtio_net = {
  .qdev.size = sizeof(VirtIOS390Device),
  .qdev.props = (Property[]) {
  DEFINE_NIC_PROPERTIES(VirtIOS390Device, nic),
-DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer, 1),
+DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer,
+   TXTIMER_DEFAULT),
  DEFINE_PROP_INT32("txburst", VirtIOS390Device, txburst, 256),
  DEFINE_PROP_END_OF_LIST(),
  },
diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c
index 7b76972..ee5746d 100644
--- a/hw/syborg_virtio.c
+++ b/hw/syborg_virtio.c
@@ -300,7 +300,8 @@ static SysBusDeviceInfo syborg_virtio_net_info = {
  .qdev.props = (Property[]) {
  DEFINE_NIC_PROPERTIES(SyborgVirtIOProxy, nic),
  DEFINE_VIRTIO_NET_FEATURES(SyborgVirtIOProxy, host_features),
-DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer, 1),
+DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer,
+   TXTIMER_DEFAULT),
  DEFINE_PROP_INT32("txburst", SyborgVirtIOProxy, txburst, 256),
  DEFINE_PROP_END_OF_LIST(),
  }
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index e025c09..9740f57 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -695,7 +695,8 @@ static PCIDeviceInfo virtio_info[] = {
  DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3),
  DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features),
  DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic),
-DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer, 1),
+DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer,
+   TXTIMER_DEFAULT),
  DEFINE_PROP_INT32("txburst", VirtIOPCIProxy, txburst, 256),
  DEFINE_PROP_END_OF_LIST(),
  },
diff --git a/hw/virtio.h b/hw/virtio.h
index 4051889..a1a17a2 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -183,6 +183,12 @@ void virtio_update_irq(VirtIODevice *vdev);
  void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding,
  void *opaque);

+#ifdef CONFIG_IOTHREAD
+ #define TXTIMER_DEFAULT 0
+#else
+ #define TXTIMER_DEFAULT 1
+#endif
+
   

Add a comment explaning that this is just a performance optimization?

 

  /* Base devices.  */
  VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf);
  VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,

--
To unsubscribe from this list: send the line "unsu

Re: windows 2008 installation doesn't recognize cdrom

2010-08-31 Thread Martin Kraus
On Mon, Aug 30, 2010 at 05:29:45PM +0200, Martin Kraus wrote:
> On Mon, Aug 30, 2010 at 04:12:05PM +0200, Jes Sorensen wrote:
> > On 08/30/10 14:45, Martin Kraus wrote:
> > > By the installer I've meant the windows installer which complains that it 
> > > can't
> > > find a driver for cdrom (emulated by kvm patched qemu). I can boot from 
> > > the cd
> > > image by using -drive file=.media=cdrom or -cdrom ... and the 
> > > installation starts but it stops at the second installation window asking 
> > > to
> > > browse manualy for an appropriate driver.
> > > 
> > > I've been just wondering if anyone came across this problem or if I should
> > > blame it on Windows. 
> > 
> > That is very odd, I have certainly managed to install win2k8 here from
> > an ISO image without problems. If you launch it with -monitor stdio what
> > do you get when you type 'info block' ?
> 
> ide0-hd0: type=hd removable=0 file=/dev/mapper/virtual-pinda ro=0 
> drv=host_device encrypted=0
> ide1-cd0: type=cdrom removable=1 locked=0 
> file=/root/pub/iso/SW_DVD5_Windows_Svr_2008_R2_x64_ENG_1_ProdAct_X15-50365(2).ISO
>  ro=0 drv=raw encrypted=0 
> floppy0: type=floppy removable=1 locked=0 [not inserted]
> sd0: type=floppy removable=1 locked=0 [not inserted]
> 
> I'll try to download the iso again if I manage to get past the microsoft
> signon. 

Oh well. I've managed to persuade someone capable of signing on to the
microsoft site to get the hash of the dvd and it's been corrupted. 
I appologize for this blunder.

thank you for help
mk
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-net: Switch default to new bottom half TX handler for iothread

2010-08-31 Thread Alex Williamson
On Tue, 2010-08-31 at 23:25 +0300, Michael S. Tsirkin wrote:
> On Fri, Aug 27, 2010 at 04:37:45PM -0600, Alex Williamson wrote:
> > The bottom half handler shows big improvements over the timer
> > with few downsides, default to it when the iothread is enabled.
> > 
> > Using the following tests, with the guest and host connected
> > via tap+bridge:
> > 
> > guest> netperf -t TCP_STREAM -H $HOST
> > host> netperf -t TCP_STREAM -H $GUEST
> > guest> netperf -t UDP_STREAM -H $HOST
> > host> netperf -t UDP_STREAM -H $GUEST
> > guest> netperf -t TCP_RR -H $HOST
> > 
> > Results: base throughput, exits/throughput ->
> >patched throughput, exits/throughput
> > 
> > --enable-io-thread
> > TCP guest->host 2737.77, 47.82  -> 6767.09, 29.15 = 247%, 61%
> > TCP host->guest 2231.33, 74.00  -> 4125.80, 67.61 = 185%, 91%
> > UDP guest->host 6281.68, 14.66  -> 12569.27, 1.98 = 200%, 14%
> > UDP host->guest 275.91,  289.22 -> 264.80, 293.53 = 96%, 101%
> > interations/s   1949.65, 82.97  -> 7417.56, 84.31 = 380%, 102%
> > 
> > No --enable-io-thread
> > TCP guest->host 3041.57, 55.11 -> 1038.93, 517.57 = 34%, 939%
> > TCP host->guest 2416.03, 76.67 -> 5655.92, 55.52  = 234%, 72%
> > UDP guest->host 12255.82, 6.11 -> 7775.87, 31.32  = 63%, 513%
> > UDP host->guest 587.92, 245.95 -> 611.88, 239.92  = 104%, 98%
> > interations/s   1975.59, 83.21 -> 8935.50, 88.18  = 452%, 106%
> > 
> > Signed-off-by: Alex Williamson 
> 
> parameter having different settings based on config
> options might surprise some users. I don't think
> we really need a parameter here ...

I'm not a bit fan of this either, but I'd also prefer not to introduce a
regression for a performance difference we know about in advance.  It
gets even more complicated when we factor in qemu-kvm, as it doesn't
build with iothread enabled, but seems to get and even better boost in
performance across the board thanks largely to the kvm-irqchip.  Should
we instead make this a configure option?  --enable-virtio-net-txbh?
Thanks,

Alex

> > ---
> > 
> >  hw/s390-virtio-bus.c |3 ++-
> >  hw/syborg_virtio.c   |3 ++-
> >  hw/virtio-pci.c  |3 ++-
> >  hw/virtio.h  |6 ++
> >  4 files changed, 12 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c
> > index 1483362..985f99a 100644
> > --- a/hw/s390-virtio-bus.c
> > +++ b/hw/s390-virtio-bus.c
> > @@ -328,7 +328,8 @@ static VirtIOS390DeviceInfo s390_virtio_net = {
> >  .qdev.size = sizeof(VirtIOS390Device),
> >  .qdev.props = (Property[]) {
> >  DEFINE_NIC_PROPERTIES(VirtIOS390Device, nic),
> > -DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer, 1),
> > +DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer,
> > +   TXTIMER_DEFAULT),
> >  DEFINE_PROP_INT32("txburst", VirtIOS390Device, txburst, 256),
> >  DEFINE_PROP_END_OF_LIST(),
> >  },
> > diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c
> > index 7b76972..ee5746d 100644
> > --- a/hw/syborg_virtio.c
> > +++ b/hw/syborg_virtio.c
> > @@ -300,7 +300,8 @@ static SysBusDeviceInfo syborg_virtio_net_info = {
> >  .qdev.props = (Property[]) {
> >  DEFINE_NIC_PROPERTIES(SyborgVirtIOProxy, nic),
> >  DEFINE_VIRTIO_NET_FEATURES(SyborgVirtIOProxy, host_features),
> > -DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer, 1),
> > +DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer,
> > +   TXTIMER_DEFAULT),
> >  DEFINE_PROP_INT32("txburst", SyborgVirtIOProxy, txburst, 256),
> >  DEFINE_PROP_END_OF_LIST(),
> >  }
> > diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> > index e025c09..9740f57 100644
> > --- a/hw/virtio-pci.c
> > +++ b/hw/virtio-pci.c
> > @@ -695,7 +695,8 @@ static PCIDeviceInfo virtio_info[] = {
> >  DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3),
> >  DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features),
> >  DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic),
> > -DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer, 1),
> > +DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer,
> > +   TXTIMER_DEFAULT),
> >  DEFINE_PROP_INT32("txburst", VirtIOPCIProxy, txburst, 256),
> >  DEFINE_PROP_END_OF_LIST(),
> >  },
> > diff --git a/hw/virtio.h b/hw/virtio.h
> > index 4051889..a1a17a2 100644
> > --- a/hw/virtio.h
> > +++ b/hw/virtio.h
> > @@ -183,6 +183,12 @@ void virtio_update_irq(VirtIODevice *vdev);
> >  void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding,
> >  void *opaque);
> >  
> > +#ifdef CONFIG_IOTHREAD
> > + #define TXTIMER_DEFAULT 0
> > +#else
> > + #define TXTIMER_DEFAULT 1
> > +#endif
> > +
> 
> Add a comment explaning that this is just a performance optimization?
> 
> >  /* Base devices.  */
> >  VirtIODevice *virtio_blk_init(De

Re: [PATCH 0/5] virtio-net: More configurability and bh handling for tx

2010-08-31 Thread Alex Williamson
On Tue, 2010-08-31 at 16:33 -0500, Anthony Liguori wrote:
> On 08/31/2010 03:27 PM, Michael S. Tsirkin wrote:
> > On Fri, Aug 27, 2010 at 04:36:59PM -0600, Alex Williamson wrote:
> >
> >> Add the ability to configure the tx_timer timeout and add a bottom
> >> half tx handler that typically shows a nice perf boost over the
> >> time based approach.  See last patch for perf details.  Make this
> >> the new default when the iothread is enabled.  Thanks,
> >>
> >> Alex
> >>  
> > As a further thought, maybe it would help to have a
> > separate namespace for unsupported, developer-only parameters.
> > Thoughts?
> >
> 
> We already have an undocumented one, just prefix with 'x-' to indicate 
> that it's experimental.

Any objection then to x-txburst and x-txtimer then?  Easy enough to
rename.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 0/2] SVM: fix cpu onlining

2010-08-31 Thread Marcelo Tosatti

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 2/2] KVM: SVM: init_vmcb should reset vcpu->efer

2010-08-31 Thread Marcelo Tosatti
Otherwise EFER_LMA bit is retained across a SIPI reset.

Fixes guest cpu onlining.

Signed-off-by: Marcelo Tosatti 

Index: kvm/arch/x86/kvm/svm.c
===
--- kvm.orig/arch/x86/kvm/svm.c
+++ kvm/arch/x86/kvm/svm.c
@@ -816,7 +816,7 @@ static void init_vmcb(struct vcpu_svm *s
init_sys_seg(&save->ldtr, SEG_TYPE_LDT);
init_sys_seg(&save->tr, SEG_TYPE_BUSY_TSS16);
 
-   save->efer = EFER_SVME;
+   svm_set_efer(&svm->vcpu, 0);
save->dr6 = 0x0ff0;
save->dr7 = 0x400;
save->rflags = 2;


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/2] KVM: SVM: reset mmu context in init_vmcb

2010-08-31 Thread Marcelo Tosatti
Since commit aad827034e419fa no mmu reinitialization is performed
via init_vmcb.

Zero vcpu->arch.cr0 and pass the reset value as a parameter to 
kvm_set_cr0.

Signed-off-by: Marcelo Tosatti 

Index: kvm/arch/x86/kvm/svm.c
===
--- kvm.orig/arch/x86/kvm/svm.c
+++ kvm/arch/x86/kvm/svm.c
@@ -827,8 +827,8 @@ static void init_vmcb(struct vcpu_svm *s
 * This is the guest-visible cr0 value.
 * svm_set_cr0() sets PG and WP and clears NW and CD on save->cr0.
 */
-   svm->vcpu.arch.cr0 = X86_CR0_NW | X86_CR0_CD | X86_CR0_ET;
-   (void)kvm_set_cr0(&svm->vcpu, svm->vcpu.arch.cr0);
+   svm->vcpu.arch.cr0 = 0;
+   (void)kvm_set_cr0(&svm->vcpu, X86_CR0_NW | X86_CR0_CD | X86_CR0_ET);
 
save->cr4 = X86_CR4_PAE;
/* rdx = ?? */


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] NUMA: realize NUMA memory pinning

2010-08-31 Thread Anthony Liguori

On 08/31/2010 03:54 PM, Andrew Theurer wrote:

On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote:
   

On 08/23/2010 04:16 PM, Andre Przywara wrote:
 

Anthony Liguori wrote:
   

On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
 

On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
   

According to the user-provided assignment bind the respective part
of the guest's memory to the given host node. This uses Linux'
mbind syscall (which is wrapped only in libnuma) to realize the
pinning right after the allocation.
Failures are not fatal, but produce a warning.

Signed-off-by: Andre Przywara
...
 

Why is it not possible (or perhaps not desired) to change the binding
after the guest is started?

Sounds unflexible.
   

The solution is to introduce a monitor interface to later adjust the
pinning, allowing both changing the affinity only (only valid for
future fault-ins) and actually copying the memory (more costly).
   

This is just duplicating numactl.

 

Actually this is the next item on my list, but I wanted to bring up
the basics first to avoid recoding parts afterwards. Also I am not
(yet) familiar with the QMP protocol.
   

We really need a solution that lets a user use a tool like numactl
outside of the QEMU instance.
 

I fear that is not how it's meant to work with the Linux' NUMA API. In
opposite to the VCPU threads, which are externally visible entities
(PIDs), the memory should be private to the QEMU process. While you
can change the NUMA allocation policy of the _whole_ process, there is
no way to externally distinguish parts of the process' memory.
Although you could later (and externally) migrate already faulted
pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you
would let an external tool interfere with QEMUs internal memory
management. Take for instance the change of the allocation policy
regarding the 1MB and 3.5-4GB holes. An external tool would have to
either track such changes or you simply could not change such things
in QEMU.
   

It's extremely likely that if you're doing NUMA pinning, you're also
doing large pages via hugetlbfs.  numactl can already set policies for
files in hugetlbfs so all you need to do is have a separate hugetlbfs
file for each numa node.
 

Why would we resort to hugetlbfs when we have transparent hugepages?
   


If you care about NUMA pinning, I can't believe you don't want 
guaranteed large page allocation which THP does not provide.


The general point though is that we should find a way to partition 
memory in qemu such that an external process can control the actual NUMA 
placement.  This gives us maximum flexibility.


Otherwise, what do we implement in QEMU?  Direct pinning of memory to 
nodes?  Can we migrate memory between nodes?  Should we support 
interleaving memory between two virtual nodes?  Why pick and choose when 
we can have it all.



FWIW, large apps like databases have set a precedent for managing their
own NUMA policies.


Of course because they know what their NUMA policy should be.  They live 
in a simple world where they assume they're the only application in the 
system, they read the distance tables, figure they'll use XX% of all 
physical memory, and then pin how they see fit.


But an individual QEMU process lives in a complex world.  It's almost 
never the only thing on the system and it's only allowed to use a subset 
of resources.  It's not sure what set of resources it can and can't use 
and that's often times changing.  The topology chosen for a guest is 
static but it's host topology may be dynamic due to thinks like live 
migration.


In short, QEMU absolutely cannot implement a NUMA policy in a vacuum.  
Instead, it needs to let something with a larger view of the system 
determine a NUMA policy that makes sense overall.


There are two ways we can do this.  We can implement monitor commands 
that attempt to expose every single NUMA tunable possible.  Or, we can 
tie into the existing commands which guarantee that we support every 
possible tunable and that as NUMA support in Linux evolves, we get all 
the new features for free.


And, since numactl already supports setting policies on files in 
hugetlbfs, all we need is a simple change to qemu to allow -mem-path to 
work per-node instead of globally.  And it's useful to implement other 
types of things like having one node be guaranteed large pages and 
another node THP or some other fanciness.


Sounds awfully appealing to me.


   I don't see why qemu should be any different.
Numactl is great for small apps that need to be pinned in one node, or
spread evenly on all nodes.  Having to get hugetlbfs involved just to
workaround a shortcoming of numactl just seems like a bad idea.
   


You seem to be asserting that we should implement a full NUMA policy in 
QEMU.  What should it be when we don't (in QEMU) know what else is 
running on the system?


Regards,

Anthony Li

Re: [PATCH 0/5] virtio-net: More configurability and bh handling for tx

2010-08-31 Thread Anthony Liguori

On 08/31/2010 03:27 PM, Michael S. Tsirkin wrote:

On Fri, Aug 27, 2010 at 04:36:59PM -0600, Alex Williamson wrote:
   

Add the ability to configure the tx_timer timeout and add a bottom
half tx handler that typically shows a nice perf boost over the
time based approach.  See last patch for perf details.  Make this
the new default when the iothread is enabled.  Thanks,

Alex
 

As a further thought, maybe it would help to have a
separate namespace for unsupported, developer-only parameters.
Thoughts?
   


We already have an undocumented one, just prefix with 'x-' to indicate 
that it's experimental.


Regards,

Anthony Liguori

   

---

Alex Williamson (5):
   virtio-net: Switch default to new bottom half TX handler for iothread
   virtio-net: Introduce a new bottom half packet TX
   virtio-net: Rename tx_timer_active to tx_waiting
   virtio-net: Limit number of packets sent per TX flush
   virtio-net: Make tx_timer timeout configurable


  hw/s390-virtio-bus.c |6 ++
  hw/s390-virtio-bus.h |4 ++
  hw/syborg_virtio.c   |   10 +++-
  hw/virtio-net.c  |  129 --
  hw/virtio-net.h  |2 -
  hw/virtio-pci.c  |   10 +++-
  hw/virtio.h  |9 +++
  7 files changed, 137 insertions(+), 33 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] NUMA: realize NUMA memory pinning

2010-08-31 Thread Andrew Theurer
On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote:
> On 08/23/2010 04:16 PM, Andre Przywara wrote:
> > Anthony Liguori wrote:
> >> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
> >>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
>  According to the user-provided assignment bind the respective part
>  of the guest's memory to the given host node. This uses Linux'
>  mbind syscall (which is wrapped only in libnuma) to realize the
>  pinning right after the allocation.
>  Failures are not fatal, but produce a warning.
> 
>  Signed-off-by: Andre Przywara
> > >>> ...
> >>> Why is it not possible (or perhaps not desired) to change the binding
> >>> after the guest is started?
> >>>
> >>> Sounds unflexible.
> > The solution is to introduce a monitor interface to later adjust the 
> > pinning, allowing both changing the affinity only (only valid for 
> > future fault-ins) and actually copying the memory (more costly).
> 
> This is just duplicating numactl.
> 
> > Actually this is the next item on my list, but I wanted to bring up 
> > the basics first to avoid recoding parts afterwards. Also I am not 
> > (yet) familiar with the QMP protocol.
> >>
> >> We really need a solution that lets a user use a tool like numactl 
> >> outside of the QEMU instance.
> > I fear that is not how it's meant to work with the Linux' NUMA API. In 
> > opposite to the VCPU threads, which are externally visible entities 
> > (PIDs), the memory should be private to the QEMU process. While you 
> > can change the NUMA allocation policy of the _whole_ process, there is 
> > no way to externally distinguish parts of the process' memory. 
> > Although you could later (and externally) migrate already faulted 
> > pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you 
> > would let an external tool interfere with QEMUs internal memory 
> > management. Take for instance the change of the allocation policy 
> > regarding the 1MB and 3.5-4GB holes. An external tool would have to 
> > either track such changes or you simply could not change such things 
> > in QEMU.
> 
> It's extremely likely that if you're doing NUMA pinning, you're also 
> doing large pages via hugetlbfs.  numactl can already set policies for 
> files in hugetlbfs so all you need to do is have a separate hugetlbfs 
> file for each numa node.

Why would we resort to hugetlbfs when we have transparent hugepages?

FWIW, large apps like databases have set a precedent for managing their
own NUMA policies.  I don't see why qemu should be any different.
Numactl is great for small apps that need to be pinned in one node, or
spread evenly on all nodes.  Having to get hugetlbfs involved just to
workaround a shortcoming of numactl just seems like a bad idea.   
> 
> Then you have all the flexibility of numactl and you can implement node 
> migration external to QEMU if you so desire.
> 
> > So what is wrong with keeping that code in QEMU, which knows best 
> > about the internals and already has flexible and mighty ways (command 
> > line and QMP) of manipulating its behavior?
> 
> NUMA is a last-mile optimization.  For the audience that cares about 
> this level of optimization, only providing an interface that allows a 
> small set of those optimizations to be used is unacceptable.
> 
> There's a very simple way to do this right and that's by adding 
> interfaces to QEMU that let's us work with existing tooling instead of 
> inventing new interfaces.
> 
> Regards,
> 
> Anthony Liguori
> 
> > Regards,
> > Andre.

-Andrew Theurer

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] virtio-net: Introduce a new bottom half packet TX

2010-08-31 Thread Alex Williamson
On Tue, 2010-08-31 at 23:14 +0300, Michael S. Tsirkin wrote:
> On Fri, Aug 27, 2010 at 04:37:36PM -0600, Alex Williamson wrote:
> > Based on a patch from Mark McLoughlin, this patch introduces a new
> > bottom half packet transmitter that avoids the latency imposed by
> > the tx_timer approach.  Rather than scheduling a timer when a TX
> > packet comes in, schedule a bottom half to be run from the iothread.
> > The bottom half handler first attempts to flush the queue with
> > notification disabled (this is where we could race with a guest
> > without txburst).  If we flush a full burst, reschedule immediately.
> > If we send short of a full burst, try to re-enable notification.
> > To avoid a race with TXs that may have occurred, we must then
> > flush again.  If we find some packets to send, the guest it probably
> > active, so we can reschedule again.
> > 
> > tx_timer and tx_bh are mutually exclusive, so we can re-use the
> > tx_waiting flag to indicate one or the other needs to be setup.
> > This allows us to seamlessly migrate between timer and bh TX
> > handling.
> > 
> > Signed-off-by: Alex Williamson 
> > ---
> > 
> >  hw/virtio-net.c |   81 
> > ++-
> >  1 files changed, 68 insertions(+), 13 deletions(-)
> > 
> > diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> > index 8b652f2..3288c77 100644
> > --- a/hw/virtio-net.c
> > +++ b/hw/virtio-net.c
> > @@ -36,6 +36,7 @@ typedef struct VirtIONet
> >  VirtQueue *ctrl_vq;
> >  NICState *nic;
> >  QEMUTimer *tx_timer;
> > +QEMUBH *tx_bh;
> >  uint32_t tx_timeout;
> >  int32_t tx_burst;
> >  int tx_waiting;
> > @@ -704,16 +705,25 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, 
> > VirtQueue *vq)
> >  {
> >  VirtIONet *n = to_virtio_net(vdev);
> >  
> > -if (n->tx_waiting) {
> > -virtio_queue_set_notification(vq, 1);
> > -qemu_del_timer(n->tx_timer);
> > -n->tx_waiting = 0;
> > -virtio_net_flush_tx(n, vq);
> > +if (n->tx_timer) {
> > +if (n->tx_waiting) {
> > +virtio_queue_set_notification(vq, 1);
> > +qemu_del_timer(n->tx_timer);
> > +n->tx_waiting = 0;
> > +virtio_net_flush_tx(n, vq);
> > +} else {
> > +qemu_mod_timer(n->tx_timer,
> > +   qemu_get_clock(vm_clock) + n->tx_timeout);
> > +n->tx_waiting = 1;
> > +virtio_queue_set_notification(vq, 0);
> > +}
> >  } else {
> > -qemu_mod_timer(n->tx_timer,
> > -   qemu_get_clock(vm_clock) + n->tx_timeout);
> > +if (unlikely(n->tx_waiting)) {
> > +return;
> > +}
> > +virtio_queue_set_notification(n->tx_vq, 0);
> > +qemu_bh_schedule(n->tx_bh);
> >  n->tx_waiting = 1;
> > -virtio_queue_set_notification(vq, 0);
> >  }
> >  }
> >  
> > @@ -731,6 +741,41 @@ static void virtio_net_tx_timer(void *opaque)
> >  virtio_net_flush_tx(n, n->tx_vq);
> >  }
> >  
> > +static void virtio_net_tx_bh(void *opaque)
> > +{
> > +VirtIONet *n = opaque;
> > +int32_t ret;
> > +
> > +n->tx_waiting = 0;
> > +
> > +/* Just in case the driver is not ready on more */
> > +if (unlikely(!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK)))
> > +return;
> > +
> > +ret = virtio_net_flush_tx(n, n->tx_vq);
> > +if (ret == -EBUSY) {
> > +return; /* Notification re-enable handled by tx_complete */
> > +}
> > +
> > +/* If we flush a full burst of packets, assume there are
> > + * more coming and immediately reschedule */
> > +if (ret >= n->tx_burst) {
> > +qemu_bh_schedule(n->tx_bh);
> > +n->tx_waiting = 1;
> > +return;
> > +}
> > +
> > +/* If less than a full burst, re-enable notification and flush
> > + * anything that may have come in while we weren't looking.  If
> > + * we find something, assume the guest is still active and reschedule 
> > */
> > +virtio_queue_set_notification(n->tx_vq, 1);
> > +if (virtio_net_flush_tx(n, n->tx_vq) > 0) {
> 
> Shouldn't this be virtio_net_flush_tx(n, n->tx_vq) >= n->tx_burst?
> If we get less than tx_burst, the ring is empty now so no need to
> reschedule.
> Right?

I suppose it depends on how aggressive we want to be.  If the guest put
something on the queue between the first flush and this one, then it
might be actively transmitting, and if we want to optimize latency, we
anticipate that it might continue to transmit and re-schedule.  This is
taken straight from markmc's rhel5 patch.  I wouldn't argue that it's
wrong to not reschedule here, but it's clearly less aggressive.  Thanks,

Alex

> > +virtio_queue_set_notification(n->tx_vq, 0);
> > +qemu_bh_schedule(n->tx_bh);
> > +n->tx_waiting = 1;
> > +}
> > +}
> > +
> >  static void virtio_net_save(QEMUFile *f, void *opaque)
> >  {
> >  VirtIONet *n = opaque;
> > @@ -850,8 

Re: [PATCH 0/5] virtio-net: More configurability and bh handling for tx

2010-08-31 Thread Michael S. Tsirkin
On Fri, Aug 27, 2010 at 04:36:59PM -0600, Alex Williamson wrote:
> Add the ability to configure the tx_timer timeout and add a bottom
> half tx handler that typically shows a nice perf boost over the
> time based approach.  See last patch for perf details.  Make this
> the new default when the iothread is enabled.  Thanks,
> 
> Alex

As a further thought, maybe it would help to have a
separate namespace for unsupported, developer-only parameters.
Thoughts?

> ---
> 
> Alex Williamson (5):
>   virtio-net: Switch default to new bottom half TX handler for iothread
>   virtio-net: Introduce a new bottom half packet TX
>   virtio-net: Rename tx_timer_active to tx_waiting
>   virtio-net: Limit number of packets sent per TX flush
>   virtio-net: Make tx_timer timeout configurable
> 
> 
>  hw/s390-virtio-bus.c |6 ++
>  hw/s390-virtio-bus.h |4 ++
>  hw/syborg_virtio.c   |   10 +++-
>  hw/virtio-net.c  |  129 
> --
>  hw/virtio-net.h  |2 -
>  hw/virtio-pci.c  |   10 +++-
>  hw/virtio.h  |9 +++
>  7 files changed, 137 insertions(+), 33 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-net: Switch default to new bottom half TX handler for iothread

2010-08-31 Thread Michael S. Tsirkin
On Fri, Aug 27, 2010 at 04:37:45PM -0600, Alex Williamson wrote:
> The bottom half handler shows big improvements over the timer
> with few downsides, default to it when the iothread is enabled.
> 
> Using the following tests, with the guest and host connected
> via tap+bridge:
> 
> guest> netperf -t TCP_STREAM -H $HOST
> host> netperf -t TCP_STREAM -H $GUEST
> guest> netperf -t UDP_STREAM -H $HOST
> host> netperf -t UDP_STREAM -H $GUEST
> guest> netperf -t TCP_RR -H $HOST
> 
> Results: base throughput, exits/throughput ->
>patched throughput, exits/throughput
> 
> --enable-io-thread
> TCP guest->host 2737.77, 47.82  -> 6767.09, 29.15 = 247%, 61%
> TCP host->guest 2231.33, 74.00  -> 4125.80, 67.61 = 185%, 91%
> UDP guest->host 6281.68, 14.66  -> 12569.27, 1.98 = 200%, 14%
> UDP host->guest 275.91,  289.22 -> 264.80, 293.53 = 96%, 101%
> interations/s   1949.65, 82.97  -> 7417.56, 84.31 = 380%, 102%
> 
> No --enable-io-thread
> TCP guest->host 3041.57, 55.11 -> 1038.93, 517.57 = 34%, 939%
> TCP host->guest 2416.03, 76.67 -> 5655.92, 55.52  = 234%, 72%
> UDP guest->host 12255.82, 6.11 -> 7775.87, 31.32  = 63%, 513%
> UDP host->guest 587.92, 245.95 -> 611.88, 239.92  = 104%, 98%
> interations/s   1975.59, 83.21 -> 8935.50, 88.18  = 452%, 106%
> 
> Signed-off-by: Alex Williamson 

parameter having different settings based on config
options might surprise some users. I don't think
we really need a parameter here ...

> ---
> 
>  hw/s390-virtio-bus.c |3 ++-
>  hw/syborg_virtio.c   |3 ++-
>  hw/virtio-pci.c  |3 ++-
>  hw/virtio.h  |6 ++
>  4 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c
> index 1483362..985f99a 100644
> --- a/hw/s390-virtio-bus.c
> +++ b/hw/s390-virtio-bus.c
> @@ -328,7 +328,8 @@ static VirtIOS390DeviceInfo s390_virtio_net = {
>  .qdev.size = sizeof(VirtIOS390Device),
>  .qdev.props = (Property[]) {
>  DEFINE_NIC_PROPERTIES(VirtIOS390Device, nic),
> -DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer, 1),
> +DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer,
> +   TXTIMER_DEFAULT),
>  DEFINE_PROP_INT32("txburst", VirtIOS390Device, txburst, 256),
>  DEFINE_PROP_END_OF_LIST(),
>  },
> diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c
> index 7b76972..ee5746d 100644
> --- a/hw/syborg_virtio.c
> +++ b/hw/syborg_virtio.c
> @@ -300,7 +300,8 @@ static SysBusDeviceInfo syborg_virtio_net_info = {
>  .qdev.props = (Property[]) {
>  DEFINE_NIC_PROPERTIES(SyborgVirtIOProxy, nic),
>  DEFINE_VIRTIO_NET_FEATURES(SyborgVirtIOProxy, host_features),
> -DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer, 1),
> +DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer,
> +   TXTIMER_DEFAULT),
>  DEFINE_PROP_INT32("txburst", SyborgVirtIOProxy, txburst, 256),
>  DEFINE_PROP_END_OF_LIST(),
>  }
> diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> index e025c09..9740f57 100644
> --- a/hw/virtio-pci.c
> +++ b/hw/virtio-pci.c
> @@ -695,7 +695,8 @@ static PCIDeviceInfo virtio_info[] = {
>  DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3),
>  DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features),
>  DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic),
> -DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer, 1),
> +DEFINE_PROP_UINT32("txtimer", VirtIOPCIProxy, txtimer,
> +   TXTIMER_DEFAULT),
>  DEFINE_PROP_INT32("txburst", VirtIOPCIProxy, txburst, 256),
>  DEFINE_PROP_END_OF_LIST(),
>  },
> diff --git a/hw/virtio.h b/hw/virtio.h
> index 4051889..a1a17a2 100644
> --- a/hw/virtio.h
> +++ b/hw/virtio.h
> @@ -183,6 +183,12 @@ void virtio_update_irq(VirtIODevice *vdev);
>  void virtio_bind_device(VirtIODevice *vdev, const VirtIOBindings *binding,
>  void *opaque);
>  
> +#ifdef CONFIG_IOTHREAD
> + #define TXTIMER_DEFAULT 0
> +#else
> + #define TXTIMER_DEFAULT 1
> +#endif
> +

Add a comment explaning that this is just a performance optimization?

>  /* Base devices.  */
>  VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf);
>  VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] virtio-net: Limit number of packets sent per TX flush

2010-08-31 Thread Michael S. Tsirkin
On Fri, Aug 27, 2010 at 04:37:18PM -0600, Alex Williamson wrote:
> If virtio_net_flush_tx is called with notification disabled, we can
> race with the guest, processing packets at the same rate as they
> get produced.  The trouble is that this means we have no guaranteed
> exit condition from the function and can spend minutes in there.
> Currently flush_tx is only called with notification on, which seems
> to limit us to one pass through the queue per call.  An upcoming
> patch changes this.
> 
> One pass through the queue (256) seems to be a good default value
> for this, balancing latency with throughput.  We use a signed int
> for txburst because 2^31 packets in a burst would take many, many
> minutes to process and it allows us to easily return a negative
> value value from virtio_net_flush_tx() to indicate a back-off
> or error condition.
> 
> Signed-off-by: Alex Williamson 

It might be better not to make this configurable, and simply set
txburst = vq->num in code in a single place, than the magic
256 constant in multiuple places.
Alternatively, let's put a macro in a .h file.

It might also be a good idea to put the above motivation in comments in the 
code.


> ---
> 
>  hw/s390-virtio-bus.c |4 +++-
>  hw/s390-virtio-bus.h |2 ++
>  hw/syborg_virtio.c   |6 +-
>  hw/virtio-net.c  |   23 ---
>  hw/virtio-pci.c  |6 +-
>  hw/virtio.h  |2 +-
>  6 files changed, 32 insertions(+), 11 deletions(-)
> 
> diff --git a/hw/s390-virtio-bus.c b/hw/s390-virtio-bus.c
> index 4da0b40..1483362 100644
> --- a/hw/s390-virtio-bus.c
> +++ b/hw/s390-virtio-bus.c
> @@ -110,7 +110,8 @@ static int s390_virtio_net_init(VirtIOS390Device *dev)
>  {
>  VirtIODevice *vdev;
>  
> -vdev = virtio_net_init((DeviceState *)dev, &dev->nic, dev->txtimer);
> +vdev = virtio_net_init((DeviceState *)dev, &dev->nic,
> +   dev->txtimer, dev->txburst);
>  if (!vdev) {
>  return -1;
>  }
> @@ -328,6 +329,7 @@ static VirtIOS390DeviceInfo s390_virtio_net = {
>  .qdev.props = (Property[]) {
>  DEFINE_NIC_PROPERTIES(VirtIOS390Device, nic),
>  DEFINE_PROP_UINT32("txtimer", VirtIOS390Device, txtimer, 1),
> +DEFINE_PROP_INT32("txburst", VirtIOS390Device, txburst, 256),
>  DEFINE_PROP_END_OF_LIST(),
>  },
>  };
> diff --git a/hw/s390-virtio-bus.h b/hw/s390-virtio-bus.h
> index 922daf2..808aea0 100644
> --- a/hw/s390-virtio-bus.h
> +++ b/hw/s390-virtio-bus.h
> @@ -45,6 +45,8 @@ typedef struct VirtIOS390Device {
>  uint32_t max_virtserial_ports;
>  /* Timeout value for virtio-net txtimer, 1 = default, >1 = ns timeout */
>  uint32_t txtimer;
> +/* Max tx packets for virtio-net to burst at a time */
> +int32_t txburst;
>  } VirtIOS390Device;
>  
>  typedef struct VirtIOS390Bus {
> diff --git a/hw/syborg_virtio.c b/hw/syborg_virtio.c
> index c8d731a..7b76972 100644
> --- a/hw/syborg_virtio.c
> +++ b/hw/syborg_virtio.c
> @@ -70,6 +70,8 @@ typedef struct {
>  uint32_t host_features;
>  /* Timeout value for virtio-net txtimer, 1 = default, >1 = ns timeout */
>  uint32_t txtimer;
> +/* Max tx packets for virtio-net to burst at a time */
> +int32_t txburst;
>  } SyborgVirtIOProxy;
>  
>  static uint32_t syborg_virtio_readl(void *opaque, target_phys_addr_t offset)
> @@ -286,7 +288,8 @@ static int syborg_virtio_net_init(SysBusDevice *dev)
>  VirtIODevice *vdev;
>  SyborgVirtIOProxy *proxy = FROM_SYSBUS(SyborgVirtIOProxy, dev);
>  
> -vdev = virtio_net_init(&dev->qdev, &proxy->nic, proxy->txtimer);
> +vdev = virtio_net_init(&dev->qdev, &proxy->nic,
> +   proxy->txtimer, proxy->txburst);
>  return syborg_virtio_init(proxy, vdev);
>  }
>  
> @@ -298,6 +301,7 @@ static SysBusDeviceInfo syborg_virtio_net_info = {
>  DEFINE_NIC_PROPERTIES(SyborgVirtIOProxy, nic),
>  DEFINE_VIRTIO_NET_FEATURES(SyborgVirtIOProxy, host_features),
>  DEFINE_PROP_UINT32("txtimer", SyborgVirtIOProxy, txtimer, 1),
> +DEFINE_PROP_INT32("txburst", SyborgVirtIOProxy, txburst, 256),
>  DEFINE_PROP_END_OF_LIST(),
>  }
>  };
> diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> index 9ef29f0..ac4aa8f 100644
> --- a/hw/virtio-net.c
> +++ b/hw/virtio-net.c
> @@ -37,6 +37,7 @@ typedef struct VirtIONet
>  NICState *nic;
>  QEMUTimer *tx_timer;
>  uint32_t tx_timeout;
> +int32_t tx_burst;
>  int tx_timer_active;
>  uint32_t has_vnet_hdr;
>  uint8_t has_ufo;
> @@ -620,7 +621,7 @@ static ssize_t virtio_net_receive(VLANClientState *nc, 
> const uint8_t *buf, size_
>  return size;
>  }
>  
> -static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq);
> +static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq);
>  
>  static void virtio_net_tx_complete(VLANClientState *nc, ssize_t len)
>  {
> @@ -636,16 +637,18 @@ static void virtio_net_tx_complete(VLANClientState *nc, 
> ssize_t len)
>  }
>  

Re: [PATCH 0/5] virtio-net: More configurability and bh handling for tx

2010-08-31 Thread Michael S. Tsirkin
On Fri, Aug 27, 2010 at 04:36:59PM -0600, Alex Williamson wrote:
> Add the ability to configure the tx_timer timeout and add a bottom
> half tx handler that typically shows a nice perf boost over the
> time based approach.  See last patch for perf details.  Make this
> the new default when the iothread is enabled.  Thanks,
> 
> Alex

Looks good overall, but do we really want to make these parameters qdev 
properties?
It really looks like exposing an implementation detail that we'll have hard
time supporting in the future and users will have hard time
understanding. OTOH developers should have no trouble tweaking
a define on top of the file.



> ---
> 
> Alex Williamson (5):
>   virtio-net: Switch default to new bottom half TX handler for iothread
>   virtio-net: Introduce a new bottom half packet TX
>   virtio-net: Rename tx_timer_active to tx_waiting
>   virtio-net: Limit number of packets sent per TX flush
>   virtio-net: Make tx_timer timeout configurable
> 
> 
>  hw/s390-virtio-bus.c |6 ++
>  hw/s390-virtio-bus.h |4 ++
>  hw/syborg_virtio.c   |   10 +++-
>  hw/virtio-net.c  |  129 
> --
>  hw/virtio-net.h  |2 -
>  hw/virtio-pci.c  |   10 +++-
>  hw/virtio.h  |9 +++
>  7 files changed, 137 insertions(+), 33 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] virtio-net: Introduce a new bottom half packet TX

2010-08-31 Thread Michael S. Tsirkin
On Fri, Aug 27, 2010 at 04:37:36PM -0600, Alex Williamson wrote:
> Based on a patch from Mark McLoughlin, this patch introduces a new
> bottom half packet transmitter that avoids the latency imposed by
> the tx_timer approach.  Rather than scheduling a timer when a TX
> packet comes in, schedule a bottom half to be run from the iothread.
> The bottom half handler first attempts to flush the queue with
> notification disabled (this is where we could race with a guest
> without txburst).  If we flush a full burst, reschedule immediately.
> If we send short of a full burst, try to re-enable notification.
> To avoid a race with TXs that may have occurred, we must then
> flush again.  If we find some packets to send, the guest it probably
> active, so we can reschedule again.
> 
> tx_timer and tx_bh are mutually exclusive, so we can re-use the
> tx_waiting flag to indicate one or the other needs to be setup.
> This allows us to seamlessly migrate between timer and bh TX
> handling.
> 
> Signed-off-by: Alex Williamson 
> ---
> 
>  hw/virtio-net.c |   81 
> ++-
>  1 files changed, 68 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> index 8b652f2..3288c77 100644
> --- a/hw/virtio-net.c
> +++ b/hw/virtio-net.c
> @@ -36,6 +36,7 @@ typedef struct VirtIONet
>  VirtQueue *ctrl_vq;
>  NICState *nic;
>  QEMUTimer *tx_timer;
> +QEMUBH *tx_bh;
>  uint32_t tx_timeout;
>  int32_t tx_burst;
>  int tx_waiting;
> @@ -704,16 +705,25 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, 
> VirtQueue *vq)
>  {
>  VirtIONet *n = to_virtio_net(vdev);
>  
> -if (n->tx_waiting) {
> -virtio_queue_set_notification(vq, 1);
> -qemu_del_timer(n->tx_timer);
> -n->tx_waiting = 0;
> -virtio_net_flush_tx(n, vq);
> +if (n->tx_timer) {
> +if (n->tx_waiting) {
> +virtio_queue_set_notification(vq, 1);
> +qemu_del_timer(n->tx_timer);
> +n->tx_waiting = 0;
> +virtio_net_flush_tx(n, vq);
> +} else {
> +qemu_mod_timer(n->tx_timer,
> +   qemu_get_clock(vm_clock) + n->tx_timeout);
> +n->tx_waiting = 1;
> +virtio_queue_set_notification(vq, 0);
> +}
>  } else {
> -qemu_mod_timer(n->tx_timer,
> -   qemu_get_clock(vm_clock) + n->tx_timeout);
> +if (unlikely(n->tx_waiting)) {
> +return;
> +}
> +virtio_queue_set_notification(n->tx_vq, 0);
> +qemu_bh_schedule(n->tx_bh);
>  n->tx_waiting = 1;
> -virtio_queue_set_notification(vq, 0);
>  }
>  }
>  
> @@ -731,6 +741,41 @@ static void virtio_net_tx_timer(void *opaque)
>  virtio_net_flush_tx(n, n->tx_vq);
>  }
>  
> +static void virtio_net_tx_bh(void *opaque)
> +{
> +VirtIONet *n = opaque;
> +int32_t ret;
> +
> +n->tx_waiting = 0;
> +
> +/* Just in case the driver is not ready on more */
> +if (unlikely(!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK)))
> +return;
> +
> +ret = virtio_net_flush_tx(n, n->tx_vq);
> +if (ret == -EBUSY) {
> +return; /* Notification re-enable handled by tx_complete */
> +}
> +
> +/* If we flush a full burst of packets, assume there are
> + * more coming and immediately reschedule */
> +if (ret >= n->tx_burst) {
> +qemu_bh_schedule(n->tx_bh);
> +n->tx_waiting = 1;
> +return;
> +}
> +
> +/* If less than a full burst, re-enable notification and flush
> + * anything that may have come in while we weren't looking.  If
> + * we find something, assume the guest is still active and reschedule */
> +virtio_queue_set_notification(n->tx_vq, 1);
> +if (virtio_net_flush_tx(n, n->tx_vq) > 0) {

Shouldn't this be virtio_net_flush_tx(n, n->tx_vq) >= n->tx_burst?
If we get less than tx_burst, the ring is empty now so no need to
reschedule.
Right?

> +virtio_queue_set_notification(n->tx_vq, 0);
> +qemu_bh_schedule(n->tx_bh);
> +n->tx_waiting = 1;
> +}
> +}
> +
>  static void virtio_net_save(QEMUFile *f, void *opaque)
>  {
>  VirtIONet *n = opaque;
> @@ -850,8 +895,12 @@ static int virtio_net_load(QEMUFile *f, void *opaque, 
> int version_id)
>  n->mac_table.first_multi = i;
>  
>  if (n->tx_waiting) {
> -qemu_mod_timer(n->tx_timer,
> -   qemu_get_clock(vm_clock) + n->tx_timeout);
> +if (n->tx_timer) {
> +qemu_mod_timer(n->tx_timer,
> +   qemu_get_clock(vm_clock) + n->tx_timeout);
> +} else {
> +qemu_bh_schedule(n->tx_bh);
> +}
>  }
>  return 0;
>  }
> @@ -939,9 +988,9 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf 
> *conf,
>  
>  qemu_format_nic_info_str(&n->nic->nc, conf->macaddr.a);
>  
> -n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, 

Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!

2010-08-31 Thread Marcelo Tosatti
On Tue, Aug 31, 2010 at 06:33:34PM +0200, Jes Sorensen wrote:
> On 08/31/10 18:28, Avi Kivity wrote:
> >  On 08/31/2010 03:17 PM, jes.soren...@redhat.com wrote:
> >> From: Jes Sorensen
> >>
> >> We regularly see bug reports over this one, however it is a write to
> >> a read-only register which some operating systems (including Linux)
> >> tend to write to once in a while.
> >>
> >> Ignore the writes since they do no harm.
> >>
> > 
> > Does Linux write it with wrmsr_safe()?  If not, I don't see how it
> > works.  If it does, then we shouldn't ignore the write, instead issue
> > the #GP as usual, but be silent about it.
> 
> To be honest, I am having a hard time determining where the write
> happens. I looked at this with Gleb and the rip obtained by
> kvm_read_rip(vcpu) in the code doesn't indicate anything that resembles
> a wrmsr().
> 
> It was either a
> 560   outb(0xfe, 0x64); /* pulse reset low */
> or
> 49asm volatile("sti; hlt": : :"memory");
> 
> which makes no sense to me, but given it's x86, I am not sure if it
> could have come from the BIOS or something during reboot?
> 
> Jes

kvm_reset_msrs in qemu-kvm-x86.c.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!

2010-08-31 Thread David S. Ahern


On 08/31/10 11:04, Jes Sorensen wrote:
>> Just grep for the msr name in a guest kernel source that's known to
>> trigger the message.
> 
> Been there, done that! This happens with an F13 kernel during reboot.
> Ran the search on the expanded 2.6.32.8-149 tree and found no reference
> to anything trying to write it, except for KVM backing up the flag, but
> that shouldn't happen in the guest.

I've been seeing the messages with a 32-bit Fedora 10 guest running a
2.6.27 variant. The wrmsr messages are generated by the time the grub
menu appears. ie., pre-OS.

David

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/5] virtio-net: Make tx_timer timeout configurable

2010-08-31 Thread Chris Wright
* Alex Williamson (alex.william...@redhat.com) wrote:
> On Tue, 2010-08-31 at 11:00 -0700, Chris Wright wrote:
> > * Alex Williamson (alex.william...@redhat.com) wrote:
> > > diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> > > index 075f72d..9ef29f0 100644
> > > --- a/hw/virtio-net.c
> > > +++ b/hw/virtio-net.c
> > > @@ -36,6 +36,7 @@ typedef struct VirtIONet
> > >  VirtQueue *ctrl_vq;
> > >  NICState *nic;
> > >  QEMUTimer *tx_timer;
> > > +uint32_t tx_timeout;
> > >  int tx_timer_active;
> > >  uint32_t has_vnet_hdr;
> > >  uint8_t has_ufo;
> > > @@ -702,7 +703,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, 
> > > VirtQueue *vq)
> > >  virtio_net_flush_tx(n, vq);
> > >  } else {
> > >  qemu_mod_timer(n->tx_timer,
> > > -   qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
> > > +   qemu_get_clock(vm_clock) + n->tx_timeout);
> > >  n->tx_timer_active = 1;
> > >  virtio_queue_set_notification(vq, 0);
> > >  }
> > > @@ -842,7 +843,7 @@ static int virtio_net_load(QEMUFile *f, void *opaque, 
> > > int version_id)
> > >  
> > >  if (n->tx_timer_active) {
> > >  qemu_mod_timer(n->tx_timer,
> > > -   qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
> > > +   qemu_get_clock(vm_clock) + n->tx_timeout);
> > 
> > I think I'm missing where this is stored?  Looks like migration
> > would revert a changed tx_timeout back to 150us.
> 
> It's not stored, it can be instantiated on the migration target any way
> you please and we can migrate between different values or even different
> TX mitigation strategies.  If a non-default value is used on the source
> and you want to maintain the same behavior, the target needs to be
> started the same way.

heh, IOW, I did miss how it's stored...on cmdline ;)

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/5] virtio-net: Make tx_timer timeout configurable

2010-08-31 Thread Chris Wright
* Alex Williamson (alex.william...@redhat.com) wrote:
> diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> index 075f72d..9ef29f0 100644
> --- a/hw/virtio-net.c
> +++ b/hw/virtio-net.c
> @@ -36,6 +36,7 @@ typedef struct VirtIONet
>  VirtQueue *ctrl_vq;
>  NICState *nic;
>  QEMUTimer *tx_timer;
> +uint32_t tx_timeout;
>  int tx_timer_active;
>  uint32_t has_vnet_hdr;
>  uint8_t has_ufo;
> @@ -702,7 +703,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, 
> VirtQueue *vq)
>  virtio_net_flush_tx(n, vq);
>  } else {
>  qemu_mod_timer(n->tx_timer,
> -   qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
> +   qemu_get_clock(vm_clock) + n->tx_timeout);
>  n->tx_timer_active = 1;
>  virtio_queue_set_notification(vq, 0);
>  }
> @@ -842,7 +843,7 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int 
> version_id)
>  
>  if (n->tx_timer_active) {
>  qemu_mod_timer(n->tx_timer,
> -   qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
> +   qemu_get_clock(vm_clock) + n->tx_timeout);

I think I'm missing where this is stored?  Looks like migration
would revert a changed tx_timeout back to 150us.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/5] virtio-net: Make tx_timer timeout configurable

2010-08-31 Thread Alex Williamson
On Tue, 2010-08-31 at 11:00 -0700, Chris Wright wrote:
> * Alex Williamson (alex.william...@redhat.com) wrote:
> > diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> > index 075f72d..9ef29f0 100644
> > --- a/hw/virtio-net.c
> > +++ b/hw/virtio-net.c
> > @@ -36,6 +36,7 @@ typedef struct VirtIONet
> >  VirtQueue *ctrl_vq;
> >  NICState *nic;
> >  QEMUTimer *tx_timer;
> > +uint32_t tx_timeout;
> >  int tx_timer_active;
> >  uint32_t has_vnet_hdr;
> >  uint8_t has_ufo;
> > @@ -702,7 +703,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, 
> > VirtQueue *vq)
> >  virtio_net_flush_tx(n, vq);
> >  } else {
> >  qemu_mod_timer(n->tx_timer,
> > -   qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
> > +   qemu_get_clock(vm_clock) + n->tx_timeout);
> >  n->tx_timer_active = 1;
> >  virtio_queue_set_notification(vq, 0);
> >  }
> > @@ -842,7 +843,7 @@ static int virtio_net_load(QEMUFile *f, void *opaque, 
> > int version_id)
> >  
> >  if (n->tx_timer_active) {
> >  qemu_mod_timer(n->tx_timer,
> > -   qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
> > +   qemu_get_clock(vm_clock) + n->tx_timeout);
> 
> I think I'm missing where this is stored?  Looks like migration
> would revert a changed tx_timeout back to 150us.

It's not stored, it can be instantiated on the migration target any way
you please and we can migrate between different values or even different
TX mitigation strategies.  If a non-default value is used on the source
and you want to maintain the same behavior, the target needs to be
started the same way.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "KVM internal error. Suberror: 1" with ancient 2.4 kernel as guest

2010-08-31 Thread Avi Kivity

 On 08/31/2010 06:49 PM, Avi Kivity wrote:

 On 08/31/2010 05:32 PM, Jiri Kosina wrote:

(qemu) x/5i $eip
0xc027a841:  movq   (%esi),%mm0
0xc027a844:  movq   0x8(%esi),%mm1
0xc027a848:  movq   0x10(%esi),%mm2
0xc027a84c:  movq   0x18(%esi),%mm3
0xc027a850:  movq   %mm0,(%edx)
===

Is there any issue with emulating MMX?



Yes.  MMX is not currently emulated.

If there's a command line option to disable the use of MMX you can try 
it, otherwise wait for it to be implemented (or implement it 
yourself).  I'll try to do it for 2.6.37, but can't promise anything.


You can also run qemu with -cpu qemu32,-mmx.  That will expose a cpu 
without mmx support; hopefully the guest kernel will see that and avoid 
mmx instructions.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!

2010-08-31 Thread Jes Sorensen
On 08/31/10 18:44, Avi Kivity wrote:
>  On 08/31/2010 07:33 PM, Jes Sorensen wrote:
>> On 08/31/10 18:28, Avi Kivity wrote:
>>>   On 08/31/2010 03:17 PM, jes.soren...@redhat.com wrote:
 From: Jes Sorensen

 We regularly see bug reports over this one, however it is a write to
 a read-only register which some operating systems (including Linux)
 tend to write to once in a while.

 Ignore the writes since they do no harm.

>>> Does Linux write it with wrmsr_safe()?  If not, I don't see how it
>>> works.  If it does, then we shouldn't ignore the write, instead issue
>>> the #GP as usual, but be silent about it.
>> To be honest, I am having a hard time determining where the write
>> happens. I looked at this with Gleb and the rip obtained by
>> kvm_read_rip(vcpu) in the code doesn't indicate anything that resembles
>> a wrmsr().
>>
>> It was either a
>> 560outb(0xfe, 0x64); /* pulse reset low */
>> or
>> 49asm volatile("sti; hlt": : :"memory");
>>
>> which makes no sense to me,
> 
> Just grep for the msr name in a guest kernel source that's known to
> trigger the message.

Been there, done that! This happens with an F13 kernel during reboot.
Ran the search on the expanded 2.6.32.8-149 tree and found no reference
to anything trying to write it, except for KVM backing up the flag, but
that shouldn't happen in the guest.

>>   but given it's x86, I am not sure if it
>> could have come from the BIOS or something during reboot?
> 
> The bios is the same for all kernels (and is unlikely to mess with
> performance counter msrs anyway).

I was fooled by this too, it's not a performance counter MSR, it's a CPU
frequency scaling MSR.

Cheers,
Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!

2010-08-31 Thread Avi Kivity

 On 08/31/2010 03:17 PM, jes.soren...@redhat.com wrote:

From: Jes Sorensen

We regularly see bug reports over this one, however it is a write to
a read-only register which some operating systems (including Linux)
tend to write to once in a while.

Ignore the writes since they do no harm.



Does Linux write it with wrmsr_safe()?  If not, I don't see how it 
works.  If it does, then we shouldn't ignore the write, instead issue 
the #GP as usual, but be silent about it.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!

2010-08-31 Thread Jes Sorensen
On 08/31/10 18:28, Avi Kivity wrote:
>  On 08/31/2010 03:17 PM, jes.soren...@redhat.com wrote:
>> From: Jes Sorensen
>>
>> We regularly see bug reports over this one, however it is a write to
>> a read-only register which some operating systems (including Linux)
>> tend to write to once in a while.
>>
>> Ignore the writes since they do no harm.
>>
> 
> Does Linux write it with wrmsr_safe()?  If not, I don't see how it
> works.  If it does, then we shouldn't ignore the write, instead issue
> the #GP as usual, but be silent about it.

To be honest, I am having a hard time determining where the write
happens. I looked at this with Gleb and the rip obtained by
kvm_read_rip(vcpu) in the code doesn't indicate anything that resembles
a wrmsr().

It was either a
560 outb(0xfe, 0x64); /* pulse reset low */
or
49  asm volatile("sti; hlt": : :"memory");

which makes no sense to me, but given it's x86, I am not sure if it
could have come from the BIOS or something during reboot?

Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!

2010-08-31 Thread Avi Kivity

 On 08/31/2010 07:33 PM, Jes Sorensen wrote:

On 08/31/10 18:28, Avi Kivity wrote:

  On 08/31/2010 03:17 PM, jes.soren...@redhat.com wrote:

From: Jes Sorensen

We regularly see bug reports over this one, however it is a write to
a read-only register which some operating systems (including Linux)
tend to write to once in a while.

Ignore the writes since they do no harm.


Does Linux write it with wrmsr_safe()?  If not, I don't see how it
works.  If it does, then we shouldn't ignore the write, instead issue
the #GP as usual, but be silent about it.

To be honest, I am having a hard time determining where the write
happens. I looked at this with Gleb and the rip obtained by
kvm_read_rip(vcpu) in the code doesn't indicate anything that resembles
a wrmsr().

It was either a
560 outb(0xfe, 0x64); /* pulse reset low */
or
49  asm volatile("sti; hlt": : :"memory");

which makes no sense to me,


Just grep for the msr name in a guest kernel source that's known to 
trigger the message.



  but given it's x86, I am not sure if it
could have come from the BIOS or something during reboot?


The bios is the same for all kernels (and is unlikely to mess with 
performance counter msrs anyway).



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] *** SUBJECT HERE ***

2010-08-31 Thread Avi Kivity

 On 08/31/2010 07:13 PM, Anthony Liguori wrote:


Just as an aside, does anyone have a good way to maintain the 00 
patches in series with repeated submissions?


I tried to store it in git as an empty commit but most of the git 
tooling doesn't work well with that.


I keep each posting in a -v1/ etc directory and copy-paste from that.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] *** SUBJECT HERE ***

2010-08-31 Thread Anthony Liguori

On 08/31/2010 11:10 AM, Avi Kivity wrote:

 On 08/31/2010 05:43 PM, Gerd Hoffmann wrote:


You better should pass '00*.patch'  instead of '00*' to git 
send-email so it doesn't mail out the editor backup file ;)




That's what I usually do - guess I slipped this time.


Just as an aside, does anyone have a good way to maintain the 00 patches 
in series with repeated submissions?


I tried to store it in git as an empty commit but most of the git 
tooling doesn't work well with that.


Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] *** SUBJECT HERE ***

2010-08-31 Thread Avi Kivity

 On 08/31/2010 05:43 PM, Gerd Hoffmann wrote:


You better should pass '00*.patch'  instead of '00*' to git send-email 
so it doesn't mail out the editor backup file ;)




That's what I usually do - guess I slipped this time.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "KVM internal error. Suberror: 1" with ancient 2.4 kernel as guest

2010-08-31 Thread Avi Kivity

 On 08/31/2010 05:32 PM, Jiri Kosina wrote:

(qemu) x/5i $eip
0xc027a841:  movq   (%esi),%mm0
0xc027a844:  movq   0x8(%esi),%mm1
0xc027a848:  movq   0x10(%esi),%mm2
0xc027a84c:  movq   0x18(%esi),%mm3
0xc027a850:  movq   %mm0,(%edx)
===

Is there any issue with emulating MMX?



Yes.  MMX is not currently emulated.

If there's a command line option to disable the use of MMX you can try 
it, otherwise wait for it to be implemented (or implement it yourself).  
I'll try to do it for 2.6.37, but can't promise anything.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: How kvm handle extern interrupt when guest os is running

2010-08-31 Thread jemmy858585
I understand ,thank you.

i find this in intel system programming guide:
Acknowledge interrupt on exit. The “acknowledge interrupt on exit” VM-exit
control in the controlling VMCS controls processor behavior for external 
interrupt
acknowledgement. If the control is 1, the processor acknowledges the interrupt
controller to acquire the interrupt vector upon VM exit, and stores the vector 
in
the VM-exit interruption-information field. If the control is 0, the external
interrupt is not acknowledged during VM exit. Since RFLAGS.IF is automatically
cleared on VM exits due to external interrupts, VMM re-enabling of interrupts
(setting RFLAGS.IF = 1) initiates the external interrupt acknowledgement and
vectoring of the external interrupt through the monitor/host IDT.

and i find xen set  “acknowledge interrupt on exit” to 1. 


2010-08-31 



jemmy858585 



发件人: Avi Kivity 
发送时间: 2010-08-29  22:24:13 
收件人: jemmy858585 
抄送: kvm 
主题: Re: How kvm handle extern interrupt when guest os is running 
 
  On 08/29/2010 05:16 PM, jemmy858585 wrote:
> i didn't understand how kvm handle extern interrupt when guest os running.
> i think the extern interrupt will cause vm exit, then kvm should pass this 
> interrupt to host os , then the host os handle it.
> but i didn't found the correlative code.
>
> I think vmx_complete_interrupts and vmx_handle_exit function will be execute, 
> but i didn't found how pass the extern interrupt to host os.
>
> If anyone knows , please tell me,thanks.
>
The interrupt causes a vmexit but is not consumed by the processor.  
Instead, it remains pending.  As soon as the host enables interrupts the 
interrupt will be delivered.
-- 
error compiling committee.c: too many arguments to function


Re: [PATCH] cgroups: fix API thinko

2010-08-31 Thread Michael S. Tsirkin
On Wed, Aug 25, 2010 at 02:35:20PM -0700, Andrew Morton wrote:
> On Fri, 06 Aug 2010 10:38:24 -0600
> Alex Williamson  wrote:
> 
> > On Fri, 2010-08-06 at 09:34 -0700, Sridhar Samudrala wrote:
> > > On 8/5/2010 3:59 PM, Michael S. Tsirkin wrote:
> > > > cgroup_attach_task_current_cg API that have upstream is backwards: we
> > > > really need an API to attach to the cgroups from another process A to
> > > > the current one.
> > > >
> > > > In our case (vhost), a priveledged user wants to attach it's task to 
> > > > cgroups
> > > > from a less priveledged one, the API makes us run it in the other
> > > > task's context, and this fails.
> > > >
> > > > So let's make the API generic and just pass in 'from' and 'to' tasks.
> > > > Add an inline wrapper for cgroup_attach_task_current_cg to avoid
> > > > breaking bisect.
> > > >
> > > > Signed-off-by: Michael S. Tsirkin
> > > > ---
> > > >
> > > > Paul, Li, Sridhar, could you please review the following
> > > > patch?
> > > >
> > > > I only compile-tested it due to travel, but looks
> > > > straight-forward to me.
> > > > Alex Williamson volunteered to test and report the results.
> > > > Sending out now for review as I might be offline for a bit.
> > > > Will only try to merge when done, obviously.
> > > >
> > > > If OK, I would like to merge this through -net tree,
> > > > together with the patch fixing vhost-net.
> > > > Let me know if that sounds ok.
> > > >
> > > > Thanks!
> > > >
> > > > This patch is on top of net-next, it is needed for fix
> > > > vhost-net regression in net-next, where a non-priveledged
> > > > process can't enable the device anymore:
> > > >
> > > > when qemu uses vhost, inside the ioctl call it
> > > > creates a thread, and tries to add
> > > > this thread to the groups of current, and it fails.
> > > > But we control the thread, so to solve the problem,
> > > > we really should tell it 'connect to out cgroups'.
> 
> So am I correct to assume that this change is now needed in 2.6.36, and
> unneeded in 2.6.35?

Yes, I think so. Unless there are objections, I intend to merge this
(with the review fixes) through net-2.6 together with a vhost-net patch
that depends on this fix.

> Can it affect the userspace<->kernel API in amy manner?  If so, it
> should be backported into earlier kernels to reduce the number of
> incompatible kernels out there.

I think it doesn't affect anything except 2.6.36-rcX,
earlier kernels didn't use this API.

> Paul, did you have any comments?
> 
> I didn't see any update in response to the minor review comments, so...
> 
> 
>  include/linux/cgroup.h |1 +
>  kernel/cgroup.c|6 +++---
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff -puN include/linux/cgroup.h~cgroups-fix-api-thinko-fix 
> include/linux/cgroup.h
> --- a/include/linux/cgroup.h~cgroups-fix-api-thinko-fix
> +++ a/include/linux/cgroup.h
> @@ -579,6 +579,7 @@ void cgroup_iter_end(struct cgroup *cgrp
>  int cgroup_scan_tasks(struct cgroup_scanner *scan);
>  int cgroup_attach_task(struct cgroup *, struct task_struct *);
>  int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
> +
>  static inline int cgroup_attach_task_current_cg(struct task_struct *tsk)
>  {
>   return cgroup_attach_task_all(current, tsk);
> diff -puN kernel/cgroup.c~cgroups-fix-api-thinko-fix kernel/cgroup.c
> --- a/kernel/cgroup.c~cgroups-fix-api-thinko-fix
> +++ a/kernel/cgroup.c
> @@ -1798,13 +1798,13 @@ out:
>  int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>  {
>   struct cgroupfs_root *root;
> - struct cgroup *cur_cg;
>   int retval = 0;
>  
>   cgroup_lock();
>   for_each_active_root(root) {
> - cur_cg = task_cgroup_from_root(from, root);
> - retval = cgroup_attach_task(cur_cg, tsk);
> + struct cgroup *from_cg = task_cgroup_from_root(from, root);
> +
> + retval = cgroup_attach_task(from_cg, tsk);
>   if (retval)
>   break;
>   }
> _
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call minutes for August 31

2010-08-31 Thread Chris Wright
QMP/QPI
- declare what's in 0.13 supported
  - means reasonable effort to avoid breaking something (deprecation is
possible)
- things that are left, shallow patch or human monitor conversion
- how to move forward?
  - need to improve interfaces (no argument there)
  - internal vs. external interfaces
  - QMP == external, stable, extensible, discoverable, fwd/back compat
  - internal == C, data structures, changeable, non-stable
  - redefine internal interfaces and work up?
- this addresses concern that most changes are in monitor.c
- and addresses the concern that we aren't defining proper
  extensible top level interfaces
  - work top down from external?
- map to internal details...fix internals when external is
  hard/impossible based on internals
- need to focus on QMP command addition in the face of internal details
  - no disagreement there
- decouple monitor and QMP
- sane interfaces (proposal for migration from Anthony)
- error issues...

0.13 schedule
- rc1 tagged locally and under test, once testing completes, upload,
  then one week to fix any outstanding issues
  - will push tag later today and upload, announce once propagates (24hrs-ish)

qemu-kvm integration
- not getting a lot of cycles
- nothing major that anthony won't pull
  - extboot still
- performance
  - i/o threading model (merge both and fix in-tree)
- in-kernel apic
- device assignement (vfio against qemu tree)
- disable ia64
- avi will look at doing the pull request
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] *** SUBJECT HERE ***

2010-08-31 Thread Gerd Hoffmann

On 08/31/10 15:30, Avi Kivity wrote:

On 08/31/2010 04:25 PM, Avi Kivity wrote:

*** BLURB HERE ***



That was supposed to be:

[PATCH 0/6] kvm_stat tracepoint support

This patchset allows kvm_stat to display the information exposed by kvm
tracepoints.


That was there too.

You better should pass '00*.patch'  instead of '00*' to git send-email 
so it doesn't mail out the editor backup file ;)


cheers,
  Gerd

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


"KVM internal error. Suberror: 1" with ancient 2.4 kernel as guest

2010-08-31 Thread Jiri Kosina
Hi,

when migrating ancient machine to become KVM guest, I am facing a problem 
that KVM gives me the error below when being passed oldish vmlinuz image:

=== 
# qemu-kvm -kernel vmlinuz-2.4.33
KVM internal error. Suberror: 1
rax  rbx 003e rcx  rdx 
c1485180
rsi c00b8000 rdi c1485180 rsp c0305f70 rbp 
0fa0
r8   r9   r10  r11 

r12  r13  r14  r15 

rip c027a841 rflags 0006
cs 0010 (/ p 1 dpl 0 db 1 s 1 type b l 0 g 1 avl 0)
ds 0018 (/ p 1 dpl 0 db 1 s 1 type 3 l 0 g 1 avl 0)
es 0018 (/ p 1 dpl 0 db 1 s 1 type 3 l 0 g 1 avl 0)
ss 0018 (/bfff p 1 dpl 0 db 1 s 1 type 7 l 0 g 1 avl 0)
fs 0018 (/ p 1 dpl 0 db 1 s 1 type 3 l 0 g 1 avl 0)
gs 0018 (/ p 1 dpl 0 db 1 s 1 type 3 l 0 g 1 avl 0)
tr 0060 (c0301d80/00eb p 1 dpl 0 db 0 s 0 type b l 0 g 0 avl 0)
ldt 0068 (c0289020/0027 p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0)
gdt c0288860/7f
idt c0288000/7ff
cr0 80050033 cr2 0 cr3 287000 cr4 90 cr8 0 efer 0
emulation failure, check dmesg for details
===

There is no message emitted in dmesg. 
The instructions around eip look reasonably valid from a quick look (this 
is monitor output from the time it crashes):

===
(qemu) x/5i $eip
0xc027a841:  movq   (%esi),%mm0
0xc027a844:  movq   0x8(%esi),%mm1
0xc027a848:  movq   0x10(%esi),%mm2
0xc027a84c:  movq   0x18(%esi),%mm3
0xc027a850:  movq   %mm0,(%edx)
===

Is there any issue with emulating MMX?

This is with 0.12.3 KVM module on 2.6.32.

The kernel image which triggers this can be downloaded from

http://www.jikos.cz/jikos/junk/vmlinuz-2.4.33

Thanks,

-- 
Jiri Kosina
SUSE Labs, Novell Inc.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VIA Nano processor U2250 (1.6GHz Capable)

2010-08-31 Thread Avi Kivity

 On 08/30/2010 08:35 PM, Pierre-Philipp Braun wrote:

Hi,

I'm having a little issue starting a KVM guest.  Is that a known Issue on
VIA Nano vmx capable processors ?


It's a bug in the cpu.  It's possible that 2.6.37 will not trigger that 
bug, though.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] *** SUBJECT HERE ***

2010-08-31 Thread Avi Kivity

 On 08/31/2010 04:25 PM, Avi Kivity wrote:

*** BLURB HERE ***



That was supposed to be:

[PATCH 0/6] kvm_stat tracepoint support

This patchset allows kvm_stat to display the information exposed by kvm
tracepoints.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for August 31

2010-08-31 Thread Anthony Liguori

On 08/31/2010 01:21 AM, Avi Kivity wrote:

 On 08/31/2010 12:54 AM, Juan Quintela wrote:

Please send in any agenda items you are interested in covering.



zero (dot|point) thirteen schedule


-rc1 is tagged locally and undergoing regression testing.  Once that 
finishes (later today), I'll push and upload.  Once it's announced (~24 
hours later) we've got one week until 0.13.0 to bring in additional 
stable fixes.


Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] kvm_stat: provide detailed kvm_exit:exit_reason display

2010-08-31 Thread Avi Kivity
In addition to displaying kvm_exit as an aggregate counter, use perf_event's
filter capability to count individual exit reasons.

Signed-off-by: Avi Kivity 
---
 kvm/kvm_stat |  156 +-
 1 files changed, 155 insertions(+), 1 deletions(-)

diff --git a/kvm/kvm_stat b/kvm/kvm_stat
index 9b0392b..677683a 100755
--- a/kvm/kvm_stat
+++ b/kvm/kvm_stat
@@ -16,6 +16,143 @@ class DebugfsProvider(object):
 return int(file(self.base + '/' + key).read())
 return dict([(key, val(key)) for key in self._fields])
 
+vmx_exit_reasons = {
+0: 'EXCEPTION_NMI',
+1: 'EXTERNAL_INTERRUPT',
+2: 'TRIPLE_FAULT',
+7: 'PENDING_INTERRUPT',
+8: 'NMI_WINDOW',
+9: 'TASK_SWITCH',
+10: 'CPUID',
+12: 'HLT',
+14: 'INVLPG',
+15: 'RDPMC',
+16: 'RDTSC',
+18: 'VMCALL',
+19: 'VMCLEAR',
+20: 'VMLAUNCH',
+21: 'VMPTRLD',
+22: 'VMPTRST',
+23: 'VMREAD',
+24: 'VMRESUME',
+25: 'VMWRITE',
+26: 'VMOFF',
+27: 'VMON',
+28: 'CR_ACCESS',
+29: 'DR_ACCESS',
+30: 'IO_INSTRUCTION',
+31: 'MSR_READ',
+32: 'MSR_WRITE',
+33: 'INVALID_STATE',
+36: 'MWAIT_INSTRUCTION',
+39: 'MONITOR_INSTRUCTION',
+40: 'PAUSE_INSTRUCTION',
+41: 'MCE_DURING_VMENTRY',
+43: 'TPR_BELOW_THRESHOLD',
+44: 'APIC_ACCESS',
+48: 'EPT_VIOLATION',
+49: 'EPT_MISCONFIG',
+54: 'WBINVD',
+55: 'XSETBV',
+}
+
+svm_exit_reasons = {
+0x000: 'READ_CR0',
+0x003: 'READ_CR3',
+0x004: 'READ_CR4',
+0x008: 'READ_CR8',
+0x010: 'WRITE_CR0',
+0x013: 'WRITE_CR3',
+0x014: 'WRITE_CR4',
+0x018: 'WRITE_CR8',
+0x020: 'READ_DR0',
+0x021: 'READ_DR1',
+0x022: 'READ_DR2',
+0x023: 'READ_DR3',
+0x024: 'READ_DR4',
+0x025: 'READ_DR5',
+0x026: 'READ_DR6',
+0x027: 'READ_DR7',
+0x030: 'WRITE_DR0',
+0x031: 'WRITE_DR1',
+0x032: 'WRITE_DR2',
+0x033: 'WRITE_DR3',
+0x034: 'WRITE_DR4',
+0x035: 'WRITE_DR5',
+0x036: 'WRITE_DR6',
+0x037: 'WRITE_DR7',
+0x040: 'EXCP_BASE',
+0x060: 'INTR',
+0x061: 'NMI',
+0x062: 'SMI',
+0x063: 'INIT',
+0x064: 'VINTR',
+0x065: 'CR0_SEL_WRITE',
+0x066: 'IDTR_READ',
+0x067: 'GDTR_READ',
+0x068: 'LDTR_READ',
+0x069: 'TR_READ',
+0x06a: 'IDTR_WRITE',
+0x06b: 'GDTR_WRITE',
+0x06c: 'LDTR_WRITE',
+0x06d: 'TR_WRITE',
+0x06e: 'RDTSC',
+0x06f: 'RDPMC',
+0x070: 'PUSHF',
+0x071: 'POPF',
+0x072: 'CPUID',
+0x073: 'RSM',
+0x074: 'IRET',
+0x075: 'SWINT',
+0x076: 'INVD',
+0x077: 'PAUSE',
+0x078: 'HLT',
+0x079: 'INVLPG',
+0x07a: 'INVLPGA',
+0x07b: 'IOIO',
+0x07c: 'MSR',
+0x07d: 'TASK_SWITCH',
+0x07e: 'FERR_FREEZE',
+0x07f: 'SHUTDOWN',
+0x080: 'VMRUN',
+0x081: 'VMMCALL',
+0x082: 'VMLOAD',
+0x083: 'VMSAVE',
+0x084: 'STGI',
+0x085: 'CLGI',
+0x086: 'SKINIT',
+0x087: 'RDTSCP',
+0x088: 'ICEBP',
+0x089: 'WBINVD',
+0x08a: 'MONITOR',
+0x08b: 'MWAIT',
+0x08c: 'MWAIT_COND',
+0x400: 'NPF',
+}
+
+vendor_exit_reasons = {
+'vmx': vmx_exit_reasons,
+'svm': svm_exit_reasons,
+}
+
+exit_reasons = None
+
+for line in file('/proc/cpuinfo').readlines():
+if line.startswith('flags'):
+for flag in line.split():
+if flag in vendor_exit_reasons:
+exit_reasons = vendor_exit_reasons[flag]
+
+filters = {
+'kvm_exit': ('exit_reason', exit_reasons)
+}
+
+def invert(d):
+return dict((x[1], x[0]) for x in d.iteritems())
+
+for f in filters:
+filters[f] = (filters[f][0], invert(filters[f][1]))
+
 import ctypes, struct, array
 
 libc = ctypes.CDLL('libc.so.6')
@@ -62,6 +199,7 @@ PERF_FORMAT_TOTAL_TIME_RUNNING   = 1 << 1
 PERF_FORMAT_ID = 1 << 2
 PERF_FORMAT_GROUP  = 1 << 3
 
+import re
 
 class TracepointProvider(object):
 def __init__(self):
@@ -69,6 +207,13 @@ class TracepointProvider(object):
 fields = [f
   for f in os.listdir(self.base)
   if os.path.isdir(self.base + '/' + f)]
+extra = []
+for f in fields:
+if f in filters:
+subfield, values = filters[f]
+for name, number in values.iteritems():
+extra.append(f + '(' + name + ')')
+fields += extra
 self.select(fields)
 def fields(self):
 return self._fields
@@ -80,10 +225,14 @@ class TracepointProvider(object):
 for cpu in self.cpus:
 group_leader = -1
 for f in _fields:
+fbase, sub = f, None
+m = re.match(r'(.*)\((.*)\)', f)
+if m:
+fbase, sub = m.groups()
 attr = perf_event_attr()
 attr.type = PERF_TYPE_TRACEPOINT
 attr.size = ctypes.sizeof(attr)
-id = int(file(self.base + f + '/id').rea

[PATCH 0/6] *** SUBJECT HERE ***

2010-08-31 Thread Avi Kivity
*** BLURB HERE ***

Avi Kivity (6):
  kvm_stat: refactor to separate stats provider from difference engine
  kvm_stat: implement tracepoint stats provider
  kvm_stat: provide detailed kvm_exit:exit_reason display
  kvm_stat: sort tui output according to highest occurence
  kvm_stat: increase label width
  kvm_stat: be slower

 kvm/kvm_stat |  297 +++---
 1 files changed, 285 insertions(+), 12 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] kvm_stat: be slower

2010-08-31 Thread Avi Kivity
More information is displayed, so more time is need to process the information.

Signed-off-by: Avi Kivity 
---
 kvm/kvm_stat |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kvm/kvm_stat b/kvm/kvm_stat
index b7bc846..e68ca4e 100755
--- a/kvm/kvm_stat
+++ b/kvm/kvm_stat
@@ -327,7 +327,7 @@ def tui(screen, stats):
 
 while True:
 refresh()
-curses.halfdelay(10)
+curses.halfdelay(30)
 try:
 c = screen.getkey()
 if c == 'q':
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] kvm_stat: implement tracepoint stats provider

2010-08-31 Thread Avi Kivity
Signed-off-by: Avi Kivity 
---
 kvm/kvm_stat |  101 +-
 1 files changed, 100 insertions(+), 1 deletions(-)

diff --git a/kvm/kvm_stat b/kvm/kvm_stat
index 75424fc..9b0392b 100755
--- a/kvm/kvm_stat
+++ b/kvm/kvm_stat
@@ -16,6 +16,100 @@ class DebugfsProvider(object):
 return int(file(self.base + '/' + key).read())
 return dict([(key, val(key)) for key in self._fields])
 
+import ctypes, struct, array
+
+libc = ctypes.CDLL('libc.so.6')
+syscall = libc.syscall
+class perf_event_attr(ctypes.Structure):
+_fields_ = [('type', ctypes.c_uint32),
+('size', ctypes.c_uint32),
+('config', ctypes.c_uint64),
+('sample_freq', ctypes.c_uint64),
+('sample_type', ctypes.c_uint64),
+('read_format', ctypes.c_uint64),
+('flags', ctypes.c_uint64),
+('wakeup_events', ctypes.c_uint32),
+('bp_type', ctypes.c_uint32),
+('bp_addr', ctypes.c_uint64),
+('bp_len', ctypes.c_uint64),
+]
+def _perf_event_open(attr, pid, cpu, group_fd, flags):
+return syscall(298, ctypes.pointer(attr), ctypes.c_int(pid),
+   ctypes.c_int(cpu), ctypes.c_int(group_fd),
+   ctypes.c_long(flags))
+
+PERF_TYPE_HARDWARE = 0
+PERF_TYPE_SOFTWARE = 1
+PERF_TYPE_TRACEPOINT   = 2
+PERF_TYPE_HW_CACHE = 3
+PERF_TYPE_RAW  = 4
+PERF_TYPE_BREAKPOINT   = 5
+
+PERF_SAMPLE_IP = 1 << 0
+PERF_SAMPLE_TID= 1 << 1
+PERF_SAMPLE_TIME   = 1 << 2
+PERF_SAMPLE_ADDR   = 1 << 3
+PERF_SAMPLE_READ   = 1 << 4
+PERF_SAMPLE_CALLCHAIN  = 1 << 5
+PERF_SAMPLE_ID = 1 << 6
+PERF_SAMPLE_CPU= 1 << 7
+PERF_SAMPLE_PERIOD = 1 << 8
+PERF_SAMPLE_STREAM_ID  = 1 << 9
+PERF_SAMPLE_RAW= 1 << 10
+
+PERF_FORMAT_TOTAL_TIME_ENABLED = 1 << 0
+PERF_FORMAT_TOTAL_TIME_RUNNING = 1 << 1
+PERF_FORMAT_ID = 1 << 2
+PERF_FORMAT_GROUP  = 1 << 3
+
+
+class TracepointProvider(object):
+def __init__(self):
+self.base = '/sys/kernel/debug/tracing/events/kvm/'
+fields = [f
+  for f in os.listdir(self.base)
+  if os.path.isdir(self.base + '/' + f)]
+self.select(fields)
+def fields(self):
+return self._fields
+def select(self, _fields):
+self._fields = _fields
+self.cpus = [0, 1]
+fds = []
+self.group_leaders = []
+for cpu in self.cpus:
+group_leader = -1
+for f in _fields:
+attr = perf_event_attr()
+attr.type = PERF_TYPE_TRACEPOINT
+attr.size = ctypes.sizeof(attr)
+id = int(file(self.base + f + '/id').read())
+attr.config = id
+attr.sample_type = (PERF_SAMPLE_RAW
+| PERF_SAMPLE_TIME
+| PERF_SAMPLE_CPU)
+attr.sample_period = 1
+attr.read_format = PERF_FORMAT_GROUP
+fd = _perf_event_open(attr, -1, cpu, group_leader, 0)
+if fd == -1:
+raise Exception('perf_event_open failed')
+if group_leader == -1:
+group_leader = fd
+fds.append(fd)
+self.group_leaders.append(group_leader)
+self.fds = fds
+self.files = [os.fdopen(group_leader)
+  for group_leader in self.group_leaders]
+def read(self):
+ret = dict([(f, 0) for f in self._fields])
+bytes = 8 * (1 + len(self._fields))
+fmt = '' + 'q' * len(self._fields)
+for file in self.files:
+a = struct.unpack(fmt, file.read(bytes))
+for field, val in zip(self._fields, a):
+ret[field] += val
+return ret
+
 class Stats:
 def __init__(self, provider, fields = None):
 def wanted(key):
@@ -133,7 +227,12 @@ options.add_option('-f', '--fields',
)
 (options, args) = options.parse_args(sys.argv)
 
-stats = Stats(provider = DebugfsProvider(), fields = options.fields)
+try:
+provider = TracepointProvider()
+except:
+provider = DebugfsProvider()
+
+stats = Stats(provider, fields = options.fields)
 
 if options.log:
 log(stats)
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] kvm_stat: increase label width

2010-08-31 Thread Avi Kivity
With kvm_exit drill down, labels are pretty large.

Signed-off-by: Avi Kivity 
---
 kvm/kvm_stat |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kvm/kvm_stat b/kvm/kvm_stat
index c55bf88..b7bc846 100755
--- a/kvm/kvm_stat
+++ b/kvm/kvm_stat
@@ -295,7 +295,7 @@ if not os.access('/sys/kernel/debug/kvm', os.F_OK):
 print "and ensure the kvm modules are loaded"
 sys.exit(1)
 
-label_width = 20
+label_width = 40
 number_width = 10
 
 def tui(screen, stats):
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] kvm_stat: sort tui output according to highest occurence

2010-08-31 Thread Avi Kivity
With plenty of counters and wide, short screens it's hard to see what's on
the top.  So sort the output.

Signed-off-by: Avi Kivity 
---
 kvm/kvm_stat |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/kvm/kvm_stat b/kvm/kvm_stat
index 677683a..c55bf88 100755
--- a/kvm/kvm_stat
+++ b/kvm/kvm_stat
@@ -306,7 +306,12 @@ def tui(screen, stats):
 screen.addstr(0, 0, 'kvm statistics')
 row = 2
 s = stats.get()
-for key in sorted(s.keys()):
+def sortkey(x):
+if s[x][1]:
+return -s[x][1]
+else:
+return x
+for key in sorted(s.keys(), key = sortkey):
 if row >= screen.getmaxyx()[0]:
 break
 values = s[key]
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] kvm_stat tracepoint support

2010-08-31 Thread Avi Kivity
This patchset allows kvm_stat to display the information exposed by kvm
tracepoints.

Avi Kivity (6):
  kvm_stat: refactor to separate stats provider from difference engine
  kvm_stat: implement tracepoint stats provider
  kvm_stat: provide detailed kvm_exit:exit_reason display
  kvm_stat: sort tui output according to highest occurence
  kvm_stat: increase label width
  kvm_stat: be slower

 kvm/kvm_stat |  297 +++---
 1 files changed, 285 insertions(+), 12 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] kvm_stat: refactor to separate stats provider from difference engine

2010-08-31 Thread Avi Kivity
Signed-off-by: Avi Kivity 
---
 kvm/kvm_stat |   33 -
 1 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/kvm/kvm_stat b/kvm/kvm_stat
index 21aff5b..75424fc 100755
--- a/kvm/kvm_stat
+++ b/kvm/kvm_stat
@@ -3,21 +3,36 @@
 import curses
 import sys, os, time, optparse
 
+class DebugfsProvider(object):
+def __init__(self):
+self.base = '/sys/kernel/debug/kvm'
+self._fields = os.listdir(self.base)
+def fields(self):
+return self._fields
+def select(self, fields):
+self._fields = fields
+def read(self):
+def val(key):
+return int(file(self.base + '/' + key).read())
+return dict([(key, val(key)) for key in self._fields])
+
 class Stats:
-def __init__(self, fields = None):
+def __init__(self, provider, fields = None):
 def wanted(key):
 import re
 if not fields:
 return True
 return re.match(fields, key) != None
-self.base = '/sys/kernel/debug/kvm'
-self.values = {}
-for key in os.listdir(self.base):
-if wanted(key):
-self.values[key] = None
+self.provider = provider
+self.values = dict([(key, None)
+for key in provider.fields()
+if wanted(key)])
+self.provider.select(self.values.keys())
 def get(self):
-for key, oldval in self.values.iteritems():
-newval = int(file(self.base + '/' + key).read())
+new = self.provider.read()
+for key in self.provider.fields():
+oldval = self.values[key]
+newval = new[key]
 newdelta = None
 if oldval is not None:
 newdelta = newval - oldval[0]
@@ -118,7 +133,7 @@ options.add_option('-f', '--fields',
)
 (options, args) = options.parse_args(sys.argv)
 
-stats = Stats(fields = options.fields)
+stats = Stats(provider = DebugfsProvider(), fields = options.fields)
 
 if options.log:
 log(stats)
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it possible to live migrate guest OS'es between different versions of kvm/qemu-kvm?

2010-08-31 Thread Avi Kivity

 On 08/30/2010 11:16 PM, Brian Jackson wrote:

On Monday, August 30, 2010 07:04:51 am Nils Cant wrote:

Hey guys,

next try is without libvirt, but still no joy.

After issuing 'migrate -d' on my sending host (qemu-kvm 0.11.0), I
get the following output on the receiving host (qemu-kvm 0.12.4):


A quick search of this and/or the qemu mailing lists would have told you this
is unsupported.



It should work between 0.12 and 0.13, though.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!

2010-08-31 Thread Jes . Sorensen
From: Jes Sorensen 

We regularly see bug reports over this one, however it is a write to
a read-only register which some operating systems (including Linux)
tend to write to once in a while.

Ignore the writes since they do no harm.

Signed-off-by: Jes Sorensen 
---
 arch/x86/kvm/x86.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4014d6c..2e1ae7f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1439,6 +1439,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
pr_unimpl(vcpu, "unimplemented perfctr wrmsr: "
"0x%x data 0x%llx\n", msr, data);
break;
+   case MSR_IA32_PERF_STATUS:
+   /*
+* Write to read-only register, has no effect, so
+* lets not bother users with a warning about this.
+*/
+   break;
case HV_X64_MSR_GUEST_OS_ID ... HV_X64_MSR_SINT15:
if (kvm_hv_msr_partition_wide(msr)) {
int r;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests v2 7/8] Add a simple kvmclock driver

2010-08-31 Thread Jason Wang
Most of the codes were borrowed from arxh/x86/kernel/kvmclock.c. A
special bit: PV_CLOCK_CYCLE_RAW_TEST_BIT is used to notify the driver
to return unadjusted cycles.

Signed-off-by: Jason Wang 
---
 x86/kvmclock.c |  279 
 x86/kvmclock.h |   60 
 2 files changed, 339 insertions(+), 0 deletions(-)
 create mode 100644 x86/kvmclock.c
 create mode 100644 x86/kvmclock.h

diff --git a/x86/kvmclock.c b/x86/kvmclock.c
new file mode 100644
index 000..0624da3
--- /dev/null
+++ b/x86/kvmclock.c
@@ -0,0 +1,279 @@
+#include "libcflat.h"
+#include "smp.h"
+#include "atomic.h"
+#include "processor.h"
+#include "kvmclock.h"
+
+#define unlikely(x)__builtin_expect(!!(x), 0)
+#define likely(x)  __builtin_expect(!!(x), 1)
+
+
+struct pvclock_vcpu_time_info __attribute__((aligned(4))) hv_clock[MAX_CPU];
+struct pvclock_wall_clock wall_clock;
+static unsigned char valid_flags = 0;
+static atomic64_t last_value = ATOMIC64_INIT(0);
+
+/*
+ * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
+ * yielding a 64-bit result.
+ */
+static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
+{
+   u64 product;
+#ifdef __i386__
+   u32 tmp1, tmp2;
+#endif
+
+   if (shift < 0)
+   delta >>= -shift;
+   else
+   delta <<= shift;
+
+#ifdef __i386__
+   __asm__ (
+   "mul  %5   ; "
+   "mov  %4,%%eax ; "
+   "mov  %%edx,%4 ; "
+   "mul  %5   ; "
+   "xor  %5,%5; "
+   "add  %4,%%eax ; "
+   "adc  %5,%%edx ; "
+   : "=A" (product), "=r" (tmp1), "=r" (tmp2)
+   : "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
+#elif defined(__x86_64__)
+   __asm__ (
+   "mul %%rdx ; shrd $32,%%rdx,%%rax"
+   : "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
+#else
+#error implement me!
+#endif
+
+   return product;
+}
+
+#ifdef __i386__
+# define do_div(n,base) ({ \
+   u32 __base = (base);\
+   u32 __rem;  \
+   __rem = ((u64)(n)) % __base;\
+   (n) = ((u64)(n)) / __base;  \
+   __rem;  \
+ })
+#else
+u32 __attribute__((weak)) __div64_32(u64 *n, u32 base)
+{
+   u64 rem = *n;
+   u64 b = base;
+   u64 res, d = 1;
+   u32 high = rem >> 32;
+
+   /* Reduce the thing a bit first */
+   res = 0;
+   if (high >= base) {
+   high /= base;
+   res = (u64) high << 32;
+   rem -= (u64) (high*base) << 32;
+   }
+
+   while ((s64)b > 0 && b < rem) {
+   b = b+b;
+   d = d+d;
+   }
+
+   do {
+   if (rem >= b) {
+   rem -= b;
+   res += d;
+   }
+   b >>= 1;
+   d >>= 1;
+   } while (d);
+
+   *n = res;
+   return rem;
+}
+
+# define do_div(n,base) ({ \
+   u32 __base = (base);\
+   u32 __rem;  \
+   (void)(((typeof((n)) *)0) == ((u64 *)0));   \
+   if (likely(((n) >> 32) == 0)) { \
+   __rem = (u32)(n) % __base;  \
+   (n) = (u32)(n) / __base;\
+   } else  \
+   __rem = __div64_32(&(n), __base);   \
+   __rem;  \
+ })
+#endif
+
+/**
+ * set_normalized_timespec - set timespec sec and nsec parts and normalize
+ *
+ * @ts:pointer to timespec variable to be set
+ * @sec:   seconds to set
+ * @nsec:  nanoseconds to set
+ *
+ * Set seconds and nanoseconds field of a timespec variable and
+ * normalize to the timespec storage format
+ *
+ * Note: The tv_nsec part is always in the range of
+ * 0 <= tv_nsec < NSEC_PER_SEC
+ * For negative values only the tv_sec field is negative !
+ */
+void set_normalized_timespec(struct timespec *ts, long sec, s64 nsec)
+{
+   while (nsec >= NSEC_PER_SEC) {
+   /*
+* The following asm() prevents the compiler from
+* optimising this loop into a modulo operation. See
+* also __iter_div_u64_rem() in include/linux/time.h
+*/
+   asm("" : "+rm"(nsec));
+   nsec -= NSEC_PER_SEC;
+   ++sec;
+   }
+   while (nsec < 0) {
+   asm("" : "+rm"(nsec));
+   nsec += NSEC_PER_SEC;
+   --sec;
+   }
+   ts->tv_sec = sec;
+   ts->tv_nsec = nsec;
+}
+
+static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
+{
+   u64 

[PATCH kvm-unit-tests v2 8/8] Add tests for kvm-clock

2010-08-31 Thread Jason Wang
This patch implements three tests for kvmclock. First one check whether
the date of time returned by kvmclock matches the value got from
host. Second one check whether the cycle of kvmclock grows
monotonically in smp guest and the count of warps and stalls are also
recorded. The last test is a performance test to measure the guest
cycles cost when trying to read cycle of kvmclock.

Three parameters were accepted by the test: test loops, seconds
since 1970-01-01 00:00:00 UTC which could be easily get through date
+%s and the max accepted offset value between the tod of guest and
host.

Signed-off-by: Jason Wang 
---
 config-x86-common.mak |6 +-
 x86/README|1 
 x86/kvmclock_test.c   |  166 +
 x86/unittests.cfg |5 +
 4 files changed, 177 insertions(+), 1 deletions(-)
 create mode 100644 x86/kvmclock_test.c

diff --git a/config-x86-common.mak b/config-x86-common.mak
index b8ca859..b541c1c 100644
--- a/config-x86-common.mak
+++ b/config-x86-common.mak
@@ -26,7 +26,8 @@ FLATLIBS = lib/libcflat.a $(libgcc)
 tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \
$(TEST_DIR)/smptest.flat  $(TEST_DIR)/port80.flat \
$(TEST_DIR)/realmode.flat $(TEST_DIR)/msr.flat \
-   $(TEST_DIR)/hypercall.flat $(TEST_DIR)/sieve.flat
+   $(TEST_DIR)/hypercall.flat $(TEST_DIR)/sieve.flat \
+   $(TEST_DIR)/kvmclock_test.flat
 
 tests_and_config = $(TEST_DIR)/*.flat $(TEST_DIR)/unittests.cfg
 
@@ -70,6 +71,9 @@ $(TEST_DIR)/rmap_chain.flat: $(cstart.o) 
$(TEST_DIR)/rmap_chain.o \
 
 $(TEST_DIR)/svm.flat: $(cstart.o) $(TEST_DIR)/vm.o
 
+$(TEST_DIR)/kvmclock_test.flat: $(cstart.o) $(TEST_DIR)/kvmclock.o \
+$(TEST_DIR)/kvmclock_test.o
+
 arch_clean:
$(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat \
$(TEST_DIR)/.*.d $(TEST_DIR)/lib/.*.d $(TEST_DIR)/lib/*.o
diff --git a/x86/README b/x86/README
index ab5a2ae..215fb2f 100644
--- a/x86/README
+++ b/x86/README
@@ -12,3 +12,4 @@ sieve: heavy memory access with no paging and with paging 
static and with paging
 smptest: run smp_id() on every cpu and compares return value to number
 tsc: write to tsc(0) and write to tsc(1000) and read it back
 vmexit: long loops for each: cpuid, vmcall, mov_from_cr8, mov_to_cr8, 
inl_pmtimer, ipi, ipi+halt
+kvmclock_test: test of wallclock, monotonic cycle and performance of kvmclock
diff --git a/x86/kvmclock_test.c b/x86/kvmclock_test.c
new file mode 100644
index 000..5b14ae2
--- /dev/null
+++ b/x86/kvmclock_test.c
@@ -0,0 +1,166 @@
+#include "libcflat.h"
+#include "smp.h"
+#include "atomic.h"
+#include "processor.h"
+#include "kvmclock.h"
+
+#define DEFAULT_TEST_LOOPS 1L
+#define DEFAULT_THRESHOLD  5L
+
+struct test_info {
+struct spinlock lock;
+long loops;   /* test loops */
+u64 warps;/* warp count */
+u64 stalls;   /* stall count */
+long long worst;  /* worst warp */
+volatile cycle_t last;/* last cycle seen by test */
+atomic_t ncpus;   /* number of cpu in the test*/
+int check;/* check cycle ? */
+};
+
+struct test_info ti[4];
+
+static int wallclock_test(long sec, long threshold)
+{
+long ksec, offset;
+struct timespec ts;
+
+printf("Wallclock test, threshold %ld\n", threshold);
+kvm_get_wallclock(&ts);
+ksec = ts.tv_sec;
+
+offset = ksec - sec;
+printf("Seconds get from host: %ld\n", sec);
+printf("Seconds get from kvmclock: %ld\n", ksec);
+printf("Offset:%ld\n", offset);
+
+if (offset > threshold || offset < -threshold) {
+printf("offset too large!\n");
+return 1;
+}
+
+return 0;
+}
+
+static void kvm_clock_test(void *data)
+{
+struct test_info *hv_test_info = (struct test_info *)data;
+long i, check = hv_test_info->check;
+
+for (i = 0; i < hv_test_info->loops; i++){
+cycle_t t0, t1;
+long long delta;
+
+if (check == 0) {
+kvm_clock_read();
+continue;
+}
+
+spin_lock(&hv_test_info->lock);
+t1 = kvm_clock_read();
+t0 = hv_test_info->last;
+hv_test_info->last = kvm_clock_read();
+spin_unlock(&hv_test_info->lock);
+
+delta = t1 - t0;
+if (delta < 0) {
+spin_lock(&hv_test_info->lock);
+++hv_test_info->warps;
+if (delta < hv_test_info->worst){
+hv_test_info->worst = delta;
+printf("Worst warp %lld %\n", 
hv_test_info->worst);
+}
+spin_unloc

[PATCH kvm-unit-tests v2 5/8] Export tsc related helpers

2010-08-31 Thread Jason Wang
Signed-off-by: Jason Wang 
---
 lib/x86/processor.h |   22 ++
 x86/tsc.c   |   16 +---
 x86/vmexit.c|   15 ---
 3 files changed, 23 insertions(+), 30 deletions(-)

diff --git a/lib/x86/processor.h b/lib/x86/processor.h
index 67d7ca5..c348808 100644
--- a/lib/x86/processor.h
+++ b/lib/x86/processor.h
@@ -258,4 +258,26 @@ static inline void sti(void)
 asm volatile ("sti");
 }
 
+static inline unsigned long long rdtsc()
+{
+   long long r;
+
+#ifdef __x86_64__
+   unsigned a, d;
+
+   asm volatile ("rdtsc" : "=a"(a), "=d"(d));
+   r = a | ((long long)d << 32);
+#else
+   asm volatile ("rdtsc" : "=A"(r));
+#endif
+   return r;
+}
+
+static inline void wrtsc(u64 tsc)
+{
+   unsigned a = tsc, d = tsc >> 32;
+
+   asm volatile("wrmsr" : : "a"(a), "d"(d), "c"(0x10));
+}
+
 #endif
diff --git a/x86/tsc.c b/x86/tsc.c
index 394a9c6..58f332d 100644
--- a/x86/tsc.c
+++ b/x86/tsc.c
@@ -1,19 +1,5 @@
 #include "libcflat.h"
-
-u64 rdtsc(void)
-{
-   unsigned a, d;
-
-   asm volatile("rdtsc" : "=a"(a), "=d"(d));
-   return a | (u64)d << 32;
-}
-
-void wrtsc(u64 tsc)
-{
-   unsigned a = tsc, d = tsc >> 32;
-
-   asm volatile("wrmsr" : : "a"(a), "d"(d), "c"(0x10));
-}
+#include "processor.h"
 
 void test_wrtsc(u64 t1)
 {
diff --git a/x86/vmexit.c b/x86/vmexit.c
index 84c384d..551083d 100644
--- a/x86/vmexit.c
+++ b/x86/vmexit.c
@@ -3,21 +3,6 @@
 #include "smp.h"
 #include "processor.h"
 
-static inline unsigned long long rdtsc()
-{
-   long long r;
-
-#ifdef __x86_64__
-   unsigned a, d;
-
-   asm volatile ("rdtsc" : "=a"(a), "=d"(d));
-   r = a | ((long long)d << 32);
-#else
-   asm volatile ("rdtsc" : "=A"(r));
-#endif
-   return r;
-}
-
 static unsigned int inl(unsigned short port)
 {
 unsigned int val;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests v2 6/8] Introduce atol()

2010-08-31 Thread Jason Wang
Signed-off-by: Jason Wang 
---
 lib/libcflat.h |1 +
 lib/string.c   |   31 +++
 2 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/lib/libcflat.h b/lib/libcflat.h
index 9cb90cf..d4ee2e0 100644
--- a/lib/libcflat.h
+++ b/lib/libcflat.h
@@ -50,6 +50,7 @@ extern void puts(const char *s);
 
 extern void *memset(void *s, int c, size_t n);
 
+extern long atol(const char *ptr);
 #define ARRAY_SIZE(_a)  (sizeof(_a)/sizeof((_a)[0]))
 
 #endif
diff --git a/lib/string.c b/lib/string.c
index acac3c0..1f19f5c 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -30,3 +30,34 @@ void *memset(void *s, int c, size_t n)
 
 return s;
 }
+
+long atol(const char *ptr)
+{
+long acc = 0;
+const char *s = ptr;
+int neg, c;
+
+while (*s == ' ' || *s == '\t')
+s++;
+if (*s == '-'){
+neg = 1;
+s++;
+} else {
+neg = 0;
+if (*s == '+')
+s++;
+}
+
+while (*s) {
+if (*s < '0' || *s > '9')
+break;
+c = *s - '0';
+acc = acc * 10 + c;
+s++;
+}
+
+if (neg)
+acc = -acc;
+
+return acc;
+}

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests v2 4/8] Introduce atomic operations

2010-08-31 Thread Jason Wang
Signed-off-by: Jason Wang 
---
 config-x86-common.mak |1 
 lib/x86/atomic.c  |   37 +++
 lib/x86/atomic.h  |  164 +
 3 files changed, 202 insertions(+), 0 deletions(-)
 create mode 100644 lib/x86/atomic.c
 create mode 100644 lib/x86/atomic.h

diff --git a/config-x86-common.mak b/config-x86-common.mak
index fdb9e3e..b8ca859 100644
--- a/config-x86-common.mak
+++ b/config-x86-common.mak
@@ -10,6 +10,7 @@ cflatobjs += \
 
 cflatobjs += lib/x86/fwcfg.o
 cflatobjs += lib/x86/apic.o
+cflatobjs += lib/x86/atomic.o
 
 $(libcflat): LDFLAGS += -nostdlib
 $(libcflat): CFLAGS += -ffreestanding -I lib
diff --git a/lib/x86/atomic.c b/lib/x86/atomic.c
new file mode 100644
index 000..da74ff2
--- /dev/null
+++ b/lib/x86/atomic.c
@@ -0,0 +1,37 @@
+#include 
+#include "atomic.h"
+
+#ifdef __i386__
+
+u64 atomic64_cmpxchg(atomic64_t *v, u64 old, u64 new)
+{
+u32 low = new;
+u32 high = new >> 32;
+
+asm volatile("lock cmpxchg8b %1\n"
+ : "+A" (old),
+   "+m" (*(volatile long long *)&v->counter)
+ : "b" (low), "c" (high)
+ : "memory"
+ );
+
+return old;
+}
+
+#else
+
+u64 atomic64_cmpxchg(atomic64_t *v, u64 old, u64 new)
+{
+u64 ret;
+u64 _old = old;
+u64 _new = new;
+
+asm volatile("lock cmpxchgq %2,%1"
+ : "=a" (ret), "+m" (*(volatile long *)&v->counter)
+ : "r" (_new), "0" (_old)
+ : "memory"
+ );
+return ret;
+}
+
+#endif
diff --git a/lib/x86/atomic.h b/lib/x86/atomic.h
new file mode 100644
index 000..de2f033
--- /dev/null
+++ b/lib/x86/atomic.h
@@ -0,0 +1,164 @@
+#ifndef __ATOMIC_H
+#define __ATOMIC_H
+
+typedef struct {
+   volatile int counter;
+} atomic_t;
+
+#ifdef __i386__
+
+/**
+ * atomic_read - read atomic variable
+ * @v: pointer of type atomic_t
+ *
+ * Atomically reads the value of @v.
+ */
+static inline int atomic_read(const atomic_t *v)
+{
+   return v->counter;
+}
+
+/**
+ * atomic_set - set atomic variable
+ * @v: pointer of type atomic_t
+ * @i: required value
+ *
+ * Atomically sets the value of @v to @i.
+ */
+static inline void atomic_set(atomic_t *v, int i)
+{
+   v->counter = i;
+}
+
+/**
+ * atomic_inc - increment atomic variable
+ * @v: pointer of type atomic_t
+ *
+ * Atomically increments @v by 1.
+ */
+static inline void atomic_inc(atomic_t *v)
+{
+   asm volatile("lock incl %0"
+: "+m" (v->counter));
+}
+
+/**
+ * atomic_dec - decrement atomic variable
+ * @v: pointer of type atomic_t
+ *
+ * Atomically decrements @v by 1.
+ */
+static inline void atomic_dec(atomic_t *v)
+{
+   asm volatile("lock decl %0"
+: "+m" (v->counter));
+}
+
+typedef struct {
+   u64 __attribute__((aligned(8))) counter;
+} atomic64_t;
+
+#define ATOMIC64_INIT(val) { (val) }
+
+/**
+ * atomic64_read - read atomic64 variable
+ * @ptr:  pointer to type atomic64_t
+ *
+ * Atomically reads the value of @ptr and returns it.
+ */
+static inline u64 atomic64_read(atomic64_t *ptr)
+{
+   u64 res;
+
+   /*
+* Note, we inline this atomic64_t primitive because
+* it only clobbers EAX/EDX and leaves the others
+* untouched. We also (somewhat subtly) rely on the
+* fact that cmpxchg8b returns the current 64-bit value
+* of the memory location we are touching:
+*/
+   asm volatile("mov %%ebx, %%eax\n\t"
+ "mov %%ecx, %%edx\n\t"
+ "lock cmpxchg8b %1\n"
+ : "=&A" (res)
+ : "m" (*ptr)
+ );
+   return res;
+}
+
+u64 atomic64_cmpxchg(atomic64_t *v, u64 old, u64 new);
+
+#elif defined(__x86_64__)
+
+/**
+ * atomic_read - read atomic variable
+ * @v: pointer of type atomic_t
+ *
+ * Atomically reads the value of @v.
+ */
+static inline int atomic_read(const atomic_t *v)
+{
+   return v->counter;
+}
+
+/**
+ * atomic_set - set atomic variable
+ * @v: pointer of type atomic_t
+ * @i: required value
+ *
+ * Atomically sets the value of @v to @i.
+ */
+static inline void atomic_set(atomic_t *v, int i)
+{
+   v->counter = i;
+}
+
+/**
+ * atomic_inc - increment atomic variable
+ * @v: pointer of type atomic_t
+ *
+ * Atomically increments @v by 1.
+ */
+static inline void atomic_inc(atomic_t *v)
+{
+   asm volatile("lock incl %0"
+: "=m" (v->counter)
+: "m" (v->counter));
+}
+
+/**
+ * atomic_dec - decrement atomic variable
+ * @v: pointer of type atomic_t
+ *
+ * Atomically decrements @v by 1.
+ */
+static inline void atomic_dec(atomic_t *v)
+{
+   asm volatile("lock decl %0"
+: "=m" (v->counter)
+: "m" (v->counter));
+}
+
+typedef struct {
+   long long counter;
+} atomic64_t;
+
+#define ATOMIC

[PATCH kvm-unit-tests v2 3/8] Introduce memory barriers.

2010-08-31 Thread Jason Wang
Signed-off-by: Jason Wang 
---
 lib/x86/smp.h |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/lib/x86/smp.h b/lib/x86/smp.h
index c2e7350..df5fdba 100644
--- a/lib/x86/smp.h
+++ b/lib/x86/smp.h
@@ -1,6 +1,10 @@
 #ifndef __SMP_H
 #define __SMP_H
 
+#define mb()   asm volatile("mfence":::"memory")
+#define rmb()  asm volatile("lfence":::"memory")
+#define wmb()  asm volatile("sfence" ::: "memory")
+
 struct spinlock {
 int v;
 };

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests v2 2/8] Increase max_cpu to 64

2010-08-31 Thread Jason Wang
Signed-off-by: Jason Wang 
---
 x86/cstart.S   |2 +-
 x86/cstart64.S |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/x86/cstart.S b/x86/cstart.S
index ae1fdbb..bc8d563 100644
--- a/x86/cstart.S
+++ b/x86/cstart.S
@@ -6,7 +6,7 @@ boot_idt = 0
 
 ipi_vector = 0x20
 
-max_cpus = 4
+max_cpus = 64
 
 .bss
 
diff --git a/x86/cstart64.S b/x86/cstart64.S
index ef09d82..71014d8 100644
--- a/x86/cstart64.S
+++ b/x86/cstart64.S
@@ -6,7 +6,7 @@ boot_idt = 0
 
 ipi_vector = 0x20
 
-max_cpus = 4
+max_cpus = 64
 
 .bss
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests v2 1/8] Introduce some type definiations.

2010-08-31 Thread Jason Wang
Signed-off-by: Jason Wang 
---
 lib/libcflat.h |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/lib/libcflat.h b/lib/libcflat.h
index d0d3df2..9cb90cf 100644
--- a/lib/libcflat.h
+++ b/lib/libcflat.h
@@ -23,10 +23,14 @@
 #include 
 
 typedef unsigned char u8;
+typedef signed char s8;
 typedef unsigned short u16;
+typedef signed short s16;
 typedef unsigned u32;
+typedef signed s32;
 typedef unsigned long ulong;
 typedef unsigned long long u64;
+typedef signed long long s64;
 typedef unsigned long size_t;
 typedef _Bool bool;
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests v2 0/8] Tests for kvmclock

2010-08-31 Thread Jason Wang
The following series implements three tests for kvmclock:

Wall clock test: check whether the kvmclock could return a correct TOD
by pass the correct one to test and check it with the value returned
by kvmclock.

Monotonic cycle test: check whether the cycles of kvmclock grows
monotonically in smp guests.

Performance test: Measure the performance ( through guest TSC cycles )
of kvmclock driver for both with and without the guest side
adjustment.

Change since V1:

- Change the max cpu to 64 in order to test in larger systems.
- Make the per-cpu structure hv_clock 4 byte aligned.
- Fix the bug of the clobber list in atomic64_cmpxchg().
- Add the test of pure performance test based on guest TSC cycle.
- Record the count of stall when doing monotonic cycle test.
- Introduce some new type definitions.
- Fix the bug of the wallclock calculation and test.

Here is a test result I got form a guest with 64 vcpus on a 64 cores
intel X7550:

The test command is qemu-system-x86_64 -device testdev,chardev=testlog
-chardev file,id=testlog,path=msr.out -kernel ./x86/kvmclock_test.flat
-smp 64 --append 10

For 64bit guests:
The guest cycles spent on geting raw cycles is about 21543417.
The guest cycles spent on geting adjusted cycles is 1067264988.

Which means it takes about 50x more time on correcting the cycles
than just returning cycles supplied by hypervisor.

For 32bit guests:
The guest cycles spent on geting raw cycles is about 26916562.
The guest cycles spent on geting adjusted cycles is 2020119174.

Which means it taks about 75x more time on correcting the cycles than
just returning cycles supplied by hypervisor.

And for 4 core guests: we only take about 10x-15x more time on
correcting.

Pay attention:
1 The performance test just read cycle and do not check anything else.
2 I've run the test for serverl times, the result are similar.
3 The cycles were measured from guest, so maybe not accurate.

Here is a sample test results of 64bit guest with 64 vcpus:

..
Check the stability of raw cycle ...
Worst warp -10 
Worst warp -6537 
Worst warp -11170 
Worst warp -11370 
Worst warp -11372 
Worst warp -31754 
Worst warp -31876 
Worst warp -31904 
Worst warp -36771 
Worst warp -37382 
Worst warp -37688 
Worst warp -37707 
Total vcpus: 64
Test  loops: 10
Total warps:  53811
Total stalls: 408
Worst warp:   -37707
Raw cycle is not stable
Monotonic cycle test:
Total vcpus: 64
Test  loops: 10
Total warps:  0
Total stalls: 214
Worst warp:   0
Measure the performance of raw cycle ...
Total vcpus: 64
Test  loops: 10
TSC cycles:  21543417
Measure the performance of adjusted cycle ...
Total vcpus: 64
Test  loops: 10
TSC cycles:  1067264988

---

Jason Wang (8):
  Introduce some type definiations.
  Increase max_cpu to 64
  Introduce memory barriers.
  Introduce atomic operations
  Export tsc related helpers
  Introduce atol()
  Add a simple kvmclock driver
  Add tests for kvm-clock


 config-x86-common.mak |7 +
 lib/libcflat.h|5 +
 lib/string.c  |   31 +
 lib/x86/atomic.c  |   37 ++
 lib/x86/atomic.h  |  164 +
 lib/x86/processor.h   |   22 
 lib/x86/smp.h |4 +
 x86/README|1 
 x86/cstart.S  |2 
 x86/cstart64.S|2 
 x86/kvmclock.c|  279 +
 x86/kvmclock.h|   60 +++
 x86/kvmclock_test.c   |  166 +
 x86/tsc.c |   16 ---
 x86/unittests.cfg |5 +
 x86/vmexit.c  |   15 ---
 16 files changed, 783 insertions(+), 33 deletions(-)
 create mode 100644 lib/x86/atomic.c
 create mode 100644 lib/x86/atomic.h
 create mode 100644 x86/kvmclock.c
 create mode 100644 x86/kvmclock.h
 create mode 100644 x86/kvmclock_test.c

-- 
Jason Wang
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 4/7] change kernel accounting to include steal time

2010-08-31 Thread Peter Zijlstra
On Mon, 2010-08-30 at 19:03 -0400, Rik van Riel wrote:
> 
> > I think it basically comes down to adding "sched_clock_unstolen()" which
> > the scheduler can use to measure time a process spends running, and
> > sched_clock() for measuring sleep times.  In the normal case,
> > sched_clock_unstolen() would be the same as sched_clock().
> 
> That requires the host to export (any time the guest is scheduled
> in), the amount of CPU time the VCPU thread has used, and the time
> the VCPU was scheduled in.
> 
> Since the VCPU must be running when it is examining these variables,
> it can calculate the additional time (since it was last scheduled)
> to account to the task, and remember the currently calculated time
> in its own per-vcpu variable, so next time it can get a delta again. 

I think its easier (and sufficient) for the host to tell the guest how
long it was _not_ running. That can simply be passed in when you start
the vcpu again and doesn't need a fancy communication channel.

The guests sched_clock() will measure wall time, the guests
sched_clock_stolen() will report the accumulation of these stolen times.

Then you can make sched_clock_unstolen() be sched_clock() -
sched_clock_stolen(). And like Jeremy said, if you make the sched_fair
stuff use sched_clock_unstolen() things should more or less work.

The problem with all that is that you'll start to schedule on unstolen
time instead of wall-time, which might not give the best results for
things like latencies etc.. but I guess that's one of the prices you pay
for using virt.

Also, like said yesterday, you need some factor in update_cpu_power(), a
quick hack might be to add all stolen time to sched_rt_avg_update().


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html