Re: performance trouble

2012-02-03 Thread David Cure
Hello,

Le Thu, Feb 02, 2012 at 11:41:48AM +0100, David Cure ecrivait :
 
   For kvm_stats, I mean it's better to have only one VM with one
 user for the test so I send this evening or tomorrow morning.

I attach snapshot of the kvm_stat : the first one just before to
run the function and the second one at the end.

Here are the vmstat 1 take when the function is running :


 4  0  0 32601320  21360 1987760049  1123 14577 21601  8
4 88  0
procs ---memory-- ---swap-- -io -system--
cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy
id wa
 4  0  0 32601196  21360 19877600 0   141 16559 21315 12
3 84  0
 2  0  0 32601212  21360 19877600 0   128 17377 24949 10
4 86  0
 2  0  0 32601128  21360 19877600 0   120 13064 18986 13
3 85  0
 4  0  0 32601216  21360 19877600 0   284 16052 21569 11
4 85  0
 3  0  0 32601132  21360 19877600 0   192 15846 23323 10
3 87  0
 3  0  0 32601492  21360 19877600 0   564 19116 27478 10
4 85  0
 4  0  0 32601492  21360 19877600 0 0 16996 24286 14
5 81  0
 2  0  0 32601492  21360 19877600 0   661 15754 23371  9
4 87  0
 3  0  0 32601268  21360 1987760036 0 9808 13849 11
1 88  0
 2  0  0 32601304  21360 19877600 036 11846 13826 14
1 85  0
 3  0  0 32601304  21360 198776006437 9445 12263 18
1 81  0
 4  0  0 32601604  21360 19877600   29020 9959 14786 10
2 88  0

I hope you can see something strange ;)

David.
attachment: kvm_stat1.pngattachment: kvm_stat2.png

KVM: x86: increase recommended max vcpus to 160

2012-02-03 Thread Marcelo Tosatti

Increase recommended max vcpus from 64 to 160 (tested internally
at Red Hat).

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4610166..782d973 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -29,7 +29,7 @@
 #include asm/msr-index.h
 
 #define KVM_MAX_VCPUS 254
-#define KVM_SOFT_MAX_VCPUS 64
+#define KVM_SOFT_MAX_VCPUS 160
 #define KVM_MEMORY_SLOTS 32
 /* memory slots that does not exposed to userspace */
 #define KVM_PRIVATE_MEM_SLOTS 4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: increase recommended max vcpus to 160

2012-02-03 Thread Sasha Levin
On Fri, 2012-02-03 at 12:28 -0200, Marcelo Tosatti wrote:
 Increase recommended max vcpus from 64 to 160 (tested internally
 at Red Hat).
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Awesome!

Would it be possible to share the benchmarks?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2012-02-03 Thread Roopa Prabhu



On 2/2/12 10:58 AM, John Fastabend john.r.fastab...@intel.com wrote:

 On 2/2/2012 10:07 AM, Roopa Prabhu wrote:

snip..
 
 My patches were trying to do just this (unless I am missing something).
 
 
 Right I was trying enumerate the cases. Your patches 5,6 seem to use
 dev_{uc|mc}_{add|del} like this.
 
Ok

 
 I think this has some real advantages to the above scheme. First
 we get rid of _all_ the drivers having to add a bunch of new
 net_device ops and do it once in the layer above. This is nice
 for driver implementers but also because your feature becomes usable
 immediately and we don't have to wait for driver developers to implement
 it.
 
 Yes my patches were targeting towards this too. I had macvlan implement the
 netlink ops and macvlan internally was using the dev_uc_add and del routines
 to pass the addr lists to lower device.
 
 Yes. But I am missing why the VF ops and netlink extensions are useful. Or
 even the op/netlink extension into the PF for that matter.
 

This was to support something that intel wanted. I think Ben described that
in another email. This was done by v3 version of the patch. This is not
needed for the macvlan problem that this patch is trying to solve. We were
trying to make the new netlink interface fit other use cases if possible.

I believe at this point we are convinced that the hybrid acceleration with
PF/VF that ben described can be solved in possibly other ways.

 
 
 Also it prunes down the number of netlink extensions being added
 here. 
 
 Additionally the existing semantics seem a bit strange to me on the
 netlink message side. Taking a quick look at the macvlan implementation
 it looks like every set has to have a complete list of address. But
 the dev_uc_add and dev_uc_del seem to be using a refcnt scheme so
 if I want to add a second address and then latter a third address
 how does that work?
 
 Every set has a complete list of addresses because, for macvlan non-passthru
 modes, in future we might want to have macvlan driver do the filtering (This
 is for the case when we have a single lower device and multiple macvlans)
 
 
 hmm but lists seem problematic when hooked up to the netdev uc and mc addr
 lists. Consider this case
 
 read uc_list  --- thread1: dumps unicast table
 add vlan  --- thread2: adds a vlan maybe inserting a uc addr
 write uc_list --- thread1: writes the table back + 1 addr
 
 Does the uc addr of the vlan get deleted? And this case

It should.

 
 read uc_list   --- dump table
 write uc_list  --- add a new filter A to the uc list
 read uc_list   --- dump table
 write uc_list  --- add a second filter B to the uc list
 
 Now based on your patch 4,5 it looks like the refcnt on the address A is
 two so to remove it I have to call set filters twice without the A addr.

I think this depends on the implementation of the handler. The macvtap
handler will remove A and add B if A is not part of the second filter.


 
 read  uc_list   --- dump table
 write uc_list   --- list without A
 write uc_list   --- list without A
 
 This seems really easy to get screwed up and it doesn't look like user
 space can learn the refcnt (at least in this series).
 
 


the sequences you are describing above are possible but they depend on how
you implement the filter handler I think.
For macvlan, this filter op could just populate the internal filter.
Except that for passthru mode it tries to program the lower dev hw filter
using uc/mc lists. And in passthru mode it assumes that it owns the lower
device, and tries to make sure that it adds or deletes only addresses it
knows about. I think if you have another thread also adding/deleting address
to the lowerdev when it is assigned to macvtap in passthru mode, the other
thread might see inconsistent results.

The netlink filter interface which the patch was trying to add was not a
replacement for existing uc/mc lists. It was really targeted to virtual
devices that want to do filtering on their own.
I believe uc/mc lists are used to set/unset hw filters and they are not used
in fwding lookups in the kernel (pls correct me if I am wrong). The filter
that this netlink msg was trying to set was for devices that want to do
filtering/fwding lookups like the hw.

 
 Is the expected flow from user space 'read uc_list - write uc_list'?
 This seems risky because with two adders in user space you might
 lose addresses unless they are somehow kept in sync. IMHO it is likely
 easier to implement an ADD and DEL attribute rather than a table
 approach.
 
 The ADD and DEL will work for macvlan passthru mode because it maps 1-1 with
 the lowerdev uc and mc list. The table was for non passthru modes when
 macvlan driver might need to do filtering. So my patchset started with
 macvlan filter table for all macvlan modes (hopefully) with passthru mode as
 a specific case of offloading everything to the lowerdevice.
 
 
 Still this doesn't require a table right. Repeated ADD/DEL should work
 correct?
 
Yes that's correct. You 

[patch 0/8] KVM: Remaining body of TSC emulation work (Zachary Amsden)

2012-02-03 Thread Marcelo Tosatti
Rebase Zachary's last TSC series. From his original email:

This is the remaining bulk of work I have related to TSC emulation.
In summary, I believe this fixes all known issues with TSC.  A few
rather subtle issues are cleaned up, S4 suspend is fixed, and the
API for adjust_tsc_offset is expanded to take arguments in either
guest cycles or optionally host cycles.  The final patch adds
software TSC emulation, with a configurable default setting.

Note that TSC trapping will only be used when other methods fail
due to unstable TSC or lack of scaling hardware.  With the improved
offset matching, I was even able to get SMP guests to boot with a
TSC clock; cycle testing showed a maximum backwards leap of around
0.25 ms, which is actually fairly good.  With the last patch applied,
software TSC emulation kicks in and the backwards TSC count even on
my broken hardware dropped to zero.

Some of this code (the S4 suspend compensation) has already been
ported into RHEL to fix various bugs - however upstream had diverged
a bit from the original course I planned; I had to add a few things
that had been optimized out of upstream back in (last_host_tsc).

In the course of doing this, I think the new code looks much
cleaner, with well documented and easy to understand variables.
Yes, there are a lot of tracking variables to maintain this whole
infrastructure - and yes, some of them appear to be redundant of
easily computable from others - but in actuality, the information
they provide is easy to understand, and the resulting code is much
easier to verify than a complex system where some quantities may
be undefined or computed on the fly and thus causing subtle races.


The last patch which introduces TSC trapping has not been included: 
it unconditionally enables TSC trapping if the guest frequency 
differs from the host or the host TSC is unstable. It should instead
use tsc catchup if the guest frequency is higher than the host. Also, 
there is no need for trapping if the guest is using kvmclock.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: increase recommended max vcpus to 160

2012-02-03 Thread Marcelo Tosatti
On Fri, Feb 03, 2012 at 10:28:57AM -0500, Sasha Levin wrote:
 On Fri, 2012-02-03 at 12:28 -0200, Marcelo Tosatti wrote:
  Increase recommended max vcpus from 64 to 160 (tested internally
  at Red Hat).
  
  Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 Awesome!
 
 Would it be possible to share the benchmarks?

Not sure. Shak?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 7/8] Dont mark TSC unstable due to S4 suspend

2012-02-03 Thread Marcelo Tosatti
From: Zachary Amsden zams...@gmail.com

During a host suspend, TSC may go backwards, which KVM interprets
as an unstable TSC.  Technically, KVM should not be marking the
TSC unstable, which causes the TSC clocksource to go bad, but we
need to be adjusting the TSC offsets in such a case.

Dealing with this issue is a little tricky as the only place we
can reliably do it is before much of the timekeeping infrastructure
is up and running.  On top of this, we are not in a KVM thread
context, so we may not be able to safely access VCPU fields.
Instead, we compute our best known hardware offset at power-up and
stash it to be applied to all VCPUs when they actually start running.

Signed-off-by: Zachary Amsden zams...@gmail.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/include/asm/kvm_host.h
===
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -423,6 +423,7 @@ struct kvm_vcpu_arch {
u64 last_tsc_nsec;
u64 last_tsc_write;
u64 last_host_tsc;
+   u64 tsc_offset_adjustment;
bool tsc_catchup;
bool tsc_always_catchup;
s8 virtual_tsc_shift;
Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -2238,6 +2238,14 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
}
 
kvm_x86_ops-vcpu_load(vcpu, cpu);
+
+   /* Apply any externally detected TSC adjustments (due to suspend) */
+   if (unlikely(vcpu-arch.tsc_offset_adjustment)) {
+   adjust_tsc_offset_host(vcpu, vcpu-arch.tsc_offset_adjustment);
+   vcpu-arch.tsc_offset_adjustment = 0;
+   set_bit(KVM_REQ_CLOCK_UPDATE, vcpu-requests);
+   }
+
if (unlikely(vcpu-cpu != cpu) || check_tsc_unstable()) {
s64 tsc_delta = !vcpu-arch.last_host_tsc ? 0 :
native_read_tsc() - vcpu-arch.last_host_tsc;
@@ -5950,13 +5958,88 @@ int kvm_arch_hardware_enable(void *garba
struct kvm *kvm;
struct kvm_vcpu *vcpu;
int i;
+   int ret;
+   u64 local_tsc;
+   u64 max_tsc = 0;
+   bool stable, backwards_tsc = false;
 
kvm_shared_msr_cpu_online();
-   list_for_each_entry(kvm, vm_list, vm_list)
-   kvm_for_each_vcpu(i, vcpu, kvm)
-   if (vcpu-cpu == smp_processor_id())
-   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-   return kvm_x86_ops-hardware_enable(garbage);
+   ret = kvm_x86_ops-hardware_enable(garbage);
+   if (ret != 0)
+   return ret;
+
+   local_tsc = native_read_tsc();
+   stable = !check_tsc_unstable();
+   list_for_each_entry(kvm, vm_list, vm_list) {
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   if (!stable  vcpu-cpu == smp_processor_id())
+   set_bit(KVM_REQ_CLOCK_UPDATE, vcpu-requests);
+   if (stable  vcpu-arch.last_host_tsc  local_tsc) {
+   backwards_tsc = true;
+   if (vcpu-arch.last_host_tsc  max_tsc)
+   max_tsc = vcpu-arch.last_host_tsc;
+   }
+   }
+   }
+
+   /*
+* Sometimes, even reliable TSCs go backwards.  This happens on
+* platforms that reset TSC during suspend or hibernate actions, but
+* maintain synchronization.  We must compensate.  Fortunately, we can
+* detect that condition here, which happens early in CPU bringup,
+* before any KVM threads can be running.  Unfortunately, we can't
+* bring the TSCs fully up to date with real time, as we aren't yet far
+* enough into CPU bringup that we know how much real time has actually
+* elapsed; our helper function, get_kernel_ns() will be using boot
+* variables that haven't been updated yet.
+*
+* So we simply find the maximum observed TSC above, then record the
+* adjustment to TSC in each VCPU.  When the VCPU later gets loaded,
+* the adjustment will be applied.  Note that we accumulate
+* adjustments, in case multiple suspend cycles happen before some VCPU
+* gets a chance to run again.  In the event that no KVM threads get a
+* chance to run, we will miss the entire elapsed period, as we'll have
+* reset last_host_tsc, so VCPUs will not have the TSC adjusted and may
+* loose cycle time.  This isn't too big a deal, since the loss will be
+* uniform across all VCPUs (not to mention the scenario is extremely
+* unlikely). It is possible that a second hibernate recovery happens
+* much faster than a first, causing the observed TSC here to be
+* smaller; this would require additional padding adjustment, 

[patch 2/8] Improve TSC offset matching

2012-02-03 Thread Marcelo Tosatti
From: Zachary Amsden zams...@gmail.com

There are a few improvements that can be made to the TSC offset
matching code.  First, we don't need to call the 128-bit multiply
(especially on a constant number), the code works much nicer to
do computation in nanosecond units.

Second, the way everything is setup with software TSC rate scaling,
we currently have per-cpu rates.  Obviously this isn't too desirable
to use in practice, but if for some reason we do change the rate of
all VCPUs at runtime, then reset the TSCs, we will only want to
match offsets for VCPUs running at the same rate.

Finally, for the case where we have an unstable host TSC, but
rate scaling is being done in hardware, we should call the platform
code to compute the TSC offset, so the math is reorganized to recompute
the base instead, then transform the base into an offset using the
existing API.

Signed-off-by: Zachary Amsden zams...@gmail.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/include/asm/kvm_host.h
===
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -513,6 +513,7 @@ struct kvm_arch {
u64 last_tsc_nsec;
u64 last_tsc_offset;
u64 last_tsc_write;
+   u32 last_tsc_khz;
 
struct kvm_xen_hvm_config xen_hvm_config;
 
Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1018,33 +1018,39 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
struct kvm *kvm = vcpu-kvm;
u64 offset, ns, elapsed;
unsigned long flags;
-   s64 sdiff;
+   s64 nsdiff;
 
raw_spin_lock_irqsave(kvm-arch.tsc_write_lock, flags);
offset = kvm_x86_ops-compute_tsc_offset(vcpu, data);
ns = get_kernel_ns();
elapsed = ns - kvm-arch.last_tsc_nsec;
-   sdiff = data - kvm-arch.last_tsc_write;
-   if (sdiff  0)
-   sdiff = -sdiff;
+
+   /* n.b - signed multiplication and division required */
+   nsdiff = data - kvm-arch.last_tsc_write;
+   nsdiff = (nsdiff * 1000) / vcpu-arch.virtual_tsc_khz;
+   nsdiff -= elapsed;
+   if (nsdiff  0)
+   nsdiff = -nsdiff;
 
/*
-* Special case: close write to TSC within 5 seconds of
-* another CPU is interpreted as an attempt to synchronize
-* The 5 seconds is to accommodate host load / swapping as
-* well as any reset of TSC during the boot process.
-*
-* In that case, for a reliable TSC, we can match TSC offsets,
-* or make a best guest using elapsed value.
-*/
-   if (sdiff  nsec_to_cycles(vcpu, 5ULL * NSEC_PER_SEC) 
-   elapsed  5ULL * NSEC_PER_SEC) {
+* Special case: TSC write with a small delta (1 second) of virtual
+* cycle time against real time is interpreted as an attempt to
+* synchronize the CPU.
+ *
+* For a reliable TSC, we can match TSC offsets, and for an unstable
+* TSC, we add elapsed time in this computation.  We could let the
+* compensation code attempt to catch up if we fall behind, but
+* it's better to try to match offsets from the beginning.
+ */
+   if (nsdiff  NSEC_PER_SEC 
+   vcpu-arch.virtual_tsc_khz == kvm-arch.last_tsc_khz) {
if (!check_tsc_unstable()) {
offset = kvm-arch.last_tsc_offset;
pr_debug(kvm: matched tsc offset for %llu\n, data);
} else {
u64 delta = nsec_to_cycles(vcpu, elapsed);
-   offset += delta;
+   data += delta;
+   offset = kvm_x86_ops-compute_tsc_offset(vcpu, data);
pr_debug(kvm: adjusted tsc offset by %llu\n, delta);
}
ns = kvm-arch.last_tsc_nsec;
@@ -1052,6 +1058,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
kvm-arch.last_tsc_nsec = ns;
kvm-arch.last_tsc_write = data;
kvm-arch.last_tsc_offset = offset;
+   kvm-arch.last_tsc_khz = vcpu-arch.virtual_tsc_khz;
kvm_x86_ops-write_tsc_offset(vcpu, offset);
raw_spin_unlock_irqrestore(kvm-arch.tsc_write_lock, flags);
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 4/8] Fix last_guest_tsc / tsc_offset semantics

2012-02-03 Thread Marcelo Tosatti
From: Zachary Amsden zams...@gmail.com

The variable last_guest_tsc was being used as an ad-hoc indicator
that guest TSC has been initialized and recorded correctly.  However,
it may not have been, it could be that guest TSC has been set to some
large value, the back to a small value (by, say, a software reboot).

This defeats the logic and causes KVM to falsely assume that the
guest TSC has gone backwards, marking the host TSC unstable, which
is undesirable behavior.

In addition, rather than try to compute an offset adjustment for the
TSC on unstable platforms, just recompute the whole offset.  This
allows us to get rid of one callsite for adjust_tsc_offset, which
is problematic because the units it takes are in guest units, but
here, the computation was originally being done in host units.

Doing this, and also recording last_guest_tsc when the TSC is written
allow us to remove the tricky logic which depended on last_guest_tsc
being zero to indicate a reset of uninitialized value.

Instead, we now have the guarantee that the guest TSC offset is
always at least something which will get us last_guest_tsc.

Signed-off-by: Zachary Amsden zams...@gmail.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1065,6 +1065,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
vcpu-arch.hv_clock.tsc_timestamp = 0;
vcpu-arch.last_tsc_write = data;
vcpu-arch.last_tsc_nsec = ns;
+   vcpu-arch.last_guest_tsc = data;
 }
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
@@ -1133,7 +1134,7 @@ static int kvm_guest_time_update(struct 
 * observed by the guest and ensure the new system time is greater.
 */
max_kernel_ns = 0;
-   if (vcpu-hv_clock.tsc_timestamp  vcpu-last_guest_tsc) {
+   if (vcpu-hv_clock.tsc_timestamp) {
max_kernel_ns = vcpu-last_guest_tsc -
vcpu-hv_clock.tsc_timestamp;
max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
@@ -2243,13 +2244,14 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
u64 tsc;
 
tsc = kvm_x86_ops-read_l1_tsc(vcpu);
-   tsc_delta = !vcpu-arch.last_guest_tsc ? 0 :
-tsc - vcpu-arch.last_guest_tsc;
+   tsc_delta = tsc - vcpu-arch.last_guest_tsc;
 
if (tsc_delta  0)
mark_tsc_unstable(KVM discovered backwards TSC);
if (check_tsc_unstable()) {
-   kvm_x86_ops-adjust_tsc_offset(vcpu, -tsc_delta);
+   u64 offset = kvm_x86_ops-compute_tsc_offset(vcpu,
+   vcpu-arch.last_guest_tsc);
+   kvm_x86_ops-write_tsc_offset(vcpu, offset);
vcpu-arch.tsc_catchup = 1;
}
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need advice how to fix an access to uninitialized per_cpu clock

2012-02-03 Thread Marcelo Tosatti
On Thu, Feb 02, 2012 at 12:43:00PM +0100, Igor Mammedov wrote:
 While playing with kvm cpu hot-plug, I've probably stumbled on general
 kernel bug. So I'm lookng for advice on approach to fix it.
 
 When kvm guest uses kvmclock, it may hang on cpu hot-plug at
 
 BSP:
  smp_apic_timer_interrupt
...
  - do_timer
- update_wall_time
 
 and a being on-lined CPU at waiting on sync with BSP at:
  start_secondary:
while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
 cpu_relax();
 
 because of unusually big value returned by clock-read(clock) in
 update_wall_time.
 This happens due to overflow in pvclock_get_nsec_offset
 
 u64 delta = tsc - shadow-tsc_timestamp;
 
 when shadow-tsc_timestamp is bigger than tsc.
 And since pvclock_clocksource_read remembers and returns largest
 value of any clock that ever was returned, clock for affected guest
 virtually freezes.
 
 Overflow happens due to reading undefined values from uninitialized
 per_cpu variable hv_clock. In case of cpu hot-plug, clock is read at
 least once on being on-lined cpu before it is initialized for this
 cpu:
 
   start_secondary
 |- smp_callin
 |  -  smp_store_cpu_info
 |- identify_secondary_cpu
 |  - mtrr_ap_init
 |- mtrr_restore
 |  - stop_machine_from_inactive_cpu
 |- queue_stop_cpus_work
 |  ...
 |- kvm_clock_read
 |...
 |x86_cpuinit.setup_percpu_clockev
 
 full call chain is below:
 [0.002999]  [8102fc74] pvclock_clocksource_read+0x9f/0x275
 [0.002999]  [81071f9e] ? check_preempt_wakeup+0x11c/0x1c2
 [0.002999]  [8102f05d] kvm_clock_read+0x21/0xd3
 [0.002999]  [810154b1] sched_clock+0x9/0xd
 [0.002999]  [81070509] sched_clock_local+0x12/0x75
 [0.002999]  [8107065e] sched_clock_cpu+0x84/0xc6
 [0.002999]  [8106bfd3] update_rq_clock+0x28/0x108
 [0.002999]  [8106c15e] enqueue_task+0x1d/0x64
 [0.002999]  [8106c500] activate_task+0x22/0x24
 [0.002999]  [8106c90c] ttwu_do_activate.constprop.39+0x32/0x61
 [0.002999]  [8106cd0c] try_to_wake_up+0x17e/0x1e1
 [0.002999]  [8106cd98] wake_up_process+0x15/0x17
 [0.002999]  [8109ebbf] cpu_stop_queue_work+0x3d/0x5f
 [0.002999]  [8109ec6c] queue_stop_cpus_work+0x8b/0xb6
 [0.002999]  [8109f0e3] stop_machine_from_inactive_cpu+0xb4/0xed
 [0.002999]  [81023896] ? mtrr_restore+0x4a/0x4a
 [0.002999]  [81023eb3] mtrr_ap_init+0x5a/0x5c
 [0.002999]  [814b95f7] identify_secondary_cpu+0x19/0x1b
 [0.002999]  [814bc0d3] smp_store_cpu_info+0x3c/0x3e
 [0.002999]  [814bc4ed] start_secondary+0x122/0x263
 
 Looking at native_smp_prepare_cpus in arch/x86/kernel/smpboot.c
 I see that it unconditionally calls set_mtrr_aps_delayed_init which
 in turn effectively converts calls to mtrr_ap_init to nop ops.
 And later call to native_smp_cpus_done - mtrr_aps_init completes
 mtrr initialization.
 
 The same pattern might be noticed in suspend/hibernate handlers, that
 call enable_nonboot_cpus.
 
   enable_nonboot_cpus
 - arch_enable_nonboot_cpus_begin
  - set_mtrr_aps_delayed_init
 
 - boot secondary cpus
 
 - arch_enable_nonboot_cpus_end
  - mtrr_aps_init
 
 So question is if the calling mtrr_ap_init so early in
 identify_secondary_cpu is really necessary?
 From current code it looks like it is never called at normal smp
 boot/resume time. And only path that triggers it is cpu hot-plug one.
 
 I see following possible solutions:
 
 1. Could the call to mtrr_ap_init be just moved from identify_secondary_cpu
 to start_secondary right after x86_cpuinit.setup_percpu_clockev call?

 This will prevent an access to uninitialized per_cpu clock. Or move
 x86_cpuinit.setup_percpu_clockev before smp_callin?
 
 2. Another way to prevent access to uninitialized per_cpu clock is to
 introduce hook.early_percpu_clock_init that will be called before
 smp_callin or right before mtrr_ap_init.
 
 I hope to see opinions about this matter from a more experienced people than 
 me.

It would be more robust to introduce x86_cpuinit.early_percpu_clock_init
and use it for KVM, replacing setup_percpu_clockev hook.

setup_percpu_clockev initializes the APIC timer, but clearly sched_clock
should be functional before that, for example for ftrace (in addition to
sched_clock from scheduler as in the trace above).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/8] Infrastructure for software and hardware based TSC rate scaling

2012-02-03 Thread Marcelo Tosatti
From: Zachary Amsden zams...@gmail.com

This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate.  Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code.  This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.

There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed.  Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used.  The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.

In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.

Signed-off-by: Zachary Amsden zams...@gmail.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/include/asm/kvm_host.h
===
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -422,10 +422,11 @@ struct kvm_vcpu_arch {
u64 last_kernel_ns;
u64 last_tsc_nsec;
u64 last_tsc_write;
-   u32 virtual_tsc_khz;
bool tsc_catchup;
-   u32  tsc_catchup_mult;
-   s8   tsc_catchup_shift;
+   bool tsc_always_catchup;
+   s8 virtual_tsc_shift;
+   u32 virtual_tsc_mult;
+   u32 virtual_tsc_khz;
 
atomic_t nmi_queued;  /* unprocessed asynchronous NMIs */
unsigned nmi_pending; /* NMI queued after currently running handler */
@@ -651,7 +652,7 @@ struct kvm_x86_ops {
 
bool (*has_wbinvd_exit)(void);
 
-   void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz);
+   void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool 
scale);
void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
Index: kvm/arch/x86/kvm/svm.c
===
--- kvm.orig/arch/x86/kvm/svm.c
+++ kvm/arch/x86/kvm/svm.c
@@ -964,20 +964,25 @@ static u64 svm_scale_tsc(struct kvm_vcpu
return _tsc;
 }
 
-static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
+static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool 
scale)
 {
struct vcpu_svm *svm = to_svm(vcpu);
u64 ratio;
u64 khz;
 
-   /* TSC scaling supported? */
-   if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR))
+   /* Guest TSC same frequency as host TSC? */
+   if (!scale) {
+   svm-tsc_ratio = TSC_RATIO_DEFAULT;
return;
+   }
 
-   /* TSC-Scaling disabled or guest TSC same frequency as host TSC? */
-   if (user_tsc_khz == 0) {
-   vcpu-arch.virtual_tsc_khz = 0;
-   svm-tsc_ratio = TSC_RATIO_DEFAULT;
+   /* TSC scaling supported? */
+   if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR)) {
+   if (user_tsc_khz  tsc_khz) {
+   vcpu-arch.tsc_catchup = 1;
+   vcpu-arch.tsc_always_catchup = 1;
+   } else
+   WARN(1, user requested TSC rate below hardware 
speed\n);
return;
}
 
@@ -992,7 +997,6 @@ static void svm_set_tsc_khz(struct kvm_v
user_tsc_khz);
return;
}
-   vcpu-arch.virtual_tsc_khz = user_tsc_khz;
svm-tsc_ratio = ratio;
 }
 
Index: kvm/arch/x86/kvm/vmx.c
===
--- kvm.orig/arch/x86/kvm/vmx.c
+++ kvm/arch/x86/kvm/vmx.c
@@ -1817,13 +1817,19 @@ u64 vmx_read_l1_tsc(struct kvm_vcpu *vcp
 }
 
 /*
- * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ
- * ioctl. In this case the call-back should update internal vmx state to make
- * the changes effective.
+ * Engage any workarounds for mis-matched TSC rates.  Currently limited to
+ * software catchup for faster rates on slower CPUs.
  */
-static void vmx_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
+static void vmx_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool 
scale)
 {
-   /* Nothing to do here */
+   if (!scale)
+   return;
+
+   if (user_tsc_khz  tsc_khz) {
+   vcpu-arch.tsc_catchup = 1;
+   vcpu-arch.tsc_always_catchup = 1;
+   } else
+   WARN(1, user 

[patch 8/8] Track TSC synchronization in generations

2012-02-03 Thread Marcelo Tosatti
From: Zachary Amsden zams...@gmail.com

This allows us to track the original nanosecond and counter values
at each phase of TSC writing by the guest.  This gets us perfect
offset matching for stable TSC systems, and perfect software
computed TSC matching for machines with unstable TSC.

Signed-off-by: Zachary Amsden zams...@gmail.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/include/asm/kvm_host.h
===
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -420,10 +420,11 @@ struct kvm_vcpu_arch {
 
u64 last_guest_tsc;
u64 last_kernel_ns;
-   u64 last_tsc_nsec;
-   u64 last_tsc_write;
u64 last_host_tsc;
u64 tsc_offset_adjustment;
+   u64 this_tsc_nsec;
+   u64 this_tsc_write;
+   u8  this_tsc_generation;
bool tsc_catchup;
bool tsc_always_catchup;
s8 virtual_tsc_shift;
@@ -513,9 +514,12 @@ struct kvm_arch {
s64 kvmclock_offset;
raw_spinlock_t tsc_write_lock;
u64 last_tsc_nsec;
-   u64 last_tsc_offset;
u64 last_tsc_write;
u32 last_tsc_khz;
+   u64 cur_tsc_nsec;
+   u64 cur_tsc_write;
+   u64 cur_tsc_offset;
+   u8  cur_tsc_generation;
 
struct kvm_xen_hvm_config xen_hvm_config;
 
Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1006,10 +1006,10 @@ static void kvm_set_tsc_khz(struct kvm_v
 
 static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 {
-   u64 tsc = pvclock_scale_delta(kernel_ns-vcpu-arch.last_tsc_nsec,
+   u64 tsc = pvclock_scale_delta(kernel_ns-vcpu-arch.this_tsc_nsec,
  vcpu-arch.virtual_tsc_mult,
  vcpu-arch.virtual_tsc_shift);
-   tsc += vcpu-arch.last_tsc_write;
+   tsc += vcpu-arch.this_tsc_write;
return tsc;
 }
 
@@ -1045,7 +1045,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
if (nsdiff  NSEC_PER_SEC 
vcpu-arch.virtual_tsc_khz == kvm-arch.last_tsc_khz) {
if (!check_tsc_unstable()) {
-   offset = kvm-arch.last_tsc_offset;
+   offset = kvm-arch.cur_tsc_offset;
pr_debug(kvm: matched tsc offset for %llu\n, data);
} else {
u64 delta = nsec_to_cycles(vcpu, elapsed);
@@ -1053,20 +1053,45 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
offset = kvm_x86_ops-compute_tsc_offset(vcpu, data);
pr_debug(kvm: adjusted tsc offset by %llu\n, delta);
}
+   } else {
+   /*
+* We split periods of matched TSC writes into generations.
+* For each generation, we track the original measured
+* nanosecond time, offset, and write, so if TSCs are in
+* sync, we can match exact offset, and if not, we can match
+* exact software computaion in compute_guest_tsc()
+*
+* These values are tracked in kvm-arch.cur_xxx variables.
+*/
+   kvm-arch.cur_tsc_generation++;
+   kvm-arch.cur_tsc_nsec = ns;
+   kvm-arch.cur_tsc_write = data;
+   kvm-arch.cur_tsc_offset = offset;
+   pr_debug(kvm: new tsc generation %u, clock %llu\n,
+kvm-arch.cur_tsc_generation, data);
}
+
+   /*
+* We also track th most recent recorded KHZ, write and time to
+* allow the matching interval to be extended at each write.
+*/
kvm-arch.last_tsc_nsec = ns;
kvm-arch.last_tsc_write = data;
-   kvm-arch.last_tsc_offset = offset;
kvm-arch.last_tsc_khz = vcpu-arch.virtual_tsc_khz;
-   kvm_x86_ops-write_tsc_offset(vcpu, offset);
-   raw_spin_unlock_irqrestore(kvm-arch.tsc_write_lock, flags);
 
/* Reset of TSC must disable overshoot protection below */
vcpu-arch.hv_clock.tsc_timestamp = 0;
-   vcpu-arch.last_tsc_write = data;
-   vcpu-arch.last_tsc_nsec = ns;
vcpu-arch.last_guest_tsc = data;
+
+   /* Keep track of which generation this VCPU has synchronized to */
+   vcpu-arch.this_tsc_generation = kvm-arch.cur_tsc_generation;
+   vcpu-arch.this_tsc_nsec = kvm-arch.cur_tsc_nsec;
+   vcpu-arch.this_tsc_write = kvm-arch.cur_tsc_write;
+
+   kvm_x86_ops-write_tsc_offset(vcpu, offset);
+   raw_spin_unlock_irqrestore(kvm-arch.tsc_write_lock, flags);
 }
+
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
 static int kvm_guest_time_update(struct kvm_vcpu *v)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 6/8] Allow adjust_tsc_offset to be in host or guest cycles

2012-02-03 Thread Marcelo Tosatti
Redefine the API to take a parameter indicating whether an
adjustment is in host or guest cycles.

Signed-off-by: Zachary Amsden zams...@gmail.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/include/asm/kvm_host.h
===
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -646,7 +646,7 @@ struct kvm_x86_ops {
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
int (*get_lpage_level)(void);
bool (*rdtscp_supported)(void);
-   void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment);
+   void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment, bool 
host);
 
void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
 
@@ -676,6 +676,17 @@ struct kvm_arch_async_pf {
 
 extern struct kvm_x86_ops *kvm_x86_ops;
 
+static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
+  s64 adjustment)
+{
+   kvm_x86_ops-adjust_tsc_offset(vcpu, adjustment, false);
+}
+
+static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 
adjustment)
+{
+   kvm_x86_ops-adjust_tsc_offset(vcpu, adjustment, true);
+}
+
 int kvm_mmu_module_init(void);
 void kvm_mmu_module_exit(void);
 
Index: kvm/arch/x86/kvm/svm.c
===
--- kvm.orig/arch/x86/kvm/svm.c
+++ kvm/arch/x86/kvm/svm.c
@@ -1016,10 +1016,14 @@ static void svm_write_tsc_offset(struct 
mark_dirty(svm-vmcb, VMCB_INTERCEPTS);
 }
 
-static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
+static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool 
host)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
+   WARN_ON(adjustment  0);
+   if (host)
+   adjustment = svm_scale_tsc(vcpu, adjustment);
+
svm-vmcb-control.tsc_offset += adjustment;
if (is_guest_mode(vcpu))
svm-nested.hsave-control.tsc_offset += adjustment;
Index: kvm/arch/x86/kvm/vmx.c
===
--- kvm.orig/arch/x86/kvm/vmx.c
+++ kvm/arch/x86/kvm/vmx.c
@@ -1856,7 +1856,7 @@ static void vmx_write_tsc_offset(struct 
}
 }
 
-static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
+static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool 
host)
 {
u64 offset = vmcs_read64(TSC_OFFSET);
vmcs_write64(TSC_OFFSET, offset + adjustment);
Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1102,7 +1102,7 @@ static int kvm_guest_time_update(struct 
if (vcpu-tsc_catchup) {
u64 tsc = compute_guest_tsc(v, kernel_ns);
if (tsc  tsc_timestamp) {
-   kvm_x86_ops-adjust_tsc_offset(v, tsc - tsc_timestamp);
+   adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
tsc_timestamp = tsc;
}
}


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 3/8] Leave TSC synchronization window open with each new sync

2012-02-03 Thread Marcelo Tosatti
From: Zachary Amsden zams...@gmail.com

Currently, when the TSC is written by the guest, the variable
ns is updated to force the current write to appear to have taken
place at the time of the first write in this sync phase.  This
leaves a cliff at the end of the match window where updates will
fall of the end.  There are two scenarios where this can be a
problem in practe - first, on a system with a large number of
VCPUs, the sync period may last for an extended period of time.

The second way this can happen is if the VM reboots very rapidly
and we catch a VCPU TSC synchronization just around the edge.
We may be unaware of the reboot, and thus the first VCPU might
synchronize with an old set of the timer (at, say 0.97 seconds
ago, when first powered on).  The second VCPU can come in 0.04
seconds later to try to synchronize, but it misses the window
because it is just over the threshold.

Instead, stop doing this artificial setback of the ns variable
and just update it with every write of the TSC.

It may be observed that doing so causes values computed by
compute_guest_tsc to diverge slightly across CPUs - note that
the last_tsc_ns and last_tsc_write variable are used here, and
now they last_tsc_ns will be different for each VCPU, reflecting
the actual time of the update.

However, compute_guest_tsc is used only for guests which already
have TSC stability issues, and further, note that the previous
patch has caused last_tsc_write to be incremented by the difference
in nanoseconds, converted back into guest cycles.  As such, only
boundary rounding errors should be visible, which given the
resolution in nanoseconds, is going to only be a few cycles and
only visible in cross-CPU consistency tests.  The problem can be
fixed by adding a new set of variables to track the start offset
and start write value for the current sync cycle.

Signed-off-by: Zachary Amsden zams...@gmail.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1053,7 +1053,6 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
offset = kvm_x86_ops-compute_tsc_offset(vcpu, data);
pr_debug(kvm: adjusted tsc offset by %llu\n, delta);
}
-   ns = kvm-arch.last_tsc_nsec;
}
kvm-arch.last_tsc_nsec = ns;
kvm-arch.last_tsc_write = data;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 5/8] Add last_host_tsc tracking back to KVM

2012-02-03 Thread Marcelo Tosatti
From: Zachary Amsden zams...@gmail.com

The variable last_host_tsc was removed from upstream code.  I am adding
it back for two reasons.  First, it is unnecessary to use guest TSC
computation to conclude information about the host TSC.  The guest may
set the TSC backwards (this case handled by the previous patch), but
the computation of guest TSC (and fetching an MSR) is significanlty more
work and complexity than simply reading the hardware counter.  In addition,
we don't actually need the guest TSC for any part of the computation,
by always recomputing the offset, we can eliminate the need to deal with
the current offset and any scaling factors that may apply.

The second reason is that later on, we are going to be using the host
TSC value to restore TSC offsets after a host S4 suspend, so we need to
be reading the host values, not the guest values here.

Signed-off-by: Zachary Amsden zams...@gmail.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/include/asm/kvm_host.h
===
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -422,6 +422,7 @@ struct kvm_vcpu_arch {
u64 last_kernel_ns;
u64 last_tsc_nsec;
u64 last_tsc_write;
+   u64 last_host_tsc;
bool tsc_catchup;
bool tsc_always_catchup;
s8 virtual_tsc_shift;
Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -2239,13 +2239,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 
kvm_x86_ops-vcpu_load(vcpu, cpu);
if (unlikely(vcpu-cpu != cpu) || check_tsc_unstable()) {
-   /* Make sure TSC doesn't go backwards */
-   s64 tsc_delta;
-   u64 tsc;
-
-   tsc = kvm_x86_ops-read_l1_tsc(vcpu);
-   tsc_delta = tsc - vcpu-arch.last_guest_tsc;
-
+   s64 tsc_delta = !vcpu-arch.last_host_tsc ? 0 :
+   native_read_tsc() - vcpu-arch.last_host_tsc;
if (tsc_delta  0)
mark_tsc_unstable(KVM discovered backwards TSC);
if (check_tsc_unstable()) {
@@ -2268,7 +2263,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *
 {
kvm_x86_ops-vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
-   vcpu-arch.last_guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu);
+   vcpu-arch.last_host_tsc = native_read_tsc();
 }
 
 static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Next gen kvm api

2012-02-03 Thread Eric Northup
On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity a...@redhat.com wrote:
[...]

 Moving to syscalls avoids these problems, but introduces new ones:

 - adding new syscalls is generally frowned upon, and kvm will need several
 - syscalls into modules are harder and rarer than into core kernel code
 - will need to add a vcpu pointer to task_struct, and a kvm pointer to
 mm_struct
- Lost a good place to put access control (permissions on /dev/kvm)
for which user-mode processes can use KVM.

How would the ability to use sys_kvm_* be regulated?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/4] Fix task switches into/out of VM86

2012-02-03 Thread Kevin Wolf
Kevin Wolf (4):
  KVM: x86 emulator: Fix task switch privilege checks
  KVM: x86 emulator: VM86 segments must have DPL 3
  KVM: SVM: Fix CPL updates
  KVM: x86 emulator: Allow PM/VM86 switch during task switch

 arch/x86/include/asm/kvm_emulate.h |3 +-
 arch/x86/include/asm/kvm_host.h|4 +-
 arch/x86/kvm/emulate.c |   75 ---
 arch/x86/kvm/svm.c |   25 ++--
 arch/x86/kvm/vmx.c |8 ++-
 arch/x86/kvm/x86.c |   12 -
 6 files changed, 107 insertions(+), 20 deletions(-)

-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/4] KVM: x86 emulator: Fix task switch privilege checks

2012-02-03 Thread Kevin Wolf
Currently, all task switches check privileges against the DPL of the
TSS. This is only correct for jmp/call to a TSS. If a task gate is used,
the DPL of this take gate is used for the check instead. Exceptions,
external interrupts and iret shouldn't perform any check.

Signed-off-by: Kevin Wolf kw...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |2 +-
 arch/x86/include/asm/kvm_host.h|4 +-
 arch/x86/kvm/emulate.c |   53 +++-
 arch/x86/kvm/svm.c |5 +++-
 arch/x86/kvm/vmx.c |8 +++--
 arch/x86/kvm/x86.c |6 ++--
 6 files changed, 61 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index ab4092e..c8a9cf3 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -372,7 +372,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt 
*ctxt);
 #define EMULATION_INTERCEPTED 2
 int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
 int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
-u16 tss_selector, int reason,
+u16 tss_selector, int idt_index, int reason,
 bool has_error_code, u32 error_code);
 int emulate_int_real(struct x86_emulate_ctxt *ctxt, int irq);
 #endif /* _ASM_X86_KVM_X86_EMULATE_H */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 52d6640..0533fc4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -741,8 +741,8 @@ int kvm_emulate_wbinvd(struct kvm_vcpu *vcpu);
 void kvm_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
 int kvm_load_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector, int seg);
 
-int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int reason,
-   bool has_error_code, u32 error_code);
+int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int idt_index,
+   int reason, bool has_error_code, u32 error_code);
 
 int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
 int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 05a562b..7097ca9 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1151,6 +1151,22 @@ static int pio_in_emulated(struct x86_emulate_ctxt *ctxt,
return 1;
 }
 
+static int read_interrupt_descriptor(struct x86_emulate_ctxt *ctxt,
+u16 index, struct kvm_desc_struct *desc)
+{
+   struct kvm_desc_ptr dt;
+   ulong addr;
+
+   ctxt-ops-get_idt(ctxt, dt);
+
+   if (dt.size  index * 8 + 7)
+   return emulate_gp(ctxt, index  3 | 0x2);
+
+   addr = dt.address + index * 8;
+   return ctxt-ops-read_std(ctxt, addr, desc, sizeof *desc,
+  ctxt-exception);
+}
+
 static void get_descriptor_table_ptr(struct x86_emulate_ctxt *ctxt,
 u16 selector, struct desc_ptr *dt)
 {
@@ -2350,7 +2366,7 @@ static int task_switch_32(struct x86_emulate_ctxt *ctxt,
 }
 
 static int emulator_do_task_switch(struct x86_emulate_ctxt *ctxt,
-  u16 tss_selector, int reason,
+  u16 tss_selector, int idt_index, int reason,
   bool has_error_code, u32 error_code)
 {
struct x86_emulate_ops *ops = ctxt-ops;
@@ -2372,12 +2388,35 @@ static int emulator_do_task_switch(struct 
x86_emulate_ctxt *ctxt,
 
/* FIXME: check that next_tss_desc is tss */
 
-   if (reason != TASK_SWITCH_IRET) {
-   if ((tss_selector  3)  next_tss_desc.dpl ||
-   ops-cpl(ctxt)  next_tss_desc.dpl)
-   return emulate_gp(ctxt, 0);
+   /*
+* Check privileges. The three cases are task switch caused by...
+*
+* 1. jmp/call/int to task gate: Check against DPL of the task gate
+* 2. Exception/IRQ/iret: No check is performed
+* 3. jmp/call to TSS: Check agains DPL of the TSS
+*/
+   if (reason == TASK_SWITCH_GATE) {
+   if (idt_index != -1) {
+   /* Software interrupts */
+   struct kvm_desc_struct task_gate_desc;
+   int dpl;
+
+   ret = read_interrupt_descriptor(ctxt, idt_index,
+   task_gate_desc);
+   if (ret != X86EMUL_CONTINUE)
+   return ret;
+
+   dpl = task_gate_desc.dpl;
+   if ((tss_selector  3)  dpl || ops-cpl(ctxt)  dpl)
+   return emulate_gp(ctxt, (idt_index  3) | 0x2);
+   }
+   } else if (reason != TASK_SWITCH_IRET) {
+   int dpl = next_tss_desc.dpl;
+   if 

[PATCH v3 2/4] KVM: x86 emulator: VM86 segments must have DPL 3

2012-02-03 Thread Kevin Wolf
Setting the segment DPL to 0 for at least the VM86 code segment makes
the VM entry fail on VMX.

Signed-off-by: Kevin Wolf kw...@redhat.com
---
 arch/x86/kvm/emulate.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 7097ca9..144a203 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1243,6 +1243,8 @@ static int load_segment_descriptor(struct 
x86_emulate_ctxt *ctxt,
seg_desc.type = 3;
seg_desc.p = 1;
seg_desc.s = 1;
+   if (ctxt-mode == X86EMUL_MODE_VM86)
+   seg_desc.dpl = 3;
goto load;
}
 
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/4] KVM: SVM: Fix CPL updates

2012-02-03 Thread Kevin Wolf
Keep CPL at 0 in real mode and at 3 in VM86. In protected/long mode, use
RPL rather than DPL of the code segment.

Signed-off-by: Kevin Wolf kw...@redhat.com
---
 arch/x86/kvm/svm.c |   19 ---
 1 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 6a977c1..4124a7e 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1263,6 +1263,21 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu)
wrmsrl(host_save_user_msrs[i], svm-host_user_msrs[i]);
 }
 
+static void svm_update_cpl(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_svm *svm = to_svm(vcpu);
+   int cpl;
+
+   if (!is_protmode(vcpu))
+   cpl = 0;
+   else if (svm-vmcb-save.rflags  X86_EFLAGS_VM)
+   cpl = 3;
+   else
+   cpl = svm-vmcb-save.cs.selector  0x3;
+
+   svm-vmcb-save.cpl = cpl;
+}
+
 static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu)
 {
return to_svm(vcpu)-vmcb-save.rflags;
@@ -1538,9 +1553,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
s-attrib |= (var-g  1)  SVM_SELECTOR_G_SHIFT;
}
if (seg == VCPU_SREG_CS)
-   svm-vmcb-save.cpl
-   = (svm-vmcb-save.cs.attrib
-   SVM_SELECTOR_DPL_SHIFT)  3;
+   svm_update_cpl(vcpu);
 
mark_dirty(svm-vmcb, VMCB_SEG);
 }
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/4] KVM: x86 emulator: Allow PM/VM86 switch during task switch

2012-02-03 Thread Kevin Wolf
Task switches can switch between Protected Mode and VM86. The current
mode must be updated during the task switch emulation so that the new
segment selectors are interpreted correctly.

In order to let privilege checks succeed, rflags needs to be updated in
the vcpu struct as this causes a CPL update.

Signed-off-by: Kevin Wolf kw...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |1 +
 arch/x86/kvm/emulate.c |   20 
 arch/x86/kvm/svm.c |1 +
 arch/x86/kvm/x86.c |6 ++
 4 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index c8a9cf3..4a21c7d 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -176,6 +176,7 @@ struct x86_emulate_ops {
void (*set_idt)(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt);
ulong (*get_cr)(struct x86_emulate_ctxt *ctxt, int cr);
int (*set_cr)(struct x86_emulate_ctxt *ctxt, int cr, ulong val);
+   void (*set_rflags)(struct x86_emulate_ctxt *ctxt, ulong val);
int (*cpl)(struct x86_emulate_ctxt *ctxt);
int (*get_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong *dest);
int (*set_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong value);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 144a203..a9fc21d 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2273,6 +2273,8 @@ static int load_state_from_tss32(struct x86_emulate_ctxt 
*ctxt,
return emulate_gp(ctxt, 0);
ctxt-_eip = tss-eip;
ctxt-eflags = tss-eflags | 2;
+
+   /* General purpose registers */
ctxt-regs[VCPU_REGS_RAX] = tss-eax;
ctxt-regs[VCPU_REGS_RCX] = tss-ecx;
ctxt-regs[VCPU_REGS_RDX] = tss-edx;
@@ -2295,6 +2297,24 @@ static int load_state_from_tss32(struct x86_emulate_ctxt 
*ctxt,
set_segment_selector(ctxt, tss-gs, VCPU_SREG_GS);
 
/*
+* If we're switching between Protected Mode and VM86, we need to make
+* sure to update the mode before loading the segment descriptors so
+* that the selectors are interpreted correctly.
+*
+* Need to get rflags to the vcpu struct immediately because it
+* influences the CPL which is checked at least when loading the segment
+* descriptors and when pushing an error code to the new kernel stack.
+*
+* TODO Introduce a separate ctxt-ops-set_cpl callback
+*/
+   if (ctxt-eflags  X86_EFLAGS_VM)
+   ctxt-mode = X86EMUL_MODE_VM86;
+   else
+   ctxt-mode = X86EMUL_MODE_PROT32;
+
+   ctxt-ops-set_rflags(ctxt, ctxt-eflags);
+
+   /*
 * Now load segment descriptors. If fault happenes at this stage
 * it is handled in a context of new task
 */
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 4124a7e..38d5ef3 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1286,6 +1286,7 @@ static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu)
 static void svm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 {
to_svm(vcpu)-vmcb-save.rflags = rflags;
+   svm_update_cpl(vcpu);
 }
 
 static void svm_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dc3e945..502b5c3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4040,6 +4040,11 @@ static int emulator_set_cr(struct x86_emulate_ctxt 
*ctxt, int cr, ulong val)
return res;
 }
 
+static void emulator_set_rflags(struct x86_emulate_ctxt *ctxt, ulong val)
+{
+   kvm_set_rflags(emul_to_vcpu(ctxt), val);
+}
+
 static int emulator_get_cpl(struct x86_emulate_ctxt *ctxt)
 {
return kvm_x86_ops-get_cpl(emul_to_vcpu(ctxt));
@@ -4199,6 +4204,7 @@ static struct x86_emulate_ops emulate_ops = {
.set_idt = emulator_set_idt,
.get_cr  = emulator_get_cr,
.set_cr  = emulator_set_cr,
+   .set_rflags  = emulator_set_rflags,
.cpl = emulator_get_cpl,
.get_dr  = emulator_get_dr,
.set_dr  = emulator_set_dr,
-- 
1.7.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools: Fix test for mmap failure

2012-02-03 Thread Cyrill Gorcunov
On error mmap returns MAP_FAILED so we
need a proper test here.

Signed-off-by: Cyrill Gorcunov gorcu...@gmail.com
---
 tools/kvm/hw/pci-shmem.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6.git/tools/kvm/hw/pci-shmem.c
===
--- linux-2.6.git.orig/tools/kvm/hw/pci-shmem.c
+++ linux-2.6.git/tools/kvm/hw/pci-shmem.c
@@ -209,7 +209,7 @@ static void *setup_shmem(const char *key
   PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0);
close(fd);
 
-   if (mem == NULL)
+   if (mem == MAP_FAILED)
pr_warning(Failed to mmap shared memory file);
 
return mem;
@@ -259,8 +259,9 @@ int pci_shmem__init(struct kvm *kvm)
/* Open shared memory and plug it into the guest */
mem = setup_shmem(shmem_region-handle, shmem_region-size,
shmem_region-create);
-   if (mem == NULL)
+   if (mem == MAP_FAILED)
return 0;
+
kvm__register_mem(kvm, shmem_region-phys_addr, shmem_region-size,
  mem);
return 1;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Fix test for mmap failure

2012-02-03 Thread Cyrill Gorcunov
On Fri, Feb 03, 2012 at 11:15:41PM +0400, Cyrill Gorcunov wrote:
 On error mmap returns MAP_FAILED so we
 need a proper test here.


Pekka, pick this one instead -- a caller is expecting null/not-null
only.

Cyrill
---
kvm tools: Fix test for mmap failure

On error mmap returns MAP_FAILED so we
need a proper test here.

Signed-off-by: Cyrill Gorcunov gorcu...@gmail.com
---
 tools/kvm/hw/pci-shmem.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux-2.6.git/tools/kvm/hw/pci-shmem.c
===
--- linux-2.6.git.orig/tools/kvm/hw/pci-shmem.c
+++ linux-2.6.git/tools/kvm/hw/pci-shmem.c
@@ -207,10 +207,11 @@ static void *setup_shmem(const char *key
}
mem = mmap(NULL, len,
   PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0);
-   close(fd);
-
-   if (mem == NULL)
+   if (mem == MAP_FAILED) {
pr_warning(Failed to mmap shared memory file);
+   mem = NULL;
+   }
+   close(fd);
 
return mem;
 }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tool: Make kvm structure to carry name copy

2012-02-03 Thread Cyrill Gorcunov
If guest name is used (which is default case) the kvm might end
up carrying the pointer to name which is allocated on stack.

kvm_cmd_run_init
  (on stack) default_name
  kvm__init(..., default_name)
kvm-name = default_name

So I think better to allow kvm to carry own copy
of guest name. 64 symbols should be more than enough.

Signed-off-by: Cyrill Gorcunov gorcu...@gmail.com
---

I hope I didn't miss anything?

 tools/kvm/kvm.c  |2 +-
 tools/kvm/powerpc/include/kvm/kvm-arch.h |2 +-
 tools/kvm/x86/include/kvm/kvm-arch.h |2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6.git/tools/kvm/kvm.c
===
--- linux-2.6.git.orig/tools/kvm/kvm.c
+++ linux-2.6.git/tools/kvm/kvm.c
@@ -384,7 +384,7 @@ struct kvm *kvm__init(const char *kvm_de
 
kvm__arch_init(kvm, hugetlbfs_path, ram_size);
 
-   kvm-name = name;
+   strncpy(kvm-name, name, sizeof(kvm-name));
 
kvm_ipc__start(kvm__create_socket(kvm));
kvm_ipc__register_handler(KVM_IPC_PID, kvm__pid);
Index: linux-2.6.git/tools/kvm/powerpc/include/kvm/kvm-arch.h
===
--- linux-2.6.git.orig/tools/kvm/powerpc/include/kvm/kvm-arch.h
+++ linux-2.6.git/tools/kvm/powerpc/include/kvm/kvm-arch.h
@@ -64,7 +64,7 @@ struct kvm {
unsigned long   fdt_gra;
unsigned long   initrd_gra;
unsigned long   initrd_size;
-   const char  *name;
+   charname[64];
int vm_state;
 };
 
Index: linux-2.6.git/tools/kvm/x86/include/kvm/kvm-arch.h
===
--- linux-2.6.git.orig/tools/kvm/x86/include/kvm/kvm-arch.h
+++ linux-2.6.git/tools/kvm/x86/include/kvm/kvm-arch.h
@@ -48,7 +48,7 @@ struct kvm {
struct disk_image   **disks;
int nr_disks;
 
-   const char  *name;
+   charname[64];
 
int vm_state;
 };
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: increase recommended max vcpus to 160

2012-02-03 Thread shak

On 02/03/2012 10:55 AM, Marcelo Tosatti wrote:

On Fri, Feb 03, 2012 at 10:28:57AM -0500, Sasha Levin wrote:
   

On Fri, 2012-02-03 at 12:28 -0200, Marcelo Tosatti wrote:
 

Increase recommended max vcpus from 64 to 160 (tested internally
at Red Hat).

Signed-off-by: Marcelo Tosattimtosa...@redhat.com
   

Awesome!

Would it be possible to share the benchmarks?
 

Not sure. Shak?

   
We have been using Linpack NxN but always hit a HW barriers even on bare 
metal


http://software.intel.com/en-us/articles/running-the-hpl-benchmark-over-intel-mpi/?wapkw=(LINPACK+MPI)
http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download

Shak


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] virtio-pci: add setup_vqs flag in vp_try_to_find_vqs

2012-02-03 Thread Rusty Russell
On Wed, 1 Feb 2012 11:54:09 +0200, Michael S. Tsirkin m...@redhat.com wrote:
 On Thu, Jan 12, 2012 at 09:20:06AM +0800, zanghongy...@huawei.com wrote:
  From: Hongyong Zang zanghongy...@huawei.com
  
  changes in vp_try_to_find_vqs:
  Virtio-serial's probe() calls it to request irqs and setup vqs of port0 and
  controls; add_port() calls it to set up vqs of io_port.
  it will not create virtqueue if the name is null.
  
  Signed-off-by: Hongyong Zang zanghongy...@huawei.com
 
 This looks like a fragile hack, to me.
 
 virtio spec also implied that VQ initialization is done
 before status is set to OK (during probe) and
 devices might have relied on that. So if we want to change
 that, I think we need a feature bit.

If virtio_serial doesn't use those virtqueues, I think it's OK.

I think I'd rather have vp_find_vqs take an offset, however.

 Besides, what about documentation? non pci transports?

I think this is a device-specific issue, but re-calling find_vqs will
have to be audited.  Which adding an offset arg should lead to anyway.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-03 Thread Anthony Liguori

On 02/03/2012 12:07 PM, Eric Northup wrote:

On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivitya...@redhat.com  wrote:
[...]


Moving to syscalls avoids these problems, but introduces new ones:

- adding new syscalls is generally frowned upon, and kvm will need several
- syscalls into modules are harder and rarer than into core kernel code
- will need to add a vcpu pointer to task_struct, and a kvm pointer to
mm_struct

- Lost a good place to put access control (permissions on /dev/kvm)
for which user-mode processes can use KVM.

How would the ability to use sys_kvm_* be regulated?


Why should it be regulated?

It's not a finite or privileged resource.

Regards,

Anthony Liguori





--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-03 Thread Takuya Yoshikawa
Hope to get comments from live migration developers,

Anthony Liguori anth...@codemonkey.ws wrote:

  Guest memory management
  ---
  Instead of managing each memory slot individually, a single API will be
  provided that replaces the entire guest physical memory map atomically.
  This matches the implementation (using RCU) and plugs holes in the
  current API, where you lose the dirty log in the window between the last
  call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
  that removes the slot.
 
  Slot-based dirty logging will be replaced by range-based and work-based
  dirty logging; that is what pages are dirty in this range, which may be
  smaller than a slot and don't return more than N pages.
 
  We may want to place the log in user memory instead of kernel memory, to
  reduce pinned memory and increase flexibility.
 
 Since we really only support 64-bit hosts, what about just pointing the 
 kernel 
 at a address/size pair and rely on userspace to mmap() the range 
 appropriately?
 

Seems reasonable but the real problem is not how to set up the memory:
the problem is how to set a bit in user-space.

We need two things:
- introducing set_bit_user()
- changing mmu_lock from spin_lock to mutex_lock
  (mark_page_dirty() can be called with mmu_lock held)

The former is straightforward and I sent a patch last year.
The latter needs a fundamental change:  I heard (from Avi) that we can
change mmu_lock to mutex_lock if mmu_notifier becomes preemptible.

So I was planning to restart this work when Peter's
mm: Preemptibility
http://lkml.org/lkml/2011/4/1/141
gets finished.

But even if we cannot achieve without pinned memory we may also want
to make the user-space know how many pages are getting dirty.

For example think about the last step of live migration.  We stop the
guest and send the remaining pages.  For this we do not need to write
protect them any more, just want to know which ones are dirty.

If user-space can read the bitmap, it does not need to do GET_DIRTY_LOG
because the guest is already stopped, so we can reduce the downtime.

Is this correct?


So I think we can do this in two steps:
1. just move the bitmap to user-space and (pin it)
2. un-pin it when the time comes

I can start 1 after srcu-less dirty logging gets finished.


Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Fix test for mmap failure

2012-02-03 Thread Pekka Enberg

On Fri, Feb 03, 2012 at 11:15:41PM +0400, Cyrill Gorcunov wrote:

On error mmap returns MAP_FAILED so we
need a proper test here.


On Fri, 3 Feb 2012, Cyrill Gorcunov wrote:

Pekka, pick this one instead -- a caller is expecting null/not-null
only.


Applied, thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PPC: Book3S HV: Fix kvm_alloc_linear in case where no linears exist

2012-02-03 Thread Paul Mackerras
In kvm_alloc_linear we were using and deferencing ri after the
list_for_each_entry had come to the end of the list.  In that
situation, ri is not really defined and probably points to the
list head.  This will happen every time if the free_linears list
is empty, for instance.  This led to a NULL pointer dereference
crash in memset on POWER7 while trying to allocate an HPT in the
case where no HPTs were preallocated.

This fixes it by using a separate variable for the return value
from the loop iterator.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_hv_builtin.c |9 +
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index bed1279..e1b60f5 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -173,9 +173,9 @@ static void __init kvm_linear_init_one(ulong size, int 
count, int type)
 
 static struct kvmppc_linear_info *kvm_alloc_linear(int type)
 {
-   struct kvmppc_linear_info *ri;
+   struct kvmppc_linear_info *ri, *ret;
 
-   ri = NULL;
+   ret = NULL;
spin_lock(linear_lock);
list_for_each_entry(ri, free_linears, list) {
if (ri-type != type)
@@ -183,11 +183,12 @@ static struct kvmppc_linear_info *kvm_alloc_linear(int 
type)
 
list_del(ri-list);
atomic_inc(ri-use_count);
+   memset(ri-base_virt, 0, ri-npages  PAGE_SHIFT);
+   ret = ri;
break;
}
spin_unlock(linear_lock);
-   memset(ri-base_virt, 0, ri-npages  PAGE_SHIFT);
-   return ri;
+   return ret;
 }
 
 static void kvm_release_linear(struct kvmppc_linear_info *ri)
-- 
1.7.8.3

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] KVM: PPC: Book3S HV: Save and restore CR in __kvmppc_vcore_entry

2012-02-03 Thread Paul Mackerras
The ABI specifies that CR fields CR2--CR4 are nonvolatile across function
calls.  Currently __kvmppc_vcore_entry doesn't save and restore the CR,
leading to CR2--CR4 getting corrupted with guest values, possibly leading
to incorrect behaviour in its caller.  This adds instructions to save
and restore CR at the points where we save and restore the nonvolatile
GPRs.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_hv_interrupts.S |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_interrupts.S 
b/arch/powerpc/kvm/book3s_hv_interrupts.S
index 3f7b674..d3fb4df 100644
--- a/arch/powerpc/kvm/book3s_hv_interrupts.S
+++ b/arch/powerpc/kvm/book3s_hv_interrupts.S
@@ -46,8 +46,10 @@ _GLOBAL(__kvmppc_vcore_entry)
/* Save host state to the stack */
stdur1, -SWITCH_FRAME_SIZE(r1)
 
-   /* Save non-volatile registers (r14 - r31) */
+   /* Save non-volatile registers (r14 - r31) and CR */
SAVE_NVGPRS(r1)
+   mfcrr3
+   std r3, _CCR(r1)
 
/* Save host DSCR */
 BEGIN_FTR_SECTION
@@ -157,8 +159,10 @@ kvmppc_handler_highmem:
 * R13  = PACA
 */
 
-   /* Restore non-volatile host registers (r14 - r31) */
+   /* Restore non-volatile host registers (r14 - r31) and CR */
REST_NVGPRS(r1)
+   ld  r4, _CCR(r1)
+   mtcrr4
 
addir1, r1, SWITCH_FRAME_SIZE
ld  r0, PPC_LR_STKOFF(r1)
-- 
1.7.8.3

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] KVM: PPC: Book3S HV: Make secondary threads more robust against stray IPIs

2012-02-03 Thread Paul Mackerras
Currently on POWER7, if we are running the guest on a core and we don't
need all the hardware threads, we do nothing to ensure that the unused
threads aren't executing in the kernel (other than checking that they
are offline).  We just assume they're napping and we don't do anything
to stop them trying to enter the kernel while the guest is running.
This means that a stray IPI can wake up the hardware thread and it will
then try to enter the kernel, but since the core is in guest context,
it will execute code from the guest in hypervisor mode once it turns the
MMU on, which tends to lead to crashes or hangs in the host.

This fixes the problem by adding two new one-byte flags in the
kvmppc_host_state structure in the PACA which are used to interlock
between the primary thread and the unused secondary threads when entering
the guest.  With these flags, the primary thread can ensure that the
unused secondaries are not already in kernel mode (i.e. handling a stray
IPI) and then indicate that they should not try to enter the kernel
if they do get woken for any reason.  Instead they will go into KVM code,
find that there is no vcpu to run, acknowledge and clear the IPI and go
back to nap mode.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_book3s_asm.h |8 +++
 arch/powerpc/kernel/asm-offsets.c |2 +
 arch/powerpc/kernel/exceptions-64s.S  |   12 ++--
 arch/powerpc/kernel/idle_power7.S |7 ++
 arch/powerpc/kvm/book3s_hv.c  |   49 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |   92 +
 6 files changed, 124 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index 1f2f5b6..88609b2 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -79,6 +79,9 @@ struct kvmppc_host_state {
u8 napping;
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
+   u8 hwthread_req;
+   u8 hwthread_state;
+
struct kvm_vcpu *kvm_vcpu;
struct kvmppc_vcore *kvm_vcore;
unsigned long xics_phys;
@@ -122,4 +125,9 @@ struct kvmppc_book3s_shadow_vcpu {
 
 #endif /*__ASSEMBLY__ */
 
+/* Values for kvm_state */
+#define KVM_HWTHREAD_IN_KERNEL 0
+#define KVM_HWTHREAD_IN_NAP1
+#define KVM_HWTHREAD_IN_KVM2
+
 #endif /* __ASM_KVM_BOOK3S_ASM_H__ */
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 8e0db0b..1914818 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -549,6 +549,8 @@ int main(void)
HSTATE_FIELD(HSTATE_IN_GUEST, in_guest);
HSTATE_FIELD(HSTATE_RESTORE_HID5, restore_hid5);
HSTATE_FIELD(HSTATE_NAPPING, napping);
+   HSTATE_FIELD(HSTATE_HWTHREAD_REQ, hwthread_req);
+   HSTATE_FIELD(HSTATE_HWTHREAD_STATE, hwthread_state);
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
HSTATE_FIELD(HSTATE_KVM_VCPU, kvm_vcpu);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index ca9b733..1726a48 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -62,11 +62,13 @@ BEGIN_FTR_SECTION
GET_PACA(r13)
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
-   lbz r0,PACAPROCSTART(r13)
-   cmpwi   r0,0x80
-   bne 1f
-   li  r0,1
-   stb r0,PACAPROCSTART(r13)
+   li  r0,KVM_HWTHREAD_IN_KERNEL
+   stb r0,HSTATE_HWTHREAD_STATE(r13)
+   /* Order setting hwthread_state vs. testing hwthread_req */
+   sync
+   lbz r0,HSTATE_HWTHREAD_REQ(r13)
+   cmpwi   r0,0
+   beq 1f
b   kvm_start_guest
 1:
 #endif
diff --git a/arch/powerpc/kernel/idle_power7.S 
b/arch/powerpc/kernel/idle_power7.S
index fcdff19..95eb8c5 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -15,6 +15,7 @@
 #include asm/ppc_asm.h
 #include asm/asm-offsets.h
 #include asm/ppc-opcode.h
+#include asm/kvm_book3s_asm.h
 
 #undef DEBUG
 
@@ -64,6 +65,12 @@ _GLOBAL(power7_idle)
std r9,_MSR(r1)
std r1,PACAR1(r13)
 
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+   /* Tell KVM we're napping */
+   li  r4,KVM_HWTHREAD_IN_NAP
+   stb r4,HSTATE_HWTHREAD_STATE(r13)
+#endif
+
/* Magic NAP mode enter sequence */
std r0,0(r1)
ptesync
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index ce1cac7..966e54c 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -569,6 +569,45 @@ static void kvmppc_remove_runnable(struct kvmppc_vcore *vc,
list_del(vcpu-arch.run_list);
 }
 
+static int kvmppc_grab_hwthread(int cpu)
+{
+   struct paca_struct *tpaca;
+   long timeout = 1000;
+
+   tpaca = paca[cpu];
+
+   /* Ensure the thread won't go into the kernel if it wakes */
+   tpaca-kvm_hstate.hwthread_req = 1;
+
+   /*
+

[PATCH 3/4] KVM: PPC: Book3S HV: Make virtual processor area registration more robust

2012-02-03 Thread Paul Mackerras
The PAPR API allows three sorts of per-virtual-processor areas to be
registered (VPA, SLB shadow buffer, and dispatch trace log), and
furthermore, these can be registered and unregistered for another
virtual CPU.  Currently we just update the vcpu fields pointing to
these areas at the time of registration or unregistration.  If this
is done on another vcpu, there is the possibility that the target vcpu
is using those fields at the time and could end up using a bogus
pointer and corrupting memory.

This fixes the race by making the target cpu itself do the update, so
we can be sure that the update happens at a time when the fields
aren't being used.  Each area now has a struct kvmppc_vpa which is
used to manage these updates.  There is also a spinlock which protects
access to all of the kvmppc_vpa structs, other than to the pinned_addr
fields.  (We could have just taken the spinlock when using the vpa,
slb_shadow or dtl fields, but that would mean taking the spinlock on
every guest entry and exit.)

This also changes 'struct dtl' (which was undefined) to 'struct dtl_entry',
which is what the rest of the kernel uses.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/hvcall.h   |   10 ++
 arch/powerpc/include/asm/kvm_host.h |   27 -
 arch/powerpc/kernel/asm-offsets.c   |2 +-
 arch/powerpc/kvm/book3s_hv.c|  226 ---
 4 files changed, 189 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 1c324ff..318bac9 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -114,6 +114,16 @@
 #define H_PP1  (1UL(63-62))
 #define H_PP2  (1UL(63-63))
 
+/* Flags for H_REGISTER_VPA subfunction field */
+#define H_VPA_FUNC_SHIFT   (63-18) /* Bit posn of subfunction code */
+#define H_VPA_FUNC_MASK7UL
+#define H_VPA_REG_VPA  1UL /* Register Virtual Processor Area */
+#define H_VPA_REG_DTL  2UL /* Register Dispatch Trace Log */
+#define H_VPA_REG_SLB  3UL /* Register SLB shadow buffer */
+#define H_VPA_DEREG_VPA5UL /* Deregister Virtual Processor 
Area */
+#define H_VPA_DEREG_DTL6UL /* Deregister Dispatch Trace 
Log */
+#define H_VPA_DEREG_SLB7UL /* Deregister SLB shadow buffer 
*/
+
 /* VASI States */
 #define H_VASI_INVALID  0
 #define H_VASI_ENABLED  1
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 1843d5d..0ab60ac 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -82,7 +82,7 @@ struct kvm_vcpu;
 
 struct lppaca;
 struct slb_shadow;
-struct dtl;
+struct dtl_entry;
 
 struct kvm_vm_stat {
u32 remote_tlb_flush;
@@ -271,6 +271,19 @@ struct kvmppc_vcore {
 #define VCORE_EXITING  2
 #define VCORE_SLEEPING 3
 
+/*
+ * Struct used to manage memory for a virtual processor area
+ * registered by a PAPR guest.  There are three types of area
+ * that a guest can register.
+ */
+struct kvmppc_vpa {
+   void *pinned_addr;  /* Address in kernel linear mapping */
+   void *pinned_end;   /* End of region */
+   unsigned long next_gpa; /* Guest phys addr for update */
+   unsigned long len;  /* Number of bytes required */
+   u8 update_pending;  /* 1 = update pinned_addr from next_gpa */
+};
+
 struct kvmppc_pte {
ulong eaddr;
u64 vpage;
@@ -450,11 +463,6 @@ struct kvm_vcpu_arch {
u8 prodded;
u32 last_inst;
 
-   struct lppaca *vpa;
-   struct slb_shadow *slb_shadow;
-   struct dtl *dtl;
-   struct dtl *dtl_end;
-
wait_queue_head_t *wqp;
struct kvmppc_vcore *vcore;
int ret;
@@ -479,6 +487,13 @@ struct kvm_vcpu_arch {
struct task_struct *run_task;
struct kvm_run *kvm_run;
pgd_t *pgdir;
+
+   spinlock_t vpa_update_lock;
+   struct kvmppc_vpa vpa;
+   struct kvmppc_vpa dtl;
+   struct dtl_entry *dtl_ptr;
+   unsigned long dtl_index;
+   struct kvmppc_vpa slb_shadow;
 #endif
 };
 
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 1914818..f8d9e16 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -475,7 +475,7 @@ int main(void)
DEFINE(VCPU_PENDING_EXC, offsetof(struct kvm_vcpu, 
arch.pending_exceptions));
DEFINE(VCPU_CEDED, offsetof(struct kvm_vcpu, arch.ceded));
DEFINE(VCPU_PRODDED, offsetof(struct kvm_vcpu, arch.prodded));
-   DEFINE(VCPU_VPA, offsetof(struct kvm_vcpu, arch.vpa));
+   DEFINE(VCPU_VPA, offsetof(struct kvm_vcpu, arch.vpa.pinned_addr));
DEFINE(VCPU_MMCR, offsetof(struct kvm_vcpu, arch.mmcr));
DEFINE(VCPU_PMC, offsetof(struct kvm_vcpu, arch.pmc));
DEFINE(VCPU_SLB, offsetof(struct kvm_vcpu, arch.slb));
diff --git