date:20100616

Re: [Qemu-devel] Re: KVM call minutes for June 15

2010-06-16 Thread Markus Armbruster

Anthony Liguori anth...@codemonkey.ws writes:

 On 06/15/2010 10:41 AM, Christoph Hellwig wrote:
 On Tue, Jun 15, 2010 at 08:18:12AM -0700, Chris Wright wrote:

 KVM/qemu patches
 - patch rate is high, documentation is low, review is low
 - patches need to include better descriptions and documentation
- will slow down patch writers
- will make it easier for patch reviewers
  
 What is the qemu patch review policy anyway?

 We don't really have a coherent policy.  Suggestions for improvement
 are always appreciated.

There are no
 Reviewed-by: included in the actual commits,

 Reviewed-by/Ack-by's are pretty helpful for me.  In terms of including
 them in commit messages, if there's a strong feeling that that would
 be helpful then it's something I can look at doing but it also
 requires a fair bit of manual work during commit.

Can't hurt reviewer motivation.  Could it be automated?  Find replies,
extract tags.  If you want your acks to be picked up, you better make
sure your References header works, and your tags are formatted
correctly.

[...]
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Avi Kivity


On 06/13/2010 06:03 PM, Avi Kivity wrote:

Currently fpu management is only lazy in one direction.  When we switch into
a task, we may avoid loading the fpu state in the hope that the task will
never use it.  If we guess right we save an fpu load/save cycle; if not,
a Device not Available exception will remind us to load the fpu.

However, in the other direction, fpu management is eager.  When we switch out
of an fpu-using task, we always save its fpu state.

This is wasteful if the task(s) that run until we switch back in all don't use
the fpu, since we could have kept the task's fpu on the cpu all this time
and saved an fpu save/load cycle.  This can be quite common with threaded
interrupts, but will also happen with normal kernel threads and even normal
user tasks.

This patch series converts task fpu management to be fully lazy.  When
switching out of a task, we keep its fpu state on the cpu, only flushing it
if some other task needs the fpu.
   


Ingo, Peter, any feedback on this?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread H. Peter Anvin

On 06/16/2010 12:24 AM, Avi Kivity wrote:
 
 Ingo, Peter, any feedback on this?
 

Conceptually, this makes sense to me.  However, I have a concern what
happens when a task is scheduled on another CPU, while its FPU state is
still in registers in the original CPU.  That would seem to require
expensive IPIs to spill the state in order for the rescheduling to
proceed, and this could really damage performance.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/24] Implement VMXON and VMXOFF

2010-06-16 Thread Nadav Har'El

On Tue, Jun 15, 2010, Marcelo Tosatti wrote about Re: [PATCH 3/24] Implement 
VMXON and VMXOFF:
  +   if (!(vcpu-arch.cr4  X86_CR4_VMXE) ||
  +   !(vcpu-arch.cr0  X86_CR0_PE) ||
 
 kvm_read_cr0_bits, kvm_read_cr4_bits.

Thanks. I'll change that.

-- 
Nadav Har'El|Wednesday, Jun 16 2010, 4 Tammuz 5770
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |In Fortran, God is real unless declared
http://nadav.harel.org.il   |an integer.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()

2010-06-16 Thread Andi Kleen

On Tue, Jun 15, 2010 at 02:22:06PM +0300, Avi Kivity wrote:
 Too much duplication.  How about putting the tail end of the function in a 
 common helper (with an inatomic flag)?

 btw, is_hwpoison_address() is racy.  While it looks up the address, some 
 other task can unmap the page tables under us.

Where is is_hwpoison_address() coming from? I can't find it anywhere.

Anyways hwpoison will not remove the page as long as there
is a reference to it, so as long as you get the reference
race free against another task you're ok. Of course there might 
be always an error in between and the hardware may poison
the data, but we don't try to handle all kernel errors.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Avi Kivity


On 06/16/2010 10:32 AM, H. Peter Anvin wrote:

On 06/16/2010 12:24 AM, Avi Kivity wrote:
   

Ingo, Peter, any feedback on this?
 

Conceptually, this makes sense to me.  However, I have a concern what
happens when a task is scheduled on another CPU, while its FPU state is
still in registers in the original CPU.  That would seem to require
expensive IPIs to spill the state in order for the rescheduling to
proceed, and this could really damage performance.
   


Right, this optimization isn't free.

I think the tradeoff is favourable since task migrations are much less 
frequent than context switches within the same cpu, can the scheduler 
experts comment?


We can also mitigate some of the IPIs if we know that we're migrating on 
the cpu we're migrating from (i.e. we're pushing tasks to another cpu, 
not pulling them from their cpu).  Is that a common case, and if so, 
where can I hook a call to unlazy_fpu() (or its new equivalent)?


Note that kvm on intel has exactly the same issue (the VMPTR and VMCS 
are on-chip registers that are expensive to load and save, so we keep 
them loaded even while not scheduled, and IPI if we notice we've 
migrated; note that architecturally the cpu can cache multiple VMCSs 
simultaneously (though I doubt they cache multiple VMCSs 
microarchitecturally at this point)).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/17] Unify vendor TSC logic

2010-06-16 Thread Jason Wang

Zachary Amsden wrote:
 Move the TSC control logic from the vendor backends into x86.c
 by adding adjust_tsc_offset to x86 ops.  Now all TSC decisions
 can be done in one place.

 Also, rename some variable in the VCPU structure to more accurately
 reflect their actual content.

 VMX backend would record last observed TSC very late (possibly in an
 IPI to clear the VMCS from a remote CPU), now it is done earlier.

 Note this is still not yet perfect.  We adjust TSC in both
 directions when the TSC is unstable, resulting in desynchronized
 TSCs for SMP guests.  A better choice would be to compensate based
 on a master reference clock.

 Signed-off-by: Zachary Amsden zams...@redhat.com
 ---
  arch/x86/include/asm/kvm_host.h |5 +++--
  arch/x86/kvm/svm.c  |   25 +++--
  arch/x86/kvm/vmx.c  |   20 
  arch/x86/kvm/x86.c  |   16 +++-
  4 files changed, 33 insertions(+), 33 deletions(-)

 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 91631b8..94f6ce8 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -253,7 +253,6 @@ struct kvm_mmu {
  };
  
  struct kvm_vcpu_arch {
 - u64 host_tsc;
   /*
* rip and regs accesses must go through
* kvm_{register,rip}_{read,write} functions.
 @@ -334,9 +333,10 @@ struct kvm_vcpu_arch {
  
   gpa_t time;
   struct pvclock_vcpu_time_info hv_clock;
 - unsigned int hv_clock_tsc_khz;
 + unsigned int hw_tsc_khz;
   unsigned int time_offset;
   struct page *time_page;
 + u64 last_host_tsc;
  
   bool nmi_pending;
   bool nmi_injected;
 @@ -530,6 +530,7 @@ struct kvm_x86_ops {
   u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
   int (*get_lpage_level)(void);
   bool (*rdtscp_supported)(void);
 + void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment);
  
   void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry);
  
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index 09b2145..ee2cf30 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -948,18 +948,6 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
   int i;
  
   if (unlikely(cpu != vcpu-cpu)) {
 - u64 delta;
 -
 - if (check_tsc_unstable()) {
 - /*
 -  * Make sure that the guest sees a monotonically
 -  * increasing TSC.
 -  */
 - delta = vcpu-arch.host_tsc - native_read_tsc();
 - svm-vmcb-control.tsc_offset += delta;
 - if (is_nested(svm))
 - svm-nested.hsave-control.tsc_offset += delta;
 - }
   svm-asid_generation = 0;
   }
  
 @@ -975,8 +963,6 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu)
   ++vcpu-stat.host_state_reload;
   for (i = 0; i  NR_HOST_SAVE_USER_MSRS; i++)
   wrmsrl(host_save_user_msrs[i], svm-host_user_msrs[i]);
 -
 - vcpu-arch.host_tsc = native_read_tsc();
  }
  
  static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu)
 @@ -3422,6 +3408,15 @@ static bool svm_rdtscp_supported(void)
   return false;
  }
  
 +static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
 +{
 + struct vcpu_svm *svm = to_svm(vcpu);
 +
 + svm-vmcb-control.tsc_offset += adjustment;
 + if (is_nested(svm))
 + svm-nested.hsave-control.tsc_offset += adjustment;
 +}
 +
  static void svm_fpu_deactivate(struct kvm_vcpu *vcpu)
  {
   struct vcpu_svm *svm = to_svm(vcpu);
 @@ -3506,6 +3501,8 @@ static struct kvm_x86_ops svm_x86_ops = {
   .rdtscp_supported = svm_rdtscp_supported,
  
   .set_supported_cpuid = svm_set_supported_cpuid,
 +
 + .adjust_tsc_offset = svm_adjust_tsc_offset,
  };
  
  static int __init svm_init(void)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index a657e08..a993e67 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -498,7 +498,6 @@ static void __vcpu_clear(void *arg)
   vmcs_clear(vmx-vmcs);
   if (per_cpu(current_vmcs, cpu) == vmx-vmcs)
   per_cpu(current_vmcs, cpu) = NULL;
 - rdtscll(vmx-vcpu.arch.host_tsc);
   list_del(vmx-local_vcpus_link);
   vmx-vcpu.cpu = -1;
   vmx-launched = 0;
 @@ -881,7 +880,6 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
  static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
  {
   struct vcpu_vmx *vmx = to_vmx(vcpu);
 - u64 tsc_this, delta, new_offset;
   u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
  
   if (!vmm_exclusive)
 @@ -914,16 +912,6 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
  
   rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
   vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
 -
 - /*
 -

Re: [PATCH 04/17] Fix deep C-state TSC desynchronization

2010-06-16 Thread Jason Wang

Zachary Amsden wrote:
 When CPUs with unstable TSCs enter deep C-state, TSC may stop
 running.  This causes us to require resynchronization.  Since
 we can't tell when this may potentially happen, we assume the
 worst by forcing re-compensation for it at every point the VCPU
 task is descheduled.

   
Is there any method to do the compensation on host TSC when recovering
from C-state? It should be simpler than do it on guest TSC.
 Signed-off-by: Zachary Amsden zams...@redhat.com
 ---
  arch/x86/kvm/x86.c |   11 +++
  1 files changed, 11 insertions(+), 0 deletions(-)

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index c8289d0..618c435 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1822,7 +1822,18 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
  {
   kvm_x86_ops-vcpu_put(vcpu);
   kvm_put_guest_fpu(vcpu);
 +
   vcpu-arch.last_host_tsc = native_read_tsc();
 +
 + /*
 +  * When potentially leaving a CPU with unstable TSCs, we risk
 +  * that the CPU enters deep C-state.  If it does, the TSC may
 +  * go out of sync but we will not recalibrate because the test
 +  * vcpu-cpu != cpu can not detect this condition.  So set
 +  * vcpu-cpu = -1 to force the recalibration above.
 +  */
 + if (check_tsc_unstable())
 + vcpu-cpu = -1;
  }
  
  static int is_efer_nx(void)
   

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/17] Keep SMP VMs more in sync on unstable TSC

2010-06-16 Thread Jason Wang

Zachary Amsden wrote:
 SMP VMs on machines with unstable TSC have their TSC offset adjusted by the
 local offset delta from last measurement.  This does not take into account how
 long it has been since the measurement, leading to drift.  Minimize the drift
 by accounting for any time difference the kernel has observed.

 Signed-off-by: Zachary Amsden zams...@redhat.com
 ---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/x86.c  |   20 +++-
  2 files changed, 20 insertions(+), 1 deletions(-)

 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 94f6ce8..1afecd7 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -337,6 +337,7 @@ struct kvm_vcpu_arch {
   unsigned int time_offset;
   struct page *time_page;
   u64 last_host_tsc;
 + u64 last_host_ns;
  
   bool nmi_pending;
   bool nmi_injected;
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 618c435..b1bdf05 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1810,6 +1810,19 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
   /* Make sure TSC doesn't go backwards */
   s64 tsc_delta = !vcpu-arch.last_host_tsc ? 0 : 
   native_read_tsc() - vcpu-arch.last_host_tsc;
 +
 + /* Subtract elapsed cycle time from the delta computation */
 + if (check_tsc_unstable()  vcpu-arch.last_host_ns) {
 + s64 delta;
 + struct timespec ts;
 + ktime_get_ts(ts);
 + monotonic_to_bootbased(ts);
 + delta = timespec_to_ns(ts) - vcpu-arch.last_host_ns;
 + delta = delta * per_cpu(cpu_tsc_khz, cpu);
   
This seems not work well on a cpu w/o CONSTANT_TSC.
 + delta = delta / USEC_PER_SEC;
 + tsc_delta -= delta;
 + }
 +
   if (tsc_delta  0 || check_tsc_unstable())
   kvm_x86_ops-adjust_tsc_offset(vcpu, -tsc_delta);
   kvm_migrate_timers(vcpu);
 @@ -1832,8 +1845,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
* vcpu-cpu != cpu can not detect this condition.  So set
* vcpu-cpu = -1 to force the recalibration above.
*/
 - if (check_tsc_unstable())
 + if (check_tsc_unstable()) {
 + struct timespec ts;
 + ktime_get_ts(ts);
 + monotonic_to_bootbased(ts);
 + vcpu-arch.last_host_ns = timespec_to_ns(ts);
   vcpu-cpu = -1;
 + }
  }
  
  static int is_efer_nx(void)
   

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

2010-06-16 Thread Jason Wang

Zachary Amsden wrote:
 Kernel time, which advances in discrete steps may progress much slower
 than TSC.  As a result, when kvmclock is adjusted to a new base, the
 apparent time to the guest, which runs at a much higher, nsec scaled
 rate based on the current TSC, may have already been observed to have
 a larger value (kernel_ns + scaled tsc) than the value to which we are
 setting it (kernel_ns + 0).
   
This is one issue of kvmclock which tries to supply a clocksource whose
precision may even higher than host.
 We must instead compute the clock as potentially observed by the guest
 for kernel_ns to make sure it does not go backwards.

 Signed-off-by: Zachary Amsden zams...@redhat.com
 ---
  arch/x86/include/asm/kvm_host.h |4 ++
  arch/x86/kvm/x86.c  |   79 
 +--
  2 files changed, 71 insertions(+), 12 deletions(-)

 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 1afecd7..7ec2472 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -338,6 +338,8 @@ struct kvm_vcpu_arch {
   struct page *time_page;
   u64 last_host_tsc;
   u64 last_host_ns;
 + u64 last_guest_tsc;
 + u64 last_kernel_ns;
  
   bool nmi_pending;
   bool nmi_injected;
 @@ -455,6 +457,8 @@ struct kvm_vcpu_stat {
   u32 hypercalls;
   u32 irq_injections;
   u32 nmi_injections;
 + u32 tsc_overshoot;
 + u32 tsc_ahead;
  };
  
  struct kvm_x86_ops {
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 52d7d34..703ea43 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -138,6 +138,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
   { insn_emulation_fail, VCPU_STAT(insn_emulation_fail) },
   { irq_injections, VCPU_STAT(irq_injections) },
   { nmi_injections, VCPU_STAT(nmi_injections) },
 + { tsc_overshoot, VCPU_STAT(tsc_overshoot) },
 + { tsc_ahead, VCPU_STAT(tsc_ahead) },
   { mmu_shadow_zapped, VM_STAT(mmu_shadow_zapped) },
   { mmu_pte_write, VM_STAT(mmu_pte_write) },
   { mmu_pte_updated, VM_STAT(mmu_pte_updated) },
 @@ -927,33 +929,84 @@ static int kvm_recompute_guest_time(struct kvm_vcpu *v)
   struct kvm_vcpu_arch *vcpu = v-arch;
   void *shared_kaddr;
   unsigned long this_tsc_khz;
 + s64 kernel_ns, max_kernel_ns;
 + u64 tsc_timestamp;
  
   if ((!vcpu-time_page))
   return 0;
  
 - this_tsc_khz = get_cpu_var(cpu_tsc_khz);
 - put_cpu_var(cpu_tsc_khz);
 + /*
 +  * The protection we require is simple: we must not be preempted from
 +  * the CPU between our read of the TSC khz and our read of the TSC.
 +  * Interrupt protection is not strictly required, but it does result in
 +  * greater accuracy for the TSC / kernel_ns measurement.
 +  */
 + local_irq_save(flags);
 + this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
 + kvm_get_msr(v, MSR_IA32_TSC, tsc_timestamp);
 + ktime_get_ts(ts);
 + monotonic_to_bootbased(ts);
 + kernel_ns = timespec_to_ns(ts);
 + local_irq_restore(flags);
 +
   if (unlikely(this_tsc_khz == 0)) {
   kvm_request_guest_time_update(v);
   return 1;
   }
  
 + /*
 +  * Time as measured by the TSC may go backwards when resetting the base
 +  * tsc_timestamp.  The reason for this is that the TSC resolution is
 +  * higher than the resolution of the other clock scales.  Thus, many
 +  * possible measurments of the TSC correspond to one measurement of any
 +  * other clock, and so a spread of values is possible.  This is not a
 +  * problem for the computation of the nanosecond clock; with TSC rates
 +  * around 1GHZ, there can only be a few cycles which correspond to one
 +  * nanosecond value, and any path through this code will inevitably
 +  * take longer than that.  However, with the kernel_ns value itself,
 +  * the precision may be much lower, down to HZ granularity.  If the
 +  * first sampling of TSC against kernel_ns ends in the low part of the
 +  * range, and the second in the high end of the range, we can get:
 +  *
 +  * (TSC - offset_low) * S + kns_old  (TSC - offset_high) * S + kns_new
 +  *
 +  * As the sampling errors potentially range in the thousands of cycles,
 +  * it is possible such a time value has already been observed by the
 +  * guest.  To protect against this, we must compute the system time as
 +  * observed by the guest and ensure the new system time is greater.
 +  */
 + max_kernel_ns = 0;
 + if (vcpu-hv_clock.tsc_timestamp) {
 + max_kernel_ns = vcpu-last_guest_tsc -
   
Since you do the comparison with kernel_ns, so what you need here is
tsc_timestamp which looks more like the 'last' tsc seen by guest.
 + vcpu-hv_clock.tsc_timestamp;
 + max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
 +

Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()

2010-06-16 Thread Avi Kivity


On 06/16/2010 10:59 AM, Andi Kleen wrote:

On Tue, Jun 15, 2010 at 02:22:06PM +0300, Avi Kivity wrote:
   

Too much duplication.  How about putting the tail end of the function in a
common helper (with an inatomic flag)?

btw, is_hwpoison_address() is racy.  While it looks up the address, some
other task can unmap the page tables under us.
 

Where is is_hwpoison_address() coming from? I can't find it anywhere.

   


kvm.git master mm/memory-failure.c (19564281fe).


Anyways hwpoison will not remove the page as long as there
is a reference to it, so as long as you get the reference
race free against another task you're ok. Of course there might
be always an error in between and the hardware may poison
the data, but we don't try to handle all kernel errors.
   


The page is fine, the page tables are not.  Another task can munmap() 
the thing while is_hwpoison_address() is running.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/17] Add helper function get_kernel_ns

2010-06-16 Thread Jason Wang

Zachary Amsden wrote:
 On 06/14/2010 10:41 PM, Avi Kivity wrote:
   
 On 06/15/2010 10:34 AM, Zachary Amsden wrote:
 
 Add a helper function for the multiple places this is used.  Note 
 that it
 must not be called in preemptible context, as that would mean the kernel
 could enter software suspend state, which would cause non-atomic 
 operation
 of the monotonic_to_bootbased computation.

 Open question: should the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls use this
 as well?  Currently, they are not bootbased (but perhaps should be).

 Signed-off-by: Zachary Amsdenzams...@redhat.com
 ---
   arch/x86/kvm/x86.c |   26 +-
   1 files changed, 13 insertions(+), 13 deletions(-)

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 703ea43..15c7317 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -915,6 +915,16 @@ static void kvm_get_time_scale(uint32_t 
 scaled_khz, uint32_t base_khz,
__func__, base_khz, scaled_khz, shift, *pmultiplier);
   }

 +static inline u64 get_kernel_ns(void)
 +{
 +struct timespec ts;
 +
 +WARN_ON(preemptible());
 +ktime_get_ts(ts);
 +monotonic_to_bootbased(ts);
 +return timespec_to_ns(ts);
 +}
 +
   static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);

   
 Isn't something like this a candidate for the time infrastructure?

 

 Should it be?  It certainly seems reasonable.


 Also, should we be using it for the cases KVM_GET_CLOCK / 
 KVM_SET_CLOCK?  It seems global kvmclock_offset for the VM ignores the 
 bootbased conversion.

   
Yes we should.
 Zach
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
   

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] KVM: busy-spin detector

2010-06-16 Thread Avi Kivity


On 06/11/2010 05:25 AM, Marcelo Tosatti wrote:

The following patch implements a simple busy-spin detector. It considers
a vcpu as busy-spinning if there are two consecutive exits due to
external interrupt on the same RIP, and sleeps for 100us in that case.

It is very likely that if the vcpu is making progress it will either
exit for other reasons or change RIP.

The percentage numbers below represent improvement in kernel build
time in comparison with mainline (RHEL 5.4 guest).

make -j16, 8 cpu host:
smp 16: 3%
smp 18: 10%
smp 32: 14%

smp 4, make -j4, pinned to 2 cpus:
4%

smp 8, make -j8, pinned to 2 cpus:
5%

   


The big problem here is not in the patch itself, but in 
kvm_vcpu_on_spin().  An un-overcommitted guest will trigger this on long 
spinlocks and go to sleep, and if the lock is released while we're still 
sleeping, we lose performance.



@@ -4654,6 +4665,8 @@ static int vcpu_enter_guest(struct kvm_v
kvm_lapic_sync_from_vapic(vcpu);

r = kvm_x86_ops-handle_exit(vcpu);
+   if (r == 1)
+   vcpu-arch.last_rip = ~(0UL);
  out:
return r;
  }
   


!= 2, no?

It's a little ugly to overload the return code like that.  Perhaps:

+vcpu-arch.save_last_rip = vcpu-arch.last_rip
+vcpu-arch.last_rip = ~0UL;
 r = kvm_x86_ops-handle_exit(vcpu);

And kvm_detect_spin() can restore last_rip.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()

2010-06-16 Thread Markus Armbruster

Alex Williamson alex.william...@redhat.com writes:

 On Tue, 2010-06-15 at 12:28 +0100, Paul Brook wrote:
   Alex proposed to disambiguate by adding identified properties of the
   immediate parent bus and device to the path component.  For PCI, these
   are dev.fn.  Likewise for any other bus where devices have unambigous
   bus address.  The driver name carries no information!
  
  From user POV, driver names are very handly to address a device
  intuitively - except for the case you have tones of devices on the same
  bus that are handled by the same driver. For that case we need to
  augment the device name with a useful per-bus ID, derived from the bus
  address where available, otherwise based on instance numbers.
 
 This is where I think you're missing a trick. We don't need to augment the 
 name, we just need to allow the bus id to be used instead.

 For the case of a hot remove, I agree.  If the user specifies pci_del
 pci.0/03.0, that's completely sufficient because we don't care what's
 in that slot, just remove it.  However, I still see some use cases for
 device names in the path.  Take for example:

 (A): /i440FX-pcihost/pci.0/e1000.05.0

 vs

 (B): /pci.0/05.0

 (removing both the root bridge driver name and the device driver name)

/ is the main system bus.  System bus defines no bus address at the
moment.  Therefore, you have to use the driver name i440FX-pcihost.

/i440FX-pcihost/pci.0 is the PCI bus.  PCI bus defines a bus address:
dev.fn.  Therefore, you can either use the bus address @05.0, or the
driver name e1000.

We have /i440FX-pcihost/pci.0/e1000 vs. /i440FX-pcihost/pci.0/@05.0.

 If we attach a pci option rom to the device, create some ram to store
 the option rom and name the ram block $(PATH)/rom, with (A) we know more
 about the hierarchy to get to the actual devfn device, and we know the
 type of device that's in the slot.  With (B), there's no robustness.  If
 we migrated using (B), we could be stuffing a pc e1000 option rom into a
 ppc lsi895, just because it happened to live that the same place and
 have a ram block named rom.  Including driver names increases the
 uniqueness of the path.

 Another example; if we have two drivers that create a vmstate with name
 foo, plug one driver into slot 03.0 on the migration source, the other
 into slot 03.0 on the migration destination, what happens?  It seems
 likely that the destination will try to load the vmstate from a
 different driver and fail in wonderful and bizarre ways.  If we use (A),
 each path automatically has it's own namespace.

 ISA is also a good example even though it doesn't do hotplug.  Given
 this set:

 /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/mc146818rtc
 /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/isa-serial.0x3f8
 /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/isa-parallel.0x378
 /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/i8042
 /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/isa-fdc
 /i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/ne2k_isa.0x300

 versus this set:

 /pci.0.01.0/isa.0
 /pci.0.01.0/isa.0/0x3f8
 /pci.0.01.0/isa.0/0x378
 /pci.0.01.0/isa.0
 /pci.0.01.0/isa.0
 /pci.0.01.0/isa.0/0x300

 Which one has devices that are easier to uniquely identify?
  
   For other buses, we need to make something up.
   
   Note that addressing by bus address rather than name is generally
   useful, not just in the context of savevm.  For instance, I'd appreciate
   being able to say something like device_del pci.0/04.0.
  
  And I prefer device_del [.../]pci.0/e1000. Otherwise you need to dump
  the bus first before you can identify which device you want to remove.
 
 We can allow both.
 
 A bus address is sufficient to uniquely identify a device.

 A bus address (assuming it exists) is sufficient to uniquely identify a
 device, on a given VM.  A bus address only identifies a location when
 comparing two separate VMs.

Identifying a device on a given VM is all a qdev path does.

If you want to check two VMs have the same device in the same slot, then
you need to compare *devices*, not their names.  You propose to encode
*one* property of the device in the name, namely its driver.  This is
far from sufficient.  If you tell me you need it anyway for migration,
I'll have to take that at face value (I'm not an expert there).  But
please do not call it qdev path, because it ain't.

   I see no reason to 
 require the driver name,  or to include it in the canonical device address.

 Migration.  Including the driver name extends our ability to uniquely
 identify a device across separate VMs.  It's then up to the vmstate code
 to figure out whether the device are compatible for migrate state.

Migration needs to recreate the same qdev tree on the destination.
Driver name is just *one* property of a device.  Migration needs to
transfer *all* properties.  Why encode driver name in the path, but not
the rest?  Why can't you put the driver name wherever you put the rest?

   An easy way to get that is to reserve part of the name space for bus
   addresses.  If

Re: [Qemu-devel] Re: [RFC PATCH 0/5] Introduce canonical device hierarchy string

2010-06-16 Thread Markus Armbruster

Alex Williamson alex.william...@redhat.com writes:

 On Tue, 2010-06-15 at 10:53 +0200, Markus Armbruster wrote:
 Alex Williamson alex.william...@redhat.com writes:
 
  On Mon, 2010-06-14 at 09:02 +0200, Gerd Hoffmann wrote:
  Hi,
  
   My premise with this attempt is that we walk the hierarchy and use the
   names to create the base of the path.  As we get to the device,
   particularly to the parent bus of the device, we need to start looking 
   at
   properties to ensure uniqueness.
  
  You'll need that for every bus along the way down to the device.  Create 
  a virtual machine with two lsi scsi host adapters, then attach a disk 
  with scsi id 0 to each.  Just the scsi id isn't good enougth to identify 
  the device.  You'll need the lsi pci address too.
 
  Yep, see below.
 
   For now, the only properties I've tagged as path
   properties are PCI bus addresses and MAC addresses.
  
  mac address isn't needed here.  You need the property which specifies 
  the bus address.  For PCI this obviously is the PCI address.  For scsi 
  the scsi id.  For ISA you can use the I/O port base.  virtio-serial the 
  port number, ...
 
  PCI: addr
  SCSI: scsi-id
  ISA: serial/parallel = iobase, others??
 
 If there's no iobase (pathological case), require ID.
 
  ide-drive: unit
 
 Bus name is IDE, but it's clear enough what you mean :)

 I put ide-drive here because the unit is a property of the device, not
 the bus.

I consider that a (very minor) bug.

  I2C: address
 
  virtio-serial doesn't seem to make a DeviceState per port, so I think it
  can be skipped.
 
 Really?
 
 Anyway, its port number should do as bus address.

 Maybe I'm not specifying it correctly.  I see a max_nr_ports property,
 but I don't see that each port is a separate qdev.

I see property nr in virtconsole_info and virtserialport_info.  I
can't see any other virtio-serial devices.

   I'm sure I'm still missing some...
 
 s390-virtio
 SSI
 System

 I'll need some help coming up with useful properties to key on for
 these.  I had hoped there's only one System bus.

 USB

 usb-storage seems to have a useful drive property that lets me
 distinguish these devices:

 /i440FX-pcihost/pci.0/piix3-usb-uhci.01.2/usb.0/usb-storage.usb0/scsi.0/scsi-disk.0
 /i440FX-pcihost/pci.0/piix3-usb-uhci.01.2/usb.0/usb-storage.usb1/scsi.0/scsi-disk.0
  drive

 But otherwise USB is disappointingly devoid of useful properties at the
 bus level.

Paul suggested physical ports.  Doesn't look like we have them, but that
should be fixable.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 0/5] Introduce canonical device hierarchy string

2010-06-16 Thread Markus Armbruster

Markus Armbruster arm...@redhat.com writes:

 Alex Williamson alex.william...@redhat.com writes:

 On Tue, 2010-06-15 at 10:53 +0200, Markus Armbruster wrote:
 Alex Williamson alex.william...@redhat.com writes:
[...]
  virtio-serial doesn't seem to make a DeviceState per port, so I think it
  can be skipped.
 
 Really?
 
 Anyway, its port number should do as bus address.

 Maybe I'm not specifying it correctly.  I see a max_nr_ports property,
 but I don't see that each port is a separate qdev.

 I see property nr in virtconsole_info and virtserialport_info.  I
 can't see any other virtio-serial devices.

Same issue as with ide-drive's unit: should be a bus property.

[...]
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 0/9] rework KVM mmu_shrink() code

2010-06-16 Thread Avi Kivity


On 06/15/2010 04:55 PM, Dave Hansen wrote:

This is a big RFC for the moment.  These need a bunch more
runtime testing.

--

We've seen contention in the mmu_shrink() function.


First of all, that's surprising.  I tried to configure the shrinker so 
it would stay away from kvm unless memory was really tight.  The reason 
is that kvm mmu pages can cost as much as 1-2 ms of cpu time to build, 
perhaps even more, so we shouldn't drop them lightly.


It's certainly a neglected area that needs attention, though.


This patch
set reworks it to hopefully be more scalable to large numbers
of CPUs, as well as large numbers of running VMs.

The patches are ordered with increasing invasiveness.

These seem to boot and run fine.  I'm running about 40 VMs at
once, while doing echo 3  /proc/sys/vm/drop_caches, and
killing/restarting VMs constantly.
   


Will drop_caches actually shrink the kvm caches too?  If so we probably 
need to add that to autotest since it's a really good stress test for 
the mmu.



Seems to be relatively stable, and seems to keep the numbers
of kvm_mmu_page_header objects down.
   


That's no necessarily a good thing, those things are expensive to 
recreate.  Of course, when we do need to reclaim them, that should be 
efficient.


We also do a very bad job of selecting which page to reclaim.  We need 
to start using the accessed bit on sptes that point to shadow page 
tables, and then look those up and reclaim unreferenced pages sooner.  
With shadow paging there can be tons of unsync pages that are basically 
unused and can be reclaimed at no cost to future runtime.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Ingo Molnar


(Cc:-ed various performance/optimization folks)

* Avi Kivity a...@redhat.com wrote:

 On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
 On 06/16/2010 12:24 AM, Avi Kivity wrote:
 Ingo, Peter, any feedback on this?
  Conceptually, this makes sense to me.  However, I have a concern what
  happens when a task is scheduled on another CPU, while its FPU state is
  still in registers in the original CPU.  That would seem to require
  expensive IPIs to spill the state in order for the rescheduling to
  proceed, and this could really damage performance.
 
 Right, this optimization isn't free.
 
 I think the tradeoff is favourable since task migrations are much
 less frequent than context switches within the same cpu, can the
 scheduler experts comment?

This cannot be stated categorically without precise measurements of 
known-good, known-bad, average FPU usage and average CPU usage scenarios. All 
these workloads have different characteristics.

I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, 
various lmbench components, X benchmarks, tiobench - you name it. Combined 
with the fact that most micro-benchmarks wont be using the FPU, while in the 
long run most processes will be using the FPU due to SIMM instructions. So 
even a positive result might be skewed in practice. Has to be measured 
carefully IMO - and i havent seen a _single_ performance measurement in the 
submission mail. This is really essential.

So this does not look like a patch-set we could apply without gathering a 
_ton_ of hard data about advantages and disadvantages.

 We can also mitigate some of the IPIs if we know that we're migrating on the 
 cpu we're migrating from (i.e. we're pushing tasks to another cpu, not 
 pulling them from their cpu).  Is that a common case, and if so, where can I 
 hook a call to unlazy_fpu() (or its new equivalent)?

When the system goes from idle to less idle then most of the 'fast' migrations 
happen on a 'push' model - on a busy CPU we wake up a new task and push it out 
to a known-idle CPU. At that point we can indeed unlazy the FPU with probably 
little cost.

But on busy servers where most wakeups are IRQ based the chance of being on 
the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of 
CPUs.

If there's some sucky corner case in theory we could approach it statistically 
and measure the ratio of fast vs. slow migration vs. local context switches - 
but that looks a bit complex.

Dunno.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 1/9] abstract kvm x86 mmu-n_free_mmu_pages

2010-06-16 Thread Avi Kivity


On 06/15/2010 04:55 PM, Dave Hansen wrote:

First of all, I think free is a poor name for this value.  In
this context, it means, the number of mmu pages which this kvm
instance should be able to allocate.  To me, free implies
much more that the objects are there and ready for use.  I think
available is a much better description, especially when you
see how it is calculated.

   


Agreed.  Note that if reclaim is improved, we can drop it completely and 
let the kernel trim the cache when it grows too large.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value

2010-06-16 Thread Avi Kivity


On 06/15/2010 04:55 PM, Dave Hansen wrote:

Note: this is the real meat of the patch set.  It can be applied up
to this point, and everything will probably be improved, at least
a bit.

Of slab shrinkers, the VM code says:

  * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
  * querying the cache size, so a fastpath for that case is appropriate.

and it *means* it.  Look at how it calls the shrinkers:

nr_before = (*shrinker-shrink)(0, gfp_mask);
shrink_ret = (*shrinker-shrink)(this_scan, gfp_mask);

So, if you do anything stupid in your shrinker, the VM will doubly
punish you.
   


Ouch.


The mmu_shrink() function takes the global kvm_lock, then acquires
every VM's kvm-mmu_lock in sequence.  If we have 100 VMs, then
we're going to take 101 locks.  We do it twice, so each call takes
202 locks.  If we're under memory pressure, we can have each cpu
trying to do this.  It can get really hairy, and we've seen lock
spinning in mmu_shrink() be the dominant entry in profiles.

This is guaranteed to optimize at least half of those lock
aquisitions away.  It removes the need to take any of the locks
when simply trying to count objects.


Signed-off-by: Dave Hansend...@linux.vnet.ibm.com
---

  linux-2.6.git-dave/arch/x86/kvm/mmu.c |   33 +++--
  1 file changed, 23 insertions(+), 10 deletions(-)

diff -puN arch/x86/kvm/mmu.c~make_global_used_value arch/x86/kvm/mmu.c
--- linux-2.6.git/arch/x86/kvm/mmu.c~make_global_used_value 2010-06-09 
15:14:30.0 -0700
+++ linux-2.6.git-dave/arch/x86/kvm/mmu.c   2010-06-09 15:14:30.0 
-0700
@@ -891,6 +891,19 @@ static int is_empty_shadow_page(u64 *spt
  }
  #endif

+/*
+ * This value is the sum of all of the kvm instances's
+ * kvm-arch.n_used_mmu_pages values.  We need a global,
+ * aggregate version in order to make the slab shrinker
+ * faster
+ */
+static unsigned int kvm_total_used_mmu_pages;
   


The variable needs to be at the top of the file.


+static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
+{
+   kvm-arch.n_used_mmu_pages += nr;
+   kvm_total_used_mmu_pages += nr;
   


Needs an atomic operation, since there's no global lock here.  To avoid 
bouncing this cacheline around, make the variable percpu and make 
readers take a sum across all cpus.  Side benefit is that you no longer 
need an atomic but a local_t, which is considerably cheaper.



+}
+
   



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()

2010-06-16 Thread Andi Kleen

 The page is fine, the page tables are not.  Another task can munmap() the 
 thing while is_hwpoison_address() is running.

Ok  that boils down to me not seeing that source.

If it accesses the page tables yes then it's racy. But whoever
looked up the page tables in the first place should have
returned an -EFAULT. There's no useful address attached
to poison.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 8/9] reduce kvm_lock hold times in mmu_skrink()

2010-06-16 Thread Avi Kivity


On 06/15/2010 04:55 PM, Dave Hansen wrote:

mmu_shrink() is effectively single-threaded since the global
kvm_lock is held over the entire function.

I beleive its only use here is for synchronization of the
vm_list.  Instead of using the kvm_lock to ensure
consistency of the list, we instead obtain a kvm_get_kvm()
reference.  This keeps the kvm object on the vm_list while
we shrink it.

Since we don't need the lock to maintain the list any more,
we can drop it.  We'll reacquire it if we need to get another
object off.

This leads to a larger number of atomic ops, but reduces
lock hold times: the typical latency vs. throughput debate.

   


snip content


diff -puN kernel/profile.c~optimize_shrinker-3 kernel/profile.c
--- linux-2.6.git/kernel/profile.c~optimize_shrinker-3  2010-06-11 
09:09:43.0 -0700
+++ linux-2.6.git-dave/kernel/profile.c 2010-06-11 09:12:24.0 -0700
@@ -314,6 +314,8 @@ void profile_hits(int type, void *__pc,
if (prof_on != type || !prof_buffer)
return;
pc = min((pc - (unsigned long)_stext)  prof_shift, prof_len - 1);
+   if ((pc == prof_len - 1)  printk_ratelimit())
+   printk(profile_hits(%d, %p, %d)\n, type, __pc, nr_hits);
i = primary = (pc  (NR_PROFILE_GRP - 1))  PROFILE_GRPSHIFT;
secondary = (~(pc  1)  (NR_PROFILE_GRP - 1))  PROFILE_GRPSHIFT;
cpu = get_cpu();
_
   


Unrelated gunk...

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: KVM call minutes for June 15

2010-06-16 Thread Yoshiaki Tamura

2010/6/16 Markus Armbruster arm...@redhat.com:
 Anthony Liguori anth...@codemonkey.ws writes:

 On 06/15/2010 10:41 AM, Christoph Hellwig wrote:
 On Tue, Jun 15, 2010 at 08:18:12AM -0700, Chris Wright wrote:

 KVM/qemu patches
 - patch rate is high, documentation is low, review is low
 - patches need to include better descriptions and documentation
    - will slow down patch writers
    - will make it easier for patch reviewers

 What is the qemu patch review policy anyway?

 We don't really have a coherent policy.  Suggestions for improvement
 are always appreciated.

    There are no
 Reviewed-by: included in the actual commits,

 Reviewed-by/Ack-by's are pretty helpful for me.  In terms of including
 them in commit messages, if there's a strong feeling that that would
 be helpful then it's something I can look at doing but it also
 requires a fair bit of manual work during commit.

 Can't hurt reviewer motivation.  Could it be automated?  Find replies,
 extract tags.  If you want your acks to be picked up, you better make
 sure your References header works, and your tags are formatted
 correctly.

How about letting the submitter to include acked-by or reviewed-by
manually and repost?
It wouldn't make the maintainers busy.  Although the traffic would
increase, it would show the gratitude from submitter to the reviewer.

Thanks,

Yoshi


 [...]
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call minutes for June 15

2010-06-16 Thread Paolo Bonzini


On 06/15/2010 05:18 PM, Chris Wright wrote:

- size for each section would be useful (breaks protocol)
   - while size is possibly useful, breaks protocol


It is not necessary to break the protocol.  If you're okay with only 
having the size information when the migration data has been saved to a 
file, you can put the directory at the end of the migration data, after 
the EOF section.  Something like


   QEMU_VM_SECTION_EOF
   QEMU_VM_SECTION_DIRECTORY
  copy of the migration data, with the actual data replaced
  by a single 8-byte pointer to the beginning of the section:

  QEMU_VM_SECTION_START
  section id
  5 block
  instance id
  version id
  8-byte pointer

  QEMU_VM_SECTION_START
  section id
  3 ram
  instance id
  version id
  8-byte pointer

  ...
  QEMU_VM_SECTION_FULL
  section id
  10 cpu_common
  instance id
  version id
  8-byte pointer

  ...
  QEMU_VM_SECTION_EOF
  8-byte pointer

  QEMU_VM_SECTION_DIRECTORY
  8-byte pointer

Note that by definition the last 8 bytes will point to the beginning of 
the directory.  You can read the last 18 bytes to reduce (to almost 
zero) the possibility of a false positive.


The directory table can be built at save time and streamed after the EOF 
without causing an error if the receiver closes its connection during 
the streaming of the directory.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Nick Piggin

On Wed, Jun 16, 2010 at 10:39:41AM +0200, Ingo Molnar wrote:
 
 (Cc:-ed various performance/optimization folks)
 
 * Avi Kivity a...@redhat.com wrote:
 
  On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
  On 06/16/2010 12:24 AM, Avi Kivity wrote:
  Ingo, Peter, any feedback on this?
   Conceptually, this makes sense to me.  However, I have a concern what
   happens when a task is scheduled on another CPU, while its FPU state is
   still in registers in the original CPU.  That would seem to require
   expensive IPIs to spill the state in order for the rescheduling to
   proceed, and this could really damage performance.
  
  Right, this optimization isn't free.
  
  I think the tradeoff is favourable since task migrations are much
  less frequent than context switches within the same cpu, can the
  scheduler experts comment?
 
 This cannot be stated categorically without precise measurements of 
 known-good, known-bad, average FPU usage and average CPU usage scenarios. All 
 these workloads have different characteristics.
 
 I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, 
 various lmbench components, X benchmarks, tiobench - you name it. Combined 
 with the fact that most micro-benchmarks wont be using the FPU, while in the 
 long run most processes will be using the FPU due to SIMM instructions. So 
 even a positive result might be skewed in practice. Has to be measured 
 carefully IMO - and i havent seen a _single_ performance measurement in the 
 submission mail. This is really essential.

It can be nice to code an absolute worst-case microbenchmark too.

Task migration can actually be very important to the point of being
almost a fastpath in some workloads where threads are oversubscribed to
CPUs and blocking on some contented resource (IO or mutex or whatever).
I suspect the main issues in that case is the actual context switching
and contention, but it would be nice to see just how much slower it
could get.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Samuel Thibault

Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a écrit :
 in the long run most processes will be using the FPU due to SIMM  
 instructions.

I believe glibc already uses SIMM instructions for e.g. memcpy and
friends, i.e. basically all applications...

Samuel
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ kvm-Bugs-2933400 ] virtio-blk io errors / data corruption on raw drives 1 TB

2010-06-16 Thread SourceForge.net

Bugs item #2933400, was opened at 2010-01-16 15:35
Message generated for change (Comment added) made by masc82
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2933400group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Closed
Resolution: Fixed
Priority: 9
Private: No
Submitted By: MaSc82 (masc82)
Assigned to: Nobody/Anonymous (nobody)
Summary: virtio-blk io errors / data corruption on raw drives  1 TB

Initial Comment:
When attaching raw drives  1 TB, buffer io errors will most likely occur, 
filesystems get corrupted. Processes (like mkfs.ext4) might freeze completely 
when filesystems are created on the guest.

Here's a typical log excerpt when using mkfs.ext4 on a 1.5 TB drive on a Ubuntu 
9.10 guest:
(kern.log)
Jan 15 20:40:44 q kernel: [  677.076602] Buffer I/O error on device vde, 
logical block 366283764
Jan 15 20:40:44 q kernel: [  677.076607] Buffer I/O error on device vde, 
logical block 366283765
Jan 15 20:40:44 q kernel: [  677.076611] Buffer I/O error on device vde, 
logical block 366283766
Jan 15 20:40:44 q kernel: [  677.076616] Buffer I/O error on device vde, 
logical block 366283767
Jan 15 20:40:44 q kernel: [  677.076621] Buffer I/O error on device vde, 
logical block 366283768
Jan 15 20:40:44 q kernel: [  677.076626] Buffer I/O error on device vde, 
logical block 366283769
(messages)
Jan 15 20:40:44 q kernel: [  677.076534] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076541] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076546] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076599] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076604] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076609] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076613] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076618] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076623] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076628] lost page write due to I/O error on vde
Jan 15 20:45:55 q Backgrounding to notify hosts...
(The following entries will repeat infinitely, mkfs.ext4 will not exit and 
cannot be killed)
Jan 15 20:49:27 q kernel: [ 1200.520096] mkfs.ext4 D  0 
 1839   1709 0x
Jan 15 20:49:27 q kernel: [ 1200.520101]  88004e157cb8 0082 
88004e157c58 00015880
Jan 15 20:49:27 q kernel: [ 1200.520115]  88004ef6c7c0 00015880 
00015880 00015880
Jan 15 20:49:27 q kernel: [ 1200.520118]  00015880 88004ef6c7c0 
00015880 00015880
Jan 15 20:49:27 q kernel: [ 1200.520123] Call Trace:
Jan 15 20:49:27 q kernel: [ 1200.520157]  [810da0f0] ? 
sync_page+0x0/0x50
Jan 15 20:49:27 q kernel: [ 1200.520178]  [815255f8] 
io_schedule+0x28/0x40
Jan 15 20:49:27 q kernel: [ 1200.520182]  [810da12d] 
sync_page+0x3d/0x50
Jan 15 20:49:27 q kernel: [ 1200.520185]  [81525b17] 
__wait_on_bit+0x57/0x80
Jan 15 20:49:27 q kernel: [ 1200.520192]  [810da29e] 
wait_on_page_bit+0x6e/0x80
Jan 15 20:49:27 q kernel: [ 1200.520205]  [81078650] ? 
wake_bit_function+0x0/0x40
Jan 15 20:49:27 q kernel: [ 1200.520210]  [810e44e0] ? 
pagevec_lookup_tag+0x20/0x30
Jan 15 20:49:27 q kernel: [ 1200.520213]  [810da745] 
wait_on_page_writeback_range+0xf5/0x190
Jan 15 20:49:27 q kernel: [ 1200.520217]  [810da807] 
filemap_fdatawait+0x27/0x30
Jan 15 20:49:27 q kernel: [ 1200.520220]  [810dacb4] 
filemap_write_and_wait+0x44/0x50
Jan 15 20:49:27 q kernel: [ 1200.520235]  [8114ba9f] 
__sync_blockdev+0x1f/0x40
Jan 15 20:49:27 q kernel: [ 1200.520239]  [8114bace] 
sync_blockdev+0xe/0x10
Jan 15 20:49:27 q kernel: [ 1200.520241]  [8114baea] 
block_fsync+0x1a/0x20
Jan 15 20:49:27 q kernel: [ 1200.520249]  [81142f26] 
vfs_fsync+0x86/0xf0
Jan 15 20:49:27 q kernel: [ 1200.520252]  [81142fc9] 
do_fsync+0x39/0x60
Jan 15 20:49:27 q kernel: [ 1200.520255]  [8114301b] 
sys_fsync+0xb/0x10
Jan 15 20:49:27 q kernel: [ 1200.520271]  [81011fc2] 
system_call_fastpath+0x16/0x1b

In my case I was switching to virtio at one point, but the behaviour didn't 
show until there was  1 TB data on the filesystem. very dangerous.

Tested using 2 different SATA controllers, 1.5 TB lvm/mdraid, single 1.5 TB 
drive and 2 TB lvm/mdraid.
The behaviour does not occur with if=scsi or if=ide.

#2914397 might be related: 
https://sourceforge.net/tracker/?func=detailaid=2914397group_id=180599atid=893831
This blog post might also relate: 
http://www.neuhalfen.name/2009/08/05/OpenSolaris_KVM_and_large_IDE_drives/

CPU: Intel

Re: [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive

2010-06-16 Thread Avi Kivity


On 06/15/2010 04:55 PM, Dave Hansen wrote:

In a previous patch, we removed the 'nr_to_scan' tracking.
It was not being used to track the number of objects
scanned, so we stopped using it entirely.  Here, we
strart using it again.

The theory here is simple; if we already have the refcount
and the kvm-mmu_lock, then we should do as much work as
possible under the lock.  The downside is that we're less
fair about the KVM instances from which we reclaim.  Each
call to mmu_shrink() will tend to pick on one instance,
after which it gets moved to the end of the list and left
alone for a while.
   


That also increases the latency hit, as well as a potential fault storm, 
on that instance.  Spreading out is less efficient, but smoother.



If mmu_shrink() has already done a significant amount of
scanning, the use of 'nr_to_scan' inside shrink_kvm_mmu()
will also ensure that we do not over-reclaim when we have
already done a lot of work in this call.

In the end, this patch defines a scan as:
1. An attempt to acquire a refcount on a 'struct kvm'
2. freeing a kvm mmu page

This would probably be most ideal if we can expose some
of the work done by kvm_mmu_remove_some_alloc_mmu_pages()
as also counting as scanning, but I think we have churned
enough for the moment.
   


It usually removes one page.


Signed-off-by: Dave Hansend...@linux.vnet.ibm.com
---

  linux-2.6.git-dave/arch/x86/kvm/mmu.c |   11 ++-
  1 file changed, 6 insertions(+), 5 deletions(-)

diff -puN arch/x86/kvm/mmu.c~make-shrinker-more-aggressive arch/x86/kvm/mmu.c
--- linux-2.6.git/arch/x86/kvm/mmu.c~make-shrinker-more-aggressive  
2010-06-14 11:30:44.0 -0700
+++ linux-2.6.git-dave/arch/x86/kvm/mmu.c   2010-06-14 11:38:04.0 
-0700
@@ -2935,8 +2935,10 @@ static int shrink_kvm_mmu(struct kvm *kv

idx = srcu_read_lock(kvm-srcu);
spin_lock(kvm-mmu_lock);
-   if (kvm-arch.n_used_mmu_pages  0)
-   freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm);
+   while (nr_to_scan  0  kvm-arch.n_used_mmu_pages  0) {
+   freed_pages += kvm_mmu_remove_some_alloc_mmu_pages(kvm);
+   nr_to_scan--;
+   }
   


What tree are you patching?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Avi Kivity


On 06/16/2010 11:39 AM, Ingo Molnar wrote:

(Cc:-ed various performance/optimization folks)

* Avi Kivitya...@redhat.com  wrote:

   

On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
 

On 06/16/2010 12:24 AM, Avi Kivity wrote:
   

Ingo, Peter, any feedback on this?
 

Conceptually, this makes sense to me.  However, I have a concern what
happens when a task is scheduled on another CPU, while its FPU state is
still in registers in the original CPU.  That would seem to require
expensive IPIs to spill the state in order for the rescheduling to
proceed, and this could really damage performance.
   

Right, this optimization isn't free.

I think the tradeoff is favourable since task migrations are much
less frequent than context switches within the same cpu, can the
scheduler experts comment?
 

This cannot be stated categorically without precise measurements of
known-good, known-bad, average FPU usage and average CPU usage scenarios. All
these workloads have different characteristics.

I can imagine bad effects across all sorts of workloads: tcpbench, AIM7,
various lmbench components, X benchmarks, tiobench - you name it. Combined
with the fact that most micro-benchmarks wont be using the FPU, while in the
long run most processes will be using the FPU due to SIMM instructions. So
even a positive result might be skewed in practice. Has to be measured
carefully IMO - and i havent seen a _single_ performance measurement in the
submission mail. This is really essential.
   


I have really no idea what to measure.  Which would you most like to see?


So this does not look like a patch-set we could apply without gathering a
_ton_ of hard data about advantages and disadvantages.
   


I agree (not to mention that I'm not really close to having an applyable 
patchset).


Note some of the advantages will not be in throughput but in latency 
(making kernel_fpu_begin() preemptible, and reducing context switch time 
for event threads).



We can also mitigate some of the IPIs if we know that we're migrating on the
cpu we're migrating from (i.e. we're pushing tasks to another cpu, not
pulling them from their cpu).  Is that a common case, and if so, where can I
hook a call to unlazy_fpu() (or its new equivalent)?
 

When the system goes from idle to less idle then most of the 'fast' migrations
happen on a 'push' model - on a busy CPU we wake up a new task and push it out
to a known-idle CPU. At that point we can indeed unlazy the FPU with probably
little cost.
   


Can you point me to the code which does this?


But on busy servers where most wakeups are IRQ based the chance of being on
the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of
CPUs.
   


But don't we usually avoid pulls due to NUMA and cache considerations?


If there's some sucky corner case in theory we could approach it statistically
and measure the ratio of fast vs. slow migration vs. local context switches -
but that looks a bit complex.

   


I certainly wouldn't want to start with it.


Dunno.
   


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Avi Kivity


On 06/16/2010 12:10 PM, Nick Piggin wrote:



This cannot be stated categorically without precise measurements of
known-good, known-bad, average FPU usage and average CPU usage scenarios. All
these workloads have different characteristics.

I can imagine bad effects across all sorts of workloads: tcpbench, AIM7,
various lmbench components, X benchmarks, tiobench - you name it. Combined
with the fact that most micro-benchmarks wont be using the FPU, while in the
long run most processes will be using the FPU due to SIMM instructions. So
even a positive result might be skewed in practice. Has to be measured
carefully IMO - and i havent seen a _single_ performance measurement in the
submission mail. This is really essential.
 

It can be nice to code an absolute worst-case microbenchmark too.
   


Sure.


Task migration can actually be very important to the point of being
almost a fastpath in some workloads where threads are oversubscribed to
CPUs and blocking on some contented resource (IO or mutex or whatever).
I suspect the main issues in that case is the actual context switching
and contention, but it would be nice to see just how much slower it
could get.
   


If it's just cpu oversubscription then the IPIs will be limited by the 
rebalance rate and the time slice, so as you say it has to involve 
contention and frequent wakeups as well as heavy cpu usage.  That won't 
be easy to code.  Can you suggest an existing benchmark to run?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()

2010-06-16 Thread Avi Kivity


On 06/16/2010 11:49 AM, Andi Kleen wrote:

The page is fine, the page tables are not.  Another task can munmap() the
thing while is_hwpoison_address() is running.
 

Ok  that boils down to me not seeing that source.

If it accesses the page tables yes then it's racy. But whoever
looked up the page tables in the first place should have
returned an -EFAULT. There's no useful address attached
to poison.
   


We need to distinguish between genuine -EFAULT and poisoned address.

That's why I suggested get_user_pages_ptes_fast.  You can return page = 
NULL (-EFAULT) and the pte in the same go.  No race, and useful for 
other cases.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Avi Kivity


On 06/16/2010 12:01 PM, Samuel Thibault wrote:

Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a écrit :
   

in the long run most processes will be using the FPU due to SIMM
instructions.
 

I believe glibc already uses SIMM instructions for e.g. memcpy and
friends, i.e. basically all applications...
   


I think they ought to be using 'rep movs' on newer processors, but yes 
you're right.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RFC qdev path semantics (was: [Qemu-devel] [RFC PATCH 0/5] Introduce canonical device hierarchy string)

2010-06-16 Thread Markus Armbruster

A number of changes to qdev paths have been proposed in various threads.
It's becoming harder to keep track of them, so let me sum them up in one
place.  Please correct me if I misrepresent your ideas.

I'm going to describe the current state of things, and the proposed
changes (marked with ###).


The device tree has the main system bus as root.  A child of a bus is a
device.  A child of a device is a bus.

A qdev path consists of qdev path components separated by '/'.  It
resolves to a node in the device tree, either bus or device.

The qdev path / resolves to the root, i.e. the main system bus.

The qdev path IDENT, where IDENT is an identifier, resolves to the
device whose id is IDENT.

If PATH resolves to device DEV, and a child of DEV has the name IDENT,
then we resolve to that bus.

Bus names are chosen by the system as follows:

* If the driver of the parent device model provides a name, use that.

* Else, if the parent device has id ID, use ID.NUM, where NUM is the bus
  number, counting from zero in creation order.

* Else, use TYPE.NUM, where TYPE is derived from the bus type, and NUM
  is the bus number, as above.

### Paul proposes to drop ID.NUM.

### Paul proposes to either drop TYPE.NUM (and require drivers to
provide bus names), or make NUM count separately for each bus type.

If PATH resolves to bus BUS, then we resolve PATH/IDENT as follows:

* If a child of BUS has id IDENT, then we resolve to that device.

  ### Jan proposes to drop this rule.

* Else, if a child of BUS has a driver with name IDENT, then we resolve
  to that device.

  If more than one exist, resolve to the first one.  This assumes
  children are ordered.  Order is the same as in info qtree.
  Currently, it's reverse creation order.

  This is *not* a stable address.

* Else, if a child of BUS has a driver with alias name IDENT, then we
  resolve to that device.

  If more than one exist, resolve to the first one.  This assumes
  children are ordered.  Order is the same as in info qtree.
  Currently, it's reverse creation order.

  This is *not* a stable address.

### I propose: we resolve PATH/@BUS-ADDR to the child of BUS with bus
address BUS-ADDR, if devices on this type of bus have bus addresses.
The format of BUS-ADDR depends on the bus.

### Paul proposes to require all buses to define bus addresses.  Make
one up if necessary.

### Jan proposes: we resolve PATH/IDENT.NUM as follows:

* If a child of BUS has a driver with name IDENT and an instance
  number NUM, then we resolve to that device.

  Need a suitable definition of instance number.

  Jan proposes to number devices with the same driver on the same
  bus.  This assumes children are ordered, see above.

  This is *not* a stable address if the bus supports hot-plug.

  I propose to us bus-address as instance number.  Works best
  together with Paul's proposal to define bus addresses.  Syntax
  id...@bus-addr makes more sense then.

* Else, same with driver alias name.

### Here's a possible synthesis of the above three proposals: require
bus addresses, and permit any of

PATH/IDENT
PATH/@BUS-ADDR
PATH/id...@bus-addr

PATH/IDENT can't address instances that don't come first.

IDENT in PATH/id...@bus-addr is redundant.  Therefore, it can't be
the canonical qdev path.  That's fine, PATH/@BUS-ADDR serves.

If the above rules resolve PATH to a device in a context where we expect
a bus, and the device has exactly one bus, resolve it to that bus
instead.

### Jan and I propose to drop this rule.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

kvm_queue_exception - what is going wrong here?

2010-06-16 Thread Sebastian Hetze

Hi *,

we recently have encountered kvm_queue_exception events in dmesg.
What is causing these exceptions? It appears that the guest is
at least suffering performance loss. Is the qemu-system-x86 hang
event linked to the exception? Is there anything we can do
to prevent these exceptions to happen?

Best regards,

  Sebastian

-
[1811816.496620] set_cr3: #GP, reserved bits
[1811816.496624] [ cut here ]
[1811816.496640] WARNING: at /usr/src/linux-2.6.31/arch/x86/kvm/x86.c:202 
kvm_queue_exception_e+0x61/0x70 [kvm
]()
[1811816.496643] Hardware name: S5520HC
[1811816.496644] Modules linked in: drbd tun ipt_MASQUERADE iptable_nat nf_nat 
nf_conntrack_ipv4 nf_defrag_ipv
4 xt_state nf_conntrack ipt_REJECT xt_tcpudp ppdev iptable_filter ip_tables lp 
parport x_tables kvm_intel kvm
bridge stp bonding e100 via_rhine 3c59x 8139too mii shpchp iTCO_wdt 
iTCO_vendor_support pcspkr joydev ses encl
osure usbhid bnx2 aacraid e1000e igb dca fbcon tileblit font bitblit softcursor
[1811816.496669] Pid: 26170, comm: qemu-system-x86 Not tainted 2.6.31-14-server 
#48
[1811816.496672] Call Trace:
[1811816.496682]  [a0140651] ? kvm_queue_exception_e+0x61/0x70 [kvm]
[1811816.496689]  [8105f458] warn_slowpath_common+0x78/0xd0
[1811816.496692]  [8105f4bf] warn_slowpath_null+0xf/0x20
[1811816.496701]  [a0140651] kvm_queue_exception_e+0x61/0x70 [kvm]
[1811816.496711]  [a01406ff] load_guest_segment_descriptor+0x9f/0xb0 
[kvm]
[1811816.496721]  [a0140753] kvm_load_segment_descriptor+0x43/0x110 
[kvm]
[1811816.496730]  [a0140a20] ? kvm_inject_gp+0x10/0x20 [kvm]
[1811816.496740]  [a0140bed] ? kvm_set_cr3+0x15d/0x170 [kvm]
[1811816.496749]  [a0140d67] load_state_from_tss32+0x167/0x240 [kvm]
[1811816.496758]  [a0142a69] kvm_task_switch_32+0x119/0x130 [kvm]
[1811816.496768]  [a0142f93] kvm_task_switch+0x153/0x300 [kvm]
[1811816.496777]  [a0137df1] ? gfn_to_hva+0x11/0x90 [kvm]
[1811816.496786]  [a0138122] ? kvm_read_guest_page+0x62/0x70 [kvm]
[1811816.496792]  [a016cbab] handle_task_switch+0x6b/0x160 [kvm_intel]
[1811816.496796]  [a0169b45] vmx_handle_exit+0xf5/0x280 [kvm_intel]
[1811816.496801]  [a016d822] ? vmx_vcpu_run+0x292/0x856 [kvm_intel]
[1811816.496811]  [a013ddbf] vcpu_enter_guest+0x2cf/0x600 [kvm]
[1811816.496814]  [8107a12e] ? finish_wait+0x5e/0x80
[1811816.496819]  [81541710] ? _spin_lock_irq+0x10/0x20
[1811816.496822]  [81541603] ? __down_read+0xc3/0xce
[1811816.496825]  [81079fa0] ? autoremove_wake_function+0x0/0x40
[1811816.496834]  [a013e153] __vcpu_run+0x63/0x330 [kvm]
[1811816.496844]  [a01440ab] kvm_arch_vcpu_ioctl_run+0x8b/0x1f0 [kvm]
[1811816.496853]  [a01367f2] kvm_vcpu_ioctl+0x2e2/0x630 [kvm]
[1811816.496856]  [8106e99f] ? dequeue_signal+0x9f/0x180
[1811816.496859]  [81131d41] vfs_ioctl+0x31/0xa0
[1811816.496862]  [811321b3] do_vfs_ioctl+0x373/0x400
[1811816.496864]  [811322d9] sys_ioctl+0x99/0xa0
[1811816.496862]  [811321b3] do_vfs_ioctl+0x373/0x400
[1811816.496864]  [811322d9] sys_ioctl+0x99/0xa0
[1811816.496867]  [8106d3e9] ? do_sigpending+0xa9/0xc0
[1811816.496870]  [81012082] system_call_fastpath+0x16/0x1b
[1811816.496872] ---[ end trace c6ea0da2c3c9d0e4 ]---
[1811816.496887] set_cr3: #GP, pdptrs reserved bits
[1811816.496889] [ cut here ]
[1811816.496898] WARNING: at /usr/src/linux-2.6.31/arch/x86/kvm/x86.c:202 
kvm_queue_exception_e+0x61/0x70 [kvm]()
[1811816.496900] Hardware name: S5520HC
[1811816.496901] Modules linked in: drbd tun ipt_MASQUERADE iptable_nat nf_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp 
ppdev iptable_filter ip_tables lp parport x_tables kvm_intel kvm bridge stp 
bonding e100 via_rhine 3c59x 8139too mii shpchp iTCO_wdt iTCO_vendor_support 
pcspkr joydev ses enclosure usbhid bnx2 aacraid e1000e igb dca fbcon tileblit 
font bitblit softcursor
[1811816.496923] Pid: 26170, comm: qemu-system-x86 Tainted: GW  
2.6.31-14-server #48
[1811816.496925] Call Trace:
[1811816.496934]  [a0140651] ? kvm_queue_exception_e+0x61/0x70 [kvm]
[1811816.496937]  [8105f458] warn_slowpath_common+0x78/0xd0
[1811816.496940]  [8105f4bf] warn_slowpath_null+0xf/0x20
[1811816.496950]  [a0140651] kvm_queue_exception_e+0x61/0x70 [kvm]
[1811816.496959]  [a01406ff] load_guest_segment_descriptor+0x9f/0xb0 
[kvm]
[1811816.496969]  [a0140753] kvm_load_segment_descriptor+0x43/0x110 
[kvm]
[1811816.496978]  [a0140b52] ? kvm_set_cr3+0xc2/0x170 [kvm]
[1811816.496988]  [a0140d9c] load_state_from_tss32+0x19c/0x240 [kvm]
[1811816.496997]  [a0142a69] kvm_task_switch_32+0x119/0x130 [kvm]
[1811816.497006]  [a0142f93] kvm_task_switch+0x153/0x300 [kvm]
[1811816.497015]  [a0137df1] ?

Re: Loss of network connectivity with high load

2010-06-16 Thread Daniel Bareiro

On Wednesday, 16 June 2010 00:36:43 -0300,
Daniel Bareiro wrote:

  There have been a few bugs similar to this reported that should be
  pretty easy to find. Basically, I'd say try 0.12.4, it should have a
  few fixes in this area over 0.12.3.

 I was testing with qemu-kvm 0.12.4 but I'm having the same problem. I
 re-read the announce about qemu-kvm 0.12.4 sent by Avi, but I didn't
 find in it some reference about fixes related to Virtio.
 
 If there's anything else I can try, please tell me.

I found some references to similar problems:

http://forum.proxmox.com/threads/3117-virtio-net-crashing-(stop-sending-traffic)
http://www.mail-archive.com/kvm@vger.kernel.org/msg26033.html
http://marc.info/?l=kvmm=126564542625725w=2


In particular, I note the patches provided by Tom Lendacky and Herbert
Xu. Are these patches applied to qemu-kvm 0.12.3 or 0.12.4? If so, it
could be a regression?


Thanks in advance for your reply.

Regards,
Daniel
-- 
Fingerprint: BFB3 08D6 B4D1 31B2 72B9  29CE 6696 BF1B 14E6 1D37
Powered by Debian GNU/Linux Lenny - Linux user #188.598


signature.asc
Description: Digital signature

Re: kvm_queue_exception - what is going wrong here?

2010-06-16 Thread Gleb Natapov

On Wed, Jun 16, 2010 at 12:05:10PM +0200, Sebastian Hetze wrote:
 Hi *,
 
 we recently have encountered kvm_queue_exception events in dmesg.
 What is causing these exceptions? It appears that the guest is
 at least suffering performance loss. Is the qemu-system-x86 hang
 event linked to the exception? Is there anything we can do
 to prevent these exceptions to happen?
 
What is your guest? It look like it tries to switch task and tss it tries
to switch to is broken, so task switch emulation code encounters multiple
exception and it complains about it.  Theoretically we should abort task
switch after fist exception condition (set_cr3: #GP, reserved bits),
but error handling in task switch emulation code is far from perfect
especially in such old kernel like 2.6.31.

 Best regards,
 
   Sebastian
 
 -
 [1811816.496620] set_cr3: #GP, reserved bits
 [1811816.496624] [ cut here ]
 [1811816.496640] WARNING: at /usr/src/linux-2.6.31/arch/x86/kvm/x86.c:202 
 kvm_queue_exception_e+0x61/0x70 [kvm
 ]()
 [1811816.496643] Hardware name: S5520HC
 [1811816.496644] Modules linked in: drbd tun ipt_MASQUERADE iptable_nat 
 nf_nat nf_conntrack_ipv4 nf_defrag_ipv
 4 xt_state nf_conntrack ipt_REJECT xt_tcpudp ppdev iptable_filter ip_tables 
 lp parport x_tables kvm_intel kvm
 bridge stp bonding e100 via_rhine 3c59x 8139too mii shpchp iTCO_wdt 
 iTCO_vendor_support pcspkr joydev ses encl
 osure usbhid bnx2 aacraid e1000e igb dca fbcon tileblit font bitblit 
 softcursor
 [1811816.496669] Pid: 26170, comm: qemu-system-x86 Not tainted 
 2.6.31-14-server #48
 [1811816.496672] Call Trace:
 [1811816.496682]  [a0140651] ? kvm_queue_exception_e+0x61/0x70 [kvm]
 [1811816.496689]  [8105f458] warn_slowpath_common+0x78/0xd0
 [1811816.496692]  [8105f4bf] warn_slowpath_null+0xf/0x20
 [1811816.496701]  [a0140651] kvm_queue_exception_e+0x61/0x70 [kvm]
 [1811816.496711]  [a01406ff] 
 load_guest_segment_descriptor+0x9f/0xb0 [kvm]
 [1811816.496721]  [a0140753] kvm_load_segment_descriptor+0x43/0x110 
 [kvm]
 [1811816.496730]  [a0140a20] ? kvm_inject_gp+0x10/0x20 [kvm]
 [1811816.496740]  [a0140bed] ? kvm_set_cr3+0x15d/0x170 [kvm]
 [1811816.496749]  [a0140d67] load_state_from_tss32+0x167/0x240 [kvm]
 [1811816.496758]  [a0142a69] kvm_task_switch_32+0x119/0x130 [kvm]
 [1811816.496768]  [a0142f93] kvm_task_switch+0x153/0x300 [kvm]
 [1811816.496777]  [a0137df1] ? gfn_to_hva+0x11/0x90 [kvm]
 [1811816.496786]  [a0138122] ? kvm_read_guest_page+0x62/0x70 [kvm]
 [1811816.496792]  [a016cbab] handle_task_switch+0x6b/0x160 
 [kvm_intel]
 [1811816.496796]  [a0169b45] vmx_handle_exit+0xf5/0x280 [kvm_intel]
 [1811816.496801]  [a016d822] ? vmx_vcpu_run+0x292/0x856 [kvm_intel]
 [1811816.496811]  [a013ddbf] vcpu_enter_guest+0x2cf/0x600 [kvm]
 [1811816.496814]  [8107a12e] ? finish_wait+0x5e/0x80
 [1811816.496819]  [81541710] ? _spin_lock_irq+0x10/0x20
 [1811816.496822]  [81541603] ? __down_read+0xc3/0xce
 [1811816.496825]  [81079fa0] ? autoremove_wake_function+0x0/0x40
 [1811816.496834]  [a013e153] __vcpu_run+0x63/0x330 [kvm]
 [1811816.496844]  [a01440ab] kvm_arch_vcpu_ioctl_run+0x8b/0x1f0 
 [kvm]
 [1811816.496853]  [a01367f2] kvm_vcpu_ioctl+0x2e2/0x630 [kvm]
 [1811816.496856]  [8106e99f] ? dequeue_signal+0x9f/0x180
 [1811816.496859]  [81131d41] vfs_ioctl+0x31/0xa0
 [1811816.496862]  [811321b3] do_vfs_ioctl+0x373/0x400
 [1811816.496864]  [811322d9] sys_ioctl+0x99/0xa0
 [1811816.496862]  [811321b3] do_vfs_ioctl+0x373/0x400
 [1811816.496864]  [811322d9] sys_ioctl+0x99/0xa0
 [1811816.496867]  [8106d3e9] ? do_sigpending+0xa9/0xc0
 [1811816.496870]  [81012082] system_call_fastpath+0x16/0x1b
 [1811816.496872] ---[ end trace c6ea0da2c3c9d0e4 ]---
 [1811816.496887] set_cr3: #GP, pdptrs reserved bits
 [1811816.496889] [ cut here ]
 [1811816.496898] WARNING: at /usr/src/linux-2.6.31/arch/x86/kvm/x86.c:202 
 kvm_queue_exception_e+0x61/0x70 [kvm]()
 [1811816.496900] Hardware name: S5520HC
 [1811816.496901] Modules linked in: drbd tun ipt_MASQUERADE iptable_nat 
 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT 
 xt_tcpudp ppdev iptable_filter ip_tables lp parport x_tables kvm_intel kvm 
 bridge stp bonding e100 via_rhine 3c59x 8139too mii shpchp iTCO_wdt 
 iTCO_vendor_support pcspkr joydev ses enclosure usbhid bnx2 aacraid e1000e 
 igb dca fbcon tileblit font bitblit softcursor
 [1811816.496923] Pid: 26170, comm: qemu-system-x86 Tainted: GW  
 2.6.31-14-server #48
 [1811816.496925] Call Trace:
 [1811816.496934]  [a0140651] ? kvm_queue_exception_e+0x61/0x70 [kvm]
 [1811816.496937]  [8105f458] warn_slowpath_common+0x78/0xd0
 [1811816.496940]  [8105f4bf] warn_slowpath_null+0xf/0x20

Re: RFC qdev path semantics (was: [Qemu-devel] [RFC PATCH 0/5] Introduce canonical device hierarchy string)

2010-06-16 Thread Paul Brook

 * Else, use TYPE.NUM, where TYPE is derived from the bus type, and NUM
   is the bus number, as above.
 
 ### Paul proposes to either drop TYPE.NUM (and require drivers to
 provide bus names), or make NUM count separately for each bus type.

I revised this proposal: Drop the .NUM part, and require drivers provide a bus 
name if TYPE would result in ambiguous names.

Paul
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()

2010-06-16 Thread huang ying

On Tue, Jun 15, 2010 at 7:22 PM, Avi Kivity a...@redhat.com wrote:
 btw, is_hwpoison_address() is racy.  While it looks up the address, some
 other task can unmap the page tables under us.

 Andi/Huang?

 One way of fixing it is get_user_pages_ptes_fast(), which also returns the
 pte, also atomically.  I want it for other reasons as well (respond to a
 read fault by gupping the page for read, but allowing write access if the
 pte indicates it is writeable).

Yes. is_hwpoison_address() is racy. But I think it is not absolutely
necessary to call is_hwpoison_address() in hva_to_pfn_atomic(), is it?

For is_hwpoison_address() in hva_to_pfn(), we can protect it with mmap_sem.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] KVM: MMU: introduce gfn_to_page_atomic() and gfn_to_pfn_atomic()

2010-06-16 Thread Avi Kivity


On 06/16/2010 01:51 PM, huang ying wrote:

On Tue, Jun 15, 2010 at 7:22 PM, Avi Kivitya...@redhat.com  wrote:
   

btw, is_hwpoison_address() is racy.  While it looks up the address, some
other task can unmap the page tables under us.

Andi/Huang?

One way of fixing it is get_user_pages_ptes_fast(), which also returns the
pte, also atomically.  I want it for other reasons as well (respond to a
read fault by gupping the page for read, but allowing write access if the
pte indicates it is writeable).
 

Yes. is_hwpoison_address() is racy. But I think it is not absolutely
necessary to call is_hwpoison_address() in hva_to_pfn_atomic(), is it?
   


We can probably ignore it, yes.


For is_hwpoison_address() in hva_to_pfn(), we can protect it with mmap_sem.
   


Not very appealing, but should work.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/24] Implement VMXON and VMXOFF

2010-06-16 Thread Nadav Har'El

Hi,

On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 3/24] Implement VMXON 
and VMXOFF:
 On 06/13/2010 03:24 PM, Nadav Har'El wrote:
 This patch allows a guest to use the VMXON and VMXOFF instructions, and
 emulates them accordingly. Basically this amounts to checking some
 prerequisites, and then remembering whether the guest has enabled or 
 disabled
 VMX operation.
 
 Should probably reorder with next patch.

I can't do this if I want the code to compile after each patch, because the
next patch (controlling when setting cr4.VMXE can be set) needs to check
whether VMXON was done.

 Please (here and elsewhere) use the standard kernel style for multiline 
 comments - start with /* on a line by itself.

Sure, sorry about that. I guess I need to (re)read the Linux coding style
document.

 +vmx-nested.vmxon = 1;

 = true

I'll change that. I learned C more than a decade before the advent of
stdbool.h, so in my mind, 1 has always been, and still is, the right and
only way to write true... But of course it doesn't mean I need to inflict
my old style on everybody else ;-)

 Need to block INIT signals in the local apic as well (fine for a 
 separate patch).

I've been looking into how I might best go about achieving this.

The APIC_DM_INIT handler is in lapic.c, which is not aware of VMX or
(obviously) nested VMX. So I need to add some sort of generic block INIT
flag which that code will check. Is this the sort of fix you had in mind?

A different change could be to write a handler for exit reason 3, which we 
get if there's a real INIT signal in the host; If we get exit reason 3 from
L2, we need to exit to L1 to handle it, while if we get exit reason 3 from
L1 that has done VMXON, we simply need to do nothing (according to the spec).

So I'm not sure which of these two things is what we really about. What kind
of scenario did you have in mind where this INIT business is relevant?



Here is the patch with your above comments fixed *except* the INIT thing:

---
Subject: [PATCH 3/24] Implement VMXON and VMXOFF

This patch allows a guest to use the VMXON and VMXOFF instructions, and
emulates them accordingly. Basically this amounts to checking some
prerequisites, and then remembering whether the guest has enabled or disabled
VMX operation.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
--- .before/arch/x86/kvm/vmx.c  2010-06-16 13:20:19.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2010-06-16 13:20:19.0 +0300
@@ -125,6 +125,17 @@ struct shared_msr_entry {
u64 mask;
 };
 
+/*
+ * The nested_vmx structure is part of vcpu_vmx, and holds information we need
+ * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
+ * the current VMCS set by L1, a list of the VMCSs used to run the active
+ * L2 guests on the hardware, and more.
+ */
+struct nested_vmx {
+   /* Has the level1 guest done vmxon? */
+   bool vmxon;
+};
+
 struct vcpu_vmx {
struct kvm_vcpu   vcpu;
struct list_head  local_vcpus_link;
@@ -176,6 +187,9 @@ struct vcpu_vmx {
u32 exit_reason;
 
bool rdtscp_enabled;
+
+   /* Support for guest hypervisors (nested VMX) */
+   struct nested_vmx nested;
 };
 
 static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
@@ -3361,6 +3375,90 @@ static int handle_vmx_insn(struct kvm_vc
return 1;
 }
 
+/*
+ * Emulate the VMXON instruction.
+ * Currently, we just remember that VMX is active, and do not save or even
+ * inspect the argument to VMXON (the so-called VMXON pointer) because we
+ * do not currently need to store anything in that guest-allocated memory
+ * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their
+ * argument is different from the VMXON pointer (which the spec says they do).
+ */
+static int handle_vmon(struct kvm_vcpu *vcpu)
+{
+   struct kvm_segment cs;
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   /* The Intel VMX Instruction Reference lists a bunch of bits that
+* are prerequisite to running VMXON, most notably CR4.VMXE must be
+* set to 1. Otherwise, we should fail with #UD. We test these now:
+*/
+   if (!nested ||
+   !kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) ||
+   !kvm_read_cr0_bits(vcpu, X86_CR0_PE) ||
+   (vmx_get_rflags(vcpu)  X86_EFLAGS_VM)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   vmx_get_segment(vcpu, cs, VCPU_SREG_CS);
+   if (is_long_mode(vcpu)  !cs.l) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   if (vmx_get_cpl(vcpu)) {
+   kvm_inject_gp(vcpu, 0);
+   return 1;
+   }
+
+   vmx-nested.vmxon = true;
+
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
+
+/*
+ * Intel's VMX Instruction Reference specifies a common set of prerequisites
+ * for running VMX instructions (except VMXON, whose prerequisites are
+ * slightly

Re: [PATCH 3/24] Implement VMXON and VMXOFF

2010-06-16 Thread Avi Kivity


On 06/16/2010 02:14 PM, Nadav Har'El wrote:

Hi,

On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 3/24] Implement VMXON and 
VMXOFF:
   

On 06/13/2010 03:24 PM, Nadav Har'El wrote:
 

This patch allows a guest to use the VMXON and VMXOFF instructions, and
emulates them accordingly. Basically this amounts to checking some
prerequisites, and then remembering whether the guest has enabled or
disabled
VMX operation.
   

Should probably reorder with next patch.
 

I can't do this if I want the code to compile after each patch, because the
next patch (controlling when setting cr4.VMXE can be set) needs to check
whether VMXON was done.
   


You can have this patch add the vmxon check.  But it doesn't matter too 
much, you can keep the current order.



Need to block INIT signals in the local apic as well (fine for a
separate patch).
 

I've been looking into how I might best go about achieving this.

The APIC_DM_INIT handler is in lapic.c, which is not aware of VMX or
(obviously) nested VMX. So I need to add some sort of generic block INIT
flag which that code will check. Is this the sort of fix you had in mind?
   


It's not enough to block INIT, there is also exit reason 3, INIT 
signal.  So you need to call x86.c code from the lapic, which needs to 
call a kvm_x86_op hook which lets vmx.c decide whether the INIT needs to 
be intercepted or not, and what to do with it (ignore in root mode, exit 
in non-root mode)


Note the check needs to be done in vcpu context, not during delivery as 
it is done now.  So we probably need a KVM_REQ_INIT bit in 
vcpu-requests, which we can check during guest entry where we know if 
we're in root or non-root mode.


Pretty complicated and esoteric.  We can defer this now while we work 
out more immediate issues, but it needs to be addressed.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC qdev path semantics

2010-06-16 Thread Jan Kiszka

Markus Armbruster wrote:
 A number of changes to qdev paths have been proposed in various threads.
 It's becoming harder to keep track of them, so let me sum them up in one
 place.  Please correct me if I misrepresent your ideas.
 
 I'm going to describe the current state of things, and the proposed
 changes (marked with ###).
 
 
 The device tree has the main system bus as root.  A child of a bus is a
 device.  A child of a device is a bus.
 
 A qdev path consists of qdev path components separated by '/'.  It
 resolves to a node in the device tree, either bus or device.
 
 The qdev path / resolves to the root, i.e. the main system bus.

Another aspect: A path may start with an arbitrary bus name, not only
the system bus. Although this is ambiguous, we need to keep it for
addressing the bus itself due to existing client use. But, IMO, we
should at least start deprecating this for addressing elements below
that bus (e.g. pci.0/e1000).

And besides specifying devices via absolute qdev paths, we also support
addressing them via their ID if present. I checked if ID/BUS[/...] was
supported so far, but it wasn't. So I did not propose this yet although
it might be a useful abbreviation.

 
 The qdev path IDENT, where IDENT is an identifier, resolves to the
 device whose id is IDENT.
 
 If PATH resolves to device DEV, and a child of DEV has the name IDENT,
 then we resolve to that bus.
 
 Bus names are chosen by the system as follows:
 
 * If the driver of the parent device model provides a name, use that.
 
 * Else, if the parent device has id ID, use ID.NUM, where NUM is the bus
   number, counting from zero in creation order.
 
 * Else, use TYPE.NUM, where TYPE is derived from the bus type, and NUM
   is the bus number, as above.
 
 ### Paul proposes to drop ID.NUM.
 
 ### Paul proposes to either drop TYPE.NUM (and require drivers to
 provide bus names), or make NUM count separately for each bus type.
 
 If PATH resolves to bus BUS, then we resolve PATH/IDENT as follows:
 
 * If a child of BUS has id IDENT, then we resolve to that device.
 
   ### Jan proposes to drop this rule.
 
 * Else, if a child of BUS has a driver with name IDENT, then we resolve
   to that device.
 
   If more than one exist, resolve to the first one.  This assumes
   children are ordered.  Order is the same as in info qtree.
   Currently, it's reverse creation order.
 
   This is *not* a stable address.
 
 * Else, if a child of BUS has a driver with alias name IDENT, then we
   resolve to that device.
 
   If more than one exist, resolve to the first one.  This assumes
   children are ordered.  Order is the same as in info qtree.
   Currently, it's reverse creation order.
 
   This is *not* a stable address.
 
 ### I propose: we resolve PATH/@BUS-ADDR to the child of BUS with bus
 address BUS-ADDR, if devices on this type of bus have bus addresses.
 The format of BUS-ADDR depends on the bus.
 
 ### Paul proposes to require all buses to define bus addresses.  Make
 one up if necessary.
 
 ### Jan proposes: we resolve PATH/IDENT.NUM as follows:
 
 * If a child of BUS has a driver with name IDENT and an instance
   number NUM, then we resolve to that device.
 
   Need a suitable definition of instance number.
 
   Jan proposes to number devices with the same driver on the same
   bus.  This assumes children are ordered, see above.
 
   This is *not* a stable address if the bus supports hot-plug.
 
   I propose to us bus-address as instance number.  Works best
   together with Paul's proposal to define bus addresses.  Syntax
   id...@bus-addr makes more sense then.

I would be fine with this scheme, but I assume we still need instance
numbers as fallback for buses without any usable addressing. Example: I
have a patch queued that uses this for internal addressing of all hpet
devices on the system bus (to connect them to the ISA IRQs).

 
 * Else, same with driver alias name.
 
 ### Here's a possible synthesis of the above three proposals: require
 bus addresses, and permit any of
 
 PATH/IDENT
 PATH/@BUS-ADDR
 PATH/id...@bus-addr
 
 PATH/IDENT can't address instances that don't come first.

PATH/IDENT[.INSTANCE] would resolve the addressability.

 
 IDENT in PATH/id...@bus-addr is redundant.  Therefore, it can't be
 the canonical qdev path.  That's fine, PATH/@BUS-ADDR serves.
 
 If the above rules resolve PATH to a device in a context where we expect
 a bus, and the device has exactly one bus, resolve it to that bus
 instead.
 
 ### Jan and I propose to drop this rule.

Thanks for this summary!

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-16 Thread Avi Kivity


On 06/15/2010 05:47 PM, Dave Hansen wrote:



That's a bug that needs to be fixed.  Eventually the host will come
under pressure and will balloon the guest.  If that kills the guest, the
ballooning is not effective as a host memory management technique.
 

I'm not convinced that it's just a bug that can be fixed.  Consider a
case where a host sees a guest with 100MB of free memory at the exact
moment that a database app sees that memory.  The host tries to balloon
that memory away at the same time that the app goes and allocates it.
That can certainly lead to an OOM very quickly, even for very small
amounts of memory (much less than 100MB).  Where's the bug?

I think the issues are really fundamental to ballooning.
   


There are two issues involved.

One is, can the kernel accurately determine the amount of memory it 
needs to work?  We have resources such as RAM and swap.  We have 
liabilities in the form of swappable userspace memory, mlocked userspace 
memory, kernel memory to support these, and various reclaimable and 
non-reclaimable kernel caches.  Can we determine the minimum amount of 
RAM to support are workload at a point in time?


If we had this, we could modify the balloon to refuse to balloon if it 
takes the kernel beneath the minimum amount of RAM needed.


In fact, this is similar to allocating memory with overcommit_memory = 
0.  The difference is the balloon allocates mlocked memory, while normal 
allocations can be charged against swap.  But fundamentally it's the same.



If all the guests do this, then it leaves that much more free memory on
the host, which can be used flexibly for extra host page cache, new
guests, etc...
   

If the host detects lots of pagecache misses it can balloon guests
down.  If pagecache is quiet, why change anything?
 

Page cache misses alone are not really sufficient.  This is the classic
problem where we try to differentiate streaming I/O (which we can't
effectively cache) from I/O which can be effectively cached.
   


True.  Random I/O across a very large dataset is also difficult to cache.


If the host wants to start new guests, it can balloon guests down.  If
no new guests are wanted, why change anything?
 

We're talking about an environment which we're always trying to
optimize.  Imagine that we're always trying to consolidate guests on to
smaller numbers of hosts.  We're effectively in a state where we
_always_ want new guests.
   


If this came at no cost to the guests, you'd be right.  But at some 
point guest performance will be hit by this, so the advantage gained 
from freeing memory will be balanced by the disadvantage.


Also, memory is not the only resource.  At some point you become cpu 
bound; at that point freeing memory doesn't help and in fact may 
increase your cpu load.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC qdev path semantics

2010-06-16 Thread Paul Brook

 Markus Armbruster wrote:
  A number of changes to qdev paths have been proposed in various threads.
  It's becoming harder to keep track of them, so let me sum them up in one
  place.  Please correct me if I misrepresent your ideas.
  
  I'm going to describe the current state of things, and the proposed
  changes (marked with ###).
  
  
  The device tree has the main system bus as root.  A child of a bus is a
  device.  A child of a device is a bus.
  
  A qdev path consists of qdev path components separated by '/'.  It
  resolves to a node in the device tree, either bus or device.
  
  The qdev path / resolves to the root, i.e. the main system bus.
 
 Another aspect: A path may start with an arbitrary bus name, not only
 the system bus. Although this is ambiguous, we need to keep it for
 addressing the bus itself due to existing client use. But, IMO, we
 should at least start deprecating this for addressing elements below
 that bus (e.g. pci.0/e1000).

I think this would be better served by adding explicit aliases/IDs for those 
use-cases. i.e. define the global ID pci.0 to be an alias for
 /i440FX-pcihost/pci

Paul
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: KVM call minutes for June 15

2010-06-16 Thread Arnd Bergmann

On Wednesday 16 June 2010, Markus Armbruster wrote:
 Can't hurt reviewer motivation.  Could it be automated?  Find replies,
 extract tags.  If you want your acks to be picked up, you better make
 sure your References header works, and your tags are formatted
 correctly.

I think pwclient (https://patchwork.kernel.org/) can do this for you.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC qdev path semantics

2010-06-16 Thread Jan Kiszka

Paul Brook wrote:
 Markus Armbruster wrote:
 A number of changes to qdev paths have been proposed in various threads.
 It's becoming harder to keep track of them, so let me sum them up in one
 place.  Please correct me if I misrepresent your ideas.

 I'm going to describe the current state of things, and the proposed
 changes (marked with ###).


 The device tree has the main system bus as root.  A child of a bus is a
 device.  A child of a device is a bus.

 A qdev path consists of qdev path components separated by '/'.  It
 resolves to a node in the device tree, either bus or device.

 The qdev path / resolves to the root, i.e. the main system bus.
 Another aspect: A path may start with an arbitrary bus name, not only
 the system bus. Although this is ambiguous, we need to keep it for
 addressing the bus itself due to existing client use. But, IMO, we
 should at least start deprecating this for addressing elements below
 that bus (e.g. pci.0/e1000).
 
 I think this would be better served by adding explicit aliases/IDs for those 
 use-cases. i.e. define the global ID pci.0 to be an alias for
  /i440FX-pcihost/pci

Makes sense. We could attach this ID to the BusState (corresponding to
DeviceState:id) and manually set it during machine init.
qbus_find_recursive would then look for a matching ID instead of a name.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [PATCH 1/2] [scsi-bus]: Add PR-OUT and PR-IN case for SCSIRequest xfer and xfer_mode setup

2010-06-16 Thread Kevin Wolf

Am 04.06.2010 16:06, schrieb Kevin Wolf:
 Am 31.05.2010 03:43, schrieb Nicholas A. Bellinger:
 From: Nicholas Bellinger n...@linux-iscsi.org

 This patch updates hw/scsi-bus.c to add PERSISTENT_RESERVE_OUT and 
 PERSISTENT_RESERVE_IN
 case in scsi_req_length() to extra the incoming buffer length into 
 SCSIRequest-cmd.xfer,
 and adds a second PERSISTENT_RESERVE_OUT case in scsi_req_xfer_mode() in 
 order to properly
 set SCSI_XFER_TO_DEV for WRITE data.

 Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation and TCM_Loop 
 target ports.

 Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
 ---
  hw/scsi-bus.c |5 +
  1 files changed, 5 insertions(+), 0 deletions(-)

 diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c
 index b8e4b71..75ec74e 100644
 --- a/hw/scsi-bus.c
 +++ b/hw/scsi-bus.c
 @@ -325,6 +325,10 @@ static int scsi_req_length(SCSIRequest *req, uint8_t 
 *cmd)
  case INQUIRY:
  req-cmd.xfer = cmd[4] | (cmd[3]  8);
  break;
 +case PERSISTENT_RESERVE_OUT:
 +case PERSISTENT_RESERVE_IN:
 +req-cmd.xfer = cmd[8] | (cmd[7]  8);
 
 Maybe I'm missing something, but isn't exactly the same value set in the
 switch block above? (for cmd[0]  5 == 2)

Nicholas? This isn't applied yet because I'm waiting for your answer.

Is there a reason why it makes sense to do it explicitly here instead
using the generic code a few lines above? I think the same applied to
patch 2/2.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: RFC qdev path semantics

2010-06-16 Thread Paul Brook

  I think this would be better served by adding explicit aliases/IDs for
  those use-cases. i.e. define the global ID pci.0 to be an alias for
  
   /i440FX-pcihost/pci
 
 Makes sense. We could attach this ID to the BusState (corresponding to
 DeviceState:id) and manually set it during machine init.
 qbus_find_recursive would then look for a matching ID instead of a name.

If we accept your proposal that global ids are not accepted as local device 
identifiers, then these probably shouldn't be stored in the device tree at 
all.

Allowing at most one global ID per node also seems like a rather arbitrary 
limitation.

How about having a global alias table that maps an ID to a devtree path?
The code to handle these becomes a straight string substitution followed by a 
normal path lookup. We don't need separate code for device v.s. bus, and 
global_device_id/busname just works.

Paul
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1

2010-06-16 Thread Nadav Har'El

On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 5/24] Introduce 
vmcs12: a VMCS structure for L1:
 +struct __attribute__ ((__packed__)) vmcs12 {
 +/* According to the Intel spec, a VMCS region must start with the
 + * following two fields. Then follow implementation-specific data.
 + */
 +u32 revision_id;
 +u32 abort;
 +};

 
 Note that this structure becomes an ABI, it cannot change except in a 
 backward compatible way due to the need for live migration.  So I'd like 
 a documentation patch that adds a description of the content to 
 Documentation/kvm/.  It can be as simple as listing the structure 
 definition.

I agree that if struct vmcs12 is changed, this will cause problems for live
migration, but why does this mean that the struct's fields or layout an ABI
worth documenting?
After all, isn't the idea of VMCS that its internal content and layout
is opaque for the L1 guest - he can only read/write it with VMREAD/VMWRITE,
and those two instructions are the ABI (which is of course documented in the
Intel spec) - not the content of the vmcs12 structure. Even if the guest knew
the exact layout of this structure, he's not supposed to use it.

By the way, we have not actually checked that live migration is working
as expected with nested virtualization running. I expect there to be more
pitfalls and bugs even before we consider migration between two different
versions. We would indeed like to allow live migration of different kinds
(of L1 with all its L2 guests; Of all L2 guests of a L1; Of a single L2
guest), but we're trying to finish the more basic functionality first.

Thanks,
Nadav.


-- 
Nadav Har'El|Wednesday, Jun 16 2010, 4 Tammuz 5770
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |It is error alone which needs the support
http://nadav.harel.org.il   |of government. Truth can stand by itself.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/3] Monitor support QEMU trace framework

2010-06-16 Thread Prerna Saxena

This is v3 of a set of patches to enable trace visualization  control 
via the QEMU monitor. It is based on trace infrastructure posted by 
Stefan :
( http://lists.gnu.org/archive/html/qemu-devel/2010-05/msg02407.html )


This patchset adds monitor commands :
- info trace : to view current contents of the trace buffer
- info tracepoints : to view all available tracepoints and their state.
- tracepoint NAME on|off : to enable/disable the logging of data
 from tracepoint 'NAME' to on/off.

Changelog :
- Clean-ups from v2, particularly relating to export of tdb_hash()

-- 
Prerna Saxena

Linux Technology Centre,
IBM Systems and Technology Lab,
Bangalore, India

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] Export hash function

2010-06-16 Thread Prerna Saxena

Rename tdb_hash() to qemu_hash().
Move definition from qdict.c to a new file qemu-misc.c for use by tracing
infrastructure.

Signed-off-by: Prerna Saxena pre...@linux.vnet.ibm.com
---

 Makefile.objs |2 +-
 qdict.c   |   24 
 qemu-common.h |3 +++
 qemu-misc.c   |   24 
 4 files changed, 32 insertions(+), 21 deletions(-)
 create mode 100644 qemu-misc.c


diff --git a/Makefile.objs b/Makefile.objs
index 7cb40ac..53e3a65 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -1,6 +1,6 @@
 ###
 # QObject
-qobject-obj-y = qint.o qstring.o qdict.o qlist.o qfloat.o qbool.o
+qobject-obj-y = qint.o qstring.o qdict.o qemu-misc.o qlist.o qfloat.o qbool.o
 qobject-obj-y += qjson.o json-lexer.o json-streamer.o json-parser.o
 qobject-obj-y += qerror.o
 
diff --git a/qdict.c b/qdict.c
index 175bc17..d4588ef 100644
--- a/qdict.c
+++ b/qdict.c
@@ -53,22 +53,6 @@ QDict *qobject_to_qdict(const QObject *obj)
 }
 
 /**
- * tdb_hash(): based on the hash agorithm from gdbm, via tdb
- * (from module-init-tools)
- */
-static unsigned int tdb_hash(const char *name)
-{
-unsigned value;/* Used to compute the hash value.  */
-unsigned   i;  /* Used to cycle through random values. */
-
-/* Set the initial value from the key size. */
-for (value = 0x238F13AF * strlen(name), i=0; name[i]; i++)
-value = (value + (((const unsigned char *)name)[i]  (i*5 % 24)));
-
-return (1103515243 * value + 12345);
-}
-
-/**
  * alloc_entry(): allocate a new QDictEntry
  */
 static QDictEntry *alloc_entry(const char *key, QObject *value)
@@ -113,7 +97,7 @@ void qdict_put_obj(QDict *qdict, const char *key, QObject 
*value)
 unsigned int hash;
 QDictEntry *entry;
 
-hash = tdb_hash(key) % QDICT_HASH_SIZE;
+hash = qemu_hash(key) % QDICT_HASH_SIZE;
 entry = qdict_find(qdict, key, hash);
 if (entry) {
 /* replace key's value */
@@ -137,7 +121,7 @@ QObject *qdict_get(const QDict *qdict, const char *key)
 {
 QDictEntry *entry;
 
-entry = qdict_find(qdict, key, tdb_hash(key) % QDICT_HASH_SIZE);
+entry = qdict_find(qdict, key, qemu_hash(key) % QDICT_HASH_SIZE);
 return (entry == NULL ? NULL : entry-value);
 }
 
@@ -148,7 +132,7 @@ QObject *qdict_get(const QDict *qdict, const char *key)
  */
 int qdict_haskey(const QDict *qdict, const char *key)
 {
-unsigned int hash = tdb_hash(key) % QDICT_HASH_SIZE;
+unsigned int hash = qemu_hash(key) % QDICT_HASH_SIZE;
 return (qdict_find(qdict, key, hash) == NULL ? 0 : 1);
 }
 
@@ -347,7 +331,7 @@ void qdict_del(QDict *qdict, const char *key)
 {
 QDictEntry *entry;
 
-entry = qdict_find(qdict, key, tdb_hash(key) % QDICT_HASH_SIZE);
+entry = qdict_find(qdict, key, qemu_hash(key) % QDICT_HASH_SIZE);
 if (entry) {
 QLIST_REMOVE(entry, next);
 qentry_destroy(entry);
diff --git a/qemu-common.h b/qemu-common.h
index a4888e5..d225e45 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -17,6 +17,9 @@ typedef struct QEMUTimer QEMUTimer;
 typedef struct QEMUFile QEMUFile;
 typedef struct QEMUBH QEMUBH;
 
+/* Hash function definition */
+unsigned int qemu_hash(const char *name);
+
 /* Hack around the mess dyngen-exec.h causes: We need QEMU_NORETURN in files 
that
cannot include the following headers without conflicts. This condition has
to be removed once dyngen is gone. */
diff --git a/qemu-misc.c b/qemu-misc.c
new file mode 100644
index 000..a69196a
--- /dev/null
+++ b/qemu-misc.c
@@ -0,0 +1,24 @@
+/*
+ * Definition of tdb_hash() moved here, from qdict.c. Renamed to qemu_hash()
+ * 
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ *
+ */
+#include qemu-common.h
+
+/**
+ * tdb_hash(): based on the hash agorithm from gdbm, via tdb
+ * (from module-init-tools). Renamed to qemu_hash(). 
+ */
+unsigned int qemu_hash(const char *name)
+{
+unsigned value; /* Used to compute the hash value.  */
+unsigned   i;   /* Used to cycle through random values. */
+
+/* Set the initial value from the key size. */
+for (value = 0x238F13AF * strlen(name), i=0; name[i]; i++)
+value = (value + (((const unsigned char *)name)[i]  (i*5 % 24)));
+
+return (1103515243 * value + 12345);
+}


-- 
Prerna Saxena

Linux Technology Centre,
IBM Systems and Technology Lab,
Bangalore, India

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] Monitor command 'info trace'

2010-06-16 Thread Prerna Saxena

Monitor command 'info trace' to display contents of trace buffer

Signed-off-by: Prerna Saxena pre...@linux.vnet.ibm.com
---

 configure   |3 +++
 monitor.c   |   12 
 qemu-monitor.hx |4 
 simpletrace.c   |   13 +
 tracetool   |2 ++
 5 files changed, 34 insertions(+), 0 deletions(-)


diff --git a/configure b/configure
index 675d0fc..56af8dd 100755
--- a/configure
+++ b/configure
@@ -2302,6 +2302,9 @@ bsd)
 esac
 
 echo TRACE_BACKEND=$trace_backend  $config_host_mak
+if test $trace_backend = simple; then
+  echo CONFIG_SIMPLE_TRACE=y  $config_host_mak
+fi
 if test $trace_backend = ust; then
   LIBS=-lust $LIBS
 fi
diff --git a/monitor.c b/monitor.c
index ad50f12..8b60830 100644
--- a/monitor.c
+++ b/monitor.c
@@ -55,6 +55,9 @@
 #include json-streamer.h
 #include json-parser.h
 #include osdep.h
+#ifdef CONFIG_SIMPLE_TRACE
+#include trace.h
+#endif
 
 //#define DEBUG
 //#define DEBUG_COMPLETION
@@ -2780,6 +2783,15 @@ static const mon_cmd_t info_cmds[] = {
 .help   = show roms,
 .mhandler.info = do_info_roms,
 },
+#if defined(CONFIG_SIMPLE_TRACE)
+{
+.name   = trace,
+.args_type  = ,
+.params = ,
+.help   = show current contents of trace buffer,
+.mhandler.info = do_info_trace,
+},
+#endif
 {
 .name   = NULL,
 },
diff --git a/qemu-monitor.hx b/qemu-monitor.hx
index b6e3467..766c30f 100644
--- a/qemu-monitor.hx
+++ b/qemu-monitor.hx
@@ -114,6 +114,10 @@ show migration status
 show balloon information
 @item info qtree
 show device tree
+#ifdef CONFIG_SIMPLE_TRACE
+...@item info trace
+show contents of trace buffer
+#endif
 @end table
 ETEXI
 
diff --git a/simpletrace.c b/simpletrace.c
index 2fec4d3..239ae3f 100644
--- a/simpletrace.c
+++ b/simpletrace.c
@@ -62,3 +62,16 @@ void trace4(TraceEvent event, unsigned long x1, unsigned 
long x2, unsigned long
 void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3, unsigned long x4, unsigned long x5) {
 trace(event, x1, x2, x3, x4, x5);
 }
+
+void do_info_trace(Monitor *mon)
+{
+unsigned int i, max_idx;
+
+max_idx = trace_idx ? trace_idx : TRACE_BUF_LEN;
+
+for (i=0; imax_idx ;i++) {
+monitor_printf(mon, Event %ld : %ld %ld %ld %ld %ld\n,
+  trace_buf[i].event, trace_buf[i].x1, trace_buf[i].x2,
+trace_buf[i].x3, trace_buf[i].x4, trace_buf[i].x5);
+}
+}
diff --git a/tracetool b/tracetool
index 9ea9c08..2c73bab 100755
--- a/tracetool
+++ b/tracetool
@@ -130,6 +130,7 @@ void trace2(TraceEvent event, unsigned long x1, unsigned 
long x2);
 void trace3(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3);
 void trace4(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3, unsigned long x4);
 void trace5(TraceEvent event, unsigned long x1, unsigned long x2, unsigned 
long x3, unsigned long x4, unsigned long x5);
+void do_info_trace(Monitor *mon);
 EOF
 
 simple_event_num=0
@@ -289,6 +290,7 @@ tracetoh()
 #define TRACE_H
 
 #include qemu-common.h
+#include monitor.h
 EOF
 convert h
 echo #endif /* TRACE_H */


-- 
Prerna Saxena

Linux Technology Centre,
IBM Systems and Technology Lab,
Bangalore, India

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] Toggle tracepoint state

2010-06-16 Thread Prerna Saxena

This patch adds support for dynamically enabling/disabling of tracepoints.
This is done by internally maintaining each tracepoint's state, and 
permitting logging of data from a tracepoint only if it is in an 
'active' state.

Monitor commands added :
1) info tracepoints : to view all available tracepoints and 
  their state.
2) tracepoint NAME on|off   : to enable/disable data logging from a 
  given tracepoint.
  Eg, tracepoint paio_submit off 
disables logging of data when 
paio_submit is hit.

Signed-off-by: Prerna Saxena pre...@linux.vnet.ibm.com
---

 monitor.c   |   16 ++
 qemu-monitor.hx |   18 
 simpletrace.c   |   63 +++
 tracetool   |   30 +++---
 vl.c|6 +
 5 files changed, 129 insertions(+), 4 deletions(-)


diff --git a/monitor.c b/monitor.c
index 8b60830..238bdf0 100644
--- a/monitor.c
+++ b/monitor.c
@@ -548,6 +548,15 @@ static void do_commit(Monitor *mon, const QDict *qdict)
 }
 }
 
+#ifdef CONFIG_SIMPLE_TRACE
+static void do_change_tracepoint_state(Monitor *mon, const QDict *qdict)
+{
+const char *tp_name = qdict_get_str(qdict, name);
+bool new_state = qdict_get_bool(qdict, option);
+change_tracepoint_state(tp_name, new_state);
+}
+#endif
+
 static void user_monitor_complete(void *opaque, QObject *ret_data)
 {
 MonitorCompletionData *data = (MonitorCompletionData *)opaque; 
@@ -2791,6 +2800,13 @@ static const mon_cmd_t info_cmds[] = {
 .help   = show current contents of trace buffer,
 .mhandler.info = do_info_trace,
 },
+{
+.name   = tracepoints,
+.args_type  = ,
+.params = ,
+.help   = show available tracepoints  their state,
+.mhandler.info = do_info_all_tracepoints,
+},
 #endif
 {
 .name   = NULL,
diff --git a/qemu-monitor.hx b/qemu-monitor.hx
index 766c30f..8540b8f 100644
--- a/qemu-monitor.hx
+++ b/qemu-monitor.hx
@@ -117,6 +117,8 @@ show device tree
 #ifdef CONFIG_SIMPLE_TRACE
 @item info trace
 show contents of trace buffer
+...@item info tracepoints
+show available tracepoints and their state
 #endif
 @end table
 ETEXI
@@ -225,6 +227,22 @@ STEXI
 @item logfile @var{filename}
 @findex logfile
 Output logs to @var{filename}.
+#ifdef CONFIG_SIMPLE_TRACE
+ETEXI
+
+{
+.name   = tracepoint,
+.args_type  = name:s,option:b,
+.params = name on|off,
+.help   = changes status of a specific tracepoint,
+.mhandler.cmd = do_change_tracepoint_state,
+},
+
+STEXI
+...@item tracepoint
+...@findex tracepoint
+changes status of a tracepoint
+#endif
 ETEXI
 
 {
diff --git a/simpletrace.c b/simpletrace.c
index 239ae3f..4221a8f 100644
--- a/simpletrace.c
+++ b/simpletrace.c
@@ -3,6 +3,12 @@
 #include trace.h
 
 typedef struct {
+char *tp_name;
+bool state;
+unsigned int hash;
+} Tracepoint;
+
+typedef struct {
 unsigned long event;
 unsigned long x1;
 unsigned long x2;
@@ -18,11 +24,29 @@ enum {
 static TraceRecord trace_buf[TRACE_BUF_LEN];
 static unsigned int trace_idx;
 static FILE *trace_fp;
+static Tracepoint trace_list[NR_TRACEPOINTS];
+
+void init_tracepoint(const char *tname, TraceEvent tevent)
+{
+if (!tname || tevent  NR_TRACEPOINTS) {
+return;
+}
+
+trace_list[tevent].tp_name = (char*)qemu_malloc(strlen(tname)+1);
+strncpy(trace_list[tevent].tp_name, tname, strlen(tname));
+trace_list[tevent].hash = qemu_hash(tname);
+trace_list[tevent].state = 1; /* Enable all by default */
+}
 
 static void trace(TraceEvent event, unsigned long x1,
   unsigned long x2, unsigned long x3,
   unsigned long x4, unsigned long x5) {
 TraceRecord *rec = trace_buf[trace_idx];
+
+if (!trace_list[event].state) {
+return;
+}
+
 rec-event = event;
 rec-x1 = x1;
 rec-x2 = x2;
@@ -75,3 +99,42 @@ void do_info_trace(Monitor *mon)
 trace_buf[i].x3, trace_buf[i].x4, trace_buf[i].x5);
 }
 }
+
+void do_info_all_tracepoints(Monitor *mon)
+{
+unsigned int i;
+
+for (i=0; iNR_TRACEPOINTS; i++) {
+monitor_printf(mon, %s [Event ID %u] : state %u\n,
+trace_list[i].tp_name, i, trace_list[i].state);
+}
+}
+
+static Tracepoint* find_tracepoint_by_name(const char *tname)
+{
+unsigned int i, name_hash;
+
+if (!tname) {
+return NULL;
+}
+
+name_hash = qemu_hash(tname);
+
+for (i=0; iNR_TRACEPOINTS; i++) {
+if (trace_list[i].hash == name_hash 
+   !strncmp(trace_list[i].tp_name, tname, strlen(tname))) {
+return trace_list[i];
+}
+}
+return NULL; /* indicates

Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()

2010-06-16 Thread Markus Armbruster

Jan Kiszka jan.kis...@siemens.com writes:

 Markus Armbruster wrote:
 Jan Kiszka jan.kis...@siemens.com writes:
[...]
 The format I will propose is global-ID|/absolute/path, no more
 /path/global-ID as this comes with the risk of ambiguity (ID may shadow
 bus-local name of a device).
 
 Doesn't this break existing usage?

 Please name one. It could only be weird corner cases like device_add
 driver,bus=bus1/ID/bus2 or bus=ID/bus. But given that we always
 allowed to address bus2 directly (and this is something I cannot and
 will not change), does this really matter? Maybe allowing paths to start
 with an ID is something worth considering, will think about this again.

I checked with Dan: libvirt doesn't care.  So I don't either.

[...]
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [PATCH 1/2] [scsi-bus]: Add PR-OUT and PR-IN case for SCSIRequest xfer and xfer_mode setup

2010-06-16 Thread Nicholas A. Bellinger

On Wed, 2010-06-16 at 14:13 +0200, Kevin Wolf wrote:
 Am 04.06.2010 16:06, schrieb Kevin Wolf:
  Am 31.05.2010 03:43, schrieb Nicholas A. Bellinger:
  From: Nicholas Bellinger n...@linux-iscsi.org
 
  This patch updates hw/scsi-bus.c to add PERSISTENT_RESERVE_OUT and 
  PERSISTENT_RESERVE_IN
  case in scsi_req_length() to extra the incoming buffer length into 
  SCSIRequest-cmd.xfer,
  and adds a second PERSISTENT_RESERVE_OUT case in scsi_req_xfer_mode() in 
  order to properly
  set SCSI_XFER_TO_DEV for WRITE data.
 
  Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation and 
  TCM_Loop target ports.
 
  Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
  ---
   hw/scsi-bus.c |5 +
   1 files changed, 5 insertions(+), 0 deletions(-)
 
  diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c
  index b8e4b71..75ec74e 100644
  --- a/hw/scsi-bus.c
  +++ b/hw/scsi-bus.c
  @@ -325,6 +325,10 @@ static int scsi_req_length(SCSIRequest *req, uint8_t 
  *cmd)
   case INQUIRY:
   req-cmd.xfer = cmd[4] | (cmd[3]  8);
   break;
  +case PERSISTENT_RESERVE_OUT:
  +case PERSISTENT_RESERVE_IN:
  +req-cmd.xfer = cmd[8] | (cmd[7]  8);
  
  Maybe I'm missing something, but isn't exactly the same value set in the
  switch block above? (for cmd[0]  5 == 2)
 
 Nicholas? This isn't applied yet because I'm waiting for your answer.
 
 Is there a reason why it makes sense to do it explicitly here instead
 using the generic code a few lines above? I think the same applied to
 patch 2/2.

Hi Kevin,

I just tested this again and you are correct, the reassignment of
req-cmd.xfer for PR and Maintence CDBs is unnecessary in
scsi_req_length().  I will go ahead and drop part this from my tree.

Please let me know if you would like me to resend the patch series.

Best,

--nab

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC qdev path semantics

2010-06-16 Thread Markus Armbruster

Markus Armbruster arm...@redhat.com writes:

[...]
 Bus names are chosen by the system as follows:

 * If the driver of the parent device model provides a name, use that.

 * Else, if the parent device has id ID, use ID.NUM, where NUM is the bus
   number, counting from zero in creation order.

 * Else, use TYPE.NUM, where TYPE is derived from the bus type, and NUM
   is the bus number, as above.

 ### Paul proposes to drop ID.NUM.

ABI change: -device lsi,id=my-scsi -device scsi-disk,bus=my-scsi.0 no
longer works.

 ### Paul proposes to either drop TYPE.NUM (and require drivers to
 provide bus names), or make NUM count separately for each bus type.

Likewise.

I'm not saying we can't do this, just that we need to consider backward
compatibility.

[...]
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/17] Eliminate duplicated timer code

2010-06-16 Thread Glauber Costa

On Mon, Jun 14, 2010 at 09:34:03PM -1000, Zachary Amsden wrote:
 Move duplicated timer related code into arch_vcpu_load rather
 than vendor callouts.  Should be an isomorphic transformation.
 
 Signed-off-by: Zachary Amsden zams...@redhat.com
Acked-by: Glauber Costa glom...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [PATCH 1/2] [scsi-bus]: Add PR-OUT and PR-IN case for SCSIRequest xfer and xfer_mode setup

2010-06-16 Thread Kevin Wolf

Am 16.06.2010 15:03, schrieb Nicholas A. Bellinger:
 On Wed, 2010-06-16 at 14:13 +0200, Kevin Wolf wrote:
 Am 04.06.2010 16:06, schrieb Kevin Wolf:
 Am 31.05.2010 03:43, schrieb Nicholas A. Bellinger:
 From: Nicholas Bellinger n...@linux-iscsi.org

 This patch updates hw/scsi-bus.c to add PERSISTENT_RESERVE_OUT and 
 PERSISTENT_RESERVE_IN
 case in scsi_req_length() to extra the incoming buffer length into 
 SCSIRequest-cmd.xfer,
 and adds a second PERSISTENT_RESERVE_OUT case in scsi_req_xfer_mode() in 
 order to properly
 set SCSI_XFER_TO_DEV for WRITE data.

 Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation and 
 TCM_Loop target ports.

 Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
 ---
  hw/scsi-bus.c |5 +
  1 files changed, 5 insertions(+), 0 deletions(-)

 diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c
 index b8e4b71..75ec74e 100644
 --- a/hw/scsi-bus.c
 +++ b/hw/scsi-bus.c
 @@ -325,6 +325,10 @@ static int scsi_req_length(SCSIRequest *req, uint8_t 
 *cmd)
  case INQUIRY:
  req-cmd.xfer = cmd[4] | (cmd[3]  8);
  break;
 +case PERSISTENT_RESERVE_OUT:
 +case PERSISTENT_RESERVE_IN:
 +req-cmd.xfer = cmd[8] | (cmd[7]  8);

 Maybe I'm missing something, but isn't exactly the same value set in the
 switch block above? (for cmd[0]  5 == 2)

 Nicholas? This isn't applied yet because I'm waiting for your answer.

 Is there a reason why it makes sense to do it explicitly here instead
 using the generic code a few lines above? I think the same applied to
 patch 2/2.
 
 Hi Kevin,
 
 I just tested this again and you are correct, the reassignment of
 req-cmd.xfer for PR and Maintence CDBs is unnecessary in
 scsi_req_length().  I will go ahead and drop part this from my tree.
 
 Please let me know if you would like me to resend the patch series.

Sure, taking simpler code is always better, so I'd be happier with a v2.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [PATCH 1/2] [scsi-bus]: Add PR-OUT and PR-IN case for SCSIRequest xfer and xfer_mode setup

2010-06-16 Thread Nicholas A. Bellinger

On Wed, 2010-06-16 at 06:03 -0700, Nicholas A. Bellinger wrote:
 On Wed, 2010-06-16 at 14:13 +0200, Kevin Wolf wrote:
  Am 04.06.2010 16:06, schrieb Kevin Wolf:
   Am 31.05.2010 03:43, schrieb Nicholas A. Bellinger:
   From: Nicholas Bellinger n...@linux-iscsi.org
  
   This patch updates hw/scsi-bus.c to add PERSISTENT_RESERVE_OUT and 
   PERSISTENT_RESERVE_IN
   case in scsi_req_length() to extra the incoming buffer length into 
   SCSIRequest-cmd.xfer,
   and adds a second PERSISTENT_RESERVE_OUT case in scsi_req_xfer_mode() in 
   order to properly
   set SCSI_XFER_TO_DEV for WRITE data.
  
   Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation and 
   TCM_Loop target ports.
  
   Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
   ---
hw/scsi-bus.c |5 +
1 files changed, 5 insertions(+), 0 deletions(-)
  
   diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c
   index b8e4b71..75ec74e 100644
   --- a/hw/scsi-bus.c
   +++ b/hw/scsi-bus.c
   @@ -325,6 +325,10 @@ static int scsi_req_length(SCSIRequest *req, 
   uint8_t *cmd)
case INQUIRY:
req-cmd.xfer = cmd[4] | (cmd[3]  8);
break;
   +case PERSISTENT_RESERVE_OUT:
   +case PERSISTENT_RESERVE_IN:
   +req-cmd.xfer = cmd[8] | (cmd[7]  8);
   
   Maybe I'm missing something, but isn't exactly the same value set in the
   switch block above? (for cmd[0]  5 == 2)
  
  Nicholas? This isn't applied yet because I'm waiting for your answer.
  
  Is there a reason why it makes sense to do it explicitly here instead
  using the generic code a few lines above? I think the same applied to
  patch 2/2.
 
 Hi Kevin,
 
 I just tested this again and you are correct, the reassignment of
 req-cmd.xfer for PR and Maintence CDBs is unnecessary in
 scsi_req_length().  I will go ahead and drop part this from my tree.
 
 Please let me know if you would like me to resend the patch series.
 

Actually, I should mention that I have only tested this with TYPE_DISK
so far.  I think we do still need the Maintenance CDBs length
reassignment for MMC and TYPE_ROM from the second patch:

diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c
index 75ec74e..7d80405 100644
--- a/hw/scsi-bus.c
+++ b/hw/scsi-bus.c
@@ -329,6 +329,17 @@ static int scsi_req_length(SCSIRequest *req, uint8_t *cmd)
 case PERSISTENT_RESERVE_IN:
 req-cmd.xfer = cmd[8] | (cmd[7]  8);
 break;
+case MAINTENANCE_OUT:
+case MAINTENANCE_IN:
+if (req-dev-type != TYPE_ROM) {
+/* Used for MI_REPORT_TARGET_PGS, MO_SET_TARGET_PGS et al. */
+req-cmd.xfer = cmd[9] | (cmd[8]  8) |
+(cmd[7]  16) | (cmd[6]  24);
+} else {
+/* GPCMD_REPORT_KEY and GPCMD_SEND_KEY from multi media commands */
+req-cmd.xfer = cmd[9] | (cmd[8]  8);
+}
+break;
 }

Do you have a problem with leaving this reassignment in for TYPE_ROM..?

Best,

--nab


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1

2010-06-16 Thread Avi Kivity


On 06/16/2010 03:24 PM, Nadav Har'El wrote:

On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 5/24] Introduce vmcs12: a 
VMCS structure for L1:
   

+struct __attribute__ ((__packed__)) vmcs12 {
+   /* According to the Intel spec, a VMCS region must start with the
+* following two fields. Then follow implementation-specific data.
+*/
+   u32 revision_id;
+   u32 abort;
+};

   

Note that this structure becomes an ABI, it cannot change except in a
backward compatible way due to the need for live migration.  So I'd like
a documentation patch that adds a description of the content to
Documentation/kvm/.  It can be as simple as listing the structure
definition.
 

I agree that if struct vmcs12 is changed, this will cause problems for live
migration, but why does this mean that the struct's fields or layout an ABI
worth documenting?
   


It's a way of adding a barrier to changing it, and of determining which 
versions are compatible with which other versions.



After all, isn't the idea of VMCS that its internal content and layout
is opaque for the L1 guest - he can only read/write it with VMREAD/VMWRITE,
and those two instructions are the ABI (which is of course documented in the
Intel spec) - not the content of the vmcs12 structure. Even if the guest knew
the exact layout of this structure, he's not supposed to use it.
   


Right, it's only migration, not for guest use.  Or perhaps for someone 
debugging a hypervisor.



By the way, we have not actually checked that live migration is working
as expected with nested virtualization running. I expect there to be more
pitfalls and bugs even before we consider migration between two different
versions. We would indeed like to allow live migration of different kinds
(of L1 with all its L2 guests; Of all L2 guests of a L1; Of a single L2
guest), but we're trying to finish the more basic functionality first.
   


Live migration will not work without ioctls to save/load the vmptr and 
vmxon state.


nsvm has a hack where they force an exit before existing to host 
userspace, so host userspace never sees guest mode.  I don't like it 
much, and in any case it can't work for nvmx since you need to migrate 
the vmxon state.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/17] Unify vendor TSC logic

2010-06-16 Thread Glauber Costa

On Wed, Jun 16, 2010 at 04:10:10PM +0800, Jason Wang wrote:
 Zachary Amsden wrote:
   
   void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
   {
  +   kvm_x86_ops-vcpu_load(vcpu, cpu);
  if (unlikely(vcpu-cpu != cpu)) {
  +   /* Make sure TSC doesn't go backwards */
  +   s64 tsc_delta = !vcpu-arch.last_host_tsc ? 0 : 
  +   native_read_tsc() - vcpu-arch.last_host_tsc;
  +   if (tsc_delta  0 || check_tsc_unstable())

 It's better to do the adjustment also when tsc_delta  0
And why do you think so? Doing it on tsc_delta  0 would force us to adjust
at every entry but the first. And I guess we want to adjust as few times as 
we can.

For example, we would adjust on every cpu bounce even for machines that has
a perfectly sync tsc. This could introduce an error not present before.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [PATCH 1/2] [scsi-bus]: Add PR-OUT and PR-IN case for SCSIRequest xfer and xfer_mode setup

2010-06-16 Thread Kevin Wolf

Am 16.06.2010 15:09, schrieb Nicholas A. Bellinger:
 On Wed, 2010-06-16 at 06:03 -0700, Nicholas A. Bellinger wrote:
 On Wed, 2010-06-16 at 14:13 +0200, Kevin Wolf wrote:
 Am 04.06.2010 16:06, schrieb Kevin Wolf:
 Am 31.05.2010 03:43, schrieb Nicholas A. Bellinger:
 From: Nicholas Bellinger n...@linux-iscsi.org

 This patch updates hw/scsi-bus.c to add PERSISTENT_RESERVE_OUT and 
 PERSISTENT_RESERVE_IN
 case in scsi_req_length() to extra the incoming buffer length into 
 SCSIRequest-cmd.xfer,
 and adds a second PERSISTENT_RESERVE_OUT case in scsi_req_xfer_mode() in 
 order to properly
 set SCSI_XFER_TO_DEV for WRITE data.

 Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation and 
 TCM_Loop target ports.

 Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
 ---
  hw/scsi-bus.c |5 +
  1 files changed, 5 insertions(+), 0 deletions(-)

 diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c
 index b8e4b71..75ec74e 100644
 --- a/hw/scsi-bus.c
 +++ b/hw/scsi-bus.c
 @@ -325,6 +325,10 @@ static int scsi_req_length(SCSIRequest *req, uint8_t 
 *cmd)
  case INQUIRY:
  req-cmd.xfer = cmd[4] | (cmd[3]  8);
  break;
 +case PERSISTENT_RESERVE_OUT:
 +case PERSISTENT_RESERVE_IN:
 +req-cmd.xfer = cmd[8] | (cmd[7]  8);

 Maybe I'm missing something, but isn't exactly the same value set in the
 switch block above? (for cmd[0]  5 == 2)

 Nicholas? This isn't applied yet because I'm waiting for your answer.

 Is there a reason why it makes sense to do it explicitly here instead
 using the generic code a few lines above? I think the same applied to
 patch 2/2.

 Hi Kevin,

 I just tested this again and you are correct, the reassignment of
 req-cmd.xfer for PR and Maintence CDBs is unnecessary in
 scsi_req_length().  I will go ahead and drop part this from my tree.

 Please let me know if you would like me to resend the patch series.

 
 Actually, I should mention that I have only tested this with TYPE_DISK
 so far.  I think we do still need the Maintenance CDBs length
 reassignment for MMC and TYPE_ROM from the second patch:
 
 diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c
 index 75ec74e..7d80405 100644
 --- a/hw/scsi-bus.c
 +++ b/hw/scsi-bus.c
 @@ -329,6 +329,17 @@ static int scsi_req_length(SCSIRequest *req, uint8_t 
 *cmd)
  case PERSISTENT_RESERVE_IN:
  req-cmd.xfer = cmd[8] | (cmd[7]  8);
  break;
 +case MAINTENANCE_OUT:
 +case MAINTENANCE_IN:
 +if (req-dev-type != TYPE_ROM) {
 +/* Used for MI_REPORT_TARGET_PGS, MO_SET_TARGET_PGS et al. */
 +req-cmd.xfer = cmd[9] | (cmd[8]  8) |
 +(cmd[7]  16) | (cmd[6]  24);
 +} else {
 +/* GPCMD_REPORT_KEY and GPCMD_SEND_KEY from multi media commands 
 */
 +req-cmd.xfer = cmd[9] | (cmd[8]  8);
 +}
 +break;
  }
 
 Do you have a problem with leaving this reassignment in for TYPE_ROM..?

In a place where it's different from what was set above, I'm not against
it. I just wanted to reduce some code duplication.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC qdev path semantics

2010-06-16 Thread Paul Brook

  Bus names are chosen by the system as follows:
  
  * If the driver of the parent device model provides a name, use that.
  
  * Else, if the parent device has id ID, use ID.NUM, where NUM is the bus
  
number, counting from zero in creation order.
  
  * Else, use TYPE.NUM, where TYPE is derived from the bus type, and NUM
  
is the bus number, as above.
  
  ### Paul proposes to drop ID.NUM.
 
 ABI change: -device lsi,id=my-scsi -device scsi-disk,bus=my-scsi.0 no
 longer works.

IMO this is a fundamentally broken ABI, so I don't care.
 
  ### Paul proposes to either drop TYPE.NUM (and require drivers to
  provide bus names), or make NUM count separately for each bus type.
 
 Likewise.

I'd be surprised if anyone actually uses absolute device paths at this time, 
and they're probably going to be broken by other changes.

Using these default bus names as global identifiers is fixable using aliases
(e.g. -device lsi,bus=pci.0).  I'd expect this to cover most interesting uses.
See http://lists.nongnu.org/archive/html/qemu-devel/2010-06/msg02149.html

Paul
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/17] Fix deep C-state TSC desynchronization

2010-06-16 Thread Glauber Costa

On Mon, Jun 14, 2010 at 09:34:06PM -1000, Zachary Amsden wrote:
 When CPUs with unstable TSCs enter deep C-state, TSC may stop
 running.  This causes us to require resynchronization.  Since
 we can't tell when this may potentially happen, we assume the
 worst by forcing re-compensation for it at every point the VCPU
 task is descheduled.
can't we just compensate everytime check_tsc_unstable() is true?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/17] Keep SMP VMs more in sync on unstable TSC

2010-06-16 Thread Glauber Costa

On Mon, Jun 14, 2010 at 09:34:07PM -1000, Zachary Amsden wrote:
 SMP VMs on machines with unstable TSC have their TSC offset adjusted by the
 local offset delta from last measurement.  This does not take into account how
 long it has been since the measurement, leading to drift.  Minimize the drift
 by accounting for any time difference the kernel has observed.
 
 Signed-off-by: Zachary Amsden zams...@redhat.com
I believe this should be done not only if we have check_tsc_unstable() == true,
but anytime we adjust the tsc. I mean:

Sure it is expected to be much more relevant in this case, but if we're testing
generally for tsc_delta  0 in the adjustment code, it is because we believe
it can happen, even if tsc is stable (otherwise, we'd better take it off 
completely).

And in that case, we should account elapsed time too. 

 ---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/x86.c  |   20 +++-
  2 files changed, 20 insertions(+), 1 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 94f6ce8..1afecd7 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -337,6 +337,7 @@ struct kvm_vcpu_arch {
   unsigned int time_offset;
   struct page *time_page;
   u64 last_host_tsc;
 + u64 last_host_ns;
  
   bool nmi_pending;
   bool nmi_injected;
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 618c435..b1bdf05 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1810,6 +1810,19 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
   /* Make sure TSC doesn't go backwards */
   s64 tsc_delta = !vcpu-arch.last_host_tsc ? 0 : 
   native_read_tsc() - vcpu-arch.last_host_tsc;
 +
 + /* Subtract elapsed cycle time from the delta computation */
 + if (check_tsc_unstable()  vcpu-arch.last_host_ns) {
 + s64 delta;
 + struct timespec ts;
 + ktime_get_ts(ts);
 + monotonic_to_bootbased(ts);
 + delta = timespec_to_ns(ts) - vcpu-arch.last_host_ns;
 + delta = delta * per_cpu(cpu_tsc_khz, cpu);
 + delta = delta / USEC_PER_SEC;
 + tsc_delta -= delta;
 + }
 +
   if (tsc_delta  0 || check_tsc_unstable())
   kvm_x86_ops-adjust_tsc_offset(vcpu, -tsc_delta);
   kvm_migrate_timers(vcpu);
 @@ -1832,8 +1845,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
* vcpu-cpu != cpu can not detect this condition.  So set
* vcpu-cpu = -1 to force the recalibration above.
*/
 - if (check_tsc_unstable())
 + if (check_tsc_unstable()) {
 + struct timespec ts;
 + ktime_get_ts(ts);
 + monotonic_to_bootbased(ts);
 + vcpu-arch.last_host_ns = timespec_to_ns(ts);
   vcpu-cpu = -1;
 + }
  }
  
  static int is_efer_nx(void)
 -- 
 1.7.1
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/24] Implement VMPTRLD

2010-06-16 Thread Gleb Natapov

On Sun, Jun 13, 2010 at 03:27:41PM +0300, Nadav Har'El wrote:
 This patch implements the VMPTRLD instruction.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
 --- .before/arch/x86/kvm/vmx.c2010-06-13 15:01:29.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300
 @@ -3829,6 +3829,26 @@ static int read_guest_vmcs_gpa(struct kv
   return 0;
  }
  
 +static void set_rflags_to_vmx_fail_invalid(struct kvm_vcpu *vcpu)
 +{
 + unsigned long rflags;
 + rflags = vmx_get_rflags(vcpu);
 + rflags |= X86_EFLAGS_CF;
 + rflags = ~X86_EFLAGS_PF  ~X86_EFLAGS_AF  ~X86_EFLAGS_ZF 
 + ~X86_EFLAGS_SF  ~X86_EFLAGS_OF;
 + vmx_set_rflags(vcpu, rflags);
 +}
 +
 +static void set_rflags_to_vmx_fail_valid(struct kvm_vcpu *vcpu)
 +{
 + unsigned long rflags;
 + rflags = vmx_get_rflags(vcpu);
 + rflags |= X86_EFLAGS_ZF;
 + rflags = ~X86_EFLAGS_PF  ~X86_EFLAGS_AF  ~X86_EFLAGS_CF 
 + ~X86_EFLAGS_SF  ~X86_EFLAGS_OF;
 + vmx_set_rflags(vcpu, rflags);
 +}
 +
  static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu)
  {
   unsigned long rflags;
 @@ -3869,6 +3889,57 @@ static int handle_vmclear(struct kvm_vcp
   return 1;
  }
  
 +static bool verify_vmcs12_revision(struct kvm_vcpu *vcpu, gpa_t 
 guest_vmcs_addr)
 +{
 + bool ret;
 + struct vmcs12 *vmcs12;
 + struct page *vmcs_page = nested_get_page(vcpu, guest_vmcs_addr);
 + if (vmcs_page == NULL)
 + return 0;
 + vmcs12 = (struct vmcs12 *)kmap_atomic(vmcs_page, KM_USER0);
 + if (vmcs12-revision_id == VMCS12_REVISION)
 + ret = 1;
 + else {
 + set_rflags_to_vmx_fail_valid(vcpu);
Should set VM-Instruction Error Field accordingly.

 + ret = 0;
 + }
 + kunmap_atomic(vmcs12, KM_USER0);
 + kvm_release_page_dirty(vmcs_page);
 + return ret;
 +}
 +
 +/* Emulate the VMPTRLD instruction */
 +static int handle_vmptrld(struct kvm_vcpu *vcpu)
 +{
 + struct vcpu_vmx *vmx = to_vmx(vcpu);
 + gpa_t guest_vmcs_addr;
 +
 + if (!nested_vmx_check_permission(vcpu))
 + return 1;
 +
 + if (read_guest_vmcs_gpa(vcpu, guest_vmcs_addr)) {
 + set_rflags_to_vmx_fail_invalid(vcpu);
 + return 1;
 + }
 +
 + if (!verify_vmcs12_revision(vcpu, guest_vmcs_addr))
 + return 1;

Should check that guest_vmcs_addr != VMXON address. I think this check
is missing from VMCLEAR too.

 +
 + if (vmx-nested.current_vmptr != guest_vmcs_addr) {
 + vmx-nested.current_vmptr = guest_vmcs_addr;
 +
 + if (nested_create_current_vmcs(vcpu)) {
 + printk(KERN_ERR %s error could not allocate memory,
 + __func__);
 + return -ENOMEM;
 + }
 + }
 +
 + clear_rflags_cf_zf(vcpu);
 + skip_emulated_instruction(vcpu);
 + return 1;
 +}
 +
  static int handle_invlpg(struct kvm_vcpu *vcpu)
  {
   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
 @@ -4153,7 +4224,7 @@ static int (*kvm_vmx_exit_handlers[])(st
   [EXIT_REASON_VMCALL]  = handle_vmcall,
   [EXIT_REASON_VMCLEAR] = handle_vmclear,
   [EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
 - [EXIT_REASON_VMPTRLD] = handle_vmx_insn,
 + [EXIT_REASON_VMPTRLD] = handle_vmptrld,
   [EXIT_REASON_VMPTRST] = handle_vmx_insn,
   [EXIT_REASON_VMREAD]  = handle_vmx_insn,
   [EXIT_REASON_VMRESUME]= handle_vmx_insn,
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] [scsi-bus]: Changes for PERSISTENT_RESERVE_OUT and MAINTENANCE CDBs v2

2010-06-16 Thread Nicholas A. Bellinger

From: Nicholas Bellinger n...@linux-iscsi.org

Greetings,

Attached is v2 for a series containing a handful of changes required for 
scsi-bus in order
to get PERSISTENT_RESERVE_OUT and MAINTENANCE CDBs working as expected.  Thanks 
to Kevin Wolf
for pointing out the unnecessary SCSIRequest-cmd.xfer reassignments in 
scsi_req_length() from
the original series.

Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation + scsi-generic 
and
TCM_Loop target ports.

Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org

Nicholas Bellinger (2):
  [scsi-bus]: Add PERSISTENT_RESERVE_OUT SCSIRequest-cmd.mode setup
  [scsi-bus]: Add MAINTENANCE_IN and MAINTENANCE_OUT SCSIRequest xfer
and mode assignments

 hw/scsi-bus.c  |   11 +++
 hw/scsi-defs.h |2 ++
 2 files changed, 13 insertions(+), 0 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] [scsi-bus]: Add PERSISTENT_RESERVE_OUT SCSIRequest-cmd.mode setup

2010-06-16 Thread Nicholas A. Bellinger

From: Nicholas Bellinger n...@linux-iscsi.org

This patch updates hw/scsi-bus.c to add the PERSISTENT_RESERVE_OUT cdb
case in scsi_req_xfer_mode() to set SCSI_XFER_TO_DEV for outgoing WRITE data.

Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
---
 hw/scsi-bus.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c
index b8e4b71..7a5fa48 100644
--- a/hw/scsi-bus.c
+++ b/hw/scsi-bus.c
@@ -389,6 +389,7 @@ static void scsi_req_xfer_mode(SCSIRequest *req)
 case MEDIUM_SCAN:
 case SEND_VOLUME_TAG:
 case WRITE_LONG_2:
+case PERSISTENT_RESERVE_OUT:
 req-cmd.mode = SCSI_XFER_TO_DEV;
 break;
 default:
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] [scsi-bus]: Add MAINTENANCE_IN and MAINTENANCE_OUT SCSIRequest xfer and mode assignments

2010-06-16 Thread Nicholas A. Bellinger

From: Nicholas Bellinger n...@linux-iscsi.org

This patch updates hw/scsi-bus.c to add MAINTENANCE_IN and MAINTENANCE_OUT case 
in
scsi_req_length() for TYPE_ROM with MMC commands.  It also adds the 
MAINTENANCE_OUT
case in scsi_req_xfer_mode() to set SCSI_XFER_TO_DEV for outgoing write data.

Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
---
 hw/scsi-bus.c  |   10 ++
 hw/scsi-defs.h |2 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c
index 7a5fa48..ed3554c 100644
--- a/hw/scsi-bus.c
+++ b/hw/scsi-bus.c
@@ -325,6 +325,13 @@ static int scsi_req_length(SCSIRequest *req, uint8_t *cmd)
 case INQUIRY:
 req-cmd.xfer = cmd[4] | (cmd[3]  8);
 break;
+case MAINTENANCE_OUT:
+case MAINTENANCE_IN:
+if (req-dev-type == TYPE_ROM) {
+/* GPCMD_REPORT_KEY and GPCMD_SEND_KEY from multi media commands */
+req-cmd.xfer = cmd[9] | (cmd[8]  8);
+}
+break;
 }
 return 0;
 }
@@ -390,6 +397,7 @@ static void scsi_req_xfer_mode(SCSIRequest *req)
 case SEND_VOLUME_TAG:
 case WRITE_LONG_2:
 case PERSISTENT_RESERVE_OUT:
+case MAINTENANCE_OUT:
 req-cmd.mode = SCSI_XFER_TO_DEV;
 break;
 default:
@@ -535,6 +543,8 @@ const char *scsi_command_name(uint8_t cmd)
 [ SPACE] = SPACE,
 [ INQUIRY  ] = INQUIRY,
 [ RECOVER_BUFFERED_DATA] = RECOVER_BUFFERED_DATA,
+[ MAINTENANCE_IN   ] = MAINTENANCE_IN,
+[ MAINTENANCE_OUT  ] = MAINTENANCE_OUT,
 [ MODE_SELECT  ] = MODE_SELECT,
 [ RESERVE  ] = RESERVE,
 [ RELEASE  ] = RELEASE,
diff --git a/hw/scsi-defs.h b/hw/scsi-defs.h
index 5890094..a4a3518 100644
--- a/hw/scsi-defs.h
+++ b/hw/scsi-defs.h
@@ -84,6 +84,8 @@
 #define MODE_SENSE_10 0x5a
 #define PERSISTENT_RESERVE_IN 0x5e
 #define PERSISTENT_RESERVE_OUT 0x5f
+#define MAINTENANCE_IN0xa3
+#define MAINTENANCE_OUT   0xa4
 #define MOVE_MEDIUM   0xa5
 #define READ_12   0xa8
 #define WRITE_12  0xaa
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: RFC qdev path semantics

2010-06-16 Thread Jan Kiszka

Paul Brook wrote:
 I think this would be better served by adding explicit aliases/IDs for
 those use-cases. i.e. define the global ID pci.0 to be an alias for

  /i440FX-pcihost/pci
 Makes sense. We could attach this ID to the BusState (corresponding to
 DeviceState:id) and manually set it during machine init.
 qbus_find_recursive would then look for a matching ID instead of a name.
 
 If we accept your proposal that global ids are not accepted as local device 
 identifiers, then these probably shouldn't be stored in the device tree at 
 all.
 
 Allowing at most one global ID per node also seems like a rather arbitrary 
 limitation.
 
 How about having a global alias table that maps an ID to a devtree path?
 The code to handle these becomes a straight string substitution followed by a 
 normal path lookup. We don't need separate code for device v.s. bus, and 
 global_device_id/busname just works.

No principle concerns about such a table. It should nevertheless be
reference-based to assist lifetime management of aliases that are
created together with the devices they point to.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] Monitor support QEMU trace framework

2010-06-16 Thread Jan Kiszka

Prerna Saxena wrote:
 This is v3 of a set of patches to enable trace visualization  control 
 via the QEMU monitor. It is based on trace infrastructure posted by 
 Stefan :
 ( http://lists.gnu.org/archive/html/qemu-devel/2010-05/msg02407.html )
 
 
 This patchset adds monitor commands :
 - info trace : to view current contents of the trace buffer
 - info tracepoints : to view all available tracepoints and their state.
 - tracepoint NAME on|off : to enable/disable the logging of data
  from tracepoint 'NAME' to on/off.
 
 Changelog :
 - Clean-ups from v2, particularly relating to export of tdb_hash()
 

But it's still using the legacy monitor command interface. See [1]+[2]
for a mechanism that allows you to convert to the new interface without
automatically exposing your commands via QMP.

Jan

[1] http://thread.gmane.org/gmane.comp.emulators.qemu/74356
[2] http://thread.gmane.org/gmane.comp.emulators.qemu/74372

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 16/17] TSC reset compensation

2010-06-16 Thread Glauber Costa

On Mon, Jun 14, 2010 at 09:34:18PM -1000, Zachary Amsden wrote:
 Attempt to synchronize TSCs which are reset to the same value.  In the
 case of a reliable hardware TSC, we can just re-use the same offset, but
 on non-reliable hardware, we can get closer by adjusting the offset to
 match the elapsed time.
 
 Signed-off-by: Zachary Amsden zams...@redhat.com
 ---
  arch/x86/kvm/x86.c |   34 --
  1 files changed, 32 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 8e836e9..cedb71f 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -937,14 +937,44 @@ static inline void kvm_request_guest_time_update(struct 
 kvm_vcpu *v)
   set_bit(KVM_REQ_CLOCK_SYNC, v-requests);
  }
  
 +static inline int kvm_tsc_reliable(void)
 +{
 + return (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) 
 + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) 
 + !check_tsc_unstable());
 +}
 +
why can't we re-use vmware TSC_RELIABLE flag?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 11/24] Implement VMPTRST

2010-06-16 Thread Gleb Natapov

On Mon, Jun 14, 2010 at 12:15:10PM +0300, Avi Kivity wrote:
 On 06/13/2010 03:28 PM, Nadav Har'El wrote:
 This patch implements the VMPTRST instruction.
 
 Signed-off-by: Nadav Har'Eln...@il.ibm.com
 ---
 --- .before/arch/x86/kvm/x86.c   2010-06-13 15:01:29.0 +0300
 +++ .after/arch/x86/kvm/x86.c2010-06-13 15:01:29.0 +0300
 @@ -3301,7 +3301,7 @@ static int kvm_read_guest_virt_system(gv
  return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, error);
   }
 
 -static int kvm_write_guest_virt_system(gva_t addr, void *val,
 +int kvm_write_guest_virt_system(gva_t addr, void *val,
 unsigned int bytes,
 struct kvm_vcpu *vcpu,
 u32 *error)
 
 write_guest_virt_system() is used by writes which need to ignore the
 cpl, for example when a cpl 3 instruction loads a segment, the
 processor needs to update the accessed flag even though it is only
 accessible to cpl 0.  That's not your case, you need the ordinary
 write_guest_virt().
 
 Um, I see there is no kvm_write_guest_virt(), you'll have to introduce it.
 
the code uses this function after checking cpl to be zero, so may be it
is ok, not to pretty though. I was actually hoping to get rid of all
kvm_(read|write)_guest_virt* and replace existing uses with
emulator_(read|write)_emulated, but this patch series adds more users
that will be hard to replace :(

 
 +/* Emulate the VMPTRST instruction */
 +static int handle_vmptrst(struct kvm_vcpu *vcpu)
 +{
 +int r = 0;
 +unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
 +u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
 +gva_t vmcs_gva;
 +
 +if (!nested_vmx_check_permission(vcpu))
 +return 1;
 +
 +vmcs_gva = get_vmx_mem_address(vcpu, exit_qualification,
 +   vmx_instruction_info);
 +if (vmcs_gva == 0)
 +return 1;
 
 What's wrong with gva 0?  It's favoured by exploiters everywhere.
 
 +r = kvm_write_guest_virt_system(vmcs_gva,
 + (void *)to_vmx(vcpu)-nested.current_vmptr,
 + sizeof(u64), vcpu, NULL);
 +if (r) {
 
 Check against the X86EMUL return codes.  You'll need to inject a
 page fault on failure.
 
 +printk(KERN_INFO %s failed to write vmptr\n, __func__);
 +return 1;
 +}
 +clear_rflags_cf_zf(vcpu);
 +skip_emulated_instruction(vcpu);
 +return 1;
 +}
 +
 
 -- 
 error compiling committee.c: too many arguments to function
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

2010-06-16 Thread Glauber Costa

On Wed, Jun 16, 2010 at 04:11:26PM +0800, Jason Wang wrote:
 Zachary Amsden wrote:
  Kernel time, which advances in discrete steps may progress much slower
  than TSC.  As a result, when kvmclock is adjusted to a new base, the
  apparent time to the guest, which runs at a much higher, nsec scaled
  rate based on the current TSC, may have already been observed to have
  a larger value (kernel_ns + scaled tsc) than the value to which we are
  setting it (kernel_ns + 0).

 This is one issue of kvmclock which tries to supply a clocksource whose
 precision may even higher than host.
What if we export to the guest the current clock resolution, and when doing 
guest
reads, simply chop whatever value we got to the lowest acceptable value?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/17] Add helper function get_kernel_ns

2010-06-16 Thread Glauber Costa

On Wed, Jun 16, 2010 at 04:12:46PM +0800, Jason Wang wrote:
 Zachary Amsden wrote:
  On 06/14/2010 10:41 PM, Avi Kivity wrote:

  On 06/15/2010 10:34 AM, Zachary Amsden wrote:
  
  Add a helper function for the multiple places this is used.  Note 
  that it
  must not be called in preemptible context, as that would mean the kernel
  could enter software suspend state, which would cause non-atomic 
  operation
  of the monotonic_to_bootbased computation.
 
  Open question: should the KVM_SET_CLOCK / KVM_GET_CLOCK ioctls use this
  as well?  Currently, they are not bootbased (but perhaps should be).
 
  Signed-off-by: Zachary Amsdenzams...@redhat.com
  ---
arch/x86/kvm/x86.c |   26 +-
1 files changed, 13 insertions(+), 13 deletions(-)
 
  diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
  index 703ea43..15c7317 100644
  --- a/arch/x86/kvm/x86.c
  +++ b/arch/x86/kvm/x86.c
  @@ -915,6 +915,16 @@ static void kvm_get_time_scale(uint32_t 
  scaled_khz, uint32_t base_khz,
 __func__, base_khz, scaled_khz, shift, *pmultiplier);
}
 
  +static inline u64 get_kernel_ns(void)
  +{
  +struct timespec ts;
  +
  +WARN_ON(preemptible());
  +ktime_get_ts(ts);
  +monotonic_to_bootbased(ts);
  +return timespec_to_ns(ts);
  +}
  +
static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 

  Isn't something like this a candidate for the time infrastructure?
 
  
 
  Should it be?  It certainly seems reasonable.
 
 
  Also, should we be using it for the cases KVM_GET_CLOCK / 
  KVM_SET_CLOCK?  It seems global kvmclock_offset for the VM ignores the 
  bootbased conversion.
 

 Yes we should.
indeed. Also shows the importance of doing it in a central place.
Have we had an acessor like that before, we would probably not have
made this mistake.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

2010-06-16 Thread Avi Kivity


On 06/16/2010 04:58 PM, Glauber Costa wrote:

On Wed, Jun 16, 2010 at 04:11:26PM +0800, Jason Wang wrote:
   

Zachary Amsden wrote:
 

Kernel time, which advances in discrete steps may progress much slower
than TSC.  As a result, when kvmclock is adjusted to a new base, the
apparent time to the guest, which runs at a much higher, nsec scaled
rate based on the current TSC, may have already been observed to have
a larger value (kernel_ns + scaled tsc) than the value to which we are
setting it (kernel_ns + 0).

   

This is one issue of kvmclock which tries to supply a clocksource whose
precision may even higher than host.
 

What if we export to the guest the current clock resolution, and when doing 
guest
reads, simply chop whatever value we got to the lowest acceptable value?
   


The clock resolution can change, and while we can expose it reliably 
through pvclock, do we need a notification so that the guest can update 
other internal structures?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Loss of network connectivity with high load

2010-06-16 Thread Avi Kivity


On 06/16/2010 12:00 AM, Daniel Bareiro wrote:

Hi all!

I'm using Linux 2.6.31.13 compiled with the kernel.org source code on
KVM host with Debian GNU/Linux Lenny amd64. Also I'm using Debian
GNU/Linux Lenny amd64 virtual machine with kernel 2.6.26-2 from Debian
repositories. I'm using qemu-kvm 0.12.3.

   


2.6.26 is really old.  Can you try a new kernel in the guest?  Recent 
kernels had many virtio fixes.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] [scsi-bus]: Changes for PERSISTENT_RESERVE_OUT and MAINTENANCE CDBs v2

2010-06-16 Thread Kevin Wolf

Am 16.06.2010 15:42, schrieb Nicholas A. Bellinger:
 From: Nicholas Bellinger n...@linux-iscsi.org
 
 Greetings,
 
 Attached is v2 for a series containing a handful of changes required for 
 scsi-bus in order
 to get PERSISTENT_RESERVE_OUT and MAINTENANCE CDBs working as expected.  
 Thanks to Kevin Wolf
 for pointing out the unnecessary SCSIRequest-cmd.xfer reassignments in 
 scsi_req_length() from
 the original series.
 
 Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation + scsi-generic 
 and
 TCM_Loop target ports.
 
 Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org

Thanks, applied to the block branch.

One more thing: You usually put things like [scsi-bus] in brackets so
that they are removed by git am. I think this is actually useful
information to have in the commit log, so would you mind leaving the
brackets out for your next submissions?

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/24] Add VMCS fields to the vmcs12

2010-06-16 Thread Gleb Natapov

On Sun, Jun 13, 2010 at 03:28:43PM +0300, Nadav Har'El wrote:
 In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
 standard VMCS fields. These fields are encapsulated in a struct shadow_vmcs.
 
 Later patches will enable L1 to read and write these fields using VMREAD/
 VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing a real
 VMCS for running L2.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
 --- .before/arch/x86/kvm/vmx.c2010-06-13 15:01:29.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300
 @@ -117,6 +117,136 @@ struct shared_msr_entry {
   u64 mask;
  };
  
 +/* shadow_vmcs is a structure used in nested VMX for holding a copy of all
 + * standard VMCS fields. It is used for emulating a VMCS for L1 (see vmcs12),
 + * and also for easier access to VMCS data (see l1_shadow_vmcs).
 + */
 +struct __attribute__ ((__packed__)) shadow_vmcs {
 + u16 virtual_processor_id;
 + u16 guest_es_selector;
 + u16 guest_cs_selector;
 + u16 guest_ss_selector;
 + u16 guest_ds_selector;
 + u16 guest_fs_selector;
 + u16 guest_gs_selector;
 + u16 guest_ldtr_selector;
 + u16 guest_tr_selector;
 + u16 host_es_selector;
 + u16 host_cs_selector;
 + u16 host_ss_selector;
 + u16 host_ds_selector;
 + u16 host_fs_selector;
 + u16 host_gs_selector;
 + u16 host_tr_selector;
 + u64 io_bitmap_a;
 + u64 io_bitmap_b;
 + u64 msr_bitmap;
 + u64 vm_exit_msr_store_addr;
 + u64 vm_exit_msr_load_addr;
 + u64 vm_entry_msr_load_addr;
 + u64 tsc_offset;
 + u64 virtual_apic_page_addr;
 + u64 apic_access_addr;
 + u64 ept_pointer;
 + u64 guest_physical_address;
 + u64 vmcs_link_pointer;
 + u64 guest_ia32_debugctl;
 + u64 guest_ia32_pat;
 + u64 guest_pdptr0;
 + u64 guest_pdptr1;
 + u64 guest_pdptr2;
 + u64 guest_pdptr3;
 + u64 host_ia32_pat;
 + u32 pin_based_vm_exec_control;
 + u32 cpu_based_vm_exec_control;
 + u32 exception_bitmap;
 + u32 page_fault_error_code_mask;
 + u32 page_fault_error_code_match;
 + u32 cr3_target_count;
 + u32 vm_exit_controls;
 + u32 vm_exit_msr_store_count;
 + u32 vm_exit_msr_load_count;
 + u32 vm_entry_controls;
 + u32 vm_entry_msr_load_count;
 + u32 vm_entry_intr_info_field;
 + u32 vm_entry_exception_error_code;
 + u32 vm_entry_instruction_len;
 + u32 tpr_threshold;
 + u32 secondary_vm_exec_control;
 + u32 vm_instruction_error;
 + u32 vm_exit_reason;
 + u32 vm_exit_intr_info;
 + u32 vm_exit_intr_error_code;
 + u32 idt_vectoring_info_field;
 + u32 idt_vectoring_error_code;
 + u32 vm_exit_instruction_len;
 + u32 vmx_instruction_info;
 + u32 guest_es_limit;
 + u32 guest_cs_limit;
 + u32 guest_ss_limit;
 + u32 guest_ds_limit;
 + u32 guest_fs_limit;
 + u32 guest_gs_limit;
 + u32 guest_ldtr_limit;
 + u32 guest_tr_limit;
 + u32 guest_gdtr_limit;
 + u32 guest_idtr_limit;
 + u32 guest_es_ar_bytes;
 + u32 guest_cs_ar_bytes;
 + u32 guest_ss_ar_bytes;
 + u32 guest_ds_ar_bytes;
 + u32 guest_fs_ar_bytes;
 + u32 guest_gs_ar_bytes;
 + u32 guest_ldtr_ar_bytes;
 + u32 guest_tr_ar_bytes;
 + u32 guest_interruptibility_info;
 + u32 guest_activity_state;
 + u32 guest_sysenter_cs;
 + u32 host_ia32_sysenter_cs;
 + unsigned long cr0_guest_host_mask;
 + unsigned long cr4_guest_host_mask;
 + unsigned long cr0_read_shadow;
 + unsigned long cr4_read_shadow;
 + unsigned long cr3_target_value0;
 + unsigned long cr3_target_value1;
 + unsigned long cr3_target_value2;
 + unsigned long cr3_target_value3;
 + unsigned long exit_qualification;
 + unsigned long guest_linear_address;
 + unsigned long guest_cr0;
 + unsigned long guest_cr3;
 + unsigned long guest_cr4;
 + unsigned long guest_es_base;
 + unsigned long guest_cs_base;
 + unsigned long guest_ss_base;
 + unsigned long guest_ds_base;
 + unsigned long guest_fs_base;
 + unsigned long guest_gs_base;
 + unsigned long guest_ldtr_base;
 + unsigned long guest_tr_base;
 + unsigned long guest_gdtr_base;
 + unsigned long guest_idtr_base;
 + unsigned long guest_dr7;
 + unsigned long guest_rsp;
 + unsigned long guest_rip;
 + unsigned long guest_rflags;
 + unsigned long guest_pending_dbg_exceptions;
 + unsigned long guest_sysenter_esp;
 + unsigned long guest_sysenter_eip;
 + unsigned long host_cr0;
 + unsigned long host_cr3;
 + unsigned long host_cr4;
 + unsigned long host_fs_base;
 + unsigned long host_gs_base;
 + unsigned long host_tr_base;
 + unsigned long host_gdtr_base;
 + unsigned long host_idtr_base;
 + unsigned long host_ia32_sysenter_esp;
 + unsigned long host_ia32_sysenter_eip;
 + unsigned long host_rsp;
 + unsigned long host_rip;
 +};
 +

Re: [Qemu-devel] Re: RFC qdev path semantics

2010-06-16 Thread Markus Armbruster

Paul Brook p...@codesourcery.com writes:

  Bus names are chosen by the system as follows:
  
  * If the driver of the parent device model provides a name, use that.
  
  * Else, if the parent device has id ID, use ID.NUM, where NUM is the bus
  
number, counting from zero in creation order.
  
  * Else, use TYPE.NUM, where TYPE is derived from the bus type, and NUM
  
is the bus number, as above.
  
  ### Paul proposes to drop ID.NUM.
 
 ABI change: -device lsi,id=my-scsi -device scsi-disk,bus=my-scsi.0 no
 longer works.

 IMO this is a fundamentally broken ABI, so I don't care.

Users of this ABI won't appreciate that attitude.

I do support dropping ID.NUM, but we owe our users due ABI diligence.

  ### Paul proposes to either drop TYPE.NUM (and require drivers to
  provide bus names), or make NUM count separately for each bus type.
 
 Likewise.

 I'd be surprised if anyone actually uses absolute device paths at this time, 
 and they're probably going to be broken by other changes.

Yes.

 Using these default bus names as global identifiers is fixable using aliases
 (e.g. -device lsi,bus=pci.0).  I'd expect this to cover most interesting uses.
 See http://lists.nongnu.org/archive/html/qemu-devel/2010-06/msg02149.html

Keeping the old bus names work for the buses we create automatically
shouldn't be hard.  Only a few, and no ID.NUM there.  It's the
user-created buses that worry me.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 3/9] replace x86 kvm n_free_mmu_pages with n_used_mmu_pages

2010-06-16 Thread Marcelo Tosatti

On Tue, Jun 15, 2010 at 06:55:22AM -0700, Dave Hansen wrote:
 
 I think doing this makes the code much more readable.  That's
 borne out by the fact that this patch removes code.  used
 also happens to be the number that we need to return back to
 the slab code when our shrinker gets called.  Keeping this
 value as opposed to free makes the next patch simpler.
 
 So, 'struct kvm' is kzalloc()'d.  'struct kvm_arch' is a
 structure member (and not a pointer) of 'struct kvm'.  That
 means they start out zeroed.  I _think_ they get initialized
 properly by kvm_mmu_change_mmu_pages().  But, that only happens
 via kvm ioctls.
 
 I have a suspicion that they values are actually inconsistent
 until those ioctls get called; free and alloc are both zero.
 But, the VM can't really get run until these ioctl()s get called
 anyway.  There are also some checks for negative used_pages
 values which confused me.  It might all tie together.
 
 Anyway, another benefit of storing 'used' intead of 'free' is
 that the values are consistent from the moment the structure is
 allocated: no negative used value.
 
 Signed-off-by: Dave Hansen d...@linux.vnet.ibm.com
 ---
 
  linux-2.6.git-dave/arch/x86/include/asm/kvm_host.h |2 -
  linux-2.6.git-dave/arch/x86/kvm/mmu.c  |   29 
 +++--
  linux-2.6.git-dave/arch/x86/kvm/mmu.h  |3 +-
  3 files changed, 13 insertions(+), 21 deletions(-)
 
 diff -puN arch/x86/include/asm/kvm_host.h~replace-free-with-used 
 arch/x86/include/asm/kvm_host.h
 --- linux-2.6.git/arch/x86/include/asm/kvm_host.h~replace-free-with-used  
 2010-06-09 15:14:29.0 -0700
 +++ linux-2.6.git-dave/arch/x86/include/asm/kvm_host.h2010-06-09 
 15:14:29.0 -0700
 @@ -380,7 +380,7 @@ struct kvm_mem_aliases {
  struct kvm_arch {
   struct kvm_mem_aliases *aliases;
  
 - unsigned int n_free_mmu_pages;
 + unsigned int n_used_mmu_pages;
   unsigned int n_requested_mmu_pages;
   unsigned int n_max_mmu_pages;
   atomic_t invlpg_counter;
 diff -puN arch/x86/kvm/mmu.c~replace-free-with-used arch/x86/kvm/mmu.c
 --- linux-2.6.git/arch/x86/kvm/mmu.c~replace-free-with-used   2010-06-09 
 15:14:29.0 -0700
 +++ linux-2.6.git-dave/arch/x86/kvm/mmu.c 2010-06-09 15:14:29.0 
 -0700
 @@ -898,7 +898,7 @@ static void kvm_mmu_free_page(struct kvm
   __free_page(virt_to_page(sp-spt));
   __free_page(virt_to_page(sp-gfns));
   kfree(sp);
 - ++kvm-arch.n_free_mmu_pages;
 + --kvm-arch.n_used_mmu_pages;
  }
  
  static unsigned kvm_page_table_hashfn(gfn_t gfn)
 @@ -919,7 +919,7 @@ static struct kvm_mmu_page *kvm_mmu_allo
   bitmap_zero(sp-slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
   sp-multimapped = 0;
   sp-parent_pte = parent_pte;
 - --vcpu-kvm-arch.n_free_mmu_pages;
 + ++vcpu-kvm-arch.n_used_mmu_pages;
   return sp;
  }
  
 @@ -1516,39 +1516,30 @@ static int kvm_mmu_zap_page(struct kvm *
  
  /*
   * Changing the number of mmu pages allocated to the vm
 - * Note: if kvm_nr_mmu_pages is too small, you will get dead lock
 + * Note: if goal_nr_mmu_pages is too small, you will get dead lock
   */
 -void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages)
 +void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int 
 goal_nr_mmu_pages)
  {
 - int used_pages;
 -
 - used_pages = kvm-arch.n_max_mmu_pages - kvm_mmu_available_pages(kvm);
 - used_pages = max(0, used_pages);
 -
   /*
* If we set the number of mmu pages to be smaller be than the
* number of actived pages , we must to free some mmu pages before we
* change the value
*/
  
 - if (used_pages  kvm_nr_mmu_pages) {
 - while (used_pages  kvm_nr_mmu_pages 
 + if (kvm-arch.n_used_mmu_pages  goal_nr_mmu_pages) {
 + while (kvm-arch.n_used_mmu_pages  goal_nr_mmu_pages 
   !list_empty(kvm-arch.active_mmu_pages)) {
   struct kvm_mmu_page *page;
  
   page = container_of(kvm-arch.active_mmu_pages.prev,
   struct kvm_mmu_page, link);
 - used_pages -= kvm_mmu_zap_page(kvm, page);
 - used_pages--;
 + kvm-arch.n_used_mmu_pages -= kvm_mmu_zap_page(kvm, 
 page);
 + kvm-arch.n_used_mmu_pages--;

kvm-arch.n_used_mmu_pages is deaccounted for in kvm_mmu_zap_page - 
kvm_mmu_free_page
already.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86 emulator: fix pusha instruction emulation

2010-06-16 Thread Marcelo Tosatti

On Tue, Jun 15, 2010 at 09:00:16AM +0800, Wei Yongjun wrote:
 emulate pusha instruction only writeback the last
 EDI register, but the other registers which need
 to be writeback is ignored. This patch fixed it.
 
 ---
  arch/x86/kvm/emulate.c |  133 
 ++--
  1 files changed, 73 insertions(+), 60 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 13/24] Implement VMREAD and VMWRITE

2010-06-16 Thread Gleb Natapov

On Mon, Jun 14, 2010 at 12:36:02PM +0300, Avi Kivity wrote:
 On 06/13/2010 03:29 PM, Nadav Har'El wrote:
 Implement the VMREAD and VMWRITE instructions. With these instructions, L1
 can read and write to the VMCS it is holding. The values are read or written
 to the fields of the shadow_vmcs structure introduced in the previous patch.
 
 
 +
 +static inline int vmcs_field_size(int field_type, struct kvm_vcpu *vcpu)
 +{
 +switch (field_type) {
 +case VMCS_FIELD_TYPE_U16:
 +return 2;
 +case VMCS_FIELD_TYPE_U32:
 +return 4;
 +case VMCS_FIELD_TYPE_U64:
 +return 8;
 +case VMCS_FIELD_TYPE_ULONG:
 +#ifdef CONFIG_X86_64
 +if (is_long_mode(vcpu))
 +return 8;
 +#endif
 +return 4;
 
 No need for the ifdef, is_long_mode() works everywhere.
 
 +}
 +return 0; /* should never happen */
 
 Then BUG()?
 
 +}
 +
   struct vcpu_vmx {
  struct kvm_vcpu   vcpu;
  struct list_head  local_vcpus_link;
 @@ -4184,6 +4220,189 @@ static int handle_vmclear(struct kvm_vcp
  return 1;
   }
 
 
 +static int handle_vmread_reg(struct kvm_vcpu *vcpu, int reg,
 + unsigned long field)
 +{
 +u64 field_value;
 +if (!nested_vmcs_read_any(vcpu, field,field_value))
 +return 0;
 +
 +#ifdef CONFIG_X86_64
 +switch (vmcs_field_type(field)) {
 +case VMCS_FIELD_TYPE_U64: case VMCS_FIELD_TYPE_ULONG:
 +if (!is_long_mode(vcpu)) {
 +kvm_register_write(vcpu, reg+1, field_value  32);
 
 What's this reg+1 thing?  I thought vmread simply ignores the upper half.
 
 +field_value = (u32)field_value;
 +}
 +}
 +#endif
 +kvm_register_write(vcpu, reg, field_value);
 +return 1;
 +}
 +
 +static int handle_vmread_mem(struct kvm_vcpu *vcpu, gva_t gva,
 + unsigned long field)
 +{
 +u64 field_value;
 +if (!nested_vmcs_read_any(vcpu, field,field_value))
 +return 0;
 +
 +/* It's ok to use *_system, because handle_vmread verifies cpl=0 */
 
 +kvm_write_guest_virt_system(gva,field_value,
 + vmcs_field_size(vmcs_field_type(field), vcpu),
 + vcpu, NULL);
 
 vmread doesn't support 64-bit writes to memory outside long mode, so
 you'll have to truncate the write.
 
 I think you'll be better off returning a 32-bit size in
 vmcs_field_size() in these cases.
 
Actually write should be always 32bit long outside IA-32e mode and
64bit long in 64 bit mode. Unused bits should be set to zero.


 +return 1;
 +}
 +
 +static int handle_vmread(struct kvm_vcpu *vcpu)
 +{
 +unsigned long field;
 +unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
 +u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
 +gva_t gva = 0;
 +int read_succeed;
 +
 +if (!nested_vmx_check_permission(vcpu))
 +return 1;
 +
 +if (!nested_map_current(vcpu)) {
 +printk(KERN_INFO %s invalid shadow vmcs\n, __func__);
 +set_rflags_to_vmx_fail_invalid(vcpu);
 +return 1;
 +}
 
 Can do the read_any() here.
 
 +
 +/* decode instruction info to get the field to read and where to store
 + * its value */
 +field = kvm_register_read(vcpu, VMX_OPERAND_REG2(vmx_instruction_info));
 +if (VMX_OPERAND_IS_REG(vmx_instruction_info)) {
 +read_succeed = handle_vmread_reg(vcpu,
 +VMX_OPERAND_REG(vmx_instruction_info), field);
 +} else {
 +gva = get_vmx_mem_address(vcpu, exit_qualification,
 +  vmx_instruction_info);
 +if (gva == 0)
 +return 1;
 +read_succeed = handle_vmread_mem(vcpu, gva, field);
 +}
 +
 +if (read_succeed) {
 +clear_rflags_cf_zf(vcpu);
 +skip_emulated_instruction(vcpu);
 +} else {
 +set_rflags_to_vmx_fail_valid(vcpu);
 +vmcs_write32(VM_INSTRUCTION_ERROR, 12);
 
 s_e_i() in any case but an exception.
 
 +}
 +
 +nested_unmap_current(vcpu);
 +return 1;
 +}
 +
 +
 
 +if (VMX_OPERAND_IS_REG(vmx_instruction_info))
 +field_value = kvm_register_read(vcpu,
 +VMX_OPERAND_REG(vmx_instruction_info));
 +else {
 +gva  = get_vmx_mem_address(vcpu, exit_qualification,
 +vmx_instruction_info);
 +if (gva == 0)
 +return 1;
 +kvm_read_guest_virt(gva,field_value,
 +vmcs_field_size(field_type, vcpu), vcpu, NULL);
 
 Check for exception.
 
 
 -- 
 error compiling committee.c: too many arguments to function
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from

Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock

2010-06-16 Thread Glauber Costa

On Wed, Jun 16, 2010 at 05:13:02PM +0300, Avi Kivity wrote:
 On 06/16/2010 04:58 PM, Glauber Costa wrote:
 On Wed, Jun 16, 2010 at 04:11:26PM +0800, Jason Wang wrote:
 Zachary Amsden wrote:
 Kernel time, which advances in discrete steps may progress much slower
 than TSC.  As a result, when kvmclock is adjusted to a new base, the
 apparent time to the guest, which runs at a much higher, nsec scaled
 rate based on the current TSC, may have already been observed to have
 a larger value (kernel_ns + scaled tsc) than the value to which we are
 setting it (kernel_ns + 0).
 
 This is one issue of kvmclock which tries to supply a clocksource whose
 precision may even higher than host.
 What if we export to the guest the current clock resolution, and when doing 
 guest
 reads, simply chop whatever value we got to the lowest acceptable value?
 
 The clock resolution can change, and while we can expose it reliably
 through pvclock, do we need a notification so that the guest can
 update other internal structures?
I believe the only thing we need is to warn the guest whenever this changes.
We can probably fit it in one of the paddings we have, and add a flag to say
it is valid (now that we have the infrastructure).
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] qemu: Enable XSAVE related CPUID

2010-06-16 Thread Sheng Yang

On Thursday 10 June 2010 11:31:02 Sheng Yang wrote:
 We can support it in KVM now. The 0xd leaf is queried from KVM.

Hi Marcelo  Avi

How about patch 1 and 2 in this series? They are used to enable XSAVE cpuid.

--
regards
Yang, Sheng


 
 Signed-off-by: Sheng Yang sh...@linux.intel.com
 ---
  target-i386/cpuid.c |   21 +
  1 files changed, 21 insertions(+), 0 deletions(-)
 
 diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c
 index fb78061..26e4054 100644
 --- a/target-i386/cpuid.c
 +++ b/target-i386/cpuid.c
 @@ -1081,6 +1081,27 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index,
 uint32_t count, *ecx = 0;
  *edx = 0;
  break;
 +case 0xD:
 +/* Processor Extended State */
 +if (!(env-cpuid_ext_features  CPUID_EXT_XSAVE)) {
 +*eax = 0;
 +*ebx = 0;
 +*ecx = 0;
 +*edx = 0;
 +break;
 +}
 +if (kvm_enabled()) {
 +*eax = kvm_arch_get_supported_cpuid(env, 0xd, count, R_EAX);
 +*ebx = kvm_arch_get_supported_cpuid(env, 0xd, count, R_EBX);
 +*ecx = kvm_arch_get_supported_cpuid(env, 0xd, count, R_ECX);
 +*edx = kvm_arch_get_supported_cpuid(env, 0xd, count, R_EDX);
 +} else {
 +*eax = 0;
 +*ebx = 0;
 +*ecx = 0;
 +*edx = 0;
 +}
 +break;
  case 0x8000:
  *eax = env-cpuid_xlevel;
  *ebx = env-cpuid_vendor1;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 13/24] Implement VMREAD and VMWRITE

2010-06-16 Thread Gleb Natapov

On Sun, Jun 13, 2010 at 03:29:13PM +0300, Nadav Har'El wrote:
 Implement the VMREAD and VMWRITE instructions. With these instructions, L1
 can read and write to the VMCS it is holding. The values are read or written
 to the fields of the shadow_vmcs structure introduced in the previous patch.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
 --- .before/arch/x86/kvm/vmx.c2010-06-13 15:01:29.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.0 +0300
 @@ -299,6 +299,42 @@ struct nested_vmx {
   int l2_vmcs_num;
  };
  
 +enum vmcs_field_type {
 + VMCS_FIELD_TYPE_U16 = 0,
 + VMCS_FIELD_TYPE_U64 = 1,
 + VMCS_FIELD_TYPE_U32 = 2,
 + VMCS_FIELD_TYPE_ULONG = 3
 +};
 +
 +#define VMCS_FIELD_LENGTH_OFFSET 13
 +#define VMCS_FIELD_LENGTH_MASK 0x6000
 +
 +static inline int vmcs_field_type(unsigned long field)
 +{
 + if (0x1  field)/* one of the *_HIGH fields, all are 32 bit */
 + return VMCS_FIELD_TYPE_U32;
 + return (VMCS_FIELD_LENGTH_MASK  field)  13;
 +}
 +
 +static inline int vmcs_field_size(int field_type, struct kvm_vcpu *vcpu)
 +{
 + switch (field_type) {
 + case VMCS_FIELD_TYPE_U16:
 + return 2;
 + case VMCS_FIELD_TYPE_U32:
 + return 4;
 + case VMCS_FIELD_TYPE_U64:
 + return 8;
 + case VMCS_FIELD_TYPE_ULONG:
 +#ifdef CONFIG_X86_64
 + if (is_long_mode(vcpu))
 + return 8;
 +#endif
 + return 4;
 + }
 + return 0; /* should never happen */
 +}
 +
  struct vcpu_vmx {
   struct kvm_vcpu   vcpu;
   struct list_head  local_vcpus_link;
 @@ -4184,6 +4220,189 @@ static int handle_vmclear(struct kvm_vcp
   return 1;
  }
  
 +static inline bool nested_vmcs_read_any(struct kvm_vcpu *vcpu,
 + unsigned long field, u64 *ret)
 +{
 + short offset = vmcs_field_to_offset(field);
 + char *p;
 +
 + if (offset  0)
 + return 0;
 + if (!to_vmx(vcpu)-nested.current_l2_page)
 + return 0;
 +
 + p = ((char *)(get_shadow_vmcs(vcpu))) + offset;
 +
 + switch (vmcs_field_type(field)) {
 + case VMCS_FIELD_TYPE_ULONG:
 + *ret = *((unsigned long *)p);
 + return 1;
 + case VMCS_FIELD_TYPE_U16:
 + *ret = (u16) *((unsigned long *)p);
 + return 1;
 + case VMCS_FIELD_TYPE_U32:
 + *ret = (u32) *((unsigned long *)p);
 + return 1;
 + case VMCS_FIELD_TYPE_U64:
 + *ret = *((u64 *)p);
 + return 1;
 + default:
 + return 0; /* can never happen. */
 + }
 +}
 +
 +static int handle_vmread_reg(struct kvm_vcpu *vcpu, int reg,
 +  unsigned long field)
 +{
 + u64 field_value;
 + if (!nested_vmcs_read_any(vcpu, field, field_value))
 + return 0;
 +
 +#ifdef CONFIG_X86_64
 + switch (vmcs_field_type(field)) {
 + case VMCS_FIELD_TYPE_U64: case VMCS_FIELD_TYPE_ULONG:
 + if (!is_long_mode(vcpu)) {
 + kvm_register_write(vcpu, reg+1, field_value  32);
 + field_value = (u32)field_value;
 + }
 + }
 +#endif
 + kvm_register_write(vcpu, reg, field_value);
 + return 1;
 +}
 +
 +static int handle_vmread_mem(struct kvm_vcpu *vcpu, gva_t gva,
 +  unsigned long field)
 +{
 + u64 field_value;
 + if (!nested_vmcs_read_any(vcpu, field, field_value))
 + return 0;
 +
 + /* It's ok to use *_system, because handle_vmread verifies cpl=0 */
 + kvm_write_guest_virt_system(gva, field_value,
 +  vmcs_field_size(vmcs_field_type(field), vcpu),
 +  vcpu, NULL);
 + return 1;
 +}
 +
 +static int handle_vmread(struct kvm_vcpu *vcpu)
 +{
 + unsigned long field;
 + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
 + u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
 + gva_t gva = 0;
 + int read_succeed;
 +
 + if (!nested_vmx_check_permission(vcpu))
 + return 1;
 +
 + if (!nested_map_current(vcpu)) {
 + printk(KERN_INFO %s invalid shadow vmcs\n, __func__);
 + set_rflags_to_vmx_fail_invalid(vcpu);
 + return 1;
 + }
 +
 + /* decode instruction info to get the field to read and where to store
 +  * its value */
 + field = kvm_register_read(vcpu, VMX_OPERAND_REG2(vmx_instruction_info));
 + if (VMX_OPERAND_IS_REG(vmx_instruction_info)) {
 + read_succeed = handle_vmread_reg(vcpu,
 + VMX_OPERAND_REG(vmx_instruction_info), field);
 + } else {
 + gva = get_vmx_mem_address(vcpu, exit_qualification,
 +   vmx_instruction_info);
 + if (gva == 0)
 + return 1;
 + read_succeed = handle_vmread_mem(vcpu, gva, field);
 + }
 +

Re: [RFC][PATCH 0/9] rework KVM mmu_shrink() code

2010-06-16 Thread Dave Hansen

On Wed, 2010-06-16 at 11:38 +0300, Avi Kivity wrote:
 On 06/15/2010 04:55 PM, Dave Hansen wrote:
  These seem to boot and run fine.  I'm running about 40 VMs at
  once, while doing echo 3  /proc/sys/vm/drop_caches, and
  killing/restarting VMs constantly.
 
 
 Will drop_caches actually shrink the kvm caches too?  If so we probably 
 need to add that to autotest since it's a really good stress test for 
 the mmu.

I'm completely sure.  I crashed my machines several times this way
during testing.

  Seems to be relatively stable, and seems to keep the numbers
  of kvm_mmu_page_header objects down.
 
 
 That's no necessarily a good thing, those things are expensive to 
 recreate.  Of course, when we do need to reclaim them, that should be 
 efficient.

Oh, I meant that I didn't break the shrinker completely.

 We also do a very bad job of selecting which page to reclaim.  We need 
 to start using the accessed bit on sptes that point to shadow page 
 tables, and then look those up and reclaim unreferenced pages sooner.  
 With shadow paging there can be tons of unsync pages that are basically 
 unused and can be reclaimed at no cost to future runtime.

Sounds like a good next step.

-- Dave

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value

2010-06-16 Thread Dave Hansen

On Wed, 2010-06-16 at 11:48 +0300, Avi Kivity wrote:
 On 06/15/2010 04:55 PM, Dave Hansen wrote:
  +/*
  + * This value is the sum of all of the kvm instances's
  + * kvm-arch.n_used_mmu_pages values.  We need a global,
  + * aggregate version in order to make the slab shrinker
  + * faster
  + */
  +static unsigned int kvm_total_used_mmu_pages;
 
 
 The variable needs to be at the top of the file.

Gotcha, will do.

  +static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
  +{
  +   kvm-arch.n_used_mmu_pages += nr;
  +   kvm_total_used_mmu_pages += nr;
 
 
 Needs an atomic operation, since there's no global lock here.  To avoid 
 bouncing this cacheline around, make the variable percpu and make 
 readers take a sum across all cpus.  Side benefit is that you no longer 
 need an atomic but a local_t, which is considerably cheaper.

That's a good point.  All of the modifications are done under locks, but
the fast path isn't any more.  I'll fix it up.

-- Dave

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 9/9] make kvm mmu shrinker more aggressive

2010-06-16 Thread Dave Hansen

On Wed, 2010-06-16 at 12:24 +0300, Avi Kivity wrote:
 On 06/15/2010 04:55 PM, Dave Hansen wrote:
  In a previous patch, we removed the 'nr_to_scan' tracking.
  It was not being used to track the number of objects
  scanned, so we stopped using it entirely.  Here, we
  strart using it again.
 
  The theory here is simple; if we already have the refcount
  and the kvm-mmu_lock, then we should do as much work as
  possible under the lock.  The downside is that we're less
  fair about the KVM instances from which we reclaim.  Each
  call to mmu_shrink() will tend to pick on one instance,
  after which it gets moved to the end of the list and left
  alone for a while.
 
 
 That also increases the latency hit, as well as a potential fault storm, 
 on that instance.  Spreading out is less efficient, but smoother.

This is probably something that we need to go back and actually measure.
My suspicion is that, when memory fills up and this shrinker is getting
called a lot, it will be naturally fair.  That list gets shuffled around
enough, and mmu_shrink() called often enough that no VMs get picked on
too unfairly.

I'll go back and see if I can quantify this a bit, though.

I do worry about the case where you really have only a single CPU going
into reclaim and a very small number of VMs on the system.  You're
basically guaranteeing that you'll throw away nr_to_scan of the poor
victim VM's, with no penalty on the other guy.  

  If mmu_shrink() has already done a significant amount of
  scanning, the use of 'nr_to_scan' inside shrink_kvm_mmu()
  will also ensure that we do not over-reclaim when we have
  already done a lot of work in this call.
 
  In the end, this patch defines a scan as:
  1. An attempt to acquire a refcount on a 'struct kvm'
  2. freeing a kvm mmu page
 
  This would probably be most ideal if we can expose some
  of the work done by kvm_mmu_remove_some_alloc_mmu_pages()
  as also counting as scanning, but I think we have churned
  enough for the moment.
 
 It usually removes one page.

Does it always just go right now and free it, or is there any real
scanning that has to go on?

  diff -puN arch/x86/kvm/mmu.c~make-shrinker-more-aggressive 
  arch/x86/kvm/mmu.c
  --- linux-2.6.git/arch/x86/kvm/mmu.c~make-shrinker-more-aggressive  
  2010-06-14 11:30:44.0 -0700
  +++ linux-2.6.git-dave/arch/x86/kvm/mmu.c   2010-06-14 11:38:04.0 
  -0700
  @@ -2935,8 +2935,10 @@ static int shrink_kvm_mmu(struct kvm *kv
 
  idx = srcu_read_lock(kvm-srcu);
  spin_lock(kvm-mmu_lock);
  -   if (kvm-arch.n_used_mmu_pages  0)
  -   freed_pages = kvm_mmu_remove_some_alloc_mmu_pages(kvm);
  +   while (nr_to_scan  0  kvm-arch.n_used_mmu_pages  0) {
  +   freed_pages += kvm_mmu_remove_some_alloc_mmu_pages(kvm);
  +   nr_to_scan--;
  +   }
 
 
 What tree are you patching?

These applied to Linus's latest as of yesterday.

-- Dave

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 11/24] Implement VMPTRST

2010-06-16 Thread Nadav Har'El

On Wed, Jun 16, 2010, Gleb Natapov wrote about Re: [PATCH 11/24] Implement 
VMPTRST:
 On Mon, Jun 14, 2010 at 12:15:10PM +0300, Avi Kivity wrote:
  write_guest_virt_system() is used by writes which need to ignore the
  cpl, for example when a cpl 3 instruction loads a segment, the
  processor needs to update the accessed flag even though it is only
  accessible to cpl 0.  That's not your case, you need the ordinary
  write_guest_virt().
  
  Um, I see there is no kvm_write_guest_virt(), you'll have to introduce it.
  
 the code uses this function after checking cpl to be zero, so may be it
 is ok, not to pretty though. I was actually hoping to get rid of all
 kvm_(read|write)_guest_virt* and replace existing uses with
 emulator_(read|write)_emulated, but this patch series adds more users
 that will be hard to replace :(

If I remember the history correctly, this is exactly what happened in this
code. We used to use kvm_write_guest_virt(), until a few months ago it
disappeared. I thought it to be fine to call write_guest_virt_system()
because, like you said, we only already check above that cpl=0, so it is fine
to assume we have cpl 0 privileges.

So while it might look a bit strange at first, I think it should be fine and
there's no need to create more functions.

-- 
Nadav Har'El|Wednesday, Jun 16 2010, 4 Tammuz 5770
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Amateurs built the ark - professionals
http://nadav.harel.org.il   |built the Titanic.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ kvm-Bugs-2948108 ] qemu-kvm fails to migrate properly in 0.12.2

2010-06-16 Thread SourceForge.net

Bugs item #2948108, was opened at 2010-02-08 14:06
Message generated for change (Comment added) made by ramereth
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2948108group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: qemu
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Lance Albertson (ramereth)
Assigned to: Nobody/Anonymous (nobody)
Summary: qemu-kvm fails to migrate properly in 0.12.2

Initial Comment:
I'm using Ganeti and encountered a problem [1] with migrating KVM virtual 
machines. I have confirmed that this is not a problem specific to Ganeti and is 
also happening to the Libvirt project [2].

When migrating from host A to host B, everything goes fine however the guest VM 
(running CentOS 5.4) clock changes to approximately 600 seconds in the future. 
When attempting to migrate back to host B, the VM locks up and stops 
responding. However, if you fix the clock prior to syncing back to host A, the 
migration happens however the clock on the VM literally stops so the guest is 
pretty unusable. 

I tried the same scenario using Ubuntu 9.10 for the guest OS and has different 
issues. Upon migrating to host B, the clock on the guest VM was ahead by over 3 
days into the future. Then when I migrate it back to host A the VM locks up 
hard.

I can confirm I have no problem using migration when using 0.11.1.

Here's the detail information about our setup:
Hardware: Intel(R) Xeon(R) CPU 5160  @ 3.00GHz
qemu-kvm: 0.12.2
kernel: 2.6.29-hardened
Linux OS: Gentoo Hardened

If you need more information, please let me know and I'll be happy to provide 
it.

Thanks-

[1] http://groups.google.com/group/ganeti/browse_thread/thread/9b2c556557e749cf
[2] http://thread.gmane.org/gmane.comp.emulators.libvirt/20771

--

Comment By: Lance Albertson (ramereth)
Date: 2010-06-16 08:39

Message:
Has there been any progress made on this bug? I noticed that I had the same
problem with 0.11.1 and just resorted to not using the kvm-clock.

--

Comment By: Lance Albertson (ramereth)
Date: 2010-02-09 11:29

Message:
The patch you provided doesn't seem to apply cleanly on older kernels so I
went ahead and used the latest stable 2.6.32.8 which seems to include the
patch. Unfortunately I have the same problem as before with the time moving
up by nearly three days and the VM locking up upon returning to the
original host.

--

Comment By: Lance Albertson (ramereth)
Date: 2010-02-09 08:09

Message:
I'll give that a try sometime today

--

Comment By: Brian Jackson (iggy_cav)
Date: 2010-02-09 07:55

Message:
Can someone try this?

http://www.mail-archive.com/kvm@vger.kernel.org/msg28046.html

--

Comment By: ttt (mrt2k9)
Date: 2010-02-09 01:50

Message:
Using acpi_pm both on hosts and guests (Debian 5.0 AMD64) seems to work. I
now did 20-30 live migrations or so and it worked like a charm. Thanks!

--

Comment By: Lance Albertson (ramereth)
Date: 2010-02-08 16:33

Message:
Any idea why kvmclock is causing this issue? Has there been any recent
changes to that part of qemu-kvm?

Thanks for the tips on CentOS, I'll give that a try and see if that works
as a workaround for now.

--

Comment By: Brian Jackson (iggy_cav)
Date: 2010-02-08 16:30

Message:
I think 5.4 had kvmclock backported to it... I just think it doesn't have
the same /sys adjustments for clock, so you have to use the no-kvmclock
thing or possibly something else.

http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.4/html/Virtualization_Guide/chap-Virtualization-KVM_guest_timing_management.html

has some tips for disabling it I think

possibly just clocksource=acpi_pm is needed

--

Comment By: Lance Albertson (ramereth)
Date: 2010-02-08 16:13

Message:
It seems to work if I use something like acpi_pm for the clock source (on
Debian). The clock also seemed to stay in time. I haven't tested it on the
other OS's but it seems that CentOS doesn't even use kvmclock (which means
sense considering the age of the kernel).

--

Comment By: Brian Jackson (iggy_cav)
Date: 2010-02-08 15:54

Message:
guest

you can either use the no-kvmclock boot param or choose a different clock
in:
/sys/devices/system/clocksource/clocksource0/current_clocksource

possible values are in:

Re: [RFC][PATCH 3/9] replace x86 kvm n_free_mmu_pages with n_used_mmu_pages

2010-06-16 Thread Dave Hansen

On Wed, 2010-06-16 at 11:25 -0300, Marcelo Tosatti wrote:
  - if (used_pages  kvm_nr_mmu_pages) {
  - while (used_pages  kvm_nr_mmu_pages 
  + if (kvm-arch.n_used_mmu_pages  goal_nr_mmu_pages) {
  + while (kvm-arch.n_used_mmu_pages  goal_nr_mmu_pages 
!list_empty(kvm-arch.active_mmu_pages)) {
struct kvm_mmu_page *page;
   
page = container_of(kvm-arch.active_mmu_pages.prev,
struct kvm_mmu_page, link);
  - used_pages -= kvm_mmu_zap_page(kvm, page);
  - used_pages--;
  + kvm-arch.n_used_mmu_pages -= kvm_mmu_zap_page(kvm, 
  page);
  + kvm-arch.n_used_mmu_pages--;
 
 kvm-arch.n_used_mmu_pages is deaccounted for in kvm_mmu_zap_page -
 kvm_mmu_free_page
 already.

Ahh, I see that now.  Thanks!

-- Dave

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RESEND PATCH] acpi_piix4: save gpe and pci hotplug slot status

2010-06-16 Thread Juan Quintela

Anthony Liguori anth...@codemonkey.ws wrote:
 On 06/14/2010 03:28 PM, Alex Williamson wrote:
 PCI hotplug currently doesn't work after a migration because
 we don't migrate the enable bits of the GPE state.  Pull hotplug
 structs into vmstate.

 Signed-off-by: Alex Williamsonalex.william...@redhat.com


 Applied.  Thanks.

 Regards,

 Anthony Liguori

I think this is better implemented as a subsection.  We didin't need
this until hotplug arrived, I think that checking if any up/down are
!= 0 and then send it as subsections is a best way to do it.

This way it could also be backported to stable.

Later, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] qemu: kvm: Enable XSAVE live migration support

2010-06-16 Thread Jan Kiszka

Marcelo Tosatti wrote:
 On Fri, Jun 11, 2010 at 12:36:49PM +0800, Sheng Yang wrote:
 Signed-off-by: Sheng Yang sh...@linux.intel.com
 ---
  qemu-kvm-x86.c|  109 
 -
  qemu-kvm.c|   24 +++
  qemu-kvm.h|   28 +
  target-i386/cpu.h |5 ++
  target-i386/kvm.c |2 +
  target-i386/machine.c |   20 +
  6 files changed, 169 insertions(+), 19 deletions(-)
 
 Applied, thanks.

Oops, late remark: Why introducing this feature against qemu-kvm instead
of upstream? Doesn't this just generate additional conversion work and
the risk of divergence to upstream in the migration protocol?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 4/9] create aggregate kvm_total_used_mmu_pages value

2010-06-16 Thread Dave Hansen

On Wed, 2010-06-16 at 11:48 +0300, Avi Kivity wrote:
  +static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, int nr)
  +{
  + kvm-arch.n_used_mmu_pages += nr;
  + kvm_total_used_mmu_pages += nr;
 
 
 Needs an atomic operation, since there's no global lock here.  To avoid 
 bouncing this cacheline around, make the variable percpu and make 
 readers take a sum across all cpus.  Side benefit is that you no longer 
 need an atomic but a local_t, which is considerably cheaper.

We do have the stuff in:

include/linux/percpu_counter.h

the downside being that they're not precise and they're *HUGE* according
to the comment. :)

It's actually fairly difficult to do a counter which is precise,
scalable, and works well for small CPU counts when NR_CPUS is large.  Do
you mind if we just stick with a plain atomic_t for now?

-- Dave

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RESEND PATCH] acpi_piix4: save gpe and pci hotplug slot status

2010-06-16 Thread Alex Williamson

On Wed, 2010-06-16 at 17:47 +0200, Juan Quintela wrote:
 Anthony Liguori anth...@codemonkey.ws wrote:
  On 06/14/2010 03:28 PM, Alex Williamson wrote:
  PCI hotplug currently doesn't work after a migration because
  we don't migrate the enable bits of the GPE state.  Pull hotplug
  structs into vmstate.
 
  Signed-off-by: Alex Williamsonalex.william...@redhat.com
 
 
  Applied.  Thanks.
 
  Regards,
 
  Anthony Liguori
 
 I think this is better implemented as a subsection.  We didin't need
 this until hotplug arrived, I think that checking if any up/down are
 != 0 and then send it as subsections is a best way to do it.
 
 This way it could also be backported to stable.

The slots aren't really the issue, they were mostly for completeness.
The key is gpe.en, which is likely always going to be all 1s for an ACPI
aware OS.  So if we test != 0, we're going to need that subsection in
99% of the cases.  Maybe we can assume gpe.en is all set on the target,
but I don't really look forward to finding out the ways that might
break.  Thanks,

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] qemu: kvm: Enable XSAVE live migration support

2010-06-16 Thread Marcelo Tosatti

On Wed, Jun 16, 2010 at 05:48:46PM +0200, Jan Kiszka wrote:
 Marcelo Tosatti wrote:
  On Fri, Jun 11, 2010 at 12:36:49PM +0800, Sheng Yang wrote:
  Signed-off-by: Sheng Yang sh...@linux.intel.com
  ---
   qemu-kvm-x86.c|  109 
  -
   qemu-kvm.c|   24 +++
   qemu-kvm.h|   28 +
   target-i386/cpu.h |5 ++
   target-i386/kvm.c |2 +
   target-i386/machine.c |   20 +
   6 files changed, 169 insertions(+), 19 deletions(-)
  
  Applied, thanks.
 
 Oops, late remark: Why introducing this feature against qemu-kvm instead
 of upstream? Doesn't this just generate additional conversion work and
 the risk of divergence to upstream in the migration protocol?

Thats true. Sheng, can you add save/restore support to uq/master to
avoid these problems?

Then the cpuid bits can be also merged upstream.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ kvm-Bugs-2948108 ] qemu-kvm fails to migrate properly in 0.12.2

2010-06-16 Thread SourceForge.net

Bugs item #2948108, was opened at 2010-02-08 16:06
Message generated for change (Comment added) made by iggy_cav
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2948108group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: qemu
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Lance Albertson (ramereth)
Assigned to: Nobody/Anonymous (nobody)
Summary: qemu-kvm fails to migrate properly in 0.12.2

Initial Comment:
I'm using Ganeti and encountered a problem [1] with migrating KVM virtual 
machines. I have confirmed that this is not a problem specific to Ganeti and is 
also happening to the Libvirt project [2].

When migrating from host A to host B, everything goes fine however the guest VM 
(running CentOS 5.4) clock changes to approximately 600 seconds in the future. 
When attempting to migrate back to host B, the VM locks up and stops 
responding. However, if you fix the clock prior to syncing back to host A, the 
migration happens however the clock on the VM literally stops so the guest is 
pretty unusable. 

I tried the same scenario using Ubuntu 9.10 for the guest OS and has different 
issues. Upon migrating to host B, the clock on the guest VM was ahead by over 3 
days into the future. Then when I migrate it back to host A the VM locks up 
hard.

I can confirm I have no problem using migration when using 0.11.1.

Here's the detail information about our setup:
Hardware: Intel(R) Xeon(R) CPU 5160  @ 3.00GHz
qemu-kvm: 0.12.2
kernel: 2.6.29-hardened
Linux OS: Gentoo Hardened

If you need more information, please let me know and I'll be happy to provide 
it.

Thanks-

[1] http://groups.google.com/group/ganeti/browse_thread/thread/9b2c556557e749cf
[2] http://thread.gmane.org/gmane.comp.emulators.libvirt/20771

--

Comment By: Brian Jackson (iggy_cav)
Date: 2010-06-16 12:30

Message:
There have been some fixes committed and there is another long set of
patches being discussed on the list now. Unfortunately, the fixes require
both host and guest updates. I don't think they will be in till 2.6.36 or
so. If they are suitable for stable releases, they can probably go into
2.6.32 and the other maintained stable trees.

--

Comment By: Lance Albertson (ramereth)
Date: 2010-06-16 10:39

Message:
Has there been any progress made on this bug? I noticed that I had the same
problem with 0.11.1 and just resorted to not using the kvm-clock.

--

Comment By: Lance Albertson (ramereth)
Date: 2010-02-09 13:29

Message:
The patch you provided doesn't seem to apply cleanly on older kernels so I
went ahead and used the latest stable 2.6.32.8 which seems to include the
patch. Unfortunately I have the same problem as before with the time moving
up by nearly three days and the VM locking up upon returning to the
original host.

--

Comment By: Lance Albertson (ramereth)
Date: 2010-02-09 10:09

Message:
I'll give that a try sometime today

--

Comment By: Brian Jackson (iggy_cav)
Date: 2010-02-09 09:55

Message:
Can someone try this?

http://www.mail-archive.com/kvm@vger.kernel.org/msg28046.html

--

Comment By: ttt (mrt2k9)
Date: 2010-02-09 03:50

Message:
Using acpi_pm both on hosts and guests (Debian 5.0 AMD64) seems to work. I
now did 20-30 live migrations or so and it worked like a charm. Thanks!

--

Comment By: Lance Albertson (ramereth)
Date: 2010-02-08 18:33

Message:
Any idea why kvmclock is causing this issue? Has there been any recent
changes to that part of qemu-kvm?

Thanks for the tips on CentOS, I'll give that a try and see if that works
as a workaround for now.

--

Comment By: Brian Jackson (iggy_cav)
Date: 2010-02-08 18:30

Message:
I think 5.4 had kvmclock backported to it... I just think it doesn't have
the same /sys adjustments for clock, so you have to use the no-kvmclock
thing or possibly something else.

http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.4/html/Virtualization_Guide/chap-Virtualization-KVM_guest_timing_management.html

has some tips for disabling it I think

possibly just clocksource=acpi_pm is needed

--

Comment By: Lance Albertson (ramereth)
Date: 2010-02-08 18:13

Message:
It seems to work if I use something like acpi_pm for the clock source (on
Debian). The clock also seemed

Re: [PATCH 2/3] qemu: Enable XSAVE related CPUID

2010-06-16 Thread Marcelo Tosatti

On Wed, Jun 16, 2010 at 10:59:07PM +0800, Sheng Yang wrote:
 On Thursday 10 June 2010 11:31:02 Sheng Yang wrote:
  We can support it in KVM now. The 0xd leaf is queried from KVM.
 
 Hi Marcelo  Avi
 
 How about patch 1 and 2 in this series? They are used to enable XSAVE cpuid.

Applied.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: KVM call minutes for June 15

2010-06-16 Thread Blue Swirl

On Tue, Jun 15, 2010 at 4:13 PM, Anthony Liguori anth...@codemonkey.ws wrote:
 On 06/15/2010 10:41 AM, Christoph Hellwig wrote:

 On Tue, Jun 15, 2010 at 08:18:12AM -0700, Chris Wright wrote:


 KVM/qemu patches
 - patch rate is high, documentation is low, review is low
 - patches need to include better descriptions and documentation
   - will slow down patch writers
   - will make it easier for patch reviewers


 What is the qemu patch review policy anyway?

 We don't really have a coherent policy.  Suggestions for improvement are
 always appreciated.

I think a policy exists, it's just sort of 'laissez-faire'. But a
clear, written policy agreed by all wouldn't hurt.

Perhaps we should make a checklist or FAQ. I think it should be an
easy task even for someone without much QEMU background to do just
CODING_STYLE or missing SoB checks. Currently there are a lot of
commits with blatant CODING_STYLE violations. Even partial reviews for
specific issues would be helpful.


   There are no
 Reviewed-by: included in the actual commits,

 Reviewed-by/Ack-by's are pretty helpful for me.  In terms of including them
 in commit messages, if there's a strong feeling that that would be helpful
 then it's something I can look at doing but it also requires a fair bit of
 manual work during commit.

  and the requirement
 for a positive review also seem to vary a lot, up to the point that
 some commiters commit code that has never hit a public mailing list
 before.


 That really shouldn't happen and if it does, please point it out on the
 list.

This also happens due to lack of policy.

 Regards,

 Anthony Liguori

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 122 matches

Mail list logo