date:20120920

On 09/20/2012 04:43 AM, Hao, Xudong wrote:
 -Original Message-
 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Wednesday, September 19, 2012 6:24 PM
 To: Hao, Xudong
 Cc: Marcelo Tosatti; kvm@vger.kernel.org; Zhang, Xiantao
 Subject: Re: [PATCH v3] kvm/fpu: Enable fully eager restore kvm FPU
  That may be:
 
  static bool lazy_fpu_allowed()
  {
 return !(vcpu-arch.xcr0  ~((u64)KVM_XSTATE_LAZY));
  }
 
 Shouldn't it depend on cr4.osxsave as well?
 
 
 It do need to check cr4.osxsave due to a separate function.
 
 static bool lazy_fpu_allowed(struct kvm_vcpu *vcpu)
 {
   return !kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) ||
   !(vcpu-arch.xcr0  ~((u64)KVM_XSTATE_LAZY));
 }

Yes.

 
 
  On guest entry:
  if (!lazy_fpu_allowed(vcpu))
  kvm_x86_ops-fpu_activate(vcpu);
 
 
 But we already have that:
 
  if (vcpu-fpu_active)
  kvm_load_guest_fpu(vcpu);
 
 so why not manage fpu_active to be always set when needed?  I don't want
 more checks in the entry path.

 I means add fpu_active() in kvm_set_xcr(), not in guest entry. Then the 
 fpu_active will be set always when guest initialize xstate.
  
 @@ -574,6 +574,9 @@ int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 kvm_inject_gp(vcpu, 0);
 return 1;
 }
 +   if (!lazy_fpu_allowed(vcpu))
 +   kvm_x86_ops-fpu_activate(vcpu);
 return 0;
 

And of course on cr4 update.  So a function update_lazy_fpu() to be
called from both places is needed.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/15] ARM: add mem_type prot_pte accessor

2012-09-20 Thread Marc Zyngier

On 18/09/12 22:53, Christoffer Dall wrote:
 On Tue, Sep 18, 2012 at 5:04 PM, Russell King - ARM Linux
 li...@arm.linux.org.uk wrote:
 On Sat, Sep 15, 2012 at 11:34:36AM -0400, Christoffer Dall wrote:
 From: Marc Zyngier marc.zyng...@arm.com

 The KVM hypervisor mmu code requires access to the mem_type prot_pte
 field when setting up page tables pointing to a device. Unfortunately,
 the mem_type structure is opaque.

 Add an accessor (get_mem_type_prot_pte()) to retrieve the prot_pte
 value.

 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com

 Is there a reason why we need this to be exposed, along with all the
 page table manipulation in patch 7?

 Is there a reason why we can't have new MT_ types for PAGE_HYP and
 the HYP MT_DEVICE type (which is the same as MT_DEVICE but with
 PTE_USER set) and have the standard ARM/generic kernel code build
 those mappings?
 
 For hyp mode we can do this, but we cannot do this for the cpu
 interfaces that need to be mapped into each VM as they have each their
 own pgd. 

Isn't that the same problem? The HYP mode has its own pgd too. I think
this is the main issue with the generic code. If we can come up with an
interface that allows the generic code to work on alternative pgds, we
could pretty much do what Russell suggests here.

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 3/5] KVM: MMU: cleanup FNAME(page_fault)

On 09/14/2012 12:58 PM, Xiao Guangrong wrote:
 Let it return emulate state instead of spte like __direct_map
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/paging_tmpl.h |   28 ++--
  1 files changed, 10 insertions(+), 18 deletions(-)
 
 diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
 index 92f466c..0adf376 100644
 --- a/arch/x86/kvm/paging_tmpl.h
 +++ b/arch/x86/kvm/paging_tmpl.h
 @@ -463,20 +463,18 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
 struct guest_walker *gw,
  /*
   * Fetch a shadow pte for a specific level in the paging hierarchy.
   */
 -static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 +static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
struct guest_walker *gw,
int user_fault, int write_fault, int hlevel,
 -  int *emulate, pfn_t pfn, bool map_writable,
 -  bool prefault)
 +  pfn_t pfn, bool map_writable, bool prefault)
  {

Please document the return value in the comment.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/5] KVM: MMU: introduce page_fault_start and page_fault_end

On 09/14/2012 12:59 PM, Xiao Guangrong wrote:
 Wrap the common operations into these two functions
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   53 +++
  arch/x86/kvm/paging_tmpl.h |   16 +
  2 files changed, 39 insertions(+), 30 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 29ce28b..7e7b8cd 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -2825,6 +2825,29 @@ exit:
  static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
gva_t gva, pfn_t *pfn, bool write, bool *writable);
 
 +static bool
 +page_fault_start(struct kvm_vcpu *vcpu, gfn_t *gfnp, pfn_t *pfnp, int 
 *levelp,
 +  bool force_pt_level, unsigned long mmu_seq)
 +{
 + spin_lock(vcpu-kvm-mmu_lock);
 + if (mmu_notifier_retry(vcpu, mmu_seq))
 + return false;
 +
 + kvm_mmu_free_some_pages(vcpu);
 + if (likely(!force_pt_level))
 + transparent_hugepage_adjust(vcpu, gfnp, pfnp, levelp);
 +
 + return true;
 +}
 +
 +static void page_fault_end(struct kvm_vcpu *vcpu, pfn_t pfn)
 +{
 + spin_unlock(vcpu-kvm-mmu_lock);
 +
 + if (!is_error_pfn(pfn))
 + kvm_release_pfn_clean(pfn);
 +}

Needs sparse annotations (__acquires, __releases).

These code blocks have nothing in common except for being shared.  Often
that's not good for maintainability because it means that further
changes can affect one path but not the other.  But we can try it out
and see.



-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

On 09/19/2012 08:44 PM, Auld, Will wrote:
 From 9982bb73460b05c1328068aae047b14b2294e2da Mon Sep 17 00:00:00 2001
 From: Will Auld will.a...@intel.com
 Date: Wed, 12 Sep 2012 18:10:56 -0700
 Subject: [PATCH] Enabling IA32_TSC_ADJUST for guest VM
 
 CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported
 
 Basic design is to emulate the MSR by allowing reads and writes to a guest 
 vcpu specific location to store the value of the emulated MSR while adding 
 the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will 
 be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This 
 is of course as long as the use TSC counter offsetting VM-execution control 
 is enabled as well as the IA32_TSC_ADJUST control.
 
 However, because hardware will only return the TSC + IA32_TSC_ADJUST + vmsc 
 tsc_offset for a guest process when it does and rdtsc (with the correct 
 settings) the value of our virtualized IA32_TSC_ADJUST must be stored in one 
 of these three locations. The argument against storing it in the actual MSR 
 is performance. This is likely to be seldom used while the save/restore is 
 required on every transition. IA32_TSC_ADJUST was created as a way to solve 
 some issues with writing TSC itself so that is not an option either. The 
 remaining option, defined above as our solution has the problem of returning 
 incorrect vmcs tsc_offset values (unless we intercept and fix, not done here) 
 as mentioned above. However, more problematic is that storing the data in 
 vmcs tsc_offset will have a different semantic effect on the system than does 
 using the actual MSR. This is illustrated in the following example: The 
 hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest 
 process perf!
 or!
  ms a rdtsc. In this case the guest process will get TSC + 
 IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including IA32_TSC_ADJUST_guest. 
 While the total system semantics changed the semantics as seen by the guest 
 do not and hence this will not cause a problem.
 +++ b/arch/x86/kvm/cpuid.c
 @@ -248,8 +248,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
 u32 function,
  
   /* cpuid 7.0.ebx */
   const u32 kvm_supported_word9_x86_features =
 - F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) |
 - F(BMI2) | F(ERMS) | f_invpcid | F(RTM);
 + F(FSGSBASE) | F(TSC_ADJUST) | F(BMI1) | F(HLE) |
 + F(AVX2) | F(SMEP) | F(BMI2) | F(ERMS) | f_invpcid | F(RTM);
  

You're exposing this feature unconditionally, but part of the
implementation is in vmx.c.  This means that if an AMD processor arrives
that implements the feature, we will expose the feature even though we
lack some of the implementation.

So we need to mask the feature here based on a callback from kvm_x86_ops.

   /* all calls to cpuid_count() should be made on the same cpu */
   get_cpu();
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index c00f03d..35d11b3 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2173,6 +2173,9 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, u32 
 msr_index, u64 *pdata)
   case MSR_IA32_SYSENTER_ESP:
   data = vmcs_readl(GUEST_SYSENTER_ESP);
   break;
 + case MSR_TSC_ADJUST:
 + data = (u64)vcpu-arch.tsc_adjust;
 + break;

Can be moved to common code.

   case MSR_TSC_AUX:
   if (!to_vmx(vcpu)-rdtscp_enabled)
   return 1;
 @@ -2241,6 +2244,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 
 msr_index, u64 data)
   }
   ret = kvm_set_msr_common(vcpu, msr_index, data);
   break;
 + case MSR_TSC_ADJUST:
 +#define DUMMY 1

What is this?

 + vmx_adjust_tsc_offset(vcpu,
 + (s64)(data-vcpu-arch.tsc_adjust),

Cast unneeded; space between operands please.

 + (bool)DUMMY);
 + vcpu-arch.tsc_adjust = (s64)data;

Cast is unneeded.

 + break;
   case MSR_TSC_AUX:
   if (!vmx-rdtscp_enabled)
   return 1;
 @@ -3931,6 +3941,8 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
  
   vcpu-arch.regs_avail = ~((1  VCPU_REGS_RIP) | (1  VCPU_REGS_RSP));
  
 + vcpu-arch.tsc_adjust = 0x0;
 +

Can be moved to common code.

   vmx-rmode.vm86_active = 0;
  

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Enabling IA32_TSC_ADJUST for guest VM

On 09/19/2012 08:44 PM, Auld, Will wrote:
 @@ -2241,6 +2244,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, u32 
 msr_index, u64 data)
   }
   ret = kvm_set_msr_common(vcpu, msr_index, data);
   break;
 + case MSR_TSC_ADJUST:
 +#define DUMMY 1
 + vmx_adjust_tsc_offset(vcpu,
 + (s64)(data-vcpu-arch.tsc_adjust),
 + (bool)DUMMY);
 + vcpu-arch.tsc_adjust = (s64)data;
 + break;
   case MSR_TSC_AUX:
   if (!vmx-rdtscp_enabled)
   return 1;

Writes to MSR_IA32_TSC also need to adjust MSR_IA32_TSC_ADJUST.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Fix guest debug across vcpu INIT reset

On 09/19/2012 10:38 PM, Jan Kiszka wrote:
 If we reset a vcpu on INIT, make sure to not touch dr7 as stored in the
 VMCS/VMCB and also switch_db_regs if guest debugging is using hardware
 breakpoints. Otherwise, the vcpu will not trigger hardware breakpoints
 until userspace issues another KVM_SET_GUEST_DEBUG IOCTL for it.
 
 Found while trying to stop on start_secondary.
 
 @@ -1146,7 +1146,8 @@ static void init_vmcb(struct vcpu_svm *svm)
  
   svm_set_efer(svm-vcpu, 0);
   save-dr6 = 0x0ff0;
 - save-dr7 = 0x400;
 + if (!(svm-vcpu.guest_debug  KVM_GUESTDBG_USE_HW_BP))
 + save-dr7 = 0x400;

Whenever we multiplex a resource foo for several users, we should have a
function update_foo() that recalculates foo from all its sources:

vcpu-arch.dr7 = 0x400;
update_dr7(svm-vcpu);

(don't know if the first line really belongs here)

   kvm_set_rflags(svm-vcpu, 2);
   save-rip = 0xfff0;
   svm-vcpu.arch.regs[VCPU_REGS_RIP] = save-rip;
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index d62b413..37f68f7 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -3959,7 +3959,8 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
   kvm_rip_write(vcpu, 0);
   kvm_register_write(vcpu, VCPU_REGS_RSP, 0);
  
 - vmcs_writel(GUEST_DR7, 0x400);
 + if (!(vcpu-guest_debug  KVM_GUESTDBG_USE_HW_BP))
 + vmcs_writel(GUEST_DR7, 0x400);
  

Ditto. update_dr7() could even be generic and use -set_dr7().

   vmcs_writel(GUEST_GDTR_BASE, 0);
   vmcs_write32(GUEST_GDTR_LIMIT, 0x);
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index c4d451e..7c8c2b8 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -6043,7 +6043,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
   vcpu-arch.nmi_pending = 0;
   vcpu-arch.nmi_injected = false;
  
 - vcpu-arch.switch_db_regs = 0;
 + if (!(vcpu-guest_debug  KVM_GUESTDBG_USE_HW_BP))
 + vcpu-arch.switch_db_regs = 0;
   memset(vcpu-arch.db, 0, sizeof(vcpu-arch.db));
   vcpu-arch.dr6 = DR6_FIXED_1;
   vcpu-arch.dr7 = DR7_FIXED_1;
 

You could move the switch_db_regs calculation into update_dr7(), and
move the update_dr7() call after assigning it here.  This fixes
everything neatly (and obviates the need for the first line in the
snippet above).


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4] KVM: optimize apic interrupt delivery

On 09/13/2012 05:19 PM, Gleb Natapov wrote:
 Most interrupt are delivered to only one vcpu. Use pre-build tables to
 find interrupt destination instead of looping through all vcpus. In case
 of logical mode loop only through vcpus in a logical cluster irq is sent
 to.

Applied, thanks.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] kvmclock: fix guest stop notification

On Thu, Sep 20, 2012 at 01:55:20PM +0530, Amit Shah wrote:
 Commit f349c12c0434e29c79ecde89029320c4002f7253 added the guest stop
 notification, but it did it in a way that the stop notification would
 never reach the kernel.  The kvm_vm_state_changed() function gets a
 value of 0 for the 'running' parameter when the VM is stopped, making
 all the code added previously dead code.
 
 This patch reworks the code so that it's called when 'running' is 0,
 which indicates the VM was stopped.
 
 CC: Eric B Munson emun...@mgebm.net
 CC: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 CC: Andreas Färber afaer...@suse.de
 CC: Marcelo Tosatti mtosa...@redhat.com
 CC: Paolo Bonzini pbonz...@redhat.com
 CC: Laszlo Ersek ler...@redhat.com
 Signed-off-by: Amit Shah amit.s...@redhat.com
 ---
  hw/kvm/clock.c |   21 +++--
  1 files changed, 11 insertions(+), 10 deletions(-)
 
 diff --git a/hw/kvm/clock.c b/hw/kvm/clock.c
 index 824b978..f3427eb 100644
 --- a/hw/kvm/clock.c
 +++ b/hw/kvm/clock.c
 @@ -71,18 +71,19 @@ static void kvmclock_vm_state_change(void *opaque, int 
 running,
  
  if (running) {
  s-clock_valid = false;
 +return;
 +}
  
 -if (!cap_clock_ctrl) {
 -return;
 -}
 -for (penv = first_cpu; penv != NULL; penv = penv-next_cpu) {
 -ret = kvm_vcpu_ioctl(penv, KVM_KVMCLOCK_CTRL, 0);
 -if (ret) {
 -if (ret != -EINVAL) {
 -fprintf(stderr, %s: %s\n, __func__, strerror(-ret));
 -}
 -return;
 +if (!cap_clock_ctrl) {
 +return;
 +}
 +for (penv = first_cpu; penv != NULL; penv = penv-next_cpu) {
 +ret = kvm_vcpu_ioctl(penv, KVM_KVMCLOCK_CTRL, 0);
 +if (ret) {
 +if (ret != -EINVAL) {
 +fprintf(stderr, %s: %s\n, __func__, strerror(-ret));
  }
 +return;
  }
  }
  }
 -- 
 1.7.7.6

ACK

Avi, please merge through uq/master.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] kvmclock: fix guest stop notification


Amit, should please use uq/master in the subject to help
the person who is merging patches.

On Thu, Sep 20, 2012 at 01:55:20PM +0530, Amit Shah wrote:
 Commit f349c12c0434e29c79ecde89029320c4002f7253 added the guest stop
 notification, but it did it in a way that the stop notification would
 never reach the kernel.  The kvm_vm_state_changed() function gets a
 value of 0 for the 'running' parameter when the VM is stopped, making
 all the code added previously dead code.
 
 This patch reworks the code so that it's called when 'running' is 0,
 which indicates the VM was stopped.
 
 CC: Eric B Munson emun...@mgebm.net
 CC: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 CC: Andreas Färber afaer...@suse.de
 CC: Marcelo Tosatti mtosa...@redhat.com
 CC: Paolo Bonzini pbonz...@redhat.com
 CC: Laszlo Ersek ler...@redhat.com
 Signed-off-by: Amit Shah amit.s...@redhat.com
 ---
  hw/kvm/clock.c |   21 +++--
  1 files changed, 11 insertions(+), 10 deletions(-)
 
 diff --git a/hw/kvm/clock.c b/hw/kvm/clock.c
 index 824b978..f3427eb 100644
 --- a/hw/kvm/clock.c
 +++ b/hw/kvm/clock.c
 @@ -71,18 +71,19 @@ static void kvmclock_vm_state_change(void *opaque, int 
 running,
  
  if (running) {
  s-clock_valid = false;
 +return;
 +}
  
 -if (!cap_clock_ctrl) {
 -return;
 -}
 -for (penv = first_cpu; penv != NULL; penv = penv-next_cpu) {
 -ret = kvm_vcpu_ioctl(penv, KVM_KVMCLOCK_CTRL, 0);
 -if (ret) {
 -if (ret != -EINVAL) {
 -fprintf(stderr, %s: %s\n, __func__, strerror(-ret));
 -}
 -return;
 +if (!cap_clock_ctrl) {
 +return;
 +}
 +for (penv = first_cpu; penv != NULL; penv = penv-next_cpu) {
 +ret = kvm_vcpu_ioctl(penv, KVM_KVMCLOCK_CTRL, 0);
 +if (ret) {
 +if (ret != -EINVAL) {
 +fprintf(stderr, %s: %s\n, __func__, strerror(-ret));
  }
 +return;
  }
  }
  }
 -- 
 1.7.7.6
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 1/3] KVM: x86: export svm/vmx exit code and vector code to userspace

On Mon, Sep 17, 2012 at 07:58:57AM -0700, Arnaldo Carvalho de Melo wrote:
 Em Mon, Sep 17, 2012 at 04:31:13PM +0800, Dong Hao escreveu:
  From: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
  
  Exporting KVM exit information to userspace to be consumed by perf.
  
  [ Dong Hao haod...@linux.vnet.ibm.com: rebase it on acme's git tree ]
  Signed-off-by: Dong Hao haod...@linux.vnet.ibm.com
  Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 
 Do we have acked/reviewed-by for this parth? Marcelo? Avi?

Code move... ACK.

 
   arch/x86/include/asm/kvm.h  |   16 +++
   arch/x86/include/asm/kvm_host.h |   16 ---
   arch/x86/include/asm/svm.h  |  205 
  +--
   arch/x86/include/asm/vmx.h  |  127 
   arch/x86/kvm/trace.h|   89 -
   5 files changed, 230 insertions(+), 223 deletions(-)
  
  diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h
  index 246617e..41e08cb 100644
  --- a/arch/x86/include/asm/kvm.h
  +++ b/arch/x86/include/asm/kvm.h
  @@ -9,6 +9,22 @@
   #include linux/types.h
   #include linux/ioctl.h
   
  +#define DE_VECTOR 0
  +#define DB_VECTOR 1
  +#define BP_VECTOR 3
  +#define OF_VECTOR 4
  +#define BR_VECTOR 5
  +#define UD_VECTOR 6
  +#define NM_VECTOR 7
  +#define DF_VECTOR 8
  +#define TS_VECTOR 10
  +#define NP_VECTOR 11
  +#define SS_VECTOR 12
  +#define GP_VECTOR 13
  +#define PF_VECTOR 14
  +#define MF_VECTOR 16
  +#define MC_VECTOR 18
  +
   /* Select x86 specific features in linux/kvm.h */
   #define __KVM_HAVE_PIT
   #define __KVM_HAVE_IOAPIC
  diff --git a/arch/x86/include/asm/kvm_host.h 
  b/arch/x86/include/asm/kvm_host.h
  index 09155d6..1eaa6b0 100644
  --- a/arch/x86/include/asm/kvm_host.h
  +++ b/arch/x86/include/asm/kvm_host.h
  @@ -75,22 +75,6 @@
   #define KVM_HPAGE_MASK(x)  (~(KVM_HPAGE_SIZE(x) - 1))
   #define KVM_PAGES_PER_HPAGE(x) (KVM_HPAGE_SIZE(x) / PAGE_SIZE)
   
  -#define DE_VECTOR 0
  -#define DB_VECTOR 1
  -#define BP_VECTOR 3
  -#define OF_VECTOR 4
  -#define BR_VECTOR 5
  -#define UD_VECTOR 6
  -#define NM_VECTOR 7
  -#define DF_VECTOR 8
  -#define TS_VECTOR 10
  -#define NP_VECTOR 11
  -#define SS_VECTOR 12
  -#define GP_VECTOR 13
  -#define PF_VECTOR 14
  -#define MF_VECTOR 16
  -#define MC_VECTOR 18
  -
   #define SELECTOR_TI_MASK (1  2)
   #define SELECTOR_RPL_MASK 0x03
   
  diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
  index f2b83bc..cdf5674 100644
  --- a/arch/x86/include/asm/svm.h
  +++ b/arch/x86/include/asm/svm.h
  @@ -1,6 +1,135 @@
   #ifndef __SVM_H
   #define __SVM_H
   
  +#define SVM_EXIT_READ_CR0  0x000
  +#define SVM_EXIT_READ_CR3  0x003
  +#define SVM_EXIT_READ_CR4  0x004
  +#define SVM_EXIT_READ_CR8  0x008
  +#define SVM_EXIT_WRITE_CR0 0x010
  +#define SVM_EXIT_WRITE_CR3 0x013
  +#define SVM_EXIT_WRITE_CR4 0x014
  +#define SVM_EXIT_WRITE_CR8 0x018
  +#define SVM_EXIT_READ_DR0  0x020
  +#define SVM_EXIT_READ_DR1  0x021
  +#define SVM_EXIT_READ_DR2  0x022
  +#define SVM_EXIT_READ_DR3  0x023
  +#define SVM_EXIT_READ_DR4  0x024
  +#define SVM_EXIT_READ_DR5  0x025
  +#define SVM_EXIT_READ_DR6  0x026
  +#define SVM_EXIT_READ_DR7  0x027
  +#define SVM_EXIT_WRITE_DR0 0x030
  +#define SVM_EXIT_WRITE_DR1 0x031
  +#define SVM_EXIT_WRITE_DR2 0x032
  +#define SVM_EXIT_WRITE_DR3 0x033
  +#define SVM_EXIT_WRITE_DR4 0x034
  +#define SVM_EXIT_WRITE_DR5 0x035
  +#define SVM_EXIT_WRITE_DR6 0x036
  +#define SVM_EXIT_WRITE_DR7 0x037
  +#define SVM_EXIT_EXCP_BASE 0x040
  +#define SVM_EXIT_INTR  0x060
  +#define SVM_EXIT_NMI   0x061
  +#define SVM_EXIT_SMI   0x062
  +#define SVM_EXIT_INIT  0x063
  +#define SVM_EXIT_VINTR 0x064
  +#define SVM_EXIT_CR0_SEL_WRITE 0x065
  +#define SVM_EXIT_IDTR_READ 0x066
  +#define SVM_EXIT_GDTR_READ 0x067
  +#define SVM_EXIT_LDTR_READ 0x068
  +#define SVM_EXIT_TR_READ   0x069
  +#define SVM_EXIT_IDTR_WRITE0x06a
  +#define SVM_EXIT_GDTR_WRITE0x06b
  +#define SVM_EXIT_LDTR_WRITE0x06c
  +#define SVM_EXIT_TR_WRITE  0x06d
  +#define SVM_EXIT_RDTSC 0x06e
  +#define SVM_EXIT_RDPMC 0x06f
  +#define SVM_EXIT_PUSHF 0x070
  +#define SVM_EXIT_POPF  0x071
  +#define SVM_EXIT_CPUID 0x072
  +#define SVM_EXIT_RSM   0x073
  +#define SVM_EXIT_IRET  0x074
  +#define SVM_EXIT_SWINT 0x075
  +#define SVM_EXIT_INVD  0x076
  +#define SVM_EXIT_PAUSE 0x077
  +#define SVM_EXIT_HLT   0x078
  +#define SVM_EXIT_INVLPG0x079
  +#define SVM_EXIT_INVLPGA   0x07a
  +#define SVM_EXIT_IOIO  0x07b
  +#define SVM_EXIT_MSR   0x07c
  +#define SVM_EXIT_TASK_SWITCH   0x07d
  +#define SVM_EXIT_FERR_FREEZE   0x07e
  +#define SVM_EXIT_SHUTDOWN  0x07f
  +#define SVM_EXIT_VMRUN 0x080
  +#define SVM_EXIT_VMMCALL   0x081

RE: [PATCH 01/10] ARM: KVM: Keep track of currently running vcpus

2012-09-20 Thread Min-gyu Kim

 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Christoffer Dall
 Sent: Sunday, September 16, 2012 12:37 AM
 To: kvm@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
 kvm...@lists.cs.columbia.edu
 Subject: [PATCH 01/10] ARM: KVM: Keep track of currently running vcpus
 
 From: Marc Zyngier marc.zyng...@arm.com
 
 When an interrupt occurs for the guest, it is sometimes necessary to find
 out which vcpu was running at that point.
 
 Keep track of which vcpu is being tun in kvm_arch_vcpu_ioctl_run(), and
 allow the data to be retrived using either:
 - kvm_arm_get_running_vcpu(): returns the vcpu running at this point
   on the current CPU. Can only be used in a non-preemptable context.

What's the purpose of kvm_arm_get_running_vcpu?
It seems to be enough to pass vcpu struct through function argument, 
and there is no caller by now.


 - kvm_arm_get_running_vcpus(): returns the per-CPU variable holding
   the the running vcpus, useable for per-CPU interrupts.
 
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
 ---
  arch/arm/include/asm/kvm_host.h |9 +
  arch/arm/kvm/arm.c  |   30 ++
  2 files changed, 39 insertions(+)
 
 diff --git a/arch/arm/include/asm/kvm_host.h
 b/arch/arm/include/asm/kvm_host.h index 3fec9ad..2e3ac1c 100644
 --- a/arch/arm/include/asm/kvm_host.h
 +++ b/arch/arm/include/asm/kvm_host.h
 @@ -208,4 +208,13 @@ static inline int kvm_test_age_hva(struct kvm *kvm,
 unsigned long hva)  {
   return 0;
  }
 +
 +struct kvm_vcpu *kvm_arm_get_running_vcpu(void); struct kvm_vcpu
 +__percpu **kvm_get_running_vcpus(void);
 +
 +int kvm_arm_copy_coproc_indices(struct kvm_vcpu *vcpu, u64 __user
 +*uindices); unsigned long kvm_arm_num_coproc_regs(struct kvm_vcpu
 +*vcpu); struct kvm_one_reg; int kvm_arm_coproc_get_reg(struct kvm_vcpu
 +*vcpu, const struct kvm_one_reg *); int kvm_arm_coproc_set_reg(struct
 +kvm_vcpu *vcpu, const struct kvm_one_reg *);
  #endif /* __ARM_KVM_HOST_H__ */
 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index
 64fbec7..e6c3743 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -54,11 +54,38 @@ static DEFINE_PER_CPU(unsigned long,
 kvm_arm_hyp_stack_page);  static struct vfp_hard_struct __percpu
 *kvm_host_vfp_state;  static unsigned long hyp_default_vectors;
 
 +/* Per-CPU variable containing the currently running vcpu. */ static
 +DEFINE_PER_CPU(struct kvm_vcpu *, kvm_arm_running_vcpu);
 +
  /* The VMID used in the VTTBR */
  static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);  static u8
 kvm_next_vmid;  static DEFINE_SPINLOCK(kvm_vmid_lock);
 
 +static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu) {
 + BUG_ON(preemptible());
 + __get_cpu_var(kvm_arm_running_vcpu) = vcpu; }
 +
 +/**
 + * kvm_arm_get_running_vcpu - get the vcpu running on the current CPU.
 + * Must be called from non-preemptible context  */ struct kvm_vcpu
 +*kvm_arm_get_running_vcpu(void) {
 + BUG_ON(preemptible());
 + return __get_cpu_var(kvm_arm_running_vcpu);
 +}
 +
 +/**
 + * kvm_arm_get_running_vcpus - get the per-CPU array on currently running
 vcpus.
 + */
 +struct kvm_vcpu __percpu **kvm_get_running_vcpus(void) {
 + return kvm_arm_running_vcpu;
 +}
 +
  int kvm_arch_hardware_enable(void *garbage)  {
   return 0;
 @@ -293,10 +320,13 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int
 cpu)
   cpumask_clear_cpu(cpu, vcpu-arch.require_dcache_flush);
   flush_cache_all(); /* We'd really want v7_flush_dcache_all()
 */
   }
 +
 + kvm_arm_set_running_vcpu(vcpu);
  }
 
  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)  {
 + kvm_arm_set_running_vcpu(NULL);
  }
 
  int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in the body
 of a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/15] ARM: add mem_type prot_pte accessor

2012-09-20 Thread Christoffer Dall

On Thu, Sep 20, 2012 at 6:01 AM, Marc Zyngier marc.zyng...@arm.com wrote:
 On 18/09/12 22:53, Christoffer Dall wrote:
 On Tue, Sep 18, 2012 at 5:04 PM, Russell King - ARM Linux
 li...@arm.linux.org.uk wrote:
 On Sat, Sep 15, 2012 at 11:34:36AM -0400, Christoffer Dall wrote:
 From: Marc Zyngier marc.zyng...@arm.com

 The KVM hypervisor mmu code requires access to the mem_type prot_pte
 field when setting up page tables pointing to a device. Unfortunately,
 the mem_type structure is opaque.

 Add an accessor (get_mem_type_prot_pte()) to retrieve the prot_pte
 value.

 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com

 Is there a reason why we need this to be exposed, along with all the
 page table manipulation in patch 7?

 Is there a reason why we can't have new MT_ types for PAGE_HYP and
 the HYP MT_DEVICE type (which is the same as MT_DEVICE but with
 PTE_USER set) and have the standard ARM/generic kernel code build
 those mappings?

 For hyp mode we can do this, but we cannot do this for the cpu
 interfaces that need to be mapped into each VM as they have each their
 own pgd.

 Isn't that the same problem? The HYP mode has its own pgd too. I think
 this is the main issue with the generic code. If we can come up with an
 interface that allows the generic code to work on alternative pgds, we
 could pretty much do what Russell suggests here.

Hyp mode has its own pgd, but there will only be one of them which can
be allocated at boot and setup at boot in mmu.c

This will not be the case for guest pages.  On the other hand, this
locks us to only one user of such mappings and more users could
potentially (I know it's a stretch) clutter mmu.c later on, which is
why I suggest the PAGE_KVM_DEVICE approach for now, which should then
be renamed, PAGE_S2_DEVICE probably.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 3/3] KVM: perf: kvm events analysis tool

2012-09-20 Thread David Ahern


On 9/17/12 2:31 AM, Dong Hao wrote:

From: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com

Add 'perf kvm stat' support to analyze kvm vmexit/mmio/ioport smartly

Usage:
- kvm stat
   run a command and gather performance counter statistics, it is the alias of
   perf stat

- trace kvm events:
   perf kvm stat record, or, if other tracepoints are interesting as well, we
   can append the events like this:
   perf kvm stat record -e timer:* -a

   If many guests are running, we can track the specified guest by using -p or
   --pid, -a is used to track events generated by all guests.

- show the result:
   perf kvm stat report

The output example is following:
# pgrep qemu
13005
13059

total 2 guests are running on the host

Then, track the guest whose pid is 13059:
# ./perf kvm stat record -p 13059
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.253 MB perf.data.guest (~11065 samples) ]

See the vmexit events:
# ./perf kvm stat report --event=vmexit


Analyze events for all VCPUs:

  VM-EXITSamples  Samples% Time% Avg time

  APIC_ACCESS46070.55% 0.01% 22.44us ( +-   1.75% )
  HLT 9314.26%99.98% 832077.26us ( +-  10.42% )
   EXTERNAL_INTERRUPT 64 9.82% 0.00% 35.35us ( +-  14.21% )
PENDING_INTERRUPT 24 3.68% 0.00%  9.29us ( +-  31.39% )
CR_ACCESS  7 1.07% 0.00%  8.12us ( +-   5.76% )
   IO_INSTRUCTION  3 0.46% 0.00% 18.00us ( +-  11.79% )
EXCEPTION_NMI  1 0.15% 0.00%  5.83us ( +-   -nan% )

Total Samples:652, Total events handled time:77396109.80us.

See the mmio events:
# ./perf kvm stat report --event=mmio


Analyze events for all VCPUs:

  MMIO AccessSamples  Samples% Time% Avg time

 0xfee00380:W38784.31%79.28%  8.29us ( +-   3.32% )
 0xfee00300:W 24 5.23% 9.96% 16.79us ( +-   1.97% )
 0xfee00300:R 24 5.23% 7.83% 13.20us ( +-   3.00% )
 0xfee00310:W 24 5.23% 2.93%  4.94us ( +-   3.84% )

Total Samples:459, Total events handled time:4044.59us.

See the ioport event:
# ./perf kvm stat report --event=ioport


Analyze events for all VCPUs:

   IO Port AccessSamples  Samples% Time% Avg time

  0xc050:POUT  3   100.00%   100.00% 13.75us ( +-  10.83% )

Total Samples:3, Total events handled time:41.26us.

And, --vcpu is used to track the specified vcpu and --key is used to sort the
result:
# ./perf kvm stat report --event=vmexit --vcpu=0 --key=time


Analyze events for VCPU 0:

  VM-EXITSamples  Samples% Time% Avg time

  HLT 2713.85%99.97% 405790.24us ( +-  12.70% )
   EXTERNAL_INTERRUPT 13 6.67% 0.00% 27.94us ( +-  22.26% )
  APIC_ACCESS14674.87% 0.03% 21.69us ( +-   2.91% )
   IO_INSTRUCTION  2 1.03% 0.00% 17.77us ( +-  20.56% )
CR_ACCESS  2 1.03% 0.00%  8.55us ( +-   6.47% )
PENDING_INTERRUPT  5 2.56% 0.00%  6.27us ( +-   3.94% )

Total Samples:195, Total events handled time:10959950.90us.

[ Dong Hao haod...@linux.vnet.ibm.com
   Runzhen Wang runz...@linux.vnet.ibm.com:

  - rebase it on current acme's tree

  - fix the compiling-error on i386

]

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
Signed-off-by: Dong Hao haod...@linux.vnet.ibm.com
Signed-off-by: Runzhen Wang runz...@linux.vnet.ibm.com
---
  tools/perf/Documentation/perf-kvm.txt |   30 ++-
  tools/perf/MANIFEST   |3 +
  tools/perf/builtin-kvm.c  |  840 -
  tools/perf/util/header.c  |   59 +++-
  tools/perf/util/header.h  |1 +
  tools/perf/util/thread.h  |2 +
  6 files changed, 929 insertions(+), 6 deletions(-)



Acked-by: David Ahern dsah...@gmail.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 01/10] ARM: KVM: Keep track of currently running vcpus

2012-09-20 Thread Marc Zyngier

On Thu, 20 Sep 2012 21:53:33 +0900, Min-gyu Kim mingyu84@samsung.com
wrote:
 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Christoffer Dall
 Sent: Sunday, September 16, 2012 12:37 AM
 To: kvm@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
 kvm...@lists.cs.columbia.edu
 Subject: [PATCH 01/10] ARM: KVM: Keep track of currently running vcpus
 
 From: Marc Zyngier marc.zyng...@arm.com
 
 When an interrupt occurs for the guest, it is sometimes necessary to
find
 out which vcpu was running at that point.
 
 Keep track of which vcpu is being tun in kvm_arch_vcpu_ioctl_run(), and
 allow the data to be retrived using either:
 - kvm_arm_get_running_vcpu(): returns the vcpu running at this point
   on the current CPU. Can only be used in a non-preemptable context.
 
 What's the purpose of kvm_arm_get_running_vcpu?
 It seems to be enough to pass vcpu struct through function argument, 
 and there is no caller by now.

This is also designed to be used in an interrupt handler, and is used by
the (currently out of tree) perf code:
https://lists.cs.columbia.edu/pipermail/kvmarm/2012-September/003192.html

Basically, you need this infrastructure when handling interrupts.

M.
-- 
Fast, cheap, reliable. Pick two.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/6] Reduce compaction scanning and lock contention

Hi Richard,

This series is following up from your mail at
http://www.spinics.net/lists/kvm/msg80080.html . I am pleased the lock
contention is now reduced but acknowledge that the scanning rates are
stupidly high. Fortunately, I am reasonably confident I know what is
going wrong. If all goes according to plain this should drastically reduce
the amount of time your workload spends on compaction. I would very much
appreciate if you drop the MM patches (i.e. keep the btrfs patches) and
replace them with this series. I know that Rik's patches are dropped and
this is deliberate. I reimplemented his idea on top of the fifth patch on
this series to cover both the migrate and free scanners. Thanks to Rik who
discussed how the idea could be reimplemented on IRC which was very helpful.
Hopefully the patch actually reflects what we discussed :)

Shaohua, I would also appreciate if you tested this series. I picked up
one of your patches but replaced another and want to make sure that the
workload you were investigating is still ok.

===

Richard Davies and Shaohua Li have both reported lock contention problems
in compaction on the zone and LRU locks as well as significant amounts of
time being spent in compaction. It is critical that performance gains from
THP are not offset by the cost of allocating them in the first place. This
series aims to reduce lock contention and scanning rates.

Patch 1 is a fix for c67fe375 (mm: compaction: Abort async compaction if
locks are contended or taking too long) to properly abort in all
cases when contention is detected.

Patch 2 defers acquiring the zone-lru_lock as long as possible.

Patch 3 defers acquiring the zone-lock as lock as possible.

Patch 4 reverts Rik's skip-free patches as the core concept gets
reimplemented later and the remaining patches are easier to
understand if this is reverted first.

Patch 5 adds a pageblock-skip bit to the pageblock flags to cache what
pageblocks should be skipped by the migrate and free scanners.
This drastically reduces the amount of scanning compaction has
to do.

Patch 6 reimplements something similar to Rik's idea except it uses the
pageblock-skip information to decide where the scanners should
restart from and does not need to wrap around.

I tested this on 3.6-rc5 as that was the kernel base that the earlier threads
worked on. It will need a bit of work to rebase on top of Andrews tree for
merging due to other compaction changes but it will not be a major problem.
Kernels tested were

vanilla 3.6-rc5
lesslockPatches 1-3
revert  Patches 1-4
cachefail   Patches 1-5
skipuseless Patches 1-6

Stress high-order allocation tests looked ok.

STRESS-HIGHALLOC
   3.6.0 3.6.0-rc5 3.6.0-rc53.6.0-rc5   
  3.6.0-rc5
   rc5-vanillalesslockrevertcachefail   
skipuseless  
Pass 1  17.00 ( 0.00%)19.00 ( 2.00%)29.00 (12.00%)   24.00 ( 
7.00%)20.00 ( 3.00%)
Pass 2  16.00 ( 0.00%)19.00 ( 3.00%)39.00 (23.00%)   37.00 
(21.00%)35.00 (19.00%)
while Rested88.00 ( 0.00%)88.00 ( 0.00%)88.00 ( 0.00%)   85.00 
(-3.00%)86.00 (-2.00%)

Success rates are improved a bit by the series as there are fewer
opporunities to race with other allocation requests if compaction is
scanning less.  I recognise the success rates are still low but patches
that tackle parts of that are in Andrews tree already.

The time to complete the tests did not vary that much and are uninteresting
as were the vmstat statistics so I will not present them here.

Using ftrace I recorded how much scanning was done by compaction and got this

3.6.0 3.6.0-rc5 3.6.0-rc5  3.6.0-rc5  
3.6.0-rc5
rc5-vanillalesslockrevert  cachefail  
skipuseless  
Total   freescanned   185020625  223313210  744553485   37149462   
29231432 
Total   freeisolated 84509411747594301672 906689 
721963 
Total   freeefficiency  0.0046%0.0053%0.0058%0.0244%
0.0247% 
Total   migrate scanned   187708506  143133150  428180990   21941574   
12288851 
Total   migrate isolated 71437610811343950098 711357 
590552 
Total   migrate efficiency  0.0038%0.0076%0.0092%0.0324%
0.0481% 

The efficiency is worthless because of the nature of the test and the
number of failures.  The really interesting point as far as this patch
series is concerned is the number of pages scanned.

Note that reverting Rik's patches massively increases the number of pages 
scanned
indicating that those patches really did make a huge difference to CPU usage.

However, caching what pageblocks should be skipped has a much higher
impact. With patches 1-5 applied, free page scanning is reduced by 80%
in comparison to the vanilla kernel and

[PATCH 6/6] mm: compaction: Restart compaction from near where it left off

This is almost entirely based on Rik's previous patches and discussions
with him about how this might be implemented.

Order  0 compaction stops when enough free pages of the correct page
order have been coalesced.  When doing subsequent higher order allocations,
it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to compact
at the start of the zone, and for free pages to compact things to at the
end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at
the end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

This patch caches where the migration and free scanner should start from on
subsequent compaction invocations using the pageblock-skip information. When
compaction starts it begins from the cached restart points and will
update the cached restart points until a page is isolated or a pageblock
is skipped that would have been scanned by synchronous compaction.

Signed-off-by: Mel Gorman mgor...@suse.de
---
 include/linux/mmzone.h |4 
 mm/compaction.c|   54 
 mm/internal.h  |4 
 3 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a456361..e7792a3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -370,6 +370,10 @@ struct zone {
int all_unreclaimable; /* All pages pinned */
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
unsigned long   compact_blockskip_expire;
+
+   /* pfns where compaction scanners should start */
+   unsigned long   compact_cached_free_pfn;
+   unsigned long   compact_cached_migrate_pfn;
 #endif
 #ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
diff --git a/mm/compaction.c b/mm/compaction.c
index fae0011..45a17c9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -79,6 +79,9 @@ static void reset_isolation_suitable(struct zone *zone)
 */
if (time_before(jiffies, zone-compact_blockskip_expire))
return;
+
+   zone-compact_cached_migrate_pfn = start_pfn;
+   zone-compact_cached_free_pfn = end_pfn;
zone-compact_blockskip_expire = jiffies + (HZ * 5);
 
/* Walk the zone and mark every pageblock as suitable for isolation */
@@ -99,13 +102,29 @@ static void reset_isolation_suitable(struct zone *zone)
  * If no pages were isolated then mark this pageblock to be skipped in the
  * future. The information is later cleared by reset_isolation_suitable().
  */
-static void update_pageblock_skip(struct page *page, unsigned long nr_isolated)
+static void update_pageblock_skip(struct compact_control *cc,
+   struct page *page, unsigned long nr_isolated,
+   bool migrate_scanner)
 {
+   struct zone *zone = cc-zone;
if (!page)
return;
 
-   if (!nr_isolated)
+   if (!nr_isolated) {
+   unsigned long pfn = page_to_pfn(page);
set_pageblock_skip(page);
+
+   /* Update where compaction should restart */
+   if (migrate_scanner) {
+   if (!cc-finished_update_migrate 
+   pfn  zone-compact_cached_migrate_pfn)
+   zone-compact_cached_migrate_pfn = pfn;
+   } else {
+   if (!cc-finished_update_free 
+   pfn  zone-compact_cached_free_pfn)
+   zone-compact_cached_free_pfn = pfn;
+   }
+   }
 }
 
 static inline bool should_release_lock(spinlock_t *lock)
@@ -257,7 +276,7 @@ out:
 
/* Update the pageblock-skip if the whole pageblock was scanned */
if (blockpfn == end_pfn)
-   update_pageblock_skip(valid_page, total_isolated);
+   update_pageblock_skip(cc, valid_page, total_isolated, false);
 
return total_isolated;
 }
@@ -472,6 +491,7 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
 */
if (!cc-sync  last_pageblock_nr != pageblock_nr 
!migrate_async_suitable(get_pageblock_migratetype(page))) {
+   cc-finished_update_migrate = true;
goto next_pageblock;
}
 
@@ -520,6 +540,7 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
VM_BUG_ON(PageTransCompound(page));
 
/* Successfully isolated */
+   cc-finished_update_migrate = true;
del_page_from_lru_list(page, lruvec, page_lru(page));
list_add(page-lru,

[PATCH 5/6] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

When compaction was implemented it was known that scanning could potentially
be excessive. The ideal was that a counter be maintained for each pageblock
but maintaining this information would incur a severe penalty due to a
shared writable cache line. It has reached the point where the scanning
costs are an serious problem, particularly on long-lived systems where a
large process starts and allocates a large number of THPs at the same time.

Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip. If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock
is marked to be skipped in the future. When scanning, this bit is checked
before any scanning takes place and the block skipped if set.

The main difficulty with a patch like this is when to ignore the cached
information? If it's ignored too often, the scanning rates will still
be excessive. If the information is too stale then allocations will fail
that might have otherwise succeeded. In this patch

o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
  be discarded if it's at least 5 seconds since the last time the cache
  was discarded
o If there are a large number of allocation failures, discard the cache.

The time-based heuristic is very clumsy but there are few choices for a
better event. Depending solely on multiple allocation failures still allows
excessive scanning when THP allocations are failing in quick succession
due to memory pressure. Waiting until memory pressure is relieved would
cause compaction to continually fail instead of using reclaim/compaction
to try allocate the page. The time-based mechanism is clumsy but a better
option is not obvious.

Signed-off-by: Mel Gorman mgor...@suse.de
---
 include/linux/mmzone.h  |3 ++
 include/linux/pageblock-flags.h |   19 +++-
 mm/compaction.c |   93 +--
 mm/internal.h   |1 +
 mm/page_alloc.c |1 +
 5 files changed, 111 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 603d0b5..a456361 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -368,6 +368,9 @@ struct zone {
 */
spinlock_t  lock;
int all_unreclaimable; /* All pages pinned */
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+   unsigned long   compact_blockskip_expire;
+#endif
 #ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t   span_seqlock;
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 19ef95d..eed27f4 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -30,6 +30,9 @@ enum pageblock_bits {
PB_migrate,
PB_migrate_end = PB_migrate + 3 - 1,
/* 3 bits required for migrate types */
+#ifdef CONFIG_COMPACTION
+   PB_migrate_skip,/* If set the block is skipped by compaction */
+#endif /* CONFIG_COMPACTION */
NR_PAGEBLOCK_BITS
 };
 
@@ -65,10 +68,22 @@ unsigned long get_pageblock_flags_group(struct page *page,
 void set_pageblock_flags_group(struct page *page, unsigned long flags,
int start_bitidx, int end_bitidx);
 
+#ifdef CONFIG_COMPACTION
+#define get_pageblock_skip(page) \
+   get_pageblock_flags_group(page, PB_migrate_skip, \
+   PB_migrate_skip + 1)
+#define clear_pageblock_skip(page) \
+   set_pageblock_flags_group(page, 0, PB_migrate_skip,  \
+   PB_migrate_skip + 1)
+#define set_pageblock_skip(page) \
+   set_pageblock_flags_group(page, 1, PB_migrate_skip,  \
+   PB_migrate_skip + 1)
+#endif /* CONFIG_COMPACTION */
+
 #define get_pageblock_flags(page) \
-   get_pageblock_flags_group(page, 0, NR_PAGEBLOCK_BITS-1)
+   get_pageblock_flags_group(page, 0, PB_migrate_end)
 #define set_pageblock_flags(page, flags) \
set_pageblock_flags_group(page, flags,  \
- 0, NR_PAGEBLOCK_BITS-1)
+ 0, PB_migrate_end)
 
 #endif /* PAGEBLOCK_FLAGS_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index 6058822..fae0011 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,64 @@ static inline bool migrate_async_suitable(int migratetype)
return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
 }
 
+/* Returns true if the pageblock should be scanned for pages to isolate. */
+static inline bool isolation_suitable(struct compact_control

[PATCH 4/6] Revert mm: have order 0 compaction start off where it left

This reverts commit 7db8889a (mm: have order  0 compaction start off
where it left) and commit de74f1cc (mm: have order  0 compaction start
near a pageblock with free pages). These patches were a good idea and
tests confirmed that they massively reduced the amount of scanning but
the implementation is complex and tricky to understand. A later patch
will cache what pageblocks should be skipped and reimplements the
concept of compact_cached_free_pfn on top for both migration and
free scanners.

Signed-off-by: Mel Gorman mgor...@suse.de
---
 include/linux/mmzone.h |4 ---
 mm/compaction.c|   65 
 mm/internal.h  |6 -
 mm/page_alloc.c|5 
 4 files changed, 5 insertions(+), 75 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2daa54f..603d0b5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -368,10 +368,6 @@ struct zone {
 */
spinlock_t  lock;
int all_unreclaimable; /* All pages pinned */
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
-   /* pfn where the last incremental compaction isolated free pages */
-   unsigned long   compact_cached_free_pfn;
-#endif
 #ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t   span_seqlock;
diff --git a/mm/compaction.c b/mm/compaction.c
index 70c7cbd..6058822 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -481,20 +481,6 @@ next_pageblock:
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
 #ifdef CONFIG_COMPACTION
 /*
- * Returns the start pfn of the last page block in a zone.  This is the 
starting
- * point for full compaction of a zone.  Compaction searches for free pages 
from
- * the end of each zone, while isolate_freepages_block scans forward inside 
each
- * page block.
- */
-static unsigned long start_free_pfn(struct zone *zone)
-{
-   unsigned long free_pfn;
-   free_pfn = zone-zone_start_pfn + zone-spanned_pages;
-   free_pfn = ~(pageblock_nr_pages-1);
-   return free_pfn;
-}
-
-/*
  * Based on information in the current compact_control, find blocks
  * suitable for isolating free pages from and then isolate them.
  */
@@ -562,19 +548,8 @@ static void isolate_freepages(struct zone *zone,
 * looking for free pages, the search will restart here as
 * page migration may have returned some pages to the allocator
 */
-   if (isolated) {
+   if (isolated)
high_pfn = max(high_pfn, pfn);
-
-   /*
-* If the free scanner has wrapped, update
-* compact_cached_free_pfn to point to the highest
-* pageblock with free pages. This reduces excessive
-* scanning of full pageblocks near the end of the
-* zone
-*/
-   if (cc-order  0  cc-wrapped)
-   zone-compact_cached_free_pfn = high_pfn;
-   }
}
 
/* split_free_page does not map the pages */
@@ -582,11 +557,6 @@ static void isolate_freepages(struct zone *zone,
 
cc-free_pfn = high_pfn;
cc-nr_freepages = nr_freepages;
-
-   /* If compact_cached_free_pfn is reset then set it now */
-   if (cc-order  0  !cc-wrapped 
-   zone-compact_cached_free_pfn == start_free_pfn(zone))
-   zone-compact_cached_free_pfn = high_pfn;
 }
 
 /*
@@ -682,26 +652,8 @@ static int compact_finished(struct zone *zone,
if (fatal_signal_pending(current))
return COMPACT_PARTIAL;
 
-   /*
-* A full (order == -1) compaction run starts at the beginning and
-* end of a zone; it completes when the migrate and free scanner meet.
-* A partial (order  0) compaction can start with the free scanner
-* at a random point in the zone, and may have to restart.
-*/
-   if (cc-free_pfn = cc-migrate_pfn) {
-   if (cc-order  0  !cc-wrapped) {
-   /* We started partway through; restart at the end. */
-   unsigned long free_pfn = start_free_pfn(zone);
-   zone-compact_cached_free_pfn = free_pfn;
-   cc-free_pfn = free_pfn;
-   cc-wrapped = 1;
-   return COMPACT_CONTINUE;
-   }
-   return COMPACT_COMPLETE;
-   }
-
-   /* We wrapped around and ended up where we started. */
-   if (cc-wrapped  cc-free_pfn = cc-start_free_pfn)
+   /* Compaction run completes if the migrate and free scanner meet */
+   if (cc-free_pfn = cc-migrate_pfn)
return COMPACT_COMPLETE;
 
/*
@@ -799,15 +751,8 @@ static int compact_zone(struct zone *zone, struct

[PATCH 1/6] mm: compaction: Abort compaction loop if lock is contended or run too long

From: Shaohua Li s...@fusionio.com

Changelog since V2
o Fix BUG_ON triggered due to pages left on cc.migratepages
o Make compact_zone_order() require non-NULL arg `contended'

Changelog since V1
o only abort the compaction if lock is contended or run too long
o Rearranged the code by Andrea Arcangeli.

isolate_migratepages_range() might isolate no pages if for example when
zone-lru_lock is contended and running asynchronous compaction. In this
case, we should abort compaction, otherwise, compact_zone will run a
useless loop and make zone-lru_lock is even contended.

[minc...@kernel.org: Putback pages isolated for migration if aborting]
[a...@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
Signed-off-by: Andrea Arcangeli aarca...@redhat.com
Signed-off-by: Shaohua Li s...@fusionio.com
Signed-off-by: Mel Gorman mgor...@suse.de
---
 mm/compaction.c |   17 -
 mm/internal.h   |2 +-
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 7fcd3a5..a8de20d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -70,8 +70,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, 
unsigned long *flags,
 
/* async aborts if taking too long or contended */
if (!cc-sync) {
-   if (cc-contended)
-   *cc-contended = true;
+   cc-contended = true;
return false;
}
 
@@ -634,7 +633,7 @@ static isolate_migrate_t isolate_migratepages(struct zone 
*zone,
 
/* Perform the isolation */
low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn);
-   if (!low_pfn)
+   if (!low_pfn || cc-contended)
return ISOLATE_ABORT;
 
cc-migrate_pfn = low_pfn;
@@ -787,6 +786,8 @@ static int compact_zone(struct zone *zone, struct 
compact_control *cc)
switch (isolate_migratepages(zone, cc)) {
case ISOLATE_ABORT:
ret = COMPACT_PARTIAL;
+   putback_lru_pages(cc-migratepages);
+   cc-nr_migratepages = 0;
goto out;
case ISOLATE_NONE:
continue;
@@ -831,6 +832,7 @@ static unsigned long compact_zone_order(struct zone *zone,
 int order, gfp_t gfp_mask,
 bool sync, bool *contended)
 {
+   unsigned long ret;
struct compact_control cc = {
.nr_freepages = 0,
.nr_migratepages = 0,
@@ -838,12 +840,17 @@ static unsigned long compact_zone_order(struct zone *zone,
.migratetype = allocflags_to_migratetype(gfp_mask),
.zone = zone,
.sync = sync,
-   .contended = contended,
};
INIT_LIST_HEAD(cc.freepages);
INIT_LIST_HEAD(cc.migratepages);
 
-   return compact_zone(zone, cc);
+   ret = compact_zone(zone, cc);
+
+   VM_BUG_ON(!list_empty(cc.freepages));
+   VM_BUG_ON(!list_empty(cc.migratepages));
+
+   *contended = cc.contended;
+   return ret;
 }
 
 int sysctl_extfrag_threshold = 500;
diff --git a/mm/internal.h b/mm/internal.h
index b8c91b3..4bd7c0e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -130,7 +130,7 @@ struct compact_control {
int order;  /* order a direct compactor needs */
int migratetype;/* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
-   bool *contended;/* True if a lock was contended */
+   bool contended; /* True if a lock was contended */
 };
 
 unsigned long
-- 
1.7.9.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/6] mm: compaction: Acquire the zone-lock as late as possible

Compactions free scanner acquires the zone-lock when checking for PageBuddy
pages and isolating them. It does this even if there are no PageBuddy pages
in the range.

This patch defers acquiring the zone lock for as long as possible. In the
event there are no free pages in the pageblock then the lock will not be
acquired at all which reduces contention on zone-lock.

Signed-off-by: Mel Gorman mgor...@suse.de
---
 mm/compaction.c |  141 +--
 1 file changed, 75 insertions(+), 66 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 6450c3e..70c7cbd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -89,10 +89,26 @@ static bool compact_checklock_irqsave(spinlock_t *lock, 
unsigned long *flags,
return true;
 }
 
-static inline bool compact_trylock_irqsave(spinlock_t *lock,
-   unsigned long *flags, struct compact_control *cc)
+/* Returns true if the page is within a block suitable for migration to */
+static bool suitable_migration_target(struct page *page)
 {
-   return compact_checklock_irqsave(lock, flags, false, cc);
+
+   int migratetype = get_pageblock_migratetype(page);
+
+   /* Don't interfere with memory hot-remove or the min_free_kbytes blocks 
*/
+   if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
+   return false;
+
+   /* If the page is a large free page, then allow migration */
+   if (PageBuddy(page)  page_order(page) = pageblock_order)
+   return true;
+
+   /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
+   if (migrate_async_suitable(migratetype))
+   return true;
+
+   /* Otherwise skip the block */
+   return false;
 }
 
 /*
@@ -101,13 +117,16 @@ static inline bool compact_trylock_irqsave(spinlock_t 
*lock,
  * pages inside of the pageblock (even though it may still end up isolating
  * some pages).
  */
-static unsigned long isolate_freepages_block(unsigned long blockpfn,
+static unsigned long isolate_freepages_block(struct compact_control *cc,
+   unsigned long blockpfn,
unsigned long end_pfn,
struct list_head *freelist,
bool strict)
 {
int nr_scanned = 0, total_isolated = 0;
struct page *cursor;
+   unsigned long flags;
+   bool locked = false;
 
cursor = pfn_to_page(blockpfn);
 
@@ -116,23 +135,38 @@ static unsigned long isolate_freepages_block(unsigned 
long blockpfn,
int isolated, i;
struct page *page = cursor;
 
-   if (!pfn_valid_within(blockpfn)) {
-   if (strict)
-   return 0;
-   continue;
-   }
+   if (!pfn_valid_within(blockpfn))
+   goto strict_check;
nr_scanned++;
 
-   if (!PageBuddy(page)) {
-   if (strict)
-   return 0;
-   continue;
-   }
+   if (!PageBuddy(page))
+   goto strict_check;
+
+   /*
+* The zone lock must be held to isolate freepages. This
+* unfortunately this is a very coarse lock and can be
+* heavily contended if there are parallel allocations
+* or parallel compactions. For async compaction do not
+* spin on the lock and we acquire the lock as late as
+* possible.
+*/
+   locked = compact_checklock_irqsave(cc-zone-lock, flags,
+   locked, cc);
+   if (!locked)
+   break;
+
+   /* Recheck this is a suitable migration target under lock */
+   if (!strict  !suitable_migration_target(page))
+   break;
+
+   /* Recheck this is a buddy page under lock */
+   if (!PageBuddy(page))
+   goto strict_check;
 
/* Found a free page, break it into order-0 pages */
isolated = split_free_page(page);
if (!isolated  strict)
-   return 0;
+   goto strict_check;
total_isolated += isolated;
for (i = 0; i  isolated; i++) {
list_add(page-lru, freelist);
@@ -144,9 +178,23 @@ static unsigned long isolate_freepages_block(unsigned long 
blockpfn,
blockpfn += isolated - 1;
cursor += isolated - 1;
}
+
+   continue;
+
+strict_check:
+   /* Abort isolation if the caller requested strict isolation */
+   if (strict) {
+   total_isolated = 0;
+

[PATCH 2/6] mm: compaction: Acquire the zone-lru_lock as late as possible

Compactions migrate scanner acquires the zone-lru_lock when scanning a range
of pages looking for LRU pages to acquire. It does this even if there are
no LRU pages in the range. If multiple processes are compacting then this
can cause severe locking contention. To make matters worse commit b2eef8c0
(mm: compaction: minimise the time IRQs are disabled while isolating pages
for migration) releases the lru_lock every SWAP_CLUSTER_MAX pages that are
scanned.

This patch makes two changes to how the migrate scanner acquires the LRU
lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages if
the lock is contended. This reduces the number of times it unnecessarily
disables and re-enables IRQs. The second is that it defers acquiring the
LRU lock for as long as possible. If there are no LRU pages or the only
LRU pages are transhuge then the LRU lock will not be acquired at all
which reduces contention on zone-lru_lock.

Signed-off-by: Mel Gorman mgor...@suse.de
---
 mm/compaction.c |   63 +--
 1 file changed, 43 insertions(+), 20 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index a8de20d..6450c3e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,11 @@ static inline bool migrate_async_suitable(int migratetype)
return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
 }
 
+static inline bool should_release_lock(spinlock_t *lock)
+{
+   return need_resched() || spin_is_contended(lock);
+}
+
 /*
  * Compaction requires the taking of some coarse locks that are potentially
  * very heavily contended. Check if the process needs to be scheduled or
@@ -62,7 +67,7 @@ static inline bool migrate_async_suitable(int migratetype)
 static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
  bool locked, struct compact_control *cc)
 {
-   if (need_resched() || spin_is_contended(lock)) {
+   if (should_release_lock(lock)) {
if (locked) {
spin_unlock_irqrestore(lock, *flags);
locked = false;
@@ -275,7 +280,7 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
isolate_mode_t mode = 0;
struct lruvec *lruvec;
unsigned long flags;
-   bool locked;
+   bool locked = false;
 
/*
 * Ensure that there are not too many pages isolated from the LRU
@@ -295,23 +300,17 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
 
/* Time to isolate some pages for migration */
cond_resched();
-   spin_lock_irqsave(zone-lru_lock, flags);
-   locked = true;
for (; low_pfn  end_pfn; low_pfn++) {
struct page *page;
 
/* give a chance to irqs before checking need_resched() */
-   if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
-   spin_unlock_irqrestore(zone-lru_lock, flags);
-   locked = false;
+   if (locked  !((low_pfn+1) % SWAP_CLUSTER_MAX)) {
+   if (should_release_lock(zone-lru_lock)) {
+   spin_unlock_irqrestore(zone-lru_lock, flags);
+   locked = false;
+   }
}
 
-   /* Check if it is ok to still hold the lock */
-   locked = compact_checklock_irqsave(zone-lru_lock, flags,
-   locked, cc);
-   if (!locked)
-   break;
-
/*
 * migrate_pfn does not necessarily start aligned to a
 * pageblock. Ensure that pfn_valid is called when moving
@@ -351,21 +350,38 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
pageblock_nr = low_pfn  pageblock_order;
if (!cc-sync  last_pageblock_nr != pageblock_nr 
!migrate_async_suitable(get_pageblock_migratetype(page))) {
-   low_pfn += pageblock_nr_pages;
-   low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
-   last_pageblock_nr = pageblock_nr;
-   continue;
+   goto next_pageblock;
}
 
+   /* Check may be lockless but that's ok as we recheck later */
if (!PageLRU(page))
continue;
 
/*
-* PageLRU is set, and lru_lock excludes isolation,
-* splitting and collapsing (collapsing has already
-* happened if PageLRU is set).
+* PageLRU is set. lru_lock normally excludes isolation
+* splitting and collapsing (collapsing has already happened
+* if PageLRU is set) but the lock is not necessarily taken
+* here and it is wasteful to

Re: [PATCH v2 3/5] s390: Add new channel I/O based virtio transport.

2012-09-20 Thread Anthony Liguori

Cornelia Huck cornelia.h...@de.ibm.com writes:

 Add a new virtio transport that uses channel commands to perform
 virtio operations.

 Add a new machine type s390-ccw that uses this virtio-ccw transport
 and make it the default machine for s390.

 Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
 ---

 Changes v1-v2:
 - update to virtio-ccw interface changes

 ---
  hw/qdev-monitor.c  |   5 +
  hw/s390-virtio.c   | 277 
  hw/s390x/Makefile.objs |   1 +
  hw/s390x/css.c |  45 +++
  hw/s390x/css.h |   3 +
  hw/s390x/virtio-ccw.c  | 875 
 +
  hw/s390x/virtio-ccw.h  |  79 +
  vl.c   |   1 +
  8 files changed, 1215 insertions(+), 71 deletions(-)
  create mode 100644 hw/s390x/virtio-ccw.c
  create mode 100644 hw/s390x/virtio-ccw.h

 diff --git a/hw/qdev-monitor.c b/hw/qdev-monitor.c
 index 33b7f79..92b7c59 100644
 --- a/hw/qdev-monitor.c
 +++ b/hw/qdev-monitor.c
 @@ -42,6 +42,11 @@ static const QDevAlias qdev_alias_table[] = {
  { virtio-blk-s390, virtio-blk, QEMU_ARCH_S390X },
  { virtio-net-s390, virtio-net, QEMU_ARCH_S390X },
  { virtio-serial-s390, virtio-serial, QEMU_ARCH_S390X },
 +{ virtio-blk-ccw, virtio-blk, QEMU_ARCH_S390X },
 +{ virtio-net-ccw, virtio-net, QEMU_ARCH_S390X },
 +{ virtio-serial-ccw, virtio-serial, QEMU_ARCH_S390X },
 +{ virtio-balloon-ccw, virtio-balloon, QEMU_ARCH_S390X },
 +{ virtio-scsi-ccw, virtio-scsi, QEMU_ARCH_S390X },
  { lsi53c895a, lsi },
  { ich9-ahci, ahci },

Please don't add aliases.  That's just an ugly hack to maintain compatibility.

  { }
 diff --git a/hw/s390-virtio.c b/hw/s390-virtio.c
 index 47eed35..2509291 100644
 --- a/hw/s390-virtio.c
 +++ b/hw/s390-virtio.c
 @@ -30,8 +30,11 @@
  #include hw/sysbus.h
  #include kvm.h
  #include exec-memory.h
 +#include qemu-thread.h
  
  #include hw/s390-virtio-bus.h
 +#include hw/s390x/css.h
 +#include hw/s390x/virtio-ccw.h
  
  //#define DEBUG_S390
  
 @@ -46,6 +49,7 @@
  #define KVM_S390_VIRTIO_NOTIFY  0
  #define KVM_S390_VIRTIO_RESET   1
  #define KVM_S390_VIRTIO_SET_STATUS  2
 +#define KVM_S390_VIRTIO_CCW_NOTIFY  3
  
  #define KERN_IMAGE_START0x01UL
  #define KERN_PARM_AREA  0x010480UL
 @@ -62,6 +66,7 @@
  
  static VirtIOS390Bus *s390_bus;
  static S390CPU **ipi_states;
 +VirtioCcwBus *ccw_bus;
  
  S390CPU *s390_cpu_addr2state(uint16_t cpu_addr)
  {
 @@ -75,15 +80,21 @@ S390CPU *s390_cpu_addr2state(uint16_t cpu_addr)
  int s390_virtio_hypercall(CPUS390XState *env, uint64_t mem, uint64_t 
 hypercall)
  {
  int r = 0, i;
 +int cssid, ssid, schid, m;
 +SubchDev *sch;
  
  dprintf(KVM hypercall: %ld\n, hypercall);
  switch (hypercall) {
  case KVM_S390_VIRTIO_NOTIFY:
  if (mem  ram_size) {
 -VirtIOS390Device *dev = s390_virtio_bus_find_vring(s390_bus,
 -   mem, i);
 -if (dev) {
 -virtio_queue_notify(dev-vdev, i);
 +if (s390_bus) {
 +VirtIOS390Device *dev = s390_virtio_bus_find_vring(s390_bus,
 +   mem, i);
 +if (dev) {
 +virtio_queue_notify(dev-vdev, i);
 +} else {
 +r = -EINVAL;
 +}
  } else {
  r = -EINVAL;
  }
 @@ -92,28 +103,49 @@ int s390_virtio_hypercall(CPUS390XState *env, uint64_t 
 mem, uint64_t hypercall)
  }
  break;
  case KVM_S390_VIRTIO_RESET:
 -{
 -VirtIOS390Device *dev;
 -
 -dev = s390_virtio_bus_find_mem(s390_bus, mem);
 -virtio_reset(dev-vdev);
 -stb_phys(dev-dev_offs + VIRTIO_DEV_OFFS_STATUS, 0);
 -s390_virtio_device_sync(dev);
 -s390_virtio_reset_idx(dev);
 +if (s390_bus) {
 +VirtIOS390Device *dev;
 +
 +dev = s390_virtio_bus_find_mem(s390_bus, mem);
 +virtio_reset(dev-vdev);
 +stb_phys(dev-dev_offs + VIRTIO_DEV_OFFS_STATUS, 0);
 +s390_virtio_device_sync(dev);
 +s390_virtio_reset_idx(dev);
 +} else {
 +r = -EINVAL;
 +}
  break;
 -}
  case KVM_S390_VIRTIO_SET_STATUS:
 -{
 -VirtIOS390Device *dev;
 +if (s390_bus) {
 +VirtIOS390Device *dev;
  
 -dev = s390_virtio_bus_find_mem(s390_bus, mem);
 -if (dev) {
 -s390_virtio_device_update_status(dev);
 +dev = s390_virtio_bus_find_mem(s390_bus, mem);
 +if (dev) {
 +s390_virtio_device_update_status(dev);
 +} else {
 +r = -EINVAL;
 +}
  } else {
  r = -EINVAL;
  }
  break;
 -}
 +case KVM_S390_VIRTIO_CCW_NOTIFY:
 +if (ccw_bus) {

Re: [PATCH v2 5/5] [HACK] Handle multiple virtio aliases.

2012-09-20 Thread Anthony Liguori

Cornelia Huck cornelia.h...@de.ibm.com writes:

 This patch enables using both virtio-xxx-s390 and virtio-xxx-ccw
 by making the alias lookup code verify that a driver is actually
 registered.

 (Only included in order to allow testing of virtio-ccw; should be
 replaced by cleaning up the virtio bus model.)

 Not-signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com

No more aliases.  Just drop the whole thing.

Regards,

Anthony Liguori

 ---
  blockdev.c|  6 +---
  hw/qdev-monitor.c | 85 
 +--
  vl.c  |  6 +---
  3 files changed, 53 insertions(+), 44 deletions(-)

 diff --git a/blockdev.c b/blockdev.c
 index 7c83baa..a7c39b6 100644
 --- a/blockdev.c
 +++ b/blockdev.c
 @@ -560,11 +560,7 @@ DriveInfo *drive_init(QemuOpts *opts, int 
 default_to_scsi)
  case IF_VIRTIO:
  /* add virtio block device */
  opts = qemu_opts_create(qemu_find_opts(device), NULL, 0, NULL);
 -if (arch_type == QEMU_ARCH_S390X) {
 -qemu_opt_set(opts, driver, virtio-blk-s390);
 -} else {
 -qemu_opt_set(opts, driver, virtio-blk-pci);
 -}
 +qemu_opt_set(opts, driver, virtio-blk);
  qemu_opt_set(opts, drive, dinfo-id);
  if (devaddr)
  qemu_opt_set(opts, addr, devaddr);
 diff --git a/hw/qdev-monitor.c b/hw/qdev-monitor.c
 index 92b7c59..9245a1e 100644
 --- a/hw/qdev-monitor.c
 +++ b/hw/qdev-monitor.c
 @@ -118,9 +118,53 @@ static int set_property(const char *name, const char 
 *value, void *opaque)
  return 0;
  }
  
 -static const char *find_typename_by_alias(const char *alias)
 +static BusState *qbus_find_recursive(BusState *bus, const char *name,
 + const char *bus_typename)
 +{
 +BusChild *kid;
 +BusState *child, *ret;
 +int match = 1;
 +
 +if (name  (strcmp(bus-name, name) != 0)) {
 +match = 0;
 +}
 +if (bus_typename 
 +(strcmp(object_get_typename(OBJECT(bus)), bus_typename) != 0)) {
 +match = 0;
 +}
 +if (match) {
 +return bus;
 +}
 +
 +QTAILQ_FOREACH(kid, bus-children, sibling) {
 +DeviceState *dev = kid-child;
 +QLIST_FOREACH(child, dev-child_bus, sibling) {
 +ret = qbus_find_recursive(child, name, bus_typename);
 +if (ret) {
 +return ret;
 +}
 +}
 +}
 +return NULL;
 +}
 +
 +static bool qdev_verify_bus(DeviceClass *dc)
 +{
 +BusState *bus;
 +
 +if (dc) {
 +bus = qbus_find_recursive(sysbus_get_default(), NULL, dc-bus_type);
 +if (bus) {
 +return true;
 +}
 +}
 +return false;
 +}
 +
 +static const char *find_typename_by_alias(const char *alias, bool check_bus)
  {
  int i;
 +ObjectClass *oc;
  
  for (i = 0; qdev_alias_table[i].alias; i++) {
  if (qdev_alias_table[i].arch_mask 
 @@ -129,7 +173,10 @@ static const char *find_typename_by_alias(const char 
 *alias)
  }
  
  if (strcmp(qdev_alias_table[i].alias, alias) == 0) {
 -return qdev_alias_table[i].typename;
 +oc = object_class_by_name(qdev_alias_table[i].typename);
 +if (oc  (!check_bus || qdev_verify_bus(DEVICE_CLASS(oc {
 +return qdev_alias_table[i].typename;
 +}
  }
  }
  
 @@ -155,7 +202,7 @@ int qdev_device_help(QemuOpts *opts)
  
  klass = object_class_by_name(driver);
  if (!klass) {
 -const char *typename = find_typename_by_alias(driver);
 +const char *typename = find_typename_by_alias(driver, false);
  
  if (typename) {
  driver = typename;
 @@ -283,36 +330,6 @@ static DeviceState *qbus_find_dev(BusState *bus, char 
 *elem)
  return NULL;
  }
  
 -static BusState *qbus_find_recursive(BusState *bus, const char *name,
 - const char *bus_typename)
 -{
 -BusChild *kid;
 -BusState *child, *ret;
 -int match = 1;
 -
 -if (name  (strcmp(bus-name, name) != 0)) {
 -match = 0;
 -}
 -if (bus_typename 
 -(strcmp(object_get_typename(OBJECT(bus)), bus_typename) != 0)) {
 -match = 0;
 -}
 -if (match) {
 -return bus;
 -}
 -
 -QTAILQ_FOREACH(kid, bus-children, sibling) {
 -DeviceState *dev = kid-child;
 -QLIST_FOREACH(child, dev-child_bus, sibling) {
 -ret = qbus_find_recursive(child, name, bus_typename);
 -if (ret) {
 -return ret;
 -}
 -}
 -}
 -return NULL;
 -}
 -
  static BusState *qbus_find(const char *path)
  {
  DeviceState *dev;
 @@ -417,7 +434,7 @@ DeviceState *qdev_device_add(QemuOpts *opts)
  /* find driver */
  obj = object_class_by_name(driver);
  if (!obj) {
 -const char *typename = find_typename_by_alias(driver);
 +const char *typename = find_typename_by_alias(driver, true);

Re: [PATCH v2 3/5] s390: Add new channel I/O based virtio transport.

2012-09-20 Thread Alexander Graf



On 20.09.2012, at 16:24, Anthony Liguori aligu...@us.ibm.com wrote:

 Cornelia Huck cornelia.h...@de.ibm.com writes:
 
 Add a new virtio transport that uses channel commands to perform
 virtio operations.
 
 Add a new machine type s390-ccw that uses this virtio-ccw transport
 and make it the default machine for s390.
 
 Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
 ---
 
 Changes v1-v2:
 - update to virtio-ccw interface changes
 
 ---
 hw/qdev-monitor.c  |   5 +
 hw/s390-virtio.c   | 277 
 hw/s390x/Makefile.objs |   1 +
 hw/s390x/css.c |  45 +++
 hw/s390x/css.h |   3 +
 hw/s390x/virtio-ccw.c  | 875 
 +
 hw/s390x/virtio-ccw.h  |  79 +
 vl.c   |   1 +
 8 files changed, 1215 insertions(+), 71 deletions(-)
 create mode 100644 hw/s390x/virtio-ccw.c
 create mode 100644 hw/s390x/virtio-ccw.h
 
 diff --git a/hw/qdev-monitor.c b/hw/qdev-monitor.c
 index 33b7f79..92b7c59 100644
 --- a/hw/qdev-monitor.c
 +++ b/hw/qdev-monitor.c
 @@ -42,6 +42,11 @@ static const QDevAlias qdev_alias_table[] = {
 { virtio-blk-s390, virtio-blk, QEMU_ARCH_S390X },
 { virtio-net-s390, virtio-net, QEMU_ARCH_S390X },
 { virtio-serial-s390, virtio-serial, QEMU_ARCH_S390X },
 +{ virtio-blk-ccw, virtio-blk, QEMU_ARCH_S390X },
 +{ virtio-net-ccw, virtio-net, QEMU_ARCH_S390X },
 +{ virtio-serial-ccw, virtio-serial, QEMU_ARCH_S390X },
 +{ virtio-balloon-ccw, virtio-balloon, QEMU_ARCH_S390X },
 +{ virtio-scsi-ccw, virtio-scsi, QEMU_ARCH_S390X },
 { lsi53c895a, lsi },
 { ich9-ahci, ahci },
 
 Please don't add aliases.  That's just an ugly hack to maintain compatibility.
 
 { }
 diff --git a/hw/s390-virtio.c b/hw/s390-virtio.c
 index 47eed35..2509291 100644
 --- a/hw/s390-virtio.c
 +++ b/hw/s390-virtio.c
 @@ -30,8 +30,11 @@
 #include hw/sysbus.h
 #include kvm.h
 #include exec-memory.h
 +#include qemu-thread.h
 
 #include hw/s390-virtio-bus.h
 +#include hw/s390x/css.h
 +#include hw/s390x/virtio-ccw.h
 
 //#define DEBUG_S390
 
 @@ -46,6 +49,7 @@
 #define KVM_S390_VIRTIO_NOTIFY  0
 #define KVM_S390_VIRTIO_RESET   1
 #define KVM_S390_VIRTIO_SET_STATUS  2
 +#define KVM_S390_VIRTIO_CCW_NOTIFY  3
 
 #define KERN_IMAGE_START0x01UL
 #define KERN_PARM_AREA  0x010480UL
 @@ -62,6 +66,7 @@
 
 static VirtIOS390Bus *s390_bus;
 static S390CPU **ipi_states;
 +VirtioCcwBus *ccw_bus;
 
 S390CPU *s390_cpu_addr2state(uint16_t cpu_addr)
 {
 @@ -75,15 +80,21 @@ S390CPU *s390_cpu_addr2state(uint16_t cpu_addr)
 int s390_virtio_hypercall(CPUS390XState *env, uint64_t mem, uint64_t 
 hypercall)
 {
 int r = 0, i;
 +int cssid, ssid, schid, m;
 +SubchDev *sch;
 
 dprintf(KVM hypercall: %ld\n, hypercall);
 switch (hypercall) {
 case KVM_S390_VIRTIO_NOTIFY:
 if (mem  ram_size) {
 -VirtIOS390Device *dev = s390_virtio_bus_find_vring(s390_bus,
 -   mem, i);
 -if (dev) {
 -virtio_queue_notify(dev-vdev, i);
 +if (s390_bus) {
 +VirtIOS390Device *dev = s390_virtio_bus_find_vring(s390_bus,
 +   mem, i);
 +if (dev) {
 +virtio_queue_notify(dev-vdev, i);
 +} else {
 +r = -EINVAL;
 +}
 } else {
 r = -EINVAL;
 }
 @@ -92,28 +103,49 @@ int s390_virtio_hypercall(CPUS390XState *env, uint64_t 
 mem, uint64_t hypercall)
 }
 break;
 case KVM_S390_VIRTIO_RESET:
 -{
 -VirtIOS390Device *dev;
 -
 -dev = s390_virtio_bus_find_mem(s390_bus, mem);
 -virtio_reset(dev-vdev);
 -stb_phys(dev-dev_offs + VIRTIO_DEV_OFFS_STATUS, 0);
 -s390_virtio_device_sync(dev);
 -s390_virtio_reset_idx(dev);
 +if (s390_bus) {
 +VirtIOS390Device *dev;
 +
 +dev = s390_virtio_bus_find_mem(s390_bus, mem);
 +virtio_reset(dev-vdev);
 +stb_phys(dev-dev_offs + VIRTIO_DEV_OFFS_STATUS, 0);
 +s390_virtio_device_sync(dev);
 +s390_virtio_reset_idx(dev);
 +} else {
 +r = -EINVAL;
 +}
 break;
 -}
 case KVM_S390_VIRTIO_SET_STATUS:
 -{
 -VirtIOS390Device *dev;
 +if (s390_bus) {
 +VirtIOS390Device *dev;
 
 -dev = s390_virtio_bus_find_mem(s390_bus, mem);
 -if (dev) {
 -s390_virtio_device_update_status(dev);
 +dev = s390_virtio_bus_find_mem(s390_bus, mem);
 +if (dev) {
 +s390_virtio_device_update_status(dev);
 +} else {
 +r = -EINVAL;
 +}
 } else {
 r = -EINVAL;
 }
 break;
 -}
 +case KVM_S390_VIRTIO_CCW_NOTIFY:
 +

Re: [Qemu-devel] [PATCH v3 05/17] target-i386: Add x86_set_hyperv.

2012-09-20 Thread Eduardo Habkost

On Wed, Sep 19, 2012 at 05:26:01PM -0400, Don Slutz wrote:
 On 09/19/12 15:32, Eduardo Habkost wrote:
 On Mon, Sep 17, 2012 at 10:00:55AM -0400, Don Slutz wrote:
 This is used to set the cpu object's hypervisor level to the default for 
 Microsoft's Hypervisor.
 
 Signed-off-by: Don Slutz d...@cloudswitch.com
 ---
   target-i386/cpu.c |   10 ++
   1 files changed, 10 insertions(+), 0 deletions(-)
 
 diff --git a/target-i386/cpu.c b/target-i386/cpu.c
 index 0e4a18d..4120393 100644
 --- a/target-i386/cpu.c
 +++ b/target-i386/cpu.c
 @@ -1192,6 +1192,13 @@ static void x86_cpuid_set_hv_level(Object *obj, 
 Visitor *v, void *opaque,
   }
 
   #if !defined(CONFIG_USER_ONLY)
 +static void x86_set_hyperv(Object *obj, Error **errp)
 +{
 +X86CPU *cpu = X86_CPU(obj);
 +
 +cpu-env.cpuid_hv_level = HYPERV_CPUID_MIN;
 HYPERV_CPUID_MIN is defined on linux-headers/asm-x86/hyperv.h, that is
 included only if build host is linux-x86 and CONFIG_KVM is set.
 
 Will fix.  I see 3 options:
 
 1) Use the numbers like 0x4005

If we're going to use the number directly, it's better to have a define
for it (so #2 is better).

 
 2) Use QEMU defines like:
  #define CPUID_HV_LEVEL_HYPERV  0x4005
 
 3) Use condtional define:
  #ifndef HYPERV_CPUID_MIN
  #define CPUID_HV_LEVEL_HYPERV  0x4005
  #else
  #define CPUID_HV_LEVEL_HYPERV  HYPERV_CPUID_MIN
  #endif
 

If QEMU is going to contain a #define CPUID_HV_LEVEL_HYPERV 0x4005
in the code, I don't see a reason to try to use the definition from
asm-x86/hyperv.h if available.

So, #2 looks like the best option.

 I lean to #2 of #3 and both over #1.  This is because if in the
 future if HYPERV_CPUID_MIN ever changes then a patch to QEMU needs to
 be done and considered before that change can be seen by a guest.
 Note: since hypervisor-level=0x4006 can be specified, this might
 be enough to do some testing.

I don't think HYPERV_CPUID_MIN will ever change, but the meaning of the
constant is not clear to me. Anyway, if it ever changes, that's another
reason for QEMU to have its own constant. The resulting CPUID bits
exposed to the guest should be a function of the machine-type and
command-line/config parameters, and nothing else (otherwise the CPUID
bits would change under the guest's feet when live-migrating).

-- 
Eduardo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

qemu.git/qemu-kvm.git bugs during migrate + reboot

2012-09-20 Thread Lucas Meneghel Rodrigues


Hi guys,

We're seeing the following problem during upstream testing:

qemu: VQ 0 size 0x80 Guest index 0x2d6
inconsistent with Host index 0x18: delta 0x2be
qemu: warning:  error while loading state for
instance 0x0 of device ':00:04.0/virtio-blk'
load of migration failed

This is happening consistently with qemu and qemu-kvm. Test case is 
simple, while the vm goes through a reboot loop, a parallel ping-pong 
migration loop happens.


I'm happy to provide more details and logs.

Lucas
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM] Guest Debugging Facilities in KVM

2012-09-20 Thread Dean Pucsek


On 2012-09-19, at 7:45 AM, Jan Kiszka jan.kis...@siemens.com wrote:

 On 2012-09-19 16:38, Avi Kivity wrote:
 On 09/17/2012 10:36 PM, Dean Pucsek wrote:
 Hello,
 
 For my Masters thesis I am investigating the usage of Intel VT-x and branch 
 tracing in the domain of malware analysis.  Essentially what I'm aiming to 
 do is trace the execution of a guest VM and then pass that trace on to some 
 other tools.  I've been playing KVM for a couple weeks now but from 
 comments such as (in arch/x86/kvm/vmx.c): 
 
   /*
* Forward all other exceptions that are valid in real mode.
* FIXME: Breaks guest debugging in real mode, needs to be fixed with
*the required debugging infrastructure rework.
*/
 
 And (from an email sent to the list in July 2008): 
 
Note that guest debugging in real mode is broken now. This has to be
fixed by the scheduled debugging infrastructure rework (will be done
once base patches for QEMU have been accepted).
 
 it is unclear to me how much support there is for guest debugging in KVM 
 currently (I wasn't able to find any recent documentation on it) and what 
 the debugging infrastructure referred to by these comments is.  I am 
 interested in becoming involved with the KVM project in this respect 
 however some guidance and direction on the guest debugging facilities would 
 be greatly appreciated.
 
 Guest debugging works (but not in real mode due to the issue above).
 
 That doesn't apply to CPUs with Unrestricted Guest support, right? At
 least I didn't notice any limitations recently. [I did notice some other
 corner-case issue with guest debugging, still need to dig into that...]
 
 You can set hardware and software breakpoints and kvm will forward them
 to userspace, and from there to the debugger.  I'll be happy to help, as
 I'm sure Jan (as the author of most of the guest debugging code) will as
 well.
 

Is there a roadmap or plan for how the KVM project envisions the debugging 
facilities evolving? 

 
 This may help as a starter:
 
 http://chemnitzer.linux-tage.de/2012/vortraege/folien/1061-VirtualDebugging.pdf
 

That is a huge help, thanks!

 Jan
 
 -- 
 Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
 Corporate Competence Center Embedded Linux

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

pci-assign terminates the guest upon pread() / pwrite() error?

2012-09-20 Thread Etienne Martineau

In hw/kvm/pci-assign.c a pread() error part of assigned_dev_pci_read() 
result in a hw_error(). Similarly a pwrite() error part of 
assigned_dev_pci_write() also result in a hw_error().


Would there be a way to avoid terminating the guest for those cases? How 
about we deassign the device upon error?


thanks,
Etienne
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 PATCH 04/21] x86: Avoid RCU warnings on slave CPUs

2012-09-20 Thread Paul E. McKenney

On Thu, Sep 06, 2012 at 08:27:40PM +0900, Tomoki Sekiyama wrote:
 Initialize rcu related variables to avoid warnings about RCU usage while
 slave CPUs is running specified functions. Also notify RCU subsystem before
 the slave CPU is entered into idle state.

Hello, Tomoki,

A few questions and comments interspersed below.

Thanx, Paul

 Signed-off-by: Tomoki Sekiyama tomoki.sekiyama...@hitachi.com
 Cc: Avi Kivity a...@redhat.com
 Cc: Marcelo Tosatti mtosa...@redhat.com
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: Ingo Molnar mi...@redhat.com
 Cc: H. Peter Anvin h...@zytor.com
 ---
 
  arch/x86/kernel/smpboot.c |4 
  kernel/rcutree.c  |   14 ++
  2 files changed, 18 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
 index e8cfe377..45dfc1d 100644
 --- a/arch/x86/kernel/smpboot.c
 +++ b/arch/x86/kernel/smpboot.c
 @@ -382,6 +382,8 @@ notrace static void __cpuinit start_slave_cpu(void 
 *unused)
   f = per_cpu(slave_cpu_func, cpu);
   per_cpu(slave_cpu_func, cpu).func = NULL;
 
 + rcu_note_context_switch(cpu);
 +

Why not use rcu_idle_enter() and rcu_idle_exit()?  These would tell
RCU to ignore the slave CPU for the duration of its idle period.
The way you have it, if a slave CPU stayed idle for too long, you
would get RCU CPU stall warnings, and possibly system hangs as well.

Or is this being called from some task that is not the idle task?
If so, you instead want the new rcu_user_enter() and rcu_user_exit()
that are hopefully on their way into 3.7.  Or maybe better, use a real
idle task, so that idle_task(smp_processor_id()) returns true and RCU
stops complaining.  ;-)

Note that CPUs that RCU believes to be idle are not permitted to contain
RCU read-side critical sections, which in turn means no entering the
scheduler, no sleeping, and so on.  There is an RCU_NONIDLE() macro
to tell RCU to pay attention to the CPU only for the duration of the
statement passed to RCU_NONIDLE, and there are also an _rcuidle variant
of the tracing statement to allow tracing from idle.

   if (!f.func) {
   native_safe_halt();
   continue;
 @@ -1005,6 +1007,8 @@ int __cpuinit slave_cpu_up(unsigned int cpu)
   if (IS_ERR(idle))
   return PTR_ERR(idle);
 
 + slave_cpu_notify(CPU_SLAVE_UP_PREPARE, cpu);
 +
   ret = __native_cpu_up(cpu, idle, 1);
 
   cpu_maps_update_done();
 diff --git a/kernel/rcutree.c b/kernel/rcutree.c
 index f280e54..31a7c8c 100644
 --- a/kernel/rcutree.c
 +++ b/kernel/rcutree.c
 @@ -2589,6 +2589,9 @@ static int __cpuinit rcu_cpu_notify(struct 
 notifier_block *self,
   switch (action) {
   case CPU_UP_PREPARE:
   case CPU_UP_PREPARE_FROZEN:
 +#ifdef CONFIG_SLAVE_CPU
 + case CPU_SLAVE_UP_PREPARE:
 +#endif

Why do you need #ifdef here?  Why not define CPU_SLAVE_UP_PREPARE
unconditionally?  Then if CONFIG_SLAVE_CPU=n, rcu_cpu_notify() would
never be invoked with CPU_SLAVE_UP_PREPARE, so no problems.

   rcu_prepare_cpu(cpu);
   rcu_prepare_kthreads(cpu);
   break;
 @@ -2603,6 +2606,9 @@ static int __cpuinit rcu_cpu_notify(struct 
 notifier_block *self,
   break;
   case CPU_DYING:
   case CPU_DYING_FROZEN:
 +#ifdef CONFIG_SLAVE_CPU
 + case CPU_SLAVE_DYING:
 +#endif

Same here.

   /*
* The whole machine is stopped except this CPU, so we can
* touch any data without introducing corruption. We send the
 @@ -2616,6 +2622,9 @@ static int __cpuinit rcu_cpu_notify(struct 
 notifier_block *self,
   case CPU_DEAD_FROZEN:
   case CPU_UP_CANCELED:
   case CPU_UP_CANCELED_FROZEN:
 +#ifdef CONFIG_SLAVE_CPU
 + case CPU_SLAVE_DEAD:
 +#endif

And here.

   for_each_rcu_flavor(rsp)
   rcu_cleanup_dead_cpu(cpu, rsp);
   break;
 @@ -2797,6 +2806,10 @@ static void __init rcu_init_geometry(void)
   rcu_num_nodes -= n;
  }
 
 +static struct notifier_block __cpuinitdata rcu_slave_nb = {
 + .notifier_call = rcu_cpu_notify,
 +};
 +
  void __init rcu_init(void)
  {
   int cpu;
 @@ -2814,6 +2827,7 @@ void __init rcu_init(void)
* or the scheduler are operational.
*/
   cpu_notifier(rcu_cpu_notify, 0);
 + register_slave_cpu_notifier(rcu_slave_nb);
   for_each_online_cpu(cpu)
   rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
   check_cpu_stall_init();
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: pci-assign terminates the guest upon pread() / pwrite() error?

2012-09-20 Thread Jan Kiszka

On 2012-09-20 19:27, Etienne Martineau wrote:
 In hw/kvm/pci-assign.c a pread() error part of assigned_dev_pci_read()
 result in a hw_error(). Similarly a pwrite() error part of
 assigned_dev_pci_write() also result in a hw_error().
 
 Would there be a way to avoid terminating the guest for those cases? How
 about we deassign the device upon error?

First of all, is this a regression of latest QEMU / qemu-kvm? Or was it
always like this for you?

Then, can you provide more information about the device (lscpi -vv) and
what accesses go wrong (printf, complete console output)?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pci-assign terminates the guest upon pread() / pwrite() error?

2012-09-20 Thread Alex Williamson

On Thu, 2012-09-20 at 13:27 -0400, Etienne Martineau wrote:
 In hw/kvm/pci-assign.c a pread() error part of assigned_dev_pci_read() 
 result in a hw_error(). Similarly a pwrite() error part of 
 assigned_dev_pci_write() also result in a hw_error().
 
 Would there be a way to avoid terminating the guest for those cases? How 
 about we deassign the device upon error?

By terminating the guest we contain the error vs allowing the guest to
continue running with invalid data.  De-assigning the device is
asynchronous and relies on guest involvement, so damage is potentially
already done.  Is this a theoretical problem or do you actually have
hardware that hits this?  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/6] mm: compaction: Cache if a pageblock was scanned and no pages were isolated


On 09/20/2012 10:04 AM, Mel Gorman wrote:

When compaction was implemented it was known that scanning could potentially
be excessive. The ideal was that a counter be maintained for each pageblock
but maintaining this information would incur a severe penalty due to a
shared writable cache line. It has reached the point where the scanning
costs are an serious problem, particularly on long-lived systems where a
large process starts and allocates a large number of THPs at the same time.

Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip. If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock
is marked to be skipped in the future. When scanning, this bit is checked
before any scanning takes place and the block skipped if set.

The main difficulty with a patch like this is when to ignore the cached
information? If it's ignored too often, the scanning rates will still
be excessive. If the information is too stale then allocations will fail
that might have otherwise succeeded. In this patch


Big hammer, but I guess it is effective...

Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] Revert mm: have order 0 compaction start off where it left


On 09/20/2012 10:04 AM, Mel Gorman wrote:

This reverts commit 7db8889a (mm: have order  0 compaction start off
where it left) and commit de74f1cc (mm: have order  0 compaction start
near a pageblock with free pages). These patches were a good idea and
tests confirmed that they massively reduced the amount of scanning but
the implementation is complex and tricky to understand. A later patch
will cache what pageblocks should be skipped and reimplements the
concept of compact_cached_free_pfn on top for both migration and
free scanners.

Signed-off-by: Mel Gorman mgor...@suse.de


Sure, it makes your next patches easier...

Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/6] mm: compaction: Restart compaction from near where it left off


On 09/20/2012 10:04 AM, Mel Gorman wrote:

This is almost entirely based on Rik's previous patches and discussions
with him about how this might be implemented.

Order  0 compaction stops when enough free pages of the correct page
order have been coalesced.  When doing subsequent higher order allocations,
it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to compact
at the start of the zone, and for free pages to compact things to at the
end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at
the end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

This patch caches where the migration and free scanner should start from on
subsequent compaction invocations using the pageblock-skip information. When
compaction starts it begins from the cached restart points and will
update the cached restart points until a page is isolated or a pageblock
is skipped that would have been scanned by synchronous compaction.

Signed-off-by: Mel Gorman mgor...@suse.de


Together with patch 5/6, this has the effect of
skipping compaction in a zone if the free and
isolate markers have met, and it has been less
than 5 seconds since the skip information was
reset.

Compaction on zones where we cycle through more
slowly can continue, even when this particular
zone is experiencing problems, so I guess this
is desired behaviour...

Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] mm: compaction: Acquire the zone-lock as late as possible


On 09/20/2012 10:04 AM, Mel Gorman wrote:

Compactions free scanner acquires the zone-lock when checking for PageBuddy
pages and isolating them. It does this even if there are no PageBuddy pages
in the range.

This patch defers acquiring the zone lock for as long as possible. In the
event there are no free pages in the pageblock then the lock will not be
acquired at all which reduces contention on zone-lock.

Signed-off-by: Mel Gorman mgor...@suse.de


Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/6] mm: compaction: Acquire the zone-lru_lock as late as possible


On 09/20/2012 10:04 AM, Mel Gorman wrote:

Compactions migrate scanner acquires the zone-lru_lock when scanning a range
of pages looking for LRU pages to acquire. It does this even if there are
no LRU pages in the range. If multiple processes are compacting then this
can cause severe locking contention. To make matters worse commit b2eef8c0
(mm: compaction: minimise the time IRQs are disabled while isolating pages
for migration) releases the lru_lock every SWAP_CLUSTER_MAX pages that are
scanned.

This patch makes two changes to how the migrate scanner acquires the LRU
lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages if
the lock is contended. This reduces the number of times it unnecessarily
disables and re-enables IRQs. The second is that it defers acquiring the
LRU lock for as long as possible. If there are no LRU pages or the only
LRU pages are transhuge then the LRU lock will not be acquired at all
which reduces contention on zone-lru_lock.

Signed-off-by: Mel Gorman mgor...@suse.de


Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/6] mm: compaction: Abort compaction loop if lock is contended or run too long


On 09/20/2012 10:04 AM, Mel Gorman wrote:

From: Shaohua Li s...@fusionio.com

Changelog since V2
o Fix BUG_ON triggered due to pages left on cc.migratepages
o Make compact_zone_order() require non-NULL arg `contended'

Changelog since V1
o only abort the compaction if lock is contended or run too long
o Rearranged the code by Andrea Arcangeli.

isolate_migratepages_range() might isolate no pages if for example when
zone-lru_lock is contended and running asynchronous compaction. In this
case, we should abort compaction, otherwise, compact_zone will run a
useless loop and make zone-lru_lock is even contended.

[minc...@kernel.org: Putback pages isolated for migration if aborting]
[a...@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
Signed-off-by: Andrea Arcangeli aarca...@redhat.com
Signed-off-by: Shaohua Li s...@fusionio.com
Signed-off-by: Mel Gorman mgor...@suse.de


Acked-by: Rik van Riel r...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 00/17] Allow changing of Hypervisor CPUIDs.

Also known as Paravirtualization CPUIDs.

This is primarily done so that the guest will think it is running
under vmware when hypervisor-vendor=vmware is specified as a
property of a cpu.

This depends on:

http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01400.html

As far as I know it is #4. It depends on (1) and (2) and (3).

This change is based on:

Microsoft Hypervisor CPUID Leaves:

http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx

Linux kernel change starts with:
http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html
Also:
http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html

VMware documention on CPUIDs (Mechanisms to determine if software is
running in a VMware virtual machine):

http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458

Changes from v3 to v4:
Added CPUID_HV_LEVEL_HYPERV, CPUID_HV_LEVEL_KVM.
Added CPUID_HV_VENDOR_HYPERV.
Added hyperv as known hypservisor-vendor.
Allow hypervisor-level to be 0.

Changes from v2 to v3:
Clean post to qemu-devel.

Changes from v1 to v2:

1) Added 1/4 from
http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg05153.html

Because Fred is changing jobs and so will not be pushing to get
this in. It needed to be rebased, And I needed it to complete the
testing of this change.

2) Added 2/4 because of the re-work I needed a way to clear all KVM bits,

3) The rework of v1. Make it fit into the object model re-work of cpu.c for
x86.

4) Added 3/4 -- The split out of the code that is not needed for accel=kvm.

Changes from v2 to v3:

Marcelo Tosatti:
Its one big patch, better split in logically correlated patches
(with better changelog). This would help reviewers.

So split 3 and 4 into 3 to 17. More info in change log.
No code change.

Don Slutz (17):
target-i386: Allow tsc-frequency to be larger then 2.147G
target-i386: Add missing kvm bits.
target-i386: Add Hypervisor level.
target-i386: Add cpu object access routines for Hypervisor level.
target-i386: Add x86_set_hyperv.
target-i386: Use Hypervisor level in -machine pc,accel=kvm.
target-i386: Use Hypervisor level in -machine pc,accel=tcg.
target-i386: Add Hypervisor vendor.
target-i386: Add cpu object access routines for Hypervisor vendor.
target-i386: Use Hypervisor vendor in -machine pc,accel=kvm.
target-i386: Use Hypervisor vendor in -machine pc,accel=tcg.
target-i386: Add some known names to Hypervisor vendor.
target-i386: Add optional Hypervisor leaf extra.
target-i386: Add cpu object access routines for Hypervisor leaf
extra.
target-i386: Add setting of Hypervisor leaf extra for known vmare4.
target-i386: Use Hypervisor leaf extra in -machine pc,accel=kvm.
target-i386: Use Hypervisor leaf extra in -machine pc,accel=tcg.

target-i386/cpu.c | 277 -
target-i386/cpu.h | 29 ++
target-i386/kvm.c | 36 ++-
3 files changed, 331 insertions(+), 11 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[PATCH v4 01/17] target-i386: Allow tsc-frequency to be larger then 2.147G

The check using INT_MAX (2147483647) is wrong in this case.

Signed-off-by: Fred Oliveira folive...@cloudswitch.com
Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index af50a8f..0313cf5 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1146,7 +1146,7 @@ static void x86_cpuid_set_tsc_freq(Object *obj, Visitor 
*v, void *opaque,
 {
 X86CPU *cpu = X86_CPU(obj);
 const int64_t min = 0;
-const int64_t max = INT_MAX;
+const int64_t max = INT64_MAX;
 int64_t value;
 
 visit_type_freq(v, value, name, errp);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 02/17] target-i386: Add missing kvm bits.

Fix duplicate name (kvmclock = kvm_clock2) also.

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.c |   12 
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 0313cf5..5f9866a 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -87,10 +87,14 @@ static const char *ext3_feature_name[] = {
 };
 
 static const char *kvm_feature_name[] = {
-kvmclock, kvm_nopiodelay, kvm_mmu, kvmclock, kvm_asyncpf, NULL, 
kvm_pv_eoi, NULL,
-NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
-NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
-NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+kvmclock, kvm_nopiodelay, kvm_mmu, kvm_clock2,
+kvm_asyncpf, kvm_steal_time, kvm_pv_eoi, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+kvm_clock_stable, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
 };
 
 static const char *svm_feature_name[] = {
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 03/17] target-i386: Add Hypervisor level.

Also known as Paravirtualization level or maximim cpuid function present in 
this leaf.
This is just the EAX value for 0x4000.

QEMU knows this is KVM_CPUID_SIGNATURE (0x4000).

This is based on:

Microsoft Hypervisor CPUID Leaves:
  
http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx

Linux kernel change starts with:
  http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html
Also:
  http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html

VMware documention on CPUIDs (Mechanisms to determine if software is
running in a VMware virtual machine):
  
http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458

QEMU has the value HYPERV_CPUID_MIN defined.

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.h |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 5265c5a..05c0848 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -782,6 +782,8 @@ typedef struct CPUX86State {
 uint32_t cpuid_ext4_features;
 /* Flags from CPUID[EAX=7,ECX=0].EBX */
 uint32_t cpuid_7_0_ebx;
+/* Hypervisor CPUIDs */
+uint32_t cpuid_hv_level;
 
 /* MTRRs */
 uint64_t mtrr_fixed[11];
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 05/17] target-i386: Add x86_set_hyperv.

This is used to set the cpu object's hypervisor level to the default for 
Microsoft's Hypervisor.

HYPERV_CPUID_MIN (0x4005) is defined in a linux header file.
CPUID_HV_LEVEL_HYPERV (0x4005) is used instead.

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.c |   10 ++
 target-i386/cpu.h |2 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 0e4a18d..6aeb194 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1192,6 +1192,13 @@ static void x86_cpuid_set_hv_level(Object *obj, Visitor 
*v, void *opaque,
 }
 
 #if !defined(CONFIG_USER_ONLY)
+static void x86_set_hyperv(Object *obj, Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+
+cpu-env.cpuid_hv_level = CPUID_HV_LEVEL_HYPERV;
+}
+
 static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque,
  const char *name, Error **errp)
 {
@@ -1214,6 +1221,7 @@ static void x86_set_hv_spinlocks(Object *obj, Visitor *v, 
void *opaque,
 return;
 }
 hyperv_set_spinlock_retries(value);
+x86_set_hyperv(obj, errp);
 }
 
 static void x86_get_hv_relaxed(Object *obj, Visitor *v, void *opaque,
@@ -1234,6 +1242,7 @@ static void x86_set_hv_relaxed(Object *obj, Visitor *v, 
void *opaque,
 return;
 }
 hyperv_enable_relaxed_timing(value);
+x86_set_hyperv(obj, errp);
 }
 
 static void x86_get_hv_vapic(Object *obj, Visitor *v, void *opaque,
@@ -1254,6 +1263,7 @@ static void x86_set_hv_vapic(Object *obj, Visitor *v, 
void *opaque,
 return;
 }
 hyperv_enable_vapic_recommended(value);
+x86_set_hyperv(obj, errp);
 }
 #endif
 
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 05c0848..7fc7906 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -488,6 +488,8 @@
 
 #define CPUID_VENDOR_VIA   CentaurHauls
 
+#define CPUID_HV_LEVEL_HYPERV  0x4005
+
 #define CPUID_MWAIT_IBE (1  1) /* Interrupts can exit capability */
 #define CPUID_MWAIT_EMX (1  0) /* enumeration supported */
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 04/17] target-i386: Add cpu object access routines for Hypervisor level.

These are modeled after x86_cpuid_get_xlevel and x86_cpuid_set_xlevel.

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.c |   28 
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 5f9866a..0e4a18d 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1166,6 +1166,31 @@ static void x86_cpuid_set_tsc_freq(Object *obj, Visitor 
*v, void *opaque,
 cpu-env.tsc_khz = value / 1000;
 }
 
+static void x86_cpuid_get_hv_level(Object *obj, Visitor *v, void *opaque,
+const char *name, Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+
+visit_type_uint32(v, cpu-env.cpuid_hv_level, name, errp);
+}
+
+static void x86_cpuid_set_hv_level(Object *obj, Visitor *v, void *opaque,
+const char *name, Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+uint32_t value;
+
+visit_type_uint32(v, value, name, errp);
+if (error_is_set(errp)) {
+return;
+}
+
+if ((value != 0)  (value  0x4000)) {
+value += 0x4000;
+}
+cpu-env.cpuid_hv_level = value;
+}
+
 #if !defined(CONFIG_USER_ONLY)
 static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque,
  const char *name, Error **errp)
@@ -2061,6 +2086,9 @@ static void x86_cpu_initfn(Object *obj)
 object_property_add(obj, enforce, bool,
 x86_cpuid_get_enforce,
 x86_cpuid_set_enforce, NULL, NULL, NULL);
+object_property_add(obj, hypervisor-level, int,
+x86_cpuid_get_hv_level,
+x86_cpuid_set_hv_level, NULL, NULL, NULL);
 #if !defined(CONFIG_USER_ONLY)
 object_property_add(obj, hv_spinlocks, int,
 x86_get_hv_spinlocks,
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 07/17] target-i386: Use Hypervisor level in -machine pc,accel=tcg.

Also known as Paravirtualization level.

This change is based on:

Microsoft Hypervisor CPUID Leaves:
  
http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx

Linux kernel change starts with:
  http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html
Also:
  http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html

VMware documention on CPUIDs (Mechanisms to determine if software is
running in a VMware virtual machine):
  
http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458

QEMU knows this as KVM_CPUID_SIGNATURE (0x4000) in kvm on linux.

This does not provide vendor support in tcg yet.

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.c |   22 ++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 6aeb194..b7532b7 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1651,6 +1651,16 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 index =  env-cpuid_xlevel;
 }
 }
+} else if (index  0x4000) {
+if (env-cpuid_hv_level  0) {
+/* Handle Hypervisor CPUIDs */
+if (index  env-cpuid_hv_level) {
+index = env-cpuid_hv_level;
+}
+} else {
+if (index  env-cpuid_level)
+index = env-cpuid_level;
+}
 } else {
 if (index  env-cpuid_level)
 index = env-cpuid_level;
@@ -1789,6 +1799,18 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *edx = 0;
 }
 break;
+case 0x4000:
+*eax = env-cpuid_hv_level;
+*ebx = 0;
+*ecx = 0;
+*edx = 0;
+break;
+case 0x4001:
+*eax = env-cpuid_kvm_features;
+*ebx = 0;
+*ecx = 0;
+*edx = 0;
+break;
 case 0x8000:
 *eax = env-cpuid_xlevel;
 *ebx = env-cpuid_vendor1;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 06/17] target-i386: Use Hypervisor level in -machine pc,accel=kvm.

Also known as Paravirtualization level.

This change is based on:

Microsoft Hypervisor CPUID Leaves:
  
http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx

Linux kernel change starts with:
  http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html
Also:
  http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html

VMware documention on CPUIDs (Mechanisms to determine if software is
running in a VMware virtual machine):
  
http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458

QEMU knows this is KVM_CPUID_SIGNATURE (0x4000).

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/kvm.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 895d848..bf27793 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -389,12 +389,12 @@ int kvm_arch_init_vcpu(CPUX86State *env)
 c = cpuid_data.entries[cpuid_i++];
 memset(c, 0, sizeof(*c));
 c-function = KVM_CPUID_SIGNATURE;
-if (!hyperv_enabled()) {
+if (env-cpuid_hv_level == 0) {
 memcpy(signature, KVMKVMKVM\0\0\0, 12);
 c-eax = 0;
 } else {
 memcpy(signature, Microsoft Hv, 12);
-c-eax = HYPERV_CPUID_MIN;
+c-eax = env-cpuid_hv_level;
 }
 c-ebx = signature[0];
 c-ecx = signature[1];
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 09/17] target-i386: Add cpu object access routines for Hypervisor vendor.

These are modeled after x86_cpuid_set_vendor and x86_cpuid_get_vendor.
Since kvm's vendor is shorter, the test for correct size is removed and zero 
padding is added.

Set Microsoft's Vendor now that we can.  Value defined in:
  
http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx
And matches want is in target-i386/kvm.c

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.c |   44 
 target-i386/cpu.h |2 ++
 2 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index b7532b7..d8f7e22 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1191,12 +1191,53 @@ static void x86_cpuid_set_hv_level(Object *obj, Visitor 
*v, void *opaque,
 cpu-env.cpuid_hv_level = value;
 }
 
+static char *x86_cpuid_get_hv_vendor(Object *obj, Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+CPUX86State *env = cpu-env;
+char *value;
+int i;
+
+value = (char *)g_malloc(CPUID_VENDOR_SZ + 1);
+for (i = 0; i  4; i++) {
+value[i + 0] = env-cpuid_hv_vendor1  (8 * i);
+value[i + 4] = env-cpuid_hv_vendor2  (8 * i);
+value[i + 8] = env-cpuid_hv_vendor3  (8 * i);
+}
+value[CPUID_VENDOR_SZ] = '\0';
+
+return value;
+}
+
+static void x86_cpuid_set_hv_vendor(Object *obj, const char *value,
+ Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+CPUX86State *env = cpu-env;
+int i;
+char adj_value[CPUID_VENDOR_SZ + 1];
+
+memset(adj_value, 0, sizeof(adj_value));
+
+pstrcpy(adj_value, sizeof(adj_value), value);
+
+env-cpuid_hv_vendor1 = 0;
+env-cpuid_hv_vendor2 = 0;
+env-cpuid_hv_vendor3 = 0;
+for (i = 0; i  4; i++) {
+env-cpuid_hv_vendor1 |= ((uint8_t)adj_value[i + 0])  (8 * i);
+env-cpuid_hv_vendor2 |= ((uint8_t)adj_value[i + 4])  (8 * i);
+env-cpuid_hv_vendor3 |= ((uint8_t)adj_value[i + 8])  (8 * i);
+}
+}
+
 #if !defined(CONFIG_USER_ONLY)
 static void x86_set_hyperv(Object *obj, Error **errp)
 {
 X86CPU *cpu = X86_CPU(obj);
 
 cpu-env.cpuid_hv_level = CPUID_HV_LEVEL_HYPERV;
+x86_cpuid_set_hv_vendor(obj, CPUID_HV_VENDOR_HYPERV, errp);
 }
 
 static void x86_get_hv_spinlocks(Object *obj, Visitor *v, void *opaque,
@@ -2121,6 +2162,9 @@ static void x86_cpu_initfn(Object *obj)
 object_property_add(obj, hypervisor-level, int,
 x86_cpuid_get_hv_level,
 x86_cpuid_set_hv_level, NULL, NULL, NULL);
+object_property_add_str(obj, hypervisor-vendor,
+x86_cpuid_get_hv_vendor,
+x86_cpuid_set_hv_vendor, NULL);
 #if !defined(CONFIG_USER_ONLY)
 object_property_add(obj, hv_spinlocks, int,
 x86_get_hv_spinlocks,
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index e13a44a..91ddf76 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -488,6 +488,8 @@
 
 #define CPUID_VENDOR_VIA   CentaurHauls
 
+#define CPUID_HV_VENDOR_HYPERV Microsoft Hv
+
 #define CPUID_HV_LEVEL_HYPERV  0x4005
 
 #define CPUID_MWAIT_IBE (1  1) /* Interrupts can exit capability */
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 10/17] target-i386: Use Hypervisor vendor in -machine pc,accel=kvm.

Also known as Paravirtualization vendor.

This change is based on:

Microsoft Hypervisor CPUID Leaves:
  
http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx

Linux kernel change starts with:
  http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html
Also:
  http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html

VMware documention on CPUIDs (Mechanisms to determine if software is
running in a VMware virtual machine):
  
http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/kvm.c |   15 ++-
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index bf27793..dde9214 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -389,16 +389,21 @@ int kvm_arch_init_vcpu(CPUX86State *env)
 c = cpuid_data.entries[cpuid_i++];
 memset(c, 0, sizeof(*c));
 c-function = KVM_CPUID_SIGNATURE;
-if (env-cpuid_hv_level == 0) {
+if (env-cpuid_hv_level == 0 
+env-cpuid_hv_vendor1 == 0 
+env-cpuid_hv_vendor2 == 0 
+env-cpuid_hv_vendor3 == 0) {
 memcpy(signature, KVMKVMKVM\0\0\0, 12);
 c-eax = 0;
+c-ebx = signature[0];
+c-ecx = signature[1];
+c-edx = signature[2];
 } else {
-memcpy(signature, Microsoft Hv, 12);
 c-eax = env-cpuid_hv_level;
+c-ebx = env-cpuid_hv_vendor1;
+c-ecx = env-cpuid_hv_vendor2;
+c-edx = env-cpuid_hv_vendor3;
 }
-c-ebx = signature[0];
-c-ecx = signature[1];
-c-edx = signature[2];
 
 c = cpuid_data.entries[cpuid_i++];
 memset(c, 0, sizeof(*c));
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 13/17] target-i386: Add optional Hypervisor leaf extra.

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.h |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 6dafaeb..e158c54 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -807,6 +807,10 @@ typedef struct CPUX86State {
 uint32_t cpuid_hv_vendor1;
 uint32_t cpuid_hv_vendor2;
 uint32_t cpuid_hv_vendor3;
+/* VMware extra data */
+uint32_t cpuid_hv_extra;
+uint32_t cpuid_hv_extra_a;
+uint32_t cpuid_hv_extra_b;
 
 /* MTRRs */
 uint64_t mtrr_fixed[11];
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 14/17] target-i386: Add cpu object access routines for Hypervisor leaf extra.

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.c |   66 +
 1 files changed, 66 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 904b08f..7e9c43b 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1273,6 +1273,63 @@ static void x86_cpuid_set_hv_vendor(Object *obj, const 
char *value,
 }
 }
 
+static void x86_cpuid_get_hv_extra(Object *obj, Visitor *v, void *opaque,
+const char *name, Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+
+visit_type_uint32(v, cpu-env.cpuid_hv_extra, name, errp);
+}
+
+static void x86_cpuid_set_hv_extra(Object *obj, Visitor *v, void *opaque,
+const char *name, Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+uint32_t value;
+
+visit_type_uint32(v, value, name, errp);
+if (error_is_set(errp)) {
+return;
+}
+
+if ((value != 0)  (value  0x4000)) {
+value += 0x4000;
+}
+cpu-env.cpuid_hv_extra = value;
+}
+
+static void x86_cpuid_get_hv_extra_a(Object *obj, Visitor *v, void *opaque,
+const char *name, Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+
+visit_type_uint32(v, cpu-env.cpuid_hv_extra_a, name, errp);
+}
+
+static void x86_cpuid_set_hv_extra_a(Object *obj, Visitor *v, void *opaque,
+const char *name, Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+
+visit_type_uint32(v, cpu-env.cpuid_hv_extra_a, name, errp);
+}
+
+static void x86_cpuid_get_hv_extra_b(Object *obj, Visitor *v, void *opaque,
+const char *name, Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+
+visit_type_uint32(v, cpu-env.cpuid_hv_extra_b, name, errp);
+}
+
+static void x86_cpuid_set_hv_extra_b(Object *obj, Visitor *v, void *opaque,
+const char *name, Error **errp)
+{
+X86CPU *cpu = X86_CPU(obj);
+
+visit_type_uint32(v, cpu-env.cpuid_hv_extra_b, name, errp);
+}
+
 #if !defined(CONFIG_USER_ONLY)
 static void x86_set_hyperv(Object *obj, Error **errp)
 {
@@ -2215,6 +2272,15 @@ static void x86_cpu_initfn(Object *obj)
 object_property_add_str(obj, hypervisor-vendor,
 x86_cpuid_get_hv_vendor,
 x86_cpuid_set_hv_vendor, NULL);
+object_property_add(obj, hypervisor-extra, int,
+x86_cpuid_get_hv_extra,
+x86_cpuid_set_hv_extra, NULL, NULL, NULL);
+object_property_add(obj, hypervisor-extra-a, int,
+x86_cpuid_get_hv_extra_a,
+x86_cpuid_set_hv_extra_a, NULL, NULL, NULL);
+object_property_add(obj, hypervisor-extra-b, int,
+x86_cpuid_get_hv_extra_b,
+x86_cpuid_set_hv_extra_b, NULL, NULL, NULL);
 #if !defined(CONFIG_USER_ONLY)
 object_property_add(obj, hv_spinlocks, int,
 x86_get_hv_spinlocks,
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 15/17] target-i386: Add setting of Hypervisor leaf extra for known vmare4.

This was taken from:
  http://article.gmane.org/gmane.comp.emulators.kvm.devel/22643

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.c |   32 
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 7e9c43b..4594693 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1135,6 +1135,36 @@ static void x86_cpuid_set_model_id(Object *obj, const 
char *model_id,
 }
 }
 
+static void x86_cpuid_set_vmware_extra(Object *obj)
+{
+X86CPU *cpu = X86_CPU(obj);
+
+if ((cpu-env.tsc_khz != 0) 
+(cpu-env.cpuid_hv_level == CPUID_HV_LEVEL_VMARE_4) 
+(cpu-env.cpuid_hv_vendor1 == CPUID_HV_VENDOR_VMWARE_1) 
+(cpu-env.cpuid_hv_vendor2 == CPUID_HV_VENDOR_VMWARE_2) 
+(cpu-env.cpuid_hv_vendor3 == CPUID_HV_VENDOR_VMWARE_3)) {
+const uint32_t apic_khz = 100L;
+
+/*
+ * From article.gmane.org/gmane.comp.emulators.kvm.devel/22643
+ *
+ *Leaf 0x4010, Timing Information.
+ *
+ *VMware has defined the first generic leaf to provide timing
+ *information.  This leaf returns the current TSC frequency and
+ *current Bus frequency in kHz.
+ *
+ *# EAX: (Virtual) TSC frequency in kHz.
+ *# EBX: (Virtual) Bus (local apic timer) frequency in kHz.
+ *# ECX, EDX: RESERVED (Per above, reserved fields are set to 
zero).
+ */
+cpu-env.cpuid_hv_extra = 0x4010;
+cpu-env.cpuid_hv_extra_a = (uint32_t)cpu-env.tsc_khz;
+cpu-env.cpuid_hv_extra_b = apic_khz;
+}
+}
+
 static void x86_cpuid_get_tsc_freq(Object *obj, Visitor *v, void *opaque,
const char *name, Error **errp)
 {
@@ -1164,6 +1194,7 @@ static void x86_cpuid_set_tsc_freq(Object *obj, Visitor 
*v, void *opaque,
 }
 
 cpu-env.tsc_khz = value / 1000;
+x86_cpuid_set_vmware_extra(obj);
 }
 
 static void x86_cpuid_get_hv_level(Object *obj, Visitor *v, void *opaque,
@@ -1271,6 +1302,7 @@ static void x86_cpuid_set_hv_vendor(Object *obj, const 
char *value,
 env-cpuid_hv_vendor2 |= ((uint8_t)adj_value[i + 4])  (8 * i);
 env-cpuid_hv_vendor3 |= ((uint8_t)adj_value[i + 8])  (8 * i);
 }
+x86_cpuid_set_vmware_extra(obj);
 }
 
 static void x86_cpuid_get_hv_extra(Object *obj, Visitor *v, void *opaque,
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 16/17] target-i386: Use Hypervisor leaf extra in -machine pc,accel=kvm.

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/kvm.c |   19 +++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index dde9214..bd7753f 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -457,6 +457,25 @@ int kvm_arch_init_vcpu(CPUX86State *env)
 c-ebx = signature[0];
 c-ecx = signature[1];
 c-edx = signature[2];
+} else if (env-cpuid_hv_level  0) {
+for (i = KVM_CPUID_FEATURES + 1; i = env-cpuid_hv_level; i++) {
+c = cpuid_data.entries[cpuid_i++];
+memset(c, 0, sizeof(*c));
+c-function = i;
+if (i == env-cpuid_hv_extra) {
+c-eax = env-cpuid_hv_extra_a;
+c-ebx = env-cpuid_hv_extra_b;
+}
+}
+
+c = cpuid_data.entries[cpuid_i++];
+memset(c, 0, sizeof(*c));
+c-function = KVM_CPUID_SIGNATURE_NEXT;
+memcpy(signature, KVMKVMKVM\0\0\0, 12);
+c-eax = 0;
+c-ebx = signature[0];
+c-ecx = signature[1];
+c-edx = signature[2];
 }
 
 has_msr_async_pf_en = c-eax  (1  KVM_FEATURE_ASYNC_PF);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 12/17] target-i386: Add some known names to Hypervisor vendor.

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.c |   44 +++-
 target-i386/cpu.h |   14 ++
 2 files changed, 57 insertions(+), 1 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 5cf7146..904b08f 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1206,6 +1206,23 @@ static char *x86_cpuid_get_hv_vendor(Object *obj, Error 
**errp)
 }
 value[CPUID_VENDOR_SZ] = '\0';
 
+/* Convert known names */
+if (!strcmp(value, CPUID_HV_VENDOR_HYPERV) 
+   env-cpuid_hv_level == CPUID_HV_LEVEL_HYPERV) {
+pstrcpy(value, sizeof(value), hyperv);
+} else if (!strcmp(value, CPUID_HV_VENDOR_VMWARE)) {
+if (env-cpuid_hv_level == CPUID_HV_LEVEL_VMARE_4) {
+pstrcpy(value, sizeof(value), vmware4);
+} else if (env-cpuid_hv_level == CPUID_HV_LEVEL_VMARE_3) {
+pstrcpy(value, sizeof(value), vmware3);
+}
+} else if (!strcmp(value, CPUID_HV_VENDOR_XEN) 
+   env-cpuid_hv_level == CPUID_HV_LEVEL_XEN) {
+pstrcpy(value, sizeof(value), xen);
+} else if (!strcmp(value, CPUID_HV_VENDOR_KVM) 
+   env-cpuid_hv_level == 0) {
+pstrcpy(value, sizeof(value), kvm);
+}
 return value;
 }
 
@@ -1219,7 +1236,32 @@ static void x86_cpuid_set_hv_vendor(Object *obj, const 
char *value,
 
 memset(adj_value, 0, sizeof(adj_value));
 
-pstrcpy(adj_value, sizeof(adj_value), value);
+/* Convert known names */
+if (!strcmp(value, hyperv)) {
+if (env-cpuid_hv_level == 0) {
+env-cpuid_hv_level = CPUID_HV_LEVEL_HYPERV;
+}
+pstrcpy(adj_value, sizeof(adj_value), CPUID_HV_VENDOR_HYPERV);
+} else if (!strcmp(value, vmware) || !strcmp(value, vmware4)) {
+if (env-cpuid_hv_level == 0) {
+env-cpuid_hv_level = CPUID_HV_LEVEL_VMARE_4;
+}
+pstrcpy(adj_value, sizeof(adj_value), CPUID_HV_VENDOR_VMWARE);
+} else if (!strcmp(value, vmware3)) {
+if (env-cpuid_hv_level == 0) {
+env-cpuid_hv_level = CPUID_HV_LEVEL_VMARE_3;
+}
+pstrcpy(adj_value, sizeof(adj_value), CPUID_HV_VENDOR_VMWARE);
+} else if (!strcmp(value, xen)) {
+if (env-cpuid_hv_level == 0) {
+env-cpuid_hv_level = CPUID_HV_LEVEL_XEN;
+}
+pstrcpy(adj_value, sizeof(adj_value), CPUID_HV_VENDOR_XEN);
+} else if (!strcmp(value, kvm)) {
+pstrcpy(adj_value, sizeof(adj_value), CPUID_HV_VENDOR_KVM);
+} else {
+pstrcpy(adj_value, sizeof(adj_value), value);
+}
 
 env-cpuid_hv_vendor1 = 0;
 env-cpuid_hv_vendor2 = 0;
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index e3e176b..6dafaeb 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -490,10 +490,24 @@
 
 #define CPUID_HV_VENDOR_HYPERV Microsoft Hv
 
+#define CPUID_HV_VENDOR_VMWARE_1 0x61774d56 /* VMwa */
+#define CPUID_HV_VENDOR_VMWARE_2 0x4d566572 /* reVM */
+#define CPUID_HV_VENDOR_VMWARE_3 0x65726177 /* ware */
+#define CPUID_HV_VENDOR_VMWARE VMwareVMware
+
+#define CPUID_HV_VENDOR_XEN XenVMMXenVMM
+
+#define CPUID_HV_VENDOR_KVM KVMKVMKVM
+
 #define CPUID_HV_LEVEL_HYPERV  0x4005
 
 #define CPUID_HV_LEVEL_KVM  0x4001
 
+#define CPUID_HV_LEVEL_XEN  0x4002
+
+#define CPUID_HV_LEVEL_VMARE_3 0x4002
+#define CPUID_HV_LEVEL_VMARE_4 0x4010
+
 #define CPUID_MWAIT_IBE (1  1) /* Interrupts can exit capability */
 #define CPUID_MWAIT_EMX (1  0) /* enumeration supported */
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 11/17] target-i386: Use Hypervisor vendor in -machine pc,accel=tcg.

Also known as Paravirtualization vendor.

This change is based on:

Microsoft Hypervisor CPUID Leaves:
  
http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx

Linux kernel change starts with:
  http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html
Also:
  http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html
This is where the 0 is the same as 0x4001 is defined.

VMware documention on CPUIDs (Mechanisms to determine if software is
running in a VMware virtual machine):
  
http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.c |   20 ++--
 target-i386/cpu.h |2 ++
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index d8f7e22..5cf7146 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1693,10 +1693,18 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 }
 }
 } else if (index  0x4000) {
-if (env-cpuid_hv_level  0) {
+if (env-cpuid_hv_level != 0 ||
+env-cpuid_hv_vendor1 != 0 ||
+env-cpuid_hv_vendor2 != 0 ||
+env-cpuid_hv_vendor3 != 0) {
+uint32_t real_level = env-cpuid_hv_level;
+
+/* Handle KVM's old level. */
+if (real_level == 0)
+real_level = CPUID_HV_LEVEL_KVM;
 /* Handle Hypervisor CPUIDs */
-if (index  env-cpuid_hv_level) {
-index = env-cpuid_hv_level;
+if (index  real_level) {
+index = real_level;
 }
 } else {
 if (index  env-cpuid_level)
@@ -1842,9 +1850,9 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 break;
 case 0x4000:
 *eax = env-cpuid_hv_level;
-*ebx = 0;
-*ecx = 0;
-*edx = 0;
+*ebx = env-cpuid_hv_vendor1;
+*ecx = env-cpuid_hv_vendor2;
+*edx = env-cpuid_hv_vendor3;
 break;
 case 0x4001:
 *eax = env-cpuid_kvm_features;
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 91ddf76..e3e176b 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -492,6 +492,8 @@
 
 #define CPUID_HV_LEVEL_HYPERV  0x4005
 
+#define CPUID_HV_LEVEL_KVM  0x4001
+
 #define CPUID_MWAIT_IBE (1  1) /* Interrupts can exit capability */
 #define CPUID_MWAIT_EMX (1  0) /* enumeration supported */
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 17/17] target-i386: Use Hypervisor leaf extra in -machine pc,accel=tcg.

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 4594693..72a8442 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1991,6 +1991,17 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *ecx = 0;
 *edx = 0;
 break;
+case 0x4002 ... 0x40FF:
+if (index == env-cpuid_hv_extra) {
+*eax = env-cpuid_hv_extra_a;
+*ebx = env-cpuid_hv_extra_b;
+} else {
+*eax = 0;
+*ebx = 0;
+}
+*ecx = 0;
+*edx = 0;
+break;
 case 0x8000:
 *eax = env-cpuid_xlevel;
 *ebx = env-cpuid_vendor1;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 08/17] target-i386: Add Hypervisor vendor.

Also known as Paravirtualization vendor.
This is EBX, ECX, EDX data for 0x4000.

QEMU knows this is KVM_CPUID_SIGNATURE (0x4000).

This is based on:

Microsoft Hypervisor CPUID Leaves:
  
http://msdn.microsoft.com/en-us/library/windows/hardware/ff542428%28v=vs.85%29.aspx

Linux kernel change starts with:
  http://fixunix.com/kernel/538707-use-cpuid-communicate-hypervisor.html
Also:
  http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html

VMware documention on CPUIDs (Mechanisms to determine if software is
running in a VMware virtual machine):
  
http://kb.vmware.com/selfservice/microsites/search.do?language=en_UScmd=displayKCexternalId=1009458

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.h |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 7fc7906..e13a44a 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -786,6 +786,9 @@ typedef struct CPUX86State {
 uint32_t cpuid_7_0_ebx;
 /* Hypervisor CPUIDs */
 uint32_t cpuid_hv_level;
+uint32_t cpuid_hv_vendor1;
+uint32_t cpuid_hv_vendor2;
+uint32_t cpuid_hv_vendor3;
 
 /* MTRRs */
 uint64_t mtrr_fixed[11];
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/1] target-i386: Fix default Hypervisor level for kvm

Looking at http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html
The new value for EAX is 0x4001.

This depends on 
http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg02497.html

As far as I known it is #5.  It depends on (1), (2), (3) and (4).

Based on cpu-queue[1] branch.
(From http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg02639.html)

[1] https://github.com/ehabkost/qemu/commits/cpu-queue
My branch is now based on Andreas's qom-cpu branch from
https://github.com/afaerber/qemu-cpu/commits/qom-cpu

Changes form v1 to v2:
  Drop patch #1 -- possible live-migrating issues.

  Added kvm1 and kvm0 to handle the 2 cases.  

Don Slutz (1):
  target-i386: Fix default Hypervisor level for hypervisor-vendor=kvm.

 target-i386/cpu.c |   17 +
 1 files changed, 13 insertions(+), 4 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/1] target-i386: Fix default Hypervisor level for hypervisor-vendor=kvm.

From http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/00100.html
EAX should be KVM_CPUID_FEATURES (0x4001) not 0.

Added hypervisor-vendor=kvm0 to get the older CPUID result. kvm1 selects the 
newer one.

Signed-off-by: Don Slutz d...@cloudswitch.com
---
 target-i386/cpu.c |   17 +
 1 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 72a8442..e51b2b1 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1250,9 +1250,12 @@ static char *x86_cpuid_get_hv_vendor(Object *obj, Error 
**errp)
 } else if (!strcmp(value, CPUID_HV_VENDOR_XEN) 
env-cpuid_hv_level == CPUID_HV_LEVEL_XEN) {
 pstrcpy(value, sizeof(value), xen);
-} else if (!strcmp(value, CPUID_HV_VENDOR_KVM) 
-   env-cpuid_hv_level == 0) {
-pstrcpy(value, sizeof(value), kvm);
+} else if (!strcmp(value, CPUID_HV_VENDOR_KVM)) {
+if (env-cpuid_hv_level == CPUID_HV_LEVEL_KVM) {
+pstrcpy(value, sizeof(value), kvm1);
+} else if (env-cpuid_hv_level == 0) {
+pstrcpy(value, sizeof(value), kvm0);
+}
 }
 return value;
 }
@@ -1288,7 +1291,13 @@ static void x86_cpuid_set_hv_vendor(Object *obj, const 
char *value,
 env-cpuid_hv_level = CPUID_HV_LEVEL_XEN;
 }
 pstrcpy(adj_value, sizeof(adj_value), CPUID_HV_VENDOR_XEN);
-} else if (!strcmp(value, kvm)) {
+} else if (!strcmp(value, kvm) || !strcmp(value, kvm1)) {
+if (env-cpuid_hv_level == 0) {
+env-cpuid_hv_level = CPUID_HV_LEVEL_KVM;
+}
+pstrcpy(adj_value, sizeof(adj_value), CPUID_HV_VENDOR_KVM);
+} else if (!strcmp(value, kvm0)) {
+env-cpuid_hv_level = 0;
 pstrcpy(adj_value, sizeof(adj_value), CPUID_HV_VENDOR_KVM);
 } else {
 pstrcpy(adj_value, sizeof(adj_value), value);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pci-assign terminates the guest upon pread() / pwrite() error?

2012-09-20 Thread Etienne Martineau


On 09/20/2012 03:37 PM, Alex Williamson wrote:

On Thu, 2012-09-20 at 15:08 -0400, Etienne Martineau wrote:

On 09/20/2012 02:16 PM, Alex Williamson wrote:

On Thu, 2012-09-20 at 13:27 -0400, Etienne Martineau wrote:

In hw/kvm/pci-assign.c a pread() error part of assigned_dev_pci_read()
result in a hw_error(). Similarly a pwrite() error part of
assigned_dev_pci_write() also result in a hw_error().

Would there be a way to avoid terminating the guest for those cases? How
about we deassign the device upon error?


By terminating the guest we contain the error vs allowing the guest to
continue running with invalid data.  De-assigning the device is
asynchronous and relies on guest involvement, so damage is potentially
already done.  Is this a theoretical problem or do you actually have
hardware that hits this?  Thanks,

Alex



This problem is in the context of a Hot-pluggable device assigned to the
guest. If the guest rd/wr the config space at the same time than the
device is physically taken out then the guest will terminate with
hw_error().

Because this limits the availability of the guest I think we should try
to recover instead. I don't see what other damage can happen since
guest's MMIO access to the stale device will go nowhere?


So you're looking at implementing surprise device removal?  There's not
just config space, there's slow bar access and mmap'd spaces to worry
about too.  What does going nowhere mean?  If it means reads return -1
and the guest is trying to read the data portion of a packet from the
network or an hba, we've now passed bad data to the guest.  Thanks,

Alex





Thanks for your answer;

Yes we are doing 'surprise device removal' for assigned device. Note 
that the problem also exist with standard 'attention button' device removal.


The problem is all about fault isolation. Ideally, only the 
corresponding driver should be affected by this 'surprise device 
removal'. I think that taking down the guest is too coarse. Think about 
a 'surprise device removal' on the host. In that case the host is not 
taken down so why not do the same with the guest?


Yes some badness will be latched into the guest but really this not any 
different that having a mis-behaving device.


thanks,
Etienne


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [libvirt] TSC scaling interface to management

2012-09-20 Thread Dor Laor


On 09/12/2012 06:39 PM, Marcelo Tosatti wrote:



HW TSC scaling is a feature of AMD processors that allows a
multiplier to be specified to the TSC frequency exposed to the guest.

KVM also contains provision to trap TSC (KVM: Infrastructure for
software and hardware based TSC rate scaling cc578287e3224d0da)
or advance TSC frequency.

This is useful when migrating to a host with different frequency and
the guest is possibly using direct RDTSC instructions for purposes
other than measuring cycles (that is, it previously calculated
cycles-per-second, and uses that information which is stale after
migration).

qemu-x86: Set tsc_khz in kvm when supported (e7429073ed1a76518)
added support for tsc_khz= option in QEMU.

I am proposing the following changes so that management applications
can work with this:

1) New option for tsc_khz, which is tsc_khz=host (QEMU command line
option). Host means that QEMU is responsible for retrieving the
TSC frequency of the host processor and use that.
Management application does not have to deal with the burden.

2) New subsection with tsc_khz value. Destination host should consult
supported features of running kernel and fail if feature is unsupported.


It is not necessary to use this tsc_khz setting with modern guests
using paravirtual clocks, or when its known that applications make
proper use of the time interface provided by operating systems.

On the other hand, legacy applications or setups which require no
modification and correct operation while virtualized and make
use of RDTSC might need this.

Therefore it appears that this tsc_khz=auto option can be specified
only if the user specifies so (it can be a per-guest flag hidden
in the management configuration/manual).

Sending this email to gather suggestions (or objections)
to this interface.


I'm not sure I understand the exact difference between the offers.
We can define these 3 options:

1. Qemu/kvm won't make use of tsc scaling feature at all.
2. tsc scaling is used and we take the value either from the host or
   from the live migration data that overrides the later for incoming.
   As you've said, it should be passed through a sub section.
3. Manual setting of the value (uncommon).

Is there another option worth considering?
The questions is what should be the default. IMHO #2 is more appropriate 
to serve as a default since we do expect tsc to change between hosts.


Cheers,
Dor
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: pci-assign terminates the guest upon pread() / pwrite() error?

2012-09-20 Thread Alex Williamson

On Thu, 2012-09-20 at 16:36 -0400, Etienne Martineau wrote:
 On 09/20/2012 03:37 PM, Alex Williamson wrote:
  On Thu, 2012-09-20 at 15:08 -0400, Etienne Martineau wrote:
  On 09/20/2012 02:16 PM, Alex Williamson wrote:
  On Thu, 2012-09-20 at 13:27 -0400, Etienne Martineau wrote:
  In hw/kvm/pci-assign.c a pread() error part of assigned_dev_pci_read()
  result in a hw_error(). Similarly a pwrite() error part of
  assigned_dev_pci_write() also result in a hw_error().
 
  Would there be a way to avoid terminating the guest for those cases? How
  about we deassign the device upon error?
 
  By terminating the guest we contain the error vs allowing the guest to
  continue running with invalid data.  De-assigning the device is
  asynchronous and relies on guest involvement, so damage is potentially
  already done.  Is this a theoretical problem or do you actually have
  hardware that hits this?  Thanks,
 
  Alex
 
 
  This problem is in the context of a Hot-pluggable device assigned to the
  guest. If the guest rd/wr the config space at the same time than the
  device is physically taken out then the guest will terminate with
  hw_error().
 
  Because this limits the availability of the guest I think we should try
  to recover instead. I don't see what other damage can happen since
  guest's MMIO access to the stale device will go nowhere?
 
  So you're looking at implementing surprise device removal?  There's not
  just config space, there's slow bar access and mmap'd spaces to worry
  about too.  What does going nowhere mean?  If it means reads return -1
  and the guest is trying to read the data portion of a packet from the
  network or an hba, we've now passed bad data to the guest.  Thanks,
 
  Alex
 
 
 
 
 Thanks for your answer;
 
 Yes we are doing 'surprise device removal' for assigned device. Note 
 that the problem also exist with standard 'attention button' device removal.
 
 The problem is all about fault isolation. Ideally, only the 
 corresponding driver should be affected by this 'surprise device 
 removal'. I think that taking down the guest is too coarse. Think about 
 a 'surprise device removal' on the host. In that case the host is not 
 taken down so why not do the same with the guest?

It depends on the host hardware.  Some x86 hardware will try to isolate
the fault with an NMI other architectures such as ia64 would pull a
machine check on a driver access to unresponsive devices.

 Yes some badness will be latched into the guest but really this not any 
 different that having a mis-behaving device.

... which is a bad thing, but often undetectable.  This is detectable.
Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-users] Recommendations for Virtulization Hardware

2012-09-20 Thread Nuno Magalhães

Hi,

Just a Xen newbie myself, but from what i've gathered and fiddled, Xen
(P)VMs don't come with a graphics card. You'd have to remote to your
Windows HVM to play games. You can fiddle with PCI pass through for
some video cards and there's some VGA passtrough as well, but i don't
think running a recent game on a VM would provide good results (i'm
thinking FPSs here, not solitaire or simcity).

As for hardware i considered an ATI/Asus board with a Phenom II X6 and
16GB or DDR3 a while ago, plus box, PSU, no disks, around 500€... YMMV
and your needs as well.

Just my ill-informed 2c. HTH

Nuno
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Recommendations for Virtulization Hardware

2012-09-20 Thread ShadesOfGrey

I'm looking to build a new personal computer.  I want it to function as 
a Linux desktop, provide network services for my home, and lastly, 
occasional Windows gaming.  From what I've gathered, virtualization 
using a Type 1 Hypervisor supporting PCI/VGA pass-through like KVM or 
Xen would be an attractive solution for my needs.  For reference, 
reading these threads on Ars Technica may be helpful to understand where 
I'm coming from, 
http://arstechnica.com/civis/viewtopic.php?f=6t=1175674 and 
http://arstechnica.com/civis/viewtopic.php?f=11t=1181867. But 
basically, I use Linux as my primary OS and would rather avoid dual 
booting or building two boxes just to play Windows games when I want to 
play Windows games.  I'm also intrigued by the concept of virtualization 
and would like to experiment with it as a solution for my case.


My problem is isolating which hardware to choose, specifically which 
combination of CPU, motherboard and video card.  Previously I had been 
relying on web searches to glean information from gaming and enthusiast 
web sites and tech specs from motherboard manufacturers.  After what I 
learned during my participation in the referenced threads at Ars 
Technica, I find myself back at square one.  Instead of trying to guess 
what hardware support KVM  Xen, and vice versa.  I'd like to know what 
hardware KVM  Xen users are actually using to run KVM  Xen? 
Particularly with consideration for 3D gaming and current generation 
hardware, BTW.


If there is need for further clarification, I'll answer any queries you 
might have.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Recommendations for Virtulization Hardware

2012-09-20 Thread Alex Williamson

On Thu, 2012-09-20 at 17:12 -0400, ShadesOfGrey wrote:
 I'm looking to build a new personal computer.  I want it to function as 
 a Linux desktop, provide network services for my home, and lastly, 
 occasional Windows gaming.  From what I've gathered, virtualization 
 using a Type 1 Hypervisor supporting PCI/VGA pass-through like KVM or 
 Xen would be an attractive solution for my needs.  For reference, 
 reading these threads on Ars Technica may be helpful to understand where 
 I'm coming from, 
 http://arstechnica.com/civis/viewtopic.php?f=6t=1175674 and 
 http://arstechnica.com/civis/viewtopic.php?f=11t=1181867. But 
 basically, I use Linux as my primary OS and would rather avoid dual 
 booting or building two boxes just to play Windows games when I want to 
 play Windows games.  I'm also intrigued by the concept of virtualization 
 and would like to experiment with it as a solution for my case.
 
 My problem is isolating which hardware to choose, specifically which 
 combination of CPU, motherboard and video card.  Previously I had been 
 relying on web searches to glean information from gaming and enthusiast 
 web sites and tech specs from motherboard manufacturers.  After what I 
 learned during my participation in the referenced threads at Ars 
 Technica, I find myself back at square one.  Instead of trying to guess 
 what hardware support KVM  Xen, and vice versa.  I'd like to know what 
 hardware KVM  Xen users are actually using to run KVM  Xen? 
 Particularly with consideration for 3D gaming and current generation 
 hardware, BTW.
 
 If there is need for further clarification, I'll answer any queries you 
 might have.

There have been a couple success reports of using AMD/ATI graphics cards
on Intel VT-d systems with KVM device assignment.  Nvidia cards have not
met with the same degree of success.  In both our cases, the graphics
device was assigned to a Windows guest as a secondary graphics head.
For me, Windows took over the assigned device as the primary graphics,
disabling the emulated graphics.  For my slow HD 5450, 3dMark seems to
get a similar score to what others see on bare metal.

Graphics device assignment on KVM is getting better, but should be
considered experimental at best.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [libvirt] TSC scaling interface to management

On Fri, Sep 21, 2012 at 12:02:46AM +0300, Dor Laor wrote:
 On 09/12/2012 06:39 PM, Marcelo Tosatti wrote:
 
 
 HW TSC scaling is a feature of AMD processors that allows a
 multiplier to be specified to the TSC frequency exposed to the guest.
 
 KVM also contains provision to trap TSC (KVM: Infrastructure for
 software and hardware based TSC rate scaling cc578287e3224d0da)
 or advance TSC frequency.
 
 This is useful when migrating to a host with different frequency and
 the guest is possibly using direct RDTSC instructions for purposes
 other than measuring cycles (that is, it previously calculated
 cycles-per-second, and uses that information which is stale after
 migration).
 
 qemu-x86: Set tsc_khz in kvm when supported (e7429073ed1a76518)
 added support for tsc_khz= option in QEMU.
 
 I am proposing the following changes so that management applications
 can work with this:
 
 1) New option for tsc_khz, which is tsc_khz=host (QEMU command line
 option). Host means that QEMU is responsible for retrieving the
 TSC frequency of the host processor and use that.
 Management application does not have to deal with the burden.
 
 2) New subsection with tsc_khz value. Destination host should consult
 supported features of running kernel and fail if feature is unsupported.
 
 
 It is not necessary to use this tsc_khz setting with modern guests
 using paravirtual clocks, or when its known that applications make
 proper use of the time interface provided by operating systems.
 
 On the other hand, legacy applications or setups which require no
 modification and correct operation while virtualized and make
 use of RDTSC might need this.
 
 Therefore it appears that this tsc_khz=auto option can be specified
 only if the user specifies so (it can be a per-guest flag hidden
 in the management configuration/manual).
 
 Sending this email to gather suggestions (or objections)
 to this interface.
 
 I'm not sure I understand the exact difference between the offers.
 We can define these 3 options:
 
 1. Qemu/kvm won't make use of tsc scaling feature at all.
 2. tsc scaling is used and we take the value either from the host or
from the live migration data that overrides the later for incoming.
As you've said, it should be passed through a sub section.
 3. Manual setting of the value (uncommon).
 
 Is there another option worth considering?
 The questions is what should be the default. IMHO #2 is more
 appropriate to serve as a default since we do expect tsc to change
 between hosts.

Option 1. is more appropriate to serve as a default given that
modern guests make use of paravirt, as you have observed.

That is, tsc scaling is only required if the guest does direct RDTSC
on the expectation that the value won't change.

 Cheers,
 Dor
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] KVM: x86: Fix guest debug across vcpu INIT reset

2012-09-20 Thread Jan Kiszka

From: Jan Kiszka jan.kis...@siemens.com

If we reset a vcpu on INIT, we so far overwrote dr7 as provided by
KVM_SET_GUEST_DEBUG, and we also cleared switch_db_regs unconditionally.

Fix this by saving the dr7 used for guest debugging and calculating the
effective register value as well as switch_db_regs on any potential
change. This will change to focus of the set_guest_debug vendor op to
update_dp_bp_intercept.

Found while trying to stop on start_secondary.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 arch/x86/include/asm/kvm_host.h |4 ++--
 arch/x86/kvm/svm.c  |   23 ---
 arch/x86/kvm/vmx.c  |   14 +-
 arch/x86/kvm/x86.c  |   26 +-
 4 files changed, 24 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 64adb61..e8baa1c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -457,6 +457,7 @@ struct kvm_vcpu_arch {
unsigned long dr6;
unsigned long dr7;
unsigned long eff_db[KVM_NR_DB_REGS];
+   unsigned long guest_debug_dr7;
 
u64 mcg_cap;
u64 mcg_status;
@@ -621,8 +622,7 @@ struct kvm_x86_ops {
void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu);
void (*vcpu_put)(struct kvm_vcpu *vcpu);
 
-   void (*set_guest_debug)(struct kvm_vcpu *vcpu,
-   struct kvm_guest_debug *dbg);
+   void (*update_db_bp_intercept)(struct kvm_vcpu *vcpu);
int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata);
int (*set_msr)(struct kvm_vcpu *vcpu, u32 msr_index, u64 data);
u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 611c728..076064b 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1146,7 +1146,6 @@ static void init_vmcb(struct vcpu_svm *svm)
 
svm_set_efer(svm-vcpu, 0);
save-dr6 = 0x0ff0;
-   save-dr7 = 0x400;
kvm_set_rflags(svm-vcpu, 2);
save-rip = 0xfff0;
svm-vcpu.arch.regs[VCPU_REGS_RIP] = save-rip;
@@ -1643,7 +1642,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
mark_dirty(svm-vmcb, VMCB_SEG);
 }
 
-static void update_db_intercept(struct kvm_vcpu *vcpu)
+static void update_db_bp_intercept(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -1663,20 +1662,6 @@ static void update_db_intercept(struct kvm_vcpu *vcpu)
vcpu-guest_debug = 0;
 }
 
-static void svm_guest_debug(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg)
-{
-   struct vcpu_svm *svm = to_svm(vcpu);
-
-   if (vcpu-guest_debug  KVM_GUESTDBG_USE_HW_BP)
-   svm-vmcb-save.dr7 = dbg-arch.debugreg[7];
-   else
-   svm-vmcb-save.dr7 = vcpu-arch.dr7;
-
-   mark_dirty(svm-vmcb, VMCB_DR);
-
-   update_db_intercept(vcpu);
-}
-
 static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *sd)
 {
if (sd-next_asid  sd-max_asid) {
@@ -1748,7 +1733,7 @@ static int db_interception(struct vcpu_svm *svm)
if (!(svm-vcpu.guest_debug  KVM_GUESTDBG_SINGLESTEP))
svm-vmcb-save.rflags =
~(X86_EFLAGS_TF | X86_EFLAGS_RF);
-   update_db_intercept(svm-vcpu);
+   update_db_bp_intercept(svm-vcpu);
}
 
if (svm-vcpu.guest_debug 
@@ -3659,7 +3644,7 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
 */
svm-nmi_singlestep = true;
svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
-   update_db_intercept(vcpu);
+   update_db_bp_intercept(vcpu);
 }
 
 static int svm_set_tss_addr(struct kvm *kvm, unsigned int addr)
@@ -4259,7 +4244,7 @@ static struct kvm_x86_ops svm_x86_ops = {
.vcpu_load = svm_vcpu_load,
.vcpu_put = svm_vcpu_put,
 
-   .set_guest_debug = svm_guest_debug,
+   .update_db_bp_intercept = update_db_bp_intercept,
.get_msr = svm_get_msr,
.set_msr = svm_set_msr,
.get_segment_base = svm_get_segment_base,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d62b413..b0f7b34 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2286,16 +2286,6 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum 
kvm_reg reg)
}
 }
 
-static void set_guest_debug(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg)
-{
-   if (vcpu-guest_debug  KVM_GUESTDBG_USE_HW_BP)
-   vmcs_writel(GUEST_DR7, dbg-arch.debugreg[7]);
-   else
-   vmcs_writel(GUEST_DR7, vcpu-arch.dr7);
-
-   update_exception_bitmap(vcpu);
-}
-
 static __init int cpu_has_kvm_support(void)
 {
return cpu_has_vmx();
@@ -3959,8 +3949,6 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
kvm_rip_write(vcpu, 0);
kvm_register_write(vcpu, VCPU_REGS_RSP, 0);
 
-   vmcs_writel(GUEST_DR7, 0x400);
-

[PATCH 03/10] KVM: PPC: Book3S HV: Fix updates of vcpu-cpu

This removes the powerpc generic updates of vcpu-cpu in load and
put, and moves them to the various backends.

The reason is that HV KVM does its own sauce with that field
and the generic updates might corrupt it. The field contains the
CPU# of the -first- HW CPU of the core always for all the VCPU
threads of a core (the one that's online from a host Linux
perspective).

However, the preempt notifiers are going to be called on the
threads VCPUs when they are running (due to them sleeping on our
private waitqueue) causing unload to be called, potentially
clobbering the value.

Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_pr.c |3 ++-
 arch/powerpc/kvm/booke.c |2 ++
 arch/powerpc/kvm/powerpc.c   |2 --
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 4d0667a..bf3ec5d 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -64,7 +64,7 @@ void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
svcpu-slb_max = to_book3s(vcpu)-slb_shadow_max;
svcpu_put(svcpu);
 #endif
-
+   vcpu-cpu = smp_processor_id();
 #ifdef CONFIG_PPC_BOOK3S_32
current-thread.kvm_shadow_vcpu = to_book3s(vcpu)-shadow_vcpu;
 #endif
@@ -84,6 +84,7 @@ void kvmppc_core_vcpu_put(struct kvm_vcpu *vcpu)
kvmppc_giveup_ext(vcpu, MSR_FP);
kvmppc_giveup_ext(vcpu, MSR_VEC);
kvmppc_giveup_ext(vcpu, MSR_VSX);
+   vcpu-cpu = -1;
 }
 
 int kvmppc_core_check_requests(struct kvm_vcpu *vcpu)
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 3a6490f..69d047c 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -1509,12 +1509,14 @@ void kvmppc_decrementer_func(unsigned long data)
 
 void kvmppc_booke_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
+   vcpu-cpu = smp_processor_id();
current-thread.kvm_vcpu = vcpu;
 }
 
 void kvmppc_booke_vcpu_put(struct kvm_vcpu *vcpu)
 {
current-thread.kvm_vcpu = NULL;
+   vcpu-cpu = -1;
 }
 
 int __init kvmppc_booke_init(void)
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 2b08564..fd73763 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -510,7 +510,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
mtspr(SPRN_VRSAVE, vcpu-arch.vrsave);
 #endif
kvmppc_core_vcpu_load(vcpu, cpu);
-   vcpu-cpu = smp_processor_id();
 }
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -519,7 +518,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 #ifdef CONFIG_BOOKE
vcpu-arch.vrsave = mfspr(SPRN_VRSAVE);
 #endif
-   vcpu-cpu = -1;
 }
 
 int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/10] KVM: PPC: Book3S HV: Fix accounting of stolen time

Currently the code that accounts stolen time tends to overestimate the
stolen time, and will sometimes report more stolen time in a DTL
(dispatch trace log) entry than has elapsed since the last DTL entry.
This can cause guests to underflow the user or system time measured
for some tasks, leading to ridiculous CPU percentages and total runtimes
being reported by top and other utilities.

In addition, the current code was designed for the previous policy where
a vcore would only run when all the vcpus in it were runnable, and so
only counted stolen time on a per-vcore basis.  Now that a vcore can
run while some of the vcpus in it are doing other things in the kernel
(e.g. handling a page fault), we need to count the time when a vcpu task
is preempted while it is not running as part of a vcore as stolen also.

To do this, we bring back the BUSY_IN_HOST vcpu state and extend the
vcpu_load/put functions to count preemption time while the vcpu is
in that state.  Handling the transitions between the RUNNING and
BUSY_IN_HOST states requires checking and updating two variables
(accumulated time stolen and time last preempted), so we add a new
spinlock, vcpu-arch.tbacct_lock.  This protects both the per-vcpu
stolen/preempt-time variables, and the per-vcore variables while this
vcpu is running the vcore.

Finally, we now don't count time spent in userspace as stolen time.
The task could be executing in userspace on behalf of the vcpu, or
it could be preempted, or the vcpu could be genuinely stopped.  Since
we have no way of dividing up the time between these cases, we don't
count any of it as stolen.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_host.h |5 ++
 arch/powerpc/kvm/book3s_hv.c|  127 ++-
 2 files changed, 117 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 1e8cbd1..3093896 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -559,12 +559,17 @@ struct kvm_vcpu_arch {
unsigned long dtl_index;
u64 stolen_logged;
struct kvmppc_vpa slb_shadow;
+
+   spinlock_t tbacct_lock;
+   u64 busy_stolen;
+   u64 busy_preempt;
 #endif
 };
 
 /* Values for vcpu-arch.state */
 #define KVMPPC_VCPU_NOTREADY   0
 #define KVMPPC_VCPU_RUNNABLE   1
+#define KVMPPC_VCPU_BUSY_IN_HOST   2
 
 /* Values for vcpu-arch.io_gpr */
 #define KVM_MMIO_REG_MASK  0x001f
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index dc34a69..f953f73 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -57,23 +57,74 @@
 /* #define EXIT_DEBUG_SIMPLE */
 /* #define EXIT_DEBUG_INT */
 
+/* Used as a null value for timebase values */
+#define TB_NIL (~(u64)0)
+
 static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
 static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
 
+/*
+ * We use the vcpu_load/put functions to measure stolen time.
+ * Stolen time is counted as time when either the vcpu is able to
+ * run as part of a virtual core, but the task running the vcore
+ * is preempted or sleeping, or when the vcpu needs something done
+ * in the kernel by the task running the vcpu, but that task is
+ * preempted or sleeping.  Those two things have to be counted
+ * separately, since one of the vcpu tasks will take on the job
+ * of running the core, and the other vcpu tasks in the vcore will
+ * sleep waiting for it to do that, but that sleep shouldn't count
+ * as stolen time.
+ *
+ * Hence we accumulate stolen time when the vcpu can run as part of
+ * a vcore using vc-stolen_tb, and the stolen time when the vcpu
+ * needs its task to do other things in the kernel (for example,
+ * service a page fault) in busy_stolen.  We don't accumulate
+ * stolen time for a vcore when it is inactive, or for a vcpu
+ * when it is in state RUNNING or NOTREADY.  NOTREADY is a bit of
+ * a misnomer; it means that the vcpu task is not executing in
+ * the KVM_VCPU_RUN ioctl, i.e. it is in userspace or elsewhere in
+ * the kernel.  We don't have any way of dividing up that time
+ * between time that the vcpu is genuinely stopped, time that
+ * the task is actively working on behalf of the vcpu, and time
+ * that the task is preempted, so we don't count any of it as
+ * stolen.
+ *
+ * Updates to busy_stolen are protected by arch.tbacct_lock;
+ * updates to vc-stolen_tb are protected by the arch.tbacct_lock
+ * of the vcpu that has taken responsibility for running the vcore
+ * (i.e. vc-runner).  The stolen times are measured in units of
+ * timebase ticks.  (Note that the != TB_NIL checks below are
+ * purely defensive; they should never fail.)
+ */
+
 void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
struct kvmppc_vcore *vc = vcpu-arch.vcore;
 
-   if (vc-runner == vcpu  vc-vcore_state != VCORE_INACTIVE)
+

[PATCH 01/10] KVM: PPC: Book3S HV: Provide a way for userspace to get/set per-vCPU areas

The PAPR paravirtualization interface lets guests register three
different types of per-vCPU buffer areas in its memory for communication
with the hypervisor.  These are called virtual processor areas (VPAs).
Currently the hypercalls to register and unregister VPAs are handled
by KVM in the kernel, and userspace has no way to know about or save
and restore these registrations across a migration.

This adds get and set ioctls to allow userspace to see what addresses
have been registered, and to register or unregister them.  This will
be needed for guest hibernation and migration, and is also needed so
that userspace can unregister them on reset (otherwise we corrupt
guest memory after reboot by writing to the VPAs registered by the
previous kernel).  We also add a capability to indicate that the
ioctls are supported.

This also fixes a bug where we were calling init_vpa unconditionally,
leading to an oops when unregistering the VPA.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 Documentation/virtual/kvm/api.txt  |   32 +
 arch/powerpc/include/asm/kvm_ppc.h |3 ++
 arch/powerpc/kvm/book3s_hv.c   |   54 +++-
 arch/powerpc/kvm/powerpc.c |   26 +
 include/linux/kvm.h|   11 
 5 files changed, 125 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index a12f4e4..76a07a6 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1992,6 +1992,38 @@ return the hash table order in the parameter.  (If the 
guest is using
 the virtualized real-mode area (VRMA) facility, the kernel will
 re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
 
+4.77 KVM_PPC_GET_VPA_INFO
+
+Capability: KVM_CAP_PPC_VPA
+Architectures: powerpc
+Type: vcpu ioctl
+Parameters: Pointer to struct kvm_ppc_vpa (out)
+Returns: 0 on success, -1 on error
+
+This populates and returns a structure containing the guest physical
+addresses and sizes of the three per-virtual-processor areas that the
+guest can register with the hypervisor under the PAPR
+paravirtualization interface, namely the Virtual Processor Area, the
+SLB (Segment Lookaside Buffer) Shadow Area, and the Dispatch Trace
+Log.
+
+4.78 KVM_PPC_SET_VPA_INFO
+
+Capability: KVM_CAP_PPC_VPA
+Architectures: powerpc
+Type: vcpu ioctl
+Parameters: Pointer to struct kvm_ppc_vpa (in)
+Returns: 0 on success, -1 on error
+
+This sets the guest physical addresses and sizes of the three
+per-virtual-processor areas that the guest can register with the
+hypervisor under the PAPR paravirtualization interface, namely the
+Virtual Processor Area, the SLB (Segment Lookaside Buffer) Shadow
+Area, and the Dispatch Trace Log.  Providing an address of zero for
+any of these areas causes the kernel to unregister any previously
+registered area; a non-zero address replaces any previously registered
+area.
+
 
 5. The kvm_run structure
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 3fb980d..2c94cb3 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -205,6 +205,9 @@ int kvmppc_set_sregs_ivor(struct kvm_vcpu *vcpu, struct 
kvm_sregs *sregs);
 int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg);
 int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg);
 
+int kvm_vcpu_get_vpa_info(struct kvm_vcpu *vcpu, struct kvm_ppc_vpa *vpa);
+int kvm_vcpu_set_vpa_info(struct kvm_vcpu *vcpu, struct kvm_ppc_vpa *vpa);
+
 void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid);
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 38c7f1b..bebf9cb 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -143,6 +143,57 @@ static void init_vpa(struct kvm_vcpu *vcpu, struct lppaca 
*vpa)
vpa-yield_count = 1;
 }
 
+int kvm_vcpu_get_vpa_info(struct kvm_vcpu *vcpu, struct kvm_ppc_vpa *vpa)
+{
+   spin_lock(vcpu-arch.vpa_update_lock);
+   vpa-vpa_addr = vcpu-arch.vpa.next_gpa;
+   vpa-slb_shadow_addr = vcpu-arch.slb_shadow.next_gpa;
+   vpa-slb_shadow_size = vcpu-arch.slb_shadow.len;
+   vpa-dtl_addr = vcpu-arch.dtl.next_gpa;
+   vpa-dtl_size = vcpu-arch.dtl.len;
+   spin_unlock(vcpu-arch.vpa_update_lock);
+   return 0;
+}
+
+static inline void set_vpa(struct kvmppc_vpa *v, unsigned long addr,
+  unsigned long len)
+{
+   if (v-next_gpa != addr || v-len != len) {
+   v-next_gpa = addr;
+   v-len = addr ? len : 0;
+   v-update_pending = 1;
+   }
+}
+
+int kvm_vcpu_set_vpa_info(struct kvm_vcpu *vcpu, struct kvm_ppc_vpa *vpa)
+{
+   /* check that addresses are cacheline aligned */
+   if ((vpa-vpa_addr  (L1_CACHE_BYTES - 1)) ||
+   (vpa-slb_shadow_addr  (L1_CACHE_BYTES -

[PATCH 05/10] KVM: PPC: Book3S HV: Fix some races in starting secondary threads

Subsequent patches implementing in-kernel XICS emulation will make it
possible for IPIs to arrive at secondary threads at arbitrary times.
This fixes some races in how we start the secondary threads, which
if not fixed could lead to occasional crashes of the host kernel.

This makes sure that (a) we have grabbed all the secondary threads,
and verified that they are no longer in the kernel, before we start
any thread, (b) that the secondary thread loads its vcpu pointer
after clearing the IPI that woke it up (so we don't miss a wakeup),
and (c) that the secondary thread clears its vcpu pointer before
incrementing the nap count.  It also removes unnecessary setting
of the vcpu and vcore pointers in the paca in kvmppc_core_vcpu_load.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_hv.c|   41 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   11 ++---
 2 files changed, 32 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index a917603..cd3dc12 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -64,8 +64,6 @@ void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
struct kvmppc_vcore *vc = vcpu-arch.vcore;
 
-   local_paca-kvm_hstate.kvm_vcpu = vcpu;
-   local_paca-kvm_hstate.kvm_vcore = vc;
if (vc-runner == vcpu  vc-vcore_state != VCORE_INACTIVE)
vc-stolen_tb += mftb() - vc-preempt_tb;
 }
@@ -776,6 +774,7 @@ static int kvmppc_grab_hwthread(int cpu)
 
/* Ensure the thread won't go into the kernel if it wakes */
tpaca-kvm_hstate.hwthread_req = 1;
+   tpaca-kvm_hstate.kvm_vcpu = NULL;
 
/*
 * If the thread is already executing in the kernel (e.g. handling
@@ -825,7 +824,6 @@ static void kvmppc_start_thread(struct kvm_vcpu *vcpu)
smp_wmb();
 #if defined(CONFIG_PPC_ICP_NATIVE)  defined(CONFIG_SMP)
if (vcpu-arch.ptid) {
-   kvmppc_grab_hwthread(cpu);
xics_wake_cpu(cpu);
++vc-n_woken;
}
@@ -851,7 +849,8 @@ static void kvmppc_wait_for_nap(struct kvmppc_vcore *vc)
 
 /*
  * Check that we are on thread 0 and that any other threads in
- * this core are off-line.
+ * this core are off-line.  Then grab the threads so they can't
+ * enter the kernel.
  */
 static int on_primary_thread(void)
 {
@@ -863,6 +862,17 @@ static int on_primary_thread(void)
while (++thr  threads_per_core)
if (cpu_online(cpu + thr))
return 0;
+
+   /* Grab all hw threads so they can't go into the kernel */
+   for (thr = 1; thr  threads_per_core; ++thr) {
+   if (kvmppc_grab_hwthread(cpu + thr)) {
+   /* Couldn't grab one; let the others go */
+   do {
+   kvmppc_release_hwthread(cpu + thr);
+   } while (--thr  0);
+   return 0;
+   }
+   }
return 1;
 }
 
@@ -911,16 +921,6 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
}
 
/*
-* Make sure we are running on thread 0, and that
-* secondary threads are offline.
-*/
-   if (threads_per_core  1  !on_primary_thread()) {
-   list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list)
-   vcpu-arch.ret = -EBUSY;
-   goto out;
-   }
-
-   /*
 * Assign physical thread IDs, first to non-ceded vcpus
 * and then to ceded ones.
 */
@@ -939,15 +939,22 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
if (vcpu-arch.ceded)
vcpu-arch.ptid = ptid++;
 
+   /*
+* Make sure we are running on thread 0, and that
+* secondary threads are offline.
+*/
+   if (threads_per_core  1  !on_primary_thread()) {
+   list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list)
+   vcpu-arch.ret = -EBUSY;
+   goto out;
+   }
+
vc-stolen_tb += mftb() - vc-preempt_tb;
vc-pcpu = smp_processor_id();
list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list) {
kvmppc_start_thread(vcpu);
kvmppc_create_dtl_entry(vcpu, vc);
}
-   /* Grab any remaining hw threads so they can't go into the kernel */
-   for (i = ptid; i  threads_per_core; ++i)
-   kvmppc_grab_hwthread(vc-pcpu + i);
 
preempt_disable();
spin_unlock(vc-lock);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 44b72fe..1e90ef6 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -134,8 +134,11 @@ kvm_start_guest:
 
 27:/* XXX should handle hypervisor maintenance interrupts etc. here */
 
+   /* reload vcpu pointer after clearing

[PATCH 02/10] KVM: PPC: Book3S HV: Allow KVM guests to stop secondary threads coming online

When a Book3S HV KVM guest is running, we need the host to be in
single-thread mode, that is, all of the cores (or at least all of
the cores where the KVM guest could run) to be running only one
active hardware thread.  This is because of the hardware restriction
in POWER processors that all of the hardware threads in the core
must be in the same logical partition.  Complying with this restriction
is much easier if, from the host kernel's point of view, only one
hardware thread is active.

This adds two hooks in the SMP hotplug code to allow the KVM code to
make sure that secondary threads (i.e. hardware threads other than
thread 0) cannot come online while any KVM guest exists.  The KVM
code still has to check that any core where it runs a guest has the
secondary threads offline, but having done that check it can now be
sure that they will not come online while the guest is running.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/smp.h |8 +++
 arch/powerpc/kernel/smp.c  |   46 
 arch/powerpc/kvm/book3s_hv.c   |   12 +--
 3 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index ebc24dc..b625a1a 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -66,6 +66,14 @@ void generic_cpu_die(unsigned int cpu);
 void generic_mach_cpu_die(void);
 void generic_set_cpu_dead(unsigned int cpu);
 int generic_check_cpu_restart(unsigned int cpu);
+
+extern void inhibit_secondary_onlining(void);
+extern void uninhibit_secondary_onlining(void);
+
+#else /* HOTPLUG_CPU */
+static inline void inhibit_secondary_onlining(void) {}
+static inline void uninhibit_secondary_onlining(void) {}
+
 #endif
 
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 0321007..c45f51d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -410,6 +410,45 @@ int generic_check_cpu_restart(unsigned int cpu)
 {
return per_cpu(cpu_state, cpu) == CPU_UP_PREPARE;
 }
+
+static atomic_t secondary_inhibit_count;
+
+/*
+ * Don't allow secondary CPU threads to come online
+ */
+void inhibit_secondary_onlining(void)
+{
+   /*
+* This makes secondary_inhibit_count stable during cpu
+* online/offline operations.
+*/
+   get_online_cpus();
+
+   atomic_inc(secondary_inhibit_count);
+   put_online_cpus();
+}
+EXPORT_SYMBOL_GPL(inhibit_secondary_onlining);
+
+/*
+ * Allow secondary CPU threads to come online again
+ */
+void uninhibit_secondary_onlining(void)
+{
+   get_online_cpus();
+   atomic_dec(secondary_inhibit_count);
+   put_online_cpus();
+}
+EXPORT_SYMBOL_GPL(uninhibit_secondary_onlining);
+
+static int secondaries_inhibited(void)
+{
+   return atomic_read(secondary_inhibit_count);
+}
+
+#else /* HOTPLUG_CPU */
+
+#define secondaries_inhibited()0
+
 #endif
 
 static void cpu_idle_thread_init(unsigned int cpu, struct task_struct *idle)
@@ -428,6 +467,13 @@ int __cpuinit __cpu_up(unsigned int cpu, struct 
task_struct *tidle)
 {
int rc, c;
 
+   /*
+* Don't allow secondary threads to come online if inhibited
+*/
+   if (threads_per_core  1  secondaries_inhibited() 
+   cpu % threads_per_core != 0)
+   return -EBUSY;
+
if (smp_ops == NULL ||
(smp_ops-cpu_bootable  !smp_ops-cpu_bootable(cpu)))
return -EINVAL;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index bebf9cb..6fe1410 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -47,6 +47,7 @@
 #include asm/page.h
 #include asm/hvcall.h
 #include asm/switch_to.h
+#include asm/smp.h
 #include linux/gfp.h
 #include linux/vmalloc.h
 #include linux/highmem.h
@@ -918,8 +919,6 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
/*
 * Make sure we are running on thread 0, and that
 * secondary threads are offline.
-* XXX we should also block attempts to bring any
-* secondary threads online.
 */
if (threads_per_core  1  !on_primary_thread()) {
list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list)
@@ -1632,11 +1631,20 @@ int kvmppc_core_init_vm(struct kvm *kvm)
 
kvm-arch.using_mmu_notifiers = !!cpu_has_feature(CPU_FTR_ARCH_206);
spin_lock_init(kvm-arch.slot_phys_lock);
+
+   /*
+* Don't allow secondary CPU threads to come online
+* while any KVM VMs exist.
+*/
+   inhibit_secondary_onlining();
+
return 0;
 }
 
 void kvmppc_core_destroy_vm(struct kvm *kvm)
 {
+   uninhibit_secondary_onlining();
+
if (kvm-arch.rma) {
kvm_release_rma(kvm-arch.rma);
kvm-arch.rma = NULL;
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message

[PATCH 0/10] HV KVM fixes, reposted

This is a repost of 10 patches out of a series of 12 that I posted
more than three weeks ago that have had no comments but have not yet
been applied.  They have been rediffed against Alex Graf's current
kvm-ppc-next branch.

This series contains various fixes collected during the process of
getting reboot of Book3S HV guests to work correctly, plus some needed
for Ben H's forthcoming series to implement in-kernel XICS (interrupt
controller) emulation.  As part of getting reboot to work, we have
relaxed the previous policy where, on POWER7, a virtual core would
only run when all of the vcpus in it were ready to run (or were idle).
Now a virtual core will run as soon as any one of its vcpus are ready.
This avoids the problem where the guest wouldn't run after reboot
because userspace (qemu) had stopped all except cpu 0.

Please apply.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/10] KVM: PPC: Book3s HV: Don't access runnable threads list without vcore lock

There were a few places where we were traversing the list of runnable
threads in a virtual core, i.e. vc-runnable_threads, without holding
the vcore spinlock.  This extends the places where we hold the vcore
spinlock to cover everywhere that we traverse that list.

Since we possibly need to sleep inside kvmppc_book3s_hv_page_fault,
this moves the call of it from kvmppc_handle_exit out to
kvmppc_vcpu_run, where we don't hold the vcore lock.

In kvmppc_vcore_blocked, we don't actually need to check whether
all vcpus are ceded and don't have any pending exceptions, since the
caller has already done that.  The caller (kvmppc_run_vcpu) wasn't
actually checking for pending exceptions, so we add that.

The change of if to while in kvmppc_run_vcpu is to make sure that we
never call kvmppc_remove_runnable() when the vcore state is RUNNING or
EXITING.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_asm.h |1 +
 arch/powerpc/kvm/book3s_hv.c   |   64 +---
 2 files changed, 31 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index 76fdcfe..fb99a21 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -123,6 +123,7 @@
 #define RESUME_GUEST_NV RESUME_FLAG_NV
 #define RESUME_HOST RESUME_FLAG_HOST
 #define RESUME_HOST_NV  (RESUME_FLAG_HOST|RESUME_FLAG_NV)
+#define RESUME_PAGE_FAULT  (12)
 
 #define KVM_GUEST_MODE_NONE0
 #define KVM_GUEST_MODE_GUEST   1
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index cd3dc12..bd3c5c1 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -466,7 +466,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
  struct task_struct *tsk)
 {
int r = RESUME_HOST;
-   int srcu_idx;
 
vcpu-stat.sum_exits++;
 
@@ -526,16 +525,12 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 * have been handled already.
 */
case BOOK3S_INTERRUPT_H_DATA_STORAGE:
-   srcu_idx = srcu_read_lock(vcpu-kvm-srcu);
-   r = kvmppc_book3s_hv_page_fault(run, vcpu,
-   vcpu-arch.fault_dar, vcpu-arch.fault_dsisr);
-   srcu_read_unlock(vcpu-kvm-srcu, srcu_idx);
+   r = RESUME_PAGE_FAULT;
break;
case BOOK3S_INTERRUPT_H_INST_STORAGE:
-   srcu_idx = srcu_read_lock(vcpu-kvm-srcu);
-   r = kvmppc_book3s_hv_page_fault(run, vcpu,
-   kvmppc_get_pc(vcpu), 0);
-   srcu_read_unlock(vcpu-kvm-srcu, srcu_idx);
+   vcpu-arch.fault_dar = kvmppc_get_pc(vcpu);
+   vcpu-arch.fault_dsisr = 0;
+   r = RESUME_PAGE_FAULT;
break;
/*
 * This occurs if the guest executes an illegal instruction.
@@ -880,22 +875,24 @@ static int on_primary_thread(void)
  * Run a set of guest threads on a physical core.
  * Called with vc-lock held.
  */
-static int kvmppc_run_core(struct kvmppc_vcore *vc)
+static void kvmppc_run_core(struct kvmppc_vcore *vc)
 {
struct kvm_vcpu *vcpu, *vcpu0, *vnext;
long ret;
u64 now;
int ptid, i, need_vpa_update;
int srcu_idx;
+   struct kvm_vcpu *vcpus_to_update[threads_per_core];
 
/* don't start if any threads have a signal pending */
need_vpa_update = 0;
list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list) {
if (signal_pending(vcpu-arch.run_task))
-   return 0;
-   need_vpa_update |= vcpu-arch.vpa.update_pending |
-   vcpu-arch.slb_shadow.update_pending |
-   vcpu-arch.dtl.update_pending;
+   return;
+   if (vcpu-arch.vpa.update_pending ||
+   vcpu-arch.slb_shadow.update_pending ||
+   vcpu-arch.dtl.update_pending)
+   vcpus_to_update[need_vpa_update++] = vcpu;
}
 
/*
@@ -915,8 +912,8 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
 */
if (need_vpa_update) {
spin_unlock(vc-lock);
-   list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list)
-   kvmppc_update_vpas(vcpu);
+   for (i = 0; i  need_vpa_update; ++i)
+   kvmppc_update_vpas(vcpus_to_update[i]);
spin_lock(vc-lock);
}
 
@@ -933,8 +930,10 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
vcpu-arch.ptid = ptid++;
}
}
-   if (!vcpu0)
-   return 0;   /* nothing to run */
+   if (!vcpu0) {
+   vc-vcore_state = VCORE_INACTIVE;
+   return; /* nothing to run; should

[PATCH 04/10] KVM: PPC: Book3S HV: Remove bogus update of physical thread IDs

When making a vcpu non-runnable we incorrectly changed the
thread IDs of all other threads on the core, just remove that
code.

Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_hv.c |6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6fe1410..a917603 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -759,17 +759,11 @@ extern void xics_wake_cpu(int cpu);
 static void kvmppc_remove_runnable(struct kvmppc_vcore *vc,
   struct kvm_vcpu *vcpu)
 {
-   struct kvm_vcpu *v;
-
if (vcpu-arch.state != KVMPPC_VCPU_RUNNABLE)
return;
vcpu-arch.state = KVMPPC_VCPU_BUSY_IN_HOST;
--vc-n_runnable;
++vc-n_busy;
-   /* decrement the physical thread id of each following vcpu */
-   v = vcpu;
-   list_for_each_entry_continue(v, vc-runnable_threads, arch.run_list)
-   --v-arch.ptid;
list_del(vcpu-arch.run_list);
 }
 
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/10] KVM: PPC: Book3S HV: Fix calculation of guest phys address for MMIO emulation

In the case where the host kernel is using a 64kB base page size and
the guest uses a 4k HPTE (hashed page table entry) to map an emulated
MMIO device, we were calculating the guest physical address wrongly.
We were calculating a gfn as the guest physical address shifted right
16 bits (PAGE_SHIFT) but then only adding back in 12 bits from the
effective address, since the HPTE had a 4k page size.  Thus the gpa
reported to userspace was missing 4 bits.

Instead, we now compute the guest physical address from the HPTE
without reference to the host page size, and then compute the gfn
by shifting the gpa right PAGE_SHIFT bits.

Reported-by: Alexey Kardashevskiy a...@ozlabs.ru
Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c |9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index f598366..7a4aae9 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -571,7 +571,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
struct kvm *kvm = vcpu-kvm;
unsigned long *hptep, hpte[3], r;
unsigned long mmu_seq, psize, pte_size;
-   unsigned long gfn, hva, pfn;
+   unsigned long gpa, gfn, hva, pfn;
struct kvm_memory_slot *memslot;
unsigned long *rmap;
struct revmap_entry *rev;
@@ -609,15 +609,14 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
 
/* Translate the logical address and get the page */
psize = hpte_page_size(hpte[0], r);
-   gfn = hpte_rpn(r, psize);
+   gpa = (r  HPTE_R_RPN  ~(psize - 1)) | (ea  (psize - 1));
+   gfn = gpa  PAGE_SHIFT;
memslot = gfn_to_memslot(kvm, gfn);
 
/* No memslot means it's an emulated MMIO region */
-   if (!memslot || (memslot-flags  KVM_MEMSLOT_INVALID)) {
-   unsigned long gpa = (gfn  PAGE_SHIFT) | (ea  (psize - 1));
+   if (!memslot || (memslot-flags  KVM_MEMSLOT_INVALID))
return kvmppc_hv_emulate_mmio(run, vcpu, gpa, ea,
  dsisr  DSISR_ISSTORE);
-   }
 
if (!kvm-arch.using_mmu_notifiers)
return -EFAULT; /* should never get here */
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/10] KVM: PPC: Book3S HV: Fixes for late-joining threads

If a thread in a virtual core becomes runnable while other threads
in the same virtual core are already running in the guest, it is
possible for the latecomer to join the others on the core without
first pulling them all out of the guest.  Currently this only happens
rarely, when a vcpu is first started.  This fixes some bugs and
omissions in the code in this case.

First, we need to check for VPA updates for the latecomer and make
a DTL entry for it.  Secondly, if it comes along while the master
vcpu is doing a VPA update, we don't need to do anything since the
master will pick it up in kvmppc_run_core.  To handle this correctly
we introduce a new vcore state, VCORE_STARTING.  Thirdly, there is
a race because we currently clear the hardware thread's hwthread_req
before waiting to see it get to nap.  A latecomer thread could have
its hwthread_req cleared before it gets to test it, and therefore
never increment the nap_count, leading to messages about wait_for_nap
timeouts.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_host.h |7 ---
 arch/powerpc/kvm/book3s_hv.c|   14 +++---
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 68f5a30..218534d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -289,9 +289,10 @@ struct kvmppc_vcore {
 
 /* Values for vcore_state */
 #define VCORE_INACTIVE 0
-#define VCORE_RUNNING  1
-#define VCORE_EXITING  2
-#define VCORE_SLEEPING 3
+#define VCORE_SLEEPING 1
+#define VCORE_STARTING 2
+#define VCORE_RUNNING  3
+#define VCORE_EXITING  4
 
 /*
  * Struct used to manage memory for a virtual processor area
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index bd3c5c1..8e84625 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -368,6 +368,11 @@ static void kvmppc_update_vpa(struct kvm_vcpu *vcpu, 
struct kvmppc_vpa *vpap)
 
 static void kvmppc_update_vpas(struct kvm_vcpu *vcpu)
 {
+   if (!(vcpu-arch.vpa.update_pending ||
+ vcpu-arch.slb_shadow.update_pending ||
+ vcpu-arch.dtl.update_pending))
+   return;
+
spin_lock(vcpu-arch.vpa_update_lock);
if (vcpu-arch.vpa.update_pending) {
kvmppc_update_vpa(vcpu, vcpu-arch.vpa);
@@ -902,7 +907,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
vc-n_woken = 0;
vc-nap_count = 0;
vc-entry_exit_count = 0;
-   vc-vcore_state = VCORE_RUNNING;
+   vc-vcore_state = VCORE_STARTING;
vc-in_guest = 0;
vc-napping_threads = 0;
 
@@ -955,6 +960,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
kvmppc_create_dtl_entry(vcpu, vc);
}
 
+   vc-vcore_state = VCORE_RUNNING;
preempt_disable();
spin_unlock(vc-lock);
 
@@ -963,8 +969,6 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
srcu_idx = srcu_read_lock(vcpu0-kvm-srcu);
 
__kvmppc_vcore_entry(NULL, vcpu0);
-   for (i = 0; i  threads_per_core; ++i)
-   kvmppc_release_hwthread(vc-pcpu + i);
 
spin_lock(vc-lock);
/* disable sending of IPIs on virtual external irqs */
@@ -973,6 +977,8 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
/* wait for secondary threads to finish writing their state to memory */
if (vc-nap_count  vc-n_woken)
kvmppc_wait_for_nap(vc);
+   for (i = 0; i  threads_per_core; ++i)
+   kvmppc_release_hwthread(vc-pcpu + i);
/* prevent other vcpu threads from doing kvmppc_start_thread() now */
vc-vcore_state = VCORE_EXITING;
spin_unlock(vc-lock);
@@ -1063,6 +1069,7 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
kvm_run-exit_reason = 0;
vcpu-arch.ret = RESUME_GUEST;
vcpu-arch.trap = 0;
+   kvmppc_update_vpas(vcpu);
 
/*
 * Synchronize with other threads in this virtual core
@@ -1086,6 +1093,7 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
if (vc-vcore_state == VCORE_RUNNING 
VCORE_EXIT_COUNT(vc) == 0) {
vcpu-arch.ptid = vc-n_runnable - 1;
+   kvmppc_create_dtl_entry(vcpu, vc);
kvmppc_start_thread(vcpu);
}
 
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/10] KVM: PPC: Book3S HV: Run virtual core whenever any vcpus in it can run

Currently the Book3S HV code implements a policy on multi-threaded
processors (i.e. POWER7) that requires all of the active vcpus in a
virtual core to be ready to run before we run the virtual core.
However, that causes problems on reset, because reset stops all vcpus
except vcpu 0, and can also reduce throughput since all four threads
in a virtual core have to wait whenever any one of them hits a
hypervisor page fault.

This relaxes the policy, allowing the virtual core to run as soon as
any vcpu in it is runnable.  With this, the KVMPPC_VCPU_STOPPED state
and the KVMPPC_VCPU_BUSY_IN_HOST state have been combined into a single
KVMPPC_VCPU_NOTREADY state, since we no longer need to distinguish
between them.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_host.h |5 +--
 arch/powerpc/kvm/book3s_hv.c|   74 ++-
 2 files changed, 40 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 218534d..1e8cbd1 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -563,9 +563,8 @@ struct kvm_vcpu_arch {
 };
 
 /* Values for vcpu-arch.state */
-#define KVMPPC_VCPU_STOPPED0
-#define KVMPPC_VCPU_BUSY_IN_HOST   1
-#define KVMPPC_VCPU_RUNNABLE   2
+#define KVMPPC_VCPU_NOTREADY   0
+#define KVMPPC_VCPU_RUNNABLE   1
 
 /* Values for vcpu-arch.io_gpr */
 #define KVM_MMIO_REG_MASK  0x001f
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8e84625..dc34a69 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -669,10 +669,7 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, 
unsigned int id)
 
kvmppc_mmu_book3s_hv_init(vcpu);
 
-   /*
-* We consider the vcpu stopped until we see the first run ioctl for it.
-*/
-   vcpu-arch.state = KVMPPC_VCPU_STOPPED;
+   vcpu-arch.state = KVMPPC_VCPU_NOTREADY;
 
init_waitqueue_head(vcpu-arch.cpu_run);
 
@@ -759,9 +756,8 @@ static void kvmppc_remove_runnable(struct kvmppc_vcore *vc,
 {
if (vcpu-arch.state != KVMPPC_VCPU_RUNNABLE)
return;
-   vcpu-arch.state = KVMPPC_VCPU_BUSY_IN_HOST;
+   vcpu-arch.state = KVMPPC_VCPU_NOTREADY;
--vc-n_runnable;
-   ++vc-n_busy;
list_del(vcpu-arch.run_list);
 }
 
@@ -1062,7 +1058,6 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc)
 static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 {
int n_ceded;
-   int prev_state;
struct kvmppc_vcore *vc;
struct kvm_vcpu *v, *vn;
 
@@ -1079,7 +1074,6 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
vcpu-arch.ceded = 0;
vcpu-arch.run_task = current;
vcpu-arch.kvm_run = kvm_run;
-   prev_state = vcpu-arch.state;
vcpu-arch.state = KVMPPC_VCPU_RUNNABLE;
list_add_tail(vcpu-arch.run_list, vc-runnable_threads);
++vc-n_runnable;
@@ -1089,35 +1083,26 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
 * If the vcore is already running, we may be able to start
 * this thread straight away and have it join in.
 */
-   if (prev_state == KVMPPC_VCPU_STOPPED) {
+   if (!signal_pending(current)) {
if (vc-vcore_state == VCORE_RUNNING 
VCORE_EXIT_COUNT(vc) == 0) {
vcpu-arch.ptid = vc-n_runnable - 1;
kvmppc_create_dtl_entry(vcpu, vc);
kvmppc_start_thread(vcpu);
+   } else if (vc-vcore_state == VCORE_SLEEPING) {
+   wake_up(vc-wq);
}
 
-   } else if (prev_state == KVMPPC_VCPU_BUSY_IN_HOST)
-   --vc-n_busy;
+   }
 
while (vcpu-arch.state == KVMPPC_VCPU_RUNNABLE 
   !signal_pending(current)) {
-   if (vc-n_busy || vc-vcore_state != VCORE_INACTIVE) {
+   if (vc-vcore_state != VCORE_INACTIVE) {
spin_unlock(vc-lock);
kvmppc_wait_for_exec(vcpu, TASK_INTERRUPTIBLE);
spin_lock(vc-lock);
continue;
}
-   vc-runner = vcpu;
-   n_ceded = 0;
-   list_for_each_entry(v, vc-runnable_threads, arch.run_list)
-   if (!v-arch.pending_exceptions)
-   n_ceded += v-arch.ceded;
-   if (n_ceded == vc-n_runnable)
-   kvmppc_vcore_blocked(vc);
-   else
-   kvmppc_run_core(vc);
-
list_for_each_entry_safe(v, vn, vc-runnable_threads,
 arch.run_list) {
kvmppc_core_prepare_to_enter(v);
@@ -1129,23 +1114,40 @@ static int

[PATCH v3 2/2] KVM: PPC: Book3S: Get/set guest FP regs using the GET/SET_ONE_REG interface

This enables userspace to get and set all the guest floating-point
state using the KVM_[GS]ET_ONE_REG ioctls.  The floating-point state
includes all of the traditional floating-point registers and the
FPSCR (floating point status/control register), all the VMX/Altivec
vector registers and the VSCR (vector status/control register), and
on POWER7, the vector-scalar registers (note that each FP register
is the high-order half of the corresponding VSR).

Most of these are implemented in common Book 3S code, except for VSX
on POWER7.  Because HV and PR differ in how they store the FP and VSX
registers on POWER7, the code for these cases is not common.  On POWER7,
the FP registers are the upper halves of the VSX registers vsr0 - vsr31.
PR KVM stores vsr0 - vsr31 in two halves, with the upper halves in the
arch.fpr[] array and the lower halves in the arch.vsr[] array, whereas
HV KVM on POWER7 stores the whole VSX register in arch.vsr[].

Signed-off-by: Paul Mackerras pau...@samba.org
---
v3: Handle most registers in common book3s code, where possible

 Documentation/virtual/kvm/api.txt  |   11 +
 arch/powerpc/include/asm/kvm.h |   20 
 arch/powerpc/include/asm/kvm_ppc.h |2 ++
 arch/powerpc/kvm/book3s.c  |   44 
 arch/powerpc/kvm/book3s_hv.c   |   42 ++
 arch/powerpc/kvm/book3s_pr.c   |   24 
 6 files changed, 143 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index e4a2067..02bb8e1 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1759,6 +1759,17 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_PMC6  | 32
   PPC   | KVM_REG_PPC_PMC7  | 32
   PPC   | KVM_REG_PPC_PMC8  | 32
+  PPC   | KVM_REG_PPC_FPR0 | 64
+  ...
+  PPC   | KVM_REG_PPC_FPR31 | 64
+  PPC   | KVM_REG_PPC_VR0  | 128
+  ...
+  PPC   | KVM_REG_PPC_VR31  | 128
+  PPC   | KVM_REG_PPC_VSR0 | 128
+  ...
+  PPC   | KVM_REG_PPC_VSR31 | 128
+  PPC   | KVM_REG_PPC_FPSCR | 64
+  PPC   | KVM_REG_PPC_VSCR  | 32
 
 4.69 KVM_GET_ONE_REG
 
diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index 9557576..1466975 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -360,4 +360,24 @@ struct kvm_book3e_206_tlb_params {
 #define KVM_REG_PPC_PMC7   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1e)
 #define KVM_REG_PPC_PMC8   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1f)
 
+/* 32 floating-point registers */
+#define KVM_REG_PPC_FPR0   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x20)
+#define KVM_REG_PPC_FPR(n) (KVM_REG_PPC_FPR0 + (n))
+#define KVM_REG_PPC_FPR31  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x3f)
+
+/* 32 VMX/Altivec vector registers */
+#define KVM_REG_PPC_VR0(KVM_REG_PPC | KVM_REG_SIZE_U128 | 0x40)
+#define KVM_REG_PPC_VR(n)  (KVM_REG_PPC_VR0 + (n))
+#define KVM_REG_PPC_VR31   (KVM_REG_PPC | KVM_REG_SIZE_U128 | 0x5f)
+
+/* 32 double-width FP registers for VSX */
+/* High-order halves overlap with FP regs */
+#define KVM_REG_PPC_VSR0   (KVM_REG_PPC | KVM_REG_SIZE_U128 | 0x60)
+#define KVM_REG_PPC_VSR(n) (KVM_REG_PPC_VSR0 + (n))
+#define KVM_REG_PPC_VSR31  (KVM_REG_PPC | KVM_REG_SIZE_U128 | 0x7f)
+
+/* FP and vector status/control registers */
+#define KVM_REG_PPC_FPSCR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x80)
+#define KVM_REG_PPC_VSCR   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x81)
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 6002b0a..d9fb406 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -199,6 +199,8 @@ static inline u32 kvmppc_set_field(u64 inst, int msb, int 
lsb, int value)
 union kvmppc_one_reg {
u32 wval;
u64 dval;
+   vector128 vval;
+   u64 vsxval[2];
 };
 
 void kvmppc_core_get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 6d1306c..abdd9ef 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -512,6 +512,28 @@ int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu, 
struct kvm_one_reg *reg)
case KVM_REG_PPC_DSISR:
val.wval = vcpu-arch.shared-dsisr;
break;
+   case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
+   val.dval = vcpu-arch.fpr[reg-id - KVM_REG_PPC_FPR0];
+   break;
+   case KVM_REG_PPC_FPSCR:
+   val.dval = vcpu-arch.fpscr;
+   break;
+#ifdef CONFIG_ALTIVEC
+   case KVM_REG_PPC_VR0 ... KVM_REG_PPC_VR31:
+   if (!cpu_has_feature(CPU_FTR_ALTIVEC)) {
+   r = -ENXIO;
+

[PATCH v3 1/2] KVM: PPC: Book3S: Get/set guest SPRs using the GET/SET_ONE_REG interface

This enables userspace to get and set various SPRs (special-purpose
registers) using the KVM_[GS]ET_ONE_REG ioctls.  With this, userspace
can get and set all the SPRs that are part of the guest state, either
through the KVM_[GS]ET_REGS ioctls, the KVM_[GS]ET_SREGS ioctls, or
the KVM_[GS]ET_ONE_REG ioctls.

The SPRs that are added here are:

- DABR:  Data address breakpoint register
- DSCR:  Data stream control register
- PURR:  Processor utilization of resources register
- SPURR: Scaled PURR
- DAR:   Data address register
- DSISR: Data storage interrupt status register
- AMR:   Authority mask register
- UAMOR: User authority mask override register
- MMCR0, MMCR1, MMCRA: Performance monitor unit control registers
- PMC1..PMC8: Performance monitor unit counter registers

In order to reduce code duplication between PR and HV KVM code, this
moves the kvm_vcpu_ioctl_[gs]et_one_reg functions into book3s.c and
centralizes the copying between user and kernel space there.  The
registers that are handled differently between PR and HV, and those
that exist only in one flavor, are handled in kvmppc_[gs]et_one_reg()
functions that are specific to each flavor.

Signed-off-by: Paul Mackerras pau...@samba.org
---
v3: handle DAR and DSISR, plus copy to/from userspace, in common code

 Documentation/virtual/kvm/api.txt  |   19 +
 arch/powerpc/include/asm/kvm.h |   21 ++
 arch/powerpc/include/asm/kvm_ppc.h |7 
 arch/powerpc/kvm/book3s.c  |   74 +++
 arch/powerpc/kvm/book3s_hv.c   |   76 ++--
 arch/powerpc/kvm/book3s_pr.c   |   23 ++-
 6 files changed, 196 insertions(+), 24 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 76a07a6..e4a2067 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1740,6 +1740,25 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_IAC4  | 64
   PPC   | KVM_REG_PPC_DAC1  | 64
   PPC   | KVM_REG_PPC_DAC2  | 64
+  PPC   | KVM_REG_PPC_DABR  | 64
+  PPC   | KVM_REG_PPC_DSCR  | 64
+  PPC   | KVM_REG_PPC_PURR  | 64
+  PPC   | KVM_REG_PPC_SPURR | 64
+  PPC   | KVM_REG_PPC_DAR   | 64
+  PPC   | KVM_REG_PPC_DSISR| 32
+  PPC   | KVM_REG_PPC_AMR   | 64
+  PPC   | KVM_REG_PPC_UAMOR | 64
+  PPC   | KVM_REG_PPC_MMCR0 | 64
+  PPC   | KVM_REG_PPC_MMCR1 | 64
+  PPC   | KVM_REG_PPC_MMCRA | 64
+  PPC   | KVM_REG_PPC_PMC1  | 32
+  PPC   | KVM_REG_PPC_PMC2  | 32
+  PPC   | KVM_REG_PPC_PMC3  | 32
+  PPC   | KVM_REG_PPC_PMC4  | 32
+  PPC   | KVM_REG_PPC_PMC5  | 32
+  PPC   | KVM_REG_PPC_PMC6  | 32
+  PPC   | KVM_REG_PPC_PMC7  | 32
+  PPC   | KVM_REG_PPC_PMC8  | 32
 
 4.69 KVM_GET_ONE_REG
 
diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index 3c14202..9557576 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -338,5 +338,26 @@ struct kvm_book3e_206_tlb_params {
 #define KVM_REG_PPC_IAC4   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x5)
 #define KVM_REG_PPC_DAC1   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x6)
 #define KVM_REG_PPC_DAC2   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x7)
+#define KVM_REG_PPC_DABR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x8)
+#define KVM_REG_PPC_DSCR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x9)
+#define KVM_REG_PPC_PURR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa)
+#define KVM_REG_PPC_SPURR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb)
+#define KVM_REG_PPC_DAR(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc)
+#define KVM_REG_PPC_DSISR  (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xd)
+#define KVM_REG_PPC_AMR(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xe)
+#define KVM_REG_PPC_UAMOR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xf)
+
+#define KVM_REG_PPC_MMCR0  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10)
+#define KVM_REG_PPC_MMCR1  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11)
+#define KVM_REG_PPC_MMCRA  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
+
+#define KVM_REG_PPC_PMC1   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x18)
+#define KVM_REG_PPC_PMC2   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x19)
+#define KVM_REG_PPC_PMC3   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1a)
+#define KVM_REG_PPC_PMC4   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1b)
+#define KVM_REG_PPC_PMC5   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1c)
+#define KVM_REG_PPC_PMC6   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1d)
+#define KVM_REG_PPC_PMC7   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1e)
+#define KVM_REG_PPC_PMC8   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1f)
 
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 2c94cb3..6002b0a 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -196,6 +196,11 @@ static inline u32 kvmppc_set_field(u64 inst, int msb, int 
lsb, int value)
return r;

Re: [PATCH] powerpc-kvm: fixing page alignment for TCE

2012-09-20 Thread Alexander Graf


On 04.09.2012, at 09:36, Alexey Kardashevskiy wrote:

 From: Paul Mackerras pau...@samba.org
 
 TODO: ask Paul to make a proper message.

TODO?

Also, Ben or Paul, please ack if you think it's correct.


Alex

 
 This is the fix for a host kernel compiled with a page size
 other than 4K (TCE page size). In the case of a 64K page size,
 the host used to lose address bits in hpte_rpn().
 The patch fixes it.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
 arch/powerpc/kvm/book3s_64_mmu_hv.c |9 -
 1 file changed, 4 insertions(+), 5 deletions(-)
 
 diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
 b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 index 80a5775..a41f11b 100644
 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
 +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 @@ -503,7 +503,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
 struct kvm_vcpu *vcpu,
   struct kvm *kvm = vcpu-kvm;
   unsigned long *hptep, hpte[3], r;
   unsigned long mmu_seq, psize, pte_size;
 - unsigned long gfn, hva, pfn;
 + unsigned long gpa, gfn, hva, pfn;
   struct kvm_memory_slot *memslot;
   unsigned long *rmap;
   struct revmap_entry *rev;
 @@ -541,15 +541,14 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
 struct kvm_vcpu *vcpu,
 
   /* Translate the logical address and get the page */
   psize = hpte_page_size(hpte[0], r);
 - gfn = hpte_rpn(r, psize);
 + gpa = (r  HPTE_R_RPN  ~(psize - 1)) | (ea  (psize - 1));
 + gfn = gpa  PAGE_SHIFT;
   memslot = gfn_to_memslot(kvm, gfn);
 
   /* No memslot means it's an emulated MMIO region */
 - if (!memslot || (memslot-flags  KVM_MEMSLOT_INVALID)) {
 - unsigned long gpa = (gfn  PAGE_SHIFT) | (ea  (psize - 1));
 + if (!memslot || (memslot-flags  KVM_MEMSLOT_INVALID))
   return kvmppc_hv_emulate_mmio(run, vcpu, gpa, ea,
 dsisr  DSISR_ISSTORE);
 - }
 
   if (!kvm-arch.using_mmu_notifiers)
   return -EFAULT; /* should never get here */
 -- 
 1.7.10.4
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm-ppc in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/10] KVM: PPC: Book3S HV: Fix updates of vcpu-cpu

This removes the powerpc generic updates of vcpu-cpu in load and
put, and moves them to the various backends.

The reason is that HV KVM does its own sauce with that field
and the generic updates might corrupt it. The field contains the
CPU# of the -first- HW CPU of the core always for all the VCPU
threads of a core (the one that's online from a host Linux
perspective).

However, the preempt notifiers are going to be called on the
threads VCPUs when they are running (due to them sleeping on our
private waitqueue) causing unload to be called, potentially
clobbering the value.

Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_pr.c |3 ++-
 arch/powerpc/kvm/booke.c |2 ++
 arch/powerpc/kvm/powerpc.c   |2 --
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 4d0667a..bf3ec5d 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -64,7 +64,7 @@ void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
svcpu-slb_max = to_book3s(vcpu)-slb_shadow_max;
svcpu_put(svcpu);
 #endif
-
+   vcpu-cpu = smp_processor_id();
 #ifdef CONFIG_PPC_BOOK3S_32
current-thread.kvm_shadow_vcpu = to_book3s(vcpu)-shadow_vcpu;
 #endif
@@ -84,6 +84,7 @@ void kvmppc_core_vcpu_put(struct kvm_vcpu *vcpu)
kvmppc_giveup_ext(vcpu, MSR_FP);
kvmppc_giveup_ext(vcpu, MSR_VEC);
kvmppc_giveup_ext(vcpu, MSR_VSX);
+   vcpu-cpu = -1;
 }
 
 int kvmppc_core_check_requests(struct kvm_vcpu *vcpu)
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 3a6490f..69d047c 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -1509,12 +1509,14 @@ void kvmppc_decrementer_func(unsigned long data)
 
 void kvmppc_booke_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
+   vcpu-cpu = smp_processor_id();
current-thread.kvm_vcpu = vcpu;
 }
 
 void kvmppc_booke_vcpu_put(struct kvm_vcpu *vcpu)
 {
current-thread.kvm_vcpu = NULL;
+   vcpu-cpu = -1;
 }
 
 int __init kvmppc_booke_init(void)
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 2b08564..fd73763 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -510,7 +510,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
mtspr(SPRN_VRSAVE, vcpu-arch.vrsave);
 #endif
kvmppc_core_vcpu_load(vcpu, cpu);
-   vcpu-cpu = smp_processor_id();
 }
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -519,7 +518,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 #ifdef CONFIG_BOOKE
vcpu-arch.vrsave = mfspr(SPRN_VRSAVE);
 #endif
-   vcpu-cpu = -1;
 }
 
 int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/10] KVM: PPC: Book3S HV: Allow KVM guests to stop secondary threads coming online

When a Book3S HV KVM guest is running, we need the host to be in
single-thread mode, that is, all of the cores (or at least all of
the cores where the KVM guest could run) to be running only one
active hardware thread.  This is because of the hardware restriction
in POWER processors that all of the hardware threads in the core
must be in the same logical partition.  Complying with this restriction
is much easier if, from the host kernel's point of view, only one
hardware thread is active.

This adds two hooks in the SMP hotplug code to allow the KVM code to
make sure that secondary threads (i.e. hardware threads other than
thread 0) cannot come online while any KVM guest exists.  The KVM
code still has to check that any core where it runs a guest has the
secondary threads offline, but having done that check it can now be
sure that they will not come online while the guest is running.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/smp.h |8 +++
 arch/powerpc/kernel/smp.c  |   46 
 arch/powerpc/kvm/book3s_hv.c   |   12 +--
 3 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index ebc24dc..b625a1a 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -66,6 +66,14 @@ void generic_cpu_die(unsigned int cpu);
 void generic_mach_cpu_die(void);
 void generic_set_cpu_dead(unsigned int cpu);
 int generic_check_cpu_restart(unsigned int cpu);
+
+extern void inhibit_secondary_onlining(void);
+extern void uninhibit_secondary_onlining(void);
+
+#else /* HOTPLUG_CPU */
+static inline void inhibit_secondary_onlining(void) {}
+static inline void uninhibit_secondary_onlining(void) {}
+
 #endif
 
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 0321007..c45f51d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -410,6 +410,45 @@ int generic_check_cpu_restart(unsigned int cpu)
 {
return per_cpu(cpu_state, cpu) == CPU_UP_PREPARE;
 }
+
+static atomic_t secondary_inhibit_count;
+
+/*
+ * Don't allow secondary CPU threads to come online
+ */
+void inhibit_secondary_onlining(void)
+{
+   /*
+* This makes secondary_inhibit_count stable during cpu
+* online/offline operations.
+*/
+   get_online_cpus();
+
+   atomic_inc(secondary_inhibit_count);
+   put_online_cpus();
+}
+EXPORT_SYMBOL_GPL(inhibit_secondary_onlining);
+
+/*
+ * Allow secondary CPU threads to come online again
+ */
+void uninhibit_secondary_onlining(void)
+{
+   get_online_cpus();
+   atomic_dec(secondary_inhibit_count);
+   put_online_cpus();
+}
+EXPORT_SYMBOL_GPL(uninhibit_secondary_onlining);
+
+static int secondaries_inhibited(void)
+{
+   return atomic_read(secondary_inhibit_count);
+}
+
+#else /* HOTPLUG_CPU */
+
+#define secondaries_inhibited()0
+
 #endif
 
 static void cpu_idle_thread_init(unsigned int cpu, struct task_struct *idle)
@@ -428,6 +467,13 @@ int __cpuinit __cpu_up(unsigned int cpu, struct 
task_struct *tidle)
 {
int rc, c;
 
+   /*
+* Don't allow secondary threads to come online if inhibited
+*/
+   if (threads_per_core  1  secondaries_inhibited() 
+   cpu % threads_per_core != 0)
+   return -EBUSY;
+
if (smp_ops == NULL ||
(smp_ops-cpu_bootable  !smp_ops-cpu_bootable(cpu)))
return -EINVAL;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index bebf9cb..6fe1410 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -47,6 +47,7 @@
 #include asm/page.h
 #include asm/hvcall.h
 #include asm/switch_to.h
+#include asm/smp.h
 #include linux/gfp.h
 #include linux/vmalloc.h
 #include linux/highmem.h
@@ -918,8 +919,6 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
/*
 * Make sure we are running on thread 0, and that
 * secondary threads are offline.
-* XXX we should also block attempts to bring any
-* secondary threads online.
 */
if (threads_per_core  1  !on_primary_thread()) {
list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list)
@@ -1632,11 +1631,20 @@ int kvmppc_core_init_vm(struct kvm *kvm)
 
kvm-arch.using_mmu_notifiers = !!cpu_has_feature(CPU_FTR_ARCH_206);
spin_lock_init(kvm-arch.slot_phys_lock);
+
+   /*
+* Don't allow secondary CPU threads to come online
+* while any KVM VMs exist.
+*/
+   inhibit_secondary_onlining();
+
return 0;
 }
 
 void kvmppc_core_destroy_vm(struct kvm *kvm)
 {
+   uninhibit_secondary_onlining();
+
if (kvm-arch.rma) {
kvm_release_rma(kvm-arch.rma);
kvm-arch.rma = NULL;
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a

[PATCH 01/10] KVM: PPC: Book3S HV: Provide a way for userspace to get/set per-vCPU areas

The PAPR paravirtualization interface lets guests register three
different types of per-vCPU buffer areas in its memory for communication
with the hypervisor.  These are called virtual processor areas (VPAs).
Currently the hypercalls to register and unregister VPAs are handled
by KVM in the kernel, and userspace has no way to know about or save
and restore these registrations across a migration.

This adds get and set ioctls to allow userspace to see what addresses
have been registered, and to register or unregister them.  This will
be needed for guest hibernation and migration, and is also needed so
that userspace can unregister them on reset (otherwise we corrupt
guest memory after reboot by writing to the VPAs registered by the
previous kernel).  We also add a capability to indicate that the
ioctls are supported.

This also fixes a bug where we were calling init_vpa unconditionally,
leading to an oops when unregistering the VPA.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 Documentation/virtual/kvm/api.txt  |   32 +
 arch/powerpc/include/asm/kvm_ppc.h |3 ++
 arch/powerpc/kvm/book3s_hv.c   |   54 +++-
 arch/powerpc/kvm/powerpc.c |   26 +
 include/linux/kvm.h|   11 
 5 files changed, 125 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index a12f4e4..76a07a6 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1992,6 +1992,38 @@ return the hash table order in the parameter.  (If the 
guest is using
 the virtualized real-mode area (VRMA) facility, the kernel will
 re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
 
+4.77 KVM_PPC_GET_VPA_INFO
+
+Capability: KVM_CAP_PPC_VPA
+Architectures: powerpc
+Type: vcpu ioctl
+Parameters: Pointer to struct kvm_ppc_vpa (out)
+Returns: 0 on success, -1 on error
+
+This populates and returns a structure containing the guest physical
+addresses and sizes of the three per-virtual-processor areas that the
+guest can register with the hypervisor under the PAPR
+paravirtualization interface, namely the Virtual Processor Area, the
+SLB (Segment Lookaside Buffer) Shadow Area, and the Dispatch Trace
+Log.
+
+4.78 KVM_PPC_SET_VPA_INFO
+
+Capability: KVM_CAP_PPC_VPA
+Architectures: powerpc
+Type: vcpu ioctl
+Parameters: Pointer to struct kvm_ppc_vpa (in)
+Returns: 0 on success, -1 on error
+
+This sets the guest physical addresses and sizes of the three
+per-virtual-processor areas that the guest can register with the
+hypervisor under the PAPR paravirtualization interface, namely the
+Virtual Processor Area, the SLB (Segment Lookaside Buffer) Shadow
+Area, and the Dispatch Trace Log.  Providing an address of zero for
+any of these areas causes the kernel to unregister any previously
+registered area; a non-zero address replaces any previously registered
+area.
+
 
 5. The kvm_run structure
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 3fb980d..2c94cb3 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -205,6 +205,9 @@ int kvmppc_set_sregs_ivor(struct kvm_vcpu *vcpu, struct 
kvm_sregs *sregs);
 int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg);
 int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu, struct kvm_one_reg *reg);
 
+int kvm_vcpu_get_vpa_info(struct kvm_vcpu *vcpu, struct kvm_ppc_vpa *vpa);
+int kvm_vcpu_set_vpa_info(struct kvm_vcpu *vcpu, struct kvm_ppc_vpa *vpa);
+
 void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid);
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 38c7f1b..bebf9cb 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -143,6 +143,57 @@ static void init_vpa(struct kvm_vcpu *vcpu, struct lppaca 
*vpa)
vpa-yield_count = 1;
 }
 
+int kvm_vcpu_get_vpa_info(struct kvm_vcpu *vcpu, struct kvm_ppc_vpa *vpa)
+{
+   spin_lock(vcpu-arch.vpa_update_lock);
+   vpa-vpa_addr = vcpu-arch.vpa.next_gpa;
+   vpa-slb_shadow_addr = vcpu-arch.slb_shadow.next_gpa;
+   vpa-slb_shadow_size = vcpu-arch.slb_shadow.len;
+   vpa-dtl_addr = vcpu-arch.dtl.next_gpa;
+   vpa-dtl_size = vcpu-arch.dtl.len;
+   spin_unlock(vcpu-arch.vpa_update_lock);
+   return 0;
+}
+
+static inline void set_vpa(struct kvmppc_vpa *v, unsigned long addr,
+  unsigned long len)
+{
+   if (v-next_gpa != addr || v-len != len) {
+   v-next_gpa = addr;
+   v-len = addr ? len : 0;
+   v-update_pending = 1;
+   }
+}
+
+int kvm_vcpu_set_vpa_info(struct kvm_vcpu *vcpu, struct kvm_ppc_vpa *vpa)
+{
+   /* check that addresses are cacheline aligned */
+   if ((vpa-vpa_addr  (L1_CACHE_BYTES - 1)) ||
+   (vpa-slb_shadow_addr  (L1_CACHE_BYTES -

[PATCH 09/10] KVM: PPC: Book3S HV: Fix accounting of stolen time

Currently the code that accounts stolen time tends to overestimate the
stolen time, and will sometimes report more stolen time in a DTL
(dispatch trace log) entry than has elapsed since the last DTL entry.
This can cause guests to underflow the user or system time measured
for some tasks, leading to ridiculous CPU percentages and total runtimes
being reported by top and other utilities.

In addition, the current code was designed for the previous policy where
a vcore would only run when all the vcpus in it were runnable, and so
only counted stolen time on a per-vcore basis.  Now that a vcore can
run while some of the vcpus in it are doing other things in the kernel
(e.g. handling a page fault), we need to count the time when a vcpu task
is preempted while it is not running as part of a vcore as stolen also.

To do this, we bring back the BUSY_IN_HOST vcpu state and extend the
vcpu_load/put functions to count preemption time while the vcpu is
in that state.  Handling the transitions between the RUNNING and
BUSY_IN_HOST states requires checking and updating two variables
(accumulated time stolen and time last preempted), so we add a new
spinlock, vcpu-arch.tbacct_lock.  This protects both the per-vcpu
stolen/preempt-time variables, and the per-vcore variables while this
vcpu is running the vcore.

Finally, we now don't count time spent in userspace as stolen time.
The task could be executing in userspace on behalf of the vcpu, or
it could be preempted, or the vcpu could be genuinely stopped.  Since
we have no way of dividing up the time between these cases, we don't
count any of it as stolen.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_host.h |5 ++
 arch/powerpc/kvm/book3s_hv.c|  127 ++-
 2 files changed, 117 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 1e8cbd1..3093896 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -559,12 +559,17 @@ struct kvm_vcpu_arch {
unsigned long dtl_index;
u64 stolen_logged;
struct kvmppc_vpa slb_shadow;
+
+   spinlock_t tbacct_lock;
+   u64 busy_stolen;
+   u64 busy_preempt;
 #endif
 };
 
 /* Values for vcpu-arch.state */
 #define KVMPPC_VCPU_NOTREADY   0
 #define KVMPPC_VCPU_RUNNABLE   1
+#define KVMPPC_VCPU_BUSY_IN_HOST   2
 
 /* Values for vcpu-arch.io_gpr */
 #define KVM_MMIO_REG_MASK  0x001f
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index dc34a69..f953f73 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -57,23 +57,74 @@
 /* #define EXIT_DEBUG_SIMPLE */
 /* #define EXIT_DEBUG_INT */
 
+/* Used as a null value for timebase values */
+#define TB_NIL (~(u64)0)
+
 static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
 static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
 
+/*
+ * We use the vcpu_load/put functions to measure stolen time.
+ * Stolen time is counted as time when either the vcpu is able to
+ * run as part of a virtual core, but the task running the vcore
+ * is preempted or sleeping, or when the vcpu needs something done
+ * in the kernel by the task running the vcpu, but that task is
+ * preempted or sleeping.  Those two things have to be counted
+ * separately, since one of the vcpu tasks will take on the job
+ * of running the core, and the other vcpu tasks in the vcore will
+ * sleep waiting for it to do that, but that sleep shouldn't count
+ * as stolen time.
+ *
+ * Hence we accumulate stolen time when the vcpu can run as part of
+ * a vcore using vc-stolen_tb, and the stolen time when the vcpu
+ * needs its task to do other things in the kernel (for example,
+ * service a page fault) in busy_stolen.  We don't accumulate
+ * stolen time for a vcore when it is inactive, or for a vcpu
+ * when it is in state RUNNING or NOTREADY.  NOTREADY is a bit of
+ * a misnomer; it means that the vcpu task is not executing in
+ * the KVM_VCPU_RUN ioctl, i.e. it is in userspace or elsewhere in
+ * the kernel.  We don't have any way of dividing up that time
+ * between time that the vcpu is genuinely stopped, time that
+ * the task is actively working on behalf of the vcpu, and time
+ * that the task is preempted, so we don't count any of it as
+ * stolen.
+ *
+ * Updates to busy_stolen are protected by arch.tbacct_lock;
+ * updates to vc-stolen_tb are protected by the arch.tbacct_lock
+ * of the vcpu that has taken responsibility for running the vcore
+ * (i.e. vc-runner).  The stolen times are measured in units of
+ * timebase ticks.  (Note that the != TB_NIL checks below are
+ * purely defensive; they should never fail.)
+ */
+
 void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
struct kvmppc_vcore *vc = vcpu-arch.vcore;
 
-   if (vc-runner == vcpu  vc-vcore_state != VCORE_INACTIVE)
+

[PATCH 05/10] KVM: PPC: Book3S HV: Fix some races in starting secondary threads

Subsequent patches implementing in-kernel XICS emulation will make it
possible for IPIs to arrive at secondary threads at arbitrary times.
This fixes some races in how we start the secondary threads, which
if not fixed could lead to occasional crashes of the host kernel.

This makes sure that (a) we have grabbed all the secondary threads,
and verified that they are no longer in the kernel, before we start
any thread, (b) that the secondary thread loads its vcpu pointer
after clearing the IPI that woke it up (so we don't miss a wakeup),
and (c) that the secondary thread clears its vcpu pointer before
incrementing the nap count.  It also removes unnecessary setting
of the vcpu and vcore pointers in the paca in kvmppc_core_vcpu_load.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_hv.c|   41 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   11 ++---
 2 files changed, 32 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index a917603..cd3dc12 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -64,8 +64,6 @@ void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
struct kvmppc_vcore *vc = vcpu-arch.vcore;
 
-   local_paca-kvm_hstate.kvm_vcpu = vcpu;
-   local_paca-kvm_hstate.kvm_vcore = vc;
if (vc-runner == vcpu  vc-vcore_state != VCORE_INACTIVE)
vc-stolen_tb += mftb() - vc-preempt_tb;
 }
@@ -776,6 +774,7 @@ static int kvmppc_grab_hwthread(int cpu)
 
/* Ensure the thread won't go into the kernel if it wakes */
tpaca-kvm_hstate.hwthread_req = 1;
+   tpaca-kvm_hstate.kvm_vcpu = NULL;
 
/*
 * If the thread is already executing in the kernel (e.g. handling
@@ -825,7 +824,6 @@ static void kvmppc_start_thread(struct kvm_vcpu *vcpu)
smp_wmb();
 #if defined(CONFIG_PPC_ICP_NATIVE)  defined(CONFIG_SMP)
if (vcpu-arch.ptid) {
-   kvmppc_grab_hwthread(cpu);
xics_wake_cpu(cpu);
++vc-n_woken;
}
@@ -851,7 +849,8 @@ static void kvmppc_wait_for_nap(struct kvmppc_vcore *vc)
 
 /*
  * Check that we are on thread 0 and that any other threads in
- * this core are off-line.
+ * this core are off-line.  Then grab the threads so they can't
+ * enter the kernel.
  */
 static int on_primary_thread(void)
 {
@@ -863,6 +862,17 @@ static int on_primary_thread(void)
while (++thr  threads_per_core)
if (cpu_online(cpu + thr))
return 0;
+
+   /* Grab all hw threads so they can't go into the kernel */
+   for (thr = 1; thr  threads_per_core; ++thr) {
+   if (kvmppc_grab_hwthread(cpu + thr)) {
+   /* Couldn't grab one; let the others go */
+   do {
+   kvmppc_release_hwthread(cpu + thr);
+   } while (--thr  0);
+   return 0;
+   }
+   }
return 1;
 }
 
@@ -911,16 +921,6 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
}
 
/*
-* Make sure we are running on thread 0, and that
-* secondary threads are offline.
-*/
-   if (threads_per_core  1  !on_primary_thread()) {
-   list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list)
-   vcpu-arch.ret = -EBUSY;
-   goto out;
-   }
-
-   /*
 * Assign physical thread IDs, first to non-ceded vcpus
 * and then to ceded ones.
 */
@@ -939,15 +939,22 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
if (vcpu-arch.ceded)
vcpu-arch.ptid = ptid++;
 
+   /*
+* Make sure we are running on thread 0, and that
+* secondary threads are offline.
+*/
+   if (threads_per_core  1  !on_primary_thread()) {
+   list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list)
+   vcpu-arch.ret = -EBUSY;
+   goto out;
+   }
+
vc-stolen_tb += mftb() - vc-preempt_tb;
vc-pcpu = smp_processor_id();
list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list) {
kvmppc_start_thread(vcpu);
kvmppc_create_dtl_entry(vcpu, vc);
}
-   /* Grab any remaining hw threads so they can't go into the kernel */
-   for (i = ptid; i  threads_per_core; ++i)
-   kvmppc_grab_hwthread(vc-pcpu + i);
 
preempt_disable();
spin_unlock(vc-lock);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 44b72fe..1e90ef6 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -134,8 +134,11 @@ kvm_start_guest:
 
 27:/* XXX should handle hypervisor maintenance interrupts etc. here */
 
+   /* reload vcpu pointer after clearing

[PATCH 07/10] KVM: PPC: Book3S HV: Fixes for late-joining threads

If a thread in a virtual core becomes runnable while other threads
in the same virtual core are already running in the guest, it is
possible for the latecomer to join the others on the core without
first pulling them all out of the guest.  Currently this only happens
rarely, when a vcpu is first started.  This fixes some bugs and
omissions in the code in this case.

First, we need to check for VPA updates for the latecomer and make
a DTL entry for it.  Secondly, if it comes along while the master
vcpu is doing a VPA update, we don't need to do anything since the
master will pick it up in kvmppc_run_core.  To handle this correctly
we introduce a new vcore state, VCORE_STARTING.  Thirdly, there is
a race because we currently clear the hardware thread's hwthread_req
before waiting to see it get to nap.  A latecomer thread could have
its hwthread_req cleared before it gets to test it, and therefore
never increment the nap_count, leading to messages about wait_for_nap
timeouts.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_host.h |7 ---
 arch/powerpc/kvm/book3s_hv.c|   14 +++---
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 68f5a30..218534d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -289,9 +289,10 @@ struct kvmppc_vcore {
 
 /* Values for vcore_state */
 #define VCORE_INACTIVE 0
-#define VCORE_RUNNING  1
-#define VCORE_EXITING  2
-#define VCORE_SLEEPING 3
+#define VCORE_SLEEPING 1
+#define VCORE_STARTING 2
+#define VCORE_RUNNING  3
+#define VCORE_EXITING  4
 
 /*
  * Struct used to manage memory for a virtual processor area
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index bd3c5c1..8e84625 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -368,6 +368,11 @@ static void kvmppc_update_vpa(struct kvm_vcpu *vcpu, 
struct kvmppc_vpa *vpap)
 
 static void kvmppc_update_vpas(struct kvm_vcpu *vcpu)
 {
+   if (!(vcpu-arch.vpa.update_pending ||
+ vcpu-arch.slb_shadow.update_pending ||
+ vcpu-arch.dtl.update_pending))
+   return;
+
spin_lock(vcpu-arch.vpa_update_lock);
if (vcpu-arch.vpa.update_pending) {
kvmppc_update_vpa(vcpu, vcpu-arch.vpa);
@@ -902,7 +907,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
vc-n_woken = 0;
vc-nap_count = 0;
vc-entry_exit_count = 0;
-   vc-vcore_state = VCORE_RUNNING;
+   vc-vcore_state = VCORE_STARTING;
vc-in_guest = 0;
vc-napping_threads = 0;
 
@@ -955,6 +960,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
kvmppc_create_dtl_entry(vcpu, vc);
}
 
+   vc-vcore_state = VCORE_RUNNING;
preempt_disable();
spin_unlock(vc-lock);
 
@@ -963,8 +969,6 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
srcu_idx = srcu_read_lock(vcpu0-kvm-srcu);
 
__kvmppc_vcore_entry(NULL, vcpu0);
-   for (i = 0; i  threads_per_core; ++i)
-   kvmppc_release_hwthread(vc-pcpu + i);
 
spin_lock(vc-lock);
/* disable sending of IPIs on virtual external irqs */
@@ -973,6 +977,8 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
/* wait for secondary threads to finish writing their state to memory */
if (vc-nap_count  vc-n_woken)
kvmppc_wait_for_nap(vc);
+   for (i = 0; i  threads_per_core; ++i)
+   kvmppc_release_hwthread(vc-pcpu + i);
/* prevent other vcpu threads from doing kvmppc_start_thread() now */
vc-vcore_state = VCORE_EXITING;
spin_unlock(vc-lock);
@@ -1063,6 +1069,7 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
kvm_run-exit_reason = 0;
vcpu-arch.ret = RESUME_GUEST;
vcpu-arch.trap = 0;
+   kvmppc_update_vpas(vcpu);
 
/*
 * Synchronize with other threads in this virtual core
@@ -1086,6 +1093,7 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
if (vc-vcore_state == VCORE_RUNNING 
VCORE_EXIT_COUNT(vc) == 0) {
vcpu-arch.ptid = vc-n_runnable - 1;
+   kvmppc_create_dtl_entry(vcpu, vc);
kvmppc_start_thread(vcpu);
}
 
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/10] KVM: PPC: Book3S HV: Fix calculation of guest phys address for MMIO emulation

In the case where the host kernel is using a 64kB base page size and
the guest uses a 4k HPTE (hashed page table entry) to map an emulated
MMIO device, we were calculating the guest physical address wrongly.
We were calculating a gfn as the guest physical address shifted right
16 bits (PAGE_SHIFT) but then only adding back in 12 bits from the
effective address, since the HPTE had a 4k page size.  Thus the gpa
reported to userspace was missing 4 bits.

Instead, we now compute the guest physical address from the HPTE
without reference to the host page size, and then compute the gfn
by shifting the gpa right PAGE_SHIFT bits.

Reported-by: Alexey Kardashevskiy a...@ozlabs.ru
Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c |9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index f598366..7a4aae9 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -571,7 +571,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
struct kvm *kvm = vcpu-kvm;
unsigned long *hptep, hpte[3], r;
unsigned long mmu_seq, psize, pte_size;
-   unsigned long gfn, hva, pfn;
+   unsigned long gpa, gfn, hva, pfn;
struct kvm_memory_slot *memslot;
unsigned long *rmap;
struct revmap_entry *rev;
@@ -609,15 +609,14 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
 
/* Translate the logical address and get the page */
psize = hpte_page_size(hpte[0], r);
-   gfn = hpte_rpn(r, psize);
+   gpa = (r  HPTE_R_RPN  ~(psize - 1)) | (ea  (psize - 1));
+   gfn = gpa  PAGE_SHIFT;
memslot = gfn_to_memslot(kvm, gfn);
 
/* No memslot means it's an emulated MMIO region */
-   if (!memslot || (memslot-flags  KVM_MEMSLOT_INVALID)) {
-   unsigned long gpa = (gfn  PAGE_SHIFT) | (ea  (psize - 1));
+   if (!memslot || (memslot-flags  KVM_MEMSLOT_INVALID))
return kvmppc_hv_emulate_mmio(run, vcpu, gpa, ea,
  dsisr  DSISR_ISSTORE);
-   }
 
if (!kvm-arch.using_mmu_notifiers)
return -EFAULT; /* should never get here */
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/10] KVM: PPC: Book3s HV: Don't access runnable threads list without vcore lock

There were a few places where we were traversing the list of runnable
threads in a virtual core, i.e. vc-runnable_threads, without holding
the vcore spinlock.  This extends the places where we hold the vcore
spinlock to cover everywhere that we traverse that list.

Since we possibly need to sleep inside kvmppc_book3s_hv_page_fault,
this moves the call of it from kvmppc_handle_exit out to
kvmppc_vcpu_run, where we don't hold the vcore lock.

In kvmppc_vcore_blocked, we don't actually need to check whether
all vcpus are ceded and don't have any pending exceptions, since the
caller has already done that.  The caller (kvmppc_run_vcpu) wasn't
actually checking for pending exceptions, so we add that.

The change of if to while in kvmppc_run_vcpu is to make sure that we
never call kvmppc_remove_runnable() when the vcore state is RUNNING or
EXITING.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_asm.h |1 +
 arch/powerpc/kvm/book3s_hv.c   |   64 +---
 2 files changed, 31 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index 76fdcfe..fb99a21 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -123,6 +123,7 @@
 #define RESUME_GUEST_NV RESUME_FLAG_NV
 #define RESUME_HOST RESUME_FLAG_HOST
 #define RESUME_HOST_NV  (RESUME_FLAG_HOST|RESUME_FLAG_NV)
+#define RESUME_PAGE_FAULT  (12)
 
 #define KVM_GUEST_MODE_NONE0
 #define KVM_GUEST_MODE_GUEST   1
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index cd3dc12..bd3c5c1 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -466,7 +466,6 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
  struct task_struct *tsk)
 {
int r = RESUME_HOST;
-   int srcu_idx;
 
vcpu-stat.sum_exits++;
 
@@ -526,16 +525,12 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 * have been handled already.
 */
case BOOK3S_INTERRUPT_H_DATA_STORAGE:
-   srcu_idx = srcu_read_lock(vcpu-kvm-srcu);
-   r = kvmppc_book3s_hv_page_fault(run, vcpu,
-   vcpu-arch.fault_dar, vcpu-arch.fault_dsisr);
-   srcu_read_unlock(vcpu-kvm-srcu, srcu_idx);
+   r = RESUME_PAGE_FAULT;
break;
case BOOK3S_INTERRUPT_H_INST_STORAGE:
-   srcu_idx = srcu_read_lock(vcpu-kvm-srcu);
-   r = kvmppc_book3s_hv_page_fault(run, vcpu,
-   kvmppc_get_pc(vcpu), 0);
-   srcu_read_unlock(vcpu-kvm-srcu, srcu_idx);
+   vcpu-arch.fault_dar = kvmppc_get_pc(vcpu);
+   vcpu-arch.fault_dsisr = 0;
+   r = RESUME_PAGE_FAULT;
break;
/*
 * This occurs if the guest executes an illegal instruction.
@@ -880,22 +875,24 @@ static int on_primary_thread(void)
  * Run a set of guest threads on a physical core.
  * Called with vc-lock held.
  */
-static int kvmppc_run_core(struct kvmppc_vcore *vc)
+static void kvmppc_run_core(struct kvmppc_vcore *vc)
 {
struct kvm_vcpu *vcpu, *vcpu0, *vnext;
long ret;
u64 now;
int ptid, i, need_vpa_update;
int srcu_idx;
+   struct kvm_vcpu *vcpus_to_update[threads_per_core];
 
/* don't start if any threads have a signal pending */
need_vpa_update = 0;
list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list) {
if (signal_pending(vcpu-arch.run_task))
-   return 0;
-   need_vpa_update |= vcpu-arch.vpa.update_pending |
-   vcpu-arch.slb_shadow.update_pending |
-   vcpu-arch.dtl.update_pending;
+   return;
+   if (vcpu-arch.vpa.update_pending ||
+   vcpu-arch.slb_shadow.update_pending ||
+   vcpu-arch.dtl.update_pending)
+   vcpus_to_update[need_vpa_update++] = vcpu;
}
 
/*
@@ -915,8 +912,8 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
 */
if (need_vpa_update) {
spin_unlock(vc-lock);
-   list_for_each_entry(vcpu, vc-runnable_threads, arch.run_list)
-   kvmppc_update_vpas(vcpu);
+   for (i = 0; i  need_vpa_update; ++i)
+   kvmppc_update_vpas(vcpus_to_update[i]);
spin_lock(vc-lock);
}
 
@@ -933,8 +930,10 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
vcpu-arch.ptid = ptid++;
}
}
-   if (!vcpu0)
-   return 0;   /* nothing to run */
+   if (!vcpu0) {
+   vc-vcore_state = VCORE_INACTIVE;
+   return; /* nothing to run; should

[PATCH 04/10] KVM: PPC: Book3S HV: Remove bogus update of physical thread IDs

When making a vcpu non-runnable we incorrectly changed the
thread IDs of all other threads on the core, just remove that
code.

Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/kvm/book3s_hv.c |6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6fe1410..a917603 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -759,17 +759,11 @@ extern void xics_wake_cpu(int cpu);
 static void kvmppc_remove_runnable(struct kvmppc_vcore *vc,
   struct kvm_vcpu *vcpu)
 {
-   struct kvm_vcpu *v;
-
if (vcpu-arch.state != KVMPPC_VCPU_RUNNABLE)
return;
vcpu-arch.state = KVMPPC_VCPU_BUSY_IN_HOST;
--vc-n_runnable;
++vc-n_busy;
-   /* decrement the physical thread id of each following vcpu */
-   v = vcpu;
-   list_for_each_entry_continue(v, vc-runnable_threads, arch.run_list)
-   --v-arch.ptid;
list_del(vcpu-arch.run_list);
 }
 
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/10] HV KVM fixes, reposted

This is a repost of 10 patches out of a series of 12 that I posted
more than three weeks ago that have had no comments but have not yet
been applied.  They have been rediffed against Alex Graf's current
kvm-ppc-next branch.

This series contains various fixes collected during the process of
getting reboot of Book3S HV guests to work correctly, plus some needed
for Ben H's forthcoming series to implement in-kernel XICS (interrupt
controller) emulation.  As part of getting reboot to work, we have
relaxed the previous policy where, on POWER7, a virtual core would
only run when all of the vcpus in it were ready to run (or were idle).
Now a virtual core will run as soon as any one of its vcpus are ready.
This avoids the problem where the guest wouldn't run after reboot
because userspace (qemu) had stopped all except cpu 0.

Please apply.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/10] KVM: PPC: Book3S HV: Run virtual core whenever any vcpus in it can run

Currently the Book3S HV code implements a policy on multi-threaded
processors (i.e. POWER7) that requires all of the active vcpus in a
virtual core to be ready to run before we run the virtual core.
However, that causes problems on reset, because reset stops all vcpus
except vcpu 0, and can also reduce throughput since all four threads
in a virtual core have to wait whenever any one of them hits a
hypervisor page fault.

This relaxes the policy, allowing the virtual core to run as soon as
any vcpu in it is runnable.  With this, the KVMPPC_VCPU_STOPPED state
and the KVMPPC_VCPU_BUSY_IN_HOST state have been combined into a single
KVMPPC_VCPU_NOTREADY state, since we no longer need to distinguish
between them.

Signed-off-by: Paul Mackerras pau...@samba.org
---
 arch/powerpc/include/asm/kvm_host.h |5 +--
 arch/powerpc/kvm/book3s_hv.c|   74 ++-
 2 files changed, 40 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 218534d..1e8cbd1 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -563,9 +563,8 @@ struct kvm_vcpu_arch {
 };
 
 /* Values for vcpu-arch.state */
-#define KVMPPC_VCPU_STOPPED0
-#define KVMPPC_VCPU_BUSY_IN_HOST   1
-#define KVMPPC_VCPU_RUNNABLE   2
+#define KVMPPC_VCPU_NOTREADY   0
+#define KVMPPC_VCPU_RUNNABLE   1
 
 /* Values for vcpu-arch.io_gpr */
 #define KVM_MMIO_REG_MASK  0x001f
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8e84625..dc34a69 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -669,10 +669,7 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, 
unsigned int id)
 
kvmppc_mmu_book3s_hv_init(vcpu);
 
-   /*
-* We consider the vcpu stopped until we see the first run ioctl for it.
-*/
-   vcpu-arch.state = KVMPPC_VCPU_STOPPED;
+   vcpu-arch.state = KVMPPC_VCPU_NOTREADY;
 
init_waitqueue_head(vcpu-arch.cpu_run);
 
@@ -759,9 +756,8 @@ static void kvmppc_remove_runnable(struct kvmppc_vcore *vc,
 {
if (vcpu-arch.state != KVMPPC_VCPU_RUNNABLE)
return;
-   vcpu-arch.state = KVMPPC_VCPU_BUSY_IN_HOST;
+   vcpu-arch.state = KVMPPC_VCPU_NOTREADY;
--vc-n_runnable;
-   ++vc-n_busy;
list_del(vcpu-arch.run_list);
 }
 
@@ -1062,7 +1058,6 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore *vc)
 static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 {
int n_ceded;
-   int prev_state;
struct kvmppc_vcore *vc;
struct kvm_vcpu *v, *vn;
 
@@ -1079,7 +1074,6 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
vcpu-arch.ceded = 0;
vcpu-arch.run_task = current;
vcpu-arch.kvm_run = kvm_run;
-   prev_state = vcpu-arch.state;
vcpu-arch.state = KVMPPC_VCPU_RUNNABLE;
list_add_tail(vcpu-arch.run_list, vc-runnable_threads);
++vc-n_runnable;
@@ -1089,35 +1083,26 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
 * If the vcore is already running, we may be able to start
 * this thread straight away and have it join in.
 */
-   if (prev_state == KVMPPC_VCPU_STOPPED) {
+   if (!signal_pending(current)) {
if (vc-vcore_state == VCORE_RUNNING 
VCORE_EXIT_COUNT(vc) == 0) {
vcpu-arch.ptid = vc-n_runnable - 1;
kvmppc_create_dtl_entry(vcpu, vc);
kvmppc_start_thread(vcpu);
+   } else if (vc-vcore_state == VCORE_SLEEPING) {
+   wake_up(vc-wq);
}
 
-   } else if (prev_state == KVMPPC_VCPU_BUSY_IN_HOST)
-   --vc-n_busy;
+   }
 
while (vcpu-arch.state == KVMPPC_VCPU_RUNNABLE 
   !signal_pending(current)) {
-   if (vc-n_busy || vc-vcore_state != VCORE_INACTIVE) {
+   if (vc-vcore_state != VCORE_INACTIVE) {
spin_unlock(vc-lock);
kvmppc_wait_for_exec(vcpu, TASK_INTERRUPTIBLE);
spin_lock(vc-lock);
continue;
}
-   vc-runner = vcpu;
-   n_ceded = 0;
-   list_for_each_entry(v, vc-runnable_threads, arch.run_list)
-   if (!v-arch.pending_exceptions)
-   n_ceded += v-arch.ceded;
-   if (n_ceded == vc-n_runnable)
-   kvmppc_vcore_blocked(vc);
-   else
-   kvmppc_run_core(vc);
-
list_for_each_entry_safe(v, vn, vc-runnable_threads,
 arch.run_list) {
kvmppc_core_prepare_to_enter(v);
@@ -1129,23 +1114,40 @@ static int

[PATCH v3 1/2] KVM: PPC: Book3S: Get/set guest SPRs using the GET/SET_ONE_REG interface

This enables userspace to get and set various SPRs (special-purpose
registers) using the KVM_[GS]ET_ONE_REG ioctls.  With this, userspace
can get and set all the SPRs that are part of the guest state, either
through the KVM_[GS]ET_REGS ioctls, the KVM_[GS]ET_SREGS ioctls, or
the KVM_[GS]ET_ONE_REG ioctls.

The SPRs that are added here are:

- DABR:  Data address breakpoint register
- DSCR:  Data stream control register
- PURR:  Processor utilization of resources register
- SPURR: Scaled PURR
- DAR:   Data address register
- DSISR: Data storage interrupt status register
- AMR:   Authority mask register
- UAMOR: User authority mask override register
- MMCR0, MMCR1, MMCRA: Performance monitor unit control registers
- PMC1..PMC8: Performance monitor unit counter registers

In order to reduce code duplication between PR and HV KVM code, this
moves the kvm_vcpu_ioctl_[gs]et_one_reg functions into book3s.c and
centralizes the copying between user and kernel space there.  The
registers that are handled differently between PR and HV, and those
that exist only in one flavor, are handled in kvmppc_[gs]et_one_reg()
functions that are specific to each flavor.

Signed-off-by: Paul Mackerras pau...@samba.org
---
v3: handle DAR and DSISR, plus copy to/from userspace, in common code

 Documentation/virtual/kvm/api.txt  |   19 +
 arch/powerpc/include/asm/kvm.h |   21 ++
 arch/powerpc/include/asm/kvm_ppc.h |7 
 arch/powerpc/kvm/book3s.c  |   74 +++
 arch/powerpc/kvm/book3s_hv.c   |   76 ++--
 arch/powerpc/kvm/book3s_pr.c   |   23 ++-
 6 files changed, 196 insertions(+), 24 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 76a07a6..e4a2067 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1740,6 +1740,25 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_IAC4  | 64
   PPC   | KVM_REG_PPC_DAC1  | 64
   PPC   | KVM_REG_PPC_DAC2  | 64
+  PPC   | KVM_REG_PPC_DABR  | 64
+  PPC   | KVM_REG_PPC_DSCR  | 64
+  PPC   | KVM_REG_PPC_PURR  | 64
+  PPC   | KVM_REG_PPC_SPURR | 64
+  PPC   | KVM_REG_PPC_DAR   | 64
+  PPC   | KVM_REG_PPC_DSISR| 32
+  PPC   | KVM_REG_PPC_AMR   | 64
+  PPC   | KVM_REG_PPC_UAMOR | 64
+  PPC   | KVM_REG_PPC_MMCR0 | 64
+  PPC   | KVM_REG_PPC_MMCR1 | 64
+  PPC   | KVM_REG_PPC_MMCRA | 64
+  PPC   | KVM_REG_PPC_PMC1  | 32
+  PPC   | KVM_REG_PPC_PMC2  | 32
+  PPC   | KVM_REG_PPC_PMC3  | 32
+  PPC   | KVM_REG_PPC_PMC4  | 32
+  PPC   | KVM_REG_PPC_PMC5  | 32
+  PPC   | KVM_REG_PPC_PMC6  | 32
+  PPC   | KVM_REG_PPC_PMC7  | 32
+  PPC   | KVM_REG_PPC_PMC8  | 32
 
 4.69 KVM_GET_ONE_REG
 
diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index 3c14202..9557576 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -338,5 +338,26 @@ struct kvm_book3e_206_tlb_params {
 #define KVM_REG_PPC_IAC4   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x5)
 #define KVM_REG_PPC_DAC1   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x6)
 #define KVM_REG_PPC_DAC2   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x7)
+#define KVM_REG_PPC_DABR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x8)
+#define KVM_REG_PPC_DSCR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x9)
+#define KVM_REG_PPC_PURR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa)
+#define KVM_REG_PPC_SPURR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb)
+#define KVM_REG_PPC_DAR(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc)
+#define KVM_REG_PPC_DSISR  (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xd)
+#define KVM_REG_PPC_AMR(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xe)
+#define KVM_REG_PPC_UAMOR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xf)
+
+#define KVM_REG_PPC_MMCR0  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10)
+#define KVM_REG_PPC_MMCR1  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11)
+#define KVM_REG_PPC_MMCRA  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
+
+#define KVM_REG_PPC_PMC1   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x18)
+#define KVM_REG_PPC_PMC2   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x19)
+#define KVM_REG_PPC_PMC3   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1a)
+#define KVM_REG_PPC_PMC4   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1b)
+#define KVM_REG_PPC_PMC5   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1c)
+#define KVM_REG_PPC_PMC6   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1d)
+#define KVM_REG_PPC_PMC7   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1e)
+#define KVM_REG_PPC_PMC8   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x1f)
 
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 2c94cb3..6002b0a 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -196,6 +196,11 @@ static inline u32 kvmppc_set_field(u64 inst, int msb, int 
lsb, int value)
return r;

[PATCH v3 2/2] KVM: PPC: Book3S: Get/set guest FP regs using the GET/SET_ONE_REG interface