Re: [PATCH v6 4/4] add 2nd stage page fault handling during live migration

2014-05-28 Thread Christoffer Dall
On Tue, May 27, 2014 at 06:30:23PM -0700, Mario Smarduch wrote:
 On 05/27/2014 01:19 PM, Christoffer Dall wrote:
  On Thu, May 15, 2014 at 11:27:31AM -0700, Mario Smarduch wrote:
  This patch adds support for handling 2nd stage page faults during 
  migration,
  it disables faulting in huge pages, and splits up existing huge pages.
 
  Signed-off-by: Mario Smarduch m.smard...@samsung.com
  ---
   arch/arm/kvm/mmu.c |   36 ++--
   1 file changed, 34 insertions(+), 2 deletions(-)
 
  diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
  index b939312..10e7bf6 100644
  --- a/arch/arm/kvm/mmu.c
  +++ b/arch/arm/kvm/mmu.c
  @@ -1002,6 +1002,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
  phys_addr_t fault_ipa,
 struct kvm_mmu_memory_cache *memcache = vcpu-arch.mmu_page_cache;
 struct vm_area_struct *vma;
 pfn_t pfn;
  +  bool migration_active;
   
 write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
 if (fault_status == FSC_PERM  !write_fault) {
  @@ -1053,12 +1054,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
  phys_addr_t fault_ipa,
 return -EFAULT;
   
 spin_lock(kvm-mmu_lock);
  +
  +  /*
  +   * Place inside lock to prevent race condition when whole VM is being
  +   * write proteced. Prevent race of huge page install when migration is
  +   * active.
  +   */
  +  migration_active = vcpu-kvm-arch.migration_in_progress;
  +
 if (mmu_notifier_retry(kvm, mmu_seq))
 goto out_unlock;
  -  if (!hugetlb  !force_pte)
  +
  +  /* When migrating don't spend cycles coalescing huge pages */
  +  if (!hugetlb  !force_pte  !migration_active)
 hugetlb = transparent_hugepage_adjust(pfn, fault_ipa);
   
  -  if (hugetlb) {
  +  /* During migration don't install huge pages */
  
  again, all this is not about migration per se, it's about when logging
  dirty pages, (which may be commonly used for migration).
  
 
 Yes that's true , I'll update but until recently (new RFC on qemu list) where
 dirty logging is used for getting VM RSS or hot memory regions, I don't see 
 any
 other use case.
 

That doesn't really matter.  KVM doesn't really know (or care) what user
space is doing with its features; it implements a certain functionality
behind an ABI, and that's it.  For things to be consistent and make
sense in the kernel, you can only refer to concepts defined by KVM, not
by how QEMU or kvmtools (or some other user space client) may use it.

  +  if (hugetlb  !migration_active) {
 pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
 new_pmd = pmd_mkhuge(new_pmd);
 if (writable) {
  @@ -1069,6 +1081,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
  phys_addr_t fault_ipa,
 ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, new_pmd);
 } else {
 pte_t new_pte = pfn_pte(pfn, PAGE_S2);
  +
  +  /*
  +   * If pmd is  mapping a huge page then split it up into
  +   * small pages, when doing live migration.
  +   */
  +  if (migration_active) {
  +  pmd_t *pmd;
  +  if (hugetlb) {
  +  pfn += pte_index(fault_ipa);
  +  gfn = fault_ipa  PAGE_SHIFT;
  +  }
  
  how can you have hugetlb when we entered this else-clause conditional on
  having !hugetlb?
  
 - if(hugetlb  !migration_active)

ah, you changed that, sorry, my bad.

 
 forces all page faults to enter here while in migration. Huge page entries
 are cleared and stage2_set_pte() splits the huge page, and installs the pte
 for the fault_ipa. I placed that there since it flows with installing 
 a pte as well as splitting a huge page. But your comment on performance
 split up huge page vs. deferred  page faulting should move it out of here. 
 

Why do you need to make that change?  I would think that just not
setting hugetlb when you have dirty page logging activated should take
care of all that you need.

Wrt. my comments on performance, I had a look at the x86 code, and they
seem to take your approach.  We should probably talk more closely to
them about their experiences.

 
  +  new_pte = pfn_pte(pfn, PAGE_S2);
  +  pmd = stage2_get_pmd(kvm, NULL, fault_ipa);
  +  if (pmd  kvm_pmd_huge(*pmd))
  +  clear_pmd_entry(kvm, pmd, fault_ipa);
  
  If we have a huge pmd entry, how did we take a fault on there?  Would
  that be if a different CPU inserted a huge page entry since we got here,
  is this what you're trying to handle?
  
  I'm confused.
  
 
 I thing this related to the above.
 

Well, if you're taking a fault, it means that you either don't have a
PMD or you don't have a pte.  If you have kvm_pmd_huge() you have a pmd
and you don't need a pte, so this should never happen. Ever.

  +  }
  +
 if (writable) {
 kvm_set_s2pte_writable(new_pte);

Question about MSR_K7_HWCR in kvm_set_msr_common

2014-05-28 Thread Jidong Xiao
Hi,

In kvm_set_msr_common(), I see that the follow piece of code will
handle the write operation to the register MSR_K7_HWCR.

case MSR_K7_HWCR:
  data = ~(u64)0x40; /* ignore flush filter disable */
  data = ~(u64)0x100; /* ignore ignne emulation enable */
  data = ~(u64)0x8; /* ignore TLB cache disable */
  if (data != 0) {
   pr_unimpl(vcpu, unimplemented HWCR wrmsr: 0x%llx\n,data);
   return 1;
   }
  break;

I am totally confused that, from this piece of code, we can see,
nothing will actually be written to MSR_K7_HWCR, if so, why do we
explicitly ignore some bits?

if we don't want the guest to write 0x40, 0x100, 0x8 to this register,
why don't we just return 1 and do nothing else. Like this:

case MSR_K7_HWCR:
 {
   pr_unimpl(vcpu, unimplemented HWCR wrmsr: 0x%llx\n,data);
   return 1;
   }

Or, we can simply use the default case, which may also return 1.

So, my question is, if we explicitly emulate this register, why do we
also explicitly ignore all the write operation to this register?

-Jidong
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL RESEND] KVM fixes for 3.15-rc6

2014-05-28 Thread Paolo Bonzini
Linus,

The following changes since commit 89ca3b881987f5a4be4c5dbaa7f0df12bbdde2fd:

  Linux 3.15-rc4 (2014-05-04 18:14:42 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus

for you to fetch changes up to a4e91d04b86504f145cc5f766c2609357a68b186:

  Merge tag 'kvm-s390-for-3.15-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-master 
(2014-05-15 14:46:57 +0200)



Small fixes for x86, slightly larger fixes for PPC, and a forgotten s390 patch.
The PPC fixes are important because they fix breakage that is new in 3.15.


Alexander Graf (2):
  KVM: PPC: Book3S: ifdef on CONFIG_KVM_BOOK3S_32_HANDLER for 32bit
  KVM guest: Make pv trampoline code executable

Cornelia Huck (1):
  KVM: s390: announce irqfd capability

Marcelo Tosatti (1):
  KVM: x86: disable master clock if TSC is reset during suspend

Paolo Bonzini (3):
  KVM: vmx: disable APIC virtualization in nested guests
  Merge tag 'signed-for-3.15' of git://github.com/agraf/linux-2.6 into 
kvm-master
  Merge tag 'kvm-s390-for-3.15-1' of git://git.kernel.org/.../kvms390/linux 
into kvm-master

Paul Mackerras (1):
  KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest 
exit

pingf...@linux.vnet.ibm.com (1):
  KVM: PPC: Book3S: HV: make _PAGE_NUMA take effect

 Documentation/virtual/kvm/api.txt   |   2 +-
 arch/powerpc/include/asm/sections.h |  11 
 arch/powerpc/kernel/kvm.c   |   2 +-
 arch/powerpc/kvm/book3s.c   |   6 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |   2 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 104 
 arch/powerpc/kvm/book3s_pr.c|   6 +-
 arch/powerpc/mm/hash_utils_64.c |   4 ++
 arch/s390/kvm/kvm-s390.c|   1 +
 arch/x86/kvm/vmx.c  |   7 ++-
 arch/x86/kvm/x86.c  |   6 +-
 11 files changed, 139 insertions(+), 12 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 3/4] live migration support for VM dirty log management

2014-05-28 Thread Christoffer Dall
On Tue, May 27, 2014 at 02:55:21PM -0700, Mario Smarduch wrote:
 On 05/27/2014 01:12 PM, Christoffer Dall wrote:
  On Thu, May 15, 2014 at 11:27:30AM -0700, Mario Smarduch wrote:

[...]

  +
  +  /* If pgd, pud, pmd not present and you cross pmd range check next
  +   * index.
  +   */
  +  pgd = pgdp + pgd_index(ipa);
  +  if (unlikely(crosses_pmd  !pgd_present(*pgd))) {
  +  pgd = pgdp + pgd_index(next);
  +  if (!pgd_present(*pgd))
  +  return;
  +  }
  +
  +  pud = pud_offset(pgd, ipa);
  +  if (unlikely(crosses_pmd  !pud_present(*pud))) {
  +  pud = pud_offset(pgd, next);
  +  if (!pud_present(*pud))
  +  return;
  +  }
  +
  +  pmd = pmd_offset(pud, ipa);
  +  if (unlikely(crosses_pmd  !pmd_present(*pmd))) {
  +  pmd = pmd_offset(pud, next);
  +  if (!pmd_present(*pmd))
  +  return;
  +  }
  +
  +  for (;;) {
  +  pte = pte_offset_kernel(pmd, ipa);
  +  if (!pte_present(*pte))
  +  goto next_ipa;
  +
  +  if (kvm_s2pte_readonly(pte))
  +  goto next_ipa;
  +  kvm_set_s2pte_readonly(pte);
  +next_ipa:
  +  mask = mask - 1;
  +  if (!mask)
  +  break;
  +
  +  /* find next page */
  +  ipa = (gfnofst + __ffs(mask))  PAGE_SHIFT;
  +
  +  /* skip upper page table lookups */
  +  if (!crosses_pmd)
  +  continue;
  +
  +  pgd = pgdp + pgd_index(ipa);
  +  if (unlikely(!pgd_present(*pgd)))
  +  goto next_ipa;
  +  pud = pud_offset(pgd, ipa);
  +  if (unlikely(!pud_present(*pud)))
  +  goto next_ipa;
  +  pmd = pmd_offset(pud, ipa);
  +  if (unlikely(!pmd_present(*pmd)))
  +  goto next_ipa;
  +  }
  
  So I think the reason this is done separately on x86 is that they have
  an rmap structure for their gfn mappings so that they can quickly lookup
  ptes based on a gfn and write-protect it without having to walk the
  stage-2 page tables.
 
 Yes, they also use rmapps for mmu notifiers, invalidations on huge VMs and 
 large ranges resulted in excessive times. 
  
  Unless you want to introduce this on ARM, I think you will be much
 
 Eventually yes but that would also require reworking mmu notifiers.  I had 
 two step approach in mind. Initially get the dirty page marking to work, 
 TLB flushing, GIC/arch-timer migration, validate migration under various 
 stress loads (page reclaim) with mmu notifiers, test several VMs and 
 migration 
 times. 
 
 Then get rmapp (or something similar) working - eventually for huge VMs it's
 needed. In short two phases.
 
  better off just having a single (properly written) iterating
  write-protect function, that takes a start and end IPA and a bitmap for
  which pages to actually write-protect, which can then handle the generic
  case (either NULL or all-ones bitmap) or a specific case, which just
  traverses the IPA range given as input.  Such a function should follow
  the model of page table walk functions discussed previously
  (separate functions: wp_pgd_enties(), wp_pud_entries(),
  wp_pmd_entries(), wp_pte_entries()).
  
  However, you may want to verify my assumption above with the x86 people
  and look at sharing the rmap logic between architectures.
  
  In any case, this code is very difficult to read and understand, and it
  doesn't look at all like the other code we have to walk page tables.  I
  understand you are trying to optimize for performance (by skipping some
  intermediate page table level lookups), but you never declare that goal
  anywhere in the code or in the commit message.
 
 Marc's comment noticed I was walking a small range (128k), using upper table
 iterations that covered 1G, 2MB ranges. As you mention the code tries to
 optimize upper table lookups. Yes the function is too bulky, but I'm not sure 
 how 
 to remove the upper table checks since page tables may change between the 
 time pages are marked dirty and the log is retrieved. And if a memory slot 
 is very dirty walking upper tables will impact performance. I'll think some 
 more on this function.
 
I think you should aim at the simplest possible implementation that
functionally works, first.  Let's verify that this thing works, have
clean working code that implementation-wise is as minimal as possible.

Then we can run perf on that and see if our migrations are very slow,
where we are actually spending time, and only then optimize it.

The solution to this specific problem for the time being appears quite
clear to me: Follow the exact same scheme as for unmap_range (the one I
sent out here:
https://lists.cs.columbia.edu/pipermail/kvmarm/2014-May/009592.html, the
diff is hard to read, so I recommend you apply the patch and look at the
resulting code).  Have a similar scheme, call it wp_ipa_range() or
something like that, and use that for now.

-Christoffer

Re: [PATCH v2 0/9] arm64: KVM: debug infrastructure support

2014-05-28 Thread Marc Zyngier
On 25/05/14 16:34, Christoffer Dall wrote:
 On Tue, May 20, 2014 at 05:55:36PM +0100, Marc Zyngier wrote:
 This patch series adds debug support, a key feature missing from the
 KVM/arm64 port.

 The main idea is to keep track of whether the debug registers are
 dirty (changed by the guest) or not. In this case, perform the usual
 save/restore dance, for one run only. It means we only have a penalty
 if a guest is actively using the debug registers.

 The amount of registers is properly frightening, but CPUs actually
 only implement a subset of them. Also, there is a number of registers
 we don't bother emulating (things having to do with external debug and
 OSlock).
 
 What is the rationale about not having to deal with external debug and
 OSlock?

External debug is when you actually plug a physical JTAG into the CPU.
OSlock is a way to prevent other software to play with the debug
registers. My understanding is that it is only useful in combination
with the external debug.

In both case, implementing support for this is probably not worth the
effort, at least for the time being.

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/9] arm64: KVM: debug infrastructure support

2014-05-28 Thread Christoffer Dall
On Wed, May 28, 2014 at 10:56:45AM +0100, Marc Zyngier wrote:
 On 25/05/14 16:34, Christoffer Dall wrote:
  On Tue, May 20, 2014 at 05:55:36PM +0100, Marc Zyngier wrote:
  This patch series adds debug support, a key feature missing from the
  KVM/arm64 port.
 
  The main idea is to keep track of whether the debug registers are
  dirty (changed by the guest) or not. In this case, perform the usual
  save/restore dance, for one run only. It means we only have a penalty
  if a guest is actively using the debug registers.
 
  The amount of registers is properly frightening, but CPUs actually
  only implement a subset of them. Also, there is a number of registers
  we don't bother emulating (things having to do with external debug and
  OSlock).
  
  What is the rationale about not having to deal with external debug and
  OSlock?
 
 External debug is when you actually plug a physical JTAG into the CPU.
 OSlock is a way to prevent other software to play with the debug
 registers. My understanding is that it is only useful in combination
 with the external debug.
 
 In both case, implementing support for this is probably not worth the
 effort, at least for the time being.
 
OK, can we document that somewhere clearly in the code then so we know
how we can simply ignore those registers?

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/9] arm64: KVM: debug infrastructure support

2014-05-28 Thread Marc Zyngier
On 28/05/14 10:58, Christoffer Dall wrote:
 On Wed, May 28, 2014 at 10:56:45AM +0100, Marc Zyngier wrote:
 On 25/05/14 16:34, Christoffer Dall wrote:
 On Tue, May 20, 2014 at 05:55:36PM +0100, Marc Zyngier wrote:
 This patch series adds debug support, a key feature missing from the
 KVM/arm64 port.

 The main idea is to keep track of whether the debug registers are
 dirty (changed by the guest) or not. In this case, perform the usual
 save/restore dance, for one run only. It means we only have a penalty
 if a guest is actively using the debug registers.

 The amount of registers is properly frightening, but CPUs actually
 only implement a subset of them. Also, there is a number of registers
 we don't bother emulating (things having to do with external debug and
 OSlock).

 What is the rationale about not having to deal with external debug and
 OSlock?

 External debug is when you actually plug a physical JTAG into the CPU.
 OSlock is a way to prevent other software to play with the debug
 registers. My understanding is that it is only useful in combination
 with the external debug.

 In both case, implementing support for this is probably not worth the
 effort, at least for the time being.

 OK, can we document that somewhere clearly in the code then so we know
 how we can simply ignore those registers?

Sure. I'll add some documentation.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/9] arm64: KVM: add trap handlers for AArch64 debug registers

2014-05-28 Thread Marc Zyngier
On 25/05/14 16:34, Christoffer Dall wrote:
 On Tue, May 20, 2014 at 05:55:39PM +0100, Marc Zyngier wrote:
 Add handlers for all the AArch64 debug registers that are accessible
 from EL0 or EL1. The trapping code keeps track of the state of the
 debug registers, allowing for the switch code to implement a lazy
 switching strategy.

 Reviewed-by: Anup Patel anup.pa...@linaro.org
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  arch/arm64/include/asm/kvm_asm.h  |  28 ++--
  arch/arm64/include/asm/kvm_host.h |   3 +
  arch/arm64/kvm/sys_regs.c | 130 
 +-
  3 files changed, 151 insertions(+), 10 deletions(-)

 diff --git a/arch/arm64/include/asm/kvm_asm.h 
 b/arch/arm64/include/asm/kvm_asm.h
 index 9fcd54b..e6b159a 100644
 --- a/arch/arm64/include/asm/kvm_asm.h
 +++ b/arch/arm64/include/asm/kvm_asm.h
 @@ -43,14 +43,25 @@
  #define  AMAIR_EL1   19  /* Aux Memory Attribute Indirection 
 Register */
  #define  CNTKCTL_EL1 20  /* Timer Control Register (EL1) */
  #define  PAR_EL1 21  /* Physical Address Register */
 +#define MDSCR_EL122  /* Monitor Debug System Control Register */
 +#define DBGBCR0_EL1  23  /* Debug Breakpoint Control Registers (0-15) */
 +#define DBGBCR15_EL1 38
 +#define DBGBVR0_EL1  39  /* Debug Breakpoint Value Registers (0-15) */
 +#define DBGBVR15_EL1 54
 +#define DBGWCR0_EL1  55  /* Debug Watchpoint Control Registers (0-15) */
 +#define DBGWCR15_EL1 70
 +#define DBGWVR0_EL1  71  /* Debug Watchpoint Value Registers (0-15) */
 +#define DBGWVR15_EL1 86
 +#define MDCCINT_EL1  87  /* Monitor Debug Comms Channel Interrupt 
 Enable Reg */
 +
  /* 32bit specific registers. Keep them at the end of the range */
 -#define  DACR32_EL2  22  /* Domain Access Control Register */
 -#define  IFSR32_EL2  23  /* Instruction Fault Status Register */
 -#define  FPEXC32_EL2 24  /* Floating-Point Exception Control 
 Register */
 -#define  DBGVCR32_EL225  /* Debug Vector Catch Register */
 -#define  TEECR32_EL1 26  /* ThumbEE Configuration Register */
 -#define  TEEHBR32_EL127  /* ThumbEE Handler Base Register */
 -#define  NR_SYS_REGS 28
 +#define  DACR32_EL2  88  /* Domain Access Control Register */
 +#define  IFSR32_EL2  89  /* Instruction Fault Status Register */
 +#define  FPEXC32_EL2 90  /* Floating-Point Exception Control 
 Register */
 +#define  DBGVCR32_EL291  /* Debug Vector Catch Register */
 +#define  TEECR32_EL1 92  /* ThumbEE Configuration Register */
 +#define  TEEHBR32_EL193  /* ThumbEE Handler Base Register */
 +#define  NR_SYS_REGS 94

  /* 32bit mapping */
  #define c0_MPIDR (MPIDR_EL1 * 2) /* MultiProcessor ID Register */
 @@ -87,6 +98,9 @@
  #define ARM_EXCEPTION_IRQ  0
  #define ARM_EXCEPTION_TRAP 1

 +#define KVM_ARM64_DEBUG_DIRTY_SHIFT  0
 +#define KVM_ARM64_DEBUG_DIRTY(1  
 KVM_ARM64_DEBUG_DIRTY_SHIFT)
 +
  #ifndef __ASSEMBLY__
  struct kvm;
  struct kvm_vcpu;
 diff --git a/arch/arm64/include/asm/kvm_host.h 
 b/arch/arm64/include/asm/kvm_host.h
 index 0a1d697..4737961 100644
 --- a/arch/arm64/include/asm/kvm_host.h
 +++ b/arch/arm64/include/asm/kvm_host.h
 @@ -101,6 +101,9 @@ struct kvm_vcpu_arch {
   /* Exception Information */
   struct kvm_vcpu_fault_info fault;

 + /* Debug state */
 + u64 debug_flags;
 +
   /* Pointer to host CPU context */
   kvm_cpu_context_t *host_cpu_context;

 diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
 index c3d28f1..d46a965 100644
 --- a/arch/arm64/kvm/sys_regs.c
 +++ b/arch/arm64/kvm/sys_regs.c
 @@ -30,6 +30,7 @@
  #include asm/kvm_mmu.h
  #include asm/cacheflush.h
  #include asm/cputype.h
 +#include asm/debug-monitors.h
  #include trace/events/kvm.h

  #include sys_regs.h
 @@ -173,6 +174,58 @@ static bool trap_raz_wi(struct kvm_vcpu *vcpu,
   return read_zero(vcpu, p);
  }

 +static bool trap_oslsr_el1(struct kvm_vcpu *vcpu,
 +const struct sys_reg_params *p,
 +const struct sys_reg_desc *r)
 +{
 + if (p-is_write) {
 + return ignore_write(vcpu, p);
 + } else {
 + *vcpu_reg(vcpu, p-Rt) = (1  3);
 + return true;
 + }
 +}
 +
 +static bool trap_dbgauthstatus_el1(struct kvm_vcpu *vcpu,
 +const struct sys_reg_params *p,
 +const struct sys_reg_desc *r)
 +{
 + if (p-is_write) {
 + return ignore_write(vcpu, p);
 + } else {
 + *vcpu_reg(vcpu, p-Rt) = 0x; /* Implemented and disabled */
 
 is this always safe?  What happens when you stop trapping accesses to
 this register and the hardware tells you something different?
 
 Are we assuming that this is always the case since otherwise none of
 this works, or?

No, 

Re: [PATCH v2 4/9] arm64: KVM: common infrastructure for handling AArch32 CP14/CP15

2014-05-28 Thread Marc Zyngier
On 25/05/14 16:34, Christoffer Dall wrote:
 On Tue, May 20, 2014 at 05:55:40PM +0100, Marc Zyngier wrote:
 As we're about to trap a bunch of CP14 registers, let's rework
 the CP15 handling so it can be generalized and work with multiple
 tables.

 Reviewed-by: Anup Patel anup.pa...@linaro.org
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  arch/arm64/include/asm/kvm_asm.h|   2 +-
  arch/arm64/include/asm/kvm_coproc.h |   3 +-
  arch/arm64/include/asm/kvm_host.h   |   9 ++-
  arch/arm64/kvm/handle_exit.c|   4 +-
  arch/arm64/kvm/sys_regs.c   | 121 
 +---
  5 files changed, 111 insertions(+), 28 deletions(-)

 diff --git a/arch/arm64/include/asm/kvm_asm.h 
 b/arch/arm64/include/asm/kvm_asm.h
 index e6b159a..12f9dd7 100644
 --- a/arch/arm64/include/asm/kvm_asm.h
 +++ b/arch/arm64/include/asm/kvm_asm.h
 @@ -93,7 +93,7 @@
  #define c10_AMAIR0  (AMAIR_EL1 * 2) /* Aux Memory Attr Indirection Reg */
  #define c10_AMAIR1  (c10_AMAIR0 + 1)/* Aux Memory Attr Indirection Reg */
  #define c14_CNTKCTL (CNTKCTL_EL1 * 2) /* Timer Control Register (PL1) */
 -#define NR_CP15_REGS(NR_SYS_REGS * 2)
 +#define NR_COPRO_REGS   (NR_SYS_REGS * 2)
  
  #define ARM_EXCEPTION_IRQ 0
  #define ARM_EXCEPTION_TRAP1
 diff --git a/arch/arm64/include/asm/kvm_coproc.h 
 b/arch/arm64/include/asm/kvm_coproc.h
 index 9a59301..0b52377 100644
 --- a/arch/arm64/include/asm/kvm_coproc.h
 +++ b/arch/arm64/include/asm/kvm_coproc.h
 @@ -39,7 +39,8 @@ void kvm_register_target_sys_reg_table(unsigned int target,
 struct kvm_sys_reg_target_table *table);
  
  int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu, struct kvm_run *run);
 -int kvm_handle_cp14_access(struct kvm_vcpu *vcpu, struct kvm_run *run);
 +int kvm_handle_cp14_32(struct kvm_vcpu *vcpu, struct kvm_run *run);
 +int kvm_handle_cp14_64(struct kvm_vcpu *vcpu, struct kvm_run *run);
  int kvm_handle_cp15_32(struct kvm_vcpu *vcpu, struct kvm_run *run);
  int kvm_handle_cp15_64(struct kvm_vcpu *vcpu, struct kvm_run *run);
  int kvm_handle_sys_reg(struct kvm_vcpu *vcpu, struct kvm_run *run);
 diff --git a/arch/arm64/include/asm/kvm_host.h 
 b/arch/arm64/include/asm/kvm_host.h
 index 4737961..31cff7a 100644
 --- a/arch/arm64/include/asm/kvm_host.h
 +++ b/arch/arm64/include/asm/kvm_host.h
 @@ -86,7 +86,7 @@ struct kvm_cpu_context {
  struct kvm_regs gp_regs;
  union {
  u64 sys_regs[NR_SYS_REGS];
 -u32 cp15[NR_CP15_REGS];
 +u32 copro[NR_COPRO_REGS];
  };
  };
  
 @@ -141,7 +141,12 @@ struct kvm_vcpu_arch {
  
  #define vcpu_gp_regs(v) ((v)-arch.ctxt.gp_regs)
  #define vcpu_sys_reg(v,r)   ((v)-arch.ctxt.sys_regs[(r)])
 -#define vcpu_cp15(v,r)  ((v)-arch.ctxt.cp15[(r)])
 +/*
 + * CP14 and CP15 live in the same array, as they are backed by the
 + * same system registers.
 + */
 +#define vcpu_cp14(v,r)  ((v)-arch.ctxt.copro[(r)])
 +#define vcpu_cp15(v,r)  ((v)-arch.ctxt.copro[(r)])
  
  struct kvm_vm_stat {
  u32 remote_tlb_flush;
 diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
 index 7bc41ea..f0ca49f 100644
 --- a/arch/arm64/kvm/handle_exit.c
 +++ b/arch/arm64/kvm/handle_exit.c
 @@ -69,9 +69,9 @@ static exit_handle_fn arm_exit_handlers[] = {
  [ESR_EL2_EC_WFI]= kvm_handle_wfx,
  [ESR_EL2_EC_CP15_32]= kvm_handle_cp15_32,
  [ESR_EL2_EC_CP15_64]= kvm_handle_cp15_64,
 -[ESR_EL2_EC_CP14_MR]= kvm_handle_cp14_access,
 +[ESR_EL2_EC_CP14_MR]= kvm_handle_cp14_32,
  [ESR_EL2_EC_CP14_LS]= kvm_handle_cp14_load_store,
 -[ESR_EL2_EC_CP14_64]= kvm_handle_cp14_access,
 +[ESR_EL2_EC_CP14_64]= kvm_handle_cp14_64,
  [ESR_EL2_EC_HVC32]  = handle_hvc,
  [ESR_EL2_EC_SMC32]  = handle_smc,
  [ESR_EL2_EC_HVC64]  = handle_hvc,
 diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
 index d46a965..429e38c 100644
 --- a/arch/arm64/kvm/sys_regs.c
 +++ b/arch/arm64/kvm/sys_regs.c
 @@ -474,6 +474,10 @@ static const struct sys_reg_desc sys_reg_descs[] = {
NULL, reset_val, FPEXC32_EL2, 0x70 },
  };
  
 +/* Trapped cp14 registers */
 +static const struct sys_reg_desc cp14_regs[] = {
 +};
 +
  /*
   * Trapped cp15 registers. TTBR0/TTBR1 get a double encoding,
   * depending on the way they are accessed (as a 32bit or a 64bit
 @@ -581,26 +585,19 @@ int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu, 
 struct kvm_run *run)
  return 1;
  }
  
 -int kvm_handle_cp14_access(struct kvm_vcpu *vcpu, struct kvm_run *run)
 -{
 -kvm_inject_undefined(vcpu);
 -return 1;
 -}
 -
 -static void emulate_cp15(struct kvm_vcpu *vcpu,
 - const struct sys_reg_params *params)
 
 document the return value please?

Sure.

 +static int emulate_cp(struct kvm_vcpu *vcpu,
 +  const struct sys_reg_params *params,
 +  const struct sys_reg_desc *table,
 +  

[Bug 76331] kernel BUG at drivers/iommu/intel-iommu.c:844!

2014-05-28 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=76331

--- Comment #5 from Matt mspe...@users.sourceforge.net ---
Hi Alex,

Great news !
Yesterday I had the opportunity to recompile my kernel with your suggested fix
in intel-iommu driver : dmar_domain-gaw = min(dmar_domain-gaw, addr_width);

After multiple tests I can confirm that this successfully fixed the issue.
How can we have this integrated in the official kernel sources ?

I also tried to re-order the qemu command-line...
With or without the fix I don't see any difference and I always end up with
various problems related to the gpu pass-thru :
- one VM blue screen at boot (VIDEO_TDR_ERROR)
- one Host crash !
- driver error (code 43) and dmesg full of errors like :
[ 2283.900194] dmar: DMAR:[DMA Read] Request device [06:00.0] fault addr
12de0a000 
DMAR:[fault reason 12] non-zero reserved fields in PTE
[ 2283.900201] dmar: DMAR:[DMA Write] Request device [06:00.0] fault addr
aff93000 
DMAR:[fault reason 12] non-zero reserved fields in PTE
[ 2286.149141] dmar: DRHD: handling fault status reg 602

But I'm not sure if this problem is related...

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


release VCPUs while VM is running,

2014-05-28 Thread Anshul Makkar
Hi,

Is there any work in progress on releasing VCPUs while the VM is still
in progress (Hot unplugging) ?

The following patches were not accepted.

Thanks
Anshul Makkar
www.justkernel.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Implement Batched (group) ticket lock

2014-05-28 Thread Raghavendra K T
In virtualized environment there are mainly three problems
related to spinlocks that affect performance.
1. LHP (lock holder preemption)
2. Lock Waiter Preemption (LWP)
3. Starvation/fairness

 Though ticketlocks solve the fairness problem, it worsens LWP, LHP problems.
pv-ticketlocks tried to address this. But we can further improve at the
cost of relaxed fairness.

In this patch, we form a batch of eligible lock holders and we serve the 
eligible
(to hold the lock) batches in FIFO, but the lock-waiters within eligible batch 
are served
in an unfair manner. This increases probability of any eligible lock-holder 
being
in running state (to an average of (batch_size/2)-1). It also provides needed
bounded starvation since any lock requester can not acquire more than batch_size
times repeatedly during contention. On the negetive side we would need an extra 
cmpxchg.

 The patch has the batch size of 4. (As we know increasing  batch size means we 
are
closer to unfair locks and batch size of 1 = ticketlock).

Result:
Test system: 32cpu 2node machine w/ 64GB each (16 pcpu machine +ht).
Guests:  8GB  16vcpu guests (average of 8 iterations)

 % Improvements with kvm guests (batch size = 4):
  ebizzy_0.5x   4.3
  ebizzy_1.0x   7.8
  ebizzy_1.5x  23.4
  ebizzy_2.0x  48.6

Baremetal:
ebizzy showed very high stdev, kernbench result was good but both of them did
not show much difference.

ebizzy: rec/sec higher is better
base50452
patched 50703

kernbench  time in sec (lesser is better)
base48.9 sec
patched 48.8 sec

Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---
 arch/x86/include/asm/spinlock.h   | 35 +--
 arch/x86/include/asm/spinlock_types.h | 14 ++
 arch/x86/kernel/kvm.c |  6 --
 3 files changed, 39 insertions(+), 16 deletions(-)

TODO:
- we need an intelligent way to nullify the effect of batching for baremetal
 (because extra cmpxchg is not required).

- we may have to make batch size as kernel arg to solve above problem
 (to run same kernel for host/guest). Increasing batch size also seem to help
 virtualized guest more, so we will have flexibility of tuning depending on vm 
size.

- My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem to
  show the impact of extra cmpxchg. but there should be effect of extra cmpxchg.

- virtualized guest had slight impact on 1x cases of some benchmarks but we 
have got
 impressive performance for 1x cases. So overall, patch needs exhaustive 
tesing.

- we can further add dynamically changing batch_size implementation 
(inspiration and
  hint by Paul McKenney) as necessary.
 
 I have found that increasing  batch size gives excellent improvements for 
 overcommitted guests. I understand that we need more exhaustive testing.

 Please provide your suggestion and comments.

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 0f62f54..87685f1 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -81,23 +81,36 @@ static __always_inline int 
arch_spin_value_unlocked(arch_spinlock_t lock)
  */
 static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
 {
-   register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
+   register struct __raw_tickets inc = { .tail = TICKET_LOCK_TAIL_INC };
+   struct __raw_tickets new;
 
inc = xadd(lock-tickets, inc);
-   if (likely(inc.head == inc.tail))
-   goto out;
 
inc.tail = ~TICKET_SLOWPATH_FLAG;
for (;;) {
unsigned count = SPIN_THRESHOLD;
 
do {
-   if (ACCESS_ONCE(lock-tickets.head) == inc.tail)
-   goto out;
+   if ((inc.head  TICKET_LOCK_BATCH_MASK) == (inc.tail 
+   TICKET_LOCK_BATCH_MASK))
+   goto spin;
cpu_relax();
+   inc.head = ACCESS_ONCE(lock-tickets.head);
} while (--count);
__ticket_lock_spinning(lock, inc.tail);
}
+spin:
+   for (;;) {
+   inc.head = ACCESS_ONCE(lock-tickets.head);
+   if (!(inc.head  TICKET_LOCK_HEAD_INC)) {
+   new.head = inc.head | TICKET_LOCK_HEAD_INC;
+   if (cmpxchg(lock-tickets.head, inc.head, new.head)
+   == inc.head)
+   goto out;
+   }
+   cpu_relax();
+   }
+
 out:   barrier();  /* make sure nothing creeps before the lock is taken */
 }
 
@@ -109,7 +122,8 @@ static __always_inline int 
arch_spin_trylock(arch_spinlock_t *lock)
if (old.tickets.head != (old.tickets.tail  ~TICKET_SLOWPATH_FLAG))
return 0;
 
-   new.head_tail = old.head_tail + (TICKET_LOCK_INC  TICKET_SHIFT);
+   new.head_tail = 

Re: [PATCH 0/8] Bug fixes for HV KVM, v2

2014-05-28 Thread Alexander Graf


On 26.05.14 11:48, Paul Mackerras wrote:

This series of patches fixes a few bugs that have been found in
testing HV KVM recently.  It also adds workarounds for a couple of
POWER8 PMU bugs, fixes the definition of KVM_REG_PPC_WORT, and adds
some things that were missing from Documentation/virtual/kvm/api.txt.

The patches are against Alex Graf's kvm-ppc-queue branch.

Please apply for 3.16.  The first couple would be safe to go into 3.15
as well, and probably should.


Thanks, applied all to kvm-ppc-queue. I don't think it's necessary for 
the first few to really go into 3.15 - if user space uses a header from 
there it will just get unimplemented ONE_REG returns for WORT.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: release VCPUs while VM is running,

2014-05-28 Thread Igor Mammedov
On Wed, 28 May 2014 13:59:31 +0200
Anshul Makkar anshul.mak...@profitbricks.com wrote:

 Hi,
 
 Is there any work in progress on releasing VCPUs while the VM is still
 in progress (Hot unplugging) ?
As far as I know nobody works on it yet.

 
 The following patches were not accepted.
 
 Thanks
 Anshul Makkar
 www.justkernel.com
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] KVM: PPC: Book3S: Controls for in-kernel PAPR hypercall handling

2014-05-28 Thread Alexander Graf


On 26.05.14 14:17, Paul Mackerras wrote:

This provides a way for userspace controls which PAPR hcalls get
handled in the kernel.  Each hcall can be individually enabled or
disabled for in-kernel handling, except for H_RTAS.  The exception
for H_RTAS is because userspace can already control whether
individual RTAS functions are handled in-kernel or not via the
KVM_PPC_RTAS_DEFINE_TOKEN ioctl, and because the numeric value for
H_RTAS is out of the normal sequence of hcall numbers.

Hcalls are enabled or disabled using the KVM_ENABLE_CAP ioctl for
the KVM_CAP_PPC_ENABLE_HCALL capability.  The args field of the
struct kvm_enable_cap specifies the hcall number in args[0] and
the enable/disable flag in args[1]; 0 means disable in-kernel
handling (so that the hcall will always cause an exit to userspace)
and 1 means enable.

Enabling or disabling in-kernel handling of an hcall is effective
across the whole VM, even though the KVM_ENABLE_CAP ioctl is
applied to a vcpu.

When a VM is created, an initial set of hcalls are enabled for
in-kernel handling.  The set that is enabled is the set that have
an in-kernel implementation at this point.  Any new hcall
implementations from this point onwards should not be added to the
default set.

No distinction is made between real-mode and virtual-mode hcall
implementations; the one setting controls them both.

Signed-off-by: Paul Mackerras pau...@samba.org
---
  Documentation/virtual/kvm/api.txt   | 17 +++
  arch/powerpc/include/asm/kvm_book3s.h   |  1 +
  arch/powerpc/include/asm/kvm_host.h |  2 ++
  arch/powerpc/kernel/asm-offsets.c   |  1 +
  arch/powerpc/kvm/book3s_hv.c| 51 +
  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 +++
  arch/powerpc/kvm/book3s_pr.c|  5 
  arch/powerpc/kvm/book3s_pr_papr.c   | 37 
  arch/powerpc/kvm/powerpc.c  | 19 
  include/uapi/linux/kvm.h|  1 +
  10 files changed, 145 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 6b0225d..dfd6e0c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2983,3 +2983,20 @@ Parameters: args[0] is the XICS device fd
  args[1] is the XICS CPU number (server ID) for this vcpu
  
  This capability connects the vcpu to an in-kernel XICS device.

+
+6.8 KVM_CAP_PPC_ENABLE_HCALL
+
+Architectures: ppc
+Parameters: args[0] is the PAPR hcall number
+   args[1] is 0 to disable, 1 to enable in-kernel handling
+
+This capability controls whether individual PAPR hypercalls (hcalls)
+get handled by the kernel or not.  Enabling or disabling in-kernel
+handling of an hcall is effective across the VM.  On creation, an


Hrm. Could we move the CAP to vm level then?


+initial set of hcalls are enabled for in-kernel handling, which
+consists of those hcalls for which in-kernel handlers were implemented
+before this capability was implemented.  If disabled, the kernel will
+not to attempt to handle the hcall, but will always exit to userspace
+to handle it.  Note that it may not make sense to enable some and
+disable others of a group of related hcalls, but KVM will not prevent
+userspace from doing that.
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index f52f656..772044b 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -189,6 +189,7 @@ extern void kvmppc_hv_entry_trampoline(void);
  extern u32 kvmppc_alignment_dsisr(struct kvm_vcpu *vcpu, unsigned int inst);
  extern ulong kvmppc_alignment_dar(struct kvm_vcpu *vcpu, unsigned int inst);
  extern int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd);
+extern void kvmppc_pr_init_default_hcalls(struct kvm *kvm);
  extern void kvmppc_copy_to_svcpu(struct kvmppc_book3s_shadow_vcpu *svcpu,
 struct kvm_vcpu *vcpu);
  extern void kvmppc_copy_from_svcpu(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index bb66d8b..2889587 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -34,6 +34,7 @@
  #include asm/processor.h
  #include asm/page.h
  #include asm/cacheflush.h
+#include asm/hvcall.h
  
  #define KVM_MAX_VCPUS		NR_CPUS

  #define KVM_MAX_VCORESNR_CPUS
@@ -263,6 +264,7 @@ struct kvm_arch {
  #ifdef CONFIG_PPC_BOOK3S_64
struct list_head spapr_tce_tables;
struct list_head rtas_tokens;
+   DECLARE_BITMAP(enabled_hcalls, MAX_HCALL_OPCODE/4 + 1);
  #endif
  #ifdef CONFIG_KVM_MPIC
struct openpic *mpic;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 93e1465..c427b51 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -492,6 +492,7 @@ int main(void)
DEFINE(KVM_HOST_SDR1, 

Re: [PATCH 2/3] KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled

2014-05-28 Thread Alexander Graf


On 26.05.14 14:17, Paul Mackerras wrote:

This adds code to check that when the KVM_CAP_PPC_ENABLE_HCALL
capability is used to enable or disable in-kernel handling of an
hcall, that the hcall is actually implemented by the kernel.
If not an EINVAL error is returned.

Signed-off-by: Paul Mackerras pau...@samba.org


Please add this as sanity check to the default enabled list as well - in 
case we lose the ability to enable an in-kernel hcall later.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: PPC: Book3S HV: Add H_SET_MODE hcall handling

2014-05-28 Thread Alexander Graf


On 26.05.14 14:17, Paul Mackerras wrote:

From: Michael Neuling mi...@neuling.org

This adds support for the H_SET_MODE hcall.  This hcall is a
multiplexer that has several functions, some of which are called
rarely, and some which are potentially called very frequently.
Here we add support for the functions that set the debug registers
CIABR (Completed Instruction Address Breakpoint Register) and
DAWR/DAWRX (Data Address Watchpoint Register and eXtension),
since they could be updated by the guest as often as every context
switch.

This also adds a kvmppc_power8_compatible() function to test to see
if a guest is compatible with POWER8 or not.  The CIABR and DAWR/X
only exist on POWER8.

Signed-off-by: Michael Neuling mi...@neuling.org
Signed-off-by: Paul Mackerras pau...@samba.org
---
  arch/powerpc/include/asm/hvcall.h |  6 +
  arch/powerpc/kvm/book3s_hv.c  | 51 ++-
  2 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 5dbbb29..85bc8c0 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -279,6 +279,12 @@
  #define H_GET_24X7_DATA   0xF07C
  #define H_GET_PERF_COUNTER_INFO   0xF080
  
+/* Values for 2nd argument to H_SET_MODE */

+#define H_SET_MODE_RESOURCE_SET_CIABR  1
+#define H_SET_MODE_RESOURCE_SET_DAWR   2
+#define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3
+#define H_SET_MODE_RESOURCE_LE 4
+
  #ifndef __ASSEMBLY__
  
  /**

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1f91130..53e2b63 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -557,6 +557,47 @@ static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
vcpu-arch.dtl.dirty = true;
  }
  
+static bool kvmppc_power8_compatible(struct kvm_vcpu *vcpu)

+{
+   if (vcpu-arch.vcore-arch_compat = PVR_ARCH_207)
+   return true;
+   if ((!vcpu-arch.vcore-arch_compat) 
+   cpu_has_feature(CPU_FTR_ARCH_207S))
+   return true;
+   return false;
+}
+
+static int kvmppc_h_set_mode(struct kvm_vcpu *vcpu, unsigned long mflags,
+unsigned long resource, unsigned long value1,
+unsigned long value2)
+{
+   switch (resource) {
+   case H_SET_MODE_RESOURCE_SET_CIABR:
+   if (!kvmppc_power8_compatible(vcpu))
+   return H_P2;
+   if (value2)
+   return H_P4;
+   if (mflags)
+   return H_UNSUPPORTED_FLAG_START;
+   if ((value1  0x3) == 0x3)


What is this?

Alex


+   return H_P3;
+   vcpu-arch.ciabr  = value1;
+   return H_SUCCESS;
+   case H_SET_MODE_RESOURCE_SET_DAWR:
+   if (!kvmppc_power8_compatible(vcpu))
+   return H_P2;
+   if (mflags)
+   return H_UNSUPPORTED_FLAG_START;
+   if (value2  DABRX_HYP)
+   return H_P4;
+   vcpu-arch.dawr  = value1;
+   vcpu-arch.dawrx = value2;
+   return H_SUCCESS;
+   default:
+   return H_TOO_HARD;
+   }
+}
+
  int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
  {
unsigned long req = kvmppc_get_gpr(vcpu, 3);
@@ -626,7 +667,14 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
  
  		/* Send the error out to userspace via KVM_RUN */

return rc;
-
+   case H_SET_MODE:
+   ret = kvmppc_h_set_mode(vcpu, kvmppc_get_gpr(vcpu, 4),
+   kvmppc_get_gpr(vcpu, 5),
+   kvmppc_get_gpr(vcpu, 6),
+   kvmppc_get_gpr(vcpu, 7));
+   if (ret == H_TOO_HARD)
+   return RESUME_HOST;
+   break;
case H_XIRR:
case H_CPPR:
case H_EOI:
@@ -652,6 +700,7 @@ static int kvmppc_hcall_impl_hv(struct kvm *kvm, unsigned 
long cmd)
case H_PROD:
case H_CONFER:
case H_REGISTER_VPA:
+   case H_SET_MODE:
  #ifdef CONFIG_KVM_XICS
case H_XIRR:
case H_CPPR:


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvmtool: virtio: pass trapped vcpu to IO accessors

2014-05-28 Thread Will Deacon
On Tue, May 27, 2014 at 11:24:19AM +0100, Marc Zyngier wrote:
 The recent introduction of bi-endianness on arm/arm64 had the
 odd effect of breaking virtio-pci support on these platforms, as the
 device endian field defaults to being VIRTIO_ENDIAN_HOST, which
 is the wrong thing to have on a bi-endian capable architecture.
 
 The fix is to check for the endianness on the ioport path the
 same way we do it for mmio, which implies passing the vcpu all
 the way down. Patch is a bit ugly, but aligns MMIO and ioport nicely.
 
 Tested on arm64 and x86.
 
 Acked-by: Will Deacon will.dea...@arm.com
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com

Please can you pick this one up Pekka? It fixes an unfortunate regression
in PCI caused by the bi-endianness series.

Cheers,

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Book3S PR: Rework SLB switching code

2014-05-28 Thread Alexander Graf


On 17.05.14 07:36, Paul Mackerras wrote:

On Thu, May 15, 2014 at 02:43:53PM +0200, Alexander Graf wrote:

On LPAR guest systems Linux enables the shadow SLB to indicate to the
hypervisor a number of SLB entries that always have to be available.

Today we go through this shadow SLB and disable all ESID's valid bits.
However, pHyp doesn't like this approach very much and honors us with
fancy machine checks.

Fortunately the shadow SLB descriptor also has an entry that indicates
the number of valid entries following. During the lifetime of a guest
we can just swap that value to 0 and don't have to worry about the
SLB restoration magic.

I think this is a great idea; I have been thinking we should do
something like this.


While we're touching the code, let's also make it more readable (get
rid of rldicl), allow it to deal with a dynamic number of bolted
SLB entries and only do shadow SLB swizzling on LPAR systems.

Signed-off-by: Alexander Graf ag...@suse.de

[snip]


+#define SHADOW_SLB_ENTRY_LEN   0x10

Normally we'd define structure offsets/sizes like this in
asm-offsets.c.  However, since the structure can't change I guess this
is OK.


/* Fill SLB with our shadow */
  
+	lis	r7, SLB_ESID_V@h

+
lbz r12, SVCPU_SLB_MAX(r3)
mulli   r12, r12, 16
addir12, r12, SVCPU_SLB
@@ -99,7 +76,7 @@ slb_loop_enter:
  
  	ld	r10, 0(r11)
  
-	rldicl. r0, r10, 37, 63

+   and.r9, r10, r7

Or...
andis.  r9, r10, SLB_ESID_V@h
and save a register and an instruction.


Good idea :)




cmpdcr0, r11, r12
blt slb_loop_enter
  
+	isync

+   sync

Why?


Hrm, I guess I was trying to narrow down why things broke. I'll omit it 
and see whether my test machine can still successfully run PR KVM.





+BEGIN_FW_FTR_SECTION
+
+   /* Declare SLB shadow as SLB_NUM_BOLTED entries big */
+
+   li  r8, SLB_NUM_BOLTED
+   stb r8, 3(r11)

Currently it's true that slb_shadow.persistent is always
SLB_NUM_BOLTED, but if you are going to embed that assumption here in


We had that assumption before too ;)


the KVM code you should at least add some comments over in
arch/powerpc/mm/slb.c and in arch/powerpc/kernel/paca.c (where
slb_shadow.persistent gets initialized) warning people that if they
break that assumption they need to fix KVM code as well.


but I like warnings, so I'll add some.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/21] MIPS: KVM: Add nanosecond count bias KVM register

2014-05-28 Thread James Hogan
Hi Paolo, David, Andreas,

On 26/04/14 10:37, Paolo Bonzini wrote:
 Il 26/04/2014 00:34, James Hogan ha scritto:
 So yes, you could technically manage without (4) by using (2) ((4) was
 implemented first), but I think it probably still has some value since
 you can
 do it with a single ioctl rather than 4 ioctls (freeze timer, read
 resume_time, read or write count, unfreeze timer).

 Enough value to be worthwhile? I haven't really made up my mind yet
 but I'm
 leaning towards yes.
 
 It would be interesting to see how the userspace patches use this
 independent of COUNT_RESUME.

The implementation in QEMU that I've settled upon makes do with just
COUNT_CTL and COUNT_RESUME, but with a slight kernel modification so
that COUNT_RESUME is writeable (to any positive monotonic nanosecond
value = now). It works fairly cleanly and correctly even with stopping
and starting VM clock (gdb, stop/cont, savevm/loadvm, live migration),
to match the behaviour of the existing mips cpu timer emulation, so I
plan to drop this bias patch, and will post a v2 patchset soon with just
a few modifications.

QEMU saves the state of the KVM timer from kvm_arch_get_registers() or
when the VM clock is stopped (via a vmstate notifier) - whichever comes
first. It then restores the KVM timer from kvm_arch_put_registers() or
when the VM clock is started - whichever comes last.
Example sequence:
stop VM - SAVE
get regs - vm clock already stopped, not saved again
start VM - regs dirty, not restored
put regs - vm clock running, RESTORE

Saving involves:
COUNT_CTL.DC = 1 (freeze KVM timer)
get CP0_Cause, CP0_Count and COUNT_RESUME
store a copy of the calculated VM clock @COUNT_RESUME nanoseconds
(i.e. the VM clock corresponding to the saved CP0_Count)

Restoring involves:
put COUNT_RESUME = now - (vm clock @now - saved vm clock)
(resume occurs at the same interval into the past that the VM clock has
increased since saving)
put CP0_Cause, CP0_Count
(the stored CP0_Count applies at that resume time)
COUNT_CTL.DC = 0 (resume KVM timer from CP0_Count at COUNT_RESUME)

I'll post an updated QEMU patchset ASAP after the KVM patchset, but
wanted to explain how this API can actually be used. Does it sound
reasonable?

Thanks
James
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] KVM: PPC: Book3S PR: Use SLB entry 0

2014-05-28 Thread Alexander Graf
We didn't make use of SLB entry 0 because ... of no good reason. SLB entry 0
will always be used by the Linux linear SLB entry, so the fact that slbia
does not invalidate it doesn't matter as we overwrite SLB 0 on exit anyway.

Just enable use of SLB entry 0 for our shadow SLB code.

Signed-off-by: Alexander Graf ag...@suse.de

---

v1 - v2:

  - flush ERAT by writing 0 to slb0
---
 arch/powerpc/kvm/book3s_64_mmu_host.c | 11 ---
 arch/powerpc/kvm/book3s_64_slb.S  |  3 ++-
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c 
b/arch/powerpc/kvm/book3s_64_mmu_host.c
index e2efb85..0ac9839 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
@@ -271,11 +271,8 @@ static int kvmppc_mmu_next_segment(struct kvm_vcpu *vcpu, 
ulong esid)
int found_inval = -1;
int r;
 
-   if (!svcpu-slb_max)
-   svcpu-slb_max = 1;
-
/* Are we overwriting? */
-   for (i = 1; i  svcpu-slb_max; i++) {
+   for (i = 0; i  svcpu-slb_max; i++) {
if (!(svcpu-slb[i].esid  SLB_ESID_V))
found_inval = i;
else if ((svcpu-slb[i].esid  ESID_MASK) == esid) {
@@ -285,7 +282,7 @@ static int kvmppc_mmu_next_segment(struct kvm_vcpu *vcpu, 
ulong esid)
}
 
/* Found a spare entry that was invalidated before */
-   if (found_inval  0) {
+   if (found_inval = 0) {
r = found_inval;
goto out;
}
@@ -359,7 +356,7 @@ void kvmppc_mmu_flush_segment(struct kvm_vcpu *vcpu, ulong 
ea, ulong seg_size)
ulong seg_mask = -seg_size;
int i;
 
-   for (i = 1; i  svcpu-slb_max; i++) {
+   for (i = 0; i  svcpu-slb_max; i++) {
if ((svcpu-slb[i].esid  SLB_ESID_V) 
(svcpu-slb[i].esid  seg_mask) == ea) {
/* Invalidate this entry */
@@ -373,7 +370,7 @@ void kvmppc_mmu_flush_segment(struct kvm_vcpu *vcpu, ulong 
ea, ulong seg_size)
 void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu)
 {
struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
-   svcpu-slb_max = 1;
+   svcpu-slb_max = 0;
svcpu-slb[0].esid = 0;
svcpu_put(svcpu);
 }
diff --git a/arch/powerpc/kvm/book3s_64_slb.S b/arch/powerpc/kvm/book3s_64_slb.S
index 596140e..84c52c6 100644
--- a/arch/powerpc/kvm/book3s_64_slb.S
+++ b/arch/powerpc/kvm/book3s_64_slb.S
@@ -138,7 +138,8 @@ slb_do_enter:
 
/* Restore bolted entries from the shadow and fix it along the way */
 
-   /* We don't store anything in entry 0, so we don't need to take care of 
it */
+   li  r0, r0
+   slbmte  r0, r0
slbia
isync
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] KVM: PPC: Book3S PR: Rework SLB switching code

2014-05-28 Thread Alexander Graf
On LPAR guest systems Linux enables the shadow SLB to indicate to the
hypervisor a number of SLB entries that always have to be available.

Today we go through this shadow SLB and disable all ESID's valid bits.
However, pHyp doesn't like this approach very much and honors us with
fancy machine checks.

Fortunately the shadow SLB descriptor also has an entry that indicates
the number of valid entries following. During the lifetime of a guest
we can just swap that value to 0 and don't have to worry about the
SLB restoration magic.

While we're touching the code, let's also make it more readable (get
rid of rldicl), allow it to deal with a dynamic number of bolted
SLB entries and only do shadow SLB swizzling on LPAR systems.

Signed-off-by: Alexander Graf ag...@suse.de

---

v1 - v2:

  - use andis.
  - remove superfluous isync/sync
  - add KVM warning comments in SLB bolting code
---
 arch/powerpc/kernel/paca.c   |  3 ++
 arch/powerpc/kvm/book3s_64_slb.S | 83 ++--
 arch/powerpc/mm/slb.c|  2 +-
 3 files changed, 42 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index ad302f8..d6e195e 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -98,6 +98,9 @@ static inline void free_lppacas(void) { }
 /*
  * 3 persistent SLBs are registered here.  The buffer will be zero
  * initially, hence will all be invaild until we actually write them.
+ *
+ * If you make the number of persistent SLB entries dynamic, please also
+ * update PR KVM to flush and restore them accordingly.
  */
 static struct slb_shadow *slb_shadow;
 
diff --git a/arch/powerpc/kvm/book3s_64_slb.S b/arch/powerpc/kvm/book3s_64_slb.S
index 84c52c6..3589c4e 100644
--- a/arch/powerpc/kvm/book3s_64_slb.S
+++ b/arch/powerpc/kvm/book3s_64_slb.S
@@ -17,29 +17,9 @@
  * Authors: Alexander Graf ag...@suse.de
  */
 
-#define SHADOW_SLB_ESID(num)   (SLBSHADOW_SAVEAREA + (num * 0x10))
-#define SHADOW_SLB_VSID(num)   (SLBSHADOW_SAVEAREA + (num * 0x10) + 0x8)
-#define UNBOLT_SLB_ENTRY(num) \
-   li  r11, SHADOW_SLB_ESID(num);  \
-   LDX_BE  r9, r12, r11;   \
-   /* Invalid? Skip. */;   \
-   rldicl. r0, r9, 37, 63; \
-   beq slb_entry_skip_ ## num; \
-   xoris   r9, r9, SLB_ESID_V@h;   \
-   STDX_BE r9, r12, r11;   \
-  slb_entry_skip_ ## num:
-
-#define REBOLT_SLB_ENTRY(num) \
-   li  r8, SHADOW_SLB_ESID(num);   \
-   li  r7, SHADOW_SLB_VSID(num);   \
-   LDX_BE  r10, r11, r8;   \
-   cmpdi   r10, 0; \
-   beq slb_exit_skip_ ## num;  \
-   orisr10, r10, SLB_ESID_V@h; \
-   LDX_BE  r9, r11, r7;\
-   slbmte  r9, r10;\
-   STDX_BE r10, r11, r8;   \
-slb_exit_skip_ ## num:
+#define SHADOW_SLB_ENTRY_LEN   0x10
+#define OFFSET_ESID(x) (SHADOW_SLB_ENTRY_LEN * x)
+#define OFFSET_VSID(x) ((SHADOW_SLB_ENTRY_LEN * x) + 8)
 
 /**
  **
@@ -63,20 +43,15 @@ slb_exit_skip_ ## num:
 * SVCPU[LR]  = guest LR
 */
 
-   /* Remove LPAR shadow entries */
+BEGIN_FW_FTR_SECTION
 
-#if SLB_NUM_BOLTED == 3
+   /* Declare SLB shadow as 0 entries big */
 
-   ld  r12, PACA_SLBSHADOWPTR(r13)
+   ld  r11, PACA_SLBSHADOWPTR(r13)
+   li  r8, 0
+   stb r8, 3(r11)
 
-   /* Remove bolted entries */
-   UNBOLT_SLB_ENTRY(0)
-   UNBOLT_SLB_ENTRY(1)
-   UNBOLT_SLB_ENTRY(2)
-   
-#else
-#error unknown number of bolted entries
-#endif
+END_FW_FTR_SECTION_IFSET(FW_FEATURE_LPAR)
 
/* Flush SLB */
 
@@ -99,7 +74,7 @@ slb_loop_enter:
 
ld  r10, 0(r11)
 
-   rldicl. r0, r10, 37, 63
+   andis.  r9, r10, SLB_ESID_V@h
beq slb_loop_enter_skip
 
ld  r9, 8(r11)
@@ -136,24 +111,42 @@ slb_do_enter:
 *
 */
 
-   /* Restore bolted entries from the shadow and fix it along the way */
+   /* Remove all SLB entries that are in use. */
 
li  r0, r0
slbmte  r0, r0
slbia
-   isync
 
-#if SLB_NUM_BOLTED == 3
+   /* Restore bolted entries from the shadow */
 
ld  r11, PACA_SLBSHADOWPTR(r13)
 
-   REBOLT_SLB_ENTRY(0)
-   REBOLT_SLB_ENTRY(1)
-   REBOLT_SLB_ENTRY(2)
-   
-#else
-#error unknown number of bolted entries
-#endif
+BEGIN_FW_FTR_SECTION
+
+   /* Declare SLB shadow as SLB_NUM_BOLTED entries big */
+
+   li  r8, SLB_NUM_BOLTED
+   stb r8, 3(r11)
+
+END_FW_FTR_SECTION_IFSET(FW_FEATURE_LPAR)
+
+   /* Manually load all entries from shadow SLB */
+
+   li  r8, SLBSHADOW_SAVEAREA
+   li  r7, 

Re: [PATCH v2 7/9] arm64: KVM: add trap handlers for AArch32 debug registers

2014-05-28 Thread Marc Zyngier
On 25/05/14 16:35, Christoffer Dall wrote:
 On Tue, May 20, 2014 at 05:55:43PM +0100, Marc Zyngier wrote:
 Add handlers for all the AArch32 debug registers that are accessible
 from EL0 or EL1. The code follow the same strategy as the AArch64
 counterpart with regards to tracking the dirty state of the debug
 registers.

 Reviewed-by: Anup Patel anup.pa...@linaro.org
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  arch/arm64/include/asm/kvm_asm.h |   9 +++
  arch/arm64/kvm/sys_regs.c| 137 
 ++-
  2 files changed, 145 insertions(+), 1 deletion(-)

 diff --git a/arch/arm64/include/asm/kvm_asm.h 
 b/arch/arm64/include/asm/kvm_asm.h
 index 12f9dd7..993a7db 100644
 --- a/arch/arm64/include/asm/kvm_asm.h
 +++ b/arch/arm64/include/asm/kvm_asm.h
 @@ -93,6 +93,15 @@
  #define c10_AMAIR0  (AMAIR_EL1 * 2) /* Aux Memory Attr Indirection Reg */
  #define c10_AMAIR1  (c10_AMAIR0 + 1)/* Aux Memory Attr Indirection Reg */
  #define c14_CNTKCTL (CNTKCTL_EL1 * 2) /* Timer Control Register (PL1) */
 +
 +#define cp14_DBGDSCRext (MDSCR_EL1 * 2)
 +#define cp14_DBGBCR0(DBGBCR0_EL1 * 2)
 +#define cp14_DBGBVR0(DBGBVR0_EL1 * 2)
 +#define cp14_DBGBXVR0   (cp14_DBGBVR0 + 1)
 +#define cp14_DBGWCR0(DBGWCR0_EL1 * 2)
 +#define cp14_DBGWVR0(DBGWVR0_EL1 * 2)
 +#define cp14_DBGDCCINT  (MDCCINT_EL1 * 2)
 +
  #define NR_COPRO_REGS   (NR_SYS_REGS * 2)
  
  #define ARM_EXCEPTION_IRQ 0
 diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
 index 98d60d1..5960d5b 100644
 --- a/arch/arm64/kvm/sys_regs.c
 +++ b/arch/arm64/kvm/sys_regs.c
 @@ -474,12 +474,148 @@ static const struct sys_reg_desc sys_reg_descs[] = {
NULL, reset_val, FPEXC32_EL2, 0x70 },
  };
  
 +static bool trap_dbgidr(struct kvm_vcpu *vcpu,
 +const struct sys_reg_params *p,
 +const struct sys_reg_desc *r)
 +{
 +if (p-is_write) {
 +return ignore_write(vcpu, p);
 +} else {
 +u64 dfr = read_cpuid(ID_AA64DFR0_EL1);
 +u64 pfr = read_cpuid(ID_AA64PFR0_EL1);
 +u32 el3 = !!((pfr  12)  0xf);
 +
 +*vcpu_reg(vcpu, p-Rt) = dfr  20)  0xf)  28) |
 +  (((dfr  12)  0xf)  24) |
 +  (((dfr  28)  0xf)  20) |
 +  (6  16) | (el3  14) | (el3  
 12));
 +return true;
 +}
 +}
 +
 +static bool trap_debug32(struct kvm_vcpu *vcpu,
 + const struct sys_reg_params *p,
 + const struct sys_reg_desc *r)
 +{
 +if (p-is_write) {
 +vcpu_cp14(vcpu, r-reg) = *vcpu_reg(vcpu, p-Rt);
 +vcpu-arch.debug_flags |= KVM_ARM64_DEBUG_DIRTY;
 +} else {
 +*vcpu_reg(vcpu, p-Rt) = vcpu_cp14(vcpu, r-reg);
 +}
 +
 +return true;
 +}
 +
 +#define DBG_BCR_BVR_WCR_WVR(n)  \
 +/* DBGBCRn */   \
 +{ Op1( 0), CRn( 0), CRm((n)), Op2( 4), trap_debug32,\
 +  NULL, (cp14_DBGBCR0 + (n) * 2) }, \
 +/* DBGBVRn */   \
 +{ Op1( 0), CRn( 0), CRm((n)), Op2( 5), trap_debug32,\
 +  NULL, (cp14_DBGBVR0 + (n) * 2) }, \
 
 I think you switched the order of DBGBCRn and DBGBVRn here?  Opc2==4 is
 the DBGBVR and Opc2==5 is the DBGBCR0.

I can't get it right, can I? It's amazing that it worked...

 +/* DBGWVRn */   \
 +{ Op1( 0), CRn( 0), CRm((n)), Op2( 6), trap_debug32,\
 +  NULL, (cp14_DBGWVR0 + (n) * 2) }, \
 +/* DBGWCRn */   \
 +{ Op1( 0), CRn( 0), CRm((n)), Op2( 7), trap_debug32,\
 +  NULL, (cp14_DBGWCR0 + (n) * 2) }
 +
 +#define DBGBXVR(n)  \
 +{ Op1( 0), CRn( 1), CRm((n)), Op2( 1), trap_debug32,\
 +  NULL, cp14_DBGBXVR0 + n * 2 }
 +
  /* Trapped cp14 registers */
  static const struct sys_reg_desc cp14_regs[] = {
 +/* DBGIDR */
 +{ Op1( 0), CRn( 0), CRm( 0), Op2( 0), trap_dbgidr },
 +/* DBGDTRRXext */
 +{ Op1( 0), CRn( 0), CRm( 0), Op2( 2), trap_raz_wi },
 +
 +DBG_BCR_BVR_WCR_WVR(0),
 +/* DBGDSCRint */
 +{ Op1( 0), CRn( 0), CRm( 1), Op2( 0), trap_raz_wi },
 +DBG_BCR_BVR_WCR_WVR(1),
 +/* DBGDCCINT */
 +{ Op1( 0), CRn( 0), CRm( 2), Op2( 0), trap_debug32 },
 +/* DBGDSCRext */
 +{ Op1( 0), CRn( 0), CRm( 2), Op2( 2), trap_debug32 },
 +DBG_BCR_BVR_WCR_WVR(2),
 +/* DBGDTRTXext */
 +{ Op1( 0), CRn( 0), CRm( 3), Op2( 2), trap_raz_wi },
 +DBG_BCR_BVR_WCR_WVR(3),
 +DBG_BCR_BVR_WCR_WVR(4),
 
 So we don't have a handler for the TRRX_int (nor the TRRX_EL0).  I
 understand that it doesn't make sense to keep state for them or
 context-switch them, but do we never trap on these and if we do, is 

Re: [PATCH v2 9/9] arm64: KVM: enable trapping of all debug registers

2014-05-28 Thread Marc Zyngier
On 25/05/14 16:36, Christoffer Dall wrote:
 On Tue, May 20, 2014 at 05:55:45PM +0100, Marc Zyngier wrote:
 Enable trapping of the debug registers, preventing the guests to
 mess with the host state (and allowing guests to use the debug
 infrastructure as well).

 Reviewed-by: Anup Patel anup.pa...@linaro.org
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  arch/arm64/kvm/hyp.S | 8 
  1 file changed, 8 insertions(+)

 diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
 index 73ec5c4..72ed0bf8 100644
 --- a/arch/arm64/kvm/hyp.S
 +++ b/arch/arm64/kvm/hyp.S
 @@ -777,6 +777,14 @@ __kvm_hyp_code_start:
  mrs x2, mdcr_el2
  and x2, x2, #MDCR_EL2_HPMN_MASK
  orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
 +orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
 
 so we unconditionally trap on the OS register access, but we don't
 properly emulate these do we?  What's the rationale?  (atmittedly,
 again, I'm not 100% clear on how the OS lock thingy is supposed to
 work/be used).

The rational is that we don't want the guest to mess with the host
state, which may have decided to use the OSlock thing (we don't use it
at all, but who knows...). So we trap it, discard whatever the guest
wants to put there, and carry on.

I'm not sure if this would confuse any guest (we only have Linux so far,
so I'm not too worried). Should a more adventurous guest show up, we can
revisit this.

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/21] MIPS: KVM: Add nanosecond count bias KVM register

2014-05-28 Thread Paolo Bonzini

Il 28/05/2014 16:21, James Hogan ha scritto:

The implementation in QEMU that I've settled upon makes do with just
COUNT_CTL and COUNT_RESUME, but with a slight kernel modification so
that COUNT_RESUME is writeable (to any positive monotonic nanosecond
value = now). It works fairly cleanly and correctly even with stopping
and starting VM clock (gdb, stop/cont, savevm/loadvm, live migration),
to match the behaviour of the existing mips cpu timer emulation, so I
plan to drop this bias patch, and will post a v2 patchset soon with just
a few modifications.


It makes sense to have writable registers in the emulator, even if they 
are read-only in real hardware.  We also do that for x86, FWIW.


So the idea looks okay to me.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Divide error in kvm_unlock_kick()

2014-05-28 Thread Chris Webb
Running a 3.14.4 x86-64 SMP guest kernel on qemu-2.0, with kvm enabled and
-cpu host on a 3.14.4 AMD Opteron host, I'm seeing a reliable kernel panic from
the guest shortly after boot. I think is happening in kvm_unlock_kick() in the
paravirt_ops code:

divide error:  [#1] PREEMPT SMP 
Modules linked in:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.14.4-guest #16
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
task: 88007d384880 ti: 88007d3b2000 task.ti: 88007d3b2000
RIP: 0010:[8102f0cc]  [8102f0cc] kvm_unlock_kick+0x63/0x6b
RSP: 0018:88007fc83db0  EFLAGS: 00010046
RAX: 0005 RBX:  RCX: 0003
RDX: 0003 RSI: 88007fd91d40 RDI: 0008
RBP: 88007fd91d40 R08:  R09: 8198e840
R10: 88007cbc7400 R11: 88007cbc9d00 R12: cec0
R13: 0001 R14: 88007fd91d40 R15: 0001
FS:  7ff42a4d3700() GS:88007fc8() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 7ff42a290006 CR3: 7c76d000 CR4: 000406e0
Stack:
 88007fd11d40 88007d361cc0 88007fc8d240 81563990
 810e42a6 00038102fa73 0282 
 88007fd12668 88007fc83ecc 00ff 006b
Call Trace:
 IRQ 
 [81563990] ? _raw_spin_unlock+0x57/0x61
 [810e42a6] ? load_balance+0x4ff/0x783
 [810e4681] ? rebalance_domains+0x157/0x20c
 [810e4841] ? run_rebalance_domains+0x10b/0x148
 [810be7c1] ? __do_softirq+0xec/0x1fe
 [810beacc] ? irq_exit+0x48/0x8d
 [815658dd] ? reschedule_interrupt+0x6d/0x80
 EOI 
 [8100a842] ? hard_enable_TSC+0x2e/0x2e
 [8102fbe1] ? native_safe_halt+0x2/0x3
 [8100a853] ? default_idle+0x11/0x14
 [810ed4e7] ? cpu_startup_entry+0x153/0x1d2
 [810277ad] ? start_secondary+0x220/0x23c
Code: 0c c5 40 50 87 81 49 8d 04 0c 48 8b 30 48 39 ee 75 ca 8a 40 08 38 d8 75 
c3 48 c7 c0 22 b0 00 00 31 db 0f b7 0c 08 b8 05 00 00 00 0f 01 c1 5b 5d 41 5c 
c3 4c 8d 54 24 08 48 83 e4 f0 b9 0a 00 00 
RIP  [8102f0cc] kvm_unlock_kick+0x63/0x6b
 RSP 88007fc83db0
---[ end trace 2278d9742b4dff74 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Shutting down cpus with NMI
Kernel Offset: 0x0 from 0x8100 (relocation range: 
0x8000-0x9fff)

My host kernel config is http://cdw.me.uk/tmp/host-config.txt and the guest
config is http://cdw.me.uk/tmp/guest-config.txt with qemu command line:

  qemu-system-x86 -enable-kvm -cpu qemu64 -machine q35 -m 2048 -name $1 \
-smp sockets=1,cores=4 -pidfile /run/$1.pid -runas nobody \
-serial stdio -vga none -vnc none -kernel /boot/vmlinuz-guest \
-append console=ttyS0 root=/dev/vda \
-drive file=/dev/guest/$1,cache=none,format=raw,if=virtio \
-device virtio-net-pci,netdev=nic,mac=$( /sys/class/net/$1/address) \
-netdev tap,id=nic,fd=3 3/dev/tap$( /sys/class/net/$1/ifindex)

I can stop this crash by disabling CONFIG_PARAVIRT_SPINLOCKS in my guest
kernel, running with -cpu qemu64 instead of -cpu host, or running with -smp 1
instead of -smp 4. (Removing/changing the -machine q35 makes no difference.)

My CPU flags inside the crashing guest look like this:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb lm rep_good nopl
extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic popcnt aes xsave
avx f16c hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse
3dnowprefetch osvw xop fma4 tbm arat npt nrip_save tsc_adjust bmi1

whereas in a (working) -cpu qemu64 guest, they look like this:

fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx
fxsr sse sse2 ht syscall nx lm nopl pni cx16 x2apic popcnt hypervisor lahf_lm
cmp_legacy svm abm sse4a

I tried enabling CONFIG_PARAVIRT_DEBUG, but no extra information was reported.

Very happy to do any testing at my end which might help track down what's going
on here.

Best wishes,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 3/4] live migration support for VM dirty log management

2014-05-28 Thread Mario Smarduch
On 05/28/2014 02:08 AM, Christoffer Dall wrote:
 On Tue, May 27, 2014 at 02:55:21PM -0700, Mario Smarduch wrote:
 On 05/27/2014 01:12 PM, Christoffer Dall wrote:
 On Thu, May 15, 2014 at 11:27:30AM -0700, Mario Smarduch wrote:
 
 [...]
 
 +
 +  /* If pgd, pud, pmd not present and you cross pmd range check next
 +   * index.
 +   */
 +  pgd = pgdp + pgd_index(ipa);
 +  if (unlikely(crosses_pmd  !pgd_present(*pgd))) {
 +  pgd = pgdp + pgd_index(next);
 +  if (!pgd_present(*pgd))
 +  return;
 +  }
 +
 +  pud = pud_offset(pgd, ipa);
 +  if (unlikely(crosses_pmd  !pud_present(*pud))) {
 +  pud = pud_offset(pgd, next);
 +  if (!pud_present(*pud))
 +  return;
 +  }
 +
 +  pmd = pmd_offset(pud, ipa);
 +  if (unlikely(crosses_pmd  !pmd_present(*pmd))) {
 +  pmd = pmd_offset(pud, next);
 +  if (!pmd_present(*pmd))
 +  return;
 +  }
 +
 +  for (;;) {
 +  pte = pte_offset_kernel(pmd, ipa);
 +  if (!pte_present(*pte))
 +  goto next_ipa;
 +
 +  if (kvm_s2pte_readonly(pte))
 +  goto next_ipa;
 +  kvm_set_s2pte_readonly(pte);
 +next_ipa:
 +  mask = mask - 1;
 +  if (!mask)
 +  break;
 +
 +  /* find next page */
 +  ipa = (gfnofst + __ffs(mask))  PAGE_SHIFT;
 +
 +  /* skip upper page table lookups */
 +  if (!crosses_pmd)
 +  continue;
 +
 +  pgd = pgdp + pgd_index(ipa);
 +  if (unlikely(!pgd_present(*pgd)))
 +  goto next_ipa;
 +  pud = pud_offset(pgd, ipa);
 +  if (unlikely(!pud_present(*pud)))
 +  goto next_ipa;
 +  pmd = pmd_offset(pud, ipa);
 +  if (unlikely(!pmd_present(*pmd)))
 +  goto next_ipa;
 +  }

 So I think the reason this is done separately on x86 is that they have
 an rmap structure for their gfn mappings so that they can quickly lookup
 ptes based on a gfn and write-protect it without having to walk the
 stage-2 page tables.

 Yes, they also use rmapps for mmu notifiers, invalidations on huge VMs and 
 large ranges resulted in excessive times. 

 Unless you want to introduce this on ARM, I think you will be much

 Eventually yes but that would also require reworking mmu notifiers.  I had 
 two step approach in mind. Initially get the dirty page marking to work, 
 TLB flushing, GIC/arch-timer migration, validate migration under various 
 stress loads (page reclaim) with mmu notifiers, test several VMs and 
 migration 
 times. 

 Then get rmapp (or something similar) working - eventually for huge VMs it's
 needed. In short two phases.

 better off just having a single (properly written) iterating
 write-protect function, that takes a start and end IPA and a bitmap for
 which pages to actually write-protect, which can then handle the generic
 case (either NULL or all-ones bitmap) or a specific case, which just
 traverses the IPA range given as input.  Such a function should follow
 the model of page table walk functions discussed previously
 (separate functions: wp_pgd_enties(), wp_pud_entries(),
 wp_pmd_entries(), wp_pte_entries()).

 However, you may want to verify my assumption above with the x86 people
 and look at sharing the rmap logic between architectures.

 In any case, this code is very difficult to read and understand, and it
 doesn't look at all like the other code we have to walk page tables.  I
 understand you are trying to optimize for performance (by skipping some
 intermediate page table level lookups), but you never declare that goal
 anywhere in the code or in the commit message.

 Marc's comment noticed I was walking a small range (128k), using upper table
 iterations that covered 1G, 2MB ranges. As you mention the code tries to
 optimize upper table lookups. Yes the function is too bulky, but I'm not 
 sure how 
 to remove the upper table checks since page tables may change between the 
 time pages are marked dirty and the log is retrieved. And if a memory slot 
 is very dirty walking upper tables will impact performance. I'll think some 
 more on this function.

 I think you should aim at the simplest possible implementation that
 functionally works, first.  Let's verify that this thing works, have
 clean working code that implementation-wise is as minimal as possible.
 
 Then we can run perf on that and see if our migrations are very slow,
 where we are actually spending time, and only then optimize it.
 
 The solution to this specific problem for the time being appears quite
 clear to me: Follow the exact same scheme as for unmap_range (the one I
 sent out here:
 https://lists.cs.columbia.edu/pipermail/kvmarm/2014-May/009592.html, the
 diff is hard to read, so I recommend you apply the patch and look at the
 resulting code).  Have a similar scheme, call it wp_ipa_range() or
 something like that, and use that for now.

Ok I'll reuse that code. I'll need to 

Re: [PATCH v6 4/4] add 2nd stage page fault handling during live migration

2014-05-28 Thread Mario Smarduch

emslot dirty_bitmap during and after write protect.
 

 -Christoffer

Regarding huge pud that's causing some design problems, should huge PUD
pages be considered at all?

Thanks,
  Mario

 
 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 02/12] kvm tools: Fix print format warnings

2014-05-28 Thread Andreas Herrmann
This should fix following warnings

 builtin-stat.c:93:3: warning: format '%llu' expects argument of type 'long 
long unsigned int', but argument 2 has type '__u64' [-Wformat]
 builtin-run.c:188:4: warning: format '%Lu' expects argument of type 'long long 
unsigned int', but argument 3 has type '__u64' [-Wformat]
 builtin-run.c:554:3: warning: format '%llu' expects argument of type 'long 
long unsigned int', but argument 2 has type 'u64' [-Wformat]
 builtin-run.c:554:3: warning: format '%llu' expects argument of type 'long 
long unsigned int', but argument 3 has type 'u64' [-Wformat]
 builtin-run.c:645:3: warning: format '%Lu' expects argument of type 'long long 
unsigned int', but argument 4 has type 'u64' [-Wformat]
 disk/core.c:330:4: warning: format '%llu' expects argument of type 'long long 
unsigned int', but argument 4 has type '__dev_t' [-Wformat]
 disk/core.c:330:4: warning: format '%llu' expects argument of type 'long long 
unsigned int', but argument 5 has type '__dev_t' [-Wformat]
 disk/core.c:330:4: warning: format '%llu' expects argument of type 'long long 
unsigned int', but argument 6 has type '__ino64_t' [-Wformat]
 mmio.c:134:5: warning: format '%llx' expects argument of type 'long long 
unsigned int', but argument 4 has type 'u64' [-Wformat]
 util/util.c:101:7: warning: format '%lld' expects argument of type 'long long 
int', but argument 3 has type 'u64' [-Wformat]
 util/util.c:113:7: warning: format '%lld' expects argument of type 'long long 
int', but argument 2 has type 'u64' [-Wformat]
 hw/pci-shmem.c:339:3: warning: format '%llx' expects argument of type 'long 
long unsigned int', but argument 2 has type 'u64' [-Wformat]
 hw/pci-shmem.c:340:3: warning: format '%llx' expects argument of type 'long 
long unsigned int', but argument 2 has type 'u64' [-Wformat]

as observed when compiling on mips64.

Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/builtin-run.c  |   12 
 tools/kvm/builtin-stat.c |2 +-
 tools/kvm/disk/core.c|4 +++-
 tools/kvm/hw/pci-shmem.c |5 +++--
 tools/kvm/mmio.c |5 +++--
 tools/kvm/util/util.c|4 ++--
 6 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/tools/kvm/builtin-run.c b/tools/kvm/builtin-run.c
index da95d71..1ee75ad 100644
--- a/tools/kvm/builtin-run.c
+++ b/tools/kvm/builtin-run.c
@@ -184,8 +184,8 @@ panic_kvm:
current_kvm_cpu-kvm_run-exit_reason,
kvm_exit_reasons[current_kvm_cpu-kvm_run-exit_reason]);
if (current_kvm_cpu-kvm_run-exit_reason == KVM_EXIT_UNKNOWN)
-   fprintf(stderr, KVM exit code: 0x%Lu\n,
-   current_kvm_cpu-kvm_run-hw.hardware_exit_reason);
+   fprintf(stderr, KVM exit code: 0x%llu\n,
+   (unsigned long 
long)current_kvm_cpu-kvm_run-hw.hardware_exit_reason);
 
kvm_cpu__set_debug_fd(STDOUT_FILENO);
kvm_cpu__show_registers(current_kvm_cpu);
@@ -551,7 +551,9 @@ static struct kvm *kvm_cmd_run_init(int argc, const char 
**argv)
kvm-cfg.ram_size = get_ram_size(kvm-cfg.nrcpus);
 
if (kvm-cfg.ram_size  host_ram_size())
-   pr_warning(Guest memory size %lluMB exceeds host physical RAM 
size %lluMB, kvm-cfg.ram_size, host_ram_size());
+   pr_warning(Guest memory size %lluMB exceeds host physical RAM 
size %lluMB,
+   (unsigned long long)kvm-cfg.ram_size,
+   (unsigned long long)host_ram_size());
 
kvm-cfg.ram_size = MB_SHIFT;
 
@@ -639,7 +641,9 @@ static struct kvm *kvm_cmd_run_init(int argc, const char 
**argv)
kvm-cfg.real_cmdline = real_cmdline;
 
printf(  # %s run -k %s -m %Lu -c %d --name %s\n, KVM_BINARY_NAME,
-   kvm-cfg.kernel_filename, kvm-cfg.ram_size / 1024 / 1024, 
kvm-cfg.nrcpus, kvm-cfg.guest_name);
+   kvm-cfg.kernel_filename,
+   (unsigned long long)kvm-cfg.ram_size / 1024 / 1024,
+   kvm-cfg.nrcpus, kvm-cfg.guest_name);
 
if (init_list__init(kvm)  0)
die (Initialisation failed);
diff --git a/tools/kvm/builtin-stat.c b/tools/kvm/builtin-stat.c
index ffd72e8..5d6407e 100644
--- a/tools/kvm/builtin-stat.c
+++ b/tools/kvm/builtin-stat.c
@@ -90,7 +90,7 @@ static int do_memstat(const char *name, int sock)
printf(The total amount of memory available (in 
bytes):);
break;
}
-   printf(%llu\n, stats[i].val);
+   printf(%llu\n, (unsigned long long)stats[i].val);
}
printf(\n);
 
diff --git a/tools/kvm/disk/core.c b/tools/kvm/disk/core.c
index 4e9bda0..309e16c 100644
--- a/tools/kvm/disk/core.c
+++ b/tools/kvm/disk/core.c
@@ -327,7 +327,9 @@ ssize_t disk_image__get_serial(struct disk_image *disk, 
void *buffer, ssize_t *l
return r;
 
*len = snprintf(buffer, *len, %llu%llu%llu,
-   (u64)st.st_dev, 

[PATCH v3 00/12] kvm tools: Misc patches (mips support)

2014-05-28 Thread Andreas Herrmann
Hi,

This is v3 of my patch set to run lkvm on MIPS.

It's rebased on v3.13-rc1-1436-g1fc83c5 of
git://github.com/penberg/linux-kvm.git

Diffstat is:

 tools/kvm/Makefile   |9 +-
 tools/kvm/arm/fdt.c  |7 -
 tools/kvm/arm/include/arm-common/kvm-arch.h  |2 +
 tools/kvm/builtin-run.c  |   12 +-
 tools/kvm/builtin-stat.c |2 +-
 tools/kvm/disk/core.c|4 +-
 tools/kvm/hw/pci-shmem.c |5 +-
 tools/kvm/include/kvm/kvm.h  |1 +
 tools/kvm/include/kvm/term.h |2 +
 tools/kvm/kvm.c  |   20 +-
 tools/kvm/mips/include/kvm/barrier.h |   20 ++
 tools/kvm/mips/include/kvm/kvm-arch.h|   38 +++
 tools/kvm/mips/include/kvm/kvm-config-arch.h |7 +
 tools/kvm/mips/include/kvm/kvm-cpu-arch.h|   42 
 tools/kvm/mips/irq.c |   10 +
 tools/kvm/mips/kvm-cpu.c |  219 +
 tools/kvm/mips/kvm.c |  328 ++
 tools/kvm/mmio.c |5 +-
 tools/kvm/pci.c  |   16 +-
 tools/kvm/powerpc/include/kvm/kvm-arch.h |2 +
 tools/kvm/powerpc/kvm.c  |7 -
 tools/kvm/term.c |   10 +-
 tools/kvm/util/util.c|4 +-
 tools/kvm/virtio/pci.c   |6 +-
 tools/kvm/x86/include/kvm/kvm-arch.h |2 +
 25 files changed, 742 insertions(+), 38 deletions(-)

Changelog:
v3:
 - Rebased to v3.13-rc1-1436-g1fc83c5
 - Moved patch kvm tools: Provide per arch macro to specify type for
   KVM_CREATE_VM before patch kvm tools, mips: Enable build of mips
   support to avoid broken built.
 - Added macro for hypercall number (KVM_HC_MIPS_CONSOLE_OUTPUT)
   (Once mips-paravirt is upstream and its changes merged into your
   tree this should be replaced by including the appropriate kernel
   header file.)
v2:
 - 
http://marc.info/?i=1400518411-9759-1-git-send-email-andreas.herrm...@caviumnetworks.com
v1:
 - 
http://marc.info/?i=1399391491-5021-1-git-send-email-andreas.herrm...@caviumnetworks.com


Please apply.


Thanks,

Andreas
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 01/12] kvm tools: Print message on failure of KVM_CREATE_VM

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/kvm.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/kvm/kvm.c b/tools/kvm/kvm.c
index d7d2e84..7bd20d3 100644
--- a/tools/kvm/kvm.c
+++ b/tools/kvm/kvm.c
@@ -286,6 +286,7 @@ int kvm__init(struct kvm *kvm)
 
kvm-vm_fd = ioctl(kvm-sys_fd, KVM_CREATE_VM, 0);
if (kvm-vm_fd  0) {
+   pr_err(KVM_CREATE_VM ioctl);
ret = kvm-vm_fd;
goto err_sys_fd;
}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 04/12] kvm tools: Allow to load ELF binary

2014-05-28 Thread Andreas Herrmann
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/include/kvm/kvm.h |1 +
 tools/kvm/kvm.c |   11 +++
 2 files changed, 12 insertions(+)

diff --git a/tools/kvm/include/kvm/kvm.h b/tools/kvm/include/kvm/kvm.h
index f1b71a0..58cb73b 100644
--- a/tools/kvm/include/kvm/kvm.h
+++ b/tools/kvm/include/kvm/kvm.h
@@ -109,6 +109,7 @@ void *guest_flat_to_host(struct kvm *kvm, u64 offset);
 u64 host_to_guest_flat(struct kvm *kvm, void *ptr);
 
 int load_flat_binary(struct kvm *kvm, int fd_kernel, int fd_initrd, const char 
*kernel_cmdline);
+int load_elf_binary(struct kvm *kvm, int fd_kernel, int fd_initrd, const char 
*kernel_cmdline);
 bool load_bzimage(struct kvm *kvm, int fd_kernel, int fd_initrd, const char 
*kernel_cmdline);
 
 /*
diff --git a/tools/kvm/kvm.c b/tools/kvm/kvm.c
index 7bd20d3..cfc0693 100644
--- a/tools/kvm/kvm.c
+++ b/tools/kvm/kvm.c
@@ -349,6 +349,12 @@ static bool initrd_check(int fd)
!memcmp(id, CPIO_MAGIC, 4);
 }
 
+int __attribute__((__weak__)) load_elf_binary(struct kvm *kvm, int fd_kernel,
+   int fd_initrd, const char *kernel_cmdline)
+{
+   return false;
+}
+
 bool kvm__load_kernel(struct kvm *kvm, const char *kernel_filename,
const char *initrd_filename, const char *kernel_cmdline)
 {
@@ -375,6 +381,11 @@ bool kvm__load_kernel(struct kvm *kvm, const char 
*kernel_filename,
 
pr_warning(%s is not a bzImage. Trying to load it as a flat 
binary..., kernel_filename);
 
+   ret = load_elf_binary(kvm, fd_kernel, fd_initrd, kernel_cmdline);
+
+   if (ret)
+   goto found_kernel;
+
ret = load_flat_binary(kvm, fd_kernel, fd_initrd, kernel_cmdline);
 
if (ret)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 03/12] kvm tools: Move definition of TERM_MAX_DEVS to header

2014-05-28 Thread Andreas Herrmann
In order to use it in other C files (in addition to term.c).

Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/include/kvm/term.h |2 ++
 tools/kvm/term.c |1 -
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/kvm/include/kvm/term.h b/tools/kvm/include/kvm/term.h
index 5f63457..dc9882e 100644
--- a/tools/kvm/include/kvm/term.h
+++ b/tools/kvm/include/kvm/term.h
@@ -10,6 +10,8 @@
 #define CONSOLE_VIRTIO 2
 #define CONSOLE_HV 3
 
+#define TERM_MAX_DEVS  4
+
 int term_putc_iov(struct iovec *iov, int iovcnt, int term);
 int term_getc_iov(struct kvm *kvm, struct iovec *iov, int iovcnt, int term);
 int term_putc(char *addr, int cnt, int term);
diff --git a/tools/kvm/term.c b/tools/kvm/term.c
index 214f5e2..3de410b 100644
--- a/tools/kvm/term.c
+++ b/tools/kvm/term.c
@@ -16,7 +16,6 @@
 
 #define TERM_FD_IN  0
 #define TERM_FD_OUT 1
-#define TERM_MAX_DEVS  4
 
 static struct termios  orig_term;
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 07/12] kvm tools: Provide per arch macro to specify type for KVM_CREATE_VM

2014-05-28 Thread Andreas Herrmann
This is is usually 0 for most archs. On mips we have two types.
TE (type 0) and MIPS-VZ (type 1). Default to 1 on mips.

Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/arm/include/arm-common/kvm-arch.h |2 ++
 tools/kvm/kvm.c |2 +-
 tools/kvm/mips/include/kvm/kvm-arch.h   |5 +
 tools/kvm/powerpc/include/kvm/kvm-arch.h|2 ++
 tools/kvm/x86/include/kvm/kvm-arch.h|2 ++
 5 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/tools/kvm/arm/include/arm-common/kvm-arch.h 
b/tools/kvm/arm/include/arm-common/kvm-arch.h
index 5d2fab2..082131d 100644
--- a/tools/kvm/arm/include/arm-common/kvm-arch.h
+++ b/tools/kvm/arm/include/arm-common/kvm-arch.h
@@ -32,6 +32,8 @@
 
 #define KVM_IRQ_OFFSET GIC_SPI_IRQ_BASE
 
+#define KVM_VM_TYPE0
+
 #define VIRTIO_DEFAULT_TRANS(kvm)  \
((kvm)-cfg.arch.virtio_trans_pci ? VIRTIO_PCI : VIRTIO_MMIO)
 
diff --git a/tools/kvm/kvm.c b/tools/kvm/kvm.c
index 6046434..e1b9f6c 100644
--- a/tools/kvm/kvm.c
+++ b/tools/kvm/kvm.c
@@ -284,7 +284,7 @@ int kvm__init(struct kvm *kvm)
goto err_sys_fd;
}
 
-   kvm-vm_fd = ioctl(kvm-sys_fd, KVM_CREATE_VM, 0);
+   kvm-vm_fd = ioctl(kvm-sys_fd, KVM_CREATE_VM, KVM_VM_TYPE);
if (kvm-vm_fd  0) {
pr_err(KVM_CREATE_VM ioctl);
ret = kvm-vm_fd;
diff --git a/tools/kvm/mips/include/kvm/kvm-arch.h 
b/tools/kvm/mips/include/kvm/kvm-arch.h
index 4a8407b..7eadbf4 100644
--- a/tools/kvm/mips/include/kvm/kvm-arch.h
+++ b/tools/kvm/mips/include/kvm/kvm-arch.h
@@ -17,6 +17,11 @@
 
 #define KVM_IRQ_OFFSET 1
 
+/*
+ * MIPS-VZ (trap and emulate is 0)
+ */
+#define KVM_VM_TYPE1
+
 #define VIRTIO_DEFAULT_TRANS(kvm)  VIRTIO_PCI
 
 #include stdbool.h
diff --git a/tools/kvm/powerpc/include/kvm/kvm-arch.h 
b/tools/kvm/powerpc/include/kvm/kvm-arch.h
index f8627a2..fdd518f 100644
--- a/tools/kvm/powerpc/include/kvm/kvm-arch.h
+++ b/tools/kvm/powerpc/include/kvm/kvm-arch.h
@@ -44,6 +44,8 @@
 
 #define KVM_IRQ_OFFSET 16
 
+#define KVM_VM_TYPE0
+
 #define VIRTIO_DEFAULT_TRANS(kvm)  VIRTIO_PCI
 
 struct spapr_phb;
diff --git a/tools/kvm/x86/include/kvm/kvm-arch.h 
b/tools/kvm/x86/include/kvm/kvm-arch.h
index a9f23b8..673bdf1 100644
--- a/tools/kvm/x86/include/kvm/kvm-arch.h
+++ b/tools/kvm/x86/include/kvm/kvm-arch.h
@@ -27,6 +27,8 @@
 
 #define KVM_IRQ_OFFSET 5
 
+#define KVM_VM_TYPE0
+
 #define VIRTIO_DEFAULT_TRANS(kvm)  VIRTIO_PCI
 
 struct kvm_arch {
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 05/12] kvm tools: Introduce weak (default) load_bzimage function

2014-05-28 Thread Andreas Herrmann
... to get rid of its function definition from archs that don't
support it.

Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/arm/fdt.c |7 ---
 tools/kvm/kvm.c |6 ++
 tools/kvm/powerpc/kvm.c |7 ---
 3 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/tools/kvm/arm/fdt.c b/tools/kvm/arm/fdt.c
index 30cd75a..186a718 100644
--- a/tools/kvm/arm/fdt.c
+++ b/tools/kvm/arm/fdt.c
@@ -276,10 +276,3 @@ int load_flat_binary(struct kvm *kvm, int fd_kernel, int 
fd_initrd,
 
return true;
 }
-
-bool load_bzimage(struct kvm *kvm, int fd_kernel, int fd_initrd,
- const char *kernel_cmdline)
-{
-   /* To b or not to b? That is the zImage. */
-   return false;
-}
diff --git a/tools/kvm/kvm.c b/tools/kvm/kvm.c
index cfc0693..6046434 100644
--- a/tools/kvm/kvm.c
+++ b/tools/kvm/kvm.c
@@ -355,6 +355,12 @@ int __attribute__((__weak__)) load_elf_binary(struct kvm 
*kvm, int fd_kernel,
return false;
 }
 
+bool __attribute__((__weak__)) load_bzimage(struct kvm *kvm, int fd_kernel,
+   int fd_initrd, const char *kernel_cmdline)
+{
+   return false;
+}
+
 bool kvm__load_kernel(struct kvm *kvm, const char *kernel_filename,
const char *initrd_filename, const char *kernel_cmdline)
 {
diff --git a/tools/kvm/powerpc/kvm.c b/tools/kvm/powerpc/kvm.c
index c1712cf..2b03a12 100644
--- a/tools/kvm/powerpc/kvm.c
+++ b/tools/kvm/powerpc/kvm.c
@@ -204,13 +204,6 @@ int load_flat_binary(struct kvm *kvm, int fd_kernel, int 
fd_initrd, const char *
return true;
 }
 
-bool load_bzimage(struct kvm *kvm, int fd_kernel, int fd_initrd,
- const char *kernel_cmdline)
-{
-   /* We don't support bzImages. */
-   return false;
-}
-
 struct fdt_prop {
void *value;
int size;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 06/12] kvm tools, mips: Add MIPS support

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

So far this was tested with host running KVM using MIPS-VZ (on Cavium
Octeon3). A paravirtualized mips kernel was used for the guest.

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/mips/include/kvm/barrier.h |   20 +++
 tools/kvm/mips/include/kvm/kvm-arch.h|   33 
 tools/kvm/mips/include/kvm/kvm-config-arch.h |7 +
 tools/kvm/mips/include/kvm/kvm-cpu-arch.h|   42 +
 tools/kvm/mips/irq.c |   10 ++
 tools/kvm/mips/kvm-cpu.c |  219 ++
 tools/kvm/mips/kvm.c |  128 +++
 7 files changed, 459 insertions(+)
 create mode 100644 tools/kvm/mips/include/kvm/barrier.h
 create mode 100644 tools/kvm/mips/include/kvm/kvm-arch.h
 create mode 100644 tools/kvm/mips/include/kvm/kvm-config-arch.h
 create mode 100644 tools/kvm/mips/include/kvm/kvm-cpu-arch.h
 create mode 100644 tools/kvm/mips/irq.c
 create mode 100644 tools/kvm/mips/kvm-cpu.c
 create mode 100644 tools/kvm/mips/kvm.c

[andreas.herrmann:
   * Rename kvm__arch_periodic_poll to kvm__arch_read_term
 because of commit fa817d892508b6d3a90f478dbeedbe5583b14da7
 (kvm tools: remove periodic tick in favour of a polling thread)
   * Add ioport__map_irq skeleton to fix build problem
   * Rely on TERM_MAX_DEVS instead of using other macros
   * Adaptions for MMIO support
   * Set coalesc offset
   * Fix compile warnings
   * Fix debug output format for register dump
   * Add check for upper bound of len in hypercall_write_cons
   * Adapt signature of kvm_cpu__emulate_mmio to latest changes
 commit 0f99176f615b894152f0b424f9b7156a26e81730
 (kvmtool: pass trapped vcpu to MMIO accessors)
   * Use macro for hypercall number]

diff --git a/tools/kvm/mips/include/kvm/barrier.h 
b/tools/kvm/mips/include/kvm/barrier.h
new file mode 100644
index 000..45bfa72
--- /dev/null
+++ b/tools/kvm/mips/include/kvm/barrier.h
@@ -0,0 +1,20 @@
+#ifndef _KVM_BARRIER_H_
+#define _KVM_BARRIER_H_
+
+#define barrier() asm volatile(: : :memory)
+
+#define mb()   asm volatile (.set push\n\t.set mips2\n\tsync\n\t.set pop: : 
:memory)
+#define rmb() mb()
+#define wmb() mb()
+
+#ifdef CONFIG_SMP
+#define smp_mb()   mb()
+#define smp_rmb()  rmb()
+#define smp_wmb()  wmb()
+#else
+#define smp_mb()   barrier()
+#define smp_rmb()  barrier()
+#define smp_wmb()  barrier()
+#endif
+
+#endif /* _KVM_BARRIER_H_ */
diff --git a/tools/kvm/mips/include/kvm/kvm-arch.h 
b/tools/kvm/mips/include/kvm/kvm-arch.h
new file mode 100644
index 000..4a8407b
--- /dev/null
+++ b/tools/kvm/mips/include/kvm/kvm-arch.h
@@ -0,0 +1,33 @@
+#ifndef KVM__KVM_ARCH_H
+#define KVM__KVM_ARCH_H
+
+#define KVM_MMIO_START 0x1000
+#define KVM_PCI_CFG_AREA   KVM_MMIO_START
+#define KVM_PCI_MMIO_AREA  (KVM_MMIO_START + 0x100)
+#define KVM_VIRTIO_MMIO_AREA   (KVM_MMIO_START + 0x200)
+
+/*
+ * Just for reference. This and the above corresponds to what's used
+ * in mipsvz_page_fault() in kvm_mipsvz.c of the host kernel.
+ */
+#define KVM_MIPS_IOPORT_AREA   0x1e00
+#define KVM_MIPS_IOPORT_SIZE   0x0001
+#define KVM_MIPS_IRQCHIP_AREA  0x1e01
+#define KVM_MIPS_IRQCHIP_SIZE  0x0001
+
+#define KVM_IRQ_OFFSET 1
+
+#define VIRTIO_DEFAULT_TRANS(kvm)  VIRTIO_PCI
+
+#include stdbool.h
+
+#include linux/types.h
+
+struct kvm_arch {
+   u64 entry_point;
+   u64 argc;
+   u64 argv;
+   bool is64bit;
+};
+
+#endif /* KVM__KVM_ARCH_H */
diff --git a/tools/kvm/mips/include/kvm/kvm-config-arch.h 
b/tools/kvm/mips/include/kvm/kvm-config-arch.h
new file mode 100644
index 000..8a28f9d
--- /dev/null
+++ b/tools/kvm/mips/include/kvm/kvm-config-arch.h
@@ -0,0 +1,7 @@
+#ifndef KVM__KVM_CONFIG_ARCH_H
+#define KVM__KVM_CONFIG_ARCH_H
+
+struct kvm_config_arch {
+};
+
+#endif /* KVM__MIPS_KVM_CONFIG_ARCH_H */
diff --git a/tools/kvm/mips/include/kvm/kvm-cpu-arch.h 
b/tools/kvm/mips/include/kvm/kvm-cpu-arch.h
new file mode 100644
index 000..fc286b1
--- /dev/null
+++ b/tools/kvm/mips/include/kvm/kvm-cpu-arch.h
@@ -0,0 +1,42 @@
+#ifndef KVM__KVM_CPU_ARCH_H
+#define KVM__KVM_CPU_ARCH_H
+
+#include linux/kvm.h /* for struct kvm_regs */
+#include kvm/kvm.h   /* for kvm__emulate_{mm}io() */
+#include pthread.h
+
+struct kvm;
+
+struct kvm_cpu {
+   pthread_t   thread; /* VCPU thread */
+
+   unsigned long   cpu_id;
+
+   struct kvm  *kvm;   /* parent KVM */
+   int vcpu_fd;/* For VCPU ioctls() */
+   struct kvm_run  *kvm_run;
+
+   struct kvm_regs regs;
+
+   u8  is_running;
+   u8  paused;
+   u8  needs_nmi;
+
+   struct kvm_coalesced_mmio_ring *ring;
+};
+
+/*
+ * As these are such simple wrappers, let's have them in the header so they'll
+ * be cheaper to call:
+ */
+static 

[PATCH v3 08/12] kvm tools, mips: Enable build of mips support

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/Makefile |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile
index b872651..880d580 100644
--- a/tools/kvm/Makefile
+++ b/tools/kvm/Makefile
@@ -105,7 +105,7 @@ OBJS+= virtio/mmio.o
 
 # Translate uname -m into ARCH string
 ARCH ?= $(shell uname -m | sed -e s/i.86/i386/ -e s/ppc.*/powerpc/ \
- -e s/armv7.*/arm/ -e s/aarch64.*/arm64/)
+ -e s/armv7.*/arm/ -e s/aarch64.*/arm64/ -e s/mips64/mips/)
 
 ifeq ($(ARCH),i386)
ARCH := x86
@@ -184,6 +184,13 @@ ifeq ($(ARCH), arm64)
ARCH_WANT_LIBFDT := y
 endif
 
+ifeq ($(ARCH),mips)
+   DEFINES += -DCONFIG_MIPS
+   ARCH_INCLUDE:= mips/include
+   OBJS+= mips/kvm.o
+   OBJS+= mips/kvm-cpu.o
+   OBJS+= mips/irq.o
+endif
 ###
 
 ifeq (,$(ARCH_INCLUDE))
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 09/12] kvm tools: Handle virtio/pci I/O space as little endian

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

It doesn't work on big endian hosts as is.

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/pci.c|   16 +---
 tools/kvm/virtio/pci.c |6 +++---
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/tools/kvm/pci.c b/tools/kvm/pci.c
index 5d60585..eac718f 100644
--- a/tools/kvm/pci.c
+++ b/tools/kvm/pci.c
@@ -10,7 +10,7 @@
 
 #define PCI_BAR_OFFSET(b)  (offsetof(struct pci_device_header, 
bar[b]))
 
-static union pci_config_addresspci_config_address;
+static u32 pci_config_address_bits;
 
 /* This is within our PCI gap - in an unused area.
  * Note this is a PCI *bus address*, is used to assign BARs etc.!
@@ -49,7 +49,7 @@ static void *pci_config_address_ptr(u16 port)
void *base;
 
offset  = port - PCI_CONFIG_ADDRESS;
-   base= pci_config_address;
+   base= pci_config_address_bits;
 
return base + offset;
 }
@@ -79,6 +79,10 @@ static struct ioport_operations pci_config_address_ops = {
 
 static bool pci_device_exists(u8 bus_number, u8 device_number, u8 
function_number)
 {
+   union pci_config_address pci_config_address;
+
+   pci_config_address.w = ioport__read32(pci_config_address_bits);
+
if (pci_config_address.bus_number != bus_number)
return false;
 
@@ -90,6 +94,9 @@ static bool pci_device_exists(u8 bus_number, u8 
device_number, u8 function_numbe
 
 static bool pci_config_data_out(struct ioport *ioport, struct kvm *kvm, u16 
port, void *data, int size)
 {
+   union pci_config_address pci_config_address;
+
+   pci_config_address.w = ioport__read32(pci_config_address_bits);
/*
 * If someone accesses PCI configuration space offsets that are not
 * aligned to 4 bytes, it uses ioports to signify that.
@@ -103,6 +110,9 @@ static bool pci_config_data_out(struct ioport *ioport, 
struct kvm *kvm, u16 port
 
 static bool pci_config_data_in(struct ioport *ioport, struct kvm *kvm, u16 
port, void *data, int size)
 {
+   union pci_config_address pci_config_address;
+
+   pci_config_address.w = ioport__read32(pci_config_address_bits);
/*
 * If someone accesses PCI configuration space offsets that are not
 * aligned to 4 bytes, it uses ioports to signify that.
@@ -133,7 +143,7 @@ void pci__config_wr(struct kvm *kvm, union 
pci_config_address addr, void *data,
void *p = device__find_dev(DEVICE_BUS_PCI, 
dev_num)-data;
struct pci_device_header *hdr = p;
u8 bar = (offset - PCI_BAR_OFFSET(0)) / (sizeof(u32));
-   u32 sz = PCI_IO_SIZE;
+   u32 sz = cpu_to_le32(PCI_IO_SIZE);
 
if (bar  6  hdr-bar_size[bar])
sz = hdr-bar_size[bar];
diff --git a/tools/kvm/virtio/pci.c b/tools/kvm/virtio/pci.c
index 57ccde6..b8122b0 100644
--- a/tools/kvm/virtio/pci.c
+++ b/tools/kvm/virtio/pci.c
@@ -378,9 +378,9 @@ int virtio_pci__init(struct kvm *kvm, void *dev, struct 
virtio_device *vdev,
| 
PCI_BASE_ADDRESS_SPACE_MEMORY),
.status = cpu_to_le16(PCI_STATUS_CAP_LIST),
.capabilities   = (void *)vpci-pci_hdr.msix - (void 
*)vpci-pci_hdr,
-   .bar_size[0]= IOPORT_SIZE,
-   .bar_size[1]= IOPORT_SIZE,
-   .bar_size[2]= PCI_IO_SIZE * 2,
+   .bar_size[0]= cpu_to_le32(IOPORT_SIZE),
+   .bar_size[1]= cpu_to_le32(IOPORT_SIZE),
+   .bar_size[2]= cpu_to_le32(PCI_IO_SIZE*2),
};
 
vpci-dev_hdr = (struct device_header) {
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 10/12] kvm tools, mips: Add support for loading elf binaries

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/mips/kvm.c |  200 ++
 1 file changed, 200 insertions(+)

[andreas.herrmann:
   * Fixed compile warnings]

diff --git a/tools/kvm/mips/kvm.c b/tools/kvm/mips/kvm.c
index 9eecfb5..fc0428b 100644
--- a/tools/kvm/mips/kvm.c
+++ b/tools/kvm/mips/kvm.c
@@ -6,6 +6,7 @@
 
 #include ctype.h
 #include unistd.h
+#include elf.h
 
 struct kvm_ext kvm_req_ext[] = {
{ 0, 0 }
@@ -99,6 +100,43 @@ int kvm__arch_setup_firmware(struct kvm *kvm)
return 0;
 }
 
+static void kvm__mips_install_cmdline(struct kvm *kvm)
+{
+   char *p = kvm-ram_start;
+   u64 cmdline_offset = 0x2000;
+   u64 argv_start = 0x3000;
+   u64 argv_offset = argv_start;
+   u64 argc = 0;
+
+   sprintf(p + cmdline_offset, mem=0x%llx@0 ,
+(unsigned long long)kvm-ram_size);
+
+   strcat(p + cmdline_offset, kvm-cfg.real_cmdline); /* maximum size is 
2K */
+
+   while (p[cmdline_offset]) {
+   if (!isspace(p[cmdline_offset])) {
+   if (kvm-arch.is64bit) {
+   *(u64 *)(p + argv_offset) = 
0x8000ull + cmdline_offset;
+   argv_offset += sizeof(u64);
+   } else {
+   *(u32 *)(p + argv_offset) = 0x8000u + 
cmdline_offset;
+   argv_offset += sizeof(u32);
+   }
+   argc++;
+   while(p[cmdline_offset]  !isspace(p[cmdline_offset]))
+   cmdline_offset++;
+   continue;
+   }
+   /* Must be a space character skip over these*/
+   while(p[cmdline_offset]  isspace(p[cmdline_offset])) {
+   p[cmdline_offset] = 0;
+   cmdline_offset++;
+   }
+   }
+   kvm-arch.argc = argc;
+   kvm-arch.argv = 0x8000ull + argv_start;
+}
+
 /* Load at the 1M point. */
 #define KERNEL_LOAD_ADDR 0x100
 int load_flat_binary(struct kvm *kvm, int fd_kernel, int fd_initrd, const char 
*kernel_cmdline)
@@ -123,6 +161,168 @@ int load_flat_binary(struct kvm *kvm, int fd_kernel, int 
fd_initrd, const char *
return true;
 }
 
+struct kvm__arch_elf_info {
+   u64 load_addr;
+   u64 entry_point;
+   size_t len;
+   size_t offset;
+};
+
+static bool kvm__arch_get_elf_64_info(Elf64_Ehdr *ehdr, int fd_kernel,
+ struct kvm__arch_elf_info *ei)
+{
+   int i;
+   size_t nr;
+   Elf64_Phdr phdr;
+
+   if (ehdr-e_phentsize != sizeof(phdr)) {
+   pr_info(Incompatible ELF PHENTSIZE %d, ehdr-e_phentsize);
+   return false;
+   }
+
+   ei-entry_point = ehdr-e_entry;
+
+   if (lseek(fd_kernel, ehdr-e_phoff, SEEK_SET)  0)
+   die_perror(lseek);
+
+   phdr.p_type = PT_NULL;
+   for (i = 0; i  ehdr-e_phnum; i++) {
+   nr = read(fd_kernel, phdr, sizeof(phdr));
+   if (nr != sizeof(phdr)) {
+   pr_info(Couldn't read %d bytes for ELF PHDR., 
(int)sizeof(phdr));
+   return false;
+   }
+   if (phdr.p_type == PT_LOAD)
+   break;
+   }
+   if (phdr.p_type != PT_LOAD) {
+   pr_info(No PT_LOAD Program Header found.);
+   return false;
+   }
+
+   ei-load_addr = phdr.p_paddr;
+
+   if ((ei-load_addr  0xc000ull) == 0x8000ull)
+   ei-load_addr = 0x1ffful; /* Convert KSEG{0,1} to 
physical. */
+   if ((ei-load_addr  0xc000ull) == 0x8000ull)
+   ei-load_addr = 0x07ffull; /* Convert XKPHYS to 
pysical */
+
+
+   ei-len = phdr.p_filesz;
+   ei-offset = phdr.p_offset;
+
+   return true;
+}
+
+static bool kvm__arch_get_elf_32_info(Elf32_Ehdr *ehdr, int fd_kernel,
+ struct kvm__arch_elf_info *ei)
+{
+   int i;
+   size_t nr;
+   Elf32_Phdr phdr;
+
+   if (ehdr-e_phentsize != sizeof(phdr)) {
+   pr_info(Incompatible ELF PHENTSIZE %d, ehdr-e_phentsize);
+   return false;
+   }
+
+   ei-entry_point = (s64)((s32)ehdr-e_entry);
+
+   if (lseek(fd_kernel, ehdr-e_phoff, SEEK_SET)  0)
+   die_perror(lseek);
+
+   phdr.p_type = PT_NULL;
+   for (i = 0; i  ehdr-e_phnum; i++) {
+   nr = read(fd_kernel, phdr, sizeof(phdr));
+   if (nr != sizeof(phdr)) {
+   pr_info(Couldn't read %d bytes for ELF PHDR., 
(int)sizeof(phdr));
+   return false;
+   }
+   if (phdr.p_type == PT_LOAD)
+ 

[PATCH v3 12/12] kvm tools: Return number of bytes written by term_putc

2014-05-28 Thread Andreas Herrmann
No caller is currently using the return value but better return
number of bytes written instead of 0 in case of an error.

Cc: Sergei Shtylyov sergei.shtyl...@cogentembedded.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/term.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/kvm/term.c b/tools/kvm/term.c
index b153eed..1b8131a 100644
--- a/tools/kvm/term.c
+++ b/tools/kvm/term.c
@@ -57,7 +57,7 @@ int term_putc(char *addr, int cnt, int term)
while (num_remaining) {
ret = write(term_fds[term][TERM_FD_OUT], addr, num_remaining);
if (ret  0)
-   return 0;
+   return cnt - num_remaining;
num_remaining -= ret;
addr += ret;
}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 11/12] kvm tools: Modify term_putc to write more than one char

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

It is a performance enhancement. When running in a simulator, each
system call to write a character takes a lot of time.  Batching them
up decreases the overhead (in the root kernel) of each virtio console
write.

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/term.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/kvm/term.c b/tools/kvm/term.c
index 3de410b..b153eed 100644
--- a/tools/kvm/term.c
+++ b/tools/kvm/term.c
@@ -52,11 +52,14 @@ int term_getc(struct kvm *kvm, int term)
 int term_putc(char *addr, int cnt, int term)
 {
int ret;
+   int num_remaining = cnt;
 
-   while (cnt--) {
-   ret = write(term_fds[term][TERM_FD_OUT], addr++, 1);
+   while (num_remaining) {
+   ret = write(term_fds[term][TERM_FD_OUT], addr, num_remaining);
if (ret  0)
return 0;
+   num_remaining -= ret;
+   addr += ret;
}
 
return cnt;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Some more basic questions..

2014-05-28 Thread Marcus White
Hello,
Some more basic questions..

1. How can I ensure that memory for the a guest is available and
reserved? In other words, I bring up a Linux VM which has 4G
allocated, I want to make sure it has all the 4G available right away.
I saw references to balloon driver, it seemed like that was more for
dynamic memory exchange between host and guest. In my case, it is a
Linux guest with a Linux VM.

2. Does the host reclaim pages from guest if it needs it without a
balloon driver?

3. This might be a very basic question, please bear with me:) If I use
virtio for say network and block, does network and block traffic still
go through QEMU? Is the host part of virtio basically QEMU or is it
something that runs in the host kernel. If QEMU, does every IO still
pass through it? Found some conflicting information, so not a 100%
sure. Found this, not sure if it is 100% accurate? Trying to
understand the flow through different layers, and what the layers are.
http://events.linuxfoundation.org/sites/events/files/slides/MasakiKimura_LinuxConNorthAmerica2013_1.pdf


Thank you in Advance:)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 01/13] MIPS: OCTEON: Enable use of FPU

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

Some versions of the assembler will not assemble CFC1 for OCTEON, so
override the ISA for these.

Add r4k_fpu.o to handle low level FPU initialization.

Modify octeon_switch.S to save the FPU registers.  And include
r4k_switch.S to pick up more FPU support.

Get rid of #define cpu_has_fpu 0

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 .../asm/mach-cavium-octeon/cpu-feature-overrides.h |1 -
 arch/mips/kernel/Makefile  |2 +-
 arch/mips/kernel/branch.c  |6 +-
 arch/mips/kernel/octeon_switch.S   |   84 ++--
 arch/mips/kernel/r4k_switch.S  |3 +
 arch/mips/math-emu/cp1emu.c|6 +-
 6 files changed, 75 insertions(+), 27 deletions(-)

diff --git a/arch/mips/include/asm/mach-cavium-octeon/cpu-feature-overrides.h 
b/arch/mips/include/asm/mach-cavium-octeon/cpu-feature-overrides.h
index 94ed063..cf80228 100644
--- a/arch/mips/include/asm/mach-cavium-octeon/cpu-feature-overrides.h
+++ b/arch/mips/include/asm/mach-cavium-octeon/cpu-feature-overrides.h
@@ -22,7 +22,6 @@
 #define cpu_has_3k_cache   0
 #define cpu_has_4k_cache   0
 #define cpu_has_tx39_cache 0
-#define cpu_has_fpu0
 #define cpu_has_counter1
 #define cpu_has_watch  1
 #define cpu_has_divec  1
diff --git a/arch/mips/kernel/Makefile b/arch/mips/kernel/Makefile
index 8f8b531..e00a1eb 100644
--- a/arch/mips/kernel/Makefile
+++ b/arch/mips/kernel/Makefile
@@ -41,7 +41,7 @@ obj-$(CONFIG_CPU_R4K_FPU) += r4k_fpu.o r4k_switch.o
 obj-$(CONFIG_CPU_R3000)+= r2300_fpu.o r2300_switch.o
 obj-$(CONFIG_CPU_R6000)+= r6000_fpu.o r4k_switch.o
 obj-$(CONFIG_CPU_TX39XX)   += r2300_fpu.o r2300_switch.o
-obj-$(CONFIG_CPU_CAVIUM_OCTEON) += octeon_switch.o
+obj-$(CONFIG_CPU_CAVIUM_OCTEON)+= r4k_fpu.o octeon_switch.o
 
 obj-$(CONFIG_SMP)  += smp.o
 obj-$(CONFIG_SMP_UP)   += smp-up.o
diff --git a/arch/mips/kernel/branch.c b/arch/mips/kernel/branch.c
index 101ab9a..7b2df22 100644
--- a/arch/mips/kernel/branch.c
+++ b/arch/mips/kernel/branch.c
@@ -562,7 +562,11 @@ int __compute_return_epc_for_insn(struct pt_regs *regs,
case cop1_op:
preempt_disable();
if (is_fpu_owner())
-   asm volatile(cfc1\t%0,$31 : =r (fcr31));
+   asm volatile(
+   .set push\n
+   \t.set mips1\n
+   \tcfc1\t%0,$31\n
+   \t.set pop : =r (fcr31));
else
fcr31 = current-thread.fpu.fcr31;
preempt_enable();
diff --git a/arch/mips/kernel/octeon_switch.S b/arch/mips/kernel/octeon_switch.S
index 029e002..f654768 100644
--- a/arch/mips/kernel/octeon_switch.S
+++ b/arch/mips/kernel/octeon_switch.S
@@ -10,24 +10,12 @@
  * Copyright (C) 2000 MIPS Technologies, Inc.
  *written by Carsten Langgaard, carst...@mips.com
  */
-#include asm/asm.h
-#include asm/cachectl.h
-#include asm/fpregdef.h
-#include asm/mipsregs.h
-#include asm/asm-offsets.h
-#include asm/pgtable-bits.h
-#include asm/regdef.h
-#include asm/stackframe.h
-#include asm/thread_info.h
-
-#include asm/asmmacro.h
-
-/*
- * Offset to the current process status flags, the first 32 bytes of the
- * stack are not used.
- */
-#define ST_OFF (_THREAD_SIZE - 32 - PT_SIZE + PT_STATUS)
 
+#define USE_ALTERNATE_RESUME_IMPL 1
+   .set push
+   .set arch=mips64r2
+#include r4k_switch.S
+   .set pop
 /*
  * task_struct *resume(task_struct *prev, task_struct *next,
  *struct thread_info *next_ti, int usedfpu)
@@ -40,6 +28,61 @@
cpu_save_nonscratch a0
LONG_S  ra, THREAD_REG31(a0)
 
+   /*
+* check if we need to save FPU registers
+*/
+   PTR_L   t3, TASK_THREAD_INFO(a0)
+   LONG_L  t0, TI_FLAGS(t3)
+   li  t1, _TIF_USEDFPU
+   and t2, t0, t1
+   beqzt2, 1f
+   nor t1, zero, t1
+
+   and t0, t0, t1
+   LONG_S  t0, TI_FLAGS(t3)
+
+   /*
+* clear saved user stack CU1 bit
+*/
+   LONG_L  t0, ST_OFF(t3)
+   li  t1, ~ST0_CU1
+   and t0, t0, t1
+   LONG_S  t0, ST_OFF(t3)
+
+   .set push
+   .set arch=mips64r2
+   fpu_save_double a0 t0 t1# c0_status passed in t0
+   # clobbers t1
+   .set pop
+1:
+
+   /* check if we need to save COP2 registers */
+   PTR_L   t2, TASK_THREAD_INFO(a0)
+   LONG_L  t0, ST_OFF(t2)
+   bbit0   t0, 30, 1f
+
+   /* Disable COP2 in the stored process state */
+   li  t1, ST0_CU2
+   xor t0, t1
+   LONG_S  t0, ST_OFF(t2)
+
+   /* Enable COP2 so we can 

[PATCH v2 00/13] MIPS: Add mips_paravirt

2014-05-28 Thread Andreas Herrmann
Hi,

This is v2 of patches to add support for paravirtualized guest on mips
(mips_paravirt). Some of the patches add basic support to run it on
octeon3.

The core of mips_paravirt is David's work.

I've run it using lkvm on a host with KVM MIPS-VZ support (on
octeon3). I've tested guest kernels built for CPU_MIPS64_R2 and
CPU_MIPS32_R2.

When the guest kernel is built for CPU_CAVIUM_OCTEON it requires an
additional patch to avoid usage of octeon_send_ipi_single in
octeon_flush_icache_all_cores. Latest patch for this is not yet
included as it caused a regression and needs some rework.

To built a mips_paravirt guest kernel it's easiest to start with
mips_paravirt_defconfig, check/modify CPU selection (defconfig has
CPU_MIPS64_R2), and kick off kernel built.

Patches are against linux-next/master as of today (next-20140528).
(To make use of __BITFIELD_FIELD macro.)

Diffstat is

 arch/mips/Kbuild.platforms |1 +
 arch/mips/Kconfig  |   30 +-
 arch/mips/cavium-octeon/Kconfig|   23 +-
 arch/mips/configs/mips_paravirt_defconfig  |  103 ++
 arch/mips/include/asm/cpu-features.h   |9 +-
 arch/mips/include/asm/cpu-type.h   |1 +
 arch/mips/include/asm/kvm_para.h   |   91 +
 .../asm/mach-cavium-octeon/cpu-feature-overrides.h |1 -
 .../asm/mach-paravirt/cpu-feature-overrides.h  |   36 ++
 arch/mips/include/asm/mach-paravirt/irq.h  |   19 +
 .../include/asm/mach-paravirt/kernel-entry-init.h  |   50 +++
 arch/mips/include/asm/mach-paravirt/war.h  |   25 ++
 arch/mips/include/asm/mipsregs.h   |9 +
 arch/mips/include/asm/r4kcache.h   |2 +
 arch/mips/include/uapi/asm/kvm_para.h  |6 +-
 arch/mips/kernel/Makefile  |2 +-
 arch/mips/kernel/branch.c  |6 +-
 arch/mips/kernel/cpu-probe.c   |2 +-
 arch/mips/kernel/octeon_switch.S   |   84 +++--
 arch/mips/kernel/r4k_switch.S  |3 +
 arch/mips/math-emu/cp1emu.c|6 +-
 arch/mips/mm/c-r4k.c   |   48 ++-
 arch/mips/mm/tlbex.c   |2 +-
 arch/mips/paravirt/Kconfig |6 +
 arch/mips/paravirt/Makefile|   14 +
 arch/mips/paravirt/Platform|9 +
 arch/mips/paravirt/paravirt-irq.c  |  369 
 arch/mips/paravirt/paravirt-smp.c  |  148 
 arch/mips/paravirt/serial.c|   40 +++
 arch/mips/paravirt/setup.c |   67 
 arch/mips/pci/Makefile |2 +-
 arch/mips/pci/pci-virtio-guest.c   |  131 +++
 include/uapi/linux/kvm_para.h  |3 +
 33 files changed, 1295 insertions(+), 53 deletions(-)

Changelog:
v2:
 - Rebased patches on latest linux-next
 - Define hypercalls and HC numbers for MIPS in kvm_para.h header files
 - Misc changes to pci-virtio-guest.c:
   * Make use of __BITFIELD_FIELD macro
   * Calculate IO address for in[lwb] and out[lwb] depending on size
 as usually done throughout the kernel.
   * I still kept this driver version. Once that Generic
 Configuration Access Mechanism support
 (https://lkml.org/lkml/2014/5/18/54) is mainline I might have to
 think about how to make use of that instead.
 - Provide minimal defconfig
 - Renaming mips_cpunum to get_ebase_cpunum
 - Provide _machine_halt function with initial patch of paravirt code
   * No _machine_restart so far. I have to look into this separately
 from this initial patch set -- I think it requires additionl
 kvm-tool changes.
 - Fix barriers when booting secondary CPUs
 - Replace check for 64-bit kernel by common macro
 - Remove R4600_HIT_CACHEOP_WAR_IMPL from r4k_blast_dcache_page_dc128()
 - Use switch statement in r4k_blast_dcache_page_setup()]
 - Remove mistakenly introduced config options from patch
  MIPS: OCTEON: Move CAVIUM_OCTEON_CVMSEG_SIZE to CPU_CAVIUM_OCTEON
 - Use on_each_cpu unconditionally in irq_core_bus_sync_unlock
 - Misc minor changes after review of v1
 - Remove call to irq_reserve_irq from irq_init_core as linux-next
   contains a patches to remove this function and friends
v1:
 - 
http://marc.info/?i=1400597236-11352-1-git-send-email-andreas.herrm...@caviumnetworks.com

Comments are welcome.


Thanks,

Andreas

PS: 1, or 2 comments from mailing list after 1st submission are still
not addressed. I'll look into this asap but I thought sending out v2
shouldn't be delayed.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 03/13] MIPS: OCTEON: Move CAVIUM_OCTEON_CVMSEG_SIZE to CPU_CAVIUM_OCTEON

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

CVMSEG is related to the CPU core not the SoC system.  So needs to be
configurable there.

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 arch/mips/cavium-octeon/Kconfig |   23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/arch/mips/cavium-octeon/Kconfig b/arch/mips/cavium-octeon/Kconfig
index 227705d..6028666 100644
--- a/arch/mips/cavium-octeon/Kconfig
+++ b/arch/mips/cavium-octeon/Kconfig
@@ -10,6 +10,17 @@ config CAVIUM_CN63XXP1
  non-CN63XXP1 hardware, so it is recommended to select n
  unless it is known the workarounds are needed.
 
+config CAVIUM_OCTEON_CVMSEG_SIZE
+   int Number of L1 cache lines reserved for CVMSEG memory
+   range 0 54
+   default 1
+   help
+ CVMSEG LM is a segment that accesses portions of the dcache as a
+ local memory; the larger CVMSEG is, the smaller the cache is.
+ This selects the size of CVMSEG LM, which is in cache blocks. The
+ legally range is from zero to 54 cache blocks (i.e. CVMSEG LM is
+ between zero and 6192 bytes).
+
 endif # CPU_CAVIUM_OCTEON
 
 if CAVIUM_OCTEON_SOC
@@ -23,17 +34,6 @@ config CAVIUM_OCTEON_2ND_KERNEL
  with this option to be run at the same time as one built without this
  option.
 
-config CAVIUM_OCTEON_CVMSEG_SIZE
-   int Number of L1 cache lines reserved for CVMSEG memory
-   range 0 54
-   default 1
-   help
- CVMSEG LM is a segment that accesses portions of the dcache as a
- local memory; the larger CVMSEG is, the smaller the cache is.
- This selects the size of CVMSEG LM, which is in cache blocks. The
- legally range is from zero to 54 cache blocks (i.e. CVMSEG LM is
- between zero and 6192 bytes).
-
 config CAVIUM_OCTEON_LOCK_L2
bool Lock often used kernel code in the L2
default y
@@ -86,7 +86,6 @@ config SWIOTLB
select IOMMU_HELPER
select NEED_SG_DMA_LENGTH
 
-
 config OCTEON_ILM
tristate Module to measure interrupt latency using Octeon CIU Timer
help
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 02/13] MIPS: Move system level config items from CPU_CAVIUM_OCTEON to CAVIUM_OCTEON_SOC

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

They are a property of the SoC not the CPU itself.

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 arch/mips/Kconfig |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 125edd4..d540945 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -730,6 +730,11 @@ config CAVIUM_OCTEON_SOC
select ZONE_DMA32
select HOLES_IN_ZONE
select ARCH_REQUIRE_GPIOLIB
+   select LIBFDT
+   select USE_OF
+   select ARCH_SPARSEMEM_ENABLE
+   select SYS_SUPPORTS_SMP
+   select NR_CPUS_DEFAULT_16
help
  This option supports all of the Octeon reference boards from Cavium
  Networks. It builds a kernel that dynamically determines the Octeon
@@ -1408,16 +1413,11 @@ config CPU_SB1
 config CPU_CAVIUM_OCTEON
bool Cavium Octeon processor
depends on SYS_HAS_CPU_CAVIUM_OCTEON
-   select ARCH_SPARSEMEM_ENABLE
select CPU_HAS_PREFETCH
select CPU_SUPPORTS_64BIT_KERNEL
-   select SYS_SUPPORTS_SMP
-   select NR_CPUS_DEFAULT_16
select WEAK_ORDERING
select CPU_SUPPORTS_HIGHMEM
select CPU_SUPPORTS_HUGEPAGES
-   select LIBFDT
-   select USE_OF
select USB_EHCI_BIG_ENDIAN_MMIO
select MIPS_L1_CACHE_SHIFT_7
help
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 04/13] MIPS: Don't use RI/XI with 32-bit kernels on 64-bit CPUs

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

The TLB handlers cannot handle this case, so disable it for now.

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 arch/mips/include/asm/cpu-features.h |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/mips/include/asm/cpu-features.h 
b/arch/mips/include/asm/cpu-features.h
index f75dd70..c7d8c99 100644
--- a/arch/mips/include/asm/cpu-features.h
+++ b/arch/mips/include/asm/cpu-features.h
@@ -110,9 +110,15 @@
 #ifndef cpu_has_smartmips
 #define cpu_has_smartmips  (cpu_data[0].ases  MIPS_ASE_SMARTMIPS)
 #endif
+
 #ifndef cpu_has_rixi
-#define cpu_has_rixi   (cpu_data[0].options  MIPS_CPU_RIXI)
+# ifdef CONFIG_64BIT
+# define cpu_has_rixi  (cpu_data[0].options  MIPS_CPU_RIXI)
+# else /* CONFIG_32BIT */
+# define cpu_has_rixi  ((cpu_data[0].options  MIPS_CPU_RIXI)  
!cpu_has_64bits)
+# endif
 #endif
+
 #ifndef cpu_has_mmips
 # ifdef CONFIG_SYS_SUPPORTS_MICROMIPS
 #  define cpu_has_mmips(cpu_data[0].options  
MIPS_CPU_MICROMIPS)
@@ -120,6 +126,7 @@
 #  define cpu_has_mmips0
 # endif
 #endif
+
 #ifndef cpu_has_vtag_icache
 #define cpu_has_vtag_icache(cpu_data[0].icache.flags  MIPS_CACHE_VTAG)
 #endif
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 05/13] MIPS: Don't build fast TLB refill handler with 32-bit kernels

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

The fast handler only supports 64-bit kernels.

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 arch/mips/mm/tlbex.c |8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/mips/mm/tlbex.c b/arch/mips/mm/tlbex.c
index f99ec587..af91f3e 100644
--- a/arch/mips/mm/tlbex.c
+++ b/arch/mips/mm/tlbex.c
@@ -1250,13 +1250,17 @@ static void build_r4000_tlb_refill_handler(void)
unsigned int final_len;
struct mips_huge_tlb_info htlb_info __maybe_unused;
enum vmalloc64_mode vmalloc_mode __maybe_unused;
-
+#ifdef CONFIG_64BIT
+   bool is64bit = true;
+#else
+   bool is64bit = false;
+#endif
memset(tlb_handler, 0, sizeof(tlb_handler));
memset(labels, 0, sizeof(labels));
memset(relocs, 0, sizeof(relocs));
memset(final_handler, 0, sizeof(final_handler));
 
-   if ((scratch_reg = 0 || scratchpad_available())  use_bbit_insns()) {
+   if (is64bit  (scratch_reg = 0 || scratchpad_available())  
use_bbit_insns()) {
htlb_info = build_fast_tlb_refill_handler(p, l, r, K0, K1,
  scratch_reg);
vmalloc_mode = refill_scratch;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 07/13] MIPS: Add function get_ebase_cpunum

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

This returns the CPUNum from the low order Ebase bits.

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 arch/mips/include/asm/mipsregs.h |9 +
 arch/mips/kernel/cpu-probe.c |2 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

[andreas.herrmann:
  * Renamed function from mips_cpunum
  * Make use of function in cpu-probe.c]

diff --git a/arch/mips/include/asm/mipsregs.h b/arch/mips/include/asm/mipsregs.h
index fb2d174..98e9754 100644
--- a/arch/mips/include/asm/mipsregs.h
+++ b/arch/mips/include/asm/mipsregs.h
@@ -1792,6 +1792,15 @@ __BUILD_SET_C0(brcm_cmt_ctrl)
 __BUILD_SET_C0(brcm_config)
 __BUILD_SET_C0(brcm_mode)
 
+/*
+ * Return low 10 bits of ebase.
+ * Note that under KVM (MIPSVZ) this returns vcpu id.
+ */
+static inline unsigned int get_ebase_cpunum(void)
+{
+   return read_c0_ebase()  0x3ff;
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_MIPSREGS_H */
diff --git a/arch/mips/kernel/cpu-probe.c b/arch/mips/kernel/cpu-probe.c
index e8638c5..dfc6f62 100644
--- a/arch/mips/kernel/cpu-probe.c
+++ b/arch/mips/kernel/cpu-probe.c
@@ -423,7 +423,7 @@ static void decode_configs(struct cpuinfo_mips *c)
 
 #ifndef CONFIG_MIPS_CPS
if (cpu_has_mips_r2) {
-   c-core = read_c0_ebase()  0x3ff;
+   c-core = get_ebase_cpunum();
if (cpu_has_mipsmt)
c-core = fls(core_nvpes()) - 1;
}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/13] MIPS: Add minimal support for OCTEON3 to c-r4k.c

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

These are needed to boot a generic mips64r2 kernel on OCTEONIII.

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 arch/mips/include/asm/r4kcache.h |2 ++
 arch/mips/mm/c-r4k.c |   48 ++
 2 files changed, 46 insertions(+), 4 deletions(-)

[andreas.herrmann:
  * Remove R4600_HIT_CACHEOP_WAR_IMPL from r4k_blast_dcache_page_dc128()
  * Use switch statement in r4k_blast_dcache_page_setup()]

diff --git a/arch/mips/include/asm/r4kcache.h b/arch/mips/include/asm/r4kcache.h
index fe8d1b6..0b8bd28 100644
--- a/arch/mips/include/asm/r4kcache.h
+++ b/arch/mips/include/asm/r4kcache.h
@@ -523,6 +523,8 @@ __BUILD_BLAST_CACHE(s, scache, Index_Writeback_Inv_SD, 
Hit_Writeback_Inv_SD, 32,
 __BUILD_BLAST_CACHE(d, dcache, Index_Writeback_Inv_D, Hit_Writeback_Inv_D, 64, 
)
 __BUILD_BLAST_CACHE(i, icache, Index_Invalidate_I, Hit_Invalidate_I, 64, )
 __BUILD_BLAST_CACHE(s, scache, Index_Writeback_Inv_SD, Hit_Writeback_Inv_SD, 
64, )
+__BUILD_BLAST_CACHE(d, dcache, Index_Writeback_Inv_D, Hit_Writeback_Inv_D, 
128, )
+__BUILD_BLAST_CACHE(i, icache, Index_Invalidate_I, Hit_Invalidate_I, 128, )
 __BUILD_BLAST_CACHE(s, scache, Index_Writeback_Inv_SD, Hit_Writeback_Inv_SD, 
128, )
 
 __BUILD_BLAST_CACHE(inv_d, dcache, Index_Writeback_Inv_D, Hit_Invalidate_D, 
16, )
diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c
index 5c21282..3eb7270 100644
--- a/arch/mips/mm/c-r4k.c
+++ b/arch/mips/mm/c-r4k.c
@@ -108,18 +108,34 @@ static inline void r4k_blast_dcache_page_dc64(unsigned 
long addr)
blast_dcache64_page(addr);
 }
 
+static inline void r4k_blast_dcache_page_dc128(unsigned long addr)
+{
+   blast_dcache128_page(addr);
+}
+
 static void r4k_blast_dcache_page_setup(void)
 {
unsigned long  dc_lsize = cpu_dcache_line_size();
 
-   if (dc_lsize == 0)
+   switch (dc_lsize) {
+   case 0:
r4k_blast_dcache_page = (void *)cache_noop;
-   else if (dc_lsize == 16)
+   break;
+   case 16:
r4k_blast_dcache_page = blast_dcache16_page;
-   else if (dc_lsize == 32)
+   break;
+   case 32:
r4k_blast_dcache_page = r4k_blast_dcache_page_dc32;
-   else if (dc_lsize == 64)
+   break;
+   case 64:
r4k_blast_dcache_page = r4k_blast_dcache_page_dc64;
+   break;
+   case 128:
+   r4k_blast_dcache_page = r4k_blast_dcache_page_dc128;
+   break;
+   default:
+   break;
+   }
 }
 
 #ifndef CONFIG_EVA
@@ -158,6 +174,8 @@ static void r4k_blast_dcache_page_indexed_setup(void)
r4k_blast_dcache_page_indexed = blast_dcache32_page_indexed;
else if (dc_lsize == 64)
r4k_blast_dcache_page_indexed = blast_dcache64_page_indexed;
+   else if (dc_lsize == 128)
+   r4k_blast_dcache_page_indexed = blast_dcache128_page_indexed;
 }
 
 void (* r4k_blast_dcache)(void);
@@ -175,6 +193,8 @@ static void r4k_blast_dcache_setup(void)
r4k_blast_dcache = blast_dcache32;
else if (dc_lsize == 64)
r4k_blast_dcache = blast_dcache64;
+   else if (dc_lsize == 128)
+   r4k_blast_dcache = blast_dcache128;
 }
 
 /* force code alignment (used for TX49XX_ICACHE_INDEX_INV_WAR) */
@@ -264,6 +284,8 @@ static void r4k_blast_icache_page_setup(void)
r4k_blast_icache_page = blast_icache32_page;
else if (ic_lsize == 64)
r4k_blast_icache_page = blast_icache64_page;
+   else if (ic_lsize == 128)
+   r4k_blast_icache_page = blast_icache128_page;
 }
 
 #ifndef CONFIG_EVA
@@ -337,6 +359,8 @@ static void r4k_blast_icache_setup(void)
r4k_blast_icache = blast_icache32;
} else if (ic_lsize == 64)
r4k_blast_icache = blast_icache64;
+   else if (ic_lsize == 128)
+   r4k_blast_icache = blast_icache128;
 }
 
 static void (* r4k_blast_scache_page)(unsigned long addr);
@@ -1093,6 +1117,21 @@ static void probe_pcache(void)
c-dcache.waybit = 0;
break;
 
+   case CPU_CAVIUM_OCTEON3:
+   /* For now lie about the number of ways. */
+   c-icache.linesz = 128;
+   c-icache.sets = 16;
+   c-icache.ways = 8;
+   c-icache.flags |= MIPS_CACHE_VTAG;
+   icache_size = c-icache.sets * c-icache.ways * 
c-icache.linesz;
+
+   c-dcache.linesz = 128;
+   c-dcache.ways = 8;
+   c-dcache.sets = 8;
+   dcache_size = c-dcache.sets * c-dcache.ways * 
c-dcache.linesz;
+   c-options |= MIPS_CPU_PREFETCH;
+   break;
+
default:
if (!(config  MIPS_CONF_M))
panic(Don't know how to probe P-caches on this cpu.);
@@ 

[PATCH v2 13/13] MIPS: Add minimal defconfig for mips_paravirt

2014-05-28 Thread Andreas Herrmann
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 arch/mips/configs/mips_paravirt_defconfig |  103 +
 1 file changed, 103 insertions(+)
 create mode 100644 arch/mips/configs/mips_paravirt_defconfig

diff --git a/arch/mips/configs/mips_paravirt_defconfig 
b/arch/mips/configs/mips_paravirt_defconfig
new file mode 100644
index 000..84cfcb4
--- /dev/null
+++ b/arch/mips/configs/mips_paravirt_defconfig
@@ -0,0 +1,103 @@
+CONFIG_MIPS_PARAVIRT=y
+CONFIG_CPU_MIPS64_R2=y
+CONFIG_64BIT=y
+CONFIG_TRANSPARENT_HUGEPAGE=y
+CONFIG_SMP=y
+CONFIG_HZ_1000=y
+CONFIG_PREEMPT=y
+CONFIG_SYSVIPC=y
+CONFIG_BSD_PROCESS_ACCT=y
+CONFIG_BSD_PROCESS_ACCT_V3=y
+CONFIG_IKCONFIG=y
+CONFIG_IKCONFIG_PROC=y
+CONFIG_LOG_BUF_SHIFT=14
+CONFIG_RELAY=y
+CONFIG_BLK_DEV_INITRD=y
+CONFIG_EXPERT=y
+CONFIG_SLAB=y
+CONFIG_MODULES=y
+CONFIG_MODULE_UNLOAD=y
+# CONFIG_BLK_DEV_BSG is not set
+CONFIG_PCI=y
+CONFIG_MIPS32_COMPAT=y
+CONFIG_MIPS32_O32=y
+CONFIG_MIPS32_N32=y
+CONFIG_NET=y
+CONFIG_PACKET=y
+CONFIG_UNIX=y
+CONFIG_INET=y
+CONFIG_IP_MULTICAST=y
+CONFIG_IP_ADVANCED_ROUTER=y
+CONFIG_IP_MULTIPLE_TABLES=y
+CONFIG_IP_ROUTE_MULTIPATH=y
+CONFIG_IP_ROUTE_VERBOSE=y
+CONFIG_IP_PNP=y
+CONFIG_IP_PNP_DHCP=y
+CONFIG_IP_PNP_BOOTP=y
+CONFIG_IP_PNP_RARP=y
+CONFIG_IP_MROUTE=y
+CONFIG_IP_PIMSM_V1=y
+CONFIG_IP_PIMSM_V2=y
+CONFIG_SYN_COOKIES=y
+# CONFIG_INET_LRO is not set
+CONFIG_IPV6=y
+# CONFIG_WIRELESS is not set
+CONFIG_UEVENT_HELPER_PATH=/sbin/hotplug
+# CONFIG_FW_LOADER is not set
+CONFIG_BLK_DEV_LOOP=y
+CONFIG_VIRTIO_BLK=y
+CONFIG_SCSI=y
+CONFIG_BLK_DEV_SD=y
+CONFIG_NETDEVICES=y
+CONFIG_VIRTIO_NET=y
+# CONFIG_NET_VENDOR_BROADCOM is not set
+# CONFIG_NET_VENDOR_INTEL is not set
+# CONFIG_NET_VENDOR_MARVELL is not set
+# CONFIG_NET_VENDOR_MICREL is not set
+# CONFIG_NET_VENDOR_NATSEMI is not set
+# CONFIG_NET_VENDOR_SMSC is not set
+# CONFIG_NET_VENDOR_STMICRO is not set
+# CONFIG_NET_VENDOR_WIZNET is not set
+CONFIG_PHYLIB=y
+CONFIG_MARVELL_PHY=y
+CONFIG_BROADCOM_PHY=y
+CONFIG_BCM87XX_PHY=y
+# CONFIG_WLAN is not set
+# CONFIG_INPUT is not set
+# CONFIG_SERIO is not set
+# CONFIG_VT is not set
+CONFIG_VIRTIO_CONSOLE=y
+# CONFIG_HW_RANDOM is not set
+# CONFIG_HWMON is not set
+# CONFIG_USB_SUPPORT is not set
+CONFIG_VIRTIO_PCI=y
+CONFIG_VIRTIO_BALLOON=y
+CONFIG_VIRTIO_MMIO=y
+# CONFIG_IOMMU_SUPPORT is not set
+CONFIG_EXT4_FS=y
+CONFIG_EXT4_FS_POSIX_ACL=y
+CONFIG_EXT4_FS_SECURITY=y
+CONFIG_MSDOS_FS=y
+CONFIG_VFAT_FS=y
+CONFIG_PROC_KCORE=y
+CONFIG_TMPFS=y
+CONFIG_HUGETLBFS=y
+# CONFIG_MISC_FILESYSTEMS is not set
+CONFIG_NFS_FS=y
+CONFIG_NFS_V4=y
+CONFIG_NFS_V4_1=y
+CONFIG_ROOT_NFS=y
+CONFIG_NLS_CODEPAGE_437=y
+CONFIG_NLS_ASCII=y
+CONFIG_NLS_ISO8859_1=y
+CONFIG_NLS_UTF8=y
+CONFIG_DEBUG_INFO=y
+CONFIG_DEBUG_FS=y
+CONFIG_MAGIC_SYSRQ=y
+# CONFIG_SCHED_DEBUG is not set
+# CONFIG_FTRACE is not set
+CONFIG_CRYPTO_CBC=y
+CONFIG_CRYPTO_HMAC=y
+CONFIG_CRYPTO_MD5=y
+CONFIG_CRYPTO_DES=y
+# CONFIG_CRYPTO_ANSI_CPRNG is not set
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 11/13] MIPS: paravirt: Add pci controller for virtio

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com


Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 arch/mips/Kconfig|1 +
 arch/mips/paravirt/Kconfig   |6 ++
 arch/mips/pci/Makefile   |2 +-
 arch/mips/pci/pci-virtio-guest.c |  131 ++
 4 files changed, 139 insertions(+), 1 deletion(-)
 create mode 100644 arch/mips/paravirt/Kconfig
 create mode 100644 arch/mips/pci/pci-virtio-guest.c

[andreas.herrmann:
  * Make use of __BITFIELD_FIELD macro
  * Calculate IO address for in[lwb] and out[lwb] depending on size]

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index d540945..1f836d8 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -823,6 +823,7 @@ source arch/mips/cavium-octeon/Kconfig
 source arch/mips/loongson/Kconfig
 source arch/mips/loongson1/Kconfig
 source arch/mips/netlogic/Kconfig
+source arch/mips/paravirt/Kconfig
 
 endmenu
 
diff --git a/arch/mips/paravirt/Kconfig b/arch/mips/paravirt/Kconfig
new file mode 100644
index 000..ecae586
--- /dev/null
+++ b/arch/mips/paravirt/Kconfig
@@ -0,0 +1,6 @@
+if MIPS_PARAVIRT
+
+config MIPS_PCI_VIRTIO
+   def_bool y
+
+endif #  MIPS_PARAVIRT
diff --git a/arch/mips/pci/Makefile b/arch/mips/pci/Makefile
index d61138a..ff8a553 100644
--- a/arch/mips/pci/Makefile
+++ b/arch/mips/pci/Makefile
@@ -21,7 +21,7 @@ obj-$(CONFIG_BCM63XX) += pci-bcm63xx.o 
fixup-bcm63xx.o \
 obj-$(CONFIG_MIPS_ALCHEMY) += pci-alchemy.o
 obj-$(CONFIG_SOC_AR71XX)   += pci-ar71xx.o
 obj-$(CONFIG_PCI_AR724X)   += pci-ar724x.o
-
+obj-$(CONFIG_MIPS_PCI_VIRTIO)  += pci-virtio-guest.o
 #
 # These are still pretty much in the old state, watch, go blind.
 #
diff --git a/arch/mips/pci/pci-virtio-guest.c b/arch/mips/pci/pci-virtio-guest.c
new file mode 100644
index 000..40a078b
--- /dev/null
+++ b/arch/mips/pci/pci-virtio-guest.c
@@ -0,0 +1,131 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file COPYING in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (C) 2013 Cavium, Inc.
+ */
+
+#include linux/kernel.h
+#include linux/init.h
+#include linux/interrupt.h
+#include linux/pci.h
+
+#include uapi/asm/bitfield.h
+#include asm/byteorder.h
+#include asm/io.h
+
+#define PCI_CONFIG_ADDRESS 0xcf8
+#define PCI_CONFIG_DATA0xcfc
+
+union pci_config_address {
+   struct {
+   __BITFIELD_FIELD(unsigned enable_bit  : 1,  /* 31   */
+   __BITFIELD_FIELD(unsigned reserved: 7,  /* 30 .. 24 */
+   __BITFIELD_FIELD(unsigned bus_number  : 8,  /* 23 .. 16 */
+   __BITFIELD_FIELD(unsigned devfn_number: 8,  /* 15 .. 8  */
+   __BITFIELD_FIELD(unsigned register_number : 8,  /* 7  .. 0  */
+   );
+   };
+   u32 w;
+};
+
+int pcibios_plat_dev_init(struct pci_dev *dev)
+{
+   return 0;
+}
+
+int pcibios_map_irq(const struct pci_dev *dev, u8 slot, u8 pin)
+{
+   return ((pin + slot) % 4)+ MIPS_IRQ_PCIA;
+}
+
+static void pci_virtio_guest_write_config_addr(struct pci_bus *bus,
+   unsigned int devfn, int reg)
+{
+   union pci_config_address pca = { .w = 0 };
+
+   pca.register_number = reg;
+   pca.devfn_number = devfn;
+   pca.bus_number = bus-number;
+   pca.enable_bit = 1;
+
+   outl(pca.w, PCI_CONFIG_ADDRESS);
+}
+
+static int pci_virtio_guest_write_config(struct pci_bus *bus,
+   unsigned int devfn, int reg, int size, u32 val)
+{
+   pci_virtio_guest_write_config_addr(bus, devfn, reg);
+
+   switch (size) {
+   case 1:
+   outb(val, PCI_CONFIG_DATA + (reg  3));
+   break;
+   case 2:
+   outw(val, PCI_CONFIG_DATA + (reg  2));
+   break;
+   case 4:
+   outl(val, PCI_CONFIG_DATA);
+   break;
+   }
+
+   return PCIBIOS_SUCCESSFUL;
+}
+
+static int pci_virtio_guest_read_config(struct pci_bus *bus, unsigned int 
devfn,
+   int reg, int size, u32 *val)
+{
+   pci_virtio_guest_write_config_addr(bus, devfn, reg);
+
+   switch (size) {
+   case 1:
+   *val = inb(PCI_CONFIG_DATA + (reg  3));
+   break;
+   case 2:
+   *val = inw(PCI_CONFIG_DATA + (reg  2));
+   break;
+   case 4:
+   *val = inl(PCI_CONFIG_DATA);
+   break;
+   }
+   return PCIBIOS_SUCCESSFUL;
+}
+
+static struct pci_ops pci_virtio_guest_ops = {
+   .read  = pci_virtio_guest_read_config,
+   .write = pci_virtio_guest_write_config,
+};
+
+static struct resource pci_virtio_guest_mem_resource = {
+   .name = Virtio MEM,
+   .flags = IORESOURCE_MEM,
+   .start  = 0x1000,
+   .end= 0x1dff
+};
+
+static struct resource 

[PATCH v2 10/13] MIPS: Add code for new system 'paravirt'

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

For para-virtualized guests running under KVM or other equivalent
hypervisor.

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 .../asm/mach-paravirt/cpu-feature-overrides.h  |   36 ++
 arch/mips/include/asm/mach-paravirt/irq.h  |   19 +
 .../include/asm/mach-paravirt/kernel-entry-init.h  |   50 +++
 arch/mips/include/asm/mach-paravirt/war.h  |   25 ++
 arch/mips/mm/tlbex.c   |8 +-
 arch/mips/paravirt/Makefile|   14 +
 arch/mips/paravirt/Platform|9 +
 arch/mips/paravirt/paravirt-irq.c  |  369 
 arch/mips/paravirt/paravirt-smp.c  |  148 
 arch/mips/paravirt/serial.c|   40 +++
 arch/mips/paravirt/setup.c |   67 
 11 files changed, 779 insertions(+), 6 deletions(-)
 create mode 100644 arch/mips/include/asm/mach-paravirt/cpu-feature-overrides.h
 create mode 100644 arch/mips/include/asm/mach-paravirt/irq.h
 create mode 100644 arch/mips/include/asm/mach-paravirt/kernel-entry-init.h
 create mode 100644 arch/mips/include/asm/mach-paravirt/war.h
 create mode 100644 arch/mips/paravirt/Makefile
 create mode 100644 arch/mips/paravirt/Platform
 create mode 100644 arch/mips/paravirt/paravirt-irq.c
 create mode 100644 arch/mips/paravirt/paravirt-smp.c
 create mode 100644 arch/mips/paravirt/serial.c
 create mode 100644 arch/mips/paravirt/setup.c

[andreas.herrmann:
  * Adaptions after renaming of mips_cpunum to get_ebase_cpunum
  * Provide _machine_halt function to exit VM on shutdown of guest
  * Adapt serial.c and setup.c to use new hypercalls and HC numbers
  * Fix barriers when booting secondary CPUs
  * Replace check for 64-bit kernel by common macro
  * Remove call to irq_reserve_irq from irq_init_core
The function was removed (linux-next) see
- commit 1d008353ba088fdec0b2a944e140ff9154a5fb20
  (genirq: Remove irq_reserve_irq[s])
- commit f63b6a05f2b11537612266a8b27a61f412344a1d
  (genirq: Replace reserve_irqs in core code)
- commit be4034016c11f8913d38fccf692007fab1c50be1
  (s390: Avoid call to irq_reserve_irqs())
Not reserving unused irqs shouldn't hurt.
  * Use on_each_cpu unconditionally in irq_core_bus_sync_unlock]
  * Misc other minor changes after review of v1]

diff --git a/arch/mips/include/asm/mach-paravirt/cpu-feature-overrides.h 
b/arch/mips/include/asm/mach-paravirt/cpu-feature-overrides.h
new file mode 100644
index 000..725e1ed
--- /dev/null
+++ b/arch/mips/include/asm/mach-paravirt/cpu-feature-overrides.h
@@ -0,0 +1,36 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file COPYING in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (C) 2013 Cavium, Inc.
+ */
+#ifndef __ASM_MACH_PARAVIRT_CPU_FEATURE_OVERRIDES_H
+#define __ASM_MACH_PARAVIRT_CPU_FEATURE_OVERRIDES_H
+
+#define cpu_has_4kex   1
+#define cpu_has_3k_cache   0
+#define cpu_has_tx39_cache 0
+#define cpu_has_counter1
+#define cpu_has_llsc   1
+/*
+ * We Disable LL/SC on non SMP systems as it is faster to disable
+ * interrupts for atomic access than a LL/SC.
+ */
+#ifdef CONFIG_SMP
+# define kernel_uses_llsc  1
+#else
+# define kernel_uses_llsc  0
+#endif
+
+#ifdef CONFIG_CPU_CAVIUM_OCTEON
+#define cpu_dcache_line_size() 128
+#define cpu_icache_line_size() 128
+#define cpu_has_octeon_cache   1
+#define cpu_has_4k_cache   0
+#else
+#define cpu_has_octeon_cache   0
+#define cpu_has_4k_cache   1
+#endif
+
+#endif /* __ASM_MACH_PARAVIRT_CPU_FEATURE_OVERRIDES_H */
diff --git a/arch/mips/include/asm/mach-paravirt/irq.h 
b/arch/mips/include/asm/mach-paravirt/irq.h
new file mode 100644
index 000..9b4d35e
--- /dev/null
+++ b/arch/mips/include/asm/mach-paravirt/irq.h
@@ -0,0 +1,19 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file COPYING in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (C) 2013 Cavium, Inc.
+ */
+#ifndef __ASM_MACH_PARAVIRT_IRQ_H__
+#define  __ASM_MACH_PARAVIRT_IRQ_H__
+
+#define NR_IRQS 64
+#define MIPS_CPU_IRQ_BASE 1
+
+#define MIPS_IRQ_PCIA (MIPS_CPU_IRQ_BASE + 8)
+
+#define MIPS_IRQ_MBOX0 (MIPS_CPU_IRQ_BASE + 32)
+#define MIPS_IRQ_MBOX1 (MIPS_CPU_IRQ_BASE + 33)
+
+#endif /* __ASM_MACH_PARAVIRT_IRQ_H__ */
diff --git a/arch/mips/include/asm/mach-paravirt/kernel-entry-init.h 
b/arch/mips/include/asm/mach-paravirt/kernel-entry-init.h
new file mode 100644
index 000..2f82bfa
--- /dev/null
+++ b/arch/mips/include/asm/mach-paravirt/kernel-entry-init.h
@@ -0,0 +1,50 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file COPYING in the main directory of this archive
+ * for more details.
+ 

[PATCH v2 12/13] MIPS: Enable build for new system 'paravirt'

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 arch/mips/Kbuild.platforms |1 +
 arch/mips/Kconfig  |   19 +++
 2 files changed, 20 insertions(+)

diff --git a/arch/mips/Kbuild.platforms b/arch/mips/Kbuild.platforms
index 6e23912..f5e18bf 100644
--- a/arch/mips/Kbuild.platforms
+++ b/arch/mips/Kbuild.platforms
@@ -18,6 +18,7 @@ platforms += loongson1
 platforms += mti-malta
 platforms += mti-sead3
 platforms += netlogic
+platforms += paravirt
 platforms += pmcs-msp71xx
 platforms += pnx833x
 platforms += ralink
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 1f836d8..df16e1e 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -803,6 +803,25 @@ config NLM_XLP_BOARD
  This board is based on Netlogic XLP Processor.
  Say Y here if you have a XLP based board.
 
+config MIPS_PARAVIRT
+   bool Para-Virtualized guest system
+   select CEVT_R4K
+   select CSRC_R4K
+   select DMA_COHERENT
+   select SYS_SUPPORTS_64BIT_KERNEL
+   select SYS_SUPPORTS_32BIT_KERNEL
+   select SYS_SUPPORTS_BIG_ENDIAN
+   select SYS_SUPPORTS_SMP
+   select NR_CPUS_DEFAULT_4
+   select SYS_HAS_EARLY_PRINTK
+   select SYS_HAS_CPU_MIPS32_R2
+   select SYS_HAS_CPU_MIPS64_R2
+   select SYS_HAS_CPU_CAVIUM_OCTEON
+   select HW_HAS_PCI
+   select SWAP_IO_SPACE
+   help
+ This option supports guest running under 
+
 endchoice
 
 source arch/mips/alchemy/Kconfig
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 08/13] MIPS: OCTEON: Add OCTEON3 to __get_cpu_type

2014-05-28 Thread Andreas Herrmann
Otherwise __builtin_unreachable might be called.

Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 arch/mips/include/asm/cpu-type.h |1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/mips/include/asm/cpu-type.h b/arch/mips/include/asm/cpu-type.h
index e3308b4..b4e2bd8 100644
--- a/arch/mips/include/asm/cpu-type.h
+++ b/arch/mips/include/asm/cpu-type.h
@@ -163,6 +163,7 @@ static inline int __pure __get_cpu_type(const int cpu_type)
case CPU_CAVIUM_OCTEON:
case CPU_CAVIUM_OCTEON_PLUS:
case CPU_CAVIUM_OCTEON2:
+   case CPU_CAVIUM_OCTEON3:
 #endif
 
 #if defined(CONFIG_SYS_HAS_CPU_BMIPS32_3300) || \
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 09/13] MIPS: Add functions for hypervisor call

2014-05-28 Thread Andreas Herrmann
From: David Daney david.da...@cavium.com

Introduce kvm_hypercall[0-3].
Define three new hypercalls for MIPS: GET_CLOCK_FREQ, EXIT_VM, and
CONSOLE_OUTPUT.

[andreas.herrmann:
  * Properly define hypercalls and HC numbers for MIPS
in kvm_para.h header files]

Signed-off-by: David Daney david.da...@cavium.com
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 arch/mips/include/asm/kvm_para.h  |   91 +
 arch/mips/include/uapi/asm/kvm_para.h |6 ++-
 include/uapi/linux/kvm_para.h |3 ++
 3 files changed, 99 insertions(+), 1 deletion(-)
 create mode 100644 arch/mips/include/asm/kvm_para.h

diff --git a/arch/mips/include/asm/kvm_para.h b/arch/mips/include/asm/kvm_para.h
new file mode 100644
index 000..e559812
--- /dev/null
+++ b/arch/mips/include/asm/kvm_para.h
@@ -0,0 +1,91 @@
+#ifndef _ASM_MIPS_KVM_PARA_H
+#define _ASM_MIPS_KVM_PARA_H
+
+#include uapi/asm/kvm_para.h
+
+#define KVM_HYPERCALL .word 0x4228
+
+/*
+ * Hypercalls for KVM.
+ *
+ * Hypercall number is passed in v0.
+ * Return value will be placed in v0.
+ * Up to 3 arguments are passed in a0, a1, and a2.
+ */
+static inline unsigned long kvm_hypercall0(unsigned long num)
+{
+   register unsigned long n asm(v0);
+   register unsigned long r asm(v0);
+
+   n = num;
+   __asm__ __volatile__(
+   KVM_HYPERCALL
+   : =r (r) : r (n) : memory
+   );
+
+   return r;
+}
+
+static inline unsigned long kvm_hypercall1(unsigned long num,
+   unsigned long arg0)
+{
+   register unsigned long n asm(v0);
+   register unsigned long r asm(v0);
+   register unsigned long a0 asm(a0);
+
+   n = num;
+   a0 = arg0;
+   __asm__ __volatile__(
+   KVM_HYPERCALL
+   : =r (r) : r (n), r (a0) : memory
+   );
+
+   return r;
+}
+
+static inline unsigned long kvm_hypercall2(unsigned long num,
+   unsigned long arg0, unsigned long arg1)
+{
+   register unsigned long n asm(v0);
+   register unsigned long r asm(v0);
+   register unsigned long a0 asm(a0);
+   register unsigned long a1 asm(a1);
+
+   n = num;
+   a0 = arg0;
+   a1 = arg1;
+   __asm__ __volatile__(
+   KVM_HYPERCALL
+   : =r (r) : r (n), r (a0), r (a1) : memory
+   );
+
+   return r;
+}
+
+static inline unsigned long kvm_hypercall3(unsigned long num,
+   unsigned long arg0, unsigned long arg1, unsigned long arg2)
+{
+   register unsigned long n asm(v0);
+   register unsigned long r asm(v0);
+   register unsigned long a0 asm(a0);
+   register unsigned long a1 asm(a1);
+   register unsigned long a2 asm(a2);
+
+   n = num;
+   a0 = arg0;
+   a1 = arg1;
+   a2 = arg2;
+   __asm__ __volatile__(
+   KVM_HYPERCALL
+   : =r (r) : r (n), r (a0), r (a1), r (a2) : memory
+   );
+
+   return r;
+}
+
+static inline unsigned int kvm_arch_para_features(void)
+{
+   return 0;
+}
+
+#endif /* _ASM_MIPS_KVM_PARA_H */
diff --git a/arch/mips/include/uapi/asm/kvm_para.h 
b/arch/mips/include/uapi/asm/kvm_para.h
index 14fab8f..7e16d7c 100644
--- a/arch/mips/include/uapi/asm/kvm_para.h
+++ b/arch/mips/include/uapi/asm/kvm_para.h
@@ -1 +1,5 @@
-#include asm-generic/kvm_para.h
+#ifndef _UAPI_ASM_MIPS_KVM_PARA_H
+#define _UAPI_ASM_MIPS_KVM_PARA_H
+
+
+#endif /* _UAPI_ASM_MIPS_KVM_PARA_H */
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 2841f86..bf6cd7d 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -20,6 +20,9 @@
 #define KVM_HC_FEATURES3
 #define KVM_HC_PPC_MAP_MAGIC_PAGE  4
 #define KVM_HC_KICK_CPU5
+#define KVM_HC_MIPS_GET_CLOCK_FREQ 6
+#define KVM_HC_MIPS_EXIT_VM7
+#define KVM_HC_MIPS_CONSOLE_OUTPUT 8
 
 /*
  * hypercalls use architecture specific
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] PCI: Introduce new device binding path using pci_dev.driver_override

2014-05-28 Thread Greg KH
On Tue, May 27, 2014 at 09:07:42PM -0600, Bjorn Helgaas wrote:
 On Tue, May 20, 2014 at 08:53:21AM -0600, Alex Williamson wrote:
  The driver_override field allows us to specify the driver for a device
  rather than relying on the driver to provide a positive match of the
  device.  This shortcuts the existing process of looking up the vendor
  and device ID, adding them to the driver new_id, binding the device,
  then removing the ID, but it also provides a couple advantages.
  
  First, the above existing process allows the driver to bind to any
  device matching the new_id for the window where it's enabled.  This is
  often not desired, such as the case of trying to bind a single device
  to a meta driver like pci-stub or vfio-pci.  Using driver_override we
  can do this deterministically using:
  
  echo pci-stub  /sys/bus/pci/devices/:03:00.0/driver_override
  echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
  echo :03:00.0  /sys/bus/pci/drivers_probe
  
  Previously we could not invoke drivers_probe after adding a device
  to new_id for a driver as we get non-deterministic behavior whether
  the driver we intend or the standard driver will claim the device.
  Now it becomes a deterministic process, only the driver matching
  driver_override will probe the device.
  
  To return the device to the standard driver, we simply clear the
  driver_override and reprobe the device:
  
  echo  /sys/bus/pci/devices/:03:00.0/driver_override
  echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
  echo :03:00.0  /sys/bus/pci/drivers_probe
  
  Another advantage to this approach is that we can specify a driver
  override to force a specific binding or prevent any binding.  For
  instance when an IOMMU group is exposed to userspace through VFIO
  we require that all devices within that group are owned by VFIO.
  However, devices can be hot-added into an IOMMU group, in which case
  we want to prevent the device from binding to any driver (override
  driver = none) or perhaps have it automatically bind to vfio-pci.
  With driver_override it's a simple matter for this field to be set
  internally when the device is first discovered to prevent driver
  matches.
  
  Signed-off-by: Alex Williamson alex.william...@redhat.com
  Cc: Greg Kroah-Hartman gre...@linuxfoundation.org
 
 Greg, are you going to weigh in on this?  It does seem to solve some real
 problems.  ISTR you had an opinion once, but I don't know your current
 thoughts.

This looks good to me:

Acked-by: Greg Kroah-Hartman gre...@linuxfoundation.org
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Implement Batched (group) ticket lock

2014-05-28 Thread Rik van Riel
On 05/28/2014 08:16 AM, Raghavendra K T wrote:

This patch looks very promising.

 TODO:
 - we need an intelligent way to nullify the effect of batching for baremetal
  (because extra cmpxchg is not required).

On (larger?) NUMA systems, the unfairness may be a nice performance
benefit, reducing cache line bouncing through the system, and it
could well outweigh the extra cmpxchg at times.

 - we may have to make batch size as kernel arg to solve above problem
  (to run same kernel for host/guest). Increasing batch size also seem to help
  virtualized guest more, so we will have flexibility of tuning depending on 
 vm size.
 
 - My kernbench/ebizzy test on baremetal (32 cpu +ht sandybridge) did not seem 
 to
   show the impact of extra cmpxchg. but there should be effect of extra 
 cmpxchg.

Canceled out by better NUMA locality?

Or maybe cmpxchg is cheap once you already own the cache line
exclusively?

 - virtualized guest had slight impact on 1x cases of some benchmarks but we 
 have got
  impressive performance for 1x cases. So overall, patch needs exhaustive 
 tesing.
 
 - we can further add dynamically changing batch_size implementation 
 (inspiration and
   hint by Paul McKenney) as necessary.

I could see a larger batch size being beneficial.

Currently the maximum wait time for a spinlock on a system
with N CPUs is N times the length of the largest critical
section.

Having the batch size set equal to the number of CPUs would only
double that, and better locality (CPUs local to the current
lock holder winning the spinlock operation) might speed things
up enough to cancel that part of that out again...

  I have found that increasing  batch size gives excellent improvements for 
  overcommitted guests. I understand that we need more exhaustive testing.
 
  Please provide your suggestion and comments.

I have only nitpicks so far...

 diff --git a/arch/x86/include/asm/spinlock_types.h 
 b/arch/x86/include/asm/spinlock_types.h
 index 4f1bea1..b04c03d 100644
 --- a/arch/x86/include/asm/spinlock_types.h
 +++ b/arch/x86/include/asm/spinlock_types.h
 @@ -3,15 +3,16 @@
  
  #include linux/types.h
  
 +#define TICKET_LOCK_INC_SHIFT 1
 +#define __TICKET_LOCK_TAIL_INC (1TICKET_LOCK_INC_SHIFT)
 +
  #ifdef CONFIG_PARAVIRT_SPINLOCKS
 -#define __TICKET_LOCK_INC2
  #define TICKET_SLOWPATH_FLAG ((__ticket_t)1)
  #else
 -#define __TICKET_LOCK_INC1
  #define TICKET_SLOWPATH_FLAG ((__ticket_t)0)
  #endif

For the !CONFIG_PARAVIRT case, TICKET_LOCK_INC_SHIFT used to be 0,
now you are making it one. Probably not an issue, since even people
who compile with 128  CONFIG_NR_CPUS = 256 will likely have their
spinlocks padded out to 32 or 64 bits anyway in most data structures.

 -#if (CONFIG_NR_CPUS  (256 / __TICKET_LOCK_INC))
 +#if (CONFIG_NR_CPUS  (256 / __TICKET_LOCK_TAIL_INC))
  typedef u8  __ticket_t;
  typedef u16 __ticketpair_t;
  #else
 @@ -19,7 +20,12 @@ typedef u16 __ticket_t;
  typedef u32 __ticketpair_t;
  #endif
  
 -#define TICKET_LOCK_INC  ((__ticket_t)__TICKET_LOCK_INC)
 +#define TICKET_LOCK_TAIL_INC ((__ticket_t)__TICKET_LOCK_TAIL_INC)
 +
 +#define TICKET_LOCK_HEAD_INC ((__ticket_t)1)
 +#define TICKET_BATCH0x4 /* 4 waiters can contend simultaneously */
 +#define TICKET_LOCK_BATCH_MASK (~(TICKET_BATCHTICKET_LOCK_INC_SHIFT) + \
 +   TICKET_LOCK_TAIL_INC - 1)

I do not see the value in having TICKET_BATCH declared with a
hexadecimal number, and it may be worth making sure the code
does not compile if someone tried a TICKET_BATCH value that
is not a power of 2.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/15] MIPS: paravirt: Provide _machine_halt function to exit VM on shutdown of guest

2014-05-28 Thread Andreas Herrmann
On Wed, May 21, 2014 at 02:44:49PM +0100, James Hogan wrote:
 On 20/05/14 15:47, Andreas Herrmann wrote:
  Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
 
 Does it make sense to provide a _machine_restart too?

Hmm, I've not seen a real need for this so far.
(Halting the guest and relaunching it from the shell with lkvm was fast
enough for my tests ;-)

But it's worth to get it working. I might be wrong but I think that
this requires lkvm changes to actually handle the reboot.

 I think this should be squashed into patch 10 really,

Done that.

 or else patch 10
 split up into several parts (irq, smp, serial, other).

Still kept the pci stuff as a separate patch in case that it might be
replaced with something based on PCI: Generic Configuration Access
Mechanism support (https://lkml.org/lkml/2014/5/18/54) or similar.

Andreas

 Cheers
 James
 
  ---
   arch/mips/paravirt/setup.c |7 +++
   1 file changed, 7 insertions(+)
  
  diff --git a/arch/mips/paravirt/setup.c b/arch/mips/paravirt/setup.c
  index f80c3bc..6d2781c 100644
  --- a/arch/mips/paravirt/setup.c
  +++ b/arch/mips/paravirt/setup.c
  @@ -8,6 +8,7 @@
   
   #include linux/kernel.h
   
  +#include asm/reboot.h
   #include asm/bootinfo.h
   #include asm/mipsregs.h
   #include asm/smp-ops.h
  @@ -27,6 +28,11 @@ void __init plat_time_init(void)
  preset_lpj = mips_hpt_frequency / (2 * HZ);
   }
   
  +static void pv_machine_halt(void)
  +{
  +   hypcall0(1 /* Exit VM */);
  +}
  +
   /*
* Early entry point for arch setup
*/
  @@ -47,6 +53,7 @@ void __init prom_init(void)
  if (i  argc - 1)
  strlcat(arcs_cmdline,  , COMMAND_LINE_SIZE);
  }
  +   _machine_halt = pv_machine_halt;
  register_smp_ops(paravirt_smp_ops);
   }
   
  
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] PCI: Introduce new device binding path using pci_dev.driver_override

2014-05-28 Thread Bjorn Helgaas
On Tue, May 20, 2014 at 08:53:21AM -0600, Alex Williamson wrote:
 The driver_override field allows us to specify the driver for a device
 rather than relying on the driver to provide a positive match of the
 device.  This shortcuts the existing process of looking up the vendor
 and device ID, adding them to the driver new_id, binding the device,
 then removing the ID, but it also provides a couple advantages.
 
 First, the above existing process allows the driver to bind to any
 device matching the new_id for the window where it's enabled.  This is
 often not desired, such as the case of trying to bind a single device
 to a meta driver like pci-stub or vfio-pci.  Using driver_override we
 can do this deterministically using:
 
 echo pci-stub  /sys/bus/pci/devices/:03:00.0/driver_override
 echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
 echo :03:00.0  /sys/bus/pci/drivers_probe
 
 Previously we could not invoke drivers_probe after adding a device
 to new_id for a driver as we get non-deterministic behavior whether
 the driver we intend or the standard driver will claim the device.
 Now it becomes a deterministic process, only the driver matching
 driver_override will probe the device.
 
 To return the device to the standard driver, we simply clear the
 driver_override and reprobe the device:
 
 echo  /sys/bus/pci/devices/:03:00.0/driver_override
 echo :03:00.0  /sys/bus/pci/devices/:03:00.0/driver/unbind
 echo :03:00.0  /sys/bus/pci/drivers_probe
 
 Another advantage to this approach is that we can specify a driver
 override to force a specific binding or prevent any binding.  For
 instance when an IOMMU group is exposed to userspace through VFIO
 we require that all devices within that group are owned by VFIO.
 However, devices can be hot-added into an IOMMU group, in which case
 we want to prevent the device from binding to any driver (override
 driver = none) or perhaps have it automatically bind to vfio-pci.
 With driver_override it's a simple matter for this field to be set
 internally when the device is first discovered to prevent driver
 matches.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 Cc: Greg Kroah-Hartman gre...@linuxfoundation.org

I applied this with Reviewed-bys/Acks from Konrad, Alexander, and Greg to
pci/virtualization for v3.16, thanks!

 ---
 
 v3: kfree() override buffer on device release, noted by Alex Graf
 
 v2: Use strchr() as suggested by Guenter Roeck and adopted by the
 platform driver version of this same interface.
 
  Documentation/ABI/testing/sysfs-bus-pci |   21 
  drivers/pci/pci-driver.c|   25 +--
  drivers/pci/pci-sysfs.c |   40 
 +++
  drivers/pci/probe.c |1 +
  include/linux/pci.h |1 +
  5 files changed, 85 insertions(+), 3 deletions(-)
 
 diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
 b/Documentation/ABI/testing/sysfs-bus-pci
 index a3c5a66..898ddc4 100644
 --- a/Documentation/ABI/testing/sysfs-bus-pci
 +++ b/Documentation/ABI/testing/sysfs-bus-pci
 @@ -250,3 +250,24 @@ Description:
   valid.  For example, writing a 2 to this file when sriov_numvfs
   is not 0 and not 2 already will return an error. Writing a 10
   when the value of sriov_totalvfs is 8 will return an error.
 +
 +What:/sys/bus/pci/devices/.../driver_override
 +Date:April 2014
 +Contact: Alex Williamson alex.william...@redhat.com
 +Description:
 + This file allows the driver for a device to be specified which
 + will override standard static and dynamic ID matching.  When
 + specified, only a driver with a name matching the value written
 + to driver_override will have an opportunity to bind to the
 + device.  The override is specified by writing a string to the
 + driver_override file (echo pci-stub  driver_override) and
 + may be cleared with an empty string (echo  driver_override).
 + This returns the device to standard matching rules binding.
 + Writing to driver_override does not automatically unbind the
 + device from its current driver or make any attempt to
 + automatically load the specified driver.  If no driver with a
 + matching name is currently loaded in the kernel, the device
 + will not bind to any driver.  This also allows devices to
 + opt-out of driver binding using a driver_override name such as
 + none.  Only a single driver may be specified in the override,
 + there is no support for parsing delimiters.
 diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
 index d911e0c..4393c12 100644
 --- a/drivers/pci/pci-driver.c
 +++ b/drivers/pci/pci-driver.c
 @@ -216,6 +216,13 @@ const struct pci_device_id 

Re: [PATCH 11/15] MIPS: paravirt: Add pci controller for virtio

2014-05-28 Thread Andreas Herrmann
On Thu, May 22, 2014 at 10:17:07PM +0200, Andreas Herrmann wrote:
 On Wed, May 21, 2014 at 12:42:52PM +0100, James Hogan wrote:
  On 20/05/14 15:47, Andreas Herrmann wrote:
   From: David Daney david.da...@cavium.com
   
   Signed-off-by: David Daney david.da...@cavium.com
   Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
   ---
arch/mips/Kconfig|1 +
arch/mips/paravirt/Kconfig   |6 ++
arch/mips/pci/Makefile   |2 +-
arch/mips/pci/pci-virtio-guest.c |  140 
   ++
4 files changed, 148 insertions(+), 1 deletion(-)
create mode 100644 arch/mips/paravirt/Kconfig
create mode 100644 arch/mips/pci/pci-virtio-guest.c
  
  If I understand correctly this just drives a simple PCI controller for a
  PCI bus that a virtio device happens to be usually plugged in to, yeh?
 
 Yes.
 
  It sounds like it would make sense to take advantage of Will Deacon's
  recent efforts to make a generic pci controller driver for this sort of
  thing which specifically mentions emulation by kvmtool? Is it
  effectively the same PCI controller that is being emulated?
 
 I think, it's very similar. But it depends on OF.
  
  http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/thread.html#233491
  http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/233491.html
  http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/233490.html
 
 Currently we are at v6:
 http://marc.info/?i=1399478839-3564-1-git-send-email-will.dea...@arm.com
 
 Will take a closer look (trying to get it running for mips_paravirt).

FYI, I've dismissed this (for v2) after taking a closer look and after
I've seen https://lkml.org/lkml/2014/5/18/54.


Andreas
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Implement Batched (group) ticket lock

2014-05-28 Thread Linus Torvalds
On Wed, May 28, 2014 at 2:55 PM, Rik van Riel r...@redhat.com wrote:

 Or maybe cmpxchg is cheap once you already own the cache line
 exclusively?

A locked cmpxchg ends up being anything between ~15-50 cycles
depending on microarchitecture if things are already exclusively in
the cache (with the P4 being an outlier, and all locked instructions
tend to take ~100+ cycles, but I can't say I can really find it in
myself to even care about netburst any more).

The most noticeable downside we've seen has been when we've used
read-op-cmpxchg as a _replacement_ for something like lock [x]add,
when that read+cmpxchg has caused two cacheline ops (cacheline first
loaded shared by the read, then exclusive by the cmpxchg). That's bad.

But if preceded by a write (or, in this case, an xadd), that doesn't
happen. Still, those roughly 15-50 cycles can certainly be noticeable
(especially at the high end), but you need to have some load that
doesn't bounce the lock, and largely fit in the caches to see it. And
you probably want to test one of the older CPU's, I think Haswell is
the lower end (ie in the ~20 cycle range).

If somebody has a P4 still, that's likely the worst case by far.

  Linus
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Implement Batched (group) ticket lock

2014-05-28 Thread Thomas Gleixner
On Wed, 28 May 2014, Linus Torvalds wrote:
 
 If somebody has a P4 still, that's likely the worst case by far.

I do, but I'm only using it during winter and only if the ia64 machine
does not provide sufficient heating. So you have to wait at least half
a year until I'm able to test it.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/15] MIPS: paravirt: Provide _machine_halt function to exit VM on shutdown of guest

2014-05-28 Thread James Hogan
On Thursday 29 May 2014 00:04:18 Andreas Herrmann wrote:
 On Wed, May 21, 2014 at 02:44:49PM +0100, James Hogan wrote:
  On 20/05/14 15:47, Andreas Herrmann wrote:
   Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
  
  Does it make sense to provide a _machine_restart too?
 
 Hmm, I've not seen a real need for this so far.
 (Halting the guest and relaunching it from the shell with lkvm was fast
 enough for my tests ;-)
 
 But it's worth to get it working. I might be wrong but I think that
 this requires lkvm changes to actually handle the reboot.

Fair enough. No point implementing something that can't be used/tested yet. If 
QEMU gets support for paravirt, it can be done then.

Cheers
James
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Implement Batched (group) ticket lock

2014-05-28 Thread Rik van Riel
On 05/28/2014 06:19 PM, Linus Torvalds wrote:

 If somebody has a P4 still, that's likely the worst case by far.

I'm sure cmpxchg isn't the only thing making P4 the worst case :)

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 4/4] add 2nd stage page fault handling during live migration

2014-05-28 Thread Mario Smarduch

Little bit more details on this question -

For 2nd stage 3-level tables PUD blocks don't exist - although
it appears you can have a PGD block but I don't see any
support for that. But should the code still work as if PUDs
(4-level table) are used and check for pud_huge()?

Looking at ARMv8 there are several block formats, I don't know which one
will be use for 2nd stage (4KB, 16,...) but one of them supports 4-level
table (have not looked at this in detail, could be wrong here).

Should pud_huge() be supported for future compatibility?

This impacts logging -
 - Some decisions are needed either clear the PUD entry and
   force them to pages or mark dirty bit map for each 4k page
   in the PUD Block range, IA64 appears to that in mark_pages_dirty().

 - If you assume pud_huge() then you probably have to support
   the logic for PUD Block descriptor even though
   it's not used in 3-level table at this time.

I think until PUD Blocks are actually used it's maybe better to
ignore them.

- Mario



On 05/28/2014 11:42 AM, Mario Smarduch wrote:
 
 emslot dirty_bitmap during and after write protect.


 -Christoffer
 
 Regarding huge pud that's causing some design problems, should huge PUD
 pages be considered at all?
 
 Thanks,
   Mario


 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

 
 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] KVM: PPC: Book3S: Controls for in-kernel PAPR hypercall handling

2014-05-28 Thread Paul Mackerras
On Wed, May 28, 2014 at 03:27:32PM +0200, Alexander Graf wrote:
 
 On 26.05.14 14:17, Paul Mackerras wrote:
 +6.8 KVM_CAP_PPC_ENABLE_HCALL
 +
 +Architectures: ppc
 +Parameters: args[0] is the PAPR hcall number
 +args[1] is 0 to disable, 1 to enable in-kernel handling
 +
 +This capability controls whether individual PAPR hypercalls (hcalls)
 +get handled by the kernel or not.  Enabling or disabling in-kernel
 +handling of an hcall is effective across the VM.  On creation, an
 
 Hrm. Could we move the CAP to vm level then?

You mean, define a VM ioctl instead of using a capability?  Or are you
suggesting something else?

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: PPC: Book3S HV: Add H_SET_MODE hcall handling

2014-05-28 Thread Michael Neuling
Alex,

  +static int kvmppc_h_set_mode(struct kvm_vcpu *vcpu, unsigned long mflags,
  +unsigned long resource, unsigned long value1,
  +unsigned long value2)
  +{
  +   switch (resource) {
  +   case H_SET_MODE_RESOURCE_SET_CIABR:
  +   if (!kvmppc_power8_compatible(vcpu))
  +   return H_P2;
  +   if (value2)
  +   return H_P4;
  +   if (mflags)
  +   return H_UNSUPPORTED_FLAG_START;
  +   if ((value1  0x3) == 0x3)
 
 What is this?

It's what it says in PAPR (I wish that was public!!!).  Joking aside... 

If you refer to the 2.07 HW arch (not PAPR), the bottom two bits of the
CIABR tell you what mode to match in.  0x3 means match in hypervisor,
which we obviously don't want the guest to be able to do.

I'll add some #defines to make it a clearer and repost.

Mikey

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM] About releasing vcpu when closing vcpu fd

2014-05-28 Thread Gu Zheng
Hi Gleb,

On 05/23/2014 05:43 PM, Gleb Natapov wrote:

 CCing Paolo.
 
 On Tue, May 20, 2014 at 01:45:55PM +0800, Gu Zheng wrote:
 Hi Gleb,
 Excuse me for offline noisy.
 You will get much quicker response if you'll post to the list :)

Got it.:)

 
 There was a patch(from Chen Fan, last august) about releasing vcpu when
 closing vcpu fd http://www.spinics.net/lists/kvm/msg95701.html, but
 your comment said Attempt where made to make it possible to destroy 
 individual vcpus separately from destroying VM before, but they were
 unsuccessful thus far.
 So what is the pain here? If we want to achieve the goal, what should we do?
 Looking forward to your further comments.:)

 CPU array is accessed locklessly in a lot of places, so it will have to be 
 RCUified.
 There was attempt to do so 2 year or so ago, but it didn't go anyware. Adding 
 locks is
 to big a price to pay for ability to free a little bit of memory by 
 destroying vcpu. 

Yes, it's a pain here. But if we want to implement vcpu hot-remove, this must 
be
fixed sooner or later.
And any guys working on kvm vcpu hot-remove now?

 An
 alternative may be to make sure that stopped vcpu takes as little memory as 
 possible.

Yeah. But if we add a new vcpu with the old id that we stopped before, it will 
fail.

Best regards,
Gu

 
 --
   Gleb.
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alexander Graf


On 28.05.14 02:57, Alex Williamson wrote:

On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:

On 28.05.14 02:39, Alex Williamson wrote:

On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:

On 27.05.14 20:15, Alex Williamson wrote:

On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:

The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
Documentation/vfio.txt  | 92 
-
drivers/vfio/pci/Makefile   |  1 +
drivers/vfio/pci/vfio_pci.c | 20 +---
drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
drivers/vfio/pci/vfio_pci_private.h |  5 ++
drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++
include/uapi/linux/vfio.h   | 66 ++
7 files changed, 308 insertions(+), 7 deletions(-)
create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c

[...]


+
+   return ret;
+}
+
static long tce_iommu_ioctl(void *iommu_data,
 unsigned int cmd, unsigned long arg)
{
@@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
tce_iommu_disable(container);
mutex_unlock(container-lock);
return 0;
+   case VFIO_EEH_PE_SET_OPTION:
+   case VFIO_EEH_PE_GET_STATE:
+   case VFIO_EEH_PE_RESET:
+   case VFIO_EEH_PE_CONFIGURE:
+   return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);

This is where it would have really made sense to have a single
VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
AlexG, are you really attached to splitting these out into separate
ioctls?

I don't see the problem. We need to forward 4 ioctls to a separate piece
of code, so we forward 4 ioctls to a separate piece of code :). Putting
them into one ioctl just moves the switch() into another function.

And uses an extra 3 ioctl numbers and gives us extra things to update if
we ever need to add more ioctls, etc.  ioctl numbers are an address
space, how much address space do we really want to give to EEH?  It's
not a big difference, but I don't think it's completely even either.
Thanks,

Yes, that's the point. I by far prefer to have you push back on anyone
who introduces useless ioctls rather than have a separate EEH number
space that people can just throw anything in they like ;).

Well, I appreciate that, but having them as separate ioctls doesn't
really prevent that either.  Any one of these 4 could be set to take a
sub-option to extend and contort the EEH interface.  The only way to
prevent that would be to avoid the argsz+flags hack that make the ioctl
extendable.  Thanks,


Sure, that's what patch review is about. I'm really more concerned about 
whose court the number space is in - you or Gavin. If we're talking 
about top level ioctls you will care a lot more.


But I'm not religious about this. You're the VFIO maintainer, so it's 
your call. I just personally cringe when I see an ioctl that gets an 
opcode and a parameter argument where the parameter argument is a 
union with one struct for each opcode.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alexander Graf


On 28.05.14 02:55, Gavin Shan wrote:

On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:

On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:

The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
  Documentation/vfio.txt  | 92 -
  drivers/vfio/pci/Makefile   |  1 +
  drivers/vfio/pci/vfio_pci.c | 20 +---
  drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
  drivers/vfio/pci/vfio_pci_private.h |  5 ++
  drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++
  include/uapi/linux/vfio.h   | 66 ++
  7 files changed, 308 insertions(+), 7 deletions(-)
  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c


[...]


diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index cb9023d..c5fac36 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
  
  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
  
+/*

+ * EEH functionality can be enabled or disabled on one specific device.
+ * Also, the DMA or IO frozen state can be removed from the frozen PE
+ * if required.
+ */
+struct vfio_eeh_pe_set_option {
+   __u32 argsz;
+   __u32 flags;
+   __u32 option;
+#define VFIO_EEH_PE_SET_OPT_DISABLE0   /* Disable EEH  */
+#define VFIO_EEH_PE_SET_OPT_ENABLE 1   /* Enable EEH   */
+#define VFIO_EEH_PE_SET_OPT_IO 2   /* Enable IO*/
+#define VFIO_EEH_PE_SET_OPT_DMA3   /* Enable DMA   */

This is more of a command than an option isn't it?  Each of these
probably needs a more significant description.


Yeah, it would be regarded as opcode and I'll add more description about
them in next revision.


Please just call them commands.




+};
+
+#define VFIO_EEH_PE_SET_OPTION _IO(VFIO_TYPE, VFIO_BASE + 21)
+
+/*
+ * Each EEH PE should have unique address to be identified. PE's
+ * sharing mode is also useful information as well.
+ */
+#define VFIO_EEH_PE_GET_ADDRESS0   /* Get address  */
+#define VFIO_EEH_PE_GET_MODE   1   /* Query mode   */
+#define VFIO_EEH_PE_MODE_NONE  0   /* Not a PE */
+#define VFIO_EEH_PE_MODE_NOT_SHARED1   /* Exclusive*/
+#define VFIO_EEH_PE_MODE_SHARED2   /* Shared mode  */
+
+/*
+ * EEH PE might have been frozen because of PCI errors. Also, it might
+ * be experiencing reset for error revoery. The following command helps
+ * to get the state.
+ */
+struct vfio_eeh_pe_get_state {
+   __u32 argsz;
+   __u32 flags;
+   __u32 state;
+};

Should state be a union to better describe the value returned?  What
exactly is the address and why does the user need to know it?  Does this
need user input or could we just return the address and mode regardless?


Ok. I think you want enum (not union) for state. I'll have macros for the
state in next revision as I did that for other cases.

Those macros defined for address just for ABI stuff as Alex.G mentioned.
There isn't corresponding ioctl command for host to get address any more
because QEMU (user) will have to figure it out by himself. The address
here means PE address and user has to figure it out according to PE
segmentation.


Why would the user ever need the address?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] Bug fixes for HV KVM, v2

2014-05-28 Thread Alexander Graf


On 26.05.14 11:48, Paul Mackerras wrote:

This series of patches fixes a few bugs that have been found in
testing HV KVM recently.  It also adds workarounds for a couple of
POWER8 PMU bugs, fixes the definition of KVM_REG_PPC_WORT, and adds
some things that were missing from Documentation/virtual/kvm/api.txt.

The patches are against Alex Graf's kvm-ppc-queue branch.

Please apply for 3.16.  The first couple would be safe to go into 3.15
as well, and probably should.


Thanks, applied all to kvm-ppc-queue. I don't think it's necessary for 
the first few to really go into 3.15 - if user space uses a header from 
there it will just get unimplemented ONE_REG returns for WORT.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alexander Graf


On 28.05.14 14:49, Gavin Shan wrote:

On Wed, May 28, 2014 at 01:41:35PM +0200, Alexander Graf wrote:

On 28.05.14 02:55, Gavin Shan wrote:

On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:

On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:

The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
  Documentation/vfio.txt  | 92 -
  drivers/vfio/pci/Makefile   |  1 +
  drivers/vfio/pci/vfio_pci.c | 20 +---
  drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
  drivers/vfio/pci/vfio_pci_private.h |  5 ++
  drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++
  include/uapi/linux/vfio.h   | 66 ++
  7 files changed, 308 insertions(+), 7 deletions(-)
  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c

[...]


diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index cb9023d..c5fac36 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+/*
+ * EEH functionality can be enabled or disabled on one specific device.
+ * Also, the DMA or IO frozen state can be removed from the frozen PE
+ * if required.
+ */
+struct vfio_eeh_pe_set_option {
+   __u32 argsz;
+   __u32 flags;
+   __u32 option;
+#define VFIO_EEH_PE_SET_OPT_DISABLE0   /* Disable EEH  */
+#define VFIO_EEH_PE_SET_OPT_ENABLE 1   /* Enable EEH   */
+#define VFIO_EEH_PE_SET_OPT_IO 2   /* Enable IO*/
+#define VFIO_EEH_PE_SET_OPT_DMA3   /* Enable DMA   */

This is more of a command than an option isn't it?  Each of these
probably needs a more significant description.


Yeah, it would be regarded as opcode and I'll add more description about
them in next revision.

Please just call them commands.


Ok. I guess you want me to change the macro names like this ?

#define VFIO_EEH_CMD_DISABLE0   /* Disable EEH functionality*/
#define VFIO_EEH_CMD_ENABLE 1   /* Enable EEH functionality */
#define VFIO_EEH_CMD_ENABLE_IO  2   /* Enable IO for frozen PE  */
#define VFIO_EEH_CMD_ENABLE_DMA 3   /* Enable DMA for frozen PE */


Yes, the ioctl name too.




+};
+
+#define VFIO_EEH_PE_SET_OPTION _IO(VFIO_TYPE, VFIO_BASE + 21)
+
+/*
+ * Each EEH PE should have unique address to be identified. PE's
+ * sharing mode is also useful information as well.
+ */
+#define VFIO_EEH_PE_GET_ADDRESS0   /* Get address  */
+#define VFIO_EEH_PE_GET_MODE   1   /* Query mode   */
+#define VFIO_EEH_PE_MODE_NONE  0   /* Not a PE */
+#define VFIO_EEH_PE_MODE_NOT_SHARED1   /* Exclusive*/
+#define VFIO_EEH_PE_MODE_SHARED2   /* Shared mode  */
+
+/*
+ * EEH PE might have been frozen because of PCI errors. Also, it might
+ * be experiencing reset for error revoery. The following command helps
+ * to get the state.
+ */
+struct vfio_eeh_pe_get_state {
+   __u32 argsz;
+   __u32 flags;
+   __u32 state;
+};

Should state be a union to better describe the value returned?  What
exactly is the address and why does the user need to know it?  Does this
need user input or could we just return the address and mode regardless?


Ok. I think you want enum (not union) for state. I'll have macros for the
state in next revision as I did that for other cases.

Those macros defined for address just for ABI stuff as Alex.G mentioned.
There isn't corresponding ioctl command for host to get address any more
because QEMU (user) will have to figure it out by himself. The address
here means PE address and user has to figure it out according to PE
segmentation.

Why would the user ever need the address?


I will remove those address related macros in next revision because it's
user-level bussiness, not related to host kernel any more.


Ok, so the next question is whether there will be any state outside of 
GET_MODE left in the future.



Alex


If the user is QEMU + guest, we need the address to identify the PE though PHB
BUID could be used as same purpose to get PHB, which is one-to-one mapping with
IOMMU group on sPAPR platform. However, once the PE address is built and 
returned
to guest, guest will use the PE address as input parameter in subsequent RTAS
calls.

If the user is some kind of application who just uses the ioctl() without 
supporting
RTAS calls. We don't need care about PE address.

Thanks,
Gavin

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: 

Re: [PATCH 1/3] KVM: PPC: Book3S: Controls for in-kernel PAPR hypercall handling

2014-05-28 Thread Alexander Graf


On 26.05.14 14:17, Paul Mackerras wrote:

This provides a way for userspace controls which PAPR hcalls get
handled in the kernel.  Each hcall can be individually enabled or
disabled for in-kernel handling, except for H_RTAS.  The exception
for H_RTAS is because userspace can already control whether
individual RTAS functions are handled in-kernel or not via the
KVM_PPC_RTAS_DEFINE_TOKEN ioctl, and because the numeric value for
H_RTAS is out of the normal sequence of hcall numbers.

Hcalls are enabled or disabled using the KVM_ENABLE_CAP ioctl for
the KVM_CAP_PPC_ENABLE_HCALL capability.  The args field of the
struct kvm_enable_cap specifies the hcall number in args[0] and
the enable/disable flag in args[1]; 0 means disable in-kernel
handling (so that the hcall will always cause an exit to userspace)
and 1 means enable.

Enabling or disabling in-kernel handling of an hcall is effective
across the whole VM, even though the KVM_ENABLE_CAP ioctl is
applied to a vcpu.

When a VM is created, an initial set of hcalls are enabled for
in-kernel handling.  The set that is enabled is the set that have
an in-kernel implementation at this point.  Any new hcall
implementations from this point onwards should not be added to the
default set.

No distinction is made between real-mode and virtual-mode hcall
implementations; the one setting controls them both.

Signed-off-by: Paul Mackerras pau...@samba.org
---
  Documentation/virtual/kvm/api.txt   | 17 +++
  arch/powerpc/include/asm/kvm_book3s.h   |  1 +
  arch/powerpc/include/asm/kvm_host.h |  2 ++
  arch/powerpc/kernel/asm-offsets.c   |  1 +
  arch/powerpc/kvm/book3s_hv.c| 51 +
  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 +++
  arch/powerpc/kvm/book3s_pr.c|  5 
  arch/powerpc/kvm/book3s_pr_papr.c   | 37 
  arch/powerpc/kvm/powerpc.c  | 19 
  include/uapi/linux/kvm.h|  1 +
  10 files changed, 145 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 6b0225d..dfd6e0c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2983,3 +2983,20 @@ Parameters: args[0] is the XICS device fd
  args[1] is the XICS CPU number (server ID) for this vcpu
  
  This capability connects the vcpu to an in-kernel XICS device.

+
+6.8 KVM_CAP_PPC_ENABLE_HCALL
+
+Architectures: ppc
+Parameters: args[0] is the PAPR hcall number
+   args[1] is 0 to disable, 1 to enable in-kernel handling
+
+This capability controls whether individual PAPR hypercalls (hcalls)
+get handled by the kernel or not.  Enabling or disabling in-kernel
+handling of an hcall is effective across the VM.  On creation, an


Hrm. Could we move the CAP to vm level then?


+initial set of hcalls are enabled for in-kernel handling, which
+consists of those hcalls for which in-kernel handlers were implemented
+before this capability was implemented.  If disabled, the kernel will
+not to attempt to handle the hcall, but will always exit to userspace
+to handle it.  Note that it may not make sense to enable some and
+disable others of a group of related hcalls, but KVM will not prevent
+userspace from doing that.
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index f52f656..772044b 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -189,6 +189,7 @@ extern void kvmppc_hv_entry_trampoline(void);
  extern u32 kvmppc_alignment_dsisr(struct kvm_vcpu *vcpu, unsigned int inst);
  extern ulong kvmppc_alignment_dar(struct kvm_vcpu *vcpu, unsigned int inst);
  extern int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd);
+extern void kvmppc_pr_init_default_hcalls(struct kvm *kvm);
  extern void kvmppc_copy_to_svcpu(struct kvmppc_book3s_shadow_vcpu *svcpu,
 struct kvm_vcpu *vcpu);
  extern void kvmppc_copy_from_svcpu(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index bb66d8b..2889587 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -34,6 +34,7 @@
  #include asm/processor.h
  #include asm/page.h
  #include asm/cacheflush.h
+#include asm/hvcall.h
  
  #define KVM_MAX_VCPUS		NR_CPUS

  #define KVM_MAX_VCORESNR_CPUS
@@ -263,6 +264,7 @@ struct kvm_arch {
  #ifdef CONFIG_PPC_BOOK3S_64
struct list_head spapr_tce_tables;
struct list_head rtas_tokens;
+   DECLARE_BITMAP(enabled_hcalls, MAX_HCALL_OPCODE/4 + 1);
  #endif
  #ifdef CONFIG_KVM_MPIC
struct openpic *mpic;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 93e1465..c427b51 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -492,6 +492,7 @@ int main(void)
DEFINE(KVM_HOST_SDR1, 

Re: [PATCH 2/3] KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled

2014-05-28 Thread Alexander Graf


On 26.05.14 14:17, Paul Mackerras wrote:

This adds code to check that when the KVM_CAP_PPC_ENABLE_HCALL
capability is used to enable or disable in-kernel handling of an
hcall, that the hcall is actually implemented by the kernel.
If not an EINVAL error is returned.

Signed-off-by: Paul Mackerras pau...@samba.org


Please add this as sanity check to the default enabled list as well - in 
case we lose the ability to enable an in-kernel hcall later.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Book3S PR: Rework SLB switching code

2014-05-28 Thread Alexander Graf


On 17.05.14 07:36, Paul Mackerras wrote:

On Thu, May 15, 2014 at 02:43:53PM +0200, Alexander Graf wrote:

On LPAR guest systems Linux enables the shadow SLB to indicate to the
hypervisor a number of SLB entries that always have to be available.

Today we go through this shadow SLB and disable all ESID's valid bits.
However, pHyp doesn't like this approach very much and honors us with
fancy machine checks.

Fortunately the shadow SLB descriptor also has an entry that indicates
the number of valid entries following. During the lifetime of a guest
we can just swap that value to 0 and don't have to worry about the
SLB restoration magic.

I think this is a great idea; I have been thinking we should do
something like this.


While we're touching the code, let's also make it more readable (get
rid of rldicl), allow it to deal with a dynamic number of bolted
SLB entries and only do shadow SLB swizzling on LPAR systems.

Signed-off-by: Alexander Graf ag...@suse.de

[snip]


+#define SHADOW_SLB_ENTRY_LEN   0x10

Normally we'd define structure offsets/sizes like this in
asm-offsets.c.  However, since the structure can't change I guess this
is OK.


/* Fill SLB with our shadow */
  
+	lis	r7, SLB_ESID_V@h

+
lbz r12, SVCPU_SLB_MAX(r3)
mulli   r12, r12, 16
addir12, r12, SVCPU_SLB
@@ -99,7 +76,7 @@ slb_loop_enter:
  
  	ld	r10, 0(r11)
  
-	rldicl. r0, r10, 37, 63

+   and.r9, r10, r7

Or...
andis.  r9, r10, SLB_ESID_V@h
and save a register and an instruction.


Good idea :)




cmpdcr0, r11, r12
blt slb_loop_enter
  
+	isync

+   sync

Why?


Hrm, I guess I was trying to narrow down why things broke. I'll omit it 
and see whether my test machine can still successfully run PR KVM.





+BEGIN_FW_FTR_SECTION
+
+   /* Declare SLB shadow as SLB_NUM_BOLTED entries big */
+
+   li  r8, SLB_NUM_BOLTED
+   stb r8, 3(r11)

Currently it's true that slb_shadow.persistent is always
SLB_NUM_BOLTED, but if you are going to embed that assumption here in


We had that assumption before too ;)


the KVM code you should at least add some comments over in
arch/powerpc/mm/slb.c and in arch/powerpc/kernel/paca.c (where
slb_shadow.persistent gets initialized) warning people that if they
break that assumption they need to fix KVM code as well.


but I like warnings, so I'll add some.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] KVM: PPC: Book3S PR: Use SLB entry 0

2014-05-28 Thread Alexander Graf
We didn't make use of SLB entry 0 because ... of no good reason. SLB entry 0
will always be used by the Linux linear SLB entry, so the fact that slbia
does not invalidate it doesn't matter as we overwrite SLB 0 on exit anyway.

Just enable use of SLB entry 0 for our shadow SLB code.

Signed-off-by: Alexander Graf ag...@suse.de

---

v1 - v2:

  - flush ERAT by writing 0 to slb0
---
 arch/powerpc/kvm/book3s_64_mmu_host.c | 11 ---
 arch/powerpc/kvm/book3s_64_slb.S  |  3 ++-
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c 
b/arch/powerpc/kvm/book3s_64_mmu_host.c
index e2efb85..0ac9839 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
@@ -271,11 +271,8 @@ static int kvmppc_mmu_next_segment(struct kvm_vcpu *vcpu, 
ulong esid)
int found_inval = -1;
int r;
 
-   if (!svcpu-slb_max)
-   svcpu-slb_max = 1;
-
/* Are we overwriting? */
-   for (i = 1; i  svcpu-slb_max; i++) {
+   for (i = 0; i  svcpu-slb_max; i++) {
if (!(svcpu-slb[i].esid  SLB_ESID_V))
found_inval = i;
else if ((svcpu-slb[i].esid  ESID_MASK) == esid) {
@@ -285,7 +282,7 @@ static int kvmppc_mmu_next_segment(struct kvm_vcpu *vcpu, 
ulong esid)
}
 
/* Found a spare entry that was invalidated before */
-   if (found_inval  0) {
+   if (found_inval = 0) {
r = found_inval;
goto out;
}
@@ -359,7 +356,7 @@ void kvmppc_mmu_flush_segment(struct kvm_vcpu *vcpu, ulong 
ea, ulong seg_size)
ulong seg_mask = -seg_size;
int i;
 
-   for (i = 1; i  svcpu-slb_max; i++) {
+   for (i = 0; i  svcpu-slb_max; i++) {
if ((svcpu-slb[i].esid  SLB_ESID_V) 
(svcpu-slb[i].esid  seg_mask) == ea) {
/* Invalidate this entry */
@@ -373,7 +370,7 @@ void kvmppc_mmu_flush_segment(struct kvm_vcpu *vcpu, ulong 
ea, ulong seg_size)
 void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu)
 {
struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
-   svcpu-slb_max = 1;
+   svcpu-slb_max = 0;
svcpu-slb[0].esid = 0;
svcpu_put(svcpu);
 }
diff --git a/arch/powerpc/kvm/book3s_64_slb.S b/arch/powerpc/kvm/book3s_64_slb.S
index 596140e..84c52c6 100644
--- a/arch/powerpc/kvm/book3s_64_slb.S
+++ b/arch/powerpc/kvm/book3s_64_slb.S
@@ -138,7 +138,8 @@ slb_do_enter:
 
/* Restore bolted entries from the shadow and fix it along the way */
 
-   /* We don't store anything in entry 0, so we don't need to take care of 
it */
+   li  r0, r0
+   slbmte  r0, r0
slbia
isync
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] KVM: PPC: Book3S PR: Rework SLB switching code

2014-05-28 Thread Alexander Graf
On LPAR guest systems Linux enables the shadow SLB to indicate to the
hypervisor a number of SLB entries that always have to be available.

Today we go through this shadow SLB and disable all ESID's valid bits.
However, pHyp doesn't like this approach very much and honors us with
fancy machine checks.

Fortunately the shadow SLB descriptor also has an entry that indicates
the number of valid entries following. During the lifetime of a guest
we can just swap that value to 0 and don't have to worry about the
SLB restoration magic.

While we're touching the code, let's also make it more readable (get
rid of rldicl), allow it to deal with a dynamic number of bolted
SLB entries and only do shadow SLB swizzling on LPAR systems.

Signed-off-by: Alexander Graf ag...@suse.de

---

v1 - v2:

  - use andis.
  - remove superfluous isync/sync
  - add KVM warning comments in SLB bolting code
---
 arch/powerpc/kernel/paca.c   |  3 ++
 arch/powerpc/kvm/book3s_64_slb.S | 83 ++--
 arch/powerpc/mm/slb.c|  2 +-
 3 files changed, 42 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index ad302f8..d6e195e 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -98,6 +98,9 @@ static inline void free_lppacas(void) { }
 /*
  * 3 persistent SLBs are registered here.  The buffer will be zero
  * initially, hence will all be invaild until we actually write them.
+ *
+ * If you make the number of persistent SLB entries dynamic, please also
+ * update PR KVM to flush and restore them accordingly.
  */
 static struct slb_shadow *slb_shadow;
 
diff --git a/arch/powerpc/kvm/book3s_64_slb.S b/arch/powerpc/kvm/book3s_64_slb.S
index 84c52c6..3589c4e 100644
--- a/arch/powerpc/kvm/book3s_64_slb.S
+++ b/arch/powerpc/kvm/book3s_64_slb.S
@@ -17,29 +17,9 @@
  * Authors: Alexander Graf ag...@suse.de
  */
 
-#define SHADOW_SLB_ESID(num)   (SLBSHADOW_SAVEAREA + (num * 0x10))
-#define SHADOW_SLB_VSID(num)   (SLBSHADOW_SAVEAREA + (num * 0x10) + 0x8)
-#define UNBOLT_SLB_ENTRY(num) \
-   li  r11, SHADOW_SLB_ESID(num);  \
-   LDX_BE  r9, r12, r11;   \
-   /* Invalid? Skip. */;   \
-   rldicl. r0, r9, 37, 63; \
-   beq slb_entry_skip_ ## num; \
-   xoris   r9, r9, SLB_ESID_V@h;   \
-   STDX_BE r9, r12, r11;   \
-  slb_entry_skip_ ## num:
-
-#define REBOLT_SLB_ENTRY(num) \
-   li  r8, SHADOW_SLB_ESID(num);   \
-   li  r7, SHADOW_SLB_VSID(num);   \
-   LDX_BE  r10, r11, r8;   \
-   cmpdi   r10, 0; \
-   beq slb_exit_skip_ ## num;  \
-   orisr10, r10, SLB_ESID_V@h; \
-   LDX_BE  r9, r11, r7;\
-   slbmte  r9, r10;\
-   STDX_BE r10, r11, r8;   \
-slb_exit_skip_ ## num:
+#define SHADOW_SLB_ENTRY_LEN   0x10
+#define OFFSET_ESID(x) (SHADOW_SLB_ENTRY_LEN * x)
+#define OFFSET_VSID(x) ((SHADOW_SLB_ENTRY_LEN * x) + 8)
 
 /**
  **
@@ -63,20 +43,15 @@ slb_exit_skip_ ## num:
 * SVCPU[LR]  = guest LR
 */
 
-   /* Remove LPAR shadow entries */
+BEGIN_FW_FTR_SECTION
 
-#if SLB_NUM_BOLTED == 3
+   /* Declare SLB shadow as 0 entries big */
 
-   ld  r12, PACA_SLBSHADOWPTR(r13)
+   ld  r11, PACA_SLBSHADOWPTR(r13)
+   li  r8, 0
+   stb r8, 3(r11)
 
-   /* Remove bolted entries */
-   UNBOLT_SLB_ENTRY(0)
-   UNBOLT_SLB_ENTRY(1)
-   UNBOLT_SLB_ENTRY(2)
-   
-#else
-#error unknown number of bolted entries
-#endif
+END_FW_FTR_SECTION_IFSET(FW_FEATURE_LPAR)
 
/* Flush SLB */
 
@@ -99,7 +74,7 @@ slb_loop_enter:
 
ld  r10, 0(r11)
 
-   rldicl. r0, r10, 37, 63
+   andis.  r9, r10, SLB_ESID_V@h
beq slb_loop_enter_skip
 
ld  r9, 8(r11)
@@ -136,24 +111,42 @@ slb_do_enter:
 *
 */
 
-   /* Restore bolted entries from the shadow and fix it along the way */
+   /* Remove all SLB entries that are in use. */
 
li  r0, r0
slbmte  r0, r0
slbia
-   isync
 
-#if SLB_NUM_BOLTED == 3
+   /* Restore bolted entries from the shadow */
 
ld  r11, PACA_SLBSHADOWPTR(r13)
 
-   REBOLT_SLB_ENTRY(0)
-   REBOLT_SLB_ENTRY(1)
-   REBOLT_SLB_ENTRY(2)
-   
-#else
-#error unknown number of bolted entries
-#endif
+BEGIN_FW_FTR_SECTION
+
+   /* Declare SLB shadow as SLB_NUM_BOLTED entries big */
+
+   li  r8, SLB_NUM_BOLTED
+   stb r8, 3(r11)
+
+END_FW_FTR_SECTION_IFSET(FW_FEATURE_LPAR)
+
+   /* Manually load all entries from shadow SLB */
+
+   li  r8, SLBSHADOW_SAVEAREA
+   li  r7, 

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alex Williamson
On Wed, 2014-05-28 at 13:37 +0200, Alexander Graf wrote:
 On 28.05.14 02:57, Alex Williamson wrote:
  On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:
  On 28.05.14 02:39, Alex Williamson wrote:
  On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:
  On 27.05.14 20:15, Alex Williamson wrote:
  On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
  The patch adds new IOCTL commands for sPAPR VFIO container device
  to support EEH functionality for PCI devices, which have been passed
  through from host to somebody else via VFIO.
 
  Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
  ---
  Documentation/vfio.txt  | 92 
  -
  drivers/vfio/pci/Makefile   |  1 +
  drivers/vfio/pci/vfio_pci.c | 20 +---
  drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
  drivers/vfio/pci/vfio_pci_private.h |  5 ++
  drivers/vfio/vfio_iommu_spapr_tce.c | 85 
  ++
  include/uapi/linux/vfio.h   | 66 ++
  7 files changed, 308 insertions(+), 7 deletions(-)
  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
  [...]
 
  +
  +  return ret;
  +}
  +
  static long tce_iommu_ioctl(void *iommu_data,
  unsigned int cmd, unsigned long arg)
  {
  @@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
 tce_iommu_disable(container);
 mutex_unlock(container-lock);
 return 0;
  +  case VFIO_EEH_PE_SET_OPTION:
  +  case VFIO_EEH_PE_GET_STATE:
  +  case VFIO_EEH_PE_RESET:
  +  case VFIO_EEH_PE_CONFIGURE:
  +  return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
  This is where it would have really made sense to have a single
  VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
  AlexG, are you really attached to splitting these out into separate
  ioctls?
  I don't see the problem. We need to forward 4 ioctls to a separate piece
  of code, so we forward 4 ioctls to a separate piece of code :). Putting
  them into one ioctl just moves the switch() into another function.
  And uses an extra 3 ioctl numbers and gives us extra things to update if
  we ever need to add more ioctls, etc.  ioctl numbers are an address
  space, how much address space do we really want to give to EEH?  It's
  not a big difference, but I don't think it's completely even either.
  Thanks,
  Yes, that's the point. I by far prefer to have you push back on anyone
  who introduces useless ioctls rather than have a separate EEH number
  space that people can just throw anything in they like ;).
  Well, I appreciate that, but having them as separate ioctls doesn't
  really prevent that either.  Any one of these 4 could be set to take a
  sub-option to extend and contort the EEH interface.  The only way to
  prevent that would be to avoid the argsz+flags hack that make the ioctl
  extendable.  Thanks,
 
 Sure, that's what patch review is about. I'm really more concerned about 
 whose court the number space is in - you or Gavin. If we're talking 
 about top level ioctls you will care a lot more.
 
 But I'm not religious about this. You're the VFIO maintainer, so it's 
 your call. I just personally cringe when I see an ioctl that gets an 
 opcode and a parameter argument where the parameter argument is a 
 union with one struct for each opcode.

Well, what would it look like...

struct vfio_eeh_pe_op {
__u32 argsz;
__u32 flags;
__u32 op;
};

Couldn't every single one of these be a separate op?  Are there any
cases where we can't use the ioctl return value?

VFIO_EEH_PE_DISABLE
VFIO_EEH_PE_ENABLE
VFIO_EEH_PE_UNFREEZE_IO
VFIO_EEH_PE_UNFREEZE_DMA
VFIO_EEH_PE_GET_MODE
VFIO_EEH_PE_RESET_DEACTIVATE
VFIO_EEH_PE_RESET_HOT
VFIO_EEH_PE_RESET_FUNDAMENTAL
VFIO_EEH_PE_CONFIGURE

It doesn't look that bad to me, what am I missing?  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alex Williamson
On Wed, 2014-05-28 at 10:55 +1000, Gavin Shan wrote:
 On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:
 On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
  The patch adds new IOCTL commands for sPAPR VFIO container device
  to support EEH functionality for PCI devices, which have been passed
  through from host to somebody else via VFIO.
  
  Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
  ---
   Documentation/vfio.txt  | 92 
  -
   drivers/vfio/pci/Makefile   |  1 +
   drivers/vfio/pci/vfio_pci.c | 20 +---
   drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
   drivers/vfio/pci/vfio_pci_private.h |  5 ++
   drivers/vfio/vfio_iommu_spapr_tce.c | 85 
  ++
   include/uapi/linux/vfio.h   | 66 ++
   7 files changed, 308 insertions(+), 7 deletions(-)
   create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
  
  diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
  index b9ca023..d890fed 100644
  --- a/Documentation/vfio.txt
  +++ b/Documentation/vfio.txt
  @@ -305,7 +305,15 @@ faster, the map/unmap handling has been implemented 
  in real mode which provides
   an excellent performance which has limitations such as inability to do
   locked pages accounting in real time.
   
  -So 3 additional ioctls have been added:
  +4) According to sPAPR specification, A Partitionable Endpoint (PE) is an 
  I/O
  +subtree that can be treated as a unit for the purposes of partitioning and
  +error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
  +function of a multi-function IOA, or multiple IOAs (possibly including 
  switch
  +and bridge structures above the multiple IOAs). PPC64 guests detect PCI 
  errors
  +and recover from them via EEH RTAS services, which works on the basis of
  +additional ioctl commands.
  +
  +So 7 additional ioctls have been added:
   
 VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
 of the DMA window on the PCI bus.
  @@ -316,6 +324,17 @@ So 3 additional ioctls have been added:
   
 VFIO_IOMMU_DISABLE - disables the container.
   
  +  VFIO_EEH_PE_SET_OPTION - enables or disables EEH functionality on the
  +  specified device. Also, it can be used to remove IO or DMA
  +  stopped state on the frozen PE.
  +
  +  VFIO_EEH_PE_GET_STATE - retrieve PE's state: frozen or normal state.
  +
  +  VFIO_EEH_PE_RESET - do PE reset, which is one of the major steps for
  +  error recovering.
  +
  +  VFIO_EEH_PE_CONFIGURE - configure the PCI bridges after PE reset. It's
  +  one of the major steps for error recoverying.
   
   The code flow from the example above should be slightly changed:
   
  @@ -346,6 +365,77 @@ The code flow from the example above should be 
  slightly changed:
 ioctl(container, VFIO_IOMMU_MAP_DMA, dma_map);
 .
   
  +Based on the initial example we have, the following piece of code could be
  +reference for EEH setup and error handling:
  +
  +  struct vfio_eeh_pe_set_option option = { .argsz = sizeof(option) };
  +  struct vfio_eeh_pe_get_state state = { .argsz = sizeof(state) };
  +  struct vfio_eeh_pe_reset reset = { .argsz = sizeof(reset) };
  +  struct vfio_eeh_pe_configure configure = { .argsz = sizeof(configure) };
  +
  +  
  +
  +  /* Get a file descriptor for the device */
  +  device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, :06:0d.0);
  +
  +  /* Enable the EEH functionality on the device */
  +  option.option = VFIO_EEH_PE_SET_OPT_ENABLE;
  +  ioctl(container, VFIO_EEH_PE_SET_OPTION, option);
  +
  +  /* You're suggested to create additional data struct to represent
  +   * PE, and put child devices belonging to same IOMMU group to the
  +   * PE instance for later reference.
  +   */
  +
  +  /* Check the PE's state and make sure it's in functional state */
  +  ioctl(container, VFIO_EEH_PE_GET_STATE, state);
  +
  +  /* Save device's state. pci_save_state() would be good enough
  +   * as an example.
  +   */
  +
  +  /* Test and setup the device */
  +  ioctl(device, VFIO_DEVICE_GET_INFO, device_info);
  +
  +  
  +
  +  /* When 0xFF's returned from reading PCI config space or IO BARs
  +   * of the PCI device. Check the PE state to see if that has been
  +   * frozen.
  +   */
  +  ioctl(container, VFIO_EEH_PE_GET_STATE, state);
  +
  +  /* Waiting for pending PCI transactions to be completed and don't
  +   * produce any more PCI traffic from/to the affected PE until
  +   * recovery is finished.
  +   */
  +
  +  /* Enable IO for the affected PE and collect logs. Usually, the
  +   * standard part of PCI config space, AER registers are dumped
  +   * as logs for further analysis.
  +   */
  +  option.option = VFIO_EEH_PE_SET_OPT_IO;
  +  ioctl(container, VFIO_EEH_PE_SET_OPTION, option);
  +
  +  /* Issue PE reset */
  +  reset.option = VFIO_EEH_PE_RESET_HOT;
  +  ioctl(container, 

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Benjamin Herrenschmidt
On Wed, 2014-05-28 at 22:49 +1000, Gavin Shan wrote:
 
 I will remove those address related macros in next revision because it's
 user-level bussiness, not related to host kernel any more.
 
 If the user is QEMU + guest, we need the address to identify the PE though PHB
 BUID could be used as same purpose to get PHB, which is one-to-one mapping 
 with
 IOMMU group on sPAPR platform. However, once the PE address is built and 
 returned
 to guest, guest will use the PE address as input parameter in subsequent RTAS
 calls.
 
 If the user is some kind of application who just uses the ioctl() without 
 supporting
 RTAS calls. We don't need care about PE address. 

I am a bit reluctant with that PE==PHB equation we seem to be introducing.

This isn't the case in HW. It's possible that this is how we handle VFIO *today*
in qemu but it doesn't have to be does it ?

It also paints us into a corner if we want to start implementing some kind of
emulated EEH for selected emulated devices and/or virtio.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alexander Graf


On 28.05.14 18:17, Alex Williamson wrote:

On Wed, 2014-05-28 at 13:37 +0200, Alexander Graf wrote:

On 28.05.14 02:57, Alex Williamson wrote:

On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:

On 28.05.14 02:39, Alex Williamson wrote:

On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:

On 27.05.14 20:15, Alex Williamson wrote:

On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:

The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 Documentation/vfio.txt  | 92 
-
 drivers/vfio/pci/Makefile   |  1 +
 drivers/vfio/pci/vfio_pci.c | 20 +---
 drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
 drivers/vfio/pci/vfio_pci_private.h |  5 ++
 drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++
 include/uapi/linux/vfio.h   | 66 ++
 7 files changed, 308 insertions(+), 7 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c

[...]


+
+   return ret;
+}
+
 static long tce_iommu_ioctl(void *iommu_data,
 unsigned int cmd, unsigned long arg)
 {
@@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
tce_iommu_disable(container);
mutex_unlock(container-lock);
return 0;
+   case VFIO_EEH_PE_SET_OPTION:
+   case VFIO_EEH_PE_GET_STATE:
+   case VFIO_EEH_PE_RESET:
+   case VFIO_EEH_PE_CONFIGURE:
+   return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);

This is where it would have really made sense to have a single
VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
AlexG, are you really attached to splitting these out into separate
ioctls?

I don't see the problem. We need to forward 4 ioctls to a separate piece
of code, so we forward 4 ioctls to a separate piece of code :). Putting
them into one ioctl just moves the switch() into another function.

And uses an extra 3 ioctl numbers and gives us extra things to update if
we ever need to add more ioctls, etc.  ioctl numbers are an address
space, how much address space do we really want to give to EEH?  It's
not a big difference, but I don't think it's completely even either.
Thanks,

Yes, that's the point. I by far prefer to have you push back on anyone
who introduces useless ioctls rather than have a separate EEH number
space that people can just throw anything in they like ;).

Well, I appreciate that, but having them as separate ioctls doesn't
really prevent that either.  Any one of these 4 could be set to take a
sub-option to extend and contort the EEH interface.  The only way to
prevent that would be to avoid the argsz+flags hack that make the ioctl
extendable.  Thanks,

Sure, that's what patch review is about. I'm really more concerned about
whose court the number space is in - you or Gavin. If we're talking
about top level ioctls you will care a lot more.

But I'm not religious about this. You're the VFIO maintainer, so it's
your call. I just personally cringe when I see an ioctl that gets an
opcode and a parameter argument where the parameter argument is a
union with one struct for each opcode.

Well, what would it look like...

struct vfio_eeh_pe_op {
__u32 argsz;
__u32 flags;
__u32 op;
};

Couldn't every single one of these be a separate op?  Are there any
cases where we can't use the ioctl return value?

VFIO_EEH_PE_DISABLE
VFIO_EEH_PE_ENABLE
VFIO_EEH_PE_UNFREEZE_IO
VFIO_EEH_PE_UNFREEZE_DMA
VFIO_EEH_PE_GET_MODE
VFIO_EEH_PE_RESET_DEACTIVATE
VFIO_EEH_PE_RESET_HOT
VFIO_EEH_PE_RESET_FUNDAMENTAL
VFIO_EEH_PE_CONFIGURE

It doesn't look that bad to me, what am I missing?  Thanks,


Yup, that looks well to me as well :)


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alexander Graf


On 28.05.14 23:58, Benjamin Herrenschmidt wrote:

On Wed, 2014-05-28 at 22:49 +1000, Gavin Shan wrote:

I will remove those address related macros in next revision because it's
user-level bussiness, not related to host kernel any more.

If the user is QEMU + guest, we need the address to identify the PE though PHB
BUID could be used as same purpose to get PHB, which is one-to-one mapping with
IOMMU group on sPAPR platform. However, once the PE address is built and 
returned
to guest, guest will use the PE address as input parameter in subsequent RTAS
calls.

If the user is some kind of application who just uses the ioctl() without 
supporting
RTAS calls. We don't need care about PE address.

I am a bit reluctant with that PE==PHB equation we seem to be introducing.

This isn't the case in HW. It's possible that this is how we handle VFIO *today*
in qemu but it doesn't have to be does it ?


Right, but that's pure QEMU internals. From the VFIO kernel interface's 
point of view, a VFIO group is a PE context, as that's what gets IOMMU 
controlled together too.



It also paints us into a corner if we want to start implementing some kind of
emulated EEH for selected emulated devices and/or virtio.


I don't think so :). In QEMU the PHB emulation would have to notify the 
container (IOMMU emulation layer - PE) that a PE operation happened. 
It's that emulation code's responsibility to broadcast operations across 
its own emulated operations (recover config space access, reconfigure 
BARs, etc) and the VFIO PE operations.


So from a kernel interface point of view, I think leaving out any 
address information is the right way to go. Whether we managed to get 
all QEMU internal interfaces modeled correctly yet has to be seen on the 
next patch set revision :).



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Gavin Shan
On Wed, May 28, 2014 at 03:12:35PM +0200, Alexander Graf wrote:

On 28.05.14 14:49, Gavin Shan wrote:
On Wed, May 28, 2014 at 01:41:35PM +0200, Alexander Graf wrote:
On 28.05.14 02:55, Gavin Shan wrote:
On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:
On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
  Documentation/vfio.txt  | 92 
 -
  drivers/vfio/pci/Makefile   |  1 +
  drivers/vfio/pci/vfio_pci.c | 20 +---
  drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
  drivers/vfio/pci/vfio_pci_private.h |  5 ++
  drivers/vfio/vfio_iommu_spapr_tce.c | 85 
 ++
  include/uapi/linux/vfio.h   | 66 ++
  7 files changed, 308 insertions(+), 7 deletions(-)
  create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
[...]

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index cb9023d..c5fac36 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -455,6 +455,72 @@ struct vfio_iommu_spapr_tce_info {
  #define VFIO_IOMMU_SPAPR_TCE_GET_INFO   _IO(VFIO_TYPE, VFIO_BASE + 12)
+/*
+ * EEH functionality can be enabled or disabled on one specific device.
+ * Also, the DMA or IO frozen state can be removed from the frozen PE
+ * if required.
+ */
+struct vfio_eeh_pe_set_option {
+ __u32 argsz;
+ __u32 flags;
+ __u32 option;
+#define VFIO_EEH_PE_SET_OPT_DISABLE  0   /* Disable EEH  */
+#define VFIO_EEH_PE_SET_OPT_ENABLE   1   /* Enable EEH   */
+#define VFIO_EEH_PE_SET_OPT_IO   2   /* Enable IO*/
+#define VFIO_EEH_PE_SET_OPT_DMA  3   /* Enable DMA   */
This is more of a command than an option isn't it?  Each of these
probably needs a more significant description.

Yeah, it would be regarded as opcode and I'll add more description about
them in next revision.
Please just call them commands.

Ok. I guess you want me to change the macro names like this ?

#define VFIO_EEH_CMD_DISABLE  0   /* Disable EEH functionality*/
#define VFIO_EEH_CMD_ENABLE   1   /* Enable EEH functionality */
#define VFIO_EEH_CMD_ENABLE_IO2   /* Enable IO for frozen PE  
*/
#define VFIO_EEH_CMD_ENABLE_DMA   3   /* Enable DMA for frozen PE 
*/

Yes, the ioctl name too.


Ok. Thanks. I will also to rename those option / command related macros
to VFIO_EEH_CMD_ in next revision.


+};
+
+#define VFIO_EEH_PE_SET_OPTION   _IO(VFIO_TYPE, VFIO_BASE + 21)
+
+/*
+ * Each EEH PE should have unique address to be identified. PE's
+ * sharing mode is also useful information as well.
+ */
+#define VFIO_EEH_PE_GET_ADDRESS  0   /* Get address  */
+#define VFIO_EEH_PE_GET_MODE 1   /* Query mode   */
+#define VFIO_EEH_PE_MODE_NONE0   /* Not a PE */
+#define VFIO_EEH_PE_MODE_NOT_SHARED  1   /* Exclusive*/
+#define VFIO_EEH_PE_MODE_SHARED  2   /* Shared mode  */
+
+/*
+ * EEH PE might have been frozen because of PCI errors. Also, it might
+ * be experiencing reset for error revoery. The following command helps
+ * to get the state.
+ */
+struct vfio_eeh_pe_get_state {
+ __u32 argsz;
+ __u32 flags;
+ __u32 state;
+};
Should state be a union to better describe the value returned?  What
exactly is the address and why does the user need to know it?  Does this
need user input or could we just return the address and mode regardless?

Ok. I think you want enum (not union) for state. I'll have macros for the
state in next revision as I did that for other cases.

Those macros defined for address just for ABI stuff as Alex.G mentioned.
There isn't corresponding ioctl command for host to get address any more
because QEMU (user) will have to figure it out by himself. The address
here means PE address and user has to figure it out according to PE
segmentation.
Why would the user ever need the address?

I will remove those address related macros in next revision because it's
user-level bussiness, not related to host kernel any more.

Ok, so the next question is whether there will be any state outside
of GET_MODE left in the future.


That's also user-level business and those macros should be removed as well :-)

Thanks,
Gavin

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Gavin Shan
On Thu, May 29, 2014 at 12:40:26AM +0200, Alexander Graf wrote:
On 28.05.14 18:17, Alex Williamson wrote:
On Wed, 2014-05-28 at 13:37 +0200, Alexander Graf wrote:
On 28.05.14 02:57, Alex Williamson wrote:
On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:
On 28.05.14 02:39, Alex Williamson wrote:
On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:
On 27.05.14 20:15, Alex Williamson wrote:
On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 Documentation/vfio.txt  | 92 
 -
 drivers/vfio/pci/Makefile   |  1 +
 drivers/vfio/pci/vfio_pci.c | 20 +---
 drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
 drivers/vfio/pci/vfio_pci_private.h |  5 ++
 drivers/vfio/vfio_iommu_spapr_tce.c | 85 
 ++
 include/uapi/linux/vfio.h   | 66 
 ++
 7 files changed, 308 insertions(+), 7 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
[...]

+
+  return ret;
+}
+
 static long tce_iommu_ioctl(void *iommu_data,
unsigned int cmd, unsigned long arg)
 {
@@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
   tce_iommu_disable(container);
   mutex_unlock(container-lock);
   return 0;
+  case VFIO_EEH_PE_SET_OPTION:
+  case VFIO_EEH_PE_GET_STATE:
+  case VFIO_EEH_PE_RESET:
+  case VFIO_EEH_PE_CONFIGURE:
+  return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);
This is where it would have really made sense to have a single
VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
AlexG, are you really attached to splitting these out into separate
ioctls?
I don't see the problem. We need to forward 4 ioctls to a separate piece
of code, so we forward 4 ioctls to a separate piece of code :). Putting
them into one ioctl just moves the switch() into another function.
And uses an extra 3 ioctl numbers and gives us extra things to update if
we ever need to add more ioctls, etc.  ioctl numbers are an address
space, how much address space do we really want to give to EEH?  It's
not a big difference, but I don't think it's completely even either.
Thanks,
Yes, that's the point. I by far prefer to have you push back on anyone
who introduces useless ioctls rather than have a separate EEH number
space that people can just throw anything in they like ;).
Well, I appreciate that, but having them as separate ioctls doesn't
really prevent that either.  Any one of these 4 could be set to take a
sub-option to extend and contort the EEH interface.  The only way to
prevent that would be to avoid the argsz+flags hack that make the ioctl
extendable.  Thanks,
Sure, that's what patch review is about. I'm really more concerned about
whose court the number space is in - you or Gavin. If we're talking
about top level ioctls you will care a lot more.

But I'm not religious about this. You're the VFIO maintainer, so it's
your call. I just personally cringe when I see an ioctl that gets an
opcode and a parameter argument where the parameter argument is a
union with one struct for each opcode.
Well, what would it look like...

struct vfio_eeh_pe_op {
  __u32 argsz;
  __u32 flags;
  __u32 op;
};

Couldn't every single one of these be a separate op?  Are there any
cases where we can't use the ioctl return value?

VFIO_EEH_PE_DISABLE
VFIO_EEH_PE_ENABLE
VFIO_EEH_PE_UNFREEZE_IO
VFIO_EEH_PE_UNFREEZE_DMA
VFIO_EEH_PE_GET_MODE
VFIO_EEH_PE_RESET_DEACTIVATE
VFIO_EEH_PE_RESET_HOT
VFIO_EEH_PE_RESET_FUNDAMENTAL
VFIO_EEH_PE_CONFIGURE

It doesn't look that bad to me, what am I missing?  Thanks,

Yup, that looks well to me as well :)


s/VFIO_EEH_PE_GET_MODE/VFIO_EEH_PE_GET_STATE.

I'll include this in next revision. Thanks, Alex.

Thanks,
Gavin

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Alexander Graf


On 29.05.14 01:37, Gavin Shan wrote:

On Thu, May 29, 2014 at 12:40:26AM +0200, Alexander Graf wrote:

On 28.05.14 18:17, Alex Williamson wrote:

On Wed, 2014-05-28 at 13:37 +0200, Alexander Graf wrote:

On 28.05.14 02:57, Alex Williamson wrote:

On Wed, 2014-05-28 at 02:44 +0200, Alexander Graf wrote:

On 28.05.14 02:39, Alex Williamson wrote:

On Wed, 2014-05-28 at 00:49 +0200, Alexander Graf wrote:

On 27.05.14 20:15, Alex Williamson wrote:

On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:

The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
---
 Documentation/vfio.txt  | 92 
-
 drivers/vfio/pci/Makefile   |  1 +
 drivers/vfio/pci/vfio_pci.c | 20 +---
 drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
 drivers/vfio/pci/vfio_pci_private.h |  5 ++
 drivers/vfio/vfio_iommu_spapr_tce.c | 85 ++
 include/uapi/linux/vfio.h   | 66 ++
 7 files changed, 308 insertions(+), 7 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c

[...]


+
+   return ret;
+}
+
 static long tce_iommu_ioctl(void *iommu_data,
 unsigned int cmd, unsigned long arg)
 {
@@ -283,6 +363,11 @@ static long tce_iommu_ioctl(void *iommu_data,
tce_iommu_disable(container);
mutex_unlock(container-lock);
return 0;
+   case VFIO_EEH_PE_SET_OPTION:
+   case VFIO_EEH_PE_GET_STATE:
+   case VFIO_EEH_PE_RESET:
+   case VFIO_EEH_PE_CONFIGURE:
+   return tce_iommu_eeh_ioctl(iommu_data, cmd, arg);

This is where it would have really made sense to have a single
VFIO_EEH_OP ioctl with a data structure passed to indicate the sub-op.
AlexG, are you really attached to splitting these out into separate
ioctls?

I don't see the problem. We need to forward 4 ioctls to a separate piece
of code, so we forward 4 ioctls to a separate piece of code :). Putting
them into one ioctl just moves the switch() into another function.

And uses an extra 3 ioctl numbers and gives us extra things to update if
we ever need to add more ioctls, etc.  ioctl numbers are an address
space, how much address space do we really want to give to EEH?  It's
not a big difference, but I don't think it's completely even either.
Thanks,

Yes, that's the point. I by far prefer to have you push back on anyone
who introduces useless ioctls rather than have a separate EEH number
space that people can just throw anything in they like ;).

Well, I appreciate that, but having them as separate ioctls doesn't
really prevent that either.  Any one of these 4 could be set to take a
sub-option to extend and contort the EEH interface.  The only way to
prevent that would be to avoid the argsz+flags hack that make the ioctl
extendable.  Thanks,

Sure, that's what patch review is about. I'm really more concerned about
whose court the number space is in - you or Gavin. If we're talking
about top level ioctls you will care a lot more.

But I'm not religious about this. You're the VFIO maintainer, so it's
your call. I just personally cringe when I see an ioctl that gets an
opcode and a parameter argument where the parameter argument is a
union with one struct for each opcode.

Well, what would it look like...

struct vfio_eeh_pe_op {
__u32 argsz;
__u32 flags;
__u32 op;
};

Couldn't every single one of these be a separate op?  Are there any
cases where we can't use the ioctl return value?

VFIO_EEH_PE_DISABLE
VFIO_EEH_PE_ENABLE
VFIO_EEH_PE_UNFREEZE_IO
VFIO_EEH_PE_UNFREEZE_DMA
VFIO_EEH_PE_GET_MODE
VFIO_EEH_PE_RESET_DEACTIVATE
VFIO_EEH_PE_RESET_HOT
VFIO_EEH_PE_RESET_FUNDAMENTAL
VFIO_EEH_PE_CONFIGURE

It doesn't look that bad to me, what am I missing?  Thanks,

Yup, that looks well to me as well :)


s/VFIO_EEH_PE_GET_MODE/VFIO_EEH_PE_GET_STATE.

I'll include this in next revision. Thanks, Alex.


Yup, no need for CMD anymore then either :)


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Gavin Shan
On Wed, May 28, 2014 at 10:32:11AM -0600, Alex Williamson wrote:
On Wed, 2014-05-28 at 10:55 +1000, Gavin Shan wrote:
 On Tue, May 27, 2014 at 12:15:27PM -0600, Alex Williamson wrote:
 On Tue, 2014-05-27 at 18:40 +1000, Gavin Shan wrote:
  The patch adds new IOCTL commands for sPAPR VFIO container device
  to support EEH functionality for PCI devices, which have been passed
  through from host to somebody else via VFIO.
  
  Signed-off-by: Gavin Shan gws...@linux.vnet.ibm.com
  ---
   Documentation/vfio.txt  | 92 
  -
   drivers/vfio/pci/Makefile   |  1 +
   drivers/vfio/pci/vfio_pci.c | 20 +---
   drivers/vfio/pci/vfio_pci_eeh.c | 46 +++
   drivers/vfio/pci/vfio_pci_private.h |  5 ++
   drivers/vfio/vfio_iommu_spapr_tce.c | 85 
  ++
   include/uapi/linux/vfio.h   | 66 ++
   7 files changed, 308 insertions(+), 7 deletions(-)
   create mode 100644 drivers/vfio/pci/vfio_pci_eeh.c
  
  diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
  index b9ca023..d890fed 100644
  --- a/Documentation/vfio.txt
  +++ b/Documentation/vfio.txt
  @@ -305,7 +305,15 @@ faster, the map/unmap handling has been implemented 
  in real mode which provides
   an excellent performance which has limitations such as inability to do
   locked pages accounting in real time.
   
  -So 3 additional ioctls have been added:
  +4) According to sPAPR specification, A Partitionable Endpoint (PE) is an 
  I/O
  +subtree that can be treated as a unit for the purposes of partitioning 
  and
  +error recovery. A PE may be a single or multi-function IOA (IO Adapter), 
  a
  +function of a multi-function IOA, or multiple IOAs (possibly including 
  switch
  +and bridge structures above the multiple IOAs). PPC64 guests detect PCI 
  errors
  +and recover from them via EEH RTAS services, which works on the basis of
  +additional ioctl commands.
  +
  +So 7 additional ioctls have been added:
   
VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
of the DMA window on the PCI bus.
  @@ -316,6 +324,17 @@ So 3 additional ioctls have been added:
   
VFIO_IOMMU_DISABLE - disables the container.
   
  + VFIO_EEH_PE_SET_OPTION - enables or disables EEH functionality on the
  + specified device. Also, it can be used to remove IO or DMA
  + stopped state on the frozen PE.
  +
  + VFIO_EEH_PE_GET_STATE - retrieve PE's state: frozen or normal state.
  +
  + VFIO_EEH_PE_RESET - do PE reset, which is one of the major steps for
  + error recovering.
  +
  + VFIO_EEH_PE_CONFIGURE - configure the PCI bridges after PE reset. It's
  + one of the major steps for error recoverying.
   
   The code flow from the example above should be slightly changed:
   
  @@ -346,6 +365,77 @@ The code flow from the example above should be 
  slightly changed:
ioctl(container, VFIO_IOMMU_MAP_DMA, dma_map);
.
   
  +Based on the initial example we have, the following piece of code could 
  be
  +reference for EEH setup and error handling:
  +
  + struct vfio_eeh_pe_set_option option = { .argsz = sizeof(option) };
  + struct vfio_eeh_pe_get_state state = { .argsz = sizeof(state) };
  + struct vfio_eeh_pe_reset reset = { .argsz = sizeof(reset) };
  + struct vfio_eeh_pe_configure configure = { .argsz = sizeof(configure) };
  +
  + 
  +
  + /* Get a file descriptor for the device */
  + device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, :06:0d.0);
  +
  + /* Enable the EEH functionality on the device */
  + option.option = VFIO_EEH_PE_SET_OPT_ENABLE;
  + ioctl(container, VFIO_EEH_PE_SET_OPTION, option);
  +
  + /* You're suggested to create additional data struct to represent
  +  * PE, and put child devices belonging to same IOMMU group to the
  +  * PE instance for later reference.
  +  */
  +
  + /* Check the PE's state and make sure it's in functional state */
  + ioctl(container, VFIO_EEH_PE_GET_STATE, state);
  +
  + /* Save device's state. pci_save_state() would be good enough
  +  * as an example.
  +  */
  +
  + /* Test and setup the device */
  + ioctl(device, VFIO_DEVICE_GET_INFO, device_info);
  +
  + 
  +
  + /* When 0xFF's returned from reading PCI config space or IO BARs
  +  * of the PCI device. Check the PE state to see if that has been
  +  * frozen.
  +  */
  + ioctl(container, VFIO_EEH_PE_GET_STATE, state);
  +
  + /* Waiting for pending PCI transactions to be completed and don't
  +  * produce any more PCI traffic from/to the affected PE until
  +  * recovery is finished.
  +  */
  +
  + /* Enable IO for the affected PE and collect logs. Usually, the
  +  * standard part of PCI config space, AER registers are dumped
  +  * as logs for further analysis.
  +  */
  + option.option = VFIO_EEH_PE_SET_OPT_IO;
  + ioctl(container, VFIO_EEH_PE_SET_OPTION, option);
  +
  + /* Issue PE reset */
  + reset.option = VFIO_EEH_PE_RESET_HOT;
  + 

Re: [PATCH v7 3/3] drivers/vfio: EEH support for VFIO PCI device

2014-05-28 Thread Benjamin Herrenschmidt
On Thu, 2014-05-29 at 10:05 +1000, Gavin Shan wrote:
 The log stuff is TBD and I'll figure it out later.
 
 About to what are the errors, there are a lot. Most of them are related
 to hardware level, for example unstable PCI link. Usually, those error
 bits defined in AER fatal error state register contribute to EEH errors.
 It could be software related, e.g. violating IOMMU protection (read/write
 permission etc), or even one PCI device isn't capable of DMAing. Hopefully,
 it's the explaination you're looking for? :-)

Note to Alex('s) ...

The log we get from FW at the moment in the host is:

  - In the case of pHyp / RTAS host, opaque. Basically it's a blob that we store
and that can be sent to IBM service people :-) Not terribly practical.

  - On PowerNV, it's a IODA specific data structure (basically a dump of a 
bunch of register state and tables). IODA is our IO architecture (sadly the
document itself isn't public at this point) and we have two versions, IODA1
and IODA2. You can consider the structure as chipset specific basically.

What I want to do in the long run is:

  - In the case of pHyp/RTAS host, there's not much we can do, so basically
forward that log as-is.

  - In the case of PowerNV, forward the log *and* add a well-defined blob to
it that does some basic interpretation of it. In fact I want to do the latter
more generally in the host kernel for host kernel errors as well, but we
can forward it to the guest via VFIO too. What I mean by interpretation is
something along the lines of an error type DMA IOMMU fault, MMIO error,
Link loss, PCIe UE, ... among a list of well known error types that
represent the most common ones, with a little bit of added info when
available (for most DMA errors we can provide the DMA address that faulted
for example).

So Gavin and I need to work a bit on that, both in the context of the host
kernel to improve the reporting there, and in the context of what we pass to
user space.

However, no driver today cares about that information. The PCIe error recovery
system doesn't carry it and it has no impact on the EEH recovery procedures,
so EEH in that sense is useful without that reporting. It's useful for the
programmer (or user/admin) to identify what went wrong but it's not used by
the automated recovery process.

One last thing to look at is in the case of a VFIO device, we might want to
silence the host kernel printf's once we support guest EEH since otherwise
the guest has a path to flood the host kernel log by triggering a lot of
EEH errors purposefully.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] KVM: PPC: Book3S: Controls for in-kernel PAPR hypercall handling

2014-05-28 Thread Paul Mackerras
On Wed, May 28, 2014 at 03:27:32PM +0200, Alexander Graf wrote:
 
 On 26.05.14 14:17, Paul Mackerras wrote:
 +6.8 KVM_CAP_PPC_ENABLE_HCALL
 +
 +Architectures: ppc
 +Parameters: args[0] is the PAPR hcall number
 +args[1] is 0 to disable, 1 to enable in-kernel handling
 +
 +This capability controls whether individual PAPR hypercalls (hcalls)
 +get handled by the kernel or not.  Enabling or disabling in-kernel
 +handling of an hcall is effective across the VM.  On creation, an
 
 Hrm. Could we move the CAP to vm level then?

You mean, define a VM ioctl instead of using a capability?  Or are you
suggesting something else?

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: PPC: Book3S HV: Add H_SET_MODE hcall handling

2014-05-28 Thread Michael Neuling
Alex,

  +static int kvmppc_h_set_mode(struct kvm_vcpu *vcpu, unsigned long mflags,
  +unsigned long resource, unsigned long value1,
  +unsigned long value2)
  +{
  +   switch (resource) {
  +   case H_SET_MODE_RESOURCE_SET_CIABR:
  +   if (!kvmppc_power8_compatible(vcpu))
  +   return H_P2;
  +   if (value2)
  +   return H_P4;
  +   if (mflags)
  +   return H_UNSUPPORTED_FLAG_START;
  +   if ((value1  0x3) == 0x3)
 
 What is this?

It's what it says in PAPR (I wish that was public!!!).  Joking aside... 

If you refer to the 2.07 HW arch (not PAPR), the bottom two bits of the
CIABR tell you what mode to match in.  0x3 means match in hypervisor,
which we obviously don't want the guest to be able to do.

I'll add some #defines to make it a clearer and repost.

Mikey

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html