Re: [PATCH kernel v11 20/34] powerpc/powernv/ioda2: Move TCE kill register address to PE

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:44PM +1000, Alexey Kardashevskiy wrote:
 At the moment the DMA setup code looks for the ibm,opal-tce-kill
 property which contains the TCE kill register address. Writing to
 this register invalidates TCE cache on IODA/IODA2 hub.
 
 This moves the register address from iommu_table to pnv_pnb as this
 register belongs to PHB and invalidates TCE cache for all tables of
 all attached PEs.
 
 This moves the property reading/remapping code to a helper which is
 called when DMA is being configured for PE and which does DMA setup
 for both IODA1 and IODA2.
 
 This adds a new pnv_pci_ioda2_tce_invalidate_entire() helper which
 invalidates cache for the entire table. It should be called after
 every call to opal_pci_map_pe_dma_window(). It was not required before
 because there was just a single TCE table and 64bit DMA was handled via
 bypass window (which has no table so no cache was used) but this is going
 to change with Dynamic DMA windows (DDW).
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpoCyC0dlyWS.pgp
Description: PGP signature


Re: [PATCH kernel v11 17/34] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:41PM +1000, Alexey Kardashevskiy wrote:
 Modern IBM POWERPC systems support multiple (currently two) TCE tables
 per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
 for TCE tables. Right now just one table is supported.
 
 For IODA, instead of embedding iommu_table, the new iommu_table_group
 keeps pointers to those. The iommu_table structs are allocated
 dynamically now by a pnv_pci_table_alloc() helper as PCI hotplug
 code (for EEH recovery) and SRIOV are supported there.
 
 For P5IOC2, both iommu_table_group and iommu_table are embedded into
 PE struct. As there is no EEH and SRIOV support for P5IOC2,
 iommu_free_table() should not be called on iommu_table struct pointers
 so we can keep it embedded in pnv_phb::p5ioc2.
 
 For pSeries, this replaces multiple calls of kzalloc_node() with a new
 iommu_pseries_group_alloc() helper and stores the table group struct
 pointer into the pci_dn struct. For release, a iommu_table_group_free()
 helper is added.
 
 This moves iommu_table struct allocation from SR-IOV code to
 the generic DMA initialization code in pnv_pci_ioda2_setup_dma_pe.
 
 This replaces a single pointer to iommu_group with a list of
 iommu_table_group structs. For now it is just a single iommu_table_group
 in this list but later with TCE table sharing enabled, the list will
 keep all the IOMMU groups which use the particular table. The list
 uses iommu_table_group_link structs rather than iommu_table_group::next
 as a VFIO container may have 2 IOMMU tables, each will have its own list
 head pointer as it is mainly for TCE invalidation code which should
 walk through all attached groups and invalidate TCE cache so
 the table has to keep the list head pointer. The other option would
 be storing list head in a VFIO container but it would not work as
 the platform code (which does TCE table update and invalidation) has
 no idea about VFIO.
 
 This should cause no behavioural change.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 Reviewed-by: David Gibson da...@gibson.dropbear.id.au
 Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com

It looks like this commit message doesn't match the code - it seems
like an older or newer version of the message from the previous patch.

This patch seems instead to be about changing the table_group - table
relationship from 1:1 to many:many.

 ---
 Changes:
 v10:
 * iommu_table is not embedded into iommu_table_group but allocated
 dynamically
 * iommu_table allocation is moved to a single place for IODA2's
 pnv_pci_ioda_setup_dma_pe where it belongs to
 * added list of groups into iommu_table; most of the code just looks at
 the first item to keep the patch simpler
 
 v9:
 * s/it_group/it_table_group/
 * added and used iommu_table_group_free(), from now iommu_free_table()
 is only used for VIO
 * added iommu_pseries_group_alloc()
 * squashed powerpc/iommu: Introduce iommu_table_alloc() helper into this
 ---
  arch/powerpc/include/asm/iommu.h|   8 +-
  arch/powerpc/kernel/iommu.c |   9 +-
  arch/powerpc/platforms/powernv/pci-ioda.c   |  45 ++
  arch/powerpc/platforms/powernv/pci-p5ioc2.c |   3 +
  arch/powerpc/platforms/powernv/pci.c|  76 +
  arch/powerpc/platforms/powernv/pci.h|   7 ++
  arch/powerpc/platforms/pseries/iommu.c  |  33 +++-
  drivers/vfio/vfio_iommu_spapr_tce.c | 122 
 
  8 files changed, 242 insertions(+), 61 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/iommu.h 
 b/arch/powerpc/include/asm/iommu.h
 index 5a7267f..44a20cc 100644
 --- a/arch/powerpc/include/asm/iommu.h
 +++ b/arch/powerpc/include/asm/iommu.h
 @@ -91,7 +91,7 @@ struct iommu_table {
   struct iommu_pool pools[IOMMU_NR_POOLS];
   unsigned long *it_map;   /* A simple allocation bitmap for now */
   unsigned long  it_page_shift;/* table iommu page size */
 - struct iommu_table_group *it_table_group;
 + struct list_head it_group_list;/* List of iommu_table_group_link */
   struct iommu_table_ops *it_ops;
   void (*set_bypass)(struct iommu_table *tbl, bool enable);
  };
 @@ -126,6 +126,12 @@ extern struct iommu_table *iommu_init_table(struct 
 iommu_table * tbl,
   int nid);
  #define IOMMU_TABLE_GROUP_MAX_TABLES 1
  
 +struct iommu_table_group_link {
 + struct list_head next;
 + struct rcu_head rcu;
 + struct iommu_table_group *table_group;
 +};
 +
  struct iommu_table_group {
   struct iommu_group *group;
   struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
 diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
 index 719f048..e305a8f 100644
 --- a/arch/powerpc/kernel/iommu.c
 +++ b/arch/powerpc/kernel/iommu.c
 @@ -1078,6 +1078,7 @@ EXPORT_SYMBOL_GPL(iommu_release_ownership);
  int iommu_add_device(struct device 

Re: [PATCH kernel v11 18/34] vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:42PM +1000, Alexey Kardashevskiy wrote:
 This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
 which call in a loop iommu_take_ownership()/iommu_release_ownership()
 for every table on the group. As there is just one now, no change in
 behaviour is expected.
 
 At the moment the iommu_table struct has a set_bypass() which enables/
 disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
 which calls this callback when external IOMMU users such as VFIO are
 about to get over a PHB.
 
 The set_bypass() callback is not really an iommu_table function but
 IOMMU/PE function. This introduces a iommu_table_group_ops struct and
 adds take_ownership()/release_ownership() callbacks to it which are
 called when an external user takes/releases control over the IOMMU.
 
 This replaces set_bypass() with ownership callbacks as it is not
 necessarily just bypass enabling, it can be something else/more
 so let's give it more generic name.
 
 The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
 IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
 The following patches will replace iommu_take_ownership/
 iommu_release_ownership calls in IODA2 with full IOMMU table release/
 create.
 
 As we here and touching bypass control, this removes
 pnv_pci_ioda2_setup_bypass_pe() as it does not do much
 more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
 initialization to pnv_pci_ioda2_setup_dma_pe.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpSeNI0zax8E.pgp
Description: PGP signature


Re: [PATCH kernel v11 21/34] powerpc/powernv/ioda2: Add TCE invalidation for all attached groups

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:45PM +1000, Alexey Kardashevskiy wrote:
 The iommu_table struct keeps a list of IOMMU groups it is used for.
 At the moment there is just a single group attached but further
 patches will add TCE table sharing. When sharing is enabled, TCE cache
 in each PE needs to be invalidated so does the patch.
 
 This does not change pnv_pci_ioda1_tce_invalidate() as there is no plan
 to enable TCE table sharing on PHBs older than IODA2.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgp3Qap2rncgo.pgp
Description: PGP signature


Re: [PATCH kernel v11 23/34] powerpc/iommu/powernv: Release replaced TCE

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:47PM +1000, Alexey Kardashevskiy wrote:
 At the moment writing new TCE value to the IOMMU table fails with EBUSY
 if there is a valid entry already. However PAPR specification allows
 the guest to write new TCE value without clearing it first.
 
 Another problem this patch is addressing is the use of pool locks for
 external IOMMU users such as VFIO. The pool locks are to protect
 DMA page allocator rather than entries and since the host kernel does
 not control what pages are in use, there is no point in pool locks and
 exchange()+put_page(oldtce) is sufficient to avoid possible races.
 
 This adds an exchange() callback to iommu_table_ops which does the same
 thing as set() plus it returns replaced TCE and DMA direction so
 the caller can release the pages afterwards. The exchange() receives
 a physical address unlike set() which receives linear mapping address;
 and returns a physical address as the clear() does.
 
 This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
 for a platform to have exchange() implemented in order to support VFIO.
 
 This replaces iommu_tce_build() and iommu_clear_tce() with
 a single iommu_tce_xchg().
 
 This makes sure that TCE permission bits are not set in TCE passed to
 IOMMU API as those are to be calculated by platform code from
 DMA direction.
 
 This moves SetPageDirty() to the IOMMU code to make it work for both
 VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
 available later).
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpzSxt7XbJHf.pgp
Description: PGP signature


Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time

2015-06-01 Thread Paolo Bonzini


On 01/06/2015 13:42, Christian Borntraeger wrote:
 Am 01.06.2015 um 13:34 schrieb Paolo Bonzini:


 On 01/06/2015 09:47, Christian Borntraeger wrote:

 1: disable, guest, disable again and save, restore to disable, 
 enable
 and now it is
 2: disable, guest, enable
 and with your patch it is
 3: disable, guest, enable, disable, enable

 I assume that 3 and 1 are similar in its costs, so this is probably ok.

 At least on x86, 3 and 2 are similar, but 3 is much more expensive than
 1!  See https://lkml.org/lkml/2015/5/5/835:
 
 That does not make sense. If 3 and 2 are similar, then 2 must be much more
 expensive than 1 as well. As 2 is a strict subset of 1 it must be cheaper, no?

Yes, it must.  I meant 3 is much cheaper than 1.

Paolo

 Cost of: CLI insn  same-IF : 0 cycles
 Cost of: CLI insn  flip-IF : 0 cycles
 Cost of: STI insn  same-IF : 0 cycles
 Cost of: STI insn  flip-IF : 0 cycles
 Cost of: PUSHF   insn  : 0 cycles
 Cost of: POPFinsn  same-IF :20 cycles
 Cost of: POPFinsn  flip-IF :28 cycles
 Cost of: local_irq_save()fn:20 cycles
 Cost of: local_irq_restore() fnsame-IF :24 cycles
 Cost of: local_irq_restore() fnflip-IF :28 cycles
 Cost of: irq_save()+restore()fnsame-IF :48 cycles
 Cost of: irq_save()+restore()fnflip-IF :48 cycles
 
 Yes its similar on s390. local_irq_save/restore is noticable in guest exit
 hot loops (thats what inspired my patch), but a simple irq disable is
 just single cycle pipelined. Given the design of aggressive out-out order
 designs with all the architectural ordering this makes sense.
 
 Christian
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] treewide: Fix typo compatability - compatibility

2015-06-01 Thread Alexander Graf


On 27.05.15 14:05, Laurent Pinchart wrote:
 Even though 'compatability' has a dedicated entry in the Wiktionary,
 it's listed as 'Mispelling of compatibility'. Fix it.
 
 Signed-off-by: Laurent Pinchart laurent.pinch...@ideasonboard.com
 ---
  arch/metag/include/asm/elf.h | 2 +-


  arch/powerpc/kvm/book3s.c| 2 +-

Acked-by: Alexander Graf ag...@suse.de

for the PPC KVM bit.


Alex
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time

2015-06-01 Thread Christian Borntraeger
Am 01.06.2015 um 13:34 schrieb Paolo Bonzini:
 
 
 On 01/06/2015 09:47, Christian Borntraeger wrote:

 1: disable, guest, disable again and save, restore to disable, 
 enable
 and now it is
 2: disable, guest, enable
 and with your patch it is
 3: disable, guest, enable, disable, enable

 I assume that 3 and 1 are similar in its costs, so this is probably ok.
 
 At least on x86, 3 and 2 are similar, but 3 is much more expensive than
 1!  See https://lkml.org/lkml/2015/5/5/835:

That does not make sense. If 3 and 2 are similar, then 2 must be much more
expensive than 1 as well. As 2 is a strict subset of 1 it must be cheaper, no?


 
 Cost of: CLI insn  same-IF : 0 cycles
 Cost of: CLI insn  flip-IF : 0 cycles
 Cost of: STI insn  same-IF : 0 cycles
 Cost of: STI insn  flip-IF : 0 cycles
 Cost of: PUSHF   insn  : 0 cycles
 Cost of: POPFinsn  same-IF :20 cycles
 Cost of: POPFinsn  flip-IF :28 cycles
 Cost of: local_irq_save()fn:20 cycles
 Cost of: local_irq_restore() fnsame-IF :24 cycles
 Cost of: local_irq_restore() fnflip-IF :28 cycles
 Cost of: irq_save()+restore()fnsame-IF :48 cycles
 Cost of: irq_save()+restore()fnflip-IF :48 cycles

Yes its similar on s390. local_irq_save/restore is noticable in guest exit
hot loops (thats what inspired my patch), but a simple irq disable is
just single cycle pipelined. Given the design of aggressive out-out order
designs with all the architectural ordering this makes sense.

Christian

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time

2015-06-01 Thread Paolo Bonzini


On 01/06/2015 09:47, Christian Borntraeger wrote:
 
 1: disable, guest, disable again and save, restore to disable, 
 enable
 and now it is
 2: disable, guest, enable
 and with your patch it is
 3: disable, guest, enable, disable, enable
 
 I assume that 3 and 1 are similar in its costs, so this is probably ok.

At least on x86, 3 and 2 are similar, but 3 is much more expensive than
1!  See https://lkml.org/lkml/2015/5/5/835:

Cost of: CLI insn  same-IF : 0 cycles
Cost of: CLI insn  flip-IF : 0 cycles
Cost of: STI insn  same-IF : 0 cycles
Cost of: STI insn  flip-IF : 0 cycles
Cost of: PUSHF   insn  : 0 cycles
Cost of: POPFinsn  same-IF :20 cycles
Cost of: POPFinsn  flip-IF :28 cycles
Cost of: local_irq_save()fn:20 cycles
Cost of: local_irq_restore() fnsame-IF :24 cycles
Cost of: local_irq_restore() fnflip-IF :28 cycles
Cost of: irq_save()+restore()fnsame-IF :48 cycles
Cost of: irq_save()+restore()fnflip-IF :48 cycles

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 10/12] KVM: arm64: guest debug, HW assisted debug support

2015-06-01 Thread Alex Bennée

Will Deacon will.dea...@arm.com writes:

 Hi Alex,

 On Fri, May 29, 2015 at 10:30:26AM +0100, Alex Bennée wrote:
 This adds support for userspace to control the HW debug registers for
 guest debug. In the debug ioctl we copy the IMPDEF defined number of
 registers into a new register set called host_debug_state. There is now
 a new vcpu parameter called debug_ptr which selects which register set
 is to copied into the real registers when world switch occurs.
 
 I've moved some helper functions into the hw_breakpoint.h header for
 re-use.
 
 As with single step we need to tweak the guest registers to enable the
 exceptions so we need to save and restore those bits.
 
 Two new capabilities have been added to the KVM_EXTENSION ioctl to allow
 userspace to query the number of hardware break and watch points
 available on the host hardware.
 
 Signed-off-by: Alex Bennée alex.ben...@linaro.org

 [...]

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 33c8143..ada57df 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2668,7 +2668,7 @@ The top 16 bits of the control field are architecture 
 specific control
  flags which can include the following:
  
- KVM_GUESTDBG_USE_SW_BP: using software breakpoints [x86, arm64]
 -  - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390]
 +  - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390, 
 arm64]
- KVM_GUESTDBG_INJECT_DB: inject DB type exception [x86]
- KVM_GUESTDBG_INJECT_BP: inject BP type exception [x86]
- KVM_GUESTDBG_EXIT_PENDING:  trigger an immediate guest exit [s390]
 @@ -2683,6 +2683,11 @@ updated to the correct (supplied) values.
  The second part of the structure is architecture specific and
  typically contains a set of debug registers.
  
 +For arm64 the number of debug registers is implementation defined and
 +can be determined by querying the KVM_CAP_GUEST_DEBUG_HW_BPS and
 +KVM_CAP_GUEST_DEBUG_HW_WPS capabilities which return a positive number
 +indicating the number of supported registers.
 +
  When debug events exit the main run loop with the reason
  KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run
  structure containing architecture specific debug information.
 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index 0d17c7b..6df47c1 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -307,6 +307,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
  
  #define KVM_GUESTDBG_VALID_MASK (KVM_GUESTDBG_ENABLE |\
  KVM_GUESTDBG_USE_SW_BP | \
 +KVM_GUESTDBG_USE_HW_BP | \
  KVM_GUESTDBG_SINGLESTEP)
  
  /**
 @@ -327,6 +328,12 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
 *vcpu,
  
  if (dbg-control  KVM_GUESTDBG_ENABLE) {
  vcpu-guest_debug = dbg-control;
 +
 +/* Hardware assisted Break and Watch points */
 +if (vcpu-guest_debug  KVM_GUESTDBG_USE_HW_BP) {
 +vcpu-arch.external_debug_state = dbg-arch;
 +}
 +
  } else {
  /* If not enabled clear all flags */
  vcpu-guest_debug = 0;
 diff --git a/arch/arm64/include/asm/hw_breakpoint.h 
 b/arch/arm64/include/asm/hw_breakpoint.h
 index 52b484b..c450552 100644
 --- a/arch/arm64/include/asm/hw_breakpoint.h
 +++ b/arch/arm64/include/asm/hw_breakpoint.h
 @@ -130,6 +130,18 @@ static inline void ptrace_hw_copy_thread(struct 
 task_struct *task)
  }
  #endif
  
 +/* Determine number of BRP registers available. */
 +static inline int get_num_brps(void)
 +{
 +return ((read_cpuid(ID_AA64DFR0_EL1)  12)  0xf) + 1;
 +}
 +
 +/* Determine number of WRP registers available. */
 +static inline int get_num_wrps(void)
 +{
 +return ((read_cpuid(ID_AA64DFR0_EL1)  20)  0xf) + 1;
 +}

 I'm fine with moving these into the header file, but we should probably
 revisit this at some point so that we use a shadow value to indicate the
 minimum number of registers across the system for e.g. big.LITTLE platforms
 that don't have uniform debug resources.

There is work going forward in QEMU to specify this limitation when
creating vcpus. At the moment the kernel sanity checks these are all the
same on boot up but I guess we will have to decide where the smarts will
end up. Do we:

  - report the real number per real-cpu (QEMU figures out vcpu  config)

  or

  - report the lowest common denominator
  


  extern struct pmu perf_ops_bp;
  
  #endif  /* __KERNEL__ */
 diff --git a/arch/arm64/include/asm/kvm_host.h 
 b/arch/arm64/include/asm/kvm_host.h
 index e5040b6..498d4f7 100644
 --- a/arch/arm64/include/asm/kvm_host.h
 +++ b/arch/arm64/include/asm/kvm_host.h
 @@ -113,12 +113,13 @@ struct kvm_vcpu_arch {
  
  /*
   * For debugging the guest we need to keep a set of debug
 - * registers which can override the guests own debug state
 + * registers which can 

Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time

2015-06-01 Thread Christian Borntraeger
Am 28.05.2015 um 20:49 schrieb Christoffer Dall:
 Until now we have been calling kvm_guest_exit after re-enabling
 interrupts when we come back from the guest, but this has the
 unfortunate effect that CPU time accounting done in the context of timer
 interrupts occurring while the guest is running doesn't properly notice
 that the time since the last tick was spent in the guest.

Can you verify that a CPU bound guest has almost zero guest time?
Assuming that your answer is yes your patch make sense as host
timer interrupts should be the only reasons for guest exits then.


 Inspired by the comment in the x86 code, move the kvm_guest_exit() call
 below the local_irq_enable() call and change __kvm_guest_exit() to
 kvm_guest_exit(), because we are now calling this function with
 interrupts enabled.  We have to now explicitly disable preemption and
 not enable preemption before we've called kvm_guest_exit(), since
 otherwise we could be preempted and everything happening before we
 eventually get scheduled again would be accounted for as guest time.
 
 At the same time, move the trace_kvm_exit() call outside of the atomic
 section, since there is no reason for us to do that with interrupts
 disabled.
 
 Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
 ---
 This patch is based on kvm/queue, because it has the kvm_guest_enter/exit
 rework recently posted by Christian Borntraeger.  I hope I got the logic
 of this right, there were 2 slightly worrying facts about this:
 
 First, we now enable and disable and enable interrupts on each exit
 path, but I couldn't see any performance overhead on hackbench - yes the
 only benchmark we care about.

This should be somewhat similar to the situation before my patch.
There it was

1: disable, guest, disable again and save, restore to disable, enable
and now it is
2: disable, guest, enable
and with your patch it is
3: disable, guest, enable, disable, enable

I assume that 3 and 1 are similar in its costs, so this is probably ok.


 
 Second, looking at the ppc and mips code, they seem to also call
 kvm_guest_exit() before enabling interrupts, so I don't understand how
 guest CPU time accounting works on those architectures.

Not an expert here, but I assume mips has the same logic as arm so if your
patch is right for arm its probably also for mips.

powerpc looks similar to what s390 does (not using the tick, instead it uses
a hw-timer) so this should be fine.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 09/11] KVM: arm: disable debug mode if we don't actually need it.

2015-06-01 Thread Will Deacon
On Sun, May 31, 2015 at 05:27:10AM +0100, Zhichao Huang wrote:
 Until now we enable debug mode all the time even if we don't
 actually need it.
 
 Inspired by the implementation in arm64, disable debug mode if
 we don't need it. And then we are able to reduce unnecessary
 registers saving/restoring when the debug mode is disabled.

I'm terrified about this patch. Enabling monitor mode has proven to be
*extremely* fragile in practice on 32-bit ARM SoCs, so trying to do this
morei often makes me very nervous.

 Signed-off-by: Zhichao Huang zhichao.hu...@linaro.org
 ---
  arch/arm/kernel/hw_breakpoint.c | 55 
 ++---
  1 file changed, 46 insertions(+), 9 deletions(-)
 
 diff --git a/arch/arm/kernel/hw_breakpoint.c b/arch/arm/kernel/hw_breakpoint.c
 index dc7d0a9..1d27563 100644
 --- a/arch/arm/kernel/hw_breakpoint.c
 +++ b/arch/arm/kernel/hw_breakpoint.c
 @@ -266,8 +266,7 @@ static int enable_monitor_mode(void)
   }
  
   /* Check that the write made it through. */
 - ARM_DBG_READ(c0, c1, 0, dscr);
 - if (!(dscr  ARM_DSCR_MDBGEN)) {
 + if (!monitor_mode_enabled()) {
   pr_warn_once(Failed to enable monitor mode on CPU %d.\n,
   smp_processor_id());
   return -EPERM;

Ok, this hunk is harmless :)

 @@ -277,6 +276,43 @@ out:
   return 0;
  }
  
 +static int disable_monitor_mode(void)
 +{
 + u32 dscr;
 +
 + ARM_DBG_READ(c0, c1, 0, dscr);
 +
 + /* If monitor mode is already disabled, just return. */
 + if (!(dscr  ARM_DSCR_MDBGEN))
 + goto out;
 +
 + /* Write to the corresponding DSCR. */
 + switch (get_debug_arch()) {
 + case ARM_DEBUG_ARCH_V6:
 + case ARM_DEBUG_ARCH_V6_1:
 + ARM_DBG_WRITE(c0, c1, 0, (dscr  ~ARM_DSCR_MDBGEN));
 + break;
 + case ARM_DEBUG_ARCH_V7_ECP14:
 + case ARM_DEBUG_ARCH_V7_1:
 + case ARM_DEBUG_ARCH_V8:
 + ARM_DBG_WRITE(c0, c2, 2, (dscr  ~ARM_DSCR_MDBGEN));
 + isb();
 + break;
 + default:
 + return -ENODEV;
 + }
 +
 + /* Check that the write made it through. */
 + if (monitor_mode_enabled()) {
 + pr_warn_once(Failed to disable monitor mode on CPU %d.\n,
 + smp_processor_id());
 + return -EPERM;
 + }
 +
 +out:
 + return 0;
 +}

I'm not comfortable with this. enable_monitor_mode has precisly one caller
[reset_ctrl_regs] which goes to some lengths to get the system into a
well-defined state. On top of that, the whole thing is run with an undef
hook registered because there isn't an architected way to discover whether
or not DBGSWENABLE is driven low.

Why exactly do you need this? Can you not trap guest debug accesses some
other way?

  int hw_breakpoint_slots(int type)
  {
   if (!debug_arch_supported())
 @@ -338,6 +374,8 @@ int arch_install_hw_breakpoint(struct perf_event *bp)
   int i, max_slots, ctrl_base, val_base;
   u32 addr, ctrl;
  
 + enable_monitor_mode();
 +
   addr = info-address;
   ctrl = encode_ctrl_reg(info-ctrl) | 0x1;
  
 @@ -430,6 +468,8 @@ void arch_uninstall_hw_breakpoint(struct perf_event *bp)
  
   /* Reset the control register. */
   write_wb_reg(base + i, 0);
 +
 + disable_monitor_mode();

My previous concerns notwithstanding, shouldn't this be refcounted?

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time

2015-06-01 Thread Christoffer Dall
On Mon, Jun 01, 2015 at 09:47:46AM +0200, Christian Borntraeger wrote:
 Am 28.05.2015 um 20:49 schrieb Christoffer Dall:
  Until now we have been calling kvm_guest_exit after re-enabling
  interrupts when we come back from the guest, but this has the
  unfortunate effect that CPU time accounting done in the context of timer
  interrupts occurring while the guest is running doesn't properly notice
  that the time since the last tick was spent in the guest.
 
 Can you verify that a CPU bound guest has almost zero guest time?
 Assuming that your answer is yes your patch make sense as host
 timer interrupts should be the only reasons for guest exits then.
 

Yes, 'cat /dev/urandom  /dev/null' in the guest shows up as 100% sys in
mpstat on the host, 0% guest.

 
  Inspired by the comment in the x86 code, move the kvm_guest_exit() call
  below the local_irq_enable() call and change __kvm_guest_exit() to
  kvm_guest_exit(), because we are now calling this function with
  interrupts enabled.  We have to now explicitly disable preemption and
  not enable preemption before we've called kvm_guest_exit(), since
  otherwise we could be preempted and everything happening before we
  eventually get scheduled again would be accounted for as guest time.
  
  At the same time, move the trace_kvm_exit() call outside of the atomic
  section, since there is no reason for us to do that with interrupts
  disabled.
  
  Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
  ---
  This patch is based on kvm/queue, because it has the kvm_guest_enter/exit
  rework recently posted by Christian Borntraeger.  I hope I got the logic
  of this right, there were 2 slightly worrying facts about this:
  
  First, we now enable and disable and enable interrupts on each exit
  path, but I couldn't see any performance overhead on hackbench - yes the
  only benchmark we care about.
 
 This should be somewhat similar to the situation before my patch.
 There it was
 
 1: disable, guest, disable again and save, restore to disable, 
 enable
 and now it is
 2: disable, guest, enable
 and with your patch it is
 3: disable, guest, enable, disable, enable
 
 I assume that 3 and 1 are similar in its costs, so this is probably ok.
 
 
  
  Second, looking at the ppc and mips code, they seem to also call
  kvm_guest_exit() before enabling interrupts, so I don't understand how
  guest CPU time accounting works on those architectures.
 
 Not an expert here, but I assume mips has the same logic as arm so if your
 patch is right for arm its probably also for mips.
 
 powerpc looks similar to what s390 does (not using the tick, instead it uses
 a hw-timer) so this should be fine.
 
I wonder if we can simply enable HAVE_VIRT_CPU_ACCOUNTING_GEN and get
this for free which would avoid the need for this patch?

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/15] KVM: MTRR: improve kvm_mtrr_get_guest_memory_type

2015-06-01 Thread Paolo Bonzini


On 30/05/2015 12:59, Xiao Guangrong wrote:
  - kvm_mtrr_get_guest_memory_type() only checks one page in MTRRs so that
it's unnecessary to check to see if the range is partially covered in
MTRR
 
  - optimize the check of overlap memory type and add some comments to explain
the precedence
 
 Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com
 ---
  arch/x86/kvm/mtrr.c | 89 
 ++---
  1 file changed, 44 insertions(+), 45 deletions(-)
 
 diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
 index bc9c6da..d3c06d2 100644
 --- a/arch/x86/kvm/mtrr.c
 +++ b/arch/x86/kvm/mtrr.c
 @@ -231,24 +231,16 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, 
 u64 *pdata)
   return 0;
  }
  
 -/*
 - * The function is based on mtrr_type_lookup() in
 - * arch/x86/kernel/cpu/mtrr/generic.c
 - */
 -static int get_mtrr_type(struct kvm_mtrr *mtrr_state,
 -  u64 start, u64 end)
 +u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
  {
 - u64 base, mask;
 - u8 prev_match, curr_match;
 - int i, num_var_ranges = KVM_NR_VAR_MTRR;
 + struct kvm_mtrr *mtrr_state = vcpu-arch.mtrr_state;
 + u64 base, mask, start = gfn_to_gpa(gfn);
 + int i, num_var_ranges = KVM_NR_VAR_MTRR, type_mask, type = -1;

Do not mix initialized and uninitialized variables on the same line
(preexisting, I know, but let's fix it instead of making it worse :)).
Please put each initialized variable on a separate line.

Also please initialize type_mask here (more on this below).

  
   /* MTRR is completely disabled, use UC for all of physical memory. */
   if (!mtrr_state-mtrr_enabled)
   return MTRR_TYPE_UNCACHABLE;
  
 - /* Make end inclusive end, instead of exclusive */
 - end--;
 -
   /* Look in fixed ranges. Just return the type as per start */
   if (mtrr_state-fixed_mtrr_enabled  (start  0x10)) {
   int idx;
 @@ -273,9 +265,9 @@ static int get_mtrr_type(struct kvm_mtrr *mtrr_state,
* Look of multiple ranges matching this address and pick type
* as per MTRR precedence
*/
 - prev_match = 0xFF;
 + type_mask = (1  MTRR_TYPE_WRBACK) | (1  MTRR_TYPE_WRTHROUGH);
   for (i = 0; i  num_var_ranges; ++i) {
 - unsigned short start_state, end_state;
 + int curr_type;
  
   if (!(mtrr_state-var_ranges[i].mask  (1  11)))
   continue;
 @@ -283,50 +275,57 @@ static int get_mtrr_type(struct kvm_mtrr *mtrr_state,
   base = mtrr_state-var_ranges[i].base  PAGE_MASK;
   mask = mtrr_state-var_ranges[i].mask  PAGE_MASK;
  
 - start_state = ((start  mask) == (base  mask));
 - end_state = ((end  mask) == (base  mask));
 - if (start_state != end_state)
 - return 0xFE;
 -
   if ((start  mask) != (base  mask))
   continue;
  
 - curr_match = mtrr_state-var_ranges[i].base  0xff;
 - if (prev_match == 0xFF) {
 - prev_match = curr_match;
 + /*
 +  * Please refer to Intel SDM Volume 3: 11.11.4.1 MTRR
 +  * Precedences.
 +  */
 +
 + curr_type = mtrr_state-var_ranges[i].base  0xff;
 + if (type == -1) {
 + type = curr_type;
   continue;
   }
  
 - if (prev_match == MTRR_TYPE_UNCACHABLE ||
 - curr_match == MTRR_TYPE_UNCACHABLE)
 + /*
 +  * If two or more variable memory ranges match and the
 +  * memory types are identical, then that memory type is
 +  * used.
 +  */
 + if (type == curr_type)
 + continue;
 +
 + /*
 +  * If two or more variable memory ranges match and one of
 +  * the memory types is UC, the UC memory type used.
 +  */
 + if (curr_type == MTRR_TYPE_UNCACHABLE)
   return MTRR_TYPE_UNCACHABLE;
  
 - if ((prev_match == MTRR_TYPE_WRBACK 
 -  curr_match == MTRR_TYPE_WRTHROUGH) ||
 - (prev_match == MTRR_TYPE_WRTHROUGH 
 -  curr_match == MTRR_TYPE_WRBACK)) {
 - prev_match = MTRR_TYPE_WRTHROUGH;
 - curr_match = MTRR_TYPE_WRTHROUGH;
 + /*
 +  * If two or more variable memory ranges match and the
 +  * memory types are WT and WB, the WT memory type is used.
 +  */
 + if (((1  type)  type_mask) 
 +   ((1  curr_type)  type_mask)) {

Please inline definition of type_mask in the if, or rename it to
wt_wb_mask and make it const.  Or another suggestion below...

 + type = MTRR_TYPE_WRTHROUGH;
 + continue;
   }
  
 - 

Re: [PATCH 14/15] KVM: MTRR: do not map huage page for non-consistent range

2015-06-01 Thread Paolo Bonzini


On 01/06/2015 11:36, Paolo Bonzini wrote:
 Does this have a performance impact on shadow?  Perhaps we could cache
 in struct kvm_arch_memory_slot whether the memslot is covered by MTRRs?

Nevermind, patch 15 answers my question.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time

2015-06-01 Thread Christian Borntraeger
Am 01.06.2015 um 11:08 schrieb Christoffer Dall:


 Second, looking at the ppc and mips code, they seem to also call
 kvm_guest_exit() before enabling interrupts, so I don't understand how
 guest CPU time accounting works on those architectures.

 Not an expert here, but I assume mips has the same logic as arm so if your
 patch is right for arm its probably also for mips.

 powerpc looks similar to what s390 does (not using the tick, instead it uses
 a hw-timer) so this should be fine.

 I wonder if we can simply enable HAVE_VIRT_CPU_ACCOUNTING_GEN and get
 this for free which would avoid the need for this patch?

Asssuming that HAVE_VIRT_CPU_ACCOUNTING_GEN behaves similar to 
HAVE_VIRT_CPU_ACCOUNTING on s390/power in respect to not rely on ticks
- yes it might work out. Can you give it a try?

Christian


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/15] KVM: MTRR: do not map huage page for non-consistent range

2015-06-01 Thread Paolo Bonzini


On 30/05/2015 12:59, Xiao Guangrong wrote:
 Currently guest MTRR is completely prohibited if cache snoop is supported on
 IOMMU (!noncoherent_dma) and host does the emulation based on the knowledge
 from host side, however, host side is not the good point to know
 what the purpose of guest is. A good example is that pass-throughed VGA
 frame buffer is not always UC as host expected

Can you explain how?  The original idea was that such a framebuffer
would be kvm_is_reserved_pfn and thus be unconditionally UC.

 +bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
 +   int page_num)
 +{
 + struct mtrr_looker looker;
 + struct kvm_mtrr *mtrr_state = vcpu-arch.mtrr_state;
 + u64 start = gfn_to_gpa(gfn), end = gfn_to_gpa(gfn + page_num);
 + int type = -1;
 +
 + mtrr_for_each_mem_type(looker, mtrr_state, start, end) {
 + if (type == -1) {
 + type = looker.mem_type;
 + continue;
 + }
 +
 + if (type != looker.mem_type)
 + return false;
 + }
 +
 + if ((type != -1)  looker.partial_map 
 +   (mtrr_state-def_type != type))
 + return false;
 +

No Pascal-like parentheses.

Does this have a performance impact on shadow?  Perhaps we could cache
in struct kvm_arch_memory_slot whether the memslot is covered by MTRRs?

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: arm: vgic: Drop useless Group0 warning

2015-06-01 Thread Christoffer Dall
On Fri, May 29, 2015 at 05:30:15PM +0100, Marc Zyngier wrote:
 If a GICv3-enabled guest tries to configure Group0, we print a
 warning on the console (because we don't support Group0 interrupts).
 
 This is fairly pointless, and would allow a guest to spam the
 console. Let's just drop the warning.
 
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com

Acked-by: Christoffer Dall christoffer.d...@linaro.org
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/11] KVM: arm: plug guest debug exploit

2015-06-01 Thread Marc Zyngier
Hi Zhichao,

On 31/05/15 05:27, Zhichao Huang wrote:
 Hardware debugging in guests is not intercepted currently, it means
 that a malicious guest can bring down the entire machine by writing
 to the debug registers.
 
 This patch enable trapping of all debug registers, preventing the guests
 to mess with the host state.
 
 However, it is a precursor for later patches which will need to do
 more to world switch debug states while necessary.
 
 Cc: sta...@vger.kernel.org
 Signed-off-by: Zhichao Huang zhichao.hu...@linaro.org
 ---
  arch/arm/include/asm/kvm_coproc.h |  3 +-
  arch/arm/kvm/coproc.c | 60 
 +++
  arch/arm/kvm/handle_exit.c|  4 +--
  arch/arm/kvm/interrupts_head.S|  2 +-
  4 files changed, 59 insertions(+), 10 deletions(-)
 
 diff --git a/arch/arm/include/asm/kvm_coproc.h 
 b/arch/arm/include/asm/kvm_coproc.h
 index 4917c2f..e74ab0f 100644
 --- a/arch/arm/include/asm/kvm_coproc.h
 +++ b/arch/arm/include/asm/kvm_coproc.h
 @@ -31,7 +31,8 @@ void kvm_register_target_coproc_table(struct 
 kvm_coproc_target_table *table);
  int kvm_handle_cp10_id(struct kvm_vcpu *vcpu, struct kvm_run *run);
  int kvm_handle_cp_0_13_access(struct kvm_vcpu *vcpu, struct kvm_run *run);
  int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu, struct kvm_run *run);
 -int kvm_handle_cp14_access(struct kvm_vcpu *vcpu, struct kvm_run *run);
 +int kvm_handle_cp14_32(struct kvm_vcpu *vcpu, struct kvm_run *run);
 +int kvm_handle_cp14_64(struct kvm_vcpu *vcpu, struct kvm_run *run);
  int kvm_handle_cp15_32(struct kvm_vcpu *vcpu, struct kvm_run *run);
  int kvm_handle_cp15_64(struct kvm_vcpu *vcpu, struct kvm_run *run);
  
 diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c
 index f3d88dc..2e12760 100644
 --- a/arch/arm/kvm/coproc.c
 +++ b/arch/arm/kvm/coproc.c
 @@ -91,12 +91,6 @@ int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu, 
 struct kvm_run *run)
   return 1;
  }
  
 -int kvm_handle_cp14_access(struct kvm_vcpu *vcpu, struct kvm_run *run)
 -{
 - kvm_inject_undefined(vcpu);
 - return 1;
 -}
 -
  static void reset_mpidr(struct kvm_vcpu *vcpu, const struct coproc_reg *r)
  {
   /*
 @@ -519,6 +513,60 @@ int kvm_handle_cp15_32(struct kvm_vcpu *vcpu, struct 
 kvm_run *run)
   return emulate_cp15(vcpu, params);
  }
  
 +/**
 + * kvm_handle_cp14_64 -- handles a mrrc/mcrr trap on a guest CP14 access
 + * @vcpu: The VCPU pointer
 + * @run:  The kvm_run struct
 + */
 +int kvm_handle_cp14_64(struct kvm_vcpu *vcpu, struct kvm_run *run)
 +{
 + struct coproc_params params;
 +
 + params.CRn = (kvm_vcpu_get_hsr(vcpu)  1)  0xf;
 + params.Rt1 = (kvm_vcpu_get_hsr(vcpu)  5)  0xf;
 + params.is_write = ((kvm_vcpu_get_hsr(vcpu)  1) == 0);
 + params.is_64bit = true;
 +
 + params.Op1 = (kvm_vcpu_get_hsr(vcpu)  16)  0xf;
 + params.Op2 = 0;
 + params.Rt2 = (kvm_vcpu_get_hsr(vcpu)  10)  0xf;
 + params.CRm = 0;
 +
 + /* raz_wi */
 + (void)pm_fake(vcpu, params, NULL);
 +
 + /* handled */
 + kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
 + return 1;
 +}
 +
 +/**
 + * kvm_handle_cp14_32 -- handles a mrc/mcr trap on a guest CP14 access
 + * @vcpu: The VCPU pointer
 + * @run:  The kvm_run struct
 + */
 +int kvm_handle_cp14_32(struct kvm_vcpu *vcpu, struct kvm_run *run)
 +{
 + struct coproc_params params;
 +
 + params.CRm = (kvm_vcpu_get_hsr(vcpu)  1)  0xf;
 + params.Rt1 = (kvm_vcpu_get_hsr(vcpu)  5)  0xf;
 + params.is_write = ((kvm_vcpu_get_hsr(vcpu)  1) == 0);
 + params.is_64bit = false;
 +
 + params.CRn = (kvm_vcpu_get_hsr(vcpu)  10)  0xf;
 + params.Op1 = (kvm_vcpu_get_hsr(vcpu)  14)  0x7;
 + params.Op2 = (kvm_vcpu_get_hsr(vcpu)  17)  0x7;
 + params.Rt2 = 0;
 +
 + /* raz_wi */
 + (void)pm_fake(vcpu, params, NULL);
 +
 + /* handled */
 + kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
 + return 1;
 +}
 +
  
 /**
   * Userspace API
   
 */
 diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
 index 95f12b2..357ad1b 100644
 --- a/arch/arm/kvm/handle_exit.c
 +++ b/arch/arm/kvm/handle_exit.c
 @@ -104,9 +104,9 @@ static exit_handle_fn arm_exit_handlers[] = {
   [HSR_EC_WFI]= kvm_handle_wfx,
   [HSR_EC_CP15_32]= kvm_handle_cp15_32,
   [HSR_EC_CP15_64]= kvm_handle_cp15_64,
 - [HSR_EC_CP14_MR]= kvm_handle_cp14_access,
 + [HSR_EC_CP14_MR]= kvm_handle_cp14_32,
   [HSR_EC_CP14_LS]= kvm_handle_cp14_load_store,
 - [HSR_EC_CP14_64]= kvm_handle_cp14_access,
 + [HSR_EC_CP14_64]= kvm_handle_cp14_64,
   [HSR_EC_CP_0_13]= kvm_handle_cp_0_13_access,
   [HSR_EC_CP10_ID]= kvm_handle_cp10_id,
   [HSR_EC_SVC_HYP]= handle_svc_hyp,
 diff --git 

Re: [PATCH kernel v11 25/34] powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:49PM +1000, Alexey Kardashevskiy wrote:
 This is a part of moving TCE table allocation into an iommu_ops
 callback to support multiple IOMMU groups per one VFIO container.
 
 This moves the code which allocates the actual TCE tables to helpers:
 pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages().
 These do not allocate/free the iommu_table struct.
 
 This enforces window size to be a power of two.
 
 This should cause no behavioural change.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgplT2uFUf7Sj.pgp
Description: PGP signature


Re: [PATCH 05/15] KVM: MTRR: clean up mtrr default type

2015-06-01 Thread Paolo Bonzini


On 30/05/2015 12:59, Xiao Guangrong wrote:
 Use union definition to avoid the decode/code workload and drop all the
 hard code
 
 Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com
 ---
  arch/x86/include/asm/kvm_host.h | 12 ++--
  arch/x86/kvm/mtrr.c | 19 ---
  2 files changed, 18 insertions(+), 13 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 2c3c52d..95ce2ff 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -347,8 +347,16 @@ enum {
  struct kvm_mtrr {
   struct mtrr_var_range var_ranges[KVM_NR_VAR_MTRR];
   mtrr_type fixed_ranges[KVM_NR_FIXED_MTRR_REGION];
 - unsigned char enabled;
 - mtrr_type def_type;
 +
 + union {
 + u64 deftype;
 + struct {
 + unsigned def_type:8;
 + unsigned reserved:2;
 + unsigned fixed_mtrr_enabled:1;
 + unsigned mtrr_enabled:1;
 + };
 + };
  };
  
  struct kvm_vcpu_arch {
 diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
 index 562341b..6de49dd 100644
 --- a/arch/x86/kvm/mtrr.c
 +++ b/arch/x86/kvm/mtrr.c
 @@ -105,7 +105,6 @@ EXPORT_SYMBOL_GPL(kvm_mtrr_valid);
  static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
  {
   struct kvm_mtrr *mtrr_state = vcpu-arch.mtrr_state;
 - unsigned char mtrr_enabled = mtrr_state-enabled;
   gfn_t start, end, mask;
   int index;
   bool is_fixed = true;
 @@ -114,7 +113,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 !kvm_arch_has_noncoherent_dma(vcpu-kvm))
   return;
  
 - if (!(mtrr_enabled  0x2)  msr != MSR_MTRRdefType)
 + if (!mtrr_state-mtrr_enabled  msr != MSR_MTRRdefType)

I know Linus doesn't like bitfields too much.  Can you change these to
inline functions, and only leave an u64 deftype in the struct?

Paolo

   return;
  
   switch (msr) {
 @@ -153,7 +152,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
   end = ((start  mask) | ~mask) + 1;
   }
  
 - if (is_fixed  !(mtrr_enabled  0x1))
 + if (is_fixed  !mtrr_state-fixed_mtrr_enabled)
   return;
  
   kvm_zap_gfn_range(vcpu-kvm, gpa_to_gfn(start), gpa_to_gfn(end));
 @@ -166,10 +165,9 @@ int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 
 data)
   if (!kvm_mtrr_valid(vcpu, msr, data))
   return 1;
  
 - if (msr == MSR_MTRRdefType) {
 - vcpu-arch.mtrr_state.def_type = data;
 - vcpu-arch.mtrr_state.enabled = (data  0xc00)  10;
 - } else if (msr == MSR_MTRRfix64K_0)
 + if (msr == MSR_MTRRdefType)
 + vcpu-arch.mtrr_state.deftype = data;
 + else if (msr == MSR_MTRRfix64K_0)
   p[0] = data;
   else if (msr == MSR_MTRRfix16K_8 || msr == MSR_MTRRfix16K_A)
   p[1 + msr - MSR_MTRRfix16K_8] = data;
 @@ -216,8 +214,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 
 *pdata)
   return 1;
  
   if (msr == MSR_MTRRdefType)
 - *pdata = vcpu-arch.mtrr_state.def_type +
 -  (vcpu-arch.mtrr_state.enabled  10);
 + *pdata = vcpu-arch.mtrr_state.deftype;
   else if (msr == MSR_MTRRfix64K_0)
   *pdata = p[0];
   else if (msr == MSR_MTRRfix16K_8 || msr == MSR_MTRRfix16K_A)
 @@ -256,14 +253,14 @@ static int get_mtrr_type(struct kvm_mtrr *mtrr_state,
   int i, num_var_ranges = KVM_NR_VAR_MTRR;
  
   /* MTRR is completely disabled, use UC for all of physical memory. */
 - if (!(mtrr_state-enabled  0x2))
 + if (!mtrr_state-mtrr_enabled)
   return MTRR_TYPE_UNCACHABLE;
  
   /* Make end inclusive end, instead of exclusive */
   end--;
  
   /* Look in fixed ranges. Just return the type as per start */
 - if ((mtrr_state-enabled  0x1)  (start  0x10)) {
 + if (mtrr_state-fixed_mtrr_enabled  (start  0x10)) {
   int idx;
  
   if (start  0x8) {
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/15] KVM: MTRR: sort variable MTRRs

2015-06-01 Thread Paolo Bonzini


On 30/05/2015 12:59, Xiao Guangrong wrote:
 Sort all valid variable MTRRs based on its base address, it will help us to
 check a range to see if it's fully contained in variable MTRRs
 
 Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com
 ---
  arch/x86/include/asm/kvm_host.h |  3 +++
  arch/x86/kvm/mtrr.c | 39 +++
  arch/x86/kvm/x86.c  |  2 +-
  arch/x86/kvm/x86.h  |  1 +
  4 files changed, 44 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index f3fc152..5be8f2e 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -347,6 +347,7 @@ enum {
  struct kvm_mtrr_range {
   u64 base;
   u64 mask;
 + struct list_head node;
  };
  
  struct kvm_mtrr {
 @@ -362,6 +363,8 @@ struct kvm_mtrr {
   unsigned mtrr_enabled:1;
   };
   };
 +
 + struct list_head head;
  };
  
  struct kvm_vcpu_arch {
 diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
 index aeb9767..8e3b81a 100644
 --- a/arch/x86/kvm/mtrr.c
 +++ b/arch/x86/kvm/mtrr.c
 @@ -262,6 +262,38 @@ do_zap:
   kvm_zap_gfn_range(vcpu-kvm, gpa_to_gfn(start), gpa_to_gfn(end));
  }
  
 +static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
 +{
 + u64 start, end;
 +
 + if (!(range-mask  (1  11)))
 + return false;
 +
 + var_mtrr_range(range, start, end);
 + return end  start;
 +}
 +
 +static void set_var_mtrr_start(struct kvm_mtrr *mtrr_state, int index)
 +{
 + /* remove the entry if it's in the list. */
 + if (var_mtrr_range_is_valid(mtrr_state-var_ranges[index]))
 + list_del(mtrr_state-var_ranges[index].node);
 +}
 +
 +static void set_var_mtrr_end(struct kvm_mtrr *mtrr_state, int index)
 +{
 + struct kvm_mtrr_range *tmp, *cur = mtrr_state-var_ranges[index];
 +
 + /* add it to the list if it's valid. */
 + if (var_mtrr_range_is_valid(mtrr_state-var_ranges[index])) {
 + list_for_each_entry(tmp, mtrr_state-head, node)
 + if (cur-base  tmp-base)
 + list_add_tail(cur-node, tmp-node);
 +
 + list_add_tail(cur-node, mtrr_state-head);
 + }
 +}
 +
  int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
  {
   int index;
 @@ -281,10 +313,12 @@ int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, 
 u64 data)
  
   index = (msr - 0x200) / 2;
   is_mtrr_mask = msr - 0x200 - 2 * index;
 + set_var_mtrr_start(vcpu-arch.mtrr_state, index);
   if (!is_mtrr_mask)
   vcpu-arch.mtrr_state.var_ranges[index].base = data;
   else
   vcpu-arch.mtrr_state.var_ranges[index].mask = data;
 + set_var_mtrr_end(vcpu-arch.mtrr_state, index);

Just move the whole code to a new kvm_set_var_mtrr function.

   }
  
   update_mtrr(vcpu, msr);
 @@ -331,6 +365,11 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 
 *pdata)
   return 0;
  }
  
 +void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu)
 +{
 + INIT_LIST_HEAD(vcpu-arch.mtrr_state.head);
 +}
 +
  u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
  {
   struct kvm_mtrr *mtrr_state = vcpu-arch.mtrr_state;
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 64c2891..8084ac3 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -6991,7 +6991,7 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
   kvm_vcpu_reset(vcpu, false);
   kvm_mmu_setup(vcpu);
   vcpu_put(vcpu);
 -
 + kvm_vcpu_mtrr_init(vcpu);

Please move this call much earlier.

Paolo

   return r;
  }
  
 diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
 index 956558d..a3dae49 100644
 --- a/arch/x86/kvm/x86.h
 +++ b/arch/x86/kvm/x86.h
 @@ -162,6 +162,7 @@ int kvm_write_guest_virt_system(struct x86_emulate_ctxt 
 *ctxt,
   gva_t addr, void *val, unsigned int bytes,
   struct x86_exception *exception);
  
 +void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu);
  bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data);
  int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
  int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/15] KVM: MTRR: introduce mtrr_for_each_mem_type

2015-06-01 Thread Paolo Bonzini


On 30/05/2015 12:59, Xiao Guangrong wrote:
 +struct mtrr_looker {
 + /* input fields. */
 + struct kvm_mtrr *mtrr_state;
 + u64 start;
 + u64 end;

s/looker/iter/ or s/looker/lookup/

 +static void mtrr_lookup_start(struct mtrr_looker *looker)
 +{
 + looker-mem_type = -1;
 +
 + if (!looker-mtrr_state-mtrr_enabled) {
 + looker-partial_map = true;
 + return;
 + }
 +
 + if (!mtrr_lookup_fixed_start(looker))
 + mtrr_lookup_var_start(looker);
 +}
 +

Separating mtrr_lookup_start and mtrr_lookup_init is weird.

There are common parts of mtrr_lookup_*_start and mtrr_lookup_*_next.  
For example this:

 + looker-mem_type = looker-mtrr_state-fixed_ranges[index];
 + looker-start = fixed_mtrr_range_end_addr(seg, index);
 + return true;

in mtrr_lookup_fixed_start is the same as this:

 
 + end = fixed_mtrr_range_end_addr(looker-seg, looker-index);
 +
 + /* switch to next segment. */
 + if (end = eseg-end) {
 + looker-seg++;
 + looker-index = 0;
 +
 + /* have looked up for all fixed MTRRs. */
 + if (looker-seg = ARRAY_SIZE(fixed_seg_table))
 + return mtrr_lookup_var_start(looker);
 +
 + end = fixed_mtrr_range_end_addr(looker-seg, looker-index);
 + }
 +
 + looker-mem_type = mtrr_state-fixed_ranges[looker-index];
 + looker-start = end;

in mtrr_lookup_fixed_next.  Can you try to make them more common?

Basically you should have

+#define mtrr_for_each_mem_type(_looker_, _mtrr_, _gpa_start_, _gpa_end_) \
+   for (mtrr_lookup_init(_looker_, _mtrr_, _gpa_start_, _gpa_end_); \
+!mtrr_lookup_end(_looker_); mtrr_lookup_next(_looker_))

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 2/8] tun: add tun_is_little_endian() helper

2015-06-01 Thread Michael S. Tsirkin
On Fri, Apr 24, 2015 at 02:24:38PM +0200, Greg Kurz wrote:
 Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com

Dave, could you please ack merging this through
the virtio tree?

 ---
  drivers/net/tun.c |9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/net/tun.c b/drivers/net/tun.c
 index 857dca4..3c3d6c0 100644
 --- a/drivers/net/tun.c
 +++ b/drivers/net/tun.c
 @@ -206,14 +206,19 @@ struct tun_struct {
   u32 flow_count;
  };
  
 +static inline bool tun_is_little_endian(struct tun_struct *tun)
 +{
 + return tun-flags  TUN_VNET_LE;
 +}
 +
  static inline u16 tun16_to_cpu(struct tun_struct *tun, __virtio16 val)
  {
 - return __virtio16_to_cpu(tun-flags  TUN_VNET_LE, val);
 + return __virtio16_to_cpu(tun_is_little_endian(tun), val);
  }
  
  static inline __virtio16 cpu_to_tun16(struct tun_struct *tun, u16 val)
  {
 - return __cpu_to_virtio16(tun-flags  TUN_VNET_LE, val);
 + return __cpu_to_virtio16(tun_is_little_endian(tun), val);
  }
  
  static inline u32 tun_hashfn(u32 rxhash)
 
 ___
 Virtualization mailing list
 virtualizat...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/virtualization
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 10/12] KVM: arm64: guest debug, HW assisted debug support

2015-06-01 Thread Will Deacon
Hi Alex,

On Fri, May 29, 2015 at 10:30:26AM +0100, Alex Bennée wrote:
 This adds support for userspace to control the HW debug registers for
 guest debug. In the debug ioctl we copy the IMPDEF defined number of
 registers into a new register set called host_debug_state. There is now
 a new vcpu parameter called debug_ptr which selects which register set
 is to copied into the real registers when world switch occurs.
 
 I've moved some helper functions into the hw_breakpoint.h header for
 re-use.
 
 As with single step we need to tweak the guest registers to enable the
 exceptions so we need to save and restore those bits.
 
 Two new capabilities have been added to the KVM_EXTENSION ioctl to allow
 userspace to query the number of hardware break and watch points
 available on the host hardware.
 
 Signed-off-by: Alex Bennée alex.ben...@linaro.org

[...]

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 33c8143..ada57df 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2668,7 +2668,7 @@ The top 16 bits of the control field are architecture 
 specific control
  flags which can include the following:
  
- KVM_GUESTDBG_USE_SW_BP: using software breakpoints [x86, arm64]
 -  - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390]
 +  - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390, arm64]
- KVM_GUESTDBG_INJECT_DB: inject DB type exception [x86]
- KVM_GUESTDBG_INJECT_BP: inject BP type exception [x86]
- KVM_GUESTDBG_EXIT_PENDING:  trigger an immediate guest exit [s390]
 @@ -2683,6 +2683,11 @@ updated to the correct (supplied) values.
  The second part of the structure is architecture specific and
  typically contains a set of debug registers.
  
 +For arm64 the number of debug registers is implementation defined and
 +can be determined by querying the KVM_CAP_GUEST_DEBUG_HW_BPS and
 +KVM_CAP_GUEST_DEBUG_HW_WPS capabilities which return a positive number
 +indicating the number of supported registers.
 +
  When debug events exit the main run loop with the reason
  KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run
  structure containing architecture specific debug information.
 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index 0d17c7b..6df47c1 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -307,6 +307,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
  
  #define KVM_GUESTDBG_VALID_MASK (KVM_GUESTDBG_ENABLE |\
   KVM_GUESTDBG_USE_SW_BP | \
 + KVM_GUESTDBG_USE_HW_BP | \
   KVM_GUESTDBG_SINGLESTEP)
  
  /**
 @@ -327,6 +328,12 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu 
 *vcpu,
  
   if (dbg-control  KVM_GUESTDBG_ENABLE) {
   vcpu-guest_debug = dbg-control;
 +
 + /* Hardware assisted Break and Watch points */
 + if (vcpu-guest_debug  KVM_GUESTDBG_USE_HW_BP) {
 + vcpu-arch.external_debug_state = dbg-arch;
 + }
 +
   } else {
   /* If not enabled clear all flags */
   vcpu-guest_debug = 0;
 diff --git a/arch/arm64/include/asm/hw_breakpoint.h 
 b/arch/arm64/include/asm/hw_breakpoint.h
 index 52b484b..c450552 100644
 --- a/arch/arm64/include/asm/hw_breakpoint.h
 +++ b/arch/arm64/include/asm/hw_breakpoint.h
 @@ -130,6 +130,18 @@ static inline void ptrace_hw_copy_thread(struct 
 task_struct *task)
  }
  #endif
  
 +/* Determine number of BRP registers available. */
 +static inline int get_num_brps(void)
 +{
 + return ((read_cpuid(ID_AA64DFR0_EL1)  12)  0xf) + 1;
 +}
 +
 +/* Determine number of WRP registers available. */
 +static inline int get_num_wrps(void)
 +{
 + return ((read_cpuid(ID_AA64DFR0_EL1)  20)  0xf) + 1;
 +}

I'm fine with moving these into the header file, but we should probably
revisit this at some point so that we use a shadow value to indicate the
minimum number of registers across the system for e.g. big.LITTLE platforms
that don't have uniform debug resources.

  extern struct pmu perf_ops_bp;
  
  #endif   /* __KERNEL__ */
 diff --git a/arch/arm64/include/asm/kvm_host.h 
 b/arch/arm64/include/asm/kvm_host.h
 index e5040b6..498d4f7 100644
 --- a/arch/arm64/include/asm/kvm_host.h
 +++ b/arch/arm64/include/asm/kvm_host.h
 @@ -113,12 +113,13 @@ struct kvm_vcpu_arch {
  
   /*
* For debugging the guest we need to keep a set of debug
 -  * registers which can override the guests own debug state
 +  * registers which can override the guest's own debug state
* while being used. These are set via the KVM_SET_GUEST_DEBUG
* ioctl.
*/
   struct kvm_guest_debug_arch *debug_ptr;
   struct kvm_guest_debug_arch vcpu_debug_state;
 + struct kvm_guest_debug_arch external_debug_state;
  
   /* Pointer to host CPU context */
   kvm_cpu_context_t *host_cpu_context;
 diff 

Re: [PATCH 08/15] KVM: MTRR: introduce fixed_mtrr_segment table

2015-06-01 Thread Paolo Bonzini


On 30/05/2015 12:59, Xiao Guangrong wrote:
 This table summarizes the information of fixed MTRRs and introduce some APIs
 to abstract its operation which helps us to clean up the code and will be
 used in later patches
 
 Signed-off-by: Xiao Guangrong guangrong.x...@linux.intel.com
 ---
  arch/x86/kvm/mtrr.c | 191 
 ++--
  1 file changed, 142 insertions(+), 49 deletions(-)
 
 diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
 index d3c06d2..888441e 100644
 --- a/arch/x86/kvm/mtrr.c
 +++ b/arch/x86/kvm/mtrr.c
 @@ -102,12 +102,126 @@ bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, 
 u64 data)
  }
  EXPORT_SYMBOL_GPL(kvm_mtrr_valid);
  
 +struct fixed_mtrr_segment {
 + u64 start;
 + u64 end;
 +
 + /*
 +  * unit corresponds to the MSR entry in this segment, the size
 +  * of unit is covered in one MSR.
 +  */
 + u64 unit_size;
 +
 + /* a range is covered in one memory cache type. */
 + u64 range_size;
 +
 + /* the start position in kvm_mtrr.fixed_ranges[]. */
 + int range_start;
 +};
 +
 +static struct fixed_mtrr_segment fixed_seg_table[] = {
 + /* MSR_MTRRfix64K_0, 1 unit. 64K fixed mtrr. */
 + {
 + .start = 0x0,
 + .end = 0x8,
 + .unit_size = 0x8,

Unit size is always range size * 8.

 + .range_size = 1ULL  16,

Perhaps 64 * 1024 (and so on below) is clearer, because it matches the
name of the MSR?

 + .range_start = 0,
 + },
 +
 + /*
 +  * MSR_MTRRfix16K_8 ... MSR_MTRRfix16K_A, 2 units,
 +  * 16K fixed mtrr.
 +  */
 + {
 + .start = 0x8,
 + .end = 0xc,
 + .unit_size = (0xc - 0x8) / 2,
 + .range_size = 1ULL  14,
 + .range_start = 8,
 + },
 +
 + /*
 +  * MSR_MTRRfix4K_C ... MSR_MTRRfix4K_F8000, 8 units,
 +  * 4K fixed mtrr.
 +  */
 + {
 + .start = 0xc,
 + .end = 0x10,
 + .unit_size = (0x10 - 0xc) / 8,
 + .range_size = 1ULL  12,
 + .range_start = 24,
 + }
 +};
 +
 +static int fixed_msr_to_range_index(u32 msr)
 +{
 + int seg, unit;
 +
 + if (!fixed_msr_to_seg_unit(msr, seg, unit))
 + return -1;
 +
 + fixed_mtrr_seg_unit_range_index(seg, unit);

This looks wrong.

 + return 0;
 +}
 +
  static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
  {
   struct kvm_mtrr *mtrr_state = vcpu-arch.mtrr_state;
   gfn_t start, end, mask;
   int index;
 - bool is_fixed = true;
  
   if (msr == MSR_IA32_CR_PAT || !tdp_enabled ||
 !kvm_arch_has_noncoherent_dma(vcpu-kvm))
 @@ -116,32 +230,19 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
   if (!mtrr_state-mtrr_enabled  msr != MSR_MTRRdefType)
   return;
  
 + if (fixed_msr_to_range(msr, start, end)) {
 + if (!mtrr_state-fixed_mtrr_enabled)
 + return;
 + goto do_zap;
 + }
 +
   switch (msr) {

Please move defType handling in an if above if
(fixed_msr_to_range(msr, start, end)).  Then you don't need the goto.

Paolo

 - case MSR_MTRRfix64K_0:
 - start = 0x0;
 - end = 0x8;
 - break;
 - case MSR_MTRRfix16K_8:
 - start = 0x8;
 - end = 0xa;
 - break;
 - case MSR_MTRRfix16K_A:
 - start = 0xa;
 - end = 0xc;
 - break;
 - case MSR_MTRRfix4K_C ... MSR_MTRRfix4K_F8000:
 - index = msr - MSR_MTRRfix4K_C;
 - start = 0xc + index * (32  10);
 - end = start + (32  10);
 - break;
   case MSR_MTRRdefType:
 - is_fixed = false;
   start = 0x0;
   end = ~0ULL;
   break;
   default:
   /* variable range MTRRs. */
 - is_fixed = false;
   index = (msr - 0x200) / 2;
   start = mtrr_state-var_ranges[index].base  PAGE_MASK;
   mask = mtrr_state-var_ranges[index].mask  PAGE_MASK;
 @@ -150,38 +251,33 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
   end = ((start  mask) | ~mask) + 1;
   }
  
 - if (is_fixed  !mtrr_state-fixed_mtrr_enabled)
 - return;
 -
 +do_zap:
   kvm_zap_gfn_range(vcpu-kvm, gpa_to_gfn(start), gpa_to_gfn(end));
  }
  
  int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
  {
 - u64 *p = (u64 *)vcpu-arch.mtrr_state.fixed_ranges;
 + int index;
  
   if (!kvm_mtrr_valid(vcpu, msr, data))
   return 1;
  
 - if (msr == MSR_MTRRdefType)
 + index = fixed_msr_to_range_index(msr);
 + if (index = 0)
 + *(u64 *)vcpu-arch.mtrr_state.fixed_ranges[index] = data;
 + else if (msr == MSR_MTRRdefType)
   

Re: [PATCH v6 3/8] macvtap: introduce macvtap_is_little_endian() helper

2015-06-01 Thread Michael S. Tsirkin
On Fri, Apr 24, 2015 at 02:24:48PM +0200, Greg Kurz wrote:
 Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com

Dave, could you pls ack merging this through the virtio tree?

 ---
  drivers/net/macvtap.c |9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
 index 27ecc5c..a2f2958 100644
 --- a/drivers/net/macvtap.c
 +++ b/drivers/net/macvtap.c
 @@ -49,14 +49,19 @@ struct macvtap_queue {
  
  #define MACVTAP_VNET_LE 0x8000
  
 +static inline bool macvtap_is_little_endian(struct macvtap_queue *q)
 +{
 + return q-flags  MACVTAP_VNET_LE;
 +}
 +
  static inline u16 macvtap16_to_cpu(struct macvtap_queue *q, __virtio16 val)
  {
 - return __virtio16_to_cpu(q-flags  MACVTAP_VNET_LE, val);
 + return __virtio16_to_cpu(macvtap_is_little_endian(q), val);
  }
  
  static inline __virtio16 cpu_to_macvtap16(struct macvtap_queue *q, u16 val)
  {
 - return __cpu_to_virtio16(q-flags  MACVTAP_VNET_LE, val);
 + return __cpu_to_virtio16(macvtap_is_little_endian(q), val);
  }
  
  static struct proto macvtap_proto = {
 
 ___
 Virtualization mailing list
 virtualizat...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/virtualization
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time

2015-06-01 Thread Mario Smarduch
On 05/30/2015 11:59 PM, Christoffer Dall wrote:
 Hi Mario,
 
 On Fri, May 29, 2015 at 03:34:47PM -0700, Mario Smarduch wrote:
 On 05/28/2015 11:49 AM, Christoffer Dall wrote:
 Until now we have been calling kvm_guest_exit after re-enabling
 interrupts when we come back from the guest, but this has the
 unfortunate effect that CPU time accounting done in the context of timer
 interrupts occurring while the guest is running doesn't properly notice
 that the time since the last tick was spent in the guest.

 Inspired by the comment in the x86 code, move the kvm_guest_exit() call
 below the local_irq_enable() call and change __kvm_guest_exit() to
 kvm_guest_exit(), because we are now calling this function with
 interrupts enabled.  We have to now explicitly disable preemption and
 not enable preemption before we've called kvm_guest_exit(), since
 otherwise we could be preempted and everything happening before we
 eventually get scheduled again would be accounted for as guest time.

 At the same time, move the trace_kvm_exit() call outside of the atomic
 section, since there is no reason for us to do that with interrupts
 disabled.

 Signed-off-by: Christoffer Dall christoffer.d...@linaro.org
 ---
 This patch is based on kvm/queue, because it has the kvm_guest_enter/exit
 rework recently posted by Christian Borntraeger.  I hope I got the logic
 of this right, there were 2 slightly worrying facts about this:

 First, we now enable and disable and enable interrupts on each exit
 path, but I couldn't see any performance overhead on hackbench - yes the
 only benchmark we care about.

 Second, looking at the ppc and mips code, they seem to also call
 kvm_guest_exit() before enabling interrupts, so I don't understand how
 guest CPU time accounting works on those architectures.

 Changes since v1:
  - Tweak comment and commit text based on Marc's feedback.
  - Explicitly disable preemption and enable it only after kvm_guest_exit().

  arch/arm/kvm/arm.c | 21 +
  1 file changed, 17 insertions(+), 4 deletions(-)

 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index e41cb11..fe8028d 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -532,6 +532,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
 struct kvm_run *run)
 kvm_vgic_flush_hwstate(vcpu);
 kvm_timer_flush_hwstate(vcpu);
  
 +   preempt_disable();
 local_irq_disable();
  
 /*
 @@ -544,6 +545,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
 struct kvm_run *run)
  
 if (ret = 0 || need_new_vmid_gen(vcpu-kvm)) {
 local_irq_enable();
 +   preempt_enable();
 kvm_timer_sync_hwstate(vcpu);
 kvm_vgic_sync_hwstate(vcpu);
 continue;
 @@ -559,8 +561,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
 struct kvm_run *run)
 ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
  
 vcpu-mode = OUTSIDE_GUEST_MODE;
 -   __kvm_guest_exit();
 -   trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
 +   /*
 +* Back from guest
 +*/
 +
 /*
  * We may have taken a host interrupt in HYP mode (ie
  * while executing the guest). This interrupt is still
 @@ -574,8 +578,17 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
 struct kvm_run *run)
 local_irq_enable();
  
 /*
 -* Back from guest
 -*/
 +* We do local_irq_enable() before calling kvm_guest_exit() so
 +* that if a timer interrupt hits while running the guest we
 +* account that tick as being spent in the guest.  We enable
 +* preemption after calling kvm_guest_exit() so that if we get
 +* preempted we make sure ticks after that is not counted as
 +* guest time.
 +*/
 +   kvm_guest_exit();
 +   trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
 +   preempt_enable();
 +
  
 kvm_timer_sync_hwstate(vcpu);
 kvm_vgic_sync_hwstate(vcpu);


 Hi Christoffer,
  so currently we take a snap shot when we enter the guest
 (tsk-vtime_snap) and upon exit add the time we spent in
 the guest and update accrued time, which appears correct.
 
 not on arm64, because we don't select HAVE_VIRT_CPU_ACCOUNTING_GEN.  Or
 am I missing something obvious here?
I see what you mean we can't use cycle based accounting to accrue
Guest time.

 

 With this patch it appears that interrupts running
 in host mode are accrued to Guest time, and additional preemption
 latency is added.

 It is true that interrupt processing in host mode (if they hit on a CPU
 when it is running a guest) are accrued to guest time, but without this
 

Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time

2015-06-01 Thread Christoffer Dall
On Mon, Jun 01, 2015 at 11:21:19AM +0200, Christian Borntraeger wrote:
 Am 01.06.2015 um 11:08 schrieb Christoffer Dall:
 
 
  Second, looking at the ppc and mips code, they seem to also call
  kvm_guest_exit() before enabling interrupts, so I don't understand how
  guest CPU time accounting works on those architectures.
 
  Not an expert here, but I assume mips has the same logic as arm so if your
  patch is right for arm its probably also for mips.
 
  powerpc looks similar to what s390 does (not using the tick, instead it 
  uses
  a hw-timer) so this should be fine.
 
  I wonder if we can simply enable HAVE_VIRT_CPU_ACCOUNTING_GEN and get
  this for free which would avoid the need for this patch?
 
 Asssuming that HAVE_VIRT_CPU_ACCOUNTING_GEN behaves similar to 
 HAVE_VIRT_CPU_ACCOUNTING on s390/power in respect to not rely on ticks
 - yes it might work out. Can you give it a try?
 
Adding HAVE_VIRT_CPU_ACCOUNTING_GEN to arch/arm64/Kconfig works, but has
no effect unless you also enable CONFIG_NO_HZ_FULL, so that hardly feels
like a fix since it would be a shame to force users to use this config
option to report CPU usage correctly.

I'm not entirely sure what the history and meaning behind these configs
are, so maybe there is an entirely different rework needed here.  It
seems logical that you could simply sample the counter at entry/exit of
the guest, but if there is nowhere to store this data without
NO_HZ_FULL+VIRT_CPU_ACCOUNTING_GEN then I guess that would be why?

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] arm/arm64: KVM: Properly account for guest CPU time

2015-06-01 Thread Christian Borntraeger
Am 01.06.2015 um 15:35 schrieb Christoffer Dall:
 On Mon, Jun 01, 2015 at 11:21:19AM +0200, Christian Borntraeger wrote:
 Am 01.06.2015 um 11:08 schrieb Christoffer Dall:


 Second, looking at the ppc and mips code, they seem to also call
 kvm_guest_exit() before enabling interrupts, so I don't understand how
 guest CPU time accounting works on those architectures.

 Not an expert here, but I assume mips has the same logic as arm so if your
 patch is right for arm its probably also for mips.

 powerpc looks similar to what s390 does (not using the tick, instead it 
 uses
 a hw-timer) so this should be fine.

 I wonder if we can simply enable HAVE_VIRT_CPU_ACCOUNTING_GEN and get
 this for free which would avoid the need for this patch?

 Asssuming that HAVE_VIRT_CPU_ACCOUNTING_GEN behaves similar to 
 HAVE_VIRT_CPU_ACCOUNTING on s390/power in respect to not rely on ticks
 - yes it might work out. Can you give it a try?

 Adding HAVE_VIRT_CPU_ACCOUNTING_GEN to arch/arm64/Kconfig works, but has
 no effect unless you also enable CONFIG_NO_HZ_FULL, so that hardly feels
 like a fix since it would be a shame to force users to use this config
 option to report CPU usage correctly.
 
 I'm not entirely sure what the history and meaning behind these configs
 are, so maybe there is an entirely different rework needed here.  It
 seems logical that you could simply sample the counter at entry/exit of
 the guest, but if there is nowhere to store this data without
 NO_HZ_FULL+VIRT_CPU_ACCOUNTING_GEN then I guess that would be why?

Given Paolos response that irq_disable/enable is faster than save/restore
at least on x86 your v2 patch might actually be the right thing to do.

Christian

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/15] KVM: MTRR: introduce mtrr_for_each_mem_type

2015-06-01 Thread Paolo Bonzini


On 01/06/2015 11:33, Paolo Bonzini wrote:
 +looker-mem_type = looker-mtrr_state-fixed_ranges[index];
  +  looker-start = fixed_mtrr_range_end_addr(seg, index);
  +  return true;
 in mtrr_lookup_fixed_start is the same as this:
 
  
  +  end = fixed_mtrr_range_end_addr(looker-seg, looker-index);
  +
  +  /* switch to next segment. */
  +  if (end = eseg-end) {
  +  looker-seg++;
  +  looker-index = 0;
  +
  +  /* have looked up for all fixed MTRRs. */
  +  if (looker-seg = ARRAY_SIZE(fixed_seg_table))
  +  return mtrr_lookup_var_start(looker);
  +
  +  end = fixed_mtrr_range_end_addr(looker-seg, looker-index);
  +  }
  +
  +  looker-mem_type = mtrr_state-fixed_ranges[looker-index];
  +  looker-start = end;
 in mtrr_lookup_fixed_next.  Can you try to make them more common?
 
 Basically you should have
 
 +#define mtrr_for_each_mem_type(_looker_, _mtrr_, _gpa_start_, _gpa_end_) \
 + for (mtrr_lookup_init(_looker_, _mtrr_, _gpa_start_, _gpa_end_); \
 +  !mtrr_lookup_end(_looker_); mtrr_lookup_next(_looker_))

... where the above code I quoted would be part of mtrr_lookup_end.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL] vhost: cleanups and fixes

2015-06-01 Thread Michael S. Tsirkin
The following changes since commit c65b99f046843d2455aa231747b5a07a999a9f3d:

  Linux 4.1-rc6 (2015-05-31 19:01:07 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 8a7b19d8b542b87bccc3eaaf81dcc90a5ca48aea:

  include/uapi/linux/virtio_balloon.h: include linux/virtio_types.h (2015-06-01 
15:46:54 +0200)


virtio: last-minute fix for 4.1

This tweaks an exported user-space header to fix
build breakage for userspace using it.

Signed-off-by: Michael S. Tsirkin m...@redhat.com


Mikko Rapeli (1):
  include/uapi/linux/virtio_balloon.h: include linux/virtio_types.h

 include/uapi/linux/virtio_balloon.h | 1 +
 1 file changed, 1 insertion(+)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL] vhost: cleanups and fixes

2015-06-01 Thread Michael S. Tsirkin
The mail subject is wrong or course - the one in
the pull request is correct.
Sorry about that - this shouldn't interfere with merging it, I'll
correct it for future requests.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 3/8] macvtap: introduce macvtap_is_little_endian() helper

2015-06-01 Thread David Miller
From: Michael S. Tsirkin m...@redhat.com
Date: Mon, 1 Jun 2015 12:30:16 +0200

 On Fri, Apr 24, 2015 at 02:24:48PM +0200, Greg Kurz wrote:
 Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com
 
 Dave, could you pls ack merging this through the virtio tree?

Acked-by: David S. Miller da...@davemloft.net
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 2/8] tun: add tun_is_little_endian() helper

2015-06-01 Thread David Miller
From: Michael S. Tsirkin m...@redhat.com
Date: Mon, 1 Jun 2015 12:29:43 +0200

 On Fri, Apr 24, 2015 at 02:24:38PM +0200, Greg Kurz wrote:
 Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com
 
 Dave, could you please ack merging this through
 the virtio tree?

Acked-by: David S. Miller da...@davemloft.net
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kernel v11 27/34] powerpc/powernv: Implement multilevel TCE tables

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:51PM +1000, Alexey Kardashevskiy wrote:
 TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
 on huge guests (hundreds of GB of RAM) so the kernel might be unable to
 allocate contiguous chunk of physical memory to store the TCE table.
 
 To address this, POWER8 CPU (actually, IODA2) supports multi-level
 TCE tables, up to 5 levels which splits the table into a tree of
 smaller subtables.
 
 This adds multi-level TCE tables support to
 pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages()
 helpers.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
 Changes:
 v10:
 * fixed multiple comments received for v9
 
 v9:
 * moved from ioda2 to common powernv pci code
 * fixed cleanup if allocation fails in a middle
 * removed check for the size - all boundary checks happen in the calling code
 anyway
 ---
  arch/powerpc/include/asm/iommu.h  |  2 +
  arch/powerpc/platforms/powernv/pci-ioda.c | 98 
 ---
  arch/powerpc/platforms/powernv/pci.c  | 13 
  3 files changed, 104 insertions(+), 9 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/iommu.h 
 b/arch/powerpc/include/asm/iommu.h
 index 4636734..706cfc0 100644
 --- a/arch/powerpc/include/asm/iommu.h
 +++ b/arch/powerpc/include/asm/iommu.h
 @@ -96,6 +96,8 @@ struct iommu_pool {
  struct iommu_table {
   unsigned long  it_busno; /* Bus number this table belongs to */
   unsigned long  it_size;  /* Size of iommu table in entries */
 + unsigned long  it_indirect_levels;
 + unsigned long  it_level_size;
   unsigned long  it_offset;/* Offset into global table */
   unsigned long  it_base;  /* mapped address of tce table */
   unsigned long  it_index; /* which iommu table this is */
 diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
 b/arch/powerpc/platforms/powernv/pci-ioda.c
 index fda01c1..68ffc7a 100644
 --- a/arch/powerpc/platforms/powernv/pci-ioda.c
 +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
 @@ -49,6 +49,9 @@
  /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
  #define TCE32_TABLE_SIZE ((0x1000 / 0x1000) * 8)
  
 +#define POWERNV_IOMMU_DEFAULT_LEVELS 1
 +#define POWERNV_IOMMU_MAX_LEVELS 5
 +
  static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
  
  static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 @@ -1975,6 +1978,8 @@ static long pnv_pci_ioda2_set_window(struct 
 iommu_table_group *table_group,
   table_group);
   struct pnv_phb *phb = pe-phb;
   int64_t rc;
 + const unsigned long size = tbl-it_indirect_levels ?
 + tbl-it_level_size : tbl-it_size;
   const __u64 start_addr = tbl-it_offset  tbl-it_page_shift;
   const __u64 win_size = tbl-it_size  tbl-it_page_shift;
  
 @@ -1989,9 +1994,9 @@ static long pnv_pci_ioda2_set_window(struct 
 iommu_table_group *table_group,
   rc = opal_pci_map_pe_dma_window(phb-opal_id,
   pe-pe_number,
   pe-pe_number  1,
 - 1,
 + tbl-it_indirect_levels + 1,
   __pa(tbl-it_base),
 - tbl-it_size  3,
 + size  3,
   IOMMU_PAGE_SIZE(tbl));
   if (rc) {
   pe_err(pe, Failed to configure TCE table, err %ld\n, rc);
 @@ -2071,11 +2076,19 @@ static void pnv_pci_ioda_setup_opal_tce_kill(struct 
 pnv_phb *phb)
   phb-ioda.tce_inval_reg = ioremap(phb-ioda.tce_inval_reg_phys, 8);
  }
  
 -static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift)
 +static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift,
 + unsigned levels, unsigned long limit,
 + unsigned long *tce_table_allocated)
  {
   struct page *tce_mem = NULL;
 - __be64 *addr;
 + __be64 *addr, *tmp;
   unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
 + unsigned long local_allocated = 1UL  (order + PAGE_SHIFT);
 + unsigned entries = 1UL  (shift - 3);
 + long i;
 +
 + if (*tce_table_allocated = limit)
 + return NULL;

I'm not quite clear what case this limit logic is trying to catch.

  
   tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
   if (!tce_mem) {
 @@ -2083,31 +2096,69 @@ static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int 
 nid, unsigned shift)
   return NULL;
   }
   addr = page_address(tce_mem);
 - memset(addr, 0, 1UL  (order + PAGE_SHIFT));
 + memset(addr, 0, local_allocated);
 +
 + --levels;
 + if (!levels) {
 + *tce_table_allocated += local_allocated;
 + return addr;
 + }
 +
 + for (i = 0; i  entries; ++i) {
 + tmp = pnv_pci_ioda2_table_do_alloc_pages(nid, shift,
 + levels, limit, tce_table_allocated);
 + if (!tmp)
 + break;
 +
 +

Re: [PATCH kernel v11 26/34] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:50PM +1000, Alexey Kardashevskiy wrote:
 This is a part of moving DMA window programming to an iommu_ops
 callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as
 a first parameter (not pnv_ioda_pe) as it is going to be used as
 a callback for VFIO DDW code.
 
 This adds pnv_pci_ioda2_tvt_invalidate() to invalidate TVT as it is

I'm assuming that's what's now called pnv_pci_ioda2_invalidate_entire()?

 a good thing to do. It does not have immediate effect now as the table
 is never recreated after reboot but it will in the following patches.
 
 This should cause no behavioural change.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Reviewed-by: David Gibson da...@gibson.dropbear.id.au
 Reviewed-by: Gavin Shan gws...@linux.vnet.ibm.com
 ---
 Changes:
 v11:
 * replaced some 1it_page_shift with IOMMU_PAGE_SIZE() macro
 
 v9:
 * initialize pe-table_group.tables[0] at the very end when
 tbl is fully initialized
 * moved pnv_pci_ioda2_tvt_invalidate() from earlier patch
 ---
  arch/powerpc/platforms/powernv/pci-ioda.c | 47 
 +--
  1 file changed, 38 insertions(+), 9 deletions(-)
 
 diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
 b/arch/powerpc/platforms/powernv/pci-ioda.c
 index 3d29fe3..fda01c1 100644
 --- a/arch/powerpc/platforms/powernv/pci-ioda.c
 +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
 @@ -1968,6 +1968,43 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
 *phb,
   }
  }
  
 +static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group,
 + int num, struct iommu_table *tbl)
 +{
 + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
 + table_group);
 + struct pnv_phb *phb = pe-phb;
 + int64_t rc;
 + const __u64 start_addr = tbl-it_offset  tbl-it_page_shift;
 + const __u64 win_size = tbl-it_size  tbl-it_page_shift;
 +
 + pe_info(pe, Setting up window %llx..%llx pg=%x\n,
 + start_addr, start_addr + win_size - 1,
 + IOMMU_PAGE_SIZE(tbl));
 +
 + /*
 +  * Map TCE table through TVT. The TVE index is the PE number
 +  * shifted by 1 bit for 32-bits DMA space.
 +  */
 + rc = opal_pci_map_pe_dma_window(phb-opal_id,
 + pe-pe_number,
 + pe-pe_number  1,
 + 1,
 + __pa(tbl-it_base),
 + tbl-it_size  3,
 + IOMMU_PAGE_SIZE(tbl));
 + if (rc) {
 + pe_err(pe, Failed to configure TCE table, err %ld\n, rc);
 + return rc;
 + }
 +
 + pnv_pci_link_table_and_group(phb-hose-node, num,
 + tbl, pe-table_group);
 + pnv_pci_ioda2_tce_invalidate_entire(pe);
 +
 + return 0;
 +}
 +
  static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
  {
   uint16_t window_id = (pe-pe_number  1 ) + 1;
 @@ -2123,21 +2160,13 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
 *phb,
   pe-table_group.ops = pnv_pci_ioda2_ops;
  #endif
  
 - /*
 -  * Map TCE table through TVT. The TVE index is the PE number
 -  * shifted by 1 bit for 32-bits DMA space.
 -  */
 - rc = opal_pci_map_pe_dma_window(phb-opal_id, pe-pe_number,
 - pe-pe_number  1, 1, __pa(tbl-it_base),
 - tbl-it_size  3, 1ULL  tbl-it_page_shift);
 + rc = pnv_pci_ioda2_set_window(pe-table_group, 0, tbl);
   if (rc) {
   pe_err(pe, Failed to configure 32-bit TCE table,
   err %ld\n, rc);
   goto fail;
   }
  
 - pnv_pci_ioda2_tce_invalidate_entire(pe);
 -
   /* OPAL variant of PHB3 invalidated TCEs */
   if (phb-ioda.tce_inval_reg)
   tbl-it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgptxgs23iUmd.pgp
Description: PGP signature


Re: [PATCH kernel v11 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:57PM +1000, Alexey Kardashevskiy wrote:
 The existing implementation accounts the whole DMA window in
 the locked_vm counter. This is going to be worse with multiple
 containers and huge DMA windows. Also, real-time accounting would requite
 additional tracking of accounted pages due to the page size difference -
 IOMMU uses 4K pages and system uses 4K or 64K pages.
 
 Another issue is that actual pages pinning/unpinning happens on every
 DMA map/unmap request. This does not affect the performance much now as
 we spend way too much time now on switching context between
 guest/userspace/host but this will start to matter when we add in-kernel
 DMA map/unmap acceleration.
 
 This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
 New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
 2 new ioctls to register/unregister DMA memory -
 VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
 which receive user space address and size of a memory region which
 needs to be pinned/unpinned and counted in locked_vm.
 New IOMMU splits physical pages pinning and TCE table update
 into 2 different operations. It requires:
 1) guest pages to be registered first
 2) consequent map/unmap requests to work only with pre-registered memory.
 For the default single window case this means that the entire guest
 (instead of 2GB) needs to be pinned before using VFIO.
 When a huge DMA window is added, no additional pinning will be
 required, otherwise it would be guest RAM + 2GB.
 
 The new memory registration ioctls are not supported by
 VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
 will require memory to be preregistered in order to work.
 
 The accounting is done per the user process.
 
 This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
 can do with v1 or v2 IOMMUs.
 
 In order to support memory pre-registration, we need a way to track
 the use of every registered memory region and only allow unregistration
 if a region is not in use anymore. So we need a way to tell from what
 region the just cleared TCE was from.
 
 This adds a userspace view of the TCE table into iommu_table struct.
 It contains userspace address, one per TCE entry. The table is only
 allocated when the ownership over an IOMMU group is taken which means
 it is only used from outside of the powernv code (such as VFIO).
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 ---
 Changes:
 v11:
 * mm_iommu_put() does not return a code so this does not check it
 * moved v2 in tce_container to pack the struct
 
 v10:
 * moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO
 specific thing
 * squashed powerpc/iommu: Add userspace view of TCE table into this as
 it is
 a part of IOMMU v2
 * s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/
 * fixed some function names to have tce_iommu_ in the beginning rather
 just tce_
 * as mm_iommu_mapped_inc() can now fail, check for the return code
 
 v9:
 * s/tce_get_hva_cached/tce_iommu_use_page_v2/
 
 v7:
 * now memory is registered per mm (i.e. process)
 * moved memory registration code to powerpc/mmu
 * merged vfio: powerpc/spapr: Define v2 IOMMU into this
 * limited new ioctls to v2 IOMMU
 * updated doc
 * unsupported ioclts return -ENOTTY instead of -EPERM
 
 v6:
 * tce_get_hva_cached() returns hva via a pointer
 
 v4:
 * updated docs
 * s/kzmalloc/vzalloc/
 * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
 replaced offset with index
 * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
 and removed duplicating vfio_iommu_spapr_register_memory
 ---
  Documentation/vfio.txt  |  31 ++-
  arch/powerpc/include/asm/iommu.h|   6 +
  drivers/vfio/vfio_iommu_spapr_tce.c | 512 
 ++--
  include/uapi/linux/vfio.h   |  27 ++
  4 files changed, 487 insertions(+), 89 deletions(-)
 
 diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
 index 96978ec..7dcf2b5 100644
 --- a/Documentation/vfio.txt
 +++ b/Documentation/vfio.txt
 @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
  
  This implementation has some specifics:
  
 -1) Only one IOMMU group per container is supported as an IOMMU group
 -represents the minimal entity which isolation can be guaranteed for and
 -groups are allocated statically, one per a Partitionable Endpoint (PE)
 +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
 +container is supported as an IOMMU table is allocated at the boot time,
 +one table per a IOMMU group which is a Partitionable Endpoint (PE)
  (PE is often a PCI domain but not always).
 +Newer systems (POWER8 with IODA2) have improved hardware design which allows
 +to remove this limitation and have multiple IOMMU groups per a VFIO 
 container.
  
  2) The hardware supports so called DMA windows - 

Re: [PATCH kernel v11 31/34] vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership control

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:55PM +1000, Alexey Kardashevskiy wrote:
 Before the IOMMU user (VFIO) would take control over the IOMMU table
 belonging to a specific IOMMU group. This approach did not allow sharing
 tables between IOMMU groups attached to the same container.
 
 This introduces a new IOMMU ownership flavour when the user can not
 just control the existing IOMMU table but remove/create tables on demand.
 If an IOMMU implements take/release_ownership() callbacks, this lets
 the user have full control over the IOMMU group. When the ownership
 is taken, the platform code removes all the windows so the caller must
 create them.
 Before returning the ownership back to the platform code, VFIO
 unprograms and removes all the tables it created.
 
 This changes IODA2's onwership handler to remove the existing table
 rather than manipulating with the existing one. From now on,
 iommu_take_ownership() and iommu_release_ownership() are only called
 from the vfio_iommu_spapr_tce driver.
 
 Old-style ownership is still supported allowing VFIO to run on older
 P5IOC2 and IODA IO controllers.
 
 No change in userspace-visible behaviour is expected. Since it recreates
 TCE tables on each ownership change, related kernel traces will appear
 more often.
 
 This adds a pnv_pci_ioda2_setup_default_config() which is called
 when PE is being configured at boot time and when the ownership is
 passed from VFIO to the platform code.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpGI6H0c3HKJ.pgp
Description: PGP signature


Re: [PATCH kernel v11 29/34] powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE release

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:53PM +1000, Alexey Kardashevskiy wrote:
 The existing code programmed TVT#0 with some address and then
 immediately released that memory.
 
 This makes use of pnv_pci_ioda2_unset_window() and
 pnv_pci_ioda2_set_bypass() which do correct resource release and
 TVT update.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpHcf_huXaFN.pgp
Description: PGP signature


Re: [PATCH kernel v11 32/34] powerpc/mmu: Add userspace-to-physical addresses translation cache

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:56PM +1000, Alexey Kardashevskiy wrote:
 We are adding support for DMA memory pre-registration to be used in
 conjunction with VFIO. The idea is that the userspace which is going to
 run a guest may want to pre-register a user space memory region so
 it all gets pinned once and never goes away. Having this done,
 a hypervisor will not have to pin/unpin pages on every DMA map/unmap
 request. This is going to help with multiple pinning of the same memory.
 
 Another use of it is in-kernel real mode (mmu off) acceleration of
 DMA requests where real time translation of guest physical to host
 physical addresses is non-trivial and may fail as linux ptes may be
 temporarily invalid. Also, having cached host physical addresses
 (compared to just pinning at the start and then walking the page table
 again on every H_PUT_TCE), we can be sure that the addresses which we put
 into TCE table are the ones we already pinned.
 
 This adds a list of memory regions to mm_context_t. Each region consists
 of a header and a list of physical addresses. This adds API to:
 1. register/unregister memory regions;
 2. do final cleanup (which puts all pre-registered pages);
 3. do userspace to physical address translation;
 4. manage usage counters; multiple registration of the same memory
 is allowed (once per container).
 
 This implements 2 counters per registered memory region:
 - @mapped: incremented on every DMA mapping; decremented on unmapping;
 initialized to 1 when a region is just registered; once it becomes zero,
 no more mappings allowe;
 - @used: incremented on every register ioctl; decremented on
 unregister; unregistration is allowed for DMA mapped regions unless
 it is the very last reference. For the very last reference this checks
 that the region is still mapped and returns -EBUSY so the userspace
 gets to know that memory is still pinned and unregistration needs to
 be retried; @used remains 1.
 
 Host physical addresses are stored in vmalloc'ed array. In order to
 access these in the real mode (mmu off), there is a real_vmalloc_addr()
 helper. In-kernel acceleration patchset will move it from KVM to MMU code.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

There's a bunch of ways this could be cleaned up, but those can be
done later.  So,

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpxPkLIdT5rV.pgp
Description: PGP signature


Re: [PATCH kernel v11 34/34] vfio: powerpc/spapr: Support Dynamic DMA windows

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:58PM +1000, Alexey Kardashevskiy wrote:
 This adds create/remove window ioctls to create and remove DMA windows.
 sPAPR defines a Dynamic DMA windows capability which allows
 para-virtualized guests to create additional DMA windows on a PCI bus.
 The existing linux kernels use this new window to map the entire guest
 memory and switch to the direct DMA operations saving time on map/unmap
 requests which would normally happen in a big amounts.
 
 This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
 VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
 Up to 2 windows are supported now by the hardware and by this driver.
 
 This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
 information such as a number of supported windows and maximum number
 levels of TCE tables.
 
 DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature
 as we still want to support v2 on platforms which cannot do DDW for
 the sake of TCE acceleration in KVM (coming soon).
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgptc8o5ntY8T.pgp
Description: PGP signature


Re: [PATCH kernel v11 28/34] vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:52PM +1000, Alexey Kardashevskiy wrote:
 This extends iommu_table_group_ops by a set of callbacks to support
 dynamic DMA windows management.
 
 create_table() creates a TCE table with specific parameters.
 it receives iommu_table_group to know nodeid in order to allocate
 TCE table memory closer to the PHB. The exact format of allocated
 multi-level table might be also specific to the PHB model (not
 the case now though).
 This callback calculated the DMA window offset on a PCI bus from @num
 and stores it in a just created table.
 
 set_window() sets the window at specified TVT index + @num on PHB.
 
 unset_window() unsets the window from specified TVT.
 
 This adds a free() callback to iommu_table_ops to free the memory
 (potentially a tree of tables) allocated for the TCE table.
 
 create_table() and free() are supposed to be called once per
 VFIO container and set_window()/unset_window() are supposed to be
 called for every group in a container.
 
 This adds IOMMU capabilities to iommu_table_group such as default
 32bit window parameters and others. This makes use of new values in
 vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
 advertise pagemasks to the userspace.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 Acked-by: Alex Williamson alex.william...@redhat.com

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpLhpGvd0Imy.pgp
Description: PGP signature


Re: [PATCH kernel v11 30/34] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:54PM +1000, Alexey Kardashevskiy wrote:
 This adds a way for the IOMMU user to know how much a new table will
 use so it can be accounted in the locked_vm limit before allocation
 happens.
 
 This stores the allocated table size in pnv_pci_ioda2_get_table_size()
 so the locked_vm counter can be updated correctly when a table is
 being disposed.
 
 This defines an iommu_table_group_ops callback to let VFIO know
 how much memory will be locked if a table is created.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

Reviewed-by: David Gibson da...@gibson.dropbear.id.au

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpBOujNE4LxQ.pgp
Description: PGP signature