Re: [PATCH kernel v11 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-06-04 Thread Gavin Shan
On Fri, May 29, 2015 at 06:44:57PM +1000, Alexey Kardashevskiy wrote:
>The existing implementation accounts the whole DMA window in
>the locked_vm counter. This is going to be worse with multiple
>containers and huge DMA windows. Also, real-time accounting would requite
>additional tracking of accounted pages due to the page size difference -
>IOMMU uses 4K pages and system uses 4K or 64K pages.
>
>Another issue is that actual pages pinning/unpinning happens on every
>DMA map/unmap request. This does not affect the performance much now as
>we spend way too much time now on switching context between
>guest/userspace/host but this will start to matter when we add in-kernel
>DMA map/unmap acceleration.
>
>This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
>New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
>2 new ioctls to register/unregister DMA memory -
>VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
>which receive user space address and size of a memory region which
>needs to be pinned/unpinned and counted in locked_vm.
>New IOMMU splits physical pages pinning and TCE table update
>into 2 different operations. It requires:
>1) guest pages to be registered first
>2) consequent map/unmap requests to work only with pre-registered memory.
>For the default single window case this means that the entire guest
>(instead of 2GB) needs to be pinned before using VFIO.
>When a huge DMA window is added, no additional pinning will be
>required, otherwise it would be guest RAM + 2GB.
>
>The new memory registration ioctls are not supported by
>VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
>will require memory to be preregistered in order to work.
>
>The accounting is done per the user process.
>
>This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
>can do with v1 or v2 IOMMUs.
>
>In order to support memory pre-registration, we need a way to track
>the use of every registered memory region and only allow unregistration
>if a region is not in use anymore. So we need a way to tell from what
>region the just cleared TCE was from.
>
>This adds a userspace view of the TCE table into iommu_table struct.
>It contains userspace address, one per TCE entry. The table is only
>allocated when the ownership over an IOMMU group is taken which means
>it is only used from outside of the powernv code (such as VFIO).
>
>Signed-off-by: Alexey Kardashevskiy 
>[aw: for the vfio related changes]
>Acked-by: Alex Williamson 
>---
>Changes:
>v11:
>* mm_iommu_put() does not return a code so this does not check it
>* moved "v2" in tce_container to pack the struct
>
>v10:
>* moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO
>specific thing
>* squashed "powerpc/iommu: Add userspace view of TCE table" into this as
>it is
>a part of IOMMU v2
>* s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/
>* fixed some function names to have "tce_iommu_" in the beginning rather
>just "tce_"
>* as mm_iommu_mapped_inc() can now fail, check for the return code
>
>v9:
>* s/tce_get_hva_cached/tce_iommu_use_page_v2/
>
>v7:
>* now memory is registered per mm (i.e. process)
>* moved memory registration code to powerpc/mmu
>* merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
>* limited new ioctls to v2 IOMMU
>* updated doc
>* unsupported ioclts return -ENOTTY instead of -EPERM
>
>v6:
>* tce_get_hva_cached() returns hva via a pointer
>
>v4:
>* updated docs
>* s/kzmalloc/vzalloc/
>* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
>replaced offset with index
>* renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
>and removed duplicating vfio_iommu_spapr_register_memory
>---
> Documentation/vfio.txt  |  31 ++-
> arch/powerpc/include/asm/iommu.h|   6 +
> drivers/vfio/vfio_iommu_spapr_tce.c | 512 ++--
> include/uapi/linux/vfio.h   |  27 ++
> 4 files changed, 487 insertions(+), 89 deletions(-)
>
>diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
>index 96978ec..7dcf2b5 100644
>--- a/Documentation/vfio.txt
>+++ b/Documentation/vfio.txt
>@@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
>
> This implementation has some specifics:
>
>-1) Only one IOMMU group per container is supported as an IOMMU group
>-represents the minimal entity which isolation can be guaranteed for and
>-groups are allocated statically, one per a Partitionable Endpoint (PE)
>+1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
>+container is supported as an IOMMU table is allocated at the boot time,
>+one table per a IOMMU group which is a Partitionable Endpoint (PE)
> (PE is often a PCI domain but not always).
>+Newer systems (POWER8 with IODA2) have improved hardware design which allows
>+to remove this limitation and have multiple IOMMU groups per a VFIO container.
>
> 2) The hardware supports so called DMA windows - the PCI address range
> within which 

Re: [PATCH kernel v11 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-06-04 Thread Gavin Shan
On Fri, May 29, 2015 at 06:44:57PM +1000, Alexey Kardashevskiy wrote:
The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update
into 2 different operations. It requires:
1) guest pages to be registered first
2) consequent map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
[aw: for the vfio related changes]
Acked-by: Alex Williamson alex.william...@redhat.com
---
Changes:
v11:
* mm_iommu_put() does not return a code so this does not check it
* moved v2 in tce_container to pack the struct

v10:
* moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO
specific thing
* squashed powerpc/iommu: Add userspace view of TCE table into this as
it is
a part of IOMMU v2
* s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/
* fixed some function names to have tce_iommu_ in the beginning rather
just tce_
* as mm_iommu_mapped_inc() can now fail, check for the return code

v9:
* s/tce_get_hva_cached/tce_iommu_use_page_v2/

v7:
* now memory is registered per mm (i.e. process)
* moved memory registration code to powerpc/mmu
* merged vfio: powerpc/spapr: Define v2 IOMMU into this
* limited new ioctls to v2 IOMMU
* updated doc
* unsupported ioclts return -ENOTTY instead of -EPERM

v6:
* tce_get_hva_cached() returns hva via a pointer

v4:
* updated docs
* s/kzmalloc/vzalloc/
* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
replaced offset with index
* renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
and removed duplicating vfio_iommu_spapr_register_memory
---
 Documentation/vfio.txt  |  31 ++-
 arch/powerpc/include/asm/iommu.h|   6 +
 drivers/vfio/vfio_iommu_spapr_tce.c | 512 ++--
 include/uapi/linux/vfio.h   |  27 ++
 4 files changed, 487 insertions(+), 89 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 96978ec..7dcf2b5 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -289,10 +289,12 @@ PPC64 sPAPR implementation note

 This implementation has some specifics:

-1) Only one IOMMU group per container is supported as an IOMMU group
-represents the minimal entity which isolation can be guaranteed for and
-groups are allocated statically, one per a Partitionable Endpoint (PE)
+1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
+container is supported as an IOMMU table is allocated at the boot time,
+one table per a IOMMU group which is a Partitionable Endpoint (PE)
 (PE is often a PCI domain but not always).
+Newer systems (POWER8 with IODA2) have improved hardware design which allows
+to remove this limitation and have multiple IOMMU groups per a VFIO container.

 2) The hardware supports so called DMA windows - the PCI address range
 within which DMA transfer is allowed, any attempt to access address space
@@ -427,6 +429,29 @@ 

Re: [PATCH kernel v11 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-06-03 Thread David Gibson
On Wed, Jun 03, 2015 at 09:40:49PM +1000, Alexey Kardashevskiy wrote:
> On 06/02/2015 02:17 PM, David Gibson wrote:
> >On Fri, May 29, 2015 at 06:44:57PM +1000, Alexey Kardashevskiy wrote:
> >>The existing implementation accounts the whole DMA window in
> >>the locked_vm counter. This is going to be worse with multiple
> >>containers and huge DMA windows. Also, real-time accounting would requite
> >>additional tracking of accounted pages due to the page size difference -
> >>IOMMU uses 4K pages and system uses 4K or 64K pages.
> >>
> >>Another issue is that actual pages pinning/unpinning happens on every
> >>DMA map/unmap request. This does not affect the performance much now as
> >>we spend way too much time now on switching context between
> >>guest/userspace/host but this will start to matter when we add in-kernel
> >>DMA map/unmap acceleration.
> >>
> >>This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
> >>New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
> >>2 new ioctls to register/unregister DMA memory -
> >>VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
> >>which receive user space address and size of a memory region which
> >>needs to be pinned/unpinned and counted in locked_vm.
> >>New IOMMU splits physical pages pinning and TCE table update
> >>into 2 different operations. It requires:
> >>1) guest pages to be registered first
> >>2) consequent map/unmap requests to work only with pre-registered memory.
> >>For the default single window case this means that the entire guest
> >>(instead of 2GB) needs to be pinned before using VFIO.
> >>When a huge DMA window is added, no additional pinning will be
> >>required, otherwise it would be guest RAM + 2GB.
> >>
> >>The new memory registration ioctls are not supported by
> >>VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
> >>will require memory to be preregistered in order to work.
> >>
> >>The accounting is done per the user process.
> >>
> >>This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
> >>can do with v1 or v2 IOMMUs.
> >>
> >>In order to support memory pre-registration, we need a way to track
> >>the use of every registered memory region and only allow unregistration
> >>if a region is not in use anymore. So we need a way to tell from what
> >>region the just cleared TCE was from.
> >>
> >>This adds a userspace view of the TCE table into iommu_table struct.
> >>It contains userspace address, one per TCE entry. The table is only
> >>allocated when the ownership over an IOMMU group is taken which means
> >>it is only used from outside of the powernv code (such as VFIO).
> >>
> >>Signed-off-by: Alexey Kardashevskiy 
> >>[aw: for the vfio related changes]
> >>Acked-by: Alex Williamson 
> >>---
> >>Changes:
> >>v11:
> >>* mm_iommu_put() does not return a code so this does not check it
> >>* moved "v2" in tce_container to pack the struct
> >>
> >>v10:
> >>* moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO
> >>specific thing
> >>* squashed "powerpc/iommu: Add userspace view of TCE table" into this as
> >>it is
> >>a part of IOMMU v2
> >>* s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/
> >>* fixed some function names to have "tce_iommu_" in the beginning rather
> >>just "tce_"
> >>* as mm_iommu_mapped_inc() can now fail, check for the return code
> >>
> >>v9:
> >>* s/tce_get_hva_cached/tce_iommu_use_page_v2/
> >>
> >>v7:
> >>* now memory is registered per mm (i.e. process)
> >>* moved memory registration code to powerpc/mmu
> >>* merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
> >>* limited new ioctls to v2 IOMMU
> >>* updated doc
> >>* unsupported ioclts return -ENOTTY instead of -EPERM
> >>
> >>v6:
> >>* tce_get_hva_cached() returns hva via a pointer
> >>
> >>v4:
> >>* updated docs
> >>* s/kzmalloc/vzalloc/
> >>* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
> >>replaced offset with index
> >>* renamed vfio_iommu_type_register_memory to 
> >>vfio_iommu_spapr_register_memory
> >>and removed duplicating vfio_iommu_spapr_register_memory
> >>---
> >>  Documentation/vfio.txt  |  31 ++-
> >>  arch/powerpc/include/asm/iommu.h|   6 +
> >>  drivers/vfio/vfio_iommu_spapr_tce.c | 512 
> >> ++--
> >>  include/uapi/linux/vfio.h   |  27 ++
> >>  4 files changed, 487 insertions(+), 89 deletions(-)
> >>
> >>diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> >>index 96978ec..7dcf2b5 100644
> >>--- a/Documentation/vfio.txt
> >>+++ b/Documentation/vfio.txt
> >>@@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
> >>
> >>  This implementation has some specifics:
> >>
> >>-1) Only one IOMMU group per container is supported as an IOMMU group
> >>-represents the minimal entity which isolation can be guaranteed for and
> >>-groups are allocated statically, one per a Partitionable Endpoint (PE)
> >>+1) On older systems (POWER7 with P5IOC2/IODA1) only one 

Re: [PATCH kernel v11 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-06-03 Thread Alexey Kardashevskiy

On 06/02/2015 02:17 PM, David Gibson wrote:

On Fri, May 29, 2015 at 06:44:57PM +1000, Alexey Kardashevskiy wrote:

The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update
into 2 different operations. It requires:
1) guest pages to be registered first
2) consequent map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
---
Changes:
v11:
* mm_iommu_put() does not return a code so this does not check it
* moved "v2" in tce_container to pack the struct

v10:
* moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO
specific thing
* squashed "powerpc/iommu: Add userspace view of TCE table" into this as
it is
a part of IOMMU v2
* s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/
* fixed some function names to have "tce_iommu_" in the beginning rather
just "tce_"
* as mm_iommu_mapped_inc() can now fail, check for the return code

v9:
* s/tce_get_hva_cached/tce_iommu_use_page_v2/

v7:
* now memory is registered per mm (i.e. process)
* moved memory registration code to powerpc/mmu
* merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
* limited new ioctls to v2 IOMMU
* updated doc
* unsupported ioclts return -ENOTTY instead of -EPERM

v6:
* tce_get_hva_cached() returns hva via a pointer

v4:
* updated docs
* s/kzmalloc/vzalloc/
* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
replaced offset with index
* renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
and removed duplicating vfio_iommu_spapr_register_memory
---
  Documentation/vfio.txt  |  31 ++-
  arch/powerpc/include/asm/iommu.h|   6 +
  drivers/vfio/vfio_iommu_spapr_tce.c | 512 ++--
  include/uapi/linux/vfio.h   |  27 ++
  4 files changed, 487 insertions(+), 89 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 96978ec..7dcf2b5 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -289,10 +289,12 @@ PPC64 sPAPR implementation note

  This implementation has some specifics:

-1) Only one IOMMU group per container is supported as an IOMMU group
-represents the minimal entity which isolation can be guaranteed for and
-groups are allocated statically, one per a Partitionable Endpoint (PE)
+1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
+container is supported as an IOMMU table is allocated at the boot time,
+one table per a IOMMU group which is a Partitionable Endpoint (PE)
  (PE is often a PCI domain but not always).
+Newer systems (POWER8 with IODA2) have improved hardware design which allows
+to remove this limitation and have multiple IOMMU groups per a VFIO container.

  2) The hardware supports so called DMA windows - the PCI address range
  within which DMA transfer is allowed, any attempt to access address 

Re: [PATCH kernel v11 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-06-03 Thread Alexey Kardashevskiy

On 06/02/2015 02:17 PM, David Gibson wrote:

On Fri, May 29, 2015 at 06:44:57PM +1000, Alexey Kardashevskiy wrote:

The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update
into 2 different operations. It requires:
1) guest pages to be registered first
2) consequent map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
[aw: for the vfio related changes]
Acked-by: Alex Williamson alex.william...@redhat.com
---
Changes:
v11:
* mm_iommu_put() does not return a code so this does not check it
* moved v2 in tce_container to pack the struct

v10:
* moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO
specific thing
* squashed powerpc/iommu: Add userspace view of TCE table into this as
it is
a part of IOMMU v2
* s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/
* fixed some function names to have tce_iommu_ in the beginning rather
just tce_
* as mm_iommu_mapped_inc() can now fail, check for the return code

v9:
* s/tce_get_hva_cached/tce_iommu_use_page_v2/

v7:
* now memory is registered per mm (i.e. process)
* moved memory registration code to powerpc/mmu
* merged vfio: powerpc/spapr: Define v2 IOMMU into this
* limited new ioctls to v2 IOMMU
* updated doc
* unsupported ioclts return -ENOTTY instead of -EPERM

v6:
* tce_get_hva_cached() returns hva via a pointer

v4:
* updated docs
* s/kzmalloc/vzalloc/
* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
replaced offset with index
* renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
and removed duplicating vfio_iommu_spapr_register_memory
---
  Documentation/vfio.txt  |  31 ++-
  arch/powerpc/include/asm/iommu.h|   6 +
  drivers/vfio/vfio_iommu_spapr_tce.c | 512 ++--
  include/uapi/linux/vfio.h   |  27 ++
  4 files changed, 487 insertions(+), 89 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 96978ec..7dcf2b5 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -289,10 +289,12 @@ PPC64 sPAPR implementation note

  This implementation has some specifics:

-1) Only one IOMMU group per container is supported as an IOMMU group
-represents the minimal entity which isolation can be guaranteed for and
-groups are allocated statically, one per a Partitionable Endpoint (PE)
+1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
+container is supported as an IOMMU table is allocated at the boot time,
+one table per a IOMMU group which is a Partitionable Endpoint (PE)
  (PE is often a PCI domain but not always).
+Newer systems (POWER8 with IODA2) have improved hardware design which allows
+to remove this limitation and have multiple IOMMU groups per a VFIO container.

  2) The hardware supports so called DMA windows - the PCI address range
  within which DMA transfer is allowed, 

Re: [PATCH kernel v11 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-06-03 Thread David Gibson
On Wed, Jun 03, 2015 at 09:40:49PM +1000, Alexey Kardashevskiy wrote:
 On 06/02/2015 02:17 PM, David Gibson wrote:
 On Fri, May 29, 2015 at 06:44:57PM +1000, Alexey Kardashevskiy wrote:
 The existing implementation accounts the whole DMA window in
 the locked_vm counter. This is going to be worse with multiple
 containers and huge DMA windows. Also, real-time accounting would requite
 additional tracking of accounted pages due to the page size difference -
 IOMMU uses 4K pages and system uses 4K or 64K pages.
 
 Another issue is that actual pages pinning/unpinning happens on every
 DMA map/unmap request. This does not affect the performance much now as
 we spend way too much time now on switching context between
 guest/userspace/host but this will start to matter when we add in-kernel
 DMA map/unmap acceleration.
 
 This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
 New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
 2 new ioctls to register/unregister DMA memory -
 VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
 which receive user space address and size of a memory region which
 needs to be pinned/unpinned and counted in locked_vm.
 New IOMMU splits physical pages pinning and TCE table update
 into 2 different operations. It requires:
 1) guest pages to be registered first
 2) consequent map/unmap requests to work only with pre-registered memory.
 For the default single window case this means that the entire guest
 (instead of 2GB) needs to be pinned before using VFIO.
 When a huge DMA window is added, no additional pinning will be
 required, otherwise it would be guest RAM + 2GB.
 
 The new memory registration ioctls are not supported by
 VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
 will require memory to be preregistered in order to work.
 
 The accounting is done per the user process.
 
 This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
 can do with v1 or v2 IOMMUs.
 
 In order to support memory pre-registration, we need a way to track
 the use of every registered memory region and only allow unregistration
 if a region is not in use anymore. So we need a way to tell from what
 region the just cleared TCE was from.
 
 This adds a userspace view of the TCE table into iommu_table struct.
 It contains userspace address, one per TCE entry. The table is only
 allocated when the ownership over an IOMMU group is taken which means
 it is only used from outside of the powernv code (such as VFIO).
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 ---
 Changes:
 v11:
 * mm_iommu_put() does not return a code so this does not check it
 * moved v2 in tce_container to pack the struct
 
 v10:
 * moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO
 specific thing
 * squashed powerpc/iommu: Add userspace view of TCE table into this as
 it is
 a part of IOMMU v2
 * s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/
 * fixed some function names to have tce_iommu_ in the beginning rather
 just tce_
 * as mm_iommu_mapped_inc() can now fail, check for the return code
 
 v9:
 * s/tce_get_hva_cached/tce_iommu_use_page_v2/
 
 v7:
 * now memory is registered per mm (i.e. process)
 * moved memory registration code to powerpc/mmu
 * merged vfio: powerpc/spapr: Define v2 IOMMU into this
 * limited new ioctls to v2 IOMMU
 * updated doc
 * unsupported ioclts return -ENOTTY instead of -EPERM
 
 v6:
 * tce_get_hva_cached() returns hva via a pointer
 
 v4:
 * updated docs
 * s/kzmalloc/vzalloc/
 * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
 replaced offset with index
 * renamed vfio_iommu_type_register_memory to 
 vfio_iommu_spapr_register_memory
 and removed duplicating vfio_iommu_spapr_register_memory
 ---
   Documentation/vfio.txt  |  31 ++-
   arch/powerpc/include/asm/iommu.h|   6 +
   drivers/vfio/vfio_iommu_spapr_tce.c | 512 
  ++--
   include/uapi/linux/vfio.h   |  27 ++
   4 files changed, 487 insertions(+), 89 deletions(-)
 
 diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
 index 96978ec..7dcf2b5 100644
 --- a/Documentation/vfio.txt
 +++ b/Documentation/vfio.txt
 @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
 
   This implementation has some specifics:
 
 -1) Only one IOMMU group per container is supported as an IOMMU group
 -represents the minimal entity which isolation can be guaranteed for and
 -groups are allocated statically, one per a Partitionable Endpoint (PE)
 +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
 +container is supported as an IOMMU table is allocated at the boot time,
 +one table per a IOMMU group which is a Partitionable Endpoint (PE)
   (PE is often a PCI domain but not always).
 +Newer systems (POWER8 with IODA2) have improved hardware design which 
 allows
 +to 

Re: [PATCH kernel v11 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:57PM +1000, Alexey Kardashevskiy wrote:
> The existing implementation accounts the whole DMA window in
> the locked_vm counter. This is going to be worse with multiple
> containers and huge DMA windows. Also, real-time accounting would requite
> additional tracking of accounted pages due to the page size difference -
> IOMMU uses 4K pages and system uses 4K or 64K pages.
> 
> Another issue is that actual pages pinning/unpinning happens on every
> DMA map/unmap request. This does not affect the performance much now as
> we spend way too much time now on switching context between
> guest/userspace/host but this will start to matter when we add in-kernel
> DMA map/unmap acceleration.
> 
> This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
> New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
> 2 new ioctls to register/unregister DMA memory -
> VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
> which receive user space address and size of a memory region which
> needs to be pinned/unpinned and counted in locked_vm.
> New IOMMU splits physical pages pinning and TCE table update
> into 2 different operations. It requires:
> 1) guest pages to be registered first
> 2) consequent map/unmap requests to work only with pre-registered memory.
> For the default single window case this means that the entire guest
> (instead of 2GB) needs to be pinned before using VFIO.
> When a huge DMA window is added, no additional pinning will be
> required, otherwise it would be guest RAM + 2GB.
> 
> The new memory registration ioctls are not supported by
> VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
> will require memory to be preregistered in order to work.
> 
> The accounting is done per the user process.
> 
> This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
> can do with v1 or v2 IOMMUs.
> 
> In order to support memory pre-registration, we need a way to track
> the use of every registered memory region and only allow unregistration
> if a region is not in use anymore. So we need a way to tell from what
> region the just cleared TCE was from.
> 
> This adds a userspace view of the TCE table into iommu_table struct.
> It contains userspace address, one per TCE entry. The table is only
> allocated when the ownership over an IOMMU group is taken which means
> it is only used from outside of the powernv code (such as VFIO).
> 
> Signed-off-by: Alexey Kardashevskiy 
> [aw: for the vfio related changes]
> Acked-by: Alex Williamson 
> ---
> Changes:
> v11:
> * mm_iommu_put() does not return a code so this does not check it
> * moved "v2" in tce_container to pack the struct
> 
> v10:
> * moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO
> specific thing
> * squashed "powerpc/iommu: Add userspace view of TCE table" into this as
> it is
> a part of IOMMU v2
> * s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/
> * fixed some function names to have "tce_iommu_" in the beginning rather
> just "tce_"
> * as mm_iommu_mapped_inc() can now fail, check for the return code
> 
> v9:
> * s/tce_get_hva_cached/tce_iommu_use_page_v2/
> 
> v7:
> * now memory is registered per mm (i.e. process)
> * moved memory registration code to powerpc/mmu
> * merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
> * limited new ioctls to v2 IOMMU
> * updated doc
> * unsupported ioclts return -ENOTTY instead of -EPERM
> 
> v6:
> * tce_get_hva_cached() returns hva via a pointer
> 
> v4:
> * updated docs
> * s/kzmalloc/vzalloc/
> * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
> replaced offset with index
> * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
> and removed duplicating vfio_iommu_spapr_register_memory
> ---
>  Documentation/vfio.txt  |  31 ++-
>  arch/powerpc/include/asm/iommu.h|   6 +
>  drivers/vfio/vfio_iommu_spapr_tce.c | 512 
> ++--
>  include/uapi/linux/vfio.h   |  27 ++
>  4 files changed, 487 insertions(+), 89 deletions(-)
> 
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> index 96978ec..7dcf2b5 100644
> --- a/Documentation/vfio.txt
> +++ b/Documentation/vfio.txt
> @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
>  
>  This implementation has some specifics:
>  
> -1) Only one IOMMU group per container is supported as an IOMMU group
> -represents the minimal entity which isolation can be guaranteed for and
> -groups are allocated statically, one per a Partitionable Endpoint (PE)
> +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
> +container is supported as an IOMMU table is allocated at the boot time,
> +one table per a IOMMU group which is a Partitionable Endpoint (PE)
>  (PE is often a PCI domain but not always).
> +Newer systems (POWER8 with IODA2) have improved hardware design which allows
> +to remove this limitation and have multiple IOMMU 

Re: [PATCH kernel v11 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-06-01 Thread David Gibson
On Fri, May 29, 2015 at 06:44:57PM +1000, Alexey Kardashevskiy wrote:
 The existing implementation accounts the whole DMA window in
 the locked_vm counter. This is going to be worse with multiple
 containers and huge DMA windows. Also, real-time accounting would requite
 additional tracking of accounted pages due to the page size difference -
 IOMMU uses 4K pages and system uses 4K or 64K pages.
 
 Another issue is that actual pages pinning/unpinning happens on every
 DMA map/unmap request. This does not affect the performance much now as
 we spend way too much time now on switching context between
 guest/userspace/host but this will start to matter when we add in-kernel
 DMA map/unmap acceleration.
 
 This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
 New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
 2 new ioctls to register/unregister DMA memory -
 VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
 which receive user space address and size of a memory region which
 needs to be pinned/unpinned and counted in locked_vm.
 New IOMMU splits physical pages pinning and TCE table update
 into 2 different operations. It requires:
 1) guest pages to be registered first
 2) consequent map/unmap requests to work only with pre-registered memory.
 For the default single window case this means that the entire guest
 (instead of 2GB) needs to be pinned before using VFIO.
 When a huge DMA window is added, no additional pinning will be
 required, otherwise it would be guest RAM + 2GB.
 
 The new memory registration ioctls are not supported by
 VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
 will require memory to be preregistered in order to work.
 
 The accounting is done per the user process.
 
 This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
 can do with v1 or v2 IOMMUs.
 
 In order to support memory pre-registration, we need a way to track
 the use of every registered memory region and only allow unregistration
 if a region is not in use anymore. So we need a way to tell from what
 region the just cleared TCE was from.
 
 This adds a userspace view of the TCE table into iommu_table struct.
 It contains userspace address, one per TCE entry. The table is only
 allocated when the ownership over an IOMMU group is taken which means
 it is only used from outside of the powernv code (such as VFIO).
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 ---
 Changes:
 v11:
 * mm_iommu_put() does not return a code so this does not check it
 * moved v2 in tce_container to pack the struct
 
 v10:
 * moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO
 specific thing
 * squashed powerpc/iommu: Add userspace view of TCE table into this as
 it is
 a part of IOMMU v2
 * s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/
 * fixed some function names to have tce_iommu_ in the beginning rather
 just tce_
 * as mm_iommu_mapped_inc() can now fail, check for the return code
 
 v9:
 * s/tce_get_hva_cached/tce_iommu_use_page_v2/
 
 v7:
 * now memory is registered per mm (i.e. process)
 * moved memory registration code to powerpc/mmu
 * merged vfio: powerpc/spapr: Define v2 IOMMU into this
 * limited new ioctls to v2 IOMMU
 * updated doc
 * unsupported ioclts return -ENOTTY instead of -EPERM
 
 v6:
 * tce_get_hva_cached() returns hva via a pointer
 
 v4:
 * updated docs
 * s/kzmalloc/vzalloc/
 * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
 replaced offset with index
 * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
 and removed duplicating vfio_iommu_spapr_register_memory
 ---
  Documentation/vfio.txt  |  31 ++-
  arch/powerpc/include/asm/iommu.h|   6 +
  drivers/vfio/vfio_iommu_spapr_tce.c | 512 
 ++--
  include/uapi/linux/vfio.h   |  27 ++
  4 files changed, 487 insertions(+), 89 deletions(-)
 
 diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
 index 96978ec..7dcf2b5 100644
 --- a/Documentation/vfio.txt
 +++ b/Documentation/vfio.txt
 @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
  
  This implementation has some specifics:
  
 -1) Only one IOMMU group per container is supported as an IOMMU group
 -represents the minimal entity which isolation can be guaranteed for and
 -groups are allocated statically, one per a Partitionable Endpoint (PE)
 +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
 +container is supported as an IOMMU table is allocated at the boot time,
 +one table per a IOMMU group which is a Partitionable Endpoint (PE)
  (PE is often a PCI domain but not always).
 +Newer systems (POWER8 with IODA2) have improved hardware design which allows
 +to remove this limitation and have multiple IOMMU groups per a VFIO 
 container.
  
  2) The hardware supports so called DMA windows -