Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-05-10 Thread Alexey Kardashevskiy

On 05/11/2015 12:11 PM, Alexey Kardashevskiy wrote:

On 05/05/2015 10:02 PM, David Gibson wrote:

On Fri, May 01, 2015 at 05:12:45PM +1000, Alexey Kardashevskiy wrote:

On 05/01/2015 02:23 PM, David Gibson wrote:

On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:

On 04/29/2015 04:31 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v9:
* fixed code flow in error cases added in v8

v8:
* added ENOMEM on failed vzalloc()
---
  arch/powerpc/include/asm/iommu.h  |  6 ++
  arch/powerpc/kernel/iommu.c   | 18 ++
  arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
  3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h
b/arch/powerpc/include/asm/iommu.h
index 7694546..1472de3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -111,9 +111,15 @@ struct iommu_table {
  unsigned long *it_map;   /* A simple allocation bitmap for
now */
  unsigned long  it_page_shift;/* table iommu page size */
  struct iommu_table_group *it_table_group;
+unsigned long *it_userspace; /* userspace view of the table */


A single unsigned long doesn't seem like enough.


Why single? This is an array.


As in single per page.



Sorry, I am not following you here.
It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully
backed
with either system page or a huge page.





How do you know
which process's address space this address refers to?


It is a current task. Multiple userspaces cannot use the same
container/tables.


Where is that enforced?



It is accessed from VFIO DMA map/unmap which are ioctls() to a container's
fd which is per a process.


Usually, but what enforces that.  If you open a container fd, then
fork(), and attempt to map from both parent and child, what happens?



vfio_group_fops::open() checks if the group is already opened, and I want
to believe open() is called from fork() for new fd so no mapping can happen
later.


I am wrong here. Nothing prevents multiple userspace from using the same 
container. It still does not seem really dangerous as in order to use VFIO, 
someone with the root privilege should set right permissions on /dev/vfio* 
first anyway and that person knows what QEMU does and what QEMU does not :)


I could add pid into iommu_table, next to it_userspace, and fail when other 
pid is trying to change the it_userspace table. Not sure if I want to do 
this check in realmode though (performance). Or make sure somehow that 
fork() closes container and group fd's (but how?). In the worst case, wrong 
userspace page will be put and there will be random backtraces on the host 
kernel. What would you do?



--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-05-10 Thread Alexey Kardashevskiy

On 05/05/2015 10:02 PM, David Gibson wrote:

On Fri, May 01, 2015 at 05:12:45PM +1000, Alexey Kardashevskiy wrote:

On 05/01/2015 02:23 PM, David Gibson wrote:

On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:

On 04/29/2015 04:31 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v9:
* fixed code flow in error cases added in v8

v8:
* added ENOMEM on failed vzalloc()
---
  arch/powerpc/include/asm/iommu.h  |  6 ++
  arch/powerpc/kernel/iommu.c   | 18 ++
  arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
  3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7694546..1472de3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -111,9 +111,15 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
unsigned long  it_page_shift;/* table iommu page size */
struct iommu_table_group *it_table_group;
+   unsigned long *it_userspace; /* userspace view of the table */


A single unsigned long doesn't seem like enough.


Why single? This is an array.


As in single per page.



Sorry, I am not following you here.
It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully backed
with either system page or a huge page.





How do you know
which process's address space this address refers to?


It is a current task. Multiple userspaces cannot use the same container/tables.


Where is that enforced?



It is accessed from VFIO DMA map/unmap which are ioctls() to a container's
fd which is per a process.


Usually, but what enforces that.  If you open a container fd, then
fork(), and attempt to map from both parent and child, what happens?



vfio_group_fops::open() checks if the group is already opened, and I want 
to believe open() is called from fork() for new fd so no mapping can happen 
later.



--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-05-10 Thread Alexey Kardashevskiy

On 05/11/2015 12:11 PM, Alexey Kardashevskiy wrote:

On 05/05/2015 10:02 PM, David Gibson wrote:

On Fri, May 01, 2015 at 05:12:45PM +1000, Alexey Kardashevskiy wrote:

On 05/01/2015 02:23 PM, David Gibson wrote:

On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:

On 04/29/2015 04:31 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v9:
* fixed code flow in error cases added in v8

v8:
* added ENOMEM on failed vzalloc()
---
  arch/powerpc/include/asm/iommu.h  |  6 ++
  arch/powerpc/kernel/iommu.c   | 18 ++
  arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
  3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h
b/arch/powerpc/include/asm/iommu.h
index 7694546..1472de3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -111,9 +111,15 @@ struct iommu_table {
  unsigned long *it_map;   /* A simple allocation bitmap for
now */
  unsigned long  it_page_shift;/* table iommu page size */
  struct iommu_table_group *it_table_group;
+unsigned long *it_userspace; /* userspace view of the table */


A single unsigned long doesn't seem like enough.


Why single? This is an array.


As in single per page.



Sorry, I am not following you here.
It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully
backed
with either system page or a huge page.





How do you know
which process's address space this address refers to?


It is a current task. Multiple userspaces cannot use the same
container/tables.


Where is that enforced?



It is accessed from VFIO DMA map/unmap which are ioctls() to a container's
fd which is per a process.


Usually, but what enforces that.  If you open a container fd, then
fork(), and attempt to map from both parent and child, what happens?



vfio_group_fops::open() checks if the group is already opened, and I want
to believe open() is called from fork() for new fd so no mapping can happen
later.


I am wrong here. Nothing prevents multiple userspace from using the same 
container. It still does not seem really dangerous as in order to use VFIO, 
someone with the root privilege should set right permissions on /dev/vfio* 
first anyway and that person knows what QEMU does and what QEMU does not :)


I could add pid into iommu_table, next to it_userspace, and fail when other 
pid is trying to change the it_userspace table. Not sure if I want to do 
this check in realmode though (performance). Or make sure somehow that 
fork() closes container and group fd's (but how?). In the worst case, wrong 
userspace page will be put and there will be random backtraces on the host 
kernel. What would you do?



--
Alexey
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-05-10 Thread Alexey Kardashevskiy

On 05/05/2015 10:02 PM, David Gibson wrote:

On Fri, May 01, 2015 at 05:12:45PM +1000, Alexey Kardashevskiy wrote:

On 05/01/2015 02:23 PM, David Gibson wrote:

On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:

On 04/29/2015 04:31 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v9:
* fixed code flow in error cases added in v8

v8:
* added ENOMEM on failed vzalloc()
---
  arch/powerpc/include/asm/iommu.h  |  6 ++
  arch/powerpc/kernel/iommu.c   | 18 ++
  arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
  3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7694546..1472de3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -111,9 +111,15 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
unsigned long  it_page_shift;/* table iommu page size */
struct iommu_table_group *it_table_group;
+   unsigned long *it_userspace; /* userspace view of the table */


A single unsigned long doesn't seem like enough.


Why single? This is an array.


As in single per page.



Sorry, I am not following you here.
It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully backed
with either system page or a huge page.





How do you know
which process's address space this address refers to?


It is a current task. Multiple userspaces cannot use the same container/tables.


Where is that enforced?



It is accessed from VFIO DMA map/unmap which are ioctls() to a container's
fd which is per a process.


Usually, but what enforces that.  If you open a container fd, then
fork(), and attempt to map from both parent and child, what happens?



vfio_group_fops::open() checks if the group is already opened, and I want 
to believe open() is called from fork() for new fd so no mapping can happen 
later.



--
Alexey
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-05-05 Thread David Gibson
On Fri, May 01, 2015 at 05:12:45PM +1000, Alexey Kardashevskiy wrote:
> On 05/01/2015 02:23 PM, David Gibson wrote:
> >On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:
> >>On 04/29/2015 04:31 PM, David Gibson wrote:
> >>>On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:
> In order to support memory pre-registration, we need a way to track
> the use of every registered memory region and only allow unregistration
> if a region is not in use anymore. So we need a way to tell from what
> region the just cleared TCE was from.
> 
> This adds a userspace view of the TCE table into iommu_table struct.
> It contains userspace address, one per TCE entry. The table is only
> allocated when the ownership over an IOMMU group is taken which means
> it is only used from outside of the powernv code (such as VFIO).
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v9:
> * fixed code flow in error cases added in v8
> 
> v8:
> * added ENOMEM on failed vzalloc()
> ---
>   arch/powerpc/include/asm/iommu.h  |  6 ++
>   arch/powerpc/kernel/iommu.c   | 18 ++
>   arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
>   3 files changed, 44 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index 7694546..1472de3 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -111,9 +111,15 @@ struct iommu_table {
>   unsigned long *it_map;   /* A simple allocation bitmap for 
>  now */
>   unsigned long  it_page_shift;/* table iommu page size */
>   struct iommu_table_group *it_table_group;
> + unsigned long *it_userspace; /* userspace view of the table */
> >>>
> >>>A single unsigned long doesn't seem like enough.
> >>
> >>Why single? This is an array.
> >
> >As in single per page.
> 
> 
> Sorry, I am not following you here.
> It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully backed
> with either system page or a huge page.
> 
> 
> >
> >>>How do you know
> >>>which process's address space this address refers to?
> >>
> >>It is a current task. Multiple userspaces cannot use the same 
> >>container/tables.
> >
> >Where is that enforced?
> 
> 
> It is accessed from VFIO DMA map/unmap which are ioctls() to a container's
> fd which is per a process.

Usually, but what enforces that.  If you open a container fd, then
fork(), and attempt to map from both parent and child, what happens?

> Same for KVM - when it registers IOMMU groups in
> KVM, fd's of opened IOMMU groups are passed there. Or I did not understand
> the question...
> 
> 
> >More to the point, that's a VFIO constraint, but it's here affecting
> >the design of a structure owned by the platform code.
> 
> Right. But keeping in mind KVM, I cannot think of any better design here.
> 
> 
> >[snip]
>   static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
> @@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct 
> iommu_table_group *table_group,
>   int nid = pe->phb->hose->node;
>   __u64 bus_offset = num ? pe->tce_bypass_base : 0;
>   long ret;
> + unsigned long *uas, uas_cb = sizeof(*uas) * (window_size >> page_shift);
> +
> + uas = vzalloc(uas_cb);
> + if (!uas)
> + return -ENOMEM;
> >>>
> >>>I don't see why this is allocated both here as well as in
> >>>take_ownership.
> >>
> >>Where else? The only alternative is vfio_iommu_spapr_tce but I really do not
> >>want to touch iommu_table fields there.
> >
> >Well to put it another way, why isn't take_ownership calling create
> >itself (or at least a common helper).
> 
> I am trying to keep DDW stuff away from platform-oriented
> arch/powerpc/kernel/iommu.c which main purpose is to implement
> iommu_alloc() It already has
> 
> I'd rather move it_userspace allocation completely to vfio_iommu_spapr_tce
> (should have done earlier, actually), would this be ok?

Yeah, that makes more sense to me.

> >Clearly the it_userspace table needs to have lifetime which matches
> >the TCE table itself, so there should be a single function that marks
> >the beginning of that joint lifetime.
> 
> 
> No. it_userspace lives as long as the platform code does not control the
> table. For IODA2 it is equal for the lifetime of the table, for IODA1/P5IOC2
> it is not.

Right, I was imprecise.  I was thinking of the ownership change as an
end/beginning of lifetime even for IODA1, because the table has to be
fully cleared at that point, even though it's not actually
reallocated.

> >>>Isn't this function used for core-kernel users of the
> >>>iommu as well, in which case it shouldn't need the it_userspace.
> >>
> >>
> >>No. This is an iommu_table_group_ops callback 

Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-05-05 Thread David Gibson
On Fri, May 01, 2015 at 05:12:45PM +1000, Alexey Kardashevskiy wrote:
 On 05/01/2015 02:23 PM, David Gibson wrote:
 On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:
 On 04/29/2015 04:31 PM, David Gibson wrote:
 On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:
 In order to support memory pre-registration, we need a way to track
 the use of every registered memory region and only allow unregistration
 if a region is not in use anymore. So we need a way to tell from what
 region the just cleared TCE was from.
 
 This adds a userspace view of the TCE table into iommu_table struct.
 It contains userspace address, one per TCE entry. The table is only
 allocated when the ownership over an IOMMU group is taken which means
 it is only used from outside of the powernv code (such as VFIO).
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
 Changes:
 v9:
 * fixed code flow in error cases added in v8
 
 v8:
 * added ENOMEM on failed vzalloc()
 ---
   arch/powerpc/include/asm/iommu.h  |  6 ++
   arch/powerpc/kernel/iommu.c   | 18 ++
   arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
   3 files changed, 44 insertions(+), 2 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/iommu.h 
 b/arch/powerpc/include/asm/iommu.h
 index 7694546..1472de3 100644
 --- a/arch/powerpc/include/asm/iommu.h
 +++ b/arch/powerpc/include/asm/iommu.h
 @@ -111,9 +111,15 @@ struct iommu_table {
   unsigned long *it_map;   /* A simple allocation bitmap for 
  now */
   unsigned long  it_page_shift;/* table iommu page size */
   struct iommu_table_group *it_table_group;
 + unsigned long *it_userspace; /* userspace view of the table */
 
 A single unsigned long doesn't seem like enough.
 
 Why single? This is an array.
 
 As in single per page.
 
 
 Sorry, I am not following you here.
 It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully backed
 with either system page or a huge page.
 
 
 
 How do you know
 which process's address space this address refers to?
 
 It is a current task. Multiple userspaces cannot use the same 
 container/tables.
 
 Where is that enforced?
 
 
 It is accessed from VFIO DMA map/unmap which are ioctls() to a container's
 fd which is per a process.

Usually, but what enforces that.  If you open a container fd, then
fork(), and attempt to map from both parent and child, what happens?

 Same for KVM - when it registers IOMMU groups in
 KVM, fd's of opened IOMMU groups are passed there. Or I did not understand
 the question...
 
 
 More to the point, that's a VFIO constraint, but it's here affecting
 the design of a structure owned by the platform code.
 
 Right. But keeping in mind KVM, I cannot think of any better design here.
 
 
 [snip]
   static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
 @@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct 
 iommu_table_group *table_group,
   int nid = pe-phb-hose-node;
   __u64 bus_offset = num ? pe-tce_bypass_base : 0;
   long ret;
 + unsigned long *uas, uas_cb = sizeof(*uas) * (window_size  page_shift);
 +
 + uas = vzalloc(uas_cb);
 + if (!uas)
 + return -ENOMEM;
 
 I don't see why this is allocated both here as well as in
 take_ownership.
 
 Where else? The only alternative is vfio_iommu_spapr_tce but I really do not
 want to touch iommu_table fields there.
 
 Well to put it another way, why isn't take_ownership calling create
 itself (or at least a common helper).
 
 I am trying to keep DDW stuff away from platform-oriented
 arch/powerpc/kernel/iommu.c which main purpose is to implement
 iommu_alloc()co. It already has
 
 I'd rather move it_userspace allocation completely to vfio_iommu_spapr_tce
 (should have done earlier, actually), would this be ok?

Yeah, that makes more sense to me.

 Clearly the it_userspace table needs to have lifetime which matches
 the TCE table itself, so there should be a single function that marks
 the beginning of that joint lifetime.
 
 
 No. it_userspace lives as long as the platform code does not control the
 table. For IODA2 it is equal for the lifetime of the table, for IODA1/P5IOC2
 it is not.

Right, I was imprecise.  I was thinking of the ownership change as an
end/beginning of lifetime even for IODA1, because the table has to be
fully cleared at that point, even though it's not actually
reallocated.

 Isn't this function used for core-kernel users of the
 iommu as well, in which case it shouldn't need the it_userspace.
 
 
 No. This is an iommu_table_group_ops callback which calls what the platform
 code calls (pnv_pci_create_table()) plus allocates this it_userspace thing.
 The callback is only called from VFIO.
 
 Ok.
 
 As touched on above it seems more like this should be owned by VFIO
 code than the platform code.
 
 Agree now :) I'll move the allocation to VFIO. Thanks!
 
 

-- 
David Gibson| 

Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-05-01 Thread Alexey Kardashevskiy

On 05/01/2015 02:23 PM, David Gibson wrote:

On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:

On 04/29/2015 04:31 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v9:
* fixed code flow in error cases added in v8

v8:
* added ENOMEM on failed vzalloc()
---
  arch/powerpc/include/asm/iommu.h  |  6 ++
  arch/powerpc/kernel/iommu.c   | 18 ++
  arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
  3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7694546..1472de3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -111,9 +111,15 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
unsigned long  it_page_shift;/* table iommu page size */
struct iommu_table_group *it_table_group;
+   unsigned long *it_userspace; /* userspace view of the table */


A single unsigned long doesn't seem like enough.


Why single? This is an array.


As in single per page.



Sorry, I am not following you here.
It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully 
backed with either system page or a huge page.






How do you know
which process's address space this address refers to?


It is a current task. Multiple userspaces cannot use the same container/tables.


Where is that enforced?



It is accessed from VFIO DMA map/unmap which are ioctls() to a container's 
fd which is per a process. Same for KVM - when it registers IOMMU groups in 
KVM, fd's of opened IOMMU groups are passed there. Or I did not understand 
the question...




More to the point, that's a VFIO constraint, but it's here affecting
the design of a structure owned by the platform code.


Right. But keeping in mind KVM, I cannot think of any better design here.



[snip]

  static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
@@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,
int nid = pe->phb->hose->node;
__u64 bus_offset = num ? pe->tce_bypass_base : 0;
long ret;
+   unsigned long *uas, uas_cb = sizeof(*uas) * (window_size >> page_shift);
+
+   uas = vzalloc(uas_cb);
+   if (!uas)
+   return -ENOMEM;


I don't see why this is allocated both here as well as in
take_ownership.


Where else? The only alternative is vfio_iommu_spapr_tce but I really do not
want to touch iommu_table fields there.


Well to put it another way, why isn't take_ownership calling create
itself (or at least a common helper).


I am trying to keep DDW stuff away from platform-oriented 
arch/powerpc/kernel/iommu.c which main purpose is to implement 
iommu_alloc() It already has


I'd rather move it_userspace allocation completely to vfio_iommu_spapr_tce 
(should have done earlier, actually), would this be ok?




Clearly the it_userspace table needs to have lifetime which matches
the TCE table itself, so there should be a single function that marks
the beginning of that joint lifetime.



No. it_userspace lives as long as the platform code does not control the 
table. For IODA2 it is equal for the lifetime of the table, for 
IODA1/P5IOC2 it is not.





Isn't this function used for core-kernel users of the
iommu as well, in which case it shouldn't need the it_userspace.



No. This is an iommu_table_group_ops callback which calls what the platform
code calls (pnv_pci_create_table()) plus allocates this it_userspace thing.
The callback is only called from VFIO.


Ok.

As touched on above it seems more like this should be owned by VFIO
code than the platform code.


Agree now :) I'll move the allocation to VFIO. Thanks!


--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-05-01 Thread Alexey Kardashevskiy

On 05/01/2015 02:23 PM, David Gibson wrote:

On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:

On 04/29/2015 04:31 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v9:
* fixed code flow in error cases added in v8

v8:
* added ENOMEM on failed vzalloc()
---
  arch/powerpc/include/asm/iommu.h  |  6 ++
  arch/powerpc/kernel/iommu.c   | 18 ++
  arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
  3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7694546..1472de3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -111,9 +111,15 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
unsigned long  it_page_shift;/* table iommu page size */
struct iommu_table_group *it_table_group;
+   unsigned long *it_userspace; /* userspace view of the table */


A single unsigned long doesn't seem like enough.


Why single? This is an array.


As in single per page.



Sorry, I am not following you here.
It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully 
backed with either system page or a huge page.






How do you know
which process's address space this address refers to?


It is a current task. Multiple userspaces cannot use the same container/tables.


Where is that enforced?



It is accessed from VFIO DMA map/unmap which are ioctls() to a container's 
fd which is per a process. Same for KVM - when it registers IOMMU groups in 
KVM, fd's of opened IOMMU groups are passed there. Or I did not understand 
the question...




More to the point, that's a VFIO constraint, but it's here affecting
the design of a structure owned by the platform code.


Right. But keeping in mind KVM, I cannot think of any better design here.



[snip]

  static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
@@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,
int nid = pe-phb-hose-node;
__u64 bus_offset = num ? pe-tce_bypass_base : 0;
long ret;
+   unsigned long *uas, uas_cb = sizeof(*uas) * (window_size  page_shift);
+
+   uas = vzalloc(uas_cb);
+   if (!uas)
+   return -ENOMEM;


I don't see why this is allocated both here as well as in
take_ownership.


Where else? The only alternative is vfio_iommu_spapr_tce but I really do not
want to touch iommu_table fields there.


Well to put it another way, why isn't take_ownership calling create
itself (or at least a common helper).


I am trying to keep DDW stuff away from platform-oriented 
arch/powerpc/kernel/iommu.c which main purpose is to implement 
iommu_alloc()co. It already has


I'd rather move it_userspace allocation completely to vfio_iommu_spapr_tce 
(should have done earlier, actually), would this be ok?




Clearly the it_userspace table needs to have lifetime which matches
the TCE table itself, so there should be a single function that marks
the beginning of that joint lifetime.



No. it_userspace lives as long as the platform code does not control the 
table. For IODA2 it is equal for the lifetime of the table, for 
IODA1/P5IOC2 it is not.





Isn't this function used for core-kernel users of the
iommu as well, in which case it shouldn't need the it_userspace.



No. This is an iommu_table_group_ops callback which calls what the platform
code calls (pnv_pci_create_table()) plus allocates this it_userspace thing.
The callback is only called from VFIO.


Ok.

As touched on above it seems more like this should be owned by VFIO
code than the platform code.


Agree now :) I'll move the allocation to VFIO. Thanks!


--
Alexey
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-04-30 Thread David Gibson
On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 04:31 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:
> >>In order to support memory pre-registration, we need a way to track
> >>the use of every registered memory region and only allow unregistration
> >>if a region is not in use anymore. So we need a way to tell from what
> >>region the just cleared TCE was from.
> >>
> >>This adds a userspace view of the TCE table into iommu_table struct.
> >>It contains userspace address, one per TCE entry. The table is only
> >>allocated when the ownership over an IOMMU group is taken which means
> >>it is only used from outside of the powernv code (such as VFIO).
> >>
> >>Signed-off-by: Alexey Kardashevskiy 
> >>---
> >>Changes:
> >>v9:
> >>* fixed code flow in error cases added in v8
> >>
> >>v8:
> >>* added ENOMEM on failed vzalloc()
> >>---
> >>  arch/powerpc/include/asm/iommu.h  |  6 ++
> >>  arch/powerpc/kernel/iommu.c   | 18 ++
> >>  arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
> >>  3 files changed, 44 insertions(+), 2 deletions(-)
> >>
> >>diff --git a/arch/powerpc/include/asm/iommu.h 
> >>b/arch/powerpc/include/asm/iommu.h
> >>index 7694546..1472de3 100644
> >>--- a/arch/powerpc/include/asm/iommu.h
> >>+++ b/arch/powerpc/include/asm/iommu.h
> >>@@ -111,9 +111,15 @@ struct iommu_table {
> >>unsigned long *it_map;   /* A simple allocation bitmap for now */
> >>unsigned long  it_page_shift;/* table iommu page size */
> >>struct iommu_table_group *it_table_group;
> >>+   unsigned long *it_userspace; /* userspace view of the table */
> >
> >A single unsigned long doesn't seem like enough.
> 
> Why single? This is an array.

As in single per page.

> > How do you know
> >which process's address space this address refers to?
> 
> It is a current task. Multiple userspaces cannot use the same 
> container/tables.

Where is that enforced?

More to the point, that's a VFIO constraint, but it's here affecting
the design of a structure owned by the platform code.

[snip]
> >>  static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
> >>@@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct 
> >>iommu_table_group *table_group,
> >>int nid = pe->phb->hose->node;
> >>__u64 bus_offset = num ? pe->tce_bypass_base : 0;
> >>long ret;
> >>+   unsigned long *uas, uas_cb = sizeof(*uas) * (window_size >> page_shift);
> >>+
> >>+   uas = vzalloc(uas_cb);
> >>+   if (!uas)
> >>+   return -ENOMEM;
> >
> >I don't see why this is allocated both here as well as in
> >take_ownership.
> 
> Where else? The only alternative is vfio_iommu_spapr_tce but I really do not
> want to touch iommu_table fields there.

Well to put it another way, why isn't take_ownership calling create
itself (or at least a common helper).

Clearly the it_userspace table needs to have lifetime which matches
the TCE table itself, so there should be a single function that marks
the beginning of that joint lifetime.

> >Isn't this function used for core-kernel users of the
> >iommu as well, in which case it shouldn't need the it_userspace.
> 
> 
> No. This is an iommu_table_group_ops callback which calls what the platform
> code calls (pnv_pci_create_table()) plus allocates this it_userspace thing.
> The callback is only called from VFIO.

Ok.

As touched on above it seems more like this should be owned by VFIO
code than the platform code.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpaLDumcgaa0.pgp
Description: PGP signature


Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-04-30 Thread Alexey Kardashevskiy

On 04/29/2015 04:31 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v9:
* fixed code flow in error cases added in v8

v8:
* added ENOMEM on failed vzalloc()
---
  arch/powerpc/include/asm/iommu.h  |  6 ++
  arch/powerpc/kernel/iommu.c   | 18 ++
  arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
  3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7694546..1472de3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -111,9 +111,15 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
unsigned long  it_page_shift;/* table iommu page size */
struct iommu_table_group *it_table_group;
+   unsigned long *it_userspace; /* userspace view of the table */


A single unsigned long doesn't seem like enough.


Why single? This is an array.


 How do you know
which process's address space this address refers to?


It is a current task. Multiple userspaces cannot use the same container/tables.




struct iommu_table_ops *it_ops;
  };

+#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
+   ((tbl)->it_userspace ? \
+   &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \
+   NULL)
+
  /* Pure 2^n version of get_order */
  static inline __attribute_const__
  int get_iommu_order(unsigned long size, struct iommu_table *tbl)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 2eaba0c..74a3f52 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -38,6 +38,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -739,6 +740,8 @@ void iommu_reset_table(struct iommu_table *tbl, const char 
*node_name)
free_pages((unsigned long) tbl->it_map, order);
}

+   WARN_ON(tbl->it_userspace);
+
memset(tbl, 0, sizeof(*tbl));
  }

@@ -1016,6 +1019,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
  {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
int ret = 0;
+   unsigned long *uas;

/*
 * VFIO does not control TCE entries allocation and the guest
@@ -1027,6 +1031,10 @@ int iommu_take_ownership(struct iommu_table *tbl)
if (!tbl->it_ops->exchange)
return -EINVAL;

+   uas = vzalloc(sizeof(*uas) * tbl->it_size);
+   if (!uas)
+   return -ENOMEM;
+
spin_lock_irqsave(>large_pool.lock, flags);
for (i = 0; i < tbl->nr_pools; i++)
spin_lock(>pools[i].lock);
@@ -1044,6 +1052,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
memset(tbl->it_map, 0xff, sz);
}

+   if (ret) {
+   vfree(uas);
+   } else {
+   BUG_ON(tbl->it_userspace);
+   tbl->it_userspace = uas;
+   }
+
for (i = 0; i < tbl->nr_pools; i++)
spin_unlock(>pools[i].lock);
spin_unlock_irqrestore(>large_pool.lock, flags);
@@ -1056,6 +1071,9 @@ void iommu_release_ownership(struct iommu_table *tbl)
  {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;

+   vfree(tbl->it_userspace);
+   tbl->it_userspace = NULL;
+
spin_lock_irqsave(>large_pool.lock, flags);
for (i = 0; i < tbl->nr_pools; i++)
spin_lock(>pools[i].lock);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 45bc131..e0be556 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -25,6 +25,7 @@
  #include 
  #include 
  #include 
+#include 

  #include 
  #include 
@@ -1827,6 +1828,14 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, 
long index,
pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
  }

+void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
+{
+   vfree(tbl->it_userspace);
+   tbl->it_userspace = NULL;
+
+   pnv_pci_free_table(tbl);
+}
+
  static struct iommu_table_ops pnv_ioda2_iommu_ops = {
.set = pnv_ioda2_tce_build,
  #ifdef CONFIG_IOMMU_API
@@ -1834,7 +1843,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
  #endif
.clear = 

Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-04-30 Thread David Gibson
On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote:
 On 04/29/2015 04:31 PM, David Gibson wrote:
 On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:
 In order to support memory pre-registration, we need a way to track
 the use of every registered memory region and only allow unregistration
 if a region is not in use anymore. So we need a way to tell from what
 region the just cleared TCE was from.
 
 This adds a userspace view of the TCE table into iommu_table struct.
 It contains userspace address, one per TCE entry. The table is only
 allocated when the ownership over an IOMMU group is taken which means
 it is only used from outside of the powernv code (such as VFIO).
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
 Changes:
 v9:
 * fixed code flow in error cases added in v8
 
 v8:
 * added ENOMEM on failed vzalloc()
 ---
   arch/powerpc/include/asm/iommu.h  |  6 ++
   arch/powerpc/kernel/iommu.c   | 18 ++
   arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
   3 files changed, 44 insertions(+), 2 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/iommu.h 
 b/arch/powerpc/include/asm/iommu.h
 index 7694546..1472de3 100644
 --- a/arch/powerpc/include/asm/iommu.h
 +++ b/arch/powerpc/include/asm/iommu.h
 @@ -111,9 +111,15 @@ struct iommu_table {
 unsigned long *it_map;   /* A simple allocation bitmap for now */
 unsigned long  it_page_shift;/* table iommu page size */
 struct iommu_table_group *it_table_group;
 +   unsigned long *it_userspace; /* userspace view of the table */
 
 A single unsigned long doesn't seem like enough.
 
 Why single? This is an array.

As in single per page.

  How do you know
 which process's address space this address refers to?
 
 It is a current task. Multiple userspaces cannot use the same 
 container/tables.

Where is that enforced?

More to the point, that's a VFIO constraint, but it's here affecting
the design of a structure owned by the platform code.

[snip]
   static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
 @@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct 
 iommu_table_group *table_group,
 int nid = pe-phb-hose-node;
 __u64 bus_offset = num ? pe-tce_bypass_base : 0;
 long ret;
 +   unsigned long *uas, uas_cb = sizeof(*uas) * (window_size  page_shift);
 +
 +   uas = vzalloc(uas_cb);
 +   if (!uas)
 +   return -ENOMEM;
 
 I don't see why this is allocated both here as well as in
 take_ownership.
 
 Where else? The only alternative is vfio_iommu_spapr_tce but I really do not
 want to touch iommu_table fields there.

Well to put it another way, why isn't take_ownership calling create
itself (or at least a common helper).

Clearly the it_userspace table needs to have lifetime which matches
the TCE table itself, so there should be a single function that marks
the beginning of that joint lifetime.

 Isn't this function used for core-kernel users of the
 iommu as well, in which case it shouldn't need the it_userspace.
 
 
 No. This is an iommu_table_group_ops callback which calls what the platform
 code calls (pnv_pci_create_table()) plus allocates this it_userspace thing.
 The callback is only called from VFIO.

Ok.

As touched on above it seems more like this should be owned by VFIO
code than the platform code.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpaLDumcgaa0.pgp
Description: PGP signature


Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-04-30 Thread Alexey Kardashevskiy

On 04/29/2015 04:31 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v9:
* fixed code flow in error cases added in v8

v8:
* added ENOMEM on failed vzalloc()
---
  arch/powerpc/include/asm/iommu.h  |  6 ++
  arch/powerpc/kernel/iommu.c   | 18 ++
  arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
  3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7694546..1472de3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -111,9 +111,15 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
unsigned long  it_page_shift;/* table iommu page size */
struct iommu_table_group *it_table_group;
+   unsigned long *it_userspace; /* userspace view of the table */


A single unsigned long doesn't seem like enough.


Why single? This is an array.


 How do you know
which process's address space this address refers to?


It is a current task. Multiple userspaces cannot use the same container/tables.




struct iommu_table_ops *it_ops;
  };

+#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
+   ((tbl)-it_userspace ? \
+   ((tbl)-it_userspace[(entry) - (tbl)-it_offset]) : \
+   NULL)
+
  /* Pure 2^n version of get_order */
  static inline __attribute_const__
  int get_iommu_order(unsigned long size, struct iommu_table *tbl)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 2eaba0c..74a3f52 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -38,6 +38,7 @@
  #include linux/pci.h
  #include linux/iommu.h
  #include linux/sched.h
+#include linux/vmalloc.h
  #include asm/io.h
  #include asm/prom.h
  #include asm/iommu.h
@@ -739,6 +740,8 @@ void iommu_reset_table(struct iommu_table *tbl, const char 
*node_name)
free_pages((unsigned long) tbl-it_map, order);
}

+   WARN_ON(tbl-it_userspace);
+
memset(tbl, 0, sizeof(*tbl));
  }

@@ -1016,6 +1019,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
  {
unsigned long flags, i, sz = (tbl-it_size + 7)  3;
int ret = 0;
+   unsigned long *uas;

/*
 * VFIO does not control TCE entries allocation and the guest
@@ -1027,6 +1031,10 @@ int iommu_take_ownership(struct iommu_table *tbl)
if (!tbl-it_ops-exchange)
return -EINVAL;

+   uas = vzalloc(sizeof(*uas) * tbl-it_size);
+   if (!uas)
+   return -ENOMEM;
+
spin_lock_irqsave(tbl-large_pool.lock, flags);
for (i = 0; i  tbl-nr_pools; i++)
spin_lock(tbl-pools[i].lock);
@@ -1044,6 +1052,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
memset(tbl-it_map, 0xff, sz);
}

+   if (ret) {
+   vfree(uas);
+   } else {
+   BUG_ON(tbl-it_userspace);
+   tbl-it_userspace = uas;
+   }
+
for (i = 0; i  tbl-nr_pools; i++)
spin_unlock(tbl-pools[i].lock);
spin_unlock_irqrestore(tbl-large_pool.lock, flags);
@@ -1056,6 +1071,9 @@ void iommu_release_ownership(struct iommu_table *tbl)
  {
unsigned long flags, i, sz = (tbl-it_size + 7)  3;

+   vfree(tbl-it_userspace);
+   tbl-it_userspace = NULL;
+
spin_lock_irqsave(tbl-large_pool.lock, flags);
for (i = 0; i  tbl-nr_pools; i++)
spin_lock(tbl-pools[i].lock);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 45bc131..e0be556 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -25,6 +25,7 @@
  #include linux/memblock.h
  #include linux/iommu.h
  #include linux/sizes.h
+#include linux/vmalloc.h

  #include asm/sections.h
  #include asm/io.h
@@ -1827,6 +1828,14 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, 
long index,
pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
  }

+void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
+{
+   vfree(tbl-it_userspace);
+   tbl-it_userspace = NULL;
+
+   pnv_pci_free_table(tbl);
+}
+
  static struct iommu_table_ops pnv_ioda2_iommu_ops = {
   

Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-04-29 Thread David Gibson
On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:
> In order to support memory pre-registration, we need a way to track
> the use of every registered memory region and only allow unregistration
> if a region is not in use anymore. So we need a way to tell from what
> region the just cleared TCE was from.
> 
> This adds a userspace view of the TCE table into iommu_table struct.
> It contains userspace address, one per TCE entry. The table is only
> allocated when the ownership over an IOMMU group is taken which means
> it is only used from outside of the powernv code (such as VFIO).
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v9:
> * fixed code flow in error cases added in v8
> 
> v8:
> * added ENOMEM on failed vzalloc()
> ---
>  arch/powerpc/include/asm/iommu.h  |  6 ++
>  arch/powerpc/kernel/iommu.c   | 18 ++
>  arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
>  3 files changed, 44 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index 7694546..1472de3 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -111,9 +111,15 @@ struct iommu_table {
>   unsigned long *it_map;   /* A simple allocation bitmap for now */
>   unsigned long  it_page_shift;/* table iommu page size */
>   struct iommu_table_group *it_table_group;
> + unsigned long *it_userspace; /* userspace view of the table */

A single unsigned long doesn't seem like enough.  How do you know
which process's address space this address refers to?

>   struct iommu_table_ops *it_ops;
>  };
>  
> +#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
> + ((tbl)->it_userspace ? \
> + &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \
> + NULL)
> +
>  /* Pure 2^n version of get_order */
>  static inline __attribute_const__
>  int get_iommu_order(unsigned long size, struct iommu_table *tbl)
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 2eaba0c..74a3f52 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -38,6 +38,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -739,6 +740,8 @@ void iommu_reset_table(struct iommu_table *tbl, const 
> char *node_name)
>   free_pages((unsigned long) tbl->it_map, order);
>   }
>  
> + WARN_ON(tbl->it_userspace);
> +
>   memset(tbl, 0, sizeof(*tbl));
>  }
>  
> @@ -1016,6 +1019,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
>  {
>   unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
>   int ret = 0;
> + unsigned long *uas;
>  
>   /*
>* VFIO does not control TCE entries allocation and the guest
> @@ -1027,6 +1031,10 @@ int iommu_take_ownership(struct iommu_table *tbl)
>   if (!tbl->it_ops->exchange)
>   return -EINVAL;
>  
> + uas = vzalloc(sizeof(*uas) * tbl->it_size);
> + if (!uas)
> + return -ENOMEM;
> +
>   spin_lock_irqsave(>large_pool.lock, flags);
>   for (i = 0; i < tbl->nr_pools; i++)
>   spin_lock(>pools[i].lock);
> @@ -1044,6 +1052,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
>   memset(tbl->it_map, 0xff, sz);
>   }
>  
> + if (ret) {
> + vfree(uas);
> + } else {
> + BUG_ON(tbl->it_userspace);
> + tbl->it_userspace = uas;
> + }
> +
>   for (i = 0; i < tbl->nr_pools; i++)
>   spin_unlock(>pools[i].lock);
>   spin_unlock_irqrestore(>large_pool.lock, flags);
> @@ -1056,6 +1071,9 @@ void iommu_release_ownership(struct iommu_table *tbl)
>  {
>   unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
>  
> + vfree(tbl->it_userspace);
> + tbl->it_userspace = NULL;
> +
>   spin_lock_irqsave(>large_pool.lock, flags);
>   for (i = 0; i < tbl->nr_pools; i++)
>   spin_lock(>pools[i].lock);
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 45bc131..e0be556 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -1827,6 +1828,14 @@ static void pnv_ioda2_tce_free(struct iommu_table 
> *tbl, long index,
>   pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
>  }
>  
> +void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
> +{
> + vfree(tbl->it_userspace);
> + tbl->it_userspace = NULL;
> +
> + pnv_pci_free_table(tbl);
> +}
> +
>  static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>   .set = pnv_ioda2_tce_build,
>  #ifdef CONFIG_IOMMU_API
> @@ -1834,7 +1843,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
>  #endif
>   .clear = 

Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-04-29 Thread David Gibson
On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote:
 In order to support memory pre-registration, we need a way to track
 the use of every registered memory region and only allow unregistration
 if a region is not in use anymore. So we need a way to tell from what
 region the just cleared TCE was from.
 
 This adds a userspace view of the TCE table into iommu_table struct.
 It contains userspace address, one per TCE entry. The table is only
 allocated when the ownership over an IOMMU group is taken which means
 it is only used from outside of the powernv code (such as VFIO).
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
 Changes:
 v9:
 * fixed code flow in error cases added in v8
 
 v8:
 * added ENOMEM on failed vzalloc()
 ---
  arch/powerpc/include/asm/iommu.h  |  6 ++
  arch/powerpc/kernel/iommu.c   | 18 ++
  arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
  3 files changed, 44 insertions(+), 2 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/iommu.h 
 b/arch/powerpc/include/asm/iommu.h
 index 7694546..1472de3 100644
 --- a/arch/powerpc/include/asm/iommu.h
 +++ b/arch/powerpc/include/asm/iommu.h
 @@ -111,9 +111,15 @@ struct iommu_table {
   unsigned long *it_map;   /* A simple allocation bitmap for now */
   unsigned long  it_page_shift;/* table iommu page size */
   struct iommu_table_group *it_table_group;
 + unsigned long *it_userspace; /* userspace view of the table */

A single unsigned long doesn't seem like enough.  How do you know
which process's address space this address refers to?

   struct iommu_table_ops *it_ops;
  };
  
 +#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
 + ((tbl)-it_userspace ? \
 + ((tbl)-it_userspace[(entry) - (tbl)-it_offset]) : \
 + NULL)
 +
  /* Pure 2^n version of get_order */
  static inline __attribute_const__
  int get_iommu_order(unsigned long size, struct iommu_table *tbl)
 diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
 index 2eaba0c..74a3f52 100644
 --- a/arch/powerpc/kernel/iommu.c
 +++ b/arch/powerpc/kernel/iommu.c
 @@ -38,6 +38,7 @@
  #include linux/pci.h
  #include linux/iommu.h
  #include linux/sched.h
 +#include linux/vmalloc.h
  #include asm/io.h
  #include asm/prom.h
  #include asm/iommu.h
 @@ -739,6 +740,8 @@ void iommu_reset_table(struct iommu_table *tbl, const 
 char *node_name)
   free_pages((unsigned long) tbl-it_map, order);
   }
  
 + WARN_ON(tbl-it_userspace);
 +
   memset(tbl, 0, sizeof(*tbl));
  }
  
 @@ -1016,6 +1019,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
  {
   unsigned long flags, i, sz = (tbl-it_size + 7)  3;
   int ret = 0;
 + unsigned long *uas;
  
   /*
* VFIO does not control TCE entries allocation and the guest
 @@ -1027,6 +1031,10 @@ int iommu_take_ownership(struct iommu_table *tbl)
   if (!tbl-it_ops-exchange)
   return -EINVAL;
  
 + uas = vzalloc(sizeof(*uas) * tbl-it_size);
 + if (!uas)
 + return -ENOMEM;
 +
   spin_lock_irqsave(tbl-large_pool.lock, flags);
   for (i = 0; i  tbl-nr_pools; i++)
   spin_lock(tbl-pools[i].lock);
 @@ -1044,6 +1052,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
   memset(tbl-it_map, 0xff, sz);
   }
  
 + if (ret) {
 + vfree(uas);
 + } else {
 + BUG_ON(tbl-it_userspace);
 + tbl-it_userspace = uas;
 + }
 +
   for (i = 0; i  tbl-nr_pools; i++)
   spin_unlock(tbl-pools[i].lock);
   spin_unlock_irqrestore(tbl-large_pool.lock, flags);
 @@ -1056,6 +1071,9 @@ void iommu_release_ownership(struct iommu_table *tbl)
  {
   unsigned long flags, i, sz = (tbl-it_size + 7)  3;
  
 + vfree(tbl-it_userspace);
 + tbl-it_userspace = NULL;
 +
   spin_lock_irqsave(tbl-large_pool.lock, flags);
   for (i = 0; i  tbl-nr_pools; i++)
   spin_lock(tbl-pools[i].lock);
 diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
 b/arch/powerpc/platforms/powernv/pci-ioda.c
 index 45bc131..e0be556 100644
 --- a/arch/powerpc/platforms/powernv/pci-ioda.c
 +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
 @@ -25,6 +25,7 @@
  #include linux/memblock.h
  #include linux/iommu.h
  #include linux/sizes.h
 +#include linux/vmalloc.h
  
  #include asm/sections.h
  #include asm/io.h
 @@ -1827,6 +1828,14 @@ static void pnv_ioda2_tce_free(struct iommu_table 
 *tbl, long index,
   pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
  }
  
 +void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
 +{
 + vfree(tbl-it_userspace);
 + tbl-it_userspace = NULL;
 +
 + pnv_pci_free_table(tbl);
 +}
 +
  static struct iommu_table_ops pnv_ioda2_iommu_ops = {
   .set = pnv_ioda2_tce_build,
  #ifdef CONFIG_IOMMU_API
 @@ -1834,7 +1843,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
  

[PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-04-25 Thread Alexey Kardashevskiy
In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v9:
* fixed code flow in error cases added in v8

v8:
* added ENOMEM on failed vzalloc()
---
 arch/powerpc/include/asm/iommu.h  |  6 ++
 arch/powerpc/kernel/iommu.c   | 18 ++
 arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
 3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7694546..1472de3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -111,9 +111,15 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
unsigned long  it_page_shift;/* table iommu page size */
struct iommu_table_group *it_table_group;
+   unsigned long *it_userspace; /* userspace view of the table */
struct iommu_table_ops *it_ops;
 };
 
+#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
+   ((tbl)->it_userspace ? \
+   &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \
+   NULL)
+
 /* Pure 2^n version of get_order */
 static inline __attribute_const__
 int get_iommu_order(unsigned long size, struct iommu_table *tbl)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 2eaba0c..74a3f52 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -739,6 +740,8 @@ void iommu_reset_table(struct iommu_table *tbl, const char 
*node_name)
free_pages((unsigned long) tbl->it_map, order);
}
 
+   WARN_ON(tbl->it_userspace);
+
memset(tbl, 0, sizeof(*tbl));
 }
 
@@ -1016,6 +1019,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
int ret = 0;
+   unsigned long *uas;
 
/*
 * VFIO does not control TCE entries allocation and the guest
@@ -1027,6 +1031,10 @@ int iommu_take_ownership(struct iommu_table *tbl)
if (!tbl->it_ops->exchange)
return -EINVAL;
 
+   uas = vzalloc(sizeof(*uas) * tbl->it_size);
+   if (!uas)
+   return -ENOMEM;
+
spin_lock_irqsave(>large_pool.lock, flags);
for (i = 0; i < tbl->nr_pools; i++)
spin_lock(>pools[i].lock);
@@ -1044,6 +1052,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
memset(tbl->it_map, 0xff, sz);
}
 
+   if (ret) {
+   vfree(uas);
+   } else {
+   BUG_ON(tbl->it_userspace);
+   tbl->it_userspace = uas;
+   }
+
for (i = 0; i < tbl->nr_pools; i++)
spin_unlock(>pools[i].lock);
spin_unlock_irqrestore(>large_pool.lock, flags);
@@ -1056,6 +1071,9 @@ void iommu_release_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 
+   vfree(tbl->it_userspace);
+   tbl->it_userspace = NULL;
+
spin_lock_irqsave(>large_pool.lock, flags);
for (i = 0; i < tbl->nr_pools; i++)
spin_lock(>pools[i].lock);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 45bc131..e0be556 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1827,6 +1828,14 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, 
long index,
pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
 }
 
+void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
+{
+   vfree(tbl->it_userspace);
+   tbl->it_userspace = NULL;
+
+   pnv_pci_free_table(tbl);
+}
+
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
@@ -1834,7 +1843,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 #endif
.clear = pnv_ioda2_tce_free,
.get = pnv_tce_get,
-   .free = pnv_pci_free_table,
+   .free = pnv_pci_ioda2_free_table,
 };
 
 static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
@@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,
int nid = pe->phb->hose->node;
__u64 bus_offset = num ? 

[PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table

2015-04-25 Thread Alexey Kardashevskiy
In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v9:
* fixed code flow in error cases added in v8

v8:
* added ENOMEM on failed vzalloc()
---
 arch/powerpc/include/asm/iommu.h  |  6 ++
 arch/powerpc/kernel/iommu.c   | 18 ++
 arch/powerpc/platforms/powernv/pci-ioda.c | 22 --
 3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7694546..1472de3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -111,9 +111,15 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
unsigned long  it_page_shift;/* table iommu page size */
struct iommu_table_group *it_table_group;
+   unsigned long *it_userspace; /* userspace view of the table */
struct iommu_table_ops *it_ops;
 };
 
+#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
+   ((tbl)-it_userspace ? \
+   ((tbl)-it_userspace[(entry) - (tbl)-it_offset]) : \
+   NULL)
+
 /* Pure 2^n version of get_order */
 static inline __attribute_const__
 int get_iommu_order(unsigned long size, struct iommu_table *tbl)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 2eaba0c..74a3f52 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -38,6 +38,7 @@
 #include linux/pci.h
 #include linux/iommu.h
 #include linux/sched.h
+#include linux/vmalloc.h
 #include asm/io.h
 #include asm/prom.h
 #include asm/iommu.h
@@ -739,6 +740,8 @@ void iommu_reset_table(struct iommu_table *tbl, const char 
*node_name)
free_pages((unsigned long) tbl-it_map, order);
}
 
+   WARN_ON(tbl-it_userspace);
+
memset(tbl, 0, sizeof(*tbl));
 }
 
@@ -1016,6 +1019,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl-it_size + 7)  3;
int ret = 0;
+   unsigned long *uas;
 
/*
 * VFIO does not control TCE entries allocation and the guest
@@ -1027,6 +1031,10 @@ int iommu_take_ownership(struct iommu_table *tbl)
if (!tbl-it_ops-exchange)
return -EINVAL;
 
+   uas = vzalloc(sizeof(*uas) * tbl-it_size);
+   if (!uas)
+   return -ENOMEM;
+
spin_lock_irqsave(tbl-large_pool.lock, flags);
for (i = 0; i  tbl-nr_pools; i++)
spin_lock(tbl-pools[i].lock);
@@ -1044,6 +1052,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
memset(tbl-it_map, 0xff, sz);
}
 
+   if (ret) {
+   vfree(uas);
+   } else {
+   BUG_ON(tbl-it_userspace);
+   tbl-it_userspace = uas;
+   }
+
for (i = 0; i  tbl-nr_pools; i++)
spin_unlock(tbl-pools[i].lock);
spin_unlock_irqrestore(tbl-large_pool.lock, flags);
@@ -1056,6 +1071,9 @@ void iommu_release_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl-it_size + 7)  3;
 
+   vfree(tbl-it_userspace);
+   tbl-it_userspace = NULL;
+
spin_lock_irqsave(tbl-large_pool.lock, flags);
for (i = 0; i  tbl-nr_pools; i++)
spin_lock(tbl-pools[i].lock);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 45bc131..e0be556 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -25,6 +25,7 @@
 #include linux/memblock.h
 #include linux/iommu.h
 #include linux/sizes.h
+#include linux/vmalloc.h
 
 #include asm/sections.h
 #include asm/io.h
@@ -1827,6 +1828,14 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, 
long index,
pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
 }
 
+void pnv_pci_ioda2_free_table(struct iommu_table *tbl)
+{
+   vfree(tbl-it_userspace);
+   tbl-it_userspace = NULL;
+
+   pnv_pci_free_table(tbl);
+}
+
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
@@ -1834,7 +1843,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 #endif
.clear = pnv_ioda2_tce_free,
.get = pnv_tce_get,
-   .free = pnv_pci_free_table,
+   .free = pnv_pci_ioda2_free_table,
 };
 
 static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb,
@@ -2062,12