Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-05-10 Thread Alexey Kardashevskiy

On 05/05/2015 09:58 PM, David Gibson wrote:

On Fri, May 01, 2015 at 04:53:08PM +1000, Alexey Kardashevskiy wrote:

On 05/01/2015 03:12 PM, David Gibson wrote:

On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote:

On 04/29/2015 04:40 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:

This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.

This stores the allocated table size in pnv_pci_create_table()
so the locked_vm counter can be updated correctly when a table is
being disposed.

This defines an iommu_table_group_ops callback to let VFIO know
how much memory will be locked if a table is created.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v9:
* reimplemented the whole patch
---
  arch/powerpc/include/asm/iommu.h  |  5 +
  arch/powerpc/platforms/powernv/pci-ioda.c | 14 
  arch/powerpc/platforms/powernv/pci.c  | 36 +++
  arch/powerpc/platforms/powernv/pci.h  |  2 ++
  4 files changed, 57 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 1472de3..9844c106 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -99,6 +99,7 @@ struct iommu_table {
unsigned long  it_size;  /* Size of iommu table in entries */
unsigned long  it_indirect_levels;
unsigned long  it_level_size;
+   unsigned long  it_allocated_size;
unsigned long  it_offset;/* Offset into global table */
unsigned long  it_base;  /* mapped address of tce table */
unsigned long  it_index; /* which iommu table this is */
@@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
  struct iommu_table_group;

  struct iommu_table_group_ops {
+   unsigned long (*get_table_size)(
+   __u32 page_shift,
+   __u64 window_size,
+   __u32 levels);
long (*create_table)(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index e0be556..7f548b4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb 
*phb,
  }

  #ifdef CONFIG_IOMMU_API
+static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
+
+   if (!ret)
+   return ret;
+
+   /* Add size of it_userspace */
+   return ret + (window_size >> page_shift) * sizeof(unsigned long);


This doesn't make much sense.  The userspace view can't possibly be a
property of the specific low-level IOMMU model.



This it_userspace thing is all about memory preregistration.

I need some way to track how many actual mappings the
mm_iommu_table_group_mem_t has in order to decide whether to allow
unregistering or not.

When I clear TCE, I can read the old value which is host physical address
which I cannot use to find the preregistered region and adjust the mappings
counter; I can only use userspace addresses for this (not even guest
physical addresses as it is VFIO and probably no KVM).

So I have to keep userspace addresses somewhere, one per IOMMU page, and the
iommu_table seems a natural place for this.


Well.. sort of.  But as noted elsewhere this pulls VFIO specific
constraints into a platform code structure.  And whether you get this
table depends on the platform IOMMU type rather than on what VFIO
wants to do with it, which doesn't make sense.

What might make more sense is an opaque pointer io iommu_table for use
by the table "owner" (in the take_ownership sense).  The pointer would
be stored in iommu_table, but VFIO is responsible for populating and
managing its contents.

Or you could just put the userspace mappings in the container.
Although you might want a different data structure in that case.


Nope. I need this table in in-kernel acceleration to update the mappings
counter per mm_iommu_table_group_mem_t. In KVM's real mode handlers, I only
have IOMMU tables, not containers or groups. QEMU creates a guest view of
the table (KVM_CREATE_SPAPR_TCE) specifying a LIOBN, and then attaches TCE
tables to it via set of ioctls (one per IOMMU group) to VFIO KVM device.

So if I call it it_opaque (instead of it_userspace), I will still need a
common place (visible to VFIO and PowerKVM) for this to put:
#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry)


I think it should be in a VFIO header.  If I'm understanding right
this part of the PowerKVM code is explicitly VFIO aware - that's kind
of the point.


Well. 

Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-05-10 Thread Alexey Kardashevskiy

On 05/05/2015 09:58 PM, David Gibson wrote:

On Fri, May 01, 2015 at 04:53:08PM +1000, Alexey Kardashevskiy wrote:

On 05/01/2015 03:12 PM, David Gibson wrote:

On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote:

On 04/29/2015 04:40 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:

This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.

This stores the allocated table size in pnv_pci_create_table()
so the locked_vm counter can be updated correctly when a table is
being disposed.

This defines an iommu_table_group_ops callback to let VFIO know
how much memory will be locked if a table is created.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v9:
* reimplemented the whole patch
---
  arch/powerpc/include/asm/iommu.h  |  5 +
  arch/powerpc/platforms/powernv/pci-ioda.c | 14 
  arch/powerpc/platforms/powernv/pci.c  | 36 +++
  arch/powerpc/platforms/powernv/pci.h  |  2 ++
  4 files changed, 57 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 1472de3..9844c106 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -99,6 +99,7 @@ struct iommu_table {
unsigned long  it_size;  /* Size of iommu table in entries */
unsigned long  it_indirect_levels;
unsigned long  it_level_size;
+   unsigned long  it_allocated_size;
unsigned long  it_offset;/* Offset into global table */
unsigned long  it_base;  /* mapped address of tce table */
unsigned long  it_index; /* which iommu table this is */
@@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
  struct iommu_table_group;

  struct iommu_table_group_ops {
+   unsigned long (*get_table_size)(
+   __u32 page_shift,
+   __u64 window_size,
+   __u32 levels);
long (*create_table)(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index e0be556..7f548b4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb 
*phb,
  }

  #ifdef CONFIG_IOMMU_API
+static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
+
+   if (!ret)
+   return ret;
+
+   /* Add size of it_userspace */
+   return ret + (window_size  page_shift) * sizeof(unsigned long);


This doesn't make much sense.  The userspace view can't possibly be a
property of the specific low-level IOMMU model.



This it_userspace thing is all about memory preregistration.

I need some way to track how many actual mappings the
mm_iommu_table_group_mem_t has in order to decide whether to allow
unregistering or not.

When I clear TCE, I can read the old value which is host physical address
which I cannot use to find the preregistered region and adjust the mappings
counter; I can only use userspace addresses for this (not even guest
physical addresses as it is VFIO and probably no KVM).

So I have to keep userspace addresses somewhere, one per IOMMU page, and the
iommu_table seems a natural place for this.


Well.. sort of.  But as noted elsewhere this pulls VFIO specific
constraints into a platform code structure.  And whether you get this
table depends on the platform IOMMU type rather than on what VFIO
wants to do with it, which doesn't make sense.

What might make more sense is an opaque pointer io iommu_table for use
by the table owner (in the take_ownership sense).  The pointer would
be stored in iommu_table, but VFIO is responsible for populating and
managing its contents.

Or you could just put the userspace mappings in the container.
Although you might want a different data structure in that case.


Nope. I need this table in in-kernel acceleration to update the mappings
counter per mm_iommu_table_group_mem_t. In KVM's real mode handlers, I only
have IOMMU tables, not containers or groups. QEMU creates a guest view of
the table (KVM_CREATE_SPAPR_TCE) specifying a LIOBN, and then attaches TCE
tables to it via set of ioctls (one per IOMMU group) to VFIO KVM device.

So if I call it it_opaque (instead of it_userspace), I will still need a
common place (visible to VFIO and PowerKVM) for this to put:
#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry)


I think it should be in a VFIO header.  If I'm understanding right
this part of the PowerKVM code is explicitly VFIO aware - that's kind
of the 

Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-05-05 Thread David Gibson
On Fri, May 01, 2015 at 04:53:08PM +1000, Alexey Kardashevskiy wrote:
> On 05/01/2015 03:12 PM, David Gibson wrote:
> >On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote:
> >>On 04/29/2015 04:40 PM, David Gibson wrote:
> >>>On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:
> This adds a way for the IOMMU user to know how much a new table will
> use so it can be accounted in the locked_vm limit before allocation
> happens.
> 
> This stores the allocated table size in pnv_pci_create_table()
> so the locked_vm counter can be updated correctly when a table is
> being disposed.
> 
> This defines an iommu_table_group_ops callback to let VFIO know
> how much memory will be locked if a table is created.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v9:
> * reimplemented the whole patch
> ---
>   arch/powerpc/include/asm/iommu.h  |  5 +
>   arch/powerpc/platforms/powernv/pci-ioda.c | 14 
>   arch/powerpc/platforms/powernv/pci.c  | 36 
>  +++
>   arch/powerpc/platforms/powernv/pci.h  |  2 ++
>   4 files changed, 57 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index 1472de3..9844c106 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -99,6 +99,7 @@ struct iommu_table {
>   unsigned long  it_size;  /* Size of iommu table in entries 
>  */
>   unsigned long  it_indirect_levels;
>   unsigned long  it_level_size;
> + unsigned long  it_allocated_size;
>   unsigned long  it_offset;/* Offset into global table */
>   unsigned long  it_base;  /* mapped address of tce table */
>   unsigned long  it_index; /* which iommu table this is */
> @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
> iommu_table * tbl,
>   struct iommu_table_group;
> 
>   struct iommu_table_group_ops {
> + unsigned long (*get_table_size)(
> + __u32 page_shift,
> + __u64 window_size,
> + __u32 levels);
>   long (*create_table)(struct iommu_table_group *table_group,
>   int num,
>   __u32 page_shift,
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index e0be556..7f548b4 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
> pnv_phb *phb,
>   }
> 
>   #ifdef CONFIG_IOMMU_API
> +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
> + __u64 window_size, __u32 levels)
> +{
> + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
> +
> + if (!ret)
> + return ret;
> +
> + /* Add size of it_userspace */
> + return ret + (window_size >> page_shift) * sizeof(unsigned long);
> >>>
> >>>This doesn't make much sense.  The userspace view can't possibly be a
> >>>property of the specific low-level IOMMU model.
> >>
> >>
> >>This it_userspace thing is all about memory preregistration.
> >>
> >>I need some way to track how many actual mappings the
> >>mm_iommu_table_group_mem_t has in order to decide whether to allow
> >>unregistering or not.
> >>
> >>When I clear TCE, I can read the old value which is host physical address
> >>which I cannot use to find the preregistered region and adjust the mappings
> >>counter; I can only use userspace addresses for this (not even guest
> >>physical addresses as it is VFIO and probably no KVM).
> >>
> >>So I have to keep userspace addresses somewhere, one per IOMMU page, and the
> >>iommu_table seems a natural place for this.
> >
> >Well.. sort of.  But as noted elsewhere this pulls VFIO specific
> >constraints into a platform code structure.  And whether you get this
> >table depends on the platform IOMMU type rather than on what VFIO
> >wants to do with it, which doesn't make sense.
> >
> >What might make more sense is an opaque pointer io iommu_table for use
> >by the table "owner" (in the take_ownership sense).  The pointer would
> >be stored in iommu_table, but VFIO is responsible for populating and
> >managing its contents.
> >
> >Or you could just put the userspace mappings in the container.
> >Although you might want a different data structure in that case.
> 
> Nope. I need this table in in-kernel acceleration to update the mappings
> counter per mm_iommu_table_group_mem_t. In KVM's real mode handlers, I only
> have IOMMU tables, not containers or groups. QEMU creates a guest view of
> the table 

Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-05-05 Thread David Gibson
On Fri, May 01, 2015 at 04:53:08PM +1000, Alexey Kardashevskiy wrote:
 On 05/01/2015 03:12 PM, David Gibson wrote:
 On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote:
 On 04/29/2015 04:40 PM, David Gibson wrote:
 On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:
 This adds a way for the IOMMU user to know how much a new table will
 use so it can be accounted in the locked_vm limit before allocation
 happens.
 
 This stores the allocated table size in pnv_pci_create_table()
 so the locked_vm counter can be updated correctly when a table is
 being disposed.
 
 This defines an iommu_table_group_ops callback to let VFIO know
 how much memory will be locked if a table is created.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
 Changes:
 v9:
 * reimplemented the whole patch
 ---
   arch/powerpc/include/asm/iommu.h  |  5 +
   arch/powerpc/platforms/powernv/pci-ioda.c | 14 
   arch/powerpc/platforms/powernv/pci.c  | 36 
  +++
   arch/powerpc/platforms/powernv/pci.h  |  2 ++
   4 files changed, 57 insertions(+)
 
 diff --git a/arch/powerpc/include/asm/iommu.h 
 b/arch/powerpc/include/asm/iommu.h
 index 1472de3..9844c106 100644
 --- a/arch/powerpc/include/asm/iommu.h
 +++ b/arch/powerpc/include/asm/iommu.h
 @@ -99,6 +99,7 @@ struct iommu_table {
   unsigned long  it_size;  /* Size of iommu table in entries 
  */
   unsigned long  it_indirect_levels;
   unsigned long  it_level_size;
 + unsigned long  it_allocated_size;
   unsigned long  it_offset;/* Offset into global table */
   unsigned long  it_base;  /* mapped address of tce table */
   unsigned long  it_index; /* which iommu table this is */
 @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
 iommu_table * tbl,
   struct iommu_table_group;
 
   struct iommu_table_group_ops {
 + unsigned long (*get_table_size)(
 + __u32 page_shift,
 + __u64 window_size,
 + __u32 levels);
   long (*create_table)(struct iommu_table_group *table_group,
   int num,
   __u32 page_shift,
 diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
 b/arch/powerpc/platforms/powernv/pci-ioda.c
 index e0be556..7f548b4 100644
 --- a/arch/powerpc/platforms/powernv/pci-ioda.c
 +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
 @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
 pnv_phb *phb,
   }
 
   #ifdef CONFIG_IOMMU_API
 +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
 + __u64 window_size, __u32 levels)
 +{
 + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
 +
 + if (!ret)
 + return ret;
 +
 + /* Add size of it_userspace */
 + return ret + (window_size  page_shift) * sizeof(unsigned long);
 
 This doesn't make much sense.  The userspace view can't possibly be a
 property of the specific low-level IOMMU model.
 
 
 This it_userspace thing is all about memory preregistration.
 
 I need some way to track how many actual mappings the
 mm_iommu_table_group_mem_t has in order to decide whether to allow
 unregistering or not.
 
 When I clear TCE, I can read the old value which is host physical address
 which I cannot use to find the preregistered region and adjust the mappings
 counter; I can only use userspace addresses for this (not even guest
 physical addresses as it is VFIO and probably no KVM).
 
 So I have to keep userspace addresses somewhere, one per IOMMU page, and the
 iommu_table seems a natural place for this.
 
 Well.. sort of.  But as noted elsewhere this pulls VFIO specific
 constraints into a platform code structure.  And whether you get this
 table depends on the platform IOMMU type rather than on what VFIO
 wants to do with it, which doesn't make sense.
 
 What might make more sense is an opaque pointer io iommu_table for use
 by the table owner (in the take_ownership sense).  The pointer would
 be stored in iommu_table, but VFIO is responsible for populating and
 managing its contents.
 
 Or you could just put the userspace mappings in the container.
 Although you might want a different data structure in that case.
 
 Nope. I need this table in in-kernel acceleration to update the mappings
 counter per mm_iommu_table_group_mem_t. In KVM's real mode handlers, I only
 have IOMMU tables, not containers or groups. QEMU creates a guest view of
 the table (KVM_CREATE_SPAPR_TCE) specifying a LIOBN, and then attaches TCE
 tables to it via set of ioctls (one per IOMMU group) to VFIO KVM device.
 
 So if I call it it_opaque (instead of it_userspace), I will still need a
 common place (visible to VFIO and PowerKVM) for this to put:
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry)

I think it should be in a VFIO header.  If I'm understanding right
this part of the PowerKVM code is explicitly VFIO aware - that's 

Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-05-01 Thread Alexey Kardashevskiy

On 05/01/2015 03:12 PM, David Gibson wrote:

On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote:

On 04/29/2015 04:40 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:

This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.

This stores the allocated table size in pnv_pci_create_table()
so the locked_vm counter can be updated correctly when a table is
being disposed.

This defines an iommu_table_group_ops callback to let VFIO know
how much memory will be locked if a table is created.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v9:
* reimplemented the whole patch
---
  arch/powerpc/include/asm/iommu.h  |  5 +
  arch/powerpc/platforms/powernv/pci-ioda.c | 14 
  arch/powerpc/platforms/powernv/pci.c  | 36 +++
  arch/powerpc/platforms/powernv/pci.h  |  2 ++
  4 files changed, 57 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 1472de3..9844c106 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -99,6 +99,7 @@ struct iommu_table {
unsigned long  it_size;  /* Size of iommu table in entries */
unsigned long  it_indirect_levels;
unsigned long  it_level_size;
+   unsigned long  it_allocated_size;
unsigned long  it_offset;/* Offset into global table */
unsigned long  it_base;  /* mapped address of tce table */
unsigned long  it_index; /* which iommu table this is */
@@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
  struct iommu_table_group;

  struct iommu_table_group_ops {
+   unsigned long (*get_table_size)(
+   __u32 page_shift,
+   __u64 window_size,
+   __u32 levels);
long (*create_table)(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index e0be556..7f548b4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb 
*phb,
  }

  #ifdef CONFIG_IOMMU_API
+static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
+
+   if (!ret)
+   return ret;
+
+   /* Add size of it_userspace */
+   return ret + (window_size >> page_shift) * sizeof(unsigned long);


This doesn't make much sense.  The userspace view can't possibly be a
property of the specific low-level IOMMU model.



This it_userspace thing is all about memory preregistration.

I need some way to track how many actual mappings the
mm_iommu_table_group_mem_t has in order to decide whether to allow
unregistering or not.

When I clear TCE, I can read the old value which is host physical address
which I cannot use to find the preregistered region and adjust the mappings
counter; I can only use userspace addresses for this (not even guest
physical addresses as it is VFIO and probably no KVM).

So I have to keep userspace addresses somewhere, one per IOMMU page, and the
iommu_table seems a natural place for this.


Well.. sort of.  But as noted elsewhere this pulls VFIO specific
constraints into a platform code structure.  And whether you get this
table depends on the platform IOMMU type rather than on what VFIO
wants to do with it, which doesn't make sense.

What might make more sense is an opaque pointer io iommu_table for use
by the table "owner" (in the take_ownership sense).  The pointer would
be stored in iommu_table, but VFIO is responsible for populating and
managing its contents.

Or you could just put the userspace mappings in the container.
Although you might want a different data structure in that case.


Nope. I need this table in in-kernel acceleration to update the mappings 
counter per mm_iommu_table_group_mem_t. In KVM's real mode handlers, I only 
have IOMMU tables, not containers or groups. QEMU creates a guest view of 
the table (KVM_CREATE_SPAPR_TCE) specifying a LIOBN, and then attaches TCE 
tables to it via set of ioctls (one per IOMMU group) to VFIO KVM device.


So if I call it it_opaque (instead of it_userspace), I will still need a 
common place (visible to VFIO and PowerKVM) for this to put:

#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry)

So far this place was arch/powerpc/include/asm/iommu.h and the iommu_table 
struct.




The other thing to bear in mind is that registered regions are likely
to be large contiguous blocks in user addresses, though obviously not
contiguous in physical addr.  So you 

Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-05-01 Thread Alexey Kardashevskiy

On 05/01/2015 03:12 PM, David Gibson wrote:

On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote:

On 04/29/2015 04:40 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:

This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.

This stores the allocated table size in pnv_pci_create_table()
so the locked_vm counter can be updated correctly when a table is
being disposed.

This defines an iommu_table_group_ops callback to let VFIO know
how much memory will be locked if a table is created.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v9:
* reimplemented the whole patch
---
  arch/powerpc/include/asm/iommu.h  |  5 +
  arch/powerpc/platforms/powernv/pci-ioda.c | 14 
  arch/powerpc/platforms/powernv/pci.c  | 36 +++
  arch/powerpc/platforms/powernv/pci.h  |  2 ++
  4 files changed, 57 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 1472de3..9844c106 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -99,6 +99,7 @@ struct iommu_table {
unsigned long  it_size;  /* Size of iommu table in entries */
unsigned long  it_indirect_levels;
unsigned long  it_level_size;
+   unsigned long  it_allocated_size;
unsigned long  it_offset;/* Offset into global table */
unsigned long  it_base;  /* mapped address of tce table */
unsigned long  it_index; /* which iommu table this is */
@@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
  struct iommu_table_group;

  struct iommu_table_group_ops {
+   unsigned long (*get_table_size)(
+   __u32 page_shift,
+   __u64 window_size,
+   __u32 levels);
long (*create_table)(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index e0be556..7f548b4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb 
*phb,
  }

  #ifdef CONFIG_IOMMU_API
+static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
+
+   if (!ret)
+   return ret;
+
+   /* Add size of it_userspace */
+   return ret + (window_size  page_shift) * sizeof(unsigned long);


This doesn't make much sense.  The userspace view can't possibly be a
property of the specific low-level IOMMU model.



This it_userspace thing is all about memory preregistration.

I need some way to track how many actual mappings the
mm_iommu_table_group_mem_t has in order to decide whether to allow
unregistering or not.

When I clear TCE, I can read the old value which is host physical address
which I cannot use to find the preregistered region and adjust the mappings
counter; I can only use userspace addresses for this (not even guest
physical addresses as it is VFIO and probably no KVM).

So I have to keep userspace addresses somewhere, one per IOMMU page, and the
iommu_table seems a natural place for this.


Well.. sort of.  But as noted elsewhere this pulls VFIO specific
constraints into a platform code structure.  And whether you get this
table depends on the platform IOMMU type rather than on what VFIO
wants to do with it, which doesn't make sense.

What might make more sense is an opaque pointer io iommu_table for use
by the table owner (in the take_ownership sense).  The pointer would
be stored in iommu_table, but VFIO is responsible for populating and
managing its contents.

Or you could just put the userspace mappings in the container.
Although you might want a different data structure in that case.


Nope. I need this table in in-kernel acceleration to update the mappings 
counter per mm_iommu_table_group_mem_t. In KVM's real mode handlers, I only 
have IOMMU tables, not containers or groups. QEMU creates a guest view of 
the table (KVM_CREATE_SPAPR_TCE) specifying a LIOBN, and then attaches TCE 
tables to it via set of ioctls (one per IOMMU group) to VFIO KVM device.


So if I call it it_opaque (instead of it_userspace), I will still need a 
common place (visible to VFIO and PowerKVM) for this to put:

#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry)

So far this place was arch/powerpc/include/asm/iommu.h and the iommu_table 
struct.




The other thing to bear in mind is that registered regions are likely
to be large contiguous blocks in user addresses, though obviously not
contiguous in physical addr.  

Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-04-30 Thread David Gibson
On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote:
> On 04/29/2015 04:40 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:
> >>This adds a way for the IOMMU user to know how much a new table will
> >>use so it can be accounted in the locked_vm limit before allocation
> >>happens.
> >>
> >>This stores the allocated table size in pnv_pci_create_table()
> >>so the locked_vm counter can be updated correctly when a table is
> >>being disposed.
> >>
> >>This defines an iommu_table_group_ops callback to let VFIO know
> >>how much memory will be locked if a table is created.
> >>
> >>Signed-off-by: Alexey Kardashevskiy 
> >>---
> >>Changes:
> >>v9:
> >>* reimplemented the whole patch
> >>---
> >>  arch/powerpc/include/asm/iommu.h  |  5 +
> >>  arch/powerpc/platforms/powernv/pci-ioda.c | 14 
> >>  arch/powerpc/platforms/powernv/pci.c  | 36 
> >> +++
> >>  arch/powerpc/platforms/powernv/pci.h  |  2 ++
> >>  4 files changed, 57 insertions(+)
> >>
> >>diff --git a/arch/powerpc/include/asm/iommu.h 
> >>b/arch/powerpc/include/asm/iommu.h
> >>index 1472de3..9844c106 100644
> >>--- a/arch/powerpc/include/asm/iommu.h
> >>+++ b/arch/powerpc/include/asm/iommu.h
> >>@@ -99,6 +99,7 @@ struct iommu_table {
> >>unsigned long  it_size;  /* Size of iommu table in entries */
> >>unsigned long  it_indirect_levels;
> >>unsigned long  it_level_size;
> >>+   unsigned long  it_allocated_size;
> >>unsigned long  it_offset;/* Offset into global table */
> >>unsigned long  it_base;  /* mapped address of tce table */
> >>unsigned long  it_index; /* which iommu table this is */
> >>@@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
> >>iommu_table * tbl,
> >>  struct iommu_table_group;
> >>
> >>  struct iommu_table_group_ops {
> >>+   unsigned long (*get_table_size)(
> >>+   __u32 page_shift,
> >>+   __u64 window_size,
> >>+   __u32 levels);
> >>long (*create_table)(struct iommu_table_group *table_group,
> >>int num,
> >>__u32 page_shift,
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> >>b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index e0be556..7f548b4 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
> >>pnv_phb *phb,
> >>  }
> >>
> >>  #ifdef CONFIG_IOMMU_API
> >>+static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
> >>+   __u64 window_size, __u32 levels)
> >>+{
> >>+   unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
> >>+
> >>+   if (!ret)
> >>+   return ret;
> >>+
> >>+   /* Add size of it_userspace */
> >>+   return ret + (window_size >> page_shift) * sizeof(unsigned long);
> >
> >This doesn't make much sense.  The userspace view can't possibly be a
> >property of the specific low-level IOMMU model.
> 
> 
> This it_userspace thing is all about memory preregistration.
> 
> I need some way to track how many actual mappings the
> mm_iommu_table_group_mem_t has in order to decide whether to allow
> unregistering or not.
> 
> When I clear TCE, I can read the old value which is host physical address
> which I cannot use to find the preregistered region and adjust the mappings
> counter; I can only use userspace addresses for this (not even guest
> physical addresses as it is VFIO and probably no KVM).
> 
> So I have to keep userspace addresses somewhere, one per IOMMU page, and the
> iommu_table seems a natural place for this.

Well.. sort of.  But as noted elsewhere this pulls VFIO specific
constraints into a platform code structure.  And whether you get this
table depends on the platform IOMMU type rather than on what VFIO
wants to do with it, which doesn't make sense.

What might make more sense is an opaque pointer io iommu_table for use
by the table "owner" (in the take_ownership sense).  The pointer would
be stored in iommu_table, but VFIO is responsible for populating and
managing its contents.

Or you could just put the userspace mappings in the container.
Although you might want a different data structure in that case.

The other thing to bear in mind is that registered regions are likely
to be large contiguous blocks in user addresses, though obviously not
contiguous in physical addr.  So you might be able to compaticfy this
information by storing it as a list of variable length blocks in
userspace address space, rather than a per-page address..



But.. isn't there a bigger problem here.  As Paulus was pointing out,
there's nothing guaranteeing the page tables continue to contain the
same page as was there at gup() time.

What's going to happen if you REGISTER a memory region, then mremap()
over it?  Then attempt to PUT_TCE a page in the region? 

Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-04-30 Thread Alexey Kardashevskiy

On 04/29/2015 04:40 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:

This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.

This stores the allocated table size in pnv_pci_create_table()
so the locked_vm counter can be updated correctly when a table is
being disposed.

This defines an iommu_table_group_ops callback to let VFIO know
how much memory will be locked if a table is created.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v9:
* reimplemented the whole patch
---
  arch/powerpc/include/asm/iommu.h  |  5 +
  arch/powerpc/platforms/powernv/pci-ioda.c | 14 
  arch/powerpc/platforms/powernv/pci.c  | 36 +++
  arch/powerpc/platforms/powernv/pci.h  |  2 ++
  4 files changed, 57 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 1472de3..9844c106 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -99,6 +99,7 @@ struct iommu_table {
unsigned long  it_size;  /* Size of iommu table in entries */
unsigned long  it_indirect_levels;
unsigned long  it_level_size;
+   unsigned long  it_allocated_size;
unsigned long  it_offset;/* Offset into global table */
unsigned long  it_base;  /* mapped address of tce table */
unsigned long  it_index; /* which iommu table this is */
@@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
  struct iommu_table_group;

  struct iommu_table_group_ops {
+   unsigned long (*get_table_size)(
+   __u32 page_shift,
+   __u64 window_size,
+   __u32 levels);
long (*create_table)(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index e0be556..7f548b4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb 
*phb,
  }

  #ifdef CONFIG_IOMMU_API
+static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
+
+   if (!ret)
+   return ret;
+
+   /* Add size of it_userspace */
+   return ret + (window_size >> page_shift) * sizeof(unsigned long);


This doesn't make much sense.  The userspace view can't possibly be a
property of the specific low-level IOMMU model.



This it_userspace thing is all about memory preregistration.

I need some way to track how many actual mappings the 
mm_iommu_table_group_mem_t has in order to decide whether to allow 
unregistering or not.


When I clear TCE, I can read the old value which is host physical address 
which I cannot use to find the preregistered region and adjust the mappings 
counter; I can only use userspace addresses for this (not even guest 
physical addresses as it is VFIO and probably no KVM).


So I have to keep userspace addresses somewhere, one per IOMMU page, and 
the iommu_table seems a natural place for this.









+}
+
  static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
int num, __u32 page_shift, __u64 window_size, __u32 levels,
struct iommu_table *tbl)
@@ -2086,6 +2098,7 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,

BUG_ON(tbl->it_userspace);
tbl->it_userspace = uas;
+   tbl->it_allocated_size += uas_cb;
tbl->it_ops = _ioda2_iommu_ops;
if (pe->tce_inval_reg)
tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
@@ -2160,6 +2173,7 @@ static void pnv_ioda2_release_ownership(struct 
iommu_table_group *table_group)
  }

  static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
+   .get_table_size = pnv_pci_ioda2_get_table_size,
.create_table = pnv_pci_ioda2_create_table,
.set_window = pnv_pci_ioda2_set_window,
.unset_window = pnv_pci_ioda2_unset_window,
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index fc129c4..1b5b48a 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -662,6 +662,38 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
tbl->it_type = TCE_PCI;
  }

+unsigned long pnv_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long bytes = 0;
+   const unsigned window_shift = ilog2(window_size);
+   unsigned entries_shift = window_shift - page_shift;
+   unsigned table_shift = 

Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-04-30 Thread Alexey Kardashevskiy

On 04/29/2015 04:40 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:

This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.

This stores the allocated table size in pnv_pci_create_table()
so the locked_vm counter can be updated correctly when a table is
being disposed.

This defines an iommu_table_group_ops callback to let VFIO know
how much memory will be locked if a table is created.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v9:
* reimplemented the whole patch
---
  arch/powerpc/include/asm/iommu.h  |  5 +
  arch/powerpc/platforms/powernv/pci-ioda.c | 14 
  arch/powerpc/platforms/powernv/pci.c  | 36 +++
  arch/powerpc/platforms/powernv/pci.h  |  2 ++
  4 files changed, 57 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 1472de3..9844c106 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -99,6 +99,7 @@ struct iommu_table {
unsigned long  it_size;  /* Size of iommu table in entries */
unsigned long  it_indirect_levels;
unsigned long  it_level_size;
+   unsigned long  it_allocated_size;
unsigned long  it_offset;/* Offset into global table */
unsigned long  it_base;  /* mapped address of tce table */
unsigned long  it_index; /* which iommu table this is */
@@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
  struct iommu_table_group;

  struct iommu_table_group_ops {
+   unsigned long (*get_table_size)(
+   __u32 page_shift,
+   __u64 window_size,
+   __u32 levels);
long (*create_table)(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index e0be556..7f548b4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb 
*phb,
  }

  #ifdef CONFIG_IOMMU_API
+static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
+
+   if (!ret)
+   return ret;
+
+   /* Add size of it_userspace */
+   return ret + (window_size  page_shift) * sizeof(unsigned long);


This doesn't make much sense.  The userspace view can't possibly be a
property of the specific low-level IOMMU model.



This it_userspace thing is all about memory preregistration.

I need some way to track how many actual mappings the 
mm_iommu_table_group_mem_t has in order to decide whether to allow 
unregistering or not.


When I clear TCE, I can read the old value which is host physical address 
which I cannot use to find the preregistered region and adjust the mappings 
counter; I can only use userspace addresses for this (not even guest 
physical addresses as it is VFIO and probably no KVM).


So I have to keep userspace addresses somewhere, one per IOMMU page, and 
the iommu_table seems a natural place for this.









+}
+
  static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
int num, __u32 page_shift, __u64 window_size, __u32 levels,
struct iommu_table *tbl)
@@ -2086,6 +2098,7 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,

BUG_ON(tbl-it_userspace);
tbl-it_userspace = uas;
+   tbl-it_allocated_size += uas_cb;
tbl-it_ops = pnv_ioda2_iommu_ops;
if (pe-tce_inval_reg)
tbl-it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
@@ -2160,6 +2173,7 @@ static void pnv_ioda2_release_ownership(struct 
iommu_table_group *table_group)
  }

  static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
+   .get_table_size = pnv_pci_ioda2_get_table_size,
.create_table = pnv_pci_ioda2_create_table,
.set_window = pnv_pci_ioda2_set_window,
.unset_window = pnv_pci_ioda2_unset_window,
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index fc129c4..1b5b48a 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -662,6 +662,38 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
tbl-it_type = TCE_PCI;
  }

+unsigned long pnv_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long bytes = 0;
+   const unsigned window_shift = ilog2(window_size);
+   unsigned entries_shift = window_shift - page_shift;
+   unsigned 

Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-04-30 Thread David Gibson
On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote:
 On 04/29/2015 04:40 PM, David Gibson wrote:
 On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:
 This adds a way for the IOMMU user to know how much a new table will
 use so it can be accounted in the locked_vm limit before allocation
 happens.
 
 This stores the allocated table size in pnv_pci_create_table()
 so the locked_vm counter can be updated correctly when a table is
 being disposed.
 
 This defines an iommu_table_group_ops callback to let VFIO know
 how much memory will be locked if a table is created.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
 Changes:
 v9:
 * reimplemented the whole patch
 ---
   arch/powerpc/include/asm/iommu.h  |  5 +
   arch/powerpc/platforms/powernv/pci-ioda.c | 14 
   arch/powerpc/platforms/powernv/pci.c  | 36 
  +++
   arch/powerpc/platforms/powernv/pci.h  |  2 ++
   4 files changed, 57 insertions(+)
 
 diff --git a/arch/powerpc/include/asm/iommu.h 
 b/arch/powerpc/include/asm/iommu.h
 index 1472de3..9844c106 100644
 --- a/arch/powerpc/include/asm/iommu.h
 +++ b/arch/powerpc/include/asm/iommu.h
 @@ -99,6 +99,7 @@ struct iommu_table {
 unsigned long  it_size;  /* Size of iommu table in entries */
 unsigned long  it_indirect_levels;
 unsigned long  it_level_size;
 +   unsigned long  it_allocated_size;
 unsigned long  it_offset;/* Offset into global table */
 unsigned long  it_base;  /* mapped address of tce table */
 unsigned long  it_index; /* which iommu table this is */
 @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
 iommu_table * tbl,
   struct iommu_table_group;
 
   struct iommu_table_group_ops {
 +   unsigned long (*get_table_size)(
 +   __u32 page_shift,
 +   __u64 window_size,
 +   __u32 levels);
 long (*create_table)(struct iommu_table_group *table_group,
 int num,
 __u32 page_shift,
 diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
 b/arch/powerpc/platforms/powernv/pci-ioda.c
 index e0be556..7f548b4 100644
 --- a/arch/powerpc/platforms/powernv/pci-ioda.c
 +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
 @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
 pnv_phb *phb,
   }
 
   #ifdef CONFIG_IOMMU_API
 +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
 +   __u64 window_size, __u32 levels)
 +{
 +   unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
 +
 +   if (!ret)
 +   return ret;
 +
 +   /* Add size of it_userspace */
 +   return ret + (window_size  page_shift) * sizeof(unsigned long);
 
 This doesn't make much sense.  The userspace view can't possibly be a
 property of the specific low-level IOMMU model.
 
 
 This it_userspace thing is all about memory preregistration.
 
 I need some way to track how many actual mappings the
 mm_iommu_table_group_mem_t has in order to decide whether to allow
 unregistering or not.
 
 When I clear TCE, I can read the old value which is host physical address
 which I cannot use to find the preregistered region and adjust the mappings
 counter; I can only use userspace addresses for this (not even guest
 physical addresses as it is VFIO and probably no KVM).
 
 So I have to keep userspace addresses somewhere, one per IOMMU page, and the
 iommu_table seems a natural place for this.

Well.. sort of.  But as noted elsewhere this pulls VFIO specific
constraints into a platform code structure.  And whether you get this
table depends on the platform IOMMU type rather than on what VFIO
wants to do with it, which doesn't make sense.

What might make more sense is an opaque pointer io iommu_table for use
by the table owner (in the take_ownership sense).  The pointer would
be stored in iommu_table, but VFIO is responsible for populating and
managing its contents.

Or you could just put the userspace mappings in the container.
Although you might want a different data structure in that case.

The other thing to bear in mind is that registered regions are likely
to be large contiguous blocks in user addresses, though obviously not
contiguous in physical addr.  So you might be able to compaticfy this
information by storing it as a list of variable length blocks in
userspace address space, rather than a per-page address..



But.. isn't there a bigger problem here.  As Paulus was pointing out,
there's nothing guaranteeing the page tables continue to contain the
same page as was there at gup() time.

What's going to happen if you REGISTER a memory region, then mremap()
over it?  Then attempt to PUT_TCE a page in the region? Or what if you
mremap() it to someplace else then try to PUT_TCE a page there? Or
REGISTER it again in its new location?

-- 
David Gibson| I'll have my music baroque, and my code
david AT 

Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-04-29 Thread David Gibson
On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:
> This adds a way for the IOMMU user to know how much a new table will
> use so it can be accounted in the locked_vm limit before allocation
> happens.
> 
> This stores the allocated table size in pnv_pci_create_table()
> so the locked_vm counter can be updated correctly when a table is
> being disposed.
> 
> This defines an iommu_table_group_ops callback to let VFIO know
> how much memory will be locked if a table is created.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v9:
> * reimplemented the whole patch
> ---
>  arch/powerpc/include/asm/iommu.h  |  5 +
>  arch/powerpc/platforms/powernv/pci-ioda.c | 14 
>  arch/powerpc/platforms/powernv/pci.c  | 36 
> +++
>  arch/powerpc/platforms/powernv/pci.h  |  2 ++
>  4 files changed, 57 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index 1472de3..9844c106 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -99,6 +99,7 @@ struct iommu_table {
>   unsigned long  it_size;  /* Size of iommu table in entries */
>   unsigned long  it_indirect_levels;
>   unsigned long  it_level_size;
> + unsigned long  it_allocated_size;
>   unsigned long  it_offset;/* Offset into global table */
>   unsigned long  it_base;  /* mapped address of tce table */
>   unsigned long  it_index; /* which iommu table this is */
> @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
> iommu_table * tbl,
>  struct iommu_table_group;
>  
>  struct iommu_table_group_ops {
> + unsigned long (*get_table_size)(
> + __u32 page_shift,
> + __u64 window_size,
> + __u32 levels);
>   long (*create_table)(struct iommu_table_group *table_group,
>   int num,
>   __u32 page_shift,
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index e0be556..7f548b4 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
> pnv_phb *phb,
>  }
>  
>  #ifdef CONFIG_IOMMU_API
> +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
> + __u64 window_size, __u32 levels)
> +{
> + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
> +
> + if (!ret)
> + return ret;
> +
> + /* Add size of it_userspace */
> + return ret + (window_size >> page_shift) * sizeof(unsigned long);

This doesn't make much sense.  The userspace view can't possibly be a
property of the specific low-level IOMMU model.

> +}
> +
>  static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
>   int num, __u32 page_shift, __u64 window_size, __u32 levels,
>   struct iommu_table *tbl)
> @@ -2086,6 +2098,7 @@ static long pnv_pci_ioda2_create_table(struct 
> iommu_table_group *table_group,
>  
>   BUG_ON(tbl->it_userspace);
>   tbl->it_userspace = uas;
> + tbl->it_allocated_size += uas_cb;
>   tbl->it_ops = _ioda2_iommu_ops;
>   if (pe->tce_inval_reg)
>   tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
> @@ -2160,6 +2173,7 @@ static void pnv_ioda2_release_ownership(struct 
> iommu_table_group *table_group)
>  }
>  
>  static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> + .get_table_size = pnv_pci_ioda2_get_table_size,
>   .create_table = pnv_pci_ioda2_create_table,
>   .set_window = pnv_pci_ioda2_set_window,
>   .unset_window = pnv_pci_ioda2_unset_window,
> diff --git a/arch/powerpc/platforms/powernv/pci.c 
> b/arch/powerpc/platforms/powernv/pci.c
> index fc129c4..1b5b48a 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -662,6 +662,38 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
>   tbl->it_type = TCE_PCI;
>  }
>  
> +unsigned long pnv_get_table_size(__u32 page_shift,
> + __u64 window_size, __u32 levels)
> +{
> + unsigned long bytes = 0;
> + const unsigned window_shift = ilog2(window_size);
> + unsigned entries_shift = window_shift - page_shift;
> + unsigned table_shift = entries_shift + 3;
> + unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift);
> + unsigned long direct_table_size;
> +
> + if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS) ||
> + (window_size > memory_hotplug_max()) ||
> + !is_power_of_2(window_size))
> + return 0;
> +
> + /* Calculate a direct table size from window_size and levels */
> + entries_shift = ROUND_UP(entries_shift, levels) / levels;
> + table_shift = entries_shift + 3;
> + 

Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-04-29 Thread David Gibson
On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:
 This adds a way for the IOMMU user to know how much a new table will
 use so it can be accounted in the locked_vm limit before allocation
 happens.
 
 This stores the allocated table size in pnv_pci_create_table()
 so the locked_vm counter can be updated correctly when a table is
 being disposed.
 
 This defines an iommu_table_group_ops callback to let VFIO know
 how much memory will be locked if a table is created.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
 Changes:
 v9:
 * reimplemented the whole patch
 ---
  arch/powerpc/include/asm/iommu.h  |  5 +
  arch/powerpc/platforms/powernv/pci-ioda.c | 14 
  arch/powerpc/platforms/powernv/pci.c  | 36 
 +++
  arch/powerpc/platforms/powernv/pci.h  |  2 ++
  4 files changed, 57 insertions(+)
 
 diff --git a/arch/powerpc/include/asm/iommu.h 
 b/arch/powerpc/include/asm/iommu.h
 index 1472de3..9844c106 100644
 --- a/arch/powerpc/include/asm/iommu.h
 +++ b/arch/powerpc/include/asm/iommu.h
 @@ -99,6 +99,7 @@ struct iommu_table {
   unsigned long  it_size;  /* Size of iommu table in entries */
   unsigned long  it_indirect_levels;
   unsigned long  it_level_size;
 + unsigned long  it_allocated_size;
   unsigned long  it_offset;/* Offset into global table */
   unsigned long  it_base;  /* mapped address of tce table */
   unsigned long  it_index; /* which iommu table this is */
 @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
 iommu_table * tbl,
  struct iommu_table_group;
  
  struct iommu_table_group_ops {
 + unsigned long (*get_table_size)(
 + __u32 page_shift,
 + __u64 window_size,
 + __u32 levels);
   long (*create_table)(struct iommu_table_group *table_group,
   int num,
   __u32 page_shift,
 diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
 b/arch/powerpc/platforms/powernv/pci-ioda.c
 index e0be556..7f548b4 100644
 --- a/arch/powerpc/platforms/powernv/pci-ioda.c
 +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
 @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
 pnv_phb *phb,
  }
  
  #ifdef CONFIG_IOMMU_API
 +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
 + __u64 window_size, __u32 levels)
 +{
 + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
 +
 + if (!ret)
 + return ret;
 +
 + /* Add size of it_userspace */
 + return ret + (window_size  page_shift) * sizeof(unsigned long);

This doesn't make much sense.  The userspace view can't possibly be a
property of the specific low-level IOMMU model.

 +}
 +
  static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
   int num, __u32 page_shift, __u64 window_size, __u32 levels,
   struct iommu_table *tbl)
 @@ -2086,6 +2098,7 @@ static long pnv_pci_ioda2_create_table(struct 
 iommu_table_group *table_group,
  
   BUG_ON(tbl-it_userspace);
   tbl-it_userspace = uas;
 + tbl-it_allocated_size += uas_cb;
   tbl-it_ops = pnv_ioda2_iommu_ops;
   if (pe-tce_inval_reg)
   tbl-it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 @@ -2160,6 +2173,7 @@ static void pnv_ioda2_release_ownership(struct 
 iommu_table_group *table_group)
  }
  
  static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
 + .get_table_size = pnv_pci_ioda2_get_table_size,
   .create_table = pnv_pci_ioda2_create_table,
   .set_window = pnv_pci_ioda2_set_window,
   .unset_window = pnv_pci_ioda2_unset_window,
 diff --git a/arch/powerpc/platforms/powernv/pci.c 
 b/arch/powerpc/platforms/powernv/pci.c
 index fc129c4..1b5b48a 100644
 --- a/arch/powerpc/platforms/powernv/pci.c
 +++ b/arch/powerpc/platforms/powernv/pci.c
 @@ -662,6 +662,38 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
   tbl-it_type = TCE_PCI;
  }
  
 +unsigned long pnv_get_table_size(__u32 page_shift,
 + __u64 window_size, __u32 levels)
 +{
 + unsigned long bytes = 0;
 + const unsigned window_shift = ilog2(window_size);
 + unsigned entries_shift = window_shift - page_shift;
 + unsigned table_shift = entries_shift + 3;
 + unsigned long tce_table_size = max(0x1000UL, 1UL  table_shift);
 + unsigned long direct_table_size;
 +
 + if (!levels || (levels  POWERNV_IOMMU_MAX_LEVELS) ||
 + (window_size  memory_hotplug_max()) ||
 + !is_power_of_2(window_size))
 + return 0;
 +
 + /* Calculate a direct table size from window_size and levels */
 + entries_shift = ROUND_UP(entries_shift, levels) / levels;
 + table_shift = entries_shift + 3;
 + table_shift = max_t(unsigned, table_shift, PAGE_SHIFT);
 + direct_table_size =  1UL  table_shift;
 +
 + for ( ; 

[PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-04-25 Thread Alexey Kardashevskiy
This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.

This stores the allocated table size in pnv_pci_create_table()
so the locked_vm counter can be updated correctly when a table is
being disposed.

This defines an iommu_table_group_ops callback to let VFIO know
how much memory will be locked if a table is created.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v9:
* reimplemented the whole patch
---
 arch/powerpc/include/asm/iommu.h  |  5 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 
 arch/powerpc/platforms/powernv/pci.c  | 36 +++
 arch/powerpc/platforms/powernv/pci.h  |  2 ++
 4 files changed, 57 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 1472de3..9844c106 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -99,6 +99,7 @@ struct iommu_table {
unsigned long  it_size;  /* Size of iommu table in entries */
unsigned long  it_indirect_levels;
unsigned long  it_level_size;
+   unsigned long  it_allocated_size;
unsigned long  it_offset;/* Offset into global table */
unsigned long  it_base;  /* mapped address of tce table */
unsigned long  it_index; /* which iommu table this is */
@@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
 struct iommu_table_group;
 
 struct iommu_table_group_ops {
+   unsigned long (*get_table_size)(
+   __u32 page_shift,
+   __u64 window_size,
+   __u32 levels);
long (*create_table)(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index e0be556..7f548b4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb 
*phb,
 }
 
 #ifdef CONFIG_IOMMU_API
+static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
+
+   if (!ret)
+   return ret;
+
+   /* Add size of it_userspace */
+   return ret + (window_size >> page_shift) * sizeof(unsigned long);
+}
+
 static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
int num, __u32 page_shift, __u64 window_size, __u32 levels,
struct iommu_table *tbl)
@@ -2086,6 +2098,7 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,
 
BUG_ON(tbl->it_userspace);
tbl->it_userspace = uas;
+   tbl->it_allocated_size += uas_cb;
tbl->it_ops = _ioda2_iommu_ops;
if (pe->tce_inval_reg)
tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
@@ -2160,6 +2173,7 @@ static void pnv_ioda2_release_ownership(struct 
iommu_table_group *table_group)
 }
 
 static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
+   .get_table_size = pnv_pci_ioda2_get_table_size,
.create_table = pnv_pci_ioda2_create_table,
.set_window = pnv_pci_ioda2_set_window,
.unset_window = pnv_pci_ioda2_unset_window,
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index fc129c4..1b5b48a 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -662,6 +662,38 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
tbl->it_type = TCE_PCI;
 }
 
+unsigned long pnv_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long bytes = 0;
+   const unsigned window_shift = ilog2(window_size);
+   unsigned entries_shift = window_shift - page_shift;
+   unsigned table_shift = entries_shift + 3;
+   unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift);
+   unsigned long direct_table_size;
+
+   if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS) ||
+   (window_size > memory_hotplug_max()) ||
+   !is_power_of_2(window_size))
+   return 0;
+
+   /* Calculate a direct table size from window_size and levels */
+   entries_shift = ROUND_UP(entries_shift, levels) / levels;
+   table_shift = entries_shift + 3;
+   table_shift = max_t(unsigned, table_shift, PAGE_SHIFT);
+   direct_table_size =  1UL << table_shift;
+
+   for ( ; levels; --levels) {
+   bytes += ROUND_UP(tce_table_size, direct_table_size);
+
+   tce_table_size /= direct_table_size;
+   tce_table_size <<= 3;
+   tce_table_size = 

[PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-04-25 Thread Alexey Kardashevskiy
This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.

This stores the allocated table size in pnv_pci_create_table()
so the locked_vm counter can be updated correctly when a table is
being disposed.

This defines an iommu_table_group_ops callback to let VFIO know
how much memory will be locked if a table is created.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
Changes:
v9:
* reimplemented the whole patch
---
 arch/powerpc/include/asm/iommu.h  |  5 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 14 
 arch/powerpc/platforms/powernv/pci.c  | 36 +++
 arch/powerpc/platforms/powernv/pci.h  |  2 ++
 4 files changed, 57 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 1472de3..9844c106 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -99,6 +99,7 @@ struct iommu_table {
unsigned long  it_size;  /* Size of iommu table in entries */
unsigned long  it_indirect_levels;
unsigned long  it_level_size;
+   unsigned long  it_allocated_size;
unsigned long  it_offset;/* Offset into global table */
unsigned long  it_base;  /* mapped address of tce table */
unsigned long  it_index; /* which iommu table this is */
@@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
 struct iommu_table_group;
 
 struct iommu_table_group_ops {
+   unsigned long (*get_table_size)(
+   __u32 page_shift,
+   __u64 window_size,
+   __u32 levels);
long (*create_table)(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index e0be556..7f548b4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb 
*phb,
 }
 
 #ifdef CONFIG_IOMMU_API
+static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
+
+   if (!ret)
+   return ret;
+
+   /* Add size of it_userspace */
+   return ret + (window_size  page_shift) * sizeof(unsigned long);
+}
+
 static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
int num, __u32 page_shift, __u64 window_size, __u32 levels,
struct iommu_table *tbl)
@@ -2086,6 +2098,7 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,
 
BUG_ON(tbl-it_userspace);
tbl-it_userspace = uas;
+   tbl-it_allocated_size += uas_cb;
tbl-it_ops = pnv_ioda2_iommu_ops;
if (pe-tce_inval_reg)
tbl-it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
@@ -2160,6 +2173,7 @@ static void pnv_ioda2_release_ownership(struct 
iommu_table_group *table_group)
 }
 
 static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
+   .get_table_size = pnv_pci_ioda2_get_table_size,
.create_table = pnv_pci_ioda2_create_table,
.set_window = pnv_pci_ioda2_set_window,
.unset_window = pnv_pci_ioda2_unset_window,
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index fc129c4..1b5b48a 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -662,6 +662,38 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
tbl-it_type = TCE_PCI;
 }
 
+unsigned long pnv_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long bytes = 0;
+   const unsigned window_shift = ilog2(window_size);
+   unsigned entries_shift = window_shift - page_shift;
+   unsigned table_shift = entries_shift + 3;
+   unsigned long tce_table_size = max(0x1000UL, 1UL  table_shift);
+   unsigned long direct_table_size;
+
+   if (!levels || (levels  POWERNV_IOMMU_MAX_LEVELS) ||
+   (window_size  memory_hotplug_max()) ||
+   !is_power_of_2(window_size))
+   return 0;
+
+   /* Calculate a direct table size from window_size and levels */
+   entries_shift = ROUND_UP(entries_shift, levels) / levels;
+   table_shift = entries_shift + 3;
+   table_shift = max_t(unsigned, table_shift, PAGE_SHIFT);
+   direct_table_size =  1UL  table_shift;
+
+   for ( ; levels; --levels) {
+   bytes += ROUND_UP(tce_table_size, direct_table_size);
+
+   tce_table_size /= direct_table_size;
+   tce_table_size = 3;
+   tce_table_size =