Re: [PATCH v5 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-08-17 Thread Leonardo Brás
Alexey, Fred:

On Fri, 2021-07-23 at 15:34 +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 22/07/2021 01:04, Frederic Barrat wrote:
> > 
> > 
> > On 21/07/2021 05:32, Alexey Kardashevskiy wrote:
> > > > > +    struct iommu_table *newtbl;
> > > > > +    int i;
> > > > > +
> > > > > +    for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources);
> > > > > i++) {
> > > > > +    const unsigned long mask = IORESOURCE_MEM_64 | 
> > > > > IORESOURCE_MEM;
> > > > > +
> > > > > +    /* Look for MMIO32 */
> > > > > +    if ((pci->phb->mem_resources[i].flags & mask) ==
> > > > > IORESOURCE_MEM)
> > > > > +    break;
> > > > > +    }
> > > > > +
> > > > > +    if (i == ARRAY_SIZE(pci->phb->mem_resources))
> > > > > +    goto out_del_list;
> > > > 
> > > > 
> > > > So we exit and do nothing if there's no MMIO32 bar?
> > > > Isn't the intent just to figure out the MMIO32 area to reserve
> > > > it 
> > > > when init'ing the table? In which case we could default to 0,0
> > > > 
> > > > I'm actually not clear why we are reserving this area on
> > > > pseries.
> > > 
> > > 
> > > 
> > > If we do not reserve it, then the iommu code will allocate DMA
> > > pages 
> > > from there and these addresses are MMIO32 from the kernel pov at 
> > > least. I saw crashes when (I think) a device tried DMAing to the
> > > top 
> > > 2GB of the bus space which happened to be a some other device's
> > > BAR.
> > 
> > 
> > hmmm... then figuring out the correct range needs more work. We
> > could 
> > have more than one MMIO32 bar. And they don't have to be adjacent. 
> 
> They all have to be within the MMIO32 window of a PHB and we reserve
> the 
> entire window here.
> 
> > I 
> > don't see that we are reserving any range on the initial table
> > though 
> > (on pseries).
> True, we did not need to, as the hypervisor always took care of DMA
> and 
> MMIO32 regions to not overlap.
> 
> And in this series we do not (strictly speaking) need this either as 
> phyp never allocates more than one window dynamically and that only 
> window is always the second one starting from 0x800....
> It 
> is probably my mistake that KVM allows a new window to start from 0 -
> PAPR did not prohibit this explicitly.
> 
> And for the KVM case, we do not need to remove the default window as
> KVM 
> can pretty much always allocate as many TCE as the VM wants. But we 
> still allow removing the default window and creating a huge one
> instead 
> at 0x0 as this way we can allow 1:1 for every single PCI device even
> if 
> it only allows 48 (or similar but less than 64bit) DMA. Hope this
> makes 
> sense. Thanks,
> 
> 

Thank you for this discussion, I got to learn a lot!

If I got this, no further change will be necessary, is that correct?

I am testing a v6, and I intend to send it soon.



Re: [PATCH v5 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-08-17 Thread Leonardo Brás
Hello Alexey, Fred. Thanks for reviewing!


On Wed, 2021-07-21 at 13:32 +1000, Alexey Kardashevskiy wrote:
> > >     spin_lock(_window_list_lock);
> > 
> > 
> > 
> > 
> > Somewhere around here, we have:
> > 
> > out_remove_win:
> >  remove_ddw(pdn, true, DIRECT64_PROPNAME);
> > 
> > We should replace with:
> >  remove_ddw(pdn, true, win_name);
> 
> 
> True. Good spotting. Or rework remove_dma_window() to take just a
> liobn. 
> Thanks,

Fred,
Good catch! On rebasing the last changes I did miss this one.

Due to reworking v5 06/11, I ended up with a solution that takes a
liobn instead of a win name (as suggested by Alexey), and this one is
no more :)


> 
> > 
> > 
> >    Fred
> > 
> > 
> > 
> > > 
> 

Thanks!



Re: [PATCH v5 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-07-22 Thread Alexey Kardashevskiy




On 22/07/2021 01:04, Frederic Barrat wrote:



On 21/07/2021 05:32, Alexey Kardashevskiy wrote:

+    struct iommu_table *newtbl;
+    int i;
+
+    for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
+    const unsigned long mask = IORESOURCE_MEM_64 | 
IORESOURCE_MEM;

+
+    /* Look for MMIO32 */
+    if ((pci->phb->mem_resources[i].flags & mask) == 
IORESOURCE_MEM)

+    break;
+    }
+
+    if (i == ARRAY_SIZE(pci->phb->mem_resources))
+    goto out_del_list;



So we exit and do nothing if there's no MMIO32 bar?
Isn't the intent just to figure out the MMIO32 area to reserve it 
when init'ing the table? In which case we could default to 0,0


I'm actually not clear why we are reserving this area on pseries.




If we do not reserve it, then the iommu code will allocate DMA pages 
from there and these addresses are MMIO32 from the kernel pov at 
least. I saw crashes when (I think) a device tried DMAing to the top 
2GB of the bus space which happened to be a some other device's BAR.



hmmm... then figuring out the correct range needs more work. We could 
have more than one MMIO32 bar. And they don't have to be adjacent. 


They all have to be within the MMIO32 window of a PHB and we reserve the 
entire window here.


I 
don't see that we are reserving any range on the initial table though 
(on pseries).
True, we did not need to, as the hypervisor always took care of DMA and 
MMIO32 regions to not overlap.


And in this series we do not (strictly speaking) need this either as 
phyp never allocates more than one window dynamically and that only 
window is always the second one starting from 0x800.... It 
is probably my mistake that KVM allows a new window to start from 0 - 
PAPR did not prohibit this explicitly.


And for the KVM case, we do not need to remove the default window as KVM 
can pretty much always allocate as many TCE as the VM wants. But we 
still allow removing the default window and creating a huge one instead 
at 0x0 as this way we can allow 1:1 for every single PCI device even if 
it only allows 48 (or similar but less than 64bit) DMA. Hope this makes 
sense. Thanks,



--
Alexey


Re: [PATCH v5 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-07-21 Thread Frederic Barrat




On 21/07/2021 05:32, Alexey Kardashevskiy wrote:

+    struct iommu_table *newtbl;
+    int i;
+
+    for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
+    const unsigned long mask = IORESOURCE_MEM_64 | 
IORESOURCE_MEM;

+
+    /* Look for MMIO32 */
+    if ((pci->phb->mem_resources[i].flags & mask) == 
IORESOURCE_MEM)

+    break;
+    }
+
+    if (i == ARRAY_SIZE(pci->phb->mem_resources))
+    goto out_del_list;



So we exit and do nothing if there's no MMIO32 bar?
Isn't the intent just to figure out the MMIO32 area to reserve it when 
init'ing the table? In which case we could default to 0,0


I'm actually not clear why we are reserving this area on pseries.




If we do not reserve it, then the iommu code will allocate DMA pages 
from there and these addresses are MMIO32 from the kernel pov at least. 
I saw crashes when (I think) a device tried DMAing to the top 2GB of the 
bus space which happened to be a some other device's BAR.



hmmm... then figuring out the correct range needs more work. We could 
have more than one MMIO32 bar. And they don't have to be adjacent. I 
don't see that we are reserving any range on the initial table though 
(on pseries).


  Fred


Re: [PATCH v5 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-07-20 Thread Alexey Kardashevskiy




On 21/07/2021 04:12, Frederic Barrat wrote:



On 16/07/2021 10:27, Leonardo Bras wrote:

So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

By using DDW, indirect mapping  can get more TCEs than available for the
default DMA window, and also get access to using much larger pagesizes
(16MB as implemented in qemu vs 4k from default DMA window), causing a
significant increase on the maximum amount of memory that can be IOMMU
mapped at the same time.

Indirect mapping will only be used if direct mapping is not a
possibility.

For indirect mapping, it's necessary to re-create the iommu_table with
the new DMA window parameters, so iommu_alloc() can use it.

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

Even though there won't be both direct and indirect mappings at the
same time, we can't reuse the DIRECT64_PROPNAME property name, or else
an older kexec()ed kernel can assume direct mapping, and skip
iommu_alloc(), causing undesirable behavior.
So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
was created to represent a DDW that does not allow direct mapping.

Signed-off-by: Leonardo Bras 
---
  arch/powerpc/platforms/pseries/iommu.c | 87 +-
  1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c

index 22d251e15b61..a67e71c49aeb 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -375,6 +375,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock);
  /* protects initializing window twice for same device */
  static DEFINE_MUTEX(direct_window_init_mutex);
  #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
  static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
  unsigned long num_pfn, const void *arg)
@@ -925,6 +926,7 @@ static int find_existing_ddw_windows(void)
  return 0;
  find_existing_ddw_windows_named(DIRECT64_PROPNAME);
+    find_existing_ddw_windows_named(DMA64_PROPNAME);
  return 0;
  }
@@ -1211,14 +1213,17 @@ static bool enable_ddw(struct pci_dev *dev, 
struct device_node *pdn)

  struct ddw_create_response create;
  int page_shift;
  u64 win_addr;
+    const char *win_name;
  struct device_node *dn;
  u32 ddw_avail[DDW_APPLICABLE_SIZE];
  struct direct_window *window;
  struct property *win64;
  bool ddw_enabled = false;
  struct failed_ddw_pdn *fpdn;
-    bool default_win_removed = false;
+    bool default_win_removed = false, direct_mapping = false;
  bool pmem_present;
+    struct pci_dn *pci = PCI_DN(pdn);
+    struct iommu_table *tbl = pci->table_group->tables[0];
  dn = of_find_node_by_type(NULL, "ibm,pmemory");
  pmem_present = dn != NULL;
@@ -1227,6 +1232,7 @@ static bool enable_ddw(struct pci_dev *dev, 
struct device_node *pdn)

  mutex_lock(_window_init_mutex);
  if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {
+    direct_mapping = (len >= max_ram_len);
  ddw_enabled = true;
  goto out_unlock;
  }
@@ -1307,8 +1313,7 @@ static bool enable_ddw(struct pci_dev *dev, 
struct device_node *pdn)

    query.page_size);
  goto out_failed;
  }
-    /* verify the window * number of ptes will map the partition */
-    /* check largest block * page size > max memory hotplug addr */
+
  /*
   * The "ibm,pmemory" can appear anywhere in the address space.
   * Assuming it is still backed by page structs, try 
MAX_PHYSMEM_BITS
@@ -1324,13 +1329,25 @@ static bool enable_ddw(struct pci_dev *dev, 
struct device_node *pdn)

  dev_info(>dev, "Skipping ibm,pmemory");
  }
+    /* check if the available block * number of ptes will map 
everything */

  if (query.largest_available_block < (1ULL << (len - page_shift))) {
  dev_dbg(>dev,
  "can't map partition max 0x%llx with %llu %llu-sized 
pages\n",

  1ULL << len,
  query.largest_available_block,
  1ULL << page_shift);
-    goto out_failed;
+
+    /* DDW + IOMMU on single window may fail if there is any 
allocation */

+    if (default_win_removed && iommu_table_in_use(tbl)) {
+    dev_dbg(>dev, "current IOMMU table in use, can't be 
replaced.\n");

+    goto out_failed;
+    }
+
+    len = 

Re: [PATCH v5 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-07-20 Thread Frederic Barrat




On 16/07/2021 10:27, Leonardo Bras wrote:

So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

By using DDW, indirect mapping  can get more TCEs than available for the
default DMA window, and also get access to using much larger pagesizes
(16MB as implemented in qemu vs 4k from default DMA window), causing a
significant increase on the maximum amount of memory that can be IOMMU
mapped at the same time.

Indirect mapping will only be used if direct mapping is not a
possibility.

For indirect mapping, it's necessary to re-create the iommu_table with
the new DMA window parameters, so iommu_alloc() can use it.

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

Even though there won't be both direct and indirect mappings at the
same time, we can't reuse the DIRECT64_PROPNAME property name, or else
an older kexec()ed kernel can assume direct mapping, and skip
iommu_alloc(), causing undesirable behavior.
So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
was created to represent a DDW that does not allow direct mapping.

Signed-off-by: Leonardo Bras 
---
  arch/powerpc/platforms/pseries/iommu.c | 87 +-
  1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 22d251e15b61..a67e71c49aeb 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -375,6 +375,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock);
  /* protects initializing window twice for same device */
  static DEFINE_MUTEX(direct_window_init_mutex);
  #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
  
  static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,

unsigned long num_pfn, const void *arg)
@@ -925,6 +926,7 @@ static int find_existing_ddw_windows(void)
return 0;
  
  	find_existing_ddw_windows_named(DIRECT64_PROPNAME);

+   find_existing_ddw_windows_named(DMA64_PROPNAME);
  
  	return 0;

  }
@@ -1211,14 +1213,17 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct ddw_create_response create;
int page_shift;
u64 win_addr;
+   const char *win_name;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64;
bool ddw_enabled = false;
struct failed_ddw_pdn *fpdn;
-   bool default_win_removed = false;
+   bool default_win_removed = false, direct_mapping = false;
bool pmem_present;
+   struct pci_dn *pci = PCI_DN(pdn);
+   struct iommu_table *tbl = pci->table_group->tables[0];
  
  	dn = of_find_node_by_type(NULL, "ibm,pmemory");

pmem_present = dn != NULL;
@@ -1227,6 +1232,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
mutex_lock(_window_init_mutex);
  
  	if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {

+   direct_mapping = (len >= max_ram_len);
ddw_enabled = true;
goto out_unlock;
}
@@ -1307,8 +1313,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
  query.page_size);
goto out_failed;
}
-   /* verify the window * number of ptes will map the partition */
-   /* check largest block * page size > max memory hotplug addr */
+
/*
 * The "ibm,pmemory" can appear anywhere in the address space.
 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
@@ -1324,13 +1329,25 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
dev_info(>dev, "Skipping ibm,pmemory");
}
  
+	/* check if the available block * number of ptes will map everything */

if (query.largest_available_block < (1ULL << (len - page_shift))) {
dev_dbg(>dev,
"can't map partition max 0x%llx with %llu %llu-sized 
pages\n",
1ULL << len,
query.largest_available_block,
1ULL << page_shift);
-   goto out_failed;
+
+   /* DDW + IOMMU on single window may fail if there is any 
allocation */
+   if (default_win_removed && 

[PATCH v5 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-07-16 Thread Leonardo Bras
So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

By using DDW, indirect mapping  can get more TCEs than available for the
default DMA window, and also get access to using much larger pagesizes
(16MB as implemented in qemu vs 4k from default DMA window), causing a
significant increase on the maximum amount of memory that can be IOMMU
mapped at the same time.

Indirect mapping will only be used if direct mapping is not a
possibility.

For indirect mapping, it's necessary to re-create the iommu_table with
the new DMA window parameters, so iommu_alloc() can use it.

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

Even though there won't be both direct and indirect mappings at the
same time, we can't reuse the DIRECT64_PROPNAME property name, or else
an older kexec()ed kernel can assume direct mapping, and skip
iommu_alloc(), causing undesirable behavior.
So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
was created to represent a DDW that does not allow direct mapping.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 87 +-
 1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 22d251e15b61..a67e71c49aeb 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -375,6 +375,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock);
 /* protects initializing window twice for same device */
 static DEFINE_MUTEX(direct_window_init_mutex);
 #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
 
 static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
unsigned long num_pfn, const void *arg)
@@ -925,6 +926,7 @@ static int find_existing_ddw_windows(void)
return 0;
 
find_existing_ddw_windows_named(DIRECT64_PROPNAME);
+   find_existing_ddw_windows_named(DMA64_PROPNAME);
 
return 0;
 }
@@ -1211,14 +1213,17 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct ddw_create_response create;
int page_shift;
u64 win_addr;
+   const char *win_name;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64;
bool ddw_enabled = false;
struct failed_ddw_pdn *fpdn;
-   bool default_win_removed = false;
+   bool default_win_removed = false, direct_mapping = false;
bool pmem_present;
+   struct pci_dn *pci = PCI_DN(pdn);
+   struct iommu_table *tbl = pci->table_group->tables[0];
 
dn = of_find_node_by_type(NULL, "ibm,pmemory");
pmem_present = dn != NULL;
@@ -1227,6 +1232,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
mutex_lock(_window_init_mutex);
 
if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {
+   direct_mapping = (len >= max_ram_len);
ddw_enabled = true;
goto out_unlock;
}
@@ -1307,8 +1313,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
  query.page_size);
goto out_failed;
}
-   /* verify the window * number of ptes will map the partition */
-   /* check largest block * page size > max memory hotplug addr */
+
/*
 * The "ibm,pmemory" can appear anywhere in the address space.
 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
@@ -1324,13 +1329,25 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
dev_info(>dev, "Skipping ibm,pmemory");
}
 
+   /* check if the available block * number of ptes will map everything */
if (query.largest_available_block < (1ULL << (len - page_shift))) {
dev_dbg(>dev,
"can't map partition max 0x%llx with %llu %llu-sized 
pages\n",
1ULL << len,
query.largest_available_block,
1ULL << page_shift);
-   goto out_failed;
+
+   /* DDW + IOMMU on single window may fail if there is any 
allocation */
+   if (default_win_removed && iommu_table_in_use(tbl)) {
+