Re: [PATCH v6 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-08-27 Thread Leonardo BrĂ¡s
Hello Fred,

On Fri, 2021-08-27 at 19:06 +0200, Frederic Barrat wrote:
> 
> I think it looks ok now as it was mostly me who was misunderstanding
> one 
> part of the previous iteration.
> Reviewed-by: Frederic Barrat 
> 

Thank you for reviewing this series!

> Sorry for the late review, I was enjoying some time off. And thanks
> for 
> that series, I believe it should help with those bugs complaining
> about 
> lack of DMA space. It was also very educational for me, thanks to you
> and Alexey for your detailed answers.

Working on this series caused me to learn a lot about how DMA and IOMMU
works in Power, and it was a great experience. Thanks to Alexey who
helped and guided me through this!

Best regards,
Leonardo Bras



Re: [PATCH v6 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-08-27 Thread Frederic Barrat




On 17/08/2021 08:39, Leonardo Bras wrote:

So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

By using DDW, indirect mapping  can get more TCEs than available for the
default DMA window, and also get access to using much larger pagesizes
(16MB as implemented in qemu vs 4k from default DMA window), causing a
significant increase on the maximum amount of memory that can be IOMMU
mapped at the same time.

Indirect mapping will only be used if direct mapping is not a
possibility.

For indirect mapping, it's necessary to re-create the iommu_table with
the new DMA window parameters, so iommu_alloc() can use it.

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

Even though there won't be both direct and indirect mappings at the
same time, we can't reuse the DIRECT64_PROPNAME property name, or else
an older kexec()ed kernel can assume direct mapping, and skip
iommu_alloc(), causing undesirable behavior.
So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
was created to represent a DDW that does not allow direct mapping.

Signed-off-by: Leonardo Bras 
---



I think it looks ok now as it was mostly me who was misunderstanding one 
part of the previous iteration.

Reviewed-by: Frederic Barrat 

Sorry for the late review, I was enjoying some time off. And thanks for 
that series, I believe it should help with those bugs complaining about 
lack of DMA space. It was also very educational for me, thanks to you 
and Alexey for your detailed answers.


  Fred



  arch/powerpc/platforms/pseries/iommu.c | 89 +-
  1 file changed, 74 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e11c00b2dc1e..0eccc29f5573 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -375,6 +375,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock);
  /* protects initializing window twice for same device */
  static DEFINE_MUTEX(direct_window_init_mutex);
  #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
  
  static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,

unsigned long num_pfn, const void *arg)
@@ -940,6 +941,7 @@ static int find_existing_ddw_windows(void)
return 0;
  
  	find_existing_ddw_windows_named(DIRECT64_PROPNAME);

+   find_existing_ddw_windows_named(DMA64_PROPNAME);
  
  	return 0;

  }
@@ -1226,14 +1228,17 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct ddw_create_response create;
int page_shift;
u64 win_addr;
+   const char *win_name;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64;
bool ddw_enabled = false;
struct failed_ddw_pdn *fpdn;
-   bool default_win_removed = false;
+   bool default_win_removed = false, direct_mapping = false;
bool pmem_present;
+   struct pci_dn *pci = PCI_DN(pdn);
+   struct iommu_table *tbl = pci->table_group->tables[0];
  
  	dn = of_find_node_by_type(NULL, "ibm,pmemory");

pmem_present = dn != NULL;
@@ -1242,6 +1247,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
mutex_lock(_window_init_mutex);
  
  	if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {

+   direct_mapping = (len >= max_ram_len);
ddw_enabled = true;
goto out_unlock;
}
@@ -1322,8 +1328,8 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
  query.page_size);
goto out_failed;
}
-   /* verify the window * number of ptes will map the partition */
-   /* check largest block * page size > max memory hotplug addr */
+
+
/*
 * The "ibm,pmemory" can appear anywhere in the address space.
 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
@@ -1339,13 +1345,25 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
dev_info(>dev, "Skipping ibm,pmemory");
}
  
+	/* check if the available block * number of ptes will map everything */

if (query.largest_available_block < (1ULL << (len - page_shift))) {
dev_dbg(>dev,

[PATCH v6 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-08-17 Thread Leonardo Bras
So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

By using DDW, indirect mapping  can get more TCEs than available for the
default DMA window, and also get access to using much larger pagesizes
(16MB as implemented in qemu vs 4k from default DMA window), causing a
significant increase on the maximum amount of memory that can be IOMMU
mapped at the same time.

Indirect mapping will only be used if direct mapping is not a
possibility.

For indirect mapping, it's necessary to re-create the iommu_table with
the new DMA window parameters, so iommu_alloc() can use it.

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

Even though there won't be both direct and indirect mappings at the
same time, we can't reuse the DIRECT64_PROPNAME property name, or else
an older kexec()ed kernel can assume direct mapping, and skip
iommu_alloc(), causing undesirable behavior.
So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
was created to represent a DDW that does not allow direct mapping.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 89 +-
 1 file changed, 74 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e11c00b2dc1e..0eccc29f5573 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -375,6 +375,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock);
 /* protects initializing window twice for same device */
 static DEFINE_MUTEX(direct_window_init_mutex);
 #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
 
 static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
unsigned long num_pfn, const void *arg)
@@ -940,6 +941,7 @@ static int find_existing_ddw_windows(void)
return 0;
 
find_existing_ddw_windows_named(DIRECT64_PROPNAME);
+   find_existing_ddw_windows_named(DMA64_PROPNAME);
 
return 0;
 }
@@ -1226,14 +1228,17 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct ddw_create_response create;
int page_shift;
u64 win_addr;
+   const char *win_name;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64;
bool ddw_enabled = false;
struct failed_ddw_pdn *fpdn;
-   bool default_win_removed = false;
+   bool default_win_removed = false, direct_mapping = false;
bool pmem_present;
+   struct pci_dn *pci = PCI_DN(pdn);
+   struct iommu_table *tbl = pci->table_group->tables[0];
 
dn = of_find_node_by_type(NULL, "ibm,pmemory");
pmem_present = dn != NULL;
@@ -1242,6 +1247,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
mutex_lock(_window_init_mutex);
 
if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {
+   direct_mapping = (len >= max_ram_len);
ddw_enabled = true;
goto out_unlock;
}
@@ -1322,8 +1328,8 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
  query.page_size);
goto out_failed;
}
-   /* verify the window * number of ptes will map the partition */
-   /* check largest block * page size > max memory hotplug addr */
+
+
/*
 * The "ibm,pmemory" can appear anywhere in the address space.
 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
@@ -1339,13 +1345,25 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
dev_info(>dev, "Skipping ibm,pmemory");
}
 
+   /* check if the available block * number of ptes will map everything */
if (query.largest_available_block < (1ULL << (len - page_shift))) {
dev_dbg(>dev,
"can't map partition max 0x%llx with %llu %llu-sized 
pages\n",
1ULL << len,
query.largest_available_block,
1ULL << page_shift);
-   goto out_failed;
+
+   /* DDW + IOMMU on single window may fail if there is any 
allocation */
+   if (default_win_removed && iommu_table_in_use(tbl)) {
+