[PATCH] powerpc/pseries/iommu: Split Dynamic DMA Window to be used in Hybrid mode

2024-05-13 Thread Gaurav Batra
Dynamic DMA Window (DDW) supports TCEs that are backed by 2MB page size.
In most configurations, DDW is big enough to pre-map all of LPAR memory
for IO. Pre-mapping of memory for DMA results in improvements in IO
performance.

Persistent memory, vPMEM, can be assigned to an LPAR as well. vPMEM is not
contiguous with LPAR memory and usually is assigned at high memory
addresses.  This makes is not possible to pre-map both vPMEM and LPAR
memory in the same DDW.

For a dedicated adapter this limitation is not an issue. Dedicated
adapters can have both Default DMA window, which is backed by 4K page size
and a DDW backed by 2MB page size TCEs. In this scenario, LPAR memory is
pre-mapped in the DDW.  Any DMA going to the vPMEM is routed via
dynamically allocated TCEs in the default window.

The issue arises with SR-IOV adapters. There is only one DMA window -
either Default or DDW. If an LPAR has vPMEM assigned, memory is not
pre-mapped in the DDW since TCEs needs to be allocated for vPMEM as well.
In this case, DDW is created and TCEs are dynamically allocated for both
vPMEM and LPAR memory.

Today, DDW is only used in single mode - direct mapped TCEs or dynamically
mapped TCEs. This enhancement breaks a single DDW in 2 regions -

1. First region to pre-map LPAR memory
2. Second region to dynamically allocate TCEs for IO to vPMEM

The DDW is split only if it is big enough to pre-map complete LPAR memory
and still have some space left to dynamically map vPMEM. Maximum size
possible DDW is created as permitted by the Hypervisor.

Signed-off-by: Gaurav Batra 
Reviewed-by: Brian King 
---
 arch/powerpc/include/asm/iommu.h   |  2 +
 arch/powerpc/platforms/pseries/iommu.c | 71 --
 2 files changed, 56 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 026695943550..bb252a15cd4c 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -31,6 +31,8 @@
 #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
 #define DMA64_PROPNAME "linux,dma64-ddr-window-info"
 
+#defineMIN_DDW_VPMEM_DMA_WINDOWSZ_2G
+
 /* Boot time flags */
 extern int iommu_is_off;
 extern int iommu_force_on;
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e8c4129697b1..b0a8e8d2159e 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1296,7 +1296,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
-   u64 win_addr;
+   u64 win_addr, dynamic_offset = 0;
const char *win_name;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
@@ -1304,6 +1304,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct property *win64;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false, direct_mapping = false;
+   bool dynamic_mapping = false;
bool pmem_present;
struct pci_dn *pci = PCI_DN(pdn);
struct property *default_win = NULL;
@@ -1399,7 +1400,6 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_failed;
}
 
-
/*
 * The "ibm,pmemory" can appear anywhere in the address space.
 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
@@ -1424,14 +1424,43 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
1ULL << page_shift);
 
len = order_base_2(query.largest_available_block << page_shift);
-   win_name = DMA64_PROPNAME;
+
+   dynamic_mapping = true;
} else {
direct_mapping = !default_win_removed ||
(len == MAX_PHYSMEM_BITS) ||
(!pmem_present && (len == max_ram_len));
-   win_name = direct_mapping ? DIRECT64_PROPNAME : DMA64_PROPNAME;
+
+   /* DDW is big enough to direct map RAM. If there is vPMEM, check
+* if enough space is left in DDW where we can dynamically
+* allocate TCEs for vPMEM. For now, this Hybrid sharing of DDW
+* is only for SR-IOV devices.
+*/
+   if (default_win_removed && pmem_present && !direct_mapping) {
+   /* DDW is big enough to be split */
+   if ((query.largest_available_block << page_shift) >=
+MIN_DDW_VPMEM_DMA_WINDOW + (1ULL << max_ram_len)) {
+
+   direct_mapping = true;
+
+   /* offset of the Dynamic part of DDW */
+   dynamic_offset = 1ULL << max_ram_len

[PATCH v2] powerpc/pseries/iommu: LPAR panics during boot up with a frozen PE

2024-04-22 Thread Gaurav Batra
At the time of LPAR boot up, partition firmware provides Open Firmware
property ibm,dma-window for the PE. This property is provided on the PCI
bus the PE is attached to.

There are execptions where the partition firmware might not provide this
property for the PE at the time of LPAR boot up. One of the scenario is
where the firmware has frozen the PE due to some error condition. This
PE is frozen for 24 hours or unless the whole system is reinitialized.

Within this time frame, if the LPAR is booted, the frozen PE will be
presented to the LPAR but ibm,dma-window property could be missing.

Today, under these circumstances, the LPAR oopses with NULL pointer
dereference, when configuring the PCI bus the PE is attached to.

BUG: Kernel NULL pointer dereference on read at 0x00c8
Faulting instruction address: 0xc01024c0
Oops: Kernel access of bad area, sig: 7 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
Supported: Yes
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.4.0-150600.9-default #1
Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200 0xf06 of:IBM,FW1060.00 
(NM1060_023) hv:phyp pSeries
NIP:  c01024c0 LR: c01024b0 CTR: c0102450
REGS: c37db5c0 TRAP: 0300   Not tainted  (6.4.0-150600.9-default)
MSR:  82009033   CR: 28000822  XER: 
CFAR: c010254c DAR: 00c8 DSISR: 0008 IRQMASK: 0
...
NIP [c01024c0] pci_dma_bus_setup_pSeriesLP+0x70/0x2a0
LR [c01024b0] pci_dma_bus_setup_pSeriesLP+0x60/0x2a0
Call Trace:
pci_dma_bus_setup_pSeriesLP+0x60/0x2a0 (unreliable)
pcibios_setup_bus_self+0x1c0/0x370
__of_scan_bus+0x2f8/0x330
pcibios_scan_phb+0x280/0x3d0
pcibios_init+0x88/0x12c
do_one_initcall+0x60/0x320
kernel_init_freeable+0x344/0x3e4
kernel_init+0x34/0x1d0
ret_from_kernel_user_thread+0x14/0x1c

Fixes: b1fc44eaa9ba ("pseries/iommu/ddw: Fix kdump to work in absence of 
ibm,dma-window")
Signed-off-by: Gaurav Batra 
---
 arch/powerpc/platforms/pseries/iommu.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e8c4129697b1..b1e6d275cda9 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -786,8 +786,16 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
*bus)
 * parent bus. During reboot, there will be ibm,dma-window property to
 * define DMA window. For kdump, there will at least be default window 
or DDW
 * or both.
+* There is an exception to the above. In case the PE goes into frozen
+* state, firmware may not provide ibm,dma-window property at the time
+* of LPAR boot up.
 */
 
+   if (!pdn) {
+   pr_debug("  no ibm,dma-window property !\n");
+   return;
+   }
+
ppci = PCI_DN(pdn);
 
pr_debug("  parent is %pOF, iommu_table: 0x%p\n",

base-commit: 0bbac3facb5d6cc0171c45c9873a2dc96bea9680
-- 
2.39.3 (Apple Git-146)



Re: [PATCH] powerpc/pseries/iommu: LPAR panics when rebooted with a frozen PE

2024-04-19 Thread Gaurav Batra
You are right. I think, the "reboot" should be replaced with just "boot 
up". If there are no other comments, or code changes, I can re-word the 
commit message and submit for review.


Thanks,

Gaurav

On 4/19/24 6:11 AM, Michal Suchánek wrote:

Hello,

On Fri, Apr 19, 2024 at 04:12:46PM +1000, Michael Ellerman wrote:

Gaurav Batra  writes:

At the time of LPAR reboot, partition firmware provides Open Firmware
property ibm,dma-window for the PE. This property is provided on the PCI
bus the PE is attached to.

AFAICS you're actually describing a bug that happens during boot *up*?

Describing it as "reboot" makes me think you're talking about the
shutdown path. I think that will confuse people, me at least :)

there is probably an assumption that it must have been running
previously for the errors to happen in the first place but given the
error state persists for a day it may be a very long 'reboot'.

Thanks

Michal

cheers


There are execptions where the partition firmware might not provide this
property for the PE at the time of LPAR reboot. One of the scenario is
where the firmware has frozen the PE due to some error conditions. This
PE is frozen for 24 hours or unless the whole system is reinitialized.

Within this time frame, if the LPAR is rebooted, the frozen PE will be
presented to the LPAR but ibm,dma-window property could be missing.

Today, under these circumstances, the LPAR oopses with NULL pointer
dereference, when configuring the PCI bus the PE is attached to.

BUG: Kernel NULL pointer dereference on read at 0x00c8
Faulting instruction address: 0xc01024c0
Oops: Kernel access of bad area, sig: 7 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
Supported: Yes
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.4.0-150600.9-default #1
Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200 0xf06 of:IBM,FW1060.00 
(NM1060_023) hv:phyp pSeries
NIP:  c01024c0 LR: c01024b0 CTR: c0102450
REGS: c37db5c0 TRAP: 0300   Not tainted  (6.4.0-150600.9-default)
MSR:  82009033   CR: 28000822  XER: 
CFAR: c010254c DAR: 00c8 DSISR: 0008 IRQMASK: 0
...
NIP [c01024c0] pci_dma_bus_setup_pSeriesLP+0x70/0x2a0
LR [c01024b0] pci_dma_bus_setup_pSeriesLP+0x60/0x2a0
Call Trace:
pci_dma_bus_setup_pSeriesLP+0x60/0x2a0 (unreliable)
pcibios_setup_bus_self+0x1c0/0x370
__of_scan_bus+0x2f8/0x330
pcibios_scan_phb+0x280/0x3d0
pcibios_init+0x88/0x12c
do_one_initcall+0x60/0x320
kernel_init_freeable+0x344/0x3e4
kernel_init+0x34/0x1d0
ret_from_kernel_user_thread+0x14/0x1c

Fixes: b1fc44eaa9ba ("pseries/iommu/ddw: Fix kdump to work in absence of 
ibm,dma-window")
Signed-off-by: Gaurav Batra 
---
  arch/powerpc/platforms/pseries/iommu.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e8c4129697b1..e808d5b1fa49 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -786,8 +786,16 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
*bus)
 * parent bus. During reboot, there will be ibm,dma-window property to
 * define DMA window. For kdump, there will at least be default window 
or DDW
 * or both.
+* There is an exception to the above. In case the PE goes into frozen
+* state, firmware may not provide ibm,dma-window property at the time
+* of LPAR reboot.
 */
  
+	if (!pdn) {

+   pr_debug("  no ibm,dma-window property !\n");
+   return;
+   }
+
ppci = PCI_DN(pdn);
  
  	pr_debug("  parent is %pOF, iommu_table: 0x%p\n",


base-commit: 2c71fdf02a95b3dd425b42f28fd47fb2b1d22702
--
2.39.3 (Apple Git-146)


[PATCH] powerpc/pseries/iommu: LPAR panics when rebooted with a frozen PE

2024-04-16 Thread Gaurav Batra
At the time of LPAR reboot, partition firmware provides Open Firmware
property ibm,dma-window for the PE. This property is provided on the PCI
bus the PE is attached to.

There are execptions where the partition firmware might not provide this
property for the PE at the time of LPAR reboot. One of the scenario is
where the firmware has frozen the PE due to some error conditions. This
PE is frozen for 24 hours or unless the whole system is reinitialized.

Within this time frame, if the LPAR is rebooted, the frozen PE will be
presented to the LPAR but ibm,dma-window property could be missing.

Today, under these circumstances, the LPAR oopses with NULL pointer
dereference, when configuring the PCI bus the PE is attached to.

BUG: Kernel NULL pointer dereference on read at 0x00c8
Faulting instruction address: 0xc01024c0
Oops: Kernel access of bad area, sig: 7 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
Supported: Yes
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.4.0-150600.9-default #1
Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200 0xf06 of:IBM,FW1060.00 
(NM1060_023) hv:phyp pSeries
NIP:  c01024c0 LR: c01024b0 CTR: c0102450
REGS: c37db5c0 TRAP: 0300   Not tainted  (6.4.0-150600.9-default)
MSR:  82009033   CR: 28000822  XER: 
CFAR: c010254c DAR: 00c8 DSISR: 0008 IRQMASK: 0
...
NIP [c01024c0] pci_dma_bus_setup_pSeriesLP+0x70/0x2a0
LR [c01024b0] pci_dma_bus_setup_pSeriesLP+0x60/0x2a0
Call Trace:
pci_dma_bus_setup_pSeriesLP+0x60/0x2a0 (unreliable)
pcibios_setup_bus_self+0x1c0/0x370
__of_scan_bus+0x2f8/0x330
pcibios_scan_phb+0x280/0x3d0
pcibios_init+0x88/0x12c
do_one_initcall+0x60/0x320
kernel_init_freeable+0x344/0x3e4
kernel_init+0x34/0x1d0
ret_from_kernel_user_thread+0x14/0x1c

Fixes: b1fc44eaa9ba ("pseries/iommu/ddw: Fix kdump to work in absence of 
ibm,dma-window")
Signed-off-by: Gaurav Batra 
---
 arch/powerpc/platforms/pseries/iommu.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e8c4129697b1..e808d5b1fa49 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -786,8 +786,16 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
*bus)
 * parent bus. During reboot, there will be ibm,dma-window property to
 * define DMA window. For kdump, there will at least be default window 
or DDW
 * or both.
+* There is an exception to the above. In case the PE goes into frozen
+* state, firmware may not provide ibm,dma-window property at the time
+* of LPAR reboot.
 */
 
+   if (!pdn) {
+   pr_debug("  no ibm,dma-window property !\n");
+   return;
+   }
+
ppci = PCI_DN(pdn);
 
pr_debug("  parent is %pOF, iommu_table: 0x%p\n",

base-commit: 2c71fdf02a95b3dd425b42f28fd47fb2b1d22702
-- 
2.39.3 (Apple Git-146)



[PATCH v3] powerpc/pseries/iommu: DLPAR ADD of pci device doesn't completely initialize pci_controller structure

2024-02-15 Thread Gaurav Batra
erpc/iommu: Add iommu_ops to report capabilities and 
allow blocking domains")
Signed-off-by: Gaurav Batra 
Reviewed-by: Brian King 
---
 arch/powerpc/include/asm/ppc-pci.h | 10 ++
 arch/powerpc/kernel/iommu.c| 23 --
 arch/powerpc/platforms/pseries/pci_dlpar.c |  4 
 3 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-pci.h 
b/arch/powerpc/include/asm/ppc-pci.h
index ce2b1b5eebdd..a8b7e8682f5b 100644
--- a/arch/powerpc/include/asm/ppc-pci.h
+++ b/arch/powerpc/include/asm/ppc-pci.h
@@ -30,6 +30,16 @@ void *pci_traverse_device_nodes(struct device_node *start,
void *data);
 extern void pci_devs_phb_init_dynamic(struct pci_controller *phb);
 
+#if defined(CONFIG_IOMMU_API) && (defined(CONFIG_PPC_PSERIES) || \
+ defined(CONFIG_PPC_POWERNV))
+extern void ppc_iommu_register_device(struct pci_controller *phb);
+extern void ppc_iommu_unregister_device(struct pci_controller *phb);
+#else
+static inline void ppc_iommu_register_device(struct pci_controller *phb) { }
+static inline void ppc_iommu_unregister_device(struct pci_controller *phb) { }
+#endif
+
+
 /* From rtas_pci.h */
 extern void init_pci_config_tokens (void);
 extern unsigned long get_phb_buid (struct device_node *);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ebe259bdd462..3425b917c8d6 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1339,7 +1339,7 @@ static struct iommu_device 
*spapr_tce_iommu_probe_device(struct device *dev)
struct pci_controller *hose;
 
if (!dev_is_pci(dev))
-   return ERR_PTR(-EPERM);
+   return ERR_PTR(-ENODEV);
 
pdev = to_pci_dev(dev);
hose = pdev->bus->sysdata;
@@ -1388,6 +1388,21 @@ static const struct attribute_group 
*spapr_tce_iommu_groups[] = {
NULL,
 };
 
+void ppc_iommu_register_device(struct pci_controller *phb)
+{
+   iommu_device_sysfs_add(>iommu, phb->parent,
+   spapr_tce_iommu_groups, "iommu-phb%04x",
+   phb->global_number);
+   iommu_device_register(>iommu, _tce_iommu_ops,
+   phb->parent);
+}
+
+void ppc_iommu_unregister_device(struct pci_controller *phb)
+{
+   iommu_device_unregister(>iommu);
+   iommu_device_sysfs_remove(>iommu);
+}
+
 /*
  * This registers IOMMU devices of PHBs. This needs to happen
  * after core_initcall(iommu_init) + postcore_initcall(pci_driver_init) and
@@ -1398,11 +1413,7 @@ static int __init 
spapr_tce_setup_phb_iommus_initcall(void)
struct pci_controller *hose;
 
list_for_each_entry(hose, _list, list_node) {
-   iommu_device_sysfs_add(>iommu, hose->parent,
-  spapr_tce_iommu_groups, "iommu-phb%04x",
-  hose->global_number);
-   iommu_device_register(>iommu, _tce_iommu_ops,
- hose->parent);
+   ppc_iommu_register_device(hose);
}
return 0;
 }
diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c 
b/arch/powerpc/platforms/pseries/pci_dlpar.c
index 4ba824568119..4448386268d9 100644
--- a/arch/powerpc/platforms/pseries/pci_dlpar.c
+++ b/arch/powerpc/platforms/pseries/pci_dlpar.c
@@ -35,6 +35,8 @@ struct pci_controller *init_phb_dynamic(struct device_node 
*dn)
 
pseries_msi_allocate_domains(phb);
 
+   ppc_iommu_register_device(phb);
+
/* Create EEH devices for the PHB */
eeh_phb_pe_create(phb);
 
@@ -76,6 +78,8 @@ int remove_phb_dynamic(struct pci_controller *phb)
}
}
 
+   ppc_iommu_unregister_device(phb);
+
pseries_msi_free_domains(phb);
 
/* Keep a reference so phb isn't freed yet */

base-commit: ecb1b8288dc7ccbdcb3b9df005fa1c0e0c0388a7
-- 
2.39.3 (Apple Git-145)



Re: [PATCH] powerpc/pseries/iommu: DLPAR ADD of pci device doesn't completely initialize pci_controller structure

2024-02-07 Thread Gaurav Batra

Hello All,

There is still some issue even after applying the patch. This is not a 
complete fix. I am working on V3 of the patch. Please do not merge this 
patch upstream.


Thanks,

Gaurav

On 1/10/24 4:53 PM, Gaurav Batra wrote:

When a PCI device is Dynamically added, LPAR OOPS with NULL pointer
exception.

Complete stack is as below

[  211.239206] BUG: Kernel NULL pointer dereference on read at 0x0030
[  211.239210] Faulting instruction address: 0xc06bbe5c
[  211.239214] Oops: Kernel access of bad area, sig: 11 [#1]
[  211.239218] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[  211.239223] Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 
auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding 
nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc 
rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser 
libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs 
ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core 
mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto 
pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse
[  211.239280] CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66
[  211.239284] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries
[  211.239289] NIP:  c06bbe5c LR: c0a13e68 CTR: c00579f8
[  211.239293] REGS: c0009924f240 TRAP: 0300   Not tainted  (6.7.0-203405+)
[  211.239298] MSR:  80009033   CR: 24002220  
XER: 20040006
[  211.239306] CFAR: c0a13e64 DAR: 0030 DSISR: 4000 
IRQMASK: 0
[  211.239306] GPR00: c0a13e68 c0009924f4e0 c15a2b00 

[  211.239306] GPR04: c13c5590  c6d07970 
c000d8f8f180
[  211.239306] GPR08: 06ec c000d8f8f180 c2c35d58 
24002228
[  211.239306] GPR12: c00579f8 c003ffeb3880  

[  211.239306] GPR16:    

[  211.239306] GPR20:    

[  211.239306] GPR24: c000919460c0  f000 
c10088e8
[  211.239306] GPR28: c13c5590 c6d07970 c000919460c0 
c000919460c0
[  211.239354] NIP [c06bbe5c] sysfs_add_link_to_group+0x34/0x94
[  211.239361] LR [c0a13e68] iommu_device_link+0x5c/0x118
[  211.239367] Call Trace:
[  211.239369] [c0009924f4e0] [c0a109b8] 
iommu_init_device+0x26c/0x318 (unreliable)
[  211.239376] [c0009924f520] [c0a13e68] 
iommu_device_link+0x5c/0x118
[  211.239382] [c0009924f560] [c0a107f4] 
iommu_init_device+0xa8/0x318
[  211.239387] [c0009924f5c0] [c0a11a08] 
iommu_probe_device+0xc0/0x134
[  211.239393] [c0009924f600] [c0a11ac0] 
iommu_bus_notifier+0x44/0x104
[  211.239398] [c0009924f640] [c018dcc0] 
notifier_call_chain+0xb8/0x19c
[  211.239405] [c0009924f6a0] [c018df88] 
blocking_notifier_call_chain+0x64/0x98
[  211.239411] [c0009924f6e0] [c0a250fc] bus_notify+0x50/0x7c
[  211.239416] [c0009924f720] [c0a20838] device_add+0x640/0x918
[  211.239421] [c0009924f7f0] [c08f1a34] pci_device_add+0x23c/0x298
[  211.239427] [c0009924f840] [c0077460] 
of_create_pci_dev+0x400/0x884
[  211.239432] [c0009924f8e0] [c0077a08] of_scan_pci_dev+0x124/0x1b0
[  211.239437] [c0009924f980] [c0077b0c] __of_scan_bus+0x78/0x18c
[  211.239442] [c0009924fa10] [c0073f90] 
pcibios_scan_phb+0x2a4/0x3b0
[  211.239447] [c0009924fad0] [c01007a8] init_phb_dynamic+0xb8/0x110
[  211.239453] [c0009924fb40] [c00806920620] dlpar_add_slot+0x170/0x3b8 
[rpadlpar_io]
[  211.239461] [c0009924fbe0] [c00806920d64] 
add_slot_store.part.0+0xb4/0x130 [rpadlpar_io]
[  211.239468] [c0009924fc70] [c0fb4144] kobj_attr_store+0x2c/0x48
[  211.239473] [c0009924fc90] [c06b90e4] sysfs_kf_write+0x64/0x78
[  211.239479] [c0009924fcb0] [c06b7b78] 
kernfs_fop_write_iter+0x1b0/0x290
[  211.239485] [c0009924fd00] [c05b6fdc] vfs_write+0x350/0x4a0
[  211.239491] [c0009924fdc0] [c05b7450] ksys_write+0x84/0x140
[  211.239496] [c0009924fe10] [c0030a04] 
system_call_exception+0x124/0x330
[  211.239502] [c0009924fe50] [c000cedc] 
system_call_vectored_common+0x15c/0x2ec

Commit a940904443e4 ("powerpc/iommu: Add iommu_ops to report capabilities
and allow blocking domains") broke DLPAR ADD of pci devices.

The above added iommu_device structure to pci_controller. During
system boot, pci devices are discovered and this newly added iommu_device
structure initialized by a call to iommu_device_register().

During

[PATCH v2] powerpc/pseries/iommu: IOMMU table is not initialized for kdump over SR-IOV

2024-01-25 Thread Gaurav Batra
From: Gaurav Batra 

When kdump kernel tries to copy dump data over SR-IOV, LPAR panics due to
NULL pointer execption.

Here is the complete stack

[   19.944378] Kernel attempted to read user page (0) - exploit attempt? (uid: 
0)^M
[   19.944388] BUG: Kernel NULL pointer dereference on read at 0x^M
[   19.944394] Faulting instruction address: 0xc00020847ad4^M
[   19.944400] Oops: Kernel access of bad area, sig: 11 [#1]^M
[   19.944406] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries^M
[   19.944413] Modules linked in: mlx5_core(+) vmx_crypto pseries_wdt papr_scm 
libnvdimm mlxfw tls psample sunrpc fuse overlay squashfs loop^M
[   19.944435] CPU: 12 PID: 315 Comm: systemd-udevd Not tainted 6.4.0-Test102+ 
#12^M
[   19.92] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries^M
[   19.944450] NIP:  c00020847ad4 LR: c0002083b2dc CTR: 
006cd18c^M
[   19.944457] REGS: c00029162ca0 TRAP: 0300   Not tainted  
(6.4.0-Test102+)^M
[   19.944463] MSR:  8280b033   CR: 
48288244  XER: 0008^M
[   19.944480] CFAR: c0002083b2d8 DAR:  DSISR: 4000 
IRQMASK: 1 ^M
[   19.944480] GPR00: c0002088ecd4 c00029162f40 c00021542900 
 ^M
[   19.944480] GPR04:    
 ^M
[   19.944480] GPR08:    
0001 ^M
[   19.944480] GPR12:  c00022ec1000 4000 
 ^M
[   19.944480] GPR16: c0002b8b3900 c0002b8b3918 c0002b8b3800 
 ^M
[   19.944480] GPR20:    
 ^M
[   19.944480] GPR24:  0001  
 ^M
[   19.944480] GPR28:    
 ^M
[   19.944553] NIP [c00020847ad4] _find_next_zero_bit+0x24/0x110^M
[   19.944564] LR [c0002083b2dc] bitmap_find_next_zero_area_off+0x5c/0xe0^M
[   19.944572] Call Trace:^M
[   19.944576] [c00029162f40] [c000209fec70] dev_printk_emit+0x38/0x48 
(unreliable)^M
[   19.944587] [c00029162fa0] [c0002088ecd4] 
iommu_area_alloc+0xc4/0x180^M
[   19.944596] [c00029163020] [c0002005a3d8] 
iommu_range_alloc+0x1e8/0x580^M
[   19.944606] [c00029163150] [c0002005a7d0] iommu_alloc+0x60/0x130^M
[   19.944613] [c000291631a0] [c0002005b9f8] 
iommu_alloc_coherent+0x158/0x2b0^M
[   19.944621] [c00029163260] [c00020058fdc] 
dma_iommu_alloc_coherent+0x3c/0x50^M
[   19.944629] [c00029163280] [c00020235260] 
dma_alloc_attrs+0x170/0x1f0^M
[   19.944637] [c000291632f0] [c0080140c058] mlx5_cmd_init+0xc0/0x760 
[mlx5_core]^M
[   19.944745] [c000291633c0] [c00801403448] 
mlx5_function_setup+0xf0/0x510 [mlx5_core]^M
[   19.944845] [c00029163490] [c008014039bc] mlx5_init_one+0x84/0x210 
[mlx5_core]^M
[   19.944943] [c00029163520] [c00801404c00] probe_one+0x118/0x2c0 
[mlx5_core]^M
[   19.945041] [c000291635b0] [c000208ddce8] 
local_pci_probe+0x68/0x110^M
[   19.945049] [c00029163630] [c000208deb98] pci_call_probe+0x68/0x200^M
[   19.945057] [c00029163790] [c000208dfecc] 
pci_device_probe+0xbc/0x1a0^M
[   19.945065] [c000291637d0] [c00020a032f4] really_probe+0x104/0x540^M
[   19.945072] [c00029163850] [c00020a037e4] 
__driver_probe_device+0xb4/0x230^M
[   19.945080] [c000291638d0] [c00020a039b4] 
driver_probe_device+0x54/0x130^M
[   19.945088] [c00029163910] [c00020a03da8] 
__driver_attach+0x158/0x2b0^M
[   19.945096] [c00029163990] [c000209ffa78] 
bus_for_each_dev+0xa8/0x130^M
[   19.945103] [c000291639f0] [c00020a026c4] driver_attach+0x34/0x50^M
[   19.945110] [c00029163a10] [c00020a01aac] 
bus_add_driver+0x16c/0x300^M
[   19.945118] [c00029163aa0] [c00020a05944] 
driver_register+0xa4/0x1b0^M
[   19.945126] [c00029163b10] [c000208dd428] 
__pci_register_driver+0x68/0x80^M
[   19.945133] [c00029163b30] [c008015414cc] mlx5_init+0xb8/0x100 
[mlx5_core]^M
[   19.945247] [c00029163ba0] [c00020012f60] 
do_one_initcall+0x60/0x300^M
[   19.945255] [c00029163c80] [c00020241cbc] do_init_module+0x7c/0x2b0^M

At the time of LPAR dump, before kexec hands over control to kdump kernel,
DDWs are scanned and added in the FDT. For the SR-IOV case, default DMA
window "ibm,dma-window" is removed from the FDT and DDW added, for the
device.

Now, kexec hands over control to the kdump kernel.

When the kdump kernel initializes, PCI busses are scanned and IOMMU
group/tables created, in pci_dma_bus_setup_pSeriesLP(). For the SR-IOV
case, there is no "ibm,dma-window". The original commit: b1fc44eaa9ba,
fixes the path where memory is pre-mapped (direct mapped) to the DDW. When
TCEs are direct mapped, there is no need

[PATCH v2] powerpc/pseries/iommu: DLPAR ADD of pci device doesn't completely initialize pci_controller structure

2024-01-22 Thread Gaurav Batra
erpc/iommu: Add iommu_ops to report capabilities and 
allow blocking domains")
Signed-off-by: Gaurav Batra 
---
 arch/powerpc/include/asm/ppc-pci.h |  3 +++
 arch/powerpc/kernel/iommu.c| 21 -
 arch/powerpc/platforms/pseries/pci_dlpar.c |  4 
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-pci.h 
b/arch/powerpc/include/asm/ppc-pci.h
index ce2b1b5eebdd..c3a3f3df36d1 100644
--- a/arch/powerpc/include/asm/ppc-pci.h
+++ b/arch/powerpc/include/asm/ppc-pci.h
@@ -29,6 +29,9 @@ void *pci_traverse_device_nodes(struct device_node *start,
void *(*fn)(struct device_node *, void *),
void *data);
 extern void pci_devs_phb_init_dynamic(struct pci_controller *phb);
+extern void ppc_iommu_register_device(struct pci_controller *phb);
+extern void ppc_iommu_unregister_device(struct pci_controller *phb);
+
 
 /* From rtas_pci.h */
 extern void init_pci_config_tokens (void);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ebe259bdd462..c6f62e130d55 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1388,6 +1388,21 @@ static const struct attribute_group 
*spapr_tce_iommu_groups[] = {
NULL,
 };
 
+void ppc_iommu_register_device(struct pci_controller *phb)
+{
+   iommu_device_sysfs_add(>iommu, phb->parent,
+   spapr_tce_iommu_groups, "iommu-phb%04x",
+   phb->global_number);
+   iommu_device_register(>iommu, _tce_iommu_ops,
+   phb->parent);
+}
+
+void ppc_iommu_unregister_device(struct pci_controller *phb)
+{
+   iommu_device_unregister(>iommu);
+   iommu_device_sysfs_remove(>iommu);
+}
+
 /*
  * This registers IOMMU devices of PHBs. This needs to happen
  * after core_initcall(iommu_init) + postcore_initcall(pci_driver_init) and
@@ -1398,11 +1413,7 @@ static int __init 
spapr_tce_setup_phb_iommus_initcall(void)
struct pci_controller *hose;
 
list_for_each_entry(hose, _list, list_node) {
-   iommu_device_sysfs_add(>iommu, hose->parent,
-  spapr_tce_iommu_groups, "iommu-phb%04x",
-  hose->global_number);
-   iommu_device_register(>iommu, _tce_iommu_ops,
- hose->parent);
+   ppc_iommu_register_device(hose);
}
return 0;
 }
diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c 
b/arch/powerpc/platforms/pseries/pci_dlpar.c
index 4ba824568119..4448386268d9 100644
--- a/arch/powerpc/platforms/pseries/pci_dlpar.c
+++ b/arch/powerpc/platforms/pseries/pci_dlpar.c
@@ -35,6 +35,8 @@ struct pci_controller *init_phb_dynamic(struct device_node 
*dn)
 
pseries_msi_allocate_domains(phb);
 
+   ppc_iommu_register_device(phb);
+
/* Create EEH devices for the PHB */
eeh_phb_pe_create(phb);
 
@@ -76,6 +78,8 @@ int remove_phb_dynamic(struct pci_controller *phb)
}
}
 
+   ppc_iommu_unregister_device(phb);
+
pseries_msi_free_domains(phb);
 
/* Keep a reference so phb isn't freed yet */
-- 
2.39.3 (Apple Git-145)



[PATCH] powerpc/pseries/iommu: DLPAR ADD of pci device doesn't completely initialize pci_controller structure

2024-01-10 Thread Gaurav Batra
erpc/iommu: Add iommu_ops to report capabilities and 
allow blocking domains")
Signed-off-by: Gaurav Batra 
---
 arch/powerpc/include/asm/ppc-pci.h |  3 +++
 arch/powerpc/kernel/iommu.c| 15 +++
 arch/powerpc/platforms/pseries/pci_dlpar.c |  4 
 3 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-pci.h 
b/arch/powerpc/include/asm/ppc-pci.h
index ce2b1b5eebdd..55a2ba36e9c4 100644
--- a/arch/powerpc/include/asm/ppc-pci.h
+++ b/arch/powerpc/include/asm/ppc-pci.h
@@ -29,6 +29,9 @@ void *pci_traverse_device_nodes(struct device_node *start,
void *(*fn)(struct device_node *, void *),
void *data);
 extern void pci_devs_phb_init_dynamic(struct pci_controller *phb);
+extern void pci_register_device_dynamic(struct pci_controller *phb);
+extern void pci_unregister_device_dynamic(struct pci_controller *phb);
+
 
 /* From rtas_pci.h */
 extern void init_pci_config_tokens (void);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ebe259bdd462..342739fe74c4 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1388,6 +1388,21 @@ static const struct attribute_group 
*spapr_tce_iommu_groups[] = {
NULL,
 };
 
+void pci_register_device_dynamic(struct pci_controller *phb)
+{
+   iommu_device_sysfs_add(>iommu, phb->parent,
+   spapr_tce_iommu_groups, "iommu-phb%04x",
+   phb->global_number);
+   iommu_device_register(>iommu, _tce_iommu_ops,
+   phb->parent);
+}
+
+void pci_unregister_device_dynamic(struct pci_controller *phb)
+{
+   iommu_device_unregister(>iommu);
+   iommu_device_sysfs_remove(>iommu);
+}
+
 /*
  * This registers IOMMU devices of PHBs. This needs to happen
  * after core_initcall(iommu_init) + postcore_initcall(pci_driver_init) and
diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c 
b/arch/powerpc/platforms/pseries/pci_dlpar.c
index 4ba824568119..ec70ca435b7e 100644
--- a/arch/powerpc/platforms/pseries/pci_dlpar.c
+++ b/arch/powerpc/platforms/pseries/pci_dlpar.c
@@ -35,6 +35,8 @@ struct pci_controller *init_phb_dynamic(struct device_node 
*dn)
 
pseries_msi_allocate_domains(phb);
 
+   pci_register_device_dynamic(phb);
+
/* Create EEH devices for the PHB */
eeh_phb_pe_create(phb);
 
@@ -76,6 +78,8 @@ int remove_phb_dynamic(struct pci_controller *phb)
}
}
 
+   pci_unregister_device_dynamic(phb);
+
pseries_msi_free_domains(phb);
 
/* Keep a reference so phb isn't freed yet */
-- 
2.39.3 (Apple Git-145)



Re: [PATCH] powerpc/pseries/iommu: IOMMU table is not initialized for kdump over SR-IOV

2023-12-14 Thread Gaurav Batra



On 12/14/23 4:18 AM, Aneesh Kumar K.V wrote:

Gaurav Batra  writes:


When kdump kernel tries to copy dump data over SR-IOV, LPAR panics due to
NULL pointer execption.

Here is the complete stack

[   19.944378] Kernel attempted to read user page (0) - exploit attempt? (uid: 
0)^M
[   19.944388] BUG: Kernel NULL pointer dereference on read at 0x^M
[   19.944394] Faulting instruction address: 0xc00020847ad4^M
[   19.944400] Oops: Kernel access of bad area, sig: 11 [#1]^M
[   19.944406] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries^M
[   19.944413] Modules linked in: mlx5_core(+) vmx_crypto pseries_wdt papr_scm 
libnvdimm mlxfw tls psample sunrpc fuse overlay squashfs loop^M
[   19.944435] CPU: 12 PID: 315 Comm: systemd-udevd Not tainted 6.4.0-Test102+ 
#12^M
[   19.92] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries^M
[   19.944450] NIP:  c00020847ad4 LR: c0002083b2dc CTR: 
006cd18c^M
[   19.944457] REGS: c00029162ca0 TRAP: 0300   Not tainted  
(6.4.0-Test102+)^M
[   19.944463] MSR:  8280b033   CR: 
48288244  XER: 0008^M
[   19.944480] CFAR: c0002083b2d8 DAR:  DSISR: 4000 
IRQMASK: 1 ^M
[   19.944480] GPR00: c0002088ecd4 c00029162f40 c00021542900 
 ^M
[   19.944480] GPR04:    
 ^M
[   19.944480] GPR08:    
0001 ^M
[   19.944480] GPR12:  c00022ec1000 4000 
 ^M
[   19.944480] GPR16: c0002b8b3900 c0002b8b3918 c0002b8b3800 
 ^M
[   19.944480] GPR20:    
 ^M
[   19.944480] GPR24:  0001  
 ^M
[   19.944480] GPR28:    
 ^M
[   19.944553] NIP [c00020847ad4] _find_next_zero_bit+0x24/0x110^M
[   19.944564] LR [c0002083b2dc] bitmap_find_next_zero_area_off+0x5c/0xe0^M
[   19.944572] Call Trace:^M
[   19.944576] [c00029162f40] [c000209fec70] dev_printk_emit+0x38/0x48 
(unreliable)^M
[   19.944587] [c00029162fa0] [c0002088ecd4] 
iommu_area_alloc+0xc4/0x180^M
[   19.944596] [c00029163020] [c0002005a3d8] 
iommu_range_alloc+0x1e8/0x580^M
[   19.944606] [c00029163150] [c0002005a7d0] iommu_alloc+0x60/0x130^M
[   19.944613] [c000291631a0] [c0002005b9f8] 
iommu_alloc_coherent+0x158/0x2b0^M
[   19.944621] [c00029163260] [c00020058fdc] 
dma_iommu_alloc_coherent+0x3c/0x50^M
[   19.944629] [c00029163280] [c00020235260] 
dma_alloc_attrs+0x170/0x1f0^M
[   19.944637] [c000291632f0] [c0080140c058] mlx5_cmd_init+0xc0/0x760 
[mlx5_core]^M
[   19.944745] [c000291633c0] [c00801403448] 
mlx5_function_setup+0xf0/0x510 [mlx5_core]^M
[   19.944845] [c00029163490] [c008014039bc] mlx5_init_one+0x84/0x210 
[mlx5_core]^M
[   19.944943] [c00029163520] [c00801404c00] probe_one+0x118/0x2c0 
[mlx5_core]^M
[   19.945041] [c000291635b0] [c000208ddce8] 
local_pci_probe+0x68/0x110^M
[   19.945049] [c00029163630] [c000208deb98] pci_call_probe+0x68/0x200^M
[   19.945057] [c00029163790] [c000208dfecc] 
pci_device_probe+0xbc/0x1a0^M
[   19.945065] [c000291637d0] [c00020a032f4] really_probe+0x104/0x540^M
[   19.945072] [c00029163850] [c00020a037e4] 
__driver_probe_device+0xb4/0x230^M
[   19.945080] [c000291638d0] [c00020a039b4] 
driver_probe_device+0x54/0x130^M
[   19.945088] [c00029163910] [c00020a03da8] 
__driver_attach+0x158/0x2b0^M
[   19.945096] [c00029163990] [c000209ffa78] 
bus_for_each_dev+0xa8/0x130^M
[   19.945103] [c000291639f0] [c00020a026c4] driver_attach+0x34/0x50^M
[   19.945110] [c00029163a10] [c00020a01aac] 
bus_add_driver+0x16c/0x300^M
[   19.945118] [c00029163aa0] [c00020a05944] 
driver_register+0xa4/0x1b0^M
[   19.945126] [c00029163b10] [c000208dd428] 
__pci_register_driver+0x68/0x80^M
[   19.945133] [c00029163b30] [c008015414cc] mlx5_init+0xb8/0x100 
[mlx5_core]^M
[   19.945247] [c00029163ba0] [c00020012f60] 
do_one_initcall+0x60/0x300^M
[   19.945255] [c00029163c80] [c00020241cbc] do_init_module+0x7c/0x2b0^M

At the time of LPAR dump, before kexec hands over control to kdump kernel,
DDWs are scanned and added in the FDT. For the SR-IOV case, default DMA
window "ibm,dma-window" is removed from the FDT and DDW added, for the
device.

Now, kexec hands over control to the kdump kernel.

When the kdump kernel initializes, PCI busses are scanned and IOMMU
group/tables created, in pci_dma_bus_setup_pSeriesLP(). For the SR-IOV
case, there is no "ibm,dma-window". As a result, due to the code bug, the
newly created IOMMU table is not initialized. Later, when the

[PATCH] powerpc/pseries/iommu: IOMMU table is not initialized for kdump over SR-IOV

2023-12-13 Thread Gaurav Batra
ption is thrown
from iommu_area_alloc().

The fix would be to initialize IOMMU table with DDW property stored in
the FDT. There are 2 points to remember -

1. For the dedicated adapter, kdump kernel would encounter both
   default and DDW in FDT. In this case, DDW property is used to
   initialize the IOMMU table.

2. A DDW could be direct or dynamic mapped. kdump kernel would
   initialize IOMMU table and mark the existing DDW as
   "dynamic". This works fine since, at the time of table
   initialization, iommu_table_clear() makes some space in the
   DDW, for some predefined number of TCEs which are needed for
   kdump to succeed.

Fixes: b1fc44eaa9ba ("pseries/iommu/ddw: Fix kdump to work in absence of 
ibm,dma-window")
Signed-off-by: Gaurav Batra 
Reviewed-by: Brian King 
---
 arch/powerpc/platforms/pseries/iommu.c | 150 -
 1 file changed, 99 insertions(+), 51 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 496e16c588aa..6f022c51bd57 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -574,29 +574,6 @@ static void iommu_table_setparms(struct pci_controller 
*phb,
 
 struct iommu_table_ops iommu_table_lpar_multi_ops;
 
-/*
- * iommu_table_setparms_lpar
- *
- * Function: On pSeries LPAR systems, return TCE table info, given a pci bus.
- */
-static void iommu_table_setparms_lpar(struct pci_controller *phb,
- struct device_node *dn,
- struct iommu_table *tbl,
- struct iommu_table_group *table_group,
- const __be32 *dma_window)
-{
-   unsigned long offset, size, liobn;
-
-   of_parse_dma_window(dn, dma_window, , , );
-
-   iommu_table_setparms_common(tbl, phb->bus->number, liobn, offset, size, 
IOMMU_PAGE_SHIFT_4K, NULL,
-   _table_lpar_multi_ops);
-
-
-   table_group->tce32_start = offset;
-   table_group->tce32_size = size;
-}
-
 struct iommu_table_ops iommu_table_pseries_ops = {
.set = tce_build_pSeries,
.clear = tce_free_pSeries,
@@ -724,26 +701,71 @@ struct iommu_table_ops iommu_table_lpar_multi_ops = {
  * dynamic 64bit DMA window, walking up the device tree.
  */
 static struct device_node *pci_dma_find(struct device_node *dn,
-   const __be32 **dma_window)
+   struct dynamic_dma_window_prop *prop)
 {
-   const __be32 *dw = NULL;
+   const __be32 *default_prop = NULL;
+   const __be32 *ddw_prop = NULL;
+   struct device_node *rdn = NULL;
+   bool default_win = false, ddw_win = false;
 
for ( ; dn && PCI_DN(dn); dn = dn->parent) {
-   dw = of_get_property(dn, "ibm,dma-window", NULL);
-   if (dw) {
-   if (dma_window)
-   *dma_window = dw;
-   return dn;
+   default_prop = of_get_property(dn, "ibm,dma-window", NULL);
+   if (default_prop) {
+   rdn = dn;
+   default_win = true;
+   }
+   ddw_prop = of_get_property(dn, DIRECT64_PROPNAME, NULL);
+   if (ddw_prop) {
+   rdn = dn;
+   ddw_win = true;
+   break;
+   }
+   ddw_prop = of_get_property(dn, DMA64_PROPNAME, NULL);
+   if (ddw_prop) {
+   rdn = dn;
+   ddw_win = true;
+   break;
}
-   dw = of_get_property(dn, DIRECT64_PROPNAME, NULL);
-   if (dw)
-   return dn;
-   dw = of_get_property(dn, DMA64_PROPNAME, NULL);
-   if (dw)
-   return dn;
+
+   /* At least found default window, which is the case for normal 
boot */
+   if (default_win)
+   break;
}
 
-   return NULL;
+   /* For PCI devices there will always be a DMA window, either on the 
device
+* or parent bus
+*/
+   WARN_ON(!(default_win | ddw_win));
+
+   /* caller doesn't want to get DMA window property */
+   if (!prop)
+   return rdn;
+
+   /* parse DMA window property. During normal system boot, only default
+* DMA window is passed in OF. But, for kdump, a dedicated adapter might
+* have both default and DDW in FDT. In this scenario, DDW takes 
precedence
+* over default window.
+*/
+   if (ddw_win) {
+   struct dynamic_dma_window_prop *p;
+
+   p = (struct dynamic_dma_window_prop *)ddw_prop;
+   

[PATCH v2] powerpc/pseries/iommu: enable_ddw incorrectly returns direct mapping for SR-IOV device

2023-10-02 Thread Gaurav Batra
When a device is initialized, the driver invokes dma_supported() twice -
first for streaming mappings followed by coherent mappings. For an
SR-IOV device, default window is deleted and DDW created. With vPMEM
enabled, TCE mappings are dynamically created for both vPMEM and SR-IOV
device.  There are no direct mappings.

First time when dma_supported() is called with 64 bit mask, DDW is created
and marked as dynamic window. The second time dma_supported() is called,
enable_ddw() finds existing window for the device and incorrectly returns
it as "direct mapping".

This only happens when size of DDW is big enough to map max LPAR memory.

This results in streaming TCEs to not get dynamically mapped, since code
incorrently assumes these are already pre-mapped. The adapter initially
comes up but goes down due to EEH.

Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect 
mapping")
Cc: sta...@vger.kernel.org
Signed-off-by: Gaurav Batra 
---
 arch/powerpc/platforms/pseries/iommu.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 16d93b580f61..496e16c588aa 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -914,7 +914,8 @@ static int remove_ddw(struct device_node *np, bool 
remove_prop, const char *win_
return 0;
 }
 
-static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int 
*window_shift)
+static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int 
*window_shift,
+ bool *direct_mapping)
 {
struct dma_win *window;
const struct dynamic_dma_window_prop *dma64;
@@ -927,6 +928,7 @@ static bool find_existing_ddw(struct device_node *pdn, u64 
*dma_addr, int *windo
dma64 = window->prop;
*dma_addr = be64_to_cpu(dma64->dma_base);
*window_shift = be32_to_cpu(dma64->window_shift);
+   *direct_mapping = window->direct;
found = true;
break;
}
@@ -1270,10 +1272,8 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 
mutex_lock(_win_init_mutex);
 
-   if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {
-   direct_mapping = (len >= max_ram_len);
+   if (find_existing_ddw(pdn, >dev.archdata.dma_offset, , 
_mapping))
goto out_unlock;
-   }
 
/*
 * If we already went through this for a previous function of
-- 
2.39.2 (Apple Git-143)



[PATCH] powerpc/pseries/iommu: enable_ddw incorrectly returns direct mapping for SR-IOV device.

2023-10-02 Thread Gaurav Batra
When a device is initialized, the driver invokes dma_supported() twice - first
for streaming mappings followed by coherent mappings. For an SR-IOV device,
default window is deleted and DDW created. With vPMEM enabled, TCE mappings
are dynamically created for both vPMEM and SR-IOV device. There are no direct
mappings.

First time when dma_supported() is called with 64 bit mask, DDW is created and
marked as dynamic window. The second time dma_supported() is called, 
enable_ddw()
finds existing window for the device and incorrectly returns it as "direct 
mapping".

This only happens when size of DDW is capable of mapping max LPAR memory.

This results in streaming TCEs to not get dynamically mapped, since code 
incorrently
assumes these are already pre-mapped. The adapter initially comes up but goes 
down
due to EEH.
---
 arch/powerpc/platforms/pseries/iommu.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 16d93b580f61..d8b4adcef1ad 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -914,7 +914,8 @@ static int remove_ddw(struct device_node *np, bool 
remove_prop, const char *win_
return 0;
 }
 
-static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int 
*window_shift)
+static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int 
*window_shift,
+ bool *direct_mapping)
 {
struct dma_win *window;
const struct dynamic_dma_window_prop *dma64;
@@ -927,6 +928,7 @@ static bool find_existing_ddw(struct device_node *pdn, u64 
*dma_addr, int *windo
dma64 = window->prop;
*dma_addr = be64_to_cpu(dma64->dma_base);
*window_shift = be32_to_cpu(dma64->window_shift);
+   *direct_mapping = window->direct;
found = true;
break;
}
@@ -1270,8 +1272,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 
mutex_lock(_win_init_mutex);
 
-   if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {
-   direct_mapping = (len >= max_ram_len);
+   if (find_existing_ddw(pdn, >dev.archdata.dma_offset, , 
_mapping)) {
goto out_unlock;
}
 
-- 
2.39.2 (Apple Git-143)



Re: [PATCH] powerpc/iommu: TCEs are incorrectly manipulated with DLPAR add/remove of memory

2023-06-26 Thread Gaurav Batra

Thanks a lot


On 6/25/23 11:54 PM, Michael Ellerman wrote:

Gaurav Batra  writes:

Hello Michael,

Did you get a chance to look into this patch? I don't mean to rush you.
Just wondering if there is anything I can do to help make the patch to
Upstream.

I skimmed it and decided it wasn't a critical bug fix, and hoped someone
else would review it - silly me :D

But the patch looks simple enough, and the explanation is very good so I
think it looks good to merge.

I'll apply it for v6.5.

cheers


On 6/13/23 12:17 PM, Gaurav Batra wrote:

Hello Michael,

I found this bug while going though the code. This bug is exposed when
DDW is smaller than the max memory of the LPAR. This will result in
creating DDW which will have Dynamically mapped TCEs (no direct mapping).

I would like to stress that this  bug is exposed only in Upstream
kernel. Current kernel level in Distros are not exposed to this since
they don't have the  concept of "dynamically mapped" DDW.

I didn't have access to any of the P10 boxes with large amount of
memory to  re-create the scenario. On P10 we have 2MB TCEs, which
results in DDW large enough to be able to cover  max memory I could
have for the LPAR. As a result,  IO Bus Addresses generated were
always within DDW limits and no H_PARAMETER was returned by HCALL.

So, I hacked the kernel to force the use of 64K TCEs. This resulted in
DDW smaller than max memory.

When I tried to DLPAR ADD memory, it failed with error code of -4
(H_PARAMETER) from HCALL (H_PUT_TCE/H_PUT_TCE_INDIRECT), when
iommu_mem_notifier() invoked tce_setrange_multi_pSeriesLP().

I didn't test the DLPAR REMOVE path, to verify if incorrect TCEs are
removed by tce_clearrange_multi_pSeriesLP(), since I would need to
hack kernel to force dynamically added TCEs to the high IO Bus
Addresses. But, the concept is  same.

Thanks,

Gaurav

On 6/13/23 12:16 PM, Gaurav Batra wrote:

When memory is dynamically added/removed, iommu_mem_notifier() is
invoked. This
routine traverses through all the DMA windows (DDW only, not default
windows)
to add/remove "direct" TCE mappings. The routines for this purpose are
tce_clearrange_multi_pSeriesLP() and tce_clearrange_multi_pSeriesLP().

Both these routines are designed for Direct mapped DMA windows only.

The issue is that there could be some DMA windows in the list which
are not
"direct" mapped. Calling these routines will either,

1) remove some dynamically mapped TCEs, Or
2) try to add TCEs which are out of bounds and HCALL returns H_PARAMETER

Here are the side affects when these routines are incorrectly invoked
for
"dynamically" mapped DMA windows.

tce_setrange_multi_pSeriesLP()

This adds direct mapped TCEs. Now, this could invoke HCALL to add
TCEs with
out-of-bound range. In this scenario, HCALL will return H_PARAMETER
and DLAR
ADD of memory will fail.

tce_clearrange_multi_pSeriesLP()

This will remove range of TCEs. The TCE range that is calculated,
depending on
the memory range being added, could infact be mapping some other memory
address (for dynamic DMA window scenario). This will wipe out those
TCEs.

The solution is for iommu_mem_notifier() to only invoke these
routines for
"direct" mapped DMA windows.

Signed-off-by: Gaurav Batra 
Reviewed-by: Brian King 
---
   arch/powerpc/platforms/pseries/iommu.c | 17 +
   1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c
b/arch/powerpc/platforms/pseries/iommu.c
index 918f511837db..24dd61636400 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -363,6 +363,7 @@ struct dynamic_dma_window_prop {
   struct dma_win {
   struct device_node *device;
   const struct dynamic_dma_window_prop *prop;
+    bool    direct;
   struct list_head list;
   };

@@ -1409,6 +1410,8 @@ static bool enable_ddw(struct pci_dev *dev,
struct device_node *pdn)
   goto out_del_prop;

   if (direct_mapping) {
+    window->direct = true;
+
   /* DDW maps the whole partition, so enable direct DMA
mapping */
   ret = walk_system_ram_range(0, memblock_end_of_DRAM() >>
PAGE_SHIFT,
   win64->value,
tce_setrange_multi_pSeriesLP_walk);
@@ -1425,6 +1428,8 @@ static bool enable_ddw(struct pci_dev *dev,
struct device_node *pdn)
   int i;
   unsigned long start = 0, end = 0;

+    window->direct = false;
+
   for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
   const unsigned long mask = IORESOURCE_MEM_64 |
IORESOURCE_MEM;

@@ -1587,8 +1592,10 @@ static int iommu_mem_notifier(struct
notifier_block *nb, unsigned long action,
   case MEM_GOING_ONLINE:
   spin_lock(_win_list_lock);
   list_for_each_entry(window, _win_list, list) {
-    ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
-    arg->nr_pages, window->prop);
+   

Re: [PATCH] powerpc/iommu: TCEs are incorrectly manipulated with DLPAR add/remove of memory

2023-06-23 Thread Gaurav Batra

Hello Michael,

Did you get a chance to look into this patch? I don't mean to rush you. 
Just wondering if there is anything I can do to help make the patch to 
Upstream.


Thanks,

Gaurav

On 6/13/23 12:17 PM, Gaurav Batra wrote:

Hello Michael,

I found this bug while going though the code. This bug is exposed when 
DDW is smaller than the max memory of the LPAR. This will result in 
creating DDW which will have Dynamically mapped TCEs (no direct mapping).


I would like to stress that this  bug is exposed only in Upstream 
kernel. Current kernel level in Distros are not exposed to this since 
they don't have the  concept of "dynamically mapped" DDW.


I didn't have access to any of the P10 boxes with large amount of 
memory to  re-create the scenario. On P10 we have 2MB TCEs, which 
results in DDW large enough to be able to cover  max memory I could 
have for the LPAR. As a result,  IO Bus Addresses generated were 
always within DDW limits and no H_PARAMETER was returned by HCALL.


So, I hacked the kernel to force the use of 64K TCEs. This resulted in 
DDW smaller than max memory.


When I tried to DLPAR ADD memory, it failed with error code of -4 
(H_PARAMETER) from HCALL (H_PUT_TCE/H_PUT_TCE_INDIRECT), when 
iommu_mem_notifier() invoked tce_setrange_multi_pSeriesLP().


I didn't test the DLPAR REMOVE path, to verify if incorrect TCEs are 
removed by tce_clearrange_multi_pSeriesLP(), since I would need to 
hack kernel to force dynamically added TCEs to the high IO Bus 
Addresses. But, the concept is  same.


Thanks,

Gaurav

On 6/13/23 12:16 PM, Gaurav Batra wrote:
When memory is dynamically added/removed, iommu_mem_notifier() is 
invoked. This
routine traverses through all the DMA windows (DDW only, not default 
windows)

to add/remove "direct" TCE mappings. The routines for this purpose are
tce_clearrange_multi_pSeriesLP() and tce_clearrange_multi_pSeriesLP().

Both these routines are designed for Direct mapped DMA windows only.

The issue is that there could be some DMA windows in the list which 
are not

"direct" mapped. Calling these routines will either,

1) remove some dynamically mapped TCEs, Or
2) try to add TCEs which are out of bounds and HCALL returns H_PARAMETER

Here are the side affects when these routines are incorrectly invoked 
for

"dynamically" mapped DMA windows.

tce_setrange_multi_pSeriesLP()

This adds direct mapped TCEs. Now, this could invoke HCALL to add 
TCEs with
out-of-bound range. In this scenario, HCALL will return H_PARAMETER 
and DLAR

ADD of memory will fail.

tce_clearrange_multi_pSeriesLP()

This will remove range of TCEs. The TCE range that is calculated, 
depending on

the memory range being added, could infact be mapping some other memory
address (for dynamic DMA window scenario). This will wipe out those 
TCEs.


The solution is for iommu_mem_notifier() to only invoke these 
routines for

"direct" mapped DMA windows.

Signed-off-by: Gaurav Batra 
Reviewed-by: Brian King 
---
  arch/powerpc/platforms/pseries/iommu.c | 17 +
  1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c

index 918f511837db..24dd61636400 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -363,6 +363,7 @@ struct dynamic_dma_window_prop {
  struct dma_win {
  struct device_node *device;
  const struct dynamic_dma_window_prop *prop;
+    bool    direct;
  struct list_head list;
  };

@@ -1409,6 +1410,8 @@ static bool enable_ddw(struct pci_dev *dev, 
struct device_node *pdn)

  goto out_del_prop;

  if (direct_mapping) {
+    window->direct = true;
+
  /* DDW maps the whole partition, so enable direct DMA 
mapping */
  ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> 
PAGE_SHIFT,
  win64->value, 
tce_setrange_multi_pSeriesLP_walk);
@@ -1425,6 +1428,8 @@ static bool enable_ddw(struct pci_dev *dev, 
struct device_node *pdn)

  int i;
  unsigned long start = 0, end = 0;

+    window->direct = false;
+
  for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
  const unsigned long mask = IORESOURCE_MEM_64 | 
IORESOURCE_MEM;


@@ -1587,8 +1592,10 @@ static int iommu_mem_notifier(struct 
notifier_block *nb, unsigned long action,

  case MEM_GOING_ONLINE:
  spin_lock(_win_list_lock);
  list_for_each_entry(window, _win_list, list) {
-    ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
-    arg->nr_pages, window->prop);
+    if (window->direct) {
+    ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
+    arg->nr_pages, window->prop);
+    }
  /* XXX log error */
  }
  spin_unlock(_win_list_lock);
@@ -15

Re: [PATCH] powerpc/iommu: TCEs are incorrectly manipulated with DLPAR add/remove of memory

2023-06-13 Thread Gaurav Batra

Hello Michael,

I found this bug while going though the code. This bug is exposed when 
DDW is smaller than the max memory of the LPAR. This will result in 
creating DDW which will have Dynamically mapped TCEs (no direct mapping).


I would like to stress that this  bug is exposed only in Upstream 
kernel. Current kernel level in Distros are not exposed to this since 
they don't have the  concept of "dynamically mapped" DDW.


I didn't have access to any of the P10 boxes with large amount of memory 
to  re-create the scenario. On P10 we have 2MB TCEs, which results in 
DDW large enough to be able to cover  max memory I could have for the 
LPAR. As a result,  IO Bus Addresses generated were always within DDW 
limits and no H_PARAMETER was returned by HCALL.


So, I hacked the kernel to force the use of 64K TCEs. This resulted in 
DDW smaller than max memory.


When I tried to DLPAR ADD memory, it failed with error code of -4 
(H_PARAMETER) from HCALL (H_PUT_TCE/H_PUT_TCE_INDIRECT), when 
iommu_mem_notifier() invoked tce_setrange_multi_pSeriesLP().


I didn't test the DLPAR REMOVE path, to verify if incorrect TCEs are 
removed by tce_clearrange_multi_pSeriesLP(), since I would need to hack 
kernel to force dynamically added TCEs to the high IO Bus Addresses. 
But, the concept is  same.


Thanks,

Gaurav

On 6/13/23 12:16 PM, Gaurav Batra wrote:

When memory is dynamically added/removed, iommu_mem_notifier() is invoked. This
routine traverses through all the DMA windows (DDW only, not default windows)
to add/remove "direct" TCE mappings. The routines for this purpose are
tce_clearrange_multi_pSeriesLP() and tce_clearrange_multi_pSeriesLP().

Both these routines are designed for Direct mapped DMA windows only.

The issue is that there could be some DMA windows in the list which are not
"direct" mapped. Calling these routines will either,

1) remove some dynamically mapped TCEs, Or
2) try to add TCEs which are out of bounds and HCALL returns H_PARAMETER

Here are the side affects when these routines are incorrectly invoked for
"dynamically" mapped DMA windows.

tce_setrange_multi_pSeriesLP()

This adds direct mapped TCEs. Now, this could invoke HCALL to add TCEs with
out-of-bound range. In this scenario, HCALL will return H_PARAMETER and DLAR
ADD of memory will fail.

tce_clearrange_multi_pSeriesLP()

This will remove range of TCEs. The TCE range that is calculated, depending on
the memory range being added, could infact be mapping some other memory
address (for dynamic DMA window scenario). This will wipe out those TCEs.

The solution is for iommu_mem_notifier() to only invoke these routines for
"direct" mapped DMA windows.

Signed-off-by: Gaurav Batra 
Reviewed-by: Brian King 
---
  arch/powerpc/platforms/pseries/iommu.c | 17 +
  1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 918f511837db..24dd61636400 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -363,6 +363,7 @@ struct dynamic_dma_window_prop {
  struct dma_win {
struct device_node *device;
const struct dynamic_dma_window_prop *prop;
+   booldirect;
struct list_head list;
  };

@@ -1409,6 +1410,8 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_del_prop;

if (direct_mapping) {
+   window->direct = true;
+
/* DDW maps the whole partition, so enable direct DMA mapping */
ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> 
PAGE_SHIFT,
win64->value, 
tce_setrange_multi_pSeriesLP_walk);
@@ -1425,6 +1428,8 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
int i;
unsigned long start = 0, end = 0;

+   window->direct = false;
+
for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
const unsigned long mask = IORESOURCE_MEM_64 | 
IORESOURCE_MEM;

@@ -1587,8 +1592,10 @@ static int iommu_mem_notifier(struct notifier_block *nb, 
unsigned long action,
case MEM_GOING_ONLINE:
spin_lock(_win_list_lock);
list_for_each_entry(window, _win_list, list) {
-   ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
-   arg->nr_pages, window->prop);
+   if (window->direct) {
+   ret |= 
tce_setrange_multi_pSeriesLP(arg->start_pfn,
+   arg->nr_pages, window->prop);
+   }
/* XXX log error */
}
spin_unlock(_win_list_lock);
@@ -1597,8 +1604,10 @@ static in

[PATCH] powerpc/iommu: TCEs are incorrectly manipulated with DLPAR add/remove of memory

2023-06-13 Thread Gaurav Batra
When memory is dynamically added/removed, iommu_mem_notifier() is invoked. This
routine traverses through all the DMA windows (DDW only, not default windows)
to add/remove "direct" TCE mappings. The routines for this purpose are
tce_clearrange_multi_pSeriesLP() and tce_clearrange_multi_pSeriesLP().

Both these routines are designed for Direct mapped DMA windows only.

The issue is that there could be some DMA windows in the list which are not
"direct" mapped. Calling these routines will either,

1) remove some dynamically mapped TCEs, Or
2) try to add TCEs which are out of bounds and HCALL returns H_PARAMETER

Here are the side affects when these routines are incorrectly invoked for
"dynamically" mapped DMA windows.

tce_setrange_multi_pSeriesLP()

This adds direct mapped TCEs. Now, this could invoke HCALL to add TCEs with
out-of-bound range. In this scenario, HCALL will return H_PARAMETER and DLAR
ADD of memory will fail.

tce_clearrange_multi_pSeriesLP()

This will remove range of TCEs. The TCE range that is calculated, depending on
the memory range being added, could infact be mapping some other memory
address (for dynamic DMA window scenario). This will wipe out those TCEs.

The solution is for iommu_mem_notifier() to only invoke these routines for
"direct" mapped DMA windows.

Signed-off-by: Gaurav Batra 
Reviewed-by: Brian King 
---
 arch/powerpc/platforms/pseries/iommu.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 918f511837db..24dd61636400 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -363,6 +363,7 @@ struct dynamic_dma_window_prop {
 struct dma_win {
struct device_node *device;
const struct dynamic_dma_window_prop *prop;
+   booldirect;
struct list_head list;
 };
 
@@ -1409,6 +1410,8 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_del_prop;
 
if (direct_mapping) {
+   window->direct = true;
+
/* DDW maps the whole partition, so enable direct DMA mapping */
ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> 
PAGE_SHIFT,
win64->value, 
tce_setrange_multi_pSeriesLP_walk);
@@ -1425,6 +1428,8 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
int i;
unsigned long start = 0, end = 0;
 
+   window->direct = false;
+
for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
const unsigned long mask = IORESOURCE_MEM_64 | 
IORESOURCE_MEM;
 
@@ -1587,8 +1592,10 @@ static int iommu_mem_notifier(struct notifier_block *nb, 
unsigned long action,
case MEM_GOING_ONLINE:
spin_lock(_win_list_lock);
list_for_each_entry(window, _win_list, list) {
-   ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
-   arg->nr_pages, window->prop);
+   if (window->direct) {
+   ret |= 
tce_setrange_multi_pSeriesLP(arg->start_pfn,
+   arg->nr_pages, window->prop);
+   }
/* XXX log error */
}
spin_unlock(_win_list_lock);
@@ -1597,8 +1604,10 @@ static int iommu_mem_notifier(struct notifier_block *nb, 
unsigned long action,
case MEM_OFFLINE:
spin_lock(_win_list_lock);
list_for_each_entry(window, _win_list, list) {
-   ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn,
-   arg->nr_pages, window->prop);
+   if (window->direct) {
+   ret |= 
tce_clearrange_multi_pSeriesLP(arg->start_pfn,
+   arg->nr_pages, window->prop);
+   }
/* XXX log error */
}
spin_unlock(_win_list_lock);
-- 
2.39.2 (Apple Git-143)



Re: [PATCH v2] powerpc/iommu: limit number of TCEs to 512 for H_STUFF_TCE hcall

2023-05-25 Thread Gaurav Batra

Hello Michael,

Here are the changes in v2 of the patch

1. added some wording to change log to specify why we are seeing the 
issue now. Also added "CC: sta...@vger.kernel.org"


2. changed "limit" to unsigned long. I have kept "rpages" as "long"

Thanks,

Gaurav

On 5/25/23 9:34 AM, Gaurav Batra wrote:

As of now, in tce_freemulti_pSeriesLP(), there is no limit on how many TCEs
are passed to H_STUFF_TCE hcall. This was not an issue until now. Newer
firmware releases have started enforcing this requirement.

The interface has been in it's current form since the beginning.

Cc: sta...@vger.kernel.org

Signed-off-by: Gaurav Batra 
Reviewed-by: Brian King 
---
  arch/powerpc/platforms/pseries/iommu.c | 13 +++--
  1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index c74b71d4733d..f159a195101d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -306,13 +306,22 @@ static void tce_free_pSeriesLP(unsigned long liobn, long 
tcenum, long tceshift,
  static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, 
long npages)
  {
u64 rc;
+   long rpages = npages;
+   unsigned long limit;

if (!firmware_has_feature(FW_FEATURE_STUFF_TCE))
return tce_free_pSeriesLP(tbl->it_index, tcenum,
  tbl->it_page_shift, npages);

-   rc = plpar_tce_stuff((u64)tbl->it_index,
-(u64)tcenum << tbl->it_page_shift, 0, npages);
+   do {
+   limit = min_t(unsigned long, rpages, 512);
+
+   rc = plpar_tce_stuff((u64)tbl->it_index,
+   (u64)tcenum << tbl->it_page_shift, 0, limit);
+
+   rpages -= limit;
+   tcenum += limit;
+   } while (rpages > 0 && !rc);

if (rc && printk_ratelimit()) {
printk("tce_freemulti_pSeriesLP: plpar_tce_stuff failed\n");


[PATCH v2] powerpc/iommu: limit number of TCEs to 512 for H_STUFF_TCE hcall

2023-05-25 Thread Gaurav Batra
As of now, in tce_freemulti_pSeriesLP(), there is no limit on how many TCEs
are passed to H_STUFF_TCE hcall. This was not an issue until now. Newer
firmware releases have started enforcing this requirement.

The interface has been in it's current form since the beginning.

Cc: sta...@vger.kernel.org

Signed-off-by: Gaurav Batra 
Reviewed-by: Brian King 
---
 arch/powerpc/platforms/pseries/iommu.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index c74b71d4733d..f159a195101d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -306,13 +306,22 @@ static void tce_free_pSeriesLP(unsigned long liobn, long 
tcenum, long tceshift,
 static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long 
npages)
 {
u64 rc;
+   long rpages = npages;
+   unsigned long limit;
 
if (!firmware_has_feature(FW_FEATURE_STUFF_TCE))
return tce_free_pSeriesLP(tbl->it_index, tcenum,
  tbl->it_page_shift, npages);
 
-   rc = plpar_tce_stuff((u64)tbl->it_index,
-(u64)tcenum << tbl->it_page_shift, 0, npages);
+   do {
+   limit = min_t(unsigned long, rpages, 512);
+
+   rc = plpar_tce_stuff((u64)tbl->it_index,
+   (u64)tcenum << tbl->it_page_shift, 0, limit);
+
+   rpages -= limit;
+   tcenum += limit;
+   } while (rpages > 0 && !rc);
 
if (rc && printk_ratelimit()) {
printk("tce_freemulti_pSeriesLP: plpar_tce_stuff failed\n");
-- 
2.39.2 (Apple Git-143)



Re: [PATCH] powerpc/iommu: limit number of TCEs to 512 for H_STUFF_TCE hcall

2023-05-22 Thread Gaurav Batra


On 5/17/23 7:19 AM, Michael Ellerman wrote:

Gaurav Batra  writes:

Hello Michael,

System test hit the crash. I believe, it was PHYP that resulted in it
due to number of TCEs passed in to be >512.

OK. It's always good to spell out in the change log whether it's a
theoretical/unlikely bug, or one that's actually been hit in testing or
the field.
I will submit another version of the patch with some changes in the log 
once I figure out how to Tag it for stable (as mentioned below).

I was wondering about the Fixes tag as well. But, this interface, in
it's current form, is there from the day the file was created. So, in
this case, should I mention the first commit which created this source file?

If it really goes back to the origin commit, then it's probably better
to just say so and tag it for stable, rather than pointing to 1da177e4.
How to do I tag it for stable? Will it be part of the "Fixes:" tag or 
some other tag?


I wonder though is there something else that changed that means this bug
is now being hit but wasn't before? Or maybe it's just that we are
testing on systems with large enough amounts of memory to hit this but
which aren't using a direct mapping?


From the details in Bugzilla, it does seems like the HCALL was 
previously taking long as well but PHYP was more relaxed about it. Now, 
PHYP is limiting on how long can an HCALL take.


Below are some excerpts from the Bug: 202349

Linux is passing too many counts in H_STUFF_TCE. The higher the counts, 
the longer the HCALL takes. From a Hypervisor perspective, we cannot 
stop Linux from doing this or it will violate the rules in the PAPR 
(which then would cause Linux to crash). The dispatcher team has 
"tightened the screws" on long running HCALLs by causing this trap to 
fire. From our discussions, they will not put the limits back where they 
were before.



Thanks

Gaurav



cheers

Re: [PATCH v2] powerpc/iommu: DMA address offset is incorrectly calculated with 2MB TCEs

2023-05-22 Thread Gaurav Batra

Hello Alexey,

No worries. I resolved the issue with Michael's help. The patch is 
merged upstream and it fixes the issue.


Here is the link

https://lore.kernel.org/all/20230504175913.83844-1-gba...@linux.vnet.ibm.com/

Thanks,

Gaurav

On 5/21/23 7:08 PM, Alexey Kardashevskiy wrote:

Hi Gaurav,

Sorry I missed this. Please share the link to the your fix, I do not 
see it in my mail. In general, the problem can probably be solved by 
using huge pages (anything more than 64K) only for 1:1 mapping.



On 03/05/2023 13:25, Gaurav Batra wrote:

Hello Alexey,

I recently joined IOMMU team. There was a bug reported by test team 
where Mellanox driver was timing out during configuration. I proposed 
a fix for the same, which is below in the email.


You suggested a fix for Srikar's reported problem. Basically, both 
these fixes will resolve Srikar and Mellanox driver issues. The 
problem is with 2MB DDW.


Since you have extensive knowledge of IOMMU design and code, in your 
opinion, which patch should we adopt?


Thanks a lot

Gaurav

On 4/20/23 2:45 PM, Gaurav Batra wrote:

Hello Michael,

I was looking into the Bug: 199106 
(https://bugzilla.linux.ibm.com/show_bug.cgi?id=199106).


In the Bug, Mellanox driver was timing out when enabling SRIOV device.

I tested, Alexey's patch and it fixes the issue with Mellanox 
driver. The down side


to Alexey's fix is that even a small memory request by the driver 
will be aligned up


to 2MB. In my test, the Mellanox driver is issuing multiple requests 
of 64K size.


All these will get aligned up to 2MB, which is quite a waste of 
resources.



In any case, both the patches work. Let me know which approach you 
prefer. In case


we decide to go with my patch, I just realized that I need to fix 
nio_pages in


iommu_free_coherent() as well.


Thanks,

Gaurav

On 4/20/23 10:21 AM, Michael Ellerman wrote:

Gaurav Batra  writes:

When DMA window is backed by 2MB TCEs, the DMA address for the mapped
page should be the offset of the page relative to the 2MB TCE. The 
code

was incorrectly setting the DMA address to the beginning of the TCE
range.

Mellanox driver is reporting timeout trying to ENABLE_HCA for an 
SR-IOV

ethernet port, when DMA window is backed by 2MB TCEs.

I assume this is similar or related to the bug Srikar reported?

https://lore.kernel.org/linuxppc-dev/20230323095333.gi1005...@linux.vnet.ibm.com/ 



In that thread Alexey suggested a patch, have you tried his patch? He
suggested rounding up the allocation size, rather than adjusting the
dma_handle.


Fixes: 3872731187141d5d0a5c4fb30007b8b9ec36a44d
That's not the right syntax, it's described in the documentation 
how to

generate it.

It should be:

   Fixes: 387273118714 ("powerps/pseries/dma: Add support for 2M 
IOMMU page size")


cheers

diff --git a/arch/powerpc/kernel/iommu.c 
b/arch/powerpc/kernel/iommu.c

index ee95937bdaf1..ca57526ce47a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -517,7 +517,7 @@ int ppc_iommu_map_sg(struct device *dev, 
struct iommu_table *tbl,

  /* Convert entry to a dma_addr_t */
  entry += tbl->it_offset;
  dma_addr = entry << tbl->it_page_shift;
-    dma_addr |= (s->offset & ~IOMMU_PAGE_MASK(tbl));
+    dma_addr |= (vaddr & ~IOMMU_PAGE_MASK(tbl));
    DBG("  - %lu pages, entry: %lx, dma_addr: %lx\n",
  npages, entry, dma_addr);
@@ -904,6 +904,7 @@ void *iommu_alloc_coherent(struct device *dev, 
struct iommu_table *tbl,

  unsigned int order;
  unsigned int nio_pages, io_order;
  struct page *page;
+    int tcesize = (1 << tbl->it_page_shift);
    size = PAGE_ALIGN(size);
  order = get_order(size);
@@ -930,7 +931,8 @@ void *iommu_alloc_coherent(struct device *dev, 
struct iommu_table *tbl,

  memset(ret, 0, size);
    /* Set up tces to cover the allocated range */
-    nio_pages = size >> tbl->it_page_shift;
+    nio_pages = IOMMU_PAGE_ALIGN(size, tbl) >> tbl->it_page_shift;
+
  io_order = get_iommu_order(size, tbl);
  mapping = iommu_alloc(dev, tbl, ret, nio_pages, 
DMA_BIDIRECTIONAL,

    mask >> tbl->it_page_shift, io_order, 0);
@@ -938,7 +940,8 @@ void *iommu_alloc_coherent(struct device *dev, 
struct iommu_table *tbl,

  free_pages((unsigned long)ret, order);
  return NULL;
  }
-    *dma_handle = mapping;
+
+    *dma_handle = mapping | ((u64)ret & (tcesize - 1));
  return ret;
  }
  --




Re: [PATCH] powerpc/iommu: limit number of TCEs to 512 for H_STUFF_TCE hcall

2023-05-11 Thread Gaurav Batra

Hello Michael,

System test hit the crash. I believe, it was PHYP that resulted in it 
due to number of TCEs passed in to be >512.


I was wondering about the Fixes tag as well. But, this interface, in 
it's current form, is there from the day the file was created. So, in 
this case, should I mention the first commit which created this source file?


Thanks a lot for looking into it.

Gaurav

On 5/11/23 9:35 PM, Michael Ellerman wrote:

Gaurav Batra  writes:

As of now, in tce_freemulti_pSeriesLP(), there is no limit on how many TCEs
are passed to H_STUFF_TCE hcall. PAPR is enforcing this to be limited to
512 TCEs.

Did you actually hit a bug here, or just noticed via code inspection?

Can you provide a Fixes: tag ?

cheers


Signed-off-by: Gaurav Batra 
Reviewed-by: Brian King 
---
  arch/powerpc/platforms/pseries/iommu.c | 12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index c74b71d4733d..1b134b1b795a 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -306,13 +306,21 @@ static void tce_free_pSeriesLP(unsigned long liobn, long 
tcenum, long tceshift,
  static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, 
long npages)
  {
u64 rc;
+   long limit, rpages = npages;
   
I don't know why npages is signed, but we don't ever want limit to be

negative, so it'd be better of as unsigned long wouldn't it?


if (!firmware_has_feature(FW_FEATURE_STUFF_TCE))
return tce_free_pSeriesLP(tbl->it_index, tcenum,
  tbl->it_page_shift, npages);
  
-	rc = plpar_tce_stuff((u64)tbl->it_index,

-(u64)tcenum << tbl->it_page_shift, 0, npages);
+   do {
+   limit = min_t(long, rpages, 512);

And here'd we'd use unsigned long.


+   rc = plpar_tce_stuff((u64)tbl->it_index,
+   (u64)tcenum << tbl->it_page_shift, 0, limit);
+
+   rpages -= limit;
+   tcenum += limit;
+   } while (rpages > 0 && !rc);
  
  	if (rc && printk_ratelimit()) {

printk("tce_freemulti_pSeriesLP: plpar_tce_stuff failed\n");
--

cheers


[PATCH] powerpc/iommu: limit number of TCEs to 512 for H_STUFF_TCE hcall

2023-05-09 Thread Gaurav Batra
As of now, in tce_freemulti_pSeriesLP(), there is no limit on how many TCEs
are passed to H_STUFF_TCE hcall. PAPR is enforcing this to be limited to
512 TCEs.

Signed-off-by: Gaurav Batra 
Reviewed-by: Brian King 
---
 arch/powerpc/platforms/pseries/iommu.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index c74b71d4733d..1b134b1b795a 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -306,13 +306,21 @@ static void tce_free_pSeriesLP(unsigned long liobn, long 
tcenum, long tceshift,
 static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long 
npages)
 {
u64 rc;
+   long limit, rpages = npages;
 
if (!firmware_has_feature(FW_FEATURE_STUFF_TCE))
return tce_free_pSeriesLP(tbl->it_index, tcenum,
  tbl->it_page_shift, npages);
 
-   rc = plpar_tce_stuff((u64)tbl->it_index,
-(u64)tcenum << tbl->it_page_shift, 0, npages);
+   do {
+   limit = min_t(long, rpages, 512);
+
+   rc = plpar_tce_stuff((u64)tbl->it_index,
+   (u64)tcenum << tbl->it_page_shift, 0, limit);
+
+   rpages -= limit;
+   tcenum += limit;
+   } while (rpages > 0 && !rc);
 
if (rc && printk_ratelimit()) {
printk("tce_freemulti_pSeriesLP: plpar_tce_stuff failed\n");
-- 



[PATCH] iommu/powerpc: Incorrect DDW Table is referenced for SR-IOV device

2023-05-05 Thread Gaurav Batra
For an SR-IOV device, while enabling DDW, a new table is created and added
at index 1 in the group. In the below 2 scenarios, the table is incorrectly
referenced at index 0 (which is where the table is for default DMA window).

1. When adding DDW

This issue is exposed with "slub_debug". Error thrown out from
dma_iommu_dma_supported()

Warning: IOMMU offset too big for device mask
mask: 0x, table offset: 0x800

2. During Dynamic removal of the PCI device.

Error is from iommu_tce_table_put() since a NULL table pointer is
passed in.

Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect 
mapping")

Signed-off-by: Gaurav Batra 

Reviewed-by: Brian King 
---
 arch/powerpc/kernel/dma-iommu.c|  4 +++-
 arch/powerpc/platforms/pseries/iommu.c | 13 +
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index 038ce8d9061d..8920862ffd79 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -144,7 +144,7 @@ static bool dma_iommu_bypass_supported(struct device *dev, 
u64 mask)
 /* We support DMA to/from any memory page via the iommu */
 int dma_iommu_dma_supported(struct device *dev, u64 mask)
 {
-   struct iommu_table *tbl = get_iommu_table_base(dev);
+   struct iommu_table *tbl;
 
if (dev_is_pci(dev) && dma_iommu_bypass_supported(dev, mask)) {
/*
@@ -162,6 +162,8 @@ int dma_iommu_dma_supported(struct device *dev, u64 mask)
return 1;
}
 
+   tbl = get_iommu_table_base(dev);
+
if (!tbl) {
dev_err(dev, "Warning: IOMMU dma not supported: mask 0x%08llx, 
table unavailable\n", mask);
return 0;
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index c74b71d4733d..a8acd3b402d7 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -85,19 +85,24 @@ static struct iommu_table_group 
*iommu_pseries_alloc_group(int node)
 static void iommu_pseries_free_group(struct iommu_table_group *table_group,
const char *node_name)
 {
-   struct iommu_table *tbl;
-
if (!table_group)
return;
 
-   tbl = table_group->tables[0];
 #ifdef CONFIG_IOMMU_API
if (table_group->group) {
iommu_group_put(table_group->group);
BUG_ON(table_group->group);
}
 #endif
-   iommu_tce_table_put(tbl);
+
+   /* Default DMA window table is at index 0, while DDW at 1. SR-IOV
+* adapters only have table on index 1.
+*/
+   if (table_group->tables[0])
+   iommu_tce_table_put(table_group->tables[0]);
+
+   if (table_group->tables[1])
+   iommu_tce_table_put(table_group->tables[1]);
 
kfree(table_group);
 }
-- 



Re: [PATCH v2] powerpc/iommu: DMA address offset is incorrectly calculated with 2MB TCEs

2023-05-04 Thread Gaurav Batra

Hello Michael,

I agree with your concerns regarding a device been able to access memory 
that doesn't belong to it. That exposure we have today with 2MB TCEs. 
With 2MB TCEs, DMA window size will be big enough, for dedicated 
adapters, that whole memory is going to be mapped "direct". Which 
essentially means, that a "rogue" device/driver has the potential to 
corrupt LPAR wide memory.


I have sent you v3.

Thanks,

Gaurav

On 5/4/23 12:10 AM, Michael Ellerman wrote:

Gaurav Batra  writes:

Hello Michael,

I was looking into the Bug: 199106
(https://bugzilla.linux.ibm.com/show_bug.cgi?id=199106).

In the Bug, Mellanox driver was timing out when enabling SRIOV device.

I tested, Alexey's patch and it fixes the issue with Mellanox driver.
The down side

to Alexey's fix is that even a small memory request by the driver will
be aligned up

to 2MB. In my test, the Mellanox driver is issuing multiple requests of
64K size.

All these will get aligned up to 2MB, which is quite a waste of resources.

OK. I guess we should use your patch then.

It's not ideal as it means the device can potentially read/write to
memory it shouldn't, but 2MB is a lot to waste for a 64K alloc.


In any case, both the patches work. Let me know which approach you
prefer. In case

we decide to go with my patch, I just realized that I need to fix
nio_pages in

iommu_free_coherent() as well.

Can you send a v3 with that fixed please.

cheers


[PATCH v3] powerpc/iommu: DMA address offset is incorrectly calculated with 2MB TCEs

2023-05-04 Thread Gaurav Batra
When DMA window is backed by 2MB TCEs, the DMA address for the mapped
page should be the offset of the page relative to the 2MB TCE. The code
was incorrectly setting the DMA address to the beginning of the TCE
range.

Mellanox driver is reporting timeout trying to ENABLE_HCA for an SR-IOV
ethernet port, when DMA window is backed by 2MB TCEs.

Fixes: 387273118714 ("powerps/pseries/dma: Add support for 2M IOMMU page size")

Signed-off-by: Gaurav Batra 

Reviewed-by: Greg Joyce 
Reviewed-by: Brian King 
---
 arch/powerpc/kernel/iommu.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ee95937bdaf1..b8b7a189cd3c 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -517,7 +517,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table 
*tbl,
/* Convert entry to a dma_addr_t */
entry += tbl->it_offset;
dma_addr = entry << tbl->it_page_shift;
-   dma_addr |= (s->offset & ~IOMMU_PAGE_MASK(tbl));
+   dma_addr |= (vaddr & ~IOMMU_PAGE_MASK(tbl));
 
DBG("  - %lu pages, entry: %lx, dma_addr: %lx\n",
npages, entry, dma_addr);
@@ -904,6 +904,7 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
unsigned int order;
unsigned int nio_pages, io_order;
struct page *page;
+   int tcesize = (1 << tbl->it_page_shift);
 
size = PAGE_ALIGN(size);
order = get_order(size);
@@ -930,7 +931,8 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
memset(ret, 0, size);
 
/* Set up tces to cover the allocated range */
-   nio_pages = size >> tbl->it_page_shift;
+   nio_pages = IOMMU_PAGE_ALIGN(size, tbl) >> tbl->it_page_shift;
+
io_order = get_iommu_order(size, tbl);
mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
  mask >> tbl->it_page_shift, io_order, 0);
@@ -938,7 +940,8 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
free_pages((unsigned long)ret, order);
return NULL;
}
-   *dma_handle = mapping;
+
+   *dma_handle = mapping | ((u64)ret & (tcesize - 1));
return ret;
 }
 
@@ -949,7 +952,7 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t 
size,
unsigned int nio_pages;
 
size = PAGE_ALIGN(size);
-   nio_pages = size >> tbl->it_page_shift;
+   nio_pages = IOMMU_PAGE_ALIGN(size, tbl) >> tbl->it_page_shift;
iommu_free(tbl, dma_handle, nio_pages);
size = PAGE_ALIGN(size);
free_pages((unsigned long)vaddr, get_order(size));
-- 



Re: [PATCH v2] powerpc/iommu: DMA address offset is incorrectly calculated with 2MB TCEs

2023-05-02 Thread Gaurav Batra

Hello Alexey,

I recently joined IOMMU team. There was a bug reported by test team 
where Mellanox driver was timing out during configuration. I proposed a 
fix for the same, which is below in the email.


You suggested a fix for Srikar's reported problem. Basically, both these 
fixes will resolve Srikar and Mellanox driver issues. The problem is 
with 2MB DDW.


Since you have extensive knowledge of IOMMU design and code, in your 
opinion, which patch should we adopt?


Thanks a lot

Gaurav

On 4/20/23 2:45 PM, Gaurav Batra wrote:

Hello Michael,

I was looking into the Bug: 199106 
(https://bugzilla.linux.ibm.com/show_bug.cgi?id=199106).


In the Bug, Mellanox driver was timing out when enabling SRIOV device.

I tested, Alexey's patch and it fixes the issue with Mellanox driver. 
The down side


to Alexey's fix is that even a small memory request by the driver will 
be aligned up


to 2MB. In my test, the Mellanox driver is issuing multiple requests 
of 64K size.


All these will get aligned up to 2MB, which is quite a waste of 
resources.



In any case, both the patches work. Let me know which approach you 
prefer. In case


we decide to go with my patch, I just realized that I need to fix 
nio_pages in


iommu_free_coherent() as well.


Thanks,

Gaurav

On 4/20/23 10:21 AM, Michael Ellerman wrote:

Gaurav Batra  writes:

When DMA window is backed by 2MB TCEs, the DMA address for the mapped
page should be the offset of the page relative to the 2MB TCE. The code
was incorrectly setting the DMA address to the beginning of the TCE
range.

Mellanox driver is reporting timeout trying to ENABLE_HCA for an SR-IOV
ethernet port, when DMA window is backed by 2MB TCEs.

I assume this is similar or related to the bug Srikar reported?

https://lore.kernel.org/linuxppc-dev/20230323095333.gi1005...@linux.vnet.ibm.com/

In that thread Alexey suggested a patch, have you tried his patch? He
suggested rounding up the allocation size, rather than adjusting the
dma_handle.


Fixes: 3872731187141d5d0a5c4fb30007b8b9ec36a44d

That's not the right syntax, it's described in the documentation how to
generate it.

It should be:

   Fixes: 387273118714 ("powerps/pseries/dma: Add support for 2M 
IOMMU page size")


cheers


diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ee95937bdaf1..ca57526ce47a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -517,7 +517,7 @@ int ppc_iommu_map_sg(struct device *dev, struct 
iommu_table *tbl,

  /* Convert entry to a dma_addr_t */
  entry += tbl->it_offset;
  dma_addr = entry << tbl->it_page_shift;
-    dma_addr |= (s->offset & ~IOMMU_PAGE_MASK(tbl));
+    dma_addr |= (vaddr & ~IOMMU_PAGE_MASK(tbl));
    DBG("  - %lu pages, entry: %lx, dma_addr: %lx\n",
  npages, entry, dma_addr);
@@ -904,6 +904,7 @@ void *iommu_alloc_coherent(struct device *dev, 
struct iommu_table *tbl,

  unsigned int order;
  unsigned int nio_pages, io_order;
  struct page *page;
+    int tcesize = (1 << tbl->it_page_shift);
    size = PAGE_ALIGN(size);
  order = get_order(size);
@@ -930,7 +931,8 @@ void *iommu_alloc_coherent(struct device *dev, 
struct iommu_table *tbl,

  memset(ret, 0, size);
    /* Set up tces to cover the allocated range */
-    nio_pages = size >> tbl->it_page_shift;
+    nio_pages = IOMMU_PAGE_ALIGN(size, tbl) >> tbl->it_page_shift;
+
  io_order = get_iommu_order(size, tbl);
  mapping = iommu_alloc(dev, tbl, ret, nio_pages, 
DMA_BIDIRECTIONAL,

    mask >> tbl->it_page_shift, io_order, 0);
@@ -938,7 +940,8 @@ void *iommu_alloc_coherent(struct device *dev, 
struct iommu_table *tbl,

  free_pages((unsigned long)ret, order);
  return NULL;
  }
-    *dma_handle = mapping;
+
+    *dma_handle = mapping | ((u64)ret & (tcesize - 1));
  return ret;
  }
  --


Re: [PATCH v2] powerpc/iommu: DMA address offset is incorrectly calculated with 2MB TCEs

2023-04-20 Thread Gaurav Batra

Hello Michael,

I was looking into the Bug: 199106 
(https://bugzilla.linux.ibm.com/show_bug.cgi?id=199106).


In the Bug, Mellanox driver was timing out when enabling SRIOV device.

I tested, Alexey's patch and it fixes the issue with Mellanox driver. 
The down side


to Alexey's fix is that even a small memory request by the driver will 
be aligned up


to 2MB. In my test, the Mellanox driver is issuing multiple requests of 
64K size.


All these will get aligned up to 2MB, which is quite a waste of resources.


In any case, both the patches work. Let me know which approach you 
prefer. In case


we decide to go with my patch, I just realized that I need to fix 
nio_pages in


iommu_free_coherent() as well.


Thanks,

Gaurav

On 4/20/23 10:21 AM, Michael Ellerman wrote:

Gaurav Batra  writes:

When DMA window is backed by 2MB TCEs, the DMA address for the mapped
page should be the offset of the page relative to the 2MB TCE. The code
was incorrectly setting the DMA address to the beginning of the TCE
range.

Mellanox driver is reporting timeout trying to ENABLE_HCA for an SR-IOV
ethernet port, when DMA window is backed by 2MB TCEs.

I assume this is similar or related to the bug Srikar reported?

   
https://lore.kernel.org/linuxppc-dev/20230323095333.gi1005...@linux.vnet.ibm.com/

In that thread Alexey suggested a patch, have you tried his patch? He
suggested rounding up the allocation size, rather than adjusting the
dma_handle.


Fixes: 3872731187141d5d0a5c4fb30007b8b9ec36a44d

That's not the right syntax, it's described in the documentation how to
generate it.

It should be:

   Fixes: 387273118714 ("powerps/pseries/dma: Add support for 2M IOMMU page 
size")

cheers


diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ee95937bdaf1..ca57526ce47a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -517,7 +517,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table 
*tbl,
/* Convert entry to a dma_addr_t */
entry += tbl->it_offset;
dma_addr = entry << tbl->it_page_shift;
-   dma_addr |= (s->offset & ~IOMMU_PAGE_MASK(tbl));
+   dma_addr |= (vaddr & ~IOMMU_PAGE_MASK(tbl));
  
  		DBG("  - %lu pages, entry: %lx, dma_addr: %lx\n",

npages, entry, dma_addr);
@@ -904,6 +904,7 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
unsigned int order;
unsigned int nio_pages, io_order;
struct page *page;
+   int tcesize = (1 << tbl->it_page_shift);
  
  	size = PAGE_ALIGN(size);

order = get_order(size);
@@ -930,7 +931,8 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
memset(ret, 0, size);
  
  	/* Set up tces to cover the allocated range */

-   nio_pages = size >> tbl->it_page_shift;
+   nio_pages = IOMMU_PAGE_ALIGN(size, tbl) >> tbl->it_page_shift;
+
io_order = get_iommu_order(size, tbl);
mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
  mask >> tbl->it_page_shift, io_order, 0);
@@ -938,7 +940,8 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
free_pages((unsigned long)ret, order);
return NULL;
}
-   *dma_handle = mapping;
+
+   *dma_handle = mapping | ((u64)ret & (tcesize - 1));
return ret;
  }
  
--


[PATCH v2] powerpc/iommu: DMA address offset is incorrectly calculated with 2MB TCEs

2023-04-19 Thread Gaurav Batra
When DMA window is backed by 2MB TCEs, the DMA address for the mapped
page should be the offset of the page relative to the 2MB TCE. The code
was incorrectly setting the DMA address to the beginning of the TCE
range.

Mellanox driver is reporting timeout trying to ENABLE_HCA for an SR-IOV
ethernet port, when DMA window is backed by 2MB TCEs.

Fixes: 3872731187141d5d0a5c4fb30007b8b9ec36a44d
Signed-off-by: Gaurav Batra 

Reviewed-by: Greg Joyce 
Reviewed-by: Brian King 
---
 arch/powerpc/kernel/iommu.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ee95937bdaf1..ca57526ce47a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -517,7 +517,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table 
*tbl,
/* Convert entry to a dma_addr_t */
entry += tbl->it_offset;
dma_addr = entry << tbl->it_page_shift;
-   dma_addr |= (s->offset & ~IOMMU_PAGE_MASK(tbl));
+   dma_addr |= (vaddr & ~IOMMU_PAGE_MASK(tbl));
 
DBG("  - %lu pages, entry: %lx, dma_addr: %lx\n",
npages, entry, dma_addr);
@@ -904,6 +904,7 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
unsigned int order;
unsigned int nio_pages, io_order;
struct page *page;
+   int tcesize = (1 << tbl->it_page_shift);
 
size = PAGE_ALIGN(size);
order = get_order(size);
@@ -930,7 +931,8 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
memset(ret, 0, size);
 
/* Set up tces to cover the allocated range */
-   nio_pages = size >> tbl->it_page_shift;
+   nio_pages = IOMMU_PAGE_ALIGN(size, tbl) >> tbl->it_page_shift;
+
io_order = get_iommu_order(size, tbl);
mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
  mask >> tbl->it_page_shift, io_order, 0);
@@ -938,7 +940,8 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
free_pages((unsigned long)ret, order);
return NULL;
}
-   *dma_handle = mapping;
+
+   *dma_handle = mapping | ((u64)ret & (tcesize - 1));
return ret;
 }
 
-- 



[PATCH] powerpc/iommu: DMA address offset is incorrectly calculated with 2MB TCEs

2023-04-19 Thread Gaurav Batra
When DMA window is backed by 2MB TCEs, the DMA address for the mapped
page should be the offset of the page relative to the 2MB TCE. The code
was incorrectly setting the DMA address to the beginning of the TCE
range.

Mellanox driver is reporting timeout trying to ENABLE_HCA for an SR-IOV
ethernet port, when DMA window is backed by 2MB TCEs.

Signed-off-by: Gaurav Batra 

Reviewed-by: Greg Joyce 
Reviewed-by: Brian King 
---
 arch/powerpc/kernel/iommu.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ee95937bdaf1..ca57526ce47a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -517,7 +517,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table 
*tbl,
/* Convert entry to a dma_addr_t */
entry += tbl->it_offset;
dma_addr = entry << tbl->it_page_shift;
-   dma_addr |= (s->offset & ~IOMMU_PAGE_MASK(tbl));
+   dma_addr |= (vaddr & ~IOMMU_PAGE_MASK(tbl));
 
DBG("  - %lu pages, entry: %lx, dma_addr: %lx\n",
npages, entry, dma_addr);
@@ -904,6 +904,7 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
unsigned int order;
unsigned int nio_pages, io_order;
struct page *page;
+   int tcesize = (1 << tbl->it_page_shift);
 
size = PAGE_ALIGN(size);
order = get_order(size);
@@ -930,7 +931,8 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
memset(ret, 0, size);
 
/* Set up tces to cover the allocated range */
-   nio_pages = size >> tbl->it_page_shift;
+   nio_pages = IOMMU_PAGE_ALIGN(size, tbl) >> tbl->it_page_shift;
+
io_order = get_iommu_order(size, tbl);
mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
  mask >> tbl->it_page_shift, io_order, 0);
@@ -938,7 +940,8 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
free_pages((unsigned long)ret, order);
return NULL;
}
-   *dma_handle = mapping;
+
+   *dma_handle = mapping | ((u64)ret & (tcesize - 1));
return ret;
 }
 
--