[RFC PATCH v2 6/6] powerpc/iommu: Implement the iommu_table_group_ops for pSeries

2024-04-30 Thread Shivaprasad G Bhat
PPC64 IOMMU API defines iommu_table_group_ops which handles DMA
windows for PEs, their ownership transfer, create/set/unset the TCE
tables for the Dynamic DMA wundows(DDW). VFIOS uses these APIs for
support on POWER.

The commit 9d67c9433509 "(powerpc/iommu: Add "borrowing"
iommu_table_group_ops)" implemented partial support for this API with
"borrow" mechanism wherein the DMA windows if created already by the
host driver, they would be available for VFIO to use. Also, it didn't
have the support to control/modify the window size or the IO page
size.

The current patch implements all the necessary iommu_table_group_ops
APIs there by avoiding the "borrrowing". So, just the way it is on the
PowerNV platform, with this patch the iommu table group ownership is
transferred to the VFIO PPC subdriver, the iommu table, DMA windows
creation/deletion all driven through the APIs.

The pSeries uses the query-pe-dma-window, create-pe-dma-window and
reset-pe-dma-window RTAS calls for DMA window creation, deletion and
reset to defaul. The RTAs calls do show some minor differences to the
way things are to be handled on the pSeries which are listed below.

* On pSeries, the default DMA window size is "fixed" cannot be custom
sized as requested by the user. For non-SRIOV VFs, It is fixed at 2GB
and for SRIOV VFs, its variable sized based on the capacity assigned
to it during the VF assignment to the LPAR. So, for the  default DMA
window alone the size if requested less than tce32_size, the smaller
size is enforced using the iommu table->it_size.

* The DMA start address for 32-bit window is 0, and for the 64-bit window
in case of PowerNV is hardcoded to TVE select (bit 59) at 512PiB offset.
And this address is returned at the time of create_table() API call(even
before the window is created), the subsequent set_window() call actually
opens the DMA window. On pSeries, the DMA start address for 32-bit
window is known from the 'ibm,dma-window' DT property. However, the
64-bit window start address is not known until the create-pe-dma RTAS
call is made. So, the create_table() which returns the DMA window start
address actually opens the DMA window and returns the DMA start address
as returned by the Hypervisor for the create-pe-dma RTAS call.

* The reset-pe-dma RTAS call resets the DMA windows and restores the
default DMA window, however it does not clear the TCE table entries
if there are any. In case of ownership transfer from platform domain
which used direct mapping, the patch chooses remove-pe-dma instead of
reset-pe for the 64-bit window intentionally so that the
clear_dma_window() is called.

Other than the DMA window management changes mentioned above, the
patch also brings back the userspace view for the single level TCE
as it existed before commit 090bad39b237a ("powerpc/powernv: Add
indirect levels to it_userspace") along with the relavent
refactoring.

Signed-off-by: Shivaprasad G Bhat 
---
 arch/powerpc/include/asm/iommu.h  |4 
 arch/powerpc/kernel/iommu.c   |4 
 arch/powerpc/platforms/powernv/pci-ioda.c |6 
 arch/powerpc/platforms/pseries/iommu.c|  643 +
 4 files changed, 559 insertions(+), 98 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 784809e34abe..1e0c2b2882c2 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -181,9 +181,9 @@ struct iommu_table_group_ops {
long (*unset_window)(struct iommu_table_group *table_group,
int num);
/* Switch ownership from platform code to external user (e.g. VFIO) */
-   long (*take_ownership)(struct iommu_table_group *table_group);
+   long (*take_ownership)(struct iommu_table_group *table_group, struct 
device *dev);
/* Switch ownership from external user (e.g. VFIO) back to core */
-   void (*release_ownership)(struct iommu_table_group *table_group);
+   void (*release_ownership)(struct iommu_table_group *table_group, struct 
device *dev);
 };
 
 struct iommu_table_group_link {
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 31dc663be0cc..2ad1e320f69f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1170,7 +1170,7 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain 
*platform_domain,
 * The domain being set to PLATFORM from earlier
 * BLOCKED. The table_group ownership has to be released.
 */
-   table_group->ops->release_ownership(table_group);
+   table_group->ops->release_ownership(table_group, dev);
iommu_group_put(grp);
 
return 0;
@@ -1198,7 +1198,7 @@ spapr_tce_blocked_iommu_attach_dev(struct iommu_domain 
*platform_domain,
 * also sets the dma_api ops
 */
table_group = iommu_group_get_iommudata(grp);
-   ret = table_group->ops->take_o

[RFC PATCH v2 4/6] vfio/spapr: Always clear TCEs before unsetting the window

2024-04-30 Thread Shivaprasad G Bhat
The PAPR expects the TCE table to have no entries at the time of
unset window(i.e. remove-pe). The TCE clear right now is done
before freeing the iommu table. On pSeries, the unset window
makes those entries inaccessible to the OS and the H_PUT/GET calls
fail on them with H_CONSTRAINED.

On PowerNV, this has no side effect as the TCE clear can be done
before the DMA window removal as well.

Signed-off-by: Shivaprasad G Bhat 
---
 drivers/vfio/vfio_iommu_spapr_tce.c |   13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index a94ec6225d31..5f9e7e477078 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -364,7 +364,6 @@ static void tce_iommu_release(void *iommu_data)
if (!tbl)
continue;
 
-   tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
tce_iommu_free_table(container, tbl);
}
 
@@ -720,6 +719,8 @@ static long tce_iommu_remove_window(struct tce_container 
*container,
 
BUG_ON(!tbl->it_size);
 
+   tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
+
/* Detach groups from IOMMUs */
list_for_each_entry(tcegrp, >group_list, next) {
table_group = iommu_group_get_iommudata(tcegrp->grp);
@@ -738,7 +739,6 @@ static long tce_iommu_remove_window(struct tce_container 
*container,
}
 
/* Free table */
-   tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
tce_iommu_free_table(container, tbl);
container->tables[num] = NULL;
 
@@ -1197,9 +1197,14 @@ static void tce_iommu_release_ownership(struct 
tce_container *container,
return;
}
 
-   for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i)
-   if (container->tables[i])
+   for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+   if (container->tables[i]) {
+   tce_iommu_clear(container, container->tables[i],
+   container->tables[i]->it_offset,
+   container->tables[i]->it_size);
table_group->ops->unset_window(table_group, i);
+   }
+   }
 }
 
 static long tce_iommu_take_ownership(struct tce_container *container,




[RFC PATCH v2 5/6] powerpc/iommu: Move dev_has_iommu_table() to iommu.c

2024-04-30 Thread Shivaprasad G Bhat
Move function dev_has_iommu_table() to powerpc/kernel/iommu.c
as it is going to be used by machine specific iommu code as
well in subsequent patches.

Signed-off-by: Shivaprasad G Bhat 
---
 arch/powerpc/include/asm/iommu.h |1 +
 arch/powerpc/kernel/eeh.c|   16 
 arch/powerpc/kernel/iommu.c  |   17 +
 3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 744cc5fc22d3..784809e34abe 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -220,6 +220,7 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm,
enum dma_data_direction *direction);
 extern void iommu_tce_kill(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
+int dev_has_iommu_table(struct device *dev, void *data);
 
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index ab316e155ea9..37bfbef929b5 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1264,22 +1264,6 @@ EXPORT_SYMBOL(eeh_dev_release);
 
 #ifdef CONFIG_IOMMU_API
 
-static int dev_has_iommu_table(struct device *dev, void *data)
-{
-   struct pci_dev *pdev = to_pci_dev(dev);
-   struct pci_dev **ppdev = data;
-
-   if (!dev)
-   return 0;
-
-   if (device_iommu_mapped(dev)) {
-   *ppdev = pdev;
-   return 1;
-   }
-
-   return 0;
-}
-
 /**
  * eeh_iommu_group_to_pe - Convert IOMMU group to EEH PE
  * @group: IOMMU group
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index bc4584e73061..31dc663be0cc 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -987,6 +987,23 @@ unsigned long iommu_direction_to_tce_perm(enum 
dma_data_direction dir)
 EXPORT_SYMBOL_GPL(iommu_direction_to_tce_perm);
 
 #ifdef CONFIG_IOMMU_API
+
+int dev_has_iommu_table(struct device *dev, void *data)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct pci_dev **ppdev = data;
+
+   if (!dev)
+   return 0;
+
+   if (device_iommu_mapped(dev)) {
+   *ppdev = pdev;
+   return 1;
+   }
+
+   return 0;
+}
+
 /*
  * SPAPR TCE API
  */




[RFC PATCH v2 3/6] powerpc/pseries/iommu: Use the iommu table[0] for IOV VF's DDW

2024-04-30 Thread Shivaprasad G Bhat
This patch basically brings consistency with PowerNV approach
to use the first freely available iommu table when the default
window is removed.

The pSeries iommu code convention has been that the table[0] is
for the default 32 bit DMA window and the table[1] is for the
64 bit DDW.

With VFs having only 1 DMA window, the default has to be removed
for creating the larger DMA window. The existing code uses the
table[1] for that, while marking the table[0] as NULL. This is
fine as long as the host driver itself uses the device.

For the VFIO user, on pSeries there is no way to skip table[0]
as the VFIO subdriver uses the first freely available table.
The window 0, when created as 64-bit DDW in that context would
still be on table[0], as the maximum number of windows is 1.

This won't have any impact for the host driver as the table is
fetched from the device's iommu_table_base.

Signed-off-by: Shivaprasad G Bhat 
---
 arch/powerpc/platforms/pseries/iommu.c |   12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 5b68a4918d63..e701255560a6 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -155,7 +155,7 @@ static void iommu_pseries_free_group(struct 
iommu_table_group *table_group,
 #endif
 
/* Default DMA window table is at index 0, while DDW at 1. SR-IOV
-* adapters only have table on index 1.
+* adapters only have table on index 0(if not direct mapped).
 */
if (table_group->tables[0])
iommu_tce_table_put(table_group->tables[0]);
@@ -1519,6 +1519,11 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
clean_dma_window(pdn, win64->value);
goto out_del_list;
}
+   if (default_win_removed) {
+   iommu_tce_table_put(pci->table_group->tables[0]);
+   pci->table_group->tables[0] = NULL;
+   set_iommu_table_base(>dev, NULL);
+   }
} else {
struct iommu_table *newtbl;
int i;
@@ -1548,15 +1553,12 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
1UL << len, page_shift, NULL, 
_table_lpar_multi_ops);
iommu_init_table(newtbl, pci->phb->node, start, end);
 
-   pci->table_group->tables[1] = newtbl;
+   pci->table_group->tables[default_win_removed ? 0 : 1] = newtbl;
 
set_iommu_table_base(>dev, newtbl);
}
 
if (default_win_removed) {
-   iommu_tce_table_put(pci->table_group->tables[0]);
-   pci->table_group->tables[0] = NULL;
-
/* default_win is valid here because default_win_removed == 
true */
of_remove_property(pdn, default_win);
dev_info(>dev, "Removed default DMA window for %pOF\n", 
pdn);




[RFC PATCH v2 2/6] powerpc/pseries/iommu: Fix the VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctl output

2024-04-30 Thread Shivaprasad G Bhat
The ioctl VFIO_IOMMU_SPAPR_TCE_GET_INFO is not reporting the
actuals on the platform as not all the details are correctly
collected during the platform probe/scan into the iommu_table_group.

Collect the information during the device setup time as the DMA
window property is already looked up on parent nodes anyway.

Signed-off-by: Shivaprasad G Bhat 
---
 arch/powerpc/platforms/pseries/iommu.c |   81 ++--
 1 file changed, 67 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index c6850ec1919a..5b68a4918d63 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -857,13 +857,6 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
*bus)
be32_to_cpu(prop.tce_shift), NULL,
_table_lpar_multi_ops);
 
-   /* Only for normal boot with default window. Doesn't matter even
-* if we set these with DDW which is 64bit during kdump, since
-* these will not be used during kdump.
-*/
-   ppci->table_group->tce32_start = be64_to_cpu(prop.dma_base);
-   ppci->table_group->tce32_size = 1 << 
be32_to_cpu(prop.window_shift);
-
if (!iommu_init_table(tbl, ppci->phb->node, 0, 0))
panic("Failed to initialize iommu table");
 
@@ -1615,6 +1608,71 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
return direct_mapping;
 }
 
+static __u64 query_page_size_to_mask(u32 query_page_size)
+{
+   const long shift[] = {
+   (SZ_4K),   (SZ_64K), (SZ_16M),
+   (SZ_32M),  (SZ_64M), (SZ_128M),
+   (SZ_256M), (SZ_16G), (SZ_2M)
+   };
+   int i, ret = 0;
+
+   for (i = 0; i < ARRAY_SIZE(shift); i++) {
+   if (query_page_size & (1 << i))
+   ret |= shift[i];
+   }
+
+   return ret;
+}
+
+static void spapr_tce_init_table_group(struct pci_dev *pdev,
+  struct device_node *pdn,
+  struct dynamic_dma_window_prop prop)
+{
+   struct iommu_table_group  *table_group = PCI_DN(pdn)->table_group;
+   u32 ddw_avail[DDW_APPLICABLE_SIZE];
+
+   struct ddw_query_response query;
+   int ret;
+
+   /* Only for normal boot with default window. Doesn't matter during
+* kdump, since these will not be used during kdump.
+*/
+   if (is_kdump_kernel())
+   return;
+
+   if (table_group->max_dynamic_windows_supported != 0)
+   return; /* already initialized */
+
+   table_group->tce32_start = be64_to_cpu(prop.dma_base);
+   table_group->tce32_size = 1 << be32_to_cpu(prop.window_shift);
+
+   if (!of_find_property(pdn, "ibm,dma-window", NULL))
+   dev_err(>dev, "default dma window missing!\n");
+
+   ret = of_property_read_u32_array(pdn, "ibm,ddw-applicable",
+   _avail[0], DDW_APPLICABLE_SIZE);
+   if (ret) {
+   table_group->max_dynamic_windows_supported = -1;
+   return;
+   }
+
+   ret = query_ddw(pdev, ddw_avail, , pdn);
+   if (ret) {
+   dev_err(>dev, "%s: query_ddw failed\n", __func__);
+   table_group->max_dynamic_windows_supported = -1;
+   return;
+   }
+
+   if (query.windows_available == 0)
+   table_group->max_dynamic_windows_supported = 1;
+   else
+   table_group->max_dynamic_windows_supported = 
IOMMU_TABLE_GROUP_MAX_TABLES;
+
+   table_group->max_levels = 1;
+   table_group->pgsizes |= query_page_size_to_mask(query.page_size);
+}
+
 static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
 {
struct device_node *pdn, *dn;
@@ -1654,13 +1712,6 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev 
*dev)
be32_to_cpu(prop.tce_shift), NULL,
_table_lpar_multi_ops);
 
-   /* Only for normal boot with default window. Doesn't matter even
-* if we set these with DDW which is 64bit during kdump, since
-* these will not be used during kdump.
-*/
-   pci->table_group->tce32_start = be64_to_cpu(prop.dma_base);
-   pci->table_group->tce32_size = 1 << 
be32_to_cpu(prop.window_shift);
-
iommu_init_table(tbl, pci->phb->node, 0, 0);
iommu_register_group(pci->table_group,
pci_domain_nr(pci->phb->bus), 0);
@@ -1669,6 +1720,8 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev 
*dev)
pr_debug("  found DMA w

[RFC PATCH v2 1/6] powerpc/iommu: Move pSeries specific functions to pseries/iommu.c

2024-04-30 Thread Shivaprasad G Bhat
The PowerNV specific table_group_ops are defined in powernv/pci-ioda.c.
The pSeries specific table_group_ops are sitting in the generic powerpc
file. Move it to where it actually belong(pseries/iommu.c).

Only code movement, no functional changes intended.

Signed-off-by: Shivaprasad G Bhat 
---
 arch/powerpc/include/asm/iommu.h   |4 +
 arch/powerpc/kernel/iommu.c|  149 
 arch/powerpc/platforms/pseries/iommu.c |  144 +++
 3 files changed, 149 insertions(+), 148 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 026695943550..744cc5fc22d3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -156,6 +156,9 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
 extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
int nid, unsigned long res_start, unsigned long res_end);
 bool iommu_table_in_use(struct iommu_table *tbl);
+extern void iommu_table_reserve_pages(struct iommu_table *tbl,
+   unsigned long res_start, unsigned long res_end);
+extern void iommu_table_clear(struct iommu_table *tbl);
 
 #define IOMMU_TABLE_GROUP_MAX_TABLES   2
 
@@ -218,7 +221,6 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm,
 extern void iommu_tce_kill(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
 
-extern struct iommu_table_group_ops spapr_tce_table_group_ops;
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 29a8c8e18585..bc4584e73061 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -642,7 +642,7 @@ void ppc_iommu_unmap_sg(struct iommu_table *tbl, struct 
scatterlist *sglist,
tbl->it_ops->flush(tbl);
 }
 
-static void iommu_table_clear(struct iommu_table *tbl)
+void iommu_table_clear(struct iommu_table *tbl)
 {
/*
 * In case of firmware assisted dump system goes through clean
@@ -683,7 +683,7 @@ static void iommu_table_clear(struct iommu_table *tbl)
 #endif
 }
 
-static void iommu_table_reserve_pages(struct iommu_table *tbl,
+void iommu_table_reserve_pages(struct iommu_table *tbl,
unsigned long res_start, unsigned long res_end)
 {
int i;
@@ -1101,59 +1101,6 @@ void iommu_tce_kill(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_kill);
 
-#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
-static int iommu_take_ownership(struct iommu_table *tbl)
-{
-   unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
-   int ret = 0;
-
-   /*
-* VFIO does not control TCE entries allocation and the guest
-* can write new TCEs on top of existing ones so iommu_tce_build()
-* must be able to release old pages. This functionality
-* requires exchange() callback defined so if it is not
-* implemented, we disallow taking ownership over the table.
-*/
-   if (!tbl->it_ops->xchg_no_kill)
-   return -EINVAL;
-
-   spin_lock_irqsave(>large_pool.lock, flags);
-   for (i = 0; i < tbl->nr_pools; i++)
-   spin_lock_nest_lock(>pools[i].lock, >large_pool.lock);
-
-   if (iommu_table_in_use(tbl)) {
-   pr_err("iommu_tce: it_map is not empty");
-   ret = -EBUSY;
-   } else {
-   memset(tbl->it_map, 0xff, sz);
-   }
-
-   for (i = 0; i < tbl->nr_pools; i++)
-   spin_unlock(>pools[i].lock);
-   spin_unlock_irqrestore(>large_pool.lock, flags);
-
-   return ret;
-}
-
-static void iommu_release_ownership(struct iommu_table *tbl)
-{
-   unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
-
-   spin_lock_irqsave(>large_pool.lock, flags);
-   for (i = 0; i < tbl->nr_pools; i++)
-   spin_lock_nest_lock(>pools[i].lock, >large_pool.lock);
-
-   memset(tbl->it_map, 0, sz);
-
-   iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
-   tbl->it_reserved_end);
-
-   for (i = 0; i < tbl->nr_pools; i++)
-   spin_unlock(>pools[i].lock);
-   spin_unlock_irqrestore(>large_pool.lock, flags);
-}
-#endif
-
 int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
/*
@@ -1185,98 +1132,6 @@ int iommu_add_device(struct iommu_table_group 
*table_group, struct device *dev)
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
-/*
- * A simple iommu_table_group_ops which only allows reusing the existing
- * iommu_table. This handles VFIO for POWER7 or the nested KVM.
- * The ops does not allow creating windows and only allo

[RFC PATCH v2 0/6] powerpc: pSeries: vfio: iommu: Re-enable support for SPAPR TCE VFIO

2024-04-30 Thread Shivaprasad G Bhat
RFC v1 was posted here [1]. As I was testing more and fixing the
issues, I realized its clean to have the table_group_ops implemented
the way it is done on PowerNV and stop 'borrowing' the DMA windows
for pSeries.

This patch-set implements the iommu table_group_ops for pSeries for
VFIO SPAPR TCE sub-driver thereby enabling the VFIO support on POWER
pSeries machines.

So, this patchset is a re-write and not close to the V1 except
for few changes.

Structure of the patchset:
-
The first and fifth patches just code movements.

Second patch takes care of collecting the TCE and DDW information
for the vfio_iommu_spapr_tce_ddw_info during probe.

Third patch fixes the convention of using table[1] for VFs on
pSeries when used by the host driver.

Fourth patch fixes the VFIO to call TCE clear before unset window.

The last patch has the API implementations, please find the
details on its commit description.

Testing:
---
Tested with nested guest for NVME card, Mellanox multi-function
card by attaching them to nested kvm guest running on a pSeries
lpar.
Also vfio-test [2] by Alex Willamson, was forked and updated to
add support for pSeries guest and used to test these patches[3].

Limitations/Known Issues:

* The DMA window restrictions with SRIOV VF scenarios of having
maximum 1 dma window is taken care in the current patches itself.
However, the necessary changes required in
vfio_iommu_spapr_tce_ddw_info to expose the default window being
a 64-bit one and the qemu changes handle the same will be taken
care in next versions.
* KVM guest boot throws warning at remap_pfn_range_notrack(), on
the host, I will post the fix along in the next versions.
* The DLPAR hotplugged device has no FDT entry until next reboot,
default dma window property has to be preserved differently for
this case.

References:
--
[1] 
https://lore.kernel.org/linuxppc-dev/171026724548.8367.8321359354119254395.st...@linux.ibm.com/
[2] https://github.com/awilliam/tests
[3] https://github.com/nnmwebmin/vfio-ppc-tests/tree/vfio-ppc-ex

---
Changelog:
v1: 
https://lore.kernel.org/linuxppc-dev/171026724548.8367.8321359354119254395.st...@linux.ibm.com/
 - Rewrite as to stop borrowing the DMA windows and implemented
 the table_group_ops for pSeries.
 - Cover letter and Patch 6 has more details as this was a rewrite.

Shivaprasad G Bhat (6):
  powerpc/iommu: Move pSeries specific functions to pseries/iommu.c
  powerpc/pseries/iommu: Fix the VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctl output
  powerpc/pseries/iommu: Use the iommu table[0] for IOV VF's DDW
  vfio/spapr: Always clear TCEs before unsetting the window
  powerpc/iommu: Move dev_has_iommu_table() to iommu.c
  powerpc/iommu: Implement the iommu_table_group_ops for pSeries


 arch/powerpc/include/asm/iommu.h  |   9 +-
 arch/powerpc/kernel/eeh.c |  16 -
 arch/powerpc/kernel/iommu.c   | 170 +
 arch/powerpc/platforms/powernv/pci-ioda.c |   6 +-
 arch/powerpc/platforms/pseries/iommu.c| 720 +-
 drivers/vfio/vfio_iommu_spapr_tce.c   |  13 +-
 6 files changed, 729 insertions(+), 205 deletions(-)

--
Signature



Re: [RFC PATCH 3/3] pseries/iommu: Enable DDW for VFIO TCE create

2024-03-25 Thread Shivaprasad G Bhat

Hi Michael,

On 3/13/24 18:23, Michael Ellerman wrote:

Hi Shivaprasad,

Shivaprasad G Bhat  writes:

The commit 9d67c9433509 ("powerpc/iommu: Add \"borrowing\"
iommu_table_group_ops") implemented the "borrow" mechanism for
the pSeries SPAPR TCE. It did implement this support partially
that it left out creating the DDW if not present already.

The patch here attempts to fix the missing gaps.
  - Expose the DDW info to user by collecting it during probe.
  - Create the window and the iommu table if not present during
VFIO_SPAPR_TCE_CREATE.
  - Remove and recreate the window if the pageshift and window sizes
do not match.
  - Restore the original window in enable_ddw() if the user had
created/modified the DDW. As there is preference for DIRECT mapping
on the host driver side, the user created window is removed.

The changes work only for the non-SRIOV-VF scenarios for PEs having
2 DMA windows.

This crashes on powernv.



Thanks for pointing this out.  I will take care of this in v2 of this RFC.


Regards,

Shivaprasad



Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables

2024-03-20 Thread Shivaprasad G Bhat

Hi Jason,

On 3/19/24 20:02, Jason Gunthorpe wrote:

On Tue, Mar 12, 2024 at 01:14:20PM -0500, Shivaprasad G Bhat wrote:

The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
it_userspace") which implemented the tce indirect levels
support for PowerNV ended up removing the single level support
which existed by default(generic tce_iommu_userspace_view_alloc/free()
calls). On pSeries the TCEs are single level, and the allocation
of userspace view is lost with the removal of generic code.

:( :(

If this has been broken since 2018 and nobody cared till now can we
please go in a direction of moving this code to the new iommu APIs
instead of doubling down on more of this old stuff that apparently
almost nobody cares about ??


We have existing software stack deployments using VFIO userspace
device assignment running on Power platform. We have to enable
similar software stack on newer generation Power10 platform and
also in a pSeries lpar environment. These distros rely on VFIO enabled
in kernel and currently have IOMMUFD disabled. This patch series is
a simpler low risk enablement that functionally get the software stack
working while we continue to enable and move to IOMMUFD in phases.
We have to fix the older APIs in order to stage the functional enablement
in small increments.

We are working on iommufd support for pSeries and looking forward
to Timothy's patches.


-Thanks

Shivaprasad


Jason


[RFC PATCH 3/3] pseries/iommu: Enable DDW for VFIO TCE create

2024-03-12 Thread Shivaprasad G Bhat
The commit 9d67c9433509 ("powerpc/iommu: Add \"borrowing\"
iommu_table_group_ops") implemented the "borrow" mechanism for
the pSeries SPAPR TCE. It did implement this support partially
that it left out creating the DDW if not present already.

The patch here attempts to fix the missing gaps.
 - Expose the DDW info to user by collecting it during probe.
 - Create the window and the iommu table if not present during
   VFIO_SPAPR_TCE_CREATE.
 - Remove and recreate the window if the pageshift and window sizes
   do not match.
 - Restore the original window in enable_ddw() if the user had
   created/modified the DDW. As there is preference for DIRECT mapping
   on the host driver side, the user created window is removed.

The changes work only for the non-SRIOV-VF scenarios for PEs having
2 DMA windows.

Signed-off-by: Shivaprasad G Bhat 
---
 arch/powerpc/include/asm/iommu.h   |3 
 arch/powerpc/kernel/iommu.c|7 -
 arch/powerpc/platforms/pseries/iommu.c |  362 +++-
 3 files changed, 360 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 744cc5fc22d3..fde174122844 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -110,6 +110,7 @@ struct iommu_table {
unsigned long  it_page_shift;/* table iommu page size */
struct list_head it_group_list;/* List of iommu_table_group_link */
__be64 *it_userspace; /* userspace view of the table */
+   bool reset_ddw;
struct iommu_table_ops *it_ops;
struct krefit_kref;
int it_nid;
@@ -169,6 +170,8 @@ struct iommu_table_group_ops {
__u32 page_shift,
__u64 window_size,
__u32 levels);
+   void (*init_group)(struct iommu_table_group *table_group,
+   struct device *dev);
long (*create_table)(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index aa11b2acf24f..1cce2b8b8f2c 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -740,6 +740,7 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid,
return NULL;
}
 
+   tbl->it_nid = nid;
iommu_table_reserve_pages(tbl, res_start, res_end);
 
/* We only split the IOMMU table if we have 1GB or more of space */
@@ -1141,7 +1142,10 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain 
*platform_domain,
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
struct iommu_group *grp = iommu_group_get(dev);
-   struct iommu_table_group *table_group;
+   struct iommu_table_group *table_group = iommu_group_get_iommudata(grp);
+
+   /* This should have been in spapr_tce_iommu_probe_device() ?*/
+   table_group->ops->init_group(table_group, dev);
 
/* At first attach the ownership is already set */
if (!domain) {
@@ -1149,7 +1153,6 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain 
*platform_domain,
return 0;
}
 
-   table_group = iommu_group_get_iommudata(grp);
/*
 * The domain being set to PLATFORM from earlier
 * BLOCKED. The table_group ownership has to be released.
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 3d9865dadf73..7224107a0f60 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -630,6 +630,62 @@ static void iommu_table_setparms(struct pci_controller 
*phb,
phb->dma_window_base_cur += phb->dma_window_size;
 }
 
+static int iommu_table_reset(struct iommu_table *tbl, unsigned long busno,
+  unsigned long liobn, unsigned long win_addr,
+  unsigned long window_size, unsigned long 
page_shift,
+  void *base, struct iommu_table_ops 
*table_ops)
+{
+   unsigned long sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
+   unsigned int i, oldsize = tbl->it_size;
+   struct iommu_pool *p;
+
+   WARN_ON(!tbl->it_ops);
+
+   if (oldsize != (window_size >> page_shift)) {
+   vfree(tbl->it_map);
+
+   tbl->it_map = vzalloc_node(sz, tbl->it_nid);
+   if (!tbl->it_map)
+   return -ENOMEM;
+
+   tbl->it_size = window_size >> page_shift;
+   if (oldsize < (window_size >> page_shift))
+   iommu_table_clear(tbl);
+   }
+   tbl->it_busno = busno;
+   tbl->it_index = liobn;
+   tbl->it_offset = win_addr >> page_shift;
+   tbl->it_blocksize = 16;
+   t

[RFC PATCH 2/3] powerpc/iommu: Move pSeries specific functions to pseries/iommu.c

2024-03-12 Thread Shivaprasad G Bhat
The PowerNV specific table_group_ops are defined in powernv/pci-ioda.c.
The pSeries specific table_group_ops are sitting in the generic powerpc
file. Move it to where it actually belong(pseries/iommu.c).

Only code movement, no functional changes intended.

Signed-off-by: Shivaprasad G Bhat 
---
 arch/powerpc/include/asm/iommu.h   |4 +
 arch/powerpc/kernel/iommu.c|  149 
 arch/powerpc/platforms/pseries/iommu.c |  145 +++
 3 files changed, 150 insertions(+), 148 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 026695943550..744cc5fc22d3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -156,6 +156,9 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
 extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
int nid, unsigned long res_start, unsigned long res_end);
 bool iommu_table_in_use(struct iommu_table *tbl);
+extern void iommu_table_reserve_pages(struct iommu_table *tbl,
+   unsigned long res_start, unsigned long res_end);
+extern void iommu_table_clear(struct iommu_table *tbl);
 
 #define IOMMU_TABLE_GROUP_MAX_TABLES   2
 
@@ -218,7 +221,6 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm,
 extern void iommu_tce_kill(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
 
-extern struct iommu_table_group_ops spapr_tce_table_group_ops;
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 1185efebf032..aa11b2acf24f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -642,7 +642,7 @@ void ppc_iommu_unmap_sg(struct iommu_table *tbl, struct 
scatterlist *sglist,
tbl->it_ops->flush(tbl);
 }
 
-static void iommu_table_clear(struct iommu_table *tbl)
+void iommu_table_clear(struct iommu_table *tbl)
 {
/*
 * In case of firmware assisted dump system goes through clean
@@ -683,7 +683,7 @@ static void iommu_table_clear(struct iommu_table *tbl)
 #endif
 }
 
-static void iommu_table_reserve_pages(struct iommu_table *tbl,
+void iommu_table_reserve_pages(struct iommu_table *tbl,
unsigned long res_start, unsigned long res_end)
 {
int i;
@@ -1101,59 +1101,6 @@ void iommu_tce_kill(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_kill);
 
-#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
-static int iommu_take_ownership(struct iommu_table *tbl)
-{
-   unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
-   int ret = 0;
-
-   /*
-* VFIO does not control TCE entries allocation and the guest
-* can write new TCEs on top of existing ones so iommu_tce_build()
-* must be able to release old pages. This functionality
-* requires exchange() callback defined so if it is not
-* implemented, we disallow taking ownership over the table.
-*/
-   if (!tbl->it_ops->xchg_no_kill)
-   return -EINVAL;
-
-   spin_lock_irqsave(>large_pool.lock, flags);
-   for (i = 0; i < tbl->nr_pools; i++)
-   spin_lock_nest_lock(>pools[i].lock, >large_pool.lock);
-
-   if (iommu_table_in_use(tbl)) {
-   pr_err("iommu_tce: it_map is not empty");
-   ret = -EBUSY;
-   } else {
-   memset(tbl->it_map, 0xff, sz);
-   }
-
-   for (i = 0; i < tbl->nr_pools; i++)
-   spin_unlock(>pools[i].lock);
-   spin_unlock_irqrestore(>large_pool.lock, flags);
-
-   return ret;
-}
-
-static void iommu_release_ownership(struct iommu_table *tbl)
-{
-   unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
-
-   spin_lock_irqsave(>large_pool.lock, flags);
-   for (i = 0; i < tbl->nr_pools; i++)
-   spin_lock_nest_lock(>pools[i].lock, >large_pool.lock);
-
-   memset(tbl->it_map, 0, sz);
-
-   iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
-   tbl->it_reserved_end);
-
-   for (i = 0; i < tbl->nr_pools; i++)
-   spin_unlock(>pools[i].lock);
-   spin_unlock_irqrestore(>large_pool.lock, flags);
-}
-#endif
-
 int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
/*
@@ -1185,98 +1132,6 @@ int iommu_add_device(struct iommu_table_group 
*table_group, struct device *dev)
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
-/*
- * A simple iommu_table_group_ops which only allows reusing the existing
- * iommu_table. This handles VFIO for POWER7 or the nested KVM.
- * The ops does not allow creating windows and only allo

[RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables

2024-03-12 Thread Shivaprasad G Bhat
The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
it_userspace") which implemented the tce indirect levels
support for PowerNV ended up removing the single level support
which existed by default(generic tce_iommu_userspace_view_alloc/free()
calls). On pSeries the TCEs are single level, and the allocation
of userspace view is lost with the removal of generic code.

The patch attempts to bring it back for pseries on the refactored
code base.

On pSeries, the windows/tables are "borrowed", so the it_ops->free()
is not called during the container detach or the tce release call paths
as the table is not really freed. So, decoupling the userspace view
array free and alloc from table's it_ops just the way it was before.

Signed-off-by: Shivaprasad G Bhat 
---
 arch/powerpc/platforms/pseries/iommu.c |   19 ++--
 drivers/vfio/vfio_iommu_spapr_tce.c|   51 
 2 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e8c4129697b1..40de8d55faef 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -143,7 +143,7 @@ static int tce_build_pSeries(struct iommu_table *tbl, long 
index,
 }
 
 
-static void tce_free_pSeries(struct iommu_table *tbl, long index, long npages)
+static void tce_clear_pSeries(struct iommu_table *tbl, long index, long npages)
 {
__be64 *tcep;
 
@@ -162,6 +162,11 @@ static unsigned long tce_get_pseries(struct iommu_table 
*tbl, long index)
return be64_to_cpu(*tcep);
 }
 
+static void tce_free_pSeries(struct iommu_table *tbl)
+{
+   /* Do nothing. */
+}
+
 static void tce_free_pSeriesLP(unsigned long liobn, long, long, long);
 static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
 
@@ -576,7 +581,7 @@ struct iommu_table_ops iommu_table_lpar_multi_ops;
 
 struct iommu_table_ops iommu_table_pseries_ops = {
.set = tce_build_pSeries,
-   .clear = tce_free_pSeries,
+   .clear = tce_clear_pSeries,
.get = tce_get_pseries
 };
 
@@ -685,15 +690,23 @@ static int tce_exchange_pseries(struct iommu_table *tbl, 
long index, unsigned
 
return rc;
 }
+
+static __be64 *tce_useraddr_pSeriesLP(struct iommu_table *tbl, long index,
+ bool __always_unused alloc)
+{
+   return tbl->it_userspace ? >it_userspace[index - tbl->it_offset] : 
NULL;
+}
 #endif
 
 struct iommu_table_ops iommu_table_lpar_multi_ops = {
.set = tce_buildmulti_pSeriesLP,
 #ifdef CONFIG_IOMMU_API
.xchg_no_kill = tce_exchange_pseries,
+   .useraddrptr = tce_useraddr_pSeriesLP,
 #endif
.clear = tce_freemulti_pSeriesLP,
-   .get = tce_get_pSeriesLP
+   .get = tce_get_pSeriesLP,
+   .free = tce_free_pSeries
 };
 
 /*
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index a94ec6225d31..1cf36d687559 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -177,6 +177,50 @@ static long tce_iommu_register_pages(struct tce_container 
*container,
return ret;
 }
 
+static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl,
+   struct mm_struct *mm)
+{
+   unsigned long cb = ALIGN(sizeof(tbl->it_userspace[0]) *
+   tbl->it_size, PAGE_SIZE);
+   unsigned long *uas;
+   long ret;
+
+   if (tbl->it_indirect_levels)
+   return 0;
+
+   WARN_ON(tbl->it_userspace);
+
+   ret = account_locked_vm(mm, cb >> PAGE_SHIFT, true);
+   if (ret)
+   return ret;
+
+   uas = vzalloc(cb);
+   if (!uas) {
+   account_locked_vm(mm, cb >> PAGE_SHIFT, false);
+   return -ENOMEM;
+   }
+   tbl->it_userspace = (__be64 *) uas;
+
+   return 0;
+}
+
+static void tce_iommu_userspace_view_free(struct iommu_table *tbl,
+   struct mm_struct *mm)
+{
+   unsigned long cb = ALIGN(sizeof(tbl->it_userspace[0]) *
+   tbl->it_size, PAGE_SIZE);
+
+   if (!tbl->it_userspace)
+   return;
+
+   if (tbl->it_indirect_levels)
+   return;
+
+   vfree(tbl->it_userspace);
+   tbl->it_userspace = NULL;
+   account_locked_vm(mm, cb >> PAGE_SHIFT, false);
+}
+
 static bool tce_page_is_contained(struct mm_struct *mm, unsigned long hpa,
unsigned int it_page_shift)
 {
@@ -554,6 +598,12 @@ static long tce_iommu_build_v2(struct tce_container 
*container,
unsigned long hpa;
enum dma_data_direction dirtmp;
 
+   if (!tbl->it_userspace) {
+   ret = tce_iommu_userspace_view_alloc(tbl, container->mm);
+   if (ret)
+   return ret;
+   }
+
for (i = 0; i < pages; ++i) {
struct mm_iommu_table_group

[RFC PATCH 0/3] powerpc: pSeries: vfio: iommu: Re-enable support for SPAPR TCE VFIO

2024-03-12 Thread Shivaprasad G Bhat
This patch-set fills the missing gaps on pSeries for VFIO SPAPR TCE
sub-driver thereby re-enabling the VFIO support on POWER pSeries
machines.

Structure of the patchset
=

* Currently, due to [1] VFIO_IOMMU_UNMAP_{DMA,UNMAP} ioctls are broken
on pSeries as it only supports single level TCEs. This is addressed in
the first patch as the lost functionality is restored.

On pSeries, the DMA windows (default 32-bit and the 64-bit DDW) are
"borrowed" between the host and the vfio driver. The necessary
mechanism for this is already been in place since [2]. However,
the VFIO SPAPR-TCE sub-driver doesn't open the 64-bit window if
it wasn't already done by a host driver like NVME. So the user-space
only gets access to the default 32-bit DMA window alone. This poses a
challenge for devices having no host kernel drivers and completely
depend on VFIO user-space interface to request DMA. The
VFIO_SPAPR_TCE_CREATE ioctl currently just returns EPERM without
attempting to open the second window for such devices.

* The second patch is just code movement for pSeries specific
functions from arch iommu to the pSeries platform iommu file.
This is needed as the DMA window manipulation operations
introduced in the third patch depend on these functions and are
entirely pSeries specific.

* The third patch adds necessary support to open up the 64-bit DMA
window on VFIO_SPAPR_TCE_CREATE ioctl from the user-space. It also
collects the DDW information from the platform for exposing it to
user through 'struct vfio_iommu_spapr_tce_ddw_info'.

Testing

These patches are tested with by attaching a nvme disk to a nested
kvm guest running a pSeries lpar. Also vfio-test [3] by Alex Willamson,
was forked and updated to add support for pSeries guest and used to
test these patches[4].

Limitations/Known Issues

* Does not work for SRIOV VFs, as they have only one DMA window.
* Does not work for multi-function cards.
* Bugs
  - mmdrop() in tce_iommu_release() when container detached
with pending unmaps.

[1] Commit: 090bad39b237 ("powerpc/powernv: Add indirect levels to 
it_userspace")
[2] Commit: 9d67c9433509 ("powerpc/iommu: Add \"borrowing\" 
iommu_table_group_ops")
[3] https://github.com/awilliam/tests
[4] https://github.com/nnmwebmin/vfio-ppc-tests/tree/vfio-ppc-ex

---

Shivaprasad G Bhat (3):
  powerpc/pseries/iommu: Bring back userspace view for single level TCE 
tables
  powerpc/iommu: Move pSeries specific functions to pseries/iommu.c
  pseries/iommu: Enable DDW for VFIO TCE create


 arch/powerpc/include/asm/iommu.h   |   7 +-
 arch/powerpc/kernel/iommu.c| 156 +---
 arch/powerpc/platforms/pseries/iommu.c | 514 -
 drivers/vfio/vfio_iommu_spapr_tce.c|  51 +++
 4 files changed, 571 insertions(+), 157 deletions(-)



Re: [PATCH v2] powerpc/iommu: Fix the iommu group reference leak during platform domain attach

2024-02-15 Thread Shivaprasad G Bhat

On 2/15/24 08:01, Michael Ellerman wrote:

Shivaprasad G Bhat  writes:

The function spapr_tce_platform_iommu_attach_dev() is missing to call
iommu_group_put() when the domain is already set. This refcount leak
shows up with BUG_ON() during DLPAR remove operation as,



   [c013aed5fd10] [c05bfeb4] vfs_write+0xf8/0x488
   [c013aed5fdc0] [c05c0570] ksys_write+0x84/0x140
   [c013aed5fe10] [c0033358] system_call_exception+0x138/0x330
   [c013aed5fe50] [c000d05c] system_call_vectored_common+0x15c/0x2ec
   --- interrupt: 3000 at 0x2433acb4
   
   ---[ end trace  ]---

The patch makes the iommu_group_get() call only when using it there by
avoiding the leak.

Fixes: a8ca9fc9134c ("powerpc/iommu: Do not do platform domain attach atctions after 
probe")
Reported-by: Venkat Rao Bagalkote 
Closes: 
https://lore.kernel.org/all/274e0d2b-b5cc-475e-94e6-8427e88e2...@linux.vnet.ibm.com
Signed-off-by: Shivaprasad G Bhat 
---
Changelog:
v1: 
https://lore.kernel.org/all/170784021983.6249.10039296655906636112.st...@linux.ibm.com/
  - Minor refactor to call the iommu_group_get() only if required.
  - Updated the title, description and signature(Closes/Reported-by).

Sorry I already applied v1.

If you send this as a patch on top of v1 with a new change log I can
merge it as a cleanup/rework.


I have posted the cleanup patch at 
https://lore.kernel.org/linux-iommu/170800513841.2411.13524607664262048895.st...@linux.ibm.com/


Thank you!

Shivaprasad


cheers


[PATCH] powerpc/iommu: Refactor spapr_tce_platform_iommu_attach_dev()

2024-02-15 Thread Shivaprasad G Bhat
The patch makes the iommu_group_get() call only when using it
thereby avoiding the unnecessary get & put for domain already
being set case.

Reviewed-by: Jason Gunthorpe 
Signed-off-by: Shivaprasad G Bhat 
---
Changelog:
v2: 
https://lore.kernel.org/linux-iommu/170793401503.7491.9431631474642074097.st...@linux.ibm.com/
 - As the v1 itself was merged, the patch was suggested to be reposted as
   cleanup/refactoring to be applied on top of v1.
 - Removed the versioning as this is actually new cleanup/refactoring.
 - Retaining the Reviewed-by as the effective new code was actually reviewed.

v1: 
https://lore.kernel.org/all/170784021983.6249.10039296655906636112.st...@linux.ibm.com/
 - Minor refactor to call the iommu_get_group only if required.
 - Updated the title, description and signature(Closes/Reported-by).

 arch/powerpc/kernel/iommu.c |7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index a9bebfd56b3b..37fae3bd89c6 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1285,15 +1285,14 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain 
*platform_domain,
struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct iommu_group *grp = iommu_group_get(dev);
struct iommu_table_group *table_group;
+   struct iommu_group *grp;

/* At first attach the ownership is already set */
-   if (!domain) {
-   iommu_group_put(grp);
+   if (!domain)
return 0;
-   }

+   grp = iommu_group_get(dev);
table_group = iommu_group_get_iommudata(grp);
/*
 * The domain being set to PLATFORM from earlier




Re: [PATCH] powerpc/iommu: Fix the missing iommu_group_put() during platform domain attach

2024-02-14 Thread Shivaprasad G Bhat



On 2/14/24 18:28, Jason Gunthorpe wrote:

On Wed, Feb 14, 2024 at 11:53:20PM +1100, Michael Ellerman wrote:

Venkat Rao Bagalkote  writes:

Thanks for the patch. Applied this patch and verified and issue is fixed.

This issue way originally reported in the below mail.

https://marc.info/?l=linux-kernel=170737160630106=2

Please use lore for links, in this case:

https://lore.kernel.org/all/274e0d2b-b5cc-475e-94e6-8427e88e2...@linux.vnet.ibm.com/

Also if you are respinning you may prefer this

@@ -1285,14 +1285,15 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain 
*platform_domain,
 struct device *dev)
  {
 struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct iommu_group *grp = iommu_group_get(dev);
 struct iommu_table_group *table_group;
+   struct iommu_group *grp;
 int ret = -EINVAL;
  
 /* At first attach the ownership is already set */

 if (!domain)
 return 0;
  
+   grp = iommu_group_get(dev);

 if (!grp)
 return -ENODEV;

Which is sort of why this happened in the first place :)


Right! Posted the v2 here

https://lore.kernel.org/linux-iommu/170793401503.7491.9431631474642074097.st...@linux.ibm.com/


Thanks,

Shivaprasad


Jason


[PATCH v2] powerpc/iommu: Fix the iommu group reference leak during platform domain attach

2024-02-14 Thread Shivaprasad G Bhat
The function spapr_tce_platform_iommu_attach_dev() is missing to call
iommu_group_put() when the domain is already set. This refcount leak
shows up with BUG_ON() during DLPAR remove operation as,

  KernelBug: Kernel bug in state 'None': kernel BUG at 
arch/powerpc/platforms/pseries/iommu.c:100!
  Oops: Exception in kernel mode, sig: 5 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries
  
  Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 of:IBM,FW1060.00 
(NH1060_016) hv:phyp pSeries
  NIP:  c00ff4d4 LR: c00ff4cc CTR: 
  REGS: c013aed5f840 TRAP: 0700   Tainted: G  I 
(6.8.0-rc3-autotest-g99bd3cb0d12e)
  MSR:  80029033   CR: 44002402  XER: 2004
  CFAR: c0a0d170 IRQMASK: 0
  GPR00: c00ff4cc c013aed5fae0 c1512700 c013aa362138
  GPR04:    000119c8afd0
  GPR08:  c01284442b00 0001 1003
  GPR12: 0003 c0182f00  
  GPR16:    
  GPR20:    
  GPR24: c013aed5fc40 0002  c2757d90
  GPR28: c00ff440 c2757cb8 c0183799c1a0 c013aa362b00
  NIP [c00ff4d4] iommu_reconfig_notifier+0x94/0x200
  LR [c00ff4cc] iommu_reconfig_notifier+0x8c/0x200
  Call Trace:
  [c013aed5fae0] [c00ff4cc] iommu_reconfig_notifier+0x8c/0x200 
(unreliable)
  [c013aed5fb10] [c01a27b0] notifier_call_chain+0xb8/0x19c
  [c013aed5fb70] [c01a2a78] blocking_notifier_call_chain+0x64/0x98
  [c013aed5fbb0] [c0c4a898] of_reconfig_notify+0x44/0xdc
  [c013aed5fc20] [c0c4add4] of_detach_node+0x78/0xb0
  [c013aed5fc70] [c00f96a8] ofdt_write.part.0+0x86c/0xbb8
  [c013aed5fce0] [c069b4bc] proc_reg_write+0xf4/0x150
  [c013aed5fd10] [c05bfeb4] vfs_write+0xf8/0x488
  [c013aed5fdc0] [c05c0570] ksys_write+0x84/0x140
  [c013aed5fe10] [c0033358] system_call_exception+0x138/0x330
  [c013aed5fe50] [c000d05c] system_call_vectored_common+0x15c/0x2ec
  --- interrupt: 3000 at 0x2433acb4
  
  ---[ end trace  ]---

The patch makes the iommu_group_get() call only when using it there by
avoiding the leak.

Fixes: a8ca9fc9134c ("powerpc/iommu: Do not do platform domain attach atctions 
after probe")
Reported-by: Venkat Rao Bagalkote 
Closes: 
https://lore.kernel.org/all/274e0d2b-b5cc-475e-94e6-8427e88e2...@linux.vnet.ibm.com
Signed-off-by: Shivaprasad G Bhat 
---
Changelog:
v1: 
https://lore.kernel.org/all/170784021983.6249.10039296655906636112.st...@linux.ibm.com/
 - Minor refactor to call the iommu_group_get() only if required.
 - Updated the title, description and signature(Closes/Reported-by).

 arch/powerpc/kernel/iommu.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d71eac3b2887..37fae3bd89c6 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1285,13 +1285,14 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain 
*platform_domain,
struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct iommu_group *grp = iommu_group_get(dev);
struct iommu_table_group *table_group;
+   struct iommu_group *grp;

/* At first attach the ownership is already set */
if (!domain)
return 0;

+   grp = iommu_group_get(dev);
table_group = iommu_group_get_iommudata(grp);
/*
 * The domain being set to PLATFORM from earlier




[PATCH] powerpc/iommu: Fix the missing iommu_group_put() during platform domain attach

2024-02-13 Thread Shivaprasad G Bhat
The function spapr_tce_platform_iommu_attach_dev() is missing to call
iommu_group_put() when the domain is already set. This refcount leak
shows up with BUG_ON() during DLPAR remove operation as,

  KernelBug: Kernel bug in state 'None': kernel BUG at 
arch/powerpc/platforms/pseries/iommu.c:100!
  Oops: Exception in kernel mode, sig: 5 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries
  
  Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 of:IBM,FW1060.00 
(NH1060_016) hv:phyp pSeries
  NIP:  c00ff4d4 LR: c00ff4cc CTR: 
  REGS: c013aed5f840 TRAP: 0700   Tainted: G  I 
(6.8.0-rc3-autotest-g99bd3cb0d12e)
  MSR:  80029033   CR: 44002402  XER: 2004
  CFAR: c0a0d170 IRQMASK: 0
  GPR00: c00ff4cc c013aed5fae0 c1512700 c013aa362138
  GPR04:    000119c8afd0
  GPR08:  c01284442b00 0001 1003
  GPR12: 0003 c0182f00  
  GPR16:    
  GPR20:    
  GPR24: c013aed5fc40 0002  c2757d90
  GPR28: c00ff440 c2757cb8 c0183799c1a0 c013aa362b00
  NIP [c00ff4d4] iommu_reconfig_notifier+0x94/0x200
  LR [c00ff4cc] iommu_reconfig_notifier+0x8c/0x200
  Call Trace:
  [c013aed5fae0] [c00ff4cc] iommu_reconfig_notifier+0x8c/0x200 
(unreliable)
  [c013aed5fb10] [c01a27b0] notifier_call_chain+0xb8/0x19c
  [c013aed5fb70] [c01a2a78] blocking_notifier_call_chain+0x64/0x98
  [c013aed5fbb0] [c0c4a898] of_reconfig_notify+0x44/0xdc
  [c013aed5fc20] [c0c4add4] of_detach_node+0x78/0xb0
  [c013aed5fc70] [c00f96a8] ofdt_write.part.0+0x86c/0xbb8
  [c013aed5fce0] [c069b4bc] proc_reg_write+0xf4/0x150
  [c013aed5fd10] [c05bfeb4] vfs_write+0xf8/0x488
  [c013aed5fdc0] [c05c0570] ksys_write+0x84/0x140
  [c013aed5fe10] [c0033358] system_call_exception+0x138/0x330
  [c013aed5fe50] [c000d05c] system_call_vectored_common+0x15c/0x2ec
  --- interrupt: 3000 at 0x2433acb4
  
  ---[ end trace  ]---

The patch adds the missing iommu_group_put() call.

Fixes: a8ca9fc9134c ("powerpc/iommu: Do not do platform domain attach atctions 
after probe")
Signed-off-by: Shivaprasad G Bhat 
---
 arch/powerpc/kernel/iommu.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d71eac3b2887..a9bebfd56b3b 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1289,8 +1289,10 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain 
*platform_domain,
struct iommu_table_group *table_group;
 
/* At first attach the ownership is already set */
-   if (!domain)
+   if (!domain) {
+   iommu_group_put(grp);
return 0;
+   }
 
table_group = iommu_group_get_iommudata(grp);
/*




Re: [PATCH] powerpc/papr_scm: Move duplicate definitions to common header files

2024-01-27 Thread Shivaprasad G Bhat

On 1/26/24 02:46, Christophe Leroy wrote:


Le 18/04/2022 à 06:38, Shivaprasad G Bhat a écrit :

papr_scm and ndtest share common PDSM payload structs like
nd_papr_pdsm_health. Presently these structs are duplicated across
papr_pdsm.h and ndtest.h header files. Since 'ndtest' is essentially
arch independent and can run on platforms other than PPC64, a way
needs to be deviced to avoid redundancy and duplication of PDSM
structs in future.

So the patch proposes moving the PDSM header from arch/powerpc/include-
-/uapi/ to the generic include/uapi/linux directory. Also, there are
some #defines common between papr_scm and ndtest which are not exported
to the user space. So, move them to a header file which can be shared
across ndtest and papr_scm via newly introduced include/linux/papr_scm.h.

Signed-off-by: Shivaprasad G Bhat 
Signed-off-by: Vaibhav Jain 
Suggested-by: "Aneesh Kumar K.V" 

This patch doesn't apply, if still relevant can you please rebase and
re-submit ?


Thanks for taking a look.


I have rebased and reposted the patch here

https://lore.kernel.org/nvdimm/170638176942.112443.2937254675538057083.st...@ltcd48-lp2.aus.stglab.ibm.com/T/#u


Thanks!

Shivaprasad





[REPOST PATCH v3] powerpc/papr_scm: Move duplicate definitions to common header files

2024-01-27 Thread Shivaprasad G Bhat
papr_scm and ndtest share common PDSM payload structs like
nd_papr_pdsm_health. Presently these structs are duplicated across
papr_pdsm.h and ndtest.h header files. Since 'ndtest' is essentially
arch independent and can run on platforms other than PPC64, a way
needs to be deviced to avoid redundancy and duplication of PDSM
structs in future.

So the patch proposes moving the PDSM header from arch/powerpc/include-
-/uapi/ to the generic include/uapi/linux directory. Also, there
are some #defines common between papr_scm and ndtest which are not
exported to the user space. So, move them to a header file which
can be shared across ndtest and papr_scm via newly introduced
include/linux/papr_scm.h.

Signed-off-by: Shivaprasad G Bhat 
Signed-off-by: Vaibhav Jain 
Suggested-by: "Aneesh Kumar K.V" 
---
Changelog:
Repost of v3:
Link: 
https://lore.kernel.org/all/165763948558.3501667.16230079003621326755.st...@ltc-boston123.aus.stglabs.ibm.com/
Rebased again, no changes.

Repost of v3:
Link: 
https://lore.kernel.org/nvdimm/165025666388.2927278.9540058958498766114.st...@lep8c.aus.stglabs.ibm.com/
Rebased and no changes.

Since v2:
Link: 
https://patchwork.kernel.org/project/linux-nvdimm/patch/163454440296.431294.2368481747380790011.st...@lep8c.aus.stglabs.ibm.com/
* Made it like v1, and rebased.
* Fixed repeating words in comments of the header file papr_scm.h

Since v1:
Link: 
https://patchwork.kernel.org/project/linux-nvdimm/patch/162505488483.72147.12741153746322191381.stgit@56e104a48989/
* Removed dependency on this patch for the other patches

 MAINTAINERS   |2
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  165 -
 arch/powerpc/platforms/pseries/papr_scm.c |   43 
 include/linux/papr_scm.h  |   49 +
 include/uapi/linux/papr_pdsm.h|  165 +
 tools/testing/nvdimm/test/ndtest.c|2
 tools/testing/nvdimm/test/ndtest.h|   31 -
 7 files changed, 220 insertions(+), 237 deletions(-)
 delete mode 100644 arch/powerpc/include/uapi/asm/papr_pdsm.h
 create mode 100644 include/linux/papr_scm.h
 create mode 100644 include/uapi/linux/papr_pdsm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 39219b144c23..70da9aa81654 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12393,6 +12393,8 @@ F:  drivers/rtc/rtc-opal.c
 F: drivers/scsi/ibmvscsi/
 F: drivers/tty/hvc/hvc_opal.c
 F: drivers/watchdog/wdrtas.c
+F: include/linux/papr_scm.h
+F: include/uapi/linux/papr_pdsm.h
 F: tools/testing/selftests/powerpc
 N: /pmac
 N: powermac
diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
deleted file mode 100644
index 17439925045c..
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ /dev/null
@@ -1,165 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-/*
- * PAPR nvDimm Specific Methods (PDSM) and structs for libndctl
- *
- * (C) Copyright IBM 2020
- *
- * Author: Vaibhav Jain 
- */
-
-#ifndef _UAPI_ASM_POWERPC_PAPR_PDSM_H_
-#define _UAPI_ASM_POWERPC_PAPR_PDSM_H_
-
-#include 
-#include 
-
-/*
- * PDSM Envelope:
- *
- * The ioctl ND_CMD_CALL exchange data between user-space and kernel via
- * envelope which consists of 2 headers sections and payload sections as
- * illustrated below:
- *  +-+---+---+
- *  |   64-Bytes  |   8-Bytes |   Max 184-Bytes   |
- *  +-+---+---+
- *  | ND-HEADER   |  PDSM-HEADER  |  PDSM-PAYLOAD |
- *  +-+---+---+
- *  | nd_family   |   |   |
- *  | nd_size_out | cmd_status|   |
- *  | nd_size_in  | reserved  | nd_pdsm_payload   |
- *  | nd_command  | payload   --> |   |
- *  | nd_fw_size  |   |   |
- *  | nd_payload ---> |   |   |
- *  +---+-+---+
- *
- * ND Header:
- * This is the generic libnvdimm header described as 'struct nd_cmd_pkg'
- * which is interpreted by libnvdimm before passed on to papr_scm. Important
- * member fields used are:
- * 'nd_family' : (In) NVDIMM_FAMILY_PAPR_SCM
- * 'nd_size_in': (In) PDSM-HEADER + PDSM-IN-PAYLOAD (usually 0)
- * 'nd_size_out': (In) PDSM-HEADER + PDSM-RETURN-PAYLOAD
- * 'nd_command' : (In) One of PAPR_PDSM_XXX
- * 'nd_fw_size' : (Out) PDSM-HEADER + size of actual payload returned
- *
- * PDSM Header:
- * This is papr-scm specific header that precedes the payload. This is defined
- * as nd_cmd_pdsm_pkg.  Following fields aare available in this header:
- *
- * 'cmd_status': (Out) Errors if any encountered while

Re: [PATCH 2/2] iommu: Fix the domain type checks when default_domain is set

2024-01-26 Thread Shivaprasad G Bhat

On 1/25/24 21:22, Jason Gunthorpe wrote:

On Thu, Jan 25, 2024 at 06:08:52AM -0600, Shivaprasad G Bhat wrote:

On PPC64, the iommu_ops.def_domain_type() is not defined and
CONFIG_IOMMU_DMA not enabled. With commit 0f6a90436a57 ("iommu: Do not
use IOMMU_DOMAIN_DMA if CONFIG_IOMMU_DMA is not enabled"), the
iommu_get_default_domain_type() return IOMMU_DOMAIN_IDENTITY. With
commit 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove
set_platform_dma_ops"), the defaule_domain is set wih the type being
PLATFORM. With these two changes, iommu_group_alloc_default_domain()
ends up returning the NULL(with recent changes, ERR_PTR(-EINVAL))
leading to iommu_probe_device() failure and the device has no
iommu_group set in effect. Subsequently, the bind to vfio(VFIO_IOMMU)
fail as the iommu_group is not set for the device.

Make the iommu_get_default_domain_type() to take default_domain->type
into consideration along with default_domain_type() and fix
iommu_group_alloc_default_domain() to not error out if the requested
type is same as default domain type.

Fixes: 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove 
set_platform_dma_ops")
Fixes: 0f6a90436a57 ("iommu: Do not use IOMMU_DOMAIN_DMA if CONFIG_IOMMU_DMA is not 
enabled")
Signed-off-by: Shivaprasad G Bhat 
---
  drivers/iommu/iommu.c |   14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)

Are you OK with this version?

https://lore.kernel.org/linux-iommu/20240123174905.gs50...@ziepe.ca/

?

I think it does the same thing


Yes, This works. I see a very minor difference of default_domain->type 
is given


precedence over def_domain_type(). Please keep me CC when you post this 
fix, I'll


test it with any(?) other changes if coming along with it.


Thanks,

Shivaprasad


Jason


Re: [PATCH 1/2] powerpc: iommu: Bring back table group release_ownership() call

2024-01-26 Thread Shivaprasad G Bhat



On 1/25/24 21:20, Jason Gunthorpe wrote:

On Thu, Jan 25, 2024 at 06:08:39AM -0600, Shivaprasad G Bhat wrote:

The commit 2ad56efa80db ("powerpc/iommu: Setup a default domain and

[snip]

+   /*
+* The domain being set to PLATFORM from earlier
+* BLOCKED. The table_group ownership has to be released.
+*/
+   table_group->ops->release_ownership(table_group);
+   } else if (platform_domain->type == IOMMU_DOMAIN_BLOCKED) {
+   table_group = iommu_group_get_iommudata(grp);
+   ret = table_group->ops->take_ownership(table_group);
+   iommu_group_put(grp);
+   }

Sure, but please split the function, don't test on the
platform->domain_type.

Sure.

Also, is there any chance someone can work on actually fixing this to
be a proper iommu driver? I think that will become important for power
to use the common dma_iommu code in the next year...

We are looking into it.

Sort of like this:

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ebe259bdd46298..0d6a7fea2bd9a5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1287,20 +1287,20 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain 
*platform_domain,
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
struct iommu_group *grp = iommu_group_get(dev);

[snip]

+static const struct iommu_domain_ops spapr_tce_blocked_domain_ops = {
+   .attach_dev = spapr_tce_blocked_iommu_attach_dev,
+};
+
+static struct iommu_domain spapr_tce_blocked_domain = {
+   .type = IOMMU_DOMAIN_BLOCKED,
+   .ops = _tce_blocked_domain_ops,
  };
  
  static bool spapr_tce_iommu_capable(struct device *dev, enum iommu_cap cap)


I have posted the next version as suggested.

Thanks for the quick review!

Regards,

Shivaprasad




[PATCH v2] powerpc: iommu: Bring back table group release_ownership() call

2024-01-26 Thread Shivaprasad G Bhat
The commit 2ad56efa80db ("powerpc/iommu: Setup a default domain and
remove set_platform_dma_ops") refactored the code removing the
set_platform_dma_ops(). It missed out the table group
release_ownership() call which would have got called otherwise
during the guest shutdown via vfio_group_detach_container(). On
PPC64, this particular call actually sets up the 32-bit TCE table,
and enables the 64-bit DMA bypass etc. Now after guest shutdown,
the subsequent host driver (e.g megaraid-sas) probe post unbind
from vfio-pci fails like,

megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 
0x7fff, table unavailable
megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 0x, 
table unavailable
megaraid_sas 0031:01:00.0: Failed to set DMA mask
megaraid_sas 0031:01:00.0: Failed from megasas_init_fw 6539

The patch brings back the call to table_group release_ownership()
call when switching back to PLATFORM domain from BLOCKED, while
also separates the domain_ops for both.

Fixes: 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove 
set_platform_dma_ops")
Signed-off-by: Shivaprasad G Bhat 
---
Changelog:
v1: 
https://lore.kernel.org/linux-iommu/170618451433.3805.9015493852395837391.st...@ltcd48-lp2.aus.stglab.ibm.com/
 - Split the common attach_dev call to platform and blocked attach_dev
   calls as suggested.

 arch/powerpc/kernel/iommu.c |   37 -
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ebe259bdd462..d71eac3b2887 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1287,20 +1287,20 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain 
*platform_domain,
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
struct iommu_group *grp = iommu_group_get(dev);
struct iommu_table_group *table_group;
-   int ret = -EINVAL;

/* At first attach the ownership is already set */
if (!domain)
return 0;

-   if (!grp)
-   return -ENODEV;
-
table_group = iommu_group_get_iommudata(grp);
-   ret = table_group->ops->take_ownership(table_group);
+   /*
+* The domain being set to PLATFORM from earlier
+* BLOCKED. The table_group ownership has to be released.
+*/
+   table_group->ops->release_ownership(table_group);
iommu_group_put(grp);

-   return ret;
+   return 0;
 }

 static const struct iommu_domain_ops spapr_tce_platform_domain_ops = {
@@ -1312,13 +1312,32 @@ static struct iommu_domain spapr_tce_platform_domain = {
.ops = _tce_platform_domain_ops,
 };

-static struct iommu_domain spapr_tce_blocked_domain = {
-   .type = IOMMU_DOMAIN_BLOCKED,
+static int
+spapr_tce_blocked_iommu_attach_dev(struct iommu_domain *platform_domain,
+struct device *dev)
+{
+   struct iommu_group *grp = iommu_group_get(dev);
+   struct iommu_table_group *table_group;
+   int ret = -EINVAL;
+
/*
 * FIXME: SPAPR mixes blocked and platform behaviors, the blocked domain
 * also sets the dma_api ops
 */
-   .ops = _tce_platform_domain_ops,
+   table_group = iommu_group_get_iommudata(grp);
+   ret = table_group->ops->take_ownership(table_group);
+   iommu_group_put(grp);
+
+   return ret;
+}
+
+static const struct iommu_domain_ops spapr_tce_blocked_domain_ops = {
+   .attach_dev = spapr_tce_blocked_iommu_attach_dev,
+};
+
+static struct iommu_domain spapr_tce_blocked_domain = {
+   .type = IOMMU_DOMAIN_BLOCKED,
+   .ops = _tce_blocked_domain_ops,
 };

 static bool spapr_tce_iommu_capable(struct device *dev, enum iommu_cap cap)




[PATCH 2/2] iommu: Fix the domain type checks when default_domain is set

2024-01-25 Thread Shivaprasad G Bhat
On PPC64, the iommu_ops.def_domain_type() is not defined and
CONFIG_IOMMU_DMA not enabled. With commit 0f6a90436a57 ("iommu: Do not
use IOMMU_DOMAIN_DMA if CONFIG_IOMMU_DMA is not enabled"), the
iommu_get_default_domain_type() return IOMMU_DOMAIN_IDENTITY. With
commit 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove
set_platform_dma_ops"), the defaule_domain is set wih the type being
PLATFORM. With these two changes, iommu_group_alloc_default_domain()
ends up returning the NULL(with recent changes, ERR_PTR(-EINVAL))
leading to iommu_probe_device() failure and the device has no
iommu_group set in effect. Subsequently, the bind to vfio(VFIO_IOMMU)
fail as the iommu_group is not set for the device.

Make the iommu_get_default_domain_type() to take default_domain->type
into consideration along with default_domain_type() and fix
iommu_group_alloc_default_domain() to not error out if the requested
type is same as default domain type.

Fixes: 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove 
set_platform_dma_ops")
Fixes: 0f6a90436a57 ("iommu: Do not use IOMMU_DOMAIN_DMA if CONFIG_IOMMU_DMA is 
not enabled")
Signed-off-by: Shivaprasad G Bhat 
---
 drivers/iommu/iommu.c |   14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 68e648b55767..04083a1d9f7e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -135,6 +135,8 @@ static struct group_device *iommu_group_alloc_device(struct 
iommu_group *group,
 struct device *dev);
 static void __iommu_group_free_device(struct iommu_group *group,
  struct group_device *grp_dev);
+static int iommu_get_def_domain_type(struct iommu_group *group,
+struct device *dev, int cur_type);
 
 #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)  \
 struct iommu_group_attribute iommu_group_attr_##_name =\
@@ -1788,7 +1790,8 @@ __iommu_group_alloc_default_domain(struct iommu_group 
*group, int req_type)
 static struct iommu_domain *
 iommu_group_alloc_default_domain(struct iommu_group *group, int req_type)
 {
-   const struct iommu_ops *ops = 
dev_iommu_ops(iommu_group_first_dev(group));
+   struct device *dev = iommu_group_first_dev(group);
+   const struct iommu_ops *ops = dev_iommu_ops(dev);
struct iommu_domain *dom;
 
lockdep_assert_held(>mutex);
@@ -1799,7 +1802,7 @@ iommu_group_alloc_default_domain(struct iommu_group 
*group, int req_type)
 * domain. Do not use in new drivers.
 */
if (ops->default_domain) {
-   if (req_type)
+   if (iommu_get_def_domain_type(group, dev, req_type) != req_type)
return ERR_PTR(-EINVAL);
return ops->default_domain;
}
@@ -1871,10 +1874,13 @@ static int iommu_get_def_domain_type(struct iommu_group 
*group,
const struct iommu_ops *ops = dev_iommu_ops(dev);
int type;
 
-   if (!ops->def_domain_type)
+   if (ops->def_domain_type)
+   type = ops->def_domain_type(dev);
+   else if (group->default_domain)
+   type = group->default_domain->type;
+   else
return cur_type;
 
-   type = ops->def_domain_type(dev);
if (!type || cur_type == type)
return cur_type;
if (!cur_type)




[PATCH 1/2] powerpc: iommu: Bring back table group release_ownership() call

2024-01-25 Thread Shivaprasad G Bhat
The commit 2ad56efa80db ("powerpc/iommu: Setup a default domain and
remove set_platform_dma_ops") refactored the code removing the
set_platform_dma_ops(). It missed out the table group
release_ownership() call which would have got called otherwise
during the guest shutdown via vfio_group_detach_container(). On
PPC64, this particular call actually sets up the 32-bit TCE table,
and enables the 64-bit DMA bypass etc. Now after guest shutdown,
the subsequent host driver (e.g megaraid-sas) probe post unbind
from vfio-pci fails like,

megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 
0x7fff, table unavailable
megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 0x, 
table unavailable
megaraid_sas 0031:01:00.0: Failed to set DMA mask
megaraid_sas 0031:01:00.0: Failed from megasas_init_fw 6539

The patch brings back the call to table_group release_ownership()
call when switching back to PLATFORM domain.

Fixes: 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove 
set_platform_dma_ops")
Signed-off-by: Shivaprasad G Bhat 
---
 arch/powerpc/kernel/iommu.c |   16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ebe259bdd462..ac7df43fa7ef 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1296,9 +1296,19 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain 
*platform_domain,
if (!grp)
return -ENODEV;
 
-   table_group = iommu_group_get_iommudata(grp);
-   ret = table_group->ops->take_ownership(table_group);
-   iommu_group_put(grp);
+   if (platform_domain->type == IOMMU_DOMAIN_PLATFORM) {
+   ret = 0;
+   table_group = iommu_group_get_iommudata(grp);
+   /*
+* The domain being set to PLATFORM from earlier
+* BLOCKED. The table_group ownership has to be released.
+*/
+   table_group->ops->release_ownership(table_group);
+   } else if (platform_domain->type == IOMMU_DOMAIN_BLOCKED) {
+   table_group = iommu_group_get_iommudata(grp);
+   ret = table_group->ops->take_ownership(table_group);
+   iommu_group_put(grp);
+   }
 
return ret;
 }




[PATCH 0/2] powerpc: iommu: Fix the vfio-pci bind and unbind issues

2024-01-25 Thread Shivaprasad G Bhat
On PowerNV and pSeries machines, the bind and unbind of vfio-pci is
broken with the below commits.

2ad56efa80db ("powerpc/iommu: Setup a default domain and remove 
set_platform_dma_ops")
0f6a90436a57 ("iommu: Do not use IOMMU_DOMAIN_DMA if CONFIG_IOMMU_DMA is not 
enabled")

Details of the same are in the following patch descriptions.

---

Shivaprasad G Bhat (2):
  powerpc: iommu: Bring back table group release_ownership() call
  iommu: Fix the domain type checks when default_domain is set


 arch/powerpc/kernel/iommu.c | 16 +---
 drivers/iommu/iommu.c   | 14 ++
 2 files changed, 23 insertions(+), 7 deletions(-)

--
Signature



[REPOST PATCH] ndtest: Cleanup all of blk namespace specific code

2022-07-12 Thread Shivaprasad G Bhat
With the nd_namespace_blk and nd_blk_region infrastructures being removed,
the ndtest still has some references to the old code. So the
compilation fails as below,

../tools/testing/nvdimm/test/ndtest.c:204:25: error: ‘ND_DEVICE_NAMESPACE_BLK’ 
undeclared here (not in a function); did you mean ‘ND_DEVICE_NAMESPACE_IO’?
  204 | .type = ND_DEVICE_NAMESPACE_BLK,
  | ^~~
  | ND_DEVICE_NAMESPACE_IO
../tools/testing/nvdimm/test/ndtest.c: In function ‘ndtest_create_region’:
../tools/testing/nvdimm/test/ndtest.c:630:17: error: ‘ndbr_desc’ undeclared 
(first use in this function); did you mean ‘ndr_desc’?
  630 | ndbr_desc.enable = ndtest_blk_region_enable;
  | ^
  | ndr_desc
../tools/testing/nvdimm/test/ndtest.c:630:17: note: each undeclared identifier 
is reported only once for each function it appears in
../tools/testing/nvdimm/test/ndtest.c:630:36: error: ‘ndtest_blk_region_enable’ 
undeclared (first use in this function)
  630 | ndbr_desc.enable = ndtest_blk_region_enable;
  |^~~~
../tools/testing/nvdimm/test/ndtest.c:631:35: error: ‘ndtest_blk_do_io’ 
undeclared (first use in this function); did you mean ‘ndtest_blk_mmio’?
  631 | ndbr_desc.do_io = ndtest_blk_do_io;
  |   ^~~~
  |   ndtest_blk_mmio

The current patch removes the specific code to cleanup all obsolete
references.

Signed-off-by: Shivaprasad G Bhat 
---
Changelog:
Repost of v1:
Link - 
https://patchwork.kernel.org/project/linux-nvdimm/patch/165025395730.2821159.14794984437851867426.st...@lep8c.aus.stglabs.ibm.com/
No changes.

 tools/testing/nvdimm/test/ndtest.c |   77 
 1 file changed, 77 deletions(-)

diff --git a/tools/testing/nvdimm/test/ndtest.c 
b/tools/testing/nvdimm/test/ndtest.c
index 4d1a947367f9..01ceb98c15a0 100644
--- a/tools/testing/nvdimm/test/ndtest.c
+++ b/tools/testing/nvdimm/test/ndtest.c
@@ -134,39 +134,6 @@ static struct ndtest_mapping region1_mapping[] = {
},
 };

-static struct ndtest_mapping region2_mapping[] = {
-   {
-   .dimm = 0,
-   .position = 0,
-   .start = 0,
-   .size = DIMM_SIZE,
-   },
-};
-
-static struct ndtest_mapping region3_mapping[] = {
-   {
-   .dimm = 1,
-   .start = 0,
-   .size = DIMM_SIZE,
-   }
-};
-
-static struct ndtest_mapping region4_mapping[] = {
-   {
-   .dimm = 2,
-   .start = 0,
-   .size = DIMM_SIZE,
-   }
-};
-
-static struct ndtest_mapping region5_mapping[] = {
-   {
-   .dimm = 3,
-   .start = 0,
-   .size = DIMM_SIZE,
-   }
-};
-
 static struct ndtest_region bus0_regions[] = {
{
.type = ND_DEVICE_NAMESPACE_PMEM,
@@ -182,34 +149,6 @@ static struct ndtest_region bus0_regions[] = {
.size = DIMM_SIZE * 2,
.range_index = 2,
},
-   {
-   .type = ND_DEVICE_NAMESPACE_BLK,
-   .num_mappings = ARRAY_SIZE(region2_mapping),
-   .mapping = region2_mapping,
-   .size = DIMM_SIZE,
-   .range_index = 3,
-   },
-   {
-   .type = ND_DEVICE_NAMESPACE_BLK,
-   .num_mappings = ARRAY_SIZE(region3_mapping),
-   .mapping = region3_mapping,
-   .size = DIMM_SIZE,
-   .range_index = 4,
-   },
-   {
-   .type = ND_DEVICE_NAMESPACE_BLK,
-   .num_mappings = ARRAY_SIZE(region4_mapping),
-   .mapping = region4_mapping,
-   .size = DIMM_SIZE,
-   .range_index = 5,
-   },
-   {
-   .type = ND_DEVICE_NAMESPACE_BLK,
-   .num_mappings = ARRAY_SIZE(region5_mapping),
-   .mapping = region5_mapping,
-   .size = DIMM_SIZE,
-   .range_index = 6,
-   },
 };

 static struct ndtest_mapping region6_mapping[] = {
@@ -501,21 +440,6 @@ static int ndtest_create_region(struct ndtest_priv *p,
nd_set->altcookie = nd_set->cookie1;
ndr_desc->nd_set = nd_set;

-   if (region->type == ND_DEVICE_NAMESPACE_BLK) {
-   mappings[0].start = 0;
-   mappings[0].size = DIMM_SIZE;
-   mappings[0].nvdimm = p->config->dimms[ndimm].nvdimm;
-
-   ndr_desc->mapping = [0];
-   ndr_desc->num_mappings = 1;
-   ndr_desc->num_lanes = 1;
-   ndbr_desc.enable = ndtest_blk_region_enable;
-   ndbr_desc.do_io = ndtest_blk_do_io;
-   region->region = nvdimm_blk_region_create(p->bus, ndr_desc);
-
-   goto done;

[PATCH v3] tests/nvdimm/ndtest: Simulate nvdimm health, DSC and smart-inject

2022-04-18 Thread Shivaprasad G Bhat
The 'papr_scm' module and 'papr' implementation in libndctl supports
PDSMs for reporting PAPR NVDIMM health, dirty-shutdown-count and
injecting smart-errors. This patch adds support for those PDSMs in
ndtest module so that PDSM specific paths in libndctl can be exercised.

Signed-off-by: Shivaprasad G Bhat 
Signed-off-by: Vaibhav Jain 
---
Changelog:
Since v2:
Link: 
https://patchwork.kernel.org/project/linux-nvdimm/patch/163454437514.431245.15482985237822269917.st...@lep8c.aus.stglabs.ibm.com/
* Made it like v1 which had the patches based on the moved header files.
  So, this patch depends on the patch moving the header files posted at -
  
https://patchwork.kernel.org/project/linux-nvdimm/patch/165025666388.2927278.9540058958498766114.st...@lep8c.aus.stglabs.ibm.com/

Since v1:
Link: https://patchwork.kernel.org/project/linux-nvdimm/list/?series=521767
* Removed the dependency on a header movement patch
 tools/testing/nvdimm/test/ndtest.c |  146 
 tools/testing/nvdimm/test/ndtest.h |7 ++
 2 files changed, 153 insertions(+)

diff --git a/tools/testing/nvdimm/test/ndtest.c 
b/tools/testing/nvdimm/test/ndtest.c
index 5eb946a02c95..b07cbdd1952d 100644
--- a/tools/testing/nvdimm/test/ndtest.c
+++ b/tools/testing/nvdimm/test/ndtest.c
@@ -50,6 +50,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "1e5c75d2-b618-11ea-9aa3-507b9ddc0f72",
.physical_id = 0,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -57,6 +61,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "1c4d43ac-b618-11ea-be80-507b9ddc0f72",
.physical_id = 1,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -64,6 +72,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "a9f17ffc-b618-11ea-b36d-507b9ddc0f72",
.physical_id = 2,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -71,6 +83,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "b6b83b22-b618-11ea-8aae-507b9ddc0f72",
.physical_id = 3,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -237,6 +253,103 @@ static int ndtest_get_config_size(struct ndtest_dimm 
*dimm, unsigned int buf_len
return 0;
 }
 
+static int ndtest_pdsm_health(struct ndtest_dimm *dimm,
+   union nd_pdsm_payload *payload,
+   unsigned int buf_len)
+{
+   struct nd_papr_pdsm_health *health = >health;
+
+   if (buf_len < sizeof(health))
+   return -EINVAL;
+
+   health->extension_flags = 0;
+   health->dimm_unarmed = !!(dimm->flags & PAPR_PMEM_UNARMED_MASK);
+   health->dimm_bad_shutdown = !!(dimm->flags & 
PAPR_PMEM_BAD_SHUTDOWN_MASK);
+   health->dimm_bad_restore = !!(dimm->flags & PAPR_PMEM_BAD_RESTORE_MASK);
+   health->dimm_health = PAPR_PDSM_DIMM_HEALTHY;
+
+   if (dimm->flags & PAPR_PMEM_HEALTH_FATAL)
+   health->dimm_health = PAPR_PDSM_DIMM_FATAL;
+   else if (dimm->flags & PAPR_PMEM_HEALTH_CRITICAL)
+   health->dimm_health = PAPR_PDSM_DIMM_CRITICAL;
+   else if (dimm->flags & PAPR_PMEM_HEALTH_UNHEALTHY ||
+dimm->flags & PAPR_PMEM_HEALTH_NON_CRITICAL)
+   health->dimm_health = PAPR_PDSM_DIMM_UNHEALTHY;
+
+   health->extension_flags = 0;
+   if (dimm->extension_flags & PDSM_DIMM_HEALTH_RUN_GAUGE_VALID) {
+   health->dimm_fuel_gauge = dimm->dimm_fuel_gauge;
+   health->extension_flags |= PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
+   }
+   if (dimm->extension_flags & PDSM_DIMM_DSC_VALID) {
+   health->dimm_dsc = dimm->dimm_dsc;
+   health->extension_flags |= PDSM_DIMM_DSC_VALID;
+   }
+
+   return 0;
+}
+
+static 

[RFC PATCH] ndtest: Make ndtest a module on its own

2022-04-17 Thread Shivaprasad G Bhat
Today ndtest module is compiled as nfit_test.ko depending on
if the CONFIG_ACPI_NFIT defined or not.

It is more advantageous to make ndtest a module on its own
so that the unit tests can be run serially on the same host
without a need for recompilation of sources.

The patch modifies the Kbuild file to take care of that.

Signed-off-by: Shivaprasad G Bhat 
Signed-off-by: Vaibhav Jain 
---
 tools/testing/nvdimm/test/Kbuild |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/testing/nvdimm/test/Kbuild b/tools/testing/nvdimm/test/Kbuild
index 197bcb2b7f35..4d4835f73b46 100644
--- a/tools/testing/nvdimm/test/Kbuild
+++ b/tools/testing/nvdimm/test/Kbuild
@@ -2,12 +2,12 @@
 ccflags-y := -I$(srctree)/drivers/nvdimm/
 ccflags-y += -I$(srctree)/drivers/acpi/nfit/
 
-obj-m += nfit_test.o
+obj-m += ndtest.o
 obj-m += nfit_test_iomap.o
 
 ifeq  ($(CONFIG_ACPI_NFIT),m)
+   obj-m += nfit_test.o
nfit_test-y := nfit.o
-else
-   nfit_test-y := ndtest.o
 endif
+
 nfit_test_iomap-y := iomap.o




[PATCH] powerpc/papr_scm: Move duplicate definitions to common header files

2022-04-17 Thread Shivaprasad G Bhat
papr_scm and ndtest share common PDSM payload structs like
nd_papr_pdsm_health. Presently these structs are duplicated across
papr_pdsm.h and ndtest.h header files. Since 'ndtest' is essentially
arch independent and can run on platforms other than PPC64, a way
needs to be deviced to avoid redundancy and duplication of PDSM
structs in future.

So the patch proposes moving the PDSM header from arch/powerpc/include-
-/uapi/ to the generic include/uapi/linux directory. Also, there are
some #defines common between papr_scm and ndtest which are not exported
to the user space. So, move them to a header file which can be shared
across ndtest and papr_scm via newly introduced include/linux/papr_scm.h.

Signed-off-by: Shivaprasad G Bhat 
Signed-off-by: Vaibhav Jain 
Suggested-by: "Aneesh Kumar K.V" 
---
Changelog:
Since v2:
Link: 
https://patchwork.kernel.org/project/linux-nvdimm/patch/163454440296.431294.2368481747380790011.st...@lep8c.aus.stglabs.ibm.com/
* Made it like v1, and rebased.
* Fixed repeating words in comments of the header file papr_scm.h

Since v1:
Link: 
https://patchwork.kernel.org/project/linux-nvdimm/patch/162505488483.72147.12741153746322191381.stgit@56e104a48989/
* Removed dependency on this patch for the other patches

 MAINTAINERS   |2 
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  165 -
 arch/powerpc/platforms/pseries/papr_scm.c |   43 
 include/linux/papr_scm.h  |   49 +
 include/uapi/linux/papr_pdsm.h|  165 +
 tools/testing/nvdimm/test/ndtest.c|2 
 tools/testing/nvdimm/test/ndtest.h|   31 -
 7 files changed, 220 insertions(+), 237 deletions(-)
 delete mode 100644 arch/powerpc/include/uapi/asm/papr_pdsm.h
 create mode 100644 include/linux/papr_scm.h
 create mode 100644 include/uapi/linux/papr_pdsm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 1699bb7cc867..03685b074dda 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11254,6 +11254,8 @@ F:  drivers/rtc/rtc-opal.c
 F: drivers/scsi/ibmvscsi/
 F: drivers/tty/hvc/hvc_opal.c
 F: drivers/watchdog/wdrtas.c
+F: include/linux/papr_scm.h
+F: include/uapi/linux/papr_pdsm.h
 F: tools/testing/selftests/powerpc
 N: /pmac
 N: powermac
diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
deleted file mode 100644
index 17439925045c..
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ /dev/null
@@ -1,165 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-/*
- * PAPR nvDimm Specific Methods (PDSM) and structs for libndctl
- *
- * (C) Copyright IBM 2020
- *
- * Author: Vaibhav Jain 
- */
-
-#ifndef _UAPI_ASM_POWERPC_PAPR_PDSM_H_
-#define _UAPI_ASM_POWERPC_PAPR_PDSM_H_
-
-#include 
-#include 
-
-/*
- * PDSM Envelope:
- *
- * The ioctl ND_CMD_CALL exchange data between user-space and kernel via
- * envelope which consists of 2 headers sections and payload sections as
- * illustrated below:
- *  +-+---+---+
- *  |   64-Bytes  |   8-Bytes |   Max 184-Bytes   |
- *  +-+---+---+
- *  | ND-HEADER   |  PDSM-HEADER  |  PDSM-PAYLOAD |
- *  +-+---+---+
- *  | nd_family   |   |   |
- *  | nd_size_out | cmd_status|   |
- *  | nd_size_in  | reserved  | nd_pdsm_payload   |
- *  | nd_command  | payload   --> |   |
- *  | nd_fw_size  |   |   |
- *  | nd_payload ---> |   |   |
- *  +---+-+---+
- *
- * ND Header:
- * This is the generic libnvdimm header described as 'struct nd_cmd_pkg'
- * which is interpreted by libnvdimm before passed on to papr_scm. Important
- * member fields used are:
- * 'nd_family' : (In) NVDIMM_FAMILY_PAPR_SCM
- * 'nd_size_in': (In) PDSM-HEADER + PDSM-IN-PAYLOAD (usually 0)
- * 'nd_size_out': (In) PDSM-HEADER + PDSM-RETURN-PAYLOAD
- * 'nd_command' : (In) One of PAPR_PDSM_XXX
- * 'nd_fw_size' : (Out) PDSM-HEADER + size of actual payload returned
- *
- * PDSM Header:
- * This is papr-scm specific header that precedes the payload. This is defined
- * as nd_cmd_pdsm_pkg.  Following fields aare available in this header:
- *
- * 'cmd_status': (Out) Errors if any encountered while 
servicing PDSM.
- * 'reserved'  : Not used, reserved for future and should be set to 0.
- * 'payload': A union of all the possible payload structs
- *
- * PDSM Payload:
- *
- * The layout of the PDSM Payload is defined by various structs shared between
- * papr_scm and libndctl so that con

[PATCH] ndtest: Cleanup all of blk namespace specific code

2022-04-17 Thread Shivaprasad G Bhat
With the nd_namespace_blk and nd_blk_region infrastructures being removed,
the ndtest still has some references to the old code. So the
compilation fails as below,

../tools/testing/nvdimm/test/ndtest.c:204:25: error: ‘ND_DEVICE_NAMESPACE_BLK’ 
undeclared here (not in a function); did you mean ‘ND_DEVICE_NAMESPACE_IO’?
  204 | .type = ND_DEVICE_NAMESPACE_BLK,
  | ^~~
  | ND_DEVICE_NAMESPACE_IO
../tools/testing/nvdimm/test/ndtest.c: In function ‘ndtest_create_region’:
../tools/testing/nvdimm/test/ndtest.c:630:17: error: ‘ndbr_desc’ undeclared 
(first use in this function); did you mean ‘ndr_desc’?
  630 | ndbr_desc.enable = ndtest_blk_region_enable;
  | ^
  | ndr_desc
../tools/testing/nvdimm/test/ndtest.c:630:17: note: each undeclared identifier 
is reported only once for each function it appears in
../tools/testing/nvdimm/test/ndtest.c:630:36: error: ‘ndtest_blk_region_enable’ 
undeclared (first use in this function)
  630 | ndbr_desc.enable = ndtest_blk_region_enable;
  |^~~~
../tools/testing/nvdimm/test/ndtest.c:631:35: error: ‘ndtest_blk_do_io’ 
undeclared (first use in this function); did you mean ‘ndtest_blk_mmio’?
  631 | ndbr_desc.do_io = ndtest_blk_do_io;
  |   ^~~~
  |   ndtest_blk_mmio

The current patch removes the specific code to cleanup all obsolete
references.

Signed-off-by: Shivaprasad G Bhat 
---
 tools/testing/nvdimm/test/ndtest.c |   77 
 1 file changed, 77 deletions(-)

diff --git a/tools/testing/nvdimm/test/ndtest.c 
b/tools/testing/nvdimm/test/ndtest.c
index 4d1a947367f9..01ceb98c15a0 100644
--- a/tools/testing/nvdimm/test/ndtest.c
+++ b/tools/testing/nvdimm/test/ndtest.c
@@ -134,39 +134,6 @@ static struct ndtest_mapping region1_mapping[] = {
},
 };
 
-static struct ndtest_mapping region2_mapping[] = {
-   {
-   .dimm = 0,
-   .position = 0,
-   .start = 0,
-   .size = DIMM_SIZE,
-   },
-};
-
-static struct ndtest_mapping region3_mapping[] = {
-   {
-   .dimm = 1,
-   .start = 0,
-   .size = DIMM_SIZE,
-   }
-};
-
-static struct ndtest_mapping region4_mapping[] = {
-   {
-   .dimm = 2,
-   .start = 0,
-   .size = DIMM_SIZE,
-   }
-};
-
-static struct ndtest_mapping region5_mapping[] = {
-   {
-   .dimm = 3,
-   .start = 0,
-   .size = DIMM_SIZE,
-   }
-};
-
 static struct ndtest_region bus0_regions[] = {
{
.type = ND_DEVICE_NAMESPACE_PMEM,
@@ -182,34 +149,6 @@ static struct ndtest_region bus0_regions[] = {
.size = DIMM_SIZE * 2,
.range_index = 2,
},
-   {
-   .type = ND_DEVICE_NAMESPACE_BLK,
-   .num_mappings = ARRAY_SIZE(region2_mapping),
-   .mapping = region2_mapping,
-   .size = DIMM_SIZE,
-   .range_index = 3,
-   },
-   {
-   .type = ND_DEVICE_NAMESPACE_BLK,
-   .num_mappings = ARRAY_SIZE(region3_mapping),
-   .mapping = region3_mapping,
-   .size = DIMM_SIZE,
-   .range_index = 4,
-   },
-   {
-   .type = ND_DEVICE_NAMESPACE_BLK,
-   .num_mappings = ARRAY_SIZE(region4_mapping),
-   .mapping = region4_mapping,
-   .size = DIMM_SIZE,
-   .range_index = 5,
-   },
-   {
-   .type = ND_DEVICE_NAMESPACE_BLK,
-   .num_mappings = ARRAY_SIZE(region5_mapping),
-   .mapping = region5_mapping,
-   .size = DIMM_SIZE,
-   .range_index = 6,
-   },
 };
 
 static struct ndtest_mapping region6_mapping[] = {
@@ -501,21 +440,6 @@ static int ndtest_create_region(struct ndtest_priv *p,
nd_set->altcookie = nd_set->cookie1;
ndr_desc->nd_set = nd_set;
 
-   if (region->type == ND_DEVICE_NAMESPACE_BLK) {
-   mappings[0].start = 0;
-   mappings[0].size = DIMM_SIZE;
-   mappings[0].nvdimm = p->config->dimms[ndimm].nvdimm;
-
-   ndr_desc->mapping = [0];
-   ndr_desc->num_mappings = 1;
-   ndr_desc->num_lanes = 1;
-   ndbr_desc.enable = ndtest_blk_region_enable;
-   ndbr_desc.do_io = ndtest_blk_do_io;
-   region->region = nvdimm_blk_region_create(p->bus, ndr_desc);
-
-   goto done;
-   }
-
for (i = 0; i < region->num_mappings; i++) {
ndimm = region->mapping[i].dimm;
mappings[i].start = region->map

[REPOST RFC PATCH v2] powerpc/papr_scm: Move duplicate definitions to common header files

2021-10-18 Thread Shivaprasad G Bhat
papr_scm and ndtest share common PDSM payload structs like
nd_papr_pdsm_health. Presently these structs are duplicated across
papr_pdsm.h and ndtest.h header files. Since 'ndtest' is essentially
arch independent and can run on platforms other than PPC64, a way
needs to be deviced to avoid redundancy and duplication of PDSM
structs in future.

So the patch proposes moving the PDSM header from arch/powerpc/include-
-/uapi/ to the generic include/uapi/linux directory. Also, there are
some #defines common between papr_scm and ndtest which are not exported
to the user space. So, move them to a header file which can be shared
across ndtest and papr_scm via newly introduced include/linux/papr_scm.h.

Signed-off-by: Shivaprasad G Bhat 
Signed-off-by: Vaibhav Jain 
Suggested-by: "Aneesh Kumar K.V" 
---
Changelog:

Since v1:
Link: 
https://patchwork.kernel.org/project/linux-nvdimm/patch/162505488483.72147.12741153746322191381.stgit@56e104a48989/
* Removed dependency on this patch for the other patches

 MAINTAINERS   |2 
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  165 -
 arch/powerpc/platforms/pseries/papr_scm.c |   43 
 include/linux/papr_scm.h  |   48 
 include/uapi/linux/papr_pdsm.h|  165 +
 tools/testing/nvdimm/test/ndtest.c|2 
 tools/testing/nvdimm/test/ndtest.h|  120 -
 7 files changed, 219 insertions(+), 326 deletions(-)
 delete mode 100644 arch/powerpc/include/uapi/asm/papr_pdsm.h
 create mode 100644 include/linux/papr_scm.h
 create mode 100644 include/uapi/linux/papr_pdsm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 5b33791bb8e97..3e78f595f3d91 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10746,6 +10746,8 @@ F:  drivers/rtc/rtc-opal.c
 F: drivers/scsi/ibmvscsi/
 F: drivers/tty/hvc/hvc_opal.c
 F: drivers/watchdog/wdrtas.c
+F: include/linux/papr_scm.h
+F: include/uapi/linux/papr_pdsm.h
 F: tools/testing/selftests/powerpc
 N: /pmac
 N: powermac
diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
deleted file mode 100644
index 17439925045cf..0
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ /dev/null
@@ -1,165 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-/*
- * PAPR nvDimm Specific Methods (PDSM) and structs for libndctl
- *
- * (C) Copyright IBM 2020
- *
- * Author: Vaibhav Jain 
- */
-
-#ifndef _UAPI_ASM_POWERPC_PAPR_PDSM_H_
-#define _UAPI_ASM_POWERPC_PAPR_PDSM_H_
-
-#include 
-#include 
-
-/*
- * PDSM Envelope:
- *
- * The ioctl ND_CMD_CALL exchange data between user-space and kernel via
- * envelope which consists of 2 headers sections and payload sections as
- * illustrated below:
- *  +-+---+---+
- *  |   64-Bytes  |   8-Bytes |   Max 184-Bytes   |
- *  +-+---+---+
- *  | ND-HEADER   |  PDSM-HEADER  |  PDSM-PAYLOAD |
- *  +-+---+---+
- *  | nd_family   |   |   |
- *  | nd_size_out | cmd_status|   |
- *  | nd_size_in  | reserved  | nd_pdsm_payload   |
- *  | nd_command  | payload   --> |   |
- *  | nd_fw_size  |   |   |
- *  | nd_payload ---> |   |   |
- *  +---+-+---+
- *
- * ND Header:
- * This is the generic libnvdimm header described as 'struct nd_cmd_pkg'
- * which is interpreted by libnvdimm before passed on to papr_scm. Important
- * member fields used are:
- * 'nd_family' : (In) NVDIMM_FAMILY_PAPR_SCM
- * 'nd_size_in': (In) PDSM-HEADER + PDSM-IN-PAYLOAD (usually 0)
- * 'nd_size_out': (In) PDSM-HEADER + PDSM-RETURN-PAYLOAD
- * 'nd_command' : (In) One of PAPR_PDSM_XXX
- * 'nd_fw_size' : (Out) PDSM-HEADER + size of actual payload returned
- *
- * PDSM Header:
- * This is papr-scm specific header that precedes the payload. This is defined
- * as nd_cmd_pdsm_pkg.  Following fields aare available in this header:
- *
- * 'cmd_status': (Out) Errors if any encountered while 
servicing PDSM.
- * 'reserved'  : Not used, reserved for future and should be set to 0.
- * 'payload': A union of all the possible payload structs
- *
- * PDSM Payload:
- *
- * The layout of the PDSM Payload is defined by various structs shared between
- * papr_scm and libndctl so that contents of payload can be interpreted. As 
such
- * its defined as a union of all possible payload structs as
- * 'union nd_pdsm_payload'. Based on the value of 'nd_cmd_pkg.nd_command'
- * appropriate member of the u

[REPOST PATCH v2] tests/nvdimm/ndtest: Simulate nvdimm health, DSC and smart-inject

2021-10-18 Thread Shivaprasad G Bhat
The 'papr_scm' module and 'papr' implementation in libndctl supports
PDSMs for reporting PAPR NVDIMM health, dirty-shutdown-count and
injecting smart-errors. This patch adds support for those PDSMs in
ndtest module so that PDSM specific paths in libndctl can be exercised.

Signed-off-by: Shivaprasad G Bhat 
Signed-off-by: Vaibhav Jain 
---
Changelog:
Since v1:
Link: https://patchwork.kernel.org/project/linux-nvdimm/list/?series=521767
* Removed the dependency on a header movement patch

 tools/testing/nvdimm/test/ndtest.c |  148 
 tools/testing/nvdimm/test/ndtest.h |   96 +++
 2 files changed, 244 insertions(+)

diff --git a/tools/testing/nvdimm/test/ndtest.c 
b/tools/testing/nvdimm/test/ndtest.c
index 6862915f1fb0c..45d42cd25e82f 100644
--- a/tools/testing/nvdimm/test/ndtest.c
+++ b/tools/testing/nvdimm/test/ndtest.c
@@ -48,6 +48,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "1e5c75d2-b618-11ea-9aa3-507b9ddc0f72",
.physical_id = 0,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -55,6 +59,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "1c4d43ac-b618-11ea-be80-507b9ddc0f72",
.physical_id = 1,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -62,6 +70,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "a9f17ffc-b618-11ea-b36d-507b9ddc0f72",
.physical_id = 2,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -69,6 +81,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "b6b83b22-b618-11ea-8aae-507b9ddc0f72",
.physical_id = 3,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -296,6 +312,103 @@ static int ndtest_get_config_size(struct ndtest_dimm 
*dimm, unsigned int buf_len
return 0;
 }
 
+static int ndtest_pdsm_health(struct ndtest_dimm *dimm,
+   union nd_pdsm_payload *payload,
+   unsigned int buf_len)
+{
+   struct nd_papr_pdsm_health *health = >health;
+
+   if (buf_len < sizeof(health))
+   return -EINVAL;
+
+   health->extension_flags = 0;
+   health->dimm_unarmed = !!(dimm->flags & PAPR_PMEM_UNARMED_MASK);
+   health->dimm_bad_shutdown = !!(dimm->flags & 
PAPR_PMEM_BAD_SHUTDOWN_MASK);
+   health->dimm_bad_restore = !!(dimm->flags & PAPR_PMEM_BAD_RESTORE_MASK);
+   health->dimm_health = PAPR_PDSM_DIMM_HEALTHY;
+
+   if (dimm->flags & PAPR_PMEM_HEALTH_FATAL)
+   health->dimm_health = PAPR_PDSM_DIMM_FATAL;
+   else if (dimm->flags & PAPR_PMEM_HEALTH_CRITICAL)
+   health->dimm_health = PAPR_PDSM_DIMM_CRITICAL;
+   else if (dimm->flags & PAPR_PMEM_HEALTH_UNHEALTHY ||
+dimm->flags & PAPR_PMEM_HEALTH_NON_CRITICAL)
+   health->dimm_health = PAPR_PDSM_DIMM_UNHEALTHY;
+
+   health->extension_flags = 0;
+   if (dimm->extension_flags & PDSM_DIMM_HEALTH_RUN_GAUGE_VALID) {
+   health->dimm_fuel_gauge = dimm->dimm_fuel_gauge;
+   health->extension_flags |= PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
+   }
+   if (dimm->extension_flags & PDSM_DIMM_DSC_VALID) {
+   health->dimm_dsc = dimm->dimm_dsc;
+   health->extension_flags |= PDSM_DIMM_DSC_VALID;
+   }
+
+   return 0;
+}
+
+static void smart_notify(struct ndtest_dimm *dimm)
+{
+   struct device *bus = dimm->dev->parent;
+
+   if (!(dimm->flags & PAPR_PMEM_HEALTH_NON_CRITICAL) ||
+   (dimm->flags & PAPR_PMEM_BAD_SHUTDOWN_MASK)) {
+   device_lock(bus);
+   /* send smart notification */
+   if (dimm->notify_handle)
+   sysfs_notify_dir

[REPORT PATCH v2] powerpc/papr_scm: Implement initial support for injecting smart errors

2021-10-18 Thread Shivaprasad G Bhat
From: Vaibhav Jain 

Presently PAPR doesn't support injecting smart errors on an
NVDIMM. This makes testing the NVDIMM health reporting functionality
difficult as simulating NVDIMM health related events need a hacked up
qemu version.

To solve this problem this patch proposes simulating certain set of
NVDIMM health related events in papr_scm. Specifically 'fatal' health
state and 'dirty' shutdown state. These error can be injected via the
user-space 'ndctl-inject-smart(1)' command. With the proposed patch and
corresponding ndctl patches following command flow is expected:

$ sudo ndctl list -DH -d nmem0
...
  "health_state":"ok",
  "shutdown_state":"clean",
...
 # inject unsafe shutdown and fatal health error
$ sudo ndctl inject-smart nmem0 -Uf
...
  "health_state":"fatal",
  "shutdown_state":"dirty",
...
 # uninject all errors
$ sudo ndctl inject-smart nmem0 -N
...
  "health_state":"ok",
  "shutdown_state":"clean",
...

The patch adds two members 'health_bitmap_mask' and
'health_bitmap_override' inside struct papr_scm_priv which are then
bit blt'ed to the health bitmaps fetched from the hypervisor. In case
we are not able to fetch health information from the hypervisor we
service the health bitmap from these two members. These members are
accessible from sysfs at nmemX/papr/health_bitmap_override

A new PDSM named 'SMART_INJECT' is proposed that accepts newly
introduced 'struct nd_papr_pdsm_smart_inject' as payload thats
exchanged between libndctl and papr_scm to indicate the requested
smart-error states.

When the processing the PDSM 'SMART_INJECT', papr_pdsm_smart_inject()
constructs a pair or 'mask' and 'override' bitmaps from the payload
and bit-blt it to the 'health_bitmap_{mask, override}' members. This
ensures the after being fetched from the hypervisor, the health_bitmap
reflects requested smart-error states.

Signed-off-by: Vaibhav Jain 
Signed-off-by: Shivaprasad G Bhat 
---
Changelog:

Since v1:
Link: https://patchwork.kernel.org/project/linux-nvdimm/list/?series=513881
* Updated the patch description.
* Removed dependency of a header movement patch.
* Removed '__packed' attribute for 'struct nd_papr_pdsm_smart_inject' [Aneesh]

 arch/powerpc/include/uapi/asm/papr_pdsm.h |   18 ++
 arch/powerpc/platforms/pseries/papr_scm.c |   94 -
 2 files changed, 109 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 82488b1e7276e..17439925045cf 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -116,6 +116,22 @@ struct nd_papr_pdsm_health {
};
 };
 
+/* Flags for injecting specific smart errors */
+#define PDSM_SMART_INJECT_HEALTH_FATAL (1 << 0)
+#define PDSM_SMART_INJECT_BAD_SHUTDOWN (1 << 1)
+
+struct nd_papr_pdsm_smart_inject {
+   union {
+   struct {
+   /* One or more of PDSM_SMART_INJECT_ */
+   __u32 flags;
+   __u8 fatal_enable;
+   __u8 unsafe_shutdown_enable;
+   };
+   __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
+   };
+};
+
 /*
  * Methods to be embedded in ND_CMD_CALL request. These are sent to the kernel
  * via 'nd_cmd_pkg.nd_command' member of the ioctl struct
@@ -123,12 +139,14 @@ struct nd_papr_pdsm_health {
 enum papr_pdsm {
PAPR_PDSM_MIN = 0x0,
PAPR_PDSM_HEALTH,
+   PAPR_PDSM_SMART_INJECT,
PAPR_PDSM_MAX,
 };
 
 /* Maximal union that can hold all possible payload types */
 union nd_pdsm_payload {
struct nd_papr_pdsm_health health;
+   struct nd_papr_pdsm_smart_inject smart_inject;
__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
 } __packed;
 
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index f48e87ac89c9b..de4cf329cfb3b 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -68,6 +68,10 @@
 #define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
 #define PAPR_SCM_PERF_STATS_VERSION 0x1
 
+/* Use bitblt method to override specific bits in the '_bitmap_' */
+#define BITBLT_BITMAP(_bitmap_, _mask_, _override_)\
+   (((_bitmap_) & ~(_mask_)) | ((_mask_) & (_override_)))
+
 /* Struct holding a single performance metric */
 struct papr_scm_perf_stat {
u8 stat_id[8];
@@ -120,6 +124,12 @@ struct papr_scm_priv {
 
/* length of the stat buffer as expected by phyp */
size_t stat_buffer_len;
+
+   /* The bits which needs to be overridden */
+   u64 health_bitmap_mask;
+
+   /* The overridden values for the bits having the masks set */
+   u64 health_bitmap_override;
 };
 
 static int papr_scm_pmem_flush(struct nd_regio

[RFC PATCH v2] powerpc/papr_scm: Move duplicate definitions to common header files

2021-09-06 Thread Shivaprasad G Bhat
papr_scm and ndtest share common PDSM payload structs like
nd_papr_pdsm_health. Presently these structs are duplicated across papr_pdsm.h
and ndtest.h header files. Since 'ndtest' is essentially arch independent and 
can
run on platforms other than PPC64, a way needs to be deviced to avoid redundancy
and duplication of PDSM structs in future.

So the patch proposes moving the PDSM header from arch/powerpc/include/uapi/ to
the generic include/uapi/linux directory. Also, there are some #defines common
between papr_scm and ndtest which are not exported to the user space. So, move
them to a header file which can be shared across ndtest and papr_scm via newly
introduced include/linux/papr_scm.h.

Signed-off-by: Shivaprasad G Bhat 
Signed-off-by: Vaibhav Jain 
Suggested-by: "Aneesh Kumar K.V" 
---
Changelog:

Since v1:
Link: 
https://patchwork.kernel.org/project/linux-nvdimm/patch/162505488483.72147.12741153746322191381.stgit@56e104a48989/
* Removed dependency on this patch for the other patches

 MAINTAINERS   |2 
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  165 -
 arch/powerpc/platforms/pseries/papr_scm.c |   43 
 include/linux/papr_scm.h  |   48 
 include/uapi/linux/papr_pdsm.h|  165 +
 tools/testing/nvdimm/test/ndtest.c|2 
 tools/testing/nvdimm/test/ndtest.h|  120 -
 7 files changed, 219 insertions(+), 326 deletions(-)
 delete mode 100644 arch/powerpc/include/uapi/asm/papr_pdsm.h
 create mode 100644 include/linux/papr_scm.h
 create mode 100644 include/uapi/linux/papr_pdsm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 6c8be735cc91..03fe0c77cefa 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10661,6 +10661,8 @@ F:  drivers/rtc/rtc-opal.c
 F: drivers/scsi/ibmvscsi/
 F: drivers/tty/hvc/hvc_opal.c
 F: drivers/watchdog/wdrtas.c
+F: include/linux/papr_scm.h
+F: include/uapi/linux/papr_pdsm.h
 F: tools/testing/selftests/powerpc
 N: /pmac
 N: powermac
diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
deleted file mode 100644
index 17439925045c..
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ /dev/null
@@ -1,165 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-/*
- * PAPR nvDimm Specific Methods (PDSM) and structs for libndctl
- *
- * (C) Copyright IBM 2020
- *
- * Author: Vaibhav Jain 
- */
-
-#ifndef _UAPI_ASM_POWERPC_PAPR_PDSM_H_
-#define _UAPI_ASM_POWERPC_PAPR_PDSM_H_
-
-#include 
-#include 
-
-/*
- * PDSM Envelope:
- *
- * The ioctl ND_CMD_CALL exchange data between user-space and kernel via
- * envelope which consists of 2 headers sections and payload sections as
- * illustrated below:
- *  +-+---+---+
- *  |   64-Bytes  |   8-Bytes |   Max 184-Bytes   |
- *  +-+---+---+
- *  | ND-HEADER   |  PDSM-HEADER  |  PDSM-PAYLOAD |
- *  +-+---+---+
- *  | nd_family   |   |   |
- *  | nd_size_out | cmd_status|   |
- *  | nd_size_in  | reserved  | nd_pdsm_payload   |
- *  | nd_command  | payload   --> |   |
- *  | nd_fw_size  |   |   |
- *  | nd_payload ---> |   |   |
- *  +---+-+---+
- *
- * ND Header:
- * This is the generic libnvdimm header described as 'struct nd_cmd_pkg'
- * which is interpreted by libnvdimm before passed on to papr_scm. Important
- * member fields used are:
- * 'nd_family' : (In) NVDIMM_FAMILY_PAPR_SCM
- * 'nd_size_in': (In) PDSM-HEADER + PDSM-IN-PAYLOAD (usually 0)
- * 'nd_size_out': (In) PDSM-HEADER + PDSM-RETURN-PAYLOAD
- * 'nd_command' : (In) One of PAPR_PDSM_XXX
- * 'nd_fw_size' : (Out) PDSM-HEADER + size of actual payload returned
- *
- * PDSM Header:
- * This is papr-scm specific header that precedes the payload. This is defined
- * as nd_cmd_pdsm_pkg.  Following fields aare available in this header:
- *
- * 'cmd_status': (Out) Errors if any encountered while 
servicing PDSM.
- * 'reserved'  : Not used, reserved for future and should be set to 0.
- * 'payload': A union of all the possible payload structs
- *
- * PDSM Payload:
- *
- * The layout of the PDSM Payload is defined by various structs shared between
- * papr_scm and libndctl so that contents of payload can be interpreted. As 
such
- * its defined as a union of all possible payload structs as
- * 'union nd_pdsm_payload'. Based on the value of 'nd_cmd_pkg.nd_command'
- * appropriate member of the union is accessed

[PATCH v2] tests/nvdimm/ndtest: Simulate nvdimm health, DSC and smart-inject

2021-09-06 Thread Shivaprasad G Bhat
The 'papr_scm' module and 'papr' implementation in libndctl supports
PDSMs for reporting PAPR NVDIMM health, dirty-shutdown-count and
injecting smart-errors. This patch adds support for those PDSMs in
ndtest module so that PDSM specific paths in libndctl can be exercised.

Signed-off-by: Shivaprasad G Bhat 
Signed-off-by: Vaibhav Jain 
---
Changelog:

Since v1:
Link: https://patchwork.kernel.org/project/linux-nvdimm/list/?series=521767
* Removed the dependency on a header movement patch

 tools/testing/nvdimm/test/ndtest.c |  148 
 tools/testing/nvdimm/test/ndtest.h |   96 +++
 2 files changed, 244 insertions(+)

diff --git a/tools/testing/nvdimm/test/ndtest.c 
b/tools/testing/nvdimm/test/ndtest.c
index 6862915f1fb0..45d42cd25e82 100644
--- a/tools/testing/nvdimm/test/ndtest.c
+++ b/tools/testing/nvdimm/test/ndtest.c
@@ -48,6 +48,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "1e5c75d2-b618-11ea-9aa3-507b9ddc0f72",
.physical_id = 0,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -55,6 +59,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "1c4d43ac-b618-11ea-be80-507b9ddc0f72",
.physical_id = 1,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -62,6 +70,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "a9f17ffc-b618-11ea-b36d-507b9ddc0f72",
.physical_id = 2,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -69,6 +81,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "b6b83b22-b618-11ea-8aae-507b9ddc0f72",
.physical_id = 3,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -296,6 +312,103 @@ static int ndtest_get_config_size(struct ndtest_dimm 
*dimm, unsigned int buf_len
return 0;
 }
 
+static int ndtest_pdsm_health(struct ndtest_dimm *dimm,
+   union nd_pdsm_payload *payload,
+   unsigned int buf_len)
+{
+   struct nd_papr_pdsm_health *health = >health;
+
+   if (buf_len < sizeof(health))
+   return -EINVAL;
+
+   health->extension_flags = 0;
+   health->dimm_unarmed = !!(dimm->flags & PAPR_PMEM_UNARMED_MASK);
+   health->dimm_bad_shutdown = !!(dimm->flags & 
PAPR_PMEM_BAD_SHUTDOWN_MASK);
+   health->dimm_bad_restore = !!(dimm->flags & PAPR_PMEM_BAD_RESTORE_MASK);
+   health->dimm_health = PAPR_PDSM_DIMM_HEALTHY;
+
+   if (dimm->flags & PAPR_PMEM_HEALTH_FATAL)
+   health->dimm_health = PAPR_PDSM_DIMM_FATAL;
+   else if (dimm->flags & PAPR_PMEM_HEALTH_CRITICAL)
+   health->dimm_health = PAPR_PDSM_DIMM_CRITICAL;
+   else if (dimm->flags & PAPR_PMEM_HEALTH_UNHEALTHY ||
+dimm->flags & PAPR_PMEM_HEALTH_NON_CRITICAL)
+   health->dimm_health = PAPR_PDSM_DIMM_UNHEALTHY;
+
+   health->extension_flags = 0;
+   if (dimm->extension_flags & PDSM_DIMM_HEALTH_RUN_GAUGE_VALID) {
+   health->dimm_fuel_gauge = dimm->dimm_fuel_gauge;
+   health->extension_flags |= PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
+   }
+   if (dimm->extension_flags & PDSM_DIMM_DSC_VALID) {
+   health->dimm_dsc = dimm->dimm_dsc;
+   health->extension_flags |= PDSM_DIMM_DSC_VALID;
+   }
+
+   return 0;
+}
+
+static void smart_notify(struct ndtest_dimm *dimm)
+{
+   struct device *bus = dimm->dev->parent;
+
+   if (!(dimm->flags & PAPR_PMEM_HEALTH_NON_CRITICAL) ||
+   (dimm->flags & PAPR_PMEM_BAD_SHUTDOWN_MASK)) {
+   device_lock(bus);
+   /* send smart notification */
+   if (dimm->notify_handle)
+   sysfs_notify_dir

[PATCH v2] powerpc/papr_scm: Implement initial support for injecting smart errors

2021-09-06 Thread Shivaprasad G Bhat
From: Vaibhav Jain 

Presently PAPR doesn't support injecting smart errors on an
NVDIMM. This makes testing the NVDIMM health reporting functionality
difficult as simulating NVDIMM health related events need a hacked up
qemu version.

To solve this problem this patch proposes simulating certain set of
NVDIMM health related events in papr_scm. Specifically 'fatal' health
state and 'dirty' shutdown state. These error can be injected via the
user-space 'ndctl-inject-smart(1)' command. With the proposed patch and
corresponding ndctl patches following command flow is expected:

$ sudo ndctl list -DH -d nmem0
...
  "health_state":"ok",
  "shutdown_state":"clean",
...
 # inject unsafe shutdown and fatal health error
$ sudo ndctl inject-smart nmem0 -Uf
...
  "health_state":"fatal",
  "shutdown_state":"dirty",
...
 # uninject all errors
$ sudo ndctl inject-smart nmem0 -N
...
  "health_state":"ok",
  "shutdown_state":"clean",
...

The patch adds two members 'health_bitmap_mask' and
'health_bitmap_override' inside struct papr_scm_priv which are then
bit blt'ed to the health bitmaps fetched from the hypervisor. In case
we are not able to fetch health information from the hypervisor we
service the health bitmap from these two members. These members are
accessible from sysfs at nmemX/papr/health_bitmap_override

A new PDSM named 'SMART_INJECT' is proposed that accepts newly
introduced 'struct nd_papr_pdsm_smart_inject' as payload thats
exchanged between libndctl and papr_scm to indicate the requested
smart-error states.

When the processing the PDSM 'SMART_INJECT', papr_pdsm_smart_inject()
constructs a pair or 'mask' and 'override' bitmaps from the payload
and bit-blt it to the 'health_bitmap_{mask, override}' members. This
ensures the after being fetched from the hypervisor, the health_bitmap
reflects requested smart-error states.

Signed-off-by: Vaibhav Jain 
Signed-off-by: Shivaprasad G Bhat 
---
Changelog:

Since v1:
Link: https://patchwork.kernel.org/project/linux-nvdimm/list/?series=513881
* Updated the patch description.
* Removed dependency of a header movement patch.
* Removed '__packed' attribute for 'struct nd_papr_pdsm_smart_inject' [Aneesh]

 arch/powerpc/include/uapi/asm/papr_pdsm.h |   18 ++
 arch/powerpc/platforms/pseries/papr_scm.c |   94 -
 2 files changed, 109 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 82488b1e7276..17439925045c 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -116,6 +116,22 @@ struct nd_papr_pdsm_health {
};
 };
 
+/* Flags for injecting specific smart errors */
+#define PDSM_SMART_INJECT_HEALTH_FATAL (1 << 0)
+#define PDSM_SMART_INJECT_BAD_SHUTDOWN (1 << 1)
+
+struct nd_papr_pdsm_smart_inject {
+   union {
+   struct {
+   /* One or more of PDSM_SMART_INJECT_ */
+   __u32 flags;
+   __u8 fatal_enable;
+   __u8 unsafe_shutdown_enable;
+   };
+   __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
+   };
+};
+
 /*
  * Methods to be embedded in ND_CMD_CALL request. These are sent to the kernel
  * via 'nd_cmd_pkg.nd_command' member of the ioctl struct
@@ -123,12 +139,14 @@ struct nd_papr_pdsm_health {
 enum papr_pdsm {
PAPR_PDSM_MIN = 0x0,
PAPR_PDSM_HEALTH,
+   PAPR_PDSM_SMART_INJECT,
PAPR_PDSM_MAX,
 };
 
 /* Maximal union that can hold all possible payload types */
 union nd_pdsm_payload {
struct nd_papr_pdsm_health health;
+   struct nd_papr_pdsm_smart_inject smart_inject;
__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
 } __packed;
 
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index f48e87ac89c9..de4cf329cfb3 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -68,6 +68,10 @@
 #define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
 #define PAPR_SCM_PERF_STATS_VERSION 0x1
 
+/* Use bitblt method to override specific bits in the '_bitmap_' */
+#define BITBLT_BITMAP(_bitmap_, _mask_, _override_)\
+   (((_bitmap_) & ~(_mask_)) | ((_mask_) & (_override_)))
+
 /* Struct holding a single performance metric */
 struct papr_scm_perf_stat {
u8 stat_id[8];
@@ -120,6 +124,12 @@ struct papr_scm_priv {
 
/* length of the stat buffer as expected by phyp */
size_t stat_buffer_len;
+
+   /* The bits which needs to be overridden */
+   u64 health_bitmap_mask;
+
+   /* The overridden values for the bits having the masks set */
+   u64 health_bitmap_override;
 };
 
 static int papr_scm_pmem_flush(struct nd_region *nd_region,

[PATCH] tests/nvdimm/ndtest: Simulate nvdimm health, DSC and smart-inject

2021-07-26 Thread Shivaprasad G Bhat
The 'papr_scm' module and 'papr' implementation in libndctl supports
PDSMs for reporting PAPR NVDIMM health, its dirty-shutdown-count and
injecting smart-error. This patch adds support for those PDSMs in
ndtest module so that PDSM specific paths in libndctl can be exercised.

Signed-off-by: Shivaprasad G Bhat 
---
The patch depends on the PAPR PDSM smart-inject payload definitions
added with the patch - 
https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg191337.html

 tools/testing/nvdimm/test/ndtest.c |  149 
 tools/testing/nvdimm/test/ndtest.h |7 ++
 2 files changed, 156 insertions(+)

diff --git a/tools/testing/nvdimm/test/ndtest.c 
b/tools/testing/nvdimm/test/ndtest.c
index 00ec2c213061..6622e8adbd11 100644
--- a/tools/testing/nvdimm/test/ndtest.c
+++ b/tools/testing/nvdimm/test/ndtest.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../watermark.h"
 #include "nfit_test.h"
@@ -49,6 +50,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "1e5c75d2-b618-11ea-9aa3-507b9ddc0f72",
.physical_id = 0,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -56,6 +61,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "1c4d43ac-b618-11ea-be80-507b9ddc0f72",
.physical_id = 1,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -63,6 +72,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "a9f17ffc-b618-11ea-b36d-507b9ddc0f72",
.physical_id = 2,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -70,6 +83,10 @@ static struct ndtest_dimm dimm_group1[] = {
.uuid_str = "b6b83b22-b618-11ea-8aae-507b9ddc0f72",
.physical_id = 3,
.num_formats = 2,
+   .flags = PAPR_PMEM_HEALTH_NON_CRITICAL,
+   .extension_flags = PDSM_DIMM_DSC_VALID | 
PDSM_DIMM_HEALTH_RUN_GAUGE_VALID,
+   .dimm_fuel_gauge = 95,
+   .dimm_dsc = 42,
},
{
.size = DIMM_SIZE,
@@ -297,6 +314,103 @@ static int ndtest_get_config_size(struct ndtest_dimm 
*dimm, unsigned int buf_len
return 0;
 }
 
+static int ndtest_pdsm_health(struct ndtest_dimm *dimm,
+   union nd_pdsm_payload *payload,
+   unsigned int buf_len)
+{
+   struct nd_papr_pdsm_health *health = >health;
+
+   if (buf_len < sizeof(health))
+   return -EINVAL;
+
+   health->extension_flags = 0;
+   health->dimm_unarmed = !!(dimm->flags & PAPR_PMEM_UNARMED_MASK);
+   health->dimm_bad_shutdown = !!(dimm->flags & 
PAPR_PMEM_BAD_SHUTDOWN_MASK);
+   health->dimm_bad_restore = !!(dimm->flags & PAPR_PMEM_BAD_RESTORE_MASK);
+   health->dimm_health = PAPR_PDSM_DIMM_HEALTHY;
+
+   if (dimm->flags & PAPR_PMEM_HEALTH_FATAL)
+   health->dimm_health = PAPR_PDSM_DIMM_FATAL;
+   else if (dimm->flags & PAPR_PMEM_HEALTH_CRITICAL)
+   health->dimm_health = PAPR_PDSM_DIMM_CRITICAL;
+   else if (dimm->flags & PAPR_PMEM_HEALTH_UNHEALTHY ||
+dimm->flags & PAPR_PMEM_HEALTH_NON_CRITICAL)
+   health->dimm_health = PAPR_PDSM_DIMM_UNHEALTHY;
+
+   health->extension_flags = 0;
+   if (dimm->extension_flags & PDSM_DIMM_HEALTH_RUN_GAUGE_VALID) {
+   health->dimm_fuel_gauge = dimm->dimm_fuel_gauge;
+   health->extension_flags |= PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
+   }
+   if (dimm->extension_flags & PDSM_DIMM_DSC_VALID) {
+   health->dimm_dsc = dimm->dimm_dsc;
+   health->extension_flags |= PDSM_DIMM_DSC_VALID;
+   }
+
+   return 0;
+}
+
+static void smart_notify(struct ndtest_dimm *dimm)
+{
+   struct device *bus = dimm->dev->parent;
+
+   if (!(dimm->flags & PAPR_PMEM_HEALTH_NON_CRITICAL) ||
+   (dimm->flags & PAPR_PMEM_BAD_SHUTDOWN_MASK)) {
+   device_lock(bus);
+   /* se

[PATCH] powerpc/papr_scm: Move duplicate definitions to common header files

2021-06-30 Thread Shivaprasad G Bhat
The papr_scm uses PDSM structures like nd_papr_pdsm_health as the
PDSM payload which would be reused by ndtest. Since ndtest is arch
independent, move the PDSM header from arch/powerpc/include/uapi/
to the generic include/uapi/linux directory.

Also, there are some #defines common between papr_scm and ndtest
which are not exported to the user space. So, move them to a header
file which can be shared across ndtest and papr_scm via newly
introduced include/linux/papr_scm.h.

Signed-off-by: Shivaprasad G Bhat 
Suggested-by: "Aneesh Kumar K.V" 
---
 MAINTAINERS   |2 
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  147 -
 arch/powerpc/platforms/pseries/papr_scm.c |   43 
 include/linux/papr_scm.h  |   48 +
 include/uapi/linux/papr_pdsm.h|  147 +
 tools/testing/nvdimm/test/ndtest.c|1 
 tools/testing/nvdimm/test/ndtest.h|   31 --
 7 files changed, 200 insertions(+), 219 deletions(-)
 delete mode 100644 arch/powerpc/include/uapi/asm/papr_pdsm.h
 create mode 100644 include/linux/papr_scm.h
 create mode 100644 include/uapi/linux/papr_pdsm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 008fcad7ac008..5d2d8a6b5eb53 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10494,6 +10494,8 @@ F:  drivers/rtc/rtc-opal.c
 F: drivers/scsi/ibmvscsi/
 F: drivers/tty/hvc/hvc_opal.c
 F: drivers/watchdog/wdrtas.c
+F: include/linux/papr_scm.h
+F: include/uapi/linux/papr_pdsm.h
 F: tools/testing/selftests/powerpc
 N: /pmac
 N: powermac
diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
deleted file mode 100644
index 82488b1e7276e..0
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ /dev/null
@@ -1,147 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-/*
- * PAPR nvDimm Specific Methods (PDSM) and structs for libndctl
- *
- * (C) Copyright IBM 2020
- *
- * Author: Vaibhav Jain 
- */
-
-#ifndef _UAPI_ASM_POWERPC_PAPR_PDSM_H_
-#define _UAPI_ASM_POWERPC_PAPR_PDSM_H_
-
-#include 
-#include 
-
-/*
- * PDSM Envelope:
- *
- * The ioctl ND_CMD_CALL exchange data between user-space and kernel via
- * envelope which consists of 2 headers sections and payload sections as
- * illustrated below:
- *  +-+---+---+
- *  |   64-Bytes  |   8-Bytes |   Max 184-Bytes   |
- *  +-+---+---+
- *  | ND-HEADER   |  PDSM-HEADER  |  PDSM-PAYLOAD |
- *  +-+---+---+
- *  | nd_family   |   |   |
- *  | nd_size_out | cmd_status|   |
- *  | nd_size_in  | reserved  | nd_pdsm_payload   |
- *  | nd_command  | payload   --> |   |
- *  | nd_fw_size  |   |   |
- *  | nd_payload ---> |   |   |
- *  +---+-+---+
- *
- * ND Header:
- * This is the generic libnvdimm header described as 'struct nd_cmd_pkg'
- * which is interpreted by libnvdimm before passed on to papr_scm. Important
- * member fields used are:
- * 'nd_family' : (In) NVDIMM_FAMILY_PAPR_SCM
- * 'nd_size_in': (In) PDSM-HEADER + PDSM-IN-PAYLOAD (usually 0)
- * 'nd_size_out': (In) PDSM-HEADER + PDSM-RETURN-PAYLOAD
- * 'nd_command' : (In) One of PAPR_PDSM_XXX
- * 'nd_fw_size' : (Out) PDSM-HEADER + size of actual payload returned
- *
- * PDSM Header:
- * This is papr-scm specific header that precedes the payload. This is defined
- * as nd_cmd_pdsm_pkg.  Following fields aare available in this header:
- *
- * 'cmd_status': (Out) Errors if any encountered while 
servicing PDSM.
- * 'reserved'  : Not used, reserved for future and should be set to 0.
- * 'payload': A union of all the possible payload structs
- *
- * PDSM Payload:
- *
- * The layout of the PDSM Payload is defined by various structs shared between
- * papr_scm and libndctl so that contents of payload can be interpreted. As 
such
- * its defined as a union of all possible payload structs as
- * 'union nd_pdsm_payload'. Based on the value of 'nd_cmd_pkg.nd_command'
- * appropriate member of the union is accessed.
- */
-
-/* Max payload size that we can handle */
-#define ND_PDSM_PAYLOAD_MAX_SIZE 184
-
-/* Max payload size that we can handle */
-#define ND_PDSM_HDR_SIZE \
-   (sizeof(struct nd_pkg_pdsm) - ND_PDSM_PAYLOAD_MAX_SIZE)
-
-/* Various nvdimm health indicators */
-#define PAPR_PDSM_DIMM_HEALTHY   0
-#define PAPR_PDSM_DIMM_UNHEALTHY 1
-#define PAPR_PDSM_DIMM_CRITICAL  2
-#define PAPR_PDSM_DIMM_FATAL 3
-
-/* struct nd_papr_pdsm_health.exten

[PATCH v3] powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall

2021-03-29 Thread Shivaprasad G Bhat
Add support for ND_REGION_ASYNC capability if the device tree
indicates 'ibm,hcall-flush-required' property in the NVDIMM node.
Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor.

If the flush request failed, the hypervisor is expected to
to reflect the problem in the subsequent nvdimm H_SCM_HEALTH call.

This patch prevents mmap of namespaces with MAP_SYNC flag if the
nvdimm requires an explicit flush[1].

References:
[1] 
https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c

Signed-off-by: Shivaprasad G Bhat 
---
v2 - https://www.spinics.net/lists/kvm-ppc/msg18799.html
Changes from v2:
   - Fixed the commit message.
   - Add dev_dbg before the H_SCM_FLUSH hcall

v1 - https://www.spinics.net/lists/kvm-ppc/msg18272.html
Changes from v1:
   - Hcall semantics finalized, all changes are to accomodate them.

 Documentation/powerpc/papr_hcalls.rst |   14 ++
 arch/powerpc/include/asm/hvcall.h |3 +-
 arch/powerpc/platforms/pseries/papr_scm.c |   40 +
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/Documentation/powerpc/papr_hcalls.rst 
b/Documentation/powerpc/papr_hcalls.rst
index 48fcf1255a33..648f278eea8f 100644
--- a/Documentation/powerpc/papr_hcalls.rst
+++ b/Documentation/powerpc/papr_hcalls.rst
@@ -275,6 +275,20 @@ Health Bitmap Flags:
 Given a DRC Index collect the performance statistics for NVDIMM and copy them
 to the resultBuffer.
 
+**H_SCM_FLUSH**
+
+| Input: *drcIndex, continue-token*
+| Out: *continue-token*
+| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
+
+Given a DRC Index Flush the data to backend NVDIMM device.
+
+The hcall returns H_BUSY when the flush takes longer time and the hcall needs
+to be issued multiple times in order to be completely serviced. The
+*continue-token* from the output to be passed in the argument list of
+subsequent hcalls to the hypervisor until the hcall is completely serviced
+at which point H_SUCCESS or other error is returned by the hypervisor.
+
 References
 ==
 .. [1] "Power Architecture Platform Reference"
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index ed6086d57b22..9f7729a97ebd 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -315,7 +315,8 @@
 #define H_SCM_HEALTH0x400
 #define H_SCM_PERFORMANCE_STATS 0x418
 #define H_RPT_INVALIDATE   0x448
-#define MAX_HCALL_OPCODE   H_RPT_INVALIDATE
+#define H_SCM_FLUSH0x44C
+#define MAX_HCALL_OPCODE   H_SCM_FLUSH
 
 /* Scope args for H_SCM_UNBIND_ALL */
 #define H_UNBIND_SCOPE_ALL (0x1)
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 835163f54244..b7a47fcc5aa5 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -93,6 +93,7 @@ struct papr_scm_priv {
uint64_t block_size;
int metadata_size;
bool is_volatile;
+   bool hcall_flush_required;
 
uint64_t bound_addr;
 
@@ -117,6 +118,39 @@ struct papr_scm_priv {
size_t stat_buffer_len;
 };
 
+static int papr_scm_pmem_flush(struct nd_region *nd_region,
+  struct bio *bio __maybe_unused)
+{
+   struct papr_scm_priv *p = nd_region_provider_data(nd_region);
+   unsigned long ret_buf[PLPAR_HCALL_BUFSIZE];
+   uint64_t token = 0;
+   int64_t rc;
+
+   dev_dbg(>pdev->dev, "flush drc 0x%x", p->drc_index);
+
+   do {
+   rc = plpar_hcall(H_SCM_FLUSH, ret_buf, p->drc_index, token);
+   token = ret_buf[0];
+
+   /* Check if we are stalled for some time */
+   if (H_IS_LONG_BUSY(rc)) {
+   msleep(get_longbusy_msecs(rc));
+   rc = H_BUSY;
+   } else if (rc == H_BUSY) {
+   cond_resched();
+   }
+   } while (rc == H_BUSY);
+
+   if (rc) {
+   dev_err(>pdev->dev, "flush error: %lld", rc);
+   rc = -EIO;
+   } else {
+   dev_dbg(>pdev->dev, "flush drc 0x%x complete", p->drc_index);
+   }
+
+   return rc;
+}
+
 static LIST_HEAD(papr_nd_regions);
 static DEFINE_MUTEX(papr_ndr_lock);
 
@@ -943,6 +977,11 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
ndr_desc.num_mappings = 1;
ndr_desc.nd_set = >nd_set;
 
+   if (p->hcall_flush_required) {
+   set_bit(ND_REGION_ASYNC, _desc.flags);
+   ndr_desc.flush = papr_scm_pmem_flush;
+   }
+
if (p->is_volatile)
p->region = nvdimm_volatile_region_create(p->bus, _desc);
else {
@@ -1088,6 +1127,7 @@ static int papr_scm_probe(struct platform_device *pdev)
p->block_size = block_size;
p->blocks = blocks;
  

[PATCH v2] powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall

2021-03-23 Thread Shivaprasad G Bhat
Add support for ND_REGION_ASYNC capability if the device tree
indicates 'ibm,hcall-flush-required' property in the NVDIMM node.
Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor.

If the flush request failed, the hypervisor is expected to
to reflect the problem in the subsequent dimm health request call.

This patch prevents mmap of namespaces with MAP_SYNC flag if the
nvdimm requires explicit flush[1].

References:
[1] 
https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c

Signed-off-by: Shivaprasad G Bhat 
---
v1 - https://www.spinics.net/lists/kvm-ppc/msg18272.html
Changes from v1:
   - Hcall semantics finalized, all changes are to accomodate them.

 Documentation/powerpc/papr_hcalls.rst |   14 ++
 arch/powerpc/include/asm/hvcall.h |3 +-
 arch/powerpc/platforms/pseries/papr_scm.c |   39 +
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/Documentation/powerpc/papr_hcalls.rst 
b/Documentation/powerpc/papr_hcalls.rst
index 48fcf1255a33..648f278eea8f 100644
--- a/Documentation/powerpc/papr_hcalls.rst
+++ b/Documentation/powerpc/papr_hcalls.rst
@@ -275,6 +275,20 @@ Health Bitmap Flags:
 Given a DRC Index collect the performance statistics for NVDIMM and copy them
 to the resultBuffer.
 
+**H_SCM_FLUSH**
+
+| Input: *drcIndex, continue-token*
+| Out: *continue-token*
+| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
+
+Given a DRC Index Flush the data to backend NVDIMM device.
+
+The hcall returns H_BUSY when the flush takes longer time and the hcall needs
+to be issued multiple times in order to be completely serviced. The
+*continue-token* from the output to be passed in the argument list of
+subsequent hcalls to the hypervisor until the hcall is completely serviced
+at which point H_SUCCESS or other error is returned by the hypervisor.
+
 References
 ==
 .. [1] "Power Architecture Platform Reference"
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index ed6086d57b22..9f7729a97ebd 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -315,7 +315,8 @@
 #define H_SCM_HEALTH0x400
 #define H_SCM_PERFORMANCE_STATS 0x418
 #define H_RPT_INVALIDATE   0x448
-#define MAX_HCALL_OPCODE   H_RPT_INVALIDATE
+#define H_SCM_FLUSH0x44C
+#define MAX_HCALL_OPCODE   H_SCM_FLUSH
 
 /* Scope args for H_SCM_UNBIND_ALL */
 #define H_UNBIND_SCOPE_ALL (0x1)
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 835163f54244..f0407e135410 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -93,6 +93,7 @@ struct papr_scm_priv {
uint64_t block_size;
int metadata_size;
bool is_volatile;
+   bool hcall_flush_required;
 
uint64_t bound_addr;
 
@@ -117,6 +118,38 @@ struct papr_scm_priv {
size_t stat_buffer_len;
 };
 
+static int papr_scm_pmem_flush(struct nd_region *nd_region,
+  struct bio *bio __maybe_unused)
+{
+   struct papr_scm_priv *p = nd_region_provider_data(nd_region);
+   unsigned long ret_buf[PLPAR_HCALL_BUFSIZE];
+   uint64_t token = 0;
+   int64_t rc;
+
+   do {
+   rc = plpar_hcall(H_SCM_FLUSH, ret_buf, p->drc_index, token);
+   token = ret_buf[0];
+
+   /* Check if we are stalled for some time */
+   if (H_IS_LONG_BUSY(rc)) {
+   msleep(get_longbusy_msecs(rc));
+   rc = H_BUSY;
+   } else if (rc == H_BUSY) {
+   cond_resched();
+   }
+
+   } while (rc == H_BUSY);
+
+   if (rc) {
+   dev_err(>pdev->dev, "flush error: %lld", rc);
+   rc = -EIO;
+   } else {
+   dev_dbg(>pdev->dev, "flush drc 0x%x complete", p->drc_index);
+   }
+
+   return rc;
+}
+
 static LIST_HEAD(papr_nd_regions);
 static DEFINE_MUTEX(papr_ndr_lock);
 
@@ -943,6 +976,11 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
ndr_desc.num_mappings = 1;
ndr_desc.nd_set = >nd_set;
 
+   if (p->hcall_flush_required) {
+   set_bit(ND_REGION_ASYNC, _desc.flags);
+   ndr_desc.flush = papr_scm_pmem_flush;
+   }
+
if (p->is_volatile)
p->region = nvdimm_volatile_region_create(p->bus, _desc);
else {
@@ -1088,6 +1126,7 @@ static int papr_scm_probe(struct platform_device *pdev)
p->block_size = block_size;
p->blocks = blocks;
p->is_volatile = !of_property_read_bool(dn, "ibm,cache-flush-required");
+   p->hcall_flush_required = of_property_read_bool(dn, 
"ibm,hcall-flush-required");
 
/* We just need to ensure that set cookies are unique across */
uuid_parse(uuid_str, (uuid_t *) uuid);




Re: [RFC Qemu PATCH v2 1/2] spapr: drc: Add support for async hcalls at the drc level

2021-03-23 Thread Shivaprasad G Bhat

Hi David,

Sorry about the delay.

On 2/8/21 11:51 AM, David Gibson wrote:

On Tue, Jan 19, 2021 at 12:40:31PM +0530, Shivaprasad G Bhat wrote:

Thanks for the comments!


On 12/28/20 2:08 PM, David Gibson wrote:


On Mon, Dec 21, 2020 at 01:08:53PM +0100, Greg Kurz wrote:

...

The overall idea looks good but I think you should consider using
a thread pool to implement it. See below.

I am not convinced, however.  Specifically, attaching this to the DRC
doesn't make sense to me.  We're adding exactly one DRC related async
hcall, and I can't really see much call for another one.  We could
have other async hcalls - indeed we already have one for HPT resizing
- but attaching this to DRCs doesn't help for those.

The semantics of the hcall made me think, if this is going to be
re-usable for future if implemented at DRC level.

It would only be re-usable for operations that are actually connected
to DRCs.  It doesn't seem to me particularly likely that we'll ever
have more asynchronous hcalls that are also associated with DRCs.

Okay


Other option
is to move the async-hcall-state/list into the NVDIMMState structure
in include/hw/mem/nvdimm.h and handle it with machine->nvdimms_state
at a global level.

I'm ok with either of two options:

A) Implement this ad-hoc for this specific case, making whatever
simplifications you can based on this specific case.


I am simplifying it to nvdimm use-case alone and limiting the scope.



B) Implement a general mechanism for async hcalls that is *not* tied
to DRCs.  Then use that for the existing H_RESIZE_HPT_PREPARE call as
well as this new one.


Hope you are okay with using the pool based approach that Greg

Honestly a thread pool seems like it might be overkill for this
application.


I think its appropriate here as that is what is being done by virtio-pmem

too for flush requests. The aio infrastructure simplifies lot of the

thread handling usage. Please suggest if you think there are better ways.


I am sending the next version addressing all the comments from you and Greg.


Thanks,

Shivaprasad



Re: [RFC Qemu PATCH v2 1/2] spapr: drc: Add support for async hcalls at the drc level

2021-01-18 Thread Shivaprasad G Bhat

Thanks for the comments!


On 12/28/20 2:08 PM, David Gibson wrote:


On Mon, Dec 21, 2020 at 01:08:53PM +0100, Greg Kurz wrote:

...

The overall idea looks good but I think you should consider using
a thread pool to implement it. See below.

I am not convinced, however.  Specifically, attaching this to the DRC
doesn't make sense to me.  We're adding exactly one DRC related async
hcall, and I can't really see much call for another one.  We could
have other async hcalls - indeed we already have one for HPT resizing
- but attaching this to DRCs doesn't help for those.


The semantics of the hcall made me think, if this is going to be

re-usable for future if implemented at DRC level. Other option

is to move the async-hcall-state/list into the NVDIMMState structure

in include/hw/mem/nvdimm.h and handle it with machine->nvdimms_state

at a global level.


Hope you are okay with using the pool based approach that Greg

suggested.


Please let me know.


Thanks,

Shivaprasad




[RFC PATCH] powerpc/papr_scm: Implement scm async flush

2020-12-01 Thread Shivaprasad G Bhat
Tha patch implements SCM async-flush hcall and sets the
ND_REGION_ASYNC capability when the platform device tree
has "ibm,async-flush-required" set.

The below demonstration shows the map_sync behavior when
ibm,async-flush-required is present in device tree.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)

The pmem0 is from nvdimm without async-flush-required,
and pmem1 is from nvdimm with async-flush-required, mounted as
/dev/pmem0 on /mnt1 type xfs 
(rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
/dev/pmem1 on /mnt2 type xfs 
(rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)

#./mapsync /mnt1/newfile> Without async-flush-required
#./mapsync /mnt2/newfile> With async-flush-required
Failed to mmap  with Operation not supported

Signed-off-by: Shivaprasad G Bhat 
---
The HCALL semantics are in review, not final.

 Documentation/powerpc/papr_hcalls.rst |   14 ++
 arch/powerpc/include/asm/hvcall.h |3 +-
 arch/powerpc/platforms/pseries/papr_scm.c |   39 +
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/Documentation/powerpc/papr_hcalls.rst 
b/Documentation/powerpc/papr_hcalls.rst
index 48fcf1255a33..cc310814f24c 100644
--- a/Documentation/powerpc/papr_hcalls.rst
+++ b/Documentation/powerpc/papr_hcalls.rst
@@ -275,6 +275,20 @@ Health Bitmap Flags:
 Given a DRC Index collect the performance statistics for NVDIMM and copy them
 to the resultBuffer.
 
+**H_SCM_ASYNC_FLUSH**
+
+| Input: *drcIndex*
+| Out: *continue-token*
+| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
+
+Given a DRC Index Flush the data to backend NVDIMM device.
+
+The hcall returns H_BUSY when the flush takes longer time and the hcall needs
+to be issued multiple times in order to be completely serviced. The
+*continue-token* from the output to be passed in the argument list in
+subsequent hcalls to the hypervisor until the hcall is completely serviced
+at which point H_SUCCESS is returned by the hypervisor.
+
 References
 ==
 .. [1] "Power Architecture Platform Reference"
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index c1fbccb04390..4a13074bc782 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -306,7 +306,8 @@
 #define H_SCM_HEALTH0x400
 #define H_SCM_PERFORMANCE_STATS 0x418
 #define H_RPT_INVALIDATE   0x448
-#define MAX_HCALL_OPCODE   H_RPT_INVALIDATE
+#define H_SCM_ASYNC_FLUSH  0x4A0
+#define MAX_HCALL_OPCODE   H_SCM_ASYNC_FLUSH
 
 /* Scope args for H_SCM_UNBIND_ALL */
 #define H_UNBIND_SCOPE_ALL (0x1)
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 835163f54244..1f8c5153cb3d 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -93,6 +93,7 @@ struct papr_scm_priv {
uint64_t block_size;
int metadata_size;
bool is_volatile;
+   bool async_flush_required;
 
uint64_t bound_addr;
 
@@ -117,6 +118,38 @@ struct papr_scm_priv {
size_t stat_buffer_len;
 };
 
+static int papr_scm_pmem_flush(struct nd_region *nd_region, struct bio *bio)
+{
+   unsigned long ret[PLPAR_HCALL_BUFSIZE];
+   struct papr_scm_priv *p = nd_region_provider_data(nd_region);
+   int64_t rc;
+   uint64_t token = 0;
+
+   do {
+   rc = plpar_hcall(H_SCM_ASYNC_FLUSH, ret, p->drc_index, token);
+
+   /* Check if we are stalled for some time */
+   token = ret[0];
+   if (H_IS_LONG_BUSY(rc)) {
+   msleep(get_longbusy_msecs(rc));
+   rc = H_BUSY;
+   } else if (rc == H_BUSY) {
+   cond_resched();
+   }
+
+   } while (rc == H_BUSY);
+
+   if (rc)
+   dev_err(>pdev->dev, "flush error: %lld\n", rc);
+   else
+   dev_dbg(>pdev->dev, "flush drc 0x%x complete\n",
+   p->drc_index);
+
+   dev_dbg(>pdev->dev, "Flush call complete\n");
+
+   return rc;
+}
+
 static LIST_HEAD(papr_nd_regions);
 static DEFINE_MUTEX(papr_ndr_lock);
 
@@ -943,6 +976,11 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
ndr_desc.num_mappings = 1;
ndr_desc.nd_set = >nd_set;
 
+   if (p->async_flush_required) {
+   set_bit(ND_REGION_ASYNC, _desc.flags);
+   ndr_desc.flush = papr_scm_pmem_flush;
+   }
+
if (p->is_volatile)
p->region = nvdimm_volatile_region_create(p->bus, _desc);
else {
@@ -1088,6 +1126,7 @@ static int papr_scm_probe(struct platform_device *pdev)
p->block_size = block_size;
p->blocks = blocks;
p->is_volatile = !of_pro

[RFC Qemu PATCH v2 2/2] spapr: nvdimm: Implement async flush hcalls

2020-11-30 Thread Shivaprasad G Bhat
When the persistent memory beacked by a file, a cpu cache flush instruction
is not sufficient to ensure the stores are correctly flushed to the media.

The patch implements the async hcalls for flush operation on demand from the
guest kernel.

The device option sync-dax is by default off and enables explicit asynchronous
flush requests from guest. It can be disabled by setting syn-dax=on.

Signed-off-by: Shivaprasad G Bhat 
---
 hw/mem/nvdimm.c |1 +
 hw/ppc/spapr_nvdimm.c   |   79 +++
 include/hw/mem/nvdimm.h |   10 ++
 include/hw/ppc/spapr.h  |3 +-
 4 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 03c2201b56..37a4db0135 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -220,6 +220,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, 
const void *buf,
 
 static Property nvdimm_properties[] = {
 DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false),
+DEFINE_PROP_BOOL(NVDIMM_SYNC_DAX_PROP, NVDIMMDevice, sync_dax, false),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/ppc/spapr_nvdimm.c b/hw/ppc/spapr_nvdimm.c
index a833a63b5e..557e36aa98 100644
--- a/hw/ppc/spapr_nvdimm.c
+++ b/hw/ppc/spapr_nvdimm.c
@@ -22,6 +22,7 @@
  * THE SOFTWARE.
  */
 #include "qemu/osdep.h"
+#include "qemu/cutils.h"
 #include "qapi/error.h"
 #include "hw/ppc/spapr_drc.h"
 #include "hw/ppc/spapr_nvdimm.h"
@@ -155,6 +156,11 @@ static int spapr_dt_nvdimm(SpaprMachineState *spapr, void 
*fdt,
  "operating-system")));
 _FDT(fdt_setprop(fdt, child_offset, "ibm,cache-flush-required", NULL, 0));
 
+if (!nvdimm->sync_dax) {
+_FDT(fdt_setprop(fdt, child_offset, "ibm,async-flush-required",
+ NULL, 0));
+}
+
 return child_offset;
 }
 
@@ -370,6 +376,78 @@ static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, 
SpaprMachineState *spapr,
 return H_SUCCESS;
 }
 
+typedef struct SCMAsyncFlushData {
+int fd;
+uint64_t token;
+} SCMAsyncFlushData;
+
+static int flush_worker_cb(void *opaque)
+{
+int ret = H_SUCCESS;
+SCMAsyncFlushData *req_data = opaque;
+
+/* flush raw backing image */
+if (qemu_fdatasync(req_data->fd) < 0) {
+error_report("papr_scm: Could not sync nvdimm to backend file: %s",
+ strerror(errno));
+ret = H_HARDWARE;
+}
+
+g_free(req_data);
+
+return ret;
+}
+
+static target_ulong h_scm_async_flush(PowerPCCPU *cpu, SpaprMachineState 
*spapr,
+  target_ulong opcode, target_ulong *args)
+{
+int ret;
+uint32_t drc_index = args[0];
+uint64_t continue_token = args[1];
+SpaprDrc *drc = spapr_drc_by_index(drc_index);
+PCDIMMDevice *dimm;
+HostMemoryBackend *backend = NULL;
+SCMAsyncFlushData *req_data = NULL;
+
+if (!drc || !drc->dev ||
+spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
+return H_PARAMETER;
+}
+
+if (continue_token != 0) {
+ret = spapr_drc_get_async_hcall_status(drc, continue_token);
+if (ret == H_BUSY) {
+args[0] = continue_token;
+return H_LONG_BUSY_ORDER_1_SEC;
+}
+
+return ret;
+}
+
+dimm = PC_DIMM(drc->dev);
+backend = MEMORY_BACKEND(dimm->hostmem);
+
+req_data = g_malloc0(sizeof(SCMAsyncFlushData));
+req_data->fd = memory_region_get_fd(>mr);
+
+continue_token = spapr_drc_get_new_async_hcall_token(drc);
+if (!continue_token) {
+g_free(req_data);
+return H_P2;
+}
+req_data->token = continue_token;
+
+spapr_drc_run_async_hcall(drc, continue_token, _worker_cb, req_data);
+
+ret = spapr_drc_get_async_hcall_status(drc, continue_token);
+if (ret == H_BUSY) {
+args[0] = req_data->token;
+return ret;
+}
+
+return ret;
+}
+
 static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
  target_ulong opcode, target_ulong *args)
 {
@@ -486,6 +564,7 @@ static void spapr_scm_register_types(void)
 spapr_register_hypercall(H_SCM_BIND_MEM, h_scm_bind_mem);
 spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
 spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
+spapr_register_hypercall(H_SCM_ASYNC_FLUSH, h_scm_async_flush);
 }
 
 type_init(spapr_scm_register_types)
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index c699842dd0..9e8795766e 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM)
 #define NVDIMM_LABEL_SIZE_PROP "label-size"
 #define NVDIMM_UUID_PROP   "uuid"
 #define NVDIMM_UNARMED_PROP"unarmed"
+#define NVDIMM_SYNC_DAX_PROP

[RFC Qemu PATCH v2 1/2] spapr: drc: Add support for async hcalls at the drc level

2020-11-30 Thread Shivaprasad G Bhat
The patch adds support for async hcalls at the DRC level for the
spapr devices. To be used by spapr-scm devices in the patch/es to follow.

Signed-off-by: Shivaprasad G Bhat 
---
 hw/ppc/spapr_drc.c |  149 
 include/hw/ppc/spapr_drc.h |   25 +++
 2 files changed, 174 insertions(+)

diff --git a/hw/ppc/spapr_drc.c b/hw/ppc/spapr_drc.c
index 77718cde1f..4ecd04f686 100644
--- a/hw/ppc/spapr_drc.c
+++ b/hw/ppc/spapr_drc.c
@@ -15,6 +15,7 @@
 #include "qapi/qmp/qnull.h"
 #include "cpu.h"
 #include "qemu/cutils.h"
+#include "qemu/guest-random.h"
 #include "hw/ppc/spapr_drc.h"
 #include "qom/object.h"
 #include "migration/vmstate.h"
@@ -421,6 +422,148 @@ void spapr_drc_detach(SpaprDrc *drc)
 spapr_drc_release(drc);
 }
 
+
+/*
+ * @drc : device DRC targetting which the async hcalls to be made.
+ *
+ * All subsequent requests to run/query the status should use the
+ * unique token returned here.
+ */
+uint64_t spapr_drc_get_new_async_hcall_token(SpaprDrc *drc)
+{
+Error *err = NULL;
+uint64_t token;
+SpaprDrcDeviceAsyncHCallState *tmp, *next, *state;
+
+state = g_malloc0(sizeof(*state));
+state->pending = true;
+
+qemu_mutex_lock(>async_hcall_states_lock);
+retry:
+if (qemu_guest_getrandom(, sizeof(token), ) < 0) {
+error_report_err(err);
+g_free(state);
+qemu_mutex_unlock(>async_hcall_states_lock);
+return 0;
+}
+
+if (!token) /* Token should be non-zero */
+goto retry;
+
+if (!QLIST_EMPTY(>async_hcall_states)) {
+QLIST_FOREACH_SAFE(tmp, >async_hcall_states, node, next) {
+if (tmp->continue_token == token) {
+/* If the token already in use, get a new one */
+goto retry;
+}
+}
+}
+
+state->continue_token = token;
+QLIST_INSERT_HEAD(>async_hcall_states, state, node);
+
+qemu_mutex_unlock(>async_hcall_states_lock);
+
+return state->continue_token;
+}
+
+static void *spapr_drc_async_hcall_runner(void *opaque)
+{
+int response = -1;
+SpaprDrcDeviceAsyncHCallState *state = opaque;
+
+/*
+ * state is freed only after this thread finishes(after pthread_join()),
+ * don't worry about it becoming NULL.
+ */
+
+response = state->func(state->data);
+
+state->hcall_ret = response;
+state->pending = 0;
+
+return NULL;
+}
+
+/*
+ * @drc  : device DRC targetting which the async hcalls to be made.
+ * token : The continue token to be used for tracking as recived from
+ * spapr_drc_get_new_async_hcall_token
+ * @func() : the worker function which needs to be executed asynchronously
+ * @data : data to be passed to the asynchronous function. Worker is supposed
+ * to free/cleanup the data that is passed here
+ */
+void spapr_drc_run_async_hcall(SpaprDrc *drc, uint64_t token,
+   SpaprDrcAsyncHcallWorkerFunc *func, void *data)
+{
+SpaprDrcDeviceAsyncHCallState *state;
+
+qemu_mutex_lock(>async_hcall_states_lock);
+QLIST_FOREACH(state, >async_hcall_states, node) {
+if (state->continue_token == token) {
+state->func = func;
+state->data = data;
+qemu_thread_create(>thread, "sPAPR Async HCALL",
+   spapr_drc_async_hcall_runner, state,
+   QEMU_THREAD_JOINABLE);
+break;
+}
+}
+qemu_mutex_unlock(>async_hcall_states_lock);
+}
+
+/*
+ * spapr_drc_finish_async_hcalls
+ *  Waits for all pending async requests to complete
+ *  thier execution and free the states
+ */
+static void spapr_drc_finish_async_hcalls(SpaprDrc *drc)
+{
+SpaprDrcDeviceAsyncHCallState *state, *next;
+
+if (QLIST_EMPTY(>async_hcall_states)) {
+return;
+}
+
+qemu_mutex_lock(>async_hcall_states_lock);
+QLIST_FOREACH_SAFE(state, >async_hcall_states, node, next) {
+qemu_thread_join(>thread);
+QLIST_REMOVE(state, node);
+g_free(state);
+}
+qemu_mutex_unlock(>async_hcall_states_lock);
+}
+
+/*
+ * spapr_drc_get_async_hcall_status
+ *  Fetches the status of the hcall worker and returns H_BUSY
+ *  if the worker is still running.
+ */
+int spapr_drc_get_async_hcall_status(SpaprDrc *drc, uint64_t token)
+{
+int ret = H_PARAMETER;
+SpaprDrcDeviceAsyncHCallState *state, *node;
+
+qemu_mutex_lock(>async_hcall_states_lock);
+QLIST_FOREACH_SAFE(state, >async_hcall_states, node, node) {
+if (state->continue_token == token) {
+if (state->pending) {
+ret = H_BUSY;
+break;
+} else {
+ret = state->hcall_ret;
+qemu_thread_join(>thread);
+QLIST_REMOVE(sta

[RFC Qemu PATCH v2 0/2] spapr: nvdimm: Asynchronus flush hcall support

2020-11-30 Thread Shivaprasad G Bhat
The nvdimm devices are expected to ensure write persistent during power
failure kind of scenarios.

The libpmem has architecture specific instructions like dcbf on power
to flush the cache data to backend nvdimm device during normal writes.

Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest
doesn't traslate to actual flush to the backend file on the host in case
of file backed vnvdimms. This is addressed by virtio-pmem in case of x86_64
by making asynchronous flushes.

On PAPR, issue is addressed by adding a new hcall to
request for an explicit asynchronous flush requests from the guest ndctl
driver when the backend nvdimm cannot ensure write persistence with dcbf
alone. So, the approach here is to convey when the asynchronous flush is
required in a device tree property. The guest makes the hcall when the
property is found, instead of relying on dcbf.

The first patch adds the necessary asynchronous hcall support infrastructure
code at the DRC level. Second patch implements the hcall using the
infrastructure.

Hcall semantics are in review and not final.

A new device property sync-dax is added to the nvdimm device. When the 
sync-dax is off(default), the asynchronous hcalls will be called.

With respect to save from new qemu to restore on old qemu, having the
sync-dax by default off(when not specified) causes IO errors in guests as
the async-hcall would not be supported on old qemu. The new hcall
implementation being supported only on the new  pseries machine version,
the current machine version checks may be sufficient to prevent
such migration. Please suggest what should be done.

The below demonstration shows the map_sync behavior with sync-dax on & off.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)

The pmem0 is from nvdimm with With sync-dax=on, and pmem1 is from nvdimm with 
syn-dax=off, mounted as
/dev/pmem0 on /mnt1 type xfs 
(rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
/dev/pmem1 on /mnt2 type xfs 
(rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)

[root@atest-guest ~]# ./mapsync /mnt1/newfile> When sync-dax=off
[root@atest-guest ~]# ./mapsync /mnt2/newfile> when sync-dax=on
Failed to mmap  with Operation not supported

---
v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html
Changes from v1
  - Fixed a missed-out unlock
  - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token

Shivaprasad G Bhat (2):
  spapr: drc: Add support for async hcalls at the drc level
  spapr: nvdimm: Implement async flush hcalls


 hw/mem/nvdimm.c|1
 hw/ppc/spapr_drc.c |  146 
 hw/ppc/spapr_nvdimm.c  |   79 
 include/hw/mem/nvdimm.h|   10 +++
 include/hw/ppc/spapr.h |3 +
 include/hw/ppc/spapr_drc.h |   25 
 6 files changed, 263 insertions(+), 1 deletion(-)

--
Signature