from:"Alexey Kardashevskiy"

Re: [RFC PATCH v2 0/6] powerpc: pSeries: vfio: iommu: Re-enable support for SPAPR TCE VFIO

2024-05-01 Thread Alexey Kardashevskiy





On 2/5/24 00:09, Jason Gunthorpe wrote:

On Tue, Apr 30, 2024 at 03:05:34PM -0500, Shivaprasad G Bhat wrote:

RFC v1 was posted here [1]. As I was testing more and fixing the
issues, I realized its clean to have the table_group_ops implemented
the way it is done on PowerNV and stop 'borrowing' the DMA windows
for pSeries.

This patch-set implements the iommu table_group_ops for pSeries for
VFIO SPAPR TCE sub-driver thereby enabling the VFIO support on POWER
pSeries machines.


Wait, did they previously not have any support?

>

Again, this TCE stuff needs to go away, not grow. I can grudgingly
accept fixing it where it used to work, but not enabling more HW that
never worked before! :(



This used to work when I tried last time 2+ years ago, not a new stuff. 
Thanks,



--
Alexey

Re: [PATCH v2] powerpc/iommu: DMA address offset is incorrectly calculated with 2MB TCEs

2023-05-21 Thread Alexey Kardashevskiy


Hi Gaurav,

Sorry I missed this. Please share the link to the your fix, I do not see 
it in my mail. In general, the problem can probably be solved by using 
huge pages (anything more than 64K) only for 1:1 mapping.



On 03/05/2023 13:25, Gaurav Batra wrote:

Hello Alexey,

I recently joined IOMMU team. There was a bug reported by test team 
where Mellanox driver was timing out during configuration. I proposed a 
fix for the same, which is below in the email.


You suggested a fix for Srikar's reported problem. Basically, both these 
fixes will resolve Srikar and Mellanox driver issues. The problem is 
with 2MB DDW.


Since you have extensive knowledge of IOMMU design and code, in your 
opinion, which patch should we adopt?


Thanks a lot

Gaurav

On 4/20/23 2:45 PM, Gaurav Batra wrote:

Hello Michael,

I was looking into the Bug: 199106 
(https://bugzilla.linux.ibm.com/show_bug.cgi?id=199106).


In the Bug, Mellanox driver was timing out when enabling SRIOV device.

I tested, Alexey's patch and it fixes the issue with Mellanox driver. 
The down side


to Alexey's fix is that even a small memory request by the driver will 
be aligned up


to 2MB. In my test, the Mellanox driver is issuing multiple requests 
of 64K size.


All these will get aligned up to 2MB, which is quite a waste of 
resources.



In any case, both the patches work. Let me know which approach you 
prefer. In case


we decide to go with my patch, I just realized that I need to fix 
nio_pages in


iommu_free_coherent() as well.


Thanks,

Gaurav

On 4/20/23 10:21 AM, Michael Ellerman wrote:

Gaurav Batra  writes:

When DMA window is backed by 2MB TCEs, the DMA address for the mapped
page should be the offset of the page relative to the 2MB TCE. The code
was incorrectly setting the DMA address to the beginning of the TCE
range.

Mellanox driver is reporting timeout trying to ENABLE_HCA for an SR-IOV
ethernet port, when DMA window is backed by 2MB TCEs.

I assume this is similar or related to the bug Srikar reported?

https://lore.kernel.org/linuxppc-dev/20230323095333.gi1005...@linux.vnet.ibm.com/

In that thread Alexey suggested a patch, have you tried his patch? He
suggested rounding up the allocation size, rather than adjusting the
dma_handle.


Fixes: 3872731187141d5d0a5c4fb30007b8b9ec36a44d

That's not the right syntax, it's described in the documentation how to
generate it.

It should be:

   Fixes: 387273118714 ("powerps/pseries/dma: Add support for 2M 
IOMMU page size")


cheers


diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ee95937bdaf1..ca57526ce47a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -517,7 +517,7 @@ int ppc_iommu_map_sg(struct device *dev, struct 
iommu_table *tbl,

  /* Convert entry to a dma_addr_t */
  entry += tbl->it_offset;
  dma_addr = entry << tbl->it_page_shift;
-    dma_addr |= (s->offset & ~IOMMU_PAGE_MASK(tbl));
+    dma_addr |= (vaddr & ~IOMMU_PAGE_MASK(tbl));
    DBG("  - %lu pages, entry: %lx, dma_addr: %lx\n",
  npages, entry, dma_addr);
@@ -904,6 +904,7 @@ void *iommu_alloc_coherent(struct device *dev, 
struct iommu_table *tbl,

  unsigned int order;
  unsigned int nio_pages, io_order;
  struct page *page;
+    int tcesize = (1 << tbl->it_page_shift);
    size = PAGE_ALIGN(size);
  order = get_order(size);
@@ -930,7 +931,8 @@ void *iommu_alloc_coherent(struct device *dev, 
struct iommu_table *tbl,

  memset(ret, 0, size);
    /* Set up tces to cover the allocated range */
-    nio_pages = size >> tbl->it_page_shift;
+    nio_pages = IOMMU_PAGE_ALIGN(size, tbl) >> tbl->it_page_shift;
+
  io_order = get_iommu_order(size, tbl);
  mapping = iommu_alloc(dev, tbl, ret, nio_pages, 
DMA_BIDIRECTIONAL,

    mask >> tbl->it_page_shift, io_order, 0);
@@ -938,7 +940,8 @@ void *iommu_alloc_coherent(struct device *dev, 
struct iommu_table *tbl,

  free_pages((unsigned long)ret, order);
  return NULL;
  }
-    *dma_handle = mapping;
+
+    *dma_handle = mapping | ((u64)ret & (tcesize - 1));
  return ret;
  }
  --


--
Alexey

Re: Probing nvme disks fails on Upstream kernels on powerpc Maxconfig

2023-04-13 Thread Alexey Kardashevskiy





On 05/04/2023 15:45, Michael Ellerman wrote:

"Linux regression tracking (Thorsten Leemhuis)"  
writes:

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

On 23.03.23 10:53, Srikar Dronamraju wrote:


I am unable to boot upstream kernels from v5.16 to the latest upstream
kernel on a maxconfig system. (Machine config details given below)

At boot, we see a series of messages like the below.

dracut-initqueue[13917]: Warning: dracut-initqueue: timeout, still waiting for 
following initqueue hooks:
dracut-initqueue[13917]: Warning: 
/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2f93dc0767-18aa-467f-afa7-5b4e9c13108a.sh:
 "if ! grep -q After=remote-fs-pre.target 
/run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
dracut-initqueue[13917]: [ -e 
"/dev/disk/by-uuid/93dc0767-18aa-467f-afa7-5b4e9c13108a" ]
dracut-initqueue[13917]: fi"


Alexey, did you look into this? This is apparently caused by a commit of
yours (see quoted part below) that Michael applied. Looks like it fell
through the cracks from here, but maybe I'm missing something.


Unfortunately Alexey is not working at IBM any more, so he won't have
access to any hardware to debug/test this.

Srikar are you debugging this? If not we'll have to find someone else to
look at it.


Has this been fixed and I missed cc:? Anyway, without the full log, I 
still see it is a huge guest so chances are the guest could not map all 
RAM so instead it uses the biggest possible DDW with 2M pages. If that's 
the case, this might help it:


diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 614af78b3695..996acf245ae5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -906,7 +906,7 @@ void *iommu_alloc_coherent(struct device *dev, 
struct iommu_table *tbl,

unsigned int nio_pages, io_order;
struct page *page;

-   size = PAGE_ALIGN(size);
+   size = _ALIGN(size, IOMMU_PAGE_SIZE(tbl));
order = get_order(size);

/*
@@ -949,10 +949,9 @@ void iommu_free_coherent(struct iommu_table *tbl, 
size_t size,

if (tbl) {
unsigned int nio_pages;

-   size = PAGE_ALIGN(size);
+   size = _ALIGN(size, IOMMU_PAGE_SIZE(tbl));
nio_pages = size >> tbl->it_page_shift;
iommu_free(tbl, dma_handle, nio_pages);
-   size = PAGE_ALIGN(size);
free_pages((unsigned long)vaddr, get_order(size));
}


And there may be other places where PAGE_SIZE is used instead of 
IOMMU_PAGE_SIZE(tbl). Thanks,



--
Alexey

Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems

2023-03-06 Thread Alexey Kardashevskiy





On 07/03/2023 10:46, Alex Williamson wrote:

On Mon, 6 Mar 2023 11:29:53 -0600 (CST)
Timothy Pearson  wrote:


This patch series reenables VFIO support on POWER systems.  It
is based on Alexey Kardashevskiys's patch series, rebased and
successfully tested under QEMU with a Marvell PCIe SATA controller
on a POWER9 Blackbird host.

Alexey Kardashevskiy (3):
   powerpc/iommu: Add "borrowing" iommu_table_group_ops
   powerpc/pci_64: Init pcibios subsys a bit later
   powerpc/iommu: Add iommu_ops to report capabilities and allow blocking
 domains

Timothy Pearson (1):
   Add myself to MAINTAINERS for Power VFIO support

  MAINTAINERS   |   5 +
  arch/powerpc/include/asm/iommu.h  |   6 +-
  arch/powerpc/include/asm/pci-bridge.h |   7 +
  arch/powerpc/kernel/iommu.c   | 246 +-
  arch/powerpc/kernel/pci_64.c  |   2 +-
  arch/powerpc/platforms/powernv/pci-ioda.c |  36 +++-
  arch/powerpc/platforms/pseries/iommu.c|  27 +++
  arch/powerpc/platforms/pseries/pseries.h  |   4 +
  arch/powerpc/platforms/pseries/setup.c|   3 +
  drivers/vfio/vfio_iommu_spapr_tce.c   |  96 ++---
  10 files changed, 338 insertions(+), 94 deletions(-)



For vfio and MAINTAINERS portions,

Acked-by: Alex Williamson 

I'll note though that spapr_tce_take_ownership() looks like it copied a
bug from the old tce_iommu_take_ownership() where tbl and tbl->it_map
are tested before calling iommu_take_ownership() but not in the unwind
loop, ie. tables we might have skipped on setup are unconditionally
released on unwind.  Thanks,



Ah, true, a bug. Thanks for pointing out.


--
Alexey

Re: [PATCH kernel v2 0/3] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-10-27 Thread Alexey Kardashevskiy


Michael, Fred, ping?


On 20/09/2022 23:04, Alexey Kardashevskiy wrote:

Here is another take on iommu_ops on POWER to make VFIO work
again on POWERPC64. Tested on PPC, kudos to Fred!

The tree with all prerequisites is here:
https://github.com/aik/linux/tree/kvm-fixes-wip

The previous discussion is here:
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220707135552.3688927-1-...@ozlabs.ru/
https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/
https://lore.kernel.org/all/20220714081822.3717693-3-...@ozlabs.ru/T/

Please comment. Thanks.


This is based on sha1
ce888220d5c7 Linus Torvalds "Merge tag 'scsi-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi".

Please comment. Thanks.



Alexey Kardashevskiy (3):
   powerpc/iommu: Add "borrowing" iommu_table_group_ops
   powerpc/pci_64: Init pcibios subsys a bit later
   powerpc/iommu: Add iommu_ops to report capabilities and allow blocking
 domains

  arch/powerpc/include/asm/iommu.h  |   6 +-
  arch/powerpc/include/asm/pci-bridge.h |   7 +
  arch/powerpc/platforms/pseries/pseries.h  |   4 +
  arch/powerpc/kernel/iommu.c   | 247 +-
  arch/powerpc/kernel/pci_64.c  |   2 +-
  arch/powerpc/platforms/powernv/pci-ioda.c |  36 +++-
  arch/powerpc/platforms/pseries/iommu.c|  27 +++
  arch/powerpc/platforms/pseries/setup.c|   3 +
  drivers/vfio/vfio_iommu_spapr_tce.c   |  96 ++---
  9 files changed, 334 insertions(+), 94 deletions(-)



--
Alexey

[PATCH kernel v2 3/3] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-09-20 Thread Alexey Kardashevskiy

Up until now PPC64 managed to avoid using iommu_ops. The VFIO driver
uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
the Type1 VFIO driver. Recent development added 2 uses of iommu_ops to
the generic VFIO which broke POWER:
- a coherency capability check;
- blocking IOMMU domain - iommu_group_dma_owner_claimed()/...

This adds a simple iommu_ops which reports support for cache
coherency and provides a basic support for blocking domains. No other
domain types are implemented so the default domain is NULL.

Since now iommu_ops controls the group ownership, this takes it out of
VFIO.

This adds an IOMMU device into a pci_controller (=PHB) and registers it
in the IOMMU subsystem, iommu_ops is registered at this point.
This setup is done in postcore_initcall_sync.

This replaces iommu_group_add_device() with iommu_probe_device() as
the former misses necessary steps in connecting PCI devices to IOMMU
devices. This adds a comment about why explicit iommu_probe_device()
is still needed.

Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence")
Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices")
Cc: Deming Wang 
Cc: Robin Murphy 
Cc: Jason Gunthorpe 
Cc: Alex Williamson 
Cc: Daniel Henrique Barboza 
Cc: Fabiano Rosas 
Cc: Murilo Opsfelder Araujo 
Cc: Nicholas Piggin 
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* replaced a default domain with blocked
---
 arch/powerpc/include/asm/pci-bridge.h |   7 +
 arch/powerpc/platforms/pseries/pseries.h  |   4 +
 arch/powerpc/kernel/iommu.c   | 149 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  30 +
 arch/powerpc/platforms/pseries/iommu.c|  24 
 arch/powerpc/platforms/pseries/setup.c|   3 +
 drivers/vfio/vfio_iommu_spapr_tce.c   |   8 --
 7 files changed, 215 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index e18c95f4e1d4..fcab0e4b203b 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device_node;
 
@@ -44,6 +45,9 @@ struct pci_controller_ops {
 #endif
 
void(*shutdown)(struct pci_controller *hose);
+
+   struct iommu_group *(*device_group)(struct pci_controller *hose,
+   struct pci_dev *pdev);
 };
 
 /*
@@ -131,6 +135,9 @@ struct pci_controller {
struct irq_domain   *dev_domain;
struct irq_domain   *msi_domain;
struct fwnode_handle*fwnode;
+
+   /* iommu_ops support */
+   struct iommu_device iommu;
 };
 
 /* These are used for config access before all the PCI probing
diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index 1d75b7742ef0..f8bce40ebd0c 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -123,5 +123,9 @@ static inline void 
pseries_lpar_read_hblkrm_characteristics(void) { }
 #endif
 
 void pseries_rng_init(void);
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+struct iommu_group *pSeries_pci_device_group(struct pci_controller *hose,
+struct pci_dev *pdev);
+#endif
 
 #endif /* _PSERIES_PSERIES_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d873c123ab49..823da727aac7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DBG(...)
 
@@ -1158,8 +1159,14 @@ int iommu_add_device(struct iommu_table_group 
*table_group, struct device *dev)
 
pr_debug("%s: Adding %s to iommu group %d\n",
 __func__, dev_name(dev),  iommu_group_id(table_group->group));
-
-   return iommu_group_add_device(table_group->group, dev);
+   /*
+* This is still not adding devices via the IOMMU bus notifier because
+* of pcibios_init() from arch/powerpc/kernel/pci_64.c which calls
+* pcibios_scan_phb() first (and this guy adds devices and triggers
+* the notifier) and only then it calls pci_bus_add_devices() which
+* configures DMA for buses which also creates PEs and IOMMU groups.
+*/
+   return iommu_probe_device(dev);
 }
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
@@ -1239,6 +1246,7 @@ static long spapr_tce_take_ownership(struct 
iommu_table_group *table_group)
rc = iommu_take_ownership(tbl);
if (!rc)
continue;
+
for (j = 0; j < i; ++j)
iommu_release_ownership(table_group->tables[j]);
return rc;
@@ -1271,4 +1279,141 @@ struct iommu_table_group_ops spapr_tce_table_group_ops 
= {
.release_ownership = spapr_tce_release_ownership,
 };
 
+/*
+ * A simple iommu_ops to allow less cr

[PATCH kernel v2 0/3] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-09-20 Thread Alexey Kardashevskiy

Here is another take on iommu_ops on POWER to make VFIO work
again on POWERPC64. Tested on PPC, kudos to Fred!

The tree with all prerequisites is here:
https://github.com/aik/linux/tree/kvm-fixes-wip

The previous discussion is here:
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220707135552.3688927-1-...@ozlabs.ru/
https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/
https://lore.kernel.org/all/20220714081822.3717693-3-...@ozlabs.ru/T/

Please comment. Thanks.


This is based on sha1
ce888220d5c7 Linus Torvalds "Merge tag 'scsi-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi".

Please comment. Thanks.



Alexey Kardashevskiy (3):
  powerpc/iommu: Add "borrowing" iommu_table_group_ops
  powerpc/pci_64: Init pcibios subsys a bit later
  powerpc/iommu: Add iommu_ops to report capabilities and allow blocking
domains

 arch/powerpc/include/asm/iommu.h  |   6 +-
 arch/powerpc/include/asm/pci-bridge.h |   7 +
 arch/powerpc/platforms/pseries/pseries.h  |   4 +
 arch/powerpc/kernel/iommu.c   | 247 +-
 arch/powerpc/kernel/pci_64.c  |   2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  36 +++-
 arch/powerpc/platforms/pseries/iommu.c|  27 +++
 arch/powerpc/platforms/pseries/setup.c|   3 +
 drivers/vfio/vfio_iommu_spapr_tce.c   |  96 ++---
 9 files changed, 334 insertions(+), 94 deletions(-)

-- 
2.37.3

[PATCH kernel v2 1/3] powerpc/iommu: Add "borrowing" iommu_table_group_ops

2022-09-20 Thread Alexey Kardashevskiy

PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows
for PEs: control the ownership, create/set/unset a table the hardware
for dynamic DMA windows (DDW). VFIO uses the API to implement support
on POWER.

So far only PowerNV IODA2 (POWER8 and newer machines) implemented this
and other cases (POWER7 or nested KVM) did not and instead reused
existing iommu_table structs. This means 1) no DDW 2) ownership transfer
is done directly in the VFIO SPAPR TCE driver.

Soon POWER is going to get its own iommu_ops and ownership control is
going to move there. This implements spapr_tce_table_group_ops which
borrows iommu_table tables. The upside is that VFIO needs to know less
about POWER.

The new ops returns the existing table from create_table() and
only checks if the same window is already set. This is only going to work
if the default DMA window starts table_group.tce32_start and as big as
pe->table_group.tce32_size (not the case for IODA2+ PowerNV).

This changes iommu_table_group_ops::take_ownership() to return an error
if borrowing a table failed.

This should not cause any visible change in behavior for PowerNV.
pSeries was not that well tested/supported anyway.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/iommu.h  |  6 +-
 arch/powerpc/kernel/iommu.c   | 98 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c |  6 +-
 arch/powerpc/platforms/pseries/iommu.c|  3 +
 drivers/vfio/vfio_iommu_spapr_tce.c   | 94 --
 5 files changed, 121 insertions(+), 86 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7e29c73e3dd4..678b5bdc79b1 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -175,7 +175,7 @@ struct iommu_table_group_ops {
long (*unset_window)(struct iommu_table_group *table_group,
int num);
/* Switch ownership from platform code to external user (e.g. VFIO) */
-   void (*take_ownership)(struct iommu_table_group *table_group);
+   long (*take_ownership)(struct iommu_table_group *table_group);
/* Switch ownership from external user (e.g. VFIO) back to core */
void (*release_ownership)(struct iommu_table_group *table_group);
 };
@@ -215,6 +215,8 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm,
enum dma_data_direction *direction);
 extern void iommu_tce_kill(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
+
+extern struct iommu_table_group_ops spapr_tce_table_group_ops;
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
@@ -303,8 +305,6 @@ extern int iommu_tce_check_gpa(unsigned long page_shift,
iommu_tce_check_gpa((tbl)->it_page_shift, (gpa)))
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
-extern int iommu_take_ownership(struct iommu_table *tbl);
-extern void iommu_release_ownership(struct iommu_table *tbl);
 
 extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
 extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index caebe1431596..d873c123ab49 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1088,7 +1088,7 @@ void iommu_tce_kill(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_kill);
 
-int iommu_take_ownership(struct iommu_table *tbl)
+static int iommu_take_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
int ret = 0;
@@ -1120,9 +1120,8 @@ int iommu_take_ownership(struct iommu_table *tbl)
 
return ret;
 }
-EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
-void iommu_release_ownership(struct iommu_table *tbl)
+static void iommu_release_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 
@@ -1139,7 +1138,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
spin_unlock(>pools[i].lock);
spin_unlock_irqrestore(>large_pool.lock, flags);
 }
-EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
 int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
@@ -1181,4 +1179,96 @@ void iommu_del_device(struct device *dev)
iommu_group_remove_device(dev);
 }
 EXPORT_SYMBOL_GPL(iommu_del_device);
+
+/*
+ * A simple iommu_table_group_ops which only allows reusing the existing
+ * iommu_table. This handles VFIO for POWER7 or the nested KVM.
+ * The ops does not allow creating windows and only allows reusing the existing
+ * one if it matches table_group->tce32_start/tce32_size/page_shift.
+ */
+static unsigned long spapr_tce_get_table_size(__u32 page_shift,
+ __u64 window_size, __u32 leve

[PATCH kernel v2 2/3] powerpc/pci_64: Init pcibios subsys a bit later

2022-09-20 Thread Alexey Kardashevskiy

The following patches are going to add dependency/use of iommu_ops which
is initialized in subsys_initcall as well.

This moves pciobios_init() to the next initcall level.

This should not cause behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/pci_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c
index 0c7cfb9fab04..9cd763d512ae 100644
--- a/arch/powerpc/kernel/pci_64.c
+++ b/arch/powerpc/kernel/pci_64.c
@@ -73,7 +73,7 @@ static int __init pcibios_init(void)
return 0;
 }
 
-subsys_initcall(pcibios_init);
+subsys_initcall_sync(pcibios_init);
 
 int pcibios_unmap_io_space(struct pci_bus *bus)
 {
-- 
2.37.3

Re: [PATCH kernel] KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent

2022-09-13 Thread Alexey Kardashevskiy


Ping? It's been a while and probably got lost :-/

On 18/05/2022 16:27, Alexey Kardashevskiy wrote:



On 5/4/22 17:48, Alexey Kardashevskiy wrote:

When introduced, IRQFD resampling worked on POWER8 with XICS. However
KVM on POWER9 has never implemented it - the compatibility mode code
("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native
XIVE mode does not handle INTx in KVM at all.

This moved the capability support advertising to platforms and stops
advertising it on XIVE, i.e. POWER9 and later.

Signed-off-by: Alexey Kardashevskiy 
---


Or I could move this one together with KVM_CAP_IRQFD. Thoughts?



Ping?



---
  arch/arm64/kvm/arm.c   | 3 +++
  arch/mips/kvm/mips.c   | 3 +++
  arch/powerpc/kvm/powerpc.c | 6 ++
  arch/riscv/kvm/vm.c    | 3 +++
  arch/s390/kvm/kvm-s390.c   | 3 +++
  arch/x86/kvm/x86.c | 3 +++
  virt/kvm/kvm_main.c    | 1 -
  7 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 523bc934fe2f..092f0614bae3 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -210,6 +210,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, 
long ext)

  case KVM_CAP_SET_GUEST_DEBUG:
  case KVM_CAP_VCPU_ATTRIBUTES:
  case KVM_CAP_PTP_KVM:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+    case KVM_CAP_IRQFD_RESAMPLE:
+#endif
  r = 1;
  break;
  case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index a25e0b73ee70..0f3de470a73e 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -1071,6 +1071,9 @@ int kvm_vm_ioctl_check_extension(struct kvm 
*kvm, long ext)

  case KVM_CAP_READONLY_MEM:
  case KVM_CAP_SYNC_MMU:
  case KVM_CAP_IMMEDIATE_EXIT:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+    case KVM_CAP_IRQFD_RESAMPLE:
+#endif
  r = 1;
  break;
  case KVM_CAP_NR_VCPUS:
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 875c30c12db0..87698ffef3be 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -591,6 +591,12 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, 
long ext)

  break;
  #endif
+#ifdef CONFIG_HAVE_KVM_IRQFD
+    case KVM_CAP_IRQFD_RESAMPLE:
+    r = !xive_enabled();
+    break;
+#endif
+
  case KVM_CAP_PPC_ALLOC_HTAB:
  r = hv_enabled;
  break;
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index c768f75279ef..b58579b386bb 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -63,6 +63,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, 
long ext)

  case KVM_CAP_READONLY_MEM:
  case KVM_CAP_MP_STATE:
  case KVM_CAP_IMMEDIATE_EXIT:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+    case KVM_CAP_IRQFD_RESAMPLE:
+#endif
  r = 1;
  break;
  case KVM_CAP_NR_VCPUS:
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 156d1c25a3c1..85e093fc8d13 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -564,6 +564,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, 
long ext)

  case KVM_CAP_SET_GUEST_DEBUG:
  case KVM_CAP_S390_DIAG318:
  case KVM_CAP_S390_MEM_OP_EXTENSION:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+    case KVM_CAP_IRQFD_RESAMPLE:
+#endif
  r = 1;
  break;
  case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c0ca599a353..a0a7b769483d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4273,6 +4273,9 @@ int kvm_vm_ioctl_check_extension(struct kvm 
*kvm, long ext)

  case KVM_CAP_SYS_ATTRIBUTES:
  case KVM_CAP_VAPIC:
  case KVM_CAP_ENABLE_CAP:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+    case KVM_CAP_IRQFD_RESAMPLE:
+#endif
  r = 1;
  break;
  case KVM_CAP_EXIT_HYPERCALL:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 70e05af5ebea..885e72e668a5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4293,7 +4293,6 @@ static long 
kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)

  #endif
  #ifdef CONFIG_HAVE_KVM_IRQFD
  case KVM_CAP_IRQFD:
-    case KVM_CAP_IRQFD_RESAMPLE:
  #endif
  case KVM_CAP_IOEVENTFD_ANY_LENGTH:
  case KVM_CAP_CHECK_EXTENSION_VM:




--
Alexey

Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-28 Thread Alexey Kardashevskiy





On 29/07/2022 13:10, Tian, Kevin wrote:

From: Oliver O'Halloran 
Sent: Friday, July 29, 2022 10:53 AM

On Fri, Jul 29, 2022 at 12:21 PM Alexey Kardashevskiy  wrote:


*snip*

About this. If a platform has a concept of explicit DMA windows (2 or
more), is it one domain with 2 windows or 2 domains with one window

each?


If it is 2 windows, iommu_domain_ops misses windows manipulation
callbacks (I vaguely remember it being there for embedded PPC64 but
cannot find it quickly).

If it is 1 window per a domain, then can a device be attached to 2
domains at least in theory (I suspect not)?

On server POWER CPUs, each DMA window is backed by an independent

IOMMU

page table. (reminder) A window is a bus address range where devices are
allowed to DMA to/from ;)


I've always thought of windows as being entries to a top-level "iommu
page table" for the device / domain. The fact each window is backed by
a separate IOMMU page table shouldn't really be relevant outside the
arch/platform.


Yes. This is what was agreed when discussing how to integrate iommufd
with POWER [1].

One domain represents one address space.

Windows are just constraints on the address space for what ranges can
be mapped.

having two page tables underlying is just kind of POWER specific format.



It is a POWER specific thing with one not-so-obvious consequence of each 
window having an independent page size (fixed at the moment or creation) 
and (most likely) different page size, like, 4K vs. 2M.





Thanks
Kevin

[1] https://lore.kernel.org/all/Yns+TCSa6hWbU7wZ@yekko/


--
Alexey

Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-28 Thread Alexey Kardashevskiy

On 08/07/2022 17:32, Tian, Kevin wrote:

From: Alexey Kardashevskiy 
Sent: Friday, July 8, 2022 2:35 PM
On 7/8/22 15:00, Alexey Kardashevskiy wrote:

On 7/8/22 01:10, Jason Gunthorpe wrote:

On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote:

Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
the Type1 VFIO driver. Recent development though has added a

coherency

capability check to the generic part of VFIO and essentially disabled
VFIO on PPC64; the similar story about

iommu_group_dma_owner_claimed().

This adds an iommu_ops stub which reports support for cache
coherency. Because bus_set_iommu() triggers IOMMU probing of PCI
devices,
this provides minimum code for the probing to not crash.

stale comment since this patch doesn't use bus_set_iommu() now.

+
+static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
+  struct device *dev)
+{
+    return 0;
+}

It is important when this returns that the iommu translation is all
emptied. There should be no left over translations from the DMA API at
this point. I have no idea how power works in this regard, but it
should be explained why this is safe in a comment at a minimum.

  > It will turn into a security problem to allow kernel mappings to leak
  > past this point.
  >

I've added for v2 checking for no valid mappings for a device (or, more
precisely, in the associated iommu_group), this domain does not need
checking, right?

Uff, not that simple. Looks like once a device is in a group, its
dma_ops is set to iommu_dma_ops and IOMMU code owns DMA. I guess
then
there is a way to set those to NULL or do something similar to let
dma_map_direct() from kernel/dma/mapping.c return "true", is not there?

dev->dma_ops is NULL as long as you don't handle DMA domain type
here and don't call iommu_setup_dma_ops().

Given this only supports blocking domain then above should be irrelevant.

For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is
fine to do nothing as tce_iommu_take_ownership() and
tce_iommu_take_ownership_ddw() take care of not having active DMA
mappings. Thanks,

In general, is "domain" something from hardware or it is a software
concept? Thanks,

'domain' is a software concept to represent the hardware I/O page
table. 

About this. If a platform has a concept of explicit DMA windows (2 or 
more), is it one domain with 2 windows or 2 domains with one window each?

If it is 2 windows, iommu_domain_ops misses windows manipulation 
callbacks (I vaguely remember it being there for embedded PPC64 but 
cannot find it quickly).

If it is 1 window per a domain, then can a device be attached to 2 
domains at least in theory (I suspect not)?

On server POWER CPUs, each DMA window is backed by an independent IOMMU 
page table. (reminder) A window is a bus address range where devices are 
allowed to DMA to/from ;)

Thanks,

A blocking domain means all DMAs from a device attached to
this domain are blocked/rejected (equivalent to an empty I/O page
table), usually enforced in the .attach_dev() callback.

Yes, a comment for why having a NULL .attach_dev() is OK is welcomed.

Thanks
Kevin

--
Alexey

Re: [PATCH kernel 3/3] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-18 Thread Alexey Kardashevskiy





On 19/07/2022 04:09, Jason Gunthorpe wrote:

On Thu, Jul 14, 2022 at 06:18:22PM +1000, Alexey Kardashevskiy wrote:


+/*
+ * A simple iommu_ops to allow less cruft in generic VFIO code.
+ */
+static bool spapr_tce_iommu_capable(enum iommu_cap cap)
+{
+   switch (cap) {
+   case IOMMU_CAP_CACHE_COHERENCY:


I would add a remark here that it is because vfio is going to use
SPAPR mode but still checks that the iommu driver support coherency -
with out that detail it looks very strange to have caps without
implementing unmanaged domains


+static struct iommu_domain *spapr_tce_iommu_domain_alloc(unsigned int type)
+{
+   struct iommu_domain *dom;
+
+   if (type != IOMMU_DOMAIN_BLOCKED)
+   return NULL;
+
+   dom = kzalloc(sizeof(*dom), GFP_KERNEL);
+   if (!dom)
+   return NULL;
+
+   dom->geometry.aperture_start = 0;
+   dom->geometry.aperture_end = ~0ULL;
+   dom->geometry.force_aperture = true;


A blocked domain doesn't really have an aperture, all DMA is rejected,
so I think these can just be deleted and left at zero.

Generally I'm suggesting drivers just use a static singleton instance
for the blocked domain instead of the allocation like this, but that
is a very minor nit.


+static struct iommu_device *spapr_tce_iommu_probe_device(struct device *dev)
+{
+   struct pci_dev *pdev;
+   struct pci_controller *hose;
+
+   /* Weirdly iommu_device_register() assigns the same ops to all buses */
+   if (!dev_is_pci(dev))
+   return ERR_PTR(-EPERM);


Less "weirdly", more by design. The iommu driver should check if the
given struct device is supported or not, it isn't really a bus
specific operation.


+static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev)
+{
+   struct pci_controller *hose;
+   struct pci_dev *pdev;
+
+   /* Weirdly iommu_device_register() assigns the same ops to all buses */
+   if (!dev_is_pci(dev))
+   return ERR_PTR(-EPERM);


This doesn't need repeating, if probe_device() fails then this will
never be called.


+static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
+ struct device *dev)
+{
+   struct iommu_group *grp = iommu_group_get(dev);
+   struct iommu_table_group *table_group;
+   int ret = -EINVAL;
+
+   if (!grp)
+   return -ENODEV;
+
+   table_group = iommu_group_get_iommudata(grp);
+
+   if (dom->type == IOMMU_DOMAIN_BLOCKED)
+   ret = table_group->ops->take_ownership(table_group);


Ideally there shouldn't be dom->type checks like this.


The blocking domain should have its own iommu_domain_ops that only
process the blocking operation. Ie call this like
spapr_tce_iommu_blocking_attach_dev()

Instead of having a "default_domain_ops" leave it NULL and create a
spapr_tce_blocking_domain_ops with these two functions and assign it
to domain->ops when creating. Then it is really clear these functions
are only called for the DOMAIN_BLOCKED type and you don't need to
check it.


+static void spapr_tce_iommu_detach_dev(struct iommu_domain *dom,
+  struct device *dev)
+{
+   struct iommu_group *grp = iommu_group_get(dev);
+   struct iommu_table_group *table_group;
+
+   table_group = iommu_group_get_iommudata(grp);
+   WARN_ON(dom->type != IOMMU_DOMAIN_BLOCKED);
+   table_group->ops->release_ownership(table_group);
+}


Ditto


+struct iommu_group *pSeries_pci_device_group(struct pci_controller *hose,
+struct pci_dev *pdev)
+{
+   struct device_node *pdn, *dn = pdev->dev.of_node;
+   struct iommu_group *grp;
+   struct pci_dn *pci;
+
+   pdn = pci_dma_find(dn, NULL);
+   if (!pdn || !PCI_DN(pdn))
+   return ERR_PTR(-ENODEV);
+
+   pci = PCI_DN(pdn);
+   if (!pci->table_group)
+   return ERR_PTR(-ENODEV);
+
+   grp = pci->table_group->group;
+   if (!grp)
+   return ERR_PTR(-ENODEV);
+
+   return iommu_group_ref_get(grp);


Not for this series, but this is kind of backwards, the driver
specific data (ie the table_group) should be in
iommu_group_get_iommudata()...



It is there but here we are getting from a device to a group - a device 
is not added to a group yet when iommu_probe_device() works and tries 
adding a device via iommu_group_get_for_dev().






diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 8a65ea61744c..3b53b466e49b 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -1152,8 +1152,6 @@ static void tce_iommu_release_ownership(struct 
tce_container *container,
for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i)
if (container->tables[i])
table_group->ops-&g

[PATCH kernel 3/3] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-14 Thread Alexey Kardashevskiy

Up until now PPC64 managed to avoid using iommu_ops. The VFIO driver
uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
the Type1 VFIO driver. Recent development added 2 uses of iommu_ops to
the generic VFIO which broke POWER:
- a coherency capability check;
- blocking IOMMU domain - iommu_group_dma_owner_claimed()/...

This adds a simple iommu_ops which reports support for cache
coherency and provides a basic support for blocking domains. No other
domain types are implemented so the default domain is NULL.

Since now iommu_ops controls the group ownership, this takes it out of
VFIO.

This adds an IOMMU device into a pci_controller (=PHB) and registers it
in the IOMMU subsystem, iommu_ops is registered at this point.
This setup is done in postcore_initcall_sync.

This replaces iommu_group_add_device() with iommu_probe_device() as
the former misses necessary steps in connecting PCI devices to IOMMU
devices. This adds a comment about why explicit iommu_probe_device()
is still needed.

The previous discussion is here:
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220707135552.3688927-1-...@ozlabs.ru/
https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/

Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence")
Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices")
Cc: Deming Wang 
Cc: Robin Murphy 
Cc: Jason Gunthorpe 
Cc: Alex Williamson 
Cc: Daniel Henrique Barboza 
Cc: Fabiano Rosas 
Cc: Murilo Opsfelder Araujo 
Cc: Nicholas Piggin 
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/pci-bridge.h |   7 +
 arch/powerpc/platforms/pseries/pseries.h  |   5 +
 arch/powerpc/kernel/iommu.c   | 159 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  30 
 arch/powerpc/platforms/pseries/iommu.c|  24 
 arch/powerpc/platforms/pseries/setup.c|   3 +
 drivers/vfio/vfio_iommu_spapr_tce.c   |   8 --
 7 files changed, 226 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index c85f901227c9..338a45b410b4 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device_node;
 
@@ -44,6 +45,9 @@ struct pci_controller_ops {
 #endif
 
void(*shutdown)(struct pci_controller *hose);
+
+   struct iommu_group *(*device_group)(struct pci_controller *hose,
+   struct pci_dev *pdev);
 };
 
 /*
@@ -131,6 +135,9 @@ struct pci_controller {
struct irq_domain   *dev_domain;
struct irq_domain   *msi_domain;
struct fwnode_handle*fwnode;
+
+   /* iommu_ops support */
+   struct iommu_device iommu;
 };
 
 /* These are used for config access before all the PCI probing
diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index f5c916c839c9..9a49a16dd89a 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -122,4 +122,9 @@ void pseries_lpar_read_hblkrm_characteristics(void);
 static inline void pseries_lpar_read_hblkrm_characteristics(void) { }
 #endif
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+struct iommu_group *pSeries_pci_device_group(struct pci_controller *hose,
+struct pci_dev *pdev);
+#endif
+
 #endif /* _PSERIES_PSERIES_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d873c123ab49..b5301e289714 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DBG(...)
 
@@ -1158,8 +1159,14 @@ int iommu_add_device(struct iommu_table_group 
*table_group, struct device *dev)
 
pr_debug("%s: Adding %s to iommu group %d\n",
 __func__, dev_name(dev),  iommu_group_id(table_group->group));
-
-   return iommu_group_add_device(table_group->group, dev);
+   /*
+* This is still not adding devices via the IOMMU bus notifier because
+* of pcibios_init() from arch/powerpc/kernel/pci_64.c which calls
+* pcibios_scan_phb() first (and this guy adds devices and triggers
+* the notifier) and only then it calls pci_bus_add_devices() which
+* configures DMA for buses which also creates PEs and IOMMU groups.
+*/
+   return iommu_probe_device(dev);
 }
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
@@ -1239,6 +1246,7 @@ static long spapr_tce_take_ownership(struct 
iommu_table_group *table_group)
rc = iommu_take_ownership(tbl);
if (!rc)
continue;
+
for (j = 0; j < i; ++j)
iommu_release_ownership(table_group->tables[j]);
return rc;
@@ -1271,4

[PATCH kernel 2/3] powerpc/pci_64: Init pcibios subsys a bit later

2022-07-14 Thread Alexey Kardashevskiy

The following patches are going to add dependency/use of iommu_ops which
is initialized in subsys_initcall as well.

This moves pciobios_init() to the next initcall level.

This should not cause behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/pci_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c
index 19b03ddf5631..79472d2f1739 100644
--- a/arch/powerpc/kernel/pci_64.c
+++ b/arch/powerpc/kernel/pci_64.c
@@ -73,7 +73,7 @@ static int __init pcibios_init(void)
return 0;
 }
 
-subsys_initcall(pcibios_init);
+subsys_initcall_sync(pcibios_init);
 
 int pcibios_unmap_io_space(struct pci_bus *bus)
 {
-- 
2.30.2

[PATCH kernel 1/3] powerpc/iommu: Add "borrowing" iommu_table_group_ops

2022-07-14 Thread Alexey Kardashevskiy

PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows
for PEs: control the ownership, create/set/unset a table the hardware
for dynamic DMA windows (DDW). VFIO uses the API to implement support
on POWER.

So far only PowerNV IODA2 (POWER8 and newer machines) implemented this and 
other cases (POWER7 or nested KVM) did not and instead reused
existing iommu_table structs. This means 1) no DDW 2) ownership transfer
is done directly in the VFIO SPAPR TCE driver.

Soon POWER is going to get its own iommu_ops and ownership control is
going to move there. This implements spapr_tce_table_group_ops which
borrows iommu_table tables. The upside is that VFIO needs to know less
about POWER.

The new ops returns the existing table from create_table() and
only checks if the same window is already set. This is only going to work
if the default DMA window starts table_group.tce32_start and as big as
pe->table_group.tce32_size (not the case for IODA2+ PowerNV).

This changes iommu_table_group_ops::take_ownership() to return an error
if borrowing a table failed.

This should not cause any visible change in behavior for PowerNV.
pSeries was not that well tested/supported anyway.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/iommu.h  |  6 +-
 arch/powerpc/kernel/iommu.c   | 98 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c |  6 +-
 arch/powerpc/platforms/pseries/iommu.c|  3 +
 drivers/vfio/vfio_iommu_spapr_tce.c   | 94 --
 5 files changed, 121 insertions(+), 86 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7e29c73e3dd4..678b5bdc79b1 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -175,7 +175,7 @@ struct iommu_table_group_ops {
long (*unset_window)(struct iommu_table_group *table_group,
int num);
/* Switch ownership from platform code to external user (e.g. VFIO) */
-   void (*take_ownership)(struct iommu_table_group *table_group);
+   long (*take_ownership)(struct iommu_table_group *table_group);
/* Switch ownership from external user (e.g. VFIO) back to core */
void (*release_ownership)(struct iommu_table_group *table_group);
 };
@@ -215,6 +215,8 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm,
enum dma_data_direction *direction);
 extern void iommu_tce_kill(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
+
+extern struct iommu_table_group_ops spapr_tce_table_group_ops;
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
@@ -303,8 +305,6 @@ extern int iommu_tce_check_gpa(unsigned long page_shift,
iommu_tce_check_gpa((tbl)->it_page_shift, (gpa)))
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
-extern int iommu_take_ownership(struct iommu_table *tbl);
-extern void iommu_release_ownership(struct iommu_table *tbl);
 
 extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
 extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index caebe1431596..d873c123ab49 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1088,7 +1088,7 @@ void iommu_tce_kill(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_kill);
 
-int iommu_take_ownership(struct iommu_table *tbl)
+static int iommu_take_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
int ret = 0;
@@ -1120,9 +1120,8 @@ int iommu_take_ownership(struct iommu_table *tbl)
 
return ret;
 }
-EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
-void iommu_release_ownership(struct iommu_table *tbl)
+static void iommu_release_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 
@@ -1139,7 +1138,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
spin_unlock(>pools[i].lock);
spin_unlock_irqrestore(>large_pool.lock, flags);
 }
-EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
 int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
@@ -1181,4 +1179,96 @@ void iommu_del_device(struct device *dev)
iommu_group_remove_device(dev);
 }
 EXPORT_SYMBOL_GPL(iommu_del_device);
+
+/*
+ * A simple iommu_table_group_ops which only allows reusing the existing
+ * iommu_table. This handles VFIO for POWER7 or the nested KVM.
+ * The ops does not allow creating windows and only allows reusing the existing
+ * one if it matches table_group->tce32_start/tce32_size/page_shift.
+ */
+static unsigned long spapr_tce_get_table_size(__u32 page_shift,
+ __u64 window_size, __u32 leve

[PATCH kernel 0/3] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-14 Thread Alexey Kardashevskiy

Here is another take on iommu_ops on POWER to make VFIO work
again on POWERPC64.

The tree with all prerequisites is here:
https://github.com/aik/linux/tree/kvm-fixes-wip

The previous discussion is here:
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220707135552.3688927-1-...@ozlabs.ru/
https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/

Please comment. Thanks.



Alexey Kardashevskiy (3):
  powerpc/iommu: Add "borrowing" iommu_table_group_ops
  powerpc/pci_64: Init pcibios subsys a bit later
  powerpc/iommu: Add iommu_ops to report capabilities and allow blocking
domains

 arch/powerpc/include/asm/iommu.h  |   6 +-
 arch/powerpc/include/asm/pci-bridge.h |   7 +
 arch/powerpc/platforms/pseries/pseries.h  |   5 +
 arch/powerpc/kernel/iommu.c   | 257 +-
 arch/powerpc/kernel/pci_64.c  |   2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  36 ++-
 arch/powerpc/platforms/pseries/iommu.c|  27 +++
 arch/powerpc/platforms/pseries/setup.c|   3 +
 drivers/vfio/vfio_iommu_spapr_tce.c   |  96 ++--
 9 files changed, 345 insertions(+), 94 deletions(-)

-- 
2.30.2

[PATCH kernel] powerpc/iommu: Fix iommu_table_in_use for a small default DMA window case

2022-07-14 Thread Alexey Kardashevskiy

The existing iommu_table_in_use() helper checks if the kernel is using
any of TCEs. There are some reserved TCEs:
1) the very first one if DMA window starts from 0 to avoid having a zero
but still valid DMA handle;
2) it_reserved_start..it_reserved_end to exclude MMIO32 window in case
the default window spans across that - this is the default for the first
DMA window on PowerNV.

When 1) is the case and 2) is not the helper does not skip 1) and returns
wrong status.

This only seems occurring when passing through a PCI device to a nested
guest (not something we support really well) so it has not been seen
before.

This fixes the bug by adding a special case for no MMIO32 reservation.

Fixes: 3c33066a2190 ("powerpc/kernel/iommu: Add new iommu_table_in_use() 
helper")
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/iommu.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 7e56ddb3e0b9..caebe1431596 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -775,6 +775,11 @@ bool iommu_table_in_use(struct iommu_table *tbl)
/* ignore reserved bit0 */
if (tbl->it_offset == 0)
start = 1;
+
+   /* Simple case with no reserved MMIO32 region */
+   if (!tbl->it_reserved_start && !tbl->it_reserved_end)
+   return find_next_bit(tbl->it_map, tbl->it_size, start) != 
tbl->it_size;
+
end = tbl->it_reserved_start - tbl->it_offset;
if (find_next_bit(tbl->it_map, end, start) != end)
return true;
-- 
2.30.2

[PATCH kernel] powerpc/ioda/iommu/debugfs: Generate unique debugfs entries

2022-07-14 Thread Alexey Kardashevskiy

The iommu_table::it_index is a LIOBN which is not initialized on PowerNV
as it is not used except IOMMU debugfs where it is used for a node name.

This initializes it_index witn a unique number to avoid warnings and
have a node for every iommu_table.

This should not cause any behavioral change without CONFIG_IOMMU_DEBUGFS.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index c8cf2728031a..9de9b2fb163d 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1609,6 +1609,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb 
*phb,
tbl->it_ops = _ioda1_iommu_ops;
pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift;
pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
+   tbl->it_index = (phb->hose->global_number << 16) | pe->pe_number;
if (!iommu_init_table(tbl, phb->hose->node, 0, 0))
panic("Failed to initialize iommu table");
 
@@ -1779,6 +1780,7 @@ static long pnv_pci_ioda2_setup_default_config(struct 
pnv_ioda_pe *pe)
res_end = min(window_size, SZ_4G) >> tbl->it_page_shift;
}
 
+   tbl->it_index = (pe->phb->hose->global_number << 16) | pe->pe_number;
if (iommu_init_table(tbl, pe->phb->hose->node, res_start, res_end))
rc = pnv_pci_ioda2_set_window(>table_group, 0, tbl);
else
-- 
2.30.2

Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-11 Thread Alexey Kardashevskiy





On 7/12/22 04:46, Jason Gunthorpe wrote:

On Mon, Jul 11, 2022 at 11:24:32PM +1000, Alexey Kardashevskiy wrote:


I really think that for 5.19 we should really move this blocked domain
business to Type1 like this:

https://github.com/aik/linux/commit/96f80c8db03b181398ad355f6f90e574c3ada4bf


This creates the same security bug for power we are discussing here. If you


How so? attach_dev() on power makes uninitalizes DMA setup for the group 
on the hardware level, any other DMA user won't be able to initiate DMA.




don't want to fix it then lets just merge this iommu_ops patch as is rather than
mangle the core code.


The core code should not be assuming iommu_ops != NULL, Type1 should, I 
thought it is the whole point of having Type1, why is not it the case 
anymore?



--
Alexey

Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-11 Thread Alexey Kardashevskiy





On 10/07/2022 22:32, Alexey Kardashevskiy wrote:



On 10/07/2022 16:29, Jason Gunthorpe wrote:

On Sat, Jul 09, 2022 at 12:58:00PM +1000, Alexey Kardashevskiy wrote:
driver->ops->attach_group on POWER attaches a group so VFIO claims 
ownership
over a group, not devices. Underlying API 
(pnv_ioda2_take_ownership()) does

not need to keep track of the state, it is one group, one ownership
transfer, easy.


It should not change, I think you can just map the attach_dev to the 
group?


There are multiple devices in a group, cannot just map 1:1.



What is exactly the reason why iommu_group_claim_dma_owner() cannot stay
inside Type1 (sorry if it was explained, I could have missed)?


It has nothing to do with type1 - the ownership system is designed to
exclude other in-kernel drivers from using the group at the same time
vfio is using the group. power still needs this protection regardless
of if is using the formal iommu api or not.


POWER deals with it in vfio_iommu_driver_ops::attach_group.



I really think that for 5.19 we should really move this blocked domain 
business to Type1 like this:


https://github.com/aik/linux/commit/96f80c8db03b181398ad355f6f90e574c3ada4bf

Thanks,


Also, from another mail, you said iommu_alloc_default_domain() should 
fail
on power but at least IOMMU_DOMAIN_BLOCKED must be supported, or the 
whole

iommu_group_claim_dma_owner() thing falls apart.


Yes

And iommu_ops::domain_alloc() is not told if it is asked to create a 
default

domain, it only takes a type.


"default domain" refers to the default type pased to domain_alloc(),
it will never be blocking, so it will always fail on power.
"default domain" is better understood as the domain used by the DMA
API


The DMA API on POWER does not use iommu_ops, it is dma_iommu_ops from 
arch/powerpc/kernel/dma-iommu.c from before 2005. so the default domain 
is type == 0 where 0 == BLOCKED.




--
Alexey

Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-10 Thread Alexey Kardashevskiy





On 10/07/2022 16:29, Jason Gunthorpe wrote:

On Sat, Jul 09, 2022 at 12:58:00PM +1000, Alexey Kardashevskiy wrote:
  

driver->ops->attach_group on POWER attaches a group so VFIO claims ownership
over a group, not devices. Underlying API (pnv_ioda2_take_ownership()) does
not need to keep track of the state, it is one group, one ownership
transfer, easy.


It should not change, I think you can just map the attach_dev to the group?


There are multiple devices in a group, cannot just map 1:1.



What is exactly the reason why iommu_group_claim_dma_owner() cannot stay
inside Type1 (sorry if it was explained, I could have missed)?


It has nothing to do with type1 - the ownership system is designed to
exclude other in-kernel drivers from using the group at the same time
vfio is using the group. power still needs this protection regardless
of if is using the formal iommu api or not.


POWER deals with it in vfio_iommu_driver_ops::attach_group.


Also, from another mail, you said iommu_alloc_default_domain() should fail
on power but at least IOMMU_DOMAIN_BLOCKED must be supported, or the whole
iommu_group_claim_dma_owner() thing falls apart.


Yes


And iommu_ops::domain_alloc() is not told if it is asked to create a default
domain, it only takes a type.


"default domain" refers to the default type pased to domain_alloc(),
it will never be blocking, so it will always fail on power.
"default domain" is better understood as the domain used by the DMA
API


The DMA API on POWER does not use iommu_ops, it is dma_iommu_ops from 
arch/powerpc/kernel/dma-iommu.c from before 2005. so the default domain 
is type == 0 where 0 == BLOCKED.



--
Alexey

Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-08 Thread Alexey Kardashevskiy





On 08/07/2022 21:55, Jason Gunthorpe wrote:

On Fri, Jul 08, 2022 at 04:34:55PM +1000, Alexey Kardashevskiy wrote:


For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is fine
to do nothing as tce_iommu_take_ownership() and
tce_iommu_take_ownership_ddw() take care of not having active DMA mappings.


That will still cause a security problem because
tce_iommu_take_ownership()/etc are called too late. This is the moment
in the flow when the ownershift must change away from the DMA API that
power implements and to VFIO, not later.


Trying to do that.

vfio_group_set_container:
iommu_group_claim_dma_owner
driver->ops->attach_group

iommu_group_claim_dma_owner sets a domain to a group. Good. But it 
attaches devices, not groups. Bad.


driver->ops->attach_group on POWER attaches a group so VFIO claims 
ownership over a group, not devices. Underlying API 
(pnv_ioda2_take_ownership()) does not need to keep track of the state, 
it is one group, one ownership transfer, easy.



What is exactly the reason why iommu_group_claim_dma_owner() cannot stay 
inside Type1 (sorry if it was explained, I could have missed)?




Also, from another mail, you said iommu_alloc_default_domain() should 
fail on power but at least IOMMU_DOMAIN_BLOCKED must be supported, or 
the whole iommu_group_claim_dma_owner() thing falls apart.
And iommu_ops::domain_alloc() is not told if it is asked to create a 
default domain, it only takes a type.




--
Alexey

Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-08 Thread Alexey Kardashevskiy





On 08/07/2022 23:19, Jason Gunthorpe wrote:

On Fri, Jul 08, 2022 at 11:10:07PM +1000, Alexey Kardashevskiy wrote:



On 08/07/2022 21:55, Jason Gunthorpe wrote:

On Fri, Jul 08, 2022 at 04:34:55PM +1000, Alexey Kardashevskiy wrote:


For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is fine
to do nothing as tce_iommu_take_ownership() and
tce_iommu_take_ownership_ddw() take care of not having active DMA mappings.


That will still cause a security problem because
tce_iommu_take_ownership()/etc are called too late. This is the moment
in the flow when the ownershift must change away from the DMA API that
power implements and to VFIO, not later.


It is getting better and better :)

On POWERNV, at the boot time the platforms sets up PHBs, enables bypass,
creates groups and attaches devices. As for now attaching devices to the
default domain (which is BLOCKED) fails the not-being-use check as enabled
bypass means "everything is mapped for DMA". So at this point the default
domain has to be IOMMU_DOMAIN_IDENTITY or IOMMU_DOMAIN_UNMANAGED so later on
VFIO can move devices to IOMMU_DOMAIN_BLOCKED. Am I missing something?


For power the default domain should be NULL

NULL means that the platform is using the group to provide its DMA
ops. IIRC this patch was already setup correctly to do this?

The transition from NULL to blocking must isolate the group so all DMA
is blocked. blocking to NULL should re-estbalish platform DMA API
control.

The default domain should be non-NULL when the normal dma-iommu stuff is
providing the DMA API.

So, I think it is already setup properly, it is just the question of
what to do when entering/leaving blocking mode.




Well, the patch calls iommu_probe_device() which calls 
iommu_alloc_default_domain() which creates IOMMU_DOMAIN_BLOCKED (==0) as 
nothing initialized iommu_def_domain_type. Need a different default type 
(and return NULL when IOMMU API tries creating this type)?






Jason


--
Alexey

Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-08 Thread Alexey Kardashevskiy





On 08/07/2022 21:55, Jason Gunthorpe wrote:

On Fri, Jul 08, 2022 at 04:34:55PM +1000, Alexey Kardashevskiy wrote:


For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is fine
to do nothing as tce_iommu_take_ownership() and
tce_iommu_take_ownership_ddw() take care of not having active DMA mappings.


That will still cause a security problem because
tce_iommu_take_ownership()/etc are called too late. This is the moment
in the flow when the ownershift must change away from the DMA API that
power implements and to VFIO, not later.


It is getting better and better :)

On POWERNV, at the boot time the platforms sets up PHBs, enables bypass, 
creates groups and attaches devices. As for now attaching devices to the 
default domain (which is BLOCKED) fails the not-being-use check as 
enabled bypass means "everything is mapped for DMA". So at this point 
the default domain has to be IOMMU_DOMAIN_IDENTITY or 
IOMMU_DOMAIN_UNMANAGED so later on VFIO can move devices to 
IOMMU_DOMAIN_BLOCKED. Am I missing something?





Jason


--
Alexey

Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-08 Thread Alexey Kardashevskiy

On 08/07/2022 17:32, Tian, Kevin wrote:

From: Alexey Kardashevskiy 
Sent: Friday, July 8, 2022 2:35 PM
On 7/8/22 15:00, Alexey Kardashevskiy wrote:

On 7/8/22 01:10, Jason Gunthorpe wrote:

On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote:

Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
the Type1 VFIO driver. Recent development though has added a

coherency

capability check to the generic part of VFIO and essentially disabled
VFIO on PPC64; the similar story about

iommu_group_dma_owner_claimed().

This adds an iommu_ops stub which reports support for cache
coherency. Because bus_set_iommu() triggers IOMMU probing of PCI
devices,
this provides minimum code for the probing to not crash.

stale comment since this patch doesn't use bus_set_iommu() now.

+
+static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
+  struct device *dev)
+{
+    return 0;
+}

It is important when this returns that the iommu translation is all
emptied. There should be no left over translations from the DMA API at
this point. I have no idea how power works in this regard, but it
should be explained why this is safe in a comment at a minimum.

  > It will turn into a security problem to allow kernel mappings to leak
  > past this point.
  >

I've added for v2 checking for no valid mappings for a device (or, more
precisely, in the associated iommu_group), this domain does not need
checking, right?

Uff, not that simple. Looks like once a device is in a group, its
dma_ops is set to iommu_dma_ops and IOMMU code owns DMA. I guess
then
there is a way to set those to NULL or do something similar to let
dma_map_direct() from kernel/dma/mapping.c return "true", is not there?

dev->dma_ops is NULL as long as you don't handle DMA domain type
here and don't call iommu_setup_dma_ops().

Given this only supports blocking domain then above should be irrelevant.

For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is
fine to do nothing as tce_iommu_take_ownership() and
tce_iommu_take_ownership_ddw() take care of not having active DMA
mappings. Thanks,

In general, is "domain" something from hardware or it is a software
concept? Thanks,

'domain' is a software concept to represent the hardware I/O page
table. A blocking domain means all DMAs from a device attached to
this domain are blocked/rejected (equivalent to an empty I/O page
table), usually enforced in the .attach_dev() callback.

Yes, a comment for why having a NULL .attach_dev() is OK is welcomed.

Making it NULL makes __iommu_attach_device() fail, .attach_dev() needs 
to return 0 in this crippled environment. Thanks for explaining the 
rest, good food for thoughts.

Thanks
Kevin

--
Alexey

Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-08 Thread Alexey Kardashevskiy

On 7/8/22 15:00, Alexey Kardashevskiy wrote:

On 7/8/22 01:10, Jason Gunthorpe wrote:

On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote:

Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
the Type1 VFIO driver. Recent development though has added a coherency
capability check to the generic part of VFIO and essentially disabled
VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed().

This adds an iommu_ops stub which reports support for cache
coherency. Because bus_set_iommu() triggers IOMMU probing of PCI 
devices,

this provides minimum code for the probing to not crash.

Because now we have to set iommu_ops to the system (bus_set_iommu() or
iommu_device_register()), this requires the POWERNV PCI setup to happen
after bus_register(_bus_type) which is postcore_initcall
TODO: check if it still works, read sha1, for more details:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5537fcb319d016ce387

Because setting the ops triggers probing, this does not work well with
iommu_group_add_device(), hence the move to iommu_probe_device().

Because iommu_probe_device() does not take the group (which is why
we had the helper in the first place), this adds
pci_controller_ops::device_group.

So, basically there is one iommu_device per PHB and devices are added to
groups indirectly via series of calls inside the IOMMU code.

pSeries is out of scope here (a minor fix needed for barely supported
platform in regard to VFIO).

The previous discussion is here:
https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/

I think this is basically OK, for what it is. It looks like there is
more some-day opportunity to make use of the core infrastructure though.

does it make sense to have this many callbacks, or
the generic IOMMU code can safely operate without some
(given I add some more checks for !NULL)? thanks,

I wouldn't worry about it..

@@ -1156,7 +1158,10 @@ int iommu_add_device(struct iommu_table_group 
*table_group, struct device *dev)

  pr_debug("%s: Adding %s to iommu group %d\n",
   __func__, dev_name(dev),  
iommu_group_id(table_group->group));

-    return iommu_group_add_device(table_group->group, dev);
+    ret = iommu_probe_device(dev);
+    dev_info(dev, "probed with %d\n", ret);

For another day, but it seems a bit strange to call 
iommu_probe_device() like this?

Shouldn't one of the existing call sites cover this? The one in
of_iommu.c perhaps?

It looks to me that of_iommu.c expects the iommu setup to happen before 
linux starts as linux looks for #iommu-cells or iommu-map properties in 
the device tree. The powernv firmware (aka skiboot) does not do this and 
it is linux which manages iommu groups.

+static bool spapr_tce_iommu_is_attach_deferred(struct device *dev)
+{
+   return false;
+}

I think you can NULL this op:

static bool iommu_is_attach_deferred(struct device *dev)
{
const struct iommu_ops *ops = dev_iommu_ops(dev);

if (ops->is_attach_deferred)
    return ops->is_attach_deferred(dev);

return false;
}

+static struct iommu_group *spapr_tce_iommu_device_group(struct 
device *dev)

+{
+    struct pci_controller *hose;
+    struct pci_dev *pdev;
+
+    /* Weirdly iommu_device_register() assigns the same ops to all 
buses */

+    if (!dev_is_pci(dev))
+    return ERR_PTR(-EPERM);
+
+    pdev = to_pci_dev(dev);
+    hose = pdev->bus->sysdata;
+
+    if (!hose->controller_ops.device_group)
+    return ERR_PTR(-ENOENT);
+
+    return hose->controller_ops.device_group(hose, pdev);
+}

Is this missing a refcount get on the group?

+
+static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
+  struct device *dev)
+{
+    return 0;
+}

It is important when this returns that the iommu translation is all
emptied. There should be no left over translations from the DMA API at
this point. I have no idea how power works in this regard, but it
should be explained why this is safe in a comment at a minimum.

 > It will turn into a security problem to allow kernel mappings to leak
 > past this point.
 >

I've added for v2 checking for no valid mappings for a device (or, more 
precisely, in the associated iommu_group), this domain does not need 
checking, right?

Uff, not that simple. Looks like once a device is in a group, its 
dma_ops is set to iommu_dma_ops and IOMMU code owns DMA. I guess then 
there is a way to set those to NULL or do something similar to let

dma_map_direct() from kernel/dma/mapping.c return "true", is not there?

For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is 
fine to do nothing as tce_iommu_take_ownership() and 
tce_iommu_take_ownership_ddw() take care of not having active DMA 
mappings. Thanks,

In general, is "domain"

Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-07 Thread Alexey Kardashevskiy

On 7/8/22 01:10, Jason Gunthorpe wrote:

On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote:

Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
the Type1 VFIO driver. Recent development though has added a coherency
capability check to the generic part of VFIO and essentially disabled
VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed().

This adds an iommu_ops stub which reports support for cache
coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices,
this provides minimum code for the probing to not crash.

Because now we have to set iommu_ops to the system (bus_set_iommu() or
iommu_device_register()), this requires the POWERNV PCI setup to happen
after bus_register(_bus_type) which is postcore_initcall
TODO: check if it still works, read sha1, for more details:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5537fcb319d016ce387

Because setting the ops triggers probing, this does not work well with
iommu_group_add_device(), hence the move to iommu_probe_device().

Because iommu_probe_device() does not take the group (which is why
we had the helper in the first place), this adds
pci_controller_ops::device_group.

So, basically there is one iommu_device per PHB and devices are added to
groups indirectly via series of calls inside the IOMMU code.

pSeries is out of scope here (a minor fix needed for barely supported
platform in regard to VFIO).

The previous discussion is here:
https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/

I think this is basically OK, for what it is. It looks like there is
more some-day opportunity to make use of the core infrastructure though.

does it make sense to have this many callbacks, or
the generic IOMMU code can safely operate without some
(given I add some more checks for !NULL)? thanks,

I wouldn't worry about it..

@@ -1156,7 +1158,10 @@ int iommu_add_device(struct iommu_table_group 
*table_group, struct device *dev)
pr_debug("%s: Adding %s to iommu group %d\n",
 __func__, dev_name(dev),  iommu_group_id(table_group->group));

-	return iommu_group_add_device(table_group->group, dev);

+   ret = iommu_probe_device(dev);
+   dev_info(dev, "probed with %d\n", ret);

For another day, but it seems a bit strange to call iommu_probe_device() like 
this?
Shouldn't one of the existing call sites cover this? The one in
of_iommu.c perhaps?

It looks to me that of_iommu.c expects the iommu setup to happen before 
linux starts as linux looks for #iommu-cells or iommu-map properties in 
the device tree. The powernv firmware (aka skiboot) does not do this and 
it is linux which manages iommu groups.

+static bool spapr_tce_iommu_is_attach_deferred(struct device *dev)
+{
+   return false;
+}

I think you can NULL this op:

static bool iommu_is_attach_deferred(struct device *dev)
{
const struct iommu_ops *ops = dev_iommu_ops(dev);

if (ops->is_attach_deferred)
return ops->is_attach_deferred(dev);

return false;
}

+static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev)
+{
+   struct pci_controller *hose;
+   struct pci_dev *pdev;
+
+   /* Weirdly iommu_device_register() assigns the same ops to all buses */
+   if (!dev_is_pci(dev))
+   return ERR_PTR(-EPERM);
+
+   pdev = to_pci_dev(dev);
+   hose = pdev->bus->sysdata;
+
+   if (!hose->controller_ops.device_group)
+   return ERR_PTR(-ENOENT);
+
+   return hose->controller_ops.device_group(hose, pdev);
+}

Is this missing a refcount get on the group?

+
+static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
+ struct device *dev)
+{
+   return 0;
+}

It is important when this returns that the iommu translation is all
emptied. There should be no left over translations from the DMA API at
this point. I have no idea how power works in this regard, but it
should be explained why this is safe in a comment at a minimum.

> It will turn into a security problem to allow kernel mappings to leak
> past this point.
>

I've added for v2 checking for no valid mappings for a device (or, more 
precisely, in the associated iommu_group), this domain does not need 
checking, right?

In general, is "domain" something from hardware or it is a software 
concept? Thanks,

Jason

--
Alexey

[PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

2022-07-07 Thread Alexey Kardashevskiy

Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
the Type1 VFIO driver. Recent development though has added a coherency
capability check to the generic part of VFIO and essentially disabled
VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed().

This adds an iommu_ops stub which reports support for cache
coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices,
this provides minimum code for the probing to not crash.

Because now we have to set iommu_ops to the system (bus_set_iommu() or
iommu_device_register()), this requires the POWERNV PCI setup to happen
after bus_register(_bus_type) which is postcore_initcall
TODO: check if it still works, read sha1, for more details:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5537fcb319d016ce387

Because setting the ops triggers probing, this does not work well with
iommu_group_add_device(), hence the move to iommu_probe_device().

Because iommu_probe_device() does not take the group (which is why
we had the helper in the first place), this adds
pci_controller_ops::device_group.

So, basically there is one iommu_device per PHB and devices are added to
groups indirectly via series of calls inside the IOMMU code.

pSeries is out of scope here (a minor fix needed for barely supported
platform in regard to VFIO).

The previous discussion is here:
https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/

Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence")
Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices")
Cc: Oliver O'Halloran 
Cc: Robin Murphy 
Cc: Jason Gunthorpe 
Cc: Alex Williamson 
Cc: Daniel Henrique Barboza 
Cc: Fabiano Rosas 
Cc: Murilo Opsfelder Araujo 
Cc: Nicholas Piggin 
Signed-off-by: Alexey Kardashevskiy 
---


does it make sense to have this many callbacks, or
the generic IOMMU code can safely operate without some
(given I add some more checks for !NULL)? thanks,


---
 arch/powerpc/include/asm/iommu.h  |   2 +
 arch/powerpc/include/asm/pci-bridge.h |   7 ++
 arch/powerpc/kernel/iommu.c   | 106 +-
 arch/powerpc/kernel/pci-common.c  |   2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  40 
 5 files changed, 155 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7e29c73e3dd4..4bdae0ee29d0 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -215,6 +215,8 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm,
enum dma_data_direction *direction);
 extern void iommu_tce_kill(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
+
+extern const struct iommu_ops spapr_tce_iommu_ops;
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index c85f901227c9..338a45b410b4 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device_node;
 
@@ -44,6 +45,9 @@ struct pci_controller_ops {
 #endif
 
void(*shutdown)(struct pci_controller *hose);
+
+   struct iommu_group *(*device_group)(struct pci_controller *hose,
+   struct pci_dev *pdev);
 };
 
 /*
@@ -131,6 +135,9 @@ struct pci_controller {
struct irq_domain   *dev_domain;
struct irq_domain   *msi_domain;
struct fwnode_handle*fwnode;
+
+   /* iommu_ops support */
+   struct iommu_device iommu;
 };
 
 /* These are used for config access before all the PCI probing
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 7e56ddb3e0b9..c4c7eb596fef 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1138,6 +1138,8 @@ EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
 int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
+   int ret;
+
/*
 * The sysfs entries should be populated before
 * binding IOMMU group. If sysfs entries isn't
@@ -1156,7 +1158,10 @@ int iommu_add_device(struct iommu_table_group 
*table_group, struct device *dev)
pr_debug("%s: Adding %s to iommu group %d\n",
 __func__, dev_name(dev),  iommu_group_id(table_group->group));
 
-   return iommu_group_add_device(table_group->group, dev);
+   ret = iommu_probe_device(dev);
+   dev_info(dev, "probed with %d\n", ret);
+
+   return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
@@ -1176,4 +1181,103 @@ void iommu_del_device(struct devi

[PATCH kernel] powerpc/iommu: Add simple iommu_ops to report capabilities

2022-07-05 Thread Alexey Kardashevskiy

Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
the Type1 VFIO driver. Recent development though has added a coherency
capability check to the generic part of VFIO and essentially disabled
VFIO on PPC64.

This adds a simple iommu_ops stub which reports support for cache
coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices,
this provides minimum code for the probing to not crash.

The previous discussion is here:
https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/

Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence")
Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices")
Signed-off-by: Alexey Kardashevskiy 
---

I have not looked into the domains for ages, what is missing here? With this
on top of 5.19-rc1 VFIO works again on my POWER9 box. Thanks,

---
 arch/powerpc/include/asm/iommu.h |  2 +
 arch/powerpc/kernel/iommu.c  | 70 
 arch/powerpc/kernel/pci_64.c |  3 ++
 3 files changed, 75 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7e29c73e3dd4..4bdae0ee29d0 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -215,6 +215,8 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm,
enum dma_data_direction *direction);
 extern void iommu_tce_kill(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
+
+extern const struct iommu_ops spapr_tce_iommu_ops;
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 7e56ddb3e0b9..2205b448f7d5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1176,4 +1176,74 @@ void iommu_del_device(struct device *dev)
iommu_group_remove_device(dev);
 }
 EXPORT_SYMBOL_GPL(iommu_del_device);
+
+/*
+ * A simple iommu_ops to allow less cruft in generic VFIO code.
+ */
+static bool spapr_tce_iommu_capable(enum iommu_cap cap)
+{
+   switch (cap) {
+   case IOMMU_CAP_CACHE_COHERENCY:
+   return true;
+   default:
+   break;
+   }
+
+   return false;
+}
+
+static struct iommu_domain *spapr_tce_iommu_domain_alloc(unsigned int type)
+{
+   struct iommu_domain *domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+
+   if (!domain)
+   return NULL;
+
+   domain->geometry.aperture_start = 0;
+   domain->geometry.aperture_end = ~0ULL;
+   domain->geometry.force_aperture = true;
+
+   return domain;
+}
+
+static struct iommu_device *spapr_tce_iommu_probe_device(struct device *dev)
+{
+   struct iommu_device *iommu_dev = kzalloc(sizeof(struct iommu_device), 
GFP_KERNEL);
+
+   iommu_dev->dev = dev;
+   iommu_dev->ops = _tce_iommu_ops;
+
+   return iommu_dev;
+}
+
+static void spapr_tce_iommu_release_device(struct device *dev)
+{
+}
+
+static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
+ struct device *dev)
+{
+   return 0;
+}
+
+static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev)
+{
+   struct iommu_group *grp = dev->iommu_group;
+
+   if (!grp)
+   grp = ERR_PTR(-ENODEV);
+   return grp;
+}
+
+const struct iommu_ops spapr_tce_iommu_ops = {
+   .capable = spapr_tce_iommu_capable,
+   .domain_alloc = spapr_tce_iommu_domain_alloc,
+   .probe_device = spapr_tce_iommu_probe_device,
+   .release_device = spapr_tce_iommu_release_device,
+   .device_group = spapr_tce_iommu_device_group,
+   .default_domain_ops = &(const struct iommu_domain_ops) {
+   .attach_dev = spapr_tce_iommu_attach_dev,
+   }
+};
+
 #endif /* CONFIG_IOMMU_API */
diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c
index 19b03ddf5631..04bc0c52e45c 100644
--- a/arch/powerpc/kernel/pci_64.c
+++ b/arch/powerpc/kernel/pci_64.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -27,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* pci_io_base -- the base address from which io bars are offsets.
  * This is the lowest I/O base address (so bar values are always positive),
@@ -69,6 +71,7 @@ static int __init pcibios_init(void)
ppc_md.pcibios_fixup();
 
printk(KERN_DEBUG "PCI: Probing PCI hardware done\n");
+   bus_set_iommu(_bus_type, _tce_iommu_ops);
 
return 0;
 }
-- 
2.30.2

[PATCH llvm v2] powerpc/llvm/lto: Allow LLVM LTO builds

2022-06-29 Thread Alexey Kardashevskiy

This enables LTO_CLANG builds on POWER with the upstream version of
LLVM.

LTO optimizes the output vmlinux binary and this may affect the FTP
alternative section if alt branches use "bc" (Branch Conditional) which
is limited by 16 bit offsets. This shows up in errors like:

ld.lld: error: InputSection too large for range extension thunk 
vmlinux.o:(__ftr_alt_97+0xF0)

This works around the issue by replacing "bc" in FTR_SECTION_ELSE with
"b" which allows 26 bit offsets.

This catches the problem instructions in vmlinux.o before it LTO'ed:

$ objdump -d -M raw -j __ftr_alt_97 vmlinux.o | egrep '\S+\s*\'
  30:   00 00 82 40 bc  4,eq,30 <__ftr_alt_97+0x30>
  f0:   00 00 82 40 bc  4,eq,f0 <__ftr_alt_97+0xf0>

This allows LTO builds for ppc64le_defconfig plus LTO options.
Note that DYNAMIC_FTRACE/FUNCTION_TRACER is not supported by LTO builds
but this is not POWERPC-specific.

This makes the copy routines slower on POWER6 as this partially reverts
a4e22f02f5b6 ("powerpc: Update 64bit__copy_tofrom_user() using 
CPU_FTR_UNALIGNED_LD_STD")

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* dropped FTR sections which were only meant to improve POWER6 as
Paul suggested

---

Note 1:
This is further development of
https://lore.kernel.org/all/20220211023125.1790960-1-...@ozlabs.ru/T/

Note 2:
CONFIG_ZSTD_COMPRESS and CONFIG_ZSTD_DECOMPRESS must be both "m" or "y"
or it won't link. For details:
https://lore.kernel.org/lkml/20220428043850.1706973-1-...@ozlabs.ru/T/
---
 arch/powerpc/Kconfig   |  2 ++
 arch/powerpc/kernel/exceptions-64s.S   |  4 +++-
 arch/powerpc/lib/copyuser_64.S | 15 +--
 arch/powerpc/lib/feature-fixups-test.S |  3 +--
 arch/powerpc/lib/memcpy_64.S   | 14 +-
 5 files changed, 8 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 3eaddb8997a9..35050264ea7b 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -162,6 +162,8 @@ config PPC
select ARCH_WANTS_MODULES_DATA_IN_VMALLOC   if PPC_BOOK3S_32 || 
PPC_8xx
select ARCH_WANTS_NO_INSTR
select ARCH_WEAK_RELEASE_ACQUIRE
+   select ARCH_SUPPORTS_LTO_CLANG
+   select ARCH_SUPPORTS_LTO_CLANG_THIN
select BINFMT_ELF
select BUILDTIME_TABLE_SORT
select CLONE_BACKWARDS
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index b66dd6f775a4..5b783bd51260 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -476,9 +476,11 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real, text)
.if IHSRR_IF_HVMODE
BEGIN_FTR_SECTION
bne masked_Hinterrupt
+   b   4f
FTR_SECTION_ELSE
-   bne masked_interrupt
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
+   bne masked_interrupt
+4:
.elseif IHSRR
bne masked_Hinterrupt
.else
diff --git a/arch/powerpc/lib/copyuser_64.S b/arch/powerpc/lib/copyuser_64.S
index db8719a14846..b914e52ed240 100644
--- a/arch/powerpc/lib/copyuser_64.S
+++ b/arch/powerpc/lib/copyuser_64.S
@@ -9,7 +9,7 @@
 #include 
 
 #ifndef SELFTEST_CASE
-/* 0 == most CPUs, 1 == POWER6, 2 == Cell */
+/* 0 == most CPUs, 2 == Cell */
 #define SELFTEST_CASE  0
 #endif
 
@@ -68,19 +68,6 @@ _GLOBAL(__copy_tofrom_user_base)
andi.   r6,r6,7
PPC_MTOCRF(0x01,r5)
blt cr1,.Lshort_copy
-/* Below we want to nop out the bne if we're on a CPU that has the
- * CPU_FTR_UNALIGNED_LD_STD bit set and the CPU_FTR_CP_USE_DCBTZ bit
- * cleared.
- * At the time of writing the only CPU that has this combination of bits
- * set is Power6.
- */
-test_feature = (SELFTEST_CASE == 1)
-BEGIN_FTR_SECTION
-   nop
-FTR_SECTION_ELSE
-   bne .Ldst_unaligned
-ALT_FTR_SECTION_END(CPU_FTR_UNALIGNED_LD_STD | CPU_FTR_CP_USE_DCBTZ, \
-   CPU_FTR_UNALIGNED_LD_STD)
 .Ldst_aligned:
addir3,r3,-16
 r3_offset = 16
diff --git a/arch/powerpc/lib/feature-fixups-test.S 
b/arch/powerpc/lib/feature-fixups-test.S
index 480172fbd024..2751e42a9fd7 100644
--- a/arch/powerpc/lib/feature-fixups-test.S
+++ b/arch/powerpc/lib/feature-fixups-test.S
@@ -145,7 +145,6 @@ BEGIN_FTR_SECTION
 FTR_SECTION_ELSE
 2: or  2,2,2
PPC_LCMPI   r3,1
-   beq 3f
blt 2b
b   3f
b   1b
@@ -160,10 +159,10 @@ globl(ftr_fixup_test6_expected)
 1: or  1,1,1
 2: or  2,2,2
PPC_LCMPI   r3,1
-   beq 3f
blt 2b
b   3f
b   1b
+   nop
 3: or  1,1,1
or  2,2,2
or  3,3,3
diff --git a/arch/powerpc/lib/memcpy_64.S b/arch/powerpc/lib/memcpy_64.S
index 016c91e958d8..117399dbc891 100644
--- a/arch/powerpc/lib/me

[PATCH kernel v2] pseries/iommu/ddw: Fix kdump to work in absence of ibm,dma-window

2022-06-29 Thread Alexey Kardashevskiy

The pseries platform uses 32bit default DMA window (always 4K pages) and
optional 64bit DMA window available via DDW ("Dynamic DMA Windows"),
64K or 2M pages. For ages the default one was not removed and a huge
window was created in addition. Things changed with SRIOV-enabled
PowerVM which creates a default-and-bigger DMA window in 64bit space
(still using 4K pages) for IOV VFs so certain OSes do not need to use
the DDW API in order to utilize all available TCE budget.

Linux on the other hand removes the default window and creates a bigger
one (with more TCEs or/and a bigger page size - 64K/2M) in a bid to map
the entire RAM, and if the new window size is smaller than that - it
still uses this new bigger window. The result is that the default window
is removed but the "ibm,dma-window" property is not.

When kdump is invoked, the existing code tries reusing the existing 64bit
DMA window which location and parameters are stored in the device tree but
this fails as the new property does not make it to the kdump device tree
blob. So the code falls back to the default window which does not exist
anymore although the device tree says that it does. The result of that
is that PCI devices become unusable and cannot be used for kdumping.

This preserves the DMA64 and DIRECT64 properties in the device tree blob
for the crash kernel. Since the crash kernel setup is done after device
drivers are loaded and probed, the proper DMA config is stored at least
for boot time devices.

Because DDW window is optional and the code configures the default window
first, the existing code creates an IOMMU table descriptor for
the non-existing default DMA window. It is harmless for kdump as it does
not touch the actual window (only reads what is mapped and marks those IO
pages as used) but it is bad for kexec which clears it thinking it is
a smaller default window rather than a bigger DDW window.

This removes the "ibm,dma-window" property from the device tree after
a bigger window is created and the crash kernel setup picks it up.

Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect 
mapping")
Signed-off-by: Alexey Kardashevskiy 
---

Looks like SYSCALL(kexec_file_load) never supported DMA. so it could be:
Fixes: a0458284f062 ("powerpc: Add support code for kexec_file_load()")

---
Changes:
v2:
* reworked enable_ddw() to reuse default_win
* removed @tbl as it was only used once and later on this zeroes it
* undef for xxx64_PROPNAME in file_load_64.c
* renamed new functions in file_load_64.c
---
 arch/powerpc/kexec/file_load_64.c  | 54 
 arch/powerpc/platforms/pseries/iommu.c | 89 ++
 2 files changed, 102 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index b4981b651d9a..5d2c22aa34fb 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -1038,6 +1038,48 @@ static int update_cpus_node(void *fdt)
return ret;
 }
 
+static int copy_property(void *fdt, int node_offset, const struct device_node 
*dn,
+const char *propname)
+{
+   const void *prop, *fdtprop;
+   int len = 0, fdtlen = 0, ret;
+
+   prop = of_get_property(dn, propname, );
+   fdtprop = fdt_getprop(fdt, node_offset, propname, );
+
+   if (fdtprop && !prop)
+   ret = fdt_delprop(fdt, node_offset, propname);
+   else if (prop)
+   ret = fdt_setprop(fdt, node_offset, propname, prop, len);
+
+   return ret;
+}
+
+static int update_pci_dma_nodes(void *fdt, const char *dmapropname)
+{
+   struct device_node *dn;
+   int pci_offset, root_offset, ret = 0;
+
+   if (!firmware_has_feature(FW_FEATURE_LPAR))
+   return 0;
+
+   root_offset = fdt_path_offset(fdt, "/");
+   for_each_node_with_property(dn, dmapropname) {
+   pci_offset = fdt_subnode_offset(fdt, root_offset, 
of_node_full_name(dn));
+   if (pci_offset < 0)
+   continue;
+
+   ret = copy_property(fdt, pci_offset, dn, "ibm,dma-window");
+   if (ret < 0)
+   break;
+   ret = copy_property(fdt, pci_offset, dn, dmapropname);
+   if (ret < 0)
+   break;
+   }
+
+   return ret;
+}
+
 /**
  * setup_new_fdt_ppc64 - Update the flattend device-tree of the kernel
  *   being loaded.
@@ -1099,6 +1141,18 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt,
if (ret < 0)
goto out;
 
+#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
+   ret = update_pci_dma_nodes(fdt, DIRECT64_PROPNAME);
+   if (ret < 0)
+   goto out;
+
+   ret = update_pci_dma_nodes(fdt,

[PATCH kernel] KVM: PPC: Do not warn when userspace asked for too big TCE table

2022-06-28 Thread Alexey Kardashevskiy

KVM manages emulated TCE tables for guest LIOBNs by a two level table
which maps up to 128TiB with 16MB IOMMU pages (enabled in QEMU by default)
and MAX_ORDER=11 (the kernel's default). Note that the last level of
the table is allocated when actual TCE is updated.

However these tables are created via ioctl() on kvmfd and the userspace
can trigger WARN_ON_ONCE_GFP(order >= MAX_ORDER, gfp) in mm/page_alloc.c
and flood dmesg.

This adds __GFP_NOWARN.

Signed-off-by: Alexey Kardashevskiy 
---

We could probably switch to vmalloc() to allow even bigger
emulated tables which we do not really want the userspace
to create though.

---
 arch/powerpc/kvm/book3s_64_vio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index d6589c4fe889..40864373ef87 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -307,7 +307,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
return ret;
 
ret = -ENOMEM;
-   stt = kzalloc(struct_size(stt, pages, npages), GFP_KERNEL);
+   stt = kzalloc(struct_size(stt, pages, npages), GFP_KERNEL | 
__GFP_NOWARN);
if (!stt)
goto fail_acct;
 
-- 
2.30.2

Re: [PATCH kernel] pseries/iommu/ddw: Fix kdump to work in absence of ibm,dma-window

2022-06-27 Thread Alexey Kardashevskiy





On 6/27/22 14:10, Russell Currey wrote:

On Thu, 2022-06-16 at 17:59 +1000, Alexey Kardashevskiy wrote:

The pseries platform uses 32bit default DMA window (always 4K pages)
and
optional 64bit DMA window available via DDW ("Dynamic DMA Windows"),
64K or 2M pages. For ages the default one was not removed and a huge
window was created in addition. Things changed with SRIOV-enabled
PowerVM which creates a default-and-bigger DMA window in 64bit space
(still using 4K pages) for IOV VFs so certain OSes do not need to use
the DDW API in order to utilize all available TCE budget.

Linux on the other hand removes the default window and creates a
bigger
one (with more TCEs or/and a bigger page size - 64K/2M) in a bid to
map
the entire RAM, and if the new window size is smaller than that - it
still uses this new bigger window. The result is that the default
window
is removed but the "ibm,dma-window" property is not.

When kdump is invoked, the existing code tries reusing the existing
64bit
DMA window which location and parameters are stored in the device
tree but
this fails as the new property does not make it to the kdump device
tree
blob. So the code falls back to the default window which does not
exist
anymore although the device tree says that it does. The result of
that
is that PCI devices become unusable and cannot be used for kdumping.

This preserves the DMA64 and DIRECT64 properties in the device tree
blob
for the crash kernel. Since the crash kernel setup is done after
device
drivers are loaded and probed, the proper DMA config is stored at
least
for boot time devices.

Because DDW window is optional and the code configures the default
window
first, the existing code creates an IOMMU table descriptor for
the non-existing default DMA window. It is harmless for kdump as it
does
not touch the actual window (only reads what is mapped and marks
those IO
pages as used) but it is bad for kexec which clears it thinking it is
a smaller default window rather than a bigger DDW window.

This removes the "ibm,dma-window" property from the device tree after
a bigger window is created and the crash kernel setup picks it up.

Signed-off-by: Alexey Kardashevskiy 


Hey Alexey, great description of the problem.  Would this need a Fixes:
tag?



May be. But which patch does it fix really - the one which did not 
preserve the dma64 properties or the one which started removing the 
default window? :)




---
  arch/powerpc/kexec/file_load_64.c  | 52 +++
  arch/powerpc/platforms/pseries/iommu.c | 88 +++-
--
  2 files changed, 102 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/kexec/file_load_64.c
b/arch/powerpc/kexec/file_load_64.c
index b4981b651d9a..b4b486b68b63 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -1038,6 +1038,48 @@ static int update_cpus_node(void *fdt)
 return ret;
  }
  
+static int copy_dma_property(void *fdt, int node_offset, const

struct device_node *dn,
+    const char *propname)
+{
+   const void *prop, *fdtprop;
+   int len = 0, fdtlen = 0, ret;
+
+   prop = of_get_property(dn, propname, );
+   fdtprop = fdt_getprop(fdt, node_offset, propname, );
+
+   if (fdtprop && !prop)
+   ret = fdt_delprop(fdt, node_offset, propname);
+   else if (prop)
+   ret = fdt_setprop(fdt, node_offset, propname, prop,
len);


If fdtprop and prop are both false, ret is returned uninitialised.


+
+   return ret;
+}
+
+static int update_pci_nodes(void *fdt, const char *dmapropname)
+{
+   struct device_node *dn;
+   int pci_offset, root_offset, ret = 0;
+
+   if (!firmware_has_feature(FW_FEATURE_LPAR))
+   return 0;
+
+   root_offset = fdt_path_offset(fdt, "/");
+   for_each_node_with_property(dn, dmapropname) {
+   pci_offset = fdt_subnode_offset(fdt, root_offset,
of_node_full_name(dn));
+   if (pci_offset < 0)
+   continue;
+
+   ret = copy_dma_property(fdt, pci_offset, dn,
"ibm,dma-window");
+   if (ret < 0)
+   break;
+   ret = copy_dma_property(fdt, pci_offset, dn,
dmapropname);
+   if (ret < 0)
+   break;
+   }
+
+   return ret;
+}
+
  /**
   * setup_new_fdt_ppc64 - Update the flattend device-tree of the
kernel
   *   being loaded.
@@ -1099,6 +1141,16 @@ int setup_new_fdt_ppc64(const struct kimage
*image, void *fdt,
 if (ret < 0)
 goto out;
  
+#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"

+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"


Instead of having these defined in two different places, could they be
moved out of iommu.c and into a header?  Though we hardcode ibm,dma-
window everywhere anyway.



These properties are f

[PATCH kernel] KVM: PPC: Book3s: Fix warning about xics_rm_h_xirr_x

2022-06-21 Thread Alexey Kardashevskiy

This fixes "no previous prototype":

arch/powerpc/kvm/book3s_hv_rm_xics.c:482:15:
warning: no previous prototype for 'xics_rm_h_xirr_x' [-Wmissing-prototypes]

Reported by the kernel test robot.

Fixes: b22af9041927 ("KVM: PPC: Book3s: Remove real mode interrupt controller 
hcalls handlers")
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/book3s_xics.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_xics.h b/arch/powerpc/kvm/book3s_xics.h
index 8e4c79e2fcd8..08fb0843faf5 100644
--- a/arch/powerpc/kvm/book3s_xics.h
+++ b/arch/powerpc/kvm/book3s_xics.h
@@ -143,6 +143,7 @@ static inline struct kvmppc_ics 
*kvmppc_xics_find_ics(struct kvmppc_xics *xics,
 }
 
 extern unsigned long xics_rm_h_xirr(struct kvm_vcpu *vcpu);
+extern unsigned long xics_rm_h_xirr_x(struct kvm_vcpu *vcpu);
 extern int xics_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server,
 unsigned long mfrr);
 extern int xics_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr);
-- 
2.30.2

Re: [PATCH v2 4/4] watchdog/pseries-wdt: initial support for H_WATCHDOG-based watchdog timers

2022-06-20 Thread Alexey Kardashevskiy





On 6/3/22 03:53, Scott Cheloha wrote:

PAPR v2.12 defines a new hypercall, H_WATCHDOG.  The hypercall permits
guest control of one or more virtual watchdog timers.  The timers have
millisecond granularity.  The guest is terminated when a timer
expires.

This patch adds a watchdog driver for these timers, "pseries-wdt".

pseries_wdt_probe() currently assumes the existence of only one
platform device and always assigns it watchdogNumber 1.  If we ever
expose more than one timer to userspace we will need to devise a way
to assign a distinct watchdogNumber to each platform device at device
registration time.

Signed-off-by: Scott Cheloha 



Besides the patch ordering and 0444 vs. 0644 (which is up to the PPC 
maintainer to decide anyway :) ), looks good to me.



Reviewed-by: Alexey Kardashevskiy 




---
  .../watchdog/watchdog-parameters.rst  |  12 +
  drivers/watchdog/Kconfig  |   8 +
  drivers/watchdog/Makefile |   1 +
  drivers/watchdog/pseries-wdt.c| 264 ++
  4 files changed, 285 insertions(+)
  create mode 100644 drivers/watchdog/pseries-wdt.c

diff --git a/Documentation/watchdog/watchdog-parameters.rst 
b/Documentation/watchdog/watchdog-parameters.rst
index 223c99361a30..29153eed6689 100644
--- a/Documentation/watchdog/watchdog-parameters.rst
+++ b/Documentation/watchdog/watchdog-parameters.rst
@@ -425,6 +425,18 @@ pnx833x_wdt:
  
  -
  
+pseries-wdt:

+action:
+   Action taken when watchdog expires: 0 (power off), 1 (restart),
+   2 (dump and restart). (default=1)
+timeout:
+   Initial watchdog timeout in seconds. (default=60)
+nowayout:
+   Watchdog cannot be stopped once started.
+   (default=kernel config parameter)
+
+-
+
  rc32434_wdt:
  timeout:
Watchdog timeout value, in seconds (default=20)
diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig
index c4e82a8d863f..06b412603f3e 100644
--- a/drivers/watchdog/Kconfig
+++ b/drivers/watchdog/Kconfig
@@ -1932,6 +1932,14 @@ config MEN_A21_WDT
  
  # PPC64 Architecture
  
+config PSERIES_WDT

+   tristate "POWER Architecture Platform Watchdog Timer"
+   depends on PPC_PSERIES
+   select WATCHDOG_CORE
+   help
+ Driver for virtual watchdog timers provided by PAPR
+ hypervisors (e.g. PowerVM, KVM).
+
  config WATCHDOG_RTAS
tristate "RTAS watchdog"
depends on PPC_RTAS
diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile
index f7da867e8782..f35660409f17 100644
--- a/drivers/watchdog/Makefile
+++ b/drivers/watchdog/Makefile
@@ -184,6 +184,7 @@ obj-$(CONFIG_BOOKE_WDT) += booke_wdt.o
  obj-$(CONFIG_MEN_A21_WDT) += mena21_wdt.o
  
  # PPC64 Architecture

+obj-$(CONFIG_PSERIES_WDT) += pseries-wdt.o
  obj-$(CONFIG_WATCHDOG_RTAS) += wdrtas.o
  
  # S390 Architecture

diff --git a/drivers/watchdog/pseries-wdt.c b/drivers/watchdog/pseries-wdt.c
new file mode 100644
index ..cfe53587457d
--- /dev/null
+++ b/drivers/watchdog/pseries-wdt.c
@@ -0,0 +1,264 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2022 International Business Machines, Inc.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DRV_NAME "pseries-wdt"
+
+/*
+ * The PAPR's MSB->LSB bit ordering is 0->63.  These macros simplify
+ * defining bitfields as described in the PAPR without needing to
+ * transpose values to the more C-like 63->0 ordering.
+ */
+#define SETFIELD(_v, _b, _e)   \
+   (((unsigned long)(_v) << PPC_BITLSHIFT(_e)) & PPC_BITMASK((_b), (_e)))
+#define GETFIELD(_v, _b, _e)   \
+   (((unsigned long)(_v) & PPC_BITMASK((_b), (_e))) >> PPC_BITLSHIFT(_e))
+
+/*
+ * The H_WATCHDOG hypercall first appears in PAPR v2.12 and is
+ * described fully in sections 14.5 and 14.15.6.
+ *
+ *
+ * H_WATCHDOG Input
+ *
+ * R4: "flags":
+ *
+ * Bits 48-55: "operation"
+ *
+ * 0x01  Start Watchdog
+ * 0x02  Stop Watchdog
+ * 0x03  Query Watchdog Capabilities
+ */
+#define PSERIES_WDTF_OP(op)SETFIELD((op), 48, 55)
+#define PSERIES_WDTF_OP_START  PSERIES_WDTF_OP(0x1)
+#define PSERIES_WDTF_OP_STOP   PSERIES_WDTF_OP(0x2)
+#define PSERIES_WDTF_OP_QUERY  PSERIES_WDTF_OP(0x3)
+
+/*
+ * Bits 56-63: "timeoutAction" (for "Start Watchdog" only)
+ *
+ * 0x01  Hard poweroff
+ * 0x02  Hard restart
+ * 0x03  Dump restart
+ */
+#define PSERIES_WDTF_ACTION(ac)SETFIELD(ac, 56, 63)
+#define PSERIES_WDTF_ACTION_HARD_POWEROFF  PSERIES_WDTF_ACTION(0x1)
+#define PSERIES_WDTF_ACTION_HARD_RESTART   PSERIES_WDTF_ACTION(0x2)
+#define PSERIES_WDTF_ACTION_DUMP_RES

[PATCH kernel] pseries/iommu/ddw: Fix kdump to work in absence of ibm,dma-window

2022-06-16 Thread Alexey Kardashevskiy

The pseries platform uses 32bit default DMA window (always 4K pages) and
optional 64bit DMA window available via DDW ("Dynamic DMA Windows"),
64K or 2M pages. For ages the default one was not removed and a huge
window was created in addition. Things changed with SRIOV-enabled
PowerVM which creates a default-and-bigger DMA window in 64bit space
(still using 4K pages) for IOV VFs so certain OSes do not need to use
the DDW API in order to utilize all available TCE budget.

Linux on the other hand removes the default window and creates a bigger
one (with more TCEs or/and a bigger page size - 64K/2M) in a bid to map
the entire RAM, and if the new window size is smaller than that - it
still uses this new bigger window. The result is that the default window
is removed but the "ibm,dma-window" property is not.

When kdump is invoked, the existing code tries reusing the existing 64bit
DMA window which location and parameters are stored in the device tree but
this fails as the new property does not make it to the kdump device tree
blob. So the code falls back to the default window which does not exist
anymore although the device tree says that it does. The result of that
is that PCI devices become unusable and cannot be used for kdumping.

This preserves the DMA64 and DIRECT64 properties in the device tree blob
for the crash kernel. Since the crash kernel setup is done after device
drivers are loaded and probed, the proper DMA config is stored at least
for boot time devices.

Because DDW window is optional and the code configures the default window
first, the existing code creates an IOMMU table descriptor for
the non-existing default DMA window. It is harmless for kdump as it does
not touch the actual window (only reads what is mapped and marks those IO
pages as used) but it is bad for kexec which clears it thinking it is
a smaller default window rather than a bigger DDW window.

This removes the "ibm,dma-window" property from the device tree after
a bigger window is created and the crash kernel setup picks it up.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kexec/file_load_64.c  | 52 +++
 arch/powerpc/platforms/pseries/iommu.c | 88 +++---
 2 files changed, 102 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index b4981b651d9a..b4b486b68b63 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -1038,6 +1038,48 @@ static int update_cpus_node(void *fdt)
return ret;
 }
 
+static int copy_dma_property(void *fdt, int node_offset, const struct 
device_node *dn,
+const char *propname)
+{
+   const void *prop, *fdtprop;
+   int len = 0, fdtlen = 0, ret;
+
+   prop = of_get_property(dn, propname, );
+   fdtprop = fdt_getprop(fdt, node_offset, propname, );
+
+   if (fdtprop && !prop)
+   ret = fdt_delprop(fdt, node_offset, propname);
+   else if (prop)
+   ret = fdt_setprop(fdt, node_offset, propname, prop, len);
+
+   return ret;
+}
+
+static int update_pci_nodes(void *fdt, const char *dmapropname)
+{
+   struct device_node *dn;
+   int pci_offset, root_offset, ret = 0;
+
+   if (!firmware_has_feature(FW_FEATURE_LPAR))
+   return 0;
+
+   root_offset = fdt_path_offset(fdt, "/");
+   for_each_node_with_property(dn, dmapropname) {
+   pci_offset = fdt_subnode_offset(fdt, root_offset, 
of_node_full_name(dn));
+   if (pci_offset < 0)
+   continue;
+
+   ret = copy_dma_property(fdt, pci_offset, dn, "ibm,dma-window");
+   if (ret < 0)
+   break;
+   ret = copy_dma_property(fdt, pci_offset, dn, dmapropname);
+   if (ret < 0)
+   break;
+   }
+
+   return ret;
+}
+
 /**
  * setup_new_fdt_ppc64 - Update the flattend device-tree of the kernel
  *   being loaded.
@@ -1099,6 +1141,16 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt,
if (ret < 0)
goto out;
 
+#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
+   ret = update_pci_nodes(fdt, DIRECT64_PROPNAME);
+   if (ret < 0)
+   goto out;
+
+   ret = update_pci_nodes(fdt, DMA64_PROPNAME);
+   if (ret < 0)
+   goto out;
+
/* Update memory reserve map */
ret = get_reserved_memory_ranges();
if (ret)
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index fba64304e859..af3c871668df 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -700,6 +700,33 @@ struct iommu_table_ops iommu_table_lpar_multi_ops = {
.get

Re: [PATCH v1 4/4] watchdog/pseries-wdt: initial support for PAPR H_WATCHDOG timers

2022-06-01 Thread Alexey Kardashevskiy





On 6/2/22 00:48, Scott Cheloha wrote:

On Wed, May 25, 2022 at 04:35:11PM +1000, Alexey Kardashevskiy wrote:


On 5/21/22 04:35, Scott Cheloha wrote:

PAPR v2.12 defines a new hypercall, H_WATCHDOG.  The hypercall permits
guest control of one or more virtual watchdog timers.  The timers have
millisecond granularity.  The guest is terminated when a timer
expires.

This patch adds a watchdog driver for these timers, "pseries-wdt".

pseries_wdt_probe() currently assumes the existence of only one
platform device and always assigns it watchdogNumber 1.  If we ever
expose more than one timer to userspace we will need to devise a way
to assign a distinct watchdogNumber to each platform device at device
registration time.


This one should go before 4/4 in the series for bisectability.

What is platform_device_register_simple("pseries-wdt",...) going to do
without the driver?


This is a chicken-and-egg problem without an obvious solution.  A
device without a driver is a body without a soul.  A driver without a
device is a ghost without a machine.

... or something like that, don't quote me :)

Absent some very compelling reasoning, I would like to keep the
current order.  It feels logical to me to keep the powerpc/pseries
patches adjacent and prior to the watchdog driver patch.


Signed-off-by: Scott Cheloha 
---
   .../watchdog/watchdog-parameters.rst  |  12 +
   drivers/watchdog/Kconfig  |   8 +
   drivers/watchdog/Makefile |   1 +
   drivers/watchdog/pseries-wdt.c| 337 ++
   4 files changed, 358 insertions(+)
   create mode 100644 drivers/watchdog/pseries-wdt.c

diff --git a/Documentation/watchdog/watchdog-parameters.rst 
b/Documentation/watchdog/watchdog-parameters.rst
index 223c99361a30..4ffe725e796c 100644
--- a/Documentation/watchdog/watchdog-parameters.rst
+++ b/Documentation/watchdog/watchdog-parameters.rst
@@ -425,6 +425,18 @@ pnx833x_wdt:
   -
+pseries-wdt:
+action:
+   Action taken when watchdog expires: 1 (power off), 2 (restart),
+   3 (dump and restart). (default=2)
+timeout:
+   Initial watchdog timeout in seconds. (default=60)
+nowayout:
+   Watchdog cannot be stopped once started.
+   (default=kernel config parameter)
+
+-
+
   rc32434_wdt:
   timeout:
Watchdog timeout value, in seconds (default=20)
diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig
index c4e82a8d863f..06b412603f3e 100644
--- a/drivers/watchdog/Kconfig
+++ b/drivers/watchdog/Kconfig
@@ -1932,6 +1932,14 @@ config MEN_A21_WDT
   # PPC64 Architecture
+config PSERIES_WDT
+   tristate "POWER Architecture Platform Watchdog Timer"
+   depends on PPC_PSERIES
+   select WATCHDOG_CORE
+   help
+ Driver for virtual watchdog timers provided by PAPR
+ hypervisors (e.g. PowerVM, KVM).
+
   config WATCHDOG_RTAS
tristate "RTAS watchdog"
depends on PPC_RTAS
diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile
index f7da867e8782..f35660409f17 100644
--- a/drivers/watchdog/Makefile
+++ b/drivers/watchdog/Makefile
@@ -184,6 +184,7 @@ obj-$(CONFIG_BOOKE_WDT) += booke_wdt.o
   obj-$(CONFIG_MEN_A21_WDT) += mena21_wdt.o
   # PPC64 Architecture
+obj-$(CONFIG_PSERIES_WDT) += pseries-wdt.o
   obj-$(CONFIG_WATCHDOG_RTAS) += wdrtas.o
   # S390 Architecture
diff --git a/drivers/watchdog/pseries-wdt.c b/drivers/watchdog/pseries-wdt.c
new file mode 100644
index ..f41bc4d3b7a2
--- /dev/null
+++ b/drivers/watchdog/pseries-wdt.c
@@ -0,0 +1,337 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2022 International Business Machines, Inc.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DRV_NAME "pseries-wdt"
+
+/*
+ * The PAPR's MSB->LSB bit ordering is 0->63.  These macros simplify
+ * defining bitfields as described in the PAPR without needing to
+ * transpose values to the more C-like 63->0 ordering.
+ */
+#define SETFIELD(_v, _b, _e)   \
+   (((unsigned long)(_v) << PPC_BITLSHIFT(_e)) & PPC_BITMASK((_b), (_e)))
+#define GETFIELD(_v, _b, _e)   \
+   (((unsigned long)(_v) & PPC_BITMASK((_b), (_e))) >> PPC_BITLSHIFT(_e))
+
+/*
+ * H_WATCHDOG Hypercall Input
+ *
+ * R4: "flags":
+ *
+ * A 64-bit value structured as follows:
+ *
+ * Bits 0-46: Reserved (must be zero).
+ */
+#define PSERIES_WDTF_RESERVED  PPC_BITMASK(0, 46)
+
+/*
+ * Bit 47: "leaveOtherWatchdogsRunningOnTimeout"
+ *
+ * 0  Stop outstanding watchdogs on timeout.
+ * 1  Leave outstanding watchdogs running on timeout.
+ */
+#define PSERIES_WDTF_LEAVE_OTHER   PPC_BIT(47)
+
+/*
+ * Bits 48-55: "operation"
+ *
+ * 0x01  Start Watchdog
+ *

[PATCH kernel] powerpc/pseries/iommu: Print ibm,query-pe-dma-windows parameters

2022-05-31 Thread Alexey Kardashevskiy

PowerVM has a stricter policy about allocating TCEs for LPARs and
often there is not enough TCEs for 1:1 mapping, this adds the supported
numbers into dev_info() to help analyzing bugreports.

Signed-off-by: Alexey Kardashevskiy 
---

A PowerVM admin can enable "enlarged IO capacity" for a passed
though PCI device but there is no way from inside LPAR to know if that
worked or how many more TCEs became available.

---
 arch/powerpc/platforms/pseries/iommu.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 7639e7355df2..84edc8d730e1 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1022,9 +1022,6 @@ static int query_ddw(struct pci_dev *dev, const u32 
*ddw_avail,
 
ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, out_sz, query_out,
cfg_addr, BUID_HI(buid), BUID_LO(buid));
-   dev_info(>dev, "ibm,query-pe-dma-windows(%x) %x %x %x returned 
%d\n",
-ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr, BUID_HI(buid),
-BUID_LO(buid), ret);
 
switch (out_sz) {
case 5:
@@ -1042,6 +1039,11 @@ static int query_ddw(struct pci_dev *dev, const u32 
*ddw_avail,
break;
}
 
+   dev_info(>dev, "ibm,query-pe-dma-windows(%x) %x %x %x returned %d, 
lb=%llx ps=%x wn=%d\n",
+ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr, BUID_HI(buid),
+BUID_LO(buid), ret, query->largest_available_block,
+query->page_size, query->windows_available);
+
return ret;
 }
 
-- 
2.30.2

Re: [PATCH v1 4/4] watchdog/pseries-wdt: initial support for PAPR H_WATCHDOG timers

2022-05-25 Thread Alexey Kardashevskiy

ic int pseries_wdt_start(struct watchdog_device *wdd)
+{
+   struct device *dev = wdd->parent;
+   struct pseries_wdt *pw = watchdog_get_drvdata(wdd);
+   unsigned long flags, msecs;
+   long rc;
+
+   flags = action | PSERIES_WDTF_OP_START;
+   msecs = wdd->timeout * 1000UL;
+   rc = plpar_hcall_norets(H_WATCHDOG, flags, pw->num, msecs);
+   if (rc != H_SUCCESS) {
+   dev_crit(dev, "H_WATCHDOG: %ld: failed to start timer %lu",
+rc, pw->num);
+   return -EIO;
+   }
+   return 0;
+}
+
+static int pseries_wdt_stop(struct watchdog_device *wdd)
+{
+   struct device *dev = wdd->parent;
+   struct pseries_wdt *pw = watchdog_get_drvdata(wdd);
+   long rc;
+
+   rc = plpar_hcall_norets(H_WATCHDOG, PSERIES_WDTF_OP_STOP, pw->num);
+   if (rc != H_SUCCESS && rc != H_NOOP) {
+   dev_crit(dev, "H_WATCHDOG: %ld: failed to stop timer %lu",
+rc, pw->num);
+   return -EIO;
+   }
+   return 0;
+}
+
+static struct watchdog_info pseries_wdt_info = {
+   .identity = DRV_NAME,
+   .options = WDIOF_KEEPALIVEPING | WDIOF_MAGICCLOSE | WDIOF_SETTIMEOUT
+   | WDIOF_PRETIMEOUT,
+};
+
+static const struct watchdog_ops pseries_wdt_ops = {
+   .owner = THIS_MODULE,
+   .start = pseries_wdt_start,
+   .stop = pseries_wdt_stop,
+};
+
+static int pseries_wdt_probe(struct platform_device *pdev)
+{
+   unsigned long ret[PLPAR_HCALL_BUFSIZE] = { 0 };
+   unsigned long cap, min_timeout_ms;
+   long rc;
+   struct pseries_wdt *pw;
+   int err;
+
+   rc = plpar_hcall(H_WATCHDOG, ret, PSERIES_WDTF_OP_QUERY);
+   if (rc != H_SUCCESS)
+   return rc == H_FUNCTION ? -ENODEV : -EIO;


Nit:

if (rc == H_FUNCTION)
return -ENODEV;
if (rc != H_SUCCESS)
return -EIO;

?


+   cap = ret[0];
+
+   pw = devm_kzalloc(>dev, sizeof(*pw), GFP_KERNEL);
+   if (!pw)
+   return -ENOMEM;
+
+   /*
+* Assume watchdogNumber 1 for now.  If we ever support
+* multiple timers we will need to devise a way to choose a
+* distinct watchdogNumber for each platform device at device
+* registration time.
+*/
+   pw->num = 1;
+
+   pw->wd.parent = >dev;
+   pw->wd.info = _wdt_info;
+   pw->wd.ops = _wdt_ops;
+   min_timeout_ms = PSERIES_WDTQ_MIN_TIMEOUT(cap);
+   pw->wd.min_timeout = roundup(min_timeout_ms, 1000) / 1000;
+   pw->wd.max_timeout = UINT_MAX;
+   watchdog_init_timeout(>wd, timeout, NULL);



If PSERIES_WDTF_OP_QUERY returns 2min and this driver's default is 1min, 
watchdog_init_timeout() returns an error, don't we want to handle it 
here? Thanks,




+   watchdog_set_nowayout(>wd, nowayout);
+   watchdog_stop_on_reboot(>wd);
+   watchdog_stop_on_unregister(>wd);
+   watchdog_set_drvdata(>wd, pw);
+
+   err = devm_watchdog_register_device(>dev, >wd);
+   if (err)
+   return err;
+
+   platform_set_drvdata(pdev, >wd);
+
+   return 0;
+}
+
+static int pseries_wdt_suspend(struct platform_device *pdev, pm_message_t 
state)
+{
+   struct watchdog_device *wd = platform_get_drvdata(pdev);
+
+   if (watchdog_active(wd))
+   return pseries_wdt_stop(wd);
+   return 0;
+}
+
+static int pseries_wdt_resume(struct platform_device *pdev)
+{
+   struct watchdog_device *wd = platform_get_drvdata(pdev);
+
+   if (watchdog_active(wd))
+   return pseries_wdt_start(wd);
+   return 0;
+}
+
+static const struct platform_device_id pseries_wdt_id[] = {
+   { .name = "pseries-wdt" },
+   {}
+};
+MODULE_DEVICE_TABLE(platform, pseries_wdt_id);
+
+static struct platform_driver pseries_wdt_driver = {
+   .driver = {
+   .name = DRV_NAME,
+   .owner = THIS_MODULE,
+   },
+   .id_table = pseries_wdt_id,
+   .probe = pseries_wdt_probe,
+   .resume = pseries_wdt_resume,
+   .suspend = pseries_wdt_suspend,
+};
+module_platform_driver(pseries_wdt_driver);
+
+MODULE_AUTHOR("Alexey Kardashevskiy ");
+MODULE_AUTHOR("Scott Cheloha ");
+MODULE_DESCRIPTION("POWER Architecture Platform Watchdog Driver");
+MODULE_LICENSE("GPL");


--
Alexey

Re: [PATCH kernel] KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent

2022-05-18 Thread Alexey Kardashevskiy





On 5/4/22 17:48, Alexey Kardashevskiy wrote:

When introduced, IRQFD resampling worked on POWER8 with XICS. However
KVM on POWER9 has never implemented it - the compatibility mode code
("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native
XIVE mode does not handle INTx in KVM at all.

This moved the capability support advertising to platforms and stops
advertising it on XIVE, i.e. POWER9 and later.

Signed-off-by: Alexey Kardashevskiy 
---


Or I could move this one together with KVM_CAP_IRQFD. Thoughts?



Ping?



---
  arch/arm64/kvm/arm.c   | 3 +++
  arch/mips/kvm/mips.c   | 3 +++
  arch/powerpc/kvm/powerpc.c | 6 ++
  arch/riscv/kvm/vm.c| 3 +++
  arch/s390/kvm/kvm-s390.c   | 3 +++
  arch/x86/kvm/x86.c | 3 +++
  virt/kvm/kvm_main.c| 1 -
  7 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 523bc934fe2f..092f0614bae3 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -210,6 +210,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_SET_GUEST_DEBUG:
case KVM_CAP_VCPU_ATTRIBUTES:
case KVM_CAP_PTP_KVM:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index a25e0b73ee70..0f3de470a73e 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -1071,6 +1071,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_READONLY_MEM:
case KVM_CAP_SYNC_MMU:
case KVM_CAP_IMMEDIATE_EXIT:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_NR_VCPUS:
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 875c30c12db0..87698ffef3be 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -591,6 +591,12 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
break;
  #endif
  
+#ifdef CONFIG_HAVE_KVM_IRQFD

+   case KVM_CAP_IRQFD_RESAMPLE:
+   r = !xive_enabled();
+   break;
+#endif
+
case KVM_CAP_PPC_ALLOC_HTAB:
r = hv_enabled;
break;
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index c768f75279ef..b58579b386bb 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -63,6 +63,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_READONLY_MEM:
case KVM_CAP_MP_STATE:
case KVM_CAP_IMMEDIATE_EXIT:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_NR_VCPUS:
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 156d1c25a3c1..85e093fc8d13 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -564,6 +564,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_SET_GUEST_DEBUG:
case KVM_CAP_S390_DIAG318:
case KVM_CAP_S390_MEM_OP_EXTENSION:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c0ca599a353..a0a7b769483d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4273,6 +4273,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_SYS_ATTRIBUTES:
case KVM_CAP_VAPIC:
case KVM_CAP_ENABLE_CAP:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_EXIT_HYPERCALL:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 70e05af5ebea..885e72e668a5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4293,7 +4293,6 @@ static long kvm_vm_ioctl_check_extension_generic(struct 
kvm *kvm, long arg)
  #endif
  #ifdef CONFIG_HAVE_KVM_IRQFD
case KVM_CAP_IRQFD:
-   case KVM_CAP_IRQFD_RESAMPLE:
  #endif
case KVM_CAP_IOEVENTFD_ANY_LENGTH:
case KVM_CAP_CHECK_EXTENSION_VM:


--
Alexey

Re: [PATCH kernel] KVM: PPC: Book3s: Remove real mode interrupt controller hcalls handlers

2022-05-10 Thread Alexey Kardashevskiy





On 5/11/22 03:58, Cédric Le Goater wrote:

Hello Alexey,

On 5/9/22 09:11, Alexey Kardashevskiy wrote:

Currently we have 2 sets of interrupt controller hypercalls handlers
for real and virtual modes, this is from POWER8 times when switching
MMU on was considered an expensive operation.

POWER9 however does not have dependent threads and MMU is enabled for
handling hcalls so the XIVE native 


XIVE native does not have any real-mode hcall handlers. In fact, all
are handled at the QEMU level.

or XICS-on-XIVE real mode handlers never execute on real P9 and > 
later CPUs.


They are not ? I am surprised. It must be a "recent" change. Any how,
if you can remove them safely, this is good news and you should be able
to clean up some more code in the PowerNV native interface.



Yes, this is the result of that massive work of Nick to move the KVM's 
asm to c for p9. It could have been the case even before that but harder 
to see in that asm code :)





This untemplate the handlers and only keeps the real mode handlers for
XICS native (up to POWER8) and remove the rest of dead code. Changes
in functions are mechanical except few missing empty lines to make
checkpatch.pl happy.

The default implemented hcalls list already contains XICS hcalls so
no change there.

This should not cause any behavioral change.


In the worse case, it impacts performance a bit but only on "old" distros
(kernel < 4.14), I doubt anyone will complain.


Signed-off-by: Alexey Kardashevskiy 


Acked-by: Cédric Le Goater 



Thanks!



Thanks,

C.



---
  arch/powerpc/kvm/Makefile   |   2 +-
  arch/powerpc/include/asm/kvm_ppc.h  |   7 -
  arch/powerpc/kvm/book3s_xive.h  |   7 -
  arch/powerpc/kvm/book3s_hv_builtin.c    |  64 ---
  arch/powerpc/kvm/book3s_hv_rm_xics.c    |   5 +
  arch/powerpc/kvm/book3s_hv_rm_xive.c    |  46 --
  arch/powerpc/kvm/book3s_xive.c  | 638 +++-
  arch/powerpc/kvm/book3s_xive_template.c | 636 ---
  arch/powerpc/kvm/book3s_hv_rmhandlers.S |  12 +-
  9 files changed, 632 insertions(+), 785 deletions(-)
  delete mode 100644 arch/powerpc/kvm/book3s_hv_rm_xive.c
  delete mode 100644 arch/powerpc/kvm/book3s_xive_template.c

diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 8e3681a86074..f17379b0f161 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -73,7 +73,7 @@ kvm-hv-$(CONFIG_PPC_TRANSACTIONAL_MEM) += \
  book3s_hv_tm.o
  kvm-book3s_64-builtin-xics-objs-$(CONFIG_KVM_XICS) := \
-    book3s_hv_rm_xics.o book3s_hv_rm_xive.o
+    book3s_hv_rm_xics.o
  kvm-book3s_64-builtin-tm-objs-$(CONFIG_PPC_TRANSACTIONAL_MEM) += \
  book3s_hv_tm_builtin.o
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h

index 44200a27371b..a775377a570e 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -787,13 +787,6 @@ long kvmppc_rm_h_page_init(struct kvm_vcpu *vcpu, 
unsigned long flags,

 unsigned long dest, unsigned long src);
  long kvmppc_hpte_hv_fault(struct kvm_vcpu *vcpu, unsigned long addr,
    unsigned long slb_v, unsigned int status, 
bool data);

-unsigned long kvmppc_rm_h_xirr(struct kvm_vcpu *vcpu);
-unsigned long kvmppc_rm_h_xirr_x(struct kvm_vcpu *vcpu);
-unsigned long kvmppc_rm_h_ipoll(struct kvm_vcpu *vcpu, unsigned long 
server);

-int kvmppc_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server,
-    unsigned long mfrr);
-int kvmppc_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr);
-int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr);
  void kvmppc_guest_entry_inject_int(struct kvm_vcpu *vcpu);
  /*
diff --git a/arch/powerpc/kvm/book3s_xive.h 
b/arch/powerpc/kvm/book3s_xive.h

index 09d0657596c3..1e48f72e8aa5 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -285,13 +285,6 @@ static inline u32 __xive_read_eq(__be32 *qpage, 
u32 msk, u32 *idx, u32 *toggle)

  return cur & 0x7fff;
  }
-extern unsigned long xive_rm_h_xirr(struct kvm_vcpu *vcpu);
-extern unsigned long xive_rm_h_ipoll(struct kvm_vcpu *vcpu, unsigned 
long server);

-extern int xive_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server,
- unsigned long mfrr);
-extern int xive_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr);
-extern int xive_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr);
-
  /*
   * Common Xive routines for XICS-over-XIVE and XIVE native
   */
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c

index 7e52d0beee77..88a8f6473c4e 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -489,70 +489,6 @@ static long kvmppc_read_one_intr(bool *again)
  return kvmppc_check_passthru(xisr, xirr, again);
  }
-#ifdef CONFIG_KVM_XICS
-unsigned long kvmppc_rm_h_xirr(struct kvm_vcpu *vcpu)
-{
-    if (!kvmppc_xics_

Re: [PATCH 2/2] powerpc/vdso: Link with ld.lld when requested

2022-05-10 Thread Alexey Kardashevskiy





On 5/10/22 06:46, Nathan Chancellor wrote:

The PowerPC vDSO is linked with $(CC) instead of $(LD), which means the
default linker of the compiler is used instead of the linker requested
by the builder.

   $ make ARCH=powerpc LLVM=1 mrproper defconfig arch/powerpc/kernel/vdso/
   ...

   $ llvm-readelf -p .comment arch/powerpc/kernel/vdso/vdso{32,64}.so.dbg

   File: arch/powerpc/kernel/vdso/vdso32.so.dbg
   String dump of section '.comment':
   [ 0] clang version 14.0.0 (Fedora 14.0.0-1.fc37)

   File: arch/powerpc/kernel/vdso/vdso64.so.dbg
   String dump of section '.comment':
   [ 0] clang version 14.0.0 (Fedora 14.0.0-1.fc37)

The compiler option '-fuse-ld' tells the compiler which linker to use
when it is invoked as both the compiler and linker. Use '-fuse-ld=lld'
when LD=ld.lld has been specified (CONFIG_LD_IS_LLD) so that the vDSO is
linked with the same linker as the rest of the kernel.

   $ llvm-readelf -p .comment arch/powerpc/kernel/vdso/vdso{32,64}.so.dbg

   File: arch/powerpc/kernel/vdso/vdso32.so.dbg
   String dump of section '.comment':
   [ 0] Linker: LLD 14.0.0
   [14] clang version 14.0.0 (Fedora 14.0.0-1.fc37)

   File: arch/powerpc/kernel/vdso/vdso64.so.dbg
   String dump of section '.comment':
   [ 0] Linker: LLD 14.0.0
   [14] clang version 14.0.0 (Fedora 14.0.0-1.fc37)

LD can be a full path to ld.lld, which will not be handled properly by
'-fuse-ld=lld' if the full path to ld.lld is outside of the compiler's
search path. '-fuse-ld' can take a path to the linker but it is
deprecated in clang 12.0.0; '--ld-path' is preferred for this scenario.

Use '--ld-path' if it is supported, as it will handle a full path or
just 'ld.lld' properly. See the LLVM commit below for the full details
of '--ld-path'.

Link: https://github.com/ClangBuiltLinux/linux/issues/774
Link: 
https://github.com/llvm/llvm-project/commit/1bc5c84710a8c73ef21295e63c19d10a8c71f2f5
Signed-off-by: Nathan Chancellor 
---
  arch/powerpc/kernel/vdso/Makefile | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/vdso/Makefile 
b/arch/powerpc/kernel/vdso/Makefile
index 954974287ee7..096b0bf1335f 100644
--- a/arch/powerpc/kernel/vdso/Makefile
+++ b/arch/powerpc/kernel/vdso/Makefile
@@ -48,6 +48,7 @@ UBSAN_SANITIZE := n
  KASAN_SANITIZE := n
  
  ccflags-y := -shared -fno-common -fno-builtin -nostdlib -Wl,--hash-style=both

+ccflags-$(CONFIG_LD_IS_LLD) += $(call cc-option,--ld-path=$(LD),-fuse-ld=lld)



Out of curiosity - how does this work exactly? I can see --ld-path= in 
the output so it works but there is no -fuse-ld=lld, is the second 
argument of cc-option only picked when the first one is not supported?


Anyway,

Tested-by: Alexey Kardashevskiy 
Reviewed-by: Alexey Kardashevskiy 



  
  CC32FLAGS := -Wl,-soname=linux-vdso32.so.1 -m32

  AS32FLAGS := -D__VDSO32__ -s

Re: [PATCH 1/2] powerpc/vdso: Remove unused ENTRY in linker scripts

2022-05-10 Thread Alexey Kardashevskiy





On 5/10/22 06:46, Nathan Chancellor wrote:

When linking vdso{32,64}.so.dbg with ld.lld, there is a warning about
not finding _start for the starting address:

   ld.lld: warning: cannot find entry symbol _start; not setting start address
   ld.lld: warning: cannot find entry symbol _start; not setting start address

Looking at GCC + GNU ld, the entry point address is 0x0:

   $ llvm-readelf -h vdso{32,64}.so.dbg &| rg "(File|Entry point address):"
   File: vdso32.so.dbg
 Entry point address:   0x0
   File: vdso64.so.dbg
 Entry point address:   0x0

This matches what ld.lld emits:

   $ powerpc64le-linux-gnu-readelf -p .comment vdso{32,64}.so.dbg

   File: vdso32.so.dbg

   String dump of section '.comment':
 [ 0]  Linker: LLD 14.0.0
 [14]  clang version 14.0.0 (Fedora 14.0.0-1.fc37)

   File: vdso64.so.dbg

   String dump of section '.comment':
 [ 0]  Linker: LLD 14.0.0
 [14]  clang version 14.0.0 (Fedora 14.0.0-1.fc37)

   $ llvm-readelf -h vdso{32,64}.so.dbg &| rg "(File|Entry point address):"
   File: vdso32.so.dbg
 Entry point address:   0x0
   File: vdso64.so.dbg
 Entry point address:   0x0

Remove ENTRY to remove the warning, as it is unnecessary for the vDSO to
function correctly.



Sounds more like a bugfix to me - _start is simply not defined, I wonder 
why ld is not complaining.



Tested-by: Alexey Kardashevskiy 
Reviewed-by: Alexey Kardashevskiy 




Signed-off-by: Nathan Chancellor 
---
  arch/powerpc/kernel/vdso/vdso32.lds.S | 1 -
  arch/powerpc/kernel/vdso/vdso64.lds.S | 1 -
  2 files changed, 2 deletions(-)

diff --git a/arch/powerpc/kernel/vdso/vdso32.lds.S 
b/arch/powerpc/kernel/vdso/vdso32.lds.S
index 58e0099f70f4..e0d19d74455f 100644
--- a/arch/powerpc/kernel/vdso/vdso32.lds.S
+++ b/arch/powerpc/kernel/vdso/vdso32.lds.S
@@ -13,7 +13,6 @@ OUTPUT_FORMAT("elf32-powerpcle", "elf32-powerpcle", 
"elf32-powerpcle")
  OUTPUT_FORMAT("elf32-powerpc", "elf32-powerpc", "elf32-powerpc")
  #endif
  OUTPUT_ARCH(powerpc:common)
-ENTRY(_start)
  
  SECTIONS

  {
diff --git a/arch/powerpc/kernel/vdso/vdso64.lds.S 
b/arch/powerpc/kernel/vdso/vdso64.lds.S
index 0288cad428b0..1a4a7bc4c815 100644
--- a/arch/powerpc/kernel/vdso/vdso64.lds.S
+++ b/arch/powerpc/kernel/vdso/vdso64.lds.S
@@ -13,7 +13,6 @@ OUTPUT_FORMAT("elf64-powerpcle", "elf64-powerpcle", 
"elf64-powerpcle")
  OUTPUT_FORMAT("elf64-powerpc", "elf64-powerpc", "elf64-powerpc")
  #endif
  OUTPUT_ARCH(powerpc:common64)
-ENTRY(_start)
  
  SECTIONS

  {

Re: [PATCH kernel] powerpc/llvm/lto: Allow LLVM LTO builds

2022-05-09 Thread Alexey Kardashevskiy





On 5/9/22 15:18, Alexey Kardashevskiy wrote:



On 5/4/22 07:21, Nick Desaulniers wrote:
On Thu, Apr 28, 2022 at 11:46 PM Alexey Kardashevskiy  
wrote:


This enables LTO_CLANG builds on POWER with the upstream version of
LLVM.

LTO optimizes the output vmlinux binary and this may affect the FTP
alternative section if alt branches use "bc" (Branch Conditional) which
is limited by 16 bit offsets. This shows up in errors like:

ld.lld: error: InputSection too large for range extension thunk 
vmlinux.o:(__ftr_alt_97+0xF0)


This works around the issue by replacing "bc" in FTR_SECTION_ELSE with
"b" which allows 26 bit offsets.

This catches the problem instructions in vmlinux.o before it LTO'ed:

$ objdump -d -M raw -j __ftr_alt_97 vmlinux.o | egrep '\S+\s*\'
   30:   00 00 82 40 bc  4,eq,30 <__ftr_alt_97+0x30>
   f0:   00 00 82 40 bc  4,eq,f0 <__ftr_alt_97+0xf0>

This allows LTO builds for ppc64le_defconfig plus LTO options.
Note that DYNAMIC_FTRACE/FUNCTION_TRACER is not supported by LTO builds
but this is not POWERPC-specific.


$ ARCH=powerpc make LLVM=1 -j72 ppc64le_defconfig
$ ARCH=powerpc make LLVM=1 -j72 menuconfig

$ ARCH=powerpc make LLVM=1 -j72
...
   VDSO64L arch/powerpc/kernel/vdso/vdso64.so.dbg
/usr/bin/powerpc64le-linux-gnu-ld:
/android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: error
loading plugin:
/android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: cannot open
shared object file: No such file or directory
clang-15: error: linker command failed with exit code 1 (use -v to see
invocation)
make[1]: *** [arch/powerpc/kernel/vdso/Makefile:67:
arch/powerpc/kernel/vdso/vdso64.so.dbg] Error 1

Looks like LLD isn't being invoked correctly to link the vdso.
Probably need to revisit
https://lore.kernel.org/lkml/20200901222523.1941988-1-ndesaulni...@google.com/

How were you working around this issue? Perhaps you built clang to
default to LLD? (there's a cmake option for that)



What option is that? I only add  -DLLVM_ENABLE_LLD=ON  which (I think) 
tells cmake to use lld to link the LLVM being built but does not seem to 
tell what the built clang should do.


Without -DLLVM_ENABLE_LLD=ON, building just fails:

[fstn1-p1 ~/pbuild/llvm/llvm-lto-latest-cleanbuild]$ ninja -j 100
[619/3501] Linking CXX executable bin/not
FAILED: bin/not
: && /usr/bin/clang++ -fPIC -fvisibility-inlines-hidden 
-Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra 
-Wno-unused-parameter -Wwrite-strings -Wcast-qual 
-Wmissing-field-initializers -pedantic -Wno-long-long 
-Wc++98-compat-extra-semi -Wimplicit-fallthrough 
-Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor 
-Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion 
-Wmisleading-indentation -fdiagnostics-color -ffunction-sections 
-fdata-sections -flto -O3 -DNDEBUG -flto 
-Wl,-rpath-link,/home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/./lib 
-Wl,--gc-sections utils/not/CMakeFiles/not.dir/not.cpp.o -o bin/not 
-Wl,-rpath,"\$ORIGIN/../lib"  -lpthread  lib/libLLVMSupport.a  -lrt 
-ldl  -lpthread  -lm  /usr/lib/powerpc64le-linux-gnu/libz.so 
/usr/lib/powerpc64le-linux-gnu/libtinfo.so  lib/libLLVMDemangle.a && :
/usr/bin/ld: lib/libLLVMSupport.a: error adding symbols: archive has no 
index; run ranlib to add one
clang: error: linker command failed with exit code 1 (use -v to see 
invocation)
[701/3501] Building CXX object 
utils/TableGen/CMakeFiles/llvm-tblgen.dir/GlobalISelEmitter.cpp.o

ninja: build stopped: subcommand failed.



My head hurts :(
The above example is running on PPC. Now I am trying x86 box:



A bit of progress.

cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld" 
-DLLVM_TARGET_ARCH=PowerPC -DLLVM_TARGETS_TO_BUILD=PowerPC 
~/llvm-project//llvm -DLLVM_ENABLE_LTO=ON 
-DLLVM_BINUTILS_INCDIR=/usr/lib/gcc/powerpc64le-linux-gnu/11/plugin/include/ 
-DCMAKE_BUILD_TYPE=Release


produces:

-- Native target architecture is PowerPC 



-- LLVM host triple: x86_64-unknown-linux-gnu
-- LLVM default target triple: x86_64-unknown-linux-gnu


and the resulting "clang" can only to "Target: 
x86_64-unknown-linux-gnu", how do you build LLVM exactly? Thanks,

[PATCH kernel] KVM: PPC: Book3s: Remove real mode interrupt controller hcalls handlers

2022-05-09 Thread Alexey Kardashevskiy

Currently we have 2 sets of interrupt controller hypercalls handlers
for real and virtual modes, this is from POWER8 times when switching
MMU on was considered an expensive operation.

POWER9 however does not have dependent threads and MMU is enabled for
handling hcalls so the XIVE native or XICS-on-XIVE real mode handlers
never execute on real P9 and later CPUs.

This untemplate the handlers and only keeps the real mode handlers for
XICS native (up to POWER8) and remove the rest of dead code. Changes
in functions are mechanical except few missing empty lines to make
checkpatch.pl happy.

The default implemented hcalls list already contains XICS hcalls so
no change there.

This should not cause any behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---

Minus 153 lines nevertheless.


---
 arch/powerpc/kvm/Makefile   |   2 +-
 arch/powerpc/include/asm/kvm_ppc.h  |   7 -
 arch/powerpc/kvm/book3s_xive.h  |   7 -
 arch/powerpc/kvm/book3s_hv_builtin.c|  64 ---
 arch/powerpc/kvm/book3s_hv_rm_xics.c|   5 +
 arch/powerpc/kvm/book3s_hv_rm_xive.c|  46 --
 arch/powerpc/kvm/book3s_xive.c  | 638 +++-
 arch/powerpc/kvm/book3s_xive_template.c | 636 ---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  12 +-
 9 files changed, 632 insertions(+), 785 deletions(-)
 delete mode 100644 arch/powerpc/kvm/book3s_hv_rm_xive.c
 delete mode 100644 arch/powerpc/kvm/book3s_xive_template.c

diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 8e3681a86074..f17379b0f161 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -73,7 +73,7 @@ kvm-hv-$(CONFIG_PPC_TRANSACTIONAL_MEM) += \
book3s_hv_tm.o
 
 kvm-book3s_64-builtin-xics-objs-$(CONFIG_KVM_XICS) := \
-   book3s_hv_rm_xics.o book3s_hv_rm_xive.o
+   book3s_hv_rm_xics.o
 
 kvm-book3s_64-builtin-tm-objs-$(CONFIG_PPC_TRANSACTIONAL_MEM) += \
book3s_hv_tm_builtin.o
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 44200a27371b..a775377a570e 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -787,13 +787,6 @@ long kvmppc_rm_h_page_init(struct kvm_vcpu *vcpu, unsigned 
long flags,
   unsigned long dest, unsigned long src);
 long kvmppc_hpte_hv_fault(struct kvm_vcpu *vcpu, unsigned long addr,
   unsigned long slb_v, unsigned int status, bool data);
-unsigned long kvmppc_rm_h_xirr(struct kvm_vcpu *vcpu);
-unsigned long kvmppc_rm_h_xirr_x(struct kvm_vcpu *vcpu);
-unsigned long kvmppc_rm_h_ipoll(struct kvm_vcpu *vcpu, unsigned long server);
-int kvmppc_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server,
-unsigned long mfrr);
-int kvmppc_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr);
-int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr);
 void kvmppc_guest_entry_inject_int(struct kvm_vcpu *vcpu);
 
 /*
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 09d0657596c3..1e48f72e8aa5 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -285,13 +285,6 @@ static inline u32 __xive_read_eq(__be32 *qpage, u32 msk, 
u32 *idx, u32 *toggle)
return cur & 0x7fff;
 }
 
-extern unsigned long xive_rm_h_xirr(struct kvm_vcpu *vcpu);
-extern unsigned long xive_rm_h_ipoll(struct kvm_vcpu *vcpu, unsigned long 
server);
-extern int xive_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server,
-unsigned long mfrr);
-extern int xive_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr);
-extern int xive_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr);
-
 /*
  * Common Xive routines for XICS-over-XIVE and XIVE native
  */
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 7e52d0beee77..88a8f6473c4e 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -489,70 +489,6 @@ static long kvmppc_read_one_intr(bool *again)
return kvmppc_check_passthru(xisr, xirr, again);
 }
 
-#ifdef CONFIG_KVM_XICS
-unsigned long kvmppc_rm_h_xirr(struct kvm_vcpu *vcpu)
-{
-   if (!kvmppc_xics_enabled(vcpu))
-   return H_TOO_HARD;
-   if (xics_on_xive())
-   return xive_rm_h_xirr(vcpu);
-   else
-   return xics_rm_h_xirr(vcpu);
-}
-
-unsigned long kvmppc_rm_h_xirr_x(struct kvm_vcpu *vcpu)
-{
-   if (!kvmppc_xics_enabled(vcpu))
-   return H_TOO_HARD;
-   vcpu->arch.regs.gpr[5] = get_tb();
-   if (xics_on_xive())
-   return xive_rm_h_xirr(vcpu);
-   else
-   return xics_rm_h_xirr(vcpu);
-}
-
-unsigned long kvmppc_rm_h_ipoll(struct kvm_vcpu *vcpu, unsigned long server)
-{
-   if (!kvmppc_xics_enabled(vcpu))
-   return H_TOO_HARD;
-   if (xics_on_xive())
-   return xive_rm_h_ipoll(vcpu, server);
-

Re: [PATCH kernel] powerpc/llvm/lto: Allow LLVM LTO builds

2022-05-08 Thread Alexey Kardashevskiy





On 5/4/22 07:21, Nick Desaulniers wrote:

On Thu, Apr 28, 2022 at 11:46 PM Alexey Kardashevskiy  wrote:


This enables LTO_CLANG builds on POWER with the upstream version of
LLVM.

LTO optimizes the output vmlinux binary and this may affect the FTP
alternative section if alt branches use "bc" (Branch Conditional) which
is limited by 16 bit offsets. This shows up in errors like:

ld.lld: error: InputSection too large for range extension thunk 
vmlinux.o:(__ftr_alt_97+0xF0)

This works around the issue by replacing "bc" in FTR_SECTION_ELSE with
"b" which allows 26 bit offsets.

This catches the problem instructions in vmlinux.o before it LTO'ed:

$ objdump -d -M raw -j __ftr_alt_97 vmlinux.o | egrep '\S+\s*\'
   30:   00 00 82 40 bc  4,eq,30 <__ftr_alt_97+0x30>
   f0:   00 00 82 40 bc  4,eq,f0 <__ftr_alt_97+0xf0>

This allows LTO builds for ppc64le_defconfig plus LTO options.
Note that DYNAMIC_FTRACE/FUNCTION_TRACER is not supported by LTO builds
but this is not POWERPC-specific.


$ ARCH=powerpc make LLVM=1 -j72 ppc64le_defconfig
$ ARCH=powerpc make LLVM=1 -j72 menuconfig

$ ARCH=powerpc make LLVM=1 -j72
...
   VDSO64L arch/powerpc/kernel/vdso/vdso64.so.dbg
/usr/bin/powerpc64le-linux-gnu-ld:
/android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: error
loading plugin:
/android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: cannot open
shared object file: No such file or directory
clang-15: error: linker command failed with exit code 1 (use -v to see
invocation)
make[1]: *** [arch/powerpc/kernel/vdso/Makefile:67:
arch/powerpc/kernel/vdso/vdso64.so.dbg] Error 1

Looks like LLD isn't being invoked correctly to link the vdso.
Probably need to revisit
https://lore.kernel.org/lkml/20200901222523.1941988-1-ndesaulni...@google.com/

How were you working around this issue? Perhaps you built clang to
default to LLD? (there's a cmake option for that)



What option is that? I only add  -DLLVM_ENABLE_LLD=ON  which (I think) 
tells cmake to use lld to link the LLVM being built but does not seem to 
tell what the built clang should do.


Without -DLLVM_ENABLE_LLD=ON, building just fails:

[fstn1-p1 ~/pbuild/llvm/llvm-lto-latest-cleanbuild]$ ninja -j 100
[619/3501] Linking CXX executable bin/not
FAILED: bin/not
: && /usr/bin/clang++ -fPIC -fvisibility-inlines-hidden 
-Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra 
-Wno-unused-parameter -Wwrite-strings -Wcast-qual 
-Wmissing-field-initializers -pedantic -Wno-long-long 
-Wc++98-compat-extra-semi -Wimplicit-fallthrough 
-Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor 
-Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion 
-Wmisleading-indentation -fdiagnostics-color -ffunction-sections 
-fdata-sections -flto -O3 -DNDEBUG -flto 
-Wl,-rpath-link,/home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/./lib 
-Wl,--gc-sections utils/not/CMakeFiles/not.dir/not.cpp.o -o bin/not 
-Wl,-rpath,"\$ORIGIN/../lib"  -lpthread  lib/libLLVMSupport.a  -lrt 
-ldl  -lpthread  -lm  /usr/lib/powerpc64le-linux-gnu/libz.so 
/usr/lib/powerpc64le-linux-gnu/libtinfo.so  lib/libLLVMDemangle.a && :
/usr/bin/ld: lib/libLLVMSupport.a: error adding symbols: archive has no 
index; run ranlib to add one
clang: error: linker command failed with exit code 1 (use -v to see 
invocation)
[701/3501] Building CXX object 
utils/TableGen/CMakeFiles/llvm-tblgen.dir/GlobalISelEmitter.cpp.o

ninja: build stopped: subcommand failed.



My head hurts :(
The above example is running on PPC. Now I am trying x86 box:


[2693/3505] Linking CXX shared library lib/libLTO.so.15git
FAILED: lib/libLTO.so.15git
: && /usr/bin/clang++ -fPIC -fPIC -fvisibility-inlines-hidden 
-Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra 
-Wno-unused-parameter -Wwrite-strings -Wcast-qual 
-Wmissing-field-initializers -pedantic -Wno-long-long 
-Wc++98-compat-extra-semi -Wimplicit-fallthrough 
-Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor 
-Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion 
-Wmisleading-indentation -fdiagnostics-color -ffunction-sections 
-fdata-sections -flto -O3 -DNDEBUG  -Wl,-z,defs -Wl,-z,nodelete 
-fuse-ld=ld -flto   -Wl,-rpath-link,/home/aik/llvm-build/./lib 
-Wl,--gc-sections 
-Wl,--version-script,"/home/aik/llvm-build/tools/lto/LTO.exports" 
-shared -Wl,-soname,libLTO.so.15git -o lib/libLTO.so.15git 
tools/lto/CMakeFiles/LTO.dir/LTODisassembler.cpp.o 
tools/lto/CMakeFiles/LTO.dir/lto.cpp.o  -Wl,-rpath,"\$ORIGIN/../lib" 
lib/libLLVMPowerPCAsmParser.a  lib/libLLVMPowerPCCodeGen.a 
lib/libLLVMPowerPCDesc.a  lib/libLLVMPowerPCDisassembler.a 
lib/libLLVMPowerPCInfo.a  lib/libLLVMBitReader.a  lib/libLLVMCore.a 
lib/libLLVMCodeGen.a  lib/libLLVMLTO.a  lib/libLLVMMC.a 
lib/libLLVMMCDisassembler.a  lib/libLLVMSupport.a  lib/libLLVMTarget.a 
lib/libLLVMAsmPrinter.a  lib/libLLVMGlobalISel.a 
lib/libLLVMSelectionDAG.

[PATCH kernel] KVM: PPC: Book3s: PR: Enable default TCE hypercalls

2022-05-06 Thread Alexey Kardashevskiy

When KVM_CAP_PPC_ENABLE_HCALL was introduced, H_GET_TCE and H_PUT_TCE
were already implemented and enabled by default; however H_GET_TCE
was missed out on PR KVM (probably because the handler was in
the real mode code at the time).

This enables H_GET_TCE by default. While at this, this wraps
the checks in ifdef CONFIG_SPAPR_TCE_IOMMU just like HV KVM.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/book3s_pr_papr.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_pr_papr.c 
b/arch/powerpc/kvm/book3s_pr_papr.c
index dc4f51ac84bc..a1f2978b2a86 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -433,9 +433,12 @@ int kvmppc_hcall_impl_pr(unsigned long cmd)
case H_REMOVE:
case H_PROTECT:
case H_BULK_REMOVE:
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+   case H_GET_TCE:
case H_PUT_TCE:
case H_PUT_TCE_INDIRECT:
case H_STUFF_TCE:
+#endif
case H_CEDE:
case H_LOGICAL_CI_LOAD:
case H_LOGICAL_CI_STORE:
@@ -464,7 +467,10 @@ static unsigned int default_hcall_list[] = {
H_REMOVE,
H_PROTECT,
H_BULK_REMOVE,
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+   H_GET_TCE,
H_PUT_TCE,
+#endif
H_CEDE,
H_SET_MODE,
 #ifdef CONFIG_KVM_XICS
-- 
2.30.2

[PATCH kernel v2] KVM: PPC: Book3s: Retire H_PUT_TCE/etc real mode handlers

2022-05-05 Thread Alexey Kardashevskiy

LoPAPR defines guest visible IOMMU with hypercalls to use it -
H_PUT_TCE/etc. Implemented first on POWER7 where hypercalls would trap
in the KVM in the real mode (with MMU off). The problem with the real mode
is some memory is not available and some API usage crashed the host but
enabling MMU was an expensive operation.

The problems with the real mode handlers are:
1. Occasionally these cannot complete the request so the code is
copied+modified to work in the virtual mode, very little is shared;
2. The real mode handlers have to be linked into vmlinux to work;
3. An exception in real mode immediately reboots the machine.

If the small DMA window is used, the real mode handlers bring better
performance. However since POWER8, there has always been a bigger DMA
window which VMs use to map the entire VM memory to avoid calling
H_PUT_TCE. Such 1:1 mapping happens once and uses H_PUT_TCE_INDIRECT
(a bulk version of H_PUT_TCE) which virtual mode handler is even closer
to its real mode version.

On POWER9 hypercalls trap straight to the virtual mode so the real mode
handlers never execute on POWER9 and later CPUs.

So with the current use of the DMA windows and MMU improvements in
POWER9 and later, there is no point in duplicating the code.
The 32bit passed through devices may slow down but we do not have many
of these in practice. For example, with this applied, a 1Gbit ethernet
adapter still demostrates above 800Mbit/s of actual throughput.

This removes the real mode handlers from KVM and related code from
the powernv platform.

This updates the list of implemented hcalls in KVM-HV as the realmode
handlers are removed.

This changes ABI - kvmppc_h_get_tce() moves to the KVM module and
kvmppc_find_table() is static now.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* update the list of enabled hcalls as they are removed from .S
---
 arch/powerpc/kvm/Makefile |   3 -
 arch/powerpc/include/asm/iommu.h  |   6 +-
 arch/powerpc/include/asm/kvm_ppc.h|   2 -
 arch/powerpc/include/asm/mmu_context.h|   5 -
 arch/powerpc/platforms/powernv/pci.h  |   3 +-
 arch/powerpc/kernel/iommu.c   |   4 +-
 arch/powerpc/kvm/book3s_64_vio.c  |  43 ++
 arch/powerpc/kvm/book3s_64_vio_hv.c   | 672 --
 arch/powerpc/kvm/book3s_hv.c  |   6 +
 arch/powerpc/mm/book3s64/iommu_api.c  |  68 --
 arch/powerpc/platforms/powernv/pci-ioda-tce.c |   5 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  46 +-
 arch/powerpc/platforms/pseries/iommu.c|   3 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  10 -
 14 files changed, 75 insertions(+), 801 deletions(-)
 delete mode 100644 arch/powerpc/kvm/book3s_64_vio_hv.c

diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 9bdfc8b50899..8e3681a86074 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -37,9 +37,6 @@ kvm-e500mc-objs := \
e500_emulate.o
 kvm-objs-$(CONFIG_KVM_E500MC) := $(kvm-e500mc-objs)
 
-kvm-book3s_64-builtin-objs-$(CONFIG_SPAPR_TCE_IOMMU) := \
-   book3s_64_vio_hv.o
-
 kvm-pr-y := \
fpu.o \
emulate.o \
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index d7912b66c874..7e29c73e3dd4 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -51,13 +51,11 @@ struct iommu_table_ops {
int (*xchg_no_kill)(struct iommu_table *tbl,
long index,
unsigned long *hpa,
-   enum dma_data_direction *direction,
-   bool realmode);
+   enum dma_data_direction *direction);
 
void (*tce_kill)(struct iommu_table *tbl,
unsigned long index,
-   unsigned long pages,
-   bool realmode);
+   unsigned long pages);
 
__be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc);
 #endif
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 838d4cb460b7..44200a27371b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -177,8 +177,6 @@ extern void kvmppc_setup_partition_table(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce_64 *args);
-extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-   struct kvm *kvm, unsigned long liobn);
 #define kvmppc_ioba_validate(stt, ioba, npages) \
(iommu_tce_check_ioba((stt)->page_shift, (stt)->offset, \
(stt)->size, (ioba), (npages)) ?\
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index b8527a74bd4d..3f25bd3e14eb 100644
--- a/arch/powerpc/include/asm/mmu_contex

Re: [PATCH kernel] powerpc/llvm/lto: Allow LLVM LTO builds

2022-05-04 Thread Alexey Kardashevskiy





On 04/05/2022 17:11, Alexey Kardashevskiy wrote:



On 5/4/22 07:21, Nick Desaulniers wrote:
On Thu, Apr 28, 2022 at 11:46 PM Alexey Kardashevskiy  
wrote:


This enables LTO_CLANG builds on POWER with the upstream version of
LLVM.

LTO optimizes the output vmlinux binary and this may affect the FTP
alternative section if alt branches use "bc" (Branch Conditional) which
is limited by 16 bit offsets. This shows up in errors like:

ld.lld: error: InputSection too large for range extension thunk 
vmlinux.o:(__ftr_alt_97+0xF0)


This works around the issue by replacing "bc" in FTR_SECTION_ELSE with
"b" which allows 26 bit offsets.

This catches the problem instructions in vmlinux.o before it LTO'ed:

$ objdump -d -M raw -j __ftr_alt_97 vmlinux.o | egrep '\S+\s*\'
   30:   00 00 82 40 bc  4,eq,30 <__ftr_alt_97+0x30>
   f0:   00 00 82 40 bc  4,eq,f0 <__ftr_alt_97+0xf0>

This allows LTO builds for ppc64le_defconfig plus LTO options.
Note that DYNAMIC_FTRACE/FUNCTION_TRACER is not supported by LTO builds
but this is not POWERPC-specific.


$ ARCH=powerpc make LLVM=1 -j72 ppc64le_defconfig
$ ARCH=powerpc make LLVM=1 -j72 menuconfig

$ ARCH=powerpc make LLVM=1 -j72
...
   VDSO64L arch/powerpc/kernel/vdso/vdso64.so.dbg
/usr/bin/powerpc64le-linux-gnu-ld:
/android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: error
loading plugin:
/android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: cannot open
shared object file: No such file or directory
clang-15: error: linker command failed with exit code 1 (use -v to see
invocation)
make[1]: *** [arch/powerpc/kernel/vdso/Makefile:67:
arch/powerpc/kernel/vdso/vdso64.so.dbg] Error 1

Looks like LLD isn't being invoked correctly to link the vdso.
Probably need to revisit
https://lore.kernel.org/lkml/20200901222523.1941988-1-ndesaulni...@google.com/

How were you working around this issue? Perhaps you built clang to
default to LLD? (there's a cmake option for that)



I was not.

Just did clean build like this:

mkdir ~/pbuild/llvm/llvm-lto-latest-cleanbuild

cd ~/pbuild/llvm/llvm-lto-latest-cleanbuild

CC='clang' CXX='clang++' cmake -G Ninja 
-DLLVM_ENABLE_PROJECTS="clang;lld" -DLLVM_TARGETS_TO_BUILD=PowerPC 
~/p/llvm/llvm-latest/llvm/ -DLLVM_ENABLE_LTO=ON   -DLLVM_ENABLE_LLD=ON 
-DLLVM_BINUTILS_INCDIR=/usr/include -DCMAKE_BUILD_TYPE=Release


ninja -j 50

It builds fine:

[fstn1-p1 ~/p/kernels-llvm/llvm]$ find 
/home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/ -iname LLVMgold.so 
-exec ls -l {} \;
-rwxrwxr-x 1 aik aik 39032840 May  4 13:06 
/home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/lib/LLVMgold.so





and then in the kernel tree:


PATH=/home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/bin:$PATH make 
-j64 
O=/home/aik/pbuild/kernels-llvm/llvm-wip-llvm-latest-lto-full-cleanbuild/ ARCH=powerpc LLVM_IAS=1 CC=clang LLVM=1 ppc64le_defconfig


then enabled LTO in that .config and then just built "vmlinux":


[fstn1-p1 ~/p/kernels-llvm/llvm]$ ls -l 
/home/aik/pbuild/kernels-llvm/llvm-wip-llvm-latest-lto-full-cleanbuild/vmlinux
-rwxrwxr-x 1 aik aik 48145272 May  4 17:00 
/home/aik/pbuild/kernels-llvm/llvm-wip-llvm-latest-lto-full-cleanbuild/vmlinux


which boots under qemu, the kernel version is:

Preparing to boot Linux version 5.18.0-rc2_0bb153baeff0_a+fstn1 
(aik@fstn1-p1) (clang version 15.0.0 (https://github.com/llvm/llvm-proje
ct.git e29dc0c6fde284e7f05aa5f45b05c629c9fad295), LLD 15.0.0) #1 SMP Wed 
May 4 16:54:16 AEST 2022




Before I got to this point, I did many unspeakable things to that build 
system so may be it is screwed in some way but I cannot pinpoint it.


The installed clang/lld is 12.0.0-3ubuntu1~21.04.2 and 
-DLLVM_ENABLE_LLD=ON from cmake is to accelerate rebuilding of LLVM (for 
bisecting). I'll try without it now, just takes ages to complete.



And I just did. clang built with gcc simply crashes while building 
kernel's scripts/basic/fixdep  :-D


I may have to file a bug against clang now :-/









Perhaps for now I should just send:
```
diff --git a/arch/powerpc/kernel/vdso/Makefile
b/arch/powerpc/kernel/vdso/Makefile
index 954974287ee7..8762e6513683 100644
--- a/arch/powerpc/kernel/vdso/Makefile
+++ b/arch/powerpc/kernel/vdso/Makefile
@@ -55,6 +55,11 @@ AS32FLAGS := -D__VDSO32__ -s
  CC64FLAGS := -Wl,-soname=linux-vdso64.so.1
  AS64FLAGS := -D__VDSO64__ -s

+ifneq ($(LLVM),)
+CC32FLAGS += -fuse-ld=lld
+CC64FLAGS += -fuse-ld=lld
+endif
+
  targets += vdso32.lds
  CPPFLAGS_vdso32.lds += -P -C -Upowerpc
  targets += vdso64.lds
```




Signed-off-by: Alexey Kardashevskiy 
---

Note 1:
This is further development of
https://lore.kernel.org/all/20220211023125.1790960-1-...@ozlabs.ru/T/

Note 2:
CONFIG_ZSTD_COMPRESS and CONFIG_ZSTD_DECOMPRESS must be both "m" or "y"
or it won't link. For details:
https://lore.kernel.org/lkml/20220428043850.1706973-1-...@ozlabs.ru/T/


Yeah, I just hit this:
```
   LTO vmlinux.o
LLVM ERROR: Function Import:

[PATCH kernel] KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent

2022-05-04 Thread Alexey Kardashevskiy

When introduced, IRQFD resampling worked on POWER8 with XICS. However
KVM on POWER9 has never implemented it - the compatibility mode code
("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native
XIVE mode does not handle INTx in KVM at all.

This moved the capability support advertising to platforms and stops
advertising it on XIVE, i.e. POWER9 and later.

Signed-off-by: Alexey Kardashevskiy 
---


Or I could move this one together with KVM_CAP_IRQFD. Thoughts?

---
 arch/arm64/kvm/arm.c   | 3 +++
 arch/mips/kvm/mips.c   | 3 +++
 arch/powerpc/kvm/powerpc.c | 6 ++
 arch/riscv/kvm/vm.c| 3 +++
 arch/s390/kvm/kvm-s390.c   | 3 +++
 arch/x86/kvm/x86.c | 3 +++
 virt/kvm/kvm_main.c| 1 -
 7 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 523bc934fe2f..092f0614bae3 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -210,6 +210,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_SET_GUEST_DEBUG:
case KVM_CAP_VCPU_ATTRIBUTES:
case KVM_CAP_PTP_KVM:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index a25e0b73ee70..0f3de470a73e 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -1071,6 +1071,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_READONLY_MEM:
case KVM_CAP_SYNC_MMU:
case KVM_CAP_IMMEDIATE_EXIT:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_NR_VCPUS:
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 875c30c12db0..87698ffef3be 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -591,6 +591,12 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
break;
 #endif
 
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+   r = !xive_enabled();
+   break;
+#endif
+
case KVM_CAP_PPC_ALLOC_HTAB:
r = hv_enabled;
break;
diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c
index c768f75279ef..b58579b386bb 100644
--- a/arch/riscv/kvm/vm.c
+++ b/arch/riscv/kvm/vm.c
@@ -63,6 +63,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_READONLY_MEM:
case KVM_CAP_MP_STATE:
case KVM_CAP_IMMEDIATE_EXIT:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_NR_VCPUS:
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 156d1c25a3c1..85e093fc8d13 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -564,6 +564,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_SET_GUEST_DEBUG:
case KVM_CAP_S390_DIAG318:
case KVM_CAP_S390_MEM_OP_EXTENSION:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c0ca599a353..a0a7b769483d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4273,6 +4273,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_SYS_ATTRIBUTES:
case KVM_CAP_VAPIC:
case KVM_CAP_ENABLE_CAP:
+#ifdef CONFIG_HAVE_KVM_IRQFD
+   case KVM_CAP_IRQFD_RESAMPLE:
+#endif
r = 1;
break;
case KVM_CAP_EXIT_HYPERCALL:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 70e05af5ebea..885e72e668a5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4293,7 +4293,6 @@ static long kvm_vm_ioctl_check_extension_generic(struct 
kvm *kvm, long arg)
 #endif
 #ifdef CONFIG_HAVE_KVM_IRQFD
case KVM_CAP_IRQFD:
-   case KVM_CAP_IRQFD_RESAMPLE:
 #endif
case KVM_CAP_IOEVENTFD_ANY_LENGTH:
case KVM_CAP_CHECK_EXTENSION_VM:
-- 
2.30.2

Re: [PATCH kernel] powerpc/llvm/lto: Allow LLVM LTO builds

2022-05-04 Thread Alexey Kardashevskiy





On 5/4/22 07:21, Nick Desaulniers wrote:

On Thu, Apr 28, 2022 at 11:46 PM Alexey Kardashevskiy  wrote:


This enables LTO_CLANG builds on POWER with the upstream version of
LLVM.

LTO optimizes the output vmlinux binary and this may affect the FTP
alternative section if alt branches use "bc" (Branch Conditional) which
is limited by 16 bit offsets. This shows up in errors like:

ld.lld: error: InputSection too large for range extension thunk 
vmlinux.o:(__ftr_alt_97+0xF0)

This works around the issue by replacing "bc" in FTR_SECTION_ELSE with
"b" which allows 26 bit offsets.

This catches the problem instructions in vmlinux.o before it LTO'ed:

$ objdump -d -M raw -j __ftr_alt_97 vmlinux.o | egrep '\S+\s*\'
   30:   00 00 82 40 bc  4,eq,30 <__ftr_alt_97+0x30>
   f0:   00 00 82 40 bc  4,eq,f0 <__ftr_alt_97+0xf0>

This allows LTO builds for ppc64le_defconfig plus LTO options.
Note that DYNAMIC_FTRACE/FUNCTION_TRACER is not supported by LTO builds
but this is not POWERPC-specific.


$ ARCH=powerpc make LLVM=1 -j72 ppc64le_defconfig
$ ARCH=powerpc make LLVM=1 -j72 menuconfig

$ ARCH=powerpc make LLVM=1 -j72
...
   VDSO64L arch/powerpc/kernel/vdso/vdso64.so.dbg
/usr/bin/powerpc64le-linux-gnu-ld:
/android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: error
loading plugin:
/android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: cannot open
shared object file: No such file or directory
clang-15: error: linker command failed with exit code 1 (use -v to see
invocation)
make[1]: *** [arch/powerpc/kernel/vdso/Makefile:67:
arch/powerpc/kernel/vdso/vdso64.so.dbg] Error 1

Looks like LLD isn't being invoked correctly to link the vdso.
Probably need to revisit
https://lore.kernel.org/lkml/20200901222523.1941988-1-ndesaulni...@google.com/

How were you working around this issue? Perhaps you built clang to
default to LLD? (there's a cmake option for that)



I was not.

Just did clean build like this:

mkdir ~/pbuild/llvm/llvm-lto-latest-cleanbuild

cd ~/pbuild/llvm/llvm-lto-latest-cleanbuild

CC='clang' CXX='clang++' cmake -G Ninja 
-DLLVM_ENABLE_PROJECTS="clang;lld" -DLLVM_TARGETS_TO_BUILD=PowerPC 
~/p/llvm/llvm-latest/llvm/ -DLLVM_ENABLE_LTO=ON   -DLLVM_ENABLE_LLD=ON 
-DLLVM_BINUTILS_INCDIR=/usr/include -DCMAKE_BUILD_TYPE=Release


ninja -j 50

It builds fine:

[fstn1-p1 ~/p/kernels-llvm/llvm]$ find 
/home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/ -iname LLVMgold.so 
-exec ls -l {} \;
-rwxrwxr-x 1 aik aik 39032840 May  4 13:06 
/home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/lib/LLVMgold.so 






and then in the kernel tree:


PATH=/home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/bin:$PATH make 
-j64 
O=/home/aik/pbuild/kernels-llvm/llvm-wip-llvm-latest-lto-full-cleanbuild/ 
ARCH=powerpc LLVM_IAS=1 CC=clang LLVM=1 ppc64le_defconfig


then enabled LTO in that .config and then just built "vmlinux":


[fstn1-p1 ~/p/kernels-llvm/llvm]$ ls -l 
/home/aik/pbuild/kernels-llvm/llvm-wip-llvm-latest-lto-full-cleanbuild/vmlinux 

-rwxrwxr-x 1 aik aik 48145272 May  4 17:00 
/home/aik/pbuild/kernels-llvm/llvm-wip-llvm-latest-lto-full-cleanbuild/vmlinux


which boots under qemu, the kernel version is:

Preparing to boot Linux version 5.18.0-rc2_0bb153baeff0_a+fstn1 
(aik@fstn1-p1) (clang version 15.0.0 (https://github.com/llvm/llvm-proje
ct.git e29dc0c6fde284e7f05aa5f45b05c629c9fad295), LLD 15.0.0) #1 SMP Wed 
May 4 16:54:16 AEST 2022




Before I got to this point, I did many unspeakable things to that build 
system so may be it is screwed in some way but I cannot pinpoint it.


The installed clang/lld is 12.0.0-3ubuntu1~21.04.2 and 
-DLLVM_ENABLE_LLD=ON from cmake is to accelerate rebuilding of LLVM (for 
bisecting). I'll try without it now, just takes ages to complete.




Perhaps for now I should just send:
```
diff --git a/arch/powerpc/kernel/vdso/Makefile
b/arch/powerpc/kernel/vdso/Makefile
index 954974287ee7..8762e6513683 100644
--- a/arch/powerpc/kernel/vdso/Makefile
+++ b/arch/powerpc/kernel/vdso/Makefile
@@ -55,6 +55,11 @@ AS32FLAGS := -D__VDSO32__ -s
  CC64FLAGS := -Wl,-soname=linux-vdso64.so.1
  AS64FLAGS := -D__VDSO64__ -s

+ifneq ($(LLVM),)
+CC32FLAGS += -fuse-ld=lld
+CC64FLAGS += -fuse-ld=lld
+endif
+
  targets += vdso32.lds
  CPPFLAGS_vdso32.lds += -P -C -Upowerpc
  targets += vdso64.lds
```




Signed-off-by: Alexey Kardashevskiy 
---

Note 1:
This is further development of
https://lore.kernel.org/all/20220211023125.1790960-1-...@ozlabs.ru/T/

Note 2:
CONFIG_ZSTD_COMPRESS and CONFIG_ZSTD_DECOMPRESS must be both "m" or "y"
or it won't link. For details:
https://lore.kernel.org/lkml/20220428043850.1706973-1-...@ozlabs.ru/T/


Yeah, I just hit this:
```
   LTO vmlinux.o
LLVM ERROR: Function Import: link error: linking module flags 'Code
Model': IDs have conflicting values in
'lib/built-in.a(entropy_common.o at 5782)' and
'lib/built-in.a(zstd_decompress_block.o at 6202)'
PLEASE submit a bug report to
https:

Re: [PATCH kernel] powerpc/llvm/lto: Allow LLVM LTO builds

2022-05-03 Thread Alexey Kardashevskiy





On 5/4/22 07:24, Nick Desaulniers wrote:

On Thu, Apr 28, 2022 at 11:46 PM Alexey Kardashevskiy  wrote:


diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index b66dd6f775a4..5b783bd51260 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -476,9 +476,11 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real, text)
 .if IHSRR_IF_HVMODE
 BEGIN_FTR_SECTION
 bne masked_Hinterrupt
+   b   4f
 FTR_SECTION_ELSE


Do you need to have the ELSE even if there's nothing in it; should it
have a nop?  The rest of the assembler changes LGTM, but withholding
RB tag until we have Kconfig dependencies in better shape.



The FTR patcher will add the necessary amount of "nop"s there and 
dropping "FTR_SECTION_ELSE" breaks the build as it does some 
"pushsection" magic.






-   bne masked_interrupt
 ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
+   bne masked_interrupt
+4:
 .elseif IHSRR
 bne masked_Hinterrupt
 .else

[PATCH kernel] powerpc/llvm/lto: Allow LLVM LTO builds

2022-04-29 Thread Alexey Kardashevskiy

This enables LTO_CLANG builds on POWER with the upstream version of
LLVM.

LTO optimizes the output vmlinux binary and this may affect the FTP
alternative section if alt branches use "bc" (Branch Conditional) which
is limited by 16 bit offsets. This shows up in errors like:

ld.lld: error: InputSection too large for range extension thunk 
vmlinux.o:(__ftr_alt_97+0xF0)

This works around the issue by replacing "bc" in FTR_SECTION_ELSE with
"b" which allows 26 bit offsets.

This catches the problem instructions in vmlinux.o before it LTO'ed:

$ objdump -d -M raw -j __ftr_alt_97 vmlinux.o | egrep '\S+\s*\'
  30:   00 00 82 40 bc  4,eq,30 <__ftr_alt_97+0x30>
  f0:   00 00 82 40 bc  4,eq,f0 <__ftr_alt_97+0xf0>

This allows LTO builds for ppc64le_defconfig plus LTO options.
Note that DYNAMIC_FTRACE/FUNCTION_TRACER is not supported by LTO builds
but this is not POWERPC-specific.

Signed-off-by: Alexey Kardashevskiy 
---

Note 1:
This is further development of
https://lore.kernel.org/all/20220211023125.1790960-1-...@ozlabs.ru/T/

Note 2:
CONFIG_ZSTD_COMPRESS and CONFIG_ZSTD_DECOMPRESS must be both "m" or "y"
or it won't link. For details:
https://lore.kernel.org/lkml/20220428043850.1706973-1-...@ozlabs.ru/T/
---
 arch/powerpc/Kconfig   | 2 ++
 arch/powerpc/kernel/exceptions-64s.S   | 4 +++-
 arch/powerpc/lib/copyuser_64.S | 3 ++-
 arch/powerpc/lib/feature-fixups-test.S | 3 +--
 arch/powerpc/lib/memcpy_64.S   | 3 ++-
 5 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 174edabb74fa..e2c7b5c1d0a6 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -158,6 +158,8 @@ config PPC
select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
select ARCH_WANT_LD_ORPHAN_WARN
select ARCH_WEAK_RELEASE_ACQUIRE
+   select ARCH_SUPPORTS_LTO_CLANG
+   select ARCH_SUPPORTS_LTO_CLANG_THIN
select BINFMT_ELF
select BUILDTIME_TABLE_SORT
select CLONE_BACKWARDS
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index b66dd6f775a4..5b783bd51260 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -476,9 +476,11 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real, text)
.if IHSRR_IF_HVMODE
BEGIN_FTR_SECTION
bne masked_Hinterrupt
+   b   4f
FTR_SECTION_ELSE
-   bne masked_interrupt
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
+   bne masked_interrupt
+4:
.elseif IHSRR
bne masked_Hinterrupt
.else
diff --git a/arch/powerpc/lib/copyuser_64.S b/arch/powerpc/lib/copyuser_64.S
index db8719a14846..d07f95eebc65 100644
--- a/arch/powerpc/lib/copyuser_64.S
+++ b/arch/powerpc/lib/copyuser_64.S
@@ -75,10 +75,11 @@ _GLOBAL(__copy_tofrom_user_base)
  * set is Power6.
  */
 test_feature = (SELFTEST_CASE == 1)
+   beq .Ldst_aligned
 BEGIN_FTR_SECTION
nop
 FTR_SECTION_ELSE
-   bne .Ldst_unaligned
+   b   .Ldst_unaligned
 ALT_FTR_SECTION_END(CPU_FTR_UNALIGNED_LD_STD | CPU_FTR_CP_USE_DCBTZ, \
CPU_FTR_UNALIGNED_LD_STD)
 .Ldst_aligned:
diff --git a/arch/powerpc/lib/feature-fixups-test.S 
b/arch/powerpc/lib/feature-fixups-test.S
index 480172fbd024..2751e42a9fd7 100644
--- a/arch/powerpc/lib/feature-fixups-test.S
+++ b/arch/powerpc/lib/feature-fixups-test.S
@@ -145,7 +145,6 @@ BEGIN_FTR_SECTION
 FTR_SECTION_ELSE
 2: or  2,2,2
PPC_LCMPI   r3,1
-   beq 3f
blt 2b
b   3f
b   1b
@@ -160,10 +159,10 @@ globl(ftr_fixup_test6_expected)
 1: or  1,1,1
 2: or  2,2,2
PPC_LCMPI   r3,1
-   beq 3f
blt 2b
b   3f
b   1b
+   nop
 3: or  1,1,1
or  2,2,2
or  3,3,3
diff --git a/arch/powerpc/lib/memcpy_64.S b/arch/powerpc/lib/memcpy_64.S
index 016c91e958d8..286c7e2d0883 100644
--- a/arch/powerpc/lib/memcpy_64.S
+++ b/arch/powerpc/lib/memcpy_64.S
@@ -50,10 +50,11 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY)
At the time of writing the only CPU that has this combination of bits
set is Power6. */
 test_feature = (SELFTEST_CASE == 1)
+   beq  .ldst_aligned
 BEGIN_FTR_SECTION
nop
 FTR_SECTION_ELSE
-   bne .Ldst_unaligned
+   b   .Ldst_unaligned
 ALT_FTR_SECTION_END(CPU_FTR_UNALIGNED_LD_STD | CPU_FTR_CP_USE_DCBTZ, \
 CPU_FTR_UNALIGNED_LD_STD)
 .Ldst_aligned:
-- 
2.30.2

[PATCH kernel] KVM: PPC: Book3s: Retire H_PUT_TCE/etc real mode handlers

2022-04-28 Thread Alexey Kardashevskiy

LoPAPR defines guest visible IOMMU with hypercalls to use it -
H_PUT_TCE/etc. Implemented first on POWER7 where hypercalls would trap
in the KVM in the real mode (with MMU off). The problem with the real mode
is some memory is not available and some API usage crashed the host but
enabling MMU was an expensive operation.

The problems with the real mode handlers are:
1. Occasionally these cannot complete the request so the code is
copied+modified to work in the virtual mode, very little is shared;
2. The real mode handlers have to be linked into vmlinux to work;
3. An exception in real mode immediately reboots the machine.

If the small DMA window is used, the real mode handlers bring better
performance. However since POWER8, there has always been a bigger DMA
window which VMs use to map the entire VM memory to avoid calling
H_PUT_TCE. Such 1:1 mapping happens once and uses H_PUT_TCE_INDIRECT
(a bulk version of H_PUT_TCE) which virtual mode handler is even closer
to its real mode version.

On POWER9 hypercalls trap straight to the virtual mode so the real mode
handlers never execute on POWER9 and later CPUs.

So with the current use of the DMA windows and MMU improvements in
POWER9 and later, there is no point in duplicating the code.
The 32bit passed through devices may slow down but we do not have many
of these in practice. For example, with this applied, a 1Gbit ethernet
adapter still demostrates above 800Mbit/s of actual throughput.

This removes the real mode handlers from KVM and related code from
the powernv platform.

This changes ABI - kvmppc_h_get_tce() moves to the KVM module and
kvmppc_find_table() is static now.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/Makefile |   3 -
 arch/powerpc/include/asm/iommu.h  |   6 +-
 arch/powerpc/include/asm/kvm_ppc.h|   2 -
 arch/powerpc/include/asm/mmu_context.h|   5 -
 arch/powerpc/platforms/powernv/pci.h  |   3 +-
 arch/powerpc/kernel/iommu.c   |   4 +-
 arch/powerpc/kvm/book3s_64_vio.c  |  43 ++
 arch/powerpc/kvm/book3s_64_vio_hv.c   | 672 --
 arch/powerpc/mm/book3s64/iommu_api.c  |  68 --
 arch/powerpc/platforms/powernv/pci-ioda-tce.c |   5 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  46 +-
 arch/powerpc/platforms/pseries/iommu.c|   3 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  10 -
 13 files changed, 69 insertions(+), 801 deletions(-)
 delete mode 100644 arch/powerpc/kvm/book3s_64_vio_hv.c

diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 9bdfc8b50899..8e3681a86074 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -37,9 +37,6 @@ kvm-e500mc-objs := \
e500_emulate.o
 kvm-objs-$(CONFIG_KVM_E500MC) := $(kvm-e500mc-objs)
 
-kvm-book3s_64-builtin-objs-$(CONFIG_SPAPR_TCE_IOMMU) := \
-   book3s_64_vio_hv.o
-
 kvm-pr-y := \
fpu.o \
emulate.o \
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index d7912b66c874..7e29c73e3dd4 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -51,13 +51,11 @@ struct iommu_table_ops {
int (*xchg_no_kill)(struct iommu_table *tbl,
long index,
unsigned long *hpa,
-   enum dma_data_direction *direction,
-   bool realmode);
+   enum dma_data_direction *direction);
 
void (*tce_kill)(struct iommu_table *tbl,
unsigned long index,
-   unsigned long pages,
-   bool realmode);
+   unsigned long pages);
 
__be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc);
 #endif
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 838d4cb460b7..44200a27371b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -177,8 +177,6 @@ extern void kvmppc_setup_partition_table(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce_64 *args);
-extern struct kvmppc_spapr_tce_table *kvmppc_find_table(
-   struct kvm *kvm, unsigned long liobn);
 #define kvmppc_ioba_validate(stt, ioba, npages) \
(iommu_tce_check_ioba((stt)->page_shift, (stt)->offset, \
(stt)->size, (ioba), (npages)) ?\
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index b8527a74bd4d..3f25bd3e14eb 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -34,15 +34,10 @@ extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_look

[PATCH kernel] powerpc/perf: Fix 32bit compile

2022-04-20 Thread Alexey Kardashevskiy

The "read_bhrb" global symbol is only called under CONFIG_PPC64 of
arch/powerpc/perf/core-book3s.c but it is compiled for both 32 and 64 bit
anyway (and LLVM fails to link this on 32bit).

This fixes it by moving bhrb.o to obj64 targets.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/perf/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 2f46e31c7612..4f53d0b97539 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -3,11 +3,11 @@
 obj-y  += callchain.o callchain_$(BITS).o perf_regs.o
 obj-$(CONFIG_COMPAT)   += callchain_32.o
 
-obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o
+obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o
 obj64-$(CONFIG_PPC_PERF_CTRS)  += ppc970-pmu.o power5-pmu.o \
   power5+-pmu.o power6-pmu.o power7-pmu.o \
   isa207-common.o power8-pmu.o power9-pmu.o \
-  generic-compat-pmu.o power10-pmu.o
+  generic-compat-pmu.o power10-pmu.o bhrb.o
 obj32-$(CONFIG_PPC_PERF_CTRS)  += mpc7450-pmu.o
 
 obj-$(CONFIG_PPC_POWERNV)  += imc-pmu.o
-- 
2.30.2

[PATCH kernel v2] KVM: PPC: Fix TCE handling for VFIO

2022-04-19 Thread Alexey Kardashevskiy

The LoPAPR spec defines a guest visible IOMMU with a variable page size.
Currently QEMU advertises 4K, 64K, 2M, 16MB pages, a Linux VM picks
the biggest (16MB). In the case of a passed though PCI device, there is
a hardware IOMMU which does not support all pages sizes from the above -
P8 cannot do 2MB and P9 cannot do 16MB. So for each emulated
16M IOMMU page we may create several smaller mappings ("TCEs") in
the hardware IOMMU.

The code wrongly uses the emulated TCE index instead of hardware TCE
index in error handling. The problem is easier to see on POWER8 with
multi-level TCE tables (when only the first level is preallocated)
as hash mode uses real mode TCE hypercalls handlers.
The kernel starts using indirect tables when VMs get bigger than 128GB
(depends on the max page order).
The very first real mode hcall is going to fail with H_TOO_HARD as
in the real mode we cannot allocate memory for TCEs (we can in the virtual
mode) but on the way out the code attempts to clear hardware TCEs using
emulated TCE indexes which corrupts random kernel memory because
it_offset==1<<59 is subtracted from those indexes and the resulting index
is out of the TCE table bounds.

This fixes kvmppc_clear_tce() to use the correct TCE indexes.

While at it, this fixes TCE cache invalidation which uses emulated TCE
indexes instead of the hardware ones. This went unnoticed as 64bit DMA
is used these days and VMs map all RAM in one go and only then do DMA
and this is when the TCE cache gets populated.

Potentially this could slow down mapping, however normally 16MB
emulated pages are backed by 64K hardware pages so it is one write to
the "TCE Kill" per 256 updates which is not that bad considering the size
of the cache (1024 TCEs or so).

Fixes: ca1fc489cfa0 ("KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages 
with smaller physical pages")

Reviewed-by: Frederic Barrat 
Reviewed-by: David Gibson 
Tested-by: David Gibson 
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* reworded the first paragraph of the commit log
---
 arch/powerpc/kvm/book3s_64_vio.c| 45 +++--
 arch/powerpc/kvm/book3s_64_vio_hv.c | 44 ++--
 2 files changed, 45 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index d42b4b6d4a79..85cfa6328222 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -420,13 +420,19 @@ static void kvmppc_tce_put(struct kvmppc_spapr_tce_table 
*stt,
tbl[idx % TCES_PER_PAGE] = tce;
 }
 
-static void kvmppc_clear_tce(struct mm_struct *mm, struct iommu_table *tbl,
-   unsigned long entry)
+static void kvmppc_clear_tce(struct mm_struct *mm, struct 
kvmppc_spapr_tce_table *stt,
+   struct iommu_table *tbl, unsigned long entry)
 {
-   unsigned long hpa = 0;
-   enum dma_data_direction dir = DMA_NONE;
+   unsigned long i;
+   unsigned long subpages = 1ULL << (stt->page_shift - tbl->it_page_shift);
+   unsigned long io_entry = entry << (stt->page_shift - 
tbl->it_page_shift);
 
-   iommu_tce_xchg_no_kill(mm, tbl, entry, , );
+   for (i = 0; i < subpages; ++i) {
+   unsigned long hpa = 0;
+   enum dma_data_direction dir = DMA_NONE;
+
+   iommu_tce_xchg_no_kill(mm, tbl, io_entry + i, , );
+   }
 }
 
 static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
@@ -485,6 +491,8 @@ static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
break;
}
 
+   iommu_tce_kill(tbl, io_entry, subpages);
+
return ret;
 }
 
@@ -544,6 +552,8 @@ static long kvmppc_tce_iommu_map(struct kvm *kvm,
break;
}
 
+   iommu_tce_kill(tbl, io_entry, subpages);
+
return ret;
 }
 
@@ -590,10 +600,9 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long 
liobn,
ret = kvmppc_tce_iommu_map(vcpu->kvm, stt, stit->tbl,
entry, ua, dir);
 
-   iommu_tce_kill(stit->tbl, entry, 1);
 
if (ret != H_SUCCESS) {
-   kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry);
+   kvmppc_clear_tce(vcpu->kvm->mm, stt, stit->tbl, entry);
goto unlock_exit;
}
}
@@ -669,13 +678,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 */
if (get_user(tce, tces + i)) {
ret = H_TOO_HARD;
-   goto invalidate_exit;
+   goto unlock_exit;
}
tce = be64_to_cpu(tce);
 
if (kvmppc_tce_to_ua(vcpu->kvm, tce, )) {
ret = H_PARAMETER;
-   goto invalidate_exit;
+   goto unlock_exit;

[PATCH kernel v3] powerpc/boot: Stop using RELACOUNT

2022-04-06 Thread Alexey Kardashevskiy

So far the RELACOUNT tag from the ELF header was containing the exact
number of R_PPC_RELATIVE/R_PPC64_RELATIVE relocations. However the LLVM's
recent change [1] make it equal-or-less than the actual number which
makes it useless.

This replaces RELACOUNT in zImage loader with a pair of RELASZ and RELAENT.
The vmlinux relocation code is fixed in commit d79976918852
("powerpc/64: Add UADDR64 relocation support").

To make it more future proof, this walks through the entire .rela.dyn
section instead of assuming that the section is sorter by a relocation
type. Unlike d79976918852, this does not add unaligned UADDR/UADDR64
relocations as we are likely not to see those in practice - the zImage
is small and very arch specific so there is a smaller chance that some
generic feature (such as PRINK_INDEX) triggers unaligned relocations.

[1] https://github.com/llvm/llvm-project/commit/da0e5b885b25cf4
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v3:
* s/divd/divdu/ for ppc64

v2:
* s/divd/divwu/ for ppc32
* updated the commit log
* named all new labels instead of numbering them
(s/101f/.Lcheck_for_relaent/ and so on)
---
 arch/powerpc/boot/crt0.S | 45 ++--
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S
index feadee18e271..44544720daae 100644
--- a/arch/powerpc/boot/crt0.S
+++ b/arch/powerpc/boot/crt0.S
@@ -8,7 +8,8 @@
 #include "ppc_asm.h"
 
 RELA = 7
-RELACOUNT = 0x6ff9
+RELASZ = 8
+RELAENT = 9
 
.data
/* A procedure descriptor used when booting this as a COFF file.
@@ -75,34 +76,39 @@ p_base: mflrr10 /* r10 now points to 
runtime addr of p_base */
bne 11f
lwz r9,4(r12)   /* get RELA pointer in r9 */
b   12f
-11:addis   r8,r8,(-RELACOUNT)@ha
-   cmpwi   r8,RELACOUNT@l
+11:cmpwi   r8,RELASZ
+   bne .Lcheck_for_relaent
+   lwz r0,4(r12)   /* get RELASZ value in r0 */
+   b   12f
+.Lcheck_for_relaent:
+   cmpwi   r8,RELAENT
bne 12f
-   lwz r0,4(r12)   /* get RELACOUNT value in r0 */
+   lwz r14,4(r12)  /* get RELAENT value in r14 */
 12:addir12,r12,8
b   9b
 
/* The relocation section contains a list of relocations.
 * We now do the R_PPC_RELATIVE ones, which point to words
-* which need to be initialized with addend + offset.
-* The R_PPC_RELATIVE ones come first and there are RELACOUNT
-* of them. */
+* which need to be initialized with addend + offset */
 10:/* skip relocation if we don't have both */
cmpwi   r0,0
beq 3f
cmpwi   r9,0
beq 3f
+   cmpwi   r14,0
+   beq 3f
 
add r9,r9,r11   /* Relocate RELA pointer */
+   divwu   r0,r0,r14   /* RELASZ / RELAENT */
mtctr   r0
 2: lbz r0,4+3(r9)  /* ELF32_R_INFO(reloc->r_info) */
cmpwi   r0,22   /* R_PPC_RELATIVE */
-   bne 3f
+   bne .Lnext
lwz r12,0(r9)   /* reloc->r_offset */
lwz r0,8(r9)/* reloc->r_addend */
add r0,r0,r11
stwxr0,r11,r12
-   addir9,r9,12
+.Lnext:add r9,r9,r14
bdnz2b
 
/* Do a cache flush for our text, in case the loader didn't */
@@ -160,32 +166,39 @@ p_base:   mflrr10 /* r10 now points to 
runtime addr of p_base */
bne 10f
ld  r13,8(r11)   /* get RELA pointer in r13 */
b   11f
-10:addis   r12,r12,(-RELACOUNT)@ha
-   cmpdi   r12,RELACOUNT@l
-   bne 11f
-   ld  r8,8(r11)   /* get RELACOUNT value in r8 */
+10:cmpwi   r12,RELASZ
+   bne .Lcheck_for_relaent
+   lwz r8,8(r11)   /* get RELASZ pointer in r8 */
+   b   11f
+.Lcheck_for_relaent:
+   cmpwi   r12,RELAENT
+   bne 11f
+   lwz r14,8(r11)  /* get RELAENT pointer in r14 */
 11:addir11,r11,16
b   9b
 12:
-   cmpdi   r13,0/* check we have both RELA and RELACOUNT */
+   cmpdi   r13,0/* check we have both RELA, RELASZ, RELAENT*/
cmpdi   cr1,r8,0
beq 3f
beq cr1,3f
+   cmpdi   r14,0
+   beq 3f
 
/* Calcuate the runtime offset. */
subfr13,r13,r9
 
/* Run through the list of relocations and process the
 * R_PPC64_RELATIVE ones. */
+   divdu   r8,r8,r14   /* RELASZ / RELAENT */
mtctr   r8
 13:ld  r0,8(r9)/* ELF64_R_TYPE(reloc->r_info) */
cmpdi   r0,22   /* R_PPC64_RELATIVE */
-   bne 3f
+   bne .Lnext
ld  r12,0(r9)/* reloc->r_offset */
ld  r0,16(r9)   /* reloc->r_addend */
add r0,r0,r13
stdxr0,r13,r12
-   addir9,r

Re: [PATCH kernel v2] powerpc/boot: Stop using RELACOUNT

2022-04-06 Thread Alexey Kardashevskiy





On 4/6/22 14:58, Gabriel Paubert wrote:

On Wed, Apr 06, 2022 at 02:01:48PM +1000, Alexey Kardashevskiy wrote:

So far the RELACOUNT tag from the ELF header was containing the exact
number of R_PPC_RELATIVE/R_PPC64_RELATIVE relocations. However the LLVM's
recent change [1] make it equal-or-less than the actual number which
makes it useless.

This replaces RELACOUNT in zImage loader with a pair of RELASZ and RELAENT.
The vmlinux relocation code is fixed in commit d79976918852
("powerpc/64: Add UADDR64 relocation support").

To make it more future proof, this walks through the entire .rela.dyn
section instead of assuming that the section is sorter by a relocation
type. Unlike d79976918852, this does not add unaligned UADDR/UADDR64
relocations as we are likely not to see those in practice - the zImage
is small and very arch specific so there is a smaller chance that some
generic feature (such as PRINK_INDEX) triggers unaligned relocations.

[1] https://github.com/llvm/llvm-project/commit/da0e5b885b25cf4
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* s/divd/divwu/ for ppc32
* updated the commit log
* named all new labels instead of numbering them
(s/101f/.Lcheck_for_relaent/ and so on)
---
  arch/powerpc/boot/crt0.S | 45 ++--
  1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S
index feadee18e271..e9306d862f8d 100644
--- a/arch/powerpc/boot/crt0.S
+++ b/arch/powerpc/boot/crt0.S
@@ -8,7 +8,8 @@
  #include "ppc_asm.h"
  
  RELA = 7

-RELACOUNT = 0x6ff9
+RELASZ = 8
+RELAENT = 9
  
  	.data

/* A procedure descriptor used when booting this as a COFF file.
@@ -75,34 +76,39 @@ p_base: mflrr10 /* r10 now points to 
runtime addr of p_base */
bne 11f
lwz r9,4(r12)   /* get RELA pointer in r9 */
b   12f
-11:addis   r8,r8,(-RELACOUNT)@ha
-   cmpwi   r8,RELACOUNT@l
+11:cmpwi   r8,RELASZ
+   bne .Lcheck_for_relaent
+   lwz r0,4(r12)   /* get RELASZ value in r0 */
+   b   12f
+.Lcheck_for_relaent:
+   cmpwi   r8,RELAENT
bne 12f
-   lwz r0,4(r12)   /* get RELACOUNT value in r0 */
+   lwz r14,4(r12)  /* get RELAENT value in r14 */
  12:   addir12,r12,8
b   9b
  
  	/* The relocation section contains a list of relocations.

 * We now do the R_PPC_RELATIVE ones, which point to words
-* which need to be initialized with addend + offset.
-* The R_PPC_RELATIVE ones come first and there are RELACOUNT
-* of them. */
+* which need to be initialized with addend + offset */
  10:   /* skip relocation if we don't have both */
cmpwi   r0,0
beq 3f
cmpwi   r9,0
beq 3f
+   cmpwi   r14,0
+   beq 3f
  
  	add	r9,r9,r11	/* Relocate RELA pointer */

+   divwu   r0,r0,r14   /* RELASZ / RELAENT */
mtctr   r0
  2:lbz r0,4+3(r9)  /* ELF32_R_INFO(reloc->r_info) */
cmpwi   r0,22   /* R_PPC_RELATIVE */
-   bne 3f
+   bne .Lnext
lwz r12,0(r9)   /* reloc->r_offset */
lwz r0,8(r9)/* reloc->r_addend */
add r0,r0,r11
stwxr0,r11,r12
-   addir9,r9,12
+.Lnext:add r9,r9,r14
bdnz2b
  
  	/* Do a cache flush for our text, in case the loader didn't */

@@ -160,32 +166,39 @@ p_base:   mflrr10 /* r10 now points to 
runtime addr of p_base */
bne 10f
ld  r13,8(r11)   /* get RELA pointer in r13 */
b   11f
-10:addis   r12,r12,(-RELACOUNT)@ha
-   cmpdi   r12,RELACOUNT@l
-   bne 11f
-   ld  r8,8(r11)   /* get RELACOUNT value in r8 */
+10:cmpwi   r12,RELASZ
+   bne .Lcheck_for_relaent
+   lwz r8,8(r11)   /* get RELASZ pointer in r8 */
+   b   11f
+.Lcheck_for_relaent:
+   cmpwi   r12,RELAENT
+   bne 11f
+   lwz r14,8(r11)  /* get RELAENT pointer in r14 */
  11:   addir11,r11,16
b   9b
  12:
-   cmpdi   r13,0/* check we have both RELA and RELACOUNT */
+   cmpdi   r13,0/* check we have both RELA, RELASZ, RELAENT*/
cmpdi   cr1,r8,0
beq 3f
beq cr1,3f
+   cmpdi   r14,0
+   beq 3f
  
  	/* Calcuate the runtime offset. */

subfr13,r13,r9
  
  	/* Run through the list of relocations and process the

 * R_PPC64_RELATIVE ones. */
+   divdr8,r8,r14   /* RELASZ / RELAENT */


While you are at it, this one should also be divdu.

I really wished IBM had used explicit signed/unsigned indication in the
mnemonics (divds, divdu, divws, divwu) instead. Fortunately very little
assemby code uses these instructions nowadays.



Fair enough, v3 is coming. Thanks,





mtctr

[PATCH kernel] KVM: PPC: Fix TCE handling for VFIO

2022-04-05 Thread Alexey Kardashevskiy

At the moment the IOMMU page size in a pseries VM is 16MB (the biggest
allowed by LoPAPR), this page size is used for an emulated TCE table.
If there is a passed though PCI device, that there are hardware IOMMU
tables with equal or smaller IOMMU page sizes so one emulated IOMMU pages
is backed by power-of-two hardware pages.

The code wrongly uses the emulated TCE index instead of hardware TCE
index in error handling. The problem is easier to see on POWER8 with
multi-level TCE tables (when only the first level is preallocated)
as hash mode uses real mode TCE hypercalls handlers.
The kernel starts using indirect tables when VMs get bigger than 128GB
(depends on the max page order).
The very first real mode hcall is going to fail with H_TOO_HARD as
in the real mode we cannot allocate memory for TCEs (we can in the virtual
mode) but on the way out the code attempts to clear hardware TCEs using
emulated TCE indexes which corrupts random kernel memory because
it_offset==1<<59 is subtracted from those indexes and the resulting index
is out of the TCE table bounds.

This fixes kvmppc_clear_tce() to use the correct TCE indexes.

While at it, this fixes TCE cache invalidation which uses emulated TCE
indexes instead of the hardware ones. This went unnoticed as 64bit DMA
is used these days and VMs map all RAM in one go and only then do DMA
and this is when the TCE cache gets populated.

Potentially this could slow down mapping, however normally 16MB
emulated pages are backed by 64K hardware pages so it is one write to
the "TCE Kill" per 256 updates which is not that bad considering the size
of the cache (1024 TCEs or so).

Fixes: ca1fc489cfa0 ("KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages 
with smaller physical pages")
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/book3s_64_vio.c| 45 +++--
 arch/powerpc/kvm/book3s_64_vio_hv.c | 44 ++--
 2 files changed, 45 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index d42b4b6d4a79..85cfa6328222 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -420,13 +420,19 @@ static void kvmppc_tce_put(struct kvmppc_spapr_tce_table 
*stt,
tbl[idx % TCES_PER_PAGE] = tce;
 }
 
-static void kvmppc_clear_tce(struct mm_struct *mm, struct iommu_table *tbl,
-   unsigned long entry)
+static void kvmppc_clear_tce(struct mm_struct *mm, struct 
kvmppc_spapr_tce_table *stt,
+   struct iommu_table *tbl, unsigned long entry)
 {
-   unsigned long hpa = 0;
-   enum dma_data_direction dir = DMA_NONE;
+   unsigned long i;
+   unsigned long subpages = 1ULL << (stt->page_shift - tbl->it_page_shift);
+   unsigned long io_entry = entry << (stt->page_shift - 
tbl->it_page_shift);
 
-   iommu_tce_xchg_no_kill(mm, tbl, entry, , );
+   for (i = 0; i < subpages; ++i) {
+   unsigned long hpa = 0;
+   enum dma_data_direction dir = DMA_NONE;
+
+   iommu_tce_xchg_no_kill(mm, tbl, io_entry + i, , );
+   }
 }
 
 static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
@@ -485,6 +491,8 @@ static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
break;
}
 
+   iommu_tce_kill(tbl, io_entry, subpages);
+
return ret;
 }
 
@@ -544,6 +552,8 @@ static long kvmppc_tce_iommu_map(struct kvm *kvm,
break;
}
 
+   iommu_tce_kill(tbl, io_entry, subpages);
+
return ret;
 }
 
@@ -590,10 +600,9 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long 
liobn,
ret = kvmppc_tce_iommu_map(vcpu->kvm, stt, stit->tbl,
entry, ua, dir);
 
-   iommu_tce_kill(stit->tbl, entry, 1);
 
if (ret != H_SUCCESS) {
-   kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry);
+   kvmppc_clear_tce(vcpu->kvm->mm, stt, stit->tbl, entry);
goto unlock_exit;
}
}
@@ -669,13 +678,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 */
if (get_user(tce, tces + i)) {
ret = H_TOO_HARD;
-   goto invalidate_exit;
+   goto unlock_exit;
}
tce = be64_to_cpu(tce);
 
if (kvmppc_tce_to_ua(vcpu->kvm, tce, )) {
ret = H_PARAMETER;
-   goto invalidate_exit;
+   goto unlock_exit;
}
 
list_for_each_entry_lockless(stit, >iommu_tables, next) {
@@ -684,19 +693,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
iommu_tce_direction(tce));
 
if (ret !=

[PATCH kernel v2] powerpc/boot: Stop using RELACOUNT

2022-04-05 Thread Alexey Kardashevskiy

So far the RELACOUNT tag from the ELF header was containing the exact
number of R_PPC_RELATIVE/R_PPC64_RELATIVE relocations. However the LLVM's
recent change [1] make it equal-or-less than the actual number which
makes it useless.

This replaces RELACOUNT in zImage loader with a pair of RELASZ and RELAENT.
The vmlinux relocation code is fixed in commit d79976918852
("powerpc/64: Add UADDR64 relocation support").

To make it more future proof, this walks through the entire .rela.dyn
section instead of assuming that the section is sorter by a relocation
type. Unlike d79976918852, this does not add unaligned UADDR/UADDR64
relocations as we are likely not to see those in practice - the zImage
is small and very arch specific so there is a smaller chance that some
generic feature (such as PRINK_INDEX) triggers unaligned relocations.

[1] https://github.com/llvm/llvm-project/commit/da0e5b885b25cf4
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* s/divd/divwu/ for ppc32
* updated the commit log
* named all new labels instead of numbering them
(s/101f/.Lcheck_for_relaent/ and so on)
---
 arch/powerpc/boot/crt0.S | 45 ++--
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S
index feadee18e271..e9306d862f8d 100644
--- a/arch/powerpc/boot/crt0.S
+++ b/arch/powerpc/boot/crt0.S
@@ -8,7 +8,8 @@
 #include "ppc_asm.h"
 
 RELA = 7
-RELACOUNT = 0x6ff9
+RELASZ = 8
+RELAENT = 9
 
.data
/* A procedure descriptor used when booting this as a COFF file.
@@ -75,34 +76,39 @@ p_base: mflrr10 /* r10 now points to 
runtime addr of p_base */
bne 11f
lwz r9,4(r12)   /* get RELA pointer in r9 */
b   12f
-11:addis   r8,r8,(-RELACOUNT)@ha
-   cmpwi   r8,RELACOUNT@l
+11:cmpwi   r8,RELASZ
+   bne .Lcheck_for_relaent
+   lwz r0,4(r12)   /* get RELASZ value in r0 */
+   b   12f
+.Lcheck_for_relaent:
+   cmpwi   r8,RELAENT
bne 12f
-   lwz r0,4(r12)   /* get RELACOUNT value in r0 */
+   lwz r14,4(r12)  /* get RELAENT value in r14 */
 12:addir12,r12,8
b   9b
 
/* The relocation section contains a list of relocations.
 * We now do the R_PPC_RELATIVE ones, which point to words
-* which need to be initialized with addend + offset.
-* The R_PPC_RELATIVE ones come first and there are RELACOUNT
-* of them. */
+* which need to be initialized with addend + offset */
 10:/* skip relocation if we don't have both */
cmpwi   r0,0
beq 3f
cmpwi   r9,0
beq 3f
+   cmpwi   r14,0
+   beq 3f
 
add r9,r9,r11   /* Relocate RELA pointer */
+   divwu   r0,r0,r14   /* RELASZ / RELAENT */
mtctr   r0
 2: lbz r0,4+3(r9)  /* ELF32_R_INFO(reloc->r_info) */
cmpwi   r0,22   /* R_PPC_RELATIVE */
-   bne 3f
+   bne .Lnext
lwz r12,0(r9)   /* reloc->r_offset */
lwz r0,8(r9)/* reloc->r_addend */
add r0,r0,r11
stwxr0,r11,r12
-   addir9,r9,12
+.Lnext:add r9,r9,r14
bdnz2b
 
/* Do a cache flush for our text, in case the loader didn't */
@@ -160,32 +166,39 @@ p_base:   mflrr10 /* r10 now points to 
runtime addr of p_base */
bne 10f
ld  r13,8(r11)   /* get RELA pointer in r13 */
b   11f
-10:addis   r12,r12,(-RELACOUNT)@ha
-   cmpdi   r12,RELACOUNT@l
-   bne 11f
-   ld  r8,8(r11)   /* get RELACOUNT value in r8 */
+10:cmpwi   r12,RELASZ
+   bne .Lcheck_for_relaent
+   lwz r8,8(r11)   /* get RELASZ pointer in r8 */
+   b   11f
+.Lcheck_for_relaent:
+   cmpwi   r12,RELAENT
+   bne 11f
+   lwz r14,8(r11)  /* get RELAENT pointer in r14 */
 11:addir11,r11,16
b   9b
 12:
-   cmpdi   r13,0/* check we have both RELA and RELACOUNT */
+   cmpdi   r13,0/* check we have both RELA, RELASZ, RELAENT*/
cmpdi   cr1,r8,0
beq 3f
beq cr1,3f
+   cmpdi   r14,0
+   beq 3f
 
/* Calcuate the runtime offset. */
subfr13,r13,r9
 
/* Run through the list of relocations and process the
 * R_PPC64_RELATIVE ones. */
+   divdr8,r8,r14   /* RELASZ / RELAENT */
mtctr   r8
 13:ld  r0,8(r9)/* ELF64_R_TYPE(reloc->r_info) */
cmpdi   r0,22   /* R_PPC64_RELATIVE */
-   bne 3f
+   bne .Lnext
ld  r12,0(r9)/* reloc->r_offset */
ld  r0,16(r9)   /* reloc->r_addend */
add r0,r0,r13
stdxr0,r13,r12
-   addir9,r9,24
+.Lnext:

Re: [PATCH kernel] powerpc/boot: Stop using RELACOUNT

2022-03-22 Thread Alexey Kardashevskiy





On 3/22/22 13:12, Michael Ellerman wrote:

Alexey Kardashevskiy  writes:

So far the RELACOUNT tag from the ELF header was containing the exact
number of R_PPC_RELATIVE/R_PPC64_RELATIVE relocations. However the LLVM's
recent change [1] make it equal-or-less than the actual number which
makes it useless.

This replaces RELACOUNT in zImage loader with a pair of RELASZ and RELAENT.
The vmlinux relocation code is fixed in [2].


That's committed so you can say:
   in commit d79976918852 ("powerpc/64: Add UADDR64 relocation support")


To make it more future proof, this walks through the entire .rela.dyn
section instead of assuming that the section is sorter by a relocation
type. Unlike [1], this does not add unaligned UADDR/UADDR64 relocations

 ^
 that should be 2?


Yes.




as in hardly possible to see those in arch-specific zImage.


I don't quite parse that. Is it true we can never see them in zImage?
Maybe it's true that we don't see them in practice.


I can force UADDR64 in zImage as I did for d79976918852 but zImage is 
lot smaller and more arch-specific than vmlinux and so far only 
PRINT_INDEX triggered UADDR64 in vmlinux and chances of the same thing 
happening in zImage are small.






[1] https://github.com/llvm/llvm-project/commit/da0e5b885b25cf4
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next=d799769188529a
Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/boot/crt0.S | 43 +---
  1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S
index feadee18e271..6ea3417da3b7 100644
--- a/arch/powerpc/boot/crt0.S
+++ b/arch/powerpc/boot/crt0.S
@@ -8,7 +8,8 @@
  #include "ppc_asm.h"
  
  RELA = 7

-RELACOUNT = 0x6ff9
+RELASZ = 8
+RELAENT = 9
  
  	.data

/* A procedure descriptor used when booting this as a COFF file.
@@ -75,34 +76,38 @@ p_base: mflrr10 /* r10 now points to 
runtime addr of p_base */
bne 11f
lwz r9,4(r12)   /* get RELA pointer in r9 */
b   12f
-11:addis   r8,r8,(-RELACOUNT)@ha
-   cmpwi   r8,RELACOUNT@l
+11:cmpwi   r8,RELASZ
+   bne 111f
+   lwz r0,4(r12)   /* get RELASZ value in r0 */
+   b   12f
+111:   cmpwi   r8,RELAENT


Can you use named local labels for new labels you introduce?

This could be .Lcheck_for_relaent: perhaps.


Then I'll need to rename them all/most and add more noise to the patch 
which reduces chances of it being reviewed. But sure, I can rename labels.





bne 12f
-   lwz r0,4(r12)   /* get RELACOUNT value in r0 */
+   lwz r14,4(r12)  /* get RELAENT value in r14 */
  12:   addir12,r12,8
b   9b
  
  	/* The relocation section contains a list of relocations.

 * We now do the R_PPC_RELATIVE ones, which point to words
-* which need to be initialized with addend + offset.
-* The R_PPC_RELATIVE ones come first and there are RELACOUNT
-* of them. */
+* which need to be initialized with addend + offset */
  10:   /* skip relocation if we don't have both */
cmpwi   r0,0
beq 3f
cmpwi   r9,0
beq 3f
+   cmpwi   r14,0
+   beq 3f
  
  	add	r9,r9,r11	/* Relocate RELA pointer */

+   divdr0,r0,r14   /* RELASZ / RELAENT */


This is in the 32-bit portion isn't it. AFAIK 32-bit CPUs don't
implement divd. I'm not sure why the toolchain allowed it. I would
expect it to trap if run on real 32-bit hardware.



Uff, my bad, "divw", right?

I am guessing it works as zImage for 64bit BigEndian is still ELF32 
which runs in 64bit CPU and I did not test on real PPC32 as I'm not 
quite sure how and I hoped your farm will do this for me :)





mtctr   r0
  2:lbz r0,4+3(r9)  /* ELF32_R_INFO(reloc->r_info) */
cmpwi   r0,22   /* R_PPC_RELATIVE */
-   bne 3f
+   bne 22f
lwz r12,0(r9)   /* reloc->r_offset */
lwz r0,8(r9)/* reloc->r_addend */
add r0,r0,r11
stwxr0,r11,r12
-   addir9,r9,12
+22:add r9,r9,r14
bdnz2b
  
  	/* Do a cache flush for our text, in case the loader didn't */


cheers

[PATCH kernel] powerpc/boot: Stop using RELACOUNT

2022-03-21 Thread Alexey Kardashevskiy

So far the RELACOUNT tag from the ELF header was containing the exact
number of R_PPC_RELATIVE/R_PPC64_RELATIVE relocations. However the LLVM's
recent change [1] make it equal-or-less than the actual number which
makes it useless.

This replaces RELACOUNT in zImage loader with a pair of RELASZ and RELAENT.
The vmlinux relocation code is fixed in [2].

To make it more future proof, this walks through the entire .rela.dyn
section instead of assuming that the section is sorter by a relocation
type. Unlike [1], this does not add unaligned UADDR/UADDR64 relocations
as in hardly possible to see those in arch-specific zImage.

[1] https://github.com/llvm/llvm-project/commit/da0e5b885b25cf4
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next=d799769188529a
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/boot/crt0.S | 43 +---
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S
index feadee18e271..6ea3417da3b7 100644
--- a/arch/powerpc/boot/crt0.S
+++ b/arch/powerpc/boot/crt0.S
@@ -8,7 +8,8 @@
 #include "ppc_asm.h"
 
 RELA = 7
-RELACOUNT = 0x6ff9
+RELASZ = 8
+RELAENT = 9
 
.data
/* A procedure descriptor used when booting this as a COFF file.
@@ -75,34 +76,38 @@ p_base: mflrr10 /* r10 now points to 
runtime addr of p_base */
bne 11f
lwz r9,4(r12)   /* get RELA pointer in r9 */
b   12f
-11:addis   r8,r8,(-RELACOUNT)@ha
-   cmpwi   r8,RELACOUNT@l
+11:cmpwi   r8,RELASZ
+   bne 111f
+   lwz r0,4(r12)   /* get RELASZ value in r0 */
+   b   12f
+111:   cmpwi   r8,RELAENT
bne 12f
-   lwz r0,4(r12)   /* get RELACOUNT value in r0 */
+   lwz r14,4(r12)  /* get RELAENT value in r14 */
 12:addir12,r12,8
b   9b
 
/* The relocation section contains a list of relocations.
 * We now do the R_PPC_RELATIVE ones, which point to words
-* which need to be initialized with addend + offset.
-* The R_PPC_RELATIVE ones come first and there are RELACOUNT
-* of them. */
+* which need to be initialized with addend + offset */
 10:/* skip relocation if we don't have both */
cmpwi   r0,0
beq 3f
cmpwi   r9,0
beq 3f
+   cmpwi   r14,0
+   beq 3f
 
add r9,r9,r11   /* Relocate RELA pointer */
+   divdr0,r0,r14   /* RELASZ / RELAENT */
mtctr   r0
 2: lbz r0,4+3(r9)  /* ELF32_R_INFO(reloc->r_info) */
cmpwi   r0,22   /* R_PPC_RELATIVE */
-   bne 3f
+   bne 22f
lwz r12,0(r9)   /* reloc->r_offset */
lwz r0,8(r9)/* reloc->r_addend */
add r0,r0,r11
stwxr0,r11,r12
-   addir9,r9,12
+22:add r9,r9,r14
bdnz2b
 
/* Do a cache flush for our text, in case the loader didn't */
@@ -160,32 +165,38 @@ p_base:   mflrr10 /* r10 now points to 
runtime addr of p_base */
bne 10f
ld  r13,8(r11)   /* get RELA pointer in r13 */
b   11f
-10:addis   r12,r12,(-RELACOUNT)@ha
-   cmpdi   r12,RELACOUNT@l
-   bne 11f
-   ld  r8,8(r11)   /* get RELACOUNT value in r8 */
+10:cmpwi   r12,RELASZ
+   bne 101f
+   lwz r8,8(r11)   /* get RELASZ pointer in r8 */
+   b   11f
+101:   cmpwi   r12,RELAENT
+   bne 11f
+   lwz r14,8(r11)  /* get RELAENT pointer in r14 */
 11:addir11,r11,16
b   9b
 12:
-   cmpdi   r13,0/* check we have both RELA and RELACOUNT */
+   cmpdi   r13,0/* check we have both RELA, RELASZ, RELAENT*/
cmpdi   cr1,r8,0
beq 3f
beq cr1,3f
+   cmpdi   r14,0
+   beq 3f
 
/* Calcuate the runtime offset. */
subfr13,r13,r9
 
/* Run through the list of relocations and process the
 * R_PPC64_RELATIVE ones. */
+   divdr8,r8,r14   /* RELASZ / RELAENT */
mtctr   r8
 13:ld  r0,8(r9)/* ELF64_R_TYPE(reloc->r_info) */
cmpdi   r0,22   /* R_PPC64_RELATIVE */
-   bne 3f
+   bne 14f
ld  r12,0(r9)/* reloc->r_offset */
ld  r0,16(r9)   /* reloc->r_addend */
add r0,r0,r13
stdxr0,r13,r12
-   addir9,r9,24
+14:add r9,r9,r14
bdnz13b
 
/* Do a cache flush for our text, in case the loader didn't */
-- 
2.30.2

Re: [PATCH] powerpc: Replace ppc64 DT_RELACOUNT usage with DT_RELASZ/24

2022-03-10 Thread Alexey Kardashevskiy





On 3/11/22 15:15, Michael Ellerman wrote:

Fāng-ruì Sòng  writes:

On Thu, Mar 10, 2022 at 11:48 AM Nick Desaulniers
 wrote:


On Tue, Mar 8, 2022 at 9:53 PM Fangrui Song  wrote:


DT_RELACOUNT is an ELF dynamic tag inherited from SunOS indicating the
number of R_*_RELATIVE relocations. It is optional but {ld.lld,ld.lld}
-z combreloc always creates it (if non-zero) to slightly speed up glibc
ld.so relocation resolving by avoiding R_*R_PPC64_RELATIVE type
comparison. The tag is otherwise nearly unused in the wild and I'd
recommend that software avoids using it.

lld>=14.0.0 (since commit da0e5b885b25cf4ded0fa89b965dc6979ac02ca9)
underestimates DT_RELACOUNT for ppc64 when position-independent long
branch thunks are used. Correcting it needs non-trivial arch-specific
complexity which I'd prefer to avoid. Since our code always compares the
relocation type with R_PPC64_RELATIVE, replacing every occurrence of
DT_RELACOUNT with DT_RELASZ/sizeof(Elf64_Rela)=DT_RELASZ/24 is a correct
alternative.


checking that sizeof(Elf64_Rela) == 24, yep: https://godbolt.org/z/bb4aKbo5T



DT_RELASZ is in practice bounded by an uint32_t. Dividing x by 24 can be
implemented as (uint32_t)(x*0xaaab) >> 4.


Yep: https://godbolt.org/z/x9445ePPv



Link: https://github.com/ClangBuiltLinux/linux/issues/1581
Reported-by: Nathan Chancellor 
Signed-off-by: Fangrui Song 
---
  arch/powerpc/boot/crt0.S   | 28 +---
  arch/powerpc/kernel/reloc_64.S | 15 +--
  2 files changed, 26 insertions(+), 17 deletions(-)

...


I rebased the patch on
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
master and got a conflict.
Seems that 
https://lore.kernel.org/linuxppc-dev/20220309061822.168173-1-...@ozlabs.ru/T/#u
("[PATCH kernel v4] powerpc/64: Add UADDR64 relocation support") fixed
the issue.
It just doesn't change arch/powerpc/boot/crt0.S


Yeah sorry, I applied Alexey's v4 just before I saw your patch arrive on
the list.

If one of you can rework this so it applies on top that would be great :)



I guess it is me as now I have to add that UARRD64 thing to crt0.S as 
well, don't I?


And also we are giving up on the llvm ld having a bug with RELACOUNT?

[PATCH kernel v4] powerpc/64: Add UADDR64 relocation support

2022-03-08 Thread Alexey Kardashevskiy

When ld detects unaligned relocations, it emits R_PPC64_UADDR64
relocations instead of R_PPC64_RELATIVE. Currently R_PPC64_UADDR64 are
detected by arch/powerpc/tools/relocs_check.sh and expected not to work.
Below is a simple chunk to trigger this behaviour (this disables
optimization for the demonstration purposes only, this also happens with
-O1/-O2 when CONFIG_PRINTK_INDEX=y, for example):

\#pragma GCC push_options
\#pragma GCC optimize ("O0")
struct entry {
const char *file;
int line;
} __attribute__((packed));
static const struct entry e1 = { .file = __FILE__, .line = __LINE__ };
static const struct entry e2 = { .file = __FILE__, .line = __LINE__ };
...
prom_printf("e1=%s %lx %lx\n", e1.file, (unsigned long) e1.file, mfmsr());
prom_printf("e2=%s %lx\n", e2.file, (unsigned long) e2.file);
\#pragma GCC pop_options

This adds support for UADDR64 for 64bit. This reuses __dynamic_symtab
from the 32bit which supports more relocation types already.

Because RELACOUNT includes only R_PPC64_RELATIVE, this replaces it with
RELASZ which is the size of all relocation records.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v4:
* fixed reloc->r_info hadling on big endian

v3:
* named some labels

v2:
* replaced RELACOUNT with RELASZ/RELAENT
* removed FIXME
---
 arch/powerpc/kernel/reloc_64.S | 67 +-
 arch/powerpc/kernel/vmlinux.lds.S  |  2 -
 arch/powerpc/tools/relocs_check.sh |  7 +---
 3 files changed, 48 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/kernel/reloc_64.S b/arch/powerpc/kernel/reloc_64.S
index 02d4719bf43a..232e4549defe 100644
--- a/arch/powerpc/kernel/reloc_64.S
+++ b/arch/powerpc/kernel/reloc_64.S
@@ -8,8 +8,10 @@
 #include 
 
 RELA = 7
-RELACOUNT = 0x6ff9
+RELASZ = 8
+RELAENT = 9
 R_PPC64_RELATIVE = 22
+R_PPC64_UADDR64 = 43
 
 /*
  * r3 = desired final address of kernel
@@ -25,29 +27,38 @@ _GLOBAL(relocate)
add r9,r9,r12   /* r9 has runtime addr of .rela.dyn section */
ld  r10,(p_st - 0b)(r12)
add r10,r10,r12 /* r10 has runtime addr of _stext */
+   ld  r13,(p_sym - 0b)(r12)
+   add r13,r13,r12 /* r13 has runtime addr of .dynsym */
 
/*
-* Scan the dynamic section for the RELA and RELACOUNT entries.
+* Scan the dynamic section for the RELA, RELASZ and RELAENT entries.
 */
li  r7,0
li  r8,0
-1: ld  r6,0(r11)   /* get tag */
+.Ltags:
+   ld  r6,0(r11)   /* get tag */
cmpdi   r6,0
-   beq 4f  /* end of list */
+   beq .Lend_of_list   /* end of list */
cmpdi   r6,RELA
bne 2f
ld  r7,8(r11)   /* get RELA pointer in r7 */
-   b   3f
-2: addis   r6,r6,(-RELACOUNT)@ha
-   cmpdi   r6,RELACOUNT@l
+   b   4f
+2: cmpdi   r6,RELASZ
bne 3f
-   ld  r8,8(r11)   /* get RELACOUNT value in r8 */
-3: addir11,r11,16
-   b   1b
-4: cmpdi   r7,0/* check we have both RELA and RELACOUNT */
+   ld  r8,8(r11)   /* get RELASZ value in r8 */
+   b   4f
+3: cmpdi   r6,RELAENT
+   bne 4f
+   ld  r12,8(r11)  /* get RELAENT value in r12 */
+4: addir11,r11,16
+   b   .Ltags
+.Lend_of_list:
+   cmpdi   r7,0/* check we have RELA, RELASZ, RELAENT */
cmpdi   cr1,r8,0
-   beq 6f
-   beq cr1,6f
+   beq .Lout
+   beq cr1,.Lout
+   cmpdi   r12,0
+   beq .Lout
 
/*
 * Work out linktime address of _stext and hence the
@@ -62,23 +73,39 @@ _GLOBAL(relocate)
 
/*
 * Run through the list of relocations and process the
-* R_PPC64_RELATIVE ones.
+* R_PPC64_RELATIVE and R_PPC64_UADDR64 ones.
 */
+   divdr8,r8,r12   /* RELASZ / RELAENT */
mtctr   r8
-5: ld  r0,8(9) /* ELF64_R_TYPE(reloc->r_info) */
+.Lrels:ld  r0,8(r9)/* ELF64_R_TYPE(reloc->r_info) 
*/
cmpdi   r0,R_PPC64_RELATIVE
-   bne 6f
+   bne .Luaddr64
ld  r6,0(r9)/* reloc->r_offset */
ld  r0,16(r9)   /* reloc->r_addend */
+   b   .Lstore
+.Luaddr64:
+   srdir14,r0,32   /* ELF64_R_SYM(reloc->r_info) */
+   clrldi  r0,r0,32
+   cmpdi   r0,R_PPC64_UADDR64
+   bne .Lnext
+   ld  r6,0(r9)
+   ld  r0,16(r9)
+   mulli   r14,r14,24  /* 24 == sizeof(elf64_sym) */
+   add r14,r14,r13 /* elf64_sym[ELF64_R_SYM] */
+   ld  r14,8(r14)
+   add r0,r0,r14
+.Lstore:
add r0,r0,r3
stdxr0,r7,r6
-   addir9,r9,24
-   bdnz5b
-
-6: blr
+.Lnext:
+   add r9,r9,r12
+   bdnz.Lrels
+.Lout:
+   blr
 
 .balign 8
 p_dyn: .8byte  __dynamic_start - 0b
 p_r

[PATCH kernel v3] powerpc/64: Add UADDR64 relocation support

2022-02-24 Thread Alexey Kardashevskiy

When ld detects unaligned relocations, it emits R_PPC64_UADDR64
relocations instead of R_PPC64_RELATIVE. Currently R_PPC64_UADDR64 are
detected by arch/powerpc/tools/relocs_check.sh and expected not to work.
Below is a simple chunk to trigger this behaviour (this disables
optimization for the demonstration purposes only, this also happens with
-O1/-O2 when CONFIG_PRINTK_INDEX=y, for example):

\#pragma GCC push_options
\#pragma GCC optimize ("O0")
struct entry {
const char *file;
int line;
} __attribute__((packed));
static const struct entry e1 = { .file = __FILE__, .line = __LINE__ };
static const struct entry e2 = { .file = __FILE__, .line = __LINE__ };
...
prom_printf("e1=%s %lx %lx\n", e1.file, (unsigned long) e1.file, mfmsr());
prom_printf("e2=%s %lx\n", e2.file, (unsigned long) e2.file);
\#pragma GCC pop_options


This adds support for UADDR64 for 64bit. This reuses __dynamic_symtab
from the 32bit which supports more relocation types already.

Because RELACOUNT includes only R_PPC64_RELATIVE, this replaces it with
RELASZ which is the size of all relocation records.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v3:
* named some labels

v2:
* replaced RELACOUNT with RELASZ/RELAENT
* removed FIXME

---

Tested via qemu gdb stub (the kernel is loaded at 0x40).

Disasm:

c1a804d0 :
c1a804d0:   b0 04 a8 01 .long 0x1a804b0
c1a804d0: R_PPC64_RELATIVE  
*ABS*-0x3e57fb50
c1a804d4:   00 00 00 c0 lfs f0,0(0)
c1a804d8:   fa 08 00 00 .long 0x8fa

c1a804dc :
...
c1a804dc: R_PPC64_UADDR64   .rodata+0x4b0

Before relocation:
>>> p *(unsigned long *) 0x1e804d0
$1 = 0xc1a804b0
>>> p *(unsigned long *) 0x1e804dc
$2 = 0x0

After relocation in __boot_from_prom:
>>> p *(unsigned long *) 0x1e804d0
$1 = 0x1e804b0
>>> p *(unsigned long *) 0x1e804dc
$2 = 0x1e804b0

After relocation in __after_prom_start:
>>> p *(unsigned long *) 0x1e804d0
$1 = 0xc1a804b0
>>> p *(unsigned long *) 0x1e804dc
$2 = 0xc1a804b0
>>>
---
 arch/powerpc/kernel/reloc_64.S | 67 +-
 arch/powerpc/kernel/vmlinux.lds.S  |  2 -
 arch/powerpc/tools/relocs_check.sh |  7 +---
 3 files changed, 48 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/kernel/reloc_64.S b/arch/powerpc/kernel/reloc_64.S
index 02d4719bf43a..4a8eccbaebb4 100644
--- a/arch/powerpc/kernel/reloc_64.S
+++ b/arch/powerpc/kernel/reloc_64.S
@@ -8,8 +8,10 @@
 #include 
 
 RELA = 7
-RELACOUNT = 0x6ff9
+RELASZ = 8
+RELAENT = 9
 R_PPC64_RELATIVE = 22
+R_PPC64_UADDR64 = 43
 
 /*
  * r3 = desired final address of kernel
@@ -25,29 +27,38 @@ _GLOBAL(relocate)
add r9,r9,r12   /* r9 has runtime addr of .rela.dyn section */
ld  r10,(p_st - 0b)(r12)
add r10,r10,r12 /* r10 has runtime addr of _stext */
+   ld  r13,(p_sym - 0b)(r12)
+   add r13,r13,r12 /* r13 has runtime addr of .dynsym */
 
/*
-* Scan the dynamic section for the RELA and RELACOUNT entries.
+* Scan the dynamic section for the RELA, RELASZ and RELAENT entries.
 */
li  r7,0
li  r8,0
-1: ld  r6,0(r11)   /* get tag */
+.Ltags:
+   ld  r6,0(r11)   /* get tag */
cmpdi   r6,0
-   beq 4f  /* end of list */
+   beq .Lend_of_list   /* end of list */
cmpdi   r6,RELA
bne 2f
ld  r7,8(r11)   /* get RELA pointer in r7 */
-   b   3f
-2: addis   r6,r6,(-RELACOUNT)@ha
-   cmpdi   r6,RELACOUNT@l
+   b   4f
+2: cmpdi   r6,RELASZ
bne 3f
-   ld  r8,8(r11)   /* get RELACOUNT value in r8 */
-3: addir11,r11,16
-   b   1b
-4: cmpdi   r7,0/* check we have both RELA and RELACOUNT */
+   ld  r8,8(r11)   /* get RELASZ value in r8 */
+   b   4f
+3: cmpdi   r6,RELAENT
+   bne 4f
+   ld  r12,8(r11)  /* get RELAENT value in r12 */
+4: addir11,r11,16
+   b   .Ltags
+.Lend_of_list:
+   cmpdi   r7,0/* check we have RELA, RELASZ, RELAENT */
cmpdi   cr1,r8,0
-   beq 6f
-   beq cr1,6f
+   beq .Lout
+   beq cr1,.Lout
+   cmpdi   r12,0
+   beq .Lout
 
/*
 * Work out linktime address of _stext and hence the
@@ -62,23 +73,39 @@ _GLOBAL(relocate)
 
/*
 * Run through the list of relocations and process the
-* R_PPC64_RELATIVE ones.
+* R_PPC64_RELATIVE and R_PPC64_UADDR64 ones.
 */
+   divdr8,r8,r12   /* RELASZ / RELAENT */
mtctr   r8
-5: ld  r0,8(9) /* ELF64_R_TYPE(reloc->r_info) */
+.Lrelocations:
+   lwa r0,8(r9)

Re: [PATCH kernel 2/3] powerpc/llvm: Sample config for LLVM LTO

2022-02-15 Thread Alexey Kardashevskiy





On 2/12/22 11:05, Nick Desaulniers wrote:

On Thu, Feb 10, 2022 at 6:31 PM Alexey Kardashevskiy  wrote:


The config is a copy of ppc64_defconfig with a few tweaks. This could be
a smaller config to merge into ppc64_defconfig but unfortunately
merger does not allow disabling already enabled options.


Cool series!



This is a command line to compile the kernel using the upstream llvm:

make -j64 O=/home/aik/pbuild/kernels-llvm/ \
  "KCFLAGS=-Wmissing-braces -Wno-array-bounds" \
  ARCH=powerpc LLVM_IAS=1 ppc64le_lto_defconfig CC=clang LLVM=1


That command line invocation is kind of a mess, and many things
shouldn't be necessary.

O= is just noise; if folks are doing in tree builds then that doesn't
add anything meaningful.
KCFLAGS= why? I know -Warray-bounds is being worked on actively, but
do we have instances of -Wmissing-braces at the moment? Let's get
those fixed up.
LLVM_IAS=1 is implied by LLVM=1.
CC=clang is implied by LLVM=1

why add a new config? I think it would be simpler to just show command
line invocations of `./scripts/config -e` and `make`. No new config
required.




I should have added "RFC" in this one as the purpose of the patch is to 
show what works right now and not for actual submission.





Forces CONFIG_BTRFS_FS=y to make CONFIG_ZSTD_COMPRESS=y to fix:
ld.lld: error: linking module flags 'Code Model': IDs have conflicting values 
in 'lib/built-in.a(entropy_common.o at 5332)' and 'ld-temp.o'

because modules are linked with -mcmodel=large but the kernel uses 
-mcmodel=medium


Please file a bug about this.
https://github.com/ClangBuiltLinux/linux/issues



Enables CONFIG_USERFAULTFD=y as otherwise vm_userfaultfd_ctx becomes
0 bytes long and clang sanitizer crashes as
https://bugs.llvm.org/show_bug.cgi?id=500375




The above hyperlink doesn't work for me. Upstream llvm just moved from
bugzilla to github issue tracker.



aah this is the correct one:
https://bugs.llvm.org/show_bug.cgi?id=50037



https://github.com/llvm/llvm-project/issues


oh ok.



Disables CONFIG_FTR_FIXUP_SELFTEST as it uses FTR_SECTION_ELSE with
conditional branches. There are other places like this and the following
patches address that.

Disables CONFIG_FTRACE_MCOUNT_USE_RECORDMCOUNT as CONFIG_HAS_LTO_CLANG
depends on it being disabled. In order to avoid disabling way too many
options (like DYNAMIC_FTRACE/FUNCTION_TRACER), this converts
FTRACE_MCOUNT_USE_RECORDMCOUNT from def_bool to bool.

Note that even with this config there is a good chance that LTO
is going to fail linking vmlinux because of the "bc" problem.


I think rather than adding a new config with LTO enabled and a few
things turned off, it would be better to not allow LTO to be
selectable if those things are turned on, until the combination of the
two are fixed.


Well, if I want people to try this thing, I kinda need to provide an 
easy way to allow LTO. The new config seemed the easiest (== the 
shortest) :)

[PATCH kernel 3/3] powerpc/llvm/lto: Workaround conditional branches in FTR_SECTION_ELSE

2022-02-10 Thread Alexey Kardashevskiy

LTO invites ld/lld to optimize the output binary and this may affect
the FTP alternative section if alt branches use "bc" (Branch Conditional)
which only allows 16 bit offsets. This manifests in errors like:

ld.lld: error: InputSection too large for range extension thunk 
vmlinux.o:(__ftr_alt_97+0xF0)

This works around the problem by replacing "bc" and its alias(es) in
FTR_SECTION_ELSE with "b" which allows 26 bit offsets.

This catches the problem instructions in vmlinux.o before it LTO'ed:

$ objdump -d -M raw -j __ftr_alt_97 vmlinux.o | egrep '\S+\s*\'
  30:   00 00 82 40 bc  4,eq,30 <__ftr_alt_97+0x30>
  f0:   00 00 82 40 bc  4,eq,f0 <__ftr_alt_97+0xf0>

The change in copyuser_64.S is needed even when building default
configs, the other two changes are needed if the kernel config grows.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/exceptions-64s.S | 6 +-
 arch/powerpc/lib/copyuser_64.S   | 3 ++-
 arch/powerpc/lib/memcpy_64.S | 3 ++-
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 55caeee37c08..b8d9a2f5f3a5 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -476,9 +476,13 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real, text)
.if IHSRR_IF_HVMODE
BEGIN_FTR_SECTION
bne masked_Hinterrupt
+   b   4f
FTR_SECTION_ELSE
-   bne masked_interrupt
+   nop
+   nop
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
+   bne masked_interrupt
+4:
.elseif IHSRR
bne masked_Hinterrupt
.else
diff --git a/arch/powerpc/lib/copyuser_64.S b/arch/powerpc/lib/copyuser_64.S
index db8719a14846..d07f95eebc65 100644
--- a/arch/powerpc/lib/copyuser_64.S
+++ b/arch/powerpc/lib/copyuser_64.S
@@ -75,10 +75,11 @@ _GLOBAL(__copy_tofrom_user_base)
  * set is Power6.
  */
 test_feature = (SELFTEST_CASE == 1)
+   beq .Ldst_aligned
 BEGIN_FTR_SECTION
nop
 FTR_SECTION_ELSE
-   bne .Ldst_unaligned
+   b   .Ldst_unaligned
 ALT_FTR_SECTION_END(CPU_FTR_UNALIGNED_LD_STD | CPU_FTR_CP_USE_DCBTZ, \
CPU_FTR_UNALIGNED_LD_STD)
 .Ldst_aligned:
diff --git a/arch/powerpc/lib/memcpy_64.S b/arch/powerpc/lib/memcpy_64.S
index 016c91e958d8..286c7e2d0883 100644
--- a/arch/powerpc/lib/memcpy_64.S
+++ b/arch/powerpc/lib/memcpy_64.S
@@ -50,10 +50,11 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY)
At the time of writing the only CPU that has this combination of bits
set is Power6. */
 test_feature = (SELFTEST_CASE == 1)
+   beq  .ldst_aligned
 BEGIN_FTR_SECTION
nop
 FTR_SECTION_ELSE
-   bne .Ldst_unaligned
+   b   .Ldst_unaligned
 ALT_FTR_SECTION_END(CPU_FTR_UNALIGNED_LD_STD | CPU_FTR_CP_USE_DCBTZ, \
 CPU_FTR_UNALIGNED_LD_STD)
 .Ldst_aligned:
-- 
2.30.2

[PATCH kernel 2/3] powerpc/llvm: Sample config for LLVM LTO

2022-02-10 Thread Alexey Kardashevskiy

The config is a copy of ppc64_defconfig with a few tweaks. This could be
a smaller config to merge into ppc64_defconfig but unfortunately
merger does not allow disabling already enabled options.

This is a command line to compile the kernel using the upstream llvm:

make -j64 O=/home/aik/pbuild/kernels-llvm/ \
 "KCFLAGS=-Wmissing-braces -Wno-array-bounds" \
 ARCH=powerpc LLVM_IAS=1 ppc64le_lto_defconfig CC=clang LLVM=1

Forces CONFIG_BTRFS_FS=y to make CONFIG_ZSTD_COMPRESS=y to fix:
ld.lld: error: linking module flags 'Code Model': IDs have conflicting values 
in 'lib/built-in.a(entropy_common.o at 5332)' and 'ld-temp.o'

because modules are linked with -mcmodel=large but the kernel uses 
-mcmodel=medium

Enables CONFIG_USERFAULTFD=y as otherwise vm_userfaultfd_ctx becomes
0 bytes long and clang sanitizer crashes as
https://bugs.llvm.org/show_bug.cgi?id=500375

Disables CONFIG_FTR_FIXUP_SELFTEST as it uses FTR_SECTION_ELSE with
conditional branches. There are other places like this and the following
patches address that.

Disables CONFIG_FTRACE_MCOUNT_USE_RECORDMCOUNT as CONFIG_HAS_LTO_CLANG
depends on it being disabled. In order to avoid disabling way too many
options (like DYNAMIC_FTRACE/FUNCTION_TRACER), this converts
FTRACE_MCOUNT_USE_RECORDMCOUNT from def_bool to bool.

Note that even with this config there is a good chance that LTO
is going to fail linking vmlinux because of the "bc" problem.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/Makefile|   4 +
 arch/powerpc/configs/ppc64_lto_defconfig | 381 +++
 2 files changed, 385 insertions(+)
 create mode 100644 arch/powerpc/configs/ppc64_lto_defconfig

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 5f16ac1583c5..23f1ade8abc9 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -308,6 +308,10 @@ PHONY += ppc64le_defconfig
 ppc64le_defconfig:
$(call merge_into_defconfig,ppc64_defconfig,le)
 
+PHONY += ppc64le_lto_defconfig
+ppc64le_lto_defconfig:
+   $(call merge_into_defconfig,ppc64_lto_defconfig,le)
+
 PHONY += ppc64le_guest_defconfig
 ppc64le_guest_defconfig:
$(call merge_into_defconfig,ppc64_defconfig,le guest)
diff --git a/arch/powerpc/configs/ppc64_lto_defconfig 
b/arch/powerpc/configs/ppc64_lto_defconfig
new file mode 100644
index ..67f82b422b7d
--- /dev/null
+++ b/arch/powerpc/configs/ppc64_lto_defconfig
@@ -0,0 +1,381 @@
+CONFIG_SYSVIPC=y
+CONFIG_POSIX_MQUEUE=y
+CONFIG_NO_HZ=y
+CONFIG_HIGH_RES_TIMERS=y
+CONFIG_TASKSTATS=y
+CONFIG_TASK_DELAY_ACCT=y
+CONFIG_IKCONFIG=y
+CONFIG_IKCONFIG_PROC=y
+CONFIG_LOG_BUF_SHIFT=18
+CONFIG_LOG_CPU_MAX_BUF_SHIFT=13
+CONFIG_NUMA_BALANCING=y
+CONFIG_CGROUPS=y
+CONFIG_MEMCG=y
+CONFIG_CGROUP_SCHED=y
+CONFIG_CGROUP_FREEZER=y
+CONFIG_CPUSETS=y
+CONFIG_CGROUP_DEVICE=y
+CONFIG_CGROUP_CPUACCT=y
+CONFIG_CGROUP_PERF=y
+CONFIG_CGROUP_BPF=y
+CONFIG_BLK_DEV_INITRD=y
+CONFIG_BPF_SYSCALL=y
+# CONFIG_COMPAT_BRK is not set
+CONFIG_PROFILING=y
+CONFIG_PPC64=y
+CONFIG_NR_CPUS=2048
+CONFIG_PPC_SPLPAR=y
+CONFIG_DTL=y
+CONFIG_PPC_SMLPAR=y
+CONFIG_IBMEBUS=y
+CONFIG_PPC_SVM=y
+CONFIG_PPC_MAPLE=y
+CONFIG_PPC_PASEMI=y
+CONFIG_PPC_PASEMI_IOMMU=y
+CONFIG_PPC_PS3=y
+CONFIG_PS3_DISK=m
+CONFIG_PS3_ROM=m
+CONFIG_PS3_FLASH=m
+CONFIG_PS3_LPM=m
+CONFIG_PPC_IBM_CELL_BLADE=y
+CONFIG_RTAS_FLASH=m
+CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
+CONFIG_CPU_FREQ_GOV_POWERSAVE=y
+CONFIG_CPU_FREQ_GOV_USERSPACE=y
+CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
+CONFIG_CPU_FREQ_PMAC64=y
+CONFIG_HZ_100=y
+CONFIG_PPC_TRANSACTIONAL_MEM=y
+CONFIG_KEXEC=y
+CONFIG_KEXEC_FILE=y
+CONFIG_CRASH_DUMP=y
+CONFIG_FA_DUMP=y
+CONFIG_IRQ_ALL_CPUS=y
+CONFIG_SCHED_SMT=y
+CONFIG_HOTPLUG_PCI=y
+CONFIG_HOTPLUG_PCI_RPA=m
+CONFIG_HOTPLUG_PCI_RPA_DLPAR=m
+CONFIG_PCCARD=y
+CONFIG_ELECTRA_CF=y
+CONFIG_VIRTUALIZATION=y
+CONFIG_KVM_BOOK3S_64=m
+CONFIG_KVM_BOOK3S_64_HV=m
+CONFIG_VHOST_NET=m
+CONFIG_KPROBES=y
+CONFIG_JUMP_LABEL=y
+CONFIG_MODULES=y
+CONFIG_MODULE_UNLOAD=y
+CONFIG_MODVERSIONS=y
+CONFIG_MODULE_SRCVERSION_ALL=y
+CONFIG_PARTITION_ADVANCED=y
+CONFIG_BINFMT_MISC=m
+CONFIG_MEMORY_HOTPLUG=y
+CONFIG_MEMORY_HOTREMOVE=y
+CONFIG_KSM=y
+CONFIG_TRANSPARENT_HUGEPAGE=y
+CONFIG_NET=y
+CONFIG_PACKET=y
+CONFIG_UNIX=y
+CONFIG_XFRM_USER=m
+CONFIG_NET_KEY=m
+CONFIG_INET=y
+CONFIG_IP_MULTICAST=y
+CONFIG_IP_PNP=y
+CONFIG_IP_PNP_DHCP=y
+CONFIG_IP_PNP_BOOTP=y
+CONFIG_NET_IPIP=y
+CONFIG_SYN_COOKIES=y
+CONFIG_INET_AH=m
+CONFIG_INET_ESP=m
+CONFIG_INET_IPCOMP=m
+CONFIG_IPV6=y
+CONFIG_NETFILTER=y
+# CONFIG_NETFILTER_ADVANCED is not set
+CONFIG_BRIDGE=m
+CONFIG_NET_SCHED=y
+CONFIG_NET_CLS_BPF=m
+CONFIG_NET_CLS_ACT=y
+CONFIG_NET_ACT_BPF=m
+CONFIG_BPF_JIT=y
+CONFIG_DEVTMPFS=y
+CONFIG_DEVTMPFS_MOUNT=y
+CONFIG_BLK_DEV_FD=y
+CONFIG_BLK_DEV_LOOP=y
+CONFIG_BLK_DEV_NBD=m
+CONFIG_BLK_DEV_RAM=y
+CONFIG_BLK_DEV_RAM_SIZE=65536
+CONFIG_VIRTIO_BLK=m
+CONFIG_BLK_DEV_SD=y
+CONFIG_CHR_DEV_ST=m
+CONFIG_BLK_DEV_SR=y
+CONFIG_CHR_DEV_SG=y
+CONFIG_SCSI_CONSTANTS=y
+CONFIG_SCSI_FC_ATTRS=y
+CONFI

[PATCH kernel 1/3] powerpc/64: Allow LLVM LTO builds

2022-02-10 Thread Alexey Kardashevskiy

The upstream LLVM supports now LTO on PPC, enable it.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/Kconfig | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b779603978e1..91c14f83 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -153,6 +153,8 @@ config PPC
select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
select ARCH_WANT_LD_ORPHAN_WARN
select ARCH_WEAK_RELEASE_ACQUIRE
+   select ARCH_SUPPORTS_LTO_CLANG  if PPC64
+   select ARCH_SUPPORTS_LTO_CLANG_THIN if PPC64
select BINFMT_ELF
select BUILDTIME_TABLE_SORT
select CLONE_BACKWARDS
-- 
2.30.2

[PATCH kernel 0/3] powerpc/llvm/lto: Enable CONFIG_LTO_CLANG_THIN=y

2022-02-10 Thread Alexey Kardashevskiy



This is based on sha1
1b43a74f255c Michael Ellerman "Automatic merge of 'master' into merge 
(2022-02-01 10:41)".

Please comment. Thanks.



Alexey Kardashevskiy (3):
  powerpc/64: Allow LLVM LTO builds
  powerpc/llvm: Sample config for LLVM LTO
  powerpc/llvm/lto: Workaround conditional branches in FTR_SECTION_ELSE

 arch/powerpc/Makefile|   4 +
 arch/powerpc/Kconfig |   2 +
 arch/powerpc/configs/ppc64_lto_defconfig | 381 +++
 arch/powerpc/kernel/exceptions-64s.S |   6 +-
 arch/powerpc/lib/copyuser_64.S   |   3 +-
 arch/powerpc/lib/memcpy_64.S |   3 +-
 6 files changed, 396 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/configs/ppc64_lto_defconfig

-- 
2.30.2

[PATCH kernel v2] powerpc/64: Add UADDR64 relocation support

2022-02-03 Thread Alexey Kardashevskiy

When ld detects unaligned relocations, it emits R_PPC64_UADDR64
relocations instead of R_PPC64_RELATIVE. Currently R_PPC64_UADDR64 are
detected by arch/powerpc/tools/relocs_check.sh and expected not to work.
Below is a simple chunk to trigger this behaviour (this disables
optimization for the demonstration purposes only, this also happens with
-O1/-O2 when CONFIG_PRINTK_INDEX=y, for example):

\#pragma GCC push_options
\#pragma GCC optimize ("O0")
struct entry {
const char *file;
int line;
} __attribute__((packed));
static const struct entry e1 = { .file = __FILE__, .line = __LINE__ };
static const struct entry e2 = { .file = __FILE__, .line = __LINE__ };
...
prom_printf("e1=%s %lx %lx\n", e1.file, (unsigned long) e1.file, mfmsr());
prom_printf("e2=%s %lx\n", e2.file, (unsigned long) e2.file);
\#pragma GCC pop_options


This adds support for UADDR64 for 64bit. This reuses __dynamic_symtab
from the 32bit which supports more relocation types already.

Because RELACOUNT includes only R_PPC64_RELATIVE, this replaces it with
RELASZ which is the size of all relocation records.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* replaced RELACOUNT with RELASZ/RELAENT
* removed FIXME

---

Tested via qemu gdb stub (the kernel is loaded at 0x40).

Disasm:

c1a804d0 :
c1a804d0:   b0 04 a8 01 .long 0x1a804b0
c1a804d0: R_PPC64_RELATIVE  
*ABS*-0x3e57fb50
c1a804d4:   00 00 00 c0 lfs f0,0(0)
c1a804d8:   fa 08 00 00 .long 0x8fa

c1a804dc :
...
c1a804dc: R_PPC64_UADDR64   .rodata+0x4b0

Before relocation:
>>> p *(unsigned long *) 0x1e804d0
$1 = 0xc1a804b0
>>> p *(unsigned long *) 0x1e804dc
$2 = 0x0

After relocation in __boot_from_prom:
>>> p *(unsigned long *) 0x1e804d0
$1 = 0x1e804b0
>>> p *(unsigned long *) 0x1e804dc
$2 = 0x1e804b0

After relocation in __after_prom_start:
>>> p *(unsigned long *) 0x1e804d0
$1 = 0xc1a804b0
>>> p *(unsigned long *) 0x1e804dc
$2 = 0xc1a804b0
>>>
---
 arch/powerpc/kernel/reloc_64.S | 56 --
 arch/powerpc/kernel/vmlinux.lds.S  |  2 --
 arch/powerpc/tools/relocs_check.sh |  7 +---
 3 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kernel/reloc_64.S b/arch/powerpc/kernel/reloc_64.S
index 02d4719bf43a..f7dcc25e93d0 100644
--- a/arch/powerpc/kernel/reloc_64.S
+++ b/arch/powerpc/kernel/reloc_64.S
@@ -8,8 +8,10 @@
 #include 
 
 RELA = 7
-RELACOUNT = 0x6ff9
+RELASZ = 8
+RELAENT = 9
 R_PPC64_RELATIVE = 22
+R_PPC64_UADDR64 = 43
 
 /*
  * r3 = desired final address of kernel
@@ -25,29 +27,36 @@ _GLOBAL(relocate)
add r9,r9,r12   /* r9 has runtime addr of .rela.dyn section */
ld  r10,(p_st - 0b)(r12)
add r10,r10,r12 /* r10 has runtime addr of _stext */
+   ld  r13,(p_sym - 0b)(r12)
+   add r13,r13,r12 /* r13 has runtime addr of .dynsym */
 
/*
-* Scan the dynamic section for the RELA and RELACOUNT entries.
+* Scan the dynamic section for the RELA, RELASZ and RELAENT entries.
 */
li  r7,0
li  r8,0
 1: ld  r6,0(r11)   /* get tag */
cmpdi   r6,0
-   beq 4f  /* end of list */
+   beq 5f  /* end of list */
cmpdi   r6,RELA
bne 2f
ld  r7,8(r11)   /* get RELA pointer in r7 */
-   b   3f
-2: addis   r6,r6,(-RELACOUNT)@ha
-   cmpdi   r6,RELACOUNT@l
+   b   4f
+2: cmpdi   r6,RELASZ
bne 3f
-   ld  r8,8(r11)   /* get RELACOUNT value in r8 */
-3: addir11,r11,16
+   ld  r8,8(r11)   /* get RELASZ value in r8 */
+   b   4f
+3: cmpdi   r6,RELAENT
+   bne 4f
+   ld  r12,8(r11)  /* get RELAENT value in r12 */
+4: addir11,r11,16
b   1b
-4: cmpdi   r7,0/* check we have both RELA and RELACOUNT */
+5: cmpdi   r7,0/* check we have RELA, RELASZ, RELAENT */
cmpdi   cr1,r8,0
-   beq 6f
-   beq cr1,6f
+   beq 10f
+   beq cr1,10f
+   cmpdi   r12,0
+   beq 10f
 
/*
 * Work out linktime address of _stext and hence the
@@ -62,23 +71,34 @@ _GLOBAL(relocate)
 
/*
 * Run through the list of relocations and process the
-* R_PPC64_RELATIVE ones.
+* R_PPC64_RELATIVE and R_PPC64_UADDR64 ones.
 */
+   divdr8,r8,r12   /* RELASZ / RELAENT */
mtctr   r8
-5: ld  r0,8(9) /* ELF64_R_TYPE(reloc->r_info) */
+5: lwa r0,8(r9)/* ELF64_R_TYPE(reloc->r_info) */
cmpdi   r0,R_PPC64_RELATIVE
-   bne 6f
+   bne 7f
ld  r6,0(r9)/

Re: [PATCH 2/2] KVM: selftests: Add support for ppc64le

2022-01-31 Thread Alexey Kardashevskiy

  VM_MODE_P36V48_16K,
VM_MODE_P36V48_64K,
VM_MODE_P36V47_16K,
+   VM_MODE_P51V52_64K,
NUM_VM_MODES,
  };
  
@@ -87,6 +88,12 @@ extern enum vm_guest_mode vm_mode_default;

  #define MIN_PAGE_SHIFT12U
  #define ptes_per_page(page_size)  ((page_size) / 8)
  
+#elif defined(__powerpc__)

+
+#define VM_MODE_DEFAULTVM_MODE_P51V52_64K
+#define MIN_PAGE_SHIFT 16U
+#define ptes_per_page(page_size)   ((page_size) / 8)
+
  #endif
  
  #define MIN_PAGE_SIZE		(1U << MIN_PAGE_SHIFT)

diff --git a/tools/testing/selftests/kvm/include/ppc64le/processor.h 
b/tools/testing/selftests/kvm/include/ppc64le/processor.h
new file mode 100644
index ..fbc1332b2b80
--- /dev/null
+++ b/tools/testing/selftests/kvm/include/ppc64le/processor.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * powerpc processor specific defines
+ */
+#ifndef SELFTEST_KVM_PROCESSOR_H
+#define SELFTEST_KVM_PROCESSOR_H
+
+#define PPC_BIT(x) (1ULL << (63 - x))



Put the "x" in braces.



+
+#define MSR_SF  PPC_BIT(0)
+#define MSR_IR  PPC_BIT(58)
+#define MSR_DR  PPC_BIT(59)
+#define MSR_LE  PPC_BIT(63)
+
+#define LPCR_UPRT  PPC_BIT(41)
+#define LPCR_EVIRT PPC_BIT(42)
+#define LPCR_HRPPC_BIT(43)
+#define LPCR_GTSE  PPC_BIT(53)
+
+#define PATB_GRPPC_BIT(0)
+
+#define PTE_VALID PPC_BIT(0)
+#define PTE_LEAF  PPC_BIT(1)
+#define PTE_RPPC_BIT(55)
+#define PTE_CPPC_BIT(56)
+#define PTE_RC   (PTE_R | PTE_C)
+#define PTE_READ  0x4
+#define PTE_WRITE 0x2
+#define PTE_EXEC  0x1
+#define PTE_RWX   (PTE_READ|PTE_WRITE|PTE_EXEC)
+
+extern uint64_t hcall(uint64_t nr, ...);
+
+static inline uint32_t mfpvr(void)
+{
+   uint32_t pvr;
+
+   asm ("mfpvr %0"
+: "=r"(pvr));
+   return pvr;
+}
+
+#endif
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index c22a17aac6b0..cc5247c2cfeb 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -205,6 +205,7 @@ const char *vm_guest_mode_string(uint32_t i)
[VM_MODE_P36V48_16K]= "PA-bits:36,  VA-bits:48, 16K pages",
[VM_MODE_P36V48_64K]= "PA-bits:36,  VA-bits:48, 64K pages",
[VM_MODE_P36V47_16K]= "PA-bits:36,  VA-bits:47, 16K pages",
+   [VM_MODE_P51V52_64K]= "PA-bits:51,  VA-bits:52, 64K pages",
};
_Static_assert(sizeof(strings)/sizeof(char *) == NUM_VM_MODES,
   "Missing new mode strings?");
@@ -230,6 +231,7 @@ const struct vm_guest_mode_params vm_guest_mode_params[] = {
[VM_MODE_P36V48_16K]= { 36, 48,  0x4000, 14 },
[VM_MODE_P36V48_64K]= { 36, 48, 0x1, 16 },
[VM_MODE_P36V47_16K]= { 36, 47,  0x4000, 14 },
+   [VM_MODE_P51V52_64K]= { 51, 52, 0x1, 16 },
  };
  _Static_assert(sizeof(vm_guest_mode_params)/sizeof(struct 
vm_guest_mode_params) == NUM_VM_MODES,
   "Missing new mode params?");
@@ -331,6 +333,9 @@ struct kvm_vm *vm_create(enum vm_guest_mode mode, uint64_t 
phy_pages, int perm)
case VM_MODE_P44V64_4K:
vm->pgtable_levels = 5;
break;
+   case VM_MODE_P51V52_64K:
+   vm->pgtable_levels = 4;
+   break;
default:
TEST_FAIL("Unknown guest mode, mode: 0x%x", mode);
}
diff --git a/tools/testing/selftests/kvm/lib/powerpc/hcall.S 
b/tools/testing/selftests/kvm/lib/powerpc/hcall.S
new file mode 100644
index ..a78b88f3b207
--- /dev/null
+++ b/tools/testing/selftests/kvm/lib/powerpc/hcall.S
@@ -0,0 +1,6 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+.globl hcall;
+
+hcall:
+   sc  1
+   blr
diff --git a/tools/testing/selftests/kvm/lib/powerpc/processor.c 
b/tools/testing/selftests/kvm/lib/powerpc/processor.c
new file mode 100644
index ..2ffd5423a968
--- /dev/null
+++ b/tools/testing/selftests/kvm/lib/powerpc/processor.c
@@ -0,0 +1,343 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * KVM selftest powerpc library code
+ *
+ * Copyright (C) 2021, IBM Corp.



2022?

Otherwise looks good and works well and we have another test for 
instruction emilation on top of this which highlighted a bug so this is 
useful stuff.



Reviewed-by: Alexey Kardashevskiy 



+ */
+
+#define _GNU_SOURCE
+//#define DEBUG
+
+#include "kvm_util.h"
+#include "../kvm_util_internal.h"
+#include "processor.h"
+
+/*
+ * 2^(12+PRTS) = Process table size
+ *
+ * But the hardware doesn't seem to care, so 0 for now.
+ */
+#define PRTS 0
+#define RTS ((0x5UL << 5) | (0x2UL << 61)) /* 2^(RTS+31) = 2^52 */
+#define RPDS 0xd
+#define RPDB_MASK 0x0f00UL
+#define RPN_MASK  0x01fff000UL
+
+#define MIN_FRAME_SZ 32
+
+static const int radix_64k_index_

Re: [PATCH kernel] powerpc/64: Add UADDR64 relocation support

2022-01-31 Thread Alexey Kardashevskiy





On 1/31/22 17:38, Christophe Leroy wrote:



Le 31/01/2022 à 05:14, Alexey Kardashevskiy a écrit :

When ld detects unaligned relocations, it emits R_PPC64_UADDR64
relocations instead of R_PPC64_RELATIVE. Currently R_PPC64_UADDR64 are
detected by arch/powerpc/tools/relocs_check.sh and expected not to work.
Below is a simple chunk to trigger this behaviour:


According to relocs_check.sh, this is expected to happen only with
binutils < 2.19. Today minimum binutils version is 2.23

Have you observed this problem with newer version of binutils ?


Oh yeah. 2.36.1. And the toolchain folks explained internally that this 
is correct behavior and this was a ticking bomb which exploded now and 
the kernel has to deal with it.





\#pragma GCC push_options
\#pragma GCC optimize ("O0")


AFAIU Linux Kernel is always built with O2


Correct. Even O1 hides this.


Have you observed the problem with O2 ?



Yes, I see it once I enable CONFIG_PRINTK_INDEX (this is how it was 
spotted with my particular config, there is still a fair chance that 
this config option does not cause UADDR64 always) but I did not debug 
with it enabled as pretty much every single __func__ passed to printk 
caused unaligned relocation (tens of thousands). Note that this 
particular case can be fixed by removing __packed from "struct pi_entry" 
(== re-arm the bomb). Thanks,







struct entry {
  const char *file;
  int line;
} __attribute__((packed));
static const struct entry e1 = { .file = __FILE__, .line = __LINE__ };
static const struct entry e2 = { .file = __FILE__, .line = __LINE__ };
...
prom_printf("e1=%s %lx %lx\n", e1.file, (unsigned long) e1.file, mfmsr());
prom_printf("e2=%s %lx\n", e2.file, (unsigned long) e2.file);
\#pragma GCC pop_options


This adds support for UADDR64 for 64bit. This reuses __dynamic_symtab
from the 32bit which supports more relocation types already.

This adds a workaround for the number of relocations as the DT_RELACOUNT
ELF Dynamic Array Tag does not include relocations other than
R_PPC64_RELATIVE. This instead iterates over the entire .rela.dyn section.

Signed-off-by: Alexey Kardashevskiy 
---

Tested via qemu gdb stub (the kernel is loaded at 0x40).

Disasm:

c1a804d0 :
c1a804d0:   b0 04 a8 01 .long 0x1a804b0
  c1a804d0: R_PPC64_RELATIVE  
*ABS*-0x3e57fb50
c1a804d4:   00 00 00 c0 lfs f0,0(0)
c1a804d8:   fa 08 00 00 .long 0x8fa

c1a804dc :
  ...
  c1a804dc: R_PPC64_UADDR64   .rodata+0x4b0

Before relocation:


p *(unsigned long *) 0x1e804d0

$1 = 0xc1a804b0

p *(unsigned long *) 0x1e804dc

$2 = 0x0

After:

p *(unsigned long *) 0x1e804d0

$1 = 0x1e804b0

p *(unsigned long *) 0x1e804dc

$2 = 0x1e804b0
---
   arch/powerpc/kernel/reloc_64.S | 47 +-
   arch/powerpc/kernel/vmlinux.lds.S  |  3 +-
   arch/powerpc/tools/relocs_check.sh |  6 
   3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/kernel/reloc_64.S b/arch/powerpc/kernel/reloc_64.S
index 02d4719bf43a..a91175723d9d 100644
--- a/arch/powerpc/kernel/reloc_64.S
+++ b/arch/powerpc/kernel/reloc_64.S
@@ -10,6 +10,7 @@
   RELA = 7
   RELACOUNT = 0x6ff9
   R_PPC64_RELATIVE = 22
+R_PPC64_UADDR64 = 43
   
   /*

* r3 = desired final address of kernel
@@ -25,6 +26,8 @@ _GLOBAL(relocate)
add r9,r9,r12   /* r9 has runtime addr of .rela.dyn section */
ld  r10,(p_st - 0b)(r12)
add r10,r10,r12 /* r10 has runtime addr of _stext */
+   ld  r13,(p_sym - 0b)(r12)
+   add r13,r13,r12 /* r13 has runtime addr of .dynsym */
   
   	/*

 * Scan the dynamic section for the RELA and RELACOUNT entries.
@@ -46,8 +49,8 @@ _GLOBAL(relocate)
b   1b
   4:   cmpdi   r7,0/* check we have both RELA and RELACOUNT */
cmpdi   cr1,r8,0
-   beq 6f
-   beq cr1,6f
+   beq 9f
+   beq cr1,9f
   
   	/*

 * Work out linktime address of _stext and hence the
@@ -60,25 +63,55 @@ _GLOBAL(relocate)
subfr10,r7,r10
subfr3,r10,r3   /* final_offset */
   
+	/*

+* FIXME
+* Here r8 is a number of relocations in .rela.dyn.
+* When ld issues UADDR64 relocations, they end up at the end
+* of the .rela.dyn section. However RELACOUNT does not include
+* them so the loop below is going to finish after the last
+* R_PPC64_RELATIVE as they normally go first.
+* Work out the size of .rela.dyn at compile time.
+*/
+   ld  r8,(p_rela_end - 0b)(r12)
+   ld  r18,(p_rela - 0b)(r12)
+   sub r8,r8,r18
+   li  r18,24  /* 24 == sizeof(elf64_rela) */
+   divdr8,r8,r18
+
/*
 * Run through the list of relocations and p

[PATCH kernel] powerpc/64: Add UADDR64 relocation support

2022-01-30 Thread Alexey Kardashevskiy

When ld detects unaligned relocations, it emits R_PPC64_UADDR64
relocations instead of R_PPC64_RELATIVE. Currently R_PPC64_UADDR64 are
detected by arch/powerpc/tools/relocs_check.sh and expected not to work.
Below is a simple chunk to trigger this behaviour:

\#pragma GCC push_options
\#pragma GCC optimize ("O0")
struct entry {
const char *file;
int line;
} __attribute__((packed));
static const struct entry e1 = { .file = __FILE__, .line = __LINE__ };
static const struct entry e2 = { .file = __FILE__, .line = __LINE__ };
...
prom_printf("e1=%s %lx %lx\n", e1.file, (unsigned long) e1.file, mfmsr());
prom_printf("e2=%s %lx\n", e2.file, (unsigned long) e2.file);
\#pragma GCC pop_options


This adds support for UADDR64 for 64bit. This reuses __dynamic_symtab
from the 32bit which supports more relocation types already.

This adds a workaround for the number of relocations as the DT_RELACOUNT
ELF Dynamic Array Tag does not include relocations other than
R_PPC64_RELATIVE. This instead iterates over the entire .rela.dyn section.

Signed-off-by: Alexey Kardashevskiy 
---

Tested via qemu gdb stub (the kernel is loaded at 0x40).

Disasm:

c1a804d0 :
c1a804d0:   b0 04 a8 01 .long 0x1a804b0
c1a804d0: R_PPC64_RELATIVE  
*ABS*-0x3e57fb50
c1a804d4:   00 00 00 c0 lfs f0,0(0)
c1a804d8:   fa 08 00 00 .long 0x8fa

c1a804dc :
...
c1a804dc: R_PPC64_UADDR64   .rodata+0x4b0

Before relocation:

>>> p *(unsigned long *) 0x1e804d0
$1 = 0xc1a804b0
>>> p *(unsigned long *) 0x1e804dc
$2 = 0x0

After:
>>> p *(unsigned long *) 0x1e804d0
$1 = 0x1e804b0
>>> p *(unsigned long *) 0x1e804dc
$2 = 0x1e804b0
---
 arch/powerpc/kernel/reloc_64.S | 47 +-
 arch/powerpc/kernel/vmlinux.lds.S  |  3 +-
 arch/powerpc/tools/relocs_check.sh |  6 
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/kernel/reloc_64.S b/arch/powerpc/kernel/reloc_64.S
index 02d4719bf43a..a91175723d9d 100644
--- a/arch/powerpc/kernel/reloc_64.S
+++ b/arch/powerpc/kernel/reloc_64.S
@@ -10,6 +10,7 @@
 RELA = 7
 RELACOUNT = 0x6ff9
 R_PPC64_RELATIVE = 22
+R_PPC64_UADDR64 = 43
 
 /*
  * r3 = desired final address of kernel
@@ -25,6 +26,8 @@ _GLOBAL(relocate)
add r9,r9,r12   /* r9 has runtime addr of .rela.dyn section */
ld  r10,(p_st - 0b)(r12)
add r10,r10,r12 /* r10 has runtime addr of _stext */
+   ld  r13,(p_sym - 0b)(r12)
+   add r13,r13,r12 /* r13 has runtime addr of .dynsym */
 
/*
 * Scan the dynamic section for the RELA and RELACOUNT entries.
@@ -46,8 +49,8 @@ _GLOBAL(relocate)
b   1b
 4: cmpdi   r7,0/* check we have both RELA and RELACOUNT */
cmpdi   cr1,r8,0
-   beq 6f
-   beq cr1,6f
+   beq 9f
+   beq cr1,9f
 
/*
 * Work out linktime address of _stext and hence the
@@ -60,25 +63,55 @@ _GLOBAL(relocate)
subfr10,r7,r10
subfr3,r10,r3   /* final_offset */
 
+   /*
+* FIXME
+* Here r8 is a number of relocations in .rela.dyn.
+* When ld issues UADDR64 relocations, they end up at the end
+* of the .rela.dyn section. However RELACOUNT does not include
+* them so the loop below is going to finish after the last
+* R_PPC64_RELATIVE as they normally go first.
+* Work out the size of .rela.dyn at compile time.
+*/
+   ld  r8,(p_rela_end - 0b)(r12)
+   ld  r18,(p_rela - 0b)(r12)
+   sub r8,r8,r18
+   li  r18,24  /* 24 == sizeof(elf64_rela) */
+   divdr8,r8,r18
+
/*
 * Run through the list of relocations and process the
-* R_PPC64_RELATIVE ones.
+* R_PPC64_RELATIVE and R_PPC64_UADDR64 ones.
 */
mtctr   r8
-5: ld  r0,8(9) /* ELF64_R_TYPE(reloc->r_info) */
+5: lwa r0,8(r9)/* ELF64_R_TYPE(reloc->r_info) */
cmpdi   r0,R_PPC64_RELATIVE
bne 6f
ld  r6,0(r9)/* reloc->r_offset */
ld  r0,16(r9)   /* reloc->r_addend */
-   add r0,r0,r3
+   b   7f
+
+6: cmpdi   r0,R_PPC64_UADDR64
+   bne 8f
+   ld  r6,0(r9)
+   ld  r0,16(r9)
+   lwa r14,12(r9)  /* ELF64_R_SYM(reloc->r_info) */
+   mulli   r14,r14,24  /* 24 == sizeof(elf64_sym) */
+   add r14,r14,r13 /* elf64_sym[ELF64_R_SYM] */
+   ld  r14,8(r14)
+   add r0,r0,r14
+
+7: add r0,r0,r3
stdxr0,r7,r6
-   addir9,r9,24
+
+8: addir9,r9,24
bdnz5b
 
-6: blr
+9: blr
 
 .balign 8
 p_dyn: .8byte  __dynamic_start - 0b
 p_rela:.8byte  __rela

[PATCH kernel v5] KVM: PPC: Merge powerpc's debugfs entry content into generic entry

2022-01-10 Thread Alexey Kardashevskiy

At the moment KVM on PPC creates 4 types of entries under the kvm debugfs:
1) "%pid-%fd" per a KVM instance (for all platforms);
2) "vm%pid" (for PPC Book3s HV KVM);
3) "vm%u_vcpu%u_timing" (for PPC Book3e KVM);
4) "kvm-xive-%p" (for XIVE PPC Book3s KVM, the same for XICS);

The problem with this is that multiple VMs per process is not allowed for
2) and 3) which makes it possible for userspace to trigger errors when
creating duplicated debugfs entries.

This merges all these into 1).

This defines kvm_arch_create_kvm_debugfs() similar to
kvm_arch_create_vcpu_debugfs().

This defines 2 hooks in kvmppc_ops that allow specific KVM implementations
add necessary entries, this adds the _e500 suffix to
kvmppc_create_vcpu_debugfs_e500() to make it clear what platform it is for.

This makes use of already existing kvm_arch_create_vcpu_debugfs() on PPC.

This removes no more used debugfs_dir pointers from PPC kvm_arch structs.

This stops removing vcpu entries as once created vcpus stay around
for the entire life of a VM and removed when the KVM instance is closed,
see commit d56f5136b010 ("KVM: let kvm_destroy_vm_debugfs clean up vCPU
debugfs directories").

Suggested-by: Fabiano Rosas 
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v5:
* fixed e500mc2

v4:
* added "kvm-xive-%p"

v3:
* reworked commit log, especially, the bit about removing vcpus

v2:
* handled powerpc-booke
* s/kvm/vm/ in arch hooks
---
 arch/powerpc/include/asm/kvm_host.h|  6 ++---
 arch/powerpc/include/asm/kvm_ppc.h |  2 ++
 arch/powerpc/kvm/timing.h  | 12 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
 arch/powerpc/kvm/book3s_hv.c   | 31 ++
 arch/powerpc/kvm/book3s_xics.c | 13 ++-
 arch/powerpc/kvm/book3s_xive.c | 13 ++-
 arch/powerpc/kvm/book3s_xive_native.c  | 13 ++-
 arch/powerpc/kvm/e500.c|  1 +
 arch/powerpc/kvm/e500mc.c  |  1 +
 arch/powerpc/kvm/powerpc.c | 16 ++---
 arch/powerpc/kvm/timing.c  | 21 +
 13 files changed, 51 insertions(+), 82 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 17263276189e..f5e14fa683f4 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -26,6 +26,8 @@
 #include 
 #include 
 
+#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+
 #define KVM_MAX_VCPUS  NR_CPUS
 #define KVM_MAX_VCORES NR_CPUS
 
@@ -295,7 +297,6 @@ struct kvm_arch {
bool dawr1_enabled;
pgd_t *pgtable;
u64 process_table;
-   struct dentry *debugfs_dir;
struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
@@ -673,7 +674,6 @@ struct kvm_vcpu_arch {
u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
u64 timing_last_exit;
-   struct dentry *debugfs_exit_timing;
 #endif
 
 #ifdef CONFIG_PPC_BOOK3S
@@ -829,8 +829,6 @@ struct kvm_vcpu_arch {
struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */
struct kvmhv_tb_accumulator guest_time; /* guest execution */
struct kvmhv_tb_accumulator cede_time;  /* time napping inside guest */
-
-   struct dentry *debugfs_dir;
 #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 33db83b82fbd..d2b192dea0d2 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -316,6 +316,8 @@ struct kvmppc_ops {
int (*svm_off)(struct kvm *kvm);
int (*enable_dawr1)(struct kvm *kvm);
bool (*hash_v3_possible)(void);
+   int (*create_vm_debugfs)(struct kvm *kvm);
+   int (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry 
*debugfs_dentry);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/kvm/timing.h b/arch/powerpc/kvm/timing.h
index feef7885ba82..45817ab82bb4 100644
--- a/arch/powerpc/kvm/timing.h
+++ b/arch/powerpc/kvm/timing.h
@@ -14,8 +14,8 @@
 #ifdef CONFIG_KVM_EXIT_TIMING
 void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu);
 void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu);
-void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id);
-void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu);
+int kvmppc_create_vcpu_debugfs_e500(struct kvm_vcpu *vcpu,
+   struct dentry *debugfs_dentry);
 
 static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int type)
 {
@@ -26,9 +26,11 @@ static inline void kvmppc_set_exit_type(struct kvm_vcpu 
*vcpu, int type)
 /* if exit timing is not configured there is no need to build the c file */
 static inline void kvmppc_

Re: [PATCH v3 5/6] KVM: PPC: mmio: Return to guest after emulation failure

2022-01-10 Thread Alexey Kardashevskiy





On 1/10/22 18:36, Nicholas Piggin wrote:

Excerpts from Fabiano Rosas's message of January 8, 2022 7:00 am:

If MMIO emulation fails we don't want to crash the whole guest by
returning to userspace.

The original commit bbf45ba57eae ("KVM: ppc: PowerPC 440 KVM
implementation") added a todo:

   /* XXX Deliver Program interrupt to guest. */

and later the commit d69614a295ae ("KVM: PPC: Separate loadstore
emulation from priv emulation") added the Program interrupt injection
but in another file, so I'm assuming it was missed that this block
needed to be altered.

Signed-off-by: Fabiano Rosas 
Reviewed-by: Alexey Kardashevskiy 
---
  arch/powerpc/kvm/powerpc.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6daeea4a7de1..56b0faab7a5f 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -309,7 +309,7 @@ int kvmppc_emulate_mmio(struct kvm_vcpu *vcpu)
kvmppc_get_last_inst(vcpu, INST_GENERIC, _inst);
kvmppc_core_queue_program(vcpu, 0);
pr_info("%s: emulation failed (%08x)\n", __func__, last_inst);
-   r = RESUME_HOST;
+   r = RESUME_GUEST;


So at this point can the pr_info just go away?

I wonder if this shouldn't be a DSI rather than a program check.
DSI with DSISR[37] looks a bit more expected. Not that Linux
probably does much with it but at least it would give a SIGBUS
rather than SIGILL.


It does not like it is more expected to me, it is not about wrong memory 
attributes, it is the instruction itself which cannot execute.


DSISR[37]:
Set to 1 if the access is due to a lq, stq, lwat, ldat, lbarx, lharx, 
lwarx, ldarx, lqarx, stwat,
stdat, stbcx., sthcx., stwcx., stdcx., or stqcx. instruction that 
addresses storage that is Write
Through Required or Caching Inhibited; or if the access is due to a copy 
or paste. instruction
that addresses storage that is Caching Inhibited; or if the access is 
due to a lwat, ldat, stwat, or
stdat instruction that addresses storage that is Guarded; otherwise set 
to 0.

Re: [PATCH v3 4/6] KVM: PPC: mmio: Queue interrupt at kvmppc_emulate_mmio

2022-01-09 Thread Alexey Kardashevskiy





On 08/01/2022 08:00, Fabiano Rosas wrote:

If MMIO emulation fails, we queue a Program interrupt to the
guest. Move that line up into kvmppc_emulate_mmio, which is where we
set RESUME_GUEST/HOST. This allows the removal of the 'advance'
variable.

No functional change, just separation of responsibilities.

Signed-off-by: Fabiano Rosas 



Reviewed-by: Alexey Kardashevskiy 



---
  arch/powerpc/kvm/emulate_loadstore.c | 8 +---
  arch/powerpc/kvm/powerpc.c   | 2 +-
  2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kvm/emulate_loadstore.c 
b/arch/powerpc/kvm/emulate_loadstore.c
index 48272a9b9c30..4dec920fe4c9 100644
--- a/arch/powerpc/kvm/emulate_loadstore.c
+++ b/arch/powerpc/kvm/emulate_loadstore.c
@@ -73,7 +73,6 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu)
  {
u32 inst;
enum emulation_result emulated = EMULATE_FAIL;
-   int advance = 1;
struct instruction_op op;
  
  	/* this default type might be overwritten by subcategories */

@@ -355,15 +354,10 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu)
}
}
  
-	if (emulated == EMULATE_FAIL) {

-   advance = 0;
-   kvmppc_core_queue_program(vcpu, 0);
-   }
-
trace_kvm_ppc_instr(inst, kvmppc_get_pc(vcpu), emulated);
  
  	/* Advance past emulated instruction. */

-   if (advance)
+   if (emulated != EMULATE_FAIL)
kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
  
  	return emulated;

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 4d7d0d080232..6daeea4a7de1 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -307,7 +307,7 @@ int kvmppc_emulate_mmio(struct kvm_vcpu *vcpu)
u32 last_inst;
  
  		kvmppc_get_last_inst(vcpu, INST_GENERIC, _inst);

-   /* XXX Deliver Program interrupt to guest. */
+   kvmppc_core_queue_program(vcpu, 0);
pr_info("%s: emulation failed (%08x)\n", __func__, last_inst);
r = RESUME_HOST;
break;

Re: [PATCH v2 6/7] KVM: PPC: mmio: Return to guest after emulation failure

2022-01-06 Thread Alexey Kardashevskiy





On 07/01/2022 07:03, Fabiano Rosas wrote:

If MMIO emulation fails we don't want to crash the whole guest by
returning to userspace.

The original commit bbf45ba57eae ("KVM: ppc: PowerPC 440 KVM
implementation") added a todo:

   /* XXX Deliver Program interrupt to guest. */

and later the commit d69614a295ae ("KVM: PPC: Separate loadstore
emulation from priv emulation") added the Program interrupt injection
but in another file, so I'm assuming it was missed that this block
needed to be altered.

Signed-off-by: Fabiano Rosas 



Looks right.
Reviewed-by: Alexey Kardashevskiy 

but this means if I want to keep debugging those kvm selftests in 
comfort, I'll have to have some exception handlers in the vm as 
otherwise the failing $pc is lost after this change :)



---
  arch/powerpc/kvm/powerpc.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index a2e78229d645..50e08635e18a 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -309,7 +309,7 @@ int kvmppc_emulate_mmio(struct kvm_vcpu *vcpu)
kvmppc_get_last_inst(vcpu, INST_GENERIC, _inst);
kvmppc_core_queue_program(vcpu, 0);
pr_info("%s: emulation failed (%08x)\n", __func__, last_inst);
-   r = RESUME_HOST;
+   r = RESUME_GUEST;
break;
}
default:


--
Alexey

Re: [PATCH v2 3/7] KVM: PPC: Fix mmio length message

2022-01-06 Thread Alexey Kardashevskiy





On 07/01/2022 07:03, Fabiano Rosas wrote:

We check against 'bytes' but print 'run->mmio.len' which at that point
has an old value.

e.g. 16-byte load:

before:
__kvmppc_handle_load: bad MMIO length: 8

now:
__kvmppc_handle_load: bad MMIO length: 16

Signed-off-by: Fabiano Rosas 
---
  arch/powerpc/kvm/powerpc.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 92e552ab5a77..0b0818d032e1 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -1246,7 +1246,7 @@ static int __kvmppc_handle_load(struct kvm_vcpu *vcpu,
  
  	if (bytes > sizeof(run->mmio.data)) {

printk(KERN_ERR "%s: bad MMIO length: %d\n", __func__,
-  run->mmio.len);
+  bytes);



"return EMULATE_FAIL;" here and below as there is really no point in 
trashing kvm_run::mmio (not much harm too but still) and this code does 
not handle more than 8 bytes anyway.





}
  
  	run->mmio.phys_addr = vcpu->arch.paddr_accessed;

@@ -1335,7 +1335,7 @@ int kvmppc_handle_store(struct kvm_vcpu *vcpu,
  
  	if (bytes > sizeof(run->mmio.data)) {

printk(KERN_ERR "%s: bad MMIO length: %d\n", __func__,
-  run->mmio.len);
+  bytes);
}
  
  	run->mmio.phys_addr = vcpu->arch.paddr_accessed;


--
Alexey

Re: [PATCH 2/3] KVM: PPC: Fix vmx/vsx mixup in mmio emulation

2022-01-04 Thread Alexey Kardashevskiy





On 28/12/2021 04:28, Fabiano Rosas wrote:

Nicholas Piggin  writes:


Excerpts from Fabiano Rosas's message of December 24, 2021 7:15 am:

The MMIO emulation code for vector instructions is duplicated between
VSX and VMX. When emulating VMX we should check the VMX copy size
instead of the VSX one.

Fixes: acc9eb9305fe ("KVM: PPC: Reimplement LOAD_VMX/STORE_VMX instruction ...")
Signed-off-by: Fabiano Rosas 


Good catch. AFAIKS handle_vmx_store needs the same treatment? If you
agree then


Half the bug now, half the bug next year... haha I'll send a v2.

aside:
All this duplication is kind of annoying. I'm looking into what it would
take to have quadword instruction emulation here as well (Alexey caught
a bug with syskaller) and the code would be really similar. I see that
x86 has a more generic implementation that maybe we could take advantage
of. See "f78146b0f923 (KVM: Fix page-crossing MMIO)"


Uff. My head exploded with vsx/vmx/vec :)
But this seems to have fixed "lvx" (which is vmx, right?).

Tested with: https://github.com/aik/linux/commits/my_kvm_tests



--
Alexey

[PATCH llvm 6/6] powerpc/mm/book3s64/hash: Switch pre 2.06 tlbiel to .long

2021-12-20 Thread Alexey Kardashevskiy

The llvm integrated assembler does not recognise the ISA 2.05 tlbiel
version. Work around it by switching to .long when an old arch level
detected.

Signed-off-by: Daniel Axtens 
[aik: did "Eventually do this more smartly"]
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/ppc-opcode.h  | 2 ++
 arch/powerpc/mm/book3s64/hash_native.c | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 9fe3223e7820..efad07081cc0 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -394,6 +394,7 @@
(0x7c000264 | ___PPC_RB(rb) | ___PPC_RS(rs) | ___PPC_RIC(ric) | 
___PPC_PRS(prs) | ___PPC_R(r))
 #define PPC_RAW_TLBIEL(rb, rs, ric, prs, r) \
(0x7c000224 | ___PPC_RB(rb) | ___PPC_RS(rs) | ___PPC_RIC(ric) | 
___PPC_PRS(prs) | ___PPC_R(r))
+#define PPC_RAW_TLBIEL_v205(rb, l) (0x7c000224 | ___PPC_RB(rb) | (l << 21))
 #define PPC_RAW_TLBSRX_DOT(a, b)   (0x7c0006a5 | __PPC_RA0(a) | 
__PPC_RB(b))
 #define PPC_RAW_TLBIVAX(a, b)  (0x7c000624 | __PPC_RA0(a) | 
__PPC_RB(b))
 #define PPC_RAW_ERATWE(s, a, w)(0x7c0001a6 | __PPC_RS(s) | 
__PPC_RA(a) | __PPC_WS(w))
@@ -606,6 +607,7 @@
stringify_in_c(.long PPC_RAW_TLBIE_5(rb, rs, 
ric, prs, r))
 #definePPC_TLBIEL(rb,rs,ric,prs,r) \
stringify_in_c(.long PPC_RAW_TLBIEL(rb, rs, 
ric, prs, r))
+#define PPC_TLBIEL_v205(rb, l) stringify_in_c(.long PPC_RAW_TLBIEL_v205(rb, l))
 #define PPC_TLBSRX_DOT(a, b)   stringify_in_c(.long PPC_RAW_TLBSRX_DOT(a, b))
 #define PPC_TLBIVAX(a, b)  stringify_in_c(.long PPC_RAW_TLBIVAX(a, b))
 
diff --git a/arch/powerpc/mm/book3s64/hash_native.c 
b/arch/powerpc/mm/book3s64/hash_native.c
index d2a320828c0b..623a7b7ab38b 100644
--- a/arch/powerpc/mm/book3s64/hash_native.c
+++ b/arch/powerpc/mm/book3s64/hash_native.c
@@ -163,7 +163,7 @@ static inline void __tlbiel(unsigned long vpn, int psize, 
int apsize, int ssize)
va |= ssize << 8;
sllp = get_sllp_encoding(apsize);
va |= sllp << 5;
-   asm volatile(ASM_FTR_IFSET("tlbiel %0", "tlbiel %0,0", %1)
+   asm volatile(ASM_FTR_IFSET("tlbiel %0", PPC_TLBIEL_v205(%0, 0), 
%1)
 : : "r" (va), "i" (CPU_FTR_ARCH_206)
 : "memory");
break;
@@ -182,7 +182,7 @@ static inline void __tlbiel(unsigned long vpn, int psize, 
int apsize, int ssize)
 */
va |= (vpn & 0xfe);
va |= 1; /* L */
-   asm volatile(ASM_FTR_IFSET("tlbiel %0", "tlbiel %0,1", %1)
+   asm volatile(ASM_FTR_IFSET("tlbiel %0", PPC_TLBIEL_v205(%0, 1), 
%1)
 : : "r" (va), "i" (CPU_FTR_ARCH_206)
 : "memory");
break;
-- 
2.30.2

[PATCH llvm 5/6] powerpc/mm: Switch obsolete dssall to .long

2021-12-20 Thread Alexey Kardashevskiy

The dssall ("Data Stream Stop All") instruction is obsolete altogether
with other Data Cache Instructions since ISA 2.03 (year 2006).

LLVM IAS does not support it but PPC970 seems to be using it.
This switches dssall to .long as there is no much point in fixing LLVM.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/ppc-opcode.h   | 2 ++
 arch/powerpc/kernel/idle.c  | 2 +-
 arch/powerpc/mm/mmu_context.c   | 2 +-
 arch/powerpc/kernel/idle_6xx.S  | 2 +-
 arch/powerpc/kernel/l2cr_6xx.S  | 6 +++---
 arch/powerpc/kernel/swsusp_32.S | 2 +-
 arch/powerpc/kernel/swsusp_asm64.S  | 2 +-
 arch/powerpc/platforms/powermac/cache.S | 4 ++--
 8 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index f50213e2a3e0..9fe3223e7820 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -249,6 +249,7 @@
 #define PPC_INST_COPY  0x7c20060c
 #define PPC_INST_DCBA  0x7c0005ec
 #define PPC_INST_DCBA_MASK 0xfc0007fe
+#define PPC_INST_DSSALL0x7e00066c
 #define PPC_INST_ISEL  0x7c1e
 #define PPC_INST_ISEL_MASK 0xfc3e
 #define PPC_INST_LSWI  0x7c0004aa
@@ -577,6 +578,7 @@
 #definePPC_DCBZL(a, b) stringify_in_c(.long PPC_RAW_DCBZL(a, 
b))
 #definePPC_DIVDE(t, a, b)  stringify_in_c(.long PPC_RAW_DIVDE(t, 
a, b))
 #definePPC_DIVDEU(t, a, b) stringify_in_c(.long PPC_RAW_DIVDEU(t, 
a, b))
+#define PPC_DSSALL stringify_in_c(.long PPC_INST_DSSALL)
 #define PPC_LQARX(t, a, b, eh) stringify_in_c(.long PPC_RAW_LQARX(t, a, b, eh))
 #define PPC_STQCX(t, a, b) stringify_in_c(.long PPC_RAW_STQCX(t, a, b))
 #define PPC_MADDHD(t, a, b, c) stringify_in_c(.long PPC_RAW_MADDHD(t, a, b, c))
diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c
index 1f835539fda4..4ad79eb638c6 100644
--- a/arch/powerpc/kernel/idle.c
+++ b/arch/powerpc/kernel/idle.c
@@ -82,7 +82,7 @@ void power4_idle(void)
return;
 
if (cpu_has_feature(CPU_FTR_ALTIVEC))
-   asm volatile("DSSALL ; sync" ::: "memory");
+   asm volatile(PPC_DSSALL " ; sync" ::: "memory");
 
power4_idle_nap();
 
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 735c36f26388..1fb9c99f8679 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -90,7 +90,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 * context
 */
if (cpu_has_feature(CPU_FTR_ALTIVEC))
-   asm volatile ("dssall");
+   asm volatile (PPC_DSSALL);
 
if (!new_on_cpu)
membarrier_arch_switch_mm(prev, next, tsk);
diff --git a/arch/powerpc/kernel/idle_6xx.S b/arch/powerpc/kernel/idle_6xx.S
index 13cad9297d82..3c097356366b 100644
--- a/arch/powerpc/kernel/idle_6xx.S
+++ b/arch/powerpc/kernel/idle_6xx.S
@@ -129,7 +129,7 @@ BEGIN_FTR_SECTION
 END_FTR_SECTION_IFCLR(CPU_FTR_NO_DPM)
mtspr   SPRN_HID0,r4
 BEGIN_FTR_SECTION
-   DSSALL
+   PPC_DSSALL
sync
 END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
lwz r8,TI_LOCAL_FLAGS(r2)   /* set napping bit */
diff --git a/arch/powerpc/kernel/l2cr_6xx.S b/arch/powerpc/kernel/l2cr_6xx.S
index 225511d73bef..f2e03ed423d0 100644
--- a/arch/powerpc/kernel/l2cr_6xx.S
+++ b/arch/powerpc/kernel/l2cr_6xx.S
@@ -96,7 +96,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_L2CR)
 
/* Stop DST streams */
 BEGIN_FTR_SECTION
-   DSSALL
+   PPC_DSSALL
sync
 END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
 
@@ -292,7 +292,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_L3CR)
isync
 
/* Stop DST streams */
-   DSSALL
+   PPC_DSSALL
sync
 
/* Get the current enable bit of the L3CR into r4 */
@@ -401,7 +401,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_L3CR)
 _GLOBAL(__flush_disable_L1)
/* Stop pending alitvec streams and memory accesses */
 BEGIN_FTR_SECTION
-   DSSALL
+   PPC_DSSALL
 END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
sync
 
diff --git a/arch/powerpc/kernel/swsusp_32.S b/arch/powerpc/kernel/swsusp_32.S
index f73f4d72fea4..e0cbd63007f2 100644
--- a/arch/powerpc/kernel/swsusp_32.S
+++ b/arch/powerpc/kernel/swsusp_32.S
@@ -181,7 +181,7 @@ _GLOBAL(swsusp_arch_resume)
 #ifdef CONFIG_ALTIVEC
/* Stop pending alitvec streams and memory accesses */
 BEGIN_FTR_SECTION
-   DSSALL
+   PPC_DSSALL
 END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
 #endif
sync
diff --git a/arch/powerpc/kernel/swsusp_asm64.S 
b/arch/powerpc/kernel/swsusp_asm64.S
index 96bb20715aa9..9f1903c7f540 100644
--- a/arch/powerpc/kernel/swsusp_asm64.S
+++ b/arch/powerpc/kernel/swsusp_asm64.S
@@ -141,7 +141,7 @@ END_FW_FTR_SECTION_IFCLR(F

[PATCH llvm 4/6] powerpc/64/asm: Do not reassign labels

2021-12-20 Thread Alexey Kardashevskiy

From: Daniel Axtens 

The LLVM integrated assembler really does not like us reassigning things
to the same label:

:7:9: error: invalid reassignment of non-absolute variable 
'fs_label'

This happens across a bunch of platforms:
https://github.com/ClangBuiltLinux/linux/issues/1043
https://github.com/ClangBuiltLinux/linux/issues/1008
https://github.com/ClangBuiltLinux/linux/issues/920
https://github.com/ClangBuiltLinux/linux/issues/1050

There is no hope of getting this fixed in LLVM (see
https://github.com/ClangBuiltLinux/linux/issues/1043#issuecomment-641571200
and https://bugs.llvm.org/show_bug.cgi?id=47798#c1 )
so if we want to build with LLVM_IAS, we need to hack
around it ourselves.

For us the big problem comes from this:

\#define USE_FIXED_SECTION(sname)   \
fs_label = start_##sname;   \
fs_start = sname##_start;   \
use_ftsec sname;

\#define USE_TEXT_SECTION()
fs_label = start_text;  \
fs_start = text_start;  \
.text

and in particular fs_label.

This works around it by not setting those 'variables' and requiring
that users of the variables instead track for themselves what section
they are in. This isn't amazing, by any stretch, but it gets us further
in the compilation.

Note that even though users have to keep track of the section, using
a wrong one produces an error with both binutils and llvm which prevents
from using wrong section at the compile time:

llvm error example:

AS  arch/powerpc/kernel/head_64.o
:0: error: Cannot represent a difference across sections
make[3]: *** [/home/aik/p/kernels-llvm/llvm/scripts/Makefile.build:388: 
arch/powerpc/kernel/head_64.o] Error 1

binutils error example:

/home/aik/p/kernels-llvm/llvm/arch/powerpc/kernel/exceptions-64s.S: Assembler 
messages:
/home/aik/p/kernels-llvm/llvm/arch/powerpc/kernel/exceptions-64s.S:1974: Error: 
can't resolve `system_call_common' {.text section} - `start_r
eal_vectors' {.head.text.real_vectors section}
make[3]: *** [/home/aik/p/kernels-llvm/llvm/scripts/Makefile.build:388: 
arch/powerpc/kernel/head_64.o] Error 1

Signed-off-by: Daniel Axtens 
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/head-64.h   | 12 +--
 arch/powerpc/kernel/exceptions-64s.S | 32 ++--
 arch/powerpc/kernel/head_64.S| 18 
 arch/powerpc/kernel/interrupt_64.S   |  2 +-
 4 files changed, 31 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/head-64.h 
b/arch/powerpc/include/asm/head-64.h
index 242204e12993..d73153b0275d 100644
--- a/arch/powerpc/include/asm/head-64.h
+++ b/arch/powerpc/include/asm/head-64.h
@@ -98,13 +98,9 @@ linker_stub_catch:   
\
. = sname##_len;
 
 #define USE_FIXED_SECTION(sname)   \
-   fs_label = start_##sname;   \
-   fs_start = sname##_start;   \
use_ftsec sname;
 
 #define USE_TEXT_SECTION() \
-   fs_label = start_text;  \
-   fs_start = text_start;  \
.text
 
 #define CLOSE_FIXED_SECTION(sname) \
@@ -161,13 +157,15 @@ end_##sname:
  * - ABS_ADDR is used to find the absolute address of any symbol, from within
  *   a fixed section.
  */
-#define DEFINE_FIXED_SYMBOL(label) \
-   label##_absolute = (label - fs_label + fs_start)
+// define label as being _in_ sname
+#define DEFINE_FIXED_SYMBOL(label, sname) \
+   label##_absolute = (label - start_ ## sname + sname ## _start)
 
 #define FIXED_SYMBOL_ABS_ADDR(label)   \
(label##_absolute)
 
-#define ABS_ADDR(label) (label - fs_label + fs_start)
+// find label from _within_ sname
+#define ABS_ADDR(label, sname) (label - start_ ## sname + sname ## _start)
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 83d37678f7cf..44b70bf535e3 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -48,7 +48,7 @@
.balign IFETCH_ALIGN_BYTES; \
.global name;   \
_ASM_NOKPROBE_SYMBOL(name); \
-   DEFINE_FIXED_SYMBOL(name);  \
+   DEFINE_FIXED_SYMBOL(name, text);\
 name:
 
 #define TRAMP_REAL_BEGIN(name) \
@@ -76,18 +76,18 @@ name:
ld  reg,PACAKBASE(r13); /* get high part of  */   \
ori reg,reg,FIXED_SYMBOL_ABS_ADDR(label)
 
-#define __LOAD_HANDLER(reg, label) \
+#define

[PATCH llvm 3/6] powerpc/64/asm: Inline BRANCH_TO_C000

2021-12-20 Thread Alexey Kardashevskiy

It is used just once and does not really help with readability, remove it.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/exceptions-64s.S | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index a30f563bc7a8..83d37678f7cf 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -89,19 +89,6 @@ name:
ori reg,reg,(ABS_ADDR(label))@l;\
addis   reg,reg,(ABS_ADDR(label))@h
 
-/*
- * Branch to label using its 0xC000 address. This results in instruction
- * address suitable for MSR[IR]=0 or 1, which allows relocation to be turned
- * on using mtmsr rather than rfid.
- *
- * This could set the 0xc bits for !RELOCATABLE as an immediate, rather than
- * load KBASE for a slight optimisation.
- */
-#define BRANCH_TO_C000(reg, label) \
-   __LOAD_FAR_HANDLER(reg, label); \
-   mtctr   reg;\
-   bctr
-
 /*
  * Interrupt code generation macros
  */
@@ -962,7 +949,9 @@ TRAMP_REAL_BEGIN(system_reset_idle_wake)
/* We are waking up from idle, so may clobber any volatile register */
cmpwi   cr1,r5,2
bltlr   cr1 /* no state loss, return to idle caller with r3=SRR1 */
-   BRANCH_TO_C000(r12, DOTSYM(idle_return_gpr_loss))
+   __LOAD_FAR_HANDLER(r12, DOTSYM(idle_return_gpr_loss))
+   mtctr   r12
+   bctr
 #endif
 
 #ifdef CONFIG_PPC_PSERIES
-- 
2.30.2

[PATCH llvm 2/6] powerpc: check for support for -Wa,-m{power4,any}

2021-12-20 Thread Alexey Kardashevskiy

From: Daniel Axtens 

LLVM's integrated assembler does not like either -Wa,-mpower4
or -Wa,-many. So just don't pass them if they're not supported.

Signed-off-by: Daniel Axtens 
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/Makefile | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index e9aa4e8b07dd..5f16ac1583c5 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -245,7 +245,9 @@ cpu-as-$(CONFIG_E500)   += -Wa,-me500
 # When using '-many -mpower4' gas will first try and find a matching power4
 # mnemonic and failing that it will allow any valid mnemonic that GAS knows
 # about. GCC will pass -many to GAS when assembling, clang does not.
-cpu-as-$(CONFIG_PPC_BOOK3S_64) += -Wa,-mpower4 -Wa,-many
+# LLVM IAS doesn't understand either flag: 
https://github.com/ClangBuiltLinux/linux/issues/675
+# but LLVM IAS only supports ISA >= 2.06 for Book3S 64 anyway...
+cpu-as-$(CONFIG_PPC_BOOK3S_64) += $(call as-option,-Wa$(comma)-mpower4) $(call 
as-option,-Wa$(comma)-many)
 cpu-as-$(CONFIG_PPC_E500MC)+= $(call as-option,-Wa$(comma)-me500mc)
 
 KBUILD_AFLAGS += $(cpu-as-y)
-- 
2.30.2

[PATCH llvm 1/6] powerpc/toc: PowerPC64 future proof kernel toc, revised for lld

2021-12-20 Thread Alexey Kardashevskiy

From: Alan Modra 

This patch future-proofs the kernel against linker changes that might
put the toc pointer at some location other than .got+0x8000, by
replacing __toc_start+0x8000 with .TOC. throughout.  If the kernel's
idea of the toc pointer doesn't agree with the linker, bad things
happen.

prom_init.c code relocating its toc is also changed so that a symbolic
__prom_init_toc_start toc-pointer relative address is calculated
rather than assuming that it is always at toc-pointer - 0x8000.  The
length calculations loading values from the toc are also avoided.
It's a little incestuous to do that with unreloc_toc picking up
adjusted values (which is fine in practice, they both adjust by the
same amount if all goes well).

I've also changed the way .got is aligned in vmlinux.lds and
zImage.lds, mostly so that dumping out section info by objdump or
readelf plainly shows the alignment is 256.  This linker script
feature was added 2005-09-27, available in FSF binutils releases from
2.17 onwards.  Should be safe to use in the kernel, I think.

Finally, put *(.got) before the prom_init.o entry which only needs
*(.toc), so that the GOT header goes in the correct place.  I don't
believe this makes any difference for the kernel as it would for
dynamic objects being loaded by ld.so.  That change is just to stop
lusers who blindly copy kernel scripts being led astray.  Of course,
this change needs the prom_init.c changes.

Some notes on .toc and .got.

.toc is a compiler generated section of addresses.  .got is a linker
generated section of addresses, generally built when the linker sees
R_*_*GOT* relocations.  In the case of powerpc64 ld.bfd, there are
multiple generated .got sections, one per input object file.  So you
can somewhat reasonably write in a linker script an input section
statement like *prom_init.o(.got .toc) to mean "the .got and .toc
section for files matching *prom_init.o".  On other architectures that
doesn't make sense, because the linker generally has just one .got
section.  Even on powerpc64, note well that the GOT entries for
prom_init.o may be merged with GOT entries from other objects.  That
means that if prom_init.o references, say, _end via some GOT
relocation, and some other object also references _end via a GOT
relocation, the GOT entry for _end may be in the range
__prom_init_toc_start to __prom_init_toc_end and if the kernel does
something special to GOT/TOC entries in that range then the value of
_end as seen by objects other than prom_init.o will be affected.  On
the other hand the GOT entry for _end may not be in the range
__prom_init_toc_start to __prom_init_toc_end.  Which way it turns out
is deterministic but a detail of linker operation that should not be
relied on.

A feature of ld.bfd is that input .toc (and .got) sections matching
one linker input section statement may be sorted, to put entries used
by small-model code first, near the toc base.  This is why scripts for
powerpc64 normally use *(.got .toc) rather than *(.got) *(.toc), since
the first form allows more freedom to sort.

Another feature of ld.bfd is that indirect addressing sequences using
the GOT/TOC may be edited by the linker to relative addressing.  In
many cases relative addressing would be emitted by gcc for
-mcmodel=medium if you appropriately decorate variable declarations
with non-default visibility.

The original patch is here:
https://lore.kernel.org/linuxppc-dev/20210310034813.gm6...@bubble.grove.modra.org/

Signed-off-by: Alan Modra 
[aik: removed non-relocatable which is gone in 24d33ac5b8ffb]
[aik: added <=2.24 check]
[aik: because of llvm-as, kernel_toc_addr() uses "mr" instead of global 
register variable]
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/Makefile   |  5 +++--
 arch/powerpc/include/asm/sections.h | 14 +++---
 arch/powerpc/boot/crt0.S|  2 +-
 arch/powerpc/boot/zImage.lds.S  |  7 ++-
 arch/powerpc/kernel/head_64.S   |  2 +-
 arch/powerpc/kernel/vmlinux.lds.S   |  8 +++-
 6 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index e02568f17334..e9aa4e8b07dd 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -445,10 +445,11 @@ PHONY += checkbin
 # Check toolchain versions:
 # - gcc-4.6 is the minimum kernel-wide version so nothing required.
 checkbin:
-   @if test "x${CONFIG_CPU_LITTLE_ENDIAN}" = "xy" \
-   && $(LD) --version | head -1 | grep ' 2\.24$$' >/dev/null ; then \
+   @if test "x${CONFIG_LD_IS_LLD}" != "xy" -a \
+   "x$(call ld-ifversion, -le, 22400, y)" = "xy" ; then \
echo -n '*** binutils 2.24 miscompiles weak symbols ' ; \
echo 'in some circumstances.' ; \
+   echo'*** binutils 2.23 do not define the TOC symbol ' ; \
echo -n '*** Please use a different binutils versi

[PATCH kernel 0/6] powerpc: Build with LLVM_IAS=1

2021-12-20 Thread Alexey Kardashevskiy

This allows compiling the upstream Linux with the upstream llvm with one fix on 
top;
https://reviews.llvm.org/D115419

This is based on sha1
798527287598 Michael Ellerman "Automatic merge of 'next' into merge (2021-12-14 
00:12)".

Please comment. Thanks.



Alan Modra (1):
  powerpc/toc: PowerPC64 future proof kernel toc, revised for lld

Alexey Kardashevskiy (3):
  powerpc/64/asm: Inline BRANCH_TO_C000
  powerpc/mm: Switch obsolete dssall to .long
  powerpc/mm/book3s64/hash: Switch pre 2.06 tlbiel to .long

Daniel Axtens (2):
  powerpc: check for support for -Wa,-m{power4,any}
  powerpc/64/asm: Do not reassign labels

 arch/powerpc/Makefile   |  9 +++--
 arch/powerpc/include/asm/head-64.h  | 12 +++
 arch/powerpc/include/asm/ppc-opcode.h   |  4 +++
 arch/powerpc/include/asm/sections.h | 14 
 arch/powerpc/kernel/idle.c  |  2 +-
 arch/powerpc/mm/book3s64/hash_native.c  |  4 +--
 arch/powerpc/mm/mmu_context.c   |  2 +-
 arch/powerpc/boot/crt0.S|  2 +-
 arch/powerpc/boot/zImage.lds.S  |  7 ++--
 arch/powerpc/kernel/exceptions-64s.S| 47 ++---
 arch/powerpc/kernel/head_64.S   | 20 +--
 arch/powerpc/kernel/idle_6xx.S  |  2 +-
 arch/powerpc/kernel/interrupt_64.S  |  2 +-
 arch/powerpc/kernel/l2cr_6xx.S  |  6 ++--
 arch/powerpc/kernel/swsusp_32.S |  2 +-
 arch/powerpc/kernel/swsusp_asm64.S  |  2 +-
 arch/powerpc/kernel/vmlinux.lds.S   |  8 ++---
 arch/powerpc/platforms/powermac/cache.S |  4 +--
 18 files changed, 69 insertions(+), 80 deletions(-)

-- 
2.30.2

Re: [PATCH kernel v4] KVM: PPC: Merge powerpc's debugfs entry content into generic entry

2021-12-20 Thread Alexey Kardashevskiy




On 12/20/21 18:29, Cédric Le Goater wrote:
> On 12/20/21 02:23, Alexey Kardashevskiy wrote:
>> At the moment KVM on PPC creates 4 types of entries under the kvm debugfs:
>> 1) "%pid-%fd" per a KVM instance (for all platforms);
>> 2) "vm%pid" (for PPC Book3s HV KVM);
>> 3) "vm%u_vcpu%u_timing" (for PPC Book3e KVM);
>> 4) "kvm-xive-%p" (for XIVE PPC Book3s KVM, the same for XICS);
>>
>> The problem with this is that multiple VMs per process is not allowed for
>> 2) and 3) which makes it possible for userspace to trigger errors when
>> creating duplicated debugfs entries.
>>
>> This merges all these into 1).
>>
>> This defines kvm_arch_create_kvm_debugfs() similar to
>> kvm_arch_create_vcpu_debugfs().
>>
>> This defines 2 hooks in kvmppc_ops that allow specific KVM implementations
>> add necessary entries, this adds the _e500 suffix to
>> kvmppc_create_vcpu_debugfs_e500() to make it clear what platform it is for.
>>
>> This makes use of already existing kvm_arch_create_vcpu_debugfs() on PPC.
>>
>> This removes no more used debugfs_dir pointers from PPC kvm_arch structs.
>>
>> This stops removing vcpu entries as once created vcpus stay around
>> for the entire life of a VM and removed when the KVM instance is closed,
>> see commit d56f5136b010 ("KVM: let kvm_destroy_vm_debugfs clean up vCPU
>> debugfs directories").
>>
>> Suggested-by: Fabiano Rosas 
>> Signed-off-by: Alexey Kardashevskiy 
> 
> Reviewed-by: Cédric Le Goater 
> 
> One comment below.
> 
>> ---
>> Changes:
>> v4:
>> * added "kvm-xive-%p"
>>
>> v3:
>> * reworked commit log, especially, the bit about removing vcpus
>>
>> v2:
>> * handled powerpc-booke
>> * s/kvm/vm/ in arch hooks
>> ---
>>   arch/powerpc/include/asm/kvm_host.h|  6 ++---
>>   arch/powerpc/include/asm/kvm_ppc.h |  2 ++
>>   arch/powerpc/kvm/timing.h  |  9 
>>   arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 +-
>>   arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
>>   arch/powerpc/kvm/book3s_hv.c   | 31 ++
>>   arch/powerpc/kvm/book3s_xics.c | 13 ++-
>>   arch/powerpc/kvm/book3s_xive.c | 13 ++-
>>   arch/powerpc/kvm/book3s_xive_native.c  | 13 ++-
>>   arch/powerpc/kvm/e500.c|  1 +
>>   arch/powerpc/kvm/e500mc.c  |  1 +
>>   arch/powerpc/kvm/powerpc.c | 16 ++---
>>   arch/powerpc/kvm/timing.c  | 20 -
>>   13 files changed, 47 insertions(+), 82 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_host.h 
>> b/arch/powerpc/include/asm/kvm_host.h
>> index 17263276189e..f5e14fa683f4 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -26,6 +26,8 @@
>>   #include 
>>   #include 
>>   
>> +#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
>> +
>>   #define KVM_MAX_VCPUS  NR_CPUS
>>   #define KVM_MAX_VCORES NR_CPUS
>>   
>> @@ -295,7 +297,6 @@ struct kvm_arch {
>>  bool dawr1_enabled;
>>  pgd_t *pgtable;
>>  u64 process_table;
>> -struct dentry *debugfs_dir;
>>  struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
>>   #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
>>   #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
>> @@ -673,7 +674,6 @@ struct kvm_vcpu_arch {
>>  u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
>>  u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
>>  u64 timing_last_exit;
>> -struct dentry *debugfs_exit_timing;
>>   #endif
>>   
>>   #ifdef CONFIG_PPC_BOOK3S
>> @@ -829,8 +829,6 @@ struct kvm_vcpu_arch {
>>  struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */
>>  struct kvmhv_tb_accumulator guest_time; /* guest execution */
>>  struct kvmhv_tb_accumulator cede_time;  /* time napping inside guest */
>> -
>> -struct dentry *debugfs_dir;
>>   #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
>>   };
>>   
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
>> b/arch/powerpc/include/asm/kvm_ppc.h
>> index 33db83b82fbd..d2b192dea0d2 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -316,6 +316,8 @@ struct kvmppc_ops {
>>  int (*svm_off)(struct kvm *kvm);
>>  int (*enable_dawr1)(struct kvm *kvm);
>>  bool (*hash_v

[PATCH kernel v4] KVM: PPC: Merge powerpc's debugfs entry content into generic entry

2021-12-19 Thread Alexey Kardashevskiy

At the moment KVM on PPC creates 4 types of entries under the kvm debugfs:
1) "%pid-%fd" per a KVM instance (for all platforms);
2) "vm%pid" (for PPC Book3s HV KVM);
3) "vm%u_vcpu%u_timing" (for PPC Book3e KVM);
4) "kvm-xive-%p" (for XIVE PPC Book3s KVM, the same for XICS);

The problem with this is that multiple VMs per process is not allowed for
2) and 3) which makes it possible for userspace to trigger errors when
creating duplicated debugfs entries.

This merges all these into 1).

This defines kvm_arch_create_kvm_debugfs() similar to
kvm_arch_create_vcpu_debugfs().

This defines 2 hooks in kvmppc_ops that allow specific KVM implementations
add necessary entries, this adds the _e500 suffix to
kvmppc_create_vcpu_debugfs_e500() to make it clear what platform it is for.

This makes use of already existing kvm_arch_create_vcpu_debugfs() on PPC.

This removes no more used debugfs_dir pointers from PPC kvm_arch structs.

This stops removing vcpu entries as once created vcpus stay around
for the entire life of a VM and removed when the KVM instance is closed,
see commit d56f5136b010 ("KVM: let kvm_destroy_vm_debugfs clean up vCPU
debugfs directories").

Suggested-by: Fabiano Rosas 
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v4:
* added "kvm-xive-%p"

v3:
* reworked commit log, especially, the bit about removing vcpus

v2:
* handled powerpc-booke
* s/kvm/vm/ in arch hooks
---
 arch/powerpc/include/asm/kvm_host.h|  6 ++---
 arch/powerpc/include/asm/kvm_ppc.h |  2 ++
 arch/powerpc/kvm/timing.h  |  9 
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
 arch/powerpc/kvm/book3s_hv.c   | 31 ++
 arch/powerpc/kvm/book3s_xics.c | 13 ++-
 arch/powerpc/kvm/book3s_xive.c | 13 ++-
 arch/powerpc/kvm/book3s_xive_native.c  | 13 ++-
 arch/powerpc/kvm/e500.c|  1 +
 arch/powerpc/kvm/e500mc.c  |  1 +
 arch/powerpc/kvm/powerpc.c | 16 ++---
 arch/powerpc/kvm/timing.c  | 20 -
 13 files changed, 47 insertions(+), 82 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 17263276189e..f5e14fa683f4 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -26,6 +26,8 @@
 #include 
 #include 
 
+#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+
 #define KVM_MAX_VCPUS  NR_CPUS
 #define KVM_MAX_VCORES NR_CPUS
 
@@ -295,7 +297,6 @@ struct kvm_arch {
bool dawr1_enabled;
pgd_t *pgtable;
u64 process_table;
-   struct dentry *debugfs_dir;
struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
@@ -673,7 +674,6 @@ struct kvm_vcpu_arch {
u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
u64 timing_last_exit;
-   struct dentry *debugfs_exit_timing;
 #endif
 
 #ifdef CONFIG_PPC_BOOK3S
@@ -829,8 +829,6 @@ struct kvm_vcpu_arch {
struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */
struct kvmhv_tb_accumulator guest_time; /* guest execution */
struct kvmhv_tb_accumulator cede_time;  /* time napping inside guest */
-
-   struct dentry *debugfs_dir;
 #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 33db83b82fbd..d2b192dea0d2 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -316,6 +316,8 @@ struct kvmppc_ops {
int (*svm_off)(struct kvm *kvm);
int (*enable_dawr1)(struct kvm *kvm);
bool (*hash_v3_possible)(void);
+   int (*create_vm_debugfs)(struct kvm *kvm);
+   int (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry 
*debugfs_dentry);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/kvm/timing.h b/arch/powerpc/kvm/timing.h
index feef7885ba82..493a7d510fd5 100644
--- a/arch/powerpc/kvm/timing.h
+++ b/arch/powerpc/kvm/timing.h
@@ -14,8 +14,8 @@
 #ifdef CONFIG_KVM_EXIT_TIMING
 void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu);
 void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu);
-void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id);
-void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu);
+void kvmppc_create_vcpu_debugfs_e500(struct kvm_vcpu *vcpu,
+struct dentry *debugfs_dentry);
 
 static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int type)
 {
@@ -26,9 +26,8 @@ static inline void kvmppc_set_exit_type(struct kvm_vcpu 
*vcpu, int type)
 /* if exit timing is not configured there is no need to build the c file */
 static inline void kvmppc_

Re: [PATCH kernel v3] KVM: PPC: Merge powerpc's debugfs entry content into generic entry

2021-12-16 Thread Alexey Kardashevskiy




On 12/16/21 05:11, Cédric Le Goater wrote:
> On 12/15/21 02:33, Alexey Kardashevskiy wrote:
>> At the moment KVM on PPC creates 3 types of entries under the kvm debugfs:
>> 1) "%pid-%fd" per a KVM instance (for all platforms);
>> 2) "vm%pid" (for PPC Book3s HV KVM);
>> 3) "vm%u_vcpu%u_timing" (for PPC Book3e KVM).
>>
>> The problem with this is that multiple VMs per process is not allowed for
>> 2) and 3) which makes it possible for userspace to trigger errors when
>> creating duplicated debugfs entries.
>>
>> This merges all these into 1).
>>
>> This defines kvm_arch_create_kvm_debugfs() similar to
>> kvm_arch_create_vcpu_debugfs().
>>
>> This defines 2 hooks in kvmppc_ops that allow specific KVM implementations
>> add necessary entries, this adds the _e500 suffix to
>> kvmppc_create_vcpu_debugfs_e500() to make it clear what platform it is for.
>>
>> This makes use of already existing kvm_arch_create_vcpu_debugfs() on PPC.
>>
>> This removes no more used debugfs_dir pointers from PPC kvm_arch structs.
>>
>> This stops removing vcpu entries as once created vcpus stay around
>> for the entire life of a VM and removed when the KVM instance is closed,
>> see commit d56f5136b010 ("KVM: let kvm_destroy_vm_debugfs clean up vCPU
>> debugfs directories").
> 
> It would nice to also move the KVM device debugfs files :
> 
>/sys/kernel/debug/powerpc/kvm-xive-%p
> 
> These are dynamically created and destroyed at run time depending
> on the interrupt mode negociated by CAS. It might be more complex ?

With this addition:

diff --git a/arch/powerpc/kvm/book3s_xive_native.c
b/arch/powerpc/kvm/book3s_xive_native.c
index 99db9ac49901..511f643e2875 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -1267,10 +1267,10 @@ static void xive_native_debugfs_init(struct
kvmppc_xive *xive)
return;
}

-   xive->dentry = debugfs_create_file(name, 0444, arch_debugfs_dir,
+   xive->dentry = debugfs_create_file(name, 0444,
xive->kvm->debugfs_dentry,
   xive, _native_debug_fops);


it looks fine, this is "before":

root@zz1:/sys/kernel/debug# find -iname "*xive*"
./slab/xive-provision
./powerpc/kvm-xive-c000208c
./powerpc/xive


and this is "after" the patch applied.

root@zz1:/sys/kernel/debug# find -iname "*xive*"
./kvm/29058-11/kvm-xive-c000208c
./slab/xive-provision
./powerpc/xive


I'll repost unless there is something more to it. Thanks,


-- 
Alexey

[PATCH kernel v3] KVM: PPC: Merge powerpc's debugfs entry content into generic entry

2021-12-14 Thread Alexey Kardashevskiy

At the moment KVM on PPC creates 3 types of entries under the kvm debugfs:
1) "%pid-%fd" per a KVM instance (for all platforms);
2) "vm%pid" (for PPC Book3s HV KVM);
3) "vm%u_vcpu%u_timing" (for PPC Book3e KVM).

The problem with this is that multiple VMs per process is not allowed for
2) and 3) which makes it possible for userspace to trigger errors when
creating duplicated debugfs entries.

This merges all these into 1).

This defines kvm_arch_create_kvm_debugfs() similar to
kvm_arch_create_vcpu_debugfs().

This defines 2 hooks in kvmppc_ops that allow specific KVM implementations
add necessary entries, this adds the _e500 suffix to
kvmppc_create_vcpu_debugfs_e500() to make it clear what platform it is for.

This makes use of already existing kvm_arch_create_vcpu_debugfs() on PPC.

This removes no more used debugfs_dir pointers from PPC kvm_arch structs.

This stops removing vcpu entries as once created vcpus stay around
for the entire life of a VM and removed when the KVM instance is closed,
see commit d56f5136b010 ("KVM: let kvm_destroy_vm_debugfs clean up vCPU
debugfs directories").

Suggested-by: Fabiano Rosas 
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v3:
* reworked commit log, especially, the bit about removing vcpus

v2:
* handled powerpc-booke
* s/kvm/vm/ in arch hooks
---
 arch/powerpc/include/asm/kvm_host.h|  6 ++---
 arch/powerpc/include/asm/kvm_ppc.h |  2 ++
 arch/powerpc/kvm/timing.h  |  9 
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
 arch/powerpc/kvm/book3s_hv.c   | 31 ++
 arch/powerpc/kvm/e500.c|  1 +
 arch/powerpc/kvm/e500mc.c  |  1 +
 arch/powerpc/kvm/powerpc.c | 16 ++---
 arch/powerpc/kvm/timing.c  | 20 -
 10 files changed, 41 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 17263276189e..f5e14fa683f4 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -26,6 +26,8 @@
 #include 
 #include 
 
+#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+
 #define KVM_MAX_VCPUS  NR_CPUS
 #define KVM_MAX_VCORES NR_CPUS
 
@@ -295,7 +297,6 @@ struct kvm_arch {
bool dawr1_enabled;
pgd_t *pgtable;
u64 process_table;
-   struct dentry *debugfs_dir;
struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
@@ -673,7 +674,6 @@ struct kvm_vcpu_arch {
u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
u64 timing_last_exit;
-   struct dentry *debugfs_exit_timing;
 #endif
 
 #ifdef CONFIG_PPC_BOOK3S
@@ -829,8 +829,6 @@ struct kvm_vcpu_arch {
struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */
struct kvmhv_tb_accumulator guest_time; /* guest execution */
struct kvmhv_tb_accumulator cede_time;  /* time napping inside guest */
-
-   struct dentry *debugfs_dir;
 #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 33db83b82fbd..d2b192dea0d2 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -316,6 +316,8 @@ struct kvmppc_ops {
int (*svm_off)(struct kvm *kvm);
int (*enable_dawr1)(struct kvm *kvm);
bool (*hash_v3_possible)(void);
+   int (*create_vm_debugfs)(struct kvm *kvm);
+   int (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry 
*debugfs_dentry);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/kvm/timing.h b/arch/powerpc/kvm/timing.h
index feef7885ba82..493a7d510fd5 100644
--- a/arch/powerpc/kvm/timing.h
+++ b/arch/powerpc/kvm/timing.h
@@ -14,8 +14,8 @@
 #ifdef CONFIG_KVM_EXIT_TIMING
 void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu);
 void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu);
-void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id);
-void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu);
+void kvmppc_create_vcpu_debugfs_e500(struct kvm_vcpu *vcpu,
+struct dentry *debugfs_dentry);
 
 static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int type)
 {
@@ -26,9 +26,8 @@ static inline void kvmppc_set_exit_type(struct kvm_vcpu 
*vcpu, int type)
 /* if exit timing is not configured there is no need to build the c file */
 static inline void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu) {}
 static inline void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu) {}
-static inline void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu,
-   unsigned int id) {}
-static inline void kvmppc_remove_vcpu_debug

[PATCH kernel 3/3] powerpc/pseries/ddw: Do not try direct mapping with persistent memory and one window

2021-11-07 Thread Alexey Kardashevskiy

There is a possibility of having just one DMA window available with
a limited capacity which the existing code does not handle that well.
If the window is big enough for the system RAM but less than
MAX_PHYSMEM_BITS (which we want when persistent memory is present),
we create 1:1 window and leave persistent memory without DMA.

This disables 1:1 mapping entirely if there is persistent memory and
either:
- the huge DMA window does not cover the entire address space;
- the default DMA window is removed.

This relies on reverted 54fc3c681ded
("powerpc/pseries/ddw: Extend upper limit for huge DMA window for persistent 
memory")
to return the actual amount RAM in ddw_memory_hotplug_max() (posted
separately).

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 301fa5b3d528..8f998e55735b 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1356,8 +1356,10 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
len = order_base_2(query.largest_available_block << page_shift);
win_name = DMA64_PROPNAME;
} else {
-   direct_mapping = true;
-   win_name = DIRECT64_PROPNAME;
+   direct_mapping = !default_win_removed ||
+   (len == MAX_PHYSMEM_BITS) ||
+   (!pmem_present && (len == max_ram_len));
+   win_name = direct_mapping ? DIRECT64_PROPNAME : DMA64_PROPNAME;
}
 
ret = create_ddw(dev, ddw_avail, , page_shift, len);
-- 
2.30.2

[PATCH kernel 2/3] powerpc/pseries/ddw: simplify enable_ddw()

2021-11-07 Thread Alexey Kardashevskiy

This drops rather useless ddw_enabled flag as direct_mapping implies
it anyway.

While at this, fix indents in enable_ddw().

This should not cause any behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---

This replaces "powerpc/pseries/iommu: Fix indentations"
---
 arch/powerpc/platforms/pseries/iommu.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 64385d6f33c2..301fa5b3d528 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1229,7 +1229,6 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct dma_win *window;
struct property *win64;
-   bool ddw_enabled = false;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false, direct_mapping = false;
bool pmem_present;
@@ -1244,7 +1243,6 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 
if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {
direct_mapping = (len >= max_ram_len);
-   ddw_enabled = true;
goto out_unlock;
}
 
@@ -1397,8 +1395,8 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
dev_info(>dev, "failed to map DMA window for %pOF: 
%d\n",
 dn, ret);
 
-   /* Make sure to clean DDW if any TCE was set*/
-   clean_dma_window(pdn, win64->value);
+   /* Make sure to clean DDW if any TCE was set*/
+   clean_dma_window(pdn, win64->value);
goto out_del_list;
}
} else {
@@ -1445,7 +1443,6 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
spin_unlock(_win_list_lock);
 
dev->dev.archdata.dma_offset = win_addr;
-   ddw_enabled = true;
goto out_unlock;
 
 out_del_list:
@@ -1481,10 +1478,10 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 * as RAM, then we failed to create a window to cover persistent
 * memory and need to set the DMA limit.
 */
-   if (pmem_present && ddw_enabled && direct_mapping && len == max_ram_len)
+   if (pmem_present && direct_mapping && len == max_ram_len)
dev->dev.bus_dma_limit = dev->dev.archdata.dma_offset + (1ULL 
<< len);
 
-return ddw_enabled && direct_mapping;
+   return direct_mapping;
 }
 
 static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
-- 
2.30.2

[PATCH kernel 1/3] powerpc/pseries/ddw: Revert "Extend upper limit for huge DMA window for persistent memory"

2021-11-07 Thread Alexey Kardashevskiy

This reverts commit 54fc3c681ded9437e4548e2501dc1136b23cfa9a
which does not allow 1:1 mapping even for the system RAM which
is usually possible.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 49b401536d29..64385d6f33c2 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1094,15 +1094,6 @@ static phys_addr_t ddw_memory_hotplug_max(void)
phys_addr_t max_addr = memory_hotplug_max();
struct device_node *memory;
 
-   /*
-* The "ibm,pmemory" can appear anywhere in the address space.
-* Assuming it is still backed by page structs, set the upper limit
-* for the huge DMA window as MAX_PHYSMEM_BITS.
-*/
-   if (of_find_node_by_type(NULL, "ibm,pmemory"))
-   return (sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ?
-   (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS);
-
for_each_node_by_type(memory, "memory") {
unsigned long start, size;
int n_mem_addr_cells, n_mem_size_cells, len;
-- 
2.30.2

[PATCH kernel 0/3] powerpc/pseries/ddw: Fixes for persistent memory case

2021-11-07 Thread Alexey Kardashevskiy



This is based on sha1
f855455dee0b Michael Ellerman "Automatic merge of 'next' into merge (2021-11-05 
22:19)".

Please comment. Thanks.



Alexey Kardashevskiy (3):
  powerpc/pseries/ddw: Revert "Extend upper limit for huge DMA window
for persistent memory"
  powerpc/pseries/ddw: simplify enable_ddw()
  powerpc/pseries/ddw: Do not try direct mapping with persistent memory
and one window

 arch/powerpc/platforms/pseries/iommu.c | 26 --
 1 file changed, 8 insertions(+), 18 deletions(-)

-- 
2.30.2

Re: [PATCH] powerpc: Enhance pmem DMA bypass handling

2021-10-28 Thread Alexey Kardashevskiy





On 28/10/2021 08:30, Brian King wrote:

On 10/26/21 12:39 AM, Alexey Kardashevskiy wrote:



On 10/26/21 01:40, Brian King wrote:

On 10/23/21 7:18 AM, Alexey Kardashevskiy wrote:



On 23/10/2021 07:18, Brian King wrote:

On 10/22/21 7:24 AM, Alexey Kardashevskiy wrote:



On 22/10/2021 04:44, Brian King wrote:

If ibm,pmemory is installed in the system, it can appear anywhere
in the address space. This patch enhances how we handle DMA for devices when
ibm,pmemory is present. In the case where we have enough DMA space to
direct map all of RAM, but not ibm,pmemory, we use direct DMA for
I/O to RAM and use the default window to dynamically map ibm,pmemory.
In the case where we only have a single DMA window, this won't work, > so if 
the window is not big enough to map the entire address range,
we cannot direct map.


but we want the pmem range to be mapped into the huge DMA window too if we can, 
why skip it?


This patch should simply do what the comment in this commit mentioned below 
suggests, which says that
ibm,pmemory can appear anywhere in the address space. If the DMA window is 
large enough
to map all of MAX_PHYSMEM_BITS, we will indeed simply do direct DMA for 
everything,
including the pmem. If we do not have a big enough window to do that, we will do
direct DMA for DRAM and dynamic mapping for pmem.



Right, and this is what we do already, do not we? I missing something here.


The upstream code does not work correctly that I can see. If I boot an upstream 
kernel
with an nvme device and vpmem assigned to the LPAR, and enable dev_dbg in 
arch/powerpc/platforms/pseries/iommu.c,
I see the following in the logs:

[2.157549] nvme 0121:50:00.0: ibm,query-pe-dma-windows(53) 50 800 
2121 returned 0
[2.157561] nvme 0121:50:00.0: Skipping ibm,pmemory
[2.157567] nvme 0121:50:00.0: can't map partition max 0x8 with 
16777216 65536-sized pages
[2.170150] nvme 0121:50:00.0: ibm,create-pe-dma-window(54) 50 800 
2121 10 28 returned 0 (liobn = 0x7121 starting addr = 800 0)
[2.170170] nvme 0121:50:00.0: created tce table LIOBN 0x7121 for 
/pci@8002121/pci1014,683@0
[2.356260] nvme 0121:50:00.0: node is /pci@8002121/pci1014,683@0

This means we are heading down the leg in enable_ddw where we do not set 
direct_mapping to true. We use
create the DDW window, but don't do any direct DMA. This is because the window 
is not large enough to
map 2PB of memory, which is what ddw_memory_hotplug_max returns without my 
patch.

With my patch applied, I get this in the logs:

[2.204866] nvme 0121:50:00.0: ibm,query-pe-dma-windows(53) 50 800 
2121 returned 0
[2.204875] nvme 0121:50:00.0: Skipping ibm,pmemory
[2.205058] nvme 0121:50:00.0: ibm,create-pe-dma-window(54) 50 800 
2121 10 21 returned 0 (liobn = 0x7121 starting addr = 800 0)
[2.205068] nvme 0121:50:00.0: created tce table LIOBN 0x7121 for 
/pci@8002121/pci1014,683@0
[2.215898] nvme 0121:50:00.0: iommu: 64-bit OK but direct DMA is limited by 
802




ah I see. then...




Thanks,

Brian






https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc/platforms/pseries/iommu.c?id=bf6e2d562bbc4d115cf322b0bca57fe5bbd26f48


Thanks,

Brian







Signed-off-by: Brian King 
---
    arch/powerpc/platforms/pseries/iommu.c | 19 ++-
    1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 269f61d519c2..d9ae985d10a4 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1092,15 +1092,6 @@ static phys_addr_t ddw_memory_hotplug_max(void)
    phys_addr_t max_addr = memory_hotplug_max();
    struct device_node *memory;
    -    /*
- * The "ibm,pmemory" can appear anywhere in the address space.
- * Assuming it is still backed by page structs, set the upper limit
- * for the huge DMA window as MAX_PHYSMEM_BITS.
- */
-    if (of_find_node_by_type(NULL, "ibm,pmemory"))
-    return (sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ?
-    (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS);
-
    for_each_node_by_type(memory, "memory") {
    unsigned long start, size;
    int n_mem_addr_cells, n_mem_size_cells, len;
@@ -1341,6 +1332,16 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
     */
    len = max_ram_len;
    if (pmem_present) {
+    if (default_win_removed) {
+    /*
+ * If we only have one DMA window and have pmem present,
+ * then we need to be able to map the entire address
+ * range in order to be able to do direct DMA to RAM.
+ */
+    len = order_base_2((sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ?
+

Re: [PATCH] powerpc: Enhance pmem DMA bypass handling

2021-10-25 Thread Alexey Kardashevskiy




On 10/26/21 01:40, Brian King wrote:
> On 10/23/21 7:18 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 23/10/2021 07:18, Brian King wrote:
>>> On 10/22/21 7:24 AM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 22/10/2021 04:44, Brian King wrote:
>>>>> If ibm,pmemory is installed in the system, it can appear anywhere
>>>>> in the address space. This patch enhances how we handle DMA for devices 
>>>>> when
>>>>> ibm,pmemory is present. In the case where we have enough DMA space to
>>>>> direct map all of RAM, but not ibm,pmemory, we use direct DMA for
>>>>> I/O to RAM and use the default window to dynamically map ibm,pmemory.
>>>>> In the case where we only have a single DMA window, this won't work, > so 
>>>>> if the window is not big enough to map the entire address range,
>>>>> we cannot direct map.
>>>>
>>>> but we want the pmem range to be mapped into the huge DMA window too if we 
>>>> can, why skip it?
>>>
>>> This patch should simply do what the comment in this commit mentioned below 
>>> suggests, which says that
>>> ibm,pmemory can appear anywhere in the address space. If the DMA window is 
>>> large enough
>>> to map all of MAX_PHYSMEM_BITS, we will indeed simply do direct DMA for 
>>> everything,
>>> including the pmem. If we do not have a big enough window to do that, we 
>>> will do
>>> direct DMA for DRAM and dynamic mapping for pmem.
>>
>>
>> Right, and this is what we do already, do not we? I missing something here.
> 
> The upstream code does not work correctly that I can see. If I boot an 
> upstream kernel
> with an nvme device and vpmem assigned to the LPAR, and enable dev_dbg in 
> arch/powerpc/platforms/pseries/iommu.c,
> I see the following in the logs:
> 
> [2.157549] nvme 0121:50:00.0: ibm,query-pe-dma-windows(53) 50 800 
> 2121 returned 0
> [2.157561] nvme 0121:50:00.0: Skipping ibm,pmemory
> [2.157567] nvme 0121:50:00.0: can't map partition max 0x8 
> with 16777216 65536-sized pages
> [2.170150] nvme 0121:50:00.0: ibm,create-pe-dma-window(54) 50 800 
> 2121 10 28 returned 0 (liobn = 0x7121 starting addr = 800 0)
> [2.170170] nvme 0121:50:00.0: created tce table LIOBN 0x7121 for 
> /pci@8002121/pci1014,683@0
> [2.356260] nvme 0121:50:00.0: node is /pci@8002121/pci1014,683@0
> 
> This means we are heading down the leg in enable_ddw where we do not set 
> direct_mapping to true. We use
> create the DDW window, but don't do any direct DMA. This is because the 
> window is not large enough to
> map 2PB of memory, which is what ddw_memory_hotplug_max returns without my 
> patch. 
> 
> With my patch applied, I get this in the logs:
> 
> [2.204866] nvme 0121:50:00.0: ibm,query-pe-dma-windows(53) 50 800 
> 2121 returned 0
> [2.204875] nvme 0121:50:00.0: Skipping ibm,pmemory
> [2.205058] nvme 0121:50:00.0: ibm,create-pe-dma-window(54) 50 800 
> 2121 10 21 returned 0 (liobn = 0x7121 starting addr = 800 0)
> [2.205068] nvme 0121:50:00.0: created tce table LIOBN 0x7121 for 
> /pci@8002121/pci1014,683@0
> [2.215898] nvme 0121:50:00.0: iommu: 64-bit OK but direct DMA is limited 
> by 802
> 


ah I see. then...


> 
> Thanks,
> 
> Brian
> 
> 
>>
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc/platforms/pseries/iommu.c?id=bf6e2d562bbc4d115cf322b0bca57fe5bbd26f48
>>>
>>>
>>> Thanks,
>>>
>>> Brian
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Signed-off-by: Brian King 
>>>>> ---
>>>>>    arch/powerpc/platforms/pseries/iommu.c | 19 ++-
>>>>>    1 file changed, 10 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>>>>> b/arch/powerpc/platforms/pseries/iommu.c
>>>>> index 269f61d519c2..d9ae985d10a4 100644
>>>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>>>> @@ -1092,15 +1092,6 @@ static phys_addr_t ddw_memory_hotplug_max(void)
>>>>>    phys_addr_t max_addr = memory_hotplug_max();
>>>>>    struct device_node *memory;
>>>>>    -    /*
>>>>> - * The "ibm,pmemory"

Re: [PATCH] powerpc: Enhance pmem DMA bypass handling

2021-10-23 Thread Alexey Kardashevskiy





On 23/10/2021 07:18, Brian King wrote:

On 10/22/21 7:24 AM, Alexey Kardashevskiy wrote:



On 22/10/2021 04:44, Brian King wrote:

If ibm,pmemory is installed in the system, it can appear anywhere
in the address space. This patch enhances how we handle DMA for devices when
ibm,pmemory is present. In the case where we have enough DMA space to
direct map all of RAM, but not ibm,pmemory, we use direct DMA for
I/O to RAM and use the default window to dynamically map ibm,pmemory.
In the case where we only have a single DMA window, this won't work, > so if 
the window is not big enough to map the entire address range,
we cannot direct map.


but we want the pmem range to be mapped into the huge DMA window too if we can, 
why skip it?


This patch should simply do what the comment in this commit mentioned below 
suggests, which says that
ibm,pmemory can appear anywhere in the address space. If the DMA window is 
large enough
to map all of MAX_PHYSMEM_BITS, we will indeed simply do direct DMA for 
everything,
including the pmem. If we do not have a big enough window to do that, we will do
direct DMA for DRAM and dynamic mapping for pmem.



Right, and this is what we do already, do not we? I missing something here.



https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc/platforms/pseries/iommu.c?id=bf6e2d562bbc4d115cf322b0bca57fe5bbd26f48


Thanks,

Brian







Signed-off-by: Brian King 
---
   arch/powerpc/platforms/pseries/iommu.c | 19 ++-
   1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 269f61d519c2..d9ae985d10a4 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1092,15 +1092,6 @@ static phys_addr_t ddw_memory_hotplug_max(void)
   phys_addr_t max_addr = memory_hotplug_max();
   struct device_node *memory;
   -    /*
- * The "ibm,pmemory" can appear anywhere in the address space.
- * Assuming it is still backed by page structs, set the upper limit
- * for the huge DMA window as MAX_PHYSMEM_BITS.
- */
-    if (of_find_node_by_type(NULL, "ibm,pmemory"))
-    return (sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ?
-    (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS);
-
   for_each_node_by_type(memory, "memory") {
   unsigned long start, size;
   int n_mem_addr_cells, n_mem_size_cells, len;
@@ -1341,6 +1332,16 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
    */
   len = max_ram_len;
   if (pmem_present) {
+    if (default_win_removed) {
+    /*
+ * If we only have one DMA window and have pmem present,
+ * then we need to be able to map the entire address
+ * range in order to be able to do direct DMA to RAM.
+ */
+    len = order_base_2((sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ?
+    (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS));
+    }
+
   if (query.largest_available_block >=
   (1ULL << (MAX_PHYSMEM_BITS - page_shift)))
   len = MAX_PHYSMEM_BITS;








--
Alexey

Re: [PATCH] powerpc: Enhance pmem DMA bypass handling

2021-10-22 Thread Alexey Kardashevskiy





On 22/10/2021 04:44, Brian King wrote:

If ibm,pmemory is installed in the system, it can appear anywhere
in the address space. This patch enhances how we handle DMA for devices when
ibm,pmemory is present. In the case where we have enough DMA space to
direct map all of RAM, but not ibm,pmemory, we use direct DMA for
I/O to RAM and use the default window to dynamically map ibm,pmemory.
In the case where we only have a single DMA window, this won't work, > so if 
the window is not big enough to map the entire address range,
we cannot direct map.


but we want the pmem range to be mapped into the huge DMA window too if 
we can, why skip it?





Signed-off-by: Brian King 
---
  arch/powerpc/platforms/pseries/iommu.c | 19 ++-
  1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 269f61d519c2..d9ae985d10a4 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1092,15 +1092,6 @@ static phys_addr_t ddw_memory_hotplug_max(void)
phys_addr_t max_addr = memory_hotplug_max();
struct device_node *memory;
  
-	/*

-* The "ibm,pmemory" can appear anywhere in the address space.
-* Assuming it is still backed by page structs, set the upper limit
-* for the huge DMA window as MAX_PHYSMEM_BITS.
-*/
-   if (of_find_node_by_type(NULL, "ibm,pmemory"))
-   return (sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ?
-   (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS);
-
for_each_node_by_type(memory, "memory") {
unsigned long start, size;
int n_mem_addr_cells, n_mem_size_cells, len;
@@ -1341,6 +1332,16 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 */
len = max_ram_len;
if (pmem_present) {
+   if (default_win_removed) {
+   /*
+* If we only have one DMA window and have pmem present,
+* then we need to be able to map the entire address
+* range in order to be able to do direct DMA to RAM.
+*/
+   len = order_base_2((sizeof(phys_addr_t) * 8 <= 
MAX_PHYSMEM_BITS) ?
+   (phys_addr_t) -1 : (1ULL << 
MAX_PHYSMEM_BITS));
+   }
+
if (query.largest_available_block >=
(1ULL << (MAX_PHYSMEM_BITS - page_shift)))
len = MAX_PHYSMEM_BITS;



--
Alexey

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2596 matches

Mail list logo