Re: [RFC PATCH v2 0/6] powerpc: pSeries: vfio: iommu: Re-enable support for SPAPR TCE VFIO
On 2/5/24 00:09, Jason Gunthorpe wrote: On Tue, Apr 30, 2024 at 03:05:34PM -0500, Shivaprasad G Bhat wrote: RFC v1 was posted here [1]. As I was testing more and fixing the issues, I realized its clean to have the table_group_ops implemented the way it is done on PowerNV and stop 'borrowing' the DMA windows for pSeries. This patch-set implements the iommu table_group_ops for pSeries for VFIO SPAPR TCE sub-driver thereby enabling the VFIO support on POWER pSeries machines. Wait, did they previously not have any support? > Again, this TCE stuff needs to go away, not grow. I can grudgingly accept fixing it where it used to work, but not enabling more HW that never worked before! :( This used to work when I tried last time 2+ years ago, not a new stuff. Thanks, -- Alexey
Re: [PATCH v2] powerpc/iommu: DMA address offset is incorrectly calculated with 2MB TCEs
Hi Gaurav, Sorry I missed this. Please share the link to the your fix, I do not see it in my mail. In general, the problem can probably be solved by using huge pages (anything more than 64K) only for 1:1 mapping. On 03/05/2023 13:25, Gaurav Batra wrote: Hello Alexey, I recently joined IOMMU team. There was a bug reported by test team where Mellanox driver was timing out during configuration. I proposed a fix for the same, which is below in the email. You suggested a fix for Srikar's reported problem. Basically, both these fixes will resolve Srikar and Mellanox driver issues. The problem is with 2MB DDW. Since you have extensive knowledge of IOMMU design and code, in your opinion, which patch should we adopt? Thanks a lot Gaurav On 4/20/23 2:45 PM, Gaurav Batra wrote: Hello Michael, I was looking into the Bug: 199106 (https://bugzilla.linux.ibm.com/show_bug.cgi?id=199106). In the Bug, Mellanox driver was timing out when enabling SRIOV device. I tested, Alexey's patch and it fixes the issue with Mellanox driver. The down side to Alexey's fix is that even a small memory request by the driver will be aligned up to 2MB. In my test, the Mellanox driver is issuing multiple requests of 64K size. All these will get aligned up to 2MB, which is quite a waste of resources. In any case, both the patches work. Let me know which approach you prefer. In case we decide to go with my patch, I just realized that I need to fix nio_pages in iommu_free_coherent() as well. Thanks, Gaurav On 4/20/23 10:21 AM, Michael Ellerman wrote: Gaurav Batra writes: When DMA window is backed by 2MB TCEs, the DMA address for the mapped page should be the offset of the page relative to the 2MB TCE. The code was incorrectly setting the DMA address to the beginning of the TCE range. Mellanox driver is reporting timeout trying to ENABLE_HCA for an SR-IOV ethernet port, when DMA window is backed by 2MB TCEs. I assume this is similar or related to the bug Srikar reported? https://lore.kernel.org/linuxppc-dev/20230323095333.gi1005...@linux.vnet.ibm.com/ In that thread Alexey suggested a patch, have you tried his patch? He suggested rounding up the allocation size, rather than adjusting the dma_handle. Fixes: 3872731187141d5d0a5c4fb30007b8b9ec36a44d That's not the right syntax, it's described in the documentation how to generate it. It should be: Fixes: 387273118714 ("powerps/pseries/dma: Add support for 2M IOMMU page size") cheers diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index ee95937bdaf1..ca57526ce47a 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -517,7 +517,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl, /* Convert entry to a dma_addr_t */ entry += tbl->it_offset; dma_addr = entry << tbl->it_page_shift; - dma_addr |= (s->offset & ~IOMMU_PAGE_MASK(tbl)); + dma_addr |= (vaddr & ~IOMMU_PAGE_MASK(tbl)); DBG(" - %lu pages, entry: %lx, dma_addr: %lx\n", npages, entry, dma_addr); @@ -904,6 +904,7 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl, unsigned int order; unsigned int nio_pages, io_order; struct page *page; + int tcesize = (1 << tbl->it_page_shift); size = PAGE_ALIGN(size); order = get_order(size); @@ -930,7 +931,8 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl, memset(ret, 0, size); /* Set up tces to cover the allocated range */ - nio_pages = size >> tbl->it_page_shift; + nio_pages = IOMMU_PAGE_ALIGN(size, tbl) >> tbl->it_page_shift; + io_order = get_iommu_order(size, tbl); mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL, mask >> tbl->it_page_shift, io_order, 0); @@ -938,7 +940,8 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl, free_pages((unsigned long)ret, order); return NULL; } - *dma_handle = mapping; + + *dma_handle = mapping | ((u64)ret & (tcesize - 1)); return ret; } -- -- Alexey
Re: Probing nvme disks fails on Upstream kernels on powerpc Maxconfig
On 05/04/2023 15:45, Michael Ellerman wrote: "Linux regression tracking (Thorsten Leemhuis)" writes: [CCing the regression list, as it should be in the loop for regressions: https://docs.kernel.org/admin-guide/reporting-regressions.html] On 23.03.23 10:53, Srikar Dronamraju wrote: I am unable to boot upstream kernels from v5.16 to the latest upstream kernel on a maxconfig system. (Machine config details given below) At boot, we see a series of messages like the below. dracut-initqueue[13917]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks: dracut-initqueue[13917]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2f93dc0767-18aa-467f-afa7-5b4e9c13108a.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then dracut-initqueue[13917]: [ -e "/dev/disk/by-uuid/93dc0767-18aa-467f-afa7-5b4e9c13108a" ] dracut-initqueue[13917]: fi" Alexey, did you look into this? This is apparently caused by a commit of yours (see quoted part below) that Michael applied. Looks like it fell through the cracks from here, but maybe I'm missing something. Unfortunately Alexey is not working at IBM any more, so he won't have access to any hardware to debug/test this. Srikar are you debugging this? If not we'll have to find someone else to look at it. Has this been fixed and I missed cc:? Anyway, without the full log, I still see it is a huge guest so chances are the guest could not map all RAM so instead it uses the biggest possible DDW with 2M pages. If that's the case, this might help it: diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 614af78b3695..996acf245ae5 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -906,7 +906,7 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl, unsigned int nio_pages, io_order; struct page *page; - size = PAGE_ALIGN(size); + size = _ALIGN(size, IOMMU_PAGE_SIZE(tbl)); order = get_order(size); /* @@ -949,10 +949,9 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size, if (tbl) { unsigned int nio_pages; - size = PAGE_ALIGN(size); + size = _ALIGN(size, IOMMU_PAGE_SIZE(tbl)); nio_pages = size >> tbl->it_page_shift; iommu_free(tbl, dma_handle, nio_pages); - size = PAGE_ALIGN(size); free_pages((unsigned long)vaddr, get_order(size)); } And there may be other places where PAGE_SIZE is used instead of IOMMU_PAGE_SIZE(tbl). Thanks, -- Alexey
Re: [PATCH v2 0/4] Reenable VFIO support on POWER systems
On 07/03/2023 10:46, Alex Williamson wrote: On Mon, 6 Mar 2023 11:29:53 -0600 (CST) Timothy Pearson wrote: This patch series reenables VFIO support on POWER systems. It is based on Alexey Kardashevskiys's patch series, rebased and successfully tested under QEMU with a Marvell PCIe SATA controller on a POWER9 Blackbird host. Alexey Kardashevskiy (3): powerpc/iommu: Add "borrowing" iommu_table_group_ops powerpc/pci_64: Init pcibios subsys a bit later powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains Timothy Pearson (1): Add myself to MAINTAINERS for Power VFIO support MAINTAINERS | 5 + arch/powerpc/include/asm/iommu.h | 6 +- arch/powerpc/include/asm/pci-bridge.h | 7 + arch/powerpc/kernel/iommu.c | 246 +- arch/powerpc/kernel/pci_64.c | 2 +- arch/powerpc/platforms/powernv/pci-ioda.c | 36 +++- arch/powerpc/platforms/pseries/iommu.c| 27 +++ arch/powerpc/platforms/pseries/pseries.h | 4 + arch/powerpc/platforms/pseries/setup.c| 3 + drivers/vfio/vfio_iommu_spapr_tce.c | 96 ++--- 10 files changed, 338 insertions(+), 94 deletions(-) For vfio and MAINTAINERS portions, Acked-by: Alex Williamson I'll note though that spapr_tce_take_ownership() looks like it copied a bug from the old tce_iommu_take_ownership() where tbl and tbl->it_map are tested before calling iommu_take_ownership() but not in the unwind loop, ie. tables we might have skipped on setup are unconditionally released on unwind. Thanks, Ah, true, a bug. Thanks for pointing out. -- Alexey
Re: [PATCH kernel v2 0/3] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
Michael, Fred, ping? On 20/09/2022 23:04, Alexey Kardashevskiy wrote: Here is another take on iommu_ops on POWER to make VFIO work again on POWERPC64. Tested on PPC, kudos to Fred! The tree with all prerequisites is here: https://github.com/aik/linux/tree/kvm-fixes-wip The previous discussion is here: https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220707135552.3688927-1-...@ozlabs.ru/ https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/ https://lore.kernel.org/all/20220714081822.3717693-3-...@ozlabs.ru/T/ Please comment. Thanks. This is based on sha1 ce888220d5c7 Linus Torvalds "Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi". Please comment. Thanks. Alexey Kardashevskiy (3): powerpc/iommu: Add "borrowing" iommu_table_group_ops powerpc/pci_64: Init pcibios subsys a bit later powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains arch/powerpc/include/asm/iommu.h | 6 +- arch/powerpc/include/asm/pci-bridge.h | 7 + arch/powerpc/platforms/pseries/pseries.h | 4 + arch/powerpc/kernel/iommu.c | 247 +- arch/powerpc/kernel/pci_64.c | 2 +- arch/powerpc/platforms/powernv/pci-ioda.c | 36 +++- arch/powerpc/platforms/pseries/iommu.c| 27 +++ arch/powerpc/platforms/pseries/setup.c| 3 + drivers/vfio/vfio_iommu_spapr_tce.c | 96 ++--- 9 files changed, 334 insertions(+), 94 deletions(-) -- Alexey
[PATCH kernel v2 3/3] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
Up until now PPC64 managed to avoid using iommu_ops. The VFIO driver uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the Type1 VFIO driver. Recent development added 2 uses of iommu_ops to the generic VFIO which broke POWER: - a coherency capability check; - blocking IOMMU domain - iommu_group_dma_owner_claimed()/... This adds a simple iommu_ops which reports support for cache coherency and provides a basic support for blocking domains. No other domain types are implemented so the default domain is NULL. Since now iommu_ops controls the group ownership, this takes it out of VFIO. This adds an IOMMU device into a pci_controller (=PHB) and registers it in the IOMMU subsystem, iommu_ops is registered at this point. This setup is done in postcore_initcall_sync. This replaces iommu_group_add_device() with iommu_probe_device() as the former misses necessary steps in connecting PCI devices to IOMMU devices. This adds a comment about why explicit iommu_probe_device() is still needed. Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence") Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices") Cc: Deming Wang Cc: Robin Murphy Cc: Jason Gunthorpe Cc: Alex Williamson Cc: Daniel Henrique Barboza Cc: Fabiano Rosas Cc: Murilo Opsfelder Araujo Cc: Nicholas Piggin Signed-off-by: Alexey Kardashevskiy --- Changes: v2: * replaced a default domain with blocked --- arch/powerpc/include/asm/pci-bridge.h | 7 + arch/powerpc/platforms/pseries/pseries.h | 4 + arch/powerpc/kernel/iommu.c | 149 +- arch/powerpc/platforms/powernv/pci-ioda.c | 30 + arch/powerpc/platforms/pseries/iommu.c| 24 arch/powerpc/platforms/pseries/setup.c| 3 + drivers/vfio/vfio_iommu_spapr_tce.c | 8 -- 7 files changed, 215 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index e18c95f4e1d4..fcab0e4b203b 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -8,6 +8,7 @@ #include #include #include +#include struct device_node; @@ -44,6 +45,9 @@ struct pci_controller_ops { #endif void(*shutdown)(struct pci_controller *hose); + + struct iommu_group *(*device_group)(struct pci_controller *hose, + struct pci_dev *pdev); }; /* @@ -131,6 +135,9 @@ struct pci_controller { struct irq_domain *dev_domain; struct irq_domain *msi_domain; struct fwnode_handle*fwnode; + + /* iommu_ops support */ + struct iommu_device iommu; }; /* These are used for config access before all the PCI probing diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h index 1d75b7742ef0..f8bce40ebd0c 100644 --- a/arch/powerpc/platforms/pseries/pseries.h +++ b/arch/powerpc/platforms/pseries/pseries.h @@ -123,5 +123,9 @@ static inline void pseries_lpar_read_hblkrm_characteristics(void) { } #endif void pseries_rng_init(void); +#ifdef CONFIG_SPAPR_TCE_IOMMU +struct iommu_group *pSeries_pci_device_group(struct pci_controller *hose, +struct pci_dev *pdev); +#endif #endif /* _PSERIES_PSERIES_H */ diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index d873c123ab49..823da727aac7 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -35,6 +35,7 @@ #include #include #include +#include #define DBG(...) @@ -1158,8 +1159,14 @@ int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) pr_debug("%s: Adding %s to iommu group %d\n", __func__, dev_name(dev), iommu_group_id(table_group->group)); - - return iommu_group_add_device(table_group->group, dev); + /* +* This is still not adding devices via the IOMMU bus notifier because +* of pcibios_init() from arch/powerpc/kernel/pci_64.c which calls +* pcibios_scan_phb() first (and this guy adds devices and triggers +* the notifier) and only then it calls pci_bus_add_devices() which +* configures DMA for buses which also creates PEs and IOMMU groups. +*/ + return iommu_probe_device(dev); } EXPORT_SYMBOL_GPL(iommu_add_device); @@ -1239,6 +1246,7 @@ static long spapr_tce_take_ownership(struct iommu_table_group *table_group) rc = iommu_take_ownership(tbl); if (!rc) continue; + for (j = 0; j < i; ++j) iommu_release_ownership(table_group->tables[j]); return rc; @@ -1271,4 +1279,141 @@ struct iommu_table_group_ops spapr_tce_table_group_ops = { .release_ownership = spapr_tce_release_ownership, }; +/* + * A simple iommu_ops to allow less cr
[PATCH kernel v2 0/3] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
Here is another take on iommu_ops on POWER to make VFIO work again on POWERPC64. Tested on PPC, kudos to Fred! The tree with all prerequisites is here: https://github.com/aik/linux/tree/kvm-fixes-wip The previous discussion is here: https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220707135552.3688927-1-...@ozlabs.ru/ https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/ https://lore.kernel.org/all/20220714081822.3717693-3-...@ozlabs.ru/T/ Please comment. Thanks. This is based on sha1 ce888220d5c7 Linus Torvalds "Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi". Please comment. Thanks. Alexey Kardashevskiy (3): powerpc/iommu: Add "borrowing" iommu_table_group_ops powerpc/pci_64: Init pcibios subsys a bit later powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains arch/powerpc/include/asm/iommu.h | 6 +- arch/powerpc/include/asm/pci-bridge.h | 7 + arch/powerpc/platforms/pseries/pseries.h | 4 + arch/powerpc/kernel/iommu.c | 247 +- arch/powerpc/kernel/pci_64.c | 2 +- arch/powerpc/platforms/powernv/pci-ioda.c | 36 +++- arch/powerpc/platforms/pseries/iommu.c| 27 +++ arch/powerpc/platforms/pseries/setup.c| 3 + drivers/vfio/vfio_iommu_spapr_tce.c | 96 ++--- 9 files changed, 334 insertions(+), 94 deletions(-) -- 2.37.3
[PATCH kernel v2 1/3] powerpc/iommu: Add "borrowing" iommu_table_group_ops
PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows for PEs: control the ownership, create/set/unset a table the hardware for dynamic DMA windows (DDW). VFIO uses the API to implement support on POWER. So far only PowerNV IODA2 (POWER8 and newer machines) implemented this and other cases (POWER7 or nested KVM) did not and instead reused existing iommu_table structs. This means 1) no DDW 2) ownership transfer is done directly in the VFIO SPAPR TCE driver. Soon POWER is going to get its own iommu_ops and ownership control is going to move there. This implements spapr_tce_table_group_ops which borrows iommu_table tables. The upside is that VFIO needs to know less about POWER. The new ops returns the existing table from create_table() and only checks if the same window is already set. This is only going to work if the default DMA window starts table_group.tce32_start and as big as pe->table_group.tce32_size (not the case for IODA2+ PowerNV). This changes iommu_table_group_ops::take_ownership() to return an error if borrowing a table failed. This should not cause any visible change in behavior for PowerNV. pSeries was not that well tested/supported anyway. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h | 6 +- arch/powerpc/kernel/iommu.c | 98 ++- arch/powerpc/platforms/powernv/pci-ioda.c | 6 +- arch/powerpc/platforms/pseries/iommu.c| 3 + drivers/vfio/vfio_iommu_spapr_tce.c | 94 -- 5 files changed, 121 insertions(+), 86 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7e29c73e3dd4..678b5bdc79b1 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -175,7 +175,7 @@ struct iommu_table_group_ops { long (*unset_window)(struct iommu_table_group *table_group, int num); /* Switch ownership from platform code to external user (e.g. VFIO) */ - void (*take_ownership)(struct iommu_table_group *table_group); + long (*take_ownership)(struct iommu_table_group *table_group); /* Switch ownership from external user (e.g. VFIO) back to core */ void (*release_ownership)(struct iommu_table_group *table_group); }; @@ -215,6 +215,8 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm, enum dma_data_direction *direction); extern void iommu_tce_kill(struct iommu_table *tbl, unsigned long entry, unsigned long pages); + +extern struct iommu_table_group_ops spapr_tce_table_group_ops; #else static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, @@ -303,8 +305,6 @@ extern int iommu_tce_check_gpa(unsigned long page_shift, iommu_tce_check_gpa((tbl)->it_page_shift, (gpa))) extern void iommu_flush_tce(struct iommu_table *tbl); -extern int iommu_take_ownership(struct iommu_table *tbl); -extern void iommu_release_ownership(struct iommu_table *tbl); extern enum dma_data_direction iommu_tce_direction(unsigned long tce); extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir); diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index caebe1431596..d873c123ab49 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1088,7 +1088,7 @@ void iommu_tce_kill(struct iommu_table *tbl, } EXPORT_SYMBOL_GPL(iommu_tce_kill); -int iommu_take_ownership(struct iommu_table *tbl) +static int iommu_take_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; int ret = 0; @@ -1120,9 +1120,8 @@ int iommu_take_ownership(struct iommu_table *tbl) return ret; } -EXPORT_SYMBOL_GPL(iommu_take_ownership); -void iommu_release_ownership(struct iommu_table *tbl) +static void iommu_release_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; @@ -1139,7 +1138,6 @@ void iommu_release_ownership(struct iommu_table *tbl) spin_unlock(>pools[i].lock); spin_unlock_irqrestore(>large_pool.lock, flags); } -EXPORT_SYMBOL_GPL(iommu_release_ownership); int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) { @@ -1181,4 +1179,96 @@ void iommu_del_device(struct device *dev) iommu_group_remove_device(dev); } EXPORT_SYMBOL_GPL(iommu_del_device); + +/* + * A simple iommu_table_group_ops which only allows reusing the existing + * iommu_table. This handles VFIO for POWER7 or the nested KVM. + * The ops does not allow creating windows and only allows reusing the existing + * one if it matches table_group->tce32_start/tce32_size/page_shift. + */ +static unsigned long spapr_tce_get_table_size(__u32 page_shift, + __u64 window_size, __u32 leve
[PATCH kernel v2 2/3] powerpc/pci_64: Init pcibios subsys a bit later
The following patches are going to add dependency/use of iommu_ops which is initialized in subsys_initcall as well. This moves pciobios_init() to the next initcall level. This should not cause behavioral change. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kernel/pci_64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c index 0c7cfb9fab04..9cd763d512ae 100644 --- a/arch/powerpc/kernel/pci_64.c +++ b/arch/powerpc/kernel/pci_64.c @@ -73,7 +73,7 @@ static int __init pcibios_init(void) return 0; } -subsys_initcall(pcibios_init); +subsys_initcall_sync(pcibios_init); int pcibios_unmap_io_space(struct pci_bus *bus) { -- 2.37.3
Re: [PATCH kernel] KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent
Ping? It's been a while and probably got lost :-/ On 18/05/2022 16:27, Alexey Kardashevskiy wrote: On 5/4/22 17:48, Alexey Kardashevskiy wrote: When introduced, IRQFD resampling worked on POWER8 with XICS. However KVM on POWER9 has never implemented it - the compatibility mode code ("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native XIVE mode does not handle INTx in KVM at all. This moved the capability support advertising to platforms and stops advertising it on XIVE, i.e. POWER9 and later. Signed-off-by: Alexey Kardashevskiy --- Or I could move this one together with KVM_CAP_IRQFD. Thoughts? Ping? --- arch/arm64/kvm/arm.c | 3 +++ arch/mips/kvm/mips.c | 3 +++ arch/powerpc/kvm/powerpc.c | 6 ++ arch/riscv/kvm/vm.c | 3 +++ arch/s390/kvm/kvm-s390.c | 3 +++ arch/x86/kvm/x86.c | 3 +++ virt/kvm/kvm_main.c | 1 - 7 files changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 523bc934fe2f..092f0614bae3 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -210,6 +210,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SET_GUEST_DEBUG: case KVM_CAP_VCPU_ATTRIBUTES: case KVM_CAP_PTP_KVM: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index a25e0b73ee70..0f3de470a73e 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -1071,6 +1071,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_READONLY_MEM: case KVM_CAP_SYNC_MMU: case KVM_CAP_IMMEDIATE_EXIT: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_NR_VCPUS: diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 875c30c12db0..87698ffef3be 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -591,6 +591,12 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) break; #endif +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: + r = !xive_enabled(); + break; +#endif + case KVM_CAP_PPC_ALLOC_HTAB: r = hv_enabled; break; diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c index c768f75279ef..b58579b386bb 100644 --- a/arch/riscv/kvm/vm.c +++ b/arch/riscv/kvm/vm.c @@ -63,6 +63,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_READONLY_MEM: case KVM_CAP_MP_STATE: case KVM_CAP_IMMEDIATE_EXIT: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_NR_VCPUS: diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 156d1c25a3c1..85e093fc8d13 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -564,6 +564,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SET_GUEST_DEBUG: case KVM_CAP_S390_DIAG318: case KVM_CAP_S390_MEM_OP_EXTENSION: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0c0ca599a353..a0a7b769483d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4273,6 +4273,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SYS_ATTRIBUTES: case KVM_CAP_VAPIC: case KVM_CAP_ENABLE_CAP: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_EXIT_HYPERCALL: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 70e05af5ebea..885e72e668a5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4293,7 +4293,6 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) #endif #ifdef CONFIG_HAVE_KVM_IRQFD case KVM_CAP_IRQFD: - case KVM_CAP_IRQFD_RESAMPLE: #endif case KVM_CAP_IOEVENTFD_ANY_LENGTH: case KVM_CAP_CHECK_EXTENSION_VM: -- Alexey
Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
On 29/07/2022 13:10, Tian, Kevin wrote: From: Oliver O'Halloran Sent: Friday, July 29, 2022 10:53 AM On Fri, Jul 29, 2022 at 12:21 PM Alexey Kardashevskiy wrote: *snip* About this. If a platform has a concept of explicit DMA windows (2 or more), is it one domain with 2 windows or 2 domains with one window each? If it is 2 windows, iommu_domain_ops misses windows manipulation callbacks (I vaguely remember it being there for embedded PPC64 but cannot find it quickly). If it is 1 window per a domain, then can a device be attached to 2 domains at least in theory (I suspect not)? On server POWER CPUs, each DMA window is backed by an independent IOMMU page table. (reminder) A window is a bus address range where devices are allowed to DMA to/from ;) I've always thought of windows as being entries to a top-level "iommu page table" for the device / domain. The fact each window is backed by a separate IOMMU page table shouldn't really be relevant outside the arch/platform. Yes. This is what was agreed when discussing how to integrate iommufd with POWER [1]. One domain represents one address space. Windows are just constraints on the address space for what ranges can be mapped. having two page tables underlying is just kind of POWER specific format. It is a POWER specific thing with one not-so-obvious consequence of each window having an independent page size (fixed at the moment or creation) and (most likely) different page size, like, 4K vs. 2M. Thanks Kevin [1] https://lore.kernel.org/all/Yns+TCSa6hWbU7wZ@yekko/ -- Alexey
Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
On 08/07/2022 17:32, Tian, Kevin wrote: From: Alexey Kardashevskiy Sent: Friday, July 8, 2022 2:35 PM On 7/8/22 15:00, Alexey Kardashevskiy wrote: On 7/8/22 01:10, Jason Gunthorpe wrote: On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote: Historically PPC64 managed to avoid using iommu_ops. The VFIO driver uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the Type1 VFIO driver. Recent development though has added a coherency capability check to the generic part of VFIO and essentially disabled VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed(). This adds an iommu_ops stub which reports support for cache coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices, this provides minimum code for the probing to not crash. stale comment since this patch doesn't use bus_set_iommu() now. + +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom, + struct device *dev) +{ + return 0; +} It is important when this returns that the iommu translation is all emptied. There should be no left over translations from the DMA API at this point. I have no idea how power works in this regard, but it should be explained why this is safe in a comment at a minimum. > It will turn into a security problem to allow kernel mappings to leak > past this point. > I've added for v2 checking for no valid mappings for a device (or, more precisely, in the associated iommu_group), this domain does not need checking, right? Uff, not that simple. Looks like once a device is in a group, its dma_ops is set to iommu_dma_ops and IOMMU code owns DMA. I guess then there is a way to set those to NULL or do something similar to let dma_map_direct() from kernel/dma/mapping.c return "true", is not there? dev->dma_ops is NULL as long as you don't handle DMA domain type here and don't call iommu_setup_dma_ops(). Given this only supports blocking domain then above should be irrelevant. For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is fine to do nothing as tce_iommu_take_ownership() and tce_iommu_take_ownership_ddw() take care of not having active DMA mappings. Thanks, In general, is "domain" something from hardware or it is a software concept? Thanks, 'domain' is a software concept to represent the hardware I/O page table. About this. If a platform has a concept of explicit DMA windows (2 or more), is it one domain with 2 windows or 2 domains with one window each? If it is 2 windows, iommu_domain_ops misses windows manipulation callbacks (I vaguely remember it being there for embedded PPC64 but cannot find it quickly). If it is 1 window per a domain, then can a device be attached to 2 domains at least in theory (I suspect not)? On server POWER CPUs, each DMA window is backed by an independent IOMMU page table. (reminder) A window is a bus address range where devices are allowed to DMA to/from ;) Thanks, A blocking domain means all DMAs from a device attached to this domain are blocked/rejected (equivalent to an empty I/O page table), usually enforced in the .attach_dev() callback. Yes, a comment for why having a NULL .attach_dev() is OK is welcomed. Thanks Kevin -- Alexey
Re: [PATCH kernel 3/3] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
On 19/07/2022 04:09, Jason Gunthorpe wrote: On Thu, Jul 14, 2022 at 06:18:22PM +1000, Alexey Kardashevskiy wrote: +/* + * A simple iommu_ops to allow less cruft in generic VFIO code. + */ +static bool spapr_tce_iommu_capable(enum iommu_cap cap) +{ + switch (cap) { + case IOMMU_CAP_CACHE_COHERENCY: I would add a remark here that it is because vfio is going to use SPAPR mode but still checks that the iommu driver support coherency - with out that detail it looks very strange to have caps without implementing unmanaged domains +static struct iommu_domain *spapr_tce_iommu_domain_alloc(unsigned int type) +{ + struct iommu_domain *dom; + + if (type != IOMMU_DOMAIN_BLOCKED) + return NULL; + + dom = kzalloc(sizeof(*dom), GFP_KERNEL); + if (!dom) + return NULL; + + dom->geometry.aperture_start = 0; + dom->geometry.aperture_end = ~0ULL; + dom->geometry.force_aperture = true; A blocked domain doesn't really have an aperture, all DMA is rejected, so I think these can just be deleted and left at zero. Generally I'm suggesting drivers just use a static singleton instance for the blocked domain instead of the allocation like this, but that is a very minor nit. +static struct iommu_device *spapr_tce_iommu_probe_device(struct device *dev) +{ + struct pci_dev *pdev; + struct pci_controller *hose; + + /* Weirdly iommu_device_register() assigns the same ops to all buses */ + if (!dev_is_pci(dev)) + return ERR_PTR(-EPERM); Less "weirdly", more by design. The iommu driver should check if the given struct device is supported or not, it isn't really a bus specific operation. +static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev) +{ + struct pci_controller *hose; + struct pci_dev *pdev; + + /* Weirdly iommu_device_register() assigns the same ops to all buses */ + if (!dev_is_pci(dev)) + return ERR_PTR(-EPERM); This doesn't need repeating, if probe_device() fails then this will never be called. +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom, + struct device *dev) +{ + struct iommu_group *grp = iommu_group_get(dev); + struct iommu_table_group *table_group; + int ret = -EINVAL; + + if (!grp) + return -ENODEV; + + table_group = iommu_group_get_iommudata(grp); + + if (dom->type == IOMMU_DOMAIN_BLOCKED) + ret = table_group->ops->take_ownership(table_group); Ideally there shouldn't be dom->type checks like this. The blocking domain should have its own iommu_domain_ops that only process the blocking operation. Ie call this like spapr_tce_iommu_blocking_attach_dev() Instead of having a "default_domain_ops" leave it NULL and create a spapr_tce_blocking_domain_ops with these two functions and assign it to domain->ops when creating. Then it is really clear these functions are only called for the DOMAIN_BLOCKED type and you don't need to check it. +static void spapr_tce_iommu_detach_dev(struct iommu_domain *dom, + struct device *dev) +{ + struct iommu_group *grp = iommu_group_get(dev); + struct iommu_table_group *table_group; + + table_group = iommu_group_get_iommudata(grp); + WARN_ON(dom->type != IOMMU_DOMAIN_BLOCKED); + table_group->ops->release_ownership(table_group); +} Ditto +struct iommu_group *pSeries_pci_device_group(struct pci_controller *hose, +struct pci_dev *pdev) +{ + struct device_node *pdn, *dn = pdev->dev.of_node; + struct iommu_group *grp; + struct pci_dn *pci; + + pdn = pci_dma_find(dn, NULL); + if (!pdn || !PCI_DN(pdn)) + return ERR_PTR(-ENODEV); + + pci = PCI_DN(pdn); + if (!pci->table_group) + return ERR_PTR(-ENODEV); + + grp = pci->table_group->group; + if (!grp) + return ERR_PTR(-ENODEV); + + return iommu_group_ref_get(grp); Not for this series, but this is kind of backwards, the driver specific data (ie the table_group) should be in iommu_group_get_iommudata()... It is there but here we are getting from a device to a group - a device is not added to a group yet when iommu_probe_device() works and tries adding a device via iommu_group_get_for_dev(). diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 8a65ea61744c..3b53b466e49b 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -1152,8 +1152,6 @@ static void tce_iommu_release_ownership(struct tce_container *container, for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) if (container->tables[i]) table_group->ops-&g
[PATCH kernel 3/3] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
Up until now PPC64 managed to avoid using iommu_ops. The VFIO driver uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the Type1 VFIO driver. Recent development added 2 uses of iommu_ops to the generic VFIO which broke POWER: - a coherency capability check; - blocking IOMMU domain - iommu_group_dma_owner_claimed()/... This adds a simple iommu_ops which reports support for cache coherency and provides a basic support for blocking domains. No other domain types are implemented so the default domain is NULL. Since now iommu_ops controls the group ownership, this takes it out of VFIO. This adds an IOMMU device into a pci_controller (=PHB) and registers it in the IOMMU subsystem, iommu_ops is registered at this point. This setup is done in postcore_initcall_sync. This replaces iommu_group_add_device() with iommu_probe_device() as the former misses necessary steps in connecting PCI devices to IOMMU devices. This adds a comment about why explicit iommu_probe_device() is still needed. The previous discussion is here: https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220707135552.3688927-1-...@ozlabs.ru/ https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/ Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence") Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices") Cc: Deming Wang Cc: Robin Murphy Cc: Jason Gunthorpe Cc: Alex Williamson Cc: Daniel Henrique Barboza Cc: Fabiano Rosas Cc: Murilo Opsfelder Araujo Cc: Nicholas Piggin Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/pci-bridge.h | 7 + arch/powerpc/platforms/pseries/pseries.h | 5 + arch/powerpc/kernel/iommu.c | 159 +- arch/powerpc/platforms/powernv/pci-ioda.c | 30 arch/powerpc/platforms/pseries/iommu.c| 24 arch/powerpc/platforms/pseries/setup.c| 3 + drivers/vfio/vfio_iommu_spapr_tce.c | 8 -- 7 files changed, 226 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index c85f901227c9..338a45b410b4 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -8,6 +8,7 @@ #include #include #include +#include struct device_node; @@ -44,6 +45,9 @@ struct pci_controller_ops { #endif void(*shutdown)(struct pci_controller *hose); + + struct iommu_group *(*device_group)(struct pci_controller *hose, + struct pci_dev *pdev); }; /* @@ -131,6 +135,9 @@ struct pci_controller { struct irq_domain *dev_domain; struct irq_domain *msi_domain; struct fwnode_handle*fwnode; + + /* iommu_ops support */ + struct iommu_device iommu; }; /* These are used for config access before all the PCI probing diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h index f5c916c839c9..9a49a16dd89a 100644 --- a/arch/powerpc/platforms/pseries/pseries.h +++ b/arch/powerpc/platforms/pseries/pseries.h @@ -122,4 +122,9 @@ void pseries_lpar_read_hblkrm_characteristics(void); static inline void pseries_lpar_read_hblkrm_characteristics(void) { } #endif +#ifdef CONFIG_SPAPR_TCE_IOMMU +struct iommu_group *pSeries_pci_device_group(struct pci_controller *hose, +struct pci_dev *pdev); +#endif + #endif /* _PSERIES_PSERIES_H */ diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index d873c123ab49..b5301e289714 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -35,6 +35,7 @@ #include #include #include +#include #define DBG(...) @@ -1158,8 +1159,14 @@ int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) pr_debug("%s: Adding %s to iommu group %d\n", __func__, dev_name(dev), iommu_group_id(table_group->group)); - - return iommu_group_add_device(table_group->group, dev); + /* +* This is still not adding devices via the IOMMU bus notifier because +* of pcibios_init() from arch/powerpc/kernel/pci_64.c which calls +* pcibios_scan_phb() first (and this guy adds devices and triggers +* the notifier) and only then it calls pci_bus_add_devices() which +* configures DMA for buses which also creates PEs and IOMMU groups. +*/ + return iommu_probe_device(dev); } EXPORT_SYMBOL_GPL(iommu_add_device); @@ -1239,6 +1246,7 @@ static long spapr_tce_take_ownership(struct iommu_table_group *table_group) rc = iommu_take_ownership(tbl); if (!rc) continue; + for (j = 0; j < i; ++j) iommu_release_ownership(table_group->tables[j]); return rc; @@ -1271,4
[PATCH kernel 2/3] powerpc/pci_64: Init pcibios subsys a bit later
The following patches are going to add dependency/use of iommu_ops which is initialized in subsys_initcall as well. This moves pciobios_init() to the next initcall level. This should not cause behavioral change. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kernel/pci_64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c index 19b03ddf5631..79472d2f1739 100644 --- a/arch/powerpc/kernel/pci_64.c +++ b/arch/powerpc/kernel/pci_64.c @@ -73,7 +73,7 @@ static int __init pcibios_init(void) return 0; } -subsys_initcall(pcibios_init); +subsys_initcall_sync(pcibios_init); int pcibios_unmap_io_space(struct pci_bus *bus) { -- 2.30.2
[PATCH kernel 1/3] powerpc/iommu: Add "borrowing" iommu_table_group_ops
PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows for PEs: control the ownership, create/set/unset a table the hardware for dynamic DMA windows (DDW). VFIO uses the API to implement support on POWER. So far only PowerNV IODA2 (POWER8 and newer machines) implemented this and other cases (POWER7 or nested KVM) did not and instead reused existing iommu_table structs. This means 1) no DDW 2) ownership transfer is done directly in the VFIO SPAPR TCE driver. Soon POWER is going to get its own iommu_ops and ownership control is going to move there. This implements spapr_tce_table_group_ops which borrows iommu_table tables. The upside is that VFIO needs to know less about POWER. The new ops returns the existing table from create_table() and only checks if the same window is already set. This is only going to work if the default DMA window starts table_group.tce32_start and as big as pe->table_group.tce32_size (not the case for IODA2+ PowerNV). This changes iommu_table_group_ops::take_ownership() to return an error if borrowing a table failed. This should not cause any visible change in behavior for PowerNV. pSeries was not that well tested/supported anyway. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h | 6 +- arch/powerpc/kernel/iommu.c | 98 ++- arch/powerpc/platforms/powernv/pci-ioda.c | 6 +- arch/powerpc/platforms/pseries/iommu.c| 3 + drivers/vfio/vfio_iommu_spapr_tce.c | 94 -- 5 files changed, 121 insertions(+), 86 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7e29c73e3dd4..678b5bdc79b1 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -175,7 +175,7 @@ struct iommu_table_group_ops { long (*unset_window)(struct iommu_table_group *table_group, int num); /* Switch ownership from platform code to external user (e.g. VFIO) */ - void (*take_ownership)(struct iommu_table_group *table_group); + long (*take_ownership)(struct iommu_table_group *table_group); /* Switch ownership from external user (e.g. VFIO) back to core */ void (*release_ownership)(struct iommu_table_group *table_group); }; @@ -215,6 +215,8 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm, enum dma_data_direction *direction); extern void iommu_tce_kill(struct iommu_table *tbl, unsigned long entry, unsigned long pages); + +extern struct iommu_table_group_ops spapr_tce_table_group_ops; #else static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, @@ -303,8 +305,6 @@ extern int iommu_tce_check_gpa(unsigned long page_shift, iommu_tce_check_gpa((tbl)->it_page_shift, (gpa))) extern void iommu_flush_tce(struct iommu_table *tbl); -extern int iommu_take_ownership(struct iommu_table *tbl); -extern void iommu_release_ownership(struct iommu_table *tbl); extern enum dma_data_direction iommu_tce_direction(unsigned long tce); extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir); diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index caebe1431596..d873c123ab49 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1088,7 +1088,7 @@ void iommu_tce_kill(struct iommu_table *tbl, } EXPORT_SYMBOL_GPL(iommu_tce_kill); -int iommu_take_ownership(struct iommu_table *tbl) +static int iommu_take_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; int ret = 0; @@ -1120,9 +1120,8 @@ int iommu_take_ownership(struct iommu_table *tbl) return ret; } -EXPORT_SYMBOL_GPL(iommu_take_ownership); -void iommu_release_ownership(struct iommu_table *tbl) +static void iommu_release_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; @@ -1139,7 +1138,6 @@ void iommu_release_ownership(struct iommu_table *tbl) spin_unlock(>pools[i].lock); spin_unlock_irqrestore(>large_pool.lock, flags); } -EXPORT_SYMBOL_GPL(iommu_release_ownership); int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) { @@ -1181,4 +1179,96 @@ void iommu_del_device(struct device *dev) iommu_group_remove_device(dev); } EXPORT_SYMBOL_GPL(iommu_del_device); + +/* + * A simple iommu_table_group_ops which only allows reusing the existing + * iommu_table. This handles VFIO for POWER7 or the nested KVM. + * The ops does not allow creating windows and only allows reusing the existing + * one if it matches table_group->tce32_start/tce32_size/page_shift. + */ +static unsigned long spapr_tce_get_table_size(__u32 page_shift, + __u64 window_size, __u32 leve
[PATCH kernel 0/3] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
Here is another take on iommu_ops on POWER to make VFIO work again on POWERPC64. The tree with all prerequisites is here: https://github.com/aik/linux/tree/kvm-fixes-wip The previous discussion is here: https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220707135552.3688927-1-...@ozlabs.ru/ https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/ Please comment. Thanks. Alexey Kardashevskiy (3): powerpc/iommu: Add "borrowing" iommu_table_group_ops powerpc/pci_64: Init pcibios subsys a bit later powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains arch/powerpc/include/asm/iommu.h | 6 +- arch/powerpc/include/asm/pci-bridge.h | 7 + arch/powerpc/platforms/pseries/pseries.h | 5 + arch/powerpc/kernel/iommu.c | 257 +- arch/powerpc/kernel/pci_64.c | 2 +- arch/powerpc/platforms/powernv/pci-ioda.c | 36 ++- arch/powerpc/platforms/pseries/iommu.c| 27 +++ arch/powerpc/platforms/pseries/setup.c| 3 + drivers/vfio/vfio_iommu_spapr_tce.c | 96 ++-- 9 files changed, 345 insertions(+), 94 deletions(-) -- 2.30.2
[PATCH kernel] powerpc/iommu: Fix iommu_table_in_use for a small default DMA window case
The existing iommu_table_in_use() helper checks if the kernel is using any of TCEs. There are some reserved TCEs: 1) the very first one if DMA window starts from 0 to avoid having a zero but still valid DMA handle; 2) it_reserved_start..it_reserved_end to exclude MMIO32 window in case the default window spans across that - this is the default for the first DMA window on PowerNV. When 1) is the case and 2) is not the helper does not skip 1) and returns wrong status. This only seems occurring when passing through a PCI device to a nested guest (not something we support really well) so it has not been seen before. This fixes the bug by adding a special case for no MMIO32 reservation. Fixes: 3c33066a2190 ("powerpc/kernel/iommu: Add new iommu_table_in_use() helper") Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kernel/iommu.c | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 7e56ddb3e0b9..caebe1431596 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -775,6 +775,11 @@ bool iommu_table_in_use(struct iommu_table *tbl) /* ignore reserved bit0 */ if (tbl->it_offset == 0) start = 1; + + /* Simple case with no reserved MMIO32 region */ + if (!tbl->it_reserved_start && !tbl->it_reserved_end) + return find_next_bit(tbl->it_map, tbl->it_size, start) != tbl->it_size; + end = tbl->it_reserved_start - tbl->it_offset; if (find_next_bit(tbl->it_map, end, start) != end) return true; -- 2.30.2
[PATCH kernel] powerpc/ioda/iommu/debugfs: Generate unique debugfs entries
The iommu_table::it_index is a LIOBN which is not initialized on PowerNV as it is not used except IOMMU debugfs where it is used for a node name. This initializes it_index witn a unique number to avoid warnings and have a node for every iommu_table. This should not cause any behavioral change without CONFIG_IOMMU_DEBUGFS. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/platforms/powernv/pci-ioda.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index c8cf2728031a..9de9b2fb163d 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1609,6 +1609,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb, tbl->it_ops = _ioda1_iommu_ops; pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift; pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift; + tbl->it_index = (phb->hose->global_number << 16) | pe->pe_number; if (!iommu_init_table(tbl, phb->hose->node, 0, 0)) panic("Failed to initialize iommu table"); @@ -1779,6 +1780,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe) res_end = min(window_size, SZ_4G) >> tbl->it_page_shift; } + tbl->it_index = (pe->phb->hose->global_number << 16) | pe->pe_number; if (iommu_init_table(tbl, pe->phb->hose->node, res_start, res_end)) rc = pnv_pci_ioda2_set_window(>table_group, 0, tbl); else -- 2.30.2
Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
On 7/12/22 04:46, Jason Gunthorpe wrote: On Mon, Jul 11, 2022 at 11:24:32PM +1000, Alexey Kardashevskiy wrote: I really think that for 5.19 we should really move this blocked domain business to Type1 like this: https://github.com/aik/linux/commit/96f80c8db03b181398ad355f6f90e574c3ada4bf This creates the same security bug for power we are discussing here. If you How so? attach_dev() on power makes uninitalizes DMA setup for the group on the hardware level, any other DMA user won't be able to initiate DMA. don't want to fix it then lets just merge this iommu_ops patch as is rather than mangle the core code. The core code should not be assuming iommu_ops != NULL, Type1 should, I thought it is the whole point of having Type1, why is not it the case anymore? -- Alexey
Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
On 10/07/2022 22:32, Alexey Kardashevskiy wrote: On 10/07/2022 16:29, Jason Gunthorpe wrote: On Sat, Jul 09, 2022 at 12:58:00PM +1000, Alexey Kardashevskiy wrote: driver->ops->attach_group on POWER attaches a group so VFIO claims ownership over a group, not devices. Underlying API (pnv_ioda2_take_ownership()) does not need to keep track of the state, it is one group, one ownership transfer, easy. It should not change, I think you can just map the attach_dev to the group? There are multiple devices in a group, cannot just map 1:1. What is exactly the reason why iommu_group_claim_dma_owner() cannot stay inside Type1 (sorry if it was explained, I could have missed)? It has nothing to do with type1 - the ownership system is designed to exclude other in-kernel drivers from using the group at the same time vfio is using the group. power still needs this protection regardless of if is using the formal iommu api or not. POWER deals with it in vfio_iommu_driver_ops::attach_group. I really think that for 5.19 we should really move this blocked domain business to Type1 like this: https://github.com/aik/linux/commit/96f80c8db03b181398ad355f6f90e574c3ada4bf Thanks, Also, from another mail, you said iommu_alloc_default_domain() should fail on power but at least IOMMU_DOMAIN_BLOCKED must be supported, or the whole iommu_group_claim_dma_owner() thing falls apart. Yes And iommu_ops::domain_alloc() is not told if it is asked to create a default domain, it only takes a type. "default domain" refers to the default type pased to domain_alloc(), it will never be blocking, so it will always fail on power. "default domain" is better understood as the domain used by the DMA API The DMA API on POWER does not use iommu_ops, it is dma_iommu_ops from arch/powerpc/kernel/dma-iommu.c from before 2005. so the default domain is type == 0 where 0 == BLOCKED. -- Alexey
Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
On 10/07/2022 16:29, Jason Gunthorpe wrote: On Sat, Jul 09, 2022 at 12:58:00PM +1000, Alexey Kardashevskiy wrote: driver->ops->attach_group on POWER attaches a group so VFIO claims ownership over a group, not devices. Underlying API (pnv_ioda2_take_ownership()) does not need to keep track of the state, it is one group, one ownership transfer, easy. It should not change, I think you can just map the attach_dev to the group? There are multiple devices in a group, cannot just map 1:1. What is exactly the reason why iommu_group_claim_dma_owner() cannot stay inside Type1 (sorry if it was explained, I could have missed)? It has nothing to do with type1 - the ownership system is designed to exclude other in-kernel drivers from using the group at the same time vfio is using the group. power still needs this protection regardless of if is using the formal iommu api or not. POWER deals with it in vfio_iommu_driver_ops::attach_group. Also, from another mail, you said iommu_alloc_default_domain() should fail on power but at least IOMMU_DOMAIN_BLOCKED must be supported, or the whole iommu_group_claim_dma_owner() thing falls apart. Yes And iommu_ops::domain_alloc() is not told if it is asked to create a default domain, it only takes a type. "default domain" refers to the default type pased to domain_alloc(), it will never be blocking, so it will always fail on power. "default domain" is better understood as the domain used by the DMA API The DMA API on POWER does not use iommu_ops, it is dma_iommu_ops from arch/powerpc/kernel/dma-iommu.c from before 2005. so the default domain is type == 0 where 0 == BLOCKED. -- Alexey
Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
On 08/07/2022 21:55, Jason Gunthorpe wrote: On Fri, Jul 08, 2022 at 04:34:55PM +1000, Alexey Kardashevskiy wrote: For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is fine to do nothing as tce_iommu_take_ownership() and tce_iommu_take_ownership_ddw() take care of not having active DMA mappings. That will still cause a security problem because tce_iommu_take_ownership()/etc are called too late. This is the moment in the flow when the ownershift must change away from the DMA API that power implements and to VFIO, not later. Trying to do that. vfio_group_set_container: iommu_group_claim_dma_owner driver->ops->attach_group iommu_group_claim_dma_owner sets a domain to a group. Good. But it attaches devices, not groups. Bad. driver->ops->attach_group on POWER attaches a group so VFIO claims ownership over a group, not devices. Underlying API (pnv_ioda2_take_ownership()) does not need to keep track of the state, it is one group, one ownership transfer, easy. What is exactly the reason why iommu_group_claim_dma_owner() cannot stay inside Type1 (sorry if it was explained, I could have missed)? Also, from another mail, you said iommu_alloc_default_domain() should fail on power but at least IOMMU_DOMAIN_BLOCKED must be supported, or the whole iommu_group_claim_dma_owner() thing falls apart. And iommu_ops::domain_alloc() is not told if it is asked to create a default domain, it only takes a type. -- Alexey
Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
On 08/07/2022 23:19, Jason Gunthorpe wrote: On Fri, Jul 08, 2022 at 11:10:07PM +1000, Alexey Kardashevskiy wrote: On 08/07/2022 21:55, Jason Gunthorpe wrote: On Fri, Jul 08, 2022 at 04:34:55PM +1000, Alexey Kardashevskiy wrote: For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is fine to do nothing as tce_iommu_take_ownership() and tce_iommu_take_ownership_ddw() take care of not having active DMA mappings. That will still cause a security problem because tce_iommu_take_ownership()/etc are called too late. This is the moment in the flow when the ownershift must change away from the DMA API that power implements and to VFIO, not later. It is getting better and better :) On POWERNV, at the boot time the platforms sets up PHBs, enables bypass, creates groups and attaches devices. As for now attaching devices to the default domain (which is BLOCKED) fails the not-being-use check as enabled bypass means "everything is mapped for DMA". So at this point the default domain has to be IOMMU_DOMAIN_IDENTITY or IOMMU_DOMAIN_UNMANAGED so later on VFIO can move devices to IOMMU_DOMAIN_BLOCKED. Am I missing something? For power the default domain should be NULL NULL means that the platform is using the group to provide its DMA ops. IIRC this patch was already setup correctly to do this? The transition from NULL to blocking must isolate the group so all DMA is blocked. blocking to NULL should re-estbalish platform DMA API control. The default domain should be non-NULL when the normal dma-iommu stuff is providing the DMA API. So, I think it is already setup properly, it is just the question of what to do when entering/leaving blocking mode. Well, the patch calls iommu_probe_device() which calls iommu_alloc_default_domain() which creates IOMMU_DOMAIN_BLOCKED (==0) as nothing initialized iommu_def_domain_type. Need a different default type (and return NULL when IOMMU API tries creating this type)? Jason -- Alexey
Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
On 08/07/2022 21:55, Jason Gunthorpe wrote: On Fri, Jul 08, 2022 at 04:34:55PM +1000, Alexey Kardashevskiy wrote: For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is fine to do nothing as tce_iommu_take_ownership() and tce_iommu_take_ownership_ddw() take care of not having active DMA mappings. That will still cause a security problem because tce_iommu_take_ownership()/etc are called too late. This is the moment in the flow when the ownershift must change away from the DMA API that power implements and to VFIO, not later. It is getting better and better :) On POWERNV, at the boot time the platforms sets up PHBs, enables bypass, creates groups and attaches devices. As for now attaching devices to the default domain (which is BLOCKED) fails the not-being-use check as enabled bypass means "everything is mapped for DMA". So at this point the default domain has to be IOMMU_DOMAIN_IDENTITY or IOMMU_DOMAIN_UNMANAGED so later on VFIO can move devices to IOMMU_DOMAIN_BLOCKED. Am I missing something? Jason -- Alexey
Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
On 08/07/2022 17:32, Tian, Kevin wrote: From: Alexey Kardashevskiy Sent: Friday, July 8, 2022 2:35 PM On 7/8/22 15:00, Alexey Kardashevskiy wrote: On 7/8/22 01:10, Jason Gunthorpe wrote: On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote: Historically PPC64 managed to avoid using iommu_ops. The VFIO driver uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the Type1 VFIO driver. Recent development though has added a coherency capability check to the generic part of VFIO and essentially disabled VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed(). This adds an iommu_ops stub which reports support for cache coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices, this provides minimum code for the probing to not crash. stale comment since this patch doesn't use bus_set_iommu() now. + +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom, + struct device *dev) +{ + return 0; +} It is important when this returns that the iommu translation is all emptied. There should be no left over translations from the DMA API at this point. I have no idea how power works in this regard, but it should be explained why this is safe in a comment at a minimum. > It will turn into a security problem to allow kernel mappings to leak > past this point. > I've added for v2 checking for no valid mappings for a device (or, more precisely, in the associated iommu_group), this domain does not need checking, right? Uff, not that simple. Looks like once a device is in a group, its dma_ops is set to iommu_dma_ops and IOMMU code owns DMA. I guess then there is a way to set those to NULL or do something similar to let dma_map_direct() from kernel/dma/mapping.c return "true", is not there? dev->dma_ops is NULL as long as you don't handle DMA domain type here and don't call iommu_setup_dma_ops(). Given this only supports blocking domain then above should be irrelevant. For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is fine to do nothing as tce_iommu_take_ownership() and tce_iommu_take_ownership_ddw() take care of not having active DMA mappings. Thanks, In general, is "domain" something from hardware or it is a software concept? Thanks, 'domain' is a software concept to represent the hardware I/O page table. A blocking domain means all DMAs from a device attached to this domain are blocked/rejected (equivalent to an empty I/O page table), usually enforced in the .attach_dev() callback. Yes, a comment for why having a NULL .attach_dev() is OK is welcomed. Making it NULL makes __iommu_attach_device() fail, .attach_dev() needs to return 0 in this crippled environment. Thanks for explaining the rest, good food for thoughts. Thanks Kevin -- Alexey
Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
On 7/8/22 15:00, Alexey Kardashevskiy wrote: On 7/8/22 01:10, Jason Gunthorpe wrote: On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote: Historically PPC64 managed to avoid using iommu_ops. The VFIO driver uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the Type1 VFIO driver. Recent development though has added a coherency capability check to the generic part of VFIO and essentially disabled VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed(). This adds an iommu_ops stub which reports support for cache coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices, this provides minimum code for the probing to not crash. Because now we have to set iommu_ops to the system (bus_set_iommu() or iommu_device_register()), this requires the POWERNV PCI setup to happen after bus_register(_bus_type) which is postcore_initcall TODO: check if it still works, read sha1, for more details: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5537fcb319d016ce387 Because setting the ops triggers probing, this does not work well with iommu_group_add_device(), hence the move to iommu_probe_device(). Because iommu_probe_device() does not take the group (which is why we had the helper in the first place), this adds pci_controller_ops::device_group. So, basically there is one iommu_device per PHB and devices are added to groups indirectly via series of calls inside the IOMMU code. pSeries is out of scope here (a minor fix needed for barely supported platform in regard to VFIO). The previous discussion is here: https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/ I think this is basically OK, for what it is. It looks like there is more some-day opportunity to make use of the core infrastructure though. does it make sense to have this many callbacks, or the generic IOMMU code can safely operate without some (given I add some more checks for !NULL)? thanks, I wouldn't worry about it.. @@ -1156,7 +1158,10 @@ int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) pr_debug("%s: Adding %s to iommu group %d\n", __func__, dev_name(dev), iommu_group_id(table_group->group)); - return iommu_group_add_device(table_group->group, dev); + ret = iommu_probe_device(dev); + dev_info(dev, "probed with %d\n", ret); For another day, but it seems a bit strange to call iommu_probe_device() like this? Shouldn't one of the existing call sites cover this? The one in of_iommu.c perhaps? It looks to me that of_iommu.c expects the iommu setup to happen before linux starts as linux looks for #iommu-cells or iommu-map properties in the device tree. The powernv firmware (aka skiboot) does not do this and it is linux which manages iommu groups. +static bool spapr_tce_iommu_is_attach_deferred(struct device *dev) +{ + return false; +} I think you can NULL this op: static bool iommu_is_attach_deferred(struct device *dev) { const struct iommu_ops *ops = dev_iommu_ops(dev); if (ops->is_attach_deferred) return ops->is_attach_deferred(dev); return false; } +static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev) +{ + struct pci_controller *hose; + struct pci_dev *pdev; + + /* Weirdly iommu_device_register() assigns the same ops to all buses */ + if (!dev_is_pci(dev)) + return ERR_PTR(-EPERM); + + pdev = to_pci_dev(dev); + hose = pdev->bus->sysdata; + + if (!hose->controller_ops.device_group) + return ERR_PTR(-ENOENT); + + return hose->controller_ops.device_group(hose, pdev); +} Is this missing a refcount get on the group? + +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom, + struct device *dev) +{ + return 0; +} It is important when this returns that the iommu translation is all emptied. There should be no left over translations from the DMA API at this point. I have no idea how power works in this regard, but it should be explained why this is safe in a comment at a minimum. > It will turn into a security problem to allow kernel mappings to leak > past this point. > I've added for v2 checking for no valid mappings for a device (or, more precisely, in the associated iommu_group), this domain does not need checking, right? Uff, not that simple. Looks like once a device is in a group, its dma_ops is set to iommu_dma_ops and IOMMU code owns DMA. I guess then there is a way to set those to NULL or do something similar to let dma_map_direct() from kernel/dma/mapping.c return "true", is not there? For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is fine to do nothing as tce_iommu_take_ownership() and tce_iommu_take_ownership_ddw() take care of not having active DMA mappings. Thanks, In general, is "domain"
Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
On 7/8/22 01:10, Jason Gunthorpe wrote: On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote: Historically PPC64 managed to avoid using iommu_ops. The VFIO driver uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the Type1 VFIO driver. Recent development though has added a coherency capability check to the generic part of VFIO and essentially disabled VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed(). This adds an iommu_ops stub which reports support for cache coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices, this provides minimum code for the probing to not crash. Because now we have to set iommu_ops to the system (bus_set_iommu() or iommu_device_register()), this requires the POWERNV PCI setup to happen after bus_register(_bus_type) which is postcore_initcall TODO: check if it still works, read sha1, for more details: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5537fcb319d016ce387 Because setting the ops triggers probing, this does not work well with iommu_group_add_device(), hence the move to iommu_probe_device(). Because iommu_probe_device() does not take the group (which is why we had the helper in the first place), this adds pci_controller_ops::device_group. So, basically there is one iommu_device per PHB and devices are added to groups indirectly via series of calls inside the IOMMU code. pSeries is out of scope here (a minor fix needed for barely supported platform in regard to VFIO). The previous discussion is here: https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/ I think this is basically OK, for what it is. It looks like there is more some-day opportunity to make use of the core infrastructure though. does it make sense to have this many callbacks, or the generic IOMMU code can safely operate without some (given I add some more checks for !NULL)? thanks, I wouldn't worry about it.. @@ -1156,7 +1158,10 @@ int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) pr_debug("%s: Adding %s to iommu group %d\n", __func__, dev_name(dev), iommu_group_id(table_group->group)); - return iommu_group_add_device(table_group->group, dev); + ret = iommu_probe_device(dev); + dev_info(dev, "probed with %d\n", ret); For another day, but it seems a bit strange to call iommu_probe_device() like this? Shouldn't one of the existing call sites cover this? The one in of_iommu.c perhaps? It looks to me that of_iommu.c expects the iommu setup to happen before linux starts as linux looks for #iommu-cells or iommu-map properties in the device tree. The powernv firmware (aka skiboot) does not do this and it is linux which manages iommu groups. +static bool spapr_tce_iommu_is_attach_deferred(struct device *dev) +{ + return false; +} I think you can NULL this op: static bool iommu_is_attach_deferred(struct device *dev) { const struct iommu_ops *ops = dev_iommu_ops(dev); if (ops->is_attach_deferred) return ops->is_attach_deferred(dev); return false; } +static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev) +{ + struct pci_controller *hose; + struct pci_dev *pdev; + + /* Weirdly iommu_device_register() assigns the same ops to all buses */ + if (!dev_is_pci(dev)) + return ERR_PTR(-EPERM); + + pdev = to_pci_dev(dev); + hose = pdev->bus->sysdata; + + if (!hose->controller_ops.device_group) + return ERR_PTR(-ENOENT); + + return hose->controller_ops.device_group(hose, pdev); +} Is this missing a refcount get on the group? + +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom, + struct device *dev) +{ + return 0; +} It is important when this returns that the iommu translation is all emptied. There should be no left over translations from the DMA API at this point. I have no idea how power works in this regard, but it should be explained why this is safe in a comment at a minimum. > It will turn into a security problem to allow kernel mappings to leak > past this point. > I've added for v2 checking for no valid mappings for a device (or, more precisely, in the associated iommu_group), this domain does not need checking, right? In general, is "domain" something from hardware or it is a software concept? Thanks, Jason -- Alexey
[PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
Historically PPC64 managed to avoid using iommu_ops. The VFIO driver uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the Type1 VFIO driver. Recent development though has added a coherency capability check to the generic part of VFIO and essentially disabled VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed(). This adds an iommu_ops stub which reports support for cache coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices, this provides minimum code for the probing to not crash. Because now we have to set iommu_ops to the system (bus_set_iommu() or iommu_device_register()), this requires the POWERNV PCI setup to happen after bus_register(_bus_type) which is postcore_initcall TODO: check if it still works, read sha1, for more details: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5537fcb319d016ce387 Because setting the ops triggers probing, this does not work well with iommu_group_add_device(), hence the move to iommu_probe_device(). Because iommu_probe_device() does not take the group (which is why we had the helper in the first place), this adds pci_controller_ops::device_group. So, basically there is one iommu_device per PHB and devices are added to groups indirectly via series of calls inside the IOMMU code. pSeries is out of scope here (a minor fix needed for barely supported platform in regard to VFIO). The previous discussion is here: https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/ Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence") Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices") Cc: Oliver O'Halloran Cc: Robin Murphy Cc: Jason Gunthorpe Cc: Alex Williamson Cc: Daniel Henrique Barboza Cc: Fabiano Rosas Cc: Murilo Opsfelder Araujo Cc: Nicholas Piggin Signed-off-by: Alexey Kardashevskiy --- does it make sense to have this many callbacks, or the generic IOMMU code can safely operate without some (given I add some more checks for !NULL)? thanks, --- arch/powerpc/include/asm/iommu.h | 2 + arch/powerpc/include/asm/pci-bridge.h | 7 ++ arch/powerpc/kernel/iommu.c | 106 +- arch/powerpc/kernel/pci-common.c | 2 +- arch/powerpc/platforms/powernv/pci-ioda.c | 40 5 files changed, 155 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7e29c73e3dd4..4bdae0ee29d0 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -215,6 +215,8 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm, enum dma_data_direction *direction); extern void iommu_tce_kill(struct iommu_table *tbl, unsigned long entry, unsigned long pages); + +extern const struct iommu_ops spapr_tce_iommu_ops; #else static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index c85f901227c9..338a45b410b4 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -8,6 +8,7 @@ #include #include #include +#include struct device_node; @@ -44,6 +45,9 @@ struct pci_controller_ops { #endif void(*shutdown)(struct pci_controller *hose); + + struct iommu_group *(*device_group)(struct pci_controller *hose, + struct pci_dev *pdev); }; /* @@ -131,6 +135,9 @@ struct pci_controller { struct irq_domain *dev_domain; struct irq_domain *msi_domain; struct fwnode_handle*fwnode; + + /* iommu_ops support */ + struct iommu_device iommu; }; /* These are used for config access before all the PCI probing diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 7e56ddb3e0b9..c4c7eb596fef 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1138,6 +1138,8 @@ EXPORT_SYMBOL_GPL(iommu_release_ownership); int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) { + int ret; + /* * The sysfs entries should be populated before * binding IOMMU group. If sysfs entries isn't @@ -1156,7 +1158,10 @@ int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) pr_debug("%s: Adding %s to iommu group %d\n", __func__, dev_name(dev), iommu_group_id(table_group->group)); - return iommu_group_add_device(table_group->group, dev); + ret = iommu_probe_device(dev); + dev_info(dev, "probed with %d\n", ret); + + return ret; } EXPORT_SYMBOL_GPL(iommu_add_device); @@ -1176,4 +1181,103 @@ void iommu_del_device(struct devi
[PATCH kernel] powerpc/iommu: Add simple iommu_ops to report capabilities
Historically PPC64 managed to avoid using iommu_ops. The VFIO driver uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the Type1 VFIO driver. Recent development though has added a coherency capability check to the generic part of VFIO and essentially disabled VFIO on PPC64. This adds a simple iommu_ops stub which reports support for cache coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices, this provides minimum code for the probing to not crash. The previous discussion is here: https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/ Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence") Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices") Signed-off-by: Alexey Kardashevskiy --- I have not looked into the domains for ages, what is missing here? With this on top of 5.19-rc1 VFIO works again on my POWER9 box. Thanks, --- arch/powerpc/include/asm/iommu.h | 2 + arch/powerpc/kernel/iommu.c | 70 arch/powerpc/kernel/pci_64.c | 3 ++ 3 files changed, 75 insertions(+) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7e29c73e3dd4..4bdae0ee29d0 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -215,6 +215,8 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm, enum dma_data_direction *direction); extern void iommu_tce_kill(struct iommu_table *tbl, unsigned long entry, unsigned long pages); + +extern const struct iommu_ops spapr_tce_iommu_ops; #else static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 7e56ddb3e0b9..2205b448f7d5 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1176,4 +1176,74 @@ void iommu_del_device(struct device *dev) iommu_group_remove_device(dev); } EXPORT_SYMBOL_GPL(iommu_del_device); + +/* + * A simple iommu_ops to allow less cruft in generic VFIO code. + */ +static bool spapr_tce_iommu_capable(enum iommu_cap cap) +{ + switch (cap) { + case IOMMU_CAP_CACHE_COHERENCY: + return true; + default: + break; + } + + return false; +} + +static struct iommu_domain *spapr_tce_iommu_domain_alloc(unsigned int type) +{ + struct iommu_domain *domain = kzalloc(sizeof(*domain), GFP_KERNEL); + + if (!domain) + return NULL; + + domain->geometry.aperture_start = 0; + domain->geometry.aperture_end = ~0ULL; + domain->geometry.force_aperture = true; + + return domain; +} + +static struct iommu_device *spapr_tce_iommu_probe_device(struct device *dev) +{ + struct iommu_device *iommu_dev = kzalloc(sizeof(struct iommu_device), GFP_KERNEL); + + iommu_dev->dev = dev; + iommu_dev->ops = _tce_iommu_ops; + + return iommu_dev; +} + +static void spapr_tce_iommu_release_device(struct device *dev) +{ +} + +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom, + struct device *dev) +{ + return 0; +} + +static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev) +{ + struct iommu_group *grp = dev->iommu_group; + + if (!grp) + grp = ERR_PTR(-ENODEV); + return grp; +} + +const struct iommu_ops spapr_tce_iommu_ops = { + .capable = spapr_tce_iommu_capable, + .domain_alloc = spapr_tce_iommu_domain_alloc, + .probe_device = spapr_tce_iommu_probe_device, + .release_device = spapr_tce_iommu_release_device, + .device_group = spapr_tce_iommu_device_group, + .default_domain_ops = &(const struct iommu_domain_ops) { + .attach_dev = spapr_tce_iommu_attach_dev, + } +}; + #endif /* CONFIG_IOMMU_API */ diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c index 19b03ddf5631..04bc0c52e45c 100644 --- a/arch/powerpc/kernel/pci_64.c +++ b/arch/powerpc/kernel/pci_64.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -27,6 +28,7 @@ #include #include #include +#include /* pci_io_base -- the base address from which io bars are offsets. * This is the lowest I/O base address (so bar values are always positive), @@ -69,6 +71,7 @@ static int __init pcibios_init(void) ppc_md.pcibios_fixup(); printk(KERN_DEBUG "PCI: Probing PCI hardware done\n"); + bus_set_iommu(_bus_type, _tce_iommu_ops); return 0; } -- 2.30.2
[PATCH llvm v2] powerpc/llvm/lto: Allow LLVM LTO builds
This enables LTO_CLANG builds on POWER with the upstream version of LLVM. LTO optimizes the output vmlinux binary and this may affect the FTP alternative section if alt branches use "bc" (Branch Conditional) which is limited by 16 bit offsets. This shows up in errors like: ld.lld: error: InputSection too large for range extension thunk vmlinux.o:(__ftr_alt_97+0xF0) This works around the issue by replacing "bc" in FTR_SECTION_ELSE with "b" which allows 26 bit offsets. This catches the problem instructions in vmlinux.o before it LTO'ed: $ objdump -d -M raw -j __ftr_alt_97 vmlinux.o | egrep '\S+\s*\' 30: 00 00 82 40 bc 4,eq,30 <__ftr_alt_97+0x30> f0: 00 00 82 40 bc 4,eq,f0 <__ftr_alt_97+0xf0> This allows LTO builds for ppc64le_defconfig plus LTO options. Note that DYNAMIC_FTRACE/FUNCTION_TRACER is not supported by LTO builds but this is not POWERPC-specific. This makes the copy routines slower on POWER6 as this partially reverts a4e22f02f5b6 ("powerpc: Update 64bit__copy_tofrom_user() using CPU_FTR_UNALIGNED_LD_STD") Signed-off-by: Alexey Kardashevskiy --- Changes: v2: * dropped FTR sections which were only meant to improve POWER6 as Paul suggested --- Note 1: This is further development of https://lore.kernel.org/all/20220211023125.1790960-1-...@ozlabs.ru/T/ Note 2: CONFIG_ZSTD_COMPRESS and CONFIG_ZSTD_DECOMPRESS must be both "m" or "y" or it won't link. For details: https://lore.kernel.org/lkml/20220428043850.1706973-1-...@ozlabs.ru/T/ --- arch/powerpc/Kconfig | 2 ++ arch/powerpc/kernel/exceptions-64s.S | 4 +++- arch/powerpc/lib/copyuser_64.S | 15 +-- arch/powerpc/lib/feature-fixups-test.S | 3 +-- arch/powerpc/lib/memcpy_64.S | 14 +- 5 files changed, 8 insertions(+), 30 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 3eaddb8997a9..35050264ea7b 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -162,6 +162,8 @@ config PPC select ARCH_WANTS_MODULES_DATA_IN_VMALLOC if PPC_BOOK3S_32 || PPC_8xx select ARCH_WANTS_NO_INSTR select ARCH_WEAK_RELEASE_ACQUIRE + select ARCH_SUPPORTS_LTO_CLANG + select ARCH_SUPPORTS_LTO_CLANG_THIN select BINFMT_ELF select BUILDTIME_TABLE_SORT select CLONE_BACKWARDS diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index b66dd6f775a4..5b783bd51260 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -476,9 +476,11 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real, text) .if IHSRR_IF_HVMODE BEGIN_FTR_SECTION bne masked_Hinterrupt + b 4f FTR_SECTION_ELSE - bne masked_interrupt ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206) + bne masked_interrupt +4: .elseif IHSRR bne masked_Hinterrupt .else diff --git a/arch/powerpc/lib/copyuser_64.S b/arch/powerpc/lib/copyuser_64.S index db8719a14846..b914e52ed240 100644 --- a/arch/powerpc/lib/copyuser_64.S +++ b/arch/powerpc/lib/copyuser_64.S @@ -9,7 +9,7 @@ #include #ifndef SELFTEST_CASE -/* 0 == most CPUs, 1 == POWER6, 2 == Cell */ +/* 0 == most CPUs, 2 == Cell */ #define SELFTEST_CASE 0 #endif @@ -68,19 +68,6 @@ _GLOBAL(__copy_tofrom_user_base) andi. r6,r6,7 PPC_MTOCRF(0x01,r5) blt cr1,.Lshort_copy -/* Below we want to nop out the bne if we're on a CPU that has the - * CPU_FTR_UNALIGNED_LD_STD bit set and the CPU_FTR_CP_USE_DCBTZ bit - * cleared. - * At the time of writing the only CPU that has this combination of bits - * set is Power6. - */ -test_feature = (SELFTEST_CASE == 1) -BEGIN_FTR_SECTION - nop -FTR_SECTION_ELSE - bne .Ldst_unaligned -ALT_FTR_SECTION_END(CPU_FTR_UNALIGNED_LD_STD | CPU_FTR_CP_USE_DCBTZ, \ - CPU_FTR_UNALIGNED_LD_STD) .Ldst_aligned: addir3,r3,-16 r3_offset = 16 diff --git a/arch/powerpc/lib/feature-fixups-test.S b/arch/powerpc/lib/feature-fixups-test.S index 480172fbd024..2751e42a9fd7 100644 --- a/arch/powerpc/lib/feature-fixups-test.S +++ b/arch/powerpc/lib/feature-fixups-test.S @@ -145,7 +145,6 @@ BEGIN_FTR_SECTION FTR_SECTION_ELSE 2: or 2,2,2 PPC_LCMPI r3,1 - beq 3f blt 2b b 3f b 1b @@ -160,10 +159,10 @@ globl(ftr_fixup_test6_expected) 1: or 1,1,1 2: or 2,2,2 PPC_LCMPI r3,1 - beq 3f blt 2b b 3f b 1b + nop 3: or 1,1,1 or 2,2,2 or 3,3,3 diff --git a/arch/powerpc/lib/memcpy_64.S b/arch/powerpc/lib/memcpy_64.S index 016c91e958d8..117399dbc891 100644 --- a/arch/powerpc/lib/me
[PATCH kernel v2] pseries/iommu/ddw: Fix kdump to work in absence of ibm,dma-window
The pseries platform uses 32bit default DMA window (always 4K pages) and optional 64bit DMA window available via DDW ("Dynamic DMA Windows"), 64K or 2M pages. For ages the default one was not removed and a huge window was created in addition. Things changed with SRIOV-enabled PowerVM which creates a default-and-bigger DMA window in 64bit space (still using 4K pages) for IOV VFs so certain OSes do not need to use the DDW API in order to utilize all available TCE budget. Linux on the other hand removes the default window and creates a bigger one (with more TCEs or/and a bigger page size - 64K/2M) in a bid to map the entire RAM, and if the new window size is smaller than that - it still uses this new bigger window. The result is that the default window is removed but the "ibm,dma-window" property is not. When kdump is invoked, the existing code tries reusing the existing 64bit DMA window which location and parameters are stored in the device tree but this fails as the new property does not make it to the kdump device tree blob. So the code falls back to the default window which does not exist anymore although the device tree says that it does. The result of that is that PCI devices become unusable and cannot be used for kdumping. This preserves the DMA64 and DIRECT64 properties in the device tree blob for the crash kernel. Since the crash kernel setup is done after device drivers are loaded and probed, the proper DMA config is stored at least for boot time devices. Because DDW window is optional and the code configures the default window first, the existing code creates an IOMMU table descriptor for the non-existing default DMA window. It is harmless for kdump as it does not touch the actual window (only reads what is mapped and marks those IO pages as used) but it is bad for kexec which clears it thinking it is a smaller default window rather than a bigger DDW window. This removes the "ibm,dma-window" property from the device tree after a bigger window is created and the crash kernel setup picks it up. Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping") Signed-off-by: Alexey Kardashevskiy --- Looks like SYSCALL(kexec_file_load) never supported DMA. so it could be: Fixes: a0458284f062 ("powerpc: Add support code for kexec_file_load()") --- Changes: v2: * reworked enable_ddw() to reuse default_win * removed @tbl as it was only used once and later on this zeroes it * undef for xxx64_PROPNAME in file_load_64.c * renamed new functions in file_load_64.c --- arch/powerpc/kexec/file_load_64.c | 54 arch/powerpc/platforms/pseries/iommu.c | 89 ++ 2 files changed, 102 insertions(+), 41 deletions(-) diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index b4981b651d9a..5d2c22aa34fb 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -1038,6 +1038,48 @@ static int update_cpus_node(void *fdt) return ret; } +static int copy_property(void *fdt, int node_offset, const struct device_node *dn, +const char *propname) +{ + const void *prop, *fdtprop; + int len = 0, fdtlen = 0, ret; + + prop = of_get_property(dn, propname, ); + fdtprop = fdt_getprop(fdt, node_offset, propname, ); + + if (fdtprop && !prop) + ret = fdt_delprop(fdt, node_offset, propname); + else if (prop) + ret = fdt_setprop(fdt, node_offset, propname, prop, len); + + return ret; +} + +static int update_pci_dma_nodes(void *fdt, const char *dmapropname) +{ + struct device_node *dn; + int pci_offset, root_offset, ret = 0; + + if (!firmware_has_feature(FW_FEATURE_LPAR)) + return 0; + + root_offset = fdt_path_offset(fdt, "/"); + for_each_node_with_property(dn, dmapropname) { + pci_offset = fdt_subnode_offset(fdt, root_offset, of_node_full_name(dn)); + if (pci_offset < 0) + continue; + + ret = copy_property(fdt, pci_offset, dn, "ibm,dma-window"); + if (ret < 0) + break; + ret = copy_property(fdt, pci_offset, dn, dmapropname); + if (ret < 0) + break; + } + + return ret; +} + /** * setup_new_fdt_ppc64 - Update the flattend device-tree of the kernel * being loaded. @@ -1099,6 +1141,18 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, if (ret < 0) goto out; +#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info" +#define DMA64_PROPNAME "linux,dma64-ddr-window-info" + ret = update_pci_dma_nodes(fdt, DIRECT64_PROPNAME); + if (ret < 0) + goto out; + + ret = update_pci_dma_nodes(fdt,
[PATCH kernel] KVM: PPC: Do not warn when userspace asked for too big TCE table
KVM manages emulated TCE tables for guest LIOBNs by a two level table which maps up to 128TiB with 16MB IOMMU pages (enabled in QEMU by default) and MAX_ORDER=11 (the kernel's default). Note that the last level of the table is allocated when actual TCE is updated. However these tables are created via ioctl() on kvmfd and the userspace can trigger WARN_ON_ONCE_GFP(order >= MAX_ORDER, gfp) in mm/page_alloc.c and flood dmesg. This adds __GFP_NOWARN. Signed-off-by: Alexey Kardashevskiy --- We could probably switch to vmalloc() to allow even bigger emulated tables which we do not really want the userspace to create though. --- arch/powerpc/kvm/book3s_64_vio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index d6589c4fe889..40864373ef87 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -307,7 +307,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, return ret; ret = -ENOMEM; - stt = kzalloc(struct_size(stt, pages, npages), GFP_KERNEL); + stt = kzalloc(struct_size(stt, pages, npages), GFP_KERNEL | __GFP_NOWARN); if (!stt) goto fail_acct; -- 2.30.2
Re: [PATCH kernel] pseries/iommu/ddw: Fix kdump to work in absence of ibm,dma-window
On 6/27/22 14:10, Russell Currey wrote: On Thu, 2022-06-16 at 17:59 +1000, Alexey Kardashevskiy wrote: The pseries platform uses 32bit default DMA window (always 4K pages) and optional 64bit DMA window available via DDW ("Dynamic DMA Windows"), 64K or 2M pages. For ages the default one was not removed and a huge window was created in addition. Things changed with SRIOV-enabled PowerVM which creates a default-and-bigger DMA window in 64bit space (still using 4K pages) for IOV VFs so certain OSes do not need to use the DDW API in order to utilize all available TCE budget. Linux on the other hand removes the default window and creates a bigger one (with more TCEs or/and a bigger page size - 64K/2M) in a bid to map the entire RAM, and if the new window size is smaller than that - it still uses this new bigger window. The result is that the default window is removed but the "ibm,dma-window" property is not. When kdump is invoked, the existing code tries reusing the existing 64bit DMA window which location and parameters are stored in the device tree but this fails as the new property does not make it to the kdump device tree blob. So the code falls back to the default window which does not exist anymore although the device tree says that it does. The result of that is that PCI devices become unusable and cannot be used for kdumping. This preserves the DMA64 and DIRECT64 properties in the device tree blob for the crash kernel. Since the crash kernel setup is done after device drivers are loaded and probed, the proper DMA config is stored at least for boot time devices. Because DDW window is optional and the code configures the default window first, the existing code creates an IOMMU table descriptor for the non-existing default DMA window. It is harmless for kdump as it does not touch the actual window (only reads what is mapped and marks those IO pages as used) but it is bad for kexec which clears it thinking it is a smaller default window rather than a bigger DDW window. This removes the "ibm,dma-window" property from the device tree after a bigger window is created and the crash kernel setup picks it up. Signed-off-by: Alexey Kardashevskiy Hey Alexey, great description of the problem. Would this need a Fixes: tag? May be. But which patch does it fix really - the one which did not preserve the dma64 properties or the one which started removing the default window? :) --- arch/powerpc/kexec/file_load_64.c | 52 +++ arch/powerpc/platforms/pseries/iommu.c | 88 +++- -- 2 files changed, 102 insertions(+), 38 deletions(-) diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index b4981b651d9a..b4b486b68b63 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -1038,6 +1038,48 @@ static int update_cpus_node(void *fdt) return ret; } +static int copy_dma_property(void *fdt, int node_offset, const struct device_node *dn, + const char *propname) +{ + const void *prop, *fdtprop; + int len = 0, fdtlen = 0, ret; + + prop = of_get_property(dn, propname, ); + fdtprop = fdt_getprop(fdt, node_offset, propname, ); + + if (fdtprop && !prop) + ret = fdt_delprop(fdt, node_offset, propname); + else if (prop) + ret = fdt_setprop(fdt, node_offset, propname, prop, len); If fdtprop and prop are both false, ret is returned uninitialised. + + return ret; +} + +static int update_pci_nodes(void *fdt, const char *dmapropname) +{ + struct device_node *dn; + int pci_offset, root_offset, ret = 0; + + if (!firmware_has_feature(FW_FEATURE_LPAR)) + return 0; + + root_offset = fdt_path_offset(fdt, "/"); + for_each_node_with_property(dn, dmapropname) { + pci_offset = fdt_subnode_offset(fdt, root_offset, of_node_full_name(dn)); + if (pci_offset < 0) + continue; + + ret = copy_dma_property(fdt, pci_offset, dn, "ibm,dma-window"); + if (ret < 0) + break; + ret = copy_dma_property(fdt, pci_offset, dn, dmapropname); + if (ret < 0) + break; + } + + return ret; +} + /** * setup_new_fdt_ppc64 - Update the flattend device-tree of the kernel * being loaded. @@ -1099,6 +1141,16 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, if (ret < 0) goto out; +#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info" +#define DMA64_PROPNAME "linux,dma64-ddr-window-info" Instead of having these defined in two different places, could they be moved out of iommu.c and into a header? Though we hardcode ibm,dma- window everywhere anyway. These properties are f
[PATCH kernel] KVM: PPC: Book3s: Fix warning about xics_rm_h_xirr_x
This fixes "no previous prototype": arch/powerpc/kvm/book3s_hv_rm_xics.c:482:15: warning: no previous prototype for 'xics_rm_h_xirr_x' [-Wmissing-prototypes] Reported by the kernel test robot. Fixes: b22af9041927 ("KVM: PPC: Book3s: Remove real mode interrupt controller hcalls handlers") Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kvm/book3s_xics.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/book3s_xics.h b/arch/powerpc/kvm/book3s_xics.h index 8e4c79e2fcd8..08fb0843faf5 100644 --- a/arch/powerpc/kvm/book3s_xics.h +++ b/arch/powerpc/kvm/book3s_xics.h @@ -143,6 +143,7 @@ static inline struct kvmppc_ics *kvmppc_xics_find_ics(struct kvmppc_xics *xics, } extern unsigned long xics_rm_h_xirr(struct kvm_vcpu *vcpu); +extern unsigned long xics_rm_h_xirr_x(struct kvm_vcpu *vcpu); extern int xics_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server, unsigned long mfrr); extern int xics_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr); -- 2.30.2
Re: [PATCH v2 4/4] watchdog/pseries-wdt: initial support for H_WATCHDOG-based watchdog timers
On 6/3/22 03:53, Scott Cheloha wrote: PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. The timers have millisecond granularity. The guest is terminated when a timer expires. This patch adds a watchdog driver for these timers, "pseries-wdt". pseries_wdt_probe() currently assumes the existence of only one platform device and always assigns it watchdogNumber 1. If we ever expose more than one timer to userspace we will need to devise a way to assign a distinct watchdogNumber to each platform device at device registration time. Signed-off-by: Scott Cheloha Besides the patch ordering and 0444 vs. 0644 (which is up to the PPC maintainer to decide anyway :) ), looks good to me. Reviewed-by: Alexey Kardashevskiy --- .../watchdog/watchdog-parameters.rst | 12 + drivers/watchdog/Kconfig | 8 + drivers/watchdog/Makefile | 1 + drivers/watchdog/pseries-wdt.c| 264 ++ 4 files changed, 285 insertions(+) create mode 100644 drivers/watchdog/pseries-wdt.c diff --git a/Documentation/watchdog/watchdog-parameters.rst b/Documentation/watchdog/watchdog-parameters.rst index 223c99361a30..29153eed6689 100644 --- a/Documentation/watchdog/watchdog-parameters.rst +++ b/Documentation/watchdog/watchdog-parameters.rst @@ -425,6 +425,18 @@ pnx833x_wdt: - +pseries-wdt: +action: + Action taken when watchdog expires: 0 (power off), 1 (restart), + 2 (dump and restart). (default=1) +timeout: + Initial watchdog timeout in seconds. (default=60) +nowayout: + Watchdog cannot be stopped once started. + (default=kernel config parameter) + +- + rc32434_wdt: timeout: Watchdog timeout value, in seconds (default=20) diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index c4e82a8d863f..06b412603f3e 100644 --- a/drivers/watchdog/Kconfig +++ b/drivers/watchdog/Kconfig @@ -1932,6 +1932,14 @@ config MEN_A21_WDT # PPC64 Architecture +config PSERIES_WDT + tristate "POWER Architecture Platform Watchdog Timer" + depends on PPC_PSERIES + select WATCHDOG_CORE + help + Driver for virtual watchdog timers provided by PAPR + hypervisors (e.g. PowerVM, KVM). + config WATCHDOG_RTAS tristate "RTAS watchdog" depends on PPC_RTAS diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile index f7da867e8782..f35660409f17 100644 --- a/drivers/watchdog/Makefile +++ b/drivers/watchdog/Makefile @@ -184,6 +184,7 @@ obj-$(CONFIG_BOOKE_WDT) += booke_wdt.o obj-$(CONFIG_MEN_A21_WDT) += mena21_wdt.o # PPC64 Architecture +obj-$(CONFIG_PSERIES_WDT) += pseries-wdt.o obj-$(CONFIG_WATCHDOG_RTAS) += wdrtas.o # S390 Architecture diff --git a/drivers/watchdog/pseries-wdt.c b/drivers/watchdog/pseries-wdt.c new file mode 100644 index ..cfe53587457d --- /dev/null +++ b/drivers/watchdog/pseries-wdt.c @@ -0,0 +1,264 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2022 International Business Machines, Inc. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define DRV_NAME "pseries-wdt" + +/* + * The PAPR's MSB->LSB bit ordering is 0->63. These macros simplify + * defining bitfields as described in the PAPR without needing to + * transpose values to the more C-like 63->0 ordering. + */ +#define SETFIELD(_v, _b, _e) \ + (((unsigned long)(_v) << PPC_BITLSHIFT(_e)) & PPC_BITMASK((_b), (_e))) +#define GETFIELD(_v, _b, _e) \ + (((unsigned long)(_v) & PPC_BITMASK((_b), (_e))) >> PPC_BITLSHIFT(_e)) + +/* + * The H_WATCHDOG hypercall first appears in PAPR v2.12 and is + * described fully in sections 14.5 and 14.15.6. + * + * + * H_WATCHDOG Input + * + * R4: "flags": + * + * Bits 48-55: "operation" + * + * 0x01 Start Watchdog + * 0x02 Stop Watchdog + * 0x03 Query Watchdog Capabilities + */ +#define PSERIES_WDTF_OP(op)SETFIELD((op), 48, 55) +#define PSERIES_WDTF_OP_START PSERIES_WDTF_OP(0x1) +#define PSERIES_WDTF_OP_STOP PSERIES_WDTF_OP(0x2) +#define PSERIES_WDTF_OP_QUERY PSERIES_WDTF_OP(0x3) + +/* + * Bits 56-63: "timeoutAction" (for "Start Watchdog" only) + * + * 0x01 Hard poweroff + * 0x02 Hard restart + * 0x03 Dump restart + */ +#define PSERIES_WDTF_ACTION(ac)SETFIELD(ac, 56, 63) +#define PSERIES_WDTF_ACTION_HARD_POWEROFF PSERIES_WDTF_ACTION(0x1) +#define PSERIES_WDTF_ACTION_HARD_RESTART PSERIES_WDTF_ACTION(0x2) +#define PSERIES_WDTF_ACTION_DUMP_RES
[PATCH kernel] pseries/iommu/ddw: Fix kdump to work in absence of ibm,dma-window
The pseries platform uses 32bit default DMA window (always 4K pages) and optional 64bit DMA window available via DDW ("Dynamic DMA Windows"), 64K or 2M pages. For ages the default one was not removed and a huge window was created in addition. Things changed with SRIOV-enabled PowerVM which creates a default-and-bigger DMA window in 64bit space (still using 4K pages) for IOV VFs so certain OSes do not need to use the DDW API in order to utilize all available TCE budget. Linux on the other hand removes the default window and creates a bigger one (with more TCEs or/and a bigger page size - 64K/2M) in a bid to map the entire RAM, and if the new window size is smaller than that - it still uses this new bigger window. The result is that the default window is removed but the "ibm,dma-window" property is not. When kdump is invoked, the existing code tries reusing the existing 64bit DMA window which location and parameters are stored in the device tree but this fails as the new property does not make it to the kdump device tree blob. So the code falls back to the default window which does not exist anymore although the device tree says that it does. The result of that is that PCI devices become unusable and cannot be used for kdumping. This preserves the DMA64 and DIRECT64 properties in the device tree blob for the crash kernel. Since the crash kernel setup is done after device drivers are loaded and probed, the proper DMA config is stored at least for boot time devices. Because DDW window is optional and the code configures the default window first, the existing code creates an IOMMU table descriptor for the non-existing default DMA window. It is harmless for kdump as it does not touch the actual window (only reads what is mapped and marks those IO pages as used) but it is bad for kexec which clears it thinking it is a smaller default window rather than a bigger DDW window. This removes the "ibm,dma-window" property from the device tree after a bigger window is created and the crash kernel setup picks it up. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kexec/file_load_64.c | 52 +++ arch/powerpc/platforms/pseries/iommu.c | 88 +++--- 2 files changed, 102 insertions(+), 38 deletions(-) diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index b4981b651d9a..b4b486b68b63 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -1038,6 +1038,48 @@ static int update_cpus_node(void *fdt) return ret; } +static int copy_dma_property(void *fdt, int node_offset, const struct device_node *dn, +const char *propname) +{ + const void *prop, *fdtprop; + int len = 0, fdtlen = 0, ret; + + prop = of_get_property(dn, propname, ); + fdtprop = fdt_getprop(fdt, node_offset, propname, ); + + if (fdtprop && !prop) + ret = fdt_delprop(fdt, node_offset, propname); + else if (prop) + ret = fdt_setprop(fdt, node_offset, propname, prop, len); + + return ret; +} + +static int update_pci_nodes(void *fdt, const char *dmapropname) +{ + struct device_node *dn; + int pci_offset, root_offset, ret = 0; + + if (!firmware_has_feature(FW_FEATURE_LPAR)) + return 0; + + root_offset = fdt_path_offset(fdt, "/"); + for_each_node_with_property(dn, dmapropname) { + pci_offset = fdt_subnode_offset(fdt, root_offset, of_node_full_name(dn)); + if (pci_offset < 0) + continue; + + ret = copy_dma_property(fdt, pci_offset, dn, "ibm,dma-window"); + if (ret < 0) + break; + ret = copy_dma_property(fdt, pci_offset, dn, dmapropname); + if (ret < 0) + break; + } + + return ret; +} + /** * setup_new_fdt_ppc64 - Update the flattend device-tree of the kernel * being loaded. @@ -1099,6 +1141,16 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, if (ret < 0) goto out; +#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info" +#define DMA64_PROPNAME "linux,dma64-ddr-window-info" + ret = update_pci_nodes(fdt, DIRECT64_PROPNAME); + if (ret < 0) + goto out; + + ret = update_pci_nodes(fdt, DMA64_PROPNAME); + if (ret < 0) + goto out; + /* Update memory reserve map */ ret = get_reserved_memory_ranges(); if (ret) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index fba64304e859..af3c871668df 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -700,6 +700,33 @@ struct iommu_table_ops iommu_table_lpar_multi_ops = { .get
Re: [PATCH v1 4/4] watchdog/pseries-wdt: initial support for PAPR H_WATCHDOG timers
On 6/2/22 00:48, Scott Cheloha wrote: On Wed, May 25, 2022 at 04:35:11PM +1000, Alexey Kardashevskiy wrote: On 5/21/22 04:35, Scott Cheloha wrote: PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. The timers have millisecond granularity. The guest is terminated when a timer expires. This patch adds a watchdog driver for these timers, "pseries-wdt". pseries_wdt_probe() currently assumes the existence of only one platform device and always assigns it watchdogNumber 1. If we ever expose more than one timer to userspace we will need to devise a way to assign a distinct watchdogNumber to each platform device at device registration time. This one should go before 4/4 in the series for bisectability. What is platform_device_register_simple("pseries-wdt",...) going to do without the driver? This is a chicken-and-egg problem without an obvious solution. A device without a driver is a body without a soul. A driver without a device is a ghost without a machine. ... or something like that, don't quote me :) Absent some very compelling reasoning, I would like to keep the current order. It feels logical to me to keep the powerpc/pseries patches adjacent and prior to the watchdog driver patch. Signed-off-by: Scott Cheloha --- .../watchdog/watchdog-parameters.rst | 12 + drivers/watchdog/Kconfig | 8 + drivers/watchdog/Makefile | 1 + drivers/watchdog/pseries-wdt.c| 337 ++ 4 files changed, 358 insertions(+) create mode 100644 drivers/watchdog/pseries-wdt.c diff --git a/Documentation/watchdog/watchdog-parameters.rst b/Documentation/watchdog/watchdog-parameters.rst index 223c99361a30..4ffe725e796c 100644 --- a/Documentation/watchdog/watchdog-parameters.rst +++ b/Documentation/watchdog/watchdog-parameters.rst @@ -425,6 +425,18 @@ pnx833x_wdt: - +pseries-wdt: +action: + Action taken when watchdog expires: 1 (power off), 2 (restart), + 3 (dump and restart). (default=2) +timeout: + Initial watchdog timeout in seconds. (default=60) +nowayout: + Watchdog cannot be stopped once started. + (default=kernel config parameter) + +- + rc32434_wdt: timeout: Watchdog timeout value, in seconds (default=20) diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index c4e82a8d863f..06b412603f3e 100644 --- a/drivers/watchdog/Kconfig +++ b/drivers/watchdog/Kconfig @@ -1932,6 +1932,14 @@ config MEN_A21_WDT # PPC64 Architecture +config PSERIES_WDT + tristate "POWER Architecture Platform Watchdog Timer" + depends on PPC_PSERIES + select WATCHDOG_CORE + help + Driver for virtual watchdog timers provided by PAPR + hypervisors (e.g. PowerVM, KVM). + config WATCHDOG_RTAS tristate "RTAS watchdog" depends on PPC_RTAS diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile index f7da867e8782..f35660409f17 100644 --- a/drivers/watchdog/Makefile +++ b/drivers/watchdog/Makefile @@ -184,6 +184,7 @@ obj-$(CONFIG_BOOKE_WDT) += booke_wdt.o obj-$(CONFIG_MEN_A21_WDT) += mena21_wdt.o # PPC64 Architecture +obj-$(CONFIG_PSERIES_WDT) += pseries-wdt.o obj-$(CONFIG_WATCHDOG_RTAS) += wdrtas.o # S390 Architecture diff --git a/drivers/watchdog/pseries-wdt.c b/drivers/watchdog/pseries-wdt.c new file mode 100644 index ..f41bc4d3b7a2 --- /dev/null +++ b/drivers/watchdog/pseries-wdt.c @@ -0,0 +1,337 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2022 International Business Machines, Inc. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#define DRV_NAME "pseries-wdt" + +/* + * The PAPR's MSB->LSB bit ordering is 0->63. These macros simplify + * defining bitfields as described in the PAPR without needing to + * transpose values to the more C-like 63->0 ordering. + */ +#define SETFIELD(_v, _b, _e) \ + (((unsigned long)(_v) << PPC_BITLSHIFT(_e)) & PPC_BITMASK((_b), (_e))) +#define GETFIELD(_v, _b, _e) \ + (((unsigned long)(_v) & PPC_BITMASK((_b), (_e))) >> PPC_BITLSHIFT(_e)) + +/* + * H_WATCHDOG Hypercall Input + * + * R4: "flags": + * + * A 64-bit value structured as follows: + * + * Bits 0-46: Reserved (must be zero). + */ +#define PSERIES_WDTF_RESERVED PPC_BITMASK(0, 46) + +/* + * Bit 47: "leaveOtherWatchdogsRunningOnTimeout" + * + * 0 Stop outstanding watchdogs on timeout. + * 1 Leave outstanding watchdogs running on timeout. + */ +#define PSERIES_WDTF_LEAVE_OTHER PPC_BIT(47) + +/* + * Bits 48-55: "operation" + * + * 0x01 Start Watchdog + *
[PATCH kernel] powerpc/pseries/iommu: Print ibm,query-pe-dma-windows parameters
PowerVM has a stricter policy about allocating TCEs for LPARs and often there is not enough TCEs for 1:1 mapping, this adds the supported numbers into dev_info() to help analyzing bugreports. Signed-off-by: Alexey Kardashevskiy --- A PowerVM admin can enable "enlarged IO capacity" for a passed though PCI device but there is no way from inside LPAR to know if that worked or how many more TCEs became available. --- arch/powerpc/platforms/pseries/iommu.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 7639e7355df2..84edc8d730e1 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1022,9 +1022,6 @@ static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail, ret = rtas_call(ddw_avail[DDW_QUERY_PE_DMA_WIN], 3, out_sz, query_out, cfg_addr, BUID_HI(buid), BUID_LO(buid)); - dev_info(>dev, "ibm,query-pe-dma-windows(%x) %x %x %x returned %d\n", -ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr, BUID_HI(buid), -BUID_LO(buid), ret); switch (out_sz) { case 5: @@ -1042,6 +1039,11 @@ static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail, break; } + dev_info(>dev, "ibm,query-pe-dma-windows(%x) %x %x %x returned %d, lb=%llx ps=%x wn=%d\n", +ddw_avail[DDW_QUERY_PE_DMA_WIN], cfg_addr, BUID_HI(buid), +BUID_LO(buid), ret, query->largest_available_block, +query->page_size, query->windows_available); + return ret; } -- 2.30.2
Re: [PATCH v1 4/4] watchdog/pseries-wdt: initial support for PAPR H_WATCHDOG timers
ic int pseries_wdt_start(struct watchdog_device *wdd) +{ + struct device *dev = wdd->parent; + struct pseries_wdt *pw = watchdog_get_drvdata(wdd); + unsigned long flags, msecs; + long rc; + + flags = action | PSERIES_WDTF_OP_START; + msecs = wdd->timeout * 1000UL; + rc = plpar_hcall_norets(H_WATCHDOG, flags, pw->num, msecs); + if (rc != H_SUCCESS) { + dev_crit(dev, "H_WATCHDOG: %ld: failed to start timer %lu", +rc, pw->num); + return -EIO; + } + return 0; +} + +static int pseries_wdt_stop(struct watchdog_device *wdd) +{ + struct device *dev = wdd->parent; + struct pseries_wdt *pw = watchdog_get_drvdata(wdd); + long rc; + + rc = plpar_hcall_norets(H_WATCHDOG, PSERIES_WDTF_OP_STOP, pw->num); + if (rc != H_SUCCESS && rc != H_NOOP) { + dev_crit(dev, "H_WATCHDOG: %ld: failed to stop timer %lu", +rc, pw->num); + return -EIO; + } + return 0; +} + +static struct watchdog_info pseries_wdt_info = { + .identity = DRV_NAME, + .options = WDIOF_KEEPALIVEPING | WDIOF_MAGICCLOSE | WDIOF_SETTIMEOUT + | WDIOF_PRETIMEOUT, +}; + +static const struct watchdog_ops pseries_wdt_ops = { + .owner = THIS_MODULE, + .start = pseries_wdt_start, + .stop = pseries_wdt_stop, +}; + +static int pseries_wdt_probe(struct platform_device *pdev) +{ + unsigned long ret[PLPAR_HCALL_BUFSIZE] = { 0 }; + unsigned long cap, min_timeout_ms; + long rc; + struct pseries_wdt *pw; + int err; + + rc = plpar_hcall(H_WATCHDOG, ret, PSERIES_WDTF_OP_QUERY); + if (rc != H_SUCCESS) + return rc == H_FUNCTION ? -ENODEV : -EIO; Nit: if (rc == H_FUNCTION) return -ENODEV; if (rc != H_SUCCESS) return -EIO; ? + cap = ret[0]; + + pw = devm_kzalloc(>dev, sizeof(*pw), GFP_KERNEL); + if (!pw) + return -ENOMEM; + + /* +* Assume watchdogNumber 1 for now. If we ever support +* multiple timers we will need to devise a way to choose a +* distinct watchdogNumber for each platform device at device +* registration time. +*/ + pw->num = 1; + + pw->wd.parent = >dev; + pw->wd.info = _wdt_info; + pw->wd.ops = _wdt_ops; + min_timeout_ms = PSERIES_WDTQ_MIN_TIMEOUT(cap); + pw->wd.min_timeout = roundup(min_timeout_ms, 1000) / 1000; + pw->wd.max_timeout = UINT_MAX; + watchdog_init_timeout(>wd, timeout, NULL); If PSERIES_WDTF_OP_QUERY returns 2min and this driver's default is 1min, watchdog_init_timeout() returns an error, don't we want to handle it here? Thanks, + watchdog_set_nowayout(>wd, nowayout); + watchdog_stop_on_reboot(>wd); + watchdog_stop_on_unregister(>wd); + watchdog_set_drvdata(>wd, pw); + + err = devm_watchdog_register_device(>dev, >wd); + if (err) + return err; + + platform_set_drvdata(pdev, >wd); + + return 0; +} + +static int pseries_wdt_suspend(struct platform_device *pdev, pm_message_t state) +{ + struct watchdog_device *wd = platform_get_drvdata(pdev); + + if (watchdog_active(wd)) + return pseries_wdt_stop(wd); + return 0; +} + +static int pseries_wdt_resume(struct platform_device *pdev) +{ + struct watchdog_device *wd = platform_get_drvdata(pdev); + + if (watchdog_active(wd)) + return pseries_wdt_start(wd); + return 0; +} + +static const struct platform_device_id pseries_wdt_id[] = { + { .name = "pseries-wdt" }, + {} +}; +MODULE_DEVICE_TABLE(platform, pseries_wdt_id); + +static struct platform_driver pseries_wdt_driver = { + .driver = { + .name = DRV_NAME, + .owner = THIS_MODULE, + }, + .id_table = pseries_wdt_id, + .probe = pseries_wdt_probe, + .resume = pseries_wdt_resume, + .suspend = pseries_wdt_suspend, +}; +module_platform_driver(pseries_wdt_driver); + +MODULE_AUTHOR("Alexey Kardashevskiy "); +MODULE_AUTHOR("Scott Cheloha "); +MODULE_DESCRIPTION("POWER Architecture Platform Watchdog Driver"); +MODULE_LICENSE("GPL"); -- Alexey
Re: [PATCH kernel] KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent
On 5/4/22 17:48, Alexey Kardashevskiy wrote: When introduced, IRQFD resampling worked on POWER8 with XICS. However KVM on POWER9 has never implemented it - the compatibility mode code ("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native XIVE mode does not handle INTx in KVM at all. This moved the capability support advertising to platforms and stops advertising it on XIVE, i.e. POWER9 and later. Signed-off-by: Alexey Kardashevskiy --- Or I could move this one together with KVM_CAP_IRQFD. Thoughts? Ping? --- arch/arm64/kvm/arm.c | 3 +++ arch/mips/kvm/mips.c | 3 +++ arch/powerpc/kvm/powerpc.c | 6 ++ arch/riscv/kvm/vm.c| 3 +++ arch/s390/kvm/kvm-s390.c | 3 +++ arch/x86/kvm/x86.c | 3 +++ virt/kvm/kvm_main.c| 1 - 7 files changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 523bc934fe2f..092f0614bae3 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -210,6 +210,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SET_GUEST_DEBUG: case KVM_CAP_VCPU_ATTRIBUTES: case KVM_CAP_PTP_KVM: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index a25e0b73ee70..0f3de470a73e 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -1071,6 +1071,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_READONLY_MEM: case KVM_CAP_SYNC_MMU: case KVM_CAP_IMMEDIATE_EXIT: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_NR_VCPUS: diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 875c30c12db0..87698ffef3be 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -591,6 +591,12 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) break; #endif +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: + r = !xive_enabled(); + break; +#endif + case KVM_CAP_PPC_ALLOC_HTAB: r = hv_enabled; break; diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c index c768f75279ef..b58579b386bb 100644 --- a/arch/riscv/kvm/vm.c +++ b/arch/riscv/kvm/vm.c @@ -63,6 +63,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_READONLY_MEM: case KVM_CAP_MP_STATE: case KVM_CAP_IMMEDIATE_EXIT: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_NR_VCPUS: diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 156d1c25a3c1..85e093fc8d13 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -564,6 +564,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SET_GUEST_DEBUG: case KVM_CAP_S390_DIAG318: case KVM_CAP_S390_MEM_OP_EXTENSION: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0c0ca599a353..a0a7b769483d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4273,6 +4273,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SYS_ATTRIBUTES: case KVM_CAP_VAPIC: case KVM_CAP_ENABLE_CAP: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_EXIT_HYPERCALL: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 70e05af5ebea..885e72e668a5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4293,7 +4293,6 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) #endif #ifdef CONFIG_HAVE_KVM_IRQFD case KVM_CAP_IRQFD: - case KVM_CAP_IRQFD_RESAMPLE: #endif case KVM_CAP_IOEVENTFD_ANY_LENGTH: case KVM_CAP_CHECK_EXTENSION_VM: -- Alexey
Re: [PATCH kernel] KVM: PPC: Book3s: Remove real mode interrupt controller hcalls handlers
On 5/11/22 03:58, Cédric Le Goater wrote: Hello Alexey, On 5/9/22 09:11, Alexey Kardashevskiy wrote: Currently we have 2 sets of interrupt controller hypercalls handlers for real and virtual modes, this is from POWER8 times when switching MMU on was considered an expensive operation. POWER9 however does not have dependent threads and MMU is enabled for handling hcalls so the XIVE native XIVE native does not have any real-mode hcall handlers. In fact, all are handled at the QEMU level. or XICS-on-XIVE real mode handlers never execute on real P9 and > later CPUs. They are not ? I am surprised. It must be a "recent" change. Any how, if you can remove them safely, this is good news and you should be able to clean up some more code in the PowerNV native interface. Yes, this is the result of that massive work of Nick to move the KVM's asm to c for p9. It could have been the case even before that but harder to see in that asm code :) This untemplate the handlers and only keeps the real mode handlers for XICS native (up to POWER8) and remove the rest of dead code. Changes in functions are mechanical except few missing empty lines to make checkpatch.pl happy. The default implemented hcalls list already contains XICS hcalls so no change there. This should not cause any behavioral change. In the worse case, it impacts performance a bit but only on "old" distros (kernel < 4.14), I doubt anyone will complain. Signed-off-by: Alexey Kardashevskiy Acked-by: Cédric Le Goater Thanks! Thanks, C. --- arch/powerpc/kvm/Makefile | 2 +- arch/powerpc/include/asm/kvm_ppc.h | 7 - arch/powerpc/kvm/book3s_xive.h | 7 - arch/powerpc/kvm/book3s_hv_builtin.c | 64 --- arch/powerpc/kvm/book3s_hv_rm_xics.c | 5 + arch/powerpc/kvm/book3s_hv_rm_xive.c | 46 -- arch/powerpc/kvm/book3s_xive.c | 638 +++- arch/powerpc/kvm/book3s_xive_template.c | 636 --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 12 +- 9 files changed, 632 insertions(+), 785 deletions(-) delete mode 100644 arch/powerpc/kvm/book3s_hv_rm_xive.c delete mode 100644 arch/powerpc/kvm/book3s_xive_template.c diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile index 8e3681a86074..f17379b0f161 100644 --- a/arch/powerpc/kvm/Makefile +++ b/arch/powerpc/kvm/Makefile @@ -73,7 +73,7 @@ kvm-hv-$(CONFIG_PPC_TRANSACTIONAL_MEM) += \ book3s_hv_tm.o kvm-book3s_64-builtin-xics-objs-$(CONFIG_KVM_XICS) := \ - book3s_hv_rm_xics.o book3s_hv_rm_xive.o + book3s_hv_rm_xics.o kvm-book3s_64-builtin-tm-objs-$(CONFIG_PPC_TRANSACTIONAL_MEM) += \ book3s_hv_tm_builtin.o diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 44200a27371b..a775377a570e 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -787,13 +787,6 @@ long kvmppc_rm_h_page_init(struct kvm_vcpu *vcpu, unsigned long flags, unsigned long dest, unsigned long src); long kvmppc_hpte_hv_fault(struct kvm_vcpu *vcpu, unsigned long addr, unsigned long slb_v, unsigned int status, bool data); -unsigned long kvmppc_rm_h_xirr(struct kvm_vcpu *vcpu); -unsigned long kvmppc_rm_h_xirr_x(struct kvm_vcpu *vcpu); -unsigned long kvmppc_rm_h_ipoll(struct kvm_vcpu *vcpu, unsigned long server); -int kvmppc_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server, - unsigned long mfrr); -int kvmppc_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr); -int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr); void kvmppc_guest_entry_inject_int(struct kvm_vcpu *vcpu); /* diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h index 09d0657596c3..1e48f72e8aa5 100644 --- a/arch/powerpc/kvm/book3s_xive.h +++ b/arch/powerpc/kvm/book3s_xive.h @@ -285,13 +285,6 @@ static inline u32 __xive_read_eq(__be32 *qpage, u32 msk, u32 *idx, u32 *toggle) return cur & 0x7fff; } -extern unsigned long xive_rm_h_xirr(struct kvm_vcpu *vcpu); -extern unsigned long xive_rm_h_ipoll(struct kvm_vcpu *vcpu, unsigned long server); -extern int xive_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server, - unsigned long mfrr); -extern int xive_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr); -extern int xive_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr); - /* * Common Xive routines for XICS-over-XIVE and XIVE native */ diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c index 7e52d0beee77..88a8f6473c4e 100644 --- a/arch/powerpc/kvm/book3s_hv_builtin.c +++ b/arch/powerpc/kvm/book3s_hv_builtin.c @@ -489,70 +489,6 @@ static long kvmppc_read_one_intr(bool *again) return kvmppc_check_passthru(xisr, xirr, again); } -#ifdef CONFIG_KVM_XICS -unsigned long kvmppc_rm_h_xirr(struct kvm_vcpu *vcpu) -{ - if (!kvmppc_xics_
Re: [PATCH 2/2] powerpc/vdso: Link with ld.lld when requested
On 5/10/22 06:46, Nathan Chancellor wrote: The PowerPC vDSO is linked with $(CC) instead of $(LD), which means the default linker of the compiler is used instead of the linker requested by the builder. $ make ARCH=powerpc LLVM=1 mrproper defconfig arch/powerpc/kernel/vdso/ ... $ llvm-readelf -p .comment arch/powerpc/kernel/vdso/vdso{32,64}.so.dbg File: arch/powerpc/kernel/vdso/vdso32.so.dbg String dump of section '.comment': [ 0] clang version 14.0.0 (Fedora 14.0.0-1.fc37) File: arch/powerpc/kernel/vdso/vdso64.so.dbg String dump of section '.comment': [ 0] clang version 14.0.0 (Fedora 14.0.0-1.fc37) The compiler option '-fuse-ld' tells the compiler which linker to use when it is invoked as both the compiler and linker. Use '-fuse-ld=lld' when LD=ld.lld has been specified (CONFIG_LD_IS_LLD) so that the vDSO is linked with the same linker as the rest of the kernel. $ llvm-readelf -p .comment arch/powerpc/kernel/vdso/vdso{32,64}.so.dbg File: arch/powerpc/kernel/vdso/vdso32.so.dbg String dump of section '.comment': [ 0] Linker: LLD 14.0.0 [14] clang version 14.0.0 (Fedora 14.0.0-1.fc37) File: arch/powerpc/kernel/vdso/vdso64.so.dbg String dump of section '.comment': [ 0] Linker: LLD 14.0.0 [14] clang version 14.0.0 (Fedora 14.0.0-1.fc37) LD can be a full path to ld.lld, which will not be handled properly by '-fuse-ld=lld' if the full path to ld.lld is outside of the compiler's search path. '-fuse-ld' can take a path to the linker but it is deprecated in clang 12.0.0; '--ld-path' is preferred for this scenario. Use '--ld-path' if it is supported, as it will handle a full path or just 'ld.lld' properly. See the LLVM commit below for the full details of '--ld-path'. Link: https://github.com/ClangBuiltLinux/linux/issues/774 Link: https://github.com/llvm/llvm-project/commit/1bc5c84710a8c73ef21295e63c19d10a8c71f2f5 Signed-off-by: Nathan Chancellor --- arch/powerpc/kernel/vdso/Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/vdso/Makefile b/arch/powerpc/kernel/vdso/Makefile index 954974287ee7..096b0bf1335f 100644 --- a/arch/powerpc/kernel/vdso/Makefile +++ b/arch/powerpc/kernel/vdso/Makefile @@ -48,6 +48,7 @@ UBSAN_SANITIZE := n KASAN_SANITIZE := n ccflags-y := -shared -fno-common -fno-builtin -nostdlib -Wl,--hash-style=both +ccflags-$(CONFIG_LD_IS_LLD) += $(call cc-option,--ld-path=$(LD),-fuse-ld=lld) Out of curiosity - how does this work exactly? I can see --ld-path= in the output so it works but there is no -fuse-ld=lld, is the second argument of cc-option only picked when the first one is not supported? Anyway, Tested-by: Alexey Kardashevskiy Reviewed-by: Alexey Kardashevskiy CC32FLAGS := -Wl,-soname=linux-vdso32.so.1 -m32 AS32FLAGS := -D__VDSO32__ -s
Re: [PATCH 1/2] powerpc/vdso: Remove unused ENTRY in linker scripts
On 5/10/22 06:46, Nathan Chancellor wrote: When linking vdso{32,64}.so.dbg with ld.lld, there is a warning about not finding _start for the starting address: ld.lld: warning: cannot find entry symbol _start; not setting start address ld.lld: warning: cannot find entry symbol _start; not setting start address Looking at GCC + GNU ld, the entry point address is 0x0: $ llvm-readelf -h vdso{32,64}.so.dbg &| rg "(File|Entry point address):" File: vdso32.so.dbg Entry point address: 0x0 File: vdso64.so.dbg Entry point address: 0x0 This matches what ld.lld emits: $ powerpc64le-linux-gnu-readelf -p .comment vdso{32,64}.so.dbg File: vdso32.so.dbg String dump of section '.comment': [ 0] Linker: LLD 14.0.0 [14] clang version 14.0.0 (Fedora 14.0.0-1.fc37) File: vdso64.so.dbg String dump of section '.comment': [ 0] Linker: LLD 14.0.0 [14] clang version 14.0.0 (Fedora 14.0.0-1.fc37) $ llvm-readelf -h vdso{32,64}.so.dbg &| rg "(File|Entry point address):" File: vdso32.so.dbg Entry point address: 0x0 File: vdso64.so.dbg Entry point address: 0x0 Remove ENTRY to remove the warning, as it is unnecessary for the vDSO to function correctly. Sounds more like a bugfix to me - _start is simply not defined, I wonder why ld is not complaining. Tested-by: Alexey Kardashevskiy Reviewed-by: Alexey Kardashevskiy Signed-off-by: Nathan Chancellor --- arch/powerpc/kernel/vdso/vdso32.lds.S | 1 - arch/powerpc/kernel/vdso/vdso64.lds.S | 1 - 2 files changed, 2 deletions(-) diff --git a/arch/powerpc/kernel/vdso/vdso32.lds.S b/arch/powerpc/kernel/vdso/vdso32.lds.S index 58e0099f70f4..e0d19d74455f 100644 --- a/arch/powerpc/kernel/vdso/vdso32.lds.S +++ b/arch/powerpc/kernel/vdso/vdso32.lds.S @@ -13,7 +13,6 @@ OUTPUT_FORMAT("elf32-powerpcle", "elf32-powerpcle", "elf32-powerpcle") OUTPUT_FORMAT("elf32-powerpc", "elf32-powerpc", "elf32-powerpc") #endif OUTPUT_ARCH(powerpc:common) -ENTRY(_start) SECTIONS { diff --git a/arch/powerpc/kernel/vdso/vdso64.lds.S b/arch/powerpc/kernel/vdso/vdso64.lds.S index 0288cad428b0..1a4a7bc4c815 100644 --- a/arch/powerpc/kernel/vdso/vdso64.lds.S +++ b/arch/powerpc/kernel/vdso/vdso64.lds.S @@ -13,7 +13,6 @@ OUTPUT_FORMAT("elf64-powerpcle", "elf64-powerpcle", "elf64-powerpcle") OUTPUT_FORMAT("elf64-powerpc", "elf64-powerpc", "elf64-powerpc") #endif OUTPUT_ARCH(powerpc:common64) -ENTRY(_start) SECTIONS {
Re: [PATCH kernel] powerpc/llvm/lto: Allow LLVM LTO builds
On 5/9/22 15:18, Alexey Kardashevskiy wrote: On 5/4/22 07:21, Nick Desaulniers wrote: On Thu, Apr 28, 2022 at 11:46 PM Alexey Kardashevskiy wrote: This enables LTO_CLANG builds on POWER with the upstream version of LLVM. LTO optimizes the output vmlinux binary and this may affect the FTP alternative section if alt branches use "bc" (Branch Conditional) which is limited by 16 bit offsets. This shows up in errors like: ld.lld: error: InputSection too large for range extension thunk vmlinux.o:(__ftr_alt_97+0xF0) This works around the issue by replacing "bc" in FTR_SECTION_ELSE with "b" which allows 26 bit offsets. This catches the problem instructions in vmlinux.o before it LTO'ed: $ objdump -d -M raw -j __ftr_alt_97 vmlinux.o | egrep '\S+\s*\' 30: 00 00 82 40 bc 4,eq,30 <__ftr_alt_97+0x30> f0: 00 00 82 40 bc 4,eq,f0 <__ftr_alt_97+0xf0> This allows LTO builds for ppc64le_defconfig plus LTO options. Note that DYNAMIC_FTRACE/FUNCTION_TRACER is not supported by LTO builds but this is not POWERPC-specific. $ ARCH=powerpc make LLVM=1 -j72 ppc64le_defconfig $ ARCH=powerpc make LLVM=1 -j72 menuconfig $ ARCH=powerpc make LLVM=1 -j72 ... VDSO64L arch/powerpc/kernel/vdso/vdso64.so.dbg /usr/bin/powerpc64le-linux-gnu-ld: /android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: error loading plugin: /android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: cannot open shared object file: No such file or directory clang-15: error: linker command failed with exit code 1 (use -v to see invocation) make[1]: *** [arch/powerpc/kernel/vdso/Makefile:67: arch/powerpc/kernel/vdso/vdso64.so.dbg] Error 1 Looks like LLD isn't being invoked correctly to link the vdso. Probably need to revisit https://lore.kernel.org/lkml/20200901222523.1941988-1-ndesaulni...@google.com/ How were you working around this issue? Perhaps you built clang to default to LLD? (there's a cmake option for that) What option is that? I only add -DLLVM_ENABLE_LLD=ON which (I think) tells cmake to use lld to link the LLVM being built but does not seem to tell what the built clang should do. Without -DLLVM_ENABLE_LLD=ON, building just fails: [fstn1-p1 ~/pbuild/llvm/llvm-lto-latest-cleanbuild]$ ninja -j 100 [619/3501] Linking CXX executable bin/not FAILED: bin/not : && /usr/bin/clang++ -fPIC -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -flto -O3 -DNDEBUG -flto -Wl,-rpath-link,/home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/./lib -Wl,--gc-sections utils/not/CMakeFiles/not.dir/not.cpp.o -o bin/not -Wl,-rpath,"\$ORIGIN/../lib" -lpthread lib/libLLVMSupport.a -lrt -ldl -lpthread -lm /usr/lib/powerpc64le-linux-gnu/libz.so /usr/lib/powerpc64le-linux-gnu/libtinfo.so lib/libLLVMDemangle.a && : /usr/bin/ld: lib/libLLVMSupport.a: error adding symbols: archive has no index; run ranlib to add one clang: error: linker command failed with exit code 1 (use -v to see invocation) [701/3501] Building CXX object utils/TableGen/CMakeFiles/llvm-tblgen.dir/GlobalISelEmitter.cpp.o ninja: build stopped: subcommand failed. My head hurts :( The above example is running on PPC. Now I am trying x86 box: A bit of progress. cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld" -DLLVM_TARGET_ARCH=PowerPC -DLLVM_TARGETS_TO_BUILD=PowerPC ~/llvm-project//llvm -DLLVM_ENABLE_LTO=ON -DLLVM_BINUTILS_INCDIR=/usr/lib/gcc/powerpc64le-linux-gnu/11/plugin/include/ -DCMAKE_BUILD_TYPE=Release produces: -- Native target architecture is PowerPC -- LLVM host triple: x86_64-unknown-linux-gnu -- LLVM default target triple: x86_64-unknown-linux-gnu and the resulting "clang" can only to "Target: x86_64-unknown-linux-gnu", how do you build LLVM exactly? Thanks,
[PATCH kernel] KVM: PPC: Book3s: Remove real mode interrupt controller hcalls handlers
Currently we have 2 sets of interrupt controller hypercalls handlers for real and virtual modes, this is from POWER8 times when switching MMU on was considered an expensive operation. POWER9 however does not have dependent threads and MMU is enabled for handling hcalls so the XIVE native or XICS-on-XIVE real mode handlers never execute on real P9 and later CPUs. This untemplate the handlers and only keeps the real mode handlers for XICS native (up to POWER8) and remove the rest of dead code. Changes in functions are mechanical except few missing empty lines to make checkpatch.pl happy. The default implemented hcalls list already contains XICS hcalls so no change there. This should not cause any behavioral change. Signed-off-by: Alexey Kardashevskiy --- Minus 153 lines nevertheless. --- arch/powerpc/kvm/Makefile | 2 +- arch/powerpc/include/asm/kvm_ppc.h | 7 - arch/powerpc/kvm/book3s_xive.h | 7 - arch/powerpc/kvm/book3s_hv_builtin.c| 64 --- arch/powerpc/kvm/book3s_hv_rm_xics.c| 5 + arch/powerpc/kvm/book3s_hv_rm_xive.c| 46 -- arch/powerpc/kvm/book3s_xive.c | 638 +++- arch/powerpc/kvm/book3s_xive_template.c | 636 --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 12 +- 9 files changed, 632 insertions(+), 785 deletions(-) delete mode 100644 arch/powerpc/kvm/book3s_hv_rm_xive.c delete mode 100644 arch/powerpc/kvm/book3s_xive_template.c diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile index 8e3681a86074..f17379b0f161 100644 --- a/arch/powerpc/kvm/Makefile +++ b/arch/powerpc/kvm/Makefile @@ -73,7 +73,7 @@ kvm-hv-$(CONFIG_PPC_TRANSACTIONAL_MEM) += \ book3s_hv_tm.o kvm-book3s_64-builtin-xics-objs-$(CONFIG_KVM_XICS) := \ - book3s_hv_rm_xics.o book3s_hv_rm_xive.o + book3s_hv_rm_xics.o kvm-book3s_64-builtin-tm-objs-$(CONFIG_PPC_TRANSACTIONAL_MEM) += \ book3s_hv_tm_builtin.o diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 44200a27371b..a775377a570e 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -787,13 +787,6 @@ long kvmppc_rm_h_page_init(struct kvm_vcpu *vcpu, unsigned long flags, unsigned long dest, unsigned long src); long kvmppc_hpte_hv_fault(struct kvm_vcpu *vcpu, unsigned long addr, unsigned long slb_v, unsigned int status, bool data); -unsigned long kvmppc_rm_h_xirr(struct kvm_vcpu *vcpu); -unsigned long kvmppc_rm_h_xirr_x(struct kvm_vcpu *vcpu); -unsigned long kvmppc_rm_h_ipoll(struct kvm_vcpu *vcpu, unsigned long server); -int kvmppc_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server, -unsigned long mfrr); -int kvmppc_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr); -int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr); void kvmppc_guest_entry_inject_int(struct kvm_vcpu *vcpu); /* diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h index 09d0657596c3..1e48f72e8aa5 100644 --- a/arch/powerpc/kvm/book3s_xive.h +++ b/arch/powerpc/kvm/book3s_xive.h @@ -285,13 +285,6 @@ static inline u32 __xive_read_eq(__be32 *qpage, u32 msk, u32 *idx, u32 *toggle) return cur & 0x7fff; } -extern unsigned long xive_rm_h_xirr(struct kvm_vcpu *vcpu); -extern unsigned long xive_rm_h_ipoll(struct kvm_vcpu *vcpu, unsigned long server); -extern int xive_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server, -unsigned long mfrr); -extern int xive_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr); -extern int xive_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr); - /* * Common Xive routines for XICS-over-XIVE and XIVE native */ diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c index 7e52d0beee77..88a8f6473c4e 100644 --- a/arch/powerpc/kvm/book3s_hv_builtin.c +++ b/arch/powerpc/kvm/book3s_hv_builtin.c @@ -489,70 +489,6 @@ static long kvmppc_read_one_intr(bool *again) return kvmppc_check_passthru(xisr, xirr, again); } -#ifdef CONFIG_KVM_XICS -unsigned long kvmppc_rm_h_xirr(struct kvm_vcpu *vcpu) -{ - if (!kvmppc_xics_enabled(vcpu)) - return H_TOO_HARD; - if (xics_on_xive()) - return xive_rm_h_xirr(vcpu); - else - return xics_rm_h_xirr(vcpu); -} - -unsigned long kvmppc_rm_h_xirr_x(struct kvm_vcpu *vcpu) -{ - if (!kvmppc_xics_enabled(vcpu)) - return H_TOO_HARD; - vcpu->arch.regs.gpr[5] = get_tb(); - if (xics_on_xive()) - return xive_rm_h_xirr(vcpu); - else - return xics_rm_h_xirr(vcpu); -} - -unsigned long kvmppc_rm_h_ipoll(struct kvm_vcpu *vcpu, unsigned long server) -{ - if (!kvmppc_xics_enabled(vcpu)) - return H_TOO_HARD; - if (xics_on_xive()) - return xive_rm_h_ipoll(vcpu, server); -
Re: [PATCH kernel] powerpc/llvm/lto: Allow LLVM LTO builds
On 5/4/22 07:21, Nick Desaulniers wrote: On Thu, Apr 28, 2022 at 11:46 PM Alexey Kardashevskiy wrote: This enables LTO_CLANG builds on POWER with the upstream version of LLVM. LTO optimizes the output vmlinux binary and this may affect the FTP alternative section if alt branches use "bc" (Branch Conditional) which is limited by 16 bit offsets. This shows up in errors like: ld.lld: error: InputSection too large for range extension thunk vmlinux.o:(__ftr_alt_97+0xF0) This works around the issue by replacing "bc" in FTR_SECTION_ELSE with "b" which allows 26 bit offsets. This catches the problem instructions in vmlinux.o before it LTO'ed: $ objdump -d -M raw -j __ftr_alt_97 vmlinux.o | egrep '\S+\s*\' 30: 00 00 82 40 bc 4,eq,30 <__ftr_alt_97+0x30> f0: 00 00 82 40 bc 4,eq,f0 <__ftr_alt_97+0xf0> This allows LTO builds for ppc64le_defconfig plus LTO options. Note that DYNAMIC_FTRACE/FUNCTION_TRACER is not supported by LTO builds but this is not POWERPC-specific. $ ARCH=powerpc make LLVM=1 -j72 ppc64le_defconfig $ ARCH=powerpc make LLVM=1 -j72 menuconfig $ ARCH=powerpc make LLVM=1 -j72 ... VDSO64L arch/powerpc/kernel/vdso/vdso64.so.dbg /usr/bin/powerpc64le-linux-gnu-ld: /android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: error loading plugin: /android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: cannot open shared object file: No such file or directory clang-15: error: linker command failed with exit code 1 (use -v to see invocation) make[1]: *** [arch/powerpc/kernel/vdso/Makefile:67: arch/powerpc/kernel/vdso/vdso64.so.dbg] Error 1 Looks like LLD isn't being invoked correctly to link the vdso. Probably need to revisit https://lore.kernel.org/lkml/20200901222523.1941988-1-ndesaulni...@google.com/ How were you working around this issue? Perhaps you built clang to default to LLD? (there's a cmake option for that) What option is that? I only add -DLLVM_ENABLE_LLD=ON which (I think) tells cmake to use lld to link the LLVM being built but does not seem to tell what the built clang should do. Without -DLLVM_ENABLE_LLD=ON, building just fails: [fstn1-p1 ~/pbuild/llvm/llvm-lto-latest-cleanbuild]$ ninja -j 100 [619/3501] Linking CXX executable bin/not FAILED: bin/not : && /usr/bin/clang++ -fPIC -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -flto -O3 -DNDEBUG -flto -Wl,-rpath-link,/home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/./lib -Wl,--gc-sections utils/not/CMakeFiles/not.dir/not.cpp.o -o bin/not -Wl,-rpath,"\$ORIGIN/../lib" -lpthread lib/libLLVMSupport.a -lrt -ldl -lpthread -lm /usr/lib/powerpc64le-linux-gnu/libz.so /usr/lib/powerpc64le-linux-gnu/libtinfo.so lib/libLLVMDemangle.a && : /usr/bin/ld: lib/libLLVMSupport.a: error adding symbols: archive has no index; run ranlib to add one clang: error: linker command failed with exit code 1 (use -v to see invocation) [701/3501] Building CXX object utils/TableGen/CMakeFiles/llvm-tblgen.dir/GlobalISelEmitter.cpp.o ninja: build stopped: subcommand failed. My head hurts :( The above example is running on PPC. Now I am trying x86 box: [2693/3505] Linking CXX shared library lib/libLTO.so.15git FAILED: lib/libLTO.so.15git : && /usr/bin/clang++ -fPIC -fPIC -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -flto -O3 -DNDEBUG -Wl,-z,defs -Wl,-z,nodelete -fuse-ld=ld -flto -Wl,-rpath-link,/home/aik/llvm-build/./lib -Wl,--gc-sections -Wl,--version-script,"/home/aik/llvm-build/tools/lto/LTO.exports" -shared -Wl,-soname,libLTO.so.15git -o lib/libLTO.so.15git tools/lto/CMakeFiles/LTO.dir/LTODisassembler.cpp.o tools/lto/CMakeFiles/LTO.dir/lto.cpp.o -Wl,-rpath,"\$ORIGIN/../lib" lib/libLLVMPowerPCAsmParser.a lib/libLLVMPowerPCCodeGen.a lib/libLLVMPowerPCDesc.a lib/libLLVMPowerPCDisassembler.a lib/libLLVMPowerPCInfo.a lib/libLLVMBitReader.a lib/libLLVMCore.a lib/libLLVMCodeGen.a lib/libLLVMLTO.a lib/libLLVMMC.a lib/libLLVMMCDisassembler.a lib/libLLVMSupport.a lib/libLLVMTarget.a lib/libLLVMAsmPrinter.a lib/libLLVMGlobalISel.a lib/libLLVMSelectionDAG.
[PATCH kernel] KVM: PPC: Book3s: PR: Enable default TCE hypercalls
When KVM_CAP_PPC_ENABLE_HCALL was introduced, H_GET_TCE and H_PUT_TCE were already implemented and enabled by default; however H_GET_TCE was missed out on PR KVM (probably because the handler was in the real mode code at the time). This enables H_GET_TCE by default. While at this, this wraps the checks in ifdef CONFIG_SPAPR_TCE_IOMMU just like HV KVM. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kvm/book3s_pr_papr.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c index dc4f51ac84bc..a1f2978b2a86 100644 --- a/arch/powerpc/kvm/book3s_pr_papr.c +++ b/arch/powerpc/kvm/book3s_pr_papr.c @@ -433,9 +433,12 @@ int kvmppc_hcall_impl_pr(unsigned long cmd) case H_REMOVE: case H_PROTECT: case H_BULK_REMOVE: +#ifdef CONFIG_SPAPR_TCE_IOMMU + case H_GET_TCE: case H_PUT_TCE: case H_PUT_TCE_INDIRECT: case H_STUFF_TCE: +#endif case H_CEDE: case H_LOGICAL_CI_LOAD: case H_LOGICAL_CI_STORE: @@ -464,7 +467,10 @@ static unsigned int default_hcall_list[] = { H_REMOVE, H_PROTECT, H_BULK_REMOVE, +#ifdef CONFIG_SPAPR_TCE_IOMMU + H_GET_TCE, H_PUT_TCE, +#endif H_CEDE, H_SET_MODE, #ifdef CONFIG_KVM_XICS -- 2.30.2
[PATCH kernel v2] KVM: PPC: Book3s: Retire H_PUT_TCE/etc real mode handlers
LoPAPR defines guest visible IOMMU with hypercalls to use it - H_PUT_TCE/etc. Implemented first on POWER7 where hypercalls would trap in the KVM in the real mode (with MMU off). The problem with the real mode is some memory is not available and some API usage crashed the host but enabling MMU was an expensive operation. The problems with the real mode handlers are: 1. Occasionally these cannot complete the request so the code is copied+modified to work in the virtual mode, very little is shared; 2. The real mode handlers have to be linked into vmlinux to work; 3. An exception in real mode immediately reboots the machine. If the small DMA window is used, the real mode handlers bring better performance. However since POWER8, there has always been a bigger DMA window which VMs use to map the entire VM memory to avoid calling H_PUT_TCE. Such 1:1 mapping happens once and uses H_PUT_TCE_INDIRECT (a bulk version of H_PUT_TCE) which virtual mode handler is even closer to its real mode version. On POWER9 hypercalls trap straight to the virtual mode so the real mode handlers never execute on POWER9 and later CPUs. So with the current use of the DMA windows and MMU improvements in POWER9 and later, there is no point in duplicating the code. The 32bit passed through devices may slow down but we do not have many of these in practice. For example, with this applied, a 1Gbit ethernet adapter still demostrates above 800Mbit/s of actual throughput. This removes the real mode handlers from KVM and related code from the powernv platform. This updates the list of implemented hcalls in KVM-HV as the realmode handlers are removed. This changes ABI - kvmppc_h_get_tce() moves to the KVM module and kvmppc_find_table() is static now. Signed-off-by: Alexey Kardashevskiy --- Changes: v2: * update the list of enabled hcalls as they are removed from .S --- arch/powerpc/kvm/Makefile | 3 - arch/powerpc/include/asm/iommu.h | 6 +- arch/powerpc/include/asm/kvm_ppc.h| 2 - arch/powerpc/include/asm/mmu_context.h| 5 - arch/powerpc/platforms/powernv/pci.h | 3 +- arch/powerpc/kernel/iommu.c | 4 +- arch/powerpc/kvm/book3s_64_vio.c | 43 ++ arch/powerpc/kvm/book3s_64_vio_hv.c | 672 -- arch/powerpc/kvm/book3s_hv.c | 6 + arch/powerpc/mm/book3s64/iommu_api.c | 68 -- arch/powerpc/platforms/powernv/pci-ioda-tce.c | 5 +- arch/powerpc/platforms/powernv/pci-ioda.c | 46 +- arch/powerpc/platforms/pseries/iommu.c| 3 +- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 10 - 14 files changed, 75 insertions(+), 801 deletions(-) delete mode 100644 arch/powerpc/kvm/book3s_64_vio_hv.c diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile index 9bdfc8b50899..8e3681a86074 100644 --- a/arch/powerpc/kvm/Makefile +++ b/arch/powerpc/kvm/Makefile @@ -37,9 +37,6 @@ kvm-e500mc-objs := \ e500_emulate.o kvm-objs-$(CONFIG_KVM_E500MC) := $(kvm-e500mc-objs) -kvm-book3s_64-builtin-objs-$(CONFIG_SPAPR_TCE_IOMMU) := \ - book3s_64_vio_hv.o - kvm-pr-y := \ fpu.o \ emulate.o \ diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index d7912b66c874..7e29c73e3dd4 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -51,13 +51,11 @@ struct iommu_table_ops { int (*xchg_no_kill)(struct iommu_table *tbl, long index, unsigned long *hpa, - enum dma_data_direction *direction, - bool realmode); + enum dma_data_direction *direction); void (*tce_kill)(struct iommu_table *tbl, unsigned long index, - unsigned long pages, - bool realmode); + unsigned long pages); __be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc); #endif diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 838d4cb460b7..44200a27371b 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -177,8 +177,6 @@ extern void kvmppc_setup_partition_table(struct kvm *kvm); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce_64 *args); -extern struct kvmppc_spapr_tce_table *kvmppc_find_table( - struct kvm *kvm, unsigned long liobn); #define kvmppc_ioba_validate(stt, ioba, npages) \ (iommu_tce_check_ioba((stt)->page_shift, (stt)->offset, \ (stt)->size, (ioba), (npages)) ?\ diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index b8527a74bd4d..3f25bd3e14eb 100644 --- a/arch/powerpc/include/asm/mmu_contex
Re: [PATCH kernel] powerpc/llvm/lto: Allow LLVM LTO builds
On 04/05/2022 17:11, Alexey Kardashevskiy wrote: On 5/4/22 07:21, Nick Desaulniers wrote: On Thu, Apr 28, 2022 at 11:46 PM Alexey Kardashevskiy wrote: This enables LTO_CLANG builds on POWER with the upstream version of LLVM. LTO optimizes the output vmlinux binary and this may affect the FTP alternative section if alt branches use "bc" (Branch Conditional) which is limited by 16 bit offsets. This shows up in errors like: ld.lld: error: InputSection too large for range extension thunk vmlinux.o:(__ftr_alt_97+0xF0) This works around the issue by replacing "bc" in FTR_SECTION_ELSE with "b" which allows 26 bit offsets. This catches the problem instructions in vmlinux.o before it LTO'ed: $ objdump -d -M raw -j __ftr_alt_97 vmlinux.o | egrep '\S+\s*\' 30: 00 00 82 40 bc 4,eq,30 <__ftr_alt_97+0x30> f0: 00 00 82 40 bc 4,eq,f0 <__ftr_alt_97+0xf0> This allows LTO builds for ppc64le_defconfig plus LTO options. Note that DYNAMIC_FTRACE/FUNCTION_TRACER is not supported by LTO builds but this is not POWERPC-specific. $ ARCH=powerpc make LLVM=1 -j72 ppc64le_defconfig $ ARCH=powerpc make LLVM=1 -j72 menuconfig $ ARCH=powerpc make LLVM=1 -j72 ... VDSO64L arch/powerpc/kernel/vdso/vdso64.so.dbg /usr/bin/powerpc64le-linux-gnu-ld: /android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: error loading plugin: /android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: cannot open shared object file: No such file or directory clang-15: error: linker command failed with exit code 1 (use -v to see invocation) make[1]: *** [arch/powerpc/kernel/vdso/Makefile:67: arch/powerpc/kernel/vdso/vdso64.so.dbg] Error 1 Looks like LLD isn't being invoked correctly to link the vdso. Probably need to revisit https://lore.kernel.org/lkml/20200901222523.1941988-1-ndesaulni...@google.com/ How were you working around this issue? Perhaps you built clang to default to LLD? (there's a cmake option for that) I was not. Just did clean build like this: mkdir ~/pbuild/llvm/llvm-lto-latest-cleanbuild cd ~/pbuild/llvm/llvm-lto-latest-cleanbuild CC='clang' CXX='clang++' cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld" -DLLVM_TARGETS_TO_BUILD=PowerPC ~/p/llvm/llvm-latest/llvm/ -DLLVM_ENABLE_LTO=ON -DLLVM_ENABLE_LLD=ON -DLLVM_BINUTILS_INCDIR=/usr/include -DCMAKE_BUILD_TYPE=Release ninja -j 50 It builds fine: [fstn1-p1 ~/p/kernels-llvm/llvm]$ find /home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/ -iname LLVMgold.so -exec ls -l {} \; -rwxrwxr-x 1 aik aik 39032840 May 4 13:06 /home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/lib/LLVMgold.so and then in the kernel tree: PATH=/home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/bin:$PATH make -j64 O=/home/aik/pbuild/kernels-llvm/llvm-wip-llvm-latest-lto-full-cleanbuild/ ARCH=powerpc LLVM_IAS=1 CC=clang LLVM=1 ppc64le_defconfig then enabled LTO in that .config and then just built "vmlinux": [fstn1-p1 ~/p/kernels-llvm/llvm]$ ls -l /home/aik/pbuild/kernels-llvm/llvm-wip-llvm-latest-lto-full-cleanbuild/vmlinux -rwxrwxr-x 1 aik aik 48145272 May 4 17:00 /home/aik/pbuild/kernels-llvm/llvm-wip-llvm-latest-lto-full-cleanbuild/vmlinux which boots under qemu, the kernel version is: Preparing to boot Linux version 5.18.0-rc2_0bb153baeff0_a+fstn1 (aik@fstn1-p1) (clang version 15.0.0 (https://github.com/llvm/llvm-proje ct.git e29dc0c6fde284e7f05aa5f45b05c629c9fad295), LLD 15.0.0) #1 SMP Wed May 4 16:54:16 AEST 2022 Before I got to this point, I did many unspeakable things to that build system so may be it is screwed in some way but I cannot pinpoint it. The installed clang/lld is 12.0.0-3ubuntu1~21.04.2 and -DLLVM_ENABLE_LLD=ON from cmake is to accelerate rebuilding of LLVM (for bisecting). I'll try without it now, just takes ages to complete. And I just did. clang built with gcc simply crashes while building kernel's scripts/basic/fixdep :-D I may have to file a bug against clang now :-/ Perhaps for now I should just send: ``` diff --git a/arch/powerpc/kernel/vdso/Makefile b/arch/powerpc/kernel/vdso/Makefile index 954974287ee7..8762e6513683 100644 --- a/arch/powerpc/kernel/vdso/Makefile +++ b/arch/powerpc/kernel/vdso/Makefile @@ -55,6 +55,11 @@ AS32FLAGS := -D__VDSO32__ -s CC64FLAGS := -Wl,-soname=linux-vdso64.so.1 AS64FLAGS := -D__VDSO64__ -s +ifneq ($(LLVM),) +CC32FLAGS += -fuse-ld=lld +CC64FLAGS += -fuse-ld=lld +endif + targets += vdso32.lds CPPFLAGS_vdso32.lds += -P -C -Upowerpc targets += vdso64.lds ``` Signed-off-by: Alexey Kardashevskiy --- Note 1: This is further development of https://lore.kernel.org/all/20220211023125.1790960-1-...@ozlabs.ru/T/ Note 2: CONFIG_ZSTD_COMPRESS and CONFIG_ZSTD_DECOMPRESS must be both "m" or "y" or it won't link. For details: https://lore.kernel.org/lkml/20220428043850.1706973-1-...@ozlabs.ru/T/ Yeah, I just hit this: ``` LTO vmlinux.o LLVM ERROR: Function Import:
[PATCH kernel] KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent
When introduced, IRQFD resampling worked on POWER8 with XICS. However KVM on POWER9 has never implemented it - the compatibility mode code ("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native XIVE mode does not handle INTx in KVM at all. This moved the capability support advertising to platforms and stops advertising it on XIVE, i.e. POWER9 and later. Signed-off-by: Alexey Kardashevskiy --- Or I could move this one together with KVM_CAP_IRQFD. Thoughts? --- arch/arm64/kvm/arm.c | 3 +++ arch/mips/kvm/mips.c | 3 +++ arch/powerpc/kvm/powerpc.c | 6 ++ arch/riscv/kvm/vm.c| 3 +++ arch/s390/kvm/kvm-s390.c | 3 +++ arch/x86/kvm/x86.c | 3 +++ virt/kvm/kvm_main.c| 1 - 7 files changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 523bc934fe2f..092f0614bae3 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -210,6 +210,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SET_GUEST_DEBUG: case KVM_CAP_VCPU_ATTRIBUTES: case KVM_CAP_PTP_KVM: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index a25e0b73ee70..0f3de470a73e 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -1071,6 +1071,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_READONLY_MEM: case KVM_CAP_SYNC_MMU: case KVM_CAP_IMMEDIATE_EXIT: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_NR_VCPUS: diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 875c30c12db0..87698ffef3be 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -591,6 +591,12 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) break; #endif +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: + r = !xive_enabled(); + break; +#endif + case KVM_CAP_PPC_ALLOC_HTAB: r = hv_enabled; break; diff --git a/arch/riscv/kvm/vm.c b/arch/riscv/kvm/vm.c index c768f75279ef..b58579b386bb 100644 --- a/arch/riscv/kvm/vm.c +++ b/arch/riscv/kvm/vm.c @@ -63,6 +63,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_READONLY_MEM: case KVM_CAP_MP_STATE: case KVM_CAP_IMMEDIATE_EXIT: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_NR_VCPUS: diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 156d1c25a3c1..85e093fc8d13 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -564,6 +564,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SET_GUEST_DEBUG: case KVM_CAP_S390_DIAG318: case KVM_CAP_S390_MEM_OP_EXTENSION: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0c0ca599a353..a0a7b769483d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4273,6 +4273,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SYS_ATTRIBUTES: case KVM_CAP_VAPIC: case KVM_CAP_ENABLE_CAP: +#ifdef CONFIG_HAVE_KVM_IRQFD + case KVM_CAP_IRQFD_RESAMPLE: +#endif r = 1; break; case KVM_CAP_EXIT_HYPERCALL: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 70e05af5ebea..885e72e668a5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4293,7 +4293,6 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) #endif #ifdef CONFIG_HAVE_KVM_IRQFD case KVM_CAP_IRQFD: - case KVM_CAP_IRQFD_RESAMPLE: #endif case KVM_CAP_IOEVENTFD_ANY_LENGTH: case KVM_CAP_CHECK_EXTENSION_VM: -- 2.30.2
Re: [PATCH kernel] powerpc/llvm/lto: Allow LLVM LTO builds
On 5/4/22 07:21, Nick Desaulniers wrote: On Thu, Apr 28, 2022 at 11:46 PM Alexey Kardashevskiy wrote: This enables LTO_CLANG builds on POWER with the upstream version of LLVM. LTO optimizes the output vmlinux binary and this may affect the FTP alternative section if alt branches use "bc" (Branch Conditional) which is limited by 16 bit offsets. This shows up in errors like: ld.lld: error: InputSection too large for range extension thunk vmlinux.o:(__ftr_alt_97+0xF0) This works around the issue by replacing "bc" in FTR_SECTION_ELSE with "b" which allows 26 bit offsets. This catches the problem instructions in vmlinux.o before it LTO'ed: $ objdump -d -M raw -j __ftr_alt_97 vmlinux.o | egrep '\S+\s*\' 30: 00 00 82 40 bc 4,eq,30 <__ftr_alt_97+0x30> f0: 00 00 82 40 bc 4,eq,f0 <__ftr_alt_97+0xf0> This allows LTO builds for ppc64le_defconfig plus LTO options. Note that DYNAMIC_FTRACE/FUNCTION_TRACER is not supported by LTO builds but this is not POWERPC-specific. $ ARCH=powerpc make LLVM=1 -j72 ppc64le_defconfig $ ARCH=powerpc make LLVM=1 -j72 menuconfig $ ARCH=powerpc make LLVM=1 -j72 ... VDSO64L arch/powerpc/kernel/vdso/vdso64.so.dbg /usr/bin/powerpc64le-linux-gnu-ld: /android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: error loading plugin: /android0/llvm-project/llvm/build/bin/../lib/LLVMgold.so: cannot open shared object file: No such file or directory clang-15: error: linker command failed with exit code 1 (use -v to see invocation) make[1]: *** [arch/powerpc/kernel/vdso/Makefile:67: arch/powerpc/kernel/vdso/vdso64.so.dbg] Error 1 Looks like LLD isn't being invoked correctly to link the vdso. Probably need to revisit https://lore.kernel.org/lkml/20200901222523.1941988-1-ndesaulni...@google.com/ How were you working around this issue? Perhaps you built clang to default to LLD? (there's a cmake option for that) I was not. Just did clean build like this: mkdir ~/pbuild/llvm/llvm-lto-latest-cleanbuild cd ~/pbuild/llvm/llvm-lto-latest-cleanbuild CC='clang' CXX='clang++' cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld" -DLLVM_TARGETS_TO_BUILD=PowerPC ~/p/llvm/llvm-latest/llvm/ -DLLVM_ENABLE_LTO=ON -DLLVM_ENABLE_LLD=ON -DLLVM_BINUTILS_INCDIR=/usr/include -DCMAKE_BUILD_TYPE=Release ninja -j 50 It builds fine: [fstn1-p1 ~/p/kernels-llvm/llvm]$ find /home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/ -iname LLVMgold.so -exec ls -l {} \; -rwxrwxr-x 1 aik aik 39032840 May 4 13:06 /home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/lib/LLVMgold.so and then in the kernel tree: PATH=/home/aik/pbuild/llvm/llvm-lto-latest-cleanbuild/bin:$PATH make -j64 O=/home/aik/pbuild/kernels-llvm/llvm-wip-llvm-latest-lto-full-cleanbuild/ ARCH=powerpc LLVM_IAS=1 CC=clang LLVM=1 ppc64le_defconfig then enabled LTO in that .config and then just built "vmlinux": [fstn1-p1 ~/p/kernels-llvm/llvm]$ ls -l /home/aik/pbuild/kernels-llvm/llvm-wip-llvm-latest-lto-full-cleanbuild/vmlinux -rwxrwxr-x 1 aik aik 48145272 May 4 17:00 /home/aik/pbuild/kernels-llvm/llvm-wip-llvm-latest-lto-full-cleanbuild/vmlinux which boots under qemu, the kernel version is: Preparing to boot Linux version 5.18.0-rc2_0bb153baeff0_a+fstn1 (aik@fstn1-p1) (clang version 15.0.0 (https://github.com/llvm/llvm-proje ct.git e29dc0c6fde284e7f05aa5f45b05c629c9fad295), LLD 15.0.0) #1 SMP Wed May 4 16:54:16 AEST 2022 Before I got to this point, I did many unspeakable things to that build system so may be it is screwed in some way but I cannot pinpoint it. The installed clang/lld is 12.0.0-3ubuntu1~21.04.2 and -DLLVM_ENABLE_LLD=ON from cmake is to accelerate rebuilding of LLVM (for bisecting). I'll try without it now, just takes ages to complete. Perhaps for now I should just send: ``` diff --git a/arch/powerpc/kernel/vdso/Makefile b/arch/powerpc/kernel/vdso/Makefile index 954974287ee7..8762e6513683 100644 --- a/arch/powerpc/kernel/vdso/Makefile +++ b/arch/powerpc/kernel/vdso/Makefile @@ -55,6 +55,11 @@ AS32FLAGS := -D__VDSO32__ -s CC64FLAGS := -Wl,-soname=linux-vdso64.so.1 AS64FLAGS := -D__VDSO64__ -s +ifneq ($(LLVM),) +CC32FLAGS += -fuse-ld=lld +CC64FLAGS += -fuse-ld=lld +endif + targets += vdso32.lds CPPFLAGS_vdso32.lds += -P -C -Upowerpc targets += vdso64.lds ``` Signed-off-by: Alexey Kardashevskiy --- Note 1: This is further development of https://lore.kernel.org/all/20220211023125.1790960-1-...@ozlabs.ru/T/ Note 2: CONFIG_ZSTD_COMPRESS and CONFIG_ZSTD_DECOMPRESS must be both "m" or "y" or it won't link. For details: https://lore.kernel.org/lkml/20220428043850.1706973-1-...@ozlabs.ru/T/ Yeah, I just hit this: ``` LTO vmlinux.o LLVM ERROR: Function Import: link error: linking module flags 'Code Model': IDs have conflicting values in 'lib/built-in.a(entropy_common.o at 5782)' and 'lib/built-in.a(zstd_decompress_block.o at 6202)' PLEASE submit a bug report to https:
Re: [PATCH kernel] powerpc/llvm/lto: Allow LLVM LTO builds
On 5/4/22 07:24, Nick Desaulniers wrote: On Thu, Apr 28, 2022 at 11:46 PM Alexey Kardashevskiy wrote: diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index b66dd6f775a4..5b783bd51260 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -476,9 +476,11 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real, text) .if IHSRR_IF_HVMODE BEGIN_FTR_SECTION bne masked_Hinterrupt + b 4f FTR_SECTION_ELSE Do you need to have the ELSE even if there's nothing in it; should it have a nop? The rest of the assembler changes LGTM, but withholding RB tag until we have Kconfig dependencies in better shape. The FTR patcher will add the necessary amount of "nop"s there and dropping "FTR_SECTION_ELSE" breaks the build as it does some "pushsection" magic. - bne masked_interrupt ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206) + bne masked_interrupt +4: .elseif IHSRR bne masked_Hinterrupt .else
[PATCH kernel] powerpc/llvm/lto: Allow LLVM LTO builds
This enables LTO_CLANG builds on POWER with the upstream version of LLVM. LTO optimizes the output vmlinux binary and this may affect the FTP alternative section if alt branches use "bc" (Branch Conditional) which is limited by 16 bit offsets. This shows up in errors like: ld.lld: error: InputSection too large for range extension thunk vmlinux.o:(__ftr_alt_97+0xF0) This works around the issue by replacing "bc" in FTR_SECTION_ELSE with "b" which allows 26 bit offsets. This catches the problem instructions in vmlinux.o before it LTO'ed: $ objdump -d -M raw -j __ftr_alt_97 vmlinux.o | egrep '\S+\s*\' 30: 00 00 82 40 bc 4,eq,30 <__ftr_alt_97+0x30> f0: 00 00 82 40 bc 4,eq,f0 <__ftr_alt_97+0xf0> This allows LTO builds for ppc64le_defconfig plus LTO options. Note that DYNAMIC_FTRACE/FUNCTION_TRACER is not supported by LTO builds but this is not POWERPC-specific. Signed-off-by: Alexey Kardashevskiy --- Note 1: This is further development of https://lore.kernel.org/all/20220211023125.1790960-1-...@ozlabs.ru/T/ Note 2: CONFIG_ZSTD_COMPRESS and CONFIG_ZSTD_DECOMPRESS must be both "m" or "y" or it won't link. For details: https://lore.kernel.org/lkml/20220428043850.1706973-1-...@ozlabs.ru/T/ --- arch/powerpc/Kconfig | 2 ++ arch/powerpc/kernel/exceptions-64s.S | 4 +++- arch/powerpc/lib/copyuser_64.S | 3 ++- arch/powerpc/lib/feature-fixups-test.S | 3 +-- arch/powerpc/lib/memcpy_64.S | 3 ++- 5 files changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 174edabb74fa..e2c7b5c1d0a6 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -158,6 +158,8 @@ config PPC select ARCH_WANT_IRQS_OFF_ACTIVATE_MM select ARCH_WANT_LD_ORPHAN_WARN select ARCH_WEAK_RELEASE_ACQUIRE + select ARCH_SUPPORTS_LTO_CLANG + select ARCH_SUPPORTS_LTO_CLANG_THIN select BINFMT_ELF select BUILDTIME_TABLE_SORT select CLONE_BACKWARDS diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index b66dd6f775a4..5b783bd51260 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -476,9 +476,11 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real, text) .if IHSRR_IF_HVMODE BEGIN_FTR_SECTION bne masked_Hinterrupt + b 4f FTR_SECTION_ELSE - bne masked_interrupt ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206) + bne masked_interrupt +4: .elseif IHSRR bne masked_Hinterrupt .else diff --git a/arch/powerpc/lib/copyuser_64.S b/arch/powerpc/lib/copyuser_64.S index db8719a14846..d07f95eebc65 100644 --- a/arch/powerpc/lib/copyuser_64.S +++ b/arch/powerpc/lib/copyuser_64.S @@ -75,10 +75,11 @@ _GLOBAL(__copy_tofrom_user_base) * set is Power6. */ test_feature = (SELFTEST_CASE == 1) + beq .Ldst_aligned BEGIN_FTR_SECTION nop FTR_SECTION_ELSE - bne .Ldst_unaligned + b .Ldst_unaligned ALT_FTR_SECTION_END(CPU_FTR_UNALIGNED_LD_STD | CPU_FTR_CP_USE_DCBTZ, \ CPU_FTR_UNALIGNED_LD_STD) .Ldst_aligned: diff --git a/arch/powerpc/lib/feature-fixups-test.S b/arch/powerpc/lib/feature-fixups-test.S index 480172fbd024..2751e42a9fd7 100644 --- a/arch/powerpc/lib/feature-fixups-test.S +++ b/arch/powerpc/lib/feature-fixups-test.S @@ -145,7 +145,6 @@ BEGIN_FTR_SECTION FTR_SECTION_ELSE 2: or 2,2,2 PPC_LCMPI r3,1 - beq 3f blt 2b b 3f b 1b @@ -160,10 +159,10 @@ globl(ftr_fixup_test6_expected) 1: or 1,1,1 2: or 2,2,2 PPC_LCMPI r3,1 - beq 3f blt 2b b 3f b 1b + nop 3: or 1,1,1 or 2,2,2 or 3,3,3 diff --git a/arch/powerpc/lib/memcpy_64.S b/arch/powerpc/lib/memcpy_64.S index 016c91e958d8..286c7e2d0883 100644 --- a/arch/powerpc/lib/memcpy_64.S +++ b/arch/powerpc/lib/memcpy_64.S @@ -50,10 +50,11 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY) At the time of writing the only CPU that has this combination of bits set is Power6. */ test_feature = (SELFTEST_CASE == 1) + beq .ldst_aligned BEGIN_FTR_SECTION nop FTR_SECTION_ELSE - bne .Ldst_unaligned + b .Ldst_unaligned ALT_FTR_SECTION_END(CPU_FTR_UNALIGNED_LD_STD | CPU_FTR_CP_USE_DCBTZ, \ CPU_FTR_UNALIGNED_LD_STD) .Ldst_aligned: -- 2.30.2
[PATCH kernel] KVM: PPC: Book3s: Retire H_PUT_TCE/etc real mode handlers
LoPAPR defines guest visible IOMMU with hypercalls to use it - H_PUT_TCE/etc. Implemented first on POWER7 where hypercalls would trap in the KVM in the real mode (with MMU off). The problem with the real mode is some memory is not available and some API usage crashed the host but enabling MMU was an expensive operation. The problems with the real mode handlers are: 1. Occasionally these cannot complete the request so the code is copied+modified to work in the virtual mode, very little is shared; 2. The real mode handlers have to be linked into vmlinux to work; 3. An exception in real mode immediately reboots the machine. If the small DMA window is used, the real mode handlers bring better performance. However since POWER8, there has always been a bigger DMA window which VMs use to map the entire VM memory to avoid calling H_PUT_TCE. Such 1:1 mapping happens once and uses H_PUT_TCE_INDIRECT (a bulk version of H_PUT_TCE) which virtual mode handler is even closer to its real mode version. On POWER9 hypercalls trap straight to the virtual mode so the real mode handlers never execute on POWER9 and later CPUs. So with the current use of the DMA windows and MMU improvements in POWER9 and later, there is no point in duplicating the code. The 32bit passed through devices may slow down but we do not have many of these in practice. For example, with this applied, a 1Gbit ethernet adapter still demostrates above 800Mbit/s of actual throughput. This removes the real mode handlers from KVM and related code from the powernv platform. This changes ABI - kvmppc_h_get_tce() moves to the KVM module and kvmppc_find_table() is static now. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kvm/Makefile | 3 - arch/powerpc/include/asm/iommu.h | 6 +- arch/powerpc/include/asm/kvm_ppc.h| 2 - arch/powerpc/include/asm/mmu_context.h| 5 - arch/powerpc/platforms/powernv/pci.h | 3 +- arch/powerpc/kernel/iommu.c | 4 +- arch/powerpc/kvm/book3s_64_vio.c | 43 ++ arch/powerpc/kvm/book3s_64_vio_hv.c | 672 -- arch/powerpc/mm/book3s64/iommu_api.c | 68 -- arch/powerpc/platforms/powernv/pci-ioda-tce.c | 5 +- arch/powerpc/platforms/powernv/pci-ioda.c | 46 +- arch/powerpc/platforms/pseries/iommu.c| 3 +- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 10 - 13 files changed, 69 insertions(+), 801 deletions(-) delete mode 100644 arch/powerpc/kvm/book3s_64_vio_hv.c diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile index 9bdfc8b50899..8e3681a86074 100644 --- a/arch/powerpc/kvm/Makefile +++ b/arch/powerpc/kvm/Makefile @@ -37,9 +37,6 @@ kvm-e500mc-objs := \ e500_emulate.o kvm-objs-$(CONFIG_KVM_E500MC) := $(kvm-e500mc-objs) -kvm-book3s_64-builtin-objs-$(CONFIG_SPAPR_TCE_IOMMU) := \ - book3s_64_vio_hv.o - kvm-pr-y := \ fpu.o \ emulate.o \ diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index d7912b66c874..7e29c73e3dd4 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -51,13 +51,11 @@ struct iommu_table_ops { int (*xchg_no_kill)(struct iommu_table *tbl, long index, unsigned long *hpa, - enum dma_data_direction *direction, - bool realmode); + enum dma_data_direction *direction); void (*tce_kill)(struct iommu_table *tbl, unsigned long index, - unsigned long pages, - bool realmode); + unsigned long pages); __be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc); #endif diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 838d4cb460b7..44200a27371b 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -177,8 +177,6 @@ extern void kvmppc_setup_partition_table(struct kvm *kvm); extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm, struct kvm_create_spapr_tce_64 *args); -extern struct kvmppc_spapr_tce_table *kvmppc_find_table( - struct kvm *kvm, unsigned long liobn); #define kvmppc_ioba_validate(stt, ioba, npages) \ (iommu_tce_check_ioba((stt)->page_shift, (stt)->offset, \ (stt)->size, (ioba), (npages)) ?\ diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index b8527a74bd4d..3f25bd3e14eb 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -34,15 +34,10 @@ extern void mm_iommu_init(struct mm_struct *mm); extern void mm_iommu_cleanup(struct mm_struct *mm); extern struct mm_iommu_table_group_mem_t *mm_iommu_look
[PATCH kernel] powerpc/perf: Fix 32bit compile
The "read_bhrb" global symbol is only called under CONFIG_PPC64 of arch/powerpc/perf/core-book3s.c but it is compiled for both 32 and 64 bit anyway (and LLVM fails to link this on 32bit). This fixes it by moving bhrb.o to obj64 targets. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/perf/Makefile | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile index 2f46e31c7612..4f53d0b97539 100644 --- a/arch/powerpc/perf/Makefile +++ b/arch/powerpc/perf/Makefile @@ -3,11 +3,11 @@ obj-y += callchain.o callchain_$(BITS).o perf_regs.o obj-$(CONFIG_COMPAT) += callchain_32.o -obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o +obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o obj64-$(CONFIG_PPC_PERF_CTRS) += ppc970-pmu.o power5-pmu.o \ power5+-pmu.o power6-pmu.o power7-pmu.o \ isa207-common.o power8-pmu.o power9-pmu.o \ - generic-compat-pmu.o power10-pmu.o + generic-compat-pmu.o power10-pmu.o bhrb.o obj32-$(CONFIG_PPC_PERF_CTRS) += mpc7450-pmu.o obj-$(CONFIG_PPC_POWERNV) += imc-pmu.o -- 2.30.2
[PATCH kernel v2] KVM: PPC: Fix TCE handling for VFIO
The LoPAPR spec defines a guest visible IOMMU with a variable page size. Currently QEMU advertises 4K, 64K, 2M, 16MB pages, a Linux VM picks the biggest (16MB). In the case of a passed though PCI device, there is a hardware IOMMU which does not support all pages sizes from the above - P8 cannot do 2MB and P9 cannot do 16MB. So for each emulated 16M IOMMU page we may create several smaller mappings ("TCEs") in the hardware IOMMU. The code wrongly uses the emulated TCE index instead of hardware TCE index in error handling. The problem is easier to see on POWER8 with multi-level TCE tables (when only the first level is preallocated) as hash mode uses real mode TCE hypercalls handlers. The kernel starts using indirect tables when VMs get bigger than 128GB (depends on the max page order). The very first real mode hcall is going to fail with H_TOO_HARD as in the real mode we cannot allocate memory for TCEs (we can in the virtual mode) but on the way out the code attempts to clear hardware TCEs using emulated TCE indexes which corrupts random kernel memory because it_offset==1<<59 is subtracted from those indexes and the resulting index is out of the TCE table bounds. This fixes kvmppc_clear_tce() to use the correct TCE indexes. While at it, this fixes TCE cache invalidation which uses emulated TCE indexes instead of the hardware ones. This went unnoticed as 64bit DMA is used these days and VMs map all RAM in one go and only then do DMA and this is when the TCE cache gets populated. Potentially this could slow down mapping, however normally 16MB emulated pages are backed by 64K hardware pages so it is one write to the "TCE Kill" per 256 updates which is not that bad considering the size of the cache (1024 TCEs or so). Fixes: ca1fc489cfa0 ("KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages with smaller physical pages") Reviewed-by: Frederic Barrat Reviewed-by: David Gibson Tested-by: David Gibson Signed-off-by: Alexey Kardashevskiy --- Changes: v2: * reworded the first paragraph of the commit log --- arch/powerpc/kvm/book3s_64_vio.c| 45 +++-- arch/powerpc/kvm/book3s_64_vio_hv.c | 44 ++-- 2 files changed, 45 insertions(+), 44 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index d42b4b6d4a79..85cfa6328222 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -420,13 +420,19 @@ static void kvmppc_tce_put(struct kvmppc_spapr_tce_table *stt, tbl[idx % TCES_PER_PAGE] = tce; } -static void kvmppc_clear_tce(struct mm_struct *mm, struct iommu_table *tbl, - unsigned long entry) +static void kvmppc_clear_tce(struct mm_struct *mm, struct kvmppc_spapr_tce_table *stt, + struct iommu_table *tbl, unsigned long entry) { - unsigned long hpa = 0; - enum dma_data_direction dir = DMA_NONE; + unsigned long i; + unsigned long subpages = 1ULL << (stt->page_shift - tbl->it_page_shift); + unsigned long io_entry = entry << (stt->page_shift - tbl->it_page_shift); - iommu_tce_xchg_no_kill(mm, tbl, entry, , ); + for (i = 0; i < subpages; ++i) { + unsigned long hpa = 0; + enum dma_data_direction dir = DMA_NONE; + + iommu_tce_xchg_no_kill(mm, tbl, io_entry + i, , ); + } } static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm, @@ -485,6 +491,8 @@ static long kvmppc_tce_iommu_unmap(struct kvm *kvm, break; } + iommu_tce_kill(tbl, io_entry, subpages); + return ret; } @@ -544,6 +552,8 @@ static long kvmppc_tce_iommu_map(struct kvm *kvm, break; } + iommu_tce_kill(tbl, io_entry, subpages); + return ret; } @@ -590,10 +600,9 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, ret = kvmppc_tce_iommu_map(vcpu->kvm, stt, stit->tbl, entry, ua, dir); - iommu_tce_kill(stit->tbl, entry, 1); if (ret != H_SUCCESS) { - kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry); + kvmppc_clear_tce(vcpu->kvm->mm, stt, stit->tbl, entry); goto unlock_exit; } } @@ -669,13 +678,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, */ if (get_user(tce, tces + i)) { ret = H_TOO_HARD; - goto invalidate_exit; + goto unlock_exit; } tce = be64_to_cpu(tce); if (kvmppc_tce_to_ua(vcpu->kvm, tce, )) { ret = H_PARAMETER; - goto invalidate_exit; + goto unlock_exit;
[PATCH kernel v3] powerpc/boot: Stop using RELACOUNT
So far the RELACOUNT tag from the ELF header was containing the exact number of R_PPC_RELATIVE/R_PPC64_RELATIVE relocations. However the LLVM's recent change [1] make it equal-or-less than the actual number which makes it useless. This replaces RELACOUNT in zImage loader with a pair of RELASZ and RELAENT. The vmlinux relocation code is fixed in commit d79976918852 ("powerpc/64: Add UADDR64 relocation support"). To make it more future proof, this walks through the entire .rela.dyn section instead of assuming that the section is sorter by a relocation type. Unlike d79976918852, this does not add unaligned UADDR/UADDR64 relocations as we are likely not to see those in practice - the zImage is small and very arch specific so there is a smaller chance that some generic feature (such as PRINK_INDEX) triggers unaligned relocations. [1] https://github.com/llvm/llvm-project/commit/da0e5b885b25cf4 Signed-off-by: Alexey Kardashevskiy --- Changes: v3: * s/divd/divdu/ for ppc64 v2: * s/divd/divwu/ for ppc32 * updated the commit log * named all new labels instead of numbering them (s/101f/.Lcheck_for_relaent/ and so on) --- arch/powerpc/boot/crt0.S | 45 ++-- 1 file changed, 29 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S index feadee18e271..44544720daae 100644 --- a/arch/powerpc/boot/crt0.S +++ b/arch/powerpc/boot/crt0.S @@ -8,7 +8,8 @@ #include "ppc_asm.h" RELA = 7 -RELACOUNT = 0x6ff9 +RELASZ = 8 +RELAENT = 9 .data /* A procedure descriptor used when booting this as a COFF file. @@ -75,34 +76,39 @@ p_base: mflrr10 /* r10 now points to runtime addr of p_base */ bne 11f lwz r9,4(r12) /* get RELA pointer in r9 */ b 12f -11:addis r8,r8,(-RELACOUNT)@ha - cmpwi r8,RELACOUNT@l +11:cmpwi r8,RELASZ + bne .Lcheck_for_relaent + lwz r0,4(r12) /* get RELASZ value in r0 */ + b 12f +.Lcheck_for_relaent: + cmpwi r8,RELAENT bne 12f - lwz r0,4(r12) /* get RELACOUNT value in r0 */ + lwz r14,4(r12) /* get RELAENT value in r14 */ 12:addir12,r12,8 b 9b /* The relocation section contains a list of relocations. * We now do the R_PPC_RELATIVE ones, which point to words -* which need to be initialized with addend + offset. -* The R_PPC_RELATIVE ones come first and there are RELACOUNT -* of them. */ +* which need to be initialized with addend + offset */ 10:/* skip relocation if we don't have both */ cmpwi r0,0 beq 3f cmpwi r9,0 beq 3f + cmpwi r14,0 + beq 3f add r9,r9,r11 /* Relocate RELA pointer */ + divwu r0,r0,r14 /* RELASZ / RELAENT */ mtctr r0 2: lbz r0,4+3(r9) /* ELF32_R_INFO(reloc->r_info) */ cmpwi r0,22 /* R_PPC_RELATIVE */ - bne 3f + bne .Lnext lwz r12,0(r9) /* reloc->r_offset */ lwz r0,8(r9)/* reloc->r_addend */ add r0,r0,r11 stwxr0,r11,r12 - addir9,r9,12 +.Lnext:add r9,r9,r14 bdnz2b /* Do a cache flush for our text, in case the loader didn't */ @@ -160,32 +166,39 @@ p_base: mflrr10 /* r10 now points to runtime addr of p_base */ bne 10f ld r13,8(r11) /* get RELA pointer in r13 */ b 11f -10:addis r12,r12,(-RELACOUNT)@ha - cmpdi r12,RELACOUNT@l - bne 11f - ld r8,8(r11) /* get RELACOUNT value in r8 */ +10:cmpwi r12,RELASZ + bne .Lcheck_for_relaent + lwz r8,8(r11) /* get RELASZ pointer in r8 */ + b 11f +.Lcheck_for_relaent: + cmpwi r12,RELAENT + bne 11f + lwz r14,8(r11) /* get RELAENT pointer in r14 */ 11:addir11,r11,16 b 9b 12: - cmpdi r13,0/* check we have both RELA and RELACOUNT */ + cmpdi r13,0/* check we have both RELA, RELASZ, RELAENT*/ cmpdi cr1,r8,0 beq 3f beq cr1,3f + cmpdi r14,0 + beq 3f /* Calcuate the runtime offset. */ subfr13,r13,r9 /* Run through the list of relocations and process the * R_PPC64_RELATIVE ones. */ + divdu r8,r8,r14 /* RELASZ / RELAENT */ mtctr r8 13:ld r0,8(r9)/* ELF64_R_TYPE(reloc->r_info) */ cmpdi r0,22 /* R_PPC64_RELATIVE */ - bne 3f + bne .Lnext ld r12,0(r9)/* reloc->r_offset */ ld r0,16(r9) /* reloc->r_addend */ add r0,r0,r13 stdxr0,r13,r12 - addir9,r
Re: [PATCH kernel v2] powerpc/boot: Stop using RELACOUNT
On 4/6/22 14:58, Gabriel Paubert wrote: On Wed, Apr 06, 2022 at 02:01:48PM +1000, Alexey Kardashevskiy wrote: So far the RELACOUNT tag from the ELF header was containing the exact number of R_PPC_RELATIVE/R_PPC64_RELATIVE relocations. However the LLVM's recent change [1] make it equal-or-less than the actual number which makes it useless. This replaces RELACOUNT in zImage loader with a pair of RELASZ and RELAENT. The vmlinux relocation code is fixed in commit d79976918852 ("powerpc/64: Add UADDR64 relocation support"). To make it more future proof, this walks through the entire .rela.dyn section instead of assuming that the section is sorter by a relocation type. Unlike d79976918852, this does not add unaligned UADDR/UADDR64 relocations as we are likely not to see those in practice - the zImage is small and very arch specific so there is a smaller chance that some generic feature (such as PRINK_INDEX) triggers unaligned relocations. [1] https://github.com/llvm/llvm-project/commit/da0e5b885b25cf4 Signed-off-by: Alexey Kardashevskiy --- Changes: v2: * s/divd/divwu/ for ppc32 * updated the commit log * named all new labels instead of numbering them (s/101f/.Lcheck_for_relaent/ and so on) --- arch/powerpc/boot/crt0.S | 45 ++-- 1 file changed, 29 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S index feadee18e271..e9306d862f8d 100644 --- a/arch/powerpc/boot/crt0.S +++ b/arch/powerpc/boot/crt0.S @@ -8,7 +8,8 @@ #include "ppc_asm.h" RELA = 7 -RELACOUNT = 0x6ff9 +RELASZ = 8 +RELAENT = 9 .data /* A procedure descriptor used when booting this as a COFF file. @@ -75,34 +76,39 @@ p_base: mflrr10 /* r10 now points to runtime addr of p_base */ bne 11f lwz r9,4(r12) /* get RELA pointer in r9 */ b 12f -11:addis r8,r8,(-RELACOUNT)@ha - cmpwi r8,RELACOUNT@l +11:cmpwi r8,RELASZ + bne .Lcheck_for_relaent + lwz r0,4(r12) /* get RELASZ value in r0 */ + b 12f +.Lcheck_for_relaent: + cmpwi r8,RELAENT bne 12f - lwz r0,4(r12) /* get RELACOUNT value in r0 */ + lwz r14,4(r12) /* get RELAENT value in r14 */ 12: addir12,r12,8 b 9b /* The relocation section contains a list of relocations. * We now do the R_PPC_RELATIVE ones, which point to words -* which need to be initialized with addend + offset. -* The R_PPC_RELATIVE ones come first and there are RELACOUNT -* of them. */ +* which need to be initialized with addend + offset */ 10: /* skip relocation if we don't have both */ cmpwi r0,0 beq 3f cmpwi r9,0 beq 3f + cmpwi r14,0 + beq 3f add r9,r9,r11 /* Relocate RELA pointer */ + divwu r0,r0,r14 /* RELASZ / RELAENT */ mtctr r0 2:lbz r0,4+3(r9) /* ELF32_R_INFO(reloc->r_info) */ cmpwi r0,22 /* R_PPC_RELATIVE */ - bne 3f + bne .Lnext lwz r12,0(r9) /* reloc->r_offset */ lwz r0,8(r9)/* reloc->r_addend */ add r0,r0,r11 stwxr0,r11,r12 - addir9,r9,12 +.Lnext:add r9,r9,r14 bdnz2b /* Do a cache flush for our text, in case the loader didn't */ @@ -160,32 +166,39 @@ p_base: mflrr10 /* r10 now points to runtime addr of p_base */ bne 10f ld r13,8(r11) /* get RELA pointer in r13 */ b 11f -10:addis r12,r12,(-RELACOUNT)@ha - cmpdi r12,RELACOUNT@l - bne 11f - ld r8,8(r11) /* get RELACOUNT value in r8 */ +10:cmpwi r12,RELASZ + bne .Lcheck_for_relaent + lwz r8,8(r11) /* get RELASZ pointer in r8 */ + b 11f +.Lcheck_for_relaent: + cmpwi r12,RELAENT + bne 11f + lwz r14,8(r11) /* get RELAENT pointer in r14 */ 11: addir11,r11,16 b 9b 12: - cmpdi r13,0/* check we have both RELA and RELACOUNT */ + cmpdi r13,0/* check we have both RELA, RELASZ, RELAENT*/ cmpdi cr1,r8,0 beq 3f beq cr1,3f + cmpdi r14,0 + beq 3f /* Calcuate the runtime offset. */ subfr13,r13,r9 /* Run through the list of relocations and process the * R_PPC64_RELATIVE ones. */ + divdr8,r8,r14 /* RELASZ / RELAENT */ While you are at it, this one should also be divdu. I really wished IBM had used explicit signed/unsigned indication in the mnemonics (divds, divdu, divws, divwu) instead. Fortunately very little assemby code uses these instructions nowadays. Fair enough, v3 is coming. Thanks, mtctr
[PATCH kernel] KVM: PPC: Fix TCE handling for VFIO
At the moment the IOMMU page size in a pseries VM is 16MB (the biggest allowed by LoPAPR), this page size is used for an emulated TCE table. If there is a passed though PCI device, that there are hardware IOMMU tables with equal or smaller IOMMU page sizes so one emulated IOMMU pages is backed by power-of-two hardware pages. The code wrongly uses the emulated TCE index instead of hardware TCE index in error handling. The problem is easier to see on POWER8 with multi-level TCE tables (when only the first level is preallocated) as hash mode uses real mode TCE hypercalls handlers. The kernel starts using indirect tables when VMs get bigger than 128GB (depends on the max page order). The very first real mode hcall is going to fail with H_TOO_HARD as in the real mode we cannot allocate memory for TCEs (we can in the virtual mode) but on the way out the code attempts to clear hardware TCEs using emulated TCE indexes which corrupts random kernel memory because it_offset==1<<59 is subtracted from those indexes and the resulting index is out of the TCE table bounds. This fixes kvmppc_clear_tce() to use the correct TCE indexes. While at it, this fixes TCE cache invalidation which uses emulated TCE indexes instead of the hardware ones. This went unnoticed as 64bit DMA is used these days and VMs map all RAM in one go and only then do DMA and this is when the TCE cache gets populated. Potentially this could slow down mapping, however normally 16MB emulated pages are backed by 64K hardware pages so it is one write to the "TCE Kill" per 256 updates which is not that bad considering the size of the cache (1024 TCEs or so). Fixes: ca1fc489cfa0 ("KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages with smaller physical pages") Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kvm/book3s_64_vio.c| 45 +++-- arch/powerpc/kvm/book3s_64_vio_hv.c | 44 ++-- 2 files changed, 45 insertions(+), 44 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index d42b4b6d4a79..85cfa6328222 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -420,13 +420,19 @@ static void kvmppc_tce_put(struct kvmppc_spapr_tce_table *stt, tbl[idx % TCES_PER_PAGE] = tce; } -static void kvmppc_clear_tce(struct mm_struct *mm, struct iommu_table *tbl, - unsigned long entry) +static void kvmppc_clear_tce(struct mm_struct *mm, struct kvmppc_spapr_tce_table *stt, + struct iommu_table *tbl, unsigned long entry) { - unsigned long hpa = 0; - enum dma_data_direction dir = DMA_NONE; + unsigned long i; + unsigned long subpages = 1ULL << (stt->page_shift - tbl->it_page_shift); + unsigned long io_entry = entry << (stt->page_shift - tbl->it_page_shift); - iommu_tce_xchg_no_kill(mm, tbl, entry, , ); + for (i = 0; i < subpages; ++i) { + unsigned long hpa = 0; + enum dma_data_direction dir = DMA_NONE; + + iommu_tce_xchg_no_kill(mm, tbl, io_entry + i, , ); + } } static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm, @@ -485,6 +491,8 @@ static long kvmppc_tce_iommu_unmap(struct kvm *kvm, break; } + iommu_tce_kill(tbl, io_entry, subpages); + return ret; } @@ -544,6 +552,8 @@ static long kvmppc_tce_iommu_map(struct kvm *kvm, break; } + iommu_tce_kill(tbl, io_entry, subpages); + return ret; } @@ -590,10 +600,9 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, ret = kvmppc_tce_iommu_map(vcpu->kvm, stt, stit->tbl, entry, ua, dir); - iommu_tce_kill(stit->tbl, entry, 1); if (ret != H_SUCCESS) { - kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry); + kvmppc_clear_tce(vcpu->kvm->mm, stt, stit->tbl, entry); goto unlock_exit; } } @@ -669,13 +678,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, */ if (get_user(tce, tces + i)) { ret = H_TOO_HARD; - goto invalidate_exit; + goto unlock_exit; } tce = be64_to_cpu(tce); if (kvmppc_tce_to_ua(vcpu->kvm, tce, )) { ret = H_PARAMETER; - goto invalidate_exit; + goto unlock_exit; } list_for_each_entry_lockless(stit, >iommu_tables, next) { @@ -684,19 +693,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, iommu_tce_direction(tce)); if (ret !=
[PATCH kernel v2] powerpc/boot: Stop using RELACOUNT
So far the RELACOUNT tag from the ELF header was containing the exact number of R_PPC_RELATIVE/R_PPC64_RELATIVE relocations. However the LLVM's recent change [1] make it equal-or-less than the actual number which makes it useless. This replaces RELACOUNT in zImage loader with a pair of RELASZ and RELAENT. The vmlinux relocation code is fixed in commit d79976918852 ("powerpc/64: Add UADDR64 relocation support"). To make it more future proof, this walks through the entire .rela.dyn section instead of assuming that the section is sorter by a relocation type. Unlike d79976918852, this does not add unaligned UADDR/UADDR64 relocations as we are likely not to see those in practice - the zImage is small and very arch specific so there is a smaller chance that some generic feature (such as PRINK_INDEX) triggers unaligned relocations. [1] https://github.com/llvm/llvm-project/commit/da0e5b885b25cf4 Signed-off-by: Alexey Kardashevskiy --- Changes: v2: * s/divd/divwu/ for ppc32 * updated the commit log * named all new labels instead of numbering them (s/101f/.Lcheck_for_relaent/ and so on) --- arch/powerpc/boot/crt0.S | 45 ++-- 1 file changed, 29 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S index feadee18e271..e9306d862f8d 100644 --- a/arch/powerpc/boot/crt0.S +++ b/arch/powerpc/boot/crt0.S @@ -8,7 +8,8 @@ #include "ppc_asm.h" RELA = 7 -RELACOUNT = 0x6ff9 +RELASZ = 8 +RELAENT = 9 .data /* A procedure descriptor used when booting this as a COFF file. @@ -75,34 +76,39 @@ p_base: mflrr10 /* r10 now points to runtime addr of p_base */ bne 11f lwz r9,4(r12) /* get RELA pointer in r9 */ b 12f -11:addis r8,r8,(-RELACOUNT)@ha - cmpwi r8,RELACOUNT@l +11:cmpwi r8,RELASZ + bne .Lcheck_for_relaent + lwz r0,4(r12) /* get RELASZ value in r0 */ + b 12f +.Lcheck_for_relaent: + cmpwi r8,RELAENT bne 12f - lwz r0,4(r12) /* get RELACOUNT value in r0 */ + lwz r14,4(r12) /* get RELAENT value in r14 */ 12:addir12,r12,8 b 9b /* The relocation section contains a list of relocations. * We now do the R_PPC_RELATIVE ones, which point to words -* which need to be initialized with addend + offset. -* The R_PPC_RELATIVE ones come first and there are RELACOUNT -* of them. */ +* which need to be initialized with addend + offset */ 10:/* skip relocation if we don't have both */ cmpwi r0,0 beq 3f cmpwi r9,0 beq 3f + cmpwi r14,0 + beq 3f add r9,r9,r11 /* Relocate RELA pointer */ + divwu r0,r0,r14 /* RELASZ / RELAENT */ mtctr r0 2: lbz r0,4+3(r9) /* ELF32_R_INFO(reloc->r_info) */ cmpwi r0,22 /* R_PPC_RELATIVE */ - bne 3f + bne .Lnext lwz r12,0(r9) /* reloc->r_offset */ lwz r0,8(r9)/* reloc->r_addend */ add r0,r0,r11 stwxr0,r11,r12 - addir9,r9,12 +.Lnext:add r9,r9,r14 bdnz2b /* Do a cache flush for our text, in case the loader didn't */ @@ -160,32 +166,39 @@ p_base: mflrr10 /* r10 now points to runtime addr of p_base */ bne 10f ld r13,8(r11) /* get RELA pointer in r13 */ b 11f -10:addis r12,r12,(-RELACOUNT)@ha - cmpdi r12,RELACOUNT@l - bne 11f - ld r8,8(r11) /* get RELACOUNT value in r8 */ +10:cmpwi r12,RELASZ + bne .Lcheck_for_relaent + lwz r8,8(r11) /* get RELASZ pointer in r8 */ + b 11f +.Lcheck_for_relaent: + cmpwi r12,RELAENT + bne 11f + lwz r14,8(r11) /* get RELAENT pointer in r14 */ 11:addir11,r11,16 b 9b 12: - cmpdi r13,0/* check we have both RELA and RELACOUNT */ + cmpdi r13,0/* check we have both RELA, RELASZ, RELAENT*/ cmpdi cr1,r8,0 beq 3f beq cr1,3f + cmpdi r14,0 + beq 3f /* Calcuate the runtime offset. */ subfr13,r13,r9 /* Run through the list of relocations and process the * R_PPC64_RELATIVE ones. */ + divdr8,r8,r14 /* RELASZ / RELAENT */ mtctr r8 13:ld r0,8(r9)/* ELF64_R_TYPE(reloc->r_info) */ cmpdi r0,22 /* R_PPC64_RELATIVE */ - bne 3f + bne .Lnext ld r12,0(r9)/* reloc->r_offset */ ld r0,16(r9) /* reloc->r_addend */ add r0,r0,r13 stdxr0,r13,r12 - addir9,r9,24 +.Lnext:
Re: [PATCH kernel] powerpc/boot: Stop using RELACOUNT
On 3/22/22 13:12, Michael Ellerman wrote: Alexey Kardashevskiy writes: So far the RELACOUNT tag from the ELF header was containing the exact number of R_PPC_RELATIVE/R_PPC64_RELATIVE relocations. However the LLVM's recent change [1] make it equal-or-less than the actual number which makes it useless. This replaces RELACOUNT in zImage loader with a pair of RELASZ and RELAENT. The vmlinux relocation code is fixed in [2]. That's committed so you can say: in commit d79976918852 ("powerpc/64: Add UADDR64 relocation support") To make it more future proof, this walks through the entire .rela.dyn section instead of assuming that the section is sorter by a relocation type. Unlike [1], this does not add unaligned UADDR/UADDR64 relocations ^ that should be 2? Yes. as in hardly possible to see those in arch-specific zImage. I don't quite parse that. Is it true we can never see them in zImage? Maybe it's true that we don't see them in practice. I can force UADDR64 in zImage as I did for d79976918852 but zImage is lot smaller and more arch-specific than vmlinux and so far only PRINT_INDEX triggered UADDR64 in vmlinux and chances of the same thing happening in zImage are small. [1] https://github.com/llvm/llvm-project/commit/da0e5b885b25cf4 [2] https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next=d799769188529a Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/boot/crt0.S | 43 +--- 1 file changed, 27 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S index feadee18e271..6ea3417da3b7 100644 --- a/arch/powerpc/boot/crt0.S +++ b/arch/powerpc/boot/crt0.S @@ -8,7 +8,8 @@ #include "ppc_asm.h" RELA = 7 -RELACOUNT = 0x6ff9 +RELASZ = 8 +RELAENT = 9 .data /* A procedure descriptor used when booting this as a COFF file. @@ -75,34 +76,38 @@ p_base: mflrr10 /* r10 now points to runtime addr of p_base */ bne 11f lwz r9,4(r12) /* get RELA pointer in r9 */ b 12f -11:addis r8,r8,(-RELACOUNT)@ha - cmpwi r8,RELACOUNT@l +11:cmpwi r8,RELASZ + bne 111f + lwz r0,4(r12) /* get RELASZ value in r0 */ + b 12f +111: cmpwi r8,RELAENT Can you use named local labels for new labels you introduce? This could be .Lcheck_for_relaent: perhaps. Then I'll need to rename them all/most and add more noise to the patch which reduces chances of it being reviewed. But sure, I can rename labels. bne 12f - lwz r0,4(r12) /* get RELACOUNT value in r0 */ + lwz r14,4(r12) /* get RELAENT value in r14 */ 12: addir12,r12,8 b 9b /* The relocation section contains a list of relocations. * We now do the R_PPC_RELATIVE ones, which point to words -* which need to be initialized with addend + offset. -* The R_PPC_RELATIVE ones come first and there are RELACOUNT -* of them. */ +* which need to be initialized with addend + offset */ 10: /* skip relocation if we don't have both */ cmpwi r0,0 beq 3f cmpwi r9,0 beq 3f + cmpwi r14,0 + beq 3f add r9,r9,r11 /* Relocate RELA pointer */ + divdr0,r0,r14 /* RELASZ / RELAENT */ This is in the 32-bit portion isn't it. AFAIK 32-bit CPUs don't implement divd. I'm not sure why the toolchain allowed it. I would expect it to trap if run on real 32-bit hardware. Uff, my bad, "divw", right? I am guessing it works as zImage for 64bit BigEndian is still ELF32 which runs in 64bit CPU and I did not test on real PPC32 as I'm not quite sure how and I hoped your farm will do this for me :) mtctr r0 2:lbz r0,4+3(r9) /* ELF32_R_INFO(reloc->r_info) */ cmpwi r0,22 /* R_PPC_RELATIVE */ - bne 3f + bne 22f lwz r12,0(r9) /* reloc->r_offset */ lwz r0,8(r9)/* reloc->r_addend */ add r0,r0,r11 stwxr0,r11,r12 - addir9,r9,12 +22:add r9,r9,r14 bdnz2b /* Do a cache flush for our text, in case the loader didn't */ cheers
[PATCH kernel] powerpc/boot: Stop using RELACOUNT
So far the RELACOUNT tag from the ELF header was containing the exact number of R_PPC_RELATIVE/R_PPC64_RELATIVE relocations. However the LLVM's recent change [1] make it equal-or-less than the actual number which makes it useless. This replaces RELACOUNT in zImage loader with a pair of RELASZ and RELAENT. The vmlinux relocation code is fixed in [2]. To make it more future proof, this walks through the entire .rela.dyn section instead of assuming that the section is sorter by a relocation type. Unlike [1], this does not add unaligned UADDR/UADDR64 relocations as in hardly possible to see those in arch-specific zImage. [1] https://github.com/llvm/llvm-project/commit/da0e5b885b25cf4 [2] https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next=d799769188529a Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/boot/crt0.S | 43 +--- 1 file changed, 27 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S index feadee18e271..6ea3417da3b7 100644 --- a/arch/powerpc/boot/crt0.S +++ b/arch/powerpc/boot/crt0.S @@ -8,7 +8,8 @@ #include "ppc_asm.h" RELA = 7 -RELACOUNT = 0x6ff9 +RELASZ = 8 +RELAENT = 9 .data /* A procedure descriptor used when booting this as a COFF file. @@ -75,34 +76,38 @@ p_base: mflrr10 /* r10 now points to runtime addr of p_base */ bne 11f lwz r9,4(r12) /* get RELA pointer in r9 */ b 12f -11:addis r8,r8,(-RELACOUNT)@ha - cmpwi r8,RELACOUNT@l +11:cmpwi r8,RELASZ + bne 111f + lwz r0,4(r12) /* get RELASZ value in r0 */ + b 12f +111: cmpwi r8,RELAENT bne 12f - lwz r0,4(r12) /* get RELACOUNT value in r0 */ + lwz r14,4(r12) /* get RELAENT value in r14 */ 12:addir12,r12,8 b 9b /* The relocation section contains a list of relocations. * We now do the R_PPC_RELATIVE ones, which point to words -* which need to be initialized with addend + offset. -* The R_PPC_RELATIVE ones come first and there are RELACOUNT -* of them. */ +* which need to be initialized with addend + offset */ 10:/* skip relocation if we don't have both */ cmpwi r0,0 beq 3f cmpwi r9,0 beq 3f + cmpwi r14,0 + beq 3f add r9,r9,r11 /* Relocate RELA pointer */ + divdr0,r0,r14 /* RELASZ / RELAENT */ mtctr r0 2: lbz r0,4+3(r9) /* ELF32_R_INFO(reloc->r_info) */ cmpwi r0,22 /* R_PPC_RELATIVE */ - bne 3f + bne 22f lwz r12,0(r9) /* reloc->r_offset */ lwz r0,8(r9)/* reloc->r_addend */ add r0,r0,r11 stwxr0,r11,r12 - addir9,r9,12 +22:add r9,r9,r14 bdnz2b /* Do a cache flush for our text, in case the loader didn't */ @@ -160,32 +165,38 @@ p_base: mflrr10 /* r10 now points to runtime addr of p_base */ bne 10f ld r13,8(r11) /* get RELA pointer in r13 */ b 11f -10:addis r12,r12,(-RELACOUNT)@ha - cmpdi r12,RELACOUNT@l - bne 11f - ld r8,8(r11) /* get RELACOUNT value in r8 */ +10:cmpwi r12,RELASZ + bne 101f + lwz r8,8(r11) /* get RELASZ pointer in r8 */ + b 11f +101: cmpwi r12,RELAENT + bne 11f + lwz r14,8(r11) /* get RELAENT pointer in r14 */ 11:addir11,r11,16 b 9b 12: - cmpdi r13,0/* check we have both RELA and RELACOUNT */ + cmpdi r13,0/* check we have both RELA, RELASZ, RELAENT*/ cmpdi cr1,r8,0 beq 3f beq cr1,3f + cmpdi r14,0 + beq 3f /* Calcuate the runtime offset. */ subfr13,r13,r9 /* Run through the list of relocations and process the * R_PPC64_RELATIVE ones. */ + divdr8,r8,r14 /* RELASZ / RELAENT */ mtctr r8 13:ld r0,8(r9)/* ELF64_R_TYPE(reloc->r_info) */ cmpdi r0,22 /* R_PPC64_RELATIVE */ - bne 3f + bne 14f ld r12,0(r9)/* reloc->r_offset */ ld r0,16(r9) /* reloc->r_addend */ add r0,r0,r13 stdxr0,r13,r12 - addir9,r9,24 +14:add r9,r9,r14 bdnz13b /* Do a cache flush for our text, in case the loader didn't */ -- 2.30.2
Re: [PATCH] powerpc: Replace ppc64 DT_RELACOUNT usage with DT_RELASZ/24
On 3/11/22 15:15, Michael Ellerman wrote: Fāng-ruì Sòng writes: On Thu, Mar 10, 2022 at 11:48 AM Nick Desaulniers wrote: On Tue, Mar 8, 2022 at 9:53 PM Fangrui Song wrote: DT_RELACOUNT is an ELF dynamic tag inherited from SunOS indicating the number of R_*_RELATIVE relocations. It is optional but {ld.lld,ld.lld} -z combreloc always creates it (if non-zero) to slightly speed up glibc ld.so relocation resolving by avoiding R_*R_PPC64_RELATIVE type comparison. The tag is otherwise nearly unused in the wild and I'd recommend that software avoids using it. lld>=14.0.0 (since commit da0e5b885b25cf4ded0fa89b965dc6979ac02ca9) underestimates DT_RELACOUNT for ppc64 when position-independent long branch thunks are used. Correcting it needs non-trivial arch-specific complexity which I'd prefer to avoid. Since our code always compares the relocation type with R_PPC64_RELATIVE, replacing every occurrence of DT_RELACOUNT with DT_RELASZ/sizeof(Elf64_Rela)=DT_RELASZ/24 is a correct alternative. checking that sizeof(Elf64_Rela) == 24, yep: https://godbolt.org/z/bb4aKbo5T DT_RELASZ is in practice bounded by an uint32_t. Dividing x by 24 can be implemented as (uint32_t)(x*0xaaab) >> 4. Yep: https://godbolt.org/z/x9445ePPv Link: https://github.com/ClangBuiltLinux/linux/issues/1581 Reported-by: Nathan Chancellor Signed-off-by: Fangrui Song --- arch/powerpc/boot/crt0.S | 28 +--- arch/powerpc/kernel/reloc_64.S | 15 +-- 2 files changed, 26 insertions(+), 17 deletions(-) ... I rebased the patch on git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master and got a conflict. Seems that https://lore.kernel.org/linuxppc-dev/20220309061822.168173-1-...@ozlabs.ru/T/#u ("[PATCH kernel v4] powerpc/64: Add UADDR64 relocation support") fixed the issue. It just doesn't change arch/powerpc/boot/crt0.S Yeah sorry, I applied Alexey's v4 just before I saw your patch arrive on the list. If one of you can rework this so it applies on top that would be great :) I guess it is me as now I have to add that UARRD64 thing to crt0.S as well, don't I? And also we are giving up on the llvm ld having a bug with RELACOUNT?
[PATCH kernel v4] powerpc/64: Add UADDR64 relocation support
When ld detects unaligned relocations, it emits R_PPC64_UADDR64 relocations instead of R_PPC64_RELATIVE. Currently R_PPC64_UADDR64 are detected by arch/powerpc/tools/relocs_check.sh and expected not to work. Below is a simple chunk to trigger this behaviour (this disables optimization for the demonstration purposes only, this also happens with -O1/-O2 when CONFIG_PRINTK_INDEX=y, for example): \#pragma GCC push_options \#pragma GCC optimize ("O0") struct entry { const char *file; int line; } __attribute__((packed)); static const struct entry e1 = { .file = __FILE__, .line = __LINE__ }; static const struct entry e2 = { .file = __FILE__, .line = __LINE__ }; ... prom_printf("e1=%s %lx %lx\n", e1.file, (unsigned long) e1.file, mfmsr()); prom_printf("e2=%s %lx\n", e2.file, (unsigned long) e2.file); \#pragma GCC pop_options This adds support for UADDR64 for 64bit. This reuses __dynamic_symtab from the 32bit which supports more relocation types already. Because RELACOUNT includes only R_PPC64_RELATIVE, this replaces it with RELASZ which is the size of all relocation records. Signed-off-by: Alexey Kardashevskiy --- Changes: v4: * fixed reloc->r_info hadling on big endian v3: * named some labels v2: * replaced RELACOUNT with RELASZ/RELAENT * removed FIXME --- arch/powerpc/kernel/reloc_64.S | 67 +- arch/powerpc/kernel/vmlinux.lds.S | 2 - arch/powerpc/tools/relocs_check.sh | 7 +--- 3 files changed, 48 insertions(+), 28 deletions(-) diff --git a/arch/powerpc/kernel/reloc_64.S b/arch/powerpc/kernel/reloc_64.S index 02d4719bf43a..232e4549defe 100644 --- a/arch/powerpc/kernel/reloc_64.S +++ b/arch/powerpc/kernel/reloc_64.S @@ -8,8 +8,10 @@ #include RELA = 7 -RELACOUNT = 0x6ff9 +RELASZ = 8 +RELAENT = 9 R_PPC64_RELATIVE = 22 +R_PPC64_UADDR64 = 43 /* * r3 = desired final address of kernel @@ -25,29 +27,38 @@ _GLOBAL(relocate) add r9,r9,r12 /* r9 has runtime addr of .rela.dyn section */ ld r10,(p_st - 0b)(r12) add r10,r10,r12 /* r10 has runtime addr of _stext */ + ld r13,(p_sym - 0b)(r12) + add r13,r13,r12 /* r13 has runtime addr of .dynsym */ /* -* Scan the dynamic section for the RELA and RELACOUNT entries. +* Scan the dynamic section for the RELA, RELASZ and RELAENT entries. */ li r7,0 li r8,0 -1: ld r6,0(r11) /* get tag */ +.Ltags: + ld r6,0(r11) /* get tag */ cmpdi r6,0 - beq 4f /* end of list */ + beq .Lend_of_list /* end of list */ cmpdi r6,RELA bne 2f ld r7,8(r11) /* get RELA pointer in r7 */ - b 3f -2: addis r6,r6,(-RELACOUNT)@ha - cmpdi r6,RELACOUNT@l + b 4f +2: cmpdi r6,RELASZ bne 3f - ld r8,8(r11) /* get RELACOUNT value in r8 */ -3: addir11,r11,16 - b 1b -4: cmpdi r7,0/* check we have both RELA and RELACOUNT */ + ld r8,8(r11) /* get RELASZ value in r8 */ + b 4f +3: cmpdi r6,RELAENT + bne 4f + ld r12,8(r11) /* get RELAENT value in r12 */ +4: addir11,r11,16 + b .Ltags +.Lend_of_list: + cmpdi r7,0/* check we have RELA, RELASZ, RELAENT */ cmpdi cr1,r8,0 - beq 6f - beq cr1,6f + beq .Lout + beq cr1,.Lout + cmpdi r12,0 + beq .Lout /* * Work out linktime address of _stext and hence the @@ -62,23 +73,39 @@ _GLOBAL(relocate) /* * Run through the list of relocations and process the -* R_PPC64_RELATIVE ones. +* R_PPC64_RELATIVE and R_PPC64_UADDR64 ones. */ + divdr8,r8,r12 /* RELASZ / RELAENT */ mtctr r8 -5: ld r0,8(9) /* ELF64_R_TYPE(reloc->r_info) */ +.Lrels:ld r0,8(r9)/* ELF64_R_TYPE(reloc->r_info) */ cmpdi r0,R_PPC64_RELATIVE - bne 6f + bne .Luaddr64 ld r6,0(r9)/* reloc->r_offset */ ld r0,16(r9) /* reloc->r_addend */ + b .Lstore +.Luaddr64: + srdir14,r0,32 /* ELF64_R_SYM(reloc->r_info) */ + clrldi r0,r0,32 + cmpdi r0,R_PPC64_UADDR64 + bne .Lnext + ld r6,0(r9) + ld r0,16(r9) + mulli r14,r14,24 /* 24 == sizeof(elf64_sym) */ + add r14,r14,r13 /* elf64_sym[ELF64_R_SYM] */ + ld r14,8(r14) + add r0,r0,r14 +.Lstore: add r0,r0,r3 stdxr0,r7,r6 - addir9,r9,24 - bdnz5b - -6: blr +.Lnext: + add r9,r9,r12 + bdnz.Lrels +.Lout: + blr .balign 8 p_dyn: .8byte __dynamic_start - 0b p_r
[PATCH kernel v3] powerpc/64: Add UADDR64 relocation support
When ld detects unaligned relocations, it emits R_PPC64_UADDR64 relocations instead of R_PPC64_RELATIVE. Currently R_PPC64_UADDR64 are detected by arch/powerpc/tools/relocs_check.sh and expected not to work. Below is a simple chunk to trigger this behaviour (this disables optimization for the demonstration purposes only, this also happens with -O1/-O2 when CONFIG_PRINTK_INDEX=y, for example): \#pragma GCC push_options \#pragma GCC optimize ("O0") struct entry { const char *file; int line; } __attribute__((packed)); static const struct entry e1 = { .file = __FILE__, .line = __LINE__ }; static const struct entry e2 = { .file = __FILE__, .line = __LINE__ }; ... prom_printf("e1=%s %lx %lx\n", e1.file, (unsigned long) e1.file, mfmsr()); prom_printf("e2=%s %lx\n", e2.file, (unsigned long) e2.file); \#pragma GCC pop_options This adds support for UADDR64 for 64bit. This reuses __dynamic_symtab from the 32bit which supports more relocation types already. Because RELACOUNT includes only R_PPC64_RELATIVE, this replaces it with RELASZ which is the size of all relocation records. Signed-off-by: Alexey Kardashevskiy --- Changes: v3: * named some labels v2: * replaced RELACOUNT with RELASZ/RELAENT * removed FIXME --- Tested via qemu gdb stub (the kernel is loaded at 0x40). Disasm: c1a804d0 : c1a804d0: b0 04 a8 01 .long 0x1a804b0 c1a804d0: R_PPC64_RELATIVE *ABS*-0x3e57fb50 c1a804d4: 00 00 00 c0 lfs f0,0(0) c1a804d8: fa 08 00 00 .long 0x8fa c1a804dc : ... c1a804dc: R_PPC64_UADDR64 .rodata+0x4b0 Before relocation: >>> p *(unsigned long *) 0x1e804d0 $1 = 0xc1a804b0 >>> p *(unsigned long *) 0x1e804dc $2 = 0x0 After relocation in __boot_from_prom: >>> p *(unsigned long *) 0x1e804d0 $1 = 0x1e804b0 >>> p *(unsigned long *) 0x1e804dc $2 = 0x1e804b0 After relocation in __after_prom_start: >>> p *(unsigned long *) 0x1e804d0 $1 = 0xc1a804b0 >>> p *(unsigned long *) 0x1e804dc $2 = 0xc1a804b0 >>> --- arch/powerpc/kernel/reloc_64.S | 67 +- arch/powerpc/kernel/vmlinux.lds.S | 2 - arch/powerpc/tools/relocs_check.sh | 7 +--- 3 files changed, 48 insertions(+), 28 deletions(-) diff --git a/arch/powerpc/kernel/reloc_64.S b/arch/powerpc/kernel/reloc_64.S index 02d4719bf43a..4a8eccbaebb4 100644 --- a/arch/powerpc/kernel/reloc_64.S +++ b/arch/powerpc/kernel/reloc_64.S @@ -8,8 +8,10 @@ #include RELA = 7 -RELACOUNT = 0x6ff9 +RELASZ = 8 +RELAENT = 9 R_PPC64_RELATIVE = 22 +R_PPC64_UADDR64 = 43 /* * r3 = desired final address of kernel @@ -25,29 +27,38 @@ _GLOBAL(relocate) add r9,r9,r12 /* r9 has runtime addr of .rela.dyn section */ ld r10,(p_st - 0b)(r12) add r10,r10,r12 /* r10 has runtime addr of _stext */ + ld r13,(p_sym - 0b)(r12) + add r13,r13,r12 /* r13 has runtime addr of .dynsym */ /* -* Scan the dynamic section for the RELA and RELACOUNT entries. +* Scan the dynamic section for the RELA, RELASZ and RELAENT entries. */ li r7,0 li r8,0 -1: ld r6,0(r11) /* get tag */ +.Ltags: + ld r6,0(r11) /* get tag */ cmpdi r6,0 - beq 4f /* end of list */ + beq .Lend_of_list /* end of list */ cmpdi r6,RELA bne 2f ld r7,8(r11) /* get RELA pointer in r7 */ - b 3f -2: addis r6,r6,(-RELACOUNT)@ha - cmpdi r6,RELACOUNT@l + b 4f +2: cmpdi r6,RELASZ bne 3f - ld r8,8(r11) /* get RELACOUNT value in r8 */ -3: addir11,r11,16 - b 1b -4: cmpdi r7,0/* check we have both RELA and RELACOUNT */ + ld r8,8(r11) /* get RELASZ value in r8 */ + b 4f +3: cmpdi r6,RELAENT + bne 4f + ld r12,8(r11) /* get RELAENT value in r12 */ +4: addir11,r11,16 + b .Ltags +.Lend_of_list: + cmpdi r7,0/* check we have RELA, RELASZ, RELAENT */ cmpdi cr1,r8,0 - beq 6f - beq cr1,6f + beq .Lout + beq cr1,.Lout + cmpdi r12,0 + beq .Lout /* * Work out linktime address of _stext and hence the @@ -62,23 +73,39 @@ _GLOBAL(relocate) /* * Run through the list of relocations and process the -* R_PPC64_RELATIVE ones. +* R_PPC64_RELATIVE and R_PPC64_UADDR64 ones. */ + divdr8,r8,r12 /* RELASZ / RELAENT */ mtctr r8 -5: ld r0,8(9) /* ELF64_R_TYPE(reloc->r_info) */ +.Lrelocations: + lwa r0,8(r9)
Re: [PATCH kernel 2/3] powerpc/llvm: Sample config for LLVM LTO
On 2/12/22 11:05, Nick Desaulniers wrote: On Thu, Feb 10, 2022 at 6:31 PM Alexey Kardashevskiy wrote: The config is a copy of ppc64_defconfig with a few tweaks. This could be a smaller config to merge into ppc64_defconfig but unfortunately merger does not allow disabling already enabled options. Cool series! This is a command line to compile the kernel using the upstream llvm: make -j64 O=/home/aik/pbuild/kernels-llvm/ \ "KCFLAGS=-Wmissing-braces -Wno-array-bounds" \ ARCH=powerpc LLVM_IAS=1 ppc64le_lto_defconfig CC=clang LLVM=1 That command line invocation is kind of a mess, and many things shouldn't be necessary. O= is just noise; if folks are doing in tree builds then that doesn't add anything meaningful. KCFLAGS= why? I know -Warray-bounds is being worked on actively, but do we have instances of -Wmissing-braces at the moment? Let's get those fixed up. LLVM_IAS=1 is implied by LLVM=1. CC=clang is implied by LLVM=1 why add a new config? I think it would be simpler to just show command line invocations of `./scripts/config -e` and `make`. No new config required. I should have added "RFC" in this one as the purpose of the patch is to show what works right now and not for actual submission. Forces CONFIG_BTRFS_FS=y to make CONFIG_ZSTD_COMPRESS=y to fix: ld.lld: error: linking module flags 'Code Model': IDs have conflicting values in 'lib/built-in.a(entropy_common.o at 5332)' and 'ld-temp.o' because modules are linked with -mcmodel=large but the kernel uses -mcmodel=medium Please file a bug about this. https://github.com/ClangBuiltLinux/linux/issues Enables CONFIG_USERFAULTFD=y as otherwise vm_userfaultfd_ctx becomes 0 bytes long and clang sanitizer crashes as https://bugs.llvm.org/show_bug.cgi?id=500375 The above hyperlink doesn't work for me. Upstream llvm just moved from bugzilla to github issue tracker. aah this is the correct one: https://bugs.llvm.org/show_bug.cgi?id=50037 https://github.com/llvm/llvm-project/issues oh ok. Disables CONFIG_FTR_FIXUP_SELFTEST as it uses FTR_SECTION_ELSE with conditional branches. There are other places like this and the following patches address that. Disables CONFIG_FTRACE_MCOUNT_USE_RECORDMCOUNT as CONFIG_HAS_LTO_CLANG depends on it being disabled. In order to avoid disabling way too many options (like DYNAMIC_FTRACE/FUNCTION_TRACER), this converts FTRACE_MCOUNT_USE_RECORDMCOUNT from def_bool to bool. Note that even with this config there is a good chance that LTO is going to fail linking vmlinux because of the "bc" problem. I think rather than adding a new config with LTO enabled and a few things turned off, it would be better to not allow LTO to be selectable if those things are turned on, until the combination of the two are fixed. Well, if I want people to try this thing, I kinda need to provide an easy way to allow LTO. The new config seemed the easiest (== the shortest) :)
[PATCH kernel 3/3] powerpc/llvm/lto: Workaround conditional branches in FTR_SECTION_ELSE
LTO invites ld/lld to optimize the output binary and this may affect the FTP alternative section if alt branches use "bc" (Branch Conditional) which only allows 16 bit offsets. This manifests in errors like: ld.lld: error: InputSection too large for range extension thunk vmlinux.o:(__ftr_alt_97+0xF0) This works around the problem by replacing "bc" and its alias(es) in FTR_SECTION_ELSE with "b" which allows 26 bit offsets. This catches the problem instructions in vmlinux.o before it LTO'ed: $ objdump -d -M raw -j __ftr_alt_97 vmlinux.o | egrep '\S+\s*\' 30: 00 00 82 40 bc 4,eq,30 <__ftr_alt_97+0x30> f0: 00 00 82 40 bc 4,eq,f0 <__ftr_alt_97+0xf0> The change in copyuser_64.S is needed even when building default configs, the other two changes are needed if the kernel config grows. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kernel/exceptions-64s.S | 6 +- arch/powerpc/lib/copyuser_64.S | 3 ++- arch/powerpc/lib/memcpy_64.S | 3 ++- 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 55caeee37c08..b8d9a2f5f3a5 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -476,9 +476,13 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real, text) .if IHSRR_IF_HVMODE BEGIN_FTR_SECTION bne masked_Hinterrupt + b 4f FTR_SECTION_ELSE - bne masked_interrupt + nop + nop ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206) + bne masked_interrupt +4: .elseif IHSRR bne masked_Hinterrupt .else diff --git a/arch/powerpc/lib/copyuser_64.S b/arch/powerpc/lib/copyuser_64.S index db8719a14846..d07f95eebc65 100644 --- a/arch/powerpc/lib/copyuser_64.S +++ b/arch/powerpc/lib/copyuser_64.S @@ -75,10 +75,11 @@ _GLOBAL(__copy_tofrom_user_base) * set is Power6. */ test_feature = (SELFTEST_CASE == 1) + beq .Ldst_aligned BEGIN_FTR_SECTION nop FTR_SECTION_ELSE - bne .Ldst_unaligned + b .Ldst_unaligned ALT_FTR_SECTION_END(CPU_FTR_UNALIGNED_LD_STD | CPU_FTR_CP_USE_DCBTZ, \ CPU_FTR_UNALIGNED_LD_STD) .Ldst_aligned: diff --git a/arch/powerpc/lib/memcpy_64.S b/arch/powerpc/lib/memcpy_64.S index 016c91e958d8..286c7e2d0883 100644 --- a/arch/powerpc/lib/memcpy_64.S +++ b/arch/powerpc/lib/memcpy_64.S @@ -50,10 +50,11 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_VMX_COPY) At the time of writing the only CPU that has this combination of bits set is Power6. */ test_feature = (SELFTEST_CASE == 1) + beq .ldst_aligned BEGIN_FTR_SECTION nop FTR_SECTION_ELSE - bne .Ldst_unaligned + b .Ldst_unaligned ALT_FTR_SECTION_END(CPU_FTR_UNALIGNED_LD_STD | CPU_FTR_CP_USE_DCBTZ, \ CPU_FTR_UNALIGNED_LD_STD) .Ldst_aligned: -- 2.30.2
[PATCH kernel 2/3] powerpc/llvm: Sample config for LLVM LTO
The config is a copy of ppc64_defconfig with a few tweaks. This could be a smaller config to merge into ppc64_defconfig but unfortunately merger does not allow disabling already enabled options. This is a command line to compile the kernel using the upstream llvm: make -j64 O=/home/aik/pbuild/kernels-llvm/ \ "KCFLAGS=-Wmissing-braces -Wno-array-bounds" \ ARCH=powerpc LLVM_IAS=1 ppc64le_lto_defconfig CC=clang LLVM=1 Forces CONFIG_BTRFS_FS=y to make CONFIG_ZSTD_COMPRESS=y to fix: ld.lld: error: linking module flags 'Code Model': IDs have conflicting values in 'lib/built-in.a(entropy_common.o at 5332)' and 'ld-temp.o' because modules are linked with -mcmodel=large but the kernel uses -mcmodel=medium Enables CONFIG_USERFAULTFD=y as otherwise vm_userfaultfd_ctx becomes 0 bytes long and clang sanitizer crashes as https://bugs.llvm.org/show_bug.cgi?id=500375 Disables CONFIG_FTR_FIXUP_SELFTEST as it uses FTR_SECTION_ELSE with conditional branches. There are other places like this and the following patches address that. Disables CONFIG_FTRACE_MCOUNT_USE_RECORDMCOUNT as CONFIG_HAS_LTO_CLANG depends on it being disabled. In order to avoid disabling way too many options (like DYNAMIC_FTRACE/FUNCTION_TRACER), this converts FTRACE_MCOUNT_USE_RECORDMCOUNT from def_bool to bool. Note that even with this config there is a good chance that LTO is going to fail linking vmlinux because of the "bc" problem. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/Makefile| 4 + arch/powerpc/configs/ppc64_lto_defconfig | 381 +++ 2 files changed, 385 insertions(+) create mode 100644 arch/powerpc/configs/ppc64_lto_defconfig diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index 5f16ac1583c5..23f1ade8abc9 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -308,6 +308,10 @@ PHONY += ppc64le_defconfig ppc64le_defconfig: $(call merge_into_defconfig,ppc64_defconfig,le) +PHONY += ppc64le_lto_defconfig +ppc64le_lto_defconfig: + $(call merge_into_defconfig,ppc64_lto_defconfig,le) + PHONY += ppc64le_guest_defconfig ppc64le_guest_defconfig: $(call merge_into_defconfig,ppc64_defconfig,le guest) diff --git a/arch/powerpc/configs/ppc64_lto_defconfig b/arch/powerpc/configs/ppc64_lto_defconfig new file mode 100644 index ..67f82b422b7d --- /dev/null +++ b/arch/powerpc/configs/ppc64_lto_defconfig @@ -0,0 +1,381 @@ +CONFIG_SYSVIPC=y +CONFIG_POSIX_MQUEUE=y +CONFIG_NO_HZ=y +CONFIG_HIGH_RES_TIMERS=y +CONFIG_TASKSTATS=y +CONFIG_TASK_DELAY_ACCT=y +CONFIG_IKCONFIG=y +CONFIG_IKCONFIG_PROC=y +CONFIG_LOG_BUF_SHIFT=18 +CONFIG_LOG_CPU_MAX_BUF_SHIFT=13 +CONFIG_NUMA_BALANCING=y +CONFIG_CGROUPS=y +CONFIG_MEMCG=y +CONFIG_CGROUP_SCHED=y +CONFIG_CGROUP_FREEZER=y +CONFIG_CPUSETS=y +CONFIG_CGROUP_DEVICE=y +CONFIG_CGROUP_CPUACCT=y +CONFIG_CGROUP_PERF=y +CONFIG_CGROUP_BPF=y +CONFIG_BLK_DEV_INITRD=y +CONFIG_BPF_SYSCALL=y +# CONFIG_COMPAT_BRK is not set +CONFIG_PROFILING=y +CONFIG_PPC64=y +CONFIG_NR_CPUS=2048 +CONFIG_PPC_SPLPAR=y +CONFIG_DTL=y +CONFIG_PPC_SMLPAR=y +CONFIG_IBMEBUS=y +CONFIG_PPC_SVM=y +CONFIG_PPC_MAPLE=y +CONFIG_PPC_PASEMI=y +CONFIG_PPC_PASEMI_IOMMU=y +CONFIG_PPC_PS3=y +CONFIG_PS3_DISK=m +CONFIG_PS3_ROM=m +CONFIG_PS3_FLASH=m +CONFIG_PS3_LPM=m +CONFIG_PPC_IBM_CELL_BLADE=y +CONFIG_RTAS_FLASH=m +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y +CONFIG_CPU_FREQ_GOV_POWERSAVE=y +CONFIG_CPU_FREQ_GOV_USERSPACE=y +CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y +CONFIG_CPU_FREQ_PMAC64=y +CONFIG_HZ_100=y +CONFIG_PPC_TRANSACTIONAL_MEM=y +CONFIG_KEXEC=y +CONFIG_KEXEC_FILE=y +CONFIG_CRASH_DUMP=y +CONFIG_FA_DUMP=y +CONFIG_IRQ_ALL_CPUS=y +CONFIG_SCHED_SMT=y +CONFIG_HOTPLUG_PCI=y +CONFIG_HOTPLUG_PCI_RPA=m +CONFIG_HOTPLUG_PCI_RPA_DLPAR=m +CONFIG_PCCARD=y +CONFIG_ELECTRA_CF=y +CONFIG_VIRTUALIZATION=y +CONFIG_KVM_BOOK3S_64=m +CONFIG_KVM_BOOK3S_64_HV=m +CONFIG_VHOST_NET=m +CONFIG_KPROBES=y +CONFIG_JUMP_LABEL=y +CONFIG_MODULES=y +CONFIG_MODULE_UNLOAD=y +CONFIG_MODVERSIONS=y +CONFIG_MODULE_SRCVERSION_ALL=y +CONFIG_PARTITION_ADVANCED=y +CONFIG_BINFMT_MISC=m +CONFIG_MEMORY_HOTPLUG=y +CONFIG_MEMORY_HOTREMOVE=y +CONFIG_KSM=y +CONFIG_TRANSPARENT_HUGEPAGE=y +CONFIG_NET=y +CONFIG_PACKET=y +CONFIG_UNIX=y +CONFIG_XFRM_USER=m +CONFIG_NET_KEY=m +CONFIG_INET=y +CONFIG_IP_MULTICAST=y +CONFIG_IP_PNP=y +CONFIG_IP_PNP_DHCP=y +CONFIG_IP_PNP_BOOTP=y +CONFIG_NET_IPIP=y +CONFIG_SYN_COOKIES=y +CONFIG_INET_AH=m +CONFIG_INET_ESP=m +CONFIG_INET_IPCOMP=m +CONFIG_IPV6=y +CONFIG_NETFILTER=y +# CONFIG_NETFILTER_ADVANCED is not set +CONFIG_BRIDGE=m +CONFIG_NET_SCHED=y +CONFIG_NET_CLS_BPF=m +CONFIG_NET_CLS_ACT=y +CONFIG_NET_ACT_BPF=m +CONFIG_BPF_JIT=y +CONFIG_DEVTMPFS=y +CONFIG_DEVTMPFS_MOUNT=y +CONFIG_BLK_DEV_FD=y +CONFIG_BLK_DEV_LOOP=y +CONFIG_BLK_DEV_NBD=m +CONFIG_BLK_DEV_RAM=y +CONFIG_BLK_DEV_RAM_SIZE=65536 +CONFIG_VIRTIO_BLK=m +CONFIG_BLK_DEV_SD=y +CONFIG_CHR_DEV_ST=m +CONFIG_BLK_DEV_SR=y +CONFIG_CHR_DEV_SG=y +CONFIG_SCSI_CONSTANTS=y +CONFIG_SCSI_FC_ATTRS=y +CONFI
[PATCH kernel 1/3] powerpc/64: Allow LLVM LTO builds
The upstream LLVM supports now LTO on PPC, enable it. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/Kconfig | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index b779603978e1..91c14f83 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -153,6 +153,8 @@ config PPC select ARCH_WANT_IRQS_OFF_ACTIVATE_MM select ARCH_WANT_LD_ORPHAN_WARN select ARCH_WEAK_RELEASE_ACQUIRE + select ARCH_SUPPORTS_LTO_CLANG if PPC64 + select ARCH_SUPPORTS_LTO_CLANG_THIN if PPC64 select BINFMT_ELF select BUILDTIME_TABLE_SORT select CLONE_BACKWARDS -- 2.30.2
[PATCH kernel 0/3] powerpc/llvm/lto: Enable CONFIG_LTO_CLANG_THIN=y
This is based on sha1 1b43a74f255c Michael Ellerman "Automatic merge of 'master' into merge (2022-02-01 10:41)". Please comment. Thanks. Alexey Kardashevskiy (3): powerpc/64: Allow LLVM LTO builds powerpc/llvm: Sample config for LLVM LTO powerpc/llvm/lto: Workaround conditional branches in FTR_SECTION_ELSE arch/powerpc/Makefile| 4 + arch/powerpc/Kconfig | 2 + arch/powerpc/configs/ppc64_lto_defconfig | 381 +++ arch/powerpc/kernel/exceptions-64s.S | 6 +- arch/powerpc/lib/copyuser_64.S | 3 +- arch/powerpc/lib/memcpy_64.S | 3 +- 6 files changed, 396 insertions(+), 3 deletions(-) create mode 100644 arch/powerpc/configs/ppc64_lto_defconfig -- 2.30.2
[PATCH kernel v2] powerpc/64: Add UADDR64 relocation support
When ld detects unaligned relocations, it emits R_PPC64_UADDR64 relocations instead of R_PPC64_RELATIVE. Currently R_PPC64_UADDR64 are detected by arch/powerpc/tools/relocs_check.sh and expected not to work. Below is a simple chunk to trigger this behaviour (this disables optimization for the demonstration purposes only, this also happens with -O1/-O2 when CONFIG_PRINTK_INDEX=y, for example): \#pragma GCC push_options \#pragma GCC optimize ("O0") struct entry { const char *file; int line; } __attribute__((packed)); static const struct entry e1 = { .file = __FILE__, .line = __LINE__ }; static const struct entry e2 = { .file = __FILE__, .line = __LINE__ }; ... prom_printf("e1=%s %lx %lx\n", e1.file, (unsigned long) e1.file, mfmsr()); prom_printf("e2=%s %lx\n", e2.file, (unsigned long) e2.file); \#pragma GCC pop_options This adds support for UADDR64 for 64bit. This reuses __dynamic_symtab from the 32bit which supports more relocation types already. Because RELACOUNT includes only R_PPC64_RELATIVE, this replaces it with RELASZ which is the size of all relocation records. Signed-off-by: Alexey Kardashevskiy --- Changes: v2: * replaced RELACOUNT with RELASZ/RELAENT * removed FIXME --- Tested via qemu gdb stub (the kernel is loaded at 0x40). Disasm: c1a804d0 : c1a804d0: b0 04 a8 01 .long 0x1a804b0 c1a804d0: R_PPC64_RELATIVE *ABS*-0x3e57fb50 c1a804d4: 00 00 00 c0 lfs f0,0(0) c1a804d8: fa 08 00 00 .long 0x8fa c1a804dc : ... c1a804dc: R_PPC64_UADDR64 .rodata+0x4b0 Before relocation: >>> p *(unsigned long *) 0x1e804d0 $1 = 0xc1a804b0 >>> p *(unsigned long *) 0x1e804dc $2 = 0x0 After relocation in __boot_from_prom: >>> p *(unsigned long *) 0x1e804d0 $1 = 0x1e804b0 >>> p *(unsigned long *) 0x1e804dc $2 = 0x1e804b0 After relocation in __after_prom_start: >>> p *(unsigned long *) 0x1e804d0 $1 = 0xc1a804b0 >>> p *(unsigned long *) 0x1e804dc $2 = 0xc1a804b0 >>> --- arch/powerpc/kernel/reloc_64.S | 56 -- arch/powerpc/kernel/vmlinux.lds.S | 2 -- arch/powerpc/tools/relocs_check.sh | 7 +--- 3 files changed, 39 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/kernel/reloc_64.S b/arch/powerpc/kernel/reloc_64.S index 02d4719bf43a..f7dcc25e93d0 100644 --- a/arch/powerpc/kernel/reloc_64.S +++ b/arch/powerpc/kernel/reloc_64.S @@ -8,8 +8,10 @@ #include RELA = 7 -RELACOUNT = 0x6ff9 +RELASZ = 8 +RELAENT = 9 R_PPC64_RELATIVE = 22 +R_PPC64_UADDR64 = 43 /* * r3 = desired final address of kernel @@ -25,29 +27,36 @@ _GLOBAL(relocate) add r9,r9,r12 /* r9 has runtime addr of .rela.dyn section */ ld r10,(p_st - 0b)(r12) add r10,r10,r12 /* r10 has runtime addr of _stext */ + ld r13,(p_sym - 0b)(r12) + add r13,r13,r12 /* r13 has runtime addr of .dynsym */ /* -* Scan the dynamic section for the RELA and RELACOUNT entries. +* Scan the dynamic section for the RELA, RELASZ and RELAENT entries. */ li r7,0 li r8,0 1: ld r6,0(r11) /* get tag */ cmpdi r6,0 - beq 4f /* end of list */ + beq 5f /* end of list */ cmpdi r6,RELA bne 2f ld r7,8(r11) /* get RELA pointer in r7 */ - b 3f -2: addis r6,r6,(-RELACOUNT)@ha - cmpdi r6,RELACOUNT@l + b 4f +2: cmpdi r6,RELASZ bne 3f - ld r8,8(r11) /* get RELACOUNT value in r8 */ -3: addir11,r11,16 + ld r8,8(r11) /* get RELASZ value in r8 */ + b 4f +3: cmpdi r6,RELAENT + bne 4f + ld r12,8(r11) /* get RELAENT value in r12 */ +4: addir11,r11,16 b 1b -4: cmpdi r7,0/* check we have both RELA and RELACOUNT */ +5: cmpdi r7,0/* check we have RELA, RELASZ, RELAENT */ cmpdi cr1,r8,0 - beq 6f - beq cr1,6f + beq 10f + beq cr1,10f + cmpdi r12,0 + beq 10f /* * Work out linktime address of _stext and hence the @@ -62,23 +71,34 @@ _GLOBAL(relocate) /* * Run through the list of relocations and process the -* R_PPC64_RELATIVE ones. +* R_PPC64_RELATIVE and R_PPC64_UADDR64 ones. */ + divdr8,r8,r12 /* RELASZ / RELAENT */ mtctr r8 -5: ld r0,8(9) /* ELF64_R_TYPE(reloc->r_info) */ +5: lwa r0,8(r9)/* ELF64_R_TYPE(reloc->r_info) */ cmpdi r0,R_PPC64_RELATIVE - bne 6f + bne 7f ld r6,0(r9)/
Re: [PATCH 2/2] KVM: selftests: Add support for ppc64le
VM_MODE_P36V48_16K, VM_MODE_P36V48_64K, VM_MODE_P36V47_16K, + VM_MODE_P51V52_64K, NUM_VM_MODES, }; @@ -87,6 +88,12 @@ extern enum vm_guest_mode vm_mode_default; #define MIN_PAGE_SHIFT12U #define ptes_per_page(page_size) ((page_size) / 8) +#elif defined(__powerpc__) + +#define VM_MODE_DEFAULTVM_MODE_P51V52_64K +#define MIN_PAGE_SHIFT 16U +#define ptes_per_page(page_size) ((page_size) / 8) + #endif #define MIN_PAGE_SIZE (1U << MIN_PAGE_SHIFT) diff --git a/tools/testing/selftests/kvm/include/ppc64le/processor.h b/tools/testing/selftests/kvm/include/ppc64le/processor.h new file mode 100644 index ..fbc1332b2b80 --- /dev/null +++ b/tools/testing/selftests/kvm/include/ppc64le/processor.h @@ -0,0 +1,43 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * powerpc processor specific defines + */ +#ifndef SELFTEST_KVM_PROCESSOR_H +#define SELFTEST_KVM_PROCESSOR_H + +#define PPC_BIT(x) (1ULL << (63 - x)) Put the "x" in braces. + +#define MSR_SF PPC_BIT(0) +#define MSR_IR PPC_BIT(58) +#define MSR_DR PPC_BIT(59) +#define MSR_LE PPC_BIT(63) + +#define LPCR_UPRT PPC_BIT(41) +#define LPCR_EVIRT PPC_BIT(42) +#define LPCR_HRPPC_BIT(43) +#define LPCR_GTSE PPC_BIT(53) + +#define PATB_GRPPC_BIT(0) + +#define PTE_VALID PPC_BIT(0) +#define PTE_LEAF PPC_BIT(1) +#define PTE_RPPC_BIT(55) +#define PTE_CPPC_BIT(56) +#define PTE_RC (PTE_R | PTE_C) +#define PTE_READ 0x4 +#define PTE_WRITE 0x2 +#define PTE_EXEC 0x1 +#define PTE_RWX (PTE_READ|PTE_WRITE|PTE_EXEC) + +extern uint64_t hcall(uint64_t nr, ...); + +static inline uint32_t mfpvr(void) +{ + uint32_t pvr; + + asm ("mfpvr %0" +: "=r"(pvr)); + return pvr; +} + +#endif diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c index c22a17aac6b0..cc5247c2cfeb 100644 --- a/tools/testing/selftests/kvm/lib/kvm_util.c +++ b/tools/testing/selftests/kvm/lib/kvm_util.c @@ -205,6 +205,7 @@ const char *vm_guest_mode_string(uint32_t i) [VM_MODE_P36V48_16K]= "PA-bits:36, VA-bits:48, 16K pages", [VM_MODE_P36V48_64K]= "PA-bits:36, VA-bits:48, 64K pages", [VM_MODE_P36V47_16K]= "PA-bits:36, VA-bits:47, 16K pages", + [VM_MODE_P51V52_64K]= "PA-bits:51, VA-bits:52, 64K pages", }; _Static_assert(sizeof(strings)/sizeof(char *) == NUM_VM_MODES, "Missing new mode strings?"); @@ -230,6 +231,7 @@ const struct vm_guest_mode_params vm_guest_mode_params[] = { [VM_MODE_P36V48_16K]= { 36, 48, 0x4000, 14 }, [VM_MODE_P36V48_64K]= { 36, 48, 0x1, 16 }, [VM_MODE_P36V47_16K]= { 36, 47, 0x4000, 14 }, + [VM_MODE_P51V52_64K]= { 51, 52, 0x1, 16 }, }; _Static_assert(sizeof(vm_guest_mode_params)/sizeof(struct vm_guest_mode_params) == NUM_VM_MODES, "Missing new mode params?"); @@ -331,6 +333,9 @@ struct kvm_vm *vm_create(enum vm_guest_mode mode, uint64_t phy_pages, int perm) case VM_MODE_P44V64_4K: vm->pgtable_levels = 5; break; + case VM_MODE_P51V52_64K: + vm->pgtable_levels = 4; + break; default: TEST_FAIL("Unknown guest mode, mode: 0x%x", mode); } diff --git a/tools/testing/selftests/kvm/lib/powerpc/hcall.S b/tools/testing/selftests/kvm/lib/powerpc/hcall.S new file mode 100644 index ..a78b88f3b207 --- /dev/null +++ b/tools/testing/selftests/kvm/lib/powerpc/hcall.S @@ -0,0 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +.globl hcall; + +hcall: + sc 1 + blr diff --git a/tools/testing/selftests/kvm/lib/powerpc/processor.c b/tools/testing/selftests/kvm/lib/powerpc/processor.c new file mode 100644 index ..2ffd5423a968 --- /dev/null +++ b/tools/testing/selftests/kvm/lib/powerpc/processor.c @@ -0,0 +1,343 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * KVM selftest powerpc library code + * + * Copyright (C) 2021, IBM Corp. 2022? Otherwise looks good and works well and we have another test for instruction emilation on top of this which highlighted a bug so this is useful stuff. Reviewed-by: Alexey Kardashevskiy + */ + +#define _GNU_SOURCE +//#define DEBUG + +#include "kvm_util.h" +#include "../kvm_util_internal.h" +#include "processor.h" + +/* + * 2^(12+PRTS) = Process table size + * + * But the hardware doesn't seem to care, so 0 for now. + */ +#define PRTS 0 +#define RTS ((0x5UL << 5) | (0x2UL << 61)) /* 2^(RTS+31) = 2^52 */ +#define RPDS 0xd +#define RPDB_MASK 0x0f00UL +#define RPN_MASK 0x01fff000UL + +#define MIN_FRAME_SZ 32 + +static const int radix_64k_index_
Re: [PATCH kernel] powerpc/64: Add UADDR64 relocation support
On 1/31/22 17:38, Christophe Leroy wrote: Le 31/01/2022 à 05:14, Alexey Kardashevskiy a écrit : When ld detects unaligned relocations, it emits R_PPC64_UADDR64 relocations instead of R_PPC64_RELATIVE. Currently R_PPC64_UADDR64 are detected by arch/powerpc/tools/relocs_check.sh and expected not to work. Below is a simple chunk to trigger this behaviour: According to relocs_check.sh, this is expected to happen only with binutils < 2.19. Today minimum binutils version is 2.23 Have you observed this problem with newer version of binutils ? Oh yeah. 2.36.1. And the toolchain folks explained internally that this is correct behavior and this was a ticking bomb which exploded now and the kernel has to deal with it. \#pragma GCC push_options \#pragma GCC optimize ("O0") AFAIU Linux Kernel is always built with O2 Correct. Even O1 hides this. Have you observed the problem with O2 ? Yes, I see it once I enable CONFIG_PRINTK_INDEX (this is how it was spotted with my particular config, there is still a fair chance that this config option does not cause UADDR64 always) but I did not debug with it enabled as pretty much every single __func__ passed to printk caused unaligned relocation (tens of thousands). Note that this particular case can be fixed by removing __packed from "struct pi_entry" (== re-arm the bomb). Thanks, struct entry { const char *file; int line; } __attribute__((packed)); static const struct entry e1 = { .file = __FILE__, .line = __LINE__ }; static const struct entry e2 = { .file = __FILE__, .line = __LINE__ }; ... prom_printf("e1=%s %lx %lx\n", e1.file, (unsigned long) e1.file, mfmsr()); prom_printf("e2=%s %lx\n", e2.file, (unsigned long) e2.file); \#pragma GCC pop_options This adds support for UADDR64 for 64bit. This reuses __dynamic_symtab from the 32bit which supports more relocation types already. This adds a workaround for the number of relocations as the DT_RELACOUNT ELF Dynamic Array Tag does not include relocations other than R_PPC64_RELATIVE. This instead iterates over the entire .rela.dyn section. Signed-off-by: Alexey Kardashevskiy --- Tested via qemu gdb stub (the kernel is loaded at 0x40). Disasm: c1a804d0 : c1a804d0: b0 04 a8 01 .long 0x1a804b0 c1a804d0: R_PPC64_RELATIVE *ABS*-0x3e57fb50 c1a804d4: 00 00 00 c0 lfs f0,0(0) c1a804d8: fa 08 00 00 .long 0x8fa c1a804dc : ... c1a804dc: R_PPC64_UADDR64 .rodata+0x4b0 Before relocation: p *(unsigned long *) 0x1e804d0 $1 = 0xc1a804b0 p *(unsigned long *) 0x1e804dc $2 = 0x0 After: p *(unsigned long *) 0x1e804d0 $1 = 0x1e804b0 p *(unsigned long *) 0x1e804dc $2 = 0x1e804b0 --- arch/powerpc/kernel/reloc_64.S | 47 +- arch/powerpc/kernel/vmlinux.lds.S | 3 +- arch/powerpc/tools/relocs_check.sh | 6 3 files changed, 41 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/kernel/reloc_64.S b/arch/powerpc/kernel/reloc_64.S index 02d4719bf43a..a91175723d9d 100644 --- a/arch/powerpc/kernel/reloc_64.S +++ b/arch/powerpc/kernel/reloc_64.S @@ -10,6 +10,7 @@ RELA = 7 RELACOUNT = 0x6ff9 R_PPC64_RELATIVE = 22 +R_PPC64_UADDR64 = 43 /* * r3 = desired final address of kernel @@ -25,6 +26,8 @@ _GLOBAL(relocate) add r9,r9,r12 /* r9 has runtime addr of .rela.dyn section */ ld r10,(p_st - 0b)(r12) add r10,r10,r12 /* r10 has runtime addr of _stext */ + ld r13,(p_sym - 0b)(r12) + add r13,r13,r12 /* r13 has runtime addr of .dynsym */ /* * Scan the dynamic section for the RELA and RELACOUNT entries. @@ -46,8 +49,8 @@ _GLOBAL(relocate) b 1b 4: cmpdi r7,0/* check we have both RELA and RELACOUNT */ cmpdi cr1,r8,0 - beq 6f - beq cr1,6f + beq 9f + beq cr1,9f /* * Work out linktime address of _stext and hence the @@ -60,25 +63,55 @@ _GLOBAL(relocate) subfr10,r7,r10 subfr3,r10,r3 /* final_offset */ + /* +* FIXME +* Here r8 is a number of relocations in .rela.dyn. +* When ld issues UADDR64 relocations, they end up at the end +* of the .rela.dyn section. However RELACOUNT does not include +* them so the loop below is going to finish after the last +* R_PPC64_RELATIVE as they normally go first. +* Work out the size of .rela.dyn at compile time. +*/ + ld r8,(p_rela_end - 0b)(r12) + ld r18,(p_rela - 0b)(r12) + sub r8,r8,r18 + li r18,24 /* 24 == sizeof(elf64_rela) */ + divdr8,r8,r18 + /* * Run through the list of relocations and p
[PATCH kernel] powerpc/64: Add UADDR64 relocation support
When ld detects unaligned relocations, it emits R_PPC64_UADDR64 relocations instead of R_PPC64_RELATIVE. Currently R_PPC64_UADDR64 are detected by arch/powerpc/tools/relocs_check.sh and expected not to work. Below is a simple chunk to trigger this behaviour: \#pragma GCC push_options \#pragma GCC optimize ("O0") struct entry { const char *file; int line; } __attribute__((packed)); static const struct entry e1 = { .file = __FILE__, .line = __LINE__ }; static const struct entry e2 = { .file = __FILE__, .line = __LINE__ }; ... prom_printf("e1=%s %lx %lx\n", e1.file, (unsigned long) e1.file, mfmsr()); prom_printf("e2=%s %lx\n", e2.file, (unsigned long) e2.file); \#pragma GCC pop_options This adds support for UADDR64 for 64bit. This reuses __dynamic_symtab from the 32bit which supports more relocation types already. This adds a workaround for the number of relocations as the DT_RELACOUNT ELF Dynamic Array Tag does not include relocations other than R_PPC64_RELATIVE. This instead iterates over the entire .rela.dyn section. Signed-off-by: Alexey Kardashevskiy --- Tested via qemu gdb stub (the kernel is loaded at 0x40). Disasm: c1a804d0 : c1a804d0: b0 04 a8 01 .long 0x1a804b0 c1a804d0: R_PPC64_RELATIVE *ABS*-0x3e57fb50 c1a804d4: 00 00 00 c0 lfs f0,0(0) c1a804d8: fa 08 00 00 .long 0x8fa c1a804dc : ... c1a804dc: R_PPC64_UADDR64 .rodata+0x4b0 Before relocation: >>> p *(unsigned long *) 0x1e804d0 $1 = 0xc1a804b0 >>> p *(unsigned long *) 0x1e804dc $2 = 0x0 After: >>> p *(unsigned long *) 0x1e804d0 $1 = 0x1e804b0 >>> p *(unsigned long *) 0x1e804dc $2 = 0x1e804b0 --- arch/powerpc/kernel/reloc_64.S | 47 +- arch/powerpc/kernel/vmlinux.lds.S | 3 +- arch/powerpc/tools/relocs_check.sh | 6 3 files changed, 41 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/kernel/reloc_64.S b/arch/powerpc/kernel/reloc_64.S index 02d4719bf43a..a91175723d9d 100644 --- a/arch/powerpc/kernel/reloc_64.S +++ b/arch/powerpc/kernel/reloc_64.S @@ -10,6 +10,7 @@ RELA = 7 RELACOUNT = 0x6ff9 R_PPC64_RELATIVE = 22 +R_PPC64_UADDR64 = 43 /* * r3 = desired final address of kernel @@ -25,6 +26,8 @@ _GLOBAL(relocate) add r9,r9,r12 /* r9 has runtime addr of .rela.dyn section */ ld r10,(p_st - 0b)(r12) add r10,r10,r12 /* r10 has runtime addr of _stext */ + ld r13,(p_sym - 0b)(r12) + add r13,r13,r12 /* r13 has runtime addr of .dynsym */ /* * Scan the dynamic section for the RELA and RELACOUNT entries. @@ -46,8 +49,8 @@ _GLOBAL(relocate) b 1b 4: cmpdi r7,0/* check we have both RELA and RELACOUNT */ cmpdi cr1,r8,0 - beq 6f - beq cr1,6f + beq 9f + beq cr1,9f /* * Work out linktime address of _stext and hence the @@ -60,25 +63,55 @@ _GLOBAL(relocate) subfr10,r7,r10 subfr3,r10,r3 /* final_offset */ + /* +* FIXME +* Here r8 is a number of relocations in .rela.dyn. +* When ld issues UADDR64 relocations, they end up at the end +* of the .rela.dyn section. However RELACOUNT does not include +* them so the loop below is going to finish after the last +* R_PPC64_RELATIVE as they normally go first. +* Work out the size of .rela.dyn at compile time. +*/ + ld r8,(p_rela_end - 0b)(r12) + ld r18,(p_rela - 0b)(r12) + sub r8,r8,r18 + li r18,24 /* 24 == sizeof(elf64_rela) */ + divdr8,r8,r18 + /* * Run through the list of relocations and process the -* R_PPC64_RELATIVE ones. +* R_PPC64_RELATIVE and R_PPC64_UADDR64 ones. */ mtctr r8 -5: ld r0,8(9) /* ELF64_R_TYPE(reloc->r_info) */ +5: lwa r0,8(r9)/* ELF64_R_TYPE(reloc->r_info) */ cmpdi r0,R_PPC64_RELATIVE bne 6f ld r6,0(r9)/* reloc->r_offset */ ld r0,16(r9) /* reloc->r_addend */ - add r0,r0,r3 + b 7f + +6: cmpdi r0,R_PPC64_UADDR64 + bne 8f + ld r6,0(r9) + ld r0,16(r9) + lwa r14,12(r9) /* ELF64_R_SYM(reloc->r_info) */ + mulli r14,r14,24 /* 24 == sizeof(elf64_sym) */ + add r14,r14,r13 /* elf64_sym[ELF64_R_SYM] */ + ld r14,8(r14) + add r0,r0,r14 + +7: add r0,r0,r3 stdxr0,r7,r6 - addir9,r9,24 + +8: addir9,r9,24 bdnz5b -6: blr +9: blr .balign 8 p_dyn: .8byte __dynamic_start - 0b p_rela:.8byte __rela
[PATCH kernel v5] KVM: PPC: Merge powerpc's debugfs entry content into generic entry
At the moment KVM on PPC creates 4 types of entries under the kvm debugfs: 1) "%pid-%fd" per a KVM instance (for all platforms); 2) "vm%pid" (for PPC Book3s HV KVM); 3) "vm%u_vcpu%u_timing" (for PPC Book3e KVM); 4) "kvm-xive-%p" (for XIVE PPC Book3s KVM, the same for XICS); The problem with this is that multiple VMs per process is not allowed for 2) and 3) which makes it possible for userspace to trigger errors when creating duplicated debugfs entries. This merges all these into 1). This defines kvm_arch_create_kvm_debugfs() similar to kvm_arch_create_vcpu_debugfs(). This defines 2 hooks in kvmppc_ops that allow specific KVM implementations add necessary entries, this adds the _e500 suffix to kvmppc_create_vcpu_debugfs_e500() to make it clear what platform it is for. This makes use of already existing kvm_arch_create_vcpu_debugfs() on PPC. This removes no more used debugfs_dir pointers from PPC kvm_arch structs. This stops removing vcpu entries as once created vcpus stay around for the entire life of a VM and removed when the KVM instance is closed, see commit d56f5136b010 ("KVM: let kvm_destroy_vm_debugfs clean up vCPU debugfs directories"). Suggested-by: Fabiano Rosas Signed-off-by: Alexey Kardashevskiy --- Changes: v5: * fixed e500mc2 v4: * added "kvm-xive-%p" v3: * reworked commit log, especially, the bit about removing vcpus v2: * handled powerpc-booke * s/kvm/vm/ in arch hooks --- arch/powerpc/include/asm/kvm_host.h| 6 ++--- arch/powerpc/include/asm/kvm_ppc.h | 2 ++ arch/powerpc/kvm/timing.h | 12 +- arch/powerpc/kvm/book3s_64_mmu_hv.c| 2 +- arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +- arch/powerpc/kvm/book3s_hv.c | 31 ++ arch/powerpc/kvm/book3s_xics.c | 13 ++- arch/powerpc/kvm/book3s_xive.c | 13 ++- arch/powerpc/kvm/book3s_xive_native.c | 13 ++- arch/powerpc/kvm/e500.c| 1 + arch/powerpc/kvm/e500mc.c | 1 + arch/powerpc/kvm/powerpc.c | 16 ++--- arch/powerpc/kvm/timing.c | 21 + 13 files changed, 51 insertions(+), 82 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 17263276189e..f5e14fa683f4 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -26,6 +26,8 @@ #include #include +#define __KVM_HAVE_ARCH_VCPU_DEBUGFS + #define KVM_MAX_VCPUS NR_CPUS #define KVM_MAX_VCORES NR_CPUS @@ -295,7 +297,6 @@ struct kvm_arch { bool dawr1_enabled; pgd_t *pgtable; u64 process_table; - struct dentry *debugfs_dir; struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */ #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */ #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE @@ -673,7 +674,6 @@ struct kvm_vcpu_arch { u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES]; u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES]; u64 timing_last_exit; - struct dentry *debugfs_exit_timing; #endif #ifdef CONFIG_PPC_BOOK3S @@ -829,8 +829,6 @@ struct kvm_vcpu_arch { struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */ struct kvmhv_tb_accumulator guest_time; /* guest execution */ struct kvmhv_tb_accumulator cede_time; /* time napping inside guest */ - - struct dentry *debugfs_dir; #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */ }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 33db83b82fbd..d2b192dea0d2 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -316,6 +316,8 @@ struct kvmppc_ops { int (*svm_off)(struct kvm *kvm); int (*enable_dawr1)(struct kvm *kvm); bool (*hash_v3_possible)(void); + int (*create_vm_debugfs)(struct kvm *kvm); + int (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry); }; extern struct kvmppc_ops *kvmppc_hv_ops; diff --git a/arch/powerpc/kvm/timing.h b/arch/powerpc/kvm/timing.h index feef7885ba82..45817ab82bb4 100644 --- a/arch/powerpc/kvm/timing.h +++ b/arch/powerpc/kvm/timing.h @@ -14,8 +14,8 @@ #ifdef CONFIG_KVM_EXIT_TIMING void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu); void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu); -void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id); -void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu); +int kvmppc_create_vcpu_debugfs_e500(struct kvm_vcpu *vcpu, + struct dentry *debugfs_dentry); static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int type) { @@ -26,9 +26,11 @@ static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int type) /* if exit timing is not configured there is no need to build the c file */ static inline void kvmppc_
Re: [PATCH v3 5/6] KVM: PPC: mmio: Return to guest after emulation failure
On 1/10/22 18:36, Nicholas Piggin wrote: Excerpts from Fabiano Rosas's message of January 8, 2022 7:00 am: If MMIO emulation fails we don't want to crash the whole guest by returning to userspace. The original commit bbf45ba57eae ("KVM: ppc: PowerPC 440 KVM implementation") added a todo: /* XXX Deliver Program interrupt to guest. */ and later the commit d69614a295ae ("KVM: PPC: Separate loadstore emulation from priv emulation") added the Program interrupt injection but in another file, so I'm assuming it was missed that this block needed to be altered. Signed-off-by: Fabiano Rosas Reviewed-by: Alexey Kardashevskiy --- arch/powerpc/kvm/powerpc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 6daeea4a7de1..56b0faab7a5f 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -309,7 +309,7 @@ int kvmppc_emulate_mmio(struct kvm_vcpu *vcpu) kvmppc_get_last_inst(vcpu, INST_GENERIC, _inst); kvmppc_core_queue_program(vcpu, 0); pr_info("%s: emulation failed (%08x)\n", __func__, last_inst); - r = RESUME_HOST; + r = RESUME_GUEST; So at this point can the pr_info just go away? I wonder if this shouldn't be a DSI rather than a program check. DSI with DSISR[37] looks a bit more expected. Not that Linux probably does much with it but at least it would give a SIGBUS rather than SIGILL. It does not like it is more expected to me, it is not about wrong memory attributes, it is the instruction itself which cannot execute. DSISR[37]: Set to 1 if the access is due to a lq, stq, lwat, ldat, lbarx, lharx, lwarx, ldarx, lqarx, stwat, stdat, stbcx., sthcx., stwcx., stdcx., or stqcx. instruction that addresses storage that is Write Through Required or Caching Inhibited; or if the access is due to a copy or paste. instruction that addresses storage that is Caching Inhibited; or if the access is due to a lwat, ldat, stwat, or stdat instruction that addresses storage that is Guarded; otherwise set to 0.
Re: [PATCH v3 4/6] KVM: PPC: mmio: Queue interrupt at kvmppc_emulate_mmio
On 08/01/2022 08:00, Fabiano Rosas wrote: If MMIO emulation fails, we queue a Program interrupt to the guest. Move that line up into kvmppc_emulate_mmio, which is where we set RESUME_GUEST/HOST. This allows the removal of the 'advance' variable. No functional change, just separation of responsibilities. Signed-off-by: Fabiano Rosas Reviewed-by: Alexey Kardashevskiy --- arch/powerpc/kvm/emulate_loadstore.c | 8 +--- arch/powerpc/kvm/powerpc.c | 2 +- 2 files changed, 2 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/kvm/emulate_loadstore.c b/arch/powerpc/kvm/emulate_loadstore.c index 48272a9b9c30..4dec920fe4c9 100644 --- a/arch/powerpc/kvm/emulate_loadstore.c +++ b/arch/powerpc/kvm/emulate_loadstore.c @@ -73,7 +73,6 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu) { u32 inst; enum emulation_result emulated = EMULATE_FAIL; - int advance = 1; struct instruction_op op; /* this default type might be overwritten by subcategories */ @@ -355,15 +354,10 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu) } } - if (emulated == EMULATE_FAIL) { - advance = 0; - kvmppc_core_queue_program(vcpu, 0); - } - trace_kvm_ppc_instr(inst, kvmppc_get_pc(vcpu), emulated); /* Advance past emulated instruction. */ - if (advance) + if (emulated != EMULATE_FAIL) kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4); return emulated; diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 4d7d0d080232..6daeea4a7de1 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -307,7 +307,7 @@ int kvmppc_emulate_mmio(struct kvm_vcpu *vcpu) u32 last_inst; kvmppc_get_last_inst(vcpu, INST_GENERIC, _inst); - /* XXX Deliver Program interrupt to guest. */ + kvmppc_core_queue_program(vcpu, 0); pr_info("%s: emulation failed (%08x)\n", __func__, last_inst); r = RESUME_HOST; break;
Re: [PATCH v2 6/7] KVM: PPC: mmio: Return to guest after emulation failure
On 07/01/2022 07:03, Fabiano Rosas wrote: If MMIO emulation fails we don't want to crash the whole guest by returning to userspace. The original commit bbf45ba57eae ("KVM: ppc: PowerPC 440 KVM implementation") added a todo: /* XXX Deliver Program interrupt to guest. */ and later the commit d69614a295ae ("KVM: PPC: Separate loadstore emulation from priv emulation") added the Program interrupt injection but in another file, so I'm assuming it was missed that this block needed to be altered. Signed-off-by: Fabiano Rosas Looks right. Reviewed-by: Alexey Kardashevskiy but this means if I want to keep debugging those kvm selftests in comfort, I'll have to have some exception handlers in the vm as otherwise the failing $pc is lost after this change :) --- arch/powerpc/kvm/powerpc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index a2e78229d645..50e08635e18a 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -309,7 +309,7 @@ int kvmppc_emulate_mmio(struct kvm_vcpu *vcpu) kvmppc_get_last_inst(vcpu, INST_GENERIC, _inst); kvmppc_core_queue_program(vcpu, 0); pr_info("%s: emulation failed (%08x)\n", __func__, last_inst); - r = RESUME_HOST; + r = RESUME_GUEST; break; } default: -- Alexey
Re: [PATCH v2 3/7] KVM: PPC: Fix mmio length message
On 07/01/2022 07:03, Fabiano Rosas wrote: We check against 'bytes' but print 'run->mmio.len' which at that point has an old value. e.g. 16-byte load: before: __kvmppc_handle_load: bad MMIO length: 8 now: __kvmppc_handle_load: bad MMIO length: 16 Signed-off-by: Fabiano Rosas --- arch/powerpc/kvm/powerpc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 92e552ab5a77..0b0818d032e1 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -1246,7 +1246,7 @@ static int __kvmppc_handle_load(struct kvm_vcpu *vcpu, if (bytes > sizeof(run->mmio.data)) { printk(KERN_ERR "%s: bad MMIO length: %d\n", __func__, - run->mmio.len); + bytes); "return EMULATE_FAIL;" here and below as there is really no point in trashing kvm_run::mmio (not much harm too but still) and this code does not handle more than 8 bytes anyway. } run->mmio.phys_addr = vcpu->arch.paddr_accessed; @@ -1335,7 +1335,7 @@ int kvmppc_handle_store(struct kvm_vcpu *vcpu, if (bytes > sizeof(run->mmio.data)) { printk(KERN_ERR "%s: bad MMIO length: %d\n", __func__, - run->mmio.len); + bytes); } run->mmio.phys_addr = vcpu->arch.paddr_accessed; -- Alexey
Re: [PATCH 2/3] KVM: PPC: Fix vmx/vsx mixup in mmio emulation
On 28/12/2021 04:28, Fabiano Rosas wrote: Nicholas Piggin writes: Excerpts from Fabiano Rosas's message of December 24, 2021 7:15 am: The MMIO emulation code for vector instructions is duplicated between VSX and VMX. When emulating VMX we should check the VMX copy size instead of the VSX one. Fixes: acc9eb9305fe ("KVM: PPC: Reimplement LOAD_VMX/STORE_VMX instruction ...") Signed-off-by: Fabiano Rosas Good catch. AFAIKS handle_vmx_store needs the same treatment? If you agree then Half the bug now, half the bug next year... haha I'll send a v2. aside: All this duplication is kind of annoying. I'm looking into what it would take to have quadword instruction emulation here as well (Alexey caught a bug with syskaller) and the code would be really similar. I see that x86 has a more generic implementation that maybe we could take advantage of. See "f78146b0f923 (KVM: Fix page-crossing MMIO)" Uff. My head exploded with vsx/vmx/vec :) But this seems to have fixed "lvx" (which is vmx, right?). Tested with: https://github.com/aik/linux/commits/my_kvm_tests -- Alexey
[PATCH llvm 6/6] powerpc/mm/book3s64/hash: Switch pre 2.06 tlbiel to .long
The llvm integrated assembler does not recognise the ISA 2.05 tlbiel version. Work around it by switching to .long when an old arch level detected. Signed-off-by: Daniel Axtens [aik: did "Eventually do this more smartly"] Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/ppc-opcode.h | 2 ++ arch/powerpc/mm/book3s64/hash_native.c | 4 ++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h index 9fe3223e7820..efad07081cc0 100644 --- a/arch/powerpc/include/asm/ppc-opcode.h +++ b/arch/powerpc/include/asm/ppc-opcode.h @@ -394,6 +394,7 @@ (0x7c000264 | ___PPC_RB(rb) | ___PPC_RS(rs) | ___PPC_RIC(ric) | ___PPC_PRS(prs) | ___PPC_R(r)) #define PPC_RAW_TLBIEL(rb, rs, ric, prs, r) \ (0x7c000224 | ___PPC_RB(rb) | ___PPC_RS(rs) | ___PPC_RIC(ric) | ___PPC_PRS(prs) | ___PPC_R(r)) +#define PPC_RAW_TLBIEL_v205(rb, l) (0x7c000224 | ___PPC_RB(rb) | (l << 21)) #define PPC_RAW_TLBSRX_DOT(a, b) (0x7c0006a5 | __PPC_RA0(a) | __PPC_RB(b)) #define PPC_RAW_TLBIVAX(a, b) (0x7c000624 | __PPC_RA0(a) | __PPC_RB(b)) #define PPC_RAW_ERATWE(s, a, w)(0x7c0001a6 | __PPC_RS(s) | __PPC_RA(a) | __PPC_WS(w)) @@ -606,6 +607,7 @@ stringify_in_c(.long PPC_RAW_TLBIE_5(rb, rs, ric, prs, r)) #definePPC_TLBIEL(rb,rs,ric,prs,r) \ stringify_in_c(.long PPC_RAW_TLBIEL(rb, rs, ric, prs, r)) +#define PPC_TLBIEL_v205(rb, l) stringify_in_c(.long PPC_RAW_TLBIEL_v205(rb, l)) #define PPC_TLBSRX_DOT(a, b) stringify_in_c(.long PPC_RAW_TLBSRX_DOT(a, b)) #define PPC_TLBIVAX(a, b) stringify_in_c(.long PPC_RAW_TLBIVAX(a, b)) diff --git a/arch/powerpc/mm/book3s64/hash_native.c b/arch/powerpc/mm/book3s64/hash_native.c index d2a320828c0b..623a7b7ab38b 100644 --- a/arch/powerpc/mm/book3s64/hash_native.c +++ b/arch/powerpc/mm/book3s64/hash_native.c @@ -163,7 +163,7 @@ static inline void __tlbiel(unsigned long vpn, int psize, int apsize, int ssize) va |= ssize << 8; sllp = get_sllp_encoding(apsize); va |= sllp << 5; - asm volatile(ASM_FTR_IFSET("tlbiel %0", "tlbiel %0,0", %1) + asm volatile(ASM_FTR_IFSET("tlbiel %0", PPC_TLBIEL_v205(%0, 0), %1) : : "r" (va), "i" (CPU_FTR_ARCH_206) : "memory"); break; @@ -182,7 +182,7 @@ static inline void __tlbiel(unsigned long vpn, int psize, int apsize, int ssize) */ va |= (vpn & 0xfe); va |= 1; /* L */ - asm volatile(ASM_FTR_IFSET("tlbiel %0", "tlbiel %0,1", %1) + asm volatile(ASM_FTR_IFSET("tlbiel %0", PPC_TLBIEL_v205(%0, 1), %1) : : "r" (va), "i" (CPU_FTR_ARCH_206) : "memory"); break; -- 2.30.2
[PATCH llvm 5/6] powerpc/mm: Switch obsolete dssall to .long
The dssall ("Data Stream Stop All") instruction is obsolete altogether with other Data Cache Instructions since ISA 2.03 (year 2006). LLVM IAS does not support it but PPC970 seems to be using it. This switches dssall to .long as there is no much point in fixing LLVM. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/ppc-opcode.h | 2 ++ arch/powerpc/kernel/idle.c | 2 +- arch/powerpc/mm/mmu_context.c | 2 +- arch/powerpc/kernel/idle_6xx.S | 2 +- arch/powerpc/kernel/l2cr_6xx.S | 6 +++--- arch/powerpc/kernel/swsusp_32.S | 2 +- arch/powerpc/kernel/swsusp_asm64.S | 2 +- arch/powerpc/platforms/powermac/cache.S | 4 ++-- 8 files changed, 12 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h index f50213e2a3e0..9fe3223e7820 100644 --- a/arch/powerpc/include/asm/ppc-opcode.h +++ b/arch/powerpc/include/asm/ppc-opcode.h @@ -249,6 +249,7 @@ #define PPC_INST_COPY 0x7c20060c #define PPC_INST_DCBA 0x7c0005ec #define PPC_INST_DCBA_MASK 0xfc0007fe +#define PPC_INST_DSSALL0x7e00066c #define PPC_INST_ISEL 0x7c1e #define PPC_INST_ISEL_MASK 0xfc3e #define PPC_INST_LSWI 0x7c0004aa @@ -577,6 +578,7 @@ #definePPC_DCBZL(a, b) stringify_in_c(.long PPC_RAW_DCBZL(a, b)) #definePPC_DIVDE(t, a, b) stringify_in_c(.long PPC_RAW_DIVDE(t, a, b)) #definePPC_DIVDEU(t, a, b) stringify_in_c(.long PPC_RAW_DIVDEU(t, a, b)) +#define PPC_DSSALL stringify_in_c(.long PPC_INST_DSSALL) #define PPC_LQARX(t, a, b, eh) stringify_in_c(.long PPC_RAW_LQARX(t, a, b, eh)) #define PPC_STQCX(t, a, b) stringify_in_c(.long PPC_RAW_STQCX(t, a, b)) #define PPC_MADDHD(t, a, b, c) stringify_in_c(.long PPC_RAW_MADDHD(t, a, b, c)) diff --git a/arch/powerpc/kernel/idle.c b/arch/powerpc/kernel/idle.c index 1f835539fda4..4ad79eb638c6 100644 --- a/arch/powerpc/kernel/idle.c +++ b/arch/powerpc/kernel/idle.c @@ -82,7 +82,7 @@ void power4_idle(void) return; if (cpu_has_feature(CPU_FTR_ALTIVEC)) - asm volatile("DSSALL ; sync" ::: "memory"); + asm volatile(PPC_DSSALL " ; sync" ::: "memory"); power4_idle_nap(); diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c index 735c36f26388..1fb9c99f8679 100644 --- a/arch/powerpc/mm/mmu_context.c +++ b/arch/powerpc/mm/mmu_context.c @@ -90,7 +90,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, * context */ if (cpu_has_feature(CPU_FTR_ALTIVEC)) - asm volatile ("dssall"); + asm volatile (PPC_DSSALL); if (!new_on_cpu) membarrier_arch_switch_mm(prev, next, tsk); diff --git a/arch/powerpc/kernel/idle_6xx.S b/arch/powerpc/kernel/idle_6xx.S index 13cad9297d82..3c097356366b 100644 --- a/arch/powerpc/kernel/idle_6xx.S +++ b/arch/powerpc/kernel/idle_6xx.S @@ -129,7 +129,7 @@ BEGIN_FTR_SECTION END_FTR_SECTION_IFCLR(CPU_FTR_NO_DPM) mtspr SPRN_HID0,r4 BEGIN_FTR_SECTION - DSSALL + PPC_DSSALL sync END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC) lwz r8,TI_LOCAL_FLAGS(r2) /* set napping bit */ diff --git a/arch/powerpc/kernel/l2cr_6xx.S b/arch/powerpc/kernel/l2cr_6xx.S index 225511d73bef..f2e03ed423d0 100644 --- a/arch/powerpc/kernel/l2cr_6xx.S +++ b/arch/powerpc/kernel/l2cr_6xx.S @@ -96,7 +96,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_L2CR) /* Stop DST streams */ BEGIN_FTR_SECTION - DSSALL + PPC_DSSALL sync END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC) @@ -292,7 +292,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_L3CR) isync /* Stop DST streams */ - DSSALL + PPC_DSSALL sync /* Get the current enable bit of the L3CR into r4 */ @@ -401,7 +401,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_L3CR) _GLOBAL(__flush_disable_L1) /* Stop pending alitvec streams and memory accesses */ BEGIN_FTR_SECTION - DSSALL + PPC_DSSALL END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC) sync diff --git a/arch/powerpc/kernel/swsusp_32.S b/arch/powerpc/kernel/swsusp_32.S index f73f4d72fea4..e0cbd63007f2 100644 --- a/arch/powerpc/kernel/swsusp_32.S +++ b/arch/powerpc/kernel/swsusp_32.S @@ -181,7 +181,7 @@ _GLOBAL(swsusp_arch_resume) #ifdef CONFIG_ALTIVEC /* Stop pending alitvec streams and memory accesses */ BEGIN_FTR_SECTION - DSSALL + PPC_DSSALL END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC) #endif sync diff --git a/arch/powerpc/kernel/swsusp_asm64.S b/arch/powerpc/kernel/swsusp_asm64.S index 96bb20715aa9..9f1903c7f540 100644 --- a/arch/powerpc/kernel/swsusp_asm64.S +++ b/arch/powerpc/kernel/swsusp_asm64.S @@ -141,7 +141,7 @@ END_FW_FTR_SECTION_IFCLR(F
[PATCH llvm 4/6] powerpc/64/asm: Do not reassign labels
From: Daniel Axtens The LLVM integrated assembler really does not like us reassigning things to the same label: :7:9: error: invalid reassignment of non-absolute variable 'fs_label' This happens across a bunch of platforms: https://github.com/ClangBuiltLinux/linux/issues/1043 https://github.com/ClangBuiltLinux/linux/issues/1008 https://github.com/ClangBuiltLinux/linux/issues/920 https://github.com/ClangBuiltLinux/linux/issues/1050 There is no hope of getting this fixed in LLVM (see https://github.com/ClangBuiltLinux/linux/issues/1043#issuecomment-641571200 and https://bugs.llvm.org/show_bug.cgi?id=47798#c1 ) so if we want to build with LLVM_IAS, we need to hack around it ourselves. For us the big problem comes from this: \#define USE_FIXED_SECTION(sname) \ fs_label = start_##sname; \ fs_start = sname##_start; \ use_ftsec sname; \#define USE_TEXT_SECTION() fs_label = start_text; \ fs_start = text_start; \ .text and in particular fs_label. This works around it by not setting those 'variables' and requiring that users of the variables instead track for themselves what section they are in. This isn't amazing, by any stretch, but it gets us further in the compilation. Note that even though users have to keep track of the section, using a wrong one produces an error with both binutils and llvm which prevents from using wrong section at the compile time: llvm error example: AS arch/powerpc/kernel/head_64.o :0: error: Cannot represent a difference across sections make[3]: *** [/home/aik/p/kernels-llvm/llvm/scripts/Makefile.build:388: arch/powerpc/kernel/head_64.o] Error 1 binutils error example: /home/aik/p/kernels-llvm/llvm/arch/powerpc/kernel/exceptions-64s.S: Assembler messages: /home/aik/p/kernels-llvm/llvm/arch/powerpc/kernel/exceptions-64s.S:1974: Error: can't resolve `system_call_common' {.text section} - `start_r eal_vectors' {.head.text.real_vectors section} make[3]: *** [/home/aik/p/kernels-llvm/llvm/scripts/Makefile.build:388: arch/powerpc/kernel/head_64.o] Error 1 Signed-off-by: Daniel Axtens Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/head-64.h | 12 +-- arch/powerpc/kernel/exceptions-64s.S | 32 ++-- arch/powerpc/kernel/head_64.S| 18 arch/powerpc/kernel/interrupt_64.S | 2 +- 4 files changed, 31 insertions(+), 33 deletions(-) diff --git a/arch/powerpc/include/asm/head-64.h b/arch/powerpc/include/asm/head-64.h index 242204e12993..d73153b0275d 100644 --- a/arch/powerpc/include/asm/head-64.h +++ b/arch/powerpc/include/asm/head-64.h @@ -98,13 +98,9 @@ linker_stub_catch: \ . = sname##_len; #define USE_FIXED_SECTION(sname) \ - fs_label = start_##sname; \ - fs_start = sname##_start; \ use_ftsec sname; #define USE_TEXT_SECTION() \ - fs_label = start_text; \ - fs_start = text_start; \ .text #define CLOSE_FIXED_SECTION(sname) \ @@ -161,13 +157,15 @@ end_##sname: * - ABS_ADDR is used to find the absolute address of any symbol, from within * a fixed section. */ -#define DEFINE_FIXED_SYMBOL(label) \ - label##_absolute = (label - fs_label + fs_start) +// define label as being _in_ sname +#define DEFINE_FIXED_SYMBOL(label, sname) \ + label##_absolute = (label - start_ ## sname + sname ## _start) #define FIXED_SYMBOL_ABS_ADDR(label) \ (label##_absolute) -#define ABS_ADDR(label) (label - fs_label + fs_start) +// find label from _within_ sname +#define ABS_ADDR(label, sname) (label - start_ ## sname + sname ## _start) #endif /* __ASSEMBLY__ */ diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 83d37678f7cf..44b70bf535e3 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -48,7 +48,7 @@ .balign IFETCH_ALIGN_BYTES; \ .global name; \ _ASM_NOKPROBE_SYMBOL(name); \ - DEFINE_FIXED_SYMBOL(name); \ + DEFINE_FIXED_SYMBOL(name, text);\ name: #define TRAMP_REAL_BEGIN(name) \ @@ -76,18 +76,18 @@ name: ld reg,PACAKBASE(r13); /* get high part of */ \ ori reg,reg,FIXED_SYMBOL_ABS_ADDR(label) -#define __LOAD_HANDLER(reg, label) \ +#define
[PATCH llvm 3/6] powerpc/64/asm: Inline BRANCH_TO_C000
It is used just once and does not really help with readability, remove it. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kernel/exceptions-64s.S | 17 +++-- 1 file changed, 3 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index a30f563bc7a8..83d37678f7cf 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -89,19 +89,6 @@ name: ori reg,reg,(ABS_ADDR(label))@l;\ addis reg,reg,(ABS_ADDR(label))@h -/* - * Branch to label using its 0xC000 address. This results in instruction - * address suitable for MSR[IR]=0 or 1, which allows relocation to be turned - * on using mtmsr rather than rfid. - * - * This could set the 0xc bits for !RELOCATABLE as an immediate, rather than - * load KBASE for a slight optimisation. - */ -#define BRANCH_TO_C000(reg, label) \ - __LOAD_FAR_HANDLER(reg, label); \ - mtctr reg;\ - bctr - /* * Interrupt code generation macros */ @@ -962,7 +949,9 @@ TRAMP_REAL_BEGIN(system_reset_idle_wake) /* We are waking up from idle, so may clobber any volatile register */ cmpwi cr1,r5,2 bltlr cr1 /* no state loss, return to idle caller with r3=SRR1 */ - BRANCH_TO_C000(r12, DOTSYM(idle_return_gpr_loss)) + __LOAD_FAR_HANDLER(r12, DOTSYM(idle_return_gpr_loss)) + mtctr r12 + bctr #endif #ifdef CONFIG_PPC_PSERIES -- 2.30.2
[PATCH llvm 2/6] powerpc: check for support for -Wa,-m{power4,any}
From: Daniel Axtens LLVM's integrated assembler does not like either -Wa,-mpower4 or -Wa,-many. So just don't pass them if they're not supported. Signed-off-by: Daniel Axtens Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/Makefile | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index e9aa4e8b07dd..5f16ac1583c5 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -245,7 +245,9 @@ cpu-as-$(CONFIG_E500) += -Wa,-me500 # When using '-many -mpower4' gas will first try and find a matching power4 # mnemonic and failing that it will allow any valid mnemonic that GAS knows # about. GCC will pass -many to GAS when assembling, clang does not. -cpu-as-$(CONFIG_PPC_BOOK3S_64) += -Wa,-mpower4 -Wa,-many +# LLVM IAS doesn't understand either flag: https://github.com/ClangBuiltLinux/linux/issues/675 +# but LLVM IAS only supports ISA >= 2.06 for Book3S 64 anyway... +cpu-as-$(CONFIG_PPC_BOOK3S_64) += $(call as-option,-Wa$(comma)-mpower4) $(call as-option,-Wa$(comma)-many) cpu-as-$(CONFIG_PPC_E500MC)+= $(call as-option,-Wa$(comma)-me500mc) KBUILD_AFLAGS += $(cpu-as-y) -- 2.30.2
[PATCH llvm 1/6] powerpc/toc: PowerPC64 future proof kernel toc, revised for lld
From: Alan Modra This patch future-proofs the kernel against linker changes that might put the toc pointer at some location other than .got+0x8000, by replacing __toc_start+0x8000 with .TOC. throughout. If the kernel's idea of the toc pointer doesn't agree with the linker, bad things happen. prom_init.c code relocating its toc is also changed so that a symbolic __prom_init_toc_start toc-pointer relative address is calculated rather than assuming that it is always at toc-pointer - 0x8000. The length calculations loading values from the toc are also avoided. It's a little incestuous to do that with unreloc_toc picking up adjusted values (which is fine in practice, they both adjust by the same amount if all goes well). I've also changed the way .got is aligned in vmlinux.lds and zImage.lds, mostly so that dumping out section info by objdump or readelf plainly shows the alignment is 256. This linker script feature was added 2005-09-27, available in FSF binutils releases from 2.17 onwards. Should be safe to use in the kernel, I think. Finally, put *(.got) before the prom_init.o entry which only needs *(.toc), so that the GOT header goes in the correct place. I don't believe this makes any difference for the kernel as it would for dynamic objects being loaded by ld.so. That change is just to stop lusers who blindly copy kernel scripts being led astray. Of course, this change needs the prom_init.c changes. Some notes on .toc and .got. .toc is a compiler generated section of addresses. .got is a linker generated section of addresses, generally built when the linker sees R_*_*GOT* relocations. In the case of powerpc64 ld.bfd, there are multiple generated .got sections, one per input object file. So you can somewhat reasonably write in a linker script an input section statement like *prom_init.o(.got .toc) to mean "the .got and .toc section for files matching *prom_init.o". On other architectures that doesn't make sense, because the linker generally has just one .got section. Even on powerpc64, note well that the GOT entries for prom_init.o may be merged with GOT entries from other objects. That means that if prom_init.o references, say, _end via some GOT relocation, and some other object also references _end via a GOT relocation, the GOT entry for _end may be in the range __prom_init_toc_start to __prom_init_toc_end and if the kernel does something special to GOT/TOC entries in that range then the value of _end as seen by objects other than prom_init.o will be affected. On the other hand the GOT entry for _end may not be in the range __prom_init_toc_start to __prom_init_toc_end. Which way it turns out is deterministic but a detail of linker operation that should not be relied on. A feature of ld.bfd is that input .toc (and .got) sections matching one linker input section statement may be sorted, to put entries used by small-model code first, near the toc base. This is why scripts for powerpc64 normally use *(.got .toc) rather than *(.got) *(.toc), since the first form allows more freedom to sort. Another feature of ld.bfd is that indirect addressing sequences using the GOT/TOC may be edited by the linker to relative addressing. In many cases relative addressing would be emitted by gcc for -mcmodel=medium if you appropriately decorate variable declarations with non-default visibility. The original patch is here: https://lore.kernel.org/linuxppc-dev/20210310034813.gm6...@bubble.grove.modra.org/ Signed-off-by: Alan Modra [aik: removed non-relocatable which is gone in 24d33ac5b8ffb] [aik: added <=2.24 check] [aik: because of llvm-as, kernel_toc_addr() uses "mr" instead of global register variable] Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/Makefile | 5 +++-- arch/powerpc/include/asm/sections.h | 14 +++--- arch/powerpc/boot/crt0.S| 2 +- arch/powerpc/boot/zImage.lds.S | 7 ++- arch/powerpc/kernel/head_64.S | 2 +- arch/powerpc/kernel/vmlinux.lds.S | 8 +++- 6 files changed, 17 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index e02568f17334..e9aa4e8b07dd 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -445,10 +445,11 @@ PHONY += checkbin # Check toolchain versions: # - gcc-4.6 is the minimum kernel-wide version so nothing required. checkbin: - @if test "x${CONFIG_CPU_LITTLE_ENDIAN}" = "xy" \ - && $(LD) --version | head -1 | grep ' 2\.24$$' >/dev/null ; then \ + @if test "x${CONFIG_LD_IS_LLD}" != "xy" -a \ + "x$(call ld-ifversion, -le, 22400, y)" = "xy" ; then \ echo -n '*** binutils 2.24 miscompiles weak symbols ' ; \ echo 'in some circumstances.' ; \ + echo'*** binutils 2.23 do not define the TOC symbol ' ; \ echo -n '*** Please use a different binutils versi
[PATCH kernel 0/6] powerpc: Build with LLVM_IAS=1
This allows compiling the upstream Linux with the upstream llvm with one fix on top; https://reviews.llvm.org/D115419 This is based on sha1 798527287598 Michael Ellerman "Automatic merge of 'next' into merge (2021-12-14 00:12)". Please comment. Thanks. Alan Modra (1): powerpc/toc: PowerPC64 future proof kernel toc, revised for lld Alexey Kardashevskiy (3): powerpc/64/asm: Inline BRANCH_TO_C000 powerpc/mm: Switch obsolete dssall to .long powerpc/mm/book3s64/hash: Switch pre 2.06 tlbiel to .long Daniel Axtens (2): powerpc: check for support for -Wa,-m{power4,any} powerpc/64/asm: Do not reassign labels arch/powerpc/Makefile | 9 +++-- arch/powerpc/include/asm/head-64.h | 12 +++ arch/powerpc/include/asm/ppc-opcode.h | 4 +++ arch/powerpc/include/asm/sections.h | 14 arch/powerpc/kernel/idle.c | 2 +- arch/powerpc/mm/book3s64/hash_native.c | 4 +-- arch/powerpc/mm/mmu_context.c | 2 +- arch/powerpc/boot/crt0.S| 2 +- arch/powerpc/boot/zImage.lds.S | 7 ++-- arch/powerpc/kernel/exceptions-64s.S| 47 ++--- arch/powerpc/kernel/head_64.S | 20 +-- arch/powerpc/kernel/idle_6xx.S | 2 +- arch/powerpc/kernel/interrupt_64.S | 2 +- arch/powerpc/kernel/l2cr_6xx.S | 6 ++-- arch/powerpc/kernel/swsusp_32.S | 2 +- arch/powerpc/kernel/swsusp_asm64.S | 2 +- arch/powerpc/kernel/vmlinux.lds.S | 8 ++--- arch/powerpc/platforms/powermac/cache.S | 4 +-- 18 files changed, 69 insertions(+), 80 deletions(-) -- 2.30.2
Re: [PATCH kernel v4] KVM: PPC: Merge powerpc's debugfs entry content into generic entry
On 12/20/21 18:29, Cédric Le Goater wrote: > On 12/20/21 02:23, Alexey Kardashevskiy wrote: >> At the moment KVM on PPC creates 4 types of entries under the kvm debugfs: >> 1) "%pid-%fd" per a KVM instance (for all platforms); >> 2) "vm%pid" (for PPC Book3s HV KVM); >> 3) "vm%u_vcpu%u_timing" (for PPC Book3e KVM); >> 4) "kvm-xive-%p" (for XIVE PPC Book3s KVM, the same for XICS); >> >> The problem with this is that multiple VMs per process is not allowed for >> 2) and 3) which makes it possible for userspace to trigger errors when >> creating duplicated debugfs entries. >> >> This merges all these into 1). >> >> This defines kvm_arch_create_kvm_debugfs() similar to >> kvm_arch_create_vcpu_debugfs(). >> >> This defines 2 hooks in kvmppc_ops that allow specific KVM implementations >> add necessary entries, this adds the _e500 suffix to >> kvmppc_create_vcpu_debugfs_e500() to make it clear what platform it is for. >> >> This makes use of already existing kvm_arch_create_vcpu_debugfs() on PPC. >> >> This removes no more used debugfs_dir pointers from PPC kvm_arch structs. >> >> This stops removing vcpu entries as once created vcpus stay around >> for the entire life of a VM and removed when the KVM instance is closed, >> see commit d56f5136b010 ("KVM: let kvm_destroy_vm_debugfs clean up vCPU >> debugfs directories"). >> >> Suggested-by: Fabiano Rosas >> Signed-off-by: Alexey Kardashevskiy > > Reviewed-by: Cédric Le Goater > > One comment below. > >> --- >> Changes: >> v4: >> * added "kvm-xive-%p" >> >> v3: >> * reworked commit log, especially, the bit about removing vcpus >> >> v2: >> * handled powerpc-booke >> * s/kvm/vm/ in arch hooks >> --- >> arch/powerpc/include/asm/kvm_host.h| 6 ++--- >> arch/powerpc/include/asm/kvm_ppc.h | 2 ++ >> arch/powerpc/kvm/timing.h | 9 >> arch/powerpc/kvm/book3s_64_mmu_hv.c| 2 +- >> arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +- >> arch/powerpc/kvm/book3s_hv.c | 31 ++ >> arch/powerpc/kvm/book3s_xics.c | 13 ++- >> arch/powerpc/kvm/book3s_xive.c | 13 ++- >> arch/powerpc/kvm/book3s_xive_native.c | 13 ++- >> arch/powerpc/kvm/e500.c| 1 + >> arch/powerpc/kvm/e500mc.c | 1 + >> arch/powerpc/kvm/powerpc.c | 16 ++--- >> arch/powerpc/kvm/timing.c | 20 - >> 13 files changed, 47 insertions(+), 82 deletions(-) >> >> diff --git a/arch/powerpc/include/asm/kvm_host.h >> b/arch/powerpc/include/asm/kvm_host.h >> index 17263276189e..f5e14fa683f4 100644 >> --- a/arch/powerpc/include/asm/kvm_host.h >> +++ b/arch/powerpc/include/asm/kvm_host.h >> @@ -26,6 +26,8 @@ >> #include >> #include >> >> +#define __KVM_HAVE_ARCH_VCPU_DEBUGFS >> + >> #define KVM_MAX_VCPUS NR_CPUS >> #define KVM_MAX_VCORES NR_CPUS >> >> @@ -295,7 +297,6 @@ struct kvm_arch { >> bool dawr1_enabled; >> pgd_t *pgtable; >> u64 process_table; >> -struct dentry *debugfs_dir; >> struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */ >> #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */ >> #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE >> @@ -673,7 +674,6 @@ struct kvm_vcpu_arch { >> u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES]; >> u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES]; >> u64 timing_last_exit; >> -struct dentry *debugfs_exit_timing; >> #endif >> >> #ifdef CONFIG_PPC_BOOK3S >> @@ -829,8 +829,6 @@ struct kvm_vcpu_arch { >> struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */ >> struct kvmhv_tb_accumulator guest_time; /* guest execution */ >> struct kvmhv_tb_accumulator cede_time; /* time napping inside guest */ >> - >> -struct dentry *debugfs_dir; >> #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */ >> }; >> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h >> b/arch/powerpc/include/asm/kvm_ppc.h >> index 33db83b82fbd..d2b192dea0d2 100644 >> --- a/arch/powerpc/include/asm/kvm_ppc.h >> +++ b/arch/powerpc/include/asm/kvm_ppc.h >> @@ -316,6 +316,8 @@ struct kvmppc_ops { >> int (*svm_off)(struct kvm *kvm); >> int (*enable_dawr1)(struct kvm *kvm); >> bool (*hash_v
[PATCH kernel v4] KVM: PPC: Merge powerpc's debugfs entry content into generic entry
At the moment KVM on PPC creates 4 types of entries under the kvm debugfs: 1) "%pid-%fd" per a KVM instance (for all platforms); 2) "vm%pid" (for PPC Book3s HV KVM); 3) "vm%u_vcpu%u_timing" (for PPC Book3e KVM); 4) "kvm-xive-%p" (for XIVE PPC Book3s KVM, the same for XICS); The problem with this is that multiple VMs per process is not allowed for 2) and 3) which makes it possible for userspace to trigger errors when creating duplicated debugfs entries. This merges all these into 1). This defines kvm_arch_create_kvm_debugfs() similar to kvm_arch_create_vcpu_debugfs(). This defines 2 hooks in kvmppc_ops that allow specific KVM implementations add necessary entries, this adds the _e500 suffix to kvmppc_create_vcpu_debugfs_e500() to make it clear what platform it is for. This makes use of already existing kvm_arch_create_vcpu_debugfs() on PPC. This removes no more used debugfs_dir pointers from PPC kvm_arch structs. This stops removing vcpu entries as once created vcpus stay around for the entire life of a VM and removed when the KVM instance is closed, see commit d56f5136b010 ("KVM: let kvm_destroy_vm_debugfs clean up vCPU debugfs directories"). Suggested-by: Fabiano Rosas Signed-off-by: Alexey Kardashevskiy --- Changes: v4: * added "kvm-xive-%p" v3: * reworked commit log, especially, the bit about removing vcpus v2: * handled powerpc-booke * s/kvm/vm/ in arch hooks --- arch/powerpc/include/asm/kvm_host.h| 6 ++--- arch/powerpc/include/asm/kvm_ppc.h | 2 ++ arch/powerpc/kvm/timing.h | 9 arch/powerpc/kvm/book3s_64_mmu_hv.c| 2 +- arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +- arch/powerpc/kvm/book3s_hv.c | 31 ++ arch/powerpc/kvm/book3s_xics.c | 13 ++- arch/powerpc/kvm/book3s_xive.c | 13 ++- arch/powerpc/kvm/book3s_xive_native.c | 13 ++- arch/powerpc/kvm/e500.c| 1 + arch/powerpc/kvm/e500mc.c | 1 + arch/powerpc/kvm/powerpc.c | 16 ++--- arch/powerpc/kvm/timing.c | 20 - 13 files changed, 47 insertions(+), 82 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 17263276189e..f5e14fa683f4 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -26,6 +26,8 @@ #include #include +#define __KVM_HAVE_ARCH_VCPU_DEBUGFS + #define KVM_MAX_VCPUS NR_CPUS #define KVM_MAX_VCORES NR_CPUS @@ -295,7 +297,6 @@ struct kvm_arch { bool dawr1_enabled; pgd_t *pgtable; u64 process_table; - struct dentry *debugfs_dir; struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */ #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */ #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE @@ -673,7 +674,6 @@ struct kvm_vcpu_arch { u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES]; u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES]; u64 timing_last_exit; - struct dentry *debugfs_exit_timing; #endif #ifdef CONFIG_PPC_BOOK3S @@ -829,8 +829,6 @@ struct kvm_vcpu_arch { struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */ struct kvmhv_tb_accumulator guest_time; /* guest execution */ struct kvmhv_tb_accumulator cede_time; /* time napping inside guest */ - - struct dentry *debugfs_dir; #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */ }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 33db83b82fbd..d2b192dea0d2 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -316,6 +316,8 @@ struct kvmppc_ops { int (*svm_off)(struct kvm *kvm); int (*enable_dawr1)(struct kvm *kvm); bool (*hash_v3_possible)(void); + int (*create_vm_debugfs)(struct kvm *kvm); + int (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry); }; extern struct kvmppc_ops *kvmppc_hv_ops; diff --git a/arch/powerpc/kvm/timing.h b/arch/powerpc/kvm/timing.h index feef7885ba82..493a7d510fd5 100644 --- a/arch/powerpc/kvm/timing.h +++ b/arch/powerpc/kvm/timing.h @@ -14,8 +14,8 @@ #ifdef CONFIG_KVM_EXIT_TIMING void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu); void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu); -void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id); -void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu); +void kvmppc_create_vcpu_debugfs_e500(struct kvm_vcpu *vcpu, +struct dentry *debugfs_dentry); static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int type) { @@ -26,9 +26,8 @@ static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int type) /* if exit timing is not configured there is no need to build the c file */ static inline void kvmppc_
Re: [PATCH kernel v3] KVM: PPC: Merge powerpc's debugfs entry content into generic entry
On 12/16/21 05:11, Cédric Le Goater wrote: > On 12/15/21 02:33, Alexey Kardashevskiy wrote: >> At the moment KVM on PPC creates 3 types of entries under the kvm debugfs: >> 1) "%pid-%fd" per a KVM instance (for all platforms); >> 2) "vm%pid" (for PPC Book3s HV KVM); >> 3) "vm%u_vcpu%u_timing" (for PPC Book3e KVM). >> >> The problem with this is that multiple VMs per process is not allowed for >> 2) and 3) which makes it possible for userspace to trigger errors when >> creating duplicated debugfs entries. >> >> This merges all these into 1). >> >> This defines kvm_arch_create_kvm_debugfs() similar to >> kvm_arch_create_vcpu_debugfs(). >> >> This defines 2 hooks in kvmppc_ops that allow specific KVM implementations >> add necessary entries, this adds the _e500 suffix to >> kvmppc_create_vcpu_debugfs_e500() to make it clear what platform it is for. >> >> This makes use of already existing kvm_arch_create_vcpu_debugfs() on PPC. >> >> This removes no more used debugfs_dir pointers from PPC kvm_arch structs. >> >> This stops removing vcpu entries as once created vcpus stay around >> for the entire life of a VM and removed when the KVM instance is closed, >> see commit d56f5136b010 ("KVM: let kvm_destroy_vm_debugfs clean up vCPU >> debugfs directories"). > > It would nice to also move the KVM device debugfs files : > >/sys/kernel/debug/powerpc/kvm-xive-%p > > These are dynamically created and destroyed at run time depending > on the interrupt mode negociated by CAS. It might be more complex ? With this addition: diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index 99db9ac49901..511f643e2875 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -1267,10 +1267,10 @@ static void xive_native_debugfs_init(struct kvmppc_xive *xive) return; } - xive->dentry = debugfs_create_file(name, 0444, arch_debugfs_dir, + xive->dentry = debugfs_create_file(name, 0444, xive->kvm->debugfs_dentry, xive, _native_debug_fops); it looks fine, this is "before": root@zz1:/sys/kernel/debug# find -iname "*xive*" ./slab/xive-provision ./powerpc/kvm-xive-c000208c ./powerpc/xive and this is "after" the patch applied. root@zz1:/sys/kernel/debug# find -iname "*xive*" ./kvm/29058-11/kvm-xive-c000208c ./slab/xive-provision ./powerpc/xive I'll repost unless there is something more to it. Thanks, -- Alexey
[PATCH kernel v3] KVM: PPC: Merge powerpc's debugfs entry content into generic entry
At the moment KVM on PPC creates 3 types of entries under the kvm debugfs: 1) "%pid-%fd" per a KVM instance (for all platforms); 2) "vm%pid" (for PPC Book3s HV KVM); 3) "vm%u_vcpu%u_timing" (for PPC Book3e KVM). The problem with this is that multiple VMs per process is not allowed for 2) and 3) which makes it possible for userspace to trigger errors when creating duplicated debugfs entries. This merges all these into 1). This defines kvm_arch_create_kvm_debugfs() similar to kvm_arch_create_vcpu_debugfs(). This defines 2 hooks in kvmppc_ops that allow specific KVM implementations add necessary entries, this adds the _e500 suffix to kvmppc_create_vcpu_debugfs_e500() to make it clear what platform it is for. This makes use of already existing kvm_arch_create_vcpu_debugfs() on PPC. This removes no more used debugfs_dir pointers from PPC kvm_arch structs. This stops removing vcpu entries as once created vcpus stay around for the entire life of a VM and removed when the KVM instance is closed, see commit d56f5136b010 ("KVM: let kvm_destroy_vm_debugfs clean up vCPU debugfs directories"). Suggested-by: Fabiano Rosas Signed-off-by: Alexey Kardashevskiy --- Changes: v3: * reworked commit log, especially, the bit about removing vcpus v2: * handled powerpc-booke * s/kvm/vm/ in arch hooks --- arch/powerpc/include/asm/kvm_host.h| 6 ++--- arch/powerpc/include/asm/kvm_ppc.h | 2 ++ arch/powerpc/kvm/timing.h | 9 arch/powerpc/kvm/book3s_64_mmu_hv.c| 2 +- arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +- arch/powerpc/kvm/book3s_hv.c | 31 ++ arch/powerpc/kvm/e500.c| 1 + arch/powerpc/kvm/e500mc.c | 1 + arch/powerpc/kvm/powerpc.c | 16 ++--- arch/powerpc/kvm/timing.c | 20 - 10 files changed, 41 insertions(+), 49 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 17263276189e..f5e14fa683f4 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -26,6 +26,8 @@ #include #include +#define __KVM_HAVE_ARCH_VCPU_DEBUGFS + #define KVM_MAX_VCPUS NR_CPUS #define KVM_MAX_VCORES NR_CPUS @@ -295,7 +297,6 @@ struct kvm_arch { bool dawr1_enabled; pgd_t *pgtable; u64 process_table; - struct dentry *debugfs_dir; struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */ #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */ #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE @@ -673,7 +674,6 @@ struct kvm_vcpu_arch { u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES]; u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES]; u64 timing_last_exit; - struct dentry *debugfs_exit_timing; #endif #ifdef CONFIG_PPC_BOOK3S @@ -829,8 +829,6 @@ struct kvm_vcpu_arch { struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */ struct kvmhv_tb_accumulator guest_time; /* guest execution */ struct kvmhv_tb_accumulator cede_time; /* time napping inside guest */ - - struct dentry *debugfs_dir; #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */ }; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 33db83b82fbd..d2b192dea0d2 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -316,6 +316,8 @@ struct kvmppc_ops { int (*svm_off)(struct kvm *kvm); int (*enable_dawr1)(struct kvm *kvm); bool (*hash_v3_possible)(void); + int (*create_vm_debugfs)(struct kvm *kvm); + int (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry); }; extern struct kvmppc_ops *kvmppc_hv_ops; diff --git a/arch/powerpc/kvm/timing.h b/arch/powerpc/kvm/timing.h index feef7885ba82..493a7d510fd5 100644 --- a/arch/powerpc/kvm/timing.h +++ b/arch/powerpc/kvm/timing.h @@ -14,8 +14,8 @@ #ifdef CONFIG_KVM_EXIT_TIMING void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu); void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu); -void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id); -void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu); +void kvmppc_create_vcpu_debugfs_e500(struct kvm_vcpu *vcpu, +struct dentry *debugfs_dentry); static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int type) { @@ -26,9 +26,8 @@ static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int type) /* if exit timing is not configured there is no need to build the c file */ static inline void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu) {} static inline void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu) {} -static inline void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, - unsigned int id) {} -static inline void kvmppc_remove_vcpu_debug
[PATCH kernel 3/3] powerpc/pseries/ddw: Do not try direct mapping with persistent memory and one window
There is a possibility of having just one DMA window available with a limited capacity which the existing code does not handle that well. If the window is big enough for the system RAM but less than MAX_PHYSMEM_BITS (which we want when persistent memory is present), we create 1:1 window and leave persistent memory without DMA. This disables 1:1 mapping entirely if there is persistent memory and either: - the huge DMA window does not cover the entire address space; - the default DMA window is removed. This relies on reverted 54fc3c681ded ("powerpc/pseries/ddw: Extend upper limit for huge DMA window for persistent memory") to return the actual amount RAM in ddw_memory_hotplug_max() (posted separately). Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/platforms/pseries/iommu.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 301fa5b3d528..8f998e55735b 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1356,8 +1356,10 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) len = order_base_2(query.largest_available_block << page_shift); win_name = DMA64_PROPNAME; } else { - direct_mapping = true; - win_name = DIRECT64_PROPNAME; + direct_mapping = !default_win_removed || + (len == MAX_PHYSMEM_BITS) || + (!pmem_present && (len == max_ram_len)); + win_name = direct_mapping ? DIRECT64_PROPNAME : DMA64_PROPNAME; } ret = create_ddw(dev, ddw_avail, , page_shift, len); -- 2.30.2
[PATCH kernel 2/3] powerpc/pseries/ddw: simplify enable_ddw()
This drops rather useless ddw_enabled flag as direct_mapping implies it anyway. While at this, fix indents in enable_ddw(). This should not cause any behavioral change. Signed-off-by: Alexey Kardashevskiy --- This replaces "powerpc/pseries/iommu: Fix indentations" --- arch/powerpc/platforms/pseries/iommu.c | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 64385d6f33c2..301fa5b3d528 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1229,7 +1229,6 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) u32 ddw_avail[DDW_APPLICABLE_SIZE]; struct dma_win *window; struct property *win64; - bool ddw_enabled = false; struct failed_ddw_pdn *fpdn; bool default_win_removed = false, direct_mapping = false; bool pmem_present; @@ -1244,7 +1243,6 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) { direct_mapping = (len >= max_ram_len); - ddw_enabled = true; goto out_unlock; } @@ -1397,8 +1395,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) dev_info(>dev, "failed to map DMA window for %pOF: %d\n", dn, ret); - /* Make sure to clean DDW if any TCE was set*/ - clean_dma_window(pdn, win64->value); + /* Make sure to clean DDW if any TCE was set*/ + clean_dma_window(pdn, win64->value); goto out_del_list; } } else { @@ -1445,7 +1443,6 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) spin_unlock(_win_list_lock); dev->dev.archdata.dma_offset = win_addr; - ddw_enabled = true; goto out_unlock; out_del_list: @@ -1481,10 +1478,10 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) * as RAM, then we failed to create a window to cover persistent * memory and need to set the DMA limit. */ - if (pmem_present && ddw_enabled && direct_mapping && len == max_ram_len) + if (pmem_present && direct_mapping && len == max_ram_len) dev->dev.bus_dma_limit = dev->dev.archdata.dma_offset + (1ULL << len); -return ddw_enabled && direct_mapping; + return direct_mapping; } static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev) -- 2.30.2
[PATCH kernel 1/3] powerpc/pseries/ddw: Revert "Extend upper limit for huge DMA window for persistent memory"
This reverts commit 54fc3c681ded9437e4548e2501dc1136b23cfa9a which does not allow 1:1 mapping even for the system RAM which is usually possible. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/platforms/pseries/iommu.c | 9 - 1 file changed, 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 49b401536d29..64385d6f33c2 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1094,15 +1094,6 @@ static phys_addr_t ddw_memory_hotplug_max(void) phys_addr_t max_addr = memory_hotplug_max(); struct device_node *memory; - /* -* The "ibm,pmemory" can appear anywhere in the address space. -* Assuming it is still backed by page structs, set the upper limit -* for the huge DMA window as MAX_PHYSMEM_BITS. -*/ - if (of_find_node_by_type(NULL, "ibm,pmemory")) - return (sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ? - (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS); - for_each_node_by_type(memory, "memory") { unsigned long start, size; int n_mem_addr_cells, n_mem_size_cells, len; -- 2.30.2
[PATCH kernel 0/3] powerpc/pseries/ddw: Fixes for persistent memory case
This is based on sha1 f855455dee0b Michael Ellerman "Automatic merge of 'next' into merge (2021-11-05 22:19)". Please comment. Thanks. Alexey Kardashevskiy (3): powerpc/pseries/ddw: Revert "Extend upper limit for huge DMA window for persistent memory" powerpc/pseries/ddw: simplify enable_ddw() powerpc/pseries/ddw: Do not try direct mapping with persistent memory and one window arch/powerpc/platforms/pseries/iommu.c | 26 -- 1 file changed, 8 insertions(+), 18 deletions(-) -- 2.30.2
Re: [PATCH] powerpc: Enhance pmem DMA bypass handling
On 28/10/2021 08:30, Brian King wrote: On 10/26/21 12:39 AM, Alexey Kardashevskiy wrote: On 10/26/21 01:40, Brian King wrote: On 10/23/21 7:18 AM, Alexey Kardashevskiy wrote: On 23/10/2021 07:18, Brian King wrote: On 10/22/21 7:24 AM, Alexey Kardashevskiy wrote: On 22/10/2021 04:44, Brian King wrote: If ibm,pmemory is installed in the system, it can appear anywhere in the address space. This patch enhances how we handle DMA for devices when ibm,pmemory is present. In the case where we have enough DMA space to direct map all of RAM, but not ibm,pmemory, we use direct DMA for I/O to RAM and use the default window to dynamically map ibm,pmemory. In the case where we only have a single DMA window, this won't work, > so if the window is not big enough to map the entire address range, we cannot direct map. but we want the pmem range to be mapped into the huge DMA window too if we can, why skip it? This patch should simply do what the comment in this commit mentioned below suggests, which says that ibm,pmemory can appear anywhere in the address space. If the DMA window is large enough to map all of MAX_PHYSMEM_BITS, we will indeed simply do direct DMA for everything, including the pmem. If we do not have a big enough window to do that, we will do direct DMA for DRAM and dynamic mapping for pmem. Right, and this is what we do already, do not we? I missing something here. The upstream code does not work correctly that I can see. If I boot an upstream kernel with an nvme device and vpmem assigned to the LPAR, and enable dev_dbg in arch/powerpc/platforms/pseries/iommu.c, I see the following in the logs: [2.157549] nvme 0121:50:00.0: ibm,query-pe-dma-windows(53) 50 800 2121 returned 0 [2.157561] nvme 0121:50:00.0: Skipping ibm,pmemory [2.157567] nvme 0121:50:00.0: can't map partition max 0x8 with 16777216 65536-sized pages [2.170150] nvme 0121:50:00.0: ibm,create-pe-dma-window(54) 50 800 2121 10 28 returned 0 (liobn = 0x7121 starting addr = 800 0) [2.170170] nvme 0121:50:00.0: created tce table LIOBN 0x7121 for /pci@8002121/pci1014,683@0 [2.356260] nvme 0121:50:00.0: node is /pci@8002121/pci1014,683@0 This means we are heading down the leg in enable_ddw where we do not set direct_mapping to true. We use create the DDW window, but don't do any direct DMA. This is because the window is not large enough to map 2PB of memory, which is what ddw_memory_hotplug_max returns without my patch. With my patch applied, I get this in the logs: [2.204866] nvme 0121:50:00.0: ibm,query-pe-dma-windows(53) 50 800 2121 returned 0 [2.204875] nvme 0121:50:00.0: Skipping ibm,pmemory [2.205058] nvme 0121:50:00.0: ibm,create-pe-dma-window(54) 50 800 2121 10 21 returned 0 (liobn = 0x7121 starting addr = 800 0) [2.205068] nvme 0121:50:00.0: created tce table LIOBN 0x7121 for /pci@8002121/pci1014,683@0 [2.215898] nvme 0121:50:00.0: iommu: 64-bit OK but direct DMA is limited by 802 ah I see. then... Thanks, Brian https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc/platforms/pseries/iommu.c?id=bf6e2d562bbc4d115cf322b0bca57fe5bbd26f48 Thanks, Brian Signed-off-by: Brian King --- arch/powerpc/platforms/pseries/iommu.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 269f61d519c2..d9ae985d10a4 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1092,15 +1092,6 @@ static phys_addr_t ddw_memory_hotplug_max(void) phys_addr_t max_addr = memory_hotplug_max(); struct device_node *memory; - /* - * The "ibm,pmemory" can appear anywhere in the address space. - * Assuming it is still backed by page structs, set the upper limit - * for the huge DMA window as MAX_PHYSMEM_BITS. - */ - if (of_find_node_by_type(NULL, "ibm,pmemory")) - return (sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ? - (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS); - for_each_node_by_type(memory, "memory") { unsigned long start, size; int n_mem_addr_cells, n_mem_size_cells, len; @@ -1341,6 +1332,16 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) */ len = max_ram_len; if (pmem_present) { + if (default_win_removed) { + /* + * If we only have one DMA window and have pmem present, + * then we need to be able to map the entire address + * range in order to be able to do direct DMA to RAM. + */ + len = order_base_2((sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ? +
Re: [PATCH] powerpc: Enhance pmem DMA bypass handling
On 10/26/21 01:40, Brian King wrote: > On 10/23/21 7:18 AM, Alexey Kardashevskiy wrote: >> >> >> On 23/10/2021 07:18, Brian King wrote: >>> On 10/22/21 7:24 AM, Alexey Kardashevskiy wrote: >>>> >>>> >>>> On 22/10/2021 04:44, Brian King wrote: >>>>> If ibm,pmemory is installed in the system, it can appear anywhere >>>>> in the address space. This patch enhances how we handle DMA for devices >>>>> when >>>>> ibm,pmemory is present. In the case where we have enough DMA space to >>>>> direct map all of RAM, but not ibm,pmemory, we use direct DMA for >>>>> I/O to RAM and use the default window to dynamically map ibm,pmemory. >>>>> In the case where we only have a single DMA window, this won't work, > so >>>>> if the window is not big enough to map the entire address range, >>>>> we cannot direct map. >>>> >>>> but we want the pmem range to be mapped into the huge DMA window too if we >>>> can, why skip it? >>> >>> This patch should simply do what the comment in this commit mentioned below >>> suggests, which says that >>> ibm,pmemory can appear anywhere in the address space. If the DMA window is >>> large enough >>> to map all of MAX_PHYSMEM_BITS, we will indeed simply do direct DMA for >>> everything, >>> including the pmem. If we do not have a big enough window to do that, we >>> will do >>> direct DMA for DRAM and dynamic mapping for pmem. >> >> >> Right, and this is what we do already, do not we? I missing something here. > > The upstream code does not work correctly that I can see. If I boot an > upstream kernel > with an nvme device and vpmem assigned to the LPAR, and enable dev_dbg in > arch/powerpc/platforms/pseries/iommu.c, > I see the following in the logs: > > [2.157549] nvme 0121:50:00.0: ibm,query-pe-dma-windows(53) 50 800 > 2121 returned 0 > [2.157561] nvme 0121:50:00.0: Skipping ibm,pmemory > [2.157567] nvme 0121:50:00.0: can't map partition max 0x8 > with 16777216 65536-sized pages > [2.170150] nvme 0121:50:00.0: ibm,create-pe-dma-window(54) 50 800 > 2121 10 28 returned 0 (liobn = 0x7121 starting addr = 800 0) > [2.170170] nvme 0121:50:00.0: created tce table LIOBN 0x7121 for > /pci@8002121/pci1014,683@0 > [2.356260] nvme 0121:50:00.0: node is /pci@8002121/pci1014,683@0 > > This means we are heading down the leg in enable_ddw where we do not set > direct_mapping to true. We use > create the DDW window, but don't do any direct DMA. This is because the > window is not large enough to > map 2PB of memory, which is what ddw_memory_hotplug_max returns without my > patch. > > With my patch applied, I get this in the logs: > > [2.204866] nvme 0121:50:00.0: ibm,query-pe-dma-windows(53) 50 800 > 2121 returned 0 > [2.204875] nvme 0121:50:00.0: Skipping ibm,pmemory > [2.205058] nvme 0121:50:00.0: ibm,create-pe-dma-window(54) 50 800 > 2121 10 21 returned 0 (liobn = 0x7121 starting addr = 800 0) > [2.205068] nvme 0121:50:00.0: created tce table LIOBN 0x7121 for > /pci@8002121/pci1014,683@0 > [2.215898] nvme 0121:50:00.0: iommu: 64-bit OK but direct DMA is limited > by 802 > ah I see. then... > > Thanks, > > Brian > > >> >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc/platforms/pseries/iommu.c?id=bf6e2d562bbc4d115cf322b0bca57fe5bbd26f48 >>> >>> >>> Thanks, >>> >>> Brian >>> >>> >>>> >>>> >>>>> >>>>> Signed-off-by: Brian King >>>>> --- >>>>> arch/powerpc/platforms/pseries/iommu.c | 19 ++- >>>>> 1 file changed, 10 insertions(+), 9 deletions(-) >>>>> >>>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c >>>>> b/arch/powerpc/platforms/pseries/iommu.c >>>>> index 269f61d519c2..d9ae985d10a4 100644 >>>>> --- a/arch/powerpc/platforms/pseries/iommu.c >>>>> +++ b/arch/powerpc/platforms/pseries/iommu.c >>>>> @@ -1092,15 +1092,6 @@ static phys_addr_t ddw_memory_hotplug_max(void) >>>>> phys_addr_t max_addr = memory_hotplug_max(); >>>>> struct device_node *memory; >>>>> - /* >>>>> - * The "ibm,pmemory"
Re: [PATCH] powerpc: Enhance pmem DMA bypass handling
On 23/10/2021 07:18, Brian King wrote: On 10/22/21 7:24 AM, Alexey Kardashevskiy wrote: On 22/10/2021 04:44, Brian King wrote: If ibm,pmemory is installed in the system, it can appear anywhere in the address space. This patch enhances how we handle DMA for devices when ibm,pmemory is present. In the case where we have enough DMA space to direct map all of RAM, but not ibm,pmemory, we use direct DMA for I/O to RAM and use the default window to dynamically map ibm,pmemory. In the case where we only have a single DMA window, this won't work, > so if the window is not big enough to map the entire address range, we cannot direct map. but we want the pmem range to be mapped into the huge DMA window too if we can, why skip it? This patch should simply do what the comment in this commit mentioned below suggests, which says that ibm,pmemory can appear anywhere in the address space. If the DMA window is large enough to map all of MAX_PHYSMEM_BITS, we will indeed simply do direct DMA for everything, including the pmem. If we do not have a big enough window to do that, we will do direct DMA for DRAM and dynamic mapping for pmem. Right, and this is what we do already, do not we? I missing something here. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc/platforms/pseries/iommu.c?id=bf6e2d562bbc4d115cf322b0bca57fe5bbd26f48 Thanks, Brian Signed-off-by: Brian King --- arch/powerpc/platforms/pseries/iommu.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 269f61d519c2..d9ae985d10a4 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1092,15 +1092,6 @@ static phys_addr_t ddw_memory_hotplug_max(void) phys_addr_t max_addr = memory_hotplug_max(); struct device_node *memory; - /* - * The "ibm,pmemory" can appear anywhere in the address space. - * Assuming it is still backed by page structs, set the upper limit - * for the huge DMA window as MAX_PHYSMEM_BITS. - */ - if (of_find_node_by_type(NULL, "ibm,pmemory")) - return (sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ? - (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS); - for_each_node_by_type(memory, "memory") { unsigned long start, size; int n_mem_addr_cells, n_mem_size_cells, len; @@ -1341,6 +1332,16 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) */ len = max_ram_len; if (pmem_present) { + if (default_win_removed) { + /* + * If we only have one DMA window and have pmem present, + * then we need to be able to map the entire address + * range in order to be able to do direct DMA to RAM. + */ + len = order_base_2((sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ? + (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS)); + } + if (query.largest_available_block >= (1ULL << (MAX_PHYSMEM_BITS - page_shift))) len = MAX_PHYSMEM_BITS; -- Alexey
Re: [PATCH] powerpc: Enhance pmem DMA bypass handling
On 22/10/2021 04:44, Brian King wrote: If ibm,pmemory is installed in the system, it can appear anywhere in the address space. This patch enhances how we handle DMA for devices when ibm,pmemory is present. In the case where we have enough DMA space to direct map all of RAM, but not ibm,pmemory, we use direct DMA for I/O to RAM and use the default window to dynamically map ibm,pmemory. In the case where we only have a single DMA window, this won't work, > so if the window is not big enough to map the entire address range, we cannot direct map. but we want the pmem range to be mapped into the huge DMA window too if we can, why skip it? Signed-off-by: Brian King --- arch/powerpc/platforms/pseries/iommu.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 269f61d519c2..d9ae985d10a4 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1092,15 +1092,6 @@ static phys_addr_t ddw_memory_hotplug_max(void) phys_addr_t max_addr = memory_hotplug_max(); struct device_node *memory; - /* -* The "ibm,pmemory" can appear anywhere in the address space. -* Assuming it is still backed by page structs, set the upper limit -* for the huge DMA window as MAX_PHYSMEM_BITS. -*/ - if (of_find_node_by_type(NULL, "ibm,pmemory")) - return (sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ? - (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS); - for_each_node_by_type(memory, "memory") { unsigned long start, size; int n_mem_addr_cells, n_mem_size_cells, len; @@ -1341,6 +1332,16 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) */ len = max_ram_len; if (pmem_present) { + if (default_win_removed) { + /* +* If we only have one DMA window and have pmem present, +* then we need to be able to map the entire address +* range in order to be able to do direct DMA to RAM. +*/ + len = order_base_2((sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ? + (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS)); + } + if (query.largest_available_block >= (1ULL << (MAX_PHYSMEM_BITS - page_shift))) len = MAX_PHYSMEM_BITS; -- Alexey