Re: [Nouveau] [PATCH drm-misc-next 2/3] drm/gpuva_mgr: generalize dma_resv/extobj handling and GEM validation
On Thu, Oct 12, 2023 at 02:35:15PM +0200, Christian König wrote: > Additional to that from the software side Felix summarized it in the HMM > peer2peer discussion thread recently quite well. Do you have a pointer to that discussion?
Re: [Nouveau] [PATCH drm-next v5 03/14] drm: manager to keep track of GPUs VA mappings
Why are none of these EXPORT_SYMBOL_GPL as it's very linux-internal stuff?
Re: [Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling
Thank you. I'll queue it up as a separate patch.
Re: [Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling
On Fri, Jun 09, 2023 at 05:38:28PM +0200, Juergen Gross wrote: >>> guest started with e820_host=1 even if no PCI passthrough was planned. >>> But this should be rather rare (at least I hope so). >> >> So is this an ACK for the patch and can we go ahead with it? > > As long as above mentioned check of the E820 map is done, yes. > > If you want I can send a diff to be folded into your patch on Monday. Yes, that would be great!
Re: [Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling
On Mon, May 22, 2023 at 10:37:09AM +0200, Juergen Gross wrote: > In normal cases PCI passthrough in PV guests requires to start the guest > with e820_host=1. So it should be rather easy to limit allocating the > 64MB in PV guests to the cases where the memory map has non-RAM regions > especially in the first 1MB of the memory. > > This will cover even hotplug cases. The only case not covered would be a > guest started with e820_host=1 even if no PCI passthrough was planned. > But this should be rather rare (at least I hope so). So is this an ACK for the patch and can we go ahead with it? (I'd still like to merge swiotlb-xen into swiotlb eventually, but it's probably not going to happen this merge window)
Re: [Nouveau] [PATCH 3/4] drm/nouveau: stop using is_swiotlb_active
On Thu, May 18, 2023 at 04:30:49PM -0400, Lyude Paul wrote: > Reviewed-by: Lyude Paul > > Thanks for getting to this! I've tentantively queued this up in the dma-mapping for-next tree. Let me know if you'd prefer it to go through the nouveau tree.
Re: [Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling
On Fri, May 19, 2023 at 02:58:57PM +0200, Christoph Hellwig wrote: > On Fri, May 19, 2023 at 01:49:46PM +0100, Andrew Cooper wrote: > > > The alternative would be to finally merge swiotlb-xen into swiotlb, in > > > which case we might be able to do this later. Let me see what I can > > > do there. > > > > If that is an option, it would be great to reduce the special-cashing. > > I think it's doable, and I've been wanting it for a while. I just > need motivated testers, but it seems like I just found at least two :) So looking at swiotlb-xen it does these off things where it takes a value generated originally be xen_phys_to_dma, then only does a dma_to_phys to go back and call pfn_valid on the result. Does this make sense, or is it wrong and just works by accident? I.e. is the patch below correct? diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 67aa74d201627d..3396c5766f0dd8 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -90,9 +90,7 @@ static inline int range_straddles_page_boundary(phys_addr_t p, size_t size) static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr) { - unsigned long bfn = XEN_PFN_DOWN(dma_to_phys(dev, dma_addr)); - unsigned long xen_pfn = bfn_to_local_pfn(bfn); - phys_addr_t paddr = (phys_addr_t)xen_pfn << XEN_PAGE_SHIFT; + phys_addr_t paddr = xen_dma_to_phys(dev, dma_addr); /* If the address is outside our domain, it CAN * have the same virtual address as another address @@ -234,7 +232,7 @@ static dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page, done: if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) { - if (pfn_valid(PFN_DOWN(dma_to_phys(dev, dev_addr + if (pfn_valid(PFN_DOWN(phys))) arch_sync_dma_for_device(phys, size, dir); else xen_dma_sync_for_device(dev, dev_addr, size, dir); @@ -258,7 +256,7 @@ static void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr, BUG_ON(dir == DMA_NONE); if (!dev_is_dma_coherent(hwdev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) { - if (pfn_valid(PFN_DOWN(dma_to_phys(hwdev, dev_addr + if (pfn_valid(PFN_DOWN(paddr))) arch_sync_dma_for_cpu(paddr, size, dir); else xen_dma_sync_for_cpu(hwdev, dev_addr, size, dir); @@ -276,7 +274,7 @@ xen_swiotlb_sync_single_for_cpu(struct device *dev, dma_addr_t dma_addr, phys_addr_t paddr = xen_dma_to_phys(dev, dma_addr); if (!dev_is_dma_coherent(dev)) { - if (pfn_valid(PFN_DOWN(dma_to_phys(dev, dma_addr + if (pfn_valid(PFN_DOWN(paddr))) arch_sync_dma_for_cpu(paddr, size, dir); else xen_dma_sync_for_cpu(dev, dma_addr, size, dir); @@ -296,7 +294,7 @@ xen_swiotlb_sync_single_for_device(struct device *dev, dma_addr_t dma_addr, swiotlb_sync_single_for_device(dev, paddr, size, dir); if (!dev_is_dma_coherent(dev)) { - if (pfn_valid(PFN_DOWN(dma_to_phys(dev, dma_addr + if (pfn_valid(PFN_DOWN(paddr))) arch_sync_dma_for_device(paddr, size, dir); else xen_dma_sync_for_device(dev, dma_addr, size, dir);
Re: [Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling
On Fri, May 19, 2023 at 01:49:46PM +0100, Andrew Cooper wrote: > > The alternative would be to finally merge swiotlb-xen into swiotlb, in > > which case we might be able to do this later. Let me see what I can > > do there. > > If that is an option, it would be great to reduce the special-cashing. I think it's doable, and I've been wanting it for a while. I just need motivated testers, but it seems like I just found at least two :)
Re: [Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling
On Fri, May 19, 2023 at 12:10:26PM +0200, Marek Marczykowski-Górecki wrote: > While I would say PCI passthrough is not very common for PV guests, can > the decision about xen-swiotlb be delayed until you can enumerate > xenstore to check if there are any PCI devices connected (and not > allocate xen-swiotlb by default if there are none)? This would > still not cover the hotplug case (in which case, you'd need to force it > with a cmdline), but at least you wouldn't loose much memory just > because one of your VMs may use PCI passthrough (so, you have it enabled > in your kernel). How early can we query xenstore? We'd need to do this before setting up DMA for any device. The alternative would be to finally merge swiotlb-xen into swiotlb, in which case we might be able to do this later. Let me see what I can do there.
Re: [Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling
On Thu, May 18, 2023 at 08:18:39PM +0200, Marek Marczykowski-Górecki wrote: > On Thu, May 18, 2023 at 03:42:51PM +0200, Christoph Hellwig wrote: > > Remove the dangerous late initialization of xen-swiotlb in > > pci_xen_swiotlb_init_late and instead just always initialize > > xen-swiotlb in the boot code if CONFIG_XEN_PCIDEV_FRONTEND is enabled. > > > > Signed-off-by: Christoph Hellwig > > Doesn't it mean all the PV guests will basically waste 64MB of RAM > by default each if they don't really have PCI devices? If CONFIG_XEN_PCIDEV_FRONTEND is enabled, and the kernel's isn't booted with swiotlb=noforce, yes.
[Nouveau] [PATCH 3/4] drm/nouveau: stop using is_swiotlb_active
Drivers have no business looking into dma-mapping internals and check what backend is used. Unfortunstely the DRM core is still broken and tries to do plain page allocations instead of using DMA API allocators by default and uses various bandaids on when to use dma_alloc_coherent. Switch nouveau to use the same (broken) scheme as amdgpu and radeon to remove the last driver user of is_swiotlb_active. Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/nouveau/nouveau_ttm.c | 10 +++--- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c b/drivers/gpu/drm/nouveau/nouveau_ttm.c index 1469a88910e45d..486f39f31a38df 100644 --- a/drivers/gpu/drm/nouveau/nouveau_ttm.c +++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c @@ -24,9 +24,9 @@ */ #include -#include #include +#include #include "nouveau_drv.h" #include "nouveau_gem.h" @@ -265,7 +265,6 @@ nouveau_ttm_init(struct nouveau_drm *drm) struct nvkm_pci *pci = device->pci; struct nvif_mmu *mmu = >client.mmu; struct drm_device *dev = drm->dev; - bool need_swiotlb = false; int typei, ret; ret = nouveau_ttm_init_host(drm, 0); @@ -300,13 +299,10 @@ nouveau_ttm_init(struct nouveau_drm *drm) drm->agp.cma = pci->agp.cma; } -#if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86) - need_swiotlb = is_swiotlb_active(dev->dev); -#endif - ret = ttm_device_init(>ttm.bdev, _bo_driver, drm->dev->dev, dev->anon_inode->i_mapping, - dev->vma_offset_manager, need_swiotlb, + dev->vma_offset_manager, + drm_need_swiotlb(drm->client.mmu.dmabits), drm->client.mmu.dmabits <= 32); if (ret) { NV_ERROR(drm, "error initialising bo driver, %d\n", ret); -- 2.39.2
[Nouveau] unexport swiotlb_active
Hi all, this little series removes the last swiotlb API exposed to modules. Diffstat: arch/x86/include/asm/xen/swiotlb-xen.h |6 -- arch/x86/kernel/pci-dma.c | 28 drivers/gpu/drm/nouveau/nouveau_ttm.c | 10 +++--- drivers/pci/xen-pcifront.c |6 -- kernel/dma/swiotlb.c |1 - 5 files changed, 7 insertions(+), 44 deletions(-)
[Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling
Remove the dangerous late initialization of xen-swiotlb in pci_xen_swiotlb_init_late and instead just always initialize xen-swiotlb in the boot code if CONFIG_XEN_PCIDEV_FRONTEND is enabled. Signed-off-by: Christoph Hellwig --- arch/x86/include/asm/xen/swiotlb-xen.h | 6 -- arch/x86/kernel/pci-dma.c | 25 +++-- drivers/pci/xen-pcifront.c | 6 -- 3 files changed, 3 insertions(+), 34 deletions(-) diff --git a/arch/x86/include/asm/xen/swiotlb-xen.h b/arch/x86/include/asm/xen/swiotlb-xen.h index 77a2d19cc9909e..abde0f44df57dc 100644 --- a/arch/x86/include/asm/xen/swiotlb-xen.h +++ b/arch/x86/include/asm/xen/swiotlb-xen.h @@ -2,12 +2,6 @@ #ifndef _ASM_X86_SWIOTLB_XEN_H #define _ASM_X86_SWIOTLB_XEN_H -#ifdef CONFIG_SWIOTLB_XEN -extern int pci_xen_swiotlb_init_late(void); -#else -static inline int pci_xen_swiotlb_init_late(void) { return -ENXIO; } -#endif - int xen_swiotlb_fixup(void *buf, unsigned long nslabs); int xen_create_contiguous_region(phys_addr_t pstart, unsigned int order, unsigned int address_bits, diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c index f887b08ac5ffe4..c4a7ead9eb674e 100644 --- a/arch/x86/kernel/pci-dma.c +++ b/arch/x86/kernel/pci-dma.c @@ -81,27 +81,6 @@ static void __init pci_xen_swiotlb_init(void) if (IS_ENABLED(CONFIG_PCI)) pci_request_acs(); } - -int pci_xen_swiotlb_init_late(void) -{ - if (dma_ops == _swiotlb_dma_ops) - return 0; - - /* we can work with the default swiotlb */ - if (!io_tlb_default_mem.nslabs) { - int rc = swiotlb_init_late(swiotlb_size_or_default(), - GFP_KERNEL, xen_swiotlb_fixup); - if (rc < 0) - return rc; - } - - /* XXX: this switches the dma ops under live devices! */ - dma_ops = _swiotlb_dma_ops; - if (IS_ENABLED(CONFIG_PCI)) - pci_request_acs(); - return 0; -} -EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late); #else static inline void __init pci_xen_swiotlb_init(void) { @@ -111,7 +90,9 @@ static inline void __init pci_xen_swiotlb_init(void) void __init pci_iommu_alloc(void) { if (xen_pv_domain()) { - if (xen_initial_domain() || x86_swiotlb_enable) + if (xen_initial_domain() || + IS_ENABLED(CONFIG_XEN_PCIDEV_FRONTEND) || + x86_swiotlb_enable) pci_xen_swiotlb_init(); return; } diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c index 83c0ab50676dff..11636634ae512f 100644 --- a/drivers/pci/xen-pcifront.c +++ b/drivers/pci/xen-pcifront.c @@ -22,7 +22,6 @@ #include #include #include -#include #include #include @@ -669,11 +668,6 @@ static int pcifront_connect_and_init_dma(struct pcifront_device *pdev) spin_unlock(_dev_lock); - if (!err && !is_swiotlb_active(>xdev->dev)) { - err = pci_xen_swiotlb_init_late(); - if (err) - dev_err(>xdev->dev, "Could not setup SWIOTLB!\n"); - } return err; } -- 2.39.2
[Nouveau] [PATCH 1/4] x86: move a check out of pci_xen_swiotlb_init
Move the exact checks when to initialize the Xen swiotlb code out of pci_xen_swiotlb_init and into the caller so that is uses readable positive checks, rather than negative ones that will get even more confusing with another addition. Signed-off-by: Christoph Hellwig --- arch/x86/kernel/pci-dma.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c index de6be0a3965ee4..f887b08ac5ffe4 100644 --- a/arch/x86/kernel/pci-dma.c +++ b/arch/x86/kernel/pci-dma.c @@ -74,8 +74,6 @@ static inline void __init pci_swiotlb_detect(void) #ifdef CONFIG_SWIOTLB_XEN static void __init pci_xen_swiotlb_init(void) { - if (!xen_initial_domain() && !x86_swiotlb_enable) - return; x86_swiotlb_enable = true; x86_swiotlb_flags |= SWIOTLB_ANY; swiotlb_init_remap(true, x86_swiotlb_flags, xen_swiotlb_fixup); @@ -113,7 +111,8 @@ static inline void __init pci_xen_swiotlb_init(void) void __init pci_iommu_alloc(void) { if (xen_pv_domain()) { - pci_xen_swiotlb_init(); + if (xen_initial_domain() || x86_swiotlb_enable) + pci_xen_swiotlb_init(); return; } pci_swiotlb_detect(); -- 2.39.2
[Nouveau] [PATCH 4/4] swiotlb: unexport is_swiotlb_active
Drivers have no business looking at dma-mapping or swiotlb internals. Signed-off-by: Christoph Hellwig --- kernel/dma/swiotlb.c | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index af2e304c672c43..9f1fd28264a067 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -921,7 +921,6 @@ bool is_swiotlb_active(struct device *dev) return mem && mem->nslabs; } -EXPORT_SYMBOL_GPL(is_swiotlb_active); #ifdef CONFIG_DEBUG_FS -- 2.39.2
Re: [Nouveau] [PATCH v2] mm: Take a page reference when removing device exclusive entries
s/page/folio/ in the entire commit log?
Re: [Nouveau] [RFC] drm/nouveau/ttm: Stop calling into swiotlb
Hi Lyude, and thanks for taking a look. > -#if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86) > - need_swiotlb = is_swiotlb_active(dev->dev); > -#endif > - > ret = ttm_device_init(>ttm.bdev, _bo_driver, drm->dev->dev, > - dev->anon_inode->i_mapping, > - dev->vma_offset_manager, need_swiotlb, > - drm->client.mmu.dmabits <= 32); > + dev->anon_inode->i_mapping, > + dev->vma_offset_manager, > + nouveau_drm_use_coherent_gpu_mapping(drm), > + drm->client.mmu.dmabits <= 32); This will break setups for two reasons: - swiotlb is not only used to do device addressing limitations, so this will not catch the case of interconnect addressing limitations or forced bounce buffering which used used e.g. in secure VMs. - we might need bouncing for any DMA address below the physical address limit of the CPU But more fundamentally the use_dma32 argument to ttm_device_init is rather broken, as the onlyway to get a memory allocation that fits the DMA addressing needs of a device is to use the proper DMA mapping helpers. i.e. ttm_pool_alloc_page really needs to use dma_alloc_pages instead of alloc_pages as a first step. That way all users of the TTM pool will always get dma addressable pages and there is no need to guess the addressing limitations. The use_dma_alloc is then only needed for users that require coherent memory and are willing to deal with the limitations that this entails (e.g. no way to get at the page struct). > if (ret) { > NV_ERROR(drm, "error initialising bo driver, %d\n", ret); > return ret; > -- > 2.35.3 ---end quoted text---
Re: [Nouveau] susetting the remaining swioltb couplin in DRM
On Mon, Jul 11, 2022 at 04:31:49PM -0400, Rodrigo Vivi wrote: > On Mon, Jul 11, 2022 at 10:26:14AM +0200, Christoph Hellwig wrote: > > Hi i915 and nouveau maintainers, > > > > any chance I could get some help to remove the remaining direct > > driver calls into swiotlb, namely swiotlb_max_segment and > > is_swiotlb_active. Either should not matter to a driver as they > > should be written to the DMA API. > > Hi Christoph, > > while we take a look here, could you please share the reasons > behind sunsetting this calls? Because they are a completely broken layering violation. A driver has absolutely no business knowing the dma-mapping violation. The DMA API reports what we think is all useful constraints (e.g. dma_max_mapping_size()), and provides useful APIs to (e.g. dma_alloc_noncoherent or dma_alloc_noncontiguous) to allocate pages that can be mapped without bounce buffering and drivers should use the proper API instead of poking into one particular implementation and restrict it from changing. swiotlb_max_segment in particular returns a value that isn't actually correct (a driver can't just use all of swiotlb) AND actually doesn't work as is in various scenarious that are becoming more common, most notably host with memory encryption schemes that always require bounce buffering.
[Nouveau] susetting the remaining swioltb couplin in DRM
Hi i915 and nouveau maintainers, any chance I could get some help to remove the remaining direct driver calls into swiotlb, namely swiotlb_max_segment and is_swiotlb_active. Either should not matter to a driver as they should be written to the DMA API. In the i915 case it seems like the driver should use dma_alloc_noncontiguous and/or dma_alloc_noncoherent to allocate DMAable memory instead of using alloc_page and the streaming dma mapping helpers. For the latter it seems like it should just stop passing use_dma_alloc == true to ttm_device_init and/or that function should switch to use dma_alloc_noncoherent.
Re: [Nouveau] [PATCH 13/27] mm: move the migrate_vma_* device migration code into it's own file
On Thu, Feb 10, 2022 at 09:35:10PM +1100, Alistair Popple wrote: > I got the following build error: > > /data/source/linux/mm/migrate_device.c: In function ‘migrate_vma_collect_pmd’: > /data/source/linux/mm/migrate_device.c:242:3: error: implicit declaration of > function ‘flush_tlb_range’; did you mean ‘flush_pmd_tlb_range’? > [-Werror=implicit-function-declaration] > 242 | flush_tlb_range(walk->vma, start, end); > | ^~~ > | flush_pmd_tlb_range > > Including asm/tlbflush.h in migrate_device.c fixed it for me. Yes, the buildbot also complained about this, but somehow in my test configfs it got pulled in implicitly.
[Nouveau] [PATCH 27/27] tools: add hmm gup test for long term pinned device pages
From: Alex Sierra The intention is to test device coherent type pages that have been called through get user pages with PIN_LONGTERM flag set. These pages should get migrated back to normal system memory. Signed-off-by: Alex Sierra Signed-off-by: Alistair Popple Reviewed-by: Felix Kuehling Signed-off-by: Christoph Hellwig --- tools/testing/selftests/vm/Makefile| 2 +- tools/testing/selftests/vm/hmm-tests.c | 81 ++ 2 files changed, 82 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 1607322a112c91..58c8427114f0c2 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -142,7 +142,7 @@ $(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap $(OUTPUT)/gup_test: ../../../../mm/gup_test.h -$(OUTPUT)/hmm-tests: local_config.h +$(OUTPUT)/hmm-tests: local_config.h ../../../../mm/gup_test.h # HMM_EXTRA_LIBS may get set in local_config.mk, or it may be left empty. $(OUTPUT)/hmm-tests: LDLIBS += $(HMM_EXTRA_LIBS) diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c index 84ec8c4a1dc7b6..11b83a8084fee2 100644 --- a/tools/testing/selftests/vm/hmm-tests.c +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -36,6 +36,7 @@ * in the usual include/uapi/... directory. */ #include "../../../../lib/test_hmm_uapi.h" +#include "../../../../mm/gup_test.h" struct hmm_buffer { void*ptr; @@ -60,6 +61,8 @@ enum { #define NTIMES 10 #define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1))) +/* Just the flags we need, copied from mm.h: */ +#define FOLL_WRITE 0x01/* check pte is writable */ FIXTURE(hmm) { @@ -1766,4 +1769,82 @@ TEST_F(hmm, exclusive_cow) hmm_buffer_free(buffer); } +/* + * Test get user device pages through gup_test. Setting PIN_LONGTERM flag. + * This should trigger a migration back to system memory for both, private + * and coherent type pages. + * This test makes use of gup_test module. Make sure GUP_TEST_CONFIG is added + * to your configuration before you run it. + */ +TEST_F(hmm, hmm_gup_test) +{ + struct hmm_buffer *buffer; + struct gup_test gup; + int gup_fd; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + unsigned char *m; + + gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR); + if (gup_fd == -1) + SKIP(return, "Skipping test, could not find gup_test driver"); + + npages = 4; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer in system memory. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Migrate memory to device. */ + ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + /* Check what the device read. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + gup.nr_pages_per_call = npages; + gup.addr = (unsigned long)buffer->ptr; + gup.gup_flags = FOLL_WRITE; + gup.size = size; + /* +* Calling gup_test ioctl. It will try to PIN_LONGTERM these device pages +* causing a migration back to system memory for both, private and coherent +* type pages. +*/ + if (ioctl(gup_fd, PIN_LONGTERM_BENCHMARK, )) { + perror("ioctl on PIN_LONGTERM_BENCHMARK\n"); + goto out_test; + } + + /* Take snapshot to make sure pages have been migrated to sys memory */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + m = buffer->mirror; + for (i = 0; i < npages; i++) + ASSERT_EQ(m[i], HMM_DMIRROR_PROT_WRITE); +out_test: + close(gup_fd); + hmm_buffer_free(buffer); +} TEST_HARNESS_MAIN -- 2.30.2
[Nouveau] [PATCH 26/27] mm/gup: migrate device coherent pages when pinning instead of failing
From: Alistair Popple Currently any attempts to pin a device coherent page will fail. This is because device coherent pages need to be managed by a device driver, and pinning them would prevent a driver from migrating them off the device. However this is no reason to fail pinning of these pages. These are coherent and accessible from the CPU so can be migrated just like pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin them first try migrating them out of ZONE_DEVICE. Signed-off-by: Alistair Popple Acked-by: Felix Kuehling [hch: rebased to the split device memory checks, moved migrate_device_page to migrate_device.c] Signed-off-by: Christoph Hellwig --- mm/gup.c| 37 ++- mm/internal.h | 1 + mm/migrate_device.c | 53 + 3 files changed, 85 insertions(+), 6 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index 39b23ad39a7bde..41349b685eafb4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1889,9 +1889,31 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, ret = -EFAULT; goto unpin_pages; } + + /* +* Device coherent pages are managed by a driver and should not +* be pinned indefinitely as it prevents the driver moving the +* page. So when trying to pin with FOLL_LONGTERM instead try +* to migrate the page out of device memory. +*/ if (is_device_coherent_page(head)) { - ret = -EFAULT; - goto unpin_pages; + WARN_ON_ONCE(PageCompound(head)); + + /* +* Migration will fail if the page is pinned, so convert +* the pin on the source page to a normal reference. +*/ + if (gup_flags & FOLL_PIN) { + get_page(head); + unpin_user_page(head); + } + + pages[i] = migrate_device_page(head, gup_flags); + if (!pages[i]) { + ret = -EBUSY; + goto unpin_pages; + } + continue; } if (is_pinnable_page(head)) @@ -1931,10 +1953,13 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, return nr_pages; unpin_pages: - if (gup_flags & FOLL_PIN) { - unpin_user_pages(pages, nr_pages); - } else { - for (i = 0; i < nr_pages; i++) + for (i = 0; i < nr_pages; i++) { + if (!pages[i]) + continue; + + if (gup_flags & FOLL_PIN) + unpin_user_page(pages[i]); + else put_page(pages[i]); } diff --git a/mm/internal.h b/mm/internal.h index a67222d17e5987..1bded5d7f41a9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -719,5 +719,6 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, unsigned long addr, int page_nid, int *flags); void free_zone_device_page(struct page *page); +struct page *migrate_device_page(struct page *page, unsigned int gup_flags); #endif /* __MM_INTERNAL_H */ diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 03e182f9fc7865..3373b535d5c9d9 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -767,3 +767,56 @@ void migrate_vma_finalize(struct migrate_vma *migrate) } } EXPORT_SYMBOL(migrate_vma_finalize); + +/* + * Migrate a device coherent page back to normal memory. The caller should have + * a reference on page which will be copied to the new page if migration is + * successful or dropped on failure. + */ +struct page *migrate_device_page(struct page *page, unsigned int gup_flags) +{ + unsigned long src_pfn, dst_pfn = 0; + struct migrate_vma args; + struct page *dpage; + + lock_page(page); + src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE; + args.src = _pfn; + args.dst = _pfn; + args.cpages = 1; + args.npages = 1; + args.vma = NULL; + migrate_vma_setup(); + if (!(src_pfn & MIGRATE_PFN_MIGRATE)) + return NULL; + + dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0); + + /* +* get/pin the new page now so we don't have to retry gup after +* migrating. We already have a reference so this should never fail. +*/ + if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) { + __free_pages(dpage, 0); + dpage = NULL; + } + + if (dpage) { + lock_page(dpage); + dst_pfn =
[Nouveau] [PATCH 23/27] tools: update hmm-test to support device coherent type
From: Alex Sierra Test cases such as migrate_fault and migrate_multiple, were modified to explicit migrate from device to sys memory without the need of page faults, when using device coherent type. Snapshot test case updated to read memory device type first and based on that, get the proper returned results migrate_ping_pong test case added to test explicit migration from device to sys memory for both private and coherent zone types. Helpers to migrate from device to sys memory and vicerversa were also added. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple Signed-off-by: Christoph Hellwig --- tools/testing/selftests/vm/hmm-tests.c | 123 - 1 file changed, 102 insertions(+), 21 deletions(-) diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c index 203323967b507a..84ec8c4a1dc7b6 100644 --- a/tools/testing/selftests/vm/hmm-tests.c +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -44,6 +44,14 @@ struct hmm_buffer { int fd; uint64_tcpages; uint64_tfaults; + int zone_device_type; +}; + +enum { + HMM_PRIVATE_DEVICE_ONE, + HMM_PRIVATE_DEVICE_TWO, + HMM_COHERENCE_DEVICE_ONE, + HMM_COHERENCE_DEVICE_TWO, }; #define TWOMEG (1 << 21) @@ -60,6 +68,21 @@ FIXTURE(hmm) unsigned intpage_shift; }; +FIXTURE_VARIANT(hmm) +{ + int device_number; +}; + +FIXTURE_VARIANT_ADD(hmm, hmm_device_private) +{ + .device_number = HMM_PRIVATE_DEVICE_ONE, +}; + +FIXTURE_VARIANT_ADD(hmm, hmm_device_coherent) +{ + .device_number = HMM_COHERENCE_DEVICE_ONE, +}; + FIXTURE(hmm2) { int fd0; @@ -68,6 +91,24 @@ FIXTURE(hmm2) unsigned intpage_shift; }; +FIXTURE_VARIANT(hmm2) +{ + int device_number0; + int device_number1; +}; + +FIXTURE_VARIANT_ADD(hmm2, hmm2_device_private) +{ + .device_number0 = HMM_PRIVATE_DEVICE_ONE, + .device_number1 = HMM_PRIVATE_DEVICE_TWO, +}; + +FIXTURE_VARIANT_ADD(hmm2, hmm2_device_coherent) +{ + .device_number0 = HMM_COHERENCE_DEVICE_ONE, + .device_number1 = HMM_COHERENCE_DEVICE_TWO, +}; + static int hmm_open(int unit) { char pathname[HMM_PATH_MAX]; @@ -81,12 +122,19 @@ static int hmm_open(int unit) return fd; } +static bool hmm_is_coherent_type(int dev_num) +{ + return (dev_num >= HMM_COHERENCE_DEVICE_ONE); +} + FIXTURE_SETUP(hmm) { self->page_size = sysconf(_SC_PAGE_SIZE); self->page_shift = ffs(self->page_size) - 1; - self->fd = hmm_open(0); + self->fd = hmm_open(variant->device_number); + if (self->fd < 0 && hmm_is_coherent_type(variant->device_number)) + SKIP(exit(0), "DEVICE_COHERENT not available"); ASSERT_GE(self->fd, 0); } @@ -95,9 +143,11 @@ FIXTURE_SETUP(hmm2) self->page_size = sysconf(_SC_PAGE_SIZE); self->page_shift = ffs(self->page_size) - 1; - self->fd0 = hmm_open(0); + self->fd0 = hmm_open(variant->device_number0); + if (self->fd0 < 0 && hmm_is_coherent_type(variant->device_number0)) + SKIP(exit(0), "DEVICE_COHERENT not available"); ASSERT_GE(self->fd0, 0); - self->fd1 = hmm_open(1); + self->fd1 = hmm_open(variant->device_number1); ASSERT_GE(self->fd1, 0); } @@ -144,6 +194,7 @@ static int hmm_dmirror_cmd(int fd, } buffer->cpages = cmd.cpages; buffer->faults = cmd.faults; + buffer->zone_device_type = cmd.zone_device_type; return 0; } @@ -211,6 +262,20 @@ static void hmm_nanosleep(unsigned int n) nanosleep(, NULL); } +static int hmm_migrate_sys_to_dev(int fd, + struct hmm_buffer *buffer, + unsigned long npages) +{ + return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages); +} + +static int hmm_migrate_dev_to_sys(int fd, + struct hmm_buffer *buffer, + unsigned long npages) +{ + return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages); +} + /* * Simple NULL test of device open/close. */ @@ -875,7 +940,7 @@ TEST_F(hmm, migrate) ptr[i] = i; /* Migrate memory to device. */ - ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages); + ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages); ASSERT_EQ(ret, 0); ASSERT_EQ(buffer->cpages, npages); @@ -923,7 +988,7 @@ TEST_F(hmm, migrate_fault) ptr[i] = i; /* Migrate memory to device. */ - ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages); + ret = hmm_migrate_sys_to_dev(self-&g
[Nouveau] [PATCH 25/27] mm: remove the vma check in migrate_vma_setup()
From: Alistair Popple migrate_vma_setup() checks that a valid vma is passed so that the page tables can be walked to find the pfns associated with a given address range. However in some cases the pfns are already known, such as when migrating device coherent pages during pin_user_pages() meaning a valid vma isn't required. Signed-off-by: Alistair Popple Acked-by: Felix Kuehling Signed-off-by: Christoph Hellwig --- mm/migrate_device.c | 34 +- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 0b295594e7626d..03e182f9fc7865 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -462,24 +462,24 @@ int migrate_vma_setup(struct migrate_vma *args) args->start &= PAGE_MASK; args->end &= PAGE_MASK; - if (!args->vma || is_vm_hugetlb_page(args->vma) || - (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma)) - return -EINVAL; - if (nr_pages <= 0) - return -EINVAL; - if (args->start < args->vma->vm_start || - args->start >= args->vma->vm_end) - return -EINVAL; - if (args->end <= args->vma->vm_start || args->end > args->vma->vm_end) - return -EINVAL; if (!args->src || !args->dst) return -EINVAL; - - memset(args->src, 0, sizeof(*args->src) * nr_pages); - args->cpages = 0; - args->npages = 0; - - migrate_vma_collect(args); + if (args->vma) { + if (is_vm_hugetlb_page(args->vma) || + (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma)) + return -EINVAL; + if (args->start < args->vma->vm_start || + args->start >= args->vma->vm_end) + return -EINVAL; + if (args->end <= args->vma->vm_start || + args->end > args->vma->vm_end) + return -EINVAL; + memset(args->src, 0, sizeof(*args->src) * nr_pages); + args->cpages = 0; + args->npages = 0; + + migrate_vma_collect(args); + } if (args->cpages) migrate_vma_unmap(args); @@ -661,7 +661,7 @@ void migrate_vma_pages(struct migrate_vma *migrate) continue; } - if (!page) { + if (!page && migrate->vma) { if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE)) continue; if (!notified) { -- 2.30.2
[Nouveau] [PATCH 24/27] tools: update test_hmm script to support SP config
From: Alex Sierra Add two more parameters to set spm_addr_dev0 & spm_addr_dev1 addresses. These two parameters configure the start SP addresses for each device in test_hmm driver. Consequently, this configures zone device type as coherent. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple Signed-off-by: Christoph Hellwig --- tools/testing/selftests/vm/test_hmm.sh | 24 +--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh index 0647b525a62564..539c9371e592a1 100755 --- a/tools/testing/selftests/vm/test_hmm.sh +++ b/tools/testing/selftests/vm/test_hmm.sh @@ -40,11 +40,26 @@ check_test_requirements() load_driver() { - modprobe $DRIVER > /dev/null 2>&1 + if [ $# -eq 0 ]; then + modprobe $DRIVER > /dev/null 2>&1 + else + if [ $# -eq 2 ]; then + modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2 + > /dev/null 2>&1 + else + echo "Missing module parameters. Make sure pass"\ + "spm_addr_dev0 and spm_addr_dev1" + usage + fi + fi if [ $? == 0 ]; then major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices) mknod /dev/hmm_dmirror0 c $major 0 mknod /dev/hmm_dmirror1 c $major 1 + if [ $# -eq 2 ]; then + mknod /dev/hmm_dmirror2 c $major 2 + mknod /dev/hmm_dmirror3 c $major 3 + fi fi } @@ -58,7 +73,7 @@ run_smoke() { echo "Running smoke test. Note, this test provides basic coverage." - load_driver + load_driver $1 $2 $(dirname "${BASH_SOURCE[0]}")/hmm-tests unload_driver } @@ -75,6 +90,9 @@ usage() echo "# Smoke testing" echo "./${TEST_NAME}.sh smoke" echo + echo "# Smoke testing with SPM enabled" + echo "./${TEST_NAME}.sh smoke " + echo exit 0 } @@ -84,7 +102,7 @@ function run_test() usage else if [ "$1" = "smoke" ]; then - run_smoke + run_smoke $2 $3 else usage fi -- 2.30.2
[Nouveau] [PATCH 22/27] lib: add support for device coherent type in test_hmm
From: Alex Sierra Device Coherent type uses device memory that is coherently accesible by the CPU. This could be shown as SP (special purpose) memory range at the BIOS-e820 memory enumeration. If no SP memory is supported in system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP. Currently, test_hmm only supports two different SP ranges of at least 256MB size. This could be specified in the kernel parameter variable efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x1 & 0x14000 physical address. Ex. efi_fake_mem=1G@0x1:0x4,1G@0x14000:0x4 Private and coherent device mirror instances can be created in the same probed. This is done by passing the module parameters spm_addr_dev0 & spm_addr_dev1. In this case, it will create four instances of device_mirror. The first two correspond to private device type, the last two to coherent type. Then, they can be easily accessed from user space through /dev/hmm_mirror. Usually num_device 0 and 1 are for private, and 2 and 3 for coherent types. If no module parameters are passed, two instances of private type device_mirror will be created only. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Poppple --- lib/test_hmm.c | 253 +--- lib/test_hmm_uapi.h | 15 ++- 2 files changed, 202 insertions(+), 66 deletions(-) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 15747f70c5bc9a..361a026c5d2126 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -32,11 +32,22 @@ #include "test_hmm_uapi.h" -#define DMIRROR_NDEVICES 2 +#define DMIRROR_NDEVICES 4 #define DMIRROR_RANGE_FAULT_TIMEOUT1000 #define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U) #define DEVMEM_CHUNKS_RESERVE 16 +/* + * For device_private pages, dpage is just a dummy struct page + * representing a piece of device memory. dmirror_devmem_alloc_page + * allocates a real system memory page as backing storage to fake a + * real device. zone_device_data points to that backing page. But + * for device_coherent memory, the struct page represents real + * physical CPU-accessible memory that we can use directly. + */ +#define BACKING_PAGE(page) (is_device_private_page((page)) ? \ + (page)->zone_device_data : (page)) + static unsigned long spm_addr_dev0; module_param(spm_addr_dev0, long, 0644); MODULE_PARM_DESC(spm_addr_dev0, @@ -125,6 +136,21 @@ static int dmirror_bounce_init(struct dmirror_bounce *bounce, return 0; } +static bool dmirror_is_private_zone(struct dmirror_device *mdevice) +{ + return (mdevice->zone_device_type == + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ? true : false; +} + +static enum migrate_vma_direction +dmirror_select_device(struct dmirror *dmirror) +{ + return (dmirror->mdevice->zone_device_type == + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ? + MIGRATE_VMA_SELECT_DEVICE_PRIVATE : + MIGRATE_VMA_SELECT_DEVICE_COHERENT; +} + static void dmirror_bounce_fini(struct dmirror_bounce *bounce) { vfree(bounce->ptr); @@ -575,16 +601,19 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice, static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice) { struct page *dpage = NULL; - struct page *rpage; + struct page *rpage = NULL; /* -* This is a fake device so we alloc real system memory to store -* our device memory. +* For ZONE_DEVICE private type, this is a fake device so we allocate +* real system memory to store our device memory. +* For ZONE_DEVICE coherent type we use the actual dpage to store the +* data and ignore rpage. */ - rpage = alloc_page(GFP_HIGHUSER); - if (!rpage) - return NULL; - + if (dmirror_is_private_zone(mdevice)) { + rpage = alloc_page(GFP_HIGHUSER); + if (!rpage) + return NULL; + } spin_lock(>lock); if (mdevice->free_pages) { @@ -603,7 +632,8 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice) return dpage; error: - __free_page(rpage); + if (rpage) + __free_page(rpage); return NULL; } @@ -629,12 +659,16 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args, * unallocated pte_none() or read-only zero page. */ spage = migrate_pfn_to_page(*src); + if (WARN(spage && is_zone_device_page(spage), +"page already in device spage pfn: 0x%lx\n", +page_to_pfn(spage))) + continue; dpage = dmirror_devmem_alloc_page(mdevice); if (!dpage) continue; - rpage = dpage->zone_device_data; + rpage =
[Nouveau] [PATCH 21/27] lib: test_hmm add module param for zone device type
From: Alex Sierra In order to configure device coherent in test_hmm, two module parameters should be passed, which correspond to the SP start address of each device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed, private device type is configured. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Poppple Signed-off-by: Christoph Hellwig --- lib/test_hmm.c | 73 - lib/test_hmm_uapi.h | 1 + 2 files changed, 53 insertions(+), 21 deletions(-) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 7a27584484ce0f..15747f70c5bc9a 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -37,6 +37,16 @@ #define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U) #define DEVMEM_CHUNKS_RESERVE 16 +static unsigned long spm_addr_dev0; +module_param(spm_addr_dev0, long, 0644); +MODULE_PARM_DESC(spm_addr_dev0, + "Specify start address for SPM (special purpose memory) used for device 0. By setting this Coherent device type will be used. Make sure spm_addr_dev1 is set too. Minimum SPM size should be DEVMEM_CHUNK_SIZE."); + +static unsigned long spm_addr_dev1; +module_param(spm_addr_dev1, long, 0644); +MODULE_PARM_DESC(spm_addr_dev1, + "Specify start address for SPM (special purpose memory) used for device 1. By setting this Coherent device type will be used. Make sure spm_addr_dev0 is set too. Minimum SPM size should be DEVMEM_CHUNK_SIZE."); + static const struct dev_pagemap_ops dmirror_devmem_ops; static const struct mmu_interval_notifier_ops dmirror_min_ops; static dev_t dmirror_dev; @@ -455,28 +465,44 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd) return ret; } -static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, +static int dmirror_allocate_chunk(struct dmirror_device *mdevice, struct page **ppage) { struct dmirror_chunk *devmem; - struct resource *res; + struct resource *res = NULL; unsigned long pfn; unsigned long pfn_first; unsigned long pfn_last; void *ptr; + int ret = -ENOMEM; devmem = kzalloc(sizeof(*devmem), GFP_KERNEL); if (!devmem) - return false; + return ret; - res = request_free_mem_region(_resource, DEVMEM_CHUNK_SIZE, - "hmm_dmirror"); - if (IS_ERR(res)) + switch (mdevice->zone_device_type) { + case HMM_DMIRROR_MEMORY_DEVICE_PRIVATE: + res = request_free_mem_region(_resource, DEVMEM_CHUNK_SIZE, + "hmm_dmirror"); + if (IS_ERR_OR_NULL(res)) + goto err_devmem; + devmem->pagemap.range.start = res->start; + devmem->pagemap.range.end = res->end; + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; + break; + case HMM_DMIRROR_MEMORY_DEVICE_COHERENT: + devmem->pagemap.range.start = (MINOR(mdevice->cdevice.dev) - 2) ? + spm_addr_dev0 : + spm_addr_dev1; + devmem->pagemap.range.end = devmem->pagemap.range.start + + DEVMEM_CHUNK_SIZE - 1; + devmem->pagemap.type = MEMORY_DEVICE_COHERENT; + break; + default: + ret = -EINVAL; goto err_devmem; + } - devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; - devmem->pagemap.range.start = res->start; - devmem->pagemap.range.end = res->end; devmem->pagemap.nr_range = 1; devmem->pagemap.ops = _devmem_ops; devmem->pagemap.owner = mdevice; @@ -497,10 +523,14 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, mdevice->devmem_capacity = new_capacity; mdevice->devmem_chunks = new_chunks; } - ptr = memremap_pages(>pagemap, numa_node_id()); - if (IS_ERR(ptr)) + if (IS_ERR_OR_NULL(ptr)) { + if (ptr) + ret = PTR_ERR(ptr); + else + ret = -EFAULT; goto err_release; + } devmem->mdevice = mdevice; pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT; @@ -529,15 +559,17 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, } spin_unlock(>lock); - return true; + return 0; err_release: mutex_unlock(>devmem_lock); - release_mem_region(devmem->pagemap.range.start, range_len(>pagemap.range)); + if (res && devmem->pagemap.type == MEMORY_DEVICE_PRIVATE) +
[Nouveau] [PATCH 20/27] lib: test_hmm add ioctl to get zone device type
From: Alex Sierra new ioctl cmd added to query zone device type. This will be used once the test_hmm adds zone device coherent type. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Poppple Signed-off-by: Christoph Hellwig --- lib/test_hmm.c | 23 +-- lib/test_hmm_uapi.h | 8 2 files changed, 29 insertions(+), 2 deletions(-) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index cfe63204783918..7a27584484ce0f 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -87,6 +87,7 @@ struct dmirror_chunk { struct dmirror_device { struct cdev cdevice; struct hmm_devmem *devmem; + unsigned intzone_device_type; unsigned intdevmem_capacity; unsigned intdevmem_count; @@ -1026,6 +1027,15 @@ static int dmirror_snapshot(struct dmirror *dmirror, return ret; } +static int dmirror_get_device_type(struct dmirror *dmirror, + struct hmm_dmirror_cmd *cmd) +{ + mutex_lock(>mutex); + cmd->zone_device_type = dmirror->mdevice->zone_device_type; + mutex_unlock(>mutex); + + return 0; +} static long dmirror_fops_unlocked_ioctl(struct file *filp, unsigned int command, unsigned long arg) @@ -1076,6 +1086,9 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp, ret = dmirror_snapshot(dmirror, ); break; + case HMM_DMIRROR_GET_MEM_DEV_TYPE: + ret = dmirror_get_device_type(dmirror, ); + break; default: return -EINVAL; } @@ -1260,14 +1273,20 @@ static void dmirror_device_remove(struct dmirror_device *mdevice) static int __init hmm_dmirror_init(void) { int ret; - int id; + int id = 0; + int ndevices = 0; ret = alloc_chrdev_region(_dev, 0, DMIRROR_NDEVICES, "HMM_DMIRROR"); if (ret) goto err_unreg; - for (id = 0; id < DMIRROR_NDEVICES; id++) { + memset(dmirror_devices, 0, DMIRROR_NDEVICES * sizeof(dmirror_devices[0])); + dmirror_devices[ndevices++].zone_device_type = + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE; + dmirror_devices[ndevices++].zone_device_type = + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE; + for (id = 0; id < ndevices; id++) { ret = dmirror_device_init(dmirror_devices + id, id); if (ret) goto err_chrdev; diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h index f14dea5dcd062b..17f842f1aa02c7 100644 --- a/lib/test_hmm_uapi.h +++ b/lib/test_hmm_uapi.h @@ -19,6 +19,7 @@ * @npages: (in) number of pages to read/write * @cpages: (out) number of pages copied * @faults: (out) number of device page faults seen + * @zone_device_type: (out) zone device memory type */ struct hmm_dmirror_cmd { __u64 addr; @@ -26,6 +27,7 @@ struct hmm_dmirror_cmd { __u64 npages; __u64 cpages; __u64 faults; + __u64 zone_device_type; }; /* Expose the address space of the calling process through hmm device file */ @@ -35,6 +37,7 @@ struct hmm_dmirror_cmd { #define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd) #define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x04, struct hmm_dmirror_cmd) #define HMM_DMIRROR_CHECK_EXCLUSIVE_IOWR('H', 0x05, struct hmm_dmirror_cmd) +#define HMM_DMIRROR_GET_MEM_DEV_TYPE _IOWR('H', 0x06, struct hmm_dmirror_cmd) /* * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT. @@ -62,4 +65,9 @@ enum { HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30, }; +enum { + /* 0 is reserved to catch uninitialized type fields */ + HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1, +}; + #endif /* _LIB_TEST_HMM_UAPI_H */ -- 2.30.2
[Nouveau] [PATCH 19/27] drm/amdkfd: coherent type as sys mem on migration to ram
From: Alex Sierra Coherent device type memory on VRAM to RAM migration, has similar access as System RAM from the CPU. This flag sets the source from the sender. Which in Coherent type case, should be set as MIGRATE_VMA_SELECT_DEVICE_COHERENT. Signed-off-by: Alex Sierra Reviewed-by: Felix Kuehling Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index 2c51f2ac3b46ac..6646291d75d574 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -659,9 +659,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange, migrate.vma = vma; migrate.start = start; migrate.end = end; - migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev); + if (adev->gmc.xgmi.connected_to_cpu) + migrate.flags = MIGRATE_VMA_SELECT_DEVICE_COHERENT; + else + migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t); size *= npages; buf = kvmalloc(size, GFP_KERNEL | __GFP_ZERO); -- 2.30.2
[Nouveau] [PATCH 18/27] drm/amdkfd: add SPM support for SVM
From: Alex Sierra When CPU is connected throug XGMI, it has coherent access to VRAM resource. In this case that resource is taken from a table in the device gmc aperture base. This resource is used along with the device type, which could be DEVICE_PRIVATE or DEVICE_COHERENT to create the device page map region. Signed-off-by: Alex Sierra Reviewed-by: Felix Kuehling Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 28 ++-- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index e27ca375876230..2c51f2ac3b46ac 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -933,7 +933,7 @@ int svm_migrate_init(struct amdgpu_device *adev) { struct kfd_dev *kfddev = adev->kfd.dev; struct dev_pagemap *pgmap; - struct resource *res; + struct resource *res = NULL; unsigned long size; void *r; @@ -948,28 +948,34 @@ int svm_migrate_init(struct amdgpu_device *adev) * should remove reserved size */ size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20); - res = devm_request_free_mem_region(adev->dev, _resource, size); - if (IS_ERR(res)) - return -ENOMEM; + if (adev->gmc.xgmi.connected_to_cpu) { + pgmap->range.start = adev->gmc.aper_base; + pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 1; + pgmap->type = MEMORY_DEVICE_COHERENT; + } else { + res = devm_request_free_mem_region(adev->dev, _resource, size); + if (IS_ERR(res)) + return -ENOMEM; + pgmap->range.start = res->start; + pgmap->range.end = res->end; + pgmap->type = MEMORY_DEVICE_PRIVATE; + } - pgmap->type = MEMORY_DEVICE_PRIVATE; pgmap->nr_range = 1; - pgmap->range.start = res->start; - pgmap->range.end = res->end; pgmap->ops = _migrate_pgmap_ops; pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev); - pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; - + pgmap->flags = 0; /* Device manager releases device-specific resources, memory region and * pgmap when driver disconnects from device. */ r = devm_memremap_pages(adev->dev, pgmap); if (IS_ERR(r)) { pr_err("failed to register HMM device memory\n"); - /* Disable SVM support capability */ pgmap->type = 0; - devm_release_mem_region(adev->dev, res->start, resource_size(res)); + if (pgmap->type == MEMORY_DEVICE_PRIVATE) + devm_release_mem_region(adev->dev, res->start, + res->end - res->start + 1); return PTR_ERR(r); } -- 2.30.2
[Nouveau] [PATCH 16/27] mm: add device coherent vma selection for memory migration
From: Alex Sierra This case is used to migrate pages from device memory, back to system memory. Device coherent type memory is cache coherent from device and CPU point of view. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Poppple Signed-off-by: Christoph Hellwig --- include/linux/migrate.h | 1 + mm/migrate_device.c | 12 +--- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index db96e10eb8da22..66a34eae8cb635 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -130,6 +130,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn) enum migrate_vma_direction { MIGRATE_VMA_SELECT_SYSTEM = 1 << 0, MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1, + MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2, }; struct migrate_vma { diff --git a/mm/migrate_device.c b/mm/migrate_device.c index bfd66e7d830b02..0b295594e7626d 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -147,15 +147,21 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (is_writable_device_private_entry(entry)) mpfn |= MIGRATE_PFN_WRITE; } else { - if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) - goto next; pfn = pte_pfn(pte); - if (is_zero_pfn(pfn)) { + if (is_zero_pfn(pfn) && + (migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) { mpfn = MIGRATE_PFN_MIGRATE; migrate->cpages++; goto next; } page = vm_normal_page(migrate->vma, addr, pte); + if (page && !is_zone_device_page(page) && + !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) + goto next; + else if (page && is_device_coherent_page(page) && + (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) || +page->pgmap->owner != migrate->pgmap_owner)) + goto next; mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE; mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0; } -- 2.30.2
[Nouveau] [PATCH 17/27] mm/gup: fail get_user_pages for LONGTERM dev coherent type
From: Alex Sierra Avoid long term pinning for Coherent device type pages. This could interfere with their own device memory manager. For now, we are just returning error for PIN_LONGTERM Coherent device type pages. Eventually, these type of pages will get migrated to system memory, once the device migration pages support is added. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Poppple [hch: rebased on previous cleanups, split the two checks] Signed-off-by: Christoph Hellwig --- mm/gup.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/mm/gup.c b/mm/gup.c index 37d6c24ca71225..39b23ad39a7bde 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1881,6 +1881,19 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, continue; prev_head = head; + /* +* Device private pages will get faulted in during gup so it +* shouldn't be possible to see one here. +*/ + if (WARN_ON_ONCE(is_device_private_page(head))) { + ret = -EFAULT; + goto unpin_pages; + } + if (is_device_coherent_page(head)) { + ret = -EFAULT; + goto unpin_pages; + } + if (is_pinnable_page(head)) continue; @@ -1925,7 +1938,7 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, put_page(pages[i]); } - if (!list_empty(_page_list)) { + if (!ret && !list_empty(_page_list)) { struct migration_target_control mtc = { .nid = NUMA_NO_NODE, .gfp_mask = GFP_USER | __GFP_NOWARN, -- 2.30.2
[Nouveau] [PATCH 13/27] mm: move the migrate_vma_* device migration code into it's own file
Split the code used to migrate to and from ZONE_DEVICE memory from migrate.c into a new file. Signed-off-by: Christoph Hellwig --- mm/Kconfig | 3 + mm/Makefile | 1 + mm/migrate.c| 753 --- mm/migrate_device.c | 765 4 files changed, 769 insertions(+), 753 deletions(-) create mode 100644 mm/migrate_device.c diff --git a/mm/Kconfig b/mm/Kconfig index a1901ae6d06293..6391d8d3a616f3 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -249,6 +249,9 @@ config MIGRATION pages as migration can relocate pages to satisfy a huge page allocation instead of reclaiming. +config DEVICE_MIGRATION + def_bool MIGRATION && DEVICE_PRIVATE + config ARCH_ENABLE_HUGEPAGE_MIGRATION bool diff --git a/mm/Makefile b/mm/Makefile index 70d4309c9ce338..4cc13f3179a518 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o +obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o diff --git a/mm/migrate.c b/mm/migrate.c index 746e1230886ddb..c31d04b46a5e17 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -38,12 +38,10 @@ #include #include #include -#include #include #include #include #include -#include #include #include #include @@ -2125,757 +2123,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, #endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_NUMA */ -#ifdef CONFIG_DEVICE_PRIVATE -static int migrate_vma_collect_skip(unsigned long start, - unsigned long end, - struct mm_walk *walk) -{ - struct migrate_vma *migrate = walk->private; - unsigned long addr; - - for (addr = start; addr < end; addr += PAGE_SIZE) { - migrate->dst[migrate->npages] = 0; - migrate->src[migrate->npages++] = 0; - } - - return 0; -} - -static int migrate_vma_collect_hole(unsigned long start, - unsigned long end, - __always_unused int depth, - struct mm_walk *walk) -{ - struct migrate_vma *migrate = walk->private; - unsigned long addr; - - /* Only allow populating anonymous memory. */ - if (!vma_is_anonymous(walk->vma)) - return migrate_vma_collect_skip(start, end, walk); - - for (addr = start; addr < end; addr += PAGE_SIZE) { - migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE; - migrate->dst[migrate->npages] = 0; - migrate->npages++; - migrate->cpages++; - } - - return 0; -} - -static int migrate_vma_collect_pmd(pmd_t *pmdp, - unsigned long start, - unsigned long end, - struct mm_walk *walk) -{ - struct migrate_vma *migrate = walk->private; - struct vm_area_struct *vma = walk->vma; - struct mm_struct *mm = vma->vm_mm; - unsigned long addr = start, unmapped = 0; - spinlock_t *ptl; - pte_t *ptep; - -again: - if (pmd_none(*pmdp)) - return migrate_vma_collect_hole(start, end, -1, walk); - - if (pmd_trans_huge(*pmdp)) { - struct page *page; - - ptl = pmd_lock(mm, pmdp); - if (unlikely(!pmd_trans_huge(*pmdp))) { - spin_unlock(ptl); - goto again; - } - - page = pmd_page(*pmdp); - if (is_huge_zero_page(page)) { - spin_unlock(ptl); - split_huge_pmd(vma, pmdp, addr); - if (pmd_trans_unstable(pmdp)) - return migrate_vma_collect_skip(start, end, - walk); - } else { - int ret; - - get_page(page); - spin_unlock(ptl); - if (unlikely(!trylock_page(page))) - return migrate_vma_collect_skip(start, end, - walk); - ret = split_huge_page(page); - unlock_page(page); - put_page(page); - if (ret) - return migrate_vma_collect_skip(start, end, - walk); -
[Nouveau] [PATCH 14/27] mm: build migrate_vma_* for all configs with ZONE_DEVICE support
This code will be used for device coherent memory as well in a bit, so relax the ifdef a bit. Signed-off-by: Christoph Hellwig --- mm/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/Kconfig b/mm/Kconfig index 6391d8d3a616f3..95d4aa3acaefe0 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -250,7 +250,7 @@ config MIGRATION allocation instead of reclaiming. config DEVICE_MIGRATION - def_bool MIGRATION && DEVICE_PRIVATE + def_bool MIGRATION && ZONE_DEVICE config ARCH_ENABLE_HUGEPAGE_MIGRATION bool -- 2.30.2
[Nouveau] [PATCH 15/27] mm: add zone device coherent type memory support
From: Alex Sierra Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. Signed-off-by: Alex Sierra Acked-by: Felix Kuehling Reviewed-by: Alistair Popple [hch: rebased ontop of the refcount changes, removed is_dev_private_or_coherent_page] Signed-off-by: Christoph Hellwig --- include/linux/memremap.h | 14 ++ mm/memcontrol.c | 7 --- mm/memory-failure.c | 8 ++-- mm/memremap.c| 10 ++ mm/migrate_device.c | 16 +++- mm/rmap.c| 5 +++-- 6 files changed, 44 insertions(+), 16 deletions(-) diff --git a/include/linux/memremap.h b/include/linux/memremap.h index d6a114dd5ea8b7..eb73630a49da39 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -41,6 +41,13 @@ struct vmem_altmap { * A more complete discussion of unaddressable memory may be found in * include/linux/hmm.h and Documentation/vm/hmm.rst. * + * MEMORY_DEVICE_COHERENT: + * Device memory that is cache coherent from device and CPU point of view. This + * is used on platforms that have an advanced system bus (like CAPI or CXL). A + * driver can hotplug the device memory using ZONE_DEVICE and with that memory + * type. Any page of a process can be migrated to such memory. However no one + * should be allowed to pin such memory so that it can always be evicted. + * * MEMORY_DEVICE_FS_DAX: * Host memory that has similar access semantics as System RAM i.e. DMA * coherent and supports page pinning. In support of coordinating page @@ -61,6 +68,7 @@ struct vmem_altmap { enum memory_type { /* 0 is reserved to catch uninitialized type fields */ MEMORY_DEVICE_PRIVATE = 1, + MEMORY_DEVICE_COHERENT, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, MEMORY_DEVICE_PCI_P2PDMA, @@ -138,6 +146,12 @@ static inline bool is_device_private_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PRIVATE; } +static inline bool is_device_coherent_page(const struct page *page) +{ + return is_zone_device_page(page) && + page->pgmap->type == MEMORY_DEVICE_COHERENT; +} + static inline bool is_pci_p2pdma_page(const struct page *page) { return IS_ENABLED(CONFIG_PCI_P2PDMA) && diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 510cbfb82bb62a..10259c35fde20d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5687,8 +5687,8 @@ static int mem_cgroup_move_account(struct page *page, * 2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a * target for charge migration. if @target is not NULL, the entry is stored * in target->ent. - * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is MEMORY_DEVICE_PRIVATE - * (so ZONE_DEVICE page and thus not on the lru). + * 3(MC_TARGET_DEVICE): like MC_TARGET_PAGE but page is device memory and + * thus not on the lru. * For now we such page is charge like a regular page would be as for all * intent and purposes it is just special memory taking the place of a * regular page. @@ -5722,7 +5722,8 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma, */ if (page_memcg(page) == mc.from) { ret = MC_TARGET_PAGE; - if (is_device_private_page(page)) + if (is_device_private_page(page) || + is_device_coherent_page(page)) ret = MC_TARGET_DEVICE; if (target) target->page = page; diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 97a9ed8f87a96a..f498ed3ece79ae 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1617,12 +1617,16 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, goto unlock; } - if (pgmap->type == MEMORY_DEVICE_PRIVATE) { + switch (pgmap->type) { + case MEMORY_DEVICE_PRIVATE: + case MEMORY_DEVICE_COHERENT: /* -* TODO: Handle HMM pages which may need coordination +* TODO: Handle device pages which may need coordination * with device-side memory. */ goto unlock; + default: + break; } /* diff --git a/mm/memremap.c b/mm/memremap.c index e00ffcdba7b632..d00bb21a0630cd 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -313,6 +313,16 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) return ERR_PTR(-EINVAL); } break; + case MEMORY_DEVICE_COHERENT: +
[Nouveau] [PATCH 12/27] mm: refactor the ZONE_DEVICE handling in migrate_vma_pages
Make the flow a little more clear and prepare for adding a new ZONE_DEVICE memory type. Signed-off-by: Christoph Hellwig --- mm/migrate.c | 27 --- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 30ecd7223656c1..746e1230886ddb 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2788,24 +2788,21 @@ void migrate_vma_pages(struct migrate_vma *migrate) mapping = page_mapping(page); - if (is_zone_device_page(newpage)) { - if (is_device_private_page(newpage)) { - /* -* For now only support private anonymous when -* migrating to un-addressable device memory. -*/ - if (mapping) { - migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; - continue; - } - } else { - /* -* Other types of ZONE_DEVICE page are not -* supported. -*/ + if (is_device_private_page(newpage)) { + /* +* For now only support private anonymous when migrating +* to un-addressable device memory. +*/ + if (mapping) { migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; continue; } + } else if (is_zone_device_page(newpage)) { + /* +* Other types of ZONE_DEVICE page are not supported. +*/ + migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; + continue; } r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY); -- 2.30.2
[Nouveau] [PATCH 11/27] mm: refactor the ZONE_DEVICE handling in migrate_vma_insert_page
Make the flow a little more clear and prepare for adding a new ZONE_DEVICE memory type. Signed-off-by: Christoph Hellwig --- mm/migrate.c | 31 +++ 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 8e0370a73f8a43..30ecd7223656c1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2670,26 +2670,25 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, */ __SetPageUptodate(page); - if (is_zone_device_page(page)) { - if (is_device_private_page(page)) { - swp_entry_t swp_entry; + if (is_device_private_page(page)) { + swp_entry_t swp_entry; - if (vma->vm_flags & VM_WRITE) - swp_entry = make_writable_device_private_entry( - page_to_pfn(page)); - else - swp_entry = make_readable_device_private_entry( - page_to_pfn(page)); - entry = swp_entry_to_pte(swp_entry); - } else { - /* -* For now we only support migrating to un-addressable -* device memory. -*/ + if (vma->vm_flags & VM_WRITE) + swp_entry = make_writable_device_private_entry( + page_to_pfn(page)); + else + swp_entry = make_readable_device_private_entry( + page_to_pfn(page)); + entry = swp_entry_to_pte(swp_entry); + } else { + /* +* For now we only support migrating to un-addressable device +* memory. +*/ + if (is_zone_device_page(page)) { pr_warn_once("Unsupported ZONE_DEVICE page type.\n"); goto abort; } - } else { entry = mk_pte(page, vma->vm_page_prot); if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); -- 2.30.2
[Nouveau] [PATCH 10/27] mm: refactor check_and_migrate_movable_pages
Remove up to two levels of indentation by using continue statements and move variables to local scope where possible. Signed-off-by: Christoph Hellwig --- mm/gup.c | 81 ++-- 1 file changed, 44 insertions(+), 37 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index a9d4d724aef749..37d6c24ca71225 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1868,72 +1868,79 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, struct page **pages, unsigned int gup_flags) { - unsigned long i; - unsigned long isolation_error_count = 0; - bool drain_allow = true; - LIST_HEAD(movable_page_list); - long ret = 0; + unsigned long isolation_error_count = 0, i; struct page *prev_head = NULL; - struct page *head; - struct migration_target_control mtc = { - .nid = NUMA_NO_NODE, - .gfp_mask = GFP_USER | __GFP_NOWARN, - }; + LIST_HEAD(movable_page_list); + bool drain_allow = true; + int ret = 0; for (i = 0; i < nr_pages; i++) { - head = compound_head(pages[i]); + struct page *head = compound_head(pages[i]); + if (head == prev_head) continue; prev_head = head; + + if (is_pinnable_page(head)) + continue; + /* -* If we get a movable page, since we are going to be pinning -* these entries, try to move them out if possible. +* Try to move out any movable page before pinning the range. */ - if (!is_pinnable_page(head)) { - if (PageHuge(head)) { - if (!isolate_huge_page(head, _page_list)) - isolation_error_count++; - } else { - if (!PageLRU(head) && drain_allow) { - lru_add_drain_all(); - drain_allow = false; - } + if (PageHuge(head)) { + if (!isolate_huge_page(head, _page_list)) + isolation_error_count++; + continue; + } - if (isolate_lru_page(head)) { - isolation_error_count++; - continue; - } - list_add_tail(>lru, _page_list); - mod_node_page_state(page_pgdat(head), - NR_ISOLATED_ANON + - page_is_file_lru(head), - thp_nr_pages(head)); - } + if (!PageLRU(head) && drain_allow) { + lru_add_drain_all(); + drain_allow = false; + } + + if (isolate_lru_page(head)) { + isolation_error_count++; + continue; } + list_add_tail(>lru, _page_list); + mod_node_page_state(page_pgdat(head), + NR_ISOLATED_ANON + page_is_file_lru(head), + thp_nr_pages(head)); } + if (!list_empty(_page_list) || isolation_error_count) + goto unpin_pages; + /* * If list is empty, and no isolation errors, means that all pages are * in the correct zone. */ - if (list_empty(_page_list) && !isolation_error_count) - return nr_pages; + return nr_pages; +unpin_pages: if (gup_flags & FOLL_PIN) { unpin_user_pages(pages, nr_pages); } else { for (i = 0; i < nr_pages; i++) put_page(pages[i]); } + if (!list_empty(_page_list)) { + struct migration_target_control mtc = { + .nid = NUMA_NO_NODE, + .gfp_mask = GFP_USER | __GFP_NOWARN, + }; + ret = migrate_pages(_page_list, alloc_migration_target, NULL, (unsigned long), MIGRATE_SYNC, MR_LONGTERM_PIN, NULL); - if (ret && !list_empty(_page_list)) - putback_movable_pages(_page_list); + if (ret > 0) /* number of pages not migrated */ + ret = -ENOMEM; } - return ret > 0 ? -ENOMEM : ret; + if (ret && !list_empty(_page_list)) +
[Nouveau] [PATCH 09/27] mm: generalize the pgmap based page_free infrastructure
Key off on the existence of ->page_free to prepare for adding support for more pgmap types that are device managed and thus need the free callback. Signed-off-by: Christoph Hellwig --- mm/memremap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memremap.c b/mm/memremap.c index fef5734d5e4933..e00ffcdba7b632 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -452,7 +452,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap); void free_zone_device_page(struct page *page) { - if (WARN_ON_ONCE(!is_device_private_page(page))) + if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free)) return; __ClearPageWaiters(page); @@ -460,7 +460,7 @@ void free_zone_device_page(struct page *page) mem_cgroup_uncharge(page_folio(page)); /* -* When a device_private page is freed, the page->mapping field +* When a device managed page is freed, the page->mapping field * may still contain a (stale) mapping value. For example, the * lower bits of page->mapping may still identify the page as an * anonymous page. Ultimately, this entire field is just stale -- 2.30.2
[Nouveau] [PATCH 08/27] fsdax: depend on ZONE_DEVICE || FS_DAX_LIMITED
Add a depends on ZONE_DEVICE support or the s390-specific limited DAX support, as one of the two is required at runtime for fsdax code to actually work. Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Jason Gunthorpe --- fs/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/Kconfig b/fs/Kconfig index e9433bbc48010a..7f2455e8e18ae2 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -48,6 +48,7 @@ config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU depends on !(ARM || MIPS || SPARC) + depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX help -- 2.30.2
[Nouveau] [PATCH 07/27] mm: remove the extra ZONE_DEVICE struct page refcount
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE pages. Note that this excludes the special idle page wakeup for fsdax pages, which still happens at refcount 1. This is a separate issue and will be sorted out later. Given that only fsdax pages require the notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig symbol can go away and be replaced with a FS_DAX check for this hook in the put_page fastpath. Based on an earlier patch from Ralph Campbell . Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Ralph Campbell Reviewed-by: Jason Gunthorpe Reviewed-by: Dan Williams Acked-by: Felix Kuehling --- arch/powerpc/kvm/book3s_hv_uvmem.c | 1 - drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 - drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 - fs/Kconfig | 1 - include/linux/memremap.h | 12 +++-- include/linux/mm.h | 6 +-- lib/test_hmm.c | 1 - mm/Kconfig | 4 -- mm/internal.h| 2 + mm/memcontrol.c | 11 ++--- mm/memremap.c| 57 mm/migrate.c | 6 --- mm/swap.c| 16 ++- 13 files changed, 36 insertions(+), 83 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index e414ca44839fd1..8b6438fa18fc2b 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -712,7 +712,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm) dpage = pfn_to_page(uvmem_pfn); dpage->zone_device_data = pvt; - get_page(dpage); lock_page(dpage); return dpage; out_clear: diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index cb835f95a76e66..e27ca375876230 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -225,7 +225,6 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn) page = pfn_to_page(pfn); svm_range_bo_ref(prange->svm_bo); page->zone_device_data = prange->svm_bo; - get_page(page); lock_page(page); } diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index a5cdfbe32b5e54..7ba66ad68a8a1e 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -326,7 +326,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm) return NULL; } - get_page(page); lock_page(page); return page; } diff --git a/fs/Kconfig b/fs/Kconfig index 6c7dc1387beb0f..e9433bbc48010a 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -48,7 +48,6 @@ config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU depends on !(ARM || MIPS || SPARC) - select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED) select FS_IOMAP select DAX help diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 514ab46f597e5c..d6a114dd5ea8b7 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -68,9 +68,9 @@ enum memory_type { struct dev_pagemap_ops { /* -* Called once the page refcount reaches 1. (ZONE_DEVICE pages never -* reach 0 refcount unless there is a refcount bug. This allows the -* device driver to implement its own memory management.) +* Called once the page refcount reaches 0. The reference count will be +* reset to one by the core code after the method is called to prepare +* for handing out the page again. */ void (*page_free)(struct page *page); @@ -133,16 +133,14 @@ static inline unsigned long pgmap_vmemmap_nr(struct dev_pagemap *pgmap) static inline bool is_device_private_page(const struct page *page) { - return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && - IS_ENABLED(CONFIG_DEVICE_PRIVATE) && + return IS_ENABLED(CONFIG_DEVICE_PRIVATE) && is_zone_device_page(page) && page->pgmap->type == MEMORY_DEVICE_PRIVATE; } static inline bool is_pci_p2pdma_page(const struct page *page) { - return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && - IS_ENABLED(CONFIG_PCI_P2PDMA) && + return IS_ENABLED(CONFIG_PCI_P2PDMA) && is_zone_device_page(page) &am
[Nouveau] [PATCH 06/27] mm: don't include in
Move the check for the actual pgmap types that need the free at refcount one behavior into the out of line helper, and thus avoid the need to pull memremap.h into mm.h. Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Jason Gunthorpe Reviewed-by: Dan Williams Acked-by: Felix Kuehling --- arch/arm64/mm/mmu.c| 1 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 + drivers/gpu/drm/drm_cache.c| 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + drivers/gpu/drm/nouveau/nouveau_svm.c | 1 + drivers/infiniband/core/rw.c | 1 + drivers/nvdimm/pmem.h | 1 + drivers/nvme/host/pci.c| 1 + drivers/nvme/target/io-cmd-bdev.c | 1 + fs/fuse/virtio_fs.c| 1 + include/linux/memremap.h | 18 ++ include/linux/mm.h | 20 lib/test_hmm.c | 1 + mm/memcontrol.c| 1 + mm/memremap.c | 6 +- 15 files changed, 35 insertions(+), 22 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index acfae9b41cc8c9..580abae6c0b93f 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index ea68f3b3a4e9cb..6d643b4b791d87 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -25,6 +25,7 @@ #include #include +#include #include #include #include diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c index f19d9acbe95936..50b8a088f763a6 100644 --- a/drivers/gpu/drm/drm_cache.c +++ b/drivers/gpu/drm/drm_cache.c @@ -27,11 +27,11 @@ /* * Authors: Thomas Hellström */ - #include #include #include #include +#include #include #include diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index e886a3b9e08c7d..a5cdfbe32b5e54 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -39,6 +39,7 @@ #include #include +#include #include /* diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index 266809e511e2c1..090b9b47708cca 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -35,6 +35,7 @@ #include #include #include +#include #include struct nouveau_svm { diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c index 5a3bd41b331c93..4d98f931a13ddd 100644 --- a/drivers/infiniband/core/rw.c +++ b/drivers/infiniband/core/rw.c @@ -2,6 +2,7 @@ /* * Copyright (c) 2016 HGST, a Western Digital Company. */ +#include #include #include #include diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index 59cfe13ea8a85c..1f51a23614299b 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -3,6 +3,7 @@ #define __NVDIMM_PMEM_H__ #include #include +#include #include #include #include diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 6a99ed68091589..ab15bc72710dbe 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c index 70ca9dfc1771a9..a141446db1bea3 100644 --- a/drivers/nvme/target/io-cmd-bdev.c +++ b/drivers/nvme/target/io-cmd-bdev.c @@ -6,6 +6,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include #include +#include #include #include "nvmet.h" diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 9d737904d07c0b..86b7dbb6a0d43e 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 1fafcc38acbad6..514ab46f597e5c 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -1,6 +1,8 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_MEMREMAP_H_ #define _LINUX_MEMREMAP_H_ + +#include #include #include #include @@ -129,6 +131,22 @@ static inline unsigned long pgmap_vmemmap_nr(struct dev_pagemap *pgmap) return 1 << pgmap->vmemmap_shift; } +static inline bool is_device_private_page(const struct page *page) +{ + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && + IS_ENABLED(CONFIG_DEVICE_PRIVATE) && + is_zone_device_page(page) && + page->pgmap->type == MEMORY_DEVICE_PRIVATE; +} + +static inline bool is_pci_p2pdma_page(const struct page *page) +{ + return IS_ENABLED(CONFIG_DEV_P
[Nouveau] [PATCH 05/27] mm: simplify freeing of devmap managed pages
Make put_devmap_managed_page return if it took charge of the page or not and remove the separate page_is_devmap_managed helper. Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Jason Gunthorpe Reviewed-by: Chaitanya Kulkarni Reviewed-by: Dan Williams --- include/linux/mm.h | 34 ++ mm/memremap.c | 20 +--- mm/swap.c | 10 +- 3 files changed, 20 insertions(+), 44 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 91dd0bc786a9ec..26baadcef4556b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1094,33 +1094,24 @@ static inline bool is_zone_movable_page(const struct page *page) #ifdef CONFIG_DEV_PAGEMAP_OPS DECLARE_STATIC_KEY_FALSE(devmap_managed_key); -static inline bool page_is_devmap_managed(struct page *page) +bool __put_devmap_managed_page(struct page *page); +static inline bool put_devmap_managed_page(struct page *page) { if (!static_branch_unlikely(_managed_key)) return false; if (!is_zone_device_page(page)) return false; - switch (page->pgmap->type) { - case MEMORY_DEVICE_PRIVATE: - case MEMORY_DEVICE_FS_DAX: - return true; - default: - break; - } - return false; + if (page->pgmap->type != MEMORY_DEVICE_PRIVATE && + page->pgmap->type != MEMORY_DEVICE_FS_DAX) + return false; + return __put_devmap_managed_page(page); } -void put_devmap_managed_page(struct page *page); - #else /* CONFIG_DEV_PAGEMAP_OPS */ -static inline bool page_is_devmap_managed(struct page *page) +static inline bool put_devmap_managed_page(struct page *page) { return false; } - -static inline void put_devmap_managed_page(struct page *page) -{ -} #endif /* CONFIG_DEV_PAGEMAP_OPS */ static inline bool is_device_private_page(const struct page *page) @@ -1220,16 +1211,11 @@ static inline void put_page(struct page *page) struct folio *folio = page_folio(page); /* -* For devmap managed pages we need to catch refcount transition from -* 2 to 1, when refcount reach one it means the page is free and we -* need to inform the device driver through callback. See -* include/linux/memremap.h and HMM for details. +* For some devmap managed pages we need to catch refcount transition +* from 2 to 1: */ - if (page_is_devmap_managed(>page)) { - put_devmap_managed_page(>page); + if (put_devmap_managed_page(>page)) return; - } - folio_put(folio); } diff --git a/mm/memremap.c b/mm/memremap.c index 55d23e9f5c04ec..f41233a67edb12 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -502,24 +502,22 @@ void free_devmap_managed_page(struct page *page) page->pgmap->ops->page_free(page); } -void put_devmap_managed_page(struct page *page) +bool __put_devmap_managed_page(struct page *page) { - int count; - - if (WARN_ON_ONCE(!page_is_devmap_managed(page))) - return; - - count = page_ref_dec_return(page); - /* * devmap page refcounts are 1-based, rather than 0-based: if * refcount is 1, then the page is free and the refcount is * stable because nobody holds a reference on the page. */ - if (count == 1) + switch (page_ref_dec_return(page)) { + case 1: free_devmap_managed_page(page); - else if (!count) + break; + case 0: __put_page(page); + break; + } + return true; } -EXPORT_SYMBOL(put_devmap_managed_page); +EXPORT_SYMBOL(__put_devmap_managed_page); #endif /* CONFIG_DEV_PAGEMAP_OPS */ diff --git a/mm/swap.c b/mm/swap.c index 08058f74cae23e..25b55c56614311 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -930,16 +930,8 @@ void release_pages(struct page **pages, int nr) unlock_page_lruvec_irqrestore(lruvec, flags); lruvec = NULL; } - /* -* ZONE_DEVICE pages that return 'false' from -* page_is_devmap_managed() do not require special -* processing, and instead, expect a call to -* put_page_testzero(). -*/ - if (page_is_devmap_managed(page)) { - put_devmap_managed_page(page); + if (put_devmap_managed_page(page)) continue; - } if (put_page_testzero(page)) put_dev_pagemap(page->pgmap); continue; -- 2.30.2
[Nouveau] [PATCH 04/27] mm: move free_devmap_managed_page to memremap.c
free_devmap_managed_page has nothing to do with the code in swap.c, move it to live with the rest of the code for devmap handling. Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Jason Gunthorpe Reviewed-by: Chaitanya Kulkarni Reviewed-by: Muchun Song Reviewed-by: Dan Williams --- include/linux/mm.h | 1 - mm/memremap.c | 21 + mm/swap.c | 23 --- 3 files changed, 21 insertions(+), 24 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 7b46174989b086..91dd0bc786a9ec 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1092,7 +1092,6 @@ static inline bool is_zone_movable_page(const struct page *page) } #ifdef CONFIG_DEV_PAGEMAP_OPS -void free_devmap_managed_page(struct page *page); DECLARE_STATIC_KEY_FALSE(devmap_managed_key); static inline bool page_is_devmap_managed(struct page *page) diff --git a/mm/memremap.c b/mm/memremap.c index 5f04a0709e436e..55d23e9f5c04ec 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -501,4 +501,25 @@ void free_devmap_managed_page(struct page *page) page->mapping = NULL; page->pgmap->ops->page_free(page); } + +void put_devmap_managed_page(struct page *page) +{ + int count; + + if (WARN_ON_ONCE(!page_is_devmap_managed(page))) + return; + + count = page_ref_dec_return(page); + + /* +* devmap page refcounts are 1-based, rather than 0-based: if +* refcount is 1, then the page is free and the refcount is +* stable because nobody holds a reference on the page. +*/ + if (count == 1) + free_devmap_managed_page(page); + else if (!count) + __put_page(page); +} +EXPORT_SYMBOL(put_devmap_managed_page); #endif /* CONFIG_DEV_PAGEMAP_OPS */ diff --git a/mm/swap.c b/mm/swap.c index bcf3ac288b56d5..08058f74cae23e 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -1153,26 +1153,3 @@ void __init swap_setup(void) * _really_ don't want to cluster much more */ } - -#ifdef CONFIG_DEV_PAGEMAP_OPS -void put_devmap_managed_page(struct page *page) -{ - int count; - - if (WARN_ON_ONCE(!page_is_devmap_managed(page))) - return; - - count = page_ref_dec_return(page); - - /* -* devmap page refcounts are 1-based, rather than 0-based: if -* refcount is 1, then the page is free and the refcount is -* stable because nobody holds a reference on the page. -*/ - if (count == 1) - free_devmap_managed_page(page); - else if (!count) - __put_page(page); -} -EXPORT_SYMBOL(put_devmap_managed_page); -#endif -- 2.30.2
[Nouveau] [PATCH 02/27] mm: remove the __KERNEL__ guard from
__KERNEL__ ifdefs don't make sense outside of include/uapi/. Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Jason Gunthorpe Reviewed-by: Chaitanya Kulkarni Reviewed-by: Muchun Song Reviewed-by: Dan Williams --- include/linux/mm.h | 4 1 file changed, 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 213cc569b19223..7b46174989b086 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3,9 +3,6 @@ #define _LINUX_MM_H #include - -#ifdef __KERNEL__ - #include #include #include @@ -3381,5 +3378,4 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start, } #endif -#endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ -- 2.30.2
[Nouveau] [PATCH 03/27] mm: remove pointless includes from
hmm.h pulls in the world for no good reason at all. Remove the includes and push a few ones into the users instead. Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Jason Gunthorpe Reviewed-by: Chaitanya Kulkarni --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 + drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + include/linux/hmm.h | 9 ++--- lib/test_hmm.c | 2 ++ 4 files changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index ed5385137f4831..cb835f95a76e66 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "amdgpu_sync.h" #include "amdgpu_object.h" #include "amdgpu_vm.h" diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 3828aafd3ac46f..e886a3b9e08c7d 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -39,6 +39,7 @@ #include #include +#include /* * FIXME: this is ugly right now we are using TTM to allocate vram and we pin diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 2fd2e91d5107c0..d5a6f101f843e6 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -9,14 +9,9 @@ #ifndef LINUX_HMM_H #define LINUX_HMM_H -#include -#include +#include -#include -#include -#include -#include -#include +struct mmu_interval_notifier; /* * On output: diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 767538089a62e4..396beee6b061d4 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -26,6 +26,8 @@ #include #include #include +#include +#include #include "test_hmm_uapi.h" -- 2.30.2
[Nouveau] [PATCH 01/27] mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages
memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove the superflous extra check. Signed-off-by: Christoph Hellwig Reviewed-by: Logan Gunthorpe Reviewed-by: Jason Gunthorpe Reviewed-by: Chaitanya Kulkarni Reviewed-by: Muchun Song Reviewed-by: Dan Williams --- mm/memremap.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/memremap.c b/mm/memremap.c index 6aa5f0c2d11fda..5f04a0709e436e 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -328,8 +328,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) } break; case MEMORY_DEVICE_FS_DAX: - if (!IS_ENABLED(CONFIG_ZONE_DEVICE) || - IS_ENABLED(CONFIG_FS_DAX_LIMITED)) { + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) { WARN(1, "File system DAX not supported\n"); return ERR_PTR(-EINVAL); } -- 2.30.2
[Nouveau] start sorting out the ZONE_DEVICE refcount mess v2
Hi all, this series removes the offset by one refcount for ZONE_DEVICE pages that are freed back to the driver owning them, which is just device private ones for now, but also the planned device coherent pages and the ehanced p2p ones pending. It does not address the fsdax pages yet, which will be attacked in a follow on series. Note that if we want to get the p2p series rebased on top of this we'll need a git branch for this series. I could offer to host one. A git tree is available here: git://git.infradead.org/users/hch/misc.git pgmap-refcount Gitweb: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/pgmap-refcount Changes since v1: - add a missing memremap.h include in memcontrol.c - include rebased versions of the device coherent support and device coherent migration support series as well as additional cleanup patches Diffstt: arch/arm64/mm/mmu.c |1 arch/powerpc/kvm/book3s_hv_uvmem.c |1 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 35 - drivers/gpu/drm/amd/amdkfd/kfd_priv.h|1 drivers/gpu/drm/drm_cache.c |2 drivers/gpu/drm/nouveau/nouveau_dmem.c |3 drivers/gpu/drm/nouveau/nouveau_svm.c|1 drivers/infiniband/core/rw.c |1 drivers/nvdimm/pmem.h|1 drivers/nvme/host/pci.c |1 drivers/nvme/target/io-cmd-bdev.c|1 fs/Kconfig |2 fs/fuse/virtio_fs.c |1 include/linux/hmm.h |9 include/linux/memremap.h | 36 + include/linux/migrate.h |1 include/linux/mm.h | 59 -- lib/test_hmm.c | 353 ++--- lib/test_hmm_uapi.h | 22 mm/Kconfig |7 mm/Makefile |1 mm/gup.c | 127 +++- mm/internal.h|3 mm/memcontrol.c | 19 mm/memory-failure.c |8 mm/memremap.c| 75 +- mm/migrate.c | 763 mm/migrate_device.c | 822 +++ mm/rmap.c|5 mm/swap.c| 49 - tools/testing/selftests/vm/Makefile |2 tools/testing/selftests/vm/hmm-tests.c | 204 ++- tools/testing/selftests/vm/test_hmm.sh | 24 33 files changed, 1552 insertions(+), 1088 deletions(-)
Re: [Nouveau] [PATCH 6/8] mm: don't include in
On Thu, Feb 10, 2022 at 01:10:47PM +1100, Alistair Popple wrote: > diff --git a/mm/gup.c b/mm/gup.c > index cbb49abb7992..8e85c9fb8df4 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2007,7 +2007,6 @@ static long check_and_migrate_movable_pages(unsigned > long nr_pages, > if (!ret && list_empty(_page_list) && !isolation_error_count) > return nr_pages; > > - ret = 0; > unpin_pages: This isn't quite correct as ret is initially set to -EFAULT now. I'll fix it by removing the early ret initialization and always using the goto. I've also added another refactoring patch for this messy function. I've folded the inversion of the is_device_coherent_page check in migrate.c in as well, thanks!
Re: [Nouveau] [PATCH 6/8] mm: don't include in
On Mon, Feb 07, 2022 at 04:19:29PM -0500, Felix Kuehling wrote: > > Am 2022-02-07 um 01:32 schrieb Christoph Hellwig: >> Move the check for the actual pgmap types that need the free at refcount >> one behavior into the out of line helper, and thus avoid the need to >> pull memremap.h into mm.h. >> >> Signed-off-by: Christoph Hellwig > > The amdkfd part looks good to me. > > It looks like this patch is not based on Alex Sierra's coherent memory > series. He added two new helpers is_device_coherent_page and > is_dev_private_or_coherent_page that would need to be moved along with > is_device_private_page and is_pci_p2pdma_page. FYI, here is a branch that contains a rebase of the coherent memory related patches on top of this series: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/pgmap-refcount I don't have a good way to test this, but I'll at least let the build bot finish before sending it out (probably tomorrow).
Re: [Nouveau] [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount
On Wed, Feb 09, 2022 at 08:29:56AM -0400, Jason Gunthorpe wrote: > It is nice, but the other series are still impacted by the fsdax mess > - they still stuff pages into ptes without proper refcounts and have > to carry nonsense to dance around this problem. > > I certainly would be unhappy if the amd driver, for instance, gained > the fsdax problem as well and started pushing 4k pages into PMDs. As said before: I think this all needs to be fixed. But I'd rather fix it gradually and I think this series is a nice step forward. After that we can look at the pte mappings.
Re: [Nouveau] [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount
On Tue, Feb 08, 2022 at 07:30:11PM -0800, Dan Williams wrote: > Interesting. I had expected that to really fix the refcount problem > that fs/dax.c would need to start taking real page references as pages > were added to a mapping, just like page cache. I think we should do that eventually. But I think this series that just attacks the device private type and extends to the device coherent and p2p enhacements is a good first step to stop the proliferation of the one off refcount and to allow to deal with the fsdax pages in another more focuessed series.
Re: [Nouveau] [PATCH 6/8] mm: don't include in
On Tue, Feb 08, 2022 at 03:53:14PM -0800, Dan Williams wrote: > Yeah, same as Logan: > > mm/memcontrol.c: In function ‘get_mctgt_type’: > mm/memcontrol.c:5724:29: error: implicit declaration of function > ‘is_device_private_page’; did you mean > ‘is_device_private_entry’? [-Werror=implicit-function-declaration] > 5724 | if (is_device_private_page(page)) > | ^~ > | is_device_private_entry > > ...needs: Yeah, the buildbot also complained. I've fixed this up locally now.
Re: [Nouveau] [PATCH 6/8] mm: don't include in
On Mon, Feb 07, 2022 at 04:19:29PM -0500, Felix Kuehling wrote: > > Am 2022-02-07 um 01:32 schrieb Christoph Hellwig: >> Move the check for the actual pgmap types that need the free at refcount >> one behavior into the out of line helper, and thus avoid the need to >> pull memremap.h into mm.h. >> >> Signed-off-by: Christoph Hellwig > > The amdkfd part looks good to me. > > It looks like this patch is not based on Alex Sierra's coherent memory > series. He added two new helpers is_device_coherent_page and > is_dev_private_or_coherent_page that would need to be moved along with > is_device_private_page and is_pci_p2pdma_page. Yes. I Naked that series because it spreads te mess with the refcount further in this latest version. My intent is that it gets rebased on top of this to avoid that spread. Same for the p2p series form Logan.
[Nouveau] [PATCH 8/8] fsdax: depend on ZONE_DEVICE || FS_DAX_LIMITED
Add a depends on ZONE_DEVICE support or the s390-specific limited DAX support, as one of the two is required at runtime for fsdax code to actually work. Signed-off-by: Christoph Hellwig --- fs/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/Kconfig b/fs/Kconfig index 05efea674bffa0..6e8818a5e53c45 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -48,6 +48,7 @@ config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU depends on !(ARM || MIPS || SPARC) + depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX help -- 2.30.2
[Nouveau] [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE pages. Note that this excludes the special idle page wakeup for fsdax pages, which still happens at refcount 1. This is a separate issue and will be sorted out later. Given that only fsdax pages require the notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig symbol can go away and be replaced with a FS_DAX check for this hook in the put_page fastpath. Based on an earlier patch from Ralph Campbell . Signed-off-by: Christoph Hellwig --- arch/powerpc/kvm/book3s_hv_uvmem.c | 1 - drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 - drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 - fs/Kconfig | 1 - include/linux/memremap.h | 12 +++-- include/linux/mm.h | 6 +-- lib/test_hmm.c | 1 - mm/Kconfig | 4 -- mm/internal.h| 2 + mm/memcontrol.c | 11 ++--- mm/memremap.c| 57 mm/migrate.c | 6 --- mm/swap.c| 16 ++- 13 files changed, 36 insertions(+), 83 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index e414ca44839fd1..8b6438fa18fc2b 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -712,7 +712,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm) dpage = pfn_to_page(uvmem_pfn); dpage->zone_device_data = pvt; - get_page(dpage); lock_page(dpage); return dpage; out_clear: diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index cb835f95a76e66..e27ca375876230 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -225,7 +225,6 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn) page = pfn_to_page(pfn); svm_range_bo_ref(prange->svm_bo); page->zone_device_data = prange->svm_bo; - get_page(page); lock_page(page); } diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index a5cdfbe32b5e54..7ba66ad68a8a1e 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -326,7 +326,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm) return NULL; } - get_page(page); lock_page(page); return page; } diff --git a/fs/Kconfig b/fs/Kconfig index 7a2b11c0b8036d..05efea674bffa0 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -48,7 +48,6 @@ config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU depends on !(ARM || MIPS || SPARC) - select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED) select FS_IOMAP select DAX help diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 514ab46f597e5c..d6a114dd5ea8b7 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -68,9 +68,9 @@ enum memory_type { struct dev_pagemap_ops { /* -* Called once the page refcount reaches 1. (ZONE_DEVICE pages never -* reach 0 refcount unless there is a refcount bug. This allows the -* device driver to implement its own memory management.) +* Called once the page refcount reaches 0. The reference count will be +* reset to one by the core code after the method is called to prepare +* for handing out the page again. */ void (*page_free)(struct page *page); @@ -133,16 +133,14 @@ static inline unsigned long pgmap_vmemmap_nr(struct dev_pagemap *pgmap) static inline bool is_device_private_page(const struct page *page) { - return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && - IS_ENABLED(CONFIG_DEVICE_PRIVATE) && + return IS_ENABLED(CONFIG_DEVICE_PRIVATE) && is_zone_device_page(page) && page->pgmap->type == MEMORY_DEVICE_PRIVATE; } static inline bool is_pci_p2pdma_page(const struct page *page) { - return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && - IS_ENABLED(CONFIG_PCI_P2PDMA) && + return IS_ENABLED(CONFIG_PCI_P2PDMA) && is_zone_device_page(page) && page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } diff --git a/include/linux/mm.h b/include/linux/mm.h index
[Nouveau] [PATCH 6/8] mm: don't include in
Move the check for the actual pgmap types that need the free at refcount one behavior into the out of line helper, and thus avoid the need to pull memremap.h into mm.h. Signed-off-by: Christoph Hellwig --- arch/arm64/mm/mmu.c| 1 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 + drivers/gpu/drm/drm_cache.c| 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + drivers/gpu/drm/nouveau/nouveau_svm.c | 1 + drivers/infiniband/core/rw.c | 1 + drivers/nvdimm/pmem.h | 1 + drivers/nvme/host/pci.c| 1 + drivers/nvme/target/io-cmd-bdev.c | 1 + fs/fuse/virtio_fs.c| 1 + include/linux/memremap.h | 18 ++ include/linux/mm.h | 20 lib/test_hmm.c | 1 + mm/memremap.c | 6 +- 14 files changed, 34 insertions(+), 22 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index acfae9b41cc8c9..580abae6c0b93f 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index ea68f3b3a4e9cb..6d643b4b791d87 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -25,6 +25,7 @@ #include #include +#include #include #include #include diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c index f19d9acbe95936..50b8a088f763a6 100644 --- a/drivers/gpu/drm/drm_cache.c +++ b/drivers/gpu/drm/drm_cache.c @@ -27,11 +27,11 @@ /* * Authors: Thomas Hellström */ - #include #include #include #include +#include #include #include diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index e886a3b9e08c7d..a5cdfbe32b5e54 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -39,6 +39,7 @@ #include #include +#include #include /* diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index 266809e511e2c1..090b9b47708cca 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -35,6 +35,7 @@ #include #include #include +#include #include struct nouveau_svm { diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c index 5a3bd41b331c93..4d98f931a13ddd 100644 --- a/drivers/infiniband/core/rw.c +++ b/drivers/infiniband/core/rw.c @@ -2,6 +2,7 @@ /* * Copyright (c) 2016 HGST, a Western Digital Company. */ +#include #include #include #include diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index 59cfe13ea8a85c..1f51a23614299b 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -3,6 +3,7 @@ #define __NVDIMM_PMEM_H__ #include #include +#include #include #include #include diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 6a99ed68091589..ab15bc72710dbe 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c index 70ca9dfc1771a9..a141446db1bea3 100644 --- a/drivers/nvme/target/io-cmd-bdev.c +++ b/drivers/nvme/target/io-cmd-bdev.c @@ -6,6 +6,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include #include +#include #include #include "nvmet.h" diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 9d737904d07c0b..86b7dbb6a0d43e 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 1fafcc38acbad6..514ab46f597e5c 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -1,6 +1,8 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_MEMREMAP_H_ #define _LINUX_MEMREMAP_H_ + +#include #include #include #include @@ -129,6 +131,22 @@ static inline unsigned long pgmap_vmemmap_nr(struct dev_pagemap *pgmap) return 1 << pgmap->vmemmap_shift; } +static inline bool is_device_private_page(const struct page *page) +{ + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && + IS_ENABLED(CONFIG_DEVICE_PRIVATE) && + is_zone_device_page(page) && + page->pgmap->type == MEMORY_DEVICE_PRIVATE; +} + +static inline bool is_pci_p2pdma_page(const struct page *page) +{ + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && + IS_ENABLED(CONFIG_PCI_P2PDMA) && + is_zone_device_page(page) && + page->pgmap->type == MEMORY_DEVICE_PC
[Nouveau] [PATCH 5/8] mm: simplify freeing of devmap managed pages
Make put_devmap_managed_page return if it took charge of the page or not and remove the separate page_is_devmap_managed helper. Signed-off-by: Christoph Hellwig --- include/linux/mm.h | 34 ++ mm/memremap.c | 20 +--- mm/swap.c | 10 +- 3 files changed, 20 insertions(+), 44 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 91dd0bc786a9ec..26baadcef4556b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1094,33 +1094,24 @@ static inline bool is_zone_movable_page(const struct page *page) #ifdef CONFIG_DEV_PAGEMAP_OPS DECLARE_STATIC_KEY_FALSE(devmap_managed_key); -static inline bool page_is_devmap_managed(struct page *page) +bool __put_devmap_managed_page(struct page *page); +static inline bool put_devmap_managed_page(struct page *page) { if (!static_branch_unlikely(_managed_key)) return false; if (!is_zone_device_page(page)) return false; - switch (page->pgmap->type) { - case MEMORY_DEVICE_PRIVATE: - case MEMORY_DEVICE_FS_DAX: - return true; - default: - break; - } - return false; + if (page->pgmap->type != MEMORY_DEVICE_PRIVATE && + page->pgmap->type != MEMORY_DEVICE_FS_DAX) + return false; + return __put_devmap_managed_page(page); } -void put_devmap_managed_page(struct page *page); - #else /* CONFIG_DEV_PAGEMAP_OPS */ -static inline bool page_is_devmap_managed(struct page *page) +static inline bool put_devmap_managed_page(struct page *page) { return false; } - -static inline void put_devmap_managed_page(struct page *page) -{ -} #endif /* CONFIG_DEV_PAGEMAP_OPS */ static inline bool is_device_private_page(const struct page *page) @@ -1220,16 +1211,11 @@ static inline void put_page(struct page *page) struct folio *folio = page_folio(page); /* -* For devmap managed pages we need to catch refcount transition from -* 2 to 1, when refcount reach one it means the page is free and we -* need to inform the device driver through callback. See -* include/linux/memremap.h and HMM for details. +* For some devmap managed pages we need to catch refcount transition +* from 2 to 1: */ - if (page_is_devmap_managed(>page)) { - put_devmap_managed_page(>page); + if (put_devmap_managed_page(>page)) return; - } - folio_put(folio); } diff --git a/mm/memremap.c b/mm/memremap.c index 55d23e9f5c04ec..f41233a67edb12 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -502,24 +502,22 @@ void free_devmap_managed_page(struct page *page) page->pgmap->ops->page_free(page); } -void put_devmap_managed_page(struct page *page) +bool __put_devmap_managed_page(struct page *page) { - int count; - - if (WARN_ON_ONCE(!page_is_devmap_managed(page))) - return; - - count = page_ref_dec_return(page); - /* * devmap page refcounts are 1-based, rather than 0-based: if * refcount is 1, then the page is free and the refcount is * stable because nobody holds a reference on the page. */ - if (count == 1) + switch (page_ref_dec_return(page)) { + case 1: free_devmap_managed_page(page); - else if (!count) + break; + case 0: __put_page(page); + break; + } + return true; } -EXPORT_SYMBOL(put_devmap_managed_page); +EXPORT_SYMBOL(__put_devmap_managed_page); #endif /* CONFIG_DEV_PAGEMAP_OPS */ diff --git a/mm/swap.c b/mm/swap.c index 08058f74cae23e..25b55c56614311 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -930,16 +930,8 @@ void release_pages(struct page **pages, int nr) unlock_page_lruvec_irqrestore(lruvec, flags); lruvec = NULL; } - /* -* ZONE_DEVICE pages that return 'false' from -* page_is_devmap_managed() do not require special -* processing, and instead, expect a call to -* put_page_testzero(). -*/ - if (page_is_devmap_managed(page)) { - put_devmap_managed_page(page); + if (put_devmap_managed_page(page)) continue; - } if (put_page_testzero(page)) put_dev_pagemap(page->pgmap); continue; -- 2.30.2
[Nouveau] [PATCH 4/8] mm: move free_devmap_managed_page to memremap.c
free_devmap_managed_page has nothing to do with the code in swap.c, move it to live with the rest of the code for devmap handling. Signed-off-by: Christoph Hellwig --- include/linux/mm.h | 1 - mm/memremap.c | 21 + mm/swap.c | 23 --- 3 files changed, 21 insertions(+), 24 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 7b46174989b086..91dd0bc786a9ec 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1092,7 +1092,6 @@ static inline bool is_zone_movable_page(const struct page *page) } #ifdef CONFIG_DEV_PAGEMAP_OPS -void free_devmap_managed_page(struct page *page); DECLARE_STATIC_KEY_FALSE(devmap_managed_key); static inline bool page_is_devmap_managed(struct page *page) diff --git a/mm/memremap.c b/mm/memremap.c index 5f04a0709e436e..55d23e9f5c04ec 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -501,4 +501,25 @@ void free_devmap_managed_page(struct page *page) page->mapping = NULL; page->pgmap->ops->page_free(page); } + +void put_devmap_managed_page(struct page *page) +{ + int count; + + if (WARN_ON_ONCE(!page_is_devmap_managed(page))) + return; + + count = page_ref_dec_return(page); + + /* +* devmap page refcounts are 1-based, rather than 0-based: if +* refcount is 1, then the page is free and the refcount is +* stable because nobody holds a reference on the page. +*/ + if (count == 1) + free_devmap_managed_page(page); + else if (!count) + __put_page(page); +} +EXPORT_SYMBOL(put_devmap_managed_page); #endif /* CONFIG_DEV_PAGEMAP_OPS */ diff --git a/mm/swap.c b/mm/swap.c index bcf3ac288b56d5..08058f74cae23e 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -1153,26 +1153,3 @@ void __init swap_setup(void) * _really_ don't want to cluster much more */ } - -#ifdef CONFIG_DEV_PAGEMAP_OPS -void put_devmap_managed_page(struct page *page) -{ - int count; - - if (WARN_ON_ONCE(!page_is_devmap_managed(page))) - return; - - count = page_ref_dec_return(page); - - /* -* devmap page refcounts are 1-based, rather than 0-based: if -* refcount is 1, then the page is free and the refcount is -* stable because nobody holds a reference on the page. -*/ - if (count == 1) - free_devmap_managed_page(page); - else if (!count) - __put_page(page); -} -EXPORT_SYMBOL(put_devmap_managed_page); -#endif -- 2.30.2
[Nouveau] [PATCH 3/8] mm: remove pointless includes from
hmm.h pulls in the world for no good reason at all. Remove the includes and push a few ones into the users instead. Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 + drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + include/linux/hmm.h | 9 ++--- lib/test_hmm.c | 2 ++ 4 files changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index ed5385137f4831..cb835f95a76e66 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "amdgpu_sync.h" #include "amdgpu_object.h" #include "amdgpu_vm.h" diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 3828aafd3ac46f..e886a3b9e08c7d 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -39,6 +39,7 @@ #include #include +#include /* * FIXME: this is ugly right now we are using TTM to allocate vram and we pin diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 2fd2e91d5107c0..d5a6f101f843e6 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -9,14 +9,9 @@ #ifndef LINUX_HMM_H #define LINUX_HMM_H -#include -#include +#include -#include -#include -#include -#include -#include +struct mmu_interval_notifier; /* * On output: diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 767538089a62e4..396beee6b061d4 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -26,6 +26,8 @@ #include #include #include +#include +#include #include "test_hmm_uapi.h" -- 2.30.2
[Nouveau] [PATCH 2/8] mm: remove the __KERNEL__ guard from
__KERNEL__ ifdefs don't make sense outside of include/uapi/. Signed-off-by: Christoph Hellwig --- include/linux/mm.h | 4 1 file changed, 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 213cc569b19223..7b46174989b086 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3,9 +3,6 @@ #define _LINUX_MM_H #include - -#ifdef __KERNEL__ - #include #include #include @@ -3381,5 +3378,4 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start, } #endif -#endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ -- 2.30.2
[Nouveau] start sorting out the ZONE_DEVICE refcount mess
Hi all, this series removes the offset by one refcount for ZONE_DEVICE pages that are freed back to the driver owning them, which is just device private ones for now, but also the planned device coherent pages and the ehanced p2p ones pending. It does not address the fsdax pages yet, which will be attacked in a follow on series. Diffstat: arch/arm64/mm/mmu.c |1 arch/powerpc/kvm/book3s_hv_uvmem.c |1 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |2 drivers/gpu/drm/amd/amdkfd/kfd_priv.h|1 drivers/gpu/drm/drm_cache.c |2 drivers/gpu/drm/nouveau/nouveau_dmem.c |3 - drivers/gpu/drm/nouveau/nouveau_svm.c|1 drivers/infiniband/core/rw.c |1 drivers/nvdimm/pmem.h|1 drivers/nvme/host/pci.c |1 drivers/nvme/target/io-cmd-bdev.c|1 fs/Kconfig |2 fs/fuse/virtio_fs.c |1 include/linux/hmm.h |9 include/linux/memremap.h | 22 +- include/linux/mm.h | 59 - lib/test_hmm.c |4 + mm/Kconfig |4 - mm/internal.h|2 mm/memcontrol.c | 11 + mm/memremap.c| 63 --- mm/migrate.c |6 -- mm/swap.c| 49 ++-- 23 files changed, 90 insertions(+), 157 deletions(-)
[Nouveau] [PATCH 1/8] mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages
memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove the superflous extra check. Signed-off-by: Christoph Hellwig --- mm/memremap.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/memremap.c b/mm/memremap.c index 6aa5f0c2d11fda..5f04a0709e436e 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -328,8 +328,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) } break; case MEMORY_DEVICE_FS_DAX: - if (!IS_ENABLED(CONFIG_ZONE_DEVICE) || - IS_ENABLED(CONFIG_FS_DAX_LIMITED)) { + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) { WARN(1, "File system DAX not supported\n"); return ERR_PTR(-EINVAL); } -- 2.30.2
[Nouveau] [PATCH 7/7] vgaarb: don't pass a cookie to vga_client_register
The VGA arbitration is entirely based on pci_dev structures, so just pass that back to the set_vga_decode callback. Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 drivers/gpu/drm/i915/display/intel_vga.c | 7 --- drivers/gpu/drm/nouveau/nouveau_vga.c | 6 +++--- drivers/gpu/drm/radeon/radeon_device.c | 9 drivers/gpu/vga/vgaarb.c | 24 +- drivers/vfio/pci/vfio_pci.c| 9 include/linux/vgaarb.h | 10 - 7 files changed, 36 insertions(+), 38 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index e433fab6bcf6..8398daa0c06a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -1266,15 +1266,16 @@ bool amdgpu_device_need_post(struct amdgpu_device *adev) /** * amdgpu_device_vga_set_decode - enable/disable vga decode * - * @cookie: amdgpu_device pointer + * @pdev: PCI device pointer * @state: enable/disable vga decode * * Enable/disable vga decode (all asics). * Returns VGA resource flags. */ -static unsigned int amdgpu_device_vga_set_decode(void *cookie, bool state) +static unsigned int amdgpu_device_vga_set_decode(struct pci_dev *pdev, + bool state) { - struct amdgpu_device *adev = cookie; + struct amdgpu_device *adev = drm_to_adev(pci_get_drvdata(pdev)); amdgpu_asic_set_vga_state(adev, state); if (state) return VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM | @@ -3715,7 +3716,7 @@ int amdgpu_device_init(struct amdgpu_device *adev, /* this will fail for cards that aren't VGA class devices, just * ignore it */ if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA) - vga_client_register(adev->pdev, adev, amdgpu_device_vga_set_decode); + vga_client_register(adev->pdev, amdgpu_device_vga_set_decode); if (amdgpu_device_supports_px(ddev)) { px = true; diff --git a/drivers/gpu/drm/i915/display/intel_vga.c b/drivers/gpu/drm/i915/display/intel_vga.c index 0222719e0824..16c250700985 100644 --- a/drivers/gpu/drm/i915/display/intel_vga.c +++ b/drivers/gpu/drm/i915/display/intel_vga.c @@ -121,9 +121,9 @@ intel_vga_set_state(struct drm_i915_private *i915, bool enable_decode) } static unsigned int -intel_vga_set_decode(void *cookie, bool enable_decode) +intel_vga_set_decode(struct pci_dev *pdev, bool enable_decode) { - struct drm_i915_private *i915 = cookie; + struct drm_i915_private *i915 = pdev_to_i915(pdev); intel_vga_set_state(i915, enable_decode); @@ -136,6 +136,7 @@ intel_vga_set_decode(void *cookie, bool enable_decode) int intel_vga_register(struct drm_i915_private *i915) { + struct pci_dev *pdev = to_pci_dev(i915->drm.dev); int ret; @@ -147,7 +148,7 @@ int intel_vga_register(struct drm_i915_private *i915) * then we do not take part in VGA arbitration and the * vga_client_register() fails with -ENODEV. */ - ret = vga_client_register(pdev, i915, intel_vga_set_decode); + ret = vga_client_register(pdev, intel_vga_set_decode); if (ret && ret != -ENODEV) return ret; diff --git a/drivers/gpu/drm/nouveau/nouveau_vga.c b/drivers/gpu/drm/nouveau/nouveau_vga.c index d071c11249a3..60cd8c0463df 100644 --- a/drivers/gpu/drm/nouveau/nouveau_vga.c +++ b/drivers/gpu/drm/nouveau/nouveau_vga.c @@ -11,9 +11,9 @@ #include "nouveau_vga.h" static unsigned int -nouveau_vga_set_decode(void *priv, bool state) +nouveau_vga_set_decode(struct pci_dev *pdev, bool state) { - struct nouveau_drm *drm = nouveau_drm(priv); + struct nouveau_drm *drm = nouveau_drm(pci_get_drvdata(pdev)); struct nvif_object *device = >client.device.object; if (drm->client.device.info.family == NV_DEVICE_INFO_V0_CURIE && @@ -94,7 +94,7 @@ nouveau_vga_init(struct nouveau_drm *drm) return; pdev = to_pci_dev(dev->dev); - vga_client_register(pdev, dev, nouveau_vga_set_decode); + vga_client_register(pdev, nouveau_vga_set_decode); /* don't register Thunderbolt eGPU with vga_switcheroo */ if (pci_is_thunderbolt_attached(pdev)) diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index 11e8e42d99b3..cec03238e14d 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -1067,15 +1067,16 @@ void radeon_combios_fini(struct radeon_device *rdev) /** * radeon_vga_set_decode - enable/disable vga decode * - * @cookie: radeon_device pointer + * @pdev: PCI device * @state: enable/disable vga decode * * Enable/disable vga decode (all asics). * Returns VGA resource flags. */ -static unsigned int radeon_v
[Nouveau] [PATCH 6/7] vgaarb: remove the unused irq_set_state argument to vga_client_register
All callers pass NULL as the irq_set_state argument, so remove it and the ->irq_set_state member in struct vga_device. Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- drivers/gpu/drm/i915/display/intel_vga.c | 2 +- drivers/gpu/drm/nouveau/nouveau_vga.c | 2 +- drivers/gpu/drm/radeon/radeon_device.c | 2 +- drivers/gpu/vga/vgaarb.c | 23 +- drivers/vfio/pci/vfio_pci.c| 2 +- include/linux/vgaarb.h | 4 +--- 7 files changed, 7 insertions(+), 30 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 53afe0198e52..e433fab6bcf6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3715,7 +3715,7 @@ int amdgpu_device_init(struct amdgpu_device *adev, /* this will fail for cards that aren't VGA class devices, just * ignore it */ if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA) - vga_client_register(adev->pdev, adev, NULL, amdgpu_device_vga_set_decode); + vga_client_register(adev->pdev, adev, amdgpu_device_vga_set_decode); if (amdgpu_device_supports_px(ddev)) { px = true; diff --git a/drivers/gpu/drm/i915/display/intel_vga.c b/drivers/gpu/drm/i915/display/intel_vga.c index 833f9ec14493..0222719e0824 100644 --- a/drivers/gpu/drm/i915/display/intel_vga.c +++ b/drivers/gpu/drm/i915/display/intel_vga.c @@ -147,7 +147,7 @@ int intel_vga_register(struct drm_i915_private *i915) * then we do not take part in VGA arbitration and the * vga_client_register() fails with -ENODEV. */ - ret = vga_client_register(pdev, i915, NULL, intel_vga_set_decode); + ret = vga_client_register(pdev, i915, intel_vga_set_decode); if (ret && ret != -ENODEV) return ret; diff --git a/drivers/gpu/drm/nouveau/nouveau_vga.c b/drivers/gpu/drm/nouveau/nouveau_vga.c index de7a3a860139..d071c11249a3 100644 --- a/drivers/gpu/drm/nouveau/nouveau_vga.c +++ b/drivers/gpu/drm/nouveau/nouveau_vga.c @@ -94,7 +94,7 @@ nouveau_vga_init(struct nouveau_drm *drm) return; pdev = to_pci_dev(dev->dev); - vga_client_register(pdev, dev, NULL, nouveau_vga_set_decode); + vga_client_register(pdev, dev, nouveau_vga_set_decode); /* don't register Thunderbolt eGPU with vga_switcheroo */ if (pci_is_thunderbolt_attached(pdev)) diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index d781914f8bcb..11e8e42d99b3 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -1434,7 +1434,7 @@ int radeon_device_init(struct radeon_device *rdev, /* if we have > 1 VGA cards, then disable the radeon VGA resources */ /* this will fail for cards that aren't VGA class devices, just * ignore it */ - vga_client_register(rdev->pdev, rdev, NULL, radeon_vga_set_decode); + vga_client_register(rdev->pdev, rdev, radeon_vga_set_decode); if (rdev->flags & RADEON_IS_PX) runtime = true; diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c index 85b765b80abf..4bde017f6f22 100644 --- a/drivers/gpu/vga/vgaarb.c +++ b/drivers/gpu/vga/vgaarb.c @@ -72,9 +72,7 @@ struct vga_device { unsigned int io_norm_cnt; /* normal IO count */ unsigned int mem_norm_cnt; /* normal MEM count */ bool bridge_has_one_vga; - /* allow IRQ enable/disable hook */ void *cookie; - void (*irq_set_state)(void *cookie, bool enable); unsigned int (*set_vga_decode)(void *cookie, bool decode); }; @@ -218,13 +216,6 @@ int vga_remove_vgacon(struct pci_dev *pdev) #endif EXPORT_SYMBOL(vga_remove_vgacon); -static inline void vga_irq_set_state(struct vga_device *vgadev, bool state) -{ - if (vgadev->irq_set_state) - vgadev->irq_set_state(vgadev->cookie, state); -} - - /* If we don't ever use VGA arb we should avoid turning off anything anywhere due to old X servers getting confused about the boot device not being VGA */ @@ -325,10 +316,8 @@ static struct vga_device *__vga_tryget(struct vga_device *vgadev, if ((match & conflict->decodes) & VGA_RSRC_LEGACY_IO) pci_bits |= PCI_COMMAND_IO; - if (pci_bits) { - vga_irq_set_state(conflict, false); + if (pci_bits) flags |= PCI_VGA_STATE_CHANGE_DECODES; - } } if (change_bridge) @@ -365,9 +354,6 @@ static struct vga_device *__vga_tryget(struct vga_device *vgadev, pci_set_vga_state(vgadev->
[Nouveau] [PATCH 5/7] vgaarb: provide a vga_client_unregister wrapper
Add a trivial wrapper for the unregister case that sets all fields to NULL. Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- drivers/gpu/drm/drm_irq.c | 4 ++-- drivers/gpu/drm/i915/display/intel_vga.c | 2 +- drivers/gpu/drm/nouveau/nouveau_vga.c | 2 +- drivers/gpu/drm/radeon/radeon_device.c | 2 +- drivers/gpu/vga/vgaarb.c | 3 +-- drivers/vfio/pci/vfio_pci.c| 2 +- include/linux/vgaarb.h | 5 + 8 files changed, 13 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index d303e88e3c23..53afe0198e52 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3838,7 +3838,7 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev) vga_switcheroo_fini_domain_pm_ops(adev->dev); } if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA) - vga_client_register(adev->pdev, NULL, NULL, NULL); + vga_client_unregister(adev->pdev); if (IS_ENABLED(CONFIG_PERF_EVENTS)) amdgpu_pmu_fini(adev); diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c index c3bd664ea733..c87b0fb384e4 100644 --- a/drivers/gpu/drm/drm_irq.c +++ b/drivers/gpu/drm/drm_irq.c @@ -140,7 +140,7 @@ int drm_irq_install(struct drm_device *dev, int irq) if (ret < 0) { dev->irq_enabled = false; if (drm_core_check_feature(dev, DRIVER_LEGACY)) - vga_client_register(to_pci_dev(dev->dev), NULL, NULL, NULL); + vga_client_unregister(to_pci_dev(dev->dev)); free_irq(irq, dev); } else { dev->irq = irq; @@ -203,7 +203,7 @@ int drm_irq_uninstall(struct drm_device *dev) DRM_DEBUG("irq=%d\n", dev->irq); if (drm_core_check_feature(dev, DRIVER_LEGACY)) - vga_client_register(to_pci_dev(dev->dev), NULL, NULL, NULL); + vga_client_unregister(to_pci_dev(dev->dev)); if (dev->driver->irq_uninstall) dev->driver->irq_uninstall(dev); diff --git a/drivers/gpu/drm/i915/display/intel_vga.c b/drivers/gpu/drm/i915/display/intel_vga.c index f002b82ba9c0..833f9ec14493 100644 --- a/drivers/gpu/drm/i915/display/intel_vga.c +++ b/drivers/gpu/drm/i915/display/intel_vga.c @@ -158,5 +158,5 @@ void intel_vga_unregister(struct drm_i915_private *i915) { struct pci_dev *pdev = to_pci_dev(i915->drm.dev); - vga_client_register(pdev, NULL, NULL, NULL); + vga_client_unregister(pdev); } diff --git a/drivers/gpu/drm/nouveau/nouveau_vga.c b/drivers/gpu/drm/nouveau/nouveau_vga.c index 7c4b374b3eca..de7a3a860139 100644 --- a/drivers/gpu/drm/nouveau/nouveau_vga.c +++ b/drivers/gpu/drm/nouveau/nouveau_vga.c @@ -118,7 +118,7 @@ nouveau_vga_fini(struct nouveau_drm *drm) return; pdev = to_pci_dev(dev->dev); - vga_client_register(pdev, NULL, NULL, NULL); + vga_client_unregister(pdev); if (pci_is_thunderbolt_attached(pdev)) return; diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index 46eea01950cb..d781914f8bcb 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -1530,7 +1530,7 @@ void radeon_device_fini(struct radeon_device *rdev) vga_switcheroo_unregister_client(rdev->pdev); if (rdev->flags & RADEON_IS_PX) vga_switcheroo_fini_domain_pm_ops(rdev->dev); - vga_client_register(rdev->pdev, NULL, NULL, NULL); + vga_client_unregister(rdev->pdev); if (rdev->rio_mem) pci_iounmap(rdev->pdev, rdev->rio_mem); rdev->rio_mem = NULL; diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c index 3ed3734f66d9..85b765b80abf 100644 --- a/drivers/gpu/vga/vgaarb.c +++ b/drivers/gpu/vga/vgaarb.c @@ -877,8 +877,7 @@ EXPORT_SYMBOL(vga_set_legacy_decoding); * This function does not check whether a client for @pdev has been registered * already. * - * To unregister just call this function with @irq_set_state and @set_vga_decode - * both set to NULL for the same @pdev as originally used to register them. + * To unregister just call vga_client_unregister(). * * Returns: 0 on success, -1 on failure */ diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 318864d52837..47d13a1fb7fb 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1967,7 +1967,7 @@ static void vfio_pci_vga_uninit(struct vfio_pci_device *vdev) if (!vfio_pci_is_vga(pdev)) return; - vga_client_register(pdev, NULL, NULL, NULL); + vga
[Nouveau] [PATCH 4/7] vgaarb: cleanup vgaarb.h
Merge the different CONFIG_VGA_ARB ifdef blocks, remove superflous externs, and regularize the stubs for !CONFIG_VGA_ARB. Signed-off-by: Christoph Hellwig --- include/linux/vgaarb.h | 90 -- 1 file changed, 42 insertions(+), 48 deletions(-) diff --git a/include/linux/vgaarb.h b/include/linux/vgaarb.h index fdce9007d57e..05171fc7e26a 100644 --- a/include/linux/vgaarb.h +++ b/include/linux/vgaarb.h @@ -33,6 +33,8 @@ #include +struct pci_dev; + /* Legacy VGA regions */ #define VGA_RSRC_NONE 0x00 #define VGA_RSRC_LEGACY_IO 0x01 @@ -42,23 +44,47 @@ #define VGA_RSRC_NORMAL_IO 0x04 #define VGA_RSRC_NORMAL_MEM0x08 -struct pci_dev; - -/* For use by clients */ - -#if defined(CONFIG_VGA_ARB) -extern void vga_set_legacy_decoding(struct pci_dev *pdev, - unsigned int decodes); -#else +#ifdef CONFIG_VGA_ARB +void vga_set_legacy_decoding(struct pci_dev *pdev, unsigned int decodes); +int vga_get(struct pci_dev *pdev, unsigned int rsrc, int interruptible); +void vga_put(struct pci_dev *pdev, unsigned int rsrc); +struct pci_dev *vga_default_device(void); +void vga_set_default_device(struct pci_dev *pdev); +int vga_remove_vgacon(struct pci_dev *pdev); +int vga_client_register(struct pci_dev *pdev, void *cookie, + void (*irq_set_state)(void *cookie, bool state), + unsigned int (*set_vga_decode)(void *cookie, bool state)); +#else /* CONFIG_VGA_ARB */ static inline void vga_set_legacy_decoding(struct pci_dev *pdev, - unsigned int decodes) { }; -#endif - -#if defined(CONFIG_VGA_ARB) -extern int vga_get(struct pci_dev *pdev, unsigned int rsrc, int interruptible); -#else -static inline int vga_get(struct pci_dev *pdev, unsigned int rsrc, int interruptible) { return 0; } -#endif + unsigned int decodes) +{ +}; +static inline int vga_get(struct pci_dev *pdev, unsigned int rsrc, + int interruptible) +{ + return 0; +} +static inline void vga_put(struct pci_dev *pdev, unsigned int rsrc) +{ +} +static inline struct pci_dev *vga_default_device(void) +{ + return NULL; +} +static inline void vga_set_default_device(struct pci_dev *pdev) +{ +} +static inline int vga_remove_vgacon(struct pci_dev *pdev) +{ + return 0; +} +static inline int vga_client_register(struct pci_dev *pdev, void *cookie, + void (*irq_set_state)(void *cookie, bool state), + unsigned int (*set_vga_decode)(void *cookie, bool state)) +{ + return 0; +} +#endif /* CONFIG_VGA_ARB */ /** * vga_get_interruptible @@ -90,36 +116,4 @@ static inline int vga_get_uninterruptible(struct pci_dev *pdev, return vga_get(pdev, rsrc, 0); } -#if defined(CONFIG_VGA_ARB) -extern void vga_put(struct pci_dev *pdev, unsigned int rsrc); -#else -static inline void vga_put(struct pci_dev *pdev, unsigned int rsrc) -{ -} -#endif - - -#ifdef CONFIG_VGA_ARB -extern struct pci_dev *vga_default_device(void); -extern void vga_set_default_device(struct pci_dev *pdev); -extern int vga_remove_vgacon(struct pci_dev *pdev); -#else -static inline struct pci_dev *vga_default_device(void) { return NULL; } -static inline void vga_set_default_device(struct pci_dev *pdev) { } -static inline int vga_remove_vgacon(struct pci_dev *pdev) { return 0; } -#endif - -#if defined(CONFIG_VGA_ARB) -int vga_client_register(struct pci_dev *pdev, void *cookie, - void (*irq_set_state)(void *cookie, bool state), - unsigned int (*set_vga_decode)(void *cookie, bool state)); -#else -static inline int vga_client_register(struct pci_dev *pdev, void *cookie, - void (*irq_set_state)(void *cookie, bool state), - unsigned int (*set_vga_decode)(void *cookie, bool state)) -{ - return 0; -} -#endif - #endif /* LINUX_VGA_H */ -- 2.30.2 ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] [PATCH 3/7] vgaarb: move the kerneldoc for vga_set_legacy_decoding to vgaarb.c
Kerneldoc comments should be at the implementation side, not in the header just declaring the prototype. Signed-off-by: Christoph Hellwig --- drivers/gpu/vga/vgaarb.c | 11 +++ include/linux/vgaarb.h | 13 - 2 files changed, 11 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c index fccc7ef5153a..3ed3734f66d9 100644 --- a/drivers/gpu/vga/vgaarb.c +++ b/drivers/gpu/vga/vgaarb.c @@ -834,6 +834,17 @@ static void __vga_set_legacy_decoding(struct pci_dev *pdev, spin_unlock_irqrestore(_lock, flags); } +/** + * vga_set_legacy_decoding + * @pdev: pci device of the VGA card + * @decodes: bit mask of what legacy regions the card decodes + * + * Indicates to the arbiter if the card decodes legacy VGA IOs, legacy VGA + * Memory, both, or none. All cards default to both, the card driver (fbdev for + * example) should tell the arbiter if it has disabled legacy decoding, so the + * card can be left out of the arbitration process (and can be safe to take + * interrupts at any time. + */ void vga_set_legacy_decoding(struct pci_dev *pdev, unsigned int decodes) { __vga_set_legacy_decoding(pdev, decodes, false); diff --git a/include/linux/vgaarb.h b/include/linux/vgaarb.h index ca5160218538..fdce9007d57e 100644 --- a/include/linux/vgaarb.h +++ b/include/linux/vgaarb.h @@ -46,19 +46,6 @@ struct pci_dev; /* For use by clients */ -/** - * vga_set_legacy_decoding - * - * @pdev: pci device of the VGA card - * @decodes: bit mask of what legacy regions the card decodes - * - * Indicates to the arbiter if the card decodes legacy VGA IOs, - * legacy VGA Memory, both, or none. All cards default to both, - * the card driver (fbdev for example) should tell the arbiter - * if it has disabled legacy decoding, so the card can be left - * out of the arbitration process (and can be safe to take - * interrupts at any time. - */ #if defined(CONFIG_VGA_ARB) extern void vga_set_legacy_decoding(struct pci_dev *pdev, unsigned int decodes); -- 2.30.2 ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] [PATCH 2/7] vgaarb: remove vga_conflicts
vga_conflicts only has a single caller and none of the arch overrides mentioned in the comment. Just remove it and the thus dead check in the caller. Signed-off-by: Christoph Hellwig --- drivers/gpu/vga/vgaarb.c | 6 -- include/linux/vgaarb.h | 12 2 files changed, 18 deletions(-) diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c index 949fde433ea2..fccc7ef5153a 100644 --- a/drivers/gpu/vga/vgaarb.c +++ b/drivers/gpu/vga/vgaarb.c @@ -284,12 +284,6 @@ static struct vga_device *__vga_tryget(struct vga_device *vgadev, if (vgadev == conflict) continue; - /* Check if the architecture allows a conflict between those -* 2 devices or if they are on separate domains -*/ - if (!vga_conflicts(vgadev->pdev, conflict->pdev)) - continue; - /* We have a possible conflict. before we go further, we must * check if we sit on the same bus as the conflicting device. * if we don't, then we must tie both IO and MEM resources diff --git a/include/linux/vgaarb.h b/include/linux/vgaarb.h index 26ec8a057d2a..ca5160218538 100644 --- a/include/linux/vgaarb.h +++ b/include/linux/vgaarb.h @@ -122,18 +122,6 @@ static inline void vga_set_default_device(struct pci_dev *pdev) { } static inline int vga_remove_vgacon(struct pci_dev *pdev) { return 0; } #endif -/* - * Architectures should define this if they have several - * independent PCI domains that can afford concurrent VGA - * decoding - */ -#ifndef __ARCH_HAS_VGA_CONFLICT -static inline int vga_conflicts(struct pci_dev *p1, struct pci_dev *p2) -{ - return 1; -} -#endif - #if defined(CONFIG_VGA_ARB) int vga_client_register(struct pci_dev *pdev, void *cookie, void (*irq_set_state)(void *cookie, bool state), -- 2.30.2 ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] [PATCH 1/7] vgaarb: remove VGA_DEFAULT_DEVICE
The define is entirely unused. Signed-off-by: Christoph Hellwig --- include/linux/vgaarb.h | 6 -- 1 file changed, 6 deletions(-) diff --git a/include/linux/vgaarb.h b/include/linux/vgaarb.h index dc6ddce92066..26ec8a057d2a 100644 --- a/include/linux/vgaarb.h +++ b/include/linux/vgaarb.h @@ -42,12 +42,6 @@ #define VGA_RSRC_NORMAL_IO 0x04 #define VGA_RSRC_NORMAL_MEM0x08 -/* Passing that instead of a pci_dev to use the system "default" - * device, that is the one used by vgacon. Archs will probably - * have to provide their own vga_default_device(); - */ -#define VGA_DEFAULT_DEVICE (NULL) - struct pci_dev; /* For use by clients */ -- 2.30.2 ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] misc vgaarb cleanups
Hi all, this series cleans up a bunch of lose ends in the vgaarb code. Diffstat: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +- drivers/gpu/drm/drm_irq.c |4 drivers/gpu/drm/i915/display/intel_vga.c |9 +- drivers/gpu/drm/nouveau/nouveau_vga.c |8 - drivers/gpu/drm/radeon/radeon_device.c | 11 +- drivers/gpu/vga/vgaarb.c | 67 +--- drivers/vfio/pci/vfio_pci.c| 11 +- include/linux/vgaarb.h | 118 ++--- 8 files changed, 93 insertions(+), 146 deletions(-) ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v6 01/15] swiotlb: Refactor swiotlb init functions
Looks good, Reviewed-by: Christoph Hellwig ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v6 04/15] swiotlb: Add restricted DMA pool initialization
> +#ifdef CONFIG_DMA_RESTRICTED_POOL > +#include > +#include > +#include > +#include > +#include > +#endif I don't think any of this belongs into swiotlb.c. Marking swiotlb_init_io_tlb_mem non-static and having all this code in a separate file is probably a better idea. > +#ifdef CONFIG_DMA_RESTRICTED_POOL > +static int rmem_swiotlb_device_init(struct reserved_mem *rmem, > + struct device *dev) > +{ > + struct io_tlb_mem *mem = rmem->priv; > + unsigned long nslabs = rmem->size >> IO_TLB_SHIFT; > + > + if (dev->dma_io_tlb_mem) > + return 0; > + > + /* Since multiple devices can share the same pool, the private data, > + * io_tlb_mem struct, will be initialized by the first device attached > + * to it. > + */ This is not the normal kernel comment style. > +#ifdef CONFIG_ARM > + if (!PageHighMem(pfn_to_page(PHYS_PFN(rmem->base { > + kfree(mem); > + return -EINVAL; > + } > +#endif /* CONFIG_ARM */ And this is weird. Why would ARM have such a restriction? And if we have such rstrictions it absolutely belongs into an arch helper. > + swiotlb_init_io_tlb_mem(mem, rmem->base, nslabs, false); > + > + rmem->priv = mem; > + > +#ifdef CONFIG_DEBUG_FS > + if (!debugfs_dir) > + debugfs_dir = debugfs_create_dir("swiotlb", NULL); > + > + swiotlb_create_debugfs(mem, rmem->name, debugfs_dir); Doesn't the debugfs_create_dir belong into swiotlb_create_debugfs? Also please use IS_ENABLEd or a stub to avoid ifdefs like this. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v6 02/15] swiotlb: Refactor swiotlb_create_debugfs
Looks good, Reviewed-by: Christoph Hellwig ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v6 08/15] swiotlb: Bounce data from/to restricted DMA pool if available
> +static inline bool is_dev_swiotlb_force(struct device *dev) > +{ > +#ifdef CONFIG_DMA_RESTRICTED_POOL > + if (dev->dma_io_tlb_mem) > + return true; > +#endif /* CONFIG_DMA_RESTRICTED_POOL */ > + return false; > +} > + > /* If SWIOTLB is active, use its maximum mapping size */ > if (is_swiotlb_active(dev) && > - (dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE)) > + (dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE || > + is_dev_swiotlb_force(dev))) This is a mess. I think the right way is to have an always_bounce flag in the io_tlb_mem structure instead. Then the global swiotlb_force can go away and be replace with this and the fact that having no io_tlb_mem structure at all means forced no buffering (after a little refactoring). ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v6 05/15] swiotlb: Add a new get_io_tlb_mem getter
> +static inline struct io_tlb_mem *get_io_tlb_mem(struct device *dev) > +{ > +#ifdef CONFIG_DMA_RESTRICTED_POOL > + if (dev && dev->dma_io_tlb_mem) > + return dev->dma_io_tlb_mem; > +#endif /* CONFIG_DMA_RESTRICTED_POOL */ > + > + return io_tlb_default_mem; Given that we're also looking into a not addressing restricted pool I'd rather always assign the active pool to dev->dma_io_tlb_mem and do away with this helper. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH] drm/ttm: use dma_alloc_pages for the page pool
On Tue, May 11, 2021 at 09:35:20AM +0200, Christian König wrote: > We certainly going to need the drm_need_swiotlb() for userptr support > (unless we add some approach for drivers to opt out of swiotlb). swiotlb use is driven by three things: 1) addressing limitations of the device 2) addressing limitations of the interconnect 3) virtualiztion modes that require it not sure how the driver could opt out. What is the problem with userptr support? > Then while I really want to get rid of GFP_DMA32 as well I'm not 100% sure > if we can handle this without the flag. Note that this is still using GFP_DMA32 underneath where required, just in a layer that can decide that ѕensibly. > And last we need something better to store the DMA address and order than > allocating a separate memory object for each page. Yeah. If you use __GFP_COMP for the allocations we can find the order from the page itself, which might be useful. For 64-bit platforms the dma address could be store in page->private, or depending on how the page gets used the dma_addr field in struct page that overloads the lru field and is used by the networking page pool could be used. Maybe we could even have a common page pool between net and drm, but I don't want to go there myself, not being an expert on either subsystem. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] RFC: use dma_alloc_noncoherent in ttm_pool_alloc_page
Hi all, the memory allocation for the TTM pool is a big mess with two allocation methods that both have issues, a layering violation and odd guessing of pools in the callers. This patch switches to the dma_alloc_noncoherent API instead fixing all of the above issues. Warning: i don't have any of the relevant hardware, so this is a compile tested request for comments only! Diffstat: drivers/gpu/drm/amd/amdgpu/amdgpu.h |1 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |4 drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c |1 drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c |1 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c |1 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |1 drivers/gpu/drm/drm_cache.c | 31 - drivers/gpu/drm/drm_gem_vram_helper.c |3 drivers/gpu/drm/nouveau/nouveau_ttm.c |8 - drivers/gpu/drm/qxl/qxl_ttm.c |3 drivers/gpu/drm/radeon/radeon.h |1 drivers/gpu/drm/radeon/radeon_device.c |1 drivers/gpu/drm/radeon/radeon_ttm.c |4 drivers/gpu/drm/ttm/ttm_device.c|7 - drivers/gpu/drm/ttm/ttm_pool.c | 178 drivers/gpu/drm/ttm/ttm_tt.c| 25 drivers/gpu/drm/vmwgfx/vmwgfx_drv.c |4 include/drm/drm_cache.h |1 include/drm/ttm/ttm_device.h|3 include/drm/ttm/ttm_pool.h |9 - 20 files changed, 41 insertions(+), 246 deletions(-) ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] [PATCH] drm/ttm: use dma_alloc_pages for the page pool
Use the dma_alloc_pages allocator for the TTM pool allocator. This allocator is a front end to the page allocator which takes the DMA mask of the device into account, thus offering the best of both worlds of the two existing allocator versions. This conversion also removes the ugly layering violation where the TTM pool assumes what kind of virtual address dma_alloc_attrs can return. Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 - drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 4 +- drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c | 1 - drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 1 - drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 1 - drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 1 - drivers/gpu/drm/drm_cache.c | 31 - drivers/gpu/drm/drm_gem_vram_helper.c | 3 +- drivers/gpu/drm/nouveau/nouveau_ttm.c | 8 +- drivers/gpu/drm/qxl/qxl_ttm.c | 3 +- drivers/gpu/drm/radeon/radeon.h | 1 - drivers/gpu/drm/radeon/radeon_device.c | 1 - drivers/gpu/drm/radeon/radeon_ttm.c | 4 +- drivers/gpu/drm/ttm/ttm_device.c| 7 +- drivers/gpu/drm/ttm/ttm_pool.c | 178 drivers/gpu/drm/ttm/ttm_tt.c| 25 +--- drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 4 +- include/drm/drm_cache.h | 1 - include/drm/ttm/ttm_device.h| 3 +- include/drm/ttm/ttm_pool.h | 9 +- 20 files changed, 41 insertions(+), 246 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index dc3a69296321b3..5f40527eeef1ff 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -819,7 +819,6 @@ struct amdgpu_device { int usec_timeout; const struct amdgpu_asic_funcs *asic_funcs; boolshutdown; - boolneed_swiotlb; boolaccel_working; struct notifier_block acpi_nb; struct amdgpu_i2c_chan *i2c_bus[AMDGPU_MAX_I2C_BUS]; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index 3bef0432cac2f7..9bf17b44cba6fe 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -1705,9 +1705,7 @@ int amdgpu_ttm_init(struct amdgpu_device *adev) /* No others user of address space so set it to 0 */ r = ttm_device_init(>mman.bdev, _bo_driver, adev->dev, adev_to_drm(adev)->anon_inode->i_mapping, - adev_to_drm(adev)->vma_offset_manager, - adev->need_swiotlb, - dma_addressing_limited(adev->dev)); + adev_to_drm(adev)->vma_offset_manager); if (r) { DRM_ERROR("failed initializing buffer object driver(%d).\n", r); return r; diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c index 405d6ad09022ca..2d4fa754513033 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c @@ -846,7 +846,6 @@ static int gmc_v6_0_sw_init(void *handle) dev_warn(adev->dev, "No suitable DMA available.\n"); return r; } - adev->need_swiotlb = drm_need_swiotlb(44); r = gmc_v6_0_init_microcode(adev); if (r) { diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c index 210ada2289ec9c..a504db24f4c2a8 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c @@ -1025,7 +1025,6 @@ static int gmc_v7_0_sw_init(void *handle) pr_warn("No suitable DMA available\n"); return r; } - adev->need_swiotlb = drm_need_swiotlb(40); r = gmc_v7_0_init_microcode(adev); if (r) { diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c index e4f27b3f28fb58..42e7b1eb84b3bc 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c @@ -1141,7 +1141,6 @@ static int gmc_v8_0_sw_init(void *handle) pr_warn("No suitable DMA available\n"); return r; } - adev->need_swiotlb = drm_need_swiotlb(40); r = gmc_v8_0_init_microcode(adev); if (r) { diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c index 455bb91060d0bc..f74784b3423740 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c @@ -1548,7 +1548,6 @@ static int gmc_v9_0_sw_init(void *handle) printk(KERN_WARNING "amdgpu: No suitable DMA available.\n"); ret
Re: [Nouveau] [PATCH v5 01/16] swiotlb: Fix the type of index
On Thu, Apr 22, 2021 at 04:14:53PM +0800, Claire Chang wrote: > Fix the type of index from unsigned int to int since find_slots() might > return -1. > > Fixes: 0774983bc923 ("swiotlb: refactor swiotlb_tbl_map_single") > Signed-off-by: Claire Chang Looks good: Reviewed-by: Christoph Hellwig it really should go into 5.12. I'm not sure if Konrad is going to be able to queue this up due to his vacation, so I'm tempted to just queue it up in the dma-mapping tree. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v6 8/8] nouveau/svm: Implement atomic SVM access
> - /*XXX: atomic? */ > - return (fa->access == 0 || fa->access == 3) - > -(fb->access == 0 || fb->access == 3); > + /* Atomic access (2) has highest priority */ > + return (-1*(fa->access == 2) + (fa->access == 0 || fa->access == 3)) - > +(-1*(fb->access == 2) + (fb->access == 0 || fb->access == 3)); This looks really unreabable. If the magic values 0, 2 and 3 had names it might become a little more understadable, then factor the duplicated calculation of the priority value into a helper and we'll have code that mere humans can understand.. > + mutex_lock(>mutex); > + if (mmu_interval_read_retry(>notifier, > + notifier_seq)) { > + mutex_unlock(>mutex); > + continue; > + } > + break; > + } This looks good, why not: mutex_lock(>mutex); if (!mmu_interval_read_retry(>notifier, notifier_seq)) break; mutex_unlock(>mutex); } ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v6 5/8] mm: Device exclusive memory access
> +Not all devices support atomic access to system memory. To support atomic > +operations to a shared virtual memory page such a device needs access to that > +page which is exclusive of any userspace access from the CPU. The > +``make_device_exclusive_range()`` function can be used to make a memory range > +inaccessible from userspace. s/Not all devices/Some devices/ ? > static inline int mm_has_notifiers(struct mm_struct *mm) > @@ -528,7 +534,17 @@ static inline void mmu_notifier_range_init_migrate( > { > mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm, > start, end); > - range->migrate_pgmap_owner = pgmap; > + range->owner = pgmap; > +} > + > +static inline void mmu_notifier_range_init_exclusive( > + struct mmu_notifier_range *range, unsigned int flags, > + struct vm_area_struct *vma, struct mm_struct *mm, > + unsigned long start, unsigned long end, void *owner) > +{ > + mmu_notifier_range_init(range, MMU_NOTIFY_EXCLUSIVE, flags, vma, mm, > + start, end); > + range->owner = owner; Maybe just replace mmu_notifier_range_init_migrate with a mmu_notifier_range_init_owner helper that takes the owner but does not hard code a type? > } > + } else if (is_device_exclusive_entry(entry)) { > + page = pfn_swap_entry_to_page(entry); > + > + get_page(page); > + rss[mm_counter(page)]++; > + > + if (is_writable_device_exclusive_entry(entry) && > + is_cow_mapping(vm_flags)) { > + /* > + * COW mappings require pages in both > + * parent and child to be set to read. > + */ > + entry = make_readable_device_exclusive_entry( > + swp_offset(entry)); > + pte = swp_entry_to_pte(entry); > + if (pte_swp_soft_dirty(*src_pte)) > + pte = pte_swp_mksoft_dirty(pte); > + if (pte_swp_uffd_wp(*src_pte)) > + pte = pte_swp_mkuffd_wp(pte); > + set_pte_at(src_mm, addr, src_pte, pte); > + } Just cosmetic, but I wonder if should factor this code block into a little helper. > + > +static bool try_to_protect_one(struct page *page, struct vm_area_struct *vma, > + unsigned long address, void *arg) > +{ > + struct mm_struct *mm = vma->vm_mm; > + struct page_vma_mapped_walk pvmw = { > + .page = page, > + .vma = vma, > + .address = address, > + }; > + struct ttp_args *ttp = (struct ttp_args *) arg; This cast should not be needed. > + return ttp.valid && (!page_mapcount(page) ? true : false); This can be simplified to: return ttp.valid && !page_mapcount(page); > + npages = get_user_pages_remote(mm, start, npages, > +FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD, > +pages, NULL, NULL); > + for (i = 0; i < npages; i++, start += PAGE_SIZE) { > + if (!trylock_page(pages[i])) { > + put_page(pages[i]); > + pages[i] = NULL; > + continue; > + } > + > + if (!try_to_protect(pages[i], mm, start, arg)) { > + unlock_page(pages[i]); > + put_page(pages[i]); > + pages[i] = NULL; > + } Should the trylock_page go into try_to_protect to simplify the loop a little? Also I wonder if we need make_device_exclusive_range or should just open code the get_user_pages_remote + try_to_protect loop in the callers, as that might allow them to also deduct other information about the found pages. Otherwise looks good: Reviewed-by: Christoph Hellwig ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v6 3/8] mm/rmap: Split try_to_munlock from try_to_unmap
On Fri, Mar 12, 2021 at 07:38:46PM +1100, Alistair Popple wrote: > The behaviour of try_to_unmap_one() is difficult to follow because it > performs different operations based on a fairly large set of flags used > in different combinations. > > TTU_MUNLOCK is one such flag. However it is exclusively used by > try_to_munlock() which specifies no other flags. Therefore rather than > overload try_to_unmap_one() with unrelated behaviour split this out into > it's own function and remove the flag. > > Signed-off-by: Alistair Popple > Reviewed-by: Ralph Campbell > > --- > > Christoph - I didn't add your Reviewed-by from v3 because removal of the > extra VM_LOCKED check in v4 changed things slightly. Let me know if > you're still ok for me to add it. Thanks. Still looks good to me: Reviewed-by: Christoph Hellwig ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v6 1/8] mm: Remove special swap entry functions
On Fri, Mar 12, 2021 at 07:38:44PM +1100, Alistair Popple wrote: > Remove the migration and device private entry_to_page() and > entry_to_pfn() inline functions and instead open code them directly. > This results in shorter code which is easier to understand. I think this commit log should mention pfn_swap_entry_to_page() now. Otherwise looks good: Reviewed-by: Christoph Hellwig ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3 4/8] mm/rmap: Split migration into its own function
Nice cleanup! Reviewed-by: Christoph Hellwig ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3 3/8] mm/rmap: Split try_to_munlock from try_to_unmap
> + while (page_vma_mapped_walk()) { > + /* > + * If the page is mlock()d, we cannot swap it out. > + * If it's recently referenced (perhaps page_referenced > + * skipped over this mm) then we should reactivate it. > + */ > + if (vma->vm_flags & VM_LOCKED) { > + /* PTE-mapped THP are never mlocked */ > + if (!PageTransCompound(page)) { > + /* > + * Holding pte lock, we do *not* need > + * mmap_lock here > + */ > + mlock_vma_page(page); > + } > + ret = false; > + page_vma_mapped_walk_done(); > + break; Just return false here directly and remove the ret variable? Very nice cleanup! Reviewed-by: Christoph Hellwig ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3 2/8] mm/swapops: Rework swap entry manipulation code
On Fri, Feb 26, 2021 at 06:18:26PM +1100, Alistair Popple wrote: > Both migration and device private pages use special swap entries which > are manipluated by a range of inline functions. The arguments to these > are somewhat inconsitent so rework them to remove flag type arguments > and to make the arguments similar for both a read and write entry > creation. > > Signed-off-by: Alistair Popple Looks good, Reviewed-by: Christoph Hellwig ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3 1/8] mm: Remove special swap entry functions
> - struct page *page = migration_entry_to_page(entry); > + struct page *page = pfn_to_page(swp_offset(entry)); I wonder if keeping a single special_entry_to_page() helper would still me a useful. But I'm not entirely sure. There are also two more open coded copies of this in the THP migration code. > -#define free_swap_and_cache(e) ({(is_migration_entry(e) || > is_device_private_entry(e));}) > -#define swapcache_prepare(e) ({(is_migration_entry(e) || > is_device_private_entry(e));}) > +#define free_swap_and_cache(e) is_special_entry(e) > +#define swapcache_prepare(e) is_special_entry(e) Staring at this I'm really, really confused at what this is doing. Looking a little closer these are the !CONFIG_SWAP stubs, but it could probably use a comment or two. > } else if (is_migration_entry(entry)) { > - page = migration_entry_to_page(entry); > + page = pfn_to_page(swp_offset(entry)); > > rss[mm_counter(page)]++; > > @@ -737,7 +737,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct > mm_struct *src_mm, > set_pte_at(src_mm, addr, src_pte, pte); > } > } else if (is_device_private_entry(entry)) { > - page = device_private_entry_to_page(entry); > + page = pfn_to_page(swp_offset(entry)); > > /* >* Update rss count even for unaddressable pages, as > @@ -1274,7 +1274,7 @@ static unsigned long zap_pte_range(struct mmu_gather > *tlb, > > entry = pte_to_swp_entry(ptent); > if (is_device_private_entry(entry)) { > - struct page *page = device_private_entry_to_page(entry); > + struct page *page = pfn_to_page(swp_offset(entry)); > > if (unlikely(details && details->check_mapping)) { > /* > @@ -1303,7 +1303,7 @@ static unsigned long zap_pte_range(struct mmu_gather > *tlb, > else if (is_migration_entry(entry)) { > struct page *page; > > - page = migration_entry_to_page(entry); > + page = pfn_to_page(swp_offset(entry)); > rss[mm_counter(page)]--; > } > if (unlikely(!free_swap_and_cache(entry))) > @@ -3271,7 +3271,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > migration_entry_wait(vma->vm_mm, vmf->pmd, >vmf->address); > } else if (is_device_private_entry(entry)) { > - vmf->page = device_private_entry_to_page(entry); > + vmf->page = pfn_to_page(swp_offset(entry)); > ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); > } else if (is_hwpoison_entry(entry)) { > ret = VM_FAULT_HWPOISON; > diff --git a/mm/migrate.c b/mm/migrate.c > index 20ca887ea769..72adcc3d8f5b 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -321,7 +321,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t > *ptep, > if (!is_migration_entry(entry)) > goto out; > > - page = migration_entry_to_page(entry); > + page = pfn_to_page(swp_offset(entry)); > > /* >* Once page cache replacement of page migration started, page_count > @@ -361,7 +361,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t > *pmd) > ptl = pmd_lock(mm, pmd); > if (!is_pmd_migration_entry(*pmd)) > goto unlock; > - page = migration_entry_to_page(pmd_to_swp_entry(*pmd)); > + page = pfn_to_page(swp_offset(pmd_to_swp_entry(*pmd))); > if (!get_page_unless_zero(page)) > goto unlock; > spin_unlock(ptl); > @@ -2437,7 +2437,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > if (!is_device_private_entry(entry)) > goto next; > > - page = device_private_entry_to_page(entry); > + page = pfn_to_page(swp_offset(entry)); > if (!(migrate->flags & > MIGRATE_VMA_SELECT_DEVICE_PRIVATE) || > page->pgmap->owner != migrate->pgmap_owner) > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > index 86e3a3688d59..34230d08556a 100644 > --- a/mm/page_vma_mapped.c > +++ b/mm/page_vma_mapped.c > @@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) > if (!is_migration_entry(entry)) > return false; > > - pfn = migration_entry_to_pfn(entry); > + pfn = swp_offset(entry); > } else if (is_swap_pte(*pvmw->pte)) { > swp_entry_t entry; > > @@ -105,7 +105,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) > if (!is_device_private_entry(entry)) > return false; > > -
Re: [Nouveau] [PATCH v2 1/4] hmm: Device exclusive memory access
> page = migration_entry_to_page(swpent); > else if (is_device_private_entry(swpent)) > page = device_private_entry_to_page(swpent); > + else if (is_device_exclusive_entry(swpent)) > + page = device_exclusive_entry_to_page(swpent); > page = migration_entry_to_page(swpent); > else if (is_device_private_entry(swpent)) > page = device_private_entry_to_page(swpent); > + else if (is_device_exclusive_entry(swpent)) > + page = device_exclusive_entry_to_page(swpent); > if (is_device_private_entry(entry)) > page = device_private_entry_to_page(entry); > + > + if (is_device_exclusive_entry(entry)) > + page = device_exclusive_entry_to_page(entry); Any chance we can come up with a clever scheme to avoid all this boilerplate code (and maybe also what it gets compiled to)? > diff --git a/include/linux/hmm.h b/include/linux/hmm.h > index 866a0fa104c4..5d28ff6d4d80 100644 > --- a/include/linux/hmm.h > +++ b/include/linux/hmm.h > @@ -109,6 +109,10 @@ struct hmm_range { > */ > int hmm_range_fault(struct hmm_range *range); > > +int hmm_exclusive_range(struct mm_struct *mm, unsigned long start, > + unsigned long end, struct page **pages); > +vm_fault_t hmm_remove_exclusive_entry(struct vm_fault *vmf); Can we avoid the hmm naming for new code (we should probably also kill it off for the existing code)? > +#define free_swap_and_cache(e) ({(is_migration_entry(e) || > is_device_private_entry(e) \ > + || is_device_exclusive_entry(e)); }) > +#define swapcache_prepare(e) ({(is_migration_entry(e) || > is_device_private_entry(e) \ > + || is_device_exclusive_entry(e)); }) Can you turn these into properly formatted inline functions? As-is this becomes pretty unreadable. > +static inline void make_device_exclusive_entry_read(swp_entry_t *entry) > +{ > + *entry = swp_entry(SWP_DEVICE_EXCLUSIVE_READ, swp_offset(*entry)); > +} s/make_device_exclusive_entry_read/mark_device_exclusive_entry_readable/ ?? > + > +static inline swp_entry_t make_device_exclusive_entry(struct page *page, > bool write) > +{ > + return swp_entry(write ? SWP_DEVICE_EXCLUSIVE_WRITE : > SWP_DEVICE_EXCLUSIVE_READ, > + page_to_pfn(page)); > +} I'd split this into two helpers, which is easier to follow and avoids the pointlessly overlong lines. > +static inline bool is_device_exclusive_entry(swp_entry_t entry) > +{ > + int type = swp_type(entry); > + return type == SWP_DEVICE_EXCLUSIVE_READ || type == > SWP_DEVICE_EXCLUSIVE_WRITE; > +} Another overly long line. I also wouldn't bother with the local variable: return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ || swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE; > +static inline bool is_write_device_exclusive_entry(swp_entry_t entry) > +{ > + return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE; > +} Or reuse these kind of helpers.. > + > +static inline unsigned long device_exclusive_entry_to_pfn(swp_entry_t entry) > +{ > + return swp_offset(entry); > +} > + > +static inline struct page *device_exclusive_entry_to_page(swp_entry_t entry) > +{ > + return pfn_to_page(swp_offset(entry)); > +} I'd rather open code these two, and as a prep patch also kill off the equivalents for the migration and device private entries, which would actually clean up a lot of the mess mentioned in my first comment above. > +static int hmm_exclusive_skip(unsigned long start, > + unsigned long end, > + __always_unused int depth, > + struct mm_walk *walk) > +{ > + struct hmm_exclusive_walk *hmm_exclusive_walk = walk->private; > + unsigned long addr; > + > + for (addr = start; addr < end; addr += PAGE_SIZE) > + hmm_exclusive_walk->pages[hmm_exclusive_walk->npages++] = NULL; > + > + return 0; > +} Wouldn't pre-zeroing the array be simpler and more efficient? > +int hmm_exclusive_range(struct mm_struct *mm, unsigned long start, > + unsigned long end, struct page **pages) > +{ > + struct hmm_exclusive_walk hmm_exclusive_walk = { .pages = pages, > .npages = 0 }; > + int i; > + > + /* Collect and lock candidate pages */ > + walk_page_range(mm, start, end, _exclusive_walk_ops, > _exclusive_walk); Please avoid the overly long lines. But more importantly: Unless I'm missing something obvious this walk_page_range call just open codes get_user_pages_fast, why can't you use that? > +#if defined(CONFIG_ARCH_ENABLE_THP_MIGRATION) || defined(CONFIG_HUGETLB) > + if (PageTransHuge(page)) { > + VM_BUG_ON_PAGE(1, page); > + continue; > +
Re: [Nouveau] [PATCH 0/9] Add support for SVM atomics in Nouveau
On Wed, Feb 10, 2021 at 01:59:13PM -0400, Jason Gunthorpe wrote: > Really what you want to do here is leave the CPU page in the VMA and > the page tables where it started and deny CPU access to the page. Then > all the proper machinery will continue to work. > > IMHO "migration" is the wrong idea if the data isn't actually moving. Agreed. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH RFC v1 5/6] xen-swiotlb: convert variables to arrays
On Thu, Feb 04, 2021 at 09:40:23AM +0100, Christoph Hellwig wrote: > So one thing that has been on my mind for a while: I'd really like > to kill the separate dma ops in Xen swiotlb. If we compare xen-swiotlb > to swiotlb the main difference seems to be: > > - additional reasons to bounce I/O vs the plain DMA capable > - the possibility to do a hypercall on arm/arm64 > - an extra translation layer before doing the phys_to_dma and vice >versa > - an special memory allocator > > I wonder if inbetween a few jump labels or other no overhead enablement > options and possibly better use of the dma_range_map we could kill > off most of swiotlb-xen instead of maintaining all this code duplication? So I looked at this a bit more. For x86 with XENFEAT_auto_translated_physmap (how common is that?) pfn_to_gfn is a nop, so plain phys_to_dma/dma_to_phys do work as-is. xen_arch_need_swiotlb always returns true for x86, and range_straddles_page_boundary should never be true for the XENFEAT_auto_translated_physmap case. So as far as I can tell the mapping fast path for the XENFEAT_auto_translated_physmap can be trivially reused from swiotlb. That leaves us with the next more complicated case, x86 or fully cache coherent arm{,64} without XENFEAT_auto_translated_physmap. In that case we need to patch in a phys_to_dma/dma_to_phys that performs the MFN lookup, which could be done using alternatives or jump labels. I think if that is done right we should also be able to let that cover the foreign pages in is_xen_swiotlb_buffer/is_swiotlb_buffer, but in that worst case that would need another alternative / jump label. For non-coherent arm{,64} we'd also need to use alternatives or jump labels to for the cache maintainance ops, but that isn't a hard problem either. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH RFC v1 2/6] swiotlb: convert variables to arrays
On Wed, Feb 03, 2021 at 03:37:05PM -0800, Dongli Zhang wrote: > This patch converts several swiotlb related variables to arrays, in > order to maintain stat/status for different swiotlb buffers. Here are > variables involved: > > - io_tlb_start and io_tlb_end > - io_tlb_nslabs and io_tlb_used > - io_tlb_list > - io_tlb_index > - max_segment > - io_tlb_orig_addr > - no_iotlb_memory > > There is no functional change and this is to prepare to enable 64-bit > swiotlb. Claire Chang (on Cc) already posted a patch like this a month ago, which looks much better because it actually uses a struct instead of all the random variables. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH RFC v1 5/6] xen-swiotlb: convert variables to arrays
So one thing that has been on my mind for a while: I'd really like to kill the separate dma ops in Xen swiotlb. If we compare xen-swiotlb to swiotlb the main difference seems to be: - additional reasons to bounce I/O vs the plain DMA capable - the possibility to do a hypercall on arm/arm64 - an extra translation layer before doing the phys_to_dma and vice versa - an special memory allocator I wonder if inbetween a few jump labels or other no overhead enablement options and possibly better use of the dma_range_map we could kill off most of swiotlb-xen instead of maintaining all this code duplication? ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3 3/6] mm: support THP migration to device private memory
[adding a few of the usual suspects] On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote: > There are 4 types of ZONE_DEVICE struct pages: > MEMORY_DEVICE_PRIVATE, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, and > MEMORY_DEVICE_PCI_P2PDMA. > > Currently, memremap_pages() allocates struct pages for a physical address > range > with a page_ref_count(page) of one and increments the pgmap->ref per CPU > reference count by the number of pages created since each ZONE_DEVICE struct > page has a pointer to the pgmap. > > The struct pages are not freed until memunmap_pages() is called which > calls put_page() which calls put_dev_pagemap() which releases a reference to > pgmap->ref. memunmap_pages() blocks waiting for pgmap->ref reference count > to be zero. As far as I can tell, the put_page() in memunmap_pages() has to > be the *last* put_page() (see MEMORY_DEVICE_PCI_P2PDMA). > My RFC [1] breaks this put_page() -> put_dev_pagemap() connection so that > the struct page reference count can go to zero and back to non-zero without > changing the pgmap->ref reference count. > > Q1: Is that safe? Is there some code that depends on put_page() dropping > the pgmap->ref reference count as part of memunmap_pages()? > My testing of [1] seems OK but I'm sure there are lots of cases I didn't test. It should be safe, but the audit you've done is important to make sure we do not miss anything important. > MEMORY_DEVICE_PCI_P2PDMA: > Struct pages are created in pci_p2pdma_add_resource() and represent device > memory accessible by PCIe bar address space. Memory is allocated with > pci_alloc_p2pmem() based on a byte length but the gen_pool_alloc_owner() > call will allocate memory in a minimum of PAGE_SIZE units. > Reference counting is +1 per *allocation* on the pgmap->ref reference count. > Note that this is not +1 per page which is what put_page() expects. So > currently, a get_page()/put_page() works OK because the page reference count > only goes 1->2 and 2->1. If it went to zero, the pgmap->ref reference count > would be incorrect if the allocation size was greater than one page. > > I see pci_alloc_p2pmem() is called by nvme_alloc_sq_cmds() and > pci_p2pmem_alloc_sgl() to create a command queue and a struct scatterlist *. > Looks like sg_page(sg) returns the ZONE_DEVICE struct page of the scatterlist. > There are a huge number of places sg_page() is called so it is hard to tell > whether or not get_page()/put_page() is ever called on > MEMORY_DEVICE_PCI_P2PDMA > pages. Nothing should call get_page/put_page on them, as they are not treated as refcountable memory. More importantly nothing is allowed to keep a reference longer than the time of the I/O. > pci_p2pmem_virt_to_bus() will return the physical address and I guess > pfn_to_page(physaddr >> PAGE_SHIFT) could return the struct page. > > Since there is a clear allocation/free, pci_alloc_p2pmem() can probably be > modified to increment/decrement the MEMORY_DEVICE_PCI_P2PDMA struct page > reference count. Or maybe just leave it at one like it is now. And yes, doing that is probably a sensible safe guard. > MEMORY_DEVICE_FS_DAX: > Struct pages are created in pmem_attach_disk() and virtio_fs_setup_dax() with > an initial reference count of one. > The problem I see is that there are 3 states that are important: > a) memory is free and not allocated to any file (page_ref_count() == 0). > b) memory is allocated to a file and in the page cache (page_ref_count() == > 1). > c) some gup() or I/O has a reference even after calling unmap_mapping_pages() >(page_ref_count() > 1). ext4_break_layouts() basically waits until the >page_ref_count() == 1 with put_page() calling wake_up_var(>_refcount) >to wake up ext4_break_layouts(). > The current code doesn't seem to distinguish (a) and (b). If we want to use > the 0->1 reference count to signal (c), then the page cache would have hold > entries with a page_ref_count() == 0 which doesn't match the general page > cache I think the sensible model here is to grab a reference when it is added to the page cache. That is exactly how normal system memory pages work. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3 3/6] mm: support THP migration to device private memory
On Fri, Nov 20, 2020 at 04:01:33PM -0400, Jason Gunthorpe wrote: > On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote: > > > MEMORY_DEVICE_GENERIC: > > Struct pages are created in dev_dax_probe() and represent non-volatile > > memory. > > The device can be mmap()'ed which calls dax_mmap() which sets > > vma->vm_flags | VM_HUGEPAGE. > > A CPU page fault will result in a PTE, PMD, or PUD sized page > > (but not compound) to be inserted by vmf_insert_mixed() which will call > > either > > insert_pfn() or insert_page(). > > Neither insert_pfn() nor insert_page() increments the page reference > > count. > > But why was this done? It seems very strange to put a pfn with a > struct page into a VMA and then deliberately not take the refcount for > the duration of that pfn being in the VMA? > > What prevents memunmap_pages() from progressing while VMAs still point > at the memory? Agreed. Adding Roger who added MEMORY_DEVICE_GENERIC and the only user. > > I think just leaving the page reference count at one is better than trying > > to use the mmu_interval_notifier or changing vmf_insert_mixed() and > > invalidations of pfn_t_devmap(pfn) to adjust the page reference count. > > Why so? The entire point of getting struct page's for this stuff was > to be able to follow the struct page flow. I never did learn a reason > why there is devmap stuff all over the place in the page table code... Exactly. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3 3/6] mm: support THP migration to device private memory
On Fri, Nov 06, 2020 at 01:26:50PM -0800, Ralph Campbell wrote: > > On 11/6/20 12:03 AM, Christoph Hellwig wrote: >> I hate the extra pin count magic here. IMHO we really need to finish >> off the series to get rid of the extra references on the ZONE_DEVICE >> pages first. > > First, thanks for the review comments. > > I don't like the extra refcount either, that is why I tried to fix that up > before resending this series. However, you didn't like me just fixing the > refcount only for device private pages and I don't know the dax/pmem code > and peer-to-peer PCIe uses of ZONE_DEVICE pages well enough to say how > long it will take me to fix all the use cases. > So I wanted to make progress on the THP migration code in the mean time. I think P2P is pretty trivial, given that ZONE_DEVICE pages are used like a normal memory allocator. DAX is the interesting case, any specific help that you need with that? ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3 3/6] mm: support THP migration to device private memory
I hate the extra pin count magic here. IMHO we really need to finish off the series to get rid of the extra references on the ZONE_DEVICE pages first. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3 4/6] mm/thp: add THP allocation helper
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +extern struct page *alloc_transhugepage(struct vm_area_struct *vma, > + unsigned long addr); No need for the extern. And also here: do we actually need the stub, or can the caller make sure (using IS_ENABLED and similar) that the compiler knows the code is dead? > +struct page *alloc_transhugepage(struct vm_area_struct *vma, > + unsigned long haddr) > +{ > + gfp_t gfp; > + struct page *page; > + > + gfp = alloc_hugepage_direct_gfpmask(vma); > + page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER); > + if (page) > + prep_transhuge_page(page); > + return page; I think do_huge_pmd_anonymous_page should be switched to use this helper as well. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3 2/6] mm/migrate: move migrate_vma_collect_skip()
Looks good: Reviewed-by: Christoph Hellwig ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3 2/6] mm/migrate: move migrate_vma_collect_skip()
On Thu, Nov 05, 2020 at 04:51:43PM -0800, Ralph Campbell wrote: > Move the definition of migrate_vma_collect_skip() to make it callable > by migrate_vma_collect_hole(). This helps make the next patch easier > to read. > > Signed-off-by: Ralph Campbell Looks good, Reviewed-by: Christoph Hellwig ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau