[PATCH v2] nouveau/hmm: map pages after migration
When memory is migrated to the GPU, it is likely to be accessed by GPU code soon afterwards. Instead of waiting for a GPU fault, map the migrated memory into the GPU page tables with the same access permissions as the source CPU page table entries. This preserves copy on write semantics. Signed-off-by: Ralph Campbell Cc: Christoph Hellwig Cc: Jason Gunthorpe Cc: "Jérôme Glisse" Cc: Ben Skeggs --- Originally this patch was targeted for Jason's rdma tree since other HMM related changes were queued there. Now that those have been merged, this patch just contains changes to nouveau so it could go through any tree. I guess Ben Skeggs' tree would be appropriate. Changes since v1: Rebase to linux-5.6.0-rc4 Address Christoph Hellwig's comments drivers/gpu/drm/nouveau/nouveau_dmem.c | 44 - drivers/gpu/drm/nouveau/nouveau_svm.c | 85 ++ drivers/gpu/drm/nouveau/nouveau_svm.h | 5 ++ 3 files changed, 118 insertions(+), 16 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 0ad5d87b5a8e..172e0c98cec5 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -25,11 +25,13 @@ #include "nouveau_dma.h" #include "nouveau_mem.h" #include "nouveau_bo.h" +#include "nouveau_svm.h" #include #include #include #include +#include #include #include @@ -558,10 +560,11 @@ nouveau_dmem_init(struct nouveau_drm *drm) } static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm, - unsigned long src, dma_addr_t *dma_addr) + unsigned long src, dma_addr_t *dma_addr, u64 *pfn) { struct device *dev = drm->dev->dev; struct page *dpage, *spage; + unsigned long paddr; spage = migrate_pfn_to_page(src); if (!spage || !(src & MIGRATE_PFN_MIGRATE)) @@ -569,17 +572,21 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm, dpage = nouveau_dmem_page_alloc_locked(drm); if (!dpage) - return 0; + goto out; *dma_addr = dma_map_page(dev, spage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL); if (dma_mapping_error(dev, *dma_addr)) goto out_free_page; + paddr = nouveau_dmem_page_addr(dpage); if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_VRAM, - nouveau_dmem_page_addr(dpage), NOUVEAU_APER_HOST, - *dma_addr)) + paddr, NOUVEAU_APER_HOST, *dma_addr)) goto out_dma_unmap; + *pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM | + ((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT); + if (src & MIGRATE_PFN_WRITE) + *pfn |= NVIF_VMM_PFNMAP_V0_W; return migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED; out_dma_unmap: @@ -587,18 +594,19 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm, out_free_page: nouveau_dmem_page_free_locked(drm, dpage); out: + *pfn = NVIF_VMM_PFNMAP_V0_NONE; return 0; } static void nouveau_dmem_migrate_chunk(struct nouveau_drm *drm, - struct migrate_vma *args, dma_addr_t *dma_addrs) + struct migrate_vma *args, dma_addr_t *dma_addrs, u64 *pfns) { struct nouveau_fence *fence; unsigned long addr = args->start, nr_dma = 0, i; for (i = 0; addr < args->end; i++) { args->dst[i] = nouveau_dmem_migrate_copy_one(drm, args->src[i], - dma_addrs + nr_dma); + dma_addrs + nr_dma, pfns + i); if (args->dst[i]) nr_dma++; addr += PAGE_SIZE; @@ -607,15 +615,12 @@ static void nouveau_dmem_migrate_chunk(struct nouveau_drm *drm, nouveau_fence_new(drm->dmem->migrate.chan, false, ); migrate_vma_pages(args); nouveau_dmem_fence_done(); + nouveau_pfns_map(drm, args->vma->vm_mm, args->start, pfns, i); while (nr_dma--) { dma_unmap_page(drm->dev->dev, dma_addrs[nr_dma], PAGE_SIZE, DMA_BIDIRECTIONAL); } - /* -* FIXME optimization: update GPU page table to point to newly migrated -* memory. -*/ migrate_vma_finalize(args); } @@ -632,7 +637,8 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm, .vma= vma, .start = start, }; - unsigned long c, i; + unsigned long i; + u64 *pfns; int ret = -ENOMEM; args.src = kcalloc(max, sizeof(*args.src), GFP_KERNEL); @@ -646,19 +652,25 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm, if (!dma_addrs)
Re: [PATCH hmm 9/8] mm/hmm: do not check pmd_protnone twice in hmm_vma_handle_pmd()
On 3/12/20 12:33 PM, Jason Gunthorpe wrote: pmd_to_hmm_pfn_flags() already checks it and makes the cpu flags 0. If no fault is requested then the pfns should be returned with the not valid flags. It should not unconditionally fault if faulting is not requested. Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Signed-off-by: Jason Gunthorpe Looks good to me. Reviewed-by: Ralph Campbell --- mm/hmm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Bonus patch, this one got found after I made the series.. diff --git a/mm/hmm.c b/mm/hmm.c index ca33d086bdc190..6d9da4b0f0a9f8 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -226,7 +226,7 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr, hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags, , _fault); - if (pmd_protnone(pmd) || fault || write_fault) + if (fault || write_fault) return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk); pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH hmm 1/8] mm/hmm: add missing unmaps of the ptep during hmm_vma_handle_pte()
On 3/11/20 11:34 AM, Jason Gunthorpe wrote: From: Jason Gunthorpe Many of the direct returns of error skipped doing the pte_unmap(). All non zero exit paths must unmap the pte. The pte_unmap() is split unnaturally like this because some of the error exit paths trigger a sleep and must release the lock before sleeping. Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Fixes: 53f5c3f489ec ("mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()") Signed-off-by: Jason Gunthorpe The changes look OK to me but one issue noted below. In any case, you can add: Reviewed-by: Ralph Campbell --- mm/hmm.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/hmm.c b/mm/hmm.c index 72e5a6d9a41756..35f85424176d14 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -325,6 +325,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, } /* Report error for everything else */ + pte_unmap(ptep); *pfn = range->values[HMM_PFN_ERROR]; return -EFAULT; } else { @@ -339,10 +340,13 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, if (pte_devmap(pte)) { hmm_vma_walk->pgmap = get_dev_pagemap(pte_pfn(pte), hmm_vma_walk->pgmap); - if (unlikely(!hmm_vma_walk->pgmap)) + if (unlikely(!hmm_vma_walk->pgmap)) { + pte_unmap(ptep); return -EBUSY; + } } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { if (!is_zero_pfn(pte_pfn(pte))) { + pte_unmap(ptep); *pfn = range->values[HMM_PFN_SPECIAL]; return -EFAULT; } @@ -437,7 +441,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, [i]); if (r) { - /* hmm_vma_handle_pte() did unmap pte directory */ + /* hmm_vma_handle_pte() did pte_unmap() */ hmm_vma_walk->last = addr; return r; } I think there is a case where hmm_vma_handle_pte() is called, a fault is requested, pte_unmap() and hmm_vma_walk_hole_() are called, the latter returns zero (the fault was handled OK), but now the page table is unmapped for successive loop iterations and pte_unmap(ptep - 1) is called at the loop end. Since this isn't an issue on x86_64 but is on x86_32, I can see how it could be missed. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH hmm 2/8] mm/hmm: don't free the cached pgmap while scanning
On 3/11/20 11:35 AM, Jason Gunthorpe wrote: From: Jason Gunthorpe The pgmap is held in the hmm_vma_walk variable in hope of speeding up future get_dev_pagemap() calls by hitting the same pointer. The algorithm doesn't actually care about how long the pgmap is held for. Move the put of the cached pgmap to after the walk is completed and delete all the other now redundant puts. This solves a possible leak of the reference in hmm_vma_walk_pmd() if a hmm_vma_handle_pte() fails while looping. Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Signed-off-by: Jason Gunthorpe Reviewed-by: Ralph Campbell --- mm/hmm.c | 31 +-- 1 file changed, 9 insertions(+), 22 deletions(-) We talked about just deleting this stuff, but I think it makes alot sense for hmm_range_fault() to trigger fault on devmap pages that are not compatible with the caller - so lets just fix the leak on error path for now. diff --git a/mm/hmm.c b/mm/hmm.c index 35f85424176d14..9e8f68eb83287a 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -239,10 +239,6 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr, } pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags; } - if (hmm_vma_walk->pgmap) { - put_dev_pagemap(hmm_vma_walk->pgmap); - hmm_vma_walk->pgmap = NULL; - } hmm_vma_walk->last = end; return 0; } @@ -360,10 +356,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, return 0; fault: - if (hmm_vma_walk->pgmap) { - put_dev_pagemap(hmm_vma_walk->pgmap); - hmm_vma_walk->pgmap = NULL; - } pte_unmap(ptep); /* Fault any virtual address we were asked to fault */ return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk); @@ -446,16 +438,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return r; } } - if (hmm_vma_walk->pgmap) { - /* -* We do put_dev_pagemap() here and not in hmm_vma_handle_pte() -* so that we can leverage get_dev_pagemap() optimization which -* will not re-take a reference on a pgmap if we already have -* one. -*/ - put_dev_pagemap(hmm_vma_walk->pgmap); - hmm_vma_walk->pgmap = NULL; - } pte_unmap(ptep - 1); hmm_vma_walk->last = addr; @@ -529,10 +511,6 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags; } - if (hmm_vma_walk->pgmap) { - put_dev_pagemap(hmm_vma_walk->pgmap); - hmm_vma_walk->pgmap = NULL; - } hmm_vma_walk->last = end; goto out_unlock; } @@ -694,6 +672,15 @@ long hmm_range_fault(struct hmm_range *range, unsigned int flags) return -EBUSY; ret = walk_page_range(mm, hmm_vma_walk.last, range->end, _walk_ops, _vma_walk); + /* +* A pgmap is kept cached in the hmm_vma_walk to avoid expensive +* searching in the probably common case that the pgmap is the +* same for the entire requested range. +*/ + if (hmm_vma_walk.pgmap) { + put_dev_pagemap(hmm_vma_walk.pgmap); + hmm_vma_walk.pgmap = NULL; + } } while (ret == -EBUSY); if (ret) ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH hmm 3/8] mm/hmm: do not call hmm_vma_walk_hole() while holding a spinlock
On 3/11/20 11:35 AM, Jason Gunthorpe wrote: From: Jason Gunthorpe This eventually calls into handle_mm_fault() which is a sleeping function. Release the lock first. hmm_vma_walk_hole() does not touch the contents of the PUD, so it does not need the lock. Fixes: 3afc423632a1 ("mm: pagewalk: add p4d_entry() and pgd_entry()") Cc: Steven Price Signed-off-by: Jason Gunthorpe Reviewed-by: Ralph Campbell --- mm/hmm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/mm/hmm.c b/mm/hmm.c index 9e8f68eb83287a..32dcbfd3908315 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -473,8 +473,8 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, pud = READ_ONCE(*pudp); if (pud_none(pud)) { - ret = hmm_vma_walk_hole(start, end, -1, walk); - goto out_unlock; + spin_unlock(ptl); + return hmm_vma_walk_hole(start, end, -1, walk); } if (pud_huge(pud) && pud_devmap(pud)) { @@ -483,8 +483,8 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, bool fault, write_fault; if (!pud_present(pud)) { - ret = hmm_vma_walk_hole(start, end, -1, walk); - goto out_unlock; + spin_unlock(ptl); + return hmm_vma_walk_hole(start, end, -1, walk); } i = (addr - range->start) >> PAGE_SHIFT; @@ -495,9 +495,9 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags, , _fault); if (fault || write_fault) { - ret = hmm_vma_walk_hole_(addr, end, fault, -write_fault, walk); - goto out_unlock; + spin_unlock(ptl); + return hmm_vma_walk_hole_(addr, end, fault, write_fault, + walk); } pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH hmm 5/8] mm/hmm: add missing call to hmm_range_need_fault() before returning EFAULT
On 3/11/20 11:35 AM, Jason Gunthorpe wrote: From: Jason Gunthorpe All return paths that do EFAULT must call hmm_range_need_fault() to determine if the user requires this page to be valid. If the page cannot be made valid if the user later requires it, due to vma flags in this case, then the return should be HMM_PFN_ERROR. Fixes: a3e0d41c2b1f ("mm/hmm: improve driver API to work and wait over a range") Signed-off-by: Jason Gunthorpe Reviewed-by: Ralph Campbell --- mm/hmm.c | 19 --- 1 file changed, 8 insertions(+), 11 deletions(-) diff --git a/mm/hmm.c b/mm/hmm.c index 5f5ccf13dd1e85..e10cd0adba7b37 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -582,18 +582,15 @@ static int hmm_vma_walk_test(unsigned long start, unsigned long end, struct vm_area_struct *vma = walk->vma; /* -* Skip vma ranges that don't have struct page backing them or -* map I/O devices directly. -*/ - if (vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP)) - return -EFAULT; - - /* +* Skip vma ranges that don't have struct page backing them or map I/O +* devices directly. +* * If the vma does not allow read access, then assume that it does not -* allow write access either. HMM does not support architectures -* that allow write without read. +* allow write access either. HMM does not support architectures that +* allow write without read. */ - if (!(vma->vm_flags & VM_READ)) { + if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP)) || + !(vma->vm_flags & VM_READ)) { bool fault, write_fault; /* @@ -607,7 +604,7 @@ static int hmm_vma_walk_test(unsigned long start, unsigned long end, if (fault || write_fault) return -EFAULT; - hmm_pfns_fill(start, end, range, HMM_PFN_NONE); + hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); hmm_vma_walk->last = end; /* Skip this vma and continue processing the next vma. */ ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH hmm 4/8] mm/hmm: add missing pfns set to hmm_vma_walk_pmd()
On 3/11/20 11:35 AM, Jason Gunthorpe wrote: From: Jason Gunthorpe All success exit paths from the walker functions must set the pfns array. A migration entry with no required fault is a HMM_PFN_NONE return, just like the pte case. Fixes: d08faca018c4 ("mm/hmm: properly handle migration pmd") Signed-off-by: Jason Gunthorpe Reviewed-by: Ralph Campbell --- mm/hmm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/hmm.c b/mm/hmm.c index 32dcbfd3908315..5f5ccf13dd1e85 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -394,7 +394,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, pmd_migration_entry_wait(walk->mm, pmdp); return -EBUSY; } - return 0; + return hmm_pfns_fill(start, end, range, HMM_PFN_NONE); } else if (!pmd_present(pmd)) return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH hmm 6/8] mm/hmm: reorganize how !pte_present is handled in hmm_vma_handle_pte()
On 3/11/20 11:35 AM, Jason Gunthorpe wrote: From: Jason Gunthorpe The intention with this code is to determine if the caller required the pages to be valid, and if so, then take some action to make them valid. The action varies depending on the page type. In all cases, if the caller doesn't ask for the page, then hmm_range_fault() should not return an error. Revise the implementation to be clearer, and fix some bugs: - hmm_pte_need_fault() must always be called before testing fault or write_fault otherwise the defaults of false apply and the if()'s don't work. This was missed on the is_migration_entry() branch - -EFAULT should not be returned unless hmm_pte_need_fault() indicates fault is required - ie snapshotting should not fail. - For !pte_present() the cpu_flags are always 0, except in the special case of is_device_private_entry(), calling pte_to_hmm_pfn_flags() is confusing. Reorganize the flow so that it always follows the pattern of calling hmm_pte_need_fault() and then checking fault || write_fault. Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Signed-off-by: Jason Gunthorpe Reviewed-by: Ralph Campbell --- mm/hmm.c | 35 +++ 1 file changed, 15 insertions(+), 20 deletions(-) diff --git a/mm/hmm.c b/mm/hmm.c index e10cd0adba7b37..bf676cfef3e8ee 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -282,15 +282,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, if (!pte_present(pte)) { swp_entry_t entry = pte_to_swp_entry(pte); - if (!non_swap_entry(entry)) { - cpu_flags = pte_to_hmm_pfn_flags(range, pte); - hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags, - , _fault); - if (fault || write_fault) - goto fault; - return 0; - } - /* * This is a special swap entry, ignore migration, use * device and report anything else as error. @@ -310,26 +301,30 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, return 0; } - if (is_migration_entry(entry)) { - if (fault || write_fault) { - pte_unmap(ptep); - hmm_vma_walk->last = addr; - migration_entry_wait(walk->mm, pmdp, addr); - return -EBUSY; - } + hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0, , + _fault); + if (!fault && !write_fault) return 0; + + if (!non_swap_entry(entry)) + goto fault; + + if (is_migration_entry(entry)) { + pte_unmap(ptep); + hmm_vma_walk->last = addr; + migration_entry_wait(walk->mm, pmdp, addr); + return -EBUSY; } /* Report error for everything else */ pte_unmap(ptep); *pfn = range->values[HMM_PFN_ERROR]; return -EFAULT; - } else { - cpu_flags = pte_to_hmm_pfn_flags(range, pte); - hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags, - , _fault); } + cpu_flags = pte_to_hmm_pfn_flags(range, pte); + hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags, , + _fault); if (fault || write_fault) goto fault; ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH hmm 7/8] mm/hmm: return -EFAULT when setting HMM_PFN_ERROR on requested valid pages
On 3/11/20 11:35 AM, Jason Gunthorpe wrote: From: Jason Gunthorpe hmm_range_fault() should never return 0 if the caller requested a valid page, but the pfns output for that page would be HMM_PFN_ERROR. hmm_pte_need_fault() must always be called before setting HMM_PFN_ERROR to detect if the page is in faulting mode or not. Fix two cases in hmm_vma_walk_pmd() and reorganize some of the duplicated code. Fixes: d08faca018c4 ("mm/hmm: properly handle migration pmd") Fixes: da4c3c735ea4 ("mm/hmm/mirror: helper to snapshot CPU page table") Signed-off-by: Jason Gunthorpe Reviewed-by: Ralph Campbell --- mm/hmm.c | 38 +- 1 file changed, 21 insertions(+), 17 deletions(-) diff --git a/mm/hmm.c b/mm/hmm.c index bf676cfef3e8ee..f61fddf2ef6505 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -363,8 +363,10 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, { struct hmm_vma_walk *hmm_vma_walk = walk->private; struct hmm_range *range = hmm_vma_walk->range; - uint64_t *pfns = range->pfns; - unsigned long addr = start, i; + uint64_t *pfns = >pfns[(start - range->start) >> PAGE_SHIFT]; + unsigned long npages = (end - start) >> PAGE_SHIFT; + unsigned long addr = start; + bool fault, write_fault; pte_t *ptep; pmd_t pmd; @@ -374,14 +376,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return hmm_vma_walk_hole(start, end, -1, walk); if (thp_migration_supported() && is_pmd_migration_entry(pmd)) { - bool fault, write_fault; - unsigned long npages; - uint64_t *pfns; - - i = (addr - range->start) >> PAGE_SHIFT; - npages = (end - addr) >> PAGE_SHIFT; - pfns = >pfns[i]; - hmm_range_need_fault(hmm_vma_walk, pfns, npages, 0, , _fault); if (fault || write_fault) { @@ -390,8 +384,15 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return -EBUSY; } return hmm_pfns_fill(start, end, range, HMM_PFN_NONE); - } else if (!pmd_present(pmd)) + } + + if (!pmd_present(pmd)) { + hmm_range_need_fault(hmm_vma_walk, pfns, npages, 0, , +_fault); + if (fault || write_fault) + return -EFAULT; return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); Shouldn't this fill with HMM_PFN_NONE instead of HMM_PFN_ERROR? Otherwise, when a THP is swapped out, you will get a different value than if a PTE is swapped out and you are prefetching/snapshotting. + } if (pmd_devmap(pmd) || pmd_trans_huge(pmd)) { /* @@ -408,8 +409,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd)) goto again; - i = (addr - range->start) >> PAGE_SHIFT; - return hmm_vma_handle_pmd(walk, addr, end, [i], pmd); + return hmm_vma_handle_pmd(walk, addr, end, pfns, pmd); } /* @@ -418,15 +418,19 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, * entry pointing to pte directory or it is a bad pmd that will not * recover. */ - if (pmd_bad(pmd)) + if (pmd_bad(pmd)) { + hmm_range_need_fault(hmm_vma_walk, pfns, npages, 0, , +_fault); + if (fault || write_fault) + return -EFAULT; return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); + } ptep = pte_offset_map(pmdp, addr); - i = (addr - range->start) >> PAGE_SHIFT; - for (; addr < end; addr += PAGE_SIZE, ptep++, i++) { + for (; addr < end; addr += PAGE_SIZE, ptep++, pfns++) { int r; - r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, [i]); + r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, pfns); if (r) { /* hmm_vma_handle_pte() did pte_unmap() */ hmm_vma_walk->last = addr; ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH hmm 8/8] mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling
On 3/11/20 11:35 AM, Jason Gunthorpe wrote: From: Jason Gunthorpe Currently if a special PTE is encountered hmm_range_fault() immediately returns EFAULT and sets the HMM_PFN_SPECIAL error output (which nothing uses). EFAULT should only be returned after testing with hmm_pte_need_fault(). Also pte_devmap() and pte_special() are exclusive, and there is no need to check IS_ENABLED, pte_special() is stubbed out to return false on unsupported architectures. Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Signed-off-by: Jason Gunthorpe Reviewed-by: Ralph Campbell --- mm/hmm.c | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/mm/hmm.c b/mm/hmm.c index f61fddf2ef6505..ca33d086bdc190 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -335,16 +335,21 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, pte_unmap(ptep); return -EBUSY; } - } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { - if (!is_zero_pfn(pte_pfn(pte))) { + } + + /* +* Since each architecture defines a struct page for the zero page, just +* fall through and treat it like a normal page. +*/ + if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) { + hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0, , + _fault); + if (fault || write_fault) { pte_unmap(ptep); - *pfn = range->values[HMM_PFN_SPECIAL]; return -EFAULT; } - /* -* Since each architecture defines a struct page for the zero -* page, just fall through and treat it like a normal page. -*/ + *pfn = range->values[HMM_PFN_SPECIAL]; + return 0; } *pfn = hmm_device_entry_from_pfn(range, pte_pfn(pte)) | cpu_flags; ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 2/2] mm: remove device private page support from hmm_range_fault
On 3/16/20 10:52 AM, Christoph Hellwig wrote: No driver has actually used properly wire up and support this feature. There is various code related to it in nouveau, but as far as I can tell it never actually got turned on, and the only changes since the initial commit are global cleanups. This is not actually true. OpenCL 2.x does support SVM with nouveau and device private memory via clEnqueueSVMMigrateMem(). Also, Ben Skeggs has accepted a set of patches to map GPU memory after being migrated and this change would conflict with that. Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 1 - drivers/gpu/drm/nouveau/nouveau_dmem.c | 37 - drivers/gpu/drm/nouveau/nouveau_dmem.h | 2 -- drivers/gpu/drm/nouveau/nouveau_svm.c | 3 -- include/linux/hmm.h | 2 -- mm/hmm.c| 28 --- 6 files changed, 73 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index dee446278417..90821ce5e6ca 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -776,7 +776,6 @@ struct amdgpu_ttm_tt { static const uint64_t hmm_range_flags[HMM_PFN_FLAG_MAX] = { (1 << 0), /* HMM_PFN_VALID */ (1 << 1), /* HMM_PFN_WRITE */ - 0 /* HMM_PFN_DEVICE_PRIVATE */ }; static const uint64_t hmm_range_values[HMM_PFN_VALUE_MAX] = { diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 7605c4c48985..42808efceaf2 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -671,40 +671,3 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm, out: return ret; } - -static inline bool -nouveau_dmem_page(struct nouveau_drm *drm, struct page *page) -{ - return is_device_private_page(page) && drm->dmem == page_to_dmem(page); -} - -void -nouveau_dmem_convert_pfn(struct nouveau_drm *drm, -struct hmm_range *range) -{ - unsigned long i, npages; - - npages = (range->end - range->start) >> PAGE_SHIFT; - for (i = 0; i < npages; ++i) { - struct page *page; - uint64_t addr; - - page = hmm_device_entry_to_page(range, range->pfns[i]); - if (page == NULL) - continue; - - if (!(range->pfns[i] & range->flags[HMM_PFN_DEVICE_PRIVATE])) { - continue; - } - - if (!nouveau_dmem_page(drm, page)) { - WARN(1, "Some unknown device memory !\n"); - range->pfns[i] = 0; - continue; - } - - addr = nouveau_dmem_page_addr(page); - range->pfns[i] &= ((1UL << range->pfn_shift) - 1); - range->pfns[i] |= (addr >> PAGE_SHIFT) << range->pfn_shift; - } -} diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.h b/drivers/gpu/drm/nouveau/nouveau_dmem.h index 92394be5d649..1ac620b3d4fb 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.h +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.h @@ -38,8 +38,6 @@ int nouveau_dmem_migrate_vma(struct nouveau_drm *drm, unsigned long start, unsigned long end); -void nouveau_dmem_convert_pfn(struct nouveau_drm *drm, - struct hmm_range *range); #else /* IS_ENABLED(CONFIG_DRM_NOUVEAU_SVM) */ static inline void nouveau_dmem_init(struct nouveau_drm *drm) {} static inline void nouveau_dmem_fini(struct nouveau_drm *drm) {} diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index df9bf1fd1bc0..7e0376dca137 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -367,7 +367,6 @@ static const u64 nouveau_svm_pfn_flags[HMM_PFN_FLAG_MAX] = { [HMM_PFN_VALID ] = NVIF_VMM_PFNMAP_V0_V, [HMM_PFN_WRITE ] = NVIF_VMM_PFNMAP_V0_W, - [HMM_PFN_DEVICE_PRIVATE] = NVIF_VMM_PFNMAP_V0_VRAM, }; static const u64 @@ -558,8 +557,6 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm, break; } - nouveau_dmem_convert_pfn(drm, ); - svmm->vmm->vmm.object.client->super = true; ret = nvif_object_ioctl(>vmm->vmm.object, data, size, NULL); svmm->vmm->vmm.object.client->super = false; diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 4bf8d6997b12..5e6034f105c3 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -74,7 +74,6 @@ * Flags: * HMM_PFN_VALID: pfn is valid. It has, at least, read permission. * HMM_PFN_WRITE: CPU page table has write permission set - * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE) * * The driver provides a flags array for mapping page protections to device * PTE bits. If the
Re: [PATCH v2 hmm 0/9] Small hmm_range_fault() cleanups
On 3/23/20 6:14 PM, Jason Gunthorpe wrote: From: Jason Gunthorpe This is v2 of the first simple series with a few additional patches of little adjustments. This needs an additional patch to the hmm tester: diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c index 033a12c7ab5b6d..da15471a2bbf9a 100644 --- a/tools/testing/selftests/vm/hmm-tests.c +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -1274,7 +1274,7 @@ TEST_F(hmm2, snapshot) /* Check what the device saw. */ m = buffer->mirror; ASSERT_EQ(m[0], HMM_DMIRROR_PROT_ERROR); - ASSERT_EQ(m[1], HMM_DMIRROR_PROT_NONE); + ASSERT_EQ(m[1], HMM_DMIRROR_PROT_ERROR); ASSERT_EQ(m[2], HMM_DMIRROR_PROT_ZERO | HMM_DMIRROR_PROT_READ); ASSERT_EQ(m[3], HMM_DMIRROR_PROT_READ); ASSERT_EQ(m[4], HMM_DMIRROR_PROT_WRITE); v2 changes: - Simplify and rename the flags, rework hmm_vma_walk_test in patch 2 (CH) - Adjust more comments in patch 3 (CH, Ralph) - Put the ugly boolean logic into a function in patch 3 (CH) - Update commit message of patch 4 (CH) - Adjust formatting in patch 5 (CH) Patches 6, 7, 8 are new v1: https://lore.kernel.org/r/20200320164905.21722-1-...@ziepe.ca Jason Gunthorpe (9): mm/hmm: remove pgmap checking for devmap pages mm/hmm: return the fault type from hmm_pte_need_fault() mm/hmm: remove unused code and tidy comments mm/hmm: remove HMM_FAULT_SNAPSHOT mm/hmm: remove the CONFIG_TRANSPARENT_HUGEPAGE #ifdef mm/hmm: use device_private_entry_to_pfn() mm/hmm: do not unconditionally set pfns when returning EBUSY mm/hmm: do not set pfns when returning an error code mm/hmm: return error for non-vma snapshots Documentation/vm/hmm.rst| 12 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +- drivers/gpu/drm/nouveau/nouveau_svm.c | 2 +- include/linux/hmm.h | 109 + mm/hmm.c| 312 ++-- 5 files changed, 133 insertions(+), 304 deletions(-) I was able to recompile Karol Herbst's mesa tree and Jerome's SVM tests to test this with nouveau so for the series you can add, Tested-by: Ralph Campbell ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH hmm v2 5/5] mm/hmm: remove the customizable pfn format from hmm_range_fault
On 5/1/20 11:20 AM, Jason Gunthorpe wrote: From: Jason Gunthorpe Presumably the intent here was that hmm_range_fault() could put the data into some HW specific format and thus avoid some work. However, nothing actually does that, and it isn't clear how anything actually could do that as hmm_range_fault() provides CPU addresses which must be DMA mapped. Perhaps there is some special HW that does not need DMA mapping, but we don't have any examples of this, and the theoretical performance win of avoiding an extra scan over the pfns array doesn't seem worth the complexity. Plus pfns needs to be scanned anyhow to sort out any DEVICE_PRIVATE pages. This version replaces the uint64_t with an usigned long containing a pfn and fixed flags. On input flags is filled with the HMM_PFN_REQ_* values, on successful output it is filled with HMM_PFN_* values, describing the state of the pages. amdgpu is simple to convert, it doesn't use snapshot and doesn't use per-page flags. nouveau uses only 16 hmm_pte entries at most (ie fits in a few cache lines), and it sweeps over its pfns array a couple of times anyhow. It also has a nasty call chain before it reaches the dma map and hardware suggesting performance isn't important: nouveau_svm_fault(): args.i.m.method = NVIF_VMM_V0_PFNMAP nouveau_range_fault() nvif_object_ioctl() client->driver->ioctl() struct nvif_driver nvif_driver_nvkm: .ioctl = nvkm_client_ioctl nvkm_ioctl() nvkm_ioctl_path() nvkm_ioctl_v0[type].func(..) nvkm_ioctl_mthd() nvkm_object_mthd() struct nvkm_object_func nvkm_uvmm: .mthd = nvkm_uvmm_mthd nvkm_uvmm_mthd() nvkm_uvmm_mthd_pfnmap() nvkm_vmm_pfn_map() nvkm_vmm_ptes_get_map() func == gp100_vmm_pgt_pfn struct nvkm_vmm_desc_func gp100_vmm_desc_spt: .pfn = gp100_vmm_pgt_pfn nvkm_vmm_iter() REF_PTES == func == gp100_vmm_pgt_pfn() dma_map_page() Acked-by: Felix Kuehling Tested-by: Ralph Campbell Signed-off-by: Jason Gunthorpe Signed-off-by: Christoph Hellwig --- Documentation/vm/hmm.rst| 26 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 35 ++ drivers/gpu/drm/nouveau/nouveau_dmem.c | 27 +--- drivers/gpu/drm/nouveau/nouveau_dmem.h | 3 +- drivers/gpu/drm/nouveau/nouveau_svm.c | 87 - include/linux/hmm.h | 99 ++- mm/hmm.c| 160 +++- 7 files changed, 192 insertions(+), 245 deletions(-) ...snip... +static void nouveau_hmm_convert_pfn(struct nouveau_drm *drm, + struct hmm_range *range, u64 *ioctl_addr) +{ + unsigned long i, npages; + + /* +* The ioctl_addr prepared here is passed through nvif_object_ioctl() +* to an eventual DMA map in something like gp100_vmm_pgt_pfn() +* +* This is all just encoding the internal hmm reprensetation into a s/reprensetation/representation/ Looks good and still tests OK with nouveau. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH hmm 0/5] Adjust hmm_range_fault() API
On 4/21/20 5:21 PM, Jason Gunthorpe wrote: From: Jason Gunthorpe The API is a bit complicated for the uses we actually have, and disucssions for simplifying have come up a number of times. This small series removes the customizable pfn format and simplifies the return code of hmm_range_fault() All the drivers are adjusted to process in the simplified format. I would appreciated tested-by's for the two drivers, thanks! For nouveau you can add: Tested-by: Ralph Campbell ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 4/4] mm: check the device private page owner in hmm_range_fault
On 3/16/20 12:32 PM, Christoph Hellwig wrote: Hmm range fault will succeed for any kind of device private memory, even if it doesn't belong to the calling entity. While nouveau has some crude checks for that, they are broken because they assume nouveau is the only user of device private memory. Fix this by passing in an expected pgmap owner in the hmm_range_fault structure. Signed-off-by: Christoph Hellwig Fixes: 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using ZONE_DEVICE") Looks good. Reviewed-by: Ralph Campbell --- drivers/gpu/drm/nouveau/nouveau_dmem.c | 12 include/linux/hmm.h| 2 ++ mm/hmm.c | 10 +- 3 files changed, 11 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index edfd0805fba4..ad89e09a0be3 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -672,12 +672,6 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm, return ret; } -static inline bool -nouveau_dmem_page(struct nouveau_drm *drm, struct page *page) -{ - return is_device_private_page(page) && drm->dmem == page_to_dmem(page); -} - void nouveau_dmem_convert_pfn(struct nouveau_drm *drm, struct hmm_range *range) @@ -696,12 +690,6 @@ nouveau_dmem_convert_pfn(struct nouveau_drm *drm, if (!is_device_private_page(page)) continue; - if (!nouveau_dmem_page(drm, page)) { - WARN(1, "Some unknown device memory !\n"); - range->pfns[i] = 0; - continue; - } - addr = nouveau_dmem_page_addr(page); range->pfns[i] &= ((1UL << range->pfn_shift) - 1); range->pfns[i] |= (addr >> PAGE_SHIFT) << range->pfn_shift; diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 5e6034f105c3..bb6be4428633 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -132,6 +132,7 @@ enum hmm_pfn_value_e { * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT) * @valid: pfns array did not change since it has been fill by an HMM function + * @dev_private_owner: owner of device private pages */ struct hmm_range { struct mmu_interval_notifier *notifier; @@ -144,6 +145,7 @@ struct hmm_range { uint64_tdefault_flags; uint64_tpfn_flags_mask; uint8_t pfn_shift; + void*dev_private_owner; }; /* diff --git a/mm/hmm.c b/mm/hmm.c index cfad65f6a67b..b75b3750e03d 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -216,6 +216,14 @@ int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr, unsigned long end, uint64_t *pfns, pmd_t pmd); #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +static inline bool hmm_is_device_private_entry(struct hmm_range *range, + swp_entry_t entry) +{ + return is_device_private_entry(entry) && + device_private_entry_to_page(entry)->pgmap->owner == + range->dev_private_owner; +} + static inline uint64_t pte_to_hmm_pfn_flags(struct hmm_range *range, pte_t pte) { if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte)) @@ -254,7 +262,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, * Never fault in device private pages pages, but just report * the PFN even if not present. */ - if (is_device_private_entry(entry)) { + if (hmm_is_device_private_entry(range, entry)) { *pfn = hmm_device_entry_from_pfn(range, swp_offset(entry)); *pfn |= range->flags[HMM_PFN_VALID]; ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 2/4] mm: handle multiple owners of device private pages in migrate_vma
On 3/16/20 12:32 PM, Christoph Hellwig wrote: Add a new src_owner field to struct migrate_vma. If the field is set, only device private pages with page->pgmap->owner equal to that field are migrated. If the field is not set only "normal" pages are migrated. Signed-off-by: Christoph Hellwig Fixes: df6ad69838fc ("mm/device-public-memory: device memory cache coherent with CPU") When migrating to device private memory, setting the src_owner lets the caller know about source pages that are already migrated and skips pages migrated to a different device similar to pages swapped out an actual swap device. But, it prevents normal pages from being migrated to device private memory. It can still be useful for the driver to know that a page is owned by a different device if it has a device to device way of migrating data. nouveau_dmem_migrate_vma() isn't setting args.src_owner so only normal pages will be migrated which I guess is OK for now since nouveau doesn't handle direct GPU to GPU migration currently. When migrating device private memory to system memory due to a CPU fault, the source page should be the device's device private struct page so if it isn't, then it does make sense to not migrate whatever normal page is there. nouveau_dmem_migrate_to_ram() sets src_owner so this case looks OK. Just had to think this through. Reviewed-by: Ralph Campbell --- arch/powerpc/kvm/book3s_hv_uvmem.c | 1 + drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + include/linux/migrate.h| 8 mm/migrate.c | 9 ++--- 4 files changed, 16 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 67fefb03b9b7..f44f6b27950f 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -563,6 +563,7 @@ kvmppc_svm_page_out(struct vm_area_struct *vma, unsigned long start, mig.end = end; mig.src = _pfn; mig.dst = _pfn; + mig.src_owner = _uvmem_pgmap; mutex_lock(>arch.uvmem_lock); /* The requested page is already paged-out, nothing to do */ diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index a4682272586e..0e36345d395c 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -176,6 +176,7 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf) .end= vmf->address + PAGE_SIZE, .src= , .dst= , + .src_owner = drm->dev, }; /* diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 72120061b7d4..3e546cbf03dd 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -196,6 +196,14 @@ struct migrate_vma { unsigned long npages; unsigned long start; unsigned long end; + + /* +* Set to the owner value also stored in page->pgmap->owner for +* migrating out of device private memory. If set only device +* private pages with this owner are migrated. If not set +* device private pages are not migrated at all. +*/ + void*src_owner; }; int migrate_vma_setup(struct migrate_vma *args); diff --git a/mm/migrate.c b/mm/migrate.c index b1092876e537..7605d2c23433 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2241,7 +2241,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, arch_enter_lazy_mmu_mode(); for (; addr < end; addr += PAGE_SIZE, ptep++) { - unsigned long mpfn, pfn; + unsigned long mpfn = 0, pfn; struct page *page; swp_entry_t entry; pte_t pte; @@ -2255,8 +2255,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, } if (!pte_present(pte)) { - mpfn = 0; - /* * Only care about unaddressable device page special * page table entry. Other special swap entries are not @@ -2267,11 +2265,16 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, goto next; page = device_private_entry_to_page(entry); + if (page->pgmap->owner != migrate->src_owner) + goto next; + mpfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE; if (is_write_device_private_entry(entry)) mpfn |= MIGRATE_PFN_WRITE; } else { + if (migrate->src_owner) + goto next; pfn = pte_pfn(pte); if (is_zero_pfn(pfn)) {
Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
On 3/17/20 5:59 AM, Christoph Hellwig wrote: On Tue, Mar 17, 2020 at 09:47:55AM -0300, Jason Gunthorpe wrote: I've been using v7 of Ralph's tester and it is working well - it has DEVICE_PRIVATE support so I think it can test this flow too. Ralph are you able? This hunk seems trivial enough to me, can we include it now? I can send a separate patch for it once the tester covers it. I don't want to add it to the original patch as it is a significant behavior change compared to the existing code. Attached is an updated version of my HMM tests based on linux-5.6.0-rc6. I ran this OK with Jason's 8+1 HMM patches, Christoph's 1-5 misc HMM clean ups, and Christoph's 1-4 device private page changes applied. I'm working on getting my nouveau tests running again on a different test machine and will report on that when ready. >From d499fb343bfa9764695ecdcd759fb16bc1ca3c93 Mon Sep 17 00:00:00 2001 From: Ralph Campbell Date: Tue, 17 Mar 2020 11:10:38 -0700 Subject: [PATCH] mm/hmm/test: add self tests for HMM MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add some basic stand alone self tests for HMM. Signed-off-by: Ralph Campbell Signed-off-by: Jérôme Glisse --- MAINTAINERS|3 + include/uapi/linux/test_hmm.h | 59 ++ lib/Kconfig.debug | 12 + lib/Makefile |1 + lib/test_hmm.c | 1210 + tools/testing/selftests/vm/.gitignore |1 + tools/testing/selftests/vm/Makefile|3 + tools/testing/selftests/vm/config |2 + tools/testing/selftests/vm/hmm-tests.c | 1353 tools/testing/selftests/vm/run_vmtests | 16 + tools/testing/selftests/vm/test_hmm.sh | 97 ++ 11 files changed, 2757 insertions(+) create mode 100644 include/uapi/linux/test_hmm.h create mode 100644 lib/test_hmm.c create mode 100644 tools/testing/selftests/vm/hmm-tests.c create mode 100755 tools/testing/selftests/vm/test_hmm.sh diff --git a/MAINTAINERS b/MAINTAINERS index cc1d18cb5d18..98c80a589a52 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7608,7 +7608,10 @@ L: linux...@kvack.org S: Maintained F: mm/hmm* F: include/linux/hmm* +F: include/uapi/linux/test_hmm* F: Documentation/vm/hmm.rst +F: lib/test_hmm* +F: tools/testing/selftests/vm/*hmm* HOST AP DRIVER M: Jouni Malinen diff --git a/include/uapi/linux/test_hmm.h b/include/uapi/linux/test_hmm.h new file mode 100644 index ..8c5f70c160bf --- /dev/null +++ b/include/uapi/linux/test_hmm.h @@ -0,0 +1,59 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * This is a module to test the HMM (Heterogeneous Memory Management) API + * of the kernel. It allows a userspace program to expose its entire address + * space through the HMM test module device file. + */ +#ifndef _UAPI_LINUX_HMM_DMIRROR_H +#define _UAPI_LINUX_HMM_DMIRROR_H + +#include +#include + +/* + * Structure to pass to the HMM test driver to mimic a device accessing + * system memory and ZONE_DEVICE private memory through device page tables. + * + * @addr: (in) user address the device will read/write + * @ptr: (in) user address where device data is copied to/from + * @npages: (in) number of pages to read/write + * @cpages: (out) number of pages copied + * @faults: (out) number of device page faults seen + */ +struct hmm_dmirror_cmd { + __u64 addr; + __u64 ptr; + __u64 npages; + __u64 cpages; + __u64 faults; +}; + +/* Expose the address space of the calling process through hmm device file */ +#define HMM_DMIRROR_READ _IOWR('H', 0x00, struct hmm_dmirror_cmd) +#define HMM_DMIRROR_WRITE _IOWR('H', 0x01, struct hmm_dmirror_cmd) +#define HMM_DMIRROR_MIGRATE _IOWR('H', 0x02, struct hmm_dmirror_cmd) +#define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd) + +/* + * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT. + * HMM_DMIRROR_PROT_ERROR: no valid mirror PTE for this page + * HMM_DMIRROR_PROT_NONE: unpopulated PTE or PTE with no access + * HMM_DMIRROR_PROT_READ: read-only PTE + * HMM_DMIRROR_PROT_WRITE: read/write PTE + * HMM_DMIRROR_PROT_ZERO: special read-only zero page + * HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL: Migrated device private page on the + * device the ioctl() is made + * HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some + * other device + */ +enum { + HMM_DMIRROR_PROT_ERROR = 0xFF, + HMM_DMIRROR_PROT_NONE = 0x00, + HMM_DMIRROR_PROT_READ = 0x01, + HMM_DMIRROR_PROT_WRITE = 0x02, + HMM_DMIRROR_PROT_ZERO = 0x10, + HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL = 0x20, + HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30, +}; + +#endif /* _UAPI_LINUX_HMM_DMIRROR_H */ diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 69def4a9df00..4d22ce7879a7 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2162,6 +2162,18 @@ config TEST_MEMINIT If unsure, say N. +config TEST_HMM + tristate &q
Re: [PATCH 2/2] mm: remove device private page support from hmm_range_fault
On 3/17/20 4:56 AM, Jason Gunthorpe wrote: On Mon, Mar 16, 2020 at 01:24:09PM -0700, Ralph Campbell wrote: The reason for it being backwards is that "normally" a device doesn't want the device private page to be faulted back to system memory, it wants to get the device private struct page so it can update its page table to point to the memory already in the device. The "backwards" is you set the flag on input and never get it on output, clear the flag in input and maybe get it on output. Compare with valid or write which don't work that way. Also, a device like Nvidia's GPUs may have an alternate path for copying one GPU's memory to another (nvlink) without going through system memory so getting a device private struct page/PFN from hmm_range_fault() that isn't "owned" by the faulting GPU is useful. I agree that the current code isn't well tested or thought out for multiple devices (rdma, NVMe drives, GPUs, etc.) but it also ties in with peer-to-peer access via PCIe. I think the series here is a big improvement. The GPU driver can set owners that match the hidden cluster interconnect. Jason I agree this is an improvement. I was just thinking about possible future use cases. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
On 3/17/20 12:34 AM, Christoph Hellwig wrote: On Mon, Mar 16, 2020 at 03:49:51PM -0700, Ralph Campbell wrote: On 3/16/20 12:32 PM, Christoph Hellwig wrote: Remove the code to fault device private pages back into system memory that has never been used by any driver. Also replace the usage of the HMM_PFN_DEVICE_PRIVATE flag in the pfns array with a simple is_device_private_page check in nouveau. Signed-off-by: Christoph Hellwig Getting rid of HMM_PFN_DEVICE_PRIVATE seems reasonable to me since a driver can look at the struct page but what if a driver needs to fault in a page from another device's private memory? Should it call handle_mm_fault()? Obviously no driver cared for that so far. Once we have test cases for that and thus testable code we can add code to fault it in from hmm_vma_handle_pte. I'm OK with the series. I think I would have been less confused if I looked at patch 4 then 3. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
Adding linux-kselft...@vger.kernel.org for the test config question. On 3/19/20 11:17 AM, Jason Gunthorpe wrote: On Tue, Mar 17, 2020 at 04:14:31PM -0700, Ralph Campbell wrote: On 3/17/20 5:59 AM, Christoph Hellwig wrote: On Tue, Mar 17, 2020 at 09:47:55AM -0300, Jason Gunthorpe wrote: I've been using v7 of Ralph's tester and it is working well - it has DEVICE_PRIVATE support so I think it can test this flow too. Ralph are you able? This hunk seems trivial enough to me, can we include it now? I can send a separate patch for it once the tester covers it. I don't want to add it to the original patch as it is a significant behavior change compared to the existing code. Attached is an updated version of my HMM tests based on linux-5.6.0-rc6. I ran this OK with Jason's 8+1 HMM patches, Christoph's 1-5 misc HMM clean ups, and Christoph's 1-4 device private page changes applied. I'd like to get this to mergable, it looks pretty good now, but I have no idea about selftests - and I'm struggling to even compile the tools dir diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 69def4a9df00..4d22ce7879a7 100644 +++ b/lib/Kconfig.debug @@ -2162,6 +2162,18 @@ config TEST_MEMINIT If unsure, say N. +config TEST_HMM + tristate "Test HMM (Heterogeneous Memory Management)" + depends on DEVICE_PRIVATE + select HMM_MIRROR +select MMU_NOTIFIER extra spaces Will fix in v8. In general I wonder if it even makes sense that DEVICE_PRIVATE is user selectable? Should tests enable the feature or the feature enable the test? IMHO, if the feature is being compiled into the kernel, that should enable the menu item for the test. If the feature isn't selected, no need to test it :-) +static int dmirror_fops_open(struct inode *inode, struct file *filp) +{ + struct cdev *cdev = inode->i_cdev; + struct dmirror *dmirror; + int ret; + + /* Mirror this process address space */ + dmirror = kzalloc(sizeof(*dmirror), GFP_KERNEL); + if (dmirror == NULL) + return -ENOMEM; + + dmirror->mdevice = container_of(cdev, struct dmirror_device, cdevice); + mutex_init(>mutex); + xa_init(>pt); + + ret = mmu_interval_notifier_insert(>notifier, current->mm, + 0, ULONG_MAX & PAGE_MASK, _min_ops); + if (ret) { + kfree(dmirror); + return ret; + } + + /* Pairs with the mmdrop() in dmirror_fops_release(). */ + mmgrab(current->mm); + dmirror->mm = current->mm; The notifier holds a mmgrab, no need for another one OK. I'll replace dmirror->mm with dmirror->notifier.mm. + /* Only the first open registers the address space. */ + filp->private_data = dmirror; Not sure what this comment means I'll change the comment to: /* * The first open of the device character file registers the address * space of the process doing the open() system call with the device. * Subsequent file opens by other processes will have access to the * first process' address space. */ +static inline struct dmirror_device *dmirror_page_to_device(struct page *page) + +{ + struct dmirror_chunk *devmem; + + devmem = container_of(page->pgmap, struct dmirror_chunk, pagemap); + return devmem->mdevice; +} extra devmem var is not really needed I'll change this to: return container_of(page->pgmap, struct dmirror_chunk, pagemap)->mdevice; + +static bool dmirror_device_is_mine(struct dmirror_device *mdevice, + struct page *page) +{ + if (!is_zone_device_page(page)) + return false; + return page->pgmap->ops == _devmem_ops && + dmirror_page_to_device(page) == mdevice; +} Use new owner stuff, right? Actually this is redunant now, the check should be just WARN_ON pageowner != self owner I'll clean this up. dmirror_device_is_mine() isn't needed now. +static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range) +{ + uint64_t *pfns = range->pfns; + unsigned long pfn; + + for (pfn = (range->start >> PAGE_SHIFT); +pfn < (range->end >> PAGE_SHIFT); +pfn++, pfns++) { + struct page *page; + void *entry; + + /* +* HMM_PFN_ERROR is returned if it is accessing invalid memory +* either because of memory error (hardware detected memory +* corruption) or more likely because of truncate on mmap +* file. +*/ + if (*pfns == range->values[HMM_PFN_ERROR]) + return -EFAULT; Unless that snapshot is use hmm_range_fault() never returns success and sets PFN_ERROR, so this should be
Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
On 3/19/20 5:14 PM, Jason Gunthorpe wrote: On Tue, Mar 17, 2020 at 04:14:31PM -0700, Ralph Campbell wrote: +static int dmirror_fault(struct dmirror *dmirror, unsigned long start, +unsigned long end, bool write) +{ + struct mm_struct *mm = dmirror->mm; + unsigned long addr; + uint64_t pfns[64]; + struct hmm_range range = { + .notifier = >notifier, + .pfns = pfns, + .flags = dmirror_hmm_flags, + .values = dmirror_hmm_values, + .pfn_shift = DPT_SHIFT, + .pfn_flags_mask = ~(dmirror_hmm_flags[HMM_PFN_VALID] | + dmirror_hmm_flags[HMM_PFN_WRITE]), Since pfns is not initialized pfn_flags_mask should be 0. Good point. + .default_flags = dmirror_hmm_flags[HMM_PFN_VALID] | + (write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0), + .dev_private_owner = dmirror->mdevice, + }; + int ret = 0; +static int dmirror_snapshot(struct dmirror *dmirror, + struct hmm_dmirror_cmd *cmd) +{ + struct mm_struct *mm = dmirror->mm; + unsigned long start, end; + unsigned long size = cmd->npages << PAGE_SHIFT; + unsigned long addr; + unsigned long next; + uint64_t pfns[64]; + unsigned char perm[64]; + char __user *uptr; + struct hmm_range range = { + .pfns = pfns, + .flags = dmirror_hmm_flags, + .values = dmirror_hmm_values, + .pfn_shift = DPT_SHIFT, + .pfn_flags_mask = ~0ULL, Same here, especially since this is snapshot Jason Actually, snapshot ignores pfn_flags_mask and default_flags. In hmm_pte_need_fault(), HMM_FAULT_SNAPSHOT is checked and returns early before checking pfn_flags_mask and default_flags since no faults are being requested. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH hmm 0/6] Small hmm_range_fault() cleanups
On 3/20/20 9:48 AM, Jason Gunthorpe wrote: From: Jason Gunthorpe I've had these in my work queue for a bit, nothing profound here, just some small edits for clarity. The hmm tester changes are clear enough but I'm having a bit of trouble figuring out what this series applies cleanly to since I'm trying to apply it on top of the other patches you and Christoph have sent out. Is there a private git tree/branch where everything is applied? Ralph's hmm tester will need a small diff to work after this - which illustrates how setting default_flags == 0 is the same as what was called SNAPSHOT: diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 6ca953926dc13f..5f31f5b3e64cb9 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -300,7 +300,7 @@ static int dmirror_range_fault(struct dmirror *dmirror, range->notifier_seq = mmu_interval_read_begin(range->notifier); down_read(>mmap_sem); - count = hmm_range_fault(range, 0); + count = hmm_range_fault(range); up_read(>mmap_sem); if (count <= 0) { if (count == 0 || count == -EBUSY) @@ -337,8 +337,7 @@ static int dmirror_fault(struct dmirror *dmirror, unsigned long start, .flags = dmirror_hmm_flags, .values = dmirror_hmm_values, .pfn_shift = DPT_SHIFT, - .pfn_flags_mask = ~(dmirror_hmm_flags[HMM_PFN_VALID] | - dmirror_hmm_flags[HMM_PFN_WRITE]), + .pfn_flags_mask = 0, .default_flags = dmirror_hmm_flags[HMM_PFN_VALID] | (write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0), .dev_private_owner = dmirror->mdevice, @@ -872,7 +871,7 @@ static int dmirror_range_snapshot(struct dmirror *dmirror, range->notifier_seq = mmu_interval_read_begin(range->notifier); down_read(>mmap_sem); - count = hmm_range_fault(range, HMM_FAULT_SNAPSHOT); + count = hmm_range_fault(range); up_read(>mmap_sem); if (count <= 0) { if (count == 0 || count == -EBUSY) @@ -916,7 +915,7 @@ static int dmirror_snapshot(struct dmirror *dmirror, .flags = dmirror_hmm_flags, .values = dmirror_hmm_values, .pfn_shift = DPT_SHIFT, - .pfn_flags_mask = ~0ULL, + .pfn_flags_mask = 0, .dev_private_owner = dmirror->mdevice, }; int ret = 0; Jason Gunthorpe (6): mm/hmm: remove pgmap checking for devmap pages mm/hmm: return the fault type from hmm_pte_need_fault() mm/hmm: remove unused code and tidy comments mm/hmm: remove HMM_FAULT_SNAPSHOT mm/hmm: remove the CONFIG_TRANSPARENT_HUGEPAGE #ifdef mm/hmm: use device_private_entry_to_pfn() Documentation/vm/hmm.rst| 12 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +- drivers/gpu/drm/nouveau/nouveau_svm.c | 2 +- include/linux/hmm.h | 55 +- mm/hmm.c| 238 +--- 5 files changed, 98 insertions(+), 211 deletions(-) ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH hmm 0/6] Small hmm_range_fault() cleanups
On 3/20/20 9:48 AM, Jason Gunthorpe wrote: From: Jason Gunthorpe I've had these in my work queue for a bit, nothing profound here, just some small edits for clarity. Ralph's hmm tester will need a small diff to work after this - which illustrates how setting default_flags == 0 is the same as what was called SNAPSHOT: diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 6ca953926dc13f..5f31f5b3e64cb9 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -300,7 +300,7 @@ static int dmirror_range_fault(struct dmirror *dmirror, range->notifier_seq = mmu_interval_read_begin(range->notifier); down_read(>mmap_sem); - count = hmm_range_fault(range, 0); + count = hmm_range_fault(range); up_read(>mmap_sem); if (count <= 0) { if (count == 0 || count == -EBUSY) @@ -337,8 +337,7 @@ static int dmirror_fault(struct dmirror *dmirror, unsigned long start, .flags = dmirror_hmm_flags, .values = dmirror_hmm_values, .pfn_shift = DPT_SHIFT, - .pfn_flags_mask = ~(dmirror_hmm_flags[HMM_PFN_VALID] | - dmirror_hmm_flags[HMM_PFN_WRITE]), + .pfn_flags_mask = 0, .default_flags = dmirror_hmm_flags[HMM_PFN_VALID] | (write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0), .dev_private_owner = dmirror->mdevice, @@ -872,7 +871,7 @@ static int dmirror_range_snapshot(struct dmirror *dmirror, range->notifier_seq = mmu_interval_read_begin(range->notifier); down_read(>mmap_sem); - count = hmm_range_fault(range, HMM_FAULT_SNAPSHOT); + count = hmm_range_fault(range); up_read(>mmap_sem); if (count <= 0) { if (count == 0 || count == -EBUSY) @@ -916,7 +915,7 @@ static int dmirror_snapshot(struct dmirror *dmirror, .flags = dmirror_hmm_flags, .values = dmirror_hmm_values, .pfn_shift = DPT_SHIFT, - .pfn_flags_mask = ~0ULL, + .pfn_flags_mask = 0, .dev_private_owner = dmirror->mdevice, }; int ret = 0; Jason Gunthorpe (6): mm/hmm: remove pgmap checking for devmap pages mm/hmm: return the fault type from hmm_pte_need_fault() mm/hmm: remove unused code and tidy comments mm/hmm: remove HMM_FAULT_SNAPSHOT mm/hmm: remove the CONFIG_TRANSPARENT_HUGEPAGE #ifdef mm/hmm: use device_private_entry_to_pfn() Documentation/vm/hmm.rst| 12 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +- drivers/gpu/drm/nouveau/nouveau_svm.c | 2 +- include/linux/hmm.h | 55 +- mm/hmm.c| 238 +--- 5 files changed, 98 insertions(+), 211 deletions(-) The series looks good to me so, Reviewed-by: Ralph Campbell ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH hmm 3/6] mm/hmm: remove unused code and tidy comments
On 3/20/20 9:49 AM, Jason Gunthorpe wrote: From: Jason Gunthorpe Delete several functions that are never called, fix some desync between comments and structure content, remove an unused ret, and move one function only used by hmm.c into hmm.c Signed-off-by: Jason Gunthorpe Reviewed-by: Ralph Campbell --- include/linux/hmm.h | 50 - mm/hmm.c| 12 +++ 2 files changed, 12 insertions(+), 50 deletions(-) diff --git a/include/linux/hmm.h b/include/linux/hmm.h index bb6be4428633a8..184a8633260f9d 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -120,9 +120,6 @@ enum hmm_pfn_value_e { * * @notifier: a mmu_interval_notifier that includes the start/end * @notifier_seq: result of mmu_interval_read_begin() - * @hmm: the core HMM structure this range is active against - * @vma: the vm area struct for the range - * @list: all range lock are on a list * @start: range virtual start address (inclusive) * @end: range virtual end address (exclusive) * @pfns: array of pfns (big enough for the range) @@ -131,7 +128,6 @@ enum hmm_pfn_value_e { * @default_flags: default flags for the range (write, read, ... see hmm doc) * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT) s/pfn_shifts/pfn_shift - * @valid: pfns array did not change since it has been fill by an HMM function * @dev_private_owner: owner of device private pages */ struct hmm_range { @@ -171,52 +167,6 @@ static inline struct page *hmm_device_entry_to_page(const struct hmm_range *rang return pfn_to_page(entry >> range->pfn_shift); } -/* - * hmm_device_entry_to_pfn() - return pfn value store in a device entry - * @range: range use to decode device entry value - * @entry: device entry to extract pfn from - * Return: pfn value if device entry is valid, -1UL otherwise - */ -static inline unsigned long -hmm_device_entry_to_pfn(const struct hmm_range *range, uint64_t pfn) -{ - if (pfn == range->values[HMM_PFN_NONE]) - return -1UL; - if (pfn == range->values[HMM_PFN_ERROR]) - return -1UL; - if (pfn == range->values[HMM_PFN_SPECIAL]) - return -1UL; - if (!(pfn & range->flags[HMM_PFN_VALID])) - return -1UL; - return (pfn >> range->pfn_shift); -} - -/* - * hmm_device_entry_from_page() - create a valid device entry for a page - * @range: range use to encode HMM pfn value - * @page: page for which to create the device entry - * Return: valid device entry for the page - */ -static inline uint64_t hmm_device_entry_from_page(const struct hmm_range *range, - struct page *page) -{ - return (page_to_pfn(page) << range->pfn_shift) | - range->flags[HMM_PFN_VALID]; -} - -/* - * hmm_device_entry_from_pfn() - create a valid device entry value from pfn - * @range: range use to encode HMM pfn value - * @pfn: pfn value for which to create the device entry - * Return: valid device entry for the pfn - */ -static inline uint64_t hmm_device_entry_from_pfn(const struct hmm_range *range, -unsigned long pfn) -{ - return (pfn << range->pfn_shift) | - range->flags[HMM_PFN_VALID]; -} - /* Don't fault in missing PTEs, just snapshot the current state. */ #define HMM_FAULT_SNAPSHOT(1 << 1) diff --git a/mm/hmm.c b/mm/hmm.c index b4f662eadb7a7c..687d21c675ee60 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -37,6 +37,18 @@ enum { NEED_WRITE_FAULT = 1 << 1, }; +/* + * hmm_device_entry_from_pfn() - create a valid device entry value from pfn + * @range: range use to encode HMM pfn value + * @pfn: pfn value for which to create the device entry + * Return: valid device entry for the pfn + */ +static uint64_t hmm_device_entry_from_pfn(const struct hmm_range *range, + unsigned long pfn) +{ + return (pfn << range->pfn_shift) | range->flags[HMM_PFN_VALID]; +} + static int hmm_pfns_fill(unsigned long addr, unsigned long end, struct hmm_range *range, enum hmm_pfn_value_e value) { ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 2/2] mm: remove device private page support from hmm_range_fault
On 3/16/20 1:09 PM, Jason Gunthorpe wrote: On Mon, Mar 16, 2020 at 07:49:35PM +0100, Christoph Hellwig wrote: On Mon, Mar 16, 2020 at 11:42:19AM -0700, Ralph Campbell wrote: On 3/16/20 10:52 AM, Christoph Hellwig wrote: No driver has actually used properly wire up and support this feature. There is various code related to it in nouveau, but as far as I can tell it never actually got turned on, and the only changes since the initial commit are global cleanups. This is not actually true. OpenCL 2.x does support SVM with nouveau and device private memory via clEnqueueSVMMigrateMem(). Also, Ben Skeggs has accepted a set of patches to map GPU memory after being migrated and this change would conflict with that. Can you explain me how we actually invoke this code? For that we'd need HMM_PFN_DEVICE_PRIVATE NVIF_VMM_PFNMAP_V0_VRAM set in ->pfns before calling hmm_range_fault, which isn't happening. Oh, I got tripped on this too The logic is backwards from what you'd think.. If you *set* HMM_PFN_DEVICE_PRIVATE then this triggers: hmm_pte_need_fault(): if ((cpu_flags & range->flags[HMM_PFN_DEVICE_PRIVATE])) { /* Do we fault on device memory ? */ if (pfns & range->flags[HMM_PFN_DEVICE_PRIVATE]) { *write_fault = pfns & range->flags[HMM_PFN_WRITE]; *fault = true; } return; } Ie if the cpu page is a DEVICE_PRIVATE and the caller sets HMM_PFN_DEVICE_PRIVATE in the input flags (pfns) then it always faults it and never sets HMM_PFN_DEVICE_PRIVATE in the output flags. So setting 0 enabled device_private support, and nouveau is Ok. AMDGPU is broken because it can't handle device private and can't set the flag to supress it. I was going to try and sort this out as part of getting rid of range->flags Jason The reason for it being backwards is that "normally" a device doesn't want the device private page to be faulted back to system memory, it wants to get the device private struct page so it can update its page table to point to the memory already in the device. Also, a device like Nvidia's GPUs may have an alternate path for copying one GPU's memory to another (nvlink) without going through system memory so getting a device private struct page/PFN from hmm_range_fault() that isn't "owned" by the faulting GPU is useful. I agree that the current code isn't well tested or thought out for multiple devices (rdma, NVMe drives, GPUs, etc.) but it also ties in with peer-to-peer access via PCIe. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 2/2] mm: remove device private page support from hmm_range_fault
On 3/16/20 11:49 AM, Christoph Hellwig wrote: On Mon, Mar 16, 2020 at 11:42:19AM -0700, Ralph Campbell wrote: On 3/16/20 10:52 AM, Christoph Hellwig wrote: No driver has actually used properly wire up and support this feature. There is various code related to it in nouveau, but as far as I can tell it never actually got turned on, and the only changes since the initial commit are global cleanups. This is not actually true. OpenCL 2.x does support SVM with nouveau and device private memory via clEnqueueSVMMigrateMem(). Also, Ben Skeggs has accepted a set of patches to map GPU memory after being migrated and this change would conflict with that. Can you explain me how we actually invoke this code? GPU memory is allocated when the device private memory "struct page" is allocated. See where nouveau_dmem_chunk_alloc() calls nouveau_bo_new(). Then when a page is migrated to the GPU, the GPU memory physical address is just the offset into the "fake" PFN range allocated by devm_request_free_mem_region(). I'm looking into allocating GPU memory at the time of migration instead of when the device private memory struct pages are allocated but that is a future improvement. System memory is migrated to GPU memory: # mesa clEnqueueSVMMigrateMem() svm_migrate_op() q.svm_migrate() pipe->svm_migrate() // really nvc0_svm_migrate() drmCommandWrite() // in libdrm drmIoctl() ioctl() nouveau_drm_ioctl() // nouveau_drm.c drm_ioctl() nouveau_svmm_bind() nouveau_dmem_migrate_vma() migrate_vma_setup() nouveau_dmem_migrate_chunk() nouveau_dmem_migrate_copy_one() // allocate device private struct page dpage = nouveau_dmem_page_alloc_locked() dpage = nouveau_dmem_pages_alloc() // Get GPU VRAM physical address nouveau_dmem_page_addr(dpage) // This does the DMA to GPU memory drm->dmem->migrate.copy_func() migrate_vma_pages() migrate_vma_finalize() Without my recent patch set, there is no GPU page table entry created for this migrated memory so there will be a GPU fault which is handled in a worker thread: nouveau_svm_fault() // examine fault buffer entries and compute range of pages nouveau_range_fault() // This will fill in the pfns array with a device private entry PFN hmm_range_fault() // This sees the range->flags[HMM_PFN_DEVICE_PRIVATE] flag // and converts the HMM PFN to a GPU physical address nouveau_dmem_convert_pfn() // This sets up the GPU page tables nvif_object_ioctl() For that we'd need HMM_PFN_DEVICE_PRIVATE NVIF_VMM_PFNMAP_V0_VRAM set in ->pfns before calling hmm_range_fault, which isn't happening. It is set by hmm_range_fault() via the range->flags[HMM_PFN_DEVICE_PRIVATE] entry when hmm_range_fault() sees a device private struct page. The call to nouveau_dmem_convert_pfn() is just replacing the "fake" PFN with the real PFN but not clearing/changing the read/write or VRAM/system memory PTE bits. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 1/4] memremap: add an owner field to struct dev_pagemap
On 3/16/20 12:32 PM, Christoph Hellwig wrote: Add a new opaque owner field to struct dev_pagemap, which will allow the hmm and migrate_vma code to identify who owns ZONE_DEVICE memory, and refuse to work on mappings not owned by the calling entity. Signed-off-by: Christoph Hellwig This looks like a reasonable approach to take. Reviewed-by: Ralph Campbell --- arch/powerpc/kvm/book3s_hv_uvmem.c | 2 ++ drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + include/linux/memremap.h | 4 mm/memremap.c | 4 4 files changed, 11 insertions(+) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 79b1202b1c62..67fefb03b9b7 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -779,6 +779,8 @@ int kvmppc_uvmem_init(void) kvmppc_uvmem_pgmap.type = MEMORY_DEVICE_PRIVATE; kvmppc_uvmem_pgmap.res = *res; kvmppc_uvmem_pgmap.ops = _uvmem_ops; + /* just one global instance: */ + kvmppc_uvmem_pgmap.owner = _uvmem_pgmap; addr = memremap_pages(_uvmem_pgmap, NUMA_NO_NODE); if (IS_ERR(addr)) { ret = PTR_ERR(addr); diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 0ad5d87b5a8e..a4682272586e 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -526,6 +526,7 @@ nouveau_dmem_init(struct nouveau_drm *drm) drm->dmem->pagemap.type = MEMORY_DEVICE_PRIVATE; drm->dmem->pagemap.res = *res; drm->dmem->pagemap.ops = _dmem_pagemap_ops; + drm->dmem->pagemap.owner = drm->dev; if (IS_ERR(devm_memremap_pages(device, >dmem->pagemap))) goto out_free; diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 6fefb09af7c3..60d97e8fd3c0 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -103,6 +103,9 @@ struct dev_pagemap_ops { * @type: memory type: see MEMORY_* in memory_hotplug.h * @flags: PGMAP_* flags to specify defailed behavior * @ops: method table + * @owner: an opaque pointer identifying the entity that manages this + * instance. Used by various helpers to make sure that no + * foreign ZONE_DEVICE memory is accessed. */ struct dev_pagemap { struct vmem_altmap altmap; @@ -113,6 +116,7 @@ struct dev_pagemap { enum memory_type type; unsigned int flags; const struct dev_pagemap_ops *ops; + void *owner; }; static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) diff --git a/mm/memremap.c b/mm/memremap.c index 09b5b7adc773..9b2c97ceb775 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -181,6 +181,10 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) WARN(1, "Missing migrate_to_ram method\n"); return ERR_PTR(-EINVAL); } + if (!pgmap->owner) { + WARN(1, "Missing owner\n"); + return ERR_PTR(-EINVAL); + } break; case MEMORY_DEVICE_FS_DAX: if (!IS_ENABLED(CONFIG_ZONE_DEVICE) || ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
On 3/16/20 12:32 PM, Christoph Hellwig wrote: Remove the code to fault device private pages back into system memory that has never been used by any driver. Also replace the usage of the HMM_PFN_DEVICE_PRIVATE flag in the pfns array with a simple is_device_private_page check in nouveau. Signed-off-by: Christoph Hellwig Getting rid of HMM_PFN_DEVICE_PRIVATE seems reasonable to me since a driver can look at the struct page but what if a driver needs to fault in a page from another device's private memory? Should it call handle_mm_fault()? --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 1 - drivers/gpu/drm/nouveau/nouveau_dmem.c | 5 +++-- drivers/gpu/drm/nouveau/nouveau_svm.c | 1 - include/linux/hmm.h | 2 -- mm/hmm.c| 25 + 5 files changed, 8 insertions(+), 26 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index dee446278417..90821ce5e6ca 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -776,7 +776,6 @@ struct amdgpu_ttm_tt { static const uint64_t hmm_range_flags[HMM_PFN_FLAG_MAX] = { (1 << 0), /* HMM_PFN_VALID */ (1 << 1), /* HMM_PFN_WRITE */ - 0 /* HMM_PFN_DEVICE_PRIVATE */ }; static const uint64_t hmm_range_values[HMM_PFN_VALUE_MAX] = { diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 0e36345d395c..edfd0805fba4 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -28,6 +28,7 @@ #include #include +#include #include #include @@ -692,9 +693,8 @@ nouveau_dmem_convert_pfn(struct nouveau_drm *drm, if (page == NULL) continue; - if (!(range->pfns[i] & range->flags[HMM_PFN_DEVICE_PRIVATE])) { + if (!is_device_private_page(page)) continue; - } if (!nouveau_dmem_page(drm, page)) { WARN(1, "Some unknown device memory !\n"); @@ -705,5 +705,6 @@ nouveau_dmem_convert_pfn(struct nouveau_drm *drm, addr = nouveau_dmem_page_addr(page); range->pfns[i] &= ((1UL << range->pfn_shift) - 1); range->pfns[i] |= (addr >> PAGE_SHIFT) << range->pfn_shift; + range->pfns[i] |= NVIF_VMM_PFNMAP_V0_VRAM; } } diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index df9bf1fd1bc0..39c731a99937 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -367,7 +367,6 @@ static const u64 nouveau_svm_pfn_flags[HMM_PFN_FLAG_MAX] = { [HMM_PFN_VALID ] = NVIF_VMM_PFNMAP_V0_V, [HMM_PFN_WRITE ] = NVIF_VMM_PFNMAP_V0_W, - [HMM_PFN_DEVICE_PRIVATE] = NVIF_VMM_PFNMAP_V0_VRAM, }; static const u64 diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 4bf8d6997b12..5e6034f105c3 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -74,7 +74,6 @@ * Flags: * HMM_PFN_VALID: pfn is valid. It has, at least, read permission. * HMM_PFN_WRITE: CPU page table has write permission set - * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE) * * The driver provides a flags array for mapping page protections to device * PTE bits. If the driver valid bit for an entry is bit 3, @@ -86,7 +85,6 @@ enum hmm_pfn_flag_e { HMM_PFN_VALID = 0, HMM_PFN_WRITE, - HMM_PFN_DEVICE_PRIVATE, HMM_PFN_FLAG_MAX }; diff --git a/mm/hmm.c b/mm/hmm.c index 180e398170b0..cfad65f6a67b 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -118,15 +118,6 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk, /* We aren't ask to do anything ... */ if (!(pfns & range->flags[HMM_PFN_VALID])) return; - /* If this is device memory then only fault if explicitly requested */ - if ((cpu_flags & range->flags[HMM_PFN_DEVICE_PRIVATE])) { - /* Do we fault on device memory ? */ - if (pfns & range->flags[HMM_PFN_DEVICE_PRIVATE]) { - *write_fault = pfns & range->flags[HMM_PFN_WRITE]; - *fault = true; - } - return; - } /* If CPU page table is not valid then we need to fault */ *fault = !(cpu_flags & range->flags[HMM_PFN_VALID]); @@ -260,21 +251,15 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, swp_entry_t entry = pte_to_swp_entry(pte); /* -* This is a special swap entry, ignore migration, use -* device and report anything else as error. +* Never fault in device private pages pages, but just report +* the PFN even if not present. */
Re: linux-next: manual merge of the hmm tree with the drm tree
On 7/30/20 5:03 AM, Jason Gunthorpe wrote: On Thu, Jul 30, 2020 at 07:21:10PM +1000, Stephen Rothwell wrote: Hi all, Today's linux-next merge of the hmm tree got a conflict in: drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c between commit: 7763d24f3ba0 ("drm/nouveau/vmm/gp100-: fix mapping 2MB sysmem pages") from the drm tree and commits: 4725c6b82a48 ("nouveau: fix mapping 2MB sysmem pages") 1a77decd0cae ("nouveau: fix storing invalid ptes") from the hmm tree. 7763d24f3ba0 and 4725c6b82a48 are exactly the same patch. Oh? Ralph? What happened here? Ben did email me saying he was planning to take this patch into his nouveau tree and I did reply to him saying you had also taken it into your tree and that I had more nouveau/SVM patches for you on the way. So, I'm not sure what happened. Ben can you drop 7763d24f3ba0 ? Jason ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount
On 6/17/21 8:16 AM, Alex Sierra wrote: From: Ralph Campbell ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE. v2: AS: merged this patch in linux 5.11 version Signed-off-by: Ralph Campbell Signed-off-by: Alex Sierra --- arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 4 +- include/linux/dax.h| 2 +- include/linux/memremap.h | 7 +-- include/linux/mm.h | 44 - lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 68 +++--- mm/migrate.c | 5 -- mm/page_alloc.c| 3 ++ mm/swap.c | 45 ++--- 12 files changed, 45 insertions(+), 147 deletions(-) I think it is great that you are picking this up and trying to revive it. However, I have a number of concerns about how it affects existing ZONE_DEVICE MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this addressing them. For example, dev_dax_probe() allocates MEMORY_DEVICE_GENERIC struct pages and then: dev_dax_fault() dev_dax_huge_fault() __dev_dax_pte_fault() vmf_insert_mixed() which just inserts the PFN into the CPU page tables without increasing the page refcount so it is zero (whereas it was one before). But using get_page() will trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current notion of free verses allocated for these struct pages. I suppose init_page_count() could be called on all the struct pages in dev_dax_probe() to fix that though. I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File systems have clear allocate and free states for backing storage but there are the complications with the page cache references, etc. to consider. The >1 to 1 reference count seems to be used to tell when a page is idle (no I/O, reclaim scanners) rather than free (not allocated to any file) but I'm not 100% sure about that since I don't really understand all the issues around why a file system needs to have a DAX mount option besides knowing that the storage block size has to be a multiple of the page size.
Re: [PATCH v3 2/8] mm: remove extra ZONE_DEVICE struct page refcount
On 6/28/21 9:46 AM, Felix Kuehling wrote: Am 2021-06-17 um 3:16 p.m. schrieb Ralph Campbell: On 6/17/21 8:16 AM, Alex Sierra wrote: From: Ralph Campbell ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE. v2: AS: merged this patch in linux 5.11 version Signed-off-by: Ralph Campbell Signed-off-by: Alex Sierra --- arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 4 +- include/linux/dax.h | 2 +- include/linux/memremap.h | 7 +-- include/linux/mm.h | 44 - lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 68 +++--- mm/migrate.c | 5 -- mm/page_alloc.c | 3 ++ mm/swap.c | 45 ++--- 12 files changed, 45 insertions(+), 147 deletions(-) I think it is great that you are picking this up and trying to revive it. However, I have a number of concerns about how it affects existing ZONE_DEVICE MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this addressing them. For example, dev_dax_probe() allocates MEMORY_DEVICE_GENERIC struct pages and then: dev_dax_fault() dev_dax_huge_fault() __dev_dax_pte_fault() vmf_insert_mixed() which just inserts the PFN into the CPU page tables without increasing the page refcount so it is zero (whereas it was one before). But using get_page() will trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current notion of free verses allocated for these struct pages. I suppose init_page_count() could be called on all the struct pages in dev_dax_probe() to fix that though. Hi Ralph, For DEVICE_GENERIC pages free_zone_device_page doesn't do anything. So I'm not sure what the reference counting is good for in this case. Alex is going to add free_zone_device_page support for DEVICE_GENERIC pages (patch 8 of this series). However, even then, it only does anything if there is an actual call to put_page. Where would that call come from in the dev_dax driver case? Correct, the drivers/dax/device.c driver allocates MEMORY_DEVICE_GENERIC struct pages and doesn't seem to allocate/free the page nor increment/decrement the reference count but it does call vmf_insert_mixed() if the /dev/file is mmap()'ed into a user process' address space. If devm_memremap_pages() returns the array of ZONE_DEVICE struct pages initialized with a reference count of zero, then the CPU page tables will have a PTE/PFN that points to a struct page with a zero reference count. This goes against the normal expectation in the rest of the mm code that assumes a page mapped by a CPU has a non-zero reference count. So yes, nothing "bad" happens because put_page() isn't called but the reference count will differ from other drivers that call vmf_insert_mixed() or vm_insert_page() where the page was allocated with alloc_pages() or similar. I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File systems have clear allocate and free states for backing storage but there are the complications with the page cache references, etc. to consider. The >1 to 1 reference count seems to be used to tell when a page is idle (no I/O, reclaim scanners) rather than free (not allocated to any file) but I'm not 100% sure about that since I don't really understand all the issues around why a file system needs to have a DAX mount option besides knowing that the storage block size has to be a multiple of the page size. The only thing that happens in free_zone_device_page for FS_DAX pages is wake_up_var(>_refcount). I guess, whoever is waiting for this wake-up will need to be prepared to see a refcount 0 instead of 1 now. I see these callers that would need to be updated: ./fs/ext4/inode.c: error = ___wait_var_event(>_refcount, ./fs/ext4/inode.c- atomic_read(>_refcount) == 1, ./fs/ext4/inode.c- TASK_INTERRUPTIBLE, 0, 0, ./fs/ext4/inode.c- ext4_wait_dax_page(ei)); -- ./fs/fuse/dax.c: return ___wait_var_event(>_refcount, ./fs/fuse/dax.c- atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE, ./fs/fuse/dax.c- 0, 0, fuse_wait_dax_page(inode)); -- ./fs/xfs/xfs_file.c: return ___wait_var_event(>_refcount, ./fs/xfs/xfs_file.c- atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE, ./fs/xfs/xfs_file.c- 0, 0, xfs_wait_dax_page(inode)); Regarding your page-cache comment, doesn't DAX mean that those file pages a
RE: [PATCH v3 6/8] mm: Selftests for exclusive device memory
> From: Alistair Popple > Sent: Thursday, February 25, 2021 11:19 PM > To: linux...@kvack.org; nouv...@lists.freedesktop.org; > bske...@redhat.com; a...@linux-foundation.org > Cc: linux-...@vger.kernel.org; linux-ker...@vger.kernel.org; dri- > de...@lists.freedesktop.org; John Hubbard ; Ralph > Campbell ; jgli...@redhat.com; Jason Gunthorpe > ; h...@infradead.org; dan...@ffwll.ch; Alistair Popple > > Subject: [PATCH v3 6/8] mm: Selftests for exclusive device memory > > Adds some selftests for exclusive device memory. > > Signed-off-by: Alistair Popple One minor nit below, but you can add Tested-by: Ralph Campbell Reviewed-by: Ralph Campbell > +static int dmirror_exclusive(struct dmirror *dmirror, > + struct hmm_dmirror_cmd *cmd) > +{ > + unsigned long start, end, addr; > + unsigned long size = cmd->npages << PAGE_SHIFT; > + struct mm_struct *mm = dmirror->notifier.mm; > + struct page *pages[64]; > + struct dmirror_bounce bounce; > + unsigned long next; > + int ret; > + > + start = cmd->addr; > + end = start + size; > + if (end < start) > + return -EINVAL; > + > + /* Since the mm is for the mirrored process, get a reference first. */ > + if (!mmget_not_zero(mm)) > + return -EINVAL; > + > + mmap_read_lock(mm); > + for (addr = start; addr < end; addr = next) { > + int i, mapped; > + > + if (end < addr + (64 << PAGE_SHIFT)) > + next = end; > + else > + next = addr + (64 << PAGE_SHIFT); I suggest using ARRAY_SIZE(pages) instead of '64' to make the meaning clear. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
RE: [PATCH v3 5/8] mm: Device exclusive memory access
> From: Alistair Popple > Sent: Thursday, February 25, 2021 11:18 PM > To: linux...@kvack.org; nouv...@lists.freedesktop.org; > bske...@redhat.com; a...@linux-foundation.org > Cc: linux-...@vger.kernel.org; linux-ker...@vger.kernel.org; dri- > de...@lists.freedesktop.org; John Hubbard ; Ralph > Campbell ; jgli...@redhat.com; Jason Gunthorpe > ; h...@infradead.org; dan...@ffwll.ch; Alistair Popple > > Subject: [PATCH v3 5/8] mm: Device exclusive memory access > > Some devices require exclusive write access to shared virtual memory (SVM) > ranges to perform atomic operations on that memory. This requires CPU page > tables to be updated to deny access whilst atomic operations are occurring. > > In order to do this introduce a new swap entry type (SWP_DEVICE_EXCLUSIVE). > When a SVM range needs to be marked for exclusive access by a device all page > table mappings for the particular range are replaced with device exclusive > swap > entries. This causes any CPU access to the page to result in a fault. > > Faults are resovled by replacing the faulting entry with the original > mapping. This > results in MMU notifiers being called which a driver uses to update access > permissions such as revoking atomic access. After notifiers have been called > the > device will no longer have exclusive access to the region. > > Signed-off-by: Alistair Popple > --- > Documentation/vm/hmm.rst | 15 > include/linux/rmap.h | 3 + > include/linux/swap.h | 4 +- > include/linux/swapops.h | 44 ++- > mm/hmm.c | 5 ++ > mm/memory.c | 108 +- > mm/mprotect.c| 8 ++ > mm/page_vma_mapped.c | 9 ++- > mm/rmap.c| 163 +++ > 9 files changed, 352 insertions(+), 7 deletions(-) ... > +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, > + unsigned long end, struct page **pages) { > + long npages = (end - start) >> PAGE_SHIFT; > + long i; Nit: you should use unsigned long for 'i' and 'npages' to match start/end. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v4 2/8] mm/swapops: Rework swap entry manipulation code
On 3/3/21 10:16 PM, Alistair Popple wrote: Both migration and device private pages use special swap entries that are manipluated by a range of inline functions. The arguments to these are somewhat inconsitent so rework them to remove flag type arguments and to make the arguments similar for both read and write entry creation. Signed-off-by: Alistair Popple Reviewed-by: Christoph Hellwig Reviewed-by: Jason Gunthorpe Looks good to me too. Reviewed-by: Ralph Campbell ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v4 1/8] mm: Remove special swap entry functions
On 3/3/21 10:16 PM, Alistair Popple wrote: Remove the migration and device private entry_to_page() and entry_to_pfn() inline functions and instead open code them directly. This results in shorter code which is easier to understand. Signed-off-by: Alistair Popple Looks OK to me. Reviewed-by: Ralph Campbell --- v4: * Added pfn_swap_entry_to_page() * Reinstated check that migration entries point to locked pages * Removed #define swapcache_prepare which isn't needed for CONFIG_SWAP=0 builds --- arch/s390/mm/pgtable.c | 2 +- fs/proc/task_mmu.c | 23 +- include/linux/swap.h| 4 +-- include/linux/swapops.h | 69 ++--- mm/hmm.c| 5 ++- mm/huge_memory.c| 4 +-- mm/memcontrol.c | 2 +- mm/memory.c | 10 +++--- mm/migrate.c| 6 ++-- mm/page_vma_mapped.c| 6 ++-- 10 files changed, 50 insertions(+), 81 deletions(-) diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c index 18205f851c24..aae001096c46 100644 --- a/arch/s390/mm/pgtable.c +++ b/arch/s390/mm/pgtable.c @@ -691,7 +691,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry) if (!non_swap_entry(entry)) dec_mm_counter(mm, MM_SWAPENTS); else if (is_migration_entry(entry)) { - struct page *page = migration_entry_to_page(entry); + struct page *page = pfn_swap_entry_to_page(entry)); dec_mm_counter(mm, mm_counter(page)); } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 602e3a52884d..7c75af0fc423 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -514,10 +514,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, } else { mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT; } - } else if (is_migration_entry(swpent)) - page = migration_entry_to_page(swpent); - else if (is_device_private_entry(swpent)) - page = device_private_entry_to_page(swpent); + } else if (is_pfn_swap_entry(swpent)) + page = pfn_swap_entry_to_page(swpent); } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap && pte_none(*pte))) { page = xa_load(>vm_file->f_mapping->i_pages, @@ -549,7 +547,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, swp_entry_t entry = pmd_to_swp_entry(*pmd); if (is_migration_entry(entry)) - page = migration_entry_to_page(entry); + page = pfn_swap_entry_to_page(entry); } if (IS_ERR_OR_NULL(page)) return; @@ -691,10 +689,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, } else if (is_swap_pte(*pte)) { swp_entry_t swpent = pte_to_swp_entry(*pte); - if (is_migration_entry(swpent)) - page = migration_entry_to_page(swpent); - else if (is_device_private_entry(swpent)) - page = device_private_entry_to_page(swpent); + if (is_pfn_swap_entry(swpent)) + page = pfn_swap_entry_to_page(swpent); } if (page) { int mapcount = page_mapcount(page); @@ -1382,11 +1378,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, frame = swp_type(entry) | (swp_offset(entry) << MAX_SWAPFILES_SHIFT); flags |= PM_SWAP; - if (is_migration_entry(entry)) - page = migration_entry_to_page(entry); - - if (is_device_private_entry(entry)) - page = device_private_entry_to_page(entry); + if (is_pfn_swap_entry(entry)) + page = pfn_swap_entry_to_page(entry); } if (page && !PageAnon(page)) @@ -1443,7 +1436,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, if (pmd_swp_soft_dirty(pmd)) flags |= PM_SOFT_DIRTY; VM_BUG_ON(!is_pmd_migration_entry(pmd)); - page = migration_entry_to_page(entry); + page = pfn_swap_entry_to_page(entry); } #endif diff --git a/include/linux/swap.h b/include/linux/swap.h index 596bc2f4d9b0..57a7690966a4 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -519,8 +519,8 @@ static inline void show_swap_cache_info(void) { } -#define free_swap_and_cache(e) ({(is_migration_entry(e) || is_device_private_entry(e));}) -#define swapcache_prepare(e) ({(is_migration_entry(e) || is_device_private_entry(e)
Re: [PATCH v4 5/8] mm: Device exclusive memory access
On 3/3/21 10:16 PM, Alistair Popple wrote: Some devices require exclusive write access to shared virtual memory (SVM) ranges to perform atomic operations on that memory. This requires CPU page tables to be updated to deny access whilst atomic operations are occurring. In order to do this introduce a new swap entry type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for exclusive access by a device all page table mappings for the particular range are replaced with device exclusive swap entries. This causes any CPU access to the page to result in a fault. Faults are resovled by replacing the faulting entry with the original mapping. This results in MMU notifiers being called which a driver uses to update access permissions such as revoking atomic access. After notifiers have been called the device will no longer have exclusive access to the region. Signed-off-by: Alistair Popple I see in the next two patches how make_device_exclusive_entry() and check_device_exclusive_range() are used. This points out a similar problem that migrate_vma_setup() had before I added the mmu_notifier_range_init_migrate() helper to pass a cookie from migrate_vma_setup() to the invalidation callback so the device driver could ignore an invalidation callback triggered by the caller and thus resulting in a deadlock or having to invalidate device PTEs that wouldn't be migrating. I think you can eliminate the need for check_device_exclusive_range() in the same way by adding a "void *" pointer to make_device_exclusive_entry() and passing that through to try_to_protect(), setting rmap_walk_control rwc.arg and then passing arg to mmu_notifier_range_init_migrate(). Although, maybe it would be better to define a new mmu_notifier_range_init_exclusive() and event type MMU_NOTIFY_EXCLUSIVE so that a device driver can revoke atomic/exclusive access but keep read/write access to other parts of the page. I thought about how make_device_exclusive_entry() is similar to hmm_range_fault() and whether it would be possible to add a new HMM_PFN_REQ_EXCLUSIVE flag but I see that make_device_exclusive_entry() returns the pages locked and with an additional get_page() reference. This doesn't fit well with the other hmm_range_fault() entries being returned as a "snapshot" so having a different API makes sense. I think it would be useful to add a HMM_PFN_EXCLUSIVE flag so that snapshots of the page tables can at least report that a page is exclusively being accessed by *some* device. Unfortunately, there is no pgmap pointer to be able to tell which device has exclusive access (since any struct page could be exclusively accessed, not just device private ones). ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v4 3/8] mm/rmap: Split try_to_munlock from try_to_unmap
On 3/3/21 10:16 PM, Alistair Popple wrote: The behaviour of try_to_unmap_one() is difficult to follow because it performs different operations based on a fairly large set of flags used in different combinations. TTU_MUNLOCK is one such flag. However it is exclusively used by try_to_munlock() which specifies no other flags. Therefore rather than overload try_to_unmap_one() with unrelated behaviour split this out into it's own function and remove the flag. Signed-off-by: Alistair Popple Looks good to me. Reviewed-by: Ralph Campbell ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v4 4/8] mm/rmap: Split migration into its own function
On 3/3/21 10:16 PM, Alistair Popple wrote: Migration is currently implemented as a mode of operation for try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE. However it does not have much in common with the rest of the unmap functionality of try_to_unmap_one() and thus splitting it into a separate function reduces the complexity of try_to_unmap_one() making it more readable. Several simplifications can also be made in try_to_migrate_one() based on the following observations: - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK. - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON. - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH. TTU_SPLIT_FREEZE is a special case of migration used when splitting an anonymous page. This is most easily dealt with by calling the correct function from unmap_page() in mm/huge_memory.c - either try_to_migrate() for PageAnon or try_to_unmap(). Signed-off-by: Alistair Popple Reviewed-by: Christoph Hellwig Looks reasonable to me. I do worry a bit about code duplication. At some point in the future, it might be discovered that other combinations of TTU_XXX flags are needed in which case a careful check of try_to_migrate() and try_to_unmap() will be needed. Reviewed-by: Ralph Campbell ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
RE: [PATCH v3 6/8] mm: Selftests for exclusive device memory
> From: Jason Gunthorpe > Sent: Monday, March 1, 2021 9:56 AM > To: Alistair Popple > Cc: linux...@kvack.org; nouv...@lists.freedesktop.org; > bske...@redhat.com; a...@linux-foundation.org; linux-...@vger.kernel.org; > linux-ker...@vger.kernel.org; dri-devel@lists.freedesktop.org; John Hubbard > ; Ralph Campbell ; > jgli...@redhat.com; h...@infradead.org; dan...@ffwll.ch > Subject: Re: [PATCH v3 6/8] mm: Selftests for exclusive device memory > > On Fri, Feb 26, 2021 at 06:18:30PM +1100, Alistair Popple wrote: > > Adds some selftests for exclusive device memory. > > > > Signed-off-by: Alistair Popple > > --- > > lib/test_hmm.c | 124 ++ > > lib/test_hmm_uapi.h| 2 + > > tools/testing/selftests/vm/hmm-tests.c | 219 + > > 3 files changed, 345 insertions(+) > > Please get Ralph to review this, otherwise: > > Acked-by: Jason Gunthorpe > > Jason I'm working on it. Thanks for encouragement. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
On 8/17/21 5:35 PM, Felix Kuehling wrote: Am 2021-08-17 um 8:01 p.m. schrieb Ralph Campbell: On 8/12/21 11:31 PM, Alex Sierra wrote: From: Ralph Campbell ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE. v2: AS: merged this patch in linux 5.11 version v5: AS: add condition at try_grab_page to check for the zone device type, while page ref counter is checked less/equal to zero. In case of device zone, pages ref counter are initialized to zero. Signed-off-by: Ralph Campbell Signed-off-by: Alex Sierra --- arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 4 +- include/linux/dax.h | 2 +- include/linux/memremap.h | 7 +-- include/linux/mm.h | 13 + lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 68 +++--- mm/migrate.c | 5 -- mm/page_alloc.c | 3 ++ mm/swap.c | 45 ++--- 12 files changed, 46 insertions(+), 115 deletions(-) I haven't seen a response to the issues I raised back at v3 of this series. https://lore.kernel.org/linux-mm/4f6dd918-d79b-1aa7-3a4c-caa67ddc2...@nvidia.com/ Did I miss something? I think part of the response was that we did more testing. Alex added support for DEVICE_GENERIC pages to test_hmm and he ran DAX tests recommended by Theodore Tso. In that testing he ran into a WARN_ON_ONCE about a zero page refcount in try_get_page. The fix is in the latest version of patch 2. But it's already obsolete because John Hubbard is about to remove that function altogether. I think the issues you raised were more uncertainty than known bugs. It seems the fact that you can have DAX pages with 0 refcount is a feature more than a bug. Regards, Felix Did you test on a system without CONFIG_ARCH_HAS_PTE_SPECIAL defined? In that case, mmap() of a DAX device will call insert_page() which calls get_page() which would trigger VM_BUG_ON_PAGE(). I can believe it is OK for PTE_SPECIAL page table entries to have no struct page or that MEMORY_DEVICE_GENERIC struct pages be mapped with a zero reference count using insert_pfn(). I find it hard to believe that other MM developers don't see an issue with a struct page with refcount == 0 and mapcount == 1. I don't see where init_page_count() is being called for the MEMORY_DEVICE_GENERIC or MEMORY_DEVICE_PRIVATE struct pages the AMD driver allocates and passes to migrate_vma_setup(). Looks like svm_migrate_get_vram_page() needs to call init_page_count() instead of get_page(). (I'm looking at branch origin/alexsierrag/device_generic https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver.git) Also, what about the other places where is_device_private_page() is called? Don't they need to be updated to call is_device_page() instead? One of my goals for this patch was to remove special casing reference counts for ZONE_DEVICE pages in rmap.c, etc. I still think this patch needs an ACK from a FS/DAX maintainer.
Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount
On 8/12/21 11:31 PM, Alex Sierra wrote: From: Ralph Campbell ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE. v2: AS: merged this patch in linux 5.11 version v5: AS: add condition at try_grab_page to check for the zone device type, while page ref counter is checked less/equal to zero. In case of device zone, pages ref counter are initialized to zero. Signed-off-by: Ralph Campbell Signed-off-by: Alex Sierra --- arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 4 +- include/linux/dax.h| 2 +- include/linux/memremap.h | 7 +-- include/linux/mm.h | 13 + lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 68 +++--- mm/migrate.c | 5 -- mm/page_alloc.c| 3 ++ mm/swap.c | 45 ++--- 12 files changed, 46 insertions(+), 115 deletions(-) I haven't seen a response to the issues I raised back at v3 of this series. https://lore.kernel.org/linux-mm/4f6dd918-d79b-1aa7-3a4c-caa67ddc2...@nvidia.com/ Did I miss something?
Re: [PATCH v1 02/14] mm: remove extra ZONE_DEVICE struct page refcount
On 8/25/21 4:15 AM, Vlastimil Babka wrote: On 8/25/21 05:48, Alex Sierra wrote: From: Ralph Campbell ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE. That's certainly welcome. I just wonder what was the reason to use 1 in the first place and why it's no longer necessary? I'm sure this is a long story that I don't know most of the history. I'm guessing that ZONE_DEVICE started out with a reference count of one since that is what most "normal" struct page pages start with. Then put_page() is used to free newly initialized struct pages to the slab/slob/slub page allocator. This made it easy to call get_page()/put_page() on ZONE_DEVICE pages since get_page() asserts that the caller has a reference. However, most drivers that create ZONE_DEVICE struct pages just insert a PTE into the user page tables and don't increment/decrement the reference count. MEMORY_DEVICE_FS_DAX used the >1 to 1 reference count transition to signal that a page was idle so that made put_page() a bit more complex. Then MEMORY_DEVICE_PRIVATE pages were added and this special casing of what "idle" meant got more complicated and more parts of mm had to check for is_device_private_page(). My goal was to make ZONE_DEVICE struct pages reference counts be zero based and allocated/freed similar to the page allocator so that more of the mm code could be used, like THP ZONE_DEVICE pages, without special casing ZONE_DEVICE.
Re: [PATCH v2 02/12] mm: remove extra ZONE_DEVICE struct page refcount
On 9/13/21 9:15 AM, Alex Sierra wrote: From: Ralph Campbell ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE. Signed-off-by: Ralph Campbell Signed-off-by: Alex Sierra Reviewed-by: Christoph Hellwig --- v2: AS: merged this patch in linux 5.11 version v5: AS: add condition at try_grab_page to check for the zone device type, while page ref counter is checked less/equal to zero. In case of device zone, pages ref counter are initialized to zero. v7: AS: fix condition at try_grab_page added at v5, is invalid. It supposed to fix xfstests/generic/413 test, however, there's a known issue on this test where DAX mapped area DIO to non-DAX expect to fail. https://patchwork.kernel.org/project/fstests/patch/1489463960-3579-1-git-send-email-xz...@redhat.com This condition was removed after rebase over patch series https://lore.kernel.org/r/20210813044133.1536842-4-jhubb...@nvidia.com --- arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 4 +- include/linux/dax.h| 2 +- include/linux/memremap.h | 7 +-- include/linux/mm.h | 11 lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 69 +++--- mm/migrate.c | 5 -- mm/page_alloc.c| 3 ++ mm/swap.c | 45 ++--- 12 files changed, 45 insertions(+), 115 deletions(-) I don't see mm/memcontrol.c listed here so I think you will need this fragment too: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 389b5766e74f..0cc9c7c71e79 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5541,11 +5541,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma, */ if (is_device_private_entry(ent)) { page = pfn_swap_entry_to_page(ent); - /* -* MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have -* a refcount of 1 when free (unlike normal page) -*/ - if (!page_ref_add_unless(page, 1, 1)) + if (!get_page_unless_zero(page)) return NULL; return page; }
Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
On 10/14/21 10:06 AM, Jason Gunthorpe wrote: On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote: From: Ralph Campbell ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE. Signed-off-by: Ralph Campbell Signed-off-by: Alex Sierra Reviewed-by: Christoph Hellwig --- v2: AS: merged this patch in linux 5.11 version v5: AS: add condition at try_grab_page to check for the zone device type, while page ref counter is checked less/equal to zero. In case of device zone, pages ref counter are initialized to zero. v7: AS: fix condition at try_grab_page added at v5, is invalid. It supposed to fix xfstests/generic/413 test, however, there's a known issue on this test where DAX mapped area DIO to non-DAX expect to fail. https://patchwork.kernel.org/project/fstests/patch/1489463960-3579-1-git-send-email-xz...@redhat.com This condition was removed after rebase over patch series https://lore.kernel.org/r/20210813044133.1536842-4-jhubb...@nvidia.com --- arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 4 +- include/linux/dax.h| 2 +- include/linux/memremap.h | 7 +-- include/linux/mm.h | 11 lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memcontrol.c| 6 +-- mm/memremap.c | 69 +++--- mm/migrate.c | 5 -- mm/page_alloc.c| 3 ++ mm/swap.c | 45 ++--- 13 files changed, 46 insertions(+), 120 deletions(-) Has anyone tested this with FSDAX? Does get_user_pages() on fsdax backed memory still work? I ran xfstests-dev using the kernel boot option to "fake" a pmem device when I first posted this patch. The tests ran OK (or at least the same tests passed with and without my patch). However, I could never really convince myself the changes were "OK" for fsdax since I didn't understand the code that well. I would still like to see a xfsdax maintainer or expert ACK this change. https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git What refcount value does the struct pages have when they are installed in the PTEs? Remember a 0 refcount will make all the get_user_pages() fail. I'm looking at the call path starting in ext4_punch_hole() and I would expect to see something manipulating the page ref count before the ext4_break_layouts() call path gets to the dax_page_unused() test. All I see is we go into unmap_mapping_pages() - that would normally put back the page references held by PTEs but insert_pfn() has this: if (pfn_t_devmap(pfn)) entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); And: static inline pte_t pte_mkdevmap(pte_t pte) { return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP); } Which interacts with vm_normal_page(): if (pte_devmap(pte)) return NULL; To disable that refcounting? So... I have a feeling this will have PTEs pointing to 0 refcount pages? Unless FSDAX is !pte_devmap which is not the case, right? This seems further confirmed by this comment: /* * If we race get_user_pages_fast() here either we'll see the * elevated page count in the iteration and wait, or * get_user_pages_fast() will see that the page it took a reference * against is no longer mapped in the page tables and bail to the * get_user_pages() slow path. The slow path is protected by * pte_lock() and pmd_lock(). New references are not taken without * holding those locks, and unmap_mapping_pages() will not zero the * pte or pmd without holding the respective lock, so we are * guaranteed to either see new references or prevent new * references from being established. */ Which seems to explain this scheme relies on unmap_mapping_pages() to fence GUP_fast, not on GUP_fast observing 0 refcounts when it should stop. This seems like it would be properly fixed by using normal page refcounting for PTEs - ie stop using special for these pages? Does anyone know why devmap is pte_special anyhow? +void free_zone_device_page(struct page *page) +{ + switch (page->pgmap->type) { + case MEMORY_DEVICE_PRIVATE: + free_device_page(page); + return; + case MEMORY_DEVICE_FS_DAX: + /* notify page idle */ + wake_up_var(>_refcount); + return; It is not for this series, but I wonder if we should ju
Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount
On 10/14/21 11:01 AM, Jason Gunthorpe wrote: On Thu, Oct 14, 2021 at 10:35:27AM -0700, Ralph Campbell wrote: I ran xfstests-dev using the kernel boot option to "fake" a pmem device when I first posted this patch. The tests ran OK (or at least the same tests passed with and without my patch). Hmm. I know nothing of xfstests but tests/generic/413 Looks kind of like it might cover this situation? Did it run for you? Jason I don't remember. I'll have to rerun the test which might take a day or two to set up again.
Re: [PATCH] mm/migrate.c: Remove MIGRATE_PFN_LOCKED
On 10/24/21 21:16, Alistair Popple wrote: MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a source page was already locked during migrate_vma_collect(). If it wasn't then the a second attempt is made to lock the page. However if the first attempt failed it's unlikely a second attempt will succeed, and the retry adds complexity. So clean this up by removing the retry and MIGRATE_PFN_LOCKED flag. Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag set, but nothing actually checks that. Signed-off-by: Alistair Popple You can add: Reviewed-by: Ralph Campbell --- Documentation/vm/hmm.rst | 2 +- arch/powerpc/kvm/book3s_hv_uvmem.c | 4 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 - drivers/gpu/drm/nouveau/nouveau_dmem.c | 4 +- include/linux/migrate.h | 1 - lib/test_hmm.c | 5 +- mm/migrate.c | 145 +-- 7 files changed, 35 insertions(+), 128 deletions(-) diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst index a14c2938e7af..f2a59ed82ed3 100644 --- a/Documentation/vm/hmm.rst +++ b/Documentation/vm/hmm.rst @@ -360,7 +360,7 @@ between device driver specific code and shared common code: system memory page, locks the page with ``lock_page()``, and fills in the ``dst`` array entry with:: - dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED; + dst[i] = migrate_pfn(page_to_pfn(dpage)); Now that the driver knows that this page is being migrated, it can invalidate device private MMU mappings and copy device private memory diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index a7061ee3b157..28c436df9935 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -560,7 +560,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma, gpa, 0, page_shift); if (ret == U_SUCCESS) - *mig.dst = migrate_pfn(pfn) | MIGRATE_PFN_LOCKED; + *mig.dst = migrate_pfn(pfn); else { unlock_page(dpage); __free_page(dpage); @@ -774,7 +774,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma, } } - *mig.dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED; + *mig.dst = migrate_pfn(page_to_pfn(dpage)); migrate_vma_pages(); out_finalize: migrate_vma_finalize(); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index 4a16e3c257b9..41d9417f182b 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -300,7 +300,6 @@ svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range *prange, migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]); svm_migrate_get_vram_page(prange, migrate->dst[i]); migrate->dst[i] = migrate_pfn(migrate->dst[i]); - migrate->dst[i] |= MIGRATE_PFN_LOCKED; src[i] = dma_map_page(dev, spage, 0, PAGE_SIZE, DMA_TO_DEVICE); r = dma_mapping_error(dev, src[i]); @@ -580,7 +579,6 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange, dst[i] >> PAGE_SHIFT, page_to_pfn(dpage)); migrate->dst[i] = migrate_pfn(page_to_pfn(dpage)); - migrate->dst[i] |= MIGRATE_PFN_LOCKED; j++; } diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 92987daa5e17..3828aafd3ac4 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -166,7 +166,7 @@ static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm, goto error_dma_unmap; mutex_unlock(>mutex); - args->dst[0] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED; + args->dst[0] = migrate_pfn(page_to_pfn(dpage)); return 0; error_dma_unmap: @@ -602,7 +602,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm, ((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT); if (src & MIGRATE_PFN_WRITE) *pfn |= NVIF_VMM_PFNMAP_V0_W; - return migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED; + return migrate_pfn(page_to_pfn(dpage)); out_dma_unmap: dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL); diff --git a/include/linux/migrate.h b/include/linux/migrate.h index c8077e936691..479b861ae490 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -119,7 +119,6 @@ static inline int migrate_misplaced_page(
Re: [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount
On 2/6/22 22:32, Christoph Hellwig wrote: ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE pages. Note that this excludes the special idle page wakeup for fsdax pages, which still happens at refcount 1. This is a separate issue and will be sorted out later. Given that only fsdax pages require the notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig symbol can go away and be replaced with a FS_DAX check for this hook in the put_page fastpath. Based on an earlier patch from Ralph Campbell . Signed-off-by: Christoph Hellwig Thanks for working on this, definite step forward. Reviewed-by: Ralph Campbell
Re: [PATCH] nouveau/svm: Fix to migrate all requested pages
On 7/19/22 23:27, Alistair Popple wrote: Users may request that pages from an OpenCL SVM allocation be migrated to the GPU with clEnqueueSVMMigrateMem(). In Nouveau this will call into nouveau_dmem_migrate_vma() to do the migration. If the total range to be migrated exceeds SG_MAX_SINGLE_ALLOC the pages will be migrated in chunks of size SG_MAX_SINGLE_ALLOC. However a typo in updating the starting address means that only the first chunk will get migrated. Fix the calculation so that the entire range will get migrated if possible. Signed-off-by: Alistair Popple Fixes: e3d8b0890469 ("drm/nouveau/svm: map pages after migration") Thanks for fixing this! Reviewed-by: Ralph Campbell
Re: [PATCH] mm/migrate_device: Return number of migrating pages in args->cpages
On 11/10/22 16:51, Alistair Popple wrote: migrate_vma->cpages originally contained a count of the number of pages migrating including non-present pages which can be poluated "populated" directly on the target. Commit 241f68859656 ("mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()") inadvertantly changed this to contain just the number of pages that were unmapped. Usage of migrate_vma->cpages isn't documented, but most drivers use it to see if all the requested addresses can be migrated so restore the original behaviour. Fixes: 241f68859656 ("mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()") Signed-off-by: Alistair Popple Reported-by: Ralph Campbell You can add Reviewed-by: Ralph Campbell Thanks!
[PATCH] drm/edid: Add quirk for OSVR HDK 2.0
The OSVR virtual reality headset HDK 2.0 uses a different EDID vendor and device identifier than the HDK 1.1 - 1.4 headsets. Add the HDK 2.0 vendor and device identifier to the quirks table so that window managers do not try to display the desktop screen on the headset display. Signed-off-by: Ralph Campbell Tested-by: Ralph Campbell --- drivers/gpu/drm/drm_edid.c | 1 + 1 file changed, 1 insertion(+) I don't know how many of these VR headsets are still around but I have a working one and I saw and entry for HDK 1.x so I thought it would be good to add HDK 2.0. diff --git a/drivers/gpu/drm/drm_edid.c b/drivers/gpu/drm/drm_edid.c index 0454da505687..3b8cc1fe05e8 100644 --- a/drivers/gpu/drm/drm_edid.c +++ b/drivers/gpu/drm/drm_edid.c @@ -230,6 +230,7 @@ static const struct edid_quirk { /* OSVR HDK and HDK2 VR Headsets */ EDID_QUIRK('S', 'V', 'R', 0x1019, EDID_QUIRK_NON_DESKTOP), + EDID_QUIRK('A', 'O', 'U', 0x, EDID_QUIRK_NON_DESKTOP), }; /* -- 2.40.1
Re: [PATCH] drm/edid: Add quirk for OSVR HDK 2.0
On 6/9/23 02:03, Jani Nikula wrote: On Thu, 08 Jun 2023, Ralph Campbell wrote: The OSVR virtual reality headset HDK 2.0 uses a different EDID vendor and device identifier than the HDK 1.1 - 1.4 headsets. Add the HDK 2.0 vendor and device identifier to the quirks table so that window managers do not try to display the desktop screen on the headset display. At some point in time we requested bugs to be filed about quirks, with EDIDs attached, so we could look at them later, and maybe remove the quirks. The headset non-desktop thing started off as a quirk, but since then we've added both Microsoft VSDB and DisplayID primary use as ways to indicate this without quirks. BR, Jani. If you want me to file a bug, I can do that and I have the EDID too. Where would I file it? I did see the DisplayID 2.0 code. This headset is no longer being manufactured so updating the EDID is not practical. Signed-off-by: Ralph Campbell Tested-by: Ralph Campbell --- drivers/gpu/drm/drm_edid.c | 1 + 1 file changed, 1 insertion(+) I don't know how many of these VR headsets are still around but I have a working one and I saw and entry for HDK 1.x so I thought it would be good to add HDK 2.0. diff --git a/drivers/gpu/drm/drm_edid.c b/drivers/gpu/drm/drm_edid.c index 0454da505687..3b8cc1fe05e8 100644 --- a/drivers/gpu/drm/drm_edid.c +++ b/drivers/gpu/drm/drm_edid.c @@ -230,6 +230,7 @@ static const struct edid_quirk { /* OSVR HDK and HDK2 VR Headsets */ EDID_QUIRK('S', 'V', 'R', 0x1019, EDID_QUIRK_NON_DESKTOP), + EDID_QUIRK('A', 'O', 'U', 0x, EDID_QUIRK_NON_DESKTOP), }; /*
Re: [PATCH] drm/edid: Add quirk for OSVR HDK 2.0
On 6/10/23 00:22, Jani Nikula wrote: On Fri, 09 Jun 2023, Ralph Campbell wrote: On 6/9/23 02:03, Jani Nikula wrote: On Thu, 08 Jun 2023, Ralph Campbell wrote: The OSVR virtual reality headset HDK 2.0 uses a different EDID vendor and device identifier than the HDK 1.1 - 1.4 headsets. Add the HDK 2.0 vendor and device identifier to the quirks table so that window managers do not try to display the desktop screen on the headset display. At some point in time we requested bugs to be filed about quirks, with EDIDs attached, so we could look at them later, and maybe remove the quirks. The headset non-desktop thing started off as a quirk, but since then we've added both Microsoft VSDB and DisplayID primary use as ways to indicate this without quirks. BR, Jani. If you want me to file a bug, I can do that and I have the EDID too. Where would I file it? I suppose at https://gitlab.freedesktop.org/drm/misc/-/issues We should then reference the issue in the commit message (no need to resend, this can be added while applying). I did see the DisplayID 2.0 code. This headset is no longer being manufactured so updating the EDID is not practical. I'm not saying the EDID should be updated, just that we might drop the quirk if we find another generic way to identify non-desktops that covers the EDID in question. If the device isn't covered by the existing mechanisms, I'm not opposed to merging as-is, with the issue reference added. Thanks, Jani. Thanks, I created issue #30 "OSVR virtual reality headset HDK 2.0 not recognized as a non-desktop display" Signed-off-by: Ralph Campbell Tested-by: Ralph Campbell --- drivers/gpu/drm/drm_edid.c | 1 + 1 file changed, 1 insertion(+) I don't know how many of these VR headsets are still around but I have a working one and I saw and entry for HDK 1.x so I thought it would be good to add HDK 2.0. diff --git a/drivers/gpu/drm/drm_edid.c b/drivers/gpu/drm/drm_edid.c index 0454da505687..3b8cc1fe05e8 100644 --- a/drivers/gpu/drm/drm_edid.c +++ b/drivers/gpu/drm/drm_edid.c @@ -230,6 +230,7 @@ static const struct edid_quirk { /* OSVR HDK and HDK2 VR Headsets */ EDID_QUIRK('S', 'V', 'R', 0x1019, EDID_QUIRK_NON_DESKTOP), + EDID_QUIRK('A', 'O', 'U', 0x, EDID_QUIRK_NON_DESKTOP), }; /*
[PATCH v2] drm/edid: Add quirk for OSVR HDK 2.0
The OSVR virtual reality headset HDK 2.0 uses a different EDID vendor and device identifier than the HDK 1.1 - 1.4 headsets. Add the HDK 2.0 vendor and device identifier to the quirks table so that window managers do not try to display the desktop screen on the headset display. Signed-off-by: Ralph Campbell Tested-by: Ralph Campbell --- drivers/gpu/drm/drm_edid.c | 1 + 1 file changed, 1 insertion(+) I don't know how many of these VR headsets are still around but I have a working one and I saw an entry for HDK 1.x so I thought it would be good to add HDK 2.0. v2: The vendor ID was byte swapped. I'm not sure how I missed that in v1. diff --git a/drivers/gpu/drm/drm_edid.c b/drivers/gpu/drm/drm_edid.c index 0454da505687..3b8cc1fe05e8 100644 --- a/drivers/gpu/drm/drm_edid.c +++ b/drivers/gpu/drm/drm_edid.c @@ -230,6 +230,7 @@ static const struct edid_quirk { /* OSVR HDK and HDK2 VR Headsets */ EDID_QUIRK('S', 'V', 'R', 0x1019, EDID_QUIRK_NON_DESKTOP), + EDID_QUIRK('A', 'U', 'O', 0x, EDID_QUIRK_NON_DESKTOP), }; /* -- 2.40.1