[PATCH 5/5] mm: remove range parameter from follow_invalidate_pte()
The only user (DAX) of range parameter of follow_invalidate_pte() is gone, it safe to remove the range paramter and make it static to simlify the code. Signed-off-by: Muchun Song --- include/linux/mm.h | 3 --- mm/memory.c| 23 +++ 2 files changed, 3 insertions(+), 23 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index d211a06784d5..7895b17f6847 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1814,9 +1814,6 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); int copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); -int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, - struct mmu_notifier_range *range, pte_t **ptepp, - pmd_t **pmdpp, spinlock_t **ptlp); int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp, spinlock_t **ptlp); int follow_pfn(struct vm_area_struct *vma, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index 514a81cdd1ae..e8ce066be5f2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4869,9 +4869,8 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) } #endif /* __PAGETABLE_PMD_FOLDED */ -int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, - struct mmu_notifier_range *range, pte_t **ptepp, - pmd_t **pmdpp, spinlock_t **ptlp) +static int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, +pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) { pgd_t *pgd; p4d_t *p4d; @@ -4898,31 +4897,17 @@ int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, if (!pmdpp) goto out; - if (range) { - mmu_notifier_range_init(range, MMU_NOTIFY_CLEAR, 0, - NULL, mm, address & PMD_MASK, - (address & PMD_MASK) + PMD_SIZE); - mmu_notifier_invalidate_range_start(range); - } *ptlp = pmd_lock(mm, pmd); if (pmd_huge(*pmd)) { *pmdpp = pmd; return 0; } spin_unlock(*ptlp); - if (range) - mmu_notifier_invalidate_range_end(range); } if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) goto out; - if (range) { - mmu_notifier_range_init(range, MMU_NOTIFY_CLEAR, 0, NULL, mm, - address & PAGE_MASK, - (address & PAGE_MASK) + PAGE_SIZE); - mmu_notifier_invalidate_range_start(range); - } ptep = pte_offset_map_lock(mm, pmd, address, ptlp); if (!pte_present(*ptep)) goto unlock; @@ -4930,8 +4915,6 @@ int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, return 0; unlock: pte_unmap_unlock(ptep, *ptlp); - if (range) - mmu_notifier_invalidate_range_end(range); out: return -EINVAL; } @@ -4960,7 +4943,7 @@ int follow_invalidate_pte(struct mm_struct *mm, unsigned long address, int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp, spinlock_t **ptlp) { - return follow_invalidate_pte(mm, address, NULL, ptepp, NULL, ptlp); + return follow_invalidate_pte(mm, address, ptepp, NULL, ptlp); } EXPORT_SYMBOL_GPL(follow_pte); -- 2.11.0
[PATCH 4/5] dax: fix missing writeprotect the pte entry
Currently dax_mapping_entry_mkclean() fails to clean and write protect the pte entry within a DAX PMD entry during an *sync operation. This can result in data loss in the following sequence: 1) process A mmap write to DAX PMD, dirtying PMD radix tree entry and making the pmd entry dirty and writeable. 2) process B mmap with the @offset (e.g. 4K) and @length (e.g. 4K) write to the same file, dirtying PMD radix tree entry (already done in 1)) and making the pte entry dirty and writeable. 3) fsync, flushing out PMD data and cleaning the radix tree entry. We currently fail to mark the pte entry as clean and write protected since the vma of process B is not covered in dax_entry_mkclean(). 4) process B writes to the pte. These don't cause any page faults since the pte entry is dirty and writeable. The radix tree entry remains clean. 5) fsync, which fails to flush the dirty PMD data because the radix tree entry was clean. 6) crash - dirty data that should have been fsync'd as part of 5) could still have been in the processor cache, and is lost. Reuse some infrastructure of page_mkclean_one() to let DAX can handle similar case to fix this issue. Fixes: 4b4bb46d00b3 ("dax: clear dirty entry tags on cache flush") Signed-off-by: Muchun Song --- fs/dax.c | 78 +--- include/linux/rmap.h | 9 ++ mm/internal.h| 27 -- mm/rmap.c| 69 ++ 4 files changed, 85 insertions(+), 98 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 2955ec65eb65..7d4e3e68b861 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #define CREATE_TRACE_POINTS @@ -801,86 +802,21 @@ static void *dax_insert_entry(struct xa_state *xas, return entry; } -static inline -unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma) -{ - unsigned long address; - - address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); - return address; -} - /* Walk all mappings of a given index of a file and writeprotect them */ -static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index, - unsigned long pfn) +static void dax_entry_mkclean(struct address_space *mapping, unsigned long pfn, + unsigned long npfn, pgoff_t pgoff_start) { struct vm_area_struct *vma; - pte_t pte, *ptep = NULL; - pmd_t *pmdp = NULL; - spinlock_t *ptl; + pgoff_t pgoff_end = pgoff_start + npfn - 1; i_mmap_lock_read(mapping); - vma_interval_tree_foreach(vma, >i_mmap, index, index) { - struct mmu_notifier_range range; - unsigned long address; - + vma_interval_tree_foreach(vma, >i_mmap, pgoff_start, pgoff_end) { cond_resched(); if (!(vma->vm_flags & VM_SHARED)) continue; - address = pgoff_address(index, vma); - - /* -* follow_invalidate_pte() will use the range to call -* mmu_notifier_invalidate_range_start() on our behalf before -* taking any lock. -*/ - if (follow_invalidate_pte(vma->vm_mm, address, , , - , )) - continue; - - /* -* No need to call mmu_notifier_invalidate_range() as we are -* downgrading page table protection not changing it to point -* to a new page. -* -* See Documentation/vm/mmu_notifier.rst -*/ - if (pmdp) { -#ifdef CONFIG_FS_DAX_PMD - pmd_t pmd; - - if (pfn != pmd_pfn(*pmdp)) - goto unlock_pmd; - if (!pmd_dirty(*pmdp) && !pmd_write(*pmdp)) - goto unlock_pmd; - - flush_cache_range(vma, address, address + HPAGE_PMD_SIZE); - pmd = pmdp_invalidate(vma, address, pmdp); - pmd = pmd_wrprotect(pmd); - pmd = pmd_mkclean(pmd); - set_pmd_at(vma->vm_mm, address, pmdp, pmd); -unlock_pmd: -#endif - spin_unlock(ptl); - } else { - if (pfn != pte_pfn(*ptep)) - goto unlock_pte; - if (!pte_dirty(*ptep) && !pte_write(*ptep)) - goto unlock_pte; - - flush_cache_page(vma, address, pfn); - pte = ptep_clear_flush(vma, address, ptep); - pte = pte_wrprotect(pte); - pte =
[PATCH 3/5] mm: page_vma_mapped: support checking if a pfn is mapped into a vma
page_vma_mapped_walk() is supposed to check if a page is mapped into a vma. However, not all page frames (e.g. PFN_DEV) have a associated struct page with it. There is going to be some duplicate codes similar with this function if someone want to check if a pfn (without a struct page) is mapped into a vma. So add support for checking if a pfn is mapped into a vma. In the next patch, the dax will use this new feature. Signed-off-by: Muchun Song --- include/linux/rmap.h | 13 +-- mm/internal.h| 25 +--- mm/page_vma_mapped.c | 65 +--- 3 files changed, 70 insertions(+), 33 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 221c3c6438a7..7628474732e7 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -204,9 +204,18 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, #define PVMW_SYNC (1 << 0) /* Look for migarion entries rather than present PTEs */ #define PVMW_MIGRATION (1 << 1) +/* Walk the page table by checking the pfn instead of a struct page */ +#define PVMW_PFN_WALK (1 << 2) struct page_vma_mapped_walk { - struct page *page; + union { + struct page *page; + struct { + unsigned long pfn; + unsigned int nr; + pgoff_t index; + }; + }; struct vm_area_struct *vma; unsigned long address; pmd_t *pmd; @@ -218,7 +227,7 @@ struct page_vma_mapped_walk { static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) { /* HugeTLB pte is set to the relevant page table entry without pte_mapped. */ - if (pvmw->pte && !PageHuge(pvmw->page)) + if (pvmw->pte && ((pvmw->flags & PVMW_PFN_WALK) || !PageHuge(pvmw->page))) pte_unmap(pvmw->pte); if (pvmw->ptl) spin_unlock(pvmw->ptl); diff --git a/mm/internal.h b/mm/internal.h index deb9bda18e59..d6e3e8e1be2d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -478,25 +478,34 @@ vma_address(struct page *page, struct vm_area_struct *vma) } /* - * Then at what user virtual address will none of the page be found in vma? - * Assumes that vma_address() already returned a good starting address. - * If page is a compound head, the entire compound page is considered. + * Return the end of user virtual address at the specific offset within + * a vma. */ static inline unsigned long -vma_address_end(struct page *page, struct vm_area_struct *vma) +vma_pgoff_address_end(pgoff_t pgoff, unsigned long nr_pages, + struct vm_area_struct *vma) { - pgoff_t pgoff; unsigned long address; - VM_BUG_ON_PAGE(PageKsm(page), page);/* KSM page->index unusable */ - pgoff = page_to_pgoff(page) + compound_nr(page); - address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); + address = vma->vm_start + ((pgoff + nr_pages - vma->vm_pgoff) << PAGE_SHIFT); /* Check for address beyond vma (or wrapped through 0?) */ if (address < vma->vm_start || address > vma->vm_end) address = vma->vm_end; return address; } +/* + * Then at what user virtual address will none of the page be found in vma? + * Assumes that vma_address() already returned a good starting address. + * If page is a compound head, the entire compound page is considered. + */ +static inline unsigned long +vma_address_end(struct page *page, struct vm_area_struct *vma) +{ + VM_BUG_ON_PAGE(PageKsm(page), page);/* KSM page->index unusable */ + return vma_pgoff_address_end(page_to_pgoff(page), compound_nr(page), vma); +} + static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf, struct file *fpin) { diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index f7b331081791..c8819770d457 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -53,10 +53,16 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw) return true; } -static inline bool pfn_is_match(struct page *page, unsigned long pfn) +static inline bool pfn_is_match(struct page_vma_mapped_walk *pvmw, unsigned long pfn) { - unsigned long page_pfn = page_to_pfn(page); + struct page *page; + unsigned long page_pfn; + if (pvmw->flags & PVMW_PFN_WALK) + return pfn >= pvmw->pfn && pfn - pvmw->pfn < pvmw->nr; + + page = pvmw->page; + page_pfn = page_to_pfn(page); /* normal page and hugetlbfs page */ if (!PageTransCompound(page) || PageHuge(page)) return page_pfn == pfn; @@ -116,7 +122,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) pfn = pte_pfn(*pvmw->pte); } - return pfn_is_match(pvmw->page, pfn); + return pfn_is_match(pvmw, pfn);
[PATCH 2/5] dax: fix cache flush on PMD-mapped pages
The flush_cache_page() only remove a PAGE_SIZE sized range from the cache. However, it does not cover the full pages in a THP except a head page. Replace it with flush_cache_range() to fix this issue. Fixes: f729c8c9b24f ("dax: wrprotect pmd_t in dax_mapping_entry_mkclean") Signed-off-by: Muchun Song --- fs/dax.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index 88be1c02a151..2955ec65eb65 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -857,7 +857,7 @@ static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index, if (!pmd_dirty(*pmdp) && !pmd_write(*pmdp)) goto unlock_pmd; - flush_cache_page(vma, address, pfn); + flush_cache_range(vma, address, address + HPAGE_PMD_SIZE); pmd = pmdp_invalidate(vma, address, pmdp); pmd = pmd_wrprotect(pmd); pmd = pmd_mkclean(pmd); -- 2.11.0
[PATCH 1/5] mm: rmap: fix cache flush on THP pages
The flush_cache_page() only remove a PAGE_SIZE sized range from the cache. However, it does not cover the full pages in a THP except a head page. Replace it with flush_cache_range() to fix this issue. At least, no problems were found due to this. Maybe because the architectures that have virtual indexed caches is less. Fixes: f27176cfc363 ("mm: convert page_mkclean_one() to use page_vma_mapped_walk()") Signed-off-by: Muchun Song --- mm/rmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/rmap.c b/mm/rmap.c index b0fd9dc19eba..65670cb805d6 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -974,7 +974,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, if (!pmd_dirty(*pmd) && !pmd_write(*pmd)) continue; - flush_cache_page(vma, address, page_to_pfn(page)); + flush_cache_range(vma, address, address + HPAGE_PMD_SIZE); entry = pmdp_invalidate(vma, address, pmd); entry = pmd_wrprotect(entry); entry = pmd_mkclean(entry); -- 2.11.0
Re: [PATCH v9 02/10] dax: Introduce holder for dax_device
On Thu, Jan 20, 2022 at 06:22:00PM -0800, Darrick J. Wong wrote: > Hm, so that means XFS can only support dax+pmem when there aren't > partitions in use? Ew. Yes. Or any sensible DAX usage going forward for that matter. > > > > (2) extent the holder mechanism to cover a rangeo > > I don't think I was around for the part where "hch balked at a notifier > call chain" -- what were the objections there, specifically? I would > hope that pmem problems would be infrequent enough that the locking > contention (or rcu expiration) wouldn't be an issue...? notifiers are a nightmare untype API leading to tons of boilerplate code. Open coding the notification is almost always a better idea.
Re: [PATCH v9 10/10] fsdax: set a CoW flag when associate reflink mappings
On Fri, Jan 21, 2022 at 10:33:58AM +0800, Shiyang Ruan wrote: > > > > But different question, how does this not conflict with: > > > > #define PAGE_MAPPING_ANON 0x1 > > > > in page-flags.h? > > Now we are treating dax pages, so I think its flags should be different from > normal page. In another word, PAGE_MAPPING_ANON is a flag of rmap mechanism > for normal page, it doesn't work for dax page. And now, we have dax rmap > for dax page. So, I think this two kinds of flags are supposed to be used > in different mechanisms and won't conflect. It just needs someone to use folio_test_anon in a place where a DAX folio can be passed. This probably should not happen, but we need to clearly document that. > > Either way I think this flag should move to page-flags.h and be > > integrated with the PAGE_MAPPING_FLAGS infrastucture. > > And that's why I keep them in this dax.c file. But that does not integrate it with the infrastructure. For people to debug things it needs to be next to PAGE_MAPPING_ANON and have documentation explaining why they are exclusive.
Re: [PATCH v9 10/10] fsdax: set a CoW flag when associate reflink mappings
在 2022/1/20 16:59, Christoph Hellwig 写道: On Sun, Dec 26, 2021 at 10:34:39PM +0800, Shiyang Ruan wrote: +#define FS_DAX_MAPPING_COW 1UL + +#define MAPPING_SET_COW(m) (m = (struct address_space *)FS_DAX_MAPPING_COW) +#define MAPPING_TEST_COW(m)(((unsigned long)m & FS_DAX_MAPPING_COW) == \ + FS_DAX_MAPPING_COW) These really should be inline functions and probably use lower case names. OK. But different question, how does this not conflict with: #define PAGE_MAPPING_ANON 0x1 in page-flags.h? Now we are treating dax pages, so I think its flags should be different from normal page. In another word, PAGE_MAPPING_ANON is a flag of rmap mechanism for normal page, it doesn't work for dax page. And now, we have dax rmap for dax page. So, I think this two kinds of flags are supposed to be used in different mechanisms and won't conflect. Either way I think this flag should move to page-flags.h and be integrated with the PAGE_MAPPING_FLAGS infrastucture. And that's why I keep them in this dax.c file. -- Thanks, Ruan.
Re: [PATCH v9 02/10] dax: Introduce holder for dax_device
On Fri, Jan 21, 2022 at 09:26:52AM +0800, Shiyang Ruan wrote: > > > 在 2022/1/20 16:46, Christoph Hellwig 写道: > > On Wed, Jan 05, 2022 at 04:12:04PM -0800, Dan Williams wrote: > > > We ended up with explicit callbacks after hch balked at a notifier > > > call-chain, but I think we're back to that now. The partition mistake > > > might be unfixable, but at least bdev_dax_pgoff() is dead. Notifier > > > call chains have their own locking so, Ruan, this still does not need > > > to touch dax_read_lock(). > > > > I think we have a few options here: > > > > (1) don't allow error notifications on partitions. And error return from > > the holder registration with proper error handling in the file > > system would give us that Hm, so that means XFS can only support dax+pmem when there aren't partitions in use? Ew. > > (2) extent the holder mechanism to cover a rangeo I don't think I was around for the part where "hch balked at a notifier call chain" -- what were the objections there, specifically? I would hope that pmem problems would be infrequent enough that the locking contention (or rcu expiration) wouldn't be an issue...? > > (3) bite the bullet and create a new stacked dax_device for each > > partition > > > > I think (1) is the best option for now. If people really do need > > partitions we'll have to go for (3) > > Yes, I agree. I'm doing it the first way right now. > > I think that since we can use namespace to divide a big NVDIMM into multiple > pmems, partition on a pmem seems not so meaningful. I'll try to find out what will happen if pmem suddenly stops supporting partitions... --D > > -- > Thanks, > Ruan. > >
Re: [PATCH v9 02/10] dax: Introduce holder for dax_device
在 2022/1/20 16:46, Christoph Hellwig 写道: On Wed, Jan 05, 2022 at 04:12:04PM -0800, Dan Williams wrote: We ended up with explicit callbacks after hch balked at a notifier call-chain, but I think we're back to that now. The partition mistake might be unfixable, but at least bdev_dax_pgoff() is dead. Notifier call chains have their own locking so, Ruan, this still does not need to touch dax_read_lock(). I think we have a few options here: (1) don't allow error notifications on partitions. And error return from the holder registration with proper error handling in the file system would give us that (2) extent the holder mechanism to cover a range (3) bite the bullet and create a new stacked dax_device for each partition I think (1) is the best option for now. If people really do need partitions we'll have to go for (3) Yes, I agree. I'm doing it the first way right now. I think that since we can use namespace to divide a big NVDIMM into multiple pmems, partition on a pmem seems not so meaningful. -- Thanks, Ruan.
Re: [PATCH v9 10/10] fsdax: set a CoW flag when associate reflink mappings
On Sun, Dec 26, 2021 at 10:34:39PM +0800, Shiyang Ruan wrote: > +#define FS_DAX_MAPPING_COW 1UL > + > +#define MAPPING_SET_COW(m) (m = (struct address_space *)FS_DAX_MAPPING_COW) > +#define MAPPING_TEST_COW(m) (((unsigned long)m & FS_DAX_MAPPING_COW) == \ > + FS_DAX_MAPPING_COW) These really should be inline functions and probably use lower case names. But different question, how does this not conflict with: #define PAGE_MAPPING_ANON 0x1 in page-flags.h? Either way I think this flag should move to page-flags.h and be integrated with the PAGE_MAPPING_FLAGS infrastucture.
Re: [PATCH v9 08/10] mm: Introduce mf_dax_kill_procs() for fsdax case
Please only build the new DAX code if CONFIG_FS_DAX is set.
Re: [PATCH v9 07/10] mm: move pgoff_address() to vma_pgoff_address()
On Sun, Dec 26, 2021 at 10:34:36PM +0800, Shiyang Ruan wrote: > Since it is not a DAX-specific function, move it into mm and rename it > to be a generic helper. > > Signed-off-by: Shiyang Ruan Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCH v9 05/10] fsdax: fix function description
On Sun, Dec 26, 2021 at 10:34:34PM +0800, Shiyang Ruan wrote: > The function name has been changed, so the description should be updated > too. > > Signed-off-by: Shiyang Ruan Looks good, Reviewed-by: Christoph Hellwig Dan, can you send this to Linux for 5.17 so that we can get it out of the way?
Re: [PATCH v9 02/10] dax: Introduce holder for dax_device
On Wed, Jan 05, 2022 at 04:12:04PM -0800, Dan Williams wrote: > We ended up with explicit callbacks after hch balked at a notifier > call-chain, but I think we're back to that now. The partition mistake > might be unfixable, but at least bdev_dax_pgoff() is dead. Notifier > call chains have their own locking so, Ruan, this still does not need > to touch dax_read_lock(). I think we have a few options here: (1) don't allow error notifications on partitions. And error return from the holder registration with proper error handling in the file system would give us that (2) extent the holder mechanism to cover a range (3) bite the bullet and create a new stacked dax_device for each partition I think (1) is the best option for now. If people really do need partitions we'll have to go for (3)