Re: [PATCH v3 4/7] mm/hotplug: Allow pageblock alignment via altmap reservation
"Aneesh Kumar K.V" writes: > Add a new kconfig option that can be selected if we want to allow > pageblock alignment by reserving pages in the vmemmap altmap area. > This implies we will be reserving some pages for every memoryblock > This also allows the memmap on memory feature to be widely useful > with different memory block size values. > > Signed-off-by: Aneesh Kumar K.V > --- > mm/Kconfig | 9 +++ > mm/memory_hotplug.c | 59 + > 2 files changed, 58 insertions(+), 10 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 932349271e28..88a1472b2086 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -570,6 +570,15 @@ config MHP_MEMMAP_ON_MEMORY > depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP > depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE > > +config MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY > + bool "Allow Reserving pages for page block aligment" > + depends on MHP_MEMMAP_ON_MEMORY > + help > + This option allows memmap on memory feature to be more useful > + with different memory block sizes. This is achieved by marking some > pages > + in each memory block as reserved so that we can get page-block alignment > + for the remaining pages. > + > endif # MEMORY_HOTPLUG > > config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 07c99b0cc371..f36aec1f7626 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1252,15 +1252,17 @@ static inline bool > arch_supports_memmap_on_memory(unsigned long size) > { > unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT; > unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page); > - unsigned long remaining_size = size - vmemmap_size; > > - return IS_ALIGNED(vmemmap_size, PMD_SIZE) && > - IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)); > + return IS_ALIGNED(vmemmap_size, PMD_SIZE); > } > #endif > > static bool mhp_supports_memmap_on_memory(unsigned long size) > { > + unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT; > + unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page); > + unsigned long remaining_size = size - vmemmap_size; > + > /* >* Besides having arch support and the feature enabled at runtime, we >* need a few more assumptions to hold true: > @@ -1287,9 +1289,30 @@ static bool mhp_supports_memmap_on_memory(unsigned > long size) >* altmap as an alternative source of memory, and we do not > exactly >* populate a single PMD. >*/ > - return mhp_memmap_on_memory() && > - size == memory_block_size_bytes() && > - arch_supports_memmap_on_memory(size); > + if (!mhp_memmap_on_memory() || size != memory_block_size_bytes()) > + return false; > + /* > + * Without page reservation remaining pages should be pageblock > aligned. > + */ > + if (!IS_ENABLED(CONFIG_MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY) && > + !IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT))) > + return false; > + > + return arch_supports_memmap_on_memory(size); > +} > + > +static inline unsigned long memory_block_align_base(unsigned long size) > +{ > + if (IS_ENABLED(CONFIG_MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY)) { > + unsigned long align; > + unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT; > + unsigned long vmemmap_size; > + > + vmemmap_size = (nr_vmemmap_pages * sizeof(struct page)) >> > PAGE_SHIFT; DIV_ROUND_UP()? > + align = pageblock_align(vmemmap_size) - vmemmap_size; > + return align; > + } else > + return 0; > } > > /* > @@ -1302,7 +1325,11 @@ int __ref add_memory_resource(int nid, struct resource > *res, mhp_t mhp_flags) > { > struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; > enum memblock_flags memblock_flags = MEMBLOCK_NONE; > - struct vmem_altmap mhp_altmap = {}; > + struct vmem_altmap mhp_altmap = { > + .base_pfn = PHYS_PFN(res->start), > + .end_pfn = PHYS_PFN(res->end), > + .reserve = memory_block_align_base(resource_size(res)), > + }; > struct memory_group *group = NULL; > u64 start, size; > bool new_node = false; > @@ -1347,8 +1374,7 @@ int __ref add_memory_resource(int nid, struct resource > *res, mhp_t mhp_flags) >*/ > if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { > if (mhp_supports_memmap_on_memory(size)) { > - mhp_altmap.free = PHYS_PFN(size); > - mhp_altmap.base_pfn = PHYS_PFN(start); > + mhp_altmap.free = PHYS_PFN(size) - mhp_altmap.reserve; > params.altmap = &mhp_altmap; > } > /* fallback to not using altmap */ > @@ -1360,7 +
Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page
Alistair Popple writes: > Peter Xu writes: > >> On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote: >>> >>> Peter Xu writes: >>> >>> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote: >>> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that >>> >> installs migration entries directly if it can lock the migrating page. >>> >> When removing a dirty pte the dirty bit is supposed to be carried over >>> >> to the underlying page to prevent it being lost. >>> >> >>> >> Currently migrate_vma_*() can only be used for private anonymous >>> >> mappings. That means loss of the dirty bit usually doesn't result in >>> >> data loss because these pages are typically not file-backed. However >>> >> pages may be backed by swap storage which can result in data loss if an >>> >> attempt is made to migrate a dirty page that doesn't yet have the >>> >> PageDirty flag set. >>> >> >>> >> In this case migration will fail due to unexpected references but the >>> >> dirty pte bit will be lost. If the page is subsequently reclaimed data >>> >> won't be written back to swap storage as it is considered uptodate, >>> >> resulting in data loss if the page is subsequently accessed. >>> >> >>> >> Prevent this by copying the dirty bit to the page when removing the pte >>> >> to match what try_to_migrate_one() does. >>> >> >>> >> Signed-off-by: Alistair Popple >>> >> Acked-by: Peter Xu >>> >> Reported-by: Huang Ying >>> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma >>> >> while collecting pages") >>> >> Cc: sta...@vger.kernel.org >>> >> >>> >> --- >>> >> >>> >> Changes for v3: >>> >> >>> >> - Defer TLB flushing >>> >> - Split a TLB flushing fix into a separate change. >>> >> >>> >> Changes for v2: >>> >> >>> >> - Fixed up Reported-by tag. >>> >> - Added Peter's Acked-by. >>> >> - Atomically read and clear the pte to prevent the dirty bit getting >>> >>set after reading it. >>> >> - Added fixes tag >>> >> --- >>> >> mm/migrate_device.c | 9 +++-- >>> >> 1 file changed, 7 insertions(+), 2 deletions(-) >>> >> >>> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c >>> >> index 6a5ef9f..51d9afa 100644 >>> >> --- a/mm/migrate_device.c >>> >> +++ b/mm/migrate_device.c >>> >> @@ -7,6 +7,7 @@ >>> >> #include >>> >> #include >>> >> #include >>> >> +#include >>> >> #include >>> >> #include >>> >> #include >>> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>> >> anon_exclusive = PageAnon(page) && >>> >> PageAnonExclusive(page); >>> >> if (anon_exclusive) { >>> >> flush_cache_page(vma, addr, >>> >> pte_pfn(*ptep)); >>> >> -ptep_clear_flush(vma, addr, ptep); >>> >> +pte = ptep_clear_flush(vma, addr, ptep); >>> >> >>> >> if (page_try_share_anon_rmap(page)) { >>> >> set_pte_at(mm, addr, ptep, pte); >>> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>> >> goto next; >>> >> } >>> >> } else { >>> >> -ptep_get_and_clear(mm, addr, ptep); >>> >> +pte = ptep_get_and_clear(mm, addr, >>> >> ptep); >>> >> } >>> > >>> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are >>> > moved above the condition check so they're called unconditionally. Could >>> > you explain the rational on why it's changed back (since
Re: [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL
Alistair Popple writes: > "Huang, Ying" writes: > >> Alistair Popple writes: >> >>> When clearing a PTE the TLB should be flushed whilst still holding the >>> PTL to avoid a potential race with madvise/munmap/etc. For example >>> consider the following sequence: >>> >>> CPU0 CPU1 >>> >>> >>> migrate_vma_collect_pmd() >>> pte_unmap_unlock() >>> madvise(MADV_DONTNEED) >>> -> zap_pte_range() >>> pte_offset_map_lock() >>> [ PTE not present, TLB not flushed ] >>> pte_unmap_unlock() >>> [ page is still accessible via stale TLB ] >>> flush_tlb_range() >>> >>> In this case the page may still be accessed via the stale TLB entry >>> after madvise returns. Fix this by flushing the TLB while holding the >>> PTL. >>> >>> Signed-off-by: Alistair Popple >>> Reported-by: Nadav Amit >>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while >>> collecting pages") >>> Cc: sta...@vger.kernel.org >>> >>> --- >>> >>> Changes for v3: >>> >>> - New for v3 >>> --- >>> mm/migrate_device.c | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c >>> index 27fb37d..6a5ef9f 100644 >>> --- a/mm/migrate_device.c >>> +++ b/mm/migrate_device.c >>> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>> migrate->dst[migrate->npages] = 0; >>> migrate->src[migrate->npages++] = mpfn; >>> } >>> - arch_leave_lazy_mmu_mode(); >>> - pte_unmap_unlock(ptep - 1, ptl); >>> >>> /* Only flush the TLB if we actually modified any entries */ >>> if (unmapped) >>> flush_tlb_range(walk->vma, start, end); >> >> It appears that we can increase "unmapped" only if ptep_get_and_clear() >> is used? > > In other words you mean we only need to increase unmapped if pte_present > && !anon_exclusive? > > Agree, that's a good optimisation to make. However I'm just trying to > solve a data corruption issue (not dirtying the page) here, so will post > that as a separate optimisation patch. Thanks. OK. Then the patch looks good to me. Feel free to add my, Reviewed-by: "Huang, Ying" Best Regards, Huang, Ying >> >>> + arch_leave_lazy_mmu_mode(); >>> + pte_unmap_unlock(ptep - 1, ptl); >>> + >>> return 0; >>> } >>> >>> >>> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
Re: [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL
Alistair Popple writes: > When clearing a PTE the TLB should be flushed whilst still holding the > PTL to avoid a potential race with madvise/munmap/etc. For example > consider the following sequence: > > CPU0 CPU1 > > > migrate_vma_collect_pmd() > pte_unmap_unlock() > madvise(MADV_DONTNEED) > -> zap_pte_range() > pte_offset_map_lock() > [ PTE not present, TLB not flushed ] > pte_unmap_unlock() > [ page is still accessible via stale TLB ] > flush_tlb_range() > > In this case the page may still be accessed via the stale TLB entry > after madvise returns. Fix this by flushing the TLB while holding the > PTL. > > Signed-off-by: Alistair Popple > Reported-by: Nadav Amit > Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while > collecting pages") > Cc: sta...@vger.kernel.org > > --- > > Changes for v3: > > - New for v3 > --- > mm/migrate_device.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c > index 27fb37d..6a5ef9f 100644 > --- a/mm/migrate_device.c > +++ b/mm/migrate_device.c > @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > migrate->dst[migrate->npages] = 0; > migrate->src[migrate->npages++] = mpfn; > } > - arch_leave_lazy_mmu_mode(); > - pte_unmap_unlock(ptep - 1, ptl); > > /* Only flush the TLB if we actually modified any entries */ > if (unmapped) > flush_tlb_range(walk->vma, start, end); It appears that we can increase "unmapped" only if ptep_get_and_clear() is used? Best Regards, Huang, Ying > + arch_leave_lazy_mmu_mode(); > + pte_unmap_unlock(ptep - 1, ptl); > + > return 0; > } > > > base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page
Peter Xu writes: > On Thu, Aug 18, 2022 at 02:34:45PM +0800, Huang, Ying wrote: >> > In this specific case, the only way to do safe tlb batching in my mind is: >> > >> >pte_offset_map_lock(); >> >arch_enter_lazy_mmu_mode(); >> > // If any pending tlb, do it now >> > if (mm_tlb_flush_pending()) >> >flush_tlb_range(vma, start, end); >> > else >> > flush_tlb_batched_pending(); >> >> I don't think we need the above 4 lines. Because we will flush TLB >> before we access the pages. > > Could you elaborate? As you have said below, we don't use non-present PTEs and flush present PTEs before we access the pages. >> Can you find any issue if we don't use the above 4 lines? > > It seems okay to me to leave stall tlb at least within the scope of this > function. It only collects present ptes and flush propoerly for them. I > don't quickly see any other implications to other not touched ptes - unlike > e.g. mprotect(), there's a strong barrier of not allowing further write > after mprotect() returns. Yes. I think so too. > Still I don't know whether there'll be any side effect of having stall tlbs > in !present ptes because I'm not familiar enough with the private dev swap > migration code. But I think having them will be safe, even if redundant. I don't think it's a good idea to be redundant. That may hide the real issue. Best Regards, Huang, Ying
Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page
Peter Xu writes: > On Wed, Aug 17, 2022 at 02:41:19AM -0700, Nadav Amit wrote: >> 4. Having multiple TLB flushing infrastructures makes all of these >> discussions very complicated and unmaintainable. I need to convince myself >> in every occasion (including this one) whether calls to >> flush_tlb_batched_pending() and tlb_flush_pending() are needed or not. >> >> What I would like to have [3] is a single infrastructure that gets a >> “ticket” (generation when the batching started), the old PTE and the new PTE >> and checks whether a TLB flush is needed based on the arch behavior and the >> current TLB generation. If needed, it would update the “ticket” to the new >> generation. Andy wanted a ring for pending TLB flushes, but I think it is an >> overkill with more overhead and complexity than needed. >> >> But the current situation in which every TLB flush is a basis for long >> discussions and prone to bugs is impossible. >> >> I hope it helps. Let me know if you want me to revive the patch-set or other >> feedback. >> >> [1] https://lore.kernel.org/all/20220711034615.482895-5-21cn...@gmail.com/ >> [2] https://lore.kernel.org/all/20220718120212.3180-13-na...@vmware.com/ >> [3] https://lore.kernel.org/all/20210131001132.3368247-16-na...@vmware.com/ > > I need more reading on tlb code and also [3] which looks useful to me. > It's definitely sad to make tlb flushing so complicated. It'll be great if > things can be sorted out someday. > > In this specific case, the only way to do safe tlb batching in my mind is: > > pte_offset_map_lock(); > arch_enter_lazy_mmu_mode(); > // If any pending tlb, do it now > if (mm_tlb_flush_pending()) > flush_tlb_range(vma, start, end); > else > flush_tlb_batched_pending(); I don't think we need the above 4 lines. Because we will flush TLB before we access the pages. Can you find any issue if we don't use the above 4 lines? Best Regards, Huang, Ying > loop { > ... > pte = ptep_get_and_clear(); > ... > if (pte_present()) > unmapped++; > ... > } > if (unmapped) > flush_tlb_range(walk->vma, start, end); > arch_leave_lazy_mmu_mode(); > pte_unmap_unlock(); > > I may miss something, but even if not it already doesn't look pretty. > > Thanks,
Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page
Nadav Amit writes: > On Aug 17, 2022, at 12:17 AM, Huang, Ying wrote: > >> Alistair Popple writes: >> >>> Peter Xu writes: >>> >>>> On Wed, Aug 17, 2022 at 11:49:03AM +1000, Alistair Popple wrote: >>>>> Peter Xu writes: >>>>> >>>>>> On Tue, Aug 16, 2022 at 04:10:29PM +0800, huang ying wrote: >>>>>>>> @@ -193,11 +194,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>>>>>>>bool anon_exclusive; >>>>>>>>pte_t swp_pte; >>>>>>>> >>>>>>>> + flush_cache_page(vma, addr, pte_pfn(*ptep)); >>>>>>>> + pte = ptep_clear_flush(vma, addr, ptep); >>>>>>> >>>>>>> Although I think it's possible to batch the TLB flushing just before >>>>>>> unlocking PTL. The current code looks correct. >>>>>> >>>>>> If we're with unconditionally ptep_clear_flush(), does it mean we should >>>>>> probably drop the "unmapped" and the last flush_tlb_range() already since >>>>>> they'll be redundant? >>>>> >>>>> This patch does that, unless I missed something? >>>> >>>> Yes it does. Somehow I didn't read into the real v2 patch, sorry! >>>> >>>>>> If that'll need to be dropped, it looks indeed better to still keep the >>>>>> batch to me but just move it earlier (before unlock iiuc then it'll be >>>>>> safe), then we can keep using ptep_get_and_clear() afaiu but keep "pte" >>>>>> updated. >>>>> >>>>> I think we would also need to check should_defer_flush(). Looking at >>>>> try_to_unmap_one() there is this comment: >>>>> >>>>> if (should_defer_flush(mm, flags) && !anon_exclusive) { >>>>> /* >>>>>* We clear the PTE but do not flush so >>>>> potentially >>>>>* a remote CPU could still be writing to the >>>>> folio. >>>>>* If the entry was previously clean then the >>>>>* architecture must guarantee that a >>>>> clear->dirty >>>>>* transition on a cached TLB entry is written >>>>> through >>>>>* and traps if the PTE is unmapped. >>>>>*/ >>>>> >>>>> And as I understand it we'd need the same guarantee here. Given >>>>> try_to_migrate_one() doesn't do batched TLB flushes either I'd rather >>>>> keep the code as consistent as possible between >>>>> migrate_vma_collect_pmd() and try_to_migrate_one(). I could look at >>>>> introducing TLB flushing for both in some future patch series. >>>> >>>> should_defer_flush() is TTU-specific code? >>> >>> I'm not sure, but I think we need the same guarantee here as mentioned >>> in the comment otherwise we wouldn't see a subsequent CPU write that >>> could dirty the PTE after we have cleared it but before the TLB flush. >>> >>> My assumption was should_defer_flush() would ensure we have that >>> guarantee from the architecture, but maybe there are alternate/better >>> ways of enforcing that? >>>> IIUC the caller sets TTU_BATCH_FLUSH showing that tlb can be omitted since >>>> the caller will be responsible for doing it. In migrate_vma_collect_pmd() >>>> iiuc we don't need that hint because it'll be flushed within the same >>>> function but just only after the loop of modifying the ptes. Also it'll be >>>> with the pgtable lock held. >>> >>> Right, but the pgtable lock doesn't protect against HW PTE changes such >>> as setting the dirty bit so we need to ensure the HW does the right >>> thing here and I don't know if all HW does. >> >> This sounds sensible. But I take a look at zap_pte_range(), and find >> that it appears that the implementation requires the PTE dirty bit to be >> write-through. Do I miss something? >> >> Hi, Nadav, Can you help? > > Sorry for joinin
Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page
Alistair Popple writes: > Peter Xu writes: > >> On Wed, Aug 17, 2022 at 11:49:03AM +1000, Alistair Popple wrote: >>> >>> Peter Xu writes: >>> >>> > On Tue, Aug 16, 2022 at 04:10:29PM +0800, huang ying wrote: >>> >> > @@ -193,11 +194,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>> >> > bool anon_exclusive; >>> >> > pte_t swp_pte; >>> >> > >>> >> > + flush_cache_page(vma, addr, pte_pfn(*ptep)); >>> >> > + pte = ptep_clear_flush(vma, addr, ptep); >>> >> >>> >> Although I think it's possible to batch the TLB flushing just before >>> >> unlocking PTL. The current code looks correct. >>> > >>> > If we're with unconditionally ptep_clear_flush(), does it mean we should >>> > probably drop the "unmapped" and the last flush_tlb_range() already since >>> > they'll be redundant? >>> >>> This patch does that, unless I missed something? >> >> Yes it does. Somehow I didn't read into the real v2 patch, sorry! >> >>> >>> > If that'll need to be dropped, it looks indeed better to still keep the >>> > batch to me but just move it earlier (before unlock iiuc then it'll be >>> > safe), then we can keep using ptep_get_and_clear() afaiu but keep "pte" >>> > updated. >>> >>> I think we would also need to check should_defer_flush(). Looking at >>> try_to_unmap_one() there is this comment: >>> >>> if (should_defer_flush(mm, flags) && !anon_exclusive) { >>> /* >>> * We clear the PTE but do not flush so >>> potentially >>> * a remote CPU could still be writing to the >>> folio. >>> * If the entry was previously clean then the >>> * architecture must guarantee that a >>> clear->dirty >>> * transition on a cached TLB entry is written >>> through >>> * and traps if the PTE is unmapped. >>> */ >>> >>> And as I understand it we'd need the same guarantee here. Given >>> try_to_migrate_one() doesn't do batched TLB flushes either I'd rather >>> keep the code as consistent as possible between >>> migrate_vma_collect_pmd() and try_to_migrate_one(). I could look at >>> introducing TLB flushing for both in some future patch series. >> >> should_defer_flush() is TTU-specific code? > > I'm not sure, but I think we need the same guarantee here as mentioned > in the comment otherwise we wouldn't see a subsequent CPU write that > could dirty the PTE after we have cleared it but before the TLB flush. > > My assumption was should_defer_flush() would ensure we have that > guarantee from the architecture, but maybe there are alternate/better > ways of enforcing that? >> IIUC the caller sets TTU_BATCH_FLUSH showing that tlb can be omitted since >> the caller will be responsible for doing it. In migrate_vma_collect_pmd() >> iiuc we don't need that hint because it'll be flushed within the same >> function but just only after the loop of modifying the ptes. Also it'll be >> with the pgtable lock held. > > Right, but the pgtable lock doesn't protect against HW PTE changes such > as setting the dirty bit so we need to ensure the HW does the right > thing here and I don't know if all HW does. This sounds sensible. But I take a look at zap_pte_range(), and find that it appears that the implementation requires the PTE dirty bit to be write-through. Do I miss something? Hi, Nadav, Can you help? Best Regards, Huang, Ying [snip]
Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page
On Tue, Aug 16, 2022 at 3:39 PM Alistair Popple wrote: > > migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that > installs migration entries directly if it can lock the migrating page. > When removing a dirty pte the dirty bit is supposed to be carried over > to the underlying page to prevent it being lost. > > Currently migrate_vma_*() can only be used for private anonymous > mappings. That means loss of the dirty bit usually doesn't result in > data loss because these pages are typically not file-backed. However > pages may be backed by swap storage which can result in data loss if an > attempt is made to migrate a dirty page that doesn't yet have the > PageDirty flag set. > > In this case migration will fail due to unexpected references but the > dirty pte bit will be lost. If the page is subsequently reclaimed data > won't be written back to swap storage as it is considered uptodate, > resulting in data loss if the page is subsequently accessed. > > Prevent this by copying the dirty bit to the page when removing the pte > to match what try_to_migrate_one() does. > > Signed-off-by: Alistair Popple > Acked-by: Peter Xu > Reported-by: Huang Ying > Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while > collecting pages") > Cc: sta...@vger.kernel.org > > --- > > Changes for v2: > > - Fixed up Reported-by tag. > - Added Peter's Acked-by. > - Atomically read and clear the pte to prevent the dirty bit getting >set after reading it. > - Added fixes tag > --- > mm/migrate_device.c | 21 - > 1 file changed, 8 insertions(+), 13 deletions(-) > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c > index 27fb37d..e2d09e5 100644 > --- a/mm/migrate_device.c > +++ b/mm/migrate_device.c > @@ -7,6 +7,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -61,7 +62,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > struct migrate_vma *migrate = walk->private; > struct vm_area_struct *vma = walk->vma; > struct mm_struct *mm = vma->vm_mm; > - unsigned long addr = start, unmapped = 0; > + unsigned long addr = start; > spinlock_t *ptl; > pte_t *ptep; > > @@ -193,11 +194,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > bool anon_exclusive; > pte_t swp_pte; > > + flush_cache_page(vma, addr, pte_pfn(*ptep)); > + pte = ptep_clear_flush(vma, addr, ptep); Although I think it's possible to batch the TLB flushing just before unlocking PTL. The current code looks correct. Reviewed-by: "Huang, Ying" Best Regards, Huang, Ying > anon_exclusive = PageAnon(page) && > PageAnonExclusive(page); > if (anon_exclusive) { > - flush_cache_page(vma, addr, pte_pfn(*ptep)); > - ptep_clear_flush(vma, addr, ptep); > - > if (page_try_share_anon_rmap(page)) { > set_pte_at(mm, addr, ptep, pte); > unlock_page(page); > @@ -205,12 +205,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > mpfn = 0; > goto next; > } > - } else { > - ptep_get_and_clear(mm, addr, ptep); > } > > migrate->cpages++; > > + /* Set the dirty flag on the folio now the pte is > gone. */ > + if (pte_dirty(pte)) > + folio_mark_dirty(page_folio(page)); > + > /* Setup special migration page table entry */ > if (mpfn & MIGRATE_PFN_WRITE) > entry = make_writable_migration_entry( > @@ -242,9 +244,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > */ > page_remove_rmap(page, vma, false); > put_page(page); > - > - if (pte_present(pte)) > - unmapped++; > } else { > put_page(page); > mpfn = 0; > @@ -257,10 +256,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > arch_leave_lazy_mmu_mode(); > pte_unmap_unlock(ptep - 1, ptl); > > - /* Only flush the TLB if we actually modified any entries */ > - if (unmapped) > - flush_tlb_range(walk->vma, start, end); > - > return 0; > } > > > base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136 > -- > git-series 0.9.1 >
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
On Thu, Apr 11, 2019 at 12:08 AM Waiman Long wrote: > > On 04/10/2019 04:15 AM, huang ying wrote: > > Hi, Waiman, > > > > What's the status of this patchset? And its merging plan? > > > > Best Regards, > > Huang, Ying > > I have broken the patch into 3 parts (0/1/2) and rewritten some of them. > Part 0 has been merged into tip. Parts 1 and 2 are still under testing. Thanks! Please keep me updated! Best Regards, Huang, Ying > Cheers, > Longman >
Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
Hi, Waiman, What's the status of this patchset? And its merging plan? Best Regards, Huang, Ying
Re: [PATCH] mm: convert totalram_pages, totalhigh_pages and managed_pages to atomic.
Arun KS writes: > Remove managed_page_count_lock spinlock and instead use atomic > variables. > > Suggested-by: Michal Hocko > Suggested-by: Vlastimil Babka > Signed-off-by: Arun KS > > --- > As discussed here, > https://patchwork.kernel.org/patch/10627521/#22261253 My 2 cents. I think you should include at least part of the discussion in the patch description to make it more readable by itself. Best Regards, Huang, Ying