Re: [PATCH v3 4/7] mm/hotplug: Allow pageblock alignment via altmap reservation

2023-07-10 Thread Huang, Ying
"Aneesh Kumar K.V"  writes:

> Add a new kconfig option that can be selected if we want to allow
> pageblock alignment by reserving pages in the vmemmap altmap area.
> This implies we will be reserving some pages for every memoryblock
> This also allows the memmap on memory feature to be widely useful
> with different memory block size values.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/Kconfig  |  9 +++
>  mm/memory_hotplug.c | 59 +
>  2 files changed, 58 insertions(+), 10 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 932349271e28..88a1472b2086 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -570,6 +570,15 @@ config MHP_MEMMAP_ON_MEMORY
>   depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP
>   depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
>  
> +config MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY
> +   bool "Allow Reserving pages for page block aligment"
> +   depends on MHP_MEMMAP_ON_MEMORY
> +   help
> + This option allows memmap on memory feature to be more useful
> + with different memory block sizes. This is achieved by marking some 
> pages
> + in each memory block as reserved so that we can get page-block alignment
> + for the remaining pages.
> +
>  endif # MEMORY_HOTPLUG
>  
>  config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 07c99b0cc371..f36aec1f7626 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1252,15 +1252,17 @@ static inline bool 
> arch_supports_memmap_on_memory(unsigned long size)
>  {
>   unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
>   unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
> - unsigned long remaining_size = size - vmemmap_size;
>  
> - return IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> - IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
> + return IS_ALIGNED(vmemmap_size, PMD_SIZE);
>  }
>  #endif
>  
>  static bool mhp_supports_memmap_on_memory(unsigned long size)
>  {
> + unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
> + unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
> + unsigned long remaining_size = size - vmemmap_size;
> +
>   /*
>* Besides having arch support and the feature enabled at runtime, we
>* need a few more assumptions to hold true:
> @@ -1287,9 +1289,30 @@ static bool mhp_supports_memmap_on_memory(unsigned 
> long size)
>*   altmap as an alternative source of memory, and we do not 
> exactly
>*   populate a single PMD.
>*/
> - return mhp_memmap_on_memory() &&
> - size == memory_block_size_bytes() &&
> - arch_supports_memmap_on_memory(size);
> + if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
> + return false;
> +  /*
> +   * Without page reservation remaining pages should be pageblock 
> aligned.
> +   */
> + if (!IS_ENABLED(CONFIG_MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY) &&
> + !IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)))
> + return false;
> +
> + return arch_supports_memmap_on_memory(size);
> +}
> +
> +static inline unsigned long memory_block_align_base(unsigned long size)
> +{
> + if (IS_ENABLED(CONFIG_MHP_RESERVE_PAGES_MEMMAP_ON_MEMORY)) {
> + unsigned long align;
> + unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
> + unsigned long vmemmap_size;
> +
> + vmemmap_size = (nr_vmemmap_pages * sizeof(struct page)) >> 
> PAGE_SHIFT;

DIV_ROUND_UP()?

> + align = pageblock_align(vmemmap_size) - vmemmap_size;
> + return align;
> + } else
> + return 0;
>  }
>  
>  /*
> @@ -1302,7 +1325,11 @@ int __ref add_memory_resource(int nid, struct resource 
> *res, mhp_t mhp_flags)
>  {
>   struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
>   enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> - struct vmem_altmap mhp_altmap = {};
> + struct vmem_altmap mhp_altmap = {
> + .base_pfn =  PHYS_PFN(res->start),
> + .end_pfn  =  PHYS_PFN(res->end),
> + .reserve  = memory_block_align_base(resource_size(res)),
> + };
>   struct memory_group *group = NULL;
>   u64 start, size;
>   bool new_node = false;
> @@ -1347,8 +1374,7 @@ int __ref add_memory_resource(int nid, struct resource 
> *res, mhp_t mhp_flags)
>*/
>   if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>   if (mhp_supports_memmap_on_memory(size)) {
> - mhp_altmap.free = PHYS_PFN(size);
> - mhp_altmap.base_pfn = PHYS_PFN(start);
> + mhp_altmap.free = PHYS_PFN(size) - mhp_altmap.reserve;
>   params.altmap = &mhp_altmap;
>   }
>   /* fallback to not using altmap  */
> @@ -1360,7 +

Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-25 Thread Huang, Ying
Alistair Popple  writes:

> Peter Xu  writes:
>
>> On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
>>>
>>> Peter Xu  writes:
>>>
>>> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
>>> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
>>> >> installs migration entries directly if it can lock the migrating page.
>>> >> When removing a dirty pte the dirty bit is supposed to be carried over
>>> >> to the underlying page to prevent it being lost.
>>> >>
>>> >> Currently migrate_vma_*() can only be used for private anonymous
>>> >> mappings. That means loss of the dirty bit usually doesn't result in
>>> >> data loss because these pages are typically not file-backed. However
>>> >> pages may be backed by swap storage which can result in data loss if an
>>> >> attempt is made to migrate a dirty page that doesn't yet have the
>>> >> PageDirty flag set.
>>> >>
>>> >> In this case migration will fail due to unexpected references but the
>>> >> dirty pte bit will be lost. If the page is subsequently reclaimed data
>>> >> won't be written back to swap storage as it is considered uptodate,
>>> >> resulting in data loss if the page is subsequently accessed.
>>> >>
>>> >> Prevent this by copying the dirty bit to the page when removing the pte
>>> >> to match what try_to_migrate_one() does.
>>> >>
>>> >> Signed-off-by: Alistair Popple 
>>> >> Acked-by: Peter Xu 
>>> >> Reported-by: Huang Ying 
>>> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma 
>>> >> while collecting pages")
>>> >> Cc: sta...@vger.kernel.org
>>> >>
>>> >> ---
>>> >>
>>> >> Changes for v3:
>>> >>
>>> >>  - Defer TLB flushing
>>> >>  - Split a TLB flushing fix into a separate change.
>>> >>
>>> >> Changes for v2:
>>> >>
>>> >>  - Fixed up Reported-by tag.
>>> >>  - Added Peter's Acked-by.
>>> >>  - Atomically read and clear the pte to prevent the dirty bit getting
>>> >>set after reading it.
>>> >>  - Added fixes tag
>>> >> ---
>>> >>  mm/migrate_device.c |  9 +++--
>>> >>  1 file changed, 7 insertions(+), 2 deletions(-)
>>> >>
>>> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> >> index 6a5ef9f..51d9afa 100644
>>> >> --- a/mm/migrate_device.c
>>> >> +++ b/mm/migrate_device.c
>>> >> @@ -7,6 +7,7 @@
>>> >>  #include 
>>> >>  #include 
>>> >>  #include 
>>> >> +#include 
>>> >>  #include 
>>> >>  #include 
>>> >>  #include 
>>> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> >>  anon_exclusive = PageAnon(page) && 
>>> >> PageAnonExclusive(page);
>>> >>  if (anon_exclusive) {
>>> >>  flush_cache_page(vma, addr, 
>>> >> pte_pfn(*ptep));
>>> >> -ptep_clear_flush(vma, addr, ptep);
>>> >> +pte = ptep_clear_flush(vma, addr, ptep);
>>> >>
>>> >>  if (page_try_share_anon_rmap(page)) {
>>> >>  set_pte_at(mm, addr, ptep, pte);
>>> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> >>  goto next;
>>> >>  }
>>> >>  } else {
>>> >> -ptep_get_and_clear(mm, addr, ptep);
>>> >> +pte = ptep_get_and_clear(mm, addr, 
>>> >> ptep);
>>> >>  }
>>> >
>>> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
>>> > moved above the condition check so they're called unconditionally.  Could
>>> > you explain the rational on why it's changed back (since

Re: [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL

2022-08-25 Thread Huang, Ying
Alistair Popple  writes:

> "Huang, Ying"  writes:
>
>> Alistair Popple  writes:
>>
>>> When clearing a PTE the TLB should be flushed whilst still holding the
>>> PTL to avoid a potential race with madvise/munmap/etc. For example
>>> consider the following sequence:
>>>
>>>   CPU0  CPU1
>>>     
>>>
>>>   migrate_vma_collect_pmd()
>>>   pte_unmap_unlock()
>>> madvise(MADV_DONTNEED)
>>> -> zap_pte_range()
>>> pte_offset_map_lock()
>>> [ PTE not present, TLB not flushed ]
>>> pte_unmap_unlock()
>>> [ page is still accessible via stale TLB ]
>>>   flush_tlb_range()
>>>
>>> In this case the page may still be accessed via the stale TLB entry
>>> after madvise returns. Fix this by flushing the TLB while holding the
>>> PTL.
>>>
>>> Signed-off-by: Alistair Popple 
>>> Reported-by: Nadav Amit 
>>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while 
>>> collecting pages")
>>> Cc: sta...@vger.kernel.org
>>>
>>> ---
>>>
>>> Changes for v3:
>>>
>>>  - New for v3
>>> ---
>>>  mm/migrate_device.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 27fb37d..6a5ef9f 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> migrate->dst[migrate->npages] = 0;
>>> migrate->src[migrate->npages++] = mpfn;
>>> }
>>> -   arch_leave_lazy_mmu_mode();
>>> -   pte_unmap_unlock(ptep - 1, ptl);
>>>
>>> /* Only flush the TLB if we actually modified any entries */
>>> if (unmapped)
>>> flush_tlb_range(walk->vma, start, end);
>>
>> It appears that we can increase "unmapped" only if ptep_get_and_clear()
>> is used?
>
> In other words you mean we only need to increase unmapped if pte_present
> && !anon_exclusive?
>
> Agree, that's a good optimisation to make. However I'm just trying to
> solve a data corruption issue (not dirtying the page) here, so will post
> that as a separate optimisation patch. Thanks.

OK.  Then the patch looks good to me.  Feel free to add my,

Reviewed-by: "Huang, Ying" 

Best Regards,
Huang, Ying

>>
>>> +   arch_leave_lazy_mmu_mode();
>>> +   pte_unmap_unlock(ptep - 1, ptl);
>>> +
>>> return 0;
>>>  }
>>>
>>>
>>> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136


Re: [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL

2022-08-24 Thread Huang, Ying
Alistair Popple  writes:

> When clearing a PTE the TLB should be flushed whilst still holding the
> PTL to avoid a potential race with madvise/munmap/etc. For example
> consider the following sequence:
>
>   CPU0  CPU1
>     
>
>   migrate_vma_collect_pmd()
>   pte_unmap_unlock()
> madvise(MADV_DONTNEED)
> -> zap_pte_range()
> pte_offset_map_lock()
> [ PTE not present, TLB not flushed ]
> pte_unmap_unlock()
> [ page is still accessible via stale TLB ]
>   flush_tlb_range()
>
> In this case the page may still be accessed via the stale TLB entry
> after madvise returns. Fix this by flushing the TLB while holding the
> PTL.
>
> Signed-off-by: Alistair Popple 
> Reported-by: Nadav Amit 
> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while 
> collecting pages")
> Cc: sta...@vger.kernel.org
>
> ---
>
> Changes for v3:
>
>  - New for v3
> ---
>  mm/migrate_device.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 27fb37d..6a5ef9f 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   migrate->dst[migrate->npages] = 0;
>   migrate->src[migrate->npages++] = mpfn;
>   }
> - arch_leave_lazy_mmu_mode();
> - pte_unmap_unlock(ptep - 1, ptl);
>  
>   /* Only flush the TLB if we actually modified any entries */
>   if (unmapped)
>   flush_tlb_range(walk->vma, start, end);

It appears that we can increase "unmapped" only if ptep_get_and_clear()
is used?

Best Regards,
Huang, Ying

> + arch_leave_lazy_mmu_mode();
> + pte_unmap_unlock(ptep - 1, ptl);
> +
>   return 0;
>  }
>  
>
> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136


Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-18 Thread Huang, Ying
Peter Xu  writes:

> On Thu, Aug 18, 2022 at 02:34:45PM +0800, Huang, Ying wrote:
>> > In this specific case, the only way to do safe tlb batching in my mind is:
>> >
>> >pte_offset_map_lock();
>> >arch_enter_lazy_mmu_mode();
>> > // If any pending tlb, do it now
>> > if (mm_tlb_flush_pending())
>> >flush_tlb_range(vma, start, end);
>> > else
>> > flush_tlb_batched_pending();
>> 
>> I don't think we need the above 4 lines.  Because we will flush TLB
>> before we access the pages.
>
> Could you elaborate?

As you have said below, we don't use non-present PTEs and flush present
PTEs before we access the pages.

>> Can you find any issue if we don't use the above 4 lines?
>
> It seems okay to me to leave stall tlb at least within the scope of this
> function. It only collects present ptes and flush propoerly for them.  I
> don't quickly see any other implications to other not touched ptes - unlike
> e.g. mprotect(), there's a strong barrier of not allowing further write
> after mprotect() returns.

Yes.  I think so too.

> Still I don't know whether there'll be any side effect of having stall tlbs
> in !present ptes because I'm not familiar enough with the private dev swap
> migration code.  But I think having them will be safe, even if redundant.

I don't think it's a good idea to be redundant.  That may hide the real
issue.

Best Regards,
Huang, Ying


Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-17 Thread Huang, Ying
Peter Xu  writes:

> On Wed, Aug 17, 2022 at 02:41:19AM -0700, Nadav Amit wrote:
>> 4. Having multiple TLB flushing infrastructures makes all of these
>> discussions very complicated and unmaintainable. I need to convince myself
>> in every occasion (including this one) whether calls to
>> flush_tlb_batched_pending() and tlb_flush_pending() are needed or not.
>> 
>> What I would like to have [3] is a single infrastructure that gets a
>> “ticket” (generation when the batching started), the old PTE and the new PTE
>> and checks whether a TLB flush is needed based on the arch behavior and the
>> current TLB generation. If needed, it would update the “ticket” to the new
>> generation. Andy wanted a ring for pending TLB flushes, but I think it is an
>> overkill with more overhead and complexity than needed.
>> 
>> But the current situation in which every TLB flush is a basis for long
>> discussions and prone to bugs is impossible.
>> 
>> I hope it helps. Let me know if you want me to revive the patch-set or other
>> feedback.
>> 
>> [1] https://lore.kernel.org/all/20220711034615.482895-5-21cn...@gmail.com/
>> [2] https://lore.kernel.org/all/20220718120212.3180-13-na...@vmware.com/
>> [3] https://lore.kernel.org/all/20210131001132.3368247-16-na...@vmware.com/
>
> I need more reading on tlb code and also [3] which looks useful to me.
> It's definitely sad to make tlb flushing so complicated.  It'll be great if
> things can be sorted out someday.
>
> In this specific case, the only way to do safe tlb batching in my mind is:
>
>   pte_offset_map_lock();
>   arch_enter_lazy_mmu_mode();
> // If any pending tlb, do it now
> if (mm_tlb_flush_pending())
>   flush_tlb_range(vma, start, end);
> else
> flush_tlb_batched_pending();

I don't think we need the above 4 lines.  Because we will flush TLB
before we access the pages.  Can you find any issue if we don't use the
above 4 lines?

Best Regards,
Huang, Ying

> loop {
> ...
> pte = ptep_get_and_clear();
> ...
> if (pte_present())
> unmapped++;
> ...
> }
>   if (unmapped)
>   flush_tlb_range(walk->vma, start, end);
>   arch_leave_lazy_mmu_mode();
>   pte_unmap_unlock();
>
> I may miss something, but even if not it already doesn't look pretty.
>
> Thanks,


Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-17 Thread Huang, Ying
Nadav Amit  writes:

> On Aug 17, 2022, at 12:17 AM, Huang, Ying  wrote:
>
>> Alistair Popple  writes:
>> 
>>> Peter Xu  writes:
>>> 
>>>> On Wed, Aug 17, 2022 at 11:49:03AM +1000, Alistair Popple wrote:
>>>>> Peter Xu  writes:
>>>>> 
>>>>>> On Tue, Aug 16, 2022 at 04:10:29PM +0800, huang ying wrote:
>>>>>>>> @@ -193,11 +194,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>bool anon_exclusive;
>>>>>>>>pte_t swp_pte;
>>>>>>>> 
>>>>>>>> +   flush_cache_page(vma, addr, pte_pfn(*ptep));
>>>>>>>> +   pte = ptep_clear_flush(vma, addr, ptep);
>>>>>>> 
>>>>>>> Although I think it's possible to batch the TLB flushing just before
>>>>>>> unlocking PTL.  The current code looks correct.
>>>>>> 
>>>>>> If we're with unconditionally ptep_clear_flush(), does it mean we should
>>>>>> probably drop the "unmapped" and the last flush_tlb_range() already since
>>>>>> they'll be redundant?
>>>>> 
>>>>> This patch does that, unless I missed something?
>>>> 
>>>> Yes it does.  Somehow I didn't read into the real v2 patch, sorry!
>>>> 
>>>>>> If that'll need to be dropped, it looks indeed better to still keep the
>>>>>> batch to me but just move it earlier (before unlock iiuc then it'll be
>>>>>> safe), then we can keep using ptep_get_and_clear() afaiu but keep "pte"
>>>>>> updated.
>>>>> 
>>>>> I think we would also need to check should_defer_flush(). Looking at
>>>>> try_to_unmap_one() there is this comment:
>>>>> 
>>>>>   if (should_defer_flush(mm, flags) && !anon_exclusive) {
>>>>>   /*
>>>>>* We clear the PTE but do not flush so 
>>>>> potentially
>>>>>* a remote CPU could still be writing to the 
>>>>> folio.
>>>>>* If the entry was previously clean then the
>>>>>* architecture must guarantee that a 
>>>>> clear->dirty
>>>>>* transition on a cached TLB entry is written 
>>>>> through
>>>>>* and traps if the PTE is unmapped.
>>>>>*/
>>>>> 
>>>>> And as I understand it we'd need the same guarantee here. Given
>>>>> try_to_migrate_one() doesn't do batched TLB flushes either I'd rather
>>>>> keep the code as consistent as possible between
>>>>> migrate_vma_collect_pmd() and try_to_migrate_one(). I could look at
>>>>> introducing TLB flushing for both in some future patch series.
>>>> 
>>>> should_defer_flush() is TTU-specific code?
>>> 
>>> I'm not sure, but I think we need the same guarantee here as mentioned
>>> in the comment otherwise we wouldn't see a subsequent CPU write that
>>> could dirty the PTE after we have cleared it but before the TLB flush.
>>> 
>>> My assumption was should_defer_flush() would ensure we have that
>>> guarantee from the architecture, but maybe there are alternate/better
>>> ways of enforcing that?
>>>> IIUC the caller sets TTU_BATCH_FLUSH showing that tlb can be omitted since
>>>> the caller will be responsible for doing it.  In migrate_vma_collect_pmd()
>>>> iiuc we don't need that hint because it'll be flushed within the same
>>>> function but just only after the loop of modifying the ptes.  Also it'll be
>>>> with the pgtable lock held.
>>> 
>>> Right, but the pgtable lock doesn't protect against HW PTE changes such
>>> as setting the dirty bit so we need to ensure the HW does the right
>>> thing here and I don't know if all HW does.
>> 
>> This sounds sensible.  But I take a look at zap_pte_range(), and find
>> that it appears that the implementation requires the PTE dirty bit to be
>> write-through.  Do I miss something?
>> 
>> Hi, Nadav, Can you help?
>
> Sorry for joinin

Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-17 Thread Huang, Ying
Alistair Popple  writes:

> Peter Xu  writes:
>
>> On Wed, Aug 17, 2022 at 11:49:03AM +1000, Alistair Popple wrote:
>>>
>>> Peter Xu  writes:
>>>
>>> > On Tue, Aug 16, 2022 at 04:10:29PM +0800, huang ying wrote:
>>> >> > @@ -193,11 +194,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> >> > bool anon_exclusive;
>>> >> > pte_t swp_pte;
>>> >> >
>>> >> > +   flush_cache_page(vma, addr, pte_pfn(*ptep));
>>> >> > +   pte = ptep_clear_flush(vma, addr, ptep);
>>> >>
>>> >> Although I think it's possible to batch the TLB flushing just before
>>> >> unlocking PTL.  The current code looks correct.
>>> >
>>> > If we're with unconditionally ptep_clear_flush(), does it mean we should
>>> > probably drop the "unmapped" and the last flush_tlb_range() already since
>>> > they'll be redundant?
>>>
>>> This patch does that, unless I missed something?
>>
>> Yes it does.  Somehow I didn't read into the real v2 patch, sorry!
>>
>>>
>>> > If that'll need to be dropped, it looks indeed better to still keep the
>>> > batch to me but just move it earlier (before unlock iiuc then it'll be
>>> > safe), then we can keep using ptep_get_and_clear() afaiu but keep "pte"
>>> > updated.
>>>
>>> I think we would also need to check should_defer_flush(). Looking at
>>> try_to_unmap_one() there is this comment:
>>>
>>> if (should_defer_flush(mm, flags) && !anon_exclusive) {
>>> /*
>>>  * We clear the PTE but do not flush so 
>>> potentially
>>>  * a remote CPU could still be writing to the 
>>> folio.
>>>  * If the entry was previously clean then the
>>>  * architecture must guarantee that a 
>>> clear->dirty
>>>  * transition on a cached TLB entry is written 
>>> through
>>>  * and traps if the PTE is unmapped.
>>>  */
>>>
>>> And as I understand it we'd need the same guarantee here. Given
>>> try_to_migrate_one() doesn't do batched TLB flushes either I'd rather
>>> keep the code as consistent as possible between
>>> migrate_vma_collect_pmd() and try_to_migrate_one(). I could look at
>>> introducing TLB flushing for both in some future patch series.
>>
>> should_defer_flush() is TTU-specific code?
>
> I'm not sure, but I think we need the same guarantee here as mentioned
> in the comment otherwise we wouldn't see a subsequent CPU write that
> could dirty the PTE after we have cleared it but before the TLB flush.
>
> My assumption was should_defer_flush() would ensure we have that
> guarantee from the architecture, but maybe there are alternate/better
> ways of enforcing that?
>> IIUC the caller sets TTU_BATCH_FLUSH showing that tlb can be omitted since
>> the caller will be responsible for doing it.  In migrate_vma_collect_pmd()
>> iiuc we don't need that hint because it'll be flushed within the same
>> function but just only after the loop of modifying the ptes.  Also it'll be
>> with the pgtable lock held.
>
> Right, but the pgtable lock doesn't protect against HW PTE changes such
> as setting the dirty bit so we need to ensure the HW does the right
> thing here and I don't know if all HW does.

This sounds sensible.  But I take a look at zap_pte_range(), and find
that it appears that the implementation requires the PTE dirty bit to be
write-through.  Do I miss something?

Hi, Nadav, Can you help?

Best Regards,
Huang, Ying

[snip]


Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-16 Thread huang ying
On Tue, Aug 16, 2022 at 3:39 PM Alistair Popple  wrote:
>
> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
> installs migration entries directly if it can lock the migrating page.
> When removing a dirty pte the dirty bit is supposed to be carried over
> to the underlying page to prevent it being lost.
>
> Currently migrate_vma_*() can only be used for private anonymous
> mappings. That means loss of the dirty bit usually doesn't result in
> data loss because these pages are typically not file-backed. However
> pages may be backed by swap storage which can result in data loss if an
> attempt is made to migrate a dirty page that doesn't yet have the
> PageDirty flag set.
>
> In this case migration will fail due to unexpected references but the
> dirty pte bit will be lost. If the page is subsequently reclaimed data
> won't be written back to swap storage as it is considered uptodate,
> resulting in data loss if the page is subsequently accessed.
>
> Prevent this by copying the dirty bit to the page when removing the pte
> to match what try_to_migrate_one() does.
>
> Signed-off-by: Alistair Popple 
> Acked-by: Peter Xu 
> Reported-by: Huang Ying 
> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while 
> collecting pages")
> Cc: sta...@vger.kernel.org
>
> ---
>
> Changes for v2:
>
>  - Fixed up Reported-by tag.
>  - Added Peter's Acked-by.
>  - Atomically read and clear the pte to prevent the dirty bit getting
>set after reading it.
>  - Added fixes tag
> ---
>  mm/migrate_device.c | 21 -
>  1 file changed, 8 insertions(+), 13 deletions(-)
>
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 27fb37d..e2d09e5 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -61,7 +62,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> struct migrate_vma *migrate = walk->private;
> struct vm_area_struct *vma = walk->vma;
> struct mm_struct *mm = vma->vm_mm;
> -   unsigned long addr = start, unmapped = 0;
> +   unsigned long addr = start;
> spinlock_t *ptl;
> pte_t *ptep;
>
> @@ -193,11 +194,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> bool anon_exclusive;
> pte_t swp_pte;
>
> +   flush_cache_page(vma, addr, pte_pfn(*ptep));
> +   pte = ptep_clear_flush(vma, addr, ptep);

Although I think it's possible to batch the TLB flushing just before
unlocking PTL.  The current code looks correct.

Reviewed-by: "Huang, Ying" 

Best Regards,
Huang, Ying

> anon_exclusive = PageAnon(page) && 
> PageAnonExclusive(page);
> if (anon_exclusive) {
> -   flush_cache_page(vma, addr, pte_pfn(*ptep));
> -   ptep_clear_flush(vma, addr, ptep);
> -
> if (page_try_share_anon_rmap(page)) {
> set_pte_at(mm, addr, ptep, pte);
> unlock_page(page);
> @@ -205,12 +205,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> mpfn = 0;
> goto next;
> }
> -   } else {
> -   ptep_get_and_clear(mm, addr, ptep);
> }
>
> migrate->cpages++;
>
> +   /* Set the dirty flag on the folio now the pte is 
> gone. */
> +   if (pte_dirty(pte))
> +   folio_mark_dirty(page_folio(page));
> +
> /* Setup special migration page table entry */
> if (mpfn & MIGRATE_PFN_WRITE)
> entry = make_writable_migration_entry(
> @@ -242,9 +244,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  */
> page_remove_rmap(page, vma, false);
> put_page(page);
> -
> -   if (pte_present(pte))
> -   unmapped++;
> } else {
> put_page(page);
> mpfn = 0;
> @@ -257,10 +256,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> arch_leave_lazy_mmu_mode();
> pte_unmap_unlock(ptep - 1, ptl);
>
> -   /* Only flush the TLB if we actually modified any entries */
> -   if (unmapped)
> -   flush_tlb_range(walk->vma, start, end);
> -
> return 0;
>  }
>
>
> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
> --
> git-series 0.9.1
>


Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-04-11 Thread huang ying
On Thu, Apr 11, 2019 at 12:08 AM Waiman Long  wrote:
>
> On 04/10/2019 04:15 AM, huang ying wrote:
> > Hi, Waiman,
> >
> > What's the status of this patchset?  And its merging plan?
> >
> > Best Regards,
> > Huang, Ying
>
> I have broken the patch into 3 parts (0/1/2) and rewritten some of them.
> Part 0 has been merged into tip. Parts 1 and 2 are still under testing.

Thanks!  Please keep me updated!

Best Regards,
Huang, Ying

> Cheers,
> Longman
>


Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

2019-04-10 Thread huang ying
Hi, Waiman,

What's the status of this patchset?  And its merging plan?

Best Regards,
Huang, Ying


Re: [PATCH] mm: convert totalram_pages, totalhigh_pages and managed_pages to atomic.

2018-10-23 Thread Huang, Ying
Arun KS  writes:

> Remove managed_page_count_lock spinlock and instead use atomic
> variables.
>
> Suggested-by: Michal Hocko 
> Suggested-by: Vlastimil Babka 
> Signed-off-by: Arun KS 
>
> ---
> As discussed here,
> https://patchwork.kernel.org/patch/10627521/#22261253

My 2 cents.  I think you should include at least part of the discussion
in the patch description to make it more readable by itself.

Best Regards,
Huang, Ying